I would like to get some opinions on how to solve a minor hiccup with safegraph_py

Hello <!here>,

I would like to get some opinions on how to solve a minor hiccup @Ruowei_Yang_UM_Baltimore ran into with safegraph_py. @Ruowei_Yang_UM_Baltimore needed to alter the data type for one of read ins, but with the current way it is set up, if the dtype is not explicitly stated in the code, you cannot change it (this was not intentional).

I see 2 ways to go about this fix:

  1. create a dtype variable in the py file that explicitly states all columns and their dtypes (brute force) - or write them out inline
  2. remove all customization of dtypes and revert to pandas default (typically wrong by default)
    I cannot think of a simple way to go about this. I assume there is some way to force pandas to assume all columns are strings and then allow the user to override that, but that would mean if they changed one column, it would likely force the rest to default to pandas defaults.

Anyone have any ideas?

@Jack_Lindsay_Kraken1 i don’t really understand the problem.

if the dtype is not explicitly stated in the code, you cannot change it (this was not intentional).

is this for the reading functions?

any more specifics you can provide may help

Sorry, I will add some code snippits

    pattern_files = glob.glob(os.path.join(path_to_pattern, "*.csv.gz"))
    print(f"You are about to load in {len(pattern_files)} pattern files")

    li = []
    for pattern in pattern_files:
        print(pattern)
        df = pd.read_csv(pattern, compression=compression, *args, **kwargs,
                         dtype={'postal_code': str, 'phone_number': str, 'naics_code': str})
        li.append(df)

    SG_pattern = pd.concat(li, axis=0)
    return SG_pattern```
This is hard coded in. Originally that was so the user didnt have to worry about the dtype, but now I am realizing I obviously didnt cover all the columns that needed to be covered

what is the specific column in this case that Ruowei (or whoever is the user) wanted to change?

poi_cbg

OK.

but because it is hard coded there, it doesnt accept the dtype argument into the function

I am thinking something along the lines of #1.

but you don’t need to specify all variables

or rather,

Ok, that is what I was thinking, otherwise the read_in functions are kind of useless haha

just make sg_dtypes a parameter in the function that defaults to whatever dictionary you want, but the user can override.

i mean probably we should just add poi_cbg to the default dtypes.

But we could ALSO make it so that the user can pass their own dtypes dict and override the defaults to unblock future people’s requests.

got it. will do