I would like to get some opinions on how to solve a minor hiccup with safegraph_py

Jack_Lindsay_Kraken1 · September 3, 2020, 12:00am

Hello <!here>,

I would like to get some opinions on how to solve a minor hiccup @Ruowei_Yang_UM_Baltimore ran into with safegraph_py. @Ruowei_Yang_UM_Baltimore needed to alter the data type for one of read ins, but with the current way it is set up, if the dtype is not explicitly stated in the code, you cannot change it (this was not intentional).

I see 2 ways to go about this fix:

create a dtype variable in the py file that explicitly states all columns and their dtypes (brute force) - or write them out inline
remove all customization of dtypes and revert to pandas default (typically wrong by default)
I cannot think of a simple way to go about this. I assume there is some way to force pandas to assume all columns are strings and then allow the user to override that, but that would mean if they changed one column, it would likely force the rest to default to pandas defaults.

Anyone have any ideas?

Ryan_Fox_Squire_SafeGraph · September 4, 2020, 2:34am

@Jack_Lindsay_Kraken1 i don’t really understand the problem.

Ryan_Fox_Squire_SafeGraph · September 4, 2020, 2:34am

if the dtype is not explicitly stated in the code, you cannot change it (this was not intentional).

Ryan_Fox_Squire_SafeGraph · September 4, 2020, 2:34am

is this for the reading functions?

Ryan_Fox_Squire_SafeGraph · September 4, 2020, 2:35am

any more specifics you can provide may help

Jack_Lindsay_Kraken1 · September 4, 2020, 2:35am

Sorry, I will add some code snippits

    pattern_files = glob.glob(os.path.join(path_to_pattern, "*.csv.gz"))
    print(f"You are about to load in {len(pattern_files)} pattern files")

    li = []
    for pattern in pattern_files:
        print(pattern)
        df = pd.read_csv(pattern, compression=compression, *args, **kwargs,
                         dtype={'postal_code': str, 'phone_number': str, 'naics_code': str})
        li.append(df)

    SG_pattern = pd.concat(li, axis=0)
    return SG_pattern```
This is hard coded in. Originally that was so the user didnt have to worry about the dtype, but now I am realizing I obviously didnt cover all the columns that needed to be covered

Ryan_Fox_Squire_SafeGraph · September 4, 2020, 2:36am

what is the specific column in this case that Ruowei (or whoever is the user) wanted to change?

Jack_Lindsay_Kraken1 · September 4, 2020, 2:36am

poi_cbg

Ryan_Fox_Squire_SafeGraph · September 4, 2020, 2:36am

OK.

Jack_Lindsay_Kraken1 · September 4, 2020, 2:37am

but because it is hard coded there, it doesnt accept the dtype argument into the function

Ryan_Fox_Squire_SafeGraph · September 4, 2020, 2:37am

I am thinking something along the lines of #1.

Ryan_Fox_Squire_SafeGraph · September 4, 2020, 2:37am

but you don’t need to specify all variables

Ryan_Fox_Squire_SafeGraph · September 4, 2020, 2:37am

or rather,

Jack_Lindsay_Kraken1 · September 4, 2020, 2:37am

Ok, that is what I was thinking, otherwise the read_in functions are kind of useless haha

Ryan_Fox_Squire_SafeGraph · September 4, 2020, 2:38am

just make sg_dtypes a parameter in the function that defaults to whatever dictionary you want, but the user can override.

i mean probably we should just add poi_cbg to the default dtypes.

But we could ALSO make it so that the user can pass their own dtypes dict and override the defaults to unblock future people’s requests.

Jack_Lindsay_Kraken1 · September 4, 2020, 2:39am

got it. will do