We’ll explore whether alcohol spend at the county-level predicts voting behavior by analyzing open government datasets. Through this process, we’ll also demonstrate the difficulties data scientists face when joining and linking open datasets (and SafeGraph’s upcoming solution to this problem).
First, we’ll start with a proxy for the amount spent on alcohol in different states. Fortunately, under the Texas Tax code, alcohol permittees must report how much they made on alcohol. So, we’ll use that data to obtain the total sums of liquor, wine, and beer consumed in each county per day, a straightforward aggregation on the county where the store is located of the amount reported divided by the number of days the store reported.
In order to find out which party people are affiliated with, we’ll proxy by the results of the US General Election in 2016 and assume that people don’t switch parties most of the time and that their affiliation is representative of the population of the county. Remembering that not everyone is registered to vote, we’ll go out there and get a population estimate by county so that we can normalize each county’s drinking per capita, assuming that Texas counties didn’t substantially grow or shrink during the reporting period.
Already we find ourselves facing a couple of problems:
(a) The Texas tax data is keyed off Texas County Number — an internal numbering system
(b) The Census data is based off the County’s FIPS code
(c) The US General Election results are reported off the County’s name
So now we’ll have to go get something that links a county’s name to its Texas County Number and its FIPS code. Usually, this is where we’ll also run into spelling differences and minor things like some data sources suffixing their county names with the word “County” and others not doing so.
But now that we have all of this data, we can join it together, and look for, say, a linear correlation.
As it so happens, our vast adventure in acquiring and cleaning all this data led to another null hypothesis failing to falsify. It looks like there’s the weakest of correlations (about r=0.25, for an r2≃0.06) between how much a county votes Democrat and how much money they’ll spend on alcohol.
And our search for interesting correlations must continue elsewhere since there’s only a weak correlation (r=0.17 , r²≃0.03 ) between how much a county votes Republican and how much of a preference it has for spending money on beer over wine or liquor. On the bright side, it’s quite heartening to know that no matter where we are on the political divide, the great commonality we have is how much we like to spend on our booze.
The analysis here is fairly simplistic, and this isn’t a particularly statistical robust result, but it illustrates that a large portion of our time is often spent identifying which datasets can contain the statistics we’re interested in, cleaning up the data in them, and identifying a way to join the datasets together even when all the data is open access. At SafeGraph, we’re now working on a tool to find, structure, and join disparate open access datasets. When common keys exist for datasets, the ability to join them opens up vastly more opportunities to study relations. Stay tuned for more!