In Part 1: SafeGraph’s Data on Brick-and-Mortar Customer Demographics is the Most Accurate and Comprehensive in the World, we explained the power of SafeGraph Patterns for demographic analysis and showed a simple example for how to turn SafeGraph Patterns into powerful insights.
How do we know that the panel isn’t oversampling higher-income individuals or people from certain geographies? These types of biases, called sampling bias, can significantly skew the analysis of customer demographics.
Remember, if you want to see this logic fully implemented in python, see the Teacher Jupyter Co-Lab Notebook: Measuring and Correcting Sampling Bias
We’ve previously explored the concept of Sampling Bias in earlier posts, such as What About Bias In The SafeGraph Dataset? Sampling bias goes by multiple names (e.g., selection bias, ascertainment bias…these terms have nuanced technical differences, but are generally used synonymously). Sampling bias arises from the collection of data from segments of the population disproportionately. The intuition is simple: you want to sample from segments of the population proportional to their presence in the population. If your population is 80% women and 20% men, then a true random sample should also approximately show 80% women and 20% men. If your sample is actually 50% men and 50% women, then your sample is biased in favor of men.
To motivate the need for correcting sampling bias, let’s consider the Walmart example from Part 1.
Figure 1 is derived from SafeGraph Patterns data and shows approximately 8% of visitors to this Walmart are counted in the Hispanic or Latino Origin demographic segment (blue bar). But maybe the reason the data show such a large percentage of visitors are Not Hispanic or Latino Origin, is simply because the SafeGraph dataset is biased in favor of (over-indexed on) Not Hispanic people. This type of “sampling bias” almost certainly exists any time you don’t have precise control over the data collection. Luckily there is a straightforward method for correcting this sampling bias in SafeGraph Patterns data.
First some intuition for how this works. The SafeGraph dataset is a “sample” of the true USA population. We know there are different segments of people in the sample (i.e., people living in different census block groups, or CBGs). And we know the true frequencies of these segments of people in the overall population because we know the true population of each census block group according to the Census. Similarly, we know the empirical frequencies of these segments of people in the SafeGraph “sample”.
By comparing the true population frequency with the frequency in our sample, we can calculate which CBGs are over- or under-represented in the sample. Then we adjust (i.e., re-weight) each group in the SafeGraph sample so that each group affects our measurement proportionate to its true population frequency. In some cases, this adjustment will be extrapolating up (weighting more heavily) a group because they are under-indexed. In other cases, it means down-weighting a group that is over-indexed.
Each group is called a stratum. It’s post-hoc because we apply the correction after collecting the data (in contrast to stratified sampling in which one uses known population frequencies to control the data collection in the first place). Adjusting the relative frequencies up or down is called “re-weighting”.
In our case, we are only controlling for sampling bias in a single dimension (i.e. the geography of census block groups) and in that case the formula for adjusting (correcting) your measurements is fairly intuitive. As an aside, if you need to control for sampling bias on multiple dimensions simultaneously, there is an elegant regression-based solution called Heckman’s Correction. If you want to learn more about Heckman’s, I recommend this implementation & tutorial. Here we will not use Heckman’s and instead use the straight-forward stratified adjustment for a single dimension.
We will adjust our measurements from each stratum (e.g., census block group) individually.
Here is the re-weighting “formula” for each stratum:
Another (IMHO, more intuitive) way to re-state this formula is:
Note that if 𝑃(𝑆𝑡𝑟𝑎𝑡𝑢𝑚 𝑖𝑛 𝑃𝑜𝑝𝑙𝑢𝑎𝑡𝑖𝑜𝑛)>𝑃(𝑆𝑡𝑟𝑎𝑡𝑢𝑚 𝑖𝑛 𝑆𝑎𝑚𝑝𝑙𝑒) then 𝑎𝑑𝑗𝑢𝑠𝑡_𝑓𝑎𝑐𝑡𝑜𝑟 > 1; you need to over-weight your sample estimate, because your sample is under-represented compared to the population. Vice versa if 𝑃(𝑆𝑡𝑟𝑎𝑡𝑢𝑚 𝑖𝑛 𝑃𝑜𝑝𝑙𝑢𝑎𝑡𝑖𝑜𝑛)<𝑃(𝑆𝑡𝑟𝑎𝑡𝑢𝑚 𝑖𝑛 𝑆𝑎𝑚𝑝𝑙𝑒) then 𝑎𝑑𝑗𝑢𝑠𝑡_𝑓𝑎𝑐𝑡𝑜𝑟 < 1; you need to down-weight your sample estimate, because you are over-represented in this stratum compared to the population.
Remember, in our case, each stratum is a census block group.
See “Reducing Sampling Errors” by “Stratification” for more explanation.
First, to make sure we are on the same page, let’s confirm how things work if you are not doing any corrections. Let’s say you surveyed a sample of people from several different census block groups (CBGs) about some outcome_variable, e.g., income. To estimate the mean income across all CBGs, without any corrections or adjustments, you would calculate a weighted mean across all 𝑁 strata (e.g. CBGs), with each stratum weighted by the number of measurements (i.e. sample size) from that stratum.
Where 𝜇_𝑛 is the mean income of stratum 𝑛, and 𝑠𝑎𝑚𝑝𝑙𝑒_𝑠𝑖𝑧𝑒_𝑛 is the sample size of that stratum (i.e., how many people you surveyed about their income.)
But maybe most of your survey came from particularly wealthy census block groups (sampling bias). To correct for sampling bias, you re-weight each stratum.
Where 𝑎𝑑𝑗𝑢𝑠𝑡_𝑓𝑎𝑐𝑡𝑜𝑟_𝑛 is as defined above.
Effectively you are changing the “sample size” of each stratum so that its impact on the overall estimate is proportionate to the presence of this stratum in the whole population.
When analyzing the relative demographic composition of visitors (e.g. % Hispanic vs % Not Hispanic), and the “outcome_variable” is really just a count of people visiting a place. To be super precise, as far as our formulas above are concerned, we can define 𝜇_𝑛 as the rate of visitors from 𝐶𝐵𝐺_𝑛 to the POI or brand.
then the numerator reduces to just the count of visitors 𝑛𝑢𝑚_𝑣𝑖𝑠𝑖𝑡_𝑛 multiplied by the 𝑎𝑑𝑗𝑢𝑠𝑡_𝑓𝑎𝑐𝑡𝑜𝑟_𝑛.
Technically, in our use case, we will actually calculate weighted sums of people, rather than weighted means, so our formula will look more like this:
Let’s re-compute the demographic profile for the same Walmart from Figure 1 (sg:23540fe68cb14f3b9bf848fda3e848fc). This time we will correct for sampling bias using stratified re-weighting.
Remember, if you want to see this logic fully implemented in python, see the Teacher Jupyter Co-Lab Notebook: Measuring and Correcting Sampling Bias.
In order to measure and correct for sampling bias, we need to load two additional datasets:
We join all these datasets together and calculate the adjustment factor for each CBG using the formula described above, and the results look like this:
This dataframe has one row for every unique CBG visiting the Walmart sg:23540fe68cb14f3b9bf848fda3e848fc.
The column visitor_count_cbg_adj represents the count after correcting for sampling bias (it is just visitor_count * cbg_adjust_factor). You can compare it to the visitor_count column to see whether we over or under corrected (i.e., adjusted) the count.
We now use the adjusted number to split up into demographic segments, sum across all CBGs and we have a new slightly different demographic profile:
The specific values of each segment are printed at the top for reference.
If you want to see the generation of this chart in python, see the Teacher Jupyter Co-Lab Notebook.
According to the uncorrected (raw) SafeGraph counts, about 8% of customers to this Walmart are in the category Hispanic or Latino Origin (left-side blue bar). However, as it turns out, across these ~130 census block groups that patron this Walmart, the SafeGraph dataset is relatively under-indexed on the CBGs with larger Hispanic populations. Our methodology allows us to empirically calculate this bias, and use the stratified re-weighting to under-weight the CBGs with smaller Hispanic populations and to over-weight CBGs with larger Hispanic populations so that we are measuring all CBGs proportionate to their frequency in the population. This gives us a new revised estimate that about 9% of customers are Hispanic or Latino Origin (right-side blue bar).
The corrected data isn’t dramatically different than the uncorrected data because, overall, the SafeGraph panel is not very biased on different Ethnicities. However, 8% vs 9% is an ~ 11% difference and is not inconsequential. If you project revenue for your business over the next year based on this demographic analysis, an 11% error could potentially cause serious problems.
This correction is only possible because we know the true frequencies of these different census block groups in the US Population from the Census.
Knowing the ground-truth frequencies allows us to apply the post-hoc stratified re-weighting method. This is powerful, because it ensures that insights we report are not due to biases in the SafeGraph sample. Our results show a low-bias view of true underlying demographics.
Of course, this correction does not come for “free”. It reduces the statistical certainty of our final estimate (anytime you extrapolate you will be less certain). To see the details of how we calculate our certainty, see Part 4 in the Series. Before that, we discuss how to wrangle all this census data in Part 3.
Thanks for reading! If you found this useful or interesting please upvote and share with a friend.
You are strongly encouraged to try out a sample of SafeGraph patterns data for free, no strings attached at the SafeGraph Data Bar. Use coupon code AnalyzeDemographics for $200 worth of free data! And please send us your ideas, feedback, bug discoveries and suggestions: [email protected]
That's it – that's all we do. We want to understand the physical world and power innovation through open access to geospatial data. We believe data should be an open platform, not a trade secret. Information should not be hoarded so that only a few can innovate.