This post is available as an interactive Google Colaboratory notebook. Click here to see the full post, see the results, and play with the code yourself.
Below we’ve copied the Introduction and Highlights
SafeGraph Patterns measures foot-traffic patterns to 3.6MM commercial points-of-interest from over 45 MM mobile devices in the United States and provides a monumental window into American commerce. SafeGraph data users look through this window to ask detailed questions about consumer behavior (e.g., What is a brand’s true customer demographic? How far do people travel to go grocery shopping? What is the impact of opening a national brand coffee shop on all the other coffee shops in a neighborhood?).
A common type of question we hear from SafeGraph Patterns customers is “What about bias in your dataset?”. “Does your panel really represent the true American public?”. “How do we know that your panel isn’t oversampling wealthier people?”. This is the kind of sophisticated data skepticism we love to hear. A key part of SafeGraph’s vision is to “Seek the Truth About the World…Of course, data can never be 100% true, but we should strive to make it 100% true.”
And although SafeGraph Patterns aggregates data from ~ 10% of devices in the United States (a very impressive sample, if we don’t say so ourselves!) this sample is not a perfect representative subset of the population. Like all samples, the SafeGraph dataset has sampling error.
For example, USA Counties (Pearson correlation coefficient, r=0.97), Educational Attainment (r=0.99), and Household income (r=0.99).
Curious what this means, exactly how we got these numbers, and how to test other census sub-populations? See the full notebook.
We also include a short preview of a future post: