This is Part 1 of a 4-Part blog series exploring how to analyze customer demographics using SafeGraph Patterns data.
Here, in Part 1 we explain:
Parts 2, 3, and 4 discuss three key technical challenges for taking your demographic analysis to the next level:
These are referenced frequently throughout the series, and much of the content of these blog posts is duplicated in the Teacher Notebook augmented with fully-coded working examples.
We hope you enjoy and learn something!
At the core of every business strategy, every marketing strategy, every product strategy is the question: Who are my customers?
Demography, or demographic analysis, is the statistical study of populations. In business, demographic profiles are descriptions of your customers (or potential customers) along various dimensions such as Age, Gender, Race, Ethnicity, Income, Education, Marital Status, Geography, etc.
Understanding who are your customers (and the customers of your competitors and your compliments) is a powerful tool to positively impact many aspects of your business such as marketing, inventory planning, real estate decisions, product development, and more. Making good decisions for your business depends on wielding an accurate understanding of your customers. And having views on the customers of your competitors and your compliments provides even more benefits.
There are broadly 3 methods used today to generate demographic profiles for your business.
These methods each have strengths but they all have important limitations.
Surveys are costly, time-consuming, and it is difficult to get good coverage (survey completion rates are low). You also have to worry about sampling bias for who actually completes the survey. Did only my most loyal customers complete the survey?
Marketing Data typically starts with customer loyalty programs that collect email addresses, phone numbers or other identity information directly from customers. Then you purchase marketing datasets tied to these identifiers. There are a few problems. First, opt-in loyalty members represent a small and probably skewed slice of your customer base, so you have to worry about sampling bias. Second, the data is expensive. Third, marketing data are not very accurate anyway. Bummer.
Marketing data relies on identity-matching to your first-party data and can be skewed and inaccurate.
Drive-time catchment areas put a pin on a map for the location of your business, then analyze all of the census demographic data for all homes located within a 15-minute driving distance of your business. These are powerful GIS approaches to studying demographics. The limitation is that they assume people only visit businesses geographically near their homes. What about people that commute to work and visit businesses near their work locations? What about high tourist areas? Furthermore, these methods cannot capture the nuances of customer segmentation within geographic areas. According to drive-time catchment analysis, a nail salon, a pet store and a hardware store sitting adjacent to each other in a mall all have identical customer demographics.
According to drive-time catchment analysis, a nail salon, a pet store and a hardware store sitting adjacent to each other in a mall all have identical customer demographics.
Finally, most of these methods may give you a picture of your own customers, but you care about more than just your customers. How do you get insights into customers of competitive or complementary businesses?
Here we show a new approach for building demographic profiles using SafeGraph Patterns data.
SafeGraph Patterns is a dataset about ~3.6MM commercial brick-and-mortar points-of-interest (POI) and includes anonymized counts of how many people visit these POI each month.
The counts of visitors are derived from an anonymized panel (sample of population that is measured longitudinally) of ~46MM mobile devices (e.g., smartphones) in the USA.
These counts are broken down by different dimensions, including by the home census block group of the visitor. For example, the dataset may report that 85 total devices visited a specific POI. It also reports that 50 of those visitors live in census block group X, 25 live in census block Y, and 10 live in census block group Z.
SafeGraph provides an aggregated summary of the panel called Home Location Distributions by State/Census Block Group as part of the Panel Overview Data. This is an important reference which we will use to quantify and correct for sampling bias.
Protecting individual consumer privacy is at the core of the SafeGraph mission:
“SafeGraph’s mission is to make the world’s data open for innovation while protecting individual privacy.” — SafeGraph Vision and Values
The devices in the panel are fully anonymized; no device-level demographic data exists for devices in the panel. Instead, for every device in the panel, SafeGraph accurately derives the home census block group (CBG) of that device, based on the most common location during night time hours. Importantly, none of the anonymized device-specific data is available to customers of SafeGraph; SafeGraph Patterns is an aggregated view.
Aggregations at the level of census block groups (CBGs) is all we need. The Census reports demographic data at the level of each CBG. This allows us to build an average demographic profile based on the derived home CBGs for customers visiting a particular store. There are over 220,000 CBGs in the USA, with a mean population of ~1500 persons per CBG. This provides a relatively precise demographic picture, as we will show below.
The aggregated form of SafeGraph Patterns helps to ensure the protection of individual privacy, while also providing actionable data for statistical analysis and data science.
For all the details on SafeGraph Patterns, see the SafeGraph Patterns docs.
A demographic profile is just a description of your customers (or potential customers) along various dimensions such as Age, Gender, Race, Ethnicity, Income, Education, Marital Status, Geography, etc.
To make this concrete, let’s show an end-to-end example. To make the example and calculations super simple we will examine only a single point-of-interest (POI) location (a Walmart), and a single demographic dimension (Ethnicity) with only two demographic segments (Hispanic or Latino Origin or Not Hispanic or Latino Origin, more details below).
Once you load and transform your Patterns data, your SafeGraph data basically looks as follows (to see this example fully implemented in python, check out the Teaching Jupyter Co-Lab Notebook):
Each row is a unique pair of a safegraph_place_id (sgpid) and one home census block group (visitor_home_cbg). The column visitor_count shows the SafeGraph measurement of how many visitors from that home census block group visited that sgpid.
This example is for a single POI and has 134 rows. If we sum visitor_count we see there are a total of 3223 unique visitors to this POI (coming from 134 unique census block groups).
We don’t know any demographic data about those individual visitors, but we can use Census data to discover what are the average demographics of people that live in each CBG. So, for each of those CBGs, let’s look up the fraction of residents that are Hispanic Or Latino Origin vs Not Hispanic Or Latino Origin.
Census data can be difficult to navigate, so SafeGraph packaged all of it in a convenient single download called Open Census Data.
In this example, we are analyzing Latino or Hispanic Origin Ethnicity. The census tracks this data under the category Hispanic Or Latino Origin, and we can look up this category in the handy file from Open Census Data called cbg_field_descriptions.csv to see that the two table_ids we need are:
Going forward I will abbreviate these two groups as Not Hispanic and Hispanic, respectively. The Census counts every citizen in one of these two categories, so for each CBG we use these table_ids and convert them to a relative fraction.
Once we load these data and convert them to fractions, we have census data that looks like the following (again, to see this example fully implemented in Python, check out the Teaching Jupyter Co-Lab Notebook).
Join this dataset with the SafeGraph Patterns data on census_block_group, multiply the fractions by the visitor_count, and you have an estimate of how many people from each demographic segment visited this Walmart from each census block group. From there, you simply add all the census block groups (rows) together for each segment (column) and divide the two counts by the total to get a percent of the total. Voila, you have a demographic profile.
Figure 1 shows that according to raw* SafeGraph data, approximately 8% of visitors to this Walmart are counted in the Hispanic or Latino Origin demographic segment. *This doesn’t account for sampling bias, which we discuss in Part 2. To see this chart generated in Python, check out the Teaching Jupyter Co-Lab Notebook.
Here in Part 1, we’ve introduced the concept of demographic profiles and showed the core logic for how you can analyze them with SafeGraph Patterns.
But SafeGraph customers are sophisticated. They want to scale demographic analysis to include many locations, many brands, and many different demographic dimensions, and they need the results to be statistically and methodologically rigorous.
There are 3 main technical challenges to take demographic analysis to the next level:
This was Part 1 of a 4-Part blog series exploring how to analyze customer demographics using SafeGraph Patterns data. Keep reading as we explain how to solve core technical challenges one-by-one.
Thanks for reading! If you found this useful or interesting please upvote and share with a friend. You are strongly encouraged to try out a sample of SafeGraph patterns data for free, no strings attached at the SafeGraph Data Bar. Use coupon code AnalyzeDemographics for $200 worth of free data! And please send us your ideas, feedback, bug discoveries and suggestions: [email protected]
That's it – that's all we do. We want to understand the physical world and power innovation through open access to geospatial data. We believe data should be an open platform, not a trade secret. Information should not be hoarded so that only a few can innovate.