Ideas of Interest

SafeGraph’s Data on Brick-And-Mortar Customer Demographics Is the Most Accurate and Comprehensive in the World

November 14, 2019
Ryan Fox Squire

SafeGraph Patterns, a retail foot-traffic dataset, is a treasure trove of customer demographic insights waiting to be unlocked.

Quick orientation to this blog series

This is Part 1 of a 4-Part blog series exploring how to analyze customer demographics using SafeGraph Patterns data.

Here, in Part 1 we explain:

  • Why SafeGraph Patterns provides the most accurate and comprehensive customer demographic data available.
  • A simple end-to-end example to illustrate how to analyze demographics with SafeGraph Patterns data.

Parts 2, 3, and 4 discuss three key technical challenges for taking your demographic analysis to the next level:

To complement these blog posts we also provide two interactive Jupyter Google Co-Lab notebooks written in Python.

These are referenced frequently throughout the series, and much of the content of these blog posts is duplicated in the Teacher Notebook augmented with fully-coded working examples.

  1. The Teacher Notebook dives into the methodology step-by-step and explains how everything works.
  2. The Analysis Workbook glosses over the methodology so you can run demographic analysis on your own data easily in less than 5 minutes. This is an end-to-end implementation from raw data to visualizations.

We hope you enjoy and learn something!

Do we even need to explain why demographics are important?

At the core of every business strategy, every marketing strategy, every product strategy is the question: Who are my customers?

Demography, or demographic analysis, is the statistical study of populations. In business, demographic profiles are descriptions of your customers (or potential customers) along various dimensions such as Age, Gender, Race, Ethnicity, Income, Education, Marital Status, Geography, etc.

Understanding who are your customers (and the customers of your competitors and your compliments) is a powerful tool to positively impact many aspects of your business such as marketing, inventory planning, real estate decisions, product development, and more. Making good decisions for your business depends on wielding an accurate understanding of your customers. And having views on the customers of your competitors and your compliments provides even more benefits.

How do businesses typically conduct demographic analysis?

There are broadly 3 methods used today to generate demographic profiles for your business.

  1. Surveys
  2. Marketing data based on emails, phone numbers, or other identity information
  3. Drive-time catchment areas

These methods each have strengths but they all have important limitations.

Surveys are costly, time-consuming, and it is difficult to get good coverage (survey completion rates are low). You also have to worry about sampling bias for who actually completes the survey. Did only my most loyal customers complete the survey?

Surveys are costly, time-consuming, and it is difficult to get good coverage.

Marketing Data typically starts with customer loyalty programs that collect email addresses, phone numbers or other identity information directly from customers. Then you purchase marketing datasets tied to these identifiers. There are a few problems. First, opt-in loyalty members represent a small and probably skewed slice of your customer base, so you have to worry about sampling bias. Second, the data is expensive. Third, marketing data are not very accurate anyway. Bummer.

Marketing data relies on identity-matching to your first-party data and can be skewed and inaccurate.

Marketing data relies on identity-matching to your first-party data and can be skewed and inaccurate.

Drive-time catchment areas put a pin on a map for the location of your business, then analyze all of the census demographic data for all homes located within a 15-minute driving distance of your business. These are powerful GIS approaches to studying demographics. The limitation is that they assume people only visit businesses geographically near their homes. What about people that commute to work and visit businesses near their work locations? What about high tourist areas? Furthermore, these methods cannot capture the nuances of customer segmentation within geographic areas.

According to drive-time catchment analysis, a nail salon, a pet store and a hardware store sitting adjacent to each other in a mall all have identical customer demographics.

Finally, most of these methods may give you a picture of your own customers, but you care about more than just your customers. How do you get insights into customers of competitive or complementary businesses?

With SafeGraph Patterns you can analyze demographic profiles for any store, any brand, any geography, any business.

Here we show a new approach for building demographic profiles using SafeGraph Patterns data.

What is SafeGraph Patterns?

SafeGraph Patterns is a dataset about ~3.6MM commercial brick-and-mortar points-of-interest (POI) and includes anonymized counts of how many people visit these POI each month.

The counts of visitors are derived from an anonymized panel (sample of population that is measured longitudinally) of ~46MM mobile devices (e.g., smartphones) in the USA.

These counts are broken down by different dimensions, including by the home census block group of the visitor. For example, the dataset may report that 85 total devices visited a specific POI. It also reports that 50 of those visitors live in census block group X, 25 live in census block Y, and 10 live in census block group Z.

SafeGraph provides an aggregated summary of the panel called Home Location Distributions by State/Census Block Group as part of the Panel Overview Data. This is an important reference which we will use to quantify and correct for sampling bias.

If the data is anonymized and aggregated, then how do we estimate demographics?

Protecting individual consumer privacy is at the core of the SafeGraph mission:

“SafeGraph’s mission is to make the world’s data open for innovation while protecting individual privacy.” — SafeGraph Vision and Values

The devices in the panel are fully anonymized; no device-level demographic data exists for devices in the panel. Instead, for every device in the panel, SafeGraph accurately derives the home census block group (CBG) of that device, based on the most common location during night time hours. Importantly, none of the anonymized device-specific data is available to customers of SafeGraph; SafeGraph Patterns is an aggregated view.

Aggregations at the level of census block groups (CBGs) is all we need. The Census reports demographic data at the level of each CBG. This allows us to build an average demographic profile based on the derived home CBGs for customers visiting a particular store. There are over 220,000 CBGs in the USA, with a mean population of ~1500 persons per CBG. This provides a relatively precise demographic picture, as we will show below.

The aggregated form of SafeGraph Patterns helps to ensure the protection of individual privacy, while also providing actionable data for statistical analysis and data science.

For all the details on SafeGraph Patterns, see the SafeGraph Patterns docs.

SafeGraph Patterns is the Most Accurate and Comprehensive View of Retail Customer Demographics

  • SafeGraph Patterns is a direct measure of consumer behavior in the physical world, tied to highly accurate census data via empirically-derived home census areas.
  • Simultaneously, SafeGraph data is aggregated and abstracted away from individuals to be highly privacy-safe and super easy to use.
  • Unlike most customer demographic solutions, SafeGraph Patterns does not rely on surveys, customer loyalty programs, messy and unreliable marketing data or unrealistic geographic assumptions in consumer behavior.
  • It equally measures consumer visits nearby home or far from home and everything in between.
  • It is comprehensive. It covers every retail brand and every brick-and-mortar retail business in the United States so you can analyze not only your own customers, but also the customers of your competitors, your compliments, and new markets.

What exactly is a demographic profile? A simple end-to-end example.

A demographic profile is just a description of your customers (or potential customers) along various dimensions such as Age, Gender, Race, Ethnicity, Income, Education, Marital Status, Geography, etc.

We will analyze a demographic profile for this Walmart located at 3141 Garden Rd, Burlington, NC 27215, US (sg:23540fe68cb14f3b9bf848fda3e848fc). Geospatial polygon data courtesy of SafeGraph Geometry.

To make this concrete, let’s show an end-to-end example. To make the example and calculations super simple we will examine only a single point-of-interest (POI) location (a Walmart), and a single demographic dimension (Ethnicity) with only two demographic segments (Hispanic or Latino Origin or Not Hispanic or Latino Origin, more details below).

Once you load and transform your Patterns data, your SafeGraph data basically looks as follows (to see this example fully implemented in python, check out the Teaching Jupyter Co-Lab Notebook):

Each row is a unique pair of a safegraph_place_id (sgpid) and one home census block group (visitor_home_cbg). The column visitor_count shows the SafeGraph measurement of how many visitors from that home census block group visited that sgpid.

This example is for a single POI and has 134 rows. If we sum visitor_count we see there are a total of 3223 unique visitors to this POI (coming from 134 unique census block groups).

We don’t know any demographic data about those individual visitors, but we can use Census data to discover what are the average demographics of people that live in each CBG. So, for each of those CBGs, let’s look up the fraction of residents that are Hispanic Or Latino Origin vs Not Hispanic Or Latino Origin.

SafeGraph Open Census Data is your friend

Census data can be difficult to navigate, so SafeGraph packaged all of it in a convenient single download called Open Census Data.

In this example, we are analyzing Latino or Hispanic Origin Ethnicity. The census tracks this data under the category Hispanic Or Latino Origin, and we can look up this category in the handy file from Open Census Data called cbg_field_descriptions.csv to see that the two table_ids we need are:

  • B03003e2 HISPANIC OR LATINO ORIGIN: Not Hispanic or Latino: Total population — (Estimate)
  • B03003e3 HISPANIC OR LATINO ORIGIN: Hispanic or Latino: Total population — (Estimate)

Going forward I will abbreviate these two groups as Not Hispanic and Hispanic, respectively. The Census counts every citizen in one of these two categories, so for each CBG we use these table_ids and convert them to a relative fraction.

Once we load these data and convert them to fractions, we have census data that looks like the following (again, to see this example fully implemented in Python, check out the Teaching Jupyter Co-Lab Notebook).

Join this dataset with the SafeGraph Patterns data on census_block_group, multiply the fractions by the visitor_count, and you have an estimate of how many people from each demographic segment visited this Walmart from each census block group. From there, you simply add all the census block groups (rows) together for each segment (column) and divide the two counts by the total to get a percent of the total. Voila, you have a demographic profile.

Figure 1. A demographic profile of visitors to Walmart 3141 Garden Rd, Burlington, NC 27215 (sg:23540fe68cb14f3b9bf848fda3e848fc)

Figure 1 shows that according to raw* SafeGraph data, approximately 8% of visitors to this Walmart are counted in the Hispanic or Latino Origin demographic segment. *This doesn’t account for sampling bias, which we discuss in Part 2. To see this chart generated in Python, check out the Teaching Jupyter Co-Lab Notebook.

Taking Demographic Profiles to the Next Level: 3 Technical Challenges

Here in Part 1, we’ve introduced the concept of demographic profiles and showed the core logic for how you can analyze them with SafeGraph Patterns.

But SafeGraph customers are sophisticated. They want to scale demographic analysis to include many locations, many brands, and many different demographic dimensions, and they need the results to be statistically and methodologically rigorous.

There are 3 main technical challenges to take demographic analysis to the next level:

  1. Measuring and Correcting Sampling Bias. What if the SafeGraph dataset is biased towards higher-income individuals, or certain geographic regions? This is an important issue to control and correct for precise results. See Part 2.
  2. Wrangling Census Data and Visualization. Analyzing many demographic dimensions quickly grows the number of variables and census codes into hundreds or thousands of variables. Organizing all this data isn’t glamorous, but someone’s got to do it. See Part 3.
  3. Quantifying Statistical Certainty — How confident are we in our results? How do we know the difference between real insights and statistical noise? See Part 4.

Are you ready to take your demographic analysis to the next level?

This was Part 1 of a 4-Part blog series exploring how to analyze customer demographics using SafeGraph Patterns data. Keep reading as we explain how to solve core technical challenges one-by-one.

Thanks for reading! If you found this useful or interesting please upvote and share with a friend. You are strongly encouraged to try out a sample of SafeGraph patterns data for free, no strings attached at the SafeGraph Data Bar. Use coupon code AnalyzeDemographics for $200 worth of free data! And please send us your ideas, feedback, bug discoveries and suggestions: [email protected]

Ryan Fox Squire
Ryan Fox Squire
Product & Data Science
SafeGraph is just a data company

That's it – that's all we do. We want to understand the physical world and power innovation through open access to geospatial data. We believe data should be an open platform, not a trade secret. Information should not be hoarded so that only a few can innovate.