Blog

Ideas-of-Interest

Step-by-Step Guide for Correlating Your First-Party Store Data with SafeGraph Patterns

February 11, 2020
|
By
Ryan Fox Squire

SafeGraph Patterns foot-traffic data boosts the accuracy of first-party store data 

Many businesses and organizations like to validate SafeGraph Patterns foot-traffic data by analyzing it alongside their own first-party store data. They do this to gain a deeper understanding of their own store performance as well as to battle-test and bullet-proof the data before applying the same techniques in their own competitive intelligence efforts. It’s all about driving greater confidence in the data overall, a necessary first step in extracting meaningful and actionable insights that can drive real business results.

Unfortunately, correlating this data can be a somewhat complicated process. So, to help you do this correctly every time and get the most out of SafeGraph Patterns foot-traffic data, we’ve created this easy set-by-step guide to eliminate any guesswork. Let’s jump right in.    

>> This information is also available in greater detail as an interactive Google Colab notebook. Click here to read the full post, see the results, and play with the code yourself. <<

First, background on SafeGraph foot traffic data

SafeGraph Patterns measures the foot-traffic patterns to 3.6 million commercial points-of-interest, via an anonymized mobile location data panel that spans 45 million mobile devices in the United States, to offer an amazingly detailed window into the day-to-day of American commerce. Additionally, it gives you a new way to validate and extract greater value from your own store data. 

>> If you’d like to start correlating your store data on your own, be sure to check out the Google CoLab notebook first. <<

Step #1: Load first-party store data 

The first step in the correlation process is to load your first-party store data. As you’ll see from the example provided in the CoLab notebook, I’ve prepared some synthetic and hypothetical transaction data about Home Depot locations between 2018 and 2019. 

For those of you working along as you read this, be sure to characterize the distribution of this transaction data via descriptive statistics, histogram, and a time series by month.

Right off the bat, we can spot a few interesting patterns from this completely hypothetical data set. Based on this information alone, we can infer that transaction volume at Home Depot is highly seasonal in nature, with the largest number of transactions happening in May and June, and then again at the end of the year around the holiday shopping season. 

Now that we have a handle on our transaction data, let’s turn to SafeGraph Patterns data.

Step #2: Load SafeGraph Patterns data

The next step in this process is to load SafeGraph Patterns data. If you have any questions about any of the SafeGraph Patterns columns, check out the schema documentation here.

Easy enough, right? Now comes the tricky(ish) part.

Step #3: Join first-party data to SafeGraph Patterns using the SafeGraph Matching Service

Bringing together two points-of-interest data, using fields like street address, zip code, and location name, is not always an easy task. Fortunately for you, SafeGraph offers a free POI Matching Service to facilitate this process. Just drop us a line, and we’ll get right on it! 

Now back to the example at hand. Here, I’ve used SafeGraph’s Matching Service to successfully append a SafeGraph place ID to all of the records in our first-party transaction dataset. This made it possible to make a clean join between SafeGraph Patterns and the transaction dataset. 

>> If you’d like to see this in all its glory, check out the CoLab notebook. <<

Step #4: Check the raw correlation

If we agree that SafeGraph Patterns is an accurate measure of real-world consumer behaviors, then we’d expect a strong correlation between SafeGraph Patterns visit counts and other datasets concerning consumer behaviors. 

We make it easy to quickly check the correlations between SafeGraph Patterns and transaction volume across time and locations. In this example, the correlation (~0.7) looks very promising:


There’s a strong correlation between ‘raw visitors count’ (from SafeGraph Patterns) and ‘transactions’ (from our hypothetical transaction dataset).

Step #5: Clean up the data

There are a few outliers in the SafeGraph Patterns data that may be worth a closer look. 
However, keep in mind that outliers can also dominate regressions and distort our correlation analysis. Our goal with this exercise is simply to see whether SafeGraph Patterns data correlates with our first-party data, in general. Let’s use this outlier filtering method to clean things up:

Doing so removes the outliers so we can focus on the core dataset, which now has both a slightly improved correlation and a more nuanced picture in our scatter plot.

>> Check the quality of the relationship between the data sets via a univariate linear regression. Check out the CoLab notebook to see this in more detail. <<

Fortunately, this regression confirms what we suspected: there’s a strong relationship between the data, despite some variance in transaction volume.

Step #6: Run a multivariate regression

At this point, we need to ask: is a model any good without SafeGraph Patterns data? 

In other words, how can we begin to tell a clear story around transaction volume based on both an individual’s stores average number of transactions and the effects of seasonality?  

Answering this question requires running a multivariate regression that accounts for both the month and the store number, which tells us that around 65 percent of the variance in the transaction data can be explained by these two variables.

>> This can get a little complicated, so check out the CoLab notebook to see how it’s done. <<

Now, all we need to do is to verify whether SafeGraph Patterns data can help explain additional variance beyond what we already know from the regression analysis above. To do this, we must look at the differences not explained by the model to see the full picture. This will require us to factor out seasonality and store effects from the SafeGraph Patterns data. 

Step #7: Check the correlation

After doing this, we can see a strong correlation between both transaction volume and SafeGraph Patterns visitor counts residuals. This tells us that the remaining variance is highly correlated and, therefore, means that a model including SafeGraph Patterns data will perform better than a model that only includes first-party data (i.e. store averages and seasonality).

But how much better, exactly? In this example, we found that adding SafeGraph Patterns data to the model increased R-squared from 0.65 to 0.82, which essentially means that SafeGraph Patterns helps explain significantly more variability in transactions versus just simply knowing the monthly store average and seasonality alone. 

Combining store averages, seasonality, and SafeGraph Patterns
allows us to create a highly correlated model (R-Squared = 0.815).

SafeGraph helps you unlock actionable insights

SafeGraph Patterns data helps you unlock new observations and insights that your own first-party data can’t do on its own. This can be especially helpful for understanding the nuanced dynamics of businesses, like Home Depot, that are either heavily influenced by seasonality or even experience large store-to-store differences—two factors that can tell a bigger story about consumer behavior.    

Performing univariate and multivariate regressions to assess the correlation between SafeGraph Patterns data and other data about consumer behavior is key to unlocking these insights.

Lucky for you, you’re now an expert!

Try it for yourself

Now it’s your turn to correlate your own first-party data with SafeGraph Patterns data. Get $200 worth of data for free at the SafeGraph Data Bar when you use code: CorrelationNation.

Got ideas, feedback, bug discoveries, and suggestions? Send to [email protected]

Ryan Fox Squire
Ryan Fox Squire
Product & Data Science @ SafeGraph