Tips for Normalizing Foot Traffic Data

February 2, 2021
Francisco Utrera

Foot traffic data is incredibly useful for understanding consumer behavior. With fields for number of visitors, number of visits, dwell times, origin, and more, SafeGraph Patterns data empowers organizations to derive actionable insights related to how people interact with points of interest. 

One of the most common ways data scientists use SafeGraph Patterns data is to measure foot traffic over time. This can reveal interesting trends related to seasonality, brand affinity, and proximity to other businesses. An accurate and detailed analysis of foot traffic over time can be used to inform site selection, investment, and advertising decisions, among many other use cases spanning different industries. But to do this effectively, some technical considerations need to be factored in.

Reasons to normalize

Like most frequently-updated datasets, raw foot traffic data can show rapid fluctuations that may overwhelm an analysis. This can be remedied by applying moving averages to smooth out the data while still preserving the important trends for analysis.

Granularity should also be considered so that you find the right balance between privacy regulations and specificity for your analysis. Deciding on the appropriate level of granularity, for example, CBG-level or POI-level foot traffic, can help you determine the natural variance of mobile location pings while also ensuring you maintain compliance with privacy agreements for your specific use case.

Panel bias can arise from collecting data from sub-groups of the population disproportionately. If a Walmart POI's typical visitor profile is 10% Hispanic or Latino, but the sample panel accounts for only 5%, the. results from analyzing the raw data will not be as accurate as they could be if corrected for bias. Any other type of bias that does not fit into panel bias is considered an outlier and should also be filtered out.

Adjusting for sample bias to represent the true demographic profile of a POI improves the accuracy of results.

SafeGraph Patterns is aggregated from a panel of ~40MM mobile devices in the US and Canada. As such, there are going to be biases and outliers in the data. However, when normalized correctly, Patterns can provide the foot traffic insights needed to truly transform a business strategy.

Simple ways to normalize patterns

Even though there are advanced methods to normalize our panel, many use cases only need easy fixes to get most of the value from the data. To correct for panel bias, users can simply scale visits and visitors by multiplying with the statewide constant relevant for a given POI. This constant is calculated by dividing the total population in that state by the number of home devices (our panel size). This will account for panel changes over time and will also correct demographic bias at the state level.

Separately, outliers can be detected with quartile ranges of aggregated raw visits and raw visitors over a 2-year period. A slightly more sophisticated approach would remove locations that have higher than normal standard deviations on month over month percentage changes. These simple approaches are so easy that anyone can implement them on a Google Sheet.

If these simple approaches do not work for your purposes, we recommend going further and working through the following three steps.

We break down the important steps to normalizing Patterns data here. When you’re ready to get started, check out our full technical guide.

1. Clean and prep the data

Cleaning and preparing Patterns data will simplify the time series analysis by joining data for the sample size by month and visits per visitor for each POI. Shared polygons (with less accurate visit counts) can also be discarded in this step.

Adjusting visitor counts based on CBG sample sizes will adjust the overall visitor count for each POI by month and each origin-CBG based on the known sample sizes and true Census population counts. 

The CBG adjusted count is in units of visitors and is the extrapolated estimate of total real-world visitors based on the fraction of visitors measured by SafeGraph (SG) in the sample.

If you are looking to analyze foot traffic at a daily level, you can explode the daily visit count for a POI. To do this, combine scaled CBG visit counts with daily visit counts to adjust the raw daily visit counts for the sample size.

Don’t forget about outliers - performing outlier detection on aggregated visit counts to each POI will enable you to flag and discard anomalies that would distort results. By filtering to only include POIs with a full time series, you prevent additional variance by removing any POI or transaction data that does not have data for every month of the time range. To account for daily shifts in sample size, join the day-by-day file to the normalization file.

Now that the data is prepped, you can run a few preliminary analyses to see if there are any correlations between Patterns data and whatever you are analyzing with it.

A preliminary analysis can show correlations between first party data (transactions) and daily visit counts.

2.Encode other known sources of variance for better controls

Once you’ve prepared the data and run a few initial analyses, you can begin to control for coincident or real correlations. Using regression logic enables you to determine if correlations discovered in the preliminary analysis are coincidental or based on true world events.

In practice, much of the variance in foot traffic of consumer retail businesses are driven by factors that can easily be estimated from historical data.  You can analyze effects like seasonality, day of the week, or store-by-store differences by running a full regression accounting for known sources of variance. Using the encoded known sources of variance, you can run one regression to see how well you can use Patterns data to predict the other data you are analyzing.

Common known sources of variance include:

  • Major holidays
  • Different days of the week
  • Different months/years
  • Differences by brand or by category
  • Differences in individual POI

There is an added benefit to controlling for known sources of variance. It is hard to fully control for all methodological changes in SafeGraph data over time, but controlling for variance at various levels helps isolate additional latent sources of methodological artifact from the data. 

These sources of known variances can be encoded as categorical variables using target (mean) or one-hot encoding methods to keep the dimensionality low. When Patterns data is encoded for known variance and joined to the first party data you are analyzing it with, you can start to see trends without the noise caused by seasonality, holidays, or other known sources of variance.

3. Multiple regression to control for methodology and other sources of variance simultaneously

When you’ve encoded for multiple known sources of variance, it’s helpful to run a full regression with all of them along with daily and monthly panel metrics. While the result may be multicollinearity, often a red flag in analytics, it’s okay in this instance. Here we are just trying to understand how well Patterns data can predict the first party data you are analyzing with it, not trying to calculate coefficients. 

Comparing correlations from a model that includes SafeGraph Patterns data and one that does not will enable you to better understand and explain variances.

Another helpful step is to include encoded variables for both Patterns data and the first party into your multiple regression. This effectively controls for both historical trends and changes in methodology that may be inseparable from historical trends without sufficient data. Additionally, comparing correlations from a model that includes SafeGraph Patterns data and one that does not will enable you to better understand and explain variances.

Putting it all together

Once you’ve normalized your data, you can start to run analyses and deliver reliable results to inform your business strategy. Understanding the effect foot traffic has on your other data, whether credit card transactions, offer redemptions, or product sales, can help you uncover valuable connections otherwise undiscoverable. By factoring in seasonality, holidays, and other known sources of variance, you can be confident in your ability to make strategic decisions that will boost your overall business.

Francisco Utrera
Francisco Utrera
Technical Product Manager at SafeGraph
SafeGraph is just a data company

That's it – that's all we do. We want to understand the physical world and power innovation through open access to geospatial data. We believe data should be an open platform, not a trade secret. Information should not be hoarded so that only a few can innovate.