Blog home

Data Normalization Methods: Which to Use and Why

January 21, 2022
by
Nick Huntington-Klein

Data is one thing, and the questions we want to answer with data are another. In any context, there are often challenges when bringing the two together.

In the case of SafeGraph foot traffic data, what we have in the data is information about foot traffic that SafeGraph observes. Among the devices in the SafeGraph sample, we can check, for a given location on a particular day, how many of those devices we tend to see.

But you probably don’t want to know about what’s happening in the SafeGraph sample. You want to know what’s happening in the population. To make that jump, we need to adjust for anything that changes about the SafeGraph sample in ways that don’t reflect actual changes in the population. One of these adjustments that we need to make is normalization.

Normalization is the process of scaling the foot traffic counts we get by some measure of how big the sample is. If we observe 10 people visiting a location when we only have 10 people in the sample, that’s probably a pretty heavily trafficked area. If we observe 10 people visiting a location when we have 10 million people in the sample, that location’s a bit more obscure. We normalize so we don’t mistake one of those scenarios for the other.

Normalization is also important because the SafeGraph sample changes in size over time. You don’t want to think that your foot traffic has doubled when really it’s just that the sample expanded to cover twice as many devices.

There are many different ways you can normalize, and SafeGraph has recently added several columns to its Patterns files that perform some of these approaches for you. These include:

  • normalized_visits_by_state_scaling
  • normalized_visits_by_region_naics_visits
  • normalized_visits_by_region_naics_visitors
  • normalized_visits_by_total_visits
  • normalized_visits_by_total_visitors

You can get detailed information about these in the SafeGraph docs.

What are these all for, and when should we use them?

normalized_visits_by_state_scaling

normalized_visits_by_state_scaling takes the number of visits to a given point of interest(POI) and divides it by the number of devices in the SafeGraph sample in that state (or province, for Canadian locations), then multiplies it by the population of the state/province.

What is this good for? First of all, this allows for better cross-state comparisons. If you’re worried that you might be seeing more foot traffic for your brand in Massachusetts than in Minnesota because maybe the SafeGraph sample has a higher proportion of the population in MA than MN, this scaling will account for that difference and even it out. Now, someone counts equally no matter which state they happen to be in.

Another way normalized_visits_by_state_scaling can be used is to generate population counts of foot traffic. If we expect that someone in the SafeGraph sample is equally likely to visit a given POI as another random person in the population, then normalized_visits_by_state_scaling represents how many people (not just how many SafeGraph users) visited a given location. A word of warning, though: these population counts can be pretty noisy, especially when you’re looking at single POIs. If SafeGraph has 10% of the people in a state, which would be pretty good, then one more person walking in the door looks like ten more people walking in the store in the population count. That can make the numbers pretty swingy. This problem isn’t so bad if you’re looking at broader levels of aggregation, like foot traffic for all locations of a given brand.

If you are combining multiple locations, by the way, since the value is scaled so that one unit represents one person, you can just add them up. Easy.

Let’s see how this looks. We’re going to be looking at AMC movie theaters in December 2021, the month that saw the release of Spider-Man: No Way Home on December 17. Can we see the movie’s effect in the weekly foot traffic data?

We can definitely see the movie’s effect in the theaters, with a huge jump from the week of Dec 6 to Dec 13 (the week containing the movie’s release). The normalization doesn’t change the shape of the jump much: normalizing makes the jump from Dec 13 to Dec 20 slightly shallower, suggesting that states that are overrepresented in the SafeGraph data saw bigger Dec 13-20 jumps, which got scaled back down by the normalization.

The normalization definitely changes the scale, though. The data here suggests that about 2 million people went to an AMC theater the week that Spider-Man came out. Is that accurate? We don’t have exact numbers for AMC ticket sales to Spider-Man, but we can get a bit closer by looking at theaters as a whole.

Now we get 14 million visits to theaters in general on Spider-Man’s first week. This may be a bit low, given the $260 million opening weekend the movie had divided by a roughly $10 average ticket price, and the fact that people saw other movies, too. Getting population counts is hard! Scaling by the sampling rate, as this variable does, will help you get some of the way there (and it will generally work better the more you aggregate) but don’t be fooled into thinking it will always get you the right answer. The best population scaling approach is going to be specific to the kind of location you’re working with.

normalized_visits_by_region_naics_visits and normalized_visits_by_region_naics_visitors

The two variables representing visits normalized by region and NAICS code show foot traffic to each POI as a share of the amount of foot traffic we find to any POI in that state/province and that industry (NAICS code).

These variables are especially good for comparing market share of a brand across regions. If Brand X has normalized visits in the West that are twice what you see in the South, that means that Brand X has a foot traffic market share twice as big in the West as in the South.

This is good for adjusting for differences in industry popularity across regions. Is Movie Theater Brand X doing poorly in Oregon because Brand X isn’t doing well in Oregon, or because movie theaters in general aren’t as popular in Oregon as in California? Visits normalized by region and NAICS narrow it down to Brand X specifically being different.

Additionally, since these are scaled by visits within NAICS, this can help show you how a brand has done relative to other brands in the same industry.

There are two different versions of this variable: _visits and _visitors. What’s the difference? The _visits variable, as you might guess, scales by the number of visits we observe in total in that region and NAICS code, while the _visitors variable instead scales by the number of unique visitors.

The _visits version is a bit easier to wrap your head around - it’s the share of all visits to this NAICS code in this region that went to this POI (or, if you add up, say, all the POIs for a given brand, the share of all visits to this NAICS in this region going to this brand). _visitors is a bit harder to think about because it’s the number of visits divided by the number of visitors - unique people visiting. Since individual people visit multiple places, this could theoretically even be above 1. However, it’s a good normalization method to use if you’re more concerned about the number of people in the SafeGraph sample than the market share.

When aggregating this normalized value, you can add them up as long as you’re working within a particular region - if AMC theater 1 is 1% of the movie theater visits in Idaho and AMC theater 2 is 2% of the movie theater visits in Idaho, then together they’re 3% of movie theater visits in Idaho. But if you’re aggregating a brand further than that, say to a national or USA + Canada level, you’ll have to think a bit. If you’re taking an average, do you want to count regions where a brand isn’t present as 0 (in which case you’d get average market share per POI, although this counts large and small markets as equally important in the average), or leave them out entirely (which gives average market share where present)? Or if you’re summing them up instead of taking the mean, then you avoid the problem that one POI could compete for market share with another POI of the same brand, although the scale ceases to mean much. Maybe you could sum to the regional level and then mean. You’ll have to think about it.

Let’s see how this looks for AMC, and just go ahead and average things, keeping in mind that this will understate market share, since this allows each AMC location to cannibalize market share from the others. This probably isn’t a great option, but I’m trying to push you to think about aggregation rather than just doing what the blog did:

We see a similar jump to before, which is actually interesting here - this is normalized by NAICS already, so the existence of a jump here for AMC suggests that AMC saw a bigger Spider-Man boost than other theater brands. Something like a 4x boost, compared to a roughly 2.5x boost at other theater brands. Neat! I wouldn’t pay too much attention to the weird Y-axis values.

normalized_visits_by_total_visits and normalized_visits_by_total_visitors

These final two normalized variables are intended mainly to help adjust for the changing size of the SafeGraph sample as a whole. That’s also something that the other normalization variables do, but they also tack on other adjustments, while normalized_visits_by_total_visits and normalized_visits_by_total_visitors keep things pure and simple.

These two help adjust for the fact that if the SafeGraph sample gets, say, 10% more devices in it, you’re likely to see a 10% rise in your foot traffic numbers, even if nothing actually changed. They do this in two ways.

The _by_total_visits version adjusts for changes in the overall number of visits in the data. This helps adjust for the amount of activity SafeGraph is seeing in total, which also will adjust for actual aggregate changes in foot traffic activity. Do you want your analysis to smooth out the fact that foot traffic in general spikes on Black Friday? _by_total_visits will help adjust for that. The result is “the share of all observed visits that were to this POI.”

_by_total_visitors is similar except it adjusts for the number of active devices seen in the SafeGraph sample, which is closer to a pure way of adjusting for the size of the sample. More specifically it adjusts for the number of active devices we see with any activity. If a bunch of people start switching off their phones, or drop out of the sample, this normalization method will make sure you don’t mistake that drop in sample size for a drop in foot traffic.

Conveniently, both of these can be easily aggregated across POIs by simply summing them up. _by_total_visits has an easy interpretation, too, as the share of all visits that went to the activity you’re aggregating over. Whether that’s a useful interpretation or not depends, since often that percentage will be tiny. _by_total_visitors, on the other hand, doesn’t really have a value you can interpret in absolute terms. Both variables, however, are great for looking at growth over time.

In fact, let’s look at them in terms of growth over time by putting the “change since week 1” on the Y-axis instead of the absolute normalized value.

We see pretty massive jumps in AMC foot traffic in relative terms, increasing 4-5 fold from the week before Spider-Man to the week of Spider-Man.

Parting Thoughts

Normalization is hard, and it’s not automatic. If it were easy, we’d just have one way of normalizing and that would answer all our problems. Instead we have to think about what it is we actually want out of our normalization and choose the appropriate one.

The different normalized-visits columns in the data offer a few different avenues for normalization. It’s fairly rare that the different normalization methods will produce wildly different trends (unless, say, a brand trends upwards in lockstep with the rest of its NAICS code, in which case the NAICS scaling will remove that trend but other normalization methods wouldn’t). Looking at the AMC and movie theater graphs above, if you ignore the Y-axes you can barely tell them apart.

But there are differences in the minor details - is week 2 of Spider-Man bigger than week 3? We get slightly different answers to those sometimes. And the scale definitely changes. The magnitude of different from week to week may change (even in relative terms), and the absolute numbers are of course different since they’re all measuring things on different scales.

So what to do? A good first step in thinking about normalization is to think carefully about what you want to count as an actual change in your data. Want to keep in industry trends? Then be sure not to normalize by NAICS. Want to exclude country-wide increases in visits as people emerge from their COVID caverns? Probably want to pick a normalization method that adjusts for overall visit numbers.

Once you’ve narrowed down a set of usable normalization methods, it’s worth trying them out. See what results they get you, if they differ, and which seems to be doing the best job. And if you’re trying to get absolute visit counts by scaling up to the population, see if you can get some ground truth to compare to (like our Spider-Man box office take) so you can check whether the method you end up with works well. And then the hard part is done.

So which of these normalization methods works best for AMC, and therefore which set of details wins out? Well, they all look pretty reasonable. So which one to go with depends on what you want to “count” as a change you want to keep as opposed to a change you want to toss. If you want to really just see AMC as opposed to other theaters, that’s probably one of the NAICS-normalized ones. If you’re mostly worried about the sample size changing, perhaps scaling by total visitors. If you want to get a rough population count of how many people saw Spider-Man, then the state scaling would be your best bet.

TL;DR

Which pre-normalized variable to use?

  • normalized_visits_by_state_scaling: When you want a rough estimate of the actual number of visitors, or want to adjust for differences in SafeGraph sampling rates across states
  • normalized_visits_by_region_naics_visits: When you want to isolate the change in a given brand/POI relative to changes in the rest of the local part of the industry
  • normalized_visits_by_region_naics_visitors: When you want to isolate the change in a given brand/POI relative to changes in the rest of the local sample size of people who frequent that industry
  • normalized_visits_by_total_visits; When you want to isolate the change in a given brand/industry/POI relative to both general changes in foot traffic activity and changes in the SafeGraph sample size
  • normalized_visits_by_total_visitors: When you want to isolate the change in a given brand/industry/POI relative to changes in the SafeGraph sample size

Browse the latest

Questions? Get in touch with our team of data experts.