Fast forward to the 2024 Super Bowl. Patrick Mahomes is trying to become the first QB since Tom Brady to repeat and he’s up against the Dallas Cowboys, who miraculously didn’t choke in the playoffs. A giant watch party is scheduled at Jerry World (the Cowboy’s home stadium, Placekey: zzw-222@5qw-vxp-h3q). You run a sports betting side hustle and are modeling each team’s expected points to decide whether to bet on or against the spread. Your method is bottoms-up: take every play from every game, assign it a situational success rating, and predict points scored based on the aggregated ratings.
Before you start modeling though, you need to distill each game the Chiefs and Cowboys played that season into a CSV. Then you will assess the data quality. What if you were missing the play where the Cowboys' third-string corner had a pick-six against Jalen Hurts in the NFC Championship? In your data files, each row is a play and each column contains details about the play: down, distance, time, personnel, run or pass, yards gained, and description. How would you assess the “quality” of this dataset?
Let’s explore: Do we have rows for all the plays that happened? Are there extra rows from other games or seasons that snuck in during processing? What should we do about duplicate plays? Or the down just being wrong for a certain percent of rows? But whoops, one game has the distance column totally null throughout the entire file. And our data provider didn’t include the time column until Week 7 - who knows what happened there, probably an intern. Ready to place your bets?
Clearly data “quality” is a multi-faceted and hard problem. As a data-only company, we are well suited to tackle the problem and care immensely about addressing it. So much so that we have spent the last few months thinking about how to bucket different quality problems, which fixes will have the biggest impact for our customers, and how to communicate improvements to the market. We have narrowed in on a three-pronged framework and started in the US.
This radical transparency about data quality is unprecedented in the POI industry. And while you can expect improvements in future releases, we are proud of where we stand today. Below we will discuss row precision and row recall in detail, including our process, tradeoffs we considered, and what to expect in future releases.
You’d think that determining if a POI is real would be simple. If we had Santa’s ubiquity, we would check each one individually in just a single night. But alas, our reindeer are sleeping, and we need to use the internet and scalable methods for verifying each of our tens of millions of US POIs. A SafeGraph POI can fall into one of four mutually exclusive categories: Real Open, Unconfirmed, Closed, or Duplicates. Our north star metric for this prong is Real Open divided by Total Rows.
First let’s consider a Real Open POI: Open Range Steakhouse located at 241 E Main Street in Bozeman, MT. A quick search reveals that it has a first party website, an active Facebook page, several Google reviews from the past month, and recent yelp reviews. There’s also an article about country music star Gavin DeGraw and his brother buying the restaurant. This one is easy to classify as Real Open.
Next we will look at a now-removed Unconfirmed row: a POI named “Phantom of the Opera”. Whoops. That’s a play. And even though it may have a claimed Yelp page with a decent amount of reviews or its own 1st party website with an embedded map widget, it's not a place. The Booth Theater where the play is being performed? Absolutely a place. And also already in our dataset (Placekey: 227-224@627-wbv-kcq). But the line for the play itself? That’s got to go and now gets filtered out as Unconfirmed.
Duplicates should be fairly self-explanatory. If we find a POI represented twice, we’ll…wait for it…remove one.
Between the clear cut cases, there are obviously significant gray areas. We addressed these ambiguities one at a time with many SafeGraphers manually verifying thousands of POIs. After cross-checking our work and sharing strategies, looser guidelines coalesced into the consistent rules outlined here. The TLDR is that when we manually classify a POI as Real Open, we look for either 1) a first-party website where data originates, 2) several recent reviews, or 3) they pick up the phone.
Our March 2023 release had a Real Open Rate of 60%. Which means 40% were Unconfirmed, Closed, or Duplicates. At face value, not an ideal result! But even the behemoth Google was at the same level as us and other competitors had even lower rates. While we were pleased to be at parity with Google, we knew our customers would want more and set out to identify which of our US rows should be filtered out.
We prioritized identifying and filtering Unconfirmed POIs. After manually classifying around 30k POIs to generate a truth set, we trained a machine learning algorithm that used attributes and metadata like websites, category, region, reviews, and sources to predict the likelihood that a POI is Unconfirmed or Real. Internally, we dubbed this concept the “IRL factor.” Once all our POIs were assessed, we set category specific thresholds, making sure to balance precision and recall tradeoffs. As the graphic below shows, more aggressive thresholds mean that more Unconfirmed POIs are filtered out (true positive), but also increases the chance that Real POIs are incorrectly flagged (false positive). In the July 2023 release, we stayed conservative and prioritized keeping real POIs over removing every single Unconfirmed row.
Thanks to this model, our Real Open Rate improved to 66% in the July 2023 release and we know how we can keep it growing. Each month, we publish our Real Open rate on the Accuracy Metrics page. In Q2, we mostly focused on identifying Unconfirmed rows, but we know we still have work to do for Closed and Duplicates. And we will always strive for further improvement.
Next, let’s talk about a slightly easier problem: row recall. Colloquially, many people refer to this as “coverage” and it is simply how many total POIs we cover in a geographic zone. As mentioned above, in the US, we gauge ourselves against Google. When doing these analyses, row precision is the first step. This means that first we eliminate Unconfirmed, Closed, and Duplicate rows so we can only compare Real Open rows between vendors. This elimination is manual and follows the guidelines outlined in the prior section: broadly we verify via 1) existence of a first party website or 2) recent reviews.
Once we have the Real Open rows for SafeGraph and Google, we manually match the rows. The end result is a number of Real Open POIs each vendor has in a specific zone. For the July 2023 release, we kept it close to home and looked at zip codes 94103 (San Francisco, CA; the zip code of SafeGraph’s first office) and 98110 (Bainbridge Island, WA; the lowest population density zip code that a SafeGrapher calls home). Our north star metric for this prong is what we call the “Coverage” rate, which is SafeGraph’s Real Open POIs over Google’s. As of the July 2023 release, we are at 79%.
We know we have work to close the gap with Google, but we are proud of where we stand today and of the rapid progress we make every month.
Naturally, there are nuances in deciding if a place is a POI. Should every real estate agent who works at a real estate firm be its own POI? What about each individual lawyer at a law firm? Or a counselor who is a sole proprietor and has an office in a large office building? The ambiguities are endless. Kiosks, ATMs, transit stops, parks, rivers…trees?? Just kidding, trees are probably going too far. But we did put a lot of thought into what should be a place within our scope, and it is always subject to change pending market needs.
We have documented guidelines here. While in the past we have focused on places where people spend money - restaurants, bars, retail stores, gas stations - recently we have expanded to also include places of leisure, work and travel. Improving recall is simple: adding sources. Each month we add hundreds of sources to our pipeline. These sources are cleaned, joined, and deduped to result in a single source of truth for each POI. We have counts by NAICS code and country listed on the Summary Stats page and if you are interested in something we don’t have yet, please don’t hesitate to Contact Us.
Row precision, row recall, all this detailed and technical work, why bother? The cursory reason is that we are a data company and pride ourselves on selling high veracity data. But more importantly, better SafeGraph data improves our customer’s bottom lines.
Imagine you are a Product Manager at a company that makes a mapping application - think Apple Maps, Bing Maps, or Mapbox - and users use your app to find and route themselves to nearby places, aka “local search.” If SafeGraph has a higher Real Open rate, that means fewer bad arrivals for your users. If we have more Coverage, more places exist when users query in the search bar. A better user experience leads to increased usage and higher revenue per user.
Imagine you are the Head of Real Estate at a QSR - think Subway, Domino’s, or Chipotle - and you are responsible for choosing new locations. Real estate teams use complex Huff or gravity models to estimate foot traffic and revenue at future sites, but we all know that in models, garbage in = garbage out. When SafeGraph has cleaner data, your team can be more confident in the model output and ensure that the million plus dollar decision you are making is the correct one.
Imagine you are a Product Manager for a site selection or real estate software company - think Kalibrate, Buxton, or Crexi. You embed POIs in your site-selection software and/or model so that users can analyze different trade areas for new store development. When your customers see inaccuracies like closed or missing POIs, they doubt the validity of your platform; ‘bad data’ may impact model output. When SafeGraph has more accurate data, aka a higher Real Open rate and more Coverage, your customers can make better decisions, feel more confident in the platform, and renew at a higher rate.
Imagine you are the Chief Data Officer at an advertising firm that does visit attribution or OOH campaign planning and measurement - think Clear Channel, Vistar, or Billups. You take billions of mobile GPS pings, cluster them into groups, and see which POIs people actually visited to create audiences for your customers. Or, your customer McDonald’s wants to advertise specifically on billboards near Wendy’s locations. Better SafeGraph data means more accurate audiences or improved campaign planning, enabling your customers to derive more revenue from their advertising efforts.
It is difficult to enumerate all the potential use cases for POI data. We sell into many verticals, each with slightly different product requirements. But, we are confident that in all use cases, across all industries, more accurate data allows our customers to make better decisions that improve their bottom lines. If you’ve worked with SafeGraph before, you know that we stop at nothing to make life easier for our customers and this effort was no exception.