SafeGraph aims to be the source of truth for physical places. No fake news — just the facts.
But the dynamic, evolving, complex world we live in poses a real challenge for us SafeGraph-ers. By aiming for 100% accuracy, we know we are undertaking a Sisyphean task.
Capturing the full complexity of the world and encapsulating it neatly into one clean CSV dataset is impossible. One reason our job is so difficult is that all data sources are noisy, which causes our algorithms to make errors and mistakes.
Our machine learning team fuses data from many, many sources including satellite imagery, first-party data, municipal and government data, web searches, and more. This has enabled SafeGraph to maintain a very accurate understanding of almost everywhere people spend money.
But operating at the scale we do, it’s not surprising that SafeGraph’s algorithms make mistakes.
Once, we put a point of interest squarely in the middle of a big lake. That point of interest was NOT Atlantis. It was a Burger King. Clearly, a mistake.
We found out about it because one of our customers was giving driving directions to a person. Luckily, that person wisely decided not to drive into the water.
Our mistakes have huge real-world consequences because the largest mobile carriers, search engines, and satellite companies rely on SafeGraph’s data.
But this blog post isn’t about how our algorithms mess up and how we fix our obvious mistakes. This blog post is about the weird edge cases where even after multiple humans look at the data, we still don’t know what the right answer is.
This blog post is about the cases where we struggle to translate the complexity and nuance of the real world into simple rules and heuristics which our algorithms can then follow.
We don’t have all the answers yet, but we want to shine some light on some of the challenges we face every day. If you have any suggestions, please let us know (or come work with us!).
Knowing when a place is open for business or not seems easy enough. But how would you handle the open hours for this urgent care center?
Do we report Office Hours or InstaCare Hours? If both, how can we cleanly represent that in our schema?
Open hours become even more challenging when you account for points of interest that are seasonal, like water parks open only in the summer, or malls which have extended shopping hours during the holiday season.
During last Christmas day, our clients recommended that people go to hundreds of closed business … all because we couldn’t get the store hours straight. Again, our mistakes have serious real-world consequences (but luckily all the toy stores were open … so little Tony still got his truck).
Here’s a great article on the falsehoods programmers believe about people’s names. You can imagine when it comes to physical places, which have fewer social conventions for naming than people, there is even greater complexity to understanding and representing place names accurately.
Take for example the Broncos Stadium at Mile High. Or whatever they decided to name it this year.
Names of places change. Often.
And some places might go by two names, both equally valid. This makes determining the best name for a place challenging even for humans, let alone algorithms.
Businesses are continually merging and acquiring other businesses. Sometimes these businesses undergo rebrands. Sometimes they don’t. Sometimes they create new special regional co-branding.
We’ve reached 99% recall and also 99% precision when it comes to the top 3,000 brands in the U.S. But for smaller chains and brands, it’s difficult to organize and keep track of this information without extensive research and local familiarity. Take for example Daphne’s.
Without local familiarity, it’s not easy to know how many distinct restaurants and brands are contained in the above search results and news stories.
NAICS codes are an industry standard system for categorizing a type of business. Some example sub-categories in the NAICS system are “Commercial Banks”, & “Snack and Nonalcoholic Beverage Bars”, & “Lessors of Nonresidential Buildings (except Miniwarehouses)”.
As much as we love working out of Capital One Cafes, we hate them when it comes time to categorize these points of interest.
What’s the best category for a bank which is also a cafe and also a co-working space? And another question: should Capital One Cafes be a separate brand from Capital One?
As you can see, the real world is tricky, and it’s hard to cleanly represent what’s happening in a simple CSV with a clear taxonomy.
We need to get this right because we believe that truth data is fundamental to innovation in the Machine-Learning driven future.
So, until we reach the impossible goal of 100% accuracy, we’ll keep fixing our errors and handling these edge cases.
All models are wrong… but we are trying to make SafeGraph’s models of the physical world the most accurate and useful.
But we still make tons of mistakes. Many of the mistakes make us cringe. Our commitment to our (very demanding) customers is that we significantly improve the data every month and that we will be a bit more true every month.
You can track our progress on this journey, by following our release notes which are published with every monthly update of the data. We feature the bugs and edge cases we’ve handled, and articulate known problems that are not solved (yet!).