Whoever originally said “less is more” probably wasn’t thinking about data. But little did they know that, in today’s data-driven age, this simple concept would hold so much water.
There’s no question that data now plays a critical role in propelling businesses, governments, non-profit organizations, and academic institutions forward. It can fuel new innovations and new insights that can conceivably change the world for the better. In a business context, specifically, it can provide a competitive edge for driving long-term growth and success.
But we’ve said it before, and we’ll say it again: Not all data is created equal. When working with a new data source, you must evaluate it to ensure that it can actually do whatever you’re trying to accomplish. In this sense, being choosy about what data sources you use is essential to avoid getting pummeled by a data avalanche that can actually cause more harm than good.
If you’re trying to answer specific questions or (dis)prove certain hypotheses, it’s easy to first go down the rabbit hole of collecting as much data as you possibly can to support your objective. After all, having more data at your fingertips feels more complete and, therefore, can lead you to believe that it will give you a competitive edge by default.
Unfortunately, that’s rarely ever true. If you’re not working with the right data, it’s just more data—and more data, on its own, can actually get in the way of uncovering actionable insights.
We’ve seen this play out in a variety of ways during the COVID-19 pandemic. On the more positive end of the spectrum, we’ve seen how many local, state, and federal governments—as well as the businesses and people in the communities they serve—have relied heavily on SafeGraph’s accurate data to mitigate the ongoing health and financial impacts of this crisis on the ground level. We’ve covered this at length via the SafeGraph blog—be sure to take a look.
But while data can do a lot of good, misusing it can have serious consequences for a business. Most people who use data incorrectly have no idea they’re doing it, and then inadvertently go on to make misguided decisions with a false sense of confidence.
Bad data can affect businesses in different ways: Retailers with dirty customer data may open a store in the wrong location or insurance underwriters with inaccurate geocodes may significantly underestimate a property’s flood risk. Regardless, what all industries have in common is the high cost of making important decisions based on bad data.
So, why does this happen? For starters, it has a lot to do with data simply being more accessible and affordable than ever before. While this has, on the upside, led to a rise in the tools and resources available for collecting, cleaning, processing, and analyzing data more effectively, it still doesn’t negate the fact that the vast majority of data sources today are inherently flawed.
As a result, many trained data experts still have to do a delicate tight-rope walk to balance the right amount of data with the right level of data accuracy before leaning into any insights drawn from that data. This is one of the primary reasons why we are such big believers in data standards. Approaching and analyzing all data sources via an objectively critical lens is the only way to drive outcomes that can lead to positive change—and minimize potential harm.
When you’re wading in lots of data, it can be hard to single out the good data from the bad. However, there are a few ways to ‘sanity check’ yourself to ensure that you’re constantly staying focused on quality at all times. Here are four things to look out for:
Having a wide selection of data to choose from is crucial for sourcing the right solutions, but it doesn't mean you need to use it all. While data providers should focus on selection to ensure they offer what users need, users should only concern themselves with the data that's actually going to solve their business problems. Adding more data to the equation, simply because it's there, can lead to over-complication and incorrect results.
As Aaron Lipeles in Toward Data Science so aptly puts it:
"Making a dataset wider, by adding a lot of extra fields, increases the odds that something somewhere will look like it’s correlated when it’s not. The only way to mitigate that risk is to make the dataset deeper by adding more examples."
Here’s another way of looking at this predicament. Having more columns in your dataset is great. It indicates that you’ve got a lot of information about a particular entity at your fingertips. But if some of those columns are wholly irrelevant to the kind of analysis you’re doing, they could easily lead you to draw conclusions or make correlations that are completely inaccurate.
As a rule of thumb, you should have a clear understanding of what you’re trying to get out of the data before you actually start working with it. That’s the only way to ensure that you stay focused on the parts of the dataset you need in order to avoid any distractions that could eventually lead you down a wrong and errored path.
When you’re faced with a sea of data to cull through, it’s all too easy to go on a ‘wild goose chase’ until you find the hidden gems in there. But the problem here is that, if the data is all bad, to begin with, you’re going to spend a lot of time trying to make sense of it—in addition to cleaning and sorting it—only to draw bad conclusions that lead to bad decisions.
See how quickly this becomes an undesirable domino effect? Massive amounts of data quickly can lure you into a false sense of confidence that makes you want to force the data to be usable. Unfortunately, that won’t do you—or anyone relying on your insights—any good. So avoid the temptation to hide behind an endless flow of data because let’s face it, it’s not likely to end well.
"More data is not better if much of that data is irrelevant to what you are trying to predict. Even machine learning models can be misled if the training set is not representative of reality." – Michael Grogan,“Why More Data Isn’t Always Better” in Towards Data Science
At this point, you might think that we’re a broken record, but that’s only because this is a really important part of the data analysis equation. If you don’t go into this process knowing what problem you need to solve, you can’t build a backward strategy for identifying what elements you need—including the right datasets—to get you there.
As Michael Grogan in Towards Data Science points out, “More data is not better if much of that data is irrelevant to what you are trying to predict. Even machine learning models can be misled if the training set is not representative of reality." This further underscores just how important it is to first know what you’re trying to accomplish and then find the data sources to support it.
A lot of organizations these days tend to acquire data just for the sake of acquiring it, perhaps thinking that it’ll come in handy one day. Or maybe it’s purely out of a deeply-rooted fear of ‘feast or famine.’ Either way, before getting too far ahead of yourself, be sure to confirm that you have an actual use for the data in front of you and then cross-check it for accuracy to ensure that it won’t ruin anything good that you already have in place. Today's most data-mature organizations carefully align their data strategy and procurement processes.
Depending on the problem you’re trying to solve, there’s a very good chance that you’ll need to work with multiple datasets to get to the answer you’re looking for. But if the datasets can’t connect together well—with or without some serious cleaning—you’ve got an uphill battle on your hands. Acquiring all that data is really only helpful if you’re able to join datasets together. Not being able to do so often signals a deeper problem with the data that, in all reality, you should avoid altogether. While the industry has made strides in recent years with advancements like Placekey, not all data is easily joinable, and it may end up costing more time and resources to work with than it is really worth.
The moral of the story: When it comes to data, more isn’t always a good thing. In fact, working with more bad data will actually make your life a lot harder and, even worse, lead you to drive bad conclusions that then inform bad decision-making. Trust us, none of that is worth it.
That’s why, at SafeGraph, we’ve made a point to own our niche and focus squarely on data related to physical places. For us, keeping a narrow focus gives us a leg up in prioritizing quality over quantity just as much as it enables us to provide consistently high-quality data—that can easily be connected to other datasets—at a meaningful scale (and growing!). But we've also made it easier than ever to connect our places data with other datasets when needed.
We’ve been through the trenches and have worked with datasets of all shapes and sizes. We know what good data looks like and can spot a bad dataset from a mile away. In delivering on our mission to democratize access to data for all, we will never compromise on quality for the sake of quantity. That simply wouldn’t do anyone any good.
That's it – that's all we do. We want to understand the physical world and power innovation through open access to geospatial data. We believe data should be an open platform, not a trade secret. Information should not be hoarded so that only a few can innovate.