Most of the world’s data is sitting on a shelf, being used in a very narrow domain. This data, if properly activated, could solve some of the world’s biggest problems and lead to more health, happiness, and love for society. We could use this data to uncover some of society’s biggest secrets.
The data is there. We just need to use it.
We need the courage to harness the world’s data for good.
We have a MORAL OBLIGATION to get this data into the hands of millions of innovators. Not doing so is a true failing of society. This data can save hundreds of millions of lives and help all of humanity … which means not using it hastens the death of hundreds of millions of people.
But there are hundreds of special interest groups fighting against it. They have good intentions. They know this data can make people’s lives better. But these special interest groups fight against making data more accessible to either enhance their profit, enhance their power, or just protect the status quo.
Like Marc Andreessen’s piece, It’s Time to Build, this piece is a full-throated argument to massively increase the accessibility of data. And we need to do it now.
For decades, there have been very powerful, sensitive datasets completely unavailable to the research and business communities because the institutions who own them have been unwilling to share or sell. This unwillingness has been largely driven by a concern for people’s privacy, which in a vacuum makes sense.
But what if we could have our cake and eat it too?
Large societal institutions like the government and big tech companies have tons and tons of data… and 99.999% of it isn’t accessible to the millions of brilliant researchers, engineers, and entrepreneurs out there. We’re talking about data that can fundamentally change the trajectory of society. And because they have a monopoly on the data, they monopolize innovation and slow down technological progress.
There’s an obesity epidemic, economic inequality is still very high, wages are stagnant, we’re in the middle of an opioid crisis, average lifespan is not increasing, and public policy is responding very incrementally. The human condition is not improving fast enough, yet we’ve somehow convinced ourselves to become risk-averse at a time when we need to be daring.
We need courage.
But the solution is right in front of us. It’s time for a step-function change in progress and it starts with making data more accessible. These institutions aren’t inherently wrong for being cautious with these datasets — people’s privacy is at stake and that’s important.
But the game is not zero sum. Protecting personal privacy and developing next-generation technology and research are essential and mutually inclusive.
It doesn’t have to be one or the other – we can choose to make data more accessible and protect people’s privacy.
Before we dive in, let’s clarify that making data accessible is not the same as making it free. It’s okay to charge for data (we do that at SafeGraph), but it’s not okay to let it go to waste.
We as a society have a moral obligation to release data (for free or sell it at a reasonable price) in a privacy-safe way.
The IRS has income data on hundreds of millions of people over decades – including the incomes of people’s parents and grandparents. It is one of the largest and most comprehensive longitudinal studies in history.
However, only a select few researchers have access to the data.
The IRS is rightly concerned about people’s privacy. This is super sensitive data. But what if we could give out access to the data while completely protecting our privacy? It is possible (read on). We can allow people to ask questions of the data without seeing the underlying sensitive data. We can do it. We just need the courage to work on it. And give every researcher in the world access to one of the most important longitudinal studies the world has ever seen.
Raj Chetty is famous. He’s a Professor of Economics at Harvard. He won the John Bates Clark Medal. His studies have been cited by thousands of articles. He’s amazing. He is one of roughly four researchers that has access to the IRS data.
By analyzing the tax returns, Chetty and his colleague were able to publish many monumental longitudinal studies. One example is where Chetty and his colleagues analyzed upward mobility across generations throughout the U.S. They found that upward mobility was heavily influenced by where one grew up.
His finding: upward mobility exists – it’s just not evenly distributed.
Other amazing research Chetty has been able to conduct by having access to de-identified administrative data include:
This type of work has a huge impact on public policy but the data is only available to an amount of people you can count on one hand. This doesn't make any sense.
But how did Chetty get access to this data? He had to apply through a rigorous RFP by the IRS. I’m sure it also helps that he is an esteemed academic from prestigious institutions. Therein lies the problem. You shouldn’t have to have a John Bates Clark Medal to get access to this data. We should make this data available to EVERY innovator.
Imagine if there were a million other researchers working with the same data. Society would benefit enormously. We could better understand what types of social programs are working, where to best allocate resources and how to help humanity. Data accessibility is the cornerstone of this innovation in data and data-as-a-service.
So let’s open up access to this data in a privacy-safe way.
By the way, this data doesn’t have to be free. I’m sure there are lots of costs with administering a dataset this size in a privacy compliant way. It’s totally okay for the IRS to charge money to recoup those costs. There are still hundreds of thousands of researchers that could afford a reasonable data access fee.
Currently, most researchers work with survey data, which is not very accurate, consistent, or large enough. The real datasets are over 1,000 times the size of survey data. And the real data produces studies that are longitudinal – you can follow the progression of individuals over the years – survey data is usually about a moment in time. Real data reveals what actually happened. Survey data reveals only what people remember happened.
I recently discussed with Susan Athey on World of DaaS how Raj Chetty had a famous paper on government experiments in the 1980s. The government moved low income families to a higher income area and paid for their housing. The findings did not show any improvement to the people’s life situation so the experiment was labeled a failure. But when Chetty ran the numbers years later, he found that their kids actually did improve greatly from it. Young children who were part of this relocation program had a higher rate of college attendance and higher overall earnings.
This is going to sound obvious but it must be said: producing longitudinal studies working with real data with high response rates and low attrition produces much better results than using survey data.
The Center of Medicare and Medicaid Services (CMS) and the Veterans Affair (VA) have a LOT of data about people’s physical wellbeing. In fact, there are thousands of healthcare datasets at the federal, state, and local governments. Almost all these datasets are not accessible, again for privacy reasons.
Before we proceed further, we must acknowledge that the CMS does share a lot of statistics about people. The IRS does this as well. That’s not the main point here. In order to progress society, researchers don’t just need statistics. But they don’t even need full access to the underlying data either. They do need to be able to ask questions of the data about the care people receive over long periods of time.
The same is true for healthcare providers and insurance companies. Lots of data in very few hands.
And it makes sense. There are lots of regulations to make sure nobody’s health situation can be identified. HIPAA violations exist for a reason. Health data is extremely private (and it should be). But like tax data, there is a way to make asking questions of the data available while still protecting people’s privacy. We don’t need to make a choice between progress and privacy – we CAN DO BOTH.
And if there’s a privacy-safe way to make it accessible (more on this later), then why don’t we do it today? The ramifications are infinite. Here are a few obvious ones:
Here’s a chart from the AEI that’s made rounds on the interwebs over the past 5 years – healthcare costs have been outpacing inflation significantly:
If opening up access to micro-healthcare data creates an opportunity to reduce our healthcare costs while upleveling the quality of care, then shouldn’t we strive to pursue it?
Isn’t it our duty?
It is our obligation to make data accessible. In fact, it is a moral obligation that we should not shun.
Big tech companies like Google, Amazon, and Apple also have a LOT of data about us. No surprise there. Pretty much all of it stays within their ecosystem.
These companies have some of the smartest people working on some of the hardest problems we have today. But at the same time, there are millions of other smart people who could solve very challenging problems if they had access to this data.
By hoarding the data, these tech companies significantly slow innovation. Not selling (or sharing) their consumer data is morally wrong. We should build a world where access to data — to knowledge and history — is made available to all potential innovators.
The world already has democratized access to compute power. Today that’s available to anyone. Open access to compute power (via AWS, Microsoft Azure, Google Compute, and more) has massively accelerated innovation. And no, it’s not free. But it is available to anyone that wants to pay for it.
That’s the future we need for data. It doesn’t all have to be free, but it should be accessible to all. How many companies (and frankly, industries) exist today because compute became accessible to all? Well, 10x that impact if data became accessible.
Imagine the innovation.
Inherently, data has no value. It’s the information that can be derived from data that is valuable which ultimately dictates the value of the data. Combining datasets opens up new types of information, thereby making the value of each dataset more valuable.
We won’t solve most of society’s problems by only unlocking one or two datasets (although it will help a lot). We need a movement to make all datasets accessible, and to then enable us to join data from different datasets to draw deeper insights. Making just the IRS data or Google’s data accessible will help, but not enable all the insights we need. Joining multiple datasets is where the power lies.
Travis May, CEO of Datavant, wrote a piece on how healthcare data is mostly fragmented (full disclosure: I am an investor in Datavant). But it’s when you combine data about prescriptions, doctor’s visits, hospital check-ins, and lab tests, you get a clearer picture.
“All of these disparate data points have limited utility when analyzed individually — it is when they are brought together that these data points form a full picture of the patient’s health. Each additional piece of data that can be linked together has the potential to exponentially increase the value of the data set for understanding key public health questions.” - Travis May
What if we could combine pharmaceutical data with people’s physical records from their doctors, their hospital visits data, and the wellness data from their Apple watches? The leaps in pharmacology and physiology would be huge!
Imagine if we could combine anonymized IRS data and the Medicaid and Medicare data? By empirically tieing people’s financial wellbeing to their physical wellbeing, we could see all sorts of new programs. We could fund programs to direct public health initiatives right to the people who need it most.
Advancement in public health and policy alone would be mind boggling. There won’t be a question of how to best allocate resources. The data will all be there.
There should also be an easy way to join these datasets. Similar to how the Placekey is a common identifier for every physical place, we need to have encrypted identifiers for people data as well. Identifiers should be SIMPLE:
It sounds scary to combine this data. It sounds like something that could hurt privacy. But what if these datasets could be joined without having access to the underlying data? Where each dataset is still stored decentrally but questions can be asked across dozens of datasets. That’s actually possible. We just need the courage to build it (and to fight the special interests that want to protect the status quo).
There are many examples throughout history of making data more accessible leading to innovation and societal good.
Government is actually great at sharing specific types of data. For example, local, state, and federal governments share an abundance of data about property, mortgages, and real estate transactions. It’s messy and the structure varies from one locality to another but it’s out there.
This resulted in companies like First American, CoreLogic, and Zillow that ingest, clean up, and package this data for sale. Their datasets are then utilized by governments themselves for urban planning and economic development. This is a great example of how opening up access to data can transform society for the better.
Weather data is another example. The National Weather Service and NASA make their data accessible resulting in businesses like AccuWeather. There are also lots of companies that help industries like agriculture innovate by helping make sense of this data.
We take this for granted. But progress comes from building on top of data.
All this innovation was only possible because institutions chose to make their data more accessible.
The biggest challenge in opening up access to data boils down to protecting people’s privacy. It’s incredibly important to protect individual privacy and that’s not really up for debate. But we have ways to solve for this.
There have been huge advancements in privacy technology over the past decade, ensuring personally identifiable information is kept private and safe. But people are still making decisions like we had the same tools in the 1980s. Some of this is because the entrenched special interests are powerful but some is just because people are unaware of all the advances in protecting people’s privacy.
After reading this piece, the entrenched special interests will still be powerful. But at least the reader will have a better survey of how society can promote innovation AND still protect privacy.
Let’s start with Differential Privacy, which is probably the most commonly used measure to ensure data privacy. To boil it down to simple terms: Differential Privacy adds “noise” or slight modifications to processes that ingest sensitive data.
How much noise it creates depends on the process. The idea being that even if you remove some data points from the dataset, you arrive at basically the same end product. So if researchers are asking questions of a dataset and trying to run analysis on it (the process), complying with Differential Privacy can allow them to arrive at effectively the same answers without compromising data quality even if you remove select data points.
Why is this important? Because it ensures that the end user can’t deduce who is in the dataset (any one person in a dataset can theoretically be added or removed and the answer would still be the same). Lots of organizations use Differential Privacy today including Google, Microsoft, Apple, Facebook, JP Morgan, and even the US Census Bureau. Differential Privacy, when done well, makes it virtually impossible to reconstruct an underlying dataset or identify any one individual.
Here are some tactical methods of implementing privacy:
If direct access to data is not available, it is possible to create synthetic data. By randomly altering each data point in a dataset, it’s possible to create a new dataset with the same statistical properties as the original dataset, but no actual data point is the same. By using an algorithm to randomly modify data points, nobody’s privacy is at risk. But the resulting dataset can be just as useful as the original one.
All data tied to people can be homomorphically encrypted which allows end users to perform computations on data without ever decrypting it and seeing the underlying data. The resulting computations are also encrypted and can be decrypted.
So if our fictional character John wants to figure out (X+Y), he can submit a computation to add X and Y and will receive an encrypted solution. He’ll then have to decrypt that solution. If the solution to X+Y is 10, he will not receive the answer 10, rather the answer will be encrypted waiting to be decrypted.
So why doesn’t everyone use homomorphic encryption today? Well, it is slow and expensive. But it’s getting better and faster every day. Nudging Homomorphic Encryption is one of the most important things we can do.
A method of encryption in which unique decryption keys allow end users to perform functions on the data. This means that if you have access to a decryption key, you can perform specific analyses on private data which can be accessed without ever accessing the data itself. This allows us to ask questions of the data and see results but nothing else.
How is this different from Homomorphic Encryption? Functional Encryption requires access to a specific key which corresponds to a specific function on the data itself. The output is also not encrypted in functional encryption, unlike Homomorphic Encryption. The one drawback of Functional Encryption is that the generation of decryption keys to perform functions on the data can be a bottleneck for widespread use.
Going back to the example with John, he can submit the computation of (X+Y) and will receive a decrypted solution (e.g. if the answer is 10, he will receive that).
Two or more parties jointly compute a function on an encrypted dataset. The input data is masked (not encrypted), meaning that the underlying data is obfuscated or modified. The output is shared amongst the parties computing the function. The benefit of this methodology is that it is very hard to leak data since there are multiple parties computing the function together, rather than a single point of failure. The drawback is that it does require multiple parties to compute a single function.
Travis May from Datavant recently stated that we don’t have to trade privacy for data utility and both can be achieved. He went on to describe that the tradeoff only will need to be made when we’ve achieved the efficiency frontier, which we’re nowhere close to. He uses the graph below to visualize this:
The combination of all this new technology now means it is entirely possible to join different types of data about people without ever uncovering who they are.
If we make data that exists today across large public and private institutions more accessible, it’ll be a huge step forward for humanity. Doing so will result in unparalleled economic and policy innovation.
It doesn’t have to be free, but it also can’t be egregiously expensive.
Is the problem really privacy? On the face of it, it would seem so. But if you dig deeper, it’s a very solvable problem. The real problem is having the courage to work on the very hard task of making data privacy compliant.
Our goal shouldn’t be to hide the data; it should be to make it safe and securely accessible.
It starts with a collaboration between large institutions (access) and people (consent). The technology is there to make sure it’s executed safely and privacy is protected. And by the way, willingness to open up access will most definitely result in advances in privacy technology.
We should ask the institutions to meet us halfway. If you make the data accessible, we promise you the world will rise to make sure it’s used for innovation in a safe way.
All we need is courage.
You can find me on Twitter @auren