Challenges of Geospatial Data Integrations

We explained earlier in this guide how geospatial data integration can benefit different types of organizations in many different ways. However, putting geospatial data to work for your company can often be easier said than done.

You may not initially have the right personnel or technology infrastructure to properly work with geospatial data. And data that is inaccurate or lacks standards can require a lot of time and effort to clean, or else you risk making really bad calls on critical business decisions.

We’ll elaborate on these potential pitfalls of using geospatial data, as well as some work-arounds, in the following sections:

The need for data integration
Top 5 challenges of geospatial data integration

We’ll start with a recap of why integrating data, especially geospatial data, into a company’s operations is such a big deal.

The need for data integration

Data allows companies to measure multiple instances of an element or variable to make fact-based decisions. But any single dataset can only give a limited amount of information and be used for a limited number of purposes. That’s why integrating and linking multiple datasets in your organization’s operations is essential for being able to gain more insights and answer more questions.

Geospatial data, in particular, can be a powerful tool because it links information to specific places in the physical world. Sometimes it can also show relationships between different temporal-spatial instances. For example, it can allow a business to calculate how likely people visiting a certain point of interest nearby are to visit their store. Of course, there are many other applications that these data points and the relationships between them can be used for, such as mapmaking, civic planning, and even assessing risk for insurance policies.

Top 5 challenges of geospatial data integration

Unfortunately, integrating geospatial data into your organization’s decision-making is not without its obstacles. Some are common to other data integration processes. Others are unique to geospatial data because of what it describes and how it behaves.

1. Data standardization

A metal ruler displaying measurements in different units

Many data scientists and GIS analysts spend up to 90% of their time just cleaning data before using it. The reason for this is a lack of standards. For instance, timestamps may be from different time zones, or measurements may be taken using different units – sometimes ones that do not neatly convert between each other (think metric vs. imperial).

A standard is also sometimes only as good as its adoption rate, and there can be barriers to this as well. For example, the standard’s creator(s) may charge money, require re-sharing of data, or impose some other obligation that makes people and organizations hesitant to adopt that standard. And remember: a standard doesn’t have to perfectly fit all cases; it just has to fit enough so that a critical mass of people or organizations agree to it and derive value from it.

How to solve this problem:

A good standard should allow your datasets to be understood in the context of as many other datasets as possible. To do this, it should be able to identify data points under a series of guidelines, often summarized as the “S.I.M.P.L.E.” formula:

Storable – Data point IDs should be able to be stored in places that don’t require Internet access.
Immutable – Data point IDs shouldn’t change over time, except in extreme circumstances.
Meticulous – Data points should be uniquely identifiable across all systems they’re in.
Portable – Standardized IDs should allow data points to smoothly transition from one storage system or dataset to another.
Low-cost – The standard should be inexpensive, or even free, to use for data transactions.
Established – The standard needs to cover almost all data points it could be applied to.

2. Address standardization

An example of how identical addresses can be represented differently because of inconsistent standards

Addresses are so notorious for causing data standardization problems that they deserve their own section. For starters, there are many different elements to addresses: street name, building unit number, city, region, country, mailing code, and so on. Some databases may not have these pieces of information in a standard order, or may not even have all of them. This can make it difficult for a computer program or algorithm to tell if two or more addresses point to the same location.

There are other challenges as well. Some place names may be misspelled or have other typos. Even varying use of punctuation, abbreviations, or acronyms can cause problems. Does your data processing platform recognize that “US”, “USA”, “U.S.A.”, “the (United) States”, and even “America” all refer to the same country? Can it tell if the abbreviation “St.” stands for “street” or “saint”, and in which cases either one applies?

How to solve this problem:

Correcting these issues requires storing address data in a more efficient and less arbitrary way. That’s why Placekey was invented: to provide a free, open, and concise standard for representing information about a specific location. It generates a unique “what @ where” string of encoded characters that first identifies a location’s address, as well as a specific point of interest there (if one exists). It then defines the geographic area that location takes up, based on a hexagon whose centroid is the specific latitude and longitude coordinates of that location.

3. Lack of institutional knowledge

A pair of dolls confused as to how to assemble a puzzle block

Traditionally, geospatial data and geographic information systems (GIS) have been in a class of their own, separate from data science or other engineering fields. So only a small group of people in these latter fields (about 5%) actually know how to work with geospatial data. It doesn’t behave the same way as, say, tabular data, so many organizations struggle to ingest it into their workflows because there is a skills gap.

Bridging that skills gap can be difficult as well, and not just because companies have a limited talent pool to draw from. They also have to make sure they hire people with the unique skill sets and experience they need. This often causes the recruitment process – from drafting a posting to interviews to technical tests – to take longer than usual, which often clashes with the organization’s desire to move ongoing projects along. This can put immense pressure on hiring managers to hire someone as fast as possible, instead of someone who can actually do the specific job.

How to solve this problem:

Start by looking within your company’s own network. Then get creative if you need to: host webinars, hackathons, or meetups; attend conferences; or hire a specialized recruiting agency to attract contacts with specialized geospatial data know-how.

Ideally, you’re going to want someone with strong programming skills and a background in statistics. They should also know how to make data products, visualizations, workflows, and pipelining routines. Finally, you’ll want someone who’s familiar with machine learning, distributed computing, and (obviously) GIS software.

4. File size/processing times

Like any type of data science analysis, geospatial analytics require the right systems and infrastructure. That said, you don’t necessarily need anything radically different from other types of data analysis. But basic tools like Excel and OCDB systems run through SQL might not cut it if you’re looking to work with a large number of datasets, or at least scale up in the future.

You also have to decide how much you want to preprocess data or optimize it as you go (cost-efficiency versus flexibility to answer unique questions). Finally, you need to communicate these decisions with stakeholders so that they understand the limits of how fast or completely you can answer their questions, based on how fast you can process the relevant data.

How to solve this problem:

Data experts recommend using a cloud-based data platform like S3. While it may take more time and expertise to operate and manage, it offers better processing capabilities and scalability over time. It also offers room for developing custom parts for the tech stack as they are needed. Your system should also have a data lake, a data storage system, a processing platform, a task scheduler, and a pipeline creation tool.

5. Data quality

A dartboard with a dart that clearly missed the target

A lot of bad data exists. Most of it is caused by a lack of expertise in how to collect and process it, or just simple human error. As we’ve already discussed, lack of standardization plays a large part in this, as it can cause analysts to miss critical details. Other inaccuracies in geocoding and digitizing physical places and features can cause a cascade of inconsistencies in their geographic representation. These make it difficult, if not impossible, to accurately measure foot traffic and other variables surrounding a business or other point of interest.

Open source geospatial data is great because everyone can check it for mistakes and omissions — at least in theory. In reality, users should still be careful to vet open source data and make sure it is correct and suitable to their needs. The problem is that this process is expensive and time-consuming, so companies will often skip it — especially when they’re on a tight deadline and need insights quickly. But the consequences of making important decisions with inaccurate data can be even more costly.

How to solve this problem:

Take four steps to check data before using it. First, make sure it comes from reliable sources. Second, evaluate what it’s capable of, including any gaps it may leave and any assumptions you might make about it. Third, determine how much work it will take to get the data ready for use. Finally, based on what you know the data can (and can’t) do, draw up a plan for what specific function(s) it will serve in your operations.

If that sounds like a lot to go through, consider cutting down on some of the manual labor by investing in SafeGraph’s datasets. They’re checked for accuracy and cleaned every month by SafeGraph’s expert data technicians, so they’re among the most up-to-date and immediately-usable geospatial data sets on the market.

In summary, if you’re going to use geospatial data, first make sure you have the right people and infrastructure to work with it properly. Then, make sure the actual data you’re using is as accurate, standardized, and as relevant to your organization’s needs as possible. If you’d like further help, get in touch with SafeGraph. We’re experts in managing geospatial data – because it’s all we do.

If you're ready to learn more, check out the next chapter, "Geospatial Data Management Best Practices".

If you’re on the integration path and have questions about the process, make sure you check out our guide, “Geospatial Data Integration — Importance + Top 5 Challenges”.

CONTENTS

Challenges of Geospatial Data Integrations

The need for data integration

Top 5 challenges of geospatial data integration

1. Data standardization

2. Address standardization

3. Lack of institutional knowledge

4. File size/processing times

5. Data quality

If you're ready to learn more, check out the next chapter, "Geospatial Data Management Best Practices".

If you’re on the integration path and have questions about the process, make sure you check out our guide, “Geospatial Data Integration — Importance + Top 5 Challenges”.

Want to get started on your own? Check out our tutorials below!