Our clients often ask us for best practices when setting up data infrastructures or tooling for analyzing large quantities of geospatial data. To help answer these questions, we teamed up with Rayne Gaisford, Head of Data Strategy and Equity Research at Jefferies, and Felix Cheung, SafeGraph VP of Engineering. You can watch the full webinar here, or read up on what these data experts have to say in the text below.
The data infrastructure landscape has changed completely over the last few years. Most notably, there has been a shift from “bare minimum solutions” such as Excel and SQL to a cloud environment that offers more processing capabilities. Managing and operating these tools not only requires time and effort, but also indicates a shift from a system management approach to a more data-centered approach through the adoption of programming libraries and languages.
Second, the availability of new data infrastructure has changed how different stakeholders interact with each other. For example, the adoption of data science by the business community means that data providers now expose the work that used to be looked at by data scientists to non-technical users and help them draw conclusions for decision-making.
Many users ask what the main stack parts are for working with large quantities of geospatial data. Both Felix Cheung and Rayne Gaisford recommend a cloud source like S3. These platforms offer reliability, convenience, consistency, and speed. A data lake such as Delta Lake or Apache Iceberg running on top of that offers central versioning, snapshot protection and lays the foundation for future data processing, for example using Apache Spark. All this needs to be complemented with data storage, a processing compute platform, a scheduling service such as Amazon Managed Workflows for Apache Airflow (MWAA) and pipeline tooling that simplifies the act of writing a data processing pipeline.
But there’s more than just hardware and tools: to orchestrate multiple teams in large organizations, it is recommended to have an active committee that structures an overall data catalogue as it needs to be centralized so that different teams know what data exists outside their own data catalogue. Connecting different datasets with each other requires more than just data integration, as different brands, products, and companies are related to different data catalogues, to find the best way to convey the obtained insights to a decision maker.
Even though spatial data does not require any different tooling than other datasets, it’s good to know about what it represents and which insights can be gained from it. GIS data is more than just data representing a physical shop location: it also tells us something about the relationships between multiple businesses or the consumer behavior between multiple brands. For example, spatial data enables us to expose if a shopper who shops at place A has a higher propensity to shop at brand B. That propensity question is not often found outside of the world of spatial data analysis.
To get started with analyzing large GIS datasets, it’s possible to start using existing data analysis tools before investing in anything new. This could mean using Excel or connecting to a database using Open Database Connectivity (ODBC) interface that allows for data access in database management systems using SQL. However, this would only accommodate a small database, a solution that is not scalable over time. Cheung and Gaisford say a better option would be to adopt a cloud data platform in combination with running geospatial queries through a spatial database. Besides being a very affordable option, this solution is extendable for creating one’s own data lake to parse the data. An added benefit of starting right on the edge of what an organization is used to, is that over time, different teams will want to take ownership of the data and move them up the stack instead of holding on to whatever solution they are used to.
It can be a tough balance between how much an organization wants to preprocess and how much they want to optimize as they go. To have it both ways, an organization’s data analysis infrastructure can be split, in order to be able to offer both flexibility and cost-efficiency. For example, instead of running 12 hours of ‘pre-canned’ cloud jobs, it might be preferable to run a 3-hour job and make a data processing tool a little slower. This creates the opportunity to incrementally run a tool to answer an individually unique querying question if one comes up.
Long before any data analysis takes place, it’s important to communicate with stakeholders about their expectations in terms of the insights they’re looking for and what data they could benefit from. To have such a conversation ahead of everything else lays the foundation for a strategy to find, connect and integrate the data at a later point in time. Although such an approach from scratch can take a lot of time, it’s not necessary to go through such a process for each client.
When having this conversation with the client, the first task of a programmer is to build trust with a non-technical team so that they are seen as allies. After it becomes clear to both the technical and non-technical parties what the question is that needs to be answered with the data, it will become easier to manage client expectations and develop a common language to answer more questions in the future when it is not possible to answer all questions from a client the first time around. Having an overview in a large organization with multiple teams has the added benefit that it oftentimes becomes possible to answer the questions of one team using the data catalogue from another team.