Polygon data on building and property footprints is a critical part of a geospatial data ecosystem. It provides context on how places relate to each other physically: in terms of size, shape, proximity, positioning, and so on. So a lot of factors go into making it as informative as it can be for use in fields such as mapping, marketing, real estate, and insurance.
In this webinar, SafeGraph product manager Bryan Bonack explains some of the conventions we look at to evaluate the quality of polygon data, as well as to identify and resolve any errors with it. Here’s a summary of what’s inside:
Before we dive too deep, we’ll first provide a little more background as to what polygon data is and how it can be useful.
Polygon data is, at its core, a collection of series of connected vertices that form enclosed shapes. For SafeGraph’s purposes, these shapes are designed to represent the boundaries of real-world places. These places could be buildings, units within buildings, or even properties encompassing multiple buildings.
The polygons themselves can then be attributed with metadata to contextualize the relationships between places (area, perimeter, proximity, etc.). This makes them useful for a variety of applications in marketing, real estate site selection, store visit attribution, insurance risk assessment, and beyond. Of course, all of this is only helpful if both the polygons themselves and the metadata associated with them are precise and accurate.
The purpose of this webinar is to go over some technical guidelines we follow when creating our “Geometry” dataset of polygon data. A few are things we strive to do at all times, while others are things we try to avoid if possible. Here are the major highlights:
Time in Video: 8:58
SafeGraph emphasizes metadata explaining the relationships between geospatial polygons that overlap, which we collectively call “spatial hierarchy”. Sometimes, a polygon belongs to a place that’s part of a larger point of interest (such as a park or a school/hospital campus). In these cases, we refer to the smaller place as having a “child” polygon and the larger place as having a “parent” polygon.
There are also cases where a place is completely indoors and can only be entered by entering its parent structure. We refer to this as the place as having an “enclosed” polygon. These types of polygons are difficult to accurately map, so sometimes we note their corresponding places simply as having “shared” polygons with their parent buildings. This is in contrast to an “owned” polygon, which exactly marks the boundaries of the specific point of interest (POI) it refers to.
Time in Video: 15:08
On rare occasions, we are unable to match a polygon from our database to a specific POI. In these cases, we can create a “synthetic” polygon for that POI by drawing a radius around its latitudinal and longitudinal center. This shape can be adjusted to account for how large the POI likely is (based on its category), as well as to help it avoid overlapping with streets, terrain features, and other buildings.
We avoid using synthetic polygons as much as possible (currently less than 4% of the time), as they are merely estimates and not accurate representations of the boundaries of places. We accomplish this by adding more polygons to our dataset, refining our existing polygons, and attributing more precise geocodes to our POIs so it becomes easier to match polygons to them.
Time in Video: 17:20
A non-synthetic owned polygon is a polygon that both refers to a distinct POI and accurately depicts the shape and size of that POI. These are the most suitable polygons, regardless of whether they are parent polygons or even enclosed child polygons, for use cases where absolute precision is needed. They can even be refined further by calculating their areas and then comparing them against the average area of POIs belonging to a particular brand or category.
Time in Video: 19:16
Sometimes, two or more distinct non-parent polygons will overlap each other. This situation usually happens when multiple data sources are each representing the same place in a slightly different way. It can ideally be solved by either merging the overlapping polygons together, or else assigning one of them to be a shared parent polygon for each corresponding POI.
If left unsolved, this situation is undesirable because it clutters visualizations and data files with redundant information. In rare but severe cases, it can even interfere with correctly attributing visits to clustered POIs.
Time in Video: 21:18
Correctly defining parent polygons can be challenging at times. They often have unique shapes that go beyond a single building footprint (or even groups of buildings) to include features like parking lots, terrain, and other structures. This makes it difficult to not only precisely capture their boundaries, but also to decide if polygons that overlap with them are or are not their children.
We use a few techniques to improve how accurately we can define parent polygons. One is to consider which categories of places are most likely to contain child POIs, based on their function and average size. In a similar vein, we can look at the average size of branded POIs to work out potential parent and child relationships when polygons associated with those brands overlap.
Above all, though, we tend to hand-curate parent polygons to ensure their accuracy. We do this because we consider them to be the foundation of our spatial hierarchy concept. If our parent polygons are not correct, then it becomes difficult – if not impossible – to establish the relationships between overlapping POIs, or to properly attribute foot traffic to them.
Attributing polygon data with spatial hierarchy metadata is an important part of our dataset-building process. The reason for this is that overlapping polygons without spatial hierarchy metadata (such as parent-child relationships) can cause certain problems. One is that, if all polygons are treated equally, it becomes difficult to tell where one is relative to another. An especially problematic example regards buildings that have multiple floors, since child POIs may not all be on the same level and may instead be above or below each other.
Another problem arises when multiple data sources present similar but slightly different polygons. A lack of spatial hierarchy metadata here makes it challenging to determine if these polygons refer to the same place, or to multiple different places at or near the same location. And in the latter case, there is again ambiguity as to what spatial relationship these places have with each other, if any.
The major issue this all can lead to is trouble with correctly attributing visits and foot traffic to multiple points of interest in close proximity. This is especially true for child POIs that do not have precise polygons because they are enclosed by their parent structure. Without spatial hierarchy metadata, it becomes ambiguous whether footfall or visits correspond only to a parent POI, or to one or more of its child POIs (and which ones) as well. Improper attribution can lead to erroneous conclusions and, ultimately, bad organizational decisions.
SafeGraph’s Geometry dataset of polygon data includes additional metadata that helps to better explain places’ sizes, shapes, and relationships to surrounding points of interest. That includes attributes we collectively refer to as “spatial hierarchy”, such as if:
Check out more great events