If you’ve heard of SafeGraph, you’re most likely familiar with our foot traffic insights and data for COVID-19 response. These products are highly valuable and deserve the spotlight; but what if I told you that our Geometry data is the crux that makes it all possible? You know - the shapes depicting the places we care so much about? Without these obscure little polygons, SafeGraph products would not hold the same value as they do today. Below, we’ll discuss where these polygons come from and why they’re so useful.
Unfortunately, geometry data doesn't grow on trees - nor does it grow on maps, apps, or S3 buckets for that matter, and this makes our mission to provide a “best fitting” polygon for each record in Core Places an incredibly tall order. To further complicate matters, the definition of a “best fitting” polygon varies by POI type and can range from a building footprint (or even a slice of a building footprint) to a massive shape containing parking lots, land, and several buildings within its bounds - like a college campus for example. We obviously don’t tackle all of this alone (we’re ambitious - not crazy) and are fortunate to be in business with some amazing partners who specialize in curating geometry data of varying criteria. In most cases, we prefer to have polygons extracted from aerial imagery using the latest methods in object recognition and AI. We recognize that this is the future of geometry data sourcing and has the best chance of scaling rapidly. In other cases, and especially for places with complex requirements, we prefer to have polygons hand drawn. This is still the most sure-fire way to source an accurate polygon, and that fact is unlikely to change in the short term.
For some, geometry data is already useful in raw form. Polygons offer a robust visual representation of places and can aid in use cases ranging from square footage calculations to site selection. But for others, geometry (and the metadata inferred from it) really proves its value when used to derive additional geospatial products. Case in point, the ML algorithm responsible for building SafeGraph Patterns relies heavily on geometry metadata to intelligently attribute visits to POIs (we eat our own dog food). Without this metadata, Patterns would fail to account for the complex spatial relationships that persist in the real world, and we would struggle to predict foot traffic in many scenarios. Externally, our customers also use SafeGraph Geometry data as a “blueprint” of sorts to derive their own foot traffic insights (we like to share the dog food).
So, in the spirit of transparency, we’d like to walk through the metadata we build into our geometry as well as our best practices for putting that metadata to work.
For every place, we always want to answer three key questions:
1) Does this place encompass other places?
2) Is this place completely enclosed inside of a larger place?
3) How many places belong to this polygon?
Let’s take these one at a time...
The real world is full of places that contain other places, and these relationships exist in many forms. Some places are massive and represented by a broad, expansive boundary, and these places encompass several, if not hundreds, of smaller places within their borders. An outdoor shopping mall, for example, encompasses many POIs within its footprint, and so do hospitals, college campuses, ski resorts, stadiums, casinos, etc. In other cases, a single building may represent the footprint of a POI, but it still might contain other POIs within. A Walmart containing a Subway is a canonical example of this, and we are also interested in understanding these relationships.
In any case, we identify spatial relationships (what we refer to as “spatial hierarchy”) by measuring polygon overlap. For each pair of overlapping polygons, if the larger polygon contains at least 80% of the smaller polygon, and if the larger polygon is also of a particular POI category, then we mark it as the “parent” of the smaller polygon. It’s important to restrict parent POI candidates to a specific set of categories or brands so that we’re not solely reliant on polygon precision to determine spatial hierarchy. For example, we want airports to be parents when overlapping other POIs, but we generally don’t want cafes to be parents if overlap exists and the cafe happens to be the larger of the two polygons. See our Places Manual for a complete list of POI categories that are eligible parents. We flag these relationships in our geometry data by setting the “parent_placekey” of the smaller POI equal to the “placekey” of the larger, encompassing POI. We colloquially refer to the larger, containing POI as the "parent" and the smaller POI as the "child."
Now that we have an understanding of spatial hierarchy, we can implement some rules that consistently attribute visits where spatial hierarchy exists. Since some of our customers leverage only a subset of places, we attribute visits to both the parent POI and its children when building SafeGraph Patterns. This means that the “raw_visit_counts” at a parent POI = SUM(raw_visit_counts) at all child POIs + visits to the parent independent of its children. Therefore, if you count visits at the parent and then again for all children, you are double counting visits.
In many cases, the raw_visit_counts at the parent will equal the SUM(raw_visit_counts) for its children. This commonly exists at shopping center POIs because there are not many gaps within the shopping center to visit without the presence of a child POI. On the other hand, a parent POI, like a golf course, will likely have more visits than the sum of its children because there are plenty of areas within the golf course to visit without making a visit to a child POI (ex: playing 18 holes on the green but not making a visit to the pro shop or restaurant).
Within spatial hierarchy, we are interested in further classifying parent/child relationships. In general, we want to know when a parent POI encompasses its children completely indoors vs. on open air grounds. For example, a ski resort boundary may enclose a restaurant midway up the mountain, but the ski resort boundary itself is not an indoor enclosing structure. On the other hand, an airport containing a Starbucks completely encloses that Starbucks indoors. As a general guideline, if you must enter another structure to arrive at a POI, we want to be aware of that fact, and we set the “enclosed” column in our geometry data to “true” wherever that exists.
Similar to determining eligible parent POIs, we rely on categories to distinguish enclosing vs. non-enclosing spatial hierarchy relationships. See the enclosed section of our Places Manual for a complete breakdown of the spatial hierarchy relationships we treat as “enclosing.”
Although simple, the enclosed column is super important fuel for our visits algorithm. Due to GPS drift within major structures, when enclosed = “true,” we exclude visits to that POI and only attribute visits to the parent POI. We pride ourselves in archiving facts, so we would rather not attribute visits at all than make rash assumptions.
It’s important to distinguish when geometry data reflects the shape and size of a POI’s real world footprint and when it does not. In most cases, each polygon represents the unique footprint of a single POI, but in some cases, a precise polygon for a POI does not exist (or is not discernible through our sourcing methods), so the only polygon available may be too large and could represent several POIs.
When a polygon reflects the true shape and size of a unique POI, we give it an “OWNED_POLYGON” value in the “polygon_class” column. This means the polygon represents that unique POI, but there could be child POIs within its borders attached to the same polygon. In other words, if a single POI maps to a distinct polygon (excluding that POI's children), then polygon_class = "OWNED_POLYGON;" otherwise, polygon_class = “SHARED_POLYGON.”
We exclude children from influencing their parent POI's polygon_class because there are cases where a unique polygon does not exist for each child POI, and the child POIs most likely share the same polygon as their parent. In these cases, it does not mean that the polygon is a bad representation of the parent itself. A canonical example of this is a Nike store inside of a shopping mall. If we don't have a good polygon for the Nike store, then the Nike store likely shares the same polygon as the mall. Despite the fact that multiple POIs are attached to this polygon, the polygon is still representative of the mall's shape and size, so the polygon_class for the mall POI = “OWNED_POLYGON” and the polygon_class for the Nike store POI = “SHARED_POLYGON.” Read more about polygon_class in the Places Manual.
When computing visits, the treatment of OWNED vs. SHARED POIs is ultimately a judgement call. Our visits algorithm considers a multitude of features to determine visits to a POI (more on that here), and both the shape and size of the polygon as well as distance to centroid (latitude/longitude) are considered. When polygon_class = “SHARED_POLYGON” we rely more heavily on proximity to centroid and have found that this still produces accurate results.
At SafeGraph, we focus on a deep understanding of the physical landscape before determining POI visits, and we hope to share this context with our partners who set out to do the same.
What details are we missing? What are we getting wrong? What other metadata would be useful for you?
We’d love to hear your thoughts and check out our docs site to learn more.
That's it – that's all we do. We want to understand the physical world and power innovation through open access to geospatial data. We believe data should be an open platform, not a trade secret. Information should not be hoarded so that only a few can innovate.