Blog
Ideas-of-Interest

De-clutter Your Maps With Simple Outlier Filtering

December 11, 2019
by
Ryan Fox Squire

Outliers in datasets are controversial. Are they bad or are they trying to tell you something? Is it an artifact (also sometimes called “contamination”) or an important anomaly? Should you ignore them or make them the target of your focus?

Outliers are particularly evident (and problematic ) when visualized on maps. In this blog post, I show real-world examples of how outliers make map visualization complicated, describe a simple method to filter outliers, and provide a Google CoLab Jupyter Notebook with sample SafeGraph data so you can replicate and play with the data yourself.

Sometimes extreme values (“outliers”) are distracting on a map

What the heck is going on with that one data point in the southern tip of California?

I want to visualize how visitor dwell time varies by geographic location. I can use SafeGraph Core Places to get centroid locations of each McDonald’s and SafeGraph Patterns data (specifically the column median_dwell) for the median visit duration for each point-of-interest (POI).

Clearly there is something unusual about a single POI in the southern tip of California that has a median_dwell 600x larger than the average McDonald's and 6x larger than the next largest median_dwell.

The goal of my map is to illustrate the variance in dwell times for all McDonald’s across the state. Whether we think the extreme POI is interesting or an artifact, this one extreme value is very distracting on my map because it is off-scale from the rest of the dataset.

We could cherry-pick it and drop this one POI from the dataset, but instead, I prefer a simple and generic method for filtering out extreme values.

A Simple Method to Filter Outliers in SafeGraph Patterns

There is no one correct way to “handle” outliers, and sometimes outliers shouldn’t be “handled” at all.

Warning! You should always look at your data; do not blindly filter “outliers” without consideration.

John Tukey’s IQR method for defining outliers.

Here we implement the inter-quartile-range (IQR) based method as originally formulated by John Tukey. [1] [2]. Note: This is one of the most common definitions for “whiskers” on a box-and-whiskers plot.

Typically the Upper Extreme is defined as the Upper_Quartile + k * IQR and the Lower Extreme is Lower_Quartile - k * IQR.

The standard is to use k = 1.5. But for the purposes of visualization, you should use whatever works.

See the full post on Google CoLab

To read the full post, see the results, and play with the code yourself, click here!

Curious how this works? See the full post on Google CoLab.

Want to see a different question answered with SafeGraph data?

Please send us your ideas, feedback, bug discoveries, and suggestions to [email protected]

Ryan Fox Squire
Ryan Fox Squire
Product & Data Science
TwitterLinkedin
SafeGraph is just a data company

That's it – that's all we do. We want to understand the physical world and power innovation through open access to geospatial data. We believe data should be an open platform, not a trade secret. Information should not be hoarded so that only a few can innovate.