Outliers in datasets are controversial. Are they bad or are they trying to tell you something? Is it an artifact (also sometimes called “contamination”) or an important anomaly? Should you ignore them or make them the target of your focus?
Outliers are particularly evident (and problematic) when visualized on maps. In this blog post, I show real-world examples of how outliers make map visualization complicated, describe a simple method to filter outliers, and provide a Google CoLab Jupyter Notebook with sample SafeGraph data so you can replicate and play with the data yourself.
I want to visualize how visitor dwell time varies by geographic location. I can use SafeGraph Core Places to get centroid locations of each McDonald’s and SafeGraph Patterns data (specifically the column median_dwell) for the median visit duration for each point-of-interest (POI).
Clearly there is something unusual about a single POI in the southern tip of California that has a median_dwell 600x larger than the average McDonald's and 6x larger than the next largest median_dwell.
The goal of my map is to illustrate the variance in dwell times for all McDonald’s across the state. Whether we think the extreme POI is interesting or an artifact, this one extreme value is very distracting on my map because it is off-scale from the rest of the dataset.
We could cherry-pick it and drop this one POI from the dataset, but instead, I prefer a simple and generic method for filtering out extreme values.
There is no one correct way to “handle” outliers, and sometimes outliers shouldn’t be “handled” at all.
Warning! You should always look at your data; do not blindly filter “outliers” without consideration.
Here we implement the inter-quartile-range (IQR) based method as originally formulated by John Tukey.  . Note: This is one of the most common definitions for “whiskers” on a box-and-whiskers plot.
Typically the Upper Extreme is defined as the Upper_Quartile + k * IQR and the Lower Extreme is Lower_Quartile - k * IQR.
The standard is to use k = 1.5. But for the purposes of visualization, you should use whatever works.
To read the full post, see the results, and play with the code yourself, click here!
Want to see a different question answered with SafeGraph data?
Please send us your ideas, feedback, bug discoveries, and suggestions to [email protected]
That's it – that's all we do. We want to understand the physical world and power innovation through open access to geospatial data. We believe data should be an open platform, not a trade secret. Information should not be hoarded so that only a few can innovate.