At SafeGraph, we rely on Apache Spark, one of the most widely-used large-scale data processing frameworks, to generate our global POI dataset, which includes detailed attributes such as brand affiliation, advanced category tagging, and open hours. Hundreds to thousands of Spark applications each day are used for data transformation, machine learning model inference, and operational tasks.
Managing the reliability, efficiency, and iteration speed of engineers authoring these Spark applications presents a major challenge for our platform engineering team. However, by choosing the right Spark service provider, we can create a strong foundation for our Spark infrastructure that addresses these issues. In a recent blog post, coauthored by Nan Zhu, the Tech Lead Manager of our Platform Engineering team at SafeGraph, and Sr. Solution Architect Dave Thibault from AWS, we shared our journey of building our latest Spark platform on top of AWS EMR on EKS. By doing so, we were able to create a robust and efficient foundation that meets our needs and even led to a 50% reduction in costs compared to our previous Spark managed service vendor.