SafeGraph is a geospatial data company that curates high-precision data on millions of places around the globe. Our datasets provided detailed, accurate, and up-to-date information on points of interest and how people interact with those locations.
To scale SafeGraph offering Data As a Service (DaaS) to the rapidly growing user base, we have built a platform engineering team serving as the enabler of other teams. In this blog post, we will share our vision about platform engineering, how we implement the vision in SafeGraph and how it enables SafeGraph’s product development teams to deliver products in an easier manner.
“Why do we need a platform engineering team”? It is not an easy-to-answer question especially for startups like SafeGraph. No doubt that giant companies like Google/Facebook/Netflix/Uber do need platform teams to build and operate their own infra stack. They are facing the challenges from the massive scale of traffic volume, the complexity of use cases, etc. These situations do not seem to exist for smaller startups. Cloud vendors and open source technologies seem to have eliminated most of, if not all, the necessities for startups to build an infra stack from scratch. However, as indicated in some discussions   , we still need platform engineering to connect pieces of solutions from vendors to form a complete cloud-based infra system for the company.
Besides that, here are some of the other challenges we faced at SafeGraph and were able to solve with platform engineering:
The role of the platform team, serving as the domain expert, is to unblock the product development team from the immediate issues of the development tools and infrastructures. The platform team is also responsible for building the long-term infrastructure solution, facilitating the fast growth of the engineering teams, which is usually required from the rapid growth of the company business. Additionally, the platform team should facilitate the building of engineering culture in the company by lowering down the overhead and minimize the interruption.
In this section, we introduce how we fulfill the aforementioned missions of platform engineering in SafeGraph.
Unblocking other teams from immediate issues in infra/development tools is one of the most critical tasks for platform engineers. Many startups build their infrastructure on top of open source technologies like Kubernetes, Apache Spark, etc.. Open source does not mean free, and sometimes it comes with a high cost of operation or even be the bottleneck of your product development pipeline. The platform engineering team should be the domain expert of these open source systems. Success in this role has a two-folded meaning for both the platform team and the whole company:
One of the examples where the platform team at SafeGraph helped with immediate product development issues was on always improving performance of our Spark-based data processing stack. We had a Spark job which could run as long as 24 hours just to fail on a frequent basis. The failure/hanging Spark job had a cascading effect to all downstream consumers. The strangest thing about the job is that it did not take a huge amount of data, but still was hanging there for a long time. The team owning the job had tried many solutions, but still could not resolve the issue. The platform team jumped in and found it as a very tricky case desiring a certain level of understanding on the Spark as well as Scala language internals. The team found that:
We optimized the job by batching multiple jobs submitted from multiple threads and relieving the pressure in Spark DAGScheduler. Furthermore, we avoided using the default fork join pool in the code so that we can skip the hash code calculation. With the optimization, we shortened the running time of the job from 24 hours to only 3 hours and met the expected reliability SLA.
We have many other examples on how platform teams can clear up those immediate and unexpected issues and unblock product devs (stabilize CI/CD systems for teams to have a reliable test infrastructure, nail down various Kubernetes issues for service owners, etc.). With this practice, we successfully keep our product development teams focusing on business and also have a strong foundation for the platform team to seek further collaboration to implement our platform vision.
The other important mission for the platform engineering team is to build the long-term infrastructure solution to support the growth of the company. For a startup, the most common scenario to “build infrastructure” is to introduce the right technology for a certain purpose, like manage configurations of services/data jobs, ease the process of deploying services, etc.
“Introducing a new technology” also brings challenges:
The platform engineering team should build a sustainable and efficient infrastructure solution by:
Both of the above items can be done by the art of abstraction in infrastructure.
One example of where we fight with extra complexity and smoothing the adoption with infrastructure abstraction is our machine learning (ML) model management/deployment/versioning system. We have built various ML models facilitating our delivery of high quality data products. With the growth of the customer base and the complexity of the user requirements, we have more and more ML models, leading to the challenging requirement of managing these ML models neatly.
MLFlow is the most promising technology captured on our radar to serve our purpose, although introducing MLFlow to SafeGraph comes with a high cost. We would need to add config files for every ML project, change workflows by adding some manual steps like running some commands before committing a model, etc..
One of the examples of the low ROI brought by the change is what we need to do to show git commit hash in MLFlow Run UI. To leverage the built-in functionality in MLFlow for this purpose, our MLFlow engineers have to do at least two things:
However, all we need is to show some information in a UI. We also face challenges like showing the version of training data in MLFlow UI where every engineer “MUST” remember to log such a parameter with MLFlow API, when it would be repeated across projects. In general, we realized that MLflow is a powerful MLOps tool, however, it imposes too many distractions to our ML engineers to use it correctly given their major mission to use the state-of-the-art ML techniques to improve our data product quality.
Instead of leveraging MLFlow just following official docs or from what our vendors say, we built an internal library exposing a set of APIs for logging parameters/metrics, uploading models, integrating with Git, etc based on MLFlow. Additionally, we automatically log info like artifact version, versions of data read/write, etc. These APIs complete those manual steps for ML engineers who can now focus on their business with the blessing of state-of-the-art MLOps technology and zero change of their workflow.
The other example of benefits brought by an abstraction is keeping SafeGraph flexible in an uncertain marketplace of some technical solutions.
Data lake formats often confuse users and raise difficult decisions. When we started the project to build SafeGraph’s data lake, there were multiple choices in the marketplace, like Delta Lake, Apache Iceberg, and Apache Hudi, that could serve as the foundation of our data lake and provide our most desired functionalities, like versioning and time traveling data. We had to make a choice between Delta Lake and Apache Iceberg in the end and it turned out to be a difficult decision for the following reasons:
To resolve the dilemma, we built an internal library which provides the APIs for the common Data Lake operations (read/write versioned dataset, show dataset history, etc.). While the implementation of these operations are based on Delta Lake/Apache Iceberg, engineers in our product development team do not need to care about which one is actually used. The transparency of the fundamental file formats also guarantees that we will have zero code change even if we want to switch over to other formats in future.
A well-established and healthy engineering culture, the must-have for a great technical company, does not come free. The cost not only comes from changes in thinking and action, but also the unavoidable tooling overhead even if the agreement on embracing it is there.
Platform team serves as the enabler of the building of an engineering culture by lowering the involved cost. Taking SafeGraph as an example, we want to build an engineering culture appreciating the operational excellence of services. Operational excellence comes from comprehensive monitoring and timely alerting as well as many other facilities helping engineers improve the service SLA and debugging issues. All of these should be built as part of platform engineering instead of dumping them to each product team and expecting them to squeeze resources to build their own tools from scratch.
The platform team also has an advantage to promote the desired culture. The “product” delivered by the platform team is used across teams; when the appreciated part in a culture leads to success, a broad benefit shared by all teams is easily observed and that part is promoted straightforwardly across teams. For instance, the platform team at SafeGraph started building solutions to minimize/eliminate the manual steps in using Terraform, which is well-known to be hard-to-ramp-up. The progressively better user experience is observed across teams, the benefit is shared, and it contributes to our culture which always seeks to minimize human intervention in any process.
By resolving immediate issues, building long-term infrastructure solutions with low cost, keeping it future-oriented, and also serving as the engineering culture enabler, platform engineering is critical to optimizing an engineering organization for efficiency and sustainability.