Scaling Data As a Service (DaaS) with Platform Engineering

January 20, 2022

Nan Zhu

SafeGraph is a geospatial data company that curates high-precision data on millions of places around the globe. Our datasets provided detailed, accurate, and up-to-date information on points of interest and how people interact with those locations.

To scale SafeGraph offering Data As a Service (DaaS) to the rapidly growing user base, we have built a platform engineering team serving as the enabler of other teams. In this blog post, we will share our vision about platform engineering, how we implement the vision in SafeGraph and how it enables SafeGraph’s product development teams to deliver products in an easier manner.

Platform Engineering in Startups

“Why do we need a platform engineering team”? It is not an easy-to-answer question especially for startups like SafeGraph. No doubt that giant companies like Google/Facebook/Netflix/Uber do need platform teams to build and operate their own infra stack. They are facing the challenges from the massive scale of traffic volume, the complexity of use cases, etc. These situations do not seem to exist for smaller startups. Cloud vendors and open source technologies seem to have eliminated most of, if not all, the necessities for startups to build an infra stack from scratch. However, as indicated in some discussions [1] [2] , we still need platform engineering to connect pieces of solutions from vendors to form a complete cloud-based infra system for the company.

Besides that, here are some of the other challenges we faced at SafeGraph and were able to solve with platform engineering:

Our product developers are sometimes blocked by the errors/issues from the tools they are using, e.g. Apache Spark/Kubernetes, etc. and our vendors usually cannot help either due to the lack of experience or business domain knowledge;
Like many other companies, we usually face frictions due to the imposed changes on the working habits when introducing new technologies;
We also fall into challenging decisions when choosing among multiple solutions for a certain purpose with concerns like vendor lock-in, etc.

The Role of the Platform Team in Solving Key Challenges

The role of the platform team, serving as the domain expert, is to unblock the product development team from the immediate issues of the development tools and infrastructures. The platform team is also responsible for building the long-term infrastructure solution, facilitating the fast growth of the engineering teams, which is usually required from the rapid growth of the company business. Additionally, the platform team should facilitate the building of engineering culture in the company by lowering down the overhead and minimize the interruption.

Platform Engineering at SafeGraph

In this section, we introduce how we fulfill the aforementioned missions of platform engineering in SafeGraph.

Tackling Immediate Issues

Unblocking other teams from immediate issues in infra/development tools is one of the most critical tasks for platform engineers. Many startups build their infrastructure on top of open source technologies like Kubernetes, Apache Spark, etc.. Open source does not mean free, and sometimes it comes with a high cost of operation or even be the bottleneck of your product development pipeline. The platform engineering team should be the domain expert of these open source systems. Success in this role has a two-folded meaning for both the platform team and the whole company:

Unblocking product development: Delivering a high quality product in a timely manner to customers is always the top priority for a startup. There’s no doubt that the in-house experts of technology used in product development are valuable especially in the most challenging situations.

Building trust and lowering cost for long-term platform engineering: Platform engineering is a long-term and iterative process. It usually comes with a short-term cost, like some unavoidable interruption to others’ daily work and “distraction” of resources. Being able to solve immediate problems not only builds the trust between platform and other teams but also compensates the platform engineering cost with the saving of the engineering resources which would have been used to resolve the same issue without enough domain expertise.

One of the examples where the platform team at SafeGraph helped with immediate product development issues was on always improving performance of our Spark-based data processing stack. We had a Spark job which could run as long as 24 hours just to fail on a frequent basis. The failure/hanging Spark job had a cascading effect to all downstream consumers. The strangest thing about the job is that it did not take a huge amount of data, but still was hanging there for a long time. The team owning the job had tried many solutions, but still could not resolve the issue. The platform team jumped in and found it as a very tricky case desiring a certain level of understanding on the Spark as well as Scala language internals. The team found that:

The single-threaded task serialization mechanism in Spark’s DAGScheduler was overwhelmed by our multi-threading job submission mechanism; as a result, the DAGScheduler could not catch up with the processing speed of executors so that they were left in idle.

The Spark job implementation used Scala’s parallel collection which imposed an expensive hashcode calculation for one of our classes for its default fork join pool.

We optimized the job by batching multiple jobs submitted from multiple threads and relieving the pressure in Spark DAGScheduler. Furthermore, we avoided using the default fork join pool in the code so that we can skip the hash code calculation. With the optimization, we shortened the running time of the job from 24 hours to only 3 hours and met the expected reliability SLA.

We have many other examples on how platform teams can clear up those immediate and unexpected issues and unblock product devs (stabilize CI/CD systems for teams to have a reliable test infrastructure, nail down various Kubernetes issues for service owners, etc.). With this practice, we successfully keep our product development teams focusing on business and also have a strong foundation for the platform team to seek further collaboration to implement our platform vision.

Long-term Infrastructure Solution

The other important mission for the platform engineering team is to build the long-term infrastructure solution to support the growth of the company. For a startup, the most common scenario to “build infrastructure” is to introduce the right technology for a certain purpose, like manage configurations of services/data jobs, ease the process of deploying services, etc.

“Introducing a new technology” also brings challenges:

A new technology sometimes means a significant change to the existing workflow which other engineers have been used to. The conflict between the product delivery and the potentially long-term benefit from infrastructure is always a prevalent topic for companies.
Judging whether a technology can serve in the long-term is also a difficult task. There have been plenty of examples to describe such a scenario: data warehouse, streaming processing, messaging queues, etc. There are a zoo of technologies in each of the areas, and these technologies differentiate from each other and impose a high cost if the user wants to jump on the other boat due to the change of business requirements.

The platform engineering team should build a sustainable and efficient infrastructure solution by:

Minimizing the overhead and easing the adoption of new technologies
Making the company flexible in different solutions with the minimum switch-over cost

Both of the above items can be done by the art of abstraction in infrastructure.

Abstraction for Easier Adoption

One example of where we fight with extra complexity and smoothing the adoption with infrastructure abstraction is our machine learning (ML) model management/deployment/versioning system. We have built various ML models facilitating our delivery of high quality data products. With the growth of the customer base and the complexity of the user requirements, we have more and more ML models, leading to the challenging requirement of managing these ML models neatly.

MLFlow is the most promising technology captured on our radar to serve our purpose, although introducing MLFlow to SafeGraph comes with a high cost. We would need to add config files for every ML project, change workflows by adding some manual steps like running some commands before committing a model, etc..

One of the examples of the low ROI brought by the change is what we need to do to show git commit hash in MLFlow Run UI. To leverage the built-in functionality in MLFlow for this purpose, our MLFlow engineers have to do at least two things:

Add a project description file in their project directory

Change whatever workflow they have been used to (local python runner, Jupyter, etc.) to use MLFlow command line to run the project

However, all we need is to show some information in a UI. We also face challenges like showing the version of training data in MLFlow UI where every engineer “MUST” remember to log such a parameter with MLFlow API, when it would be repeated across projects. In general, we realized that MLflow is a powerful MLOps tool, however, it imposes too many distractions to our ML engineers to use it correctly given their major mission to use the state-of-the-art ML techniques to improve our data product quality.

Instead of leveraging MLFlow just following official docs or from what our vendors say, we built an internal library exposing a set of APIs for logging parameters/metrics, uploading models, integrating with Git, etc based on MLFlow. Additionally, we automatically log info like artifact version, versions of data read/write, etc. These APIs complete those manual steps for ML engineers who can now focus on their business with the blessing of state-of-the-art MLOps technology and zero change of their workflow.

Abstraction for Flexibility

The other example of benefits brought by an abstraction is keeping SafeGraph flexible in an uncertain marketplace of some technical solutions.

Data lake formats often confuse users and raise difficult decisions. When we started the project to build SafeGraph’s data lake, there were multiple choices in the marketplace, like Delta Lake, Apache Iceberg, and Apache Hudi, that could serve as the foundation of our data lake and provide our most desired functionalities, like versioning and time traveling data. We had to make a choice between Delta Lake and Apache Iceberg in the end and it turned out to be a difficult decision for the following reasons:

Delta Lake is mainly developed by Databricks. While it is open source, they do have a version with many proprietary features in the Databricks Spark Platform. In other words, using Delta Lake implicitly locks our computing and storage layers of infrastructure at Databricks.

Apache Iceberg, despite its super active open source community, is still an early stage project and we found several unnecessary (maybe in my personal opinion) constraints in usage as well as bugs.

There are certain differences when using Delta Lake and Iceberg to implement the same functionality, leading to the potentially high cost if we want to make a switch between formats in future.

To resolve the dilemma, we built an internal library which provides the APIs for the common Data Lake operations (read/write versioned dataset, show dataset history, etc.). While the implementation of these operations are based on Delta Lake/Apache Iceberg, engineers in our product development team do not need to care about which one is actually used. The transparency of the fundamental file formats also guarantees that we will have zero code change even if we want to switch over to other formats in future.

Engineering Culture Enabler

A well-established and healthy engineering culture, the must-have for a great technical company, does not come free. The cost not only comes from changes in thinking and action, but also the unavoidable tooling overhead even if the agreement on embracing it is there.

Platform team serves as the enabler of the building of an engineering culture by lowering the involved cost. Taking SafeGraph as an example, we want to build an engineering culture appreciating the operational excellence of services. Operational excellence comes from comprehensive monitoring and timely alerting as well as many other facilities helping engineers improve the service SLA and debugging issues. All of these should be built as part of platform engineering instead of dumping them to each product team and expecting them to squeeze resources to build their own tools from scratch.

The platform team also has an advantage to promote the desired culture. The “product” delivered by the platform team is used across teams; when the appreciated part in a culture leads to success, a broad benefit shared by all teams is easily observed and that part is promoted straightforwardly across teams. For instance, the platform team at SafeGraph started building solutions to minimize/eliminate the manual steps in using Terraform, which is well-known to be hard-to-ramp-up. The progressively better user experience is observed across teams, the benefit is shared, and it contributes to our culture which always seeks to minimize human intervention in any process.

Summary

By resolving immediate issues, building long-term infrastructure solutions with low cost, keeping it future-oriented, and also serving as the engineering culture enabler, platform engineering is critical to optimizing an engineering organization for efficiency and sustainability.