Work-Bench Snapshot: Augmenting Streaming and Batch Processing Workflows
The Work-Bench Snapshot Series explores the top people, blogs, videos, and more, shaping the enterprise on a particular topic we’re looking at from an investment standpoint.
This post was originally published on The Data Source on November 17th, 2023, my monthly newsletter covering the top innovation in data infrastructure, engineering and developer-first tooling. Subscribe here!
There’s quite a lot happening in the data and distributed systems ecosystem. From a topic perspective, microservice orchestration, testing in the context of distributed systems and declarative data transformation are themes that have come up in conversation recently.
As I wrap my head around these concepts, I’m sharing below a running list of articles that have caught my attention this week:
1. How Rama is tested: a primer on testing distributed systems
By Nathan Marz, Red Planet Labs
A distributed system is only as robust as the quality and thoroughness of its tests. More so, there are certain testing techniques which must be used if you’re going to have any hope of building a robust system resilient to the crazy edge cases that inevitably happen in production. If a distributed system isn’t tested with these techniques, you shouldn’t trust it.
Rigorous testing is crucial in software development, especially for infrastructure building. The Red Planet Labs team has dedicated years to developing their own testing strategies, ensuring the reliability and robustness of their tools. They discuss their learnings in this post.
This piece authored by Nathan Marz leverages the example of Red Planet Labs’ Rama, a general-purpose system for handling computation and storage needs at scale, to underscore the importance of testing when building distributed systems. It emphasizes non-negotiable system properties (e.g. no data loss, no stalling, timely recovery from faults), and discusses challenges in achieving them in a distributed context. Rama's testing strategies, centered on deterministic simulation, offer a powerful technique for unit testing, enhancing reproducibility and simplifying debugging in complex scenarios.
2. Fairy Tales of Workflow Orchestration
By Chris O'Hara, Achille Roussel, Julien Fabre and Thomas Pelletier, Stealth Rocket
At a high level, the solution we developed somewhat resembled workflow orchestrators we can find in today’s ecosystem. This framework raised the abstraction level, giving engineers a composable building block that they could use to express the multiple stages of execution needed to implement integrations with internal and external services while delegating the responsibility of reliably executing the workflows to the engine. The decoupling was highly successful; it unlocked orders of magnitude of infrastructure scale and, over time, grew to serve more and more use cases across pillars of the engineering org, becoming a key competitive advantage for the product. However, software engineers are creative individuals, and ultimately, the workflow orchestration model became too restrictive. Constraints started to arise in many shapes at several steps of the software development process.
This is a good read on the limitations of the microservice programming model and increasing adoption of durable execution engines in distributed systems engineering. While workflow orchestrators have emerged as a common solution for durable execution, it’s unclear if these frameworks are the ultimate fix or just a transitional phase.
The past experiences of the Stealth Rocket team leading infrastructure work at Segment is particularly insightful. They developed Centrifuge, a workflow orchestration engine for delivering user events. Despite initial success, the orchestration model proved to be too restrictive for developers seeking greater flexibility in their workflows.
The concept of "durable coroutines” as a proposed solution to the challenges surrounding durable execution is intelligent. Durable coroutines allow long-running workloads to be resumed after a crash or restart by persisting the intermediate state periodically. Unlike traditional workflow orchestrators, durable coroutines do not enforce a restrictive programming style, providing more flexibility to developers.
Check out Stealth Rocket’s open-source project to explore this new programming model, aiming to develop a source-to-source Go compiler for creating durable programs.
3. Why Data Teams Are Adopting Declarative Pipelines
By Iaroslav Zeigerman, Tobiko Data
Maintaining state is not an easy task. Developers have to worry about the consistency of the persisted state, failure recovery, and account for migrations as the product evolves. Users must consider where the state will be stored and manage additional configuration overhead. However, real-world use cases often demand capabilities that go beyond simply running things in the correct order. Features like data completeness, incremental processing, versioning, and deployments are all stateful by nature. When using stateless tools, users are compelled to build some of these capabilities themselves or accept a substantial bill from their cloud data provider as well as testing code changes directly in production.
The evolution of data transformation and DataOps products has a lot to borrow from the way that the DevOps discipline has grown over time. This piece highlights the shift from imperative, stateless solutions (e.g. Chef, Puppet, Ansible) to declarative, stateful ones (e.g. Terraform and Kubernetes) in the DevOps world and explores a similar evolution in the DataOps domain.
There are quite a few challenges posed by stateless data transformation, where information about invocations is not retained, leading to the need for users to grapple with complexities such as consistency, failure recovery, and migrations.
I like this piece because it covers the significance of stateful solutions, using SQLMesh as a prime example. Stateful approaches empower users to declaratively define data transformations, leveraging internal state for reduced computational overhead, improved versioning, and efficient data deployment.
Practitioners and startup builders, if this is an area of interest to you, please reach out to chat!