Distilling The Composable Data Management System Manifesto

Apr 18, 2024
Distilling The Composable Data Management System Manifesto
Interested in reading more?

Sign up for our Enterprise Weekly Newsletter.

We'll send you our top, curated content straight to your inbox (along with top industry news, events, and fundings).

Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.

This post was originally published on The Data Source on April 18th, 2024, my monthly newsletter covering the top innovation in data infrastructure, engineering and developer-first tooling. Subscribe here!

It’s been a while since The Composable Data Management System Manifesto came out, and I recently had the chance to revisit it. As I've been diving deeply into data management systems and the array of tools that have emerged around the ecosystem, I've discovered this paper to be an invaluable resource for gaining perspective.

Breaking down the manifesto, it's clear that the future of data management relies on adopting modularity, standardization, and open collaboration. These principles will be especially crucial for addressing fragmentation, improving interoperability, and promoting user-centricity as data ecosystems grow increasingly complex.

My takeaways:

While specialized data management systems offer task-specific flexibility, traditional monolithic structures persist despite evolving data needs and static software development practices. 

The variety of tasks within current data applications has led to the creation of specialized systems, tailored for specific purposes. Instead of a universal solution, numerous database options now exist, catering to diverse industries of use cases. The challenge is, software development practices for managing data remain largely unchanged today and this has resulted in teams building inflexible systems that have struggled to adapt to the constantly evolving demands of today's data environments.

Despite shared core components, data management systems lack consistency and reusability, resulting in a fragmented user experience.

While data management systems often share components like data storage, processing engines, and query languages, they remain inconsistent due to proprietary technologies, vendor lock-in, and differing design philosophies. This fragmentation has forced users to navigate disjointed interfaces, APIs, and workflows across multiple systems. As a result, greater interoperability and standardization is needed to enable users to seamlessly integrate and interchange components while maintaining workflow consistency.

The paper proposes a model for building composable data management systems. 
  • Standardize and consolidate language frontends to serve as the user interface for data management systems. Language frontends receive user commands and queries, translating them into Internal Representations (IR). Initiatives like ZetaSQL (Google), CoreSQL(Meta), and enhancements to PostgreSQL's parser aim to unify language frontends, ensuring consistency and easy integration across systems. Unified frontends would allow users to interact with different systems using familiar syntax, simplifying query execution and development processes.
  • Define a unified specification for the Intermediate Representation to standardize its format. This will enable portability and optimization across environments. IR acts as a bridge between high-level language frontends and low-level execution engines, providing a uniform format for queries and tasks. Substrait is an example of a tool that establishes a unified IR specification, to create cross-system compatibility and optimization. Adopting a common IR allows seamless exchange and optimization of query plans among data management systems and improves performance and resource utilization across environments.
  • Developing composable and extensible query optimizers is crucial for improving data processing efficiency. Projects like Orca and Apache Calcite have led the way in this field by enhancing query planning and execution through techniques like cost-based optimization and adaptive query processing. This facilitates optimal query performance across diverse environments while ensuring code reuse and extensibility.
  • The Execution Engine should serve as a unified framework for executing queries across systems and enable distributed and parallel task execution.  Velox is an example of a tool that simplifies distributed query processing with a common runtime environment. The idea of a unified framework ensures efficient resource usage and fault tolerance, ideal for large-scale analytical workloads. Advancements in Execution Runtime frameworks like Apache Spark, Ray, and Dasksupport distributed and parallel task execution which enhances flexibility for data science workflows. They facilitate efficient resource utilization and provide high-level abstractions for complex data processing pipelines. This streamlines application development for developers and data scientists. 

While the journey toward composable systems is still in its early stages, it’s exciting to see innovation in this area. Voltron Data is an example of a company that’s spearheading the development of composable data systems, leveraging open standards such as ADBC, Arrow, Ibis, and Substrait. There’s also significant momentum in the data ecosystem, especially in the Apache Arrow community to bring composable systems to the masses. 

People to Follow - Authors of the Manifesto
  • Pedro Pedreira leads the Velox team at Meta focusing on commoditizing execution for data management and modernizing compute engines.
  • Wes McKinney is the co-founder of Voltron Data and co-creator of Pandas, Apache Arrow and Ibis
  • Orri Erling, an engineer at Meta, specializes in the Presto analytical database. He developed a SQL/SPARQL query optimizer that handles analytics and lookups by utilizing sampling instead of precomputed statistics.
  • Konstantinos Karanasos, works on the data infrastructure team at Meta where he focuses on systems for ML, large-scale data platforms and distributed systems.
  • Scott Schneider is the tech lead on the AI Infrastructure team at Meta where he focuses on featuring engineering development tools.
  • Jacques Nadeau is the co-founder of Sundeck.io and co-creator of Apache Arrow, Dremio and Substrait.
  • Satya R Valluri and Mohamed Zait are software engineers at Databricks, leading the charge in chairing over the Composable Data Management Systems Workshop at the Very Large Data Bases conference.

If you’re a data practitioner focusing on designing and building composable data management systems, I’d love to chat. Please reach out!

TOPICS
Research
SHARE