Work-Bench Snapshot: Augmenting Streaming and Batch Processing Workflows
The Work-Bench Snapshot Series explores the top people, blogs, videos, and more, shaping the enterprise on a particular topic we’re looking at from an investment standpoint.
This post was originally published on The Data Source on April 8th, 2024, my monthly newsletter covering the top innovation in data infrastructure, engineering and developer-first tooling. Subscribe here!
Apache Arrow DataFusion is a query engine built by Andy Grove, designed for building robust data-centric systems. Developed in Rust, a language known for its performance and memory safety, DataFusion serves as a foundational component for processing extensive datasets.
In the evolving landscape of data processing and analytics platforms, DataFusion distinguishes itself with its seamless integration with the Apache Arrow format. Apache Arrow, a cross-language development platform for in-memory data, is designed to expedite analytical processing. Leveraging Arrow's columnar memory layout and zero-copy data exchange capabilities, DataFusion enhances performance and efficiency in handling large datasets. This integration minimizes data movement and serialization overhead, thereby facilitating faster query execution and reducing computational costs.
Within the competitive market, various platforms offer unique strengths. Spark provides versatility and scalability alongside a rich ecosystem of libraries. Presto excels in interactive SQL querying over large datasets. Flink prioritizes real-time stream processing with low latency and high throughput. Dask competes in scalable data processing, particularly for Python-centric workflows. ClickHouse stands out for its prowess in real-time analytics.
The growing community around DataFusion suggests that more advancements will be made in the project. What I’m most excited about is seeing whether DataFusion ends up challenging Apache Spark's dominance, given Spark's widespread adoption and associated complexities around data movement and resource provisioning.
Currently, DataFusion is widely adopted across diverse data processing use cases:
We built a new SQL Engine on Arrow and DataFusion: Arroyo unveiled a revamped SQL engine powered by Apache Arrow and DataFusion SQL toolkit. Initially relying on DataFusion solely for SQL parsing, Arroyo now integrates its physical plans, operators, and expressions, enhancing flexibility and compatibility, especially for streaming operations. This shift mirrors the momentum within the Rust data community, with DataFusion spearheading efforts to bolster streaming capabilities alongside its established batch processing prowess.
Apple’s Comet Brings Fast Vector Processing to Apache Spark: Apple has released a plug-in designed to enhance Apache Spark's execution of vector searches, thus augmenting the platform's suitability for large-scale machine learning data analysis. Comet, built upon the adaptable Apache DataFusion query engine, written in Rust, and utilizing the Arrow columnar data format, aims to expedite Spark query execution.
Flight, DataFusion, Arrow, and Parquet: Using the FDAP Architecture to build InfluxDB 3.0: InfluxDB 3.0 faces the challenge of efficiently handling incoming data and making it queryable. Instead of building a new engine from scratch, InfluxData chose to use Apache Arrow DataFusion for querying time series data using SQL. This enables them to implement custom features like late arrival resolution and time series gap filling with ease, and extend functionalities like the Flux language in their cloud environment.
pg_analytics: Transforming Postgres into a Fast OLAP Database: Building an advanced analytical database in Postgres is expensive and challenging. Early attempts like Greenplum and subsequent products from Citus and Timescale have struggled to match the performance of non-Postgres databases. This has led many companies to favor alternatives like Elasticsearch. However, with the rise of embeddable query engines like DataFusion, the ParadeDB team believes the project can outpace many OLAP databases in query speed, indicating a shift away from building query engines from scratch within databases. By integrating with DataFusion, there can be continuous improvements made to the database performance.
If you’re a data practitioner leveraging DataFusion or interested in chatting about the broader landscape of data processing / analytics tools, please reach out as I’d love to swap notes!