Work-Bench Snapshot: Areas Primed for Opportunity in Procurement
The Work-Bench Snapshot Series explores a particular topic we’re digging into from an investment standpoint.
The Work-Bench Snapshot Series explores the top people, blogs, videos, and more, shaping the enterprise on a particular topic we’re looking at from an investment standpoint.
Due to the rising demand for large-scale data processing, today’s data systems have undergone significant changes to effectively handle transactional data models as well as support a larger variety of sources, including log and metrics from web servers, sensor data from IoT systems and more. In fact, the current data ecosystem is split between two fundamental computing paradigms, batch processing, where large volumes of data are scheduled and processed offline, and stream processing, where large streams of data are continuously processed for real-time analysis.
Today, there are an increasing number of applications that require both stream and batch processing. For example, financial services organizations utilize stream analytics in areas where it is important to get fast analytical results on critical jobs, such as monitoring fraud detection and analyzing customer behavior data and stock trades. On the other hand, batch processing is used for use cases where it’s more critical to process large volumes of data than it is to get near instant results, such as end-of-cycle data reconciliation.
But the current technologies underpinning these paradigms have evolved significantly over the past couple of years. Let’s dive into it!
Kafka, the leading platform for high-throughput and low-latency messaging created by LinkedIn, became largely popular because it can publish log and event data from multiple sources into databases in a real-time, scalable and durable manner. But Kafka is not as scalable and performant as it should be: Owing to its monolithic architecture, the storage and serving layers in Kafka are coupled and can only be deployed together. What this means is that every time someone needs to access data from the storage layer, the request has to go through the message broker first which slows down time to query and reduces latency and throughput.
Trend: Apache Pulsar is a next generation messaging and queuing system that came out of Yahoo. Unlike Kafka, Pulsar is architected in a multi-layer way which decouples its compute, storage and messaging layers into separate distinct layers, enabling developers to access data directly from each individual layer. This not only enables instant access to data as it gets published by the broker, but it also significantly increases throughput, data scalability and availability.
Other tools include: Kesque and Cloudkarafka
Like event streaming platforms, batch processing systems have their own advantages and disadvantages. Innovation in the ETL pipeline has made it easier for the engineers and end users to collaboratively work with and process batch data. But since data in ETL is loaded on schedule, every time the end-user poses a question, the data has to be processed all over again in order to return a particular query. And, as more and more users query from the same pipeline and spin up multiple workflows on an ad-hoc basis, this results in slower query times and higher infrastructure costs.
Trend: Traditionally the operational and analytical stacks have largely been separate owing to the complexities of integrating the two. However, a trend that we’re observing in this space is around unifying these stacks in a model that expresses both batch and streaming computations to offer the best of each of these ecosystems. Here are some tools and their approaches on how they are tackling this space:
Other tools include: Dataflow and Apache Beam
While the first two forward-looking trends dealt with the infrastructure side of the challenge in stream and batch data processing, the user side of the problem also needs to be addressed. Data-driven organizations today have a large number of non-technical employees who need to analyze real-time streaming data but need to do so without having to interact with the complexities of the underlying infrastructure.
Trend: We are seeing a growing number of tools that democratize access to non-technical users by enabling them to query data through SQL, a common programming language among most data practitioners. These tools not only create an end-to-end self-service experience for the users but they also simplify the process by providing a common base for data engineers, data scientists and analysts to work collaboratively.
Stephan Ewen is a committer and PMC member of the Apache Flink project and CTO of Ververica (formerly data Artisans).
Arjun Narayan is the CEO and co-founder of Materialize, a NYC-based real-time streaming SQL database and was formerly a software engineer at Cockroach Labs.
Ricardo Ferreira is a developer advocate at Elastic and was formerly a member of the developer relations team at Confluent.
Maximilian Michels is a software engineer and committer to Apache Flink and Apache Beam and previously worked on the data infrastructure team at Lyft.
If you’re a startup or data practitioner working on a solution in this space, please reach out! I’d love to chat. We continue to learn and evolve our thinking in this space.