Work-Bench Snapshot: Augmenting Streaming and Batch Processing Workflows
The Work-Bench Snapshot Series explores the top people, blogs, videos, and more, shaping the enterprise on a particular topic we’re looking at from an investment standpoint.
Here at Work-Bench, we strongly believe that great businesses can be built by commercializing open source projects that target an acute pain point and cultivate developer love. For the past 10 years, we have been investing in technical infrastructure businesses like Cockroach Labs, CoreOS (Acquired by RedHat), Authzed, CommonFate, Appland, and many more to come. It’s our belief that the best open source businesses require a mix of technical innovation, grit, and tailwinds to get off the ground. As such, we love studying the best open source businesses.
Recently, I sat down with Martin Mao, Co-Founder and CEO of Chronosphere to learn more about how he leveraged his past experience building and maintaining Uber’s M3 Platform to found Chronosphere.
Today, we’re examining Trino, and the preceding technologies that led to its rise.
If you’re building an open source business, we’d love to talk.
Trino is a tale of enabling technologies and step-functions in data storage and search.
The enabling technologies that led to the formation of Trino can be traced all the way back to 2001 with Google’s File System and MapReduce framework which would go on to heavily influence the Hadoop ecosystem. Likewise, the entrance of web-scale data pushed Facebook to create Hive in 2007 and Presto in 2012, which would go on to be open sourced in 2013.
At the end of 2018, the original creators of Presto left Facebook and founded the Presto Software Foundation to ensure the project remains collaborative and independent. In 2019, the project was forked and was then known as PrestoSQL. Towards the end of 2020, PrestoSQL was renamed to Trino to reduce confusion between PrestoSQL, and the legacy PrestoDB project.
Today, Trino is an open source SQL engine that enables users to query data across disparate data sources without needing to copy or move it. Trino can be used to query against big data, relational, and file stores. It’s important to note, that while Trino understands SQL, it is not a general-purpose relational database, nor a replacement for databases like MySQL, PostgreSQL or Oracle. One of the main business benefits of using Trino is that it enables cost optimizations for data science teams by separating the compute and storage layers.
When Google first launched its search service, it had to deal with the billions of web pages that were quickly created to take advantage of the internet's promise. At the time, it was normal to download large files to a data storage device and transfer to a server where they’d be processed. Back then, transmission speeds were slow and it took far too long to move the data to the compute layer, making the network a bottleneck in this decoupled architecture.
To work around this, Google created a new file system called MapReduce that chopped new files into slices and distributed them across the servers in a cluster. What made MapReduce most attractive was that it enabled developers to process petabytes of data concurrently by splitting the data into smaller chunks, and processing them in parallel on commodity servers.
After mapping key value pairs, the system aggregates all the data from multiple servers and returns a consolidated output back to the application.
Another key benefit of MapReduce is data locality, which allows for compute cores to be co-located with disks that hold the data the cores process, eliminating massive network transfers each time the data is processed. Lastly, MapReduce sends the program to where your data resides, rather than sending data to where the program resides.
Inspired by Google’s publication on its MapReduce framework, Doug Cutting and Mike Cafarella co-founded and open-sourced Hadoop to enable anyone to benefit from the new File System and MapReduce capabilities.
The Hadoop ecosystem works on big data management and leverages similar principles to Google’s File System and MapReduce frameworks like allowing for the distributed processing of large data sets across clusters of computers using commodity hardware.
Initially, Hadoop consisted of three core components that were specifically designed to work with Big Data:
As Facebook scaled, it was collecting hundreds of petabytes of data about its users and quickly adopted Hadoop, but most people who wanted to query the data weren’t skilled in Java, so they created Hive.
Hive is an abstraction layer on top of MapReduce which enables its users to write SQL statements to analyze data stored in a Hadoop Distributed File System. Hive automatically maps SQL operations to low-level Java MapReduce API, thereby translating queries to MapReduce pipelines
Hadoop provided fault tolerant batch processing which was especially important in the early 2000s because Hadoop was originally introduced to work with cheap commodity servers which were expected to fail every once in a while. Every stage along a MapReduce pipeline that Hive created wrote the intermediate results to disk. That way, if a server fails in the middle of a query, Hive can recover the failed server's workload on another server in the cluster. However, all the reading and writing that made MapReduce resilient made querying data extremely slow.
Because the design constraints of Hadoop (and thereby Hive) imposed limits on what Hive could do, and on how quickly Hive could do it, Facebook developed Presto as a better alternative to Hive.
Presto was designed to do all of its processing in memory, enabling fast queries on the massive amounts of data that Facebook stored in Hadoop. To achieve lower latency, Presto doesn’t prioritize mid-query fault tolerance. If there’s a problem, a user will have to occasionally rerun a failed query.
The goal was that the speed of Presto would offset the disadvantage of occasionally rerunning failed queries.
In the 11 years between when Google built its File System and MapReduce to when Facebook launched Presto, network speeds improved by an order of magnitude. Facebook designed the original Presto to copy data from its Hadoop cluster and push it across its high speed network into a cluster for fast in-process memory.
After years of developing new connectors, Presto could query data from many disparate data sources. Before Presto, many data sources couldn’t be queried using SQL. Now, with Presto, you don’t have to use separate languages to access data across source types because Presto allows users to query all your data regardless of where it sits.
After 5 years of open source development, Facebook management wanted more control over Presto which led to the founders leaving Facebook to launch PrestoSQL which would later be forked and renamed as Trino.
Back in the early 2000s, Network speeds were slow, so decoupled systems were impossible, and data locality systems like Hadoop were created. Now that network speeds are faster, systems like Trino went back to decoupled architectures because data can be efficiently moved to compute.
In modern parallel distributed systems, data is sent across a cluster of servers. Therefore systems like Trino use a mix of architectures; decoupled and data locality. When storage and compute are decoupled, they can be scaled independently.
Since Trino is a compute layer, you can scale query power by adjusting the number of worker nodes in the cluster. Trino acts as a universal translator to query multiple data sources with one SQL query.
When all the sources are relational databases, Trino can do SQL-to-SQL translations and data mappings. Additionally, Trino can do more complicated translations for SQL access to systems that don’t use table-based data models.
Trino allows users to perform data warehouse analytics without the data warehouse, enabling users to query data anywhere it lives. In practice, an analyst doesn’t need to load the data, transform the data, or do preparation of any kind. With this, an analyst can then access data anywhere, using regular SQL queries, without having to worry about the underlying infrastructure that makes it all work. This is an alternative to the traditional approach of collecting and consolidating data in a centralized data warehouse.
Trino bypasses the need for ETL because it queries data where it lives. The biggest advantage of Trino is that it is just a SQL engine. Meaning, it agnostically sits on top of various data sources like MySQL, HDFS, and SQL Server. This eliminates the need for users to understand connections and SQL dialects of underlying systems. Additional benefits include:
Founded in 2017, Starburst is the company behind the large-scale commercialization and maintenance of Trino. The Starburst co-founders, Justin Borgman and Matt Fuller, previously sold their “SQL-on-Hadoop'' company, Hadapt to Terradata. After their tenure at Teradata, they decided to focus on turning Presto into an enterprise-grade service, and, after a few years, they succeeded in hiring Presto/Trino founders Dain Sundstrom, Martin Traverso and David Phillips.
Starburst offers a fully supported, enterprise-grade Trino product with dozens of pre-built connectors, security, and services.
Additionally, the company launched Starburst Galaxy, a fully managed, multi-cloud analytics SaaS offering, designed to minimize infrastructure management resources for data teams.
The new tools help data engineers define relevant metadata for creating, publishing, finding and managing curated data products based on multiple data sets. They also provide data governance and query capabilities around data products.
It’s clear that Starburst is going all-in on the notion that decentralized data is the future, and Starburst is pushing “data mesh” into existence, a way to help companies analyze distributed sets of data across domains, clouds, and geographic locations.
Github: Trinodb
Hadoop: MapReduce Framework
Trino
Trino: An Origin Story
Starburst raises $22M to modernize data analytics with Presto
Starburst Valuation Climbs To $3.35B With Latest Funding Round
What Is Trino And Why Is It Great At Processing Big Data