Presto (software)

Presto (software)
Name	Presto
Developer	Facebook, later Presto Software Foundation
Released	2013
Programming language	Java
Operating system	Linux
License	Apache License 2.0

Contents

History
Architecture and components
Query processing and optimization
Deployment and scalability
Use cases and adoption
Performance and benchmarks

Presto (software) is a distributed SQL query engine designed for interactive analytic queries against large datasets. Developed originally at Facebook and later stewarded by the Presto Software Foundation, Presto enables low-latency, ad hoc queries across heterogeneous data sources such as Hadoop, Amazon S3, MySQL, and PostgreSQL. It is used by organizations for analytics, business intelligence, and data engineering workloads alongside systems like Apache Hive, Apache Spark, and Snowflake.

History

Presto was created at Facebook in 2012–2013 to replace map-reduce based queries executed with Apache Hadoop and Hive (software), aiming to offer sub-second to second latency for interactive workloads. Key contributors and adopters in the early ecosystem included engineers with ties to Cloudera, Dropbox, and Netflix, who later proposed community governance models. In 2019 stewardship and governance shifted toward an independent foundation model inspired by projects such as the Cloud Native Computing Foundation and Apache Software Foundation, culminating in formation of the Presto Software Foundation to coordinate development, releases, and vendor-neutral collaboration.

Architecture and components

Presto employs a clustered, coordinator-worker architecture where a single coordinator node manages query planning and scheduling while multiple worker nodes execute tasks. The system's core components include the coordinator, workers, a catalog layer that provides connectors to systems like Amazon S3, Apache Cassandra, MongoDB, MySQL, and PostgreSQL, and a pluggable connector API similar in intent to interfaces used by Apache Kafka connectors and JDBC. The optimizer and planner are implemented in Java and interact with a memory-centric execution engine that avoids heavy use of Apache Hadoop YARN for resource isolation, though Presto can integrate with cluster managers such as Kubernetes, Mesos, and Docker Swarm through deployment patterns adopted by operators at Airbnb and LinkedIn.

Query processing and optimization

Queries are parsed into a logical plan, transformed by rule-based and cost-based optimizations, and converted into a distributed physical plan executed across worker nodes. Presto's optimizer incorporates techniques comparable to those in PostgreSQL and Oracle Database, including predicate pushdown, projection pruning, and join reordering using cardinality estimates from statistics collected via connectors or external metastore systems like Apache Hive Metastore. Execution employs vectorized processing and runtime filters akin to approaches in Apache Impala and ClickHouse to reduce I/O and CPU, while support for SQL features mirrors standards implemented in ANSI SQL and extensions found in BigQuery and Snowflake.

Deployment and scalability

Operators deploy Presto in cloud and on-premises environments using orchestration frameworks such as Kubernetes and provisioning tools used at Google and Amazon Web Services. Scalability is achieved by adding worker nodes behind the coordinator, with elastic scaling patterns adopted by cloud-native teams at Netflix and Uber. High-availability patterns include coordinator failover, catalog replication, and integration with identity providers like LDAP and OAuth 2.0 for multi-tenant security controls familiar from Okta and Keycloak. Monitoring and observability commonly leverage stacks built from Prometheus, Grafana, and log aggregation using Elasticsearch and Fluentd.

Use cases and adoption

Presto is used for interactive business intelligence, ad hoc analytics, data exploration, and hybrid transactional/analytical patterns in enterprises including Facebook, Uber, Netflix, Airbnb, Twitter, and LinkedIn. It supports integrations with BI tools such as Tableau, Looker, Power BI, and Superset (software), enabling analysts to query federated datasets spanning Amazon S3, Hadoop Distributed File System, and relational stores like Oracle Database and Microsoft SQL Server. Industry use cases range from product analytics at Salesforce and Pinterest to security analytics in environments operated by CrowdStrike and Splunk.

Performance and benchmarks

Benchmarking Presto involves measuring query latency, throughput, and resource efficiency against engines such as Apache Spark, Apache Impala, ClickHouse, and commercial systems like Snowflake and Google BigQuery. Public and private benchmarks—conducted by research teams and vendors with datasets like the TPC-DS and TPC-H suites—show Presto delivering competitive low-latency interactive performance on star-schema queries and federated scans when tuned with appropriate memory, network, and connector configurations. Performance characteristics depend heavily on factors observed in comparative evaluations of PostgreSQL and Oracle Database, including join strategies, data layout (columnar formats like Apache Parquet and Apache ORC), and underlying storage I/O provided by services such as Amazon EBS and Google Cloud Storage.

Category:Distributed SQL engines