Presto (SQL query engine)

Presto (SQL query engine)
Name	Presto
Developer	Facebook, Starburst Data, Uber Technologies, Netflix, LinkedIn
Initial release	2013
Written in	Java (programming language), Scala (programming language)
Operating system	Linux, macOS, Microsoft Windows
License	Apache License

Contents

Overview
Architecture
Query Processing and Optimizer
Connectors and Supported Data Sources
Deployment, Scalability, and Performance
History and Forks
Use Cases and Adoption

Presto (SQL query engine) Presto is a distributed, open-source SQL query engine designed for interactive analytic queries against large-scale data lakes and data warehouses. It enables fast, ad hoc analytics across heterogeneous storage systems and integrates with a wide ecosystem of data platforms and cloud services. Presto is used by major technology companies for low-latency analytics and has spawned commercial distributions and forks maintained by multiple organizations.

Overview

Presto is a parallel, distributed SQL engine originally developed by engineers at Facebook and later contributed to open-source communities and commercial vendors such as Starburst Data and Uber Technologies. It executes ANSI SQL-92 and more recent SQL extensions, targeting analytics workloads for organizations like Netflix, LinkedIn, Airbnb, Twitter, and Pinterest. Presto queries can access data from storage systems such as Hadoop Distributed File System, Amazon S3, Google Cloud Storage, and Azure Blob Storage, and integrate with metastore services like Apache Hive and AWS Glue. The project sits alongside other query engines and ecosystems including Apache Spark SQL, Trino (software), Apache Impala, Dremio, Google BigQuery, and Snowflake (computing).

Architecture

Presto employs a coordinator-worker architecture with a single coordinator node and multiple worker nodes, similar to architectures used in MapReduce-era systems and modern analytics stacks like Apache Flink and Apache Storm. The coordinator manages query parsing, planning, and scheduling, while workers execute distributed tasks and exchange intermediate results via remote procedure calls, akin to communication patterns in gRPC and Apache Thrift. The execution model uses a pipelined, pull-based iterator pattern and vectorized processing strategies comparable to Apache Arrow and LLVM-accelerated engines. It integrates with cluster managers and resource schedulers such as Kubernetes, Apache YARN, Mesos, and cloud orchestration platforms from Amazon Web Services, Google Cloud Platform, and Microsoft Azure.

Query Processing and Optimizer

Presto's query processing pipeline includes parsing SQL into an abstract syntax tree, semantic analysis, logical planning, cost-based optimization, and physical planning, following design principles found in systems like PostgreSQL, Apache Calcite, and Apache Spark SQL Catalyst. The optimizer applies rule-based and cost-based transformations, join reordering using statistics akin to Selinger optimizer techniques, and implements broadcast and repartition join strategies comparable to methods used in Teradata and Vertica (company). Presto supports window functions, aggregation, subqueries, and user-defined functions, and incorporates dynamic filtering and runtime optimizations influenced by research from Stanford University and Berkeley (University of California, Berkeley). Execution orchestration supports adaptive query execution similar to features in Apache Spark and Trino (software).

Connectors and Supported Data Sources

Presto's connector architecture provides pluggable adapters for diverse data sources including Apache Hive, Apache HBase, Cassandra (database), MongoDB, MySQL, PostgreSQL, Oracle Database, Microsoft SQL Server, Amazon S3, Google BigQuery, Snowflake (computing), Azure Data Lake Storage, Kafka (software), and Elasticsearch. Connectors translate metadata and data access patterns to Presto's internal execution model, similar to connector frameworks in Apache Flink and Apache NiFi. The ecosystem includes community and vendor-provided connectors for systems like ClickHouse, Greenplum, IBM Db2, SAP HANA, Teradata, S3 Select, and MinIO (software).

Deployment, Scalability, and Performance

Presto is designed for horizontal scalability across commodity hardware and cloud instances, supporting thousands of concurrent queries in production deployments at companies like Facebook, Netflix, and Uber. Performance tuning involves configuring memory pools, spill-to-disk behavior, parallelism, and JVM optimizations, practices common in HBase (software) and Cassandra (database) operations. High-availability setups integrate with orchestration and monitoring tools such as Prometheus, Grafana, Elasticsearch, Kibana, Datadog, and Splunk. Performance comparisons often reference Apache Impala, Apache Spark, and PrestoSQL forks like Trino (software); benchmarking studies appear in work by Stanford University, Berkeley (University of California, Berkeley), and industry research from Google Research and Microsoft Research.

History and Forks

Presto was created at Facebook in 2013 to replace legacy systems, with public open-source release fostering participation from companies like Twitter and Netflix. In 2019 a notable fork led to the formation of Presto Software Foundation and commercial efforts by Starburst Data; another major community fork was renamed Trino (software) (originally PrestoSQL) with contributors from Teradata and Facebook alumni. The project's evolution includes contributions from cloud vendors Amazon Web Services, Google Cloud Platform, and Microsoft Azure, and corporate adopters such as Airbnb and LinkedIn. Legal and governance developments mirror patterns seen in other open-source transitions involving OpenStack, Linux Foundation, and Apache Software Foundation ecosystems.

Use Cases and Adoption

Presto is used for interactive analytics, business intelligence, ETL read queries, and data exploration in enterprises including Facebook, Netflix, Uber Technologies, LinkedIn, Airbnb, Pinterest, and Expedia Group. It integrates with BI tools and visualization platforms like Tableau (software), Looker, Power BI, Qlik, Superset (software), and Grafana for dashboarding and reporting. Presto supports machine learning feature extraction workflows that feed frameworks such as TensorFlow, PyTorch, scikit-learn, and XGBoost, and is used alongside orchestration systems like Apache Airflow and Luigi (software) for pipeline automation. Enterprises combine Presto with data catalogues and governance tools from Apache Atlas, Collibra, and Alation to meet compliance and discovery needs.

Category:Data management systems