Generated by GPT-5-mini| Apache Tez | |
|---|---|
| Name | Apache Tez |
| Developer | Apache Software Foundation |
| Released | 2014 |
| Latest release | 0.10.1 |
| Programming language | Java |
| License | Apache License 2.0 |
Apache Tez is a distributed data-processing framework designed for building high-performance, DAG-based data-processing applications on large-scale clusters. It provides a runtime to execute directed acyclic graphs using a resource manager and integrates with ecosystem projects for batch and interactive workloads. Tez targets optimization of data movement and task execution to improve throughput and latency in data platforms.
Tez was introduced to address limitations observed in Hadoop, MapReduce (programming model), Apache Hive, Apache Pig, and Apache Spark deployments by enabling DAG-based execution similar to constructs used in Dryad and Apache Flink. The project was incubated at the Apache Software Foundation with contributions from engineers at Yahoo!, Facebook, Netflix, Twitter, and Cloudera who sought to reduce overhead seen in repeated job submission patterns common to ETL pipelines and OLAP workloads. Tez is implemented in Java (programming language) and adopts pluggable integration points for YARN, Kubernetes, and storage systems like HDFS, Amazon S3, and Azure Blob Storage.
Tez implements a compositor of vertices and edges where vertices represent processing units and edges represent data movement; this model draws influence from Dryad and DAGMan. At runtime Tez coordinates with Apache Hadoop YARN or Kubernetes to request containers and schedule tasks, while leveraging local data locality information from HDFS NameNode and ZooKeeper for cluster metadata. The Tez AM (Application Master) and Tez TaskContainers manage lifecycle events, checkpointing, and recovery similar to patterns found in Apache Storm and Apache Samza. For shuffle and intermediate storage Tez can integrate with Apache Kafka, Alluxio, and native local disks, and supports features inspired by Apache Spark's shuffle service and Google MapReduce's combiners.
Developers compose processing graphs using Tez's vertex and edge abstractions, analogous to constructs in DryadLINQ and Apache Beam, and submit them through Java APIs or through higher-level engines such as Apache Hive and Apache Pig. The Tez API exposes interfaces for Input, Output, Processor, and LogicalIO, allowing customization similar to Hadoop MapReduce's Mapper and Reducer but with finer-grained control comparable to Flink operators and Spark RDD transformations. Integration layers in Hive and Pig convert SQL and procedural scripts into Tez DAGs much like Presto and Impala translate queries into execution plans for backends.
Tez is used for interactive SQL query execution via Apache Hive, batch ETL through Apache Oozie workflows, and as an execution engine beneath analytic platforms such as Cloudera, Hortonworks, and MapR. Enterprises leverage Tez in combination with Apache HBase for hybrid transactional-analytical processing and with Apache Phoenix for low-latency queries over wide tables. Tez also integrates with orchestration tools like Apache Airflow and monitoring systems such as Prometheus and Grafana for operational observability in production data pipelines deployed on cloud providers including Amazon Web Services, Microsoft Azure, and Google Cloud Platform.
Tez improves latency and throughput by collapsing multiple MapReduce stages into a single DAG, reducing task startup overhead observed in Hadoop MapReduce and reducing network I/O similar to optimizations in Apache Spark and Apache Flink. Benchmarks from adopters at Yahoo! and Facebook reported reductions in query latency and cluster resource usage by eliminating redundant serialization and leveraging container reuse akin to YARN container reuse strategies. Tez's scalability benefits depend on underlying resource managers like YARN and orchestration layers such as Kubernetes, and performance tuning often involves configuration knobs also used in HDFS and Linux kernel I/O tuning.
Operators deploy Tez as part of platform distributions from Cloudera and Hortonworks, or package it alongside Apache Hive in custom clusters managed by provisioning tools like Ansible, Chef, and Puppet. Production deployments require coordination with YARN schedulers such as CapacityScheduler and FairScheduler, Kerberos security from MIT Kerberos or Active Directory integration, and logging aggregation with ELK Stack components like Elasticsearch, Logstash, and Kibana. Upgrades and rolling restarts follow patterns used by HDFS and YARN clusters to maintain high availability and minimize query impact.
Tez originated from engineering work at Yahoo! and entered the Apache Software Foundation incubator with contributors from Cloudera, Facebook, and Twitter, later graduating to a top-level project. Key development milestones align with releases of Apache Hadoop, Apache Hive optimizations, and ecosystem shifts toward DAG-oriented engines exemplified by Apache Spark and Apache Flink. The project maintains a community of contributors and committers interfacing via Apache JIRA and mailing lists, and its roadmap has been influenced by performance research from academic venues such as SIGMOD and VLDB and by production feedback from major internet companies like Google and Netflix.