Apache Hive on Tez

Apache Hive on Tez
Name	Apache Hive on Tez
Developer	Apache Software Foundation
Initial release	2014
Latest release	(varies)
Repo	Apache Hive
License	Apache License 2.0

Contents

Overview
Architecture and Execution Model
Performance and Optimizations
Deployment and Configuration
Use Cases and Integration
Comparison with Other Execution Engines

Apache Hive on Tez Apache Hive on Tez is a data warehousing and SQL-on-Hadoop execution option that integrates Apache Hive query processing with the Apache Tez DAG runtime to improve throughput and latency for large-scale analytics. It bridges batch-oriented projects such as Apache Hadoop MapReduce and real-time frameworks like Apache Spark by leveraging a directed acyclic graph model and resource managers including Apache YARN and Kubernetes-based deployments. Major contributors and adopters include engineers from Facebook, Hortonworks, Cloudera, MapR and organizations such as Netflix, Uber, Airbnb, eBay, and Twitter.

Overview

Hive on Tez emerged to address limitations of the original Hive execution on MapReduce by adopting the Tez engine developed initially within Facebook and contributed to the Apache Software Foundation. The design targeted interactive SQL workloads associated with platforms like Apache Impala and Presto while preserving compatibility with HiveQL and integrations with metadata services such as Apache Hive Metastore. Community governance involved stakeholders from Hortonworks, Cloudera, Yahoo!, LinkedIn, and academic collaborators including researchers from UC Berkeley and MIT who studied query optimization and DAG scheduling. Ecosystem tools interoperable with Hive on Tez include Apache Pig, Apache Flume, Apache Sqoop, Apache Oozie, Apache Ranger, and Apache Atlas.

Architecture and Execution Model

The architecture centers on composing Hive query plans into Tez DAGs submitted to resource managers like Apache YARN or container orchestration via Kubernetes. Logical plans generated by the Hive compiler are transformed into physical operators and vertices managed by Tez; execution leverages shuffle services implemented historically with Apache Hadoop Distributed File System (HDFS) and can integrate with object stores used by Amazon Web Services and Google Cloud Platform. The model supports features developed in conjunction with projects such as Apache Calcite for cost-based optimization and Apache HCatalog for metadata interoperability. Scheduling and resource negotiation integrate with frameworks like CapacityScheduler and FairScheduler, while security integrates with Apache Kerberos, Ranger, and Apache Knox for perimeter protections.

Performance and Optimizations

Hive on Tez reduces job latency by enabling vertex fusion, in-memory combiners, and pipelined shuffles influenced by research from UC Berkeley AMP Lab and production practices at Facebook. Optimizations include operator reordering, vectorized execution inspired by ClickHouse research, runtime pruning similar to Dremio techniques, and predicate pushdown compatible with Apache Parquet and Apache ORC columnar formats. Cost-based optimization leverages statistics gathered via Apache Hive Metastore and tools such as Apache Calcite and Apache Ranger for policy-aware execution. Workload management inspired by Google Borg and Mesos informs concurrency controls, while I/O reduction benefits from storage integrations with Alluxio and Ceph.

Deployment and Configuration

Typical deployments run on clusters managed by Hadoop distributions from vendors like Cloudera and Hortonworks or in cloud environments including Amazon EMR, Google Dataproc, and Microsoft Azure HDInsight. Configuration touches include tuning Tez parameters for memory, container sizes aligned with YARN node managers, and shuffle service tuning influenced by Apache Hadoop evolution. Integration with CI/CD and monitoring ecosystems uses tools such as Apache Ambari, Cloudera Manager, Prometheus, Grafana, Elasticsearch, Kibana, and Nagios. High-availability patterns adopt strategies similar to ZooKeeper ensembles, and backup/restore workflows align with practices from AWS S3 and Google Cloud Storage.

Use Cases and Integration

Adopted use cases include interactive analytics for enterprises like Netflix and Uber, ETL pipelines for e-commerce platforms such as eBay and PayPal, data science feature engineering in environments like Airbnb and LinkedIn, and log analytics for security teams leveraging Splunk alternatives. Integrations span business intelligence and dashboarding tools such as Tableau, Looker, MicroStrategy, and Power BI as well as machine learning pipelines with Apache Spark MLlib, TensorFlow, PyTorch, and model serving via KFServing and Seldon Core.

Comparison with Other Execution Engines

Compared to Apache Hadoop MapReduce, Hive on Tez offers lower latency, more flexible DAG composition, and better resource utilization. Against Apache Spark SQL, Tez emphasizes tight Hive integration and lower task startup cost in certain batch scenarios, while Spark provides richer in-memory abstractions and broader ML ecosystem ties. Versus engines like Presto, Apache Impala, and ClickHouse, Hive on Tez balances SQL compatibility and enterprise metadata governance from Hive Metastore with workload-specific trade-offs in latency and concurrency. Decisions between Tez, Spark, and Flink often weigh factors from vendors such as Cloudera and Confluent and considerations tied to security tools like Kerberos and Ranger.

Category:Apache Hive