Generated by GPT-5-mini| Hive on Tez | |
|---|---|
| Name | Hive on Tez |
| Developer | Apache Software Foundation |
| Initial release | 2014 |
| Latest release | 0.14.0 |
| Repository | Apache Hive |
| Language | Java |
| License | Apache License 2.0 |
Hive on Tez Hive on Tez is an execution layer that integrates the Apache Tez data-processing framework with the Apache Hive data warehousing system to provide a more efficient execution model for SQL-on-Hadoop workloads. It replaces traditional MapReduce-based execution in Hive with a directed acyclic graph (DAG) runtime, improving latency and resource utilization for queries used in analytics pipelines across large-scale clusters. The project is associated with several big data ecosystems and is widely adopted by organizations running Apache Hadoop distributions and cloud data platforms.
Hive on Tez combines components from Apache Hive, Apache Tez, and Apache Hadoop to implement a DAG-based execution engine that targets interactive and batch query workloads on distributed storage systems such as HDFS, Amazon S3, and Azure Blob Storage. It emerged as an alternative to MapReduce-based execution to address limitations encountered during deployments by vendors like Cloudera, Hortonworks, and MapR and is commonly used alongside resource managers such as Apache YARN and schedulers like Apache Oozie and Apache Airflow. Notable adopters include enterprises using Microsoft Azure HDInsight, Amazon EMR, and on-premises clusters at organizations modeled after use cases from Facebook and Netflix.
The architecture leverages a DAG planner in Apache Hive that compiles SQL queries into Tez DAGs, coordinating tasks through YARN containers managed by ResourceManager and NodeManager components of Hadoop. Execution units in Tez—vertices and edges—map to operators and data flows defined by Hive's optimizer phases such as those influenced by Apache Calcite-style rule-based transformations and cost-based optimization similar to techniques used in Oracle Database and IBM Db2. The Tez runtime supports custom processors analogous to operator implementations found in systems like Presto and Apache Spark SQL, while relying on the Hadoop Distributed File System for storage I/O and on network stacks used by Apache Kafka for streaming integration patterns. The execution model reduces shuffle overhead by supporting intermediate data locality and in-memory data transfer comparable to methods used in Apache Flink.
Hive on Tez improves performance through fewer I/O passes, reduced task latency, and more efficient resource packing compared to MapReduce-based execution; these benefits are documented in performance studies referencing benchmarks such as TPC-H and TPC-DS. Optimizations include vectorized execution inspired by Vectorized Query Execution techniques, predicate pushdown similar to approaches in PostgreSQL and Greenplum Database, and advanced join strategies (broadcast, sort-merge) comparable to those in Apache Spark. Cost-based optimization is enabled by statistics collected via Hive Metastore integrations with Apache Ranger and Apache Atlas metadata services, and runtime metrics are often monitored using Apache Ambari or Cloudera Manager dashboards. Performance tuning commonly references kernel-level I/O settings used in Red Hat Enterprise Linux and container CPU isolation provided by Linux cgroups.
Hive on Tez maintains compatibility with HiveQL constructs supported in Apache Hive releases and integrates with ecosystem components such as the Hive Metastore, Apache HBase, Apache Parquet, and Apache ORC columnar formats. It interoperates with authentication and authorization stacks like Kerberos and LDAP and governance tools from projects such as Apache Sentry and Apache Ranger. Integration adapters exist for query engines and tools such as Apache Hue, Beeline, and JDBC/ODBC drivers used by Tableau and Microsoft Power BI, enabling BI connectivity analogous to integrations seen with Snowflake and Google BigQuery clients.
Deployments of Hive on Tez are commonly packaged within Hadoop distributions from Cloudera, Hortonworks, and cloud services such as Amazon EMR and Microsoft Azure HDInsight. Configuration touches include tuning Tez session pooling, adjusting DAG scheduler parameters, and setting container memory and vCore allocations consistent with guidance from YARN and Linux performance docs. Operators often use configuration management and orchestration tools like Ansible, Puppet, and Chef for reproducible deployments, and CI/CD pipelines referencing Jenkins or GitLab CI for automated testing. Security configurations follow patterns established by Kerberos realms and certificate authorities such as Let's Encrypt where applicable for web UIs and service endpoints.
Use cases for Hive on Tez include interactive SQL analytics, ETL pipelines, data lake querying, and large-scale reporting implemented by companies modeled after Facebook, Netflix, Yahoo!, and financial institutions that run batch analytics similar to workloads at Goldman Sachs and JPMorgan Chase. It is selected for scenarios requiring low-latency query response, complex joins, and integration with columnar formats like ORC used by Hadoop data lakes. Adoption has been influenced by the broader shift toward DAG-based engines exemplified by Apache Spark and Apache Flink, with Hive on Tez remaining a pragmatic choice where HiveQL compatibility and Metastore integration are priorities.
Category:Apache Hive Category:Apache Tez Category:Big data