ORC (file format)

ORC (file format)
Name	ORC
Extension	.orc
Mime	application/octet-stream
Owner	Hortonworks
Genre	columnar storage file format
Released	2013

Contents

Overview
File Structure and Storage Layout
Compression, Encoding, and Indexing
Performance and Use Cases
Compatibility and Implementations
Security and Governance

ORC (file format) ORC is a columnar storage file format designed for high-performance processing of large-scale Apache Hadoop-based datasets, introduced to optimize throughput for MapReduce, Apache Hive, and Apache Spark workloads. It was developed to reduce storage overhead and improve query efficiency for analytic engines used by enterprises such as Facebook, Yahoo!, and LinkedIn. The format emphasizes compact on-disk representation, predicate pushdown, and fast vectorized reads to accelerate operations in distributed systems like Apache Flink and Presto.

Overview

ORC was created in 2013 to address limitations observed with other columnar formats in Apache Hadoop ecosystems, aiming to provide better compression and faster read performance for Apache Hive queries executed on clusters managed by Hortonworks and other vendors. The format stores data by column rather than by row, enabling systems such as Apache Spark, Impala, Amazon EMR, and Google Cloud Dataproc to skip irrelevant columns during query execution. Designers prioritized features useful to projects and organizations like Cloudera, Facebook, Netflix, Twitter, and Pinterest that operate large-scale analytic pipelines.

File Structure and Storage Layout

An ORC file organizes data into stripes, each containing row data, an index, and a footer, facilitating efficient parallel reads by compute engines including Presto, Trino, Apache Drill, and Dremio. The stripe layout supports vectorized processing models used in Apache Arrow-based systems and enables compatibility with file systems like Hadoop Distributed File System and object stores such as Amazon S3, Azure Blob Storage, and Google Cloud Storage. Metadata blocks within the file capture schema information similar to practices in Apache Avro and Protocol Buffers-based projects, while the footer records global statistics referenced by query planners in Hive Metastore and AWS Glue.

Compression, Encoding, and Indexing

ORC integrates multiple compression codecs and encoding strategies to balance space and CPU trade-offs familiar from Zstandard and Snappy adoption patterns in big data platforms from Confluent and Databricks. It supports run-length encoding, dictionary encoding, and bitpacking for primitive column types, with per-stream compression choices that mirror approaches in Parquet implementations used by Snowflake and Redshift Spectrum. The built-in lightweight indexes enable predicate pushdown and zone maps used by query engines in PrestoSQL, Trino, and Spark SQL to prune stripes or row groups, reducing I/O similar to indexing techniques in Elasticsearch and Apache Lucene.

Performance and Use Cases

ORC is optimized for analytical read-heavy workloads typical in data warehousing and business intelligence platforms operated by organizations like Uber, Airbnb, Stripe, and Square. Its columnar layout and stripe-level statistics accelerate aggregations and OLAP queries in engines such as Apache Hive LLAP, Impala, and ClickHouse integrations, while vectorized readers reduce CPU overhead in Spark SQL and Flink SQL. Use cases include ETL pipelines orchestrated by Apache NiFi and Airflow, log analytics for services from New Relic and Datadog, and large-scale machine learning feature stores used by teams at Google and Microsoft.

Compatibility and Implementations

Multiple open-source projects and commercial vendors provide readers, writers, and connectors for ORC across languages and platforms, including native implementations in Apache Hive, Java libraries used by Cloudera and Hortonworks, and integrations for Presto, Trino, Spark, and Flink. Third-party ecosystems like AWS Glue, Google BigQuery (federated storage), and Azure Synapse Analytics offer compatibility layers or connectors to query ORC data stored in cloud object stores. Community contributions and bindings have been developed for languages and runtimes influenced by OpenJDK, GraalVM, and LLVM-based ecosystems to broaden adoption.

Security and Governance

ORC files rely on the underlying storage and cluster security models governed by projects and organizations such as Apache Ranger, Apache Sentry, Kerberos, and platform providers like AWS Identity and Access Management and Azure Active Directory for access control and auditability. Governance around schema evolution, metadata stewardship, and data lineage is commonly enforced through tools like Apache Atlas, Collibra, and Alation in enterprise deployments at companies such as Salesforce and SAP. The format itself embeds checksums and structural metadata to detect corruption, while encryption-at-rest and encryption-in-transit are typically provided by file systems and transport layers implemented by Hadoop distributions and cloud providers.

Category:File formats