Parquet (file format)

Parquet (file format)
Name	Parquet
Format	Columnar storage file format
Developer	Apache Software Foundation, Twitter, Cloudera
Released	2013
Type	Data serialization, columnar file
License	Apache License 2.0

Contents

Overview
Design and Format
Compression and Encoding
Implementations and Ecosystem
Performance and Use Cases
History and Development

Parquet (file format) is a columnar storage file format designed for efficient data processing, analytical workloads, and interoperability across distributed systems. It is optimized for use with large-scale data processing engines and storage systems, emphasizing compression, schema evolution, and fast columnar reads. Parquet integrates with many data processing frameworks and storage platforms to support high-performance analytics and ETL pipelines.

Overview

Parquet was co-developed by contributors from Twitter (service), Cloudera, and the Apache Software Foundation to provide a standardized columnar file format that could be used across projects such as Apache Hadoop, Apache Spark, Apache Hive, Presto (SQL query engine), and Apache Impala. Its design aims to reduce I/O and storage costs for analytical queries typical in environments like Amazon Web Services, Google Cloud Platform, Microsoft Azure, and on-premises clusters used by organizations such as Netflix, LinkedIn, Uber Technologies, Airbnb, and Facebook. Parquet files commonly appear in data lakes built on platforms including Hadoop Distributed File System, Amazon S3, Google Cloud Storage, and Azure Blob Storage, and are consumed by tools such as Pandas (software), Dask (software), PrestoDB, and Trino (software).

Design and Format

The Parquet format uses a nested, columnar storage layout derived from concepts in Google Bigtable, RCFile, and research from Columnar databases and the Dremel (paper). Its logical schema supports complex types influenced by Thrift (software) and integrates with serialization systems like Apache Avro and Protocol Buffers. The physical layout organizes data into row groups, column chunks, and pages, enabling predicate pushdown and vectorized processing in engines such as Apache Spark SQL, Apache Flink, Presto, and Apache Drill. Parquet metadata includes a footer and page headers that allow readers to discover schema and statistics without scanning the full dataset, facilitating integration with catalogs like Apache Hive Metastore, AWS Glue Data Catalog, and Apache Atlas.

Compression and Encoding

Parquet supports multiple compression codecs and encoding schemes to optimize storage and query performance. Common codecs include Snappy (compression), Gzip, LZ4, and Zstandard, while encoding techniques include Run-Length Encoding, Dictionary Encoding, Delta Encoding, and Bit-Packing inspired by work used in systems like ORC (file format) and Apache Arrow. Hybrid approaches enable engines such as Spark and Presto to select per-column strategies, enabling efficient handling of types found in datasets from systems like MySQL, PostgreSQL, Oracle Database, Microsoft SQL Server, MongoDB, and Cassandra (database).

Implementations and Ecosystem

Parquet has wide language and framework support through libraries and bindings for Java (programming language), C++, Python (programming language), Go (programming language), Rust (programming language), and .NET Framework. Implementations appear in projects including Apache Parquet (library), parquet-mr, parquet-cpp, pyarrow, fastparquet, and integrations in platforms like Snowflake Computing, Databricks, Confluent, Kubernetes, and Apache NiFi. Ecosystem tools for schema management, validation, and migration interoperate with systems such as Apache Avro, Protobuf, Thrift, JSON Schema, and governance tools like Apache Ranger and Cloudera Manager.

Performance and Use Cases

Parquet excels for analytical queries, OLAP workloads, and batch processing where column pruning, predicate pushdown, and vectorized execution reduce I/O. It is widely used in ETL pipelines built with Apache Beam, Apache Airflow, Luigi (software), and Sqoop, and in interactive analytics with Presto, Trino, Apache Drill, Apache Superset, and Metabase. Use cases include event analytics for companies like Segment (company), ad-tech reporting at The Trade Desk, time-series aggregation in InfluxData, and large-scale machine learning feature stores used by Google, Facebook, Uber, and Netflix. Parquet’s columnar layout also complements in-memory columnar formats like Apache Arrow to accelerate data interchange between systems such as Pandas, Dask, Ray (software), and Modin.

History and Development

Parquet’s origins trace to collaborative efforts in the early 2010s among engineers from Twitter (service), Cloudera, and contributors to the Apache Software Foundation ecosystem seeking a performant, open columnar format. The format incorporated lessons from RCFile, ORC (file format), and academic research such as Dremel (paper), and leveraged serialization conventions from Thrift (software) and Avro (software). Over time, governance and specification work moved under the Apache Software Foundation umbrella, with ongoing contributions from corporations including Google, Facebook, Amazon (company), Microsoft, Snowflake Computing, and community projects like Apache Arrow and Apache Spark. Continuous extensions added support for schema evolution, complex nested types, encryption, and improved statistics for query planning used by engines like Impala and Hive. The format continues to evolve through community-driven proposals and implementations across major cloud providers and analytics vendors.

Category:File formats