Parquet (columnar storage format)

Parquet (columnar storage format)
Name	Parquet
Developer	Apache Software Foundation
Released	2013
Programming language	Java, C++
Operating system	Cross-platform
License	Apache License 2.0

Contents

Overview
Design and Architecture
File Format and Compression
Performance and Use Cases
Implementations and Ecosystem
History and Development

Parquet (columnar storage format) is an open-source, column-oriented data storage format designed for efficient storage and retrieval of large-scale analytical datasets. It targets high-throughput analytics workflows used by distributed systems and big data platforms, providing strong compression and encoding techniques to reduce storage costs and improve query performance. Parquet is widely adopted across cloud services, data engineering pipelines, and analytics engines.

Overview

Parquet organizes data by columns rather than by rows, enabling selective I/O for analytical queries that access a subset of fields. This columnar layout is complementary to systems such as Apache Hadoop, Apache Spark, Amazon Redshift, Google BigQuery, and Snowflake (software), and integrates with storage layers like Amazon S3, Google Cloud Storage, and Azure Blob Storage. Parquet files include metadata, schema information, and optional statistics that help query planners in systems such as Presto (SQL query engine), Trino (software), Apache Impala, Apache Hive, and Dremio perform predicate pushdown and partition pruning.

Design and Architecture

Parquet's architecture defines a logical schema and a physical representation that decouples storage layout from logical types. The format supports nested data models compatible with serialization libraries like Protocol Buffers, Apache Thrift, and Apache Avro. Parquet employs a layered structure with file-level metadata, row groups, column chunks, and pages; this hierarchy allows distributed engines such as Apache Spark and Apache Flink to read subsets of data efficiently. Parquet's type system maps to languages and runtimes including Java (programming language), C++, Python (programming language), and Scala (programming language), enabling broad language interoperability.

File Format and Compression

A Parquet file contains a footer with schema, row group indices, and column metadata; row groups are the unit of parallelism for readers like Presto (SQL query engine), Trino (software), and Apache Drill. Within column chunks, data is divided into pages (data pages, dictionary pages) that permit encoding schemes such as run-length encoding (RLE), delta encoding, and dictionary encoding. Parquet supports compression codecs including Snappy (compression), Zstandard, Gzip, and LZO (compression), and can combine encodings with compression to balance CPU and I/O similar to techniques used by Apache ORC and Avro (data serialization system). File metadata also stores column-level statistics used by planners in Apache Hive and Impala to skip file ranges when executing predicates.

Performance and Use Cases

Parquet excels for analytics workloads characterized by wide tables and selective projections, commonly found in ETL pipelines for platforms like Databricks, Google Cloud Platform, Amazon EMR, and Microsoft Azure HDInsight. Use cases include data warehousing, business intelligence with engines like Tableau, Looker, and Power BI, and machine learning feature stores used in projects by organizations such as Netflix, Airbnb, Uber Technologies, and LinkedIn. Its columnar storage reduces disk I/O and improves cache locality for vectorized execution engines exemplified by ClickHouse, Apache Arrow, and Vectorized execution initiatives. Parquet's support for nested types and schema evolution makes it suitable for event stores and time-series datasets processed by Fluentd, Logstash, and Apache Kafka connectors.

Implementations and Ecosystem

Parquet's reference implementation is maintained within the Apache Software Foundation ecosystem and is available in languages including Java (programming language), C++, Python (programming language), and R (programming language). Native readers and writers exist in projects such as Apache Spark, Apache Drill, Dask (software), Pandas (software library), and PyArrow, with integrations into cloud-managed services like Amazon Athena, Google BigQuery, and Azure Synapse Analytics. Tooling for schema management, validation, and migration includes platforms like Apache Atlas, LinkedIn Gobblin, Airflow and dbt (data build tool). Enterprise vendors and open-source projects such as Cloudera, Hortonworks, Confluent, and Starburst Data provide optimized connectors and query accelerators.

History and Development

Parquet was originally developed by engineers from Twitter and Cloudera and contributed to the Apache Software Foundation to address limitations in row-oriented formats when applied to petabyte-scale analytics. The format evolved alongside contemporaries such as Apache ORC and benefited from collaboration across companies including Netflix, Pinterest, and Facebook. Over time, standards for columnar storage and interoperability advanced through efforts involving Apache Arrow and community-driven enhancements in codec support, schema evolution, and nested type handling. Parquet continues to be driven by contributions from cloud providers, database vendors, and analytics communities within the Apache Software Foundation and the broader open-source ecosystem.

Category:Computer file formats Category:Big data