Generated by GPT-5-mini| Parquet (software) | |
|---|---|
| Name | Parquet |
| Developer | Apache Software Foundation |
| Released | 2013 |
| Programming language | Java, C++, Python, Rust |
| Operating system | Cross-platform |
| License | Apache License 2.0 |
Parquet (software) is a columnar storage file format and ecosystem designed for high-performance data processing. It was developed to optimize analytics workflows by combining columnar layout with rich type metadata, compression, and encoding schemes to accelerate Apache Hadoop-based and cloud-native AWS-centric data pipelines. Parquet is maintained under the Apache Software Foundation umbrella and is widely adopted across projects such as Apache Spark, Apache Hive, Apache Drill, and Google BigQuery.
Parquet provides a column-oriented storage layout that enables efficient scanning, predicate pushdown, and vectorized processing for analytics engines like Apache Spark, Presto, Dremio, and Trino. It stores nested data structures compatible with serialization systems such as Apache Avro, Protocol Buffers, and Thrift while exposing rich type information useful to systems like Apache Impala and ClickHouse. Parquet files include metadata blocks that allow systems like Apache Arrow-enabled runtimes and TensorFlow ingestion pipelines to perform schema discovery and selective reads without full file deserialization.
Parquet originated from a collaboration between engineers at Twitter and Cloudera to address limitations of row-based formats for analytics on HDFS. Early design discussions involved contributors from projects including Apache Drill and Facebook; the format was donated to the Apache Software Foundation and became an official top-level project used across the Big Data ecosystem. Subsequent development attracted engineers from Google, Microsoft, and cloud providers such as Amazon Web Services, leading to cross-project integrations with Apache Spark and implementations in languages supported by communities around Python, Java, and C++.
Parquet defines a self-describing file layout composed of row groups, column chunks, data pages, and a file footer that stores a schema and column metadata for systems like Apache Hive and Apache Drill to interpret. The logical schema can map to physical encodings such as RLE, Dictionary encoding, and bit-packed encodings implemented in runtimes including Apache Arrow and Gandiva. Parquet supports nested records via a definition and repetition level model influenced by serialization frameworks such as Google Protocol Buffers. The format’s footer and metadata support fast predicate pushdown for query engines like Presto and Trino by enabling column pruning and statistics-driven planning in systems like Apache Spark SQL.
Reference implementations exist in multiple ecosystems: the canonical Java implementation in the ASF Java stack, a high-performance C++ implementation used by Apache Arrow and DuckDB, and language bindings for Python via libraries such as pyarrow and fastparquet. Other implementations include Rust libraries backed by contributors from DataFusion and Ballista, Go implementations used in projects like InfluxData and cloud native tools, and .NET bindings for Microsoft-centric stacks. These implementations enable integrations with data systems such as Snowflake, Google BigQuery, Amazon Redshift, and orchestration platforms like Apache Airflow.
Parquet is optimized for analytics workloads in environments including Hadoop, AWS Glue, and GCP where columnar storage reduces I/O for aggregation and group-by queries executed by engines like Apache Spark and Presto. Typical use cases include data lake storage for organizations such as Netflix and Airbnb, ETL pipelines orchestrated by Apache NiFi or Airflow, and machine learning feature stores connected to Kubeflow or Tecton. Performance benefits stem from block-level compression with algorithms like Snappy and Zstandard, dictionary encodings leveraged by ClickHouse-style analytics, and compatibility with vectorized execution in Apache Arrow enabling low-latency reads for OLAP workloads.
Parquet integrates with metadata services and catalogs including Apache Hive Metastore, AWS Glue Data Catalog, and Google Cloud Data Catalog to enable schema discovery and governance in platforms such as Databricks and Cloudera Data Platform. It is supported by query engines including Presto, Trino, Apache Impala, and Apache Spark SQL, and by storage layers like HDFS, Amazon S3, and Google Cloud Storage. Ecosystem tooling such as Delta Lake, Apache Hudi, and Apache Iceberg often use Parquet as an on-disk representation while adding transactional semantics, compaction, and time-travel features.
Parquet files themselves are plaintext binary containers; access controls rely on infrastructure-level mechanisms such as Apache Ranger, IAM, and Google Cloud IAM. Encryption-at-rest and client-side encryption are implemented by storage providers like Amazon S3 and Google Cloud Storage or by encryption layers in systems like Hadoop Key Management Server and HashiCorp Vault. Metadata stored in Parquet footers can leak schema-level information, so organizations such as IBM and Microsoft recommend catalog-level access controls and schema redaction in governance tools like Apache Atlas and AWS Lake Formation to mitigate privacy risks.