OpenLineage — LLMpedia

OpenLineage
Name	OpenLineage
Developer	Linux Foundation
Released	2020
Programming language	Python, Java
Operating system	Cross-platform
License	Apache License 2.0

Contents

History
Design and Architecture
Specification
Implementations and Integrations
Use Cases and Adoption
Governance and Community
Security and Privacy Considerations

OpenLineage OpenLineage is an open specification and metadata framework for capturing lineage across data processing ecosystems. It enables interoperability among platforms such as Apache Airflow, Apache Spark, Databricks, Google Cloud Platform, and Amazon Web Services, and supports governance initiatives driven by organizations like the Linux Foundation, CNCF, and OpenSSF. The project aligns with compliance efforts involving General Data Protection Regulation, California Consumer Privacy Act, and enterprise programs from IBM, Microsoft, Snowflake Inc. and Confluent.

History

The project emerged in 2020 following collaborations among engineers from Databricks, WeWork, Expedia Group, Adobe Inc., and contributors associated with the Linux Foundation. Early momentum followed presentations at conferences such as Strata Data Conference, Spark + AI Summit, and KubeCon + CloudNativeCon, and drew comparisons to initiatives led by Apache Atlas, DataHub, Amundsen (software) and LinkedIn. Governance and stewardship conversations involved participants from Cloudera, Microsoft Research, Netflix, Airbnb, Oracle Corporation and community meetups with members of O’Reilly Media and IEEE working groups. Adoption accelerated as cloud providers and platform vendors integrated lineage capture into products showcased at events like AWS re:Invent and Google Cloud Next.

Design and Architecture

The architecture centers on an event-driven model using JSON-based payloads transmitted via transport layers such as Apache Kafka, RabbitMQ, or HTTP to collectors and backends including Elasticsearch, PostgreSQL, Neo4j and object stores like Amazon S3. The design maps tasks and datasets to canonical entities similar to concepts in OpenAPI Specification and W3C PROV, enabling interoperability with graph databases and metadata catalogs including Neo4j and JanusGraph. Components include producers integrated into orchestration engines such as Apache Airflow and compute frameworks like Apache Spark and Flink (software), plus lineage consumers used by governance tools from Collibra, Alation, Informatica and cloud services from Google BigQuery. The modularity supports SDKs in languages such as Python (programming language), Java (programming language), and integrates with CI/CD pipelines orchestrated by Jenkins, GitHub Actions, and GitLab.

Specification

The specification defines a core JSON schema for run events, job facets, dataset facets, and the relationships among runs, jobs, and datasets, drawing conceptual parallels to W3C, ISO/IEC, and provenance work by National Institute of Standards and Technology. Versioning of the specification follows semantic versioning used by projects like Semantic Versioning and governance models comparable to IETF. The schema supports facets such as schema change, data quality, and audit metadata that echo metadata models in Dublin Core and the Schema.org vocabulary, while facilitating export to formats used by Apache Parquet, Avro (data serialization system), and ORC (file format).

Implementations and Integrations

Implementations include SDKs and plugins for orchestration platforms such as Apache Airflow, Dagster, Prefect (software), and managed services from Databricks and Google Cloud Dataflow. Integrations extend to messaging systems like Apache Kafka, Google Pub/Sub, and storage backends including Amazon S3, Google Cloud Storage, Azure Blob Storage as well as analytics engines like Snowflake, Redshift, Presto (software) and Trino (software). Community-driven collectors and lineage sinks interface with catalogs such as DataHub (software), Apache Atlas, and Amundsen (software), and visualization tools like Grafana and Kibana.

Use Cases and Adoption

Common use cases encompass impact analysis in enterprises such as Capital One, HSBC, and Goldman Sachs, regulatory reporting for institutions like Deutsche Bank and Barclays, debugging and incident response at technology companies including Airbnb, Uber, LinkedIn, and Netflix, and data catalog enrichment at Spotify and Pinterest. Adoption scenarios include ML model lineage tracking for platforms influenced by TensorFlow, PyTorch, and MLOps workflows patterned after Kubeflow and MLflow (software). Vendors and cloud providers incorporate lineage capture for compliance in contexts involving SOX (law), HIPAA, and PCI DSS compliance programs.

Governance and Community

The project operates under a foundation-affiliated governance model with contributors from corporations such as Databricks, Confluent, Iguazio, OpenLineage contributors, and representatives from Linux Foundation projects, mirroring models used by Kubernetes and Istio. Community engagement occurs via mailing lists, GitHub repositories, and working groups that coordinate with standards bodies like W3C and OASIS, and collaborates with adjacent projects including DataHub (software), Apache Atlas, and the OpenMetadata community. Contributor agreements and licensing adhere to practices common to Apache Software Foundation projects and the Linux Foundation.

Security and Privacy Considerations

Security considerations include authentication and authorization using standards such as OAuth 2.0, OpenID Connect, and encryption practices recommended by NIST and implemented in cloud platforms like AWS Identity and Access Management, Google Cloud IAM, and Azure Active Directory. Privacy controls address lineage metadata that may implicate subject data governed by General Data Protection Regulation and California Consumer Privacy Act requirements, requiring techniques similar to those described in ISO/IEC 27001 and privacy engineering guidance from IAPP. Operational hardening aligns with incident response playbooks used by SANS Institute and best practices deployed by enterprises including IBM and Microsoft Corporation.

Category:Data management