Generated by GPT-5-mini| Apache Atlas | |
|---|---|
| Name | Apache Atlas |
| Developer | Apache Software Foundation |
| Initial release | 2015 |
| Programming language | Java (programming language) |
| Operating system | Linux, Windows, macOS |
| License | Apache License |
Apache Atlas Apache Atlas is an open source metadata management and data governance platform developed under the Apache Software Foundation umbrella. It provides services for metadata discovery, lineage, classification, and policy-driven governance across data ecosystems such as Hadoop, Apache Hive, Apache Kafka, and Apache Spark. Atlas is used by organizations in finance, healthcare, telecommunications, and technology to meet compliance obligations, integrate with data catalogs, and enable data stewardship workflows.
Atlas centralizes metadata about datasets, schemas, processes, and business glossaries to support stewardship and discovery across platforms like Cloudera, Hortonworks, Microsoft Azure, and Amazon Web Services. The product collects technical and business metadata from sources including Apache HBase, Apache HDFS, Apache NiFi, and Apache Ranger while enabling stakeholders such as data engineers, data scientists, and compliance officers to annotate assets with classifications tied to standards like GDPR, HIPAA, and Sarbanes–Oxley Act. Atlas exposes REST APIs and UI components that integrate with identity providers such as LDAP and Kerberos to support enterprise workflows.
Atlas implements a modular architecture comprising metadata repository, type system, REST API, and UI. The metadata repository persists entity graphs and classifications using storage backends like JanusGraph, Apache Solr, and Apache HBase while indexing for search via Elasticsearch. Its type system models datasets, processes, columns, and terms as entities connected by relationships, enabling lineage across processing engines such as Apache Flink and Presto. The REST API supports programmatic ingestion and retrieval, and the web UI supports exploration similar to catalog offerings from Collibra and Alation. Atlas can run within container orchestration platforms like Kubernetes and integrates with configuration management tools such as Ansible and Chef.
Atlas provides classification, glossary management, data lineage, and metadata discovery. Classifications allow tagging of entities with labels such as Personally Identifiable Information or Payment Card Industry Data Security Standard-related annotations to enforce policies tracked by Apache Ranger or external policy engines. The glossary supports hierarchical business terms and stewardship assignments mirroring governance programs at organizations like IBM and Microsoft. Lineage captures upstream and downstream dependencies across orchestration engines like Apache Oozie and Airflow (software), facilitating impact analysis and incident investigation for compliance frameworks including SOX and PCI DSS.
Atlas integrates broadly with the Hadoop ecosystem, connectors for Apache Kafka, metadata bridges for AWS Glue, and data catalog tools used by enterprises including Google Cloud Platform services. It exposes plugins for ingestion from systems like Oracle Database and Microsoft SQL Server and works alongside data processing frameworks including Apache Beam and Apache Storm. Community and commercial distributions from vendors such as Cloudera, Inc. and Hortonworks, Inc. (merged entities) have provided packaged integrations, and third-party projects extend Atlas connectivity to Snowflake (data warehouse) and Databricks.
Atlas supports authentication via Kerberos, LDAP, and OAuth integrations with identity providers including Active Directory and Okta. Authorization and fine-grained access control can be enforced through integration with Apache Ranger or enterprise policy managers; audit trails capture metadata changes for regulatory review by auditors from institutions like Deloitte, PwC, and KPMG. Tagged data subject to regulations such as GDPR and HIPAA can be discovered and documented to support breach notification processes and data subject access requests managed by legal teams.
Atlas may be deployed on-premises, in hybrid clouds, or on public clouds such as Amazon Web Services, Microsoft Azure, and Google Cloud Platform. Production deployments commonly use HA configurations with clustered backends like ZooKeeper and JanusGraph with storage in Cassandra or HBase and indexing with Elasticsearch or Solr. Operational concerns include backup and restore of metadata graphs, monitoring via Prometheus and Grafana, and CI/CD pipelines using Jenkins or GitLab CI for configuration management. Containerized deployments leverage Docker and orchestration via Kubernetes for scaling and resilience.
The project originated within the Hortonworks community to standardize metadata management across the Hadoop stack and entered the Apache Software Foundation incubator before becoming a top-level project. Contributions have come from engineering teams at Facebook, Netflix, Yahoo!, and enterprise distribution vendors such as Cloudera. Over successive releases, Atlas added features for extensible type systems, richer lineage capture, and enhanced security integrations, influenced by governance needs voiced by customers in sectors like finance and healthcare. Ongoing development occurs in open source community repositories with contributions coordinated through mailing lists and governance processes at the Apache Software Foundation.
Category:Data management