LLMpediaThe first transparent, open encyclopedia generated by LLMs

Amundsen (software)

Generated by GPT-5-mini
Note: This article was automatically generated by a large language model (LLM) from purely parametric knowledge (no retrieval). It may contain inaccuracies or hallucinations. This encyclopedia is part of a research project currently under review.
Article Genealogy
Parent: Apache Spark Hop 5
Expansion Funnel Raw 69 → Dedup 0 → NER 0 → Enqueued 0
1. Extracted69
2. After dedup0 (None)
3. After NER0 ()
4. Enqueued0 ()
Amundsen (software)
NameAmundsen
DeveloperLyft
Release date2019
Programming languagePython, JavaScript
LicenseApache License 2.0
RepositoryGitHub

Amundsen (software) is an open-source data discovery and metadata engine originally developed by Lyft to improve data productivity inside large technology organizations. It is designed to index, surface, and visualize metadata across data platforms such as Apache Hive, Apache Spark, Snowflake (cloud data platform), and BigQuery, enabling teams from Software engineering organizations like Uber Technologies and Airbnb to locate, understand, and trust datasets. The project integrates with cloud providers such as Amazon Web Services, Google Cloud Platform, and Microsoft Azure and is maintained by a community of contributors on GitHub.

Overview

Amundsen provides a centralized metadata catalog that combines dataset, table, column, user, and dashboard information. It was open-sourced by Lyft to address internal needs similar to what drove the creation of Apache Atlas and DataHub (LinkedIn), and it complements orchestration tools such as Apache Airflow and Prefect (software). The platform emphasizes search-driven discovery and automated lineage tracking for analytics teams at enterprises including Netflix and Pinterest.

Architecture and Components

The architecture separates API, frontend, and metadata storage layers. Core components include a Flask (web framework) API service, a React-based frontend built with TypeScript, and a metadata graph backed by Neo4j or Apache Atlas compatible stores. Auxiliary services often include an indexing engine such as Elasticsearch or OpenSearch, a metadata proxy for connectors, and a task orchestration integration with Apache Airflow or Dagster (software). Persistent storage can be provided by PostgreSQL for user and usage metrics, while provenance and lineage may be represented in a graph compatible with JanusGraph.

Metadata Ingestion and Lineage

Ingestion pipelines extract metadata from sources like Apache Hive, MySQL, PostgreSQL, Snowflake (cloud data platform), BigQuery, and business intelligence tools such as Tableau and Looker. Connectors leverage SDKs and clients for systems including JDBC, ODBC, and proprietary APIs from Microsoft Power BI. Lineage is captured via integration with orchestration and compute platforms such as Apache Spark, Airflow, and dbt (data build tool), enabling downstream traceability across ETL jobs, SQL queries, and dashboard widgets. The ingestion framework supports batch and streaming modes compatible with Kafka (software) and other messaging systems.

Search and Discovery Features

Amundsen offers full-text search powered by Elasticsearch and faceted navigation by tags, owners, and domains familiar to organizations like Google LLC and Facebook. The UI provides column-level descriptions, popularity metrics derived from query logs, and owner contact info synchronized with directories such as Okta and Active Directory. Search relevancy can be tuned using signals from usage statistics recorded in Prometheus or application telemetry from Datadog. The product supports curated badges, glossary terms from Collibra, and dataset profiling akin to features in Great Expectations.

Deployment and Operations

Amundsen is typically deployed using container orchestration platforms including Kubernetes and Docker, with CI/CD pipelines implemented via Jenkins or GitHub Actions. Production deployments integrate observability stacks like Prometheus and Grafana alongside logging solutions such as Elasticsearch and Fluentd. Cloud-native deployments often use managed services from Amazon Web Services (EKS, RDS), Google Cloud Platform (GKE, Cloud SQL), or Microsoft Azure (AKS, Azure Database). Backup and disaster recovery strategies mirror best practices used by Stripe and Shopify for stateful services.

Security and Access Control

Amundsen supports role-based and attribute-based access controls when combined with identity providers such as Okta, Auth0, and LDAP. Integration with policy engines like Open Policy Agent enables fine-grained metadata governance aligned with frameworks adopted by GDPR compliance teams and privacy engineering groups at Facebook and Google LLC. TLS encryption and mutual TLS are supported for inter-service communication, and audit logs can be exported to SIEM platforms such as Splunk for incident response workflows practiced by PayPal.

Adoption and Community

Since its release, Amundsen has attracted contributors from organizations such as Lyft, Expedia Group, Airbnb, and CERN, and maintains an active repository on GitHub with issues and pull requests coordinated via GitHub Actions and governance modeled after other projects like Apache Software Foundation incubations. The community organizes meetups and discussions on platforms including Slack (software), Discourse (website), and conferences such as Strata Data Conference and KubeCon. Commercial support and integrations are offered by ecosystem vendors and consultancies that service enterprises like Goldman Sachs and Capital One.

Category:Free software Category:Data management software Category:Open-source software projects