Amundsen (software)

Amundsen (software)
Name	Amundsen
Developer	Lyft
Release date	2019
Programming language	Python, JavaScript
License	Apache License 2.0
Repository	GitHub

Contents

Overview
Architecture and Components
Metadata Ingestion and Lineage
Search and Discovery Features
Deployment and Operations
Security and Access Control
Adoption and Community

Amundsen (software) is an open-source data discovery and metadata engine originally developed by Lyft to improve data productivity inside large technology organizations. It is designed to index, surface, and visualize metadata across data platforms such as Apache Hive, Apache Spark, Snowflake (cloud data platform), and BigQuery, enabling teams from Software engineering organizations like Uber Technologies and Airbnb to locate, understand, and trust datasets. The project integrates with cloud providers such as Amazon Web Services, Google Cloud Platform, and Microsoft Azure and is maintained by a community of contributors on GitHub.

Overview

Amundsen provides a centralized metadata catalog that combines dataset, table, column, user, and dashboard information. It was open-sourced by Lyft to address internal needs similar to what drove the creation of Apache Atlas and DataHub (LinkedIn), and it complements orchestration tools such as Apache Airflow and Prefect (software). The platform emphasizes search-driven discovery and automated lineage tracking for analytics teams at enterprises including Netflix and Pinterest.

Architecture and Components

The architecture separates API, frontend, and metadata storage layers. Core components include a Flask (web framework) API service, a React-based frontend built with TypeScript, and a metadata graph backed by Neo4j or Apache Atlas compatible stores. Auxiliary services often include an indexing engine such as Elasticsearch or OpenSearch, a metadata proxy for connectors, and a task orchestration integration with Apache Airflow or Dagster (software). Persistent storage can be provided by PostgreSQL for user and usage metrics, while provenance and lineage may be represented in a graph compatible with JanusGraph.

Metadata Ingestion and Lineage

Ingestion pipelines extract metadata from sources like Apache Hive, MySQL, PostgreSQL, Snowflake (cloud data platform), BigQuery, and business intelligence tools such as Tableau and Looker. Connectors leverage SDKs and clients for systems including JDBC, ODBC, and proprietary APIs from Microsoft Power BI. Lineage is captured via integration with orchestration and compute platforms such as Apache Spark, Airflow, and dbt (data build tool), enabling downstream traceability across ETL jobs, SQL queries, and dashboard widgets. The ingestion framework supports batch and streaming modes compatible with Kafka (software) and other messaging systems.

Search and Discovery Features

Amundsen offers full-text search powered by Elasticsearch and faceted navigation by tags, owners, and domains familiar to organizations like Google LLC and Facebook. The UI provides column-level descriptions, popularity metrics derived from query logs, and owner contact info synchronized with directories such as Okta and Active Directory. Search relevancy can be tuned using signals from usage statistics recorded in Prometheus or application telemetry from Datadog. The product supports curated badges, glossary terms from Collibra, and dataset profiling akin to features in Great Expectations.

Deployment and Operations

Amundsen is typically deployed using container orchestration platforms including Kubernetes and Docker, with CI/CD pipelines implemented via Jenkins or GitHub Actions. Production deployments integrate observability stacks like Prometheus and Grafana alongside logging solutions such as Elasticsearch and Fluentd. Cloud-native deployments often use managed services from Amazon Web Services (EKS, RDS), Google Cloud Platform (GKE, Cloud SQL), or Microsoft Azure (AKS, Azure Database). Backup and disaster recovery strategies mirror best practices used by Stripe and Shopify for stateful services.

Security and Access Control

Amundsen supports role-based and attribute-based access controls when combined with identity providers such as Okta, Auth0, and LDAP. Integration with policy engines like Open Policy Agent enables fine-grained metadata governance aligned with frameworks adopted by GDPR compliance teams and privacy engineering groups at Facebook and Google LLC. TLS encryption and mutual TLS are supported for inter-service communication, and audit logs can be exported to SIEM platforms such as Splunk for incident response workflows practiced by PayPal.

Adoption and Community

Since its release, Amundsen has attracted contributors from organizations such as Lyft, Expedia Group, Airbnb, and CERN, and maintains an active repository on GitHub with issues and pull requests coordinated via GitHub Actions and governance modeled after other projects like Apache Software Foundation incubations. The community organizes meetups and discussions on platforms including Slack (software), Discourse (website), and conferences such as Strata Data Conference and KubeCon. Commercial support and integrations are offered by ecosystem vendors and consultancies that service enterprises like Goldman Sachs and Capital One.

Category:Free software Category:Data management software Category:Open-source software projects