Apache Druid — LLMpedia

Apache Druid
Name	Apache Druid
Developer	Apache Software Foundation
Initial release	2012
Latest release	2024
Programming language	Java
License	Apache License 2.0
Website	apache.org

Contents

Overview
Architecture
Data Ingestion and Storage
Querying and Performance
Security and Operations
Use Cases and Adoption

Apache Druid Apache Druid is a high-performance, column-oriented, distributed data store designed for real-time analytics on large-scale event data. It supports low-latency queries, fast aggregations, and flexible ingestion from streaming and batch sources, and is used by organizations for OLAP-style workloads, exploratory analytics, and interactive dashboards.

Overview

Druid originated to meet needs similar to those addressed by MapReduce and Apache Hadoop but optimized for low-latency analytical queries akin to systems such as Google BigQuery, Amazon Redshift, and Snowflake (company). Its lineage intersects with projects like Apache Kafka, Apache Storm, Apache Flink, Apache Spark, and Apache Cassandra through common use in streaming and batch pipelines. Early adopters included teams influenced by paradigms from Twitter, Inc., Netflix, Inc., LinkedIn Corporation, and Uber Technologies, Inc., reflecting ties to broader trends exemplified by Lambda architecture, Kappa architecture, and vendors like Confluent, Inc. and Cloudera, Inc..

Architecture

Druid employs a segmented, tiered node architecture with roles that echo components found in distributed systems such as Zookeeper, etcd, and Consul (service mesh). Key node types correspond conceptually to services like Hadoop Distributed File System NameNodes and DataNodes, with separate processes for coordination, ingestion, storage, and query serving reminiscent of designs in Elasticsearch and Cassandra (database). Druid's coordination and metadata responsibilities parallel functionality in MySQL, PostgreSQL, and Apache Derby when used for cluster state. Operational patterns align with orchestration platforms like Kubernetes and Docker, and observability integrates with projects such as Prometheus and Grafana.

Data Ingestion and Storage

Druid supports ingestion from streaming platforms including Apache Kafka, Amazon Kinesis, and Apache Pulsar as well as batch sources like Amazon S3, Google Cloud Storage, and HDFS. Its storage model uses immutable time-partitioned segments similar in concept to artifacts in Parquet (file format), ORC (file format), and Apache Avro. Indexing and compaction workflows echo batch processes practiced in Apache Hive and Presto (software), while schema evolution and serialization concerns are navigated with formats and registries akin to Confluent Schema Registry and Thrift in systems built by Facebook, Inc. and Twitter, Inc..

Querying and Performance

Druid provides native support for OLAP-style queries with primitives comparable to those in Apache Pinot, ClickHouse, and Dremio and exposes SQL interfaces similar to PostgreSQL and MySQL clients. Its vectorized execution, bitmap indexing, and columnar compression share performance techniques with Intel CPU optimizations and storage engines used by Oracle Corporation and Microsoft SQL Server. For high concurrency and low latency, Druid borrows ideas that have been explored by Google Bigtable, Amazon DynamoDB, and Memcached deployments, while integration with tracing and telemetry leverages standards advanced by OpenTelemetry and Jaeger.

Security and Operations

Operational security practices for Druid clusters follow patterns established by CIS (Center for Internet Security) benchmarks and tools from vendors like HashiCorp and Okta, Inc. for secrets management and access control. Authentication and authorization integrate with identity systems such as LDAP, Active Directory, and OAuth 2.0 providers used by enterprises including Microsoft Corporation and IBM. Encryption at rest and in transit leverages TLS standards championed by Internet Engineering Task Force working groups and key management approaches similar to AWS KMS and Google Cloud KMS.

Use Cases and Adoption

Druid is widely used for real-time analytics, user-facing dashboards, and monitoring systems across companies that include names like Netflix, Inc., Airbnb, Inc., Uber Technologies, Inc., PayPal Holdings, Inc., and LinkedIn Corporation. Common deployments parallel ingestion patterns seen in Twitter, Inc. timelines, e-commerce telemetry similar to Shopify, and telemetry pipelines of cloud providers such as Amazon Web Services and Google Cloud Platform. The ecosystem around Druid includes commercial vendors and managed services similar to offerings from Confluent, Inc., Cloudera, Inc., Databricks, and Elastic NV, and user communities participate in forums and conferences alongside gatherings like KubeCon, Strata Data Conference, and Open Source Summit.

Category:Distributed data stores