Apache Storm — LLMpedia

Apache Storm
Name	Apache Storm
Developer	Apache Software Foundation
Initial release	2011
Programming language	Java, Clojure
Operating system	Cross-platform
License	Apache License 2.0

Contents

Overview
Architecture
Components (Spouts and Bolts)
Deployment and Operations
Use Cases and Adoption
Performance and Scalability
Security and Management

Apache Storm is a distributed, fault-tolerant, real-time computation system for processing high-velocity data streams. Originating from research and development at BackType and subsequently open-sourced by the Apache Software Foundation, the project targets stream processing scenarios that require low-latency, scalable computation across clusters. Storm integrates with a broad ecosystem of data infrastructure and has been used alongside systems from the Hadoop, Kafka, and Spark ecosystems.

Overview

Storm was created to address continuous, tuple-at-a-time processing requirements that emerged from social media analytics and online telemetry. The design drew inspiration from prior work at companies such as BackType and institutions like Twitter, and was driven by operational needs encountered by practitioners from organizations including Yahoo!, Groupon, and Netflix. Storm occupies a role among streaming platforms alongside projects such as Apache Kafka, Apache Flink, Apache Samza, and Apache Spark Streaming, and it frequently interops with storage systems like Hadoop Distributed File System and messaging systems like RabbitMQ.

Architecture

Storm's architecture separates computation from coordination and scheduling to enable elasticity and fault recovery. The system uses a master process that manages cluster state and worker processes that execute computation tasks; this control-plane model resembles coordination patterns used by Zookeeper and cluster managers like Apache Mesos and Kubernetes. Topologies define directed acyclic graph processing flows comparable to graph models in systems such as Google MapReduce and Apache Tez. For durability and state, Storm integrates with external stores including Apache Cassandra, Redis, and HBase, and it leverages serialization formats common in distributed systems, such as Protocol Buffers and Thrift.

Components (Spouts and Bolts)

Storm applications are built from two primary component types: spouts and bolts. Spouts act as stream sources and adapters to message systems like Apache Kafka and Amazon Kinesis, while bolts perform transformation, aggregation, and enrichment similar to operators in Flink and Spark Streaming. Complex topologies chain bolts to implement join, window, and stateful operations inspired by research from institutions such as UC Berkeley and projects like Google Cloud Dataflow. Developers typically write components in Java or Clojure, and community contributions include integrations with languages and runtimes used at companies like Twitter and LinkedIn.

Deployment and Operations

Storm can be deployed on bare-metal clusters, virtual machines, and container orchestration platforms. Production deployments often pair Storm with cluster coordination tools such as Apache ZooKeeper and resource schedulers like YARN or Kubernetes. Operational concerns—rolling upgrades, capacity planning, and monitoring—are addressed with observability stacks that incorporate systems like Prometheus, Graphite, and Grafana, and logging infrastructures such as ELK Stack and Splunk. Organizations such as Salesforce and Pinterest have published operational patterns emphasizing metrics, backpressure handling, and topology versioning practices.

Use Cases and Adoption

Storm has been used for real-time analytics, online feature computation for machine learning, stream ETL, fraud detection, and anomaly detection at companies including Twitter, Yahoo!, Spotify, and Alibaba Group. Workflows integrating Storm with machine learning platforms, model stores, and feature stores from ecosystems like TensorFlow, H2O.ai, and MLflow implement low-latency scoring and feature extraction. Vertical adoption spans advertising technology firms, financial institutions such as Goldman Sachs and Morgan Stanley, and telecommunication providers that require real-time call-record processing.

Performance and Scalability

Storm is optimized for low-latency per-tuple processing and supports horizontal scaling by increasing worker processes and cluster nodes. Performance tuning often involves adjustments inspired by distributed systems research from MIT and Stanford University—for example, thread pool sizing, network buffer configurations, and serialization choices such as Avro or Protocol Buffers. Benchmarks conducted by vendors and research groups compare Storm to alternatives like Apache Flink and Apache Spark Streaming on metrics including throughput (tuples/second), end-to-end latency, and exactly-once semantics. Storm historically emphasized at-least-once delivery semantics with mechanisms for acking and replay; extensions and integrations have implemented stronger guarantees using transactional and state management techniques popularized in community work from Confluent and academic labs.

Security and Management

Enterprise deployments of Storm incorporate authentication, authorization, and encryption mechanisms mirroring best practices from platforms such as Kubernetes and Apache Hadoop. Integrations with directory and identity systems from vendors like LDAP and Kerberos provide access control, while network-level TLS/SSL and encryption-at-rest solutions align with recommendations from standards bodies and service providers including Amazon Web Services and Google Cloud Platform. Management tooling from companies such as Cloudera and Hortonworks (now part of Cloudera acquisitions) has included commercial support, monitoring, and lifecycle automation for Storm clusters.

Category:Apache Software Foundation projects