Storm (distributed realtime computation system)

Storm (distributed realtime computation system)
Name	Storm
Developer	Twitter, BackType, Hortonworks, Yahoo
Initial release	2011
Programming language	Java, Clojure
Operating system	Linux
License	Apache License 2.0

Contents

Overview
Architecture
Core Concepts and Components
Deployment and Operations
Use Cases and Performance
Integration and Ecosystem

Storm (distributed realtime computation system) is a distributed, fault-tolerant stream processing platform originally developed at BackType and released by Twitter before becoming an Apache project. It provides scalable, low-latency processing for high-throughput data streams and has been adopted across organizations such as Yahoo!, Hortonworks, Alibaba Group, and Tencent for real-time analytics. Storm interoperates with ecosystems that include Apache Kafka, Apache Hadoop, Apache Zookeeper, Apache HBase, and Cassandra.

Overview

Storm was created to handle continuous, unbounded streams of data common to services like Twitter, Facebook, LinkedIn, and Netflix. It contrasts with batch systems such as Apache Hadoop MapReduce, Apache Spark Batch, and Google BigQuery by focusing on event-by-event processing similar to systems like Apache Flink and Google Dataflow. The project joined the Apache Software Foundation and has been compared with proprietary offerings from Amazon Web Services, Microsoft Azure, and Google Cloud Platform in low-latency streaming scenarios.

Architecture

Storm's architecture separates responsibility between the cluster manager, worker processes, and coordination services. The central daemon, Nimbus, plays a role analogous to what Apache Mesos or Kubernetes control planes perform for container orchestration in environments such as Docker clusters. Task assignment and state coordination rely on Apache ZooKeeper for leader election and metadata, much as HBase and Solr utilize ZooKeeper. Storm workers execute topology tasks in JVMs and connect to message queues like Apache Kafka, RabbitMQ, and Amazon Kinesis for ingestion, and to storage systems such as Apache HBase, Apache Cassandra, Redis, and Elasticsearch for output.

Core Concepts and Components

Topologies in Storm are directed acyclic graphs composed of spouts and bolts; spouts represent sources such as Apache Kafka, Amazon Kinesis, Azure Event Hubs, and Google Pub/Sub, while bolts perform transformations and writes to sinks like Apache HBase, Apache Cassandra, Elasticsearch, and MongoDB. The concept of stream grouping (shuffle, fields, all, global, direct) determines tuple routing similar to partitioning in Apache Kafka and sharding in MongoDB. Reliability mechanisms depend on tuple anchoring and acknowledgements akin to message semantics in Apache ActiveMQ and RabbitMQ, supporting at-least-once guarantees used by services like Uber and Airbnb for event processing. Storm's stateful processing options evolved to integrate with external state stores such as RocksDB and in-memory caches employed by Memcached and Redis.

Deployment and Operations

Operators deploy Storm clusters on hardware or virtualization platforms managed by systems like Apache Mesos, Kubernetes, OpenStack, and VMware ESXi. Monitoring and logging commonly integrate with Prometheus, Grafana, Nagios, Zabbix, ELK Stack, and Splunk to visualize metrics and trace events across services including Amazon CloudWatch and Google Stackdriver. High-availability patterns reuse techniques from Hadoop YARN clusters and Cassandra rings, and multi-tenant deployments require quota and isolation strategies used by OpenStack projects. Security integrations often involve Kerberos, LDAP, OAuth, and Apache Ranger or Apache Sentry for authorization.

Use Cases and Performance

Storm has been applied to real-time analytics, user-facing features, fraud detection, and stream enrichment at companies like Twitter, Yahoo!, Instagram, LinkedIn, and Pinterest. Example workloads include clickstream aggregation similar to pipelines built with Google Analytics and conversion tracking used by Adobe Analytics and Mixpanel, as well as anomaly detection used by financial firms such as Goldman Sachs and JPMorgan Chase. Benchmark comparisons often pit Storm against Apache Flink, Apache Spark Streaming, Samza, and proprietary offerings from Amazon Kinesis Data Analytics and Google Cloud Dataflow for throughput and latency; tuning involves JVM settings, backpressure controls, and affinity strategies inspired by NUMA-aware deployments in high-frequency trading platforms like Jane Street.

Integration and Ecosystem

Storm's ecosystem includes connectors, libraries, and management tools that link with projects like Apache Kafka, Apache Hadoop, Apache HBase, Apache Cassandra, Elasticsearch, Redis, Logstash, and Flume. Community contributions and enterprise integrations have come from organizations such as Twitter, Yahoo!, Hortonworks, Cloudera, Confluent, and Datastax. Tooling for deployment and CI/CD borrows patterns from Jenkins, Travis CI, CircleCI, Ansible, Chef, and Puppet; observability and tracing use Jaeger, Zipkin, and OpenTracing. Storm remains relevant within architectures that also employ Apache NiFi, Apache Beam, Apache Samza, and Apache Flink for hybrid streaming and batch processing.

Category:Distributed computing Category:Stream processing Category:Apache Software Foundation projects