Apache Spark Structured Streaming

Apache Spark Structured Streaming
Name	Apache Spark Structured Streaming
Developer	Apache Software Foundation
Initial release	2016
Latest release	2024
Written in	Scala (programming language), Java (programming language), Python (programming language)
Operating system	Cross-platform
License	Apache License

Contents

Overview
Architecture and Concepts
Programming Model and APIs
Fault Tolerance and State Management
Integration and Connectors
Deployment and Performance Tuning
Use Cases and Adoption

Apache Spark Structured Streaming Apache Spark Structured Streaming is a stream processing engine built on Apache Spark that provides a declarative high-level API for processing real-time data. It unifies batch and streaming semantics to enable continuous processing with the same DataFrame and Dataset abstractions used in batch workloads. Designed by contributors from projects and organizations such as Databricks and the Apache Software Foundation, it targets production deployments across cloud platforms like Amazon Web Services, Microsoft Azure, and Google Cloud Platform.

Overview

Structured Streaming emerged as an evolution of Spark Streaming to address limitations identified in production services deployed by companies such as Twitter, Netflix, Uber Technologies, Airbnb, and Alibaba Group. It treats streaming data as an unbounded table that is incrementally updated, leveraging the Catalyst (query optimizer) and the Tungsten (Spark) execution engine. The design reflects influences from academic systems and industrial projects including MapReduce, Flink, Kafka Streams, and Storm (distributed real-time computation system), while integrating with storage systems like Hadoop Distributed File System, Amazon S3, Azure Data Lake Storage, and Google Cloud Storage.

Architecture and Concepts

The architecture centers on micro-batch and continuous processing modes implemented atop Spark's Structured Query Language planning layer and the Catalyst (query optimizer). Key concepts include unbounded DataFrames, event-time and processing-time semantics, watermarks for handling late data inspired by systems such as Google Cloud Dataflow and Apache Beam, and stateful operators for aggregations and joins. Components interact with cluster managers like Apache Mesos, Hadoop YARN, and Kubernetes (software) for resource orchestration. The execution plan composes logical plans, physical plans, and a RDD-based execution fallback for compatibility with legacy Spark Core components.

Programming Model and APIs

Developers use high-level APIs in languages including Scala (programming language), Java (programming language), Python (programming language), and R (programming language) to express streaming queries as transformations on DataFrames and Datasets. The API exposes operations such as select, filter, groupBy, window, and join, and integrates with query optimization techniques from Catalyst (query optimizer). Connectors implement sources and sinks through the DataSource API (Spark), interoperating with systems like Apache Kafka, Amazon Kinesis, Apache Pulsar, JDBC, and Elasticsearch. Declarative SQL support enables integration with tools like Apache Hive and Presto (SQL query engine) for ad hoc analytics.

Fault Tolerance and State Management

Fault tolerance relies on checkpointing, write-ahead logs, and idempotent sinks to achieve exactly-once semantics when possible, echoing practices used by Kafka (software), Zookeeper, and Google Bigtable. State management for large stateful operators uses local storage, spilling, and external stores such as RocksDB, Redis, HBase, and Cassandra (database) for durability and scalability. Recovery semantics are coordinated with cluster managers like Kubernetes (software) and Hadoop YARN and leverage techniques from distributed systems research such as consensus protocols used by Raft (protocol) and Paxos in ancillary components.

Integration and Connectors

Structured Streaming ships and interoperates with an ecosystem of connectors maintained by projects and companies including Confluent (company), Debezium, Databricks, and cloud vendors like Amazon Web Services and Microsoft Azure. Native connectors exist for Apache Kafka, Amazon S3, Google Cloud Pub/Sub, Azure Event Hubs, JDBC, and HDFS, while community adapters provide integration with Snowflake (cloud data platform), MongoDB, Elasticsearch, InfluxDB, and Prometheus. Monitoring and observability integrate with systems such as Prometheus, Grafana, Datadog, and New Relic.

Deployment and Performance Tuning

Deployments often target managed services and orchestration platforms offered by Databricks, Amazon EMR, Google Dataproc, and Azure Synapse Analytics. Performance tuning involves configuring parallelism, shuffle partitions, state TTL, memory management from the Tungsten (Spark) execution layer, and off-heap allocations guided by recommendations from Spark Summit presentations and engineering posts by organizations like LinkedIn and Uber Technologies. Optimizations include predicate pushdown, vectorized I/O, whole-stage code generation courtesy of Catalyst (query optimizer), and adaptive query execution techniques akin to those discussed by SIGMOD and VLDB communities.

Use Cases and Adoption

Common use cases encompass real-time ETL, fraud detection pipelines used by firms such as Stripe (company) and PayPal, anomaly detection for operations teams at Netflix and Spotify, clickstream analysis practiced by Google, and monitoring systems in enterprises leveraging Elastic (company). Adoption spans financial services, advertising technology firms like The Trade Desk, telecommunications providers such as Verizon Communications, and scientific data pipelines in institutions like CERN and NASA.

Category:Apache Spark