Structured Streaming

Structured Streaming
Name	Structured Streaming
Developer	Apache Software Foundation
Initial release	2016
Latest release	2020s
Programming language	Scala, Java, Python
License	Apache License 2.0

Contents

Overview
Architecture and Key Concepts
Data Sources and Sinks
APIs and Programming Model
Fault Tolerance and Exactly-Once Semantics
Performance and Scalability
Use Cases and Implementations

Structured Streaming

Structured Streaming is a high-level stream processing engine developed as part of the Apache Spark project that provides a declarative query interface for processing continuous data as a series of incremental computations. It unifies batch and streaming semantics to allow developers using Apache Spark SQL, DataFrame (Spark), and Dataset (Spark) abstractions to build real-time analytics, event processing, and ETL pipelines. The design emphasizes integration with the Spark SQL optimizer, support for multiple data source connectors, and operational features such as fault recovery and end-to-end guarantees.

Overview

Structured Streaming emerged from efforts in the Apache Spark community to reconcile the disparate ecosystems of micro-batch systems like Apache Flink's Per-Record API discussions and legacy Spark Streaming DStream APIs, producing a model that treats streaming as a table that is continuously appended. It leverages the Catalyst (query optimizer) to plan streaming queries and integrates with the Hadoop Distributed File System, Apache Kafka, Amazon S3, and relational PostgreSQL ecosystems through connectors. Key contributors include engineers from companies such as Databricks, and the project is governed by the Apache Software Foundation community process.

Architecture and Key Concepts

Structured Streaming represents input as an unbounded table and outputs as incremental results, using concepts like logical plans, physical plans, and stateful operations implemented via the Spark SQL engine. Core components include the Query Planner that uses Catalyst (query optimizer), the Execution Engine built on Resilient Distributed Datasets and the Tungsten (Spark) execution engine for performance, and the State Store providing durable state for operators such as aggregations and joins. The system models time using notions like event time, processing time, and watermarking influenced by streaming research from systems including Google Dataflow and systems described at venues like VLDB and SIGMOD conferences.

Data Sources and Sinks

Structured Streaming supports a wide range of connectors for sources and sinks, including message brokers like Apache Kafka and RabbitMQ, distributed file systems like Hadoop Distributed File System and Amazon S3, databases such as Cassandra (database) and PostgreSQL, and cloud platforms including Google Cloud Pub/Sub and Azure Event Hubs. Sinks include append-only files, transactional stores using Delta Lake, and message sinks that integrate with Apache Pulsar and Amazon Kinesis. The connector ecosystem is driven by projects and organizations like Confluent (company), Databricks, and community adapters maintained in the Apache Software Foundation ecosystem.

APIs and Programming Model

Users program Structured Streaming via the DataFrame (Spark) and Dataset (Spark) APIs in languages such as Scala (programming language), Java (programming language), and Python (programming language). SQL users can express streaming transformations using Spark SQL dialect extensions and register streams as temporary views for interactive analysis in notebooks such as Apache Zeppelin and Jupyter Notebook. The API supports map, flatMap, groupBy, windowed aggregation, and join operators, and integrates with libraries like MLlib for streaming machine learning pipelines and GraphX for graph processing adaptations.

Fault Tolerance and Exactly-Once Semantics

To provide fault tolerance, Structured Streaming relies on write-ahead logs, checkpointing of query state to durable stores such as Hadoop Distributed File System or Amazon S3, and coordination via the Zookeeper and Kubernetes ecosystems for cluster management. Exactly-once semantics for sinks are achieved through idempotent writes, transactional sinks like Delta Lake and Kafka transactions coordinated with brokers like Apache Kafka when combined with the two-phase commit patterns discussed in distributed systems literature from ACM Proceedings. Recovery mechanisms draw on techniques refined in systems such as Google MillWheel and academic treatments at OSDI and SOSP.

Performance and Scalability

Performance in Structured Streaming arises from optimizations in the Tungsten (Spark) execution engine and whole-stage code generation via Catalyst (query optimizer), which reduce CPU and memory overheads. Scalability is achieved through partitioning strategies compatible with Apache Kafka partitions, autoscaling on cluster managers like YARN and Kubernetes, and file-based partition pruning when used with storage formats such as Parquet and ORC (file format). Benchmarks comparing latency and throughput often reference systems including Apache Flink, Apache Storm, and Google Dataflow to contextualize trade-offs between micro-batch latency and record-at-a-time processing.

Use Cases and Implementations

Structured Streaming is employed in a variety of production scenarios: real-time analytics and dashboards in enterprises such as Netflix (service), fraud detection pipelines at financial institutions that interface with Apache Cassandra and PostgreSQL, clickstream processing for advertising companies leveraging Apache Kafka and Amazon S3, and IoT telemetry ingestion integrated with Azure Event Hubs and Google Cloud Pub/Sub. Implementations often combine Structured Streaming with Delta Lake for ACID transactional sinks, MLflow or TensorFlow serving for model scoring, and orchestration platforms like Apache Airflow for end-to-end workflows.

Category:Apache Spark