Generated by GPT-5-mini| Structured Streaming | |
|---|---|
| Name | Structured Streaming |
| Developer | Apache Software Foundation |
| Initial release | 2016 |
| Latest release | 2020s |
| Programming language | Scala, Java, Python |
| License | Apache License 2.0 |
Structured Streaming
Structured Streaming is a high-level stream processing engine developed as part of the Apache Spark project that provides a declarative query interface for processing continuous data as a series of incremental computations. It unifies batch and streaming semantics to allow developers using Apache Spark SQL, DataFrame (Spark), and Dataset (Spark) abstractions to build real-time analytics, event processing, and ETL pipelines. The design emphasizes integration with the Spark SQL optimizer, support for multiple data source connectors, and operational features such as fault recovery and end-to-end guarantees.
Structured Streaming emerged from efforts in the Apache Spark community to reconcile the disparate ecosystems of micro-batch systems like Apache Flink's Per-Record API discussions and legacy Spark Streaming DStream APIs, producing a model that treats streaming as a table that is continuously appended. It leverages the Catalyst (query optimizer) to plan streaming queries and integrates with the Hadoop Distributed File System, Apache Kafka, Amazon S3, and relational PostgreSQL ecosystems through connectors. Key contributors include engineers from companies such as Databricks, and the project is governed by the Apache Software Foundation community process.
Structured Streaming represents input as an unbounded table and outputs as incremental results, using concepts like logical plans, physical plans, and stateful operations implemented via the Spark SQL engine. Core components include the Query Planner that uses Catalyst (query optimizer), the Execution Engine built on Resilient Distributed Datasets and the Tungsten (Spark) execution engine for performance, and the State Store providing durable state for operators such as aggregations and joins. The system models time using notions like event time, processing time, and watermarking influenced by streaming research from systems including Google Dataflow and systems described at venues like VLDB and SIGMOD conferences.
Structured Streaming supports a wide range of connectors for sources and sinks, including message brokers like Apache Kafka and RabbitMQ, distributed file systems like Hadoop Distributed File System and Amazon S3, databases such as Cassandra (database) and PostgreSQL, and cloud platforms including Google Cloud Pub/Sub and Azure Event Hubs. Sinks include append-only files, transactional stores using Delta Lake, and message sinks that integrate with Apache Pulsar and Amazon Kinesis. The connector ecosystem is driven by projects and organizations like Confluent (company), Databricks, and community adapters maintained in the Apache Software Foundation ecosystem.
Users program Structured Streaming via the DataFrame (Spark) and Dataset (Spark) APIs in languages such as Scala (programming language), Java (programming language), and Python (programming language). SQL users can express streaming transformations using Spark SQL dialect extensions and register streams as temporary views for interactive analysis in notebooks such as Apache Zeppelin and Jupyter Notebook. The API supports map, flatMap, groupBy, windowed aggregation, and join operators, and integrates with libraries like MLlib for streaming machine learning pipelines and GraphX for graph processing adaptations.
To provide fault tolerance, Structured Streaming relies on write-ahead logs, checkpointing of query state to durable stores such as Hadoop Distributed File System or Amazon S3, and coordination via the Zookeeper and Kubernetes ecosystems for cluster management. Exactly-once semantics for sinks are achieved through idempotent writes, transactional sinks like Delta Lake and Kafka transactions coordinated with brokers like Apache Kafka when combined with the two-phase commit patterns discussed in distributed systems literature from ACM Proceedings. Recovery mechanisms draw on techniques refined in systems such as Google MillWheel and academic treatments at OSDI and SOSP.
Performance in Structured Streaming arises from optimizations in the Tungsten (Spark) execution engine and whole-stage code generation via Catalyst (query optimizer), which reduce CPU and memory overheads. Scalability is achieved through partitioning strategies compatible with Apache Kafka partitions, autoscaling on cluster managers like YARN and Kubernetes, and file-based partition pruning when used with storage formats such as Parquet and ORC (file format). Benchmarks comparing latency and throughput often reference systems including Apache Flink, Apache Storm, and Google Dataflow to contextualize trade-offs between micro-batch latency and record-at-a-time processing.
Structured Streaming is employed in a variety of production scenarios: real-time analytics and dashboards in enterprises such as Netflix (service), fraud detection pipelines at financial institutions that interface with Apache Cassandra and PostgreSQL, clickstream processing for advertising companies leveraging Apache Kafka and Amazon S3, and IoT telemetry ingestion integrated with Azure Event Hubs and Google Cloud Pub/Sub. Implementations often combine Structured Streaming with Delta Lake for ACID transactional sinks, MLflow or TensorFlow serving for model scoring, and orchestration platforms like Apache Airflow for end-to-end workflows.