Streaming Replication

Streaming Replication
Name	Streaming Replication
Type	Data replication technique
Introduced	2000s
Primary domains	Distributed systems;Databases
Technologies	PostgreSQL;MySQL;Oracle

Contents

Overview
Architecture and Components
Data Flow and Consistency Guarantees
Deployment and Configuration
Failover and Recovery
Performance, Scalability, and Monitoring
Security and Access Control

Streaming Replication

Streaming Replication is a continuous data-replication technique used to propagate changes from a primary system to one or more secondary systems in near real-time. It is employed in distributed systems such as PostgreSQL, MySQL, Oracle Database, Microsoft SQL Server, and MongoDB deployments to support high availability, disaster recovery, and read scaling. Implementations interact with tools and protocols from projects like WAL (write-ahead logging), rsync, SSH, pg_basebackup, and integrate with orchestration systems such as Kubernetes, Docker Swarm, and Apache Mesos.

Overview

Streaming Replication provides a continuous stream of change records from a source node to replicas, enabling secondary nodes to apply updates and remain close to the primary's state. The approach appears across database systems including PostgreSQL, MySQL Group Replication, and Oracle Data Guard and is often combined with cluster managers such as Pacemaker and Corosync or with cloud platforms like Amazon Web Services, Google Cloud Platform, and Microsoft Azure. Use cases include read scaling for applications like Apache Cassandra clients, disaster recovery for enterprises using SAP HANA, and compliance strategies in organizations like NASA and European Space Agency. Design choices reference standards and research from institutions such as ACM, IEEE, and projects like Raft and Paxos.

Architecture and Components

Typical architectures consist of a primary node, one or more standbys, a replication log or stream, and a transport layer. Core components in implementations mirror system elements such as the PostgreSQL WAL sender and receiver, the MySQL binary log (binlog) appliers, and the Oracle Data Guard Broker agent. Network and transport rely on protocols and tools like TCP/IP, TLS, SSH, and middleware including HAProxy, NGINX, and Envoy. Management and automation integrate with configuration systems such as Ansible, SaltStack, Chef, and Puppet and monitoring with Prometheus, Grafana, and Nagios.

Data Flow and Consistency Guarantees

In streaming designs, the primary writes change records to a durable log which are shipped to replicas; systems vary in consistency from eventual to strong. For example, PostgreSQL supports synchronous and asynchronous modes that relate to guarantees discussed in CAP theorem literature and consensus algorithms like Raft and Paxos; MySQL replication historically provided eventual consistency but recent features add semi-synchronous options. Concepts such as write-ahead logging in PostgreSQL and Oracle and binlog positions in MySQL drive recovery semantics and point-in-time restore workflows employed by teams at organizations like Netflix and Facebook. Systems may implement conflict detection and resolution strategies informed by research from Google’s Spanner and Bigtable projects.

Deployment and Configuration

Deployment patterns include primary-standby, multi-primary, and cascading replicas; orchestration is commonly performed with Kubernetes operators, Systemd units, or cloud services like Amazon RDS, Google Cloud SQL, and Azure Database for PostgreSQL. Configuration touches on replication slots, WAL retention, binlog format, and network tuning with sysadmin tools such as ethtool and tc. Backup and bootstrap workflows use utilities like pg_basebackup, mysqldump, RMAN, and snapshots via VMware ESXi, AWS EBS, or Google Compute Engine images. Infrastructure as code practices tie into Terraform and CloudFormation templates used by enterprises including Spotify and Airbnb.

Failover and Recovery

Failover strategies span automated and manual approaches coordinated by systems such as Patroni, repmgr, Keepalived, and Pacemaker with fencing solutions like STONITH. Recovery involves promotion of a standby to primary, application of remaining logs, and reconciliation using tools like pg_rewind, mysqlrplsync, and Oracle Flashback. High-profile outage management draws on runbooks inspired by incident responses at Amazon, GitHub, and Twitter and emphasizes testing with chaos engineering tools such as Chaos Monkey from the Netflix ecosystem and failure injection frameworks like Gremlin.

Performance, Scalability, and Monitoring

Performance tuning requires attention to network latency, I/O throughput, and commit settings; metrics tracked include replication lag, WAL shipping rate, and IOPS with collectors such as Prometheus and visualization with Grafana. Scalability patterns incorporate read-only replicas behind proxies like PgPool-II, ProxySQL, and HAProxy or sharding approaches related to Cassandra and MongoDB strategies. Capacity planning references case studies from LinkedIn and Uber and leverages benchmarking tools such as sysbench, pgbench, and fio.

Security and Access Control

Secure replication uses authentication and encryption mechanisms including SSL/TLS, SSH tunnels, client certificates, and role-based access control often integrated with identity providers like LDAP, Active Directory, and OAuth 2.0 services from Okta. Network isolation employs virtual networks in AWS VPC, Google VPC, and Azure Virtual Network and firewalls such as iptables and Cisco ASA. Compliance and audit capabilities align with frameworks like PCI DSS, HIPAA, and GDPR and are often managed by security teams using tools from Splunk, ELK Stack, and Snyk.

Category:Data replication