LSM trees (Log-Structured Merge-Tree)

LSM trees (Log-Structured Merge-Tree)
Name	Log-Structured Merge-Tree
Acronym	LSM
Inventors	Patrick O'Neil, Edward Cheng, Dietrich J. King, Donald R. O'Neill
First published	1996
Field	Computer science
Related	B-tree, Write-ahead logging, NoSQL, Database indexing

Contents

Overview
Design and Components
Operations and Algorithms
Performance and Trade-offs
Variants and Optimizations
Implementations and Applications

LSM trees (Log-Structured Merge-Tree) LSM trees are a class of data structures for managing high-write throughput in persistent storage systems. They organize data across multiple levels of immutable components to convert random writes into sequential writes, improving performance for applications that demand heavy ingestion. LSM designs have influenced Google, Facebook, Amazon, LinkedIn, and Twitter storage systems and are central to many NoSQL and embedded database engines.

Overview

LSM trees were introduced in 1996 by researchers including Patrick O'Neil and have since been adapted by organizations such as Google, Facebook, Amazon, Netflix, Twitter, and LinkedIn for scalable storage. The core idea is to buffer updates in fast in-memory structures and periodically merge them into slower on-disk structures, a strategy that contrasts with index structures like the B-tree used in systems from Oracle and IBM. LSM architectures interact with technologies like solid-state drives, HDD, and storage stacks deployed by companies such as Microsoft and Apple Inc..

Design and Components

An LSM deployment typically combines an in-memory component (often a mutable tree) with on-disk immutable components called SSTables or segments, a write-ahead log, and a background compaction scheduler. The in-memory buffer can be implemented as a sorted structure similar to a red–black tree or skip list, while on-disk representations resemble SSTables used by Google Bigtable and engines inspired by it. Durability is often provided by a write-ahead log, a mechanism that appears in systems from PostgreSQL, MySQL, and SQLite. Compaction coordinates with resource managers such as those in Kubernetes, Hadoop, and Apache Mesos.

Operations and Algorithms

Writes are appended to a log and inserted into the in-memory tree; when the memory component reaches a threshold, it is flushed to disk as an immutable component. Reads must consult the in-memory structure and one or more on-disk components, sometimes aided by Bloom filters and summaries to avoid full scans — techniques also used in systems by Google, Facebook, Amazon, and Bloomberg L.P.. Compaction merges overlapping key ranges across levels to reclaim space and maintain query efficiency; scheduling compaction involves algorithms studied in the context of Bigtable, HBase, Cassandra, and RocksDB. Concurrency control and transactionality integrate concepts from ACID, Snapshot isolation, and concurrency systems developed at Berkeley DB and Oracle.

Performance and Trade-offs

LSM designs trade read amplification and compaction overhead against write amplification and sequential write performance, a balance also encountered in storage systems by Google, Facebook, Amazon, and Microsoft. Workloads with heavy writes (log ingestion, time-series data) benefit, as seen in deployments at Twitter and LinkedIn, while read-heavy workloads may favor B-tree-based systems such as those used by PostgreSQL, MySQL, and SQLite. The impact of storage media — NVMe, SSD, HDD — affects tuning decisions; operators at companies like Netflix and Dropbox tune compaction to balance latency and throughput. Storage cost models and SLAs from Amazon Web Services, Google Cloud Platform, and Microsoft Azure influence engineering trade-offs.

Variants and Optimizations

Numerous variants extend the basic LSM idea: tiered vs. leveled compaction strategies (inspired by implementations at HBase and RocksDB), partitioning schemes as used by Cassandra and ScyllaDB, and hybrid designs that incorporate in-place updates similar to B-tree mechanisms found in PostgreSQL and MySQL. Optimizations include Bloom filters (invented by Burton Howard Bloom), fence pointers, partitioned filters, buddy systems, and write throttling algorithms used in RocksDB and LevelDB. Work on adaptive compaction scheduling has been pursued by research groups at University of California, Berkeley, Massachusetts Institute of Technology, Carnegie Mellon University, and Stanford University.

Implementations and Applications

Production implementations appear in Google Bigtable, Apache HBase, Apache Cassandra, RocksDB, LevelDB, ScyllaDB, TiKV, FoundationDB (as parts of its layers), Amazon DynamoDB-inspired systems, and embedded engines such as LMDB (where different design choices apply). LSM-based stores power services at Google, Facebook, Amazon, Twitter, LinkedIn, Uber, Airbnb, and scientific datasets managed by institutions like CERN and NASA. Use cases include time-series databases, event logs, message ingestion, and analytical pipelines built on Apache Kafka, Apache Flink, Apache Hadoop, and Apache Spark.

Category:Data structures