Log-Structured Merge-tree

Log-Structured Merge-tree
Name	Log-Structured Merge-tree
Type	Data structure
Developer	Raghu Ramakrishnan; Patrick O'Neil; Edward O'Neil; Mohan; Kenneth Salem
First appeared	1996
Paradigm	Write-optimized indexing
Key concepts	Append-only storage, Compaction, Tiered storage, Bloom filter

Contents

History and motivation
Design and data structures
Operations (write, read, compaction)
Variants and extensions
Performance and trade-offs
Implementations and uses
Related concepts and comparison

Log-Structured Merge-tree is a write-optimized indexing data structure designed to provide high-throughput insertions and scalable storage for large datasets on magnetic and solid-state devices. It was introduced to address the mismatch between random-write costs on devices and workload patterns found in systems developed by institutions such as Google, Facebook, Amazon, and Yahoo!. The design influenced multiple distributed systems and databases from companies like LinkedIn, Twitter, Apple Inc., and research groups at University of California, Berkeley, MIT, and Stanford University.

History and motivation

The LSM-tree concept originated in the mid-1990s from work by researchers affiliated with IBM Research, University of Massachusetts Amherst, and Microsoft Research who published techniques for combining fast sequential writes with delayed merging, drawing on prior ideas from systems such as The Log-Structured File System and cache-oblivious structures studied at Princeton University. Early motivation cited performance bottlenecks in systems built by organizations such as Symantec, Oracle Corporation, and Hewlett-Packard where random-writes to rotating media and flash devices degraded throughput. Subsequent interest accelerated with large-scale services at Google LLC, Facebook, Inc., and Amazon Web Services that required horizontally scalable storage engines, inspiring engineering efforts at LinkedIn Corporation, Twitter, Inc., Netflix, Inc., and cloud providers like Microsoft Azure.

Design and data structures

An LSM-tree organizes data into multiple levels of sorted structures, often combining an in-memory component and on-disk components. The primary in-memory component is analogous to a memtable used in systems by Google LLC and Facebook, Inc., while on-disk components resemble SSTables used in software from Yahoo!, Apache Software Foundation, and projects at Carnegie Mellon University. Supporting auxiliary structures include probabilistic filters similar to those popularized by researchers at University of California, Berkeley and MIT and implemented in products like Apache Cassandra, Apache HBase, and RocksDB. The architecture leverages append-only logs and sorted runs inspired by indexing research at University of Washington and experimental filesystems from Bell Labs.

Operations (write, read, compaction)

Writes are ingested into an in-memory sorted structure and appended to a durable log, an approach also used in systems built by Google LLC and LinkedIn Corporation. When the in-memory structure reaches capacity, it is flushed as an immutable on-disk component, mirroring practices in Apache Cassandra, LevelDB, and RocksDB. Reads query the in-memory structure and a collection of immutable components, a pattern seen in deployments at Facebook, Inc. and Dropbox, Inc., with Bloom filters reducing IO as promoted by researchers at Massachusetts Institute of Technology. Compaction merges multiple on-disk runs into larger runs to reclaim space and enforce ordering; variants of compaction strategies were engineered at Netflix, Inc. and Google LLC to balance latency and throughput.

Variants and extensions

Multiple variants extend the baseline design: tiered versus leveled compaction strategies evolved in systems like RocksDB and Cassandra, hybrid approaches combining B-tree features were explored at Stanford University, and SSD-aware layouts were researched at Intel Corporation and Samsung Electronics. Extensions incorporate transactional semantics influenced by work at Microsoft Research and IBM Research, distributed indexing architectures used in Apache HBase and Amazon DynamoDB, and multi-version concurrency control ideas from databases developed at Oracle Corporation and PostgreSQL Global Development Group.

Performance and trade-offs

LSM-trees trade read amplification and space amplification against very low write amplification, a trade-off documented in literature from University of California, Berkeley, Cornell University, and ETH Zurich. Under write-heavy workloads typical of services run by Twitter, Inc. and Uber Technologies, Inc., LSM-trees outperform traditional B-tree engines used in systems like MySQL and PostgreSQL, while reads can suffer due to multiple component lookups as analyzed by researchers at Princeton University and Columbia University. Tuning knobs, influenced by engineering at Facebook, Inc. and LinkedIn Corporation, include compaction scheduling, bloom filter sizing, and level fanout to balance latency, throughput, and storage efficiency.

Implementations and uses

Production implementations include Apache Cassandra, Apache HBase, RocksDB, LevelDB, ScyllaDB, and proprietary engines at Google LLC and Amazon. Large-scale users include Facebook, Inc. for messaging storage, LinkedIn Corporation for activity feeds, Twitter, Inc. for timeline services, Uber Technologies, Inc. for event logs, and cloud services provided by Microsoft Azure and Amazon Web Services. Academic prototypes and experiments originated from groups at MIT, Stanford University, UC Berkeley, and CMU.

LSM-trees are often compared to B-trees, a venerable structure used in Oracle Corporation and IBM DB2 deployments, and to log-structured file systems such as those pioneered by Sun Microsystems and studied at University of California, Berkeley. They relate to probabilistic filters (Bloom filters) developed at University of Florida and prominent in systems like Redis and Memcached, and to compaction techniques influenced by battlegrounds of storage research at Intel Corporation and Samsung Electronics. Comparative evaluations appear in conferences hosted by ACM and IEEE, and in workshops affiliated with SIGMOD, VLDB, and USENIX.

Category:Data structures