Storage engines — LLMpedia

Storage engines
Name	Storage engines
Type	Software component
Developed by	Various vendors and projects
Initial release	Various
License	Various

Contents

Overview
Architecture and Components
Types and Examples
Performance and Optimization
Transactions, Concurrency, and Recovery
Use Cases and Selection Criteria
Security and Data Integrity

Storage engines

Storage engines are the software modules that manage how data is stored, retrieved, and managed on persistent media. They sit at the intersection of Oracle Corporation, MySQL, PostgreSQL, MongoDB, SQLite and influence performance, durability, and feature sets across systems such as Amazon Web Services, Google Cloud Platform, Microsoft Azure and on-premises deployments. Implementations range from embedded solutions used by SQLite and LevelDB to distributed engines powering Apache Cassandra, CockroachDB, HBase and TiDB.

Overview

Storage engines define file formats, indexing structures, buffering strategies, and logging mechanisms. Prominent projects and vendors—Oracle Corporation (with Oracle Database), Percona (with Percona Server), MariaDB Corporation (with MariaDB), MongoDB, Inc. (with MongoDB), and open-source communities behind PostgreSQL—drive engine design. Historical milestones in storage research have been influenced by work at Bell Labs, University of California, Berkeley, Massachusetts Institute of Technology and corporate labs at IBM and Microsoft Research.

Architecture and Components

A storage engine’s architecture typically includes a page cache, buffer manager, WAL or redo log, index manager, and storage format translator. Components draw on concepts codified in literature and projects such as Berkeley DB, InnoDB, RocksDB, LMDB and Arrow. The buffer manager coordinates with operating systems like Linux and FreeBSD and with filesystems such as ext4, XFS, ZFS to optimize I/O. Transaction logs interact with recovery systems influenced by algorithms from Leslie Lamport and papers from SIGMOD and VLDB conferences.

Types and Examples

Storage engines vary by data model and architecture: row-oriented engines (e.g., InnoDB used by MySQL), columnar engines (e.g., Parquet formats used by Apache Parquet and ClickHouse), log-structured merge-tree engines (e.g., LevelDB, RocksDB, Cassandra), and MVCC engines (e.g., PostgreSQL’s implementation). Hybrid and pluggable engines appear in systems like MariaDB and Percona Server, while embedded engines power SQLite and Realm. Distributed engines for scale-out come from projects such as Apache HBase, Apache Cassandra, CockroachDB, TiDB and YugaByteDB.

Performance and Optimization

Optimization strategies include indexing choices, compression, caching, and adaptive I/O scheduling. Techniques implemented in engines cite algorithms and products from Google (SSTables), Facebook (projects that inspired RocksDB), and research from Stanford University and MIT. Hardware-aware tuning references vendors like Intel and NVIDIA for NVMe, persistent memory and GPU-accelerated I/O; orchestration at scale integrates with Kubernetes, Docker and OpenStack. Benchmarking often uses suites from TPC benchmarks and academic evaluations published at SIGMOD and VLDB.

Transactions, Concurrency, and Recovery

Concurrency control and recovery are realized via lock managers, MVCC, two-phase commit, and write-ahead logging. Implementations and protocols are discussed in the context of ACID guarantees and distributed coordination services such as Apache Zookeeper and consensus protocols like Paxos and Raft (used by Etcd and Consul). Database vendors—Oracle Corporation, Microsoft (with SQL Server), and open-source projects—implement varied isolation levels and recovery algorithms informed by research from IBM Research and papers appearing at EuroSys.

Use Cases and Selection Criteria

Choice of engine depends on workload profiles and constraints relevant to platforms such as AWS Lambda, Google Kubernetes Engine and enterprise systems from SAP or Salesforce. OLTP workloads often prefer row-oriented engines from MySQL/PostgreSQL ecosystems, analytic workloads leverage columnar stores like ClickHouse or Amazon Redshift, and time-series or IoT scenarios use engines within InfluxData or TimescaleDB. Considerations include latency, throughput, consistency, operational complexity, and ecosystem integrations with projects like Apache Kafka, Spark and Hadoop.

Security and Data Integrity

Security features in engines include encryption-at-rest, encryption-in-transit, audit logging, and access controls interoperable with identity providers such as Okta and Active Directory. Integrity mechanisms use checksums, Merkle trees inspired by blockchain research around Bitcoin and cryptographic libraries from OpenSSL; compliance requirements tie implementations to standards from ISO and regulations such as GDPR and HIPAA. Backup and recovery integrate with tools from Veeam, Commvault and cloud snapshots from Amazon S3, Google Cloud Storage and Azure Blob Storage.

Category:Database engines