C-Store — LLMpedia

C-Store
Name	C-Store
Developer	Brown University; MIT; Brandeis University
Initial release	2005
Latest release	2006
Programming language	C (programming language), C++
Platform	Linux, x86
License	BSD license
Website	C-Store project

Contents

Overview
Architecture and Design
Data Storage and Compression
Query Processing and Optimization
Implementation and Use Cases
Performance and Benchmarking

C-Store is a column-oriented storage system for analytic workloads originally developed by researchers at Brandeis University, Brown University, and MIT. It emphasizes read-optimized storage, projection-based data organization, and aggressive compression to accelerate queries typical of data warehousing and decision support system scenarios. The project influenced commercial systems and academic research in columnar database, OLAP, and big data technologies.

Overview

C-Store was introduced by a collaboration including the Brunón research group at Brown University and teams at MIT and Brandeis University to address limitations of row-oriented engines when processing large-scale analytical queries such as those found in TPC-H and TPC-DS. The architecture separates storage and processing and introduces concepts like projections, tuple mover, and write-optimized store to support fast scans used by Hive (software), Amazon Redshift, Google BigQuery, and Vertica-inspired systems. C-Store papers appeared in venues such as CIDR and influenced implementations in Hadoop-integrated columnar formats like Parquet and ORC.

Architecture and Design

C-Store's design splits data into multiple projections—physically stored subsets of columns—allowing storage layouts tailored to access patterns similar to techniques used by Sybase IQ and MonetDB. The system separates a read-optimized store (ROS) and a write-optimized store (WOS) with a background tuple mover merging WOS into ROS; this pattern is analogous to approaches in RocksDB and LevelDB. Metadata, recovery, and transaction handling draw on concepts from PostgreSQL and Ingres while leveraging techniques discussed in Transaction Processing: Concepts and Techniques. The planner and executor incorporate cost models comparable to those in System R and Microsoft SQL Server.

Data Storage and Compression

C-Store organizes data by columnar projections enabling compression schemes like run-length encoding, dictionary encoding, and delta encoding found in Oracle Database and SAP HANA. Compression reduces I/O similar to strategies in Columnar DBMS research and products such as Amazon Redshift and Vertica, improving scan throughput measured in studies that compare against MySQL, PostgreSQL, and SQLite. The physical layout supports late materialization techniques discussed in Volcano (query engine) research and optimizes for CPU-cache locality as in LLVM-based query execution engines.

Query Processing and Optimization

The query processor in C-Store performs projection-aware planning, taking advantage of precomputed projections to minimize reads much like optimizers in Teradata and Netezza. It uses a vectorized execution model influenced by work from MonetDB and Vectorwise and applies cost-based selection of projections akin to methods in Selinger optimizer-era systems such as System R and IBM DB2. Join processing, aggregation, and group-by operations are engineered for columnar access patterns, comparable to techniques in Hash join implementations used by Greenplum and Amazon Aurora.

Implementation and Use Cases

C-Store served as a research prototype and a foundation for commercial systems and academic projects; its ideas underpin systems like Vertica and influenced Amazon Redshift and Google BigQuery. Typical use cases include large-scale analytical reporting, business intelligence workloads implemented in SAP BusinessObjects, Tableau (software), and MicroStrategy, and scientific analytics similar to those run on Apache Spark clusters and Hadoop ecosystems. Integration points include ETL pipelines built with Informatica and Pentaho, and data ingestion methods comparable to Apache Flume and Logstash.

Performance and Benchmarking

Benchmarks comparing C-Store prototypes to traditional row stores used workloads such as TPC-H and showed orders-of-magnitude improvements in scan-heavy queries similar to reports from Vertica and Redshift evaluations. Performance tuning involves selecting projections, compression codecs, and merge policies analogous to configurations in RocksDB and LevelDB, and results are often compared against MySQL, PostgreSQL, and Oracle Database baselines. Research publications presented at SIGMOD, VLDB, and ICDE quantify trade-offs between write throughput, query latency, and storage efficiency demonstrating the advantages of columnar systems for read-dominant analytic workloads.

Category:Database management systems