Selinger optimizer

Selinger optimizer
Name	Selinger optimizer
Caption	Logical and physical plan search
Introduced	1979

Contents

History and motivation
Algorithm overview
Cost model and cardinality estimation
Query transformations and rules
Implementation details and optimizations
Performance and evaluation
Extensions and influence on modern systems

Selinger optimizer is a cost-based query optimization approach introduced in the late 1970s for relational query processing. It pioneered a systematic search of logical and physical execution plans using dynamic programming and statistical cost models, becoming foundational to commercial IBM systems and influencing modern PostgreSQL, Oracle Corporation, Microsoft SQL Server, and MySQL optimizers. The approach united ideas from query processing, database systems, compiler theory, and performance engineering to produce efficient execution strategies for complex Structured Query Language queries.

History and motivation

The Selinger optimizer emerged from research at IBM Research and was published in a seminal paper by a team including key figures from IBM System R development. Motivated by escalating demand for automated optimization in Relational Database Management System products such as Ingres and System R, the work addressed challenges raised by expressive SQL features, large schemas, and limited hardware. The optimizer sought to replace ad hoc heuristics used in early systems like Ingres with cost-driven decisions influenced by earlier compiler optimizations from groups at Bell Labs and Stanford University.

Algorithm overview

At its core the Selinger optimizer applies dynamic programming to search a space of logical query trees and associated physical implementations. Starting from an initial relational algebra parse tree produced by a frontend, it enumerates alternative join orders, access methods, and join algorithms—selecting among options such as nested-loop join, sort-merge join, and hash join—while pruning dominated plans. The algorithm builds optimal plans for increasing numbers of relations using subproblem solutions, similar to techniques used in dynamic programming in computer science and reminiscent of approaches from compiler optimization research at institutions like MIT and Carnegie Mellon University.

Cost model and cardinality estimation

Cost estimation in the Selinger framework combines CPU, I/O, and buffer considerations into a unified model based on statistical summaries collected by system components. Cardinality estimates rely on histograms, distinct value counts, and distribution statistics gathered by utilities like ANALYZE in later systems; inaccurate statistics can lead to plan misselection, a problem studied in depth by researchers at University of California, Berkeley and University of Wisconsin–Madison. The model influenced later cost formulations in PostgreSQL, Oracle Corporation research teams, and industrial benchmarks such as TPC-C and TPC-H for measuring optimizer quality.

Query transformations and rules

Selinger-style optimization separates logical transformations from physical implementation choices. Logical rewrite rules—such as join commutativity and associativity, predicate pushdown, and projection pruning—are applied before or during enumeration, concepts explored by research groups at University of Toronto and Princeton University. Physical choices include index scan versus full table scan, and selection of join algorithms tied to available access methods like B-tree or hash indexes developed by teams at DEC and Sun Microsystems. The interplay of algebraic rewrites and implementation rules became a template adopted by systems including Ingres, Oracle Database, and Microsoft Research-driven prototypes.

Implementation details and optimizations

Implementations of the Selinger optimizer introduce practical enhancements: memoization to avoid re-enumeration, plan caching used in Oracle Corporation and Microsoft SQL Server, and adaptive re-optimization techniques explored by groups at Bell Labs and HP Labs. Engineering trade-offs include search space pruning thresholds, enforcement of plan shape constraints to limit combinatorial growth, and cost-awareness for parallel execution introduced in systems from Google and Amazon Web Services. Implementations also integrate statistics collectors and utility programs developed in projects at University of California, San Diego and ETH Zurich.

Performance and evaluation

Empirical evaluation of Selinger-style optimizers uses microbenchmarks, standard suites, and production workloads. Studies at University of Washington and industry reports from IBM and Oracle Corporation compare the approach to heuristics and randomized algorithms, showing robust performance for moderate-size joins but exposing weaknesses for very large join graphs. Benchmarking with TPC-H and TPC-DS demonstrates scenarios where cardinality misestimation or exponential search cost degrade performance, prompting hybrid strategies in systems by Teradata and Vertica.

Extensions and influence on modern systems

The Selinger optimizer directly influenced extensions including bushy-plan enumeration, parametric query optimization, and adaptive query processing developed at MIT, ETH Zurich, and Microsoft Research. Modern distributed systems such as Hadoop, Apache Spark, and cloud databases from Google Cloud and Amazon incorporate cost-based planning elements traceable to Selinger, combining them with rule-based engines from projects at Facebook and Twitter. Research on machine learning-driven cost models at institutions like Stanford University and Carnegie Mellon University extends the Selinger legacy, integrating learned cardinality estimators and adaptive re-optimization into the optimizer stack.

Category:Database management systems