Why This Matters
Query optimization sits at the heart of database system performance—it's the difference between a query that returns in milliseconds and one that brings your system to its knees. You're being tested on your understanding of how the query optimizer makes decisions, why certain techniques reduce computational cost, and when to apply different strategies. These concepts connect directly to relational algebra, cost models, and system architecture principles that appear throughout database coursework.
Don't just memorize that "indexes make queries faster." Know why a B-tree index works for range queries but a hash index doesn't. Understand how the optimizer estimates costs and what statistics it needs. When exam questions ask you to analyze a query plan or suggest optimizations, you need to think like the optimizer—evaluating trade-offs, considering data characteristics, and predicting resource consumption.
The most powerful optimization principle is simple: process less data. These techniques shrink the dataset as early as possible in the execution pipeline, reducing work for all downstream operations.
Predicate Pushdown
- Moves filter conditions closer to the data source—applying WHERE clauses before joins or aggregations dramatically cuts the rows flowing through the query plan
- Reduces I/O and memory pressure by eliminating irrelevant tuples at the earliest possible stage, especially critical in distributed systems where network transfer is expensive
- Enables index utilization since pushed-down predicates can leverage existing indexes that wouldn't be accessible later in execution
Query Rewriting and Simplification
- Transforms complex queries into equivalent but cheaper forms—the optimizer uses algebraic properties like commutativity and associativity of relational operations
- Eliminates redundant operations such as unnecessary DISTINCT clauses, dead code branches, or subqueries that can be flattened into joins
- Applies logical equivalences to restructure queries; for example, converting a correlated subquery into a more efficient semi-join
Compare: Predicate Pushdown vs. Query Rewriting—both reduce work, but predicate pushdown focuses on where filters execute in the plan, while query rewriting changes the logical structure of the query itself. FRQ tip: if asked to optimize a query, consider both approaches.
Choosing the Right Access Path
Once you've minimized data volume, the next question is how to retrieve what remains. Access path selection determines whether you scan entire tables or surgically extract just the rows you need.
Index Selection and Usage
- B-tree indexes excel at range queries and ordered access—they maintain sorted order and support operations like <,>,BETWEEN, and ORDER BY efficiently
- Hash indexes provide O(1) lookup for equality conditions—ideal for point queries but useless for range scans since hashing destroys ordering
- Index maintenance overhead must be balanced against read benefits; every INSERT, UPDATE, or DELETE must update all relevant indexes
Statistics and Cost Estimation
- Histograms capture value distributions to estimate selectivity—knowing that 90% of orders are "completed" vs. 1% are "pending" changes everything about plan choice
- Cardinality estimation predicts how many rows each operation produces; errors compound multiplicatively through the plan, making accurate statistics critical
- Cost models combine CPU, I/O, and memory estimates to score candidate plans, typically measured in abstract cost units rather than actual time
Compare: B-tree vs. Hash Index—both accelerate lookups, but B-trees support range queries (salary>50000) while hash indexes only handle equality (id=42). Know which to recommend based on the query pattern.
Optimizing Multi-Table Operations
Joins are often the most expensive operations in a query. The optimizer must decide which tables to join first and what algorithm to use—choices that can change execution time by orders of magnitude.
Join Order Optimization
- Join order affects intermediate result sizes—joining a 1M-row table with a 10-row lookup table first produces far less data than starting with two large tables
- Search space grows factorially with the number of tables; for n tables, there are n! possible orderings, making exhaustive search impractical for large queries
- Dynamic programming algorithms (like the Selinger optimizer) prune the search space by keeping only the cheapest plan for each subset of tables
Parallel Query Execution
- Partitions work across multiple cores or nodes—a single expensive join can be split into independent pieces that execute concurrently
- Speedup depends on data distribution and the ability to minimize coordination; skewed data can create bottlenecks where one worker handles most of the load
- Synchronization overhead for combining partial results means parallelism isn't free—small queries may run slower with parallelism enabled
Compare: Join Order Optimization vs. Parallel Execution—join order reduces total work by minimizing intermediate results, while parallelism reduces elapsed time by distributing fixed work across resources. Both matter, but join order typically has larger impact.
Precomputation and Storage Strategies
Sometimes the best optimization is doing work ahead of time. These techniques trade storage space and maintenance overhead for faster query response.
Materialized Views
- Stores precomputed query results as physical tables—complex aggregations or multi-table joins execute once and get reused across many queries
- Query rewriting can automatically substitute materialized views when the optimizer detects a matching query pattern
- Refresh strategies range from immediate (always consistent) to deferred (eventually consistent); the choice depends on staleness tolerance and update frequency
Partitioning
- Divides tables into smaller physical segments—range partitioning by date lets queries skip entire years of irrelevant data
- Partition pruning eliminates partitions from consideration based on query predicates; a query for 2024 data never touches 2020-2023 partitions
- Supports partition-wise joins where matching partitions from two tables join independently, enabling parallelism and reducing memory pressure
Caching and Buffer Management
- Buffer pool keeps hot pages in memory—repeated access to the same data blocks avoids expensive disk I/O
- Replacement policies like LRU (Least Recently Used) or clock algorithms decide which pages to evict when memory fills
- Query result caching stores entire result sets for repeated identical queries; particularly effective for read-heavy workloads with repetitive access patterns
Compare: Materialized Views vs. Caching—both store precomputed results, but materialized views are persistent and query-aware (the optimizer knows about them), while caches are typically transient and transparent to the optimizer. Materialized views require explicit maintenance; caches auto-invalidate.
Execution Pipeline Optimization
Even with the best plan, how operations execute matters. Pipelining minimizes the overhead of moving data between operators.
Pipelining
- Passes tuples directly between operators without materializing intermediate results—a filter feeding a join streams rows one at a time
- Reduces memory consumption since you don't need to store the entire output of each operator before starting the next
- Blocking operators like sorts and hash joins break pipelines; they must consume all input before producing any output, forcing materialization
Compare: Pipelining vs. Materialization—pipelining minimizes memory and latency by streaming data, but some operations (sorting, grouping) inherently require seeing all data first. Recognizing which operators are pipeline breakers is key to understanding query plan performance.
Quick Reference Table
|
| Reduce data early | Predicate Pushdown, Query Rewriting |
| Access path selection | Index Selection, Statistics/Cost Estimation |
| Join optimization | Join Order Optimization, Parallel Execution |
| Precomputation | Materialized Views, Caching |
| Physical data layout | Partitioning, Index Selection |
| Execution efficiency | Pipelining, Parallel Execution |
| Cost-based decisions | Statistics, Join Order Optimization |
Self-Check Questions
-
Which two techniques both aim to reduce the amount of data processed, but operate at different stages of optimization? Explain when each would be applied.
-
A query filters on status=′active′ and also requires results sorted by created_date. What type of index would you recommend, and why might a hash index be insufficient?
-
Compare and contrast materialized views and query result caching. Under what workload characteristics would you prefer one over the other?
-
An optimizer estimates a join will produce 1,000 rows but actually produces 1,000,000. What component failed, and how does this error propagate through the rest of the query plan?
-
You're asked to optimize a query joining five tables. Why can't the optimizer simply try all possible join orders, and what technique does it use instead to find a good plan efficiently?