💾Intro to Database Systems

Query Optimization Techniques

Study smarter with Fiveable

Get study guides, practice questions, and cheatsheets for all your subjects. Join 500,000+ students with a 96% pass rate.

Get Started

Why This Matters

Query optimization sits at the heart of database system performance—it's the difference between a query that returns in milliseconds and one that brings your system to its knees. You're being tested on your understanding of how the query optimizer makes decisions, why certain techniques reduce computational cost, and when to apply different strategies. These concepts connect directly to relational algebra, cost models, and system architecture principles that appear throughout database coursework.

Don't just memorize that "indexes make queries faster." Know why a B-tree index works for range queries but a hash index doesn't. Understand how the optimizer estimates costs and what statistics it needs. When exam questions ask you to analyze a query plan or suggest optimizations, you need to think like the optimizer—evaluating trade-offs, considering data characteristics, and predicting resource consumption.

Reducing Data Early: Filter and Transform

The most powerful optimization principle is simple: process less data. These techniques shrink the dataset as early as possible in the execution pipeline, reducing work for all downstream operations.

Predicate Pushdown

Moves filter conditions closer to the data source—applying WHERE clauses before joins or aggregations dramatically cuts the rows flowing through the query plan
Reduces I/O and memory pressure by eliminating irrelevant tuples at the earliest possible stage, especially critical in distributed systems where network transfer is expensive
Enables index utilization since pushed-down predicates can leverage existing indexes that wouldn't be accessible later in execution

Query Rewriting and Simplification

Transforms complex queries into equivalent but cheaper forms—the optimizer uses algebraic properties like commutativity and associativity of relational operations
Eliminates redundant operations such as unnecessary DISTINCT clauses, dead code branches, or subqueries that can be flattened into joins
Applies logical equivalences to restructure queries; for example, converting a correlated subquery into a more efficient semi-join

Compare: Predicate Pushdown vs. Query Rewriting—both reduce work, but predicate pushdown focuses on where filters execute in the plan, while query rewriting changes the logical structure of the query itself. FRQ tip: if asked to optimize a query, consider both approaches.

Choosing the Right Access Path

Once you've minimized data volume, the next question is how to retrieve what remains. Access path selection determines whether you scan entire tables or surgically extract just the rows you need.

Index Selection and Usage

B-tree indexes excel at range queries and ordered access—they maintain sorted order and support operations like $<, >, \text{BETWEEN}$ , and ORDER BY efficiently
Hash indexes provide $O(1)$ lookup for equality conditions—ideal for point queries but useless for range scans since hashing destroys ordering
Index maintenance overhead must be balanced against read benefits; every INSERT, UPDATE, or DELETE must update all relevant indexes

Statistics and Cost Estimation

Histograms capture value distributions to estimate selectivity—knowing that 90% of orders are "completed" vs. 1% are "pending" changes everything about plan choice
Cardinality estimation predicts how many rows each operation produces; errors compound multiplicatively through the plan, making accurate statistics critical
Cost models combine CPU, I/O, and memory estimates to score candidate plans, typically measured in abstract cost units rather than actual time

Compare: B-tree vs. Hash Index—both accelerate lookups, but B-trees support range queries ( $\text{salary} > 50000$ ) while hash indexes only handle equality ( $\text{id} = 42$ ). Know which to recommend based on the query pattern.

Optimizing Multi-Table Operations

Joins are often the most expensive operations in a query. The optimizer must decide which tables to join first and what algorithm to use—choices that can change execution time by orders of magnitude.

Join Order Optimization

Join order affects intermediate result sizes—joining a 1M-row table with a 10-row lookup table first produces far less data than starting with two large tables
Search space grows factorially with the number of tables; for n tables, there are n! possible orderings, making exhaustive search impractical for large queries
Dynamic programming algorithms (like the Selinger optimizer) prune the search space by keeping only the cheapest plan for each subset of tables

Parallel Query Execution

Partitions work across multiple cores or nodes—a single expensive join can be split into independent pieces that execute concurrently
Speedup depends on data distribution and the ability to minimize coordination; skewed data can create bottlenecks where one worker handles most of the load
Synchronization overhead for combining partial results means parallelism isn't free—small queries may run slower with parallelism enabled

Compare: Join Order Optimization vs. Parallel Execution—join order reduces total work by minimizing intermediate results, while parallelism reduces elapsed time by distributing fixed work across resources. Both matter, but join order typically has larger impact.

Precomputation and Storage Strategies

Sometimes the best optimization is doing work ahead of time. These techniques trade storage space and maintenance overhead for faster query response.

Materialized Views

Stores precomputed query results as physical tables—complex aggregations or multi-table joins execute once and get reused across many queries
Query rewriting can automatically substitute materialized views when the optimizer detects a matching query pattern
Refresh strategies range from immediate (always consistent) to deferred (eventually consistent); the choice depends on staleness tolerance and update frequency

Partitioning

Divides tables into smaller physical segments—range partitioning by date lets queries skip entire years of irrelevant data
Partition pruning eliminates partitions from consideration based on query predicates; a query for 2024 data never touches 2020-2023 partitions
Supports partition-wise joins where matching partitions from two tables join independently, enabling parallelism and reducing memory pressure

Caching and Buffer Management

Buffer pool keeps hot pages in memory—repeated access to the same data blocks avoids expensive disk I/O
Replacement policies like LRU (Least Recently Used) or clock algorithms decide which pages to evict when memory fills
Query result caching stores entire result sets for repeated identical queries; particularly effective for read-heavy workloads with repetitive access patterns

Compare: Materialized Views vs. Caching—both store precomputed results, but materialized views are persistent and query-aware (the optimizer knows about them), while caches are typically transient and transparent to the optimizer. Materialized views require explicit maintenance; caches auto-invalidate.

Execution Pipeline Optimization

Even with the best plan, how operations execute matters. Pipelining minimizes the overhead of moving data between operators.

Pipelining

Passes tuples directly between operators without materializing intermediate results—a filter feeding a join streams rows one at a time
Reduces memory consumption since you don't need to store the entire output of each operator before starting the next
Blocking operators like sorts and hash joins break pipelines; they must consume all input before producing any output, forcing materialization

Compare: Pipelining vs. Materialization—pipelining minimizes memory and latency by streaming data, but some operations (sorting, grouping) inherently require seeing all data first. Recognizing which operators are pipeline breakers is key to understanding query plan performance.

Quick Reference Table

Concept	Best Examples
Reduce data early	Predicate Pushdown, Query Rewriting
Access path selection	Index Selection, Statistics/Cost Estimation
Join optimization	Join Order Optimization, Parallel Execution
Precomputation	Materialized Views, Caching
Physical data layout	Partitioning, Index Selection
Execution efficiency	Pipelining, Parallel Execution
Cost-based decisions	Statistics, Join Order Optimization

Self-Check Questions

Which two techniques both aim to reduce the amount of data processed, but operate at different stages of optimization? Explain when each would be applied.
A query filters on $\text{status} = 'active'$ and also requires results sorted by $\text{created\_date}$ . What type of index would you recommend, and why might a hash index be insufficient?
Compare and contrast materialized views and query result caching. Under what workload characteristics would you prefer one over the other?
An optimizer estimates a join will produce 1,000 rows but actually produces 1,000,000. What component failed, and how does this error propagate through the rest of the query plan?
You're asked to optimize a query joining five tables. Why can't the optimizer simply try all possible join orders, and what technique does it use instead to find a good plan efficiently?

💾Intro to Database Systems

Query Optimization Techniques

Why This Matters

Reducing Data Early: Filter and Transform

Predicate Pushdown

Query Rewriting and Simplification

Choosing the Right Access Path

Index Selection and Usage

Statistics and Cost Estimation

Optimizing Multi-Table Operations

Join Order Optimization

Parallel Query Execution

Precomputation and Storage Strategies

Materialized Views

Partitioning

Caching and Buffer Management

Execution Pipeline Optimization

Pipelining

Quick Reference Table

Self-Check Questions

history

social science

english & capstone

arts

science

math & computer science

world languages

high school exams

honors classes

college classes

hs classes