upgrade
upgrade

🥸Advanced Computer Architecture

Key Concepts of Cache Coherence Protocols

Study smarter with Fiveable

Get study guides, practice questions, and cheatsheets for all your subjects. Join 500,000+ students with a 96% pass rate.

Get Started

Why This Matters

Cache coherence is the glue that holds multiprocessor systems together. When multiple processors each have their own cache, you're essentially creating multiple "versions" of the same data—and without a coherence protocol, those versions can diverge, leading to incorrect program behavior. You're being tested on your understanding of how these protocols maintain consistency, why certain designs scale better than others, and when to apply different strategies based on system architecture.

The concepts here connect directly to broader themes in computer architecture: parallelism and its challenges, memory hierarchy trade-offs, scalability versus simplicity, and bandwidth optimization. Exam questions often ask you to compare protocols, identify which approach fits a given system size, or analyze the trade-offs between bus traffic and implementation complexity. Don't just memorize protocol names and states—know what problem each protocol solves and why its mechanism works.


Snoopy Protocol Foundations: State-Based Coherence

These protocols form the backbone of cache coherence in bus-based systems. Each cache line exists in one of several states, and state transitions are triggered by local processor actions or observed bus transactions. Understanding the progression from MSI to MESI to MOESI reveals how architects incrementally solved performance problems.

MSI (Modified-Shared-Invalid) Protocol

  • Three fundamental states—Modified (dirty, exclusive), Shared (clean, potentially in multiple caches), and Invalid (not usable)
  • Write-invalidate mechanism ensures coherence by forcing all other caches to invalidate their copies before a write completes
  • High bus traffic results from lacking an Exclusive state, meaning even private data triggers unnecessary coherence transactions

MESI (Modified-Exclusive-Shared-Invalid) Protocol

  • Exclusive state addition allows a cache to hold clean, private data without broadcasting—silent upgrades to Modified become possible
  • Reduced invalidation traffic compared to MSI because the protocol distinguishes between "only copy" and "shared copy" scenarios
  • Industry standard in most modern x86 and ARM multiprocessors due to its balance of simplicity and efficiency

MOESI (Modified-Owned-Exclusive-Shared-Invalid) Protocol

  • Owned state permits a cache to supply data to other caches directly, avoiding expensive write-backs to main memory
  • Cache-to-cache transfers reduce memory bandwidth pressure when multiple processors share frequently-modified data
  • Best for high-contention workloads where shared data is read often but modified by one processor at a time

Compare: MSI vs. MESI—both use invalidation-based coherence, but MESI's Exclusive state eliminates bus traffic for private data. If an FRQ asks about optimizing single-threaded performance in a multiprocessor, MESI's silent upgrade path is your answer.


Architectural Approaches: Snooping vs. Directory

The mechanism for detecting coherence violations differs fundamentally between these approaches. Snooping relies on broadcast; directories rely on point-to-point messaging. This distinction drives scalability trade-offs that appear frequently on exams.

Snooping Protocols

  • Bus monitoring by all caches—every cache controller watches every transaction, checking addresses against its tags
  • Broadcast-based communication makes implementation straightforward but creates O(n)O(n) traffic per coherence event where nn is processor count
  • Scalability ceiling around 8-16 processors before bus bandwidth becomes the bottleneck

Directory-Based Coherence Protocols

  • Centralized or distributed directory tracks which caches hold copies of each line and their current states
  • Point-to-point messages replace broadcasts, achieving O(1)O(1) traffic per coherence event regardless of processor count
  • Essential for large-scale systems like NUMA architectures where hundreds of processors must maintain coherence

Bus-Based Coherence Protocols

  • Shared bus as communication backbone—all coherence traffic flows through a single interconnect
  • Atomic transactions simplify protocol design because only one request can be in flight at a time
  • Bottleneck at scale makes this approach unsuitable for systems beyond roughly 16 processors

Compare: Snooping vs. Directory—snooping is simpler and has lower latency for small systems, but directory-based protocols scale to hundreds of processors. When analyzing a system design question, processor count is your first decision point.


Write Strategies: Invalidate vs. Update

When a processor writes to shared data, the protocol must decide how to inform other caches. This fundamental choice affects bandwidth consumption, read latency, and overall system performance.

Write-Invalidate vs. Write-Update Protocols

  • Write-invalidate discards other copies—only the writing cache retains valid data, minimizing bandwidth for write-heavy workloads
  • Write-update broadcasts new values—all caches stay current, reducing read-miss latency but consuming bandwidth on every write
  • Workload-dependent choice where write-invalidate wins for most general-purpose computing, while write-update suits specific patterns like producer-consumer

Compare: Write-invalidate vs. Write-update—invalidate optimizes for bandwidth, update optimizes for read latency. Most real systems use invalidate because writes to shared data that's never re-read waste bandwidth under update protocols.


Advanced Scalability Solutions

These protocols address limitations of traditional approaches, enabling coherence in systems with dozens to thousands of processors. They represent the cutting edge of coherence research and appear in high-end servers and supercomputers.

Scalable Coherence Interface (SCI)

  • Ring or hierarchy topology replaces the shared bus, allowing parallel coherence transactions across different parts of the system
  • Distributed directory embedded in cache lines—each line contains pointers to sharing caches, eliminating central directory bottlenecks
  • IEEE standard (1596) designed specifically for systems where bus-based approaches fail

Token Coherence

  • Fixed number of tokens per cache line—a processor needs all tokens to write, at least one token to read
  • Decouples correctness from performance—the token mechanism guarantees safety while allowing flexible optimizations
  • Reduces serialization compared to directory protocols because token transfers can occur without central coordination

Timestamp-Based Coherence Protocols

  • Logical timestamps order operations—each write increments a timestamp, and caches compare timestamps to determine validity
  • Lazy coherence possible where stale data is tolerated briefly if the application semantics allow it
  • Flexibility for relaxed consistency models where strict ordering isn't required for correctness

Compare: SCI vs. Token Coherence—both target large-scale systems, but SCI uses distributed directories while token coherence uses a counting mechanism. Token coherence offers more flexibility for performance optimization at the cost of implementation complexity.


Quick Reference Table

ConceptBest Examples
Basic state-based protocolsMSI, MESI, MOESI
Broadcast-based coherenceSnooping protocols, Bus-based protocols
Scalable coherenceDirectory-based, SCI, Token coherence
Write strategy trade-offsWrite-invalidate, Write-update
Reducing memory trafficMOESI (Owned state), Cache-to-cache transfers
Small system optimizationMESI, Snooping protocols
Large system optimizationDirectory-based, SCI
Flexible orderingTimestamp-based protocols

Self-Check Questions

  1. Which two protocols both use state-based coherence but differ in how they handle clean, private data? Explain why this difference matters for performance.

  2. A system architect is designing a 64-processor server. Which coherence approach (snooping or directory) should they choose, and what specific scalability limitation are they avoiding?

  3. Compare and contrast write-invalidate and write-update strategies. Under what workload conditions would write-update actually outperform write-invalidate?

  4. If an FRQ asks you to reduce memory bandwidth consumption in a system with high read-sharing of modified data, which protocol state (from MOESI) directly addresses this, and how does it work?

  5. Token coherence and directory-based protocols both aim to scale beyond bus-based systems. What fundamental mechanism differs between them, and what trade-off does each approach make?