๐ŸฅธAdvanced Computer Architecture

Key Concepts of Cache Coherence Protocols

Study smarter with Fiveable

Get study guides, practice questions, and cheatsheets for all your subjects. Join 500,000+ students with a 96% pass rate.

Get Started

Why This Matters

Cache coherence is the glue that holds multiprocessor systems together. When multiple processors each have their own cache, you're creating multiple "versions" of the same data. Without a coherence protocol, those versions can diverge, leading to incorrect program behavior. You're being tested on your understanding of how these protocols maintain consistency, why certain designs scale better than others, and when to apply different strategies based on system architecture.

The concepts here connect directly to broader themes in computer architecture: parallelism and its challenges, memory hierarchy trade-offs, scalability versus simplicity, and bandwidth optimization. Exam questions often ask you to compare protocols, identify which approach fits a given system size, or analyze the trade-offs between bus traffic and implementation complexity. Don't just memorize protocol names and states. Know what problem each protocol solves and why its mechanism works.


Snoopy Protocol Foundations: State-Based Coherence

These protocols form the backbone of cache coherence in bus-based systems. Each cache line exists in one of several states, and state transitions are triggered by local processor actions or observed bus transactions. Understanding the progression from MSI to MESI to MOESI reveals how architects incrementally solved performance problems.

MSI (Modified-Shared-Invalid) Protocol

The simplest invalidation-based protocol uses three states:

  • Modified (M): This cache holds the only valid copy, and it's dirty (modified relative to main memory). The cache is responsible for writing back before the line can be evicted.
  • Shared (S): The line is clean and may exist in multiple caches. Any cache in S can read freely, but must issue a bus transaction before writing.
  • Invalid (I): The line is not usable. Any access triggers a miss.

The write-invalidate mechanism forces all other caches to move their copies to Invalid before a write completes. This guarantees only one writable copy exists at any time.

MSI's main weakness: it has no way to distinguish between "I'm the only cache with this line" and "multiple caches share this line." Even private data that no other cache holds will generate unnecessary bus invalidations on a write. This is exactly what MESI fixes.

MESI (Modified-Exclusive-Shared-Invalid) Protocol

MESI adds a fourth state, Exclusive (E), which means: this cache holds the only copy, and it's clean (matches main memory).

Why this matters so much: a cache line in E can be silently upgraded to M on a write, with no bus transaction at all. The cache already knows no one else has a copy, so there's nothing to invalidate.

  • Reduced bus traffic compared to MSI, especially for private data or single-threaded workloads running on a multiprocessor
  • Industry standard in most modern x86 and ARM multiprocessors due to its balance of simplicity and efficiency
  • On a read miss, if no other cache holds the line, the line enters E rather than S. If another cache does hold it, both copies enter S.

MOESI (Modified-Owned-Exclusive-Shared-Invalid) Protocol

MOESI adds a fifth state, Owned (O), which solves a specific and costly problem: what happens when a processor with a Modified line sees another processor request that data?

In MESI, the modifying cache must write back to main memory, and then both caches hold the line in S. That write-back costs memory bandwidth. In MOESI, the modifying cache transitions to Owned and supplies the data directly to the requester via a cache-to-cache transfer. The O state means: "I have a dirty copy that I'm responsible for eventually writing back, but other caches can hold Shared copies that they got from me."

  • Cache-to-cache transfers reduce memory bandwidth pressure significantly when multiple processors share frequently-modified data
  • Best suited for high-contention workloads where shared data is read often but modified by one processor at a time
  • Used notably in AMD processors

Compare: MSI vs. MESI: both use invalidation-based coherence, but MESI's Exclusive state eliminates bus traffic for private data. If a question asks about optimizing single-threaded performance in a multiprocessor, MESI's silent upgrade path is your answer. MOESI goes further by also eliminating unnecessary write-backs when sharing dirty data.


Architectural Approaches: Snooping vs. Directory

The mechanism for detecting coherence violations differs fundamentally between these approaches. Snooping relies on broadcast; directories rely on point-to-point messaging. This distinction drives scalability trade-offs that appear frequently on exams.

Snooping Protocols

Every cache controller monitors (snoops on) every bus transaction, checking addresses against its own tags. When a cache sees a transaction relevant to a line it holds, it takes action (e.g., invalidating or supplying data).

  • Broadcast-based communication makes implementation straightforward but creates O(n)O(n) traffic per coherence event, where nn is the number of processors
  • Low latency for small systems because the bus provides a natural serialization point and all caches see transactions simultaneously
  • Scalability ceiling around 8-16 processors before bus bandwidth saturates

Directory-Based Coherence Protocols

Instead of broadcasting, a directory tracks which caches hold copies of each memory line and what state those copies are in. When a coherence action is needed, the directory sends targeted (point-to-point) messages only to the caches that actually hold the line.

  • The directory can be centralized (at a single memory controller) or distributed (partitioned across memory controllers in a NUMA system)
  • Point-to-point messages replace broadcasts, achieving O(1)O(1) traffic per coherence event regardless of processor count
  • The trade-off: directory lookups add indirection latency (an extra hop through the directory), and the directory itself consumes storage proportional to the number of cache lines times the number of processors
  • Essential for large-scale systems like NUMA architectures with dozens to hundreds of processors

Bus-Based Coherence Protocols

This refers to the interconnect choice rather than the protocol logic itself. A shared bus serves as the communication backbone for snooping protocols.

  • Atomic transactions simplify protocol design because only one request can be in flight at a time, providing a natural total order on memory operations
  • This atomicity is what makes snooping protocols relatively easy to reason about and verify
  • Bottleneck at scale makes this approach unsuitable for systems beyond roughly 16 processors, since every coherence message competes for the same shared bus

Compare: Snooping vs. Directory: snooping is simpler and has lower latency for small systems, but directory-based protocols scale to hundreds of processors. When analyzing a system design question, processor count is your first decision point. Small system (under ~16 cores)? Snooping is likely fine. Large system? You need a directory.


Write Strategies: Invalidate vs. Update

When a processor writes to shared data, the protocol must decide how to inform other caches. This fundamental choice affects bandwidth consumption, read latency, and overall system performance.

Write-Invalidate vs. Write-Update Protocols

Write-invalidate discards all other cached copies before the write proceeds. After the invalidation, only the writing cache holds valid data. Any subsequent read from another processor will miss and fetch the new value.

Write-update (also called write-broadcast) pushes the new value to all caches that hold the line, keeping every copy current.

The trade-offs break down like this:

  • Bandwidth: Write-invalidate is cheaper per write because it sends a short invalidation message, not the full data. Write-update sends the new data value on every write.
  • Read latency: Write-update wins here because other caches always have the latest value, so they never miss on a subsequent read.
  • Multiple writes to the same line: Write-invalidate pays the invalidation cost only once; subsequent writes are local. Write-update pays the broadcast cost on every write, even if no other processor ever reads the updated value.

Most real systems use write-invalidate because, in practice, writes to shared data that are never re-read by other processors are common. Write-update wastes bandwidth in those cases. Write-update can outperform write-invalidate in specific patterns like tight producer-consumer loops, where one core writes and another core reads the same line repeatedly.

Compare: Write-invalidate vs. Write-update: invalidate optimizes for bandwidth, update optimizes for read latency. The key question is whether other caches will actually re-read the data after a write. If yes, update helps. If no, update wastes bandwidth.


Advanced Scalability Solutions

These protocols address limitations of traditional approaches, enabling coherence in systems with dozens to thousands of processors. They represent the cutting edge of coherence research and appear in high-end servers and supercomputers.

Scalable Coherence Interface (SCI)

SCI replaces the shared bus with a ring or hierarchy topology, allowing parallel coherence transactions across different parts of the system.

  • Distributed directory embedded in cache lines: each cached copy contains forward and backward pointers to other caches sharing that line, forming a doubly-linked list. This eliminates the need for a separate centralized directory structure.
  • IEEE standard (1596), designed specifically for systems where bus-based approaches fail
  • The linked-list approach means adding or removing a sharer requires pointer updates, which adds protocol complexity but avoids any single point of contention

Token Coherence

Token coherence takes a fundamentally different approach to guaranteeing correctness. Each cache line in the system has a fixed number of tokens (typically equal to the number of processors plus one).

  • A processor needs all tokens to perform a write, and at least one token to perform a read
  • This decouples correctness from performance: the token counting mechanism guarantees safety (no conflicting accesses), while the transport layer can use any combination of broadcast, point-to-point, or persistent requests to move tokens around
  • Reduces serialization compared to directory protocols because token transfers can occur without central coordination. Two unrelated coherence events don't need to go through the same directory controller.

Timestamp-Based Coherence Protocols

These protocols use logical timestamps to order operations rather than relying on bus ordering or directory serialization.

  • Each write increments a timestamp associated with the cache line, and caches compare timestamps to determine whether their copy is still valid
  • Lazy coherence becomes possible: stale data can be tolerated briefly if the application's consistency model allows it
  • Particularly useful with relaxed consistency models where strict sequential ordering of all memory operations isn't required for correctness

Compare: SCI vs. Token Coherence: both target large-scale systems, but SCI uses distributed linked-list directories while token coherence uses a counting mechanism. Token coherence offers more flexibility for performance optimization (you can mix broadcast and point-to-point freely) at the cost of more complex implementation and the need for a persistent request mechanism to handle token loss or starvation.


Quick Reference Table

ConceptBest Examples
Basic state-based protocolsMSI, MESI, MOESI
Broadcast-based coherenceSnooping protocols, Bus-based protocols
Scalable coherenceDirectory-based, SCI, Token coherence
Write strategy trade-offsWrite-invalidate, Write-update
Reducing memory trafficMOESI (Owned state), Cache-to-cache transfers
Small system optimizationMESI, Snooping protocols
Large system optimizationDirectory-based, SCI
Flexible orderingTimestamp-based protocols

Self-Check Questions

  1. Which two protocols both use state-based coherence but differ in how they handle clean, private data? Explain why this difference matters for performance.

  2. A system architect is designing a 64-processor server. Which coherence approach (snooping or directory) should they choose, and what specific scalability limitation are they avoiding?

  3. Compare and contrast write-invalidate and write-update strategies. Under what workload conditions would write-update actually outperform write-invalidate?

  4. If a question asks you to reduce memory bandwidth consumption in a system with high read-sharing of modified data, which protocol state (from MOESI) directly addresses this, and how does it work?

  5. Token coherence and directory-based protocols both aim to scale beyond bus-based systems. What fundamental mechanism differs between them, and what trade-off does each approach make?

Key Concepts of Cache Coherence Protocols to Know for Advanced Computer Architecture