Cache memory sits between the processor and main memory, giving the CPU fast access to the data it needs most. Without it, the processor would constantly wait on slower main memory, wasting cycles. This topic covers how caches are designed, how data gets mapped into them, and how the system decides what to evict when the cache is full.

Cache Memory Purpose and Benefits

Bridging the Performance Gap

Cache is a small, fast memory placed physically close to the processor. It stores copies of data and instructions the CPU is likely to need soon. Because cache is built from faster (and more expensive) SRAM rather than the DRAM used in main memory, it can respond to requests much more quickly.

The core problem cache solves: processors can execute instructions in fractions of a nanosecond, but fetching data from main memory can take tens to hundreds of nanoseconds. Cache absorbs most of those requests so the processor doesn't stall.

Principle of Locality

Cache works because programs don't access memory randomly. They follow predictable patterns described by the principle of locality:

Temporal locality: If the processor accessed a memory location recently, it's likely to access it again soon. Think of loop variables or frequently called functions.
Spatial locality: If the processor accessed one memory location, it's likely to access nearby locations next. This happens with sequential instructions and array traversals.

These two patterns are why caches are effective. The cache keeps recently used data (temporal) and pulls in neighboring data whenever it fetches a block (spatial).

Cache Memory Design Parameters

Cache Size

Cache size is the total storage capacity, typically measured in kilobytes (e.g., 4KB, 8KB, 16KB for L1 caches). A larger cache can hold more data, which generally means fewer misses. But there's a trade-off: bigger caches take longer to search, consume more power, and occupy more chip area. The right size depends on the workload and where the cache sits in the hierarchy.

Block Size

Block size (also called cache line size) is the unit of data transferred between main memory and the cache. When a cache miss occurs, the system doesn't fetch just the one byte you need. It fetches an entire block (commonly 32, 64, or 128 bytes).

Larger blocks exploit spatial locality by pulling in more adjacent data per fetch.
But if the block is too large, you waste bandwidth fetching data the processor won't actually use, and you fit fewer total blocks in the cache, which can increase the miss rate.

Bridging the Performance Gap, Memory hierarchy - Wikipedia

Associativity

Associativity determines how many locations in the cache a given memory block can occupy:

Direct-mapped (1-way): Each memory block maps to exactly one cache location. Simple and fast, but vulnerable to conflict misses when two frequently used blocks compete for the same slot.
Fully associative: A block can go anywhere in the cache. This eliminates conflict misses entirely, but the hardware must compare tags against every cache entry simultaneously, which is expensive.
N-way set-associative: The cache is divided into sets, each containing N slots (e.g., 2-way, 4-way, 8-way). A block maps to a specific set but can occupy any slot within that set. This balances flexibility against hardware cost and is the most common design in practice.

Cache Mapping Techniques: Direct vs Associative vs Set-Associative

Direct Mapping

Direct mapping is the simplest approach. Each block in main memory maps to exactly one cache location based on its address. The mapping formula is:

$Cache\_Index = (Block\_Address) \bmod (Number\_of\_Blocks\_in\_Cache)$

For example, if your cache has 8 blocks, then memory block 13 maps to cache index $13 \bmod 8 = 5$ . Memory block 5 also maps to index 5, which means these two blocks can never be in the cache at the same time. That's the main drawback: even if the rest of the cache is empty, two blocks that map to the same index will keep evicting each other.

Fully Associative Mapping

Fully associative mapping lets a block go in any cache location. There's no index calculation. Instead, the hardware compares the requested address tag against the tag of every block in the cache simultaneously. This requires a comparator for each cache entry, which gets expensive as cache size grows. Fully associative caches have the lowest miss rate but the highest hardware overhead, so they're typically only used for very small caches (like TLBs).

Set-Associative Mapping

Set-associative mapping splits the difference. The cache is organized into sets, and each memory block maps to one specific set but can occupy any slot within that set.

The relevant formulas:

$Set\_Index = (Block\_Address) \bmod (Number\_of\_Sets)$

$Tag = Block\_Address \div Number\_of\_Sets$

For a 4-way set-associative cache with 16 total blocks, you'd have $16 \div 4 = 4$ sets. Memory block 10 maps to set $10 \bmod 4 = 2$ , and it can go in any of the 4 slots within set 2. This significantly reduces conflict misses compared to direct mapping while keeping hardware costs manageable.

Quick comparison: Direct-mapped is the simplest and fastest to access but has the most conflict misses. Fully associative has the fewest misses but the most complex hardware. Set-associative is the practical middle ground used in most real processors.

Bridging the Performance Gap, CPU cache - Wikipedia

Cache Replacement Policies: LRU vs FIFO vs Random

Replacement policies only matter for associative and set-associative caches (direct-mapped caches have no choice since each block has only one possible location). When a set is full and a new block needs to come in, the policy decides which existing block gets evicted.

Least Recently Used (LRU)

LRU evicts the block that hasn't been accessed for the longest time. The logic is straightforward: if you haven't used it recently, you're less likely to need it soon (temporal locality). LRU generally performs well, but it requires extra hardware to track the access order of every block in a set. For a 2-way cache this is just a single bit, but tracking gets significantly more complex at higher associativities.

First-In, First-Out (FIFO)

FIFO evicts the block that was loaded into the cache earliest, regardless of whether it's been accessed recently. It's simpler to implement than LRU since you just maintain a queue. The downside is that FIFO can evict a block that's still being used frequently, simply because it arrived first.

Random Replacement

Random replacement picks a block to evict at random. It requires no tracking hardware at all, making it the cheapest option. Performance is surprisingly competitive with LRU in many workloads, especially at higher associativities where any choice is unlikely to be catastrophic. It's a solid option when minimizing hardware complexity matters.

Advanced Replacement Policies

Pseudo-LRU: Approximates true LRU using a binary tree of bits rather than full ordering. Much cheaper to implement in hardware while capturing most of LRU's benefit. Commonly used in real L1 and L2 caches.
Least Frequently Used (LFU): Evicts the block with the fewest total accesses. This favors blocks that are accessed often, but it can be slow to adapt when access patterns change since old counts linger.

The choice of replacement policy depends on the cache's associativity, the target workload, and the design's power and area budget. For most intro-level problems, you'll work with LRU since it's the easiest to reason about and trace through by hand.

2,589 studying →