V22.0436 - Prof. Grishman
Lectures 21 and 22: Cache
Direct mapped cache
- if cache has N words, location k of memory goes into word (k
mod N) of
- simplest cache design (Fig. 5.7) (or see Prof.
- conflict between memory locations with same (k mod N) reduces
as compared to fully associative cache
Set associative cache
- typical designs are 2-way or 4-way set associative
- somewhat greater complexity than direct mapped (more
(Fig. 5.17 or Prof.
- for 2-way, organize cache as S=N/2 sets of 2 words each
- location k of memory goes into set (k mod S) of cache
- approaches performance of fully associative
Spatial locality and block size
So far we assumed that the block size -- the amount of data in each
cache entry -- is one word.
By storing multiple (2, 4, ...) consecutive words in a cache entry and
fetching all the words on a cache miss, we can improve performance due
to spatial locality (see Gottlieb's
There is a limit to the benefit of increasing block size, however:
- it reduces the number of cache entries.
- it increases the miss penalty:
fetch a cache entry on a cache miss. To
reduce the miss penalty, modern main memories are designed to fetch
multiple words on successive clock cycles.
Strategies for memory writes
Two basic strategies:
writes always update both cache and memory. So that processor
does not have to wait for memory write to finish, we include a write buffer (which holds
information on store instructions which have not yet been written to
- write-back: writes
in the cache; when the block is replaced in the
cache, the modified words are written back to main memory. This is more
complex but reduces the main memory traffic, since a program may modify
a memory word several times while it is in the cache.
Effect on performance: effective memory access time
Goal is to have effective memory access time be close to
time of the fastest memory
- hit rate = percentage of memory accesses which are satisfied by
- miss rate = 1 - hit rate
- hit and miss rates measured using processor and cache simulator
- effective memory access time = (hit rate * cache access time) +
* access time for cache miss)
[access time for cache miss is predominantly main memory access time]
Effect on performance: CPI (p. 475-477)
Calculate cache performance in terms of its effect on the CPI:
assume each miss (for instruction, data load, or data store) leads to a
miss penalty, measured in clock cycles
(resulting from the CPU stalling while it waits for data from main
Instruction fetch miss cycles / instruction = instruction miss rate x
Data load/store miss cycles / instruction = % of load/store
instructions x data miss rate x miss penalty
Total miss cycles / instruction = instruction fetch miss
cycles/instruction + data load/store miss cycles/instruction
Effective CPI is increased by total miss cycles / instruction
Unified vs. split instruction / data cache
Having separate caches for instructions and data does not improve hit
rate but does support increased bandwidth -- one can fetch an
instruction and data word at the same time. Most current
processors have separate L1 I and D caches.
As gap in speed between CPU and memory speed grows larger, penalty for
cache miss becomes unacceptably high. To address this problem,
all modern high end CPUs all have at least two levels
of caches: A very fast, and hence not very big, first level (L1) cache
together with a larger but slower L2 cache. Some recent
microprocessors (e.g., Core i7) have 3 levels.
When a miss occurs in L1, L2 is examined, and only if a miss occurs
there is main memory referenced. (Performance analysis, p. 485).