V22.0436 - Prof. Grishman
Lecture 23: Cache
Spatial locality and block size
So far we assumed that the block size -- the amount of data in each
cache entry -- is one word.
By storing multiple (2, 4, ...) consecutive words in a cache entry and
fetching all the words on a cache miss, we can improve performance due
to spatial locality (see Gottlieb's
diagram of 4-word blocks)
There is a limit to the benefit of increasing block size, however:
- it reduces the number of cache entries.
- it increases the miss penalty:
the time required to fetch a cache entry on a cache miss. To
reduce the miss penalty, modern main memories are designed to fetch
multiple words on successive clock cycles.
Strategies for memory writes
Two basic strategies:
writes always update both cache and memory. So that processor
does not have to wait for memory write to finish, we include a write buffer (which holds
information on store instructions which have not yet been written to
- write-back: writes
only update the block in the cache; when the block is replaced in the
cache, the modified words are written back to main memory. This is more
complex but reduces the main memory traffic, since a program may modify
a memory word several times while it is in the cache.
Effect on performance: effective memory access time
Goal is to have effective memory access time be close to
time of the fastest memory
- hit rate = percentage of memory accesses which are satisfied by
- miss rate = 1 - hit rate
- hit and miss rates measured using processor and cache simulator
- effective memory access time = (hit rate * cache access time) +
* access time for cache miss)
[access time for cache miss is predominantly main memory access time]
Effect on performance: CPI
Effect of cache miss is to stall CPU; effect of cache misses can be
most accurately in terms of number of memory stall cycles for read miss
and write miss (text,
section 7.3), and hence effect on CPI.
Unified vs. split instruction / data cache
Having separate caches for instructions and data does not improve hit
rate but does support increased bandwidth -- one can fetch an
instruction and data word at the same time. Most current
processors have separate L1 I and D caches.