CSCI-UA.0436 - Prof. Grishman
Lectures 20 and 21: Cache
Direct mapped cache
- if cache has N words, location k of memory goes into word (k
mod N) of cache (Fig. 5.5)
- simplest cache design (Fig. 5.7)
(or see Prof.
- conflict between memory locations with same (k mod N) reduces
as compared to fully associative cache
Set associative cache
- typical designs are 2-way or 4-way set associative
- somewhat greater complexity than direct mapped (more
comparators and multiplexers)
(Fig. 5.17 or Prof.
- for 2-way, organize cache as S=N/2 sets of 2 words each
- location k of memory goes into set (k mod S) of cache
- on miss, replace least-recently-used member of set
(for 2-way, requires only 1 bit per set)
- approaches performance of fully associative
Spatial locality and block size
So far we assumed that the block size -- the amount of data in each
cache entry -- is one word.
By storing multiple (2, 4, ...) consecutive words in a cache entry
fetching all the words on a cache miss, we can improve performance
to spatial locality (see Gottlieb's
There is a limit to the benefit of increasing block size, however:
- it reduces the number of cache entries.
- it increases the miss
fetch a cache entry on a cache miss. To
reduce the miss penalty, modern main memories are designed to
multiple words on successive clock cycles.
Strategies for memory writes
Two basic strategies:
writes always update both cache and memory. So that
does not have to wait for memory write to finish, we include a write buffer (which holds
information on store instructions which have not yet been
- write-back: writes
the cache; when the block is replaced in the
cache, the modified words are written back to main memory. This
complex but reduces the main memory traffic, since a program may
a memory word several times while it is in the cache.
Effect on performance: effective memory access time
Goal is to have effective memory access time be close to
time of the fastest memory
- hit rate = percentage of memory accesses which are satisfied
- miss rate = 1 - hit rate
- hit and miss rates measured using processor and cache
- effective memory access time = (hit rate * cache access time)
* access time for cache miss)
[access time for cache miss is predominantly main memory access
Effect on performance: CPI (p. 475-477)
Calculate cache performance in terms of its effect on the CPI:
assume each miss (for instruction, data load, or data store) leads
miss penalty, measured in clock cycles
(resulting from the CPU stalling while it waits for data from main
Instruction fetch miss cycles / instruction = instruction miss rate
Data load/store miss cycles / instruction = % of load/store
instructions x data miss rate x miss penalty
Total miss cycles / instruction = instruction fetch miss
cycles/instruction + data load/store miss cycles/instruction
Effective CPI is increased by total miss cycles / instruction
Unified vs. split instruction / data cache
Having separate caches for instructions and data does not improve
rate but does support increased bandwidth -- one can fetch an
instruction and data word at the same time. Most current
processors have separate L1 I and D caches.
As gap in speed between CPU and memory speed grows larger, penalty
cache miss becomes unacceptably high. To address this problem,
all modern high end CPUs all have at least two levels
of caches: A very fast, and hence not very big, first level (L1)
together with a larger but slower L2 cache. Some recent
microprocessors (e.g., Core i7) have 3 levels.
When a miss occurs in L1, L2 is examined, and only if a miss
there is main memory referenced. (Performance analysis, p.