Start Lecture #20
7.2: The Basics of Caches
We start with a very simple cache organization.
One that was used on the Decstation 3100, a 1980s workstation.
- All referencess are for one word (not too bad).
- Cache blocks are one word long.
- This does not take advantage of spatial locality so is not
done in modern machines.
- We will soon drop this assumption.
- Each memory block can only go in one specific cache block.
- This is called a Direct Mapped organization.
- The location of the memory block in the cache (i.e., the
block number in the cache) is the memory block number modulo
the number of blocks in the cache.
- For example, if the cache contains 100 blocks, then memory
block 34452 is stored in cache block 52. Memory block 352
is also stored in cache block 52 (but not at the same time,
- In real systems the number of blocks in the cache is a power
of 2 so taking modulo is just extracting low order bits.
- Example: if the cache has 4 blocks, the location of a
block in the cache is the low order 2 bits of the block
- A direct mapped cache is simple and fast, but has more
associative caches we will study in a
Accessing a Cache
On the right is a pictorial example for a direct mapped cache with
4 blocks and a memory with 16 blocks.
How can we tell if a memory block is in the cache?
- We know where it will be if it is there at all.
Specifically, if memory block N is in the cache, it will be
in cache block N mod C, where C is the number of blocks in
- But many memory blocks are assigned to that same cache block.
For example, in the diagram above all the green
blocks in memory are assigned to the one green block in the cache.
- So we need the
rest of the address (i.e., the part lost
when we reduced the block number modulo the size of the cache)
to see if the block in the cache is the memory block of
That number is N/C, using the terminology above.
- The cache stores the rest of the address, called the
tag and we check the tag when looking for a block.
- Since C is a power of 2, the tag (N/C) is simply the high
order bits of N.
Also stored is a valid bit per cache block so that we
can tell if there is a memory block stored in this cache
For example, when the system is powered on, all the cache blocks
Consider the example on page 476.
- We have a tiny 8-word, direct-mapped cache with block size one
word and all memory references are for one word.
- In the table on the right, all the addresses are word
For example the reference to 3 means the reference to word 3
(which includes bytes 12, 13, 14, and 15).
- If reference experience a miss and the cache block is valid, the
current contents of the cache block is discarded (in this example
only) and the new reference takes its place.
- Do this example on the board showing the address store in the
cache at all times
The circuitry needed for this simple cache (direct mapped, block
size 1, all references to 1 word) to determine if we have a hit or a
miss, and to return the data in case of a hit is quite easy.
We are showing a 1024 word (= 4KB) direct mapped cache with block
size = reference size = 1 word.
Make sure you understand the division of the 32 bit address into
20, 10, and 2 bits.
Calculate on the board the total number of bits in this cache.
7.2 7.3 7.4
Processing a Read for this Simple Cache
The action required for a hit is obvious, namely return the data
found to the processor.
For a miss, the best action is fairly clear, but requires some
- Clearly we must go to central memory to fetch the requested
data since it is not available in the cache.
- The question is should we place this new data in the cache
evicting the old data (which was for
a different address, or should we keep the old
data in the cache.
- But it is clear that we want to store the new data instead of
Answer: Temporal Locality.
- What should we do with the old data.
Can we just toss it or do we need to write it back to central
Answer: It depends!
We will see shortly that the action needed on a
read miss, depends on our choice of action
for a write hit.
Handling Cache Misses
We can skip much of this section as it discusses the multicycle and
pipelined implementations of chapter 6, which we skipped.
For the single cycle processor implementation we just need to note a
- The instruction and data memory are replaced with caches.
- On cache misses one needs to fetch/store the desired
datum or instruction from/to central memory.
- This is very slow and hence our cycle time must be very
- A major reason why the single cycle implementation is
not used in practice.
- The above is simplified; one could use
lockup-free cache and avoid much of the problem.
Processing a write for our simple cache (direct mapped with block
size = reference size = 1 word).
We have 4 possibilities: For a write hit we must choose between
Write through and Write back.
For a write miss we must choose between write-allocate and
write-no-allocate (also called store-allocate and store-no-allocate
and other names).
Write through: Write the data to
memory as well as to the cache.
Write back: Don't write to memory
now, do it later when this cache block is evicted.
The fact that an eviction must trigger a write to memory for
write-back caches explains the comment above that the write hit
policy effects the read miss policy.
Write-allocate: Allocate a slot and write the
new data into the cache (recall we have a write miss).
The handling of the eviction this allocation (probably) causes
depends on the write hit policy.
- If the cache is write through, discard the old data
(since it is in memory) and write the new data to memory (as
well as in the cache).
- If the cache is write back, the old data must now be
written back to memory, but the new data is not
written to memory.
Write-no-allocate: Leave the cache alone and
just write the new data to memory.
Write no-allocate is not normally as effective as write allocate
due to temporal locality.
The simplest policy is write-through, write-allocate.
The decstation 3100 discussed above adopted this policy and
performed the following actions for any write, hit or miss, (recall
that, for the 3100, block size = reference size = 1 word and the
cache is direct mapped).
- Index the cache using the correct LOBs (i.e., not the very
lowest order bits as these give the byte offset).
- Write the data and the tag into the cache.
- For a hit, we are overwriting the tag with itself.
- For a miss, we are performing a write allocate and,
since the cache is write-through, memory is guaranteed to
be correct so we can simply overwrite the current entry.
- Set Valid to true.
- Send request to main memory.
Although the above policy has the advantage of simplicity,
it is out of favor due to its poor performance.
- For the GCC benchmark 11% of the operations are stores.
- If we assume an infinite speed central memory (i.e., a
zero miss penalty) or a zero miss rate, the CPI is 1.2 for
some reasonable estimate of instruction speeds.
- If we assume a 10 cycle store penalty (conservative) since
we have to write main memory (recall we are using a
write-through cache), then the
CPI becomes 1.2 + 10 * 11% = 2.5, which is
Improvement: Use a Write Buffer
- Hold a few writes at the processor while they are being
processed at memory.
- As soon as the word is written into the write buffer, the
instruction is considered complete and the next instruction can
- Hence the write penalty is eliminated as long as the word can be
written into the write buffer.
- Must stall (i.e., incur a write penalty) if the write buffer is
This occurs if a bunch of writes occur in a short period.
- If the rate of writes is greater than the rate at which memory
can handle writes, you must stall eventually.
The purpose of a write-buffer (indeed of buffers in general) is to
handle short bursts.
- The Decstation 3100 (which employed the simple cache structure
just described) had a 4-word write buffer.
- Note that a read miss must check the write buffer.
Unified vs Split I and D (Instruction and Data) Caches
Given a fixed total size (in bytes) for the cache, is it better to
have two caches, one for instructions and one for data; or is it
better to have a single
- Unified is better because it automatically performs
If the current program needs more data references than
instruction references, the cache will accommodate.
Similarly if more instruction references are needed.
- Split is better because it can do two references at once (one
instruction reference and one data reference).
- The winner is ...
split I and D (at least for L1).
- But unified has the better (i.e. higher) hit ratio.
- So hit ratio is not the ultimate measure of good