Computer Architecture
1999-2000 Fall
MW 3:30-4:45
Ciww 109
Allan Gottlieb
gottlieb@nyu.edu
http://allan.ultra.nyu.edu/~gottlieb
715 Broadway, Room 1001
212-998-3344
609-951-2707
email is best
======== START LECTURE #20
========
Improvement: Blocksize > Wordsize
-
The current setup does not take any advantage of spatial locality.
The idea of having a multiword blocksizes is to bring in words near
the referenced word since, by spatial locality, they are likely to
be referenced in the near future.
-
The figure below shows a 64KB direct mapped cache with 4-word
blocks.
-
What addresses in memory are in the block and where in the cache
do they go?
- The memory block number =
the word address / number of words per block =
the byte address / number of bytes per block
- The cache block number =
the memory block number modulo the number of blocks in the cache
- The block offset =
the word address modulo the number of words per block
- The tag =
the word addres / the number of words in the cache =
the byte address / the number of bytes in the cache
- Show from the diagram how this gives the red portion for the
tag and the green portion for the index or cache block number.
- Consider the cache shown in the diagram above and a reference to
word 17001.
- 17003 / 4 gives 4250 with a remainder of 3 .
- So the memory block number is 4250 and the block offset is 3.
- 4K=4096 and 4250 / 4096 gives 1 with a remainder of 154.
- So the cache block number is 154.
- Putting this together a reference to word 17003 is a reference
to the third word of the cache block with index 154
- The tag is 17003 / (4K * 4) = 1
-
Cachesize = Blocksize * #Entries. For the diagram above this is 64KB.
-
Calculate the total number of bits in this cache and in one
with one word blocks but still 64KB of data.
-
If the references are strictly sequential the pictured cache has 75% hits;
the simplier cache with one word blocks has no
hits.
-
How do we process read/write hits/misses?
-
Read hit: As before, return the data found to the processor.
-
Read miss: As before, due to locality we discard (or write
back) the old line and fetch the new line.
-
Write hit: As before, write the word in the cache (and perhaps
write memory as well).
-
Write miss: A new consideration arises. As before we might or
might not decide to replace the current line with the
referenced line. The new consideration is that if we decide
to replace the line (i.e., if we are implementing
store-allocate), we must remember that we only have a new
word and the unit of cache transfer is a
multiword line.
-
The simplest idea is to fetch the entire old line and
overwrite the new word. This is called
write-fetch and is something you wouldn't
even consider with blocksize = reference size = 1 word.
-
Why fetch the whole line including the word you are going
to overwrite?
Ans. The memory subsystem probably can't fetch just words
1,2, and 4 of the line.
-
Why might we want store-allocate and
write-no-fetch?
-
Ans: Because a common case is storing consecutive words:
With store-no-allocate all are misses and with
write-fetch, each store fetches the line to
overwrite another part of it.
-
To implement store-allocate-no-write-fetch (SANF), we need
to keep a valid bit per word.
Homework:
7.7 7.8 7.9
Why not make blocksize enormous? For example, why not have the cache
be one huge block.
-
NOT all access are sequential.
-
With too few blocks misses go up again.
Memory support for wider blocks
-
Should memory be wide?
-
Should the bus from the cache to the processor be wide?
-
Assume
-
1 clock required to send the address. Only one address is
needed per access for all designs.
-
15 clocks are required for each memory access (independent of
width).
-
1 Clock/busload required to transfer data.
-
How long does it take satisfy a read miss for the cache above and
each of the three memory/bus systems.
-
Narrow design (a) takes 65 clocks: 1 address transfer, 4 memory
reads, 4 data transfers (do it on the board).
-
Wide design (b) takes 17.
-
Interleaved design (c) takes 20.
-
Interleaving works great because in this case we are
guaranteed to have sequential accesses.
-
Imagine a design between (a) and (b) with a 2-word wide datapath.
It takes 33 cycles and is more expensive to build than (c).
Homework: 7.11
7.3: Measuring and Improving Cache Performance
Performance example to do on the board (a dandy exam question).
-
Assume
-
5% I-cache miss.
-
10% D-cache miss.
-
1/3 of the instructions access data.
-
CPI = 4 if miss penalty is 0 (A 0 miss penalty is not
realistic of course).
-
What is CPI with miss penalty 12 (do it)?
-
What is CPI if we upgrade to a double speed cpu+cache, but keep a
single speed memory (i.e., a 24 clock miss penalty)?
Do it on the board.
-
How much faster is the ``double speed'' machine? It would be double
speed if the miss penalty were 0 or if there was a 0% miss rate.
Homework:
7.15, 7.16