Class Notes for Computer Architecture

Computer Architecture
1999-2000 Fall
MW 3:30-4:45
Ciww 109

Allan Gottlieb
gottlieb@nyu.edu
http://allan.ultra.nyu.edu/~gottlieb
715 Broadway, Room 1001
212-998-3344
609-951-2707
email is best

======== START LECTURE #20 ========

Improvement: Blocksize > Wordsize

The current setup does not take any advantage of spatial locality. The idea of having a multiword blocksizes is to bring in words near the referenced word since, by spatial locality, they are likely to be referenced in the near future.
The figure below shows a 64KB direct mapped cache with 4-word blocks.

What addresses in memory are in the block and where in the cache do they go?
- The memory block number =
  the word address / number of words per block =
  the byte address / number of bytes per block
- The cache block number =
  the memory block number modulo the number of blocks in the cache
- The block offset =
  the word address modulo the number of words per block
- The tag =
  the word addres / the number of words in the cache =
  the byte address / the number of bytes in the cache
- Show from the diagram how this gives the red portion for the tag and the green portion for the index or cache block number.
Consider the cache shown in the diagram above and a reference to word 17001.
- 17003 / 4 gives 4250 with a remainder of 3 .
- So the memory block number is 4250 and the block offset is 3.
- 4K=4096 and 4250 / 4096 gives 1 with a remainder of 154.
- So the cache block number is 154.
- Putting this together a reference to word 17003 is a reference to the third word of the cache block with index 154
- The tag is 17003 / (4K * 4) = 1
Cachesize = Blocksize * #Entries. For the diagram above this is 64KB.
Calculate the total number of bits in this cache and in one with one word blocks but still 64KB of data.
If the references are strictly sequential the pictured cache has 75% hits; the simplier cache with one word blocks has no hits.
How do we process read/write hits/misses?
- Read hit: As before, return the data found to the processor.
- Read miss: As before, due to locality we discard (or write back) the old line and fetch the new line.
- Write hit: As before, write the word in the cache (and perhaps write memory as well).
- Write miss: A new consideration arises. As before we might or might not decide to replace the current line with the referenced line. The new consideration is that if we decide to replace the line (i.e., if we are implementing store-allocate), we must remember that we only have a new word and the unit of cache transfer is a multiword line.
  - The simplest idea is to fetch the entire old line and overwrite the new word. This is called write-fetch and is something you wouldn't even consider with blocksize = reference size = 1 word.
  - Why fetch the whole line including the word you are going to overwrite?
    Ans. The memory subsystem probably can't fetch just words 1,2, and 4 of the line.
  - Why might we want store-allocate and write-no-fetch?
  - Ans: Because a common case is storing consecutive words: With store-no-allocate all are misses and with write-fetch, each store fetches the line to overwrite another part of it.
  - To implement store-allocate-no-write-fetch (SANF), we need to keep a valid bit per word.

Homework: 7.7 7.8 7.9

Why not make blocksize enormous? For example, why not have the cache be one huge block.

NOT all access are sequential.
With too few blocks misses go up again.

Memory support for wider blocks

Should memory be wide?
Should the bus from the cache to the processor be wide?

Assume
1. 1 clock required to send the address. Only one address is needed per access for all designs.
2. 15 clocks are required for each memory access (independent of width).
3. 1 Clock/busload required to transfer data.
How long does it take satisfy a read miss for the cache above and each of the three memory/bus systems.
Narrow design (a) takes 65 clocks: 1 address transfer, 4 memory reads, 4 data transfers (do it on the board).
Wide design (b) takes 17.
Interleaved design (c) takes 20.
Interleaving works great because in this case we are guaranteed to have sequential accesses.
Imagine a design between (a) and (b) with a 2-word wide datapath.
It takes 33 cycles and is more expensive to build than (c).

Homework: 7.11

7.3: Measuring and Improving Cache Performance

Performance example to do on the board (a dandy exam question).

Assume
- 5% I-cache miss.
- 10% D-cache miss.
- 1/3 of the instructions access data.
- CPI = 4 if miss penalty is 0 (A 0 miss penalty is not realistic of course).
What is CPI with miss penalty 12 (do it)?
What is CPI if we upgrade to a double speed cpu+cache, but keep a single speed memory (i.e., a 24 clock miss penalty)?
Do it on the board.
How much faster is the ``double speed'' machine? It would be double speed if the miss penalty were 0 or if there was a 0% miss rate.

Homework: 7.15, 7.16