Class Notes for Computer Architecture

Computer Architecture
1999-2000 Fall
MW 3:30-4:45
Ciww 109

Allan Gottlieb
gottlieb@nyu.edu
http://allan.ultra.nyu.edu/~gottlieb
715 Broadway, Room 1001
212-998-3344
609-951-2707
email is best

======== START LECTURE #22 ========

How do we find a memory block in an associative cache (with block size 1 word)?
- Divide the memory block number by the number of sets to get the index into the cache.
- Mod the memory block number by the number of sets to get the tag.
- Check all the tags in the set against the tag of the memory block.
- If any tag matches, a hit has occurred and the corresponding data entry contains the memory block.
- If no tag matches, a miss has occurred.

Why is set associativity bad?
Ans: It is a little slower due to the mux and AND gate.
Which block (in the set) should be replaced?
- Random is sometimes used.
  - But it is not used for paging!
  - The number of blocks in a set is small, so the likely difference in quality between the best and the worst is less.
  - For caches, speed is crucial so have no time for calculations, even for misses.
- LRU is better, but not easy to do quickly.
  - If the cache is 2-way set associative, each set is of size two and it is easy to find the lru block quickly. How?
    Ans: For each set keep a bit indicating which block in the set was just referenced and the lru block is the other one.
  - If the cache is 4-way set associative, each set is of size 4. Consider these 4 blocks as two groups of 2. Use the trick above to find the group most recently used and pick the other group. Also use the trick within each group and chose the block in the group not used last.
  - Sound great. We can do lru fast for any power of two using a binary tree.
  - Wrong! The above is not LRU it is just an approximation. Show this on the board.
Sizes
- How big is the cache? This means, what is the capacity?
  Ans: 256 * 4 * 4B = 4KB.
- How many bits are in the cache?
- Answer
  - The 32 address bits contain 8 bits of index and 2 bits giving the byte offset.
  - So the tag is 22 bits (more examples just below).
  - Each block contains 1 valid bit, 22 tag bits and 32 data bits, for a total of 55 bits.
  - There are 1K blocks.
  - So the total size is 55Kb (kilobits).
- What fraction of the bits are user data?
  Ans: 4KB / 53Kb = 32Kb / 53Kb = 32/53.

Tag size and division of the address bits

We continue to assume a byte addressed machines with all references to a 4-byte word (lw and sw).

The 2 LOBs are not used (they specify the byte within the word but all our references are for a complete word). We show these two bits in dark blue. We continue to assume 32 bit addresses so there are 2**30 words in the address space.

Let's review various possible cache organizations and determine for each how large is the tag and how the various address bits are used. We will always use a 16KB cache. That is the size of the data portion of the cache is 16KB = 4 kilowords = 2**12 words.

Direct mapped, blocksize 1 (word).
- Since the blocksize is one word, there are 2**30 memory blocks and all the address bits (except the 2 LOBs that specify the byte within the word) are used for the memory block number. Specifically 30 bits are so used.
- The cache has 2**12 words, which is 2**12 blocks.
- So the low order 12 bits of the memory block number give the index in the cache (the cache block number), shown in cyan.
- The remaining 18 (30-12) bits are the tag, shown in red.
Direct mapped, blocksize 8
- Three bits of the address give the word within the 8-word block. These are drawn in magenta.
- The remaining 27 HOBs of the memory address give the memory block number.
- The cache has 2**12 words, which is 2**9 blocks.
- So the low order 9 bits of the memory block number gives the index in the cache.
- The remaining 18 bits are the tag
4-way set associative, blocksize 1
- Blocksize is 1 so there are 2**30 memory blocks and 30 bits are used for the memory block number.
- The cache has 2**12 blocks, which is 2**10 sets (each set has 4=2**2 blocks).
- So the low order 10 bits of the memory block number gives the index in the cache.
- The remaining 20 bits are the tag.
- As the associativity grows, the tag gets bigger. Why?
  Ans: Growing associativity reduces the number of sets into which a block can be placed. This increases the number of memory blocks eligible tobe placed in a given set. Hence more bits are needed to see if the desired block is there.
4-way set associative, blocksize 8
- Three bits of the address give the word within the block.
- The remaining 27 HOBs of the memory address give the memory block number.
- The cache has 2**12 words = 2**9 blocks = 2**7 sets.
- So the low order 7 bits of the memory block number gives the index in the cache.

Homework: 7.39, 7.40

Improvement: Multilevel caches

Modern high end PCs and workstations all have at least two levels of caches: A very fast, and hence not very big, first level (L1) cache together with a larger but slower L2 cache.

When a miss occurs in L1, L2 is examined, and only if a miss occurs there is main memory referenced.

So the average miss penalty for an L1 miss is

(L2 hit rate)*(L2 time) + (L2 miss rate)*(L2 time + memory time)

We are assuming L2 time is the same for an L2 hit or L2 miss. We are also assuming that the access doesn't begin to go to memory until the L2 miss has occurred.

Do an example

Assume
1. L1 I-cache miss rate 4%
2. L2 D-cache miss rate 5%
3. 40% of instructions reference data
4. L2 miss rate 6%
5. L2 time of 15ns
6. Memory access time 100ns
7. Base CPI of 2
8. Clock rate 400MHz
How many instructions per second does this machine execute
How many instructions per second would this machine execute if the L2 cache were eliminated.
How many instructions per second would this machine execute if both caches were eliminated.
How many instructions per second would this machine execute if the L2 cache had a 0% miss rate (L1 as originally specified).
How many instructions per second would this machine execute if both L1 caches had a 0% miss rate

7.4: Virtual Memory

I realize this material was covered in operating systems class (V22.0202). I am just reviewing it here. The goal is to show the similarity to caching, which we just studied. Indeed, (the demand part of) demand paging is caching: In demand paging the memory serves as a cache for the disk, just as in caching the cache serves as a cache for the memory.

The names used are different and there are other differences as well.

Cache concept Demand paging analogue
Memory block Page
Cache block Page Frame (frame)
Blocksize Pagesize
Tag None (table lookup)
Word in block Page offset
Valid bit Valid bit
Miss Page fault
Hit Not a page fault
Miss rate Page fault rate
Hit rate 1 - Page fault rate

Cache concept	Demand paging analogue
Memory block	Page
Cache block	Page Frame (frame)
Blocksize	Pagesize
Tag	None (table lookup)
Word in block	Page offset
Valid bit	Valid bit
Miss	Page fault
Hit	Not a page fault
Miss rate	Page fault rate
Hit rate	1 - Page fault rate

Cache concept Demand paging analogue
Placement question Placement question
Replacement question Replacement question
Associativity None (fully associative)

Cache concept	Demand paging analogue
Placement question	Placement question
Replacement question	Replacement question
Associativity	None (fully associative)

For both caching and demand paging, the placement question is trivial since the items are fixed size (no first-fit, best-fit, buddy, etc).
The replacement question is not trivial. (H&P list this under the placement question, which I believe is in error). Approximations to LRU are popular for both caching and demand paging.
The cost of a page fault vastly exceeds the cost of a cache miss so it is worth while in paging to slow down hit processing to lower the miss rate. Hence demand paging is fully associative and uses a table to locate the frame in which the page is located.
The figures to the right are for demand paging. But they can be interpreted for caching as well.
- The (virtual) page number is the memory block number
- The Page offset is the word-in-block
- The frame (physical page) number is the cache block number (which is the index into the cache).
- Since demand paging uses full associativity, the tag is the entire memory block number. Instead of checking every cache block to see if the tags match, a (page) table is used.

Homework: 7.32

Write through vs. write back

Question: On a write hit should we write the new value through to (memory/disk) or just keep it in the (cache/memory) and write it back to (memory/disk) when the (cache-line/page) is replaced?

Write through is simpler since write back requires two operations at a single event.
But write-back has fewer writes to (memory/disk) since multiple writes to the (cache-line/page) may occur before the (cache-line/page) is evicted.
For caching the cost of writing through to memory is probably less than 100 cycles so with a write buffer the cost of write through is bearable and it does simplify the situation.
For paging the cost of writing through to disk is on the order of 1,000,000 cycles. Since write-back has fewer writes to disk, it is used.