======== START LECTURE #20 ========

Chapter 7 Memory

Homework: Read Chapter 7

Ideal memory is

Fast
Big (in capacity; not physical size)
Cheap
Imposible

We observe empirically

Temporal Locality: The word referenced now is likely to be referenced again soon. Hence it is wise to keep the currently accessed word handy for a while.
Spacial Locality: Words near the currently referenced word are likely to be referenced soon. Hence it is wise to prefetch words near the currently referenced word and keep them handy for a while.

So use a memory hierarchy

Registers
Cache (really L1 L2 maybe L3)
Memory
Disk
Archive

There is a gap between each pair of adjacent levels. We study the cache <---> memory gap

In modern systems there are many levels of caches.
Similar considerations apply to the other gaps (e.g., memory<--->disk, where virtual memory techniques are applied)
But terminology is often different, e.g. cache line vs page.
In fall 97 my OS class was studying ``the same thing'' at this exact point (memory management). Not true this year since the OS text changed and memory management is earlier.

A cache is a small fast memory between the processor and the main memory. It contains a subset of the contents of the main memory.

A Cache is organized in units of blocks. Common block sizes are 16, 32, and 64 bytes.

We also view memory as organized in blocks as well. If the block size is 16, then bytes 0-15 of memory are in block 0, bytes 16-31 are in block 1, etc.
Transfers from memory to cache and back are one block.
Big blocks make good use of spacial locality
(OS think of pages and page frames)

A hit occurs when a memory reference is found in the upper level of memory hierarchy.

We will be interested in cache hits (OS in page hits), when the reference is found in the cache (OS: when found in main memory)
A miss is a non-hit.
The hit rate is the fraction of memory references that are hits.
The miss rate is 1 - hit rate, which is the fraction of references that are misses.
The hit time is the time required for a hit.
The miss time is the time required for a miss.
The miss penalty is Miss time - Hit time.

We start with a very simple cache organization.

Assume all referencess are for one word (not too bad).
Assume cache blocks are one word.
- This does not take advantage of spacial locality so is not done in real machines.
- We will drop this assumption soon.
Assume each memory block can only go in one specific cache block
- Called Direct Mapped.
- The location of the memory block in the cache (i.e. the block number in the cache) is the memory block number modulo the number of blocks in the cache.
- For example, if the cache contains 100 blocks, then memory block 34452 is stored in cache block 52.
- In real systems the number of blocks in the cache is a power of 2 so taking modulo is just extracting low order bits.
Example: if the cache has 16 blocks, the location of a block in the cache is the low order 4 bits of block number.
A pictorial example for a cache with only 4 blocks and a memory with only 16 blocks.
How can we tell if a memory block is in the cache?
- We know where it will be if it is there at all (memory block number mod number of blocks in the cache).
- But many memory blocks are assigned to that same cache block.
- So we need the ``rest'' of the address (i.e., the part lost when we reduced the block number modulo the size of the cache) to see if the block in the cache is the memory block of interest,
- Store the rest of the address, called the tag.
- Also store a valid bit per cache block so that we can tell if there is a memory block stored in this cache block.
  For example when the system is powered on, all the cache blocks are invalid.

Example on pp. 547-8.

Tiny 8 word direct mapped cache with block size one word and all references are for a word.
In the table to follow all the addresses are word addresses. For example the reference to 3 means the reference to word 3 (which includes bytes 12, 13, 14, and 15).
If reference experience a miss and the cache block is valid, the current reference is discarded (in this example only) and the new reference takes its place.
Do this example on the board showing the address store in the cache at all times

Address(10)	Address(2)	hit/miss	block#
22	10110	miss	110
26	11010	miss	010
22	10110	hit	110
26	11010	hit	010
16	10000	mis	000
3	00011	miss	011
16	10000	hit	000
18	10010	miss	010

The basic circuitry for this simple cache to determine hit or miss and to return the data is quite easy.

Calculate on the board the total number of bits in this cache.

Homework: 7.1 7.2 7.3

Processing a read for this simple cache

Hit is trivial (return the data found).
Miss: Evict and replace
- Why? I.e., why keep new data instead of old?
  Ans: Temporal Locality
- This is called write-allocate.
- The alternative is called write-no-allocate.

Skip section ``handling cache misses'' as it discusses the multicycle and pipelined implementations of chapter 6, which we skipped.

For our single cycle processor implementation we just need to note a few points

The instruction and data memory are replaced with caches.
On cache misses one needs to fetch or store the desired data or instruction in central memory.
This is very slow and hence our cycle time must be very long.
Yet another reason why the single cycle implementation is not used in practice.

Processing a write for this simple cache

Hit: Write through vs write back.
- Write through writes the data to memory as well as to the cache.
- Write back: Don't write to memory now, do it later when this cache block is evicted.
Miss: write-allocate vs write-no-allocate
The simplist is write-through, write-allocate
- Still assuming blksize=refsize = 1 word and direct mapped
- For any write (Hit or miss) do the following:
  1. Index the cache using the correct LOBs (i.e., not the very lowest order bits as these give the byte offset.
  2. Write the data and the tag
    - For a hit, we are overwriting the tag with itself.
    - For a miss, we are performing a write allocate and since the cache is write-throught we can simply overwrite the current entry
  3. Set Valid to true
  4. Send request to main memory
- Poor performance
  - GCC benchmark has 11% of operations stores
  - If we assume an infinite speed memory, CPI is 1.2 for some reasonable estimate of instruction speeds
  - Assume a 10 cycle store penalty (reasonable) since we have to write main memory (recall we are using a write-through cache).
  - CPI becomes 1.2 + 10 * 11% = 2.5, which is half speed.