## Memory Hierarchy

Computer Systems Organization (Spring 2017) CSCI-UA 201, Section 3

Instructor: Joanna Klukowska

Slides adapted from Randal E. Bryant and David R. O'Hallaron (CMU) Mohamed Zahran (NYU) Cache Memory Organization and Access



























### What about writes?

- Multiple copies of data exist:
  - L1, L2, L3, Main Memory, Disk
- What to do on a write-hit?
  - Write-through (write immediately to memory)
  - Write-back (defer write to memory until replacement of line)
    - Need a dirty bit (line different from memory or not)
- What to do on a write-miss?
  - Write-allocate (load into cache, update line in cache)
    - Good if more writes to the location follow
  - No-write-allocate (writes straight to memory, does not load into cache)
- Typical
  - Write-through + No-write-allocate
  - Write-back + Write-allocate



### **Cache Performance Metrics**

### Miss Rate

- Fraction of memory references not found in cache (misses / accesses)
   = 1 hit rate
- Typical numbers (in percentages):
  - 3-10% for L1
  - can be very small (e.g., < 1%) for L2, depending on size, etc.

### Hit Time

- Time to deliver a line in the cache to the processor
  - includes time to determine whether the line is in the cache
- Typical numbers:
  - 4 clock cycle for L1
  - 10 clock cycles for L2

### Miss Penalty

- Additional time required because of a miss
  - · typically 50-200 cycles for main memory



18

### Let's think about those numbers

- Huge difference between a hit and a miss
  - Could be 100x, if just L1 and main memory
- Would you believe 99% hits is twice as good as 97%?
  - Consider: cache hit time of 1 cycle miss penalty of 100 cycles
  - Average access time:

97% hits: 0.97\*1 cycle + 0.03\*100 cycles ≈ 1 cycle + 3 cycles = 4 cycles 99% hits: 0.99\*1 cycle + 0.01\*100 cycles ≈ 1 cycle + 1 cycle = 2 cycles

■ This is why "miss rate" is used instead of "hit rate"

## **Writing Cache Friendly Code**

- Make the **common** case go fast
  - Focus on the inner loops of the core functions
- Minimize the misses in the inner loops
  - Repeated references to variables are good (temporal locality) because there is a good chance that they are stored in registers.
  - Stride-1 reference patterns are good (spatial locality) because subsequent references to elements in the same block will be able to hit the cache (one cache miss followed by many cache hits).

# Rearranging Loops to Improve Spatial Locality

21

## **Matrix Multiplication Example**

Variable sum held in register /\* ijk \*/
for (i=0; i<n; i++) {
 for (j=0; j<n; j++) {
 sum = 0.0;
 for (k=0; k<n; k++)
 sum += a[i][k] \* b[k][j];
 c[i][j] = sum;
}
}</pre>

- Description:
  - Multiply N x N matrices
  - Matrix elements are doubles (8 bytes)
  - O(N³) total operations
  - N reads per source element
  - N values summed per destination
    - but may be able to hold in register

22

### Miss Rate Analysis for Matrix Multiply

### Assume:

- Block size = 32B (big enough for four doubles)
- Matrix dimension (N) is very large
  - Approximate 1/N as 0.0
- Cache is not even big enough to hold multiple rows

### Analysis Method:

Look at access pattern of inner loop



## **Layout of C Arrays in Memory (review)**

- C arrays allocated in row-major order
  - · each row in contiguous memory locations
- Stepping through columns in one row:

```
for (i = 0; i < N; i++)
sum += a[0][i];
```

- accesses successive elements
- if block size (B) > sizeof(a<sub>ii</sub>) bytes, exploit spatial locality
  - miss rate = sizeof(a<sub>ii</sub>) / B
- Stepping through rows in one column:

```
for (i = 0; i < n; i++) 
 sum += a[i][0];
```

- accesses distant elements
- no spatial locality!
  - miss rate = 1 (i.e. 100%)









## Learn about your machine's cache

- 1shw command list hardware information
  - sudo 1shw -C memory
- 1scpu command display information about the CPU architecture
  - 1scpu
- dmidecode command -
  - sudo dmidecode -t cache

Note: some of these may not work well in a virtual machine environment.

## **Cache Summary**

- Cache memories can have significant performance impact
- You can write your programs to exploit this!
  - Focus on the inner loops, where bulk of computations and memory accesses occur.
  - Try to maximize spatial locality by reading data objects with sequentially with stride 1.
  - Try to maximize temporal locality by using a data object as often as possible once it's read from memory.