Computer Architecture
Start Lecture #19
Homework:
Carefully go through and understand the example on page 247 that I
just did in class.
Homework: The next 5 problems form a set, i.e.,
the data from one applies to all the following problems.
The first three, 4.1, 4.2, and 4.3, are from the book.
Homework: If the clock rates of the machines M1
and M2 from exercise 4.1 are 1GHz and 2GHz, respectively, find the
CPI for program 1 on both machines.
Homework:
Assume the CPI for program 2 on each machine
is the same as the CPI for program 1 you calculated in the previous
problem.
What is the instruction count for program 2 on each machine
4.3: Evaluating Performance
I have nothing to add.
4.4: Real Stuff: Two SPEC Benchmarks and the Performance of
Recent Intel Processors
Skipped.
4.5 Fallacies and Pitfalls
What is the MIPS rating for a computer and how useful is it?
- MIPS stands for Millions of Instructions Per Second.
- It is a unit of rate or speed (like MHz), not of time (like ns.).
- It is not the same as the MIPS computer (but the name
similarity is not a coincidence).
- The number of seconds required to execute a given (machine
language) program is
the number of instructions executed / the number executed per second.
- The number of microseconds required to execute a given program is
the number of machine language instructions executed / MIPS.
- BUT ... .
- The same program in C (or Java, or Ada, etc) might need
different number of instructions
on different computers.
For example, one VAX instruction might require 2
instructions on a power-PC and 3 instructions on a MIPS.
- The same program in C, when compiled by two different compilers
for the same computer architecture, might need to execute a
different numbers of instructions.
-
Different programs may achieve different MIPS ratings on the
same architecture.
- Some programs execute more long instructions
than do other programs.
- Some programs have more cache misses and hence cause
more waiting for memory.
- Some programs inhibit full pipelining
(e.g., they may have more mispredicted branches).
- Some programs inhibit full superscalar behavior
(e.g., they may have unhideable data dependencies).
-
One can often raise the MIPS rating by adding NOPs, despite
increasing execution time. How?
Ans. MIPS doesn't specify useful instructions and
NOPs, while perhaps useless, are nonetheless very fast.
- So, unlike MHz, MIPS is not a value that be defined for a specific
computer; it depends on other factors, e.g., language/compiler used,
problem solved, and algorithm employed.
Homework:
Carefully go through and understand the example on pages 248-249
How about MFLOPS (Million of FLoating point OPerations per Second)?
For numerical calculations floating point operations are the
ones you are interested in; the others are overhead
(a
very rough approximation to reality).
It has similar problems to MIPS.
- The same program needs different numbers of floating point operations
on different machines (e.g., is sqrt one instruction or several?).
- Compilers effect the MFLOPS rating.
- MFLOPS is Not as bad as MIPS since adding NOPs lowers the MFLOPs
rating.
- But you can insert unnecessary floating point ADD instructions
and this will probably raise the MFLOPS rating.
Why?
Because it will lower the percentage of overhead
(i.e.,
non-floating point) instructions.
Benchmarks are better than MIPS or MFLOPS, but still have difficulties.
- It is hard to find benchmarks that represent your future
usage.
- Compilers can be
tuned
for important benchmarks.
- Benchmarks can be chosen to favor certain architectures.
- If your processor has 256KB of cache memory and
your competitor's has 128MB, you try to find a benchmark that
frequently accesses a region of memory having size between 128MB
and 256MB.
- If your 128MB cache is 2 way set associative (defined later
this semester) while your competitors 256MB cache is direct
mapped, then you build/choose a benchmark that frequently accesses
exactly two 10K arrays separated by an exact multiple of 256KB.
4.6: Concluding Remarks
Homework:
Read this (very short) section.
Chapter 7: Memory
Homework: Read Chapter 7.
7.1: Introduction
An ideal memory is
- Big (in capacity; not physical size).
- Fast.
- Cheap.
- Impossible.
Unable to achieve the impossible ideal we use a
memory hierarchy consisting of
- Registers
- Cache (really L1, L2, and maybe L3)
- (Central or Main) Memory
- Disk
- Archive (e.g. Tape)
... and try to satisfy most references in the small fast memories
near the top of the hierarchy.
There is a capacity/performance/price gap between each pair of
adjacent levels.
We will study the cache-to-memory gap.
- In modern systems there are many levels of caches so we should
study the L1-to-L2 gap, the L2-to-L3 gap, and the L3-to-memory
gap.
- Similar considerations to those we shall study apply as well
to the other gaps (e.g., memory-to-disk, where virtual memory
techniques are applied).
This last is the gap studied in OS classes such as 202/2250.
- But the terminology is often different, e.g., in architecture we
evict cache blocks or lines whereas in OS we evict pages.
- In fall 97 my OS class was studying
the same thing
at
the same time as my architecture class, but with different,
almost disjoint, terminology.
We observe empirically (and teach in OS).
- Temporal Locality: The word referenced now is likely to be
referenced again soon.
Hence it is wise to keep the currently
accessed word handy (high in the memory hierarchy) for a while.
- Spatial Locality: Words near the currently referenced
word are likely to be referenced soon.
Hence it is wise to prefetch words near the currently referenced
word and keep them handy (high in the memory hierarchy) for a
while.
A cache is a small fast memory between
the processor and the main memory.
It contains a subset of the
contents of the main memory.
A Cache is organized in units of blocks.
Common block sizes are 16, 32, and 64 bytes.
This is the smallest unit we can move to/from a cache (some designs
move subblocks, but we will not discuss them).
- We view memory as organized in blocks as well.
If the block size is 16, then bytes 0-15 of memory are in block
0, bytes 16-31 are in block 1, etc.
- Transfers from memory to cache and back are one block.
- Big blocks make good use of spatial locality.
- If you remember memory management in OS, think of pages and page
frames.
- The terminology in memory management is:
Pages are located in the big slow disk; frames are in the small
fast (main) memory.
- The terminology in caches is:
Memory blocks are located in the big slow (main) memory;
cache blocks are located in the small fast cache.
A hit occurs when a memory reference is found in
the upper level of the memory hierarchy.
Definitions
- We will be interested in cache hits (OS courses
study page hits), when the reference is found in the cache (OS:
when found in main memory).
- A miss is a non-hit.
- The hit rate is the fraction of memory references
that are hits.
- The miss rate is 1 - hit rate, which is the
fraction of references that are misses.
- The hit time is the time required for a hit.
- The miss time is the time required for a miss.
- The miss penalty is Miss time - Hit time.