======== START LECTURE #19 ========

Even Faster

Pipeline the cycles
- Since at one time we will have several instructions active, each at a different cycle, the resourses can't be reused (e.g., more than one instruction might need to do a register read/write at one time)
- Pipelining is more complicated than the single cycle implementation we did.
- This was the basic RISC technology on the 1980s
- It is covered in chapter 6.
Multiple datapaths (superscalar)
- Issue several instructions each cycle and the hardware figures out dependencies and only executes instructions when the dependencies are satisfied.
- Much more logic required, but conceptually not too difficult providing the system executes instructions in order
- Pretty hairy if out of order (OOO) exectuion is permitted.
- Current high end processors
VLIW (Very Long Instruction Word)
- User (i.e., the compiler) packs several instructions into one ``superinstruction'' called a very long instruction.
- User guarentees that there are no dependencies within a superinstruction.
- Hardware still needs multiple datapaths (indeed the datapaths are not so different as superscalar.
- Was proposed and tried in 80s, but was dominated by superscalar.
- A comeback (?) with Intel's EPIC (Explicitly Parallel Instruction Computer ?).
  - Called IA-64 (Intel Architecture 64-bits); the first implementation was called Merced and now has a funny name (Itanium?). It should be available next year
  - It has other features as well (e.g. predication).
  - The x86, Pentium, etc are called IA-32.

Chapter 2 Performance analysis

Homework: Read Chapter 2

Throughput measures the number of jobs per day that can be accomplished. Response time measures how long an individual job takes.

So a faster machine improves both (increasing throughput and decreasing response time).
Normally anything that improves response time improves throughput.
Adding a processor likely to increase throughput more than it decreases response time.
We will mostly be concerned with response time

Performance = 1 / Execution time

So machine X is n times faster than Y means that

Performance-X = n * Performance-Y
Execution-time-X = (1/n) * Execution-time-Y

How should we measure execution-time?

CPU time
- Includes time waiting for memory
- Does not include time waiting for I/O as this process is not not running
Elapsed time on empty system
Elapsed time on ``normally loaded'' system
Elapsed time on ``heavily loaded'' system

We mostly use CPU time, but this does not mean the other metrics are worse.

Cycle time vs. Clock rate

Recall that the cycle time is the length of a cycle.
It is a unit of time.
For modern computers it is expressed in nanoseconds, abbreviated ns.
One nano-second is one billionth of a second = 10^(-9) seconds.
The clock rate tells how many cycles fit into a given time unit (normally in one second).
So the natural unit is cycles per second. This was abbreviated CPS.
However, the world has changed and the new name (for the same thing) is Hertz, abbreviated Hz. One Hertz is one cycle per second.
For modern computers the rate is expressed in megahertz, abbreviated MHz.
One megahertz is one million hertz = 10^6 hertz.
What is the cycle time for a 333MHz computer?
- 333 million cycles = 1 second
- 3.33*10^8 cycles = 1 second
- 1 cycle = 1/(3.33*10^8) seconds = 10/3.33 * 10^(-9) seconds ~= 3ns
- Electricity travels about 1 foot in 1ns

So the execution time for a given job on a given computer is

(CPU) execution time = (#CPU clock cycles required) * (cycle time)
                     = (#CPU clock cycles required) / (Clock rate)

So a machine with a 10ns cycle time runs at a rate of
1 cycle per 10 ns = 100,000,000 cycles per second = 100 MHz

The number of CPU clock cycles required equals the number of instructions executed times the number of cycles in each instruction.

In our single cycle implementation, the number of cycles required is just the number of instructions executed.
If every instruction took 5 cycles, the number of cycles required would be five times the number of instructions executed.

But systems are more complicated than that!

Some instructions take more cycles than others.
With pipelining, several instructions are in progress at different stages of their execution.
With super scalar (or VLIW) many instructions are issued at once and many can be at the same stage of execution.
Since modern superscalars (and VLIWs) are also pipelined we have many many instructions in execution at once

Through a great many measurement, one calculates for a given machine the average CPI (cycles per instruction).

#instructions for a given program depends on the instruction set. For example we saw in chapter 3 that 1 vax instruction is often accomplishes more than 1 MIPS instruction.

Complicated instructions take longer; either more cycles or longer cycle time

Older machines with complicated instructions (e.g. VAX in 80s) had CPI>>1

With pipelining can have many cycles for each instruction but still have CPI nearly 1.

Modern superscalar machines have CPI < 1

They issue many instructions each cycle
They are pipelined so the instructions don't finish for several cycles
If have 4-issue and all instructions 5 pipeline stages, there are up to 20=5*4 instructions in progress (often called in flight) at one time.

Putting this together, we see that

   Time (in seconds) =  #Instructions * CPI * Cycle_time (in seconds)
   Time (in ns) =  #Instructions * CPI * Cycle_time (in ns)

Homework: Carefully go through and understand the example on page 59

Homework: 2.1-2.5 2.7-2.10

Homework: Make sure you can easily do all the problems with a rating of [5] and can do all with a rating of [10]

What about MIPS?

Millions of Instructions Per Second
Not the same as the MIPS computer (but not a coincidence).
For a given machine language program, the seconds required is
the number of instructions executed / MIPS
BUT
1. The same program in C might need different number of instructions on different computers (e.g. one VAX instruction might take 2 instructions on a power-PC)
2. Different programs generate different MIPS ratings on same arch.
  - Some programs execute more long instructions than do other programs.
  - Some programs have more cache misses and hence cause more waiting for memory.
  - Some programs inhibit full pipelining (e.g. mispredicted branches)
  - Some programs inhibit full superscalar behavior (data dependencies)
3. One can oftern raise the MIPS rating by adding NOPs despite increasing exec time

Homework: Carefully go through and understand the example on pages 61-3

Why not use MFLOPS

Millions of FLoating point Operations Per Second
Similiar problems to MIPS (not quite as bad since can't add no-ops)

Benchmarks

This is better, but still has difficulties
Hard to find benchmarks that represent your future usage.

Homework: Carefully go through and understand 2.7 ``fallacies and pitfalls''