======== START LECTURE #19
========
Even Faster
-
Pipeline the cycles
-
Since at one time we will have several instructions active, each
at a different cycle, the resourses can't be reused (e.g., more
than one instruction might need to do a register read/write at one
time)
-
Pipelining is more complicated than the single cycle
implementation we did.
-
This was the basic RISC technology on the 1980s
-
It is covered in chapter 6.
-
Multiple datapaths (superscalar)
-
Issue several instructions each cycle and the hardware
figures out dependencies and only executes instructions when the
dependencies are satisfied.
-
Much more logic required, but conceptually not too difficult
providing the system executes instructions in order
-
Pretty hairy if out of order (OOO) exectuion is
permitted.
-
Current high end processors
-
VLIW (Very Long Instruction Word)
-
User (i.e., the compiler) packs several instructions into one
``superinstruction'' called a very long instruction.
-
User guarentees that there are no dependencies within a
superinstruction.
-
Hardware still needs multiple datapaths (indeed the datapaths are
not so different as superscalar.
-
Was proposed and tried in 80s, but was dominated by superscalar.
-
A comeback (?) with Intel's EPIC (Explicitly Parallel Instruction
Computer ?).
-
Called IA-64 (Intel Architecture 64-bits); the first
implementation was called Merced and now has a funny name
(Itanium?). It should be available next year
-
It has other features as well (e.g. predication).
-
The x86, Pentium, etc are called IA-32.
Chapter 2 Performance analysis
Homework:
Read Chapter 2
Throughput measures the number of jobs per day
that can be accomplished. Response time measures how
long an individual job takes.
- So a faster machine improves both (increasing throughput and
decreasing response time).
- Normally anything that improves response time improves throughput.
-
Adding a processor likely to increase throughput more than
it decreases response time.
-
We will mostly be concerned with response time
Performance = 1 / Execution time
So machine X is n times faster than Y means that
-
Performance-X = n * Performance-Y
-
Execution-time-X = (1/n) * Execution-time-Y
How should we measure execution-time?
-
CPU time
-
Includes time waiting for memory
-
Does not include time waiting for I/O
as this process is not not running
-
Elapsed time on empty system
-
Elapsed time on ``normally loaded'' system
-
Elapsed time on ``heavily loaded'' system
We mostly use CPU time, but this does not mean the other
metrics are worse.
Cycle time vs. Clock rate
- Recall that the cycle time is the length of a cycle.
- It is a unit of time.
- For modern computers it is expressed in nanoseconds,
abbreviated ns.
- One nano-second is one billionth of a second = 10^(-9) seconds.
- The clock rate tells how many cycles fit into a given time unit
(normally in one second).
- So the natural unit is cycles per second. This was
abbreviated CPS.
- However, the world has changed and the new name (for the same
thing) is Hertz, abbreviated Hz. One Hertz is one cycle per second.
- For modern computers the rate is expressed in megahertz,
abbreviated MHz.
- One megahertz is one million hertz = 10^6 hertz.
- What is the cycle time for a 333MHz computer?
- 333 million cycles = 1 second
- 3.33*10^8 cycles = 1 second
- 1 cycle = 1/(3.33*10^8) seconds = 10/3.33 * 10^(-9) seconds ~= 3ns
- Electricity travels about 1 foot in 1ns
So the execution time for a given job on a given computer is
(CPU) execution time = (#CPU clock cycles required) * (cycle time)
= (#CPU clock cycles required) / (Clock rate)
So a machine with a 10ns cycle time runs at a rate of
1 cycle per 10 ns = 100,000,000 cycles per second = 100 MHz
The number of CPU clock cycles required equals the number of
instructions executed times the number of cycles in each
instruction.
- In our single cycle implementation, the number of cycles required
is just the number of instructions executed.
- If every instruction took 5 cycles, the number of cycles required
would be five times the number of instructions executed.
But systems are more complicated than that!
- Some instructions take more cycles than others.
- With pipelining, several instructions are in progress at different
stages of their execution.
- With super scalar (or VLIW) many instructions are issued at once
and many can be at the same stage of execution.
- Since modern superscalars (and VLIWs) are also pipelined we have
many many instructions in execution at once
Through a great many measurement, one calculates for a given machine
the average CPI (cycles per instruction).
#instructions for a given program depends on the instruction set. For
example we saw in chapter 3 that 1 vax instruction is often
accomplishes more than 1 MIPS instruction.
Complicated instructions take longer; either more cycles or longer cycle
time
Older machines with complicated instructions (e.g. VAX in 80s) had CPI>>1
With pipelining can have many cycles for each instruction but still
have CPI nearly 1.
Modern superscalar machines have CPI < 1
-
They issue many instructions each cycle
-
They are pipelined so the instructions don't finish for several cycles
-
If have 4-issue and all instructions 5 pipeline stages, there are
up to 20=5*4 instructions in progress (often called in flight) at
one time.
Putting this together, we see that
Time (in seconds) = #Instructions * CPI * Cycle_time (in seconds)
Time (in ns) = #Instructions * CPI * Cycle_time (in ns)
Homework:
Carefully go through and understand the example on page 59
Homework:
2.1-2.5 2.7-2.10
Homework:
Make sure you can easily do all the problems with a rating of
[5] and can do all with a rating of [10]
What about MIPS?
-
Millions of Instructions Per Second
-
Not the same as the MIPS computer (but not a
coincidence).
- For a given machine language program, the seconds required is
the number of instructions executed / MIPS
BUT
- The same program in C might need different number of instructions
on different computers (e.g. one VAX instruction might take 2
instructions on a power-PC)
-
Different programs generate different MIPS ratings on same
arch.
- Some programs execute more long instructions
than do other programs.
- Some programs have more cache misses and hence cause
more waiting for memory.
- Some programs inhibit full pipelining
(e.g. mispredicted branches)
- Some programs inhibit full superscalar behavior (data
dependencies)
-
One can oftern raise the MIPS rating by adding NOPs despite
increasing exec time
Homework:
Carefully go through and understand the example on pages 61-3
Why not use MFLOPS
-
Millions of FLoating point Operations Per Second
-
Similiar problems to MIPS (not quite as bad since can't add no-ops)
Benchmarks
-
This is better, but still has difficulties
-
Hard to find benchmarks that represent your future usage.
Homework:
Carefully go through and understand 2.7 ``fallacies and pitfalls''