Computer Architecture
1999-2000 Fall
MW 3:30-4:45
Ciww 109

Allan Gottlieb
gottlieb@nyu.edu
http://allan.ultra.nyu.edu/~gottlieb
715 Broadway, Room 1001
212-998-3344
609-951-2707
email is best

======== START LECTURE #17 ========

Note: Lab 3 (the final lab) handed out today due 20 November.

Even Faster (we are not covering this).

Pipeline the cycles.
- Since at one time we will have several instructions active, each at a different cycle, the resourses can't be reused (e.g., more than one instruction might need to do a register read/write at one time).
- Pipelining is more complicated than the single cycle implementation we did.
- This was the basic RISC technology on the 1980s.
- A pipelined implementation of the MIPS CPU is covered in chapter 6.
Multiple datapaths (superscalar).
- Issue several instructions each cycle and the hardware figures out dependencies and only executes instructions when the dependencies are satisfied.
- Much more logic required, but conceptually not too difficult providing the system executes instructions in order.
- Pretty hairy if out of order (OOO) exectuion is permitted.
- Current high end processors are all OOO superscalar (and are indeed pretty hairy).
VLIW (Very Long Instruction Word)
- User (i.e., the compiler) packs several instructions into one ``superinstruction'' called a very long instruction.
- User guarentees that there are no dependencies within a superinstruction.
- Hardware still needs multiple datapaths (indeed the datapaths are not so different from superscalar).
- The hairy control for superscalar (especially OOO superscalar) is not needed since the dependency checking is done by the compiler, not the hardware.
- Was proposed and tried in 80s, but was dominated by superscalar.
- A comeback (?) with Intel's EPIC (Explicitly Parallel Instruction Computer) architecture.
  - Called IA-64 (Intel Architecture 64-bits); the first implementation was called Merced and now has a funny name (Itanium). It should be available RSN (Real Soon Now).
  - It has other features as well (e.g. predication).
  - The x86, Pentium, etc are called IA-32.

Chapter 2 Performance analysis

Homework: Read Chapter 2

2.1: Introductions

Throughput measures the number of jobs per day that can be accomplished. Response time measures how long an individual job takes.

A faster machine improves both metrics (increases throughput and decreases response time).
Normally anything that improves response time improves throughput.
But the reverse isn't true. For example, adding a processor likely to increase throughput more than it decreases response time.
We will be concerned primarily with response time.

We define Performance as 1 / Execution time.

So machine X is n times faster than Y means that

The performance of X = n * the performance of Y.
The execution time of X = (1/n) * the execution time of Y.

2.2: Measuring Performance

How should we measure execution time?

CPU time.
- This includes the time waiting for memory.
- It does not include the time waiting for I/O as this process is not running and hence using no CPU time.
Elapsed time on an otherwise empty system.
Elapsed time on a ``normally loaded'' system.
Elapsed time on a ``heavily loaded'' system.

We use CPU time, but this does not mean the other metrics are worse.

Cycle time vs. Clock rate.

Recall that cycle time is the length of a cycle.
It is a unit of time.
For modern computers it is expressed in nanoseconds, abbreviated ns.
One nanosecond is one billionth of a second = 10^(-9) seconds.
Electricity travels about 1 foot in 1ns (in normal media).
The clock rate tells how many cycles fit into a given time unit (normally in one second).
So the natural unit for clock rate is cycles per second. This used to be abbreviated CPS.
However, the world has changed and the new name for the same thing is Hertz, abbreviated Hz. One Hertz is one cycle per second.
For modern computers the rate is expressed in megahertz, abbreviated MHz.
One megahertz is one million hertz = 10^6 hertz.
A few machines have a clock rate exceeding a gigahertz (GHz). Next year many new machines will pass the gigahertz mark; possibly some will exceed 2GHz.
One gigahertz is one billion hertz = 10^9 hertz.
What is the cycle time for a 700MHz computer?
- 700 million cycles = 1 second
- 7*10^8 cycles = 1 second
- 1 cycle = 1/(7*10^8) seconds = 10/7 * 10^(-9) seconds ~= 1.4ns
What is the clock rate for a machine with a 10ns cycle time?
- 1 cycle = 10ns = 10^(-8) seconds.
- 10^8 cycles = 1 second.
- Rate is 10^8 Hertz = 100 * 10^6 Hz = 100MHz = 0.1GHz

2.3: Relating the metrics

The execution time for a given job on a given computer is

(CPU) execution time = (#CPU clock cycles required) * (cycle time)
                     = (#CPU clock cycles required) / (clock rate)

The number of CPU clock cycles required equals the number of instructions executed times the number of cycles in each instruction.

In our single cycle implementation, the number of cycles required is just the number of instructions executed.
If every instruction took 5 cycles, the number of cycles required would be five times the number of instructions executed.

But real systems are more complicated than that!

Some instructions take more cycles than others.
With pipelining, several instructions are in progress at different stages of their execution.
With super scalar (or VLIW) many instructions are issued at once.
Since modern superscalars (and VLIWs) are also pipelined we have many many instructions executing at once.

Through a great many measurement, one calculates for a given machine the average CPI (cycles per instruction).

The number of instructions required for a given program depends on the instruction set. For example, we saw in chapter 3 that 1 Vax instruction is often accomplishes more than 1 MIPS instruction.

Complicated instructions take longer; either more cycles or longer cycle time.

Older machines with complicated instructions (e.g. VAX in 80s) had CPI>>1.

With pipelining can have many cycles for each instruction but still have CPI nearly 1.

Modern superscalar machines have CPI < 1.

They issue many instructions each cycle.
They are pipelined so the instructions don't finish for several cycles.
If we consider a 4-issue superscalar and assume that all instructions require 5 (pipelined) cycles, there are up to 20=5*4 instructions in progress (often called in flight) at one time.

Putting this together, we see that

   Time (in seconds) =  #Instructions * CPI * Cycle_time (in seconds).
   Time (in ns)      =  #Instructions * CPI * Cycle_time (in ns).

Homework: Carefully go through and understand the example on page 59

Homework: 2.1-2.5 2.7-2.10

Homework: Make sure you can easily do all the problems with a rating of [5] and can do all with a rating of [10]