======== START LECTURE #18 ========

Implementing a J-type instruction, unconditional jump

    opcode  addr
    31-26   25-0

Addr is word address; bottom 2 bits of PC are always 0

Top 4 bits of PC stay as they were (AFTER incr by 4)

Easy to add.

Smells like a good final exam type question.

What's Wrong

Some instructions are likely slower than others and we must set the clock cycle time long enough for the slowest. The disparity between the cycle times needed for different instructions is quite significant when one considers implementing more difficult instructions, like divide and floating point ops. Actually, if we considered cache misses, which result in references to external DRAM, the cycle time ratios can approach 100.

Possible solutions

Variable length cycle. How do we do it?
Asynchronous logic
- ``Self-timed'' logic.
- No clock. Instead each signal (or group of signals) is coupled with another signal that changes only when the first signal (or group) is stable.
- Hard to debug.
Multicycle instructions.
- More complicated instructions have more cycles.
- Since only one instruction is executed at a time, can reuse a single ALU and other resourses during different cycles.
- It is in the book right at this point but we are not covering it.

Even Faster (we are not covering this).

Pipeline the cycles.
- Since at one time we will have several instructions active, each at a different cycle, the resources can't be reused (e.g., more than one instruction might need to do a register read/write at one time).
- Pipelining is more complicated than the single cycle implementation we did.
- This was the basic RISC technology on the 1980s.
- A pipelined implementation of the MIPS CPU is covered in chapter 6.
Multiple datapaths (superscalar).
- Issue several instructions each cycle and the hardware figures out dependencies and only executes instructions when the dependencies are satisfied.
- Much more logic required, but conceptually not too difficult providing the system executes instructions in order.
- Pretty hairy if out of order (OOO) exectuion is permitted.
- Current high end processors are all OOO superscalar (and are indeed pretty hairy).
VLIW (Very Long Instruction Word)
- User (i.e., the compiler) packs several instructions into one ``superinstruction'' called a very long instruction.
- User guarentees that there are no dependencies within a superinstruction.
- Hardware still needs multiple datapaths (indeed the datapaths are not so different from superscalar).
- The hairy control for superscalar (especially OOO superscalar) is not needed since the dependency checking is done by the compiler, not the hardware.
- Was proposed and tried in 80s, but was dominated by superscalar.
- A comeback (?) with Intel's EPIC (Explicitly Parallel Instruction Computer) architecture.
  - Called IA-64 (Intel Architecture 64-bits); the first implementation was called Merced and now has a funny name (Itanium). It has recently become available.
  - It has other features as well (e.g. predication).
  - The x86, Pentium, etc are called IA-32.

Chapter 2 Performance analysis

Homework: Read Chapter 2

2.1: Introductions

Throughput measures the number of jobs per day that can be accomplished. Response time measures how long an individual job takes.

A faster machine improves both metrics (increases throughput and decreases response time).
Normally anything that improves response time improves throughput.
But the reverse isn't true. For example, adding a processor likely to increase throughput more than it decreases response time.
We will be concerned primarily with response time.

We define Performance as 1 / Execution time.

So machine X is n times faster than Y means that

The performance of X = n * the performance of Y.
The execution time of X = (1/n) * the execution time of Y.

2.2: Measuring Performance

How should we measure execution time?

CPU time.
- This includes the time waiting for memory.
- It does not include the time waiting for I/O as this process is not running and hence using no CPU time.
Elapsed time on an otherwise empty system.
Elapsed time on a ``normally loaded'' system.
Elapsed time on a ``heavily loaded'' system.

We use CPU time, but this does not mean the other metrics are worse.

Cycle time vs. Clock rate.

Recall that cycle time is the length of a cycle.
It is a unit of time.
For modern computers it is expressed in nanoseconds, abbreviated ns.
One nanosecond is one billionth of a second = 10^(-9) seconds.
Electricity travels about 1 foot in 1ns (in normal media).
The clock rate tells how many cycles fit into a given time unit (normally in one second).
So the natural unit for clock rate is cycles per second. This used to be abbreviated CPS.
However, the world has changed and the new name for the same thing is Hertz, abbreviated Hz. One Hertz is one cycle per second.
For modern computers the rate is expressed in megahertz, abbreviated MHz.
One megahertz is one million hertz = 10^6 hertz.
A few machines have a clock rate exceeding a gigahertz (GHz). Next year many new machines will pass the gigahertz mark; possibly some will exceed 2GHz.
One gigahertz is one billion hertz = 10^9 hertz.
What is the cycle time for a 700MHz computer?
- 700 million cycles = 1 second
- 7*10^8 cycles = 1 second
- 1 cycle = 1/(7*10^8) seconds = 10/7 * 10^(-9) seconds ~= 1.4ns
What is the clock rate for a machine with a 10ns cycle time?
- 1 cycle = 10ns = 10^(-8) seconds.
- 10^8 cycles = 1 second.
- Rate is 10^8 Hertz = 100 * 10^6 Hz = 100MHz = 0.1GHz