Computer Architecture

Start Lecture #18

Chapter 4 Performance analysis

Homework: Read Chapter 4.

4.1: Introductions

Defining Performance

Throughput measures the number of jobs per day/second/etc that can be accomplished.

Response time measures how long an individual job takes.

A faster machine improves both metrics (increases throughput and decreases response time).
Normally anything that improves (i.e., decreases) response time improves (i.e., increases) throughput.
But the reverse isn't true. For example, adding a processor likely to increase throughput more than it decreases response time.
We will be concerned primarily with response time.

We define Performance as 1 / Execution time.

Relative Performance

We say that machine X is n times faster than machine Y or machine X has n times the performance of machine Y if the execution time of a given program on X = (1/n) * the execution time of the same program on Y.

But what program should be used for the comparison? Various suites have been proposed; some emphasizing CPU integer performance, others floating point performance, and still others I/O performance.

Measuring Performance

How should we measure execution time?

CPU time.
- This includes the time waiting for memory.
- It does not include the time waiting for I/O as this process is not running and hence using no CPU time.
- Should we include system time, i.e., time when the CPU is executing the operating system on behalf of the user program.
Elapsed time on an otherwise empty system.
Elapsed time on a normally loaded system.
Elapsed time on a heavily loaded system.

We mostly employ user-mode CPU time, but this does not mean the other metrics are worse.

Cycle time vs. Clock rate.

Recall that cycle time is the length of a cycle.
It is a unit of time.
For modern (non-embedded) computers it is expressed in nanoseconds, abbreviated ns, or picoseconds, abbreviated ps.
One nanosecond is one billionth of a second = 10^-9 seconds.
One picosecond is one trillionth of a second = 10^-12 seconds.
Other units of time are microsecond, abbreviated us, which equals 10^-6 seconds and millisecond, abbreviated ms, which equals 10^-3 seconds.
Embedded CPUs often have their cycle times expressed in microseconds; the time required for a single I/O (disk access) is normally expressed in milliseconds.
Electricity travels about 1 foot in 1ns (in normal media).
The clock rate tells how many cycles fit into a given time unit (normally in one second).
So the natural unit for clock rate is cycles per second. This used to be standard unit and was abbreviated CPS.
However, the world has changed and the new name for the same thing is Hertz, abbreviated Hz. One Hertz is one cycle per second.
For modern (non-embedded) CPUs the rate is normally expressed in gigahertz, abbreviated GHz, which equals one billion hertz = 10⁹ hertz.
For older or embedded processors the rate is normally expressed in megahertz, abbreviated MHz, which equals one million hertz.

What is the cycle time for a 700MHz computer?

700 million cycles = 1 second
7*10⁸ cycles = 1 second
1 cycle = 1/(7*10⁸) seconds = 10/7 * 10^-9 seconds ~= 1.4ns

What is the clock rate for a machine with a 10ns cycle time?

1 cycle = 10ns = 10^-8 seconds.
10⁸ cycles = 1 second.
Rate is 10⁸ Hertz = 100 * 10⁶ Hz = 100MHz = 0.1GHz.

4.2: CPU Performance and its Factors

The execution time for a given job on a given computer is

    (CPU) execution time = (#CPU clock cycles required) * (cycle time)
                         = (#CPU clock cycles required) / (clock rate)

The number of CPU clock cycles required equals the number of instructions executed times the average number of cycles in each instruction.

In our single cycle implementation, the number of cycles required is just the number of instructions executed.
If every instruction took 5 cycles, the number of cycles required would be five times the number of instructions executed.

But real systems are more complicated than that!

Some instructions take more cycles than others.
With pipelining, several instructions are in progress at different stages of their execution.
With super scalar (or VLIW) many instructions are issued at once.
Since modern superscalars (and VLIWs) are also pipelined we have many many instructions executing at once.

Through a great many measurement, one calculates for a given machine the average CPI (cycles per instruction).

The number of instructions required for a given program depends on the instruction set. For example, one x86 instruction often accomplishes more than one MIPS instruction.

CPI is a good way to compare two implementations of the same instruction set (i.e., the same instruction set architecture or ISA. IF the clock cycle is unchanged, then the performance of a given ISA is inversely proportional to the CPI (e.g., halving the CPI doubles the performance).

Complicated instructions take longer; either more cycles or longer cycle time.

Older machines with complicated instructions (e.g. VAX in 80s) had CPI>>1.

With pipelining we can have many cycles for each instruction but still achieve a CPI of nearly 1.

Modern superscalar machines often have a CPI less than one. Sometimes one speaks of the IPC or instructions per cycle for such machines.

These machines issue (i.e., initiate) many instructions each cycle.
They are pipelined so the instructions don't finish for several cycles.
If we consider a 4-issue superscalar and assume that all instructions require 5 (pipelined) cycles, there are up to 20=5*4 instructions in progress (often called in flight) at one time.

Putting this together, we see that

    Time (in seconds) =  #Instructions * CPI * Cycle_time (in seconds).
    Time (in ns)      =  #Instructions * CPI * Cycle_time (in ns).

Do on the board the example on page 247.