Computer Architecture
Start Lecture #18
Chapter 4 Performance analysis
Homework:
Read Chapter 4.
4.1: Introductions
Defining Performance
Throughput measures the number of jobs
per day/second/etc that can be accomplished.
Response time measures how long an individual job
takes.
- A faster machine improves both metrics (increases throughput and
decreases response time).
- Normally anything that improves (i.e., decreases) response
time improves (i.e., increases) throughput.
- But the reverse isn't true.
For example, adding a processor likely to increase throughput
more than it decreases response time.
- We will be concerned primarily with response time.
We define Performance as 1 / Execution time.
Relative Performance
We say that machine X is n times faster than machine Y or
machine X has n times the performance of machine Y if
the execution time of a given program on X = (1/n) * the
execution time of the same program on Y.
But what program should be used for the comparison?
Various suites have been proposed; some emphasizing CPU
integer performance
, others floating point performance
,
and still others I/O performance
.
Measuring Performance
How should we measure execution time?
- CPU time.
- This includes the time waiting for memory.
- It does not include the time waiting for I/O
as this process is not running and hence using no CPU time.
- Should we include
system time
, i.e., time when the
CPU is executing the operating system on behalf
of the user program.
- Elapsed time on an otherwise empty system.
- Elapsed time on a
normally loaded
system.
- Elapsed time on a
heavily loaded
system.
We mostly employ user-mode
CPU time, but this
does not mean the other metrics are worse.
Cycle time vs. Clock rate.
- Recall that cycle time is the length of a cycle.
- It is a unit of time.
- For modern (non-embedded) computers it is expressed
in nanoseconds, abbreviated ns,
or picoseconds, abbreviated ps.
- One nanosecond is one billionth of a second = 10-9
seconds.
- One picosecond is one trillionth of a second = 10-12
seconds.
- Other units of time are microsecond, abbreviated us, which
equals 10-6 seconds and millisecond, abbreviated ms, which
equals 10-3 seconds.
- Embedded CPUs often have their cycle times expressed in
microseconds; the time required for a single I/O (disk access)
is normally expressed in milliseconds.
- Electricity travels about 1 foot in 1ns (in normal media).
- The clock rate tells how many cycles fit into
a given time unit (normally in one second).
- So the natural unit for clock rate is cycles per second.
This used to be standard unit and was abbreviated CPS.
- However, the world has changed and the new name for the same
thing is Hertz, abbreviated Hz.
One Hertz is one cycle per second.
- For modern (non-embedded) CPUs the rate is normally expressed
in gigahertz, abbreviated GHz, which equals one billion
hertz = 109 hertz.
- For older or embedded processors the rate is normally
expressed in megahertz, abbreviated MHz, which equals
one million hertz.
What is the cycle time for a 700MHz computer?
- 700 million cycles = 1 second
- 7*108 cycles = 1 second
- 1 cycle = 1/(7*108) seconds = 10/7 *
10-9 seconds ~= 1.4ns
What is the clock rate for a machine with a 10ns cycle time?
- 1 cycle = 10ns = 10-8 seconds.
- 108 cycles = 1 second.
- Rate is 108 Hertz = 100 * 106 Hz =
100MHz = 0.1GHz.
4.2: CPU Performance and its Factors
The execution time for a given job on a given computer is
(CPU) execution time = (#CPU clock cycles required) * (cycle time)
= (#CPU clock cycles required) / (clock rate)
The number of CPU clock cycles required equals the number of
instructions executed times the average number of cycles in each
instruction.
- In our single cycle implementation, the number of cycles required
is just the number of instructions executed.
- If every instruction took 5 cycles, the number of cycles required
would be five times the number of instructions executed.
But real systems are more complicated than that!
- Some instructions take more cycles than others.
- With pipelining, several instructions are in progress at different
stages of their execution.
- With super scalar (or VLIW) many instructions are issued at once.
- Since modern superscalars (and VLIWs) are also pipelined we have
many many instructions executing at once.
Through a great many measurement, one calculates for a given machine
the average CPI (cycles per instruction).
The number of instructions required for a given program depends on
the instruction set.
For example, one x86 instruction often accomplishes more than one
MIPS instruction.
CPI is a good way to compare two implementations of the same
instruction set (i.e., the same instruction set architecture
or ISA.
IF the clock cycle is unchanged, then
the performance of a given ISA is inversely proportional to the CPI
(e.g., halving the CPI doubles the performance).
Complicated instructions take longer; either more cycles or longer cycle
time.
Older machines with complicated instructions (e.g. VAX in 80s) had CPI>>1.
With pipelining we can have many cycles for each instruction but still
achieve a CPI of nearly 1.
Modern superscalar machines often have a CPI less than one.
Sometimes one speaks of the IPC or instructions per cycle
for
such machines.
- These machines issue (i.e., initiate) many instructions each cycle.
- They are pipelined so the instructions don't finish for several cycles.
- If we consider a 4-issue superscalar and assume that all
instructions require 5 (pipelined) cycles, there are up to
20=5*4 instructions in progress (often called in flight) at one
time.
Putting this together, we see that
Time (in seconds) = #Instructions * CPI * Cycle_time (in seconds).
Time (in ns) = #Instructions * CPI * Cycle_time (in ns).
Do on the board the example on page 247.