### Lecture 14: Measuring Performance

(review mid-term)

#### How fast is our single-cycle MIPS CPU?

For a synchronous machine, clock period must be greater than maximum propagation delay of combinational circuit (which computes next state).  So period for MIPS must be greater than time required for longest instruction.

How long is that?  Assume memories and ALU have 200 ps delay, and the register file (for read or write) 100 ps delay.  (Text, pp. 332-333)  Add up delays for each type of instruction, determine maximum delay (slowest instruction type).

Can we do better?  To answer that, we must consider

### Measuring Performance

Text: Section 1.4

What is performance? Computer performance is a measure of how long it takes to perform a task, or how many tasks can be performed in a given time period. The performance that matters to us is how long it takes to perform our tasks. However, unless we can afford to benchmark our task on each machine we are considering, we have to rely on more generic measures of computer performance.

For the moment, we shall just discuss CPU performance and ignore IO. The basic equation is:

time to run program = (number of instructions executed) * (average CPI) * (clock cycle time)

where CPI = number of clock cycles per instruction. For a given program, the number of instructions executed depends on the compiler used and on the architecture (instruction set). The average CPI depends on the implementation of the architecture.

#### Some Popular Metrics

• MHz (clock speed). Only a useful measure when comparing the same implementation of the same architecture.
• MIPS (millions of instructions per second). Only meaningful when comparing machines with the same architecture, since some architectures may require substantially more instructions than others for the same program. Also can be very dependent on the mix of instructions and hence on the program used to measure MIPS. (This is particularly true for machines with specialized instructions for graphics or video, such as the MMX or SSE instructions).  Some manufacturers report "peak MIPS" on carefully designed but useless programs.
• MFLOPS (millions of floating point operations per second). Primarily of interest for scientific applications. Again, deceptive "peak MFLOPS" are sometimes reported.  Many machines are now rated by their performance on Linpack, a collection of linear algebra routines; see Top500.
• benchmarks: SPECint, SPECfp, ... .
• These are maintained by the Standard Performance Evaluation Corporation.
• SPEC started with CPU metrics but now provides a wide range of metrics, including graphics tasks, Java support, and Web servers.
• CPU metrics are based on the execution time for standard collections of programs. More realistic than other metrics, but must make sure that benchmarks reasonably reflect the actual application.
• SPECint for integer applications, SPECfp for floating point applications.  Score is elapsed time on a standard machine divided by elapsed time on the machine being evaluated.
• elapsed time is not the best measure for multiple-CPU or multiple-core machines, so for such machines a SPECrate is quoted, which effectively measures the number of times the benchmark can be executed in a given period.
• Several metrics have been developed specifically for Intel PC architectures (Intel's iCOMP, Ziff-Davis' CPUmark).  Some labs report a variety of benchmarks to better match the range of possible applications.

#### How Architecture Affects Performance

Our goal is to minimize the product of the three factors (number of instructions executed, average CPI, clock cycle time) . Whenever we consider a change to the architecture, we must evaluate its effect on each of these factors.

In particular, when we add an instruction to the instruction set, we must consider whether it can significantly reduce the number of instructions to execute (the first factor) without affecting the time per instruction (the last two factors). A specialized instruction may be used only rarely by a compiler (most of the execution time is spent on a small number of instructions;  see Figure 3.26 for the distribution of instructions for the SPEC benchmarks).  On the other hand, if the instruction requires a longer data path, it may require a longer clock cycle. The net effect would be a slower machine. [We ignore the issue of code size, which is much less important than it used to be because memory is so much cheaper.]

Good candidates for instructions are those which would be used frequently and would take much longer if performed by a sequence of other instructions ... for example, floating point operations for scientific applications.

The increased use of RISC machines, starting in the mid-80's, reflected a more careful assessment of the benefits and costs of adding instructions to the instruction set.

However, the issue of binary code compatibility remains very important, if not overwhelming.  The development of entirely new machine architectures has decreased since the 90's, with the Intel PC architecture, and its variants, increasingly dominant.  As we shall discuss later, current microprocessors achieve both speed and code compatibility by translating Intel instructions into a RISC-like instruction set internally.