Lecture 9: Performance Improvements

Text: Section 5.4, Chapter 6

MIPS Implementations: multiple clock cycles / instruction (cont'd)

Performance Analysis

With the modified (multicycle) design, all instructions begin with two cycles: an instruction fetch cycle, and an instruction decode / register fetch cycle. This is followed by:

• for loads, 3 cycles: address computation, memory access, register write (total, 5 cycles)
• for stores, 2 cycles: address computation and memory write (total, 4 cycles)
• for R-type instructions, 2 cycles: ALU computation and register write (total, 4 cycles)
• for branches and jumps, 1 cycle: write new PC (total, 3 cycles)

How do we compute the net effect on performance? We need to compute cycle time * average CPI.

Average CPI depends in turn on the relative frequency of instructions: in computing the average, we need to weight the CPI of each instruction by its relative frequency. These relative frequencies are determined by simulating a variety of programs and counting the frequency of each instructions. Some examples are shown in P&H, figure 4.54, page 311. (Note how a few instructions dominate the instruction mix.)

Given the instruction mix for gcc (23% loads, 13% stores, 19% branches, 2% jumps, 43% R-type), the average CPI is 4.02 (P&H, p. 397).

Next, we need to compute the effect on cycle time. Using P&H's assumptions (2 ns for memory read/write, 2 ns for ALU, 1 ns for register read/write), the single cycle machine needs 8 ns (p. 374). The multicycle machine needs 2 ns. Thus the cycle time * average CPI is slightly higher (worse) for the multi-cycle machine. The multi-cycle system would look better if memory operations took much longer (relative to register and ALU times), or if the instruction set included more complex instructions such as a multiply (and, in general, if there was greater variation in the length of instructions.)

Implementation

The control unit must be more complex, to handle the sequential execution: we would create a finite-state machine in which the transition between steps is determined by the opcode. This control unit can be optimized down to individual gates, as was the design of the combinational control unit (for the single-cycle design).

Alternatively, we can employ a microprogrammed design, in which the tables for the control unit (the state transition table and the output table) are stored directly in a microprogam memory. This provides a more uniform structure and a design which is easier to change. At one time, microprograms were widely used, particularly for CISC machines (with large instruction sets), but they are less used today.

Pipelining

Much greater speed-ups are possible by overlapping the execution of successive instructions.

The simplest such overlap is instruction fetch overlap: fetch the next instruction while executing the current instruction. Even relatively simple processors employed such overlap.

Greater gain can be achieved by overlapping the execution (register fetch, ALU operation, ...) of successive instructions. A full pipelining scheme overlaps such operations completely, resulting ideally in a CPI (cycles per instruction) of 1. However, machines which employ such overlap must deal with data and branch hazards: instructions which influence later instructions in the pipeline. This makes the design of pipelined machines much more complex.

Pipeline hazards:  overlapping instruction execution can give rise to problems

• structural hazards:  two instructions want to use the same module (e.g., the ALU) in the same clock cycle.  This problem is reduced in machines with a very uniform instruction set, such as MIPS.  Now that logic is cheap, we may duplicate some components to avoid structural hazards.
• data hazards:  one instruction uses the result of the previous instruction.  Detecting such hazards requires keeping a table recording, for each register, whether it is the result register for an instruction currently in the pipeline. The simplest solution if a data hazard is detected is to "stall" ... to hold up the current instruction until the prior one has finished.  A more efficient solution is data forwarding ... to send a result of one instruction directly to the ALU for the next instruction, as well as putting it in the register.
• branch hazards:  we must wait until a conditional branch completes before we know whether the following instructions should be executed.  Again, we could stall the instruction after the branch, but this is inefficient.  Alternatively, we can guess whether or not the branch is taken, start subsequent instructions based upon our guess, but wait to store their results until we know the outcome of the branch.  If our guess is correct, we continue;  if it is wrong, we invalidate the instructions we issued following the branch, and try again.  Modern CPUs use a branch prediction table which keeps track of recent branches and whether they were taken in order to "guess" more accurately.

Some architectures, including MIPS, have delayed branches: the instruction after the branch is executed unconditionally, and then the branch decision is made. (p. 444) This produces some performance improvement with short pipelines, but the improvement is reduced as the pipeline gets longer.

Pipelining is more complex for CISC machines, because the instructions may take different lengths of time to execute. However, RISC-style pipelining is now incorporated into high-performance CISC processors (such as the Pentium) by translating most instructions into a series of  RISC-like operations.

Spring 2002