### Lecture 17: Performance Improvements:  Multicycle Implementation

Text: Section 5.5

#### MIPS Implementations: multiple clock cycles / instruction

The first implementation we considered executed all instructions in a single cycle. This had two major disadvantages:
• functionality had to be replicated (most obvious problem: two memories; we also needed multiple ALUs and adders, but this is a less important problem as logic becomes cheaper)
• the execution time of every instruction = the execution time of the longest instruction (in particular, load/store instructions need more time than R-type instructions)
Suppose that a memory operation takes 200 ps, an ALU operations takes 100 ps, and a register operation takes 50 ps. Then R-type operations take 400 ps, beq 350 ps, sw 550 ps, and lw 600 ps (see P&H p. 316 for details). However, with a single-cycle system, we must have a clock cycle of at least 600 ps.

We can modify the design of the MIPS machine to use a faster clock (200 ps) and multiple clock cycles per instruction. In the design given in section 5.5, instructions require up to 5 clock cycles:
1. instruction fetch (for all instructions)
2. instruction decode and register fetch (for all instructions)
3. ALU operation (for all instructions);  for beq and jump, reset PC
4. for R-type instructions, register store; for lw/sw, data memory operation
5. for lw, register store
R-type instructions take 4 cycles;  lw, 5 cycles;  sw, 4 cycles;  and beq (and jump) 3 cycles.

This revised design enables a single memory to be used for instructions and data, but requires additional registers to hold the instruction and the data read from memory, as well as registers to hold the output of the register file and the ALU (Fig. 5.25, page 318).  This design also uses a single ALU, combining the functions of the main ALU and the adder and ALU for the PC, resulting in more multiplexers (Fig. 5.26, page 320).

How do we compute the net effect on performance? We need to compute cycle time * average CPI. Average CPI depends in turn on the relative frequency of instructions: in computing the average, we need to weight the CPI of each instruction by its relative frequency. These relative frequencies are determined by simulating a variety of programs and counting the frequency of each instructions. The frequencies for SPECint and SPECfp for MIPS are shown in P&H, figure 3.26, page 228.

The net gains from this design are small or negative for this limited instruction set (p. 331). They would be greater if  we used a faster clock (say, 100 ps) and allowed 2 clock cycles for memory operations, or if the instruction set included more complex instructions such as a multiply.  However, the main significance of the multicycle design is as a step towards a pipelined design.

The control unit must be more complex, to handle the sequential execution: we would create a finite-state machine in which the transition between steps is determined by the opcode. The state transition diagram for this machine is shown in Fig. 5.38, page 339.  This is implemented using a state register and combinational logic to both determine the next state and set the control lines (Fig. 5.37, p. 338).  This control unit can be optimized down to individual gates, as was the design of the combinational control unit (for the single-cycle design). Alternatively, we can employ a microprogrammed design, in which the tables for the control unit (the state transition table and the output table) are stored directly in a microprogam memory. This provides a more uniform structure and a design which is easier to change.

#### Pipelining

Much greater speed-ups are possible by overlapping the execution of successive instructions.

The simplest such overlap is instruction fetch overlap: fetch the next instruction while executing the current instruction. Even relatively simple processors employed such overlap.

Greater gain can be achieved by overlapping the execution (register fetch, ALU operation, ...) of successive instructions. A full pipelining scheme overlaps such operations completely, resulting ideally in a CPI (cycles per instruction) of 1. However, machines which employ such overlap must deal with data and branch hazards: instructions which influence later instructions in the pipeline. This makes the design of pipelined machines much more complex.