CSCI-UA.0436 - Prof. Grishman

Lecture 16:  Pipelining

Single-cycle vs. multi-cycle

We could change from a single-cycle design (with an 800 ps clock period) to a multi-cycle design (with a 100 ps clock) in which we allowed a different number of cycles for different instruction types.  The machine would be slightly faster (clock cycle time would go down by a factor of 8, while average CPI would go up somewhat less), but the control unit considerably more complex.

The benefit of multi-cycle would be much greater if the instruction set included some instructions which took much longer (multiply, divide, floating point) but were less frequently executed than load, store, add/sub, branch.

For our small MIPS instruction subset, much greater speed-ups are possible by overlapping the execution of different instructions.

An Overview of Pipelining (Sec'n 4.5)

The simplest such overlap is instruction fetch overlap: fetch the next instruction while executing the current instruction. Even relatively simple processors employed such overlap.

Greater gain can be achieved by overlapping the execution (register fetch, ALU operation, ...) of successive instructions. A full pipelining scheme overlaps such operations completely, resulting ideally in a CPI (cycles per instruction) of 1. However, machines which employ such overlap must deal with data and branch hazards: instructions which influence later instructions in the pipeline. This makes the design of pipelined machines much more complex.

The benefits of pipelining increase when we have instructions which are relatively uniform in execution time and which can be finely divided into pipeline stages.  We can then have a relatively short clock cycle and issue one instruction each clock cycle (Fig. 4.27).  Under ideal conditions, instruction throughput is multiplied by the number of pipeline stages.

Instruction sets for pipelining (text, p. 335)

RISC machines like MIPS are well suited for pipelining.  Instruction format is simple and execution relatively uniform.  Pipelining is more complex for CISC machines, because the instructions may take different lengths of time to execute. However, RISC-style pipelining is now incorporated into high-performance CISC processors (such as the Pentium and Core 2) by translating most instructions into a series of  RISC-like operations.

Pipelined Data Path (text, section 4.6)

The basic idea is to introduce a set of pipeline registers which hold all the information required to complete execution of the instruction.  This includes portions of the instruction, control signals which have been decoded from the instruction, computed effective addresses and data.  Starting with the single-cycle machine, we can build a system with a 5-stage pipeline;  the basic design is shown in Figure 4.35 .

Pipeline hazards:  overlapping instruction execution can give rise to problems

Pipeline Control

The pipeline registers have to  hold all the information required to complete execution of the instruction.  This includes both data and control information. Fig. 4.46 shows the processor wiith the control signals it requires;  Fig. 4.50 shows how these control signals are passed along from pipeline register to pipeline register.  Fig. 4.51 puts these together.

Handling Data Hazards (Text 4.7)

Consider the  code  from p. 363 of the text

sub   $2,$1,$3
and   $12,$2,$5
or    $13,$6,$2
add   $14,$2,$2
sw    $15,100($2)

Figure 4.52 shows the pipeline dependencies in this sequence, and Fig. 4.53 shows how some of them can be addressed by sending the output of the ALU from one instruction directly to a following instruction, bypassing the register file ("forwarding").  Fig. 4.54 shows the additions to the data path to do this.

Not all data hazards can be addressed bt forwarding.  Sometimes a result is really needed before it is available;  for example, if a load is immediately followed by an R-type instruction which uses the loaded data (p. 372, Figure 4.58):

lw    $2,20($1)
and   $4,$2,$5
...

In that case the only thing we can do is wait ("stall").  We do this by changing the second instruction going down the pipeline to a "no operation" and re-issuing it in the next cycle (Fig. 4.59).

As the  length of the pipeline increases, this problem gets worse ... more stalls are required and the CPI (which in ideal conditions is 1) goes up.  The compiler can reduce this effect bt judicious scheduling of instructions.  In particular, for our pipelined MIPS, by moving loads earlier so that there is at least one unrelated instruction between a load and the instruction which uses the results of the load.