CSCI-UA.0436 - Prof. Grishman
Lecture 16: Pipelining
Single-cycle vs. multi-cycle
We could change from a single-cycle design (with an 800 ps clock
period) to a multi-cycle design (with a 100 ps clock) in which we
allowed a different number of cycles for different instruction
types. The machine would be slightly faster (clock cycle time
would go down by a factor of 8, while average CPI would go up
less), but the control unit considerably more complex.
The benefit of multi-cycle would be much greater if the instruction
included some instructions which took much longer (multiply, divide,
floating point) but were less frequently executed than load, store,
For our small MIPS instruction subset, much greater speed-ups are
possible by overlapping the execution of different instructions.
An Overview of Pipelining (Sec'n 4.5)
The simplest such overlap is instruction fetch overlap: fetch the
instruction while executing the current instruction. Even
processors employed such overlap.
Greater gain can be achieved by overlapping the execution
fetch, ALU operation, ...) of successive instructions. A full
scheme overlaps such operations completely, resulting ideally in a
(cycles per instruction) of 1. However, machines which employ such
must deal with data and branch hazards: instructions which
instructions in the pipeline. This makes the design of pipelined
much more complex.
The benefits of pipelining increase when we have instructions which
relatively uniform in execution time and which can be finely divided
into pipeline stages. We can then have a relatively short
cycle and issue one instruction each clock cycle (Fig. 4.27). Under ideal
throughput is multiplied by the number of pipeline stages.
Instruction sets for pipelining (text, p. 335)
RISC machines like MIPS are well suited for pipelining.
Instruction format is simple and execution relatively uniform.
Pipelining is more complex for CISC machines, because the
may take different lengths of time to execute. However, RISC-style
is now incorporated into high-performance CISC processors (such as
Pentium and Core 2) by translating most instructions into a series
Pipelined Data Path (text, section 4.6)
The basic idea is to introduce a set of pipeline registers which
all the information required to complete execution of the
instruction. This includes portions of the instruction,
signals which have been decoded from the instruction, computed
effective addresses and data. Starting with the single-cycle
machine, we can build a system with a 5-stage pipeline; the
design is shown in Figure 4.35 .
Pipeline hazards: overlapping instruction execution can
- structural hazards: two instructions want to use the
(e.g., the ALU) in the same clock cycle. This problem is
in machines with a very uniform instruction set, such as
that logic is cheap, we may duplicate some components to avoid
- for load instructions, memory is needed at two points in
execution (instruction fetch and data fetch); this is
by having two separate memories. In machines with a single
we may still have separate caches for instructions and data.
- in the pipelined MIPS, the registers are needed at two
in instruction execution: loading registers and storing
registers. This is not a problem as both can be done in
- data hazards: one instruction uses the result of the
The simplest solution is to "stall" ... to hold up the current
until the prior one has finished. A more efficient
solution is data
forwarding ... to send a result of one instruction
for the next instruction, as well as putting it in the register.
- branch hazards: we must wait until a conditional branch
before we know whether the following instructions should be
Again, we could stall the instruction after the branch, but this
Alternatively, we can guess whether or not the branch is taken,
instructions based upon our guess, but wait to store their
we know the outcome of the branch. If our guess is
if it is wrong, we invalidate the instructions we issued
branch, and try again. Modern CPUs use a branch prediction
which keeps track of recent branches and whether they were taken
to "guess" more accurately.
The pipeline registers have to hold all the information
to complete execution of the instruction. This includes both
and control information. Fig. 4.46 shows
the processor wiith the control signals it requires; Fig. 4.50 shows how these control signals are
passed along from pipeline register to pipeline register. Fig. 4.51 puts these together.
Handling Data Hazards (Text 4.7)
Consider the code from p. 363 of the text
Figure 4.52 shows the pipeline dependencies
in this sequence, and Fig. 4.53 shows how
some of them can be addressed by sending the output of the ALU from one
instruction directly to a following instruction, bypassing the register
file ("forwarding"). Fig. 4.54 shows
the additions to the data path to do this.
Not all data hazards can be addressed bt forwarding. Sometimes
result is really needed before it is available; for example,
load is immediately followed by an R-type instruction which uses the
loaded data (p. 372, Figure 4.58):
In that case the only thing we can do is wait ("stall"). We do
this by changing the second instruction going down the pipeline to a
"no operation" and re-issuing it in the next cycle (Fig. 4.59).
length of the pipeline increases, this problem gets worse ... more
stalls are required and the CPI (which in ideal conditions is 1)
up. The compiler can reduce this effect bt judicious
of instructions. In particular, for our pipelined MIPS, by
loads earlier so that there is at least one unrelated instruction
between a load and the instruction which uses the results of the