V22.0436 - Prof. Grishman
Lecture 16: Performance (cont'd)
How Architecture Affects Performance
Our goal is to minimize the product of the three factors (number of
executed, average CPI, clock cycle time) . Whenever we consider a
to the architecture, we must evaluate its effect on each of these
In particular, when we add an instruction to the instruction set, we
must consider whether it can significantly reduce the number of
to execute (the first factor) without affecting the time per
(the last two factors). A specialized instruction may be used only
by a compiler (most of the execution time is spent on a small number of
instructions; see Figure 3.26 for the distribution of
instructions for the SPEC benchmarks). On the other hand, if the
instruction requires a longer
data path, it may require a longer clock cycle. The net effect would be
a slower machine. [We ignore the issue of code size, which is much less
important than it used to be because memory is so much cheaper.]
Good candidates for instructions are those which would be used
and would take much longer if performed by a sequence of other
... for example, floating point operations for scientific applications.
The increased use of RISC machines, starting in the mid-80's,
a more careful assessment of the benefits and costs of adding
to the instruction set.
However, the issue of binary code compatibility remains very
if not overwhelming. The development of entirely new machine
has decreased since the 90's, with the Intel PC architecture, and its
increasingly dominant. As we shall discuss later, current
microprocessors achieve both speed and code compatibility by
translating Intel instructions into a RISC-like instruction set
Single-cycle vs. multi-cycle
We could change from a single-cycle design (with an 800 ps clock
period) to a multi-cycle design (with a 100 ps clock) in which we
allowed a different number of cycles for different intruction
types. The machine would be slightly faster (clock cycle time
would go down by a factor of 8, while average CPI would go up somewhat
less), but the control unit considerably more complex.
The benefit of multi-cycle would be much greater if the instruction set
included some instructions which took much longer (multiply, divide,
floating point) but were less frequently executed than load, store,
For our small MIPS instruction subset, much greater speed-ups are
possible by overlapping the execution of different instructions.
An Overview of Pipelining (Sec'n 4.5)
The simplest such overlap is instruction fetch overlap: fetch the
instruction while executing the current instruction. Even relatively
processors employed such overlap.
Greater gain can be achieved by overlapping the execution (register
fetch, ALU operation, ...) of successive instructions. A full
scheme overlaps such operations completely, resulting ideally in a CPI
(cycles per instruction) of 1. However, machines which employ such
must deal with data and branch hazards: instructions which influence
instructions in the pipeline. This makes the design of pipelined
much more complex.
The benefits of pipelining increase when we have instructions which are
relatively uniform in execution time and which can be finely divided
into pipeline stages. We can then have a relatively short clock
cycle and issue one instruction each clock cycle (Fig. 4.27). Under ideal conditions,
throughput is multiplied by the number of pipeline stages.
Instruction sets for pipelining (text, p. 335)
RISC machines like MIPS are well suited for pipelining.
Instruction format is simple and execution relatively uniform.
Pipelining is more complex for CISC machines, because the instructions
may take different lengths of time to execute. However, RISC-style
is now incorporated into high-performance CISC processors (such as the
Pentium and Core 2) by translating most instructions into a series
Pipelined Data Path (text, section 4.6)
The basic idea is to introduce a set of pipeline registers which hold
all the information required to complete execution of the
instruction. This includes portions of the instruction, control
signals which have been decoded from the instruction, computed
effective addresses and data. Starting with the single-cycle
machine, we can build a system with a 5-stage pipeline; the basic
design is shown in Figure 4.35 .
Pipeline hazards: overlapping instruction execution can give
- structural hazards: two instructions want to use the same
(e.g., the ALU) in the same clock cycle. This problem is reduced
in machines with a very uniform instruction set, such as MIPS.
that logic is cheap, we may duplicate some components to avoid
- for load instructions, memory is needed at two points in
execution (instruction fetch and data fetch); this is addressed here
by having two separate memories. In machines with a single memory,
we may still have separate caches for instructions and data.
- in the pipelined MIPS, the registers are needed at two points
in instruction execution: loading registers and storing into
registers. This is not a problem as both can be done in a
- data hazards: one instruction uses the result of the
The simplest solution is to "stall" ... to hold up the current
until the prior one has finished. A more efficient solution is data
forwarding ... to send a result of one instruction directly to
for the next instruction, as well as putting it in the register.
- branch hazards: we must wait until a conditional branch
before we know whether the following instructions should be
Again, we could stall the instruction after the branch, but this is
Alternatively, we can guess whether or not the branch is taken, start
instructions based upon our guess, but wait to store their results
we know the outcome of the branch. If our guess is correct, we
if it is wrong, we invalidate the instructions we issued following the
branch, and try again. Modern CPUs use a branch prediction table
which keeps track of recent branches and whether they were taken in
to "guess" more accurately.