V22.0436 - Prof. Grishman
Lecture 10: Pushing CPU Performance
Text, chapter 6
Implementing pipelined machines (text, section 6.2 & 6.3)
For a simple pipelined design, we introduce a set of pipeline registers
(Figure 6.12); each pipeline register must be able to hold all the
information about an instruction at that stage of processing, including
both the data and the (decoded) instruction (Fig. 6.29).
Superscalar and dynamic pipelining (text, section 6.8)
Most processors now try to go beyond pipelining to execute more than one
instruction at a clock cycle, producing an effective CPI < 1. This is
possible if we duplicate some of the functional parts of the processor
(e.g., have two ALUs or a register file with 4 read ports and 2 write ports),
and have logic to issue several instructions concurrently. However,
it requires even more complex logic to guard against hazards. Such designs
are called superscalar.
In addition, current processors use dynamic pipeline scheduling rather
than the simple sequential scheduling described in this chapter.
In dynamic scheduling, if an instruction is not ready to execute (because,
for example, not all operands are available), the CPU will look past that
instruction for one which is ready to execute. Instructions may therefore
be executed out-of-order; however, the final storing of results (instruction
commit) will be done according to the order of the instructions.
Taking advantage of technology improvements
How has the steady progress in integrated circuit technology been translated
into improvements in processor performance?
The technology improvements lead to faster transistors and smaller transistors.
Faster transistors mean faster clock times. Smaller transistors mean that
we can put more transistors on a chip (the latest chips have over 30M
transistors). What can we do with the increasing number of transistors
to improve performance?
All of these techniques can be observed in the progress of x86 implementations:
increase the width of the data we process. The first microprocessor (Intel
4004) had 4-bit data paths; later processors had wider paths. However,
it doesn't help much to increase processing beyond 32 or 64-bit units.
use fast combinatorial circuits: CLA adders; combinatorial multipliers.
A combinatorial multiplier for N bit numbers requires roughly N*N full
adders (about 1000 full adders for 32-bit numbers, so perhaps 10,000 to
larger register files (e.g., 32 registers on MIPS)
pipelined or superscalar architecture (with multiple functional units);
benefits limited by dependencies between instructions
on-chip memory cache
The benefits of overlapped execution can be magnified by a cooperative
compiler: a code generator which schedules instructions to reduce
hazards can speed up execution considerably.
increase in register size to 16 bits in 8086, 32 bits in 80386, and associated
increases in width of data paths
more operations (e.g., floating point) in CPU
instruction prefetch (instruction buffer in 80386)
execution overlap (pipelining) in 80486
superscalar (two execution pipelines in Pentium)
At some point most of these methods have diminishing returns; it
is very hard to squeeze out additional parallelism from a serial architecture.
New architectural approaches are needed:
SIMD instructions (single-instruction multiple-data) for specialized
applications. A single instruction specifies operations, element
by element, on arrays (vectors) of data. The operations are explicitly
parallel, so no complex checking is required. Intel added MMX instructions
to the Pentium for this purpose, and added further instructions (including
floating-point vector operations) in the Pentium III.
EPIC (explicitly-parallel instruction) architectures. An entirely
new architecture (instruction set) based upon large instructions containing
several operations which are to be performed in parallel. By making
the parallelism explicit, we gain two advantages over superscalar:
much less logic (and hence less time) is required to identify parallelism
at execution time, and compilers can generate code to take full advantage
of the parallelism. Intel and HP have developed an architecture (IA-64)
of this type, and Intel has been delivering a processor (Itanium) based
on this architecture. This architecture includes
However, the initial release of the Itanium has had little speed advantage
over the latest Pentiums, so few have been sold.
bundles of operations which are explicitly marked as being
executable in parallel (the architecture has 128-bit bundles which
contain three 40-bit RISC-like instructions).
predicated instructions: associated with each instruction
is a predicate number (0 to 63). Predicate-setting instructions
test one or two registers and set a predicate bit (to indicate, for
example, that one register is less than another, like the MIPS SLT
operation). Other instructions which refer to that predicate bit will
execute only if the bit is 1. This allows considerable conditional
execution without branches and hence with minimal stalls in pipelined
Multiprocessors (text, chapter 9). For applications which
can take advantage of high-level parallelism, multiprocessors can provide
a large improvement in performance, even though this adds to software complexity.
Many systems now use large collections of workstations as the most effective
approach for high-performance computing, and for large on-line services,
such as web servers, data base servers, and transaction processing.