V22.0436 - Prof. Grishman

Lecture 10: Pushing CPU Performance

Text, chapter 6

Implementing pipelined machines (text, section 6.2 & 6.3)

For a simple pipelined design, we introduce a set of pipeline registers (Figure 6.12);  each pipeline register must be able to hold all the information about an instruction at that stage of processing, including both the data and the (decoded) instruction (Fig. 6.29).

Superscalar and dynamic pipelining (text, section 6.8)

Most processors now try to go beyond pipelining to execute more than one instruction at a clock cycle, producing an effective CPI < 1. This is possible if we duplicate some of the functional parts of the processor (e.g., have two ALUs or a register file with 4 read ports and 2 write ports), and have logic to issue several instructions concurrently.  However,  it requires even more complex logic to guard against hazards. Such designs are called superscalar.

In addition, current processors use dynamic pipeline scheduling rather than the simple sequential scheduling described in this chapter.  In dynamic scheduling, if an instruction is not ready to execute (because, for example, not all operands are available), the CPU will look past that instruction for one which is ready to execute.  Instructions may therefore be executed out-of-order;  however, the final storing of results (instruction commit) will be done according to the order of the instructions.

Taking advantage of technology improvements

How has the steady progress in integrated circuit technology been translated into improvements in processor performance?

The technology improvements lead to faster transistors and smaller transistors. Faster transistors mean faster clock times. Smaller transistors mean that we can put more transistors on a chip (the latest chips have over 30M transistors). What can we do with the increasing number of transistors to improve performance?

All of these techniques can be observed in the progress of x86 implementations: The benefits of overlapped execution can be magnified by a cooperative compiler:  a code generator which schedules instructions to reduce hazards can speed up execution considerably.

Architectural Approaches

At some point most of these methods have diminishing returns;  it is very hard to squeeze out additional parallelism from a serial architecture.  New architectural approaches are needed: Spring 2002