A Whirlwind Tour through Computer Architecture:  Part II

Honors Computer Systems Organization (Prof. Grishman)


Most machines are synchronous:  there is one 'master clock' signal, which 'strikes' (goes from 0 to 1) at regular intervals.  Registers load new data only when the clock strikes.  The 'clock rate' of a chip is the rate at which this clock strikes (e.g., 1 GHz).  The clock period is 1/clock rate (e.g., 1 ns).

How fast can we run the clock?  When the clock strikes, new data is loaded into the registers and the outputs of the registers change.  The changed values then have to propagate through the circuits of the machine until they reach the input of the next set of registers.  We shouldn't strike the clock again until the signals have time to reach the register inputs;  otherwise the values loaded into the inputs may be incorrect.  So the minimum clock period = the (maximum) propagation delay of the circuit, from register output to register input.

How long does it take a program to run?  If the program executes I instructions from start to finish,

time for program = clock period  *  clock cycles per instruction  *   I
We can introduce additional registers immediately before and after the ALU.  Let's call the registers on the ALU inputs A and B, and the register on the ALU output C.  We can then reduce the clock period, because the maximum delay has been decreased (possibly, by a factor of 2 or even 3).  However, it now takes 3 clock cycles for one register-to-register instructions (one cycle to load A and B from the register file;  one to go through the ALU and load C;  one cycle to move the result from C to the register file).  Is that progress?


The idea:  keep our 3-cycle machine organization, but start a new instruction every clock cycle.  Operations are staggered.

The benefit:  we get one clock cycle per instruction, but with our faster clock.

The problems:  data hazards and conditional branches.  In a data hazard, the operand of one instruction depends on the result of a previous instruction.  We have to wait until the first instruction finishes before starting the second instruction.  This reduces the value of the pipeline (we no longer have one cycle per instruction), and requires complex circuitry in order to detect the hazard in the first place.  If we have a conditional branch, we don't know which instruction will be executed next, so we can't put the next instruction in the pipeline until the conditional branch has finished.  Again, this reduces the benefits of pipelining.  And all this makes the processor more complicated.  Still, once we have millions of transistors to play with, we put up with this complexity to gain speed.

The trend:  towards pipelines with more stages, in order to reduce the clock cycle time (e.g., from Pentium III to Pentium IV).

(see http://www.ece.arizona.edu/~ece462/Lec03-pipe for some nice pictures)


The x86 is an example of a complex instruction set computer (CISC):  it has a large number of instructions, they come in various sizes, they have complex layouts, and some of them do multiple operations in one instruction (some even do loops!).

In contrast, reduced instruction set computers (RISC) have fewer instructions;  all instructions are all the same size and have one or a few layouts;  each instruction does one operation;  and all arithmetic operations are done between registers.  Because the instructions are much more uniform, and close to the basic machine model, it is simpler to create a pipeline for a RISC.  CISC machines like the x86 manage to create pipelines by translating complex instructions internally into more RISC-like operations.


In an effort to squeeze even more performance out of the chip, modern CPUs employ superscalar design, which is one step beyond pipelining.  They have multiple ALUs, and issue more than one instruction in each clock cycle. In terms of our formula, the 'clock cycles per instruction' can go below 1.  But the logic to keep track of hazards becomes even more complex;  more logic is needed to schedule operations than to do them.  And even with complex logic, it is hard to schedule parallel operations 'on the fly'.


The limits of dynamic operation scheduling have led machine designers to consider a very different architecture, explicitly parallel instruction computers (EPIC), exemplified by the Itanium (IA-64 architecture).  EPIC machines have very large instructions (for the Itanium, 128 bits) which specify several operations to be done in parallel.  Thus the burden of scheduling is shifted from the processor to the compiler, and much more time can be spent in developing a good schedule (and analyzing data hazards).

To reduce the pipelining problems due to conditional branches, the IA-64 introduced predicated instructions.  Comparison instructions set predicate bits, much like they set condition codes on the x86 machine (except that there are 64 predicate bits).  Each operation specifies a predicate bit;  it is executed only if the predicate bit = 1.  In practice, all operations are performed, but the result is stored into the register file only if the predicate bit = 1.  The result is that more instructions are executed, but we don't have to stall the pipeline waiting for a condition.

Itanium chips have been available for about half a year (since summer 2001), but the performance of these early chips does not beat that of top-of-the-line Pentium IV's.