A Whirlwind Tour through Computer Architecture:  Part II

Honors Computer Systems Organization (Prof. Grishman)


Most machines are synchronous:  there is one 'master clock' signal, which 'strikes' (goes from 0 to 1) at regular intervals.  Registers load new data only when the clock strikes.  The 'clock rate' of a chip is the rate at which this clock strikes (e.g., 1 GHz).  The clock period is 1/clock rate (e.g., 1 ns).

How fast can we run the clock?  When the clock strikes, new data is loaded into the registers and the outputs of the registers change.  The changed values then have to propagate through the circuits of the machine until they reach the input of the next set of registers.  We shouldn't strike the clock again until the signals have time to reach the register inputs;  otherwise the values loaded into the inputs may be incorrect.  So the minimum clock period = the (maximum) propagation delay of the circuit, from register output to register input.

How long does it take a program to run?  If the program executes I instructions from start to finish,

time for program = clock period  *  clock cycles per instruction  *   I
We can introduce additional registers immediately before and after the ALU.  Let's call the registers on the ALU inputs A and B, and the register on the ALU output C.  We can then reduce the clock period, because the maximum delay has been decreased (possibly, by a factor of 2 or even 3).  However, it now takes 3 clock cycles for one register-to-register instructions (one cycle to load A and B from the register file;  one to go through the ALU and load C;  one cycle to move the result from C to the register file).  Is that progress?


The idea:  keep our 3-cycle machine organization, but start a new instruction every clock cycle.  Operations are staggered.

The benefit:  we get one clock cycle per instruction, but with our faster clock.

The problems:  data hazards and conditional branches.  In a data hazard, the operand of one instruction depends on the result of a previous instruction.  We have to wait until the first instruction finishes before starting the second instruction.  This reduces the value of the pipeline (we no longer have one cycle per instruction), and requires complex circuitry in order to detect the hazard in the first place.  If we have a conditional branch, we don't know which instruction will be executed next, so we can't put the next instruction in the pipeline until the conditional branch has finished.  Again, this reduces the benefits of pipelining.  And all this makes the processor more complicated.  Still, once we have millions of transistors to play with, we put up with this complexity to gain speed.

The trend:  towards pipelines with more stages, in order to reduce the clock cycle time (e.g., from Pentium III to Pentium IV).