How fast can we run the clock? When the clock strikes, new data is loaded into the registers and the outputs of the registers change. The changed values then have to propagate through the circuits of the machine until they reach the input of the next set of registers. We shouldn't strike the clock again until the signals have time to reach the register inputs; otherwise the values loaded into the inputs may be incorrect. So the minimum clock period = the (maximum) propagation delay of the circuit, from register output to register input.
How long does it take a program to run? If the program executes I instructions from start to finish,
time for program = clock period * clock cycles per instruction * IWe can introduce additional registers immediately before and after the ALU. Let's call the registers on the ALU inputs A and B, and the register on the ALU output C. We can then reduce the clock period, because the maximum delay has been decreased (possibly, by a factor of 2 or even 3). However, it now takes 3 clock cycles for one register-to-register instructions (one cycle to load A and B from the register file; one to go through the ALU and load C; one cycle to move the result from C to the register file). Is that progress?
The benefit: we get one clock cycle per instruction, but with our faster clock.
The problems: data hazards and conditional branches. In a data hazard, the operand of one instruction depends on the result of a previous instruction. We have to wait until the first instruction finishes before starting the second instruction. This reduces the value of the pipeline (we no longer have one cycle per instruction), and requires complex circuitry in order to detect the hazard in the first place. If we have a conditional branch, we don't know which instruction will be executed next, so we can't put the next instruction in the pipeline until the conditional branch has finished. Again, this reduces the benefits of pipelining. And all this makes the processor more complicated. Still, once we have millions of transistors to play with, we put up with this complexity to gain speed.
The trend: towards pipelines with more stages, in order to reduce the clock cycle time (e.g., from Pentium III to Pentium IV).
(see http://www.ece.arizona.edu/~ece462/Lec03-pipe for some nice pictures)
In contrast, reduced instruction set computers (RISC) have fewer instructions; all instructions are all the same size and have one or a few layouts; each instruction does one operation; and all arithmetic operations are done between registers. Because the instructions are much more uniform, and close to the basic machine model, it is simpler to create a pipeline for a RISC. CISC machines like the x86 manage to create pipelines by translating complex instructions internally into more RISC-like operations.
To reduce the pipelining problems due to conditional branches, the IA-64 introduced predicated instructions. Comparison instructions set predicate bits, much like they set condition codes on the x86 machine (except that there are 64 predicate bits). Each operation specifies a predicate bit; it is executed only if the predicate bit = 1. In practice, all operations are performed, but the result is stored into the register file only if the predicate bit = 1. The result is that more instructions are executed, but we don't have to stall the pipeline waiting for a condition.
Itanium chips have been available for about half a year (since summer 2001), but the performance of these early chips does not beat that of top-of-the-line Pentium IV's.