Computer Systems Design


Chapter 4

Homework: Read 4.1-4.4

4.2: Signed and Unsigned Numbers

MIPS uses 2s complement (just like 8086)

To form the 2s complement (of 0000 1111 0000 1010 0000 0000 1111 1100)

Need comparisons for signed and unsigned.

sltu and sltiu

Like slt and slti but the comparison is unsigned.

Homework: 4.1-4.9

4.3: Addition and subtraction

To add two (signed) numbers just add them. That is don't treat the sign bit special.

To subtract A-B, just take the 2s complement of B and add.

Overflows

An overflow occurs when the result of an operation cannot be represented with the available hardware. For MIPS this means when the result does not fit in a 32-bit word.

Homework: Prove this last statement (4.29) (for fun only, do not hand in).

addu, subu, addiu

These three instructions perform addition and subtraction the same way as do add and sub, but do not signal overflow

4.4: Logical Operations

Shifts: sll, srl

Bitwise AND and OR: and, or, andi, ori

No surprises.

4.5: Constructing an ALU--the fun begins

First goal is 32-bit AND, OR, and addition

Recall we know how to build a full adder. We will draw it as shown on the right.









With this adder, the ALU is easy.












With this 1-bit ALU, constructing a 32-bit version is simple.
  1. Use an array of logic elements for the logic. The logic element is the 1-bit ALU
  2. Use buses for A, B, and Result.
  3. ``Broadcast'' Opcode to all of the internal 1-bit ALUs. This means wire the external Opcode to the Opcode input of each of the internal 1-bit ALUs

First goal accomplished.



======== START LECTURE #9 ========

Implementing Addition and Subtraction

We wish to augment the ALU so that we can perform subtraction (as well as addition, AND, and OR).

A 1-bit ALU with ADD, SUB, AND, OR is shown on the right.

  1. To implement addition we use opcode 10 as before and de-assert both b-invert and Cin.
  2. To implement subtraction we still use opcode 10 but we assert both b-invert and Cin.











32-bit version is simply a bunch of these.

(More or less) all ALUs do AND, OR, ADD, SUB. Now we want to customize our ALU for the MIPS architecture, which has a few extra requirements.

  1. slt set-less-than
  2. Overflows
  3. Zero Detect

Implementing SLT

This is fairly clever as we shall see.









Homework: figure out correct rule, i.e. prob 4.23. Hint: when an overflow occurs the sign bit is definitely wrong (so the complement of the sign bit is right).




Implementing Overflow Detection














Simpler Overflow Detection

Recall the simpler overflow detection: An overflow occurs if and only if the carry in to the HOB differs from the carry out of the HOB.

Implementing Zero Detection


Observation: The CarryIn to the LOB and Binvert to all the 1-bit ALUs are always the same. So the 32-bit ALU has just one input called Bnegate, which is sent to the appropriate inputs in the 1-bit ALUs.

The Final Result is

The symbol used for an ALU is on the right

What are the control lines?

What functions can we perform?

What (3-bit) values for the control lines do we need for each function? The control lines are Bnegate (1-bit) and Operation (2-bits)

and 0 00
or 0 01
add 0 10
sub 1 10
slt 1 11


======== START LECTURE #10 ========

Fast Adders

  1. We have done what is called a ripple carry adder.
  2. What about doing the entire 32 (or 64) bit adder with 2 levels of logic?
  3. There are faster adders, e.g. carry lookahead and carry save. We will study carry lookahead adders.

Carry Lookahead Adder (CLA)

This adder is much faster than the ripple adder we did before, especially for wide (i.e., many bit) addition.

To summarize, using a subscript i to represent the bit number,

    to generate  a carry:   gi = ai bi
    to propagate a carry:   pi = ai+bi

H&P give a plumbing analogue for generate and propagate.

Given the generates and propagates, we can calculate all the carries for a 4-bit addition (recall that c0=Cin is an input) as follows (this is the formula version of the plumbing):

c1 = g0 + p0 c0

c2 = g1 + p1 c1 = g1 + p1 g0 + p1 p0 c0

c3 = g2 + p2 c2 = g2 + p2 g1 + p2 p1 g0 + p2 p1 p0 c0

c4 = g3 + p3 c3 = g3 + p3 g2 + p3 p2 g1 + p3 p2 p1 g0 + p3 p2 p1 p0 c0

Thus we can calculate c1 ... c4 in just two additional gate delays (where we assume one gate can accept upto 5 inputs). Since we get gi and pi after one gate delay, the total delay for calculating all the carries is 3 (this includes c4=Carry-Out)

Each bit of the sum si can be calculated in 2 gate delays given ai, bi, and ci. Thus, for 4-bit addition, 5 gate delays after we are given a, b and Carry-In, we have calculated s and Carry-Out.

So, for 4-bit addition, the faster adder takes time 5 and the slower adder time 8.

Now we want to put four of these together to get a fast 16-bit adder.

As black boxes, both ripple-carry adders and carry-lookahead adders (CLAs) look the same.

We could simply put four CLAs together and let the Carry-Out from one be the Carry-In of the next. That is, we could put these CLAs together in a ripple-carry manner to get a hybrid 16-bit adder.


We want to do better so we will put the 4-bit carry-lookahead adders together in a carry-lookahead manner. Thus the diagram above is not what we are going to do.

We start by determining ``super generate'' and ``super propagate'' bits.

P0 = p3 p2 p1 p0          Does the low order 4-bit adder
                          propagate a carry?
P1 = p7 p6 p5 p4

P2 = p11 p10 p9 p8

P3 = p15 p14 p13 p12      Does the high order 4-bit adder
                          propagate a carry?


G0 = g3 + p3 g2 + p3 p2 g1 + p3 p2 p1 g0        Does low order 4-bit
                                                adder generate a carry
G1 = g7 + p7 g6 + p7 p6 g5 + p7 p6 p5 g4

G2 = g11 + p11 g10 + p11 p10 g9 + p11 p10 p9 g8

G3 = g15 + p15 g14 + p15 p14 g13 + p15 p14 p13 g12

From these super generates and super propagates, we can calculate the super carries, i.e. the carries for the four 4-bit adders.

C1 = G0 + P0 c0

C2 = G1 + P1 C1 = G1 + P1 G0 + P1 P0 c0

C3 = G2 + P2 C2 = G2 + P2 G1 + P2 P1 G0 + P2 P1 P0 c0

C4 = G3 + P3 C3 = G3 + P3 G2 + P3 P2 G1 + P3 P2 P1 G0 + P3 P2 P1 P0 c0

Now these C's (together with the original inputs a and b) are just what the 4-bit CLAs need.

How long does this take, again assuming 5 input gates?

  1. We calculate the p's and g's (lower case) in 1 gate delay (as with the 4-bit CLA).
  2. We calculate the P's one gate delay after we have the p's or 2 gate delays after we start.
  3. The G's are determined 2 gate delays after we have the g's and p's. So the G's are done 3 gate delays after we start.
  4. The C's are determined 2 gate delays after the P's and G's. So the C's are done 5 gate delays after we start.
  5. Now the C's are sent back to the 4-bit CLAs, which have already calculated the p's and g's. The C's are calculated in 2 more gate delays (7 total) and the s's 2 more after that (9 total).

In summary, a 16-bit CLA takes 9 cycles instead of 32 for a ripple carry adder and 14 for the mixed adder.





Some pictures follow.


Take our original picture of the 4-bit CLA and collapse the details so it looks like.







Next include the logic to calculate P and G.



















Now put four of these with a CLA block (to calculate C's from P's, G's and Cin) and we get a 16-bit CLA. Note that we do not use the Cout from the 4-bit CLAs.

Note that the tall skinny box is general. It takes 4 Ps 4Gs and Cin and calculates 4Cs. The Ps can be propagates, superpropagates, superduperpropagates, etc. That is, you take 4 of these 16-bit CLAs and the same tall skinny box and you get a 64-bit CLA.

Homework: 44, 45


======== START LECTURE #11 ========

As noted just above the tall skinny box is useful for all size CLAs. To expand on that point and to review CLAs, let's redo CLAs with the general box.

Since we are doing 4-bits at a time, the box takes 9=2*4+1 input bits and produces 6=4+2 outputs










A 4-bit adder is now the figure on the right

What does the ``?'' box do?

  1. Calculates Gi and Pi based on ai and bi
  2. Calculates si based on ai, bi, and Ci=Cin
  3. Does not bother calculating Cout.







Now take four of these 4-bit adders and use the identical CLA box to get a 16-bit adder.

The picture on the right shows ONE 4-bit adder (the magenta box) used with the CLA box. To get a 16-bit adder you need 3 more magenta box, one above the one shown (to process bits 0-3) and two below (to process bits 8-15).





Four of these 16-bit adders with the identical CLA box gives a 64-bit adder (no picture shown).





Shifter

This is a sequential circuit.




Homework: A 4-bit shift register initially contains 1101. It is shifted six times to the right with the serial input being 101101. What is the contents of the register after each shift.

Homework: Same register, same initial condition. For the first 6 cycles the opcodes are left, left, right, nop, left, right and the serial input is 101101. The next cycle the register is loaded (in parallel) with 1011. The final 6 cycles are the same as the first 6. What is the contents of the register after each cycle?

4.6: Multiplication

    product <- 0
    for i = 0 to 31
        if LOB of multiplier = 1
            product = product + multiplicand
        shift multiplicand left 1 bit
        shift multiplier right 1 bit

Do on the board 4-bit multiplication (8-bit registers) 1100 x 1101. Since the result has (up to) 8 bits, this is often called a 4x4->8 multiply.

The diagrams below are for a 32x32-->64 multiplier.

What about the control?



======== START LECTURE #12 ========

This works!

But, when compared to the better solutions to come, is wasteful of resourses and hence is

The product register must be 64 bits since the product can contain 64 bits.

Why is multiplicand register 64 bits?

Why is ALU 64-bits?

POOF!! ... as the smoke clears we see an idea.

We can solve both problems at once

This results in the following algorithm

    product <- 0
    for i = 0 to 31
        if LOB of multiplier = 1
            (serial_in, product[32-63]) <- product[32-63] + multiplicand
        shift product right 1 bit
        shift multiplier right 1 bit






What about control

Redo same example on board

A final trick (``gate bumming'', like ``code bumming'' of 60s).

    product[0-31] <- multiplier
    for i = 0 to 31
        if LOB of product = 1
            (serial_in, product[32-63]) <- product[32-63] + multiplicand
        shift product right 1 bit










Control again boring.

Redo the same example on the board.

The above was for unsigned 32-bit multiplication.

What about signed multiplication.

There are faster multipliers, but we are not covering them.

4.7: Division

We are skiping division.

4.8: Floating Point

We are skiping floating point.

4.9: Real Stuff: Floating Point in the PowerPC and 80x86

We are skiping floating point.

Homework: Read 4.10 ``Fallacies and Pitfalls'', 4.11 ``Conclusion'', and 4.12 ``Historical Perspective''.


======== START LECTURE #13 ========

MIDTERM EXAM


======== START LECTURE #14 ========



Allan Gottlieb