Computer Architecture
1999-2000 Fall
MW 11:55-1:10
Ciww 109

Allan Gottlieb
gottlieb@nyu.edu
http://allan.ultra.nyu.edu/gottlieb
715 Broadway, Room 1001
212-998-3344
609-951-2707
email is best

Administrivia

Web Pages

There is a web page for the course. You can find it from my home page.

Textbook

Text is Hennessy and Patterson ``Computer Orgaiization and Design The Hardware/Software Interface''.

Homeworks and Labs

I make a distinction between homework and labs.

Labs are

Homeworks are

Upper left board for assignments and announcements.

Appendix B: Logic Design

Homework: Read B1

B.2: Gates, Truth Tables and Logic Equations

Homework: Read B2 Digital ==> Discrete

Primarily (but NOT exclusively) binary at the hardware level

Use only two voltages -- high and low

Since this is not an engineering course, we will ignore these issues and assume square waves.

In English digital (think digit, i.e. finger) => 10, but not in computers

Bit = Binary digIT

Instead of saying high voltage and low voltage, we say true and false or 1 and 0 or asserted and deasserted.

0 and 1 are called complements of each other.

A logic block can be thought of as a black box that takes signals in and produces signals out. There are two kinds of blocks

We are doing combinational now. Will do sequential later (few lectures).

TRUTH TABLES

Since combinatorial logic has no memory, it is simply a function from its inputs to its outputs. A Truth Table has as columns all inputs and all outputs. It has one row for each possible set of input values and the output columns have the output for that input. Let's start with a really simple case a logic block with one input and one output.

There are two columns (1 + 1) and two rows (2**1).

In  Out
0   ?
1   ?

How many are there?

How many different truth tables are there for one in and one out?

Just 4: the constant functions 1 and 0, the identity, and an inverter (pictures in a few minutes). There were two `?'s in the above table each can be a 0 or 1 so 2**2 possibilities.

OK. Now how about two inputs and 1 output.

Three columns (2+1) and 4 rows (2**2).

In1 In2  Out
0   0    ?
0   1    ?
1   0    ?
1   1    ?

How many are there? It is just how many ways can you fill in the output entries. There are 4 output entries so answer is 2**4=16.

How about 2 in and 8 out?

3 in and 8 out?

n in and k out?

Gets big fast!

Boolean algebra

Certain logic functions (i.e. truth tables) are quite common and familiar.

We use a notation that looks like algebra to express logic functions and expressions involving them.

The notation is called Boolean algebra in honor of George Boole.

A Boolean value is a 1 or a 0.
A Boolean variable takes on Boolean values.
A Boolean function takes in boolean variables and produces boolean values.

  1. The (inclusive) OR Boolean function of two variables. Draw its truth table. This is written + (e.g. X+Y where X and Y are Boolean variables) and often called the logical sum. (Three out of four output values in the truth table look right!)

  2. AND. Draw TT. Called log product and written as a centered dot (like product in regular algebra). All four values look right.

  3. NOT. Draw TT. This is a unary operator (One argument, not two like above; the two above are called binary). Written A with a bar over it (I will use ' instead of a bar as it is easier for my to type).

  4. Exclusive OR (XOR). Written as + with circle around. True if exactly one input is true (i.e. true XOR true = false). Draw TT.

Homework: Consider the Boolean function of 3 boolean vars that is true if and only if exactly 1 of the three variables is true. Draw the TT.

Some manipulation laws. Remember this is Boolean ALGEBRA.

Identity:

Inverse:

Both + and . are commutative so don't need as much as I wrote

The name inverse law is somewhat funny since you Add the inverse and get the identity for Product or Multiply by the inverse and get the identity for Sum.

Associative:

Due to associative law we can write A.B.C since either order of evaluation gives the same answer.

Often elide the . so the product associative law is A(BC)=(AB)C.

Distributive:

How does one prove these laws??

Homework: Do the second distributive law.

Let's do (on the board) the examples on pages B-5 and B-6. Consider a logic function with three inputs A, B, and C; and three outputs D, E, and F defined as follows: D is true if at least one input is true, E if exactly two are true, and F if all three are true. (Note that by if we mean if and only if.

Draw the truth table.

Show the logic equations


================ Start Lecture 2 ================

The first way we solved part E shows that any logic function can be written using just AND, OR, and NOT. Indeed, it is in a nice form. Called two levels of logic, i.e. it is a sum of products of just inputs and their compliments.

DeMorgan's laws:

You prove DM laws with TTs. Indeed that is ...

Homework: B.6 on page B-45

Do beginning of HW on the board.

With DM we can do quite a bit without resorting to TTs. For example one can show that the two expressions for E on example above (page B-6) are equal. Indeed that is

Homework: B.7 on page B-45

Do beginning of HW on board.

GATES

Gates implement basic logic functions: AND OR NOT XOR Equivalence

Show why the picture is equivalence, i.e (A XOR B)' is AB + A'B'

(A XOR B)' =
(A'B+AB')' = 
(A'B)' (AB')' = 
(A''+B') (A'+B'') = 
(A + B') (A' + B) = 
AA' + AB + B'A' + B'B = 
0   + AB + B'A' + 0 = 
AB + A'B'

Often omit the inverters and draw the little circles at the input or output of the other gates (AND OR). These little circles are sometimes called bubbles.

This explains why inverter is drawn as a buffer with a bubble.

Homework: B.2 on page B-45 (I previously did the first part of this homework).

Homework: Consider the Boolean function of 3 boolean vars (i.e. a three input function) that is true if and only if exactly 1 of the three variables is true. Draw the TT. Draw the logic diagram with AND OR NOT. Draw the logic diagram with AND OR and bubbles.

We have seen that any logic function can be constructed from AND OR NOT. So this triple is called universal.

Are there any pairs that are universal? Could it be that there is a single function that is universal? YES!

NOR (NOT OR) is true when OR is false. Do TT.

NAND (NOT AND) is true when AND is false. Do TT.

Draw two logic diagrams for each, one from definition and equivalent one with bubbles.

Theorem A 2-input NOR is universal and A 2-input NAND is universal.

Proof

We must show that you can get A', A+B, and AB using just a two input NOR.

Homework: Show that a 2-input NAND is universal.

Can draw NAND and NOR each two ways (because (AB)' = A' + B')

We have seen how to get a logic function from a TT. Indeed we can get one that is just two levels of logic. But it might not be the simplest possible. That is we may have more gates than necessary.

Trying to minimize the number of gates is NOT trivial. Mano covers this in detail. We will not cover it in this course. It is not in H&P. I actually like it but must admit that it takes a few lectures to cover well and it not used so much since it is algorithmic and is done automatically by CAD tools.

Minimization is not unique, i.e. there can be two or more minimal forms.

Given A'BC + ABC + ABC'
Combine first two to get BC + ABC'
Combine last two to get A'BC + AB

Sometimes when building a circuit, you don't care what the output is for certain input values. For example, that input combination is cannot occur. Another example occurs when, for this combination of input values, a later part of the circuit will ignore the output of this part. These are called don't care outputs situations. Making use of don't cares can reduce the number of gates needed.

Can also have don't care inputs when , for certain values of a subset of the inputs, the output is already determined and you don't have to look at the remaining inputs. We will see a case of this in the very next topic, multiplexors.

An aside on theory

Putting a circuit in disjunctive normal form (i.e. two levels of logic) means that every path from the input to the output goes through very few gates. In fact only two, an OR and an AND. Maybe we should say three since the AND can have a NOT (bubble). Theorticians call this number (2 or 3 in our case) the depth of the circuit. Se we see that every logic function can be implemented with small depth. But what about the width, i.e., the number of gates.

The news is bad. The parity function takes n inputs and gives TRUE if and only if the number of input TRUEs is odd. If the depth is fixed (say limited to 3), the number of gates needed for parity is exponential in n.

B.3 COMBINATIONAL LOGIC

Homework: Read B.3.

Generic Homework: Read sections in book corresponding to the lectures.

Multiplexor

Often called a mux or a selector

Show equiv circuit with AND OR

Hardware if-then-else

    if S=0
        M=A
    else
        M=B
    endif

Can have 4 way mux (2 selector lines)

This is an if-then-elif-elif-else

   if S1=0 and S2=0
        M=A
    elif S1=0 and S2=1
        M=B
    elif S1=1 and S2=0
        M=C
    else -- S1=1 and S2=1
        M=D
    endif

Do a TT for 2 way mux. Redo it with don't care values.
Do a TT for 4 way mux with don't care values.


================ Start Lecture 3 ================

Homework: B-12. Assume you have constant signals 1 and 0 as well.

Decoder

Encoder

Sneaky way to see that NAND is universal.

Half Adder

Homework: Draw logic diagram

Full Adder

Homework:

How about 4 bit adder ?

How about n bit adder ?


================ Start Lecture 4 ================

PLAs--Programmable Logic Arrays

Idea is to make use of the algorithmic way you can look at a TT and produce a circuit diagram in the sums of product form.

Consider the following TT from the book (page B-13)

     A | B | C || D | E | F
     --+---+---++---+---+--
     O | 0 | 0 || 0 | 0 | 0
     0 | 0 | 1 || 1 | 0 | 0
     0 | 1 | 0 || 1 | 0 | 0
     0 | 1 | 1 || 1 | 1 | 0
     1 | 0 | 0 || 1 | 0 | 0
     1 | 0 | 1 || 1 | 1 | 0
     1 | 1 | 0 || 1 | 1 | 0
     1 | 1 | 1 || 1 | 0 | 1

Here is the circuit diagram for this truth table.

Here it is redrawn

Finally, it can be redrawn in a more abstract form.

When a PLA is manufactured all the connections have been specified. That is, a PLA is specific for a given circuit. It is somewhat of a misnomer since it is notprogrammable by the user

Homework: B.10 and B.11

Can also have a PAL or Programmable array logic in which the final dots are specified by the user. The manufacturer produces a ``sea of gates''; the user programs it to the desired logic function.

Homework: Read B-5

ROMs

One way to implement a mathematical (or C) function (without side effects) is to perform a table lookup.

A ROM (Read Only Memory) is the analogous way to implement a logic function.

Important: The ROM is does not have state. It is still a combinational circuit. That is, it does not represent ``memory''. The reason is that once a ROM is manufactured, the output depends only on the input.

A PROM is a programmable ROM. That is you buy the ROM with ``nothing'' in its memory and then before it is placed in the circuit you load the memory, and never change it.

An EPROM is an erasable PROM. It costs more but if you decide to change its memory this is possible (but is slow).

``Normal'' EPROMs are erased by some ultraviolet light process. But EEPROMs (electrically erasable PROMS) are faster and are done electronically.

All these EPROMS are erasable not writable, i.e. you can't just change one bit.

A ROM is similar to PLA

Don't Cares

Example

Full truth table

     A   B   C || D   E   F
     ----------++----------
     0   0   0 || 0   0   0
     0   0   1 || 1   0   1
     0   1   0 || 0   1   1
     0   1   1 || 1   1   0
     1   0   0 || 1   1   1
     1   0   1 || 1   1   0
     1   1   0 || 1   1   0
     1   1   1 || 1   1   1

Put in the output don't cares

     A   B   C || D   E   F
     ----------++----------
     0   0   0 || 0   0   0
     0   0   1 || 1   0   1
     0   1   0 || 0   1   1
     0   1   1 || 1   1   X
     1   0   0 || 1   1   X
     1   0   1 || 1   1   X
     1   1   0 || 1   1   X
     1   1   1 || 1   1   X

Now do the input don't cares

     A   B   C || D   E   F
     ----------++----------
     0   0   0 || 0   0   0
     0   0   1 || 1   0   1
     0   1   0 || 0   1   1
     X   1   1 || 1   1   X
     1   X   X || 1   1   X

These don't cares are important for logic minimization. Compare the number of gates needed for the full TT and the reduced TT. There are techniques for minimizing logic, but we will not cover them.

Arrays of Logic Elements

32-bit mux


*********** Big Change Coming ***********

Sequential Circuits, Memory, and State

Why do we want to have state?

Assume you have a real OR gate. Assume the two inputs are both zero for an hour. At time t one input becomes 1. The output will OSCILLATE for a while before settling on exactly 1. We want to be sure we don't look at the answer before its ready.


start lecture #5

B.4: Clocks

Frequency and period

Edges

Synchronous system

Now we are going to add state elements to the combinational circuits we have been using previously.

Remember that a combinational/combinatorial circuits has its outpus determined by its input, i.e. combinatorial circuits do not contain state.

Reading and writing State Elements.

State elements include state (naturally).

B.5: Memory Elements

We want edge-triggered clocked memory and will only use edge-triggered clocked memory in our designs. However we get there by stages. We first show how to build unclocked memory; then using unclocked memory we build level-sensitive clocked memory; finally from level-sensitive clocked memory we build edge-triggered clocked memory.

Unclocked Memory

S-R latch (set-reset)

Clocked Memory: Flip-flops and latches

The S-R latch defined above is not clocked memory. Unfortunately the terminology is not perfect.

For both flip-flops and latches the output equals the value stored in the structure. Both have an input and an output (and the complemented output) and a clock input as well. The clock determines when the internal value is set to the current input. For a latch, the change occurs whenever the clock is asserted (level sensitive). For a flip-flop, the change only occurs during the active edge.

D latch

The D is for data

In the traces below notice how the output follows the input when the clock is high and remains constant when the clock is low. We assume the stored value is initially low.

D or Master-Slave Flip-flop

This was our goal. We now have an edge-triggered, clocked memory.

Note how much less wiggly the output is with the master-slave flop than before with the transparent latch. As before we are assuming the output is initially low.

Homework: Try moving the inverter to the other latch What has changed?

This picture shows the setup and hold times discussed above. It is crucial when building circuits with flip flops that D is stable during the interval between the setup and hold times. Note that D is wild outside the critical interval, but that is OK.

Homework: B.18


start lecture #6

Registers

Register File

Set of registers each numbered

To read just need mux from register file to select correct register.


start lecture #7

For writes use a decoder on register number to determine which register to write. Note that 3 errors in the book's figure were fixed

The idea is to gate the write line with the output of the decoder. In particular, we should perform a write to register r this cycle providing

Homework: 20

SRAMS and DRAMS

Note: There are other kinds of flip-flops T, J-K. Also one could learn about excitation tables for each. We will not cover this material (H&P doesn't either). If interested, see Mano

Counters

A counter counts (naturally). The counting is done in binary.

Let's look at the state transition diagram for A, the output of a 1-bit counter.

We need one flop and a combinatorial circuit.

The flop producing A is often itself called A and the D input to this flop is called DA (really D sub A).

Current      Next ||
   A    I R   A   || DA <-- i.e. to what must I set DA
------------------++--      in order to
   0    0 0   0   || 0      get the desired Next A for
   1    0 0   1   || 1      the next cycle
   0    1 0   1   || 1
   1    1 0   0   || 0
   x    x 1   0   || 0

But this table (without Next A) is the truth table for the combinatorial circuit.

A I R  || DA
-------++--
0 0 0  || 0
1 0 0  || 1
0 1 0  || 1
1 1 0  || 0
x x 1  || 0

DA = R' (A XOR I)

start lecture #8

How about a two bit counter.

To determine the combinationatorial circuit we could preceed as before

Current      Next ||
  A B   I R  A B  || DA DB
------------------++------

This would work but we can instead think about how a counter works and see that.

DA = R'(A XOR I)
DB = R'(B XOR AI)

Homework: 23

B.6: Finite State Machines

Skipped

B.7 Timing Methodologies

Skipped

Chapter 1

Homework: READ chapter 1. Do 1.1 -- 1.26 (really one matching question)
Do 1.27 to 1.44 (another matching question),
1.45 (and do 7200 RPM and 10,000 RPM), 1.46, 1.50

Chapter 3

Homework: Read sections 3.1 3.2 3.3

3.4 Representing instructions in the Computer (MIPS)

Register file

Homework: 3.2

R-type instruction (R for register)

    op    rs    rt    rd    shamt  funct
    6     5     5     5      5      6

The fields are quite consistent

Example: add $1,$2,$3

I-type (why I?)

    op    rs    rt   address
    6     5     5     16

lw/sw $1,addr($2)

RISC-like properties of the MIPS architecture

Branching instruction

slt (set less-then)

beq and bne (branch (not) equal)

blt (branch if less than)

ble (branch if less than or equal)

bgt (branch if greater than>

bge (branch if greater than or equal>

Note: Please do not make the mistake of thinking that

    stl $1,$5,$8
    beq $1,$0,L
is the same as
    stl $1,$8,$5
    bne $1,$0,L
The negation of X < Y is not Y < X

Homework: 3.12-3.17

J-type instructions (J for jump)

        op   address
        6     26

j (jump)

jr (jump register)

jal (jump and link)

I type instructions (revisited)

addi (add immediate)

    addi $1,$2,100

Why is there no subi?
Ans: Make the immediate operand negative.

slti (set less-than immediate)

    slti $1,$2,50

**** START LECTURE #9 ****

Handout Lab #1 and supporting sheets

lui (load upper immediate)

Homework: 3.1, 3.3-3.7, 3.9, 3.18, 3.37 (for fun)

Chapter 4

Homework: Read 4.1-4.4

Homework: 4.1-4.9

4.2: Signed and Unsigned Numbers

MIPS uses 2s complement (just like 8086)

To form the 2s complement (of 0000 1111 0000 1010 0000 0000 1111 1100)

Need comparisons for signed and unsigned.

sltu and sltiu

Just like slt and slti but the comparison is unsigned.

4.3: Addition and subtraction

To add two (signed) numbers just add them. That is don't treat the sign bit special.

To subtract A-B, just take the 2s complement of B and add.

Overflows

An overflow occurs when the result of an operatoin cannot be represented with the available hardware. For MIPS this means when the result does not fit in a 32-bit word.


*** START LECTURE #10 ****

> I have a question about the first lab; I'm not sure how we
> would implement a mux, would a series of if-else
> statements be an acceptable option?

No.  But that is a good question.  if-then-elif...-else
would be a FUNCTIONAL simulation.  That is you are
simulating what the mux does but not HOW it does it.  For a
gate level simulation, you need to implement the mux in
terms of AND, NOT, OR, XOR and then write code link
Fulladder.c

The implementation of a two way mux in terms of AND OR NOT
is figure B.4 on page B-9 of the text.  You need to do a 3
way mux.

Homework: (for fun) prove this last statement (4.29)

addu, subu, addiu

These add and subtract the same the same was as add and sub, but do not signal overflow

4.4: Logical Operations

Shifts: sll, srl

Bitwise AND and OR: and, or, andi, ori

No surprises.

4.5: Constructing an ALU--the fun begins

First goal is 32-bit AND, OR, and addition

Recall we know how to build a full adder. Will draw it as

With this adder, the ALU is easy.

32-bit version is simple.

  1. Use an array of logic elements for the logic. The logic element is the 1-bit ALU
  2. Use buses for A, B, and Result.
  3. ``Broadcast'' Opcode to all of the internal 1-bit ALUs. This means wire the external Opcode to the Opcode input of each of the internal 1-bit ALUs

First goal accomplished.

How about subtraction?

1-bit ALU with ADD, SUB, AND, OR is

For subtraction set Binvert and Cin.

32-bit version is simply a bunch of these.

Simulating Combinatorial Circuits at the Gate Level

Write a procedure for each logic box with the following properties.

Handout: FullAdder.c and FourBitAdder.c.

Lab 1: Do the equivalent for 1-bit-alu (without subtraction). This is easy. Lab 2 will be similar but for a more sophisticated ALU.

Extra requirements for MIPS alu:

  1. slt set-less-than





  2. Why isn't this method used?

  3. Ans: It is wrong!

  4. Example using 3 bit numbers (i.e. -4 .. 3). Try slt on -3 and +2. True subtraction (-3 - +2) give -5. The negative sign in -5 indicates (correctly) that -3 < +2. But three bit subtraction -3 - +2 gives +3 ! Hence we will incorrectly conclude that -3 is NOT less than +2. (Really, it signals an overflow. unless doing unsigned)

  5. Solution: Need the correct rule for less than (not just sign of subtraction)

    Homework: figure out correct rule, i.e. prob 4.23. Hint: when an overflow occurs the sign bit is definitely wrong (so the complement of the sign bit is right).

    1. Overflows
      • The HOB is already unique (outputs SET)
      • Need to enhance it some more to produce the overflow output
      • Recall that we gave the rule for overflow: you need to examine
        • Whether add or sub (binvert)
        • The sign of A
        • The sign of B
        • The sign of the result
        • Since this is the HOB we have all the sign bits.
        • The book also uses Cout, but this appears to be an error



    2. Simpler overflow detection
      • An overflow occurs if and only if the carry in to the HOB differs from the carry of of the HOB



    3. Zero Detect
      • To see if all bits are zero just need NOR of all the bits
      • Conceptually trivially but does require some wiring

    4. Observation: The initial Cin and Binvert are always the same. So just use one input called Bnegate.

    The Final Result is

    Symbol for the alu is

    What are the control lines?

    What functions can we perform?

    What (3-bit) values for the control lines do we need for each function?
    and000
    or 0 01
    add 0 10
    sub 1 10
    slt 1 11


    *** START LECTURE #11 ****

    Adders

    1. We have done what is called a ripple carry adder.
      The carry "ripples" from one bit to the next (LOB to HOB).
      So the time required is proportional to the wordlength.
      Each carry can be computed with two levels of logic (any function can be so computed) hence the number of gate delays is 2*wordsize.
    2. What about doing the entire 32 (or 64) bit adder with 2 levels of logic?
      • Such a circuit clearly exists. Why?
      • Ans: A two levels of logic circuit exists for any function.
      • But it would be very expensive: many gates and wires.
      • The big problem: The AND and OR gates have high fan-in, i.e., they have a large number of inputs. It is not true that a 64-input AND takes the same time as a 2-input AND.
      • Unless you are doing full custom VLSI, you get a toolbox of primative functions (say 4 input NAND) and must build from that
    3. There are faster adders, e.g. carry lookahead and carry save. We will study the carry lookahead.

    Carry Lookahead adders

    We did a ripple adder

    We will now do the carry lookahead adder, which is much faster, especially for many bit (e.g. 64 bit) addition.

    For each bit we can in one gate delay calculate

        generate a carry    gi = ai bi
    
        propogate a carry   pi = ai+bi
    

    H&P give a plumbing analogue for generate and propogate.

    Given the generates and propogates, we can calculate all the carries for a 4-bit addition (recall that c0=Cin is an input) as follows

    c1 = g0 + p0 c0
    
    c2 = g1 + p1 c1 = g1 + p1 g0 + p1 p0 c0
    
    c3 = g2 + p2 c2 = g2 + p2 g1 + p2 p1 g0 + p2 p1 p0 c0
    
    c4 = g3 + p3 c3 = g3 + p3 g2 + p3 p2 g1 + p3 p2 p1 g0 + p3 p2 p1 p0 c0
    

    Thus we can calculate c1 ... c4 in just two additional gate delays (where we assume one gate can accept upto 5 inputs). Since we get gi and pi after one gate delay, the total delay for calculating all the carries is 3 (this includes c4=CarryOut)

    Each bit of the sum si can be calculated in 2 gate delays given ai, bi, and ci. Thus 5 gate delays after we are given a, b and CarryIn, we have calculated s and CarryOut

    So for 4-bit addition the faster adder takes time 5 and the slower adder time 8.

    Now we want to put four of these together to get a fast 16-bit adder. Again we are assuming a gate can accept upto 5 inputs. It is important that the number of inputs per gate does not grow with the size of the numbers to add. If the technology available supplies only 4-input gates, we would use groups of 3 bits rather than four.

    We start by determining ``supergenerate'' and ``superpropogate'' bits. The super propogate indicates whether the 4-bit adder constructed above generates a CarryOut or propogates a CarryIn to a CarryOut

    P0 = p3 p2 p1 p0          Does the low order 4-bit adder
                              propogate a carry?
    P1 = p7 p6 p5 p4
    
    P2 = p11 p10 p9 p8
    
    P3 = p15 p14 p13 p12      Does the high order 4-bit adder
                              propogate a carry?
    
    
    G0 = g3 + p3 g2 + p3 p2 g1 + p3 p2 p1 g0        Does low order 4-bit
                                                    adder generate a carry
    G1 = g7 + p7 g6 + p7 p6 g5 + p7 p6 p5 g4
    
    G2 = g11 + p11 g10 + p11 p10 g9 + p11 p10 p9 g8
    
    G3 = g15 + p15 g14 + p15 p14 g13 + p15 p14 p13 g12
    
    
    C1 = G0 + P0 c0
    
    C2 = G1 + P1 C1 = G1 + P1 G0 + P1 P0 c0
    
    C3 = G2 + P2 C2 = G2 + P2 G1 + P2 P1 G0 + P2 P1 P0 c0
    
    C4 = G3 + P3 C3 = G3 + P3 G2 + P3 P2 G1 + P3 P2 P1 G0 + P3 P2 P1 P0 c0
    

    From these C's you just need to do a 4-bit CLA since the C's are the CarryIns for each group of 4-bits out of the 16-bits.

    How long does this take, again assuming 5 input gates?

    Some pictures follow.

    Take our original picture of the 4-bit CLA and collapse the details so it looks like.

    Next include the logic to calculate P and G

    Now put four of these with a CLA block (to calculate C's from P's, G's and Cin) and we get a 16-bit CLA. Note that we do not use the Cout from the 4-bit CLAs.

    Note that the tall skinny box is general. It takes 4 Ps 4Gs and Cin and calculates 4Cs. The Ps can be propogates, superpropogates, superduperpropogates, etc. That is, you take 4 of these 16-bit CLAs and the same tall skinny box and you get a 64-bit CLA.

    Homework: 4.44, 4.45


    *** START LECTURE #12 ****

    As noted just above the tall skinny box is useful for all size CLAs. To expand on that point and to review CLAs, let's redo CLAs with the general box.

    Since we are doing 4-bits at a time, the box takes 9=2*4+1 input bits and produces 6=4+2 outputs

    A 4-bit adder is now

    What does the ``?'' box do?

    Now take four of these 4-bit adders and use the identical CLA box to get a 16-bit adder

    Four of these 16-bit adders with the identical CLA box to gives a 64-bit adder.

    Shifter

    Homework: A 4-bit shift register initially contains 1101. It is shifted six times to the right with the serial input being 101101. What is the contents of the register after each shift.

    Homework: Same register, same init condition. For the first 6 cycles the opcodes are left, left, right, nop, left, right and the serial input is 101101. The next cycle the register is loaded (in parallel) with 1011. The final 6 cycles are the same as the first 6. What is the contents of the register after each cycle?

    Multipliers

        product <- 0
        for i = 0 to 31
            if LOB of multiplier = 1
                product = product + multiplicand
            shift multiplicand left 1 bit
            shift multiplier right 1 bit
    

    Do on board 4-bit addition (8-bit registers) 1100 x 1101

    What about the control?


    *** START LECTURE #13 ****

    This works!
    but, when compared to the better solutions to come, is wasteful of resourses and hence is

    The product register must be 64 bits since the product is 64 bits

    Why is multiplicand register 64 bits?

    Why is ALU 64-bits?

    POOF!! ... as the smoke clears we see an idea.

    We can solve both problems at once

    This results in the following algorithm

        product <- 0
        for i = 0 to 31
            if LOB of multiplier = 1
                (serial_in, product[32-63]) <- product[32-63] + multiplicand
            shift product right 1 bit
            shift multiplier right 1 bit
    

    What about control

    Redo same example on board

    A final trick (``gate bumming'', like code bumming of 60s)

        product[0-31] <- multiplier
        for i = 0 to 31
            if LOB of product = 1
                (serial_in, product[32-63]) <- product[32-63] + multiplicand
            shift product right 1 bit
    

    Control again boring

    Redo same example on board

    The above was for unsigned 32-bit multiplication

    For signed multiplication

    There are faster multipliers, but we are not covering them.

    We are skiping division.

    We are skiping floating point.

    Homework: Read 4.11 ``Historical Perspective''.


    *** START LECTURE #14 ****

    Midterm Exam
    *** START LECTURE #15 ****

    Lab 2. Due in three weeks. Modify lab 1 to deal with sub, slt, zero detect, overflow. Also lab 2 is to be 32 bits. That is, Figure 4.18.

    Go over the exam.

    Chapter 5: The processor: datapath and control

    Homework: Start Reading Chapter 5.

    5.1: Introduction

    We are going to build the MIPS processor

    Figure 5.1 redrawn below shows the main idea

    Note that the instruction gives the three register numbers as well as an immediate value to be added.

    5.2: Building a datapath

    Let's begin doing the pieces in more detail.

    Instruction fetch

    We are ignoring branches for now.

    R-type instructions


    ======== START LECTURE #16 ========

    Don't forget the mirror site. My main website will be going down for an OS upgrade. Start at http://cs.nyu.edu/

    load and store

    lw  $r,disp($s)
    sw  $r,disp($s)
    

    There is a cheat here.

    Branch on equal (beq)

    Compare two registers and branch if equal

    5.3: A simple implementation scheme

    We will just put the pieces together and then figure out the control lines that are needed and how to set them. We are not now worried about speed.

    We are assuming that the instruction memory and data memory are separate. So we are not permitting self modifying code. We are not showing how either memory is connected to the outside world (i.e. we are ignoring I/O).

    We have to use the same register file with all the pieces since when a load changes a register a subsequent R-type instruction must see the change, when an R-type instruction makes a change the lw/sw must see it (for loading or calculating the effective address, etc.

    We could use separate ALUs for each type but it is easy not to so we will use the same ALU for all. We do have a separate adder for incrementing the PC.

    Combining R-type and lw/sw

    The problem is that some inputs can come from different sources.

    1. For R-type, both ALU operands are registers. For I-type (lw/sw) the second operand is the (sign extended) immediate field.
    2. For R-type, the write data comes from the ALU. For lw it comes from the memory.
    3. For R-type, the write register comes from field rd, which is bits 15-11. For sw, the write register comes from field rt, which is bits 20-16.

    We will deal with the first two now by using a mux for each. We will deal with the third shortly by (surprise) using a mux.

    Combining R-type and lw/sw

    Including instruction fetch

    This is quite easy

    Finally, beq

    We need to have an ``if stmt'' for PC (i.e., a mux)

    Homework:


    ======== START LECTURE #17 ========

    The control for the datapath

    We start with our last figure, which shows the data path and then add the missing mux and show how the instruction is broken down.

    We need to set the muxes.

    We need to generate the three ALU cntl lines: 1-bit Bnegate and 2-bit OP

        And     0 00
        Or      0 01
        Add     0 10
        Sub     1 10
        Set-LT  1 11
    
    Homework: What happens if we use 1 00? if we use 1 01? Ignore the funny business in the HOB. The funny business ``ruins'' these ops.

    What information can we use to decide on the muxes and alu cntl lines?

    The instruction!

    So no problem, just do a truth table.

    We will let the main control (to be done later) ``summarize'' the opcode for us. It will generate a 2-bit field ALUop

        ALUop   Action needed by ALU
    
        00      Addition (for load and store)
        01      Subtraction (for beq)
        10      Determined by funct field (R-type instruction)
        11      Not used
    

    How many entries do we have now in the truth table

        ALUop | Funct        ||  Bnegate:OP
        1 0   | 5 4 3 2 1 0  ||  B OP
        ------+--------------++------------
        0 0   | x x x x x x  ||  0 10
        x 1   | x x x x x x  ||  1 10
        1 x   | x x 0 0 0 0  ||  0 10
        1 x   | x x 0 0 1 0  ||  1 10
        1 x   | x x 0 1 0 0  ||  0 00
        1 x   | x x 0 1 0 1  ||  0 01
        1 x   | x x 1 0 1 0  ||  1 11
        

    The circuit is then easy.


    ======== START LECTURE #18 ========

    I cleaned up the discussion about OP[2-0] from the end of last time

    Now we need the main control

    So 9 bits

    The following figure shows where these occur.

    They all are determined by the opcode

    The MIPS instruction set is fairly regular. Most fields we need are always in the same place in the instruction.

    MemRead: Memory delivers the value stored at the specified addr
    MemWrite: Memory stores the specified value at the specified addr
    ALUSrc: Second ALU operand comes from (reg-file / sign-ext-immediate)
    RegDst: Number of reg to write comes from the (rt / rd) field
    RegWrite: Reg-file stores the specified value in the specified register
    PCSrc: New PC is Old PC+4 / Branch target
    MemtoReg: Value written in reg-file comes from (alu / mem)

    We have seen the wiring before (and given a hardcopy handout)

    We are interested in four opcodes

    Do a stage play

    The following figures illustrate the play.

    We start with R-type instructions

    Next we show lw

    The following truth table shows the settings for the control lines for each opcode. This is drawn differently since the labels of what should be the columns are long (e.g. RegWrite) and it is easier to have long labels for rows.

    SignalR-typelwswbeq
    Op50110
    Op40000
    Op30010
    Op20001
    Op10110
    Op00110
    RegDst10XX
    ALUSrc0110
    MemtoReg01XX
    RegWrite1100
    MemRead0100
    MemWrite0010
    Branch0001
    ALUOp11000
    ALUOp0001

    Now it is straightforward but tedious to get the logic equations

    When drawn in pla style the circuit is

    Homework: 5.5 and 5.11 control, 5.1, 5.2, 5.10 (just the single-cycle datapath) 5.11

    Implementing a J-type instruction, unconditional jump

        opcode  addr
        31-26   25-0
    
    Addr is word address; bottom 2 bits of PC are always 0

    Top 4 bits of PC stay as they were (AFTER incr by 4)

    Easy to add.

    Smells like a good final exam type question.

    What's Wrong

    Some instructions are likely slower than others and we must set the clock cycle time long enough for the slowest. The disparity between the cycle times needed for different instructions is quite significant when one considers implementing more difficult instructions, like divide and floating point ops.

    Possible solutions


    ======== START LECTURE #19 ========

    Even Faster

    Chapter 2 Performance analysis

    Homework: Read Chapter 2

    Throughput measures the number of jobs per day that can be accomplished. Response time measures how long an individual job takes.

    Performance = 1 / Execution time

    So machine X is n times faster than Y means that

    How should we measure execution-time?

    We mostly use CPU time, but this does not mean the other metrics are worse.

    Cycle time vs. Clock rate

    So the execution time for a given job on a given computer is

    (CPU) execution time = (#CPU clock cycles required) * (cycle time)
                         = (#CPU clock cycles required) / (Clock rate)
    

    So a machine with a 10ns cycle time runs at a rate of
    1 cycle per 10 ns = 100,000,000 cycles per second = 100 MHz

    The number of CPU clock cycles required equals the number of instructions executed times the number of cycles in each instruction.

    But systems are more complicated than that!

    Through a great many measurement, one calculates for a given machine the average CPI (cycles per instruction).

    #instructions for a given program depends on the instruction set. For example we saw in chapter 3 that 1 vax instruction is often accomplishes more than 1 MIPS instruction.

    Complicated instructions take longer; either more cycles or longer cycle time

    Older machines with complicated instructions (e.g. VAX in 80s) had CPI>>1

    With pipelining can have many cycles for each instruction but still have CPI nearly 1.

    Modern superscalar machines have CPI < 1

    Putting this together, we see that

       Time (in seconds) =  #Instructions * CPI * Cycle_time (in seconds)
       Time (in ns) =  #Instructions * CPI * Cycle_time (in ns)
    

    Homework: Carefully go through and understand the example on page 59

    Homework: 2.1-2.5 2.7-2.10

    Homework: Make sure you can easily do all the problems with a rating of [5] and can do all with a rating of [10]

    What about MIPS?

    Homework: Carefully go through and understand the example on pages 61-3

    Why not use MFLOPS

    Benchmarks

    Homework: Carefully go through and understand 2.7 ``fallacies and pitfalls''


    ======== START LECTURE #20 ========

    Chapter 7 Memory

    Homework: Read Chapter 7

    Ideal memory is

    We observe empirically

    So use a memory hierarchy

    1. Registers
    2. Cache (really L1 L2 maybe L3)
    3. Memory
    4. Disk
    5. Archive

    There is a gap between each pair of adjacent levels. We study the cache <---> memory gap

    A cache is a small fast memory between the processor and the main memory. It contains a subset of the contents of the main memory.

    A Cache is organized in units of blocks. Common block sizes are 16, 32, and 64 bytes.

    A hit occurs when a memory reference is found in the upper level of memory hierarchy.

    We start with a very simple cache organization.

    Example on pp. 547-8.

    Address(10)Address(2)hit/missblock#
    2210110miss110
    2611010miss010
    2210110hit110
    2611010hit010
    1610000mis000
    300011miss011
    1610000hit000
    1810010miss010

    The basic circuitry for this simple cache to determine hit or miss and to return the data is quite easy.

    Calculate on the board the total number of bits in this cache.

    Homework: 7.1 7.2 7.3

    Processing a read for this simple cache

    Skip section ``handling cache misses'' as it discusses the multicycle and pipelined implementations of chapter 6, which we skipped.

    For our single cycle processor implementation we just need to note a few points

    Processing a write for this simple cache
    ======== START LECTURE #21 ========

    Homework: 7.2 7.3 (should have been give above with 7.1. I changed the notes so this is fix for ``next time'')

    Improvement: Use a write buffer

    Unified vs split I and D (instruction and data)

    Improvement: Blocksize > Wordsize

    Homework: 7.7 7.8 7.9

    Why not make blocksize enormous? Cache one huge block.

    Memory support for wider blocks

    Homework: 7.11


    ======== START LECTURE #22 ========

    Performance example to do on the board (dandy exam question).

    Homework: 7.15, 7.16

    A lower base (i.e. miss-free) CPI makes stalls appear more expensive since waiting a fixed amount of time for the memory corresponds to losing more instructions if the CPI is lower.

    Faster CPU (i.e., a faster clock) makes stalls appear more expensive since waiting a fixed amount of time for the memory corresponds to more cycles if the clock is faster (and hence more instructions since the base CPI is the same).

    Another performance example

    Remark: Larger caches have longer hit times.

    Improvement: Associative Caches

    Consider the following sad story. Jane had a cache that held 1000 blocks and had a program that only references 4 (memory) blocks, namely 23, 1023, 123023, and 7023. In fact the reference occur in order: 23, 1023, 123023, 7023, 23, 1023, 123023, 7023, 23, 1023, 123023, 7023, 23, 1023, 123023, 7023, etc. Referencing only 4 blocks and having room for 1000 in her cache, Jane expected an extremely high hit rate for her program. In fact, the hit rate was zero. She was so sad, she gave up her job as webmistriss, went to medical school, and is now a brain surgeon at the mayo clinic in rochester MN.

    Tag size and division of the address bits

    We continue to assume a byte addressed machines, but all references are to a 4-byte word (lw and sw).

    The 2 LOBs are not used (they specify the byte within the word but all our references are for a complete word). We show these two bits in dark blue. We continue to assume 32 bit addresses so there are 2**30 words in the address space.

    Let's review various possible cache organizations and determine for each how large is the tag and how the various address bits are used. We will always use a 16KB cache. That is the size of the data portion of the cache is 16KB = 4 kilowords = 2**12 words.

    1. Direct mapped, blocksize 1 (word).
      • Since the blocksize is one word, there are 2**30 memory blocks and all the address bits (except the 2 LOBs that specify the byte within the word) are used for the memory block number. Specifically 30 bits are so used.
      • The cache has 2**12 words, which is 2**12 blocks.
      • So the low order 12 bits of the memory block number give the index in the cache (the cache block number), shown in cyan.
      • The remaining 18 (30-12) bits are the tag, shown in red.

    2. Direct mapped, blocksize 8
      • Three bits of the address give the word within the 8-word block. These are drawn in magenta.
      • The remaining 27 HOBs of the memory address give the memory block number.
      • The cache has 2**12 words, which is 2**9 blocks.
      • So the low order 9 bits of the memory block number gives the index in the cache.
      • The remaining 18 bits are the tag

    3. 4-way set associative, blocksize 1
      • Blocksize is 1 so there are 2**30 memory blocks and 30 bits are used for the memory block number.
      • The cache has 2**12 blocks, which is 2**10 sets (each set has 4=2**2 blocks).
      • So the low order 10 bits of the memory block number gives the index in the cache.
      • The remaining 20 bits are the tag. As the associativity grows the tag gets bigger. Why?
        Growing associativity reduces the number of sets into which a block can be placed. This increases the number of memory blocks that be placed in a given set. Hence more bits are needed to see if the desired block is there.

    4. 4-way set associative, blocksize 8
      • Three bits of the address give the word within the block.
      • The remaining 27 HOBs of the memory address give the memory block number.
      • The cache has 2**12 words = 2**9 blocks = 2**7 sets.
      • So the low order 7 bits of the memory block number gives the index in the cache.

    Homework: 7.39, 7.40 (not assigned 1999-2000)

    Improvement: Multilevel caches

    Modern high end PCs and workstations all have at least two levels of caches: A very fast, and hence not too big, first level (L1) cache together with a larger but slower L2 cache.

    When a miss occurs in L1, L2 is examined and only if a miss occurs there is main memory referenced.

    So the average miss penalty for an L1 miss is

    (L2 hit rate)*(L2 time) + (L2 miss rate)*(L2 time + memory time)
    
    We are assuming L2 time is the same for an L2 hit or L2 miss. We are also assuming that the access doesn't begin to go to memory until the L2 miss has occurred.

    Do an example

    7.4: Virtual Memory

    I realize this material was covered in operating systems class (V22.0202). I am just reviewing it here. The goal is to show the similarity to caching, which we just studied. Indeed, (the demand part of) demand paging is caching: In demand paging the memory serves as a cache for the disk, just as in caching the cache serves as a cache for the memory.

    The names used are different and there are other differences as well.

    Cache conceptDemand paging analogue
    Memory blockPage
    Cache blockPage Frame (frame)
    BlocksizePagesize
    TagNone (table lookup)
    Word in blockPage offset
    Valid bitValid bit

    Cache conceptDemand paging analogue
    AssociativityNone (fully associative)
    MissPage fault
    HitNot a page fault
    Miss ratePage fault rate
    Hit rate1 - Page fault rate
    Placement questionPlacement question
    Replacement questionReplacement question


    ======== START LECTURE #24 ========

    Homework: 7.39, 7.40 (should have been asked earlier)

    Homework: 7.32

    Write through vs. write back

    Question: On a write hit should we write the new value through to (memory/disk) or just keep it in the (cache/memory) and write it back to (memory/disk) when the (cache-line/page) is replaced.

    Translation Lookaside Buffer (TLB)

    A TLB is a cache of the page table



    Putting it together: TLB + Cache

    This is the decstation 3100

    Actions taken

    1. The page number is searched in the fully associative TLB
    2. If a TLB hit occurs, the frame number from the TLB together with the page offset gives the physical address. A TLB miss causes an exception to reload the TLB, which we do not discuss.
    3. The physical address is broken into a cache tag and cache index (plus a two bit byte offset that is not used for word references).
    4. If the reference is a write, just do it without checking for a cache hit (this is possible because the cache is so simple as we discussed previously).
    5. For a read, if the tag located in the cache entry specified by the index matches the tag in the physical address, the referenced word has been found in the cache; i.e., we had a read hit.
    6. For a read miss, the cache entry specified by the index is fetched from memory and the data returned to satisfy the request.

    Hit/Miss possibilities

    TLBPageCacheRemarks
    hithithit Possible, but page table not checked on TLB hit, data from cache
    hithitmiss Possible, but page table not checked, cache entry loaded from memory
    hitmisshit Impossible, TLB references in-memory pages
    hitmissmiss Impossible, TLB references in-memory pages
    misshithit Possible, TLB entry loaded from page table, data from cache
    misshitmiss Possible, TLB entry loaded from page table, cache entry loaded from memory
    missmisshit Impossible, cache is a subset of memory
    missmissmiss Possible, page fault brings in page, TLB entry loaded, cache loaded

    Homework: 7.31, 7.33

    7.5: A Common Framework for Memory Hierarchies

    Question 1: Where can the block be placed?

    This could be called the placement question. There is another placement question in OS memory memory management. When dealing with varying size pieces (segmentation or whole program swapping), the available space becomes broken into varying size available blocks and varying size allocated blocks (called holes). We do not discussing the above placement question in this course (but presumably it was in 204 when you took it and for sure it will be in 204 next semester--when I teach it).

    The placement question we do study is the associativity of the structure.

    Assume a cache with N blocks

    Typical Values

    Feature Typical values
    for caches
    Typical values
    for paged memory
    Typical values
    for TLBs
    Size 8KB-8MB 16MB-2GB 256B-32KB
    Block size 16B-256B 4KB-64KB 4B-32B
    Miss penalty in clocks 10-100 1M-10M 10-100
    Miss rate .1%-10% .000001-.0001% .01%-2%

    Question 2: How is a block found?

    AssociativityLocation methodComparisons Required
    Direct mappedIndex1
    Set AssociativeIndex the set, search among elements Degree of associativity
    FullSearch all cache entries Number of cache blocks
    Separate lookup table0

    The difference in sizes and costs for demand paging vs. caching, leads to a different choice implementation of finding the block. Demand paging always uses the bottom row with a separate table (page table) but caching never uses such a table.

    Question 3: Which block should be replaced?

    This is called the replacement question and is much studied in demand paging (remember back to 202).

    Question 4: What happens on a write?

    1. Write-through
      • Data written to both the cache and main memory (in general to both levels of the hierarchy).
      • Sometimes used for caching, never used for demand paging
      • Advantages
        • Misses are simpler and cheaper (no copy back)
        • Easier to implement, especially for block size 1, which we did in class.
        • For blocksize > 1, a write miss is more complicated since the rest of the block now is invalid. Fetch the rest of the block from memory (or mark those parts invalid by extra valid bits--not covered in this course).

    Homework: 7.41

    1. Write-back
      • Data only written to the cache. The memory has stale data, but becomes up to date when the cache block is subsequently replaced in the cache.
      • Only real choice for demand paging since writing to the lower level of the memory hierarch (in this case disk) is so slow.
      • Advantages
        • Words can be written at cache speed not memory speed
        • When blocksize > 1, writes to multiple words in the cache block are only written once to memory (when the block is replaced).
        • Multiple writes to the same word in a short period are written to memory only once.
        • When blocksize > 1, the replacement can utilize a high bandwidth transfer. That is, writing one 64-byte block is faster than 16 writes of 4-bytes each.


      ======== START LECTURE #25 ========

      • Write miss policy (advanced)
        • For demand paging, the case is pretty clear. Every implementation I know of allocates a frame for the page miss and fetches the page from disk. That is it does both an allocate and a fetch.
        • For caching this is not always the case. Since there are two optional actions there are four possibilities.
          1. Don't allocate and don't fetch: This is sometimes called write around. It is done when the data is not expected to be read and is large.
          2. Don't allocate but do fetch: Impossible, where would you put the fetched block?
          3. Do allocate, but don't fetch: Sometimes called no-fetch-on-write. Also called SANF (store-allocate-no-fetch). Requires multiple valid bits per block since the just-written word is valid but the others are not (since we updated the tag to correspond to the just-written word).
          4. Do allocate and do fetch: The normal case we have been using.

    Chapter 8: Interfacing Processors and Peripherals.

    With processor speed increasing 50% / year, I/O must improved or essentially all jobs will be I/O bound.

    The diagram on the right is quite oversimplified for modern PCs but serves the purpose of this course.

    8.2: I/O Devices

    Devices are quite varied and their datarates vary enormously.

    Show a real disk opened up and illustrate the components

    8.4: Buses

    A bus is a shared communication link, using one set of wires to connect many subsystems.

    Synchronous vs. Asynchronous Buses

    A synchronous bus is clocked.

    An asynchronous bus is not clocked.

    1. The device makes a request (asserts ReadReq and puts the desired address on the data lines).

    2. Memory, which has been waiting, sees ReadReq, records the address and asserts Ack.

    3. The device waits for the Ack; once seen, it drops the data lines and deasserts ReadReq.

    4. The memory waits for the request line to drop. Then it can drop Ack (which it knows the device has now seen). The memory now at its leasure puts the data on the data lines (which it knows the device is not driving) and then asserts DataRdy. (DataRdy has been deasserted until now).

    5. The device has been waiting for DataRdy. It detects DataRdy and records the data. It then asserts Ack indicating that the data has been read.

    6. The memory sees Ack and then deasserts DataRdy and releases the data lines.

    7. The device seeing DataRdy low deasserts Ack ending the show.


    ======== START LECTURE #26 ========

    Improving Bus Performance

    These improvements mostly come at the cost of increased expense and/or complexity.

    1. Hierarchy of buses.

    2. Synchronous instead of asynchronous protocols.
      • Synchronous is actually simplier, but it essentially implies a hierarchy of protocols since not all devices can operate at the same speed.

    3. Wider data path: Use more wires, send more at once.

    4. Separate address and data lines: Same as above.

    5. Block transfers: Permit a single transaction to transfer more than one busload of data. Saves the time to release and acquire the bus, but the protocol is more complex.

    6. Obtaining bus access:
      • The simplest scheme is to permit only one bus master.
        • That is on each bus only one device is permited to initiate a bus transaction.
        • The other devices are slaves that only respond to requests.
        • With a single master, there is no issue of arbitrating among multiple requests.
      • One can have multiple masters with daisy chaining of the grant line.
        • Any device can raise the request line.
        • The device with the request raises the release line when done.
        • The arbiter monitors the request and request lines and raises the grant line.
        • The grant signal is passed from one to the other so the devices near the arbiter have priority and can starve the ones further away.
        • Passing the grant from device to device takes time.
        • Simple but not fair or high performance
      • Centralized parallel arbiter: Separate request lines from each device and separate grant lines. The arbiter decides which device should be granted the bus.
      • Distributed arbitration by self-selection: Requesting processes identify themselves on the bus and decide individually (and consistently) which one gets the grant.
      • Distributed arbitration by collision detection: Each device transmits whenever it wants, but detects collisions and retries. Ethernet uses this scheme (but not new switched ethernets).


    OptionHigh performanceLow cost
    bus widthseparate address and data lines multiplex address and data lines
    data widthwidenarrow
    transfer sizemultiple bus loadssingle bus loads
    bus mastersmultiplesingle
    clockingsynchronousasynchronous


    ======== START LECTURE #27 ========

    Do on the board the example on pages 665-666

    8.5: Interfacing I/O Devices

    Giving commands to I/O Devices

    This is really an OS issue. Must write/read to/from device registers, i.e. must communicate commands to the controller.

    Communicating with the Processor

    Should we check periodically or be told when there is something to do? Better yet can we get someone else to do it since we are not needed for the job?

    Polling

    Processor continually checks the device status to see if action is required.

    Do on the board the example on pages 676-677

    Interrupt driven I/O

    Processor is told by the device when to look. The processor is interrupted by the device.

    Do on the board the example on pages 681-682

    Direct Memory Access (DMA)

    The processor initiates the I/O operation then ``something else'' takes care of it and notifies the processor when it is done (or if an error occurs).


    More Sophisticated Controllers

    Subtlties involving the memory system

    8.6: Designing an I/O system

    Do on the board the example page 681

    Remark: The above analysis was very simplistic. It assumed everything overlapped just right and the I/Os were not bursty and that the I/Os conveniently spread themselves accross the disks.


    ======== START LECTURE #28 ========

    Review for final.