Computer Architecture
1999-2000 Fall
MW 3:30-4:45
Ciww 109

Allan Gottlieb
gottlieb@nyu.edu
http://allan.ultra.nyu.edu/~gottlieb
715 Broadway, Room 1001
212-998-3344
609-951-2707
email is best

0: Administrivia

Web Pages

There is a web page for the course. You can find it from my home page, which is http://allan.ultra.nyu.edu/~gottlieb

Textbook

Text is Hennessy and Patterson ``Computer Organization and Design The Hardware/Software Interface'', 2nd edition.

Computer Accounts and mailman mailing list

Homeworks and Labs

I make a distinction between homework and labs.

Labs are

Homeworks are

Upper left board for assignments and announcements

Appendix B: Logic Design

Homework: Read B1

B.2: Gates, Truth Tables and Logic Equations

Homework: Read B2 Digital ==> Discrete

Primarily (but NOT exclusively) binary at the hardware level

Use only two voltages--high and low.

Since this is not an engineering course, we will ignore these issues and assume square waves.

In English digital implies 10 (based on digit, i.e. finger), but not in computers.

Bit = Binary digIT

Instead of saying high voltage and low voltage, we say true and false or 1 and 0 or asserted and deasserted.

0 and 1 are called complements of each other.

A logic block can be thought of as a black box that takes signals in and produces signals out. There are two kinds of blocks

We are doing combinational blocks now. Will do sequential blocks later (in a few lectures).

TRUTH TABLES

Since combinatorial logic has no memory, it is simply a function from its inputs to its outputs. A Truth Table has as columns all inputs and all outputs. It has one row for each possible set of input values and the output columns have the output for that input. Let's start with a really simple case a logic block with one input and one output.

There are two columns (1 + 1) and two rows (2**1).

In  Out
0   ?
1   ?

How many are there?

How many different truth tables are there for a ``one in one out'' logic block?

Just 4: The constant functions 1 and 0, the identity, and an inverter (pictures in a few minutes). There were two `?'s in the above table each can be a 0 or 1 so 2**2 possibilities.

OK. Now how about two inputs and 1 output.

Three columns (2+1) and 4 rows (2**2).

In1 In2  Out
0   0    ?
0   1    ?
1   0    ?
1   1    ?

How many are there? It is just the number ways can you fill in the output entries, i.e. the question marks. There are 4 output entries so answer is 2**4=16.

How about 2 in and 8 out?

3 in and 8 out?

n in and k out?

Gets big fast!

Boolean algebra

Certain logic functions (i.e. truth tables) are quite common and familiar.

We use a notation that looks like algebra to express logic functions and expressions involving them.

The notation is called Boolean algebra in honor of George Boole.

A Boolean value is a 1 or a 0.
A Boolean variable takes on Boolean values.
A Boolean function takes in boolean variables and produces boolean values.

  1. The (inclusive) OR Boolean function of two variables. Draw its truth table. This is written + (e.g. X+Y where X and Y are Boolean variables) and often called the logical sum. (Three out of four output values in the truth table look right!)

  2. AND. Draw TT. Called logical product and written as a centered dot (like product in regular algebra). All four values look right.

  3. NOT. Draw TT. This is a unary operator (One argument, not two as above; functions with two inputs are called binary). Written A with a bar over it. I will use ' instead of a bar as it is easier for me to type in html.

  4. Exclusive OR (XOR). Written as + with a circle around it. True if exactly one input is true (i.e., true XOR true = false). Draw TT.

Homework: Consider the Boolean function of 3 boolean variables that is true if and only if exactly 1 of the three variables is true. Draw the TT.

Some manipulation laws. Remember this is Boolean ALGEBRA.

Identity:

Inverse:

Both + and . are commutative so my identity and inverse examples contained redundancy.

The name inverse law is somewhat funny since you Add the inverse and get the identity for Product or Multiply by the inverse and get the identity for Sum.

Associative:

Due to the associative law we can write A.B.C since either order of evaluation gives the same answer. Similarly we can write A+B+C.

We often elide the . so the product associative law is A(BC)=(AB)C. So we better not have three variables A, B, and AB. In fact, we normally use one letter variables.

Distributive:

How does one prove these laws??

Homework: Prove the second distributive law.

Let's do (on the board) the examples on pages B-5 and B-6. Consider a logic function with three inputs A, B, and C; and three outputs D, E, and F defined as follows: D is true if at least one input is true, E if exactly two are true, and F if all three are true. (Note that by ``if'' we mean ``if and only if''.

Draw the truth table.

Show the logic equations.

The first way we solved part E shows that any logic function can be written using just AND, OR, and NOT. Indeed, it is in a nice form. Called two levels of logic, i.e. it is a sum of products of just inputs and their compliments.

DeMorgan's laws:

You prove DM laws with TTs. Indeed that is ...

Homework: B.6 on page B-45.

Do beginning of HW on the board.


======== START LECTURE #2 ========

With DM (DeMorgan's Laws) we can do quite a bit without resorting to TTs. For example one can show that the two expressions for E in the example above (page B-6) are equal. Indeed that is

Homework: B.7 on page B-45

Do beginning of HW on board.

GATES

Gates implement basic logic functions: AND OR NOT XOR Equivalence

Often omit the inverters and draw the little circles at the input or output of the other gates (AND OR). These little circles are sometimes called bubbles.

This explains why the inverter is drawn as a buffer with a bubble.

Show why the picture for equivalence is the negation of XOR, i.e (A XOR B)' is AB + A'B'

(A XOR B)' =
(A'B+AB')' = 
(A'B)' (AB')' = 
(A''+B') (A'+B'') = 
(A + B') (A' + B) = 
AA' + AB + B'A' + B'B = 
0   + AB + B'A' + 0 = 
AB + A'B'

Homework: B.2 on page B-45 (I previously did the first part of this homework).

Homework: Consider the Boolean function of 3 boolean vars (i.e. a three input function) that is true if and only if exactly 1 of the three variables is true. Draw the TT. Draw the logic diagram with AND OR NOT. Draw the logic diagram with AND OR and bubbles.

A set of gates is called universal if these gates are sufficient to generate all logic functions.

NOR (NOT OR) is true when OR is false. Do TT.

NAND (NOT AND) is true when AND is false. Do TT.

Draw two logic diagrams for each, one from the definition and an equivalent one with bubbles.

Theorem A 2-input NOR is universal and a 2-input NAND is universal.

Proof

We must show that you can get A', A+B, and AB using just a two input NOR.

Homework: Show that a 2-input NAND is universal.

Can draw NAND and NOR each two ways (because (AB)' = A' + B')

We have seen how to get a logic function from a TT. Indeed we can get one that is just two levels of logic. But it might not be the simplest possible. That is, we may have more gates than are necessary.

Trying to minimize the number of gates is NOT trivial. Mano covers the topic of gate minimization in detail. We will not cover it in this course. It is not in H&P. I actually like it but must admit that it takes a few lectures to cover well and it not used much in practice since it is algorithmic and is done automatically by CAD tools.

Minimization is not unique, i.e. there can be two or more minimal forms.

Given A'BC + ABC + ABC'
Combine first two to get BC + ABC'
Combine last two to get A'BC + AB

Sometimes when building a circuit, you don't care what the output is for certain input values. For example, that input combination might be known not to occur. Another example occurs when, for some combination of input values, a later part of the circuit will ignore the output of this part. These are called don't care outputs situations. Making use of don't cares can reduce the number of gates needed.

Can also have don't care inputs when, for certain values of a subset of the inputs, the output is already determined and you don't have to look at the remaining inputs. We will see a case of this in the very next topic, multiplexors.

An aside on theory

Putting a circuit in disjunctive normal form (i.e. two levels of logic) means that every path from the input to the output goes through very few gates. In fact only two, an OR and an AND. Maybe we should say three since the AND can have a NOT (bubble). Theorticians call this number (2 or 3 in our case) the depth of the circuit. Se we see that every logic function can be implemented with small depth. But what about the width, i.e., the number of gates.

The news is bad. The parity function takes n inputs and gives TRUE if and only if the number of TRUE inputs is odd. If the depth is fixed (say limited to 3), the number of gates needed for parity is exponential in n.

B.3 COMBINATIONAL LOGIC

Homework: Read B.3.

Generic Homework: Read sections in book corresponding to the lectures.

Multiplexor

Often called a mux or a selector

Show equiv circuit with AND OR

Hardware if-then-else

    if S=0
        M=A
    else
        M=B
    endif

Can have 4 way mux (2 selector lines)

This is an if-then-elif-elif-else

   if S1=0 and S2=0
        M=A
    elif S1=0 and S2=1
        M=B
    elif S1=1 and S2=0
        M=C
    else -- S1=1 and S2=1
        M=D
    endif

Do a TT for 2 way mux. Redo it with don't care values.
Do a TT for 4 way mux with don't care values.

Homework: B.12.
B.5 (Assume you have constant signals 1 and 0 as well.)


======== START LECTURE #3 ========

Decoder

Encoder

Sneaky way to see that NAND is universal.

Half Adder

Homework: Draw logic diagram

Full Adder

Homework:

How about 4 bit adder ?

How about an n-bit adder ?

PLAs--Programmable Logic Arrays

Idea is to make use of the algorithmic way you can look at a TT and produce a circuit diagram in the sums of product form.

Consider the following TT from the book (page B-13)

     A | B | C || D | E | F
     --+---+---++---+---+--
     O | 0 | 0 || 0 | 0 | 0
     0 | 0 | 1 || 1 | 0 | 0
     0 | 1 | 0 || 1 | 0 | 0
     0 | 1 | 1 || 1 | 1 | 0
     1 | 0 | 0 || 1 | 0 | 0
     1 | 0 | 1 || 1 | 1 | 0
     1 | 1 | 0 || 1 | 1 | 0
     1 | 1 | 1 || 1 | 0 | 1

Here is the circuit diagram for this truth table.

Here it is redrawn in a more schmatic style.

Finally, it can be redrawn in a more abstract form.

Before a PLA is manufactured all the connections are specified. That is, a PLA is specific for a given circuit. It is somewhat of a misnomer since it is notprogrammable by the user

Homework: B.10 and B.11

Can also have a PAL or Programmable array logic in which the final dots are specified by the user. The manufacturer produces a ``sea of gates''; the user programs it to the desired logic function by adding the dots.


======== START LECTURE #4 ========

ROMs

One way to implement a mathematical (or C) function (without side effects) is to perform a table lookup.

A ROM (Read Only Memory) is the analogous way to implement a logic function.

Important: A ROM does not have state. It is another combinational circuit. That is, it does not represent ``memory''. The reason is that once a ROM is manufactured, the output depends only on the input.

A PROM is a programmable ROM. That is you buy the ROM with ``nothing'' in its memory and then before it is placed in the circuit you load the memory, and never change it. This is like a CD-R.

An EPROM is an erasable PROM. It costs more but if you decide to change its memory this is possible (but is slow). This is like a CD-RW.

``Normal'' EPROMs are erased by some ultraviolet light process. But EEPROMs (electrically erasable PROMS) are faster and are done electronically.

All these EPROMS are erasable not writable, i.e. you can't just change one bit.

A ROM is similar to PLA

Don't Cares

Example (from the book):

Full truth table

     A   B   C || D   E   F
     ----------++----------
     0   0   0 || 0   0   0
     0   0   1 || 1   0   1
     0   1   0 || 0   1   1
     0   1   1 || 1   1   0
     1   0   0 || 1   1   1
     1   0   1 || 1   1   0
     1   1   0 || 1   1   0
     1   1   1 || 1   1   1

This has 7 minterms.

Put in the output don't cares

     A   B   C || D   E   F
     ----------++----------
     0   0   0 || 0   0   0
     0   0   1 || 1   0   1
     0   1   0 || 0   1   1
     0   1   1 || 1   1   X
     1   0   0 || 1   1   X
     1   0   1 || 1   1   X
     1   1   0 || 1   1   X
     1   1   1 || 1   1   X

Now do the input don't cares

     A   B   C || D   E   F
     ----------++----------
     0   0   0 || 0   0   0
     0   0   1 || 1   0   1
     0   1   0 || 0   1   1
     X   1   1 || 1   1   X
     1   X   X || 1   1   X

These don't cares are important for logic minimization. Compare the number of gates needed for the full TT and the reduced TT. There are techniques for minimizing logic, but we will not cover them.

Arrays of Logic Elements




*** Big Change Coming ***

Sequential Circuits, Memory, and State

Why do we want to have state?

Assume you have a real OR gate. Assume the two inputs are both zero for an hour. At time t one input becomes 1. The output will OSCILLATE for a while before settling on exactly 1. We want to be sure we don't look at the answer before its ready.

B.4: Clocks

Frequency and period

Edges

Synchronous system

Now we are going to add state elements to the combinational circuits we have been using previously.

Remember that a combinational/combinatorial circuits has its outpus determined by its input, i.e. combinatorial circuits do not contain state.

State elements include state (naturally).


B.5: Memory Elements

We want edge-triggered clocked memory and will only use edge-triggered clocked memory in our designs. However we get there by stages. We first show how to build unclocked memory; then using unclocked memory we build level-sensitive clocked memory; finally from level-sensitive clocked memory we build edge-triggered clocked memory.

Unclocked Memory

S-R latch (set-reset)

Clocked Memory: Flip-flops and latches

The S-R latch defined above is not clocked memory. Unfortunately the terminology is not perfect.

For both flip-flops and latches the output equals the value stored in the structure. Both have an input and an output (and the complemented output) and a clock input as well. The clock determines when the internal value is set to the current input. For a latch, the change occurs whenever the clock is asserted (level sensitive). For a flip-flop, the change occurs at the active edge.

D latch

The D is for data

In the traces below notice how the output follows the input when the clock is high and remains constant when the clock is low. We assume the stored value is initially low.

D or Master-Slave Flip-flop

This was our goal. We now have an edge-triggered, clocked memory.

Note how much less wiggly the output is with the master-slave flop than before with the transparent latch. As before we are assuming the output is initially low.

Homework: Try moving the inverter to the other latch What has changed?


======== START LECTURE #5 ========

Homework: B.18

Registers

Register File

Set of registers each numbered

To read just need mux from register file to select correct register.

For writes use a decoder on register number to determine which register to write. Note that 3 errors in the book's figure were fixed

The idea is to gate the write line with the output of the decoder. In particular, we should perform a write to register r this cycle providing

Homework: 20


======== START LECTURE #6 ========

SRAMS and DRAMS


Note: There are other kinds of flip-flops T, J-K. Also one could learn about excitation tables for each. We will not cover this material (H&P doesn't either). If interested, see Mano

B.6: Finite State Machines

I do a different example from the book (counters instead of traffic lights). The ideas are the same and the two generic pictures (below) apply to both examples.

Counters

A counter counts (naturally).








The state transition diagram



The circuit diagram.



How do we determine the combinatorial circuit?

Current      || Next A
   A    I R  || DA <-- i.e. to what must I set DA
-------------++--      in order to get the desired
   0    0 0  || 0      Next A for the next cycle.
   1    0 0  || 1      
   0    1 0  || 1
   1    1 0  || 0
   x    x 1  || 0

But this table is simply the truth table for the combinatorial circuit.

A I R  || DA
-------++--
0 0 0  || 0
1 0 0  || 1
0 1 0  || 1
1 1 0  || 0
x x 1  || 0

DA = R' (A XOR I)

How about a two bit counter.

To determine the combinatorial circuit we could precede as before

Current      ||
  A B   I R  || DA DB
-------------++------

This would work but we can instead think about how a counter works and see that.

DA = R'(A XOR I)
DB = R'(B XOR AI)

Homework: B.23

B.7 Timing Methodologies

Skipped


======== START LECTURE #7 ========

Simulating Combinatorial Circuits at the Gate Level

The idea is, given a circuit diagram, write a program that behaves the way the circuit does. This means more than getting the same answer. The program is to work the way the circuit does.

For each logic box, you write a procedure with the following properties.

Simulating a Full Adder

Remember that a full adder has three inputs and two outputs. Hand out hard copies of FullAdder.c.

Simulating a 4-bit Adder

This implementation uses the full adder code above. Hand out hard copies of FourBitAdder.c.

Lab 1: Simulating A 1-bit ALU

Hand out Lab 1, which is available in text (without the diagram), pdf, and postscript.

Chapter 1: Computer Abstractions and Technologies

Homework: READ chapter 1. Do 1.1 -- 1.26 (really one matching question)
Do 1.27 to 1.44 (another matching question),
1.45 (and do 10,000 RPM),
1.46, 1.50

Chapter 3: Instructions: Language of the Machine

Homework: Read sections 3.1 3.2 3.3

3.4 Representing instructions in the Computer (MIPS)

Register file

Homework: 3.2.

The fields of a MIPS instruction are quite consistent

    op    rs    rt    rd    shamt  funct   <-- name of field
    6     5     5     5      5      6      <-- number of bits

R-type instruction (R for register)

Example: add $1,$2,$3

I-type (why I?)

    op    rs    rt   address
    6     5     5     16

Examples: lw/sw $1,1000($2)

RISC-like properties of the MIPS architecture.

Branching instruction

slt (set less-then)

Example: slt $3,$8,$2

beq and bne (branch (not) equal)

Examples: beq/bne $1,$2,123


======== START LECTURE #8 ========

blt (branch if less than)

Examples: blt $5,$8,123

ble (branch if less than or equal)

bgt (branch if greater than>

bge (branch if greater than or equal>

Note: Please do not make the mistake of thinking that

    stl $1,$5,$8
    beq $1,$0,L
is the same as
    stl $1,$8,$5
    bne $1,$0,L

The negation of X < Y is not Y < X

End of Note

Homework: 3.12

J-type instructions (J for jump)

        op   address
        6     26

j (jump)

Example: j 10000

jr (jump register)

Example: jr $10

jal (jump and link)

Example: jal 10000

I type instructions (revisited)

addi (add immediate)

Example: addi $1,$2,100

slti (set less-than immediate)

Example slti $1,$2,50

lui (load upper immediate)

Example: lui $4,123

Homework: 3.1, 3.3, 3.4, and 3.5.

Chapter 4

Homework: Read 4.1-4.4

4.2: Signed and Unsigned Numbers

MIPS uses 2s complement (just like 8086)

To form the 2s complement (of 0000 1111 0000 1010 0000 0000 1111 1100)

Need comparisons for signed and unsigned.

sltu and sltiu

Just like slt and slti but the comparison is unsigned.

Homework: 4.1-4.9

4.3: Addition and subtraction

To add two (signed) numbers just add them. That is don't treat the sign bit special.

To subtract A-B, just take the 2s complement of B and add.

Overflows

An overflow occurs when the result of an operatoin cannot be represented with the available hardware. For MIPS this means when the result does not fit in a 32-bit word.

Homework: Prove this last statement (4.29) (for fun only, do not hand in).

addu, subu, addiu

These add and subtract the same as add and sub, but do not signal overflow

4.4: Logical Operations

Shifts: sll, srl

Bitwise AND and OR: and, or, andi, ori

No surprises.

4.5: Constructing an ALU--the fun begins




First goal is 32-bit AND, OR, and addition

Recall we know how to build a full adder. We will draw it as shown on the right.









With this adder, the ALU is easy.



With this 1-bit ALU, constructing a 32-bit version is simple.
  1. Use an array of logic elements for the logic. The logic element is the 1-bit ALU
  2. Use buses for A, B, and Result.
  3. ``Broadcast'' Opcode to all of the internal 1-bit ALUs. This means wire the external Opcode to the Opcode input of each of the internal 1-bit ALUs

First goal accomplished.


======== START LECTURE #9 ========

Now we augment the ALU so that we can perform subtraction (as well as addition, AND, and OR).

1-bit ALU with ADD, SUB, AND, OR is

Implementing addition and subtraction

  1. To implement addition we use opcode 10 as before and de-assert both b-invert and Cin.
  2. To implement subtraction we still use opcode 10 but we assert both b-invert and Cin.










32-bit version is simply a bunch of these.



(More or less) all ALUs do AND, OR, ADD, SUB. Now we want to customize our ALU for the MIPS architecture. Extra requirements for MIPS ALU:

  1. slt set-less-than








    Homework: figure out correct rule, i.e. prob 4.23. Hint: when an overflow occurs the sign bit is definitely wrong (so the complement of the sign bit is right).
    1. Overflows
      • The HOB ALU is already unique (outputs SET).
      • Need to enhance it some more to produce the overflow output.
      • Recall that we gave the rule for overflow. You need to examine:
        • Whether the operation is add or sub (binvert).
        • The sign of A.
        • The sign of B.
        • The sign of the result.
        • Since this is the HOB we have all the sign bits.
        • The book also uses Cout, but this appears to be an error.





    1. Zero Detect
      • To see if all bits are zero just need NOR of all the bits
      • Conceptually trivially but does require some wiring

    2. Observation: The CarryIn to the LOB and Binvert to all the 1-bit ALUs are always the same. So the 32-bit ALU has just one input called Bnegate, which is sent to the appropriate inputs in the 1-bit ALUs.

    The Final Result is

    The symbol used for an ALU is on the right

    What are the control lines?

    What functions can we perform?

    What (3-bit) values for the control lines do we need for each function? The control lines are Bnegate (1-bit) and Operation (2-bits)
    and 0 00
    or 0 01
    add 0 10
    sub 1 10
    slt 1 11


    ======== START LECTURE #10 ========

    Fast Adders

    1. We have done what is called a ripple carry adder.
      • The carry ``ripples'' from one bit to the next (LOB to HOB).
      • So the time required is proportional to the wordlength
      • Each carry can be computed with two levels of logic (any function can be so computed) hence the number of gate delays for an n bit adder is 2n.
        • For a 4-bit adder 8 gate delays are required.
        • For an 16-bit adder 32 gate delays are required.
        • For an 32-bit adder 64 gate delays are required.
        • For an 64-bit adder 128 gate delays are required.
    2. What about doing the entire 32 (or 64) bit adder with 2 levels of logic?
      • Such a circuit clearly exists. Why?
        Ans: A two levels of logic circuit exists for any function.
      • But it would be very expensive: many gates and wires.
      • The big problem: When expressed with two levels of login, the AND and OR gates have high fan-in, i.e., they have a large number of inputs. It is not true that a 64-input AND takes the same time as a 2-input AND.
      • Unless you are doing full custom VLSI, you get a toolbox of primative functions (say 4 input NAND) and must build from that
    3. There are faster adders, e.g. carry lookahead and carry save. We will study carry lookahead adders.

    Carry Lookahead Adder (CLA)

    This adder is much faster than the ripple adder we did before, especially for wide (i.e., many bit) addition.

    To summarize, using a subscript i to represent the bit number,

        to generate  a carry:   gi = ai bi
        to propagate a carry:   pi = ai+bi
    

    H&P give a plumbing analogue for generate and propagate.

    Given the generates and propagates, we can calculate all the carries for a 4-bit addition (recall that c0=Cin is an input) as follows (this is the formula version of the plumbing):

    c1 = g0 + p0 c0
    
    c2 = g1 + p1 c1 = g1 + p1 g0 + p1 p0 c0
    
    c3 = g2 + p2 c2 = g2 + p2 g1 + p2 p1 g0 + p2 p1 p0 c0
    
    c4 = g3 + p3 c3 = g3 + p3 g2 + p3 p2 g1 + p3 p2 p1 g0 + p3 p2 p1 p0 c0
    

    Thus we can calculate c1 ... c4 in just two additional gate delays (where we assume one gate can accept upto 5 inputs). Since we get gi and pi after one gate delay, the total delay for calculating all the carries is 3 (this includes c4=Carry-Out)

    Each bit of the sum si can be calculated in 2 gate delays given ai, bi, and ci. Thus, for 4-bit addition, 5 gate delays after we are given a, b and Carry-In, we have calculated s and Carry-Out.

    So, for 4-bit addition, the faster adder takes time 5 and the slower adder time 8.

    Now we want to put four of these together to get a fast 16-bit adder.

    As black boxes, both ripple-carry adders and carry-lookahead adders (CLAs) look the same.

    We could simply put four CLAs together and let the Carry-Out from one be the Carry-In of the next. That is, we could put these CLAs together in a ripple-carry manner to get a hybrid 16-bit adder.


    We want to do better so we will put the 4-bit carry-lookahead adders together in a carry-lookahead manner. Thus the diagram above is not what we are going to do.

    We start by determining ``super generate'' and ``super propagate'' bits.

    P0 = p3 p2 p1 p0          Does the low order 4-bit adder
                              propagate a carry?
    P1 = p7 p6 p5 p4
    
    P2 = p11 p10 p9 p8
    
    P3 = p15 p14 p13 p12      Does the high order 4-bit adder
                              propagate a carry?
    
    
    G0 = g3 + p3 g2 + p3 p2 g1 + p3 p2 p1 g0        Does low order 4-bit
                                                    adder generate a carry
    G1 = g7 + p7 g6 + p7 p6 g5 + p7 p6 p5 g4
    
    G2 = g11 + p11 g10 + p11 p10 g9 + p11 p10 p9 g8
    
    G3 = g15 + p15 g14 + p15 p14 g13 + p15 p14 p13 g12
    
    

    From these super generates and super propagates, we can calculate the super carries, i.e. the carries for the four 4-bit adders.

    C1 = G0 + P0 c0
    
    C2 = G1 + P1 C1 = G1 + P1 G0 + P1 P0 c0
    
    C3 = G2 + P2 C2 = G2 + P2 G1 + P2 P1 G0 + P2 P1 P0 c0
    
    C4 = G3 + P3 C3 = G3 + P3 G2 + P3 P2 G1 + P3 P2 P1 G0 + P3 P2 P1 P0 c0
    

    Now these C's (together with the original inputs a and b) are just what the 4-bit CLAs need.

    How long does this take, again assuming 5 input gates?

    1. We calculate the p's and g's (lower case) in 1 gate delay (as with the 4-bit CLA).
    2. We calculate the P's one gate delay after we have the p's or 2 gate delays after we start.
    3. The G's are determined 2 gate delays after we have the g's and p's. So the G's are done 3 gate delays after we start.
    4. The C's are determined 2 gate delays after the P's and G's. So the C's are done 5 gate delays after we start.
    5. Now the C's are sent back to the 4-bit CLAs, which have already calculated the p's and g's. The C's are calculated in 2 more gate delays (7 total) and the s's 2 more after that (9 total).

    In summary, a 16-bit CLA takes 9 cycles instead of 32 for a ripple carry adder and 14 for the mixed adder.





    Some pictures follow.


    Take our original picture of the 4-bit CLA and collapse the details so it looks like.







    Next include the logic to calculate P and G.

    Now put four of these with a CLA block (to calculate C's from P's, G's and Cin) and we get a 16-bit CLA. Note that we do not use the Cout from the 4-bit CLAs.

    Note that the tall skinny box is general. It takes 4 Ps 4Gs and Cin and calculates 4Cs. The Ps can be propagates, superpropagates, superduperpropagates, etc. That is, you take 4 of these 16-bit CLAs and the same tall skinny box and you get a 64-bit CLA.

    Homework: 4.44, 4.45

    As noted just above the tall skinny box is useful for all size CLAs. To expand on that point and to review CLAs, let's redo CLAs with the general box.

    Since we are doing 4-bits at a time, the box takes 9=2*4+1 input bits and produces 6=4+2 outputs

    A 4-bit adder is now

    What does the ``?'' box do?

    Now take four of these 4-bit adders and use the identical CLA box to get a 16-bit adder

    Four of these 16-bit adders with the identical CLA box to gives a 64-bit adder.


    ======== START LECTURE #11 ========





    Shifter

    This is a sequential circuit.




    Homework: A 4-bit shift register initially contains 1101. It is shifted six times to the right with the serial input being 101101. What is the contents of the register after each shift.

    Homework: Same register, same initial condition. For the first 6 cycles the opcodes are left, left, right, nop, left, right and the serial input is 101101. The next cycle the register is loaded (in parallel) with 1011. The final 6 cycles are the same as the first 6. What is the contents of the register after each cycle?

    4.6: Multiplication

        product <- 0
        for i = 0 to 31
            if LOB of multiplier = 1
                product = product + multiplicand
            shift multiplicand left 1 bit
            shift multiplier right 1 bit
    

    Do on the board 4-bit multiplication (8-bit registers) 1100 x 1101. Since the result has (up to) 8 bits, this is often called a 4x4->8 multiply.

    The diagrams below are for a 32x32-->64 multiplier.

    What about the control?

    This works!

    But, when compared to the better solutions to come, is wasteful of resourses and hence is

    The product register must be 64 bits since the product can contain 64 bits.

    Why is multiplicand register 64 bits?

    Why is ALU 64-bits?

    POOF!! ... as the smoke clears we see an idea.

    We can solve both problems at once

    This results in the following algorithm

        product <- 0
        for i = 0 to 31
            if LOB of multiplier = 1
                (serial_in, product[32-63]) <- product[32-63] + multiplicand
            shift product right 1 bit
            shift multiplier right 1 bit
    

    What about control

    Redo same example on board

    A final trick (``gate bumming'', like code bumming of 60s).

        product[0-31] <- multiplier
        for i = 0 to 31
            if LOB of product = 1
                (serial_in, product[32-63]) <- product[32-63] + multiplicand
            shift product right 1 bit
    










    Control again boring.



    Redo the same example on the board.

    The above was for unsigned 32-bit multiplication.

    What about signed multiplication.

    There are faster multipliers, but we are not covering them.

    4.7: Division

    We are skiping division.

    4.8: Floating Point

    We are skiping floating point.

    4.9: Real Stuff: Floating Point in the PowerPC and 80x86

    We are skiping floating point.

    Homework: Read 4.10 ``Fallacies and Pitfalls'', 4.11 ``Conclusion'', and 4.12 ``Historical Perspective''.


    ======== START LECTURE #12 ========

    Notes:

    Midterm exam 25 Oct.

    Lab 2. Due 1 November. Extend Modify lab 1 to a 32 bit alu that in addition handles sub, slt, zero detect, and overflow. That is, produce a gate level simulation of Figure 4.19. This figure is also in the class notes; it is the penultimate figure before ``Fast Adders''.

    It is NOW DEFINITE that on monday 23 Oct, my office hours will have to move from 2:30--3:30 to 1:30-2:30 due to a departmental committee meeting.

    Don't forget the mirror site. My main website will be going down for an OS upgrade at some point. Start at http://cs.nyu.edu

    End of Notes:

    Chapter 5: The processor: datapath and control

    Homework: Start Reading Chapter 5.

    5.1: Introduction

    We are going to build the MIPS processor

    Figure 5.1 redrawn below shows the main idea

    Note that the instruction gives the three register numbers as well as an immediate value to be added.

    5.2: Building a datapath

    Let's begin doing the pieces in more detail.





    Instruction fetch

    We are ignoring branches for now.




    R-type instructions

    Homework: What would happen if the RegWrite line had a stuck-at-0 fault (was always deasserted)? What would happen if the RegWrite line had a stuck-at-1 fault (was always asserted)?

    load and store

    lw  $r,disp($s)
    sw  $r,disp($s)
    

    Homework: What would happen if the RegWrite line had a stuck-at-0 fault (was always deasserted)? What would happen if the RegWrite line had a stuck-at-1 fault (was always asserted)? What would happen if the MemWrite line had a stuck-at-0 fault (was always deasserted)? What would happen if the MemWrite line had a stuck-at-1 fault (was always asserted)?

    There is a cheat here.

    Branch on equal (beq)

    Compare two registers and branch if equal.

    Homework: What would happen if the RegWrite line had a stuck-at-0 fault (was always deasserted)? What would happen if the RegWrite line had a stuck-at-1 fault (was always asserted)?

    5.3: A simple implementation scheme

    We will just put the pieces together and then figure out the control lines that are needed and how to set them. We are not now worried about speed.

    We are assuming that the instruction memory and data memory are separate. So we are not permitting self modifying code. We are not showing how either memory is connected to the outside world (i.e. we are ignoring I/O).

    We have to use the same register file with all the pieces since when a load changes a register a subsequent R-type instruction must see the change, when an R-type instruction makes a change the lw/sw must see it (for loading or calculating the effective address, etc.

    We could use separate ALUs for each type but it is easy not to so we will use the same ALU for all. We do have a separate adder for incrementing the PC.

    Combining R-type and lw/sw

    The problem is that some inputs can come from different sources.

    1. For R-type, both ALU operands are registers. For I-type (lw/sw) the second operand is the (sign extended) immediate field.
    2. For R-type, the write data comes from the ALU. For lw it comes from the memory.
    3. For R-type, the write register comes from field rd, which is bits 15-11. For sw, the write register comes from field rt, which is bits 20-16.

    We will deal with the first two now by using a mux for each. We will deal with the third shortly by (surprise) using a mux.

    Combining R-type and lw/sw


    ======== START LECTURE #13 ========

    Including instruction fetch

    This is quite easy

    Finally, beq

    We need to have an ``if stmt'' for PC (i.e., a mux)

    Homework: 5.5 (just the datapath, not the control), 5.8 (just the datapath, not the control), 5.9.

    The control for the datapath

    We start with our last figure, which shows the data path and then add the missing mux and show how the instruction is broken down.

    We need to set the muxes.

    We need to generate the three ALU cntl lines: 1-bit Bnegate and 2-bit OP

        And     0 00
        Or      0 01
        Add     0 10
        Sub     1 10
        Set-LT  1 11
    
    Homework: What happens if we use 1 00 for the three ALU control lines? What if we use 1 01?

    What information can we use to decide on the muxes and alu cntl lines?

    The instruction!

    So no problem, just do a truth table.

    We will let the main control (to be done later) ``summarize'' the opcode for us. It will generate a 2-bit field ALUOp

        ALUOp   Action needed by ALU
    
        00      Addition (for load and store)
        01      Subtraction (for beq)
        10      Determined by funct field (R-type instruction)
        11      Not used
    

    How many entries do we have now in the truth table?

        ALUOp | Funct        ||  Bnegate:OP
        1 0   | 5 4 3 2 1 0  ||  B OP
        ------+--------------++------------
        0 0   | x x x x x x  ||  0 10
        x 1   | x x x x x x  ||  1 10
        1 x   | x x 0 0 0 0  ||  0 10
        1 x   | x x 0 0 1 0  ||  1 10
        1 x   | x x 0 1 0 0  ||  0 00
        1 x   | x x 0 1 0 1  ||  0 01
        1 x   | x x 1 0 1 0  ||  1 11
        
    1. When is Bnegate (called Op2 in book) asserted?
      • Those rows where its bit is 1, rows 2, 4, and 7.
            ALUOp | Funct      
            1 0   | 5 4 3 2 1 0
            ------+------------
            x 1   | x x x x x x
            1 x   | x x 0 0 1 0
            1 x   | x x 1 0 1 0
            
      • Notice that, in the 5 rows with ALUOp=1x, F1=1 is enough to distinugish the two rows where Bnegate is asserted.
      • This gives
            ALUOp | Funct       
            1 0   | 5 4 3 2 1 0 
            ------+-------------
            x 1   | x x x x x x 
            1 x   | x x x x 1 x 
            
      • Hence Bnegate is ALUOp0 + (ALUOp1 F1)

    2. When is OP1 asserted?
      • Again we begin with the rows where its bit is one
            ALUOp | Funct      
            1 0   | 5 4 3 2 1 0
            ------+------------
            0 0   | x x x x x x
            x 1   | x x x x x x
            1 x   | x x 0 0 0 0
            1 x   | x x 0 0 1 0
            1 x   | x x 1 0 1 0
            
      • Again inspection of the 5 rows with ALUOp=1x yields one F bit that distinguishes when OP1 is asserted, namely F2=0
            ALUOp | Funct      
            1 0   | 5 4 3 2 1 0
            ------+------------
            0 0   | x x x x x x
            x 1   | x x x x x x
            1 x   | x x x 0 x x
            
      • Since x 1 in the second row is really 0 1, rows 1 and 2 can be combined to give
            ALUOp | Funct      
            1 0   | 5 4 3 2 1 0
            ------+------------
            0 x   | x x x x x x
            1 x   | x x x 0 x x
            
      • Now we can use the first row to enlarge the scope of the last row
            ALUOp | Funct      
            1 0   | 5 4 3 2 1 0
            ------+------------
            0 x   | x x x x x x
            x x   | x x x 0 x x
            
      • So OP1 = NOT ALUOp1 + NOT F2

    3. When is OP0 asserted?
      • Start with the rows where its bit is set.
            ALUOp | Funct      
            1 0   | 5 4 3 2 1 0
            ------+------------
            1 x   | x x 0 1 0 1
            1 x   | x x 1 0 1 0
            
      • But again looking at all the rows where ALUOp=1x we see that the two rows where OP0 is asserted are characterized by just two Function bits
            ALUOp | Funct      
            1 0   | 5 4 3 2 1 0
            ------+------------
            1 x   | x x x x x 1
            1 x   | x x 1 x x x
            
      • So OP0 is ALUOp1 F0 + ALUOp1 F3

    The circuit is then easy.


    ======== START LECTURE #14 ========

    Now we need the main control.

    So 9 bits.

    The following figure shows where these occur.

    They all are determined by the opcode

    The MIPS instruction set is fairly regular. Most fields we need are always in the same place in the instruction.

    MemRead: Memory delivers the value stored at the specified addr
    MemWrite: Memory stores the specified value at the specified addr
    ALUSrc: Second ALU operand comes from (reg-file / sign-ext-immediate)
    RegDst: Number of reg to write comes from the (rt / rd) field
    RegWrite: Reg-file stores the specified value in the specified register
    PCSrc: New PC is Old PC+4 / Branch target
    MemtoReg: Value written in reg-file comes from (alu / mem)

    We have seen the wiring before (and have a hardcopy to handout).

    We are interested in four opcodes.

    Do a stage play

    The following figures illustrate the play.

    We start with R-type instructions

    Next we show lw

    The following truth table shows the settings for the control lines for each opcode. This is drawn differently since the labels of what should be the columns are long (e.g. RegWrite) and it is easier to have long labels for rows.

    SignalR-typelwswbeq
    Op50110
    Op40000
    Op30010
    Op20001
    Op10110
    Op00110
    RegDst10XX
    ALUSrc0110
    MemtoReg01XX
    RegWrite1100
    MemRead0100
    MemWrite0010
    Branch0001
    ALUOp11000
    ALUOp0001

    Now it is straightforward but tedious to get the logic equations

    When drawn in pla style the circuit is


    ======== START LECTURE #15 ========

    Midterm Exam


    ======== START LECTURE #16 ========

    Notes: I might have said that for simulating NOT in the lab ~ was good and ! is bad. That is wrong. You SHOULD use !x for (NOT x). The problem with ~ is that ~1 isn't zero, i.e. ~ TRUE is still TRUE.

    The class did well on the midterm. Although the exam was not very difficult, I am delighted the class did well. The only bad part is that the few students who did not do well need to study more as the final certainly won't be easier. I will go over the exam in a few minutes when more students have arrived. The median grade was 87 and the breakdown was

    90-100 18
    80-89  14
    70-79   8
    60-69   2
    50-59   1
    

    Lab2 is due on Wednesday

    End of Notes.

    Homework: 5.5 and 5.8 (control, we already did the datapath), 5.1, 5.2, 5.10 (just the single-cycle datapath) 5.11.

    Implementing a J-type instruction, unconditional jump

        opcode  addr
        31-26   25-0
    
    Addr is word address; bottom 2 bits of PC are always 0

    Top 4 bits of PC stay as they were (AFTER incr by 4)

    Easy to add.

    Smells like a good final exam type question.

    What's Wrong

    Some instructions are likely slower than others and we must set the clock cycle time long enough for the slowest. The disparity between the cycle times needed for different instructions is quite significant when one considers implementing more difficult instructions, like divide and floating point ops. Actually, if we considered cache misses, which result in references to external DRAM, the cycle time ratios can approach 100.

    Possible solutions


    ======== START LECTURE #17 ========

    Note: Lab 3 (the final lab) handed out today due 20 November.

    Even Faster (we are not covering this).

    Chapter 2 Performance analysis

    Homework: Read Chapter 2

    2.1: Introductions

    Throughput measures the number of jobs per day that can be accomplished. Response time measures how long an individual job takes.

    We define Performance as 1 / Execution time.

    So machine X is n times faster than Y means that

    2.2: Measuring Performance

    How should we measure execution time?

    We use CPU time, but this does not mean the other metrics are worse.

    Cycle time vs. Clock rate.

    2.3: Relating the metrics

    The execution time for a given job on a given computer is

    (CPU) execution time = (#CPU clock cycles required) * (cycle time)
                         = (#CPU clock cycles required) / (clock rate)
    

    The number of CPU clock cycles required equals the number of instructions executed times the number of cycles in each instruction.

    But real systems are more complicated than that!

    Through a great many measurement, one calculates for a given machine the average CPI (cycles per instruction).

    The number of instructions required for a given program depends on the instruction set. For example, we saw in chapter 3 that 1 Vax instruction is often accomplishes more than 1 MIPS instruction.

    Complicated instructions take longer; either more cycles or longer cycle time.

    Older machines with complicated instructions (e.g. VAX in 80s) had CPI>>1.

    With pipelining can have many cycles for each instruction but still have CPI nearly 1.

    Modern superscalar machines have CPI < 1.

    Putting this together, we see that

       Time (in seconds) =  #Instructions * CPI * Cycle_time (in seconds).
       Time (in ns)      =  #Instructions * CPI * Cycle_time (in ns).
    

    Homework: Carefully go through and understand the example on page 59

    Homework: 2.1-2.5 2.7-2.10

    Homework: Make sure you can easily do all the problems with a rating of [5] and can do all with a rating of [10]


    ======== START LECTURE #18 ========

    What is the MIPS rating for a computer and how useful is it?

    Homework: Carefully go through and understand the example on pages 61-3

    How about MFLOPS (Million of FLoating point OPerations per Second)? For numerical calculations floating point operations are the ones you are interested in; the others are ``overhead'' (a very rough approximation to reality).

    It has similar problems to MIPS.

    Benchmarks are better than MIPS or MFLOPS, but still have difficulties.

    Homework: Carefully go through and understand 2.7 ``fallacies and pitfalls''.

    Chapter 7: Memory

    Homework: Read Chapter 7

    7.2: Introduction

    Ideal memory is

    So we use a memory hierarchy ...

    1. Registers
    2. Cache (really L1, L2, and maybe L3)
    3. Memory
    4. Disk
    5. Archive

    ... and try to catch most references in the small fast memories near the top of the hierarchy.

    There is a capacity/performance/price gap between each pair of adjacent levels. We will study the cache <---> memory gap

    We observe empirically (and teach in 202).

    A cache is a small fast memory between the processor and the main memory. It contains a subset of the contents of the main memory.

    A Cache is organized in units of blocks. Common block sizes are 16, 32, and 64 bytes. This is the smallest unit we can move to/from a cache.

    A hit occurs when a memory reference is found in the upper level of memory hierarchy.

    7.2: The Basics of Caches

    We start with a very simple cache organization. One that was used on the Decstation 3100, a 1980s workstation.

    Example on pp. 547-8.

    Address(10)Address(2)hit/missblock#
    2210110miss110
    2611010miss010
    2210110hit110
    2611010hit010
    1610000mis000
    300011miss011
    1610000hit000
    1810010miss010

    The basic circuitry for this simple cache to determine hit or miss and to return the data is quite easy. We are showing a 1024 word (= 4KB) direct mapped cache with block size = reference size = 1 word.

    Calculate on the board the total number of bits in this cache.

    Homework: 7.1 7.2 7.3

    Processing a read for this simple cache.

    Skip the section ``handling cache misses'' as it discusses the multicycle and pipelined implementations of chapter 6, which we skipped. For our single cycle processor implementation we just need to note a few points.

    Processing a write for our simple cache (direct mapped with block size = reference size = 1 word).

    Improvement: Use a write buffer

    Unified vs split I and D (instruction and data) caches


    ======== START LECTURE #20 ========

    Improvement: Blocksize > Wordsize

    Homework: 7.7 7.8 7.9

    Why not make blocksize enormous? For example, why not have the cache be one huge block.

    Memory support for wider blocks

    Homework: 7.11

    7.3: Measuring and Improving Cache Performance

    Performance example to do on the board (a dandy exam question).

    Homework: 7.15, 7.16


    ======== START LECTURE #21 ========

    A lower base (i.e. miss-free) CPI makes stalls appear more expensive since waiting a fixed amount of time for the memory corresponds to losing more instructions if the CPI is lower.

    A faster CPU (i.e., a faster clock) makes stalls appear more expensive since waiting a fixed amount of time for the memory corresponds to more cycles if the clock is faster (and hence more instructions since the base CPI is the same).

    Another performance example.

    Remark: Larger caches have longer hit times.

    Improvement: Associative Caches

    Consider the following sad story. Jane has a cache that holds 1000 blocks and has a program that only references 4 (memory) blocks, namely 23, 1023, 123023, and 7023. In fact the references occur in order: 23, 1023, 123023, 7023, 23, 1023, 123023, 7023, 23, 1023, 123023, 7023, 23, 1023, 123023, 7023, etc. Referencing only 4 blocks and having room for 1000 in her cache, Jane expected an extremely high hit rate for her program. In fact, the hit rate was zero. She was so sad, she gave up her job as webmistriss, went to medical school, and is now a brain surgeon at the mayo clinic in rochester MN.

    So far We have studied only direct mapped caches, i.e. those for which the location in the cache is determined by the address. Since there is only one possible location in the cache for any block, to check for a hit we compare one tag with the HOBs of the addr.

    The other extreme is fully associative.

    Most common for caches is an intermediate configuration called set associative or n-way associative (e.g., 4-way associative).


    ======== START LECTURE #22 ========

    Tag size and division of the address bits

    We continue to assume a byte addressed machines with all references to a 4-byte word (lw and sw).

    The 2 LOBs are not used (they specify the byte within the word but all our references are for a complete word). We show these two bits in dark blue. We continue to assume 32 bit addresses so there are 2**30 words in the address space.

    Let's review various possible cache organizations and determine for each how large is the tag and how the various address bits are used. We will always use a 16KB cache. That is the size of the data portion of the cache is 16KB = 4 kilowords = 2**12 words.

    1. Direct mapped, blocksize 1 (word).
      • Since the blocksize is one word, there are 2**30 memory blocks and all the address bits (except the 2 LOBs that specify the byte within the word) are used for the memory block number. Specifically 30 bits are so used.
      • The cache has 2**12 words, which is 2**12 blocks.
      • So the low order 12 bits of the memory block number give the index in the cache (the cache block number), shown in cyan.
      • The remaining 18 (30-12) bits are the tag, shown in red.

    2. Direct mapped, blocksize 8
      • Three bits of the address give the word within the 8-word block. These are drawn in magenta.
      • The remaining 27 HOBs of the memory address give the memory block number.
      • The cache has 2**12 words, which is 2**9 blocks.
      • So the low order 9 bits of the memory block number gives the index in the cache.
      • The remaining 18 bits are the tag

    3. 4-way set associative, blocksize 1
      • Blocksize is 1 so there are 2**30 memory blocks and 30 bits are used for the memory block number.
      • The cache has 2**12 blocks, which is 2**10 sets (each set has 4=2**2 blocks).
      • So the low order 10 bits of the memory block number gives the index in the cache.
      • The remaining 20 bits are the tag.
      • As the associativity grows, the tag gets bigger. Why?
        Ans: Growing associativity reduces the number of sets into which a block can be placed. This increases the number of memory blocks eligible tobe placed in a given set. Hence more bits are needed to see if the desired block is there.

    4. 4-way set associative, blocksize 8
      • Three bits of the address give the word within the block.
      • The remaining 27 HOBs of the memory address give the memory block number.
      • The cache has 2**12 words = 2**9 blocks = 2**7 sets.
      • So the low order 7 bits of the memory block number gives the index in the cache.

    Homework: 7.39, 7.40

    Improvement: Multilevel caches

    Modern high end PCs and workstations all have at least two levels of caches: A very fast, and hence not very big, first level (L1) cache together with a larger but slower L2 cache.

    When a miss occurs in L1, L2 is examined, and only if a miss occurs there is main memory referenced.

    So the average miss penalty for an L1 miss is

    (L2 hit rate)*(L2 time) + (L2 miss rate)*(L2 time + memory time)
    
    We are assuming L2 time is the same for an L2 hit or L2 miss. We are also assuming that the access doesn't begin to go to memory until the L2 miss has occurred.

    Do an example

    7.4: Virtual Memory

    I realize this material was covered in operating systems class (V22.0202). I am just reviewing it here. The goal is to show the similarity to caching, which we just studied. Indeed, (the demand part of) demand paging is caching: In demand paging the memory serves as a cache for the disk, just as in caching the cache serves as a cache for the memory.

    The names used are different and there are other differences as well.

    Cache conceptDemand paging analogue
    Memory blockPage
    Cache blockPage Frame (frame)
    BlocksizePagesize
    TagNone (table lookup)
    Word in blockPage offset
    Valid bitValid bit
    MissPage fault
    HitNot a page fault
    Miss ratePage fault rate
    Hit rate1 - Page fault rate

    Cache conceptDemand paging analogue
    Placement questionPlacement question
    Replacement questionReplacement question
    AssociativityNone (fully associative)

    Homework: 7.32

    Write through vs. write back

    Question: On a write hit should we write the new value through to (memory/disk) or just keep it in the (cache/memory) and write it back to (memory/disk) when the (cache-line/page) is replaced?


    ======== START LECTURE #23 ========

    Translation Lookaside Buffer (TLB)

    A TLB is a cache of the page table



    Putting it together: TLB + Cache

    This is the decstation 3100

    Actions taken

    1. The page number is searched in the fully associative TLB
    2. If a TLB hit occurs, the frame number from the TLB together with the page offset gives the physical address. A TLB miss causes an exception to reload the TLB from the page table, which the figure does not show.
    3. The physical address is broken into a cache tag and cache index (plus a two bit byte offset that is not used for word references).
    4. If the reference is a write, just do it without checking for a cache hit (this is possible because the cache is so simple as we discussed previously).
    5. For a read, if the tag located in the cache entry specified by the index matches the tag in the physical address, the referenced word has been found in the cache; i.e., we had a read hit.
    6. For a read miss, the cache entry specified by the index is fetched from memory and the data returned to satisfy the request.

    Hit/Miss possibilities

    TLBPageCacheRemarks
    hithithit Possible, but page table not checked on TLB hit, data from cache
    hithitmiss Possible, but page table not checked, cache entry loaded from memory
    hitmisshit Impossible, TLB references in-memory pages
    hitmissmiss Impossible, TLB references in-memory pages
    misshithit Possible, TLB entry loaded from page table, data from cache
    misshitmiss Possible, TLB entry loaded from page table, cache entry loaded from memory
    missmisshit Impossible, cache is a subset of memory
    missmissmiss Possible, page fault brings in page, TLB entry loaded, cache loaded

    Homework: 7.31, 7.33

    7.5: A Common Framework for Memory Hierarchies

    Question 1: Where can/should the block be placed?

    This question has three parts.

    1. In what slot are we able to place the block.
      • For a direct mapped cache, there is only one choice.
      • For an n-way associative cache, there are n choices.
      • For a fully associative cache, any slot is permitted.
      • The n-way case includes both the direct mapped and fully associative cases.
      • For a TLB any slot is permitted. That is, a TLB is a fully associative cache of the page table.
      • For paging any slot (i.e., frame) is permitted. That is, paging uses a fully associative mapping (via a page table).
      • For segmentation, any large enough slot (i.e., region) can be used.

    2. If several possible slots are available, which one should be used?
      • I call this question the placement question.
      • For caches, TLBs and paging, which use fixed size slots, the question is trivial; any available slot is just fine.
      • For segmentation, the question is interesting and there are several algorithms, e.g., first fit, best fit, buddy, etc.

    3. If no possible slots are available, which victim should be chosen?
      • For direct mapped caches, the question is trivial. Since the block can only go in one slot, if you need to place the block and the only possible slot is not available, it must be the victim.
      • For all the other cases, n-way associative caches (n>1), TLBs paging, and segmentation, the question is interesting and there are several algorithms, e.g., LRU, Random, Belady min, FIFO, etc.
      • See question 3, below.

    Question 2: How is a block found?

    AssociativityLocation methodComparisons Required
    Direct mappedIndex1
    Set AssociativeIndex the set, search among elements Degree of associativity
    FullSearch all cache entries Number of cache blocks
    Separate lookup table0

    Typical sizes and costs

    Feature Typical values
    for caches
    Typical values
    for demand paging
    Typical values
    for TLBs
    Size 8KB-8MB 16MB-2GB 256B-32KB
    Block size 16B-256B 4KB-64KB 4B-32B
    Miss penalty in clocks 10-100 1M-10M 10-100
    Miss rate .1%-10% .000001-.0001% .01%-2%

    The difference in sizes and costs for demand paging vs. caching, leads to different algorithms for finding the block. Demand paging always uses the bottom row with a separate table (page table) but caching never uses such a table.

    Question 3: Which block should be replaced?

    This is called the replacement question and is much studied in demand paging (remember back to 202).


    ======== START LECTURE #24 ========

    Question 4: What happens on a write?

    1. Write-through
      • Data written to both the cache and main memory (in general to both levels of the hierarchy).
      • Sometimes used for caching, never used for demand paging.
      • Advantages
        • Misses are simpler and cheaper (no copy back).
        • Easier to implement, especially for block size 1, which we did in class.
        • For blocksize > 1, a write miss is more complicated since the rest of the block now is invalid. Fetch the rest of the block from memory (or mark those parts invalid by extra valid bits--not covered in this course).

    Homework: 7.41

    1. Write-back
      • Data only written to the cache. The memory has stale data, but becomes up to date when the cache block is subsequently replaced in the cache.
      • Only real choice for demand paging since writing to the lower level of the memory hierarch (in this case disk) is so slow.
      • Advantages
        • Words can be written at cache speed not memory speed
        • When blocksize > 1, writes to multiple words in the cache block are only written once to memory (when the block is replaced).
        • Multiple writes to the same word in a short period are written to memory only once.
        • When blocksize > 1, the replacement can utilize a high bandwidth transfer. That is, writing one 64-byte block is faster than 16 writes of 4-bytes each.

      Write miss policy (advanced)

      • For demand paging, the case is pretty clear. Every implementation I know of allocates frame for the page miss and fetches the page from disk. That is it does both an allocate and a fetch.
      • For caching this is not always the case. Since there are two optional actions there are four possibilities.
        1. Don't allocate and don't fetch: This is sometimes called write around. It is done when the data is not expected to be read before it will be evicted. For example, if you are writing a matrix whose size is much larger than the cache.
        2. Don't allocate but do fetch: Impossible, where would you put the fetched block?
        3. Do allocate, but don't fetch: Sometimes called no-fetch-on-write. Also called SANF (store-allocate-no-fetch). Requires multiple valid bits per block since the just-written word is valid but the others are not (since we updated the tag to correspond to the just-written word).
        4. Do allocate and do fetch: The normal case we have been using.

      Chapter 8: Interfacing Processors and Peripherals.

      With processor speed increasing 50% / year, I/O must improved or essentially all jobs will be I/O bound.

      The diagram on the right is quite oversimplified for modern PCs; a more detailed version is below.

      8.2: I/O Devices

      Devices are quite varied and their data rates vary enormously.

      • Some devices like keyboards and mice have tiny data rates.
      • Printers, etc have moderate data rates.
      • Disks and fast networks have high data rates.
      • A good graphics card and monitor has a huge data rate.

      Show a real disk opened up and illustrate the components

      • Platter
      • Surface
      • Head
      • Track
      • Sector
      • Cylinder
      • Seek time
      • Rotational latency
      • Transfer time

      8.4: Buses

      A bus is a shared communication link, using one set of wires to connect many subsystems.

      • Sounds simple (once you have tri-state drivers) ...

      • ... but it's not.

      • Very serious electrical considerations (e.g. signals reflecting from the end of the bus. We have ignored (and will continue to ignore) all electrical issues.

      • Getting high speed buses is state-of-the-art engineering.

      • Tri-state drivers (advanced):
        • A output device that can either
          1. Drive the line to 1.
          2. Drive the line to 0.
          3. Not drive the line at all (be in a high impedance state).
        • It is possible have many of these devices devices connected to the same wire providing you are careful to be sure that all but one are in the high-impedance mode.
        • This is why a single bus can have many output devices attached (but only one actually performing output at a given time).

      • Buses support bidirectional transfer, sometimes using separate wires for each direction, sometimes not.

      • Normally the memory bus is kept separate from the I/O bus. It is a fast synchronous bus (see next section) and I/O devices can't keep up.

      • Indeed the memory bus is normally custom designed (i.e., companies design their own).

      • The graphics bus is also kept separate in modern designs for bandwidth reasons, but is an industry standard (the so called AGP bus).

      • Many I/O buses are industry standards (ISA, EISA, SCSI, PCI) and support open architectures, where components can be purchased from a variety of vendors.

      • The figure above is similar to H&P's figure 8.9(c), which is shown on the right. The primary difference is that they have the processor directly connected to the memory with a processor memory bus.

      • The processor memory bus has the highest bandwidth, the backplane bus less and the I/O buses the least. Clearly the (sustained) bandwidth of each I/O bus is limited by the backplane bus. Why?
        Because all the data passing on an I/O bus must also pass on the backplane bus. Similarly the backplane bus clearly has at least the bandwidth of an I/O bus.

      • Bus adaptors are used as interfaces between buses. They perform speed matching and may also perform buffering, data width matching, and converting between synchronous and asynchronous buses (see next section).

      • For a realistic example, on the right is a diagram adapted from the 25 October 1999 issue of Microprocessor Reports on a then new Intel chip set, the so called 840.

      • Bus adaptors have a variety of names, e.g. host adapters, hubs, bridges.

      • Bus lines (i.e., wires) include those for data (data lines), function codes, device addresses. Data and address are considered data and the function codes are considered control (remember our datapath for MIPS).

      • Address and data may be multiplexed on the same lines (i.e., first send one then the other) or may be given separate lines. One is cheaper (good) and the other has higher performance (also good). Which is which?
        Ans: the multiplexed version is cheaper.

      Synchronous vs. Asynchronous Buses

      A synchronous bus is clocked.

      • One of the lines in the bus is a clock that serves as the clock for all the devices on the bus.

      • All the bus actions are done on fixed clock cycles. For example, 4 cycles after receiving a request, the memory delivers the first word.

      • This can be handled by a simple finite state machine (FSM). Basically, once the request is seen everything works one clock at a time. There are no decisions like the ones we will see for an asynchronous bus.

      • Because the protocol is so simple, it requires few gates and is very fast. So far so good.

      • Two problems with synchronous buses.
        1. All the devices must run at the same speed.
        2. The bus must be short due to clock skew.

      • Processor to memory buses are now normally synchronous.
        • The number of devices on the bus are small.
        • The bus is short.
        • The devices (i.e. processor and memory) are prepared to run at the same speed.
        • High speed is crucial.

      An asynchronous bus is not clocked.

      • Since the bus is not clocked devices of varying speeds can be on the same bus.

      • There is no problem with clock skew (since there is no clock).

      • But the bus must now contain control lines to coordinate transmission.

      • Common is a handshaking protocol.

      • We now describe a protocol in words and with FSM for a device to obtain data from memory.

      1. The device makes a request (asserts ReadReq and puts the desired address on the data lines).

      2. Memory, which has been waiting, sees ReadReq, records the address and asserts Ack.

      3. The device waits for the Ack; once seen, it drops the data lines and deasserts ReadReq.

      4. The memory waits for the request line to drop. Then it can drop Ack (which it knows the device has now seen). The memory now at its leasure puts the data on the data lines (which it knows the device is not driving) and then asserts DataRdy. (DataRdy has been deasserted until now).

      5. The device has been waiting for DataRdy. It detects DataRdy and records the data. It then asserts Ack indicating that the data has been read.

      6. The memory sees Ack and then deasserts DataRdy and releases the data lines.

      7. The device seeing DataRdy low deasserts Ack ending the show. Note that both sides are prepared for another performance.

      Improving Bus Performance

      These improvements mostly come at the cost of increased expense and/or complexity.

      1. A multiplicity of buses as the diagrams above.

      2. Synchronous instead of asynchronous protocols. >Synchronous is actually simplier, but it essentially implies a multiplicity of buses, since not all devices can operate at the same speed.
        >br>
      3. Wider data path: Use more wires, send more data at one time.

      4. Separate address and data lines: Same as above.

      5. Block transfers: Permit a single transaction to transfer more than one busload of data. Saves the time to release and acquire the bus, but the protocol is more complex.


        ======== START LECTURE #25 ========

        Obtaining bus access

        • The simplest scheme is to permit only one bus master.
          • That is, on each bus only one device is permited to initiate a bus transaction.
          • The other devices are slaves that only respond to requests.
          • With a single master, there is no issue of arbitrating among multiple requests.

        • One can have multiple masters with daisy chaining of the grant line.
          • Any device can assert the request line, indicating that it wishes to use the bus.
            • This is not trivial: uses ``open collector drivers''.
            • If no output drives the line, it will be ``pulled up'' to 5v, i.e., a logical true.
            • If one or more outputs drive the line to 0v it will go to 0v (a logical false).
            • So if a device wishes to make a request it drives the line to 0v; if it does not wish to make a request it does nothing.
            • This is (another example of) active low logic. The request line is asserted by driving it low.
          • When the arbiter sees the request line asserted (and the previous grantee has issued a release), the arbiter raises the grant line.
          • The grant signal is passed from one device to another if the first device is not requesting the bus. Hence devices near the arbiter have priority and can starve the ones further away.
          • The device whose request is granted asserts the release line when done.
          • Simple, but not fair and not of high performance.

        • Centralized parallel arbiter: Separate request lines from each device and separate grant lines. The arbiter decides which device should be granted the bus.

        • Distributed arbitration by self-selection: Requesting processes identify themselves on the bus and decide individually (and consistently) which one gets the grant.

        • Distributed arbitration by collision detection: Each device transmits whenever it wants, but detects collisions and retries. Ethernet uses this scheme (but modern switched ethernets do not).
        OptionHigh performanceLow cost
        bus widthseparate addr and data lines multiplex addr and data lines
        data widthwidenarrow
        transfer sizemultiple bus loadssingle bus loads
        bus mastersmultiplesingle
        clockingsynchronousasynchronous

        Do on the board the example on pages 665-666

        • Memory and bus support two widths of data transfer: 4 words and 16 words
        • 64-bit synchronous bus; 200MHz; 1 clock for addr; 1 for data.
        • Two clocks of ``rest'' between bus accesses
        • Memory access times: 4 words in 200ns; additional 4 word blocks in 20ns per block.
        • Can overlap transferring data with reading next data.
        • Find
          1. Sustained bandwidth and latency for reading 256 words using both size transfers
          2. How many bus transactions per sec for each (addr+data)

        • Four word blocks
          • 1 clock to send addr
          • 40 clocks read mem
          • 2 clocks to send data
          • 2 idle clocks
          • 45 total clocks
          • 256/4=64 transactions needed so latency is 64*45*5ns=14.4us
          • 64 trans per 14.4us = 64/14.4 trans per 1us = 4.44M trans per sec
          • Bandwidth = 1024 bytes per 14.4us = 1024/14.4 B/us = 71.11MB/sec

        • Sixteen word blocks
          • 1 clock for addr
          • 40 clocks for reading first 4 words
          • 2 clocks to send
          • 2 clocks idle
          • 4 clocks to read next 4 words. But this is free! Why?
            Because it is done during the send and idle of previous block.
          • So we only pay for the long initial read
          • Total = 1 + 40 + 4*(2+2) = 57 clocks.
          • 16 transactions need; latency = 57*16*5ns=4.56ms, which is much better than with 4 word blocks.
          • 16 transactions per 4.56us = 3.51M transactions/sec
          • Bandwidth = 1024B per 4.56ms = 224.56MB/sec


        ======== START LECTURE #26 ========

        Notes: I received the official final exam notice from robin.
        V22.0436.001   Gottlieb    Weds. 12/20      WWH 109
                                   2:00-3:50pm
        

        Last year's final exam is on the course home page.

        End of Notes

        8.5: Interfacing I/O Devices

        Giving commands to I/O Devices

        This is really an OS issue. Must write/read to/from device registers, i.e. must communicate commands to the controller. Note that a controller normally contains a microprocessor, but when we say the processor, we mean the central processor not the one on the controller.

        • The controler has a few registers that can be read and/or written by the processor, similar to how the processor reads and writes memory. These registers are also read and written by the controller.
        • Nearly every controler contains
          • A data register, which is readable (by the processor) for an input device (e.g., a simple keyboard), writable for an output device (e.g., a simple printer), and both readable and writable for input/output devices (e.g., disks).
          • A control register for giving commands to the device.
          • A readable status register for reporting errors and announcing when the device is ready for the next action (e.g., for a keyboard telling when the data register is valid, and for a printer telling when the character to be printed has be successfully retrieved from the data register). Remember the communication protocol we studied where ack was used.
        • Many controllers have more registers

        Communicating with the Processor

        Should we check periodically or be told when there is something to do? Better yet can we get someone else to do it since we are not needed for the job?

        • We get mail at home once a day.
        • At some business offices mail arrives a few times per day.
        • No problem checking once an hour for mail.
        • If email wasn't buffered, you would have to check several times per minute (second?, milisecond?).
        • Checking email this often is too much of a burden and most of the time when you check you find there is none so the check was wasted.

        Polling

        Processor continually checks the device status to see if action is required.

        • Like the mail example above.
        • For a general purpose OS, one needs a timer to tell the processor it is time to check (OS issue).
        • For an embedded system (microwave) make the checking part of the main control loop, which is guaranteed to be executed at a minimum frequency (application software issue).
        • For a keyboard or mouse, which have very low data rates, the system can afford to have the main CPU check. We do an example just below.
        • It is a little better for slave-like output devices such as a simple printer. Then the processor only has to poll after a request has been made until the request has been satisfied.

        Do on the board the example on pages 676-677

        • Cost of a poll is 400 clocks.
        • CPU is 500MHz.
        • How much of the CPU is needed to poll
          1. A mouse that requires 30 polls per sec?
          2. A floppy that sends 2 bytes at a time and achieves 50KB/sec?
          3. A hard disk that sends 16 bytes at a time and achieves 4MB/sec?

        • For the mouse, we use 12,000 clock cycles each second sec for polling. The CPU runs at 500*10^6 cycles/sec. So polling the mouse requires 12/500*10^-3 = 2.4*10^-5 of the CPU. A very small penalty.

        • The floppy delivers 25,000 (two byte) data packets per second so we must poll at that rate not to miss one. CPU cycles needed each second is (400)(25,000)=10^7. This represents 10^7 / 500*10^6 = 2% of the CPU

        • To keep up with the disk requires 250K polls/sec or 10^8 clock cycles or 20% of the CPU.

        • The system need not poll the floppy and disk until the CPU had issues a request. But then it must keep polling until the request is satisfied.

        Interrupt driven I/O

        Processor is told by the device when to look. The processor is interrupted by the device.

        • Dedicated lines (i.e. wires) on the bus are assigned for interrupts.

        • When a device wants to send an interrupt it asserts the corresponding line.

        • The processor checks for interrupts after each instruction. This requires ``zero time'' as it is done in parallel with the instruction execution.

        • If an interrupt is pending (i.e., if a line is asserted) the processor (this is mostly an OS issue, covered in 202).
          1. Saves the PC and perhaps some registers.
          2. Switches to kernel (i.e., privileged) mode.
          3. Jumps to a location specified in the hardware (the interrupt handler.
          At this point the OS takes over.

        • What if we have several different devices and want to do different things depending on what caused the interrupt?

        • Use vectored interrupts.
          • Instead of jumping to a single fixed location, the system defines a set of locations.
          • The system might have several interrupt lines. If line 1 is asserted, jump to location 100, if line 2 is aserted jump to location 200, etc.
          • Alternatively, the system could have just one line and have the device send the address to jump to.

        • There are other issues with interrupts that are taught in OS. For example, what happens if an interrupt occurs while an interrupt is being processed. For another example, what if one interrupt is more important than another. These are OS issues and are not covered in this course.

        • The time for processing an interrupt is typically longer than the type for a poll. But interrupts are not generated when the device is idle, a big advantage.

        Do on the board the example on pages 681-682.

        • Same hard disk and processor as above.
        • Cost of servicing an interrrupt is 500 cycles.
        • The disk is active only 5% of the time.
        • What percent of the processor would be used to service the interrupts?

        • Cycles/sec needed for processing interrupts while the disk is active is 125 million.
        • This represents 25% of the processor cycles available.
        • But the true cost is only 1.25%, since the disk is active only 5% of the time.
        • Note that the disk is not active (i.e., actively generating interrupts) right after the request is made. During the seek and rotational latency, interrupts are not generated. Only during the transfer are interrupts generated.


        ======== START LECTURE #27 ========

        Direct Memory Access (DMA)

        The processor initiates the I/O operation then ``something else'' takes care of it and notifies the processor when it is done (or if an error occurs).

        • Have a DMA engine (a small processor) on the controller.
        • The processor initiates the DMA by writing the command into data registers on the controller (e.g., read sector 5, head 4, cylinder 123 into memory location 34500)
        • For commands that are longer than the size of the data register(s), a protocol must be used to transmit the information.
        • (I/O done by the processor as in the previous methods is called programmed I/O, PIO).
        • The controller collects data from the device and then sends it on the bus to the memory without bothering the CPU.
          • So we have a multimaster bus and need some sort of arbitration.
          • Normally the I/O devices are given higher priority than the CPU.
          • Freeing the CPU from this task is good but isn't as wonderful as it seems since the memory is busy (but cache hits can be processed).
          • A big gain is that only one bus transaction is needed per bus load. With PIO, two transactions are needed: controller to processor and then processor to memory.
          • This was for an input operation (the controller writes to memory). A similar situation occurs for output where the controller reads from the memory). Once again one bus transaction per bus load.
        • When the controller detects that the I/O is complete or if an error occurs, it sets the status register accordingly and sends an interrupt to the processor to notify the latter that the I/O is complete.

        More Sophisticated Controllers

        • Sometimes called ``intelligent'' device controlers, but I prefer not to use anthropomorphic terminology.
        • Some devices, for example a modem on a serial line, deliver data without being requested to. So a controller may need to be prepared for unrequested data.
        • Some devices, for example an ethernet, have a complicated protocol so it is desirable for the controller to process some of that protocol. In particular, the collision detection and retry with exponential backoff characteristic of (non-switched) ethernet requires a real program.
        • Hence some controllers have microprocessors on board that handle much more than block transfers.
        • In the old days there were I/O channels, which would execute programs written dynamically by the main processor. For the modern controllers, the programs are fixed and loaded in ROM or PROM.

        Subtlties involving the memory system

        • Having the controller simply write to memory doesn't update the cache. Must at least invalidate the cache line.
        • Having the controller simply read from memory gets old values with a write-back cache. Must force writebacks.
        • The memory area to be read or written is specified by the program using virtual addresses. But the I/O must actually go to physical addresses. Need help from the MMU.

        8.6: Designing an I/O system

        Do on the board the example page 681

        • Assume a system with the following characteristics.
          1. A CPU that executes 300 million instructions/sec.
          2. 50K (OS) instructions required for each I/O.
          3. A Backplane bus (on which all I/O travels) that supports a data rate of 100MB/sec.
          4. Disk controllers supporting a data rate of 20MB/sec and accommodating up to 7 disks.
          5. Disks with bandwidth 5MB/sec and seek plus rotational latency of 10ms.
        • Assume a workload of 64-KB reads and 100K instructions between reads.
        • Find
          1. The maximum I/O rate achievable.
          2. How many controllers are needed for this rate?
          3. How many disks are needed for this rate?

        • One I/O plus the user's code between I/Os takes 150,000 instructions combined.
        • So the CPU limits us to 2000 I/O per sec.
        • The backplane bus limits us to 100 million / 64,000 = 1562 I/Os per sec.
        • Hence the CPU limit is not relevant and the maximum I/O rate is 1562 I/Os per sec.
        • The disk time for each I/O is 10ms + (64KB / (5MB/sec)).
          = 10ms + (12.8*10^-3)sec = 22.8ms.
        • So each disk can achieve 1/(.0228) = 43.9 I/Os per sec.
        • So need ceil (1562/ 43.9) = 36 disks.
        • Each disk uses 64KB/22.8ms = 2.74 MB/sec of bus bandwidth.
        • Since the scsi bus supports 20 MB/sec, we can put 7 disks (the maximum permitted) on it without the bus saturating.
        • So, to support 36 disks we need 6 controllers (not all will have 7 disks).

        Remark: The above analysis was very simplistic. It assumed everything overlapped just right and the I/Os were not bursty and that the I/Os conveniently spread themselves accross the disks.

        Notes:

        We will go over the practice final and review next time.

        Good luck on the (real) final!