Computer Architecture

2001-2002 Fall
Tues Thurs 2:00-3:15
Ciww 109

Chapter 0: Administrivia

Contact Information

Web Pages

There is a web site for the course. You can find it from my home page, which is http://allan.ultra.nyu.edu/~gottlieb

Textbook

Text is Hennessy and Patterson ``Computer Organization and Design The Hardware/Software Interface'', 2nd edition.

Computer Accounts and mailman mailing list

Homeworks and Labs

I make a distinction between homework and labs.

Labs are

Homeworks are

Doing Labs on non-NYU Systems

You may solve lab assignments on any system you wish, but ...

Obtaining Help with the Labs

Good methods for obtaining help include

  1. Asking me during office hours (see web page for my hours).
  2. Asking the mailing list.
  3. Asking another student, but ...
    Your lab must be your own.
    That is, each student must submit a unique lab. Of course, simply changing comments, variable names, etc does not produce a unique lab.

Upper left board for assignments and announcements

I use the upper left board for lab/homework assignments and announcements. I should never erase that board. View as a file it is group readable (the group is those in the room), appendable by just me, and (re-)writable by no one. If you see me start to erase an announcement, let me know.

Appendix B: Logic Design

Homework: Read B1

B.2: Gates, Truth Tables and Logic Equations

Homework: Read B2 Digital ==> Discrete

Primarily (but NOT exclusively) binary at the hardware level

Use only two voltages--high and low.

Since this is not an engineering course, we will ignore these issues and assume square waves.

In English digital implies 10 (based on digit, i.e. finger), but not in computers.

Bit = Binary digIT

Instead of saying high voltage and low voltage, we say true and false or 1 and 0 or asserted and deasserted.

0 and 1 are called complements of each other.

A logic block can be thought of as a black box that takes signals in and produces signals out. There are two kinds of blocks

We are doing combinational blocks now. Will do sequential blocks later (in a few lectures).

TRUTH TABLES

Since combinatorial logic has no memory, it is simply a function from its inputs to its outputs. A Truth Table has as columns all inputs and all outputs. It has one row for each possible set of input values and the output columns have the output for that input. Let's start with a really simple case a logic block with one input and one output.

There are two columns (1 + 1) and two rows (2**1).

In  Out
0   ?
1   ?

How many possible truth tables are there?

How many different truth tables are there for a ``one in one out'' logic block?

Just 4: The constant functions 1 and 0, the identity, and an inverter (pictures in a few minutes). There are two `?'s in the above table; each can be a 0 or 1 so 2**2 possibilities.

OK. Now how about two inputs and 1 output.

Three columns (2+1) and 4 rows (2**2).

In1 In2  Out
0   0    ?
0   1    ?
1   0    ?
1   1    ?

How many are there? It is just the number ways can you fill in the output entries, i.e. the question marks. There are 4 output entries so answer is 2**4=16.

How about 2 in and 8 out?

3 in and 8 out?

n in and k out?

Gets big fast!

Boolean algebra

Certain logic functions (i.e. truth tables) are quite common and familiar.

We use a notation that looks like algebra to express logic functions and expressions involving them.

The notation is called Boolean algebra in honor of George Boole.

A Boolean value is a 1 or a 0.
A Boolean variable takes on Boolean values.
A Boolean function takes in boolean variables and produces boolean values.

  1. The (inclusive) OR Boolean function of two variables. Draw its truth table. This is written + (e.g. X+Y where X and Y are Boolean variables) and often called the logical sum. (Three out of four output values in the truth table look right!)

  2. AND. Draw TT. Called logical product and written as a centered dot (like product in regular algebra). All four values look right.

  3. NOT. Draw TT. This is a unary operator (One argument, not two as above; functions with two inputs are called binary). Written A with a bar over it. I will use ' instead of a bar as it is easier for me to type in html.

  4. Exclusive OR (XOR). Written as + with a circle around it. True if exactly one input is true (i.e., true XOR true = false). Draw TT.

Homework: Consider the Boolean function of 3 boolean variables that is true if and only if exactly 1 of the three variables is true. Draw the TT.

Some manipulation laws. Remember this is Boolean ALGEBRA.

Identity:

Inverse:

Both + and . are commutative so my identity and inverse examples contained redundancy.

The name inverse law is somewhat funny since you Add the inverse and get the identity for Product or Multiply by the inverse and get the identity for Sum.

Associative:

Due to the associative law we can write A.B.C since either order of evaluation gives the same answer. Similarly we can write A+B+C.

We often elide the . so the product associative law is A(BC)=(AB)C. So we better not have three variables A, B, and AB. In fact, we normally use one letter variables.

Distributive:

How does one prove these laws??

Homework: Prove the second distributive law.

Let's do (on the board) the examples on pages B-5 and B-6. Consider a logic function with three inputs A, B, and C; and three outputs D, E, and F defined as follows: D is true if at least one input is true, E if exactly two are true, and F if all three are true. (Note that by ``if'' we mean ``if and only if''.

Draw the truth table.

Show the logic equations.


======== START LECTURE #2 ========

The first way we solved part E shows that any logic function can be written using just AND, OR, and NOT.

DeMorgan's laws:

You prove DM laws with TTs. Indeed that is ...

Homework: B.6 on page B-45.

Do beginning of HW on the board.

With DM (DeMorgan's Laws) we can do quite a bit without resorting to TTs. For example one can show that the two expressions for E in the example above (page B-6) are equal. Indeed that is

Homework: B.7 on page B-45

Do beginning of HW on board.

GATES

Gates implement the basic logic functions: AND OR NOT XOR Equivalence

We often omit the inverters and draw the little circles at the input or output of the other gates (AND OR). These little circles are sometimes called bubbles.

This explains why the inverter is drawn as a buffer with a bubble.

Show why the picture for equivalence is the negation of XOR, i.e (A XOR B)' is AB + A'B'

(A XOR B)' =
(A'B+AB')' =
(A'B)' (AB')' =
(A''+B') (A'+B'') =
(A + B') (A' + B) =
AA' + AB + B'A' + B'B =
0   + AB + B'A' + 0 =
AB + A'B'

Homework: B.2 on page B-45 (I previously did the first part of this homework).

Homework: Consider the Boolean function of 3 boolean vars (i.e. a three input function) that is true if and only if exactly 1 of the three variables is true. Draw the TT. Draw the logic diagram with AND OR NOT. Draw the logic diagram with AND OR and bubbles.

A set of gates is called universal if these gates are sufficient to generate all logic functions.

NOR (NOT OR) is true when OR is false. Do TT.

NAND (NOT AND) is true when AND is false. Do TT.

Draw two logic diagrams for each, one from the definition and an equivalent one with bubbles.

Theorem A 2-input NOR is universal and a 2-input NAND is universal.

Proof

We must show that you can get A', A+B, and AB using just a two input NOR.

Homework: Show that a 2-input NAND is universal.

Can draw NAND and NOR each two ways (because (AB)' = A' + B')

We have seen how to get a logic function from a TT. Indeed we can get one that is just two levels of logic. But it might not be the simplest possible. That is, we may have more gates than are necessary.

Trying to minimize the number of gates is NOT trivial. Mano covers the topic of gate minimization in detail. We will not cover it in this course. It is not in H&P. I actually like it but must admit that it takes a few lectures to cover well and it not used much in practice since it is algorithmic and is done automatically by CAD tools.

Minimization is not unique, i.e. there can be two or more minimal forms.

Given A'BC + ABC + ABC'
Combine first two to get BC + ABC'
Combine last two to get A'BC + AB

Sometimes when building a circuit, you don't care what the output is for certain input values. For example, that input combination might be known not to occur. Another example occurs when, for some combination of input values, a later part of the circuit will ignore the output of this part. These are called don't care outputs. Making use of don't cares can reduce the number of gates needed.

Can also have don't care inputs when, for certain values of a subset of the inputs, the output is already determined and you don't have to look at the remaining inputs. We will see a case of this in the very next topic, multiplexors.

An aside on theory

Putting a circuit in disjunctive normal form (i.e. two levels of logic) means that every path from the input to the output goes through very few gates. In fact only two, an OR and an AND. Maybe we should say three since the AND can have a NOT (bubble). Theorticians call this number (2 or 3 in our case) the depth of the circuit. Se we see that every logic function can be implemented with small depth. But what about the width, i.e., the number of gates.

The news is bad. The parity function takes n inputs and gives TRUE if and only if the number of TRUE inputs is odd. If the depth is fixed (say limited to 3), the number of gates needed for parity is exponential in n.

B.3 COMBINATIONAL LOGIC

Homework: Read B.3.

Generic Homework: Read sections in book corresponding to the lectures.

Multiplexor

Often called a mux or a selector

Show equivalent circuit with AND OR

Hardware if-then-else

    if S=0
        M=A
    else
        M=B
    endif

Can have 4 way mux (2 selector lines)

This is an if-then-elif-elif-else

   if S1=0 and S2=0
        M=A
    elif S1=0 and S2=1
        M=B
    elif S1=1 and S2=0
        M=C
    else -- S1=1 and S2=1
        M=D
    endif

Do a TT for 2 way mux. Redo it with don't care values.
Do a TT for 4 way mux with don't care values.

Homework: B.12.
B.5 (Assume you have constant signals 1 and 0 as well.)

Decoder

Basically, the opposite of a mux.


======== START LECTURE #3 ========

Test of XOR symbol ⊕ end of test

Encoder

Sneaky way to see that NAND is universal.

Half Adder

Homework: Draw logic diagram

Full Adder

Homework:

How about 4 bit adder ?

How about an n-bit adder ?

PLAs--Programmable Logic Arrays

Idea is to make use of the algorithmic way you can look at a TT and produce a circuit diagram in the sums of product form.

Consider the following TT from the book (page B-13)

     A | B | C || D | E | F
     --+---+---++---+---+--
     O | 0 | 0 || 0 | 0 | 0
     0 | 0 | 1 || 1 | 0 | 0
     0 | 1 | 0 || 1 | 0 | 0
     0 | 1 | 1 || 1 | 1 | 0
     1 | 0 | 0 || 1 | 0 | 0
     1 | 0 | 1 || 1 | 1 | 0
     1 | 1 | 0 || 1 | 1 | 0
     1 | 1 | 1 || 1 | 0 | 1

Recall how we construct a circuit from a truth table.


To the right, the above figure is redrawn in a more schematic style.


Finally, it can be redrawn in the more abstract form shown on the right.

Before a PLA is manufactured all the connections are specified. That is, a PLA is specific for a given circuit. It is somewhat of a misnomer since it is not programmable by the user.

Homework: B.10 and B.11

Can also have a PAL or Programmable array logic in which the final dots are made by the user. The manufacturer produces a ``sea of gates''; the user programs it to the desired logic function by adding the dots.

ROMs

One way to implement a mathematical (or java) function (without side effects) is to perform a table lookup.

A ROM (Read Only Memory) is the analogous way to implement a logic function.

Important: A ROM does not have state. It is another combinational circuit. That is, it does not represent ``memory''. The reason is that once a ROM is manufactured, the output depends only on the input. I realize this sounds wrong, but it is right.

A PROM is a programmable ROM. That is you buy the ROM with ``nothing'' in its memory and then before it is placed in the circuit you load the memory, and never change it. This is like a CD-R.

An EPROM is an erasable PROM. It costs more but if you decide to change its memory this is possible (but is slow). This is like a CD-RW.

``Normal'' EPROMs are erased by some ultraviolet light process. But EEPROMs (electrically erasable PROMS) are faster and are done electronically.

All these EPROMS are erasable not writable, i.e. you can't just change one bit.

A ROM is similar to PLA

Don't Cares

Example (from the book). Consider a logic function with three inputs A, B, and C, and three outputs D, E, and F.

Full truth table

     A   B   C || D   E   F
     ----------++----------
     0   0   0 || 0   0   0
     0   0   1 || 1   0   1
     0   1   0 || 0   1   1
     0   1   1 || 1   1   0
     1   0   0 || 1   1   1
     1   0   1 || 1   1   0
     1   1   0 || 1   1   0
     1   1   1 || 1   1   1

This has 7 minterms.

Put in the output don't cares

     A   B   C || D   E   F
     ----------++----------
     0   0   0 || 0   0   0
     0   0   1 || 1   0   1
     0   1   0 || 0   1   1
     0   1   1 || 1   1   X
     1   0   0 || 1   1   X
     1   0   1 || 1   1   X
     1   1   0 || 1   1   X
     1   1   1 || 1   1   X

Now do the input don't cares

     A   B   C || D   E   F
     ----------++----------
     0   0   0 || 0   0   0
     0   0   1 || 1   0   1
     0   1   0 || 0   1   1
     X   1   1 || 1   1   X
     1   X   X || 1   1   X

These don't cares are important for logic minimization. Compare the number of gates needed for the full TT and the reduced TT. There are techniques for minimizing logic, but we will not cover them.


======== START LECTURE #4 ========

Arrays of Logic Elements



*** Big Change Coming ***

Sequential Circuits, Memory, and State

Why do we want to have state?

B.4: Clocks

Assume you have a real OR gate. Assume the two inputs are both zero for an hour. At time t one input becomes 1. The output will OSCILLATE for a while before settling on exactly 1. We want to be sure we don't look at the answer before its ready.

Frequency and period

Edges

Synchronous system

Now we are going to add state elements to the combinational circuits we have been using previously.

Remember that a combinational/combinatorial circuits has its outpus determined solely by its input, i.e. combinatorial circuits do not contain state.

State elements include state (naturally).

Combinatorial circuits can NOT contain loops. For example imagine an inverter with its output connected to its input. So if the input is false, the output becomes true. But this output is wired to the input, which is now true. Thus the output becomes false, which is the new input. So the output becomes true ... .
However sequential circuits CAN and often do contains loops.


B.5: Memory Elements

We will use only edge-triggered clocked memory in our designs as they are the simplest memory to understand. Our current goal is to construct edge-triggered clocked memory. However we get there in three stages.

  1. We first show how to build unclocked memory.
  2. Then, using unclocked memory, we build level-sensitive clocked memory.
  3. Finally from level-sensitive clocked memory we build edge-triggered clocked memory.

Unclocked Memory

S-R latch (set-reset)

Clocked Memory: Flip-flops and latches

The S-R latch defined above is not clocked memory; unfortunately the terminology is not perfect.

For both flip-flops and latches the output equals the value stored in the structure. Both have an input and an output (and the complemented output) and a clock input as well. The clock determines when the internal value is set to the current input. For a latch, the output can change whenever the clock is asserted (level sensitive). For a flip-flop, changes occur only at the active edge.

D latch

The D is for data








In the traces to the right notice how the output follows the input when the clock is high and remains constant when the clock is low. We assume the stored value was initially low.

D or Master-Slave Flip-flop

This was our goal. We now have an edge-triggered, clocked memory.







Note how much less wiggly the output is with the master-slave flop than before with the transparent latch. As before we are assuming the output is initially low.
Homework: Try moving the inverter to the other latch What has changed?


======== START LECTURE #5 ========






This picture shows the setup and hold times discussed above.


Homework: B.18

Registers

A register is basically just an array of D flip-flops. For example a 32-bit register is an array of 32 D flops.


This, however, is not so good! We must have the write line correct quite a while before the active edge. That is you must know whether you are writing quite a while in advance.





An alternative is to use an active low write line, i.e. have a W' input.




To implement a multibit register, just use multiple D flops.

Register File

A register file is just a set of registers, each one numbered.


Reading From a Register File

To support reading a register we just need a (big) mux from the register file to select the correct register.

Writing a Register in a Register File

To support writing a register we use a decoder on the register number to determine which register to write. Note that 3 errors in the book's figure were fixed

Recall that the inputs to a register are W, the write line, D the data to write (if the write line is asserted), and the clock. We should perform a write to register r this cycle if the write line is asserted and the register number specified is r. The idea is to gate the write line with the output of the decoder.

Homework: 20

SRAMS and DRAMS


Note: There are other kinds of flip-flops T, J-K. Also one could learn about excitation tables for each. We will not cover this material (H&P doesn't either). If interested, see Mano


======== START LECTURE #6 ========

B.6: Finite State Machines (FSMs)

I do a different example from the book (counters instead of traffic lights). The ideas are the same and the two generic pictures (below) apply to both examples.

Counters

A counter counts (naturally).








The state transition diagram



The circuit diagram.



How do we determine the combinatorial circuit?

Current      || Next A
   A    I R  || DA <-- i.e. to what must I set DA
-------------++--      in order to get the desired
   0    0 0  || 0      Next A for the next cycle.
   1    0 0  || 1
   0    1 0  || 1
   1    1 0  || 0
   x    x 1  || 0

But this table is simply the truth table for the combinatorial circuit.

A I R  || DA
-------++--
0 0 0  || 0
1 0 0  || 1
0 1 0  || 1
1 1 0  || 0
x x 1  || 0

DA = R' (A XOR I)

How about a two bit counter.

To determine the combinatorial circuit we could precede as before

Current      ||
  A B   I R  || DA DB
-------------++------

This would work but we can instead think about how a counter works and see that.

DA = R'(A XOR I)
DB = R'(B XOR AI)

Homework: B.23

B.7 Timing Methodologies

Skipped

Simulating Combinatorial Circuits at the Gate Level

The idea is, given a circuit diagram, write a program that behaves the way the circuit does. This means more than getting the same answer. The program is to work the way the circuit does.

For each logic box, you write a procedure with the following properties.

Simulating a Full Adder

Remember that a full adder has three inputs and two outputs. Discuss FullAdder.c or perhaps FullAdder.java.

Simulating a 4-bit Adder

This implementation uses the full adder code above. Discuss FourBitAdder.c or perhaps FourBitAdder.java


======== START LECTURE #7 ========

Lab 1: Simulating A 1-bit ALU

Hand out Lab 1, which is available in text (without the diagram), pdf, and postscript.

Chapter 1: Computer Abstractions and Technologies

Homework: READ chapter 1. Do 1.1 -- 1.26 (really one matching question)
Do 1.27 to 1.44 (another matching question),
1.45 (and do 10,000 RPM),
1.46, 1.50

Chapter 3: Instructions: Language of the Machine

Homework: Read sections 3.1 3.2 3.3

3.4 Representing instructions in the Computer (MIPS)

Register file

Homework: 3.2.

The fields of a MIPS instruction are quite consistent

    op    rs    rt    rd    shamt  funct   <-- name of field
    6     5     5     5      5      6      <-- number of bits

R-type instruction (R for register)

Example: add $1,$2,$3

I-type (why I?)

    op    rs    rt   address
    6     5     5     16

Examples: lw/sw $1,1000($2)

RISC-like properties of the MIPS architecture.

Branching instruction

slt (set less-then)

Example: slt $3,$8,$2

beq and bne (branch (not) equal)

Examples: beq/bne $1,$2,123

blt (branch if less than)

Examples: blt $5,$8,123

ble (branch if less than or equal)

bgt (branch if greater than)

bge (branch if greater than or equal)

Note: Please do not make the mistake of thinking that

    stl $1,$5,$8
    beq $1,$0,L
is the same as
    stl $1,$8,$5
    bne $1,$0,L

It is not the case that the negation of X < Y is Y < X.

End of Note

Homework: 3.12

J-type instructions (J for jump)

        op   address
        6     26

j (jump)

Example: j 10000

jr (jump register)

Example: jr $10

jal (jump and link)

Example: jal 10000


======== START LECTURE #8 ========

Notes
  1. Can now get to homework solutions right from the home page. A password is still required.
  2. Syllabus added.
  3. Lectures through #7 now on course pages.
  4. Homework solutions through #6 now on course pages.
End of Notes

I type instructions (revisited)

addi (add immediate)

Example: addi $1,$2,100

slti (set less-than immediate)

Example slti $1,$2,50

lui (load upper immediate)

Example: lui $4,123

Homework: 3.1, 3.3, 3.4, and 3.5.

Chapter 4

Homework: Read 4.1-4.4

4.2: Signed and Unsigned Numbers

MIPS uses 2s complement (just like 8086)

To form the 2s complement (of 0000 1111 0000 1010 0000 0000 1111 1100)

Need comparisons for signed and unsigned.

sltu and sltiu

Like slt and slti but the comparison is unsigned.

Homework: 4.1-4.9

4.3: Addition and subtraction

To add two (signed) numbers just add them. That is don't treat the sign bit special.

To subtract A-B, just take the 2s complement of B and add.

Overflows

An overflow occurs when the result of an operation cannot be represented with the available hardware. For MIPS this means when the result does not fit in a 32-bit word.

Homework: Prove this last statement (4.29) (for fun only, do not hand in).

addu, subu, addiu

These three instructions perform addition and subtraction the same way as do add and sub, but do not signal overflow

4.4: Logical Operations

Shifts: sll, srl

Bitwise AND and OR: and, or, andi, ori

No surprises.

4.5: Constructing an ALU--the fun begins

First goal is 32-bit AND, OR, and addition

Recall we know how to build a full adder. We will draw it as shown on the right.









With this adder, the ALU is easy.












With this 1-bit ALU, constructing a 32-bit version is simple.
  1. Use an array of logic elements for the logic. The logic element is the 1-bit ALU
  2. Use buses for A, B, and Result.
  3. ``Broadcast'' Opcode to all of the internal 1-bit ALUs. This means wire the external Opcode to the Opcode input of each of the internal 1-bit ALUs

First goal accomplished.



======== START LECTURE #9 ========

Implementing Addition and Subtraction

We wish to augment the ALU so that we can perform subtraction (as well as addition, AND, and OR).

A 1-bit ALU with ADD, SUB, AND, OR is shown on the right.

  1. To implement addition we use opcode 10 as before and de-assert both b-invert and Cin.
  2. To implement subtraction we still use opcode 10 but we assert both b-invert and Cin.











32-bit version is simply a bunch of these.

(More or less) all ALUs do AND, OR, ADD, SUB. Now we want to customize our ALU for the MIPS architecture, which has a few extra requirements.

  1. slt set-less-than
  2. Overflows
  3. Zero Detect

Implementing SLT

This is fairly clever as we shall see.









Homework: figure out correct rule, i.e. prob 4.23. Hint: when an overflow occurs the sign bit is definitely wrong (so the complement of the sign bit is right).




Implementing Overflow Detection














Simpler Overflow Detection

Recall the simpler overflow detection: An overflow occurs if and only if the carry in to the HOB differs from the carry out of the HOB.

Implementing Zero Detection


Observation: The CarryIn to the LOB and Binvert to all the 1-bit ALUs are always the same. So the 32-bit ALU has just one input called Bnegate, which is sent to the appropriate inputs in the 1-bit ALUs.

The Final Result is

The symbol used for an ALU is on the right

What are the control lines?

What functions can we perform?

What (3-bit) values for the control lines do we need for each function? The control lines are Bnegate (1-bit) and Operation (2-bits)
and 0 00
or 0 01
add 0 10
sub 1 10
slt 1 11


======== START LECTURE #10 ========

Fast Adders

  1. We have done what is called a ripple carry adder.
  2. What about doing the entire 32 (or 64) bit adder with 2 levels of logic?
  3. There are faster adders, e.g. carry lookahead and carry save. We will study carry lookahead adders.

Carry Lookahead Adder (CLA)

This adder is much faster than the ripple adder we did before, especially for wide (i.e., many bit) addition.

To summarize, using a subscript i to represent the bit number,

    to generate  a carry:   gi = ai bi
    to propagate a carry:   pi = ai+bi

H&P give a plumbing analogue for generate and propagate.

Given the generates and propagates, we can calculate all the carries for a 4-bit addition (recall that c0=Cin is an input) as follows (this is the formula version of the plumbing):

c1 = g0 + p0 c0

c2 = g1 + p1 c1 = g1 + p1 g0 + p1 p0 c0

c3 = g2 + p2 c2 = g2 + p2 g1 + p2 p1 g0 + p2 p1 p0 c0

c4 = g3 + p3 c3 = g3 + p3 g2 + p3 p2 g1 + p3 p2 p1 g0 + p3 p2 p1 p0 c0

Thus we can calculate c1 ... c4 in just two additional gate delays (where we assume one gate can accept upto 5 inputs). Since we get gi and pi after one gate delay, the total delay for calculating all the carries is 3 (this includes c4=Carry-Out)

Each bit of the sum si can be calculated in 2 gate delays given ai, bi, and ci. Thus, for 4-bit addition, 5 gate delays after we are given a, b and Carry-In, we have calculated s and Carry-Out.

So, for 4-bit addition, the faster adder takes time 5 and the slower adder time 8.

Now we want to put four of these together to get a fast 16-bit adder.

As black boxes, both ripple-carry adders and carry-lookahead adders (CLAs) look the same.

We could simply put four CLAs together and let the Carry-Out from one be the Carry-In of the next. That is, we could put these CLAs together in a ripple-carry manner to get a hybrid 16-bit adder.


We want to do better so we will put the 4-bit carry-lookahead adders together in a carry-lookahead manner. Thus the diagram above is not what we are going to do.

We start by determining ``super generate'' and ``super propagate'' bits.

P0 = p3 p2 p1 p0          Does the low order 4-bit adder
                          propagate a carry?
P1 = p7 p6 p5 p4

P2 = p11 p10 p9 p8

P3 = p15 p14 p13 p12      Does the high order 4-bit adder
                          propagate a carry?


G0 = g3 + p3 g2 + p3 p2 g1 + p3 p2 p1 g0        Does low order 4-bit
                                                adder generate a carry
G1 = g7 + p7 g6 + p7 p6 g5 + p7 p6 p5 g4

G2 = g11 + p11 g10 + p11 p10 g9 + p11 p10 p9 g8

G3 = g15 + p15 g14 + p15 p14 g13 + p15 p14 p13 g12

From these super generates and super propagates, we can calculate the super carries, i.e. the carries for the four 4-bit adders.

C1 = G0 + P0 c0

C2 = G1 + P1 C1 = G1 + P1 G0 + P1 P0 c0

C3 = G2 + P2 C2 = G2 + P2 G1 + P2 P1 G0 + P2 P1 P0 c0

C4 = G3 + P3 C3 = G3 + P3 G2 + P3 P2 G1 + P3 P2 P1 G0 + P3 P2 P1 P0 c0

Now these C's (together with the original inputs a and b) are just what the 4-bit CLAs need.

How long does this take, again assuming 5 input gates?

  1. We calculate the p's and g's (lower case) in 1 gate delay (as with the 4-bit CLA).
  2. We calculate the P's one gate delay after we have the p's or 2 gate delays after we start.
  3. The G's are determined 2 gate delays after we have the g's and p's. So the G's are done 3 gate delays after we start.
  4. The C's are determined 2 gate delays after the P's and G's. So the C's are done 5 gate delays after we start.
  5. Now the C's are sent back to the 4-bit CLAs, which have already calculated the p's and g's. The C's are calculated in 2 more gate delays (7 total) and the s's 2 more after that (9 total).

In summary, a 16-bit CLA takes 9 cycles instead of 32 for a ripple carry adder and 14 for the mixed adder.





Some pictures follow.


Take our original picture of the 4-bit CLA and collapse the details so it looks like.







Next include the logic to calculate P and G.



















Now put four of these with a CLA block (to calculate C's from P's, G's and Cin) and we get a 16-bit CLA. Note that we do not use the Cout from the 4-bit CLAs.

Note that the tall skinny box is general. It takes 4 Ps 4Gs and Cin and calculates 4Cs. The Ps can be propagates, superpropagates, superduperpropagates, etc. That is, you take 4 of these 16-bit CLAs and the same tall skinny box and you get a 64-bit CLA.

Homework: 44, 45


======== START LECTURE #11 ========

As noted just above the tall skinny box is useful for all size CLAs. To expand on that point and to review CLAs, let's redo CLAs with the general box.

Since we are doing 4-bits at a time, the box takes 9=2*4+1 input bits and produces 6=4+2 outputs










A 4-bit adder is now the figure on the right

What does the ``?'' box do?

  1. Calculates Gi and Pi based on ai and bi
  2. Calculates si based on ai, bi, and Ci=Cin
  3. Does not bother calculating Cout.







Now take four of these 4-bit adders and use the identical CLA box to get a 16-bit adder.

The picture on the right shows ONE 4-bit adder (the magenta box) used with the CLA box. To get a 16-bit adder you need 3 more magenta box, one above the one shown (to process bits 0-3) and two below (to process bits 8-15).





Four of these 16-bit adders with the identical CLA box gives a 64-bit adder (no picture shown).





Shifter

This is a sequential circuit.




Homework: A 4-bit shift register initially contains 1101. It is shifted six times to the right with the serial input being 101101. What is the contents of the register after each shift.

Homework: Same register, same initial condition. For the first 6 cycles the opcodes are left, left, right, nop, left, right and the serial input is 101101. The next cycle the register is loaded (in parallel) with 1011. The final 6 cycles are the same as the first 6. What is the contents of the register after each cycle?

4.6: Multiplication

    product <- 0
    for i = 0 to 31
        if LOB of multiplier = 1
            product = product + multiplicand
        shift multiplicand left 1 bit
        shift multiplier right 1 bit

Do on the board 4-bit multiplication (8-bit registers) 1100 x 1101. Since the result has (up to) 8 bits, this is often called a 4x4->8 multiply.

The diagrams below are for a 32x32-->64 multiplier.

What about the control?



======== START LECTURE #12 ========

This works!

But, when compared to the better solutions to come, is wasteful of resourses and hence is

The product register must be 64 bits since the product can contain 64 bits.

Why is multiplicand register 64 bits?

Why is ALU 64-bits?

POOF!! ... as the smoke clears we see an idea.

We can solve both problems at once

This results in the following algorithm

    product <- 0
    for i = 0 to 31
        if LOB of multiplier = 1
            (serial_in, product[32-63]) <- product[32-63] + multiplicand
        shift product right 1 bit
        shift multiplier right 1 bit






What about control

Redo same example on board

A final trick (``gate bumming'', like ``code bumming'' of 60s).

    product[0-31] <- multiplier
    for i = 0 to 31
        if LOB of product = 1
            (serial_in, product[32-63]) <- product[32-63] + multiplicand
        shift product right 1 bit










Control again boring.

Redo the same example on the board.

The above was for unsigned 32-bit multiplication.

What about signed multiplication.

There are faster multipliers, but we are not covering them.

4.7: Division

We are skiping division.

4.8: Floating Point

We are skiping floating point.

4.9: Real Stuff: Floating Point in the PowerPC and 80x86

We are skiping floating point.

Homework: Read 4.10 ``Fallacies and Pitfalls'', 4.11 ``Conclusion'', and 4.12 ``Historical Perspective''.


======== START LECTURE #13 ========

MIDTERM EXAM


======== START LECTURE #14 ========

Chapter 5: The processor: datapath and control

Homework: Start Reading Chapter 5.

5.1: Introduction

We are going to build the MIPS processor

Figure 5.1 redrawn below shows the main idea

Note that the instruction gives the three register numbers as well as an immediate value to be added.

5.2: Building a datapath

Let's begin doing the pieces in more detail.





Instruction fetch

We are ignoring branches for now.






R-type instructions

Homework: What would happen if the RegWrite line had a stuck-at-0 fault (was always deasserted)? What would happen if the RegWrite line had a stuck-at-1 fault (was always asserted)?



load and store

lw  $r,disp($s)
sw  $r,disp($s)

lw $r,disp($s):

  1. Computes the effective address formed by adding the 16-bit immediate constant ``disp'' to the contents of register $s.
  2. Fetches the value in data memory at this address.
  3. Inserts this value into register $r.

sw $r,disp($s):

  1. Computes the same effective address as lw $r,disp($s)
  2. Stores the contents of register $r into this address

Homework: What would happen if the RegWrite line had a stuck-at-0 fault (was always deasserted)? What would happen if the RegWrite line had a stuck-at-1 fault (was always asserted)? What would happen if the MemWrite line had a stuck-at-0 fault (was always deasserted)? What would happen if the MemWrite line had a stuck-at-1 fault (was always asserted)?

There is a cheat here.


======== START LECTURE #15 ========

Branch on equal (beq)

Compare two registers and branch if equal.

Homework: What would happen if the RegWrite line had a stuck-at-0 fault (was always deasserted)? What would happen if the RegWrite line had a stuck-at-1 fault (was always asserted)?






5.3: A Simple Implementation Scheme

We will just put the pieces together and then figure out the control lines that are needed and how to set them. We are not now worried about speed.

We are assuming that the instruction memory and data memory are separate. So we are not permitting self modifying code. We are not showing how either memory is connected to the outside world (i.e. we are ignoring I/O).

We have to use the same register file with all the pieces since when a load changes a register a subsequent R-type instruction must see the change, when an R-type instruction makes a change the lw/sw must see it (for loading or calculating the effective address, etc.

We could use separate ALUs for each type but it is easy not to so we will use the same ALU for all. We do have a separate adder for incrementing the PC.

Combining R-type and lw/sw

The problem is that some inputs can come from different sources.

  1. For R-type, both ALU operands are registers. For I-type (lw/sw) the second operand is the (sign extended) immediate field.
  2. For R-type, the write data comes from the ALU. For lw it comes from the memory.
  3. For R-type, the write register comes from field rd, which is bits 15-11. For sw, the write register comes from field rt, which is bits 20-16.

We will deal with the first two now by using a mux for each. We will deal with the third shortly by (surprise) using a mux.

Combining R-type and lw/sw

Including instruction fetch

This is quite easy

Finally, beq

We need to have an ``if stmt'' for PC (i.e., a mux)

Homework: 5.5 (just the datapath, not the control), 5.8 (just the datapath, not the control), 5.9.


======== START LECTURE #16 ========

The control for the datapath

We start with our last figure, which shows the data path and then add the missing mux and show how the instruction is broken down.

We need to set the muxes.

We need to generate the three ALU cntl lines: 1-bit Bnegate and 2-bit OP

    And     0 00
    Or      0 01
    Add     0 10
    Sub     1 10
    Set-LT  1 11
Homework: What happens if we use 1 00 for the three ALU control lines? What if we use 1 01?

What information can we use to decide on the muxes and alu cntl lines?

The instruction!

So no problem, just do a truth table.

We will let the main control (to be done later) ``summarize'' the opcode for us. It will generate a 2-bit field ALUOp

    ALUOp   Action needed by ALU

    00      Addition (for load and store)
    01      Subtraction (for beq)
    10      Determined by funct field (R-type instruction)
    11      Not used

How many entries do we have now in the truth table?
opcode ALUOp operation funct field ALU action ALU cntl
LW 00 load word xxxxxx add 010
SW 00 store word xxxxxx add 010
BEQ 01 branch equal xxxxxx subtract 110
R-type 10 add 100000 add 010
R-type 10 subtract 100010 subtract 110
R-type 10 AND 100100 and 000
R-type 10 OR 100101 or 001
R-type 10 SLT 101010 set on less than 111

    ALUOp | Funct        ||  Bnegate:OP
    1 0   | 5 4 3 2 1 0  ||  B OP
    ------+--------------++------------
    0 0   | x x x x x x  ||  0 10
    x 1   | x x x x x x  ||  1 10
    1 x   | x x 0 0 0 0  ||  0 10
    1 x   | x x 0 0 1 0  ||  1 10
    1 x   | x x 0 1 0 0  ||  0 00
    1 x   | x x 0 1 0 1  ||  0 01
    1 x   | x x 1 0 1 0  ||  1 11
    

How would we implement this?

When is Bnegate (called Op2 in book) asserted?

When is OP0 asserted?

When is OP1 asserted?

The circuit is then easy.

Now we need the main control.

So 9 bits.

The following figure shows where these occur.

They all are determined by the opcode

The MIPS instruction set is fairly regular. Most fields we need are always in the same place in the instruction.

MemRead: Memory delivers the value stored at the specified addr
MemWrite: Memory stores the specified value at the specified addr
ALUSrc: Second ALU operand comes from (reg-file / sign-ext-immediate)
RegDst: Number of reg to write comes from the (rt / rd) field
RegWrite: Reg-file stores the specified value in the specified register
PCSrc: New PC is Old PC+4 / Branch target
MemtoReg: Value written in reg-file comes from (alu / mem)

We have seen the wiring before.

We are interested in four opcodes.

Do a stage play


======== START LECTURE #17 ========

The following figures illustrate the play.

We start with R-type instructions



Next we show lw

The following truth table shows the settings for the control lines for each opcode. This is drawn differently since the labels of what should be the columns are long (e.g. RegWrite) and it is easier to have long labels for rows.

SignalR-typelwswbeq
Op50110
Op40000
Op30010
Op20001
Op10110
Op00110
RegDst10XX
ALUSrc0110
MemtoReg01XX
RegWrite1100
MemRead0100
MemWrite0010
Branch0001
ALUOp11000
ALUOp0001

Now it is straightforward but tedious to get the logic equations

When drawn in pla style the circuit is

Homework: 5.5 and 5.8 (control, we already did the datapath), 5.1, 5.2, 5.10 (just the single-cycle datapath) 5.11.


======== START LECTURE #18 ========

Implementing a J-type instruction, unconditional jump

    opcode  addr
    31-26   25-0
Addr is word address; bottom 2 bits of PC are always 0

Top 4 bits of PC stay as they were (AFTER incr by 4)

Easy to add.

Smells like a good final exam type question.

What's Wrong

Some instructions are likely slower than others and we must set the clock cycle time long enough for the slowest. The disparity between the cycle times needed for different instructions is quite significant when one considers implementing more difficult instructions, like divide and floating point ops. Actually, if we considered cache misses, which result in references to external DRAM, the cycle time ratios can approach 100.

Possible solutions

Even Faster (we are not covering this).

Chapter 2 Performance analysis

Homework: Read Chapter 2

2.1: Introductions

Throughput measures the number of jobs per day that can be accomplished. Response time measures how long an individual job takes.

We define Performance as 1 / Execution time.

So machine X is n times faster than Y means that

2.2: Measuring Performance

How should we measure execution time?

We use CPU time, but this does not mean the other metrics are worse.

Cycle time vs. Clock rate.


======== START LECTURE #19 ========

2.3: Relating the metrics

The execution time for a given job on a given computer is

(CPU) execution time = (#CPU clock cycles required) * (cycle time)
                     = (#CPU clock cycles required) / (clock rate)

The number of CPU clock cycles required equals the number of instructions executed times the average number of cycles in each instruction.

But real systems are more complicated than that!

Through a great many measurement, one calculates for a given machine the average CPI (cycles per instruction).

The number of instructions required for a given program depends on the instruction set. For example, we saw in chapter 3 that 1 Vax instruction is often accomplishes more than 1 MIPS instruction.

Complicated instructions take longer; either more cycles or longer cycle time.

Older machines with complicated instructions (e.g. VAX in 80s) had CPI>>1.

With pipelining can have many cycles for each instruction but still have CPI nearly 1.

Modern superscalar machines have CPI < 1.

Putting this together, we see that

   Time (in seconds) =  #Instructions * CPI * Cycle_time (in seconds).
   Time (in ns)      =  #Instructions * CPI * Cycle_time (in ns).

Homework: Carefully go through and understand the example on page 59

Homework: 2.1-2.5 2.7-2.10

Homework: Make sure you can easily do all the problems with a rating of [5] and can do all with a rating of [10].

What is the MIPS rating for a computer and how useful is it?

Homework: Carefully go through and understand the example on pages 61-3

How about MFLOPS (Million of FLoating point OPerations per Second)? For numerical calculations floating point operations are the ones you are interested in; the others are ``overhead'' (a very rough approximation to reality).

It has similar problems to MIPS.

Benchmarks are better than MIPS or MFLOPS, but still have difficulties.

Homework: Carefully go through and understand 2.7 ``fallacies and pitfalls''.

Chapter 7: Memory

Homework: Read Chapter 7

7.2: Introduction

Ideal memory is

So we use a memory hierarchy ...

  1. Registers
  2. Cache (really L1, L2, and maybe L3)
  3. Memory
  4. Disk
  5. Archive

... and try to catch most references in the small fast memories near the top of the hierarchy.

There is a capacity/performance/price gap between each pair of adjacent levels. We will study the cache <---> memory gap

We observe empirically (and teach in 202).

A cache is a small fast memory between the processor and the main memory. It contains a subset of the contents of the main memory.

A Cache is organized in units of blocks. Common block sizes are 16, 32, and 64 bytes. This is the smallest unit we can move to/from a cache.

A hit occurs when a memory reference is found in the upper level of memory hierarchy.


======== START LECTURE #20 ========




Note:
Lab3 is assigned and due in 3 weeks.
End of Note

7.2: The Basics of Caches

We start with a very simple cache organization. One that was used on the Decstation 3100, a 1980s workstation.

Address(10)Address(2)hit/missblock#
2210110miss110
2611010miss010
2210110hit110
2611010hit010
1610000miss000
300011miss011
1610000hit000
1810010miss010

Example on pp. 547-8.


The basic circuitry for this simple cache to determine hit or miss and to return the data is quite easy. We are showing a 1024 word (= 4KB) direct mapped cache with block size = reference size = 1 word.

Calculate on the board the total number of bits in this cache.

Homework: 7.1 7.2 7.3

Processing a read for this simple cache.

Skip the section ``handling cache misses'' as it discusses the multicycle and pipelined implementations of chapter 6, which we skipped. For our single cycle processor implementation we just need to note a few points.

Processing a write for our simple cache (direct mapped with block size = reference size = 1 word).

Improvement: Use a write buffer

Unified vs split I and D (instruction and data) caches

Improvement: Blocksize > Wordsize

Homework: 7.7 7.8 7.9

Why not make blocksize enormous? For example, why not have the cache be one huge block.



======== START LECTURE #21 ========

Memory support for wider blocks

Homework: 7.11

7.3: Measuring and Improving Cache Performance

Performance example to do on the board (a dandy exam question).

Homework: 7.15, 7.16

A lower base (i.e. miss-free) CPI makes stalls appear more expensive since waiting a fixed amount of time for the memory corresponds to losing more instructions if the CPI is lower.

A faster CPU (i.e., a faster clock) makes stalls appear more expensive since waiting a fixed amount of time for the memory corresponds to more cycles if the clock is faster (and hence more instructions since the base CPI is the same).

Another performance example.

Remark: Larger caches have longer hit times.

Improvement: Associative Caches

Consider the following sad story. Jane has a cache that holds 1000 blocks and has a program that only references 4 (memory) blocks, namely 23, 1023, 123023, and 7023. In fact the references occur in order: 23, 1023, 123023, 7023, 23, 1023, 123023, 7023, 23, 1023, 123023, 7023, 23, 1023, 123023, 7023, etc. Referencing only 4 blocks and having room for 1000 in her cache, Jane expected an extremely high hit rate for her program. In fact, the hit rate was zero. She was so sad, she gave up her job as webmistress, went to medical school, and is now a brain surgeon at the mayo clinic in rochester MN.

So far We have studied only direct mapped caches, i.e. those for which the location in the cache is determined by the address. Since there is only one possible location in the cache for any block, to check for a hit we compare one tag with the HOBs of the addr.

The other extreme is fully associative.

Most common for caches is an intermediate configuration called set associative or n-way associative (e.g., 4-way associative).

How do we find a memory block in an associative cache (with block size 1 word)?


Tag size and division of the address bits

We continue to assume a byte addressed machines with all references to a 4-byte word (lw and sw).

The 2 LOBs are not used (they specify the byte within the word but all our references are for a complete word). We show these two bits in dark blue. We continue to assume 32 bit addresses so there are 2**30 words in the address space.

Let's review various possible cache organizations and determine for each how large is the tag and how the various address bits are used. We will always use a 16KB cache. That is the size of the data portion of the cache is 16KB = 4 kilowords = 2**12 words.

  1. Direct mapped, blocksize 1 (word).
  2. Direct mapped, blocksize 8
  3. 4-way set associative, blocksize 1
  4. 4-way set associative, blocksize 8

Homework: 7.39, 7.40

Improvement: Multilevel caches

Modern high end PCs and workstations all have at least two levels of caches: A very fast, and hence not very big, first level (L1) cache together with a larger but slower L2 cache.

When a miss occurs in L1, L2 is examined, and only if a miss occurs there is main memory referenced.

So the average miss penalty for an L1 miss is

(L2 hit rate)*(L2 time) + (L2 miss rate)*(L2 time + memory time)
We are assuming L2 time is the same for an L2 hit or L2 miss. We are also assuming that the access doesn't begin to go to memory until the L2 miss has occurred.

Do an example

7.4: Virtual Memory

I realize this material was covered in operating systems class (V22.0202). I am just reviewing it here. The goal is to show the similarity to caching, which we just studied. Indeed, (the demand part of) demand paging is caching: In demand paging the memory serves as a cache for the disk, just as in caching the cache serves as a cache for the memory.

The names used are different and there are other differences as well.

Cache conceptDemand paging analogue
Memory blockPage
Cache blockPage Frame (frame)
BlocksizePagesize
TagNone (table lookup)
Word in blockPage offset
Valid bitValid bit
MissPage fault
HitNot a page fault
Miss ratePage fault rate
Hit rate1 - Page fault rate

Cache conceptDemand paging analogue
Placement questionPlacement question
Replacement questionReplacement question
AssociativityNone (fully associative)

Homework: 7.32

Write through vs. write back

Question: On a write hit should we write the new value through to (memory/disk) or just keep it in the (cache/memory) and write it back to (memory/disk) when the (cache-line/page) is replaced?

Translation Lookaside Buffer (TLB)

A TLB is a cache of the page table



Putting it together: TLB + Cache

This is the decstation 3100

Actions taken

  1. The page number is searched in the fully associative TLB
  2. If a TLB hit occurs, the frame number from the TLB together with the page offset gives the physical address. A TLB miss causes an exception to reload the TLB from the page table, which the figure does not show.
  3. The physical address is broken into a cache tag and cache index (plus a two bit byte offset that is not used for word references).
  4. If the reference is a write, just do it without checking for a cache hit (this is possible because the cache is so simple as we discussed previously).
  5. For a read, if the tag located in the cache entry specified by the index matches the tag in the physical address, the referenced word has been found in the cache; i.e., we had a read hit.
  6. For a read miss, the cache entry specified by the index is fetched from memory and the data returned to satisfy the request.

Hit/Miss possibilities

TLBPageCacheRemarks
hithithit Possible, but page table not checked on TLB hit, data from cache
hithitmiss Possible, but page table not checked, cache entry loaded from memory
hitmisshit Impossible, TLB references in-memory pages
hitmissmiss Impossible, TLB references in-memory pages
misshithit Possible, TLB entry loaded from page table, data from cache
misshitmiss Possible, TLB entry loaded from page table, cache entry loaded from memory
missmisshit Impossible, cache is a subset of memory
missmissmiss Possible, page fault brings in page, TLB entry loaded, cache loaded

Homework: 7.31, 7.33

7.5: A Common Framework for Memory Hierarchies

Question 1: Where can/should the block be placed?

This question has three parts.

  1. In what slot are we able to place the block.
  2. If several possible slots are available, which one should be used?
  3. If no possible slots are available, which victim should be chosen?

Question 2: How is a block found?

AssociativityLocation methodComparisons Required
Direct mappedIndex1
Set AssociativeIndex the set, search among elements Degree of associativity
FullSearch all cache entries Number of cache blocks
Separate lookup table0

Typical sizes and costs

Feature Typical values
for caches
Typical values
for demand paging
Typical values
for TLBs
Size 8KB-8MB 16MB-2GB 256B-32KB
Block size 16B-256B 4KB-64KB 4B-32B
Miss penalty in clocks 10-100 1M-10M 10-100
Miss rate .1%-10% .000001-.0001% .01%-2%

The difference in sizes and costs for demand paging vs. caching, leads to different algorithms for finding the block. Demand paging always uses the bottom row with a separate table (page table) but caching never uses such a table.

Question 3: Which block should be replaced?

This is called the replacement question and is much studied in demand paging (remember back to 202).

Question 4: What happens on a write?

  1. Write-through

Homework: 7.41

  1. Write-back

Write miss policy (advanced)


======== START LECTURE #23 ========

Notes:
  1. I received the official final exam notice from robin.
    18 December 12:00-1:50 Room 109 CIWW
  2. A practice final exam is on the course home page.
  3. Solutions next week.
End of Notes

Chapter 8: Interfacing Processors and Peripherals.

With processor speed increasing at least 50% / year, I/O must improved or essentially all jobs will be I/O bound.

The diagram on the right is quite oversimplified for modern PCs; a more detailed version is below.

8.2: I/O Devices

Devices are quite varied and their data rates vary enormously.

Show a real disk opened up and illustrate the components (done in 202).

8.4: Buses

A bus is a shared communication link, using one set of wires to connect many subsystems.



Synchronous vs. Asynchronous Buses

A synchronous bus is clocked.

An asynchronous bus is not clocked.

We now describe a protocol in words (below) and with a finite state machine (on the right) for a device to obtain data from memory.

  1. The device makes a request (asserts ReadReq and puts the desired address on the data lines).

  2. Memory, which has been waiting, sees ReadReq, records the address and asserts Ack.

  3. The device waits for the Ack; once seen, it drops the data lines and deasserts ReadReq.

  4. The memory waits for the request line to drop. Then it can drop Ack (which it knows the device has now seen). The memory now at its leasure puts the data on the data lines (which it knows the device is not driving) and then asserts DataRdy. (DataRdy has been deasserted until now).

  5. The device has been waiting for DataRdy. It detects DataRdy and records the data. It then asserts Ack indicating that the data has been read.

  6. The memory sees Ack and then deasserts DataRdy and releases the data lines.

  7. The device seeing DataRdy low deasserts Ack ending the show. Note that both sides are prepared for another performance.

Improving Bus Performance

These improvements mostly come at the cost of increased expense and/or complexity.

  1. A multiplicity of buses as in the diagrams above.

  2. Synchronous instead of asynchronous protocols. Synchronous is actually simplier, but it essentially implies a multiplicity of buses, since not all devices can operate at the same speed.

  3. Wider data path: Use more wires, send more data at one time.

  4. Separate address and data lines: Same as above.

  5. Block transfers: Permit a single transaction to transfer more than one busload of data. Saves the time to release and acquire the bus, but the protocol is more complex.


======== START LECTURE #24 ========

Obtaining bus access

OptionHigh performanceLow cost
bus widthseparate addr and data lines multiplex addr and data lines
data widthwidenarrow
transfer sizemultiple bus loadssingle bus loads
bus mastersmultiplesingle
clockingsynchronousasynchronous

Do on the board the example on pages 665-666

8.5: Interfacing I/O Devices

Giving commands to I/O Devices

This is really an OS issue. Must write/read to/from device registers, i.e. must communicate commands to the controller. Note that a controller normally contains a microprocessor, but when we say the processor, we mean the central processor not the one on the controller.

Communicating with the Processor

Should we check periodically or be told when there is something to do? Better yet can we get someone else to do it since we are not needed for the job?

Polling

Processor continually checks the device status to see if action is required.

Do on the board the example on pages 676-677

Interrupt driven I/O

Processor is told by the device when to look. The processor is interrupted by the device.

Do on the board the example on pages 681-682.

Direct Memory Access (DMA)

The processor initiates the I/O operation then ``something else'' takes care of it and notifies the processor when it is done (or if an error occurs).


======== START LECTURE #25 ========

More Sophisticated Controllers

Subtlties involving the memory system

8.6: Designing an I/O system

Do on the board the example page 681

Remark: The above analysis was very simplistic. It assumed everything overlapped just right and the I/Os were not bursty and that the I/Os conveniently spread themselves accross the disks.

Notes:

We will go over the practice final and review next time.

Good luck on the (real) final!