V22.0436: Computer Architecture
2006-07 Fall
Allan Gottlieb
Mondays and Wednesdays 11:00-12:15
Room 102 Ciww

Start Lecture #1

Chapter 0: Administrivia

I start at Chapter 0 so that when we get to chapter 1, the numbering will agree with the text.

0.1: Contact Information

0.2: Course Web Page

There is a web site for the course. You can find it from my home page listed above.

0.3: Textbook

The course text is Hennessy and Patterson, Computer Organization and Design: The Hardware/Software Interface, 3rd edition.

0.4: Computer Accounts and Mailman Mailing List

0.5: Grades

Your grade will be a function of your exams and laboratory assignments (see below). I am not yet sure of the exact weightings, but it will be approximately 30% midterm, 30% labs, 40% final exam.

0.6: The Upper Left Board

I use the upper left board for lab/homework assignments and announcements. I should never erase that board. If you see me start to erase an announcement, please let me know.

I try very hard to remember to write all announcements on the upper left board and I am normally successful. If, during class, you see that I have forgotten to record something, please let me know. HOWEVER, if I forgot and no one reminds me, the assignment has still been given.

0.7: Homeworks and Labs

I make a distinction between homeworks and labs.

Labs are

Homeworks are

0.7.1: Homework Numbering

Homeworks are numbered by the class in which they are assigned. So any homework given today is homework #1. Even if I do not give homework today, the homework assigned next class will be homework #2. Unless I explicitly state otherwise, all homeworks assignments can be found in the class notes. So the homework present in the notes for lecture #n is homework #n (even if I inadvertently forgot to write it to the upper left board).

0.7.2: Doing Labs on non-NYU Systems

You may solve lab assignments on any system you wish, but ...

0.7.3: Obtaining Help with the Labs

Good methods for obtaining help include

  1. Asking me during office hours (see web page for my hours).
  2. Asking the mailing list.
  3. Asking another student, but ...
    Your lab must be your own.
    That is, each student must submit a unique lab. Naturally, simply changing comments, variable names, etc. does not produce a unique lab.

0.7.4: Computer Language Used for Labs

You may write your lab in Java, C, or C++. Other languages may be possible, but please ask in advance. I need to ensure that the TA is comfortable with the language.

0.8: A Grade of Incomplete

The rules for incompletes and grade changes are set by the school and not the department or individual faculty member. The rules set by CAS can be found in http://cas.nyu.edu/object/bulletin0608.ug.academicpolicies.html state:

The grade of I (Incomplete) is a temporary grade that indicates that the student has, for good reason, not completed all of the course work but that there is the possibility that the student will eventually pass the course when all of the requirements have been completed. A student must ask the instructor for a grade of I, present documented evidence of illness or the equivalent, and clarify the remaining course requirements with the instructor.

The incomplete grade is not awarded automatically. It is not used when there is no possibility that the student will eventually pass the course. If the course work is not completed after the statutory time for making up incompletes has elapsed, the temporary grade of I shall become an F and will be computed in the student's grade point average.

All work missed in the fall term must be made up by the end of the following spring term. All work missed in the spring term or in a summer session must be made up by the end of the following fall term. Students who are out of attendance in the semester following the one in which the course was taken have one year to complete the work. Students should contact the College Advising Center for an Extension of Incomplete Form, which must be approved by the instructor. Extensions of these time limits are rarely granted.

Once a final (i.e., non-incomplete) grade has been submitted by the instructor and recorded on the transcript, the final grade cannot be changed by turning in additional course work.

0.9: Academic Integrity Policy

The CS policy on academic integrity, which applies to all graduate courses in the department, can be found here .

Appendix B: Logic Design

Remark: Appendix B is on the CD that comes with the book, but is not in the book itself. If anyone does not have convenient access to a printer, please let me know and I will print a black and white copy for you. The pdf on the CD is in color so downloading it to your computer for viewing in color is probably a good idea. If you have a color printer that is not terribly slow, you might want to print it in color—that's what I did.

Homework: Read B1

B.2: Gates, Truth Tables and Logic Equations

Homework: Read B2

The word digital, when used in digital logic or digital computer means discrete. That is, the electrical values (i.e., voltages) of the signals in a circuit are treated as a non-negative integers (normally just 0 and 1).

The alternative is analog, where the electrical values are treated as real numbers.

To summarize, we will use only two voltages: high and low. A signal at the high voltage is referred to as 1 or true or set or asserted. A signal at the low voltage is referred to as 0 or false or unset or deasserted.

The assumption that at any time all signals are either 1 or 0 hides a great deal of engineering.

Since this is not an engineering course, we will ignore these issues and assume square waves.

In English digital implies 10 (based on digit, i.e. finger), but not in computers.

Indeed, the word Bit is short for Binary digIT and binary means base 2 not 10.

0 and 1 are called complements of each other as are true and false (also asserted/deasserted; also set/unset)

A logic block can be thought of as a black box that takes signals in and produces signals out. There are two kinds of blocks

We are doing combinational blocks now. Will do sequential blocks later (in a few lectures).

Truth Tables

Since combinatorial logic has no memory, it is simply a (mathematical) function from its inputs to its outputs.

A common way to represent the function is using a Truth Table. A Truth Table has a column for each input and a column for each output. It has one row for each possible set of input values. So, if there are A inputs, there are 2A rows. In each of these rows the output columns have the output for that input.

How many possible truth tables are there?

1-input, 1-output Truth Tables

1 in / 1 out Let's start with a really simple truth table, one corresponding to a logic block with one input and one output.

How many different truth tables are there for a one input one output logic block?

There are two columns (1 + 1) and two rows (21). Hence the truth table looks like the one on the right with the question marks filled in.

1-input, 1-output Truth Table
InOut


0?
1?

Since there are two question marks and each one can have one of two values there are just 22=4 possible truth tables.

  1. The constant function 1, which has output 1 (i.e., true) for either input value.
  2. The constant function 0.
  3. The identity function, i.e., the function whose output equals its input. This logic block is sometimes called a buffer.
  4. An inverter. This function has output the opposite of the input.
We will see pictures for the last two possibilities very soon.
2-input, 1-output Truth Table
In1In2Out



00?
01?
10?
11?
2-input, 1-output Truth Tables

Three columns (2+1) and 4 rows (22).

How many are there? It is just the number ways can you fill in the output entries, i.e. the question marks. There are 4 output entries so the answer is 24=16.

Larger Truth Tables

How about 2 in and 8 out?

3 in and 8 out?

n in and k out?

Boolean algebra

We use a notation that looks like algebra to express logic functions and expressions involving them.

The notation is called Boolean algebra in honor of George Boole.

A Boolean value is a 1 or a 0.
A Boolean variable takes on Boolean values.
A Boolean function takes in boolean variables and produces boolean values.

Four Boolean function are especially common.

  1. The (inclusive) OR Boolean function of two variables. Draw its truth table on the board. This is written + (e.g. X+Y where X and Y are Boolean variables) and often called the logical sum. (Three out of four output values in the truth table look like the sum.)

  2. AND. Draw its truth table on the board. And is often called the logical product and written as a centered dot (like the normal product in regular algebra). I will offen write it as a period in these notes. As in regular algebra, when all the logical variables are just one character long, we indicate the product by juxtaposition, that is, AB represents the product of A and B when it is clear that it does not represent the two character symbol AB. All four truth table values look like a product.

  3. NOT. Draw its truth table on the board. This is a unary operator (i.e., it has one argument, not two as above; functions with two inputs are called binary operators). Written A with a bar over it. I will use ' instead of a bar as it is easier for me to input in html.

  4. Exclusive OR (XOR). Draw its truth table on the board. Written as ⊕, a + with a circle around it. True if exactly one input is true. In particular, remember that TRUE ⊕ TRUE = FALSE.

Homework: Draw the truth table of the Boolean function of 3 boolean variables that is true if and only if exactly 1 of the three variables is true.

Some manipulation laws

Remember this is Boolean Algebra.

How does one prove these laws??

Homework: Prove the second distributive law.

Homework: Prove DeMorgan's laws.

Let's do (on the board) the example on pages B-6 and B-7.

Consider a logic function with three inputs A, B, and C; and three outputs D, E, and F defined as follows: D is true if at least one input is true, E if exactly two are true, and F if all three are true. (Note that by if we mean if and only if.

  1. Construct the truth table. This is straightforward; simply fill in the 24 entries by looking at the definitions of D, E, and F.

  2. Produce logic equations for D, E, and F. This can be done in two ways.

    1. Examine the column of the truth table for a given output and write one term for each entry that is a 1. This method requires constructing the truth table and might be called the method of perspiration.

    2. Look at the definition of D, E, and F and just figure it out. This might be called the method of inspiration.

      For D and F it is fairly clear. E requires some cleverness: the key idea is that exactly two are true is the same as (at least) two are true AND it is not the case that all three are true. So we have the AND of two expressions: the first is a three way OR and the second the negation of a three way AND.

Start Lecture #2

The first way we solved the previous example shows that any logic equation can be written using just AND, OR, and NOT. Indeed it shows more. Each entry in the output column of the truth table corresponds to the AND of three (because there are three inputs) literals.

A literal is either an input variable or the negation of an input variable.

In mathematical logic such a formula is said to be in disjunctive normal form because it is the disjunction (i.e., OR) of conjunctions (i.e., ANDs).

In computer architecture disjunctive normal form is called two levels of logic because it shows that any formula can be computed in by passing signals through only two logic functions, AND and then OR (assuming we are given the inputs and their compliments).

  1. First compute all the ANDs. There can be many, many of these, but they can all be computed at once using many, many and machines.
  2. Compute the required ORs of the ANDs computed in step 1.

With DM (DeMorgan's Laws) we can do quite a bit without resorting to truth tables.

For example one can ...

Homework: Show that the two expressions for E in the example above are equal.

Start to do the homework on the board.

Remark: You may ignore the references to Verilog in the text.

gates

GATES

Gates implement the basic logic functions: AND OR NOT XOR Equivalence. When drawing the logic functions we use the standard shapes shown to the right for the basic logic functions.

Note that none of the figures is input-output symmetric. That is, one can tell which lines are inputs and which are outputs without resorting to arrowheads and without the convention that inputs are on the left. Sometimes the figure is rotated 90 or 180 degrees.

Bubbles
bubbles

We often omit the inverters and draw the little circles at the input or output of the other gates (AND OR). These little circles are sometimes called bubbles.

For example, the diagram on the right shows three ways a writing the same logic function.

This explains why the inverter is drawn as a buffer with an output bubble.

Show why the picture for equivalence is correct. That is, show that equivalence is the negation of XOR, i.e, show that AB + A'B' = (A ⊕ B)'.

    (A ⊕ B)' =
    (A'B+AB')' =
    (A'B)' (AB')' =
    (A''+B') (A'+B'') =
    (A + B') (A' + B) =
    (A + B') A'  +  (A + B') B =
    AA' + B'A' + AB + B'B =
    0   + B'A' + AB + 0 =
    AB + A'B'
  

Homework: B.4.

Homework: Recall the Boolean function E that is true if and only if exactly 1 of the three variables is true. We have already drawn the truth table.
Draw a logic diagram for E using AND OR NOT.
Draw a logic diagram for E using AND OR and bubbles.

A set of gates is called universal if these gates are sufficient to generate all logic functions.

nand and nor

NOR (NOT OR) is true when OR is false. Draw the truth table on the board.

NAND (NOT AND) is true when AND is false. Draw the truth table on the board.

We can draw NAND and NOR each two ways as shown in the diagram on the right. The top pictures are from the definition; the bottom use DeMorgan's laws.

Theorem A 2-input NOR is universal and a 2-input NAND is universal.

Proof We will show that you can get A', A+B, and AB using just a two input NOR.

Draw the truth tables showing the last three statements. Also say why they are correct, i.e., we are at the stage were simple identities like these don't need truth tables.
End of Proof

Homework: Show that a 2-input NAND is universal.

If fact the above proof was overkill, it would have been enough to show that you can get A' and A+B.
Why?
Because we already know that the pair OR,NOT is universal.
It would also have been enough to show that you can get A' and AB.

Sneaky way to see that NAND is universal.

We have seen how to implement any logic function given its truth table. Indeed, the natural implementation from the truth table uses just two levels of logic. But that implementation might not be the simplest possible. That is, we may have more gates than are necessary.

Trying to minimize the number of gates is decidedly NOT trivial. A text by Mano covers the topic of gate minimization in detail. We will not cover it in this course. It is mentioned and reference, but not covered in P&H. I actually like topic but it takes a few lectures to cover well and it not used much in practice since it is algorithmic and is done automatically by CAD tools.

Minimization is not unique, i.e. there can be two or more minimal forms.

Given A'BC + ABC + ABC'
Combine first two to get BC + ABC'
Combine last two to get A'BC + AB

Don't Cares (preview)

Sometimes when building a circuit, you don't care what the output is for certain input values. For example, that input combination might be known not to occur. Another example occurs when, for some combination of input values, a later part of the circuit will ignore the output of this part. These are called don't care outputs. Making use of don't cares can reduce the number of gates needed.

Can also have don't care inputs when, for certain values of a subset of the inputs, the output is already determined and you don't have to look at the remaining inputs. We will see a case of this very soon when we do multiplexors.

An aside on theory

Putting a circuit in disjunctive normal form (i.e. two levels of logic) means that every path from the input to the output goes through very few gates. In fact only two, an OR and an AND. Maybe we should say three since the AND can have a NOT (bubble). Theoreticians call this number (2 or 3 in our case) the depth of the circuit. Se we see that every logic function can be implemented with small depth. But what about the width, i.e., the number of gates.

The news is bad. The parity function takes n inputs and gives TRUE if and only if the number of TRUE inputs is odd. If the depth is fixed (say limited to 3), the number of gates needed for parity is exponential in n.

B.3 COMBINATIONAL LOGIC

Homework: Read B.3.

Generic Homework: Read sections in book corresponding to the lectures.

Decoders (and Encoders)

Imagine you are writing a program and have 32 flags, each of which can be either true or false. You could declare 32 variables, one per flag. If permitted by the programming language, you would declare each variable to be a bit. In a language like C, without bits, you might use a single 32-bit int and play with shifts and masks to store the 32 flags in this one word.

In either case, an architect would say that you have these flags fully decoded. That is, you can detect the value of any combination of the bits.

Now imagine that for some reason you know that, at all times, exactly one of the flags is true and the other are all false. Then, instead of storing 32 bits, you could store a 5-bit integer that specifies which of the 32 flags is true. This is called fully encoded.

A 5-to-32 decoder converts an encoded 5-bit signal into 32 signals with exactly one signal true.

A 32-to-5 encoder does the reverse operations. Note that the output of an encoder is defined only if exactly one input bit is set (recall set means true).

The diagram on the right shows a 3-to-8 decoder.

Remark: Lab1 Assigned, due 17 September 2007.
Demo logisim.

Multiplexors

2-way mux

A multiplexor, often called a mux or a selector is used to select one (output) signal from a group of (input) signals based on the value of a group of (select) signals. In the 2-input mux shown on the right, the select line S is thought of as an integer 0..1. If the integer has value j then the jth input is sent to the output.

Construct on the board an equivalent circuit with ANDs and ORs in two ways:

  1. Construct a truth table with 8 rows (don't forget that, despite its name, the select line is an input) and write the sum of product form, one product for each row and a large 8-input OR. This is the canonical two-levels of logic solution.
  2. A simpler, more clever, two-levels of logic solution. Two ANDs, one per input (not including the selector). The selector goes to each AND, one with a bubble. The output from the two ANDs goes to a 2-input OR.

Start Lecture #3

4-way mux

The diagram on the right shows a 4-input MUX.

Construct on the board an equivalent circuit with ANDs and ORs in three ways:

  1. Construct the truth table (64 rows!) and write the sum of products form, one product (6-input AND) for each row and a gigantic 64-way OR. Just start this, don't finish it.
  2. A simpler (more clever) two-level logic solution. Four ANDS (one per input), each gets one of the inputs and both select lines with appropriate bubbles. The four outputs go into a 4-way OR.
  3. Construct a 2-input mux (using the clever solution). Then construct a 4-input mux using a tree of three 2-input muxes. One select line is used for the two muxes at the base of the tree, the other is used at the root.

All three of these methods generalize to a mux with 2k input lines, and k select lines.

A 2-way mux is the hardware analogue of if-then-else.

    if S=0
        M=A
    else
        M=B
    endif

A 4-way mux is an if-then-elif-elif-else

    if S1=0 and S2=0
        M=A
    elif S1=0 and S2=1
        M=B
    elif S1=1 and S2=0
        M=C
    else -- S1=1 and S2=1
        M=D
    endif

Don't Cares (again)

SI0I1O


00X0
01X1
1X00
1X11

Consider a 2-input mux. If the selector is 0, the output is I0 and the value of I1 is irrelevant. Thus, when the selector is 0, I1 is a don't care input. Similarly, when the selector is 1, I0 is a don't care input.

On the right we see the resulting truth table. Recall that without using don't cares the table would have 8 rows since there are three inputs; in this example the use of don't cares reduced the table size by a factor of 2.

The truth table for a 4-input mux has 64 rows, but the use of don't care inputs has a dramatic effect. When the selector is 01 (i.e, S0 is 0 and S1 is 1), the output equals the value of I1 and the other three I's are don't care. A corresponding result occurs for other values of the selector.

Homework: Draw the truth table for a 4-input mux making use of don't care inputs. What size reduction occurred with the don't cares?

Homework: B.13. (I am not sure what is meant by hierarchial; perhaps modular).
B.10. (Assume you have constant signals 1 and 0 as well.)

Recall that a don't care output occurs when for some input values (i.e., rows in the truth table), we don't care what the value is for certain outputs.

Powers of 2 NOT Required

How can one construct a 5-way mux?

Construct an 8-way mux and use it as follows.

Can do better by realizing the select lines equalling 5, 6, or 7 are don't cares and hence the 8-way can be customized and would use fewer gates than an 8-way mux.

PLAs—Programmable Logic Arrays (and PALs)

The idea is to partially automate the algorithmic way you can produce a circuit diagram (in the sums of product form) from a given truth table. Since the form of the circuit is always a bunch of ANDs feeding into a bunch of ORs, we can manufacture all the gates in advance of knowing the desired logic functions and when the functions are specified, we just need to make the necessary connections from the ANDs to the ORs. In essence all possible connections are configured but with switches that can be open or closed.

ABCDEF


000000
001100
010100
011110
100100
101110
110110
111101

The description just given is more accurate for a PAL (Programmable Array Logic) than for a PLA, as we shall soon see.

Consider the truth table on the right, which we have seen before. It has three inputs A, B, and C, and three outputs D, E, F.

Below the truth table we see the corresponding logic diagram in sum of products form

Recall how we construct this diagram from the truth table.

To the right, the above figure is redrawn in a more schematic style.


Finally, a PLA can be redrawn in the more abstract form shown on the right.

Before a PLA is manufactured all the connections are specified. That is, a PLA is specific for a given circuit. Hence the name Programmable Logic Array is somewhat of a misnomer since the device is not programmable by the user.

Homework: B.11 and B.12.

PAL (Programmable Array Logic)

A PAL can be thought of as a PLA in which the final dots are made by the user. The manufacturer produces a sea of gates. The user programs it to the desired logic function by adding the dots.

ROMs

One way to implement a mathematical function (or a java function without side effects) is to perform a table lookup.

A ROM (Read Only Memory) is the analogous way to implement a logic function.

Important: A ROM does not have state. It is another combinational circuit. That is, it does not represent memory. The reason is that once a ROM is manufactured, the output depends only on the input. I realize this sounds wrong, but it is right.

Indeed, we will shortly see that a ROM is like a PLA. Both are structures that can be used to implement a truth table.

The key property of combinational circuits is that the outputs depend only on the inputs. This property (having no state) is false for a RAM chip. The input to a RAM, just like the input to a ROM, is an address. The RAM responds by presenting at its outputs the value CURRENTLY stored at that address. Thus just knowing the input (i.e., the address) is not sufficient for determining the output.

A PROM is a programmable ROM. That is, you buy the ROM with nothing in its memory and then before it is placed in the circuit you load the memory, and never change it. This is like a CD-R.

An EPROM is an erasable PROM. It costs more but if you decide to change its memory this is possible (but is slow). This is like a CD-RW.

Normal EPROMs are erased by some ultraviolet light process. But EEPROMs (electrically erasable PROMS) are not as slow and are done electronically.

Flash is a modern EEPROM that is reasonably fast.

All these EPROMS are erasable not writable, i.e. you can't just change one byte to an arbitrary value. (Some modern flash rams can nearly replace true ram and perhaps should not be called EPROMS).

ROMs and PLAs

A ROM is similar to PLA

Full Truth Table
ABCDEF


000000
001101
010011
011110
100111
101110
110110
111110

Don't Cares (bigger example)


Truth Table with Output Don't Cares
ABCDEF


000000
001101
010011
01111X
10011X
10111X
11011X
11111X

The top diagram on the right is the full truth table for the following example (from the book). Consider a logic function with three inputs A, B, and C, and three outputs D, E, and F.

The full truth table has 7 minterms (rows with at least one nonzero output).

The middle truth table has the output don't cares included.

Truth Table with Input and Output Don't Cares
ABCDEF


000000
001101
010011
X1111X
1XX11X

Now do the input don't cares

The resulting truth table is also shown on the right. Note how much smaller it is

These don't cares are important for logic minimization. Compare the number of gates needed for the full truth table and the reduced truth table. There are techniques for minimizing logic, but we will not cover them.

Arrays of Logic Elements

Often we want to consider signals that are wider than a single bit. An array of logic elements is used when each of the individual bits is treated similarly. As we will soon see, sometimes most of the bits are treated similarly, but there are a few exceptions. For example, a 32-bit structure might treat the lob (low order bit) and hob differently from the others. In such a case we would have an array 30 bits wide and two 1-bit structures.

B.4: Using a Hardware Description Language

Skipped.

B.5: Constructing a Basic Arithmetic Logic Unit (ALU)

We will produce logic designs for the integer portion of the MIPS ALU (the floating point operations are more complicated and will not be implemented).

MIPS is a computer architecture widely used in embedded designs. In the 80s and early 90s, it was quite popular for desktop (or desk-side) computers. This was the era of the killer micros that decimated the market for minicomputers. (When I got my DECstation with a MIPS R3000, I think it was the fastest integer computer at NYU for a short while.)

Much of the design (all of the beginning part) is generic. I will point out when we are tailoring it for MIPS.

A 1-bit ALU

Our first goal will be a 1-bit wide structure that computes the AND, OR, and SUM of two 1-bit quantities. For the sum there is actually a third input, CarryIn, and a 2nd output, CarryOut.

Since out basic logic toolkit already includes AND and OR gates, our first real task is a 1-bit adder.

Half Adder

If the final goal was a 1-bit ALU, then we would not have a CarryIn. For a multi-bit ALU, the CarryIn for each bit is the CarryOut of the preceding lower-order bit (e.g., the CarryIn for bit 3 is the CarryOut from bit 2). When we don't have a CarryIn, the structure is sometimes called a half adder. Don't treat the name too seriously; it is not half of an adder.

Homework: Draw the logic diagram.

Start Lecture #4

Remark: Show in class how to broadcast S (select line) to many ANDs (used for wide muxes in 2nd lab) and assign lab 2.

Full Adder

adder

Now we include the carry-in.

Homework:

Combining 1-bit AND, OR, and ADD

We have implemented 1-bit versions of AND (a basic gate), OR (a basic gate), and SUM (the FA just constructed, which we henceforth draw as shown on the right). We now want a single structure that, given another input (the desired operation, another one of those control lines), produces as output the specified operation.

There is a general principle used to produce a structure that yields either X or Y depending on the value of operation.

  1. Implement a structure that always computes X.
  2. Implement another structure that always computes Y.
  3. Mux X and Y together using operation as the select line.

This mux, with an operation select line, gives a structure that sometimes produces one result and sometimes produces another. Internally both results are always produced.

In our case we have three possible operations so we need a three way mux and the select line is a 2-bit wide bus. With a 2-bit select line we can specify 4 operations, for now we are using only three.

We show the diagram for this 1-bit ALU on the right.

The Operation input is shown in green to distinguish it as a control line rather than a data line. That is, the goal is to produce two bits of result from 2 (AND, OR) or 3 (ADD) bits of data. The 2 bits of control tell what to do, rather than what data to do it on.

The extra data output (CarryOut) is always produced. Presumably if the operation was AND or OR, CarryOut is not used.

I believe the distinction between data and control will become quite clear as we encounter more examples. However, I wouldn't want to be challenged to give a (mathematically precise) definition.

A 32-bit ALU

A 1-bit ALU is interesting, but we need a 32-bit ALU to implement the MIPS 32-bit operations, acting on 32-bit data values.

For AND and OR, there is almost nothing to do; a 32-bit AND is just 32 1-bit ANDs so we can simply use an array of logic elements.

However, ADD is a little more interesting since the bits are not quite independent: The CarryOut of one bit becomes the CarryIn of the next.

A 32-bit Adder

adder4

Let's start with a 4-bit adder.

How about a 32-bit adder, or even an an n-bit adder ?

alu-32bit

Combining 32-bit AND, OR, and ADD

To obtain a 32-bit ALU, we put together the 1-bit ALUs in a manner similar to the way we constructed a 32-bit adder from 32 FAs. Specifically we proceed as follows and as shown in the figure on the right.

  1. Use an array of logic elements for the logic. The individual logic element is the 1-bit ALU.
  2. Use buses for A, B, and Result.
  3. Broadcast Operation to all of the internal 1-bit ALUs. This means wire the external Operation to the Operation input of each of the internal 1-bit ALUs.

Facts Concerning (4-bit) Two's Complement Arithmetic

Remark
This is one place were the our treatment must go a little out of order. Appendix B in the book assumes you have read the chapter on computer arithmetic; in particular it assumes that you know about two's complement arithmetic.

I do not assume you know this material and we will cover it later, when we do that chapter. What I will do here is assert some facts about two's complement arithmetic that we will use to implement the circuit for SUB.
End of Remark.

For simplicity I will be presenting 4-bit arithmetic. We are really interested in 32-bit arithmetic, but the idea is the same and the 4-bit examples are much shorter (and hence less likely to contain typos).

4-bit Twos's Complement Numbers

With 4 bits, there can be only 16 numbers. One of them is zero, 8 are negative, and 7 are positive.

The high order bit (hob) on the left is the sign bit. The sign bit is zero for positive numbers and for the number zero; the sign bit is one for negative numbers.

Zero is written simply 0000.

1-7 are written 0001, 0010, 0011, 0100, 0101, 0110, 0111. That is, you set the sign bit zero and write 1-7 using the remaining three lob's. This last statement is also true for zero.

-1, -2, ..., -7 are written by taking the two's complement of the corresponding positive number. The two's complement is computed in two steps.

  1. Take the (ordinary) complement, i.e. turn ones to zeros and vice versa. This is sometimes called the one's complement.
    For example, the (4-bit) one's complement of 3 is 1100.
  2. Add 1.
    For example, the (4-bit) two's complement of 3 is 1101.

If you take the two's complement of -1, -2, ..., -7, you get back the corresponding positive number. Try it.

If you take the two's complement of zero you get zero. Try it.

What about the 8th negative number?
-8 is written 1000.
But if you take its (4-bit) two's complement, you must get the wrong number because the correct number (+8) cannot be expressed in 4-bit two's complement notation.

Two's Complement Addition and Subtraction

Amazingly easy (if you ignore overflows).

Implementing SUB (with AND, OR, and ADD)

No change is needed to our circuit above to handle two's complement numbers for AND/OR/ADD. That statement is not clear for ADD and will be shown true later in the course.

We wish to augment the ALU so that we can perform subtraction as well. As we stated above, A-B is obtained by taking the two's complement of B and adding. A 1-bit implementation is drawn on the right with the new structures in blue (I often use blue for this purpose). The enhancement consists of

  1. Using an inverter to get the one's complement of B.
  2. Using a mux with control line (in green) Binvert to select whether B or B' is fed to the adder.
  3. Using a clever trick to obtain the effect of B's two complement when we are using B's one complement. Namely we set Cin, the carry-in to the lob, equal to 1 instead of 0. This trick increases the sum by one and, as a result, calculates A+B'+1, which is A plus the two's complement of B, which is A-B.
  4. So for the lob CarryIn is kinda-sorta a data line used as a control line.
  5. As before, setting Operation to 00 and 01 gives AND and OR respectively, providing we de-assert Binvert. CarryIn is a don't care for AND and OR.
  6. To implement addition we use opcode 10 as before and de-assert both Binvert and CarryIn
  7. To implement subtraction we again use opcode 10 but we assert both Binvert and CarryIn
Extending to 32 Bits

A 32-bit version is simply a bunch of the 1-bit structures wired together as shown on the right.

Tailoring the 32-bit ALU to MIPS

AND, OR, AND, and SUB are found in nearly all ALUs. In that sense, the construction up to this point has been generic. However, most real architectures have some extras. For MIPS they include.

  1. NOR, not very special and very easy.
  2. Overflow handling, common but not so easy.
  3. Set on less than (slt), not common and not so easy.
  4. Equality test, not very special and easy.

alu nor

Implementing NOR

We noted above that our ALU already gives us the ability to calculate AB', a fairly uncommon logic function. A MIPS ALU needs NOR and, by DeMorgan's law,
A NOR B = (A + B)' = A'B',
which is rather close, we just need to invert A as well as B.

The diagram on the right shows the needed structures: an inverter to get A', a mux to choose between A and A', and a control line for the mux.

NOR is obtained by asserting Ainvert and Binvert and setting Operation=00.

The other operations are done as before, with Ainvert de-asserted.

The 32-bit version is a straightforward ...

Homework: Draw the 32-bit ALU that supports AND, OR, ADD, SUB, and NOR.

Overflows

alu overflow hob

Remark: As with two's complement arithmetic, I just present the bare boned facts here; they are explained later in the course.

The facts are trivial (although the explanation is not). Indeed there is just one fact.

  1. An overflow occurs for two's complement addition (which includes subtraction) if and only if the carry-in to the sign bit does not equal the carry out from the sign bit.

Only the hob portion of the ALU needs to be changed. We need to see if the carry-in is different from the carry-out, but that is exactly XOR. The simple modification to the hob structure is shown on the right.

Do on the board 4-bit twos complement addition of

  1. 1 + 1
  2. -1 + -1 Note that there is NO overflow despite a carry-out.
  3. 6 + 6
  4. -6 + -6

The 32-bit version is again a straightforward ...

Homework: Draw the 32-bit ALU that supports AND, OR, ADD, SUB, and NOR and that asserts an overflow line when appropriate.

Implementing Set on Less Than (SLT)

We are given two 32-bit, two's complement numbers A and B as input and seek a 32-bit result that is 1 if A<B and 0 otherwise. Note that only the lob of the result varies; the other bits are all 0.

The implementation is fairly clever as we shall see.

  1. We need to set the LOB of the result equal to the sign bit of the subtraction A-B, and set the rest of the result bits to zero.

  2. Idea #1. Give the 4-way mux another (i.e., fourth) input, called LESS. This input is brought in from outside the bit cell. To generate slt, we make the select line to the mux equal to 11 so that the the output is the this new input. See the diagram on the right.

  3. For all the bits except the LOB, the LESS input is zero. This is trivial to do: Simply label a wire false or 0, or de-asserted and connected it to the 31 Less inputs (i.e., all but the LOB).

  4. For the LOB we still need to figure out how to set less to the sign of A-B. Note that the circuit for the lob is the same as for the other bits; the difference is in the input to the circuit.

  5. slt overview
  6. Recall that even though we have selected input 3 from the mux, all 4 inputs are computed. This is IMPORTANT: an OR gate always computes the OR of its inputs, whether you want it to or not, same for AND, etc.

  7. Hence the adder is adding and if Binvert is asserted, Ainvert is de-asserted, and CarryIn is 1, the addition actually produces A-B.

  8. Idea #2. Use the settings just mentioned so that the adder computes A-B (and the mux throws it away). Modify the HOB logic as follows (you could do this modification for all bits, but just use the result from the HOB).

  9. Why didn't I show a detailed diagram for this method?
    Because this method is not used.

  10. Why isn't the method used?
    Because it is wrong!
alu set hob

The problem with the above solution is that it ignores overflows. Consider the following 4-bit (instead of 32-bit) example.

The fix is to use the correct rule for less than rather than the sometimes incorrect rule the sign bit of A-B is 1.

Homework: figure out correct rule, i.e. a non-pictorial version of problem B.24. Hint: When an overflow occurs, the sign bit is definitely wrong.

The diagram on the right shows the correct calculation of Set.

Start Lecture #5

Remark: Lab 3 assigned. alu-final-1bit

A Simple Observation

The CarryIn to the LOB and Binvert to all the 1-bit ALUs are always the same. So the ALU has just one input called Bnegate, which is sent to the appropriate inputs in the 1-bit ALUs. The final 1-bit cell of the ALU is shown on the right.

Note that the circuit is the same for all bits; however different bits are wired differently, i.e., they have different inputs and their outputs are sent to different places.

Equality Detection

To see if A = B we simply form A-B and test if the result is zero

The Final Result

The final 32-bit ALU is shown below on the left. Again note that all the bits have the same circuit. The lob and hob have special external wiring; the other 30 bits are wired the same.

To the right of this diagram we see the symbol used for an ALU.

What are the control lines?

What functions can we perform?

function4-bit cntlAinvBnegOper
AND00000000
OR00010001
ADD00100010
SUB01100110
slt01110111
NOR11001100

We think of the three control lines Ainvert, Bnegate, and Operation as forming a single 4-bit control line. The table on the right shows what four bit value is needed for each function.

Defining the MIPS ALU in Verilog

Skipped.

B.6: Faster Addition: Carry Lookahead

This adder is much faster than the ripple adder we did before, especially for wide (i.e., many bit) addition.

Fast Carry Using Infinite Hardware

This is a simple (theoretical) result.

  1. An adder is a combinatorial circuit hence it can be constructed with two (or three if you count the bubbles) levels of logic. Done

  2. Consider 32-bit (or 64-bit, or 128-bit, or N-bit) addition, R=A+B.
  3. The above applied to any logic function; here are the calculations specific for addition.

Fast Carry Using the First Level of Abstraction: Propagate and Generate

At each bit position we have two input bits a and b as well as a CarryIn input. We now define two other bits propagate and generate (p=ai+bi and g=aibi).

hp B.6.1 really F0422 2e

To summarize, using a subscript i to represent the bit number,

    to generate  a carry:   gi = ai bi
    to propagate a carry:   pi = ai+bi
  

The diagram on the right, from P&H, gives a plumbing analogue for generate and propagate. A full size version of the diagram is here in pdf.

The point is that liquid enters the main pipe if either the initial CarryIn or one of the generates is true. The water exits the pipe at the lower left (i.e., there is a CarryOut for this bit position) if all the propagate valves are open from the lowest liquid entrance to the exit.

The two diagrams in these notes are from the 2e; the colors changed between editions.

Given the generates and propagates, we can calculate all the carries for a 4-bit addition (recall that c0=Cin is an input) as follows (this is the formula version of the plumbing):

    c1 = g0 + p0 c0
    c2 = g1 + p1 c1 = g1 + p1 g0 + p1 p0 c0
    c3 = g2 + p2 c2 = g2 + p2 g1 + p2 p1 g0 + p2 p1 p0 c0
    c4 = g3 + p3 c3 = g3 + p3 g2 + p3 p2 g1 + p3 p2 p1 g0 + p3 p2 p1 p0 c0
  

4bit cla

Thus we can calculate c1 ... c4 in just two additional gate delays given the p's and g's (where we assume one gate can accept upto 5 inputs). Since we get gi and pi after one gate delay, the total delay for calculating all the carries is 3 (this includes c4=Carry-Out)

Each bit of the sum si can be calculated in 2 gate delays given ai, bi, and ci. Thus, for 4-bit addition, 5 gate delays after we are given a, b and Carry-In, we have calculated s and Carry-Out.

We show this in the diagram on the right.

Thus, for 4-bit addition, 5 gate delays after we are given a, b and Carry-In, we have calculated s and Carry-Out using a modest amount of realistic (no more than 5-input) logic.

How does the speed of this carry-lookahead adder CLA compare to our original ripple-carry adder?

Fast Carry Using the Second Level of Abstraction

We have finished the design of a 4-bit CLA; the next goal is a 16-bit fast adder. Let's consider, at varying levels of detail, five possibilities. cla hybrid 16bit

  1. Ripple carry. Simple, we know it, but not fast.
  2. General 2 levels of logic. Always applicable, we know it, but not practical.
  3. Extend the above design to 16 bits. Possible, we could do it, but some gates have 17 inputs. Would need a tree to reduce the input count.
  4. Put together four of the 4-bit CLAs. Shown in the diagram to the right is a schematic of our 4-bit CLA and a 16-bit adder constructed from four of them.
  5. Be more clever and put together the 4-bit CLAs in a carry-lookahead manner. One could call the result a 2-level CLA.

Start Lecture #6

hpfig B.6.2

Super Propagate and Super Generate

We start the adventure by defining super propagate and super generate bits.

    P0 = p3 p2 p1 p0      Low order 4-bit adder propagates a carry
    P1 = p7 p6 p5 p4
    P2 = p11 p10 p9 p8
    P3 = p15 p14 p13 p12  High order 4-bit adder propagates a carry

    G0 = g3 + p3 g2 + p3 p2 g1 + p3 p2 p1 g0   Low order 4-bit adder generates a carry
    G1 = g7 + p7 g6 + p7 p6 g5 + p7 p6 p5 g4
    G2 = g11 + p11 g10 + p11 p10 g9 + p11 p10 p9 g8
    G3 = g15 + p15 g14 + p15 p14 g13 + p15 p14 p13 g12
  

From these super propagates and super generates, we can calculate the super carries, i.e. the carries for the four 4-bit adders.

    C1 = G0 + P0 c0
    C2 = G1 + P1 C1 = G1 + P1 G0 + P1 P0 c0
    C3 = G2 + P2 C2 = G2 + P2 G1 + P2 P1 G0 + P2 P1 P0 c0
    C4 = G3 + P3 C3 = G3 + P3 G2 + P3 P2 G1 + P3 P2 P1 G0 + P3 P2 P1 P0 c0
  

But this looks terrific! These super carries are what we need to combine four 4-bit CLAs into a 16-bit CLA in a carry-lookhead manner. Recall that the hybrid approach suffered because the carries from one 4-bit CLA to the next (i.e., the super carries) were done in a ripple carry manner.

Since it is not completely clear how to combine the pieces so far presented to get a 16-bit, 2-level CLA, I will give a pictorial account very soon.

Before the pictures, let's assume the pieces can be put together and see how fast the 16-bit, 2-level CLA actually is. Recall that we have already seen two practical 16-bit adders: A ripple carry version taking 32 gate delays and a hybrid structure taking 14 gate delays. If the 2-level design isn't faster than 14 gate delays, we won't bother with the pictures.

Remember we are assuming 5-input gates. We use lower case p, g, and c for propagates, generates, and carries; and use capital P, G, and C for the super- versions.

  1. We calculate the p's and g's (lower case) in 1 gate delay (as with the 4-bit CLA).
  2. We calculate the P's one gate delay after we have the p's or 2 gate delays after we start.
  3. The G's are determined 2 gate delays after we have the g's and p's. So the G's are done 3 gate delays after we start.
  4. The C's are determined 2 gate delays after the P's and G's. So the C's are done 5 gate delays after we start.
  5. Now the C's are sent back to the 4-bit CLAs, which have already calculated the p's and g's. The c's are calculated in 2 more gate delays (7 total) and the s's 2 more after that (9 total).
cla-pg 4-bit

Since 9<14, let the pictures begin!

  1. First perform minor surgery on the 4-bit CLA.

  2. cla clb
  3. Next put four of these 4-bit CLAs together with a Carry Lookahead Block that calculates the C's from the P's, G's and Cin=C0.

  4. cla clb
  5. We actually are not done with the CL Block.

Building CLAs Using the CL Block

It is time to validate the claim that all sizes of PLAs can be build (recursively) using the CL Block.

1-bit CLA-PG

A 1-bit CLA is just a 1-bit adder. With only one bit there is no need for any lookahead since there is no ripple to try to avoid.

However, to enable us to build a 4-bit CLA from the 1-bit version, we actually need to build what we previously called a CLA-PG. The 1-bit CLA-PG has three inputs a, b, and cin. It produces 4 outputs s, cout, p, and g. We have given the logic formulas for all four outputs previously. cla 4bit

4-bit CLA-PG

A 4-bit CLA-PG is shown as the red portion in the figure to the right.

It has nine inputs: 4 a's, 4 b's, and cin and must produce seven outputs: 4 s's, cout, p, and g (recall that the last two were previously called the super propagate and super generate respectively).

The tall black box is our CL Block.

The question is, what must the ith ? box do in order for the entire (red) structure to be a 4-bit CLA-PG?.

  1. The box must produce si, one bit of the desired sum. But this is easy since the box receives ai, bi, and ci the carry in (c0 is cin).
  2. The box must produce pi and gi for the CL Block to consume. But that is also easy since it has as input ai and bi.
  3. It looks like the ? box is just a 1-bit CLA-PG!
  4. Unfortunately, not quite. The ? box is only a (large) subset of a 1-bit CLA-PG.
  5. cla 4bit pedantic
  6. What is missing?
  7. Ans. The ? box doesn't need to produce a carry out since the larger (4-bit) CLA-PG contains a Cl-block that produces all of them.

So, if we want to say that the 4-bit (1-level) CLA-PG is composed of four 1-bit (0-level) CLA-PGs together with a CL Block, we must draw the picture as on the right. The difference is that we explicitly show that the ? box produces cout, which is then not used.

This situation will occur for all sizes. For example, either picture on the right for a a 4-bit CLA-PG produces a carry out since all 4-bit full adders do so. However, a 16-bit CLA-PG, built from four of the 4-bit units and a CL Block, does not use the carry outs produced by the four 4-bit units.

We have several alternatives.

  1. Don't mention the problem of the unused cout. A common solution but too late for us.
  2. Draw the top version of the diagram (without the unused cout's) and delcare that a CLA-PG doesn't produce a carry out. Seems weird that a CLA-PG doesn't fully replace a full adder.
  3. Draw the top version of the diagram and admit that a level k CLA-PG doesn't really use four level k-1 CLA-PG's.
  4. Draw the bottom version of the diagram.
  5. Draw the top version of the diagram, but view it as an abbreviation of the bottom version. This last is the alternative we will choose.

As another abbreviation, we will henceforth say CLA when we mean CLA-PG.

Remark: Hence the 4-bit CLA (meaning CLA-PG) is composed of

  1. Four 1-bit CLAs
  2. One CLA block
  3. Wires
  4. Nothing else
cla 16bit png
16-bit CLA-PG

Now take four of these 4-bit adders and use the identical CL block to get a 16-bit adder.

The picture on the right shows one 4-bit adder (the red box) in detail. The other three 4-bit adders are just given schematically as small empty red boxes. The CL block is also shown and is wired to all four 4-bit adders.

The complete (large) picture is shown here.

Remark: Hence the 16-bit CLA is composed of

  1. Four 4-bit CLAs
  2. One CLA block
  3. Wires
  4. Nothing else
64-bit CLA-PG

To construct a 64-bit CLA no new components are needed. That is, the only components needed have already been constructed. Specifically you need.

  1. Four magenta boxes, identical to the one just constructed.
  2. One additional CL Block, identical to the one just used to make the magenta box.
  3. Wires to connect these five boxes.

Remark: Hence the 64-bit CLA (meaning CLA-PG) is composed of

  1. Four 16-bit CLAs
  2. One CLA block
  3. Wires
  4. Nothing else

When drawn (with a brown box) the 64-bit CLA-PG has 129 inputs (64+64+1) and 67 outputs (64+1+2).

256-bit CLA-PG
  1. Four brown boxes, identical to the one just constructed.
  2. One additional CL Block, identical to the one just used to make the brown box.
  3. Wires to connect these five boxes.

Remark: Hence the 256-bit CLA (meaning CLA-PG) is composed of

  1. Four 64-bit CLAs
  2. One CLA block
  3. Wires
  4. Nothing else
etc

Homework: How many gate delays are required for our 64-bit CLA-PG? How many gate delays are required for a 64-bit ripple carry adder (constructed from 1-bit full adders)?

Summary

CLAs greatly speed up addition; the increase in speed grows with the size of the numbers to be added.

Remark: CLAs implement n-bit addition in O(log(n)) gate delays.

Start Lecture #7

Shifters

MIPS (and most other) processors must execute shift (and rotate) instructions.

We could easily extend the ALU to do 1-bit shift/rotates (i.e., shift/rotate a 32-bit quantity by 1 bit), and then perform an n-bit shift/rotate as n 1-bit shift/rotates.

This is not done in practice. Instead a separate structure, called a barrel shifter is built outside the ALU.

Remark: Barrel shifters, like CLAs are of logarithmic complexity.



*** Big Change Coming ***

Sequential Circuits, Memory, and State

Why do we need state?

B.7: Clocks

Assume you have a physical OR gate. Assume the two inputs are both zero for an hour. At time t one input becomes 1. The output will OSCILLATE for a while before settling on exactly 1. We want to be sure we don't look at the answer before its ready.

This will require us to establish a clocking methodology, i.e. an approach to determining when data is valid.

First, however, we need some ...

Terminology

Micro-, Mega-, and Friends
Nano means one billionth, i.e., 10-9.
Micro means one millionth, i.e., 10-6.
Milli means one thousandth, i.e., 10-3.
Kilo means on thousand, i.e., 103.
Mega means one million, i.e., 106.
Giga means one billion, i.e., 109.
Frequency and period

Consider the idealized waveform shown on the right. The horizontal axis is time and the vertical axis is (say) voltage.

If the waveform repeats itself indefinitely (as the one on the right does), it is called periodic.

The time required for one complete cycle, i.e., the time between two equivalent points in consecutive cycles, is called the period.

Since it is a time, period is measured in units such as seconds, days, nanoseconds, etc.

The rate at which cycles occur is called the frequency.

Since it is a rate at which cycles occur, frequency is measured in units such as cycles per hour, cycles per second, kilocycles per microweek, etc.

The modern (and less informative) name for cycles per second is Hertz, which is abbreviated Hz.

Prediction: At least one student will confuse frequency and periods on the midterm or final and hence mess up a gift question. Please, prove me wrong!

Make absolutely sure you understand why

  1. A kilohertz clock is (a million times) faster than a millihertz clock.
  2. A clock with a kilosecond period is (a million times) slower than one with a millisecond period.
Edges

Look at the diagram above and note the rising edge and the falling edge.

We will use edge-triggered logic, which means that state changes (i.e., writes to memory) occur at a clock edge.

For each design we choose to either

The edge on which changes occur (either the rising or falling edge) is called the active edge. For us, choosing which edge is active is basically a coin flip.

In real designs the choice is governed by the technology used. Some designs permit both edges to be active. For example DDR memory and double-pumped register files. This permits a portion of the design to run at effectively twice the speed since state changes occur twice as often

Synchronous system

Now we are going to add state elements to the combinational circuits we have been using previously.

Remember that a combinational/combinatorial circuits has its outpus determined solely by its input, i.e. combinatorial circuits do not contain state.

State elements include state (naturally).

Combinatorial circuits can NOT contain loops. For example imagine an inverter with its output connected to its input. So if the input is false, the output becomes true. But this output is wired to the input, which is now true. Thus the output becomes false, which is the new input. So the output becomes true ... .
However sequential circuits CAN and often Do contains loops.

B.8: Memory Elements: Flip-Flops, Latches, and Registers

We will use only edge-triggered, clocked memory in our designs as they are the simplest memory to understand. So our current goal is to construct a 1-bit, edge-triggered, clocked memory cell. However, to get there we will proceed in three stages.

  1. We first show how to build unclocked memory.
  2. Then, using unclocked memory, we build level-sensitive clocked memory.
  3. Finally from level-sensitive clocked memory we build edge-triggered clocked memory.

Unclocked Memory

The only unclocked memory we will use is a so called S-R latch (S-R stands for Set-Reset).

When we define latch below to be a level-sensitive, clocked memory, we will see that the S-R latch is not really a latch.

The circuit for an S-R latch is on the right. Note the following properties.

Clocked Memory: Flip-flops and latches

The S-R latch defined above is UNclocked memory; unfortunately the terminology is not perfect.

For both flip-flops and latches the output equals the value stored in the structure. Both have an input and an output (and the complemented output) and a clock input as well. The clock determines when the internal value is set to the current input. For a latch, the output can change whenever the clock is asserted (level sensitive). For a flip-flop, changes occur only at the active edge.

D latch

The D stands for data.

Note the following properties of the D latch circuit shown on the right.

A D latch is sometimes called a transparent latch since, whenever the clock is high, the output equals the input (the input passes right through the latch).

We won't use D latches in our designs, except right now to our workhorse, the master-slave flip-flop, an edge-triggered memory cell.

The lower diagram is how a D-latch is normally drawn.



In the traces to the right notice how the output follows the input when the clock is high and remains constant when the clock is low. We assume the stored value was initially low.

D or Master-Slave Flip-flop

This structure was our goal. It is an edge-triggered, clocked memory.

The circuit for a D flop is on the right has the following properties.

Homework: Move the inverter to the other latch. What has changed?

The picture on the right is for a master-slave flip-flop. Note how much less wiggly the output is in this picture than before with the transparent latch. As before we are assuming the output is initially low.


This picture shows the setup and hold times discussed above.

Homework: Which code better describes a flip-flop and which a latch?

    repeat {
       while (clock is low) {do nothing}
       Q=D
       while (clock is high) {do nothing}
    } until forever

  or

    repeat {
    while (clock is high) {Q=D}
    } until forever
  

Start Lecture #8

Registers

A register is basically just an array of D flip-flops. For example a 32-bit register is an array of 32 D flops.

This, however, is not so good! We must have the write line correct quite a while before the active edge. That is you must know whether you are writing quite a while in advance.

An alternative is to use an active low write line, i.e. have a W' input.




To implement a multibit register, just use multiple D flops.


Register File

A register file is just a set of registers, each one numbered.


Reading From a Register File

To support reading a register we just need a (big) mux from the register file to select the correct register.


Writing a Register in a Register File

To support writing a register we use a decoder on the register number to determine which register to write. Note that errors in the book's figure were fixed.

  1. The decoder is log n to n (5 to 32 for MIPS).
  2. The decoder outputs are numbered 0 to n-1 (NOT n).

Note also that I show the clock explicitly.

Recall that the inputs to a register are W, the write line, D the data to write (if the write line is asserted), and the clock. We should perform a write to register r this cycle if the write line is asserted and the register number specified is r. The idea is to gate the write line with the output of the decoder.

Start Lecture #9

Homework: B.36

SRAMS and DRAMS

Note: There are other kinds of flip-flops T, J-K. Also one could learn about excitation tables for each. We will not cover this material (P&H doesn't either). If interested, see Mano.

B.10: Finite State Machines (FSMs)

More precisely, we are learning about deterministic finite state machines or deterministic finite automata (DSA). The alternative nondeterministic finite automata (NDA) are somewhat strange and, althought seemingly nonrealistic and of theoretical value only, form together with DFA, what I call the secret weapon used in the first stage of a compiler (the lexical analyzer.

I do a different example from the book (counters instead of traffic lights). The ideas are the same and the two generic pictures (below) apply to both examples. state machine

Counters

A counter counts (naturally).

The State Transition Diagram
The circuit diagram.
Determining the combinatorial circuit
Truth Table for the Combinatorial Circuit
Current(Next A)
AIRDA


0000
1001
0101
1100
xx10

How do we determine the combinatorial circuit?

A 2-bit Counter.

No new ideas are needed; just more work.

Beginning of the Truth Table for a 2-bit Counter
CurrentNext
ABIRDADB


xxx100
001001

To determine the combinatorial circuit we could precede as before. The beginning of the truth table is on the right.

This would work (do a few more rows on the board), but we can instead think about how a counter works and see that.

    DB = R'(B ⊕ I)
    DA = R'(A ⊕ BI)
  

A 3-bit Counter

Homework: B.39

B.7 Timing Methodologies

Skipped

Simulating Combinatorial Circuits at the Gate Level

The idea is, given a circuit diagram, write a program that behaves the way the circuit does. This means more than getting the same answer. The program is to work the way the circuit does.

For each logic box, you write a procedure with the following properties.

Simulating a Full Adder

Remember that a full adder has three inputs and two outputs. Discuss FullAdder.c or perhaps FullAdder.java.

Simulating a 4-bit Adder

This implementation uses the full adder code above. Discuss FourBitAdder.c or perhaps FourBitAdder.java

Chapter 1: Computer Abstractions and Technologies

Homework: READ chapter 1. Do 1.1 -- 1.228 (really one matching question)
Do 1.29 to 1.45 (another matching question),
1.46.

Chapter 2: Instructions: Language of the Machine

Homework: Read sections 2.1, 2.1, and 2.3 (you need not worry about how a compiler works, but you might want to).

2.4 Representing instructions in the Computer (MIPS)

The Register File

MIPS Fields

The fields of a MIPS instruction are quite consistent

    op    rs    rt    rd    shamt  funct   name of field
     6     5     5     5      5      6     number of bits
  

R-type Instructions (R for register)

Examples: add/sub $1,$2,$3

Start Lecture #10

I-type (Immediate)

The I is for immediate.

Load Word and Store Word

Examples: lw/sw $1,1000($2)

RISC-like properties of the MIPS architecture.

addi (add immediate)

Example: addi $1,$2,100

2.5: Logical Operations

Shifts: sll and srl (shift left/right) logical

Examples sll/srl $8,$12,7

Bitwise AND and OR: and, or, andi, ori

No surprises.

Bitwise NOR (includes NOT): nor

MIPS includes a bitwise NOR (our ALU implemented it) implemented as an R-type instruction.

2.6: Instructions for Making Decisions

beq and bne (branch (not) equal)

Examples: beq/bne $1,$2,123

slt (set less-then)

Example: slt $3,$8,$2

slti (set less-then immediate)

Example: slt $3,$8,20

blt (branch if less than)

Example: blt $5,$8,123

ble (branch if less than or equal)

Example: ble $5,$8,L (L a label to be calculated by the assembler.)

bgt (branch if greater than)

Example bgt $5,$8,L

bge (branch if greater than or equal)

Example: bge $5,$8,L

Note: Please do not make the mistake of thinking that
  stl $1,$5,$8
  beq $1,$0,L
is the same as
  stl $1,$8,$5
  bne $1,$0,L

It is not the case that the negation of X < Y is Y > X.
End of Note

J-type instructions (J for jump)

These have a different format, but again the opcode is the first 6 bits.
  op address
  6   26

The effect is to jump to the specified (immediate) address. Note that there are no registers specified in this instruction and that the target address is not relative to (i.e. added to) the address of the current instruction as was done with branches.

j (jump)

Example: j 10000

But MIPS is a 32-bit machine with 32-bit address and we have specified only 26 bits. What about the other 6 bits?

In detail the address of the next instruction is calculated via a multi-step process.

  1. The 26 bit address field is extracted from the instruction.
  2. This address is left shifted two bits. The result is a 28-bit address (call it A) that is always a multiple of 4, which makes sense since all instructions must begin on a multiple of 4 bytes.
  3. The high order 4 bits are extracted from the address of the current instruction (not the address in the current instruction). Call this 4-bit quantity B.
  4. The address of the next instruction is formed by concatenating B with A.

2.7: Supporting Procedures in Computer Hardware

jal (jump and link)

Example: jal 10000

jr (jump register)

Important example: jr $31

Homework: 2.38

2.8: Communicating with People

Skipped.

MIPS Addressing for 32-bit Immediates and Addresses

How can we put a 32-bit value (say 2 billion) into register 6?

  1. Zero and add.
  2. Load the word
  3. Load shift add
  4. Load shift OR

lui (load upper immediate)

Example: lui $4,123

Homework: 2.7.

Start Lecture #11

Chapter 3

Homework: Read 3.1-3-4

3.1: Introduction

I have nothing to add.

3.2: Signed and Unsigned Numbers

MIPS uses 2s complement (just like 8086)

To form the 2s complement (of 0000 1111 0000 1010 0000 0000 1111 1100)

Need comparisons for signed and unsigned.

Comments on Two's Complement

You could easily ask what does this funny notation have to do with negative numbers. Let me make a few comments.

  1. What does minus 1 mean?
    Ans: It is the unique number that, when added to 1, gives zero.
  2. The binary number 1111...1111 has this property (using regular n-bit addition and discarding the carry-out) so we do seem to have -1 correct.
  3. Just as n+1 (for n≥0) is defined as the successor of n, -(n+1) is the number that has -n as successor. That is we need to show that
    TwosComp(n+1) + 1 = TwosComp(n).
  4. This would follow if we coud show
    OnesComp(n+1) + 1 = OnesComp(n), i.e, (n+1)' + 1 = n'.
    1. Let n be even, n = *0, * arbitrary.
    2. Write n', n+1 and (n+1)' and see that it works.
    3. Let n be odd, n = *01s1, where 1s just means a bunch of ones.
    4. Again it works.
  5. So for example TwosComp(6)+1=TwosComp(5) and hence TwosComp(6)+6=zero, so it really is -6.

sltu and sltiu

Like slt and slti but the comparison is unsigned.

Homework: 3.1-3.6

3.3: Addition and subtraction

To add two (signed) numbers just add them. That is, don't treat the sign bit special.

To subtract A-B, just take the 2s complement of B and add.

Overflows

An overflow occurs when the result of an operation cannot be represented with the available hardware. For MIPS this means when the result does not fit in a 32-bit word.

Homework: Prove this last statement (4.29) (for fun only, do not hand in).

addu, subu, addiu

These three instructions perform addition and subtraction the same way as do add and sub, but do not signal overflow.

shifter

Shifter

This is a sequential circuit.

Homework: A 4-bit shift register initially contains 1101. It is shifted six times to the right with the serial input being 101101. What is the contents of the register after each shift.

Homework: Same register, same initial condition. For the first 6 cycles the opcodes are left, left, right, nop, left, right and the serial input is 101101. The next cycle the register is loaded (in parallel) with 1011. The final 6 cycles are the same as the first 6. What is the contents of the register after each cycle?

3.4: Multiplication

Of course we can do this with two levels of logic since multiplication is just a function of its inputs.

But just as with addition, would have a very big circuit and large fan in. Instead we use a sequential circuit that mimics the algorithm we all learned in grade school.

Recall how to do multiplication.

We will do it the same way ...
... but differently

This results in the following algorithm

    product ← 0
    for i = 0 to 31
        if LOB of multiplier = 1
            product = product + multiplicand
        shift multiplicand left 1 bit
        shift multiplier right 1 bit
  

Do on the board 4-bit multiplication (8-bit registers) 1100 x 1101. Since the result has (up to) 8 bits, this is often called a 4x4→8 multiply.

The First Attempt

The diagrams below are for a 32x32-->64 multiplier.

What about the control?

This works!

But, when compared to the better solutions to come, is wasteful of resourses and hence is

An Improved Circuit

The product register must be 64 bits since the product can contain 64 bits.

Why is multiplicand register 64 bits?

Why is ALU 64-bits?

POOF!! ... as the smoke clears we see an idea.

We can solve both problems at once

This results in the following algorithm

    product <- 0
    for i = 0 to 31
        if LOB of multiplier = 1
            (serial_in, product[32-63]) <- product[32-63] + multiplicand
        shift product right 1 bit
        shift multiplier right 1 bit
  

What about control

Redo same example on board

A final trick (gate bumming, like code bumming of 60s)

The algorithm changes to:

    product[0-31] <- multiplier
    for i = 0 to 31
      if LOB of product = 1
        (serial_in, product[32-63]) <- product[32-63] + multiplicand
      shift product right 1 bit
  

Control again boring.

Redo the same example on the board.

Signed Multiplication

The above was for unsigned 32-bit multiplication. What about signed multiplication?

There are faster multipliers, but we are not covering them.

3.5: Division

We are skiping division.

3.6: Floating Point

We are skiping floating point.

3.7: Real Stuff: Floating Point in the IA-32

We are skiping floating point.

Homework: Read for your pleasure (not on exams) 3.8 Fallacies and Pitfalls, 3.9 Conclusion, and 3.10 ``Historical Perspective'' (the last is on the CD).

Start Lecture #12

Chapter 5: The Processor: Datapath and Control

Homework: Start Reading Chapter 5.

5.1: Introduction

We are going to build a basic MIPS processor.

Figure 5.1 redrawn below shows the main idea

Note that the instruction gives the three register numbers as well as an immediate value to be added.

5.2 Logic Design Convention

Done in appendix B.

5.3: Building a Datapath

Let's begin doing the pieces in more detail.

We draw buses in magenta (mostly 32 bits) and control lines in green.

Instruction fetch

We are ignoring branches and jumps for now.

The diagram on the right shows the main loop involving instruction fetch (i-fetch)

R-type instructions

We did the register file in appendix B. Recall the following points made when discussing the appendix.

The 32-bit bus with the instruction is divided into three 5-bit buses for each register number (plus other wires not shown).

Homework: What would happen if the RegWrite line had a stuck-at-0 fault (was always deasserted)? What would happen if the RegWrite line had a stuck-at-1 fault (was always asserted)?

load and store

The diagram on the right shows the structures used to implement load word and store word (lw and sw).

lw $r,disp($s):

  1. Computes the effective address formed by adding the 16-bit immediate constant disp (displacement) to the contents of register $s.
  2. Fetches the value in data memory at this address.
  3. Inserts this value into register $r.

sw $r,disp($s):

  1. Computes the same effective address as lw $r,disp($s).
  2. Stores the contents of register $r into this address.

We have a 32-bit adder and more importantly have a 32-bit addend coming from the register file. Hence we need to extend the 16-bit immediate constant to 32 bits. That is we must replicate the HOB of the 16-bit immediate constant to produce an additional 16 HOBs all equal to the sign bit of the 16-bit immediate constant. This is called sign extending the constant.

What about the control lines?

Homework: What would happen if the RegWrite line had a stuck-at-0 fault (was always deasserted)?
What would happen if the RegWrite line had a stuck-at-1 fault (was always asserted)?
What would happen if the MemWrite line had a stuck-at-0 fault
What would happen if the MemWrite line had a stuck-at-1 fault?

The Diagram is Wrong (specifically, incomplete)

The diagram cheats a little for clarity.

Branch on equal (beq)

Compare two registers and branch if equal. Recall the following from appendix B, where we built the ALU, and from chapter 2, where we discussed beq.

shift left 2

The top alu labeled add is just an adder so does not need any control

The shift left 2 is not a shifter. It simply moves wires and includes two zero wires. We need a 32-bit version. Below is a 5 bit version.

Homework: What would happen if the RegWrite line had a stuck-at-0 fault? What would happen if the RegWrite line had a stuck-at-1 fault?

5.4: A Simple Implementation Scheme

We will first put the pieces together and later figure out the control lines that are needed and how to set them. We are not now worried about speed.

We are assuming that the instruction memory and data memory are separate. So we are not permitting self modifying code. We are not showing how either memory is connected to the outside world (i.e., we are ignoring I/O).

We must use the same register file with all the pieces since when a load changes a register, a subsequent R-type instruction must see the change and when an R-type instruction makes a change, the lw/sw must see it (for loading or calculating the effective address, etc).

We could use separate ALUs for each type of instruction but we are not worried about speed so we will use the same ALU for all instruction types. We do have a separate adder for incrementing the PC.

Combining R-type and lw/sw

The problem is that some inputs can come from different sources.

  1. For R-type instructions, both ALU operands are registers. For I-type instructions (lw/sw) the second operand is the (sign extended) immediate field.
  2. For R-type instructions, the write data comes from the ALU. For lw it comes from the memory.
  3. For R-type instructions, the write register comes from field rd, which is bits 15-11. For sw, the write register comes from field rt, which is bits 20-16.

We will deal with the first two now by using a mux for each. We will deal with the third shortly by (surprise) using a mux.

Including instruction fetch

This is quite easy

Finally, beq

We need to have an if stmt for PC (i.e., a mux)

Homework: Extend the datapath just constructed to support the addi instruction as well as the instructions already supported. This is essentially the datapath component of problem 5.19 from the text.

Homework: Extend the datapath just constructed to support a variation of the lw instruction where the effective address is computed by adding the contents of two registers (instead of using an immediate field). This new instruction would be an R-type. Continue to support all the instructions that the original datapath supported. This is essentially the datapath component of problem 5.22 from the text.

Homework: Can you support a hypothetical swap instruction that swaps the contents of two registers using the same building blocks that we have used to date? Very similar to problem 5.23 from the text.

Start Lecture #13

The Control for the Datapath

We start with our last figure, which shows the data path and then add the missing mux and show how the instruction is broken down.

We need to set the muxes.

We need to generate the four ALU cntl lines: 1-bit Anegate, 1-bit Bnegate and 2-bit OP

    AND     0 0 00
    OR      0 0 01
    Add     0 0 10
    Sub     0 1 10
    Set-LT  0 1 11
    NOR     1 1 00
  

Homework: What happens if we use 0 1 00 for the four ALU control lines? What if we use 0 1 01?

What information can we use to decide on the muxes and alu cntl lines?

The instruction!

So no problem, just do a truth table.

A Two-Stage Approach

We will let the main control (to be done later) summarize the opcode for us. From this summary we determine the control lines for the muxes. Specifically, the main control will generate a 2-bit field ALUOp

    ALUOp   Action needed by ALU

    00      Addition (for load and store)
    01      Subtraction (for beq)
    10      Determined by funct field (R-type instruction)
    11      Not used
  

Start Lecture #14

MIDTERM EXAM

Start Lecture #15

Remark: Typo in lecture #13. There are four (not three) ALU control lines; we will use only three since we are not doing NOR.

Remark: Review answers for Midterm.

Controlling the ALU Given the Summary

Remark: This material is (in the 3e) in Appendix C, section C.2.

How many entries do we have now in the truth table?

Some simplifications we can take advantage of.

opcodeALUOpoperationfunctALU actionALU cntl
LW00load wordxxxxxxadd0010
SW00store wordxxxxxxadd0010
BEQ01branch equalxxxxxxsubtract0110
R-type10add100000add0010
R-type10subtract100010subtract0110
R-type10AND100100and0000
R-type10OR100101or0001
R-type10SLT101010set on less than0111

Applying these simplifications yields

    ALUOp | Funct        ||  Bnegate:OP
    1 0   | 5 4 3 2 1 0  ||  B OP
    ------+--------------++------------
    0 0   | x x x x x x  ||  0 10
    x 1   | x x x x x x  ||  1 10
    1 x   | x x 0 0 0 0  ||  0 10
    1 x   | x x 0 0 1 0  ||  1 10
    1 x   | x x 0 1 0 0  ||  0 00
    1 x   | x x 0 1 0 1  ||  0 01
    1 x   | x x 1 0 1 0  ||  1 11
  

Start Lecture #16

How should we implement this?
We will do it PLA style (disjunctive normal form, 2-levels of logic).

When is Bnegate (called Op2 in book) asserted?
Ans: Those rows where its bit is 1, namely rows 2, 4, and 7.

    ALUOp | Funct
    1 0   | 5 4 3 2 1 0
    ------+------------
    x 1   | x x x x x x
    1 x   | x x 0 0 1 0
    1 x   | x x 1 0 1 0
  

Notice that, in the 5 rows with ALUOp=1x, F1=1 is enough to distinugish the two rows where Bnegate is asserted. This gives

    ALUOp | Funct
    1 0   | 5 4 3 2 1 0
    ------+-------------
    x 1   | x x x x x x
    1 x   | x x x x 1 x
  

Hence Bnegate is ALUOp0 + (ALUOp1 F1)

Now we apply the same technique to determine when is OP0 asserted and begin by listing the rows where its bit is set.

    ALUOp | Funct
    1 0   | 5 4 3 2 1 0
    ------+------------
    1 x   | x x 0 1 0 1
    1 x   | x x 1 0 1 0
  
Again looking at all the rows where ALUOp=1x we see that the two rows where OP0 is asserted are characterized by just two Function bits
    ALUOp | Funct
    1 0   | 5 4 3 2 1 0
    ------+------------
    1 x   | x x x x x 1
    1 x   | x x 1 x x x
  

So OP0 is ALUOp1 F0 + ALUOp1 F3

Finally, we determine when is OP1 asserted and once again begin by listing the rows where its bit is one.

    ALUOp | Funct
    1 0   | 5 4 3 2 1 0
    ------+------------
    0 0   | x x x x x x
    x 1   | x x x x x x
    1 x   | x x 0 0 0 0
    1 x   | x x 0 0 1 0
    1 x   | x x 1 0 1 0
  

Inspection of the 5 rows with ALUOp=1x yields one F bit that distinguishes when OP1 is asserted, namely F2=0. Is this good luck, or well chosen funct values, or wise subset selection by H&P?

    ALUOp | Funct
    1 0   | 5 4 3 2 1 0
    ------+------------
    0 0   | x x x x x x
    x 1   | x x x x x x
    1 x   | x x x 0 x x
  

Since x 1 in the second row is really 0 1, rows 1 and 2 can be combined to give

	ALUOp | Funct
	1 0   | 5 4 3 2 1 0
	------+------------
	0 x   | x x x x x x
	1 x   | x x x 0 x x
      

Now we can use the first row to enlarge the scope of the last row

	ALUOp | Funct
	1 0   | 5 4 3 2 1 0
	------+------------
	0 x   | x x x x x x
	x x   | x x x 0 x x
      

So OP1 = NOT ALUOp1 + NOT F2

The circuit is then easy and is shown on the right.

The Main Control

Our task, illustrated in the diagram below, is to calculate 9 bits, specifically:

All 9 bits are determined by the opcode. We show the logic diagram after we illustrate the operation of the control logic.

Note that the MIPS instruction set is fairly regular. Most of the fields we need are always in the same place in the instruction (independent of the instruction type).

MemRead: Memory delivers the value stored at the specified addr
MemWrite: Memory stores the specified value at the specified addr
ALUSrc: Second ALU operand comes from (reg-file / sign-ext-immediate)
RegDst: Number of reg to write comes from the (rt / rd) field
RegWrite: Reg-file stores the specified value in the specified register
PCSrc: New PC is Old PC+4 / Branch target
MemtoReg: Value written in reg-file comes from (alu / mem)

We have just seen how to calculate ALUOp, the remaining 7 bits (recall that ALUOp is 2 bits) are described in the table to the right and their uses in controlling the datapath is shown in the picture above.

We are interested in four opcodes.

Do a stage play

The following figures illustrate the play. Bigger versions of the pictures are here.

We start with R-type instructions

Start Lecture #17



Next we show lw

The following truth table shows the settings for the control lines for each opcode. This is drawn differently since the labels of what should be the columns are long (e.g. RegWrite) and it is easier to have long labels for rows.

SignalR-typelwswbeq
Op50110
Op40000
Op30010
Op20001
Op10110
Op00110
RegDst10XX
ALUSrc0110
MemtoReg01XX
RegWrite1100
MemRead0100
MemWrite0010
Branch0001
ALUOp11000
ALUOp00001

If drawn the normal way the table would look like this.

Op5Op4Op3Op2Op1Op0 RegDstALUSrcMemtoRegRegWriteMemReadMemWrite BranchALUOp1ALUOp0


000000 100100010
100011 011110000
101011 X1X001000
000100 X0X000101

control

Now it is straightforward to get the logic equations. The circuit, drawn in PLA style (2-levels of logic) is shown on the right.

Homework: In a previous homework, you modified the datapath to support addi and a variant of lw. Determine the control needed for these instructions.
5.15, 5.16

Homework (part of 5.13): Can we eliminate MemtoReg and use MemRead instead?

Homework: Can any other control signals be eliminated?

Implementing a J-type instruction, unconditional jump

Recall the jump instruction.

    opcode  addr
    31-26   25-0
  

Addr is a word address; the bottom 2 bits of the PC are always 0; and the top 4 bits of the PC are unchanged (AFTER incrementing by 4).

This is quite easy to add and smells like a good final exam question.

What's Wrong

Some instructions are likely slower than others and we must set the clock cycle time long enough for the slowest. The disparity between the cycle times needed for different instructions is quite significant when one considers implementing more difficult instructions, like divide and floating point ops. Actually, if we considered cache misses, which result in references to external DRAM, the cycle time ratios exceed 100.

Possible solutions

Even Faster (we are not covering this).

Start Lecture #18

Chapter 4 Performance analysis

Homework: Read Chapter 4.

4.1: Introductions

Defining Performance

Throughput measures the number of jobs per day/second/etc that can be accomplished.

Response time measures how long an individual job takes.

We define Performance as 1 / Execution time.

Relative Performance

We say that machine X is n times faster than machine Y or machine X has n times the performance of machine Y if the execution time of a given program on X = (1/n) * the execution time of the same program on Y.

But what program should be used for the comparison? Various suites have been proposed; some emphasizing CPU integer performance, others floating point performance, and still others I/O performance.

Measuring Performance

How should we measure execution time?

We mostly employ user-mode CPU time, but this does not mean the other metrics are worse.

Cycle time vs. Clock rate.

What is the cycle time for a 700MHz computer?

What is the clock rate for a machine with a 10ns cycle time?

4.2: CPU Performance and its Factors

The execution time for a given job on a given computer is

    (CPU) execution time = (#CPU clock cycles required) * (cycle time)
                         = (#CPU clock cycles required) / (clock rate)
  

The number of CPU clock cycles required equals the number of instructions executed times the average number of cycles in each instruction.

But real systems are more complicated than that!

Through a great many measurement, one calculates for a given machine the average CPI (cycles per instruction).

The number of instructions required for a given program depends on the instruction set. For example, one x86 instruction often accomplishes more than one MIPS instruction.

CPI is a good way to compare two implementations of the same instruction set (i.e., the same instruction set architecture or ISA. IF the clock cycle is unchanged, then the performance of a given ISA is inversely proportional to the CPI (e.g., halving the CPI doubles the performance).

Complicated instructions take longer; either more cycles or longer cycle time.

Older machines with complicated instructions (e.g. VAX in 80s) had CPI>>1.

With pipelining we can have many cycles for each instruction but still achieve a CPI of nearly 1.

Modern superscalar machines often have a CPI less than one. Sometimes one speaks of the IPC or instructions per cycle for such machines.

Putting this together, we see that

    CPU Time (in seconds) =  #Instructions * CPI * Cycle_time (in seconds).
    CPU Time (in ns)      =  #Instructions * CPI * Cycle_time (in ns).
    CPU Rate (in seconds) =  #Instructions * CPI / Clock_Rate (in Hz).
  

Do on the board the example on page 247.

Start Lecture #19

Homework: Carefully go through and understand the example on page 247 that I just did in class.

Homework: The next 5 problems form a set, i.e., the data from one applies to all the following problems. The first three, 4.1, 4.2, and 4.3, are from the book.

Homework: If the clock rates of the machines M1 and M2 from exercise 4.1 are 1GHz and 2GHz, respectively, find the CPI for program 1 on both machines.

Homework: Assume the CPI for program 2 on each machine is the same as the CPI for program 1 you calculated in the previous problem. What is the instruction count for program 2 on each machine

4.3: Evaluating Performance

I have nothing to add.

4.4: Real Stuff: Two SPEC Benchmarks and the Performance of Recent Intel Processors

Skipped.

4.5 Fallacies and Pitfalls

What is the MIPS rating for a computer and how useful is it?

Homework: Carefully go through and understand the example on pages 248-249

How about MFLOPS (Million of FLoating point OPerations per Second)? For numerical calculations floating point operations are the ones you are interested in; the others are overhead (a very rough approximation to reality).

It has similar problems to MIPS.

Benchmarks are better than MIPS or MFLOPS, but still have difficulties.

4.6: Concluding Remarks

Homework: Read this (very short) section.

Chapter 7: Memory

Homework: Read Chapter 7.

7.1: Introduction

An ideal memory is

Unable to achieve the impossible ideal we use a memory hierarchy consisting of

  1. Registers
  2. Cache (really L1, L2, and maybe L3)
  3. (Central or Main) Memory
  4. Disk
  5. Archive (e.g. Tape)

... and try to satisfy most references in the small fast memories near the top of the hierarchy.

There is a capacity/performance/price gap between each pair of adjacent levels. We will study the cache-to-memory gap.

We observe empirically (and teach in OS).

A cache is a small fast memory between the processor and the main memory. It contains a subset of the contents of the main memory.

A Cache is organized in units of blocks. Common block sizes are 16, 32, and 64 bytes.

This is the smallest unit we can move to/from a cache (some designs move subblocks, but we will not discuss them).

A hit occurs when a memory reference is found in the upper level of the memory hierarchy.

Definitions

Start Lecture #20

7.2: The Basics of Caches

We start with a very simple cache organization. One that was used on the Decstation 3100, a 1980s workstation.

Accessing a Cache

On the right is a pictorial example for a direct mapped cache with 4 blocks and a memory with 16 blocks.

How can we tell if a memory block is in the cache?

Also stored is a valid bit per cache block so that we can tell if there is a memory block stored in this cache block.

For example, when the system is powered on, all the cache blocks are invalid.

Addr(10)Addr(2)hit/missblock#
2210110miss110
2611010miss010
2210110hit110
2611010hit010
1610000miss000
300011miss011
1610000hit000
1810010miss010

Consider the example on page 476.


The circuitry needed for this simple cache (direct mapped, block size 1, all references to 1 word) to determine if we have a hit or a miss, and to return the data in case of a hit is quite easy. We are showing a 1024 word (= 4KB) direct mapped cache with block size = reference size = 1 word.

Make sure you understand the division of the 32 bit address into 20, 10, and 2 bits.

Calculate on the board the total number of bits in this cache.

Homework: 7.2 7.3 7.4

Processing a Read for this Simple Cache

The action required for a hit is obvious, namely return the data found to the processor.

For a miss, the best action is fairly clear, but requires some thought.

Handling Cache Misses

We can skip much of this section as it discusses the multicycle and pipelined implementations of chapter 6, which we skipped. For the single cycle processor implementation we just need to note a few points.

Handling Writes

Processing a write for our simple cache (direct mapped with block size = reference size = 1 word).

We have 4 possibilities: For a write hit we must choose between Write through and Write back. For a write miss we must choose between write-allocate and write-no-allocate (also called store-allocate and store-no-allocate and other names).

Write through: Write the data to memory as well as to the cache.

Write back: Don't write to memory now, do it later when this cache block is evicted.

The fact that an eviction must trigger a write to memory for write-back caches explains the comment above that the write hit policy effects the read miss policy.

Write-allocate: Allocate a slot and write the new data into the cache (recall we have a write miss). The handling of the eviction this allocation (probably) causes depends on the write hit policy.

  1. If the cache is write through, discard the old data (since it is in memory) and write the new data to memory (as well as in the cache).
  2. If the cache is write back, the old data must now be written back to memory, but the new data is not written to memory.

Write-no-allocate: Leave the cache alone and just write the new data to memory.

Write no-allocate is not normally as effective as write allocate due to temporal locality.

The simplest policy is write-through, write-allocate. The decstation 3100 discussed above adopted this policy and performed the following actions for any write, hit or miss, (recall that, for the 3100, block size = reference size = 1 word and the cache is direct mapped).

  1. Index the cache using the correct LOBs (i.e., not the very lowest order bits as these give the byte offset).
  2. Write the data and the tag into the cache.
  3. Set Valid to true.
  4. Send request to main memory.

Although the above policy has the advantage of simplicity, it is out of favor due to its poor performance.

Improvement: Use a Write Buffer

Unified vs Split I and D (Instruction and Data) Caches

Given a fixed total size (in bytes) for the cache, is it better to have two caches, one for instructions and one for data; or is it better to have a single unified cache?

Start Lecture #21

Remark: Demo of tristate drivers in logisim (controlled registers).

Improvement: Multiword Blocks

The setup we have described does not take any advantage of spatial locality. The idea of having a multiword block size is to bring into the cache words near the referenced word since, by spatial locality, they are likely to be referenced in the near future.

We continue to assume (for a while) that the cache is direct mapped and that all references are for one word.

The terminology for byte offset and block offset is inconsistent. The byte offset gives the offset of the byte within the word so the offset of the word within the block should be called the word offset, but alas it is not in both the 2e and 3e. I don't know if this is standard (poor) terminology or a long standing typo in both editions.

The figure to the right shows a 64KB direct mapped cache with 4-word blocks.

What addresses in memory are in the block and where in the cache do they go?

Show from the diagram how this gives the red portion for the tag and the green portion for the index or cache block number.

Consider the cache shown in the diagram above and a reference to word 17003.

The cache size is the size of the data portion of the cache (normally measured in bytes).

For the caches we have see so far this is the Blocksize times the number of entries. For the diagram above this is 64KB. For the simpler direct mapped caches blocksize = wordsize so the cache size is the wordsize times the number of entries.

Let's compare the pictured cache with another one containing 64KB of data, but with one word blocks.

  1. Calculate on the board the total number of bits in each cache; this is not simply 8 times the cache size in bytes.
  2. If the references are strictly sequential the pictured cache has 75% hits; the simpler cache with one word blocks has no hits.

How do we process read/write hits/misses for a cache with multiword blocks?

Homework: 7.9, 7.10, 7.12.

Why not make blocksize enormous? For example, why not have the cache be one huge block.

Memory support for wider blocks

Recall that our processor fetches one word at a time and our memory produces one word per request. With a large blocksize cache the processor still requests one word and the cache responds with one word. However the cache requests a multiword block from memory and to date our memory is only able to respond with a single word.

The question is, "Which pieces and buses should be narrow (one word) and which ones should be wide (a full block)?". The same question arises when the cache requests that the memory store a block and the answers are the same so we will only consider the case of reading the memory).

Homework: 7.14

Start Lecture #22

7.3: Measuring and Improving Cache Performance

Do the following performance example on the board. It would be an appropriate final exam question.

Homework: 7.17, 7.18.

A lower base (i.e. miss-free) CPI makes stalls appear more expensive since waiting a fixed amount of time for the memory corresponds to losing more instructions if the CPI is lower.

A faster CPU (i.e., a faster clock) makes stalls appear more expensive since waiting a fixed amount of time for the memory corresponds to more cycles if the clock is faster (and hence more instructions since the base CPI is the same).

Another performance example.

Remark: Larger caches have longer hit times.

Reducing Cache Misses by More Flexible Placement of Blocks

Improvement: Associative Caches

Consider the following sad story. Jane has a cache that holds 1000 blocks and has a program that only references 4 (memory) blocks, namely 23, 1023, 123023, and 7023. In fact the references occur in order: 23, 1023, 123023, 7023, 23, 1023, 123023, 7023, 23, 1023, 123023, 7023, 23, 1023, 123023, 7023, etc. Referencing only 4 blocks and having room for 1000 in her cache, Jane expected an extremely high hit rate for her program. In fact, the hit rate was zero. She was so sad, she gave up her job as webmistress, went to medical school, and is now a brain surgeon at the mayo clinic in rochester MN.

So far We have studied only direct mapped caches, i.e. those for which the location in the cache is determined by the address. Since there is only one possible location in the cache for any block, to check for a hit we compare one tag with the HOBs of the addr.

The other extreme is fully associative.

Set Associative Caches

Most common for caches is an intermediate configuration called set associative or n-way associative (e.g., 4-way associative). The value of n is typically 2, 4, or 8.

If the cache has B blocks, we group them into B/n sets each of size n. Since an n-way associative cache has sets of size n blocks, it is often called a set size n cache. For example, you often hear of set size 4 caches.

In a set size n cache, memory block number K is stored in set K mod the number of sets, which equals K mod (B/n).

The figure above shows a 2-way set associative cache. Do the same example on the board for 4-way set associative.

Determining the Set Number and the Tag.

Recall that for the a direct-mapped cache, the cache index gives the number of the block in the cache. The for a set-associative cache, the cache index gives the number of the set.

Just as the block number for a direct-mapped cache is the memory block number mod the number of blocks in the cache, the set number equals the (memory) block number mod the number of sets.

Just as the tag for a direct mapped cache is the memory block number divided by the number of blocks, the tab for a set-associative cache is the memory block number divided by the number of sets.

Do NOT make the mistake of thinking that a set size 2 cache has 2 sets, it has B/2 sets each of size 2.

Ask in class.

Why is set associativity good? For example, why is 2-way set associativity better than direct mapped?

Locating a Block in the Cache

How do we find a memory block in a set associative cache with block size 1 word?

Recall that a 1-way associative cache is a direct mapped cache and that an n-way associative cache for n the number of blocks in the cache is a fully associative.

The advantage of increased associativity is normally an increased hit ratio.

What are the disadvantages?
Answer: It is a slower and a little bigger due to the extra logic.

Combining Set-associativity and Multiword Blocks

This is a fairly simple merger.

  1. Start with the picture just above for a set-associative cache.
  2. The blue is now a block not just a word.
  3. Hence the data coming out of the bottom right is a block.
  4. So use the word-within-block bits to choose the proper word.
  5. This requires the same mux as it did for direct mapped caches (with multiword blocks). See the description and picture here.
  6. We discuss below which bits of the memory address are used for which purpose.

Choosing Which Block to Replace

When an existing block must be replaced, which victim should we choose? We ask the exact same question (with different words) when we study demand paging (remember 202!).

Sizes

There are two notions of size.

  1. The cache size is the capacity of the cache.
    This means, the size of all the blocks. In the diagram above it is the size of the blue portion.

    The size of the cache in the diagram is 256 * 4 * 4B = 4KB.

  2. Another size is is the total number of bits in the cache, which includes tags and valid bits. For the diagram this is computed as follows.

For the diagrammed cache, what fraction of the bits are user data?
Ans: 4KB / 55Kb = 32Kb / 55Kb = 32/55.

Tag Size and Division of the Address Bits

We continue to assume a byte addressed machines with all references to a 4-byte word.

The 2 LOBs are not used (they specify the byte within the word, but all our references are for a complete word). We show these two bits in dark blue. We continue to assume 32 bit addresses so there are 230 words in the address space.

Let's review various possible cache organizations and determine for each how large is the tag and how the various address bits are used. We will consider four configurations each a 16KB cache. That is the size of the data portion of the cache is 16KB = 4 kilowords = 212 words.

  1. Direct mapped, blocksize 1 (word).
  2. Direct mapped, blocksize 8
  3. 4-way set associative, blocksize 1
  4. 4-way set associative, blocksize 8

Homework: 7.46 and 7.47. Note that 7.46 contains a typo !! should be ||.

Reducing the Miss Penalty Using Multilevel Caches

Improvement: Multilevel caches

Modern high end PCs and workstations all have at least two levels of caches: A very fast, and hence not very big, first level (L1) cache together with a larger but slower L2 cache.

When a miss occurs in L1, L2 is examined and only if a miss occurs there is main memory referenced.

So the average miss penalty for an L1 miss is

    (L2 hit rate)*(L2 time) + (L2 miss rate)*(L2 time + memory time)
  

We are assuming that L2 time is the same for an L2 hit or L2 miss and that the main memory access doesn't begin until the L2 miss has occurred.

Do the example on the board (a reasonably exam question, but a little long since it has so many parts).

Start Lecture #23

7.4: Virtual Memory

I realize this material was covered in operating systems class (V22.0202). I am just reviewing it here. The goal is to show the similarity to caching, which we just studied. Indeed, (the demand part of) demand paging is caching: In demand paging the memory serves as a cache for the disk, just as in caching the cache serves as a cache for the memory.

The names used are different and there are other differences as well.

Cache conceptDemand paging analogue
Memory blockPage
Cache blockPage Frame (frame)
BlocksizePagesize
TagNone (table lookup)
Word in blockPage offset
Valid bitValid bit
MissPage fault
HitNot a page fault
Miss ratePage fault rate
Hit rate1 - Page fault rate

Cache conceptDemand paging analogue
Placement questionPlacement question
Replacement questionReplacement question
AssociativityNone (fully associative)

Homework: 7.32

Write through vs. write back

Question: On a write hit should we write the new value through to (memory/disk) or just keep it in the (cache/memory) and write it back to (memory/disk) when the (cache-line/page) is replaced?

Translation Lookaside Buffer (TLB)

A TLB is a cache of the page table



Putting it together: TLB + Cache

This is the decstation 3100

Actions taken

  1. The page number is searched in the fully associative TLB
  2. If a TLB hit occurs, the frame number from the TLB together with the page offset gives the physical address. A TLB miss causes an exception to reload the TLB from the page table, which the figure does not show.
  3. The physical address is broken into a cache tag and cache index (plus a two bit byte offset that is not used for word references).
  4. If the reference is a write, just do it without checking for a cache hit (this is possible because the cache is so simple as we discussed previously).
  5. For a read, if the tag located in the cache entry specified by the index matches the tag in the physical address, the referenced word has been found in the cache; i.e., we had a read hit.
  6. For a read miss, the cache entry specified by the index is fetched from memory and the data returned to satisfy the request.

Hit/Miss possibilities

TLBPageCacheRemarks
hithithit Possible, but page table not checked on TLB hit, data from cache
hithitmiss Possible, but page table not checked, cache entry loaded from memory
hitmisshit Impossible, TLB references in-memory pages
hitmissmiss Impossible, TLB references in-memory pages
misshithit Possible, TLB entry loaded from page table, data from cache
misshitmiss Possible, TLB entry loaded from page table, cache entry loaded from memory
missmisshit Impossible, cache is a subset of memory
missmissmiss Possible, page fault brings in page, TLB entry loaded, cache loaded

Homework: 7.31, 7.33

7.5: A Common Framework for Memory Hierarchies

Question 1: Where can/should the block be placed?

This question has three parts.

  1. In what slot are we able to place the block.
  2. If several possible slots are available, which one should be used?
  3. If no possible slots are available, which victim should be chosen?

Question 2: How is a block found?

AssociativityLocation methodComparisons Required
Direct mappedIndex1
Set AssociativeIndex the set, search among elements Degree of associativity
FullSearch all cache entries Number of cache blocks
Separate lookup table0

Typical sizes and costs

Feature Typical values
for caches
Typical values
for demand paging
Typical values
for TLBs
Size 8KB-8MB 128MB-8GB 128B-4KB
Block size 32B-128B 4KB-64KB 4B-32B
Miss penalty in clocks 10-100 1M-10M 10-100
Miss rate .1%-10% .000001%-.0001% .01%-2%

The difference in sizes and costs for demand paging vs. caching, leads to different algorithms for finding the block. Demand paging always uses the bottom row with a separate table (page table) but caching never uses such a table.

Question 3: Which block should be replaced?

This is called the replacement question and is much studied in demand paging (remember back to 202).

Question 4: What happens on a write?

  1. Write-through

Homework: 7.41

  1. Write-back

Write miss policy (advanced)

Start Lecture #24

Chapter 8: Storage, Networks, and Other Peripherals.

Introduction

Peripherals are varied; indeed they vary widely in many dimensions, e.g., cost, physical size, purpose, capacity, transfer rate, response time, support for random access, connectors, and protocol.

Consider just transfer rate for the moment.

The text mentions three especially important dimensions.

The diagram on the right is quite oversimplified for modern PCs; a more detailed version is below.

8.2: Disk Storage and Dependability

Devices are quite varied and their data rates vary enormously.

Show a real disk opened up and illustrate the components.

Disk Access Time

The time for a disk access has five components, of which we concentrate on the first three.

  1. Seek.
  2. Rotational latency.
  3. Transfer time.
  4. Controller overhead.
  5. Queuing delays.
Seek Time

Today seek times are typically 5-10ms on average. It takes longer to go all the way across the disk but it does not take twice as long to go twice as far (the head must accelerate, decelerate, and settle on the track).

How should we calculate the average?

Rotational Latency

Since disks have just one arm the average rotational latency is half the time of a revolution, and is thus determined by the RPM (revolutions per minute) of the disk.

Disks today spin at 5400-15,000 RPM; they used to all spin at 3600 RPM.

Transfer Time

You might consider the other three times all overhead since it is the transfer time during which the data is actually being supplied.

The transfer rate is typically a few tens of MB per second. Given the rate, which is determined by the disk in use, the transfer time is proportional to the length of the request.

Some manufacturers quote a much higher rate, but that is for cache hits. In addition to supplying data much sooner, the electronic cache can transfer data at a higher rate than the mechanical disk.

Start Lecture #25

Remark: Lab 7, the finale, is assigned and is due 10 december 2007.

Remark: Reviewed set associative caches (which were taught the day before thanksgiving to very few students).

Start Lecture #26

Remark: I expect the final exam to be on the 7th floor like the midterm. A practice final is on the web.

Remark: Covered Tag Size and Division of Address Bits which was inadvertently omitted.

Controler Time

Not much to say. It is typically small. We will use 0ms (i.e., ignore this time).

Queuing Delays

This can be the largest component, but we will ignore it since it is not a function of the architecture, but rather of the load and OS.

Dependability, Reliability, and Availability

Reliability measures the length of time during which services is continuously delivered as expected.

An example reliability measure is mean time to failure (MTTF), which measures the average length of time that the system is delivering service as expected. Bigger values are better.

Another important measure is mean time to repair (MTTR), which measures how long the system is not delivering service as expected. Smaller values are better.

Finally we have mean time between failures (MTBF).
MTBF = MTTF + MTTR

One might think that having a large MTBF is good, but that is not necessarily correct. Consider a system with a certain MTBF and simply have the repair center deliberately add an extra 1 hour to the repair time and poof the MTBF goes up by one hour!

RAID

The acronym was coined by Patterson and his students. It stands for Redundant Array of Inexpensive Disks. Now it is often redefined as Redundant Array of Independent Disks.

RAID comes in several flavors often called levels.

No Redundancy (RAID 0)

The base, non-RAID, case from which the others are built.

Mirroring (RAID 1)

Two disks containing the same content.

Error Detecting and Correcting Code (RAID 2)

Often called ECC (error correcting code or error checking and correcting code). Widely used in RAM, not used in RAID.

Bit-Interleaved Parity (RAID 3)

Normally byte-interleaved or several-byte-interleaved. For most applications, RAID 4 is better.

Block-Interleaved Parity (RAID 4)

Striping a.k.a. Interleaving

To increase performance, rather than reliability and availability, it is a good idea to stripe or interleave blocks across several disks. In this scheme block n is stored on disk n mod k, where k is the number of disks. The quotient n/k is called the stripe number. For example, if there are 4 disks, stripe number 0 (the first stripe) consists of block 0, which is stored on disk 0, block 1 stored on 1, block 2 stored on 2, and block 3 stored on 3. Stripe 1 (like all stripes in this example) also contains 4 blocks. The first one is block 4, which is stored on disk 0.

Striping is especially good if one is accessing full stripes in which case all the blocks in the stripe can be read concurrently.

RAID 4

RAID 4 combines striping and parity. In addition to the k so-called data disks used in striping, one has a single parity disk that contains the parity of the stripe.

Consider all k data blocks in one stripe. Extend this stripe to k+1 blocks by including the corresponding block on the parity disk. The block on the parity disk is calculated as the bitwise exclusive OR of the k data blocks.

Thus a stripe contains k data blocks and one parity block, which is the exclusive OR of the data blocks.

The great news is that any block in the stripe, parity or data, is the exclusive OR of the other k. This means we can survive the failure of any one disk.

For example, let k=4 and let the data blocks be A, B, C, and D.

  1. If the parity disk fails, we can easily recreate it since, by definition, the parity block for this stripe is
          A ⊕ B ⊕ C ⊕ D
    which is the exclusive OR of the other blocks.
  2. If a data disk fails, we can again recreate it since, by the commutative and associative properties of XOR,
        A ⊕ B ⊕ C ⊕ the parity block = A ⊕ B ⊕ C ⊕ (A ⊕ B ⊕ C ⊕ D) = D
    and again the missing block is the exclusive OR of the remaining blocks.

Properties of RAID 4.

Distributed Block-Interleaved Parity RAID 5

Rotate the disk used for parity.

Again using our 4 data-disk example, we continue to put the parity for blocks 0-3 on disk 4 (the fifth disk) but rotate the assignment of which disk holds the parity block of different stripes. In more detail.

Raid 1 and Raid 5 are widely used.

P + Q Redundancy (RAID 6)

Gives more than single error correction at a higher storage overhead

.

Start Lecture #27

8.3 Networks

Skipped. (It is on the CD.)

8.4: Buses and Other Connections between Processors, Memory, and I/O Devices

A bus is a shared communication link, using one set of wires to connect many subsystems.

Tri-state Drivers

Bus Basics

Synchronous vs. Asynchronous Buses

A synchronous bus is clocked.

An asynchronous bus is not clocked.

We now describe a protocol in words (below) and with a finite state machine (on the right) for a device to obtain data from memory. The book uses a different form of diagram and is not as explicit about the address.

  1. The device makes a request (asserts ReadReq and puts the desired address on the data lines). The name data lines sounds odd since it is (now) being used for the address. It will also be used for the data itself in this design. Data lines should be contrasted with control lines (such as ReadReq).
  2. Memory, which has been waiting, sees ReadReq, records the address and asserts Ack.
  3. The device waits for the Ack; once seen, it drops the data lines and deasserts ReadReq.
  4. The memory waits for the request line to drop. Then it can drop Ack (which it knows the device has now seen). The memory now at its leasure puts the data on the data lines (which it knows the device is not driving) and then asserts DataRdy. (DataRdy has been deasserted until now).
  5. The device has been waiting for DataRdy. It detects DataRdy and records the data. It then asserts Ack indicating that the data has been read.
  6. The memory sees Ack and then deasserts DataRdy and releases the data lines.
  7. The device seeing DataRdy low deasserts Ack ending the show. Note that both sides are prepared for another performance.



The Buses and Networks of the Pentium 4

For a realistic example, on the right is a diagram adapted from the 25 October 1999 issue of Microprocessor Reports on a then brand new Intel chip set, the so called 840.

Figure 8.11 in the text shows a 2003 chipset (the date is from http://www.karbosguide.com/books/pcarchitecture/chapter22.htm). I deliberately kept my diagram of the 4 year older 810 so that you can see the changes in 4 years. If you look at 2007 PCs you will see that the speeds have again increased.

Bus adaptors have a variety of names, e.g. host adapters, hubs, bridges. The memory controller hub is often call the north bridge and the I/O controller hub is often called the south bridge.

Bus lines (i.e., wires) include those for data, function codes, device addresses. Data and address are considered data and the function codes are considered control (remember our datapath for MIPS).

Address and data may be multiplexed on the same lines (i.e., first send one then the other) or may be given separate lines. One is cheaper (good) and the other has higher performance (also good). Which is which?
Ans: the multiplexed version is cheaper.

Improving Bus Performance

These improvements mostly come at the cost of increased expense and/or complexity.

Obtaining bus access

OptionHigh performanceLow cost
bus widthseparate addr and data lines multiplex addr and data lines
data widthwidenarrow
transfer sizemultiple bus loadssingle bus loads
bus mastersmultiplesingle
clockingsynchronousasynchronous

Start Lecture #28

Do on the board the following example. Given

Find

  1. Sustained bandwidth and latency for reading 256 words using 4 word transfers.
  2. Sustained bandwidth and latency for reading 256 words using 16 word transfers.
  3. How many bus transactions per sec for each (a transaction includes both address and data.

Solution with four word blocks.

Solution with sixteen word blocks

8.5: Interfacing I/O Devices to the Processor, Memory, and Operating System

This is an I/O issue and is taught in 202.

Giving commands to I/O Devices

This is really an OS issue. Must write/read to/from device registers, i.e. must communicate commands to the controller. Note that a controller normally contains a microprocessor, but when we say the processor, we mean the central processor not the one on the controller.

Communicating with the Processor

Should we check periodically or be told when there is something to do? Better yet can we get someone else to do it since we are not needed for the job?

Polling

Processor continually checks the device status to see if action is required.

Do on the board the example on pages 676-677

Interrupt driven I/O

Processor is told by the device when to look. The processor is interrupted by the device.

Do on the board the example on pages 681-682.

Direct Memory Access (DMA)

The processor initiates the I/O operation then something else takes care of it and notifies the processor when it is done (or if an error occurs).

More Sophisticated Controllers

Subtlties involving the memory system

8.6: I/O Performace Measures: Examples from Disk and File Systems

Transaction Processing I/O Benchmarks

Skipped

File System and Web I/O Benchmarks

Skipped

I/O Performance versus Processor Performance

We do an example to illustrate the increasing impact of I/O time. It is similar to the one in the book.

Assume

  1. A job currently takes 100 seconds of CPU time and 50 seconds of I/O time.
  2. The CPU and I/O times can not be overlapped. Thus the total time required is 150 seconds.
  3. The CPU speed increases at a rate of 40% per year. This implies that the CPU time required in year n+1 is (1/1.4) times the CPU time required in year n.
  4. The I/O speed increases at a rate of 10% per year.

Calculate

  1. The CPU, I/O, and overall time required after 1,2,5,10,20 years.
  2. The percentage of the job time that the CPU is active for each year.
  3. The CPU, I/O, and overall speedup for each year.

What would happen if CPU and I/O can be overlapped, i.e., if the overall time is MAX(CPU,I/O) rather than SUM(CPU,I/O)?

8.7: Designing an I/O system

We do an example similar to the one in the book to see how the various components affect overall performance.

Assume a system with the following characteristics.

Assume a workload of 64-KB reads and 100K instructions between reads.

Find

  1. The maximum I/O rate achievable.
  2. How many controllers are needed for this rate?
  3. How many disks are needed for this rate?

Solution

Remark: The above analysis was very simplistic. It assumed everything overlapped just right and the I/Os were not bursty and that the I/Os conveniently spread themselves accross the disks.

Remark: Review of practice final.

Good luck on the (real) final!