Start Lecture #1
I start at Chapter 0 so that when we get to chapter 1, the numbering will agree with the text.
There is a web site for the course. You can find it from my home page listed above.
The course text is Hennessy and Patterson, Computer Organization and Design: The Hardware/Software Interface, 4th edition, which I will refer to as 4e.
All figures from Computer Organization and Design: The Hardware/Software Approach, Fourth Edition, by David Patterson and John Hennessy, are copyrighted material (copyright 2009 by Elsevier inc inc. all rights reserved). Figures may be reproduced only for classroom or personal educational use in conjunction with the book and only when the above copyright line is included. They may not be otherwise reproduced, distributed, or incorporated into other works without the prior written consent of the publisher.
Your grade will be a function of your midterm, final exam, and laboratory assignments (see below). I am not yet sure of the exact weightings, but each of the three will be important.
I use the upper left board for lab/homework assignments and announcements. I should never erase that board. If you see me start to erase an announcement, please let me know.
I try very hard to remember to write all announcements on the upper left board and I am normally successful. If, during class, you see that I have forgotten to record something, please let me know. HOWEVER, if I forgot and no one reminds me, the assignment has still been given.
I make a distinction between homeworks and labs.
Labs are
programming languageis a graphical language for drawing electronic circuits and simulating their behavior.
Homeworks are
Homeworks are numbered by the class in which they are assigned. So any homework given today is homework #1. Even if I do not give homework today, the homework assigned next class will be homework #2. Unless I explicitly state otherwise, all homeworks assignments can be found in the class notes. So the homework present in the notes for lecture #n is homework #n (even if I inadvertently forgot to write it to the upper left board).
This course will have graphical labs so I expect you will work on your personal computers. However, you will send your labs via email from your nyu.edu email account.
Good methods for obtaining help include
Most if not all labs will be in logisim, a graphical language for drawing electronic circuits and simulating their behavior. I do not assume you know logisim now.
Incomplete
The rules for incompletes and grade changes are set by the school and not the department or individual faculty member. The rules set by CAS can be found in http://cas.nyu.edu/object/bulletin0608.ug.academicpolicies.html state:
The grade of I (Incomplete) is a temporary grade that indicates that the student has, for good reason, not completed all of the course work but that there is the possibility that the student will eventually pass the course when all of the requirements have been completed. A student must ask the instructor for a grade of I, present documented evidence of illness or the equivalent, and clarify the remaining course requirements with the instructor.
The incomplete grade is not awarded automatically. It is not used when there is no possibility that the student will eventually pass the course. If the course work is not completed after the statutory time for making up incompletes has elapsed, the temporary grade of I shall become an F and will be computed in the student's grade point average.
All work missed in the fall term must be made up by the end of the following spring term. All work missed in the spring term or in a summer session must be made up by the end of the following fall term. Students who are out of attendance in the semester following the one in which the course was taken have one year to complete the work. Students should contact the College Advising Center for an Extension of Incomplete Form, which must be approved by the instructor. Extensions of these time limits are rarely granted.
Once a final (i.e., non-incomplete) grade has been submitted by the instructor and recorded on the transcript, the final grade cannot be changed by turning in additional course work.
This email from the assistant director, describes the policy.
Dear faculty, The vast majority of our students comply with the department's academic integrity policies; see www.cs.nyu.edu/web/Academic/Undergrad/academic_integrity.html www.cs.nyu.edu/web/Academic/Graduate/academic_integrity.html Unfortunately, every semester we discover incidents in which students copy programming assignments from those of other students, making minor modifications so that the submitted programs are extremely similar but not identical. To help in identifying inappropriate similarities, we suggest that you and your TAs consider using Moss, a system that automatically determines similarities between programs in several languages, including C, C++, and Java. For more information about Moss, see: http://theory.stanford.edu/~aiken/moss/ Feel free to tell your students in advance that you will be using this software or any other system. And please emphasize, preferably in class, the importance of academic integrity. Rosemary Amico Assistant Director, Computer Science Courant Institute of Mathematical Sciences
The university-wide policy is described here
Homework: Download logisim (first google hit) and play with it. The help button offers a tutorial, try it.
Remark: Appendix C is on the CD that comes with
the book, but is not in the book
itself.
If anyone does not have convenient access to a printer,
please let me know and I will print a black and white copy for you.
The pdf on the CD is in color so downloading it to your computer for
viewing in color is probably a good idea.
If you have a color printer that is not terribly slow, you might
want to print it in color—that's what I did.
Lab 1 part 1: (15 points), due 13 September. See the detailed assignment here.
Homework: Read C.1
Homework: Read C.2
The word digital, when used in digital logic
or
digital computer
means discrete.
That is, the electrical values (i.e., voltages) of the signals in a
circuit are treated as a integers (normally just 0 and
1).
The alternative is analog, where the electrical values are treated as real numbers.
To summarize, we will use only two voltages: high and low. A signal at the high voltage is referred to as 1 or true or set or asserted. A signal at the low voltage is referred to as 0 or false or unset or deasserted.
The assumption that at any time all signals are either 1 or 0 hides a great deal of engineering.
Since this is not an engineering course, we will ignore these issues and assume square waves.
In English, digit implies 10 (a digit is a finger), but not in computers.
Indeed, the word Bit is short for Binary digIT and binary means base 2 not 10.
0 and 1 are called complements of each other as are true and false (also asserted/deasserted; also set/unset)
A logic block can be thought of as a black box that takes in electrical signals puts out other electrical signals. There are two kinds of blocks
We shall study combinational blocks first and will will study sequential blocks later (in a few lectures).
Since combinatorial logic has no memory, it is simply a (mathematical) function from its inputs to its outputs.
A common way to represent the function is using a Truth Table. A Truth Table has a column for each input and a column for each output. It has one row for each possible set of input values. So, if there are A inputs, there are 2^{A} rows. In each of these rows the output columns have the output for that input.
Such a table is possible only because there are only a finite number of possible input values. Consider trying to produce a table for the mathematical function
y = f(x) = x^{3} + 6 x^{2} - 12 x - 3.5
There would be only two columns (one for x and one for y) but there would need to be an infinite number of rows!
Let's start with a really simple truth table, one corresponding to a logic block with one input and one output.
How many different truth tables are there for a
one input one output
logic block?
In | Out |
---|---|
0 | ? |
1 | ? |
There are two columns (1+1) and two rows (2^{1}). Hence the truth table looks like the one on the right with the question marks filled in.
Since there are two question marks and each one can have one of two values there are just 2^{2}=4 possible truth tables. They are:
We will see pictures for the last two possibilities very soon.
In1 | In2 | Out |
---|---|---|
0 | 0 | ? |
0 | 1 | ? |
1 | 0 | ? |
1 | 1 | ? |
Three columns (2+1) and 4 rows (2^{2}).
How many are there? It is just the number ways can you fill in the output entries, i.e. the question marks. There are 4 output entries so the answer is 2^{4}=16.
How about 2 in and 8 out?
3 in and 8 out?
n in and k out?
We use a notation that looks like algebra to express logic functions and expressions involving them.
The notation is called Boolean algebra in honor of George Boole.
A Boolean value is a 1 or a 0.
A Boolean variable takes on Boolean values.
A Boolean function takes in boolean variables and
produces boolean values.
Four Boolean functions are especially common.
Homework: Draw the truth table of the Boolean function of 3 boolean variables that is true if and only if exactly 1 of the three variables is true.
Remember this is Boolean Algebra.
How does one prove these laws??
Answer: It is simple, but tedious.
Write the truth tables for each side and see that the outputs are the same. Actually you write just one truth table with columns for all the inputs and for the outputs of both sides. You often write columns for intermediate outputs as well, but that is only a convenience. The key is that you have a column for the final value of the LHS (left hand side) and a column for the final value of the RHS and that these two columns have identical results.
Prove the first distributive law on the board. The following columns are required: the inputs A, B, C; the LHS A(B+C); and the RHS AB+AC. Beginners like us would also use columns for the intermediate results B+C, AB, and AC. (Note that I am now indicating product by simple juxtaposition.)
Homework: Prove the second distributive law.
Homework: Prove DeMorgan's laws.
Let's do (on the board) the example on page C-5.
Consider a logic function with three inputs A, B, and C; and three outputs D, E, and F defined as follows:
Compute the truth table and logic equations.
figure it out. This might be called the method of inspiration.
exactly two are trueis the same as
(at least) two are true AND it is not the case that all three are true. So we have the AND of two expressions: the first is a three way OR and the second the negation of a three way AND.
Start Lecture #2
The first way we solved the previous example shows that any logic equation can be written using just AND, OR, and NOT. Indeed it shows more. Each entry in the output column of the truth table corresponds to the AND of three (because there are three inputs) literals.
A literal is either an input variable or the negation of an input variable.
In mathematical logic such a formula is said to be in
disjunctive normal form
because it is the disjunction
(i.e., OR) of conjunctions (i.e., ANDs).
In computer architecture disjunctive normal form is often called two levels of logic because it shows that any formula can be computed by passing signals through only two logic functions, AND and then OR (assuming we are given the inputs and their compliments).
With DM (DeMorgan's Laws) we can do quite a bit without resorting to truth tables.
For example one can ...
Homework: Show that the two expressions for E in the example above are equal.
Start to do the homework on the board.
Remark: You may ignore the references to Verilog in the text.
Gates implement the basic logic functions: AND OR NOT XOR Equivalence. When drawing logic functions, we use the standard shapes shown to the right.
Note that none of the figures is input-output symmetric. That is, one can tell which lines are inputs and which are outputs without resorting to arrowheads and without the convention that inputs are on the left. Sometimes the figure is rotated 90 or 180 degrees.
We often omit the inverters and draw little circles at the input or output of the other gates (e.g., AND OR). These little circles are sometimes called bubbles.
For example, the diagram on the right shows three ways a writing the same logic function.
This explains why the inverter is drawn as a buffer with an output bubble.
Show why the picture above for equivalence is correct. That is, show that equivalence is the negation of XOR. Specifically, show that AB + A'B' = (A ⊕ B)'.
(A ⊕ B)' = (A'B+AB')' = (A'B)' (AB')' = (A''+B') (A'+B'') = (A + B') (A' + B) = (A + B') A' + (A + B') B = AA' + B'A' + AB + B'B = 0 + B'A' + AB + 0 = AB + A'B'
Homework: C.4.
Homework: Recall the Boolean function E that is
true if and only if exactly 1 of the three variables is true.
We have already drawn the truth table.
Draw a logic diagram for E using AND OR NOT.
Draw a logic diagram for E using AND OR and bubbles.
A set of gates is called universal if these gates are sufficient to generate all logic functions.
Definition: NOR (NOT OR) is true when OR is false.
Draw the truth table on the board.
Definition: NAND (NOT AND) true when AND is false.
Draw the truth table on the board.
We can draw both NAND and NOR in two ways as shown in the diagram on the right. The top pictures are from the definition; the bottom use DeMorgan's laws.
Theorem A 2-input NOR is universal and a 2-input NAND is universal.
Proof We will show that you can get A', A+B, and AB using just a two input NOR.
Homework: Show that a 2-input NAND is universal.
Sneaky way to see that NAND is universal.
cancel.
We have seen how to implement any logic function given its truth table. Indeed, the natural implementation from the truth table uses just two levels of logic. But that implementation might not be the simplest possible. That is, we may have more gates than are necessary.
Trying to minimize the number of gates is decidedly NOT trivial. Many texts, including one by Mano, cover the topic of gate minimization in detail. We will not cover it in this course. It is mentioned and referenced, but not covered, in 4e. I actually like the topic, but it takes a few lectures to cover well and it is no longer used in practice since it is done automatically by CAD tools.
Minimization is not unique, i.e. there can be two or more minimal forms.
Given A'BC + ABC + ABC'
Combine first two to get BC + ABC'
Combine last two to get A'BC + AB
Sometimes when building a circuit, you don't care what the output is for certain input values. For example, that input combination might be known not to occur. Another example occurs when, for some combination of input values, a later part of the circuit will ignore the output of this part. These are called don't care outputs. Making use of don't cares can reduce the number of gates needed.
One can also have don't care inputs when, for certain values of a subset of the inputs, the output is already determined and you don't have to look at the remaining inputs. We will see a case of this very soon when we do multiplexors.
Putting a circuit in disjunctive normal form (i.e. two levels of logic) means that every path from the input to the output goes through very few gates. In fact only two, an OR and an AND. Maybe we should say three since the AND can have a NOT (bubble). Theoreticians call this number (2 or 3 in our case) the depth of the circuit. Se we see that every logic function can be implemented with small depth. But what about the width, i.e., the number of gates.
The news is bad. The parity function takes n inputs and gives TRUE if and only if the number of TRUE inputs is odd. If the depth is fixed (say limited to 3), the number of gates needed for parity is exponential in n.
Homework: Read C.3.
Generic Homework: Read sections in book corresponding to the lectures.
Imagine you are writing a program and have 32 flags, each of which can be either true or false. You could declare 32 variables, one per flag. If permitted by the programming language, you would declare each variable to be a bit. In a language like C, without bits, you might use a single 32-bit int and play with shifts and masks to store the 32 flags in this one word.
In either case, an architect would say that you have these flags fully decoded. That is, you can specify the values of each of the bits.
Now imagine that for some reason you know that, at all
times, exactly one of the flags is true and the
other are all false.
Then, instead of storing 32 bits, you could store a 5-bit integer
that specifies which of the 32 flags is true.
This is called fully encoded.
For an example, consider radio buttons
on a web page.
A 5-to-32
decoder converts an encoded 5-bit signal into the
decoded 32-bit signal having the one specified signal true.
A 32-to-5
encoder does the reverse operations.
Note that the output of an encoder is defined
only if exactly one input bit is
set (recall set means true).
The the top diagram on the right shows a 3-to-8 decoder.
3with a slash, which signifies a three bit input. This notation represents a bundle of three (1-bit) wires, often called a 3-bit line.
k written as an n-bit binary numberand view the output as 2^{n} bits with the k-th bit set and all the other bits clear.
Similarly, the bottom diagram shows an 8-3 encoder.
Why do we use decoders and encoders?
Remark: Lab 1 part 2 assigned, due 15 September 2011.
A multiplexor, often called a mux or a selector is used to select one (output) signal from a group of (input) signals based on the value of a group of (select) signals. In the 2-input mux shown on the right, the select line S is thought of as an integer 0..1. If the integer has value j then the j^{th} input is sent to the output.
Construct on the board an equivalent circuit with ANDs and ORs in two ways:
Start Lecture #3
Remark: You should all have accounts on i5.nyu.edu.
The diagram on the right shows a 4-input MUX.
Construct on the board an equivalent circuit with ANDs and ORs in three ways:
All three of these methods generalize to a mux with 2^{k} input lines, and k select lines.
A 2-way mux is the hardware analogue of if-then-else.
if S=0 M=A else M=B endif
A 4-way mux is an if-then-elif-elif-else
if S1=0 and S2=0 M=A elif S1=0 and S2=1 M=B elif S1=1 and S2=0 M=C else -- S1=1 and S2=1 M=D endif
S | In0 | In1 | Out |
---|---|---|---|
0 | 0 | X | 0 |
0 | 1 | X | 1 |
1 | X | 0 | 0 |
1 | X | 1 | 1 |
Consider a 2-input mux. If the selector is 0, the output is I0 and the value of I1 is irrelevant. Thus, when the selector is 0, I1 is a don't care input. Similarly, when the selector is 1, I0 is a don't care input.
On the right we see the resulting truth table. Recall that without using don't cares the table would have 8 rows since there are three inputs; in this example the use of don't cares reduced the table size by a factor of 2.
The truth table for a 4-input mux has 64 rows, but the use of don't care inputs has a dramatic effect. When the selector is 01 (i.e, S0 is 0 and S1 is 1), the output equals the value of I1 and the other three I's are don't care. A corresponding result occurs for other values of the selector.
Homework: Draw the truth table for a 4-input mux making use of don't care inputs. What size reduction occurred with the don't cares?
Homework:
C.13.
C.10. (Assume you have constant signals 1 and 0 as well.)
Recall that a don't care output occurs when for some input values (i.e., rows in the truth table), we don't care what the value is for certain outputs.
mux outthese outputs when we have the specified inputs.
How can one construct a 5-way mux?
Construct an 8-way mux and use it as follows.
Can do better by realizing the select lines equalling 5, 6, or 7 are don't cares and hence the 8-way can be customized and would use fewer gates than an 8-way mux.
A | B | C | D | E | F |
---|---|---|---|---|---|
0 | 0 | 0 | 0 | 0 | 0 |
0 | 0 | 1 | 1 | 0 | 0 |
0 | 1 | 0 | 1 | 0 | 0 |
0 | 1 | 1 | 1 | 1 | 0 |
1 | 0 | 0 | 1 | 0 | 0 |
1 | 0 | 1 | 1 | 1 | 0 |
1 | 1 | 0 | 1 | 1 | 0 |
1 | 1 | 1 | 1 | 0 | 1 |
The idea is to partially automate the algorithmic way you can
produce a circuit diagram (in the sums of product form) from a given
truth table.
Since the form of the circuit is always a bunch of ANDs feeding into
a bunch of ORs, we can manufacture
all the gates in advance
of knowing the desired logic functions and, when the functions are
specified, we just need to make the necessary connections from the
ANDs to the ORs.
In essence all possible connections are configured but with
switches that can be open or closed.
Actually, the words above better describe a PAL (Programmable Array Logic) than a PLA, as we shall soon see.
Consider the truth table on the upper right, which we have seen before. It has three inputs A, B, and C, and three outputs D, E, F.
To the right we see the corresponding logic diagram in sum of products form.
Recall how we construct this diagram from the truth table.
To the right, the above figure is redrawn in a more schematic style.
Finally, we can draw a PLA in the more abstract form shown on bottom right.
When a PLA is manufactured all the specified connections are made. That is, a manufactured PLA is specific for a given circuit. Hence the name Programmable Logic Array is somewhat of a misnomer since the device is not programmable by the user.
Homework: C.11 and C.12.
A PAL can be thought of as a PLA in which the final dots are made
by the user.
The manufacturer produces a sea of gates
.
The user programs it to the desired logic function by adding the
dots.
One way to implement a mathematical function (or a java function without side effects) is to perform a table lookup.
A ROM (Read Only Memory) is the analogous way to implement a logic function.
Important: A ROM does not have state.
It is another combinational circuit.
That is, it does not represent memory
.
The reason is that once a ROM is manufactured, the output depends
only on the input.
I realize this sounds wrong, but it is right.
Indeed, we will shortly see that a ROM is like a PLA. Both are structures that can be used to implement a truth table.
The key property of combinational circuits is that the outputs depend only on the inputs. This property (having no state) is false for a RAM chip: The input to a RAM is (like the input to a ROM) an address and (unlike a ROM) an operation (read vs write). The RAM (given a read request) responds by presenting at its outputs the value CURRENTLY stored at that address. Thus knowing just the input (i.e., the address and the operation) is NOT sufficient for determining the output. Whereas; knowing the address supplied to a given ROM is sufficient for determining the output.
A PROM is a programmable ROM.
That is, you buy the ROM with nothing
in its memory and
then before it is placed in the circuit you load the
memory, and never change it.
This is like a CD-R.
Again, as with a ROM, when you are using a PROM in a circuit, the
output is determined by the input (the address).
An EPROM is an erasable PROM. It costs more but if you decide to change its memory this is possible (but is slow). This is like a CD-RW.
Normal
EPROMs are erased by some ultraviolet light process
that is performed outside the circuit.
But EEPROMs (electrically erasable PROMS) are
not as slow and are done electronically.
Since this is done inside the circuit you could consider it a RAM if
you considered the erasing as a normal circuit operation.
Flash is a modern EEPROM that is reasonably fast.
All these EPROMS are erasable not writable, i.e. you can't just change one byte to an arbitrary value. (Some modern flash rams can nearly replace true RAM and perhaps should not be called EPROMS).
A ROM is similar to PLA
A | B | C | D | E | F |
---|---|---|---|---|---|
0 | 0 | 0 | 0 | 0 | 0 |
0 | 0 | 1 | 1 | 0 | 1 |
0 | 1 | 0 | 0 | 1 | 1 |
0 | 1 | 1 | 1 | 1 | 0 |
1 | 0 | 0 | 1 | 1 | 1 |
1 | 0 | 1 | 1 | 1 | 0 |
1 | 1 | 0 | 1 | 1 | 0 |
1 | 1 | 1 | 1 | 1 | 0 |
Sometimes not all the input and output entries in a truth table are needed. We indicate this with an X and it can result in a smaller truth table.
muxed outdownstream).
A | B | C | D | E | F |
---|---|---|---|---|---|
0 | 0 | 0 | 0 | 0 | 0 |
0 | 0 | 1 | 1 | 0 | 1 |
0 | 1 | 0 | 0 | 1 | 1 |
0 | 1 | 1 | 1 | 1 | X |
1 | 0 | 0 | 1 | 1 | X |
1 | 0 | 1 | 1 | 1 | X |
1 | 1 | 0 | 1 | 1 | X |
1 | 1 | 1 | 1 | 1 | X |
The top diagram on the right is the full truth table for the following example (from the book) that we have considered before. Consider a logic function with three inputs A, B, and C, and three outputs D, E, and F.
The full truth table has 7 minterms (rows with at least one nonzero output).
The middle truth table has the output don't cares indicated.
A | B | C | D | E | F |
---|---|---|---|---|---|
0 | 0 | 0 | 0 | 0 | 0 |
0 | 0 | 1 | 1 | 0 | 1 |
0 | 1 | 0 | 0 | 1 | 1 |
X | 1 | 1 | 1 | 1 | X |
1 | X | X | 1 | 1 | X |
Now we do the input don't cares
The resulting truth table is also shown on the right.
These don't cares are important for logic minimization. Compare the number of gates needed for the full truth table and the reduced truth table. There are techniques for minimizing logic, but we will not cover them.
Often we want to consider signals that are wider than a single bit. An array of logic elements is used when each of the individual bits is treated similarly. As we will soon see, sometimes most of the bits are treated similarly, but there are a few exceptions. For example, a 32-bit structure might treat the lob (low order bit) and hob differently from the others. In such a case we would have an array 30 bits wide and two 1-bit structures.
by nnotation.
Start Lecture #4
We will produce logic designs for the integer portion of the MIPS ALU. The floating point operations are more complicated and will not be implemented.
MIPS is a computer architecture widely used in embedded designs.
In the 80s and early 90s, it was quite popular for desktop (or
desk-side) computers.
This was the era of the killer micros
that decimated the
market for minicomputers.
(When I got a DECstation desktop with a MIPS R3000, I think that,
for a short while, it was the fastest integer
computer at NYU.)
Much of the design we will present (indeed, all of the beginning part) is generic. I will point out when we are tailoring it for MIPS.
Our first goal will be a 1-bit wide structure that computes the AND, OR, and SUM of two 1-bit quantities. For the sum there is actually a third input, CarryIn, and a 2nd output, CarryOut.
Since out basic logic toolkit already includes AND and OR gates, our first real task is a 1-bit adder.
If the final goal was a 1-bit ALU, then we would not have a
CarryIn.
For a multi-bit ALU, the CarryIn for each bit is the CarryOut of the
preceding lower-order bit (e.g., the CarryIn for bit 3 is the
CarryOut from bit 2).
When we don't have a CarryIn, the structure is sometimes called
a half adder
.
Don't treat the name too seriously; it is not half of an adder.
A half adder has the following inputs and outputs.
Draw the truth table.
Homework: Draw the logic diagram.
Now we include the carry-in.
the total number of 1s in X, Y, and Ci is odd
Homework:
We have implemented 1-bit versions of AND (a basic gate), OR (a
basic gate), and SUM (the FA just constructed, which we henceforth
draw as shown on the right).
We now want a single structure that, given another input (the
desired operation, a so called
control line
), produces as output the specified operation.
There is a general principle used to produce a structure that yields either X or Y depending on the value of operation.
This mux, with an operation select line, gives a structure
that sometimes
produces one result and sometimes
produces another.
Internally both results are always
produced.
In our case we have three possible operations so we need a three way mux and the select line is a 2-bit wide bus. With a 2-bit select line we can specify 4 operations, for now we are using only three.
We show the diagram for this 1-bit ALU on the right.
In subsequent diagrams the Operation
input will be shown in
green to distinguish it as a control line rather than a data line.
(Now is it drawn in blue to show that it is new in this diagram.)
The goal is to produce two bits of result from 2 (AND, OR) or 3
(ADD) bits of data.
The 2 bits of control tell what to do, rather than what data to do
it on.
The extra data output (CarryOut) is always produced. Presumably if the operation was AND or OR, CarryOut is not used.
I believe the distinction between data and control will become quite clear as we encounter more examples. However, I wouldn't want to be challenged to give a (mathematically precise) definition.
Remarks:
A 1-bit ALU is interesting, but we need a 32-bit ALU to implement the MIPS 32-bit operations, acting on 32-bit data values.
For AND and OR, there is almost nothing to do; a 32-bit AND is just 32 1-bit ANDs so we can simply use an array of logic elements.
However, ADD is a little more interesting since the bits are not quite independent: The CarryOut of one bit becomes the CarryIn of the next.
Let's start with a 4-bit adder.
How about a 32-bit adder, or even an an n-bit adder?
To obtain a 32-bit ALU, we put together the 1-bit ALUs in a manner similar to the way we constructed a 32-bit adder from 32 FAs. Specifically we proceed as follows and as shown in the figure on the right.
BroadcastOperation to all of the internal 1-bit ALUs. This means wire the external Operation to the Operation input of each of the internal 1-bit ALUs.
Remark
This is one place were the our treatment
must go a little out of order.
Appendix C in the book assumes you have read the chapter on computer
arithmetic; in particular it assumes that you know about two's
complement arithmetic.
I do not assume you know this material (although I know many of you
do).
We will cover it (briefly) later, when we
do that chapter.
What I will do here is assert some facts about two's complement
arithmetic that we will use to implement the circuit for SUB.
End of Remark.
For simplicity I will be presenting 4-bit arithmetic. We are really interested in 32-bit arithmetic, but the idea is the same and the 4-bit examples are much shorter (and hence less likely to contain typos).
With 4 bits, there are 16 numbers. Since twos complement notation has one representation for each number, there are 15 nonzero values. Since there are an odd number of nonzero values, there cannot be the same number of positive and negative values. In fact 4-bit two's complement notation has 8 negative values, and 7 positive values. (In one's complement notation there are the same number of positive and negative values, but there are two representations for zero, which is inconvenient.)
The high order bit (hob) on the left is the sign bit. The sign bit is zero for positive numbers and for the number zero; the sign bit is one for negative numbers.
Zero is written simply 0000.
1-7 are written 0001, 0010, 0011, 0100, 0101, 0110, 0111. That is, you set the sign bit zero and write 1-7 using the remaining three lob's. This last statement is also true for zero.
-1, -2, ..., -7 are written by taking the two's complement of the corresponding positive number. The two's complement is computed in two steps.
If you take the two's complement of -1, -2, ..., -7, you get back the corresponding positive number. Try it.
If you take the two's complement of zero you get zero. Try it.
What about the 8th negative number?
-8 is written 1000.
But if you take its (4-bit) two's complement, you
must get the wrong number because the correct
number (+8) cannot be expressed in 4-bit two's complement
notation.
Amazingly easy (if you ignore overflows).
No change is needed to our circuit above to handle two's complement numbers for AND/OR/ADD. That statement is not clear for ADD and will be shown true later in the course.
We wish to augment the ALU so that we can perform subtraction as well. As we stated above, A-B is obtained by taking the two's complement of B and adding. A 1-bit implementation is drawn on the right with the new structures in blue (I often use blue for this purpose). The enhancement consists of
A 32-bit version is simply a bunch of the 1-bit structures wired together as shown on the right. I use CarryIn and CarryOut when referring to the external carry signals of the entire 32-bit structure. Please do not confuse them with Cin and Cout, the corresponding signals to each individual 1-bit structure.
AND, OR, AND, and SUB are found in nearly all ALUs. In that sense, the construction up to this point has been generic. However, most real architectures have some extras. For MIPS they include.
Set on less than(slt), not common and not so easy.
We noted above that our ALU already gives us the ability to calculate AB', an uncommon logic function. A MIPS ALU needs NOR and, by DeMorgan's law
A NOR B = (A + B)' = A'B'
which is rather close to AB', we need just invert A as well as B.
The diagram on the right shows the added structures: an inverter to get A', a mux to choose between A and A', and a control line for the mux.
NOR is obtained by asserting Ainvert and Binvert and setting Operation=00.
The other operations are done as before, with Ainvert de-asserted.
The 32-bit version is a straightforward ...
Homework: Draw the 32-bit ALU that supports AND, OR, ADD, SUB, and NOR.
Start Lecture #5
Remark: See if final date can be moved.
Remark: As with two's complement arithmetic, I just present the bare boned facts here; they are explained later in the course.
The facts are trivial (although the explanation is not). Indeed there is just one fact.
Only the hob portion of the ALU needs to be changed. We need to see if the carry-in is different from the carry-out, but that is exactly XOR. The simple modification to the hob structure is shown on the right.
Do on the board 4-bit twos complement addition of
The 32-bit version is again a straightforward ...
Homework: Draw the 32-bit ALU that supports AND, OR, ADD, SUB, and NOR and that asserts an overflow line when appropriate.
Note that to ease the homework and, more importantly, the real design, we can use the enhanced 1-bit ALU for all bits and simply ignore the overflow output for all but the HOB.
We are given two 32-bit, two's complement numbers A and B as input and seek a 32-bit result that is 1 if A<B and 0 otherwise. Note that only the lob of the result varies; the other 31 bits are always 0.
The implementation is fairly clever as we shall see.
The first idea is rather simple.
The sign of A-B is 1 precisely
when A<B.
Thus, to implement slt
, we need to set the LOB of the
result equal to the sign bit of the subtraction A-B, and set the
rest of the result bits to zero.
Give the 4-way mux another (i.e., a fourth) input, called Less. This input is brought in from outside the bit cell. To generate slt, we make the select line to the mux equal to 11 so that the the output is the this new input. See the diagram on the right.
Use the settings just mentioned so that the adder computes A-B (and the mux throws it away). Modify the HOB logic as follows (again it might be easier to do this modification for all bits, but just use the result from the HOB).
Why didn't I show a detailed diagram for this method?
Because this method is not used.
Why isn't the method used?
Because it is wrong!
The problem with the above solution is that it ignores overflows. Consider the following 4-bit (instead of 32-bit) example.
The fix is to use the correct rule for less than
rather than
the incorrect rule the sign bit of A-B is 1
, which ignores
overflows and gives the wrong answer when an overflow occurs.
Homework: Figure out the correct rule, i.e. a non-pictorial version of problem C.24. Hint: When an overflow occurs, the sign bit is definitely wrong.
An even bigger hint is that the diagram on the right shows the correct calculation of Set in the HOB.
This is an example where explaining the bug is harder than fixing it.
On the right is a picture of the final 1-bit ALU that shows the external interface but hides the internal details.
Recall that our goal is a 32-bit ALU. It will contain 32 of the 1-bit ALUs we have just constructed. When drawing the larger structure, we want to hide the details of the individual 1-bit cells. Thus we draw the 1-bit structure as shown on the right.
In the pictures below, to save space, I sometimes omit the labels on the interfaces of the internal structures. I try to ensure that they are in the same order as in the picture on the right (let me know of any bugs you see) and try to have enough information in the picture so that you do not need to know the order.
To see if A = B we simply form A-B and test if the result is zero.
The final 32-bit ALU is shown below on the right. Note that all 32 1-bit cells are identical; it is only the inter-cell wiring that differs. This is important!
Although each 1-bit cell has 4 inputs (Ainvert, Binvert, Cin, Operation), the entire 32-bit ALU has only 3 inputs (CarryIn is not present).
For all bits except the LOB, Cin is wired to Cout of the preceding bit.
For the LOB, Cin is the same as Binvert. So we define a single external line Bnegate, which is sent to Binvert for every 1-bit alu and is also sent to Cin of the LOB. Thus there is no CarryIn signal needed.
Again note that all the bits have the same circuit. The lob and hob have special external wiring; the other 30 bits are wired the same.
To the right we see the symbol used for an ALU. Note that we have combined the two 1-bit control lines (Ainvert and Bnegate) plus the 2-bit Operation control line into a single 4-bit control line called ALUOperation.
The book uses the label Zero for the middle
output.
I believe a better label would be Equal since the line
gives the Boolean result A==B
(which is computed as
(A-B)==0)
).
I use the term Equal Zero, rather than Equal,
to ease a comparison with the book.
function | 4-bit cntl | Ainv | Bneg | Oper |
---|---|---|---|---|
AND | 0000 | 0 | 0 | 00 |
OR | 0001 | 0 | 0 | 01 |
ADD | 0010 | 0 | 0 | 10 |
SUB | 0110 | 0 | 1 | 10 |
slt | 0111 | 0 | 1 | 11 |
NOR | 1100 | 1 | 1 | 00 |
The ALU can directly perform the following MIPS instructions by setting the control lines as indicated in the table on the right.
Remark: We have developed the logic needed to implement 6 machine instructions. The technical term is that we have developed the data path. That is one of three tasks needed for a full implementation. The other two are:
Before we either of these tasks, we will learn a much faster method for addition (and subtraction).
The alu above is not used in practice since it is too slow.
The fundamental problem is that calculating the i^{th} bit
of the result requires the carry out of the i-1^{th} bit.
For this reason the above alu is said to perform a
ripple carry
since the carry computation ripples along from
the lob to the hob and.
Thus, for a 64-bit addition, the hob will take a long time to
compute.
The adder we will study next is much faster than the ripple adder we did before, especially for wide (i.e., many bit) addition. (Do to the joys of two's complement addition, any adder can subtract by complementing the bits and adding—as we did above.
InfiniteHardware
This is a simple (theoretical) result, but not practical.
At each bit position we have two input bits a_{i} and b_{i} as well as a CarryIn input. We now define two other bits called propagate (p_{i}=a_{i}+b_{i}) and generate (g_{i}=a_{i}b_{i}), which have the following properties.
rippleeffect.
The reason for the name propagate
is that p is
true if the current bit will propagate a carry from its input to its
output.
More precisely:
if (p_{i}) then { if there is a carry in to bit i then there is a carry out from bit i } if NOT (p_{i}) then { there is never a carry out from bit i }
The reason for the name generate
is that g is
true if the current bit will generate a carry out (independent of
the carry in).
More precisely:
if (g_{i}) then { there is definitely a carry out from bit i } if NOT (g_{i}) then { there might not be a carry out from bit i
These key formulas are quite simple, but
are very useful.
To repeat:
Generate:
g_{i} = a_{i}·b_{i}
Propagate:
p_{i} = a_{i}+b_{i}
The diagram on the right, from P&H, gives a plumbing analogue for generate and propagate. A full size version of the diagram (in pdf) is here. (The plumbing diagrams in these notes are from the 2e; the colors changed between additions, but the contents are the same.)
The point is that liquid enters the main pipe if either the initial CarryIn or one of the generates is true. The water exits the pipe at the lower left (i.e., there is a CarryOut for this bit position) if all the propagate valves are open from the lowest liquid entrance to the exit.
Given the generates and propagates, we can calculate all the carries for a 4-bit addition as follows (recall that c_{0}=Cin is an input). These formulas correspond directly to the plumbing picture on the right. For simplicity, I will stop writing subscripts smaller and subtended.
c1 = g0 + p0 c0 c2 = g1 + p1 c1 = g1 + p1 g0 + p1 p0 c0 c3 = g2 + p2 c2 = g2 + p2 g1 + p2 p1 g0 + p2 p1 p0 c0 c4 = g3 + p3 c3 = g3 + p3 g2 + p3 p2 g1 + p3 p2 p1 g0 + p3 p2 p1 p0 c0
Thus we can calculate c1 ... c4 in just two additional gate delays given the p's and g's. (We assume one gate can accept up to 5 inputs). Since we get gi and pi after one gate delay, the total delay for calculating all the carries is 3 gate delays. This includes calculating c4=CarryOut.
Start Lecture #6
Note: The above formulas are for 4-bit arithmetic. A crucial point is that, when we use more bits, the formulas will NOT get bigger.
Each bit of the sum si can be calculated in 2 gate delays given ai, bi, and ci. Since, we just say that the ci's can be calculated in 3 gate delays, for 4-bit addition, 5 gate delays after we are given a, b and Carry-In, we have calculated s and Carry-Out.
We show this in the diagram on the right.
Carry Lookahead Blockhas inputs a, and b and the carry in. The block calculates the pis and gis internally (not shown in the diagram) and then calculates the carries, which are the outputs of the block. We have seen above that the block requires 3 gate delays.
+is the part of a full adder that calculates the sum
si = ai⊕bi⊕ci
= ai·bi·ci + ai·bi'·ci' +
ai'·bi·c' + ai'·bi'·c
Thus, for 4-bit addition, 5 gate delays after we are given a, b and the Carry-In, we have calculated s and the Carry-Out using a modest amount of realistic (no more than 5-input) logic.
How does the speed of this carry-lookahead adder CLA compare to our original ripple-carry adder?
We have finished the design of a 4-bit CLA. Our next goal is a 16-bit fast adder. Let's consider, at varying levels of detail, five possibilities.
Super Propagateand
Super Generate
We start the adventure by defining super propagate
and
super generate
bits.
plumbingpicture for super propagate and super generate. A larger picture is here.
P0 = p3 p2 p1 p0 Low order 4-bit adder propagates a carry P1 = p7 p6 p5 p4 Second 4-bit adder propagates a carry P2 = p11 p10 p9 p8 Third 4-bit adder propagates a carry P3 = p15 p14 p13 p12 High order 4-bit adder propagates a carry G0 = g3 + p3 g2 + p3 p2 g1 + p3 p2 p1 g0 Low order 4-bit adder generates a carry G1 = g7 + p7 g6 + p7 p6 g5 + p7 p6 p5 g4 Second 4-bit adder generates a carry G2 = g11 + p11 g10 + p11 p10 g9 + p11 p10 p9 g8 Third 4-bit adder generates a carry G3 = g15 + p15 g14 + p15 p14 g13 + p15 p14 p13 g12 HO 4-bit adder generates a carry
From these super propagates and super generates, we can calculate the super carries, i.e. the carries for the four 4-bit adders. We will use four of the 4-bit CLAs to form our 16-bit CLA but we need to calculate all the Carry-In's to the 4-bit CLAs at once NOT in a ripple-carry manner.
C0 = c0 C1 = G0 + P0 c0 C2 = G1 + P1 C1 = G1 + P1 G0 + P1 P0 c0 C3 = G2 + P2 C2 = G2 + P2 G1 + P2 P1 G0 + P2 P1 P0 c0 C4 = G3 + P3 C3 = G3 + P3 G2 + P3 P2 G1 + P3 P2 P1 G0 + P3 P2 P1 P0 c0
But this looks terrific! These super carries are what we need to combine four 4-bit CLAs into a 16-bit CLA in a carry-lookhead manner. Recall that the hybrid approach suffered because the carries from one 4-bit CLA to the next (i.e., the super carries) were done in a ripple carry manner.
Since it is not completely clear how to combine the pieces so far presented to get a 16-bit, 2-level CLA, I will give a pictorial account very soon.
Before the pictures, let's assume the pieces can be put together and see how fast the 16-bit, 2-level CLA actually is. Recall that we have already seen two practical 16-bit adders: A ripple carry version taking 32 gate delays and a hybrid structure taking 14 gate delays. If the 2-level design isn't faster than 14 gate delays, we won't bother with the pictures.
Remember we are assuming 5-input gates. We use lower case p, g, and c for propagates, generates, and carries; and use capital P, G, and C for the super- versions.
Since 9<14, let the pictures begin!
pi = ai+bi gi = ai bi c1 = g0 + p0 c0 c2 = g1 + p1 g0 + p1 p0 c0 c3 = g2 + p2 g1 + p2 p1 g0 + p2 p1 p0 c0 c4 = g3 + p3 g2 + p3 p2 g1 + p3 p2 p1 g0 + p3 p2 p1 p0 c0 si = ai bi ci + ai bi' ci' + ai' bi c' + ai' bi' c P = p3 p2 p1 po G = g3 + p3 g2 + p3 p2 g1 + p3 p2 p1 g0
Next we put four of these 4-bit CLAs together with a new structure called a Carry Lookahead Block (CL-Block) that calculates the carries needed by the 4-bit CLA-PGs using the P's, G's and C_{in}=C0. The result is a 16-bit CLA!
The formulas for the Cs are above but I repeat them here
C0 = Cin C1 = G0 + P0 Cin C2 = G1 + P1 C1 = G1 + P1 G0 + P1 P0 Cin C3 = G2 + P2 C2 = G2 + P2 G1 + P2 P1 G0 + P2 P1 P0 Cin C4 = G3 + P3 C3 = G3 + P3 G2 + P3 P2 G1 + P3 P2 P1 G0 + P3 P2 P1 P0 Cin
Note that I do not call the new structure a 4-bit CL Block or a 16-bit CL block. More on this latter
The diagram on the right shows the result, the color of the wires indicate when the values are calculated
The last paragraph is stated sloppily. Gates are always calculating their outputs from their inputs. When we say something is calculated in k gate delays, we mean that the outputs are correct k gate delays after the inputs are correct. A more accurate version of the previous paragraph would be
Since the magenta lines flow right to left (then down), I drew their arrowheads. I typically do not draw arrowheads for lines that go to the right, go down, or go right and down.
Note that all the Ci's are calculated at once. For example, it is not the case that C3 depends on C2.
Start Lecture #7
We are not done with the CL Block since our goal is to construct CLAs for any power-of-4 number of bits using this CL Block. Specifically, again assuming 5-input gates, we want the exact same CL Block to be used for a 4-bit (1-level) CLA; a 16-bit (2-level) CLA; a 64-bit (3-level) CLA; a 256-bit (4-level) CLA, etc.
In fact, we will go back further and construct a 1-bit (0-level) CLA, from which the 4-bit (1-level) CLA is built.
Moreover, when going from an 4^{n}-bit (n-level) CLA to a 4^{n+1}-bit (n+1-level) CLA, there should be no new logic that is needed. Specifically, we want a 64-bit (3-level) CLA to be composed of four 16-bit (2-level) CLAs, one additional CL Block (identical to those in the smaller constituent CLAs), some wires, and nothing else.
In the previous diagram we used a CL Block to assemble a 16-bit CLA from four 4-bit CLAs, but did not prepare for constructing a 64-bit CLA from four of these 16-bit CLAs. For that reason we did not have Pout and Gout (note that each 4-bit CLAs used did output a P and a G, which were used when constructing a 16-bit CLA).
In general, when constructing a CLA using the CL Block, there are actually three sizes of CLAs that are relevant (so far we have only dealt with two of the three).
The full CL Block is drawn on the right and contains two outputs not shown or used previously, Pout and Gout.
This Block has the following 9 inputs.
It has the following 6 outputs
These outputs are calculated from the following, previously studied, formulas.
C1 = Gin0 + PinO C_{in} C2 = Gin1 + Pin1 Gin0 + Pin1 Pin0 C_{in} C3 = Gin2 + Pin2 Gin1 + Pin2 Pin1 Gin0 + Pin2 Pin1 Pin0 C_{in} C4 = Gin3 + Pin3 Gin2 + Pin3 Pin2 Gin1 + Pin3 Pin2 Pin1 Gin0 + Pin3 Pin2 Pin1 Pin0 C_{in} Pout = Pin3 Pin2 Pin1 Pin0 Gout = Gin3 + Pin3 Gin2 + Pin3 Pin2 Gin1 + Pin3 Pin2 Pin1 Gin0
It is now time to validate the claim that all (power of 4) sizes of PLAs can be built (recursively) using the CL Block.
A 1-bit CLA is just a 1-bit adder.
With only one bit there is no need for any lookahead
since
there is no ripple
to try to avoid.
However, to enable us to build a 4-bit CLA from the 1-bit version, we actually need to build what we call a CLA-PG. The 1-bit CLA-PG has three inputs a, b, and cin. It produces 4 outputs s, cout, p, and g. We have given the logic formulas for all four outputs previously, but here they are again.
s = a b cin + a b' cin' + a' b cin' + a' b' cin odd number bits are 1 cout = a b + a cin + b cin at least two bits are 1 p = a + b g = a b
A 4-bit CLA-PG is shown as the red portion in the figure to the right.
It has nine inputs: 4 a's, b's, and cin and must produce seven outputs: 4 s's, cout, p, and g (recall that the last two were previously called the super propagate and super generate respectively).
The tall black box is our CL Block.
The question is, what must the i^{th}
.?
box do in
order for the entire (red) structure to be a 4-bit CLA-PG?
p_{i} = a_{i} + b_{i} g_{i} = a_{i} b_{i}
So the ? box is just a 1-bit CLA-PG.
Wrong!
The ? box is only a (large) subset of a 1-bit CLA-PG.
What is missing?
Ans.
The ? box doesn't need to produce a carry out since the Cl-block
produces all the carries.
So, if we want to say that the 4-bit (1-level) CLA-PG is composed of four 1-bit (0-level) CLA-PGs together with a CL Block, we must draw the picture as on the right. The difference is that we explicitly show that the ? box produces cout, which is then not used.
This situation will occur for all sizes. For example, either picture on the right for a a 4-bit CLA-PG produces a carry out since all 4-bit full adders do so. However, a 16-bit CLA-PG, built from four of the 4-bit units and a CL Block, does not use the carry outs produced by the four 4-bit units.
We have several alternatives.
We choose the last alternative.
As another abbreviation, we will henceforth say CLA when we mean CLA-PG.
Remark: Hence the 4-bit CLA (meaning CLA-PG) is composed of
Now take four of these 4-bit adders and use the identical CL Block to get a 16-bit adder.
The picture on the right shows one 4-bit adder (the red box) in detail. The other three 4-bit adders are just given schematically as small empty red boxes. The CL Block is also shown and is wired to all four 4-bit adders.
The complete (large) picture is shown here.
Remark: Hence the 16-bit CLA is composed of
To construct a 64-bit CLA no new components are needed. That is, the only components needed have already been constructed. Specifically you need.
Remark: Hence the 64-bit CLA (meaning CLA-PG) is composed of
When drawn (with a brown box) the 64-bit CLA-PG has 129 inputs (64+64+1) and 67 outputs (64+1+2).
Remark: Hence the 256-bit CLA (meaning CLA-PG) is composed of
Homework: How many gate delays are required for our 64-bit CLA-PG? How many gate delays are required for a 64-bit ripple carry adder (constructed from 1-bit full adders)?
Start logisim and look at my circuits.
Remark: Lab 3 assigned; it is due 4 October 2011.
CLAs greatly speed up addition; the increase in speed grows with the size of the numbers to be added.
Remark: CLAs implement n-bit addition in O(log(n)) gate delays.
MIPS (and most other) processors must execute shift (and rotate) instructions.
We could easily extend the ALU to do 1-bit shift/rotates (i.e., shift/rotate a 32-bit quantity by 1 bit), and then perform an n-bit shift/rotate as n 1-bit shift/rotates.
This is not done in practice. Instead a separate structure, called a barrel shifter is built outside the ALU.
Remark: Barrel shifters, like CLAs are of logarithmic complexity.
Why do we need state?
Assume you have a physical OR gate. Assume the two inputs are both zero for an hour. At time t one input becomes 1; the other one never changes. The output will oscillate for a while before settling on 1. We want to be sure we don't look at the answer before its ready.
This will require us to establish a clocking methodology, i.e., an approach to determining when data is valid.
First, however, we need some ...
Nano means one billionth, i.e., 10^{-9}.
Micro means one millionth, i.e., 10^{-6}.
Milli means one thousandth, i.e., 10^{-3}.
Kilo means one thousand, i.e., 10^{3}.
Mega means one million, i.e., 10^{6}.
Giga means one billion, i.e., 10^{9}.
Consider the idealized waveform shown on the right. The horizontal axis is time and the vertical axis is (say) voltage.
If the waveform repeats itself indefinitely (as the one on the right does), it is called periodic.
The time required for one complete cycle, i.e., the time between two equivalent points in consecutive cycles, is called the period.
Since it is a time, period is measured in units such as seconds, days, nanoseconds, etc.
The rate at which cycles occur is called the frequency.
Since it is a rate, frequency is measured in units such as cycles per hour, cycles per second, kilocycles per micro-week, etc.
The modern (and less informative) name for cycles per second is Hertz, which is abbreviated Hz.
Prediction: At least one student will confuse frequency and periods on the midterm or final and hence mess up a gift question. Please, prove me wrong!.
Make absolutely sure you understand why
Look at the diagram above and note the rising edge and the falling edge.
We will use edge-triggered logic, which means that state changes (i.e., writes to memory) occur at a clock edge.
Each of our designs will either
The edge on which changes occur (either the rising or falling edge) is called the active edge. For us, choosing which edge is active is basically a coin flip.
In real designs the choice is governed by the technology used. Some designs permit both edges to be active. Examples include DDR (double data rate) memory and double-pumped register files. This permits a portion of the design to run at effectively twice the speed since state changes occur twice as often
Now we are going to add state elements (memory) to the combinational circuits we have been using previously.
Remember that a combinational/combinatorial circuits has its outpus determined solely by its input, i.e. combinatorial circuits do not contain state.
State elements include state (naturally).
Combinatorial circuits can NOT contain loops. For example, imagine an inverter with its output connected to its input. So if the input is false, the output becomes true. But this output is wired to the input, which is now true. Thus the output becomes false, which is the new input. So the output becomes true ... .
Sequential circuits, however, can and often do contains loops.
We will use only edge-triggered, clocked memory in our designs as they are the simplest memory to understand. So our current goal is to construct a 1-bit, edge-triggered, clocked memory cell. However, to get there we will proceed in three stages.
The only unclocked memory we will use is a so called S-R latch (S-R stands for Set-Reset).
When we define latch
below to be a level-sensitive, clocked
memory, we will see that the S-R latch is not really a latch.
The circuit for an S-R latch is on the right, which is constructed
from Cross-coupled
nor gates.
Since it has two single-bit inputs, there are four possible input combinations.
For both flip-flops and latches the output equals the value stored in the structure. Both have an input and an output (and the complemented output) and a clock input as well. The clock determines when the internal value is set to the current input. For a latch, the output can change whenever the clock is asserted (level sensitive). For a flip-flop, changes occur only at the active edge.
Unfortunately the terminology used is not perfect, the S-R latch defined above is unclocked memory.
The D stands for data.
Note the following properties of the D latch circuit shown on the right.
stored. That is Q, the output of the latch, is D.
The lower abbreviated diagram is how a D-latch is normally drawn.
A D latch is sometimes called a transparent latch since, whenever the clock is high, the output equals the input (the input passes right through the latch).
We won't use D latches in our designs, except right now to construct our workhorse, the master-slave flip-flop, an edge-triggered memory cell.
Note the following points illustrated by the traces to the right. We assume the stored value was initially low.
This structure has been our goal. It is an edge-triggered, clocked memory. It is often referred to as a D-flop. Again the D stands for data.
The circuit for a D flop is on the right has the following properties.
A D flop is sometimes called a master-slave flip-flop, with the left latch called the master and the right the slave.
Note that the substructures reuse the same letters as the main structure but have different meaning (similar to block structured languages in the algol style).
The left D latch is set during the time the clock is asserted. Remember that the latch is transparent, i.e. it follows its input when its clock is asserted. But the right latch is ignoring its input at this time. When the clock falls, the 2nd latch pays attention and the first latch keeps producing whatever D was at fall-time.
Actually D must remain constant for some time around the active edge.
Start Lecture #8
Remark: Forgot to mention shifters last time; do it now.
Homework: Move the inverter to the other latch. What has changed?
The picture on the right is for a master-slave flip-flop. As before we are assuming the output is initially low.
Note how much less wiggly the output is in this picture than before with the transparent latch.
The next picture shows the setup and hold times discussed above.
Homework: Which code better describes a flip-flop and which a latch?
repeat { while (clock is low) {do nothing} Q=D while (clock is high) {do nothing} } until forever
or
repeat { while (clock is high) {Q=D} } until forever
A register is basically just an array of D flip-flops. For example a 32-bit register is an array of 32 D flops.
What if we don't want to write the register during a particular cycle (i.e. at the active edge of a particular cycle)?
As shown in the diagram on the right, we introduce another input,
the write line, which is used to gate the clock
.
If the write line is high forever, the clock input to the register is passed right along to the D flop and hence the input to the register is stored in the D flop when the active edge occurs, which for us is the falling edge. That is, the register is written every cycle.
If the write line is low forever, the clock to the D flop is always low so has no edges. Thus the register is never written.
Now that we understand what happens if the write line is constant, either always high or always low, we must ask what happens if we change the write line from high to low or vice versa.
We do not change the write line when the (external) clock is high since that would cause extra edges to be passed to the D-flop. Instead we change the write line only when the clock is low.
This, however, is not so good!
Essentially the same affect can be achieved in another manner as
well.
Instead of having the register negate the write line, the register
specifies that the input be a don't write line, i.e.
W'.
Such a register, which is depicted on the right is often call
an active low
register since it is active when its W' input
is low.
To implement a multibit register, just use multiple D flops.
A register file is just a set of registers, each one numbered.
To support reading a register we just need a (big) mux from the register file to select the correct register.
bigmux means an n-input, b-bit mux, where
To support writing a register we use a decoder on the register number to determine which register to write. Note that errors in the book's figure were fixed.
Note also that I show the clock explicitly.
Recall that the inputs to a register are W, the write line, D the data to write (if the write line is asserted), and the clock. We should perform a write to register r this cycle if the write line is asserted and the register number specified is r. The idea is to gate the write line with the output of the decoder.
Homework: C.36
fastermemory) an entire row.
logicare made from similar technologies but DRAM technology is quite different.
Note: There are other kinds of flip-flops T, J-K. Also one could learn about excitation tables for each. We will not cover this material (P&H doesn't either). If interested, see Mano.
More precisely, we are learning about deterministic
finite state machines or deterministic finite automata (DSA).
The alternative nondeterministic finite automata (NDA) are somewhat
strange and, althought seemingly nonrealistic and of theoretical
value only, form, together with DFAs, what I call the
secret weapon
used in the first stage of a compiler (the
lexical analyzer).
We will do a different example from the one in the book (counters instead of traffic lights). The ideas are the same and the two generic pictures (below) apply to both examples.
A counter counts (naturally).
incrementline is asserted.
shouldn't happen. We will say that the reset takes precedent.
Current | (Next A) | ||
---|---|---|---|
A | I | R | D_{A} |
0 | 0 | 0 | 0 |
1 | 0 | 0 | 1 |
0 | 1 | 0 | 1 |
1 | 1 | 0 | 0 |
x | x | 1 | 0 |
How do we determine the combinatorial circuit?
currentA and added the label Next A to the D_{A} column.
Start Lecture #9
No new ideas are needed; just more work.
Current | Next | ||||
---|---|---|---|---|---|
H | L | I | R | D_{H} | D_{L} |
x | x | x | 1 | 0 | 0 |
0 | 0 | 1 | 0 | 0 | 1 |
To determine the combinatorial circuit we could precede as before. The beginning of the truth table is on the right.
This would work (do a few more rows on the board), but we can instead think about how a counter works and see that.
D_{L} = R'(L ⊕ I) D_{H} = R'(H ⊕ LI)
On the right is (a diagram depicting) the Logisim circuit for the 2-bit counter. The two 1-bit registers are on the right, the clock is near the middle and the combinatorial circuit is most of the left part. There are two 1-bit inputs, namely I and R.
If you want to play with this circuit the .circ file can be downloaded here.
Homework: C.39
The idea is, given a circuit diagram, write a program that behaves the way the circuit does. This means more than getting the same answer. The program is to work the way the circuit does.
For each logic box, you write a procedure with the following properties.
Remember that a full adder has three inputs and two outputs. Discuss FullAdder.c or perhaps FullAdder.java.
This implementation uses the full adder code above. Discuss FourBitAdder.c or perhaps FourBitAdder.java
Read.
Read.
Read.
Read.
Homework: 1.1, a multi-part matching question.
Will be done later.
Some of the benchmarking material will be done later.
Some of the benchmarking material will be done later.
Read this short section.
Homework: Read section 2.1.
Homework: Read section 2.2. For this course you do not have to worry much about how a program in C is translated into assembly language. However, it is an important concept (at least at the high level). It is touched upon in 201 (sometimes) and I believe will be covered more in future versions of that course.
Remark: The of chapter 2 contains much material on this same topic, which we are de-emphasizing. The book is quite well written and you may well find that material interesting. Were this a 2-semester course, we would certainly cover it.
Many of the MIPS instructions operate of values stored in registers. The MIPS architecture we shall study has thirty-two 32-bit registers. There is another MIPS architecture that has thirty-two 64-bit registers.
A very serious task for a compiler is to make efficient use of this scarce resource; however, we shall not discuss that problem.
The text, which emphasizes the correspondence between a C or Java program and assembly language much more that we shall, is very careful in distinguishing between those registers used for C-program variables, those used for temporary values, those used when one function calls another, those used for other purposes.
The actual hardware makes no such distinction: In machine
instructions a register operand is simply a 5-bit number (from 0 to
31).
The distinction between register types
is just a convention
used by software.
I shall try to follow the conventions, but remember that from a hardware perspective, a register is named simply by a 5-bit value.
Of course computers can contain many more than thirty-two 32-bit values. Indeed, today even a modest laptop has a central memory at least ten million times larger.
In MIPS arithmetic is performed only on values located in registers. Thus, in addition to arithmetic instructions, MIPS (and essentially all other computers) need data transfer instructions to fetch values from memory to registers and to update memory with newly calculated values.
Often one operand in an arithmetic instruction is a constant, not a
variable.
MIPS supplies corresponding immediate
instructions.
For example, add naturally adds the contents of two
registers, placing the result in a third; whereas addi
(add immediate) adds a constant (contained in the instruction) to
one register, placing the result in a second register.
MIPS uses 2s complement representation for signed numbers (as do all modern processors).
To form the 2s complement (of 0000 1111 0000 1010 0000 0000 1111 1100)
MIPS (like most computers) can also process 32-bit values as unsigned numbers, that is the hob is not a sign bit. It is instead the bit in the 2^{32} place
As mentioned previously, addition/subtraction on signed numbers does not treat the sign bit specially so unsigned and signed addition/subtraction give the same answer if the operands are the same bit strings.
The differences are in overflow and comparisons.
You could easily ask what does this funny notation have to do with negative numbers. Let me make a few comments.
Homework: 2.7.1 and 2.7.2.
Homework: 2.7.4 and
2.7.6 (but change hexadecimal
to binary
.
Converting base 2 values to/from base 10 is work, but converting base 2 to/from base 16 is easy. You simply group the base 2 number into groups of 4 or expand the base 16 number from right to left. The one question is how do you write the 6 digits past 9 and the answer is A, B, C, D, E, and F.
Do some examples on the board.
We just learned how to build this structure. We need 2 read ports and 1 write port since MIPS instructions can read up to 2 register and write up to 1 register
Recall that, by software convention, registers are given two character names. The first character is a letter and indicates what the software conventionally uses the register for. For example $t2 is the third ($t0 is the first) register used for temporary values.
As stated above, MIPS has thirty-two 32-bit register. Some machines, notably the 32-bit Intel (PC) architecture, have a number of register classes, where only certain registers can be used for certain task. However, the MIPS treats registers 1-31 the same, only register 0 is special.
The fields of a MIPS instruction are fairly consistent. There are just a few classes of instruction formats and within each class the various bit positions of the instructions are used in the same way..
These instructions have three operands, each is a register number. All R-type instructions have the following fixed format.
op rs rt rd shamt funct name of field 6 5 5 5 5 6 number of bits
These fields are used consistently in R-type instructions.
Examples: add/sub $t1,$t2,$t3 (i.e., add/sub $9,$10,$11).
The I is for immediate.
op rs rt immediate operand 6 5 5 16
Examples: lw/sw $t1,1000($s3) (i.e., lw/sw $9,1000($19))
RISC-like properties of the MIPS architecture.
Example: addi $t1,$t2,100 (i.e., addi $9,$10,100)
Homework: 2.6.5 and 2.6.6.
Homework: 2.10.1, 2.10.2, and 2.10.5. (officially assigned next class)
These instructions deal with the bits within the word rather than
treating the word as a unit.
Such instructions are often called logical
, presumably
because they are often used if the logic of programs (conditionals
and loops).
Examples: sll/srl $t7,$t2,7 (i.e., sll/srl $15,$10,7)
Examples: and/or/nor $s2,$s1,$s0 (i.e., and/or/nor $18,$17,$16)
Examples andi/ori $s0,$t0,31 (i.e., andi/ori 16,8,31)
Homework: 2.13.1 and 2.14.1 (officially assigned next class)
Examples: beq/bne $t1,$t2,123 (i.e., beq/bne $9,$10,123)
if reg-9==reg-10 then go to the 124rd instruction after this one.
if reg-9!=reg-10 then go to the 124rd instruction after this one.
Start Lecture #10
This is an old friend. Recall the extra effort we put into the alu a few weeks ago to implement this important MIPS instruction.
Example: slt $t1,$t2,$t3 (i.e., slt $9,$10,$11)
Example: slti s1,s2,20000 (i.e., slti 17,18,20000)
Recall that comparison is different for unsigned and signed numbers. For example: signed values with 1 in the hob are less than those with 0 in the hob (the first value is negative); but, if the values are unsigned, a 1 in the hob is greater than a 0 in the hob. For this reason, MIPS has in addition unsigned versions of slt and slti, that use the unsigned definition of less than.
The instructions are named sltu and sltiu as you would expect. Our MIPS subset implementation, will not include them.
Example: blt $t5,$t7,123 (i.e., blt $13,$15,123)
stl $t1,$t5,$t7 bne $t1,$zero,123
Example: ble $t5,$t7,L (L a label to be calculated by the assembler.)
ble $t5,$t7,Linstruction.
sle $t1,$t5,$t7, set $t1 if $t5 less or equal $t7.
stl $t1,$t7,$t5 beq $t1,$zero,L
Example bgt $t5,$t7,L
bgt $t5,$t7,Linstruction.
sgt $t1,$t5,$t7, set $t1 if $t5 greater than $t7.
stl $t1,$t7,$t5 bne $t1,$t0,L
Example: bge $t5,$t7,L
bge $t5,$t7,Linstruction.
sge $t1,$t5,$t7, set $t1 if $t5 greater or equal $t7l
stl $t1,$t5,$t7 beq $t1,$zero,L
Note:
Please do not make the mistake of thinking that
stl $t1,$t5,$t7
is the same as
beq $t1,$zero,L
stl $t1,$t7,$t5
bne $t1,$zero,L
It is not true that the negation of X<Y
is Y>X.
End of Note
These have a different format, but again the opcode is the first 6 bits.
op address 6 26
The effect is to jump to the specified (immediate) address. Note that there are no registers specified in this instruction and that the target address is not relative to (i.e. added to) the address of the current instruction as was done with branches.
Example: j 10000
But MIPS is a 32-bit machine with 32-bit address and we have specified only 26 bits. What about the other 6 bits?
In detail the address of the next instruction is calculated via a multi-step process.
Homework: 2.16.1.
Example: jal 40000
Important example: jr $ra (i.e., jr $31)
How can we put a 32-bit value (say 2 billion) into register 6?
Example: lui $t4,123 (i.e., lui $12,123)
Homework: Read 3.1-3-4
I have nothing to add.
Recall that MIPS uses 2s complement (just like the intel chips)
To form the 2s complement (of 0000 1111 0000 1010 0000 0000 1111 1100)
To add two (signed) numbers just add them. That is, don't treat the sign bit special.
To subtract A-B, just take the 2s complement of B (forming -B) and add.
An overflow occurs when the result of an operation cannot be represented with the available hardware. For MIPS this means when the result does not fit in a 32-bit word.
Recall that the operands each have 31 data bits and a sign bit; thus the result would definitely fit in 33 bits (32 data plus 1 sign).
11111111111111111111111111111111 (32 ones is -1) + 11111111111111111111111111111111 ---------------------------------- 111111111111111111111111111111110 Discard the carry out 11111111111111111111111111111110 this is -2 as desired
As shown on the right the hardware simply discards the carry out of the high order (i.e., sign) bit, which might seem hopelessly naive, but is normally correct.
The bottom 31 bits are always correct.
Overflow occurs when the 32nd (sign) bit is set to a value and not
the sign.
An overflow occurs in the following cases
Operation Operand A Operand B Result A+B ≥ 0 ≥ 0 < 0 A+B < 0 < 0 ≥ 0 A-B ≥ 0 < 0 < 0 A-B < 0 ≥ 0 ≥ 0
These conditions are the same as
Carry-In to sign position != Carry-Out from sign position.
Homework: Prove this last statement (for fun only, do not hand in).
Since unsigned numbers are often used for address arithmetic where overflows should be ignored, these three instructions perform addition and subtraction the same way as do add and sub, but do not signal overflow.
This is a sequential circuit.
It is just a string of D-flops; the output of one is input of the next.
We want more.
Parallel output is just wires.
Shifter has 4 modes (nop, left-shift, right-shift, load) so
We could modify our registers to be shifters (bigger mux),
but ...
Our shifters are slow for big shifts; barrel shifters
are
faster and kept separate from the processor registers.
Homework: A 4-bit shift register initially contains 1101. It is shifted six times to the right with the serial input being 101101. What is the contents of the register after each shift.
Homework: Same register, same initial condition. For the first 6 cycles the opcodes are left, left, right, nop, left, right and the serial input is 101101. The next cycle the register is loaded (in parallel) with 1011. The final 6 cycles are the same as the first 6. What is the contents of the register after each cycle?
Of course we can do this with two levels of logic since multiplication is just a function of its inputs.
But just as with addition, would have a very big circuit and large fan in. Instead we use a sequential circuit that mimics the algorithm we all learned in grade school.
Recall how to do multiplication.
Our first solution multiplies in essentially the same way.
We are doing binary arithmetic so each digit
of the
multiplier is 1 or zero.
Hence multiplying
the mulitplicand by a digit of the
multiplier results in either
Use an if appropriate bit of multiplier is 1
test.
To get the appropriate bit
:
Putting in the correct column means putting it one column further left than the last time. This is done by shifting the multiplicand left one bit each time (even if the multiplier bit is zero).
Instead of adding partial products at end, we keep a running sum.
This results in the following algorithm
product ← 0 for i = 0 to 31 if LOB of multiplier = 1 product = product + multiplicand shift multiplicand left 1 bit shift multiplier right 1 bit
Start Lecture #11
What about the control?
This works!
It clearly works if we test the LOB and write the product on one cycle and shift the next cycle (so two cycles per bit). With some more care you can do it all in one cycle, you just need to be sure you add the multiplicand before it is shifted and that you get the LOB before the multiplier is shifted.
The real weakness of the above solution, when compared to the improved versions to come, is that the first attempt is wasteful of resourses and hence is:
All these are bad.
Do on the board 4-bit multiplication (8-bit registers) 1100 x 1101. Since the result has (up to) 8 bits, this is often called a 4x4→8 multiply.
The diagrams are for a 32x32→64 multiplier.
The product register must be 64 bits since the product can contain 64 bits.
Why is multiplicand register 64 bits?
Ans: So that we can shift it left, i.e., for our convenience.
By this I mean it is not required by the problem specification, but only by the solution method chosen.
Why is ALU 64-bits?
POOF!! ... as the smoke clears we see an idea.
We can solve both problems at once.
Don't shift the multiplicand left
Instead shift the product right!
Add the high-order (HO) 32-bits of product register to the multiplicand and place the result back into HO 32-bits
product <- 0 for i = 0 to 31 if LOB of multiplier = 1 (serial_in, product[32-63]) <- product[32-63] + multiplicand shift product right 1 bit shift multiplier right 1 bit
What about control?
Redo the same example on the board.
gate bumming, like
code bummingof 60s)
There is a still waste of registers, i.e. they are not fully utilized.
Timeshare
the LO half of the product register
.
The algorithm changes to:
product[0-31] ← multiplier for i = 0 to 31 if LOB of product = 1 (serial_in, product[32-63]) ← product[32-63] + multiplicand shift product right 1 bit
Control again boring.
Redo the same example on the board.
The above was for unsigned 32-bit multiplication. What about signed multiplication?
There are (asymptotically) faster multipliers, but we are not covering them.
Read for pleasure.
Read for pleasure.
Read for pleasure (located on CD).
Remark: End of material on midterm. I posted a practice midterm. Today is lecture 11. Exam is lecture 13, one week from today. Remind me anytime saturday, via email to the list, to post solutions. Advice: do NOT look at the answers until you have done the questions.
Homework: Start Reading Chapter 4.
We are going to build
a basic MIPS processor.
Figure 4.1 redrawn below shows the main idea.
Note that the diagram shows the instruction including three register numbers, an immediate value to be added to a register, and an immediate value to be added to the PC.
No single instruction has all those components, but our datapath must include pathways for all possibilities. Eventually, we will add muxes to choose which possibilities are relevant for the given instruction.
We shall see how we arrange for only certain datapaths to be used for each instruction type.
Why are we doing arithmetic on the program counter?
Done in appendix C.
Let's begin doing the pieces in more detail.
We draw buses in magenta (mostly 32 bits) and control lines in green.
We are ignoring branches and jumps for now.
The diagram on the right shows the main loop
involving
instruction fetch (i-fetch)
We did the register file in appendix C. Recall the following points made when discussing the appendix.
Readand
Writein the diagram are adjectives not verbs.
The 32-bit bus with the instruction is divided into three 5-bit buses, one for each register number (plus other wires not shown).
Homework: What would happen if the RegWrite line had a stuck-at-0 fault (was always deasserted)? What would happen if the RegWrite line had a stuck-at-1 fault (was always asserted)?
Start Lecture #12
Remarks
For this week and next there are departmental meetings on thurs that will cut into my office hours.
For this week I will have an extra office hour
on wednesday
from 12:15-4:45.
Architecture has priority, due to our midterm.
For next week I will have an extra office hour
on monday
from 12:15-4:45.
OS has priority, due to their midterm.
End of Remarks
In this chapter we are interested in building the processor, and not as interested in seeing how Java or C statements could be translated into machine instructions. As a result, I will refer to the registers in an instruction by their hardware names
The diagram on the right shows the structures used to implement load word and store word (lw and sw).
lw rt,disp(rs):
disp(displacement) to the contents of register rs.
sw rt,disp(rs):
We have a 32-bit adder and more importantly have a 32-bit addend coming from the register file. Hence we need to extend the 16-bit immediate constant to 32 bits. That is we must replicate the HOB of the 16-bit immediate constant to produce an additional 16 HOBs all equal to the sign bit of the 16-bit immediate constant. This is called sign extending the constant.
Note that the Sign Extend oval consists of just wires, no gates at all.
On the right is a small example, a 4→8 sign extender.
What about the control lines?
addfor both lw and sw.
Homework: What would happen if the RegWrite line
had a stuck-at-0 fault (was always deasserted)?
What would happen if the RegWrite line
had a stuck-at-1 fault (was always asserted)?
What would happen if the MemWrite line
had a stuck-at-0 fault?
What would happen if the MemWrite line
had a stuck-at-1 fault?
The diagram cheats a little for clarity.
mux deficient.
Compare two registers and branch if equal. The circuit on the right computes two values, the branch target address and a Boolean specifying whether or not to branch. Note the familiar pattern.
Remember that this diagram is just for beq. If the instruction is not beq then the Equal line from the ALU is not relevant. This will be fixed up later when we do all the control.
Recall the following from appendix C, where we built the ALU, and from chapter 2, where we discussed beq.
To check if two registers equal we subtract one from the other and test the result for zero (our ALU subtracts if ALU Operation says to, and our ALU always checks if the result is 0). In this case we are not interested in the result itself (so we don't wire that output to anything), just whether it is zero.
The target of the branch on equal instruction
beq rs,rt,disp
is the sum of
disp(treated as a signed number) left shifted 2 bits. The constant represents (32-bit) words and the address is specified in (8-bit) bytes). Since there are 4 bytes per word, we must multiple the word address by 4, which can be accomplished by a left shift of 2.
The shift left 2 is not a shifter. It simply moves wires and includes two zero wires. We need a 32-bit version of the 5 bit version shown on the right.
Since the immediate constant is signed it must be sign extended. As mentioned and drawn previously this is just replicating the HOB.
The top alu labeled add
is just an adder so does not need
any control.
Homework: What would happen if the RegWrite line had a stuck-at-0 fault? What would happen if the RegWrite line had a stuck-at-1 fault?
We will first put the pieces together in a way that the resulting single datapath is able to execute all of the above instructions (several R-type instructions including set-less-than, load and store word, and branch on equal).
This will require several multiplexors and their associated select lines. After we have the pieces assembled into a unified whole, we will discuss how to calculate the select lines (and other control lines).
We are not now worried about speed.
We are assuming that the instruction memory and data memory are separate. So we are not permitting self modifying code. We are not showing how either memory is connected to the outside world (i.e., we are ignoring I/O).
We must use the same register file for all the instruction types since when a load changes a register, a subsequent R-type instruction must see the change and when an R-type instruction makes a change, the lw/sw must see it (for storing or calculating the effective address).
We could use separate ALUs for each type of instruction so that several instructions could proceed at the same time, but we are not worried about speed so we will use the same ALU for all instruction types. We do have a separate adder for incrementing the PC.
The problem is that some inputs can come from different sources depending the instruction type. We need to add muxes as shown on the right.
Adding instruction fetch is quite easy.
We simply attach the instruction fetch block done above to the left of the previous diagram.
The result is shown on the right, where the new material is in blue.
Not shown yet is how the 32-bit instruction leaving the instruction memory is divided into into the various 5-bit and 15-bit fields. This is not trivial since it is not true that the same bits always go to the same field.
We need to have an if stmt
for updating the PC corresponding
to the two possiblities: the branch is taken and the branch is not
taken.
This conditional assignment to the PC should be compared to the conditional expressions found in C and Java, for example
y = (c==4) ? x : z;
As usual, in logic design the conditional assignment is done with a mux (and a control line, named PCSrc—what is the input to the PC register).
Homework: Extend the datapath just constructed to support the addi instruction as well as the instructions already supported.
Homework: Extend the datapath just constructed to support an R-type instruction that is a variation of the lw instruction where the memory address is computed by adding the contents of two registers (instead of using an immediate field) and the contents of that memory location is loaded into the third register. Continue to support all the instructions that the original datapath supported.
Homework: Can you support a hypothetical swap instruction that swaps the contents of two registers using the same building blocks that we have used to date?
Start Lecture #13
Start Lecture #14
There are basically two tasks remaining. We shall see they are related; the key is the instruction itself.
The diagram above has a 32-bit instruction magically dividing into various fields, three of 5-bits and one of 16 bits. Moreover we know that not all fields are relevant for all instructions: I-type instructions do not have a third register and R-type instructions do not have a 16-bit immediate field.
In addition, register rt is sometimes a read register and sometimes a write register.
We must figure which bits of the 32-bit instructions should go to each of the various fields in all possible circumstances.
We have ignored the control signals. Each of our muxes has a 1-bit control line that appears to be created out of thin air. We need to determine the values of each of these lines for all cases. Similarly, our ALU takes a 4-bit control line, but we have not determined how to calculate those four bits.
The diagram below shows (in blue as usual) the additions needed to
divide the instruction.
One cost
of this solution is yet another mux, with yet
another to-be-calculated 1-bit control line (having yet another
slightly cryptic name RegDst, meaning this line determines whether
Register rt or rd should be the Destination register).
Also added is an unspecified logic block ALU Control
(abbreviated in the diagram as ALU Cntl
to save space), with
an unspecified 2-bit control line ALUOp
as input.
The new control lines will be determined in the next section
entitled The Control for the Datapath
, and the new block will
calculate the 4-bit ALU Operation from the new control lines and the
funct bits of the instruction.
We write I:n-m to represent instruction bits n through m (inclusive). For example I:15-0 represents the low order 16 bits of the instruction, which we recall is the immediate field in an I-type instruction.
Bits I:31-26, the opcode, have not been used up to this point. We shall see that the opcode, will play a prominent role when we determine the control lines.
AND 0 0 00 OR 0 0 01 Add 0 0 10 Sub 0 1 10 Set-LT 0 1 11 NOR 1 1 00
Remark: Much of the material in this section is found in Appendix D section 2.
Now that we have added the one missing mux and shown how the instruction bits are divided, two related tasks remain.
Homework:
What happens if we use 0 1 00 for the four ALU control lines?
What if we use 0 1 01?
What information can we use to decide on the muxes and alu control lines?
The instruction!
What must we calculate?
No problem, just do a truth table.
Start Lecture #15
ALUOp Action needed by ALU 00 Addition (for load and store) 01 Subtraction (for beq) 10 Determined by funct field (R-type instruction) 11 Not used
We will let the main control (to be done later) summarize
the opcode for us.
From this summary and the 6-bit funct field, we shall determine the
control lines for the ALU.
Specifically, the main control will summarize the opcode as the
2-bit field ALUOp, whose meaning is shown on the right
How many entries do we have now in the truth table?
don't carebits.
opcode | ALUOp | operation | funct | ALU action | ALU cntl |
---|---|---|---|---|---|
LW | 00 | load word | xxxxxx | add | 0010 |
SW | 00 | store word | xxxxxx | add | 0010 |
BEQ | 01 | branch equal | xxxxxx | subtract | 0110 |
R-type | 10 | add | 100000 | add | 0010 |
R-type | 10 | subtract | 100010 | subtract | 0110 |
R-type | 10 | AND | 100100 | and | 0000 |
R-type | 10 | OR | 100101 | or | 0001 |
R-type | 10 | SLT | 101010 | set on less than | 0111 |
ALUOp | Funct || Bnegate:OP 1 0 | 5 4 3 2 1 0 || B OP ------+--------------++------------ 0 0 | x x x x x x || 0 10 x 1 | x x x x x x || 1 10 1 x | x x 0 0 0 0 || 0 10 1 x | x x 0 0 1 0 || 1 10 1 x | x x 0 1 0 0 || 0 00 1 x | x x 0 1 0 1 || 0 01 1 x | x x 1 0 1 0 || 1 11
Applying these simplifications yields the truth table on the right
How should we implement this?
We will do it PLA style (disjunctive normal form, 2-levels of
logic) for each of the three output bits separately.
Specifically, for each output, we will
Only the first part requires any real work.
ALUOp | Funct 1 0 | 5 4 3 2 1 0 ------+------------ x 1 | x x x x x x 1 x | x x 0 0 1 0 1 x | x x 1 0 1 0
When is Bnegate (called Op2 in book) asserted?
Ans: Those rows in the table above where its bit (the leftmost
output bit) is 1.
That is, rows 2, 4, and 7.
We show those three rows on the right.
ALUOp | Funct 1 0 | 5 4 3 2 1 0 ------+------------- x 1 | x x x x x x 1 x | x x x x 1 x
Looking again at the full (7-row) table, we notice that, in the 5 rows with ALUOp=1x, F1=1 is enough to distinugish the two rows where Bnegate is asserted. This gives the last table for BNegate, again shown on the right.
Hence Bnegate is simply ALUOp0 + (ALUOp1 · F1).
ALUOp | Funct 1 0 | 5 4 3 2 1 0 ------+------------ 1 x | x x 0 1 0 1 1 x | x x 1 0 1 0
Now we apply the same technique to determine when OP0 is asserted and begin by listing on the right the rows in the full table where its bit (the rightmost output bit) is set.
As with BNegate, we look back at the full table and study all the rows where ALUOp=1x, and, within that group of rows, those rows where OP0 is asserted (the last two rows).
ALUOp | Funct 1 0 | 5 4 3 2 1 0 ------+------------ 1 x | x x x x x 1 1 x | x x 1 x x x
We see that the rows where OP0 is asserted are characterized by just two Function bits (3 and 0), which reduces the table to that on the right.
Hence OP0 is ALUOp1 · F0 + ALUOp1 · F3
ALUOp | Funct 1 0 | 5 4 3 2 1 0 ------+------------ 0 0 | x x x x x x x 1 | x x x x x x 1 x | x x 0 0 0 0 1 x | x x 0 0 1 0 1 x | x x 1 0 1 0
Finally, we determine when is OP1 asserted using the same
technique.
However, we shall see that this bit is more inspiring
than
the first two.
Once again the procedure begins by listing on the right those rows
where the relevant bit (the middle output bit) is one.
Right away we get a hint that we have more work to do as five rows pop up.
As before we study the 5 rows in the original 7-row table that have ALUOp=1x, and, within that group, those rows where OP1 is asserted (rows 3, 4, and 7).
ALUOp | Funct 1 0 | 5 4 3 2 1 0 ------+------------ 0 0 | x x x x x x x 1 | x x x x x x 1 x | x x x 0 x x
We again find that one Funct bit distinguishes when OP1 is asserted, namely Funct bit 2 (in this case OP1 is asserted when Funct bit 2 is false).
As a result the truth table for OP1 reduces to the 3-row version shown on the right.
Although this truth table would yield a fairly small circuit, we shall to simplify it further.
ALUOp | Funct 1 0 | 5 4 3 2 1 0 ------+------------ 0 x | x x x x x x 1 x | x x x 0 x x
Recall from the original table, that the x 1 in the second row is really 0 1.
Although x 1 gives us more freedom than 0 1 in implementing this row by itself, we are able to simply further by undoing the don't care and noting that, with 0 1 in the second row, rows 1 and 2 can be combined to give the table on the right.
ALUOp | Funct 1 0 | 5 4 3 2 1 0 ------+------------ 0 x | x x x x x x x x | x x x 0 x x
Last, we can use the first row to enlarge the scope (and hence simplify the implementation) of the last row resulting in the final table on the right.
So OP1 = (ALUOp1)' + (Funct2)'
After all the simplification the circuit itself is very easy and is shown in the diagram below.
Indeed, the simplifications were so successful that we are lead to question whether this was due to
At long last we get to use the opcode (instruction bits 31-26).
Shown in blue in the diagram below is the control unit
, the
logic block that calculates the green control lines that have
appeared above but were floating, i.e., the started from nothing.
Specifically our task, illustrated in the diagram below, is to calculate the following nine bits. (Note that a smaller picture—without the control—is shown here).
All 9 bits are determined by the opcode. We show the logic diagram after we illustrate the operation of the control logic.
Note that the MIPS instruction set is fairly regular. Most of the fields we need are always in the same place in the instruction (independent of the instruction type).
MemRead: | Memory delivers the value stored at the specified addr |
MemWrite: | Memory stores the specified value at the specified addr |
ALUSrc: | Second ALU operand comes from (reg-file / sign-ext-immediate) |
RegDst: | Number of reg to write comes from the (rt / rd) field |
RegWrite: | Reg-file stores the specified value in the specified register |
PCSrc: | New PC is Old PC+4 / Branch target |
MemtoReg: | Value written in reg-file comes from (alu / mem) |
We have just seen how ALUOp is used to calculate the control bits for the ALU. The purpose of the the remaining 7 bits (recall that ALUOp is 2 bits) are described in the table to the right and their uses in controlling the datapath is shown in the picture above.
We are interested in four opcodes.
Do a stage play
volunteers
add r9,r5,r1 r9=r5+r1 0 5 1 9 0 32 sub r9,r9,r6 0 9 6 9 0 34 beq r9,r0,-8 4 9 0 < -2 > slt r1,r9,r0 0 9 0 1 0 42 lw r1,102(r2) 35 2 1 < 100 > sw r9,102(r2)
The following figures illustrate the play. Bigger versions of the pictures are here.
Start Lecture #16
Remark: Lab 5 assigned, due 8 November 2011.
The following truth table shows, for each of the four opcodes we are studying, the values needed for each control line.
Recall that we have more than four instructions since we are implementing several R-type instructions all of which have the same opcode (opcode zero). As we have seen, these R-type instructions are distinguished by the 6 Funct bits.
Instruction | Op5 | Op4 | Op3 | Op2 | Op1 | Op0 | RegDst | ALUSrc | MemtoReg | RegWrite | MemRead | MemWrite | Branch | ALUOp1 | ALUOp0 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
R-type | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 |
lw | 1 | 0 | 0 | 0 | 1 | 1 | 0 | 1 | 1 | 1 | 1 | 0 | 0 | 0 | 0 |
sw | 1 | 0 | 1 | 0 | 1 | 1 | X | 1 | X | 0 | 0 | 1 | 0 | 0 | 0 |
beq | 0 | 0 | 0 | 1 | 0 | 0 | X | 0 | X | 0 | 0 | 0 | 1 | 0 | 1 |
Control | Signal | R-type | lw | sw | beq |
---|---|---|---|---|---|
Inputs | Op5 | 0 | 1 | 1 | 0 |
Op4 | 0 | 0 | 0 | 0 | |
Op3 | 0 | 0 | 1 | 0 | |
Op2 | 0 | 0 | 0 | 1 | |
Op1 | 0 | 1 | 1 | 0 | |
Op0 | 0 | 1 | 1 | 0 | |
Outputs | RegDst | 1 | 0 | X | X |
ALUSrc | 0 | 1 | 1 | 0 | |
MemtoReg | 0 | 1 | X | X | |
RegWrite | 1 | 1 | 0 | 0 | |
MemRead | 0 | 1 | 0 | 0 | |
MemWrite | 0 | 0 | 1 | 0 | |
Branch | 0 | 0 | 0 | 1 | |
ALUOp1 | 1 | 0 | 0 | 0 | |
ALUOp0 | 0 | 0 | 0 | 1 |
The numerous columns, many with wide labels clearly leads to an extremely wide, and hence awkward table.
To make the table easier to read, Patterson and Hennessy draw this particular table in a non-standard manner as shown on the right.
The key change is that what was previously the column headings is now the row headings.
Just for fun, I tried to keep the original format and arrived at the version shown below.
Instruc- tion |
Op5 | Op4 | Op3 | Op2 | Op1 | Op0 | | | |
Reg Dst | ALU Src |
Mem toReg | Reg Write |
Mem Read | Mem Write |
Branch | ALU Opt1 |
ALU Opt0 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| | ||||||||||||||||
R-type | 0 | 0 | 0 | 0 | 0 | 0 | | | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 |
lw | 1 | 0 | 0 | 0 | 1 | 1 | | | 0 | 1 | 1 | 1 | 1 | 0 | 0 | 0 | 0 |
sw | 1 | 0 | 1 | 0 | 1 | 1 | | | X | 1 | X | 0 | 0 | 1 | 0 | 0 | 0 |
beq | 0 | 0 | 0 | 1 | 0 | 0 | | | X | 0 | X | 0 | 0 | 0 | 1 | 0 | 1 |
As always, given a truth table, it is quite easy to produce logic equations and a logic diagram, both in PLA style (i.e., using 2-levels of logic.
The circuit, drawn in PLA style is shown on the right.
Homework: In a previous homework, you modified the datapath to support addi and a variant of lw. Determine the control needed for these instructions.
Homework: Can we eliminate MemtoReg and use MemRead instead?
Homework: Can any other control signals be eliminated?
Recall the jump instruction.
opcode addr 31-26 25-0
Addr is a word address. Since the machine is byte addressable, we need to shift the address left 2 bits (filling the right with zeros).
The address in the instruction is 26 bits.
When shifted and and 0 filled, the result is 28 bits.
But the machine has 32-bit addresses.
Where do the remaining 4 bits come from?
Answer: The high order 4 bits of the new address are set equal to
the high order 4 bits of the previous instruction
(after incrementing the latter by 4).
This is quite easy to implement as seen in the following diagram. Basically all that is added to the datapath is one mux and its associated control line (plus a few wires).
Some instructions are likely slower than others and we must set the clock cycle time long enough for the slowest. The disparity between the cycle times needed for different instructions is quite significant when one considers implementing more difficult instructions, like divide and floating point ops.
Actually, if we considered cache misses, which result in references to external DRAM, the cycle time ratios would exceed 100.
Possible solutions
Self-timedlogic.
stage playswe did earlier. You can see that the instructions execute in phases: That is, first the instruction is fetched, then the registers are read, then the ALU is accessed, etc.
superinstructioncalled a very long instruction.
Patterson and Hennessy give a real-world example of pipeline based on doing multiple loads of laundry. For variety, I will present a different example, based on sandbagging a river to prevent (or at least minimize) flooding.
We have a huge quantity of dirt in the western part of an old gray town and a river with rising water in the eastern part. Since we anticipated the possibility of the river rising, we stockpiled empty burlap bags near the dirt and we have a small loop of train tracks running between the dirt and the river. We purchased two bright red carts and placed them on track one at the dirt and one at the river.
If we adopted the method of our single cycle MIPS implementation we would proceed as follows.
If we make the simplifying assumption that each of the five steps takes the same time, say T minutes, then it takes 5T minutes to complete job for one bag of sand.
We can do better than the approach just given; we can pipeline the activities. That is, once we start the cart carrying bag 1 east (step 2 of bag 1 begins) we can immediately start to fill bag 2 (step 1 of bag 2 begins)
When carrying bag 1 to the river (step 3 of bag 1 begins), we can start the cart carrying bag 2 east (step 2 of bag 2 begins) and can start filling bag 3 (step 1 of bag 3 begins).
It gets better. When placing bag 1 at the correct place (step 4 of bag 1 begins), we can (i) carry bag 2 to the river (step 3 of bag 2 begins), (ii) send the cart carrying bag 3 east (step 2 of bag 3 begins, and (iii) start filling bag 4 (step 1 of bag 4 begins).
Finally, when sending an empty cart west for the first time (step 5 of bag 1 begins), we can (i) place bag 2 at the correct place (step 4 of bag 2 begins), (ii) carry bag 3 to the correct place (step 3 of bag 3 begins, (iii) start the cart carrying bag 3 east (step 2 of bag 4 begins), and (iv) start filling bag 5 (step 1 of bag 5 begins).
The second solution seems much better: Instead of a sand bag being placed onces every 5T minutes, we now place one every T minutes, a fivefold improvement.
But the time for each sand bag is unchanged; it remains 5T. The improvement comes from the fact that we are working on several sand bags simultaneously. This is the gain in pipelining. The overall latency of each operation remains constant (actually it increases—i.e., gets worse—slightly), but the throughput increases—i.e. gets better—considerably.
Put another way we can say that pipeline improves performance by increasing throughput not by decreasing the time for one instruction.
The same idea used for sand bagging and laundry can be applied for executing computer instructions. For executing MIPS instructions the pipeline has 5 steps or stages.
Instruc- tion | Instruc- tion fetch |
Register read | ALU Operation |
Data access | Register write | Total time |
---|---|---|---|---|---|---|
lw | 200 ps | 100 ps | 200 ps | 200 ps | 100 ps | 800 ps |
sw | 200 ps | 100 ps | 200 ps | 200 ps | 700 ps | |
R-type | 200 ps | 100 ps | 200 ps | 100 ps | 600 ps | |
beq | 200 ps | 100 ps | 200 ps | 500 ps |
The table on the right gives approximate times for each part of executing the MIPS instructions we have implemented.
Using our single cycle implementation would need make the clock cycle time 800ps, the time for the longest instruction.
Using a five-stage pipeline, we would need to make the cycle time 200ps, the time of the slowest stage. Since all instructions go through all 5 stages (even if nothing is done for that instruction during one or more stages), every instruction will take 1000ps=1ns from beginning to end.
This sounds worse!
Indeed, it is worse if you judge performance by the time for one instruction. But, as we mentioned before, the more relevant measure is the throughput, i.e., the number of instructions executed in one second.
Let's look at executing a three instruction program that adds value in register 3 to a location in memory.
lw $r1, 50($r2) // uses all 5 stages add $r1, $r1, $r3 // no data access sw $r1, 50($r2) // no register write
Our single stage implementation requires 3 * 800ps = 2400ps to execute these three instructions. The instruction execution time is 800ps and the throughput is
3 instructions / (2400 * 10^{-12} second) = 1.25 * 10^{9} instructions/second
The pipelined execution requires 1400ps; the instruction execution time is 1000ps; and the throughput is
3 instructions / (1400 * 10^{-12} second) ∼ 2.14 * 10^{9} instructions/second
The result would get better if we used a bigger example. Indeed the asymptotic speedup is 4 since the single cycle implementation starts one instruction every 800ps and the pipelined implementation starts one instruction every 200ps.
Remember that real instructions execute (at least) billions of instructions so the value obtained for such programs would be extremely close to 4.
The MIPS instruction set was designed to ease pipelining in the following ways.
The example shown above gives a slightly unrealistic view of pipeline since we have not discussed hazards that can delay an instruction because it cannot execute the next pipeline stage right away.
This occurs when the hardware cannot execute all the actions required during one cycle. MIPS was designed to minimize this possibility, but consider a different design that had one memory instead of the two as in MIPS. Then, if the first instruction was lw, it would be accessing this memory during cycle 4 to read the data. But at the same time, the 4th instruction needs to do an I-fetch. The resulting contention for the combined data-instruction memory is a structural hazard.
lw $r1,40($r8) lw $r2,44($r8) add $r3,$r1+$r2 sw $r3,48($r8)
Consider the familiar statement A = B+C; found in most languages. It would most likely be translated by the compiler into an instruction sequence like the one shown on the right.
Assume the sequence starts at cycle 1. Then during cycle 4, the add instruction reads registers $r1 and $r2.
But those registers are not written with the required values until
cycles 5 and 6.
Hence the third instruction encounters a data hazard at its 2nd
cycle, which causes the pipeline to stall for three cycles so that
at cycle 7, the add can perform its second step.
All such stalls are often called bubbles
in the pipeline.
add $r1,$r1,$r2 add $r1,$r1,$r3
Consider the two instruction sequence on the right that replaces register r1 with the sum of the first three registers. The second instruction adds the third register to what should be the sum of the first two. Looking at a cycle-by-cycle picture of the pipeline (draw this on the board) we see that the second instruction reads register r1 during cycle 3 (its second cycle). But that sum will not appear in register r1 until the fifth cycle.
However, if we look again at the cycle-by-cycle picture, we see that the sum is calculated during cycle 3 (the 3 stage of the first instruction) and not actually used until cycle 4 (the third stage of the 2nd instruction).
As a result one could run a wire from the end of stage 3 to the beginning of stage 3 (and add a mux and some serious control logic) and get the value there in time.
We say the value has been forwarded from the first instruction to the second or that it has bypassed some of the steps.
In this case the solution was perfect, no bubble remains.
Homework: How would forwarding be used in the previous 4-instruction sequence? Do any bubbles remain?
beq $r1,$r2,L some instructions L: some other instructions
Consider the conditional branch shown on the right. During cycle 2 we need to fetch the 2nd instruction to execute, but we don't know what that instruction is since we don't know yet if the branch will be taken. We won't know until the end of cycle 3, when the ALU has determined if registers r1 and r2 are equal.
We could guess that the branch will not be taken and start
executing some instructions
.
If we guess wrong, we must throw out the work we did based on the
guess.
This hazard has lead to a large field of study called
branch prediction
that uses sophisticated techniques to make
a more intelligent (i.e., a more-likely-to-be-correct) guess as to
whether or not the branch will be taken.
Pipelining is an important component in the processor designer's toolbox; all modern microprocessors use it. Pipelining permits the execution of consecutive instructions to be overlapped.
Although no instruction is itself sped up (indeed some are slowed down), the throughput is increased significantly.
Hazards can greatly decrease the potential improvement of pipelining; a well designed ISA can make hazards easier to deal with, but in any case hazards complicate the design of modern high-performance processors.
Start Lecture #17
Homework: 4.6.1, 4.6.2, 4.7.1, 4.7.2, 4.9.1-3.
Remark: We only sketch some ideas in the rest of this chapter. For a complete treatment, read the book carefully.
The diagram below shows the datapath divided into the same 5 pipeline stages we just studied
.These stages are normally referred to as:
The next step is to capture the state after each stage. This means that we need to replace the simple dotted red lines, which were just for our visualization by pipeline registers that hold all the values produced by each stage
Now you can do another stage play and see that, at the beginning of each stage, the pipeline registers are read and, at the end of each stage, they are written.
There are various subtleties that must be addressed.
For example, the register file is written during the fifth pipeline stage, but the register number is read from the instruction during the first stage. Hence, by the time the fifth stage is executed, the register number is from a later instruction.
This particular problem is fixed by moving the write register number from pipeline register to pipeline register as the instruction moves through the pipeline. At the fifth stage, the register number is sent from the last pipeline register to the write register input of the register file.
You might wonder why there are only 4 pipeline registers since there are five stages. The answer is that all the fifth stage does is write a register so this value is being saved for subsequent instructions and no pipeline register is needed at the end of stage 5.
The last diagram above represents what is called a
single-clock-cycle pipeline diagram.
Diagrams such as the one on the right (which we have already
discussed) are called multiple-clock-cycle pipeline diagrams
.
The later are easier to follow, but supply fewer details.
We already calculated the control lines. The trouble is that we calculate them at the beginning, but use them in subsequent stages. Hence them must be passed from pipeline register to pipeline register as the instruction moves along the pipeline.
Three points are made
We have seen how to design the datapath and control for a subset of the MIPS processor using a single-cycle strategy. Although successful this simple implementation is too slow so we investigated, to a limited extent, a more aggressive, pipelined implementation.
In addition to pipelining, modern designs are
multiple-issue/superscalar.
That is, they issue several instructions each cycle and hence have
several instructions performing each pipeline stage and thus can
have very many instructions active (in flight
) at one
time.
This, coupled with out-of-order execution, which we haven't discussed, complicates the design and increases the power usage considerably.
Design complexity may have already already hit and passed its peak. Power concerns have causes all the major players to cut back on the complexity of their designs.
Read.
Read.
Read.
Problem 1.1 was assigned earlier in the course.
Throughput measures the number of jobs per day/second/etc that can be accomplished.
Response time measures how long an individual job takes.
We define Performance as 1 / Execution time.
We say that machine X is n times faster than machine Y or machine X has n times the performance of machine Y if the execution time of a given program on X = (1/n) * the execution time of the same program on Y.
But what program should be used for the comparison?
Various suites have been proposed; some emphasizing CPU
integer performance
, others floating point performance
,
and still others I/O performance
.
How should we measure execution time?
system time, i.e., time when the CPU is executing the operating system on behalf of the user program.
normally loadedsystem.
heavily loadedsystem.
We mostly employ user-mode
CPU time, but this
does not mean the other metrics are worse.
Cycle time vs. Clock rate.
What is the cycle time for a 700MHz computer?
What is the clock rate for a machine with a 10ns cycle time?
The execution time for a given job on a given computer is
(CPU) execution time = (#CPU clock cycles required) * (cycle time) = (#CPU clock cycles required) / (clock rate)
Since the number of CPU clock cycles required equals the number of instructions executed times the average number of cycles in each instruction, we can write this equation in other equivalent forms.
CPU Time (in seconds) = #Instructions * CPI * Cycle_time (in seconds). CPU Time (in ns) = #Instructions * CPI * Cycle_time (in ns). CPU Rate (in seconds) = #Instructions * CPI / Clock_Rate (in Hz).
In our single cycle implementation, the number of cycles required is just the number of instructions executed. That is, the CPI is 1.
Similarly, if every instruction took 5 cycles, the number of cycles required would be five times the number of instructions executed.
But real systems are more complicated than that!
After extensive measurements, one calculates for a given machine the average CPI (cycles per instruction).
We shall sometimes assume this average CPI actually applies to all instructions. Other times we shall say something like
Assume there are two classes of instructions. Class A instructions require 4 cycles to execute; class B instructions require 3 cycles to execute. Assume an execution of program P involves 30% class A instructions and 70% class B. What is the (average) CPI for this execution?
The number of instructions required for a given program depends on the instruction set. For example, assume we want to add the contents of register 1 to a location X in memory. Then MIPS would require three instructions; whereas, x86 needs only one.
lw $r3,X add $r3,$r1 add X,$r1 sw $r3,X
CPI is a good way to compare two implementations of the same
instruction set (also called the same ISA
instruction set architecture
).
IF the clock cycle is unchanged, then the performance of a given ISA is inversely proportional to the CPI (e.g., halving the CPI doubles the performance).
Naturally, complicated instructions often take longer to execute. They require either more cycles or a longer cycle time. Older machines with complicated instructions (e.g., the Digital Equipment Corporation VAX, an important machine in the 1980s) had CPI>>1.
As we have seen, with pipelining we can have many cycles for each instruction but still achieve a CPI of nearly 1.
Modern superscalar machines often have a CPI less than one.
As a result sometimes one speaks of the IPC or instructions per
cycle
for such machines.
However, we won't use IPC.
Start Lecture #18
Remark: For midterm grades I computed
0.6* exam + 0.4 * (lab1 + lab2 + lab3 + lab4)
I did not believe there was enough precision to give +- so only gave A,B,C,D.
Remark: I draft of lab6 is available. It will be assigned next class. I constructed is a demo of implementing a small RAM in logisim. The file name is ram.circ. It implements (a small version) of the Data Memory component of lab 6.
Do on the board the following example from pages 35-36.
A compiler designer is developing code sequences for a particular computer. The computer has three classes of instructions, A, B, and C, which have CPIs of 1, 2, and 3 respectively.
Note: This is shorthand for saying
Class A instructions, on average, add one cycle to the execution
time
and similarly for classes B and C.
It is not saying that executing one class A instruction takes one
cycle from beginning to end.
Again we see the difference between the latency of a single
instruction and the throughput (instructions per second).
Perhaps it would be better to say that
the cost of a class A instruction is one cycle
.
The compiler writer has a choice of two possible sequences of machine language instructions as a translation of a particular high-level language statement. The first sequence has 2 class A instructions, 1 class B, and 2 class C. The second sequence has 4 class A and 1 each class B and C.
Which sequence executes the most instructions? Which is faster? What is the CPI for each sequence.
Homework: Carefully go through and understand the example that I just did in class.
Homework: 1.3.1, 1.3.3, 1.3.4, 1.4.1, 1.4.2, 1.5.2a.
We will cover none of the manufacturing and only some of the benchmarking material
As mentioned in the previous section, some instructions take longer than others. Moreover, different ISAs perform better on certain instructions and other ISAs perform better on other instructions. Different application programs use different mixes of instructions.
It might be, for example, that computer A does great on programs that reference memory heavily, but poorly on programs dominated by floating point operations. In contrast computer B might excel on floating-point, but be sluggish on memory references.
As you can imagine, computer manufacturers would prefer that a customer evaluates the company's products by running programs on which the products do particularly well.
To standardize the measurements, many vendors agreed on certain sets of benchmarks on which they would provide performance evaluations. Perhaps the best know standard benchmarks are those sanctioned by SPEC (System Performance Evaluation Cooperative). SPEC actually contains several benchmark suites.
One fallacy is to assume that by fixing part of the problem, the entire problem is fixed to a very great extent.
For example, assume a simple system with two classes of instructions
A new wiz-bang floating-point unit is proposed that speeds
floating-point instructions by a factor of 5 (new CPI is 2), has no
effect on cycle time, and only doubles the cost of the machine.
Sounds great; a speedup of 5 at a cost of only 2!
But just how great is it?
Say the customer is primarily interested in a single application A. Measurements show that A executes N instructions, 20% of which are floating point.
To execute application A, the old system takes
.2N * 10 cycles + .8N * 2 cycles = 2N + 1.6N cycles = 3.6N cycles
The new, improved system would take
.2N * 2 cycles + .8N * 2 cycles = .4N + 1.6N cycles = 2N cycles
Since the cycle time hasn't changed, execution time is proportional
to the number of cycles.
Thus the new system is 3.6N / 2N = 1.8 times faster for
only
twice the price.
No sale!
NOAA has a new computer program that is predicting tomorrow's weather very well, but the computation takes a week, which makes the results useless. They need to reduct the 168 hours (one week) to 1 hour.
They know that the program spends 99% of its time doing material that can be partitioned evenly on up to a thousand processors.
Since they need a speedup of at least 168 and have the money, they decide to buy 1000 processors. How long does the program now take to run?
Answer: 1% of 168 hours is 1.68 hours; 99% of 168 hours is 166.32 hours. The 1000 processors cooperate on the second piece and reduce it to 166.32/1000=.16632 hours.
However, the small 1% piece isn't be sped up at all and still takes 1.68 hours. So the entire job takes 1.68+.16632 = 1.84632 hours, which is exceeds the 1 hour requirement.
The 1000 processors gave a speedup of only 168/1.84632 ∼ 91.
Homework: What would
the speedup be if they purchased only
100 processors?
MIPS is an acronym abbreviating Millions of Instructions Per Second. It is a unit of rate or speed (like MHz); it is NOT a unit of time (like ns.).
As its full name suggests, MIPS is defined as
how many million instructions were executed / how many seconds were required
This is the same as
the number of instructions executed / the number of microseconds used
The instructions we have been studying (lw, R-type, etc) are those of the MIPS computer company. That usage of the word MIPS is different from the acronym above, but the company's founders certainly knew about the acronym above.
Indeed, the company started with a Stanford research project headed by Hennessy. This project was called MIPS standing for Microprocessor without Interlocked Pipeline Stages. However, the microprocessors produced by the MIPS company did have interlocked pipeline stages.
At roughly the same time, Patterson headed a research project called RISC standing for Reduced Instruction Set Computer. Sun Microsystems was to an extent based on this research project.
MIPS only counts instructions and does not take into account that some ISAs require more instruction than other ISAs to solve the same problem. For example, we saw that adding a register to a memory location takes 1 instruction on an x86 ISA but three on a MIPS. If the single instruction took one microsecond, the x86 would achieve a 1 MIPS rating. If the three instructions took a total of 2 microseconds, the MIPS computer would achieve a 1.5 MIPS rating, much better than the x86 even though it required twice as long to accomplish the same task.
For this reason, the MIPS cannot be used (or at least should not be used) to compare systems with different ISAs.
Even with a fixed ISA there are difficulties with MIPS. As with many computer ratings, the program used is important. A program that has predominately fast instructions will achieve a higher MIPS rating that a program that has predominately slow instructions.
A new sophisticated compiler might be able to reduce the number of instructions needed to complete a program and also reduce the total execution time. Clearly a good thing. But, if the instructions eliminated were the fastest ones, then the MIPS rating would go down even though performance went up! To say it in reverse, it is often the case that if one padded a program with NOPs (which are very fast), the program would have exactly the same effect, would take longer, would execute more instructions, but yet would likely achieve a higher MIPS rating than the original.
For numerical calculations floating point operations are often the
ones you are interested in; the others are overhead
(a very
rough approximation to reality).
For this reason the MFLOPS (Millions of FLoating point OPerations
per Second) was introduced.
It has similar problems to MIPS.
overhead(i.e., non-floating point) instructions and floating-point ADD is probably the fastest floating-point instruction.
Benchmarks are better than MIPS or MFLOPS, but still have difficulties.
tunedfor important benchmarks.
Homework: Read this (very short) section.
Homework: Read Chapter 5.
Remark: Perhaps the chapter should be entitled
Large vs. Fast
.
An ideal memory is
Unable to achieve the impossible ideal we use a memory hierarchy consisting of
... and try to satisfy most references in the small fast memories near the top of the hierarchy.
There is a capacity/performance/price gap between each pair of adjacent levels. We will study the cache-to-memory gap.
the same thingat the same time as my architecture class, but with different, almost disjoint, terminology.
We observe empirically (and teach in OS).
Start Lecture #19
Remark: Lab 6 assigned; due 17 November 2011.
A cache is a small fast memory between the processor and the main memory. It contains a subset of the contents of the main memory.
A Cache is organized in units of blocks or lines. Common block sizes are 16, 32, and 64 bytes.
A block is the smallest unit we can move to/from a cache (some designs move subblocks, but we will not discuss such designs).
A hit occurs when a memory reference is found in the upper level (small, fast) of the memory hierarchy.
Consider the following address (in binary).
10101010_11110000_00001111_11001010.
This is a 32-bit address.
I used underscores to separated it into four 8-bit pieces just to
make it easy to read; the underscores have no significance.
Machine addresses are non-negative (unsigned) so the address above is a large positive number (greater than 2 billion).
All the computers we shall discuss are byte addressed. Thus the 32-bit number references a byte. So far, so good.
We will always assume that each word is four bytes. That is, we assume the computer has 32-bit words. This is not always true (many old machines had 16-bit, or smaller, words; and many new machines have 64-bit words), but to repeat, we will always assume 32-bit words.
Since 32 bits is 4 bytes, each word contains 4 bytes. We assume aligned accesses (as does the MIPS architecture we studied). This means that a word (a 4-byte quantity) must begin on a byte address that is a multiple of the word size, i.e., a multiple of 4. So word 0 includes bytes 0-3; word 1 includes bytes 4-7; word n includes bytes 4n, 4n+1, 4n+2 and 4n+3; and the four consecutive bytes 6-9 do NOT form a word.
What word includes the byte address given above,
10101010_11110000_00001111_11001010?
Answer:
10101010_11110000_00001111_110010, i.e, the address divide
by 4.
What are the other bytes in this word?
Answer:
10101010_11110000_00001111_11001000,
10101010_11110000_00001111_11001001,
and
10101010_11110000_00001111_11001011
What is the byte offset of the original byte in its word?
Answer: 10 (i.e., two), the address mod 4..
What are the byte-offsets of the other three bytes in that same
word?
Answer: 00, 01, and 11 (i.e, zero, one, and three).
Blocks vary in size. We will not make any assumption about the size, other than that it is a power of two. For these examples (only), assume each block is 32 bytes.
Since we assume aligned accesses, each 32-byte block has a byte address that is a multiple of 32. So block 0 is bytes 0-31, which is words 0-7. Block n is bytes 32n, 32n+1, ..., 32n+31.
What block includes our byte address
10101010_11110000_00001111_11001010?
Answer:
10101010_11110000_00001111_110, i.e., the byte address
divide by 32 (the number of bytes in the block) or the word address
divided by 8 (the number of words in the block).
We start with a very simple cache organization, one that was used on the Decstation 3100, a 1980s workstation. In this design cache lines (and hence memory blocks) are one word long.
Also in this design each memory block can only go in one specific cache line.
cache block number) is the memory block number modulo the number of blocks in the cache.
set associative cacheswe will soon study.
We shall assume that each memory references issued by the processor is for a single, complete word. This assumption holds for the MIPS subset we implemented since the only memory access were lw and sw. The full MIPS ISA, however, includes instructions that reference bytes and halfwords.
On the right is a diagram representing a direct mapped cache with 4 blocks and a memory with 16 blocks.
How can we find a memory block in such a cache? This is actually two questions in one.
The second question is the easier. Let C be the number of blocks in the cache. Then memory block number N can be found only in cache line number N mod C (it might not be present at all).
But many memory blocks are assigned to that same cache line. For example, in the diagram above all the green blocks in memory are assigned to the one green block in the cache.
So the first question reduces to:
Is memory block N present in cache block N/C?
Referring to the diagram we note that, since only a green memory
block can appear in the green cache block, we know that the last
two digits of the memory block in the green cache block are 10 (the
number of the green cache block).
So to determine if a specific green memory block is in the green
cache block we need the rest
of the memory block number.
Specifically is the memory block in the green cache
block 0010,
0110, 1010,
or 1110?
It is also possible that the green cache block is empty (called
invalid), i.e, it is possible that no memory block is in this cache
block.
restof the address (i.e., red digits lost when we reduced the block number modulo the size of the cache) to see if the block in the cache is the memory block of interest. That number is N/C, using the terminology above.
When the system is powered on, all the cache blocks are invalid so all the valid bits are off.
Addr(10) | Addr(2) | hit/miss | block# |
---|---|---|---|
22 | 10110 | miss | 110 |
26 | 11010 | miss | 010 |
22 | 10110 | hit | 110 |
26 | 11010 | hit | 010 |
16 | 10000 | miss | 000 |
3 | 00011 | miss | 011 |
16 | 10000 | hit | 000 |
18 | 10010 | miss | 010 |
On the right is an example from the book (page 460). It refers to figure 5, which is an enlarged version of the example diagram above. Figure 5 has C=8 (rather than 4) and M=32 (rather than 16).
In both the diagram above and the example from the book, we have M/C=4 memory blocks eligible to be stored in each cache block. Thus there are two tag bits for each cache block.
Shown on the right is a eight entry, direct-mapped cache with block size one word. As usual all references are for a single word. In order to make the diagram and arithmetic smaller the machine has only 10-bit addressing, instead of our usual 32-bit addressing. Above the cache we see a 10-bit address issued by the processor.
There are several points to go over.
The circuitry needed for a simple cache (direct mapped, block size 1 word, all references to 1 word) is shown on the right. The only difference from the example above is size. This cache holds 1024 blocks (not just 8) and the memory holds 2^{30}∼1,000,000,000 blocks (not just 32). That is, the cache size is 4KB and the memory size is 4GB.
To determine if we have a hit or a miss, and to return the data in case of a hit is quite easy, as the circuitry indicates.
Make sure you understand the division of the 32 bit address into 20, 10, and 2 bits.
Calculate on the board the total number of bits in this cache and the number used to hold data.
Homework: 5.3.1. Calculate the total number of bits in the 5.3.1 cache and the number used to hold data.
The action required for a hit is clear, namely return to the processor the data found in the cache.
For a miss, the best action is fairly clear, but requires some thought.
We just need to note a few points.
Processing a write for our simple cache (direct mapped with block size = reference size = 1 word).
We have 4 possibilities: For a write hit we must choose between Write through and Write back. For a write miss we must choose between write-allocate and write-no-allocate (also called store-allocate and store-no-allocate and other names).
Write through: Write the data to memory as well as to the cache.
With a write-through cache policy, both the memory and the cache are always up-to-date.
Write back: Don't write to memory now, do it later when this cache block is evicted.
With a write back policy, the cache is always up-to-date, but the memory can be stale (contain an out-of-date value).
The fact that an eviction must trigger a write to memory for write-back caches explains the comment above that the write hit policy effects the read miss policy.
Write-allocate: Allocate a slot and write the new data into the cache (recall we have a write miss). The handling of the eviction this allocation (probably) causes depends on the write hit policy.
Write-no-allocate: Leave the cache alone and just write the new data to memory.
Write no-allocate is not normally as effective as write allocate due to temporal locality.
Start Lecture #20
Remark: I added a section Cache Contents, Hits,
and Misses
to lecture 19, which we will cover now.
The simplest write policy is write-through, write-allocate. The decstation 3100 discussed above adopted this policy and performed the following actions for any write, hit or miss, (recall that, for the 3100, block size = reference size = 1 word and the cache is direct mapped).
Although the above policy has the advantage of simplicity, it is out of favor due to its poor performance.
Given a fixed total size (in bytes) for the cache, is it better to
have two caches, one for instructions and one for data; or is it
better to have a single unified
cache?
load balancing. If the current program needs more data references than instruction references, the cache will accommodate. Similarly if more instruction references are needed.
The setup we have described does not take any advantage of spatial locality. The idea of having a multiword block size is to bring into the cache words near the referenced word since, by spatial locality, they are likely to be referenced in the near future.
We continue to assume (for a while) that the cache is direct mapped and that all references are for one word.
The book's terminology for byte offset
and block
offset
is inconsistent.
The byte offset gives the offset
of the byte within the word so the
offset of the word within the block should be
called the word offset, but alas it is called the block offset in
the 2e, 3e, and 4e.
I don't know if this is standard terminology or a long standing typo
in all three editions.
I wrote to Patterson, who basically agreed.
I will try to use the longer but clearer term word-in-block
for the offset of the word in the block.
The figure to the right shows a 64KB direct mapped cache with 4-word (16-byte) blocks.
What addresses in memory are in the block and where in the cache do they go?
Show from the diagram how this gives the red portion for the tag and the green portion for the index or cache block number.
Consider the cache shown in the diagram above and a reference to word 17003.
Summary: Memory word 17003 resides in word 3 of cache block 154 with tag 154 set to 1 and with the valid 154 true.
The cache size or cache capacity is the size of the data portion of the cache (normally measured in bytes).
For the caches we have see so far this is the block size times the number of entries. For the diagram above this is 64KB. For the simpler direct mapped caches block size = word size so the cache size is the word size times the number of entries.
Note that the total size of the cache includes all the bits. Everything except for the data portion is considered overhead since it is not part of the running program.
For the caches we have see so far the total size is
(block size + tag size + 1) * the number of entries
Let's compare the pictured cache with another one containing 64KB of data, but with one word blocks.
Homework: 5.3.1 and 5.3.2. Also calculate the total number of bits in each of the caches and the number used to hold data.
How do we process read/write hits/misses for a cache with multiword blocks?
Why not make block size enormous? For example, why not have the cache be one huge block.
Recall that our processor fetches one word at a time and our memory produces one word per request. With a large block size cache, the processor still requests one word and the cache still responds with one word. However the cache requests a multiword block from memory and to date our memory is only able to respond with a single word.
The question is, "Which pieces and buses should be narrow (one word) and which ones should be wide (a full block)?". The same question arises when the cache requests that the memory store a block and the answers are the same so we will only consider the case of reading the memory).
Since the processor is only requesting a single word, a wide bus between the cache and processor seems silly. The processor would then need a mux to discard the other words.
The question we want to consider is whether the memory should be wide. That is, should the memory have enough pins and the bus enough enough wires so that the entire block can be transferred at once.
We make the following timing assumptions assumptions.
Consider the three designs shown on the right. The left one assumes the memory delivers one word at a time and the bus is 1-word wide. This is the most economical design.
The middle design has a wide memory that can deliver an entire (4-word) block at one time and has a block-wide bus that can deliver the entire block to the cache in one cycle.
The rightmost design has four word-wide memories that are interleaved and thus can together produce a 4-word block at one time. However, the bus can only deliver one word at a time to the cache.
The question is how long does it take to satisfy a read miss for the cache above and each of the three memory/bus systems.
Interleaving works great because in this case we are guaranteed to have sequential accesses.
Imagine a design between (a) and (b) with a 2-word wide
datapath.
It takes 33 cycles and is more expensive to build than (c).
Homework: Assume the block size is 8 words. How long would an access take for a narrow, wide, and interleaved design? How long for a 2-word wide design and for a 4-word design.
Do the following performance example on the board. It would be an appropriate final exam question.
Start Lecture #21
double speedmachine? It would be double speed if the miss penalty were 0 or if there was a 0% miss rate.
A lower base (i.e. miss-free) CPI makes stalls appear more expensive since waiting a fixed amount of time for the memory corresponds to losing more instructions if the CPI is lower.
A faster CPU (i.e., a faster clock) makes stalls appear more expensive since waiting a fixed amount of time for the memory corresponds to more cycles if the clock is faster (and hence more instructions since the base CPI is the same).
Another performance example.
Homework: Consider a system that has a miss-free CPI of 2, a D-cache miss rate of 5%, an I-cache miss rate of 2%, has 1/3 of the instructions referencing memory, and has a memory that gives a miss penalty of 20 cycles. The clock speed stays the same throughout this problem.
Remark: Larger caches have longer hit times.
Consider the following sad story. Jane has a cache that holds 1000 blocks and has a program that only references 4 (memory) blocks, namely 23, 1023, 123023, and 7023. In fact the references occur in order: 23, 1023, 123023, 7023, 23, 1023, 123023, 7023, 23, 1023, 123023, 7023, 23, 1023, 123023, 7023, etc. Referencing only 4 blocks and having room for 1000 in her cache, Jane expected an extremely high hit rate for her program. In fact, the hit rate was zero. She was so sad, she gave up her job as webmistress, went to medical school, and is now a brain surgeon at the mayo clinic in Rochester MN.
So far we have studied only direct mapped caches, i.e., those for which the location in the cache is determined by the address. Since there is only one possible location in the cache for any block, to check for a hit we compare one tag with the HOBs of the addr.
The other extreme is a fully associative cache.
Most common for caches is an intermediate configuration called set associative or n-way associative (e.g., 4-way associative). The value of n is typically 2, 4, or 8.
If the cache has B blocks, we group them into B/n sets each of size n. Since an n-way associative cache has sets of size n blocks, it is often called a set size n cache. For example, you often hear of set size 4 caches.
In a set size n cache, memory block number K is stored in set K mod the number of sets, which equals K mod (B/n).
Recall that for the a direct-mapped cache, the cache index gives the number of the block in the cache. For a set-associative cache, the cache index gives the number of the set.
Just as the line number for a direct-mapped cache is the memory block number mod the number of blocks in the cache, the set number equals the (memory) block number mod the number of sets.
Just as the tag for a direct mapped cache is the memory block number divided by the number of blocks, the tag for a set-associative cache is the memory block number divided by the number of sets.
Do NOT make the mistake of thinking that a set size 2 cache has 2 sets, it has B/2 sets each of size 2.
Ask in class.
Why is set associativity good? For example, why is 2-way set associativity better than direct mapped?
How do we find a memory block in a set associative cache with block size 1 word?
Recall that a 1-way associative cache is a direct mapped cache and that an n-way associative cache for n the number of blocks in the cache is a fully associative.
The advantage of increased associativity is normally an increased hit ratio.
What are the disadvantages?
Answer: It is a slower and a little bigger due to the extra logic.
Start Lecture #22
Remark: Lab 7 is assigned.
It is due in 1.5 NYU weeks
, i.e., 3 lectures, 6 December 2011.
Remark: Last time, after doing the 2-way set
associative example, I should have redone it in class for 4-way set
associative.
Let's do it now.
It is right before Determining the Set Number and the Tag
.
This is a fairly simple combination of the two ideas and is illustrated by the diagram on the right.
datacoming out of the original multiplexor at the bottom right is a block. In the diagram, the block is 4 words.
When an existing block must be replaced, which victim should we choose? We ask the exact same question (with different words) when we study demand paging in 202.
BigIs a Cache?
There are two notions of size.
Definition: The cache size is the capacity of the cache.
Another size is is the total number of bits in the cache, which includes tags and valid bits. For the example above with a 4-way associative, 1-word block cache, this size is computed as follows.
For this cache, what fraction of the bits are user data?
Ans: 4KB / 55Kb = 32Kb / 55Kb = 32/55.
Calculate in class the equivalent fraction for the last diagrammed cache, having 4-word blocks (and still 4-way set associative).
We continue to assume a byte addressed machines with all references to a 4-byte word.
The 2 LOBs are not used (they specify the byte within the word, but all our references are for a complete word). We show these two bits in light blue. We continue to assume 32-bit addresses so there are 2^{30} words in the address space.
Let us review various possible cache organizations and determine for each the tag size and how the various address bits are used. We will consider four configurations each a 16KB cache. That is the size of the data portion of the cache is 16KB = 4 kilowords = 2^{12} words.
On the board calculate, for each of the four caches, what is the overhead percentage.
Homework: Redo the four caches above with the size of the cache 64KB (instead of 16KB) determining the number of bits in each portion of the address as well as the overhead percentages.
Start Lecture #23
Modern high end PCs and workstations all have at least two levels of caches: A very fast, and hence not very big, first level (L1) cache together with a larger but slower L2 cache.
When a miss occurs in L1, L2 is examined and only if a miss occurs there is main memory referenced.
So the average miss penalty for an L1 miss is
(L2 hit rate)*(L2 time) + (L2 miss rate)*(L2 time + memory time)
We are assuming that L2 time is the same for an L2 hit or L2 miss and that the main memory access doesn't begin until the L2 miss has occurred.
Do this example on the board (a reasonably exam question, but a little long since it has so many parts).
Assume
Calculate
Our company's current product, has the following characteristics
Homework: Redo example 2 with a memory access time of 50ns.
I realize this material is covered in operating systems class (CSCI-UA.0202). I am just reviewing it here. The goal is to show the similarity to caching, which we just studied. Indeed, (the demand part of) demand paging is caching: In demand paging the memory serves as a cache for the disk, just as in caching the cache serves as a cache for the memory.
However, the names used are different and there are other differences as well.
Cache concept | Demand paging analogue |
---|---|
Memory block | Page |
Cache line | Page Frame (frame) |
Block Size | Page Size |
Tag | None (table lookup) |
Word in block | Page offset |
Valid bit | Valid bit |
Miss | Page fault |
Hit | Not a page fault |
Miss rate | Page fault rate |
Hit rate | 1 - Page fault rate |
Placement question | Placement question |
Replacement question | Replacement question |
Associativity | None (fully associative) |
For both caching and demand paging, the placement question does not have serious performance implications since the items are fixed size (no first-fit, best-fit, buddy, etc) as are the slots into which they are placed.
The replacement question, in contrast, is quite important for performance. Indeed, we spend significant time discussing replacements strategies in 202. Approximations to LRU (least recently used) are popular for both caching and demand paging. However, cache approximations are very crude since miss processing must be very fast and cannot involve a long calculation.d
The cost of a page fault vastly exceeds the cost of a cache miss so it is worth while in paging to slow down hit processing to lower the miss rate. Hence demand paging is fully associative and uses a table to locate the frame in which the page is located.
The figure on the right and the one below it both indicate the translation of page numbers into frame numbers, the latter showing the involvement of the page table. Although both figures are worded in terms of demand paging they can be interpreted for caching as well by essentially changing the names of certain concepts and realizing that demand paging corresponds to the extreme of a fully-associative cache.
In this section, choose the first element of each parenthesized pair for caching, and choose the second for paging.
Question: On a (write hit / write to an in-memory page) should we write the new value through to (memory/disk) or just keep it in the (cache/memory) and write it back to (memory/disk) when the (cache-line/page) is replaced?
A TLB is a cache of the page table. It is there for the same reason as any cache, the page table is too big to access fast enough so we maintain a subset that can be accessed quickly and (we hope) has few misses.
Without a TLB, every memory reference in the program would require two memory references, one to read the page table and one to read the requested memory word.
This would be an unacceptable performance loss and hence a TLB is crucial for a system with paging.
For now, we ignore the cache and just look at the TLB, pages, and frames. The diagram on the right shows the three possibilities, color-coded to indicate their relative speeds.
Typical TLB parameter values
Real systems have TLBs, page tables, and caches. Since the caches are based on real memory addresses the cache can be accessed only after the TLB or page table has converted the virtual address (page number + offset) to the real address (frame number + offset). In some systems, caches are accessed by virtual address (page number + offset), but we will ignore this possibility.
The diagram on the right is based on the decstation 3100, which is perhaps the simplest possible design. The 3100 had the following parameter values.
Actions taken
TLB | Page | Cache | Remarks |
---|---|---|---|
hit | hit | hit | Possible, but page table not checked on TLB hit, data from cache |
hit | hit | miss | Possible, but page table not checked, cache entry loaded from memory |
hit | miss | hit | Impossible, TLB references only in-memory pages |
hit | miss | miss | Impossible, TLB references only in-memory pages |
miss | hit | hit | Possible, TLB entry loaded from page table, data from cache |
miss | hit | miss | Possible, TLB entry loaded from page table, cache entry loaded from memory |
miss | miss | hit | Impossible, cache is a subset of memory |
miss | miss | miss | Possible, page fault brings in page, TLB entry loaded, cache loaded |
Disk access are extremely expensive, which dictates many choices made for demand paging and explains why choices good for caching (where a miss costs a few 10s of a nanosecond), although valid choices for demand paging, are not good choices for the latter (where the miss penalty is several milliseconds). In particular, demand paging implementations make the following choices.
1. L1 cache | a. Not a cache |
2. L2 cache | b. A cache for a cache |
3. Main memory | c. A cache for disks |
4. TLB | d. A cache for main memory |
5. Page Table | e. A cache for page table entries |
Do the following two problems in class.
Feature | Typical values for L1 caches |
Typical values for L2 caches |
Typical values for demand paging |
Typical values for TLBs |
Size | 16KB-64KB | 500KB-4MB | 1GB-1TB | 256B-16KB |
Block size | 16B-64B | 64-128 | 4KB-64KB | 4B-32B |
Miss penalty in clocks | 10-25 | 100-1000 | 10M-100M | 10-1000 |
Miss rate | 2%-5% | 0.1%-2% | 0.00001%-0.0001% | 0.01%-2% |
This question has two parts.
Associativity | Location method | Comparisons Required |
---|---|---|
Direct mapped | Index | 1 |
Set Associative | Index the set, search among elements | Degree of associativity |
Full | Search all entries | Number of entries |
Separate lookup table | 0 |
The difference in sizes and costs for demand paging vs. caching, leads to different algorithms for finding the block. Demand paging always uses the bottom row with a separate table (page table) but caching never uses such a table.
If no possible slots are available, which victim should be chosen?
I call this the replacement question and is much studied in demand paging.
Start Lecture #24
Remark: A second L2 cache example and a related homework problem were added. Do the example and assign the homework.
Peripherals are varied; indeed they vary widely in many dimensions,
e.g., cost, physical size, purpose, capacity, transfer rate,
response time, support for random
access, connectors, and
protocol.
Consider just transfer rate for the moment.
The text mentions three especially important characteristics which can be used to classify peripherals.
Probably the most important quality metric for I/O is not performance but how frequently is data irretrievably corrupted. We will soon discuss RAID, a technique to improve this metirc.
There are at least three ways to measure I/O performance
startup overheadfor each request.
Do not make the error of thinking that the 3rd metric is simply the reciprocal of the second. It takes the post office at least one day to deliver a letter from here to California, but I can send one every minute if I wish. This is another example of pipelining.
A system alternates between two states of delivered service
Transitioning from the first state to the second is called a failure. Transitioning from the second state to the first is called a restoration.
Reliability measures the length of time during which services is continuously delivered as expected.
An example reliability measure is mean time to failure (MTTF), which measures the average length of time that the system is delivering service as expected. Bigger values are better.
Another important measure is mean time to repair (MTTR), which measures how long the system is not delivering service as expected. Smaller values are better.
Finally we have mean time between failures
(MTBF).
MTBF = MTTF + MTTR.
One might think that having a large MTBF is good, but that is not necessarily correct. Consider a system with a certain MTBF and simply have the repair center deliberately add an extra 1 hour to the repair time and poof the MTBF goes up by one hour!
Devices are quite varied and their data rates vary enormously.
Show a real disk opened up and illustrate the components.
The time for a disk access has five components, of which we concentrate on the first three.
Today seek times are typically 3-8ms on average
.
It takes longer to go all the way across the disk but it
does not take twice as long to go twice as far (the
head must accelerate, decelerate, and settle on the track).
How should we calculate the average?
Since disks have just one arm the average rotational latency is half the time of a revolution, and is thus determined by the RPM (revolutions per minute) of the disk.
Disks today spin at 5400-15,000 RPM; they used to all spin at 3600 RPM.
Calculate on the board the average rotational latency of a 3600 RPM disk.
Homework: What is the average rotational latency for a 5400 RPM disk, a 5400 RPM disk, a 10,000 RPM, and a 15,000 RPM disk.
You might consider the other four times all overhead since it is the transfer time during which the data is actually being supplied.
The transfer rate is typically tens of MB per second, sometimes over 100MB/sec. Given the rate, which is determined by the disk in use, the transfer time is proportional to the length of the request.
Some manufacturers quote a much higher rate, but that is for cache hits. In addition to supplying data much sooner, the electronic cache can transfer data at a higher rate than the mechanical disk.
Consider a disk with a 5ns seek time, a transfer rate of 80MB/sec, and a rotational rate of 10,000 RPM. Calculate on the board how long it takes for a 1K block request to What overall transfer rate (bytes delivered / total time) was achieved.
Start Lecture #25
Remark: Assign the homework problem above about rotational latency and do section 6.2.
Homework: Consider a disk with a 6ns seek time, a transfer rate of 60MB/sec, and a rotational rate of 10,000 RPM. How long does a request for a 100K block require to complete? A 10MB block? What overall transfer rates (bytes delivered / total time) were achieved in each case.
Not much to say. It is typically small. We will use 0ms (i.e., ignore this time).
This can be the largest component, but we will ignore it since it is not a function of the architecture, but rather of the load and OS.
Remark: I am doing 6.9 now since it concerns disks.
The acronym RAID was coined by Patterson and his students to abbreviate for Redundant Array of Inexpensive Disks. Now it is often redefined as Redundant Array of Independent Disks.
RAID comes in several flavors often called levels.
To increase performance, rather than reliability
and availability, it is a good idea to stripe or interleave blocks
across several disks.
In this scheme block n is stored on disk n mod k, where k is the
number of disks.
The quotient n/k is called the stripe number.
For example, if there are 4 disks, stripe number 0 (the first
stripe) consists of block 0, which is stored on disk 0, block 1
stored on 1, block 2 stored on 2, and block 3 stored on 3.
Stripe 1 (like all stripes in this example) also contains 4 blocks.
The first one is block 4, which is stored on disk 0.
Striping is especially good if one is accessing full stripes in which case all the blocks in the stripe can be read or written concurrently.
The base, non-redundant, case from which the others are built. Typically the data is striped across several disks. Since RAID 0 has no redundancy, it offers no reliability advantage. It does permit large (multi-block) I/Os to use multiple disks and hence to finish faster.
Two disks containing the same content.
Often called ECC (error correcting code or error checking and correcting code). Widely used in RAM, not used in RAID.
Normally byte-interleaved or several-byte-interleaved. For most applications, RAID 4 is better.
RAID 4 combines striping and parity. In addition to the k so-called data disks used in striping, one has a single parity disk that contains the parity of the stripe.
Consider all k data blocks in one stripe. Extend this stripe to k+1 blocks by including the corresponding block on the parity disk. The block on the parity disk is calculated as the bitwise exclusive OR of the k data blocks.
Thus a stripe contains k data blocks and one parity block, which is the exclusive OR of the data blocks.
The great news is that any block in the stripe, parity or data, is the exclusive OR of the other k. This means we can survive the failure of any one disk.
For example, let k=4 and let the data blocks be A, B, C, and D.
Properties of RAID 4.
Rotate the disk used for parity.
Again using our 4 data-disk example, we continue to put the parity for blocks 0-3 on disk 4 (the fifth disk) but rotate the assignment of which disk holds the parity block of different stripes. In more detail.
Raid 0, Raid 1, and Raid 5 are widely used.
Gives more than single error correction at a higher storage overhead.
Often called a solid-state disk, flash is the latest attempted
gap-filler
technology, i.e., a technology between
RAM
and conventional disks.
Unlike most past efforts, this one has succeeded to some extent.
Flash is between DRAM and disks in both price and performance: it is cheaper and slower than DRAM, but more expensive and faster than disks. However, the minimal size disk is much larger than the minimal size flash and hence, for devices with a modest memory requirement, flash is cheaper than (as well as faster than) a disk.
Other advantages of flash over disks include lower power, smaller physical size, silence, and shock resistance. These are due to the semiconductor nature of flash implying that it has no moving parts.
Technically flash is a kind of EEPROM, an electrically erasable, programmable read-only memory. Like other EEPROM technologies, but unlike DRAM, flash retains the values stored when power is turned off, a crucial requirement for a disk replacement.
Another typical characteristic of EEPROMs shared by flash is a significantly limited lifetime with respect to writing. A given flash cell can be rewritten many thousands of times, but not millions of times. This is a serious limitation and solid state disks contain software that remap heavily used flash blocks to other flash cells, a technique called wear leveling.
There are two flavors of flash called NOR and NAND. The former is older technology, but has higher performance primarily because NAND can be read and written only in large blocks. NAND flash is increasingly popular due to is lower price ($4/GB in 2008, compared to $65/GB for NOR). Today (Dec, 2011) NAND is available for about $1/GB about 10 times the price per byte of a very large disk.
A bus is a shared communication link, using one set of wires to connect many subsystems.
A synchronous bus is clocked.
An asynchronous bus is not clocked.
Consider the situation pictured at right where a device receives an I/O request and then needs to retrieve some data from memory. We are using an asynchronous bus for the memory request and transfer. Recall that this means, neither the device nor the memory knows the speed of its partner and must be prepared for very long or essentially instantaneous responses.
Note that Ack is bidirectional.
We must insure that both sides are never driving
(outputting
on) this line at the same time.
You may think of Ack as two lines, one going in each
direction
but, in fact, one line is sufficient if tri-state drivers are
used.
A similar consideration applies to the Data Lines.
We describe below the protocol used between the device and the memory and illustrate on the right a finite state machine used to manage this interactions.
The system is initialized with the memory in the top right state and the device in the top left state. Ack is not asserted by either side. ReadReq, DataRdy, and NewReq are also deasserted.
At some point an external entity (likely the CPU) raises NewReq. Events then proceed as follows.
For a realistic example, on the right is a diagram adapted from the 25 October 1999 issue of Microprocessor Reports on a then brand new Intel chip set, the so called 840.
Bus adaptors have a variety of names, e.g. host adapters, hubs, bridges. The memory controller hub is often call the north bridge and the I/O controller hub is often called the south bridge.
Bus lines (i.e., wires) include those for data, function codes, device addresses. Data and address are considered data and the function codes are considered control (remember our datapath for MIPS).
Address and data may be multiplexed on the same lines (i.e., first
send one then the other) or may be given separate lines.
One is cheaper (good) and the other has higher performance (also
good).
Which is which?
Ans: the multiplexed version is cheaper.
These improvements mostly come at the cost of increased expense and/or complexity.
Start Lecture #26
open collector drivers.
pulled upto 5v, i.e., a logical true.
Option | High performance | Low cost |
---|---|---|
bus width | separate addr and data lines | multiplex addr and data lines |
data width | wide | narrow |
transfer size | multiple bus loads | single bus loads |
bus masters | multiple | single |
clocking | synchronous | asynchronous |
Do on the board the following example. Given
restbetween bus accesses.
Find
Solution with four word blocks.
Solution with sixteen word blocks
Homework: Redo the last example but do not permit transmitting data to overlap reading more data.
This is an I/O issue and is taught in 202.
This is really an OS issue. Must write/read to/from device registers, i.e. must communicate commands to the controller. Note that a controller normally contains a microprocessor, but when we say the processor, we mean the central processor not the one on the controller.
Should we check periodically or be told when there is something to do? Better yet can we get someone else to do it since we are not needed for the job?
Processor continually checks the device status to see if action is required.
Do on the board the example on pages 676-677
Processor is told by the device when to look. The processor is interrupted by the device.
zero timeas it is done in parallel with the instruction execution.
Do on the board the example on pages 681-682.
The processor initiates the I/O operation then something
else
takes care of it and notifies the processor when it is
done (or if an error occurs).
intelligentdevice controlers, but I prefer not to use anthropomorphic terminology.
Start Lecture #27
We do an example to illustrate the increasing impact of I/O time.
Assume
Calculate
Homework: Redo the above example assuming that CPU and I/O activity can be overlapped, i.e., assume the overall time is MAX(CPU,I/O) rather than SUM(CPU,I/O)?
Recall the picture on the right. When we are dealing with disks, the bus adapters between the backplane bus and the various I/O buses are called disk controllers. On each of those I/O buses, one would find disks.
Assume a system with the following characteristics is executing a workload of 64KB reads with 100K instructions between reads..
Find
Solution
Remark: The above analysis was very simplistic. It assumed everything overlapped just right and the I/Os were not bursty and that the I/Os conveniently spread themselves accross the disks.
Homework: Redo the above with the following parameters (more reflective of 2011 technology). Parameters not mentioned should be given the values in the example and your work should make the same simplistic assumptions that were made in the analysis.
Start Lecture #28
Done right after 6.3
Remark: Review of practice final.
Good luck on the (real) final!