Computer Systems Design Thurs 5-7pm room 102 wwh Fall 1997 Will set up a www page for the course Allan Gottlieb gottlieb@nyu.edu (or gottlieb@cs.nyu.edu or ...) 715 bway rm 1001 212-998-3344 609-951-2707 email is best Text is Hennessy and Patterson "Computer Orgaiization and Design The Hardware/Software Interface" Available in bookstore. These notes are available. They are low quality (a FEATURE). The main body of the book assumes you know logic design. I do NOT make that assumption. We will start with appendix B, which is logic design review. A more extensive treatment of logic design is M. Morris Mano "Computer System Architecture" Prentice Hall. We will not need as much as mano covers and it is not a cheap book so I am not requiring you to get it. I will get it put into the library. My treatment will follow H&P not mano. Homework vs labs (describe) Left board for assignments and announcements (will be on www when that is set up). HOMEWORK Read B1 B.2 Gates, Truth Tables and Logic Equations Digital ==> Discrete Primarily binary at the hardware (but NOT exclusively). Use only two voltages -- high and low This hides a great deal of engineering Must make sure not to sample the signal when not in one of these two states. Sometimes it is just a matter of waiting long enough (determines the clock rate i.e. megahertz) Other times it is worse and you must avoid glitches. Draw sketch of scope trace square wave sine wave real wave WE WILL IGNORE THIS. In English digital (think digit, i.e. finger) => 10, but not in computers Bit = Binary digIT Instead of saying high voltage and low voltage, we say true and false or 1 and 0 or asserted and deasserted. 0 and 1 are complements of each other. A logic block can be thought of as a black box that takes signals in and produces signals out. Two kinds combinational (or combinatorial) and sequential. The first kind doesn't have memory (so is simplier) the second kind does have memory. The current value in the memory of the block is called the state of the block. We are doing combinational now. Will do sequential later (few weeks). TRUTH TABLES Since comb log has no mem, it is simply a function from its inputs to its outputs. The truth table has as columns all inputs and all outputs. It has one row for each possible input value(e) and the output columns have the output for that input. Let's start with a really simple case. Logic block with one input and one output. DRAW IT There are two columns (1 + 1) and two rows (2**1). How many different truth tables are there for one in and one out? Just 4: the constant functions 1 and 0, the identity, and an inverter (pictures in a few minutes). OK Now how about two inputs and 1 output. Three columns (2+1) and 4 rows (2**2). How many are there? It is just how many ways can you fill in the output entries. There are 4 output entries so ans is 2**4=16. How about 2 in and 8 out? 10 cols; 4 rows; 2**(4*8)=4 billion possible 3 in and 8 out? 11 cols; 8 rows; 2**(8**8)=2**64 possible n in and k out? n+k cols; 2**n rows; 2**([2**n]*k) possible Gets big fast! Certain logic functions (i.e. truth tables) are quite common and familiar. We use a notation that looks like algebra for them and expressions involving them. Boolean algebra (george Boole). Boolean variable takes on just two values 1 and 0. Boolean function takes in boolean variables and produces boolean valuesw 1. The (inclusive) OR Boolean function of two variables. Draw its truth table. This is written + (e.g. X+Y where X and Y are Boolean variables) and often called the logical sum. (three out of four squares look right!) 2. AND. Draw TT. Called log product and written as a centered dot (like product in regular algebra). All four values look right. 3. NOT. This is a unary operator (One argument, not two like above; the two above are called binary). Written A bar. Draw TT. 4. Exclusive OR (XOR). Written as + with circle around. True if exactly one input is true (i.e. true XOR true = false). Draw TT. HOMEWORK Consider the Boolean function of 3 boolean vars that is true if and only if exactly 1 of the three variables is true. Draw the TT. Some manipulation laws. Remember this is ALGEBRA. Identity: A+0 = 0+A = A A.1 = 1.A = A (using . for and) Inverse: A+A_ = A_+A = 1 A.A_ = A_.A = 0 (using _ for inverse) Both + and . are commutative so don't need as much as I wrote Really funny to call the second inverse law (you ADD the inverse and get the identity for PRODUCT). Associative: A+(B+C) = (A+B)+C A.(B.C)=(A.B).C Due to assoc law we can write A.B.C since either order of eval gives the same answer. Often elide the . so the product assoc law is A(BC)=(AB)C. Distributive (note BOTH dist laws hold): A(B+C)=AB+AC A+(BC)=(A+B)(A+C) How does one prove these laws?? Simple (but long) write the TT. Do the first dist laws HOMEWORK: Do the second distributive law. Do example on page B-6 (Based on example on page B-5 (fig A) ) For E first use the obvious method of writing one condition for each 1-value in the E column i.e. (A_BC) + (AB_C) + (ABC_) Observe that E is true if two (but not three) inputs are true, I.E. (AB+AC+BC) (ABC)_ (using . higher precedence than +) My first way of getting E shows that ANY logic function can be written using just AND, OR, and NOT. Indeed, it is in a nice form. Called two levels of logic, i.e. it is a sum of products of just inputs and their compliments. DeMorgan's laws: (A+B)_ = A_B_ (AB)_ = A_+B_ You prove DM law with a TT. Indeed that is ... HOMEWORK B.6 on page B-45 ============== End of Lecture 1 ====================== ============== Start of Lecture 2 ==================== Do beginning of this HW on the board. With DM we can do quite a bit without resorting to TTs. For example one can show that the two expressions for E on example above (page B-6) are equal. Indeed that is HOMEWORK B.7 on page B-45 Do beginning of HW on board. GATES gates implement basic logic functions: AND OR NOT XOR Equivalance Show pictures (Fig B) Show why the picture is equivalence, i.e (A XOR B)_ is AB + A'B' (Fig C) Often omit the inverters and draw the little circles at the input or output of the other gates (AND OR). These little circles are sometimes called bubbles. Explain picture bottom of B-7 (Fig D) This explains how inverter is buffer with a bubble. HOMEWORK B.2 on page B-45 (I previously did the first part of this homework). HOMEWORK Consider the Boolean function of 3 boolean vars (i.e. a three input function) that is true if and only if exactly 1 of the three variables is true. Draw the TT. Draw the logic diagram with AND OR NOT. Draw the logic diagram with AND OR and bubbles. We have seen that any logic function can be constructed from AND OR NOT. So this triple is called universal. Are there any pairs that are universal. Could it be that there is a single function that is universal? YES! NOR (not OR) is true when OR is false. Do TT. NAND (not AND) is true when AND is false. Do TT. Draw both diagrams (one from def and equivalent one) with bubbles. A 2-input NOR is universal. A 2-input NAND is universal. Show on board that a 2-input NOR is universal. A_ = A NOR A A+B = (A NOR B)_ AB = (A_ OR B_)_ HOMEWORK Show that a 2-input NAND is universal. Notes 1. Can draw NAND and NOR each two ways (because (AB)_ = A_ + B_) 2. We have seen how to get a logic function from a TT. Indeed we can get one that is just two levels of logic. But it might not be the simplist possible. That is we may have more gates than necessary. Trying to minimize the number of gates is NOT trivial. Mano covers this in detail. We will not cover it in this course. It is not in H&P. I actually like it but must admit that it takes a few lectures to cover well and it not used so much since it is algorithmic and is done automatically by CAD tools. 3. Example of non-unique minimization Given A_BC + ABC + ABC_. Combine first two to get BC + ABC_ Combine last two to get A_BC + AB 4. Can have "don't care" results. Helps minimization. COMBINATIONAL LOGIC Multiplexor Called Mux Also called selector Two different diagrams (fig E) Show equiv circuit with AND OR if S=0 M=A else M=B endif Can have 4 way mux (2 selector lines) if S1=0 and S2=0 M=A else if ... M=B ... else M=D Do TT for 2 way mux. Redo it with don't care values HOMEWORK B-12 Assume you have constant signals 1 and 0 as well. Decoder Takes n signals in produces 2^n signals out Input "binary n" Output has n'th bit set Picture is fig F (note the "by 3" symbol) Implement on board with AND/OR Encoder Reverse "function" of encoder Not defined for all inputs (exactly one must be 1) ============== End of Lecture 2 ====================== ============== Start of Lecture 3 ==================== Sneaky way to see that NAND is universal (Fig sneaky) Do this lecture 3 after hw for lect 2. Half Adder Inputs X and Y Outputs S and Co (carry out) No Carry-in Draw TT HOMEWORK Draw logic diagram Full Adder Inputs X, Y and Ci Output S and Co S = # 1s in X, Y, Ci is odd Co = #1s is at least 2 HOMEWORK Draw TT (8 rows), show S = X XOR Y XOR Ci, show Co = XY + (X XOR Y)Z Draw circuit using formulas for S and Co from homework How about 4 bit adder ? Do it. How about n bit adder ? Linear complexity Called ripple carry Faster methods exist PLAs Programmable Logic Array Fig G (from book) Minterms Start with a logical formula Convert to sum of product forms (only NOTs on vbles) Can also have a PAL in which the final dots are specified later. Mass produce the sea of gates first. HOMEWORK B-5 ROM Another way to implement a logic function. For n inputs and k outputs need (2^n)k bits stored, namely the columns for the output vbles in the TT NOT considered state. Once the ROM is made, the output depends only on the input. Similar to PLA Fully decoded PROMs, EPROMS, EEPROMs Don't Cares Input don't cares Output don't cares Input DC example was mux Do output DC from book (Fig H) Arrays of Logic Elements Do the same thing to many signals Draw thicker lines and use the "by n" notation. Show dia for 8 bit 2-way mux and implementation with 8 muxes Bus is a collection of data lines treated as a single logical (n-bit) value. Use arrays of logic elements to process buses The above mux switches between 2 8-bit buses. ------------- Big Change Coming -------- Why do we want to have state? Memory (i.e. ram not just rom or prom) Counters Reducing gate count Multiplier would be quadradic in comb logic. With sequential logic (state) can do in linear. What follows is unofficial (i.e. too fast to understand) Shift register holds partial sum Real slick is to share this shift reg with mulitplier Assume you have a real OR gate. Assume the two inputs are both zero for an hour. At time t one input becomes 1. The output will OSCILLATE for a while before settling on exactly 1. We want to be sure we don't look at the answer before its ready. Clocks Frequency Period Rising Edge; falling edge We use edge-trigger logic State changes occur only on a clock edge Will explain later what this really means Active edge The edge on which changes occur Choice is technology dependent Synchronous system state-element ----> comb circuit -----> state-element state-elements have clock as an input can change state only at active edge produces output ALWAYS; based on current state all signals written to state elements must be valid at active edge Eg. If cycle time is 10ns make sure combinational circuit used to compute new state values completes in 10ns So state elements change on active edge, comb circuit stablizes between active edges Can have +---> state-element -----> comb circuit ---+ | | | | +------------------------------------------+ Memory We want CLOCKED memory and will only use CLOCKED memory in our designs. However for simplicity we first describe how to make UNCLOCKED memory S-R latch (set-reset) Fig I DONT assert both S and R at once S asserted sets the latch Q is true Q_ false R asserted resets the latch Q flase Q_ true Neither asserted Q and Q_ remain as they were This is the MEMORY ========================== End Lecture 3 ================== ========================== Begin Lecture 4 ================ IMPORTANT NOTE: Class will be held each thursday BUT No assignments will be due 2nd, 16th, or 23. Any homework ASSIGNED on those days will be available on the web. These notes are now web available as http://allan.ultra.nyu.edu/arch/class-notes So homework assigned this week (please number it HW#4) will be due on the 9th (14 days from today). HW#5 (assigned on the 9th), HW#6 (16th), and HW#7 (23) will all be due on the 30th. Please have each HW on separate pages. On 30th we will schedule the midterm, the midterm will NOT be on the 30th. That is the day we will decide on the midterm date and on how much will be covered. CLOCKED memory what we are really interested in. clocked latch output changes when input changes and the clock is asserted "level sensitive" rather than "edge triggered" sometimes called "transparent" we won't use these in designs but will show how to build one fig J is a D (clocked) latch. D for "data" flip-flop changes on active edge NOT transparent Fig K is D flip flop built from D latches This one has the falling edge as active edge Sometimes called a master-slave flip-flop Note box around main structure and letters reused with different meaning (block structure a la algol) Master latch is set during the time clock is asserted. Remember that the latch is transparent, i.e. follows its input when its clock is asserted. But the second latch is ignoring its input at this time. When the clock falls, the 2nd latch pays attention and the first latch keeps producing whatever D was at fall-time. Actually D must remain constant for some time around the active edge. Must be valid for set-up time before and hold time after HOMEWORK Try moving the inverter to the other latch (see fig K). This should give rising edge as active edge. Flip flops give counters. Explained next lecture Homework B.13 Don't worry if it seems hard Register Just an array of D flip-flops with a write line, fig M. Must have write line and data valid during setup and hold times ================ End of lecture 4 ================ ================ Start of lecture 5 ================ NOTE: The real story on counters is told in Figure P. HOMEWORK B.13 Now you can really do it. Register File Set of registers each numbered Supply reg#, write line, and data (if a write) Often have several read and write ports so that several registers can be read and written during one cycle. Can read and write same reg same cycle (read old value) We will do 2 read ports and one write port since that is needed for ALU ops. NOT adequate for superscalar To read just need mux from register file to select correct register. Have one of these for each read port For writes use a decoder on register number to determine which register to write Figure N (B.20 from book). Note that 2 errors were fixed. SRAMS and DRAMS External interface is Figure O (B.21 from book) (Sadly) we will not look inside. Following is unofficial Different from above because too many wires and muxes too big Two stage decode Tri-state buffers instead of mux for SRAM DRAM latches whole row but outputs only one (or a few) column(s) So can speed up access to elts in same Row Merged DRAM + CPU a new hot topic Error Correction Skipped There are other kinds of flip-flops T, J-K. Also one could learn about excitation tables for each. We will NOT (H&P doesn't either). If interested, see Mano Finite State Machines Skipped Timing Methodologies Skipped ------------------ End Appendix B ------------ ================ End Lecture 5 ================ ================ Start Lecture 6 ================ ---------------- Start Chapter 1---------------- HOMEWORK READ chapter 1. Do 1.1 -- 1.26 (really one matching question) Do 1.27 to 1.44 (another matching question) 1.45 (and do 7200 RPM and 10,000 RPM) 1.46, 1.50 ---------------- End Chapter 1 ---------------- ---------------- Start Chapter 3 ---------------- HOMEWORK Read sections 3.1 3.2 3.3 3.4 Representing instructions in the Computer (MIPS) 32 32-bit registers Register 0 is always 0 HOMEWORK 3.2 R-type instruction (R for register) op rs rt rd shamt funct 6 5 5 5 5 6 rs,rt are source operands rd is destination shift amount funct used for op=0 to distinguish alu ops add/sub $1,$2,$3 op=0, funct tells add/sub reg1 <-- reg2 + reg3 the regs can be the same (doubles the value in the reg) I-type (why I?) op rs rt address 6 5 5 16 rs is a source reg rt is a destination reg Transfers to/from memory often in words (32-bits) But the machine is byte addressible So shift the address left two bits lw/sw $1,addr($2) machine format is op 2 1 addr $1 <-- Mem[$2+addr] $1 --> Mem[$2+addr] Note how field sizes of R-type and I-type correspond. Will be important later The type is determined by the op. Branching instruction slt (set less-then) R-type Slt $3,$8,$2 reg3 <-- (1 if reg8= b So need to make the LOB of result = sign bit of a subtraction and the rest of the result bits are always zero. Idea #1. Give the mux another input. This input is brought in from outside the bit cell. That is for this setting of the mux, cell copies an input to its output. Set the mux to deliver this output when an slt is requested. Fig V Idea #2. Bring out the result of the adder (BEFORE the mux) Fig W. Only needed for the HOB (i.e. sign) Take this new output from the HOB and connect it to the new input in idea #1 for the LOB. The new input for other bits are set to zero. Problem: This is wrong! Take example using 3 bits (i.e. -4 .. 3). Try slt on -3 and 2. Subtraction (-3 - 2) should give -5 saying that -3 < 2. But -3 - 2 gives +3 !! Really it give OVERFLOW. But slt -3 -2 should not overflow. Solution: Need the correct rule for less than (not just sign of subtraction) HOMEWORK: figure out correct rule, i.e. prob 4.25 Extra requirement. Deal with overflows The 1 bit alu for hob really is different from the others Fig X (4.15). Extra requirement. Zero detect Large NOR Observation: The initial Cin and Binvert are always the same. So just use one input called Bnegate Final Result is Figure Y (4.17) Symbol for the alu is fig Z (4.18) What are the control lines? Bnegate OP What functions can we perform and or add sub set on less than What (3-bit) control lines do we need for each? and 0 00 or 0 01 add 0 01 sub 0 01 slt 1 11 Adder with just 2 levels of logic Clearly exists Expensive Large fan-in means not as fast as 2 simple levels of logic Carry Lookahead adders We did ripple adder, delay proportional to # bits For each bit we can immediately (one gate delay) calculate generate a carry gi = ai bi propogate a carry pi = ai+bi Now we can calculate all the carries (4 bit) just given c0=Cin c1 = g0 + p0 c0 c2 = g1 + p1 c1 = g1 + p1 g0 + p1 p0 c0 c3 = c4 = g3 + p3 g2 + ... + p3 p2 p1 p0 c0 Thus can calculate c1 ... c4 in just two (5-input) gate delays HANDOUT Diagram for 4-bit CLA (also on web) Now put 4 of these together to get all the 16-bit carries P0 = p3 p2 p1 p0 P1 = p7 p6 p5 p4 P2 = P3 = p15 p14 p13 p12 G0 = g3 + p3 g2 + ... + p3 p2 p1 g0 G1 = g7 + p7 g6 + ... + p7 p6 p5 g4 G2 = G3 = g15 + p15 g14 + ... p15 p14 p13 g12 C1 = G0 + P0 c0 C2 = G1 + P1 C1 = G1 + P1 G0 + P1 P0 c0 C3 = C4 = G3 + P3 G2 + ... + P3 P2 P1 P0 c0 Draw diagram just like 4-bit cla with different labels and presto get 16-bit cla Homework 4-31 -- 4-35 ================ End Lecture 7 ================ ================ Start Lecture 8 ================ HOMEWORK If didn't assign 4-31 --4-35, assign it now NOTE: I want to say more about CLA but don't want to break-up shifter/multiplier accross two lectures There are other fast adders (e.g. carry save) Shifter Just a string of D-flops; output of one is input of next Input to first is the serial input Output of last is the serial output But want more. Left and right shifting (with serial input/output) Parallel load Parallel Output Don't shift every cycle Parallel output is just wires. Shifter has 4 modes (left, right, nop, load) so 4-1 mux inside 2 control lines must come in Handout (also on web) shows diagram (Fig AA in my notes) Our shifters are slow for big shifts; barrel shifters are better HOMEWORK: A 4-bit shift register initially contains 1101. It is shifted six times to the right with the serial input being 101101. What is the contents of the register after each shift. HOMEWORK: Same register, same init condition. For the first 6 cycles the opcodes are left, left, right, nop, left, right and the serial input is 101101. The next cycle the register is loaded (in parallel) with 1011. The final 6 cycles are the same as the first 6. What is the contents of the register after each cycle? Multipliers Recall how to do multiplication. Multiplicand times multiplier gives product Multiply multiplicand by each digit of multiplier Put the result in the right column Then add the partial products just produced We will do it the same way ... ... BUT differently We have binary so each "digit" of the multiplier is 1 or zero. Hence "multiplying" by the digit means either Getting the multiplicand Getting zero Use an "if appropriate bit of multiplier is 1" stmt To get "appropriate bit" Use LOB Shift right (so next bit is LOB) Putting in the right column is done by shifting the multiplicand left one bit each time (even if the multiplier bit is zero) Instead of adding partial products at end, keep a running sum Don't add zero if multiplier bit is zero just skip step This results in the following algorithm product <- 0 for i = 0 to 31 if LOB of multiplier = 1 product = product + multiplicand shift multiplicand left 1 bit shift multiplier right 1 bit Do on board 4-bit addition (8-bit registers) 1100 x 1101 The circuit diagram is figure 4.20 (handout) What about the control? Always give the ALU the ADD operation Always send a 1 to the multiplicand to shift left Always send a 1 to the multiplier to shift right Pretty boring so far but Send a 1 to write line in product if and only if LOB multiplier is a 1 I.e. send LOB to write line I.e. it really is pretty boring This works! ... ... but is wasteful of resourses and hence is slower hotter bigger all these are bad Product register must be 64 bits since the product is 64 bits Why is multiplicand register 64 bits? So that we can shift it left I.e. for our convenience Why is ALU 64-bits? Because the product is 64 bits But we are only adding a 32-bit quantity to the product at any one step. Hmmm. Maybe we can just pull out the correct bits from the product. Would be tricky to pull out bits in the middle because which bits to pull changes each step POOF!! Solving both problems at once DON'T shift the multiplicand left Hence register is 32-bits and not a shifter Instead shift the product right! Add the HO 32-bits of prod reg to Multiplicand and place result back into HO 32-bits Only do this if the current multiplier bit is one. Do NOT need the carry-out from the adder; why? Because the HOB of the product BEFORE the add is zero This results in the following algorithm product <- 0 for i = 0 to 31 if LOB of multiplier = 1 product[32-63] = product[32-63] + multiplicand shift product right 1 bit shift multiplier right 1 bit The circuit diagram is figure 4.21 (handout) What about control Just as boring as before Send ADD, 1, 1 to ALU, multiplier, Product (shift right) Send LOB to Product (write) Redo same example on board A final trick ("gate bumming", like code bumming of 60s) There is sort of a waste of registers. Not for the multiplicand since we always need all 32 bits But once we use a multiplier bit, we can toss it And the product is half unused at beginning and only slowly ... POOF!! "Timeshare" the LO half of the "product register" In the beginning LO half contains the multiplier Each step we shift right and more goes to product less to multiplier The algorithm changes to product <- multiplier for i = 0 to 31 if LOB of product = 1 product[32-63] = product[32-63] + multiplicand shift product-multiplier-register right 1 bit Redo same example on board The above was for unsigned 32-bit multiplication For signed multiplication Save the signs of the multiplier and multiplicand Convert multiplier and multiplicand to non-neg numbers Use above algorithm Don't need the 32nd iteration since multiplier HOB of multiplier is zero (sign bit) Compliment product if original signs were different There are faster multipliers. Skip Division Skip Floating Point HOMEWORK Read 4.11 "Historical Perspective" Start Reading Chapter 5 ================ End Lecture 8 ================ ================ Start Lecture 9 ================ Class given by Prof. Grishman Datapath for single cycle CPU ================ End Lecture 9 ================ ================ Start Lecture 10 ================ Midterm Exam ================ End Lecture 10 ================ ================ Begin Lecture 11 ================ HOMEWORK 5.1, 5.11 (just datapaths not control). Lab 2. Due in ONE week. Modify lab1 to deal with slt, zero detect, overflow. That is, Figure 4.17 Overflow Will EXPLAIN for addition of nonnegatives and give rule for all. Assume 32 bit addition of nonnegs. So numbers are 31 bits with sign 0. How can you get an overflow? The 31 bit addition gives a 32-bit sum. This looks like a negative number! That is the sign is 1. Rule: When adding nonnegs, overflow iff result sign is 1. Remaining rules all say get an overflow if sign bit "wrong" Rule: When add negatives, overflow iff result sign is 0. Rule: When adding neg and nonneg, never overflows Rule: When subtracting two negs or two nonnegs, never overflow Rule: When subtracting neg from nonneg, overflow if result sign is 1 Rule: When subtracting nonneg from net, overflow if result sign is 0 All this can be summarized as GRAND RULE overflow iff carry into sign bit != carry out of sign bit OVERFLOW = CARRY-IN XOR CARRY-OUT For adding nonnegs the carry out is surely 0 and you need a 0 carry in to get a result with positive sign For adding negs the carry out is surely 1 and you need a 1 carry in to get result with negative sign Modification to slt Recall that for slt we want to know if x-y is negative. Sounds like subtraction But we want the correct sign even if overflow If overflow the sign is wrong so the correct sign is SUM[0] XOR OVERFLOW ---------------- The control for the datapath ---------------- We start with figure 5.14, which shows the data path. We need to set the muxes. We need to give the three ALU cntl lines: 1-bit Bnegate and 2-bit OP And 0 00 Or 0 01 Add 0 10 Sub 1 10 Set-LT 1 11 HOMEWORK what happens if we use 1 00? if we use 1 01? Ignore the funny business in the HOB. The funny business "ruins" these ops What information do we have to decide on the muxes and alu cntl lines? The instruction! Opcode field (6 bits) For R-type the funct field (6 bits) So no problem, just do a truth table. 12 inputs, 3 outputs 4096 rows, 15 columns, > 60K entries HELP! We will let the main control (to be done later) "summarize" the opcode for us. It will generate a 2-bit field ALUop ALUop Action needed by ALU 00 Addition (for load and store) 01 Subtraction (for beq) 10 Determined by funct field (R-type instruction) 11 Not used So now we have 8 inputs (2+6) and 3 outputs 256 rows, 11 columns; ~2800 entries Certainly easy for automation ... but we will be clever We only have 8 MIPS instructions that use the ALU (fig 5.15). Funct only used for 10 11 impossible ==> 01 = X1 and 10 = 1X So we get ALUop | Funct || Bnegate:OP 0 1 | 5 4 3 2 1 0 || B OP ------+--------------++------------ | || 0 0 | x x x x x x || 0 10 x 1 | x x x x x x || 1 10 1 x | x x 0 0 0 0 || 0 10 1 x | x x 0 0 1 0 || 1 10 1 x | x x 0 1 0 0 || 0 00 1 x | x x 0 1 0 1 || 1 11 How would we implement this? A circuit for each of the three output bits. Just decide when the individual output bit is 1 (figure 5.17) The circuit is then easy (5.18) Now we need the main control setting the four muxes Writing the registers Writing the memory Reading the memory ??? not really (well maybe ... for dram) Calculating ALUop So 9 (really 8 bits) Fig 5.20 shows were these occur. They all are determined by the opcode The MIPS instruction set is fairly regular. Most fields we need are always in the same place in the instruction. Opcode (called Op[5-0]) always in 31-26 Regs to be read always 25-21 and 20-16 (R-type, beq, store) Base reg always 25-21 (load store) Offset always 15-0 Oops: Reg to be written EITHER 20-16 (load) OR 15-11 (R-type) MUX!! Fig 5.21 describes each signal MemRead: Memory delivers the value stored at the specified addr MemWrite: Memory stores the specified value at the specified addr ALUSrc: Second ALU operand comes from (reg-file / sign-ext-immediate) RegDst: Number of reg to write comes from the (rt / rd) field RegWrite: Reg-file stores the specified value in the specified register PCSrc: New PC is Old PC + (4 / shifted-branch-target-disp) MemtoReg: Value written in reg-file comes from (alu / mem) Fig 5.22 shows the wires for control We are interested in four opcodes R-type load store BEQ Do a stage play Fig 5.23 summarizes the play (i.e. the control lines). Fig 5.27 (from FIFTH edition, an improvement to 5.30 in fourth edition) shows the control signal settings for each opcode. Now it is straightforward but tedious to get the logic equations Result is figure 5.31 HOMEWORK 5.1 5.11 control, 5.2, 5.12 New instruction format, unconditional jump opcode addr 31-26 25-0 Addr is word addr; bot 2 bits of PC always 0 Top 4 bits of PC stay as they were (AFTER incr by 4) Easy to add. My overlay to fig 5.22. ================ End Lecture 11 ================ ================ Start Lecture 12 ================ ---------------- What's wrong? ---------------- Some instructions could be especially slow and all take the time of the slowest. Worse if we look at really tough ones (floating pt divide). Solns Variable length cycle Asynchronous logic Multicycle instructions Can reuse same ALU for different cycles Even Faster Pipeline the cycles Can't reuse Complicated Chapter 6 Multiple datapaths (superscalar) Lots of logic but not too hard if execute IN ORDER Faster if do OUT of order but very complicated ---------------- Chapter 2 Performance analysis ------------ HOMEWORK Read Chapter 2 Difference between response time and throughput. Adding a processor likely to increase throughput more than decrease response time. We will mostly be concerned with response time PERFORMANCE = 1 / Execution time So machine X is n times faster than Y means Performance-X = n * Performance-Y Execution-time-X = (1/n) * Execution-time-Y How to measure execution-time CPU time Includes time waiting for memory Does not include time waiting for I/O as this process is not not running Elapsed time on empty system Elapsed time on "normally loaded" system Elapsed time on "heavily loaded" system We mostly use CPU time Does NOT mean the other metrics are worse (CPU) execution time = (#CPU clock cycles) * (clock cycle time) = (#CPU clock cycles) / (Clock rate) So a machine with a 10ns cycle time runs at a rate of 1 cycle per 10 ns = 100,000,000 cycles per second = 100 MHz #CPU clock cycles = #instructions * CPI CPI = Cycles per Instruction #instructions for a given program depends on the instruction set We saw in chapter 3 that 1 vax instruction is often > 1 MIPS instruction Complicated instructions take longer. Either many cycles or long cycle time Older machines with complicated instructions (e.g. VAX in 80s) had CPI>1 With pipelining can have many cycles for each instruction but still have CPI=1 Modern ``superscalar'' machines have CPI < 1 They issue many instructions each cycle They are pipelined so the instructions don't finish for several cycles If have 4-issue and all instructions 5 pipeline stages, there are 20=5*4 instructions in progress at one time. Putting this together Time (in seconds) = Num inst ex * (Clock cycles / inst) * (seconds / Clock cycle) HOMEWORK Carefully go through and understand the example on page 59 HOMEWORK 2.1-2.5 2.7-2.10 HOMEWORK Make sure you can easily do all the problems with a rating of [5] and can do all with a rating of [10] Why not just use MIPS ? Millions of Instructions Per Second NOT the same as the MIPS computer (but NOT a coincidence) Different architectures (inst sets) with same MIPS take different time Different programs generate different MIPS ratings on same arch Can raise the rating by adding NOPs despite increasing exec time HOMEWORK Carefully go through and understand the example on pages 61-3 Why not use MFLOPS Millions of FLoating point Operations Per Second Similiar problems to MIPS Benchmarks A start but the difficulty is choosing a representative benchmark for YOUR purchase HOMEWORK Carefully go through and understand 2.7 "fallacies and pitfalls" ---------------- Chapter 7 Memory ------------ HOMEWORK Read Chapter 7 Ideal memory is Fast Big (in capacity; not phy size) Cheap Imposible We observe empirically TEMPORAL LOCALITY - Word referenced now likely to be ref'ed again soon Good to keep currently word around for a while. SPACIAL LOCALITY - Words near currently ref'ed work likely to be ref'ed soon Good to prepare for other words near current ref So use memory hierarchy Regs Cache (really L1 L2 maybe L3) Mem Disk Archive We will first study the cache <---> mem gap Really many levels of caches Similar considerations apply to the other gaps But terminology is often different, e.g. cache line vs page (In fall 97) My OS class is studying "the same thing" right now (mem mgt) Cache is organized in units of BLOCKS Transfer a block from mem to cache Big blocks good for spacial locality Think of mem organized in blocks as well (OS think of pages and page frames) A HIT occurs when a mem ref is found in the upper level of mem hierarchy We will be interested in cache hits (OS in page hits) Miss is a non-hit Hit rate is fraction of mem refs that are hits Miss rate is 1 - hit rate Hit time is time for a hit Miss time is time for a miss Miss penalty is Miss time - Hit time Start with a simple cache organization Assume all refs are for one word (not too bad) Assume cache blocks are one word Bad for spacial locality so not done in real machines We will drop this assumption soon Assume each mem block can only go in one specific cache block DIRECT MAPPED Take mem blk # modulo # blocks in cache for location Make # blocks in the cache a power of 2 Example: if cache has 16 blocks, location in cache is the low order 4 bits of block number How can we tell if a mem blk is in the cache? We know where it will be *IF* it is there at all We need the "rest" of the addr Store the rest of the addr, called the TAG Also store a VAlID bit per cache block (in case no mem blk is stored in this cache block, e.g. when the system boots up). Show fig AB (fig 7.6) Calculate total number of bits HOMEWORK 7.1 Processing a read for this simple cache Hit is trivial Miss: Evict and replace Why? I.e., why keep new data instead of old Ans: Temporal Locality Skip section "handling cache misses" as it needs chapter 6 Processing a write for this simple cache Hit: Write through vs write back Write through write to mem as well Write back: don't write to mem now, do it on evict Miss: write-alloc vs write-no-alloc Simplist is write-through, write-allocate Still assuming blksize=refsize = 1 word and direct mapped For any write (Hit or miss) do the following: Index the cache using the correct LOBs Write data and tag Set Valid Send request to main mem Poor performance GCC benchmark has 11% of operations stores If assume an infinite speed memory, CPI is 1.2 for some reasonable estimate of instruction speeds Assume a 10 cycle store penalty (reasonable) CPI becomes 1.2 + 10 * 11% = 2.3 HALF SPEED Improvement: Use a write buffer Hold a few (four is common) writes at the processor while they are being processed at memory. Satisfy reads from here as well Unified vs split I and D (instruction and data) Unified is better because better "load balancing" Split is better because can do two refs at once ================ End Lecture 12 ================ ================ Start Lecture 13 ================ Improvement: Blocksize > Wordsize Take advantage of spacial locality Figure AC (7.94) Show how an access takes place Cachesize = Blocksize * #Entries Calculate Total number of bits in this cache and for one with one word blocks but still 64KB If accesses are strictly sequential this cache has 75% hits one word blocks has 0% HOMEWORK 7.2 7.6 Why not make blocksize enormous? Cache one huge block. NOT all access are sequential With too few blocks misses go up again Memory support for wider blocks Should memory be wide? Should the bus be wide? Consider Figure AD (7.12) Assume 1 clock to send the address 10 clocks for memory access 1 Clock/busload of data Narrow design (a) takes 45 clocks for a read hit (do it) Wide design (b) takes 12 Interleaved (c) takes 15 Interleaving works great because GUARANTEED to have sequential accesses Imagine design between (a) and (b) with 2-word wide datapath Takes 23 cycles and more expensive to build than (c) Performance example (dandy exam question) Assume 5% I-cache miss; 10% D-cache miss 1/3 instructions access data CPI = 4 if miss penalty is 0 (not realistic of) What is CPI with miss penalty 12 (do it)? What is CPI if double speed cpu+cache, single speed mem (24 clock miss penalty) (do it)? How much faster is the "double speed" machine? It would be double speed if miss penalty 0 or 0% miss rate HOMEWORK 7.15, 7.16 Lower CPI makes stalls more expensive Faster CPU (i.e. faster clock) makes stalls more expensive Remark: Large caches have LONGER hit time SKIP virtual memory 2nd edition does it later OS students should look at it Would cover if had one more lecture Improvement: Associative Caches We have studied DIRECT MAPPED caches, i.e. the location in the cache is determined by the address. To check for a hit we compare ONE tag with the HOBs of the addr. The other extreme is FULLY ASSOCIATIVE. A memory block can be placed in any cache block So tag must be entire address and must check all cache blocks to see if have a hit. Larger tag is a minor problem. The search is a disaster. Either do sequentially (too slow) or have comparators with each tag entry (too big; and also need a humongous mux) [An alternative is to have a table with one entry per MEMORY block giving the cache block number. This is too big and too slow for caches but is used for virtual memory (demand paging)] Most common for caches is intermediate configuration called SET ASSOCIATIVE or n-way associative (e.g. 4-way associative) Why good? Consider referencing two modest arrays (<< cache size) that start at location 1MB and 2MB. Both will contend for the same cache locations in a direct mapped cache but will fit together in a cache with >=2 sets. Unfortunately a cache with n sets is often called a cache of set size n. How find the block? Figure AE (7.9) Set# = Block# mod #sets Which block (in the set) should be replaced? Random is sometimes used LRU is better but not so easy to do fast enough If have 2 sets, easy. Keep last used and choose other. If have 4 sets, can view as two groups of two. Choose the group not last used and then the one in the group not last used. This is an approximation ************************************************************* * * IGNORE ALL this stuff * * Local Variables: * mode: text * indent-line-function: indent-relative * indent-tabs-mode: nil * tab-width: 4 * tab-stop-list: (4 8 12 16 20 24) * End: * **************************************************************