Computer Systems Design
               Thurs 5-7pm room 102 wwh
                  Fall 1997

Will set up a www page for the course

Allan Gottlieb
gottlieb@nyu.edu  (or gottlieb@cs.nyu.edu or ...)
715 bway rm 1001
212-998-3344
609-951-2707
email is best

Text is Hennessy and Patterson "Computer Orgaiization and Design
The Hardware/Software Interface"


Available in bookstore.

These notes are available.  They are low quality (a FEATURE).

The main body of the book assumes you know logic design.
I do NOT make that assumption.

We will start with appendix B, which is logic design review.
A more extensive treatment of logic design is M. Morris Mano
"Computer System Architecture" Prentice Hall.
We will not need as much as mano covers and it is not a cheap book so
I am not requiring you to get it.  I will get it put into the library.

My treatment will follow H&P not mano.

Homework vs labs (describe)

Left board for assignments and announcements (will be on www when that
is set up).

HOMEWORK Read B1

B.2  Gates, Truth Tables and Logic Equations

Digital ==> Discrete
Primarily binary at the hardware (but NOT exclusively).
Use only two voltages -- high and low
    This hides a great deal of engineering
    Must make sure not to sample the signal when not in
    one of these two states.

    Sometimes it is just a matter of waiting long enough
    (determines the clock rate i.e. megahertz)

    Other times it is worse and you must avoid glitches.

    Draw sketch of scope trace
        square wave
        sine wave
        real wave

    WE WILL IGNORE THIS.

In English digital (think digit, i.e. finger) => 10,
but not in computers

Bit = Binary digIT

Instead of saying high voltage and low voltage, we say true and false
or 1 and 0 or asserted and deasserted.

0 and 1 are complements of each other.

A logic block can be thought of as a black box that takes signals in
and produces signals out.  Two kinds combinational (or combinatorial)
and sequential.  The first kind doesn't have memory (so is simplier)
the second kind does have memory.  The current value in the memory of
the block is called the state of the block.

We are doing combinational now.  Will do sequential later (few weeks).

TRUTH TABLES

Since comb log has no mem, it is simply a function from its inputs to
its outputs.  The truth table has as columns all inputs and all
outputs.  It has one row for each possible input value(e) and the
output columns have the output for that input.  Let's start with a
really simple case.  Logic block with one input and one output.

DRAW IT

There are two columns (1 + 1) and two rows (2**1).

How many different truth tables are there for one in and one out?

Just 4: the constant functions 1 and 0, the identity, and an inverter
(pictures in a few minutes).

OK Now how about two inputs and 1 output.

Three columns (2+1) and 4 rows (2**2).

How many are there?  It is just how many ways can you fill in the
output entries.  There are 4 output entries so ans is 2**4=16.

How about 2 in and 8 out?
    10 cols; 4 rows; 2**(4*8)=4 billion possible

3 in and 8 out?
    11 cols; 8 rows; 2**(8**8)=2**64 possible

n in and k out?
    n+k cols; 2**n rows; 2**([2**n]*k) possible

Gets big fast!

Certain logic functions (i.e. truth tables) are quite common and
familiar.  We use a notation that looks like algebra for them and
expressions involving them.  Boolean algebra (george Boole).
Boolean variable takes on just two values 1 and 0.  Boolean function
takes in boolean variables and produces boolean valuesw

1.  The (inclusive) OR Boolean function of two variables.  Draw its
truth table.  This is written + (e.g. X+Y where X and Y are Boolean
variables) and often called the logical sum.  (three out of four
squares look right!)

2.  AND.  Draw TT.  Called log product and written as a centered dot
(like product in regular algebra).  All four values look right.

3.  NOT.  This is a unary operator (One argument, not two like above;
the two above are called binary).  Written A bar.  Draw TT.

4.  Exclusive OR (XOR).  Written as + with circle around.  True if
exactly one input is true (i.e. true XOR true = false).  Draw TT.

HOMEWORK Consider the Boolean function of 3 boolean vars that is true
if and only if exactly 1 of the three variables is true.  Draw the TT.

Some manipulation laws.  Remember this is ALGEBRA.

Identity:  A+0 = 0+A = A    A.1 = 1.A = A  (using . for and)

Inverse:   A+A_ = A_+A = 1  A.A_ = A_.A = 0 (using _ for inverse)

Both + and . are commutative so don't need as much as I wrote

Really funny to call the second inverse law (you ADD the inverse and
get the identity for PRODUCT).

Associative: A+(B+C) = (A+B)+C  A.(B.C)=(A.B).C

Due to assoc law we can write A.B.C since either order of eval gives
the same answer.

Often elide the . so the product assoc law is A(BC)=(AB)C.

Distributive (note BOTH dist laws hold): A(B+C)=AB+AC  A+(BC)=(A+B)(A+C)

How does one prove these laws??
    Simple (but long) write the TT.
    Do the first dist laws

HOMEWORK:  Do the second distributive law.

Do example on page B-6  (Based on example on page B-5 (fig A) )
    For E first use the obvious method of writing one condition
    for each 1-value in the E column i.e.
    (A_BC) + (AB_C) + (ABC_)
    Observe that E is true if two (but not three) inputs are true,
    I.E. (AB+AC+BC) (ABC)_  (using . higher precedence than +)

My first way of getting E shows that ANY logic function can be written
using just AND, OR, and NOT.  Indeed, it is in a nice form.  Called
two levels of logic, i.e. it is a sum of products of just inputs and
their compliments.

DeMorgan's laws:  (A+B)_ = A_B_   (AB)_ = A_+B_

You prove DM law with a TT.  Indeed that is ...

HOMEWORK B.6 on page B-45

============== End of Lecture 1 ======================
============== Start of Lecture 2 ====================

Do beginning of this HW on the board.

With DM we can do quite a bit without resorting to TTs.  For example
one can show that the two expressions for E on example above (page
B-6) are equal.  Indeed that is

HOMEWORK B.7 on page B-45

Do beginning of HW on board.

GATES

gates implement basic logic functions: AND OR NOT XOR Equivalance

Show pictures (Fig B)

Show why the picture is equivalence, i.e (A XOR B)_ is AB + A'B' (Fig C)

Often omit the inverters and draw the little circles at the input or
output of the other gates (AND OR).  These little circles are
sometimes called bubbles.

Explain picture bottom of B-7 (Fig D)

This explains how inverter is buffer with a bubble.


HOMEWORK B.2 on page B-45 (I previously did the first part of this
homework).

HOMEWORK Consider the Boolean function of 3 boolean vars (i.e. a three
input function) that is true if and only if exactly 1 of the three
variables is true.  Draw the TT.  Draw the logic diagram with AND OR
NOT.  Draw the logic diagram with AND OR and bubbles.

We have seen that any logic function can be constructed from AND OR
NOT.  So this triple is called universal.  Are there any pairs that
are universal.  Could it be that there is a single function that is
universal?  YES!

NOR (not OR) is true when OR is false.  Do TT.

NAND (not AND) is true when AND is false.  Do TT.

Draw both diagrams (one from def and equivalent one) with bubbles.

A 2-input NOR is universal.

A 2-input NAND is universal.

Show on board that a 2-input NOR is universal.

A_ = A NOR A

A+B = (A NOR B)_

AB = (A_ OR B_)_

HOMEWORK Show that a 2-input NAND is universal.

Notes

1. Can draw NAND and NOR each two ways (because (AB)_ = A_ + B_)

2. We have seen how to get a logic function from a TT.  Indeed we can
get one that is just two levels of logic.  But it might not be the
simplist possible.  That is we may have more gates than necessary.
Trying to minimize the number of gates is NOT trivial.  Mano covers
this in detail.  We will not cover it in this course.  It is not in
H&P.  I actually like it but must admit that it takes a few lectures
to cover well and it not used so much since it is algorithmic and is
done automatically by CAD tools.

3.  Example of non-unique minimization

Given A_BC + ABC + ABC_.  Combine first two to get BC + ABC_
Combine last two to get A_BC + AB

4.  Can have "don't care" results.  Helps minimization.


COMBINATIONAL LOGIC

Multiplexor

    Called Mux

    Also called selector

    Two different diagrams (fig E)

    Show equiv circuit with AND OR

    if S=0
        M=A
    else
        M=B
    endif

    Can have 4 way mux (2 selector lines)

    if S1=0 and S2=0
        M=A
    else if ...
        M=B
    ...
    else
        M=D

    Do TT for 2 way mux.  Redo it with don't care values

HOMEWORK B-12 Assume you have constant signals 1 and 0 as well.

Decoder

    Takes n signals in produces 2^n signals out

    Input "binary n" Output has n'th bit set

    Picture is fig F (note the "by 3" symbol)

    Implement on board with AND/OR

Encoder

    Reverse "function" of encoder

    Not defined for all inputs (exactly one must be 1)

============== End of Lecture 2 ======================
============== Start of Lecture 3 ====================

Sneaky way to see that NAND is universal
(Fig sneaky)  Do this lecture 3 after hw for lect 2.

Half Adder

    Inputs X and Y

    Outputs S and Co (carry out)

    No Carry-in

    Draw TT

HOMEWORK Draw logic diagram

Full Adder

    Inputs X, Y and Ci

    Output S and Co

    S = # 1s in X, Y, Ci is odd

    Co = #1s is at least 2

HOMEWORK Draw TT (8 rows), show S = X XOR Y XOR Ci,
show Co = XY + (X XOR Y)Z

    Draw circuit using formulas for S and Co from homework

    How about 4 bit adder ?
        Do it.

    How about n bit adder ?

        Linear complexity

        Called ripple carry

        Faster methods exist


PLAs

    Programmable Logic Array

    Fig G (from book)
    Minterms
    Start with a logical formula
    Convert to sum of product forms (only NOTs on vbles)

    Can also have a PAL in which the final dots are specified
    later.  Mass produce the sea of gates first.

HOMEWORK B-5

ROM

    Another way to implement a logic function.

    For n inputs and k outputs need (2^n)k bits stored,
    namely the columns for the output vbles in the TT

    NOT considered state.
        Once the ROM is made, the output depends only on the input.

    Similar to PLA
        Fully decoded

    PROMs, EPROMS, EEPROMs

Don't Cares

    Input don't cares
    Output don't cares
    Input DC example was mux
    Do output DC from book (Fig H)


Arrays of Logic Elements

    Do the same thing to many signals

    Draw thicker lines and use the "by n" notation.

    Show dia for 8 bit 2-way mux and implementation with 8 muxes

    Bus is a collection of data lines treated as a single logical
    (n-bit) value.

    Use arrays of logic elements to process buses
        The above mux switches between 2 8-bit buses.

-------------  Big Change Coming --------

Why do we want to have state?

    Memory (i.e. ram not just rom or prom)

    Counters

    Reducing gate count

        Multiplier would be quadradic in comb logic.

        With sequential logic (state) can do in linear.

            What follows is unofficial (i.e. too fast to understand)

            Shift register holds partial sum

            Real slick is to share this shift reg with mulitplier

Assume you have a real OR gate.  Assume the two inputs are both
zero for an hour.  At time t one input becomes 1.  The output will
OSCILLATE for a while before settling on exactly 1.  We want to be
sure we don't look at the answer before its ready.

Clocks

    Frequency
    Period
    Rising Edge; falling edge

    We use edge-trigger logic
        State changes occur only on a clock edge
        Will explain later what this really means

    Active edge
        The edge on which changes occur
        Choice is technology dependent

    Synchronous system

        state-element ----> comb circuit  -----> state-element

        state-elements have clock as an input

            can change state only at active edge

            produces output ALWAYS; based on current state

        all signals written to state elements must be valid at active edge

        Eg.  If cycle time is 10ns make sure combinational circuit
        used to compute new state values completes in 10ns

        So state elements change on active edge, comb circuit
        stablizes between active edges

        Can have

            +---> state-element -----> comb circuit ---+
            |                                          |
            |                                          |
            +------------------------------------------+

Memory

    We want CLOCKED memory and will only use CLOCKED memory in our
    designs.  However for simplicity we first describe how to make

    UNCLOCKED memory

        S-R latch (set-reset)

        Fig I

        DONT assert both S and R at once

        S asserted sets the latch Q is true Q_ false
        R asserted resets the latch Q flase Q_ true

        Neither asserted Q and Q_ remain as they were
            This is the MEMORY

==========================  End Lecture 3 ==================
==========================  Begin Lecture 4 ================

IMPORTANT NOTE:  Class will be held each thursday BUT
No assignments will be due 2nd, 16th, or 23.
Any homework ASSIGNED on those days will be available on the
web.  These notes are now web available as
http://allan.ultra.nyu.edu/arch/class-notes
So homework assigned this week (please number it HW#4) will
be due on the 9th (14 days from today).
HW#5 (assigned on the 9th), HW#6 (16th), and HW#7 (23)
will all be due on the 30th.  Please have each HW on separate pages.
On 30th we will schedule the midterm, the midterm will NOT be on
the 30th.  That is the day we will decide on the midterm date and on
how much will be covered.

    CLOCKED memory

        what we are really interested in.

        clocked latch

            output changes when input changes and the clock is asserted

            "level sensitive" rather than "edge triggered"
            sometimes called "transparent"

            we won't use these in designs but will show how to build one

            fig J is a D (clocked) latch.  D for "data"

        flip-flop changes on active edge

            NOT transparent

            Fig K is D flip flop built from D latches

                This one has the falling edge as active edge

                Sometimes called a master-slave flip-flop

                Note box around main structure and letters reused
                with different meaning (block structure a la algol)

                Master latch is set during the time clock is asserted.
                Remember that the latch is transparent, i.e. follows
                its input when its clock is asserted.  But the second
                latch is ignoring its input at this time.  When the
                clock falls, the 2nd latch pays attention and the
                first latch keeps producing whatever D was at
                fall-time.

                Actually D must remain constant for some time around
                the active edge.

                    Must be valid for set-up time before and hold time
                    after

HOMEWORK Try moving the inverter to the other latch (see fig K).
This should give rising edge as active edge.

                Flip flops give counters.

                    Explained next lecture

Homework B.13  Don't worry if it seems hard

    Register

        Just an array of D flip-flops with a write line, fig M.

        Must have write line and data valid during setup and hold
        times


================ End of lecture 4 ================
================ Start of lecture 5 ================

NOTE:  The real story on counters is told in Figure P.

HOMEWORK B.13  Now you can really do it.

    Register File

        Set of registers each numbered
        Supply reg#, write line, and data (if a write)

        Often have several read and write ports so that several
        registers can be read and written during one cycle.

        Can read and write same reg same cycle (read old value)

        We will do 2 read ports and one write port since that is
        needed for ALU ops.

            NOT adequate for superscalar

        To read just need mux from register file to select correct
        register.

            Have one of these for each read port

        For writes use a decoder on register number to determine which
        register to write

        Figure N (B.20 from book).
            Note that 2 errors were fixed.

    SRAMS and DRAMS

        External interface is Figure O (B.21 from book)

        (Sadly) we will not look inside.  Following is unofficial

        Different from above because too many wires and muxes too big

        Two stage decode

        Tri-state buffers instead of mux for SRAM

        DRAM latches whole row but outputs only one (or a few) column(s)

            So can speed up access to elts in same Row

            Merged DRAM + CPU a new hot topic

    Error Correction

        Skipped

    There are other kinds of flip-flops T, J-K.  Also one could learn
    about excitation tables for each.  We will NOT (H&P doesn't either).
    If interested, see Mano

Finite State Machines

    Skipped

Timing Methodologies

    Skipped

------------------ End Appendix B ------------
================ End Lecture 5 ================
================ Start Lecture 6 ================
---------------- Start Chapter 1----------------

HOMEWORK READ chapter 1.  Do 1.1 -- 1.26 (really one matching question)
Do 1.27 to 1.44 (another matching question)
1.45 (and do 7200 RPM and 10,000 RPM)
1.46, 1.50

---------------- End Chapter 1 ----------------
---------------- Start Chapter 3 ----------------

HOMEWORK Read sections 3.1 3.2 3.3

3.4 Representing instructions in the Computer (MIPS)

32 32-bit registers

    Register 0 is always 0

HOMEWORK 3.2

R-type instruction (R for register)

    op    rs    rt    rd    shamt   funct
    6     5     5     5      5       6

    rs,rt are source operands
    rd is destination
    shift amount
    funct used for op=0 to distinguish alu ops

    add/sub $1,$2,$3

        op=0, funct tells add/sub
        reg1 <-- reg2 + reg3
        the regs can be the same (doubles the value in the reg)

I-type (why I?)

    op    rs    rt      address
    6     5     5       16

    rs is a source reg

    rt is a destination reg

    Transfers to/from memory often in words (32-bits)

        But the machine is byte addressible
        So shift the address left two bits

    lw/sw $1,addr($2)

        machine format is    op 2 1 addr

        $1 <-- Mem[$2+addr]
        $1 --> Mem[$2+addr]

Note how field sizes of R-type and I-type correspond.

    Will be important later

The type is determined by the op.

Branching instruction

    slt (set less-then)

        R-type

        Slt $3,$8,$2  reg3 <-- (1 if reg8<reg2; otherwise 0)

        Like other R-types: read 2nd and 3rd reg, write 1st

    bne (branch not equal)

        I-type

        bne $1,$2,L  goto L if reg1!=reg2

    beq

    How do branch to L if Reg4 <= Reg 5

        slt $1,$5,$4
        bne $1,$0,L

HOMEWORK 3.12-3.17

    j (jump)

        J-type

        op   address
        6     26

    jr (jump register)

        R type but only one register used

    jal (jump and link)

        J type

        return addr stored in reg31

Immediate operands

    I-type

        NOW we know why this type is called I-type

    The third operand is an "immediate" operand

    addi $1,$2,100
    slti $1,$2,50

What about subi?

HOMEWORK 4.20 (YES, it is a 4)

    How get 32-bit constant into reg?

        Load upper half
        add lower half

        lui  $4,123  puts 123 into top half of reg4
        addi $4,$4,456

HOMEWORK 3.1, 3.3-3.7, 3.9, 3.18, 3.37 (for fun)

------------- End Chapter 3 --------------
------------- Start Chapter 4 --------------

HOMEWORK Read 4.1-4.4

HOMEWORK 4.1 (a,b,d)  4.2 (a,b,d), 4.3-4.10

overflows

    Fig Q (4.3) from book gives conditions

    Same as CarryIn to sign position != CarryOut

HOMEWORK (for fun) prove this last statement (4.29)

Unsigned version of many previous ops

    addu, subu, addiu

    sltu  set less than unsigned
    sltiu set less than immediate unsigned

Shifts (logical)

    R type, with shamt used and rs NOT used

    sll $1,$2,5   reg2 gets reg1 shifted left 5 bits
    srl

    Why need both, i.e. use a neg shamt?
    Answer is that only have 5 bits for shamt and need -31--+31 shifts

    Op is 0 (these are ALU ops, will understand in a few weeks).

------------ Constructing an ALU -- the fun begins ----------------

First goal is 32-bit AND, OR, and addition

Recall we know how to build a full adder.  Will draw it as


         Cin
          |
    ______|______
    |           |
a---+     |     |
    |    -+-    +---s   NOTE: the funny thing in the middle
b---+     |     |             is a big plus sign
    |           |
    |___________|
          |
          |
         Cout

1-bit ALU is fig R (4.11)

32-bit version is simple.  fig S (4.12).  First goal accomplished.

How about subtraction?

    BIG DEAL ABOUT 2'S COMPLIMENT is that
    A - B = A + (2's comp B) = A + (B_ + 1)

    Get B_ from an inverter (naturally)

    Get +1 from the carry-in

1-bit ALU with ADD, SUB, AND, OR is fig T (4.13)

    For subtraction set Binvert and Cin

32-bit version is simply a bunch of these (fig U)

================ End Lecture 6 ================
================ Start Lecture 7 ================

Simulating Combinatorial Circuits at the Gate Level

    Write a procedure for each logic box

        Parameters for each input and output

        (Local) variable for each (internal) wire

        Can only do AND OR XOR NOT

            In the C language & | ^ ~

            Other languages similar

        No conditional assignment; the output is a FUNCTION of the
        input

        Single assignment to each variable.
        Multiple assignments would correspond to a cycle

        Bus (set of signals) represented by array

        Testing

            Exhaustive possible for 1-bit cases

            Cleverness for n-bit cases (n=32, say)

HANDOUT full-adder.c  4add.c.  These are on the web

Lab 1 Do the equivalent for 32-bit-alu

Extra requirement for MIPS alu: slt  set-if-less-than

    Result reg is 1 if a < b
    Result reg is 0 if a >= b

    So need to make the LOB of result = sign bit of a subtraction and
    the rest of the result bits are always zero.

    Idea #1.  Give the mux another input.
    This input is brought in from outside the bit cell.
    That is for this setting of the mux, cell copies an input to its output.
    Set the mux to deliver this output when an slt is requested. Fig V

    Idea #2.  Bring out the result of the adder (BEFORE the mux) Fig W.
    Only needed for the HOB (i.e. sign)
    Take this new output from the HOB and connect it to
    the new input in idea #1 for the LOB.
    The new input for other bits are set to zero.

    Problem:  This is wrong!
    Take example using 3 bits (i.e. -4 .. 3).
    Try slt on -3 and 2.
    Subtraction (-3 - 2) should give -5 saying that -3 < 2.
    But -3 - 2 gives +3 !!
    Really it give OVERFLOW.
    But slt -3 -2 should not overflow.

    Solution: Need the correct rule for less than (not just sign of
    subtraction)

HOMEWORK: figure out correct rule, i.e. prob 4.25

Extra requirement.  Deal with overflows

    The 1 bit alu for hob really is different from the others Fig X (4.15).

Extra requirement.  Zero detect

    Large NOR

Observation:  The initial Cin and Binvert are always the same.
So just use one input called Bnegate

Final Result is Figure Y (4.17)

Symbol for the alu is fig Z (4.18)

What are the control lines?

    Bnegate
    OP

What functions can we perform

    and
    or
    add
    sub
    set on less than


What (3-bit) control lines do we need for each?

    and 0 00
    or  0 01
    add 0 01
    sub 0 01
    slt 1 11

Adder with just 2 levels of logic

    Clearly exists

    Expensive

    Large fan-in means not as fast as 2 simple levels of logic

Carry Lookahead adders

    We did ripple adder, delay proportional to # bits

    For each bit we can immediately (one gate delay) calculate

        generate a carry    gi = ai bi

        propogate a carry   pi = ai+bi

    Now we can calculate all the carries (4 bit) just given c0=Cin

        c1 = g0 + p0 c0

        c2 = g1 + p1 c1 = g1 + p1 g0 + p1 p0 c0

        c3 =

        c4 = g3 + p3 g2 + ... + p3 p2 p1 p0 c0

    Thus can calculate c1 ... c4 in just two (5-input) gate delays

HANDOUT Diagram for 4-bit CLA (also on web)

    Now put 4 of these together to get all the 16-bit carries

        P0 = p3 p2 p1 p0

        P1 = p7 p6 p5 p4

        P2 =

        P3 = p15 p14 p13 p12


        G0 = g3 + p3 g2 + ... + p3 p2 p1 g0

        G1 = g7 + p7 g6 + ... + p7 p6 p5 g4

        G2 =

        G3 = g15 + p15 g14 + ... p15 p14 p13 g12


        C1 = G0 + P0 c0

        C2 = G1 + P1 C1 = G1 + P1 G0 + P1 P0 c0

        C3 =

        C4 = G3 + P3 G2 + ... + P3 P2 P1 P0 c0


Draw diagram just like 4-bit cla with different labels and
presto get 16-bit cla

Homework 4-31 -- 4-35

================ End Lecture 7 ================
================ Start Lecture 8 ================

HOMEWORK If didn't assign 4-31 --4-35, assign it now

NOTE:  I want to say more about CLA but don't want to break-up
       shifter/multiplier accross two lectures

There are other fast adders (e.g. carry save)

Shifter

    Just a string of D-flops; output of one is input of next

        Input to first is the serial input

        Output of last is the serial output

    But want more.

        Left and right shifting (with serial input/output)

        Parallel load

        Parallel Output

        Don't shift every cycle

    Parallel output is just wires.

    Shifter has 4 modes (left, right, nop, load) so

        4-1 mux inside

        2 control lines must come in

    Handout (also on web) shows diagram (Fig AA in my notes)

    Our shifters are slow for big shifts; barrel shifters are better

HOMEWORK:  A 4-bit shift register initially contains 1101.  It is
           shifted six times to the right with the serial input being
           101101.  What is the contents of the register after each
           shift.

HOMEWORK:  Same register, same init condition.  For
           the first 6 cycles the opcodes are left, left, right, nop,
           left, right and the serial input is 101101.  The next cycle
           the register is loaded (in parallel) with 1011.  The final
           6 cycles are the same as the first 6.  What is the contents
           of the register after each cycle?

Multipliers

Recall how to do multiplication.

    Multiplicand times multiplier gives product

    Multiply multiplicand by each digit of multiplier

    Put the result in the right column

    Then add the partial products just produced

We will do it the same way ...

... BUT differently

    We have binary so each "digit" of the multiplier is 1 or zero.
    Hence "multiplying" by the digit means either

        Getting the multiplicand

        Getting zero

    Use an "if appropriate bit of multiplier is 1" stmt

    To get "appropriate bit"

        Use LOB

        Shift right (so next bit is LOB)

    Putting in the right column is done by shifting the multiplicand
    left one bit each time (even if the multiplier bit is zero)

    Instead of adding partial products at end, keep a running sum

        Don't add zero if multiplier bit is zero just skip step

This results in the following algorithm

    product <- 0
    for i = 0 to 31
        if LOB of multiplier = 1
            product = product + multiplicand
        shift multiplicand left 1 bit
        shift multiplier right 1 bit

Do on board 4-bit addition (8-bit registers) 1100 x 1101

The circuit diagram is figure 4.20 (handout)

What about the control?

    Always give the ALU the ADD operation

    Always send a 1 to the multiplicand to shift left

    Always send a 1 to the multiplier to shift right

    Pretty boring so far but

        Send a 1 to write line in product if and only if
        LOB multiplier is a 1

        I.e. send LOB to write line

        I.e. it really is pretty boring

This works! ...

... but is wasteful of resourses and hence is
    slower
    hotter
    bigger
    all these are bad

    Product register must be 64 bits since the product is 64 bits

    Why is multiplicand register 64 bits?

        So that we can shift it left

        I.e. for our convenience

    Why is ALU 64-bits?

        Because the product is 64 bits

        But we are only adding a 32-bit quantity to the
        product at any one step.

        Hmmm.

        Maybe we can just pull out the correct bits from the product.

        Would be tricky to pull out bits in the middle
        because which bits to pull changes each step

    POOF!!

Solving both problems at once

    DON'T shift the multiplicand left

        Hence register is 32-bits and not a shifter

    Instead shift the product right!

    Add the HO 32-bits of prod reg to Multiplicand
    and place result back into HO 32-bits

        Only do this if the current multiplier bit is one.

        Do NOT need the carry-out from the adder; why?

            Because the HOB of the product BEFORE the add is zero

This results in the following algorithm

    product <- 0
    for i = 0 to 31
        if LOB of multiplier = 1
            product[32-63] = product[32-63] + multiplicand
        shift product right 1 bit
        shift multiplier right 1 bit

The circuit diagram is figure 4.21 (handout)

What about control

    Just as boring as before

    Send ADD, 1, 1  to ALU, multiplier, Product (shift right)

    Send LOB to Product (write)

Redo same example on board

A final trick ("gate bumming", like code bumming of 60s)

    There is sort of a waste of registers.

        Not for the multiplicand since we always need all 32 bits

        But once we use a multiplier bit, we can toss it

        And the product is half unused at beginning and only slowly ...

        POOF!!

    "Timeshare" the LO half of the "product register"

        In the beginning LO half contains the multiplier

        Each step we shift right and more goes to product
        less to multiplier

    The algorithm changes to

    product <- multiplier
    for i = 0 to 31
        if LOB of product = 1
            product[32-63] = product[32-63] + multiplicand
        shift product-multiplier-register right 1 bit

Redo same example on board

The above was for unsigned 32-bit multiplication

For signed multiplication

    Save the signs of the multiplier and multiplicand

    Convert multiplier and multiplicand to non-neg numbers

    Use above algorithm

        Don't need the 32nd iteration since multiplier HOB of
        multiplier is zero (sign bit)

    Compliment product if original signs were different

There are faster multipliers.

Skip Division

Skip Floating Point

HOMEWORK Read 4.11 "Historical Perspective"
         Start Reading Chapter 5

================ End Lecture 8 ================
================ Start Lecture 9 ================

Class given by Prof. Grishman
Datapath for single cycle CPU

================ End Lecture 9 ================
================ Start Lecture 10 ================

Midterm Exam

================ End Lecture 10 ================
================ Begin Lecture 11 ================

HOMEWORK 5.1, 5.11 (just datapaths not control).

Lab 2.  Due in ONE week.  Modify lab1 to deal with slt, zero detect,
overflow.  That is, Figure 4.17

Overflow

    Will EXPLAIN for addition of nonnegatives and give rule for all.

    Assume 32 bit addition of nonnegs.  So numbers are 31 bits with
    sign 0.

    How can you get an overflow?

    The 31 bit addition gives a 32-bit sum.

    This looks like a negative number!  That is the sign is 1.

    Rule:  When adding nonnegs, overflow iff result sign is 1.

    Remaining rules all say get an overflow if sign bit "wrong"

        Rule: When add negatives, overflow iff result sign is 0.

        Rule: When adding neg and nonneg, never overflows

        Rule: When subtracting two negs or two nonnegs, never overflow

        Rule: When subtracting neg from nonneg, overflow if result
        sign is 1

        Rule: When subtracting nonneg from net, overflow if result
        sign is 0

    All this can be summarized as

    GRAND RULE overflow iff carry into sign bit != carry out of sign
    bit

        OVERFLOW = CARRY-IN XOR CARRY-OUT

        For adding nonnegs the carry out is surely 0
        and you need a 0 carry in to get a result with positive sign

        For adding negs the carry out is surely 1
        and you need a 1 carry in to get result with negative sign

Modification to slt

    Recall that for slt we want to know if x-y is negative.

    Sounds like subtraction

    But we want the correct sign even if overflow

    If overflow the sign is wrong so the correct sign is
    SUM[0] XOR OVERFLOW

---------------- The control for the datapath ----------------

We start with figure 5.14, which shows the data path.

We need to set the muxes.

We need to give the three ALU cntl lines: 1-bit Bnegate and 2-bit OP

    And     0 00
    Or      0 01
    Add     0 10
    Sub     1 10
    Set-LT  1 11

HOMEWORK what happens if we use 1 00?  if we use 1 01?
Ignore the funny business in the HOB.
The funny business "ruins" these ops

What information do we have to decide on the muxes and alu cntl lines?

The instruction!

    Opcode field (6 bits)

    For R-type the funct field (6 bits)

So no problem, just do a truth table.

    12 inputs, 3 outputs

    4096 rows, 15 columns, > 60K entries

    HELP!

We will let the main control (to be done later) "summarize"
the opcode for us.  It will generate a 2-bit field ALUop

    ALUop   Action needed by ALU

    00      Addition (for load and store)
    01      Subtraction (for beq)
    10      Determined by funct field (R-type instruction)
    11      Not used

So now we have 8 inputs (2+6) and 3 outputs

    256 rows, 11 columns; ~2800 entries

Certainly easy for automation ... but we will be clever

    We only have 8 MIPS instructions that use the ALU (fig 5.15).               

    Funct only used for 10

    11 impossible ==> 01 = X1   and    10 = 1X

So we get

ALUop | Funct        ||  Bnegate:OP
0 1   | 5 4 3 2 1 0  ||  B OP
------+--------------++------------
      |              ||
0 0   | x x x x x x  ||  0 10
x 1   | x x x x x x  ||  1 10
1 x   | x x 0 0 0 0  ||  0 10
1 x   | x x 0 0 1 0  ||  1 10
1 x   | x x 0 1 0 0  ||  0 00
1 x   | x x 0 1 0 1  ||  1 11

How would we implement this?

A circuit for each of the three output bits.

Just decide when the individual output bit is 1 (figure 5.17)

The circuit is then easy (5.18)

Now we need the main control

    setting the four muxes

    Writing the registers

    Writing the memory

    Reading the memory ??? not really (well maybe ... for dram)

    Calculating ALUop

So 9 (really 8 bits)

Fig 5.20 shows were these occur.

They all are determined by the opcode

The MIPS instruction set is fairly regular.  Most fields we need
are always in the same place in the instruction.

    Opcode (called Op[5-0]) always in 31-26

    Regs to be read always 25-21 and 20-16 (R-type, beq, store)

    Base reg always 25-21 (load store)

    Offset always 15-0

    Oops: Reg to be written EITHER 20-16 (load) OR 15-11 (R-type) MUX!!


Fig 5.21 describes each signal

    MemRead:  Memory delivers the value stored at the specified addr

    MemWrite: Memory stores the specified value at the specified addr

    ALUSrc:   Second ALU operand comes from (reg-file / sign-ext-immediate)

    RegDst:   Number of reg to write comes from the (rt / rd) field

    RegWrite: Reg-file stores the specified value in the specified register

    PCSrc:    New PC is Old PC + (4 / shifted-branch-target-disp)

    MemtoReg: Value written in reg-file comes from (alu / mem)    

Fig 5.22 shows the wires for control

We are interested in four opcodes

    R-type
    load
    store
    BEQ

Do a stage play

Fig 5.23 summarizes the play (i.e. the control lines).

Fig 5.27 (from FIFTH edition, an improvement to 5.30 in fourth
edition) shows the control signal settings for each opcode.

Now it is straightforward but tedious to get the logic equations

Result is figure 5.31

HOMEWORK 5.1 5.11 control, 5.2, 5.12 

New instruction format, unconditional jump

    opcode  addr
    31-26   25-0

    Addr is word addr; bot 2 bits of PC always 0

    Top 4 bits of PC stay as they were (AFTER incr by 4)

    Easy to add.  My overlay to fig 5.22.

================ End Lecture 11 ================
================ Start Lecture 12 ================
---------------- What's wrong? ----------------

Some instructions could be especially slow and all take the time of
the slowest.  Worse if we look at really tough ones (floating pt divide).

Solns

    Variable length cycle

    Asynchronous logic

    Multicycle instructions

        Can reuse same ALU for different cycles

Even Faster

    Pipeline the cycles

        Can't reuse

        Complicated

        Chapter 6

    Multiple datapaths (superscalar)

        Lots of logic but not too hard if execute IN ORDER

        Faster if do OUT of order but very complicated

---------------- Chapter 2 Performance analysis ------------

HOMEWORK Read Chapter 2

Difference between response time and throughput.

    Adding a processor likely to increase throughput more than
    decrease response time.

    We will mostly be concerned with response time

PERFORMANCE = 1 / Execution time

So machine X is n times faster than Y means

    Performance-X  =  n * Performance-Y

    Execution-time-X = (1/n) * Execution-time-Y

How to measure execution-time

    CPU time

        Includes time waiting for memory

        Does not include time waiting for I/O
        as this process is not not running

    Elapsed time on empty system

    Elapsed time on "normally loaded" system

    Elapsed time on "heavily loaded" system

We mostly use CPU time

    Does NOT mean the other metrics are worse

(CPU) execution time = (#CPU clock cycles) * (clock cycle time)
                     = (#CPU clock cycles) / (Clock rate)

So a machine with a 10ns cycle time runs at a rate of
1 cycle per 10 ns =   100,000,000 cycles per second = 100 MHz

#CPU clock cycles = #instructions * CPI

    CPI = Cycles per Instruction

#instructions for a given program depends on the instruction set

    We saw in chapter 3 that 1 vax instruction is often > 1 MIPS
    instruction

Complicated instructions take longer.

    Either many cycles or long cycle time

Older machines with complicated instructions (e.g. VAX in 80s) had CPI>1

With pipelining can have many cycles for each instruction but still
have CPI=1

Modern ``superscalar'' machines have CPI < 1

    They issue many instructions each cycle

    They are pipelined so the instructions don't finish for several cycles

    If have 4-issue and all instructions 5 pipeline stages, there are
    20=5*4 instructions in progress at one time.

Putting this together Time (in seconds) =

    Num inst ex * (Clock cycles / inst) * (seconds / Clock cycle)

HOMEWORK Carefully go through and understand the example on page 59

HOMEWORK 2.1-2.5 2.7-2.10

HOMEWORK Make sure you can easily do all the problems with a rating of
[5] and can do all with a rating of [10]

Why not just use MIPS ?

    Millions of Instructions Per Second

    NOT the same as the MIPS computer (but NOT a coincidence)

    Different architectures (inst sets) with same MIPS take different time

    Different programs generate different MIPS ratings on same arch

    Can raise the rating by adding NOPs despite increasing exec time

HOMEWORK Carefully go through and understand the example on pages 61-3

Why not use MFLOPS

    Millions of FLoating point Operations Per Second

    Similiar problems to MIPS

Benchmarks

    A start but the difficulty is choosing a representative benchmark
    for YOUR purchase

HOMEWORK Carefully go through and understand 2.7 "fallacies and pitfalls"

---------------- Chapter 7 Memory ------------

HOMEWORK Read Chapter 7

Ideal memory is

    Fast

    Big (in capacity; not phy size)

    Cheap

    Imposible

We observe empirically

    TEMPORAL LOCALITY - Word referenced now likely to be ref'ed again soon

        Good to keep currently word around for a while.

    SPACIAL LOCALITY - Words near currently ref'ed work likely to be ref'ed soon

        Good to prepare for other words near current ref

So use memory hierarchy

    Regs

    Cache  (really L1 L2 maybe L3)

    Mem

    Disk

    Archive

We will first study the cache <---> mem gap

    Really many levels of caches

    Similar considerations apply to the other gaps

        But terminology is often different, e.g. cache line vs page

(In fall 97) My OS class is studying "the same thing" right now (mem mgt)

Cache is organized in units of BLOCKS

    Transfer a block from mem to cache

    Big blocks good for spacial locality

    Think of mem organized in blocks as well

        (OS think of pages and page frames)

A HIT occurs when a mem ref is found in the upper level of mem hierarchy

    We will be interested in cache hits (OS in page hits)

    Miss is a non-hit

    Hit rate is fraction of mem refs that are hits

    Miss rate is 1 - hit rate

    Hit time is time for a hit

    Miss time is time for a miss

    Miss penalty is Miss time - Hit time

Start with a simple cache organization

    Assume all refs are for one word (not too bad)

    Assume cache blocks are one word

        Bad for spacial locality so not done in real machines

        We will drop this assumption soon

    Assume each mem block can only go in one specific cache block

        DIRECT MAPPED

        Take mem blk # modulo # blocks in cache for location

        Make # blocks in the cache a power of 2

    Example: if cache has 16 blocks, location in cache is the low
    order 4 bits of block number

    How can we tell if a mem blk is in the cache?

        We know where it will be *IF* it is there at all

        We need the "rest" of the addr

        Store the rest of the addr, called the TAG

        Also store a VAlID bit per cache block (in case no mem blk is
        stored in this cache block, e.g. when the system boots up).

    Show fig AB (fig 7.6)

        Calculate total number of bits

HOMEWORK 7.1

    Processing a read for this simple cache

        Hit is trivial

        Miss: Evict and replace

            Why?  I.e., why keep new data instead of old

            Ans:  Temporal Locality

    Skip section "handling cache misses" as it needs chapter 6

    Processing a write for this simple cache

        Hit:  Write through vs write back

            Write through write to mem as well

            Write back: don't write to mem now, do it on evict

        Miss: write-alloc vs write-no-alloc

        Simplist is write-through, write-allocate

            Still assuming blksize=refsize = 1 word and direct mapped

            For any write (Hit or miss) do the following:

                Index the cache using the correct LOBs

                Write data and tag

                Set Valid

                Send request to main mem

            Poor performance

                GCC benchmark has 11% of operations stores

                If assume an infinite speed memory, CPI is 1.2
                for some reasonable estimate of instruction speeds

                Assume a 10 cycle store penalty (reasonable)

                CPI becomes 1.2 + 10 * 11% = 2.3  HALF SPEED

Improvement:  Use a write buffer

    Hold a few (four is common) writes at the processor
    while they are being processed at memory.

    Satisfy reads from here as well

Unified vs split I and D (instruction and data)

    Unified is better because better "load balancing"

    Split is better because can do two refs at once            

================ End Lecture 12 ================
================ Start Lecture 13 ================

Improvement:  Blocksize > Wordsize

    Take advantage of spacial locality

    Figure AC (7.94)

    Show how an access takes place

    Cachesize = Blocksize * #Entries

    Calculate Total number of bits in this cache and for one
    with one word blocks but still 64KB

    If accesses are strictly sequential this cache has 75% hits
    one word blocks has 0%

HOMEWORK 7.2 7.6

Why not make blocksize enormous?  Cache one huge block.

    NOT all access are sequential

    With too few blocks misses go up again

Memory support for wider blocks

    Should memory be wide?

    Should the bus be wide?

    Consider Figure AD (7.12)

        Assume 1 clock to send the address
        10 clocks for memory access
        1 Clock/busload of data

    Narrow design (a) takes 45 clocks for a read hit (do it)

    Wide design (b) takes 12

    Interleaved (c) takes 15

    Interleaving works great because GUARANTEED to have sequential
    accesses

    Imagine design between (a) and (b) with 2-word wide datapath

        Takes 23 cycles and more expensive to build than (c)        

Performance example (dandy exam question)

    Assume 5% I-cache miss; 10% D-cache miss
    1/3 instructions access data
    CPI = 4 if miss penalty is 0 (not realistic of)

    What is CPI with miss penalty 12 (do it)?

    What is CPI if double speed cpu+cache, single speed mem
    (24 clock miss penalty) (do it)?

    How much faster is the "double speed" machine?
        It would be double speed if miss penalty 0 or 0% miss rate

HOMEWORK 7.15, 7.16

Lower CPI makes stalls more expensive

Faster CPU (i.e. faster clock) makes stalls more expensive

Remark: Large caches have LONGER hit time

SKIP virtual memory
    
    2nd edition does it later

    OS students should look at it

    Would cover if had one more lecture

Improvement: Associative Caches

    We have studied DIRECT MAPPED caches, i.e. the location in the
    cache is determined by the address.  To check for a hit we compare
    ONE tag with the HOBs of the addr.

    The other extreme is FULLY ASSOCIATIVE.

        A memory block can be placed in any cache block

        So tag must be entire address and must check all cache blocks
        to see if have a hit.

        Larger tag is a minor problem.

        The search is a disaster.  Either do sequentially (too slow)
        or have comparators with each tag entry (too big; and also
        need a humongous mux)

            [An alternative is to have a table with one entry per
            MEMORY block giving the cache block number.  This is too
            big and too slow for caches but is used for virtual memory
            (demand paging)]

    Most common for caches is intermediate configuration called
    SET ASSOCIATIVE or n-way associative (e.g. 4-way associative)

        Why good?

        Consider referencing two modest arrays (<< cache size) that
        start at location 1MB and 2MB.

        Both will contend for the same cache locations in a direct
        mapped cache but will fit together in a cache with >=2 sets.

        Unfortunately a cache with n sets is often called a cache of
        set size n.

        How find the block?  Figure AE (7.9)

            Set# = Block# mod #sets

        Which block (in the set) should be replaced?

            Random is sometimes used

            LRU is better but not so easy to do fast enough

                If have 2 sets, easy.  Keep last used and choose
                other.

                If have 4 sets, can view as two groups of two.
                Choose the group not last used and then the one in the
                group not last used.  This is an approximation

*************************************************************
*
*  IGNORE ALL this stuff
*
* Local Variables:
* mode: text
* indent-line-function: indent-relative
* indent-tabs-mode: nil
* tab-width: 4
* tab-stop-list: (4 8 12 16 20 24)
* End:
*
**************************************************************