---------------- What's wrong? ----------------

Some instructions could be especially slow and all take the time of
the slowest.  Worse if we look at really tough ones (floating pt divide).

Solns

    Variable length cycle

    Asynchronous logic

    Multicycle instructions

        Can reuse same ALU for different cycles

Even Faster

    Pipeline the cycles

        Can't reuse

        Complicated

        Chapter 6

    Multiple datapaths (superscalar)

        Lots of logic but not too hard if execute IN ORDER

        Faster if do OUT of order but very complicated

---------------- Chapter 2 Performance analysis ------------

HOMEWORK Read Chapter 2

Difference between response time and throughput.

    Adding a processor likely to increase throughput more than
    decrease response time.

    We will mostly be concerned with response time

PERFORMANCE = 1 / Execution time

So machine X is n times faster than Y means

    Performance-X  =  n * Performance-Y

    Execution-time-X = (1/n) * Execution-time-Y

How to measure execution-time

    CPU time

        Includes time waiting for memory

        Does not include time waiting for I/O
        as this process is not not running

    Elapsed time on empty system

    Elapsed time on "normally loaded" system

    Elapsed time on "heavily loaded" system

We mostly use CPU time

    Does NOT mean the other metrics are worse

(CPU) execution time = (#CPU clock cycles) * (clock cycle time)
                     = (#CPU clock cycles) / (Clock rate)

So a machine with a 10ns cycle time runs at a rate of
1 cycle per 10 ns =   100,000,000 cycles per second = 100 MHz

#CPU clock cycles = #instructions * CPI

    CPI = Cycles per Instruction

#instructions for a given program depends on the instruction set

    We saw in chapter 3 that 1 vax instruction is often > 1 MIPS
    instruction

Complicated instructions take longer.

    Either many cycles or long cycle time

Older machines with complicated instructions (e.g. VAX in 80s) had CPI>1

With pipelining can have many cycles for each instruction but still
have CPI=1

Modern ``superscalar'' machines have CPI < 1

    They issue many instructions each cycle

    They are pipelined so the instructions don't finish for several cycles

    If have 4-issue and all instructions 5 pipeline stages, there are
    20=5*4 instructions in progress at one time.

Putting this together Time (in seconds) =

    Num inst ex * (Clock cycles / inst) * (seconds / Clock cycle)

HOMEWORK Carefully go through and understand the example on page 59

HOMEWORK 2.1-2.5 2.7-2.10

HOMEWORK Make sure you can easily do all the problems with a rating of
[5] and can do all with a rating of [10]

Why not just use MIPS ?

    Millions of Instructions Per Second

    NOT the same as the MIPS computer (but NOT a coincidence)

    Different architectures (inst sets) with same MIPS take different time

    Different programs generate different MIPS ratings on same arch

    Can raise the rating by adding NOPs despite increasing exec time

HOMEWORK Carefully go through and understand the example on pages 61-3

Why not use MFLOPS

    Millions of FLoating point Operations Per Second

    Similiar problems to MIPS

Benchmarks

    A start but the difficulty is choosing a representative benchmark
    for YOUR purchase

HOMEWORK Carefully go through and understand 2.7 "fallacies and pitfalls"

---------------- Chapter 7 Memory ------------

HOMEWORK Read Chapter 7

Ideal memory is

    Fast

    Big (in capacity; not phy size)

    Cheap

    Imposible

We observe empirically

    TEMPORAL LOCALITY - Word referenced now likely to be ref'ed again soon

        Good to keep currently word around for a while.

    SPACIAL LOCALITY - Words near currently ref'ed work likely to be ref'ed soon

        Good to prepare for other words near current ref

So use memory hierarchy

    Regs

    Cache  (really L1 L2 maybe L3)

    Mem

    Disk

    Archive

We will first study the cache <---> mem gap

    Really many levels of caches

    Similar considerations apply to the other gaps

        But terminology is often different, e.g. cache line vs page

(In fall 97) My OS class is studying "the same thing" right now (mem mgt)

Cache is organized in units of BLOCKS

    Transfer a block from mem to cache

    Big blocks good for spacial locality

    Think of mem organized in blocks as well

        (OS think of pages and page frames)

A HIT occurs when a mem ref is found in the upper level of mem hierarchy

    We will be interested in cache hits (OS in page hits)

    Miss is a non-hit

    Hit rate is fraction of mem refs that are hits

    Miss rate is 1 - hit rate

    Hit time is time for a hit

    Miss time is time for a miss

    Miss penalty is Miss time - Hit time

Start with a simple cache organization

    Assume all refs are for one word (not too bad)

    Assume cache blocks are one word

        Bad for spacial locality so not done in real machines

        We will drop this assumption soon

    Assume each mem block can only go in one specific cache block

        DIRECT MAPPED

        Take mem blk # modulo # blocks in cache for location

        Make # blocks in the cache a power of 2

    Example: if cache has 16 blocks, location in cache is the low
    order 4 bits of block number

    How can we tell if a mem blk is in the cache?

        We know where it will be *IF* it is there at all

        We need the "rest" of the addr

        Store the rest of the addr, called the TAG

        Also store a VAlID bit per cache block (in case no mem blk is
        stored in this cache block, e.g. when the system boots up).

    Show fig AB (fig 7.6)

        Calculate total number of bits

HOMEWORK 7.1

    Processing a read for this simple cache

        Hit is trivial

        Miss: Evict and replace

            Why?  I.e., why keep new data instead of old

            Ans:  Temporal Locality

    Skip section "handling cache misses" as it needs chapter 6

    Processing a write for this simple cache

        Hit:  Write through vs write back

            Write through write to mem as well

            Write back: don't write to mem now, do it on evict

        Miss: write-alloc vs write-no-alloc

        Simplist is write-through, write-allocate

            Still assuming blksize=refsize = 1 word and direct mapped

            For any write (Hit or miss) do the following:

                Index the cache using the correct LOBs

                Write data and tag

                Set Valid

                Send request to main mem

            Poor performance

                GCC benchmark has 11% of operations stores

                If assume an infinite speed memory, CPI is 1.2
                for some reasonable estimate of instruction speeds

                Assume a 10 cycle store penalty (reasonable)

                CPI becomes 1.2 + 10 * 11% = 2.3  HALF SPEED

Improvement:  Use a write buffer

    Hold a few (four is common) writes at the processor
    while they are being processed at memory.

    Satisfy reads from here as well

Unified vs split I and D (instruction and data)

    Unified is better because better "load balancing"

    Split is better because can do two refs at once