---------------- What's wrong? ---------------- Some instructions could be especially slow and all take the time of the slowest. Worse if we look at really tough ones (floating pt divide). Solns Variable length cycle Asynchronous logic Multicycle instructions Can reuse same ALU for different cycles Even Faster Pipeline the cycles Can't reuse Complicated Chapter 6 Multiple datapaths (superscalar) Lots of logic but not too hard if execute IN ORDER Faster if do OUT of order but very complicated ---------------- Chapter 2 Performance analysis ------------ HOMEWORK Read Chapter 2 Difference between response time and throughput. Adding a processor likely to increase throughput more than decrease response time. We will mostly be concerned with response time PERFORMANCE = 1 / Execution time So machine X is n times faster than Y means Performance-X = n * Performance-Y Execution-time-X = (1/n) * Execution-time-Y How to measure execution-time CPU time Includes time waiting for memory Does not include time waiting for I/O as this process is not not running Elapsed time on empty system Elapsed time on "normally loaded" system Elapsed time on "heavily loaded" system We mostly use CPU time Does NOT mean the other metrics are worse (CPU) execution time = (#CPU clock cycles) * (clock cycle time) = (#CPU clock cycles) / (Clock rate) So a machine with a 10ns cycle time runs at a rate of 1 cycle per 10 ns = 100,000,000 cycles per second = 100 MHz #CPU clock cycles = #instructions * CPI CPI = Cycles per Instruction #instructions for a given program depends on the instruction set We saw in chapter 3 that 1 vax instruction is often > 1 MIPS instruction Complicated instructions take longer. Either many cycles or long cycle time Older machines with complicated instructions (e.g. VAX in 80s) had CPI>1 With pipelining can have many cycles for each instruction but still have CPI=1 Modern ``superscalar'' machines have CPI < 1 They issue many instructions each cycle They are pipelined so the instructions don't finish for several cycles If have 4-issue and all instructions 5 pipeline stages, there are 20=5*4 instructions in progress at one time. Putting this together Time (in seconds) = Num inst ex * (Clock cycles / inst) * (seconds / Clock cycle) HOMEWORK Carefully go through and understand the example on page 59 HOMEWORK 2.1-2.5 2.7-2.10 HOMEWORK Make sure you can easily do all the problems with a rating of [5] and can do all with a rating of [10] Why not just use MIPS ? Millions of Instructions Per Second NOT the same as the MIPS computer (but NOT a coincidence) Different architectures (inst sets) with same MIPS take different time Different programs generate different MIPS ratings on same arch Can raise the rating by adding NOPs despite increasing exec time HOMEWORK Carefully go through and understand the example on pages 61-3 Why not use MFLOPS Millions of FLoating point Operations Per Second Similiar problems to MIPS Benchmarks A start but the difficulty is choosing a representative benchmark for YOUR purchase HOMEWORK Carefully go through and understand 2.7 "fallacies and pitfalls" ---------------- Chapter 7 Memory ------------ HOMEWORK Read Chapter 7 Ideal memory is Fast Big (in capacity; not phy size) Cheap Imposible We observe empirically TEMPORAL LOCALITY - Word referenced now likely to be ref'ed again soon Good to keep currently word around for a while. SPACIAL LOCALITY - Words near currently ref'ed work likely to be ref'ed soon Good to prepare for other words near current ref So use memory hierarchy Regs Cache (really L1 L2 maybe L3) Mem Disk Archive We will first study the cache <---> mem gap Really many levels of caches Similar considerations apply to the other gaps But terminology is often different, e.g. cache line vs page (In fall 97) My OS class is studying "the same thing" right now (mem mgt) Cache is organized in units of BLOCKS Transfer a block from mem to cache Big blocks good for spacial locality Think of mem organized in blocks as well (OS think of pages and page frames) A HIT occurs when a mem ref is found in the upper level of mem hierarchy We will be interested in cache hits (OS in page hits) Miss is a non-hit Hit rate is fraction of mem refs that are hits Miss rate is 1 - hit rate Hit time is time for a hit Miss time is time for a miss Miss penalty is Miss time - Hit time Start with a simple cache organization Assume all refs are for one word (not too bad) Assume cache blocks are one word Bad for spacial locality so not done in real machines We will drop this assumption soon Assume each mem block can only go in one specific cache block DIRECT MAPPED Take mem blk # modulo # blocks in cache for location Make # blocks in the cache a power of 2 Example: if cache has 16 blocks, location in cache is the low order 4 bits of block number How can we tell if a mem blk is in the cache? We know where it will be *IF* it is there at all We need the "rest" of the addr Store the rest of the addr, called the TAG Also store a VAlID bit per cache block (in case no mem blk is stored in this cache block, e.g. when the system boots up). Show fig AB (fig 7.6) Calculate total number of bits HOMEWORK 7.1 Processing a read for this simple cache Hit is trivial Miss: Evict and replace Why? I.e., why keep new data instead of old Ans: Temporal Locality Skip section "handling cache misses" as it needs chapter 6 Processing a write for this simple cache Hit: Write through vs write back Write through write to mem as well Write back: don't write to mem now, do it on evict Miss: write-alloc vs write-no-alloc Simplist is write-through, write-allocate Still assuming blksize=refsize = 1 word and direct mapped For any write (Hit or miss) do the following: Index the cache using the correct LOBs Write data and tag Set Valid Send request to main mem Poor performance GCC benchmark has 11% of operations stores If assume an infinite speed memory, CPI is 1.2 for some reasonable estimate of instruction speeds Assume a 10 cycle store penalty (reasonable) CPI becomes 1.2 + 10 * 11% = 2.3 HALF SPEED Improvement: Use a write buffer Hold a few (four is common) writes at the processor while they are being processed at memory. Satisfy reads from here as well Unified vs split I and D (instruction and data) Unified is better because better "load balancing" Split is better because can do two refs at once