Class 17
CS 439
19 March 2013

[lecturer: Parth Upadhyay]

On the board
------------

0) Questions before we start?
1) Preview
    -- I/O, disks, file systems, transactions
2) I/O
3) Disks

I/O

* architecture
* communicating with devices
* device drivers

Start: Okay, so today we're going to talk a bit about I/O and disks. We're going
to take one more layer of mysticism away from the computer, and try to
understand what happens between the CPU and the various devices that could be
connected to it. Then, we'll look a decently detailed look at how disks work,
and why this is still important. 

A) architecture
 draw this picture!! (its the one on the handout)
----


----
  
  - Some examples of devices here are things like Network Cards, Graphics Cards,
    disks, etc.
  - CPU accesses physical memory over a bus
  - devices, too access memory over I/O bus

  How can we communicate with these devices?? 
  (a) Memory-mapped device registers 
    - load/store to mem locations, but actually correspond to registers on
      devices! You as a programmer read/write from these addresses normally
      using loads and stores, but they actually read/write from/to registers on
      the device.
  (b) Device Memory -- device may have memory that OS can write to directly
    on the other side of the I/O bus. 
  (c) Special I/O instructions
    -- some cpus have special instructions for i/o
    -- they're like load/store, but actually work with i/o.
    -- allow user processes to access i/o ports with finer granularity than page
  (d) DMA -- place instructions in main memory.
    -- "poke" card by writing to register
    -- overlaps unrelated computation with moving data over bus
    -- card knows where to find this list, and then it can access
       the buffers
    -- this makes a lot of sense. instead of constantly dealing one word or byte
       or whatever at a time, the device can simply write the contents of its
       operation straight into memory.

    (i) DMA disk read example:
      Let's look at an example of how DMA would work.
      1) The OS has a request (received from, potentially, a user process) that
        asks it to transfer disk data to a buffer at address X.
      2) The OS (really the device driver, you'll learn about them in a minute),
        tells the disk controller to transfer C bytes to address X.
      3) The disk controller initiates the transfer, sending each byte to the DMA
        controller. 
      4) The DMA controller transfers the bytes.
      5) When the transfer is finished, the CPU receives an interrupt from the DMA
        controller letting it know that it's finished.

  C. Device Drivers
    - expose some interface to the kernel, so that the kernel can call comparatively
      simple read/write calls or whatever. abstracts away the nasty hardware details.
    - how to synchronize with device?
      - for example, the device drivers may need to know when a network card is
        done sending a packet, or know when a disk request is complete
        [NOTE: device doesn't care a huge deal about which of the following 2
        options is in play. Interrupt/Polling are abstractions between the device
        and CPU. The question here is about the logic in the driver and interrupt
        controller.]

      (a) Approach 1: Polling
       - Sent a packet? Loop asking the card when the buffer is free.
       - Waiting for a packet? Loop asking it until its free.
       - Disk I/O? Keep looping until a ready bit is set.

       Disadvantages: Wasted CPU Cycles, and high latency (it may have come but
       you're not set to check until some time in the future).

      (b) Approach 2: Interrupts
       - ask card to interrupt CPU on events
         - Interrupt handler runs at high priority
         - Asks card whta happened (free buffer, new packet, etc)
         - This is what most general purpose OS's do

      Disadvantages: What could go wrong with interrupts? BAD in cases where the data arrival
      rate is high.
       -- packets arrive faster than OS can process them
       -- interrupts are very expensive (context switch)
       -- interrupt handlers have high priority
       -- in worst case, you can get receive livelock where you spend 100% of time
          in interrupt handler but no work gets done.
      
      ANALOGY: Interrupts vs Polling in the context of phone notifications.
      There's 2 ways for you to figure out whether you have more emails/tweets.
      1 is to, every time your phone buzzes, stop everything and look at it.
      (God forbid you miss that tweet). The second way is for you to,
      periodically, check your email. The trade-offs are pretty similar! You get
      that snapchat 5 minutes later than you could have, but you don't pay the
      costs of so many context switches.

      How to design systems given these tradeoffs? Start with interrupts. If you
      notice that your system is slowing down because of this, then switch to
      polling. If polling is chewing up too many cycles, then move towards an
      adaptive switching between interrupts and polling. (But of course, never
      optimize until you actually know what the problem.)

      -- interrupts are great for disk requests! which segways into our next
      topic....

3) Disks
  
  A. What is a disk?
  B. Geometry
  C. Performance
  D. Common #s
  E. How Drivers interface to disk
  F. Performance II
  G. Disk Scheduling (Performance III)
  H. Technology and systems trends
  
  (Show of hands, how many of you have built a computer before?)
  A. What is a disk? This is your standard SATA drive that you stick into your
  computer. It whirrs up when you start....
   - stack of magnetic platters
   - rotate together on a central spindle b/t 3,600 and 15,000 RPM

   // draw picture
   -----
   | platter
   --------
   .
   .
   .

   (a) disk arm assembly
    - arms rotate around pivot, all move together
    - pivot offers some resistance to linear shocks
    - arms contain disk heads -- one for each recording surface
    - heads read/write to data platter
  
  Kind of like vinyl records.

  --interjection
  Before we continue, you might ask why this is important! Education is for the
  future, and arguably SSDs (explain what they are in case people don't know)
  are the future.

  The reason it's important is that disks are still widely in use everywhere,
  and will be for some time. Google, Facebook, etc. all still pack their data
  centers full of cheap, old disks. Also, for them, disk failure is the common
  case, not the random/weird case, (they have so many disks that it only makes
  sense that they would be failing relatively often) so they can't cram their
  datacenters with expensive SSDs.
  As a second point, its technical literacy; many filesystems were designed with
  the disk in mind (sequential significantly better than random). You have to
  know how these things work as a computer scientist and as a programmer; you
  don't want to look dumb when you go out into the real world.
  
  B) Geometry of a disk
  // draw a picture here! (a big one, preferably).
  // This may be a good picture to put on the handout, honestly.
  Track - circle on a platter. each platter is divided into concentric tracks.
  Sector - chunk of a track (fixed size, typically 512 bytes)
  Cylinder: locus of all tracks of fixed radius on all platters
  
  - Heads are roughly lined up on a cylinder
   - NOTE: significant fractions of encoded stream for error correction
  - Generally only one active head at a time
   (why?)
   -- disks usually only have one set of read/write circuitry
   -- must worry about cross-talk between channels
   -- hard to keep multiple heads exactly aligned
  
  - disk position system
   - move head to particular track and keep it there.
   - resist physical shocks, imperfect tracks, etc.
  
  - disks also usually have a buffer. it just makes sense. used for things like
    reading ahead, or just reading a whole track as you go.

  A **seek** consists of up to four phases:
   (a) speedup: accelerate arm to max speed or half way point
   (b) coast: at max speed (for long seeks)
   (c) slowdown: stops near destination
   (d) adjusts head to actual desired track
 
  C) Performance! (important to understand if you're building systems that need
  good perf.)

  What are the components of a transfer?
   * rotational delay - time for sector to rotate under disk head
   * seek - speedup, coast, slowdown, settle
   * transfer time - will diskuss in a minute (pretty basic. how fast can the
     head read / how much you need to read)

   SEEK:
    - seeking from track to track is comparatively fast (~1ms) and is mainly
      just the settle time
    - short seeks are dominated by speedup (btw, this can accelerate at up to
      several hundred g's)
    - longer seeks dominated by coast
    - head switches are comparable to short seeks
    - sometimes you'll see a statistic that says "average seek time". this could
      actually mean many things:
      * time to seek 1/3 the disk
      * 1/3 of the time to seek the whole disk
      (why aren't these the same??)
    
    We'll do an example soon where we use these numbers to figure out the time
    it will take to access a portion of the disk.

  
  D) Some Common #s
   - capacity (100s of GB)
   - platters (8)
   - number of cylinders: tens of thousands or more
   - sectors per track: ~1000
   - RPM: 10000
   - transfer rate: 50-85mb/s
   - mean time b/t  failures: ~1 million hours (for disks in data centers its
     vastly less. They have so many disks that disk failures become common, and
     they have huge software systems in place to replicate data and handle these
     failures. they don't have super crazy expensive disks.)

    (Latency numbers! Source: https://gist.github.com/jboner/2841832)
    Putting the numbers in perspective
    Disks are *the* bottleneck in many systems. Here's some numbers that I think
    should give you context about just how expensive disk related I/O can be:

    If an L1 Cache reference was 1 second:

    (Ask people to guess these numbers! See if anyone has an intuition about the
    relative costs of these operations.)
    L1 Cache: 0:00.01 (nanoseconds)
    Mutex Lock/unlock: 0:00.50
    Main Memory: 3:20.00 (100 nanoseconds)
    // in case you're thinking "oh, but my macbook retina has this epic SSD in it.
    // this isn't a problem anymore, right? NO! disk access is still a huge
    // bummer"
    Read 4K randomly from SSD: 3 days, 11:22.00  (< 1 ms)
    Disk seek : 231 days, 11:33:20 (10ms)

    How can anyone get anything done?? Well, the real numbers are scaled down like
    9 orders of magnitude so they're not long in HUMAN terms. Also, the whole
    computer doesn't just come to a halt anytime someone wants to access the disk.
    We can context switch them out and wait for an interrupt and get other useful
    work done. BUT what is true is that if your program needs to access disk
    often, its running through molasses.

   F) Disk Performance II
   Let's do two examples to really flesh this issue out, and make these numbers
   a little less abstract.

   Numbers:
    Spindle Speed: 7200 RPM
    Avg Seek Time, read/write: 10.5ms / 12 ms
    Maximum seek time: 19ms
    Track-to-track seek time: 1ms
    Transfer rate (surface to buffer): 54-128 MB/s
    Transfer rate (buffer to host): 375 MB/s
  
  
   Q: How long would it take to do 500 read requests, spread out randomly over
     the disk (and serviced in FIFO order)?

   A: This is basically the worst case; we have no locality in our requests.
     disk access time is:
       seek time + rotation time + transfer time.
     This makes sense, right? You have to first get the head to the right place,
     wait for the disk to spin, and then transfer the data. Let's figure out
     these numbers!

     1) seek time: given to us in the table.
     2) rotation time: given to us implicitly through the 7200 RPM figure.
       7200 rev| 1 min
       --------|-------- = 120 rev/sec
         min   | 60 sec
      
       How long does 1 rev take then? (1/120 = 8.33 ms). But this is one whole
       rotation, and on average you can expect to go halfway, so let's say
       4.15ms.
     3) transfer time: We can read at least 54MB/s, and were reading 1 sector
       (512 bytes), so:

        512 bytes  |  second  |  1 MB
       ----------- | -------- | ------------ =  0.0095 ms
                   |  54 MB   |  10^6 bytes

      So, this is our cost PER READ. (Make sure ppl are convinced).

     Total Time: 10.5 ms (seek time) + 4.15ms (avg rotation) + 0.0095 = 14.66
     ms/request
     14.66 * 500 = 7.3 seconds!
     (Holy crud! This is kind of a ridiculous number..... how can we do better?)


   Q: How long would it take to do 500 requests, SEQUENTIALLY on the disk? (FIFO
     order once more)
   A: The difference now is that we don't need to seek OR rotate more than once!
     Why? It's because once we've gotten to the start, we're reading
     sequentially, so we don't need to pay for any of that anymore.

     1) seek time: given to us in the table, 10.5ms
     2) rotation time: same as above! 4.15 ms
     3) transfer time: 500 * 512 / 54MB/s = 4.75 ms, OR 500 * 512 / 54MB/s = 2ms
    
     SO:
     total = 10.5 + 4.15 + 4.75 = 19.5 ms
     total = 10.5 + 4.15 + 2    = 16.7 ms

  
   Takeaway: Sequential reads are MUCH MUCH MUCH faster than random reads and we
     should do everything that we can possibly do to perform sequential reads.
     When you learn about filesystems, you'll see that this was a very serious
     concern for filesystem designers (LFS!).
   
    Analogy? Imagine you have a grocery list. You decide to pick up the items IN
    THE ORDER that they're listed on your list. So, in one case, pretend that you
    made your list randomly, just throwing things on it. When you get to HEB or
    whatever, you'll run around the store A LOT. Most of your time will be spent
    running around looking for things; the time to pick up the items themselves
    will be minimal.

    Imagine a second list, in which you sort the items based on the layout of the
    store! So you put the produce all together, the dairy goods all together, the
    meats together, etc, and you list these in the order they'll appear in the
    store! You'll spend significantly less time looking for things, since you'll
    do a straight run through the store, picking up things on the way.

   What are some things that help this situation?
   - Disk Cache used for read-ahead (disk keeps reading at last host request)
    - otherwise, sequential reads would incur whole revolution
    - policy decision: should read-ahead cross track boundaries? a head-switch
      cannot be stopped, so there is a cost to aggressive read ahead.
   - Write caching can be a big win!
    - (if battery backed): data in buffer can be written over many times before
      actually being put back to disk. also, many writes can be stored so they
      can be scheduled more optimally
    NOTE: one thing we're going to start thinking about more is fault tolerance!
    This wasn't an issue until now because if memory got cleared, whatever; it
    is what it is. But now we have persistent storage. You can't just punt on
    the issue of making sure it's consistent.

    - We can schedule our requests to help reduce that fist figure. That is, we
      can schedule accesses more effectively.
      - to this point, we can also take it upon ourselves to order our requests
        to minimize disk seeks (which is such a huge number).
   
   G) Disk Scheduling: Performance III
    You're probably thinking oh god, not ANOTHER scheduling/policy decision
    talk. But the fact is that these are the kinds of things that happen in real
    systems; we often have to consider these sorts of things, and consider their
    tradeoffs.

    NOTE: scheduling ONLY makes sense when we actually have more than 1
    outstanding request. That is, this is in the (probably reasonably often)
    case that we have many outstanding requests.

    FCFS/FIFO: process requests in the order they are received
     +: easy to implement
     +: good fairness
     -: cannot exploit locality
     -: increases average latency, decreasing throughput

    SPTF/SSTF/SSF: shortest position time first / shortest seek time first /
    shortest seek first
     +: exploits locality of requests
     +: higher throughput
     -: starvation (jobs far away may never be touched)
     -: dont always know which request will be fastest

     improvement: aged SPTF - give older requests priority
     --"effective" seek time, where:
     T_eff = T_pos - W*T_wait

    Elevator Scheduling: like SPTF but next seek must be in same direction;
    switch direction only if no further requests.
     +: exploits locality
     +: bounded waiting
     -: cylinder in middle get better service
     -: doesn't fully exploit locality

     Common modification: sweep only in one direction. Very commonly used in
     Unix.
   
   H) Technology and Systems Trends
    - unfortunately, while seeks and rotational delay are getting a little
      faster, they have not kept up with the huge growth elsewhere in computers
    - transfer bandwith has grown about 10x per decade
    - disk DENSITY is growing very fast (byte_stored / $). (exponentially! ref.
      Bill Gates). Used to be $s on the GB. Now it's like free.
    - Disk access is still a huge bottleneck! And its getting worse. (why?)
      What we can do is use increased bandwith to prefetch large chunks for the
      same cost. Ref the example we did! We can grab huge chunks of data without
      incurring a big cost since we already paid for the seek + rotation.
    - Saving Grace: memory size increasing faster than typical workload size.
      - more and more of workload fits in file cache, which means profile of
        traffic to disk has changed: mostly writes and new data
      - logging and journalling become viable (The basis for LFS is that MOST
        reads are from the cache, so optimize for writes).
  
    - SSDs. WAY faster (if you have one you can actually tell), but its not
      going to change the whole game. Disk is still expensive. One thing it does
      change is that random reads are no longer expensive because of mechanical
      seek times.

3) Flash memory
  SSDs! One big win is that with these solid state devices (your flash drives,
  sd cards, etc), you don't have to pay for seeks. The bummer is that you pay in
  $$ what you would've paid in time.

  Flash memory is implemented using Floating Gate Transistors. These are like
  regular transistors except they have this extra 'floating gate'. It is
  completely insulated in non-conductive material, meaning the charge inside can
  stay for extremely long.

  **DIAGRAM**

  The way you "set" a bit in a floating gate transistor is to rely on quantum
  tunneling. If you run a sufficiently high voltage near it, some of the
  electrons will jump through the insulative material into the floating gate.
  Then, you can use the control gate to set it later.

  NAND flash memory is wired to allow reads and writes to portions of the disk
  at once (confusingly called pages). These are often 2kb-4kb. You can:
   * write page
   * read page
   * Erase erasure block

   Writing and reading a page are trivial (each taking 10s of microseconds, in
   comparison to memory which takes like 100 ns). 
   Before you can write something new, what was there before must be erased. You
   unfortunately can't just erase 1 word in NAND flash memory (you can in NOR
   flash memory), so you have to erase blocks at a time, which are often 128kb
   to 512 kb. Erasure can take a few miliseconds.

---------------------------------------------------------------------------

Credit: David Mazieres's notes, Alison Norman's notes, Mike Dahlin's
notes, Mike Walfish's notes, and "Operating Systems: Principles and
Practice" (by Anderson and Dahlin).