Class 17 CS 439 19 March 2013 [lecturer: Parth Upadhyay] On the board ------------ 0) Questions before we start? 1) Preview -- I/O, disks, file systems, transactions 2) I/O 3) Disks I/O * architecture * communicating with devices * device drivers Start: Okay, so today we're going to talk a bit about I/O and disks. We're going to take one more layer of mysticism away from the computer, and try to understand what happens between the CPU and the various devices that could be connected to it. Then, we'll look a decently detailed look at how disks work, and why this is still important. A) architecture draw this picture!! (its the one on the handout) ---- ---- - Some examples of devices here are things like Network Cards, Graphics Cards, disks, etc. - CPU accesses physical memory over a bus - devices, too access memory over I/O bus How can we communicate with these devices?? (a) Memory-mapped device registers - load/store to mem locations, but actually correspond to registers on devices! You as a programmer read/write from these addresses normally using loads and stores, but they actually read/write from/to registers on the device. (b) Device Memory -- device may have memory that OS can write to directly on the other side of the I/O bus. (c) Special I/O instructions -- some cpus have special instructions for i/o -- they're like load/store, but actually work with i/o. -- allow user processes to access i/o ports with finer granularity than page (d) DMA -- place instructions in main memory. -- "poke" card by writing to register -- overlaps unrelated computation with moving data over bus -- card knows where to find this list, and then it can access the buffers -- this makes a lot of sense. instead of constantly dealing one word or byte or whatever at a time, the device can simply write the contents of its operation straight into memory. (i) DMA disk read example: Let's look at an example of how DMA would work. 1) The OS has a request (received from, potentially, a user process) that asks it to transfer disk data to a buffer at address X. 2) The OS (really the device driver, you'll learn about them in a minute), tells the disk controller to transfer C bytes to address X. 3) The disk controller initiates the transfer, sending each byte to the DMA controller. 4) The DMA controller transfers the bytes. 5) When the transfer is finished, the CPU receives an interrupt from the DMA controller letting it know that it's finished. C. Device Drivers - expose some interface to the kernel, so that the kernel can call comparatively simple read/write calls or whatever. abstracts away the nasty hardware details. - how to synchronize with device? - for example, the device drivers may need to know when a network card is done sending a packet, or know when a disk request is complete [NOTE: device doesn't care a huge deal about which of the following 2 options is in play. Interrupt/Polling are abstractions between the device and CPU. The question here is about the logic in the driver and interrupt controller.] (a) Approach 1: Polling - Sent a packet? Loop asking the card when the buffer is free. - Waiting for a packet? Loop asking it until its free. - Disk I/O? Keep looping until a ready bit is set. Disadvantages: Wasted CPU Cycles, and high latency (it may have come but you're not set to check until some time in the future). (b) Approach 2: Interrupts - ask card to interrupt CPU on events - Interrupt handler runs at high priority - Asks card whta happened (free buffer, new packet, etc) - This is what most general purpose OS's do Disadvantages: What could go wrong with interrupts? BAD in cases where the data arrival rate is high. -- packets arrive faster than OS can process them -- interrupts are very expensive (context switch) -- interrupt handlers have high priority -- in worst case, you can get receive livelock where you spend 100% of time in interrupt handler but no work gets done. ANALOGY: Interrupts vs Polling in the context of phone notifications. There's 2 ways for you to figure out whether you have more emails/tweets. 1 is to, every time your phone buzzes, stop everything and look at it. (God forbid you miss that tweet). The second way is for you to, periodically, check your email. The trade-offs are pretty similar! You get that snapchat 5 minutes later than you could have, but you don't pay the costs of so many context switches. How to design systems given these tradeoffs? Start with interrupts. If you notice that your system is slowing down because of this, then switch to polling. If polling is chewing up too many cycles, then move towards an adaptive switching between interrupts and polling. (But of course, never optimize until you actually know what the problem.) -- interrupts are great for disk requests! which segways into our next topic.... 3) Disks A. What is a disk? B. Geometry C. Performance D. Common #s E. How Drivers interface to disk F. Performance II G. Disk Scheduling (Performance III) H. Technology and systems trends (Show of hands, how many of you have built a computer before?) A. What is a disk? This is your standard SATA drive that you stick into your computer. It whirrs up when you start.... - stack of magnetic platters - rotate together on a central spindle b/t 3,600 and 15,000 RPM // draw picture ----- | platter -------- . . . (a) disk arm assembly - arms rotate around pivot, all move together - pivot offers some resistance to linear shocks - arms contain disk heads -- one for each recording surface - heads read/write to data platter Kind of like vinyl records. --interjection Before we continue, you might ask why this is important! Education is for the future, and arguably SSDs (explain what they are in case people don't know) are the future. The reason it's important is that disks are still widely in use everywhere, and will be for some time. Google, Facebook, etc. all still pack their data centers full of cheap, old disks. Also, for them, disk failure is the common case, not the random/weird case, (they have so many disks that it only makes sense that they would be failing relatively often) so they can't cram their datacenters with expensive SSDs. As a second point, its technical literacy; many filesystems were designed with the disk in mind (sequential significantly better than random). You have to know how these things work as a computer scientist and as a programmer; you don't want to look dumb when you go out into the real world. B) Geometry of a disk // draw a picture here! (a big one, preferably). // This may be a good picture to put on the handout, honestly. Track - circle on a platter. each platter is divided into concentric tracks. Sector - chunk of a track (fixed size, typically 512 bytes) Cylinder: locus of all tracks of fixed radius on all platters - Heads are roughly lined up on a cylinder - NOTE: significant fractions of encoded stream for error correction - Generally only one active head at a time (why?) -- disks usually only have one set of read/write circuitry -- must worry about cross-talk between channels -- hard to keep multiple heads exactly aligned - disk position system - move head to particular track and keep it there. - resist physical shocks, imperfect tracks, etc. - disks also usually have a buffer. it just makes sense. used for things like reading ahead, or just reading a whole track as you go. A **seek** consists of up to four phases: (a) speedup: accelerate arm to max speed or half way point (b) coast: at max speed (for long seeks) (c) slowdown: stops near destination (d) adjusts head to actual desired track C) Performance! (important to understand if you're building systems that need good perf.) What are the components of a transfer? * rotational delay - time for sector to rotate under disk head * seek - speedup, coast, slowdown, settle * transfer time - will diskuss in a minute (pretty basic. how fast can the head read / how much you need to read) SEEK: - seeking from track to track is comparatively fast (~1ms) and is mainly just the settle time - short seeks are dominated by speedup (btw, this can accelerate at up to several hundred g's) - longer seeks dominated by coast - head switches are comparable to short seeks - sometimes you'll see a statistic that says "average seek time". this could actually mean many things: * time to seek 1/3 the disk * 1/3 of the time to seek the whole disk (why aren't these the same??) We'll do an example soon where we use these numbers to figure out the time it will take to access a portion of the disk. D) Some Common #s - capacity (100s of GB) - platters (8) - number of cylinders: tens of thousands or more - sectors per track: ~1000 - RPM: 10000 - transfer rate: 50-85mb/s - mean time b/t failures: ~1 million hours (for disks in data centers its vastly less. They have so many disks that disk failures become common, and they have huge software systems in place to replicate data and handle these failures. they don't have super crazy expensive disks.) (Latency numbers! Source: https://gist.github.com/jboner/2841832) Putting the numbers in perspective Disks are *the* bottleneck in many systems. Here's some numbers that I think should give you context about just how expensive disk related I/O can be: If an L1 Cache reference was 1 second: (Ask people to guess these numbers! See if anyone has an intuition about the relative costs of these operations.) L1 Cache: 0:00.01 (nanoseconds) Mutex Lock/unlock: 0:00.50 Main Memory: 3:20.00 (100 nanoseconds) // in case you're thinking "oh, but my macbook retina has this epic SSD in it. // this isn't a problem anymore, right? NO! disk access is still a huge // bummer" Read 4K randomly from SSD: 3 days, 11:22.00 (< 1 ms) Disk seek : 231 days, 11:33:20 (10ms) How can anyone get anything done?? Well, the real numbers are scaled down like 9 orders of magnitude so they're not long in HUMAN terms. Also, the whole computer doesn't just come to a halt anytime someone wants to access the disk. We can context switch them out and wait for an interrupt and get other useful work done. BUT what is true is that if your program needs to access disk often, its running through molasses. F) Disk Performance II Let's do two examples to really flesh this issue out, and make these numbers a little less abstract. Numbers: Spindle Speed: 7200 RPM Avg Seek Time, read/write: 10.5ms / 12 ms Maximum seek time: 19ms Track-to-track seek time: 1ms Transfer rate (surface to buffer): 54-128 MB/s Transfer rate (buffer to host): 375 MB/s Q: How long would it take to do 500 read requests, spread out randomly over the disk (and serviced in FIFO order)? A: This is basically the worst case; we have no locality in our requests. disk access time is: seek time + rotation time + transfer time. This makes sense, right? You have to first get the head to the right place, wait for the disk to spin, and then transfer the data. Let's figure out these numbers! 1) seek time: given to us in the table. 2) rotation time: given to us implicitly through the 7200 RPM figure. 7200 rev| 1 min --------|-------- = 120 rev/sec min | 60 sec How long does 1 rev take then? (1/120 = 8.33 ms). But this is one whole rotation, and on average you can expect to go halfway, so let's say 4.15ms. 3) transfer time: We can read at least 54MB/s, and were reading 1 sector (512 bytes), so: 512 bytes | second | 1 MB ----------- | -------- | ------------ = 0.0095 ms | 54 MB | 10^6 bytes So, this is our cost PER READ. (Make sure ppl are convinced). Total Time: 10.5 ms (seek time) + 4.15ms (avg rotation) + 0.0095 = 14.66 ms/request 14.66 * 500 = 7.3 seconds! (Holy crud! This is kind of a ridiculous number..... how can we do better?) Q: How long would it take to do 500 requests, SEQUENTIALLY on the disk? (FIFO order once more) A: The difference now is that we don't need to seek OR rotate more than once! Why? It's because once we've gotten to the start, we're reading sequentially, so we don't need to pay for any of that anymore. 1) seek time: given to us in the table, 10.5ms 2) rotation time: same as above! 4.15 ms 3) transfer time: 500 * 512 / 54MB/s = 4.75 ms, OR 500 * 512 / 54MB/s = 2ms SO: total = 10.5 + 4.15 + 4.75 = 19.5 ms total = 10.5 + 4.15 + 2 = 16.7 ms Takeaway: Sequential reads are MUCH MUCH MUCH faster than random reads and we should do everything that we can possibly do to perform sequential reads. When you learn about filesystems, you'll see that this was a very serious concern for filesystem designers (LFS!). Analogy? Imagine you have a grocery list. You decide to pick up the items IN THE ORDER that they're listed on your list. So, in one case, pretend that you made your list randomly, just throwing things on it. When you get to HEB or whatever, you'll run around the store A LOT. Most of your time will be spent running around looking for things; the time to pick up the items themselves will be minimal. Imagine a second list, in which you sort the items based on the layout of the store! So you put the produce all together, the dairy goods all together, the meats together, etc, and you list these in the order they'll appear in the store! You'll spend significantly less time looking for things, since you'll do a straight run through the store, picking up things on the way. What are some things that help this situation? - Disk Cache used for read-ahead (disk keeps reading at last host request) - otherwise, sequential reads would incur whole revolution - policy decision: should read-ahead cross track boundaries? a head-switch cannot be stopped, so there is a cost to aggressive read ahead. - Write caching can be a big win! - (if battery backed): data in buffer can be written over many times before actually being put back to disk. also, many writes can be stored so they can be scheduled more optimally NOTE: one thing we're going to start thinking about more is fault tolerance! This wasn't an issue until now because if memory got cleared, whatever; it is what it is. But now we have persistent storage. You can't just punt on the issue of making sure it's consistent. - We can schedule our requests to help reduce that fist figure. That is, we can schedule accesses more effectively. - to this point, we can also take it upon ourselves to order our requests to minimize disk seeks (which is such a huge number). G) Disk Scheduling: Performance III You're probably thinking oh god, not ANOTHER scheduling/policy decision talk. But the fact is that these are the kinds of things that happen in real systems; we often have to consider these sorts of things, and consider their tradeoffs. NOTE: scheduling ONLY makes sense when we actually have more than 1 outstanding request. That is, this is in the (probably reasonably often) case that we have many outstanding requests. FCFS/FIFO: process requests in the order they are received +: easy to implement +: good fairness -: cannot exploit locality -: increases average latency, decreasing throughput SPTF/SSTF/SSF: shortest position time first / shortest seek time first / shortest seek first +: exploits locality of requests +: higher throughput -: starvation (jobs far away may never be touched) -: dont always know which request will be fastest improvement: aged SPTF - give older requests priority --"effective" seek time, where: T_eff = T_pos - W*T_wait Elevator Scheduling: like SPTF but next seek must be in same direction; switch direction only if no further requests. +: exploits locality +: bounded waiting -: cylinder in middle get better service -: doesn't fully exploit locality Common modification: sweep only in one direction. Very commonly used in Unix. H) Technology and Systems Trends - unfortunately, while seeks and rotational delay are getting a little faster, they have not kept up with the huge growth elsewhere in computers - transfer bandwith has grown about 10x per decade - disk DENSITY is growing very fast (byte_stored / $). (exponentially! ref. Bill Gates). Used to be $s on the GB. Now it's like free. - Disk access is still a huge bottleneck! And its getting worse. (why?) What we can do is use increased bandwith to prefetch large chunks for the same cost. Ref the example we did! We can grab huge chunks of data without incurring a big cost since we already paid for the seek + rotation. - Saving Grace: memory size increasing faster than typical workload size. - more and more of workload fits in file cache, which means profile of traffic to disk has changed: mostly writes and new data - logging and journalling become viable (The basis for LFS is that MOST reads are from the cache, so optimize for writes). - SSDs. WAY faster (if you have one you can actually tell), but its not going to change the whole game. Disk is still expensive. One thing it does change is that random reads are no longer expensive because of mechanical seek times. 3) Flash memory SSDs! One big win is that with these solid state devices (your flash drives, sd cards, etc), you don't have to pay for seeks. The bummer is that you pay in $$ what you would've paid in time. Flash memory is implemented using Floating Gate Transistors. These are like regular transistors except they have this extra 'floating gate'. It is completely insulated in non-conductive material, meaning the charge inside can stay for extremely long. **DIAGRAM** The way you "set" a bit in a floating gate transistor is to rely on quantum tunneling. If you run a sufficiently high voltage near it, some of the electrons will jump through the insulative material into the floating gate. Then, you can use the control gate to set it later. NAND flash memory is wired to allow reads and writes to portions of the disk at once (confusingly called pages). These are often 2kb-4kb. You can: * write page * read page * Erase erasure block Writing and reading a page are trivial (each taking 10s of microseconds, in comparison to memory which takes like 100 ns). Before you can write something new, what was there before must be erased. You unfortunately can't just erase 1 word in NAND flash memory (you can in NOR flash memory), so you have to erase blocks at a time, which are often 128kb to 512 kb. Erasure can take a few miliseconds. --------------------------------------------------------------------------- Credit: David Mazieres's notes, Alison Norman's notes, Mike Dahlin's notes, Mike Walfish's notes, and "Operating Systems: Principles and Practice" (by Anderson and Dahlin).