Class 17 CS 372H 25 March 2010 On the board ------------ 1. disks, continued 2. flash memory 3. file systems --------------------------------------------------------------------------- 1. Disks, continued last time: A. What is a disk? B. Geometry today: C. Performance D. Common #s E. how driver interfaces to disk F. how disk interfaces to bus G. Performance II H. technology and systems trends C. disk performance (important to understand this if you are building systems that need good performance) components of transfer: rotational delay, seek delay, transfer time. rotational delay: we discussed seek: speedup, coast, slowdown, settle transfer time: will discuss discuss seeks in a bit of detail now: --seeking track-to-track: comparatively fast (~1ms). mainly settle time --short seeks (200-400 cyl.) dominated by speedup --BTW, this thing can accelerate at up to several hundred g --longer seeks dominated by coast --head switches comparable to short seeks --settle times takes longer for writes than reads. why? --because if read strays, the error will be caught, and the disk can retry --if the write strays, some other track just got clobbered. so write settles need to be done precisely --note: "average seek time" quoted can be many things --time to seek 1/3 of disk --1/3 of the time to seek the whole disk --(convince yourself those may not be the same) D. common disk #s --capacity: 100s of GB --platters: 8 --number of cylinders: tens of thousands or more --sectors per track: ~1000 --RPM: 10000 --transfer rate: 50-85 MB/s --mean time between failures: ~1 million hours (for disks in data centers, it's vastly less; for a provider like Google, even if they had very reliable disks, they'd still need an automated way to handle failures because failures would be common (imagine 100,000 disks: *some* will be on the fritz at any given moment). so what they do is to buy defective and cheap disks, which are far cheaper. lets them save on hardware costs. they get away with it because they *anyway* needed software and systems -- replication and other fault-tolerance schemes -- to handle failures.) E. how driver interfaces to disk --Sectors --Disk interface presents linear array of **sectors** --generally 512 bytes, written atomically (even if power failure; disk saves enough momentum to complete) --larger atomic units have to be synthesized by OS (will discuss later) --goes for multiple contiguous sectors or even a whole collection of unrelated sectors --OS will find ways to make such writes *appear* atomic, though, of course, the disk itself can't write more than a sector atomically --analogy to critical sections in code: --> a thread holds a lock for a while, doing a bunch of things. to the other threads, whatever that thread does is atomic: they can observe the state before lock acquistion and after lock release, but not in the middle, even though, of course, the lock-holding thread is really doing a bunch of operations that are not atomic from the processor's perspective --disk maps logical sector # to physical sectors --Zoning: puts more sectors on longer tracks --Track skewing: sector 0 position varies by track, but let the disk worry about it. Why? (for speed when doing sequential access) --Sparing: flawed sectors remapped elsewhere --all of this is invisible to OS. stated more precisely, the OS does not know the logical to physical sector mapping. the OS specifies a platter, track, sector, but who knows where it really is? --In any case, larger logical sector # difference means larger seek --Highly non-linear relationship (*and* depends on zone) --OS has no info on rotational positions --Can empirically build table to estimate times --Turns out that sometimes the logical-->physical sector mapping is what you'd expect. F. how disk interfaces to bus --Computer, disk often connected by bus (e.g., SCSI) --Multiple devices may contentd for bus --Possible disk interface features: --Disconnect from bus during requests --Command queuing: Give disk multiple requests --Disk can schedule them using rotational information --Disk cache used for read-ahead --Otherwise, sequential reads would incur whole revolution --Cross track boundaries? Can't stop a head-switch --Some disks support write caching --But data not stable---not suitable for all requests G. disk performance, II --Placement and ordering of requests critical --Sequential I/O much, much MUCH **MUCH** faster than random --Long seeks much slower than short ones --Power might fail any time, leaving inconsistent state --Must be careful about order for crashes --More on this in over next few weeks --Try to achieve contiguous accesses where possible --for example, make big chunks of individual files contiguous --"The secret to making disks fast is to treat them like tape" (John Ousterhout). --Why? say you want to read 1KB randomly. how much does that cost? average seek: ~4ms 1/2 rotation: ~3ms (10000 RPM = 166 RPS = 6 ms/rotation) transfer: ~.01 ms because 512 bytes/sector * 1000 sectors/track * 1 track/6 ms = 80MB/s transfer speed so 1 KB / (80MB/s) = 1 KB / (80KB/ms) = ~.01ms seek + rotation time dominates! --implication: can get 100s of times more data with almost no further overhead (more data affects only the transfer time term) --more abstractly: effective bandwidth (chunk_size) = chunk_size / (10ms + chunk_size/actual_BW) actual_BW ~80 MB/s. --Try to order requests to minimize seek times --OS (or disk) can only do this if it has multiple requests to order --Requires disk I/O concurrency --High-performance apps try to maximize I/O concurrency --or avoid I/O except to do write-logging (stick all your data structures in memory; write "backup" copies to disk sequentially; don't do random-access reads from the disk) --disk scheduling --see 5.4.3 in the book --FCFS: process requests in the order they are received +: easy to implement +: good fairness -: cannot exploit request locality -: increases average latency, decreasing throughput --SPTF/SSTF/SSF: shortest positioning time first / shortest seek time first: pick request with shortest seek time +: exploits locality of requests +: higher throughput -: starvation -: don't always know which request will be fastest improvement: aged SPTF --give older requests priority --adjust "effective" seek time with weighting [no pun intended] factor: T_{eff} = T_{pos} - W*T_{wait} --Elevator scheduling: like SPTF, but next seek must be in same direction; switch direction only if no further requests +: exploits locality +: bounded waiting -: cylinders in middle get better service -: doesn't fully exploit locality modification: only sweep in one direction: very commonly used in Unix. H. technology and systems trends --unfortunately, while seeks and rotational delay are getting a little faster, they have not kept up with the huge growth elsewhere in computers. --transfer bandwidth has grown about 10x per decade --the thing that is growing fast is disk density ($/byte stored). that's because density is less about the mechanical limitations --to improve density, need to get the head close to the surface. --[aside: what happens if the head contacts the surface? called "head crash": scrapes off the magnetic material ... and, with it, the data.] --Disk accesses a huge system bottleneck and getting worse. So what to do? --Bandwidth increase lets system (pre-)fetch large chunks for about the same cost as small chunk. --So trade latency for bandwidth if you can get lots of related stuff at roughly the same time. How to do that? --By clustering the related stuff together on the disk --The saving grace for big systems is that memory size is increasing faster than typical workload size --result: more and more of workload fits in file cache, which in turn means that the profile of traffic to the disk has changed: now mostly writes and new data. --which means logging and journaling become viable (more on this over next few classes) 2. flash memory A. Overview --Today, people increasingly using flash memory --Completely solid state (no moving parts) --Remembers data by storing charge --Lower power consumption and heat --No mechanical seek times to worry about --Limited # overwrites possible --Blocks wear out after 10,000 (MLC) -- 100,000 (SLC) erases --Requires _flash translation layer_ (FTL) to provide _wear leveling_, so repeated writes to logical block don't wear out physical block --FTL can seriously impact performance --In particular, random writes _very_ expensive see http://research.microsoft.com/pubs/63681/TR-2005-176.pdf --Limited durability --Charge wears out over time --Turn off device for a year, you can easily lose data B. Types of flash memory --NAND flash (most prevalent for storage) --Higher density (most used for storage) --Faster erase and write --More errors internally, so need error correction --NOR flash --Faster reads in smaller data units --Can execute code straight out of NOR flash --Significantly slower erases --Single-level cell (SLC) vs. Multi-level cell (MLC) --MLC encodes multiple bits in voltage level --MLC slower to write than SLC --NAND Flash Overview --Flash device has 2112-byte _pages_ --2048 bytes of data + 64 bytes metadata & ECC --_Blocks_ contain 64 (SLC) or 128 (MLC) pages (128KB or 256KB pages) --Blocks divided into 2--4 _planes_ --All planes contend for same package pins --But can access their blocks in parallel to overlap latencies --Can _read_ one page at a time --Takes 25 microseconds + time to get data off chip --Must _erase_ whole block before _programming_ --Erase sets all bits to 1: very expensive (2 msec) --Programming pre-erased block requires moving data to internal buffer, then 200 (SLC) -- 800 (MLC) microseconds --so random reads and writes are way faster than on a disk. But...... --sequential disk reads and writes are roughly as fast as flash memory (at least in terms of order of magnitude) and much cheaper in $/byte --Flash characteristics from http://cseweb.ucsd.edu/~swanson/papers/Asplos2009Gordon.pdf Parameter SLC MLC --------------------------------------------------------- Density Per Die (GB) 4 8 Page Size (Bytes) 2048+32 2048+64 Block Size (Pages) 64 128 Read Latency (us) 25 25 Write Latency (us) 200 800 Erase Latency (us) 2000 2000 40MHz, 16-bit bus Read b/w (MB/s) 75.8 75.8 Program b/w (MB/s) 20.1 5.0 133MHz Read b/w (MB/s) 126.4 126.4 Program b/w (MB/s) 20.1 5.0 --disk vs. MLC NAND flash vs. regular DRAM disk flash DRAM -------------------------------------------------------- Smallest write sector sector byte Atomic write sector sector byte/word Random read 8 ms 75 us 50 ns Random write 8 ms 300 us* 50 ns Sequential read 100 MB/s 250 MB/s > 1 GB/s Sequential write 100 MB/s 170 MB/s* > 1 GB/s Cost $.08--1/GB $3/GB $10-25/GB Persistence Non-volatile Non-vol. Volatile *flash write performance degrades over time 3. file systems A. Intro B. Files C. Implementing files D. Directories E. FS performance A. Intro --more papers on FSs than on any other single topic --probably also the hardest part of operating systems --what does a FS do? --provide persistence (don't go away ... ever) --somehow associate bytes on the disk with names (files) --somehow associates names with each other (directories) --where are FSes implemented? --can implement them on disk, over network, in memory, in NVRAM (non-volatile RAM), on tape, with paper (!!!!) --we are going to focus on the disk and generalize later. we'll see what it means to implement a FS over the network --a few quick notes about disks in the context of FS design --disk is the first thing we've seen that (a) doesn't go away; and (b) we can modify (BIOS ROM, hardware configuration, etc. don't go away, but we weren't able to modify these things). two implications here: (i) we're going to have to put all of our important state on the disk (ii) we have to live with what we put on the disk! scribble randomly on memory --> reboot and hope it doesn't happen again. scribbe randomly on the disk --> now what? (answer: in many cases, we're hosed.) --mismatch: CPU and memory are *also* working with "important state", but they are vastly faster than disks --disk is enormous: 100-1000x more data than memory --how to organize all of this information? --answer is by categorizing things (taxonomies). a FS is a kind of taxonomy ("/homes" has home directories, "/homes/bob/classes/cs372h" has bob's cs372h material, etc.) B. Files --what is a file? answer from user's view: a bunch of named bytes on the disk answer from FS's view: collection of disk blocks --big job of a FS: map name and offset to disk blocks FS {file,offset} --> disk address --operations are create(file), delete(file), read(), write() --***goal: operations have as few disk accesses as possible and minimal space overhead --wait, why do we want minimal space overhead, given that the disk is huge? --answer: cache space never enough; the amount of data that can be retrieved in one fetch is never enough. hence, really don't want to waste. [[--note that we have seen translation/indirection before: page table: page table virtual address ----------> physical address per-file metadata: inode offset ------> disk block address how'd we get the inode? directory file name ----------> file # (file # *is* an inode in Unix) ]] C. Implementing files --our task: uphold the goal marked *** above. --for now, we're going to assume that the file's metadata is given to us. when we look at directories in a bit, we'll see where the metadata comes from; the above picture should also give a hint access patterns we could imagine supporting: (i) Sequential: --File data processed in sequential order --By far the most common mode --Example: editor writes out new file, compiler reads in file, etc (ii) Random access: --Address any block in file directly without passing through --Examples: large data set, demand paging, databases (iii) Keyed access --Search for block with particular values --Examples: associative data base, index --This thing is everywhere in the field of databases, search engines, but.... --...usually not provided by a FS in OS helpful observations: (i) All blocks in file tend to be used together, sequentially (ii) All files in directory tend to be used together (iii) All *names* in directory tend to be used together further design parameters: (i) Most files are small (ii) Much of the disk is allocated to large files (iii) Many of the I/O operations are made to large files (iv) Want good sequential and good random access candidate designs........ 1. contiguous allocation "extent based" --when creating a file, make user pre-specify its length, and allocate the space at once --file metadata contains location and size --example: IBM OS/360 [ a1 a2 a3 b1 b2 ] what if a file c needs two sectors?! +: simple +: fast access, both sequential and random -: fragmentation where have we seen something similar? (answer: segmentation in virtual memory) 2. linked files --keep a linked list of free blocks --metadata: pointer to file's first block --each block holds pointer to next one +: no more fragmentation +: sequential access easy (and probably mostly fast, assuming decent free space management, since the pointers will point close by) -: random access is a disaster -: pointers take up room in blocks; messes up alignment of data 3. modification of linked files: FAT --keep link structure in memory --in fixed-size "FAT" (file allocation table) --pointer chasing now happens in RAM [DRAW PICTURE] --example: MS-DOS (and iPods, MP3 players, digital cameras) +: no need to maintain separate free list (table says what's free) +: low space overhead -: maximum size limited. 64K entries 512 byte blocks --> 32MB max file system bigger blocks bring advantages and disadvantages, and ditto a bigger table note: to guard against bad sectors, better store multiple copies of FAT on the disk!! root dir needs to live at well-known location [thanks to David Mazieres for portions of the above]