Class 17
CS 372H
25 March 2010

On the board
------------

1. disks, continued
2. flash memory
3. file systems

---------------------------------------------------------------------------

1. Disks, continued

    last time:
    A. What is a disk?
    B. Geometry

    today:
    C. Performance
    D. Common #s
    E. how driver interfaces to disk
    F. how disk interfaces to bus
    G. Performance II
    H. technology and systems trends


    C. disk performance (important to understand this if you are
    building systems that need good performance)

	components of transfer: rotational delay, seek delay, transfer
	time.
	    rotational delay: we discussed
	    seek: speedup, coast, slowdown, settle
	    transfer time: will discuss

	discuss seeks in a bit of detail now:

	--seeking track-to-track: comparatively fast (~1ms). mainly
	settle time

	--short seeks (200-400 cyl.) dominated by speedup

	    --BTW, this thing can accelerate at up to several hundred g

	--longer seeks dominated by coast

	--head switches comparable to short seeks

	--settle times takes longer for writes than reads. why?
	    --because if read strays, the error will be caught, and the
	    disk can retry
	    --if the write strays, some other track just got clobbered.
	    so write settles need to be done precisely

	--note: "average seek time" quoted can be many things

	    --time to seek 1/3 of disk

	    --1/3 of the time to seek the whole disk

	    --(convince yourself those may not be the same)

    D. common disk #s

	--capacity: 100s of GB
	--platters: 8
	--number of cylinders: tens of thousands or more
	--sectors per track: ~1000
	--RPM: 10000
	--transfer rate: 50-85 MB/s
	--mean time between failures: ~1 million hours
	    (for disks in data centers, it's vastly less; for a provider
	    like Google, even if they had very reliable disks, they'd still need
	    an automated way to handle failures because failures would
	    be common (imagine 100,000 disks: *some* will be on the
	    fritz at any given moment). so what they do is to buy
	    defective and cheap disks, which are far cheaper. lets them
	    save on hardware costs. they get away with it because they
	    *anyway* needed software and systems -- replication and
	    other fault-tolerance schemes -- to handle failures.)

    E. how driver interfaces to disk

	--Sectors

	    --Disk interface presents linear array of **sectors**

	    --generally 512 bytes, written atomically (even if power
	    failure; disk saves enough momentum to complete)
	    
	    --larger atomic units have to be synthesized by OS (will
	    discuss later)
		--goes for multiple contiguous sectors or even a whole
		collection of unrelated sectors
		--OS will find ways to make such writes *appear* atomic,
		though, of course, the disk itself can't write more than
		a sector atomically
		--analogy to critical sections in code: 
		
		    --> a thread holds a lock for a while, doing a bunch
		    of things. to the other threads, whatever that
		    thread does is atomic: they can observe the state
		    before lock acquistion and after lock release, but
		    not in the middle, even though, of course, the
		    lock-holding thread is really doing a bunch of
		    operations that are not atomic from the processor's
		    perspective

	--disk maps logical sector # to physical sectors

	    --Zoning: puts more sectors on longer tracks
	  
	    --Track skewing: sector 0 position varies by track, but let
	    the disk worry about it. Why? (for speed when doing
	    sequential access)

	    --Sparing: flawed sectors remapped elsewhere
	
	--all of this is invisible to OS. stated more precisely, the OS
	does not know the logical to physical sector mapping. the OS
	specifies a platter, track, sector, but who knows where it
	really is? 

	    --In any case, larger logical sector # difference means larger seek

	    --Highly non-linear relationship (*and* depends on zone)
	
	    --OS has no info on rotational positions

	    --Can empirically build table to estimate times

	    --Turns out that sometimes the logical-->physical sector
	    mapping is what you'd expect.


    F. how disk interfaces to bus

	--Computer, disk often connected by bus (e.g., SCSI)
	    --Multiple devices may contentd for bus

	--Possible disk interface features:

	    --Disconnect from bus during requests 

	    --Command queuing:  Give disk multiple requests

		--Disk can schedule them using rotational information

	--Disk cache used for read-ahead

	    --Otherwise, sequential reads would incur whole revolution

	    --Cross track boundaries?  Can't stop a head-switch
	

	--Some disks support write caching

	    --But data not stable---not suitable for all requests
	

    G. disk performance, II

	--Placement and ordering of requests critical

	    --Sequential I/O much, much MUCH **MUCH** faster than random

	    --Long seeks much slower than short ones
	
	    --Power might fail any time, leaving inconsistent state

	--Must be careful about order for crashes

	    --More on this in over next few weeks

	--Try to achieve contiguous accesses where possible

	    --for example, make big chunks of individual files contiguous

	    --"The secret to making disks fast is to treat them like tape"
	    (John Ousterhout).

	    --Why? 
	     say you want to read 1KB randomly. how much does that cost?
		average seek: ~4ms 
		1/2 rotation: ~3ms (10000 RPM = 166 RPS = 6 ms/rotation)
		transfer: ~.01 ms
		    because 
			512 bytes/sector * 1000 sectors/track * 1
			    track/6 ms = 80MB/s transfer speed
			so
			1 KB / (80MB/s) = 1 KB / (80KB/ms) = ~.01ms


		seek + rotation time dominates! 

		--implication: can get 100s of times more data with
		almost no further overhead (more data affects only the
		transfer time term)

	    --more abstractly:

		effective bandwidth (chunk_size) = 
			chunk_size / (10ms + chunk_size/actual_BW)

		actual_BW ~80 MB/s. 
      
	--Try to order requests to minimize seek times

	    --OS (or disk) can only do this if it has multiple requests
	    to order

	    --Requires disk I/O concurrency
	
	    --High-performance apps try to maximize I/O concurrency

		--or avoid I/O except to do write-logging (stick all
		your data structures in memory; write "backup" copies to
		disk sequentially; don't do random-access reads from the
		disk)

     --disk scheduling

	--see 5.4.3 in the book

	--FCFS: process requests in the order they are received
	    +: easy to implement
	    +: good fairness
	    -: cannot exploit request locality
	    -: increases average latency, decreasing throughput

	--SPTF/SSTF/SSF: shortest positioning time first / shortest seek
	time first: pick request with shortest seek time
    
	    +: exploits locality of requests
	    +: higher throughput
	    -: starvation
	    -: don't always know which request will be fastest

	    improvement: aged SPTF
		
		--give older requests priority

		--adjust "effective" seek time with weighting [no pun
		intended] factor:
		    T_{eff} = T_{pos} - W*T_{wait}


	--Elevator scheduling: like SPTF, but next seek must be in same
	direction; switch direction only if no further requests
	    +: exploits locality
	    +: bounded waiting
	    -: cylinders in middle get better service
	    -: doesn't fully exploit locality

	    modification: only sweep in one direction: very commonly
	    used in Unix.

    H. technology and systems trends

	--unfortunately, while seeks and rotational delay are getting a
	little faster, they have not kept up with the huge growth
	elsewhere in computers.
	
	--transfer bandwidth has grown about 10x per decade

	--the thing that is growing fast is disk density ($/byte
	stored). that's because density is less about the mechanical
	limitations

	--to improve density, need to get the head close to the surface.

	    --[aside: what happens if the head contacts the surface? called
	    "head crash": scrapes off the magnetic material ... and,
	    with it, the data.]

	--Disk accesses a huge system bottleneck and getting worse. So
	what to do?

	    --Bandwidth increase lets system (pre-)fetch large chunks
	    for about the same cost as small chunk.

	    --So trade latency for bandwidth if you can get lots of
	    related stuff at roughly the same time. How to do that?

	    --By clustering the related stuff together on the disk

	--The saving grace for big systems is that memory size
	is increasing faster than typical workload size

	    --result: more and more of workload fits in file cache,
	    which in turn means that the profile of traffic to the disk
	    has changed: now mostly writes and new data.

	    --which means logging and journaling become viable (more on
	    this over next few classes)
 
2. flash memory

    A. Overview

	--Today, people increasingly using flash memory
	--Completely solid state (no moving parts)
	    --Remembers data by storing charge
	    --Lower power consumption and heat
	    --No mechanical seek times to worry about

	--Limited # overwrites possible
	    --Blocks wear out after 10,000 (MLC) -- 100,000 (SLC) erases
	    --Requires _flash translation layer_ (FTL) to provide
	    _wear leveling_, so repeated writes to logical block
	  don't wear out physical block
	    --FTL can seriously impact performance

	    --In particular, random writes _very_ expensive
	    see http://research.microsoft.com/pubs/63681/TR-2005-176.pdf

	--Limited durability

	    --Charge wears out over time

	    --Turn off device for a year, you can easily lose data

    B. Types of flash memory
      

	--NAND flash (most prevalent for storage)

	    --Higher density (most used for storage)

	    --Faster erase and write

	    --More errors internally, so need error correction

	--NOR flash

	    --Faster reads in smaller data units

	    --Can execute code straight out of NOR flash

	    --Significantly slower erases
  

	--Single-level cell (SLC) vs. Multi-level cell (MLC)

	    --MLC encodes multiple bits in voltage level

	    --MLC slower to write than SLC
 
	--NAND Flash Overview

	    --Flash device has 2112-byte _pages_

		--2048 bytes of data + 64 bytes metadata & ECC

	    --_Blocks_ contain 64 (SLC) or 128 (MLC) pages
	       (128KB or 256KB pages)

	    --Blocks divided into 2--4 _planes_

		--All planes contend for same package pins
		--But can access their blocks in parallel to overlap latencies

	    --Can _read_ one page at a time

		--Takes 25 microseconds + time to get data off chip

	    --Must _erase_ whole block before _programming_

		--Erase sets all bits to 1: very expensive (2 msec)

		--Programming pre-erased block requires moving data to
		internal buffer, then 200 (SLC) -- 800 (MLC)
		microseconds

	    --so random reads and writes are way faster than on a disk.
	    But......

		--sequential disk reads and writes are roughly as fast
		as flash memory (at least in terms of order of
		magnitude) and much cheaper in $/byte

	--Flash characteristics
	from http://cseweb.ucsd.edu/~swanson/papers/Asplos2009Gordon.pdf

	    Parameter		               SLC          MLC
	    ---------------------------------------------------------
	    Density Per Die (GB)	        4             8 
	    Page Size (Bytes)			 2048+32      2048+64
	    Block Size (Pages)			  64            128 
	    Read Latency (us)			  25             25 
	    Write Latency (us)			 200            800 
	    Erase Latency (us)			2000            2000 
	    40MHz, 16-bit bus Read b/w (MB/s)    75.8           75.8 
			    Program b/w (MB/s)   20.1            5.0 
	    133MHz            Read b/w (MB/s)   126.4         126.4 
			    Program b/w (MB/s)   20.1           5.0 


    --disk vs. MLC NAND flash vs. regular DRAM

                       disk        flash         DRAM
    --------------------------------------------------------
    Smallest write    sector	  sector         byte 
    Atomic write      sector	  sector         byte/word 
    Random read        8 ms	  75 us          50 ns 
    Random write       8 ms	  300 us*        50 ns 
    Sequential read  100 MB/s	  250 MB/s       > 1 GB/s 
    Sequential write 100 MB/s	  170 MB/s*      > 1 GB/s 
    Cost              $.08--1/GB  $3/GB         $10-25/GB 
    Persistence     Non-volatile  Non-vol.     Volatile 

    *flash write performance degrades over time

3. file systems

    A. Intro
    B. Files
    C. Implementing files
    D. Directories
    E. FS performance

    A. Intro

    --more papers on FSs than on any other single topic

	--probably also the hardest part of operating systems

    --what does a FS do?

	--provide persistence (don't go away ... ever)

	--somehow associate bytes on the disk with names (files)

	--somehow associates names with each other (directories)

    --where are FSes implemented?

	--can implement them on disk, over network, in memory, in NVRAM
	(non-volatile RAM), on tape, with paper (!!!!)

	--we are going to focus on the disk and generalize later. we'll
	see what it means to implement a FS over the network
   
    --a few quick notes about disks in the context of FS design

	--disk is the first thing we've seen that (a) doesn't go away;
	and (b) we can modify (BIOS ROM, hardware configuration, etc.
	don't go away, but we weren't able to modify these things). two
	implications here:

	    (i) we're going to have to put all of our important state on
	    the disk

	    (ii) we have to live with what we put on the disk! scribble
	    randomly on memory --> reboot and hope it doesn't happen
	    again. scribbe randomly on the disk --> now what? (answer:
	    in many cases, we're hosed.)

	--mismatch: CPU and memory are *also* working with "important
	state", but they are vastly faster than disks

	--disk is enormous: 100-1000x more data than memory

	    --how to organize all of this information?
	    --answer is by categorizing things (taxonomies). a FS is a
	    kind of taxonomy ("/homes" has home directories,
	    "/homes/bob/classes/cs372h" has bob's cs372h material, etc.)


    B. Files

	--what is a file?
	    answer from user's view: a bunch of named bytes on the disk
	    answer from FS's view: collection of disk blocks
	
	--big job of a FS: map name and offset to disk blocks
	   
                                 FS
                   {file,offset} --> disk address
	    
	    --operations are create(file), delete(file), read(), write()

	    --***goal: operations have as few disk accesses as possible
	    and minimal space overhead
	    
		--wait, why do we want minimal space overhead, given that
		the disk is huge?

		--answer: cache space never enough; the amount of data
		that can be retrieved in one fetch is never enough.
		hence, really don't want to waste.

	[[--note that we have seen translation/indirection before:

	    page table:

		                    page table 
		    virtual address ----------> physical address

    
	    per-file metadata:

			    inode
		    offset ------>  disk block address


	    how'd we get the inode?

			       directory
		    file name ----------> file # 
		    
		(file # *is* an inode in Unix)
		    		
	    ]]


    C. Implementing files

	--our task: uphold the goal marked *** above. 

	--for now, we're going to assume that the file's metadata is
	given to us. when we look at directories in a bit, we'll see
	where the metadata comes from; the above picture should also
	give a hint
    
	access patterns we could imagine supporting:

	(i) Sequential:
	    --File data processed in sequential order
	    --By far the most common mode
	    --Example: editor writes out new file, compiler reads in file, etc

	(ii) Random access:
	    --Address any block in file directly without passing through
	    --Examples: large data set, demand paging, databases

	(iii) Keyed access
	    --Search for block with particular values
	    --Examples: associative data base, index
	    --This thing is everywhere in the field of databases,
	    search engines, but....
	    --...usually not provided by a FS in OS

	helpful observations:

	(i) All blocks in file tend to be used together, sequentially 

	(ii) All files in directory tend to be used together

	(iii) All *names* in directory tend to be used together

	further design parameters:

	(i) Most files are small 
	
	(ii) Much of the disk is allocated to large files

	(iii) Many of the I/O operations are made to large files

	(iv) Want good sequential and good random access 

	candidate designs........

	1. contiguous allocation 
	  "extent based"
	  --when creating a file, make user pre-specify its length, and
	  allocate the space at once
	  --file metadata contains location and size

	  --example: IBM OS/360

		[<free> a1 a2 a3 <free> b1 b2 <free> ]

		what if a file c needs two sectors?!
	  
	  +: simple
	  +: fast access, both sequential and random
	  -: fragmentation
	  
	  where have we seen something similar? (answer: segmentation in
	  virtual memory)

	2. linked files
	    
	    --keep a linked list of free blocks
	    --metadata: pointer to file's first block
	    --each block holds pointer to next one

	  +: no more fragmentation
	  +: sequential access easy (and probably mostly fast, assuming
	     decent free space management, since the pointers will point
	     close by)
	  -: random access is a disaster
	  -: pointers take up room in blocks; messes up alignment of
	  data

	3. modification of linked files: FAT

	    --keep link structure in memory
	    --in fixed-size "FAT" (file allocation table)
	    --pointer chasing now happens in RAM

	    [DRAW PICTURE]

	    --example: MS-DOS (and iPods, MP3 players, digital cameras)
	
	  +: no need to maintain separate free list (table says what's free)	
	  +: low space overhead
	  -: maximum size limited. 
	      64K entries
	      512 byte blocks --> 32MB max file system
	   bigger blocks bring advantages and disadvantages, and ditto a
	   bigger table

	    note: to guard against bad sectors, better store multiple
	    copies of FAT on the disk!!

	    root dir needs to live at well-known location


[thanks to David Mazieres for portions of the above]