Class 19
CS 372H
5 April 2011

On the board
------------

1. Last time
2. Crash recovery
3. LFS

---------------------------------------------------------------------------

1. Last time

    --file systems

    --didn't do mmap in class, but see notes

    --correction: I said that hard links mean that there needs to be
    garbage collection. This was imprecise (and wrong). Hard links means
    that there needs to be *reference counting*. GC is needed only if
    there can be cycles in the directory graph (but we rule out those
    cycles in Unix by disallowing hard links to directories).

    --question #1: why are inode 0 and inode 1 reserved?
	--different file systems use for different purposes, and some
	don't use for any purpose (reserved can mean "reserved for
	future use"). inode 1 might link all of the bad blocks. 0 might
	be reserved because it's a return error code.

    --question #2: can a symlink contain a relative path?
	answer: I think not. consider the symlink inode. it itself may
	have multiple hardlinks pointing to it. In other words, it may
	have multiple names. So if the symlink contains a relative path,
	what does that symlink actually reference?

2. Crash recovery

    --there are a lot of data structures used to implement the file
    system (bitmap of free blocks, directories, inodes, indirect blocks,
    data blocks, etc.)

    --require: crash anywhere and the system can be recovered

    --options:

	--*write through*: write changes immediately to disk. problem:
	slow! have to wait for each write to complete before going on

	--*write back*: delay writing modified data back to disk.
	problem: can lose data. another problem: updates can go to the
	disk in a wrong order

    --If multiple updates needed, do them in specific order so that if a
    crash occurs, **fsck** can work.

	--see 4.4.3 in the text for an explanation of fsck

    --Approaches to crash recovery:
	* Ad-hoc
	* Ordered updates
	* WAL (write-ahead logging) / journaling

    A. Ad-hoc:

	--can't have all data written asynchronously. If all data were
	written asynchronously, we could encounter the following
	unacceptable scenarios:

	    (a) Delete/truncate a file, append to other file, crash
		--New file may reuse block from old
		--Old inode may not be updated
		--Cross-allocation!
		--Often inode with older mtime wrong, but can't be sure

	    (b) Append to file, allocate indirect block, crash
		--Inode points to indirect block
		--But indirect block may contain garbage 

	--so what's the actual approach?

	    --be careful about order of updates. specifically:

	    --Write new inode to disk before directory entry

	    --Remove directory name before deallocating inode

	    --Write cleared inode to disk before updating cylinder group
	    free map

	--how is it implemented?

	    --synchronous write through for *metadata*. 

	    --doing one metadata write at a time ensures ordering

	example: for file create:

	    --write data to file

	    --update/write inode

	    --mark inode "allocated" in bitmap

	    --mark data blocks "allocated" in bitmap

	    --update directory

	    --(if directory grew) mark new file block "allocated" in
	    bitmap 
	
	now, cases:

	--inode not marked allocated in bitmap --> only writes were to
	unallocated, unreachable blocks; the result is that the write
	"disappears"
	 
	--inode allocated, data blocks not marked allocated in bitmap
	--> fsck must update bitmap 
	 
	--file created, but not yet in any directory --> fsck ultimately
	deletes file (after all that!)

	Disadvantages to this ad-hoc approach:

 	    (a) need to get ad-hoc reasoning exactly right 

	    (b) poor performance (synchronous writes of metadata) 

		--multiple updates to same block require that they be
		issued separately. for example, imagine two updates to
		same directory block. requires first complete before
		doing the second (otherwise, not synchronous)

		--more generally, cost of crash recoverability is
		enormous. (a job like "untar" could be 10-20x slower)

	    (c) slow recovery: fsck must scan entire disk

		--recovery gets slower as disks get bigger. if fsck
		takes one minute, what happens when disk gets 10 times
		bigger?
	
	[aside: why not use battery-backed RAM? answer:

	    --Expensive (requires specialized hardware)

	    --Often don't learn battery has died until too late

	    --A pain if computer dies (can't just move disk)

	    --If OS bug causes crash, RAM might be garbage]

    B. Approach: ordered updates

	--could reason carefully about the precise order in which
	asynchronous writes should go to disk

	--advantages

	    --performance

	    --fsck is very fast and can run in the background, since all
	    it needs to do is fix up bookkeeping

	--limitations

	    --hard to get right

    	    --arguably ad-hoc: very specific to FFS data structures.
	    unclear how to apply the approach to FSes that use data
	    structures like B+-trees

	    --metadata updates can happen out of order (for example:
	    create A, create B, crash.... it might be that only B exists
	    after the crash!)

	--to see this approach in action:
	    
	    [G. R. Ganger, M. K. McKusick, C. A. N. Soules, and Y. N.
	     Patt. Soft Updates: A Solution to the Metadata Update
	     Problem in File Systems. ACM Trans. on Computer Systems.
	     Vol. 18. No. 2., May 2000, pp. 127-153.
	     http://portal.acm.org/citation.cfm?id=350853.350863]

    C. Journaling

	--Reserve a portion of disk for **write-ahead log**
	    --Write any metadata operation first to log, then to disk
	    --After crash/reboot, re-play the log (efficient)
	    --May re-do already committed change, but won't miss anything
      
	--Performance advantage:
	    --Log is consecutive portion of disk
	    --Multiple log writes very fast (at disk b/w)
	    --Consider updates committed when written to log
      
	--Example: delete directory tree
	    --Record all freed blocks, changed directory entries in log
	    --Return control to user
	    --Write out changed directories, bitmaps, etc. in background
		(sort for good disk arm scheduling)
    
	--On recovery, must do three things:

	    i. find oldest relevant log entry
	    ii. find end of log
	    iii. read and replay committed portion of log.
	
	    i. find oldest relevant log entry

		--Otherwise, redundant and slow to replay whole log

		--Idea: checkpoints! (this idea is used throughout systems)
		    --Once all records up to log entry N have been processed and
		    once all affected blocks stably committed to disk ...
		    --Record N to disk either in reserved checkpoint location, or
		    in checkpoint log record
		    --Never need to go back before most recent checkpointed N

	    ii. find end of log

		--Typically circular buffer, so look at sequence numbers

		--Can include begin transaction/end transaction records

		    --but then need to make sure that "end transaction" only
		    gets to the disk after all other disk blocks in the
		    transaction are on disk
			--but disk can reorder requests, then system crashes
			--to avoid that, need separate disk write for "end
			transaction", which is a performance hit

		    --to avoid that, use checksums: a log entry is
		    committed when all of its disk blocks match its
		    checksum value

	    iii. not much to say: read and replay!
	    
	 --Logs are key: enable atomic complex operations. to see this,
	 we'll take a slight detour.....

	    [can skip, since the same points come up again under
	    transactions]

	    detour: some file systems (for example, XFS from SGI) use a B+-tree data
	    structure. a few quick words about B+-trees:
		--key-value map
		--ordering defined on keys (where is nearest key?)
		--data stored in blocks, so explicitly designed for
		efficient disk access
		--with n items stored, all operations are O(log n):
		    --retrieve closest <key,value> to target key k
		    --insert new <key,value> pair
		    --delete <key,value> pair
		--see any algorithms book (e.g., Cormen et al.) for details
		--**complex to implement**

	    --wait, why are we mentioning B+-trees? because some file
	    systems use them:
		--efficient implementation of large directories (map
		key = hash(filename) to value = inode #)
		--efficient implementation of inode
		    --instead of using FFS-style fixed block
		    pointers, map:
			file offset (key) --> {start block, # blocks} (value)
		    --if file consists of a small number of extents
		    (i.e., segments), then inodes are small, even
		    for large files
		--efficient implementation of map from inode # to
		inode. map:
		    inode # --> {block #, # of consecutive inodes in use}
		    [bonus: allows fast way to identify free node!]
			
	    --some B+-tree operations require multiple operations.
	    intermediate states are incorrect. what happens if there's a
	    crash in the middle? B+-tree could be in inconsistent state

	    --journaling is a big help here
		--First write all changes to the log ("insert k,v",
		"delete k", etc.)
		--If crash while writing log, incomplete log record will be
		discarded, and no change made
		--Otherwise, if crash while updating B+-tree, will
		replay entire log record and write everything

	    --limitations of journaling
		--fsync() syncs *all* operations' metadata to log

    --write-ahead logging is everywhere

    --what's the problem? (all data is written twice, in the worst case)

	--(aside: less of a problem to write data twice if you have two
	disks. common way to make systems fast: use multiple disks. then
	easier to avoid seeks)

    --log started as a way to help with consistency, but now the log is
    authoritative, so actually do we need the *other* copy of the data?
    what if everything is just stored in the log???? Transition to
    Log-structured file system (LFS)


    D. Summarize crash recovery
    
	--three viable approaches to crash recovery in file systems:

	(i) ad-hoc

	    --worry about metadata consistency, not data consistency

	    --accomplish metadata consistency by being careful about
	    order of updates

	    --write metadata synchronously

	(ii) ordered updates (soft updates), which is in OpenBSD

	    --worry about metadata consistency

	    --leads to great performance: metadata doesn't have to be
	    written synchronously (writes just have obey a partial
	    order)
	    
	(iii) journaling (the approach in most Linux file systems)

	    --more flexible

	    --easier to reason about

	    --possibly worse performance

3. Log-structured file system (LFS)

    A. Intro
    B. Finding data
    C. Crash recovery
    D. [next time] Garbage collection (cleaning)
    E. [next time] Discussion

    A. Intro

    [DRAW PICTURE]

    --Idea: write data only once by having log be only copy on disk 
	--> as you modify blocks in a file, just store them out on disk in
	the log
	--> this goes for everything: data blocks in a file, data blocks
	in a directory (of course), file metadata, etc.
	--> So writes seem pretty easy
	--> What about reads?
	    --as long as the inode points to the right disk blocks, no
	    problem to put disk blocks in log
	    --of course, if a file block is updated, the inode has to be
	    updated to point to a new disk block, but no problem: just write
	    a new inode to log!
	--> Raises the question: how do we find inodes? (see below)

    --Performance characteristics:

	--all writes are sequential!

	--no seeks except for reads!

	--why we would want to build a file system like this: as we
	discussed a few classes ago, RAM is getting bigger --> caches
	are getting bigger --> less disk I/O is for reads. But can't
	avoid writes. Conclusion: optimize for writes, which is
	precisely what LFS does.

	    --moreover, if write order predicts read order (which it
	    often does), then even read performance will be fast

    B. Finding data

	[DRAW PICTURE]
	
	--Need to maintain an index in RAM and on the disk: map from
	inode numbers to inodes

	--called the inode map: just an array of pointers to inodes
	(conceptually)

	--inode map periodically dumped to disk

    C. Crash recovery

	--(aside: no free-block list or bitmap! simplifies crash
	recovery.)

	--Checkpoints!

	--what happens on checkpoint?

	    --write out everything that is modified to the log
		--file data bocks, indirect blocks, inodes, blocks of the
		inode map, and segment usage table

	    --write to checkpoint region:
		--pointers to inode map blocks and segment usage table,
		plus current time and pointer to last segment written

	--what happens on recovery?

	    --read in checkpoint

	    --replay log from checkpoint

	--how do you find the most recent checkpoint?

	    --there are two checkpoint regions

	    --use the one with the most recent time

	    --then given the info (inode map, etc.), you're in business:
	    you have the file system!!!

	--a few wrinkles:

	    --when do they have to do when replaying the log? (basically
	    just update the in-memory data structures, which are the
	    inode map and the segment usage table.)
		
		--what's the plan? 

		--when replaying a segment:
		    --look at the segment summary table
		    --use it to figure out what information in the
		    segment is live
		    --update the inode map and segment usage table accordingly
			--usage of current segment will increase (it
			started at 0 because we're talking about
			segments *after* the checkpoint was written)
			--usage of other segments will decrease.
			why?
			    (because the new inodes point to new
			    file data blocks, thereby implicitly
			    invalidating the old ones, causing them
			    to be free space)

	    --directory and inodes may not be consistent:
		--dir entry could point to inode, but inode not written
		--inode could be written with too-high link count (i.e.,
		dir entry not yet written)
		--so what's the plan for fixing this?
		--log directory operations:

		    [ "create"|"link"|"rename"
		      dir inode #, position in dir
		      name of file, file inode #
		      new ref count ]

		--and make sure that these operations appear before the
		new directory block or new inode

		--for "link" and "rename", dir op gives all the info needed to
		actually complete the operation
		--for "create", this is not the case. if "create" is in
		log, but new inode isn't, then the directory entry is
		removed on roll-forward