Class 20
CS 372H
6 April 2010

On the board
------------

1. crash recovery
   --ad-hoc (last time)
   --ordered updates (last time)
   --soft updates (see notes)
   --journaling (today)
2. LFS (today)
3. Transactions (next time)

---------------------------------------------------------------------------

0. last time

    --corrections:

	--block size might larger than 2-4KB. configurable. these days,
	default might be as big as 16KB 
	    
	--disk block writes: blocks are big. but not written atomically.
	doesn't hurt, given the guarantees that these systems are
	providing.
	    --directory blocks, however, need to be smaller

    --three viable approaches to crash recovery in file systems:

	(i) ad-hoc
	    --worry about metadata consistency, not data consistency
	    --accomplish metadata consistency by being careful about
	    order of updates
	    --write metadata synchronously

	(ii) soft updates (the approach in OpenBSD)
	    --evolution of strawman presented last time
	    --again, worry about metadata consistency
	    --leads to great performance: metadata doesn't have to be
	    written synchronously (writes just have obey a partial
	    order)
	    --more on soft updates below, but we won't discuss
	    --this is the 

	(iii) journaling (the approach in most Linux file systems)
	    --more flexible
	    --easier to reason about
	    --possibly worse performance
	    --discuss now

1. crash recovery

    C. Approach: soft updates [http://portal.acm.org/citation.cfm?id=350853.350863]

	recall how we got here: didn't want ad-hoc crash recovery

	so follow principled order of which file system blocks make it to
	disk in which order ("ordered updates"). but this strawman had a few
	problems, chief among them the cyclic dependency problem.

	this motivates soft updates.


	--summary:

	    --Write blocks in any order

	    --But keep track of dependencies

	    --When writing a block, temporarily roll back any changes
	    you can't yet commit to disk

		--that is, a given disk block may contain multiple
		logical updates, but when writing the block, write only
		the updates whose dependencies are already on the disk

		--in essence, you may wind up writing the same disk
		blocks several times

	    --high-level approach:

		--for each updated field or pointer, maintain a structure
		that contains:
		    --old value
		    --new value
		    --list of updates on which this update depends

		--now, can write disk blocks in any order, but....
		    --have to temporarily undo updates with pending
		    dependencies

	--fsck in this regime

	    --very quick to get FS consistent (just make sure
	    per-cylinder summary info makes sense)
	    --may take a bit of time to identify leaked disk blocks and
	    inodes, but that can be done in the background

	    --compared to traditional FFS fsck:
		--can have lots of inodes with non-zero link counts
		--they don't all belong in lost+found

	--limitations of soft updates:
	    --arguably ad-hoc: very specific to FFS data structures.
	    unclear how to apply the approach to FSes that use data
	    structures like B+-trees
	    --metadata updates can happen out of order (for example:
	    create A, create B, crash.... it might be that only B exists
	    after the crash!)
	    --fsck not totally dispensed with (but runs in the
	    background)


    D. journaling
      
	--Reserve a portion of disk for **write-ahead log**
	    --Write any metadata operation first to log, then to disk
	    --After crash/reboot, re-play the log (efficient)
	    --May re-do already committed change, but won't miss anything
      
	--Performance advantage:
	    --Log is consecutive portion of disk
	    --Multiple log writes very fast (at disk b/w)
	    --Consider updates committed when written to log
      
	--Example: delete directory tree
	    --Record all freed blocks, changed directory entries in log
	    --Return control to user
	    --Write out changed directories, bitmaps, etc. in background
		(sort for good disk arm scheduling)
    
	--On recovery, must do three things:

	    i. find oldest relevant log entry
	    ii. find end of log
	    iii. read and replay committed portion of log.
	
	    i. find oldest relevant log entry

		--Otherwise, redundant and slow to replay whole log

		--Idea: checkpoints! (this idea is used throughout systems)
		    --Once all records up to log entry N have been processed and
		    once all affected blocks stably committed to disk ...
		    --Record N to disk either in reserved checkpoint location, or
		    in checkpoint log record
		    --Never need to go back before most recent checkpointed N

	    ii. find end of log

		--Typically circular buffer, so look at sequence numbers

		--Can include begin transaction/end transaction records
		    --but then need to make sure that "end transaction" only
		    gets to the disk after all other disk blocks in the
		    transaction are on disk
			--but disk can reorder requests, then system crashes
			--to avoid that, need separate disk write for "end
			transaction", which is a performance hit
		    --to avoid that, use checksums: a log entry is
		    committed when all of its disk blocks match its checksum
		    value

	    iii. not much to say: read and replay!
	    
	 --Logs are key: enable atomic complex operations. to see this,
	 we'll take a slight detour.....

	    [can skip, since the same points come up again under
	    transactions]

	    detour: some file systems (for example, XFS from SGI) use a B+-tree data
	    structure. a few quick words about B+-trees:
		--key-value map
		--ordering defined on keys (where is nearest key?)
		--data stored in blocks, so explicitly designed for
		efficient disk access
		--with n items stored, all operations are O(log n):
		    --retrieve closest <key,value> to target key k
		    --insert new <key,value> pair
		    --delete <key,value> pair
		--see any algorithms book (e.g., Cormen et al.) for details
		--**complex to implement**

	    --wait, why are we mentioning B+-trees? because some file
	    systems use them:
		--efficient implementation of large directories (map
		key = hash(filename) to value = inode #)
		--efficient implementation of inode
		    --instead of using FFS-style fixed block
		    pointers, map:
			file offset (key) --> {start block, # blocks} (value)
		    --if file consists of a small number of extents
		    (i.e., segments), then inodes are small, even
		    for large files
		--efficient implementation of map from inode # to
		inode. map:
		    inode # --> {block #, # of consecutive inodes in use}
		    [bonus: allows fast way to identify free node!]
			
	    --some B+-tree operations require multiple operations.
	    intermediate states are incorrect. what happens if there's a
	    crash in the middle? B+-tree could be in inconsistent state

	    --journaling is a big help here
		--First write all changes to the log ("insert k,v",
		"delete k", etc.)
		--If crash while writing log, incomplete log record will be
		discarded, and no change made
		--Otherwise, if crash while updating B+-tree, will
		replay entire log record and write everything

	    --limitations of journaling
		--fsync() syncs *all* operations' metadata to log

    --write-ahead logging is everywhere

    --what's the problem? (all data is written twice, in the worst case)

	--(aside: less of a problem to write data twice if you have two
	disks. common way to make systems fast: use multiple disks. then
	easier to avoid seeks)

    --log started as a way to help with consistency, but now the log is
    authoritative, so actually do we need the *other* copy of the data?
    what if everything is just stored in the log????

2. Log-structured file system (LFS)

    A. Intro
    B. Finding data
    C. Garbage collection (cleaning)
    D. Crash recovery
    E. Discussion

    A. Intro

    [DRAW PICTURE]

    --Idea: write data only once by having log be only copy on disk 
	--> as you modify blocks in a file, just store them out on disk in
	the log
	--> this goes for everything: data blocks in a file, data blocks
	in a directory (of course), file metadata, etc.
	--> So writes seem pretty easy
	--> What about reads?
	    --as long as the inode points to the right disk blocks, no
	    problem to put disk blocks in log
	    --of course, if a file block is updated, the inode has to be
	    updated to point to a new disk block, but no problem: just write
	    a new inode to log!
	--> Raises the question: how do we find inodes? (see below)

    --Performance characteristics:

	--all writes are sequential!

	--no seeks except for reads!

	--why we would want to build a file system like this: as we
	discussed a few classes ago, RAM is getting bigger --> caches
	are getting bigger --> less disk I/O is for reads. But can't
	avoid writes. Conclusion: optimize for writes, which is
	precisely what LFS does.

	    --moreover, if write order predicts read order (which it
	    often does), then even read performance will be fast

    B. Finding data

	[DRAW PICTURE]
	
	--Need to maintain an index in RAM and on the disk: map from
	inode numbers to inodes

	--called the inode map: just an array of pointers to inodes
	(conceptually)

	--inode map periodically dumped to disk

    C. Garbage collection (cleaning)

	--what if log fills up? then we're in trouble. to avoid this,
	need to do *cleaning*. 
	    --approach: basically, compress the log, and leave some free
	    space on the disk

	--use segments: contiguous regions of the log (1 MB in their
	implementation)

	--two data structures they maintain:
	    
	    --in-memory: segment usage table:
		[<seg>  <# free bytes> <most recently modified time>]

	    --on disk, per-segment: segment summary 
	    
		(probably a table indexed by entry in the segment, so first
		entry in the table gives info about first entry in
		segment)
		[ <type> <file (or inode) #> <block #> <vers #>]

	--okay, which segments should we clean? and do you want the
	utilization to be high or low when you clean?

	    --observe: if the utilization of a segment is 0 (as
	    indicated by the segment usage table), cleaning it is really
	    easy: just overwrite the segment!

	    --if utilization is very low, that's a good sign: clean that
	    segment (rewriting it is very little work)

	    --but what if utilization is high? can we clean the segment?

		--insight: yes! provided what? (provided that the
		segment generally has a lot of "cold", that is
		unchanging, data). the insight is that:
		    --because the segment is cold, it's never going to
		    get to a low utilization
		    --at the same time, because it's cold, you're not
		    wasting work by compressing it. the segment will
		    "stay compressed".

	    --they analyze bang for the buck: how long after compressing
	    will the data stick around to justify the work we did to
	    cean and compress it?

		--(benefit/cost) = (1-u)*age/(1+u)

		--cost: 1+u (1 to read in, u to write back)
		--benefit: "1-u" are the blocks we're taking back,
		    times "age", namely how long those blocks will stay
		    compact. point being: worthwhile to compact old data
		    because it will be a while before those blocks are
		    returned to the system
		    

    D. Crash recovery

	--(aside: no free-block list or bitmap! simplifies crash
	recovery.)

	--Checkpoints!

	--Replay log from checkpoint

	--what happens on checkpoint?

	    --write out everything to the log
		--file data bocks, indirect blocks, indoes, blocks of the
		inode map, and segment usage table

	    --write to checkpoint region:
		--pointers to inode map blocks and segment usage table,
		plus current time and pointer to last segment written

	--how do you find the most recent checkpoint?

	    --there are two checkpoint regions

	    --use the one with the most recent time

	    --then given the info (inode map, etc.), you're in business:
	    you have the file system!!!
   
	--what happens on recovery?

	    --read in checkpoint

	    --replay log from checkpoint

	--a few wrinkles:

	    --when do they have to do when replaying the log? (basically
	    just update the in-memory data structures, which are the
	    inode map and the segment usage table.)
		
		--what's the plan? 

		--when replaying a segment:
		    --look at the segment summary table
		    --use it to figure out what information in the
		    segment is live
		    --update the inode map and segment usage table accordingly
			--usage of current segment will increase (it
			started at 0 because we're talking about
			segments *after* the checkpoint was written)
			--usage of other segments will decrease.
			why?
			    (because the new inodes point to new
			    file data blocks, thereby implicitly
			    invalidating the old ones, causing them
			    to be free space)

	    --directory and inodes may not be consistent:
		--dir entry could point to inode, but inode not written
		--inode could be written with too-high link count (i.e.,
		dir entry not yet written)
		--so what's the plan for fixing this?
		--log directory operations:

		    [ "create"|"link"|"rename"
		      dir inode #, position in dir
		      name of file, file inode #
		      new ref count ]

		--and make sure that these operations appear before the
		new directory block or new inode

		--for "link" and "rename", dir op gives all the info needed to
		actually complete the operation
		--for "create", this is not the case. if "create" is in
		log, but new inode isn't, then the directory entry is
		removed on roll-forward
	
    E. Discussion

	--crucial for you to understand authors' points in last paragraph
	of section 5.1

	--where does LFS fall down?

	    --If write workload is random access and read workload is
	    sequential, performance will be terrible. there will be lots
	    of disk seeks.

	    --see rightmost set of bars in figure 9

	--are people using LFS today? why not?

	    --not using in its proposed form

	    --but journaling is everywhere (for example, the ext3 file
	    system on Linux).

	    --people might not be using it for any number of reasons:
		--added performance not needed
		--performance so bad in pathological cases that people
		don't want to use it


    [note: in LFS, the log *is* the file system. now going to briefly
    describe transactional systems, which in some ways are closer to
    journaling. the log is the authoritative record, but it's not the
    copy of the data used for read accesses.]