Class 18
CS 202
14 April 2015

On the board
------------

1. Last time
2. File systems crash recovery
    --intro
    --ad-hoc
    --ordered updates
    --journaling

---------------------------------------------------------------------------

1. Last time

    --finished directories

    --looked at file system performance

    --today: look at file system persistence and crash recovery
        (and a bit more about performance)

        theme: we'll see two ways in which we gain from "treating the disk
        like tape"

    --next time: revisit some of these concepts in the context of a
    different kind of system (database management systems, transactions,
    etc.)

2. crash recovery

    --There are a lot of data structures used to implement the file
    system: bitmap of free blocks, directories, inodes, indirect blocks,
    data blocks, etc.

        --We want these data structures to be *consistent*: we want
        invariants to hold

        --Thorny issue: *crashes*

    --Making the problem worse is: (a) write-back caching and (b)
    non-ordered disk writes. (a) means the OS delays writing back
    modified disk blocks. (b) means that the modified disk blocks can go
    to the disk in an unspecified order.

    --Example: 

        [DRAW PICTURE]
            INODE
            DATA BLOCK ADDED
            DATA BITMAP UPDATED

        crash.

        restart.

        uh-oh.

    --Solution: the system requires a notion of atomicity

        --How to think about this stuff: imagine that a crash can happen
        at any time. (The only thing that happens truly atomically is a
        write of a 512-byte disk sector.) So you want to arrange for the
        world to look sane, regardless of where a crash happens.

            --> Your leverage, as file system designer, is that you can
            arrange for some disk writes to happen *synchronously*
            (meaning that the system won't do anything until these disk
            writes complete), and you can impose some ordering on the
            actual writes to the disk.

        --So we need to arrange for higher-level operations ("add data
        to file") to _look_ atomic: an update either occurs or it
        doesn't.

        --Potentially useful analogy: during our concurrency unit, we
        had to worry about arbitrary interleavings (which we then tamed
        with concurrency primitives). Here, we have to worry that a
        crash can happen at any time (and we will tame this with
        abstractions like transactions). The response in both cases is a
        notion of atomicity.

    --We will mention three approaches to crash recovery in file
    systems:
    
        A. Ad-hoc (the book calls this "fsck")
        B. ordered (soft) updates
        C. Journaling (also known as write-ahead logging)


    A. Ad-hoc

        --Goal: metadata consistency, not data consistency (rationale:
        too expensive to provide data consistency; cannot live without
        metadata consistency.)

        --Approach: arrange to send file system updates to the disk in
        such a way that, if there is a crash, **fsck** can clean up
        inconsistencies
        
   	--example: for file create:

	    --write data to file

	    --then update/write inode

	    --then mark inode "allocated" in bitmap

	    --then mark data blocks "allocated" in bitmap

	    --then update directory

	    --(if directory grew) mark new file block (for directory)
	    "allocated" in bitmap 
	
            
 	now, cases:

	inode not marked allocated in bitmap --> only writes were to
	unallocated, unreachable blocks; the result is that the write
	"disappears"
	       
	inode allocated, data blocks not marked allocated in bitmap -->
	fsck must update bitmap 

	file created, but not yet in any directory --> fsck ultimately
	deletes file (after all that!)

	Disadvantages to this ad-hoc approach:

 	    (a) need to get ad-hoc reasoning exactly right 

	    (b) poor performance (synchronous writes of metadata) 

		--multiple updates to same block require that they be
		issued separately. for example, imagine two updates to
		same directory block. requires first complete before
		doing the second (otherwise, not synchronous)

		--more generally, cost of crash recoverability is
		enormous. (a job like "untar" could be 10-20x slower)

	    (c) slow recovery: fsck must scan entire disk

		--recovery gets slower as disks get bigger. if fsck
		takes one minute, what happens when disk gets 10 times
		bigger?
	
                --essentially, fsck has to scan the entire disk	


    B. Ordered updates

	--could reason carefully about the precise order in which
	asynchronous writes should go to disk

	--advantages

	    --performance

	    --fsck is very fast and can run in the background, since all
	    it needs to do is fix up bookkeeping

	--limitations

	    --hard to get right

    	    --arguably ad-hoc: very specific to FFS data structures.
	    unclear how to apply the approach to FSes that use data
	    structures like B+-trees

	    --metadata updates can happen out of order (for example:
	    create A, create B, crash.... it might be that only B exists
	    after the crash!)

	--to see this approach in action:
	    
	    [G. R. Ganger, M. K. McKusick, C. A. N. Soules, and Y. N.
	     Patt. Soft Updates: A Solution to the Metadata Update
	     Problem in File Systems. ACM Trans. on Computer Systems.
	     Vol. 18. No. 2., May 2000, pp. 127-153.
	     http://portal.acm.org/citation.cfm?id=350853.350863]


    C. Journaling

    Golden rule of atomicity, per Saltzer-Kaashoek:
    "never modify the only copy"

        The use of journaling in file systems is borrowed from
        databases. Journaling is allowing for a kind of transaction. In
        the next class, we will study transactional systems directly.

    --Reserve a portion of disk for **write-ahead log**
	--Write any metadata operation first to log, then to disk
	--After crash/reboot, re-play the log (efficient)
	--May re-do already committed change, but won't miss anything
  
    --Performance advantage:
	--Log is consecutive portion of disk
	--Multiple log writes very fast (at disk b/w)
	--Consider updates committed when written to log
  
    --Example: delete directory tree
	--Record all freed blocks, changed directory entries in log
	--Return control to user
	--Write out changed directories, bitmaps, etc. in background
	    (sort for good disk arm scheduling)
    
    --On recovery, must do three things:

	i. find oldest relevant log entry
	ii. find end of log
	iii. read and replay committed portion of log.
    
	i. find oldest relevant log entry

	    --Otherwise, redundant and slow to replay whole log

	    --Idea: checkpoints! (this idea is used throughout systems)

		--Once all records up to log entry N have been processed
		and once all affected blocks stably committed to disk
		...

		--Record N to disk either in reserved checkpoint
		location, or in checkpoint log record

		--Never need to go back before most recent checkpointed N

	ii. find end of log

	    --Typically circular buffer, so look at sequence numbers

	    --Can include begin transaction/commit transaction records

		--but then need to make sure that "commit transaction"
		only gets to the disk after all other updated in the
		transaction are in the log on disk

		    --but disk can reorder requests, then system crashes

		    --to avoid that, need separate disk write for
		    "commit transaction", which is a performance hit

		--to avoid that, use checksums: a log entry is committed
		when all of its disk blocks match its checksum value

	iii. not much to say: read and replay!

    
    --write-ahead logging is everywhere

    --what's the problem? (all data is written twice, in the worst case)

	--(aside: less of a problem to write data twice if you have two
	disks. common way to make systems fast: use multiple disks. then
	easier to avoid seeks)

    D. Summarize crash recovery
    
	--three viable approaches to crash recovery in file systems:

	(i) ad-hoc

	    --worry about metadata consistency, not data consistency

	    --accomplish metadata consistency by being careful about
	    order of updates

	    --write metadata synchronously

            --fsck cleans up on restart

	(ii) ordered updates (soft updates), which is in OpenBSD

	    --worry about metadata consistency

	    --leads to great performance: metadata doesn't have to be
	    written synchronously (writes just have obey a partial
	    order)

	    --fsck again cleans up on restart, but its job is simpler
	    and faster.
	    
	(iii) journaling (the approach in most Linux file systems)

	    --more flexible

	    --easier to reason about

	    --possibly worse performance

            --on restart, no need for fsck, just log recovery (although
            this is logically a kind of fsck)

    --log started as a way to help with consistency, but now the log is
    authoritative, so actually do we need the *other* copy of the data?
    what if everything is just stored in the log???? Transition to
    Log-structured file system (LFS)