Class 18
CS372H
29 March 2012

On the board
------------

1. Last time
2. LFS

---------------------------------------------------------------------------

1. Last time

    --disk performance

    --file systems: discussed the data structures

    --today: performance and crash recovery

2. LFS

    A. Intro
    B. Finding data
    C. Crash recovery
    D. [next time] Garbage collection (cleaning)
    E. Discussion

    A. Intro

    [DRAW PICTURE]

    --Idea: write data only once by having log be only copy on disk 
	--> as you modify blocks in a file, just store them out on disk in
	the log
	--> this goes for everything: data blocks in a file, data blocks
	in a directory (of course), file metadata, etc.
	--> So writes seem pretty easy
	--> What about reads?
	    --as long as the inode points to the right disk blocks, no
	    problem to put disk blocks in log
	    --of course, if a file block is updated, the inode has to be
	    updated to point to a new disk block, but no problem: just write
	    a new inode to log!
	--> Raises the question: how do we find inodes? (see below)

    --Performance characteristics:

	--all writes are sequential!

	--no seeks except for reads!

	--why we would want to build a file system like this: as we
	discussed last time, RAM was getting bigger --> caches
	get bigger --> less disk I/O is for reads. But can't
	avoid writes. Conclusion: optimize for writes, which is
	precisely what LFS does.

	    --moreover, if write order predicts read order (which it
	    often does), then even read performance will be fast

    B. Finding data

	[DRAW PICTURE]
	
	--Need to maintain an index in RAM and on the disk: map from
	inode numbers to inodes

	--called the inode map: just an array of pointers to inodes
	(conceptually)

	--inode map periodically dumped to disk

    C. Crash recovery

	--(aside: no free-block list or bitmap! simplifies crash
	recovery.)

	--Checkpoints!

	--what happens on checkpoint? (Sect. 4.1)

	    --write out everything that is modified to the log

		--file data bocks, indirect blocks, inodes, blocks of the
		inode map, and segment usage table

	    --write to checkpoint region:

		--pointers to inode map blocks and segment usage table,
		plus current time and pointer to last segment written

	--what happens on recovery?

	    --read in checkpoint

	    --replay log from checkpoint

	--how do you find the most recent checkpoint?

	    --there are two fixed and well-known checkpoint regions

	    --use the one with the most recent time

	--checkpoint region contains information needed to reconstruct
	inode map and segment usage table

	    --note: inode map is big and may take more than one disk
	    location. blocks of inode map may be strewn throughout the
	    disk

	    --checkpoint region also contains pointer to last segment
	    written

	--recovery:

	    --read in inode map and segment usage table at this point,
	    we're mostly in business. we have most of the file system!!!
	    
	    --but we need to get the part of the file system that
	    was written after the checkpoint and before the crash

		--this is called *roll-forward*. it works as follows in
		this context, but the idea is broader than LFS.

	    --start at last segment written and scan forward from that
	    segment
	    
		--read the segment summary table, and use it to figure
		out what information in the segment is live
		
		--usage of current segment will increase (it
		started at 0 because we're talking about
		segments *after* the checkpoint was written)

		--usage of other segments will decrease.
		why?
		    (because the new inodes point to new
		    file data blocks, thereby implicitly
		    invalidating the old ones, causing them
		    to be free space)

	    --stop when we arrive at the last inode written in the log

	    --we know that we're at the last inode written because we
	    look at every segment's segment summary block, and if the
	    segment summary block indicates that the segment is old,
	    we're done. (segment summary block has checksum and
	    timestamp, which is enough.)
	   
	--recovery wrinkle #1: what about partial segment writes? (can't
	assume that entire segments' worth are written every time)

	    --the backwards hack, which ensures 1 seek/per partial write

	--recovery wrinkle #2: directory and inodes may not be consistent:

	    --dir entry could point to inode, but inode not written

	    --inode could be written with too-high link count (i.e.,
	    dir entry not yet written)

	    --so what's the plan for fixing this?

	    --log directory operations:

		[ "create"|"link"|"rename"
		  dir inode #, position in dir
		  name of file, file inode #
		  new ref count ]

	    --and make sure that these operations appear before the
	    new directory block or new inode

	    --for "link" and "rename", dir op gives all the info needed to
	    actually complete the operation

	    --for "create", this is not the case. if "create" is in log,
	    but new inode isn't, then the directory entry is removed on
	    roll-forward

    D. [next time] Garbage collection (cleaning)

	--what if log fills up? then we're in trouble. to avoid this,
	need to do *cleaning*. 

	    --approach: basically, compress the log, and leave some free
	    space on the disk

	--use segments: contiguous regions of the log (1 MB in their
	implementation)

	--two data structures they maintain:
	    
	    --in-memory: segment usage table:
		[<seg>  <# free bytes> <most recently modified time>]

	    --on disk, per-segment: segment summary 
	    
		(a table indexed by entry in the segment, so first entry
		in the table gives info about first entry in segment)

		[ <type> <file (or inode) #> <block #> <vers #>]

	--okay, which segments should we clean? and do you want the
	utilization to be high or low when you clean?

	    --observe: if the utilization of a segment is 0 (as
	    indicated by the segment usage table), cleaning it is really
	    easy: just overwrite the segment!

	    --if utilization is very low, that's a good sign: clean that
	    segment (rewriting it is very little work)

	    --but what if utilization is high? can we clean the segment?

		--insight: yes! provided what? (provided that the
		segment generally has a lot of "cold", that is
		unchanging, data). the insight is that:

		    --because the segment is cold, it's never going to
		    get to a low utilization

		    --at the same time, because it's cold, you're not
		    wasting work by compressing it. the segment will
		    "stay compressed".

	    --they analyze bang for the buck: how long after compressing
	    will the data stick around to justify the work we did to
	    clean and compress it?

		--(benefit/cost) = (1-u)*age/(1+u)

		--cost: 1+u (1 to read in, u to write back)

		--benefit: "1-u" are the blocks we're taking back,
		    times "age". The "age" is an estimate of how long
		    the compacted blocks will stay compact. It may seem
		    counter-intuitive to multiply these together, but
		    this particular metric is just a rough guide
		    anyway. The idea is that there are two
		    factors that matter: how long the compacted blocks
		    will stay compact (captured by age) and how many
		    blocks we actually got by compacting the segment
		    (captured by 1-u). The notion that we "keep" the
		    blocks for their "age" (as stated in the paper) isn't
		    really right in literal terms because we don't know anything about what
		    will happen once the blocks are pressed back into service.
		    However, this metric captures the idea that it's worthwhile to
		    compact old data, even if it's in a segment that is
		    highly utilized, because it will be a while before
		    the old blocks need to be re-compacted.
		    
    E. Discussion

	--you should understand authors' points in last paragraph of
	section 5.1

	--where does LFS fall down?

	    --If write workload is random access and read workload is
	    sequential, performance will be terrible. there will be lots
	    of disk seeks.

	    --see rightmost set of bars in figure 9

	--are people using LFS today? why not?

	    --not using in its proposed form

	    --but journaling is everywhere (for example, the ext3 file
	    system on Linux).

	    --people might not be using it for any number of reasons:

		--added performance not needed

		--performance so bad in pathological cases that people
		don't want to use it