Class 19
CS 372H
1 April 2010

On the board
------------

1. file systems
   E. performance (case study: FFS)
   F. caching, mmap, etc.
2. crash recovery
   --ad-hoc
   --ordered updates
   --[next time] soft updates
   --[next time] journaling
3. [next time] LFS

---------------------------------------------------------------------------

0. clarify from last time

    --multiple inodes in a disk block (so our picture from last time
    wasn't totally accurate)

    --looks like this:
	
	[DRAW PICTURE]

1. file systems, continued

    E. FS Performance (case study: FFS)

      Motivation:
      
	 --Original Unix FS was simple and elegant ...
	
	    [superblock | bookkeeping info | inodes  |  data blocks (512 bytes each) ]

	    superblock: specifies number of blocks in the FS, counts of max
	      # of files, pointer to head of free list, disk
	      characteristics, info needed to get inode from inumber

	--and .... slowwwwwwww

	    --blocks too small
	    --file index too large
		--too many layers of mapping indirection
		--transfer rate low (they were getting one block @ a time)

	    --poor clustering of related objects
		--consecutive file blocks not close together
		--Inodes far from data blocks
		--Inodes for a given directory not close together
		--result: poor enumeration performance, meaning things like:
			"ls" and "grep foo *.c" were slowwwww

	    --other problems:
		--14 character names were the limit
		--can't atomically update file in crash-proof way

       FFS (fast file system) fixes these problems to a degree.

       let's look at FFS:
	    * make block size bigger
	    * cluster related objects
	    * bitmaps
	    * reserve space

	(i) first thing: make block size bigger

	--okay, but then how to avoid internal fragmentation, i.e., lots
	of block wastage?
	    --fragments:
		--FS uses large block size (4KB or 8KB)
		--but allows large blocks to be chopped into small
		"fragments" (1024, 2048)
		--fragments are used for little files and pieces at the
		end:

		    [ <big block 1>  <block of fragments> <big block 2> ]

			A = <big block 1> + 1024-byte fragment
			B = 1024-byte fragments

	    allocation approach:

	    --Allocate space when user writes beyond end of file
	    --Want last block to be a fragment if not full-size
		--If already a fragment, may contain space for write -- done
		--Else, must deallocate any existing fragment, allocate new

	    --If no appropriate free fragments, break full block
	    [[--Bummer: slow if lots of small writes. to mitigate that,
	    add "blksize" field to "stat" struct (this is the struct
	    that applications get from the kernel that contains file
	    metadata). Then a library like stdio (or even the
	    application) can learn the block size and buffer that much
	    data. As a result, the application can avoid many fragment
	    writes.]]
	
	(ii) next: cluster related objects

	    "cylinder groups"

	    --observe: can access any block in a cylinder without
	    performing a seek.  Next fastest place is adjacent cylinder

	    --FS approach: put all related items (directories, inodes,
	    data blocks, etc.) in same cylinder group

	    --ideally, unrelated stuff goes in a different cylinder
	    group

	    --what's the approach?

		--first, put sequential blocks in adjacent sectors
		(always a good idea; reason: if you access a block,
		probably going to access the next one)

		--next, put the inode in the same cylinder as the file
		data, and perhaps next to it. (reason: if you look at
		the inode, will probably look at the data too)

		--also, try to keep all inodes in a directory (remember,
		a directory is just a collection of inodes) in the same
		cylinder group (reason: if you access one file *name* in
		a directory, good chance you will access other file
		names in the directory too; think about "ls").

	    --so what the heck does a cylinder group look like?

	    [superblock | bookkeeping info | inodes  |  data blocks (512 bytes each) ]

	    superblock:
		--Contains same info as above, plus:
		    --cylinder group (CG) info
		--Some of the superblock is replicated (once per
		cylinder group)
		--Note that the superblock is replicated at 
		shifting offsets, so as to span multiple platters 

	    bookkeeping info:
		--block map: bitmap of available fragments
		--# free inodes, blocks/frags, files, dirs

	    block allocation:
		--Try to optimize for sequential access
		    --If available, use rotationally close block in same
		    cylinder
		    --Otherwise, use block in same CG
		    --If CG totally full, find other CG with quadratic hashing 
		    i.e., if CG n is full, try n + 1^2, n + 2^2, n +3^2,
		    ... (mod # CGs)
		    --Otherwise, search all CGs for some free space

	-------------------------------------------------------------------
	|	--but not going to write more than 1 MB per file per CG.  |
	|	why? .... see in a moment....                             |
	-------------------------------------------------------------------
 	
	    Want to ensure there's space for related stuff. Approach:

		--Place different directories in different cylinder groups

		--keep a "free space reserve" in each group so can
		allocate near existing things

		--when file grows too big (1MB) send its remainder to
		different cylinder group.
		    --why? (answer: we don't want any one file in the CG
		    to get too big because then the inodes in that CG
		    would have their data too far away).
		    --why does this perform decently? answer: because
		    amortizing a 10 ms seek over a 1MB read. effective
		    bandwidth: 1MB/(10ms + 1MB/80MB/s) = 44 MB/s. not
		    bad.


	(iii) track free blocks with bitmap (instead of linked list)

	    --Easier to find contiguous blocks. 
	    --Can keep the entire thing in memory (as in lab 5)
	    --Good idea: keep reserves of free blocks. Makes finding a
	    close block easier 

	    --100 GB disk / 4KB disk blocks = 25,000,000 entries = 3MB.
	    not outrageous these days.

	    --Allocate block close to block x?

		--Check for blocks near bmap[x/32]  
		--If disk almost empty, will likely find one near
		--As disk becomes full, search becomes more expensive
		and less effective.
 
	    --Trade space for time (search time, file access time)

	(iv)  Keep a reserve (e.g, 10%) of disk always free, ideally
	    scattered across disk
    
	    --but don't tell users!! (df makes full disk look 110%
	    full

	    --with 10% free, can almost always find free blocks near
	    where you want

	--Performance improvements:

	    --20-40% of disk bandwidth for large files
	    --10-20x original Unix file system!
	    --Better small file performance  (why?)

	    --Is this the best we can do?  No.
		--It's still block-based rather than extent-based
		[[--Extent-based:
		    --Name contiguous blocks with single pointer and length
		    (Linux ext2fs)]]

	    --Meta-data writes happen synchronously

		--really hurts performance on small files

		--but how to make asynchronous? 
			
		    --need write-ordering ("soft updates") or logging
		    (LFS)

		    --and/or play with semantics (/tmp file systems)

	--Usability improvements
	    --file names up to 255 characters
	    --atomic rename() system call
	    

    --Other performance hacks (beyond FFS)

	--Most obvious: big file cache

	--fact: no rotation delay if you're reading the whole track.
	    --how can we use this fact?

	--fact: transfer cost negligible for small chunks
	    --reading many nearby sectors costs roughly what reading one
	    sector costs
	    --how can we use this?
	
	--fact: if transfer itself is huge, then seek + rotation times
	become negligible (!!)
	    --how can we use this?

	--dump a lot of data at a time, read a lot at a time
	    --FFS accumulates data into 64KB clusters (but needs to do
	    bookkeeping to track free clusters)
	    --read ahead in 64KB clusters

	--why not just read/write 1 MB at a time?
	    --(for writes: may not get data to disk often enough)
	    --(for reads: may waste read bandwidth)

    F. Caching, mmap, etc.

	--recall some syscalls: 
	    fd = open(pathname, mode)
	    write(fd, buf, sz)
	    read(fd, buf, sz)

	--what the heck is a fd?
	    --indexes into a table
	    --what's in the given entry in the table?
		--inumber!
		--inode, probably!
		--and per-open-file data (file position, etc.)

	--caching:
	    --in order to make file system operations fast, kernel
	    maintains a *buffer cache* in memory
		--internally, all uses of ReadDisk(blockNum, readbuf) get
		replaced with:
		    ReadDiskCache(blockNum, readbuf) {
			ptr = buffercache.get(blockNum); 
			if (ptr) {
			    copy BLKSIZE bytes from ptr to readbuf
			} else {
			    newBuf = malloc(BLKSIZE);
			    ReadDisk(blockNum, newBuf);
			    buffercache.insert(blockNum, newBuf);
			    copy BLKSIZE bytes from newBuf to readbuf
			}
	    
	--memory mapping files
	    --syscall:
		void* mmap(void* addr, size_t len, int prot, int flags,
		           int fd, off_t offset);


	    --map the specified open file (fd) into a region of my
	    virtual memory (at addr, or at a kernel-selected place if
	    addr is 0), and return a pointer to it

	    --after this, loads and stores to addr[offset] are
	    equivalent to reading and writing to the file at the given
	    offset

	    --how's this implemented?! (answer: through virtual memory,
	    with the VA being addr, or whatever the kernel selects and
	    the PA being what? answer: the physical address storing the
	    given page in the kernel's buffer cache).

	    --have to deal with eviction from buffer cache, but this
	    problem is not unique. in all operating systems besides JOS,
	    the kernel designers *anyway* have to be able to invalidate
	    VA-->PA mappings when a page is removed from RAM

2. Crash recovery

    * Ad-hoc
    * Ordered updates
    * Soft updates
    * Journaling

    there are a lot of data structures used to implement the file system
    (bitmap of free blocks, directories, inodes, indirect blocks, data
    blocks, etc.)

    they are going to have to be consistent. and we're going to need a
    cache. but it's hard to balance consistency and performance here.
	
	write through vs. write back caching:

	--*write through*: write changes immediately to disk. problem:
	slow! have to wait for each write to complete before going on

	--*write back*: delay writing modified data back to disk.
	problem: can lose data. another problem: updates can go to the
	disk in a wrong order

    [preview transactions: even if we had write through, that may not be
    enough. probably want to do operations like <"delete a file" and
    "create a new file"> atomically. if there's a crash in between
    "delete" and "create", then write through doesn't solve the
    problem.]

    If multiple updates needed, do them in specific order so that if a
    crash occurs, **fsck** can work.

    **Crash recovery permeates FS code**
	--need to ensure fsck can recover the file system

    A. First approach: ad-hoc

	--don't worry about data consistency
	--worry about metadata consistency. will see below how this is
	accomplished

	(i) fsck

	fsck runs after crash. it scans entire disk for internal
	consistency to check for "in progress", and then fixes up
	anything "in progress":

	    --Summary info usually bad after crash
		--Scan to check free block map, block/inode counts

	    --System may have corrupt inodes (not simple crash)
		--Bad block numbers, cross-allocation, etc.
		--Do sanity check, clear inodes with garbage

	    --Fields in inodes may be wrong
		--Count number of directory entries to verify link count,
		if no entries but count != 0, move to lost+found
		--Make sure size and used data counts match blocks

	    --Directories may be bad
		--Holes illegal, "." and ".." must be valid, ...
		--All directories must be reachable

	(ii) recall we said that FS is permeated by code whose purpose
	is to ensure that fsck will work when it runs. what's required?
    
	    --for one thing, can't have all data written asynchronously.
	    If all data were written asynchronously, we could encounter
	    the following unacceptable scenarios:

		(a) Delete/truncate a file, append to other file, crash
		    --New file may reuse block from old
		    --Old inode may not be updated
		    --Cross-allocation!
		    --Often inode with older mtime wrong, but can't be sure

		(b) Append to file, allocate indirect block, crash
		    --Inode points to indirect block
		    --But indirect block may contain garbage 

	    --so what's the actual approach?

		--be careful about order of updates. specifically:

		--Write new inode to disk before directory entry

		--Remove directory name before deallocating inode

		--Write cleared inode to disk before updating CG free map
    
	    --how is it implemented?

		--synchronous write through for *metadata*. 

		--doing one metadata write at a time ensures ordering

	    example: for file create:
		--write data to file
		--update file header
		--mark file header "allocated" in bitmap
		--mark file blocks "allocated" in bitmap
		--update directory
		--(if directory grew) mark new file block "allocated" in
		bitmap 
	    
	    now, cases:

	    --file header not in bitmap --> only writes were to unallocated, 
	    unreachable blocks; the result is that the write "disappears"
	     
	    --file header allocated, file blocks not in bitmap --> update bitmap 
	     
	    --file created, but not yet in any directory --> delete
	    file (after all that!)


	(iii) Disadvantages to this ad-hoc approach:

 	    (a) need to get ad-hoc reasoning exactly right 
	    (b) poor performance (synchronous writes) 
		--multiple updates to same block
		require that they be issued separately. for example,
		imagine two updates to same directory block. requires
		first complete before doing the second (otherwise, not
		synchronous)
		--more generally, cost of crash recoverability is
		enormous. (a job like "untar" could be 10-20x slower)
	    (c) slow recovery: fsck must scan entire disk
		--recovery gets slower as disks get bigger. if fsck takes
		one minute, what happens when disk gets 10 times bigger?
	
	[aside: why not use battery-backed RAM? answer:
	    --Expensive (requires specialized hardware)
	    --Often don't learn battery has died until too late
	    --A pain if computer dies (can't just move disk)
	    --If OS bug causes crash, RAM might be garbage]

	remaining approaches: try to get better performance with same
	consistency

    B. Approach: ordered updates

	--Follow three rules in ordering updates:

	    (i) Never write pointer before initializing the structure it
	    points to 
	    
	    (ii) Never reuse a resource before nullifying all pointers to it

	    (iii) Never clear last pointer to live resource before
	    setting new one [e.g., rename]

	--If you do this, file system will be recoverable ... and
	quickly!
	    --Might leak free disk space, but otherwise correct
	    --So start running after reboot, scavenge for space in background

	--How to actually implement this?
	    --Keep a partial order on buffered blocks
	
	--Example: create file A
	    --Block X contains an inode
	    --Block Y contains a directory block
	    --Create file A in inode block X, dir block Y

	--We say "Y-->X", pronounced "Y _depends on_ X"
	    --Means Y cannot be written before X is written
	    --X is called the _dependee_, Y the _depender_
	
	--Can delay both writes, so long as order preserved
	    --Say you create a second file B in blocks X and Y
	    --Only have to write each out once 

	--So what's the catch?  Cyclic dependencies....

	    --Suppose you create file A, unlink file B
		--Both files in same directory block & inode block
 
	    --Can only write directory after inode A initialized
		--Otherwise, after crash directory will point to bogus inode
		--Worse yet, same inode # could be re-allocated
		--So could end up with file name A being an unrelated file

	    --But can only write inode block after directory entry for B
	    cleared 
		--Otherwise, might be deleting file while links to it
		still exist
		--That is, link count is now inaccurate
		[[--In principle, could get around this cyclic issue by
		writing inode block *before* B is cleared, but then
		fsck becomes superslow: would have to check every
		directory entry and inode link count]]

	[DRAW PICTURE TO ILLUSTRATE THE CYCLIC DEPENDENCY PROBLEM]

	--Summary: the problems with ordering updates naively are:
	    --cyclic dependencies
	    --crash might occur between ordered but related writes
		--summary information wrong after block freed
	    --block aging
		--a block with a dependency will never get written back