Class 19 CS 372H 5 April 2011 On the board ------------ 1. Last time 2. Crash recovery 3. LFS --------------------------------------------------------------------------- 1. Last time --file systems --didn't do mmap in class, but see notes --correction: I said that hard links mean that there needs to be garbage collection. This was imprecise (and wrong). Hard links means that there needs to be *reference counting*. GC is needed only if there can be cycles in the directory graph (but we rule out those cycles in Unix by disallowing hard links to directories). --question #1: why are inode 0 and inode 1 reserved? --different file systems use for different purposes, and some don't use for any purpose (reserved can mean "reserved for future use"). inode 1 might link all of the bad blocks. 0 might be reserved because it's a return error code. --question #2: can a symlink contain a relative path? answer: I think not. consider the symlink inode. it itself may have multiple hardlinks pointing to it. In other words, it may have multiple names. So if the symlink contains a relative path, what does that symlink actually reference? 2. Crash recovery --there are a lot of data structures used to implement the file system (bitmap of free blocks, directories, inodes, indirect blocks, data blocks, etc.) --require: crash anywhere and the system can be recovered --options: --*write through*: write changes immediately to disk. problem: slow! have to wait for each write to complete before going on --*write back*: delay writing modified data back to disk. problem: can lose data. another problem: updates can go to the disk in a wrong order --If multiple updates needed, do them in specific order so that if a crash occurs, **fsck** can work. --see 4.4.3 in the text for an explanation of fsck --Approaches to crash recovery: * Ad-hoc * Ordered updates * WAL (write-ahead logging) / journaling A. Ad-hoc: --can't have all data written asynchronously. If all data were written asynchronously, we could encounter the following unacceptable scenarios: (a) Delete/truncate a file, append to other file, crash --New file may reuse block from old --Old inode may not be updated --Cross-allocation! --Often inode with older mtime wrong, but can't be sure (b) Append to file, allocate indirect block, crash --Inode points to indirect block --But indirect block may contain garbage --so what's the actual approach? --be careful about order of updates. specifically: --Write new inode to disk before directory entry --Remove directory name before deallocating inode --Write cleared inode to disk before updating cylinder group free map --how is it implemented? --synchronous write through for *metadata*. --doing one metadata write at a time ensures ordering example: for file create: --write data to file --update/write inode --mark inode "allocated" in bitmap --mark data blocks "allocated" in bitmap --update directory --(if directory grew) mark new file block "allocated" in bitmap now, cases: --inode not marked allocated in bitmap --> only writes were to unallocated, unreachable blocks; the result is that the write "disappears" --inode allocated, data blocks not marked allocated in bitmap --> fsck must update bitmap --file created, but not yet in any directory --> fsck ultimately deletes file (after all that!) Disadvantages to this ad-hoc approach: (a) need to get ad-hoc reasoning exactly right (b) poor performance (synchronous writes of metadata) --multiple updates to same block require that they be issued separately. for example, imagine two updates to same directory block. requires first complete before doing the second (otherwise, not synchronous) --more generally, cost of crash recoverability is enormous. (a job like "untar" could be 10-20x slower) (c) slow recovery: fsck must scan entire disk --recovery gets slower as disks get bigger. if fsck takes one minute, what happens when disk gets 10 times bigger? [aside: why not use battery-backed RAM? answer: --Expensive (requires specialized hardware) --Often don't learn battery has died until too late --A pain if computer dies (can't just move disk) --If OS bug causes crash, RAM might be garbage] B. Approach: ordered updates --could reason carefully about the precise order in which asynchronous writes should go to disk --advantages --performance --fsck is very fast and can run in the background, since all it needs to do is fix up bookkeeping --limitations --hard to get right --arguably ad-hoc: very specific to FFS data structures. unclear how to apply the approach to FSes that use data structures like B+-trees --metadata updates can happen out of order (for example: create A, create B, crash.... it might be that only B exists after the crash!) --to see this approach in action: [G. R. Ganger, M. K. McKusick, C. A. N. Soules, and Y. N. Patt. Soft Updates: A Solution to the Metadata Update Problem in File Systems. ACM Trans. on Computer Systems. Vol. 18. No. 2., May 2000, pp. 127-153. http://portal.acm.org/citation.cfm?id=350853.350863] C. Journaling --Reserve a portion of disk for **write-ahead log** --Write any metadata operation first to log, then to disk --After crash/reboot, re-play the log (efficient) --May re-do already committed change, but won't miss anything --Performance advantage: --Log is consecutive portion of disk --Multiple log writes very fast (at disk b/w) --Consider updates committed when written to log --Example: delete directory tree --Record all freed blocks, changed directory entries in log --Return control to user --Write out changed directories, bitmaps, etc. in background (sort for good disk arm scheduling) --On recovery, must do three things: i. find oldest relevant log entry ii. find end of log iii. read and replay committed portion of log. i. find oldest relevant log entry --Otherwise, redundant and slow to replay whole log --Idea: checkpoints! (this idea is used throughout systems) --Once all records up to log entry N have been processed and once all affected blocks stably committed to disk ... --Record N to disk either in reserved checkpoint location, or in checkpoint log record --Never need to go back before most recent checkpointed N ii. find end of log --Typically circular buffer, so look at sequence numbers --Can include begin transaction/end transaction records --but then need to make sure that "end transaction" only gets to the disk after all other disk blocks in the transaction are on disk --but disk can reorder requests, then system crashes --to avoid that, need separate disk write for "end transaction", which is a performance hit --to avoid that, use checksums: a log entry is committed when all of its disk blocks match its checksum value iii. not much to say: read and replay! --Logs are key: enable atomic complex operations. to see this, we'll take a slight detour..... [can skip, since the same points come up again under transactions] detour: some file systems (for example, XFS from SGI) use a B+-tree data structure. a few quick words about B+-trees: --key-value map --ordering defined on keys (where is nearest key?) --data stored in blocks, so explicitly designed for efficient disk access --with n items stored, all operations are O(log n): --retrieve closest to target key k --insert new pair --delete pair --see any algorithms book (e.g., Cormen et al.) for details --**complex to implement** --wait, why are we mentioning B+-trees? because some file systems use them: --efficient implementation of large directories (map key = hash(filename) to value = inode #) --efficient implementation of inode --instead of using FFS-style fixed block pointers, map: file offset (key) --> {start block, # blocks} (value) --if file consists of a small number of extents (i.e., segments), then inodes are small, even for large files --efficient implementation of map from inode # to inode. map: inode # --> {block #, # of consecutive inodes in use} [bonus: allows fast way to identify free node!] --some B+-tree operations require multiple operations. intermediate states are incorrect. what happens if there's a crash in the middle? B+-tree could be in inconsistent state --journaling is a big help here --First write all changes to the log ("insert k,v", "delete k", etc.) --If crash while writing log, incomplete log record will be discarded, and no change made --Otherwise, if crash while updating B+-tree, will replay entire log record and write everything --limitations of journaling --fsync() syncs *all* operations' metadata to log --write-ahead logging is everywhere --what's the problem? (all data is written twice, in the worst case) --(aside: less of a problem to write data twice if you have two disks. common way to make systems fast: use multiple disks. then easier to avoid seeks) --log started as a way to help with consistency, but now the log is authoritative, so actually do we need the *other* copy of the data? what if everything is just stored in the log???? Transition to Log-structured file system (LFS) D. Summarize crash recovery --three viable approaches to crash recovery in file systems: (i) ad-hoc --worry about metadata consistency, not data consistency --accomplish metadata consistency by being careful about order of updates --write metadata synchronously (ii) ordered updates (soft updates), which is in OpenBSD --worry about metadata consistency --leads to great performance: metadata doesn't have to be written synchronously (writes just have obey a partial order) (iii) journaling (the approach in most Linux file systems) --more flexible --easier to reason about --possibly worse performance 3. Log-structured file system (LFS) A. Intro B. Finding data C. Crash recovery D. [next time] Garbage collection (cleaning) E. [next time] Discussion A. Intro [DRAW PICTURE] --Idea: write data only once by having log be only copy on disk --> as you modify blocks in a file, just store them out on disk in the log --> this goes for everything: data blocks in a file, data blocks in a directory (of course), file metadata, etc. --> So writes seem pretty easy --> What about reads? --as long as the inode points to the right disk blocks, no problem to put disk blocks in log --of course, if a file block is updated, the inode has to be updated to point to a new disk block, but no problem: just write a new inode to log! --> Raises the question: how do we find inodes? (see below) --Performance characteristics: --all writes are sequential! --no seeks except for reads! --why we would want to build a file system like this: as we discussed a few classes ago, RAM is getting bigger --> caches are getting bigger --> less disk I/O is for reads. But can't avoid writes. Conclusion: optimize for writes, which is precisely what LFS does. --moreover, if write order predicts read order (which it often does), then even read performance will be fast B. Finding data [DRAW PICTURE] --Need to maintain an index in RAM and on the disk: map from inode numbers to inodes --called the inode map: just an array of pointers to inodes (conceptually) --inode map periodically dumped to disk C. Crash recovery --(aside: no free-block list or bitmap! simplifies crash recovery.) --Checkpoints! --what happens on checkpoint? --write out everything that is modified to the log --file data bocks, indirect blocks, inodes, blocks of the inode map, and segment usage table --write to checkpoint region: --pointers to inode map blocks and segment usage table, plus current time and pointer to last segment written --what happens on recovery? --read in checkpoint --replay log from checkpoint --how do you find the most recent checkpoint? --there are two checkpoint regions --use the one with the most recent time --then given the info (inode map, etc.), you're in business: you have the file system!!! --a few wrinkles: --when do they have to do when replaying the log? (basically just update the in-memory data structures, which are the inode map and the segment usage table.) --what's the plan? --when replaying a segment: --look at the segment summary table --use it to figure out what information in the segment is live --update the inode map and segment usage table accordingly --usage of current segment will increase (it started at 0 because we're talking about segments *after* the checkpoint was written) --usage of other segments will decrease. why? (because the new inodes point to new file data blocks, thereby implicitly invalidating the old ones, causing them to be free space) --directory and inodes may not be consistent: --dir entry could point to inode, but inode not written --inode could be written with too-high link count (i.e., dir entry not yet written) --so what's the plan for fixing this? --log directory operations: [ "create"|"link"|"rename" dir inode #, position in dir name of file, file inode # new ref count ] --and make sure that these operations appear before the new directory block or new inode --for "link" and "rename", dir op gives all the info needed to actually complete the operation --for "create", this is not the case. if "create" is in log, but new inode isn't, then the directory entry is removed on roll-forward