Class 18 CS 202 14 April 2015 On the board ------------ 1. Last time 2. File systems crash recovery --intro --ad-hoc --ordered updates --journaling --------------------------------------------------------------------------- 1. Last time --finished directories --looked at file system performance --today: look at file system persistence and crash recovery (and a bit more about performance) theme: we'll see two ways in which we gain from "treating the disk like tape" --next time: revisit some of these concepts in the context of a different kind of system (database management systems, transactions, etc.) 2. crash recovery --There are a lot of data structures used to implement the file system: bitmap of free blocks, directories, inodes, indirect blocks, data blocks, etc. --We want these data structures to be *consistent*: we want invariants to hold --Thorny issue: *crashes* --Making the problem worse is: (a) write-back caching and (b) non-ordered disk writes. (a) means the OS delays writing back modified disk blocks. (b) means that the modified disk blocks can go to the disk in an unspecified order. --Example: [DRAW PICTURE] INODE DATA BLOCK ADDED DATA BITMAP UPDATED crash. restart. uh-oh. --Solution: the system requires a notion of atomicity --How to think about this stuff: imagine that a crash can happen at any time. (The only thing that happens truly atomically is a write of a 512-byte disk sector.) So you want to arrange for the world to look sane, regardless of where a crash happens. --> Your leverage, as file system designer, is that you can arrange for some disk writes to happen *synchronously* (meaning that the system won't do anything until these disk writes complete), and you can impose some ordering on the actual writes to the disk. --So we need to arrange for higher-level operations ("add data to file") to _look_ atomic: an update either occurs or it doesn't. --Potentially useful analogy: during our concurrency unit, we had to worry about arbitrary interleavings (which we then tamed with concurrency primitives). Here, we have to worry that a crash can happen at any time (and we will tame this with abstractions like transactions). The response in both cases is a notion of atomicity. --We will mention three approaches to crash recovery in file systems: A. Ad-hoc (the book calls this "fsck") B. ordered (soft) updates C. Journaling (also known as write-ahead logging) A. Ad-hoc --Goal: metadata consistency, not data consistency (rationale: too expensive to provide data consistency; cannot live without metadata consistency.) --Approach: arrange to send file system updates to the disk in such a way that, if there is a crash, **fsck** can clean up inconsistencies --example: for file create: --write data to file --then update/write inode --then mark inode "allocated" in bitmap --then mark data blocks "allocated" in bitmap --then update directory --(if directory grew) mark new file block (for directory) "allocated" in bitmap now, cases: inode not marked allocated in bitmap --> only writes were to unallocated, unreachable blocks; the result is that the write "disappears" inode allocated, data blocks not marked allocated in bitmap --> fsck must update bitmap file created, but not yet in any directory --> fsck ultimately deletes file (after all that!) Disadvantages to this ad-hoc approach: (a) need to get ad-hoc reasoning exactly right (b) poor performance (synchronous writes of metadata) --multiple updates to same block require that they be issued separately. for example, imagine two updates to same directory block. requires first complete before doing the second (otherwise, not synchronous) --more generally, cost of crash recoverability is enormous. (a job like "untar" could be 10-20x slower) (c) slow recovery: fsck must scan entire disk --recovery gets slower as disks get bigger. if fsck takes one minute, what happens when disk gets 10 times bigger? --essentially, fsck has to scan the entire disk B. Ordered updates --could reason carefully about the precise order in which asynchronous writes should go to disk --advantages --performance --fsck is very fast and can run in the background, since all it needs to do is fix up bookkeeping --limitations --hard to get right --arguably ad-hoc: very specific to FFS data structures. unclear how to apply the approach to FSes that use data structures like B+-trees --metadata updates can happen out of order (for example: create A, create B, crash.... it might be that only B exists after the crash!) --to see this approach in action: [G. R. Ganger, M. K. McKusick, C. A. N. Soules, and Y. N. Patt. Soft Updates: A Solution to the Metadata Update Problem in File Systems. ACM Trans. on Computer Systems. Vol. 18. No. 2., May 2000, pp. 127-153. http://portal.acm.org/citation.cfm?id=350853.350863] C. Journaling Golden rule of atomicity, per Saltzer-Kaashoek: "never modify the only copy" The use of journaling in file systems is borrowed from databases. Journaling is allowing for a kind of transaction. In the next class, we will study transactional systems directly. --Reserve a portion of disk for **write-ahead log** --Write any metadata operation first to log, then to disk --After crash/reboot, re-play the log (efficient) --May re-do already committed change, but won't miss anything --Performance advantage: --Log is consecutive portion of disk --Multiple log writes very fast (at disk b/w) --Consider updates committed when written to log --Example: delete directory tree --Record all freed blocks, changed directory entries in log --Return control to user --Write out changed directories, bitmaps, etc. in background (sort for good disk arm scheduling) --On recovery, must do three things: i. find oldest relevant log entry ii. find end of log iii. read and replay committed portion of log. i. find oldest relevant log entry --Otherwise, redundant and slow to replay whole log --Idea: checkpoints! (this idea is used throughout systems) --Once all records up to log entry N have been processed and once all affected blocks stably committed to disk ... --Record N to disk either in reserved checkpoint location, or in checkpoint log record --Never need to go back before most recent checkpointed N ii. find end of log --Typically circular buffer, so look at sequence numbers --Can include begin transaction/commit transaction records --but then need to make sure that "commit transaction" only gets to the disk after all other updated in the transaction are in the log on disk --but disk can reorder requests, then system crashes --to avoid that, need separate disk write for "commit transaction", which is a performance hit --to avoid that, use checksums: a log entry is committed when all of its disk blocks match its checksum value iii. not much to say: read and replay! --write-ahead logging is everywhere --what's the problem? (all data is written twice, in the worst case) --(aside: less of a problem to write data twice if you have two disks. common way to make systems fast: use multiple disks. then easier to avoid seeks) D. Summarize crash recovery --three viable approaches to crash recovery in file systems: (i) ad-hoc --worry about metadata consistency, not data consistency --accomplish metadata consistency by being careful about order of updates --write metadata synchronously --fsck cleans up on restart (ii) ordered updates (soft updates), which is in OpenBSD --worry about metadata consistency --leads to great performance: metadata doesn't have to be written synchronously (writes just have obey a partial order) --fsck again cleans up on restart, but its job is simpler and faster. (iii) journaling (the approach in most Linux file systems) --more flexible --easier to reason about --possibly worse performance --on restart, no need for fsck, just log recovery (although this is logically a kind of fsck) --log started as a way to help with consistency, but now the log is authoritative, so actually do we need the *other* copy of the data? what if everything is just stored in the log???? Transition to Log-structured file system (LFS)