Class 20 CS 372H 6 April 2010 On the board ------------ 1. crash recovery --ad-hoc (last time) --ordered updates (last time) --soft updates (see notes) --journaling (today) 2. LFS (today) 3. Transactions (next time) --------------------------------------------------------------------------- 0. last time --corrections: --block size might larger than 2-4KB. configurable. these days, default might be as big as 16KB --disk block writes: blocks are big. but not written atomically. doesn't hurt, given the guarantees that these systems are providing. --directory blocks, however, need to be smaller --three viable approaches to crash recovery in file systems: (i) ad-hoc --worry about metadata consistency, not data consistency --accomplish metadata consistency by being careful about order of updates --write metadata synchronously (ii) soft updates (the approach in OpenBSD) --evolution of strawman presented last time --again, worry about metadata consistency --leads to great performance: metadata doesn't have to be written synchronously (writes just have obey a partial order) --more on soft updates below, but we won't discuss --this is the (iii) journaling (the approach in most Linux file systems) --more flexible --easier to reason about --possibly worse performance --discuss now 1. crash recovery C. Approach: soft updates [http://portal.acm.org/citation.cfm?id=350853.350863] recall how we got here: didn't want ad-hoc crash recovery so follow principled order of which file system blocks make it to disk in which order ("ordered updates"). but this strawman had a few problems, chief among them the cyclic dependency problem. this motivates soft updates. --summary: --Write blocks in any order --But keep track of dependencies --When writing a block, temporarily roll back any changes you can't yet commit to disk --that is, a given disk block may contain multiple logical updates, but when writing the block, write only the updates whose dependencies are already on the disk --in essence, you may wind up writing the same disk blocks several times --high-level approach: --for each updated field or pointer, maintain a structure that contains: --old value --new value --list of updates on which this update depends --now, can write disk blocks in any order, but.... --have to temporarily undo updates with pending dependencies --fsck in this regime --very quick to get FS consistent (just make sure per-cylinder summary info makes sense) --may take a bit of time to identify leaked disk blocks and inodes, but that can be done in the background --compared to traditional FFS fsck: --can have lots of inodes with non-zero link counts --they don't all belong in lost+found --limitations of soft updates: --arguably ad-hoc: very specific to FFS data structures. unclear how to apply the approach to FSes that use data structures like B+-trees --metadata updates can happen out of order (for example: create A, create B, crash.... it might be that only B exists after the crash!) --fsck not totally dispensed with (but runs in the background) D. journaling --Reserve a portion of disk for **write-ahead log** --Write any metadata operation first to log, then to disk --After crash/reboot, re-play the log (efficient) --May re-do already committed change, but won't miss anything --Performance advantage: --Log is consecutive portion of disk --Multiple log writes very fast (at disk b/w) --Consider updates committed when written to log --Example: delete directory tree --Record all freed blocks, changed directory entries in log --Return control to user --Write out changed directories, bitmaps, etc. in background (sort for good disk arm scheduling) --On recovery, must do three things: i. find oldest relevant log entry ii. find end of log iii. read and replay committed portion of log. i. find oldest relevant log entry --Otherwise, redundant and slow to replay whole log --Idea: checkpoints! (this idea is used throughout systems) --Once all records up to log entry N have been processed and once all affected blocks stably committed to disk ... --Record N to disk either in reserved checkpoint location, or in checkpoint log record --Never need to go back before most recent checkpointed N ii. find end of log --Typically circular buffer, so look at sequence numbers --Can include begin transaction/end transaction records --but then need to make sure that "end transaction" only gets to the disk after all other disk blocks in the transaction are on disk --but disk can reorder requests, then system crashes --to avoid that, need separate disk write for "end transaction", which is a performance hit --to avoid that, use checksums: a log entry is committed when all of its disk blocks match its checksum value iii. not much to say: read and replay! --Logs are key: enable atomic complex operations. to see this, we'll take a slight detour..... [can skip, since the same points come up again under transactions] detour: some file systems (for example, XFS from SGI) use a B+-tree data structure. a few quick words about B+-trees: --key-value map --ordering defined on keys (where is nearest key?) --data stored in blocks, so explicitly designed for efficient disk access --with n items stored, all operations are O(log n): --retrieve closest to target key k --insert new pair --delete pair --see any algorithms book (e.g., Cormen et al.) for details --**complex to implement** --wait, why are we mentioning B+-trees? because some file systems use them: --efficient implementation of large directories (map key = hash(filename) to value = inode #) --efficient implementation of inode --instead of using FFS-style fixed block pointers, map: file offset (key) --> {start block, # blocks} (value) --if file consists of a small number of extents (i.e., segments), then inodes are small, even for large files --efficient implementation of map from inode # to inode. map: inode # --> {block #, # of consecutive inodes in use} [bonus: allows fast way to identify free node!] --some B+-tree operations require multiple operations. intermediate states are incorrect. what happens if there's a crash in the middle? B+-tree could be in inconsistent state --journaling is a big help here --First write all changes to the log ("insert k,v", "delete k", etc.) --If crash while writing log, incomplete log record will be discarded, and no change made --Otherwise, if crash while updating B+-tree, will replay entire log record and write everything --limitations of journaling --fsync() syncs *all* operations' metadata to log --write-ahead logging is everywhere --what's the problem? (all data is written twice, in the worst case) --(aside: less of a problem to write data twice if you have two disks. common way to make systems fast: use multiple disks. then easier to avoid seeks) --log started as a way to help with consistency, but now the log is authoritative, so actually do we need the *other* copy of the data? what if everything is just stored in the log???? 2. Log-structured file system (LFS) A. Intro B. Finding data C. Garbage collection (cleaning) D. Crash recovery E. Discussion A. Intro [DRAW PICTURE] --Idea: write data only once by having log be only copy on disk --> as you modify blocks in a file, just store them out on disk in the log --> this goes for everything: data blocks in a file, data blocks in a directory (of course), file metadata, etc. --> So writes seem pretty easy --> What about reads? --as long as the inode points to the right disk blocks, no problem to put disk blocks in log --of course, if a file block is updated, the inode has to be updated to point to a new disk block, but no problem: just write a new inode to log! --> Raises the question: how do we find inodes? (see below) --Performance characteristics: --all writes are sequential! --no seeks except for reads! --why we would want to build a file system like this: as we discussed a few classes ago, RAM is getting bigger --> caches are getting bigger --> less disk I/O is for reads. But can't avoid writes. Conclusion: optimize for writes, which is precisely what LFS does. --moreover, if write order predicts read order (which it often does), then even read performance will be fast B. Finding data [DRAW PICTURE] --Need to maintain an index in RAM and on the disk: map from inode numbers to inodes --called the inode map: just an array of pointers to inodes (conceptually) --inode map periodically dumped to disk C. Garbage collection (cleaning) --what if log fills up? then we're in trouble. to avoid this, need to do *cleaning*. --approach: basically, compress the log, and leave some free space on the disk --use segments: contiguous regions of the log (1 MB in their implementation) --two data structures they maintain: --in-memory: segment usage table: [ <# free bytes> ] --on disk, per-segment: segment summary (probably a table indexed by entry in the segment, so first entry in the table gives info about first entry in segment) [ ] --okay, which segments should we clean? and do you want the utilization to be high or low when you clean? --observe: if the utilization of a segment is 0 (as indicated by the segment usage table), cleaning it is really easy: just overwrite the segment! --if utilization is very low, that's a good sign: clean that segment (rewriting it is very little work) --but what if utilization is high? can we clean the segment? --insight: yes! provided what? (provided that the segment generally has a lot of "cold", that is unchanging, data). the insight is that: --because the segment is cold, it's never going to get to a low utilization --at the same time, because it's cold, you're not wasting work by compressing it. the segment will "stay compressed". --they analyze bang for the buck: how long after compressing will the data stick around to justify the work we did to cean and compress it? --(benefit/cost) = (1-u)*age/(1+u) --cost: 1+u (1 to read in, u to write back) --benefit: "1-u" are the blocks we're taking back, times "age", namely how long those blocks will stay compact. point being: worthwhile to compact old data because it will be a while before those blocks are returned to the system D. Crash recovery --(aside: no free-block list or bitmap! simplifies crash recovery.) --Checkpoints! --Replay log from checkpoint --what happens on checkpoint? --write out everything to the log --file data bocks, indirect blocks, indoes, blocks of the inode map, and segment usage table --write to checkpoint region: --pointers to inode map blocks and segment usage table, plus current time and pointer to last segment written --how do you find the most recent checkpoint? --there are two checkpoint regions --use the one with the most recent time --then given the info (inode map, etc.), you're in business: you have the file system!!! --what happens on recovery? --read in checkpoint --replay log from checkpoint --a few wrinkles: --when do they have to do when replaying the log? (basically just update the in-memory data structures, which are the inode map and the segment usage table.) --what's the plan? --when replaying a segment: --look at the segment summary table --use it to figure out what information in the segment is live --update the inode map and segment usage table accordingly --usage of current segment will increase (it started at 0 because we're talking about segments *after* the checkpoint was written) --usage of other segments will decrease. why? (because the new inodes point to new file data blocks, thereby implicitly invalidating the old ones, causing them to be free space) --directory and inodes may not be consistent: --dir entry could point to inode, but inode not written --inode could be written with too-high link count (i.e., dir entry not yet written) --so what's the plan for fixing this? --log directory operations: [ "create"|"link"|"rename" dir inode #, position in dir name of file, file inode # new ref count ] --and make sure that these operations appear before the new directory block or new inode --for "link" and "rename", dir op gives all the info needed to actually complete the operation --for "create", this is not the case. if "create" is in log, but new inode isn't, then the directory entry is removed on roll-forward E. Discussion --crucial for you to understand authors' points in last paragraph of section 5.1 --where does LFS fall down? --If write workload is random access and read workload is sequential, performance will be terrible. there will be lots of disk seeks. --see rightmost set of bars in figure 9 --are people using LFS today? why not? --not using in its proposed form --but journaling is everywhere (for example, the ext3 file system on Linux). --people might not be using it for any number of reasons: --added performance not needed --performance so bad in pathological cases that people don't want to use it [note: in LFS, the log *is* the file system. now going to briefly describe transactional systems, which in some ways are closer to journaling. the log is the authoritative record, but it's not the copy of the data used for read accesses.]