Class 18 CS372H 29 March 2012 On the board ------------ 1. Last time 2. LFS --------------------------------------------------------------------------- 1. Last time --disk performance --file systems: discussed the data structures --today: performance and crash recovery 2. LFS A. Intro B. Finding data C. Crash recovery D. [next time] Garbage collection (cleaning) E. Discussion A. Intro [DRAW PICTURE] --Idea: write data only once by having log be only copy on disk --> as you modify blocks in a file, just store them out on disk in the log --> this goes for everything: data blocks in a file, data blocks in a directory (of course), file metadata, etc. --> So writes seem pretty easy --> What about reads? --as long as the inode points to the right disk blocks, no problem to put disk blocks in log --of course, if a file block is updated, the inode has to be updated to point to a new disk block, but no problem: just write a new inode to log! --> Raises the question: how do we find inodes? (see below) --Performance characteristics: --all writes are sequential! --no seeks except for reads! --why we would want to build a file system like this: as we discussed last time, RAM was getting bigger --> caches get bigger --> less disk I/O is for reads. But can't avoid writes. Conclusion: optimize for writes, which is precisely what LFS does. --moreover, if write order predicts read order (which it often does), then even read performance will be fast B. Finding data [DRAW PICTURE] --Need to maintain an index in RAM and on the disk: map from inode numbers to inodes --called the inode map: just an array of pointers to inodes (conceptually) --inode map periodically dumped to disk C. Crash recovery --(aside: no free-block list or bitmap! simplifies crash recovery.) --Checkpoints! --what happens on checkpoint? (Sect. 4.1) --write out everything that is modified to the log --file data bocks, indirect blocks, inodes, blocks of the inode map, and segment usage table --write to checkpoint region: --pointers to inode map blocks and segment usage table, plus current time and pointer to last segment written --what happens on recovery? --read in checkpoint --replay log from checkpoint --how do you find the most recent checkpoint? --there are two fixed and well-known checkpoint regions --use the one with the most recent time --checkpoint region contains information needed to reconstruct inode map and segment usage table --note: inode map is big and may take more than one disk location. blocks of inode map may be strewn throughout the disk --checkpoint region also contains pointer to last segment written --recovery: --read in inode map and segment usage table at this point, we're mostly in business. we have most of the file system!!! --but we need to get the part of the file system that was written after the checkpoint and before the crash --this is called *roll-forward*. it works as follows in this context, but the idea is broader than LFS. --start at last segment written and scan forward from that segment --read the segment summary table, and use it to figure out what information in the segment is live --usage of current segment will increase (it started at 0 because we're talking about segments *after* the checkpoint was written) --usage of other segments will decrease. why? (because the new inodes point to new file data blocks, thereby implicitly invalidating the old ones, causing them to be free space) --stop when we arrive at the last inode written in the log --we know that we're at the last inode written because we look at every segment's segment summary block, and if the segment summary block indicates that the segment is old, we're done. (segment summary block has checksum and timestamp, which is enough.) --recovery wrinkle #1: what about partial segment writes? (can't assume that entire segments' worth are written every time) --the backwards hack, which ensures 1 seek/per partial write --recovery wrinkle #2: directory and inodes may not be consistent: --dir entry could point to inode, but inode not written --inode could be written with too-high link count (i.e., dir entry not yet written) --so what's the plan for fixing this? --log directory operations: [ "create"|"link"|"rename" dir inode #, position in dir name of file, file inode # new ref count ] --and make sure that these operations appear before the new directory block or new inode --for "link" and "rename", dir op gives all the info needed to actually complete the operation --for "create", this is not the case. if "create" is in log, but new inode isn't, then the directory entry is removed on roll-forward D. [next time] Garbage collection (cleaning) --what if log fills up? then we're in trouble. to avoid this, need to do *cleaning*. --approach: basically, compress the log, and leave some free space on the disk --use segments: contiguous regions of the log (1 MB in their implementation) --two data structures they maintain: --in-memory: segment usage table: [ <# free bytes> ] --on disk, per-segment: segment summary (a table indexed by entry in the segment, so first entry in the table gives info about first entry in segment) [ ] --okay, which segments should we clean? and do you want the utilization to be high or low when you clean? --observe: if the utilization of a segment is 0 (as indicated by the segment usage table), cleaning it is really easy: just overwrite the segment! --if utilization is very low, that's a good sign: clean that segment (rewriting it is very little work) --but what if utilization is high? can we clean the segment? --insight: yes! provided what? (provided that the segment generally has a lot of "cold", that is unchanging, data). the insight is that: --because the segment is cold, it's never going to get to a low utilization --at the same time, because it's cold, you're not wasting work by compressing it. the segment will "stay compressed". --they analyze bang for the buck: how long after compressing will the data stick around to justify the work we did to clean and compress it? --(benefit/cost) = (1-u)*age/(1+u) --cost: 1+u (1 to read in, u to write back) --benefit: "1-u" are the blocks we're taking back, times "age". The "age" is an estimate of how long the compacted blocks will stay compact. It may seem counter-intuitive to multiply these together, but this particular metric is just a rough guide anyway. The idea is that there are two factors that matter: how long the compacted blocks will stay compact (captured by age) and how many blocks we actually got by compacting the segment (captured by 1-u). The notion that we "keep" the blocks for their "age" (as stated in the paper) isn't really right in literal terms because we don't know anything about what will happen once the blocks are pressed back into service. However, this metric captures the idea that it's worthwhile to compact old data, even if it's in a segment that is highly utilized, because it will be a while before the old blocks need to be re-compacted. E. Discussion --you should understand authors' points in last paragraph of section 5.1 --where does LFS fall down? --If write workload is random access and read workload is sequential, performance will be terrible. there will be lots of disk seeks. --see rightmost set of bars in figure 9 --are people using LFS today? why not? --not using in its proposed form --but journaling is everywhere (for example, the ext3 file system on Linux). --people might not be using it for any number of reasons: --added performance not needed --performance so bad in pathological cases that people don't want to use it