Class 19 CS 372H 1 April 2010 On the board ------------ 1. file systems E. performance (case study: FFS) F. caching, mmap, etc. 2. crash recovery --ad-hoc --ordered updates --[next time] soft updates --[next time] journaling 3. [next time] LFS --------------------------------------------------------------------------- 0. clarify from last time --multiple inodes in a disk block (so our picture from last time wasn't totally accurate) --looks like this: [DRAW PICTURE] 1. file systems, continued E. FS Performance (case study: FFS) Motivation: --Original Unix FS was simple and elegant ... [superblock | bookkeeping info | inodes | data blocks (512 bytes each) ] superblock: specifies number of blocks in the FS, counts of max # of files, pointer to head of free list, disk characteristics, info needed to get inode from inumber --and .... slowwwwwwww --blocks too small --file index too large --too many layers of mapping indirection --transfer rate low (they were getting one block @ a time) --poor clustering of related objects --consecutive file blocks not close together --Inodes far from data blocks --Inodes for a given directory not close together --result: poor enumeration performance, meaning things like: "ls" and "grep foo *.c" were slowwwww --other problems: --14 character names were the limit --can't atomically update file in crash-proof way FFS (fast file system) fixes these problems to a degree. let's look at FFS: * make block size bigger * cluster related objects * bitmaps * reserve space (i) first thing: make block size bigger --okay, but then how to avoid internal fragmentation, i.e., lots of block wastage? --fragments: --FS uses large block size (4KB or 8KB) --but allows large blocks to be chopped into small "fragments" (1024, 2048) --fragments are used for little files and pieces at the end: [ ] A = + 1024-byte fragment B = 1024-byte fragments allocation approach: --Allocate space when user writes beyond end of file --Want last block to be a fragment if not full-size --If already a fragment, may contain space for write -- done --Else, must deallocate any existing fragment, allocate new --If no appropriate free fragments, break full block [[--Bummer: slow if lots of small writes. to mitigate that, add "blksize" field to "stat" struct (this is the struct that applications get from the kernel that contains file metadata). Then a library like stdio (or even the application) can learn the block size and buffer that much data. As a result, the application can avoid many fragment writes.]] (ii) next: cluster related objects "cylinder groups" --observe: can access any block in a cylinder without performing a seek. Next fastest place is adjacent cylinder --FS approach: put all related items (directories, inodes, data blocks, etc.) in same cylinder group --ideally, unrelated stuff goes in a different cylinder group --what's the approach? --first, put sequential blocks in adjacent sectors (always a good idea; reason: if you access a block, probably going to access the next one) --next, put the inode in the same cylinder as the file data, and perhaps next to it. (reason: if you look at the inode, will probably look at the data too) --also, try to keep all inodes in a directory (remember, a directory is just a collection of inodes) in the same cylinder group (reason: if you access one file *name* in a directory, good chance you will access other file names in the directory too; think about "ls"). --so what the heck does a cylinder group look like? [superblock | bookkeeping info | inodes | data blocks (512 bytes each) ] superblock: --Contains same info as above, plus: --cylinder group (CG) info --Some of the superblock is replicated (once per cylinder group) --Note that the superblock is replicated at shifting offsets, so as to span multiple platters bookkeeping info: --block map: bitmap of available fragments --# free inodes, blocks/frags, files, dirs block allocation: --Try to optimize for sequential access --If available, use rotationally close block in same cylinder --Otherwise, use block in same CG --If CG totally full, find other CG with quadratic hashing i.e., if CG n is full, try n + 1^2, n + 2^2, n +3^2, ... (mod # CGs) --Otherwise, search all CGs for some free space ------------------------------------------------------------------- | --but not going to write more than 1 MB per file per CG. | | why? .... see in a moment.... | ------------------------------------------------------------------- Want to ensure there's space for related stuff. Approach: --Place different directories in different cylinder groups --keep a "free space reserve" in each group so can allocate near existing things --when file grows too big (1MB) send its remainder to different cylinder group. --why? (answer: we don't want any one file in the CG to get too big because then the inodes in that CG would have their data too far away). --why does this perform decently? answer: because amortizing a 10 ms seek over a 1MB read. effective bandwidth: 1MB/(10ms + 1MB/80MB/s) = 44 MB/s. not bad. (iii) track free blocks with bitmap (instead of linked list) --Easier to find contiguous blocks. --Can keep the entire thing in memory (as in lab 5) --Good idea: keep reserves of free blocks. Makes finding a close block easier --100 GB disk / 4KB disk blocks = 25,000,000 entries = 3MB. not outrageous these days. --Allocate block close to block x? --Check for blocks near bmap[x/32] --If disk almost empty, will likely find one near --As disk becomes full, search becomes more expensive and less effective. --Trade space for time (search time, file access time) (iv) Keep a reserve (e.g, 10%) of disk always free, ideally scattered across disk --but don't tell users!! (df makes full disk look 110% full --with 10% free, can almost always find free blocks near where you want --Performance improvements: --20-40% of disk bandwidth for large files --10-20x original Unix file system! --Better small file performance (why?) --Is this the best we can do? No. --It's still block-based rather than extent-based [[--Extent-based: --Name contiguous blocks with single pointer and length (Linux ext2fs)]] --Meta-data writes happen synchronously --really hurts performance on small files --but how to make asynchronous? --need write-ordering ("soft updates") or logging (LFS) --and/or play with semantics (/tmp file systems) --Usability improvements --file names up to 255 characters --atomic rename() system call --Other performance hacks (beyond FFS) --Most obvious: big file cache --fact: no rotation delay if you're reading the whole track. --how can we use this fact? --fact: transfer cost negligible for small chunks --reading many nearby sectors costs roughly what reading one sector costs --how can we use this? --fact: if transfer itself is huge, then seek + rotation times become negligible (!!) --how can we use this? --dump a lot of data at a time, read a lot at a time --FFS accumulates data into 64KB clusters (but needs to do bookkeeping to track free clusters) --read ahead in 64KB clusters --why not just read/write 1 MB at a time? --(for writes: may not get data to disk often enough) --(for reads: may waste read bandwidth) F. Caching, mmap, etc. --recall some syscalls: fd = open(pathname, mode) write(fd, buf, sz) read(fd, buf, sz) --what the heck is a fd? --indexes into a table --what's in the given entry in the table? --inumber! --inode, probably! --and per-open-file data (file position, etc.) --caching: --in order to make file system operations fast, kernel maintains a *buffer cache* in memory --internally, all uses of ReadDisk(blockNum, readbuf) get replaced with: ReadDiskCache(blockNum, readbuf) { ptr = buffercache.get(blockNum); if (ptr) { copy BLKSIZE bytes from ptr to readbuf } else { newBuf = malloc(BLKSIZE); ReadDisk(blockNum, newBuf); buffercache.insert(blockNum, newBuf); copy BLKSIZE bytes from newBuf to readbuf } --memory mapping files --syscall: void* mmap(void* addr, size_t len, int prot, int flags, int fd, off_t offset); --map the specified open file (fd) into a region of my virtual memory (at addr, or at a kernel-selected place if addr is 0), and return a pointer to it --after this, loads and stores to addr[offset] are equivalent to reading and writing to the file at the given offset --how's this implemented?! (answer: through virtual memory, with the VA being addr, or whatever the kernel selects and the PA being what? answer: the physical address storing the given page in the kernel's buffer cache). --have to deal with eviction from buffer cache, but this problem is not unique. in all operating systems besides JOS, the kernel designers *anyway* have to be able to invalidate VA-->PA mappings when a page is removed from RAM 2. Crash recovery * Ad-hoc * Ordered updates * Soft updates * Journaling there are a lot of data structures used to implement the file system (bitmap of free blocks, directories, inodes, indirect blocks, data blocks, etc.) they are going to have to be consistent. and we're going to need a cache. but it's hard to balance consistency and performance here. write through vs. write back caching: --*write through*: write changes immediately to disk. problem: slow! have to wait for each write to complete before going on --*write back*: delay writing modified data back to disk. problem: can lose data. another problem: updates can go to the disk in a wrong order [preview transactions: even if we had write through, that may not be enough. probably want to do operations like <"delete a file" and "create a new file"> atomically. if there's a crash in between "delete" and "create", then write through doesn't solve the problem.] If multiple updates needed, do them in specific order so that if a crash occurs, **fsck** can work. **Crash recovery permeates FS code** --need to ensure fsck can recover the file system A. First approach: ad-hoc --don't worry about data consistency --worry about metadata consistency. will see below how this is accomplished (i) fsck fsck runs after crash. it scans entire disk for internal consistency to check for "in progress", and then fixes up anything "in progress": --Summary info usually bad after crash --Scan to check free block map, block/inode counts --System may have corrupt inodes (not simple crash) --Bad block numbers, cross-allocation, etc. --Do sanity check, clear inodes with garbage --Fields in inodes may be wrong --Count number of directory entries to verify link count, if no entries but count != 0, move to lost+found --Make sure size and used data counts match blocks --Directories may be bad --Holes illegal, "." and ".." must be valid, ... --All directories must be reachable (ii) recall we said that FS is permeated by code whose purpose is to ensure that fsck will work when it runs. what's required? --for one thing, can't have all data written asynchronously. If all data were written asynchronously, we could encounter the following unacceptable scenarios: (a) Delete/truncate a file, append to other file, crash --New file may reuse block from old --Old inode may not be updated --Cross-allocation! --Often inode with older mtime wrong, but can't be sure (b) Append to file, allocate indirect block, crash --Inode points to indirect block --But indirect block may contain garbage --so what's the actual approach? --be careful about order of updates. specifically: --Write new inode to disk before directory entry --Remove directory name before deallocating inode --Write cleared inode to disk before updating CG free map --how is it implemented? --synchronous write through for *metadata*. --doing one metadata write at a time ensures ordering example: for file create: --write data to file --update file header --mark file header "allocated" in bitmap --mark file blocks "allocated" in bitmap --update directory --(if directory grew) mark new file block "allocated" in bitmap now, cases: --file header not in bitmap --> only writes were to unallocated, unreachable blocks; the result is that the write "disappears" --file header allocated, file blocks not in bitmap --> update bitmap --file created, but not yet in any directory --> delete file (after all that!) (iii) Disadvantages to this ad-hoc approach: (a) need to get ad-hoc reasoning exactly right (b) poor performance (synchronous writes) --multiple updates to same block require that they be issued separately. for example, imagine two updates to same directory block. requires first complete before doing the second (otherwise, not synchronous) --more generally, cost of crash recoverability is enormous. (a job like "untar" could be 10-20x slower) (c) slow recovery: fsck must scan entire disk --recovery gets slower as disks get bigger. if fsck takes one minute, what happens when disk gets 10 times bigger? [aside: why not use battery-backed RAM? answer: --Expensive (requires specialized hardware) --Often don't learn battery has died until too late --A pain if computer dies (can't just move disk) --If OS bug causes crash, RAM might be garbage] remaining approaches: try to get better performance with same consistency B. Approach: ordered updates --Follow three rules in ordering updates: (i) Never write pointer before initializing the structure it points to (ii) Never reuse a resource before nullifying all pointers to it (iii) Never clear last pointer to live resource before setting new one [e.g., rename] --If you do this, file system will be recoverable ... and quickly! --Might leak free disk space, but otherwise correct --So start running after reboot, scavenge for space in background --How to actually implement this? --Keep a partial order on buffered blocks --Example: create file A --Block X contains an inode --Block Y contains a directory block --Create file A in inode block X, dir block Y --We say "Y-->X", pronounced "Y _depends on_ X" --Means Y cannot be written before X is written --X is called the _dependee_, Y the _depender_ --Can delay both writes, so long as order preserved --Say you create a second file B in blocks X and Y --Only have to write each out once --So what's the catch? Cyclic dependencies.... --Suppose you create file A, unlink file B --Both files in same directory block & inode block --Can only write directory after inode A initialized --Otherwise, after crash directory will point to bogus inode --Worse yet, same inode # could be re-allocated --So could end up with file name A being an unrelated file --But can only write inode block after directory entry for B cleared --Otherwise, might be deleting file while links to it still exist --That is, link count is now inaccurate [[--In principle, could get around this cyclic issue by writing inode block *before* B is cleared, but then fsck becomes superslow: would have to check every directory entry and inode link count]] [DRAW PICTURE TO ILLUSTRATE THE CYCLIC DEPENDENCY PROBLEM] --Summary: the problems with ordering updates naively are: --cyclic dependencies --crash might occur between ordered but related writes --summary information wrong after block freed --block aging --a block with a dependency will never get written back