Class 18 CS 372H 29 March 2011 On the board ------------ 1. Last time 2. File systems, continued A. [last time] Intro B. [last time] Files C. [last time, today] Implementing files 1. [last time] contiguous 2. [last time] linked files 3. [last time] FAT 4. indexed files D. Directories E. FS performance F. mmap --------------------------------------------------------------------------- 1. Last time 2. File systems, continued per-file metadata: inode offset ------> disk block address [inode is case of something more general] C. Implementing files remember, we're assuming meta-data provided, and the question is "what does it look like?". we'll discuss in "directories" below where the metadata came from. 4. indexed files [DRAW PICTURE] --Each file has an array holding all of its block pointers --like a page table, so similar issues crop up --Allocate this array on file creation --Allocate blocks on demand (using free list) +: sequential and random access are both easy -: need to somehow store the array --large possible file size --> lots of unused entries in the block array --large actual block size --> huge contiguous disk chunk needed --solve the problem the same way we did for page tables: [............] [..........] [.........] [ block block block] --okay, so now we're not wasting disk blocks, but what's the problem? (answer: equivalent issues as for page tables: here, it's extra disk accesses to look up the blocks) 5. indexed files, take two --classic Unix file system --inode contains: permisssions times for file access, file modification, and inode-change link count (# directories containing file) ptr 1 --> data block ptr 2 --> data block ptr 3 --> data block ..... ptr 11 --> indirect block ptr --> ptr --> ptr --> ptr --> ptr --> ptr 12 --> indirect block ptr 13 --> double indirect block ptr 14 --> triple indirect block +: Simple, easy to build, fast access to small files +: Maximum file length can be enormous, with multiple levels of indirection -: worst case # of accesses pretty bad -: worst case overhead (such as 11 block file) pretty bad -: Because you allocate blocks by taking them off unordered freelist, meta data and data get strewn across disk Notes about inodes: --stored in a fixed-size array --Size of array fixed when disk is initialized; can't be changed --Multiple inodes in a disk block --Lives in known location, originally at one side of disk, now lives in pieces across disk (helps keep metadata close to data) --The index of an inode in the inode array is called an ***i-number*** --Internally, the OS refers to files by i-number --When a file is opened, the inode brought in memory --Written back when modified and file closed or time elapses D. Directories --Problem: "Spend all day generating data, come back the next morning, want to use it." F. Corbato, on why files/dirs invented. --Approach 0: Have users remember where on disk their files are --like remembering your social security or bank account # --yuck. (people want human-friendly names.) --So use directories to map names to file blocks, somehow --But what is in directory? --A short history of directories --Approach 1: Single directory for entire system --Put directory at known location on disk --Directory contains pairs --If one user uses a name, no one else can --Many ancient personal computers work this way --Approach 2: Single directory for each user --Still clumsy, and "ls" on 10,000 files is a real pain --(But some oldtimers still work this way) --Approach 3: Hierarchical name spaces. --Allow directory to map names to files ***or other dirs*** --File system forms a tree (or graph, if links allowed) --Large name spaces tend to be hierarchical --examples: IP addresses (will come up in networking unit), domain names, scoping in programming languages, etc.) --more generally, the concept of hierarchy is everywhere in computer systems --Hierarchial Unix --used since CTSS (1960s), and Unix picked it up and used it nicely --structure like: "/" bin cdrom dev sbin tmp awk chmod .... --directories stored on disk just like regular files --here's the data in a directory file; this data is in the *data blocks* of the directory: [] .... --i-node for directory contains a special flag bit --only special users can write directory files --key point: i-number might reference another directory --this neatly turns the FS into a hierarchical tree, with almost no work --another nice thing about this: if you speed up file operations, you also speed up directory operations, because directories are just like files --bootstrapping: where do you start looking? --root dir always inode #2 (0 and 1 reserved) --and, voila, we have a namespace! --special names: "/", ".", ".." --given those names, we need only two operations to navigate the entire name space: --"cd name": (change context to directory "name") --"ls": (list all names in current directory) --example: [DRAW PICTURE] --links: --hard link: multiple dir entries point to same inode; inode contains refcount "ln a b": creates a synonym ("b") for file ("a") --how do we avoid cycles in the graph? (answer: can't hard link to directories) --soft link: synonym for a *name* "ln -s /d/a b": --creates a new inode, not just a new directory entry --new inode has "sym link" bit set --contents of that new file: "/d/a" E. FS Performance --Unix FS was simple, elegant and ... slow --blocks too small --file index (inode) too large --too many layers of mapping indirection --transfer rate low (they were getting one block at a time) --poor clustering of related objects --consecutive file blocks not close together --Inodes far from data blocks --Inodes for a given directory not close together --result: poor enumeration performance, meaning things like: "ls" and "grep foo *.c" were slowwwww --other problems: --14 character names were the limit --can't atomically update file in crash-proof way --FFS (fast file system) fixes these problems to a degree. [Reference: "M. K. McKusik, W. N. Joy, S. J. Leffler, and R. S. Fabry. A Fast File System for UNIX. ACM Trans. on Computer Systems, Vol. 2, No. 3, Aug. 1984, pp. 181-197.] what can we do to above? [ask for suggestions] * make block size bigger (4 KB, 8KB, or 16 KB) * cluster related objects "cylinder groups" (one or more consecutive cylinders) [superblock | bookkeeping info | inodes | bitmap | data blocks (512 bytes each) ] --try to put inodes and data blocks in the same cylinder group --try to put all inodes of files in the same directory in the same cylinder group --new directories placed in cylinder group with greater than average number of free inodes --as files are allocated, use a heuristic: spill to next cylinder group after 48 KB of file (which would be the point at which an indirect block would be required, assuming 4096-byte blocks) and at every megabyte thereafter. * bitmaps (to track free blocks) --Easier to find contiguous blocks --Can keep the entire thing in memory (as in lab 5) --100 GB disk / 4KB disk blocks = 25,000,000 entries = 3MB. not outrageous these days. * reserve space --but don't tell users. (df makes full disk look 110% full) * total performance --20-40% of disk bandwidth for large files --10-20x of original Unix file system! --still not the best we can do (meta-data writes happen synchronously, which really hurts performance. but making asynchronous requires story for crash recovery.) Others: --Most obvious: big file cache --kernel maintains a *buffer cache* in memory --internally, all uses of ReadDisk(blockNum, readbuf) replaced with: ReadDiskCache(blockNum, readbuf) { ptr = buffercache.get(blockNum); if (ptr) { copy BLKSIZE bytes from ptr to readbuf } else { newBuf = malloc(BLKSIZE); ReadDisk(blockNum, newBuf); buffercache.insert(blockNum, newBuf); copy BLKSIZE bytes from newBuf to readbuf } --no rotation delay if you're reading the whole track. --so try to read the whole track --more generally, try to work with big chunks (lots of disk blocks) --write in big chunks --read ahead in big chunks (64 KB) --why not just read/write 1 MB at a time? --(for writes: may not get data to disk often enough) --(for reads: may waste read bandwidth) F. mmap: memory mapping files --recall some syscalls: fd = open(pathname, mode) write(fd, buf, sz) read(fd, buf, sz) --what the heck is a fd? --indexes into a table --what's in the given entry in the table? --inumber! --inode, probably! --and per-open-file data (file position, etc.) --syscall: void* mmap(void* addr, size_t len, int prot, int flags, int fd, off_t offset); --map the specified open file (fd) into a region of my virtual memory (at addr, or at a kernel-selected place if addr is 0), and return a pointer to it --after this, loads and stores to addr[offset] are equivalent to reading and writing to the file at the given offset --how's this implemented?! (answer: through virtual memory, with the VA being addr [or whatever the kernel selects] and the PA being what? answer: the physical address storing the given page in the kernel's buffer cache). --have to deal with eviction from buffer cache, but this problem is not unique. in all operating systems besides JOS, the kernel designers *anyway* have to be able to invalidate VA-->PA mappings when a page is removed from RAM