Class 17 CS372H 27 March 2012 On the board ------------ 1. Last time 2. Disks 3. File systems --------------------------------------------------------------------------- 1. Last time --Tanenbaum/Torvalds --one of you posted about this on stackoverflow. great to see it climb up there, on reddit, and on Hacker News. --reminder about background section tomorrow 2. Disks --try to make all read and writes contiguous and sequential [Reference: "An Introduction to Disk Drive Modeling", by Chris Ruemmler and John Wilkes. IEEE Computer 1994, Vol. 27, Number 3, 1994. pp17-28.] --delays: seek, rotational, transfer --transfer keeps growing (85 MB/s +) --seek/rotational aren't really growing rotational: 10000 rotations/min = 166 rotations/sec ==> 6 ms / rotation avg seek: ~4ms --disk accesses a huge system bottleneck and getting worse (in systems that have disks). --Bandwidth increase lets system (pre-)fetch large chunks for about the same cost as small chunk. --So trade latency for bandwidth if you can get lots of related stuff at roughly the same time. How to do that? --By clustering the related stuff together on the disk --The saving grace for big systems is that memory size is increasing faster than typical workload size --result: more and more of workload fits in file cache, which in turn means that the profile of traffic to the disk has changed: now mostly writes and new data. --which means logging and journaling become viable (more on this next class) 3. File systems A. Intro B. Files 1. contiguous 2. linked files 3. FAT 4. indexed files D. Directories A. Intro --more papers on FSs than on any other single topic --probably also the hardest part of operating systems --what does a FS do? --provide persistence (don't go away ... ever) --somehow associate bytes on the disk with names (files) --somehow associates names with each other (directories) --where are FSes implemented? --can implement them on disk, over network, in memory, in NVRAM (non-volatile RAM), on tape, with paper (!!!!) --we are going to focus on the disk and generalize later. we'll see what it means to implement a FS over the network --a few quick notes about disks in the context of FS design --disk is the first thing we've seen that (a) doesn't go away; and (b) we can modify (BIOS ROM, hardware configuration, etc. don't go away, but we weren't able to modify these things). two implications here: (i) we're going to have to put all of our important state on the disk (ii) we have to live with what we put on the disk! scribble randomly on memory --> reboot and hope it doesn't happen again. scribbe randomly on the disk --> now what? (answer: in many cases, we're hosed.) --mismatch: CPU and memory are *also* working with "important state", but they are vastly faster than disks --disk is enormous: 100-1000x more data than memory --how to organize all of this information? --answer is by categorizing things (taxonomies). a FS is a kind of taxonomy ("/homes" has home directories, "/homes/bob/classes/cs372h" has bob's cs372h material, etc.) B. Files * Intro --what is a file? --answer from user's view: a bunch of named bytes on the disk --answer from FS's view: collection of disk blocks --big job of a FS: map name and offset to disk blocks FS {file,offset} --> disk address --operations are create(file), delete(file), read(), write() --***goal: operations have as few disk accesses as possible and minimal space overhead --wait, why do we want minimal space overhead, given that the disk is huge? --answer: cache space never enough; the amount of data that can be retrieved in one fetch is never enough. hence, really don't want to waste. [[--note that we have seen translation/indirection before: page table: page table virtual address ----------> physical address per-file metadata: inode offset ------> disk block address how'd we get the inode? directory file name ----------> file # (file # *is* an inode in Unix) ]] * Implementing files --our task: meet the goal marked *** above. --for now, we're going to assume that the file's metadata is given to us. when we look at directories in a bit, we'll see where the metadata comes from; the above picture should also give a hint access patterns we could imagine supporting: (i) Sequential: --File data processed in sequential order --By far the most common mode --Example: editor writes out new file, compiler reads in file, etc (ii) Random access: --Address any block in file directly without passing through --Examples: large data set, demand paging, databases (iii) Keyed access --Search for block with particular values --Examples: associative data base, index --This thing is everywhere in the field of databases, search engines, but.... --...usually not provided by a FS in OS helpful observations: (i) All blocks in file tend to be used together, sequentially (ii) All files in directory tend to be used together (iii) All *names* in directory tend to be used together further design parameters: (i) Most files are small (ii) Much of the disk is allocated to large files (iii) Many of the I/O operations are made to large files (iv) Want good sequential and good random access candidate designs........ 1. contiguous allocation "extent based" --when creating a file, make user pre-specify its length, and allocate the space at once --file metadata contains location and size --example: IBM OS/360 [ a1 a2 a3 b1 b2 ] what if a file c needs two sectors?! +: simple +: fast access, both sequential and random -: fragmentation where have we seen something similar? (answer: segmentation in virtual memory) 2. linked files --keep a linked list of free blocks --metadata: pointer to file's first block --each block holds pointer to next one +: no more fragmentation +: sequential access easy (and probably mostly fast, assuming decent free space management, since the pointers will point close by) -: random access is a disaster -: pointers take up room in blocks; messes up alignment of data 3. modification of linked files: FAT --keep link structure in memory --in fixed-size "FAT" (file allocation table) --pointer chasing now happens in RAM [DRAW PICTURE] --example: MS-DOS (and iPods, MP3 players, digital cameras) +: no need to maintain separate free list (table says what's free) +: low space overhead -: maximum size limited. 64K entries 512 byte blocks --> 32MB max file system bigger blocks bring advantages and disadvantages, and ditto a bigger table note: to guard against bad sectors, better store multiple copies of FAT on the disk!! 4. indexed files [DRAW PICTURE] --Each file has an array holding all of its block pointers --like a page table, so similar issues crop up --Allocate this array on file creation --Allocate blocks on demand (using free list) +: sequential and random access are both easy -: need to somehow store the array --large possible file size --> lots of unused entries in the block array --large actual block size --> huge contiguous disk chunk needed --solve the problem the same way we did for page tables: [............] [..........] [.........] [ block block block] --okay, so now we're not wasting disk blocks, but what's the problem? (answer: equivalent issues as for page tables: here, it's extra disk accesses to look up the blocks) 5. indexed files, take two --classic Unix file system --inode contains: permisssions times for file access, file modification, and inode-change link count (# directories containing file) ptr 1 --> data block ptr 2 --> data block ptr 3 --> data block ..... ptr 11 --> indirect block ptr --> ptr --> ptr --> ptr --> ptr --> ptr 12 --> indirect block ptr 13 --> double indirect block ptr 14 --> triple indirect block +: Simple, easy to build, fast access to small files +: Maximum file length can be enormous, with multiple levels of indirection -: worst case # of accesses pretty bad -: worst case overhead (such as 11 block file) pretty bad -: Because you allocate blocks by taking them off unordered freelist, meta data and data get strewn across disk Notes about inodes: --stored in a fixed-size array --Size of array fixed when disk is initialized; can't be changed --Multiple inodes in a disk block --Lives in known location, originally at one side of disk, now lives in pieces across disk (helps keep metadata close to data) --The index of an inode in the inode array is called an ***i-number*** --Internally, the OS refers to files by i-number --When a file is opened, the inode brought in memory --Written back when modified and file closed or time elapses D. Directories --Problem: "Spend all day generating data, come back the next morning, want to use it." F. Corbato, on why files/dirs invented. --Approach 0: Have users remember where on disk their files are --like remembering your social security or bank account # --yuck. (people want human-friendly names.) --So use directories to map names to file blocks, somehow --But what is in directory? --A short history of directories --Approach 1: Single directory for entire system --Put directory at known location on disk --Directory contains pairs --If one user uses a name, no one else can --Many ancient personal computers work this way --Approach 2: Single directory for each user --Still clumsy, and "ls" on 10,000 files is a real pain --(But some oldtimers still work this way) --Approach 3: Hierarchical name spaces. --Allow directory to map names to files ***or other dirs*** --File system forms a tree (or graph, if links allowed) --Large name spaces tend to be hierarchical --examples: IP addresses (will come up in networking unit), domain names, scoping in programming languages, etc.) --more generally, the concept of hierarchy is everywhere in computer systems --Hierarchial Unix --used since CTSS (1960s), and Unix picked it up and used it nicely --structure like: "/" bin cdrom dev sbin tmp awk chmod .... --directories stored on disk just like regular files --here's the data in a directory file; this data is in the *data blocks* of the directory: [] .... --i-node for directory contains a special flag bit --only special users can write directory files --key point: i-number might reference another directory --this neatly turns the FS into a hierarchical tree, with almost no work --another nice thing about this: if you speed up file operations, you also speed up directory operations, because directories are just like files --bootstrapping: where do you start looking? --root dir always inode #2 (0 and 1 reserved) --and, voila, we have a namespace! --special names: "/", ".", ".." --given those names, we need only two operations to navigate the entire name space: --"cd name": (change context to directory "name") --"ls": (list all names in current directory) --example: [DRAW PICTURE] --links: --hard link: multiple dir entries point to same inode; inode contains refcount "ln a b": creates a synonym ("b") for file ("a") --how do we avoid cycles in the graph? (answer: can't hard link to directories) --soft link: synonym for a *name* "ln -s /d/a b": --creates a new inode, not just a new directory entry --new inode has "sym link" bit set --contents of that new file: "/d/a" E. FS Performance --Unix FS was simple, elegant and ... slow --blocks too small --file index (inode) too large --too many layers of mapping indirection --transfer rate low (they were getting one block at a time) --poor clustering of related objects --consecutive file blocks not close together --Inodes far from data blocks --Inodes for a given directory not close together --result: poor enumeration performance, meaning things like: "ls" and "grep foo *.c" were slowwwww --other problems: --14 character names were the limit --can't atomically update file in crash-proof way --FFS (fast file system) fixes these problems to a degree. [Reference: "M. K. McKusik, W. N. Joy, S. J. Leffler, and R. S. Fabry. A Fast File System for UNIX. ACM Trans. on Computer Systems, Vol. 2, No. 3, Aug. 1984, pp. 181-197.] what can we do to above? [ask for suggestions] * make block size bigger (4 KB, 8KB, or 16 KB) * cluster related objects "cylinder groups" (one or more consecutive cylinders) [superblock | bookkeeping info | inodes | bitmap | data blocks (512 bytes each) ] --try to put inodes and data blocks in the same cylinder group --try to put all inodes of files in the same directory in the same cylinder group --new directories placed in cylinder group with greater than average number of free inodes --as files are allocated, use a heuristic: spill to next cylinder group after 48 KB of file (which would be the point at which an indirect block would be required, assuming 4096-byte blocks) and at every megabyte thereafter. * bitmaps (to track free blocks) --Easier to find contiguous blocks --Can keep the entire thing in memory (as in lab 5) --100 GB disk / 4KB disk blocks = 25,000,000 entries = 3MB. not outrageous these days. * reserve space --but don't tell users. (df makes full disk look 110% full) * total performance --20-40% of disk bandwidth for large files --10-20x of original Unix file system! --still not the best we can do (meta-data writes happen synchronously, which really hurts performance. but making asynchronous requires story for crash recovery.) Others: --Most obvious: big file cache --kernel maintains a *buffer cache* in memory --internally, all uses of ReadDisk(blockNum, readbuf) replaced with: ReadDiskCache(blockNum, readbuf) { ptr = buffercache.get(blockNum); if (ptr) { copy BLKSIZE bytes from ptr to readbuf } else { newBuf = malloc(BLKSIZE); ReadDisk(blockNum, newBuf); buffercache.insert(blockNum, newBuf); copy BLKSIZE bytes from newBuf to readbuf } --no rotation delay if you're reading the whole track. --so try to read the whole track --more generally, try to work with big chunks (lots of disk blocks) --write in big chunks --read ahead in big chunks (64 KB) --why not just read/write 1 MB at a time? --(for writes: may not get data to disk often enough) --(for reads: may waste read bandwidth) F. mmap: memory mapping files --recall some syscalls: fd = open(pathname, mode) write(fd, buf, sz) read(fd, buf, sz) --what the heck is a fd? --indexes into a table --what's in the given entry in the table? --inumber! --inode, probably! --and per-open-file data (file position, etc.) --syscall: void* mmap(void* addr, size_t len, int prot, int flags, int fd, off_t offset); --map the specified open file (fd) into a region of my virtual memory (at addr, or at a kernel-selected place if addr is 0), and return a pointer to it --after this, loads and stores to addr[offset] are equivalent to reading and writing to the file at the given offset --how's this implemented?! (answer: through virtual memory, with the VA being addr [or whatever the kernel selects] and the PA being what? answer: the physical address storing the given page in the kernel's buffer cache). --have to deal with eviction from buffer cache, but this problem is not unique. in all operating systems besides JOS, the kernel designers *anyway* have to be able to invalidate VA-->PA mappings when a page is removed from RAM [thanks to David Mazieres for portions of the above] Midterm..... --as you saw, non-negligible fraction of the exam was on JOS. we cross-checked partners' performance. in most cases, things looked healthy. however, there were a few cases where the two partners seemed to have very different understandings of JOS. --please remember that we are serious about pair programming. if we see such divergences on the final, it is not going to reflect well. --scores were a bit lower than I'd expected --some statistics: --Mean 69.0 --Median 65.5 --Standard Deviation 15.1 --Distribution: 90 - 95 2 86 - 89 2 80 - 85 3 70 - 79 0 60 - 69 3 50 - 59 6 40 - 49 1 --interpretation: --no letter grades yet (sorry) --don't panic if you're not happy with your score; lots of opportunity to bring things up --if you didn't do as well as you wanted, don't worry too much.... --....but do study for the final --regardless of how you think you did, *please* make sure you understand all the answers; the solutions are posted on the course Web page, and they are intended to be helpful here --if you have questions, let me know. we tried to be careful, but it's possible we made mistakes. --please note that a regrade request will generate a regrade of the entire exam