Class 17
CS372H
27 March 2012

On the board
------------

1. Last time
2. Disks
3. File systems

---------------------------------------------------------------------------


1. Last time

    --Tanenbaum/Torvalds

    --one of you posted about this on stackoverflow. great to see it
    climb up there, on reddit, and on Hacker News.

    --reminder about background section tomorrow

2. Disks

    --try to make all read and writes contiguous and sequential

    [Reference: "An Introduction to Disk Drive Modeling", by Chris
    Ruemmler and John Wilkes. IEEE Computer 1994, Vol. 27, Number 3,
    1994. pp17-28.]

    --delays: seek, rotational, transfer

	--transfer keeps growing (85 MB/s +)

	--seek/rotational aren't really growing

	    rotational: 10000 rotations/min = 166 rotations/sec ==> 6 ms / rotation

	    avg seek: ~4ms
	    
    --disk accesses a huge system bottleneck and getting worse (in
    systems that have disks).

	--Bandwidth increase lets system (pre-)fetch large chunks for
	about the same cost as small chunk.

	--So trade latency for bandwidth if you can get lots of
	related stuff at roughly the same time. How to do that?

	--By clustering the related stuff together on the disk

    --The saving grace for big systems is that memory size
    is increasing faster than typical workload size

	--result: more and more of workload fits in file cache,
	which in turn means that the profile of traffic to the disk
	has changed: now mostly writes and new data.

	--which means logging and journaling become viable (more on
	this next class)


3. File systems

    A. Intro
    B. Files
	1. contiguous
	2. linked files
	3. FAT 
	4. indexed files
    D. Directories


    A. Intro

    --more papers on FSs than on any other single topic

	--probably also the hardest part of operating systems

    --what does a FS do?

	--provide persistence (don't go away ... ever)

	--somehow associate bytes on the disk with names (files)

	--somehow associates names with each other (directories)

    --where are FSes implemented?

	--can implement them on disk, over network, in memory, in NVRAM
	(non-volatile RAM), on tape, with paper (!!!!)

	--we are going to focus on the disk and generalize later. we'll
	see what it means to implement a FS over the network
   
    --a few quick notes about disks in the context of FS design

	--disk is the first thing we've seen that (a) doesn't go away;
	and (b) we can modify (BIOS ROM, hardware configuration, etc.
	don't go away, but we weren't able to modify these things). two
	implications here:

	    (i) we're going to have to put all of our important state on
	    the disk

	    (ii) we have to live with what we put on the disk! scribble
	    randomly on memory --> reboot and hope it doesn't happen
	    again. scribbe randomly on the disk --> now what? (answer:
	    in many cases, we're hosed.)

	--mismatch: CPU and memory are *also* working with "important
	state", but they are vastly faster than disks

	--disk is enormous: 100-1000x more data than memory

	    --how to organize all of this information?
	    --answer is by categorizing things (taxonomies). a FS is a
	    kind of taxonomy ("/homes" has home directories,
	    "/homes/bob/classes/cs372h" has bob's cs372h material, etc.)


    B. Files

	* Intro

	--what is a file?
	    --answer from user's view: a bunch of named bytes on the disk
	    --answer from FS's view: collection of disk blocks
	
	--big job of a FS: map name and offset to disk blocks
	   
                                 FS
                   {file,offset} --> disk address
	    
	    --operations are create(file), delete(file), read(), write()

	    --***goal: operations have as few disk accesses as possible
	    and minimal space overhead
	    
		--wait, why do we want minimal space overhead, given that
		the disk is huge?

		--answer: cache space never enough; the amount of data
		that can be retrieved in one fetch is never enough.
		hence, really don't want to waste.

	[[--note that we have seen translation/indirection before:

	    page table:

		                    page table 
		    virtual address ----------> physical address

    
	    per-file metadata:

			    inode
		    offset ------>  disk block address


	    how'd we get the inode?

			       directory
		    file name ----------> file # 
		    
		(file # *is* an inode in Unix)
		    		
	    ]]


	* Implementing files

	--our task: meet the goal marked *** above. 

	--for now, we're going to assume that the file's metadata is
	given to us. when we look at directories in a bit, we'll see
	where the metadata comes from; the above picture should also
	give a hint
    
	access patterns we could imagine supporting:

	(i) Sequential:
	    --File data processed in sequential order
	    --By far the most common mode
	    --Example: editor writes out new file, compiler reads in file, etc

	(ii) Random access:
	    --Address any block in file directly without passing through
	    --Examples: large data set, demand paging, databases

	(iii) Keyed access
	    --Search for block with particular values
	    --Examples: associative data base, index
	    --This thing is everywhere in the field of databases,
	    search engines, but....
	    --...usually not provided by a FS in OS

	helpful observations:

	(i) All blocks in file tend to be used together, sequentially 

	(ii) All files in directory tend to be used together

	(iii) All *names* in directory tend to be used together

	further design parameters:

	(i) Most files are small 
	
	(ii) Much of the disk is allocated to large files

	(iii) Many of the I/O operations are made to large files

	(iv) Want good sequential and good random access 

	candidate designs........

	1. contiguous allocation 

	  "extent based"
	  --when creating a file, make user pre-specify its length, and
	  allocate the space at once
	  --file metadata contains location and size

	  --example: IBM OS/360

		[<free> a1 a2 a3 <free> b1 b2 <free> ]

		what if a file c needs two sectors?!
	  
	  +: simple
	  +: fast access, both sequential and random
	  -: fragmentation
	  
	  where have we seen something similar? (answer: segmentation in
	  virtual memory)

	2. linked files
	    
	    --keep a linked list of free blocks
	    --metadata: pointer to file's first block
	    --each block holds pointer to next one

	  +: no more fragmentation
	  +: sequential access easy (and probably mostly fast, assuming
	     decent free space management, since the pointers will point
	     close by)
	  -: random access is a disaster
	  -: pointers take up room in blocks; messes up alignment of
	  data

	3. modification of linked files: FAT

	    --keep link structure in memory
	    --in fixed-size "FAT" (file allocation table)
	    --pointer chasing now happens in RAM

	    [DRAW PICTURE]

	    --example: MS-DOS (and iPods, MP3 players, digital cameras)
	
	  +: no need to maintain separate free list (table says what's free)	
	  +: low space overhead
	  -: maximum size limited. 
	      64K entries
	      512 byte blocks --> 32MB max file system
	   bigger blocks bring advantages and disadvantages, and ditto a
	   bigger table

	    note: to guard against bad sectors, better store multiple
	    copies of FAT on the disk!!


	4. indexed files

	    [DRAW PICTURE]

	    --Each file has an array holding all of its block pointers
		--like a page table, so similar issues crop up

	    --Allocate this array on file creation

	    --Allocate blocks on demand (using free list)
	   
	    +: sequential and random access are both easy
	    -: need to somehow store the array

	    --large possible file size --> lots of unused entries in the
	    block array

	    --large actual block size --> huge contiguous disk chunk
	    needed

	    --solve the problem the same way we did for page tables:

			[............]

		[..........]	    [.........]

	    [ block     block                         block]

	    --okay, so now we're not wasting disk blocks, but what's the
	    problem? (answer: equivalent issues as for page tables:
	    here, it's extra disk accesses to look up the blocks)

	5. indexed files, take two

	    --classic Unix file system 

	    --inode contains:

		permisssions
		times for file access, file modification, and inode-change
		link count (# directories containing file)
		ptr 1  --> data block
		ptr 2  --> data block
		ptr 3  --> data block
		.....
		ptr 11  --> indirect block 
				      ptr --> 
				      ptr --> 
				      ptr --> 
				      ptr -->
				      ptr -->
		ptr 12 --> indirect block
		ptr 13 --> double indirect block
		ptr 14 --> triple indirect block

	    +: Simple, easy to build, fast access to small files

	    +: Maximum file length can be enormous, with
	       multiple levels of indirection 

	    -: worst case # of accesses pretty bad

	    -: worst case overhead (such as 11 block file) pretty bad

	    -: Because you allocate blocks by taking them off unordered
	       freelist, meta data and data get strewn across disk
 

	    Notes about inodes:

	    --stored in a fixed-size array

	    --Size of array fixed when disk is initialized; can't be changed

	    --Multiple inodes in a disk block

	    --Lives in known location, originally at one side of disk,
	    now lives in pieces across disk (helps keep metadata close
	    to data)

	    --The index of an inode in the inode array is called an
	    ***i-number***

	    --Internally, the OS refers to files by i-number

	    --When a file is opened, the inode brought in memory
   
	    --Written back when modified and file closed or time elapses

    D. Directories

    --Problem: "Spend all day generating data, come back the next
    morning, want to use it."  F. Corbato, on why files/dirs
    invented.

    --Approach 0: Have users remember where on disk their files are

	--like remembering your social security or bank account #
  
	--yuck. (people want human-friendly names.)

    --So use directories to map names to file blocks, somehow
	
	--But what is in directory?

    --A short history of directories

	--Approach 1: Single directory for entire system

	    --Put directory at known location on disk

	    --Directory contains <name,inumber> pairs

	    --If one user uses a name, no one else can

	    --Many ancient personal computers work this way

	--Approach 2: Single directory for each user

	    --Still clumsy, and "ls" on 10,000 files is a real pain
	    --(But some oldtimers still work this way)

	--Approach 3: Hierarchical name spaces. 

	    --Allow directory to map names to files ***or other dirs***

	    --File system forms a tree (or graph, if links allowed)

	    --Large name spaces tend to be hierarchical

		--examples: IP addresses (will come up in networking
		unit), domain names, scoping in programming languages,
		etc.)

		--more generally, the concept of hierarchy is everywhere
		in computer systems

    --Hierarchial Unix

	--used since CTSS (1960s), and Unix picked it up and used it nicely

	--structure like:
		            "/"
	     bin  cdrom    dev       sbin           tmp
			        awk chmod ....

	--directories stored on disk just like regular files

	    --here's the data in a directory file; this data is in the
	    *data blocks* of the directory:

	      [<name, inode#>]
	       <bin, 1021>
	       <dev, 1001>
	       <sbin, 2011>
	       ....

	    --i-node for directory contains a special flag bit

		--only special users can write directory files

	--key point: i-number might reference another directory

	    --this neatly turns the FS into a hierarchical tree, with almost
	    no work

	--another nice thing about this: if you speed up file
	operations, you also speed up directory operations, because
	directories are just like files

	--bootstrapping: where do you start looking?

	    --root dir always inode #2 (0 and 1 reserved)

	    --and, voila, we have a namespace!


	--special names: "/", ".", ".."

	--given those names, we need only two operations to navigate the
	entire name space:

	    --"cd name": (change context to directory "name")
	    --"ls": (list all names in current directory)


	--example:

	    [DRAW PICTURE]


	--links:

	    --hard link: multiple dir entries point to same inode; inode
	    contains refcount

		"ln a b": creates a synonym ("b") for file ("a")

		--how do we avoid cycles in the graph? (answer: can't
		hard link to directories)

	    --soft link: synonym for a *name*

		"ln -s /d/a b": 

		--creates a new inode, not just a new directory entry

		--new inode has "sym link" bit set

		--contents of that new file:

		    "/d/a"

    E. FS Performance
    
	--Unix FS was simple, elegant and ... slow

	    --blocks too small

	    --file index (inode) too large
		--too many layers of mapping indirection
		--transfer rate low (they were getting one block at a time)

	    --poor clustering of related objects

		--consecutive file blocks not close together

		--Inodes far from data blocks

		--Inodes for a given directory not close together

		--result: poor enumeration performance, meaning things like:
			"ls" and "grep foo *.c" were slowwwww

	    --other problems:
		--14 character names were the limit
		--can't atomically update file in crash-proof way
 
 
       --FFS (fast file system) fixes these problems to a degree.
	
	    [Reference: "M. K. McKusik, W. N. Joy, S. J. Leffler, and R.
	    S.  Fabry. A Fast File System for UNIX. ACM Trans. on
	    Computer Systems, Vol. 2, No. 3, Aug. 1984, pp. 181-197.]

      what can we do to above?

      [ask for suggestions]

      * make block size bigger (4 KB, 8KB, or 16 KB)

      * cluster related objects

	  "cylinder groups" (one or more consecutive cylinders)

	[superblock | bookkeeping info | inodes | bitmap | data blocks (512 bytes each) ]
	
	    --try to put inodes and data blocks in the same cylinder group

	    --try to put all inodes of files in the same directory in
	    the same cylinder group
	    
	    --new directories placed in cylinder group with greater than
	    average number of free inodes

	    --as files are allocated, use a heuristic: spill to next
	    cylinder group after 48 KB of file (which would be the point
	    at which an indirect block would be required, assuming
	    4096-byte blocks) and at every megabyte thereafter.
	    
      * bitmaps (to track free blocks)

	    --Easier to find contiguous blocks

	    --Can keep the entire thing in memory (as in lab 5)

	    --100 GB disk / 4KB disk blocks = 25,000,000 entries = 3MB.
	    not outrageous these days.

      * reserve space
	   --but don't tell users. (df makes full disk look 110% full)

      * total performance

	--20-40% of disk bandwidth for large files

	--10-20x of original Unix file system!

	--still not the best we can do
	    (meta-data writes happen synchronously, which really hurts
	    performance. but making asynchronous requires story for
	    crash recovery.)

      Others:

	--Most obvious: big file cache

	    --kernel maintains a *buffer cache* in memory

	    --internally, all uses of ReadDisk(blockNum, readbuf)
	    replaced with:

		ReadDiskCache(blockNum, readbuf) {
		    ptr = buffercache.get(blockNum); 
		    if (ptr) {
			copy BLKSIZE bytes from ptr to readbuf
		    } else {
			newBuf = malloc(BLKSIZE);
			ReadDisk(blockNum, newBuf);
			buffercache.insert(blockNum, newBuf);
			copy BLKSIZE bytes from newBuf to readbuf
		    }

	--no rotation delay if you're reading the whole track.
	    --so try to read the whole track

	--more generally, try to work with big chunks (lots of disk
	blocks) 
	    --write in big chunks
	    --read ahead in big chunks (64 KB)

	--why not just read/write 1 MB at a time?
	    --(for writes: may not get data to disk often enough)
	    --(for reads: may waste read bandwidth)

    F. mmap: memory mapping files

	--recall some syscalls: 
	    fd = open(pathname, mode)
	    write(fd, buf, sz)
	    read(fd, buf, sz)

	--what the heck is a fd?
	    --indexes into a table
	    --what's in the given entry in the table?
		--inumber!
		--inode, probably!
		--and per-open-file data (file position, etc.)

	--syscall:
	    void* mmap(void* addr, size_t len, int prot, int flags,
		       int fd, off_t offset);


	--map the specified open file (fd) into a region of my
	virtual memory (at addr, or at a kernel-selected place if
	addr is 0), and return a pointer to it

	--after this, loads and stores to addr[offset] are
	equivalent to reading and writing to the file at the given
	offset

	--how's this implemented?! (answer: through virtual memory,
	with the VA being addr [or whatever the kernel selects] and
	the PA being what? answer: the physical address storing the
	given page in the kernel's buffer cache).

	--have to deal with eviction from buffer cache, but this
	problem is not unique. in all operating systems besides JOS,
	the kernel designers *anyway* have to be able to invalidate
	VA-->PA mappings when a page is removed from RAM


[thanks to David Mazieres for portions of the above]


Midterm.....

    --as you saw, non-negligible fraction of the exam was on JOS. we
    cross-checked partners' performance. in most cases, things looked
    healthy. however, there were a few cases where the two partners
    seemed to have very different understandings of JOS.

	--please remember that we are serious about pair programming. if
	we see such divergences on the final, it is not going to reflect
	well.

    --scores were a bit lower than I'd expected

    --some statistics:

	--Mean		69.0
	--Median	65.5
	--Standard Deviation  15.1

	--Distribution:
	    90 - 95	2
	    86 - 89     2
	    80 - 85     3
	    70 - 79	0
	    60 - 69	3
	    50 - 59	6
	    40 - 49	1

    --interpretation:

	--no letter grades yet (sorry)

	--don't panic if you're not happy with your score; lots of
	opportunity to bring things up

	--if you didn't do as well as you wanted, don't worry too much....

	--....but do study for the final

    --regardless of how you think you did, *please* make sure you
    understand all the answers; the solutions are posted on the course
    Web page, and they are intended to be helpful here

    --if you have questions, let me know. we tried to be careful, but
    it's possible we made mistakes. 
    
    --please note that a regrade request will generate a regrade of the
    entire exam