Class 18
CS 372H
29 March 2011

On the board
------------

1. Last time
2. File systems, continued
    A. [last time] Intro
    B. [last time] Files
    C. [last time, today] Implementing files
	1. [last time] contiguous
	2. [last time] linked files
	3. [last time] FAT 
	4. indexed files
    D. Directories
    E. FS performance 
    F. mmap 

---------------------------------------------------------------------------

1. Last time

2. File systems, continued

	per-file metadata:

			    inode
		    offset ------>  disk block address

	[inode is case of something more general]

    C. Implementing files

    remember, we're assuming meta-data provided, and the question is
    "what does it look like?". we'll discuss in "directories" below where
    the metadata came from.

	4. indexed files

	    [DRAW PICTURE]

	    --Each file has an array holding all of its block pointers
		--like a page table, so similar issues crop up

	    --Allocate this array on file creation

	    --Allocate blocks on demand (using free list)
	   
	    +: sequential and random access are both easy
	    -: need to somehow store the array

	    --large possible file size --> lots of unused entries in the
	    block array

	    --large actual block size --> huge contiguous disk chunk
	    needed

	    --solve the problem the same way we did for page tables:

			[............]

		[..........]	    [.........]

	    [ block     block                         block]

	    --okay, so now we're not wasting disk blocks, but what's the
	    problem? (answer: equivalent issues as for page tables:
	    here, it's extra disk accesses to look up the blocks)

	5. indexed files, take two

	    --classic Unix file system 

	    --inode contains:

		permisssions
		times for file access, file modification, and inode-change
		link count (# directories containing file)
		ptr 1  --> data block
		ptr 2  --> data block
		ptr 3  --> data block
		.....
		ptr 11  --> indirect block 
				      ptr --> 
				      ptr --> 
				      ptr --> 
				      ptr -->
				      ptr -->
		ptr 12 --> indirect block
		ptr 13 --> double indirect block
		ptr 14 --> triple indirect block

	    +: Simple, easy to build, fast access to small files

	    +: Maximum file length can be enormous, with
	       multiple levels of indirection 

	    -: worst case # of accesses pretty bad

	    -: worst case overhead (such as 11 block file) pretty bad

	    -: Because you allocate blocks by taking them off unordered
	       freelist, meta data and data get strewn across disk
 

	    Notes about inodes:

	    --stored in a fixed-size array

	    --Size of array fixed when disk is initialized; can't be changed

	    --Multiple inodes in a disk block

	    --Lives in known location, originally at one side of disk,
	    now lives in pieces across disk (helps keep metadata close
	    to data)

	    --The index of an inode in the inode array is called an
	    ***i-number***

	    --Internally, the OS refers to files by i-number

	    --When a file is opened, the inode brought in memory
   
	    --Written back when modified and file closed or time elapses

    D. Directories

    --Problem: "Spend all day generating data, come back the next
    morning, want to use it."  F. Corbato, on why files/dirs
    invented.

    --Approach 0: Have users remember where on disk their files are

	--like remembering your social security or bank account #
  
	--yuck. (people want human-friendly names.)

    --So use directories to map names to file blocks, somehow
	
	--But what is in directory?

    --A short history of directories

	--Approach 1: Single directory for entire system

	    --Put directory at known location on disk

	    --Directory contains <name,inumber> pairs

	    --If one user uses a name, no one else can

	    --Many ancient personal computers work this way

	--Approach 2: Single directory for each user

	    --Still clumsy, and "ls" on 10,000 files is a real pain
	    --(But some oldtimers still work this way)

	--Approach 3: Hierarchical name spaces. 

	    --Allow directory to map names to files ***or other dirs***

	    --File system forms a tree (or graph, if links allowed)

	    --Large name spaces tend to be hierarchical

		--examples: IP addresses (will come up in networking
		unit), domain names, scoping in programming languages,
		etc.)

		--more generally, the concept of hierarchy is everywhere
		in computer systems

    --Hierarchial Unix

	--used since CTSS (1960s), and Unix picked it up and used it nicely

	--structure like:
		            "/"
	     bin  cdrom    dev       sbin           tmp
			        awk chmod ....

	--directories stored on disk just like regular files

	    --here's the data in a directory file; this data is in the
	    *data blocks* of the directory:

	      [<name, inode#>]
	       <bin, 1021>
	       <dev, 1001>
	       <sbin, 2011>
	       ....

	    --i-node for directory contains a special flag bit

		--only special users can write directory files

	--key point: i-number might reference another directory

	    --this neatly turns the FS into a hierarchical tree, with almost
	    no work

	--another nice thing about this: if you speed up file
	operations, you also speed up directory operations, because
	directories are just like files

	--bootstrapping: where do you start looking?

	    --root dir always inode #2 (0 and 1 reserved)

	    --and, voila, we have a namespace!


	--special names: "/", ".", ".."

	--given those names, we need only two operations to navigate the
	entire name space:

	    --"cd name": (change context to directory "name")
	    --"ls": (list all names in current directory)


	--example:

	    [DRAW PICTURE]


	--links:

	    --hard link: multiple dir entries point to same inode; inode
	    contains refcount

		"ln a b": creates a synonym ("b") for file ("a")

		--how do we avoid cycles in the graph? (answer: can't
		hard link to directories)

	    --soft link: synonym for a *name*

		"ln -s /d/a b": 

		--creates a new inode, not just a new directory entry

		--new inode has "sym link" bit set

		--contents of that new file:

		    "/d/a"

    E. FS Performance
    
	--Unix FS was simple, elegant and ... slow

	    --blocks too small

	    --file index (inode) too large
		--too many layers of mapping indirection
		--transfer rate low (they were getting one block at a time)

	    --poor clustering of related objects

		--consecutive file blocks not close together

		--Inodes far from data blocks

		--Inodes for a given directory not close together

		--result: poor enumeration performance, meaning things like:
			"ls" and "grep foo *.c" were slowwwww

	    --other problems:
		--14 character names were the limit
		--can't atomically update file in crash-proof way
 
 
       --FFS (fast file system) fixes these problems to a degree.
	
	    [Reference: "M. K. McKusik, W. N. Joy, S. J. Leffler, and R.
	    S.  Fabry. A Fast File System for UNIX. ACM Trans. on
	    Computer Systems, Vol. 2, No. 3, Aug. 1984, pp. 181-197.]

      what can we do to above?

      [ask for suggestions]

      * make block size bigger (4 KB, 8KB, or 16 KB)

      * cluster related objects

	  "cylinder groups" (one or more consecutive cylinders)

	[superblock | bookkeeping info | inodes | bitmap | data blocks (512 bytes each) ]
	
	    --try to put inodes and data blocks in the same cylinder group

	    --try to put all inodes of files in the same directory in
	    the same cylinder group
	    
	    --new directories placed in cylinder group with greater than
	    average number of free inodes

	    --as files are allocated, use a heuristic: spill to next
	    cylinder group after 48 KB of file (which would be the point
	    at which an indirect block would be required, assuming
	    4096-byte blocks) and at every megabyte thereafter.
	    
      * bitmaps (to track free blocks)

	    --Easier to find contiguous blocks

	    --Can keep the entire thing in memory (as in lab 5)

	    --100 GB disk / 4KB disk blocks = 25,000,000 entries = 3MB.
	    not outrageous these days.

      * reserve space
	   --but don't tell users. (df makes full disk look 110% full)

      * total performance

	--20-40% of disk bandwidth for large files

	--10-20x of original Unix file system!

	--still not the best we can do
	    (meta-data writes happen synchronously, which really hurts
	    performance. but making asynchronous requires story for
	    crash recovery.)

      Others:

	--Most obvious: big file cache

	    --kernel maintains a *buffer cache* in memory

	    --internally, all uses of ReadDisk(blockNum, readbuf)
	    replaced with:

		ReadDiskCache(blockNum, readbuf) {
		    ptr = buffercache.get(blockNum); 
		    if (ptr) {
			copy BLKSIZE bytes from ptr to readbuf
		    } else {
			newBuf = malloc(BLKSIZE);
			ReadDisk(blockNum, newBuf);
			buffercache.insert(blockNum, newBuf);
			copy BLKSIZE bytes from newBuf to readbuf
		    }

	--no rotation delay if you're reading the whole track.
	    --so try to read the whole track

	--more generally, try to work with big chunks (lots of disk
	blocks) 
	    --write in big chunks
	    --read ahead in big chunks (64 KB)

	--why not just read/write 1 MB at a time?
	    --(for writes: may not get data to disk often enough)
	    --(for reads: may waste read bandwidth)

    F. mmap: memory mapping files

	--recall some syscalls: 
	    fd = open(pathname, mode)
	    write(fd, buf, sz)
	    read(fd, buf, sz)

	--what the heck is a fd?
	    --indexes into a table
	    --what's in the given entry in the table?
		--inumber!
		--inode, probably!
		--and per-open-file data (file position, etc.)

	--syscall:
	    void* mmap(void* addr, size_t len, int prot, int flags,
		       int fd, off_t offset);


	--map the specified open file (fd) into a region of my
	virtual memory (at addr, or at a kernel-selected place if
	addr is 0), and return a pointer to it

	--after this, loads and stores to addr[offset] are
	equivalent to reading and writing to the file at the given
	offset

	--how's this implemented?! (answer: through virtual memory,
	with the VA being addr [or whatever the kernel selects] and
	the PA being what? answer: the physical address storing the
	given page in the kernel's buffer cache).

	--have to deal with eviction from buffer cache, but this
	problem is not unique. in all operating systems besides JOS,
	the kernel designers *anyway* have to be able to invalidate
	VA-->PA mappings when a page is removed from RAM