Class 15
CS 202
31 March 2015

On the board
------------

1. Last time
2. Disks
3. File systems
     intro
     files
     implementing files
        contiguous
        linked
        FAT
        indexed files 
4. Return midterms

(summary of concurrent programming models is at the end of the notes)

---------------------------------------------------------------------------

1. Last time

    alternatives to shared-memory preemptively scheduled threads:

        --> user-level threads: shared-memory *non*-preemptively
        scheduled threads

        --> event-driven programming: one thread of control (the event
        loop).

    advice on concurrency generally at the end of the notes

2. Go over disk quickly

    --you saw the high-level picture in CS 201; the notes below include
    some detail.

    Geometry of a disk

	--track: circle on a platter. each platter is divided into
	concentric tracks.

	--sector: chunk of a track

	--cylinder: locus of all tracks of fixed radius on all platters

	--Heads are roughly lined up on a cylinder

	    --Significant fractions of encoded stream for error correction
    
	--Generally only one head active at a time

	    --Disks usually have one set of read-write circuitry

	    --Must worry about cross-talk between channels

	    --Hard to keep multiple heads exactly aligned

	--disk positioning system

	    --Move head to specific track and keep it there

		--Resist physical shocks, imperfect tracks, etc.

	    --a **seek** consists of up to four phases:

		--*speedup*: accelerate arm to max speed or half way point
    
		--*coast*: at max speed (for long seeks)

		--*slowdown*: stops arm near destination

		--*settle*: adjusts head to actual desired track
    
            [BTW, this thing can accelerate at up to several hundred g]

    Performance (important to understand this if you are building
    systems that need good performance)

	components of transfer: rotational delay, seek delay, transfer
	time.
	    rotational delay: time for sector to rotate under disk head
	    seek: speedup, coast, slowdown, settle
	    transfer time: will discuss

	discuss seeks in a bit of detail now:

	--seeking track-to-track: comparatively fast (~1ms). mainly
	settle time

	--short seeks (200-400 cyl.) dominated by speedup

	--longer seeks dominated by coast

	--head switches comparable to short seeks

	--settle times takes longer for writes than reads. why?
	    --because if read strays, the error will be caught, and the
	    disk can retry
	    --if the write strays, some other track just got clobbered.
	    so write settles need to be done precisely

	--note: "average seek time" quoted can be many things

	    --time to seek 1/3 of disk

	    --1/3 of the time to seek the whole disk

	    --(convince yourself those may not be the same)


    Common #s

	--capacity: 100s of GB

	--platters: 8

	--number of cylinders: tens of thousands or more

	--sectors per track: ~1000

	--RPM: 10000

	--transfer rate: 50-85 MB/s

	--mean time between failures: ~1 million hours
	    (for disks in data centers, it's vastly less; for a provider
	    like Google, even if they had very reliable disks, they'd still need
	    an automated way to handle failures because failures would
	    be common (imagine 2 million disks: *some* will be on the
	    fritz at any given moment). so what they do is to buy
	    defective and cheap disks, which are far cheaper. lets them
	    save on hardware costs. they get away with it because they
	    *anyway* needed software and systems -- replication and
	    other fault-tolerance schemes -- to handle failures.)


    let's work through a disk performance example

        Spindle Speed: 7200 RPM
        Avg Seek Time, read/write: 10.5ms / 12 ms
        Maximum seek time: 19ms
        Track-to-track seek time: 1ms
        Transfer rate (surface to buffer): 54-128 MB/s
        Transfer rate (buffer to host): 375 MB/s

    Two questions:

    (a) How long would it take to do 500 sector reads, spread out
    randomly over the disk (and serviced in FIFO order)?

    (b) How long would it take to do 500 requests, SEQUENTIALLY on the
    disk? (FIFO order once more)

    Let's begin with (a), looking at one request:

        (rotation delay + seek time + transfer_time)*500

        rotation delay: 60s/1min * 1 min/7200 rotations = 8.33 ms
            on average, you have to wait for half a rotation: 4.15 ms

        seek time: 10.5 ms (given)

        transfer time: 512 bytes * 1 s/54 MB * 1MB/10^6 bytes = .0095 ms

        **per read**: 4.15 ms + 10.5 ms + .0095 ms = 14.66 ms

        500 reads: 14.66 ms/request * 500 requests = 7.3 seconds.

        total throughput: data/time = 35KB/s

        This is terrible!

    Let's look at (b)
        
        rotation delay + seek time + 500*transfer_time

        rotation delay: 4.15 ms (same as above)

        seek time: 10.5 ms (same as above)

        transfer time: 500 * .0095 ms = 4.75 ms

        total: 4.15 ms + 10.5 ms + 4.75 ms = 19.5 ms

        total throughput: 13.1 MB/s

        This is much better!

    Takeaway: Sequential reads are MUCH MUCH MUCH faster than random
    reads and we should do everything that we can possibly do to perform
    sequential reads.  When you learn about filesystems, you'll see that
    this was a very serious concern for filesystem designers (LFS!).
   
        --"The secret to making disks fast is to treat them like tape"
        (John Ousterhout).

   What are some things that help this situation?
   - Disk Cache used for read-ahead (disk keeps reading at last host request)
    - otherwise, sequential reads would incur whole revolution
    - policy decision: should read-ahead cross track boundaries? a head-switch
      cannot be stopped, so there is a cost to aggressive read ahead.
   - Write caching can be a big win!
    - (if battery backed): data in buffer can be written over many times before
      actually being put back to disk. also, many writes can be stored so they
      can be scheduled more optimally
    --if not battery backed, then policy decision between disk
    and host about whether to report data in cache as on disk or
    not

    Try to order requests to minimize seek times

        --OS (or disk) can only do this if it has multiple requests
        to order

        --Requires disk I/O concurrency

        --High-performance apps try to maximize I/O concurrency

            --or avoid I/O except to do write-logging (stick all
            your data structures in memory; write "backup" copies to
            disk sequentially; don't do random-access reads from the
            disk)

    Disk scheduling: not covering in class. Can read in text. Some notes
    below:

	--FCFS: process requests in the order they are received
	    +: easy to implement
	    +: good fairness
	    -: cannot exploit request locality
	    -: increases average latency, decreasing throughput

	--SPTF/SSTF/SSF/SJF: shortest positioning time first / shortest seek
	time first: pick request with shortest seek time
    
	    +: exploits locality of requests
	    +: higher throughput
	    -: starvation
	    -: don't always know which request will be fastest

	    improvement: aged SPTF
		
		--give older requests priority

		--adjust "effective" seek time with weighting [no pun
		intended] factor:
		    T_{eff} = T_{pos} - W*T_{wait}


	--Elevator scheduling: like SPTF, but next seek must be in same
	direction; switch direction only if no further requests
	    +: exploits locality
	    +: bounded waiting
	    -: cylinders in middle get better service
	    -: doesn't fully exploit locality

	    modification: only sweep in one direction (treating all
	    address as being circular): very commonly used in Unix.


    technology and systems trends

	--unfortunately, while seeks and rotational delay are getting a
	little faster, they have not kept up with the huge growth
	elsewhere in computers.
	
	--transfer bandwidth has grown about 10x per decade

	--the thing that is growing fast is disk density
	(byte_stored/$). that's because density is less about the
	mechanical limitations

	--to improve density, need to get the head close to the surface.

	    --[aside: what happens if the head contacts the surface? called
	    "head crash": scrapes off the magnetic material ... and,
	    with it, the data.]

	--Disk accesses a huge system bottleneck and getting worse. So
	what to do?

	    --Bandwidth increase lets system (pre-)fetch large chunks
	    for about the same cost as small chunk.

	    --So trade latency for bandwidth if you can get lots of
	    related stuff at roughly the same time. How to do that?

	    --By clustering the related stuff together on the disk. can
	    grab huge chunks of data without incurring a big cost since
	    we already paid for the seek + rotation.


	--The saving grace for big systems is that memory size
	is increasing faster than typical workload size

	    --result: more and more of workload fits in file cache,
	    which in turn means that the profile of traffic to the disk
	    has changed: now mostly writes and new data.

	    --which means logging and journaling become viable (more on
	    this over next few classes)


    how driver interfaces to disk

	--Sectors

	    --Disk interface presents linear array of **sectors**

	    --generally 512 bytes, written atomically (even if power
	    failure; disk saves enough momentum to complete)
	    
	    --larger atomic units have to be synthesized by OS (will
	    discuss later)
		--goes for multiple contiguous sectors or even a whole
		collection of unrelated sectors
		--OS will find ways to make such writes *appear* atomic,
		though, of course, the disk itself can't write more than
		a sector atomically
		--analogy to critical sections in code: 
		
		    --> a thread holds a lock for a while, doing a bunch
		    of things. to the other threads, whatever that
		    thread does is atomic: they can observe the state
		    before lock acquistion and after lock release, but
		    not in the middle, even though, of course, the
		    lock-holding thread is really doing a bunch of
		    operations that are not atomic from the processor's
		    perspective

	--disk maps logical sector # to physical sectors

	    --Zoning: puts more sectors on longer tracks
	  
	    --Track skewing: sector 0 position varies by track, but let
	    the disk worry about it. Why? (for speed when doing
	    sequential access)

	    --Sparing: flawed sectors remapped elsewhere
	
	--all of this is invisible to OS. stated more precisely, the OS
	does not know the logical to physical sector mapping. the OS
	specifies a platter, track, sector, but who knows where it
	really is? 

	    --In any case, larger logical sector # difference means
	    larger seek

	    --Highly non-linear relationship (*and* depends on zone)
	
	    --OS has no info on rotational positions

	    --Can empirically build table to estimate times

	    --Turns out that sometimes the logical-->physical sector
	    mapping is what you'd expect.


3. File systems

    A. intro
    B. files
    C. implementing files
        1. contiguous
        2. linked files
        3. FAT
        4. indexed files
    D. Directories
    E. FS performance


    A. intro

    --what does a FS do?

	--provide persistence (don't go away ... ever)

	--somehow associate bytes on the disk with names (files)

	--somehow associates names with each other (directories)

    --where are FSes implemented?

	--can implement them on disk, over network, in memory, in NVRAM
	(non-volatile RAM), on tape, with paper (!!!!)

	--we are going to focus on the disk and generalize later. we'll
	see what it means to implement a FS over the network
 
    --a few quick notes about disks in the context of FS design

	--disk is the first thing we've seen that (a) doesn't go away;
	and (b) we can modify (BIOS ROM, hardware configuration, etc.
	don't go away, but we weren't able to modify these things). two
	implications here:

	    (i) we're going to have to put all of our important state on
	    the disk

	    (ii) we have to live with what we put on the disk! scribble
	    randomly on memory --> reboot and hope it doesn't happen
	    again. scribbe randomly on the disk --> now what? (answer:
	    in many cases, we're hosed.)


    B. Files

	--what is a file?
	    --answer from user's view: a bunch of named bytes on the disk
	    --answer from FS's view: collection of disk blocks
	
	--big job of a FS: map name and offset to disk blocks
	   
                                 FS
                   {file,offset} --> disk address
	    
	    --operations are create(file), delete(file), read(), write()

            [where does RAID fit in in all of this? answer: it plugs
            into the interface to the disk. to first approximation, a
            RAID exports a sector_read()/sector_write() interface.]

	    --***goal: operations have as few disk accesses as possible
	    and minimal space overhead
	    
		--wait, why do we want minimal space overhead, given that
		the disk is huge?

		--answer: cache space never enough; the amount of data
		that can be retrieved in one fetch is never enough.
		hence, really don't want to waste.

	[[--note that we have seen translation/indirection before:

	    page table:

		                    page table 
		    virtual address ----------> physical address

    
	    per-file metadata:

			    inode
		    offset ------>  disk block address


	    how'd we get the inode?

			       directory
		    file name ----------> file # 
		    
		(file # *is* an inode in Unix)
		    		
	    ]]


    C.  Implementing files

	--our task: meet the goal marked *** above. 

	--NOTE: NOTE: NOTE: for most of today we're going to assume that
	the file's metadata is known to the system
	    
	    --> when we look at directories in a bit, we'll see where
	    the metadata comes from; the above picture should also give
	    a hint
    
	access patterns we could imagine supporting:

	(i) Sequential:
	    --File data processed in sequential order
	    --By far the most common mode
	    --Example: editor writes out new file, compiler reads in file, etc

	(ii) Random access:
	    --Address any block in file directly without passing through
	    --Examples: large data set, demand paging, databases

	(iii) Keyed access
	    --Search for block with particular values
	    --Examples: associative data base, index
	    --This thing is everywhere in the field of databases,
	    search engines, but....
	    --...usually not provided by a FS in OS

	helpful observations:

	(i) All blocks in file tend to be used together, sequentially 

	(ii) All files in directory tend to be used together

	(iii) All *names* in directory tend to be used together

	further design parameters:

	(i) Most files are small 
	
	(ii) Much of the disk is allocated to large files

	(iii) Many of the I/O operations are made to large files

	(iv) Want good sequential and good random access 

	candidate designs........

	1. contiguous allocation 

	  "extent based"
	  --when creating a file, make user pre-specify its length, and
	  allocate the space at once
	  --file metadata contains location and size

	  --example: IBM OS/360

		[<free> a1 a2 a3 <free> b1 b2 <free> ]

		what if a file c needs two sectors?!
	  
	  +: simple
	  +: fast access, both sequential and random
	  -: fragmentation
	  
	  where have we seen something similar? (answer: segmentation in
	  virtual memory)

	2. linked files
	    
	    --keep a linked list of free blocks
	    --metadata: pointer to file's first block
	    --each block holds pointer to next one

	  +: no more fragmentation
	  +: sequential access easy (and probably mostly fast, assuming
	     decent free space management, since the pointers will point
	     close by)
	  -: random access is a disaster
	  -: pointers take up room in blocks; messes up alignment of
	  data

        3. modification of linked files: FAT

	    --keep link structure in memory
	    --in fixed-size "FAT" (file allocation table)
	    --pointer chasing now happens in RAM

	    [DRAW PICTURE; two files; two different colors]

	    --example: MS-DOS (and iPods, MP3 players, digital cameras)
	
	  +: no need to maintain separate free list (table says what's free)	
	  +: low space overhead
	  -: maximum size limited. 
	      64K entries
	      512 byte blocks --> 32MB max file system
	   bigger blocks bring advantages and disadvantages, and ditto a
	   bigger table

	    note: to guard against bad sectors, better store multiple
	    copies of FAT on the disk!!

	4. indexed files

	    [DRAW PICTURE]

	    --Each file has an array holding all of its block pointers
		--like a page table, so similar issues crop up

	    --Allocate this array on file creation

	    --Allocate blocks on demand (using free list)
	   
	    +: sequential and random access are both easy
	    -: need to somehow store the array

	    --large possible file size --> lots of unused entries in the
	    block array

	    --large actual block size --> huge contiguous disk chunk
	    needed

	    --solve the problem the same way we did for page tables:

			[............]

		[..........]	    [.........]

	    [ block     block                         block]

	    --okay, so now we're not wasting disk blocks, but what's the
	    problem? (answer: equivalent issues as for page tables:
	    here, it's extra disk accesses to look up the blocks)

        [finish this next time]

	5. indexed files, take two

	    --classic Unix file system 

	    --inode contains:

		permisssions
		times for file access, file modification, and inode-change
		link count (# directories containing file)
		ptr 1  --> data block
		ptr 2  --> data block
		ptr 3  --> data block
		.....
		ptr 11  --> indirect block 
				      ptr --> 
				      ptr --> 
				      ptr --> 
				      ptr -->
				      ptr -->
		ptr 12 --> indirect block
		ptr 13 --> double indirect block
		ptr 14 --> triple indirect block

	    +: Simple, easy to build, fast access to small files

	    +: Maximum file length can be enormous, with
	       multiple levels of indirection 

	    -: worst case # of accesses pretty bad

	    -: worst case overhead (such as 11 block file) pretty bad

	    -: Because you allocate blocks by taking them off unordered
	       freelist, meta data and data get strewn across disk
 

	    Notes about inodes:

	    --stored in a fixed-size array

	    --Size of array fixed when disk is initialized; can't be changed

	    --Multiple inodes in a disk block

	    --Lives in known location, originally at one side of disk,
	    now lives in pieces across disk (helps keep metadata close
	    to data)

	    --The index of an inode in the inode array is called an
	    ***i-number***

	    --Internally, the OS refers to files by i-number

	    --When a file is opened, the inode brought in memory
   
	    --Written back when modified and file closed or time elapses

4. Return midterms


---------------------------------------------------------------------------

thanks to David Mazieres and Mike Dahlin for portions of the above.


---------------------------------------------------------------------------

Concurrency advice and reflections

    A. My advice on best approaches (higher-level advice than thread
    coding advice from before)

	--application programming:

	    --cooperative user-level multithreading

	    --kernel-level threads with *simple* synchronization 

		--this is where the thread coding advice and the
		commandments come in

	    --event-driven programming

	    --transactions, if your package provides them, and you are
	    willing to deal with performance trade-offs (namely that
	    performance is poor under contention because lots of wasted
	    work)

	--kernel code: no silver bullet here. want to avoid locks as
	much as possible. sometimes they are unavoidable. sometimes you
	want to use *non-blocking synchronization (wait-free algorithms,
	lock-free algorithms, etc.).

    B. Reflections and conclusions on concurrency

    --Threads and concurrency primitives have solved a hard problem: how
    to take advantage of hardware resources with a sequential
    abstraction (the thread) and how to safely coordinate access to
    shared resources (concurrency primitives).

    --But of course concurrency primitives have the disadvantages that
    we've discussed

    --old debate about whether threads are a good idea:

	John Ousterhout: "Why Threads are a bad idea (for most
	purposes)", 1996 talk. http://home.pacbell.net/ouster/threads.pdf

	Robert van Renesse "Goal-Oriented Programming, or Composition
	Using Events, or Threads Considered harmful".  Eighth ACM SIGOPS
	European Workshop, September 1998.
	http://www.cs.cornell.edu/home/rvr/papers/GoalOriented.pdf

	--and lots of "events vs threads" papers (use Google)

    --the debate comes down to this:

	--compared to code written in event-driven style, shared memory
	multiprogramming code is easier to read: it's easier to know the
	code's purpose. however, it's harder to make that code correct,
	and it's harder to know, when reading the code, whether it's
	correct. 

	--who is right? sort of like vi vs. emacs debates. threads,
	events, and the other alternatives all have advantages and
	disadvantages. one thing is for sure: make sure that you
	understand those advantages and disadvantages before picking a
	model to work with.

    --Some people think that threads, i.e., concurrent applications,
    shouldn't be used at all (because of the many bugs and difficult
    cases that come up, as we'll discuss). However, that position is
    becoming increasingly less tenable, given multicore computing.

	--The fundamental reason is this: if you have a
	computation-intensive job that wants to take advantage of all of
	the hardware resources of a machine, you either need to (a)
	structure the job as different processes; or (b) use
	kernel-level threading. There is no other way, given mainstream
	OS abstractions, to take advantage of a machine's parallelism.
	(a) winds up being inconvenient (in order to share data, the
	processes either have to separately set up shared memory
	regions, or else pass messages). So people use (b).

---------------------------------------------------------------------------