Class 19
CS 439
26 March 2013

On the board
------------

1. Last time
2. Alternatives to locking
3. Advice
4. File systems

(summary of concurrency unit is at the end of the notes)

---------------------------------------------------------------------------

1. Last time

    review locking; discuss MCS locks;

    also discussed performance, etc.

    CLARIFICATION: cores can access caches of other cores; that is what
    those lines in the figure represent.

2. Alternatives to locking

    --Futexes (won't really cover)
    --RCU (read-copy-update); won't really cover
    --Event-driven programming
    --Transactions
    --Non-blocking synchronization

    A. futexes (for performance)
	
	--motivation: locks must interact with scheduler, and syscalls
	on locks need to go into the kernel, which is expensive.

	--so can we optimize the process of acquiring the lock so that 
	we go into kernel only if we can't get lock?

	--enter futexes

	    ["Fuss, Futexes and Furwocks: Fast Userlevel Locking in
	    Linux", H. Franke, R. Russell, and M. Kirkwood, Ottawa Linux
	    Symposium, 2002.]

	    --idea ask kernel to sleep only if memory location hasn't changed
	    (if it has changed, the lock is contended, and we'll need
	    kernel help to adjudicate)
 
	    --void futex (int *uaddr, FUTEX_WAIT, int val, ....)

		--Go to sleep only if *uaddr == val
    
		--Extra arguments allow timeouts, etc.

	    --void futex (int *uaddr, FUTEX_WAKE, int val, ...)

		--Wake up at most val threads sleeping on
  		uaddr
 
	    --uaddr is translated down to offset in VM object

	    --So works on memory mapped file at different virtual
	  addresses in different processes

	--idea: 

	    --to "acquire", atomically decrement or set

	    --to "release", atomically increment or reset.

		if the value is 1, great; 

		if the value is *less than* 1, it means someone else
		    asked for the mutex;

		ask for the kernel's help waking the waiters


    B. RCU (read-copy-update)

	[see: "Read-Copy Update", McKenney et al., Proc Ottowa Linux
	Symposium, 2001.
	http://lse.sourceforge.net/locking/rcu/rclock_OLS.2001.05.01c.sc.pdf
	]

	--Some data is read way more often than written

	    --Like routing tables: 
	    consulted for each packet that is forwarded

	--Or Data maps in system with 100+ disks: Updated when disk
	fails, maybe every $10^{10}$ operations

	--Optimize for the common case of reading w/o lock

	    e.g., global variable: routing_table *rt;
    
	    Call lookup (rt, route); with no locking
 
	--Update by making copy, swapping pointer

	    --routing_table *nrt = copy_routing_table (rt);

	    --Update nrt

		[threads still in the table keep working]

	    --Set global rt = nrt when done updating

	    --All lookup calls see consistent old or new table
 
	--the hard part: when can we free the memory of the old routing
	table?

	    --answer: when we are guaranteed that no one is using it
	    
	    --but how can we determine that?

	    --loosely speaking, wait until each thread has context
	    switched at least once

	    --at that point, each thread will have been in _quiescent
	    state_ at least once, and meanwhile in a _quiescent state_
	    the thread's temporary variables are dead
  
    C. event-driven programming

        event-driven versus threading

        threading
        ---------

        for (;;) { 
            fd = accept_client (); 
            thread_create (service_client, &fd); 
        } 

        void service_client(void* arg) {
	    int* fd_ptr = (int*)arg;
	    int fd = *fd_ptr;

	    while (client_request_not_read_in) {
	        read(fd, ....);   /* [+] */
	    }

	    do_work_for_client();

	    while (response_to_client_not_fully_written_out) {
	        write(fd, ...);  
	    }

	    thread_exit();
        }

        event-driven
        -------------

        /* functions "register": "notify me if an event happens. here is
         * a function pointer to call....
         */
        
       for (;;) {

            rc = select(n, &read_events, &write_events, NULL, &tv); 
            
            /* go through read_events; handle any that need to be
             * handled */

            /* go through write_events; handle each */

            /* above, "handling" means dereferencing the function
             * pointers.
             */

        }
            
	--also manages "concurrency".

	--why? (because the processor really only can do one thing at a
	time.)

	--good match if there is lots of I/O

	--what happens if we try to program in event-driven style
	and we are a CPU-bound process? 

	    --there aren't natural yield points, so the other tasks may
	    never get to run

	    --or the CPU-bound tasks have to insert artificial yields().
	    In vanilla event-driven programming, there is no explicit
	    yield() call. A function that wants to stop running but
	    wants to be run again has to queue up a request to be
	    re-run. That request unfortunately has to manually take the
	    important stack variables and place them in an entry in the
	    event queue (a list of function pointers). In a very real
	    sense, this is what the thread scheduler does automatically
	    (the thread scheduler, if you look at it at a high-level,
	    takes various threads' stacks and switches them around, and
	    also loads the CPU's registers with a thread's registers.
	    this is not super-different from scheduling a function for
	    later and specifying the arguments, where the argument *are*
	    the important stack variables that the function will need,
	    in order for the function to execute).

	    --this is one reason why you want threads: easy to write
	    code where each thread just does some CPU-intensive
	    thing, and the thread scheduler worries about
	    interleaving the operations (otherwise, the interleaving
	    is more manual and consists of the steps mentioned
	    above).

    D. transactions

	--using them in the kernel requires hardware support
	    see http://www.cs.brown.edu/~mph/HerlihyM93/herlihy93transactional.pdf

	    Intel's latest chips should have such extensions
	    
	--using them in application space does not

	--when deadlock is detected, transaction manager aborts
	transaction


    E. non-blocking synchronization
    
	--wait-free algorithms

	--lock-free algorithms

	--the skinny on these is that:

	    --using atomic instructions such as compare-and-swap
	    (CMPXCHG on the x86) you can implement many common data
	    structures (stacks, queues, even hash tables)

	    --in fact, can implement *any* algorithm in wait-free
	    fashion, given the right hardware

	    --the problem is that such algorithms wind up using lots
	    of memory or involving many retries so are inefficient

	    --but since they don't lock, they are provably
	    lock-free!


    * How free are the above approaches from the disadvantages listed
    above?
    
    --RCU has some complexity required to handle garbage collection,
    and it's not always applicable

    --event-driven code removes the possibility of races and deadlock,
    but that's because it doesn't involve true concurrency, so if you
    want your app to take advantage of multiple CPUs, you can't use
    plain vanilla event-driven programming

    --non-blocking synchronization often leads to complexity and
    inefficiency (but not deadlock or broken modularity!)

    --transactions are nice but not supported everywhere and not very
    efficient if there is lots of contention

3. Advice

    A. My advice on best approaches (higher-level advice than thread
    coding advice from before)

	--application programming:

	    --cooperative user-level multithreading

	    --kernel-level threads with *simple* synchronization 

		--this is where the thread coding advice given above
		applies

	    --event-driven coding

	    --transactions, if your package provides them, and you are
	    willing to deal with performance trade-offs (namely that
	    performance is poor under contention because lots of wasted
	    work)

	--kernel hacking: no silver bullet here. want to avoid locks as
	  much as possible. sometimes they are unavoidable, in which case
	  fancy things need to happen.

	    --UT professor Emmett Witchel proposes using transactions
	    inside the kernel (TxLinux)

    B. Reflections and conclusions from concurrency unit

    --Threads and concurrency primitives have solved a hard problem: how
    to take advantage of hardware resources with a sequential
    abstraction (the thread) and how to safely coordinate access to
    shared resources (concurrency primitives).

    --But of course concurrency primitives have the disadvantages that
    we've discussed

    --old debate about whether threads are a good idea:

	John Ousterhout: "Why Threads are a bad idea (for most
	purposes)", 1996 talk. http://home.pacbell.net/ouster/threads.pdf

	Robert van Renesse "Goal-Oriented Programming, or Composition
	Using Events, or Threads Considered harmful".  Eighth ACM SIGOPS
	European Workshop, September 1998.
	http://www.cs.cornell.edu/home/rvr/papers/GoalOriented.pdf

	--and lots of "events vs threads" papers (use Google)

    --the debate comes down to this:

	--compared to code written in event-driven style, shared memory
	multiprogramming code is easier to read: it's easier to know the
	code's purpose. however, it's harder to make that code correct,
	and it's harder to know, when reading the code, whether it's
	correct. 

	--who is right? sort of like vi vs. emacs debates. threads,
	events, and the other alternatives all have advantages and
	disadvantages. one thing is for sure: make sure that you
	understand those advantages and disadvantages before picking a
	model to work with.

    --Some people think that threads, i.e., concurrent applications,
    shouldn't be used at all (because of the many bugs and difficult
    cases that come up, as we'll discuss). However, that position is
    becoming increasingly less tenable, given multicore computing.

	--The fundamental reason is this: if you have a
	computation-intensive job that wants to take advantage of all of
	the hardware resources of a machine, you either need to (a)
	structure the job as different processes; or (b) use
	kernel-level threading. There is no other way, given mainstream
	OS abstractions, to take advantage of a machine's parallelism.
	(a) winds up being inconvenient (in order to share data, the
	processes either have to separately set up shared memory
	regions, or else pass messages). So people use (b).

---------------------------------------------------------------------------

midterm topics
    
    --everything since last midterm

    --and application-level use of concurrency primitives (locking, Mike
    Dahlin's commandments, etc.)

---------------------------------------------------------------------------

4. file systems

    A. intro
    B. files
    C. implementing files
        1. contiguous
        2. linked files
        3. FAT
        4. indexed files
    D. Directories
    E. FS performance
    F. mmap


    A. Intro

    --more papers on FSs than on any other single topic

	--probably also the hardest part of operating systems

    --what does a FS do?

	--provide persistence (don't go away ... ever)

	--somehow associate bytes on the disk with names (files)

	--somehow associates names with each other (directories)

    --where are FSes implemented?

	--can implement them on disk, over network, in memory, in NVRAM
	(non-volatile RAM), on tape, with paper (!!!!)

	--we are going to focus on the disk and generalize later. we'll
	see what it means to implement a FS over the network
   
    --a few quick notes about disks in the context of FS design

	--disk is the first thing we've seen that (a) doesn't go away;
	and (b) we can modify (BIOS ROM, hardware configuration, etc.
	don't go away, but we weren't able to modify these things). two
	implications here:

	    (i) we're going to have to put all of our important state on
	    the disk

	    (ii) we have to live with what we put on the disk! scribble
	    randomly on memory --> reboot and hope it doesn't happen
	    again. scribbe randomly on the disk --> now what? (answer:
	    in many cases, we're hosed.)

	--mismatch: CPU and memory are *also* working with "important
	state", but they are vastly faster than disks

	--disk is enormous: 100-1000x more data than memory

	    --how to organize all of this information?
	    --answer is by categorizing things (taxonomies). a FS is a
	    kind of taxonomy ("/homes" has home directories,
	    "/homes/bob/classes/cs372h" has bob's cs372h material, etc.)


    B. Files

	* Intro

	--what is a file?
	    --answer from user's view: a bunch of named bytes on the disk
	    --answer from FS's view: collection of disk blocks
	
	--big job of a FS: map name and offset to disk blocks
	   
                                 FS
                   {file,offset} --> disk address
	    
	    --operations are create(file), delete(file), read(), write()

	    --***goal: operations have as few disk accesses as possible
	    and minimal space overhead
	    
		--wait, why do we want minimal space overhead, given that
		the disk is huge?

		--answer: cache space never enough; the amount of data
		that can be retrieved in one fetch is never enough.
		hence, really don't want to waste.

	[[--note that we have seen translation/indirection before:

	    page table:

		                    page table 
		    virtual address ----------> physical address

    
	    per-file metadata:

			    inode
		    offset ------>  disk block address


	    how'd we get the inode?

			       directory
		    file name ----------> file # 
		    
		(file # *is* an inode in Unix)
		    		
	    ]]


	* Implementing files

	--our task: meet the goal marked *** above. 

	--for now, we're going to assume that the file's metadata is
	given to us. when we look at directories in a bit, we'll see
	where the metadata comes from; the above picture should also
	give a hint
    
	access patterns we could imagine supporting:

	(i) Sequential:
	    --File data processed in sequential order
	    --By far the most common mode
	    --Example: editor writes out new file, compiler reads in file, etc

	(ii) Random access:
	    --Address any block in file directly without passing through
	    --Examples: large data set, demand paging, databases

	(iii) Keyed access
	    --Search for block with particular values
	    --Examples: associative data base, index
	    --This thing is everywhere in the field of databases,
	    search engines, but....
	    --...usually not provided by a FS in OS

	helpful observations:

	(i) All blocks in file tend to be used together, sequentially 

	(ii) All files in directory tend to be used together

	(iii) All *names* in directory tend to be used together

	further design parameters:

	(i) Most files are small 
	
	(ii) Much of the disk is allocated to large files

	(iii) Many of the I/O operations are made to large files

	(iv) Want good sequential and good random access 

	candidate designs........

	1. contiguous allocation 

	  "extent based"
	  --when creating a file, make user pre-specify its length, and
	  allocate the space at once
	  --file metadata contains location and size

	  --example: IBM OS/360

		[<free> a1 a2 a3 <free> b1 b2 <free> ]

		what if a file c needs two sectors?!
	  
	  +: simple
	  +: fast access, both sequential and random
	  -: fragmentation
	  
	  where have we seen something similar? (answer: segmentation in
	  virtual memory)

	2. linked files
	    
	    --keep a linked list of free blocks
	    --metadata: pointer to file's first block
	    --each block holds pointer to next one

	  +: no more fragmentation
	  +: sequential access easy (and probably mostly fast, assuming
	     decent free space management, since the pointers will point
	     close by)
	  -: random access is a disaster
	  -: pointers take up room in blocks; messes up alignment of
	  data

---------------------------------------------------------------------------

thanks to David Mazieres and Mike Dahlin for portions of the above.

---------------------------------------------------------------------------

SUMMARY AND REVIEW OF CONCURRENCY

    We've discussed different ways to handle concurrency. Here's a
    review and summary. Unfortunately, there is no one right approach to
    handling concurrency. The "right answer" changes as operating
    systems and hardware evolve, and depending on whether we're talking
    about what goes on inside the kernel, how to structure an
    application, etc. For example, in a world in which most machines had
    one CPU, it may make more sense to use event-driven programming in
    applications (note that this is a potentially controversial claim),
    and to rely on turning off interrupts in the kernel. But in a world
    with multiple CPUs, event-driven programming in an application fails
    to take advantage of the hardware's parallelism, and turning off
    interrupts in the kernel will not avoid concurrency problems.

    Why we want concurrency in the first place: better use of hardware
    resources. Increase total performance by running different tasks
    concurrently on different CPUs. But sometimes, serial execution of
    atomic operations is needed for correctness. So how do we solve
    these problems?

	--*threads are an abstraction that can take advantage of
	concurrency.* applies at multiple levels, as discussed in class.
	indeed, a kernel that runs on multiple CPUs (say, in handling
	system calls from two different processes running on two
	different CPUs) can be regarded as using
	threads-inside-the-kernel, or there can be explicit
	threads-inside-the-kernel. this is apart from kernel-level
	threading and user-level threading, which are abstractions that
	applications use

	--to get serial execution of atomic operations, we need
	hardware support at the lowest level. we may use the hardware
	support directly (as in the case of lock-free data structures),
	with a thin wrapper (as in the case of spinlocks), or wrapped in
	a much higher-level abstraction (as in the case of mutexes and
	monitors).
	
	    --the hardware support that we're talking about is test&set,
	    LOCK prefix, LD_L/ST_C (load-linked/store-conditional, on
	    DEC Alpha), interrupt enable/disable

    1. The most natural (but not the only thing) we can do with the hardware
    support is to build spinlocks. We saw a few kinds of spinlocks:

	--test_and_set (while (xchg) {} )
	--test-and-test_and_set (example given in class from Linux)
	--MCS locks

	Spinlocks are *sometimes* useful inside the kernel and more
	rarely useful in application space. The reason is that wrapping
	larger pieces of functionality with spinlocks wastes CPU cycles
	on the waiting processor. There is a trade-off between the
	overhead of putting a thread to sleep and the cycles wasted by
	spinning. For very short critical sections, spinlocks are a win.
	For longer ones, put the thread of execution to sleep.

    2. For larger pieces of functionality, higher-level synchronization
    primitives are useful:

	--mutexes

	--mutexes and condition variables (known as monitors)

	--shared reader / single writer locks (can implement as a
	monitor)
	
	--semaphores (but you should not use these as it is easy to make
	mistakes with them)

	--futexes (basically a semaphore or mutex used for synchronizing
	processes on Linux; the advantage is that if the futex is
	uncontended, the process never enters the kernel. The cost of
	a system call is only incurred when there is contention and a
	process needs to go to sleep (going to sleep and getting woken
	requires kernel help).

	Building all of the above correctly requires lower-level
	synchronization primitives. Usually, inside of these
	higher-level abstractions is a spinlock that is held for a brief
	time before the thread is put to sleep and after it is woken.

	[Disadvantages to both spinlocks and higher-level
	synchronization primitives:

	    --performance (because of synchronization point and
	    cache line bounces)

	    --performance v. complexity trade-off

		--hard to get code safe, live, and well-performing

		--to increase performance, we need finer-grained
		locking, which increases complexity, which imperils:

		    --safety (race conditions more likely)
		    --liveness (for example, deadlock, starvation more likely)

	    --deadlock (hard to ensure liveness)

	    --starvation (hard to ensure progress for all threads)

	    --priority inversion
	    
	    --broken modularity
	    
	    --careful coding required]

	In user-level code, manage these disadvantages by sacrificing
	performance for correctness. 

	In kernel code, it's trickier. Any performance problems in the
	kernel will be passed to applications. Here, the situation is
	sort of a mess. People use a combination of partial lock orders,
	careful thought, static detection tools, code review, and
	prayer.

    3. Can also use hardware support to build lock-free data structures
    (for example, using atomic compare-and-swap).

	--avoids possibility of deadlock
	--better performance
	--downside: further complexity

    4. Can also use hardware support to enable the Read-Copy Update
    (RCU) technique. Technique used inside the Linux kernel. Very
    elegant.

	--here, *writers* need to synchronize (using spinlocks,
	other hardware support, etc.), but readers do not

    [Aside:
	another paradigm for handling concurrency: transactions
    	--transactional memory (requires different hardware abstractions)
	--transactions exposed to applications and users of
	applications, like queriers of databases]

    Another approach to handling concurrency is to avoid it:

    5. event-driven programming

	--also manages "concurrency".

	--why? (because the processor really only can do one thing at a
	time.)

	--good match if there is lots of I/O

	what happens if we try to program in event-driven style and we are a
	CPU-bound process? 

	    --there aren't natural yield points, so the other tasks may
	    never get to run

	    --or the CPU-bound tasks have to insert artificial yields().
	    In vanilla event-driven programming, there is no explicit
	    yield() call. A function that wants to stop running but
	    wants to be run again has to queue up a request to be
	    re-run. That request unfortunately has to manually take the
	    important stack variables and place them in an entry in the
	    event queue. In a very real sense, this is what the thread
	    scheduler does automatically (the thread scheduler, if you
	    look at it at a high-level, takes various threads' stacks
	    and switches them around, and also loads the CPU's registers
	    with a thread's registers. this is not super-different from
	    scheduling a function for later and specifying the
	    arguments, where the argument *are* the important stack
	    variables that the function will need, in order for the
	    function to execute).

	    --this is one reason why you want threads: easy to write
	    code where each thread just does some CPU-intensive thing,
	    and the thread scheduler worries about interleaving the
	    operations (otherwise, the interleaving is more manual and
	    consists of the steps mentioned above).

    6. what does JOS do?

	Answer: JOS's approach to concurrency is probably not something
	to take as a lesson. It uses a "big kernel lock" and ensures
	that the kernel code is only executing on one processor at a
	time (note that a user-level process can execute on the other
	CPU).
	
	A JOS system has to worry about interrupts; for this it takes a
	simple approach. On the CPU on which it is executing, JOS turns
	interrupts off for the entire time that it is executing in the
	kernel. 

	JOS runs environments in user-mode with interrupts enabled,
	so at any point a timer interrupt may take the CPU away from
	an environment and switch to a different environment. This can
	happen on either CPU.

    Ultimately, threads, synchronization primitives, etc. solve a really
    hard problem: how to have multiple stuff going on at the same time
    but to allow the programmer to keep it organized, sane, and correct.
    To do this, we introduced abstractions like threads, functions like
    swtch(), relied on hardware primitives like XCHG, and built
    higher-level objects like mutexes, monitors, and condition
    variables. All of this is, at the end of the day, presenting a
    relatively sane model to the programmer, built on top of something
    that was otherwise really hard to reason about.

    On the other hand, these abstractions aren't perfect, as the litany
    of disadvantages should make clear, so the solution is to be very
    careful when writing code that has multiple units of execution that
    share memory (aka shared memory multi-programming aka threading aka
    processes that share memory).

---------------------------------------------------------------------------