Class 6
CS 439
5 Feburary 2013

On the board
------------

1. Last time
2. Reinforce atomicity
3. Trade-offs and problems from locking

    A. Hard to get right
    B. performance v. complexity trade-off
    C. starvation
    D. priority inversion
    E. deadlock
    F. broken modularity

4. More advice

---------------------------------------------------------------------------

1. Last time

    --clarified condition variables

    --monitors

    --standards

    --bit of practice

2. Reinforce atomicity


    --atomicity is required if you want to reason about code without
    contorting your brain to reason about all possible interleavings

    --atomicity requires mutual exclusion aka a solution to critical
    sections

    --mutexes provide that solution

    --once you have mutexes, don't have to worry about arbitrary
    interleavings. critical sections are interleaved, but those are
    much easier to reason about than individual operations.

    --why? because of _invariants_.

        example of an invariant:

        "list structure has integrity"

    the meaning of lock.acquire() is that if and only if you get past
    that line, it's safe to violate the invariants.

    the meaning of lock.release() is that right _before_ that line, any
    invariants need to be restored.

    the above is abstract.

    let's make it concrete:

        invariant: "list structure has integrity"

        so protect the list with a mutex

        only after acquire() is it safe to manipulate the list

        just before release(), the list needs to be in a sane state

    ASK: based on the above, what do we need to be careful about
    before/after wait()?
        
        that invariants hold. (because wait() releases the mutex.)

        this is why there is state manipulation (of AW, WW, WR, AR)
        before/after wait() in the example from last time


3. Trade-offs and problems from locking

     Locking (in all its forms: mutexes, monitors, semaphores) raises
     many issues:

    A. Hard to get right

        --this is a programming model where, unfortunately, the
        incorrect version of the code is far easier to write than the
        correct version of the code.

    B. Performance/complexity trade-off

	--one big lock is often not great for performance

	    --indeed, locking itself is the issue: changing the lock
	    type is unlikely to be as big of a performance win as
	    restructuring the code 

	    --the fundamental issue with coarse-grained locking is that
	    only one CPU can execute anywhere in the part of your code
	    protected by a lock. If your code is called a lot, this may
	    reduce the performance of an expensive multiprocessor to
	    that of a single CPU.

	    --if this happens inside the kernel, it means that
	    applications will inherit the performance problems from the
	    kernel

	--Perhaps locking at smaller granularity would get higher
	performance through more concurrency. 

	    --But how to best reduce lock granularity is a bit of an art.

	    --And unfortunately finer-grained locking makes incorrect
	    code far more likely

	    --And modularity further suffers (see item F. below)

	--Two examples of the above issues:
	
	--Example 1: imagine that every file in the file system is
	represented by a number, in a big table
	
	    --You might inspect the file system code and notice that
	    most operations use just one file or directory, leading you
	    to have one lock per file

	    --You could imagine the code implementing directories
	    exporting various operations like
		dir_lookup(d, name)
		dir_add(d, name, file_number)
		dir_del(d, name)

	    --With fine-grained locking, these directory operations
	    would *internally* acquire the lock on d, do their work, and
	    release the lock
	    
	    --Then higher-level code could implement operations like
	    moving a file from one directory to another:

	    move(olddir, oldname, newdir, newname) {
	      file_number = dir_lookup(olddir, oldname)
	      dir_del(olddir, oldname)
	      dir_add(newdir, newname, file_number)
	    }

	    --Unfortunately, this isn't great:

		--period of time when file is visible in neither
		directory. to fix that requires that the directory locks
		_not_ be hidden inside the dir_* operations.

		--so we need something like this:

		move(olddir, oldname, newdir, newname){
		  acquire(olddir.lock)
		  acquire(newdir.lock)
		  file_number = dir_lookup(olddir, oldname)
		  dir_del(olddir, oldname)
		  dir_add(newdir, newname, file_number)
		  release(newdir.lock)
		  release(olddir.lock)

	    --The above code is a bummer in that it exposes the
	    implementation of directories to move(), but (if all you
	    have is locks) you have to do it this way.

	--Example 2: see filemap.c at end of handout for an extreme case 

	--Mitigation? Unfortunately, no way around this trade-off.
	    
	    --worse, easy to get this stuff wrong: correct code is
	    harder to write than buggy code

	--If you have fine-grained locking (i.e., you are trading off
	simplicity), then you are much more likely to encounter the two
	types of errors:
	    (i) safety errors (race conditions)
	    (ii) liveness errors (deadlocks, etc.)

	--***So what do people do?***

	    --in app space:

		--don't worry too much about performance up front. makes
		it easier to keep your code free of safety problems
		*and* liveness problems

		--if you are worrying about performance, make sure there
		are no race conditions. much more important than
		worrying about deadlock.

		    --SAFETY FIRST.

		    --almost always far better for your program to do
		    nothing than to do the wrong thing (example of using
		    Linear Accelerator for radiation therapy: **way**
		    better not to subject patient to radiation beam than
		    to subject patient to a beam that is 100x too
		    strong, leading to gruesome, atrocious injuries)

		    --if the program deadlocks, the evidence is intact, and we
		    can go back and see what the problem was.

		    --there are ways around deadlock, as we will discuss
		    in a moment

		    --but we shouldn't be too cavalier about liveness
		    issues because it could lead to catastrophic cases.
		    Example: Mars Pathfinder (which was addressed; see
		    above), but still.

	    --in kernel space:

		--same thing, to some extent

		--but performance matters more in kernel space, so
		likely to be dealing with more complex issues

		    --here again, SAFETY FIRST
			--lock more aggressively
			--worry about deadlock later

		--not a satisfying answer, but there is no silver bullet
		for concurrency-related issues

	--By the way, if there is lots of contention, then the style and
	granularity of locks will not eliminate the problem. Where does
	contention come from?

	    --application requirements. lots of contention from
	    applications that inherently require global resources or
	    shared data.
	    
	    --example of Apache: every CPU needs to write to a global
	    logfile, which causes contention in the kernel. you can make
	    the locking as fine-grained as you want, but at the end of
	    the day, if there's a single logfile, a single writer
	    permitted at a time, and many contending writers, then that
	    logfile is going to wind up serializing all of the writers.


    C. Starvation

	--thread waiting indefinitely (if low priority and/or if
	resource is contended)

    D. Priority inversion

	--T1, T2, T3: (highest, middle, lowest priority)

	--T1 wants to get lock, T2 runnable, T3 runnable and holding lock

	--System will preempt T3 and run highest-priority runnable thread, namely T2

	--Solutions:

	    --Temporarily bump T3 to highest priority of any thread that is
	    ever waiting on the lock

	    --Disable interrupts, so no preemption (T3 finishes)
		... works okay unless a page fault occurs

	    --Don't handle it; structure app so only adjacent priority
	    processes/threads share locks

	--Happens in real life. For a real-life example, see:
	http://research.microsoft.com/en-us/um/people/mbj/Mars_Pathfinder/Mars_Pathfinder.html

---------------------------------------------------------------------------

Admin stuff

    --video over weekend


---------------------------------------------------------------------------


    E. Deadlock

	--see handout: simple example based on two locks

	--see handout: more complex example
	    --M calls N 
	    --N waits
	    --but let's say condition can only become true if N is invoked
	    through M
	    --now the lock inside N is unlocked, but M remains locked; that
	    is, no one is going to be able to enter M and hence N.

	--can also get deadlocks with condition variables

	--lesson: dangerous to hold locks (M's mutex in the case on the
	handout) when crossing abstraction barriers

	--deadlocks without mutexes:
	    
	    --Real issue is resources & how required 

	    --non-computer example
	    
		**[picture of bridge]**

		--bridge only allows traffic in one direction 

		--Each section of a bridge can be viewed as a resource. 

		--If a deadlock occurs, it can be resolved if one car
		backs up (preempt resources and rollback). 

		--Several cars may have to be backed up if a deadlock occurs. 

		--Starvation is possible. 

	    --other example:
		
		--one thread/process grabs disk and then tries to grab
		scanner

		--another thread/process grabs scanner and then tries to
		grab disk

	--how do we get around deadlock?

	    (i) ignore it: worry about it when happens

	    (ii) detect and recover: not great

		--could imagine attaching debugger

		    --not really viable for production software, but
		    works well in development

		--threads package can keep track of resource-allocation graph

		--see book

		    --For each lock acquired, order with other locks held 
		    
		    --If cycle occurs, abort with error 
		
		    --Detects potential deadlocks even if they do not occur 

	    (iii) avoid algorithmically

                [didn't cover this year]

		--banker's algorithm (see book)

		    --very elegant but impractical

		    --if you're using banker's algorithm, the gameboard
		    looks like this:

			ResourceMgr::Request(ResourceID resc,
					     RequestorID thrd) {
			    acquire(&mutex);
			    assert(system in a safe state);
			    while (state that would result from giving 
			           resource to thread is not safe) {
				wait(&cv, &mutex);	
			    }
			    update state by giving resource to thread
			    assert(system in a safe state);
			    release(&mutex);
			}

			Now we need to determine if a state is safe....

			To do so, see book

		--disadvantage to banker's algorithm:

		    --requires every single resource request to go
		    through a single broker

		    --requires every thread to state its maximum
		    resource needs up front. unfortunately, if threads
		    are conservative and claim they need huge quantities
		    of resources, the algorithm will reduce concurrency

	    (iv) prevent them by careful coding

		--negate one of the four conditions:
		    1. mutual exclusion
		    2. hold-and-wait
		    3. no preemption
		    4. circular wait

		--can sort of negate 1
		    --put a queue in front of resources, like the printer
		    --virtualize memory

		--not much hope of negating 2

		--can sort of negate 3:
		    --consider physical memory: virtualized with VM, can
		    take physical page away and give to another process! 

		--what about negating #4?

		    --in practice, this is what people do

		    --idea: partial order on locks

			--Establishing an order on all locks and making
			sure that every thread acquires its locks in
			that order

		    --why this works:

			--can view deadlock as a cycle in the resource
			acquisition graph

			--partial order implies no cycles and hence no
			deadlock

		    --three bummers:

			1. hard to represent CVs inside this framework.
			works best only for locks.

			2. compiler can't check at compile time that
			partial order is being adhered to because
			calling pattern is impossible to determine
			without running the program (thanks to function
			pointers and the halting problem)

			3. Picking and obeying the order on *all* locks
			requires that modules make public their locking
			behavior, and requires them to know about other
			modules' locking.  This can be painful and
			error-prone. 

			    --we saw Linux's filemap.c as an
			    example of complexity introduced by having a
			    locking order

        [willcover next time; listing here for context/flow]

	    (v) Static and dynamic detection tools

		--See, for example, these citations, citations
		therein, and papers that cite them:

		    Engler, D. and K. Ashcraft. RacerX: effective,
		    static detection of race conditions and deadlocks.
		    Proc. ACM Symposium on Operating Systems Principles
		    (SOSP), October, 2003, pp237-252.
		    http://portal.acm.org/citation.cfm?id=945468

		    Savage, S., M. Burrows, G. Nelson, P. Sobalvarro,
		    and T. Anderson. Eraser: a dynamic data race
		    detector for multithreaded programs. ACM
		    Transactions on Computer Systems (TOCS), Volume 15,
		    No 4., Nov., 1997, pp391-411.
		    http://portal.acm.org/citation.cfm?id=265927

		    a long literature on this stuff

		--Disadvantage to dynamic checking: slows program down

		--Disadvantage to static checking: many false alarms
		(tools says "there is deadlock", but in fact there is
		none) or else missed problems

		--Note that these tools get better every year. I believe
		that Valgrind has a race and deadlock detection tool

    F. broken modularity

	--examples above: avoiding deadlock requires understanding
	how programs call each other.

	--also, need to know, when calling a library, whether it's
	thread-safe: printf, malloc, etc. If not, surround call with
	mutex. (Can always surround calls with mutexes conservatively.)

        --basically locks bubble out of the interface