Class 12
CS372H
24 February 2011

On the board
------------

(One handout)

1. Last time
2. Trade-offs and problems from locking
    A. deadlock
    B. starvation
    C. priority inversion
    D. broken modularity
    E. Performance
    F. Performance/complexity trade-off
3. Concurrency is hard!
4. More advice

---------------------------------------------------------------------------

1. Last time

    --one problem with a programming model that depends on spinlocks,
    mutexes, condition variables, monitors: deadlock.
	
	--several ways of avoiding
	
	--no silver bullet

	--in practice, people try to code carefully

	--automated tools get better every year; consider using them
	when you code (Valgrind has Helgrind.)

2. Trade-offs and problems from locking

    A. Deadlock: last time

    B. Starvation

	--thread waiting indefinitely (if low priority and/or if
	resource is contended)

    C. Priority inversion

	--T1, T2, T3: (highest, middle, lowest priority)

	--T1 wants to get lock, T2 runnable, T3 runnable and holding lock

	--System will preempt T3 and run highest-priority runnable thread, namely T2

	--Solutions:

	    --Temporarily bump T3 to highest priority of any thread that is
	    ever waiting on the lock

	    --Disable interrupts, so no preemption (T3 finishes)
		... works okay unless a page fault occurs

	    --Don't handle it; structure app so only adjacent priority
	    processes/threads share locks

	--Happens in real life. For a real-life example, see:
	http://research.microsoft.com/en-us/um/people/mbj/Mars_Pathfinder/Mars_Pathfinder.html

    D. Broken modularity

	--examples above: avoiding deadlock requires understanding
	how programs call each other.

	--also, need to know, when calling a library, whether it's
	thread-safe: printf, malloc, etc. If not, surround call with
	mutex. (Can always surround calls with mutexes conservatively.)

	--we'll see other examples below.

    E. Performance

	quick digression:
	    --_dance hall_ architecture: any CPU can "dance with" any
	    memory equally (equally slowly)

	    --NUMA (non-uniform memory access): each CPU has fast access
	    to some "close" memory; slower to access memory that is
	    further away
		--AMD Opterons like this
		--Intel CPUs moving toward this
		--see next-to-last page of handout

	    --two further choices: cache coherent or not. in the former
	    case, hardware runs a cache coherence (cc) protocol to
	    invalidate caches when a local change happens. in the
	    latter case, it does not. former case is far more common.

	let's assume ccNUMA machines...back to performance issues....

	    our baseline is a test-and-test_and_set spinlock, which is
	    basically what Linux uses:
		    
		void acquire(Lock* lock) {
		    pushcli();
		    while (xchg_val(&lock->locked, 1) == 1) {
			while (lock->locked) ;
		    }
		}

		void release(Lock* lock) {
		    xchg_val(&lock->locked, 0);
		    popcli();
		}

	the performance issues are:
	
	(i) fairness 
	    --one CPU gets lock because the memory holding the
	    "locked" variable is closer to that CPU
	    --allegedly, Google had fairness problems on Opterons (I
	    have no proof of this)

	(ii) lots of traffic over memory bus: if lots of contention for
	     lock, then cache coherence protocol creates lots of remote
	     invalidations every time someone tries to do a lock acquisition 

	(iii) cache line bounces (same reason as (ii))

	(iv) locking inherently reduces concurrency

	mitigation of (i)--(iii): better locks

	    --MCS locks

		--see handout

		--advantages
		    --guarantees FIFO ordering of lock acquisitions
		    (addresses (i))
		    --spins on local variable only (addresses (ii), (iii))
		    --[not discussing this, but: works equally well on
		      machines with and without coherent caches]

		--NOTE: with fewer cores, spinlocks are better. why?

		--In fact, if there is high contention, performance will
		be poor, though MCS locks will make it be a little less
		poor. More on that in a bit.

	    --futexes

		--see notes below or next time
		
	mitigation of (iv): more fine-grained locking.

	    --unfortunately, fine-grained locking leads to the next
	    issue, which is also fundamental

    F. Performance/complexity trade-off

	--one big lock is often not great for performance, even when we
	use the fancier locks above 
	
	    --indeed, locking itself is the issue: changing the lock
	    type is unlikely to be as big of a performance win as
	    restructuring the code 

	    --the fundamental issue with coarse-grained locking is that
	    only one CPU at a time can execute anywhere in your code. If
	    your code is called a lot, this may reduce the performance
	    of an expensive multiprocessor to that of a single CPU.

	    --if this happens inside the kernel, it means that
	    applications will inherit the performance problems from the
	    kernel

	--Perhaps locking at smaller granularity would get higher
	performance through more concurrency. 

	    --But how to best reduce lock granularity is a bit of an art.

	    --And unfortunately finer-grained locking makes incorrect
	    code far more likely

	    --And modularity further suffers (see item D. above)

	--Two examples of the above issues:
	
	--Example 1: imagine that every file in the file system is
	represented by a number, in a big table
	
	    --You might inspect the file system code and notice that
	    most operations use just one file or directory, leading you
	    to have one lock per file

	    --You could imagine the code implementing directories
	    exporting various operations like
		dir_lookup(d, name)
		dir_add(d, name, file_number)
		dir_del(d, name)

	    --With fine-grained locking, these directory operations
	    would *internally* acquire the lock on d, do their work, and
	    release the lock
	    
	    --Then higher-level code could implement operations like
	    moving a file from one directory to another:

	    move(olddir, oldname, newdir, newname) {
	      file_number = dir_lookup(olddir, oldname)
	      dir_del(olddir, oldname)
	      dir_add(newdir, newname, file_number)
	    }

	    --Unfortunately, this isn't great:

		--period of time when file is visible in neither
		directory. to fix that requires that the directory locks
		_not_ be hidden inside the dir_* operations.

		--so we need something like this:

		move(olddir, oldname, newdir, newname){
		  acquire(olddir.lock)
		  acquire(newdir.lock)
		  file_number = dir_lookup(olddir, oldname)
		  dir_del(olddir, oldname)
		  dir_add(newdir, newname, file_number)
		  release(newdir.lock)
		  release(olddir.lock)

	    --The above code is a bummer in that it exposes the
	    implementation of directories to move(), but (if all you
	    have is locks) you have to do it this way.

	--Example 2: see filemap.c at end of handout for an extreme case 

	--Mitigation? Unfortunately, no way around this trade-off.
	    
	    --worse, easy to get this stuff wrong: correct code is
	    harder to write than buggy code

	--If you have fine-grained locking (i.e., you are trading off
	simplicity), then you are much more likely to encounter the two
	types of errors:
	    (i) safety errors (race conditions)
	    (ii) liveness errors (deadlocks, etc.)

	--***So what do people do?***

	    --in app space:

		--don't worry too much about performance up front. makes
		it easier to keep your code free of safety problems
		*and* liveness problems

		--if you are worrying about performance, make sure there
		are no race conditions. much more important than
		worrying about deadlock.

		    --SAFETY FIRST.

		    --almost always far better for your program to do
		    nothing than to do the wrong thing (example of using
		    Linear Acceletor for radiation therapy: **way**
		    better not to subject patient to radiation beam than
		    to subject patient to a beam that is 100x too
		    strong, leading to gruesome, atrocious injuries)

		    --if the program deadlocks, the evidence is intact, and we
		    can go back and see what the problem was.

		    --there are ways around deadlock, as we will discuss
		    in a moment

		    --but we shouldn't be too cavalier about liveness
		    issues because it could lead to catastrophic cases.
		    Example: Mars Pathfinder (which was addressed; see
		    above), but still.

	    --in kernel space:

		--same thing, to some extent

		--but performance matters more in kernel space, so
		likely to be dealing with more complex issues

		    --here again, SAFETY FIRST
			--lock more aggressively
			--worry about deadlock later

		--not a satisfying answer, but there is no silver bullet
		for concurrency-related issues

	--By the way, if there is lots of contention, then the style and
	granularity of locks will not eliminate the problem. Where does
	contention come from?

	    --application requirements. lots of contention from
	    applications that inherently require global resources or
	    shared data.
	    
	    --example of Apache: every CPU needs to write to a global
	    logfile, which causes contention in the kernel. you can make
	    the locking as fine-grained as you want, but at the end of
	    the day, if there's a single logfile, a single writer
	    permitted at a time, and many contending writers, then that
	    logfile is going to wind up serializing all of the writers.

3. Concurrency is hard!
 
    Sequential consistency

	Our examples all along have been assuming sequential
	consistency....but what does this amount to assuming?

	See examples on handout	

	(i) Defn of sequential consistency: The result of execution is
	as if all operations were executed in some sequential order, and
	the operations of each processor occurred in the order specified
	by the program.

	    [citation: L. Lamport. How to Make a Multiprocessor Computer that
	    Correctly Executes Multiprocess Programs. _IEEE Transactions
	    on Computers_, Volume C-28, Number 9, September 1979,
	    pp690-691.
	    http://research.microsoft.com/en-us/um/people/lamport/pubs/multi.pdf]

	    Basically means:

	    --Maintaining program order on individual processors 
	    --Ensuring write atomicity 

	(ii) Why isn't sequential consistency always in effect?

	    --It's expensive for the hardware (sometimes overlapping
	    instructions, or providing non-blocking memory reads, helps
	    the hardware's performance)

	    --Compiler sometimes wants to violate s.c. 

		--moves code around

		--caches values in registers

		--common subexpression elimination (could cause memory
		to be read fewer times)

		--re-arrange loops for better cache performance

		--software pipelining

	(iii) What does the x86 do?
	
	    --x86 supports multiple consistency/caching models 

		--Memory Type Range Registers (MTRR) specify consistency for 
		ranges of physical memory (e.g., frame buffer) 

		--Page Attribute Table (PAT) allows control for each 4K page 

	    --Choices include: 

		WB: Write-back caching (the default) 
		WT: Write-through caching (all writes go to memory) 
		UC: Uncacheable (for device memory) 
		WC: Write-combining: weak consistency & no caching 

	    --Some instructions have weaker consistency 
		--String instructions 
		--Special "non-temporal" instructions that bypass cache 

	(iv) x86 WB consistency

	    --Old x86s (e.g, 486, Pentium 1) had almost SC

		--Exception: A read could finish before an earlier write
		to a different location

	    --Newer x86s let a processor read its own writes early

		--see handout, item 3c: both of those functions can return 2: 
		
		--that is, the two processors see the loads in different
		orders

		--Older CPUs would wait at "f = ..." until store
		complete

	(v) x86 atomicity (review)

	    --lock prefix
	    
		--review: the lock prefix makes a memory instruction
		atomic (by locking the bus for the duration of an
		instruction, which is expensive).

		--all lock instructions totally ordered

		--other memory instructions cannot be re-ordered w. locked
		ones

	    --xchg (always locked; no prefix needed)

	    --fence instructions that can prevent re-ordering
	    
		LFENCE -- can't be reordered with reads (or later writes)

		SFENCE -- can't be reordered with writes

		MFENCE -- can't be reordered with reads or writes


4. More advice

    Two things for you to remember here and always; these two things are
    implied by the above discussion on sequential consistency:

	(1). if you're *using* a synchronization primitive (e.g., a
	mutex), do NOT try to read any shared data outside of the
	mutex (the mutex provides the needed ordering). 

	(2). if you're *implementing* a synchronization primitive, you
	need to read the language manual carefully (to tell the
	compiler what not to order), and you need to read the
	processor manual carefully (to understand the default memory
	model and how to override if necessary).
 
    ***Your best hope if you're working with threads and monitors in
    application space:***

	(3) coarse-grained locking 
	
	(4) the MikeD rules/commandments/standards: lock()/unlock() at
	the beginning and end of functions, use monitors, use while
	loops to check scheduling constraints, etc.)

	(5) disciplined hierarchical structure to your code (so you
	can order the locks), and avoid up-calls. if your structure
	is poor, you have little hope. so the following more
	detailed advice assumes decent structure:

	    --if you have to make an up-call, better ensure that
	    partial order on locks is maintained and/or that the
	    up-call doesn't require a lock to issue the up-call (*)

	    --if you have nested objects or monitors (as in the M,N
	    example in l11-handout), then there are some cases:

		--if the target of the call does not lock, then no
		problem. the outer monitor can keep holding the
		lock.

		--if the target of the call locks but does not wait,
		then caller can continue to hold lock PROVIDED that
		partial ordering exists (for example, that the
		callee never issues a callback/up-call to the
		calling module or to an even higher layer, as
		mentioned in (*) above)

		--if the target of the call locks and does wait,
		then it is dangerous to call it while holding a lock
		in the outer layer. here, you need a different code
		structure. unfortunately, there is no silver bullet

	    --to avoid nested monitors, you can/should break your
	    code up:

		  M
		  |
		  N

	       becomes:
		
		 O
		/ \
	       M   N

		where O is an ordinary module, not a monitor.

	       example: M implements checkin/checkout for database. O is
	       the database, and N is some other monitor. 


	(6) run static detection tools (commercial products; search
	around); they're getting better every year

	(7) run dynamic detection tools (Valgrind, etc.): instrument
	program if needed

	--Bummer about all of this: hard to hide details of
	synchronization behind object interfaces 

	--That is, even the solutions that avoid deadlock require
	breaking abstraction barriers and highly disciplined coding