Class 13
CS 372H
02 March 2010

On the board
------------

0. Last time
  Trade-offs and problems from synchronization primitives
    A. Performance
    B. Performance v. complexity trade-off
    C. Deadlock
    D. Starvation
1. Trade-offs, continued
    E. Priority inversion
    F. Broken modularity
    G. Careful coding required (more advice)
2. Reflections and conclusions
3. Alternatives
4. Some loose ends
    A. sequential consistency
    B. futexes
5. Scheduling

---------------------------------------------------------------------------
NOTE: the end of these notes contain a written review of our concurrency
unit. It's not "complete", but it may help organize a lot of what we
covered.
---------------------------------------------------------------------------

0. Last time

    --We showed that spinlocks are better for a smaller number of CPUs but
    have worse scaling. We said why they have worse scaling but didn't
    say why they're better for a smaller number of CPUs. Why are they?
	
	--Because acquire(Lock* lock, qnode* q) requires more operations

    -1A-->D

1. Trade-offs and problems from synchronization primitives

    E. Priority inversion

	--T1, T2, T3: (highest, middle, lowest priority)

	--T1 wants to get lock, T2 runnable, T3 runnable and holding lock

	--System will preempt T3 and run highest-priority runnable thread, namely T2

	--Solutions:

	    --Temporarily bump T3 to highest priority of any thread that is
	    ever waiting on the lock

	    --Disable interrupts, so no preemption (T3 finishes)
		... works okay unless a page fault occurs

	    --Don't handle it; structure app so only adjacent priority
	    processes/threads share locks

	--Happens in real life. For a real-life example, see:
	http://research.microsoft.com/en-us/um/people/mbj/Mars_Pathfinder/Mars_Pathfinder.html

    F. Broken modularity

	--many examples above: avoiding deadlock requires understanding
	how programs call each other

	--need to know, when calling a library, whether it's
	thread-safe: printf, malloc, etc. If not, surround call with
	mutex. (Can always surround calls with mutexes conservatively.)

    G. Careful coding required / more advice
    
    ***Your best hope if you're working with threads and monitors in
    application space: four higher-level pieces of advice***

	(i) coarse-grain locking

	(ii) disciplined hierarchical structure to your code (so you
	can order the locks), and avoid up-calls. if your structure
	is poor, you have little hope. so the following more
	detailed advice assumes decent structure:

	    --if you have to make an up-call, better ensure that
	    partial order on locks is maintained and/or that the
	    up-call doesn't require a lock to issue the up-call (*)

	    --if you have nested objects or monitors (as in the
	    example earlier today), then there are some cases:

		--if the target of the call does not lock, then no
		problem. the outer monitor can keep holding the
		lock.

		--if the target of the call locks but does not wait,
		then caller can continue to hold lock PROVIDED that
		partial ordering exists (for example, that the
		callee never issues a callback/up-call to the
		calling module or to an even higher layer, as
		mentioned in (*) above)

		--if the target of the call locks and does wait,
		then it is dangerous to call it while holding a lock
		in the outer layer. here, you need a different code
		structure. unfortunately, there is no silver bullet

	    --to avoid nested monitors, you can/should break your
	    code up:

		  M
		  |
		  N

	       becomes:
		
		 O
		/ \
	       M   N

		where O is an ordinary module, not a monitor.

	       example: M implements checkin/checkout for database. O is
	       the database, and N is some other monitor. 


	(iii) run static detection tools; they're getting better every year

	(iv) run dynamic detection tools: instrument program if needed

	--Bummer about all of this: hard to hide details of
	synchronization behind object interfaces 

	--That is, even the solutions that avoid deadlock require
	breaking abstraction barriers and highly disciplined coding

2. Reflections and conclusions

    --Threads and concurrency primitives have solved a hard problem: how
    to take advantage of hardware resources with a sequential
    abstraction (the thread) and how to safely coordinate access to
    shared resources (concurrency primitives).

    --But of course concurrency primitives have the disadvantages listed
    above

    --Threads are also not free. They are cheap but (in most packages
    and languages) not free

	--fun example: OS/2 from Microsoft and IBM (1980s). lots of threads.
	say 100. each thread needed a stack, say 10KB. result: 1MB of memory
	consumed by waiting threads. cost of 1MB of memory in 1988: $200.
	But from end-user's perspective: okay, you could keep working while
	the application spooled to the printer, but was that worth $200?

    --old debate about whether threads are a good idea:

	John Ousterhout: "Why Threads are a bad idea (for most
	purposes)", 1996 talk. http://home.pacbell.net/ouster/threads.pdf

	Robert van Renesse "Goal-Oriented Programming, or Composition
	Using Events, or Threads Considered harmful".  Eighth ACM SIGOPS
	European Workshop, September 1998.
	http://www.cs.cornell.edu/home/rvr/papers/GoalOriented.pdf

    --the case comes down to this:

	--hard to get multithreaded programs correct AND there are safer
	alternatives

	--who is right? sort of like vi vs. emacs debates. threads,
	events, and other alternatives that we will discuss below all
	have advantages and disadvantages. 

3. What else can we do?

    A. Other approaches to handling concurrency:

	--non-blocking synchronization
	
	    --wait-free algorithms

	    --lock-free algorithms

	    --the skinny on these is that:

		--using atomic instructions such as compare-and-swap
		(CMPXCHG on the x86) you can implement many common data
		structures (stacks, queues, even hash tables)

		--in fact, can implement *any* algorithm in wait-free
		fashion, given the right hardware

		--the problem is that such algorithms wind up using lots
		of memory or involving many retries so are inefficient

		--but since they don't lock, they are provably
		lock-free!

	--ignore benign race conditions

	    --for a Web site, set "hits++" in some counter without
	    protecting by mutex. who cares if it's a bit too low?
	   
	   --warning: not a good practice in general

	--RCU (read-copy-update) 
    
	    [citation: P. E. McKenney et al. Read-Copy Update. Proc.
	    Linux Symposium, 2001.
	    http://lse.sourceforge.net/locking/rcu/rclock_OLS.2001.05.01c.sc.pdf]

	    --really neat technique used widely in the Linux kernel

	    --basic idea: because reading is so much more common than
	    writing, don't synchronize readers. synchronize writers, but
	    make sure that writers update data structures in such a way
	    that readers don't get messed up

		approach: leave old data for readers, don't update in-place

		--no need for reader locks if they don't see changes in-place
		fundamentally removes need for feedback from readers to updaters!

		--instead, update by copying data items, atomically change pointers
		    --of course, this raises the question: when can we reclaim
		    the *old* memory? (technique has to be very careful
		    about this.)

		--benefits
		    reader code simpler, avoids locking issues and bus ops to notify updaters
		    thus, good performance for read-heavy workloads

	--event-driven code

	    --also manages "concurrency".

	    --why? (because the processor really only can do one thing at a
	    time.)

	    --good match if there is lots of I/O

	    what happens if we try to program in event-driven style and we are a
	    CPU-bound process? 

		--there aren't natural yield points, so the other tasks may
		never get to run

		--or the CPU-bound tasks have to insert artificial
		yields()

	--transactions

	    --we will see these later in the course

	    --using them in the kernel requires hardware support

	    --using them in application space does not

    B. How free are these other techniques from the disadvantages listed
    above?
	
	--non-blocking synchronization often leads to complexity and
	inefficiency (but not deadlock or broken modularity!)

	--RCU has some complexity required to handle garbage collection,
	and it's not always applicable

	--event-driven code removes the possibility of races and
	deadlock, but that's because it doesn't involve true
	concurrency, so if you want your app to take advantage of
	multiple CPUs, you can't use event-driven programming

	--transactions are nice but not supported everywhere and not
	very efficient if there is lots of contention

    C. My advice on best approaches (higher-level advice than thread
    coding advice above)

	--application programming:

	    --cooperative user-level multithreading

	    --kernel-level threads with *simple* synchronization (lab T)

		--this is where the thread coding advice given above
		applies

	    --event-driven coding

	    --transactions, if your package provides them, and you are
	    willing to deal with performance trade-offs (namely that
	    performance is poor under contention because lots of wasted
	    work)

	--kernel hacking: no silver bullet here. want to avoid locks as
	  much as possible. sometimes they are unavoidable, in which case
	  fancy things need to happen.

	    --UT professor Emmett Witchel proposes using transactions
	    inside the kernel (TxLinux)

4. Some loose ends

    A. Sequential consistency

	Our examples all along have been assuming sequential
	consistency....but what does this amount to assuming?

	(i) Examples

	--example 1

	    int data = 0, ready = 0;

	    void p1 () {
		data = 2000;
		ready = 1;
	    }
	    int p2 () {
		while (!ready) {}
		return data;
	    }

	    What might p2 return if run concurrently with p1?

	--example 2

	    int flag1 = 0, flag2 = 0; 

	    int main () { 
		tid id = thread_create (p1, NULL); 
		p2 (); thread_join (id); 
	    } 

	    void p1 (void *ignored) { 
		flag1 = 1; 
		if (!flag2) {
		    critical_section_1 ();
		} 
	    } 

	    void p2 (void *ignored) { 
		flag2 = 1; 
		if (!flag1) {
		    critical_section_2 ();
		} 
	    } 

	    Can both critical sections run?

	[examples from From S.V. Adve and K. Gharachorloo, IEEE
	Computer, December 1996, 66-76.
	http://rsim.cs.uiuc.edu/~sadve/Publications/computer96.pdf]

	--Answers are "no" *if* the hardware provides sequential
	consistency (but if it doesn't, the answers are "yes"):

	(ii) Defn of sequential consistency: The result of execution is
	as if all operations were executed in some sequential order, and
	the operations of each processor occurred in the order specified
	by the program.

	    [citation: L. Lamport. How to Make a Multiprocessor Computer that
	    Correctly Executes Multiprocess Programs. _IEEE Transactions
	    on Computers_, Volume C-28, Number 9, September 1979,
	    pp690-691.
	    http://research.microsoft.com/en-us/um/people/lamport/pubs/multi.pdf]

	    Basically means:

	    --Maintaining program order on individual processors 
	    --Ensuring write atomicity 

	(iii) Why isn't sequential consistency always in effect?

	    --It's expensive for the hardware (sometimes overlapping
	    instructions, or providing non-blocking memory reads, helps
	    the hardware's performance)

	    --Compiler sometimes wants to violate s.c. 
		--moves code around
		--caches values in registers
		--common subexpression elimination (could cause memory
		to be read fewer times)
		--re-arrange loops for better cache performance
		--software pipelining

	(iv) What does the x86 do?
	
	    --x86 supports multiple consistency/caching models 

		--Memory Type Range Registers (MTRR) specify consistency for 
		ranges of physical memory (e.g., frame buffer) 

		--Page Attribute Table (PAT) allows control for each 4K page 

	    --Choices include: 

		WB: Write-back caching (the default) 
		WT: Write-through caching (all writes go to memory) 
		UC: Uncacheable (for device memory) 
		WC: Write-combining: weak consistency & no caching 

	    --Some instructions have weaker consistency 
		--String instructions 
		--Special "non-temporal" instructions that bypass cache 

	(v) x86 WB consistency

	    --Old x86s (e.g, 486, Pentium 1) had almost SC

		--Exception: A read could finish before an earlier write
		to a different location

	    --Newer x86s let a processor read its own writes early

		--see handout, item #3: both of those functions can return 2: 
		
		--that is, the two processors see the loads in different
		orders

		--Older CPUs would wait at "f = ..." until store
		complete

	(vi) x86 atomicity (review)

	    --lock prefix
	    
		--review: the lock prefix makes a memory instruction
		atomic (by locking the bus for the duration of an
		instruction, which is expensive.

		--all lock instructions totally ordered

		--other memory instructions cannot be re-ordered w.\ locked
		ones

	    --xchg (always locked; no prefix needed)

	    --fence instructions that can prevent re-ordering
	    
		LFENCE -- can't be reordered w.\ reads (or later writes

		SFENCE -- can't be reordered w.\ writes

		MFENCE -- can't be reordered w.\ reads or writes

    B. Futexes

	[citation: H. Franke, R. Russell, and M. Kirkwood. Fuss,
	Futexes and Furwocks: Fast Userlevel Locking in Linux, Proc.
	Linux Symposium, 2002.
	http://www.kernel.org/doc/ols/2002/ols2002-pages-479-495.pdf]

	--abstraction that is useful for synchronizing user-level
	programs efficiently
	
	    --basically, ask the kernel to put the current process to
	    sleep only if some memory hasn't changed

	--this abstraction is really two things:
	
	    --some shared memory (that the programs must coordinate,
	    likely by using a library wrapper on top of the futex)

	    --a system call:
		void futex(int* uaddr,FUTEX_WAIT|FUTEX_WAKE,int val,....)

	--in the non-contended case, the process will never call the
	system call: the processes will just be acquiring and releasing
	the shared object (say by atomically incrementing and
	decrementing counters)

	--in the contended case, a program may need to sleep. In that
	case, it executes a call like:
	    void futex(shared_counter_address,FUTEX_WAIT,v),
	which says:
	    "sleep if *shared_counter_address == v".

	    the whole point to this is that some other program might
	    have changed the counter right before the call to futex(),
	    so if the counter changed, we might not want to go to sleep
	    (e.g., if the counter is 0 or 1 to mean "is_not_locked" or
	    "is_locked", you don't want to go to sleep unless the
	    counter really is equal to is_locked).

	--likewise, in the contended case, a program may need to wake up
	some waiters. In that case, it executes a call like:
	    void futex(shared_counter_address,FUTEX_WAKEUP,v),
	which says,

	    "wakeup at least v processes sleeping on
	    shared_counter_address"

	--see futex(2) and futex(7) in the man pages (i.e., type "man 2
	futex" and "man 7 futex")

--------------------

Admin notes

--lab 3B due tomorrow

--am holding office hours from 2-3

--so far no questions

--some further advice

    --start lab 4A now

    --clarify what I said last time: I really didn't mean to say "start
    early". what I meant was "start on time". it just so happens that
    "on time" is many days before the deadline

--------------------

5. scheduling intro

    A. When do scheduling decisions happen?

	[draw picture]

	scheduling decisions take place when a process:

	    (i) Switches from running to waiting state 
	    (ii) Switches from running to ready state 
	    (iii) Switches from waiting to ready 
	    (iv) Exits 

    B. What are metrics and criteria?
    
	--system throughput
	    # of processes that complete per unit time

	--turnaround time
	    time for each process to complete

	--response time
	    time from request to first response (e.g., key press to
	    character echo, not launch to exit)

	--fairness
	    different possible definitions:
		--freedom from starvation
		--all users get equal time on CPU
		--highest priority jobs get most of CPU
		--etc.
	    [often conflicts with efficiency. true in life as well.]

	the above are affected by secondary criteria:

	--CPU utilization (fraction of time CPU is actually working)

	--waiting time (time each process waits in ready queue)

    C. Context switching costs

	--CPU time in kernel

	    --save and restore registers

	    --switch address spaces 

	--indirect costs

	    --TLB shootdowns, processor cache, OS caches (e.g., buffer
	    caches)

	--result: more frequent context switches will lead to worse
	throughput (higher overhead)

---------------------------------------------------------------------------

SUMMARY AND REVIEW OF CONCURRENCY

    We've discussed different ways to handle concurrency. Here's a
    review and summary. Unfortunately, there is no one right approach to
    handling concurrency. The "right answer" changes as operating
    systems and hardware evolve, and depending on whether we're talking
    about what goes on inside the kernel, how to structure an
    application, etc. For example, in a world in which most machines had
    one CPU, it may make more sense to use event-driven programming in
    applications (note that this is a potentially controversial claim),
    and to rely on turning off interrupts in the kernel. But in a world
    with multiple CPUs, event-driven programming in an application fails
    to take advantage of the hardware's parallelism, and turning off
    interrupts in the kernel will not avoid concurrency problems.

    Why we want concurrency in the first place: better use of hardware
    resources. Increase total performance by running different tasks
    concurrently on different CPUs. But sometimes, serial execution of
    atomic operations is needed for correctness. So how do we solve
    these problems?

	--*threads are an abstraction that can take advantage of
	concurrency.* applies at multiple levels, as discussed in class.
	indeed, a kernel that runs on multiple CPUs (say, in handling
	system calls from two different processes running on two
	different CPUs) can be regarded as using
	threads-inside-the-kernel, or there can be explicit
	threads-inside-the-kernel. this is apart from kernel threads and
	user-level threads, which are abstractions that applications use

	--to get serial execution of atomic operations, we need
	hardware support at the lowest level. we may use the hardware
	support directly (as in the case of lock-free data structures),
	with a thin wrapper (as in the case of spinlocks), or wrapped in
	a much higher-level abstraction (as in the case of mutexes and
	monitors).
	
	    --the hardware support that we're talking about is test&set,
	    LOCK prefix, LD_L/ST_C (load-linked/store-conditional, on
	    DEC Alpha), interrupt enable/disable

    1. The most natural (but not the only thing) we can do with the hardware
    support is to build spinlocks. We saw a few kinds of spinlocks:

	--test_and_set (while (xchg) {} )
	--test-and-test_and_set (example given in class from Linux)
	--MCS locks

	Spinlocks are *sometimes* useful inside the kernel and more
	rarely useful in application space. The reason is that wrapping
	larger pieces of functionality with spinlocks wastes CPU cycles
	on the waiting processor. There is a trade-off between the
	overhead of putting a thread to sleep and the cycles wasted by
	spinning. For very short critical sections, spinlocks are a win.
	For longer ones, put the thread of execution to sleep.

    2. For larger pieces of functionality, higher-level synchronization
    primitives are useful:

	--mutexes

	--mutexes and condition variables (known as monitors)

	--shared reader / single writer locks (can implement as a
	monitor)
	
	--semaphores

	--futexes (basically a semaphore or mutex used for synchronizing
	processes on Linux; the advantage is that if the futex is
	uncontended, the process never enters the kernel. The cost of
	a system call is only incurred when there is contention and a
	process needs to go to sleep (going to sleep and getting woken
	requires kernel help).

	Building all of the above correctly requires lower-level
	synchronization primitives. Usually, inside of these
	higher-level abstractions is a spinlock that is held for a brief
	time before the thread is put to sleep and after it is woken.

	[Disadvantages to both spinlocks and higher-level
	synchronization primitives:

	    --performance (because of synchronization point and
	    cache line bounces)

	    --performance v. complexity trade-off

		--hard to get code safe, live, and well-performing

		--to increase performance, we need finer-grained
		locking, which increases complexity, which imperils:

		    --safety (i.e., race conditions more likely)
		    --liveness (i.e., deadlock, starvation more likely)

	    --deadlock (hard to ensure liveness)

	    --starvation (hard to ensure progress for all threads)

	    --priority inversion
	    
	    --broken modularity
	    
	    --careful coding required]

	In user-level code, manage these disadvantages by sacrificing
	performance for correctness. 

	In kernel code, it's trickier. Any performance problems in the
	kernel will be passed to applications. Here, the situation is
	sort of a mess. People use a combination of partial lock orders,
	careful thought, static detection tools, code review, and
	prayer.

    3. Can also use hardware support to build lock-free data structures
    (for example, using atomic compare-and-swap).

	--avoids possibility of deadlock
	--better performance
	--downside: further complexity

    4. Can also use hardware support to enable the Read-Copy Update
    (RCU) technique. Technique used inside the Linux kernel. Very
    elegant.

	--here, *writers* need to synchronize (using spinlocks,
	other hardware support, etc.), but readers do not

    [Aside:
	another paradigm for handling concurrency: transactions
    	--transactional memory (requires different hardware abstractions)
	--transactions exposed to applications and users of
	applications, like queriers of databases]

    Another approach to handling concurrency is to avoid it:

    5. event-driven programming

	--also manages "concurrency".

	--why? (because the processor really only can do one thing at a
	time.)

	--good match if there is lots of I/O

	what happens if we try to program in event-driven style and we are a
	CPU-bound process? 

	    --there aren't natural yield points, so the other tasks may
	    never get to run

	    --or the CPU-bound tasks have to insert artificial
	    yields()

	[only really relevant in application space. a kernel that
	is entirely event-driven will have trouble running on more
	than one CPU.]

    6. what does JOS do?

	Answer: JOS's approach to concurrency is a one-off solution
	that you shouldn't take as a lesson. Here's its approach:

	JOS is meant to run on single-CPU machines. It doesn't have
	to worry about concurrent operations from other CPUs, but it
	does have to worry about interrupts. JOS takes a simple
	approach: it turns interrupts off for the entire time it is
	executing in the kernel. For the most part this means JOS
	kernel code doesn't have to do anything special in
	situations where other OSes would use locks.

	JOS runs environments in user-mode with interrupts enabled,
	so at any point a timer interrupt may take the CPU away from
	an environment and switch to a different environment. This
	interleaves the two environments' instructions a bit like
	running them on two CPUs.  The library operating system has
	some data structures in memory that's shared among multiple
	environments (e.g., pipes), so it needs a way to coordinate
	access to that data.  In JOS we will use special-case
	solutions, as you will find out in lab 6.  For example, to
	implement pipes we will assume there is one reader and one
	writer.  The reader and writer never update each others'
	variables; they only read each others' variables.  Carefully
	programming using this rule we can avoid races.


    Ultimately, threads, synchronization primitives, etc. solve a really
    hard problem: how to have multiple stuff going on at the same time
    but to allow the programmer to keep it organized, sane, and correct.
    To do this, we introduced abstractions like threads, functions like
    switch() [usually known as swtch()], relied on hardware primitives
    like XCHG, and built higher-level objects like mutexes, monitors,
    and condition variables. All of this is, at the end of the day,
    presenting a relatively sane model to the programmer, built on top
    of something that was otherwise really hard to reason about.