Class 13
CS 372H
01 March 2011

On the board
------------

(One handout)

1. Alternatives to locks
    --benign races
    --event-driven programming
    --transactions
    --others
2. Reflections and conclusions from concurrency unit
3. Therac-25 and software safety
  --Background
  --Mechanics
  --What went wrong?
  --Discussion

---------------------------------------------------------------------------

0. Last time

    ** NOTE: end of the notes today contain a summary of the whole
    concurrency unit. Not claiming it's necessary or sufficient, but it
    does try to place what we've covered in some sort of context.

1. Alternatives

    A. Other approaches to handling concurrency:

	--non-blocking synchronization
	
	    --wait-free algorithms

	    --lock-free algorithms

	    --the skinny on these is that:

		--using atomic instructions such as compare-and-swap
		(CMPXCHG on the x86) you can implement many common data
		structures (stacks, queues, even hash tables)

		--in fact, can implement *any* algorithm in wait-free
		fashion, given the right hardware

		--the problem is that such algorithms wind up using lots
		of memory or involving many retries so are inefficient

		--but since they don't lock, they are provably
		lock-free!

	--ignore benign race conditions

	    --for a Web site, set "hits++" in some counter without
	    protecting by mutex. who cares if it's a bit too low?
	   
	   --warning: not a good practice in general

	--RCU (read-copy-update) 
    
	    [citation: P. E. McKenney et al. Read-Copy Update. Proc.
	    Linux Symposium, 2001.
	    http://lse.sourceforge.net/locking/rcu/rclock_OLS.2001.05.01c.sc.pdf]

	    --really neat technique used widely in the Linux kernel

	    --basic idea: because reading is so much more common than
	    writing, don't synchronize readers. synchronize writers, but
	    make sure that writers update data structures in such a way
	    that readers don't get messed up

		approach: leave old data for readers, don't update in-place

		--no need for reader locks if they don't see changes in-place
		fundamentally removes need for feedback from readers to updaters!

		--instead, update by copying data items, atomically change
		pointers
		    --of course, this raises the question: when can we reclaim
		    the *old* memory? (technique has to be very careful
		    about this.)

		--benefits
		    reader code simpler, avoids locking issues and bus
		    operations to notify updaters; thus, good performance for
		    read-heavy workloads

	--event-driven code

	    --also manages "concurrency".

	    --why? (because the processor really only can do one thing at a
	    time.)

	    --good match if there is lots of I/O

	    --what happens if we try to program in event-driven style
	    and we are a CPU-bound process? 

		--there aren't natural yield points, so the other tasks may
		never get to run

		--or the CPU-bound tasks have to insert artificial
		yields(). In vanilla event-driven programming, there is
		no explicit yield() call. A function that wants to stop
		running but wants to be run again has to queue up a
		request to be re-run. That request unfortunately has to
		manually take the important stack variables and place
		them in an entry in the event queue. In a very real
		sense, this is what the thread scheduler does
		automatically (the thread scheduler, if you look at it
		at a high-level, takes various threads' stacks and
		switches them around, and also loads the CPU's registers
		with a thread's registers.  this is not super-different
		from scheduling a function for later and specifying the
		arguments, where the argument *are* the important stack
		variables that the function will need, in order for the
		function to execute).

		--this is one reason why you want threads: easy to write
		code where each thread just does some CPU-intensive
		thing, and the thread scheduler worries about
		interleaving the operations (otherwise, the interleaving
		is more manual and consists of the steps mentioned
		above).

	--transactions

	    --we will see these later in the course

	    --using them in the kernel requires hardware support

	    --using them in application space does not

    B. How free are these other techniques from the disadvantages listed
    above?
	
	--non-blocking synchronization often leads to complexity and
	inefficiency (but not deadlock or broken modularity!)

	--RCU has some complexity required to handle garbage collection,
	and it's not always applicable

	--event-driven code removes the possibility of races and
	deadlock, but that's because it doesn't involve true
	concurrency, so if you want your app to take advantage of
	multiple CPUs, you can't use plain vanilla event-driven programming

	--transactions are nice but not supported everywhere and not
	very efficient if there is lots of contention

    C. My advice on best approaches (higher-level advice than thread
    coding advice from before)

	--application programming:

	    --cooperative user-level multithreading

	    --kernel-level threads with *simple* synchronization (lab T)

		--this is where the thread coding advice given above
		applies

	    --event-driven coding

	    --transactions, if your package provides them, and you are
	    willing to deal with performance trade-offs (namely that
	    performance is poor under contention because lots of wasted
	    work)

	--kernel hacking: no silver bullet here. want to avoid locks as
	  much as possible. sometimes they are unavoidable, in which case
	  fancy things need to happen.

	    --UT professor Emmett Witchel proposes using transactions
	    inside the kernel (TxLinux)

2. Reflections and conclusions from concurrency unit

    --Threads and concurrency primitives have solved a hard problem: how
    to take advantage of hardware resources with a sequential
    abstraction (the thread) and how to safely coordinate access to
    shared resources (concurrency primitives).

    --But of course concurrency primitives have the disadvantages that
    we've discussed

    --old debate about whether threads are a good idea:

	John Ousterhout: "Why Threads are a bad idea (for most
	purposes)", 1996 talk. http://home.pacbell.net/ouster/threads.pdf

	Robert van Renesse "Goal-Oriented Programming, or Composition
	Using Events, or Threads Considered harmful".  Eighth ACM SIGOPS
	European Workshop, September 1998.
	http://www.cs.cornell.edu/home/rvr/papers/GoalOriented.pdf

	--and lots of "events vs threads" papers (use Google)

    --the debate comes down to this:

	--compared to code written in event-driven style, shared memory
	multiprogramming code is easier to read: it's easier to know the
	code's purpose. however, it's harder to make that code correct,
	and it's harder to know, when reading the code, whether it's
	correct. 

	--who is right? sort of like vi vs. emacs debates. threads,
	events, and the other alternatives all have advantages and
	disadvantages. one thing is for sure: make sure that you
	understand those advantages and disadvantages before picking a
	model to work with.

3. Therac-25 and software safety

    * Background

    --Draw linear accelerator

    --Magnets 
	--bending magnets

    --Bombard tungsten to get photons

    * Mechanics

	[draw picture of this thing]

	dual-mode machine (actually, triple mode, given the disasters)

			    beam               beam                beam
			    energy            current            modifier
			                                        (given by TT
								position)
intended settings:      ---------------------------------------------------
   for electron therapy |    5-25 MeV          low                magnets
			| 
			|
   for X-ray therapy    |   25 MeV            high (100 x)       flattener
	photon mode     |  
			|  
   for field light mode |      0                 0                 none

	      (b/c of the flattener, more current is needed in X-ray mode)
    
       What can go wrong?

	(a) if beam has high current, but turntable has 'magnets', not
	the flattener, it is a disaster: patient gets hit with high
	current electron beam

	(b) another way to kill a patient is to turn the beam on with
	the turntable in the field-light position

	So what's going on? (Multiple modes, and mixing them up is very,
	    very bad)

    * What actually went wrong?

	--two software problems

	--a bunch of non-technical problems

	(i) software problem #1:

	[this is our best guess; actually hard to know for sure, given
	the way that the paper is written.]

	--three threads
	    --keyboard
	    --turntable
	    --general parameter setting

	--see handout for the pseudocode

	--now, if the operator sets a consistent set of parameters for x
	(X-ray (photon) mode), realizes that the doctor ordered something
	different, and then edits very quickly to e (electron) mode,
	then what happens?

	    --if the re-editing takes less than 8 seconds, the general
	    parameter setting thread never sees that the editing
	    happened because it's busy doing something else. when it
	    returns, it misses the setup signal (probably every single
	    concurrency commandment was violated here....)

	    --now the turntable is in 'e' position (magnets)

	    --but the beam is a high intensity beam because the 'Treat'
	    never saw the request to go to electron mode

	    --each thread and the operator thinks everything is okay

	    --operator presses BEAM ON --> patient mortally injured

	 --so why doesn't the computer check the set-up for consistency
	 before turning on the beam? [all it does it check that there's
	 no more input processing.] 
	    alternatives:
		--double-check with operator
		--end-to-end consistency check in software
		--hardware interlocks
		[probably want all of the above] 


	(ii) software problem #2:

	how it's supposd to work:

	    --operator sets up parameters on the screen

	    --operator moves turntable to field-light mode, and visually
	    checks that patient is properly positioned

	    --operator hits "set" to store the parameters

	    --at this point, the class3 "interlock" (in quotation marks
	    for a reason) is supposed to tell the software to check and
	    perhaps modify the turntable position

	    --operator presses "beam on"

	how they implemented this:

	    --see pseudocode on handout

	but it doesn't always work out that way. why?
	    
	    --because this boolean flag is implemented as a counter.

	    --(why implemented as a counter? PDP-11 had an Increment
	    Byte instruction that added 1 ("inc A"). This increment thing
	    presumably took a bit less code space than materializing the
	    constant 1 in an instruction like "A = 1".)

	    --so what goes wrong?
		
		--every 256 times that code runs, class3 is set to 0,
		operator presses 'set', and no repositioning

		--operator presses "beam on", and a beam is delivered in
		field light position, with no scanning magnets or
		flattener --> patient injured or killed

	(iii) Lots of larger issues here too

	    --***No end-to-end consistency checks***. What you actually
	    want is:
		--right before turning the beam on, the software checks
		that parameters line up
		--hardware that won't turn beam on if the parameters are
		inconsistent
		--then double-check that by using a radiation "phantom"

	    --too easy to say 'go', errors reported by number, no
	    documentation

	    --false alarms (operators learn the following response:
	    "it'll probably work the next time")

	    --unnecessarily complex and poor code

	    --weird software reuse: wrote own OS ... but used code from
	    a different machine
	    
	    --measuring devices that report _underdoses_ when they are
	    ridiculously saturated

	    --no real quality control, unit tests, etc.

	    --no error documentation, no documentation on software
	    design

	    --no follow-through on Therac-20's blown fuses

	    --company lied; didn't tell users about each other's
	    failures

	    --users weren't required to report failures to a central
	    clearinghouse

	    --company assumed software wasn't the problem

	    --risk analyses were totally bogus: parameters chosen from
	    thin air. 10^{-9}, 10^{-7}, etc. Obviously those parameters
	    were wrong!!

	    --process
		--no unit tests
		--no quality control

    * What could/should they have done?

    --Addressing the stuff above

    --You might be thinking, "So many things went wrong. There was no
    single cause of failure. Does that mean no single design change
    could have contributed to success?"

    --Answer: no! do end-to-end consistency checks! that single
    change would have prevented these errors!

    [--why no hardware interlocks?
	--decided not worth the expense
	--people (wrongly) trusted software]

    * What happened in disasters reported by NYT?

	--Hard to know for sure

	--Looks like: software lost the treatment plan, and it defaulted 
	to "all leaves open". Analog of field light position.

	What could/should have been done?

	    --a good rule is: "software should have sensible defaults".
	    looks like this rule is violated here.

	    --in a system like this, there should be hardware interlocks
	    (for example: no turning on the beam unless the leaves are
	    closed)

    * Discussion

    Where do the best programmers go?

	--Google, Facebook, etc....where nothing really needs to work
	(or, at least, if there are bugs, people don't die)

	--There **may** be an inverse correlation between programmer
	quality and how safety critical the code that they are writing
	is (I have no proof of this, but if I look at where the young
	"hotshot" developers are going, it's not to write the software
	to drive linear accelerators.)
    
    Lessons:

        --complex systems fail for complex reasons

	--be tolerant of inputs (they weren't); be strict on outputs
	(they weren't)

    Amateur ethics/philosophy

	(i). Philosophical/ethical question: you have a 999/1000 chance of being
	cured by this machine. 1/1000 times it will cause you to die a gruesome
	death. do you pick it? most people would.

	--> then, what *should* the FDA do?

	(ii). should people have to be licensed to write software?
	(food for thought)

	(iii). Would you say something if you were working at such a
	company? What if you were a new hire? What if it weren't safety
	critical?

---------------------------------------------------------------------------

SUMMARY AND REVIEW OF CONCURRENCY

    We've discussed different ways to handle concurrency. Here's a
    review and summary. Unfortunately, there is no one right approach to
    handling concurrency. The "right answer" changes as operating
    systems and hardware evolve, and depending on whether we're talking
    about what goes on inside the kernel, how to structure an
    application, etc. For example, in a world in which most machines had
    one CPU, it may make more sense to use event-driven programming in
    applications (note that this is a potentially controversial claim),
    and to rely on turning off interrupts in the kernel. But in a world
    with multiple CPUs, event-driven programming in an application fails
    to take advantage of the hardware's parallelism, and turning off
    interrupts in the kernel will not avoid concurrency problems.

    Why we want concurrency in the first place: better use of hardware
    resources. Increase total performance by running different tasks
    concurrently on different CPUs. But sometimes, serial execution of
    atomic operations is needed for correctness. So how do we solve
    these problems?

	--*threads are an abstraction that can take advantage of
	concurrency.* applies at multiple levels, as discussed in class.
	indeed, a kernel that runs on multiple CPUs (say, in handling
	system calls from two different processes running on two
	different CPUs) can be regarded as using
	threads-inside-the-kernel, or there can be explicit
	threads-inside-the-kernel. this is apart from kernel-level
	threading and user-level threading, which are abstractions that
	applications use

	--to get serial execution of atomic operations, we need
	hardware support at the lowest level. we may use the hardware
	support directly (as in the case of lock-free data structures),
	with a thin wrapper (as in the case of spinlocks), or wrapped in
	a much higher-level abstraction (as in the case of mutexes and
	monitors).
	
	    --the hardware support that we're talking about is test&set,
	    LOCK prefix, LD_L/ST_C (load-linked/store-conditional, on
	    DEC Alpha), interrupt enable/disable

    1. The most natural (but not the only thing) we can do with the hardware
    support is to build spinlocks. We saw a few kinds of spinlocks:

	--test_and_set (while (xchg) {} )
	--test-and-test_and_set (example given in class from Linux)
	--MCS locks

	Spinlocks are *sometimes* useful inside the kernel and more
	rarely useful in application space. The reason is that wrapping
	larger pieces of functionality with spinlocks wastes CPU cycles
	on the waiting processor. There is a trade-off between the
	overhead of putting a thread to sleep and the cycles wasted by
	spinning. For very short critical sections, spinlocks are a win.
	For longer ones, put the thread of execution to sleep.

    2. For larger pieces of functionality, higher-level synchronization
    primitives are useful:

	--mutexes

	--mutexes and condition variables (known as monitors)

	--shared reader / single writer locks (can implement as a
	monitor)
	
	--semaphores (but you should not use these as it is easy to make
	mistakes with them)

	--futexes (basically a semaphore or mutex used for synchronizing
	processes on Linux; the advantage is that if the futex is
	uncontended, the process never enters the kernel. The cost of
	a system call is only incurred when there is contention and a
	process needs to go to sleep (going to sleep and getting woken
	requires kernel help).

	Building all of the above correctly requires lower-level
	synchronization primitives. Usually, inside of these
	higher-level abstractions is a spinlock that is held for a brief
	time before the thread is put to sleep and after it is woken.

	[Disadvantages to both spinlocks and higher-level
	synchronization primitives:

	    --performance (because of synchronization point and
	    cache line bounces)

	    --performance v. complexity trade-off

		--hard to get code safe, live, and well-performing

		--to increase performance, we need finer-grained
		locking, which increases complexity, which imperils:

		    --safety (race conditions more likely)
		    --liveness (for example, deadlock, starvation more likely)

	    --deadlock (hard to ensure liveness)

	    --starvation (hard to ensure progress for all threads)

	    --priority inversion
	    
	    --broken modularity
	    
	    --careful coding required]

	In user-level code, manage these disadvantages by sacrificing
	performance for correctness. 

	In kernel code, it's trickier. Any performance problems in the
	kernel will be passed to applications. Here, the situation is
	sort of a mess. People use a combination of partial lock orders,
	careful thought, static detection tools, code review, and
	prayer.

    3. Can also use hardware support to build lock-free data structures
    (for example, using atomic compare-and-swap).

	--avoids possibility of deadlock
	--better performance
	--downside: further complexity

    4. Can also use hardware support to enable the Read-Copy Update
    (RCU) technique. Technique used inside the Linux kernel. Very
    elegant.

	--here, *writers* need to synchronize (using spinlocks,
	other hardware support, etc.), but readers do not

    [Aside:
	another paradigm for handling concurrency: transactions
    	--transactional memory (requires different hardware abstractions)
	--transactions exposed to applications and users of
	applications, like queriers of databases]

    Another approach to handling concurrency is to avoid it:

    5. event-driven programming

	--also manages "concurrency".

	--why? (because the processor really only can do one thing at a
	time.)

	--good match if there is lots of I/O

	what happens if we try to program in event-driven style and we are a
	CPU-bound process? 

	    --there aren't natural yield points, so the other tasks may
	    never get to run

	    --or the CPU-bound tasks have to insert artificial yields().
	    In vanilla event-driven programming, there is no explicit
	    yield() call. A function that wants to stop running but
	    wants to be run again has to queue up a request to be
	    re-run. That request unfortunately has to manually take the
	    important stack variables and place them in an entry in the
	    event queue. In a very real sense, this is what the thread
	    scheduler does automatically (the thread scheduler, if you
	    look at it at a high-level, takes various threads' stacks
	    and switches them around, and also loads the CPU's registers
	    with a thread's registers. this is not super-different from
	    scheduling a function for later and specifying the
	    arguments, where the argument *are* the important stack
	    variables that the function will need, in order for the
	    function to execute).

	    --this is one reason why you want threads: easy to write
	    code where each thread just does some CPU-intensive thing,
	    and the thread scheduler worries about interleaving the
	    operations (otherwise, the interleaving is more manual and
	    consists of the steps mentioned above).

    6. what does JOS do?

	Answer: JOS's approach to concurrency is a one-off solution
	that you shouldn't take as a lesson. Here's its approach:

	JOS is meant to run on single-CPU machines. It doesn't have
	to worry about concurrent operations from other CPUs, but it
	does have to worry about interrupts. JOS takes a simple
	approach: it turns interrupts off for the entire time it is
	executing in the kernel. For the most part this means JOS
	kernel code doesn't have to do anything special in
	situations where other OSes would use locks.

	JOS runs environments in user-mode with interrupts enabled,
	so at any point a timer interrupt may take the CPU away from
	an environment and switch to a different environment. This
	interleaves the two environments' instructions a bit like
	running them on two CPUs.  The library operating system has
	some data structures in memory that's shared among multiple
	environments (e.g., pipes), so it needs a way to coordinate
	access to that data.  In JOS we will use special-case
	solutions, as you will find out in lab 6.  For example, to
	implement pipes we will assume there is one reader and one
	writer.  The reader and writer never update each others'
	variables; they only read each others' variables.  Carefully
	programming using this rule we can avoid races.


    Ultimately, threads, synchronization primitives, etc. solve a really
    hard problem: how to have multiple stuff going on at the same time
    but to allow the programmer to keep it organized, sane, and correct.
    To do this, we introduced abstractions like threads, functions like
    swtch(), relied on hardware primitives like XCHG, and built
    higher-level objects like mutexes, monitors, and condition
    variables. All of this is, at the end of the day, presenting a
    relatively sane model to the programmer, built on top of something
    that was otherwise really hard to reason about.

    On the other hand, these abstractions aren't perfect, as the litany
    of disadvantages should make clear, so the solution is to be very
    careful when writing code that has multiple units of execution that
    share memory (aka shared memory multi-programming aka threading aka
    processes that share memory).