Class 7
CS 439
7 Feburary 2013

On the board
------------

1. Last time
2. Trade-offs and problems from locking

    ....
    E. deadlock
    F. broken modularity

3. More advice
4. Therac-25
    --Background
    --Mechanics
    --What went wrong?
    --Discussion

---------------------------------------------------------------------------
1. Last time

    --reinforced atomicity

    --discussed some of the trade-offs/problems from locking

    --exhibit A was deadlock

    --some of you asked about the four conditions for deadlock. here
    they are; all four have to be in effect for deadlock to result:

	    * mutual exclusion: only finite number of
	    actors/workers/threads can hold a resource

	    * hold-and-wait: wait for next resource while holding
	    current one

	    * no preemption: once the resource is granted, it cannot
	    be taken away

	    * circular wait: (cycle in graph of requests)

2. Trade-offs and problems from locking

    E. Deadlock, continued

            [what do people do to avoid?]
            (i) ignore it (not great)
            (ii) detect and recover (not great)
            (iii) avoid algorithmically (impractical)
            (iv) prevent them by careful coding
                [negate one of the four conditions]

	    (v) Static and dynamic detection tools

		--See, for example, these citations, citations
		therein, and papers that cite them:

		    Engler, D. and K. Ashcraft. RacerX: effective,
		    static detection of race conditions and deadlocks.
		    Proc. ACM Symposium on Operating Systems Principles
		    (SOSP), October, 2003, pp237-252.
		    http://portal.acm.org/citation.cfm?id=945468

		    Savage, S., M. Burrows, G. Nelson, P. Sobalvarro,
		    and T. Anderson. Eraser: a dynamic data race
		    detector for multithreaded programs. ACM
		    Transactions on Computer Systems (TOCS), Volume 15,
		    No 4., Nov., 1997, pp391-411.
		    http://portal.acm.org/citation.cfm?id=265927

		    a long literature on this stuff

		--Disadvantage to dynamic checking: slows program down

		--Disadvantage to static checking: many false alarms
		(tools says "there is deadlock", but in fact there is
		none) or else missed problems

		--Note that these tools get better every year. I believe
		that Valgrind has a race and deadlock detection tool

    F. broken modularity

	--examples above: avoiding deadlock requires understanding
	how programs call each other.

	--also, need to know, when calling a library, whether it's
	thread-safe: printf, malloc, etc. If not, surround call with
	mutex. (Can always surround calls with mutexes conservatively.)

        --basically locks bubble out of the interface

3. More advice

    Two things for you to remember here and always; these two things are
    implied by the above discussion on sequential consistency:

	(1). if you're *using* a synchronization primitive (e.g., a
	mutex), do NOT try to read any shared data outside of the
	mutex (the mutex provides the needed ordering). 

	(2). if you're *implementing* a synchronization primitive, you
	need to read the language manual carefully (to tell the
	compiler what not to order), and you need to read the
	processor manual carefully (to understand the default memory
	model and how to override if necessary).
 
    ***Your best hope if you're working with threads and monitors in
    application space:***

	(3) coarse-grained locking 
	
	(4) the MikeD rules/commandments/standards: lock()/unlock() at
	the beginning and end of functions, use monitors, use while
	loops to check scheduling constraints, etc.)

	(5) disciplined hierarchical structure to your code (so you
	can order the locks), and avoid up-calls. if your structure
	is poor, you have little hope. so the following more
	detailed advice assumes decent structure:

	    --if you have to make an up-call, better ensure that
	    partial order on locks is maintained and/or that the
	    up-call doesn't require a lock to issue the up-call (*)

	    --if you have nested objects or monitors (as in the M,N
	    example in l06-handout), then there are some cases:

		--if the target of the call does not lock, then no
		problem. the outer monitor can keep holding the
		lock.

		--if the target of the call locks but does not wait,
		then caller can continue to hold lock PROVIDED that
		partial ordering exists (for example, that the
		callee never issues a callback/up-call to the
		calling module or to an even higher layer, as
		mentioned in (*) above)

		--if the target of the call locks and does wait,
		then it is dangerous to call it while holding a lock
		in the outer layer. here, you need a different code
		structure. unfortunately, there is no silver bullet

	    --to avoid nested monitors, you can/should break your
	    code up:

		  M
		  |
		  N

	       becomes:
		
		 O
		/ \
	       M   N

		where O is an ordinary module, not a monitor.

	       example: M implements checkin/checkout for database. O is
	       the database, and N is some other monitor. 


	(6) run static detection tools (commercial products; search
	around); they're getting better every year

	(7) run dynamic detection tools (Valgrind, etc.): instrument
	program if needed

	--Bummer about all of this: hard to hide details of
	synchronization behind object interfaces 

	--That is, even the solutions that avoid deadlock require
	breaking abstraction barriers and highly disciplined coding

---------------------------------------------------------------------------

--video lecture will be assigned this weekend

---------------------------------------------------------------------------

4. Software safety and the Therac-25

    * Background

    --Draw linear accelerator

    --Magnets 
	--bending magnets

    --Bombard tungsten to get photons

    * Mechanics

	[draw picture of this thing]

	dual-mode machine (actually, triple mode, given the disasters)

			    beam               beam                beam
			    energy            current            modifier
			                                        (given by TT
								position)
intended settings:      ---------------------------------------------------
   for electron therapy |    5-25 MeV          low                magnets
			| 
			|
   for X-ray therapy    |   25 MeV            high (100 x)       flattener
	photon mode     |  
			|  
   for field light mode |      0                 0                 none

	      (b/c of the flattener, more current is needed in X-ray mode)
    
       What can go wrong?

	(a) if beam has high current, but turntable has 'magnets', not
	the flattener, it is a disaster: patient gets hit with high
	current electron beam

	(b) another way to kill a patient is to turn the beam on with
	the turntable in the field-light position

	So what's going on? (Multiple modes, and mixing them up is very,
	    very bad)

    * What actually went wrong?

	--two software problems

	--a bunch of non-technical problems

	(i) software problem #1:

	[this is our best guess; actually hard to know for sure, given
	the way that the paper is written.]

	--three threads
	    --keyboard
	    --turntable
	    --general parameter setting

	--see handout for the pseudocode

	--now, if the operator sets a consistent set of parameters for x
	(X-ray (photon) mode), realizes that the doctor ordered something
	different, and then edits very quickly to e (electron) mode,
	then what happens?

	    --if the re-editing takes less than 8 seconds, the general
	    parameter setting thread never sees that the editing
	    happened because it's busy doing something else. when it
	    returns, it misses the setup signal (probably every single
	    concurrency commandment was violated here....)

	    --now the turntable is in 'e' position (magnets)

	    --but the beam is a high intensity beam because the 'Treat'
	    never saw the request to go to electron mode

	    --each thread and the operator thinks everything is okay

	    --operator presses BEAM ON --> patient mortally injured

	 --so why doesn't the computer check the set-up for consistency
	 before turning on the beam? [all it does it check that there's
	 no more input processing.] 
	    alternatives:
		--double-check with operator
		--end-to-end consistency check in software
		--hardware interlocks
		[probably want all of the above] 


	(ii) software problem #2:

	how it's supposd to work:

	    --operator sets up parameters on the screen

	    --operator moves turntable to field-light mode, and visually
	    checks that patient is properly positioned

	    --operator hits "set" to store the parameters

	    --at this point, the class3 "interlock" (in quotation marks
	    for a reason) is supposed to tell the software to check and
	    perhaps modify the turntable position

	    --operator presses "beam on"

	how they implemented this:

	    --see pseudocode on handout

	but it doesn't always work out that way. why?
	    
	    --because this boolean flag is implemented as a counter.

	    --(why implemented as a counter? PDP-11 had an Increment
	    Byte instruction that added 1 ("inc A"). This increment thing
	    presumably took a bit less code space than materializing the
	    constant 1 in an instruction like "A = 1".)

	    --so what goes wrong?
		
		--every 256 times that code runs, class3 is set to 0,
		operator presses 'set', and no repositioning

		--operator presses "beam on", and a beam is delivered in
		field light position, with no scanning magnets or
		flattener --> patient injured or killed

	(iii) Lots of larger issues here too

	    --***No end-to-end consistency checks***. What you actually
	    want is:
		--right before turning the beam on, the software checks
		that parameters line up
		--hardware that won't turn beam on if the parameters are
		inconsistent
		--then double-check that by using a radiation "phantom"

	    --too easy to say 'go', errors reported by number, no
	    documentation

            --garbage left on the screen

	    --false alarms (operators learn the following response:
	    "it'll probably work the next time") 
		(put differently, people became "insensitive to machine
		malfunctions")

	    --unnecessarily complex and poor code

	    --weird software reuse: wrote own OS ... but used code from
	    a different machine
	    
	    --measuring devices that report _underdoses_ when they are
	    ridiculously saturated

	    --no real quality control, unit tests, etc.

	    --no error documentation, no documentation on software
	    design

	    --no follow-through on Therac-20's blown fuses

	    --company lied; didn't tell users about each other's
	    failures

	    --users weren't required to report failures to a central
	    clearinghouse

	    --no investigation when other problem arose

	    --company assumed software wasn't the problem

	    --risk analyses were totally bogus: parameters chosen from
	    thin air. 10^{-11}, 4*10^{-9}, etc. Obviously those parameters
	    were wrong!!
		(they were supposedly estimating things like "computer
		selects wrong energy")

	    --bogus changes that didn't solve the problems

	    --process
		--no unit tests
		--no quality control

    * What could/should they have done?

    --Addressing the stuff above

    --You might be thinking, "So many things went wrong. There was no
    single cause of failure. Does that mean no single design change
    could have contributed to success?"

    --Answer: no! do end-to-end consistency checks! that single
    change would have prevented these errors!

    [--why no hardware interlocks?
	--decided not worth the expense
	--people (wrongly) trusted software]

   * why is it so hard to figure out what is going on?

   --because the writing isn't good

        --irrelevant details

        --repetition

        --inconsistent descriptions

        --sentences in passive voice

        --pseudo-code doesn't tell us what's actually going on

        --confusing energy and current (the problem is high _current_,
        not high energy, but they never say that)
 

    * What happened in disasters reported by NYT?

	--Hard to know for sure

	--Looks like: software lost the treatment plan, and it defaulted 
	to "all leaves open". Analog of field light position.

	What could/should have been done?

	    --a good rule is: "software should have sensible defaults".
	    looks like this rule is violated here.

	    --in a system like this, there should be hardware interlocks
	    (for example: no turning on the beam unless the leaves are
	    closed)

    * Discussion

    Theme in building systems: be tolerant of inputs / be strict about
    outputs (they were the other way around)
    
    Authors say: "There is always another software bug." Why? (Because
    there usually is.)

    "Patient reactions were the only real indications of the seriousness
    of the problems with the Therac-25."

    Where do the best programmers go?

	--Google, Facebook, etc....where nothing really needs to work
	(or, at least, if there are bugs, people don't die)

	--There **may** be an inverse correlation between programmer
	quality and how safety critical the code that they are writing
	is (I have no proof of this, but if I look at where the young
	"hotshot" developers are going, it's not usually to write the
	software to drive linear accelerators.)
    
    Lessons:

        --complex systems fail for complex reasons

	--be tolerant of inputs (they weren't); be strict on outputs
	(they weren't)

    Amateur ethics/philosophy

	(i). Philosophical/ethical question: you have a 999/1000 chance of being
	cured by this machine. 1/1000 times it will cause you to die a gruesome
	death. do you pick it? most people would.

	--> then, what *should* the FDA do?

	(ii). should people have to be licensed to write software?
	(food for thought)

	(iii). Would you say something if you were working at such a
	company? What if you were a new hire? What if it weren't safety
	critical?