Class 8
CS 202
24 February 2015

On the board
------------

1. Last time
2. Therac-25
    --Background
    --Mechanics
    --What went wrong?
    --Discussion

---------------------------------------------------------------------------

1. Last time

    discussed the problems brought on by locking

    coarse-grained vs. fine-grained locking. each has disadvantages, as
    discussed. ADVICE: when you code on your own, start with
    coarse-grained locking! (safety and deadlock bugs less likely.)
    then, if performance is problematic, you can iterate on your design,
    introduce more locks, etc.

2. Software safety and the Therac-25

    * Background

    --Draw linear accelerator

    --Magnets 
	--bending magnets

    --Bombard tungsten to get photons

    --Field light mode: light across from angled mirror. simulates beam.

    * Mechanics

	[draw picture of this thing]

	dual-mode machine (actually, triple mode, given the disasters)

			    beam               beam                beam
			    energy            current            modifier
			                                        (given by TT
								position)
intended settings:      ---------------------------------------------------
   for electron therapy |    5-25 MeV          low                magnets
			| 
			|
   for X-ray therapy    |   25 MeV            high (100 x)       flattener
	photon mode     |  
			|  
   for field light mode |      0                 0                 none


       What can go wrong?

	(a) if beam has high current, but turntable has 'magnets', not
	the flattener, it is a disaster: patient gets hit with high
	current electron beam

	(b) another way to kill a patient is to turn the beam on with
	the turntable in the field-light position

	So what's going on? (Multiple modes, and mixing them up is very,
	    very bad)

---------------------------------------------------------------------------

admin notes:

   --promise to test concurrent programming with monitors

   --old exams posted

   --midterm review session Sunday night

---------------------------------------------------------------------------

    * What actually went wrong?

	--two software problems

	--a bunch of non-technical problems

	(i) software problem #1:

	[this is our best guess; actually hard to know for sure, given
	the way that the paper is written.]

	--three threads
	    --keyboard
	    --turntable
	    --general parameter setting

	--see handout for the pseudocode

	--now, if the operator sets a consistent set of parameters for x
	(X-ray (photon) mode), realizes that the doctor ordered something
	different, and then edits very quickly to e (electron) mode,
	then what happens?

	    --if the re-editing takes less than 8 seconds, the general
	    parameter setting thread never sees that the editing
	    happened because it's busy doing something else. when it
	    returns, it misses the setup signal (probably every single
	    concurrency commandment was violated here....)

	    --now the turntable is in 'e' position (magnets)

	    --but the beam is a high intensity beam because the 'Treat'
	    never saw the request to go to electron mode

	    --each thread and the operator thinks everything is okay

	    --operator presses BEAM ON --> patient mortally injured

	 --so why doesn't the computer check the set-up for consistency
	 before turning on the beam? [all it does it check that there's
	 no more input processing.] 
	    alternatives:
		--double-check with operator
		--end-to-end consistency check in software
		--hardware interlocks
		[probably want all of the above] 

	how it's supposd to work:

	    --operator sets up parameters on the screen

	    --operator moves turntable to field-light mode, and visually
	    checks that patient is properly positioned

	    --operator hits "set" to store the parameters

	    --at this point, the class3 "interlock" (in quotation marks
	    for a reason) is supposed to tell the software to check and
	    perhaps modify the turntable position

	    --operator presses "beam on"

	how they implemented this:

	    --see pseudocode on handout

	but it doesn't always work out that way. why?
	    
	    --because this boolean flag is implemented as a counter.

	    --(why implemented as a counter? PDP-11 had an Increment
	    Byte instruction that added 1 ("inc A"). This increment thing
	    presumably took a bit less code space than materializing the
	    constant 1 in an instruction like "A = 1".)

	    --so what goes wrong?
		
		--every 256 times that code runs, class3 is set to 0,
		operator presses 'set', and no repositioning

		--operator presses "beam on", and a beam is delivered in
		field light position, with no scanning magnets or
		flattener --> patient injured or killed

	(iii) Lots of larger issues here too

	    --***No end-to-end consistency checks***. What you actually
	    want is:
		--right before turning the beam on, the software checks
		that parameters line up
		--hardware that won't turn beam on if the parameters are
		inconsistent
		--then double-check that by using a radiation "phantom"

	    --too easy to say 'go', errors reported by number, no
	    documentation

            --garbage left on the screen

	    --false alarms (operators learn the following response:
	    "it'll probably work the next time") 
		(put differently, people became "insensitive to machine
		malfunctions")

	    --unnecessarily complex and poor code

	    --ill-fitting software reuse: wrote own OS ... but used code
	    from a different machine
	    
	    --measuring devices that report _underdoses_ when they are
	    ridiculously saturated

	    --no real quality control, unit tests, etc.

	    --no error documentation, no documentation on software
	    design

	    --no follow-through on Therac-20's blown fuses

            --ignored initial problem reports

	    --company lied; didn't tell users about each other's
	    failures

	    --users weren't required to report failures to a central
	    clearinghouse

            --wishful thinking: they made some changes, and just hoped
            that the changes addressed the bug. "The letter goes on to
            support this opinion by listing two pages of technical
            reasons why an overdose by the Therac-25 was impossible."
            (My code can't have bugs!!!)

	    --no investigation when other problems arose

	    --company assumed software wasn't the problem; in fact, they
	    assumed that software could make the machine safe.

	    --risk analyses were totally bogus: parameters chosen from
	    thin air. 10^{-11}, 4*10^{-9}, etc. Obviously those parameters
	    were wrong!!
		(they were supposedly estimating things like "computer
		selects wrong energy" as having 10^{-11} probability)

	    --bogus changes that didn't solve the problems

	    --covering the cursor "UP" key, as if that was the problem
	    (or the only problem)

	    --process
		--no unit tests
		--no quality control


    * What could/should they have done?

    --Addressing the stuff above

    --You might be thinking, "So many things went wrong. There was no
    single cause of failure. Does that mean no single design change
    could have contributed to success?"

    --Answer: no! do end-to-end consistency checks! that single
    change would have prevented these errors!

    [--why no hardware interlocks?
	--decided not worth the expense
	--people (wrongly) trusted software]

   * why is it so hard to figure out what is going on?

   --because the writing isn't good

        --irrelevant details

        --repetition

        --inconsistent descriptions

        --sentences in passive voice

        --pseudo-code doesn't tell us what's actually going on

        --confusing energy and current (the problem is a beam with high
        _current_, leading to high energy striking the patient, but they
        never say that).
 
    * What happened in disasters reported by NYT?

	--Hard to know for sure

	--Looks like: software lost the treatment plan, and it defaulted 
	to "all leaves open". Analog of field light position.

	What could/should have been done?

	    --a good rule is: "software should have sensible defaults".
	    looks like this rule is violated here.

	    --in a system like this, there should be hardware interlocks
	    (for example: no turning on the beam unless the leaves are
	    closed)


    * Discussion

    Theme in building systems: be tolerant of inputs / be strict about
    outputs (they were the other way around)
    
    Authors say: "There is always another software bug." Why? (Because
    there usually is.)

    "Patient reactions were the only real indications of the seriousness
    of the problems with the Therac-25." (Ouch.)

    Where do the best programmers go?

	--Google, Facebook, etc....where nothing really needs to work
	(or, at least, if there are bugs, people don't die)

	--There **may** be an inverse correlation between programmer
	quality and how safety critical the code that they are writing
	is (I have no proof of this, but if I look at where the young
	"hotshot" developers are going, it's not usually to write the
	software to drive linear accelerators.)
    
    Lessons:

        --complex systems fail for complex reasons

	--be tolerant of inputs (they weren't); be strict on outputs
	(they weren't)

    Amateur ethics/philosophy

	(i). Philosophical/ethical question: you have a 999/1000 chance of being
	cured by this machine. 1/1000 times it will cause you to die a gruesome
	death. do you pick it? most people would.

	--> then, what *should* the FDA do?

	(ii). should people have to be licensed to write software?
	(food for thought)

	(iii). Would you say something if you were working at such a
	company? What if you were a new hire? What if it weren't safety
	critical?