Class 7 CS 439 7 Feburary 2013 On the board ------------ 1. Last time 2. Trade-offs and problems from locking .... E. deadlock F. broken modularity 3. More advice 4. Therac-25 --Background --Mechanics --What went wrong? --Discussion --------------------------------------------------------------------------- 1. Last time --reinforced atomicity --discussed some of the trade-offs/problems from locking --exhibit A was deadlock --some of you asked about the four conditions for deadlock. here they are; all four have to be in effect for deadlock to result: * mutual exclusion: only finite number of actors/workers/threads can hold a resource * hold-and-wait: wait for next resource while holding current one * no preemption: once the resource is granted, it cannot be taken away * circular wait: (cycle in graph of requests) 2. Trade-offs and problems from locking E. Deadlock, continued [what do people do to avoid?] (i) ignore it (not great) (ii) detect and recover (not great) (iii) avoid algorithmically (impractical) (iv) prevent them by careful coding [negate one of the four conditions] (v) Static and dynamic detection tools --See, for example, these citations, citations therein, and papers that cite them: Engler, D. and K. Ashcraft. RacerX: effective, static detection of race conditions and deadlocks. Proc. ACM Symposium on Operating Systems Principles (SOSP), October, 2003, pp237-252. http://portal.acm.org/citation.cfm?id=945468 Savage, S., M. Burrows, G. Nelson, P. Sobalvarro, and T. Anderson. Eraser: a dynamic data race detector for multithreaded programs. ACM Transactions on Computer Systems (TOCS), Volume 15, No 4., Nov., 1997, pp391-411. http://portal.acm.org/citation.cfm?id=265927 a long literature on this stuff --Disadvantage to dynamic checking: slows program down --Disadvantage to static checking: many false alarms (tools says "there is deadlock", but in fact there is none) or else missed problems --Note that these tools get better every year. I believe that Valgrind has a race and deadlock detection tool F. broken modularity --examples above: avoiding deadlock requires understanding how programs call each other. --also, need to know, when calling a library, whether it's thread-safe: printf, malloc, etc. If not, surround call with mutex. (Can always surround calls with mutexes conservatively.) --basically locks bubble out of the interface 3. More advice Two things for you to remember here and always; these two things are implied by the above discussion on sequential consistency: (1). if you're *using* a synchronization primitive (e.g., a mutex), do NOT try to read any shared data outside of the mutex (the mutex provides the needed ordering). (2). if you're *implementing* a synchronization primitive, you need to read the language manual carefully (to tell the compiler what not to order), and you need to read the processor manual carefully (to understand the default memory model and how to override if necessary). ***Your best hope if you're working with threads and monitors in application space:*** (3) coarse-grained locking (4) the MikeD rules/commandments/standards: lock()/unlock() at the beginning and end of functions, use monitors, use while loops to check scheduling constraints, etc.) (5) disciplined hierarchical structure to your code (so you can order the locks), and avoid up-calls. if your structure is poor, you have little hope. so the following more detailed advice assumes decent structure: --if you have to make an up-call, better ensure that partial order on locks is maintained and/or that the up-call doesn't require a lock to issue the up-call (*) --if you have nested objects or monitors (as in the M,N example in l06-handout), then there are some cases: --if the target of the call does not lock, then no problem. the outer monitor can keep holding the lock. --if the target of the call locks but does not wait, then caller can continue to hold lock PROVIDED that partial ordering exists (for example, that the callee never issues a callback/up-call to the calling module or to an even higher layer, as mentioned in (*) above) --if the target of the call locks and does wait, then it is dangerous to call it while holding a lock in the outer layer. here, you need a different code structure. unfortunately, there is no silver bullet --to avoid nested monitors, you can/should break your code up: M | N becomes: O / \ M N where O is an ordinary module, not a monitor. example: M implements checkin/checkout for database. O is the database, and N is some other monitor. (6) run static detection tools (commercial products; search around); they're getting better every year (7) run dynamic detection tools (Valgrind, etc.): instrument program if needed --Bummer about all of this: hard to hide details of synchronization behind object interfaces --That is, even the solutions that avoid deadlock require breaking abstraction barriers and highly disciplined coding --------------------------------------------------------------------------- --video lecture will be assigned this weekend --------------------------------------------------------------------------- 4. Software safety and the Therac-25 * Background --Draw linear accelerator --Magnets --bending magnets --Bombard tungsten to get photons * Mechanics [draw picture of this thing] dual-mode machine (actually, triple mode, given the disasters) beam beam beam energy current modifier (given by TT position) intended settings: --------------------------------------------------- for electron therapy | 5-25 MeV low magnets | | for X-ray therapy | 25 MeV high (100 x) flattener photon mode | | for field light mode | 0 0 none (b/c of the flattener, more current is needed in X-ray mode) What can go wrong? (a) if beam has high current, but turntable has 'magnets', not the flattener, it is a disaster: patient gets hit with high current electron beam (b) another way to kill a patient is to turn the beam on with the turntable in the field-light position So what's going on? (Multiple modes, and mixing them up is very, very bad) * What actually went wrong? --two software problems --a bunch of non-technical problems (i) software problem #1: [this is our best guess; actually hard to know for sure, given the way that the paper is written.] --three threads --keyboard --turntable --general parameter setting --see handout for the pseudocode --now, if the operator sets a consistent set of parameters for x (X-ray (photon) mode), realizes that the doctor ordered something different, and then edits very quickly to e (electron) mode, then what happens? --if the re-editing takes less than 8 seconds, the general parameter setting thread never sees that the editing happened because it's busy doing something else. when it returns, it misses the setup signal (probably every single concurrency commandment was violated here....) --now the turntable is in 'e' position (magnets) --but the beam is a high intensity beam because the 'Treat' never saw the request to go to electron mode --each thread and the operator thinks everything is okay --operator presses BEAM ON --> patient mortally injured --so why doesn't the computer check the set-up for consistency before turning on the beam? [all it does it check that there's no more input processing.] alternatives: --double-check with operator --end-to-end consistency check in software --hardware interlocks [probably want all of the above] (ii) software problem #2: how it's supposd to work: --operator sets up parameters on the screen --operator moves turntable to field-light mode, and visually checks that patient is properly positioned --operator hits "set" to store the parameters --at this point, the class3 "interlock" (in quotation marks for a reason) is supposed to tell the software to check and perhaps modify the turntable position --operator presses "beam on" how they implemented this: --see pseudocode on handout but it doesn't always work out that way. why? --because this boolean flag is implemented as a counter. --(why implemented as a counter? PDP-11 had an Increment Byte instruction that added 1 ("inc A"). This increment thing presumably took a bit less code space than materializing the constant 1 in an instruction like "A = 1".) --so what goes wrong? --every 256 times that code runs, class3 is set to 0, operator presses 'set', and no repositioning --operator presses "beam on", and a beam is delivered in field light position, with no scanning magnets or flattener --> patient injured or killed (iii) Lots of larger issues here too --***No end-to-end consistency checks***. What you actually want is: --right before turning the beam on, the software checks that parameters line up --hardware that won't turn beam on if the parameters are inconsistent --then double-check that by using a radiation "phantom" --too easy to say 'go', errors reported by number, no documentation --garbage left on the screen --false alarms (operators learn the following response: "it'll probably work the next time") (put differently, people became "insensitive to machine malfunctions") --unnecessarily complex and poor code --weird software reuse: wrote own OS ... but used code from a different machine --measuring devices that report _underdoses_ when they are ridiculously saturated --no real quality control, unit tests, etc. --no error documentation, no documentation on software design --no follow-through on Therac-20's blown fuses --company lied; didn't tell users about each other's failures --users weren't required to report failures to a central clearinghouse --no investigation when other problem arose --company assumed software wasn't the problem --risk analyses were totally bogus: parameters chosen from thin air. 10^{-11}, 4*10^{-9}, etc. Obviously those parameters were wrong!! (they were supposedly estimating things like "computer selects wrong energy") --bogus changes that didn't solve the problems --process --no unit tests --no quality control * What could/should they have done? --Addressing the stuff above --You might be thinking, "So many things went wrong. There was no single cause of failure. Does that mean no single design change could have contributed to success?" --Answer: no! do end-to-end consistency checks! that single change would have prevented these errors! [--why no hardware interlocks? --decided not worth the expense --people (wrongly) trusted software] * why is it so hard to figure out what is going on? --because the writing isn't good --irrelevant details --repetition --inconsistent descriptions --sentences in passive voice --pseudo-code doesn't tell us what's actually going on --confusing energy and current (the problem is high _current_, not high energy, but they never say that) * What happened in disasters reported by NYT? --Hard to know for sure --Looks like: software lost the treatment plan, and it defaulted to "all leaves open". Analog of field light position. What could/should have been done? --a good rule is: "software should have sensible defaults". looks like this rule is violated here. --in a system like this, there should be hardware interlocks (for example: no turning on the beam unless the leaves are closed) * Discussion Theme in building systems: be tolerant of inputs / be strict about outputs (they were the other way around) Authors say: "There is always another software bug." Why? (Because there usually is.) "Patient reactions were the only real indications of the seriousness of the problems with the Therac-25." Where do the best programmers go? --Google, Facebook, etc....where nothing really needs to work (or, at least, if there are bugs, people don't die) --There **may** be an inverse correlation between programmer quality and how safety critical the code that they are writing is (I have no proof of this, but if I look at where the young "hotshot" developers are going, it's not usually to write the software to drive linear accelerators.) Lessons: --complex systems fail for complex reasons --be tolerant of inputs (they weren't); be strict on outputs (they weren't) Amateur ethics/philosophy (i). Philosophical/ethical question: you have a 999/1000 chance of being cured by this machine. 1/1000 times it will cause you to die a gruesome death. do you pick it? most people would. --> then, what *should* the FDA do? (ii). should people have to be licensed to write software? (food for thought) (iii). Would you say something if you were working at such a company? What if you were a new hire? What if it weren't safety critical?