Class 10 CS372H 21 February 2012 On the board ------------ 1. Last time 2. Alternatives to locking/concurrency 3. Advice 4. Therac-25 and software safety --Background --Mechanics --What went wrong? --Discussion --------------------------------------------------------------------------- 1. Last time --MCS locks --trade-offs and problems from locking --more about broken modularity: also, need to know, when calling a library, whether it's thread-safe: printf, malloc, etc. If not, surround call with mutex. (Can always surround calls with mutexes conservatively.) --basically, locks bubble out of the interface --some of you asked about the four conditions for deadlock. here they are; all four have to be in effect for deadlock to result: * mutual exclusion: only finite number of actors/workers/threads can hold a resource * hold-and-wait: wait for next resource while holding current one * no preemption: once the resource is granted, it cannot be taken away * circular wait: (cycle in graph of requests) ** NOTE: end of the notes today contain a summary of the whole concurrency unit. Not claiming it's necessary or sufficient, but it does try to place what we've covered in some sort of context. 2. Alternatives to locking/concurrency A. futexes (for performance) --motivation: locks must interact with scheduler, and syscalls on locks need to go into the kernel, which is expensive. --so can we optimize the process of acquiring the lock so that we go into kernel only if we can't get lock? --enter futexes ["Fuss, Futexes and Furwocks: Fast Userlevel Locking in Linux", H. Franke, R. Russell, and M. Kirkwood, Ottawa Linux Symposium, 2002.] --idea ask kernel to sleep only if memory location hasn't changed (if it has changed, the lock is contended, and we'll need kernel help to adjudicate) --void futex (int *uaddr, FUTEX_WAIT, int val, ....) --Go to sleep only if *uaddr == val --Extra arguments allow timeouts, etc. --void futex (int *uaddr, FUTEX_WAKE, int val, ...) --Wake up at most val threads sleeping on uaddr --uaddr is translated down to offset in VM object --So works on memory mapped file at different virtual addresses in different processes --idea: --to "acquire", atomically decrement or set --to "release", atomically increment or reset. if the value is 1, great; if the value is *less than* 1, it means someone else asked for the mutex; ask for the kernel's help waking the waiters B. RCU (read-copy-update) [see: "Read-Copy Update", McKenney et al., Proc Ottowa Linux Symposium, 2001. http://lse.sourceforge.net/locking/rcu/rclock_OLS.2001.05.01c.sc.pdf ] --Some data is read way more often than written --Like routing tables: consulted for each packet that is forwarded --Or Data maps in system with 100+ disks: Updated when disk fails, maybe every $10^{10}$ operations --Optimize for the common case of reading w/o lock e.g., global variable: routing_table *rt; Call lookup (rt, route); with no locking --Update by making copy, swapping pointer --routing_table *nrt = copy_routing_table (rt); --Update nrt [threads still in the table keep working] --Set global rt = nrt when done updating --All lookup calls see consistent old or new table --the hard part: when can we free the memory of the old routing table? --answer: when we are guaranteed that no one is using it --but how can we determine that? --loosely speaking, wait until each thread has context switched at least once --at that point, each thread will have been in _quiescent state_ at least once, and meanwhile in a _quiescent state_ the thread's temporary variables are dead C. event-driven programming --also manages "concurrency". --why? (because the processor really only can do one thing at a time.) --good match if there is lots of I/O --what happens if we try to program in event-driven style and we are a CPU-bound process? --there aren't natural yield points, so the other tasks may never get to run --or the CPU-bound tasks have to insert artificial yields(). In vanilla event-driven programming, there is no explicit yield() call. A function that wants to stop running but wants to be run again has to queue up a request to be re-run. That request unfortunately has to manually take the important stack variables and place them in an entry in the event queue. In a very real sense, this is what the thread scheduler does automatically (the thread scheduler, if you look at it at a high-level, takes various threads' stacks and switches them around, and also loads the CPU's registers with a thread's registers. this is not super-different from scheduling a function for later and specifying the arguments, where the argument *are* the important stack variables that the function will need, in order for the function to execute). --this is one reason why you want threads: easy to write code where each thread just does some CPU-intensive thing, and the thread scheduler worries about interleaving the operations (otherwise, the interleaving is more manual and consists of the steps mentioned above). D. non-blocking synchronization --wait-free algorithms --lock-free algorithms --the skinny on these is that: --using atomic instructions such as compare-and-swap (CMPXCHG on the x86) you can implement many common data structures (stacks, queues, even hash tables) --in fact, can implement *any* algorithm in wait-free fashion, given the right hardware --the problem is that such algorithms wind up using lots of memory or involving many retries so are inefficient --but since they don't lock, they are provably lock-free! E. transactions --using them in the kernel requires hardware support see http://www.cs.brown.edu/~mph/HerlihyM93/herlihy93transactional.pdf Intel's latest chips should have such extensions --using them in application space does not --when deadlock is detected, transaction manager aborts transaction * how free are the above approaches from the disadvantages listed above? --RCU has some complexity required to handle garbage collection, and it's not always applicable --event-driven code removes the possibility of races and deadlock, but that's because it doesn't involve true concurrency, so if you want your app to take advantage of multiple CPUs, you can't use plain vanilla event-driven programming --non-blocking synchronization often leads to complexity and inefficiency (but not deadlock or broken modularity!) --transactions are nice but not supported everywhere and not very efficient if there is lots of contention 3. Advice A. General advice Two things for you to remember here and always; these two things are implied by the absence of sequential consistency: (1). if you're *using* a synchronization primitive (e.g., a mutex), do NOT try to read any shared data outside of the mutex (the mutex provides the needed ordering). (2). if you're *implementing* a synchronization primitive, you need to read the language manual carefully (to tell the compiler what not to order), and you need to read the processor manual carefully (to understand the default memory model and how to override if necessary). ***Your best hope if you're working with threads and monitors in application space:*** (3) coarse-grained locking (4) the MikeD rules/commandments/standards: lock()/unlock() at the beginning and end of functions, use monitors, use while loops to check scheduling constraints, etc.) (5) disciplined hierarchical structure to your code (so you can order the locks), and avoid up-calls. if your structure is poor, you have little hope. so the following more detailed advice assumes decent structure: --if you have to make an up-call, better ensure that partial order on locks is maintained and/or that the up-call doesn't require a lock to issue the up-call (*) --if you have nested objects or monitors (as in the M,N example in l09-handout), then there are some cases: --if the target of the call does not lock, then no problem. the outer monitor can keep holding the lock. --if the target of the call locks but does not wait, then caller can continue to hold lock PROVIDED that partial ordering exists (for example, that the callee never issues a callback/up-call to the calling module or to an even higher layer, as mentioned in (*) above) --if the target of the call locks and does wait, then it is dangerous to call it while holding a lock in the outer layer. here, you need a different code structure. unfortunately, there is no silver bullet --to avoid nested monitors, you can/should break your code up: M | N becomes: O / \ M N where O is an ordinary module, not a monitor. example: M implements checkin/checkout for database. O is the database, and N is some other monitor. (6) run static detection tools (commercial products; search around); they're getting better every year (7) run dynamic detection tools (Valgrind, etc.): instrument program if needed --Bummer about all of this: hard to hide details of synchronization behind object interfaces --That is, even the solutions that avoid deadlock require breaking abstraction barriers and highly disciplined coding B. My advice on best approaches (higher-level advice than thread coding advice from before) --application programming: --cooperative user-level multithreading --kernel-level threads with *simple* synchronization (lab T) --this is where the thread coding advice given above applies --event-driven coding --transactions, if your package provides them, and you are willing to deal with performance trade-offs (namely that performance is poor under contention because lots of wasted work) --kernel hacking: no silver bullet here. want to avoid locks as much as possible. sometimes they are unavoidable, in which case fancy things need to happen. --UT professor Emmett Witchel proposes using transactions inside the kernel (TxLinux) C. Reflections and conclusions from concurrency unit --Threads and concurrency primitives have solved a hard problem: how to take advantage of hardware resources with a sequential abstraction (the thread) and how to safely coordinate access to shared resources (concurrency primitives). --But of course concurrency primitives have the disadvantages that we've discussed --old debate about whether threads are a good idea: John Ousterhout: "Why Threads are a bad idea (for most purposes)", 1996 talk. http://home.pacbell.net/ouster/threads.pdf Robert van Renesse "Goal-Oriented Programming, or Composition Using Events, or Threads Considered harmful". Eighth ACM SIGOPS European Workshop, September 1998. http://www.cs.cornell.edu/home/rvr/papers/GoalOriented.pdf --and lots of "events vs threads" papers (use Google) --the debate comes down to this: --compared to code written in event-driven style, shared memory multiprogramming code is easier to read: it's easier to know the code's purpose. however, it's harder to make that code correct, and it's harder to know, when reading the code, whether it's correct. --who is right? sort of like vi vs. emacs debates. threads, events, and the other alternatives all have advantages and disadvantages. one thing is for sure: make sure that you understand those advantages and disadvantages before picking a model to work with. 4. Software safety * Background --Draw linear accelerator --Magnets --bending magnets --Bombard tungsten to get photons * Mechanics [draw picture of this thing] dual-mode machine (actually, triple mode, given the disasters) beam beam beam energy current modifier (given by TT position) intended settings: --------------------------------------------------- for electron therapy | 5-25 MeV low magnets | | for X-ray therapy | 25 MeV high (100 x) flattener photon mode | | for field light mode | 0 0 none (b/c of the flattener, more current is needed in X-ray mode) What can go wrong? (a) if beam has high current, but turntable has 'magnets', not the flattener, it is a disaster: patient gets hit with high current electron beam (b) another way to kill a patient is to turn the beam on with the turntable in the field-light position So what's going on? (Multiple modes, and mixing them up is very, very bad) * What actually went wrong? --two software problems --a bunch of non-technical problems (i) software problem #1: [this is our best guess; actually hard to know for sure, given the way that the paper is written.] --three threads --keyboard --turntable --general parameter setting --see handout for the pseudocode --now, if the operator sets a consistent set of parameters for x (X-ray (photon) mode), realizes that the doctor ordered something different, and then edits very quickly to e (electron) mode, then what happens? --if the re-editing takes less than 8 seconds, the general parameter setting thread never sees that the editing happened because it's busy doing something else. when it returns, it misses the setup signal (probably every single concurrency commandment was violated here....) --now the turntable is in 'e' position (magnets) --but the beam is a high intensity beam because the 'Treat' never saw the request to go to electron mode --each thread and the operator thinks everything is okay --operator presses BEAM ON --> patient mortally injured --so why doesn't the computer check the set-up for consistency before turning on the beam? [all it does it check that there's no more input processing.] alternatives: --double-check with operator --end-to-end consistency check in software --hardware interlocks [probably want all of the above] (ii) software problem #2: how it's supposd to work: --operator sets up parameters on the screen --operator moves turntable to field-light mode, and visually checks that patient is properly positioned --operator hits "set" to store the parameters --at this point, the class3 "interlock" (in quotation marks for a reason) is supposed to tell the software to check and perhaps modify the turntable position --operator presses "beam on" how they implemented this: --see pseudocode on handout but it doesn't always work out that way. why? --because this boolean flag is implemented as a counter. --(why implemented as a counter? PDP-11 had an Increment Byte instruction that added 1 ("inc A"). This increment thing presumably took a bit less code space than materializing the constant 1 in an instruction like "A = 1".) --so what goes wrong? --every 256 times that code runs, class3 is set to 0, operator presses 'set', and no repositioning --operator presses "beam on", and a beam is delivered in field light position, with no scanning magnets or flattener --> patient injured or killed (iii) Lots of larger issues here too --***No end-to-end consistency checks***. What you actually want is: --right before turning the beam on, the software checks that parameters line up --hardware that won't turn beam on if the parameters are inconsistent --then double-check that by using a radiation "phantom" --too easy to say 'go', errors reported by number, no documentation --false alarms (operators learn the following response: "it'll probably work the next time") (put differently, people became "insensitive to machine malfunctions") --unnecessarily complex and poor code --weird software reuse: wrote own OS ... but used code from a different machine --measuring devices that report _underdoses_ when they are ridiculously saturated --no real quality control, unit tests, etc. --no error documentation, no documentation on software design --no follow-through on Therac-20's blown fuses --company lied; didn't tell users about each other's failures --users weren't required to report failures to a central clearinghouse --no investigation when other problem arose --company assumed software wasn't the problem --risk analyses were totally bogus: parameters chosen from thin air. 10^{-11}, 4*10^{-9}, etc. Obviously those parameters were wrong!! (they were supposedly estimating things like "computer selects wrong energy") --bogus changes that didn't solve the problems --process --no unit tests --no quality control * What could/should they have done? --Addressing the stuff above --You might be thinking, "So many things went wrong. There was no single cause of failure. Does that mean no single design change could have contributed to success?" --Answer: no! do end-to-end consistency checks! that single change would have prevented these errors! [--why no hardware interlocks? --decided not worth the expense --people (wrongly) trusted software] * What happened in disasters reported by NYT? --Hard to know for sure --Looks like: software lost the treatment plan, and it defaulted to "all leaves open". Analog of field light position. What could/should have been done? --a good rule is: "software should have sensible defaults". looks like this rule is violated here. --in a system like this, there should be hardware interlocks (for example: no turning on the beam unless the leaves are closed) * Discussion Where do the best programmers go? --Google, Facebook, etc....where nothing really needs to work (or, at least, if there are bugs, people don't die) --There **may** be an inverse correlation between programmer quality and how safety critical the code that they are writing is (I have no proof of this, but if I look at where the young "hotshot" developers are going, it's not to write the software to drive linear accelerators.) Lessons: --complex systems fail for complex reasons --be tolerant of inputs (they weren't); be strict on outputs (they weren't) Amateur ethics/philosophy (i). Philosophical/ethical question: you have a 999/1000 chance of being cured by this machine. 1/1000 times it will cause you to die a gruesome death. do you pick it? most people would. --> then, what *should* the FDA do? (ii). should people have to be licensed to write software? (food for thought) (iii). Would you say something if you were working at such a company? What if you were a new hire? What if it weren't safety critical? --------------------------------------------------------------------------- SUMMARY AND REVIEW OF CONCURRENCY We've discussed different ways to handle concurrency. Here's a review and summary. Unfortunately, there is no one right approach to handling concurrency. The "right answer" changes as operating systems and hardware evolve, and depending on whether we're talking about what goes on inside the kernel, how to structure an application, etc. For example, in a world in which most machines had one CPU, it may make more sense to use event-driven programming in applications (note that this is a potentially controversial claim), and to rely on turning off interrupts in the kernel. But in a world with multiple CPUs, event-driven programming in an application fails to take advantage of the hardware's parallelism, and turning off interrupts in the kernel will not avoid concurrency problems. Why we want concurrency in the first place: better use of hardware resources. Increase total performance by running different tasks concurrently on different CPUs. But sometimes, serial execution of atomic operations is needed for correctness. So how do we solve these problems? --*threads are an abstraction that can take advantage of concurrency.* applies at multiple levels, as discussed in class. indeed, a kernel that runs on multiple CPUs (say, in handling system calls from two different processes running on two different CPUs) can be regarded as using threads-inside-the-kernel, or there can be explicit threads-inside-the-kernel. this is apart from kernel-level threading and user-level threading, which are abstractions that applications use --to get serial execution of atomic operations, we need hardware support at the lowest level. we may use the hardware support directly (as in the case of lock-free data structures), with a thin wrapper (as in the case of spinlocks), or wrapped in a much higher-level abstraction (as in the case of mutexes and monitors). --the hardware support that we're talking about is test&set, LOCK prefix, LD_L/ST_C (load-linked/store-conditional, on DEC Alpha), interrupt enable/disable 1. The most natural (but not the only thing) we can do with the hardware support is to build spinlocks. We saw a few kinds of spinlocks: --test_and_set (while (xchg) {} ) --test-and-test_and_set (example given in class from Linux) --MCS locks Spinlocks are *sometimes* useful inside the kernel and more rarely useful in application space. The reason is that wrapping larger pieces of functionality with spinlocks wastes CPU cycles on the waiting processor. There is a trade-off between the overhead of putting a thread to sleep and the cycles wasted by spinning. For very short critical sections, spinlocks are a win. For longer ones, put the thread of execution to sleep. 2. For larger pieces of functionality, higher-level synchronization primitives are useful: --mutexes --mutexes and condition variables (known as monitors) --shared reader / single writer locks (can implement as a monitor) --semaphores (but you should not use these as it is easy to make mistakes with them) --futexes (basically a semaphore or mutex used for synchronizing processes on Linux; the advantage is that if the futex is uncontended, the process never enters the kernel. The cost of a system call is only incurred when there is contention and a process needs to go to sleep (going to sleep and getting woken requires kernel help). Building all of the above correctly requires lower-level synchronization primitives. Usually, inside of these higher-level abstractions is a spinlock that is held for a brief time before the thread is put to sleep and after it is woken. [Disadvantages to both spinlocks and higher-level synchronization primitives: --performance (because of synchronization point and cache line bounces) --performance v. complexity trade-off --hard to get code safe, live, and well-performing --to increase performance, we need finer-grained locking, which increases complexity, which imperils: --safety (race conditions more likely) --liveness (for example, deadlock, starvation more likely) --deadlock (hard to ensure liveness) --starvation (hard to ensure progress for all threads) --priority inversion --broken modularity --careful coding required] In user-level code, manage these disadvantages by sacrificing performance for correctness. In kernel code, it's trickier. Any performance problems in the kernel will be passed to applications. Here, the situation is sort of a mess. People use a combination of partial lock orders, careful thought, static detection tools, code review, and prayer. 3. Can also use hardware support to build lock-free data structures (for example, using atomic compare-and-swap). --avoids possibility of deadlock --better performance --downside: further complexity 4. Can also use hardware support to enable the Read-Copy Update (RCU) technique. Technique used inside the Linux kernel. Very elegant. --here, *writers* need to synchronize (using spinlocks, other hardware support, etc.), but readers do not [Aside: another paradigm for handling concurrency: transactions --transactional memory (requires different hardware abstractions) --transactions exposed to applications and users of applications, like queriers of databases] Another approach to handling concurrency is to avoid it: 5. event-driven programming --also manages "concurrency". --why? (because the processor really only can do one thing at a time.) --good match if there is lots of I/O what happens if we try to program in event-driven style and we are a CPU-bound process? --there aren't natural yield points, so the other tasks may never get to run --or the CPU-bound tasks have to insert artificial yields(). In vanilla event-driven programming, there is no explicit yield() call. A function that wants to stop running but wants to be run again has to queue up a request to be re-run. That request unfortunately has to manually take the important stack variables and place them in an entry in the event queue. In a very real sense, this is what the thread scheduler does automatically (the thread scheduler, if you look at it at a high-level, takes various threads' stacks and switches them around, and also loads the CPU's registers with a thread's registers. this is not super-different from scheduling a function for later and specifying the arguments, where the argument *are* the important stack variables that the function will need, in order for the function to execute). --this is one reason why you want threads: easy to write code where each thread just does some CPU-intensive thing, and the thread scheduler worries about interleaving the operations (otherwise, the interleaving is more manual and consists of the steps mentioned above). 6. what does JOS do? Answer: JOS's approach to concurrency is probably not something to take as a lesson. It uses a "big kernel lock" and ensures that the kernel code is only executing on one processor at a time (note that a user-level process can execute on the other CPU). A JOS system has to worry about interrupts; for this it takes a simple approach. On the CPU on which it is executing, JOS turns interrupts off for the entire time that it is executing in the kernel. JOS runs environments in user-mode with interrupts enabled, so at any point a timer interrupt may take the CPU away from an environment and switch to a different environment. This can happen on either CPU. Ultimately, threads, synchronization primitives, etc. solve a really hard problem: how to have multiple stuff going on at the same time but to allow the programmer to keep it organized, sane, and correct. To do this, we introduced abstractions like threads, functions like swtch(), relied on hardware primitives like XCHG, and built higher-level objects like mutexes, monitors, and condition variables. All of this is, at the end of the day, presenting a relatively sane model to the programmer, built on top of something that was otherwise really hard to reason about. On the other hand, these abstractions aren't perfect, as the litany of disadvantages should make clear, so the solution is to be very careful when writing code that has multiple units of execution that share memory (aka shared memory multi-programming aka threading aka processes that share memory). ---------------------------------------------------------------------------