Class 13 CS 372H 02 March 2010 On the board ------------ 0. Last time Trade-offs and problems from synchronization primitives A. Performance B. Performance v. complexity trade-off C. Deadlock D. Starvation 1. Trade-offs, continued E. Priority inversion F. Broken modularity G. Careful coding required (more advice) 2. Reflections and conclusions 3. Alternatives 4. Some loose ends A. sequential consistency B. futexes 5. Scheduling --------------------------------------------------------------------------- NOTE: the end of these notes contain a written review of our concurrency unit. It's not "complete", but it may help organize a lot of what we covered. --------------------------------------------------------------------------- 0. Last time --We showed that spinlocks are better for a smaller number of CPUs but have worse scaling. We said why they have worse scaling but didn't say why they're better for a smaller number of CPUs. Why are they? --Because acquire(Lock* lock, qnode* q) requires more operations -1A-->D 1. Trade-offs and problems from synchronization primitives E. Priority inversion --T1, T2, T3: (highest, middle, lowest priority) --T1 wants to get lock, T2 runnable, T3 runnable and holding lock --System will preempt T3 and run highest-priority runnable thread, namely T2 --Solutions: --Temporarily bump T3 to highest priority of any thread that is ever waiting on the lock --Disable interrupts, so no preemption (T3 finishes) ... works okay unless a page fault occurs --Don't handle it; structure app so only adjacent priority processes/threads share locks --Happens in real life. For a real-life example, see: http://research.microsoft.com/en-us/um/people/mbj/Mars_Pathfinder/Mars_Pathfinder.html F. Broken modularity --many examples above: avoiding deadlock requires understanding how programs call each other --need to know, when calling a library, whether it's thread-safe: printf, malloc, etc. If not, surround call with mutex. (Can always surround calls with mutexes conservatively.) G. Careful coding required / more advice ***Your best hope if you're working with threads and monitors in application space: four higher-level pieces of advice*** (i) coarse-grain locking (ii) disciplined hierarchical structure to your code (so you can order the locks), and avoid up-calls. if your structure is poor, you have little hope. so the following more detailed advice assumes decent structure: --if you have to make an up-call, better ensure that partial order on locks is maintained and/or that the up-call doesn't require a lock to issue the up-call (*) --if you have nested objects or monitors (as in the example earlier today), then there are some cases: --if the target of the call does not lock, then no problem. the outer monitor can keep holding the lock. --if the target of the call locks but does not wait, then caller can continue to hold lock PROVIDED that partial ordering exists (for example, that the callee never issues a callback/up-call to the calling module or to an even higher layer, as mentioned in (*) above) --if the target of the call locks and does wait, then it is dangerous to call it while holding a lock in the outer layer. here, you need a different code structure. unfortunately, there is no silver bullet --to avoid nested monitors, you can/should break your code up: M | N becomes: O / \ M N where O is an ordinary module, not a monitor. example: M implements checkin/checkout for database. O is the database, and N is some other monitor. (iii) run static detection tools; they're getting better every year (iv) run dynamic detection tools: instrument program if needed --Bummer about all of this: hard to hide details of synchronization behind object interfaces --That is, even the solutions that avoid deadlock require breaking abstraction barriers and highly disciplined coding 2. Reflections and conclusions --Threads and concurrency primitives have solved a hard problem: how to take advantage of hardware resources with a sequential abstraction (the thread) and how to safely coordinate access to shared resources (concurrency primitives). --But of course concurrency primitives have the disadvantages listed above --Threads are also not free. They are cheap but (in most packages and languages) not free --fun example: OS/2 from Microsoft and IBM (1980s). lots of threads. say 100. each thread needed a stack, say 10KB. result: 1MB of memory consumed by waiting threads. cost of 1MB of memory in 1988: $200. But from end-user's perspective: okay, you could keep working while the application spooled to the printer, but was that worth $200? --old debate about whether threads are a good idea: John Ousterhout: "Why Threads are a bad idea (for most purposes)", 1996 talk. http://home.pacbell.net/ouster/threads.pdf Robert van Renesse "Goal-Oriented Programming, or Composition Using Events, or Threads Considered harmful". Eighth ACM SIGOPS European Workshop, September 1998. http://www.cs.cornell.edu/home/rvr/papers/GoalOriented.pdf --the case comes down to this: --hard to get multithreaded programs correct AND there are safer alternatives --who is right? sort of like vi vs. emacs debates. threads, events, and other alternatives that we will discuss below all have advantages and disadvantages. 3. What else can we do? A. Other approaches to handling concurrency: --non-blocking synchronization --wait-free algorithms --lock-free algorithms --the skinny on these is that: --using atomic instructions such as compare-and-swap (CMPXCHG on the x86) you can implement many common data structures (stacks, queues, even hash tables) --in fact, can implement *any* algorithm in wait-free fashion, given the right hardware --the problem is that such algorithms wind up using lots of memory or involving many retries so are inefficient --but since they don't lock, they are provably lock-free! --ignore benign race conditions --for a Web site, set "hits++" in some counter without protecting by mutex. who cares if it's a bit too low? --warning: not a good practice in general --RCU (read-copy-update) [citation: P. E. McKenney et al. Read-Copy Update. Proc. Linux Symposium, 2001. http://lse.sourceforge.net/locking/rcu/rclock_OLS.2001.05.01c.sc.pdf] --really neat technique used widely in the Linux kernel --basic idea: because reading is so much more common than writing, don't synchronize readers. synchronize writers, but make sure that writers update data structures in such a way that readers don't get messed up approach: leave old data for readers, don't update in-place --no need for reader locks if they don't see changes in-place fundamentally removes need for feedback from readers to updaters! --instead, update by copying data items, atomically change pointers --of course, this raises the question: when can we reclaim the *old* memory? (technique has to be very careful about this.) --benefits reader code simpler, avoids locking issues and bus ops to notify updaters thus, good performance for read-heavy workloads --event-driven code --also manages "concurrency". --why? (because the processor really only can do one thing at a time.) --good match if there is lots of I/O what happens if we try to program in event-driven style and we are a CPU-bound process? --there aren't natural yield points, so the other tasks may never get to run --or the CPU-bound tasks have to insert artificial yields() --transactions --we will see these later in the course --using them in the kernel requires hardware support --using them in application space does not B. How free are these other techniques from the disadvantages listed above? --non-blocking synchronization often leads to complexity and inefficiency (but not deadlock or broken modularity!) --RCU has some complexity required to handle garbage collection, and it's not always applicable --event-driven code removes the possibility of races and deadlock, but that's because it doesn't involve true concurrency, so if you want your app to take advantage of multiple CPUs, you can't use event-driven programming --transactions are nice but not supported everywhere and not very efficient if there is lots of contention C. My advice on best approaches (higher-level advice than thread coding advice above) --application programming: --cooperative user-level multithreading --kernel-level threads with *simple* synchronization (lab T) --this is where the thread coding advice given above applies --event-driven coding --transactions, if your package provides them, and you are willing to deal with performance trade-offs (namely that performance is poor under contention because lots of wasted work) --kernel hacking: no silver bullet here. want to avoid locks as much as possible. sometimes they are unavoidable, in which case fancy things need to happen. --UT professor Emmett Witchel proposes using transactions inside the kernel (TxLinux) 4. Some loose ends A. Sequential consistency Our examples all along have been assuming sequential consistency....but what does this amount to assuming? (i) Examples --example 1 int data = 0, ready = 0; void p1 () { data = 2000; ready = 1; } int p2 () { while (!ready) {} return data; } What might p2 return if run concurrently with p1? --example 2 int flag1 = 0, flag2 = 0; int main () { tid id = thread_create (p1, NULL); p2 (); thread_join (id); } void p1 (void *ignored) { flag1 = 1; if (!flag2) { critical_section_1 (); } } void p2 (void *ignored) { flag2 = 1; if (!flag1) { critical_section_2 (); } } Can both critical sections run? [examples from From S.V. Adve and K. Gharachorloo, IEEE Computer, December 1996, 66-76. http://rsim.cs.uiuc.edu/~sadve/Publications/computer96.pdf] --Answers are "no" *if* the hardware provides sequential consistency (but if it doesn't, the answers are "yes"): (ii) Defn of sequential consistency: The result of execution is as if all operations were executed in some sequential order, and the operations of each processor occurred in the order specified by the program. [citation: L. Lamport. How to Make a Multiprocessor Computer that Correctly Executes Multiprocess Programs. _IEEE Transactions on Computers_, Volume C-28, Number 9, September 1979, pp690-691. http://research.microsoft.com/en-us/um/people/lamport/pubs/multi.pdf] Basically means: --Maintaining program order on individual processors --Ensuring write atomicity (iii) Why isn't sequential consistency always in effect? --It's expensive for the hardware (sometimes overlapping instructions, or providing non-blocking memory reads, helps the hardware's performance) --Compiler sometimes wants to violate s.c. --moves code around --caches values in registers --common subexpression elimination (could cause memory to be read fewer times) --re-arrange loops for better cache performance --software pipelining (iv) What does the x86 do? --x86 supports multiple consistency/caching models --Memory Type Range Registers (MTRR) specify consistency for ranges of physical memory (e.g., frame buffer) --Page Attribute Table (PAT) allows control for each 4K page --Choices include: WB: Write-back caching (the default) WT: Write-through caching (all writes go to memory) UC: Uncacheable (for device memory) WC: Write-combining: weak consistency & no caching --Some instructions have weaker consistency --String instructions --Special "non-temporal" instructions that bypass cache (v) x86 WB consistency --Old x86s (e.g, 486, Pentium 1) had almost SC --Exception: A read could finish before an earlier write to a different location --Newer x86s let a processor read its own writes early --see handout, item #3: both of those functions can return 2: --that is, the two processors see the loads in different orders --Older CPUs would wait at "f = ..." until store complete (vi) x86 atomicity (review) --lock prefix --review: the lock prefix makes a memory instruction atomic (by locking the bus for the duration of an instruction, which is expensive. --all lock instructions totally ordered --other memory instructions cannot be re-ordered w.\ locked ones --xchg (always locked; no prefix needed) --fence instructions that can prevent re-ordering LFENCE -- can't be reordered w.\ reads (or later writes SFENCE -- can't be reordered w.\ writes MFENCE -- can't be reordered w.\ reads or writes B. Futexes [citation: H. Franke, R. Russell, and M. Kirkwood. Fuss, Futexes and Furwocks: Fast Userlevel Locking in Linux, Proc. Linux Symposium, 2002. http://www.kernel.org/doc/ols/2002/ols2002-pages-479-495.pdf] --abstraction that is useful for synchronizing user-level programs efficiently --basically, ask the kernel to put the current process to sleep only if some memory hasn't changed --this abstraction is really two things: --some shared memory (that the programs must coordinate, likely by using a library wrapper on top of the futex) --a system call: void futex(int* uaddr,FUTEX_WAIT|FUTEX_WAKE,int val,....) --in the non-contended case, the process will never call the system call: the processes will just be acquiring and releasing the shared object (say by atomically incrementing and decrementing counters) --in the contended case, a program may need to sleep. In that case, it executes a call like: void futex(shared_counter_address,FUTEX_WAIT,v), which says: "sleep if *shared_counter_address == v". the whole point to this is that some other program might have changed the counter right before the call to futex(), so if the counter changed, we might not want to go to sleep (e.g., if the counter is 0 or 1 to mean "is_not_locked" or "is_locked", you don't want to go to sleep unless the counter really is equal to is_locked). --likewise, in the contended case, a program may need to wake up some waiters. In that case, it executes a call like: void futex(shared_counter_address,FUTEX_WAKEUP,v), which says, "wakeup at least v processes sleeping on shared_counter_address" --see futex(2) and futex(7) in the man pages (i.e., type "man 2 futex" and "man 7 futex") -------------------- Admin notes --lab 3B due tomorrow --am holding office hours from 2-3 --so far no questions --some further advice --start lab 4A now --clarify what I said last time: I really didn't mean to say "start early". what I meant was "start on time". it just so happens that "on time" is many days before the deadline -------------------- 5. scheduling intro A. When do scheduling decisions happen? [draw picture] scheduling decisions take place when a process: (i) Switches from running to waiting state (ii) Switches from running to ready state (iii) Switches from waiting to ready (iv) Exits B. What are metrics and criteria? --system throughput # of processes that complete per unit time --turnaround time time for each process to complete --response time time from request to first response (e.g., key press to character echo, not launch to exit) --fairness different possible definitions: --freedom from starvation --all users get equal time on CPU --highest priority jobs get most of CPU --etc. [often conflicts with efficiency. true in life as well.] the above are affected by secondary criteria: --CPU utilization (fraction of time CPU is actually working) --waiting time (time each process waits in ready queue) C. Context switching costs --CPU time in kernel --save and restore registers --switch address spaces --indirect costs --TLB shootdowns, processor cache, OS caches (e.g., buffer caches) --result: more frequent context switches will lead to worse throughput (higher overhead) --------------------------------------------------------------------------- SUMMARY AND REVIEW OF CONCURRENCY We've discussed different ways to handle concurrency. Here's a review and summary. Unfortunately, there is no one right approach to handling concurrency. The "right answer" changes as operating systems and hardware evolve, and depending on whether we're talking about what goes on inside the kernel, how to structure an application, etc. For example, in a world in which most machines had one CPU, it may make more sense to use event-driven programming in applications (note that this is a potentially controversial claim), and to rely on turning off interrupts in the kernel. But in a world with multiple CPUs, event-driven programming in an application fails to take advantage of the hardware's parallelism, and turning off interrupts in the kernel will not avoid concurrency problems. Why we want concurrency in the first place: better use of hardware resources. Increase total performance by running different tasks concurrently on different CPUs. But sometimes, serial execution of atomic operations is needed for correctness. So how do we solve these problems? --*threads are an abstraction that can take advantage of concurrency.* applies at multiple levels, as discussed in class. indeed, a kernel that runs on multiple CPUs (say, in handling system calls from two different processes running on two different CPUs) can be regarded as using threads-inside-the-kernel, or there can be explicit threads-inside-the-kernel. this is apart from kernel threads and user-level threads, which are abstractions that applications use --to get serial execution of atomic operations, we need hardware support at the lowest level. we may use the hardware support directly (as in the case of lock-free data structures), with a thin wrapper (as in the case of spinlocks), or wrapped in a much higher-level abstraction (as in the case of mutexes and monitors). --the hardware support that we're talking about is test&set, LOCK prefix, LD_L/ST_C (load-linked/store-conditional, on DEC Alpha), interrupt enable/disable 1. The most natural (but not the only thing) we can do with the hardware support is to build spinlocks. We saw a few kinds of spinlocks: --test_and_set (while (xchg) {} ) --test-and-test_and_set (example given in class from Linux) --MCS locks Spinlocks are *sometimes* useful inside the kernel and more rarely useful in application space. The reason is that wrapping larger pieces of functionality with spinlocks wastes CPU cycles on the waiting processor. There is a trade-off between the overhead of putting a thread to sleep and the cycles wasted by spinning. For very short critical sections, spinlocks are a win. For longer ones, put the thread of execution to sleep. 2. For larger pieces of functionality, higher-level synchronization primitives are useful: --mutexes --mutexes and condition variables (known as monitors) --shared reader / single writer locks (can implement as a monitor) --semaphores --futexes (basically a semaphore or mutex used for synchronizing processes on Linux; the advantage is that if the futex is uncontended, the process never enters the kernel. The cost of a system call is only incurred when there is contention and a process needs to go to sleep (going to sleep and getting woken requires kernel help). Building all of the above correctly requires lower-level synchronization primitives. Usually, inside of these higher-level abstractions is a spinlock that is held for a brief time before the thread is put to sleep and after it is woken. [Disadvantages to both spinlocks and higher-level synchronization primitives: --performance (because of synchronization point and cache line bounces) --performance v. complexity trade-off --hard to get code safe, live, and well-performing --to increase performance, we need finer-grained locking, which increases complexity, which imperils: --safety (i.e., race conditions more likely) --liveness (i.e., deadlock, starvation more likely) --deadlock (hard to ensure liveness) --starvation (hard to ensure progress for all threads) --priority inversion --broken modularity --careful coding required] In user-level code, manage these disadvantages by sacrificing performance for correctness. In kernel code, it's trickier. Any performance problems in the kernel will be passed to applications. Here, the situation is sort of a mess. People use a combination of partial lock orders, careful thought, static detection tools, code review, and prayer. 3. Can also use hardware support to build lock-free data structures (for example, using atomic compare-and-swap). --avoids possibility of deadlock --better performance --downside: further complexity 4. Can also use hardware support to enable the Read-Copy Update (RCU) technique. Technique used inside the Linux kernel. Very elegant. --here, *writers* need to synchronize (using spinlocks, other hardware support, etc.), but readers do not [Aside: another paradigm for handling concurrency: transactions --transactional memory (requires different hardware abstractions) --transactions exposed to applications and users of applications, like queriers of databases] Another approach to handling concurrency is to avoid it: 5. event-driven programming --also manages "concurrency". --why? (because the processor really only can do one thing at a time.) --good match if there is lots of I/O what happens if we try to program in event-driven style and we are a CPU-bound process? --there aren't natural yield points, so the other tasks may never get to run --or the CPU-bound tasks have to insert artificial yields() [only really relevant in application space. a kernel that is entirely event-driven will have trouble running on more than one CPU.] 6. what does JOS do? Answer: JOS's approach to concurrency is a one-off solution that you shouldn't take as a lesson. Here's its approach: JOS is meant to run on single-CPU machines. It doesn't have to worry about concurrent operations from other CPUs, but it does have to worry about interrupts. JOS takes a simple approach: it turns interrupts off for the entire time it is executing in the kernel. For the most part this means JOS kernel code doesn't have to do anything special in situations where other OSes would use locks. JOS runs environments in user-mode with interrupts enabled, so at any point a timer interrupt may take the CPU away from an environment and switch to a different environment. This interleaves the two environments' instructions a bit like running them on two CPUs. The library operating system has some data structures in memory that's shared among multiple environments (e.g., pipes), so it needs a way to coordinate access to that data. In JOS we will use special-case solutions, as you will find out in lab 6. For example, to implement pipes we will assume there is one reader and one writer. The reader and writer never update each others' variables; they only read each others' variables. Carefully programming using this rule we can avoid races. Ultimately, threads, synchronization primitives, etc. solve a really hard problem: how to have multiple stuff going on at the same time but to allow the programmer to keep it organized, sane, and correct. To do this, we introduced abstractions like threads, functions like switch() [usually known as swtch()], relied on hardware primitives like XCHG, and built higher-level objects like mutexes, monitors, and condition variables. All of this is, at the end of the day, presenting a relatively sane model to the programmer, built on top of something that was otherwise really hard to reason about.