Class 12 CS372H 24 February 2011 On the board ------------ (One handout) 1. Last time 2. Trade-offs and problems from locking A. deadlock B. starvation C. priority inversion D. broken modularity E. Performance F. Performance/complexity trade-off 3. Concurrency is hard! 4. More advice --------------------------------------------------------------------------- 1. Last time --one problem with a programming model that depends on spinlocks, mutexes, condition variables, monitors: deadlock. --several ways of avoiding --no silver bullet --in practice, people try to code carefully --automated tools get better every year; consider using them when you code (Valgrind has Helgrind.) 2. Trade-offs and problems from locking A. Deadlock: last time B. Starvation --thread waiting indefinitely (if low priority and/or if resource is contended) C. Priority inversion --T1, T2, T3: (highest, middle, lowest priority) --T1 wants to get lock, T2 runnable, T3 runnable and holding lock --System will preempt T3 and run highest-priority runnable thread, namely T2 --Solutions: --Temporarily bump T3 to highest priority of any thread that is ever waiting on the lock --Disable interrupts, so no preemption (T3 finishes) ... works okay unless a page fault occurs --Don't handle it; structure app so only adjacent priority processes/threads share locks --Happens in real life. For a real-life example, see: http://research.microsoft.com/en-us/um/people/mbj/Mars_Pathfinder/Mars_Pathfinder.html D. Broken modularity --examples above: avoiding deadlock requires understanding how programs call each other. --also, need to know, when calling a library, whether it's thread-safe: printf, malloc, etc. If not, surround call with mutex. (Can always surround calls with mutexes conservatively.) --we'll see other examples below. E. Performance quick digression: --_dance hall_ architecture: any CPU can "dance with" any memory equally (equally slowly) --NUMA (non-uniform memory access): each CPU has fast access to some "close" memory; slower to access memory that is further away --AMD Opterons like this --Intel CPUs moving toward this --see next-to-last page of handout --two further choices: cache coherent or not. in the former case, hardware runs a cache coherence (cc) protocol to invalidate caches when a local change happens. in the latter case, it does not. former case is far more common. let's assume ccNUMA machines...back to performance issues.... our baseline is a test-and-test_and_set spinlock, which is basically what Linux uses: void acquire(Lock* lock) { pushcli(); while (xchg_val(&lock->locked, 1) == 1) { while (lock->locked) ; } } void release(Lock* lock) { xchg_val(&lock->locked, 0); popcli(); } the performance issues are: (i) fairness --one CPU gets lock because the memory holding the "locked" variable is closer to that CPU --allegedly, Google had fairness problems on Opterons (I have no proof of this) (ii) lots of traffic over memory bus: if lots of contention for lock, then cache coherence protocol creates lots of remote invalidations every time someone tries to do a lock acquisition (iii) cache line bounces (same reason as (ii)) (iv) locking inherently reduces concurrency mitigation of (i)--(iii): better locks --MCS locks --see handout --advantages --guarantees FIFO ordering of lock acquisitions (addresses (i)) --spins on local variable only (addresses (ii), (iii)) --[not discussing this, but: works equally well on machines with and without coherent caches] --NOTE: with fewer cores, spinlocks are better. why? --In fact, if there is high contention, performance will be poor, though MCS locks will make it be a little less poor. More on that in a bit. --futexes --see notes below or next time mitigation of (iv): more fine-grained locking. --unfortunately, fine-grained locking leads to the next issue, which is also fundamental F. Performance/complexity trade-off --one big lock is often not great for performance, even when we use the fancier locks above --indeed, locking itself is the issue: changing the lock type is unlikely to be as big of a performance win as restructuring the code --the fundamental issue with coarse-grained locking is that only one CPU at a time can execute anywhere in your code. If your code is called a lot, this may reduce the performance of an expensive multiprocessor to that of a single CPU. --if this happens inside the kernel, it means that applications will inherit the performance problems from the kernel --Perhaps locking at smaller granularity would get higher performance through more concurrency. --But how to best reduce lock granularity is a bit of an art. --And unfortunately finer-grained locking makes incorrect code far more likely --And modularity further suffers (see item D. above) --Two examples of the above issues: --Example 1: imagine that every file in the file system is represented by a number, in a big table --You might inspect the file system code and notice that most operations use just one file or directory, leading you to have one lock per file --You could imagine the code implementing directories exporting various operations like dir_lookup(d, name) dir_add(d, name, file_number) dir_del(d, name) --With fine-grained locking, these directory operations would *internally* acquire the lock on d, do their work, and release the lock --Then higher-level code could implement operations like moving a file from one directory to another: move(olddir, oldname, newdir, newname) { file_number = dir_lookup(olddir, oldname) dir_del(olddir, oldname) dir_add(newdir, newname, file_number) } --Unfortunately, this isn't great: --period of time when file is visible in neither directory. to fix that requires that the directory locks _not_ be hidden inside the dir_* operations. --so we need something like this: move(olddir, oldname, newdir, newname){ acquire(olddir.lock) acquire(newdir.lock) file_number = dir_lookup(olddir, oldname) dir_del(olddir, oldname) dir_add(newdir, newname, file_number) release(newdir.lock) release(olddir.lock) --The above code is a bummer in that it exposes the implementation of directories to move(), but (if all you have is locks) you have to do it this way. --Example 2: see filemap.c at end of handout for an extreme case --Mitigation? Unfortunately, no way around this trade-off. --worse, easy to get this stuff wrong: correct code is harder to write than buggy code --If you have fine-grained locking (i.e., you are trading off simplicity), then you are much more likely to encounter the two types of errors: (i) safety errors (race conditions) (ii) liveness errors (deadlocks, etc.) --***So what do people do?*** --in app space: --don't worry too much about performance up front. makes it easier to keep your code free of safety problems *and* liveness problems --if you are worrying about performance, make sure there are no race conditions. much more important than worrying about deadlock. --SAFETY FIRST. --almost always far better for your program to do nothing than to do the wrong thing (example of using Linear Acceletor for radiation therapy: **way** better not to subject patient to radiation beam than to subject patient to a beam that is 100x too strong, leading to gruesome, atrocious injuries) --if the program deadlocks, the evidence is intact, and we can go back and see what the problem was. --there are ways around deadlock, as we will discuss in a moment --but we shouldn't be too cavalier about liveness issues because it could lead to catastrophic cases. Example: Mars Pathfinder (which was addressed; see above), but still. --in kernel space: --same thing, to some extent --but performance matters more in kernel space, so likely to be dealing with more complex issues --here again, SAFETY FIRST --lock more aggressively --worry about deadlock later --not a satisfying answer, but there is no silver bullet for concurrency-related issues --By the way, if there is lots of contention, then the style and granularity of locks will not eliminate the problem. Where does contention come from? --application requirements. lots of contention from applications that inherently require global resources or shared data. --example of Apache: every CPU needs to write to a global logfile, which causes contention in the kernel. you can make the locking as fine-grained as you want, but at the end of the day, if there's a single logfile, a single writer permitted at a time, and many contending writers, then that logfile is going to wind up serializing all of the writers. 3. Concurrency is hard! Sequential consistency Our examples all along have been assuming sequential consistency....but what does this amount to assuming? See examples on handout (i) Defn of sequential consistency: The result of execution is as if all operations were executed in some sequential order, and the operations of each processor occurred in the order specified by the program. [citation: L. Lamport. How to Make a Multiprocessor Computer that Correctly Executes Multiprocess Programs. _IEEE Transactions on Computers_, Volume C-28, Number 9, September 1979, pp690-691. http://research.microsoft.com/en-us/um/people/lamport/pubs/multi.pdf] Basically means: --Maintaining program order on individual processors --Ensuring write atomicity (ii) Why isn't sequential consistency always in effect? --It's expensive for the hardware (sometimes overlapping instructions, or providing non-blocking memory reads, helps the hardware's performance) --Compiler sometimes wants to violate s.c. --moves code around --caches values in registers --common subexpression elimination (could cause memory to be read fewer times) --re-arrange loops for better cache performance --software pipelining (iii) What does the x86 do? --x86 supports multiple consistency/caching models --Memory Type Range Registers (MTRR) specify consistency for ranges of physical memory (e.g., frame buffer) --Page Attribute Table (PAT) allows control for each 4K page --Choices include: WB: Write-back caching (the default) WT: Write-through caching (all writes go to memory) UC: Uncacheable (for device memory) WC: Write-combining: weak consistency & no caching --Some instructions have weaker consistency --String instructions --Special "non-temporal" instructions that bypass cache (iv) x86 WB consistency --Old x86s (e.g, 486, Pentium 1) had almost SC --Exception: A read could finish before an earlier write to a different location --Newer x86s let a processor read its own writes early --see handout, item 3c: both of those functions can return 2: --that is, the two processors see the loads in different orders --Older CPUs would wait at "f = ..." until store complete (v) x86 atomicity (review) --lock prefix --review: the lock prefix makes a memory instruction atomic (by locking the bus for the duration of an instruction, which is expensive). --all lock instructions totally ordered --other memory instructions cannot be re-ordered w. locked ones --xchg (always locked; no prefix needed) --fence instructions that can prevent re-ordering LFENCE -- can't be reordered with reads (or later writes) SFENCE -- can't be reordered with writes MFENCE -- can't be reordered with reads or writes 4. More advice Two things for you to remember here and always; these two things are implied by the above discussion on sequential consistency: (1). if you're *using* a synchronization primitive (e.g., a mutex), do NOT try to read any shared data outside of the mutex (the mutex provides the needed ordering). (2). if you're *implementing* a synchronization primitive, you need to read the language manual carefully (to tell the compiler what not to order), and you need to read the processor manual carefully (to understand the default memory model and how to override if necessary). ***Your best hope if you're working with threads and monitors in application space:*** (3) coarse-grained locking (4) the MikeD rules/commandments/standards: lock()/unlock() at the beginning and end of functions, use monitors, use while loops to check scheduling constraints, etc.) (5) disciplined hierarchical structure to your code (so you can order the locks), and avoid up-calls. if your structure is poor, you have little hope. so the following more detailed advice assumes decent structure: --if you have to make an up-call, better ensure that partial order on locks is maintained and/or that the up-call doesn't require a lock to issue the up-call (*) --if you have nested objects or monitors (as in the M,N example in l11-handout), then there are some cases: --if the target of the call does not lock, then no problem. the outer monitor can keep holding the lock. --if the target of the call locks but does not wait, then caller can continue to hold lock PROVIDED that partial ordering exists (for example, that the callee never issues a callback/up-call to the calling module or to an even higher layer, as mentioned in (*) above) --if the target of the call locks and does wait, then it is dangerous to call it while holding a lock in the outer layer. here, you need a different code structure. unfortunately, there is no silver bullet --to avoid nested monitors, you can/should break your code up: M | N becomes: O / \ M N where O is an ordinary module, not a monitor. example: M implements checkin/checkout for database. O is the database, and N is some other monitor. (6) run static detection tools (commercial products; search around); they're getting better every year (7) run dynamic detection tools (Valgrind, etc.): instrument program if needed --Bummer about all of this: hard to hide details of synchronization behind object interfaces --That is, even the solutions that avoid deadlock require breaking abstraction barriers and highly disciplined coding