Class 9 CS 372H 15 February 2011 On the board ------------ (One handout) 1. Review and clarify threads A. Kernel-level B. User-level C. Scheduling/interleaving 2. Concurrency --What is it? --What makes it hard? --How can we deal with races? 3. Protecting critical sections --Peterson --Next time: Spinlocks, Mutexes, Turning off interrupts --------------------------------------------------------------------------- 1. Review/clarify threads --abstraction. high-level motivation: --want to have a single process taking advantage of multiple CPUs; and/or --sometimes natural to structure a computation in terms of separate units of control that share memory --thread interface: --thread_create, thread_join, thread_exit --allocates a stack --and a TCB (thread control block; analogy with PCB) --one way to understand a given implementation of threads is by answering: * where is TCB stored? * what does swtch() look like, and who implements it? * what is the level of true concurrency? --please see notes from last time regarding user-level threading --today, we'll quickly go over what happens when the kernel implements threads and then review the stack switching in user-level threads, briefly A. kernel-level threading --Kernel maintains TCBs --looks a lot like PCB --[Draw picture] --thread_create() becomes a syscall --when do thread switches happen? --with kernel-level threading, it can happen at any point. --basic game plan for dispatch/swtch: --thread is running --switch to kernel --save thread state (to TCB) --Choose new thread to run --Load its state (from TCB) --new thread is running --Can two kernel-level threads execute on two different processors? (Answer: yes.) --Disadvantage to kernel-level threading: --every thread operation (create, exit, join, synchronize, etc.) goes through the kernel --> 10x-30x slower than user-level threads --heavier-weight memory requirements (each thread gets a stack in user space *and* within the kernel. compare to user-level threads: each thread gets a stack in user space, and there's one stack within the kernel that corresponds to the process.) --[SKIP IN CLASS] Old debates about user-level threading vs. kernel-level threading. The "Scheduler Activations" paper, by Anderson et al., [ACM Transactions on Computer Systems 10, 1 (February 1992), pp. 53--79] proposes an abstraction that is a hybrid of the two. --basically OS tells process: "I'm ready to give you another virtual CPU (or to take one away from you); which of your user-level threads do you want me to run?" --so user-level scheduler decides which threads run, but kernel takes care of multiplexing them --[COVER LATER] Some people think that threads, i.e., concurrent applications, shouldn't be used at all (because of the many bugs and difficult cases that come up, as we'll discuss). However, that position is becoming increasingly less tenable, given multicore computing. --The fundamental reason is this: if you have a computation-intensive job that wants to take advantage of all of the hardware resources of a machine, you either need to (a) structure the job as different processes; or (b) use kernel-level threading. There is no other way, given mainstream OS abstractions, to take advantage of a machine's parallelism. (a) winds up being inconvenient (in order to share data, the processes either have to separately set up shared memory regions, or else pass messages). So people use (b). B. User-level go over: * where is TCB stored? * what does swtch() look like, and who implements it? * what is the level of true concurrency? --notes from last time tell you under what circumstances swtch() is called. --we'll quickly go over the implementation --see handout..... --what is the level of true concurrency? --answer: none. given a process that is using user-level threading, **only one instruction in that process can execute at a time**. C. Scheduling/interleaving threads --Dispatcher can choose: --to run each thread to completion --time-slice in big chunks --time-slice so that each thread executes only one instruction at a time --Programs must work in all cases, for all interleavings --So how can you know if your concurrent program works? Whether *all* interleavings work? 1. Enumerate and test all possibilities? (Not feasible.) 2. Instead, maintain *invariants* on program state; structure program carefully to maintain these invariants --General strategy for dealing with concurrency: --use *atomic actions* [means the action is indivisible, regardless of how things are interleaved] to.... --....build higher-level abstractions.... --example: mutexes --....that provide invariants we can reason about.... --example: only one thread of control is modifying a linked list at once --This is our transition to the general topic of concurrency, which will occupy us for a little while --Note that the issues with concurrency that we're going to discuss are relevant to all of the threading cases above. (A possible exception is non-preemptive user-level threads, which only yield when the programmer says yield(). However, it's easy to make mistakes, so best to assume that the issues of concurrency that we're going to discuss *always* apply if there are multiple execution contexts that share memory.) --------------------------------------------------------------------------- [This material between the dashed lines is not going to be covered in class. It is for your own reference. It may or may not be helpful in studying.] Quick comparison between user-level threading and kernel-level: (i). high-level choice: user-level or kernel-level (but can have N:M threading, in which N user-level threads are multiplexed over M kernel threads, so the choice is a bit fuzzier) (ii). if user-level, there's another choice: non-preemptive (also known as cooperative) or preemptive [be able to answer: why are kernel-level threads always preemptive?] --*Only* the presence of multiple kernel-level threads can give: --true multiprocessing (i.e., different threads running on different processors) --asynchronous disk I/O using Posix interface [because read() blocks and causes the *kernel* scheduler to be invoked] --but many modern operating systems provide interfaces for asynchronous disk I/O, at least as an extension --Windows --Linux has AIO extensions --thus, even user-level threads can get asynchronous disk I/O, by having the run-time translate calls that *appear* blocking to the thread [e.g., thread_read()] into a series of instructions that: register for interest in an I/O event, put the thread to sleep, and switch() to another thread --[moral of the story: if you find yourself needing async disk I/O from user-level threads, use one of the non-Posix interfaces!] Quick terminology note: --The kernel itself uses threads internally, when executing in kernel mode. Such threads-in-the-kernel are related to, but not the same thing as, the kernel-level threading mentioned above. --We'll try to keep these concepts distinct in this class, but we may not always succeed. Historical notes: classification: # address spaces one many # threads/ addr space one MS Dos traditional Unix Palm OS many Embedded systems, VMS, Mach, NT, Solaris, HP-UX, ... Pilot (OS on first personal computer ever built -- the Alto. idea was there was no need for protection if there was only one user.) --------------------------------------------------------------------------- 2. Concurrency A. What is it? --Stuff happening at the same time --Arises in many ways --pseudo-concurrency: from scheduling --real concurrency: multiple processors --Examples: --multiple kernel threads within a process --multiple processes sharing memory --what about multiple hosts distributed across a network? (conceptually, issues are the same, but needed mechanisms are different) --We're going to treat the issues in general...they apply to processes sharing memory pages, kernel threads sharing memory spaces, user-level threads that are preemptible, etc. --so for the rest of today, we're going to talk about two threads, but this could mean: --threads inside a single process --threads inside the kernel --even two separate processes that share memory B. What makes it hard? --lots of things can go wrong..... --we will see others later (deadlock, priority inversion, etc.) --for now, look at data races.... --some examples; see handout: 2a: x = 1 or x = 2. 2b: x = 13 or x = 25. 2c: x = 1 or x = 2 or x = 3 3: incorrect list structure 4: incorrect count in buffer --all of these are called *race conditions*; not all are errors, though. --worst part of errors from race conditions is that a program may work fine most of the time but only occasionally show problems. why? (because the instructions of the various threads or processes or whatevever get interleaved in a non-deterministic order.) --and it's worse than that because inserting debugging code may change the timing so that the bug doesn't show up C. How can we deal with races? --make the needed operations atomic --how? 1. A single-instruction add? 'count' is in memory (that is what the example in #4 stipulates.) assume that %ecx holds the address of 'count' --Then, can we use the x86 instruction addl? For instance: addl $1, (%ecx) ; count++ --So looks like we can implement count++/-- with one instruction? --So we're safe? --No: not atomic on multiprocessor! --Will experience same race condition at the hardware level 2. How about using x86 LOCK prefix? --can make read-modify-write instructions atomic by preceding them with "LOCK". examples of such instructions are: XADD, CMPXCHG, INC, DEC, NOT, NEG, ADD, SUB... (when their destination operand refers to memory) --but using LOCK is very expensive (flushes processor caches) and not a "general-purpose abstraction" --only applies to one instruction: what if we need to execute three or four instructions as a unit? --compiler won't generate it by default, assumes you don't want penalty 3. Critical sections --Place count++ and count-- in critical section --Protect critical sections from concurrent execution --Now we need solution to _critical section_ problem --Solution must satisfy 3 rules: 1. mutual exclusion only one thread can be in c.s. at a time 2. progress if no threads executing in c.s., one of the threads trying to enter a given c.s. will eventually get in 3. bounded waiting once a thread T starts trying to enter the critical section, there is a bound on the number of other threads that may enter the critical section before T enters --Note progress vs. bounded waiting --If no thread can enter C.S., don't have progress --If thread A waiting to enter C.S. while B repeatedly leaves and re-enters C.S. ad infinitum, don't have bounded waiting --Gameboard is that we're now going to build primitives to protect critical sections 3. Protecting critical sections --Peterson's algorithm.... --see book --does satisfy mutual exclusion, progress, bounded waiting --But expensive and not encapsulated --High-level: --want: lock()/unlock() or enter()/leave() or acquire()/release() --lots of names for the same idea --mutex_init(mutex_t* m), mutex_lock(mutex_t* m), mutex_unlock(mutex_t* m),.... --pthread_mutex_init(), pthread_mutex_lock(), ... --in each case, the semantics are that once the thread of execution is executing inside the critical section, no other thread of execution is executing there --How to implement locks/mutexes/etc.? --Next time....