Class 24 CS 372H 19 April 2012 On the board ------------ 1. Last time 2. Deterministic execution 3. Determinator 4. Derminator: quick discussion 5. Guest presenter: Chris Cotter --------------------------------------------------------------------------- 1. Last time --transactions. discussed: --crash recovery (in DB context) --isolation (in OS context) --there's a classic DB approach to isolation (2-phase locking NOT to be confused with 2-phase commit) described in the notes from last time. 2. Deterministic execution y = 12. two threads. one executes t1, one t2 t1() { x = y + 1; } t2() { y = y * 2; } --what result would determinator provide for x? --answer: 13 --interleaving schedule: **predictable** not just deterministic --all nondeterministic inputs (time, random numbers, etc.): eliminated --why do we want this? 3. Determinator A. What's the approach? --Kernel exposes Put/Get/Ret --User-level abstractions built on top of those --Note: kernel internally is not deterministic. if you "snapshot" the processor and look at which "spaces" actually run on it, it will not be the same every time the system is run anew. put differently, the scheduler that assigns spaces to the processor can be utterly non-deterministic! B. Programming model and environment --private workspace model --they didn't invent the model, but I believe that they are the first to incorporate it into a programming model exposed by the OS. --cannot get read/write races (like in the example above) because the thing being read is the state from the beginning --can get write/write conflicts (unavoidable if there's concurrency), but they do something cool here: --every time the program runs with the same input, the write/write conflict is the same --so they can either --resolve the conflict deterministically, or --delcare the mere existence of a conflict to be a bug --synchronization primitives are deterministic --natively support fork/join --emulate locks (though the authors would say that locks are a bad idea) --the difference is that the scheduling of acquire() [say] will always be deterministic --race-free system namespaces --deep idea (in my opinion, this is one of the coolest things about this paper) --"Application code, not the system, decides where to allocate memory and what process IDs to assign children. This principle ensures that naming a resource reveals no shared state information other than what the application itself provided". [note contrast with exokernel!] --they then go on to point out that designing things this way may sidestep kernel bottlenecks (like a global lock for a system table), which may help multicore scalability --result is that the system does not provide shared memory!! you can simulate two threads seeing the same memory, as we'll see below.... C. So what's the approach? draw picture: ---------------------------------------------------- | space space | | ---------------------- --------------------- | | | app | | app | | | |--------| |--------| | | | runtime | | runtime | | | ---------------------- --------------------- | | | |---------------------------------- | | Det | |__________________________________________________| --every space gets its own copy of memory --use copy-on-write for efficiency --a parent can create child spaces --parent can insist on child's returning control --(how do child spaces go away? --if they exit, fine --if they don't, they can be stopped. --however, if they are stopped, it's not clear how to garbage collect them. maybe this wasn't implemented.) --parent can do: Put/Get --puts the caller to sleep until Ret or processor trap --(wait, what prevents processor traps from being non-deterministic?) --child can do: Ret --stops calling space, returning control to parent --exceptions cause logical Ret --assertion: Put/Get/Ret gives determinism, so the kernel is done. --rest of the work is how to use these three system calls to build a usable programming model and system D. Implementation of fork/exec/wait --fork straightforward --ASK: how do they do exec? --answer: runtime keeps a child space around --loads up the child space --calls Get to then load the child's space code into the parent space --ASK: how do they do wait? --answer: mainly a Get call in parent and a Ret call in child. E. File system --replicate the file system in every process --implement file versioning --and treat it like a distributed file system --note that conflicts here are NOT determinator conflicts, as in a get() with merge set. They're higher-level conflicts. --after wait(), the runtime copies the child's file system image (this is easy because they don't have to worry about the disk!!!) F. What about actual non-determinism on systems, such as I/O and timer events? (the point being that lots of things on the computer *are* non-deterministic: timer interrupts, cycle counters, console inputs, etc.) So how do they expose these things to user-level code? --convert these things into explict I/O channels --represent the I/O channels as special files, maintained by the supervising process --the contents of these files gets merged as sync points. --so the idea is that the process just sits there doing read() and write() --if there's no more data, read() turns into Ret and waits for data. --parent may ultimately do the same thing --write() sticks data in the console file. when it syncs with its parent, it tells its parent about the data. parent tells its parent. etc. Eventually the kernel is told about the data since the root process has access to the kernel's I/O devices. --ASK: wait, where does the determinism come from? --to get determinism, the supervising process COULD, if it wanted: --replay the events --synthesize the content --etc. --this is where the determinism comes from, but it's arguably a fudge (on the other hand, they have no choice). --the reason that it's a fudge is that if you ran the code again, you'd get a different answer. to get the same answer, you'd have to run in a special replay mode, and then make sure that you had originally logged. --it's not clear that this approach would help debugging, unless they're logging by default. G. How do they provide conventional programming model? --see 4.4: easiest thing is to expose the private workspace model using fork/join or barriers --also see 4.5: can provide conventional shared memory (vs. private workspace) but only in terms of deterministic scheduling. how do they do it? --run every thread for a certain quantum of time; then merge their memories together --violates some consistency, but after quantum, everything is consistent, and all threads can see each other's memory. --requires some hacks to make it reasonably efficient --unhappy tradeoff: --large scheduling quantum: threads waste time --small scheduling quantum: lots of propagation of shared state back and forth --and one still cannot predict what will happen in the t1()/t2() example above (though it will be the same every time), if we rewrite the example as: t1() { acquire_lock(); x = y + 1; release_lock(); } t2() { acquire_lock(); y = y * 2; release_lock(); } 4. Questions/discussion --clean slate approach (which is nice). they take an idea to its extreme. --started from JOS code! (but, from a quick skim of the pios branch of the JOS repo, does not look a lot like JOS) A. Limitations? --file system is not persistent; makes the "file system" a bit of an easier problem --can only synchronize and communicate with immediate parent and children. --why do you think that they do that? --perhaps remnant of JOS --avoids circularity --sidesteps nasty permissions issues [CHECK] --what else? --bunch of others. see fourth paragraph of section 5. B. Design decisions --why implement this functionality in the kernel? why not have a user-level thread scheduler do everything? --answer: they are going for complete determinism. complete determinism requires near-total control over the environment that the thread/process sees, and it's really the kernel that creates this environment. hence, they need to make the levels above the kernel see determinism C. Performance --why giving up performance? --why works better with coarse-grained parallelism? --a few reasons --first, any synchronization is very expensive. what used to be enqueuing and removing a thread or process from a queue is now a copy and traversal of the page tables; not cheap --another way to say this is that fork() and join() are the only points at which threads can view each other's memory, and those operations are expensive. --second, synchronization is much more coarse-grained. that has a gain -- less synchronization -- but also a cost: if the app required lots of synchronization, then either: --we're doing those virtual memory operations a lot; or --a thread is asleep waiting to be joined --thus, the applications where this model is likely to work best are those that work with coarse-grained parallelism: take a chunk of work, compute on it, and "check it back in". 5. Guest presentation