Class 26 CS 372H 27 April 2010 On the board ------------ 1. Two-phase commit 2. Virtual machines --------------------------------------------------------------------------- 1. Finish two-phase commit A. Motivation: want distributed transaction B. Impossibility result: Two Generals problem C. Two-phase commit --Abstraction: distributed transaction, with all-or-nothing atomicity. Multiple machines agree to do something or not. All sites commit or all abort. It is unacceptable for some of the sites to commit their part while other sites abort. --Assume: every site in the distributed transaction has, on its own, the ability to implement a local transaction (using the techniques that we discussed several classes ago) --Constraint: there is no reliable delivery of messages (TCP attempts to provide such an abstraction, but it cannot fully, given the Two Generals' Problem.) --Approach: use write-ahead logging (of course) plus the unreliable network: [SEE PICTURE FOR DEPICTION OF ALGORITHM] --Question: where is the commit point? (answer: when coordinator logs "COMMIT"). --What happens if coordinator crashes before commit point? (Depends what coordinator decides to do when the coordinator revives.) --What happens if messages lost? (Retransmit them. No problem here.) --what happens if B says "No.", and the message is dropped? (Coordinator waits for B's reply. Eventually B retransmits it or coordinator times out. If coordinator times out, writes ABORT locally, and the transaction henceforth will abort. If coordinator gets B's retransmission in time, then coordinator's decision depends on the usual factors: what the other workers decided, whether the coordinator decided to go through with it, etc.) --what happens if coordinator crashes just after commit point? (No problem. Retransmits its COMMIT or ABORT.) --what happens if "COMMIT" or "ABORT" message dropped? (coordinator obviously doesn't know that the message was dropped.) In this case..... --workers will resend their PREPARED messages --So coordinator needs to be able to reply saying what happened --conclusion: coordinator needs to maintain logs indefinitely, including across reboot (a disadvantage to this approach) --(how long do workers have to maintain their logs? depends on the local implementation of transactions. but probably they have to keep track of a given transaction in the log until a time equal to the later of that transaction's END record and a checkpoint of the log being applied to cell storage.) --note that the workers can ask around to find out what happened, but there are limits...we can't avoid the blocking altogether. here's why: --let's say that a worker says to the other workers, "Hey, I haven't heard from the coordinator in a while. what did you all tell the coordinator?" --If any worker says to the querying worker, "I told the coordinator I couldn't enter the PREPARED state", then the querying worker knows that the transaction would have aborted, and it can abort. --But what if all workers say, "I told the coordinator I was PREPARED?"....Unfortunately the querying worker cannot commit on this basis. The reason is that the coordinator might have written ABORT to its own log (say because of a local error or timeout). In that case, the transaction actually aborted! But the querying worker doesn't know if this happened until the coordinator is revived. --NOTE: coordinator is a single point of failure. If it fails permanently, we're in serious trouble. Can address that issue with three-phase commit. C. Three-phase commit (non-blocking) Typically covered in courses on distributed systems In practice, 2PC usually good enough. If you ever need 3PC, look it up. D. Wait, didn't the two generals tell us that we couldn't get everyone to agree? --the subtlety is the difference between everyone agreeing to take an action or not (two-phase commit or not) versus everyone agreeing to take that action at the precise instant (two-generals) --Quoting Saltzer and Kaashoek, "The persistent senders of the distributed two-phase commit protocol ensure that if the coordinator decides to commit, all of the workers will eventually also commit, but there is no assurance that they will do so at the same time. If one of the communication links goes down for a day, when it comes back up the worker at the other end of that link will then receive the notice to commit, but this action may occur a day later than the actions of its colleagues. Thus the problem solved by distributed two-phase commit is slightly relaxed when compared with the dilemma of the two generals. That relaxation doesn't help the two generals, but the relaxation turns out to be just enough to allow us to devise a protocol that ensures correctness." "By a similar line of reasoning, there is no way to ensure with complete certainty that actions will be taken simultaneously at two sites that communicate only via a best-effort network. Distributed two-phase commit can thus safely open a cash drawer of an ATM in Tokyo, with confidence that a computer in Munich will eventually update the balance of that account. But if, for some reason, it is necessary to open two cash drawers at dif- ferent sites at the same time, the only solution is either the probabilistic approach [sending lots of copies of messages and hoping that one of them arrives] or to somehow replace the best-effort network with a reliable one. The requirement for reliable communication is why real estate transactions and weddings (both of which are examples of two-phase commit protocols) usually occur with all of the parties in one room." (chapter 9, page 92) E. Thoughts and advice --If you're coding and need to do something across multiple machines, don't make it up. --use 2PC (or 3PC) --if 2PC, identify the circumstances under which indefinite blocking can occur (and decide if it's an acceptable engineering risk) --RPC is higly useful.... but.... --RPC arguably provides the wrong abstraction --its goal is an impossible one: to make transparent (i.e., invisible) to the layers above it whether a local or remote program is running. --RPC focuses attention on the "common case" of everything working! --Some argue that this is the wrong way to think about distributed programs. "Everything works" is the easy case. RPC encourages you to think about the case. --But the important and difficult cases concern partial failures (for example, not every message will get a reply). --"Exception paths" need to be as carefully considered as the "normal case" procedure call/return paths. Conclusion: RPC may be the wrong abstraction --An alternative: a lower-level message passing abstraction. --makes explicit where the messages are. therefore helps program writer avoid making implicit "everything usually works" assumptions. --may encourage structuring programs to handle failures elegantly --example: persistent message queues --use 2PC for delivering messages -- guarantees exactly once delivery even across machine failures and long partitions --but now on every message (or group of them), you're running that lengthy protocol. So each logical message costs many network messages. Sometimes you need this though! --Conclusion: persistent message queues are probably a better abstraction than RPC for building reliable distributed systems, but they are heavierweight. --------------------------------------------------------------------------- Admin notes/questions --Prefer to demo: --in class on last meeting? --for Namrata in the two days before finals? --for Namrata and me, during finals week? --not at all (advantages and disadvantages here) --Project: please try to have fun with it. One thing to note is that there are no grading scripts. So you'll have to be more disciplined about testing but, like professional developers, you may be able to code around your own bugs. --------------------------------------------------------------------------- 2. Virtual machines A. Intro --To "virtualize" means "to lie" (usually at some performance cost): the environment where the code is actually running is different from what the code seems to expect. Some piece of technology is "lying" to the code to fool it. Hence, "virtual memory", "virtualize the hardware", etc. --So what's a virtual machine? [DRAW PICTURE: REVIEW: proc_1 proc_2 ... OS --HARDWARE-- WHAT IF: proc_1 proc_2 OS1 OS2 --HARDWARE-LIKE INTERFACE-- --HARDWARE-LIKE INTERFACE-- VMM (VIRTUAL MACHINE MONITOR) --HARDWARE-- --What's going on? --VMM exposes a virtual machine abstraction that is supposed to look exactly like real hardware --The OS1, OS2, ... run in *user* mode --A brief history of virtual machines --Old idea from the 1960s (Goldberg 1974) --IBM VM/370: a VMM for IBM mainframes --high performance overhead, but worth it because hardware is really expensive. so valuable to pretend you have multiple OS environments, even if each is slower --Interest died out in 1980s and 1990s (hardware is cheap, windows NT is not, so not much benefit in saving hardware) --1997: Rosenblum's group at Stanford decides they are going to make VMs fast --their technology became VMWare --and sparked a Renaissance in virtual machines research, commercialization, and use --Why are VMs interesting again today? There is clearly great demand for them. Why? --Compatibility --not all Windows NT applications run on XP, or XP on Vista. solution: use a VMM to run both Windows NT and Windows XP --Multiplex multiple "machines" on same hardware (e.g., Amazon's EC2, where Amazon rents "machines" to customers) --Need ability to allocate a fraction of a machine (modern CPUs more powerful than most apps need) --If I have "one-tenth" the machine, then I pay for one-tenth of the power, cooling, and space --And only pay for the CPU cycles I consume, instead of having to buy hardware (though hardware is pretty cheap) --Server consolidation trend is very real --Similar benefit for a single organization --Simulate a whole server farm with a VMM and guest OSes --IT department used to run mail server on one machine, internal Web server on another, etc. --And maybe those different servers required different OSes --Much less hardware (and cheaper) to run all of these on the same hardware: a server running on Linux for XYZ, a server running on Windows for ABC, etc., and all using the same hardware --But why do you need a server farm in the first place? Note that VMs don't increase the available hardware. So there must be some sense in which isolated machines are valuable. What are they? --Isolation: want that if a machine is rooted, the effects are localized. --Software management (get the right configuration of all supporting libraries; helps your mail server or database server run) --Lots of software needs to run as root. Don't want to give your database application root over the whole machine. Solution: run the entire database app in a virtual machine. --Checkpoint, migration, and replication --Lots of possibilities here --Scenario (virtual appliance work, such as moka5.com): A bunch of developers "store" their machines in a central repository. Those machines are configured with the right tools (compiler, repository, etc.). A new developer sits down and "checks out" a machine, and just starts working. Compare this to what is required to re-image a machine from a known image or, worse, start from scratch by installing a bunch of software. --Mobility: your computing environment follows you around. --Ultimately, VMMs turn the operating system itself into normal software that can be managed. --Wait, what do operating systems do? --Multiplex hardware --Provide isolation --Abstract hardware --So why is the VMM doing some of the same things? (Arguably OS designers screwed up over all these years and should have been exposing more narrow interfaces. Because the syscall interface is wide and because security in most OSes is a joke, people solve their problems (multiplexing, isolation, backward compatibility) at a different layer of the stack. --General thoughts about virtual machines before we dive into details --End-to-end, this stuff is very cool. we are getting exceedingly complex behavior (emulating a Windows instance inside of a window on Linux) by emulating much smaller pieces (the interplay between CPU and OS). That's interesting because it's a classic case of taking a hard problem and turning it into a whole bunch of smaller, easier ones. --Ridiculously successful OS technology. Easily the biggest practical impact of OS research in the last 20 years. CPU manufacturers now support virtualization. B. What's required for the VMM to make the OS think it's running on real hardware? (this is the core technical challenge. after this works, there are many other problems to solve, ranging from memory management to I/O virtualization, to even better performance. but the core challenge is to make the OS believe it is running on real hardware.) --Different approaches. --You may have heard "the x86 is not virtualizable". we'll discuss what that means and how VMWare et al. get around that problem. --focus on CPU virtualization and memory virtualization --Approaches: (i) Binary interpretation (example: Bochs) --simplest VMM approach --See lecture 3 notes and handout. Here's a quick review.... --Build a simulation of all the hardware. --*CPU*: A loop that fetches each instruction, decodes it, simulates its effect on the machine state --*Memory*: Physical memory is just an array, simulate the MMU on all memory accesses --*I/O*: Simulate I/O devices, programmed I/O, DMA, interrupts +: simple! -: Too slow! --100x slowdown (mainly from CPU/MMU, not I/O) (ii) Classic virtualization: trap-and-emulate (doesn't work by itself on x86 for reasons we will see in a moment. for now, we just cover the technique.) --Observation: Most instructions are the same regardless of processor privilege level. EXAMPLE: incl %eax --Idea: just give the guest OS's instructions to the CPU. Let it pretend to be the OS. --Safety issue: how can the hypervisor get the CPU back, or prevent the guest OS from calling "cli" or "halt", or writing all over the other OS' memory? --Answer: use the hardware's protection mechanism, as we have been doing all semester to isolate processes --Virtualizing the CPU --Run virtual machine's OS directly on CPU at non-privileged level --Most instructions just work --Privileged instructions trap into monitor; monitor then simulates the effect of running the instruction --Doesn't fully work on the x86. We'll get to that in a second. For now, discuss how to virtualize traps and memory [Keep a "process table" for the virtualized OS: --saved registers --saved privileged registers --IDT, etc. --etc.] --Virtualizing traps: --What happens when an interrupt or trap occurs? --Trap into the *monitor* --What if the interrupt or trap should go to guest OS? --Example: Page fault, illegal instruction, system call --Answer: re-start the guest OS simulating the trap --Lookup trap vector in VM's IDT --Just like processor would have done, the monitor pushes: SS ESP EFLAGS CS EIP --and then starts running the OS at the code point given by the entry in the IDT. if this sounds familiar, it's because that's exactly what bochs/qemu do, and what the processor does in hardware. --What if the interrupt or trap happens because the guest OS tried to, say, do something privileged (loading CR0 into eax, reading/writing to disk, etc.)? --Monitor fakes it: in the "mov %cr0, %eax" example (which a normal OS is allowed to do but which a user-level process is not), the VMM would load a fake value of CR0 into eax and restart the guest OS right after that instruction --Virtualizing memory --Next time......