Class 24 CS 439 11 April 2013 On the board ------------ 1. Last time 2. Transactions, continued 3. Distributed systems --motivation for distributed transactions --impossibility result: two generals' problem --two-phase commit (2PC) --------------------------------------------------------------------------- 1. Last time Transactions, with emphasis on crash recovery. We talked about propagating from log to stable storage; the idea is that the modified pieces of the database are in RAM, so what we're really talking about is the propagation from RAM to the log versus RAM to cell storage. Reinforce the concept: we said there is an OUTCOME(commit) record. we said there is an END record. so what does it mean if there is an OUTCOME(abort) record? and then a subsequent END record? (that any modifications to stable storage that were part of the transaction have been unwound) Also discussed isolation. 2. Transactions: isolation --easiest approach: one giant lock. only one transaction active at a time. so everything really is serialized --advantage: easy to reason about --disadvantage: no concurrency --next approach: fine-grained locks (e.g., one per cell, or per-table, or whatever), and acquire all needed locks at begin_transaction and release all of them at end_transaction --advantage: easy to reason about. works great if it could be implemented --disadvantage: requires transaction to know all of its needed locks in advance --actual approach: two-phase locking. gradually acquire locks as needed inside the transaction manager (phase 1), and then release all of them together at the commit point (phase 2) --your intuition from the concurrency unit will tell you that this creates a problem....namely deadlock. we'll come back to that in a second. --why does this actually preserve a serial ordering? here's an informal argument: --consider the lock point, that is, the point in time at which the transaction owns all of the locks it will ever acquire. --consider any lock that is acquired. call it L. from the point that L is acquired to the lock point, the application always sees the same values for the data that lock L protects (because no other transaction or thread can get the lock). --so regard the application as having done all of its reads and writes instantly at the lock point --the lock points create the needed serialization. Here's why, informally. Regard the transactions as having taken place in the order given by their lock points. Okay, but how do we know that the lock points serialize? Answer: each lock point takes place at an instant, and any lock points with intersecting lock sets must be serialized with respect to each other as a result of the mutual exclusion given by locks. --to fix deadlock, several possibilities: --one of them is to remove the no-preempt condition: have the transaction manager abort transactions (roll them back) after a timeout C. Loose end: checkpoints --we haven't incorporated checkpoints into our recovery protocol. they help performance (less work to do on crash recovery), but, in their full generality, they add complexity. However, it's usually possible to reduce the complexity by using redo logging. Maintain a trailing pointer in the W.A.L. (write-ahead log) to mean: "everything before here has been propagated". that trailing pointer identifies the current checkpoint state. Source for a lot of this material: J. H. Saltzer and M. F. Kaashoek, Principles of Computer System Design: An Introduction, Morgan Kaufmann, Burlington, MA, 2009. Chapter 9. Available online D. Some take-aways from transactions --compare transactions, and their approach to crash recovery, to the ad-hoc file system crash recovery that we discussed. transactions are far more principled and far harder to implement incorrectly. --another note: in everyday life, go with redo logging! the refinement we saw wherein the changes associated with a transaction are propagated from the log to cell storage only after the COMMIT record is logged. --undo logging is really a performance optimization, and is not needed for most things 3. Distributed systems Distributed systems -- a system running across multiple machines -- is a key application of the network! Lots of issues to consider..... Note that previously, we had better modularity: --bug in user-level program --> process crashes --bug in kernel --> all processes crash --power outage --> all machines fail But in a distributed system, one machine can crash, others can stay up. Some machines can be slow. Some can crash and come back up. Lots of other issues to consider......computers can lose state, reboot, have partial state. Messages can be reordered, dropped, duplicated, delayed, etc......How do you build a system out of multiple processors and make the system *appear* to be tightly coupled (i.e., running in the same machine) even if it is not? "A distributed system is one in which the failure of a computer you didn't even know existed can render your own computer unusable." --Leslie Lamport http://research.microsoft.com/en-us/um/people/lamport/pubs/distributed-system.txt A. Motivation for distributed transactions (i) want to coordinate actions across sites: --I write you a check for $100. My bank is Frost, yours is BoA --need to debit my account $100 and credit yours with $100 --how the heck are we going to ensure that both banks execute the transaction or don't? --More complex example: --debit account on computer in New York with $1000 --open cash drawer in San Francisco, give $500 --credit account in Houston with another $500 --File systems example: --move a file from directory A on server a to directory B on server b (better not do one and not the other) We want the abstraction of a multi-site (or _distributed_) transaction --but how the heck are we going to build a transaction if our messages are carried over a network that loses them, delay them, duplicate them? and given that some computers can fail, reboot, etc.? [if you step back for a second, this is a fundamentally hard problem: need to provide all-or-nothing across machines for a complex set of operations.] --and actually the situation is even worse....... B. Two Generals' Problem (an impossibility result) [DRAW PICTURE: TWO ARMIES SEPARATED BY A VALLEY. RUNNERS GO BETWEEN THEM. RUNNERS CAN BE KILLED OR DELAYED. IF BOTH ARMIES ATTACK, THEY WIN. IF ONLY ONE ATTACKS, EVERYONE WHO ATTACKS DIES] -----> "5:00 PM good?" <---- "yeah, 5:00 PM is good." [at this point, both parties know that *if* there is an attack, they will attack at 5:00 PM. but the right-hand general cannot know that the left-hand general actually got the reply. so they need some more messages....a lot more.....] ----> "so we're doing this thing, right?" <---- "yeah, totally. but what if you don't get this ack?" [....in fact an infinite number of messages would be required.] Impossible to get the two generals to safely attack [1st general cannot tell the difference between the request lost and the *reply* lost. so 1st general cannot attack unless it gets an ack. but 2nd general cannot know that the ack was received.] Conclusion: cannot use messages and retries over an unreliable network to synchronize two machines so that they are guaranteed to do the same operation at the same time. So are we out of business? Yes, if we need to actually solve Two Generals' Problem. No, if we are content with a weaker guarantee. C. Two-phase commit --Abstraction: distributed transaction, with all-or-nothing atomicity. Multiple machines agree to do something or not. All sites commit or all abort. It is unacceptable for some of the sites to commit their part while other sites abort. --Assume: every site in the distributed transaction has, on its own, the ability to implement a local transaction (using the techniques that we discussed last time) --Constraint: there is no reliable delivery of messages (TCP attempts to provide such an abstraction, but it cannot fully, given the Two Generals' Problem.) --Approach: use write-ahead logging (of course) plus the unreliable network: [SEE PICTURE FOR DEPICTION OF ALGORITHM] --Question: where is the commit point? (answer: when coordinator logs "COMMIT"). --What happens if coordinator crashes before commit point? (Depends what coordinator decides to do when the coordinator revives.) --What happens if messages lost? (Retransmit them. No problem here.) --what happens if B says "No.", and the message is dropped? (Coordinator waits for B's reply. Eventually B retransmits it or coordinator times out. If coordinator times out, writes ABORT locally, and the transaction henceforth will abort. If coordinator gets B's retransmission in time, then coordinator's decision depends on the usual factors: what the other workers decided, whether the coordinator decided to go through with it, etc.) --what happens if coordinator crashes just after commit point and then restarts? (No problem. Retransmits its COMMIT or ABORT.) --what happens if "COMMIT" or "ABORT" message dropped? (coordinator obviously doesn't know that the message was dropped.) In this case..... --workers will resend their PREPARED messages --So coordinator needs to be able to reply saying what happened --conclusion: coordinator needs to maintain logs indefinitely, including across reboot (a disadvantage to this approach) --but if acknowledgments go back from workers to coordinator at the end of phase 2, then the coordinator does not have to keep the log of that entry forever. --(how long do workers have to maintain their logs? depends on the local implementation of transactions. but probably they have to keep track of a given transaction in the log until a time equal to the later of that transaction's END record and a checkpoint of the log being applied to cell storage.) --note that the workers can ask around to find out what happened, but there are limits...we can't avoid the blocking altogether. here's why: --let's say that a worker says to the other workers, "Hey, I haven't heard from the coordinator in a while. what did you all tell the coordinator?" --If any worker says to the querying worker, "I told the coordinator I couldn't enter the PREPARED state", then the querying worker knows that the transaction would have aborted, and it can abort. --And if any worker says, "I received a COMMIT [or an ABORT] message from the coordinator", then the worker knows that the transaction committed (or aborted). --But what if all workers say, "I told the coordinator I was PREPARED?"....Unfortunately the querying worker cannot commit on this basis. The reason is that the coordinator might have written ABORT to its own log (say because of a local error or timeout). In that case, the transaction actually aborted! But the querying worker doesn't know if this happened until the coordinator is revived. --NOTE: coordinator is a single point of failure. If it fails permanently, we're in serious trouble (system blocks). Can address that issue with three-phase commit. D. Three-phase commit (non-blocking) Typically covered in courses on distributed systems In practice, 2PC usually good enough. If you ever need 3PC, look it up. Paxos: algorithm for non-blocking consensus. E. What if failures are malicious, or result in buggy output? These are called Byzantine failures, and more mechanisms are needed. Again, a course in distributed systems will cover. Mention BFT line of research here at UT. E. Wait, didn't the two generals tell us that we couldn't get everyone to agree? --the subtlety is the difference between everyone agreeing to take an action or not (two-phase commit or not) versus everyone agreeing to take that action at the precise instant (two-generals) --Quoting Saltzer and Kaashoek, "The persistent senders of the distributed two-phase commit protocol ensure that if the coordinator decides to commit, all of the workers will eventually also commit, but there is no assurance that they will do so at the same time. If one of the communication links goes down for a day, when it comes back up the worker at the other end of that link will then receive the notice to commit, but this action may occur a day later than the actions of its colleagues. Thus the problem solved by distributed two-phase commit is slightly relaxed when compared with the dilemma of the two generals. That relaxation doesn't help the two generals, but the relaxation turns out to be just enough to allow us to devise a protocol that ensures correctness." "By a similar line of reasoning, there is no way to ensure with complete certainty that actions will be taken simultaneously at two sites that communicate only via a best-effort network. Distributed two-phase commit can thus safely open a cash drawer of an ATM in Tokyo, with confidence that a computer in Munich will eventually update the balance of that account. But if, for some reason, it is necessary to open two cash drawers at dif- ferent sites at the same time, the only solution is either the probabilistic approach [sending lots of copies of messages and hoping that one of them arrives] or to somehow replace the best-effort network with a reliable one. "The requirement for reliable communication is why real estate transactions and weddings (both of which are examples of two-phase commit protocols) usually occur with all of the parties in one room." (chapter 9, page 92) F. Thoughts and advice --If you're coding and need to do something across multiple machines, don't make it up. --use 2PC (or 3PC) --if 2PC, identify the circumstances under which indefinite blocking can occur (and decide if it's an acceptable engineering risk) --RPC is highly useful.... but.... --RPC arguably provides the wrong abstraction --its goal is an impossible one: to make transparent (i.e., invisible) to the layers above it whether a local or remote program is running. --RPC focuses attention on the "common case" of everything working! --Some argue that this is the wrong way to think about distributed programs. "Everything works" is the easy case. RPC encourages you to think about the case. --But the important and difficult cases concern partial failures (for example, not every message will get a reply). --"Exception paths" need to be as carefully considered as the "normal case" procedure call/return paths. Conclusion: RPC may be the wrong abstraction --An alternative: a lower-level message passing abstraction. --makes explicit where the messages are. therefore helps program writer avoid making implicit "everything usually works" assumptions. --may encourage structuring programs to handle failures elegantly --example: persistent message queues --use 2PC for delivering messages -- guarantees exactly once delivery even across machine failures and long partitions --but now on every message (or group of them), you're running that lengthy protocol. So each logical message costs many network messages. Sometimes you need this though! --Conclusion: persistent message queues are probably a better abstraction than RPC for building reliable distributed systems, but they are heavierweight.