Class 24 CS 372H 19 April 2011 On the board ------------ 1. Last time 2. Networking, continued 3. Using networks to build distributed systems --Motivation for distributed transactions --Impossibility result: two generals' problem --Two-phase commit (2PC) --------------------------------------------------------------------------- 2. Networks, continued J. Application layer Example: HTTP Normally, HTTP servers, otherwise known as Web servers, run on port 80 when your Web browser connects to a URL, it knows to always make requests on port 80, meaning it stamps "80" in its packets you can direct your Web browser to make requests on any port, though, like this: http://:port_num In that case, the browser itself will address its packets to the IP address that corresponds to the name of the machine and destination port port_num instead of destination port 80. Messages look like this: Browser --> Server: "GET /pics/dog.jpg HTTP/1.0\r\n" Server --> Browser: "HTTP/1.0 404 Not found\r\n" or "HTTP/1.0 200 OK\r\n header1: value1\r\n header2: value2\r\n \r\n [the bytes in dog.jpg]" [Keep in mind that the above is happening inside TCP, and that TCP is presenting a reliable byte stream to the layers above it.] QUESTION: where does NFS sit in this picture? [answer: runs over UDP or TCP on some port, either well-known, or determined with a port mapping service running on the server] K. What is the interface to the networking stack? --Application programmer classically sees *sockets*. Inspired by pipes int pipe(int fds[2]) --Allow Inter-process communication on one machine --Writes to fds[1] will be read on fds[0] --Can give each file descriptor to a different process (with fork) The idea is: let's do the same thing across machines: **SOCKETS** Write data on one machine, read it on another *sockets* can represent many different network protocols, but: --classically an interface to TCP/IP and UDP --sometimes an interface to IP or Ethernet (raw sockets) --sockets API /* senders and receivers */ int sockfd = socket(AF_INET, SOCK_STREAM|SOCK_DGRAM|, 0); [note: with AF_INET in the first position, the setting of SOCK_STREAM vs SOCK_DGRAM controls whether the app's data is going to go over TCP or UDP]. [with UDP sockets, send atomic messages that may be reordered or lost] [with TCP sockets, bytes written on one end are read on the other, provided no failures. but no guarantees that reads will return the full amount requested ... or that the data will be packetized according to the number of times the sender called send(). With TCP, you *must* sit there in a loop and keep reading. You know you're done because either (a) the application-level protocol is expected to understand where message boundaries begin and end or (b) the first machine closed its connection to the server] int rc = close(); select(); struct sockaddr_in { short sin_family; short sin_port; uint32_t sin_addr; char sin_zero[8]; }; /* senders */ int rc = connect(sockfd, &addr, addrlen); int rc = send(sockfd, buf, len, 0); int rc = sendto(sockf, buf, len, 0, &sockaddr, addrlen, 0); /* receivers */ int rc = bind(sockfd, &addr, addrlen); int rc = listen(sockfd, backlog_len); int rc = accept(sockfd, &addr, &adddrlen); int rc = recv(sockfd, buf, len, 0); int rc = recvfrom(sockfd, buf, len, 0, &addr, &addrlen); NOTES: * connections are named by 5 components: protocol (TCP), local IP address, local port, remote IP address, remote port * UDP does not require connected sockets * OS tracks all of this state in a PCB (protocol control block). --What does kernel see, and what interfaces does it invoke? TX direction: --usually gets payloads from higher levels and implements TCP/IP, UDP, IP, and part of Ethernet --usually hands most of an Ethernet frame to the network device --but not always: could imagine a Web server implemented entirely in the kernel, or even a Web server implemented on a network card --(in JOS, the entire networking stack is implemented in user space. that is the function of the lwip library.) RX direction: --when a packet arrives, use 5-tuple (above) to find PCB and figure out what to do with packet Note that to avoid lots of copies, OS may not actually store packets contiguously. May store linked list of buffers. Each buffer is either a packet header or a payload Network interface cards (NICs) --Used to be dumb --Now sometimes do lots of stuff --You are getting a network interface card working in lab 6 Kernels also do *routing* --A machine has multiple NICs connected to different networks, kernel gets a packet (either from one of the NICs or from an application), now which NIC does it go out? --kernel generally looks at the destination address of the packet and does a lookup in a table that it maintains: [IP address, prefix-length] --> next-hop next-hop is the physical interface to send the packet out This is the same routing function that Internet routers do there are data structures to make it efficient in time and space (radix trees are a decent first cut) --------------------------------------------------------------------------- Admin notes --guest lecture this Thursday --reading posted --paper assigned for today: what's the main point? [one of the points is: don't believe everything you read!] --------------------------------------------------------------------------- 3. Using networks to build distributed systems Distributed systems -- a system running across multiple machines -- is a key application of the network! Lots of issues to consider..... Note that previously, we had better modularity: --bug in user-level program --> process crashes --bug in kernel --> all processes crash --power outage --> all machines fail But in a distributed system, one machine can crash, others can stay up. Some machines can be slow. Some can crash and come back up. Lots of other issues to consider......computers can lose state, reboot, have partial state. Messages can be reordered, dropped, duplicated, delayed, etc......How do you build a system out of multiple processors and make the system *appear* to be tightly coupled (i.e., running in the same machine) even if it is not? "A distributed system is one in which the failure of a computer you didn't even know existed can render your own computer unusable." --Leslie Lamport http://research.microsoft.com/en-us/um/people/lamport/pubs/distributed-system.txt A. Motivation for distributed transactions (i) want to coordinate actions across sites: --I write you a check for $100. My bank is Frost, yours is BoA --need to debit my account $100 and credit yours with $100 --how the heck are we going to ensure that both banks execute the transaction or don't? --More complex example: --debit account on computer in New York with $1000 --open cash drawer in San Francisco, give $500 --credit account in Houston with another $500 --File systems example: --move a file from directory A on server a to directory B on server b (better not do one and not the other) We want the abstraction of a multi-site (or _distributed_) transaction --but how the heck are we going to build a transaction if our messages are carried over a network that loses them, delay them, duplicate them? and given that some computers can fail, reboot, etc.? --and actually the situation is even worse....... B. Two Generals' Problem (an impossibility result) [DRAW PICTURE: TWO ARMIES SEPARATED BY A VALLEY. RUNNERS GO BETWEEN THEM. RUNNERS CAN BE KILLED OR DELAYED. IF BOTH ARMIES ATTACK, THEY WIN. IF ONLY ONE ATTACKS, EVERYONE WHO ATTACKS DIES] -----> "5:00 PM good?" <---- "yeah, 5:00 PM is good." [at this point, both parties know that *if* there is an attack, they will attack at 5:00 PM. but the right-hand general cannot know that the left-hand general actually got the reply. so they need some more messages....a lot more.....] ----> "so we're doing this thing, right?" <---- "yeah, totally. but what if you don't get this ack?" [....in fact an infinite number of messages would be required.] Impossible to get the two generals to safely attack [1st general cannot tell the difference between the request lost and the *reply* lost. so 1st general cannot attack unless it gets an ack. but 2nd general cannot know that the ack was received.] Conclusion: cannot use messages and retries over an unreliable network to synchronize two machines so that they are guaranteed to do the same operation at the same time. So are we out of business? Yes, if we need to actually solve Two Generals' Problem. No, if we are content with a weaker guarantee. C. Two-phase commit --Abstraction: distributed transaction, with all-or-nothing atomicity. Multiple machines agree to do something or not. All sites commit or all abort. It is unacceptable for some of the sites to commit their part while other sites abort. --Assume: every site in the distributed transaction has, on its own, the ability to implement a local transaction (using the techniques that we discussed several classes ago) --Constraint: there is no reliable delivery of messages (TCP attempts to provide such an abstraction, but it cannot fully, given the Two Generals' Problem.) --Approach: use write-ahead logging (of course) plus the unreliable network: [SEE PICTURE FOR DEPICTION OF ALGORITHM] --Question: where is the commit point? (answer: when coordinator logs "COMMIT"). --What happens if coordinator crashes before commit point? (Depends what coordinator decides to do when the coordinator revives.) --What happens if messages lost? (Retransmit them. No problem here.) --what happens if B says "No.", and the message is dropped? (Coordinator waits for B's reply. Eventually B retransmits it or coordinator times out. If coordinator times out, writes ABORT locally, and the transaction henceforth will abort. If coordinator gets B's retransmission in time, then coordinator's decision depends on the usual factors: what the other workers decided, whether the coordinator decided to go through with it, etc.) --what happens if coordinator crashes just after commit point and then restarts. (No problem. Retransmits its COMMIT or ABORT.) --what happens if "COMMIT" or "ABORT" message dropped? (coordinator obviously doesn't know that the message was dropped.) In this case..... --workers will resend their PREPARED messages --So coordinator needs to be able to reply saying what happened --conclusion: coordinator needs to maintain logs indefinitely, including across reboot (a disadvantage to this approach) --but if acknowledgments go back from workers to coordinator at the end of phase 2, then the coordinator does not have to keep the log of that entry forever. --(how long do workers have to maintain their logs? depends on the local implementation of transactions. but probably they have to keep track of a given transaction in the log until a time equal to the later of that transaction's END record and a checkpoint of the log being applied to cell storage.) --note that the workers can ask around to find out what happened, but there are limits...we can't avoid the blocking altogether. here's why: --let's say that a worker says to the other workers, "Hey, I haven't heard from the coordinator in a while. what did you all tell the coordinator?" --If any worker says to the querying worker, "I told the coordinator I couldn't enter the PREPARED state", then the querying worker knows that the transaction would have aborted, and it can abort. --But what if all workers say, "I told the coordinator I was PREPARED?"....Unfortunately the querying worker cannot commit on this basis. The reason is that the coordinator might have written ABORT to its own log (say because of a local error or timeout). In that case, the transaction actually aborted! But the querying worker doesn't know if this happened until the coordinator is revived. --NOTE: coordinator is a single point of failure. If it fails permanently, we're in serious trouble (system blocks). Can address that issue with three-phase commit. D. Three-phase commit (non-blocking) Typically covered in courses on distributed systems In practice, 2PC usually good enough. If you ever need 3PC, look it up. E. Wait, didn't the two generals tell us that we couldn't get everyone to agree? --the subtlety is the difference between everyone agreeing to take an action or not (two-phase commit or not) versus everyone agreeing to take that action at the precise instant (two-generals) --Quoting Saltzer and Kaashoek, "The persistent senders of the distributed two-phase commit protocol ensure that if the coordinator decides to commit, all of the workers will eventually also commit, but there is no assurance that they will do so at the same time. If one of the communication links goes down for a day, when it comes back up the worker at the other end of that link will then receive the notice to commit, but this action may occur a day later than the actions of its colleagues. Thus the problem solved by distributed two-phase commit is slightly relaxed when compared with the dilemma of the two generals. That relaxation doesn't help the two generals, but the relaxation turns out to be just enough to allow us to devise a protocol that ensures correctness." "By a similar line of reasoning, there is no way to ensure with complete certainty that actions will be taken simultaneously at two sites that communicate only via a best-effort network. Distributed two-phase commit can thus safely open a cash drawer of an ATM in Tokyo, with confidence that a computer in Munich will eventually update the balance of that account. But if, for some reason, it is necessary to open two cash drawers at dif- ferent sites at the same time, the only solution is either the probabilistic approach [sending lots of copies of messages and hoping that one of them arrives] or to somehow replace the best-effort network with a reliable one. The requirement for reliable communication is why real estate transactions and weddings (both of which are examples of two-phase commit protocols) usually occur with all of the parties in one room." (chapter 9, page 92) F. Thoughts and advice --If you're coding and need to do something across multiple machines, don't make it up. --use 2PC (or 3PC) --if 2PC, identify the circumstances under which indefinite blocking can occur (and decide if it's an acceptable engineering risk) --RPC is highly useful.... but.... --RPC arguably provides the wrong abstraction --its goal is an impossible one: to make transparent (i.e., invisible) to the layers above it whether a local or remote program is running. --RPC focuses attention on the "common case" of everything working! --Some argue that this is the wrong way to think about distributed programs. "Everything works" is the easy case. RPC encourages you to think about the case. --But the important and difficult cases concern partial failures (for example, not every message will get a reply). --"Exception paths" need to be as carefully considered as the "normal case" procedure call/return paths. Conclusion: RPC may be the wrong abstraction --An alternative: a lower-level message passing abstraction. --makes explicit where the messages are. therefore helps program writer avoid making implicit "everything usually works" assumptions. --may encourage structuring programs to handle failures elegantly --example: persistent message queues --use 2PC for delivering messages -- guarantees exactly once delivery even across machine failures and long partitions --but now on every message (or group of them), you're running that lengthy protocol. So each logical message costs many network messages. Sometimes you need this though! --Conclusion: persistent message queues are probably a better abstraction than RPC for building reliable distributed systems, but they are heavierweight.