Class 24
CS 439
11 April 2013

On the board
------------

1. Last time
2. Transactions, continued
3. Distributed systems
    --motivation for distributed transactions
    --impossibility result: two generals' problem
    --two-phase commit (2PC)

---------------------------------------------------------------------------

1. Last time

    Transactions, with emphasis on crash recovery.

    We talked about propagating from log to stable storage;
        the idea is that the modified pieces of the database are in RAM,
        so what we're really talking about is the propagation from RAM
            to the log versus RAM to cell storage.

    Reinforce the concept:
        we said there is an OUTCOME(commit) record.
        we said there is an END record.

        so what does it mean if there is an OUTCOME(abort) record?
            and then a subsequent END record?
            (that any modifications to stable storage that were part of
            the transaction have been unwound)

    Also discussed isolation.

2. Transactions: isolation
	
    --easiest approach: one giant lock. only one transaction active
    at a time. so everything really is serialized
	--advantage: easy to reason about
	--disadvantage: no concurrency

    --next approach: fine-grained locks (e.g., one per cell, or
    per-table, or whatever), and acquire all needed locks at
    begin_transaction and release all of them at end_transaction
	--advantage: easy to reason about. works great if it could
	be implemented
	--disadvantage: requires transaction to know all of its
	needed locks in advance

    --actual approach: two-phase locking. gradually acquire locks as
    needed inside the transaction manager (phase 1), and then
    release all of them together at the commit point (phase 2)

	--your intuition from the concurrency unit will tell you
	that this creates a problem....namely deadlock. we'll come
	back to that in a second.

	--why does this actually preserve a serial ordering? here's
	an informal argument:

	    --consider the lock point, that is, the point in time at
	    which the transaction owns all of the locks it will ever
	    acquire.

	    --consider any lock that is acquired. call it L. from the
	    point that L is acquired to the lock point, the application
	    always sees the same values for the data that lock L
	    protects (because no other transaction or thread can get the
	    lock).

	    --so regard the application as having done all of its
	    reads and writes instantly at the lock point

	    --the lock points create the needed serialization.
	    Here's why, informally.  Regard the transactions as
	    having taken place in the order given by their lock
	    points. Okay, but how do we know that the lock points
	    serialize? Answer: each lock point takes place at an
	    instant, and any lock points with intersecting lock sets
	    must be serialized with respect to each other as a
	    result of the mutual exclusion given by locks.
	    
	--to fix deadlock, several possibilities:

	    --one of them is to remove the no-preempt condition:
	    have the transaction manager abort transactions (roll
	    them back) after a timeout

    C. Loose end: checkpoints

	--we haven't incorporated checkpoints into our recovery
	protocol. they help performance (less work to do on crash
	recovery), but, in their full generality, they add complexity.
	However, it's usually possible to reduce the complexity by using
	redo logging. Maintain a trailing pointer in the W.A.L.
	(write-ahead log) to mean: "everything before here has been
	propagated". that trailing pointer identifies the current
	checkpoint state.

Source for a lot of this material:

    J. H. Saltzer and M. F. Kaashoek, Principles of Computer System
    Design: An Introduction, Morgan Kaufmann, Burlington, MA, 2009.
    Chapter 9. Available online


    D. Some take-aways from transactions 

        --compare transactions, and their approach to crash recovery, to
        the ad-hoc file system crash recovery that we discussed.
        transactions are far more principled and far harder to implement
        incorrectly.

        --another note: in everyday life, go with redo logging! the
        refinement we saw  wherein the changes associated with a
        transaction are propagated from the log to cell storage only
        after the COMMIT record is logged.
	    
	    --undo logging is really a performance optimization, and is not
	    needed for most things


3. Distributed systems

    Distributed systems -- a system running across multiple machines --
    is a key application of the network!

    Lots of issues to consider.....

    Note that previously, we had better modularity:
	--bug in user-level program --> process crashes
	--bug in kernel --> all processes crash
	--power outage --> all machines fail

    But in a distributed system, one machine can crash, others can stay
    up. Some machines can be slow. Some can crash and come back up.

    Lots of other issues to consider......computers can lose state,
    reboot, have partial state. Messages can be reordered, dropped,
    duplicated, delayed, etc......How do you build a system out of
    multiple processors and make the system *appear* to be tightly
    coupled (i.e., running in the same machine) even if it is not?

    "A distributed system is one in which the failure of a computer you
    didn't even know existed can render your own computer unusable."
				    --Leslie Lamport
    http://research.microsoft.com/en-us/um/people/lamport/pubs/distributed-system.txt


    A. Motivation for distributed transactions

	(i) want to coordinate actions across sites:
	    
	    --I write you a check for $100. My bank is Frost, yours is
	    BoA
		--need to debit my account $100 and credit yours with
		$100
		--how the heck are we going to ensure that both banks
		execute the transaction or don't?
    
	    --More complex example:
		--debit account on computer in New York with $1000
		--open cash drawer in San Francisco, give $500
		--credit account in Houston with another $500

	    --File systems example:
		--move a file from directory A on server a to directory
		B on server b (better not do one and not the other)

	We want the abstraction of a multi-site (or _distributed_)
	transaction

	    --but how the heck are we going to build a transaction if
	    our messages are carried over a network that loses them,
	    delay them, duplicate them? and given that some computers
	    can fail, reboot, etc.? 

                [if you step back for a second, this is a fundamentally
                hard problem: need to provide all-or-nothing across
                machines for a complex set of operations.]

	    --and actually the situation is even worse.......


    B. Two Generals' Problem (an impossibility result)
	
	    [DRAW PICTURE: TWO ARMIES SEPARATED BY A VALLEY.
		RUNNERS GO BETWEEN THEM.
		RUNNERS CAN BE KILLED OR DELAYED.
		IF BOTH ARMIES ATTACK, THEY WIN.
		IF ONLY ONE ATTACKS, EVERYONE WHO ATTACKS DIES]

		
		-----> "5:00 PM good?"
		<----  "yeah, 5:00 PM is good."
		[at this point, both parties know that *if* there is an
		attack, they will attack at 5:00 PM. but the right-hand
		general cannot know that the left-hand general actually
		got the reply. so they need some more messages....a lot
		more.....]
		----> "so we're doing this thing, right?"
		<---- "yeah, totally. but what if you don't get this ack?"
		[....in fact an infinite number of messages would be
		required.]

	    Impossible to get the two generals to safely attack

		[1st general cannot tell the difference between the 
		request lost and the *reply* lost. so 1st general cannot
		attack unless it gets an ack. but 2nd general cannot
		know that the ack was received.]

	    Conclusion: cannot use messages and retries over an
	    unreliable network to synchronize two machines so that they
	    are guaranteed to do the same operation at the same time.

	So are we out of business? Yes, if we need to actually solve Two
	Generals' Problem. No, if we are content with a weaker
	guarantee.

    C. Two-phase commit

	--Abstraction: distributed transaction, with all-or-nothing
	atomicity. Multiple machines agree to do something or not. All
	sites commit or all abort. It is unacceptable for some of the
	sites to commit their part while other sites abort.

	--Assume: every site in the distributed transaction has, on its
	own, the ability to implement a local transaction (using the
	techniques that we discussed last time)

	--Constraint: there is no reliable delivery of messages (TCP
	attempts to provide such an abstraction, but it cannot fully,
	given the Two Generals' Problem.)

	--Approach: use write-ahead logging (of course) plus the
	unreliable network:

	[SEE PICTURE FOR DEPICTION OF ALGORITHM]

	--Question: where is the commit point? (answer: when coordinator
	logs "COMMIT").

	--What happens if coordinator crashes before commit point?
	    (Depends what coordinator decides to do when the coordinator
	    revives.)

	--What happens if messages lost?
	    (Retransmit them. No problem here.)

	--what happens if B says "No.", and the message is dropped?
	    (Coordinator waits for B's reply. Eventually B retransmits
	    it or coordinator times out. If coordinator times out,
	    writes ABORT locally, and the transaction henceforth will
	    abort. If coordinator gets B's retransmission in time, then
	    coordinator's decision depends on the usual factors: what
	    the other workers decided, whether the coordinator decided
	    to go through with it, etc.)

	--what happens if coordinator crashes just after commit point
	and then restarts?
	    (No problem. Retransmits its COMMIT or ABORT.)

	--what happens if "COMMIT" or "ABORT" message dropped?
	(coordinator obviously doesn't know that the message was
	dropped.) In this case.....

	    --workers will resend their PREPARED messages

	    --So coordinator needs to be able to reply saying what
	    happened
	   
	    --conclusion: coordinator needs to maintain logs
	    indefinitely, including across reboot (a disadvantage to
	    this approach)
	
		--but if acknowledgments go back from workers to
		coordinator at the end of phase 2, then the coordinator
		does not have to keep the log of that entry forever. 
	
	    --(how long do workers have to maintain their logs? depends
	    on the local implementation of transactions. but probably
	    they have to keep track of a given transaction in the log
	    until a time equal to the later of that transaction's
	    END record and a checkpoint of the log being applied to cell
	    storage.)

	--note that the workers can ask around to find out what
	happened, but there are limits...we can't avoid the blocking
	altogether. here's why:

	    --let's say that a worker says to the other workers, "Hey, I
	    haven't heard from the coordinator in a while. what did you
	    all tell the coordinator?"

	    --If any worker says to the querying worker, "I told the
	    coordinator I couldn't enter the PREPARED state", then the
	    querying worker knows that the transaction would have
	    aborted, and it can abort.

            --And if any worker says, "I received a COMMIT [or an ABORT]
            message from the coordinator", then the worker knows that
            the transaction committed (or aborted).

	    --But what if all workers say, "I told the coordinator I was
	    PREPARED?"....Unfortunately the querying worker cannot
	    commit on this basis. The reason is that the coordinator
	    might have written ABORT to its own log (say because of a
	    local error or timeout). In that case, the transaction
	    actually aborted! But the querying worker doesn't know if
	    this happened until the coordinator is revived.

	--NOTE: coordinator is a single point of failure. If it fails
	permanently, we're in serious trouble (system blocks). Can
	address that issue with three-phase commit.

    D. Three-phase commit (non-blocking)

	Typically covered in courses on distributed systems
	
	In practice, 2PC usually good enough. If you ever need 3PC, look
	it up.

        Paxos: algorithm for non-blocking consensus.

    E. What if failures are malicious, or result in buggy output?

        These are called Byzantine failures, and more mechanisms are
        needed.
        
        Again, a course in distributed systems will cover.

        Mention BFT line of research here at UT.

    E. Wait, didn't the two generals tell us that we couldn't get
    everyone to agree?

	--the subtlety is the difference between everyone agreeing to
	take an action or not (two-phase commit or not) versus everyone
	agreeing to take that action at the precise instant
	(two-generals)

	--Quoting Saltzer and Kaashoek, "The persistent senders of the
	distributed two-phase commit protocol ensure that if the
	coordinator decides to commit, all of the workers will
	eventually also commit, but there is no assurance that they will
	do so at the same time. If one of the communication links goes
	down for a day, when it comes back up the worker at the other
	end of that link will then receive the notice to commit, but
	this action may occur a day later than the actions of its
	colleagues. Thus the problem solved by distributed two-phase
	commit is slightly relaxed when compared with the dilemma of the
	two generals. That relaxation doesn't help the two generals, but
	the relaxation turns out to be just enough to allow us to devise
	a protocol that ensures correctness." 

	"By a similar line of reasoning, there is no way to ensure with
	complete certainty that actions will be taken simultaneously at
	two sites that communicate only via a best-effort network.
	Distributed two-phase commit can thus safely open a cash drawer
	of an ATM in Tokyo, with confidence that a computer in Munich
	will eventually update the balance of that account. But if, for
	some reason, it is necessary to open two cash drawers at dif-
	ferent sites at the same time, the only solution is either the
	probabilistic approach [sending lots of copies of messages and
	hoping that one of them arrives] or to somehow replace the best-effort
	network with a reliable one.
	
	"The requirement for reliable communication is why real estate
	transactions and weddings (both of which are examples of
	two-phase commit protocols) usually occur with all of the
	parties in one room." (chapter 9, page 92)

    F. Thoughts and advice

	--If you're coding and need to do something across multiple
	machines, don't make it up.
	    --use 2PC (or 3PC)
	    --if 2PC, identify the circumstances under which indefinite
	    blocking can occur (and decide if it's an acceptable
	    engineering risk)

	--RPC is highly useful.... but....

	--RPC arguably provides the wrong abstraction

	    --its goal is an impossible one: to make transparent (i.e.,
	    invisible) to the layers above it whether a local or remote
	    program is running.

	--RPC focuses attention on the "common case" of everything
	working!

	    --Some argue that this is the wrong way to think about
	    distributed programs. "Everything works" is the easy case.
	    RPC encourages you to think about the case.

	    --But the important and difficult cases concern partial
	    failures (for example, not every message will get a reply).

	    --"Exception paths" need to be as carefully considered as
	    the "normal case" procedure call/return paths. Conclusion:
	    RPC may be the wrong abstraction

	--An alternative: a lower-level message passing abstraction. 
	    
	    --makes explicit where the messages are. therefore helps
	    program writer avoid making implicit "everything usually
	    works" assumptions. 
	    
	    --may encourage structuring programs to handle failures
	    elegantly

	    --example: persistent message queues
	    
		--use 2PC for delivering messages -- guarantees exactly
		once delivery even across machine failures and long
		partitions

		--but now on every message (or group of them), you're
		running that lengthy protocol. So each logical message
		costs many network messages. Sometimes you need this
		though!

		--Conclusion: persistent message queues are probably a
		better abstraction than RPC for building reliable
		distributed systems, but they are heavierweight.