Class 26
CS 372H
27 April 2010

On the board
------------

1. Two-phase commit

2. Virtual machines

---------------------------------------------------------------------------

1. Finish two-phase commit

    A. Motivation: want distributed transaction

    B. Impossibility result: Two Generals problem

    C. Two-phase commit

	--Abstraction: distributed transaction, with all-or-nothing
	atomicity. Multiple machines agree to do something or not. All
	sites commit or all abort. It is unacceptable for some of the
	sites to commit their part while other sites abort.

	--Assume: every site in the distributed transaction has, on its
	own, the ability to implement a local transaction (using the
	techniques that we discussed several classes ago)

	--Constraint: there is no reliable delivery of messages (TCP
	attempts to provide such an abstraction, but it cannot fully,
	given the Two Generals' Problem.)

	--Approach: use write-ahead logging (of course) plus the
	unreliable network:

	[SEE PICTURE FOR DEPICTION OF ALGORITHM]

	--Question: where is the commit point? (answer: when coordinator
	logs "COMMIT").

	--What happens if coordinator crashes before commit point?
	    (Depends what coordinator decides to do when the coordinator
	    revives.)

	--What happens if messages lost?
	    (Retransmit them. No problem here.)

	--what happens if B says "No.", and the message is dropped?
	    (Coordinator waits for B's reply. Eventually B retransmits
	    it or coordinator times out. If coordinator times out,
	    writes ABORT locally, and the transaction henceforth will
	    abort. If coordinator gets B's retransmission in time, then
	    coordinator's decision depends on the usual factors: what
	    the other workers decided, whether the coordinator decided
	    to go through with it, etc.)

	--what happens if coordinator crashes just after commit point? 
	    (No problem. Retransmits its COMMIT or ABORT.)

	--what happens if "COMMIT" or "ABORT" message dropped?
	(coordinator obviously doesn't know that the message was
	dropped.) In this case.....

	    --workers will resend their PREPARED messages

	    --So coordinator needs to be able to reply saying what
	    happened
	   
	    --conclusion: coordinator needs to maintain logs
	    indefinitely, including across reboot (a disadvantage to
	    this approach)
		
	    --(how long do workers have to maintain their logs? depends
	    on the local implementation of transactions. but probably
	    they have to keep track of a given transaction in the log
	    until a time equal to the later of that transaction's
	    END record and a checkpoint of the log being applied to cell
	    storage.)

	--note that the workers can ask around to find out what
	happened, but there are limits...we can't avoid the blocking
	altogether. here's why:

	    --let's say that a worker says to the other workers, "Hey, I
	    haven't heard from the coordinator in a while. what did you
	    all tell the coordinator?"

	    --If any worker says to the querying worker, "I told the
	    coordinator I couldn't enter the PREPARED state", then the
	    querying worker knows that the transaction would have
	    aborted, and it can abort.

	    --But what if all workers say, "I told the coordinator I was
	    PREPARED?"....Unfortunately the querying worker cannot
	    commit on this basis. The reason is that the coordinator
	    might have written ABORT to its own log (say because of a
	    local error or timeout). In that case, the transaction
	    actually aborted! But the querying worker doesn't know if
	    this happened until the coordinator is revived.

	--NOTE: coordinator is a single point of failure. If it fails
	permanently, we're in serious trouble. Can address that issue
	with three-phase commit.

    C. Three-phase commit (non-blocking)

	Typically covered in courses on distributed systems
	
	In practice, 2PC usually good enough. If you ever need 3PC, look
	it up.

    D. Wait, didn't the two generals tell us that we couldn't get
    everyone to agree?

	--the subtlety is the difference between everyone agreeing to
	take an action or not (two-phase commit or not) versus everyone
	agreeing to take that action at the precise instant
	(two-generals)

	--Quoting Saltzer and Kaashoek, "The persistent senders of the
	distributed two-phase commit protocol ensure that if the
	coordinator decides to commit, all of the workers will
	eventually also commit, but there is no assurance that they will
	do so at the same time. If one of the communication links goes
	down for a day, when it comes back up the worker at the other
	end of that link will then receive the notice to commit, but
	this action may occur a day later than the actions of its
	colleagues. Thus the problem solved by distributed two-phase
	commit is slightly relaxed when compared with the dilemma of the
	two generals. That relaxation doesn't help the two generals, but
	the relaxation turns out to be just enough to allow us to devise
	a protocol that ensures correctness." 

	"By a similar line of reasoning, there is no way to ensure with
	complete certainty that actions will be taken simultaneously at
	two sites that communicate only via a best-effort network.
	Distributed two-phase commit can thus safely open a cash drawer
	of an ATM in Tokyo, with confidence that a computer in Munich
	will eventually update the balance of that account. But if, for
	some reason, it is necessary to open two cash drawers at dif-
	ferent sites at the same time, the only solution is either the
	probabilistic approach [sending lots of copies of messages and
	hoping that one of them arrives] or to somehow replace the best-effort
	network with a reliable one. The requirement for reliable
	communication is why real estate transactions and weddings (both
	of which are examples of two-phase commit protocols) usually
	occur with all of the parties in one room." (chapter 9, page 92)


    E. Thoughts and advice

	--If you're coding and need to do something across multiple
	machines, don't make it up.
	    --use 2PC (or 3PC)
	    --if 2PC, identify the circumstances under which indefinite
	    blocking can occur (and decide if it's an acceptable
	    engineering risk)

	--RPC is higly useful.... but....

	--RPC arguably provides the wrong abstraction

	    --its goal is an impossible one: to make transparent (i.e.,
	    invisible) to the layers above it whether a local or remote
	    program is running.

	--RPC focuses attention on the "common case" of everything
	working!

	    --Some argue that this is the wrong way to think about
	    distributed programs. "Everything works" is the easy case.
	    RPC encourages you to think about the case.

	    --But the important and difficult cases concern partial
	    failures (for example, not every message will get a reply).

	    --"Exception paths" need to be as carefully considered as
	    the "normal case" procedure call/return paths. Conclusion:
	    RPC may be the wrong abstraction

	--An alternative: a lower-level message passing abstraction. 
	    
	    --makes explicit where the messages are. therefore helps
	    program writer avoid making implicit "everything usually
	    works" assumptions. 
	    
	    --may encourage structuring programs to handle failures
	    elegantly

	    --example: persistent message queues
	    
		--use 2PC for delivering messages -- guarantees exactly
		once delivery even across machine failures and long
		partitions

		--but now on every message (or group of them), you're
		running that lengthy protocol. So each logical message
		costs many network messages. Sometimes you need this
		though!

		--Conclusion: persistent message queues are probably a
		better abstraction than RPC for building reliable
		distributed systems, but they are heavierweight.

---------------------------------------------------------------------------

Admin notes/questions

--Prefer to demo:

    --in class on last meeting?

    --for Namrata in the two days before finals?

    --for Namrata and me, during finals week?

    --not at all (advantages and disadvantages here)

--Project: please try to have fun with it. One thing to note is that
there are no grading scripts. So you'll have to be more disciplined
about testing but, like professional developers, you may be able to code
around your own bugs.

---------------------------------------------------------------------------

2. Virtual machines

    A. Intro

	--To "virtualize" means "to lie" (usually at some performance
	cost): the environment where the code is actually running is
	different from what the code seems to expect.  Some piece of
	technology is "lying" to the code to fool it. Hence, "virtual
	memory", "virtualize the hardware", etc.

	--So what's a virtual machine?

	    [DRAW PICTURE: 

	    REVIEW:

	    proc_1 proc_2 ...
		OS
	    --HARDWARE--

	    WHAT IF:
    
	    proc_1 proc_2
	    OS1					    OS2 
	    --HARDWARE-LIKE INTERFACE--     --HARDWARE-LIKE INTERFACE--
	        VMM (VIRTUAL MACHINE MONITOR)
	    --HARDWARE--


	--What's going on?

	    --VMM exposes a virtual machine abstraction that is supposed
	    to look exactly like real hardware

	    --The OS1, OS2, ... run in *user* mode

	--A brief history of virtual machines
	
	    --Old idea from the 1960s (Goldberg 1974)
	    --IBM VM/370: a VMM for IBM mainframes
		--high performance overhead, but worth it because
		hardware is really expensive. so valuable to pretend you
		have multiple OS environments, even if each is slower
	    --Interest died out in 1980s and 1990s (hardware is cheap,
	    windows NT is not, so not much benefit in saving hardware)
	    --1997: Rosenblum's group at Stanford decides they are going
	    to make VMs fast
		--their technology became VMWare
		--and sparked a Renaissance in virtual machines
		research, commercialization, and use

	--Why are VMs interesting again today? There is clearly great
	demand for them. Why?

	    --Compatibility
		--not all Windows NT applications run on XP, or XP on
		Vista. solution: use a VMM to run both Windows NT and
		Windows XP

	    --Multiplex multiple "machines" on same hardware (e.g.,
	    Amazon's EC2, where Amazon rents "machines" to customers)
		--Need ability to allocate a fraction of a machine
		(modern CPUs more powerful than most apps need)
		--If I have "one-tenth" the machine, then I pay for
		one-tenth of the power, cooling, and space
		--And only pay for the CPU cycles I consume, instead of
		having to buy hardware (though hardware is pretty cheap)
		--Server consolidation trend is very real

	    --Similar benefit for a single organization
		--Simulate a whole server farm with a VMM and guest OSes
		    --IT department used to run mail server on one
		    machine, internal Web server on another, etc.
		    --And maybe those different servers required
		    different OSes	
		    --Much less hardware (and cheaper) to run all of
		    these on the same hardware: a server running on
		    Linux for XYZ, a server running on Windows for ABC,
		    etc., and all using the same hardware

	    --But why do you need a server farm in the first place? Note
	    that VMs don't increase the available hardware. So there
	    must be some sense in which isolated machines are valuable.
	    What are they?

		--Isolation: want that if a machine is rooted, the
		effects are localized.

		--Software management (get the right configuration of all
		supporting libraries; helps your mail server or database
		server run)

		--Lots of software needs to run as root. Don't want to
		give your database application root over the whole
		machine. Solution: run the entire database app in a
		virtual machine.

	    --Checkpoint, migration, and replication

		--Lots of possibilities here

		--Scenario (virtual appliance work, such as moka5.com):
		A bunch of developers "store" their machines in a
		central repository. Those machines are configured with
		the right tools (compiler, repository, etc.). A new
		developer sits down and "checks out" a machine, and just
		starts working. Compare this to what is required to
		re-image a machine from a known image or, worse, start
		from scratch by installing a bunch of software.

		--Mobility: your computing environment follows you
		around.

	    --Ultimately, VMMs turn the operating system itself into
	    normal software that can be managed.


	--Wait, what do operating systems do?

	    --Multiplex hardware

	    --Provide isolation

	    --Abstract hardware

	    --So why is the VMM doing some of the same things? (Arguably
	    OS designers screwed up over all these years and should have
	    been exposing more narrow interfaces. Because the syscall
	    interface is wide and because security in most OSes is a
	    joke, people solve their problems (multiplexing, isolation,
	    backward compatibility) at a different layer of the stack.

	--General thoughts about virtual machines before we dive into
	details

	    --End-to-end, this stuff is very cool. we are getting
	    exceedingly complex behavior (emulating a Windows instance
	    inside of a window on Linux) by emulating much smaller
	    pieces (the interplay between CPU and OS). That's
	    interesting because it's a classic case of taking a hard
	    problem and turning it into a whole bunch of smaller, easier
	    ones.

	    --Ridiculously successful OS technology. Easily the biggest
	    practical impact of OS research in the last 20 years. CPU
	    manufacturers now support virtualization.

    B. What's required for the VMM to make the OS think it's running on
    real hardware? (this is the core technical challenge. after this
    works, there are many other problems to solve, ranging from memory
    management to I/O virtualization, to even better performance. but
    the core challenge is to make the OS believe it is running on real
    hardware.)

	--Different approaches.

	--You may have heard "the x86 is not virtualizable". we'll
	discuss what that means and how VMWare et al. get around that
	problem.

	--focus on CPU virtualization and memory virtualization

	--Approaches:
    
	(i) Binary interpretation (example: Bochs)
	    --simplest VMM approach
	    --See lecture 3 notes and handout. Here's a quick review....
	    --Build a simulation of all the hardware. 
		--*CPU*: A loop that fetches each instruction, decodes it,
		simulates its effect on the machine state
		--*Memory*: Physical memory is just an array, simulate the MMU
		on all memory accesses
		--*I/O*: Simulate I/O devices, programmed I/O, DMA, interrupts

	    +: simple!
	    -: Too slow!
		--100x slowdown (mainly from CPU/MMU, not I/O)

	(ii) Classic virtualization: trap-and-emulate
	    
	    (doesn't work by itself on x86 for reasons we will see in a
	    moment. for now, we just cover the technique.)
	   
	    --Observation: Most instructions are the same regardless of
	    processor privilege level. EXAMPLE:

		incl %eax

	    --Idea: just give the guest OS's instructions to the CPU.
	    Let it pretend to be the OS.

		--Safety issue: how can the hypervisor get the CPU back,
		or prevent the guest OS from calling "cli" or "halt", or
		writing all over the other OS' memory?

		--Answer: use the hardware's protection mechanism, as we
		have been doing all semester to isolate processes

	    --Virtualizing the CPU

		--Run virtual machine's OS directly on CPU at non-privileged level

		--Most instructions just work

		--Privileged instructions trap into monitor; monitor
		then simulates the effect of running the instruction

		--Doesn't fully work on the x86. We'll get to that in a
		second. For now, discuss how to virtualize traps and
		memory

	    [Keep a "process table" for the virtualized OS:
		--saved registers
		--saved privileged registers
		--IDT, etc.
		--etc.]

	    --Virtualizing traps:

		--What happens when an interrupt or trap occurs?
		    --Trap into the *monitor*

		--What if the interrupt or trap should go to guest OS?

		    --Example: Page fault, illegal instruction, system
		    call

		    --Answer: re-start the guest OS simulating the trap

		    --Lookup trap vector in VM's IDT
		    
		    --Just like processor would have done, the monitor
		    pushes:
			    SS
			    ESP
			    EFLAGS
			    CS
			    EIP

		    --and then starts running the OS at the code point given
		    by the entry in the IDT. if this sounds familiar, it's
		    because that's exactly what bochs/qemu do, and what the
		    processor does in hardware.

		--What if the interrupt or trap happens because the
		guest OS tried to, say, do something privileged (loading
		CR0 into eax, reading/writing to disk, etc.)?

		    --Monitor fakes it: in the "mov %cr0, %eax" example
		    (which a normal OS is allowed to do but which a
		    user-level process is not), the VMM would load a
		    fake value of CR0 into eax and restart the guest OS
		    right after that instruction

	    --Virtualizing memory

		--Next time......