Class 25
CS 372H
24 April 2012

On the board
------------

1. Last time 

2. Virtual machines overview
    --technologies
    --uses

3. Virtual machines history

4. Discuss VMWare paper

---------------------------------------------------------------------------

1. Last time

    --Determinator: discussion and Chris's presentation

2. Virtual machines overview

    TECHNOLOGIES

    --To "virtualize" means "to lie": the environment where the code is
    actually running is different from what the code seems to expect.
    Some piece of technology is "lying" to the code to fool it. Hence,
    "virtual memory", "virtualize the hardware", "virtualize IO
    devices", etc.

    --Types of virtual machine technologies:

	1. Binary interpretation (example: Bochs)

	2. Virtualization on bare metal (example: VMWare ESX, VMWare
	Workstation). [also requires binary rewriting.]

	    --run the OS and guest together
		draw picture: each OS looks as a separate process.

	    --how to fool kernel into thinking it's the kernel?

	    A. Trap-and-emulate

		--keep "process table" for the virtualized OS:
		    --saved registers
		    --saved privileged registers
		    --saved TLB contents
		
		--basically, just run code. when OS tries to do
		something privileged, trap to hypervisor, which emulates
		the effect on the OS and the virtual registers, jump to
		OS trap vector [so OS thinks it got a trap]

		    example:
		    guest does:
			mov %cr0, %eax
		
		    hyper-visor gets trap.
		    takes its stored version of %cr0
		    overwrites %eax

		    re-starts guest OS


	    --unfortunately, this technique is not sufficient on x86.
	    anyone know why?
	
		1. privileged state is visible. CPL stored in bottom two
		bits of %cs.

		2. some instructions don't cause trap; they mean
		different things in user and kernel space

   
	    --So we need another technique:

		    --translate instructions that mean different things in user
		    and kernel space. POPF (pop flags -- pops top of stack into
		    EFLAGS. in user mode, this thing does something different,
		    namely does not allow user mode to set or clear
		    interrupt-enable bit. thus, if guest kernel now thinks that
		    it's not going to get interrupted, need to actually make
		    sure that guest kernel won't get interrupted).
	    this is:

	    B. trap-and-emulate + binary translation

		--there can be hardware support for technique A, leading to:

	    C. trap-and-emulate using hardware support

    
	3. Paravirtualization + binary rewriting (example: Xen, Disco)

	    --same as above, but modify the OS slightly

    USES: why do you want virtual machines?

	    --isolation

	    --Compatibility
		--not all Windows NT applications run on XP, or XP on
		Vista. solution: use a VMM to run both Windows NT and
		Windows XP

	    --Multiplex multiple "machines" on same hardware (e.g.,
	    Amazon's EC2, where Amazon rents "machines" to customers)
		--Need ability to allocate a fraction of a machine
		(modern CPUs more powerful than most apps need)
		--If I have "one-tenth" the machine, then I pay for
		one-tenth of the power, cooling, and space
		--And only pay for the CPU cycles I consume, instead of
		having to buy hardware (though hardware is pretty cheap)
		--Server consolidation trend is very real

	    --Similar benefit for a single organization
		--Simulate a whole server farm with a VMM and guest OSes
		    --IT department used to run mail server on one
		    machine, internal Web server on another, etc.
		    --And maybe those different servers required
		    different OSes	
		    --Much less hardware (and cheaper) to run all of
		    these on the same hardware: a server running on
		    Linux for XYZ, a server running on Windows for ABC,
		    etc., and all using the same hardware

	    --But why do you need a server farm in the first place? Note
	    that VMs don't increase the available hardware. So there
	    must be some sense in which isolated machines are valuable.
	    What are they?

		--Isolation: want that if a machine is rooted, the
		effects are localized.

		--Software management (get the right configuration of all
		supporting libraries; helps your mail server or database
		server run)

		--Lots of software needs to run as root. Don't want to
		give your database application root over the whole
		machine. Solution: run the entire database app in a
		virtual machine.

	    --Checkpoint, migration, and replication

		--Lots of possibilities here

		--Scenario (virtual appliance work, such as moka5.com):
		A bunch of developers "store" their machines in a
		central repository. Those machines are configured with
		the right tools (compiler, repository, etc.). A new
		developer sits down and "checks out" a machine, and just
		starts working. Compare this to what is required to
		re-image a machine from a known image or, worse, start
		from scratch by installing a bunch of software.

		--Mobility: your computing environment follows you
		around.

	    --Ultimately, VMMs turn the operating system itself into
	    normal software that can be managed.


	--Wait, what do operating systems do?

	    --Multiplex hardware

	    --Provide isolation

	    --Abstract hardware

	--So why is the VMM doing some of the same things? Several
	reasons:
	
	    --Arguably OS designers screwed up over all these years and
	    should have been exposing more narrow interfaces. Because
	    the syscall interface is wide and because security in most
	    OSes is a joke, people solve their problems (multiplexing,
	    isolation, backward compatibility) at a different layer of
	    the stack.

	    --machine is a useful unit of abstraction/containment, so it
	    makes sense to virtualize it

		well, lots of rebooting
		lots of applications require people to be root
		lots of apps aren't well-isolated

	--General thoughts about virtual machines before we dive into
	history and paper

	    --End-to-end, this stuff is very cool. we are getting
	    exceedingly complex behavior (emulating a Windows instance
	    inside of a window on Linux) by emulating much smaller
	    pieces (the interplay between CPU and OS). That's
	    interesting because it's a classic case of taking a hard
	    problem and turning it into a whole bunch of smaller, easier
	    ones.

	    --Ultimately: something sort of artifactual about this. VMMs
	    help isolate OSes (so a forkbomb within one machine doesn't
	    affect the others), but, really, OSes should work the same
	    way! with proper scheduling abstractions and containers, a
	    fork bomb should not give one process the ability to have
	    its descendents take over the CPU.

3. History of virtual machines and VMWare

    --old idea from the 1960s and 1970s (Goldberg 1974)

	--IBM VM/370: a VMM for IBM mainframes
	    --high performance overhead, but worth it because
	    hardware is really expensive. so valuable to pretend you
	    have multiple OS environments, even if each is slower

    --Interest died out in 1980s and 1990s (hardware is cheap,
    windows NT is not, so not much benefit in saving hardware)

    --1997: Rosenblum's group at Stanford decides they are going
    to make VMs fast
	    --their technology became VMWare
	    --and sparked a renaissance in virtual machines
	    research, commercialization, and use

	--Disco: virtualize MIPS. runs on ccNUMA multiprocessor.
	ccNUMA: cache-coherent, non-uniform memory access.
	    --cache, plus network substrate for accessing non-local
	    memory

	    --ASK: what are Disco people using virtual machines for?

    --Motivation for Disco (manage ccNUMA machine with thin hypervisor
    and individual OSes) reads as very dated now:

	--serious OS effort to present a "single machine" abstraction on
	top of a ccNUMA machine (multiple processors, lots of memory,
	lots of resources, etc.).

	--Thus, no commodity OS vendor, like Microsoft, was going to
	support this.
	
	--Instead, Disco argues: just run a thin hypervisor layer that
	runs lots of different OSes. Those different OSes will naturally
	work with NUMA, faults will be contained, etc.

	--Meanwhile, the hypervisor is made simple.

    --But the true killer apps for VMWare seemed to be:
    
	--everyone running Linux on Windows or Windows on Linux
	(VMWare workstation: draw picture)

	--managing server "farms" (VMWare ESX server: draw picture).
	    
	    --isolation  [IT used to run on different machines to
	    get that isolation: mail, internal Web server, etc.]

	    --resource re-use

    --Ridiculously successful OS technology. Easily the biggest
    practical impact of OS research in the last 20 years. CPU
    manufacturers now support virtualization.

    --Disco's authors founded VMWare, which brings us to the paper

---------------------------------------------------------------------------

Admin notes

    --sign up for demo

---------------------------------------------------------------------------

4. VMWare ESX paper 

    --background; the difference between VMWare Workstation and VMWare
    ESX server (one runs as the hypervisor; the other is just an
    application with supervisor privileges)

	[--if not drawn already, draw picture of VMWare ESX server, with
	multiple VMs]

    --main point

	manage multiple virtual machines (focusing on their memories)
	but without modifying the guest operating system.

	bunch of tricks. we'll discuss them.

	first need to understand how hypervisor lies about memory

    --Key approach to memory management:

	[draw pictures inside VMWare ESX server]

	--introduce and separate two terms:
	    virtual, physical, machine

	--virtual pages: referenced by software, including kernel

	--machine pages: actual H/W page in memory

	--"physical pages"
	    --in physical machine, correspond to precise machine page
	    --in virtual machine, VMM decides where the "physical page"
	    lives. could be at any machine address. or on the disk.

	--mappings:

	    1. "primary" page tables (per-process):
		  virtual -> (per-VM) "physical pages".
		  mappings are written by the guest OS, with the
		    accessed/dirty bits written by the VMM
		  hardware MMU never sees

	    2. VMM's pmap: (per-VM):
		"physical" -> (global) machine pages
		machine pages --> virtual [why?] [answer: TLB shootdowns]
		only VMM sees

	    3. Shadow page tables:
		what goes here and why?
		answer: virtual --> (global) machine pages
		Mappings written by VMM, accessed/dirty bits written by hardware
		Guest OS never sees
		--These translations are a function of primary page
		tables and pmap
		--Accessed/dirty bits need to be copied back to primary
		PT by VMM

    --what are three parameters for a VM? (min/max/shares)

	--min: VMM guarantees this much machine memory to VM
	--max: amount of "physical" memory VM OS thinks that machine
	has
	--share: how much of machine memory this VM should have relative
	to other VMs

    --the question addressed by this paper: what to do when
    over-committed, i.e., when \sum max_i > physical memory.
		  
    --why is this hard? why not just page physical memory to disk using
    LRU?

	--answer: double-paging.

	--OS will feel memory pressure. will "page to disk" whatever
	"physical page" VMM just paged out to reuse that "physical page"
	(because its current contents haven't been used in a while --
	same reason that the VMM chose to page it out!!).

	--result: the OS writes the page to its virtual disk, but that
	causes the VMM to read the page back from the real disk so that
	the OS can write it to its virtual disk.

    I. technique: ballooning. 

	ASK: what is ballooning and why is it needed?

	    (make guest OS think it has less memory by saying "please pin
	    some physical pages". now, guest OS chooses the physical pages
	    based on any old policy. guest OS tells the balloon what
	    physical pages it has pinned. balloon sneakily tells ESX.  ESX
	    can now use the corresponding *machine* pages because the guest
	    OS isn't going to touch the corresponding "physical" pages
	    (because they're pinned).)

	    wait, I thought that they're not modifying the OS?

	    (Answer: they're not. they're just loading a module into it,
	    which they can do by issuing an appropriate "load this
	    module" instruction at the appropriate time.)

	--ASK: sounds cool. is it useful? 
	
	    (compared to what? [we don't know what the naive strategy of
	    random paging would accomplish.])

	    so we're not actually able to judge ballooning, though it is an
	    elegant trick
		UPDATE: I asked Carl Waldsburger about this, and he said
		ballooning is key. Also, footnote 7 *may* imply that
		ballooning is better than random page eviction.

	--if there's an idea here, it's that trying to infer what the OS
	is doing is hard. easier just to ask OS what it would have done.

    II. technique: content-based page sharing.

	--ASK: what is it and why useful?

	--ANSWER: share pages across OSes (e.g., kernel code)
	  --Use hashing to find pages with identical contents
	  --Big hash table maps hash values onto machine pages
	    - If hash match, compare contents
	    - If contents match, Copy-on-write sharing
	    - How to find potential matches?
	    **-- Check pages randomly
	    -- If no match, install page in hash table
	    -- But don't prevent writes yet!
	    -- Instead, mark as "hint" entry
	    -- If a later page matches this bucket, check to see if the page itself has
	    changed
	    -- If so, remove hint, install new page
	    -- If no change, up refcount

        *****-- Also check pages right before paging out to disk

	    - Space saving
	    -- 16-bit reference count plus an overflow table for larger counts

	--How well does this work? (See figures 4 and 5: very well: 67% of memory
	was shared for identically-configured machines).

	--Low overhead to implement this technique

    III. technique: share-based allocation, with a tax

	--Basic idea: give resource rights based on *shares*, S_1, ...,
	S_n

	--The VM selected to relinquish should be the one with the
	fewest shares per allocated page  i.e., lowest ratio of S_i /
	P_i. that's the OS that's paying the least. 

	    --example: A, B each have S=1. reclaim from the larger user.
	    A has twice as many shares --> A can use twice as much
	    memory

	--Problem: what if a VM has tons of shares but isn't using its
	memory? Don't want to reclaim pages from other VMs

	--Solution: tax the idle pages:

	    tax arithmetic. if my income tax rate is T, then if I earn
	    $1, I pay T*$1 in taxes. thus:
	    
		-- $1 gross = $(1-T) take home.
	    
		-- $1/(1-T) gross = $1 take home

		-- to get a dollar taken home, need k = 1/(1-T)

	    idea: 
	    
		tax idle memory

		pages that are being used are "tax deductible"

		if you're not using a page, pay a fraction, T, of it
		back to the system (not "yours").

		so each idle page costs, in shares, k times the price of
		a non-idle page.

	    consider # of shares per post-tax dollars/pages:

		rho = S / [(# used) + k*(#idle)]
		    = S / P(f + k(1-f))
	    k is "idle page cost", k = 1/(1-T)
	    f is fraction of active pages


	--ASK: how to measure non-idle memory (f):
	  Statistical sampling:  Pick n pages at random, invalidate, see if accessed
	    If t pages touched out of n at end of period, estimate usage as t/n
	    How expensive is this?  <= 100 page faults over 30 seconds negligible

	  Ridiculously easy

	--ASK: why do they keep three moving averages? What do they keep
	three moving averages of?
	    --> Slow exponentially weighted moving average of t/n over many periods
	    --> Faster weighted average that adapts more quickly
	    --> Version of faster average that incorporates samples in current period

	    --use max of 3. why?
 
	    Basic idea: respond rapidly to increases in memory usage and
	    gradually to decreases in memory usage.
		--When in doubt, want to respect priorities (so give
		credit for having had a high estimate of non-idle pages
		in the past).
		--Spike in usage likely means VM has "woken up"
		--Small pause in usage doesn't necessarily mean pause
		will continue to last

	--ASK: how do they use the estimate?

	--ASK: how well does this do? [answer: figure 6 (p. 9)]
   

    big picture:

	estimate (5.3) -->
	shared-based alloc. based on tax and reclaiming from smallest (5.2) -->
	ballooning (3.2) or paging (3.3) to decide which page

    commentary:

	very nice design in part because it has very few parameters:
	
	    min, max, S [per VM]
	    system-wide [\tao]