Class 25 CS 372H 24 April 2012 On the board ------------ 1. Last time 2. Virtual machines overview --technologies --uses 3. Virtual machines history 4. Discuss VMWare paper --------------------------------------------------------------------------- 1. Last time --Determinator: discussion and Chris's presentation 2. Virtual machines overview TECHNOLOGIES --To "virtualize" means "to lie": the environment where the code is actually running is different from what the code seems to expect. Some piece of technology is "lying" to the code to fool it. Hence, "virtual memory", "virtualize the hardware", "virtualize IO devices", etc. --Types of virtual machine technologies: 1. Binary interpretation (example: Bochs) 2. Virtualization on bare metal (example: VMWare ESX, VMWare Workstation). [also requires binary rewriting.] --run the OS and guest together draw picture: each OS looks as a separate process. --how to fool kernel into thinking it's the kernel? A. Trap-and-emulate --keep "process table" for the virtualized OS: --saved registers --saved privileged registers --saved TLB contents --basically, just run code. when OS tries to do something privileged, trap to hypervisor, which emulates the effect on the OS and the virtual registers, jump to OS trap vector [so OS thinks it got a trap] example: guest does: mov %cr0, %eax hyper-visor gets trap. takes its stored version of %cr0 overwrites %eax re-starts guest OS --unfortunately, this technique is not sufficient on x86. anyone know why? 1. privileged state is visible. CPL stored in bottom two bits of %cs. 2. some instructions don't cause trap; they mean different things in user and kernel space --So we need another technique: --translate instructions that mean different things in user and kernel space. POPF (pop flags -- pops top of stack into EFLAGS. in user mode, this thing does something different, namely does not allow user mode to set or clear interrupt-enable bit. thus, if guest kernel now thinks that it's not going to get interrupted, need to actually make sure that guest kernel won't get interrupted). this is: B. trap-and-emulate + binary translation --there can be hardware support for technique A, leading to: C. trap-and-emulate using hardware support 3. Paravirtualization + binary rewriting (example: Xen, Disco) --same as above, but modify the OS slightly USES: why do you want virtual machines? --isolation --Compatibility --not all Windows NT applications run on XP, or XP on Vista. solution: use a VMM to run both Windows NT and Windows XP --Multiplex multiple "machines" on same hardware (e.g., Amazon's EC2, where Amazon rents "machines" to customers) --Need ability to allocate a fraction of a machine (modern CPUs more powerful than most apps need) --If I have "one-tenth" the machine, then I pay for one-tenth of the power, cooling, and space --And only pay for the CPU cycles I consume, instead of having to buy hardware (though hardware is pretty cheap) --Server consolidation trend is very real --Similar benefit for a single organization --Simulate a whole server farm with a VMM and guest OSes --IT department used to run mail server on one machine, internal Web server on another, etc. --And maybe those different servers required different OSes --Much less hardware (and cheaper) to run all of these on the same hardware: a server running on Linux for XYZ, a server running on Windows for ABC, etc., and all using the same hardware --But why do you need a server farm in the first place? Note that VMs don't increase the available hardware. So there must be some sense in which isolated machines are valuable. What are they? --Isolation: want that if a machine is rooted, the effects are localized. --Software management (get the right configuration of all supporting libraries; helps your mail server or database server run) --Lots of software needs to run as root. Don't want to give your database application root over the whole machine. Solution: run the entire database app in a virtual machine. --Checkpoint, migration, and replication --Lots of possibilities here --Scenario (virtual appliance work, such as moka5.com): A bunch of developers "store" their machines in a central repository. Those machines are configured with the right tools (compiler, repository, etc.). A new developer sits down and "checks out" a machine, and just starts working. Compare this to what is required to re-image a machine from a known image or, worse, start from scratch by installing a bunch of software. --Mobility: your computing environment follows you around. --Ultimately, VMMs turn the operating system itself into normal software that can be managed. --Wait, what do operating systems do? --Multiplex hardware --Provide isolation --Abstract hardware --So why is the VMM doing some of the same things? Several reasons: --Arguably OS designers screwed up over all these years and should have been exposing more narrow interfaces. Because the syscall interface is wide and because security in most OSes is a joke, people solve their problems (multiplexing, isolation, backward compatibility) at a different layer of the stack. --machine is a useful unit of abstraction/containment, so it makes sense to virtualize it well, lots of rebooting lots of applications require people to be root lots of apps aren't well-isolated --General thoughts about virtual machines before we dive into history and paper --End-to-end, this stuff is very cool. we are getting exceedingly complex behavior (emulating a Windows instance inside of a window on Linux) by emulating much smaller pieces (the interplay between CPU and OS). That's interesting because it's a classic case of taking a hard problem and turning it into a whole bunch of smaller, easier ones. --Ultimately: something sort of artifactual about this. VMMs help isolate OSes (so a forkbomb within one machine doesn't affect the others), but, really, OSes should work the same way! with proper scheduling abstractions and containers, a fork bomb should not give one process the ability to have its descendents take over the CPU. 3. History of virtual machines and VMWare --old idea from the 1960s and 1970s (Goldberg 1974) --IBM VM/370: a VMM for IBM mainframes --high performance overhead, but worth it because hardware is really expensive. so valuable to pretend you have multiple OS environments, even if each is slower --Interest died out in 1980s and 1990s (hardware is cheap, windows NT is not, so not much benefit in saving hardware) --1997: Rosenblum's group at Stanford decides they are going to make VMs fast --their technology became VMWare --and sparked a renaissance in virtual machines research, commercialization, and use --Disco: virtualize MIPS. runs on ccNUMA multiprocessor. ccNUMA: cache-coherent, non-uniform memory access. --cache, plus network substrate for accessing non-local memory --ASK: what are Disco people using virtual machines for? --Motivation for Disco (manage ccNUMA machine with thin hypervisor and individual OSes) reads as very dated now: --serious OS effort to present a "single machine" abstraction on top of a ccNUMA machine (multiple processors, lots of memory, lots of resources, etc.). --Thus, no commodity OS vendor, like Microsoft, was going to support this. --Instead, Disco argues: just run a thin hypervisor layer that runs lots of different OSes. Those different OSes will naturally work with NUMA, faults will be contained, etc. --Meanwhile, the hypervisor is made simple. --But the true killer apps for VMWare seemed to be: --everyone running Linux on Windows or Windows on Linux (VMWare workstation: draw picture) --managing server "farms" (VMWare ESX server: draw picture). --isolation [IT used to run on different machines to get that isolation: mail, internal Web server, etc.] --resource re-use --Ridiculously successful OS technology. Easily the biggest practical impact of OS research in the last 20 years. CPU manufacturers now support virtualization. --Disco's authors founded VMWare, which brings us to the paper --------------------------------------------------------------------------- Admin notes --sign up for demo --------------------------------------------------------------------------- 4. VMWare ESX paper --background; the difference between VMWare Workstation and VMWare ESX server (one runs as the hypervisor; the other is just an application with supervisor privileges) [--if not drawn already, draw picture of VMWare ESX server, with multiple VMs] --main point manage multiple virtual machines (focusing on their memories) but without modifying the guest operating system. bunch of tricks. we'll discuss them. first need to understand how hypervisor lies about memory --Key approach to memory management: [draw pictures inside VMWare ESX server] --introduce and separate two terms: virtual, physical, machine --virtual pages: referenced by software, including kernel --machine pages: actual H/W page in memory --"physical pages" --in physical machine, correspond to precise machine page --in virtual machine, VMM decides where the "physical page" lives. could be at any machine address. or on the disk. --mappings: 1. "primary" page tables (per-process): virtual -> (per-VM) "physical pages". mappings are written by the guest OS, with the accessed/dirty bits written by the VMM hardware MMU never sees 2. VMM's pmap: (per-VM): "physical" -> (global) machine pages machine pages --> virtual [why?] [answer: TLB shootdowns] only VMM sees 3. Shadow page tables: what goes here and why? answer: virtual --> (global) machine pages Mappings written by VMM, accessed/dirty bits written by hardware Guest OS never sees --These translations are a function of primary page tables and pmap --Accessed/dirty bits need to be copied back to primary PT by VMM --what are three parameters for a VM? (min/max/shares) --min: VMM guarantees this much machine memory to VM --max: amount of "physical" memory VM OS thinks that machine has --share: how much of machine memory this VM should have relative to other VMs --the question addressed by this paper: what to do when over-committed, i.e., when \sum max_i > physical memory. --why is this hard? why not just page physical memory to disk using LRU? --answer: double-paging. --OS will feel memory pressure. will "page to disk" whatever "physical page" VMM just paged out to reuse that "physical page" (because its current contents haven't been used in a while -- same reason that the VMM chose to page it out!!). --result: the OS writes the page to its virtual disk, but that causes the VMM to read the page back from the real disk so that the OS can write it to its virtual disk. I. technique: ballooning. ASK: what is ballooning and why is it needed? (make guest OS think it has less memory by saying "please pin some physical pages". now, guest OS chooses the physical pages based on any old policy. guest OS tells the balloon what physical pages it has pinned. balloon sneakily tells ESX. ESX can now use the corresponding *machine* pages because the guest OS isn't going to touch the corresponding "physical" pages (because they're pinned).) wait, I thought that they're not modifying the OS? (Answer: they're not. they're just loading a module into it, which they can do by issuing an appropriate "load this module" instruction at the appropriate time.) --ASK: sounds cool. is it useful? (compared to what? [we don't know what the naive strategy of random paging would accomplish.]) so we're not actually able to judge ballooning, though it is an elegant trick UPDATE: I asked Carl Waldsburger about this, and he said ballooning is key. Also, footnote 7 *may* imply that ballooning is better than random page eviction. --if there's an idea here, it's that trying to infer what the OS is doing is hard. easier just to ask OS what it would have done. II. technique: content-based page sharing. --ASK: what is it and why useful? --ANSWER: share pages across OSes (e.g., kernel code) --Use hashing to find pages with identical contents --Big hash table maps hash values onto machine pages - If hash match, compare contents - If contents match, Copy-on-write sharing - How to find potential matches? **-- Check pages randomly -- If no match, install page in hash table -- But don't prevent writes yet! -- Instead, mark as "hint" entry -- If a later page matches this bucket, check to see if the page itself has changed -- If so, remove hint, install new page -- If no change, up refcount *****-- Also check pages right before paging out to disk - Space saving -- 16-bit reference count plus an overflow table for larger counts --How well does this work? (See figures 4 and 5: very well: 67% of memory was shared for identically-configured machines). --Low overhead to implement this technique III. technique: share-based allocation, with a tax --Basic idea: give resource rights based on *shares*, S_1, ..., S_n --The VM selected to relinquish should be the one with the fewest shares per allocated page i.e., lowest ratio of S_i / P_i. that's the OS that's paying the least. --example: A, B each have S=1. reclaim from the larger user. A has twice as many shares --> A can use twice as much memory --Problem: what if a VM has tons of shares but isn't using its memory? Don't want to reclaim pages from other VMs --Solution: tax the idle pages: tax arithmetic. if my income tax rate is T, then if I earn $1, I pay T*$1 in taxes. thus: -- $1 gross = $(1-T) take home. -- $1/(1-T) gross = $1 take home -- to get a dollar taken home, need k = 1/(1-T) idea: tax idle memory pages that are being used are "tax deductible" if you're not using a page, pay a fraction, T, of it back to the system (not "yours"). so each idle page costs, in shares, k times the price of a non-idle page. consider # of shares per post-tax dollars/pages: rho = S / [(# used) + k*(#idle)] = S / P(f + k(1-f)) k is "idle page cost", k = 1/(1-T) f is fraction of active pages --ASK: how to measure non-idle memory (f): Statistical sampling: Pick n pages at random, invalidate, see if accessed If t pages touched out of n at end of period, estimate usage as t/n How expensive is this? <= 100 page faults over 30 seconds negligible Ridiculously easy --ASK: why do they keep three moving averages? What do they keep three moving averages of? --> Slow exponentially weighted moving average of t/n over many periods --> Faster weighted average that adapts more quickly --> Version of faster average that incorporates samples in current period --use max of 3. why? Basic idea: respond rapidly to increases in memory usage and gradually to decreases in memory usage. --When in doubt, want to respect priorities (so give credit for having had a high estimate of non-idle pages in the past). --Spike in usage likely means VM has "woken up" --Small pause in usage doesn't necessarily mean pause will continue to last --ASK: how do they use the estimate? --ASK: how well does this do? [answer: figure 6 (p. 9)] big picture: estimate (5.3) --> shared-based alloc. based on tax and reclaiming from smallest (5.2) --> ballooning (3.2) or paging (3.3) to decide which page commentary: very nice design in part because it has very few parameters: min, max, S [per VM] system-wide [\tao]