Class 27 CS 372H 29 April 2010 On the board ------------ 1. Virtual machines, continued 2. Protection and security --stack smashing --Unix security model --------------------------------------------------------------------------- 0. Last time --Two-phase commit. Correction: if acknowledgments go back from workers to coordinator at the end of phase 2, then the coordinator does not have to keep the log of that entry forever. 1. Virtual machines A. Intro B. What's required for the VMM to make the OS think it's running on real hardware? (this is the core technical challenge. after this works, there are many other problems to solve, ranging from memory management to I/O virtualization, to even better performance. but the core challenge is to make the OS believe it is running on real hardware.) --Different approaches. --You may have heard "the x86 is not virtualizable". we'll discuss what that means and how VMWare et al. get around that problem. --focus on CPU virtualization and memory virtualization --Approaches: (i) Binary interpretation (example: Bochs) --simplest VMM approach --See lecture 3 notes and handout. Here's a quick review.... --Build a simulation of all the hardware. --*CPU*: A loop that fetches each instruction, decodes it, simulates its effect on the machine state --*Memory*: Physical memory is just an array, simulate the MMU on all memory accesses --*I/O*: Simulate I/O devices, programmed I/O, DMA, interrupts +: simple! -: Too slow! --100x slowdown (mainly from CPU/MMU, not I/O) (ii) Classic virtualization: trap-and-emulate (doesn't work by itself on x86 for reasons we will see in a moment. for now, we just cover the technique.) --Observation: Most instructions are the same regardless of processor privilege level. EXAMPLE: incl %eax --Idea: just give the guest OS's instructions to the CPU. Let it pretend to be the OS. --Safety issue: how can the hypervisor get the CPU back, or prevent the guest OS from calling "cli" or "halt", or writing all over the other OS' memory? --Answer: use the hardware's protection mechanism, as we have been doing all semester to isolate processes --Virtualizing the CPU --Run virtual machine's OS directly on CPU at non-privileged level --Most instructions just work --Privileged instructions trap into monitor; monitor then simulates the effect of running the instruction --Doesn't fully work on the x86. We'll get to that in a second. For now, discuss how to virtualize traps and memory [Keep a "process table" for the virtualized OS: --saved registers --saved privileged registers --IDT, etc. --etc.] --Virtualizing traps: --What happens when an interrupt or trap occurs? --Trap into the *monitor* --What if the interrupt or trap should go to guest OS? --Example: Page fault, illegal instruction, system call --Answer: re-start the guest OS simulating the trap --Lookup trap vector in VM's IDT --Just like processor would have done, the monitor pushes: SS ESP EFLAGS CS EIP --and then starts running the OS at the code point given by the entry in the IDT. if this sounds familiar, it's because that's exactly what bochs/qemu do, and what the processor does in hardware. --What if the interrupt or trap happens because the guest OS tried to, say, do something privileged (loading CR0 into eax, reading/writing to disk, etc.)? --Monitor fakes it: in the "mov %cr0, %eax" example (which a normal OS is allowed to do but which a user-level process is not), the VMM would load a fake value of CR0 into eax and restart the guest OS right after that instruction --Virtualizing memory --Need to somehow make guest OS think it's working with real physical addresses (so it can set up page tables and so forth) but somehow not have those physical machine addresses be real (because then we couldn't run multiple OSes at the same time: they would all be clamoring for the same physical memory). --How the heck are we going to solve this problem? --Another layer of indirection: virtual --> physical --> machine **machine means actual H/W pages** **physical no longer means hardware bits** --in physical machine, corresponds to precise machine page --in virtual machine, VMM decides where the "physical page" lives. could be at any machine address. or the disk. --How does the VMM implement this? --Trick: **use the actual hardware MMU to simulate the virtual machine's MMU** --guest OS works on "primary" page tables. one set per-process --mappings are written by the guest OS, with the accessed/dirty bits written by the MMU --VMM maintains per-VM pmap (physical map): --physical addresses to machine pages --(and machine pages back up to virtual addresses) --only VMM sees --Monitor keeps *shadow* of VM's page table --shadow page table maps from VA --> machine address --mapping written by the VMM, accessed/dirty bits written by the hardware --guest OS never sees them --they are a function of pmap and "primary" page tables --access/dirty bits need to be copied back to primary PT by VMM --QUESTION: what page table does the hardware see? --the virtual one? --the shadow? (Answer: the shadow.) --On a page fault, VMM must: --Lookup VPN --> PPN in VM's (guest OS's) page table --Determine where PPN is in machine memory (MPN), if anywhere. --may require bringing a page in from disk and getting an MPN for it (because the monitor can demand-page the virtual machine) --Insert VPN --> MPN mapping in shadow page table --Issue: --Have to be careful with the above. --Consider this case: --guest OS has page table T mapping V_u --> P_u --T itself lives at physical address P_t --the guest OS probably has some page table that maps V_t --> P_t --VMM stores P_u in machine address M_u and P_t in machine address M_t --Now we have a problem: --if the guest OS makes a change to T and maps V_u --> P_u', then the map from V_u-->M_u in the shadow page table will be wrong --or if the guest OS reads/writes V_u itself, then the accessed/dirty bits need to be changed in page table T (but the hardware is only working on the shadow page tables) --Solution: make V_t invalid in shadow page table. then monitor gets invoked whenever guest OS would try to gain access to page table T. --so VMM has to directly update OS's page tables --Called "tracing faults" (VMM is tracing OS's attempt to modify its own page tables) --An alternative is "hidden page faults": let the OS work on its own page tables but make V_u invalid in the shadow page tables. Then when a page fault happens (which should not be visible to the guest OS because to the guest OS, its "primary page tables" are in order), VMM computes the dirty/accessed bits directly and applies them to page table T. --complex tradeoffs (iii) Classic virtualization via trap-and-emulate and binary translation --x86 not virtualizable which means..... 1. privileged state is visible. CPL stored in bottom two bits of %cs. thus: - movw %cs, %ax will tell the guest OS that it's a VM. not good. VM shouldn't be able to tell. 2. some instructions don't cause trap; they mean different things in user and kernel space --for example, POPFL (pop flags) -- pops top of stack into EFLAGS. in user mode, this thing does something different, namely does not allow user mode to set or clear interrupt-enable bit. thus, if guest kernel now thinks that it's not going to get interrupted, need to actually make sure that guest kernel won't get interrupted). --Address this with binary translation --Idea: translate guest kernel code into code that runs in monitor mode (or at CPL 1) --Tricky. Have to deal with self-modifying code, have to make it fast, have to prevent code from messing with VMM memory, etc. --Once you are translating all binary code, translate all instructions that would generate traps from executing privileged instructions --How can this possibly be fast? --Answer: store and cache the translation. --QUESTION: are they binary translating guest processes, or just the kernel? --Have to translate guest for some annoying corner cases (SGDT instruction, which copies the descriptor table; need to make guest app get the guest OS's descriptor table, not the hardware's.) --Once you're doing binary translation can get many other performance benefits. Trap-and-emulate is expensive. With binary translation can simply avoid a lot of the traps by rewriting the guest OS code to just execute the operations directly (iv) Classic virtualization via hardware support Surprisingly, hardware support for trap-and-emulate may not always be faster. --why? because trap-and-emulate is inherently a blunt instrument --note that the BT can sometimes avoid trap-and-emulate sometimes by just rewriting the offending code to do the right thing (v) Para-virtualization D. Thoughts --something sort of artifactual about this. VMMs help isolate OSes (so a forkbomb within one machine doesn't affect the others), but, really, OSes should work the same way! with proper scheduling abstractions and containers, a fork bomb should not give one process the ability to have its descendents take over the CPU. --same is true for isolation: the job of the OS was originally isolation, remember! --so arguably VMMs are solving the problems that OSes should have solved long ago. --meaning what? that arguably: --OSes should not have had wide interfaces to software but should instead have exposed more narrow interfaces that were easy to make backward compatible. --OSes should have done a better job at security and isolation --------------------------------------------------------------------------- reminder: email us by Monday if you choose the demo-in-class option | --------------------------------------------------------------------------- 2. Stack smashing --Switching gears...... --Stack smashing history --('buffer overflow' is one way to conduct a stack smashing attack.) --demo --mig runs server. as Namrata. --my laptop runs honest client --my laptop runs dishonest client --note: if this server had been running as root, we'd have been able to get a root shell --and if the user/syscall interface doesn't check its arguments properly, can buffer overflow that interface --in practice, once you have a user account on a machine, it's usually possible to get root access (why? because the syscall interface is really hard to secure, as a matter of practice.) --other versions of these attacks --overwriting function pointers --smashing the heap --return-to-libc (see Tanenbaum) --how do people defend against these things? --W ^ X (map the stack pages as non-executable, if the hardware allows it). But there are some issues.... --the original 386 did not allow it with page tables, but all x86 chips that support extended page tables (which are used to help users get at >4GB of physical memory even if the machine is 3 bits) also support an XD bit in those page tables, which means "don't execute code in this page. --even on x86s that don't suport extended page tables, segmentation would help with do-not-execute (since the permissions in the segment descriptor can express this). the disadvantage here is that the compiler needs to lay out the code and stack to match what the segments would require) --the bummer with W ^ X, even when it *is* supported, is this: some languages not only don't need it but also are actively harmed by W ^ X. The core of the issue is that a program written in a safe language (Perl, Python, Java, etc.) does not need W ^ X whereas lots of C programs do. Meanwhile some machines *always* enforce W ^ X, even for programs that do not need it. Such enforcement constrains certain languages, namely those that need to do runtime code generation. --address space randomization --StackGuard (in gcc) --another defense: don't use C! CPUs are so fast that a language with bounds checking probably isn't going to pay a huge performance penalty relative to one without bounds checks --unfortunately, this is an arms race, and each time a new defense arises, a new attack arises too. here's the most advanced current technique, and it defeats many of the above defenses: --smash the stack with a bunch of return addresses. each return address points to the needed instruction followed by "ret" (requires the attacker to have previously identified these instructions in the code). not too hard in CISC code like on x86, where there are lots of sequences of code embedded in the binary, even sequences that the programmer didn't mean (because instructions are not fixed length). result: the control flow bounces around all of these byte sequences in memory, executing exactly what the attacker wanted, but not executing off of the stack. --this is called "return-oriented programming". defending against it is hard (though if people use only safe languages, that is, languages that do bounds checking and other pointer checks, such attacks will be much, much harder) --question: can we instead confine processes and users so that when they're broken into, the damage is limited?