Class 27
CS 372H
29 April 2010

On the board
------------

1. Virtual machines, continued
2. Protection and security
    --stack smashing
    --Unix security model

---------------------------------------------------------------------------

0. Last time

    --Two-phase commit. Correction: if acknowledgments go back from
    workers to coordinator at the end of phase 2, then the coordinator
    does not have to keep the log of that entry forever. 


1. Virtual machines

    A. Intro

    B.  What's required for the VMM to make the OS think it's running on
    real hardware? (this is the core technical challenge. after this
    works, there are many other problems to solve, ranging from memory
    management to I/O virtualization, to even better performance. but
    the core challenge is to make the OS believe it is running on real
    hardware.)

	--Different approaches.

	--You may have heard "the x86 is not virtualizable". we'll
	discuss what that means and how VMWare et al. get around that
	problem.

	--focus on CPU virtualization and memory virtualization

	--Approaches:
    
	(i) Binary interpretation (example: Bochs)
	    --simplest VMM approach
	    --See lecture 3 notes and handout. Here's a quick review....
	    --Build a simulation of all the hardware. 
		--*CPU*: A loop that fetches each instruction, decodes it,
		simulates its effect on the machine state
		--*Memory*: Physical memory is just an array, simulate the MMU
		on all memory accesses
		--*I/O*: Simulate I/O devices, programmed I/O, DMA, interrupts

	    +: simple!
	    -: Too slow!
		--100x slowdown (mainly from CPU/MMU, not I/O)

	(ii) Classic virtualization: trap-and-emulate
	    
	    (doesn't work by itself on x86 for reasons we will see in a
	    moment. for now, we just cover the technique.)
	   
	    --Observation: Most instructions are the same regardless of
	    processor privilege level. EXAMPLE:

		incl %eax

	    --Idea: just give the guest OS's instructions to the CPU.
	    Let it pretend to be the OS.

		--Safety issue: how can the hypervisor get the CPU back,
		or prevent the guest OS from calling "cli" or "halt", or
		writing all over the other OS' memory?

		--Answer: use the hardware's protection mechanism, as we
		have been doing all semester to isolate processes

	    --Virtualizing the CPU

		--Run virtual machine's OS directly on CPU at non-privileged level

		--Most instructions just work

		--Privileged instructions trap into monitor; monitor
		then simulates the effect of running the instruction

		--Doesn't fully work on the x86. We'll get to that in a
		second. For now, discuss how to virtualize traps and
		memory

	    [Keep a "process table" for the virtualized OS:
		--saved registers
		--saved privileged registers
		--IDT, etc.
		--etc.]

	    --Virtualizing traps:

		--What happens when an interrupt or trap occurs?
		    --Trap into the *monitor*

		--What if the interrupt or trap should go to guest OS?

		    --Example: Page fault, illegal instruction, system
		    call

		    --Answer: re-start the guest OS simulating the trap

		    --Lookup trap vector in VM's IDT
		    
		    --Just like processor would have done, the monitor
		    pushes:
			    SS
			    ESP
			    EFLAGS
			    CS
			    EIP

		    --and then starts running the OS at the code point given
		    by the entry in the IDT. if this sounds familiar, it's
		    because that's exactly what bochs/qemu do, and what the
		    processor does in hardware.

		--What if the interrupt or trap happens because the
		guest OS tried to, say, do something privileged (loading
		CR0 into eax, reading/writing to disk, etc.)?

		    --Monitor fakes it: in the "mov %cr0, %eax" example
		    (which a normal OS is allowed to do but which a
		    user-level process is not), the VMM would load a
		    fake value of CR0 into eax and restart the guest OS
		    right after that instruction

	    --Virtualizing memory

		--Need to somehow make guest OS think it's working with
		real physical addresses (so it can set up page tables
		and so forth) but somehow not have those physical
		machine addresses be real (because then we couldn't run
		multiple OSes at the same time: they would all be
		clamoring for the same physical memory).

		--How the heck are we going to solve this problem?

		--Another layer of indirection:

		    virtual --> physical --> machine

		    **machine means actual H/W pages**

		    **physical no longer means hardware bits**
		    --in physical machine, corresponds to precise
		    machine page
		    --in virtual machine, VMM decides where the
		    "physical page" lives. could be at any machine
		    address. or the disk.

		--How does the VMM implement this?
		    
		    --Trick: **use the actual hardware MMU to simulate the
		    virtual machine's MMU**
		
		    --guest OS works on "primary" page tables. one set
		    per-process
			--mappings are written by the guest OS, with the
			accessed/dirty bits written by the MMU

		    --VMM maintains per-VM pmap (physical map):
			--physical addresses to machine pages
			--(and machine pages back up to virtual
			addresses)
			--only VMM sees

		    --Monitor keeps *shadow* of VM's page table
			--shadow page table maps from VA --> machine address
			--mapping written by the VMM, accessed/dirty
			bits written by the hardware	    
			--guest OS never sees them
			--they are a function of pmap and "primary" page
			tables
			--access/dirty bits need to be copied back to
			primary PT by VMM

		    --QUESTION: what page table does the hardware see?
			--the virtual one?
			--the shadow?
		    (Answer: the shadow.)

	    --On a page fault, VMM must: 

		--Lookup VPN --> PPN in VM's (guest OS's) page table 

		--Determine where PPN is in machine memory (MPN), if
		anywhere.
		    --may require bringing a page in from disk and
		    getting an MPN for it (because the monitor can
		    demand-page the virtual machine)

		--Insert VPN --> MPN mapping in shadow page table 
	    
	    --Issue:

		--Have to be careful with the above.

		--Consider this case:
		    --guest OS has page table T mapping V_u --> P_u
		    --T itself lives at physical address P_t
		    --the guest OS probably has some page table that maps V_t --> P_t
		    --VMM stores P_u in machine address M_u and P_t in
		    machine address M_t
		    --Now we have a problem:
			--if the guest OS makes a change to T and maps
			V_u --> P_u', then the map from V_u-->M_u in the
			shadow page table will be wrong
			--or if the guest OS reads/writes V_u itself,
			then the accessed/dirty bits need to be changed
			in page table T (but the hardware is only
			working on the shadow page tables)

		--Solution: make V_t invalid in shadow page table. then
		monitor gets invoked whenever guest OS would try to gain
		access to page table T.
		    --so VMM has to directly update OS's page tables

		--Called "tracing faults" (VMM is tracing OS's attempt
		to modify its own page tables)

		--An alternative is "hidden page faults": let the OS
		work on its own page tables but make V_u invalid in the
		shadow page tables. Then when a page fault happens
		(which should not be visible to the guest OS because to
		the guest OS, its "primary page tables" are in order),
		VMM computes the dirty/accessed bits directly and
		applies them to page table T.

		--complex tradeoffs

	(iii) Classic virtualization via trap-and-emulate and binary translation

	    --x86 not virtualizable which means.....

		1. privileged state is visible. CPL stored in bottom two
		bits of %cs. thus:
		    - movw %cs, %ax 
		will tell the guest OS that it's a VM. not good. VM
		shouldn't be able to tell.

		2. some instructions don't cause trap; they mean
		different things in user and kernel space

		    --for example, POPFL (pop flags) -- pops top of
		    stack into EFLAGS. in user mode, this thing does
		    something different, namely does not allow user mode
		    to set or clear interrupt-enable bit. thus, if guest
		    kernel now thinks that it's not going to get
		    interrupted, need to actually make sure that guest
		    kernel won't get interrupted).

	    --Address this with binary translation

		--Idea: translate guest kernel code into code that runs
		in monitor mode (or at CPL 1)

		--Tricky. Have to deal with self-modifying code, have to
		make it fast, have to prevent code from messing with VMM
		memory, etc.

		--Once you are translating all binary code, translate
		all instructions that would generate traps from
		executing privileged instructions

		--How can this possibly be fast?
		    --Answer: store and cache the translation.

		--QUESTION: are they binary translating guest
		processes, or just the kernel?
		    --Have to translate guest for some annoying corner
		    cases (SGDT instruction, which copies the descriptor
		    table; need to make guest app get the guest OS's
		    descriptor table, not the hardware's.)
		
	    --Once you're doing binary translation can get many other
	    performance benefits. Trap-and-emulate is expensive. With
	    binary translation can simply avoid a lot of the traps by
	    rewriting the guest OS code to just execute the operations
	    directly

	(iv) Classic virtualization via hardware support
	    
	    Surprisingly, hardware support for trap-and-emulate may not
	    always be faster. 
    
	    --why? because trap-and-emulate is inherently a blunt
	    instrument

	    --note that the BT can sometimes avoid trap-and-emulate
	    sometimes by just rewriting the offending code to do the
	    right thing
		
	(v) Para-virtualization


    D. Thoughts

	--something sort of artifactual about this. VMMs help isolate
	OSes (so a forkbomb within one machine doesn't affect the
	others), but, really, OSes should work the same way! with proper
	scheduling abstractions and containers, a fork bomb should not
	give one process the ability to have its descendents take over
	the CPU.

	--same is true for isolation: the job of the OS was originally
	isolation, remember!

	--so arguably VMMs are solving the problems that OSes should
	have solved long ago.

	--meaning what? that arguably:
	
	    --OSes should not have had wide interfaces to software but
	    should instead have exposed more narrow interfaces that were
	    easy to make backward compatible.

	    --OSes should have done a better job at security and
	    isolation

---------------------------------------------------------------------------
    reminder: email us by Monday if you choose the demo-in-class option   |
---------------------------------------------------------------------------

2. Stack smashing

    --Switching gears......

    --Stack smashing history

    --('buffer overflow' is one way to conduct a stack smashing attack.)

    --demo

        --mig runs server. as Namrata.

	--my laptop runs honest client

	--my laptop runs dishonest client

    --note: if this server had been running as root, we'd have been able
    to get a root shell

	--and if the user/syscall interface doesn't check its arguments
	properly, can buffer overflow that interface

	--in practice, once you have a user account on a machine, it's
	usually possible to get root access (why? because the syscall
	interface is really hard to secure, as a matter of practice.)

    --other versions of these attacks

	--overwriting function pointers

	--smashing the heap

	--return-to-libc (see Tanenbaum)

    --how do people defend against these things?

	--W ^ X (map the stack pages as non-executable, if the hardware
	allows it). But there are some issues....
	
	    --the original 386 did not allow it with page tables, but
	    all x86 chips that support extended page tables (which are
	    used to help users get at >4GB of physical memory even if
	    the machine is 3 bits) also support an XD bit in those page
	    tables, which means "don't execute code in this page.
	    
	    --even on x86s that don't suport extended page tables,
	    segmentation would help with do-not-execute (since the
	    permissions in the segment descriptor can express this).
	    the disadvantage here is that the compiler needs to lay out
	    the code and stack to match what the segments would require)

	    --the bummer with W ^ X, even when it *is* supported, is
	    this: some languages not only don't need it but also are
	    actively harmed by W ^ X. The core of the issue is that a
	    program written in a safe language (Perl, Python, Java,
	    etc.) does not need W ^ X whereas lots of C programs do.
	    Meanwhile some machines *always* enforce W ^ X, even for
	    programs that do not need it. Such enforcement constrains
	    certain languages, namely those that need to do runtime code
	    generation.

	--address space randomization

	--StackGuard (in gcc)

	--another defense: don't use C! CPUs are so fast that a language
	with bounds checking probably isn't going to pay a huge
	performance penalty relative to one without bounds checks

    --unfortunately, this is an arms race, and each time a new defense
    arises, a new attack arises too. here's the most advanced current
    technique, and it defeats many of the above defenses:

	--smash the stack with a bunch of return addresses. each return
	address points to the needed instruction followed by "ret"
	(requires the attacker to have previously identified these
	instructions in the code). not too hard in CISC code like on
	x86, where there are lots of sequences of code embedded in the
	binary, even sequences that the programmer didn't mean (because
	instructions are not fixed length). result: the control flow
	bounces around all of these byte sequences in memory, executing
	exactly what the attacker wanted, but not executing off of the
	stack.

	--this is called "return-oriented programming". defending
	against it is hard (though if people use only safe languages,
	that is, languages that do bounds checking and other pointer
	checks, such attacks will be much, much harder)

    --question: can we instead confine processes and users so that when
    they're broken into, the damage is limited?