Class 15
CS372H
8 March 2012

On the board
------------

1. Last time
2. Exokernels
    --Background
    --Philosophy
    --High-level approach
    --Examples
    --Performance
    --Extensibility
3. Discussion

---------------------------------------------------------------------------

1. Last time

    --kernel organization
	* monolithic
	* microkernel
	* exokernel

    --discussed Liedtke paper

2. Exokernel

    * Background
    * Philosophy
    * High-level approach
    * Examples
    * Performance
    * Extensibility

    ASK: please summarize the thesis statement of this paper

    A. Background

	--JOS is an exokernel. Surprise!

	--people thought that radical architectures would either not be
	extensible or not be fast 

	--xok people are showing that it can be fast

    B. Philososphy
    
	1. back off a plank of OS religion

	    --we said earlier in the course that the point of an OS is to
	    abstract the hardware and to multiplex resources. the exokernel
	    dudes are basically throwing out one of those (which one?)

		--they literally argue (p4): "Additionally, an exokernel
		should export bookkeeping data structures such as freelists,
		disk arm positions, and cached TLB entries so that
		applications can tailor their allocation requests to
		available resources".

		--Ask yourself whether you belive this is a good idea. We'll
		come back to that discussion.

	2. what is the motivation?

	  --argument is that abstractions are expensive. one is
	    forcing one's applications to pay for pointless generality

	    --another argument is that abstractions get in the way of
	    the application doing what it really wants. 

		  Not about performance of individual operations 
		    (e.g. system call or IPC)
		  The problem is application structure
		    You often just can't do what you want in an ordinary OS

		Example: interaction between DBs and OSes in
		demand-paging scenario:

		[draw picture]

		1. say DB maintains its own cache

		2. two problems:

		    --DB's disk blocks are cached by buffer cache, and it
		    caches its own disk blocks too. waste of space

		    --other problem: what if DB's cache is in memory
		    that gets swapped out? totally defeats the purpose
		    of the DB cache

		    --if the DB knew that its page were being swapped
		    out, it could release the physical page (no disk
		    write needed) and, when it needed it, read back the
		    page from the DB file, not the swap area

	    Other Examples of removing abstractions?
	      --Disk blocks vs file systems
	      --Phys mem vs address space / process
	      --CPU vs time slicing or scheduler activations
	      --TLB entries vs address spaces
	      --Frame buffer vs windows
	      --Ethernet frames vs TCP/IP
	
	3. so what are they going to do?
	
	    for any problem, expose h/w or info to app, let app do what it wants
		h/w, kernel, environments, libOS, app

	    **an exokernel would not provide address space, virtual cpu, file
	    system, TCP**

	    instead, give control to app:

		phys pages, addr mappings, clock interrupts, disk i/o, net i/o

		let app build nice address space if it wants, or not

		so app gets clock interrupts and has to handle being scheduled
		and descheduled!

		should give aggressive apps much more flexibility

	4. challenges

	    a. how to multiplex CPU/mem/etc. if you expose directly to apps?

	    b. how to get security/isolation despite apps having
	    low-level control and freedom?

	    c. how to multiplex without understanding: disk (file
	    system), incoming tcp pkts

    C. High-level approach

	1. Design principles

	    One principle: Separate resource protection from management
	
		--basically, exokernel keeps track of which application
		owns which resource

	    Or, four principles:
    
	    (a) Securely expose hardware

	    ... Or, Avoid resource management: "only manage resources to the extent
		required by protection"

	    (b) Expose allocation

	    (c) Expose physical names

	    ... Efficient (remove layer of indirection), encode important attributes
	    
	     (d) Expose revocation

	2. approach:

	    a. keep track of who owns what 

	    b. ask applications to revoke
	    
	    c. abort when necessary

	    --What's a secure binding?
	    
		* Fancy name for a simple idea: check once, use many
		times

	    In this context, what's a secure binding for memory in their
	    MIPS context? (answer: TLB entry; once it exists, hardware
	    keeps using it). why not page table entry? answer: exokernel
	    doesn't *have* page tables!!

		--What would be the "exokernel" way of implementing
		secure bindings here?

		    --Answer: system calls that allow access to the TLB

		    --Supply capability for a physpage and you get to
		    map it

			(Capability is some bitstring that is presumably
			hard to forge or guess)

		--Why are capabilities important? Why not user ID/process ID?

		    ==> Avoid encoding policy. This way, one app can
		    give another the capability, and then the receiving
		    app can use it.

		
	    What is the motivation for revoking?

		--answer: app or libOS knows best what resources to
		release

		--how does it work here? steps:

		    1. "Please relinquish something"

		    2. "Please relinquish in < T microseconds or face
		    consequences"

		    3. "I revoked something for you. I'll record it so
		    you can see later what it was"


		--note that this revocation is _visible_. other OS
		architectures revoke resources invisibly.

		[since #3 requires some logic from the OS, the above
		starts to call into question the claim about an
		exokernel surely being simpler.]

---------------------------------------------------------------------------

admin notes

a. review sessions

b. midterm 

    --covers readings (book, papers), labs, lectures, homeworks

    --through Tuesday's class

    --question format:
    
	--short answers

	--design

	--coding

    --Ground rules

	--75 minutes

	--bring ONE two-sided sheet of notes; formatting requirements
	listed on Web page

	--no electronics: no laptops, cell phones, PDAs, etc.


c. how to read a paper

---------------------------------------------------------------------------

    D. Examples

	1. example: memory

	    --What are the resources? (phys pages, mappings)

	    --How do you allocate a page?

	    ... Allocate a physical page, get R/W capabilities

	    ... Kernel records owner (capability [?]) and R/W capabilities

	    ... Owner can change capabilities or deallocate page


	    --Wait a minute: in Aegis (an exokernel for MIPS-based
	    machines), the environment maintains its own page tables. Why
	    is this okay?

	     --Interface exposed by kernel to app:

		pa,capab = AllocPage()
		DeallocPage(pa)
		TLBwr(va, pa, capab.)   /* MIPS */
		MapPage(va, pa, capab.)  /* x86 */
	
	     --Kernel->app upcalls:

		PageFault(va)
		PleaseReleaseAPage()

	     --What does kernel need to do to make multiplexing work?

		--to ensure that app creates mappings to physical pages
		that it owns

		--to track which environment owns which physical pages

		--decide which application to reclaim pages from, if
		system runs out
		  --that app gets to decide which of its pages to
		  relinquish

	    Solve the DB problem mentioned above:

	    a. exokernel needs physical memory for some other app

	    b. exokernel sends DB a PleaseReleaseAPage() upcall

	    c. DB picks a clean page, calls DeallocPage(pa)

	    d. OR DB picks dirty page, writes to disk, then DeallocPage(pa)


	    Shared memory:

	      two processes want to share memory, for fast interaction
		note traditional "virtual address space" doesn't allow for this
	      process a: (pa,capab.) = AllocPage()
			 put 0x5000 -> pa in private table
			 PageFault(0x5000) upcall -> TLBwr(0x5000, pa, capab.)
			 give pa to process b (need to tell exokernel...)
	      process b:
			 put 0x6000 -> pa in private table
			 ...

	    * Note that the app calls TLBwr(); this is an example of the
	    exokernel exposing the hardware (on the MIPS processor). 

	2. example: CPU

	    --Exokernel breaks time into slices and by default schedules
	    round-robin.

	    --What does it mean to expose CPU to app?

		--Tell application when it is about to be
		context-switched *out*

		--Tell application when it is about to be
		context-switched *in*

		--Then, on a timer interrupt:

		    --CPU jumps from application into kernel

		    --kernel issues please_yield() upcall, into context
		    switch handler

			--then app saves state (registers, etc.)

			--app calls yield() for real
	   
			--so app is responsible for giving up its time,
			but if it keeps the CPU for too long, it gets
			killed.

		--when kernel decides to resume application:

		    --kernel jumps into the application at the
		    "resume()" upcall. application then restores its
		    saved registers

	    --example use:
	    
		(a) suppose time slice ends in the middle of

		    acquire(&mutex)
		    ....
		    release(&mutex)
    
		--then, inside please_yield(), app can complete the
		critical section and release the mutex (this violates
		our coding standards, of course).
    
		(b) what else can context switch handler do?

		--default: return control to the kernel ==> next app
		runs, round-robin order

		--return control to a specific app, via yield()

		    --yield() can be used to implement complex
		    scheduling policies (stride scheduling in paper)

		    --note that the base scheduler is shared among many
		    apps

	3. Network system

	    --ASHs

	    --can reply immediately

	4. Exceptions

	    --just keep running the app! i.e., deliver exception right to
	    app.

	5. not all processes will need to save their floating point
	state. so the libOS, in creating the process abstraction, can
	decide whether context switches save floating point state.

	6. QUESTION: why not just have a bit that processes set in
	createprocess() or fork() or exec() that tells the OS whether
	the process should get its FP registers saved?

	    --answer: modularity

	    --hard to anticipate and codify every possible option.

	    --easiest thing to do is just to expose the hardware and let
	    the application decide.
	
	 7. do you really believe this?

    E. Performance

	* syscalls
	* address translation
	* IPC vs. protected control transfer
	* Pipes
	* Scheduling

	--Table 4: why is Aegis so much faster than Ultrix?

	    --See Section 5.2. In the paper, this is confusing. What
	    they appear to be saying is the following:

	    --(Background: on the MIPS, there are two types of TLB
	    misses: those from kernel space, and those from user space,
	    roughly speaking. The general exception handler handles
	    syscalls, TLB misses from kernel spaces, and the double
	    fault case wherein the user TLB miss handler itself faults,
	    say because it tried to gain access to a paged-out structure
	    like the user's page tables.)

	    --(Further background: the processor was designed so that
	    handling double faults was fast and did not require saving
	    state; there were enough bits in the exception registers to
	    "unwind" and get back to user space.)
	    
	    --In Ultrix:
	   
		--There are two choices for the syscall handler (since
		it is also the double fault handler and the kernel TLB
		miss handler): require the handler to save all register
		state on the stack (thereby defeating the point of fast
		double fault handling), or else build the handler so
		that it doesn't touch the registers.

		--Either one carries a cost for the syscall handler and
		requires careful coding.	

	    --In Aegis, this problem doesn't occur. Why?

		--Because there's no possibility of a TLB miss in kernel
		space or of the type of double fault that would invoke
		the general exception handler. Why?

		--Because the kernel uses only physical addresses
		(really, pseudo-physical addresses, per the MIPS
		architecture: physical memory shows up in high VM but
		not via TLB mappings).

	--Address translation:

	    --note, we are in world of software-managed TLB. hardware
	    doesn't see page tables, only TLB.

	    --see p.9 for what happens on TLB miss

	    --how do they make it fast for apps to handle virtual
	    memory? 

		--answer: kernel has a large software TLB (this means
		that the kernel is handling TLB misses without involving
		the app, but the kernel is not applying any
		"intelligence": it's just taking entries from the
		software TLB and placing them in the hardware TLB).

	    --wait, I don't understand. why don't regular kernels have
	    this? what is the purpose here?

		--answer: for regular kernels, checking the pg table
		structures from within the TLB miss handler is not that
		expensive. but in exokernel, *applications* are managing
		memory, and we want to minimize application/kernel
		crossings. one way to do that is *not* to give control
		to app on TLB miss. one way to avoid such control is a
		STLB

		--kernel then pushes from STLB to real TLB as needed

		--note that this is normally not needed because memory
		management stays inside the kernel.

	    --final optimization: map the STLB into application space.
	    what's the point?

		--saves another crossing: applications don't have to
		tell kernel about the mapping if it's already in the
		STLB

	--IPC vs. protected control transfer

	    --why faster on Aegis?

	    --answer: transfers right into other process; no context
	    switch into the kernel, i.e., no register saving

		(--but how do they get isolation and access control?)

		(--how do they make sure that process A doesn't switch
		into process B at any old place?)

		--answer: see 5.1.2: "protected entry context". list of
		acceptable entry points. then, do access control in
		target to make sure that process is being called by a
		process it's comfortable being called by

	    --what do you think of the comparison to L3 (Table 6 and
	    accompanying text).

		--answer: this seems a bit unfair. the x86 cannot avoid
		TLB flushes on context switch. if the exokernel had to
		pay for those, it would also incur that hit.

		--similarly, if the exokernel had to build an actual IPC
		abstraction, it would have to pay more (see Table 8)


	--Pipes

	    --why so much faster on Aegis?

		--answer: they map the memory into both processes. very
		fast.

	    --why can't Ultrix do this?
		--Ultrix is constrained by posix API: pass in a file
		descriptor and memory. memory is owned by process and 
		has to get buffered by kernel for delivery to a
		different buffer.

	    --example of changing the interface making things far faster

	--ASHs: application-specific handlers.

	    --for some messages, a protocol endpoint can immediately
	    reply without control going up into application (example:
	    ACKs in TCP).

	    --"vectoring process can be dynamic". So different message
	    types can immediately be placed into different message
	    queues. This is like loading special-purpose code into the
	    device driver or networking stack

	    --Plus, this is an application of the sandboxing paper that
	    we read.

	    --But this is more complex than it seems: what happens when
	    the ASH accesses virtual memory? That might generate a page
	    fault, requiring the invocation of an app-level
	    fault-handler. So now the "kernel context" (in which the ASH
	    is running) is depending on the app's fault handler.

	F. Extensibility

	--ASHs

	--RPC

	--Page tables

	--Scheduling (section 7.3)

	    --can get isolation of processes, but fine-grained
	    scheduling of threads

	    --Yield() takes an argument: the target process

	    --the idea is that a logical application consisting of 5
	    threads or processes would just schedule those 5 threads
	    itself. the application would be given time on the CPU, and
	    then it could allocate to its own processes that time as it
	    wished.

	    --why won't this work in regular Unix?

	    --answer: because processes can't schedule other processes
	    since the interfaces aren't exposed.

3. Discussion

    A. What did you think of this paper?

	--did you buy its sales pitch?

    B. Comparison

	--L3 and Exokernel?

	    --modularity (in-kernel extensions) arguably harder in L3
	    because it's so tuned for the one thing

		--but Liedtke would say that it's so well tuned you
		should just build your extensions on top of the
		microkernel

	    --extensibility not really part of the picture: L3 exposes a
	    single interface.

    C. Why aren't people using any of this stuff?

	--Arguably OSX is like Mach, but Windows and Linux are both
	monolithic kernels. 

    D. Note: in some ways, exokernel design is ludicrous