Class 15 CS372H 8 March 2012 On the board ------------ 1. Last time 2. Exokernels --Background --Philosophy --High-level approach --Examples --Performance --Extensibility 3. Discussion --------------------------------------------------------------------------- 1. Last time --kernel organization * monolithic * microkernel * exokernel --discussed Liedtke paper 2. Exokernel * Background * Philosophy * High-level approach * Examples * Performance * Extensibility ASK: please summarize the thesis statement of this paper A. Background --JOS is an exokernel. Surprise! --people thought that radical architectures would either not be extensible or not be fast --xok people are showing that it can be fast B. Philososphy 1. back off a plank of OS religion --we said earlier in the course that the point of an OS is to abstract the hardware and to multiplex resources. the exokernel dudes are basically throwing out one of those (which one?) --they literally argue (p4): "Additionally, an exokernel should export bookkeeping data structures such as freelists, disk arm positions, and cached TLB entries so that applications can tailor their allocation requests to available resources". --Ask yourself whether you belive this is a good idea. We'll come back to that discussion. 2. what is the motivation? --argument is that abstractions are expensive. one is forcing one's applications to pay for pointless generality --another argument is that abstractions get in the way of the application doing what it really wants. Not about performance of individual operations (e.g. system call or IPC) The problem is application structure You often just can't do what you want in an ordinary OS Example: interaction between DBs and OSes in demand-paging scenario: [draw picture] 1. say DB maintains its own cache 2. two problems: --DB's disk blocks are cached by buffer cache, and it caches its own disk blocks too. waste of space --other problem: what if DB's cache is in memory that gets swapped out? totally defeats the purpose of the DB cache --if the DB knew that its page were being swapped out, it could release the physical page (no disk write needed) and, when it needed it, read back the page from the DB file, not the swap area Other Examples of removing abstractions? --Disk blocks vs file systems --Phys mem vs address space / process --CPU vs time slicing or scheduler activations --TLB entries vs address spaces --Frame buffer vs windows --Ethernet frames vs TCP/IP 3. so what are they going to do? for any problem, expose h/w or info to app, let app do what it wants h/w, kernel, environments, libOS, app **an exokernel would not provide address space, virtual cpu, file system, TCP** instead, give control to app: phys pages, addr mappings, clock interrupts, disk i/o, net i/o let app build nice address space if it wants, or not so app gets clock interrupts and has to handle being scheduled and descheduled! should give aggressive apps much more flexibility 4. challenges a. how to multiplex CPU/mem/etc. if you expose directly to apps? b. how to get security/isolation despite apps having low-level control and freedom? c. how to multiplex without understanding: disk (file system), incoming tcp pkts C. High-level approach 1. Design principles One principle: Separate resource protection from management --basically, exokernel keeps track of which application owns which resource Or, four principles: (a) Securely expose hardware ... Or, Avoid resource management: "only manage resources to the extent required by protection" (b) Expose allocation (c) Expose physical names ... Efficient (remove layer of indirection), encode important attributes (d) Expose revocation 2. approach: a. keep track of who owns what b. ask applications to revoke c. abort when necessary --What's a secure binding? * Fancy name for a simple idea: check once, use many times In this context, what's a secure binding for memory in their MIPS context? (answer: TLB entry; once it exists, hardware keeps using it). why not page table entry? answer: exokernel doesn't *have* page tables!! --What would be the "exokernel" way of implementing secure bindings here? --Answer: system calls that allow access to the TLB --Supply capability for a physpage and you get to map it (Capability is some bitstring that is presumably hard to forge or guess) --Why are capabilities important? Why not user ID/process ID? ==> Avoid encoding policy. This way, one app can give another the capability, and then the receiving app can use it. What is the motivation for revoking? --answer: app or libOS knows best what resources to release --how does it work here? steps: 1. "Please relinquish something" 2. "Please relinquish in < T microseconds or face consequences" 3. "I revoked something for you. I'll record it so you can see later what it was" --note that this revocation is _visible_. other OS architectures revoke resources invisibly. [since #3 requires some logic from the OS, the above starts to call into question the claim about an exokernel surely being simpler.] --------------------------------------------------------------------------- admin notes a. review sessions b. midterm --covers readings (book, papers), labs, lectures, homeworks --through Tuesday's class --question format: --short answers --design --coding --Ground rules --75 minutes --bring ONE two-sided sheet of notes; formatting requirements listed on Web page --no electronics: no laptops, cell phones, PDAs, etc. c. how to read a paper --------------------------------------------------------------------------- D. Examples 1. example: memory --What are the resources? (phys pages, mappings) --How do you allocate a page? ... Allocate a physical page, get R/W capabilities ... Kernel records owner (capability [?]) and R/W capabilities ... Owner can change capabilities or deallocate page --Wait a minute: in Aegis (an exokernel for MIPS-based machines), the environment maintains its own page tables. Why is this okay? --Interface exposed by kernel to app: pa,capab = AllocPage() DeallocPage(pa) TLBwr(va, pa, capab.) /* MIPS */ MapPage(va, pa, capab.) /* x86 */ --Kernel->app upcalls: PageFault(va) PleaseReleaseAPage() --What does kernel need to do to make multiplexing work? --to ensure that app creates mappings to physical pages that it owns --to track which environment owns which physical pages --decide which application to reclaim pages from, if system runs out --that app gets to decide which of its pages to relinquish Solve the DB problem mentioned above: a. exokernel needs physical memory for some other app b. exokernel sends DB a PleaseReleaseAPage() upcall c. DB picks a clean page, calls DeallocPage(pa) d. OR DB picks dirty page, writes to disk, then DeallocPage(pa) Shared memory: two processes want to share memory, for fast interaction note traditional "virtual address space" doesn't allow for this process a: (pa,capab.) = AllocPage() put 0x5000 -> pa in private table PageFault(0x5000) upcall -> TLBwr(0x5000, pa, capab.) give pa to process b (need to tell exokernel...) process b: put 0x6000 -> pa in private table ... * Note that the app calls TLBwr(); this is an example of the exokernel exposing the hardware (on the MIPS processor). 2. example: CPU --Exokernel breaks time into slices and by default schedules round-robin. --What does it mean to expose CPU to app? --Tell application when it is about to be context-switched *out* --Tell application when it is about to be context-switched *in* --Then, on a timer interrupt: --CPU jumps from application into kernel --kernel issues please_yield() upcall, into context switch handler --then app saves state (registers, etc.) --app calls yield() for real --so app is responsible for giving up its time, but if it keeps the CPU for too long, it gets killed. --when kernel decides to resume application: --kernel jumps into the application at the "resume()" upcall. application then restores its saved registers --example use: (a) suppose time slice ends in the middle of acquire(&mutex) .... release(&mutex) --then, inside please_yield(), app can complete the critical section and release the mutex (this violates our coding standards, of course). (b) what else can context switch handler do? --default: return control to the kernel ==> next app runs, round-robin order --return control to a specific app, via yield() --yield() can be used to implement complex scheduling policies (stride scheduling in paper) --note that the base scheduler is shared among many apps 3. Network system --ASHs --can reply immediately 4. Exceptions --just keep running the app! i.e., deliver exception right to app. 5. not all processes will need to save their floating point state. so the libOS, in creating the process abstraction, can decide whether context switches save floating point state. 6. QUESTION: why not just have a bit that processes set in createprocess() or fork() or exec() that tells the OS whether the process should get its FP registers saved? --answer: modularity --hard to anticipate and codify every possible option. --easiest thing to do is just to expose the hardware and let the application decide. 7. do you really believe this? E. Performance * syscalls * address translation * IPC vs. protected control transfer * Pipes * Scheduling --Table 4: why is Aegis so much faster than Ultrix? --See Section 5.2. In the paper, this is confusing. What they appear to be saying is the following: --(Background: on the MIPS, there are two types of TLB misses: those from kernel space, and those from user space, roughly speaking. The general exception handler handles syscalls, TLB misses from kernel spaces, and the double fault case wherein the user TLB miss handler itself faults, say because it tried to gain access to a paged-out structure like the user's page tables.) --(Further background: the processor was designed so that handling double faults was fast and did not require saving state; there were enough bits in the exception registers to "unwind" and get back to user space.) --In Ultrix: --There are two choices for the syscall handler (since it is also the double fault handler and the kernel TLB miss handler): require the handler to save all register state on the stack (thereby defeating the point of fast double fault handling), or else build the handler so that it doesn't touch the registers. --Either one carries a cost for the syscall handler and requires careful coding. --In Aegis, this problem doesn't occur. Why? --Because there's no possibility of a TLB miss in kernel space or of the type of double fault that would invoke the general exception handler. Why? --Because the kernel uses only physical addresses (really, pseudo-physical addresses, per the MIPS architecture: physical memory shows up in high VM but not via TLB mappings). --Address translation: --note, we are in world of software-managed TLB. hardware doesn't see page tables, only TLB. --see p.9 for what happens on TLB miss --how do they make it fast for apps to handle virtual memory? --answer: kernel has a large software TLB (this means that the kernel is handling TLB misses without involving the app, but the kernel is not applying any "intelligence": it's just taking entries from the software TLB and placing them in the hardware TLB). --wait, I don't understand. why don't regular kernels have this? what is the purpose here? --answer: for regular kernels, checking the pg table structures from within the TLB miss handler is not that expensive. but in exokernel, *applications* are managing memory, and we want to minimize application/kernel crossings. one way to do that is *not* to give control to app on TLB miss. one way to avoid such control is a STLB --kernel then pushes from STLB to real TLB as needed --note that this is normally not needed because memory management stays inside the kernel. --final optimization: map the STLB into application space. what's the point? --saves another crossing: applications don't have to tell kernel about the mapping if it's already in the STLB --IPC vs. protected control transfer --why faster on Aegis? --answer: transfers right into other process; no context switch into the kernel, i.e., no register saving (--but how do they get isolation and access control?) (--how do they make sure that process A doesn't switch into process B at any old place?) --answer: see 5.1.2: "protected entry context". list of acceptable entry points. then, do access control in target to make sure that process is being called by a process it's comfortable being called by --what do you think of the comparison to L3 (Table 6 and accompanying text). --answer: this seems a bit unfair. the x86 cannot avoid TLB flushes on context switch. if the exokernel had to pay for those, it would also incur that hit. --similarly, if the exokernel had to build an actual IPC abstraction, it would have to pay more (see Table 8) --Pipes --why so much faster on Aegis? --answer: they map the memory into both processes. very fast. --why can't Ultrix do this? --Ultrix is constrained by posix API: pass in a file descriptor and memory. memory is owned by process and has to get buffered by kernel for delivery to a different buffer. --example of changing the interface making things far faster --ASHs: application-specific handlers. --for some messages, a protocol endpoint can immediately reply without control going up into application (example: ACKs in TCP). --"vectoring process can be dynamic". So different message types can immediately be placed into different message queues. This is like loading special-purpose code into the device driver or networking stack --Plus, this is an application of the sandboxing paper that we read. --But this is more complex than it seems: what happens when the ASH accesses virtual memory? That might generate a page fault, requiring the invocation of an app-level fault-handler. So now the "kernel context" (in which the ASH is running) is depending on the app's fault handler. F. Extensibility --ASHs --RPC --Page tables --Scheduling (section 7.3) --can get isolation of processes, but fine-grained scheduling of threads --Yield() takes an argument: the target process --the idea is that a logical application consisting of 5 threads or processes would just schedule those 5 threads itself. the application would be given time on the CPU, and then it could allocate to its own processes that time as it wished. --why won't this work in regular Unix? --answer: because processes can't schedule other processes since the interfaces aren't exposed. 3. Discussion A. What did you think of this paper? --did you buy its sales pitch? B. Comparison --L3 and Exokernel? --modularity (in-kernel extensions) arguably harder in L3 because it's so tuned for the one thing --but Liedtke would say that it's so well tuned you should just build your extensions on top of the microkernel --extensibility not really part of the picture: L3 exposes a single interface. C. Why aren't people using any of this stuff? --Arguably OSX is like Mach, but Windows and Linux are both monolithic kernels. D. Note: in some ways, exokernel design is ludicrous