Class 14 CS372H 6 March 2012 On the board ------------ 1. Last time 2. Kernel organization 3. Liedtke paper --Background --Principles --Interface --Specific optimizations --Discussion --------------------------------------------------------------------------- 1. Last time --finished linking and loading --discussed SFI 2. Kernel organization: microkernels vs. monolithic vs. exokernels -- What is a microkernel? [draw picture] -- Idea: Implement many traditional OS abstractions in servers ... Paging, File system, possibly even interrupt handlers (like L3), print, display --Some servers have privileged access to some h/w (e.g., file system and disks) -- What does kernel do? Minimum to support servers: --address spaces --threads --inter-process communication (IPC) --some IPC handle to use for sending/receiving messages. -- Everything works via message passing --apps talking to servers --apps talking to apps --servers talking to servers -- Therefore message passing and IPC needs to be fast -- What are the advantages? --Modularity and extensibility (understandable, replaceable services) --Isolation (bugs take down the server, not the kernel) --Fault-tolerant (same reason: just restart individual services) --Easier to extend to distributed context (send/recv no different in terms of interface, just implementation) -- What are the disadvantages? --Performance --What would be simple calls into the kernel are now IPCs --How bad is performance? (See Table 2 (p. 185)) --Programmability --Ken Thompson: "I would generally agree that microkernels are probably the wave of the future. However, it is in my opinion easier to implement a monolithic kernel. It is also easier for it to turn into a mess in a hurry as it is modified." --In practice... --huge issue: Unix compatibility --critical to widespread adoption --difficult: Unix was not designed in a modular fashion --Mach, L4: one big Unix server ... which is not a huge practical difference from a single Linux kernel (who cares if it's running in user space? a single bug still takes down all of the interesting state because it's all in that Unix server). --History --individual ideas around since the beginning --lots of research projects starting early 1980s --hit the big time w/ CMU's Mach in 1986 --thought to be too slow in early 1990s (Mach very slow; people concluded microkernels are a bad idea) this is the context for Liedtke's paper. he is saying, "it can be fast". --now slowly returning (OSX, QNX in embedded systems/routers, etc) --ideas very influential on non-microkernels --So the approaches to kernel design are: a. Monolithic (Linux, Unix, etc.). --philosophy: --convenience (for application or OS programmer) --for any problem, either hide it from the application, or add a new system call --very successful approach b. Microkernel (OSX, L3, etc.) --philosophy: --IPC and user-space servers --for any problem, make a new server, and talk to it with RPC or IPC c. Exokernel (JOS!) --philosophy: --eliminate all abstractions --for any problem, expose hardware or the needed information to the application, and let the application do what it wants 3. Liedtke paper 3A. Background --What's an RPC? What's an IPC? --IPC: message from thread (or process A) to thread or process B --RPC: a round-trip of IPCs (there and back) --What's the minimum needed to do an IPC? --See Table 3, back page: 172 cycles --What's the big cost? (int, iret) --Why expensive? --pipeline flushed --registers dumped on stack --TLB misses in the wake of context switches --Why are 5 TLB misses needed? a. B's thread control block b. loading %cr3 flushes TLB, so kernel text causes miss on the next kernel instruction after "load %cr3" c.,d. iret, accesses both *kernel* stack and *user* GDT - two pages e. B's user *text* looks at message -- How do you think this trend has progressed since the paper? 1. Worse now. Faster processors optimized for straight-line code 2. Traps/Exceptions flush deeper pipeline, cache misses cost more cycles --Actual IPC time of optimized L3: 5 usec --Is that expensive? Compared to what? --accessing a disk? (milliseconds to access disk, so no problem) --network interrupts when packets arrive? well, what if you wanted to handle 50,000 packets/second? two IPCs/packet = 100,000 IPCs/second. processor can only do 200,000 IPCs/second, so IPCs would take up 100,000/200,000 = 100,000/200,000 = 50% of CPU # virtual memory tricks, as in Appel and Li? (several hundreds # of microseconds on roughly the same CPU, just for the # computation) 3B. Principles -- *IPC performance is the master* -- Plus a bunch of other things that emphasize IPC performance --All design decisions require a *performance discussion* --If something performs poorly, look for new techniques --*Synergistic effects* have to be taken into consideration [What does this mean? That a lot of little things might add up to a big gain, or a big loss if two changes interact poorly. Need to test each combination of features?!] --The design has to *cover all levels* from architecture down to coding --The design has to be made on a *concrete basis* -- Up until this point, a bunch of principles that argue that you should do endless IPC optimization! --How do we know when to stop? --How do we know when we can't optimize further? --Answer: One of the nicer principles in L3: "The design has to aim at a concrete performance goal." -- Without this, you'd get lost optimizing things that don't matter -- Take minimum IPC time (172 cycles), multiply by 2 -- 350 cycles = 7 usec (50 Mhz) -- set **T** = 5 usec -- Minimum null RPC is already at 69% T! -- System calls + address space switches = 60% T -- L3 achieves 250 cycles = 5 usec -- Basic approach: Design the microkernel for a specific CPU 3C. Interface old: send (threadID, send-message, timeout); /* nonblocking */ receive (receive-message, timeout); /* nonblocking */ if A sends to B: A: send(); receive(); B: while (1) { select(); receive(&requestbuf); replybuf = process(); send(replybuf); } new: /* blocking */ call (threadID, send-message, receive-message, timeout); reply_and_receive_next (reply-message, receive-message, timeout); now: A: call(threadID, send-buf, receive-buf, timeout) B: receive(&requestbuf); while (1) { replybuf = process(requestbuf); reply_and_receive_next(replybuf, &requestbuf, timeout); } 3D. Optimizations (1) new system call: 2 system calls per RPC, instead of 4. (2) complex messages: send one message instead of a bunch (3) direct transfer with memory mapping. --what's going on here? naive solution: two copies: A --> kernel --> B okay, so why not share user-level pages between A and B? have sender copy into shared buffer? well, then receiver might need write access to signal when it's done processing. problem: --security issue: information can flow back from B to A other problems with shared buffers in this context: --receiver checks message legality, then message changes (if receiver copies message first, then we're back where we started) --with many clients, a server could run out of VA space --somehow need to coordinate first --not app-friendly. why? [have to copy data anyway. can't get data into buffer, etc.] Liedtke's approach: one copy: A --> remapped B. --Kernel does copy inside A --How to do this maximally cheaply? ----Copy two PDE's (8MB) from B's address space into kernel range of A's pgdir. --Then execute the copy in A's kernel space --ASK: literally copy the entries? --No! copy the entry *except* the PTE_U bit needs to be cleared because only the kernel should be using this window in A --Why two PDEs? Maximum message size is 4 Meg, so the copy is guaranteed to work regardless of how B aligned the message buffer --Why not just copy PTEs? Would be much more expensive] --What does it mean for the TLB to be "window clean"? Why do we care? --Means TLB contains no mappings within communication window that are relevant to earlier or concurrent operations. During a transfer, the TLB *must* be window clean. --Why would there ever be old mappings? --say we're sending from process A to process B. Inside A, we need to map: window_va --> process_B_buffer --However, the TLB might contain: window_va --> process_C_buffer --This could happen either if address space A previously sent to C or if there are multiple threads in address space A, one of which is trying to transfer to C. --Why can't the IPC instructions just invalidate the mappings? --That is, why isn't it enough to invalidate the two pages? --trick question. it's not two pages. it's two PDEs --> 8 MB. --We care because mapping is cheap (copy PDE), but invalidation on x86 only lets programmer invalidate one page at a time, or whole TLB --Because of this, programmer (Liedtke) must reason about when TLB is window clean. --Maintaining the invariant: --The only thing that complicates this invariant is the existence of multiple threads in the same address space. --But does TLB invalidation of communication window turn out to be a problem? Not usually, because have to load %cr3 during IPC anyway (Unless the address space doesn't change) --See paper for the two cases when the programmer has to enforce additional TLB flushes. (4) Thread control block (TCB) tcb contains basic info about thread --registers, links for various doubly-linked lists, pgdir, uid, ... --commonly accessed fields packed together on the same cache line [draw picture of array, with kernel stack inside TCB] Store an array of TCBs, like JOS's array of Envs, inside every process's virtual memory space. --Easy to find any TCB (no linked list data structure required: just index into the array). --This means that the paging system handles the case that a TCB isn't available (say because the TCB itself is swapped out). Kernel stack is on same page as tcb. Why? a. Minimizes TLB misses (since accessing kernel stack will bring in mapping for tcb) --consider the alternative --NOTE: in table 3, switching the stacks doesn't cause TLB miss. the reason is because B's TCB was accessed earlier in table 3 (in the "access B" line) b. Very efficient access to current TCB -- just mask off lower 12 bits of %esp Another nice thing: can access *any* TCB efficiently, given the thread id. why? --actual thread number is in 32-bit thread id in very particular way b [ {thr_num}<---->] where tcb size = 2^b. --doing it this way replaces an {"and", "multiply", "add"} with {"and","add"}! --Note that the thread ID here is like the JOS env ID (has a number that serves as an index, a generation, etc.) (5) Lazy scheduling conventional approach to scheduling: A sends message to B: Move A from ready queue to waiting queue Move B from waiting queue to ready queue This requires 58 cycles, including 4 TLB misses. What are TLB misses? Using doubly linked lists [go over best implementation] Most efficient would be to insert A at B's old position in list So previous and next elements in each list must be touched lazy scheduling: Insight: After A blocks, *don't take it off the ready queue yet!* It will probably get right back on very quickly. [Likewise: After B wakes up, don't put it on the ready queue yet; just run it. It will probably go back to sleep pretty soon, so the scheduler can leave it wherever it was (wakeup queue or list of blocked threads).] Ready queue must contain all ready threads, EXCEPT POSSIBLY CURRENT ONE --Might contain other threads that aren't actually ready, though Each wakeup queue contains AT LEAST all threads waiting in that queue --Again, might contain other threads, too --Scheduler removes inappropriate queue entries when scanning queue Why does this help performance? --Only three situations in which thread gives up CPU but stays ready: --"send" syscall (as opposed to "call"), preemption, and hardware interrupts [these are the only cases when the thread needs to be put in the ready list] --So very often can IPC into thread while not putting it on ready list --"ipc : lazy queue update" ratio can reach 50:1 with high ipc rates (6) Segment register optimization --Loading segment registers is slow -- have to access GDT, etc. --But common case is that users don't change their segment registers --Observation: It's faster to check a segment register than load it So just check that segment registers are okay Only need to load if user code changed them (7) Various other tricks --Multiple timeout queues plus long-time-wakeup-list plus base+offset representation of time --Short messages passed through registers --Minimize TLB misses by putting things on the same page --Put commonly used-data on same cache lines --Other coding tricks: short offsets, avoid jumps, etc. 3E. Discussion --Great performance numbers! Much better than other microkernels (Fig 7, 8) --Too bad microbenchmark performance might not matter --Too bad, too, that hardware evolution has made ipc inherently more expensive --What do you think of theme of paper? Liedtke was fighting a losing battle against CPU makers: hardware evolution making IPC inherently more expensive [But very nice series of design decisions (or hacks).] --Is fast IPC something that computer architectures should design hardware to take into account?