Class 14
CS372H
6 March 2012

On the board
------------

1. Last time
2. Kernel organization
3. Liedtke paper
    --Background
    --Principles
    --Interface
    --Specific optimizations
    --Discussion

---------------------------------------------------------------------------

1. Last time

    --finished linking and loading

    --discussed SFI

2. Kernel organization: microkernels vs. monolithic vs. exokernels

    -- What is a microkernel?

	[draw picture]

    -- Idea:  Implement many traditional OS abstractions in servers
    ... Paging, File system, possibly even interrupt handlers (like L3),
    print, display

	--Some servers have privileged access to some h/w (e.g.,
	file system and disks)

    -- What does kernel do? Minimum to support servers:

        --address spaces

        --threads

        --inter-process communication (IPC)

	--some IPC handle to use for sending/receiving messages.

    -- Everything works via message passing
	--apps talking to servers
	--apps talking to apps
	--servers talking to servers

    -- Therefore message passing and IPC needs to be fast

    -- What are the advantages?
    	 --Modularity and extensibility (understandable, replaceable services)
	 --Isolation (bugs take down the server, not the kernel)
	 --Fault-tolerant (same reason: just restart individual services)
	 --Easier to extend to distributed context (send/recv no
	    different in terms of interface, just implementation)

    -- What are the disadvantages? 
	
	--Performance

	    --What would be simple calls into the kernel are now IPCs
	    
	    --How bad is performance?  (See Table 2 (p. 185))

	--Programmability

	    --Ken Thompson: "I would generally agree that microkernels
	    are probably the wave of the future. However, it is in my
	    opinion easier to implement a monolithic kernel. It is also
	    easier for it to turn into a mess in a hurry as it is
	    modified." 

    --In practice...

	--huge issue: Unix compatibility

	--critical to widespread adoption

	--difficult: Unix was not designed in a modular fashion

	--Mach, L4: one big Unix server ... which is not
	a huge practical difference from a single Linux kernel (who
	cares if it's running in user space? a single bug still takes
	down all of the interesting state because it's all in that Unix
	server).

    --History

	--individual ideas around since the beginning

	--lots of research projects starting early 1980s

	--hit the big time w/ CMU's Mach in 1986

	--thought to be too slow in early 1990s (Mach very slow; people
	concluded microkernels are a bad idea)
	    this is the context for Liedtke's paper. he is saying, "it
	    can be fast".

	--now slowly returning (OSX, QNX in embedded systems/routers, etc)

	--ideas very influential on non-microkernels

    --So the approaches to kernel design are:

	a. Monolithic (Linux, Unix, etc.).

	    --philosophy:
		--convenience (for application or OS programmer)
		--for any problem, either hide it from the application,
		or add a new system call

	    --very successful approach

	b. Microkernel (OSX, L3, etc.)

	    --philosophy:
		--IPC and user-space servers
		--for any problem, make a new server, and talk to it
		with RPC or IPC

	c. Exokernel (JOS!)
	    --philosophy:
		--eliminate all abstractions
		--for any problem, expose hardware or the needed
		information to the application, and let the application
		do what it wants

3. Liedtke paper

    3A. Background

    --What's an RPC? What's an IPC?
	--IPC: message from thread (or process A) to thread or
	process B
	--RPC: a round-trip of IPCs (there and back)

    --What's the minimum needed to do an IPC?

	--See Table 3, back page: 172 cycles
   
    --What's the big cost? (int, iret)

    --Why expensive?

	--pipeline flushed

	--registers dumped on stack

	--TLB misses in the wake of context switches
    
	--Why are 5 TLB misses needed?

	    a. B's thread control block

	    b. loading %cr3 flushes TLB, so kernel text causes miss on
	    the next kernel instruction after "load %cr3"

	    c.,d.  iret, accesses both *kernel* stack and *user* GDT - two pages

	    e. B's user *text* looks at message

    -- How do you think this trend has progressed since the paper?
	
	1. Worse now.  Faster processors optimized for straight-line code
	2. Traps/Exceptions flush deeper pipeline, cache misses cost more cycles

    --Actual IPC time of optimized L3: 5 usec

    --Is that expensive? 

	Compared to what?
	--accessing a disk? (milliseconds to access disk, so no problem)
	--network interrupts when packets arrive?

	    well, what if you wanted to handle 50,000
	    packets/second? two IPCs/packet = 100,000 IPCs/second.
	    processor can only do 200,000 IPCs/second, so IPCs would
	    take up 100,000/200,000 = 100,000/200,000 = 50% of CPU

#	    virtual memory tricks, as in Appel and Li? (several hundreds
#		of microseconds on roughly the same CPU, just for the
#		computation)


    3B. Principles

	-- *IPC performance is the master*

	-- Plus a bunch of other things that emphasize IPC performance

	    --All design decisions require a *performance discussion*

	    --If something performs poorly, look for new techniques

	    --*Synergistic effects* have to be taken into consideration
		[What does this mean?  That a lot of little things might
		add up to a big gain, or a big loss if two changes interact
		poorly. Need to test each combination of features?!]

	    --The design has to *cover all levels* from architecture down to
	      coding

	    --The design has to be made on a *concrete basis*

	-- Up until this point, a bunch of principles that argue that you should do
	   endless IPC optimization!

	    --How do we know when to stop?
	    --How do we know when we can't optimize further?

	    --Answer: One of the nicer principles in L3:
		"The design has to aim at a concrete performance goal."

		    -- Without this, you'd get lost optimizing things
		    that don't matter

		    -- Take minimum IPC time (172 cycles), multiply by 2
		    -- 350 cycles = 7 usec (50 Mhz)
		    -- set **T** = 5 usec
		    -- Minimum null RPC is already at 69% T!
		    -- System calls + address space switches = 60% T
		    -- L3 achieves 250 cycles = 5 usec

	-- Basic approach: Design the microkernel for a specific CPU


    3C. Interface
	
	old:
	    send (threadID, send-message, timeout);  /* nonblocking */
	    receive (receive-message, timeout);	/* nonblocking */

	if A sends to B:
	    A: send(); receive();
	    
	    B: 
		while (1) {
		    select();
		    receive(&requestbuf);
		    replybuf = process();
		    send(replybuf);
		}

	new:
	    /* blocking */
	  call (threadID, send-message, receive-message, timeout); 
	  reply_and_receive_next (reply-message, receive-message, timeout); 

	now:
	    A: call(threadID, send-buf, receive-buf, timeout)
	    B: 
		receive(&requestbuf);
		while (1) {
		    replybuf = process(requestbuf);
		    reply_and_receive_next(replybuf, &requestbuf, timeout);
		} 

    3D. Optimizations

    (1) new system call: 2 system calls per RPC, instead of 4.

    (2) complex messages: send one message instead of a bunch

    (3) direct transfer with memory mapping. 
    
	--what's going on here?

	naive solution: two copies:  A --> kernel --> B

	okay, so why not share user-level pages between A and B?
	have sender copy into shared buffer? well, then receiver might
	need write access to signal when it's done processing. problem:

	    --security issue: information can flow back from B to A 

	other problems with shared buffers in this context:

	    --receiver checks message legality, then message changes
	    (if receiver copies message first, then we're back where we
	    started)

	    --with many clients, a server could run out of VA space

	    --somehow need to coordinate first

	    --not app-friendly. why? [have to copy data anyway. can't
	    get data into buffer, etc.]

	Liedtke's approach: one copy: A --> remapped B.
	    
	    --Kernel does copy inside A

	    --How to do this maximally cheaply?
		
		----Copy two PDE's (8MB) from B's address space
		into kernel range of A's pgdir. 

		    --Then execute the copy in A's kernel space

	    --ASK: literally copy the entries?

		--No! copy the entry *except* the PTE_U bit needs to be
		cleared because only the kernel should be using this
		window in A

		    --Why two PDEs?  Maximum message size is 4 Meg, so
		    the copy is guaranteed to work regardless of how B
		    aligned the message buffer

		    --Why not just copy PTEs?  Would be much more
		    expensive]

	    --What does it mean for the TLB to be "window clean"?  Why
	    do we care? 

		--Means TLB contains no mappings within communication
		window that are relevant to earlier or concurrent
		operations. During a transfer, the TLB *must* be window
		clean.

		--Why would there ever be old mappings?
		    
		    --say we're sending from process A to process B.
		    Inside A, we need to map:

			window_va --> process_B_buffer

		    --However, the TLB might contain:

			window_va --> process_C_buffer
	
		    --This could happen either if address space A
		    previously sent to C or if there are multiple
		    threads in address space A, one of which is trying
		    to transfer to C.

		--Why can't the IPC instructions just invalidate the
		mappings?

		    --That is, why isn't it enough to invalidate the two
		    pages?

		    --trick question. it's not two pages. 
		     it's two PDEs --> 8 MB.

		    --We care because mapping is cheap (copy PDE), but
		    invalidation on x86 only lets programmer invalidate
		    one page at a time, or whole TLB

		    --Because of this, programmer (Liedtke) must reason
		    about when TLB is window clean.

		--Maintaining the invariant:

		    --The only thing that complicates this invariant is the
		    existence of multiple threads in the same address space.

		    --But does TLB invalidation of communication window
		    turn out to be a problem? Not usually, because have
		    to load %cr3 during IPC anyway (Unless the address
		    space doesn't change)

		    --See paper for the two cases when the programmer
		    has to enforce additional TLB flushes.

    (4) Thread control block (TCB)

	  tcb contains basic info about thread

	    --registers, links for various doubly-linked lists, pgdir,
	    uid, ...

	    --commonly accessed fields packed together on the same cache
	    line

	  [draw picture of array, with kernel stack inside TCB]

	  Store an array of TCBs, like JOS's array of Envs, inside every
	  process's virtual memory space.

	    --Easy to find any TCB (no linked list data structure
	    required: just index into the array).

	    --This means that the paging system handles the case that a
	    TCB isn't available (say because the TCB itself is swapped
	    out).

	  Kernel stack is on same page as tcb.  Why?

	    a. Minimizes TLB misses (since accessing kernel stack will
	    bring in mapping for tcb)

		--consider the alternative

		--NOTE: in table 3, switching the stacks doesn't cause
		TLB miss. the reason is because B's TCB was accessed
		earlier in table 3 (in the "access B" line)

	    b. Very efficient access to current TCB -- just mask off
	    lower 12 bits of %esp

	  Another nice thing: can access *any* TCB efficiently, given
	  the thread id. why?
	   
		--actual thread number is in 32-bit thread id in very
		particular way

			          b
		[     {thr_num}<---->]

	    where tcb size = 2^b.

		--doing it this way replaces an {"and", "multiply",
		"add"} with {"and","add"}!

		--Note that the thread ID here is like the JOS env ID
		(has a number that serves as an index, a generation,
		etc.)
    
    (5) Lazy scheduling

	conventional approach to scheduling:

	    A sends message to B:
	      Move A from ready queue to waiting queue
	      Move B from waiting queue to ready queue

	    This requires 58 cycles, including 4 TLB misses.  What are TLB misses?

	      Using doubly linked lists [go over best implementation]

	      Most efficient would be to insert A at B's old position in list

	      So previous and next elements in each list must be touched

	lazy scheduling:

	    Insight: After A blocks, *don't take it off the ready queue yet!*
	    It will probably get right back on very quickly.

		[Likewise: After B wakes up, don't put it on the ready
		queue yet; just run it. It will probably go back to
		sleep pretty soon, so the scheduler can leave it
		wherever it was (wakeup queue or list of blocked
		threads).]

	    Ready queue must contain all ready threads, EXCEPT POSSIBLY
	    CURRENT ONE

	      --Might contain other threads that aren't actually ready, though

	    Each wakeup queue contains AT LEAST all threads waiting in
	    that queue

	      --Again, might contain other threads, too
	      --Scheduler removes inappropriate queue entries when scanning queue

	    Why does this help performance?

	      --Only three situations in which thread gives up CPU but
	      stays ready:

		    --"send" syscall (as opposed to "call"), preemption, and
		    hardware interrupts

	      [these are the only cases when the thread needs to be put
	      in the ready list]

	      --So very often can IPC into thread while not putting it
	      on ready list

	    --"ipc : lazy queue update" ratio can reach 50:1 with high ipc rates

    (6) Segment register optimization

	--Loading segment registers is slow -- have to access GDT, etc.

	--But common case is that users don't change their segment registers

	--Observation:  It's faster to check a segment register than load it
	    So just check that segment registers are okay
	    Only need to load if user code changed them

    (7) Various other tricks

	--Multiple timeout queues plus long-time-wakeup-list plus
	base+offset representation of time

	--Short messages passed through registers

	--Minimize TLB misses by putting things on the same page

	--Put commonly used-data on same cache lines

	--Other coding tricks: short offsets, avoid jumps, etc.

    3E. Discussion

    --Great performance numbers!  Much better than other microkernels
      (Fig 7, 8)

	--Too bad microbenchmark performance might not matter 

	--Too bad, too, that hardware evolution has made ipc inherently
	more expensive
	
    --What do you think of theme of paper?
	Liedtke was fighting a losing battle against CPU makers:
	hardware evolution making IPC inherently more expensive

	[But very nice series of design decisions (or hacks).]

    --Is fast IPC something that computer architectures should design
    hardware to take into account?