Class 6
CS 372H
4 Feburary 2010

On the board
------------
1. Last time (x86 paging, JOS memory map)
2. Other page structures
3. Page faults and their uses 
4. Page faults and their costs

---------------------------------------------------------------------------

1. last time

    key idea in virtual memory:

	insert an entry in the page tables, and then the program can
	reference the address

	this is a powerful thing. it amounts to the ability to
	manufacture (and remove) opaque handles on the fly, just by
	inserting and removing entries in the mapping
	
	    the program itself can make such requests implicitly (as it
	    page faults) or explicitly (via mmap, which can be told to
	    fail if it can't create a particular entry in virtual
	    space).

	    the OS can certainly create such abstractions for the
	    program 

	example: if we want a program to be able to use address
	0x00402000 to refer to physical address 0x0a37 0000 but in a
	read-only way, we conceptually insert the entry <0x00402000,
	0x0a370000>

    we implement that mapping like this:

	    ............
                                           <20 bits> <12 bits>
	    ............                   | a370   |  W=0    |   [entry 2]
                                           |        |         |   [entry 1]
	    PGTABLE..... [entry 1]  ---->  |________|_________|   [entry 0]
                                           
	    ........


	(without the two-level page table but with 4KB pages and a 4GB
	address space, every process would need 4MB of contiguous
	physical memory to implement its page table)


    something it is probably worth internalizing:

	one of the things that a second-level page table is doing is to
	take a bunch of disparate physical pages, perhaps scattered
	throughout the disk, and logically glue them together, making
	them appear as a contiguous 4MB region in virtual space (just as
	the entire page structure glues disparate physical pages into a
	4GB "region").

	if the second-level page table that is chosen for this gluing is
	the page directory itself, then the disparate physical pages
	that all appear as a contiguous 4MB region wind up being the
	page tables themselves

	more detail is below.....

    implementation trick in JOS:

	want the page tables to look linear and to appear at addresses
	{0xef400000, 0xefc00000} = {UVPT, VPT}.

	so insert pointers at entries 957 and 959 back to the page
	directory itself

	the result is that the page tables *themselves* show up in the
	program's virtual address space at, say, [UVPT,UVPT+4MB) and
	[VPT,VPT+4MB)

	result: the picture of [UVPT,UVPT+4MB) in virtual space is:

	    UVPT+4MB
	                __________________
                        
                           PGTABLE 1023
                        __________________
                             .
                             .
                             .
                        __________________

                             PGTABLE 2
                        __________________

                             PGTABLE 1
                        ___________________
                        
                             PGTABLE 0
	    UVPT        ___________________


    further detail on the JOS implementation trick:

	this works because the page directory has the same structure as
	a page table and because the CPU just "follows arrows", namely:
	
	    (1) From the relevant entry in the pgdir [which entry,
	    recall, covers 4MB worth of VA space] to the physical page
	    number where the relevant page table lives
	    
	    (2) From the physical page number where the relevant page
	    table lives, more specifically the relevant entry in the
	    relevant page table (which is relevant to 4KB of address
	    space), to the physical page number that is the target of
	    the <VA,PA> mapping.

	now, if you "trick" the CPU into following the first arrow back
	to the pgdir itself, and the program references an address
	0xef400000+x, where x < 4MB, then the logic goes like this
	(compare the exact words below to the exact words of the
	numbered items above):

	    (1) From the relevant entry in the pgdir [which entry,
	    recall, is covering the 4MB worth of VA space from
	    [0xef40000,0xef800000)] to the physical page number where
	    the page directory lives 

	    (2) From the physical page number where the page directory
	    lives, more specifically the relevant entry in the page
	    directory (which now is relevant to only 4KB of address
	    space), to the physical page number that is the target of
	    the <0xef40000+x,PA> mapping. that physical page holds a
	    second-level page table!
	    
	    result: the second-level page table appears at 0xef40000+x

2. Other page structures

    A. Very large page sizes (e.g., 4 MB)

	--advantage: small page tables

	--disadvantage: lots of wasted memory

	--PSE (set bit 7 in PDE and get 4MB pages, no PTs)

	--**there is trade-off between large page sizes and small page
	sizes**. what is the nature of the trade-off?

	    --large page sizes means wasting actual memory

	    --small page sizes means lots of page table entries (which
	    may or may not get consumed)

	    --Tanenbaum gives an equation (section 3.5.3):
		overhead = se/p + p/2
		d (ovhd)/dp = -se/p^2 + 1/2
		finds its min. at p = sqrt(2se)

    B. Many levels of page table

	--advantage: not much memory spent on page tables if address
	space is sparse

	--disadvantage: lots of page table walking

    C. What happens when memory gets huge?
	
	--many levels of page table; or 

	--inverted page table

	    --works as a hash table

	    --stores <vpn,ppn> entries

	    [[--NOTE: the book and other references say that this thing
	    has to have the same number of entries as the number of
	    physical pages in the machine, but that is bogus. That
	    number is neither a useful minimum nor a useful maximum. It
	    is not a useful minimum because the table has to deal with
	    collisions from the fact that a potentially very large
	    number of VPNs are mapping to a much smaller number of PPNs
	    (e.g., mapping the same PPN at different places in the
	    address space), so the table needs to be able to live with a
	    number of entries greater than the number of physical frames
	    (i.e., it must handle being oversubscribed). Hence, it could
	    presumably have a smaller number of entries than the number
	    of physical frames (which is just another kind of
	    oversubscription). It is not a useful maximum because in
	    general when one is using hash tables, one wants the hash
	    table to be a little bit larger than the number of entries
	    that one is storing; adding even a little bit of "wiggle
	    room" in the form of blank entries tends to reduce
	    collisions a lot. (See Knuth, chapter 6.4.)
	    
	    So it's not at all clear how big the inverted page table
	    should be, except that the whole point is to be smaller than
	    a traditional page table. Thus, one presumably wants it to
	    be O(number of physical pages).]]
	   
3. Page faults

    --what happens if the address isn't in the page table or there is
    a protection violation? [page fault!]

    --NOTE: TLB MISS != PAGE FAULT

	--not all TLB misses generate page faults, and not all page
	faults began with TLB misses

    --what happens on the x86?

	[see handout from last time]

	--kernel executes a trap frame:

		ss
		esp    [former value of stack pointer]
		eflags [former value of eflags]
		cs 
      %esp-->	eip    [instruction that caused the trap]
		[error code] 

	%eip is now executing code to handle the trap
	    [how did processor know what to load into %eip?]

	error code:
	    [ ................................ U/S | W/R | P]
		     unused

	    U/S: user mode fault / supervisor mode fault
	    R/W: access was read / access was write
	    P: not-present page / protection violation

	on a page fault, %cr2 holds the faulting linear address

	idea is that when page fault happens, the kernel sets up those
	maps properly, or kills the process

    --exhibit A for the use of paging is virtual memory:

	--your program thinks it has, say, 512 MB of memory, but your
	hardware has only 4 MB of memory

	--the way that this worked is that the disk was (is) used to
	store memory pages

	--advantage: address space looks huge

	--disadvantage: accesses to "paged" memory (as disk pages that
	live on the disk are known) are sllooooowwwww:

	--the implementation of this is described in Tanenbaum 3.6.
	Roughly:

	    --on a page fault, the kernel reads in the faulting page

	    --QUESTION: what is listed in the page structures? how does
	    kernel know whether the address is invalid, in memory,
	    paged, what?

	    --called demand paging, and it's one way to get program code
	    into memory "lazily"
    
	    --kernel may need to send a page to disk (under what
	    conditions? answer: two conditions must hold for kernel to
	    have to write to disk)

		(1) kernel is out of memory
		(2) the page that it selects to write out is dirty


	--Many 32-bit machines have 4GB of memory, so less common to
	hear the sound of swapping these days. You either need 36-bit
	addressing and memory hogs, or multiple large memory consumers
	running on the same computer

    --many, many other uses for page faults and virtual memory

	--high-level idea: by giving kernel (or even user-level program)
	the opportunity to do interesting things on page faults, you can
	build interesting functionality:

	    --store memory pages across the network! (Distributed Shared
	    Memory)

		--basic idea was that on a page fault, the page fault
		handler went and retrieved the needed page from some
		other machine

	    --copy-on-write

		--when creating a copy of another process, don't copy
		its memory. just copy its page tables, mark the pages as
		read-only

		--QUESTION: do you need to mark the parent's pages
		as read-only as well? 

		--program semantics aren't violated when programs do
		reads

		--when a write happens, a page fault results. at that
		point, the kernel allocates a new page, copies the 
		memory over, and restarts the user program to do a write

		    --then, only do copies of memory when there is a
		    fault as a result of a write

		--this idea is all over the place

	    --accounting

		--good way to sample what percentage of the memory pages
		are written to in any time slice: mark a fraction of
		them not present, see how often you get faults

	    --if you are interested in this, check out the paper
	    "Virtual Memory Primitives for User Programs", by Andrew W.
	    Appel and Kai Li, Proc. ASPLOS, 1991.

    --Paging in day-to-day use

	 --Demand paging 

	 --Growing the stack 

	 --BSS page allocation 

	 --Shared text 

	 --Shared libraries 

	 --Shared memory 
	 
	 --Copy-on-write (fork, mmap, etc.) 

    --Okay, but in the case of demand paging, which pages do we bring to
    and from the disk?

---------------------------------------------------------------------------

admin

    --homeworks posted

    --lab 2 is to be done individually

    --lab 3 released

    --pair programming option: deadline this Friday night

	--once you have decided, you may not switch tracks (i.e., you
	may not go from pair to individual or from individual to pair).
	doing so constitutes cheating.

    --no class this Tuesday
    
	--will be recording and assigning lecture

---------------------------------------------------------------------------

4. Page faults and their costs

    --What does demand paging (i.e., paging from the disk) cost?

	--let's look at average memory access time (AMAT)

	--AMAT = (1-p)*memory access time + p * page fault time,
	where p is the prob. of a page fault.
	
	memory access time ~ 100ns 
	disk access time   ~ 10 ms = 10^7 ns

	--QUESTION: what does p need to be to ensure that paging hurts
	performance by less than 10%?

	1.1*t_M = (1-p)*t_M + p*t_D
	p = .1*t_M / (t_D - t_M) ~ 10^1 / 10^7 = 10^{-6} 

	so only one access out of 1,000,000 can be a page fault!!

	--basically, page faults are super-expensive (good thing the
	machine can do other things during a page fault)

    --Thrashing is even worse

	Memory overcommitted -- pages tossed out while still needed 
     
	Example:

	    --one program touches 50 pages (each equally likely); only 
	      have 40 physical page frames 
	    
	    --If have enough pages, 100ns/ref 
     
	    --If have too few pages, assume every 5th reference leads
	    to a  page fault 
     
	    --4refs x 100ns  and 1 page fault x 10ms for disk I/O 

	    --this gets us
		5 refs per (10ms + 400ns) = 2ms/ref = 20,000x slowdown!!! 
     

	--What we wanted: virtual memory the size of disk 
	    with access time the speed of physical memory 

	--What we have here: memory with access time of disk 

	Concept is much larger than OSes: need to pay attention to the
	slow case if it's really slow and common enough to matter.