Class 15
CS 439
5 March 2013

On the board
------------

1. Last time
2. page faults, continued
3. Other page structures
4. Page replacement policies

---------------------------------------------------------------------------

1. Last time

    --more paging!
   
    --JOS memory map
   
    --started on page faults

2B. Page faults: uses

    --exhibit A for the use of paging is virtual memory:

	--your program thinks it has, say, 512 MB of memory, but your
	hardware has only 4 MB of memory

	--the way that this worked is that the disk was (is) used to
	store memory pages

	--advantage: address space looks huge

	--disadvantage: accesses to "paged" memory (as disk pages that
	live on the disk are known) are sllooooowwwww:

	--the implementation of this is roughly:

	    --on a page fault, the kernel reads in the faulting page

	    --QUESTION: what is listed in the page structures? how does
	    kernel know whether the address is invalid, in memory,
	    paged, what?

	    --called demand paging, and it's one way to get program code
	    into memory "lazily"
    
	    --kernel may need to send a page to disk (under what
	    conditions? answer: two conditions must hold for kernel to
	    HAVE to write to disk)

		(1) kernel is out of memory
		(2) the page that it selects to write out is dirty


	--Many 32-bit machines have 4GB of memory, so less common to
	hear the sound of swapping these days. You either need 36-bit
	addressing and memory hogs, or multiple large memory consumers
	running on the same computer

    --many, many other uses for page faults and virtual memory

	--high-level idea: by giving kernel (or even user-level program)
	the opportunity to do interesting things on page faults, you can
	build interesting functionality:

	    --store memory pages across the network! (Distributed Shared
	    Memory)

		--basic idea was that on a page fault, the page fault
		handler went and retrieved the needed page from some
		other machine

	    --copy-on-write

		--when creating a copy of another process, don't copy
		its memory. just copy its page tables, mark the pages as
		read-only

		--QUESTION: do you need to mark the parent's pages
		as read-only as well? 

		--program semantics aren't violated when programs do
		reads

		--when a write happens, a page fault results. at that
		point, the kernel allocates a new page, copies the 
		memory over, and restarts the user program to do a write

		    --then, only do copies of memory when there is a
		    fault as a result of a write

		--this idea is all over the place

	    --accounting

		--good way to sample what percentage of the memory pages
		are written to in any time slice: mark a fraction of
		them not present, see how often you get faults

	    --if you are interested in this, check out the paper
	    "Virtual Memory Primitives for User Programs", by Andrew W.
	    Appel and Kai Li, Proc. ASPLOS, 1991.

    --Paging in day-to-day use

	 --Demand paging 

	 --Growing the stack 

	 --BSS page allocation 

	 --Shared text 

	 --Shared libraries 

	 --Shared memory 
	 
	 --Copy-on-write (fork, mmap, etc.) 

2C. Page faults: costs

    --What does demand paging (i.e., paging from the disk) cost?

	--let's look at average memory access time (AMAT)

	--AMAT = (1-p)*memory access time + p * page fault time,
	where p is the prob. of a page fault.
	
	memory access time ~ 100ns 
	disk access time   ~ 10 ms = 10^7 ns

	--QUESTION: what does p need to be to ensure that paging hurts
	performance by less than 10%?

	1.1*t_M = (1-p)*t_M + p*t_D
	p = .1*t_M / (t_D - t_M) ~ 10^1 / 10^7 = 10^{-6} 

	so only one access out of 1,000,000 can be a page fault!!

	--basically, page faults are super-expensive (good thing the
	machine can do other things during a page fault)

    --Thrashing is even worse

	Memory overcommitted -- pages tossed out while still needed 
     
	Example:

	    --one program touches 50 pages (each equally likely); only 
	      have 40 physical page frames 
	    
	    --If have enough pages, 100ns/ref 
     
	    --If have too few pages, assume every 5th reference leads
	    to a  page fault 
     
	    --4refs x 100ns  and 1 page fault x 10ms for disk I/O 

	    --this gets us
		5 refs per (10ms + 400ns) = 2ms/ref = 20,000x slowdown!!! 
     

	--What we wanted: virtual memory the size of disk with access
	time the speed of physical memory 

	--What we have here: memory with access time roughly of disk
	(2 ms/mem_ref compare to 10 ms/disk_access)

	Concept is much larger than OSes: need to pay attention to the
	slow case if it's really slow and common enough to matter.


3. Other page structures

    A. Very large page sizes (e.g., 4 MB)

	--advantage: small page tables

	--disadvantage: lots of wasted memory

	--PSE (set bit 7 in PDE and get 4MB pages, no PTs)

	--**there is trade-off between large page sizes and small page
	sizes**. what is the nature of the trade-off?

	    --large page sizes means wasting actual memory

	    --small page sizes means lots of page table entries (which
	    may or may not get consumed)

    B. Many levels of page table

	--advantage: not much memory spent on page tables if address
	space is sparse

	--disadvantage: lots of page table walking

    C. What happens when memory gets huge?
	
	--many levels of page table; or 

	--inverted page table

	    --works as a hash table

	    --stores <vpn,ppn> entries

---------------------------------------------------------------------------

4. Replacement policies

    --this topic is related to the previous but also more general than
    the paging context. 

    --the fundamental problem/question:
	
	--some entity holds a cache of entries and gets a cache miss.
	The entity now needs to decide which entry to throw away. How
	does it decide?

	--make sure you understand why page faults that result from
	"page-not-present in memory" are a particular kind of cache miss

	    --(the answer, which you should make sure you understand, is
	    that in the world of virtual memory, the pages resident in
	    memory are basically a cache to the backing store on the
	    disk; make sure you see why this claim, about virtual memory
	    vis-a-vis the disk, is true.)

    --the system needs to decide which entry to throw away, which calls
    for a *replacement policy*
    
    --so let's cover some policies

	[put these on the board in one place]

	* FIFO: throw out oldest (results in every page spending
	the same number of references in memory. not a good idea.
	pages are not accessed uniformly.)

    --optimal: 

	* MIN (also known as OPT). throw away the entry that won't
	be used for the longest time. our textbook and other references
	assert that it is optimal, but they do not prove it. it's a good
	idea to get in the habit of convincing yourselves of (or
	disproving) assertions.  Here's a proof, under the assumption
	that the cache is always full:

	    Choose any other scheme. Call it ALT. Now let's sum the
	    number of misses under ALT or OPT, and induct over the
	    number of references. Four cases at any given reference:
	    {OPT hits, ALT hits}, {OPT hits, ALT misses}, {OPT misses,
	    ALT misses}, {OPT misses, ALT hits}. The only interesting
	    case is the last one (in the other cases, OPT does as well
	    or better than ALT, so OPT keeps pace with, or beats, the
	    competition at every reference). Say that the last case
	    happens at a reference, r. By the induction hypothesis, OPT
	    was optimal right up until the *last* miss OPT experienced,
	    at reference, say, r - a.  After that reference, there has
	    been only one miss (the current one, at r). The alternative,
	    ALT, couldn't have done better than OPT up until r-a (by the
	    induction hypothesis). And since r-a, OPT has had only one
	    miss. But ALT could not have had 0 misses between r-a and
	    now because if it did, it means that OPT replaced the wrong
	    entry at r-a (another way to say the same thing: OPT was
	    chosen so that a is maximal). Thus, OPT is no worse than ALT
	    at r. In the remaining cases, OPT is as good or better than
	    ALT in terms of contributing to the number of misses. So by
	    induction, OPT is optimal.

    --evaluating these things
	
	input
	--reference string: sequence of page accesses
	--cache (e.g., physical memory) size 

	output
	--number of cache evictions (e.g., number of swaps)

    --examples......

	--time goes left to right. 
	--cache hit = h

        ------------------------------------

	FIFO

	phys_slot    A B C A B D A D B C B
	S1           A     h   D   h   C 
	S2             B     h   A 
	S3               C           B   h

		7 swaps, 4 hits

        ------------------------------------

	OPTIMAL

	phys_slot    A B C A B D A D B C B
	S1           A     h     h     C
	S2             B     h       h   h
	S3               C     D   h

		5 swaps, 6 hits

        ------------------------------------

    -- * LRU: throw out the least recently used (this is often a good
	idea, but it depends on the future looking like the past. what
	if we chuck a page from our cache and then were about to use
	it?)


	LRU

	phys_slot    A B C A B D A D B C B
	S1           A     h     h     C
	S2             B     h       h   h
	S3               C     D   h

		5 swaps, 6 hits

    --LRU looks awesome!

    --but what if our reference string were ABCDABCDABCD?

	phys_slot   A B C D A B C D A B C D 
	 S1         A     D     C     B
	 S2           B     A     D     C
	 S3             C     B     A     D

	 12 swaps, 0 hits. BUMMER.

    --same thing happens with FIFO.

    --what about OPT? [not as much of a bummer at all.]

    --other weirdness: Belady's anomaly: what happens if you add memory
    under a FIFO policy?

	phys_slot   A B C D A B E A B C D E 
	S1          A     D     E         h
	S2            B     A     h   C
	S3              C     B     h   D

	    9 swaps, 3 hits. not great. let's add some slots. maybe we
	    can do better

	phys_slot   A B C D A B E A B C D E 
	S1          A       h   E       D
	S2            B       h   A       E
	S3              C           B
	S4                D           C

	   10 swaps, 2 hits. this is worse. 

    --do these anomalies always happen?

	--answer: no. with policies like LRU, contents of memory with X
	pages is subset of contents with X+1 pages

    --all things considered, LRU is pretty good. let's try to implement
    it......

    --implementing LRU 

	--reasonable to do in application programs like Web servers that
	cache pages (or dedicated Web caches).
	    [use queue to track least recently accessed and use hash map
	    to implement the (k,v) lookup]

	--in OS, LRU itself does not sound great. would be doubling
	memory traffic (after every reference, have to move some
	structure to the head of some list)

	--and in hardware, it's way too much work to timestamp each
	reference and keep the list ordered (remember that the TLB may
	also be implementing these solutions)

    --how can we approximate LRU?