Class 7
CS 372H
11 February 2010

On the board
------------
1. Last time
    --page structures
    --page faults are expensive
2. Replacement policies
3. Misc

---------------------------------------------------------------------------

1. last time

    clarify bit 7 in PDE: it is in one entry
	--raises the question: when a user-level program gains access to
	an address in a 4MB range, how does the processor know whether
	the page is a 4MB or 4KB?
	    --could imagine putting 1000 entries in the TLB.
	    --but instead it has two TLBs

    also, some of you asked about why ebx is callee saved while
    eax,ecx,edx are caller saved.

    answer is.....
	eax: "accumulator" register
	ebx: "base" register
	ecx: "count" register
	edx: "data" register

	idea was that ebx would point to the base of a data structure
	(just as ebp meant "buffer pointer" and pointed to the base of a frame]

	ebx often points to a data segment (e.g., for a dynamic
	library's data), by setting it up at the beginning of a function
	and keeping it constant throughout
	
	Because it's a "stable" pointer, makes sense to have the callee
	save it.

	eax/ecx/edx is more ephemeral: used for particular calculations
	and such. Makes sense to require the caller to save it if it
	needs the values in there.

2. Replacement 

    --have a cache of entries

    --get a cache miss (of which page faults that result from
    "page-not-present in memory" are a particular kind)

	--make sure you understand why, in the world of virtual memory,
	the pages resident in memory are basically a cache to the
	backing store on the disk.

    --now, which entry do you throw away?

	--need a replacement policy

    --so let's cover some policies 

	[put these on the board in one place]

	* FIFO: throw out oldest (results in every page spending
	the same number of references in memory. not a good idea.
	pages are not accessed uniformly.)

    --optimal: 

	* MIN (also known as OPT). throw away the entry that won't
	be used for the longest time. our textbook and other references
	assert that it is optimal, but they do not prove it. it's a good
	idea to get in the habit of convincing yourselves of (or
	disproving) assertions.  Here's a proof, under the assumption
	that the cache is always full:

	    Choose any other scheme. Call it ALT. Now let's sum the
	    number of misses under ALT or OPT, and induct over the
	    number of references. Four cases at any given reference:
	    {OPT hits, ALT hits}, {OPT hits, ALT misses}, {OPT misses,
	    ALT misses}, {OPT misses, ALT hits}. The only interesting
	    case is the last one (in the other cases, OPT does as well
	    or better than ALT, so OPT keeps pace with, or beats, the
	    competition at every reference). Say that the last case
	    happens at a reference, r. By the induction hypothesis, OPT
	    was optimal right up until the *last* miss OPT experienced,
	    at reference, say, r - a.  After that reference, there has
	    been only one miss (the current one, at r). The alternative,
	    ALT, couldn't have done better than OPT up until r-a (by the
	    induction hypothesis). And since r-a, OPT has had only one
	    miss. But ALT could not have had 0 misses between r-a and
	    now because if it did, it means that OPT replaced the wrong
	    entry at r-a (another way to say the same thing: OPT was
	    chosen so that a is maximal). Thus, OPT is no worse than ALT
	    at r. In the remaining cases, OPT is as good or better than
	    ALT in terms of contributing to the number of misses. So by
	    induction, OPT is optimal.

    --evaluating these things
	
	input
	--reference string: sequence of page accesses
	--cache (e.g., physical memory) size 

	output
	--number of cache evictions (e.g., number of swaps)

    --examples......

	--time goes left to right. 
	--cache hit = h

        ------------------------------------

	FIFO

	phys_slot    A B C A B D A D B C B
	S1           A     h   D   h   C 
	S2             B     h   A 
	S3               C           B   h

		7 swaps, 4 hits

        ------------------------------------

	OPTIMAL

	phys_slot    A B C A B D A D B C B
	S1           A     h     h     C
	S2             B     h       h   h
	S3               C     D   h

		5 swaps, 6 hits

        ------------------------------------

    -- * LRU: throw out the least recently used (this is often a good
	idea, but it depends on the future looking like the past. what
	if we chuck a page from our cache and then were about to use it?


	LRU

	phys_slot    A B C A B D A D B C B
	S1           A     h     h     C
	S2             B     h       h   h
	S3               C     D   h

		5 swaps, 6 hits

    --LRU looks awesome!

    --but what if our reference string were ABCDABCDABCD?

	phys_slot   A B C D A B C D A B C D 
	 S1         A     D     C     B
	 S2           B     A     D     C
	 S3             C     B     A     D

	 12 swaps, 0 hits. BUMMER.

    --same thing happens with FIFO.

    --what about OPT? [not as much of a bummer at all.]

    --other weirdness: Belady's anomaly: what happens if you add memory
    under a FIFO policy?

	phys_slot   A B C D A B E A B C D E 
	S1          A     D     E         h
	S2            B     A     h   C
	S3              C     B     h   D

	    9 swaps, 3 hits. not great. let's add some slots. maybe we
	    can do better

	phys_slot   A B C D A B E A B C D E 
	S1          A       h   E       D
	S2            B       h   A       E
	S3              C           B
	S4                D           C

	   10 swaps, 2 hits. this is worse. 

    --do these anomalies always happen?

	--answer: no. with policies like LRU, contents of memory with X
	pages is subset of contents with X+1 pages

    --all things considered, LRU is pretty good. let's try to implement
    it......

    --implementing LRU 

	--reasonable to do in application programs like Web servers that
	cache pages (or dedicated Web caches).
	    [use queue to track least recently accessed and use hash map
	    to implement the (k,v) lookup]

	--in OS, LRU itself does not sound great. would be doubling
	memory traffic

	--and in hardware, it's way too much work to timestamp each
	reference and keep the list ordered (remember that the TLB may
	also be implementing these solutions)

    --how can we approximate LRU?

    --another algorithm: * CLOCK

	--arrange the slots in a circle. hand sweeps around, clearing
	a bit. the bit is set when the page is accessed. just evict a
	page if the hand points to it when the bit is clear.
    
	--approximates LRU ... because we're evicting pages that haven't
	been used in a while....though of course we may not be evicting
	the *least* recently used one (why not?)

    --can generalize this: * NTH CHANCE

	--don't throw a page out until the hand has swept by N times.

	--OS keeps counter per page: # sweeps

	--On page fault, OS checks use bit
	    1 --> clear use bit and clear counter
	    0 --> increment counter
		if counter < N, keep going
		if counter = N, replace the page: it hasn't been used in
		  a while

	--How to pick N?
	    Large N --> better approximation to LRU
	    Small N --> more efficient. otherwise going around the
	    circle a lot (might need to keep going around and around
	    until a page's counter gets set = to N)

	--modification:
	    --dirty pages are more expensive to evict (why?)

	    --so give dirty pages an extra chance before replacing

	    common approach (supposedly on Solaris but I don't know):
	    --clean pages use N = 1
	    --dirty pages use N = 2 
		(but initiate write back when N=1, i.e., try to get the
		page clean at N=1)

    --Section 3.4.10 in the text summarizes the various policies.

	for example:

	--NRU (two bits per process)
	    --but coarse-grained: only four classes of pages,,
		class 0: not reffed, not modified
		class 1: not reffed, modified
		class 2: reffed, not modified
		class 3: reffed, modified

---------------------------------------------------------------------------

admin announcements:

    * will release a lecture on Friday; watch it over the weekend.

    * lab T will be about threading. will be due in between parts 3A and
    3B.

---------------------------------------------------------------------------

3. Misc points

    A. Implementation points

    --note that many machines, x86 included, maintain 4 bits per page
    table entry:

	--*use*: Set when page referenced; cleared by an algorithm like
	clock (called "Accessed" on x86)

	--*modified*: Set when page modified; cleared when page written
	to disk (called "Dirty" on x86)

	--*valid*: Program can reference this page without getting a
	page fault. Set if page is in memory? [no. it is "only if", not
	"if". *valid*=1 implies page in physical memory. but page in
	physical memory does not imply *valid*=1; in other words,
	*valid*=0 does not imply page is not in physical memory.]

	--*read-only*: program can read page, but not modify it. Set if
	page is truly read-only? [no. similar case to above, but
	slightly confusing because the bit is called "writable". if a
	page's bits are such that it appears to be read-only, it may or
	may not be because it is truly "read only". but if a page is
	truly read-only, it better have its bits set to be read-only.]

    --do we need Modified and Reffed bits in the page tables set by the
    harware?

	--[x86 calls these the Dirty and Accessed bits]

	--answer: no.

	--how could we simulate them?

	--for the Modified [x86: Dirty] bit, just mark all pages
	read-only. Then if a write happens, the OS gets a page fault and
	can set the bit itself. Then the OS should mark the page
	writable so that this page fault doesn't happen again

	--for the Reffed [x86: Accessed] bit, just mark all pages as not
	present (even if they are present). Then if a reference happens,
	the OS gets a page fault, and can set the bit, after which point
	the OS should mark the page present (i.e., set the PRESENT bit).

    --note that it would be an algorithm like Clock or WSClock that
    would do the clearing.


    B. What if caching doesn't work?

	reasons

	    --process doesn't reuse memory

	    --process reuses memory but it doesn't fit.

	    --individually, all processes fit, but too much for the system

	what do we do?

	    --well, in the first two cases, there's nothing you can do,
	    other than restructuring your computation or buying memory
	    (e.g., expensive hardware that keeps entire customer
	    database in RAM)

	    --in the third case, can and must shed load. how?
	
	two approaches:
	    a. working set
	    b. page fault frequency

	a. working set

	    --only run processes s.t. the union of their working sets fit in
	    memory

	    --book defines working set. short version: the pages a processed
	    has touched over some trailing window of time

	b. page fault frequency

	    --track the metric (# page faults/instructions executed)

	    --if that thing rises above a threshold, and there is not enough
	    memory on the system, swap out the process

    C. Fairness

	--if OS needs to swap a page out, does it consider all pages in one
	pool or only those of the process that caused the page fault? 

	--what is the trade-off between local and global policies?

	    --global: more flexible but less fair

	    --local: less flexible but fairer