Class 14
CS 202
17 March 2021

On the board
------------

1. Last time
2. Page replacement policies, contd.
3. Thrashing
4. mmap()
5. intro to I/O: I/O architecture

---------------------------------------------------------------------------

1. Last time
    
    - page faults: uses, costs, page replacement policies

2. Page replacement policies, cont'd.

    --all things considered, LRU is pretty good. let's try to implement
    it......

    --implementing LRU 

	--reasonable to do in application programs like Web servers that
	cache pages (or dedicated Web caches).
	    [use queue to track least recently accessed and use hash map
	    to implement the (k,v) lookup]

	--in OS, LRU itself does not sound great. would be doubling
	memory traffic (after every reference, have to move some
	structure to the head of some list)

	--and in hardware, it's way too much work to timestamp each
	reference and keep the list ordered (remember that the TLB may
	also be implementing these solutions)


    --how can we approximate LRU?

    --another algorithm:
        * CLOCK

	--arrange the slots in a circle. hand sweeps around, clearing
	a bit. the bit is set when the page is accessed. just evict a
	page if the hand points to it when the bit is clear.
    
	--approximates LRU ... because we're evicting pages that haven't
	been used in a while....though of course we may not be evicting
	the *least* recently used one (why not?)

    --can generalize CLOCK:
        * NTH CHANCE

	--don't throw a page out until the hand has swept by N times.

	--OS keeps counter per page: # sweeps

	--On page fault, OS looks at page pointed to by the hand,
	and checks that page's use bit
	    1 --> clear use bit and clear counter
	    0 --> increment counter
		if counter < N, keep going
		if counter = N, replace the page: it hasn't been used in
		  a while

	--How to pick N?
	    Large N --> better approximation to LRU
	    Small N --> more efficient. otherwise going around the
	    circle a lot (might need to keep going around and around
	    until a page's counter gets set = to N)

	--modification:

	    --dirty pages are more expensive to evict (why?)

	    --so give dirty pages an extra chance before replacing

	    common approach (supposedly on Solaris but I don't know):
	    --clean pages use N = 1
	    --dirty pages use N = 2 
		(but initiate write back when N=1, i.e., try to get the
		page clean at N=1)


    --Summary:

	--optimal is known as OPT or MIN 

	--LRU is usually a good approximation to optimal

	--Implementing LRU in hardware or at OS/hardware interface is a
	pain

	--So implement CLOCK or NTH CHANCE ... decent approximations to
	LRU, which is in turn good approximation to OPT *assuming that
	past is a good predictor of the future* (this assumption does
	not always hold!)


    Miscellaneous implementation points

	Note that many machines, x86 included, maintain 4 bits per page
	table entry:

	    --*use*: Set when page referenced; cleared by an algorithm
	    like CLOCK (the bit is called "Accessed" on x86)

	    --*modified*: Set when page modified; cleared when page
	    written to disk (the bit is called "Dirty" on x86)

	    --*present*: It's set only if page is in memory [asterisk:
	    note that it's an "only if" not an "if". There are cases
	    when the page in physical memory but the bit is clear.]

	    --*read-only*: program can read page, but not modify it. Set
	    if page is truly read-only? [no. similar case to above, but
	    slightly confusing because the bit is called "writable". if
	    a page's bits are such that it appears to be read-only, that
	    page may or may not be truly "read only". meanwhile, if a
	    page is truly read-only, it better have its bits set to be
	    read-only.]

	Do we actually need Use and Modified bits in the page tables
	set by the harware?

	    --[again, x86 calls these the Accessed and Dirty bits]

	    --answer: no.

	    --how could we simulate them?

	    --for the Modified [x86: Dirty] bit, just mark all pages
	    read-only. Then if a write happens, the OS gets a page fault
	    and can set the bit itself. Then the OS should mark the page
	    writable so that this page fault doesn't happen again

	    --for the Use [x86: Accessed] bit, just mark all pages as
	    not present (even if they are present). Then if a reference
	    happens, the OS gets a page fault, and can set the bit,
	    after which point the OS should mark the page present (i.e.,
	    set the PRESENT bit).


    Fairness

	--if OS needs to swap a page out, does it consider all pages in one
	pool or only those of the process that caused the page fault? 

	--what is the trade-off between local and global policies?

	    --global: more flexible but less fair

	    --local: less flexible but fairer


3. Thrashing

    [The points below apply to any caching system, but for the sake of
    concreteness, let's assume that we're talking about page replacement
    in particular.]

    What is thrashing?

    Processes require more memory than system has

    Specifically, each time a page is brought in, another page, whose
    contents will soon be referenced, is thrown out

	Example:

	    --one program touches 50 pages (each equally likely); only 
	      have 40 physical page frames 
	    
	    --If we have enough physical pages, 100ns/ref 
     
	    --If we have too few physical pages, assume every 5th
	    reference leads to a page fault 
     
	    --4refs x 100ns  and 1 page fault x 10ms for disk I/O 

	    --this gets us
		5 refs per (10ms + 400ns) ~ 2ms/ref = 20,000x slowdown!!! 
     

	--What we wanted: virtual memory the size of disk with access
	time the speed of physical memory 

	--What we have here: memory with access time roughly of disk
	(2 ms/mem_ref compare to 10 ms/disk_access)

	As stated earlier, this concept is much larger than OSes: need
	to pay attention to the slow case if it's really slow and common
	enough to matter.


    Reasons/cases:

	--process doesn't reuse memory (or has no temporal locality)

	--process reuses memory but the memory that is absorbing
	most of the accesses doesn't fit.

	--individually, all processes fit, but too much for the system

    what do we do?

	--well, in the first two reasons above, there's nothing you can
	do, other than restructuring your computation or buying memory
	(e.g., expensive hardware that keeps entire customer database in
	RAM)

	--in the third case, can and must shed load. how?
    
    two approaches:
	a. working set
	b. page fault frequency

    a. working set

	--only run a set of processes s.t. the union of their
	working sets fit in memory

	--definition of working set (short version): the pages a
	process has touched over some trailing window of time

    b. page fault frequency

	--track the metric (# page faults/instructions executed)

	--if that thing rises above a threshold, and there is not enough
	memory on the system, swap out the process

4. mmap

    Plays a role in lab5. Also, cool way to bring some ideas together.

    --recall some syscalls: 
	fd = open(pathname, mode)
	write(fd, buf, sz)
	read(fd, buf, sz)

    --we've seen fds before, but what's an fd?
	--indexes into a table maintained by the kernel on behalf of the process

    --syscall:
	void* mmap(void* addr, size_t len, int prot, int flags,
		   int fd, off_t offset);


        --means, roughly, "map the specified open file (fd) into a
        region of my virtual memory (close to addr, or at a kernel-selected
        place if addr is 0), and return a pointer to it"

        [see handout]

        NOTE: the "disk image" here is the file we've mmap()'ed, not the
        process's usual backing store. The idea is that mmap() lets the
        programmer "inject" pages from a regular file on disk into the
        process's backing store (which would otherwise be part of a swap
        file).

    --after this, loads and stores to addr[x] are
    equivalent to reading and writing to the file at offset+x.

    --why is this cool?

        - example: mmap enables copying a file to stdout without
        transferring data to user space

            see handout

            NOTE: the process never itself dereferences a pointer to
            memory containing file data.

            NOTE: this saves two sets of memory-to-memory copies
            (kernel-to-user, user-to-kernel), versus the "naive"
            solution of read()ing into a buffer in user space, and then
            write()ing

            [Also, a well-tuned buffer cache manages which file pages
            are kept in RAM, rather than leaving the app developer to
            have to explicitly try to manage that (and potentially have
            the OS page replacement algorithm underneath make
            conflicting decisions).]

        - other examples:

            - reading big files. map the whole thing, rely on the paging
            mechanism to bring the needed pieces into memory as necessary

            - shared data structures, when flag is MAP_SHARED 

            - file-based data structures:
                - load data from file, update it, write it back
                - this is implemented entirely with loads/stores

            Question: how does the OS ensure that it's only writing back
            modified pages?


    --how's mmap implemented?! (answer: through virtual memory,
    with the VA being addr [or whatever the kernel selects] and
    the PA being what? answer: the physical address storing the
    given page in the kernel's buffer cache).

    --have to deal with eviction from buffer cache, so kernel will need
    a data structure that maps from:
        Phys page --> {list of (proc,va) pairs}
      
    note that the kernel needs this data structure anyway: when a
    page is evicted from RAM, the kernel needs to be able to invalidate
    the given virtual address in the page table(s) of the process(es)
    that have the page mapped.


5.  I/O architecture (high-level)

        general:
        [draw picture: CPU, Mem, I/O, connected to BUS]

        devices:
        [see handout]

        lots of details.
        fun to play with.
        registers that do different things when read vs. written.