Class 13 CS 202 11 March 2026 On the board ------------ 1. Last time 2. Page faults: intro and mechanics 3. Page faults: uses 4. Page faults: costs 5. Page replacement policies 6. Thrashing --------------------------------------------------------------------------- 1. Last time - case study of x86-64: multilevel page tables - hardware "walks" those page tables - caches the result in the TLB 2. Page faults: intro and mechanics We've discussed these a bit. Let's go into a bit more detail... Concept: a reference is illegal, either because it's not mapped in the page tables or because there is a protection violation. requires the OS to get involved this mechanism turns out to be hugely powerful, as we will see. Mechanics --what happens on the x86? --processor constructs a trap frame and transfers execution to an interrupt or trap handler ss [stack segment; ignore] rsp [former value of stack pointer] rflags [former value of rflags] cs [code segment; ignore] rip [instruction that caused the trap] %rsp --> [error code] %rip now points to code to handle the trap [how did processor know what to load into %rip?] error code: [see handout] [ ................................ U/S | W/R | P] U/S: user mode fault / supervisor mode fault R/W: access was read / access was write P: not-present page / protection violation on a page fault, %cr2 holds the faulting virtual address --intent: when page fault happens, the kernel sets up the process's page entries properly, or terminates the process Questions: --does TLB miss imply page fault? (no!) --does page fault imply TLB miss? (no!) (imagine a page that is mapped read-only. user-level process tries to write to it. TLB knows about the mapping, so no TLB miss. But this is still a protection violation. To cut down on terminology, we will lump this kind of violation in with "page fault".) 3. Uses of page faults --Best example: overcommitting physical memory (the classical use of "virtual memory") --your program thinks it has, say, 64 GB of memory, but your hardware has only 16 GB of memory --the way that this worked is that the disk was (is) used to store memory pages --advantage: address space looks huge --disadvantage: accesses to "paged" memory (as disk pages that live on the disk are known) are sllooooowwwww. --Rough implementation: --on a page fault, the kernel reads in the faulting page --QUESTION: what is listed in the page structures? how does kernel know whether the address is invalid, in memory, paged, what? --kernel may need to send a page to disk (under what conditions? answer: two conditions must hold for kernel to HAVE to write to disk) (1) kernel is out of memory (2) the page that it selects to write out is dirty --Many other uses --store memory pages across the network! (Distributed Shared Memory) --basic idea was that on a page fault, the page fault handler went and retrieved the needed page from some other machine --copy-on-write --when creating a copy of another process, don't copy its memory. just copy its page tables, mark the pages as read-only --QUESTION: do you need to mark the parent's pages as read-only as well? --program semantics aren't violated when programs do reads --when a write happens, a page fault results. at that point, the kernel allocates a new page, copies the memory over, and restarts the user program to do a write --then, only do copies of memory when there is a fault as a result of a write --this idea is all over the place; used in fork(), mmap(), etc. --accounting --good way to sample what percentage of the memory pages are written to in any time slice: mark a fraction of them not present, see how often you get faults --if you are interested in this, check out the paper "Virtual Memory Primitives for User Programs", by Andrew W. Appel and Kai Li, Proc. ASPLOS, 1991. --high-level idea: by giving kernel (or even user-level program) the opportunity to do interesting things on page faults, you can build interesting functionality --Paging in day-to-day use --Demand paging: bring program code into memory "lazily" --Growing the stack (contiguous in virtual space, probably not in physical space) --BSS page allocation (BSS segment contains the part of the address space with global variables, statically initialized to zero. OS can delay allocating and zeroing a page until the program accesses a variable on the page.) --Shared text --Shared libraries --Shared memory 4. Page faults: costs --What does paging from the disk cost? --let's look at average memory access time (AMAT) --AMAT = (1-p)*memory access time + p * page fault time, where p is the prob. of a page fault. memory access time ~ 100ns t_M disk access time ~ 10 ms = 10^7 ns t_D --QUESTION: what does p need to be to ensure that paging hurts performance by less than 10%? 1.1 * t_M > (1-p)*t_M + p*t_D .1 * t_M > p*(t_D - t_M) p < .1*t_M / (t_D - t_M) ~ 10^1 ns / 10^7 ns = 10^{-6} so only one access out of 1,000,000 can be a page fault!! --basically, page faults are super-expensive (good thing the machine can do other things during a page fault) Concept is much larger than OSes: need to pay attention to the slow case if it's really slow and common enough to matter. 5. Page replacement policies --the fundamental problem/question: --some entity holds a cache of entries and gets a cache miss. The entity now needs to decide which entry to throw away. How does it decide? --make sure you understand why page faults that result from "page-not-present in memory" are a particular kind of cache miss --(the answer is that in the world of virtual memory, the pages resident in memory are basically a cache to the backing store on the disk; make sure you see why this claim, about virtual memory vis-a-vis the disk, is true.) --the system needs to decide which entry to throw away, which calls for a *replacement policy*. --let's cover some policies.... Specific policies * FIFO: throw out oldest (results in every page spending the same number of references in memory. not a good idea. pages are not accessed uniformly.) * MIN (also known as OPT). throw away the entry that won't be used for the longest time. this is optimal. --evaluating these algorithms input --reference string: sequence of page accesses --cache (e.g., physical memory) size output --number of cache evictions (e.g., number of swaps) --examples...... --time goes left to right. --cache hit = h ------------------------------------ FIFO phys_slot A B C A B D A D B C B S1 A h D h C S2 B h A S3 C B h 7 swaps, 4 hits ------------------------------------ OPTIMAL phys_slot A B C A B D A D B C B S1 A h h C S2 B h h h S3 C D h 5 swaps, 6 hits ------------------------------------ * LRU: throw out the least recently used (this is often a good idea, but it depends on the future looking like the past. what if we chuck a page from our cache and then were about to use it?) LRU phys_slot A B C A B D A D B C B S1 A h h C S2 B h h h S3 C D h 5 swaps, 6 hits --LRU looks awesome! --but what if our reference string were ABCDABCDABCD? phys_slot A B C D A B C D A B C D S1 A D C B S2 B A D C S3 C B A D 12 swaps, 0 hits. BUMMER. --same thing happens with FIFO. --what about OPT? [not as much of a bummer at all.] --other weirdness: Belady's anomaly: what happens if you add memory under a FIFO policy? phys_slot A B C D A B E A B C D E S1 A D E h S2 B A h C S3 C B h D 9 swaps, 3 hits. not great. let's add some slots. maybe we can do better phys_slot A B C D A B E A B C D E S1 A h E D S2 B h A E S3 C B S4 D C 10 swaps, 2 hits. this is worse. --do these anomalies always happen? --answer: no. with policies like LRU, contents of memory with X pages is subset of contents with X+1 pages --all things considered, LRU is pretty good. let's try to implement it...... --implementing LRU --reasonable to do in application programs like Web servers that cache pages (or dedicated Web caches). [use queue to track least recently accessed and use hash map to implement the (k,v) lookup] --in OS, LRU itself does not sound great. would be doubling memory traffic (after every reference, have to move some structure to the head of some list) --and in hardware, it's way too much work to timestamp each reference and keep the list ordered --how can we approximate LRU? --another algorithm: * CLOCK --arrange the slots in a circle. hand sweeps around, clearing a bit. the bit is set when the page is accessed (this is called the USE bit or ACCESSED bit; see the form of a page table entry (PTE) on the handout). just evict a page if the hand points to it when the bit is clear. --approximates LRU ... because we're evicting pages that haven't been used in a while....though of course we may not be evicting the *least* recently used one (why not?) --can generalize CLOCK: * NTH CHANCE --don't throw a page out until the hand has swept by N times. --OS keeps counter per page: # sweeps --On page fault (need to evict a page), OS looks at where the hand is currently pointing, call it physical page p. check the USE bit of p. 1 --> clear use bit and clear counter 0 --> increment counter if counter < N, keep going if counter = N, replace the page: it hasn't been used in a while --How to pick N? Large N --> better approximation to LRU Small N --> more efficient. otherwise going around the circle a lot (might need to keep going around and around until a page's counter gets set = to N) --modification: --dirty pages are more expensive to evict (why?) --so give dirty pages an extra chance before replacing (OS knows which pages are dirty because of the MODIFIED, or DIRTY, bit; see again the handout that describes the form of a page table entry) common approach (supposedly on Solaris but I don't know): --clean pages use N = 1 --dirty pages use N = 2 (but initiate write back when N=1, i.e., try to get the page clean at N=1) --Summary: --optimal is known as OPT or MIN --LRU is usually a good approximation to optimal --Implementing LRU in hardware or at OS/hardware interface is a pain --So implement CLOCK or NTH CHANCE ... decent approximations to LRU, which is in turn good approximation to OPT *assuming that past is a good predictor of the future* (this assumption does not always hold!) Miscellaneous implementation points Note that many machines, x86 included, maintain 4 bits per page table entry: --*use*: Set when page referenced; cleared by an algorithm like CLOCK (the bit is called "Accessed" on x86) --*modified*: Set when page modified; cleared when page written to disk (the bit is called "Dirty" on x86) --*present*: It's set only if page is in memory [asterisk: note that it's an "only if" not an "if". There are cases when the page in physical memory but the bit is clear.] --*read-only*: program can read page, but not modify it. Set if page is truly read-only? [no. similar case to above, but slightly confusing because the bit is called "writable". if a page's bits are such that it appears to be read-only, that page may or may not be truly "read only". meanwhile, if a page is truly read-only, it better have its bits set to be read-only.] Do we actually need Use and Modified bits in the page tables set by the hardware? --[again, x86 calls these the Accessed and Dirty bits] --answer: no. --how could we simulate them? --for the Modified [x86: Dirty] bit, just mark all pages read-only. Then if a write happens, the OS gets a page fault and can set the bit itself. Then the OS should mark the page writable so that this page fault doesn't happen again --for the Use [x86: Accessed] bit, just mark all pages as not present (even if they are present). Then if a reference happens, the OS gets a page fault, and can set the bit, after which point the OS should mark the page present (i.e., set the PRESENT bit). Fairness --if OS needs to swap a page out, does it consider all pages in one pool or only those of the process that caused the page fault? --what is the trade-off between local and global policies? --global: more flexible but less fair --local: less flexible but fairer [for recent research on caching, check out this paper, which argues, contrary to some received wisdom, that FIFO works well, provided that there are different FIFO queues: https://jasony.me/publication/sosp23-s3fifo.pdf It's not yet clear whether the ideas here apply to memory pages across processes; the work is mainly geared toward higher-level caches. Still, it's food for thought!] 6. Thrashing [The points below apply to any caching system, but for the sake of concreteness, let's assume that we're talking about page replacement in particular.] What is thrashing? Processes require more memory than system has Specifically, each time a page is brought in, another page, whose contents will soon be referenced, is thrown out Example: --one program touches 50 pages (each equally likely) --If we have enough physical pages, 100ns/ref --Now assume we only have 40 physical page frames --Then, if each reference is equally likely and cannot be predicted, then every 5th reference (on average) leads to a page fault --4refs x 100ns and 1 page fault x 10ms for disk I/O --this gets us 5 refs per (10ms + 400ns) ~ 2ms/ref = 20,000x slowdown!!! --What we wanted: virtual memory the size of disk with access time the speed of physical memory --What we have here: memory with access time roughly of disk (2 ms/mem_ref compare to 10 ms/disk_access) As stated earlier, this concept is much larger than OSes: need to pay attention to the slow case if it's really slow and common enough to matter. Reasons/cases: --process doesn't re-access the same memory pages (or has no temporal locality of memory pages) -OR- --process *does* re-access the same memory pages, but the virtual memory that is absorbing most of the accesses doesn't fit in physical memory -OR- --individually, all processes fit, but too much for the system what do we do? --well, in the first two reasons above, there's nothing you can do, other than restructuring your computation or buying memory (e.g., expensive hardware that keeps entire customer database in RAM) --in the third case, can and must shed load. how? two approaches: a. working set b. page fault frequency a. working set --only run a set of processes such that the union of their working sets fit in memory --definition of working set (short version): the pages a process has touched over some trailing window of time b. page fault frequency --track the metric (# page faults/instructions executed) --if that thing rises above a threshold, and there is not enough memory on the system, swap out the process ------------------- [Acknowledgments: David Mazières, Mike Dahlin]