CS202 Review Session 5 Andrew Hua, TA Spring 2026 0. Introduction 1. Why Virtual Memory? 1.1 The Lie 1.2 Paging 1.3 Address Translation 2. Page Tables 2.1 Linear 2.2 Multi-Indexed 3. Conclusion 4. Extra: Huge Pages 5. Problems 6. Up next 7. Q&A 8. References 9. Answers to Problems ---------------------------------------- 0. Introduction - Reinforce understanding on how virtual memory works - Will be a combination of explanations and problems building upon them - All notes, extras, recordings will be posted to their usual locations 1. Why Virtual Memory? 1.1 The Lie - OS's goal with virtual memory is to give applications (and the programmers who write them) the illusion that they control a large amount of "virtual" memory, while managing the physical memory and providing it to processes - Process, when accessing memory address, does not know the physical address which virtual address corresponds to - In order to create this illusion, the OS and computer hardware have to work together - Whenever memory address is dereferenced by a program (instruction fetch, read/write), MMU has to translate VA->PA - OS maintains data structures, such as page tables, which MMU reads off of, also handles bad requests(faults) 1.2 Paging - Imagine that for every byte of virtual memory, we stored how the byte would be mapped to a physical byte, what's bad? - First off: inefficient in terms of overhead, might need as many more bytes to store mapping than actually used - Second off: Fails to take advantage of spatial locality - So, paging is the idea that we treat the "page" as the base unit of virtual memory operations - Group memory into chunks of fixed size, then only care about mapping virtual pages to physical pages - Means that we need to keep track of fewer mappings - Size varies, but can nicely align with hardware of RAM, disks 1.3 Address Translation - So, given these pages, our question is how to map virtual address to physical address? - At the most abstract level, we can imagine the job of the MMU to use a mapping from VPN to PPN to translate addresses - Each process has some mapping function, requiring extra information about the mapping function to be stored somewhere - Importantly, if each process has its own mapping function, then VA of one proc bears no relation to same VA in another proc - Ignore the top 16 bits, take bottom 48 bits as virtual address - How many bytes are addressable? - Given pages are 4 KiB (2^12 bytes), can we tell how many pages? - Can we tell how many physical pages? (No, depends on system) - Translate VPN to PPN, copy offset 2. Page Tables - Big question: How does OS store this mapping between VPN and PPN? How to make it space-efficient? Time-efficient? Both? 2.1 Linear - What is the simplest way to store a list? - First idea: create an array, use the VPN as an index to get to the page number - If each process had one of these lists, then same VA in two procs could map to different PA by changing this data structure - Emphasize this point, this creates separation between processes - Wait, how many pages? If each entry in this list needs 8 bytes(take as given), how many bytes is that? For just *one* process? - Okay, this method is clearly unfeasible, we need a way to not store redundant area 2.2 Multi-Indexed - Mike mentioned created a 512-ary (512 children max per node) tree, what does that mean? - If we start from a root table, view that as root node with entries that link to other pagetable nodes with entries that... etc. - Define each layer as honing in on the actual pagetable entry with the actual PPN we care about - Remember, this is all just taking a VPN and translating it to PPN - Now, instead of treating the VPN as one large number, split it into chunks - For example, the x86-64 architecture splits the 36-bit VPN into 4 indices of 9 bits each - Use each index to figure out which child to descend into, gives traversal of the pagetable tree - Index "walk" (HANDOUT) - Imagine that you are the MMU, and that I have given you a pagetable for accessing pages relevant to certain topics - The pagetable has 2 layers, first index is the first letter, second index is everything that follows (we ignore offset) - Now, let's work through translating "VPN" Algol to its "PPN" - Split "VPN" into indices "A", "lgol" - First index into top pagetable, find page, second index into sub-pagetable, find PPN - If you have an 'A' copy, what page is Algol found at? If 'B'? - Different pagetables => different mappings for same virtual address - What does invalid mean? - Pagetables can have different tree structures, note how 'A' has "S" page, 'B' has "T" page - Analogously, how does walking through x86-64 structure actually work? - Use first index to index into L1 pagetable, pagetable entry yields PPN of L2 page table - How many entries are there in an L1 pagetable? - Use second index to index into L2 pagetable, pagetable entry yields PPN of L3 page table - How many possible pages are accessible from a given L2 pagetable? - Use third index to index into L2 pagetable, pagetable entry yields PPN of L3 page table - How many possible L3 pagetables are there if virtual address space fully used? - Use last index to index into L4 pagetable, pagetable entry yields desired PPN - How many memory accesses needed to walk+access memory? - Extra question: If each pagetable entry uses 8 bytes, why is a 9-bit index extra nice? - What are some differences between this analogy and x86-64? - 2 levels instead of 4 - Offset nonexistent (already mentioned, but want to emphasize difference) - Virtual page space different type from physical page space (translate "leading character" to 9 leading bits of VPN) - Different way of splitting "VPN" - Pagetables separate from physical memory, you'll see the pagetables are actually stored in memory - Why is multi-indexing so useful? - Sparsity: We don't need a page for 'B' entries, nor for 'D', ... - Let's say each pagetable takes a page, in our index how many pages would we need for the full alphabet? - How many pages do we actually use in this case? - Correspondingly: We don't need L2 page tables for the blank space - This is really important! - Problems: - How many pages in the x86-64 architecture needed to allocate just one byte of memory? - How many pages needed to allocate 5 pages worth of memory? - 2^9? (both best and worst case) - 2^9+1? (in best case) - What if I spawned 32 processes, each allocates 8 pages, in best case? - Finally, 2 processes, one which allocates 1 GiB (2^30), another which allocates 1 KiB (2^10) (again, best case) - What are the costs of multi-indexed page tables? - Greater number of memory queries required per address if we solely translate using page tables - If no caching, how many memory accesses needed to execute a simple instruction? - 0x500 movq 0x200000, %rax - TLB solves this by caching the page translations that are used the most - If both previous page translations cached, but not actual memory cached, how many memory accesses needed to then exec? - 0x504 movq 0x200008, %rbx - Time-space-complexity tradeoff 3. Conclusion - What do we get from all this? - Complexity without careful decision-making is unnecessary - Programmability - Each process gets its own mapping without having to worry about other processes, OS and hardware handle all messiness - Process isolation - Each process has its own mapping, so there is no way for it to define another process's memory - Over-allocation of memory - Will only touch on briefly, but does a valid VA need to always point to a valid PA? - Demand paging is just the operating system + hardware extending that lie, saying pages "exist" in memory when not actually - To go back to index analogy, imagine if I told you that it was page 200, but needed to fetch that page from a filing cabinet - Raises more questions - This is another degree of caching: Have large but slow swap space, fast but small RAM, how to manage swapping to avoid slow disk? 4. Huge Pages [if time] - There are ways to define pages that are larger than our usual 4 KiB size, for caching reasons (see TLB in next classes) - If an entry on an L3 page table pointed straight to a huge page, what size would make the most sense for this huge page? - How many such pages could we allocate at once? - If we wanted to allocate 2 huge pages and 2000 regular pages, what is the minimum number of pages that we'd need to allocate? - We can go even further beyond, can define a 1 GiB (2^30 byte) page - What level pagetable makes the most sense for entries that point to these gigantic pages? 5. Problems - Collection of problems from review session, along with a few new ones - Problems I consider harder marked (*) - Unit reference (using binary units to be precise, if you don't understand this ignore the 'i'): - 1 KiB = 2^10 bytes ~ 10^3 bytes - 1 MiB = 2^20 bytes ~ 10^6 bytes - 1 GiB = 2^30 bytes ~ 10^9 bytes - 1 TiB = 2^40 bytes ~ 10^12 bytes - 1 PiB = 2^50 bytes ~ 10^15 bytes - Warmup: Given the following number of bits in a virtual address, physical address, and pagesize, calculate the following values: - VA: 56 bits, PA: 48 bits, pagesize: 16 KiB - VPN bits - PPN bits - Offset bits - How many virtual pages possible? - How many physical pages possible? - X86-64 pagetable rapid-fire from 2.2 (Multi-indexed): - How many entries are there in an L1 pagetable? - How many possible pages are accessible from a given L2 pagetable? - How many possible L3 pagetables are there if virtual address space fully used? - How many memory accesses are needed to access a given memory address, assuming no faults or errors? - (*) If each pagetable entry uses 8 bytes, why is a 9-bit index extra nice? - Page allocation from 2.2 (Multi-indexed): - How many pages in the x86-64 architecture needed to allocate just one byte of memory? [draw pagetable tree] - How many pages needed to allocate 5 pages worth of memory? [extend tree by appending 4 more pages] - 2^9? (both best and worst-case) [prompt for both optimal and pessimal cases, ask for what the pagetable tree looks like in each case] - 2^9+1? (in best case) - What if I spawned 32 processes, each allocates 8 pages, in best case? - (*) Finally, 2 processes, one which allocates 1 GiB (2^30), another which allocates 1 KiB (2^10) (again, best case) - Memory reads from 2.2 (Multi-indexed): - Greater number of memory queries required per address if we solely translate using page tables - If no caching, how many memory accesses needed to execute a simple instruction? - 0x0FF8 movq 0x200000, %rax - TLB solves this by caching the page translations that are used the most - If both previous page translations cached, but not actual memory cached, how many memory accesses needed to then exec? - 0x1000 movq 0x200008, %rbx - (*) Huge Pages from 4. (Huge Pages): - If an entry on an L3 page table pointed straight to a huge page, what size would make the most sense for this huge page? - How many such pages could we allocate at once? - If we wanted to allocate 2 huge pages and 2000 regular pages, what is the minimum number of 4 KiB pages needed to allocate? (treat allocating 1 huge page as allocating the same amount of memory's worth in regular pages) - We can go even further beyond, can define a 1 GiB (2^30 byte) "gigantic" page - What level pagetable makes the most sense for entries that point to these gigantic pages? - x86 Mutations - In the context of x86-64 architecture, if each pagetable entry takes 8 bytes, how many bytes would be needed to store a complete linear pagetable? (binary and human form) How does this compare to modern-day computer capabilities? - The x86 32-bit architecture, obviously, uses 32 bits for virtual addressing - Instead of 4 pagetable layers, it has 2, while pages remain at 4 KiB - How many bits are given to each pagetable index? - How many entries would each pagetable have? - How large is each pagetable, assuming each entry takes 4 bytes? - (*) In each pagetable entry, how many bits are unused by the PPN, assuming PPNs are 20 bits? - (*) Imagine we changed the x86-64 architecture a little - Instead of 4 PTE layers, we set up 3 - Uneven, L1 index takes 18 bits, L2 and L3 each take 9 bits - How large would the L1 pagetable be? (binary and human) - If I wanted to allocate 1 page under this regime, how many physical pages needed (physical pages remain 4 KiB)? - 513? (best case) 6. Up next - Next lecture, Mike will go over the exact details of x86-64 virtual memory workings and TLB - After midterm to explain how demand paging exactly works, including page faults, eviction policies 7. Q&A - Midterm studying - Use practice midterms - Work on making the meet sheet yourself, by working on this you get to know what you're weak at - Deadlocks 8. References - x86-64 reference (1st page of https://cs.nyu.edu/~mwalfish/classes/25sp/lectures/handout09.pdf) - Blog for more huge pages explanation: https://www.hudsonrivertrading.com/hrtbeat/low-latency-optimization-part-1/ 9. Answers to Problems - Warmup - 56 - log_2(16 * 1024) = 56 - 14 = 42 bits - 48 - 14 = 34 bits - log_2(16 * 1024) = 14 bits - 2^42 = 4 trillion - 2^34 = 16 billion - Rapid-fire - 512, 2^9 indices for an L1 pagetable - 2^27 = 128 million, L2 can reach 2^9 L3, each of which can reach 2^9 L3, each of which can reach 2^9 L4 so 2^(9+9+9) - 2^18 = 256 thousand, L1 can reach 2^9 L2, each of which can reach 2^9 L3, so 2^(9+9) - 5, walk L1->L4, then read actual page - (*) There are 512 entries per pagetable, so 8 bytes each means that a pagetable takes 2^12 bytes, which is a page - Page allocation - 5 (L1 -> L2 -> L3 -> L4 -> actual page) - 9 (L1 -> L2 -> L3 -> L4 -> actual pages) - 2^9 + 4 in best case (ditto), 4 * 2^9 + 1 = 2^11 + 1 = 2049 in worst case (each page requires own L2, L3, L4) - 2^9 + 6 (second L4 page table needed) - 32 * (8+4) = 2^7 * 3 = 384, each process needs 12 pages to allocate 8 pages - (*) 2^18 + 2^9 + 5 (L1 -> L2 -> L3 -> 2^9 L4 -> 2^18 pages for big + 5 pages for small) - Memory reads - 2 * 5, each memory access requires full walk of pagetables - 6, TLB caches page translation for latter read, not for former - (*) Huge Pages - 2^21, take L4 index bits transfer to offset - 2^27 (all 2^27 PTEs in all possible L3 pagetables point to a huge page) - 3 + 2 * 512 + 4 + 2000 = 3037 (L1 -> L2 -> L3 -> 2 huge pages and 4 L4 which point to regular pages) - L2, by same logic as first question, a 2^30 byte page needs 2^30 offset bits, so take L3 and L4 indices, so L2 points directly to gigantic - x86 Mutations - x86 32-bit - 2^36 virtual pages * 2^3 bytes per PTE = 2^39 bytes = 512 GB, which vastly outstrips personal computers - 10 bits (20 VPN bits / 2) - 2^10 = 1024 PTEs - 4096 bytes (1024 PTEs * 4 bytes / PTE) - (*) 12 bits (32 bits in PTE - 20 bits for PPN) - Collapsed indices - 2^21 bytes = 2 million bytes (2^18 PTEs * 2^3 bytes per entry) - 515 (512 pages for L1 + 1 L2 + 1 L3 + 1 actual page) - 1028 (512 pages for L1 + 1 L2 + 2 L3 + 513 actual pages)