Class 11 CS 202 3 March 2020 On the board ------------ 1. Last time 2. x86-64: addresses - virtual - physical 3. x86-64: page table structures 4. TLBs 5. Where does the OS live? --------------------------------------------------------------------------- 1. Last time page table conceptually implements a map from VPN --> PPN NOTE: VPN and PPN need not (and do not, in our case study) have the same number of bits review: top bits index into page table. contents at that index are the PPN. bottom bits are the offset. not changed by the mapping physical address = PPN + offset the "math" of virtual memory: get comfortable mapping between "number of bits required to represent something" and "size of the space". The latter is two-raised-to-the-power-of-the-former. Examples: a virtual address is 32 bits, means the virtual address space is 2^32 = 4 GB. the VPN is 20 bits, means there are 2^{20} virtual pages, and and the offset is 12 bits, which means page size of 2^12, or 4KB. 2. x86-64: addresses x86 architecture is 64-bits. registers and addresses are 64-bits wide VIRTUAL ADDRESSES on currently-available x86-64 machines, only 48 bits "matter". (conclusion: not all 64-bit patterns correspond to meaningful virtual addresses) Bit patterns that are valid addresses are called _canonical addresses_. Canonical address has all 0s or all 1s in the upper 16 bits (bits 63 through 48). Has to match whatever bit 47 is. [see 3.3.7.1 in the Intel software developer's manual] Result: address space is 2^{48} = 256 TB [ Another way to look at it: The x86-64 architecture divides canonical addresses into two groups, low and high. Low canonical addresses range from 0x0000'0000'0000'0000 to 0x0000'7FFF'FFFF'FFFF. High canonical addresses range from 0xFFFF'8000'0000'0000 to 0xFFFF'FFFF'FFFF'FFFF. Considered as signed 64-bit numbers, all canonical addresses range between -2^47 and 2^47-1. ] PHYSICAL ADDRESSES 52 bits Means a single machine can address up to 4 PB of physical memory. NOTE: last time I (Mike) wrongly said that the number was higher. of course, if the machine only has 16 GB (say), then physical addresses will (roughly speaking) only have 34 bits that matter, and thus the top 18 bits better be zero in the virtual->physical mapping. MAPPING have to map 48-bit number (virtual address) to 52-bit number (physical address), at the granularity of ranges of 2^{12} 3. x86-64: page table structures walk through the handout %cr3 is the address of the top-level directory (L1 page table) is that address a physical address or virtual address? [answer: it is a physical address. hardware needs to be able to follow the page table structure.] bunch of bits includes dirty (set by hardware) acccessed (set by hardware) cache disabled (set by OS) write through (set by OS) what do the U/S and R/W bits do? --are these for the kernel, the hardware, what? --who is setting them? what is the point? (OS is setting them to indicate protection; hardware is enforcing them) What if OS wants to map a process's virtual address 0x0202000 to physical address 0x3000 and make it accessible to user-level but read-only? what do the page structures look like? solution: take off the bottom 12 bits of offset vpn = 0x0202. write it out in bits: 0....0 000000001 000000010 18 0 bits L1 (0th entry) --> L2 (0th entry) --> L3 ........... ........... ........... ........... [entry 1] ........... PGTABLE <40 bits> |0x00'0000'0003 | U=1,W=0,P=1| [entry 2] | | | [entry 1] | | | [entry 0] ______________________________ helpful reminders --each entry in the L1 page table corresponds to 512GB of virtual address space ("corresponds to" means "selects the next-level page tables that actually govern the mapping"). --each entry in the L2 page table corresponds to 1 GB of virtual address space --each entry in the L3 page table corresponds to 2 MB of virtual address space --each entry in the L4 page table corresponds to 1 page (4 KB) of virtual address space --so how much virtual memory is each L4 page *table* responsible for translating? 4KB? 2MB? 1GB? [answer: 2MB] --each page page table itself consumes 4KB of physical memory, i.e., each one of these fits on a page [see Intel reference manual for more. Intel 64 and IA-32 Architectures Software Developer's Manual, Volume 3a https://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-software-developer-vol-3a-part-1-manual.pdf ] Large pages: Can get 2MB (resp, 1 GB pages) on x86: each L3 (resp, L2) page table now points to the page instead of another page table + page tables smaller, less page table walking - more wasted memory to enable this, set bit 7 (PS) bit example: set bit PS in L3 table result is 2MB pages page walking is L1, L2, L3; no L4 page tables [if time, exercises: [from cs61, 2018] https://cs61.seas.harvard.edu/site/2018/Section4/ What is the minimum number of physical pages required on x86-64 to allocate the following allocations? Draw an example pagetable mapping for each scenario (start from scratch each time). 1 byte of memory = [5 phys pages] 1 allocation of size 2^12 bytes of memory = [5 phys pages] 2^9 allocations of size of 2^12 bytes of memory each = [512 + 4 = 516 phys pages] 2^9 + 1 allocations of size of 2^12 bytes of memory each = [512 + 4 + (1 + 1) = 518 phys pages] 2^18 + 1 allocations of size 2^12 bytes of memory each = [1 (L1) + 1 (L2) + 2 (L3) + (2^9 + 1) (L4) + (2^18 + 1) (the memory)] ] 4. TLB --so it looks like the CPU (specifically its MMU) has to go out to memory on every memory reference? --called "walking the page tables" --to make this fast, we need a cache --TLB: translation lookaside buffer hardware that stores virtual address --> physical address; the reason that all of this page table walking does not slow down the process too much --hardware managed? (x86, ARM.) hardware populates TLB --software managed? (MIPS. OS's job is to load the TLB when the OS receives a "TLB miss". Not the same thing as a page fault.) --questions: --does TLB miss imply page fault? (no!) --does page fault imply TLB miss? (no!) (imagine a page that is mapped read-only. user-level process tries to write to it. TLB knows about the mapping, so no TLB miss. But this is still a protection violation. To cut down on terminology, we will lump this kind of violation in with "page fault".) --x86: --what happens to the TLB when %cr3 is loaded? [answer: flushed] --can we flush individual entries in the TLB otherwise? INVLPG addr -- Sizes [The situation is more complicated than handout, so here are some specifics for folks who are interested: Instruction TLB: 2M/4M pages, fully associative, 8 entries 4KByte pages, 8-way, 64 entries Data TLB: 2M/4M pages, 4-way, 32 entries and a separate array with 1 GByte pages, 4-way, 4 entries 4 KByte pages, 4-way, 64 entries Shared 2nd-Level TLB: 4 K/2M pages, 8-way, 1024 entries ] 5. Where does the OS live? [see handout for picture] In its own address space? -- Can't do this on most hardware (e.g., syscall instruction won’t switch address spaces) -- Also would make it harder to parse syscall arguments passed as pointers So on real systems, kernel is actually in the same address space as all processes * * not precisely true post-Meltdown, but close enough (in that some of the kernel is mapped into all user processes). for those who are interested, see notes from Panda at the end. -- Use protection bits to prohibit user code from reading/writing kernel -- Typically all kernel text, most data at same VA in *every* address space (every process has virtual addresses that map to the physical memory that stores the kernel's instructions and data) -- In Linux, the kernel is mapped at the top of the address space, along with per-process data structures. -- Physical memory also mapped up top, which gives the kernel a convenient way to access physical memory. NOTE: that means that physical memory that is in use is mapped in at least two places (once into a process's virtual address space and once into this upper region of the virtual space). [note: in lab4, it doesn't work like this. in lab4, the kernel has its own separate page table] ------------------- notes from Panda about kernel-being-mapped-into-each-process: [AP: This answer is complicated (https://www.usenix.org/system/files/login/articles/login_winter18_03_gruss.pdf), but see below for attempt to explain. * In the post-meltdown KAISER/KPTI/KVA/XNU Double Map (all names for similar mitigations) each process has two (logical) page tables: - One, the user mode page table, for use when the process is executing usermode code unmaps most (but not all) of the kernel, this includes some of the kernel stack, and a few other things. The aim of all mitigations has been to minimize the number of kernel pages in the user mode page table, but different tradeoffs are selected for how much this is minimized. - The second, the kernel mode page table, has exactly the same layout as what you mention, i.e., the kernel is in all address spaces. On entry to kernel the OS tries as rapidly as possible to switch from user mode page table to kernel mode one. It switches back before return. Having a kernel mode page table per process is in part to minimize how much one needs to change in the kernel, so maybe one can argue that this is not the "best" possible solution, but it is pretty good.]