Class 12 CS 202 12 March 2015 On the board ------------ 1. Last time 2. Segmentation 3. Paging --Intro --key data structure: page table --Segmentation vs. paging 4. Case study: x86 --Alternatives 5. TLBs 6. Page faults: intro and mechanics 7. The uses of paging and page faults 8. Where does the OS live? --------------------------------------------------------------------------- 1. Last time [keyboard repeat rate: BIOS vs. keyboard driver] virtual memory intro segmentation introduction segmentation on x86: won't cover explicitly; in the notes below. today: paging --- to understand what is going on in virtual memory, one can think of it like this: the OS defines a table on behalf of a process the virtual address itself (that is, the address used by the program) determines the row in the table that the MMU should look up roughly speaking, the top digits select a table entry; the bottom digits determine an offset into region. for our segmentation example from last time, we can visualize the result roughly like this: the bottom quarter of the address space is translated according to the first row, (not exactly the bottom quarter, since some of the addresses aren't valid; same holds for the quarters below) the next quarter of the address space is translated according to the second row, the third quarter of the address space is translated according to the third row, the fourth quarter are invalid addresses --- 2. Segmentation A. segmentation in general [last time] B. segmentation on the x86 [not covering the details in lecture; including for reference here] linear address = base + virtual_address (virtual_address is the offset here) what's the interface to segmentation? [see handout] there are tables populated by the OS: GDT, LDT (global descriptor table, local descriptor table) *the entries in this table define the segments* determines base, limit, **protection** (R/W/X, user/kernel, etc/), type these are analogous to the table we saw last time (except that a process has 2 of them instead of one). processor told where this table lives via LLDT, SLDT, LGDT, SGDT (load local descriptor table, store local descriptor table, etc.) there are also segment selector registers on the CPU: %ss (stack segment selector) %cs (code segment selector) %ds (data segment selector) %es (string [extra] segment selector) %fs %gs [more on these in a moment] every program instruction comes with an implicit *or* explicit segment register (the implicit case is the usual one). examples: pop %ebx ; implicitly uses %ss call $0x7000 ; implicitly uses %cs movl $0x1234, (%eax) ; implicitly uses %ds movl $0x1234, %gs:(%eax) ; explicitly uses %gs [all references to %eip (such as instruction fetches) use %cs for translation.] some instructions can take "far addresses": ljmp $selector, $offset [makes the selector explicit] [can do similar things with loads and stores] a selector (held in %ss, %cs, etc.) indexes into the LDT or GDT, and chooses *which* table and which *entry* in that table analogy with 14-bit example from last time: the value held in the relevant segment selector is analogous to the top digit [two bits] in our 14-bit example from last time. the entire virtual address is analogous to the bottom 3 digits [12 bits] in that example. to be clear, the idea is that on the x86 (as opposed to our example) the segment selector is not part of the virtual address; it's either implicit or explicitly part of the instruction. offset needs to be less than limit example #1: say that %ds refers to an entry in the LDT with these parameters: base 0x30000000 limit 0x300000f0 now, when program does: mov 0x50, %eax what happens? [0x50 gets translated into 0x3000 0050] example #2: what about if program does: mov 0x100, %eax ? [error.] 3. Paging A. Intro --Basic concept: divide all of memory (physical and virtual) into *fixed-size* chunks. --these chunks are called *PAGES*. --they have a size called the PAGE SIZE. (different hardware architectures specify different sizes) --in the traditional x86 (and in our labs), the PAGE SIZE will be 4096 B = 4KB = 2^{12} --Warm-up: --how many pages are there on a 32-bit architecture? --2^{32} bytes / (2^{12} bytes/page) = 2^{20} pages --Each process has a separate mapping --And each page separately mapped --we will allow the OS to gain control on certain operations --Read-only pages trap to OS on write --Invalid pages trap to OS on read or write --OS can change mapping and resume application (Harder to do this kind of thing with segments because the mapping is more coarse-grained.) --it is proper and fitting to talk about pages having **NUMBERS**. --page 0: [0,4095] --page 1: [4096, 8191] --page 2: [8192, 12277] --page 3: [12777, 16384] ..... --page 2^{20}-1 [ ......, 2^{32} - 1] --unfortunately, it is also proper and fitting to talk about _both_ virtual and physical pages having numbers. --sometimes we will try to be clear with terms like: vpn ppn B. Key data structure: page table --conceptual model: (assuming 32-bit addresses and 4KB pages) there is in the sky a 2^{20} sized array that maps the virtual address to a *physical* page table[20-bit virtual page number] = 20-bit physical page # EXAMPLE: if OS wants a program to be able to use address 0x00402000 to refer to physical address 0x00003000, then the OS conceptually adds an entry: table[0x00402] = 0x00003 (this is the 1026th virtual page being mapped to the 3rd physical page.). in decimal: table[1026] = 3 below, we will see how this is actually implemented NOTE: top 20 bits are doing the indirection. bottom 12 bits just figure out where on the page the access should take place. --bottom bits sometimes called offset. --so now all we have to do is create this mapping --why is this hard? why not just create the mapping? --answer: then you need, per process, roughly 4MB (2^{20} entries * 32 bits per entry) --deal with this shortly --key idea: represent the page table as a tree that is sparse (i.e., many of the child nodes are never filled in) C. segmentation vs paging --paging: + eliminates external fragmentation + not much internal fragmentation + easier to allocate, free, swap, etc. - data structures are larger - more complex + overall: more flexible. (intuition: mapping is more fine-grained, which means more OS control over it) (in more detail, instead of mapping a large range into a large range, we are going to independently control the mapping for every 4 KB.) --segmentation: - vulnerable to two kinds of fragmentation - hard to handle growth or shrinkage of a segment + smaller data structures + simpler overall - overall: less flexible --Segmentation is old-school and these days mostly an annoyance (but it cannot be turned off on the x86!) --however, it comes in handy every now and then --thread-local memory --sandboxing (advanced topic) --also makes it easy to share memory among processes: just use the same segment registers (sharing requires a bit more work if paging is in effect) 4. Case study: virtual memory on x86 * Has segmentation and paging. Cannot turn off segmentation (even though we usually want to) Instead, set things up so that segmentation has no effect Question: how? (Answer: by setting its mapping to be the identity function. Make the offset 0 and the limit the maximum.) * We will focus on paging best overview: the Intel manual http://www.cs.nyu.edu/~mwalfish/classes/15sp/ref/i386/s05_02.htm see handout from last time two-level mapping structure....... * a VA is 32 bits: 31 ................................... 0 * and it gets divided as follows: dir ent table ent offset 31 ....... 22 21 ...... 12 11 ....... 0 --%cr3 is the address of the page directory. --top 10 bits (first two nibbles plus first half of third nibble) select an entry in the page directory, this entry points to a **page table** --next 10 bits select the entry in the page table, which is a physical page number --so there are 1024 entries in page directory --how big is entry in page directory? 4 bytes --entry in page directory and page table: [ base address | bunch of bits | U/S R/W P ] 31..............12 why 20 bits? [answer: there are 2^20 4KB pages in the system] is that base address a physical address, a linear address, a virtual address, what? [answer: it is a physical address. hardware needs to be able to follow the page table structure.] bunch of bits includes dirty (set by hardware) acccessed (set by hardware) cache disabled (set by OS) write through (set by OS) what do these U/S and R/W bits do? --are these for the kernel, the hardware, what? --who is setting them? what is the point? (OS is setting them to indicate protection; hardware is enforcing them) what happens if U/S and R/W differ in pgdir and table? [processor does something deterministic; look up in references] * EXAMPLES Approach: examine an address and divide it up. Get used to doing this. We will work a few examples in class. Basic question: what does OS put in the data structures that are visible to the CPU's MMU to enable different mappings? What if OS wants to map a process's virtual address 0x00402[000] to physical address 0x00003[000] and make it accessible to user-level but read-only? PGDIR PGTABLE ....... <20 bits> <12 bits> ....... | 0x00003 | U=1,W=0,P=1 | [entry 2] | | | [entry 1] .....[entry 1] ----> |_________|_____________| [entry 0] ....... Now what if the OS wants to map that process's virtual address virtual address 0x00403[000] to physical address 0x80000[000] [this is physical address 2GB] and make it accessible to user-level and make it read/write? * Helpful reminders: --each entry in the page *directory* corresponds to 4MB of virtual address space ("corresponds to" means "selects the second-level page table that actually governs the mapping"). --each entry in the page *table* corresponds to 4KB of virtual address space --so how much virtual memory is each page *table* responsible for translating? 4KB? 4MB? something else? --each page directory and each page table itself consumes 4KB of physical memory, i.e., each one of these fits on a page ------------------------------------------------------------------ putting it all together.... here is how the x86's MMU translates a linear address to a physical address: ("linear address" is a synonym for "virtual address" in our context. the reason for the additional term is that on the x86, the segmentation mapping goes from virtual to linear.) [not discussing in class but make sure you understand what is written below.] uint translate (uint la, bool user, bool write) { uint pde; /* page directory entry */ pde = read_mem (%CR3 + 4*(la >> 22)); access (pde, user, write); /* see function below */ pte = read_mem ( (pde & 0xfffff000) + 4*((la >> 12) & 0x3ff)); access (pte, user, write); return (pte & 0xfffff000) + (la & 0xfff); } // check protection. pxe is a pte or pde. // user is true if CPL==3. // write is true if the attempted access was a write. // PG_P, PG_U, PG_W refer to the bits in the entry above void access (uint pxe, bool user, bool write) { if (!(pxe & PG_P) => page fault -- page not present if (!(pxe & PG_U) && user) => page fault -- not access for user if (write && !(pxe & PG_W)) { if (user) => page fault -- not writable if (%CR0 & CR0_WP) => page fault -- not writable } } -------------------------------------------------------------------- * Alternatives --Other configurations possible (both on x86 and on other hardware architectures) --There are some tradeoffs: --between large and small page sizes: --large page sizes means wasting actual memory --small page sizes means lots of page table entries (which may or may not get consumed) --between many levels of mapping and few: --more levels of mapping means less space spent on page structures when address space is sparse (which they nearly always are) but more costly for hardware to walk the page tables --fewer levels of mapping is the other way around: need to allocate larger page tables (which cost more space), but the hardware has fewer levels of mapping --Example: can get 4MB pages on x86 (each page directory entry can just point to a single page) + page tables smaller - more wasted memory to enable this, set PSE mode (set bit 7 in PDE and get 4MB pages, no PTs) 5. TLB --so it looks like the CPU (specifically its MMU) has to go out to memory on every memory reference? --called "walking the page tables" --to make this fast, we need a cache --TLB: translation lookaside buffer hardware that stores virtual address --> physical address; the reason that all of this page table walking does not slow down the process too much --hardware managed? (x86.) --software managed? (MIPS. OS's job is to load the TLB when the OS receives a "TLB miss". Not the same thing as a page fault.) --what happens to the TLB when %cr3 is loaded? [answer: flushed] --can we flush individual entries in the TLB otherwise? INVLPG addr --how does stuff get in the TLB? --answer: hardware populates it --questions: --does TLB miss imply page fault? (no!) --does page fault imply TLB miss? (no!) (imagine a page that is mapped read-only. user-level process tries to write to it. TLB knows about the mapping, so no TLB miss. But this is still a protection violation. To cut down on terminology, we will lump this kind of violation in with "page fault".) 6. Page faults: intro and mechanics Discussed these above. Let's go into a bit more detail... Concept: a reference is illegal, either because it's not mapped in the page tables or because there is a protection violation. requires the OS to get involved this mechanism turns out to be hugely powerful, as we will see. Mechanics --what happens on the x86? [see handout] --kernel constructs a trap frame and transfers execution to an interrupt or trap handler ss esp [former value of stack pointer] eflags [former value of eflags] cs %esp--> eip [instruction that caused the trap] [error code] %eip is now executing code to handle the trap [how did processor know what to load into %eip?] error code: [ ................................ U/S | W/R | P] unused U/S: user mode fault / supervisor mode fault R/W: access was read / access was write P: not-present page / protection violation on a page fault, %cr2 holds the faulting linear address --intent: when page fault happens, the kernel sets up the process's page entries properly, or kills the process 7. The uses of paging and page faults --Best example: overcommitting physical memory (the classical use of "virtual memory") --your program thinks it has, say, 512 MB of memory, but your hardware has only 256 MB of memory --the way that this worked is that the disk was (is) used to store memory pages --advantage: address space looks huge --disadvantage: accesses to "paged" memory (as disk pages that live on the disk are known) are sllooooowwwww: --Rough implementation: --on a page fault, the kernel reads in the faulting page --QUESTION: what is listed in the page structures? how does kernel know whether the address is invalid, in memory, paged, what? --kernel may need to send a page to disk (under what conditions? answer: two conditions must hold for kernel to HAVE to write to disk) (1) kernel is out of memory (2) the page that it selects to write out is dirty --Computers have lots of memory, so less common to hear the sound of swapping these days. You multiple large memory consumers running on the same computer. --Many other uses; will discuss next time 8. where does the OS live? In its own address space? -- Can't do this on most hardware (e.g., syscall instruction won’t switch address spaces) -- Also would make it harder to parse syscall arguments passed as pointers So it's actually in the same address space as the process -- Use protection bits to prohibit user code from writing kernel -- Typically all kernel text, most data at same VA in *every* address space (every process has virtual addresses that map to the physical memory that stores the kernel's instructions and data) --------------------------------------------------------------------------- midterms --don't panic if you're not happy with your score; lots of opportunity to bring things up --regardless of how you think you did, *please* make sure you understand all the answers; the solutions are posted on the course Web page, and they are intended to be helpful here --some notes about the grading: --we weren't hugely generous with partial credit, especially when an answer indicated a misunderstanding. that's for several reasons: --want to be clear with you about what you do and don't understand, as reflected in what you write --want to be fair to those who get the question --if you have questions, let me know. we tried to be careful, but it's possible we made mistakes. --please note that a regrade request will generate a regrade of the entire exam ---------------------------------------------------------------------------