Class 3 CS 372H 24 January 2012 On the board ------------ 1. Last time 2. varargs 3. PC emulation 4. virtual memory --intro --segmentation on x86 --paging on x86 --JOS --------------------------------------------------------------------------- 1. Last time PC architecture, x86 instructions, gcc calling conventions Clarification/correction to first instruction fetch: --Contrary to what we said in class, the address of the first instruction fetched is at the top of the 32-bit *physical* address space (in class, we said the instruction was at the top of the 20-bit address space). --However, the way that this location is addressed by the CPU in 16-bit mode [and what you see in QEMU] is the same as from the 8088 days (0xf000:fff0); this is the value that we mentioned in class --This calls for a virtual memory hack, but virtual memory is supposed to be in play only when the processor is in 32-bit mode. (Fundamental reason a hack is needed is that the processor boots in 16-bit mode, giving it the ability to address 20 bits of address space (=1 MB). Yet the actual target address needs to be way up at the top of a *32-bit* address space.) --Long story short: *virtual* address 0xf000:fff0 maps to physical address 0xfffffff0. --For more info, see Section 9.1.4 and Table 9.1 in the Intel architecture manual 2. varargs (draw picture) -------------- %ebp local variables callee saved registers ...... +------------+ | | arg 2 | \ +------------+ >- previous function's stack frame | arg 1 | / +------------+ | | ret %eip | / +============+ | saved %ebp | \ %ebp-> +------------+ | | callee saved| | | local variables | | local | \ | variables, | >- current function's stack frame | etc. | / | | | | | | %esp-> +------------+ / note that printf takes a variable number of arguments, and the compiler cannot always know at compile time how many it will take. so how is such dynamic behavior implemented?! first, let's look at "normal" calls to printf: i. someone calls printf("%d %d", 3, 4) [draw picture] ii. someone calls printf("%d", 3, 4) now.... iii. what if someone does char* foo = "%x %x %x %x %x %x %x %x %x %x %x" printf(foo) ? compiler must use the arguments on the stack 3. PC emulation --QEMU does exactly what a real PC would --But it is implemented in software, not hardware --Runs as a normal program on "host" operating system. --The layering looks like this: | JOS | ------------------------------- PC emulator| Web browser | ... ------------------------------- Linux ------------------------------- PC hardware ------------------------------- --Uses normal programmatic constructs (if statements, memory, etc.) to emulate processor logic and state --Stores emulated CPU registers in global variables int32_t regs[8]; #define REG_EAX 1; #define REG_EBX 2; #define REG_ECX 3; .... int32_t eip; --Stores emulated physical memory in QEMU's memory char mem[256*1024*1024]; --See handout --Simulate I/O devices, etc., by detecting accesses to "special" memory and I/O space and emulating the correct behavior: e.g., --Reads/writes to emulated hard disk transformed into reads/writes of a file on the host system --Writes to emulated VGA display hardware transformed into drawing into an X window --Reads from emulated PC keyboard transformed into reads from host's keyboard API --------------------------------------------------------------------------- Summary of lecture 2 and first part of lecture 3: --covered PC and x86, which is the platform for the labs --illustrated some important CS ideas --stored program computer --stack --memory-mapped I/O --equivalence of software and hardware --------------------------------------------------------------------------- Admin: --lab 1 is due tomorrow --project partners due by February 3 --------------------------------------------------------------------------- 4. Intro to virtual memory * top-most idea --let programs use addresses like 0, 0xc000, whatever. --OS arranges for hardware to translate these addresses --what piece of hardware does this? (A: MMU) --what doesn't OS just translate the stuff itself? [slow] idea is to fool programs but OS also fools itself! (JOS thinks it is running at the top of physical memory [0xf0000000], but it is not) --draw picture: [CPU ---> translation box --> physical addresses] that translation box gives us a bunch of things --protection: processes can't touch each other's memory --idea: if you cannot name it, you cannot use it. deep idea. --relocation: --two instances of program foo are each loaded, each think they're using memory addresses like 0,0x1234, whatever, but of course they're not using the same actual memory cells --sharing: --processes share memory under controlled circumstances, but that physical memory may show up at very different virtual addresses --that is, two processes have a different way to refer to the same physical memory cells * applied to the x86: logical [virtual] addresses ---> linear addresses ---> physical addresses --logical addresses are also known as virtual addresses --physical addresses are what is on the CPU's address pins --do they address RAM? --no, they refer to the physical memory map (i.e., hardware may do more translation) the first translation happens via *segment translation* the second translation happens via *page translation* segmentation is old-school and these days mostly an annoyance (but it cannot be turned off!) --however, it comes in handy every now and then for things like sandboxing (advanced topic) or thread-local memory (another advanced topic, though by the time the midterm comes around, you should see why segmentation could be useful to the implementer of a threads package) 5. Segmentation A. segmentation in general segmentation means: memory addresses treated like offsets into a contiguous region. QUESTION: if segmentation can't be turned off, how do we pretend it's not there? setting its mapping to be the identity function offset of 0 and no limit B. segmentation on the x86 linear address = base + virtual_address (virtual_address is the offset here) what's the interface to segmentation? there are tables: GDT, LDT processor told where this table lives via LLDT, LGDT, SLDT, SGDT every instruction comes with an implicit *or* explicit segment register (the implicit case is the usual one): pop %ebx ; implicitly uses %ss call $0x7000 ; implicitly uses %cs movl $0x1234, (%eax) ; implicitly uses %ds movl $0x1234, %gs:(%eax) ; explicitly uses %gs [all references to %eip (such as instruction fetches) uses %cs for translation.] some instructions can take "far addresses": ljmp $selector, $offset a segment register holds a segment selector different registers for the stack (ss), data (ds), code (cs), string [extra] operations (es), other fun stuff (fs, gs) a selector indexes into the LDT or GDT, and chooses *which* table and which *entry* in that table determines base, limit, **protection** (R/W/X, user/kernel, etc/), type offset better be less than limit example #1: say that %ds refers to this descriptor entry: base 0x30000000 limit 0x0f0 now, when program does: mov 0x50, %eax what happens? [0x50 gets translated into 0x3000 0050] example #2: what about if program does: mov 0x100, %eax ? [error.] NOTES: --Current privilege level (CPL) is in the low 2 bits of CS --CPL=0 is privileged O/S, CPL=3 is user --can app modify the descriptors in the LDT? it's in memory... yes it can. useful for certain things, like one user-level program sandboxing another. --app cannot just lower the CPL --don't confuse LDT and GDT with **IDT** (which you'll see in lab 3) --------------------------------------------------------------------------- potentially useful reference: 4KB = 2^{12} = 0x00001000 = 0x00000fff + 1 4MB = 2^{22} = 0x00400000 = 0x003fffff + 1 256 MB = 2^{28} = 0x10000000 = 0x0fffffff + 1 4GB = 2^{32} =0x100000000 = 0xffffffff (+1) = ~0x00000000 (0xef800000 >> 22) = 0x3be = 958 (0xf0000000 >> 22) = 0x3c0 = 960 --------------------------------------------------------------------------- 6. Paging [this idea is everywhere; we'll focus on how it works on the x86] --Basic idea: all of memory (physical and virtual) gets broken up into chunks called **PAGES**. those chunks have size = **PAGE SIZE** --we will be working almost exclusively with PAGES of PAGE SIZE = 4096 B = 4KB = 2^{12} --how many pages are there on a 32-bit architecture? --2^{32} bytes / (2^{12} bytes/page) = 2^{20} pages --it is proper and fitting to talk about pages having **NUMBERS**. --page 0: [0,4095] --page 1: [4096, 8191] --page 2: [8192, 12277] --page 3: [12777, 16384] ..... --page 2^{20}-1 [ ......, 2^{32} - 1] --unfortunately, it is also proper and fitting to talk about _both_ virtual and physical pages having numbers. --sometimes we will try to be clear with terms like: vpn ppn --why isn't segmentation enough? segmentation can be a bummer when a segment grows or shrinks paging much more flexible: instead of mapping a large range onto a large range, we are going to independently control the mapping for every 4 KB. [wow! how are we going to do that? seems like a lot of information to keep track of, since every virtual page in every process can conceivably be backed by *any* physical page.] still, segments have uses easy to share: just use the same segment registers [for the rest of this course, we will assume that segmentation on the x86 is configured to implement the identity mapping.] A. page mapping --4KB pages and 4GB address space so 2^{20} pages --top bits of VA selects the PPN --bottom bits indicate where in the page the memory reference is happening. sometimes called offset. --QUESTION: if our pages are of size 4KB = 2^{12}, then how many bottom bits are we talking about, and how many top bits are used for the layer of indirection? [answer: top 20 bits are doing the indirection. bottom 12 bits just figure out where on the page the access should take place.] --conceptual model: there is in the sky a 2^{20} sized array that maps the linear address to a *physical* page table[20-bit linear page number] = 20-bit physical page # so now all we have to do is create this mapping why is this hard? why not just create the mapping? --answer: then you need, per process, roughly 4MB (2^{20} entries * 32 bits per entry). so here's an idea: --break the 4MB table up into 4096 byte chunks, and reference those chunks in another table. --so how many entries does that other table need? --1024 --so how big is that other table? --4096 bytes! --so basically every data structure is going to be 4096 bytes here's how it works in the standard configuration on the x86, but there are others two-level mapping structure....... [refer to handout as we go through this example....] pg dir table offset 31 ....... 22 21 ...... 12 11 ....... 0 31 ................................... 0 --%cr3 is the address of the page directory. --top 10 bits select an entry in the page directory, which picks a **page table** --next 10 bits select the entry in the page table, which is a physical page number --so there are 1024 entries in page directory --how big is entry in page directory? 4 bytes --entry in page directory and page table: [ base address | bunch of bits | U/S R/W P ] 31..............12 why 20 bits? [answer: there are 2^20 4KB pages in the system] is that base address a physical address, a linear address, a virtual address, what? [answer: it is a physical address. hardware needs to be able to follow the page table structure.] --EXAMPLE JOS maps 0xf0000000 to 0x00000000 0xf0001000 to 0x00001000 WHAT DOES THIS LOOK LIKE? [ pgdir with entry 960 pointing to page table. [put the physical page table at PPN 3.] page table has PPN(0th entry) = to 0 page table has PPN(1st entry) = to 1 ] --EXAMPLE what if JOS wanted 0xf0001000 to 0x91210000 [no problem; change the phys page] point of this example: the mapping from VA to PA can be all over the place --ALWAYS REMEMBER --each entry in the page *directory* corresponds to 4MB of virtual address space --each entry in the page *table* corresponds to 4KB of virtual address space --so how much virtual memory is each page *table* responsible for translating? 4KB? 4MB? something else? --each page directory and each page table itself consumes 4KB of physical memory, i.e., each one of these fits on a page --So this is the picture we have so far: a VA is 32 bits: pg dir table offset 31 ....... 22 21 ...... 12 11 ....... 0 31 ................................... 0 --go back to entry in page directory and page table: [ base address | bunch of bits | U/S R/W P ] 31..............12 bunch includes dirty acccessed cache disabled write through --what do these U/S and R/W bits do? --are these for the kernel, the hardware, what? --who is setting them? what is the point? --what happens if U/S and R/W differ in pgdir and table? [processor does something deterministic; look up in references] --can user modify page tables? they are in memory....... --but how can the user see them? --the page tables themselves can be mapped into the user's address space! --we will see this in the case of JOS below ------------------------------------------------------------------ putting it all together.... here is how the x86's MMU translates a linear address to a physical address: [not discussing in class but make sure you perfectly understand what is written below.] uint translate (uint la, bool user, bool write) { uint pde; pde = read_mem (%CR3 + 4*(la >> 22)); access (pde, user, write); pte = read_mem ( (pde & 0xfffff000) + 4*((la >> 12) & 0x3ff)); access (pte, user, write); return (pte & 0xfffff000) + (la & 0xfff); } // check protection. pxe is a pte or pde. // user is true if CPL==3. // write is true if the attempted access was a write. // PG_P, PG_U, PG_W refer to the bits in the entry above void access (uint pxe, bool user, bool write) { if (!(pxe & PG_P) => page fault -- page not present if (!(pxe & PG_U) && user) => page fault -- not access for user if (write && !(pxe & PG_W)) { if (user) => page fault -- not writable if (%CR0 & CR0_WP) => page fault -- not writable } } -------------------------------------------------------------------- B. memory in JOS --segments only used to switch privilege level into and out of kernel --paging structures the address space --paging limits process memory access to its own address space --see handout for JOS virtual memory map --why are kernel and current process both mapped into address space? --convenient for kernel --why is all of physical memory mapped at the top? that must mean that there are physical memory pages that are mapped in multiple places.... --need to be able to get access to physical memory when setting up page tables: *kernel* has to be able to use physical addresses from time to time --wouldn't it be awesome if the 4MB worth of page table appeared inside the virtual address space, at address, say, 0xef800000 (which we call UVPT)? --what happens if we sneakily insert a pointer in the pgdir back to the pgdir itself, like this: 1023 | | 960 | | ..... ........ 958 | self.. U | .... 0 | ........ not present| --result: the page tables *themselves* show up in the program's virtual address space. where? [0xef800000,0xefc00000) --> looks like one contiguous page table, visible to users. read only. rock! more specifically: the picture of [UVPT,UVPT+4MB) in virtual space is: UVPT+4MB __________________ PGTABLE 1023 __________________ . . . __________________ PGTABLE 2 __________________ PGTABLE 1 ___________________ PGTABLE 0 UVPT ___________________ --QUESTION: * where does the **pgdir itself** live in the virtual address space? --0xef800000 ? --0xef800000 + 4KB ? --0xef800000 + 4KB * 958 ? --0xef800000 + 4KB * 960 ? --0xf0000000? --something it is probably worth internalizing: one of the things that a second-level page table is doing is to take as many as 1024 disparate physical pages, perhaps scattered throughout RAM, and then glue them together in a logical way, making them appear as a contiguous 4MB region in virtual space (just as the entire page structure glues disparate physical pages into a 4GB "region"). if the second-level page table that is chosen for this gluing is the page directory itself, then the disparate physical pages that all appear as a contiguous 4MB region wind up being the page tables themselves --with the above as background, here is further detail on the JOS implementation trick: this works because the page directory has the same structure as a page table and because the CPU just "follows arrows", namely: (1) From the relevant entry in the pgdir [which entry, recall, covers 4MB worth of VA space] to the physical page number where the relevant page table lives (2) From the physical page number where the relevant page table lives, more specifically the relevant entry in the relevant page table (which is relevant to 4KB of address space), to the physical page number that is the target of the mapping. now, if you "trick" the CPU into following the first arrow back to the pgdir itself, and the program references an address 0xef800000+x, where x < 4MB, then the logic goes like this (compare the exact words below to the exact words of the numbered items above): (1) From the relevant entry in the pgdir [which entry, recall, is covering the 4MB worth of VA space from [0xef80000,0xefc00000)] to the physical page number where the page directory lives (2) From the physical page number where the page directory lives, more specifically the relevant entry in the page directory (which now is relevant to only 4KB of address space), to the physical page number that is the target of the <0xef80000+x,PA> mapping. that physical page holds a second-level page table! result: the second-level page table appears at 0xef80000+x