Class 12
CS 202
12 March 2015

On the board
------------

1. Last time
2. Segmentation
3. Paging
    --Intro
    --key data structure: page table
    --Segmentation vs. paging
4. Case study: x86
    --Alternatives
5. TLBs
6. Page faults: intro and mechanics
7. The uses of paging and page faults
8. Where does the OS live?

---------------------------------------------------------------------------

1. Last time
    
    [keyboard repeat rate: BIOS vs. keyboard driver]

    virtual memory intro

    segmentation introduction

    segmentation on x86: won't cover explicitly; in the notes below.

    today: paging

    ---
    to understand what is going on in virtual memory, one can think of
    it like this:

        the OS defines a table on behalf of a process

        the virtual address itself (that is, the address used by the
        program) determines the row in the table that the MMU
        should look up
       
        roughly speaking, the top digits select a table entry; the
        bottom digits determine an offset into region.

        for our segmentation example from last time, we can visualize
        the result roughly like this:

            the bottom quarter of the address space is translated according
            to the first row, (not exactly the bottom quarter, since
            some of the addresses aren't valid; same holds for the
            quarters below)

            the next quarter of the address space is translated according to
            the second row,

            the third quarter of the address space is translated
            according to the third row,

            the fourth quarter are invalid addresses

    ---

2. Segmentation

    A. segmentation in general
    
        [last time]

    B. segmentation on the x86

        [not covering the details in lecture; including for reference
        here]

	linear address = base + virtual_address
	    (virtual_address is the offset here)

	what's the interface to segmentation?

            [see handout]
    
	there are tables populated by the OS:
	    
	    GDT, LDT (global descriptor table, local descriptor table)

                *the entries in this table define the segments*

	        determines base, limit, **protection** (R/W/X,
	        user/kernel, etc/), type

                these are analogous to the table we saw last time
                (except that a process has 2 of them instead of one).

	processor told where this table lives via

	    LLDT, SLDT, LGDT, SGDT
	    (load local descriptor table, store local descriptor
	    table, etc.)

        there are also segment selector registers on the CPU:
            %ss  (stack segment selector)
            %cs  (code segment selector)
            %ds  (data segment selector)
            %es  (string [extra] segment selector)
            %fs
            %gs

            [more on these in a moment]

	every program instruction comes with an implicit *or* explicit
	segment register (the implicit case is the usual one). examples:

	    pop %ebx	               ; implicitly uses %ss
	    call $0x7000               ; implicitly uses %cs
	    movl $0x1234, (%eax)       ; implicitly uses %ds
	    movl $0x1234, %gs:(%eax)   ; explicitly uses %gs

	    [all references to %eip (such as instruction fetches) use
	    %cs for translation.]
 
	    some instructions can take "far addresses":

	       ljmp $selector, $offset 

	       [makes the selector explicit]

                [can do similar things with loads and stores]

	a selector (held in %ss, %cs, etc.) indexes into the LDT or GDT,
	and chooses *which* table and which *entry* in that table

        analogy with 14-bit example from last time:

            the value held in the relevant segment selector is analogous
            to the top digit [two bits] in our 14-bit example from last
            time.
            
            the entire virtual address is analogous to the bottom 3
            digits [12 bits] in that example.

            to be clear, the idea is that on the x86 (as opposed to our
            example) the segment selector is not part of the virtual
            address; it's either implicit or explicitly part of the
            instruction.

	offset needs to be less than limit 
	 
	example #1:
	   
	    say that %ds refers to an entry in the LDT with these
	    parameters:
		base  0x30000000
		limit 0x300000f0

	    now, when program does:
		
		mov 0x50, %eax

		what happens?
		
		[0x50 gets translated into 0x3000 0050]

	example #2:

	    what about if program does:

		mov 0x100, %eax ?

	        [error.]
	

3. Paging

    A. Intro

    --Basic concept: divide all of memory (physical and virtual)
    into *fixed-size* chunks.

        --these chunks are called *PAGES*.

        --they have a size called the PAGE SIZE.
        (different hardware architectures specify different sizes)

        --in the traditional x86 (and in our labs), the PAGE SIZE
        will be 
	    4096 B = 4KB = 2^{12}

    --Warm-up:

	--how many pages are there on a 32-bit architecture?

	--2^{32} bytes / (2^{12} bytes/page) = 2^{20} pages

    --Each process has a separate mapping

        --And each page separately mapped

        --we will allow the OS to gain control on certain operations

            --Read-only pages trap to OS on write
         
            --Invalid pages trap to OS on read or write
         
            --OS can change mapping and resume application

            (Harder to do this kind of thing with segments because the
            mapping is more coarse-grained.)

    --it is proper and fitting to talk about pages having **NUMBERS**. 

	--page 0:   [0,4095]
	--page 1:   [4096, 8191]
	--page 2:   [8192, 12277]
	--page 3:   [12777, 16384]
	.....

	--page 2^{20}-1 [ ......, 2^{32} - 1]

    --unfortunately, it is also proper and fitting to talk about _both_
    virtual and physical pages having numbers.

	--sometimes we will try to be clear with terms like:
	    vpn 
	    ppn

    B. Key data structure: page table

    --conceptual model: 

        (assuming 32-bit addresses and 4KB pages)

	there is in the sky a 2^{20} sized array that maps the
	virtual address to a *physical* page

	table[20-bit virtual page number] = 20-bit physical page #

        EXAMPLE: 

            if OS wants a program to be able to use address 0x00402000
            to refer to physical address 0x00003000, then the OS
            conceptually adds an entry:

                table[0x00402] = 0x00003

            (this is the 1026th virtual page being mapped to the 3rd
            physical page.). in decimal: table[1026] = 3

            below, we will see how this is actually implemented

        NOTE: top 20 bits are doing the indirection. bottom 12 bits just
        figure out where on the page the access should take place.

            --bottom bits sometimes called offset.

    --so now all we have to do is create this mapping

    --why is this hard? why not just create the mapping?

	--answer: then you need, per process, roughly 4MB (2^{20}
	entries * 32 bits per entry)

	--deal with this shortly

        --key idea: represent the page table as a tree that is sparse
        (i.e., many of the child nodes are never filled in)

    C. segmentation vs paging

        --paging:
            + eliminates external fragmentation
            + not much internal fragmentation 
            + easier to allocate, free, swap, etc.
            - data structures are larger
            - more complex
            + overall: more flexible. 

            (intuition: mapping is more fine-grained, which means more 
             OS control over it)

            (in more detail, instead of mapping a large range into a
            large range, we are going to independently control the
            mapping for every 4 KB.)

        --segmentation:
            - vulnerable to two kinds of fragmentation
            - hard to handle growth or shrinkage of a segment
            + smaller data structures
            + simpler overall
            - overall: less flexible

    --Segmentation is old-school and these days mostly an annoyance
    (but it cannot be turned off on the x86!)
    
	--however, it comes in handy every now and then
	
	    --thread-local memory
	    
	    --sandboxing (advanced topic)

	    --also makes it easy to share memory among processes: just use
	    the same segment registers (sharing requires a bit more work if
	    paging is in effect)


4. Case study: virtual memory on x86

    * Has segmentation and paging. 

        Cannot turn off segmentation (even though we usually want to)

        Instead, set things up so that segmentation has no effect

        Question: how? 

	    (Answer: by setting its mapping to be the identity function.
	    Make the offset 0 and the limit the maximum.)

    * We will focus on paging

        best overview: the Intel manual
        http://www.cs.nyu.edu/~mwalfish/classes/15sp/ref/i386/s05_02.htm

        see handout from last time

	two-level mapping structure.......

        * a VA is 32 bits:
	    
	    31 ................................... 0

	* and it gets divided as follows:

		dir ent      table ent      offset
	    31 ....... 22  21 ...... 12  11 ....... 0


	--%cr3 is the address of the page directory.

	--top 10 bits (first two nibbles plus first half of third
	nibble) select an entry in the page directory, this entry points
	to a **page table**

	--next 10 bits select the entry in the page table, which is a
	physical page number

	--so there are 1024 entries in page directory

	--how big is entry in page directory? 4 bytes

	--entry in page directory and page table:

		[   base address   |  bunch of bits | U/S R/W P ]
		31..............12

	    why 20 bits?
		[answer: there are 2^20 4KB pages in the system]

	    is that base address a physical address, a linear address, a
	    virtual address, what?

		[answer: it is a physical address. hardware needs to be
		able to follow the page table structure.]

	    bunch of bits includes 
		    dirty (set by hardware)
		    acccessed (set by hardware)
		    cache disabled (set by OS)
		    write through  (set by OS)

	    what do these U/S and R/W bits do?

	        --are these for the kernel, the hardware, what?

	        --who is setting them? what is the point?

	        (OS is setting them to indicate protection; hardware is
	        enforcing them)

	        what happens if U/S and R/W differ in pgdir and table?

	        [processor does something deterministic; look up in
	        references]


    * EXAMPLES

        Approach: examine an address and divide it up. Get used to doing
        this. We will work a few examples in class.

        Basic question: what does OS put in the data structures that are
        visible to the CPU's MMU to enable different mappings?

        What if OS wants to map a process's
            
            virtual address  0x00402[000]  to
            physical address 0x00003[000]

            and
            
                make it accessible to user-level but read-only?


            PGDIR                               PGTABLE
                
            .......
                                       <20 bits>    <12 bits>
            .......                   | 0x00003 | U=1,W=0,P=1 |   [entry 2]
                                      |         |             |   [entry 1]
            .....[entry 1]     ---->  |_________|_____________|   [entry 0]
                                           
            ....... 


        Now what if the OS wants to map that process's virtual address

            virtual address  0x00403[000]  to
            physical address 0x80000[000]  
                [this is physical address 2GB]

            and 

                make it accessible to user-level and make it read/write?

    * Helpful reminders:

	--each entry in the page *directory* corresponds to 4MB of
	virtual address space ("corresponds to" means "selects the
	second-level page table that actually governs the mapping").

	--each entry in the page *table* corresponds to 4KB of
	virtual address space

	--so how much virtual memory is each page *table*
	responsible for translating? 4KB? 4MB? something else?

	--each page directory and each page table itself consumes
	4KB of physical memory, i.e., each one of these fits on a
	page


    ------------------------------------------------------------------
    putting it all together.... here is how the x86's MMU translates a
    linear address to a physical address:
    ("linear address" is a synonym for "virtual address" in our context.
    the reason for the additional term is that on the x86, the
    segmentation mapping goes from virtual to linear.)

	[not discussing in class but make sure you understand what is
	written below.]

       uint
       translate (uint la, bool user, bool write)
       {
	 uint pde; /* page directory entry */
	 pde = read_mem (%CR3 + 4*(la >> 22));
	 access (pde, user, write); /* see function below */
	 pte = read_mem ( (pde & 0xfffff000) + 4*((la >> 12) & 0x3ff));
	 access (pte, user, write);
	 return (pte & 0xfffff000) + (la & 0xfff);
       }

       // check protection. pxe is a pte or pde.
       // user is true if CPL==3.
       // write is true if the attempted access was a write.
       // PG_P, PG_U, PG_W refer to the bits in the entry above
       void
       access (uint pxe, bool user, bool write)
       {
	 if (!(pxe & PG_P)  
	    => page fault -- page not present
	 if (!(pxe & PG_U) && user)
	    => page fault -- not access for user
       
	 if (write && !(pxe & PG_W)) {
	   if (user)   
	      => page fault -- not writable
	   if (%CR0 & CR0_WP) 
	      => page fault -- not writable
	 }
       }
    --------------------------------------------------------------------

    * Alternatives

    --Other configurations possible (both on x86 and on other hardware
    architectures)

    --There are some tradeoffs:
    
        --between large and small page sizes:

	    --large page sizes means wasting actual memory

	    --small page sizes means lots of page table entries (which
	    may or may not get consumed)

        --between many levels of mapping and few:

            --more levels of mapping means less space spent on page
            structures when address space is sparse (which they nearly
            always are) but more costly for hardware to walk the page
            tables

            --fewer levels of mapping is the other way around: need to
            allocate larger page tables (which cost more space), but
            the hardware has fewer levels of mapping
       
    --Example: can get 4MB pages on x86 (each page directory entry can
    just point to a single page)

        + page tables smaller 
        - more wasted memory

        to enable this, set PSE mode
       	
       	    (set bit 7 in PDE and get 4MB pages, no PTs)


5. TLB
    
    --so it looks like the CPU (specifically its MMU) has to go out
    to memory on every memory reference?

	--called "walking the page tables"

    --to make this fast, we need a cache

    --TLB: translation lookaside buffer

	hardware that stores virtual address --> physical address;
	the reason that all of this page table walking does not slow
	down the process too much
	
	--hardware managed? (x86.)

	--software managed? (MIPS. OS's job is to load the TLB when
	the OS receives a "TLB miss". Not the same thing as a page
	fault.)

    --what happens to the TLB when %cr3 is loaded? [answer: flushed]

    --can we flush individual entries in the TLB otherwise? 
	INVLPG addr

    --how does stuff get in the TLB?

	--answer: hardware populates it

    --questions:

	--does TLB miss imply page fault? (no!)

	--does page fault imply TLB miss? (no!)
	    (imagine a page that is mapped read-only. user-level
	    process tries to write to it. TLB knows about the mapping,
	    so no TLB miss. But this is still a protection violation.
	    To cut down on terminology, we will lump this kind of
	    violation in with "page fault".)


6. Page faults: intro and mechanics

    Discussed these above. Let's go into a bit more detail...

    Concept:
    
        a reference is illegal, either because it's not mapped in the
        page tables or because there is a protection violation.

        requires the OS to get involved

        this mechanism turns out to be hugely powerful, as we will see.

    Mechanics

        --what happens on the x86?

	    [see handout]

	    --kernel constructs a trap frame and transfers execution to an
	    interrupt or trap handler

		    ss
		    esp    [former value of stack pointer]
		    eflags [former value of eflags]
		    cs 
          %esp-->   eip    [instruction that caused the trap]
		    [error code] 

	    %eip is now executing code to handle the trap
	        [how did processor know what to load into %eip?]

	    error code:
	        [ ................................ U/S | W/R | P]
		         unused

	        U/S: user mode fault / supervisor mode fault
	        R/W: access was read / access was write
	        P: not-present page / protection violation

	    on a page fault, %cr2 holds the faulting linear address

        --intent: when page fault happens, the kernel sets up the
        process's page entries properly, or kills the process

       
7. The uses of paging and page faults

    --Best example: overcommitting physical memory (the classical use of
    "virtual memory")

	--your program thinks it has, say, 512 MB of memory, but your
	hardware has only 256 MB of memory

	--the way that this worked is that the disk was (is) used to
	store memory pages

	--advantage: address space looks huge

	--disadvantage: accesses to "paged" memory (as disk pages that
	live on the disk are known) are sllooooowwwww:

	--Rough implementation:

	    --on a page fault, the kernel reads in the faulting page

	    --QUESTION: what is listed in the page structures? how does
	    kernel know whether the address is invalid, in memory,
	    paged, what?

	    --kernel may need to send a page to disk (under what
	    conditions? answer: two conditions must hold for kernel to
	    HAVE to write to disk)

		(1) kernel is out of memory

		(2) the page that it selects to write out is dirty

	--Computers have lots of memory, so less common to hear the
	sound of swapping these days. You multiple large memory
	consumers running on the same computer.

    --Many other uses; will discuss next time

8. where does the OS live?

    In its own address space?

    -- Can't do this on most hardware (e.g., syscall instruction won’t
    switch address spaces)

    -- Also would make it harder to parse syscall arguments passed as
    pointers

    So it's actually in the same address space as the process

    -- Use protection bits to prohibit user code from writing kernel

    -- Typically all kernel text, most data at same VA in *every*
    address space (every process has virtual addresses that map to the
    physical memory that stores the kernel's instructions and data)


---------------------------------------------------------------------------


midterms

--don't panic if you're not happy with your score; lots of opportunity
to bring things up

--regardless of how you think you did, *please* make sure you
understand all the answers; the solutions are posted on the course
Web page, and they are intended to be helpful here

--some notes about the grading:

    --we weren't hugely generous with partial credit, especially when an
    answer indicated a misunderstanding. that's for several reasons:

	--want to be clear with you about what you do and don't
	understand, as reflected in what you write

	--want to be fair to those who get the question

    --if you have questions, let me know. we tried to be careful, but
    it's possible we made mistakes. 
    
	--please note that a regrade request will generate a regrade
	of the entire exam 

---------------------------------------------------------------------------