Class 15
CS 202
26 March 2020

On the board
------------

1. Last time
2. Lab4 hints/background, continued
3. Context switches (WeensyOS)
4. Synchronous vs async I/O
5. User-level threading, intro
6. [skipped] Context switches (user-level threading)
7. Cooperative multithreading
8. Preemptive user-level multithreading

---------------------------------------------------------------------------

1. Last time

    I/O (CPU/device)
    lab 4

    today, a mixture of topics

2. lab4 hints/background

    - exercise: how many page tables are allocated for 3MB? what's the structure?

    - 3MB virtual address space, but the L4 page table that handles
     [2MB, 3MB) is allocated only on demand.

        - thus, make sure when calling virtual_memory_map that you're
        passing in a non-NULL allocator when you're supposed to.


    - process control block (PCB): this is the "struct proc" in kernel.h

    - recall:
        register %rax is the system call return value
        register %rdi contains the system call argument

    - remember: bugs in earlier parts may show up only later

    - pageinfo array:
    
        typedef struct physical_pageinfo {
            int8_t owner;
            int8_t refcount;
        } physical_pageinfo;

        static physical_pageinfo pageinfo[PAGENUMBER(MEMSIZE_PHYSICAL)];

    - x86_64_pagetable....array of 512 entries (each 8 bytes)

    ----

    [what's below here are detailed notes from a prior recitation on
    lab4.]


    - Kernel virtual address
      Kernel is setup to use an identity mapping
      [0, MEM_PHYSICAL) -> [0, MEM_PHYSICAL)

    - Physical pages' meta data is recorded in page_info array,
        whose elements contains refcount, owner
        owner can be kernel, reserved, free, or pid

    - Process control block:
      * Process registers, process state
      * Process page table - a pointer (kernel virtual address, which is
      the identical physical address) to an L1 page table
                           L1 page table's first entry points to a page
                           table, and so on...
      Our job mainly consists of manipulating the page tables, and pageinfo array

    - High level evolution of the lab:
      We have five programs: kernel + 4 processes
      Ex1. five processes share the same page table. 
          virtual addresses are all PTE_U | PTE_P | PTE_W
          Job: mark some of the addresses as (PTE_P | PTE_W), i.e. not user accessible

      Ex2. Each program uses its own page table.
          kernel already has a page table
          The job is to allocate and populate a page table for each process.
          The process break-down as helper functions:
            1. allocate a new page for process pid, and zero it (important)
            2. populate the new page table. can memcpy kernel's, but
            easier to copy the kernel's mappings individually.
                  In order to achieve the screenshot, after memcpying, we have to
                  mark [prog_addr_start, virtual_addr_size) as not-present.

      Ex3. Physical page allocation
        Motivation:
          Before this, during sys_page_alloc, when process asks for a specific virtual page,
          the identity mapping is employed to find the physical page.
          But it is too restrictive and with virtual memory, the process does not really 
          care which physical page it gets
        If we have implemented the function 1 mentioned in Ex2 (allocate
            a free page), then we are mostly good to go and just use that function.
        We also need to connect virtual-physical by setting the corresponding page table
            entry. Use virtual_memory_map

      Ex4. Overlapping virtual addresses
        Motivation:
          Every process has its own page table & accessible virtual addresses (PTE_P portions),
          we don't need to restrict processes to use different parts of the virtual memory.
          They can overlap, as long as the physical pages backing them are not overlapped.

        Easy to do: in process_setup, we use (MEMSIZE_VIRTUAL - PAGESIZE) instead of the  
          arithmetic to compute the process's stack page.

      Ex5. Fork
        High level goal:
          Produce a mostly identical process (minus the register rax).
        What does it mean to be an identical process?? 
              1 same binary
              2 same process registers
              3 AND same memory state / contents
          3 basically covers 1 because the binary is loaded in memory too.
          2 is easy to achieve (copy the registers; can do this with a
            single line of C code)
          The goal here is mainly to achieve 3.
        Fork creates a copy: the memory state has to a copy!
        Question: 
            What does it mean to make a copy of memory?
              - They are backed by physical pages, so we alloc new physical pages
                and copy the content to new pages (memcpy)
              - Then connect virtual to physical by setting the page table
            The address space is potentially 256 TB large, do we copy 256 TB? 
            How do we know which parts to copy?
              - Iterate over the virtual address space; find pages that is (PTE_P | PTE_U | PTE_W)
            Given a page table entry, how do you check if it is user RW-able?
                Fill in the blanks...
                    pte_val _ (PTE_P | PTE_W | PTE_U) == ___
            How do you find its corresponding physical page?
                PTE_ADDR

    - Useful functions to implement for said manipulations:
      * find a PO_FREE physical page and assign it to a process (Useful for ex2, 3, 4, 5)
      * allocate empty page dir + page table for a process (Ex2, 4)
      * make a copy of existing page table and assign it to a process (Ex2, 5)
      * implement your own helper functions as you see fit
    Tip: Zero the allocated page before using it!! (memset)

    - Some useful functions/macros:
       PTE_ADDR : PTE_ENTRY -> Physical address
       PAGENUMBER : a phyiscal address -> corresponding index into page info array 
       PAGEADDR : PAGENUMBER^{-1}
       virtual_memory_lookup(pagetable, va)


3. Context switches in WeensyOS

    - on interrupt, hardware saves "trapframe":
        %rip/%rsp/%eflags and lots of other things
        saves all of that on the *kernel's stack* at a well-known place in kernel memory
    - stack pointer set equal to %rdi
    - %rdi shows up, in C code, as the argument to exception()
    - exception does a brute force struct copy into process-specific memory in kernel space
    - now all of the process's registers just live in the PCB / struct proc.
    - kernel does its thing.
    - kernel gets ready to choose another process
    - remember, that process had the same thing happen
    - so all of *the new process's* registers are sitting in the same kind of memory mentioned above.
    - now, exception_return (&p->p_registers)
    - note: %rdi holds address of saved registers
    - set the stack pointer equal to that address
    - that means that popq will do the "right" thing
    - pop the saved registers into the CPU's
    - add 16 to skip past saved codes (error code if pg fault handler and int code in all cases).
    - now stack pointer is pointing to the trapframe at the time of the trap
    - that trapframe includes %rip and trap-time %rsp
    - iretq brings us back into user space with the %rip,%rsp at the time of the trap


4. Synchronous vs asynchronous I/O

    - A question of interface

    - NOTE: kernel never blocks when issuing I/O. We're discussing the
    interface presented to user-level processes.


    - Synchronous I/O: system calls block until they're handled.
   
    - Asynchronous I/O:

        I/O doesn't block. for example, if a call like read() or write()
        _would_ block, then it instead returns immediately but sets a
        flag indicating that it _would_ have blocked.

        Process discovers that data is ready either by making another
        query or by registering to be notified by a signal (discuss
        signals later)

    - Annoyingly, standard POSIX interface for files is blocking,
    always. Need to use platform-specific extensions to POSIX to get
    async I/O for files. (Although below, we will assume a non-blocking
    read(). This isn't a total abuse because read() can be set to be
    non-blocking, if the fd represents a device, pipe, or socket.)

    - Pros/cons:

        - blocking interface leads to more readable code, when
        considering the code that invokes that interface
        
        - but blocking interfaces BLOCK, which means that the code
        _above_ the interface cannot suddenly switch to doing something
        else. if we want concurrency, it has to be handled by a layer
        _underneath_ the blocking interface. 
                
            - We'll see an example of this below.

5. User-level threading

    Setting: there's a _threading package_

    --Review: what *is* a kernel-managed thread? (We refer to that
    as "kernel-level threading.")

        --basically same as a process, except two threads in the same
        process have the same value for %cr3 

        --recall: kernel threads are always preemptive

    --We can also have *user*-level threading, in which the kernel is
    completely ignorant of the existence of threading.

               [draw picture]

            T1     T2     T3
                thr package
                   OS
                   H/W


    --in this case, the threading package is the layer of software that
    maintains the array of TCBs (thread control blocks)

    --threading package has other responsibilities as well:

        --make a new stack for each new thread.

        --scheduling!

     --user-level threading can be non-preemptive (cooperative) or
     preemptive. we'll look at both.

    
6. Context switches (user-space)
    
    [skipped; cover next time]


7. Cooperative multithreading

    --This is also called *non-preemptive multithreading*.
    
    --It means that a context switch takes place only at well-defined
    points: when the thread calls yield() and when the thread would
    block on I/O.

8. Preemptive multithreading in user-level

    How can we build a user-level threading package that does
    context switches at any time?

    Need to arrange for the package to get interrupted.

    How?

    Signals!

    Deliver a periodic timer interrupt or signal to a thread
    scheduler [setitimer() ]. When it gets its interrupt, swap out
    the thread, run another one

    Makes programming with user-level threads more complex -- all the
    complexity of programming with kernel-level threads, but few of the
    advantages (except perhaps performance from fewer system calls).

    in practice, systems aren't usually built this way, but sometimes it
    is what you want (for example, if you're simulating some OS-like
    thing inside a process, and you want to simulate the non-determinism
    that arises from hardware timer interrupts).

    A larger point: signals are instructive, and are used for many
    things. What a signal is really doing is abstracting a key hardware
    feature: interrupts.

    So this is another example of the fact that the OS's job is to give
    a user-space process the illusion that it's running on something
    like a machine, by creating abstractions. In this example, the
    abstraction is the signal, and the thing that it's abstracting is an
    interrupt.