Class 16
CS 202
29 March 2021

On the board
------------
1. Last time
2. WeensyOS
3. I/O architecture
4. CPU/device interaction
   --Mechanics
        (a) explicit I/O 
        (b) memory-mapped I/O
        (c) interrupts
        (d) through memory
   --Polling vs interrupts
   --DMA vs. programmed I/O
5. Software architecture: device drivers

---------------------------------------------------------------------------

1. Last time

    --began I/O

    --then midterm

    --today: 
        WeensyOS
        I/O: kernel/device (later classes: user/kernel)


2. WeensyOS

    [draw picture of the software stack: two instances of virtualization]

    advice: start now!!!

    processes, files with p-*

    kernel code, files with k-*

    processes just allocate memory. system call: sys_page_alloc().
    analogous to brk() or mmap() in POSIX systems.

        look at process.h for where the system call happens

        see exception_return() for where the return back into user space
        happens

        %rax is what the application return value is.

    - figures (the animated gifs) are from 32-bit version of the lab. so
    you'll see some differences.

    - you'll use the virtual_memory_map() function
        pay attention to the "allocator" argument
        (and make sure your allocator initializes the new page table)

    - how many page tables are allocated for 3MB? what's the structure?

        - 3MB virtual address space, but the L4 page table that handles
         [2MB, 3MB) is allocated only on demand.

            - thus, make sure when calling virtual_memory_map that you're
            passing in a non-NULL allocator when you're supposed to.

    - process control block (PCB): this is the "struct proc" in kernel.h

    - recall:
        register %rax is the system call return value
        register %rdi contains the system call argument

    - remember: bugs in earlier parts may show up only later

    - pageinfo array:
    
        typedef struct physical_pageinfo {
            int8_t owner;
            int8_t refcount;
        } physical_pageinfo;

        static physical_pageinfo pageinfo[PAGENUMBER(MEMSIZE_PHYSICAL)];

        one physical_pageinfo struct per _physical_ page.

    - x86_64_pagetable....array of 512 entries (each 8 bytes)

    ----

    [what's below here are detailed notes from a prior recitation on
    lab4.]


    - Kernel virtual address
      Kernel is setup to use an identity mapping
      [0, MEM_PHYSICAL) -> [0, MEM_PHYSICAL)

    - Physical pages' meta data is recorded in physical_pageinfo array,
        whose elements contains refcount, owner
        owner can be kernel, reserved, free, or pid

    - Process control block:
      * Process registers, process state
      * Process page table - a pointer (kernel virtual address, which is
      the identical physical address) to an L1 page table
                           L1 page table's first entry points to a page
                           table, and so on...
      Our job mainly consists of manipulating the page tables, and pageinfo array

    - High level evolution of the lab:
      We have five programs: kernel + 4 processes
      Ex1. five processes share the same page table. 
          virtual addresses are all PTE_U | PTE_P | PTE_W
          Job: mark some of the addresses as (PTE_P | PTE_W), i.e. not user accessible

      Ex2. Each program uses its own page table.
          kernel already has a page table
          The job is to allocate and populate a page table for each process.
          The process break-down as helper functions:
            1. allocate a new page for process pid, and zero it (important)
                [use memset to zero-out]
            2. populate the new page table. can memcpy kernel's, but
            easier to copy the kernel's mappings individually.
                  In order to achieve the screenshot, after memcpying, we have to
                  mark [prog_addr_start, virtual_addr_size) as not-present.

      Ex3. Physical page allocation
        Motivation:
          Before this, during sys_page_alloc, when process asks for a specific virtual page,
          the identity mapping is employed to find the physical page.
          But it is too restrictive and with virtual memory, the process does not really 
          care which physical page it gets
        If we have implemented the function 1 mentioned in Ex2 (allocate
            a free page), then we are mostly good to go and just use that function.
        We also need to connect virtual-physical by setting the corresponding page table
            entry. Use virtual_memory_map

      Ex4. Overlapping virtual addresses
        Motivation:
          Every process has its own page table & accessible virtual addresses (PTE_P portions),
          we don't need to restrict processes to use different parts of the virtual memory.
          They can overlap, as long as the physical pages backing them are not overlapped.

        Easy to do: in process_setup, we use (MEMSIZE_VIRTUAL - PAGESIZE) instead of the  
          arithmetic to compute the process's stack page.

      Ex5. Fork
        High level goal:
          Produce a mostly identical process (minus the register rax).
        What does it mean to be an identical process?? 
              1 same binary
              2 same process registers
              3 AND same memory state / contents
          3 basically covers 1 because the binary is loaded in memory too.
          2 is easy to achieve (copy the registers; can do this with a
            single line of C code)
          The goal here is mainly to achieve 3.
        Fork creates a copy: the memory state has to be a copy!
        Question: 
            What does it mean to make a copy of memory?
              - They are backed by physical pages, so we alloc new physical pages
                and copy the content to new pages (memcpy)
              - Then connect virtual to physical by setting the page table
            The address space is potentially 256 TB large, do we copy 256 TB? 
            How do we know which parts to copy?
              - Iterate over the virtual address space; find pages that is (PTE_P | PTE_U | PTE_W)
            Given a page table entry, how do you check if it is user RW-able?
                Fill in the blanks...
                    pte_val _ (PTE_P | PTE_W | PTE_U) == ___
            How do you find its corresponding physical page?
                PTE_ADDR

    - Useful functions to implement for said manipulations:
      * find a PO_FREE physical page and assign it to a process (Useful for ex2, 3, 4, 5)
      * allocate empty page dir + page table for a process (Ex2, 4)
      * make a copy of existing page table and assign it to a process (Ex2, 5)
      * implement your own helper functions as you see fit
    Tip: Zero the allocated page before using it!! (memset)

    - Some useful functions/macros:
       PTE_ADDR : PTE_ENTRY -> Physical address
       PAGENUMBER : a phyiscal address -> corresponding index into page info array 
       PAGEADDR : PAGENUMBER^{-1}
       virtual_memory_lookup(pagetable, va)


3. I/O architecture

    reminder about the picture

    [draw picture: CPU, Mem, I/O, connected to BUS]

4. CPU/device interaction (can think of this as kernel/device
interaction, since user-level processes classically do not interact with
devices directly.

    A. Mechanics of communication

        (a) explicit I/O instructions
            
            outb, inb, outw, inw

            examples:

            (i) WeensyOS boot.c. see handout
            
                focus on boot_readsect(), boot_waitdisk()

                compare to Figures 36.5 and 36.6 in the book

                the code on the handout is the bootloader, which is 
                reading the WeensyOS kernel from disk to memory.

            (ii) reading keyboard input. see handout

                keyboard_readc(); 

            (iii) setting blinking cursor. see handout

                console_show_cursor();

        (b) memory-mapped I/O

            physical address space is mostly ordinary RAM

	    low-memory addresses (650K-1MB) actually refer to other
	    things. 

	    You as a programmer read/write from these addresses using
	    loads and stores. But they aren't "real" loads and stores to
	    memory. They turn into other things: read device registers,
	    send instructions, read/write device memory, etc.

                --interface is the same as interface to memory
                (load/store)

                --but does not behave like memory 

		    + Reads and writes can have "side effects"

		    + Read results can change due to external events 

	    Example: writing to VGA or CGA memory makes things appear on
	    the screen.

	    See handout (last panel):

                console_putc()

                (this is called by console_printf().)
           
            Some notes about memory-mapped I/O

                (i) avoid confusion: this is not the same thing as
                virtual memory. this is talking about the *physical*
                address.

                    --> is this an abstraction that the OS provides to
                    others or an abstraction that the hardware is
                    providing to the OS?  [the latter]

                (ii) aside: reset or power-on jumps to ROM at 0xffff0
                [not covered in class]

	        --so what is the first instruction going to have to do?
	            answer: probably jump

        (c) interrupts

        (d) through memory: both CPU and the device see the same memory,
        so they can use shared memory to communicate. 

            --> usually, synchronization between CPU and device requires
            lock-free techniques, plus device-specific contracts ("I
            will not overwrite memory until you set a bit in one of my
            registers telling me to do so.")

            --> as usual, need to read the manual

    B. Polling vs. interrupts (vs. busy waiting)

        So far, in our examples, the CPU has been busy waiting. This is
        fine for these examples, but higher bandwidth devices (disks,
        network cards, etc.) need different techniques.

        Polling: check back periodically 

            kernel...
            
           - ... sent a packet? Periodically ask the card when the buffer is
             free.

           - ... waiting for a packet? Periodically ask whether there is
             data

           - ... did Disk I/O? Periodically ask whether the disk is done.

            Disadvantages: wasted CPU cycles (if device not busy) and higher latency

        Interrupts: The device interrupts the CPU when its status
        changes (for example, data is ready, or data is fully written).

            (The interrupt controller itself is initialized with I/O
            instructions; if you're curious, see the function
            interrupt_init() in WeensyOS's k-hardware.c.)

            This is what most general-purpose OSes do. There is a
            disadvantage, however. This could come up if you need to
            build a high-performance system.

            Namely: If interrupt rate is high, then the computer can
            spend a lot of time handling interrupts (interrupts are
            expensive because they generate a context switch, and the
            interrupt handler runs at high priority).

                --> in the worst case, you can get *receive livelock*
                where you spend 100% of time in an interrupt handler but no
                work gets done.
 
        This tradeoff comes up everywhere....

        ANALOGY, courtesy of past TA Parth Upadhyay: "Interrupts vs
        Polling in the context of phone notifications.  There's 2 ways
        for you to figure out whether you have more emails/tweets.  1 is
        to, every time your phone buzzes, stop everything and look at
        it.  (You don't want to miss that tweet.) The second way is for
        you to, periodically, check your email. The trade-offs are
        pretty similar! You get that snapchat 5 minutes later than you
        could have, but you don't pay the costs of so many context
        switches."

        How to design systems given these tradeoffs? Start with
        interrupts. If you notice that your system is slowing down
        because of livelock, then switch to polling. If polling is chewing
        up too many cycles, then move towards an adaptive switching
        between interrupts and polling. (But of course, never optimize
        until you actually know what the problem.) A classic reference
        on this subject is the paper 
            "Eliminating Receive Livelock in an Interrupt-driven
            Kernel", by Mogul and Ramakrishnan, 1996.
        
        We have just seen three approaches to synchronizing with
        hardware:

            busy waiting
            polling
            interrupts

        QUESTION: where have we seen a conceptually similar tradeoff to
            interrupts vs. the other two?

        ANSWER: spinlocks vs. mutexes. (analogy isn't perfect 
           because mutex and cv calls are *blocking* whereas the kernel
           never truly blocks; see discussion of sync-vs-async in a
           future class.)

    C. DMA vs. programmed I/O

        Programmed I/O: (mostly) what we have been seeing in the handout
        so far: CPU writes data directly to device, and reads data
        directly from device.

	DMA: better way for large and frequent transfers (related to part
	(d) above)

	    CPU (really, device driver programmer) places some buffers
	    in main memory.

	    Tells device where the buffers are 

	    Then "pokes" the device by writing to register

            Then device uses *DMA* (direct memory access) to read or
            write the buffers,

            The CPU can poll to see if the DMA completed (or the device
            can interrupt the CPU when done).

            [rough picture:
	       buffer descriptor list
	       <metadata> --> [  buf ]
	       <metadata> --> [  buf ]
	       ....
            ]

        This makes a lot of sense. Instead of having the CPU
        constantly dealing with a small amount of data at a time, the
        device can simply write the contents of its operation straight
        into memory.

        NOTE: book couples DMA to interrupts, but things don't have to
        work like that. You could have all four possibilities in
        {DMA, programmed I/O} x {polling, interrupts}. 
        
            For example, (DMA, polling) would mean requesting a DMA
            and then later polling to see if the DMA is complete.

5. Software architecture: device drivers

    The examples on the handout are simple device drivers.

    Device drivers in general solve a software engineering problem ...

        [draw a picture]

        expose a well-defined interface to the kernel, so that the
        kernel can call comparatively simple read/write calls or
        whatever.

        For example, reset, ioctl, output, read, write,
        handle_interrupt()

        this abstracts away nasty hardware details so that the kernel
        doesn't have to understand them.

        When you write a driver, you are implementing this interface,
        and also calling functions that the kernel itself exposes

    ... but device drivers also *create* software engineering problems.
    Fundamental issues:

        Each device driver is per-OS and per-device (often can't reuse
        the "hard parts")

        They are often written by the device manufacturer (core
        competence of device manufacturers is hardware development, not
        software development).

        Under conventional kernel architectures, bugs in device drivers
        -- and there are many, many of them -- bring down the entire
        machine.

    So we have to worry about potentially sketchy drivers ...

    ... but we also have to worry about potentially sketchy devices.

        a buggy network card can scribble all over memory 
        (solution: use IOMMU; advanced topic)

        plug in your USB stick: claims to be a keyboard; starts issuing
        commands. (IOMMU doesn't help you with this one.)

        plug in a USB stick: if it's carrying a virus (aka malware),
        your computer can now be infected. (Iranian nuclear reactors are
        thought to have been attacked this way. Unfortunately for us,
        the same attacks could work against our power plants, etc.)
    
    Stuxnet example


[Acknowledgments: David Mazieres, Mike Dahlin, Brad Karp]