Class 26: Putting it all (or a lot of it) together
CS 202
5 May 2020

On the board
------------

A. Last time
B. What is today's class about?
C. Binaries and loading, assuming static linking
D. Power up to terminal
    - Step 1: Power up
    - Step 2: Firmware
    - Step 3: OS Bootloader
    - Step 4: Kernel
    - Step 5: init(8)
    - Step 6: login(1)
E. Remarks and observations

---------------------------------------------------------------------------

- Announcement: final exam on Mon., May 18, 2020, 2:00PM - 3:50PM

- Review is this Thursday; we'll restate all of the class topics since
the midterm, and the rest of the time will be given over to Q&A.  Please
bring questions!

A. Last time

  - Reflections on Trusting Trust

B. This class:
      - A walk through what happens between powering on a computer and
        arriving at a terminal.
      - Some remarks on the architecture of operating systems. 

      - This is intended to be "how stuff works" class, while also
      drawing connections to many of the subsystems and mechanisms that
      we've studied. fsck(), for example, makes an appearance.

C. Binaries and loading, assuming static linking

  - All of the steps that follow require loading and executing programs. 
    Going to start by presenting a high level overview of this process.

  - The question we want to answer: What happens when a program executes 
    the following code

  	
  	```
  	char *argv[];
  	char *envp[];
  	// Initialize argv and envp
  	// ...

  	if (fork() == 0) {
  		// Executed in the child process.
  		execve("hello", argv, envp);
  	}

  	```

        [Draw picture of process memory space, getting obliterated by
        munmap().]


        How does the process's memory space get filled out, how
  	does code start executing? To answer that, need to talk about
  	the format of an executable file.
 
  - Programs (e.g., 'hello' in the example above) are stored as executable 
    files in the file system. The format of executable files is dictated by 
    the operating system on which they are meant to be executed.
    	+ Linux executables are stored in the Executable and Linkable Format (ELF).
    	+ Windows executables are stored in the Portable Executable (PE) format.

  - Regardless of format, executables contain the following information:

        [Draw picture of an executable to the right of the picture of
        process memory space. Then show how the loader uses the header
        of the executable file to call mmap() to create the new address
        space.]

  		+ Magic number: a known byte sequence at the beginning of the file that
  		  identifies the file as an executable.
  		  	* ELF's magic number is 0x7f454c46 (0x7f followed by the string 'ELF').
  		  	* PE's magic number is 0x4d5a  (the characters 'MZ').
  		
  		+ The platform (operating system and machine architecture) on which the
  		  executable can be run.
  		  	* For example, the ELF format is used by Linux, Solaris, IRIX, 
  		  	all of the BSDs, MINIX, QNX, etc.; and these operating systems
  		  	run on processors including AMD64, x86, ARM of various flavors,
  		  	MIPS, PowerPC, SPARC, etc.
  		  	* Similarly the PE format is used by all Microsoft OSes, and EFI;
  		  	and run on processors including AMD64, x86, Alpha, ARM of various
  		  	flavors, etc.
  		  While the format is reused across many platforms, most **executables**
  		  can only be executed on one platform.
  		
  		+ The **memory layout** a process running the executable should have. Memory
  		  layout information is specified as an array of sections. Each section is a
  		  struct specifying:
  		    * The virtual address at which the section should be placed. The compiler
  		      **assumes** that it knows the virtual address for instructions and variables
  		      when compiling. The compiler does not however assume anything about the
  		      address of the stack, which is accessed using %RSP as we have seen in previous
  		      classes.
  		    * The length of each section. 
  		    * Whether the contents of the memory should be read from the file
  		      (e.g., when the entry corresponds to the `.text` or `.data` sections) 
  		      or left uninitialized.
  		    * The offset in the file where the contents of the section are stored.
  		      This offset information is ignored for sections which are not read
  		      from the file.
  		    * The memory protection bits that should be set for the section.
  		    	+ For the `.text` section then it is marked readable and 
  		    	  executable.
  		    	+ For the `.data` section then it is marked readable and
  		    	  writable.
  		   	The memory layout information **does not** include information about the stack.

  		+ The program's entry point: the virtual memory address at which the first function
  		  that should be executed when the program is run, that is the virtual address of 
  		  the `_start` function.
  		+ Other information that we omit, e.g., symbols, strings, etc.
  			
  - Loading and executing an executable, or what happens when 
    `execve("hello", argv, envp)` is executed:
     + The loader checks that the file is readable and can be executed:
       	* Check file permissions.
       	* Check the magic number to ensure file is correctly formatted.
       	* Check file header to ensure that executable can run on the current platform.
     + The loader calls munmap to unmap the processes memory.
     + The loader reads through the executable's memory layout array and uses mmap to
       allocate and set up the processes memory layout:
         * Use the executable's file descriptor when allocating sections that need to
           be read from the executable file. 
         * Use MAP_ANONYMOUS for any sections which do not need to be read from 
           the executable.
         * Set protection bits based on information in the section.
     + The loader uses mmap to allocate a stack, and sets RSP and RBP so they point to
       the top of the allocated stack. The loader then copies argv to the stack.
     + The loader then sets %RDI to argc and %RSI to point to argv, and jumps to the
       entrypoint specified in the executable.
            [detail: loader is running in user-space. In general, one
            cannot movq to %rip, so one has to jump to the entrypoint,
            at least in instructional OSes. In reality, there is a 
	   stub that executes a call to entrypoint; when the call returns
	   it calls the exit() syscall]

  - The loading process above only works for *statically linked* executables, i.e., 
    executables compiled with the `-static` flag:
      
      > gcc -static -o hello hello.c

    The vast majority of executables are *not* statically linked, and the process
    above needs to be expanded to handle dynamic linking and relocations.

    The process similarly needs to be changed in order to handle ASLR.

    We *omit* information about linking and the handling of both of these cases in
    these notes.

D. Power up to terminal

* Step 1: Power up
	- Processor and all peripherals turn on, and initialize their internal state.
	  + Zero out any registers.
	  + Set control registers to default values.
	  	[For Intel defaults are recorded in the Software Development
	  	Manual.]
	  + At this point the processor is in **real mode**. In this mode
	  	  * Paging is not enabled. All addresses are physical addresses, and
	  	    no memory protection is used.
	  	  * Programs can access up to 1 megabyte of physical memory.
	- The processor copies an executable from a ROM (read-only memory) chip 
	  connected to the processor into RAM, and jumps (using the jmp instruction) 
	  to a known offset of this copied binary (for early processors this was 0xFFFF0,
	  but newer processors have a somwhat more complex logic for this).
	  + This initial binary is refered to as the *firmware*.
	  + In all modern computers firmware is stored on EEPROMs or Flash, in order
	    to allow upgrades.
	  + The firmware itself has access to stable storage, usually provided by a
	    battery backed CMOS chip. Firmware settings (e.g., where to boot from, etc.)
	    are stored in this CMOS chip.

* Step 2: Firmware
	- Firmware reponsible for hardware initialization and providing
	  a runtime for the kernel during early boot.
	
	- On recent Intel machines, firmware can be broadly classified as either BIOS or
	  UEFI firmwares. Both are specifications, and many different implementations exist
	  for both. Here we will focus on UEFI for convenience.

	  * Many (but not all) ARM64 machines also make use of firmware conforming to the
	    UEFI spec.

	  [Historical and pop culture note: Historically, each computer manufacturer would
	  produce their own firmware which was used to ensure that computers from no other
	  manufacturers (even computers using the same processor) could run software designed
	  for the manufacturer.
	  Patents, copyrights and other legal tools were used to prevent
	  anyone from copying firmware to get around this limitation.

	  In the 1980s IBM released the IBM PC without patenting its
	  firmware (we don't know exactly why they didn't). 
	  This led to several competitors, starting with Columbia Data Products, reverse engineering IBM's
	  firmware, leading to the creation of IBM PC compatible machines. Today's BIOS descends
	  from this original effort, and this reverse engineering is what led to the creation and
	  growth of companies such as Dell.
	  
	  Season 1 of AMC's Halt and Catch Fire (also available on Netflix) presents a fictional 
	  account of this effort.]

	- UEFI initialization steps: UEFI firmware runs the following initialization steps:
	  * Switch processor from **real mode** to **long mode**. Long mode enables
	    paging, 64-bit addressing, etc. and has been the mode we have considered
	    throughout this class. As a part of this switch the firmware:

		  + Creates and switches to an identity mapped page table (similar to WeensyOS's 
		  	initial page table in Lab 4). Many UEFI implementations map all available
		  	physical memory.
		  + Creates and installs an initial Interrupt Descriptor Table (IDT). The IDT
		    specifies what code should run when interrupts are raised, sycalls are
		    invoked, etc. UEFI firmwares implement many functions including functions
		    that can fetch data from the network, wait on a timer, etc., and the 
		    interrupt handlers installed by the firmware are functional (and not stubs).
		  + Initializes other processor structures [TSS, GDT, LDT, etc.]
		  + Set Control Registers/
	  * Initialize devices including
	  	  + All disk drives
	  	  + The USB hub
	  	  + Display
	  	  + Keyboard and mouse.
	  	  + Network cards
	  	[Aside: In some cases initializing the devices listed above might require
	  	loading a driver (stored with UEFI firmware) or loading and executing a
	  	 program stored on the device. This is especially common when initializing 
	  	 high-performance network cards and some GPUs.]
	  * Constructs a device tree referred to as the `CONFIGURATION_TABLE` [described below].
	  * Mount a VFAT partition from the disk (or RAM in case of network boot) 
	    containing the operating system to be booted.
	    
	    Users configure the device from which the operating system will be loaded,
	    and this configuration is stored in the CMOS. It is common to provide 
	    the firmware with a list of devices (e.g., USB, disk,
	    network) in which case the UEFI firmware looks for the first device whose
	    partition table contains a mountable UEFI VFAT partition.

	    The VFAT partition is mounted at /EFI.

	  * Loads and runs (using the same process listed above) the OS boot loader.
	    The OS boot loader is an executable stored in the VFAT partition. 
	    	+ When booting from a hard disk, the path to the OS bootloader 
	    	  is a firmware option stored in CMOS. 
	    	+ When booting from removable disks (e.g., USB sticks) 
	    	  the executable must be at `/EFI/boot/bootx64.efi`.

	   	The OS bootloader entry point has signature:

	   	`void EntryPoint(EFI_HANDLE handle, EFI_SYSTEM_TABLE *table)` 	
	   	 
	   		+ `handle` is an integer that the bootloader needs to provide as input
	   		   when calling into the EFI firmware.
	   		+ `table` is a pointer to a set of function pointers through which 
	   		   the bootloader can call UEFI services (described below), and a
	   		   pointer to the device tree (also described below).
	   	[Note: EFI uses a slightly different calling convention from what we have seen so far:
	   	arguments are passed in %RCX, %RDX, %R8 and %R9.]


	- UEFI services
		* Like any operating system, UEFI offers the OS bootloader several APIs
		  including ones for:
		  + Sending and receiving network messages.
		  + Reading and writing files from the VFAT partition.
		  + Drawing and printing on the screen.
		  + Getting user input.
		  + Get environment variables and arguments passed to the bootloader.
		  	(arguments and environment variables are specified in CMOS.)
		  + Etc.
		* The OS bootloader must explicitly (using a function call) tell
		  the UEFI firmware when these services are no longer necessary. The
		  firmware exits when this is done.
	
	- Device trees (CONFIGURATION_TABLE)
	   * In classes 14 and 15 we talked about how operating systems use
	     device drivers to add support for new devices.
	   * But how does the operating system know what devices are present on
	     a machine, and what device drivers to load?
	   * Answer to the previous question is a device tree, which is a structure
	     containing:
	       + A list of all devices attached to the machine.
	       + Are they accessible through explicit I/O, memory-mapped I/O, or DMA?
	       + For devices using explicit I/O: what is the device port number.
	       + For devices using memory-mapped I/O: what is the physical memory address
	         mapped to the device.
	       + For devices using DMA: what is the physical memory address mapped to the
	         device's control register.
	       + Information about how interrupts from the device are routed.
	    * Why a tree? Structure also encodes the topology of how devices are connected.
	      Going to ignore this detail for now.


* Step 3: OS bootloader to Kernel
	- The UEFI firmware loads and executes the OS bootloader.
	   + On recent Linux kernels the bootloader is the vmlinuz file with a stub
	     for UEFI, we only consider this case here.
	- The vmlinuz file is a compressed executable containing the Linux kernel.
		+ High level layout:
		 	(i) PE Header + EFI stub
		 	(ii) ELF header + stub code to decompress bzip2 or gzip data.
		 	(iii) Compressed bzip2 or gzip data (itself containing
		 	an ELF header and the actual kernel).
	- vmlinuz assumes that it is given the disk ID for the Linux
	root directory ('/') as an argument (as 'root=<device>'). 
	- When executed:
		(a) The EFI stub loads and executes the decompression
		stub (item (ii) in the layout above.)
		(b) The decompression stub uncompresses the compressed
	        data (item (iii)) into memory. As stated above, the
		decompressed data is a statically linked ELF executable
		containing the Linux kernel.
		(c) The decompression stub loads and executes the Linux kernel,
		passing in a pointer to the CONFIGURATION_TABLE and other arguments as input.
		(d) The kernel at this point terminates the UEFI
		firmware; it doesn't need to terminate the decompression
		stub because that stub jumped into the kernel, so there
		is not some separate execution context.

* Step 4: Kernel
	- The kernel begins by initializing the system. This includes
	   * Switching away from an identity-mapped virtual address space.
	   * Rewriting the interrupt descriptor table.
	   * Loading and initializing device drivers using information contained
	     in the CONFIGURATION_TABLE.
	     + All device drivers run as a part of the kernel. 
	     + Device drivers communicate with user mode programs through files and
	       directories mounted in `/dev`.
	   * Mounting the root device.
	   	  + If required, fsck is run at this point in the boot process.
	- After initialization the kernel forks and launches the `init` process.

* Step 5: init
	- `init` or PID 1 is the first process executed by the system.
	  * Many alternative init implementations exist on Linux. Currently
	    most distributions use `systemd`.
	  * We are assuming a simpler `init.rc`-like init script here.
	
    - Executed as root (superuser)

    - `init` serves three main purposes
    	* Finish initializing the system.
    	* Launch the user login manager.
    	* Relaunch user login manager whenever users logout.

    - Finish initializing the system
    	* The kernel initialized device drivers, but many devices 
    	  require additional initialization:
    	   + For network cards, need to use DHCP or other means to
    	     set IP address.
    	   + For GPU, need to set initial display resolution.
    	   + Set power management settings.
    	* init launches a set of individual programs that handle this
    	  initialization.
    	   + Programs communicate with device drivers through files and
    	     directories in `/dev`.
    	   + Early initialization programs are also run as root.
    	* Many systems run daemons that provide services. Examples include
    	  sshd (allows sshing into a machine), httpd, etc. init is responsible
    	  for launching these programs.
    	  	+ Many of these programs bind to network ports between 1-1024. From
    	  	class 24, this is only allowed for programs run as root.
    	  	+ Most `init` implementations have mechanisms through which the 
    	  	  programs do not have to directly bind to these ports. 
    	  	  See tcpserve in handout 14 
    	  	      [Panda: (https://cs.nyu.edu/courses/spring20/CSCI-UA.0202-001/lectures/handout14.pdf)
    	  	      Mike:
    	  	      (https://cs.nyu.edu/~mwalfish/classes/20sp/lectures/handout14.pdf)]
    	  	  for an example of how this is done.

    	  	  When such functionality is available, init (or the daemon) will
    	  	  drop privleges and run the daemon as an unprivileged user (not root).
    	  	+ init is also responsible for detecting failures in daemon processes,
    	  	  that is detecting cases where they die, and then restarting them
    		  if appropriate. Relies on `wait(2)` for detecting failures.

    -  Launch the user session manager
       * Once the system has finished initializing, users are presented with a login
         prompt using which they can **create a session** on the machine.
       * Login prompt is created by a session manager.
       		+ Linux distributions provide many graphical session managers
       		  including slim, gnome-session, etc.
       		+ We are going to only focus on text based sessions. The cannonical
       		  text based session manager (and the one we focus on) is `login(1)`.
       * For the walkthrough in this class, we assume that after initialization,
         init launches login(1).

* Step 6: login(1), also the target of Ken Thompson's attack
	- login(1) must run as root.
		* This is because of limitiations on the behavior of seteuid(2), setegid(2), etc. 
		  that were presented in class 24.
		  ```
		  > login
		  login: Cannot possibly work without effective root
		  ```
	- login(1) presents a prompt for a username and password.
		* Values typed in by the user, checked against information in `/etc/passwd` and `/etc/shadow`.

	- If user credentials match, login(1) runs the following steps:
		* Forks a new process. The parent uses `waitpid(2)` to wait for child to die,
		  and then exits. All steps that follow are executed in the child process.
		* Calls setuid(2) to set the effective user ID to that of the logged in user.
		* Calls setgid(2) to set the effective group ID to the primary group for the user.
		[Detail: * Calls setsid(2) to create a new session.] 
		* Changes directory to the user's home directory (using chdir(2)). `/etc/passwd` stores
		  the user's home directory, and login(1) relies on this information to change directories
		  in this case.
		* Uses exec() to launch the user's login shell (e.g., bash). Login shell information for
		  each user is also stored in `/etc/passwd`
		* The user now has a shell they can use as normal.

	- Users log out by killing their shell process (which was created by the login
	  process that init created). This results in the parent login process exiting (as
	  noted in the previous point). 
	- `init` observes this exit using the same wait based mechanism used for daemons,
	  and starts a new login process.

E. Remarks and observations

	- Observe that in booting a machine involves at least two operating systems
		* The firmware, for which we considered UEFI, acts as a simple operating system
		  providing a few services. It does not support multiple processes, and has
		  only limited functionality.
		* The kernel, which provides a richer set of functionality, including schedulers,
		  etc.
	  This principle can extend at least two more layers: booting a virtual machine involves
	  running a virtual machine's firmware implementation on the host operating system, and 
	  then having the virtual machine's firmware load and execute the guest operating system.

	 Booting a computer is, from one perspective, many repeated iterations of fork and exec.

	- Recall that drivers are part of the kernel in Linux.

	  Kernels where device drivers are a part of the kernel are 
	  referred to as **MONOLITHIC KERNELS**.

	  An alternate architecture, referred to as **MICROKERNELS**, is one where device drivers
	  and many other portions of the kernel (e.g., scheduling, etc.) are run as independent
	  processes. The main argument for microkernels is that they are more modular, which makes
	  it easier to change kernel behavior. However, this modularity often carries
	  a performance penalty, and has limited their wider adoption.

	  Several current operating system kernels, including the ones for Windows and Mac OS X, started 
	  out as microkernels that were then converted to monolithic kernels due to performance concerns.

        See the Tannenbaum-Torvalds debates:
        https://cs.nyu.edu/~mwalfish/classes/ut/s10-cs372h/ref/ast-torvalds.html
        for a classic monolithic-versus-microkernels debate.


----

[credit: Aurojit Panda for this content]