Class 26: Putting it all (or a lot of it) together CS 202 04 December 2024 On the board ------------ A. Last time B. What is today's class about? C. Binaries and loading, assuming static linking D. Power up to terminal - Step 1: Power up - Step 2: Firmware - Step 3: OS Bootloader - Step 4: Kernel - Step 5: init(8) - Step 6: login(1) E. Remarks and observations --------------------------------------------------------------------------- - Announcement: final exam logistics - Review is Monday; we'll restate all of the class topics since the midterm, and the rest of the time will be given over to Q&A. Please bring questions! A. Last time - Unix security model B. This class: - A walk through what happens between powering on a computer and arriving at a terminal. - Some remarks on the architecture of operating systems. - This is intended to be "how stuff works" class, while also drawing connections to many of the subsystems and mechanisms that we've studied. fsck(), for example, makes an appearance. C. Binaries and loading, assuming static linking - All of the steps that follow require loading and executing programs. Going to start by presenting a high level overview of this process. - The question we want to answer: What happens when a program executes the following code? ``` char *argv[]; char *envp[]; // Initialize argv and envp // ... if (fork() == 0) { // Executed in the child process. execve("hello", argv, envp); } ``` [Draw picture of process memory space, getting obliterated by munmap().] How does the process's memory space get filled out, how does code start executing? To answer that, need to talk about the format of an executable file. - Programs (e.g., 'hello' in the example above) are stored as executable files in the file system. The format of executable files is dictated by the operating system on which they are meant to be executed. + Linux executables are stored in the Executable and Linkable Format (ELF). + Windows executables are stored in the Portable Executable (PE) format. - Regardless of format, executables contain the following information: [Draw picture of an executable to the right of the picture of process memory space. Then show how the loader uses the header of the executable file to call mmap() to create the new address space.] + Magic number: a known byte sequence at the beginning of the file that identifies the file as an executable. * ELF's magic number is 0x7f454c46 (0x7f followed by the string 'ELF'). * PE's magic number is 0x4d5a (the characters 'MZ'). + The platform (operating system and machine architecture) on which the executable can be run. * For example, the ELF format is used by Linux, Solaris, IRIX, all of the BSDs, MINIX, QNX, etc.; and these operating systems run on processors including AMD64, x86, ARM of various flavors, MIPS, PowerPC, SPARC, etc. * Similarly the PE format is used by all Microsoft OSes, and EFI; and run on processors including AMD64, x86, Alpha, ARM of various flavors, etc. While the format is reused across many platforms, most **executables** can only be executed on one platform. + The **memory layout** a process running the executable should have. Memory layout information is specified as an array of sections. Each section is a struct specifying: * The virtual address at which the section should be placed. The compiler **assumes** that it knows the virtual address for instructions and variables when compiling. The compiler does not however assume anything about the address of the stack, which is accessed using %RSP as we have seen in previous classes. * The length of each section. * Whether the contents of the memory should be read from the file (e.g., when the entry corresponds to the `.text` or `.data` sections) or left uninitialized. * The offset in the file where the contents of the section are stored. This offset information is ignored for sections which are not read from the file. * The memory protection bits that should be set for the section. + For the `.text` section then it is marked readable and executable. + For the `.data` section then it is marked readable and writable. The memory layout information **does not** include information about the stack. + The program's entry point: the virtual memory address at which the first function that should be executed when the program is run, that is the virtual address of the `_start` function. + Other information that we omit, e.g., symbols, strings, etc. - Loading and executing an executable, or what happens when `execve("hello", argv, envp)` is executed: + The loader checks that the file is readable and can be executed: * Check file permissions. * Check the magic number to ensure file is correctly formatted. * Check file header to ensure that executable can run on the current platform. + The loader calls munmap to unmap the process's memory. + The loader reads through the executable's memory layout array and uses mmap to allocate and set up the processes memory layout: * Use the executable's file descriptor when allocating sections that need to be read from the executable file. * Use MAP_ANONYMOUS for any sections that do not need to be read from the executable. * Set protection bits based on information in the section. + The loader uses mmap to allocate a stack, and sets RSP and RBP so they point to the top of the allocated stack. The loader then copies argv to the stack. + The loader then sets %RDI to argc and %RSI to point to argv, and jumps to the entrypoint specified in the executable. [detail: loader is running in user-space. In general, one cannot movq to %rip, so one has to jump to the entrypoint, at least in instructional OSes. In reality, there is a stub that executes a call to entrypoint; when the call returns it calls the exit() syscall] - The loading process above only works for *statically linked* executables, i.e., executables compiled with the `-static` flag: > gcc -static -o hello hello.c The vast majority of executables are *not* statically linked, and the process above needs to be expanded to handle dynamic linking and relocations. The process similarly needs to be changed in order to handle ASLR. We omit information about linking and the handling of both of these cases in these notes. D. Power up to terminal * Step 1: Power up - Processor and all peripherals turn on, and initialize their internal state. + Zero out any registers. + Set control registers to default values. [For Intel defaults are recorded in the Software Development Manual.] + At this point the processor is in **real mode**. In this mode * Paging is not enabled. All addresses are physical addresses, and no memory protection is used. * Programs can access up to 1 megabyte of physical memory. - The processor copies an executable from a ROM (read-only memory) chip connected to the processor into RAM, and jumps (using the jmp instruction) to a known offset of this copied binary (for early processors this was 0xFFFF0, but newer processors have somewhat more complex logic for this). + This initial binary is refered to as the *firmware*. + In all modern computers firmware is stored on EEPROMs or Flash, in order to allow upgrades. + The firmware itself has access to stable storage, usually provided by a battery backed CMOS chip. Firmware settings (e.g., where to boot from, etc.) are stored in this CMOS chip. * Step 2: Firmware - Firmware is reponsible for hardware initialization and providing a runtime for the kernel during early boot. - On recent Intel machines, firmware can be broadly classified as either BIOS or UEFI firmwares. Both are specifications, and many different implementations exist for both. Here we will focus on UEFI for convenience. BIOS = basic input output system UEFI = unified extensible firmware interface * Many (but not all) ARM64 machines also make use of firmware conforming to the UEFI spec. [Historical and pop culture note: Historically, each computer manufacturer would produce their own firmware which was used to ensure that computers from no other manufacturers (even computers using the same processor) could run software designed for the manufacturer. Patents, copyrights and other legal tools were used to prevent anyone from copying firmware to get around this limitation. In the 1980s IBM released the IBM PC without patenting its firmware (we don't know exactly why they didn't). This led to several competitors, starting with Columbia Data Products, reverse engineering IBM's firmware, leading to the creation of IBM PC compatible machines. Today's BIOS descends from this original effort, and this reverse engineering is what led to the creation and growth of companies such as Dell. Season 1 of AMC's Halt and Catch Fire (also available on Netflix) presents a fictional account of this effort.] - UEFI initialization steps: UEFI firmware runs the following initialization steps: * Switch processor from **real mode** to **long mode**. Long mode enables paging, 64-bit addressing, etc. and has been the mode we have considered throughout this class. As a part of this switch the firmware: + Creates and switches to an identity mapped page table (similar to WeensyOS's initial page table in Lab 4). Many UEFI implementations map all available physical memory. + Creates and installs an initial Interrupt Descriptor Table (IDT). The IDT specifies what code should run when interrupts are raised, sycalls are invoked, etc. UEFI firmwares implement many functions including functions that can fetch data from the network, wait on a timer, etc., and the interrupt handlers installed by the firmware are functional (and not stubs). + Initializes other processor structures [TSS, GDT, LDT, etc.] + Set Control Registers * Initialize devices including + All disk drives + The USB hub + Display + Keyboard and mouse + Network cards [Aside: In some cases initializing the devices listed above might require loading a driver (stored with UEFI firmware) or loading and executing a program stored on the device. This is especially common when initializing high-performance network cards and some GPUs.] * Constructs a device tree referred to as the `CONFIGURATION_TABLE` [described below]. * Mount a VFAT partition from the disk (or RAM in case of network boot) containing the operating system to be booted. Users configure the device from which the operating system will be loaded, and this configuration is stored in the CMOS. It is common to provide the firmware with a list of devices (e.g., USB, disk, network) in which case the UEFI firmware looks for the first device whose partition table contains a mountable UEFI VFAT partition. The VFAT partition is mounted at /EFI. * Loads and runs (using the same steps listed above) the OS boot loader. The OS boot loader is an executable stored in the VFAT partition. + When booting from a hard disk, the path to the OS bootloader is a firmware option stored in CMOS. + When booting from removable disks (e.g., USB sticks) the executable must be at `/EFI/boot/bootx64.efi`. The OS bootloader entry point has signature: `void EntryPoint(EFI_HANDLE handle, EFI_SYSTEM_TABLE *table)` + `handle` is an integer that the bootloader needs to provide as input when calling into the EFI firmware. + `table` is a pointer to a set of function pointers through which the bootloader can call UEFI services (described below), and a pointer to the device tree (also described below). [Note: EFI uses a slightly different calling convention from what we have seen so far: arguments are passed in %RCX, %RDX, %R8 and %R9.] - UEFI services * Like any operating system, UEFI offers the OS bootloader several APIs including ones for: + Sending and receiving network messages. + Reading and writing files from the VFAT partition. + Drawing and printing on the screen. + Getting user input. + Get environment variables and arguments passed to the bootloader. (arguments and environment variables are specified in CMOS.) + Etc. * The OS bootloader must explicitly (using a function call) tell the UEFI firmware when these services are no longer necessary. The firmware exits when this is done. - Device trees (CONFIGURATION_TABLE) * In class 16 we talked about how operating systems use device drivers to add support for new devices. * But how does the operating system know what devices are present on a machine, and what device drivers to load? * Answer to the previous question is a device tree, which is a structure containing: + A list of all devices attached to the machine. + Are they accessible through explicit I/O, memory-mapped I/O, or DMA? + For devices using explicit I/O: what is the device port number. + For devices using memory-mapped I/O: what is the physical memory address mapped to the device. + For devices using DMA: what is the physical memory address mapped to the device's control register. + Information about how interrupts from the device are routed. * Why a tree? Structure also encodes the topology of how devices are connected. Going to ignore this detail for now. * Step 3: OS bootloader to Kernel - The UEFI firmware loads and executes the OS bootloader. + On recent Linux kernels the bootloader is the vmlinuz file with a stub for UEFI, we only consider this case here. - The vmlinuz file is a compressed executable containing the Linux kernel. + High level layout: (i) PE Header + EFI stub (ii) ELF header + stub code to decompress bzip2 or gzip data. (iii) Compressed bzip2 or gzip data (itself containing an ELF header and the actual kernel). - vmlinuz assumes that it is given the disk ID for the Linux root directory ('/') as an argument (as 'root='). - When executed: (a) The EFI stub loads and executes the decompression stub (item (ii) in the layout above.) (b) The decompression stub uncompresses the compressed data (item (iii)) into memory. As stated above, the decompressed data is a statically linked ELF executable containing the Linux kernel. (c) The decompression stub loads and executes the Linux kernel, passing in a pointer to the CONFIGURATION_TABLE and other arguments as input. (d) The kernel at this point terminates the UEFI firmware; it doesn't need to terminate the decompression stub because that stub jumped into the kernel, so there is not some separate execution context. * Step 4: Kernel - The kernel begins by initializing the system. This includes * Switching away from an identity-mapped virtual address space. * Rewriting the interrupt descriptor table. * Loading and initializing device drivers using information contained in the CONFIGURATION_TABLE. + All device drivers run as a part of the kernel. + Device drivers communicate with user mode programs through files and directories mounted in `/dev`. * Mounting the root device. + If required, fsck is run at this point in the boot process. - After initialization the kernel forks and launches the `init` process. * Step 5: init - `init` or PID 1 is the first process executed by the system. * Many alternative init implementations exist on Linux. Currently most distributions use `systemd`. * We are assuming a simpler `init.rc`-like init script here. - Executed as root (superuser) - `init` serves three main purposes * Finish initializing the system. * Launch the user login manager. * Relaunch user login manager whenever users logout. - Finish initializing the system * The kernel initialized device drivers, but many devices require additional initialization: + For network cards, need to use DHCP or other means to set IP address. + For GPU, need to set initial display resolution. + Set power management settings. * init launches a set of individual programs that handle this initialization. + Programs communicate with device drivers through files and directories in `/dev`. + Early initialization programs are also run as root. * Many systems run daemons that provide services. Examples include sshd (allows sshing into a machine), httpd, etc. init is responsible for launching these programs. + Many of these programs bind to network ports between 1-1024. From class 24, this is only allowed for programs run as root. + Most `init` implementations have mechanisms through which the programs do not have to directly bind to these ports. See tcpserve in handout 15 (https://cs.nyu.edu/~mwalfish/classes/24fa/lectures/handout15.pdf)] for an example of how this is done. When such functionality is available, init (or the daemon) will drop privileges and run the daemon as an unprivileged user (not root). + init is also responsible for detecting failures in daemon processes, that is detecting cases where they die, and then restarting them if appropriate. Relies on `wait(2)` for detecting failures. - Launch the user session manager * Once the system has finished initializing, users are presented with a login prompt using which they can **create a session** on the machine. * Login prompt is created by a session manager. + Linux distributions provide many graphical session managers including slim, gnome-session, etc. + We are going to only focus on text based sessions. The cannonical text-based session manager (and the one we focus on) is `login(1)`. * For the walkthrough in this class, we assume that after initialization, init launches login(1). * Step 6: login(1), also the target of Ken Thompson's attack - login(1) must run as root. * This is because of limitiations on the behavior of seteuid(2), setegid(2), etc. that were presented in class 24. ``` > login login: Cannot possibly work without effective root ``` - login(1) presents a prompt for a username and password. * Values typed in by the user, checked against information in `/etc/passwd` and `/etc/shadow`. - If user credentials match, login(1) runs the following steps: * Forks a new process. The parent uses `waitpid(2)` to wait for child to die, and then exits. All steps that follow are executed in the child process. * Calls setuid(2) to set the effective user ID to that of the logged in user. * Calls setgid(2) to set the effective group ID to the primary group for the user. [Detail: * Calls setsid(2) to create a new session.] * Changes directory to the user's home directory (using chdir(2)). `/etc/passwd` stores the user's home directory, and login(1) relies on this information to change directories in this case. * Uses exec() to launch the user's login shell (e.g., bash). Login shell information for each user is also stored in `/etc/passwd` * The user now has a shell they can use as normal. - Users log out by killing their shell process (which was created by the login process that init created). This results in the parent login process exiting (as noted in the previous point). - `init` observes this exit using the same wait based mechanism used for daemons, and starts a new login process. E. Remarks and observations - Observe that booting a machine involves at least two operating systems * The firmware, for which we considered UEFI, acts as a simple operating system providing a few services. It does not support multiple processes, and has only limited functionality. * The kernel, which provides a richer set of functionality, including schedulers, etc. This principle can extend at least two more layers: booting a virtual machine involves running a virtual machine's firmware implementation on the host operating system, and then having the virtual machine's firmware load and execute the guest operating system. Booting a computer is, from one perspective, many repeated iterations of fork and exec. - Recall that drivers are part of the kernel in Linux. Kernels where device drivers are a part of the kernel are referred to as **MONOLITHIC KERNELS**. An alternate architecture, referred to as **MICROKERNELS**, is one where device drivers and many other portions of the kernel (e.g., scheduling, etc.) are run as independent processes. The main argument for microkernels is that they are more modular, which makes it easier to change kernel behavior. However, this modularity often carries a performance penalty, and has limited their wider adoption. Several current operating system kernels, including the ones for Windows and Mac OS X, started out as microkernels that were then converted to monolithic kernels due to performance concerns. See the Tannenbaum-Torvalds debates: https://cs.nyu.edu/~mwalfish/classes/ut/s10-cs372h/ref/ast-torvalds.html for a classic monolithic-versus-microkernels debate. ---- [credit: Aurojit Panda for this content]