Class 11 CS372H 23 February 2012 On the board ------------ 1. Last time 2. Implementation of swtch() 3. Linking and loading --introduction --processes in memory --assembling --overview of linking --details of linking ---------------------------------------------------------------------------- 1. Last time --alternatives to locking/concurrency --advice --therac-25 and software safety 2. Implementation of swtch() A. Recall thread interface: --thread_create, thread_join, thread_exit --allocates a stack --and a TCB (thread control block; analogy with PCB) B. one way to understand a given implementation of threads is by answering: * where is TCB stored? * what does swtch() look like, and who implements it? * what is the level of true concurrency? below, we answer the above three questions for kernel-level threads and user-level threads. C. Kernel-level threads [exercise: answer the questions above for kernel-level threads. below are answers but we won't cover them in class.] --Kernel maintains TCBs --looks a lot like PCB (Process Control Block) --thread_create() becomes a syscall --when do thread switches happen? --with kernel-level threading, it can happen at any point. --basic game plan for dispatch/swtch: --thread is running --switch to kernel --save thread state (to TCB) --Choose new thread to run --Load its state (from TCB) --new thread is running --Can two kernel-level threads execute on two different processors? (Answer: yes.) --Disadvantage to kernel-level threading: --every thread operation (create, exit, join, synchronize, etc.) goes through the kernel --> 10x-30x slower than user-level threads --heavier-weight memory requirements (each thread gets a stack in user space *and* within the kernel. compare to user-level threads: each thread gets a stack in user space, and there's one stack within the kernel that corresponds to the process.) D. User-level threads * where is TCB stored? (in user-space memory, by user-space threading library; note that kernel is totally ignorant of user-level threads) * what is level of true concurrency? (answer: none) * swtch(): implemented by thread library (which is a run-time system) --see handout for the implementation * Basic idea: swtch() called at "sane" moments, in response to a function call from a thread. That function is usually yield(), i.e., the call graph usually looks like this: fake_read() if read would block yield() swtch() E. More about threads (1) Example uses (for user or kernel level) (2) More about user-level threads (1) Example uses --EXAMPLE #1: int main(int argc, char** argv) { thread_create(stage1_processing, NULL); thread_create(stage2_processing, NULL); } void stage1_processing(void*) { while (1) { do_some_CPU_intensive_things(); when done, enqueue to some task list; } } void stage2_processing(void*) { while (1) { dequeue a task from some task list; do some processing print some output to terminal } } above, threading is serving to overlap computation (the CPU-intensive things) and I/O (the printing to the terminal). while the thread sleeps waiting for the data to go to the terminal, the first thread can do CPU-intensive things. --EXAMPLE #2: threaded web server services clients simultaneously: for (;;) { fd = accept_client (); thread_create (service_client, &fd); } void service_client(void* arg) { int* fd_ptr = (int*)arg; int fd = *fd_ptr; while (client_request_not_read_in) { read(fd, ....); /* [+] */ } do_work_for_client(); while (response_to_client_not_fully_written_out) { write(fd, ...); } thread_exit(); } the point to the above example is that all of the work for a single client is encapsulated. imagine if all of that work had to happen within a single thread of control; it could be done, but it would not be as convenient. Note that, to the thread, the read() and write() look to be *blocking*. That means that they only continue past the read() or write() if there is data for them, or if the output channel can accommodate data, respectively. However, to the module that *implements* threading, the read() and write() are non-blocking (we define these terms below). (2) more about user-level threads --kernel is totally ignorant of user-level threads --thread_create() allocates a new stack --do we need memory space for registers? --keep a queue of runnable threads --run-time system: --provides a layer above system calls: if they would block, switch, and run a different thread --does scheduling --thread is running --save thread state (to TCB) --Choose new thread to run --Load its state (from TCB) --new thread is running --when do the above steps happen? Two options: 1. Only when a thread calls yield() or would block on I/O --This is called *cooperative multithreading* or *non-preemptive multithreading*. --Upside: Makes it pretty easy to avoid errors from concurrency --Downside: Harder to program because now the threads have to be good about yielding, and you might have forgotten to yield inside a CPU-bound task. 2. What if we wanted to make user-level threads switch non-deterministically? --deliver a periodic timer interrupt or signal to a thread scheduler [setitimer() ]. When it gets its interrupt, swap out the thread. --makes it more complex to program with user-level threads --in practice, systems aren't usually built this way, but sometimes it is what you want (e.g., if you're simulating some OS-like thing inside a process, and you want to simulate the non-determinism that arises from hardware timer interrupts). --Before continuing, we need to clarify *blocking* versus *nonblocking* I/O calls. --Blocking means that the entity making the call (the thread in this case) does not progress past the I/O call (often a read() or write()) unless there is data for the thread (or, in the case of a write, unless the output channel can accommodate the data) --Nonblocking means that if the call *would* block, the call returns with an error message, and the thread keeps going. --(This idea also pertains to read/write system calls exposed by the kernel for the use of a process.) --Usually, the *thread* is supposed to see the call as blocking. However, there is a subtlety that is important: the other side of that call (e.g., the run-time that created the thread abstraction) makes a corresponding system call in *non-blocking* mode. That is because in this scenario of user-level threads, if the run-time *did* block, it wouldn't be able to run another thread. --As an aside, note that the relationship between the run-time and the thread is very similar to the relationship between the kernel and a process. When a process makes a blocking I/O call (most of you have done this at some point in your life -- pretty much whenever you called read() to get the data in some file), the kernel puts the process to sleep until the data arrives from the disk. But just as the run-time issues the I/O syscall to the kernel in non-blocking mode, the kernel issues the I/O request to the disk in non-blocking mode. The reason is that if the kernel went to sleep every time it waited on data from the disk, then the kernel wouldn't be able to run other processes. Put differently, the abstraction of "sleeping until there is data available" is an abstraction presented to the higher layer, and the lower layer implements that abstraction by simply not running the higher layer until the data is available. --To return to our multi-threaded Web server example from above: --Recall that the thread calls read() to get data from remote web browser --Let's assume that the Web server is using user-level threading. Then, the read() in the Web server example (marked with "[+]") is actually a "fake" call implemented by the threading run-time. The run-time makes the true read() syscall (exposed by the kernel) in non-blocking mode. (*) --> subtlety/exception: read/write syscalls for disk I/O cannot be issued in non-blocking mode, but you can ignore this point for now. we'll come back to it --If the kernel has no data for the run-time, the run-time makes the calling thread yield() and schedules another thread, one that itself had previously not be running. --When the run-time is idle, or on timer, check which connections have new data, and switch() to one of them --Let's look at how the above process is implemented, focusing on the register/EIP/stack switching. We will further focus on the case of *cooperative* user-level multithreading. REVIEW: as mentioned above, swtch() called at "sane" moments, in response to a function call from a thread. That function is usually yield(), i.e., the call graph usually looks like this: fake_read() if read would block yield() swtch() and the pseudocode looks something like this: int fake_read(int fd, char* buf, int num) { int nread = -1; while (nread == -1) { /* this is a non-blocking read() syscall */ nread = read(fd, buf, num); if (nread == -1) { /* read would block */ yield(); } } return nread; } void yield() { tid next = pick_next_thread(); /* get a runnable thread */ tid current = get_current_thread(); swtch(current, next); } --to repeat, what "would block" means: --in read direction, it means that there's no data to read --in write direction, it means that output buffers are full, so the write cannot happen yet --How to switch threads in non-cooperative context? In non-cooperative context, a thread could be switched out at any moment, so its state is not neatly arranged on the stack, per the call graph but in that case, the OS would have put some of the thread's registers in a trap frame, and the run-time can yank those registers, save them (and the other registers) in the TCB or on the thread's regular stack, and then restore them later Said differently, thread switching by the user-level run time looks a lot like process switching by the kernel. Notes/questions: --In kernel's PCB, only one set of registers is stored..... --QUESTION: where are the other registers for the other threads? Disadvantages to user-level threads: --Can we imagine having two user-level threads truly executing at once, that is on two different processors? (Answer: no. why?) --What if the OS handles page faults for the process? (then a page fault in one thread blocks all threads). --(not a huge issue in practice) --Similarly, if a thread needs to go to disk, then that actually blocks *all* threads (since the kernel won't allow the run-time to make a non-blocking read() call to the disk). So what do we do about this? --extend the API --live with it --use elaborate hacks with memory mapped files (e.g., files are all memory mapped, and runtime asks to handle its own page faults, if the OS allows it) --------------------------------------------------------------------------- [This material between the dashed lines is not going to be covered in class. It is for your own reference. It may or may not be helpful in studying.] Quick comparison between user-level threading and kernel-level: (i). high-level choice: user-level or kernel-level (but can have N:M threading, in which N user-level threads are multiplexed over M kernel threads, so the choice is a bit fuzzier) (ii). if user-level, there's another choice: non-preemptive (also known as cooperative) or preemptive [be able to answer: why are kernel-level threads always preemptive?] --*Only* the presence of multiple kernel-level threads can give: --true multiprocessing (i.e., different threads running on different processors) --asynchronous disk I/O using Posix interface [because read() blocks and causes the *kernel* scheduler to be invoked] --but many modern operating systems provide interfaces for asynchronous disk I/O, at least as an extension --Windows --Linux has AIO extensions --thus, even user-level threads can get asynchronous disk I/O, by having the run-time translate calls that *appear* blocking to the thread [e.g., thread_read()] into a series of instructions that: register for interest in an I/O event, put the thread to sleep, and switch() to another thread --[moral of the story: if you find yourself needing async disk I/O from user-level threads, use one of the non-Posix interfaces!] Quick terminology note: --The kernel itself uses threads internally, when executing in kernel mode. Such threads-in-the-kernel are related to, but not the same thing as, the kernel-level threading mentioned above. --We'll try to keep these concepts distinct in this class, but we may not always succeed. Historical notes: classification: # address spaces one many # threads/ addr space one MS Dos traditional Unix Palm OS many Embedded systems, VMS, Mach, NT, Solaris, HP-UX, ... Pilot (OS on first personal computer ever built -- the Alto. idea was there was no need for protection if there was only one user.) --------------------------------------------------------------------------- F. [SKIP IN CLASS] Old debates about user-level threading vs. kernel-level threading. The "Scheduler Activations" paper, by Anderson et al., [ACM Transactions on Computer Systems 10, 1 (February 1992), pp. 53--79] proposes an abstraction that is a hybrid of the two. --basically OS tells process: "I'm ready to give you another virtual CPU (or to take one away from you); which of your user-level threads do you want me to run?" --so user-level scheduler decides which threads run, but kernel takes care of multiplexing them 3. Linking and loading A. Introduction [draw picture] gcc as foo.c --> foo.s --> foo.o \ ld ----> a.out bar.c --> bar.s --> bar.o / interesting questions here: --How to name and refer to things that don't exist yet? --How to merge separate name spaces into a cohesive whole? naming --linking is an interesting case study of _naming_ --_naming_ is a deep concept/theme/idea that is everywhere --at the highest level, a naming system maps names to values --examples: * virtual memory: address (name) resolved to physical address (value) * file systems: file and directory names are translated to disk locations. * network names (www.cs.utexas.edu) are resolved to IP addresses * IP addresses are resolved to Ethernet addresses with ARP * addresses in the real world: 123 Elm Street gets translated to an actual location. (note: this is easier when streets are named 1st street, 2nd street, etc.!) --linking: where is printf()? How can a piece of source code refer to it? What if it doesn't exist? What about synonyms? --the concept of address: --one needs an address to use data --addresses locate things, and when the "things" move, their addresses need to change --linkers, URLs, computers, etc. basic question: --when there's code like: x += 1 where does the "x" live? what is its address? gameboard (a) assembler takes .s file and produces .o file (b) linker takes .o files and produces executable --[draw picture of .o file] (c) loader loads executable into memory (you are doing this in labs 3 and 4) --reads code and data segments into memory (possibly into buffer cache) --maps code read-only and initialized data R/W --or else fakes the process state to make it look like the process has been paged out, and then load its pieces on demand. see below. --optimizations on basic "load into memory": --Zero-initialized data does not need to be read in. --Demand load: wait until code used before get from disk --Copies of same program running? Share code --Multiple programs use same routines: share code (harder) we will go through these in the order (c), (a), (b) B. what does a process look like in memory? --running process: address space divided into _segments_: [draw picture] --how is all of this specified? --executable files! --one way to look at an executable file is that it is the interface between the linker (and the tools upstream of it) and the OS --contains list of segments, where they should be loaded into memory, what type, etc. --who builds these components? --heap: allocated and laid out at runtime by malloc --compiler and linker are not involved except to say where the heap starts --the name space is constructed dynamically and managed by the programmer (the names are stored in pointers, and organized using data structures) --the stack: allocated at runtime (every time a procedure call happens), and laid out by compiler --names are relative to stack (or frame) pointer (remember class 2....) --managed by the compiler --the linker isn't involved because the name space is entirely local: the compiler has enough information to manage it. --global data and code: allocated by the compiler, laid out by linker --compiler names them with symbolic references --linker lays them out and translates references to actual virtual addresses C. What does the assembler do? --First see what the compiler does before the assembler: [SEE HANDOUT.] ASK: what are the three symbolic references? ANSWER: to printf, to main, and to the string "hello world\n" --Here's what the assembler does with the above: [DRAW PICTURE.] --Note that assembler doesn't know where data/code should be placed in the process's address space --Assumes everything starts at zero --Emits symbol table that holds the name and offset of each created object --Routines/variables exported by file are recorded as *definitions* --There are also *references*? To understand these..... --how to represent the concept of "call procedure X" in an object file, or how to represent the concept of "use memory location Y as an address, and load from it". --answer: assembler puts bogus value in --assembler emits an _external reference_ that tells the linker where the instruction is and what symbol needs to be patched in example: printf: 4 [in external refs section] D. overview of linking --linking is done by the 'ld' program on Unix. This stands for "linkage editor". --it's usually hidden behind the compiler --but if you run "gcc -v --> offset * Records all *references* in the global symbol table --this requires storing the following information with each reference in the fragment of the symbol table: --the identifier of the coalesced segment where the reference is used --the offset in that coalesced segment [at this point, we have the following: --a group of coalesced segments, all starting at address 0. --a list of definitions (offsets where the object "lives") --a list of references (offsets where the object is "used")] * Translates all segments to give them a virtual address [move to pass 2.] * Enumerate all references and "fix" them --for each reference, inserting the symbol's virtual address into the reference's instruction or data location. --this can be done because: --symbol table tells linker where the reference is --symbol table tells linker what the symbol's ultimate virtual address will be --NOTE: there is enough information in the symbol table so that the output of the patched calls can have *PC-relative* addresses * Emit the binary --[SKIP IN CLASS] name mangling in C++, we can have multiple functions with the same names: int foo (int a); int foo (int a, int b); compiler _mangles_ symbols to get a unique name for each function. (compiler does similar thing for methods/namespaces [such as obj::fun], template instantiations, special functions such as "operator new") [unfortunately, the mangling is not compatible across compiler versions.] to "unmangle": % nm foo.o 0000000 T _Z3fooi 000000e T _Z3fooii U __gxx_personality_v0 % nm foo.o | c++filt 0000000 T foo(int) 000000e T foo(int, int) U __gxx_personality_v0 further reading: --man pages for a.out, elf --run 'nm', 'objdump' on .o and executable files [thanks to David Mazieres]