Class 11
CS372H
23 February 2012

On the board
------------

1. Last time
2. Implementation of swtch()
3. Linking and loading
    --introduction
    --processes in memory 
    --assembling
    --overview of linking
    --details of linking

----------------------------------------------------------------------------

1. Last time

    --alternatives to locking/concurrency

    --advice

    --therac-25 and software safety

2. Implementation of swtch()

    A. Recall thread interface:
	--thread_create, thread_join, thread_exit
	--allocates a stack
	--and a TCB (thread control block; analogy with PCB) 

    B. one way to understand a given implementation of threads is by answering:
	* where is TCB stored?
	* what does swtch() look like, and who implements it?
	* what is the level of true concurrency?

	below, we answer the above three questions for kernel-level
	threads and user-level threads.

    C. Kernel-level threads

	[exercise: answer the questions above for kernel-level threads.
	below are answers but we won't cover them in class.]

	--Kernel maintains TCBs

	    --looks a lot like PCB (Process Control Block)

	--thread_create() becomes a syscall

	--when do thread switches happen?

	    --with kernel-level threading, it can happen at any point.

	--basic game plan for dispatch/swtch:

	    --thread is running
	    --switch to kernel
	    --save thread state (to TCB)
	    --Choose new thread to run
	    --Load its state (from TCB)
	    --new thread is running

	--Can two kernel-level threads execute on two different
	processors? (Answer: yes.)

	--Disadvantage to kernel-level threading:

	    --every thread operation (create, exit, join, synchronize,
	    etc.) goes through the kernel --> 10x-30x slower than
	    user-level threads

	    --heavier-weight memory requirements (each thread gets a
	    stack in user space *and* within the kernel. compare to
	    user-level threads: each thread gets a stack in user space,
	    and there's one stack within the kernel that corresponds to
	    the process.)

    D. User-level threads
	
	* where is TCB stored? (in user-space memory, by user-space
	threading library; note that kernel is totally ignorant of
	user-level threads)

	* what is level of true concurrency? (answer: none)

	* swtch(): implemented by thread library (which is a run-time
	system)

	    --see handout for the implementation

	* Basic idea: swtch() called at "sane" moments, in response
	to a function call from a thread. That function is usually
	yield(), i.e., the call graph usually looks like this:
		
		    fake_read() 
			if read would block
			    yield()
				swtch()

    E. More about threads 

	(1) Example uses (for user or kernel level)
	(2) More about user-level threads

	(1) Example uses

	    --EXAMPLE #1:

		int main(int argc, char** argv) {
		    thread_create(stage1_processing, NULL);
		    thread_create(stage2_processing, NULL);
		}

		void stage1_processing(void*)
		{
		    while (1) {
			do_some_CPU_intensive_things();
			when done, enqueue to some task list;
		    }
		}
	
		void stage2_processing(void*)
		{
		    while (1) {
			dequeue a task from some task list;
			do some processing
			print some output to terminal
		    }
		}

		above, threading is serving to overlap computation (the
		CPU-intensive things) and I/O (the printing to the
		terminal). while the thread sleeps waiting for the data
		to go to the terminal, the first thread can do
		CPU-intensive things.

	    --EXAMPLE #2: threaded web server services clients simultaneously: 

		 for (;;) { 
		     fd = accept_client (); 
		     thread_create (service_client, &fd); 
		 } 

		void service_client(void* arg) {
		    int* fd_ptr = (int*)arg;
		    int fd = *fd_ptr;
    
		    while (client_request_not_read_in) {
			read(fd, ....);   /* [+] */
		    }

		    do_work_for_client();

		    while (response_to_client_not_fully_written_out) {
			write(fd, ...);  
		    }

		    thread_exit();
		}

		the point to the above example is that all of the work
		for a single client is encapsulated. imagine if all of
		that work had to happen within a single thread of
		control; it could be done, but it would not be as
		convenient.

		Note that, to the thread, the read() and write() look to
		be *blocking*. That means that they only continue past
		the read() or write() if there is data for them, or if
		the output channel can accommodate data, respectively.

		However, to the module that *implements* threading, the
		read() and write() are non-blocking (we define these
		terms below).


	(2) more about user-level threads

	    --kernel is totally ignorant of user-level threads

	    --thread_create() allocates a new stack 
		--do we need memory space for registers?

	    --keep a queue of runnable threads

	    --run-time system:

		--provides a layer above system calls: if they would block,
		switch, and run a different thread

		--does scheduling
		    --thread is running
		    --save thread state (to TCB)
		    --Choose new thread to run
		    --Load its state (from TCB)
		    --new thread is running

	    --when do the above steps happen?

		Two options:

		1. Only when a thread calls yield() or would block on I/O

		    --This is called *cooperative multithreading* or
		    *non-preemptive multithreading*.

		    --Upside: Makes it pretty easy to avoid errors from
		    concurrency	

		    --Downside: Harder to program because now the threads
		    have to be good about yielding, and you might have
		    forgotten to yield inside a CPU-bound task.	

		2. What if we wanted to make user-level threads switch
		non-deterministically?

		    --deliver a periodic timer interrupt or signal to a
		    thread scheduler [setitimer() ]. When it gets its
		    interrupt, swap out the thread.

		    --makes it more complex to program with user-level
		    threads

		    --in practice, systems aren't usually built this way,
		    but sometimes it is what you want (e.g., if you're
		    simulating some OS-like thing inside a process, and you
		    want to simulate the non-determinism that arises from
		    hardware timer interrupts).

	    --Before continuing, we need to clarify *blocking* versus
	    *nonblocking* I/O calls.
		
		--Blocking means that the entity making the call (the thread
		in this case) does not progress past the I/O call (often a
		read() or write()) unless there is data for the thread (or,
		in the case of a write, unless the output channel can
		accommodate the data)

		--Nonblocking means that if the call *would* block, the call
		returns with an error message, and the thread keeps going.

		--(This idea also pertains to read/write system calls
		exposed by the kernel for the use of a process.)

		--Usually, the *thread* is supposed to see the call as
		blocking. However, there is a subtlety that is important:
		the other side of that call (e.g., the run-time that created
		the thread abstraction) makes a corresponding system call in
		*non-blocking* mode. That is because in this scenario of
		user-level threads, if the run-time *did* block, it wouldn't
		be able to run another thread.

		--As an aside, note that the relationship between the
		run-time and the thread is very similar to the relationship
		between the kernel and a process. When a process makes a
		blocking I/O call (most of you have done this at some point
		in your life -- pretty much whenever you called read() to
		get the data in some file), the kernel puts the process to
		sleep until the data arrives from the disk. But just as the
		run-time issues the I/O syscall to the kernel in
		non-blocking mode, the kernel issues the I/O request to the
		disk in non-blocking mode. The reason is that if the kernel
		went to sleep every time it waited on data from the disk,
		then the kernel wouldn't be able to run other processes. Put
		differently, the abstraction of "sleeping until there is
		data available" is an abstraction presented to the higher
		layer, and the lower layer implements that abstraction by
		simply not running the higher layer until the data is
		available. 
	
	    --To return to our multi-threaded Web server example from above:

		--Recall that the thread calls read() to get data from
		remote web browser 

		--Let's assume that the Web server is using user-level
		threading. Then, the read() in the Web server example
		(marked with "[+]") is actually a "fake" call implemented by
		the threading run-time. The run-time makes the true read()
		syscall (exposed by the kernel) in non-blocking mode.

		    (*) --> subtlety/exception: read/write syscalls for disk
		    I/O cannot be issued in non-blocking mode, but you can
		    ignore this point for now. we'll come back to it

		--If the kernel has no data for the run-time, the run-time
		makes the calling thread yield() and schedules another
		thread, one that itself had previously not be running.

		--When the run-time is idle, or on timer, check which
		connections have new data, and switch() to one of them

	    --Let's look at how the above process is implemented, focusing
	    on the register/EIP/stack switching. We will further focus on
	    the case of *cooperative* user-level multithreading.

		REVIEW: as mentioned above, swtch() called at "sane"
		moments, in response to a function call from a thread.
		That function is usually yield(), i.e., the call graph
		usually looks like this:
		    
			fake_read() 
			    if read would block
				yield()
				    swtch()

		and the pseudocode looks something like this:
	
		int fake_read(int fd, char* buf, int num) {

		    int nread = -1;

		    while (nread == -1) {

			/* this is a non-blocking read() syscall */
			nread = read(fd, buf, num); 
		   
			if (nread == -1) {
			    /* read would block */
			    yield();
			}
		    }

		    return nread;
		}

		void yield() {

		    tid next = pick_next_thread(); /* get a runnable thread */
		    tid current = get_current_thread();

		    swtch(current, next);
		}

		--to repeat, what "would block" means:
		    --in read direction, it means that there's no data to read
		    --in write direction, it means that output buffers are
		    full, so the write cannot happen yet

	    --How to switch threads in non-cooperative context?

		In non-cooperative context, a thread could be switched out
		at any moment, so its state is not neatly arranged on the
		stack, per the call graph

		but in that case, the OS would have put some of the thread's
		registers in a trap frame, and the run-time can yank those
		registers, save them (and the other registers) in the TCB or
		on the thread's regular stack, and then restore them later

		Said differently, thread switching by the user-level run
		time looks a lot like process switching by the kernel.

	    Notes/questions:

	    --In kernel's PCB, only one set of registers is stored.....

		--QUESTION: where are the other registers for the other
		threads?

	    Disadvantages to user-level threads:

	    --Can we imagine having two user-level threads truly executing
	    at once, that is on two different processors? (Answer: no. why?)

	    --What if the OS handles page faults for the process? (then a
	    page fault in one thread blocks all threads).
		--(not a huge issue in practice)

	    --Similarly, if a thread needs to go to disk, then that actually
	    blocks *all* threads (since the kernel won't allow the run-time
	    to make a non-blocking read() call to the disk). So what do we
	    do about this?

		--extend the API
		
		--live with it

		--use elaborate hacks with memory mapped files (e.g.,
		files are all memory mapped, and runtime asks to handle
		its own page faults, if the OS allows it)

---------------------------------------------------------------------------

[This material between the dashed lines is not going to be covered in
class. It is for your own reference. It may or may not be helpful in
studying.]

Quick comparison between user-level threading and kernel-level:

	(i). high-level choice: user-level or kernel-level
	    (but can have N:M threading, in which N user-level
	    threads are multiplexed over M kernel threads, so the
	    choice is a bit fuzzier)

	(ii). if user-level, there's another choice:
	    non-preemptive (also known as cooperative) or preemptive

	[be able to answer: why are kernel-level threads always preemptive?]

	--*Only* the presence of multiple kernel-level threads can give:

	    --true multiprocessing (i.e., different threads running on
	    different processors)

	    --asynchronous disk I/O using Posix interface [because
	    read() blocks and causes the *kernel* scheduler to be
	    invoked]

		--but many modern operating systems provide interfaces
		for asynchronous disk I/O, at least as an extension

		    --Windows 

		    --Linux has AIO extensions

		--thus, even user-level threads can get asynchronous
		disk I/O, by having the run-time translate calls that
		*appear* blocking to the thread [e.g., thread_read()]
		into a series of instructions that: register for
		interest in an I/O event, put the thread to sleep, and
		switch() to another thread

		--[moral of the story: if you find yourself needing
		async disk I/O from user-level threads, use one of the
		non-Posix interfaces!]

Quick terminology note:

    --The kernel itself uses threads internally, when executing in
    kernel mode. Such threads-in-the-kernel are related to, but not the
    same thing as, the kernel-level threading mentioned above.

    --We'll try to keep these concepts distinct in this class, but we
    may not always succeed.

Historical notes:


    classification:
					    # address spaces
			    one				    many		
	# threads/               
	  addr space

	one		      MS Dos			 traditional Unix
			     Palm OS

	many	      Embedded systems,		          VMS, Mach, NT,
						         Solaris, HP-UX, ...
		      Pilot (OS on first personal
		    computer ever built --
		    the Alto.
		    idea was there was no need for
		    protection if there was only
		    one user.)


---------------------------------------------------------------------------


    F. [SKIP IN CLASS] Old debates about user-level threading vs.
    kernel-level threading. The "Scheduler Activations" paper, by
    Anderson et al., [ACM Transactions on Computer Systems 10, 1
    (February 1992), pp.  53--79] proposes an abstraction that is a
    hybrid of the two.
	    
	--basically OS tells process: "I'm ready to give you another
	virtual CPU (or to take one away from you); which of your
	user-level threads do you want me to run?"

	--so user-level scheduler decides which threads run, but
	kernel takes care of multiplexing them


3. Linking and loading

A. Introduction

    [draw picture]

	       gcc        as
	foo.c  --> foo.s  --> foo.o 
				    \
				    ld ----> a.out  

	bar.c  --> bar.s  --> bar.o /


    interesting questions here:

	--How to name and refer to things that don't exist yet?
	
	--How to merge separate name spaces into a cohesive whole?

    naming

	--linking is an interesting case study of _naming_

	--_naming_ is a deep concept/theme/idea that is everywhere

	    --at the highest level, a naming system maps names to values

	    --examples:

		* virtual memory: address (name) resolved to physical
		address (value)

		* file systems: file and directory names are translated
		to disk locations. 

		* network names (www.cs.utexas.edu) are resolved to IP addresses
		
		* IP addresses are resolved to Ethernet addresses with ARP

		* addresses in the real world: 123 Elm Street gets
		translated to an actual location. (note: this is easier when
		streets are named 1st street, 2nd street, etc.!)

	    --linking: where is printf()? How can a piece of source code
	    refer to it? What if it doesn't exist? What about synonyms?

	--the concept of address:

	    --one needs an address to use data

	    --addresses locate things, and when the "things" move, their
	    addresses need to change

		--linkers, URLs, computers, etc.

    basic question:

	--when there's code like:
	    x += 1
	
	   where does the "x" live? what is its address?

    gameboard

	(a) assembler takes .s file and produces .o file

	(b) linker takes .o files and produces executable

	    --[draw picture of .o file]

	(c) loader loads executable into memory (you are doing this in
	labs 3 and 4)

	    --reads code and data segments into memory (possibly into
	    buffer cache)

		--maps code read-only and initialized data R/W

	    --or else fakes the process state to make it look like the
	    process has been paged out, and then load its pieces on
	    demand. see below.

	    --optimizations on basic "load into memory":
		--Zero-initialized data does not need to be read in.
		--Demand load: wait until code used before get from disk
		--Copies of same program running?  Share code
		--Multiple programs use same routines: share code (harder)

    we will go through these in the order (c), (a), (b)


B. what does a process look like in memory?

    --running process:

	address space divided into _segments_:

	[draw picture]

    --how is all of this specified?

	--executable files!

	--one way to look at an executable file is that it is the
	interface between the linker (and the tools upstream of it) and
	the OS

	    --contains list of segments, where they should be loaded
	    into memory, what type, etc.

    --who builds these components?

	--heap: allocated and laid out at runtime by malloc

	    --compiler and linker are not involved except to say where
	    the heap starts

	    --the name space is constructed dynamically and managed 
	    by the programmer (the names are stored in pointers, and
	    organized using data structures)

	--the stack: allocated at runtime (every time a procedure call
	happens), and laid out by compiler

	    --names are relative to stack (or frame) pointer (remember
	    class 2....)

	    --managed by the compiler

	    --the linker isn't involved because the name space is
	    entirely local: the compiler has enough information to
	    manage it.

	--global data and code: allocated by the compiler, laid out by
	linker

	    --compiler names them with symbolic references

	    --linker lays them out and translates references to actual
	    virtual addresses

C. What does the assembler do?

    --First see what the compiler does before the assembler:

	[SEE HANDOUT.]

	ASK: what are the three symbolic references?
	ANSWER: to printf, to main, and to the string "hello world\n"

    --Here's what the assembler does with the above:

	[DRAW PICTURE.] 

    --Note that assembler doesn't know where data/code should be placed
    in the process's address space

    --Assumes everything starts at zero

    --Emits symbol table that holds the name and offset of each
    created object

    --Routines/variables exported by file are recorded as 
    *definitions*

    --There are also *references*? To understand these.....

	--how to represent the concept of "call procedure X" in an
	object file, or how to represent the concept of "use memory
	location Y as an address, and load from it".

	    --answer: assembler puts bogus value in 

	    --assembler emits an _external reference_ that tells the
	    linker where the instruction is and what symbol needs to be
	    patched in

		example:
		    printf: 4 [in external refs section]


D. overview of linking

    --linking is done by the 'ld' program on Unix. This stands for
    "linkage editor".

    --it's usually hidden behind the compiler

	--but if you run "gcc -v <filename.c" you can see ld invoked
	(you may see "collect2")

    --the linker does three things:

	1. Collects together all pieces of a program

	2. Coalesces like segments (text, data, etc.)
    
	3. Fix addresses of code and data so the program can run

    --its output is an executable

    --why is this separate from the compiler?

	--the compiler has a limited view: it sees one file at a time.
	
    --a simple linker would proceed in two passes

	pass 1
	    (a) coalesce segments; arrange in non-overlapping memory
	    (b) read every .o file's symbol table, and construct a
	    global symbol table with an entry for every symbol that is
	    used or defined
	    (c) compute a virtual address for each segment (at program
	    start plus offset)
	
	pass 2
	    (a) patch all references using the global symbol table
	    (b) emit the result

    --what the heck is the symbol table? it's the info about the program
    that the linker keeps while running

	--segments: name, size, old location (0, 4, 8), new location (in virtual
	memory: 4000, 4004, 4008, ...)

	--symbols: name, input segment, offset within segment

E. Details

    --mechanics of linking

    --at link time, linker

	[pass 1.]

	* Determines the size of each segment and the resulting address
	within a segment to place each object at 

	* Stores all global *definitions* in a global symbol table
	
	    --this symbol table is a map:
		<definition> --> offset 

	* Records all *references* in the global symbol table

	    --this requires storing the following information with each
	    reference in the fragment of the symbol table:

		--the identifier of the coalesced segment where the
		reference is used

		--the offset in that coalesced segment

	[at this point, we have the following:
	
	    --a group of coalesced segments, all starting at address 0.
	
	    --a list of definitions (offsets where the object "lives")

	    --a list of references (offsets where the object is "used")]

	* Translates all segments to give them a virtual address

	[move to pass 2.]

	* Enumerate all references and "fix" them

	    --for each reference, inserting the symbol's virtual address
	    into the reference's instruction or data location.
	    
	    --this can be done because:
	    
		--symbol table tells linker where the reference is
		
		--symbol table tells linker what the symbol's ultimate
		virtual address will be

	    --NOTE: there is enough information in the symbol table so
	    that the output of the patched calls can have *PC-relative*
	    addresses

	* Emit the binary

    --[SKIP IN CLASS] name mangling

	in C++, we can have multiple functions with the same names:
	    int foo (int a);
	    int foo (int a, int b);

	compiler _mangles_ symbols to get a unique name for each
	function. 

	(compiler does similar thing for methods/namespaces [such as
	obj::fun], template instantiations, special functions such as
	"operator new")

	[unfortunately, the mangling is not compatible across compiler
	versions.]

	to "unmangle":
	
	% nm foo.o
	    0000000 T _Z3fooi 
	    000000e T _Z3fooii
		    U __gxx_personality_v0

	% nm foo.o | c++filt
	    0000000 T foo(int)
	    000000e T foo(int, int)
		    U __gxx_personality_v0
   

further reading:

    --man pages for a.out, elf

    --run 'nm', 'objdump' on .o and executable files

[thanks to David Mazieres]