Class 9
CS 439
12 Feburary 2013

On the board
------------

1. Last time
2. revisiting threads
    --review
    --classification
    --implementation

---------------------------------------------------------------------------

1. Last time

    (video; sorry about the problems with it.)

    --PC architecture

    --x86 programming

    --gcc calling convention

    today:
        --implementation of concurrency primitives

2. Thread abstraction

    (emphasis on how implemented, and on the variants.)

    A. Review

	--recall: *threads* are a very natural way to do multiple tasks but
	operating on the same memory state. there are two fundamental
	motivations for threads, but not each of these motivations
	applies to every instance:

	    (1) desire to have a single process take advantage of
	    multiple CPUs 

		(*) --> but we'll see that whether the process can in fact
		take advantage of multiple CPUs depends on the
		implementation of threads

	    (2) often very natural to structure some computation (or task
	    or job or whatever) as multiple units of control that see
	    the same memory 
		
		(*) --> but we'll see that this motivation depends on the
		computation itself

	--abstraction/illusion: sequential set of instructions that
	executes within the address space of a process

	    (i) a thread *is* a set of registers (including a PC/IP) and
	    a stack.

	    (ii) multiple threads within the same process share the same
	    memory. (they can even read and write each other's stacks,
	    but if there are no bugs, that should not happen. generally
	    the memory that they both look at it is heap memory or
	    statically initialized memory.) 

		--another way to put this: a thread does not have its
		own page directory. so on the x86, two threads share the
		same value of %cr3 (the virtual memory lab will make
		clear what that means)

	    (iii) multiple threads within the same process are executing at once
		(*) --> but we'll see that this only actually happens sometimes

	[Note for your studying: if you truly understand why each of the
	three counterpoints marked "(*) --> but" above is true, then you
	have a good handle on the true motivations for threads and on
	what problems threads are solving.]

    B. Classification

    this abstraction can be implemented at multiple levels

	a. in-kernel, for kernel

	b. in-kernel, for processes

	c. in process, for user-level threads (examples: Java virtual
	machine, Flash player, lots of applications!)

    recall that multiple threads share memory but not registers

        --this means, to first approximation, that they see each other's
        heaps but not each other's stacks.

    different kinds of threads:
    (review)

        --non-preemptive: a thread executes exclusively until it makes a
        blocking call. (e.g., a read() on a file).

	--preemptive threads: between any two instructions, another
	thread can run [how is this implemented? answer: with interrupts
	and context switches]

    C. Implementation
        
        --one way to understand a given implementation of threads is by answering:
	    * where is TCB stored?
	    * what does swtch() look like, and who implements it?
	    * what is the level of true concurrency?

    (1) kernel-level threading

	--TCB looks a lot like PCB

	    --[Draw picture]

	    --thread_create() becomes a syscall

        --swtch() is like context switch

        --what is the level of true concurrency?

	    --when do thread switches happen?

	        --with kernel-level threading, it can happen at any point.

            --multiple kernel-level threads can run on multiple
            processors (because it's the kernel that decides what runs
            on which processors whenA)

	--basic game plan for dispatch/swtch:

	    --thread is running
	    --switch to kernel
	    --save thread state (to TCB)
	    --Choose new thread to run
	    --Load its state (from TCB)
	    --new thread is running

	--Can two kernel-level threads execute on two different
	processors? (Answer: yes.)

	--Disadvantage to kernel-level threading:

	    --every thread operation (create, exit, join, synchronize,
	    etc.) goes through the kernel --> 10x-30x slower than
	    user-level threads

	    --heavier-weight memory requirements (each thread gets a
	    stack in user space *and* within the kernel. compare to
	    user-level threads: each thread gets a stack in user space,
	    and there's one stack within the kernel that corresponds to
	    the process.)


    (2) user-level threading

        --kernel is totally ignorant of user-level threads. so where is
        TCB stored?

	--thread_create() allocates a new stack 

	    --do we need memory space for registers?

        --run-time system:

	    --keeps a queue of runnable threads

	    --provides a layer above system calls: if they would block,
	    switch, and run a different thread

	    --run-time system does scheduling
		--thread is running
		--save thread state (to TCB)
		--Choose new thread to run
		--Load its state (from TCB)
		--new thread is running

        --what does swtch() look like? 

            --see handout.....
	
	--what is the level of true concurrency?

	    --answer: none. given a process that is using user-level
	    threading, **only one instruction in that process can
	    execute at a time**.

	--when does swtch() happen?

	    Two options:

	    1. Only when a thread calls yield() or would block on I/O

		--This is called *cooperative multithreading* or
		*non-preemptive multithreading*.

		--Upside: Makes it easier to avoid errors from
		concurrency	

		--Downside: Harder to program because now the threads
		have to be good about yielding, and you might have
		forgotten to yield inside a CPU-bound task.	

	    2. What if we wanted to make user-level threads switch
	    non-deterministically?

		--deliver a periodic timer interrupt or signal to a
		thread scheduler [setitimer() ]. When it gets its
		interrupt, swap out the thread.

		--makes it more complex to program with user-level
		threads

		--in practice, systems aren't usually built this way,
		but sometimes it is what you want (e.g., if you're
		simulating some OS-like thing inside a process, and you
		want to simulate the non-determinism that arises from
		hardware timer interrupts).

	--Before continuing, we need to clarify *blocking* versus
	*nonblocking* I/O calls.
	    
	    --Blocking means that the entity making the call (the thread
	    in this case) does not progress past the I/O call (often a
	    read() or write()) unless there is data for the thread (or,
	    in the case of a write, unless the output channel can
	    accommodate the data)

	    --Nonblocking means that if the call *would* block, the call
	    returns with an error message, and the thread keeps going.

	    --(This idea also pertains to read/write system calls
	    exposed by the kernel for the use of a process.)

	    --Usually, the *thread* is supposed to see the call as
	    blocking. However, there is a subtlety that is important:
	    the other side of that call (e.g., the run-time that created
	    the thread abstraction) makes a corresponding system call in
	    *non-blocking* mode. That is because in this scenario of
	    user-level threads, if the run-time *did* block, it wouldn't
	    be able to run another thread.

	    --As an aside, note that the relationship between the
	    run-time and the thread is very similar to the relationship
	    between the kernel and a process. When a process makes a
	    blocking I/O call (most of you have done this at some point
	    in your life -- pretty much whenever you called read() to
	    get the data in some file), the kernel puts the process to
	    sleep until the data arrives from the disk. But just as the
	    run-time issues the I/O syscall to the kernel in
	    non-blocking mode, the kernel issues the I/O request to the
	    disk in non-blocking mode. The reason is that if the kernel
	    went to sleep every time it waited on data from the disk,
	    then the kernel wouldn't be able to run other processes. Put
	    differently, the abstraction of "sleeping until there is
	    data available" is an abstraction presented to the higher
	    layer, and the lower layer implements that abstraction by
	    simply not running the higher layer until the data is
	    available. 
    
	--Let's look at how the above approach is implemented, focusing
	on the register/EIP/stack switching. We will further focus on
	the case of *cooperative* user-level multithreading.

	    Basic idea: swtch() called at "sane" moments, in response
	    to a function call from a thread. That function is usually
	    yield(), i.e., the call graph usually looks like this:
		
		    fake_read() 
			if read would block
			    yield()
				swtch()

	    and the pseudocode looks something like this:
    
	    int fake_read(int fd, char* buf, int num) {

		int nread = -1;

		while (nread == -1) {

		    /* this is a non-blocking read() syscall */
		    nread = read(fd, buf, num); 
	       
		    if (nread == -1) {
			/* read would block */
			yield();
		    }
		}

		return nread;
	    }

	    void yield() {

		tid next = pick_next_thread(); /* get a runnable thread */
		tid current = get_current_thread();

		swtch(current, next);
	    }

	    --to repeat, what "would block" means:
		--in read direction, it means that there's no data to read
		--in write direction, it means that output buffers are
		full, so the write cannot happen yet

	    --how is swtch() implemented?

		--see handout.....
		--[draw picture of the two stacks]
		--make sure you understand what is going on 

	--How to switch threads in non-cooperative context?

	    In non-cooperative context, a thread could be switched out
	    at any moment, so its state is not neatly arranged on the
	    stack, per the call graph

	    but in that case, the OS would have put some of the thread's
	    registers in a trap frame, and the run-time can yank those
	    registers, save them (and the other registers) in the TCB or
	    on the thread's regular stack, and then restore them later

	    Said differently, thread switching by the user-level run
	    time looks a lot like process switching by the kernel.

	Notes/questions:

	--In kernel's PCB, only one set of registers is stored.....

	    --QUESTION: where are the other registers for the other
	    threads?

	Disadvantages to user-level threads:

	--Can we imagine having two user-level threads truly executing
	at once, that is on two different processors? (Answer: no. why?)

	--What if the OS handles page faults for the process? (then a
	page fault in one thread blocks all threads).
	    --(not a huge issue in practice)

	--Similarly, if a thread needs to go to disk, then that actually
	blocks *all* threads (since the kernel won't allow the run-time
	to make a non-blocking read() call to the disk). So what do we
	do about this?

	    --extend the API; or
	    
	    --live with it; or

	    --use elaborate hacks with memory mapped files (e.g.,
	    files are all memory mapped, and runtime asks to handle
	    its own page faults, if the OS allows it)


	--[SKIP IN CLASS] Old debates about user-level threading vs.
	kernel-level threading. The "Scheduler Activations" paper, by
	Anderson et al., [ACM Transactions on Computer Systems 10, 1
	(February 1992), pp.  53--79] proposes an abstraction that is a
	hybrid of the two.
	    
	    --basically OS tells process: "I'm ready to give you another
	    virtual CPU (or to take one away from you); which of your
	    user-level threads do you want me to run?"

	    --so user-level scheduler decides which threads run, but
	    kernel takes care of multiplexing them

	--[COVER LATER] Some people think that threads, i.e., concurrent
	applications, shouldn't be used at all (because of the many bugs
	and difficult cases that come up, as we'll discuss). However,
	that position is becoming increasingly less tenable, given
	multicore computing.

	    --The fundamental reason is this: if you have a
	    computation-intensive job that wants to take advantage of
	    all of the hardware resources of a machine, you either need
	    to (a) structure the job as different processes; or (b) use
	    kernel-level threading. There is no other way, given
	    mainstream OS abstractions, to take advantage of a machine's
	    parallelism.  (a) winds up being inconvenient (in order to
	    share data, the processes either have to separately set up
	    shared memory regions, or else pass messages). So people use
	    (b).


Quick comparison between user-level threading and kernel-level:

	(i). high-level choice: user-level or kernel-level
	    (but can have N:M threading, in which N user-level
	    threads are multiplexed over M kernel threads, so the
	    choice is a bit fuzzier)

	(ii). if user-level, there's another choice:
	    non-preemptive (also known as cooperative) or preemptive

	[be able to answer: why are kernel-level threads always preemptive?]

	--*Only* the presence of multiple kernel-level threads can give:

	    --true multiprocessing (i.e., different threads running on
	    different processors)

	    --asynchronous disk I/O using Posix interface [because
	    read() blocks and causes the *kernel* scheduler to be
	    invoked]

		--but many modern operating systems provide interfaces
		for asynchronous disk I/O, at least as an extension

		    --Windows 

		    --Linux has AIO extensions

		--thus, even user-level threads can get asynchronous
		disk I/O, by having the run-time translate calls that
		*appear* blocking to the thread [e.g., thread_read()]
		into a series of instructions that: register for
		interest in an I/O event, put the thread to sleep, and
		switch() to another thread

		--[moral of the story: if you find yourself needing
		async disk I/O from user-level threads, use one of the
		non-Posix interfaces!]

Historical notes:


    classification:
			             # address spaces
			    one				    many		
	# threads/               
	  addr space

	one		      MS Dos			 traditional Unix
			     Palm OS

	many	      Embedded systems,		          VMS, Mach, NT,
						         Solaris, HP-UX, ...
		      Pilot (OS on first personal
		    computer ever built --
		    the Alto.
		    idea was there was no need for
		    protection if there was only
		    one user.)


    D. The use of threads


[thanks to David Mazieres for content in portions of this lecture.]