Class 9
CS 372H
15 February 2011

On the board
------------

(One handout)

1. Review and clarify threads
    A. Kernel-level
    B. User-level
    C. Scheduling/interleaving
2. Concurrency
    --What is it?
    --What makes it hard?
    --How can we deal with races?
3. Protecting critical sections
    --Peterson
    --Next time: Spinlocks, Mutexes, Turning off interrupts

---------------------------------------------------------------------------

1. Review/clarify threads
    
    --abstraction. high-level motivation: 
	--want to have a single process taking advantage of multiple
	CPUs; and/or
	--sometimes natural to structure a computation in terms of
	separate units of control that share memory

    --thread interface:
	--thread_create, thread_join, thread_exit
	--allocates a stack
	--and a TCB (thread control block; analogy with PCB) 

    --one way to understand a given implementation of threads is by answering:
	* where is TCB stored?
	* what does swtch() look like, and who implements it?
	* what is the level of true concurrency?

    --please see notes from last time regarding user-level threading

    --today, we'll quickly go over what happens when the kernel
    implements threads and then review the stack switching in user-level
    threads, briefly

    A. kernel-level threading

	--Kernel maintains TCBs

	    --looks a lot like PCB

	    --[Draw picture]

	--thread_create() becomes a syscall

	--when do thread switches happen?

	    --with kernel-level threading, it can happen at any point.

	--basic game plan for dispatch/swtch:

	    --thread is running
	    --switch to kernel
	    --save thread state (to TCB)
	    --Choose new thread to run
	    --Load its state (from TCB)
	    --new thread is running

	--Can two kernel-level threads execute on two different
	processors? (Answer: yes.)

	--Disadvantage to kernel-level threading:

	    --every thread operation (create, exit, join, synchronize,
	    etc.) goes through the kernel --> 10x-30x slower than
	    user-level threads

	    --heavier-weight memory requirements (each thread gets a
	    stack in user space *and* within the kernel. compare to
	    user-level threads: each thread gets a stack in user space,
	    and there's one stack within the kernel that corresponds to
	    the process.)

	--[SKIP IN CLASS] Old debates about user-level threading vs.
	kernel-level threading. The "Scheduler Activations" paper, by
	Anderson et al., [ACM Transactions on Computer Systems 10, 1
	(February 1992), pp.  53--79] proposes an abstraction that is a
	hybrid of the two.
	    
	    --basically OS tells process: "I'm ready to give you another
	    virtual CPU (or to take one away from you); which of your
	    user-level threads do you want me to run?"

	    --so user-level scheduler decides which threads run, but
	    kernel takes care of multiplexing them

	--[COVER LATER] Some people think that threads, i.e., concurrent
	applications, shouldn't be used at all (because of the many bugs
	and difficult cases that come up, as we'll discuss). However,
	that position is becoming increasingly less tenable, given
	multicore computing.

	    --The fundamental reason is this: if you have a
	    computation-intensive job that wants to take advantage of
	    all of the hardware resources of a machine, you either need
	    to (a) structure the job as different processes; or (b) use
	    kernel-level threading. There is no other way, given
	    mainstream OS abstractions, to take advantage of a machine's
	    parallelism.  (a) winds up being inconvenient (in order to
	    share data, the processes either have to separately set up
	    shared memory regions, or else pass messages). So people use
	    (b).

    B. User-level

	go over:
	* where is TCB stored?
	* what does swtch() look like, and who implements it?
	* what is the level of true concurrency?

	--notes from last time tell you under what circumstances swtch()
	is called.

	--we'll quickly go over the implementation
	    --see handout.....
	
	--what is the level of true concurrency?

	    --answer: none. given a process that is using user-level
	    threading, **only one instruction in that process can
	    execute at a time**.


    C. Scheduling/interleaving threads

	--Dispatcher can choose:

	    --to run each thread to completion

	    --time-slice in big chunks

	    --time-slice so that each thread executes only one
	    instruction at a time

	--Programs must work in all cases, for all interleavings

	--So how can you know if your concurrent program works? Whether
	*all* interleavings work?

	    1. Enumerate and test all possibilities? (Not feasible.)

	    2. Instead, maintain *invariants* on program state; structure
	    program carefully to maintain these invariants

	--General strategy for dealing with concurrency:
	
	    --use *atomic actions* [means the action is indivisible,
	    regardless of how things are interleaved] to....
	    
	    --....build higher-level abstractions....

		--example: mutexes
	    
	    --....that provide invariants we can reason about....
	    
		--example: only one thread of control is modifying a
		linked list at once
		
        --This is our transition to the general topic of concurrency,
	which will occupy us for a little while

	    --Note that the issues with concurrency that we're going to
	    discuss are relevant to all of the threading cases above. (A
	    possible exception is non-preemptive user-level threads,
	    which only yield when the programmer says yield(). However,
	    it's easy to make mistakes, so best to assume that the
	    issues of concurrency that we're going to discuss *always*
	    apply if there are multiple execution contexts that share
	    memory.)

---------------------------------------------------------------------------

[This material between the dashed lines is not going to be covered in
class. It is for your own reference. It may or may not be helpful in
studying.]

Quick comparison between user-level threading and kernel-level:

	(i). high-level choice: user-level or kernel-level
	    (but can have N:M threading, in which N user-level
	    threads are multiplexed over M kernel threads, so the
	    choice is a bit fuzzier)

	(ii). if user-level, there's another choice:
	    non-preemptive (also known as cooperative) or preemptive

	[be able to answer: why are kernel-level threads always preemptive?]

	--*Only* the presence of multiple kernel-level threads can give:

	    --true multiprocessing (i.e., different threads running on
	    different processors)

	    --asynchronous disk I/O using Posix interface [because
	    read() blocks and causes the *kernel* scheduler to be
	    invoked]

		--but many modern operating systems provide interfaces
		for asynchronous disk I/O, at least as an extension

		    --Windows 

		    --Linux has AIO extensions

		--thus, even user-level threads can get asynchronous
		disk I/O, by having the run-time translate calls that
		*appear* blocking to the thread [e.g., thread_read()]
		into a series of instructions that: register for
		interest in an I/O event, put the thread to sleep, and
		switch() to another thread

		--[moral of the story: if you find yourself needing
		async disk I/O from user-level threads, use one of the
		non-Posix interfaces!]

Quick terminology note:

    --The kernel itself uses threads internally, when executing in
    kernel mode. Such threads-in-the-kernel are related to, but not the
    same thing as, the kernel-level threading mentioned above.

    --We'll try to keep these concepts distinct in this class, but we
    may not always succeed.

Historical notes:


    classification:
					    # address spaces
			    one				    many		
	# threads/               
	  addr space

	one		      MS Dos			 traditional Unix
			     Palm OS

	many	      Embedded systems,		          VMS, Mach, NT,
						         Solaris, HP-UX, ...
		      Pilot (OS on first personal
		    computer ever built --
		    the Alto.
		    idea was there was no need for
		    protection if there was only
		    one user.)


---------------------------------------------------------------------------

2. Concurrency

    A. What is it?

    --Stuff happening at the same time

    --Arises in many ways

	--pseudo-concurrency: from scheduling

	--real concurrency: multiple processors

    --Examples:

	--multiple kernel threads within a process

	--multiple processes sharing memory

	--what about multiple hosts distributed across a network?
	(conceptually, issues are the same, but needed mechanisms are
	different)


    --We're going to treat the issues in general...they apply to
    processes sharing memory pages, kernel threads sharing memory
    spaces, user-level threads that are preemptible, etc.

	--so for the rest of today, we're going to talk about two
	threads, but this could mean:

	    --threads inside a single process

	    --threads inside the kernel

	    --even two separate processes that share memory

    B. What makes it hard?

	--lots of things can go wrong.....
	    --we will see others later (deadlock, priority inversion, etc.)
	    --for now, look at data races....

	--some examples; see handout:

	    2a:  x = 1 or x = 2.
	    2b:  x = 13 or x = 25.
	    2c:  x = 1 or x = 2 or x = 3 

	    3: incorrect list structure

	    4: incorrect count in buffer

	--all of these are called *race conditions*; not all are errors,
	though.

	--worst part of errors from race conditions is that a program
	may work fine most of the time but only occasionally show
	problems. why?  (because the instructions of the various threads
	or processes or whatevever get interleaved in a
	non-deterministic order.)

	    --and it's worse than that because inserting debugging code
	    may change the timing so that the bug doesn't show up

    C. How can we deal with races?

	--make the needed operations atomic

	--how?
    
	1. A single-instruction add?

	    'count' is in memory (that is what the example in #4 stipulates.)
	    assume that %ecx holds the address of 'count'

	    --Then, can we use the x86 instruction addl? For instance:

		    addl $1, (%ecx)   ; count++

	    --So looks like we can implement count++/-- with one
	    instruction?

	    --So we're safe?

	    --No: not atomic on multiprocessor! 

	    --Will experience same race condition at the hardware level

	2. How about using x86 LOCK prefix?
	    
	    --can make read-modify-write instructions atomic by preceding
	    them with "LOCK". examples of such instructions are:
		XADD, CMPXCHG, INC, DEC, NOT, NEG, ADD, SUB...
		(when their destination operand refers to memory)

	    --but using LOCK is very expensive (flushes processor
	    caches) and not a "general-purpose abstraction"
		--only applies to one instruction: what if we need to
		execute three or four instructions as a unit?

	    --compiler won't generate it by default, assumes you don't
	    want penalty

	3. Critical sections
    
	    --Place count++ and count-- in critical section 

	    --Protect critical sections from concurrent execution 

	    --Now we need solution to _critical section_ problem

	    --Solution must satisfy 3 rules:

		1. mutual exclusion
		    only one thread can be in c.s. at a time		

		2. progress
		    if no threads executing in c.s., one of the threads
		    trying to enter a given c.s. will eventually get in
		    
		3. bounded waiting
		    once a thread T starts trying to enter the critical
		    section, there is a bound on the number of other threads
		    that may enter the critical section before T enters


	    --Note progress vs. bounded waiting 

		--If no thread can enter C.S., don't have progress 

		--If thread A waiting to enter C.S. while B repeatedly
		leaves and re-enters C.S. ad infinitum, don't have bounded
		waiting 

	    --Gameboard is that we're now going to build primitives to
	    protect critical sections

   
3. Protecting critical sections

    --Peterson's algorithm....
	
	--see book

	--does satisfy mutual exclusion, progress, bounded waiting

	--But expensive and not encapsulated

    --High-level:

	--want: lock()/unlock() or enter()/leave() or
	acquire()/release()

	    --lots of names for the same idea

	    --mutex_init(mutex_t* m), mutex_lock(mutex_t* m),
	    mutex_unlock(mutex_t* m),....

	    --pthread_mutex_init(), pthread_mutex_lock(), ...

	--in each case, the semantics are that once the thread of
	execution is executing inside the critical section, no other
	thread of execution is executing there

	--How to implement locks/mutexes/etc.?

	--Next time....