Class 23
CS 372H
17 April 2012

On the board
------------

1. Last time 
2. Transactions
    --intro
    --crash recovery
    --isolation
3. Discuss TxOS paper

---------------------------------------------------------------------------

1. Last time

    --finished crash course in networking
    --discussed Clark network architecture paper

2. Transactions: intro

    --consider a system with some complex data structures in memory and
    on the disk

	what kind of systems are we talking about? well, these ideas apply
	to file systems, which may have complex on-disk structures. but they
	apply far more widely, and in fact many of these ideas were
	developed in the context of databases. one confusing detail is that
	databases often request a raw block interface to the disk, thereby
	bypassing the file system. So one way to think about this mini unit
	here is that the "system" identified above could be a file system
	(often running inside the kernel) or else a database (often running
	in user space)

	in fact, such a "system" could be the state of the Linux kernel, as in TxOS.

    --want to group operations and provide a programming model like:

	begin_tx()
	    deposit_account(acct 2, $30);
	    withdraw_account(acct 1, $30);

	    probably okay to do neither deposit nor withdrawal,
	    definitely okay to do both. not okay to do one or the other.
	end_tx()

    --most of you will run into this idea if you do database programming

    --but it's bigger than DBs

	--arguably, LFS is using transactions (but why is a bit
	subtle and has to do with the precise use of the segment
	summary block)

    --we can even have the kernel expose transactions (regarding [a
    subset of] its own state) to applications:

	sys_begin_transaction()
	    syscall1();
	    syscall2();
	sys_end_transaction();

	(today's paper!)

	--very nice for, say, software install: wrap the entire
	install() program in a transaction. so all-or-nothing

	    --[aside: could imagine having the hardware export a
	    transactional interface to the OS:

		begin_transaction()
		write_mem();
		write_mem();
		read_mem();
		write_mem();
		end_transaction();

		--this is called *transactional memory*. lots of
		research on this in last 10 years.
		--cleaner way of handling concurrency than using locking
		--but nasty interactions with I/O devices, since you
		need locks for I/O devices (can't rollback once you've
		emitted output or taken input)]

    --what do we want out of this programming model?

	--informally, transactions are typically described as providing
	four properties, known as "ACID semantics":

	    A: atomicity
	    C: consistency
	    I: isolation
	    D: durability

	--but there are different levels of isolation, consistency,
	etc., so "ACID" alone doesn't tell the whole story


    (--we're only going to scratch the surface of this material. a
    course on DBs will give you far more info.)

    --What the heck is the difference between "A" and "I"? don't they
    sound similar?
	--A: atomicity. think of it as "all-or-nothing atomicity"
	--I: isolation. think of it as "before-or-after atomicity".

    --A: means: "if there's a crash, it looks to everyone after the
    crash as if the transaction either fully completed or didn't
    start".

    --I is a response to concurrency. it means: "the transactions
    should appear as if they executed in serial order".

    --"C" is not hard to provide from our perspective: just don't do
    anything dumb inside the transaction (gross oversimplification)

    --"D" is also not too hard, if we're logging our changes

    --two problems that we'll focus on:

	(1) how can the system provide crash recovery? (the A in ACID)

	(2) how can the system provide isolation? (the I in ACID)

    --discuss (1) in the context of a sketched crash-recovery protocol

    --discuss (2) in the context of TxOS

	 --classically, transactions have been defined to have four
	 properties: ACID:
	    A: atomicity
	    C: consistency
	    I: isolation
	    D: durability

	--we'll focus on the "A" and the "I"

	--briefly going to describe how to provide these two types of
	atomicity

3. Transactions: crash recovery ("Atomicity")

    setting: DBs
	--basically, a bunch of tables implemented with complex
	on-disk structures
	--want to be able to make sensible modifications to those
	structures
	--can crash at any point

	assume no concurrency for now..... 

	our challenge is that crash can happen at any point, but we want
	the state to always look consistent

	log is authoritative copy. helps get on-disk structures to
	consistent state after crash.

	log record types:
		BEGIN_TRANS(trans_id)
		END_TRANS(trans_id)
		CHANGE(trans_id, "redo action", "undo action")
		OUTCOME(trans_id, COMMIT|ABORT)

	entries:
	    SEQ #
	    TYPE: [Begin/end/change/abort/commit]
	    TRANS ID:
	    PREV_LSN: 
	    REDO action: 
	    UNDO action:

	[DRAW PICTURE OF THE SOFTWARE STRUCTURE:

		    APPLICATION OR USER (what is interface to
			transaction system?)
		    ---------------------------------- 
		    TRANSACTION SYSTEM (what is interface to next layer
			down?)
		    ---------------------------------- 
		    DISK = LOG + CELL/STABLE/NV STORAGE]

	example:
	
	    application does:
		BEGIN_TRANS
		    CHANGE_STREET
		    CHANGE_ZIPCODE
		END_TRANS

	    or:
		BEGIN_TRANS
		    DEBIT_ACCOUNT 123, $100
		    CREDIT_ACCOUNT 456, $100
		END_TRANS

	    [SHOW HOW LOTS OF TRANSACTIONS ARE INTERMINGLED IN THE LOG.]

	--Why do aborts happen?

	    --say because at the end of the transaction, the transaction
	    system realizes that some of the values were illegal.

	    --(in isolation discussion, other reasons will surface)

	--why have separate END record (instead of using OUTCOME)?
	(will see below)

	--why BEGIN_TRANS != first CHANGE record? (makes it easier to
	explain, and may make recovery simpler, but no fundamental
	reason.)

	--concept: commit point: the point at which there's no turning back.

	    --actions always look like this:
		--first step 
		....            [can back out, leaving no trace]
		--commit point
		.....           [completion is inevitable]
		--last step

	    --what's commit point when buying a house? when buying a
	    pair of shoes? when getting married?

	    --what's commit point here? (when OUTCOME(COMMIT) record is in log.
	    So, better log the commit record *on disk* before you tell
	    the user of the transaction that it committed! Get that
	    wrong, and then user of transaction would proceed on false
	    premise, namely that the so-called committed action will be
	    visible after a crash.)

	--note: maintain some invariants:
	    --always, always, always log the change before modifying the
	    non-volatile cell storage, also known as cell storage (this is what we
	    mean by "write-ahead log")

		[Per Saltzer and Kaashoek (see end of notes for
		citation), the golden rule of atomicity: "never modify
		the only copy!". Variant: "log the update *before*
		installing it"]

	    --END means that the changes have been installed

	--[*] observe: no required ordering between the commit record
	and installation (can install at any time for a long-running
	transaction, or, can write the commit record and very lazily
	update the non-volatile storage.)

	    --wait, what IS the required ordering? (just that the
	    change/update record has to be logged before the cell
	    storage is changed)

	    --in fact, the changes to cell storage do not even have to
	    be propagated in order of the log, as long as cell storage
	    eventually reflects what's in RAM and the order in the log.
	    that's because during normal operation, RAM has the right
	    answers and because during crash recovery, the log is
	    replayed in order. So the order of transactions with respect
	    to each other is preserved in cell storage in the long run

	--***NOTE: Below we are going to make unrealistic assumption
	that, if the disk blocks that correspond to cell storage are
	cached, then that cache is write through

	    --this is different from whatever in-memory structures the
	    database (or more generally, the transactional system) uses.
	    those in-memory structures don't matter for the purposes of
	    this discussion

	--make sure it's okay if you crash during recovery
	    --procedure needs to be idempotent
	    --how? (answer: records expressed as "blind writes" (for
	    example, "put 3 in cell 5", rather than something like
	    "increment value of cell 5". in other words, records shouldn't
	    make reference to previous value of modified cell.)

	--now say a crash happens. a (relatively) simple recovery
	protocol goes like this:

	    --scan backward looking for "losers" (actions that are part
	    of transactions that don't have an END record) 

		--if you encounter a CHANGE record that corresponds to a
		losing transaction, then apply its UNDO [even if it was
		committed]

	    --starting at the beginning
		--if you encounter a CHANGE record that corresponds to a
		committed transaction (which you learned about from
		backward scan) for which there is no END record,
		then REDO the action

	    --subtle detail: for all losing transactions, log END_TRANS

		--why? consider the following sequence, with time
		increasing from left to right:
		
		    <crash> <R1> A1 A2 A3 <crash> <R2>

		    assume:
		    
			<R1> is an instance of recovery

			A1..A3 are some committed, propagated actions (that
			    have END records). these actions
			    affect the same cells in cell storage as some
			    transactions that are UNDOne in R1

			<R2> is an instance of recovery

		    If there were no END_TRANS logged for the
		    transactions UNDOne in R1, then it's possible that
		    UNDOs would be processed in R2 that would undo
		    legitimate changes to cell storage by A1...A3, and
		    meanwhile A1...A3 won't get redone because they have
		    END records logged (by assumption).

	
	    [NOTE: the above is not typo'ed. the reason that it's okay
	    to UNDO a committed action in the backward scan is that the
	    forward scan will REDO the committed actions. This is
	    actually exactly what we want: the state of cell/NV storage
	    is unwound in LIFO order to make it as though the losing
	    transactions never happened. Then, the ones that actually
	    did commit get applied in order.]

	--observe: recovery has multiple stages. involves both undo and
	redo. logging also requires both undo and redo. what if we
	wanted only one of these two types of logging?
	
	    (1) say we wanted only undo logging? what requirement would we
	    have to place on the application?

		--perform all installs *before* logging OUTCOME record.
		--in that case, there is never a need to redo.
		--recovery just consists of undoing all of the actions for
		which there is no OUTCOME record....and
		    --logging an OUTCOME(abort) record

		this scheme is called *undo logging* or *rollback recovery*

	    (2) say we wanted only redo logging? what requirement would we
	    have to place on the application?
		
		--perform installs only *after* logging OUTCOME record
		--in that case, there is never a need to undo
		--recovery just consists of scanning the log and applying
		all of the redo actions for actions that:
		    --committed; but
		    --do not have an END record

		this scheme is called *redo logging* or *roll-forward
		recovery*

	--checkpoints make recovery faster

	    --but introduce complexity

	--non-write through cache of disk pages makes things even more
	complex

4. Transactions: isolation
	
	--easiest approach: one giant lock. only one transaction active
	at a time. so everything really is serialized
	    --advantage: easy to reason about
	    --disadvantage: no concurrency

	--next approach: fine-grained locks (e.g., one per cell, or
	per-table, or whatever), and acquire all needed locks at
	begin_transaction and release all of them at end_transaction
	    --advantage: easy to reason about. works great if it could
	    be implemented
	    --disadvantage: requires transaction to know all of its
	    needed locks in advance

	--actual approach: two-phase locking. gradually acquire locks as
	needed inside the transaction manager (phase 1), and then
	release all of them together at the commit point (phase 2)

	    --your intuition from the concurrency unit will tell you
	    that this creates a problem....namely deadlock. we'll come
	    back to that in a second.

	    --why does this actually preserve a serial ordering? here's
	    an informal argument:

		--consider the lock point, that is, the point in time at
		which the transaction owns all of the locks it will ever
		acquire.

		--consider any lock that is acquired. call it l.from the
		point that l is acquired to the lock point, the
		application always sees the same values for the data
		that lock l protects (because no other transaction or
		thread can get the lock).

		--so regard the application as having done all of its
		reads and writes instantly at the lock point

		--the lock points create the needed serialization.
		Here's why, informally.  Regard the transactions as
		having taken place in the order given by their lock
		points. Okay, but how do we know that the lock points
		serialize? Answer: each lock point takes place at an
		instant, and any lock points with intersecting lock sets
		must be serialized with respect to each other as a
		result of the mutual exclusion given by locks.
		
	    --to fix deadlock, several possibilities:

		--one of them is to remove the no-preempt condition:
		have the transaction manager abort transactions (roll
		them back) after a timeout

    C. Loose end: checkpoints

	--we haven't incorporated checkpoints into our recovery
	protocol. they help performance (less work to do on crash
	recovery), but, in their full generality, they add complexity.
	However, it's usually possible to reduce the complexity by using
	redo logging. Maintain a trailing pointer in the W.A.L. to mean:
	"everything before here has been propagated". that trailing
	pointer identifies the current checkpoint state.


Source for a lot of this material:

    J. H. Saltzer and M. F. Kaashoek, Principles of Computer System
    Design: An Introduction, Morgan Kaufmann, Burlington, MA, 2009.
    Chapter 9. Available online

5. Discuss TxOS paper

    --Good case study of reading a technical and involved systems paper.
    Here's one way to read a paper like this. Answer the following
    questions for yourself:

	--what is the top-level motivation?

	--what are their goals?

	--what are ther non-goals?

	--what is the interface that they provide, and what are its
	semantics?

	--what happens on the other side of that interface?

	--what are some interesting issues that arose?

    --Let's go through these questions in turn.

    --Top-level motivation
    
	--give programmers a way to group system calls.
    
	--Moreover, their system, plus a transactional memory system in
	user-space, would permit an arguably cleaner approach to
	concurrent programming:
	
	    --Provided we're not dealing with output that leaves the
	    system, mutexes/locks/CVs can be ditched in favor of
	    begin_tx() and end_tx()

    --What are the goals?

	--Main goal: provide the I in ACID to applications but
	concerning _kernel_ state (this is most of the paper's focus).
	
	    (--Again, the application's state does not get rolled back if the
	    transaction aborts.
	    
	    --If the application wants such semantics, then it needs to use
	    a user-level transactional memory platform. The authors discuss
	    this integration (see section 5.6))

	--Performance must be good

	--It must interoperate with threads of control (processes,
	threads) that are not inside a transaction

	--Durability (seemed to be required to get the paper accepted,
	report the authors, but not something they started with)

    --What are non-goals?

	--Making the entire OS interface transactional

	--Providing a complete transactional package (it was hard enough
	to do what they did). They do describe how to integrate with
	user-level transactions.
	
    --What interface do they provide, and what are its semantics?

	--sys_xbegin(), sys_xend(), sys_xabort()

	    sys_xbegin()
	      ....
	      [sys_xabort()]
	    rc = sys_xend()

	--rc indicates whether transaction succeeded.

	--abort can happen in the middle

	--What's on the other side of that interface?

    --Top-level idea: 

	--kernel ensures that, per object, only one writer at a time is
	in the middle of a transaction

	--If there are two concurrent writers, that is called a
	conflict, and one of the transactions aborts at sys_xend() [or
	in the middle]

	--an exception is containers

    --What happens in response to 

	sys_xbegin()?

	a tranactional system call?

	sys_xend()?
	    [the commit protocol; we'll discuss in a moment]

    --More detail: their core approach to isolation is what they call
    lazy version management.

	--rather than use two-phase locking plus an undo log (they use
	the term "eager version management"), they do something else.
	
	--what's the something else?

	--transactions operate on private structures; the only locking
	required is to make the shadow copy.

	    --what prevents two transactions from making private
	    structures?

	    --answer: that's a conflict, and their implementation will
	    prevent it

	--there is an inversion here.

	    --in most systems, aborting a transaction is a bit slower,
	    and committing is fast.

	    --in their system, abort is very fast (just throw away the
	    shadow copies) while commit is a bit slower 
	   
	--why do they need to make transaction abort fast? 
	    --answer: interrupt handler may quickly need to cause
	    another transaction to abort

	--So their commit step is a bit slower: copy the shadow data
	into the real copy. How do they optimize this?

	   By splitting the data structure into two and doing a pointer
	   swap. This avoids a memcpy and replaces it with a write.

	--So what's the disadvantage? Now they have to clean up the
	original copy.

	    Use RCU (read-copy update) technique to make sure that
	    original copy isn't in use when it's freed

	    Requires keeping track of active readers and writers 


    --Special cases: 
    
   
	--Linked lists. They handle these differently. How?

	    (Two transactions can concurrently work on the same list.
	    This does _not_ mean that both are manipulating the list
	    pointers at once; the integrity of the list is still
	    protected with a spinlock. What it means is that two
	    different transactions can modify the same list between 
		begin_tx()
		end_tx()
	    without one of the transactions having to abort.)
    
	container objects


    --Commit protocol:

	--kernel acquires locks on every object, in a global order.

	--if transaction hasn't been aborted by another one, then do a
	compare-and-swap to put COMMIT into the transaction status field
	(this means that any other transaction concurrently committing
	has to look at our field and see that the local one committed).

	    --> this is both the commit point and the point that
	    determines what order this transaction has in the global
	    order of transactions.

	--if we see an ABORT in the status field, it means that the
	contention manager aborted us because of a conflict with some
	other transaction

    --How do they integrate with software transactional memory?

	--answer: two-phase commit (not the same thing as two-phase
	locking). if you haven't seen the term "two-phase commit"
	before, don't worry.

	--user-level transaction gets to "prepared" state in user space.
	commit is now pending whatever happens on the OS transaction.

	--now, the STM issues sys_xend().

	    --if that commits, then the STM system commits the
	    higher-level transaction.

	    --if not, then the STM rolls back the higher-level transaction.