Class 19
CS 202
16 April 2015

On the board
------------

1. Last time / clarifications
2. Transactions: intro
3. Transactions: crash recovery

---------------------------------------------------------------------------

1. Last time

    Studied crash recovery in file systems. 
 
        Skipping Log-structured file system (LFS) for now; may return to
        it. If we don't, it's covered in the reading.

        A point to keep in mind is that, while LFS provides some crash
        recovery, the real motivation for logging in that system is
        performance.

    Some points of clarification from last time: 
   
        (1) The _log_ also lives on the disk. It has to be stored
        persistently.

        (2) Why do we want logging? The high-level goal is to perform a
        group of related operations on a set of on-disk data structures
        _atomically_: they should either all happen or not (and the
        reason they might not is a crash). 
        
            --> Imagine if we didn't have the log ... a system starts
            overwriting various data structures (say to create a file).
            If there is a crash in the middle, then the data structures
            could be an inconsistent state.

            --> Instead, the system works as follows:
            
                --it logs the intended operations (each of the
                operations on the various data structures); the log,
                remember, is in persistent storage.

                --there is a concept of _committing_ the operation: this
                happens when the system writes a single disk sector
                (atomically) saying, "yes, this operation has been fully
                logged."

                --only when the operation has committed (which implies
                that all intentions have been logged) does the system
                begin the process of actually overwriting the data
                structures on the disk (*)

                --now look what happens:

                    --if there is a crash in the middle of logging the
                    intended operations (i.e., in the middle of the
                    operation), then there was no commit entry for that
                    operation in the log, and hence -- by the point
                    marked (*) -- no mangling of the data structures.

                    --if there is a crash during the overwriting, no
                    problem. The crash recovery procedure _replays_ the
                    log, and finishes the operations.

                    --if there is a crash during the recovery procedure,
                    no problem. the system again replays the log. (the
                    log entries have to be written in such a way that
                    it's safe to apply them multiple times.)

                    in order to make all of this efficient, when the log
                    entries are acted on, they are cleaned up. this is
                    called _checkpointing_. 
                    
                        "true state" = checkpoint + L,
                        
                        where L refers to the
                        not-yet-applied-but-committed log entries 

            --> Zoom out to observe the following points:

                (a) Notice that the discipline above abides by the
                golden rule of crash consistency: "don't modify the only
                copy"

                (b) We have arranged for atomicity. We have reduced the
                modification of multiple areas on the disk to a single
                atomic operation: the writing of the "commit operation"
                entry.

                (c) In a very real sense, if that record gets written,
                then it's as if the operation happened whereas if that
                record did not get written, then it's as if the
                operation never happened (and the system is built so
                that the user does not get confirmation that the
                operation completed until after the record is written).

        (3) How do we get atomic disk writes? Like, if power is cut
        while the disk is writing a sector, how could the write possibly
        be atomic? There are multiple ways to solve this problem in
        software, but modern disks solve it for us:

            "disks these days actually make these guarantees. If you start a
            write operation to a disk, then even if the power fails in the
            middle of that sector write, the disk has enough power available,
            and it can actually steal power from the rotational energy of the
            spindle; it has enough power to complete the write of the sector
            that's being written right now. In all cases, the disks make that
            guarantee."
                "EXT3, Journaling Filesystem", Stephen Tweedie
                http://olstrans.sourceforge.net/release/OLS2000-ext3/OLS2000-ext3.html

2. Transactions intro

    Transactions are an abstraction. A bunch of operations that have to
    be performed together (or not at all).

    Transactions will generalize what we saw just above with metadata in
    file journaling.

    Transactions require us to put together _both_ concurrency and
    crash-consistency. So we'll be seeing both locks and logging.

    --consider a system with some complex data structures in memory and
    on the disk

	what kind of systems are we talking about? well, these ideas apply
	to file systems, which may have complex on-disk structures. but they
	apply far more widely, and in fact many of these ideas were
	developed in the context of databases. one confusing detail is that
	databases often request a raw block interface to the disk, thereby
	bypassing the file system. So one way to think about this mini unit
	here is that the "system" identified above could be a file system
	(often running inside the kernel) or else a database (often running
	in user space)

    
    [DRAW PICTURE]

    --want to group operations and provide a programming model like:

        /* classical transaction example */
	begin_tx()
	    deposit_account(acct 2, $30);
	    withdraw_account(acct 1, $30);
	end_tx()

	probably okay to do neither deposit nor withdrawal,
	definitely okay to do both. not okay to do one or the other.

    --most of you will run into this idea if you do database programming

    --but it's bigger than DBs. file systems with journaling are using
    a kind of transaction:
    
        /* example from file systems: creating a file */
        begin_tx()
            write data block
            update/write inode
            mark inode "allocated"...
            ....
        end_tx()


    --[aside: could imagine having the hardware export a
    transactional interface to the OS:

    	begin_tx()
    	write_mem();
    	write_mem();
    	read_mem();
    	write_mem();
    	end_tx();

    	--this is called *transactional memory*. lots of
    	research on this in last 10 years.
    	--cleaner way of handling concurrency than using locking
    	--but nasty interactions with I/O devices, since you
    	need locks for I/O devices (can't rollback once you've
    	emitted output or taken input)]


    --okay, back to DBs:

	--basically, a bunch of tables implemented with complex
	on-disk structures

	--want to be able to make sensible modifications to those
	structures

	--can crash at any point

    --we're only going to scratch the surface of this material. a
    course on DBs will give you far more info.

    --two problems that we'll focus on:

	(1) how can the system provide crash recovery? 

	(2) how can the system provide isolation? 

    --discuss (1) in the context of a sketched crash-recovery protocol

    --discuss (2) in less detail

---------------------------------------------------------------------------

admin announcement: no class a week from now. will have a makeup

---------------------------------------------------------------------------


3. Transactions: crash recovery ("Atomicity")

    assume no concurrency for now..... 

    [DRAW PICTURE OF THE SOFTWARE STRUCTURE:

		APPLICATION OR USER (what is interface to
		    transaction system?)
		---------------------------------- 
		TRANSACTION SYSTEM (what is interface to next layer
		    down?)
		---------------------------------- 
		DISK = LOG + CELL/STABLE/NV STORAGE]

    our challenge is that crash can happen at any point, but we want
    the state to always look consistent:
        --after a crash, "finished" (trans)actions should appear
        --after a crash, "unfinished" (trans)actions should not appear

    How do we implement this?
        --logs and transactions during normal operation
        --undo/redo after a crash

    log is authoritative copy. helps get on-disk structures to
    consistent state after crash.

    (People pay a lot for DBs; they (the DBs) need to work!)

    What's in the log? (Sequence of records)
        --will need to maintain some rules about the log (we'll see
        these below)

    log record types:
	    BEGIN_TRANS(trans_id)
	    END_TRANS(trans_id)
	    CHANGE(trans_id, "redo action", "undo action")
	    OUTCOME(trans_id, COMMIT|ABORT)

    entries:
	SEQ #
	TYPE: [Begin/end/change/abort/commit]
	TRANS ID:
	PREV_LSN: 
	REDO action: 
	UNDO action:


    example:
    
	application does:
	    BEGIN_TRANS
		CHANGE_STREET
		CHANGE_ZIPCODE
	    END_TRANS

	or:
	    BEGIN_TRANS
		DEBIT_ACCOUNT 123, $100
		CREDIT_ACCOUNT 456, $100
	    END_TRANS

	[SHOW HOW LOTS OF TRANSACTIONS ARE INTERMINGLED IN THE LOG.]

	--Why do aborts happen?

	    --say because at the end of the transaction, the transaction
	    system realizes that some of the values were illegal.

	    --(in isolation discussion, other reasons will surface)

	--why have separate END record (instead of using OUTCOME)?
	(will see below)

	--why BEGIN_TRANS != first CHANGE record? (makes it easier to
	explain, and may make recovery simpler, but no fundamental
	reason.)

	--concept: commit point: the point at which there's no turning back.

	    --actions always look like this:
		--first step 
		....            [can back out, leaving no trace]
		--commit point
		.....           [completion is inevitable]
		--last step

	    --what's commit point when buying a house? when buying a
	    pair of shoes? when getting married?

	    --what's commit point here? (when OUTCOME(COMMIT) record is in log.
	    So, better log the commit record *on disk* before you tell
	    the user of the transaction that it committed! Get that
	    wrong, and then user of transaction would proceed on false
	    premise, namely that the so-called committed action will be
	    visible after a crash.)

	--note: maintain some rules:

            --log records go in time order

	    --always, always, always log the change before modifying the
	    non-volatile cell storage, also known as cell storage (this
	    is why it is called "write-AHEAD log")

		[Per Saltzer and Kaashoek (see end of notes for
		citation), the golden rule of atomicity: "never modify
		the only copy!". Variant: "log the update *before*
		installing it"]

	    --END means that the changes have been installed

	--[*] observe: no required ordering between the commit record
	and installation (can install at any time for a long-running
	transaction, or, can write the commit record and very lazily
	update the non-volatile storage.)

	    --wait, what IS the required ordering? (just that the
	    change/update record has to be logged before the cell
	    storage is changed)

	    --in fact, the changes to cell storage do not even have to
	    be propagated in order of the log, as long as cell storage
	    eventually reflects what's in RAM and the order in the log.
	    that's because during normal operation, RAM has the right
	    answers and because during crash recovery, the log is
	    replayed in order. So the order of transactions with respect
	    to each other is preserved in cell storage in the long run

	--***NOTE: Below we are going to make unrealistic assumption
	that, if the disk blocks that correspond to cell storage are
	cached, then that cache is write through

	    --this is different from whatever in-memory structures the
	    database (or more generally, the transactional system) uses.
	    those in-memory structures are the ones that actually hold
	    the updates.

	--make sure it's okay if you crash during recovery
	    --procedure needs to be idempotent
	    --how? (answer: records expressed as "blind writes" (for
	    example, "put 3 in cell 5", rather than something like
	    "increment value of cell 5". in other words, records shouldn't
	    make reference to previous value of modified cell.)

	--now say a crash happens. a (relatively) simple recovery
	protocol goes like this:

	    --scan backward looking for "losers" (actions that are part
	    of transactions that don't have an END record) 

		--if you encounter a CHANGE record that corresponds to a
		losing transaction, then apply its UNDO [even if it was
		committed]

	    --starting at the beginning
		--if you encounter a CHANGE record that corresponds to a
		committed transaction (which you learned about from
		backward scan) for which there is no END record,
		then REDO the action

	    --subtle detail: for all losing transactions, log END_TRANS

		--why? consider the following sequence, with time
		increasing from left to right:
		
		    <crash> <R1> A1 A2 A3 <crash> <R2>

		    assume:
		    
			<R1> is an instance of recovery

			A1..A3 are some committed, propagated actions (that
			    have END records). these actions
			    affect the same cells in cell storage as some
			    transactions that are UNDOne in R1

			<R2> is an instance of recovery

		    If there were no END_TRANS logged for the
		    transactions UNDOne in R1, then it's possible that
		    UNDOs would be processed in R2 that would undo
		    legitimate changes to cell storage by A1...A3, and
		    meanwhile A1...A3 won't get redone because they have
		    END records logged (by assumption).

	
	    [NOTE: the above is not typo'ed. the reason that it's okay
	    to UNDO a committed action in the backward scan is that the
	    forward scan will REDO the committed actions. This is
	    actually exactly what we want: the state of cell/NV storage
	    is unwound in LIFO order to make it as though the losing
	    transactions never happened. Then, the ones that actually
	    did commit get applied in order.]

	--observe: recovery has multiple stages. involves both undo and
	redo. logging also requires both undo and redo. what if we
	wanted only one of these two types of logging? what rules can we
	add to the ones above?
	
	    (1) say we wanted only undo logging? what requirement would we
	    have to place on the application?

		--perform all installs *before* logging OUTCOME record.
		--in that case, there is never a need to redo.
		--recovery just consists of undoing all of the actions for
		which there is no OUTCOME record....and
		    --logging an OUTCOME(abort) record

		this scheme is called *undo logging* or *rollback recovery*

	    (2) say we wanted only redo logging? what requirement would we
	    have to place on the application?
		
		--perform installs only *after* logging OUTCOME record
		--in that case, there is never a need to undo
		--recovery just consists of scanning the log and applying
		all of the redo actions for actions that:
		    --committed; but
		    --do not have an END record

		this scheme is called *redo logging* or *roll-forward
		recovery*

		[notice that the logging discipline that we saw in l18
		is really redo logging. in that discipline, there was no
		END record, only the equivalent of OUTCOME(commit). that
		approach also works; it means that the system does not
		record in the log which operations have been installed
		(which we called "modifying on-disk data structures).
		In that context, it's fine not to record this info
		because the location of the "begin log" implicitly
		records it; this in turn works because in the FS context
		we assumed no intermingling of logical operations in the
		log. In this (more complicated) context, there are
		multiple transactions intermingled, and some of them
		might have been installed, and others not, so the log
		itself records which operations have been installed.
		This aids makes recovery faster, and helps with the
		checkpointing process.]

	--checkpoints make recovery faster

	    --but introduce complexity

	    --the complexity can be managed if the system uses redo
	    logging, only. (Maintain a trailing pointer in the W.A.L.
	    (write-ahead log) to mean: "everything before here has been
	    propagated". that trailing pointer identifies the current
	    checkpoint state. OSTEP chapter 42 demonstrates this idea.) 

	--non-write through cache of disk pages makes things even more
	complex

4. Transactions: isolation

[next time]