Class 20 CS 372H 7 April 2011 On the board ------------ 1. Last time 2. LFS, continued 3. Transactions --Atomicity (all-or-nothing atomicity) --Isolation (before-or-after atomicity) --------------------------------------------------------------------------- 1. Last time --crash recovery in classical file systems (ad-hoc approaches, ordered updates, journaling) --LFS (another use for a log) 2. LFS A. [last time] Intro B. [last time] Finding data C. [last time] Crash recovery in LFS D. Garbage collection (cleaning) E. Discussion C. Clarify crash recovery in LFS --need to figure out where the last checkpoint was, using checkpoint region (which is in a fixed and well-known place on the disk) --checkpoint region contains information needed to reconstruct inode map and segment usage table --note: inode map is big and may take more than one disk location. blocks of inode map may be strewn throughout the disk --checkpoint region also contains pointer to last segment written --recovery: --read in inode map and segment usage table --start at last segment written and scan forward from that segment, adjusting the inode map --stop when we arrive at the last inode written in the log --we know that we're at the last inode written because we look at every segment's segment summary block, and if the segment summary block indicates that the segment is old, we're done. (segment summary block has checksum and timestamp, which is enough. plus, segment summary block written periodically even when segment is not full.) --[maybe mention the backwards hack] D. Garbage collection (cleaning) --what if log fills up? then we're in trouble. to avoid this, need to do *cleaning*. --approach: basically, compress the log, and leave some free space on the disk --use segments: contiguous regions of the log (1 MB in their implementation) --two data structures they maintain: --in-memory: segment usage table: [ <# free bytes> ] --on disk, per-segment: segment summary (a table indexed by entry in the segment, so first entry in the table gives info about first entry in segment) [ ] --okay, which segments should we clean? and do you want the utilization to be high or low when you clean? --observe: if the utilization of a segment is 0 (as indicated by the segment usage table), cleaning it is really easy: just overwrite the segment! --if utilization is very low, that's a good sign: clean that segment (rewriting it is very little work) --but what if utilization is high? can we clean the segment? --insight: yes! provided what? (provided that the segment generally has a lot of "cold", that is unchanging, data). the insight is that: --because the segment is cold, it's never going to get to a low utilization --at the same time, because it's cold, you're not wasting work by compressing it. the segment will "stay compressed". --they analyze bang for the buck: how long after compressing will the data stick around to justify the work we did to clean and compress it? --(benefit/cost) = (1-u)*age/(1+u) --cost: 1+u (1 to read in, u to write back) --benefit: "1-u" are the blocks we're taking back, times "age". The "age" is an estimate of how long the compacted blocks will stay compact. It may seem counter-intuitive to multiply these together, but this particular metric is just a rough guide anyway. The idea is that there are two factors that matter: how long the compacted blocks will stay compact (captured by age) and how many blocks we actually got by compacting the segment (captured by 1-u). The notion that we "keep" the blocks for their "age" (as stated in the paper) isn't really right in literal terms because we don't know anything about what will happen once the blocks are pressed back into service. However, this metric captures the idea that it's worthwhile to compact old data, even if it's in a segment that is highly utilized, because it will be a while before the old blocks need to be re-compacted. E. Discussion --you should understand authors' points in last paragraph of section 5.1 --where does LFS fall down? --If write workload is random access and read workload is sequential, performance will be terrible. there will be lots of disk seeks. --see rightmost set of bars in figure 9 --are people using LFS today? why not? --not using in its proposed form --but journaling is everywhere (for example, the ext3 file system on Linux). --people might not be using it for any number of reasons: --added performance not needed --performance so bad in pathological cases that people don't want to use it [note: in LFS, the log *is* the file system. next going to briefly describe transactional systems, which in some ways are closer to journaling. the log is the authoritative record, but it's not the copy of the data used for read operations.] 3. Transactions consider a system with: (log) + (complex on-disk structures) what kind of systems are we talking about? well, these ideas apply to file systems, which may have complex on-disk structures. but they apply far more widely, and in fact many of these ideas were developed in the context of databases. one confusing detail is that databases often request a raw block interface to the disk, thereby bypassing the file system. So one way to think about this mini unit here is that the "system" identified above could be a file system (often running inside the kernel) or else a database (often running in user space) want to build *transactions*: --way of grouping a bunch of operations. but what's a transaction? --intuition: begin_transaction(); deposit_account(acct 2, $30); withdraw_account(acct 1, $30); end_transaction(); probably okay to do neither deposit nor withdrawal, definitely okay to do both. not okay to do one or the other. --most of you will run into this idea if you do database programming --but it's bigger than DBs --arguably, LFS is using transactions (but why is a bit subtle and has to do with the precise use of the segment summary block) --could even imagine having the kernel export a transactional interface: sys_begin_transaction() syscall1(); syscall2(); sys_end_transaction(); --there is research in this department that proposes exactly this --very nice for, say, software install: wrap the entire install() program in a transaction. so all-or-nothing --[aside: could imagine having the hardware export a transactional interface to the OS: begin_transaction() write_mem(); write_mem(); read_mem(); write_mem(); end_transaction(); --this is called *transactional memory*. lots of research on this in last 10 years. --cleaner way of handling concurrency than using locking --but nasty interactions with I/O devices, since you need locks for I/O devices (can't rollback once you've emitted output or taken input)] --okay, back to DBs: --basically, a bunch of tables implemented with complex on-disk structures --want to be able to make sensible modifications to those structures --can crash at any point --we're only going to scratch the surface of this material. a course on DBs will give you far more info. --classically, transactions have been defined to have four properties: ACID: A: atomicity C: consistency I: isolation D: durability --we'll focus on the "A" and the "I" --"C" is not hard to provide from our perspective: just don't do anything dumb inside the transaction (gross oversimplification) --"D" is also not too hard, if we're logging our changes --What the heck is the difference between "A" and "I"? --A: atomicity. think of it as "all-or-nothing atomicity" --I: isolation. think of it as "before-or-after atomicity". --A: means: "if there's a crash, it looks to everyone after the crash as if the transaction either fully completed or didn't start. --I is a response to concurrency. it means: "the transactions should appear as if they executed in serial order". --briefly going to describe how to provide these two types of atomicity A. atomicity / ("all-or-nothing atomicity") assume no concurrency for now..... our challenge is that crash can happen at any point, but we want the state to always look consistent log is authoritative copy. helps get on-disk structures to consistent state after crash. log record types: BEGIN_TRANS(trans_id) END_TRANS(trans_id) CHANGE(trans_id, "redo action", "undo action") OUTCOME(trans_id, COMMIT|ABORT) entries: SEQ # TYPE: [Begin/end/change/abort/commit] TRANS ID: PREV_LSN: REDO action: UNDO action: [DRAW PICTURE OF THE SOFTWARE STRUCTURE: APPLICATION OR USER (what is interface to transaction system?) ---------------------------------- TRANSACTION SYSTEM (what is interface to next layer down?) ---------------------------------- DISK = LOG + CELL/STABLE/NV STORAGE] example: application does: BEGIN_TRANS CHANGE_STREET CHANGE_ZIPCODE END_TRANS or: BEGIN_TRANS DEBIT_ACCOUNT 123, $100 CREDIT_ACCOUNT 456, $100 END_TRANS [SHOW HOW LOTS OF TRANSACTIONS ARE INTERMINGLED IN THE LOG.] --Why do aborts happen? --say because at the end of the transaction, the transaction system realizes that some of the values were illegal. --(in isolation discussion, other reasons will surface) --why have separate END record (instead of using OUTCOME)? (will see below) --why BEGIN_TRANS != first CHANGE record? (makes it easier to explain, and may make recovery simpler, but no fundamental reason.) --concept: commit point: the point at which there's no turning back. --actions always look like this: --first step .... [can back out, leaving no trace] --commit point ..... [completion is inevitable] --last step --what's commit point when buying a house? when buying a pair of shoes? when getting married? --what's commit point here? (when OUTCOME(COMMIT) record is in log. So, better log the commit record *on disk* before you tell the user of the transaction that it committed! Get that wrong, and then user of transaction would proceed on false premise, namely that the so-called committed action will be visible after a crash.) --note: maintain some invariants: --always, always, always log the change before modifying the non-volatile cell storage, also known as cell storage (this is what we mean by "write-ahead log") [Per Saltzer and Kaashoek (see end of notes for citation), the golden rule of atomicity: "never modify the only copy!". Variant: "log the update *before* installing it"] --END means that the changes have been installed --[*] observe: no required ordering between the commit record and installation (can install at any time for a long-running transaction, or, can write the commit record and very lazily update the non-volatile storage.) --wait, what IS the required ordering? (just that the change/update record has to be logged before the cell storage is changed) --in fact, the changes to cell storage do not even have to be propagated in order of the log, as long as cell storage eventually reflects what's in RAM and the order in the log. that's because during normal operation, RAM has the right answers and because during crash recovery, the log is replayed in order. So the order of transactions with respect to each other is preserved in cell storage in the long run --***NOTE: Below we are going to make unrealistic assumption that, if the disk blocks that correspond to cell storage are cached, then that cache is write through --this is different from whatever in-memory structures the database (or more generally, the transactional system) uses. those in-memory structures don't matter for the purposes of this discussion --make sure it's okay if you crash during recovery --procedure needs to be idempotent --how? (answer: records expressed as "blind writes" (for example, "put 3 in cell 5", rather than something like "increment value of cell 5". in other words, records shouldn't make reference to previous value of modified cell.) --now say a crash happens. a (relatively) simple recovery protocol goes like this: --scan backward looking for "losers" (actions that don't have an END record) --if you encounter a CHANGE record that corresponds to a losing transaction, then apply its UNDO --starting at the beginning --if you encounter a CHANGE record that corresponds to a committed transaction (which you learned about from backward scan) for which there is no END record, then REDO the action --subtle detail: for all losing transactions, log END_TRANS --why? consider the following sequence, with time increasing from left to right: A1 A2 A3 assume: is an instance of recovery A1..A3 are some committed, propagated actions (that have END records). these actions affect the same cells in cell storage as some transactions that are UNDOed in R1 is an instance of recovery If there were no END_TRANS logged for the transactions UNDOed in R1, then it's possible that UNDOs would be processed in R2 that would undo legitimate changes to cell storage by A1...A3, and meanwhile A1...A3 won't get redone because they have END records logged (by assumption). [NOTE: the above is not typo'ed. the reason that it's okay to UNDO a committed action in the backward scan is that the forward scan will REDO the committed actions. This is actually exactly what we want: the state of cell/NV storage is unwound in LIFO order to make it as though the losing transactions never happened. Then, the ones that actually did commit get applied in order.] --observe: recovery has multiple stage. involves both undo and redo. logging also requires both undo and redo. what if we wanted only one of these two types of logging? (1) say we wanted only undo logging? what requirement would we have to place on the application? --perform all installs *before* logging OUTCOME record. --in that case, there is never a need to redo. --recovery just consists of undoing all of the actions for which there is no OUTCOME record....and --logging an OUTCOME(abort) record this scheme is called *undo logging* or *rollback recovery* (2) say we wanted only redo logging? what requirement would we have to place on the application? --perform installs only *after* logging OUTCOME record --in that case, there is never a need to undo --recovery just consists of scanning the log and applying all of the redo actions for actions that: --committed; but --do not have an END record this scheme is called *redo logging* or *roll-forward recovery* --checkpoints make recovery faster --but introduce complexity --non-write through cache of disk pages makes things even more complex B. Isolation ("before-or-after atomicity") [cover next time, but including the notes here] --easiest approach: one giant lock. only one transaction active at a time. so everything really is serialized --advantage: easy to reason about --disadvantage: no concurrency --next approach: fine-grained locks (e.g., one per cell, or per-table, or whatever), and acquire all needed locks at begin_transaction and release all of them at end_transaction --advantage: easy to reason about. works great if it could be implemented --disadvantage: requires transaction to know all of its needed locks in advance --actual approach: two-phase locking. gradually acquire locks as you need them (phase 1), and then release all of them together at the commit point (phase 2) --your intuition from the concurrency unit will tell you that this creates a problem....namely deadlock. we'll come back to that in a second. --why does this actually preserve a serial ordering? here's an informal argument: --consider the lock point, that is, the point in time at which the transaction owns all of the locks it will ever acquire. --consider any lock that is acquired. from that point to the lock point, the application always sees the same data --so regard the application as having done all of its reads and writes instantly at the lock point --the lock points create the needed serialization. Here's why, informally. Regard the transactions as having taken place in the order given by their lock points. Okay, but how do we know that the lock points serialize? Answer: each lock point takes place at an instant, and any lock points with intersecting lock sets must be serialized with respect to each other as a result of the mutual exclusion given by locks. --to fix deadlock, several possibilities: --one of them is to remove the no-preempt condition: have the transaction manager abort transactions (roll them back) after a timeout Source for a lot of this material: J. H. Saltzer and M. F. Kaashoek, Principles of Computer System Design: An Introduction, Morgan Kaufmann, Burlington, MA, 2009. Chapter 9. Available online.