Lecture 21 CS 202 1. Last class 2. Recovery Journaling 3. Unix security 1. Last class - Crash recovery + Desirable properties + Available mechanisms - Ad hoc recovery - Copy on write 2. Journaling -- Copy on write showed that crash consistency is achievable when modifications **do not** modify (or destroy) the current copy. Golden rule of atomicity, per Saltzer-Kaashoek: "never modify the only copy" -- Problem is that copy-on-write carries significant write and space overheads. Want to do better without violating the golden rule of atomicity. -- Going to do so by borrowing ideas from how transactions are implemented in databases. -- Core idea: Treat file system operations as transactions. Concretely, this means that after a crash, failure recovery ensures that: * Committed file system operations are reflected in on-disk data structures. * Uncommitted file system operations are not visible after crash recovery. -- Core mechanism: Record enough information to finish applying committed operations (*redo operations*) and/or roll-back uncommitted operations (*undo operations*). This information is stored in a redo log or undo log. Discuss this in detail next. --concept: commit point: the point at which there's no turning back. --actions always look like this: --first step .... [can back out, leaving no trace] --commit point ..... [completion is inevitable] --last step --what's commit point when buying a house? when buying a pair of shoes? when getting married? --ask yourself: what's the commit point in the protocols below (and the copy-on-write protocol above)? -- Redo logging * Used by Ext3 and Ext4 on Linux, going to discuss in that context. * Log is a fixed length ring buffer placed at the beginning of the disk (see handout). * Basic operations Step 1: filesystem computes what would change due to an operation. For instance, creating a new file involves changes to directory inodes, appending to a file involves changes to the file's inode and data blocks. Step 2: the file system computes where in the log it can write this transaction, and writes a transaction begin record there (TxnBegin in the handout). This record contains a transaction ID, which needs to be unique. The file system **does not** need to wait for this write to finish and can immediately proceed to the next step. Step 3: the file system writes a record or records detailing all the changes it computed in step 1 to the log. The file system **must** now wait for these log changes and the TxnBegin record (step 2) to finish being written to disk. Step 4: once the TxnBegin record, and all the log records from step 3 have been written, the system writes a transaction end record (TxnEnd in the handout). This record contains the same transaction ID as was written in Step 2, and the transaction is considered committed once the TxEnd has been successfully written to disk. Step 5: Once the TxnEnd record has been written, the filesystem asynchronously performs the actual file system changes; this process is called **checkpointing**. While the system is free to perform checkpointing whenever it is convenient, the checkpoint rate dictates the size of the log that the system must reserve. * Crash recovery: During crash recovery, the filesystem needs to read through the logs, determine the set of **committed** operations, and then apply them. Observe that: -- The filesystem can determine whether a transaction is committed or not by comparing transaction IDs in TxnBegin and TxnEnd records. -- It is safe to apply the same redo log multiple times. Operationally, when the system is recovering from a crash, the system does the following: Step 1: The file system starts scanning from the beginning of the log. Step 2: Every time it finds a TxnBegin entry, it searches for a corresponding TxnEnd entry. Step 3: If matching TxnBegin and TxnEnd entries are found -- indicating that the transaction is committed -- the file system applies (checkpoints) the changes. Step 4: Recovery is completed once the entire log is scanned. Note, for redo logs, filesytems generally begin scanning the log from the **start of the log**. * What to log? Observe that logging can double the amount of data written to disk. To improve performance, Ext3 and 4 allow users to choose what to log. * Default is to log only metadata. The idea here is that many people are willing to accept data loss/corruption after a crash, but keeping metadata consistent is important. This is because if metadata is inconsistent the FS may become unusable, as the data structures no longer have integrity. * Can change settings to force data to be logged, along with metadata. This incurs additional overheads, but prevents data loss on crash. -- Undo logging * Not used in isolation by any file system. * Key idea: Log contains information on how to rollback any changes made to data. Mechanically, during normal operations: Step 1: Write a TxBegin entry to the log. Step 2: For each operation, write instructions for how to undo any updates made to a block. These instructions might include the original data in the block. In-place changes to the block can be made right after these instructions have been persisted. Step 3: Wait for in-place changes (what we referred to as checkpointing) to finish for all blocks. Step 4: Write a TxnEnd entry into the block, thereby committing the transaction. *Note* this implies that if a transaction is committed, then all changes have been written to the actual data structures of the file system. During crash recovery: Step 1: Scan the log to find all uncommitted transactions, these are ones where a TxnBegin entry is present, but no TxnEnd entry is found. Step 2: For each such transaction check to see whether the undo entry is valid. This is usually done through the use of a checksum. Why do we need this? Remember a crash might occur before the undo entry has been successfully written. If that happened, then (by the procedure described above), the actual changes corresponding to this undo entry have not been written to disk, so ignoring this entry is safe. On the other hand, trying to undo using a partially complete entry might result in data corruption, so using this entry would be **unsafe**. Step 3: Apply all valid undo entries found, in order to restore the disk to a consistent state. Note, for undo logs, logs are generally scanned from the **end of the log**. * Advantage: Changes can be checkpointed to disk as soon as the undo log has been updated. This is beneficial when the amount of buffer cache is low. * Disadvantage: A transaction is not committed until all dirty blocks have been flushed to their in-place targets. -- Redo Logging vs Undo Logging This is just a recap of the advantages and disadvantages. **Redo logging** * Advantage: A transaction can commit without all in-place updates (writes to actual disk locations) being completed. Updating the journal is sufficient. Why is this useful? In-place updates might be scattered all over the disk, so the ability to delay them can help improve performance. * Disadvantage: A transaction's dirty blocks need to be kept in the buffer-cache until the transaction commits and all of the associated journal entries have been flushed to disk. This might increase memory pressure. **Undo log** * Advantage: A dirty block can be written to disk as soon as the undo-log entry has been flushed to disk. This reduces memory pressure. * Disadvantage: A transaction cannot commit until all dirty blocks have been flushed to disk. This imposes additional constraints on the disk scheduler, might result in worse performance. -- Combining Redo and Undo Logging * Done by NTFS. * Goals: - Allow dirty buffers to be flushed as soon as their associated journal entries are written. This can reduce memory pressure when necessary. - Transactions commit as soon as logging is done, so the system has greater flexibility when scheduling disk writes. * How does this work? * Basic operations Step 1: filesystem computes what would change due to an operation. For instance, creating a new file involves changes to directory inodes, appending to a file involves changes to the file's inode and data blocks. Step 2: the file system computes where in the log it can write this transaction, and writes a transaction begin record there (TxnBegin in the handout). This record contains a transaction ID, which needs to be unique. The file system **does not** need to wait for this write to finish and can immediately proceed to the next step. Step 3: the file system writes both a redo log entry and an undo log entry for each of the changes it computed in step 1. These live together. The filesystem can begin making in-place changes (checkpointing changes) the moment this undo + redo log information has been written. Step 4: Wait until the TxnBegin record, and all the log records from step 3, have been written, the system writes a transaction end record (TxnEnd in the handout). This record contains the same transaction ID as was written in Step 2, and the transaction is considered committed once it has been successfully written to disk. Step 5: Similar to the redo logging case, the filesystem asynchronously continues to checkpoint/perform in-place writes whenever it is convenient. * Crash Recovery First, the "redo pass": the filesystem goes through the log finding all committed transactions, and using the redo entry within them to apply committed changes. Next, the "undo pass": Next it scans through the log (backwards) finding all uncommitted transactions, and uses the undo entries associated with these to undo any in-place updates. * Why? Designed for a time when the same Operating System ran on machines with very little memory (8-32MB), and also on "big-iron" servers with lots of memory (1GB+). This was an attempt to get the best of both worlds. Here's a recent paper about a production storage system that incorporates several of the concepts that we've studied: https://assets.amazon.science/77/5e/4a7c238f4ce890efdc325df83263/using-lightweight-formal-methods-to-validate-a-key-value-storage-node-in-amazon-s3-2.pdf 3. Protection and security in Unix A. Intro UIDs and GIDs processes have a user ID and one or more group IDs files and directories are access-controlled. you saw this in the ls lab system stores with each file who owns it. where's the info stored? (answer: inode.) special user: uid 0, called root, treated specially by the kernel as administrator uid 0 has all permissions: can read any file, do anything certain ops only root can do: --binding to ports less than 1024 --change current process's user or group ID --mount or unmount file systems --opening raw sockets (so you can do something like ping remote machines, for example) --set clock --halt or reboot machine --change UIDs (so login program needs to run as root) B. Setuid --Some legitimate actions require more privs than UID --E.g., how should users change their passwords? --Passwords are stored in root-owned /etc/passwd and /etc/shadow files (see above) --going to go into a bit of detail. why? because setuid/setgid are the sole means on Unix to *raise* a process's privilege level --Solution: Setuid/setgid programs idea: a way for root -- or another user -- to delegate its ability to do something. --special "setuid" bit in the permissions of a file --Run with privileges of file's owner or group --Each process has _real_ and _effective_ UID/GID -- _real_ is user who launched the program -- _effective_ is owner/group of file, used in access checks --for a program marked "setuid", on exec() of binary, kernel sets effective uid = file uid. NOTE: kernel would (for non-setuid) mark effective uid = real uid. --Examples: --/usr/bin/passwd : change a user's passwd. User needs to be able to run this, but only root can modify the password file. --/bin/su: change to new user ID if correct password is typed. cs202-user@d1d26b015839:~/cs202-labs$ ls -l `which passwd` -rwsr-xr-x 1 root root 59976 Nov 24 12:05 /usr/bin/passwd cs202-user@d1d26b015839:~/cs202-labs$ ls -l `which su` -rwsr-xr-x 1 root root 55672 Feb 21 2022 /usr/bin/su cs202-user@d1d26b015839:~/cs202-labs$ ls -l `which ping` -rwsr-xr-x 1 root root 78480 Sep 5 2021 /usr/bin/ping [note the 's'] --Need to own file to set the setuid bit (makes sense; this is because, by setting the bit, a user is implicitly granting their privilege to others) --Need to own file and be in group to set setgid bit --Here's an example for intuition Imagine you leave your terminal unattended, and some other user ("attacker") sits down and types: % cp /bin/sh /tmp/break-acct % chmod 4755 /tmp/break-acct the leading 4 sets the setuid bit. the 755 means "rwxr-xr-x" Attacker later runs (from their own account): $ /tmp/break-acct -p result: attacker now has a shell with your privileges and can do anything you can do (read your private files, remove them, overwrite them, etc.). in fact anyone on the system can run break-acct to get the same effect (since it's world-executable). More generally, imagine that you are writing a program on a shared system, you are the owner, and you set the setuid bit What you are doing is letting that program run with *your* privileges. --Of course that was an attack. Sometimes people intentionally install setuid-root binaries. When you do that, as a system administrator or packager, you have to be extremely careful. You're saying in essence that everyone on the system should be able to run the binary with root's privileges. --Fundamental reason you need to be careful: very difficult to anticipate exactly how and in what environment the code will be run....yet when it runs, it runs with *your* privileges (where "your" equals "root" or "whoever set the setuid bit on some code they wrote") --NOTE: Attackers can run setuid programs any time (no need to wait for root to run a vulnerable job) --FURTHER NOTE: Attacker controls many aspects of program's environment EXAMPLE ATTACKS that exploit setuid: --Close fd 2 before exec()ing program --now, setuid program opens a file, for example the password file.... (normally, would be fd=3, but because fd 2 was closed, the file will be given fd 2). --then, the program later encounters an error message and does fprintf(stderr, "some error msg"). --result: the error message goes into the password file! --fix: for setuid programs, kernel will open dummy fds for 0,1,2 if not already open --Set maximum file size to zero (if, say, setuid program changes a password and then rebuilds some password database), which means the setuid program is now running in an adverse environment --a program called "preserve" installed as setuid root; used by old editors (like the old vi) to make a backup of files in a root-accessible directory. --preserve runs system("/bin/mail"). [it does this to send email to notify the user that there is a backup, for example after a crash/restart] --"system" uses the shell to parse its argument --now if IFS (internal field separator) is set to "/" before running vi, then we get the following: --vi forks and execs /usr/lib/preserve (IFS is still set to '/', but exec() call doesn't care) --preserve invokes system("/bin/mail"), but this causes shell to parse the arguments as: bin mail --which means that if the attacker locally had a malicious binary called 'bin', then that binary could do: cd /homes/mydir/bin cp /bin/sh ./sh chown root sh # this succeeds because 'bin' is running as root chmod 4755 sh # this succeeds because 'bin' is running as root (the leading 4 means "set the setuid bit") --result is that there is now a copy of the shell executable that is owned by root and setuid root --anyone who runs this shell has a root shell on the machine --fix: shell has to ignore IFS if the shell is running as root or if EUID != UID. (also, "preserve" should not have been setuid root; there should have been a special user/group just for this purpose.) --also, modern shells refuse to run scripts that are setuid. (the issue there is a bit different, but it is related.) More reading about the setuid bit and the classic example above: http://web.deu.edu.tr/doc/oreily/networking/puis/ch05_05.htm --ptrace() examples Attack 1: --attacker ptraces setuid program P --P runs with root's privileges --now manipulate P's memory, get arbitrary privilege on the machine. this is bad. --fix: don't let process ptrace more privileged process or another user's process for example, require sender to match real and effective UID of target Attack 2: --attacker owns two unprivileged processes A and B. --A ptraces B. so far, so good. no violation of the rule above. --Then B execs a setuid program (for example, "su whatever"), which causes B's privilege to be raised. (recall that the "su" program is setuid root. "su jo" becomes user "jo" if someone types jo's password.) --Now A is connected to a process that is running with root's privileges. A can use B's elevated privileges. This is bad. --fix: disable/ignore setuid bit on binary if ptraced target calls exec() --> but let root ptrace anyone Attack 3: --now, say that A and B are unprivileged processes owned by attacker --say A ptraces B. so far, so good. no violation of prior two rules. --say A executes "su attacker", i.e., it's su'ing to its own identity --While su is superuser, B execs "su root" --remember, the attacker programmed B, and can arrange for it to exec the command just above. --BUT! remembering the ptrace rules above, the exec succeeds with the setuid bit NOT disabled/ignored. the reason is that at this moment A is the superuser, so no problem with B's exec() honoring the setuid. --attacker types password into A, gets shell, and now this (unprivileged) shell is attached to "su root". --the attacker can now manipulate B's memory (disable password checks, etc.) so that the "su root" succeeds, at which point A is connected to a root shell See Linux Yama module as a partial defense: https://www.kernel.org/doc/Documentation/security/Yama.txt additionally, Linux's capability system (`man 7 capabilities`) also provides a mechanism to limit user's ability to attach to processes using the CAP_SYS_PTRACE capability. A user who has not been granted this capability cannot attach a debugger to an arbitrary process. However, by default, debuggers run by users without this capability are still allowed to attach to child processes, that is any process that the debugger forks. This means that "$ gdb " just works. Another issue: --consider a setuid process that does a bunch of privileged things and then drops privileges to become user again --should be okay, right? *****--NO. once the process has seen something privileged and then become the user again, it can be ptraced(), and the confidential things it has seen (or the privileged resources that it holds) can be manipulated by an unprivileged user.**** --fix? make software much more complicated. separate a single process into separate ones, for example. D. TOCTTOU attacks (time-of-check-to-time-of-use) --very common attack --say there's a setuid program that needs to log events to a file, specified by the caller. The code might look like this, where logfile is from user input fd = open(logfile, O_CREAT|O_WRONLY|O_TRUNC, 0666); --what's the problem? --setuid program shouldn't be able to write to file that user can't. thus: if (access(logfile, W_OK) < 0) return ERROR; fd = open(logfile, ....) should fix it, right? NO! --here's the attack........ attacker runs setuid program, passing it "/tmp/X" setuid program attacker creat("/tmp/X"); check access("/tmp/X") --> OK unlink("/tmp/X"); symlink("/etc/passwd", "/tmp/X") open("/tmp/X") --from the BSD man pages: "access() is a potential security hole and should never be used." --the issue is that access check and open are non-atomic --to fix this, have to jump through hoops: manually traverse paths. check at each point that the dir you're in is the one you expected to be in (i.e., that you didn't accidentally follow a symbolic link). maybe check that path hasn't been modified. also need to use APIs that are relative to an opened directory fd: -- openat, renameat, unlinkat, symlinkat, faccessat -- fchown, fchownat, fchmod, fchmodat, fstat, fstatat Or Wrap groups of operations in OS transactions --Microsoft supports transactions on Windows Vista and newer https://msdn.microsoft.com/en-us/library/windows/desktop/bb986748%28v=vs.85%29.aspx --research papers: http://www.fsl.cs.sunysb.edu/docs/valor/valor_fast2009.pdf http://www.sigops.org/sosp/sosp09/papers/porter-sosp09.pdf E. Thoughts / editorial --at a high level, the real issue is not ptrace. it's not even buggy code. the real issue is that the correct version of the code is way harder to write than the incorrect version: --correct version has to traverse path manually --be super-careful when running as setuid --cannot just blame application writers; must also blame the interfaces with which they're presented. --rules are incoherent. not clear how permissions compose --for all that, Unix security is actually quite inflexible: --can't pass privileges to other processes --can't have multiple privileges at once --not a very general mechanism (cannot create a user or group unless root) [thanks to Mike Walfish, David Mazières, Nickolai Zeldovich, Robert Morris]