Lecture 21
CS 202

1. Last class
2. Recovery
	Journaling
3. Unix security

1. Last class
	- Crash recovery
		+ Desirable properties
		+ Available mechanisms
	- Ad hoc recovery
	- Copy on write

2. Journaling
        -- Copy on write showed that crash consistency is achievable when
        modifications **do not** modify (or destroy) the current copy. 
    
        Golden rule of atomicity, per Saltzer-Kaashoek:
        "never modify the only copy"

        -- Problem is that copy-on-write carries significant write and space overheads. Want
        to do better without violating the golden rule of atomicity.

        -- Going to do so by borrowing ideas from how transactions are implemented in databases.

        -- Core idea: Treat file system operations as transactions. Concretely, this means that
           after a crash, failure recovery ensures that:
            * Committed file system operations are reflected in on-disk data structures.
            * Uncommitted file system operations are not visible after crash recovery.

        -- Core mechanism: Record enough information to finish applying committed operations 
           (*redo operations*) and/or roll-back uncommitted operations (*undo operations*). 
           This information is stored in a redo log or undo log. Discuss this in detail next.

	--concept: commit point: the point at which there's no turning back.

	    --actions always look like this:
		--first step 
		....            [can back out, leaving no trace]
		--commit point
		.....           [completion is inevitable]
		--last step

	    --what's commit point when buying a house? when buying a
	    pair of shoes? when getting married?

	    --ask yourself: what's the commit point in the protocols
	    below (and the copy-on-write protocol above)?

        -- Redo logging
            * Used by Ext3 and Ext4 on Linux, going to discuss in that context.
            
            * Log is a fixed length ring buffer placed at the beginning of the disk
              (see handout).

            * Basic operations
                Step 1: filesystem computes what would change due to an operation. For instance,
                creating a new file involves changes to directory inodes, appending to a file 
                involves changes to the file's inode and data blocks.
                
                Step 2: the file system computes where in the log it can write this transaction,
                and writes a transaction begin record there (TxnBegin in the handout). This 
                record contains a transaction ID, which needs to be unique. The file system 
                **does not** need to wait for this write to finish and can immediately proceed to
                the next step.
                
                Step 3: the file system writes a record or records detailing all the changes it computed in 
                step 1 to the log. The file system **must** now wait for these log changes and
                the TxnBegin record (step 2) to finish being written to disk.
                
                Step 4: once the TxnBegin record, and all the log records from step 3 have been
                written, the system writes a transaction end record (TxnEnd in the handout). 
                This record contains the same transaction ID as was written in Step 2, and the 
                transaction is considered committed once the TxEnd has been successfully written to disk.

                Step 5: Once the TxnEnd record has been written, the filesystem asynchronously
                performs the actual file system changes; this process is called **checkpointing**. 
                While the system is free to perform checkpointing whenever it is convenient, 
                the checkpoint rate dictates the size of the log that the system must reserve.
            
            * Crash recovery: During crash recovery, the filesystem needs to read through the logs,
              determine the set of **committed** operations, and then apply them. Observe that:
              -- The filesystem can determine whether a transaction is committed or not by comparing 
                 transaction IDs in TxnBegin and TxnEnd records.
              -- It is safe to apply the same redo log multiple times. 

              Operationally, when the system is recovering from a crash, the system 
              does the following:
                
                Step 1: The file system starts scanning from the beginning of the log. 
                Step 2: Every time it finds a TxnBegin entry, it searches for a 
                    corresponding TxnEnd entry.
                Step 3: If matching TxnBegin and TxnEnd entries are found -- indicating that
                    the transaction is committed -- the file system applies (checkpoints) the
                    changes.
                Step 4: Recovery is completed once the entire log is scanned.

                Note, for redo logs, filesytems generally begin scanning the log from the
                **start of the log**.

            * What to log? 
            Observe that logging can double the amount of data written to disk.
            To improve performance, Ext3 and 4 allow users to choose what to log.
                * Default is to log only metadata. The idea here is that many people
                  are willing to accept data loss/corruption after a crash, but 
                  keeping metadata consistent is important. This is because if metadata is
                    inconsistent the FS may become unusable, as the data
                    structures no longer have integrity.
                * Can change settings to force data to be logged, along with metadata.
                  This incurs additional overheads, but prevents data loss on crash.

        -- Undo logging
            * Not used in isolation by any file system.

            * Key idea: Log contains information on how to rollback any changes made
            to data. Mechanically, during normal operations:
            Step 1: Write a TxBegin entry to the log.
            Step 2: For each operation, write instructions for how to undo any updates
            made to a block. These instructions might include the original data in the
            block. In-place changes to the block can be made right after these instructions
            have been persisted.
            Step 3: Wait for in-place changes (what we referred to as checkpointing) to finish
            for all blocks.
            Step 4: Write a TxnEnd entry into the block, thereby committing the transaction.
            *Note* this implies that if a transaction is committed, then all changes have been written to 
            the actual data structures of the file system.

            During crash recovery:
            Step 1: Scan the log to find all uncommitted transactions, these are ones where a
            TxnBegin entry is present, but no TxnEnd entry is found.
            Step 2: For each such transaction check to see whether the undo entry is valid. This
            is usually done through the use of a checksum.
                Why do we need this? Remember a crash might occur before
                the undo entry has been successfully written. If that
                happened, then (by the procedure described above), the
                actual changes corresponding to this undo entry have not
                been written to disk, so ignoring this entry is safe. On
                the other hand, trying to undo using a partially
                complete entry might result in data corruption, so using
                this entry would be **unsafe**.
            Step 3: Apply all valid undo entries found, in order to restore the disk
            to a consistent state.

            Note, for undo logs, logs are generally scanned from the
            **end of the log**.

            * Advantage: Changes can be checkpointed to disk as soon as the undo log
            has been updated. This is beneficial when the amount of buffer cache is
            low.

            * Disadvantage: A transaction is not committed until all dirty blocks have
            been flushed to their in-place targets.

        -- Redo Logging vs Undo Logging

            This is just a recap of the advantages and disadvantages.

            **Redo logging**

            * Advantage: A transaction can commit without all in-place updates (writes
            to actual disk locations) being completed. Updating the journal is sufficient.
                Why is this useful? In-place updates might be scattered all over the disk,
                so the ability to delay them can help improve performance.

            * Disadvantage: A transaction's dirty blocks need to be kept in the buffer-cache
              until the transaction commits and all of the associated journal entries have
              been flushed to disk. This might increase memory pressure.


            **Undo log**

            * Advantage: A dirty block can be written to disk as soon as the undo-log entry
            has been flushed to disk. This reduces memory pressure.

            * Disadvantage: A transaction cannot commit until all dirty blocks have been flushed
            to disk. This imposes additional constraints on the disk scheduler, might result in 
            worse performance.


        -- Combining Redo and Undo Logging

            * Done by NTFS. 

            * Goals:
                - Allow dirty buffers to be flushed as soon as their associated journal entries
                are written. This can reduce memory pressure when necessary.
                - Transactions commit as soon as logging is done, so the system has greater flexibility
                when scheduling disk writes.

            * How does this work?

            * Basic operations
                Step 1: filesystem computes what would change due to an operation. For instance,
                creating a new file involves changes to directory inodes, appending to a file 
                involves changes to the file's inode and data blocks.
                
                Step 2: the file system computes where in the log it can write this transaction,
                and writes a transaction begin record there (TxnBegin in the handout). This 
                record contains a transaction ID, which needs to be unique. The file system 
                **does not** need to wait for this write to finish and can immediately proceed to
                the next step.
                
                Step 3: the file system writes both a redo log entry and an undo log entry for each
                of the changes it computed in step 1. These live together. The filesystem can begin
                making in-place changes (checkpointing changes) the moment this undo + redo log
                information has been written.  

                Step 4: Wait until the TxnBegin record, and all the log records from step 3, have been
                written, the system writes a transaction end record (TxnEnd in the handout). 
                This record contains the same transaction ID as was written in Step 2, and the 
                transaction is considered committed once it has been successfully written to disk.
                
                Step 5: Similar to the redo logging case, the filesystem asynchronously continues to
                checkpoint/perform in-place writes whenever it is convenient.


            * Crash Recovery
                First, the "redo pass": the filesystem goes through the log finding all 
                committed transactions, and using the redo entry within them to apply committed 
                changes.

                Next, the "undo pass": Next it scans through the log (backwards) finding all 
                uncommitted transactions, and uses the undo entries associated with these to undo any
                in-place updates.

            * Why? Designed for a time when the same Operating System ran on machines with very
            little memory (8-32MB), and also on "big-iron" servers with lots of memory (1GB+). 
            This was an attempt to get the best of both worlds. 


Here's a recent paper about a production storage system that
incorporates several of the concepts that we've studied:
    https://assets.amazon.science/77/5e/4a7c238f4ce890efdc325df83263/using-lightweight-formal-methods-to-validate-a-key-value-storage-node-in-amazon-s3-2.pdf

3. Protection and security in Unix

 A. Intro

    UIDs and GIDs

    processes have a user ID and one or more group IDs

    files and directories are access-controlled.

        you saw this in the ls lab

        system stores with each file who owns it.

        where's the info stored? (answer: inode.)

    special user: uid 0, called root, treated specially by the kernel as
    administrator

        uid 0 has all permissions: can read any file, do anything

        certain ops only root can do: 
	    --binding to ports less than 1024
	    --change current process's user or group ID
	    --mount or unmount file systems
	    --opening raw sockets (so you can do something like ping remote machines,
	    for example)
	    --set clock
	    --halt or reboot machine
	
	    --change UIDs (so login program needs to run as root)

   
   B. Setuid

	--Some legitimate actions require more privs than UID
	    --E.g., how should users change their passwords?
	    --Passwords are stored in root-owned /etc/passwd and /etc/shadow files
	    (see above)

	--going to go into a bit of detail. why? because setuid/setgid
	are the sole means on Unix to *raise* a process's privilege level
	  
        --Solution:  Setuid/setgid programs
    
            idea: a way for root -- or another user -- to delegate its
            ability to do something.

	    --special "setuid" bit in the permissions of a file

	    --Run with privileges of file's owner or group

	    --Each process has _real_ and _effective_ UID/GID

	    -- _real_ is user who launched the program

	    -- _effective_ is owner/group of file, used in access checks

	    --for a program marked "setuid", on exec() of binary, kernel sets
	    effective uid = file uid. NOTE: kernel would (for
	    non-setuid) mark effective uid = real uid.
	  
	--Examples:

	    --/usr/bin/passwd : change a user's passwd. User needs to be
	    able to run this, but only root can modify the password file.

	    --/bin/su: change to new user ID if correct password is typed.

            cs202-user@d1d26b015839:~/cs202-labs$ ls -l `which passwd`
            -rwsr-xr-x 1 root root 59976 Nov 24 12:05 /usr/bin/passwd

            cs202-user@d1d26b015839:~/cs202-labs$ ls -l `which su`
            -rwsr-xr-x 1 root root 55672 Feb 21  2022 /usr/bin/su

            cs202-user@d1d26b015839:~/cs202-labs$ ls -l `which ping`
            -rwsr-xr-x 1 root root 78480 Sep  5  2021 /usr/bin/ping

	    [note the 's']

	    --Need to own file to set the setuid bit (makes sense; this
	    is because, by setting the bit, a user is implicitly
	    granting their privilege to others)

	    --Need to own file and be in group to set setgid bit


        --Here's an example for intuition

            Imagine you leave your terminal unattended, and some other
            user ("attacker") sits down and types:

            % cp /bin/sh /tmp/break-acct
            % chmod 4755 /tmp/break-acct
    
            the leading 4 sets the setuid bit. 
            the 755 means  "rwxr-xr-x"

            Attacker later runs (from their own account):

            $ /tmp/break-acct -p    
        
            result: attacker now has a shell with your privileges and
            can do anything you can do (read your private files, remove
            them, overwrite them, etc.). in fact anyone on the system
            can run break-acct to get the same effect (since it's
            world-executable).
   
            More generally, imagine that you are writing a program on a
            shared system, you are the owner, and you set the setuid bit

            What you are doing is letting that program run with *your*
            privileges.


        --Of course that was an attack. Sometimes people intentionally
        install setuid-root binaries. When you do that, as a system
        administrator or packager, you have to be extremely careful. 

            You're saying in essence that everyone on the system should
            be able to run the binary with root's privileges.

	    --Fundamental reason you need to be careful: very difficult
	    to anticipate exactly how and in what environment the code
	    will be run....yet when it runs, it runs with *your*
	    privileges (where "your" equals "root" or "whoever set the
	    setuid bit on some code they wrote")

	    --NOTE: Attackers can run setuid programs any time (no need
	    to wait for root to run a vulnerable job)

	    --FURTHER NOTE: Attacker controls many aspects of program's
	    environment

        EXAMPLE ATTACKS that exploit setuid:

	--Close fd 2 before exec()ing program
	
	    --now, setuid program opens a file, for example the
	    password file.... (normally, would be fd=3, but because
	    fd 2 was closed, the file will be given fd 2).
	    
	    --then, the program later encounters an error message
	    and does fprintf(stderr, "some error msg").

		--result: the error message goes into the password
		file!

	    --fix: for setuid programs, kernel will open dummy fds
	    for 0,1,2 if not already open

	--Set maximum file size to zero (if, say, setuid program
	changes a password and then rebuilds some password
	database), which means the setuid program is now running in
	an adverse environment

           
        --a program called "preserve" installed as setuid root; used by
        old editors (like the old vi) to make a backup of files in a
        root-accessible directory.

            --preserve runs 
                system("/bin/mail").

            [it does this to send email to notify the user that there is
            a backup, for example after a crash/restart]

            --"system" uses the shell to parse its argument

            --now if IFS (internal field separator) is set to "/" before
            running vi, then we get the following:

                --vi forks and execs /usr/lib/preserve (IFS is still set
                to '/', but exec() call doesn't care)

                --preserve invokes system("/bin/mail"), but this causes
                shell to parse the arguments as:
                    bin mail

                --which means that if the attacker locally had a
                malicious binary called 'bin', then that binary could
                do:

                    cd /homes/mydir/bin 
                    cp /bin/sh ./sh 
                    chown root sh  # this succeeds because 'bin' is running as root
                    chmod 4755 sh  # this succeeds because 'bin' is running as root

                (the leading 4 means "set the setuid bit")

                --result is that there is now a copy of the shell
                executable that is owned by root and setuid root

                --anyone who runs this shell has a root shell on the
                machine


	    --fix: shell has to ignore IFS if the shell is running as
	    root or if EUID != UID.
                (also, "preserve" should not have been setuid root;
                there should have been a special user/group just for this
                purpose.)

	    --also, modern shells refuse to run scripts that are setuid.
	    (the issue there is a bit different, but it is related.)

       More reading about the setuid bit and the classic example above:
            http://web.deu.edu.tr/doc/oreily/networking/puis/ch05_05.htm


        --ptrace() examples

	    Attack 1:

		--attacker ptraces setuid program P

		--P runs with root's privileges

		--now manipulate P's memory, get arbitrary privilege
		on the machine. this is bad.

		--fix: don't let process ptrace more privileged process or
		another user's process

		    for example, require sender to match real and
		    effective UID of target


            Attack 2:

                --attacker owns two unprivileged processes A and B.
                
                --A ptraces B. so far, so good. no violation of the
                rule above.

                --Then B execs a setuid program (for example, "su
                whatever"), which causes B's privilege to be raised.

                    (recall that the "su" program is setuid root.
                    "su jo" becomes user "jo" if someone types
                    jo's password.)

                --Now A is connected to a process that is running
                with root's privileges. A can use B's elevated
                privileges. This is bad.

                --fix: disable/ignore setuid bit on binary if
                ptraced target calls exec()

                    --> but let root ptrace anyone

            Attack 3:

                --now, say that A and B are unprivileged processes
                owned by attacker

	        --say A ptraces B. so far, so good. no violation of
	        prior two rules.

	        --say A executes "su attacker", i.e., it's su'ing to
	        its own identity

	        --While su is superuser, B execs "su root" 

	            --remember, the attacker programmed B, and can
	            arrange for it to exec the command just above.

	            --BUT! remembering the ptrace rules above, the
	            exec succeeds with the setuid bit NOT
	            disabled/ignored. the reason is that at this
	            moment A is the superuser, so no problem with
	            B's exec() honoring the setuid.

                --attacker types password into A, gets shell, and
                now this (unprivileged) shell is attached to "su root".

                --the attacker can now manipulate B's memory
                (disable password checks, etc.) so that the "su
                root" succeeds, at which point A is connected to a
                root shell

            See Linux Yama module as a partial defense:
                https://www.kernel.org/doc/Documentation/security/Yama.txt

            additionally, Linux's capability system (`man 7 capabilities`)
            also provides a mechanism to limit user's ability to attach to processes
            using the CAP_SYS_PTRACE capability. A user who has not been granted this
            capability cannot attach a debugger to an arbitrary process. However,
            by default, debuggers run by users without this capability are still 
            allowed to attach to child processes, that is any process that the
            debugger forks. This means that 
                "$ gdb <prog_name>" just works.
   

        Another issue:

	    --consider a setuid process that does a bunch of
	    privileged things and then drops privileges to
	    become user again

	    --should be okay, right?

	    *****--NO. once the process has seen something
	    privileged and then become the user again, it can be
	    ptraced(), and the confidential things it has seen
	    (or the privileged resources that it holds) can be
	    manipulated by an unprivileged user.****

	    --fix? make software much more complicated. separate
	    a single process into separate ones, for example.


    D. TOCTTOU attacks (time-of-check-to-time-of-use)

	--very common attack

	--say there's a setuid program that needs to log events to a
	file, specified by the caller. The code might look like this,
	where logfile is from user input

	    fd = open(logfile, O_CREAT|O_WRONLY|O_TRUNC, 0666);

	--what's the problem?

	    --setuid program shouldn't be able to write to file that user
	    can't. thus:
	    
		if (access(logfile, W_OK) < 0)
		    return ERROR;
		fd = open(logfile, ....)

	    should fix it, right?

	    NO!

	--here's the attack........

            attacker runs setuid program, passing it "/tmp/X"   

	    setuid program                          attacker
					      creat("/tmp/X");

	  check access("/tmp/X") --> OK
				    
					      unlink("/tmp/X");
					      symlink("/etc/passwd", "/tmp/X")

	    open("/tmp/X")
	

	--from the BSD man pages:
	    "access() is a potential security hole and should never be
	    used."

	--the issue is that access check and open are non-atomic

	--to fix this, have to jump through hoops: manually traverse
	paths. check at each point that the dir you're in is the one you
	expected to be in (i.e., that you didn't accidentally follow a
	symbolic link). maybe check that path hasn't been modified.
        also need to use APIs that are relative to an opened directory
        fd:

            -- openat, renameat, unlinkat, symlinkat, faccessat
            -- fchown, fchownat, fchmod, fchmodat, fstat, fstatat

        Or

            Wrap groups of operations in OS transactions

                --Microsoft supports transactions on Windows Vista and
                newer
                https://msdn.microsoft.com/en-us/library/windows/desktop/bb986748%28v=vs.85%29.aspx

                --research papers:

                http://www.fsl.cs.sunysb.edu/docs/valor/valor_fast2009.pdf
                http://www.sigops.org/sosp/sosp09/papers/porter-sosp09.pdf

    E. Thoughts / editorial

	--at a high level, the real issue is not ptrace. it's not even
	buggy code. the real issue is that the correct version of the
	code is way harder to write than the incorrect version:
	    --correct version has to traverse path manually
	    --be super-careful when running as setuid
	
	--cannot just blame application writers; must also blame the
	interfaces with which they're presented.

	--rules are incoherent. not clear how permissions compose

	--for all that, Unix security is actually quite inflexible:

	    --can't pass privileges to other processes
	    --can't have multiple privileges at once
	    --not a very general mechanism (cannot create a user or group
	    unless root)

[thanks to Mike Walfish, David Mazières, Nickolai Zeldovich, Robert Morris]