Operating Systems
1999-2000 Fall
Mon 5-6:50
Ciww 109

Allan Gottlieb
gottlieb@nyu.edu
http://allan.ultra.nyu.edu/gottlieb
715 Broadway, Room 1001
212-998-3344
609-951-2707
email is best

Administrivia

Web Pages

There is a web page for the course. You can find it from my home page.

Can find these notes there. Let me know if you can't find it.
They will be updated as bugs are found.
Will also have each lecture available as a separate page. I will produce the page after the lecture is given. These individual pages might not get updated.

Textbook

Text is Tanenbaum, "Modern Operating Systems".

Available in bookstore.
We will do part 1, starting with chapter 1.

Computer Accounts and majordomo mailing list

You are entitled to a computer account, get it.
Sign up for majordomo mailing list for the course. But I use the web more. If you want to send mail to me, use gottlieb@nyu.edu not the mailing list.
You may do assignments on any system you wish, but ...
- You are responsible for the machine. I extend deadlines if the nyu machines are down, not if yours are.
- Please upload your assignments to the nyu systems (dates).

Homeworks and Labs

I make a distinction between homework and labs.

Labs are

Required
Due several lectures later (date given on assignment)
Graded and form part of your final grade
Penalized for lateness

Homeworks are

Optional
Due beginning of Next lecture
Not accepted late
Mostly from the book
Collected and returned
Can help, but not hurt, your grade

Upper left board for assignments and announcements.

Homework: Read Chapter 1 (Introduction)

1. Introduction

Levels of abstraction (virtual machines)

Software (and hardware, but that is not this course) is done in layers.
The higher layers use the facilities provided by lower layers.
Alternatively said, the upper layers are written using a more powerful and more abstract virtual machine than the lower layers.
Using a broad brush, the layers are.
1. Scripts (e.g. shell scripts)
2. Applications and utilities
3. Libraries
4. The OS proper (the kernel)
5. Hardware
The kernel itself is itself normally layered, e.g.
1. ...
2. Filesystems
3. Machine independent I/O
4. Machine dependent device drivers
The machine independent part is written assuming "virtual (i.e. idealized) hardware". Simply read a block from a "disk". But in reality one must deal with the specific disk controller.
Often the machine independent part is more than one layer.
The term OS is not well defined. Is it just the kernel? How about the libraries? The utilities? All these are certainly system software but not clear how much is part of the OS.

1.1: What is an operating system?

The kernel itself raises the level of abstraction and hides details. Can write to a file (a concept present in hardware) and ignore whether it is a floppy or hard disk.

The kernel is a resource manager (so users don't conflict).

How is an OS fundamentally different from a compiler (say)?

Answer: Concurrency! Per Brinch Hansen in Operating Systems Principles (Prentice Hall, 1973) writes.

The main difficulty of multiprogramming is that concurrent activities can interact in a time-dependent manner, which makes it practically impossibly to locate programming errors by systematic testing. Perhaps, more than anything else, this explains the difficulty of making operating systems reliable.

1.2 History of Operating Systems

Single user (no OS)
Batch, uniprogrammed, run to completion
Multiprogrammed
- Overlap CPU and I/O
- Multiple batches
  - IBM OS/MFT
  - IBM OS/MVT (then other names)
  - MVT is more ``efficient'' user of resources but is more difficult.
  - When we study memory management, we will see that with varying sizes questions like compaction and ``holes'' arise.
- Time sharing
  - This is multiprogramming with rapid switching.
  - We will study scheduling when we do processor management
Multiple computers
- Multiprocessors: Almost from the beginning of the computer age but now are not exotic.
- Network OS: Make use of the multiple PCs/workstations on a LAN.
- Distributed OS: A ``seamless'' version of above
Real time systems
- Often in embedded systems
- Soft vs hard real time. In the latter missing a deadline is a fatal error--sometimes literally.

Homework:1, 2, 5 (unless otherwise stated, problems numbers are from the end of the chapter in Tanenbaum.)

1.3: Operating System Concepts

This will be brief. Much of the rest of the course will consist in ``filling in the details''.

1.3.1: Processes

A program in execution.

Often one distinguishes the state or context (memory image, open file) from the thread of control. Then if one has many threads running in the same task, the result is a ``multithreaded processes''.

The OS keeps information about all processes in the process table. Indeed, the OS views the process as the entry. An example of an active entity being viewed as a data structure (cf. discrete event simulations).

The set of processes forms a tree via the fork system call. The forker is the parent of the forkee.

A signal can be sent to a process to cause it to execute a predefined function (the signal handler). This can be tricky to program since the programmer does not know when in his ``main'' program the signal handler will be invoked.

1.3.2: Files

Modern systems have a hierarchy of files. A file system tree.

In MSDOS the hierarchy is a forest not a tree. There is no file, or directory that is an ancestor if both a:\ and c:\.
In unix the existence of symbolic links weakens the tree to a DAG

Files and directories normally have permissions

Normally have at least rwx.
User, group, world
More general is access control lists
Often files have ``attributes'' as well. For example the ext2 filesystem in linux supports a d (dump) attribute that is a hint to the dump program not to dump this file.
When a file is opened, permissions are checked a file descriptor is returned that is used for subsequent operations

Devices (mouse, tape drive, cdrom) are often view as ``special files''. In a unix system these are normally found in the /dev directory. Some utilities that are normally applied to (ordinary) files can as well be applied to some special files. For example, when you do not have anything serious going on (i.e. as soon as you log in), type the following on unix

    cat /dev/mouse

and then move the mouse. You kill the cat by typing cntl-C. I tried this on my linux box and no damage occurred. Your mileage may vary.

Many systems have standard files that are automatically made available to a process upon startup. There (initial) file descriptors are fixed

standard input: fd=0
standard output: fd=1
standard error: fd=2

A convenience offered by some command interpretors is a pipe

  ls | wc

Will give the number of files. Homework: 3

1.3.3: System Calls

The way a user (i.e. program) directly interfaces with the OS. Often the component of the OS responsible for fielding system calls and dispatching them is called the envelope. Here is a picture showing some of the components and the external events for which they are the interface.

What happens when a user writes a function call like read?

Normal function call (in C, ada, etc.)
Library routine (in C)
Small assembler routine
1. Move arguments to predefined place (perhaps registers)
2. Poof (a trap instruction) and then the OS proper runs in supervisor mode
3. Fixup result (move to correct place)

Homework: 6

1.3.4: The shell

Assumed knowledge

Homework: 9.

1.4: OS Structure

I must note that tanenbaum is a big advocate of the so called microkernel approach in which as much as possible is moved out of the (protected) microkernel into usermode components.

In the early 90s this was popular. Digital Unix and windows NT were examples. Digital unix was based on Mach a research OS from carnegie mellon university. Lately, the growing popularity of linux has called this into question.

1.4.1: Monolithic approach

The previous picture: one big program

The system switches from user mode to kernel mode during the poof and then back when the OS does a ``return''.

But of course we can structure the system better, which brings us to.

1.4.2: Layered Systems

Some systems have more layers and are more strictly structured.

An early layered system was ``THE'' by dijkstra

The operator
User programs
I/O mgt
Operator-process communication
Memory and drum management

The layering was done by convention, i.e. there was no enforcement by hardware and the entire OS is linked together as one program. This is true of man modern OS systems as well (e.g., linux).

The multics system was layered in a more formal manner. The hardware provided several protection layers and the OS used them.

1.4.4: Virtual machines

Use a ``hypervisor'' (beyond supervisor) to switch between multiple Operating Systems

Each App/CMS is a virtual 370
CMS is a single user OS
A system call in some App traps to the corresponding cms
CMS believes it is running on the machine so issues I/O instructions but ...
I/O instructions in CMS trap to VM/370

================ Start Lecture 2 ================

14.4: Client Server

When done on one computer this is the microkernel approach in which the microkernel just supplies interprocess communication and the main OS functions are provided by a number of usermode processes.

This does have advantages. For example an error in the file server cannot corrupt memory in the process server. This makes errors easier to track down.

But it does mean that when a (real) user process makes a system call there are more switches from user to kernel mode and back. These are not free.

A distributed system can be thought of as an extension of the client server concept where the servers are remote.

Homework: 11

Chapter 2: Process Management

Tanenbaum's chapter title is ``processes''. I prefer process management. The subject matter is process scheduling, interrupt handling, and IPC (Interprocess communication--and coordination).

2.1: Processes

Definition: A process is a program in execution.

Even though in actuality there are many processes running at once, the OS gives each process the illusion that it is running alone.

We are assuming a multiprogramming OS that automatically switches from one process to another. Sometimes this is called pseudoparallelism since one has the illusion of a parallel processor.
Virtual time: The time used by just this processes. Progresses at a rate independent of other processes (almost, processes switches get charged a little).
Virtual memory: The amount of memory the process can ``think it has'' is independent of other processes and each processes believes it is loaded at location zero (say) and has contiguous memory (no holes).

Some systems have user processes and system processes. The latter act as servers satisfying requests from the former (which act as clients). The natural structure of such a system is to have process management (i.e. process switching, interrupt handling, and IPC) in the lowest layer and have the rest of the OS consist of system processes.

This is called the client-server model and is one tanenbaum likes. Indeed, there was reason to believe that it would dominate. But that hasn't happened as yet. One calls such an OS server based. Systems like traditional unix or linux can be called self-service in that the user process itself switches to kernel mode and performs the system call. That is, the same process changes back and forth from/to user<-->system mode and services itself.

Process Hierarchies

Modern general purpose operating systems permit a user to create (and distroy) processes. In unix this is done by the fork system call which creates a child process. Both parent and child keep running (indeed they have the same program text) and each can fork off other processes. A process tree results. The root of the tree is a special process created by the OS during startup.

Process states and transitions

This diagram contains a great deal of information.

Consider a running process P that issues an I/O request
- The process blocks
- At some later point, a disk interrupt occurs and the driver detects that P's request is satisfied.
- P is unblocked, i.e. is moved from from blocked to ready
- At some later time the processor looks for a ready job to run and picks P
A preemptive scheduler has the dotted line preempt;
A non-preemptive scheduler doesn't
The number of processes changes only for two arcs: create and terminate
Suspend and resume are medium term scheduling
- Done on a larger time scale
- Involves memory management as well
- Sometimes called two level scheduling

One can organize an OS around the scheduler.

Write a minimal ``kernel'' consisting of the scheduler, interrupt handlers, and IPC
The rest of the OS consists of kernel processes (e.g. memory, filesystem)
Tanenbaum likes this; his OS ``Minix'' is done this way
Client server organization

2.1.3: Implementation of Processes

Process table

One entry per process; often callet PTE (process table entry)
The central data structure for process management
A state transition is reflected by a change in the value of a field in the PTE
We have converted an active entity (process) into a data structure (PTE). Finkel calls this the level principle ``an active entity becomes a data structure when looked at from a lower level''
The PTE contains a great deal of information about the process
- Saved value of registers when process not running
- Stack pointer
- CPU time used
- Process id (PID)
- Process id of parent (PPID)
- User id (uid and euid)
- Group id (gid and egid)
- Pointer to text segment (memory for the program text)
- Pointer to data segment
- Pointer to stack segment
- UMASK (default permissions for new files)
- Current working directory (cwg)
- Many others

An aside on interupts

In a well defined location in memory (specified by the hardware) the OS store aninterrupt vector, which contains the address of the (first level) interrupt handler.

Tanenbaum calls this the interrupt service routine.
Actually one can have different priorities of interrupts and the interrupt contains one pointer for each level. This is why it is called a vector

Assume a process P is running and an interrupt occurs (say a disk interrupt for the completion of a disk read previously issued by process Q). Note that the interrupt is unlikely to be for process P.

The hardware stacks the program counter etc (possibly some registers)
Hardware loads new program counter from the interrupt vector.
- Loading program counter causes a jump
- Steps 1 and 2 are similar to a procedure call. But the interrupt is asynchronous
Assembly language routine saves registers
Assembly routine sets up new stack
- These last two steps can be called setting up the C environment
Assembly routine calls C procedure (tanenbaum forgot this one)
C procedure does the real work
- Determines what caused the interrupt (in this case a disk completed an I/O)
  - How does it figure out the cause?
  - Which priority interrupt was activated.
  - The controller can write data in memory before the interrupt
  - The OS can read registers in the controller
- Mark process Q. The code to run may well be OS code (e.g. the OS may need to copy the data just read from a buffer into the user's space).
- Now we have at least two processes to run: P and Q
- The scheduler decides which process to run (P or Q or something else). Lets assume that the decision is to run P.
The C procedure (that did the real work in the interrupt processing) continues and returns to the assembly code.
Assembly language starts P at the point it was when the interrupt occurred.

2.2: Interprocess Communication (IPC)

2.2.1: Race Conditions

A race condition occurs when two processes can interact and the outcome depends on the order in which the processes execute.

Imagine two processes both accessing x, which is initially 10.
- One process is to execute x <-- x+1
- The other is to execute x <-- x-1
- When both are finished x should be 10
- But we might get 9 and might get 11!
- Show how this can happen (x <-- x+1 is not atomic)
- Tanenbaum shows how this can lead to disaster for a printer spooler

Homework: 2

2.2.2: Critical sections

Prevent interleaving of sections of code that need to be atomic with respect to each other. That is, the conflicting sections need mutual exclusion. If process A is executing its critical section, it excludes process B from executing its critical section. Conversely if process B is executing is critical section, it excludes process A from executing its critical section.

Goals for a critical section implementation.

No two processes may be simultaneously inside their critical section
No assumption may be made about the speeds or the number of CPUs
No process outside its critical section may block other processes
No process should have to wait forever to enter its critical section
- I do NOTE make this last requirement.
- I just require that the system as a whole make progress (so not all processes are blocked)
- I refer to solutions that do not satisfy tanenbaum's last condition as unfair (but nonetheless correct) solutions

2.2.3 Mutual exclusion with busy waiting

The operating system can choose not to preempt itself. That is, no preemption for system processes (if the OS is client server) or for processes running in system mode (if the OS is self service)

But this is not adequate

Does work for user programs
Doesn't prevent conflicts between the main line OS and interrupt handlers
- This could be prevented by blocking interrupts while the mail line is in its critical section.
- Indeed, this is done.
- Don't want to block interrupts for too long or system seems unresponsivie
Doesn't work if you have several processors
- Both main lines can conflict
- One processor cannot block interrupts on the other

Software solutions for two processes

Initially P1wants=P2wants=false

Code for P1                             Code for P2

Loop forever {                          Loop forever {
    P1wants <-- true         ENTRY          P2wants <-- true
    while (P2wants) {}       ENTRY          while (P1wants) {}
    critical-section                        critical-section
    P1wants <-- false        EXIT           P2wants <-- false
    non-critical-section }                  non-critical-section }

Explain why this works.

But it is wrong! Why?

Initially turn=1

Code for P1                      Code for P2

Loop forever {                   Loop forever {
    while (turn = 2) {}              while (turn = 1) {}
    critical-section                 critical-section
    turn <-- 2                       turn <-- 1
    non-critical-section }           non-critical-section }

This one forces alternation, so is not general enough.

In fact, it took years (way back when) to find a correct solution. The first one was found by dekker. It is very clever, but I am skipping it (I cover it when I teach OS II).

Initially P1wants=P2wants=false  and  turn=1

Code for P1                        Code for P2

Loop forever {                     Loop forever {
    P1wants <-- true                   P2wants <-- true
    turn <-- 2                         turn <-- 1
    while (P2wants and turn=2) {}      while (P1wants and turn=1) {}
    critical-section                   critical-section
    P1wants <-- false                  P2wants <-- false
    non-critical-section               non-critical-section

This is Peterson's solution. When it was published, it was a surprise to see such a simple soluntion. In fact Peterson gave a solution for any number of processes. Subsequently, algorithms with better fairness properties were found (e.g. no task has to wait for another task to enter the CS twice). We will not cover these.

Hardware assist (test and set)

TAS(b) where b is a binary variable ATOMICALLY sets b<--true and returns the OLD value of b. Of course it would be silly to return the new value of b since we know the new value is true

Now implementing a critical section for any number of processes is trivial.

while (TAS(s)) {}   ENTRY
s<--false           EXIT

P and V and Semaphores

Note: Tanenbaum does both busy waiting (like above) and blocking (process switching) solutions. We will only do busy waiting.

Homework: 3

The entry code is often called P and the exit code V (tanenbaum only uses P and V for blocking, but we use it for busy waiting). So the critical section problem is to write P and V so that

loop forever
    P
    critical-section
    V
    non-critical-section

satisfies

Mutual exclusion
Forward progress (my weaken version of tanenbaum's condition
No speed assumptions
No blocking by processes in NCS

Note that I use indenting carefully and hence do not need (and sometimes omit) the braces {}

A binary semaphore abstracts the TAS solution we gave for the critical section problem.

A binary semaphore S takes on two possible values ``open'' and ``closed''
Two operations are supported

P(S) is

    while (S=closed)
    S<--closed     <== This is NOT the body of the while

where finding S=open and setting S<--closed is atomic

That is, wait until the gate is open, then run through and atomically close the gate
Said another way, it is not possible for two processes doing P(S) simultaneously to both see S=open (unless a V(S) intervenes).
V(S) is simply S<--open

So for any number of processes the critical section problem can be solved by

loop forever
    P(S)
    CS     <== critical-section
    V(S)
    NCS    <== non-critical-section

The only solution we have seen for arbitrary number of processes is with test and set.

To solve other coordination problems we want to extend binary semaphores

With binary semaphores, two consecutive Vs do not permit two subsequent Ps to succeed (the gate cannot be doubly opened).
We might want to limit the number of processes in the section to 3 or 4, not always just 1.

The solution to both of these shortcomings is to remove the restriction to a binary variable and define a generalized or counting semaphore.

A counting semaphore S takes on non-negative integer values
Two operations are supported
P(S) is
```
    while (S=0)
    S--
```
where finding S>0 and decrementing S is atomic
That is, wait until the gate is open (positive), then run through and atomically close the gate one unit
Said another way, it is not possible for two processes doing P(S) simultaneously to both see the same positive value of S unless a V(S) intervenes.
V(S) is simply S++

These counting semaphores can solve what I call the semi-critical-section problem, where you premit up to k processes in the section. When k=1 we have the original critical-section problem.

initially S=k

loop forever
    P(S)
    SCS   <== semi-critical-section
    V(S)
    NCS

Start of Lecture 3

Producer-consumer problem

Two classes of processes
- Producers, which produce times and insert them into a buffer
- Consumers, which remove items and consume them
Must worry about the producer encountering a full buffer
Must worry about the consumer encountering an empty buffer
Also called the bounded buffer problem.
- Another example of active entities being replaced by a data structure when viewed at a lower level (Finkel's level principle)

initially e=k, f=0 (counting semaphore); b=open (binary semaphore)

Producer                         Consumer

loop forever                     loop forever
    produce-item                     P(f)
    P(e)                             P(b); take item from buf; V(b)
    P(b); add item to buf; V(b)      V(e)
    V(f)                             consume-item

k is the size of the buffer
e represents the number of empty buffer slots
f represents the number of full buffer slots
We assume the buffer itself is only serially accessible. That is, only one operation at a time.
- This explains the P(b) V(b) around buffer operations
- I use ; and put on one line to suggest that this is simply one operation that is indivisible.
- Of course this writing style is only a convention, the enforcement of atomicity is done by the P/V
The P(e), V(f) motif is used to force bounded alternation. If k=1 it gives strict alternation.

Dining Philosophers

A classical problem from Dijkstra

5 philosophers sitting at a round table
each has a plate of spaghetti
there is a fork between each two
need two forks to eat

What algorithm do you use for access to the shared resource (the forks)?

The obvious soln (pick up right; pick up left) deadlocks
Big lock around everything serializes
Good code in the book.

The point of mentioning this without giving the solution is to give a feel of what coordination problems are like. The book gives others as well. We are skipping these (2nd semester, depending on instructor).

Homework: 14,15

Readers and writers

Two classes of processes
- Readers, which can work concurrently
- Writers, which need exclusive access
Must prevent 2 writers from being concurrent
Must prevent a reader and a writer from being concurrent
Must permit readers to be concurrent when no writer is active
Perhaps want fairness (i.e. freedom from starvation)
Variants
1. Writer-priority readers/writers
2. Reader-priority readers/writers

Quite useful in multiprocessor operating systems. The ``easy way out'' is to treat all as writers (i.e., give up reader concurrency).

2.4: Process Scheduling

Scheduling the processor is often called just scheduling or process scheduling.

The objectives of a good scheduling policy include

Fairness
Efficiency
Low response time (important for interactive jobs)
Low turnaround time (important for batch jobs)
High throughput [these are from tanenbaum]
Repeatability. Dartmouth (DTSS) ``wasted cycles'' and limited logins for repeatability.
Fair accross projects
- ``Cheating'' in unix by using multiple processes
- TOPS-10
- Fair share research project
Degrade gracefully under load

Recall the basic diagram about process states

For now we are discussing short-term scheduling running <--> ready.

Medium term scheduling is discussed a little later.

Preemption

This is an important distinction.

The ``preempt'' arc in the diagram
Needs a clock interrupt (or equivalent)
Needed to guarantee fairness
Found in all modern general purpose operating systems
Without preemption, the system implements ``run to completion (or yield)''

Deadline scheduling

This is used for real time systems. The objective of the scheduler is to find a schedule for all the tasks (there are a fixed set of tasks) so that each meets its deadline. You know how long each task executes

Actually it is more complicated.

Periodic tasks
What if can't schedule all of them (penalty function)?
What if task time is not constant but has a known probability distribution?

We do not cover deadline schedling in this course.

The name game

There is an amazing inconsistency in naming the different (short-term) scheduling algorithms. Over the years I have used primarily 4 books: In chronological order they are Finkel, Deitel, Silberschatz, and Tanenbaum. The table just below illustrates the name game for these three books. After the table we discuss each scheduling policy in turn.

Finkel  Deitel  Silbershatz Tanenbaum
-------------------------------------
FCFS    FIFO    FCFS        --    unnamed in tanenbaum
RR      RR      RR          RR      
SRR     **      SRR         **    not in tanenbaum
PS      **      PS          PS
SPN     SJF     SJF         SJF   
PSPN    SRT     PSJF/SRTF   --    unnamed in tanenbaum
HPRN    HRN     **          **    not in tanenbaum
**      **      MLQ         **    only in silbershatz
FB      MLFQ    MLFQ        MQ

First Come First Served (FCFS, FIFO, FCFS, --)

If you ``don't'' schedule, you still have to store the PTEs somewhere. If it is a queue you get FCFS. If it is a stack (strange), you get LCFS. Perhaps you could get some sort of random policy as well.

Only FCFS is considered
The simplist scheduling policy
Non-preemptive

Round Robbin (RR, RR, RR, RR)

An important preemptive policy
Essentially the preemptive version of FCFS
The key parameter is the quantum size q
When a process is put into the running state a timer is set to q.
If the timer goes off and the process is still running, the OS preempts the process.
- This process is moved to the ready state (the preempt arc in the diagram.
- The next job in the ready list (normally a queue) is selected to run
As q gets large, RR approaches FCFS
As q gets small, RR approaches PS
What q should we choose
- Tradeoff
- Small q makes system more responsive
- Large q makes system more efficient since less switching
State dependent RR
- Same as RR but q depends on system load
- Favor processes holding important resources
  - For example, non-swappable memory
  - Perhaps medium term scheduling
External priorities
- RR but can pay more for bigger q

Homework: 9, 19, 20, 21

Selfish RR (SRR, , SRR, )

Perhaps it could be called ``snobbish RR''
``Accepted processes'' run RR
New process waits until its priority reaches that of accepted processes.
New process starts at prio 0 and increases at rate a>=0
Accepted process have prio increas at rate b>=0
Note that at any time all accepted processes have same prio
If b>a, get FCFS (if b=a, essentially FCFS)
If b=0, get RR
If a>b>0, it is interesting

Processor Sharing (PS, **, PS, PS)

All n processes are running, each on a processor 1/n as fast as the real processor.

Of theoretical interest (easy to analyze)
Approximated by RR when the quantum is small

Homework: 18.

Shortest Job First (SPN, SJF, SJF, SJF)

Sort jobs by total execution time needed and run the shortest first.

Nonpreemptive
If we consider the extreme of run-to-completion, i.e. don't even switch on an I/O, then SJF has the shortest average waiting time
- Moving a short job before a long one decreases the wait for the short by the length of the long and increases the wait of the long by the length of the short.
- This decreases the total waiting time for these two
- Hence decreases the total waiting for all and hence decreases the average waiting time as well
In realistic case where a job has I/O and we switch then, should call the policy shortest next cpu burst first.
Difficulty is predicting the future (i.e. knowing the time required for job or burst).

Preemptive Shortest Job First (PSPN, SRT, PSJF/SRTF, --)

Preemptive version of above

Permit a job that enters the ready list to preempt the running job if the time for the new job (or for its first burst) is less than the remaining time for the running job (or for its current burst).
Will never happen that an existing job in ready list will require less time than remaining for the current job.
- Why?
Can starve jobs that have a long burst.

Priority aging

As a job is waiting, raise its priority and when it is time to choose, pick job with highest priority.

Can apply this to may policies
Indeed, many policies can be thought of as priority scheduling in which we run the job with the highest priority (with different notions of priority for different policies).

Homework: 22, 23

Highest Penalty Ratio Next (HPRN, HRN, , )

Run job that has been ``hurt'' the most.

For each job, let r = T/t; where T = wall clock time this job has been in system and t is running time of the job to date.
Run job with highest r
Normally defined to be nonpreemptive (only check r when a burst ends), but there is an preemptive analogue
- Do not worry about a job that just enters the system (its ratio is undefined)
- When putting process in run state commpute the time that it will no longer have the highest ratio and set a timer.
- When a job is moved into the ready state, compute its ratio and preempt if needed.
HRN stands for highest response ratio next
This is another example of priority scheduling

Multilevel Queues (, , MLQ, **)

Put different classes of jobs in different queues

Jobs do not move from one queue to another.
Can have different policies on the different queues.
For example, might have a background (batch) queue that is FCFS and one or more foreground queues that are RR.
Must also have a policy among the queues.
For example, might have two queues foreground and background and give the first absolute priority over the second
- Might apply aging to prevent background starvation

Multilevel Feedback Queues (FB, MFQ, MLFBQ, MQ)

Many queues and jobs move from queue to queue in an attempt to dynamically separate ``batch-like'' from interactive jobs.

Run jobs from highest nonempty queue with a quantum (like RR)
When job uses up full quanta (looks a like batch job), move it to a lower queue.
When job doesn't use a full quanta (looks like an interactive job), move it to a higher queue
Long interactive jobs (user generating spurious I/O) can keep job in the upper queues.
Might have bottom queue FCFS
Many variants
For example, might let job stay in top queue 1 quantum, next queue 2 quanta (i.e. put back in same queue, at end, after first quantum expires), next queue 4

Theoretical Issues

Much theory has been done (NP completeness results abound)
Queuing theory developed to predict performance

Medium Term scheduling

Decisions made at a coarser time scale.

Called two-level scheduling by tanenbaum
Suspend (swap out) some process if memory is overcommitted
Criteria for choosing a victim
- How long since previously suspended
- How much CPU time used recently
- How much memory does it use
- External priority (pay more, get swapped out less)
We will discuss again during next topic (memory management).

Long Term Scheduling

``Job scheduling''. Decide when to start jobs (do NOT start them when submitted.
Force user to log out and/or block logins if overcommitted
CTSS for decent interactive response time
Unix if out of processes (i.e. out of PTEs)
``LEM jobs during the day'' (Grumman).

==== Start Lecture #4 ====

Notes on lab1

If several processes are waiting on I/O, you may assume noninterference. For example, assume that on cycle 100 process A flips a coin and decides its wait is 6 units and next cycle (101) process B flips a coin and decides its wait is 3 units. You do NOT have to alter process A. That is, Process A will become ready after cycle 106 (100+6) so enters the ready list cycle 107 and process B becomes ready after cycle 104 (101+3) and enters ready list cycle 105.
For processor sharing (PS), which is part of the extra credit:
PS (processor sharing). Every cycle you see how many jobs are in the ready Q. Say there are 7. Then during this cycle (an exception will be described below) each process gets 1/7 of a cycle.
EXCEPTION: Assume there are exactly 2 jobs in RQ, one needs 1/3 cycle and one needs 1/2 cycle. The process needing only 1/3 gets only 1/3, i.e. it is finished after 2/3 cycle. So the other process gets 1/3 cycle during the first 2/3 cycle and then starts to get all the cpu. Hence it finishes after 2/3 + 1/6 = 5/6 cycle. The last 1/6 cycle is not used by any process.

Chapter 3: Memory Management

Also called storage management or space management.

Memory management must deal with the storage hierarchy present in modern machines.

Registers, cache, central memory, disk, tape (backup)
Move date from level to level of the hierarchy.
How decide when to move data up?
- Fetch on demand (e.g. demand paging, which is popular now)
- prefetch
  - Readahead for file I/O
  - Large cache lines and pages
  - Extreme example. Entire job present whenever running.

We will see in the next few lectures that there are three independent decision:

Segmentation (or no segmentation)
Paging (or no paging)
Fetch on demand (or no fetching on demand)

Memory management implements address translations.

Convert virtual addresses to physical addresses
aka logical to real
Virtual address is the address expressed in the program
Physical address is the address understood by the computer
The translation from virtual to physical addresses is performed by the Memory Management Unit or (MMU).
Also have question of absolute vs. relocatable/relative addresses.
The translation might be trivial (e.g. identity) but not in a modern general purpose OS
The translation might be difficult (i.e., slow)
- Often includes addition/shifts/mask--not too bad
- Often includes memory references
  - VERY serious
  - Solution is to cache translations in a Translation Lookaside Buffer (TLB). Sometimes called a translation buffer (TB)

Homework: 7.

When is the address translation performed?

At compilie time
- Primative
- Compiler generates physical addresses
- Requires knowledge of where compilation unit will be loaded
- Rarely used (MSDOS .COM files)
At link-edit time
- Compiler
  - generates relocatable addresses for each compilation unit
  - references external addresses
- Linkage editor
  - Converts the relocatable addr to absolute
  - resolves external references
  - Misnamed ld by unix
  - Also convert virtual to physical by knowing where the linked program will be loaded. Unix ld does not do this.
- Loader is simple
- Hardware requirements are small
- A program can be loaded only where specified and cannot move once loaded.
- Not used much any more.
At load time
- Same as linkage editor but do not fix the starting address
- Program can be loaded anywhere
- Program can move but cannot be split
- Need modest hardware: base/limit regs
At execution time
- Dynamically during execution
- Hardware to perform the virtual to physical address translation quickly
- Currently dominates
- Much more information later

Extensions

Dynamic Loading
- When executing a call check if loaded.
- If not loaded, call linking loader to load it and update tables.
- Slows down calls (indirection) unless you rewrite code dynamically
Dynamic Linking
- The traditional linking described above is now called statically linked
- Frequently used routines are not linked into the program. Instead, just a stub is linked.
- If/when a routine is called, the stub checks to see if the real routine is loaded (it may have been loaded by another program).
  - If not loaded, load it.
  - If already loaded, share it. This needs some OS help so that different jobs sharing the library don't overwrite each other's private memory.
- Advantages of dynamic linking
  - Saves space: Routine only in memory once even when used many times
  - Bug fix to dynamically linked library fixes all applications that use that library, without having to relink the application.
- Disadvantages of dynamic linking
  - New bugs in dynamically linked library infect all applications
  - Applications ``change'' even when they haven't changed.

Note: I will place ** before each memory management scheme.

3.1: Memory management without swapping or paging

Job remains in memory from start to finish

Sum of memory requirements of jobs in system cannot exceed size of physical memory.

** 3.1.1: Monoprogramming without swapping or paging (Single User)

The ``good old days'' when everything was easy.

No address translation done by OS (i.e. not done dynamically)
Either reload OS each job (or don't have an OS, which is almost the same) or protect the OS from the job.
- One way to protect (part of) the OS is to have it in ROM.
- Of course must have the data in ram
- Can have a separate OS address space only accessed in supervisor mode.
User overlays if job memory exceeds physical memory.
- Progammer breaks program into pieces.
- A ``root'' piece is always memory resident.
- The root contains calls to load and unload various pieces.
- Programmer's reponsibility to ensure that a piece is already loaded when it is called.
- No longer used, but we couldn't have gotten to the moon in the 60s without it (I think).
- Overlays replaced by dynamic address translation and other features (e.g. demand paging) that have the system support logical address sizes greater than physical address sizes.
- Fred Brooks (leader of IBM's OS/360 project, author of ``The mythical man month'') remarked that the OS/360 linkage editor was terrific, especially in its support for overlays, but by the time it came out, overlays were no longer used.

3.1.2: Multiprogramming

Goal is to improve CPU utilization, by overlapping CPU and I/O

If we let p = portion of time job is waiting for I/O if run alone.
Then for monoprogramming the CPU utilization is 1-p.
Note that p is often > .5 so cpu utilization is poor.
But with a multiprogramming level (MPL) of n, the cpu utilization is approximately 1-(p^n)
Idea is that if probability that a job is waiting for I/O is p and n jobs are in memory, then the probability that all n are waiting for I/O is p^n.
This is a crude model, but it is correct that increasing MPL does increase CPU cpu utilization.
- The limitation is memory, which is why we discuss it here instead of process management.
  - Some of the cpu utiilization is time spent in the OS executing context switches.

Homework: 1, 3.

3.1.3: Multiprogramming with fixed partitions

This was used by IBM for system 360 OS/MFT (multiprogramming with a fixed number of tasks)
Can have a single input queue instead of one for each partition
- So that if there are no big jobs can use big partition for little jobs
- But I don't think IBM did this
- Can think of the input queue(s) as the ready list(s)
The partition boundaries are not movable (reboot to move)
Each job has a single ``segment'' (we will discuss segments later)
- No sharing between jobs
- No dynamic address translation
- At load time must ``establish addressibility''
  - i.e. must get a base register set to the location at which job was loaded (the bottom of the partition). The base register is part of the programmer visible register set.
  - The base register is part of the programmer visible register set.
  - This is address translation during load time
- Also called relocation
- Storage keys are adequate for protection (IBM method)
- Alternative protection method is base/limit registers
- An advantage of base/limit is that it is easier to move a job
- But MVT didn't move jobs
Tanenbaum says jobs were ``run to completion''. This must be wrong.
He probably meant that jobs not swapped out and each queue is FCFS without preemption.
MFT can have large internal fragmentation, i.e. wasted space inside a region

3.2: Swapping

Moving entire jobs between disk and memory is called swapping.

3.2.1: Multiprogramming with variable partitions

Both the number and size of the partitions change with time.
IBM OS/MVT (multiprogramming with a varying number of tasks
Also early PDP-10 OS
Job still has only one segment (as with MFT) but now can be of any size
Single ready list
Job can move (might be swapped back in a different place
This is dynamic address translation (during run time).
Perform an addition on every memory reference (i.e. on every address translation)
Called a DAT box by IBM
Eliminates internal fragmentation
- Find a region the exact right size (leave a hole for the remainder
- Not quite true, can't get a piece with 10A755 bytes. Would get say 10A750. But internal fragmentation is much reduced compared to MFT.
Introduces external fragmentation, holes outside any region.
What do you do if no hole is big enough for request?
- Can compactify
  - Transition from bar 3 to bar 4 in diagram below
  - This is expensive
  - Not suitable for real time (MIT ping pong)
- Can swap out one process to bring in another
  - Bars 5-6 and 6-7 in diagram

Homework: 4

There are more processes than holes. Why?
- Because next to a process there might be a process or a hole but next to a hole there must be a process
- So can have ``runs'' of processes but not of holes
- Above actually shows that there are about twice as many processes as holes.
Base and limit registers are used
- Storage keys not good since compactifying would require changing many keys
- Storage keys might need a fine granularity to permit the boundaries move by small amts. Hence many keys would need to be changed

==== Start Lecture #5 ====

See announcements on course home page

Introduces the ``Placement Question'', which hole (partition) to choose

Best fit, worst fit, first fit, circular first fit, quick fit, Buddy
- Best fit doesn't waste big holes, but does leave slivers and is most expensive to run.
- Worst fit avoids slivers, but eliminates all big holes so a big job will require compaction.
- Quick fit keeps lists of some common sizes (but has other problems, see tanenbaum)
- Buddy system
  - Round request to next highest power of two (causes internal fragmentation)
  - Look in list of blocks this size (cf. quick fit)
  - If list empty, go higher and split into buddies
  - When returning coalesce with buddy
  - Do splitting and coalescing recursively, i.e. keep coalescing until can't and keep splitting until successful
  - See tanenbaum for more details (or an algorithms book).
A current favorite is circular first fit (aka next fit)
- Use the first hole that is big enough (first fit) but start looking where you left off last time.
- Doesn't waste time constantly trying to use small holes that have failed before, but does tend to use many of the big holes, which can be a problem.
Buddy comes with its own implementation. How about the others?
- Bit map
  - Only question is how much memory does one bit represent.
    - Big: Serious internal fragmentation
    - Small: Many bits to store and process
- Linked list
  - Each item on list says whether Hole or Process, length, starting location
  - The items on the list are not taken from the memory to be used by processes
  - Keep in order of starting addr
  - Double linked
- Boundary tag
  - Knuth
  - Use the same memory for list items as for processes
  - Don't need an entry in linked list for blocks in use, just the avail blocks are linked
  - For the in use blocks, just need a hole/process bit at each end and the length. Keep this in the block itself.
  - See knuth, the art of computer programming vol 1

Homework: 2, 5.

Also introduces the ``Replacement Question'', which victim to swap out

We will study this question more when we discuss demand paging

Considerations in choosing a victim

Cannot replace a job that is pinned, i.e. whose memory is tied down. For example, if Direct Memory Access (DMA) I/O is scheduled for this process, the job is pinned until the DMA is complete.
Victim selection is a medium term scheduling decision
- Job that has been in a wait state for a long time is a good candidate.
- Often choose as a victim a job that has been in memory for a long time.
- Another point is how long should it stay swapped out.
For demand paging, where swaping out a page is not as drastic as swapping out a job, victim is an important memory management decision and we shall study several policies

NOTEs:

So far the schemes have had two properties
1. Each job is stored contiguously in memory. That is, the job is contiguous in physical addresses.
2. Each job cannot use more memory than exists in the system. That is, the virtual addresses space cannot exceed the physical address space.
Tanenbaum now attacks the second item. I wish to do both and start with the first
Tanenbaum (and most of the world) uses the term ``paging'' to mean what I call demand paging. This is unfortunate as it mixes together two concepts
1. Paging (dicing the address space) to solve the placement problem and essentially eliminate external fragmentation.
2. On demand fetching, to permit the total memory requirements of all loaded jobs to exceed the size of physical memory.
Tanenbaum (and most of the world) uses the term virtual memory as a synonym for demand paging. Again I consider this unfortunate.
1. Demand paging is a fine term and is quite discriptive
2. Virtual memory ``should'' be used in contrast with physical memory to describe any virtual to physical address translation.

** (non-demand) Paging

Simplest scheme to remove the requirement of contiguous physical memory.

Chop the program into fixed size pieces called pages (invisible to the programmer).
Chop the real memory into fixed size pieces called page frames or simply frames (size of a page = size of a frame)
Sprinkle the pages into the frames
Keep a table (called the page table) having an entry for each page. The page table entry or PTE for page p contains the number of the frame f that contains page p.

Example: Assume a decimal machine, with pagesize=framesize=1000.
Assume PTE 3 contains 459.
Then virtual address 3372 corresponds to physical address 459372.

Properties of (non-demand) paging.

Entire job must be memory resident to run
No holes, i.e. no external fragmentation.
If there are 50 frames available and the pagesize is 4KB than a job requiring <= 200KB will fit, even if the available frames are scattered over memory.
Hence (non-demand) paging is useful
Introduces internal fragmentation approximately equal to 1/2 pagesize
Still can have a job unable to run due to insufficient memory and have some memory avail but not enough. This is not called external fragmentation since it is not due to memory being fragmented.
Eliminates the placement question. All pages are equally good since don't have external fragmentation.
Replacement question unchanged
Since page boundary occur at ``random'' points and can change from run to run (the page size can change with no effect on the program--other than performance, pages are not appropriate units of memory to use for protection and sharing. This is discussed further when we introduce segmentation.

Homework: 13

Address translation

Each memory reference turns into 2 memory references
1. Reference the page table
2. Reference central memory
This would be a disaster!
Hence the MMU caches page#-->frame# translations. This cache is kept near the processor and can be accessed rapidly
This cache is called a translation lookaside buffer (TLB) or translation buffer (TB)
For the above example, after referencing virtual address 3372, entry 3 in the TLB would contain 459.
Hence a subsequent access to virtual address 3881 would be translated to physical address 459881 without a memory reference.

Choice of page size is discuss below

Homework: 8, 13, 15.

3.2: Virtual Memory (meaning fetch on demand)

Idea is that a program can execute if only the active portion of its address space is memory resident. That is swap in and swap out portions of a program. In a crude sense this can be called ``automatic overlays''.

Advantages

Can run a program larger than the total physical memory.
Can increase the multiprogramming level since the total size of the active, i.e. loaded, programs (running + ready + blocked) can exceed the size of the physical memory.
Some portions of program are rarely if ever used so inefficient use of memory to have them loaded

3.2.1: Paging (meaning demand paging)

Fetch pages from disk to memory when they are referenced, with a hope of getting the most actively used pages in memory.

Very common: dominates modern operating systems
Started by the Atlas system at Manchester University in the 60s (Fortheringham)
Each PTE continues to have the frame number if the page is loaded.
But what if the page is not loaded (exists only on disk)?
- The PTE has a flag indicating if loaded (can think of the X in the diagram above as indicating that this flag is not set
- If not loaded, the location on disk is kept in the PTE (not shown in the diagram)
- When a reference is made to a non-loaded page (sometimes called a non-existent page, but that is a bad name), the system has a lot of work to do. We give more details below
  1. Choose a free frame if one exists
  2. If not
    1. choose a victim frame
      - More on how to choose later
      - Called the replacement question
    2. write victim back to disk if dirty
    3. update the victim PTE to show not loaded and where on disk it has been put (perhaps the disk location is already there)
  3. Copy the referenced page from disk to the free frame
  4. Update the PTE of the referenced page to show that it is loaded and give the frame number
  5. Do the std paging address translation (p#,off)-->(f#,off)
Really not done quite this way
- There is ``always'' a free frame because ...
- There is a deamon active that checks the number of free frames and if this is too low, chooses victims and ``pages them out'' (writing them back to disk if dirty).
Size of page table
- Considerations for plain paging apply
- Large pages are good for I/O (providing the data is used) since it is faster to do one 16KB I/O than four 4KB I/Os and much faster than sixteen 1KB I/Os

Homework: 11.

3.3.2: Page tables

A discussion of page tables is also appropriate for (non-demand) paging, but the issues are more acute with demand paging since the tables can be much larger. Why?
Ans: The total size of the active processes is no longer limited to the size of physical memory.

Want access to the page table to be very fast since it is needed for every memory access.

Unfortunate laws of hardware

Big and fast are essentially imcompatible
Big and fast and low cost is hopeless

So we can't just say, put the page table in fast processor registers and let it be huge and sell the system for $1500.

Put the (one-level) page table in main memory.

Too slow
TLB would help as mentioned above (discussed later in more detail).
Too big
- Currently we are considering contiguous virtual addresses ranges (i.e. the virtual addresss have no holes).
- Typically put the stack at one end of virtual address and the global (or static) data at the other end and let them grow towards each other
- The memory in between is unused.
- This unused memory can be huge (in address range) and hence the page table will mostly contain unneeded PTEs
Works fine if the virtual address range possible is small, which was once true but no longer (e.g. of PDP-11 in tanenbaum).

Protection bits

Can place protection bits on pages. For example can mark pages as execute only. This requires that boundaries between regions with different protection must be on page boundaries. Protection is more naturally done with segmentation.

Multilevel page tables

Idea, which is also used in unix inode-based file systems, is to add a level of indirection and have a page table containing pointers to page tables.

Imagine one big page table.
Call it the second level page table and cut it into pieces each the size of a page.
Note that you can get many PTEs in one page so you will have far fewer of these pages than PTEs
Now construct a first level page table containing PTEs that point to these pages.
This first level PT is small enough to store in memory.
But since we still have the 2nd level PT, we have made the world bigger not smaller!
Don't store in memory the 2nd level page tables all of whose PTEs point to unused memory. That is use demand paging on the (first level page table

For a two level page table the virtual address is divided into three pieces
```
+-----+-----+-------+
| P#1 | P#2 | Offset|
+-----+-----+-------+
```
P#1 gives the index into the first level page table.
Follow the pointer in the corresponding PTE to reach the frame containing the relevant 2nd level page table.
P#2 gives the index into this 2nd level page table
Follow the pointer in the corresponding PTE to reach the frame containing the (originally) requested frame.
Offset gives the offset in this fram where the requested word is located.

Do an example on the board

The VAX used a 2-level page table structure, but with some wrinkles (see tanenbaum for details).

Naturally, there is no need to stop at 2 levels. In fact the sparc has 3 levels and the motorola 68030 has 4 (and the number of bits of Virtual Address used for P#1, P#2, P#3, and P#4 can be varied).

3.3.4: Associative memory (TLBs)

Note: Tanenbaum suggests that ``associative memory'' and ``translation lookaside buffer'' are synonyms. This is wrong. Associative memory is a general structure and translation lookaside buffer is a special case.

An associative memory is a content addressable memory. That is you access the memory by giving the value of some field and the hardware searches all the records and returns the record whose field contains the requested value.

For example

Name  | Animal | Mood     | Color
======+========+==========+======
Moris | Cat    | Finicky  | Grey
Fido  | Dog    | Friendly | Black
Izzy  | Iguana | Quiet    | Brown
Bud   | Frog   | Smashed  | Green

If the index field is Animal and Iguana is given, the associative memory returns

Izzy  | Iguana | Quiet    | Brown

==== Start Lecture #6 ====

A Translation Lookaside Buffer or TLB is an associate memory where the index field is the page number. The other fields include the frame number, dirty bit, valid bit, and others.

A TLB is small and expensive but at least it is fast. When the page number is in the TLB, the frame number is returned very quickly delay.
On a miss, the page number is looked up in the page table. The record found is placed in the TLB and a victim is discarded. There is no placement question since all entries are accessed at the same time. But there is a replacement question.

Homework: 15.

3.3.5: Inverted page tables

Keep a table indexed by frame number with the entry f containg the number of the page currently loaded in frame f.

Since modern machine have a smaller physical address space than virtual address space, the table is smaller
But on a TLB miss, must search the inverted page table.
Would be hopelessly slow except that some tricks are employed.
The book mentions some but not all of the tricks, we are skipping this topic.

3.4: Page Replacement Algorithms

These are solns to the replacement question.

Good solutions take advantage of locality.

Temporal locality: If a word is referenced now, it is likely to be referenced in the near future.
- This argues for caching referenced words, i.e. keeping the referenced word near the processor for a while
Spacial locality: If a word is referenced now, nearby words are likely to be referenced in the near future.
- This argues for prefetching words around the currently referenced word.
These are lumped together into locality: If a page is referenced, it is likely to be referenced in the near future.
- So it is good to bring in the entire page on a miss and to keep the page in memory for a while.
When programs begin there is no history so nothing to base locality on. At this point the paging system is said to be undergoing a ``cold start''.
Programs exhibit ``phase changes'', when the set of pages referenced changes abruptly (similar to a cold start). At the point of a phase change, many page faults occur because locality is poor.

Pages belonging to processes that have terminated are of course perfect choices for victims.

Pages belonging to processes that have been blocked for a long time are good choices as well.

Random

A lower bound on performance. Any decent scheme should do better.

3.4.1: The optimal page replacement algorithm (opt PRA)

Replace the page whose next reference will be furthest in the future

Also called belady's min algoritm
Proveably optimal. That is generates the fewest number of page faults.
Unimplementable: Requires predicting the future.
Good upper bound on performance

3.4.2: The not recently used (NRU) PRA

Divide the frames into four classes and make a random selection from the lowest nonempty class.

Not referenced, not modified
Not referenced, modified
Referenced, not modified
Referenced, modified

Assumes that in each PTE there are two extra flags R (sometimes called U, for used) and M (often called D, for dirty).

Also assumes that a page in a lower number class is cheaper to evict

If not referenced, probably not referenced again soon so not so important.
If not modified, do not have to write it out so the cost of the eviction is lower

When a page is brought in, OS resets R and M (i.e. R=M=0)
On a read, hardware sets R
On a write, hardware sets R and M

We again have the prisoner problem, we do a good job of making little ones out of big ones, but not the reverse. Need more resets

Every k clock ticks, reset all R bits

Why not reset M?
Ans: Must have M accurate to know if victim must be written back
Could have two M bits one accurate and one reset, but I don't know of any system (or proposal) that does so.

What if hardware doesn't set these bits?

OS can use tricks
When the bits are reset, make the PTE indicate the page is not resident (i.e. lie). On the page fault, set the appropriate bit(s).

3..4.3: FIFO PRA

Simple but poor since usage of page is given no weight.

Belady's Anomaly: Can have more frames yet more faults. Example given later.

3.4.4: Second chance PRA

Fifo but when time to choose a victim if page at the head of the queue has been referenced (R bit), don't evict it but reset R move it to the rear of the queue (so it looks new). The page is being a second chance.

What if all frames have been referenced?
Becomes the same as fifo (but takes longer)

Might want to turn off the R bit more often (k clock ticks).

3.4.5: Clock PRA

Same algorithm as 2nd chance, but a better (and I would say obvious) implementation: Use a circular list.

Do an example.

LIFO PRA

This is terrible! Why?
You essentially use only one frame.

3.4.6:Least Recently Used (LRU) PRA

When a page fault occurs, choose as victim that page that has been unused for the longest time, i.e. that has been least recently used.

LRU is definitely

Implementable: The past is knowable
Good: Simulation studies
Difficult
- Essentially need to either
  1. Keep a time stamp in each PTE updated on each reference and scan all the PTEs when choosing a victim to find the PTE with the oldest timestamp.
  2. Keep the PTEs in a linked list in usage order, which means on each reference moving the PTE to the end of the list

Homework: 19, 20

A hardware cutsie in in tanenbaum

For n pages, keep an nxn bit matrix.
On a reference to page i, set row i to all 1s and col i to all 0s
At any time the 1 bits in the rows are ordered by inclusion. I.e. one row's 1s are a subset of another row's 1s which is a subset of a third. (Tanenbaum forgets to mention this)
So the row with the fewest 1s is a subset of all the others and is hence least recently used
Cute, but still impractical.

3.4.7: Simulating LRU in Software

The Not Frequently Used (NFU) PRA

Include a counter in each PTE (and have R in each PTE)
Set counter to zero when page is brought into memory
For each PTE, add R to counter every k clock ticks
Choose as victim the PTE with lowest count

The Aging PRA

NFU doesn't distinguish between old references and recent one. Modify NFU so that, for all PTEs, at every k clock ticks

Counter is shifted right one bit
R is inserted as the new high order bit (HOB)

R counter
1 10000000
0 01000000
1 10100000
1 11010000
0 01101000
0 00110100
1 10011010
1 11001101
0 01100110

R	counter
1	10000000
0	01000000
1	10100000
1	11010000
0	01101000
0	00110100
1	10011010
1	11001101
0	01100110

Homework: 21, 25

3.5: Modeling Paging Algorithms

3.5.1: Belady's anomaly

Consider the following ``reference string'' (sequence of pages referenced), which is assumed to occur on a system with no pages loaded initially that uses the FIFO PRU.

 0 1 2 3 0 1 4 0 1 2 3 4

If we have 3 frames this generates 9 page faults.

If we have 4 frames this generates 10 page faults.

Theory has been developed and certain PRA (so called ``stack algorithms'') cannot suffer this anomaly for any reference string. FIFO is clearly not a stack algorithm. LRU is.

Repeat the above for LRU.

3.6: Design issues for (demand) Paging

3.6.1 & 3.6.2: The Working Set Model and Local vs Global Policies

I will do these in the reverse order (which makes more sense). Also tanenbaum doesn't actually define the working set model, but I shall.

A local PRA is one is which a victim page is chosen among the pages of the same process that requires a new page. That is the number of pages for each process is fixed. So LRU means the page least recently used by this process.

Of course we can't have a purely local policy, why?
Ans: A new process has no pages and even if we didn't apply this for the first page loaded, the process would remain with only one page.

Perhaps wait until a process has been running a while.

A global policy is one in which the choice of victim is made among all pages of all processes

If we apply global LRU indiscrimanently with some sort of RR processor scheduling policy, and memory is somewhat over-committed, then by the time we get around to a process, all the others have run and have probably paged out this process.

If this happens each process will need to page fault at a high rate; this is called thrashing. It would therefore be good to get an idea of how many pages a process needs, so that we can balance the local and global desires.

==== Start Lecture #7 ====

Request Please include your student number and email address on homework #6, which you are handing in.

The working set policy (Peter Denning)

The goal is to specify which pages a given process needs to have memory resident.

Measure time in units of memory references, so t=1045 means the time when the 1045th memory reference is issued. Measure time separately for each process, so t=1045 really means the time when this process made its 1045th memory reference.
W(t,&omega) is the set of pages referenced (by the given process) from time t-omega to time t.
That is, it is the pages referenced during the window of size omega ending at time t.
W(t,&omega) is called the working set at time t (with window &omega).
(Netscape doesn't support &omega to give the greek letter, ouch)
w(t,omega) is the size of the set W(t,omega), i.e. is the number of pages referenced in the window.

The idea of the working set policy is to ensure that each process keeps its working set in memory.

One possibility is to allocate w(t,&omega) frames to each process (this number differs for each process and changes with time) and then use a local policy.
What if there aren't enough frames to do this?
Reduce the multiprogramming level (MPL)! That is, we have a connection between memory management and process management. This is the suspend/resume arcs we saw way back when.

Interesting questions incude:

What value should be used for &omega?
How should we calculate W(t,&omega)?

Various approximations to the working set frequency have been devised.

Wsclock
- Use the aging algorithm above to maintain a counter for each PTE and declare a page whose counter is above a certain threshold to be part of the working set.
- Apply the clock algorithm globally (i.e. to all pages) but refuse to page out any page in a working set, the resulting algorithm is called wsclock
- What if we find there are no pages we can page out?
- Ans: Reduce the multiprogramming level
Page Fault Frequency (PFF)
- For each process keep track of the page fault frequency, which is the number of faults divided by the number of references.
- Actually, must use a window or a weighted calculation since you are really interested in the recent page fault frequency
- If the pff is too low, allocate more frames to this process. Either
  1. Raise its number of frames and use local policy; or
  2. Bar its frames from eviction and use a global policy
What if not enough frames?
Ans: Lower MPL. (multiprogramming level)

3.6.3: Page size

Page size ``must'' be a multiple of the disk block size. Why?
Ans: When copying out a page if you have a partial disk block, you must do a read/modify/write (i.e. 2 I/Os).
Large page size
- Good for user I/O
  - If I/O done using physical addresses, then I/O crossing a page boundary is not contiguous and hence requires multiple I/Os
  - If I/O uses virtual addresses, then pagesize doesn't effect this aspect of I/O. That is the addresses are contigous in virtual address and hence one I/O is done.
- Good for demand paging I/O.
  - Better to swap in/out one big page than several small pages
  - But if page too big you will be swapping in data that is really not local and hence might well not be used.
- Large internal fragmentation (1/2 pagesize)
- Small page table
- Very few pages. Process will have many faults if using demand paging and the process references more regions than frames.
Small page size has the oposite properties

3.6.4: Implementation Issues

Don't worry about instruction backup. Very machine dependent and modern implementations tend to get it right.

Locking (pinning) pages

We discussed pinning jobs already. The same (mostly I/O) considerations apply to pages.

Shared pages

Really should share segments

Good idea to keep reference counts or something so that when a process terminates, pages it shares with another process are not automatically discarded
Similarly a reference count would make a widely shared page (correctly) look like a poor choice for a victim.
A good place to store the ref count would be in a structure pointed to by both PTEs.

Backing Store

The issue is where on disk do we put pages

For program text, which is presumably read only, a good choice is the file itself.
Data and stack grow so either be prepared to grow the space on disk (same issues as MVT) by copying if needed or allocate a bunch of regions on disk

Paging Daemons

Done earlier

Page Fault Handling

Hardware traps to the kernel (switches to supervisor mode; saves state)
Assembly language code save more state, establishes the C-language environment, calls the OS
OS determines that a fault occured and which page
If virt addr is invalid, shoot process. If valid, seek a free frame. If no free frames, select a victim.
If the victim frame is dirty, schedule an I/O write to copy the frame to disk. This process is blocked so the process scheduler is invoked to perform a context switch.
- Tanenbaum ``forgot'' some here
- Disk interrupt occurs when I/O complete
- Hardware trap / assembly code / OS determines I/O done
- Process moved from blocked to ready
- Some time later a context switch occurs to this ready process. Since this process is in kernel mode. Perhaps it was scheduled to run as soon as it was ready. I am using a ``self-service'' model where the process moves from user mode to kernel mode.
Now the frame is clean (this may be much later in wall clock time). Schedule an I/O to read to the desired page into this clean frame. This process is again blocked so the process scheduler is invoked to perform a context switch.
Disk interrupt occurs when I/O complete (trap / asm / OS determines) / made ready / starts running). PTE updated
Fix up process (e.g. reset PC)
Process put in ready queue and eventually runs. The OS returns to the first asm routine.
Asm routine restores regs, etc and returns to user mode.

Process is unaware that all this happened.

3.7: Segmentation

Up to now, the virtual address space has been contiguous.

Among other issues this makes it difficult when there are more that two dynamically growing regions
With two regions you start them on oposite sides of the virtual space.
Better is to have many virtual address spaces each starting at zero.
This split up is user visible.
Without segmentation (equivalently said with just one segment) all procedures must be packed together so if one changes all the virtual addresses change and the program must be relinked.
Eases flexible protection and sharing (share a segment). For example, can have a share library.

The following table mostly from tanenbaum compares demand paging with demand segmentation.

Consideration Demand
Paging Demand
Segmentation
Programmer aware No Yes

How many addr spaces 1 Many

VA size > PA size Yes Yes

Protect individual
procedures separately No Yes

Accomodate elements
with changing sizes No Yes

Ease user sharing No Yes

Why invented let VA size
exceed
PA size Sharing
Protection
indep addr spaces

Internal fragmentation Yes No, in principle

Extternal fragmentation No Yes

Placement question No Yes

Replacement question Yes Yes

Consideration	Demand Paging	Demand Segmentation
Programmer aware	No	Yes
How many addr spaces	1	Many
VA size > PA size	Yes	Yes
Protect individual procedures separately	No	Yes
Accomodate elements with changing sizes	No	Yes
Ease user sharing	No	Yes
Why invented	let VA size exceed PA size	Sharing Protection indep addr spaces

Internal fragmentation	Yes	No, in principle
Extternal fragmentation	No	Yes
Placement question	No	Yes
Replacement question	Yes	Yes

Homework: 29.

** Two Segments

Late PDP-10s and TOPS-10

One shared text segment, that can also contain shared (normally read only) data.
One (private) writable data segment
Permission bits on each segment.
Which kind of segment is better to evict?
- Swap out shared segment hurts many tasks
- The shared segment is read only (probably) so no writeback is needed.
``One segment'' is OS/MVT done above.

** Three Segments

Traditional unix

Shared text execute only
Data segment (global and static variables)
Stack segment (automatic variables)

** Four Segments

Just kidding.

** General (not necessarily demand) Segmentation

Permits fine grained sharing and protection
Visible division of program
Variable size segments
Address = (seg#, offset)
Does not mandate how stored in memory.
- One possibility is that the entire program must be in memory to run it. Use whole process swapping. Early unix did this
- Can also implement demand segmentation.
- Can combine with demand paging (done below)
Segment table with a base and limit value for each segment
(seg#, offset) --> base address + offset
External fragmentation as with whole program swapping. Since segments are smaller than programs (several segments make up one program), the external fragmentation is not as bad.

==== Start Lecture #8 ====

Don't forget the mirror site. My main website will be going down for an OS upgrade. Start at http://cs.nyu.edu/

** Demand Segmentation

Same idea as demand paging applied to segments

If segment is loaded, base and limit are stored in processor registers and used in each memory reference.
If segment is not loaded, the base and limit as well as the disk address of the segment is store in the STE (segment table entry).
Generate a segment fault (analogous to page fault) if reference a nonloaded segment
To load a segment have the placement question in addition to replacement question.
I am not sure if this was ever done; it is not done now

** 3.7.2: Segmentation with paging

Combines both segmentation and paging to get advantages of both at a cost in complexity.

Virtual address is now a triple (seg#, page#, offset)
Each segment table entry points to the page table for that segment (similar to multi-level paging).

The page# field in the addr gives the entry in the chosen page table and the offset gives the offset in the page.
Instead of a limit field, the ste contains the length of the segment in pages (which equals the size of the corresponding page table in PTEs)
Straightforward implementation requires 3 memory references so a TLB is crucial.
Sometimes say the segments are of fixed size. This is wrong. They are of variable size with a fixed maximum.
The first example of this was Multics
Keep protection and sharing information on segments
Do replacement and placement on pages (no external fragmentation)
Share a segment by sharing the page table for that segment.
- Thus there is no aliasing
- Good thing or eviction would be problematic

Homework: 30.

Some last words

Segmentation / Paging / Demand Loading
- Each is a yes or no alternative
- Gives 8 possibilities
Placement and Replacement
Internal and External Fragmentation
Page Size and locality of reference
Multiprogramming level and medium term scheduling

Chapter 4: File Systems

Requirements

Size: Store very large amounts of data
Persistence: Data survives the creating process
Access: Multiple processes can access the data concurrently

Solution: Store data in files that together form a file system

4.1: Files

4.1.1: File Naming

Very important. A major function of the filesystem.

Does each file have a unique name?
Ans: Often no. We will discuss this below when we study links.
Extensions, e.g. the ``html'' in ``class-notes.html''
1. Conventions just for humans: letter.teq (my convention)
2. Conventions giving default behavior for some programs
  - Emacs thinks .html files should be edited in html mode but can edit them in any mode and can edit any file in html mode
  - Netscape thinks .html means an html file but
    <html> ... </html> works as well
  - Gzip thinks .gz means a compresses file but accepts a --suffix flag
3. Required extensions for programs
  - The gnu C compiler (and probably others) requires C programs be named *.c and assembler programs be named *.a
4. Required extensions by operating systems
  - MS-DOS treats .com files specially
  - I think windows 95 requires shortcuts to end in .lnk
Case sensitive?

4.1.2: File structure

A file is a

Byte stream
- Unix, dos, windows (I think)
- Max flexibility
- Min structure
(fixed size) Record stream: Out of date
Varied and complicated beast
- Indexed sequential
- B-trees
- Supports rapidly finding a record with a specific key
- Supports retrieving (varying size) records in key order.
- Treated in depth in database courses

4.1.3: File types

(Regular) files
Directories: studied below
Special files (for devices)
- Uses the naming power of files to unify many actions
- dir # prints on screen
- dir > file # result put in a file
- dir > /dev/tape # results written to tape
``Symbolic'' Links (similar to ``shortcuts''): Also studied below.

``Magic number'': Indentifies an executable file. There can be several different magic numbers for different types of executables.
In unix: #!/usr/bin/perl
Strongly typed files: Easy (hopefully normal) case easier (and safer) hard case harder.

4.1.4: File access

Sequential access is most common. Why back when (the era (error?) of ``real programmers'' (tm) ), files were declared to be sequential or random.
Random

4.1.5: File attributes

A laundry list of properties that can be specified for a file, e.g.

hidden
do not dump
owner
key length (for keyed files)

4.1.6: File operations

Create: Essential if a system is to add files. Need not be a separate system call (can be merged with open).
Delete: Essential if a system is to delete files.
Open: Not essential. An optimization. Do the translation from file name to disk locations only once per file rather than once per access.
Close: Not essential. Free resources.
Read: Essential. Must specify filename, file location, number of bytes, where to put data read in. Several of these parameters can be set by other system calls and in many OS's they are.
Write: Essential if updates are to be supported. See read for parameters.
Seek: Not essential functionalily (could be in read/write). Specify the file offset of the next (read or write) access
Get attributes: Essential if attributes are to be used.
Set attributes: Essential if attributes are to be user settable.
Rename: Tannenbaum has strange words. Copy and delete is not acceptable for big files. Moreover copy-delete not atomic. Indeed link-delete is not atomic so even with link (discussed below renaming a file adds functionality.

Homework: 2, 3, 4.
Read and understand ``copyfile'' on page 155.

Notes on copyfile

Normally in unix one wouldn't call read and write directly.
Indeed, for copyfile, fgets/fputs would be nice.
fgets/fputs takes care of the buffering.
Tanenbaum is correct that the error reporting is atrocious.
The worst is exiting the loop on error and thus generating an exit(0) as if nothing happened.

4.1.7: Memory mapped files

Conceptually simple and elegant. Associate a segment with each file and then normal memory operations take the place of I/O.

Thus copyfile has not fgetc/fputc (or read/write). Instead it is just like memcopy

while ( (dest++)* = (src++)* );

The implmentation is via demand paging (on top of segmentation) but the backing store for the pages is the file. This all sounds great but ...

How do you tell the length of a newly created file?
What if same file is accessed by I/O and memory mapping.
What if the file is bigger than the size of virtual memory (will not be a problem in 5 years on modern systems).

==== Start Lecture #9 ====

Bug in lab2 writup. I originally had 6 replacement algorithms, but in a moment of weakness, removed NUR. Sadly, I didn't remove it completely. Change all 900s to 750s and delete the sentence ``For NRU, start the clock pointing to page frame 0 with all use bits off.''

When comparing two algorithms or page sizes or whatever, must use the same set of reference streams.

4.2: Directories

Unit of organization.

4.2.1: Hierarchical directory systems

Possibilities

One directory in the system
One per user
One tree
One tree per user
One forest
One forest per user

These are not as wildly different as they sound.

If the system only has one directory, but alows the character / in a file name. Then one could fake a tree by having the file
/allan/gottlieb/courses/arch/class-notes.html
rather than a directory allan, a subdirectory gottlieb, ..., a file class-notes.html.
Dos (windows) is a forest, unix a tree. In dos there is no common parent of a:\ and c:\.
But windows explorer makes the dos forest look quite a bit like a tree. Indeed, the gnome file manager, looks A LOT like windows explorer.
You can get an effect similar to (but not the same as) one per user by having just one in the system and having permissions that permits each user to visit only a subset. Of course if the system doesn't have permissions, this is irrelevant.
Today's systems have a tree per system or a forest per system.

4.2.2: Path Names

Absolute versus Relative file names.
The former is starts at the (or a) root, the latter starts at the current (a.k.a working) directory.
The special directories . and ..

Homework: 1, 8.

4.2.3: Directory operations

Create: Normally comes with . and ..
Delete: First empty the directory (except for . and ..)
Opendir: Same as with files (creates a ``handle'')
Closedir: Same as files
Readdir: In the old days (of unix) could read directories as files so there was no special readdir (or opendir/closedir). It was believed that the uniform treatment would make programming (or at least system understanding) as there was less to learn.
However, experience has taught that this was not a good idea since the structure of directories then becomes exposed. Early unix had a simple structure (and there was only one). Modern systems have more sophisticaed structures and more importantly they are not fixed accross implementations.
Rename: As with files
Link: Add a second name for a file; discussed below.
Unlink: Remove a directory entry. This is how a file is deleted. But if there are many links, the file remains. Discussed in more detail below.

4.3: File System Implementation

4.3.1; Implementing Files

A disk cannot read or write a single word. Instead it can read or write a sectore, which is often 512 bytes.
Disks are written in blocks whose size is a multiple of the sector size.
For memory management we dealt with words (or bytes). Here we deal with blocks.

Contiguous allocation

This is like OS/MVT.
The entire file is stored as one piece
Simple and fast for access, but ...
Problem with growing files
- Must either evict the file itself or the file it is bumping into.
- Same problem with an OS/MVT kind of system if jobs grow.
Problem with external fragmentation.
Not used for general purpose systems. Ideal for systems where files do not change size.

Homework: 7.

Linked allocation

The directory entry contains a pointer to the first block of the file.
Each block contains a pointer to the next.
Horrible for random access.
Not used.

FAT (file allocation table)

Used by dos / windows (but not windows/NT)
Directory entry points to first block (i.e. specifies the block number)
A FAT is maintained in memory having one (word) entry for each disk block. The entry for block N contains the block number of the next block in the same file as N.
This is linked but the links are store separately.
Still is linear in size of file but now all the entries are to this one table which is in memory. So it is bad but not horrible for random access.
Size of table is one word per disk block. If one writes all blocks of size 4K and uses 4-byte words, the table is one megabyte for each disk gigabyte. Large but not prohibitive.
If write blocks of size 512 bytes (the sector size of most disks) then the table is 8 megs per gig, which is prohibitive

Inodes

Used by unix
Directory entry points to inode (index-node)
Inode points to first few data blocks
Inode points to an indirect block, which points to disk blocks
Inode points to a double indirect, which points an indirect ...
For some implementations there are triple indirect as well
The inode is in memory for open files so many references just take one I/O
Big files take two (indirect + data)
Huge files take three

4.3.2; Implementing Directories

Maps file (or subdirectory) names to the files (or subdirectories) themselvesl.

Trivial filesys (CP/M)

Only one directory in the system
Directory entry contains pointers to disk blocks.
If need more blocks, get another directory entry.

MS-DOS

Subdirectories supported
Directory entry contains metatdata such as date and size as well as pointer to first block.

Unix

Entry contains name and pointer to inode.
Metadata is in inode
Only early unix had limit of 14 character names.
- Name field now is varying length.
- To go down a level indirectory takes two steps: get inode, get file (or subdirectory).
- Do on board steps for /allan/gottlieb/courses/os/class-notes.html

4.3.3: Shared files (links)

Sharing is tannenbaum's terminology.
More discriptive would ``multinamed files''
If a file exists, one can create another name for it (quite possibly in another directory)
Often called creating (another) link to the file.
Unix has two flavor, hard links and symbolic links or symlinks.
Dos/windows has symlinks, but I don't believe it has hard links.
Start with an empty filesystem (i.e. just root) and then
```
cd /
mkdir /A; mkdir /B
mkdir /A/X; mkdir /B/Y
```
Now we have
Note that names are on edges not nodes.
When there are no multinamed files, it doesn't much matter.
Now execute
```
ln /B/Y /A/New
```
This gives
At this point there are two equally valid name for the right hand yellow file, B/Y and A/New. The fact that B/Y was created first is NOT detectable.
Both point the the same inode
Only one owner (the one who created the file initially)
One date, one set of permissions, one ...
Assume alice created /A /A/X and /A/New and bob created /B and /B/Y
Bob executes
```
rm /B/Y
```
We have
The file /A/New is still fine. But it is owned by bob, who can't see it! If the system enforces quotas bob will likely be charged (as the owner) but he can neither find nor delete the file (since bob can unlink files from /A)

Since hard links are only permitted to files (not directories) the resulting filesystem is a dag (directed acyclic graph). That is there are no directed cycles. We will now proceed to give away this useful property by studing symlinks, which can point to directories.

Symlinks

Asymmetric multinamed files.
When a symlink is created another file is created, one that points to the original file.

Again start with an empty filesystem and this time execute

cd /
mkdir /A; mkdir /B
mkdir /A/X; mkdir /B/Y
ln -s /B/Y /A/New

We now have an additional file /A/New, which is a symlink to /B/Y.
The file /A/New has the name /B/Y as its data (not metadata).
The system notices that A/New is a diamond (symlink) so reading /A/New will return the contents of /B/Y (assuming the reader has read permission for /B/Y).
If /B/Y is removed /A/New becomes invalid.
If a new /B/Y is created, A/New is once again valid.
Removing /A/New has no effect of /B/Y.
Writing /A/New is possible if the writer has write permission on /B/Y and in that case writes /B/Y

4.3.4: Disk space management

All general purpose systems use a sort of (non-demand) paging algorithm for file storage. Files are broken into fixed size pieces, called blocks that can be scattered over the disk.

The file is completely stored on the disk (one can imagine system that only stores parts of file on disk the rest of tertiary storage). Perhaps NASA does this with their huge datasets.

Choice of block size

We discussed this before when studying pagesize
Current commodity disk characteristics (not for laptops)
- Rotation at 5400, 7600, or 10,000 RPM. Recall that 6000 RPM is 100 rev/sec or one rev per 10ms. So half a rev is 5ms
- Transfer rates around 10MB/sec = 10KB/ms
- Seek around 10ms
So it takes about 15ms to start and then you get 10KB each additional ms.
This favors large blocks, 100KB or more.
But the internal fragmentation would be severe since many files are small.
Multiple block sizes have been tried as have techniques to try to have consecutive blocks of a given file near each other.
Typical block sizes are 4KB-8KB

Storing freeblocks

In memory bit map.
- One bit per block
- If blocksize=4K, 1 bit per 32K bits
- So 32GB disk (potentially all free) needs 1MB ram
Bit map paged in
Linked list with each free block pointing to next: Extra disk access per block
Linked list with links stored contiguously, i.e. an array of pointers to free blocks. Store this in free blocks and keep one in memory.

4,3.5: File System reliability

Bad blocks on disks

Not so much of a problem now. Disks more reliable and more importantly, disks take care of the bad blocks themselves

Backups

Full vs incremental (levels)
The no-dump file attribute in ms-dos and linux (and others)

Consistency

Fsck (file system check) and chkdsk
- Scan all inodes (or fat), see that each block is in one file or on free list
- Also check that the number of links to each file is correct (by looking at all directories)
- Other checks as well
``Journaling'' filesystems

==== Start Lecture #10 ====

4.3.6 File System Performance

Buffer cache or block cache

An in memory cache of disk blocks

What about writes?
- You could use write through, i.e. all writes performed to the disk before declared complete.
  - Can remove floppy as soon as operation done.
  - Heavy I/O write traffic with write through.
  - MS-Dos
- Write back: Only write when eviction occurs
  - Much less write traffic than write throuth
  - Trouble if a crash occurs.
  - Unix
  - Write dirty blocks periodically, say every minute
  - Ordered writes. Do not write a block containing pointers until the block pointed to has been written. Especially if the block pointed to contains pointers since the version of these pointers on disk may be wrong and you are giving a file pointers to some random blocks.
Research in ``log-structured'' file systems

4.4: Security

Very serious subject. Could easily be a course in itself. My treatment is very brief.

4.4.1: Security environment

Accidental data loss
- Fires, floods, etc
- System errors
- Human errors

Intruders

Sadly an enormous problem now.
NYU ``greeting'' now doesn't include ``welcome'' (that was interpreted as some sort of license to break in).
Indeed, the greeting is not friendly. It once was.

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
           WARNING:  UNAUTHORIZED PERSONS ........ DO NOT PROCEED
           ~~~~~~~   ~~~~~~~~~~~~~~~~~~~~          ~~~~~~~~~~~~~~
 This computer system is operated by New York University (NYU) and may be
 accessed only by authorized users.  Authorized users are granted specific,
 limited privileges in their use of the system.  The data and programs
 in this system may not be accessed, copied, modified, or disclosed without
 prior approval of NYU.  Access and use, or causing access and use, of this
 computer system by anyone other than as permitted by NYU are strictly pro-
 hibited by NYU and by law and may subject an unauthorized user, including
 unauthorized employees, to criminal and civil penalties as well as NYU-
 initiated disciplinary proceedings.  The use of this system is routinely
 monitored and recorded, and anyone accessing this system consents to such
 monitoring and recording.  Questions regarding this access policy or other
 topics should be directed (by e-mail) to comment@nyu.edu or (by phone) to 
 212-998-3333.
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Privacy
An enormously serious (societal) subject

4.4.2: Famous flaws

Good bathroom reading
Trojan horse attack: Planing a program of your choosing in place of a well known program and having an unsuspecting user execute.
1. For example, put in a new version of login that does everything normal but mails the username and plaintext password to gottlieb@nyu.edu.
2. Put a new version of ls in your directory and ask the sysadmin for help. ``Hopefully he types ls while in your directory and has . early in his path''

4.4.3: The internet worm

Not a virus.
A worm divides itself and sends one portion to another machine
The famous internet (Morris) worm exploited silly bugs in unix to crack systems automatically.
Specifically gets not checking lengths.
Attacked Sun and Vax unix systems.
NYU was hit hard

4.4.4: Generic Security attacks

Viruses

Attach to an existing program or to a portion of the disk that is used for booting.
When run tries to attach itself to other files
Like a binary patch: Change first instruction to jump to somewhere where you put the real first instruction, then your patch, then a jump back to the second instruction

4.4.5: Design principles for security

Bathroom

4.4.6: User authentication

Passwords

Publically available cracking software
Use this software for prevention
One time passwords: e.g. SecurId.
Current practice here and elsewhere is that when you telnet your password is sent in the clear along the ethernet.
So maybe .rhosts is aren't that bad after all.
Physical identification: Opens up a bunch of privacy questions. Should we require fingerprints for entering the subway?

Homework: 15, 16, 19, 24.

4.5: Protection mechanisms

4.5.1: Protection domains

Objects (passive) vs subjects (active)
For example processes examine files
Protection domain: A collection of (object, rights) pairs
At any given time a subject is given a protection domain that specifies its rights.
In unix a subjects domain is determined by its (uid, gid) (and whether it is in kernel mode).
Generates a matrix called the protection or permission matrix
- Each row a domain (i.e. a subject at some time)
- Each column an object (e.g. file, device)
- Each entry the right the domain/subject has on this object
Can model unix suid/sgid by permitting columns whose headings are domains and the only right possible in the corresponding entries is entry. If this right is there the subject can s[ug]id to this new domain.

4.5.2: Access Control Lists (ACLs)

Keep the columns of the matrix separate and only keep the nonnull entries.

4.5.3: Capabilities

Keep the rows of the matrix separate and only keep the nonnull entries.

4.5.4: Protection models

Give objects and subjects security levels and enforce

A subject may read only those objects whose level is at or below her own.
A subject may write only those objects whose level is at or above her own.

4.5.5: Covert channels

The bad guys are getting smart and use other means of getting out information. For example give good service for a zero and bad for a one. The figure of merit is what rate can bits be sent, i.e. the bandwidth of the covert channel.

Homework: 20.

==== Start Lecture #11 ====

Chapter 5: Input/Output

5.1: Principles of I/O Hardware

5.1.1: I/O Devices

Not much to say. Devices are varied
Block versus character devices:
- Devices, like disks or CD-roms, with addressible chunks (sectors in this case) are called block devices.
  These devices do not support seeking.
- Devices, like an ethernet or modem connection, that are a stream of characters are called character devices
  These devices do not support seeking
- Some cases, like tapes, are not so clear

5.1.2: Device Controllers

These are the ``real devices'' as far as the OS is concerned. That is the OS code is written with the controller spec in hand not with the device spec.

The figure in the book is so oversimplfied as to be borderline false. The following picture is closer to the truth (but really there are several I/O buses of different speeds).

The controller abstracts away some of the low level features of the device.
For disks, the controler does error checking, buffering and handles interleaving of sectors. Sectors are interleaved if the controler or cpu cannot handle the data rate and would otherwise have to wait a full revolution. I do not believe this is a concern with modern machines where the electronics have increased in speed faster than the devices.
For analog monitors (aka CTRs) the controler does a great deal. Analog video is far from a bunch of ones and zeros.
Controllers are also called adaptors.
Typically the interface the OS sees consists of some device registers located on the controler (address to access, read vs write, length, data value, etc).
Memory-mapped I/O vs. I/O space. In memory mapped I/O the device registers are mapped into normal memory space. So you need to know that
```
load 8888<--4
```
tells the disk read the sector whose address you previously loaded into 8800. In the I/O space you used different instructions (not ordinary loads and stores) to do essentially the same thing.

Homework: 2

5.1.3: Direct Memory Access (DMA)

The disk controler gets the desired data from the disk to its buffer
For programmed I/O (PIO) the cpu then does loads and stores (or I/O instructions) to copy the data from the buffer to the desired memory location.
With a DMA controller, the controller writes the memory without intervention of the CPU.
Clearly DMA saves CPU work. But this might not be important if the CPU is limited by the memory or by system buses.
Very important is that there is less data movement so the buses are used less and the entire operation takes less time.
Since PIO is pure software it is easier to change, which is an advantage.
DMA does need a number of bus transfers from the CPU to the controler to specify the dma. So dma is most effective for large transfers where the setup is amortized.
Why have the buffer? Why not just go from the disk straight to the memory.
Ans: Speed matchine. The disk gives data a fixed rate which might exceed the rate the memory can accept it. Also might have two dma controlers that have been given requests.

Homework: 5

5.2: Principles of I/O Software

As with any large software system, good design and layering is important.

5.2.1: Goals of the I/O Software

Device independence

We want to have most of the OS, unaware of the characteristics of the specific devices attached to the system. Indeed we also want the OS to be largely unaware of the cpu type itself.

It is thanks to this device independence that programs can be written to write and read generic devices and then at run time specific devices are assigned. Writing to a disk has differences from writing to a terminal, but unix cp (dos copy) doesn't see these differences. Indeed, most of the OS, including the filesystem code, Is unaware of whether the device is a floppy or hard disk.

Uniform naming

Recall that we discussed the value of the name space implemented by filesystems. There is no dependence between the name of the file and the device on which it is stored.

Error handling

There are several aspects to error handling including: detection, correction (if possible) and reporting.

Detection should be done as close to where the error occurred as possible before more damage is done (fault containment). This is not trivial.
Correction is sometimes easy, for example ECC memory does this automatically (but the OS wants to know about it and schedule replacement of the faulty chips before unrecoverable double errors occur).
Other easy cases include successful retries for failed ethernet transmissions. In this example, while logging is appropriate, it is quite reasonable for no action to be taken.
Error reporting tends to be awful. The trouble is that the error occurs occurs at a low level but by the time it is reported the context is lost. Unix/linux in particular is horrible in this area.

Creating the illusion of synchronous I/O

I/O must be asynchronous for performance. That is the OS cannot simply wait for an I/O to complete. Instead, it proceeds with other activities and responds to the notification when the I/O has finished.
Users (mostly) want no part of this. The code sequence
```
Read X
Y <-- X+1
Print Y
```
should print a value one greater than that read. But if the assignment is performed before the read completes ...
Performance junkies sometimes do want the asynchrony as they want to do what the OS does.

Sharable vs dedicated devices

For devices like printers and tape drives, only one user at a time is permitted. These are called serially resuable devices and are studied next chapter.

Layering

Layers of abstraction as usual prove to be effective. Most systems are believed to use the following layers (but for many systems, the OS code is not available for inspection).

User level I/O routines
Device independent I/O software
Device drivers
Interrupt handlers

We give a bottom up explanation.

5.2.2: Interrupt Handlers

We discussed an interrupt handler before when studying page faults. Then it was called ``assembly language code''.

In the present case, we have a process (actually the device driver OS code running in behalf of a user process) blocked on I/O and the I/O event has just completed. So the goal is to make the process ready. Possible methods are.

Releasing a semaphore on which the process is waiting
Sending a message to the process.
Inserting the process table entry onto the ready list.

5.2.3: Device Drivers

The portion of the OS that ``knows'' the characteristics of the controler.

The driver has two ``parts'' corresponding to its two access points. Recall the following figure from the beginning of the course.

Access by the main line OS with an I/O request.
Accessed by the interrupt handler when the I/O completes (this completion is signaled by an interrupt).

Tanenbaum describes the actions of the driver assuming it is implemented as a process (which he recommends). I give both that view point and the self-service paradigm in which the driver is invoked by the OS acting in behalf of a user process (more precisely the process shifts into kernel mode).

Driver as a process (tanenbaum)

The user issues an I/O request. The main line OS prepares a generic (e.g. read, not read using buslogic scsi controler) request for the driver and the driver is awakened (perhaps a message is sent to the driver to do both jobs).
1. The driver wakes up
  1. If the driver was idle (i.e., the controler is idle), the driver writes device registers on the controler ending with a command for the controler to begin the actual I/O.
  2. If the controler is busy (doing work the driver gave it), the driver simply queues the current request (the driver dequeues this below).
2. The driver blocks waiting for an interrupt or for more requests.
An interrupt arrives (i.e. an I/O has been completed)
1. The driver informs the main line perhaps passing data and surely passing status (error, OK).
2. Find next work item or block
  1. If the queue of requests is non-empty dequeue one and proceed as if just received a request from the main line.
  2. If queue is empty, the driver blocks waiting for an interrupt or a request from the main line.

Driver in a self-service paradigm

The user issues an I/O request. The main line OS prepares a generic request for the driver and calls the driver.
1. The driver is called
  1. If the driver was idle (i.e., the controler is idle), the driver writes device registers on the controler ending with a command for the controler to begin the actual I/O.
  2. If the controler is busy (doing work the driver gave it), the driver simply queues the current request (the driver dequeues this below).
2. The driver returns to the main line
An interrupt arrives (i.e. an I/O has been completed)
1. The driver informs the main line perhaps passing data and surely passing status (error, OK).
2. Find next work item or return
  1. If the queue of requests is non-empty dequeue one and proceed as if just received a request from the main line.
  2. If queue is empty, the driver returns to the main line.

5.2.4: Device-Independent I/O Software

Most of the functionality. But not necessarily most of the code since there can be many drivers all doing essentially the same thing is slightely different way due to slightly different controllers.

Naming. Again an important O/S functionality. Must offer a consistent interface to the device drivers.
In unix this is done by associating each device with a (special) file in the /dev directory. The inodes for these files contain an indication that these are special files and also contain so called major and minor device numbers. The major device number gives the number of the driver. (These numbers are rather ad hoc, they correspond to the position of the function pointer to the driver in a table of function pointers.)
Protection. A wide range of possibilities are actually done in real systems. Including both extreme examples of everything is permitted and nothing is permitted (directly).
- In ms-dos any process can write to any file. Presumably our offensive nuclear missile launchers do not run dos.
- In IBM and other mainframe OS's, normal processors do not access devices. Indeed the main CPU doesn't issue the I/O requests. Instead an I/O channel is used and the mainline constructs a channel program and tells the channel to invoke it.
- Unix uses normal rwx bits on files in /dev (I don't believe x is used).
Buffering is necessary since requests come in a size specified by the user and data is delivered in a size specified by the device.
Enforce exclusive access for non-shared devices like tapes.

5.2.5: User-Space Software

A good deal of I/O code is actually executed in user space. Some is in library routines linked into user programs and some is in daemon processes.

Some library routines are trivial and just move their arguments int the corrct place (e.g. registers) and then issue a trap to the correct system call.
Some, notably standard-IO (stdio) in unix, are definitely not trivial. For example consider the formatting of floating point numbers done in printf and the reverse in scanf.
Printing to a local printer is often in part by a regular program (lpr in unix) and part by a deamon (lpd in unix). The daemon might be started when the system boots or might be started on demand. I guess it is called a daemon because it is not under the control of any user.
Pringing uses spooling, i.e., the file to be printed is copied somewhere by lpr and then the daemon works with this copy. Mail uses a similar technique (but generally it is not called spooling but queuing).

Homework:

Homework: 6, 7, 8.

5.3: Disks

The ideal storage device is

Fast
Big (in capacity)
Cheap
Impossible

Disks are big and cheap, but slow.

5.3.1: Disk Hardware

Show a real disk opened up and illustrate the components

Platter
Surface
Head
Track
Sector
Cylinder
Seek time
Rotational latency
Transfer time

Overlapping I/O operations is important. Many controllers can do overlapped seeks, i.e. issue a seek to one disk while another is already seeking.

Despite what Tannenbaum says, modern disks cheat and do not have the same number of sectors on outer cylinders as on inner one. Often the controller ``cover for them'' and protect the lie.

Again contrary to Tannenbaum, it is not true that when one head is reading from cylinder C, all the heads can read from cylinder C with no penalty.

==== Start Lecture #12 ====

5.3.2: Disk Arm Scheduling Algorithms

These algorithms are relevant only if there are several I/O requests pending. For many PCs this is not the case. For most commercial applications, I/O is crucial.

FCFS (First Come First Served): Simple but has long delays.
Pick: Same as FCFS but pick up requests for cylinders that are passed on the way to the next FCFS request
SSTF (Shortest Seek Time First): Greedy algorithm. Can starve requests for outer cylinders and almost always favors middle requests.
Scan (Look, Elevator): The method used by an old fashioned jukebox (rember ``Happy Days'') and by elevators. The disk arm proceeds in one direction picking up all requests until there are no more requests in this direction at which point it goes back the other direction. This favors requests in the middle, but can't starve any requests.
C-Scan (C-look, Circular Scan/Look): Similar to Scan but only service requests when moving in one direction. When going in the other direction, go directly to the furthest away request. This doesn't favor any spot on the disk. Indeed, it treats the cylinders as though they were a clock, i.e. after the highest numbered cylinder comes cylinder 0.
N-step Scan: This is what the natural implementation of Scan gives.
- While the disk is servicing a Scan direction, the controller gathers up new requests and sorts them.
- At the end of the current sweep, the new list becomes the next sweep.

Minimizing Rotational Latency

Use Scan, which is the same as C-Scan. Why?
Because the disk only rotates in one direction.

Homework: 9, 10.

RAID (Redundant Array of Inexpensive Disks)

Tannenbaum's treatment is not very good.

This name is from Berkeley.
IBM changed the name to Redundant Array of Independent Disks
A simple form is mirroring, where two disks contain the same data.
Another simple form striping (interleaving) where consecutive blocks are spread across multiple disks. This helps bandwidth but is not redundant so shouldn't be called RAID but it sometimes is.
One of the normal RAIDs is to have N (say 4) data disks and one parity disk. Data is striped across the data disks and the bitwise parity of these sectors is written in the corresponding sector of the parity disk.
A variation is to rotate the parity. That is, for some stripes disk 1 has the parity, for others disk 2, etc.
A serious concern is the small write problem. Writing a sector requires 4 I/O. Read the old data sector, compute the change, read the parity, compute the new parity, write the new parity and the new data sector. Hence one sector I/O became 4, which is 300% overhead.
Writing a full stripe is not bad. Compute the parity of the N (say 4) data sectors to be written and then write the data sectors and the parity sector. Thus 4 sector I/Os became 5, which is only a 20% penalty and is smaller for larger N, i.e., larger stripes.

5.3.3: Error Handling

Disks error rates have dropped in recent years. Moreover, bad block forwarding is done by the controller (or disk electronic).

5.3.4: Track Caching

Often the disk/controller caches a track, since the seek penalty has already been paid. In fact modern disks have megabyte caches that hold recently read blocks. Since modern disks cheat and don't have the same number of blocks on each track, it is better for the disk electronics to do the caching since it is the only part of the system to know the true geometry.

5.3.5: Ram Disks

Fairly clear. Organize a region of memory as a set of blocks and pretend it is a disk.
A problem is that memory is volatile.
Often used during OS installation, before disk drivers are available (there are many types of disk but memory all memory looks the same so only one ram disk driver is needed).

5.4: Clocks

Also called timers.

5.4.1: Clock Hardware

Generates an interrupt when timer goes to zero
Counter reload can be automatic or under software (OS) control.
If done automatically, the interrupt occurs periodically and thus is perfect for a time of day (TOD) clock (assuming time of day initialized somehow).

5.4.2: Clock Software

TOD: Bump a counter each tick (clock interupt). If counter is only 32 bits must worry about overflow so keep two counters: low order and high order.
Time quantum for RR: Decrement a counter at each tick quantum expires when counter is zero. Load this counter when the scheduler runs a process.
Accounting: At each tick, bump a counter in the process table entry for the currently running process.
Alarm system call and system alarms:
- Users can request an alarm at some future time and the system also needs to do things specific future times (e.g. turn off floppy motor).
- The conceptually simplist solution is to have one timer for each event. Instead, we simulate many timers with just one.
- The data structure on the right works well.
- The time in each list entry is the time after the preceeding entry that this entry's alarm is to ring. The other entry is a pointer to the action to perform.
- At each tick, decrement next-signal.
- When next-signal goes to zero. process the first entry on the list and any others following immediately after with a time of zero (which means they are to be simultaneous with this alarm. Then set next-signal to the value in the next alarm
Profiling
- Want a histogram giving how much time was spent in each 1KB (say) block of code.
- At each tick check the PC and bump the appropriate counter.
- At the end of the run can assign the 1K blocks to software modules.
- If use fine grainularity (say 10B instead of 1KB) get higher accuracy but more memory overhead.

Homework: 12

5.5: Terminals

5.5.1: Terminal Hardware

Quite dated. It is true that modern systems can communicate to a hardwired ascii terminal, but most don't. Serial ports are used, but they are normally connected to modems and then some protocol (SLIP, PPP) is used not just a stream of ascii characters.

5.5.2: Memory-Mapped Terminals

Less dated. But it still discusses the character not graphics interface.
Today, the idea is to have the software write into video memory the bits to be put on the screen and then the graphics controller converts these bits to analog signals for the monitor (actually laptop displays and very modern monitors are digital).
But it is much more complicated than this. The graphics controllers can do a great deal of video themselves (like filling).
This is a subject that would take many lectures to do well.
The tannenbaum description of keyboards is, however, correct.
- At each key press and key release a code is written into the keyboard controller and the computer is interrupted.
- By remembering which keys have been depressed and not released the software can determine Cntl-A, Shift-B, etc.

==== Start Lecture #13 ====

5.5.3: Input Software

We are just looking at keyboard input. Once again graphics is too involved to be treated here.
Two fundamental modes of input raw and cooked.
In raw mode the application sees every ``character'' the user types. Indeed, raw mode is character oriented.
- All the OS does is convert the keyboard ``scan codes'' to ``characters'' and and pass these characters to the application. to w
- Some examples
  1. down-cntl down-x up-x up-cntl is converted to cntl-x
  2. down-cntl up-cntl down-x up-x is converted to nothing
  3. down-cntl down-x up-cntl up-x is converted to cntl-x (I just tried it to be sure).
  4. down-x down-cntl up-x up-cntl is converted to x
- Full screen editors use this mode.
Cooked mode is line oriented. The OS delivers lines to the application program.
- Erased characters are not seen by the application but are erased by the keyboard driver.
- Special characters are interpreted as editing characters (erase-previous-character, erase-previous-word, kill-line, etc).
- Need an escape character so that the editing characters can be passed to the application if desired.
- The cooked characters must be echoed (what should one do if the application is also generating output at this time?)
The (possibly cooked) characters must be buffered until the application issues a read (and an end-of-line EOL has been received for cooked mode).

5.5.4: Output Software

Again too dated and the truth is too complicated to deal with in a few minutes.

Homework: 16.

Chapter 6: Deadlocks

A deadlock occurs when a every member of a set of processes is waiting for an event that can only be caused by a member of the set. Often the event waited for is the release of a resource.

In the automotive world deadlocks are called gridlocks.

The processes are the cars.
The resources are the spaces occupied by the cars

Reward: One point extra credit on the final exam for anyone who brings a real (e.g., newspaper) picture of an automotive deadlock. You must bring the clipping to the final and it must be in good condition. Hand it in with your exam paper.

For a computer science example consider two processes A and B that each want to print a file currently on tape.

A has obtained ownership of the printer (but will release it when it finished printing).
B has obtained ownership of the tape drive (but will release when it has finished reading the tape).
A tries to get ownership of the tape drive, but is told to wait for B to release it.
B reads a block from the tape and then tries to get ownership of the printer to print the block; naturally B is told to wait for A to release the printer.

Bingo: deadlock!

6.1: Resources:

The resource is the object granted to a process.

Resources come in two types
1. Preemptable, meaning that the resource can be taken away from its current owner. An example is memory.
2. Non-preemptable, meaning that the resource cannot be taken away. An example is a printer.
The interesting issues arise with non-preemptable resources so those are the ones we study.
Life history of a resource is a sequence of
1. Request
2. Allocate
3. Use
4. Release
Processes make requests, use the resourse, and release the resourse. The allocate decisions are made by the system and we will study some policies used to make these deci

6.2: Deadlocks

To repeat: A deadlock occurs when a every member of a set of processes is waiting for an event that can only be caused by a member of the set. Often the event waited for is the release of a resource.

6.3: Necessary conditions for deadlock

The following four conditions (Coffman; Havender) are necessary but not sufficient for deadlock. Repeat: They are not sufficient.

Mutual exclusion: A resource can be assigned to at most one process at a time (no sharing).
Hold and wait: A processing holding a resource is permitted to request another.
No preemption: A process must release its resources; they cannot be taken away.
Circular wait: There must be a chain of processes such that each member of the chain is waiting for a resource held by the next member of the chain.

6.2.2: Deadlock Modeling

On the right is the Resource Allocation Graph, also called the Resourse Graph.

The processes are circles.
The resourses are squares.
An arc (directed line) from a process P to a resourse R signifies that process P has requested (but not yet been allocated) resourse R.
An arc from a resourse R to a process P indicates that process P has been allocated resource R.

Homework: 1.

Consider two concurrent processes P1 and P2 whose programs are.

P1: request R1       P2: request R2
    request R2           request R1
    release R2           release R1
    release R1           release R2

On the board draw the resource allocation graph for various possible executions of the processes, indicating when deadlock occurs and when deadlock is no longer avoidable.

There are four strategies used for dealing with deadlocks.

Ignore the problem
Detect deadlocks and recover from them
Avoid deadlocks by carefully deciding when to allocate resources.
Prevent deadlocks by violating one of the 4 necessary conditions.

6.3: Ignoring the problem--The Ostrich Algorithm

The ``put your head in the sand approach''.

If the likelihood of a deadlock is sufficiently small and the cost of avoiding a deadlock is sufficiently high it might be better to ignore the problem. For example if each PC deadlocks once per 100 years, the one reboot may be less painful that the restrictions needed to prevent it.
Clearly not a good philosophy for nuclear missile launchers.
For embedded systems (e.g., missile launchers) the programs run are fixed in advance so many of the questions tannenbaum raises (such as many processes wanting to fork at the same time) don't occur.

6.4: Detecting Deadlocks and Recovering from them

6.4.1: Detecting Deadlocks with single unit resources

Consider the case in which there is only one instance of each resource.

So a request can be satisfied by only one specific resourse.
In this case the 4 necessary conditions for deadlock are also sufficient.
Remember we are making an assumption (single unit resources) that is often invalid. For example, many systems have several printers and a request is given for ``a printer'' not a specific printer. Similarly, one can have many tape drives.
So the problem comes down to finding a directed cycle in the resource allocation graph. Why?
Ans: Because the other three conditions are either satisfied by the system we are studying or are not in which case deadlock is not a question. That is, conditions 1,2,3 are contitions on the system in general not on what is happening right now.

To find a directed in a directed graph is not hard. The algorithm is in the book. The idea is simple.

For each node in the graph do a depth first traversal (hoping the graph is a DAG (directed acyclic graph), building a list as you go down the DAG.
If you ever find the same node twice on your list, you have found a directed cycle and the graph is not a DAG and deadlock exists among the processes in your current list.
If you never find the same node twice, the graph is a DAG and no deadlock occurs.
The searches are finite since the list size is bounded by the number of nodes.

6.4.2: Detecting Deadlocks with multiple unit resources

This is more difficult.

The figure on the right shows a resource allocation graph with multiple unit resources.
Each unit is represented by a dot in the box.
Request edges are drawn to the box since they represent a request for any dot in the box.
Allocation edges are drawn from the dot to represent that this unit of the resource has been assigned (but all units of a resource are equivalent and the choice of which one to assign is arbitrary).
Note that there is a directed cycle in black, but there is no deadlock. Indeed the middle process might finish, erasing the magenta arc and satisfying the process on the right.
The book gives an algorithm for detecting deadlocks in this more general setting. The idea is as follows.
1. look for process that might be able to terminate (i.e. all its request arcs can be satisfied).
2. If one is found pretend that it does terminate (erase all its arcs) and repeat step 1.
3. If any processes remain, they are deadlocked.
We will really do an algorithm later (the Banker's algorithm) that has some of this flavor.

6.4.3: Recovery from deadlock

Preemption

Perhaps you can temporarily preempt a process. Not likely.

Rollback

Database (and other) systems take periodic checkpoints. If the system does take checkpoints, one could roll back to one.

Kill processes

Can always be done but might be painful. For example some processes have had effects that can't be simply undone. Print, launch a missle, etc.

6.5: Deadlock Avoidance

Let's see if we can tiptoe through the tulips and avoid deadlock states even though our system does permit all four of the necessary conditions for deadlock.

An optimistic resource manager is one that grants every request as soon as it can. To avoid deadlocks with all four conditions present, the manager must be smart not optimistic.

6.5.1 Resource Trajectories

I believe this is a tannenbaumism. I have never tried to teach it before but the possibility of color might make this understandable.

We have two processes H (horizontal) and V.
The origin represents them both starting.
Their combined state is a point on the graph.
The parts where the printer and plotter are needed by each process are indicated.
The dark green is where both processes have the plotter and hence execution cannot reach this point.
Light green represents both having the printer; also impossible.
Pink is both having both printer and plotter; impossible.
Gold is possible (H has plotter, V has printer), but you can't get there.
The upper right corner is the goal both processes finished.
The red dot is ... (cymbals) deadlock. We don't want to go there.
The cyan is safe. From anywhere in the cyan we have horizontal and vertical moves to the finish line without hitting any impossible area.
The magenta interior is very interesting. It is
- Possible: each processor has a different resource
- Not deadlocked: each processor can move within the magenta
- Deadly: deadlock is unavoidable. You will hit a magenta-green boundrary and then will no choice but to turn and go to the red dot.
The cyan-magenta border is the danger zone
The dashed line represents a possible execution pattern.
With a uniprocessor no diagonals are possible. We either move to the right meaning H is executing or move up indicating V.
The trajectory shown represents.
1. H excuting a little
2. V excuting a little
3. H executes; requests the printer; gets it; executes some more
4. V executes; requests the plotter
The crisis is at hand!
If the resource manager gives V the plotter, the magenta has been entered and all is lost. ``Abandon all hope yee who enter here'' --dante.
The right thing to do is to deny the request, let H execute moving horizontally under the magenta and dark green. At the end of the dark green, no danger remains, both processes will terminate.

==== Start Lecture #14 ====

Remark: An optimistic resource manager is one that grants every request as soon as it can. To avoid deadlocks with all four conditions present, the manager must be smart not optimistic.

6.5.2: Safe States

Avoiding deadlocks given some extra knowledge.

Not surprisingly the resource manager knows how many units of each resource it had to begin with.
Also it knows how many units of each reasource it has given to each process
It would be great to see all the programs in advance and thus know all future requests, but that is asking for too much.
Instead, each process when it starts gives its maximum usage. That is each process at startup gives for each resource the maximum number of units it can possibly ask for.
- If during the run the process asks for more than its claim, abort it.
- If it claims more than it needs, the result is that the resource manager will be more conservative than need be and there will be more waiting.

Definition: A state is safe if there one can find an ordering of the processes such that if the processes are run in this order, they will all terminate (assuming none exceeds its claim).

A manager can determine if a state is safe.

Since the manager know all the claims, it can determine the maximum amount of additional resources each process can request.
The manager knows how many units of each resource it has left.
If there is a process whose max additional requests is less than what remains (for each resource), if this process is run first it will terminate.
Once this process has terminated, the supply of resources is increased by how many the process has now.
The manager then determines if, with this additional pile of resources the manager can satisfy the maximum possible future requests for another process.
If this proceedure results in all processes terminating, the state is safe.
If the procedure comes to a point where it cannot satisfy any of the remaining processes the state is not safe.

Example 1

One resource type R with 22 unit
Three processes X, Y, and Z with claims 3, 11, and 19 respectively.
Currently the processes have 1, 5, and 10 units respectively.
Hence the manager currently has 6 units left.
Also max additional needs for processes are 2, 6, 9
This state is safe
1. Use 2 units to satisfy X; now the manager has 7 units.
2. Use 6 units to satisfy Y; now the manager has 12 units.
3. Use 9 units to satisfy Z; done!

Example 2

Assume that Z now requests 2 units and we grant them.

Currently the processes have 1, 5, and 12 units respectively.
The manager has 4 units.
The max additional needs are 2, 6, and 7.
This state is unsafe
1. Use 2 unit to satisfy X; now the manager has 5 units.
2. Y needs 6 and Z needs 7 so we can't guarantee satisfying either

Remark: An unsafe state is not necessarily a deadlocked state. Indeed, if one gets lucky all processes may terminate successfully. A safe state means that the manager can guarantee that no deadlock will occur.

6.5.3: The Banker's Algorithm (Dijkstra) for a Single Resource

The algorithm is simple: Stay in safe states.

Check before any process starts that the state is safe (this means that no process claims more than the manager has). If not, then this process is trying to claim more than the system has so cannot be run.
When the manager receives a request, it pretends to grant it and checks if the resulting state is safe. If it is safe the request is granted, if not the process is blocked.
When a resource is returned, the manager checks to see if any of the pending requests can be granted (i.e., if the result would now be safe). If so the request is granted and the manager checks to see if another can be granted.

6.5.4: The Banker's Algorithm for Multiple Resources

At a high level the algorithm is identical: Stay in safe states.

What is a safe state?
The same definition (if processes are run in a certain order they will all terminate).
Checking for safety is the same idea as above. The difference is that to tell if there enough free resources for a processes to terminate, the manager must check that for all resources, the number of free units is at least equal to the max additional need of the process.

Limitations of the banker's algorithm

Often processes don't know their max needs. They can estimate conservatively (i.e. big numbers) but then the manager becomes very conservative.
New processes arriving cause a problem (but not so bad as tannenbaum suggests).
- Assume the process's claims is less than the total number in the system (else reject the process as hopeless).
- Since the state without the new process is safe, so is the state with the new process! Just use the order you had originally and put the new process at the end.
- Insuring fairness (starvation freedom) needs a little more work, but isn't too hard either (once an hour stop taking new processes until all current processes finish).
A process becoming unavailable (tape drive breaking), can result in an unsafe state.

Homework: 11, 14 (not to be handed in).

6.6: Deadlock Prevention

Attack one of the coffman/havender conditions

6.6.1: Attacking Mutual Exclusion

Idea is to try to use spooling instead of mutual exclusion. Not possible for many kinds of resources

6.6.2: Attacking Hold and Wait

Require processes to ask for all resources in the beginning (or first release what they have and ask for it back plus new resources all at once). This is often called One Shot.

6.6.3: Attacking No Preempt

Normally not possible.

6.6.4: Attacking Circular Wait

Establish a fixed ordering of the resources and require that they be requested in this order. So if a process holds resources #34 and #54, it can request only resources #55 and higher.

It is easy to see that a cycle is now not possible.

6.7: Other Issues

6.7.1: Two-phase locking

This is covered (MUCH better) in a database text. We will skip it.

6.7.2: Non-resource deadlocks

You can get deadlock from semaphores as well as resources. This is trivial. Semaphores can be considered resources. P(S) is request S and V(S) is release S. The manager is the module implementing P and V, when the manager returns from P(S), it has granted the resource S.

6.7.3: Starvation

As usual FIFS is a good cure. Often this is done by priority aging and picking the highest priority process to get the resource. Also can periodically stop accepting new processes until all old ones get the resources.

Administrivia

Web Pages

Textbook

Computer Accounts and majordomo mailing list

Homeworks and Labs

1. Introduction

1.1: What is an operating system?

1.2 History of Operating Systems

1.3: Operating System Concepts

1.3.1: Processes

1.3.2: Files

1.3.3: System Calls

1.3.4: The shell

1.4: OS Structure

1.4.1: Monolithic approach

1.4.2: Layered Systems

1.4.4: Virtual machines

14.4: Client Server

Chapter 2: Process Management

2.1: Processes

Process Hierarchies

Process states and transitions

2.1.3: Implementation of Processes

An aside on interupts

2.2: Interprocess Communication (IPC)

2.2.1: Race Conditions

2.2.2: Critical sections

2.2.3 Mutual exclusion with busy waiting

Software solutions for two processes

Hardware assist (test and set)

P and V and Semaphores

Producer-consumer problem

Dining Philosophers

Readers and writers

2.4: Process Scheduling

Preemption

Deadline scheduling

The name game

First Come First Served (FCFS, FIFO, FCFS, --)

Round Robbin (RR, RR, RR, RR)

Selfish RR (SRR, **, SRR, **)

Processor Sharing (PS, **, PS, PS)

Shortest Job First (SPN, SJF, SJF, SJF)

Preemptive Shortest Job First (PSPN, SRT, PSJF/SRTF, --)

Priority aging

Highest Penalty Ratio Next (HPRN, HRN, **, **)

Multilevel Queues (**, **, MLQ, **)

Multilevel Feedback Queues (FB, MFQ, MLFBQ, MQ)

Theoretical Issues

Medium Term scheduling

Long Term Scheduling

Chapter 3: Memory Management

3.1: Memory management without swapping or paging

** 3.1.1: Monoprogramming without swapping or paging (Single User)

3.1.2: Multiprogramming

3.1.3: Multiprogramming with fixed partitions

3.2: Swapping

3.2.1: Multiprogramming with variable partitions

Introduces the ``Placement Question'', which hole (partition) to choose

Also introduces the ``Replacement Question'', which victim to swap out

Considerations in choosing a victim

** (non-demand) Paging

3.2: Virtual Memory (meaning fetch on demand)

3.2.1: Paging (meaning demand paging)

3.3.2: Page tables

Protection bits

Multilevel page tables

3.3.4: Associative memory (TLBs)

3.3.5: Inverted page tables

3.4: Page Replacement Algorithms

Random

3.4.1: The optimal page replacement algorithm (opt PRA)

3.4.2: The not recently used (NRU) PRA

3..4.3: FIFO PRA

3.4.4: Second chance PRA

3.4.5: Clock PRA

LIFO PRA

3.4.6:Least Recently Used (LRU) PRA

3.4.7: Simulating LRU in Software

The Not Frequently Used (NFU) PRA

Selfish RR (SRR, , SRR, )

Highest Penalty Ratio Next (HPRN, HRN, , )

Multilevel Queues (, , MLQ, **)