Operating Systems

Start Lecture #10

Remark: The new homework password is the first name of my first academic son.

3.4.8: The Working Set Page Replacement Algorithm (Peter Denning)

The working set policy

The goals are first to specify which pages a given process needs to have memory resident in order for the process to run without too many page faults and second to ensure that these pages are indeed resident.

But this is impossible since it requires predicting the future. So we again make the assumption that the near future is well approximated by the immediate past.

We measure time in units of memory references, so t=1045 means the time when the 1045th memory reference is issued. In fact we measure time separately for each process, so t=1045 really means the time when this process made its 1045th memory reference.

Definition: w(k,t), the working set at time t (with window k) is the set of pages referenced by the last k memory references ending at reference t.

The idea of the working set policy is to ensure that each process keeps its working set in memory.

Allocate |w(t,k)| frames to each process. This number differs for each process and changes with time.
On a fault, evict a page not in the working set. But it is not easy to find such a page quickly. Indeed determining w(t,k) precisely is quite time consuming and difficult. It is never done in real systems.
If a process is suspended and swapped out; the working set then can be used to say which pages should be brought back when the process is resumed.

Homework: Describe a process (i.e., a program) that runs for a long time (say hours) and always has a working set size less than 10. Assume k=100,000 and the page size is 4KB. The program need not be practical or useful.

Homework: Describe a process that runs for a long time and (except for the very beginning of execution) always has a working set size greater than 1000. Again assume k=100,000 and the page size is 4KB. The program need not be practical or useful.

The definition of Working Set is local to a process. That is, each process has a working set; there is no system wide working set other than the union of all the working sets of each process.

However, the working set of a single process has effects on the demand paging behavior and victim selection of other processes. If a process's working set is growing in size, i.e., w(t,k) is increasing as t increases, then we need to obtain new frames from other processes. A process with a working set decreasing in size is a source of free frames. We will see below that this is an interesting amalgam of local and global replacement policies.

Interesting questions concerning the working set include:

What value should be used for k?
Experiments have been done and k is surprisingly robust (i.e., for a given system, a fixed value works reasonably for a wide variety of job mixes).
How should we calculate w(t,k)?
Hard to do exactly so ...

... Various approximations to the working set, have been devised. We will study two: Using virtual time instead of memory references (immediately below), and Page Fault Frequency (part of section 3.5.1). In 3.4.9 we will see the popular WSClock algorithm that includes an approximation of the working set as well as several other ideas.

Using Virtual Time

Instead of counting memory referenced and declaring a page in the working set if it was used within k references, we keep track of time, which the system does anyway, and declare a page in the working set if it was used in the past τ seconds. Note that the time is measured only while this process is running, i.e., we are using virtual time.

Add a field time of last use to the PTE. The procedure for setting this field is in item 3 below.
Clear the reference R bit every m milliseconds and set it on every reference, the latter is done by the hardware.
To choose a victim when a page fault occurs, we proceed as follows (also setting the time of last use field). Scan the page table one PTE at a time (actually we are only interested in resident pages so we would rather look at a page frame table).
- If the R bit is 1, the page is in the working set so set the time of last use to the current (virtual) time. (We are assuming τ seconds is bigger than m milliseconds.)
- If the R bit is 0 and the last use was more than τ seconds ago, the page is not in the working set and is evicted (but we keep scanning).
- If the R bit is 0 but the last use is less than τ seconds ago, the page is in the working set so is not evicted.
- If no page was chosen for eviction, evict the LRU page.

3.4.9 The WSClock Page Replacement Algorithm

The WSClock algorithm combines aspects of the working set algorithm (with virtual time) and the clock implementation of second chance. It also distinguishes clean from dirty and referenced from non-referenced in the spirit of NRU.

Like clock we create a circular list of nodes with a hand pointing to the next node to examine. There is one such node for every resident page of this process; thus the nodes can be thought of as a list of frames or a kind of inverted page table.

Like working set we store in each node the referenced and modified bits R and M and the time of last use. R and M are cleared when the page is read in. R is set by the hardware on a reference and cleared periodically by the OS (perhaps at the end of each page fault or perhaps every m milliseconds). M is set by the hardware on a write. We indicate below the setting of the time of last use and the clearing of M.

As with working set, we use virtual time and declare a page old if its last reference is more than τ seconds in the past. Other pages are declared young (i.e., in the working set).

As with clock, on every page fault a victim is found by scanning the list starting with the node indicated by the clock hand.

If R=1, the page has been recently referenced. R is set to 0, the time of last use is set to now, and the hand advances.
If R=0 and M=1, clear M, (schedule an I/O to) write the dirty page, and advance the hand.
If R=M=0 and the page is young, advance the hand.
If R=M=0 and the page is old, this is our victim; done.

It is possible to go all around the clock without finding a victim. In that case

If writes were scheduled on old pages, one of these will become the victim once it becomes clean.
If no writes were scheduled, we will never find a victim (since nothing will change) so pick a page at random (or perhaps the oldest of the young pages).
If only young pages are scheduled for writing, no old victim will be chosen so we will need to again pick a page at random (if every page is dirty, we must wait for one to become clean).

An alternative treatment of WSClock, including more details of its interaction with the I/O subsystem, can be found here.

3.4.10 Summary of Page Replacement Algorithms

Algorithm	Comment
Random	Poor, used for comparison
Optimal	Unimplementable, used for comparison
NRU	Crude
FIFO	Not good ignores frequency of use
Second Chance	Improvement over FIFO
Clock	Better implementation of Second Chance
LIFO	Horrible, useless
LRU	Great but impractical
NFU	Crude LRU approximation
Aging	Better LRU approximation
Working Set	Good, but expensive
WSClock	Good approximation to working set

3.4.A Belady's Anomaly

Consider a system that has no pages loaded and that uses the FIFO PRU.
Consider the following reference string (sequences of pages referenced).

    0 1 2 3 0 1 4 0 1 2 3 4

If we have 3 frames this generates 9 page faults (do it).

If we have 4 frames this generates 10 page faults (do it).

Theory has been developed and certain PRA (so called stack algorithms) cannot suffer this anomaly for any reference string. FIFO is clearly not a stack algorithm. LRU is.

Repeat the above calculations for LRU.

3.5 Design Issues for (Demand) Paging Systems

3.5.1 Local vs Global Allocation Policies

A local PRA is one is which a victim page is chosen among the pages of the same process that requires a new frame. That is the number of frames for each process is fixed. So LRU for a local policy means the page least recently used by this process. A global policy is one in which the choice of victim is made among all pages of all processes.

Of course we can't have a purely local policy, why?
Answer: A new process has no pages and, even if we didn't restrict the frame needed to a local one for the first page loaded, the process would remain with only one page.
Perhaps wait until a process has been running a while before restricting it to existing frames or give the process an initial allocation of frames based on the size of the executable.

In general a global policy seems to work better. For example, consider LRU. With a local policy, the local LRU page might have been more recently used than many resident pages of other processes. A global policy needs to be coupled with a good method to decide how many frames to give to each process. By the working set principle, each process should be given |w(k,t)| frames at time t, but this value is hard to calculate exactly.

If a process is given too few frames (i.e., well below |w(k,t)|), its faulting rate will rise dramatically. If this occurs for many or all the processes, the resulting situation in which the system is doing very little useful work due to the high I/O requirements for all the page faults is called thrashing.

Page Fault Frequency (PFF)

An approximation to the working set policy that is useful for determining how many frames a process needs (but not which pages) is the Page Fault Frequency algorithm.

For each process keep track of the page fault frequency, which is the number of faults divided by the number of references.
Actually, must use a window or a weighted calculation since you are interested in the recent page fault frequency.
Actually, it is too expensive to calculate the number of references so, as above, we approximate this by the amount of (virtual) time.
If the PFF is exceptionally low, free some of this processes frames (e.g., limit victim selection to this process for a while).
If the PFF is too high, allocate more frames to this process. Either
1. Raise its number of frames if using a local policy; or
2. Bar its frames from eviction (for a while) if using a global policy.
What if there are not enough frames in the entire system? That is, what if the PFF is too high for all processes?
Answer: Reduce the MPL as we now discuss.

3.5.2: Load Control

To reduce the overall memory pressure, we must reduce the multiprogramming level (or install more memory while the system is running, which is not possible with current technology). That is, we have a connection between memory management and process management. These are the suspend/resume arcs we saw way back when and are shown again in the diagram on the right.

When the PFF (or another indicator) is too high, we choose a process and suspend it, thereby swapping it to disk and releasing all its frames. When the frequency gets low, we can resume one or more suspended processes. We also need a policy to decide when a suspended process should be resumed even at the cost of suspending another.

This is called medium-term scheduling. Since suspending or resuming a process can take seconds, we clearly do not perform this scheduling decision every few milliseconds as we do for short-term scheduling. A time scale of minutes would be more appropriate.

3.5.3: Page Size

Page size must be a multiple of the disk block size. Why?
Answer: When copying out a page if you have a partial disk block, you must do a read/modify/write (i.e., 2 I/Os).

Characteristics of a large page size.

Good for demand paging I/O:
We will learn later this term that the total time for performing 8 I/O operations each of size 1KB is much larger that the time for a single 8KB I/O. Hence it is better to swap in/out one big page than several small pages.
But if the page is too big you will be swapping in data that are not local and hence might well not be used.
Large internal fragmentation (1/2 page size).
Small page table (process size / page size * size of PTE).
These last two can be analyzed together by setting the derivative of the sum equal to 0. The minimum overhead occurs at a page size of
```
        sqrt(2 * process size * size of PTE)
      
```
Since the term inside the sqrt is typically megabytes, we see that modern practice of having the page size a few kilobytes is near the minimum point.
A very large page size leads to very few pages. A process will have many faults if it references more regions than the number of (large) frames that the process has been allocated.

A small page size has the opposite characteristics.

Homework: Consider a 32-bit address machine using paging with 8KB pages and 4 byte PTEs. How many bits are used for the offset and what is the size of the largest page table? Repeat the question for 128KB pages.

3.5.4: Separate Instruction and Data (I and D) Spaces

This was used when machine have very small virtual address spaces. Specifically the PDP-11, with 16-bit addresses, could address only 2₁₆ bytes or 64KB, a severe limitation. With separate I and D spaces there could be 64KB of instructions and 64KB of data.

Separate I and D are no longer needed with modern architectures having large address spaces.

3.5.5 Shared pages

Permit several processes to each have the same page loaded in the same frame. Of course this can only be done if the processes are using the same program and/or data.

Really should share segments.
Must keep reference counts or something so that, when a process terminates, pages it shares with another process are not automatically discarded.
Similarly, a reference count would make a widely shared page (correctly) look like a poor choice for a victim.
A good place to store the reference count would be in a structure pointed to by both PTEs. If stored in the PTEs themselves, we must keep somehow keep the count consistent between processes.
If you want the pages to be initially shared for reading but want each process's updates to be private, then use so called copy on write techniques.

Homework: Can a page shared between two processes be read-only for one process and read-write for the other?

3.5.6 Shared Libraries (Dynamic-Linking)

In addition to sharing individual pages, process can share entire library routines. The technique used is called dynamic linking and the objects produced are called shared libraries or dynamically-linked libraries (DLLs). (The traditional linking you did in lab1 is today often called static linking).

With dynamic linking, frequently used routines are not linked into the program. Instead, just a stub is linked.
When the routine is called (or when the process begins), the stub checks to see if the real routine has been loaded by another program).
- If it has not been loaded, load it (really page it in as needed).
- If it is already loaded, share it. The read-write data must be shared copy-on-write.
Advantages of dynamic linking.
- Saves RAM: Only one copy of a routing is in memory even when it is used concurrently by many processes. For example even a big server with hundreds of active processes will have only one copy of printf in memory. (In fact with demand paging only part of the routine will be in memory.)
- Saves disk space: Files containing executable programs no longer contain copies of the shared libraries.
- A bug fix to a dynamically linked library fixes all applications that use that library, without having to relink these applications.
Disadvantages of dynamic linking.
- New bugs in dynamically linked library infect all applications.
- Applications change even when they haven't changed.
A Technical Difficulty with dynamic linking. The shared library has different virtual addresses in each process so addresses relative to the beginning of the module cannot be used (they would need to be relocated to different addresses in the multiple copies of the module). Instead position-independent code must be used. For example, jumps within the module would use PC-relative addresses.

3.5.7: Mapped Files

The idea of memory-mapped files is to use the mechanisms in place for demand paging (and segmentation, if present) to implement I/O.

A system call is used to map a file into a portion of the address space. (No page can be part of a file and part of regular memory; the mapped file would be a complete segment if segmentation is present).

The implementation of demand paging we have presented assumes that the entire process is stored on disk. This portion of secondary storage is called the backing store for the pages. Sometimes it is called a paging disk. For memory-mapped files, the file itself is the backing store.

Once the file is mapped into memory, reads and writes become loads and stores.

3.5.8: Cleaning Policy (Paging Daemons)

Done earlier

The only point to add is now that we know replacement algorithms one can suggest an implementation. If a clock-like algorithm is used for victim selection, one can have a two handed clock with one hand (the paging daemon) staying ahead of the other (the one invoked by the need for a free frame).

The front hand simply writes out any page it hits that is dirty and thus the trailing hand is likely to see clean pages and hence is more quickly able to find a suitable victim.

Unless specifically requested, you may ignore paging daemons when answering exam questions.

3.5.9: Virtual Memory Interface

Skipped.

3.6: Implementation Issues

3.6.1: Operating System Involvement with Paging

When must the operating system be involved with paging?

During process creation. The OS must guess at the size of the process and then allocate a page table and a region on disk to hold the pages that are not memory resident. A few pages of the process must be loaded.
The Ready→Running transition. Real memory must be allocated for the page table if the table has been swapped out (which is permitted when the process is not running).
Some hardware register(s) must be set to point to the page table. There can be many page tables resident, but the hardware must be told the location of the page table for the running process—the active page table.
The TLB must be cleared (unless it contains a process id field).
Processing a page fault. Lots of work is needed; see 3.6.2 just below.
Process termination. Free the page table and the disk region for swapped out pages.

3.6.2 Page Fault Handling

What happens when a process, say process A, gets a page fault? Compare the following with the processing for a trap command and for an interrupt.

The hardware detects the fault and traps to the kernel (switches to supervisor mode and saves state).
Some assembly language code saves more state, establishes the C-language (or another programming language) environment, and calls the OS.
The OS determines that a page fault occurred and which page was referenced.
If the virtual address is invalid, process A is killed. If the virtual address is valid, the OS must find a free frame. If there is no free frames, the OS selects a victim frame. (Really, the paging daemon does this prior to the fault occurring, but it is easier to pretend that it is done here.) Call the process owning the victim frame, process B. (If the page replacement algorithm is local, then B=A.)
The PTE of the victim page is updated to show that the page is no longer resident.
If the victim page is dirty, the OS schedules an I/O write to copy the frame to disk and blocks A waiting for this I/O to occur.
Assuming process A needed to be blocked (i.e., the victim page is dirty) the scheduler is invoked to perform a context switch.
- Tanenbaum forgot some here.
- The process selected by the scheduler (say process C) runs.
- Perhaps C is preempted for D or perhaps C blocks and D runs and then perhaps D is blocked and E runs, etc.
- When the I/O to write the victim frame completes, a disk interrupt occurs. Assume processes C is running at the time.
- Hardware trap / assembly code / OS determines I/O done.
- The scheduler marks A as ready.
- The scheduler picks a process to run, maybe A, maybe B, maybe C, maybe another processes.
- At some point the scheduler does pick process A to run. Recall that at this point A is still executing OS code.
Now the O/S has a free frame (this may be much later in wall clock time if a victim frame had to be written). The O/S schedules an I/O to read the desired page into this free frame. Process A is blocked (perhaps for the second time) and hence the process scheduler is invoked to perform a context switch.
Again, another process is selected by the scheduler as above and eventually a disk interrupt occurs when the I/O completes (trap / asm / OS determines I/O done). The PTE in process A is updated to indicate that the page is in memory.
The O/S may need to fix up process A (e.g., reset the program counter to re-execute the instruction that caused the page fault).
Process A is placed on the ready list and eventually is chosen by the scheduler to run. Recall that process A is executing O/S code.
The OS returns to the first assembly language routine.
The assembly language routine restores registers, etc. and returns to user mode.

The user's program running as process A is unaware that all this happened (except for the time delay).

3.6.3 Instruction Backup

A cute horror story. The hardware support for page faults in the original Motorola 68000 (the first microprocessor with a large address space) was so bad that an early demand paging system for the 68000, used two processors one running one instruction behind. If the first got a page fault, there wasn't always enough information to figure out what to do so (for example did a register pre-increment occur), the system switched to the second processor after bringing in the faulting page. The next generation machine, the 68010, provided extra information on the stack so the horrible 2-processor kludge was no longer necessary.

Don't worry about instruction backup; it is very machine dependent and modern implementations tend to get it right.