Start Lecture #1
I start with chapter -1 so that when we get to chapter 1, the numbering will agree with the text.
There is a web site for the course. You can find it from my home page, which is listed above.
Start Lecture #1marker above can be thought of as
End Lecture #0.
The course text is Tanenbaum, "Modern Operating Systems", Forth
Edition (4e).
We will cover nearly all of the first six chapters, plus some
material from later chapters.
I once used a book by Finkel, which is now out of print, but is available on the web.
Replyto contribute to the current thread, but NOT to start another topic.
Grades are based on the labs and exams; the weighting will be
approximately
25%*LabAverage + 30%*MidtermExam + 45%*FinalExam
(but see homeworks below).
I use the upper left board for lab/homework assignments and announcements. I should never erase that board. Viewed as a file it is group readable (the group is those in the room), appendable by just me, and (re-)writable by no one. If you see me start to erase an announcement, please let me know.
I try very hard to remember to write all announcements on the upper left board and I am normally successful. If, during class, you see that I have forgotten to record something, please let me know. HOWEVER, if I forgot and no one reminds me, the assignment has still been given.
I make a distinction between homeworks and labs.
Labs are
Homeworks are
Homeworks are numbered by the class in which they are assigned. So any homework given today is homework #1. Even if I do not give homework today, the homework assigned next class will be homework #2. Unless I explicitly state otherwise, all homeworks assignments can be found in the class notes. So the homework present in the notes for lecture #n is homework #n (even if I inadvertently forgot to write it to the upper left board).
You may develop (i.e., write and test) lab assignments on any system you wish, e.g., your laptop. However, ...
NYU Classes.
I feel it is important for CS majors to be familiar with basic
client-server computing (related to
cloud computing
) in which one develops on a client machine
(for us, most likely one's personal laptop), but runs on a remote
server (for us, mauler.cims.nyu.edu).
This requires three steps.
I have supposedly given you each an account on mauler (and access), which takes care of step 1. Accessing mauler and access is different for different client (laptop) operating systems.
If you receive a message from mauler about an authentication failure, please follow the advice below from the systems group.
The first line of defense in all cases of authentication failure is to attempt a password reset. Please visit https://cims.nyu.edu/webapps/password/reset to do so. Within 15 minutes of a password reset submission, instructions to retrieve the new password will be sent to xyz123@nyu.edu. Please e-mail helpdesk@cims.nyu.edu in the event that the password reset either fails, or that the new password does not work (be sure to preface your ssh command with your username, e.g. ssh xyz123@access.cims.nyu.edu).
Good methods for obtaining help include
You may write your lab in Java, C, or C++.
Incomplete
The rules for incompletes and grade changes are set by the school and not the department or individual faculty member.
The rules set by CAS can be found here. They state:
The grade of I (Incomplete) is a temporary grade that indicates that the student has, for good reason, not completed all of the course work but that there is the possibility that the student will eventually pass the course when all of the requirements have been completed. A student must ask the instructor for a grade of I, present documented evidence of illness or the equivalent, and clarify the remaining course requirements with the instructor.
The incomplete grade is not awarded automatically. It is not used when there is no possibility that the student will eventually pass the course. If the course work is not completed after the statutory time for making up incompletes has elapsed, the temporary grade of I shall become an F and will be computed in the student's grade point average.
All work missed in the fall term must be made up by the end of the following spring term. All work missed in the spring term or in a summer session must be made up by the end of the following fall term. Students who are out of attendance in the semester following the one in which the course was taken have one year to complete the work. Students should contact the College Advising Center for an Extension of Incomplete Form, which must be approved by the instructor. Extensions of these time limits are rarely granted.
Once a final (i.e., non-incomplete) grade has been submitted by the instructor and recorded on the transcript, the final grade cannot be changed by turning in additional course work.
This email from the assistant director, describes the departmental policy.
Dear faculty, The vast majority of our students comply with the department's academic integrity policies; see www.cs.nyu.edu/web/Academic/Undergrad/academic_integrity.html www.cs.nyu.edu/web/Academic/Graduate/academic_integrity.html Unfortunately, every semester we discover incidents in which students copy programming assignments from those of other students, making minor modifications so that the submitted programs are extremely similar but not identical. To help in identifying inappropriate similarities, we suggest that you and your TAs consider using Moss, a system that automatically determines similarities between programs in several languages, including C, C++, and Java. For more information about Moss, see: http://theory.stanford.edu/~aiken/moss/ Feel free to tell your students in advance that you will be using this software or any other system. And please emphasize, preferably in class, the importance of academic integrity. Rosemary Amico Assistant Director, Computer Science Courant Institute of Mathematical Sciences
The university-wide policy is described here
Remark: For Fall 2016 the final exam is Tuesday, 20 December at 2PM in CIWW 109. Check out the official list
Originally called a linkage editor by IBM.
A linker is an example of a utility program included with an operating system distribution. Like a compiler, the linker is not part of the operating system per se, i.e., it does not run in supervisor mode. Unlike a compiler it is OS dependent (what object/load file format is used) and is not (normally) language dependent.
Link, of course.
When the compiler and assembler have finished processing a module, they produce an object module that is almost runnable. There are two remaining tasks to be accomplished before object modules can be run. Both are involved with linking (that word, again) together multiple object modules. The tasks are relocating relative addresses and resolving external references; each is described just below.
The output of a linker is called a load module because, with relative addresses relocated and the external addresses resolved, the module is ready to be loaded and run.
The compiler and assembler treat each module as if it will be
loaded at location zero.
For example, the machine instruction
jump
120
is used to indicate a jump to location 120 of the
current module.
To convert this relative address to an absolute address, the linker adds the base address of the module to the relative address. The base address is the address at which this module will be loaded.
For example, assume a module is to be loaded starting at location
2300 and contains the above instruction
jump 120
The linker changes this instruction to
jump 2420
How does the linker know that the module is to be loaded starting at location 2300?
If a C (or Java, or Pascal, or Ada, etc) program contains a
function call f(x)
to a
function f()
that is compiled separately, the resulting
object module must contain some kind of jump to the beginning
of f.
To see how a linker works lets consider the following example, which is the first dataset from lab #1. The description in lab1 is more detailed.
The target machine is word addressable and each word consists of 4 decimal digits. The first (leftmost) digit is the opcode and the remaining three digits form an address.
Each object module contains three parts, a definition list, a use list, and the program text itself. Each definition is a pair (sym, loc) signifying that this symbol is defined at this location.
Each use is a symbol that will be pointed to by the program text.
The program text consists of a count N followed by N 5-digit numbers, where the first four digits are an instruction word as described above and the last digit gives the type of the address component, 1=immediate, 2=absolute, 3=relative, and 4=external.
The actions taken by the linker depend on the type of the address, as we now illustrate. Consider the first input set from the lab.
Input set #1 4 1 xy 2 2 z xy 5 10043 56781 20004 80023 70014 0 1 z 6 80013 10004 10004 30004 10023 10102 0 1 z 2 50013 40004 1 z 2 2 xy z 3 80002 10014 20004
The first pass simply finds the base address of each module and produces the symbol table giving the values for xy and z (2 and 15 respectively). The second pass does the real work using the symbol table and base addresses produced in pass one.
The resulting output (shown below) is more detailed than I expect you to produce. The detail is there to help me explain what the linker is doing. All I would expect from you is the symbol table and the rightmost column of the memory map.
Symbol Table xy=2 z=15 Memory Map +0 0: 10043 1004+0 = 1004 1: 56781 5678 2: xy: 20004 ->z 2015 3: 80023 8002+0 = 8002 4: 70014 ->xy 7002 +5 0 80013 8001+5 = 8006 1 10004 ->z 1015 2 10004 ->z 1015 3 30004 ->z 3015 4 10023 1002+5 = 1007 5 10102 1010 +11 0 50013 5001+11= 5012 1 40004 ->z 4015 +13 0 80002 8000 1 10014 ->z 1015 2 z: 20004 ->xy 2002
Note: It is faster (less I/O)
to do a one pass approach, but is harder since you need
fix-up code
whenever a use occurs in a module that precedes
the module with the definition.
Historical note: The linker on unix was mistakenly called ld (for loader), which is unfortunate since it links but does not load.
Unix was originally developed at Bell Labs; the seventh edition of unix was made publicly available (perhaps earlier ones were somewhat available). The 7th ed man page for ld begins (see http://cm.bell-labs.com/7thEdMan).
.TH LD 1 .SH NAME ld \- loader .SH SYNOPSIS .B ld [ option ] file ... .SH DESCRIPTION .I Ld combines several object programs into one, resolves external references, and searches libraries.
By the mid 80s the Berkeley version (4.3BSD) man
page referred to ld as link editor
and this more accurate
name is now standard in unix/linux distributions.
During the 2004-05 fall semester a student wrote to me:
BTW - I have meant to tell you that I know the lady who wrote ld. She told me that they called it loader, because they just really didn't have a good idea of what it was going to be at the time.
Lab #1: Implement a two-pass linker. See the class home page and NYU Classes for details.
Remark: Halley Young (acm) visit.
Homework: Read Chapter 1 (Introduction)
Software is often implemented in layers (so is hardware, but that is not the subject of this course). The higher layers use the facilities provided by lower layers.
Alternatively said, the upper layers are written using a more powerful and more abstract virtual machine than the lower layers.
In yet other words, each layer is written as though it runs on the virtual machine supplied by the lower layers and in turn provides a more abstract (pleasant) virtual machine for the higher layers to run on.
Using a broad brush, the layers are.
An important distinction is that the kernel runs in privileged/kernel/supervisor mode; whereas your programs, as well as compilers, editors, shell, linkers, browsers, etc. run in user mode. This means that the OS can access all the hardware and execute all possible instructions.
In contrast user programs cannot directly modify the hardware; for example I/O instructions are normally privileged. So the programs you and I write cannot perform I/O; but they can ask the OS to perform the I/O for them.
The kernel is itself normally layered, e.g.
The machine independent I/O layer is written assuming
virtual (i.e. idealized) hardware
.
For example, the machine independent I/O portion can access a
certain byte in a given file.
In reality, I/O devices, e.g., disks, have no support or knowledge
of files; these devices support only blocks.
Lower levels of the software implement files in terms of blocks.
Often the machine independent part is itself more than one layer.
The term Operating System is not well defined. Is it just the kernel, i.e., the portion run in supervisor mode? How about the libraries? The utilities? All these are certainly system software but it is not clear how much is part of the OS.
Start Lecture #2
As mentioned above, the OS raises the abstraction level by providing a higher level virtual machine. A second related key objective for the OS is to manage the resources provided by this virtual machine.
The kernel itself raises the level of abstraction and hides details. For example a user (of the kernel) can write() to a file (a concept not present in hardware) and without knowing whether the file resides on a solid-state-disk (SSD), an internal scsi disk, or a local USB disk. The user can also ignore issues such as whether the file is stored contiguously or is broken into blocks.
Well designed abstractions are a key to managing complexity.
The kernel must manage the resources to resolve conflicts between users. Note that when we say users, we are not referring directly to humans, but instead to processes running on behalf of humans.
Often the resource is shared or multiplexed between the users. This can take the form of time-multiplexing, where the users take turns (e.g., the processor resource) or space-multiplexing, where each user gets a part of the resource (e.g., a disk drive).
With sharing comes various issues such as protection, privacy, fairness, etc.
Answer: Concurrency! Per Brinch Hansen in Operating Systems Principles (Prentice Hall, 1973) writes.
The main difficulty of multiprogramming is that concurrent activities can interact in a time-dependent manner, which makes it practically impossibly to locate programming errors by systematic testing. Perhaps, more than anything else, this explains the difficulty of making operating systems reliable.
Homework: 1. (Unless otherwise stated, problems numbers are from the end of the current chapter in Tanenbaum.) What are the two main functions of an operating system?
The subsection headings describe the hardware as well as the OS; we are naturally more interested in the latter. These two development paths are related as the improving hardware enabled the more advanced OS features.
One user (program; perhaps several humans) at a time. Any operating-system-like functionality that was needed was part of the user's program.
Although this time frame predates my own usage, computers without serious operating systems existed during the second generation and were then available to a wider (but still very select) audience.
I have fond memories of the Bendix G-15 (paper tape) and the IBM 1620 (cards; typewriter; decimal). During the short time you had the machine, it was truly a personal computer.
Many jobs were batched together, but the systems were still uniprogrammed, a job once started was run to completion without interruption and then flushed from the system.
A change from the previous generation is that the OS was not reloaded for each job and hence needed to be protected from the user's execution. As mentioned above, in the first generation, the beginning of a job contained the trivial OS-like support features used.
The batches of user jobs were prepared offline (cards to magnetic tape) using a separate computer (an IBM 1401 with a 1402 card reader/punch). The tape was brought to the main computer (an IBM 7090/7094) where the output to be printed was written on another tape. This tape went back to the service machine (1401) and was printed (on a 1403).
In my opinion Multiprogramming is the biggest change from the OS point of view. It is with multiprogramming (many processes executing concurrently) that we have the operating system fielding requests whose arrival order is non-deterministic. At this point operating systems become notoriously hard to get right due to the inability to test a significant percentage of the possible interactions and the inability to reproduce bugs on request.
Since multiple jobs are in memory at the same jobs, one job's memory must be protected from the other jobs.
The purpose of multiprogramming is to overlap CPU and I/O activity and thus greatly improve CPU utilization. Recall that these computers, in particular the processors, were very expensive.
efficientuser of resources, but is more difficult to implement.
holes, and fragmentation arise.
With multiprogramming, one job could be loading (from say cards),
another job could be printing, and a third computing.
So when a card deck was submitted by the user it could be read
directly into an on-disk queue.
Then when the system is ready to run another job, it is already
there.
Similarly, jobs would print
to disk and later another task
would really print these disk files onto paper.
This is multiprogramming with rapid switching between jobs
(processes) so that, to the user it appears that their job is always
running (but at a slower rate).
Also individual users spool
their own printed output onto a
remote terminal.
Deciding when to switch and which process to switch to is
called scheduling.
We will study scheduling when we cover processor management a few weeks from now.
MIT and Dartmouth were pioneers in timesharing. Since I went to MIT, I naturally believe MIT was first. In particular, during my second semester (jan-may 1964), I took a course that by luck was chosen to the first one on the MIT timesharing system CTSS. I do believe that I was in the first group of undergraduates (about 15 or 20 students I guess) to use timesharing. Tanenbaum also asserts that MIT was first, but again he was a student there (in physics).
Serious PC Operating systems such as Unix/Linux, Windows NT/2000/XP/Vista/etc and (the newer versions of) MacOS are multiprogrammed.
GUIs have become important. What is not clear is whether the GUI should be part of the kernel.
Early PC operating systems were uniprogrammed and their direct descendants lasted for quite some time (e.g., Windows ME), but now all (non-embedded) OS are multiprogrammed.
Note: I very much recommend reading all of 1.2, not for this course especially, but for general interest. Tanenbaum writes well and is my age so lived through much of the history himself.
Homework:
Primarily hardware changes.
The picture above is very simplified. (For one thing, today separate buses are used to Memory and Video.)
A bus is a set of wires that connect two or more
devices.
Only one message can be on the bus at a time.
All the devices receive
the message: There are no switches in
between to steer the message to the desired destination, but often
some of the wires form an address that indicates which devices
should actually process the message.
Only at a few points will we need to understand the various processor registers such as the program counter (a.k.a, instruction pointer), the stack pointer, and the Program Status Words (PSWs). We will ignore computer design issues such as pipelining and superscalar.
Many of these issues are mentioned in 201 and nearly all of them are covered in 436, the computer architecture elective.
We do, however, need the notion of a trap, that is an instruction that atomically switches the processor into privileged mode and jumps to a pre-defined physical address. This is the key for system calls in which a user program enters the operating system. We will have much more to say about traps at the end of this chapter.
Many of the OS issues introduced by multi-processors of any flavor are also found in a uni-processor, multi-programmed system. In particular, successfully handling the concurrency offered by the second class of systems, goes a long way toward preparing for the first class. The remaining multi-processor issues are not covered in this course.
We will ignore caches (which are covered in 201 and 436; you can see my online class notes for the latter, if you are interested), but we will later discuss demand paging, which is very similar. Despite their similarity, demand paging and caches use largely disjoint terminology. In both cases, the goal is to combine large, slow memory with small, fast memory to achieve the effect of large, fast memory.
The central memory in a system is called RAM (Random Access Memory). A key point is that it is volatile, i.e. the memory loses its data if power is turned off.
ROM (Read Only Memory) is used for (low-level control) software that often comes with devices on general purpose computers, and for the entire software system on non-user-programmable devices such as microwaves and dumb wristwatches. It is also used for non-changing data. A modern, familiar ROM is CD-ROM (or the denser DVD, or the even denser Blu-ray). ROM is non-volatile.
But often this unchangable data needs to be changed (e.g., to fix bugs). This gives rise first to PROM (Programmable ROM), which, like a CD-R, can be written once (as opposed to being mass produced already written like a CD-ROM), and then to EPROM (Erasable PROM), which is like a CD-RW. Early EPROMs needed UV light for erasure; EEPROM, Electrically EPROM or Flash RAM) can be erased by normal circuitry, which is much more convenient.
As mentioned above when discussing OS/MFT and OS/MVT, multiprogramming requires that we protect one process from another. That is, we need to translate the virtual addresses (a virtual address is the address as written in the program) into physical addresses (a physical address is the actual memory address in the computer) such that, at any point in time, the physical address of each process are disjoint. The hardware that performs this translation is called the MMU or Memory Management Unit. (There are occasions when two processes wish to share memory.)
Note the similarity between (1) translating virtual to physical addresses by the OS and (2) relocating relative addresses (into absolute addresses) in your lab 1 linker.
When context switching from one process to another, the translation must change, which can be an expensive operation.
When we do I/O for real, I will show a real disk opened up and illustrate the components
Devices are often quite difficult to manage and a separate computer, called a controller, is used to translate OS commands into what the device requires.
This is flash RAM organized in sector-like blocks as is a disk.
Unlike RAM, SSD is non volatile; unlike a disk it has no moving
parts (and so is faster).
The blocks can be written a large number
of times.
However, the large number
is not large enough to be
completely ignored.
The bottom of the memory hierarchy, tapes have large capacities, tiny cost per byte, and very long access times. Tapes are becoming less important since their technology improvement has not kept up with the improvement in disks. We will not study tapes in this course.
In addition to the disks and tapes just mentioned, I/O devices include monitors (and graphics controllers), NICs (Network Interface Controllers), Modems, Keyboards, Mice, etc.
The OS communicates with the device controller, not with the device itself. For each different controller, a corresponding device driver is included in the OS. Note that, for example, many different graphics controllers are capable of controlling a standard monitor, and hence the OS needs many graphics device drivers.
In theory any SCSI (Small Computer System Interconnect) controller can control any SCSI disk. In practice this is not true as SCSI gets inproved to wide scsi, ultra scsi, etc. The newer controllers can still control the older disks and often the newer disks can run in degraded mode with an older controller.
Three methods are employed.
We discuss these alternatives more in chapter 5. In particular, we explain the last point about halving bus accesses.
On the right is a figure showing the specifications for an Intel chip set introduced in 2000. The terminology used is not standardized, e.g., hubs are often called bridges. Most likely due to their location on the diagram to the right, the Memory Controller Hub is often called the Northbridge and the I/O Controller Hub the Southbridge.
As shown on the right this chip set has two different width PCI buses. This particular chip set supplies USB. An alternative is to have a PCI USB controller.
Unlike the situation in the previous diagram with a single bus, now several pairs of components can be communicating simultaneously, giving a significant improvement in performance (and complexity).
When the power button is pressed, control starts at the BIOS, a
PROM (typically flash) in the system.
Control is then passed to (the tiny program stored in) the MBR
(Master Boot Record), which is the first 512-byte block on the
primary
disk.
Control then proceeds to the first block in the active
partition and from there the OS is finally invoked (normally via an
OS loader).
The above assumes that the boot medium selected by the bios was the hard disk. Other possibilities include a CD-ROM or the network.
There is not much difference between mainframe, server, multiprocessor, and PC OS's. Indeed the 3e considerably softened the differences given in the 2e and this continues in the 4e. For example Unix/Linux and Windows runs on all of them.
This course covers all four of those classes, which perhaps should be considered just one class .
Used in data centers, these systems offer tremendous I/O capabilities and extensive fault tolerance.
Perhaps the most important servers today are web servers. Again I/O (and network) performance are critical.
A multiprocessor (as opposed to a multi-computer or multiple computers or computer network or grid) means multiple processors sharing memory and controlled by a single instance of the OS, which typically can run on any of the processors. Often it can run on several simultaneously.
Multiprocessors existed almost from the beginning of the computer age, but now are not exotic. Indeed, even my current laptop is a multiprocessor.
The operating system(s) controlling a system of multiple computers often are classified as either a Network OS or a Distributed OS. The former is basically a collection of ordinary PCs on a LAN that use the network facilities available on PC operating systems. Some extra utilities are often present to ease running jobs on remote processors.
A Distributed OS is a more sophisticated
and seamless
version of the above where the boundaries
between the processors are made nearly invisible to users (except
for performance).
This subject is not part of our course (but often is covered in 2251).
In the recent past some OS systems (e.g., ME) were claimed to be tailored to client operation. Others felt that they were restricted to client operation. This seems to be gone now; a modern PC OS is fully functional. I guess for marketing reasons some of the functionality can be disabled.
This includes phones.
The only real difference between this class and the above is the
restriction to very modest memory and very low power.
However, the very modest memory
keeps getting bigger and
some phones now include a stripped-down linux.
The OS is part of
the device, e.g., microwave ovens,
and cardiac monitors.
The OS is on a ROM so is not changed.
Since no user code is run, protection is not as important. In that respect the OS is similar to the very earliest computers. Embedded OS are very important commercially, but not covered much in this course.
These are embedded systems that also contain sensors and communication devices so that the systems in an area can cooperate.
As the name suggests, time (more accurately timeliness) is an important consideration. There are two classes: soft vs hard real time. In the latter, missing a deadline is a fatal error—sometimes literally. Very important commercially, but not covered much in this course.
Very limited in power (both meanings of the word).
Note: Professor Marsha Berger, the department's director of undergraduate studies personally told me to mention in my classes that in a recent academic year she had to refer 45 students to the dean for problems with academic integrity. As I mentioned before the department is very serious about this subject.
This will be very brief.
Much of the rest of the course will consist of
filling in the details
.
A process is program in execution. If you run the same program twice, you have created two processes. For example if you have two programs compiling in two windows, each instance of the compiler is a separate process.
Often one distinguishes the state or context of a process—its
address space (roughly its memory image), open files,
etc.—from the thread of control.
If one has many threads running in the same task,
the result is a multithreaded processes
.
The OS keeps information about all processes in the process table. Indeed, the OS views the process as the entry. This is an example of an active entity being viewed as a data structure (cf. discrete event simulations), an observation I first encountered in the (out of print) OS textbook by Finkel I mentioned previously.
The data contained in a process table entry has many uses. For example, it enables a processes that is currently blocked or suspended to resume execution in the future.
The set of processes forms a tree via the fork system call.
The forker is the parent of the forkee, which is
called a child.
If the system always blocks the parent until the child finishes, the
tree
is quite simple, just a line.
However, in modern OSes, the parent is free to continue executing and in particular is free to fork again, thereby producing another child. This produces a process tree as shown on the far right.
A process can send a signal to another process to
cause the latter to execute a predefined function (the signal
handler).
It can be tricky to write a program with a signal handler since the
programmer does not know when in the mainline
program the
signal handler will be invoked.
Each user is assigned a User IDentification (UID) and all processes created by that user have this UID. A child has the same UID as its parent. It is sometimes possible to change the UID of a running process. A group of users can be formed and given a Group IDentification, GID. One UID is special (the superuser or administrator) and has extra privileges.
Access to files and devices can be limited to a given UID or GID.
A set of processes is deadlocked if each of the processes is blocked by a process in the set. The automotive equivalent, shown below, is called gridlock. (The photograph was sent to me by Laurent Laor, a former 2250 student.)
Clearly, each process requires memory, but there are other issues as well. For example, your linkers (will) produce a load module that assumes the process is loaded at location 0. The result is that every load module has the same (virtual) address space. The operating system must ensure that the virtual addresses of concurrently executing processes are assigned disjoint physical memory.
For another example note that current operating systems permit each process to be given more (virtual) memory than the total amount of (real) memory on the machine.
Modern systems have a hierarchy of files. A file system tree.
looks likethe parent of a:\ and c:\, but that is a feature of the UI not the file systems.
You can name a file via an absolute path starting at the root directory (or in Windows a root directory) or via a relative path starting at the current working directory. One requirement for this functionality is that the OS must know the current working directory for each process.
In addition to regular files and directories, Unix also uses the
file system namespace for devices (called
special files), which are typically found in
the /dev directory.
That is, in some ways you can treat the device as a file.
In particular some utilities that are normally applied to (ordinary)
files can be applied as well to some special files.
For example, when you are accessing a unix system using a mouse,
type the following command
cat /dev/mouse
and then move the mouse.
On my more modern system the command is
cat /dev/input/mice
You kill the cat (sorry) by typing cntl-C.
I tried this on my linux box (using a text console) and no damage occurred.
Your mileage may vary.
Start Lecture #3
Before a file can be accessed, it is normally opened and a file descriptor obtained. Subsequent I/O system calls (e.g., read and write) use the file descriptor rather that the file name. This is an optimization that enables the OS to find the file once and save the information in a file table accessed by the file descriptor.
Many systems have standard files that are automatically made available to a process upon startup. These (initial) file descriptors are fixed.
A convenience offered by some command interpreters is a pipe or pipeline. For example the following command, which pipes the output of dir into a character/word/line counter, will give the number of files in the directory.
dir | wc -w
There are a wide variety of I/O devices that the OS must manage. Some of these require special treatment; for example, if two processes are printing at the same time, the OS must not interleave the output.
The OS contains device specific code (drivers) for each device (really each controller) as well as device-independent I/O code.
Files and directories have associated permissions.
attributesas well. For example the linux ext2/3/4 file systems support a
dattribute that is a hint to the dump program not to backup this file.
Memory assigned to a process, i.e., an address space, must also be protected so that unrelated processes do not read and write each others' memory.
Security has of course sadly become a very serious concern. The topic is quite deep mathematically and I do not feel that the necessarily superficial coverage that time would permit is useful so we are not covering the topic at all.
The shell presents the command line interface to the operating system and offers several convenient features.
dir | wc).
Instead of a shell, one can have a more graphical interface.
Some concepts become obsolete and then reemerge due in both cases to technology changes. Several examples follow. Perhaps the cycle will repeat with smart card OS.
The use of assembly languages greatly decreases when memories get larger. When minicomputers and microcomputers (early PCs) were first introduced, they each had small memories and for a while assembly language again became popular.
Multiprogramming requires protection hardware. Once the hardware becomes available monoprogramming becomes obsolete. Again when minicomputers and microcomputers were introduced, they had no such hardware so monoprogramming revived.
When disks are small, they hold few files and a flat (single directory) file system is adequate. Once disks get large a hierarchical file system is necessary. When mini and microcomputer were introduced, they had tiny disks and the corresponding file systems were flat.
Virtual memory, discussed in great detail later, permits a single program to address more memory than present in the computer (the latter is called physical memory). The ability to dynamically remap address also permits programs to link to libraries during runtime. Hence, when VM hardware becomes available, so does dynamic linking.
Homework: 12. Which of the following instructions should be allowed only in kernel mode?
A System call is the mechanism by which a user (i.e., a process running in user mode) directly interfaces with the OS. Some textbooks use the term envelope for the component of the OS responsible for fielding system calls and dispatching them to the appropriate component of the OS. On the right is a picture showing some of the OS components and the external events for which they are the interface.
Note that the OS serves two masters. The hardware (at the bottom) asynchronously sends interrupts and the user (at the top) synchronously invokes system calls and generates page faults.
There is an important difference between these two cases.
What happens when a user executes a system call such as read()? We show a detailed picture below, but at a high level, the following actions occur.
Before considering the read() system call, it might be good to review a typical call/return sequence for a more familiar situation, a method call in a high-level language. On the right we see the assembler-like instructions that would appear when a method invokes sin(x).
The numbers on the left represent memory locations. As in the linker lab, we assume the machine is word addressable and each instruction occupies one word. The sin() method is in words 1041-1102 and the caller is from 40-300. We are interested in the call itself, which occupies 59-61.
The problems we need to solve are first to transfer control from the caller to the callee passing the value of x and second to transfer control back returning the calculated value of sin(x).
The key data structure is a run-time stack whose changing contents we will show on the board.
A typical invocation of the (Unix) read system call is:
count = read(fd,buffer,nbytes)
This invocation reads up to nbytes from the file specified by the file descriptor fd into the character array buffer. The actual number of bytes read is returned (it might be less than nbytes if, for example, an end-of-file was encountered). In more detail, the steps performed are as follows.
A major complication is that the system call handler may block. Indeed, the read system call handler is likely to block. In that case a the operating system will probably switch to another process. Such process switching is far from trivial and is discussed later in the course.
Posix | Win32 | Description |
---|---|---|
Process Management | ||
Fork | CreateProcess | Clone current process |
exec(ve) | Replace current process | |
wait(pid) | WaitForSingleObject | Wait for a child to terminate. |
exit | ExitProcess | Terminate process & return status |
File Management | ||
open | CreateFile | Open a file & return descriptor |
close | CloseHandle | Close an open file |
read | ReadFile | Read from file to buffer |
write | WriteFile | Write from buffer to file |
lseek | SetFilePointer | Move file pointer |
stat | GetFileAttributesEx | Get status info |
Directory and File System Management | ||
mkdir | CreateDirectory | Create new directory |
rmdir | RemoveDirectory | Remove empty directory |
link | (none) | Create a directory entry |
unlink | DeleteFile | Remove a directory entry |
mount | (none) | Mount a file system |
umount | (none) | Unmount a file system |
Miscellaneous | ||
chdir | SetCurrentDirectory | Change the current working directory |
chmod | (none) | Change permissions on a file |
kill | (none) | Send a signal to a process |
time | GetLocalTime | Elapsed time since 1 jan 1970 |
We describe very briefly some of the Unix (Posix) system calls. A short description of the Windows interface is in the book.
To show how the four process management calls enable much of process management, consider the following highly simplified shell.
while (true) display_prompt() read_command(command) if (fork() != 0) waitpid(...) else execve(command) endif endwhile
The fork() system call duplicates the process. That is we now have a second process, which is a child of the process that actually executed the fork(). The parent and child are very, VERY nearly identical. For example they have the same instructions, they have the same data, and they are both currently executing the fork() system call.
But there is a difference!
The fork() system call returns a zero in the child process and returns a positive integer in the parent. In fact the value returned to the parent is the PID (process ID) of the child.
Thus, the parent and child execute different branches of the if-then-else in the code above.
Note that simply removing the waitpid(...) gives background jobs.
Most files are accessed sequentially from beginning to end.
In this case the operations performed are
open() -- possibly creating the file
multiple read()s and write()s
close()
For non-sequential access, lseek is used to move
the File Pointer
, which is the location in the file where the
next read or write will take place.
Directories are created and destroyed by mkdir and rmdir. Directories are changed by the creation, modification, and deletion of files. As mentioned, open can create files. Files can have several names: link gives another name to an existing file and unlink removes a name. When the last name is gone (and the file is no longer open by any process) , the file data is destroyed.
In Unix, one file system can be mounted on (attached to) another. When this is done, access to an existing directory on the second filesystem is temporarily replaced by the entire first file system. Most often the directory chosen is empty before the mount so no files become (temporarily) invisible.
The top picture shows two file systems; the second row shows the result when the right hand file system is mounted on /y. In both cases squares represent directories and circles represent regular files.
This is how a Unix system can enable all files, even those on different physical disks and using different filesystems, to be descendants of a single root
Skipped
Skipped
The transfer of control between user processes and the operating system kernel can be quite complicated, especially in the case of blocking system calls, hardware interrupts, and page faults. Before tackling these issues later, we begin with the familiar example of a procedure call within a user-mode process.
An important OS objective is that, even in the more complicated cases of page faults and blocking system calls requiring device interrupts, simple procedure call semantics are observed from a user process viewpoint. The complexity is hidden inside the kernel itself, yet another example of the operating system providing a more abstract, i.e., simpler, virtual machine to the user processes.
More details will be added when we study memory management (and know officially about page faults) and more again when we study I/O (and know officially about device interrupts).
A number of the points below are far from standardized.
Such items as where to place parameters, which routine saves the
registers, exact semantics of trap, etc, vary as one changes
language/compiler/OS.
Indeed some of these are referred to as calling conventions
,
i.e. their implementation is a matter of convention rather than
logical requirement.
The presentation below is, we hope, reasonable, but must be viewed
as a generic description of what could happen, rather than a real
description of what does happen with, say, C compiled by the
Microsoft compiler running on Windows 7.
Procedure f calls g(a,b,c) in process P. An example is above where a user program calls read(fd,buffer,nbytes). Note that both f() and g() are in the same process P and no action goes outside P. Thus we will not mention the process again in this description.
stack-like) structure of control transfer: we can be sure that control will return to f() when the call to g() exits. The above statement holds even if g() calls h() and then h() calls d(). In fact it even holds if, via recursion, g() calls f(). We are ignoring language features such as
throwingand
catchingexceptions, and the use of unstructured assembly coding. In the latter cases all bets are off.
We mean one procedure running in kernel mode calling another procedure, which will also be run in kernel mode. Later, we will discuss switching from user mode to kernel mode and back.
There is not much difference between the actions taken during a kernel-mode procedure call and during a user-mode procedure call. The procedures executing in kernel-mode are permitted to issue privileged instructions, but the instructions used for transferring control are all unprivileged so there is no change in that respect.
One difference is that often a different stack is used in kernel mode, but that simply means that the stack pointer must be set to the kernel stack when switching from user to kernel mode. But we are not switching modes in this section; the stack pointer already points to the kernel stack. Often there are two stack pointers one for kernel mode and one for user mode.
The trap instruction, like a procedure call, is a synchronous transfer of control: We can see where, and hence when, it is executed. In this respect, there are no surprises. Although not surprising, the trap instruction does have an unusual effect: processor execution is switched from user-mode to kernel-mode. That is, the trap instruction normally is itself executed in user-mode (it is naturally an UNprivileged instruction), but the next instruction executed (which is NOT the instruction written after the trap) is executed in kernel-mode.
Process P, running in unprivileged (user) mode, executes a trap.
The code being executed is written in assembler since there are no
high level languages that generate a trap instruction.
There is no need for us to name the function that is executing.
Compare the following example to the explanation of f calls g
given above.
nameof the code-sequence to which the processor will jump rather than as an argument to trap.
interruptappears because an RTI is also used when the kernel is returning from an interrupt as well as the present case when it is returning from an trap. Actually, an RTI doesn't always go back to user mode. Instead it returns to the mode before the trap or interrupt.
Note: A good way to use the material in the addendum is to compare the first case (user-mode f calls user-mode g) to the TRAP/RTI case line by line so that you can see the similarities and differences.
Start Lecture #4
Remark: For anyone getting an authentication failure when trying to connect to mauler or access, the following advice from the systems group is likely to be helpful.
The first line of defense in all cases of authentication failure is to attempt a password reset. Please visit https://cims.nyu.edu/webapps/password/reset to do so. Within 15 minutes of a password reset submission, instructions to retrieve the new password will be sent to xyz1233@nyu.edu. Please e-mail helpdesk@cims.nyu.edu in the event that the password reset either fails, or that the new password does not work (be sure to preface your ssh command with your username, e.g. ssh xyz123@access.cims.nyu.edu).
I must note that Tanenbaum is a big advocate of the so called microkernel approach in which as much as possible is moved out of the (supervisor mode) kernel into separate processes (see his article in CACM March 2016). The (hopefully small) portion left in supervisor mode is called a microkernel.
In the early 90s this was popular.
Digital Unix (subsequently called True64) and Windows NT were
microkernel based.
Digital Unix was based on Mach, a research OS from Carnegie Mellon
university.
However, for performance reasons, subsequent versions of Windows
were hybrid design as was OS X for the Mac (see Tanenbaum CACM
article referenced above).
Lately, the growing popularity of Linux, and Android has called into
question the once-felt belief that
all new operating systems will be microkernel based
.
The previous picture: one big program.
The system switches from user mode to kernel mode during the trap and then back when the OS does an RTI (return from interrupt).
While in supervisor mode, the OS naturally includes procedure calls and returns.
Modern monolithic systems, such as Linux, are not completely monolithic in that during execution, they can load code modules as needed. This load on demand capability is mainly used for device drivers.
We can structure the system better than above might suggest, which brings us to ...
Some systems have more layers than shown on the right and are more strictly structured.
An early layered system was THE
operating system by
Dijkstra and his students at Technische Hogeschool Eindhoven.
This was a simple batch system so the operator
was the user.
The actual layers were
The layering was done by convention, i.e. there was no enforcement by hardware and the entire OS is linked together as one program. This is true of many modern OS systems as well (e.g., linux).
The MULTICS system was layered in a more formal manner. The hardware provided several protection layers and the OS used them. That is, arbitrary code could not jump into or access data in a more protected layer.
The idea is to have the kernel, i.e. the portion running in supervisor mode, as small as possible and to have most of the operating system functionality provided by separate processes. The microkernel provides just enough to implement processes.
This has significant advantages. For example an error in the file server cannot corrupt memory in the process server since they have separate address spaces (they are after all separate process). Confining the effect of errors makes them easier to track down. Also an error in the ethernet driver can corrupt or stop network communication, but it cannot crash the system as a whole.
But the microkernel approach does mean that when a (real) user process makes a system call there are more processes switches. These are expensive and have hindered the adoption of pure microkernel operating systems.
Related to microkernels is the idea of putting the mechanism in the kernel, but not the policy. For example, the kernel would know how to select the highest priority process and run it, but some external process would assign the priorities. In this way changing the priority scheme could become a relatively minor event compared to the situation in monolithic systems where the entire kernel must be relinked and rebooted.
Dennis Ritchie, the inventor of the C programming language and co-inventor, with Ken Thompson, of Unix was interviewed in February 2003. The following is from that interview.
What's your opinion on microkernels vs. monolithic?
They're not all that different when you actually use them.Microkernels tend to be pretty large these days,monolithickernels with loadable device drivers are taking up more of the advantages claimed for microkernels.
I should note, however, that Tanenbaum's Minix microkernel (excluding the processes) is quite small, about 13,000 lines.
When implemented on one computer, a client-server OS often uses the microkernel approach shown above in which the microkernel just handles communication between clients and servers, and the main OS functions are provided by a number of separate processes.
A distributed system can be thought of as an extension of the client server concept where the servers are remote.
The figure on the right would describe a distributed system of yesteryear, where memory was scarce and it would be considered lavish to have full systems on each machine.
Today with plentiful memory, each machine would have all the standard servers and specific servers for every device on that system. So the only reason an OS-internal message would go to another computer is if the originating process wished to communicate with a specific process running on that computer or a specific device attached to that computer.
Distributed systems are becoming increasingly
important for application programs.
Perhaps the program needs data found only on certain machine (no one
machine has all the data).
For example, think of (legal, of course) file sharing
programs.
Distributed systems are also used to reduce the time required by an
application.
You do this by dividing the program into pieces, which are run
concurrently on separate computers.
Homework: The client-server model is popular in distributed systems. Can it also be used in single-computer system?
Use a hypervisor
(i.e., beyond supervisor, i.e. beyond a
normal OS) to switch between multiple
Operating Systems.
The modern name for a hypervisor is a
Virtual Machine Monitor (VMM)
.
The hypervisor idea was made popular by IBM's CP/CMS (now VM/370). CMS stood for Cambridge Monitor System since it was developed at IBM's Cambridge (MA) Science Center. It was renamed, with the same acronym (an IBM specialty, cf. RAID) to Conversational Monitor System.
Recently, virtual machine technology has moved to machines (notably
x86) that are not fully virtualizable.
Recall that when CMS executed a privileged instruction, the hardware
trapped to the real operating system.
On x86, privileged instructions are ignored when executed
in user mode, so running the guest OS in user mode won't work.
Bye bye (traditional) hypervisor.
But a new style emerged where the hypervisor runs, not on the
hardware, but on the host operating system.
See the text for a sketch of how this (and another idea
paravirtualization
) works.
An important research advance was Disco from Stanford University
that led to the successful commercial product VMware.
Both AMD and Intel have extended the x86 architecture to better support virtualization. The newest processors produced today (2008) by both companies now support an additional (higher) privilege mode for the VMM. The guest OS now runs in the old privileged mode (for which it was designed) and the hypervisor/VMM runs in the new higher privileged mode from which it is able to monitor the usage of hardware resources by the guest operating system(s).
The idea is that a new (rather simple) computer architecture called the Java Virtual Machine (JVM) was invented but not built (in hardware). Instead, interpreters for this architecture are implemented in software on many different hardware platforms. Each interpreter is also called a JVM. The java compiler transforms java into instructions for this new architecture, which then can be interpreted on any machine for which a JVM exists.
This has portability as well as security advantages, but at a cost in performance.
Of course java can also be compiled to native code for a particular hardware architecture and other languages can be compiled into instructions for a software-implemented virtual machine (e.g., pascal with its p-code).
Similar to VM/CMS but the virtual machines have disjoint resources (e.g., distinct disk blocks) so less remapping is needed.
Assumed knowledge.
Assumed knowledge.
Mostly assumed knowledge. Linker's are very briefly discussed. Our earlier discussion was much more detailed.
Extremely brief treatment with only a few points made about the running of the operating itself.
Skipped
Skipped
Assumed knowledge. Note that what is covered is just the prefixes, i.e. the names and abbreviations for various powers of 10.
Skipped, but you should read and be sure you understand it (about 2/3 of a page).
Tanenbaum's chapter title is Processes and Threads
.
I prefer to add the word management.
The subject matter is processes, threads, scheduling, interrupt
handling, and IPC (Inter-Process Communication—and
Coordination).
Definition: A process is a program in execution.
parallel processor(a computer with multiple independent processors).
Even though in actuality there are many processes running at once, the OS gives each process the illusion that it is running alone.
Virtual time is the time used by just this processes.
Virtual time progresses at a rate independent of other processes.
(Actually, this is false, the virtual time is typically
incremented a little during the systems calls used for process
switching; so if there are other processes, overhead
virtual time occurs.)
Virtual memory is the memory as viewed by the process.
Each process typically believes it has a contiguous chunk of memory
starting at location zero.
Of course this can't be true of all processes (or they would be
using the same memory) and in modern systems it is actually true of
no processes (the memory assigned to a single process is not
contiguous and does not include location zero).
Think of the individual modules that are input to the linker.
Each numbers its addresses from zero; the linker eventually
translates these relative addresses into absolute addresses.
That is the linker provides to the assembler a virtual memory in
which addresses start at zero.
Virtual time and virtual memory are examples of abstractions provided by the operating system to the user processes so that the latter experiences a more pleasant virtual machine than actually exists.
Note: Please be aware that the homework problem numbers are from the fourth edition. They are not the same problems in older editions.
From the users' or external viewpoint there are several mechanisms for creating a process.
But looked at internally, from the system's viewpoint, the second
method dominates.
Indeed, in early versions of Unix only one process
(called init) is created at system initialization; all the
others are created by the fork()
system call.
Question: Why have init?
That is why not have all processes created via method 2?
Answer: Because without init there would be no
running process to create any others.
Many systems have daemon
process lurking around to perform
tasks when they are needed.
I was pretty sure the terminology was related to mythology, but
didn't have a reference until a student found
The {Searchable} Jargon Lexicon
at http://developer.syndetic.org/query_jargon.pl?term=demon
daemon: /day'mn/ or /dee'mn/ n. [from the mythological meaning, later rationalized as the acronym `Disk And Execution MONitor'] A program that is not invoked explicitly, but lies dormant waiting for some condition(s) to occur. The idea is that the perpetrator of the condition need not be aware that a daemon is lurking (though often a program will commit an action only because it knows that it will implicitly invoke a daemon). For example, under ITS (a very early OS), writing a file on the LPT spooler's directory would invoke the spooling daemon, which would then print the file. The advantage is that programs wanting (in this example) files printed need neither compete for access to nor understand any idiosyncrasies of the LPT. They simply enter their implicit requests and let the daemon decide what to do with them. Daemons are usually spawned automatically by the system, and may either live forever or be regenerated at intervals. Daemon and demon are often used interchangeably, but seem to have distinct connotations. The term `daemon' was introduced to computing by CTSS people (who pronounced it /dee'mon/) and used it to refer to what ITS called a dragon; the prototype was a program called DAEMON that automatically made tape backups of the file system. Although the meaning and the pronunciation have drifted, we think this glossary reflects current (2000) usage.
As is often the case, wikipedia.org proved useful. Here is the first paragraph of a much larger entry. The wikipedia also has entries for other uses of daemon.
In Unix and other computer multitasking operating systems, a daemon is a computer program that runs in the background, rather than under the direct control of a user; they are usually instantiated as processes. Typically daemons have names that end with the letter "d"; for example, syslogd is the daemon which handles the system log.
Again from the outside there appear to be several termination mechanism.
And again, internally the situation is simpler.
In Unix terminology, there are two system calls kill() and
exit() that are used.
kill() (poorly named in my view) sends a signal to
another process.
For many types of signals, if the signal is not caught (via
the signal() or sigaction() system call) the
process is terminated.
There is also an uncatchable
signal.
The exit() system call is used for self termination and can
indicate success or failure.
Modern general purpose operating systems permit a user to create and destroy processes.
Old or primitive operating system like MS-DOS are not fully multiprogrammed, so when one process starts another, the first process is automatically blocked and waits until the second is finished. This implies that the process tree degenerates into a line.
The diagram on the right contains much information. I often include it on exams. Be sure to accept this gift if it is offered.
Consider a running process P that issues an I/O request.
A preemptive scheduler has the dotted line preempt;
A non-preemptive scheduler doesn't.
The number of processes changes only for two arcs: create and terminate.
Suspend and resume are medium term scheduling.
As mentioned previously, one can organize an OS around the scheduler.
kernel(a micro-kernel) consisting of the scheduler, interrupt handlers, and IPC (interprocess communication).
Minixoperating system works this way.
The OS organizes the data about each process in a table naturally called the process table. Each entry in this table is called a process table entry or process control block.
Characteristics of the process table.
an active entity becomes a data structure when looked at from a lower level.
This should be compared with the addenda on transfer of control and trap.
In a well defined location in memory (specified by the hardware) the OS stores an interrupt vector, which contains the address of the interrupt handler.
Assume a process P is running and a disk interrupt occurs indicating the completion of a disk read previously issued by process Q, which is currently blocked. Note that disk interrupts are unlikely to be for the currently running process (because the process that initiated the disk access is likely to be blocked).
this instruction caused the interrupt.
this instruction immediately preceeded the interrupt.
program, namely the OS, and hence might well be using the same variables. We will soon see how this can cause great problems even in what appear to be trivial cases.
Start Lecture #5
Asa User Process
In traditional Unix and Linux, if an interrupt occurs while a user process with PID=P is running, the system switches to kernel mode and OS code is executed, but the PID is still P. The owner of process P is charged for this execution. Try running the time program on one of the Unix systems and noting the output.
Consider a job that is unable to compute (i.e., it is waiting for I/O) a fraction p of the time.
There are at least two causes of inaccuracy in the above modeling procedure.
Nonetheless, it is correct that increasing MPL does increase CPU utilization (up to a point).
An important limitation is memory. That is, we assumed that we have many jobs loaded at once, which means we must have enough memory for them. There are other memory-related issues as well and we will discuss them later in the course.
Homework:
A crucial feature of processes is their independence or isolation: It is important that when one program executes x++ the value of x in other processes running at that time is not increased.
Sometimes, however, this feature is a bug.
The idea behind threads to have multiple threads of control (hence the name) running in the address space of a single process as shown in the diagram to the right. An address space is a memory management concept. For now think of an address space as the memory in which a process runs. (In reality it also includes the mapping from virtual addresses, i.e., addresses in the program, to physical addresses, i.e., addresses in the machine).
Each thread is somewhat like a process (e.g., it shares the processor with other threads), but a thread contains less state than a process (e.g., the address space belongs to the process in which the thread runs.)
Often, when a process P executing an application is blocked (say for I/O), there is still computation that can be done for the application. Another process can't do this computation since it doesn't have access to P's memory. But two threads in the same process do share memory so that problem doesn't occur.
The downside of this memory sharing among threads is that each
thread is not protected from the others in its process.
We will see in section 2.3 that having multiple threads concurrently
accessing the same memory can cause subtle bugs is programs that
look too simple
to be wrong.
So it is often a performance/simplicity trade-off.
Although there are many differences, we will be primarily interested in just two.
loop Read 10KB from disk1 to inBuffer Compute from inBuffer to outBuffer Write outBuffer to disk2 end loop // process 1 loop Read data from disk1 to inBuffer end loop // process 2 loop Compute from inBuffer to outBuffer end loop // process 3 loop Write outBuffer to disk2 end loop
Consider the first frame of code on the right. Assume for simplicity each line takes 10ms so the entire loop processes 10KB every 30ms. However, the CPU is busy only during the second line and, if the I/O system is sophisticated, the first and third lines use separate hardware.
Hence in principle the three lines could all proceed at the same time. That is we could turn the three steps into a pipeline so that after the startup phase, the loop would process 10KB every 10ms, a 3X speed improvement.
The second frame shows an attempt to speed up the application by splitting it into three processes: a reader, a computer, and a writer. However this can't work since the two inBuffers are not the same so processes 1 and 2 aren't communicating. Similarly for processes 2 and 3 with outBuffer.
If instead the three loops were each a thread within the same process, then the two uses of inBuffer would refer to the same variable and similarly for outBuffer. Hence our desired speedup would occur.
Another advantage of the threaded solution over the separate process non-solution, is that the system can switch between threads in the same process faster than it can switch between separate processes.
The solution
above is simplistic and would
fail.
Users of the same buffer must coordinate their actions, and you need
at least two inBuffers and two outBuffers as
shown in the solution immediately following.
The diagram on the right shows the actions during the first four time steps. The disk on the right contains the input and the one on the left will contain the output. The two circles on the right are input buffers and the two on the left are output buffers. Initially, all the buffers are invalid (i.e., contain no data).
The above example is a simplification of what is
really done.
The threads must be coordinated/synchronized so that one thread
does not either read data that is not yet completely written or
write data before the next
thread has read it.
An important modern example of threading is a multithreaded web
server.
Each thread is responding to a single WWW connection.
While one thread is blocked on I/O, additional threads can process
other WWW connections.
Question: Why not use separate processes, i.e.,
what is the shared memory?
Answer: The cache of frequently referenced
pages.
A common organization for a multithreaded application is to have a dispatcher thread that fields requests and then passes each request on to an idle worker thread. Since the dispatcher and workers share memory, passing the request is very low overhead.
A multithreaded web server can be organized this way.
A final (related) example occurs when a main line task interfaces with the user and sometimes needs to perform a lengthy task that does not directly affect the user-interface.
Tanenbaum considers a word processor currently editing a large file (say a book the user is writing) and the user deletes a word early in the book. This can cause changes on all subsequent pages. Hence, reformatting the book could cause a detectable delay in the user interface. With a threaded implementation, a second thread can be assigned the reformatting task while the primary thread continues to interface with the user. It is only when the user wishes to examine a page near the end of the book that they must wait for the second thread to finish. Hopefully, the user has been doing other editing in the beginning of the book so that the second thread is finished prior to the user needing to access pages near the end. Even if the second thread is not finished, it will have accomplished some of the work while the user is still editing near the book's beginning.
In this same example, the word processor may wish to perform automatic backups. Again another thread to do this. In this way the thread that interfaces with the user is not blocked during the backup. However some coordination between threads may be needed so that the backup is of a consistent state.
Per process items | Per thread items |
---|---|
Address space | Program counter |
Global variables | Machine registers |
Open files | Stack |
Child processes | |
Pending alarms | |
Signals and signal handlers | |
Accounting information |
A process contains a number of resources such as address space, open files, accounting information, etc. In addition to these resources, a process has a thread of control, e.g., program counter, register contents, stack. The idea of threads is to permit multiple threads of control to execute within one process. This is often called multithreading and threads are sometimes called lightweight processes. Because threads in the same process share so much state, switching between them is much less expensive than switching between separate processes. The table on the right shows which properties are common to all threads in a given process and which properties are thread specific.
Individual threads within the same process are not completely independent. For example there is no memory protection between them. This is typically not a security problem as the threads are cooperating and all are from the same user (indeed the same process). However, the shared resources do make debugging harder. For example one thread can easily overwrite data needed by another thread in the process and when the second thread fails, the cause may be hard to determine because the tendency is to assume that the failed thread caused the failure.
You may recall that a serious advantage
of microkernel
OS design was that the separate OS processes
could not, even if buggy, damage each others data structures.
A new thread in the same process is created by a routine named something like thread_create; similarly there is thread_exit. The analogue to waitpid is thread_join (the name presumably comes from the fork-join model of parallel execution).
The routine tread_yield, which relinquishes the processor, does not have a direct analogue for processes. The corresponding system call (if it existed) would move the process from running to ready. It would be as if the process preempted itself.
Homework: 15. Why would a thread ever voluntarily give up the CPU by calling thread_yield? After all, since there is no periodic clock interrupt, it may never get back the CPU?
Assume a process has several threads. What should we do if one of these threads
POSIX threads (pthreads) is an IEEE standard specification that is supported by many Unix and Unix-like systems. Pthreads follows the classical thread model above and specifies routines such as pthread_create, pthread_yield, etc.
An alternative to the classical model are the so-called Linux threads, which are discussed in section 10.3 of the 4e.
Write a (threads) library that acts as a mini-scheduler and implements thread_create, thread_exit, thread_wait, thread_yield, etc. This library acts as a run-time system for the threads in this process. The central data structure maintained and used by this library is a thread table, the analogue of the process table in the operating system itself.
There is a thread table and an instance of the threads library in each multithreaded process.
Advantages of User-Mode Threads
:
Disadvantages
For a uniprocessor, which is all we are officially considering, there is little gain in splitting pure computation into pieces. If the CPU is to be active all the time for all the threads, it is simpler to just have one (unithreaded) process.
But this changes for multiprocessors/multicores. Now it is very useful to split computation into threads and have each executing on a separate processor/core. In this case, user-mode threads are wonderful, there are no system calls and the extremely low overhead is beneficial.
However, there are serious issues involved is programming applications for this environment.
Modern operating systems have direct support for threads, i.e., the thread operations are implemented in the kernel itself. This naturally required that the operating system was (significantly) modified and was not a trivial undertaking.
One can write a (user-level) thread library even if the kernel also has threads. This is sometimes called the N:M model since N user-mode threads run on M kernel threads. In this scheme, the kernel threads cooperate to execute the user-level threads.
An offshoot of the N:M terminology is that kernel-level threading (without user-level threading) is sometimes referred to as the 1:1 model since one can think of each thread as being a user level thread executed by a dedicated kernel-level thread.
Homework:.
Skipped
The idea is to automatically issue a thread-create system call upon message arrival. (The alternative is to have a thread or process blocked on a receive system call.) If implemented well, the latency between message arrival and thread execution can be very small since the new thread does not have state to restore.
Definitely NOT for the faint of heart.
Note: We shall do section 2.4 before section 2.3 for two reasons.
Scheduling processes on the processor is often called
processor scheduling
or process scheduling
or
simply scheduling
.
As we shall see later in the course, a more precise name would
be short-term, processor scheduling
.
At this point we are discussing the two arcs connecting running↔ready in the diagram on the right, which shows the various states of a process and the transitions between those states. Medium term scheduling is discussed later (as is disk-arm scheduling).
As you would expect, the part of the OS responsible for (short-term, processor) scheduling is called the (short-term, processor) scheduler and the algorithm used is called the (short-term, processor) scheduling algorithm.
Early computer systems were monoprogrammed and, as a result, scheduling was a non-issue.
For many current personal computers, which are definitely multiprogrammed, there is in fact very rarely more than one runnable process. As a result, scheduling is not critical.
For servers, scheduling is indeed important and these are the systems you should think of.
A processes alternates between CPU activity and I/O activity, which I often refer to as CPU bursts and I/O bursts. In particular the Scheduling lab will use that terminology.
Since (as we shall see when we study I/O) the time required for a disk access depends only weakly on the size of the request, the key distinguishing factor between compute-bound (aka CPU-bound) and I/O-bound jobs is the length of the CPU bursts.
The trend over the past decade or two has been for more and more jobs to become I/O-bound since the CPU speed has increased much faster than I/O speed.
An obvious point, which is often forgotten (I don't think 4e mentions it) is that the scheduler cannot run unless the OS is running. In particular, for the uniprocessor systems we are considering, no scheduling can occur when a user process is running. (In the mulitprocessor situation, no scheduling can occur when all processors are running user jobs).
We refer to the arcs in the state transition diagram above (especially the top triangle) and discuss those transitions where scheduling is desirable and those where it is manditory.
It is important to distinguish preemptive from non-preemptive scheduling algorithms.
run until completion, or block(or yield, if there is threading).
preemptarc in the diagram is present for preemptive scheduling algorithms.
We distinguish three categories of scheduling algorithms with regard to the importance of preemption.
For multiprogramed batch systems (we do not consider uniprogrammed systems, which have no need for schedulers) the primary concern is efficiency. Since no user is waiting at a terminal, preemption is not crucial and if it is used, it is performed rarely, i.e., each process is given a long time period before being preempted.
For interactive systems (and multiuser servers), preemption is crucial for fairness and rapid response time to short requests.
We don't study real time systems in this course, but will say that preemption is typically not important since all the processes are cooperating and are programmed to do their task in a prescribed time window.
There are numerous objectives, several of which conflict, that a scheduler tries to achieve. These include.
jobto its termination. This is important for batch jobs.
shortest job first.
wasted cyclesand limited logins for repeatability.
This is used for real time systems. The objective of the scheduler is to find a schedule for all the tasks (there are a fixed set of tasks) so that each meets its deadline. The run time of each task is known in advance.
Actually it is more complicated.
There is an amazing inconsistency in naming the different (short-term, processor) scheduling algorithms. Over the years I have used primarily 4 books: In chronological order they are Finkel, Deitel, Silberschatz, and Tanenbaum. The table just below illustrates the name game for these four books. After the table we discuss several scheduling policy in some detail.
Finkel Deitel Silbershatz Tanenbaum ------------------------------------- FCFS FIFO FCFS FCFS RR RR RR RR PS ** PS PS SRR ** SRR not in tanenbaum SPN SJF SJF SJF/SPN PSPN SRT PSJF/SRTF SRTN HPRN HRN ** not in tanenbaum ** ** MLQ only in silbershatz FB MLFQ MLFQ MQ
Note: For an alternate organization of the scheduling algorithms (due to my former PhD student Eric Freudenthal and presented by him Fall 2002) click here.
If the OS doesn't schedule
, it still needs to store the list
of ready processes in some manner.
If it is a queue you get FCFS.
If it is a stack, you get LCFS.
Perhaps you could get some sort of random policy as well.
Sort jobs by execution time needed and run the shortest first.
This is a Non-preemptive algorithm.
First consider a static (overly simple, non-realistic) situation
where all jobs are available in the beginning and we know how long
each one will take to run.
For simplicity lets consider run-to-completion
, also
called uniprogrammed
(i.e., we don't even switch to another
process on I/O).
In this situation, uniprogrammed SJF has the shortest average waiting time. Here's why.
The above argument illustrates an advantage of favoring short jobs: the average waiting time is reduced. For example we will soon learn about RR. An argument for making the RR quantum is small is that short jobs are favored
In the more realistic case of true SJF where the scheduler switches to a new process when the currently running process blocks (say for I/O), I would call the policy shortest next-CPU-burst first. However, I have never heard anyone (except me) call it that.
The real difficulty is predicting the future (i.e., knowing in advance the time required for the job or the job's next-CPU-burst).
One way to estimate the duration of the next CPU burst is to calculate a weighted average of the duration of recent CPU bursts. Tanenbaum calls this Shortest Process Next.
Shortest Job First can starve a process that requires a long burst.
Starvation can be prevented by the standard technique.
Question: What is that technique?
Answer: Priority aging (see below).
Start Lecture #6
Preemptive version of above. Indeed some authors call it preemptive shortest job first.
Permit a process that enters the ready list to preempt the running process if the time for the new process (or for its next burst) is less than the remaining time for the running process (or for its current burst).
It will never happen that a process already in the ready list
will require less time than the remaining time for the currently
running process.
Question: Why?
Answer: When the process joined the ready list it
would have started running if the current process had more time
remaining.
Since that didn't happen the currently running job then had less
time remaining and now it has even less.
SRTN Can starve a process that requires a long burst.
Starvation can be prevented by the standard technique.
Question: What is that technique?
Answer: Priority aging (see below).
The following algorithms can also be used for batch systems, but in that case, the gain may not justify the extra complexity.
Round Robin (RR) is an important preemptive policy. It is essentially the preemptive version of FCFS. One property of RR is that it loosely approximates SJF without knowing in advance how long each process will require.
When a process is put into the running state a timer is set to q milliseconds, a key parameter of the policy, called the quantum. If the timer goes off and the process is still running, the OS preempts the process.
Note that, as in FCFS, the ready list is being treated as a queue. Indeed many always call the list of ready processes the ready queue. But I don't. For other scheduling algorithms the ready list is not accessed in a FIFO manner so I find the term queue misleading. For FCFS and RR, the term ready queue is appropriate.
When a process is created or unblocked, it is likewise placed at the rear of the ready list.
Note that RR with a quantum of say 10ms. works well if you have a 1 hr job and then a 1 second job. This is the sense in which it approximates SJF.
As q gets large, RR approaches FCFS.
Indeed if q is larger that the longest time any process will run
before terminating or blocking, then RR is FCFS.
A good way to see this is to look at my favorite diagram and note
the three arcs leaving running.
They are triggered
by three conditions: process terminating,
process blocking, and process preempted.
If the first trigger condition to arise is never preemption, we can
erase that arc and then RR becomes FCFS.
As q gets small, RR approaches PS (Processor Sharing, described next).
Question: What value of q should we choose?
Answer: A trade-off exists.
A student found the following reference for the name Round Robin in the Encyclopedia of Word and Phrase Origins by Robert Hendrickson (Facts on File, New York, 1997). A similar, but less detailed, citation can be found in wikipedia.
The round robin was originally a petition, its signatures arranged in a circular form to disguise the order of signing. Most probably it takes its name from theruban rond, (round ribbon), in 17th-century France, where government officials devised a method of signing their petitions of grievances on ribbons that were attached to the documents in a circular form. In that way no signer could be accused of signing the document first and risk having his head chopped off for instigating trouble.Ruban rondlater becameround robinin English and the custom continued in the British navy, where petitions of grievances were signed as if the signatures were spokes of a wheel radiating from its hub. Todayround robinusually means a sports tournament where all of the contestants play each other at least once and losing a match doesn't result in immediate elimination.
Homework: Round-robin schedulers normally maintain a list of all ready processes, with each process occurring exactly once in the list. What would happen if a process occurred more than once in the list? Can you think of any reason for allowing this?
Homework: Give an argument favoring a large quantum; give an argument favoring a small quantum.
Process | CPU Time | Creation Time |
---|---|---|
P1 | 20 | 0 |
P2 | 3 | 3 |
P3 | 2 | 5 |
Homework:
lab 2 tie-breaking rule.
Homework: Redo the previous homework for q=2 with the following changes. After process P1 runs for 3ms (milliseconds), it blocks for 2ms. P1 never blocks again. That is, P1 begins with a CPU burst of 3ms, then has an I/O burst of 2ms, and finally it has a CPU burst of 20-3 = 17ms. P2 never blocks. After P3 runs for 1 ms it blocks for 1ms. Assume the context switch time is zero. Remind me to answer this problem in class next lecture. A student in 2016-17 fall fount this video helpful (thank you, kevin). https://www.youtube.com/watch?v=aWlQYllBZDs
Merge the ready and running states and permit all ready jobs to be run at once. However, the processor slows down so that when n jobs are running at once, each progresses at a speed 1/n as fast as it would if it were running alone.
Homework: 38.
Each job is assigned a priority (externally, perhaps by charging more for higher priority) and the highest priority ready job is run.
External prioritiesabove.
standard technique, which is right below.
As a job is waiting, increase its priority; hence it will eventually have the highest priority.
No job can remain in the ready state forever.
standard techniqueused to prevent starvation (assuming all jobs terminate or the policy is preemptive).
Homework: 44, 45. Note that when the book says RR with each process getting its fair share, it means Processor Sharing.
SRR is a preemptive policy in which unblocked (i.e. ready and
running) processes are divided into two classes the Accepted
processes
, which are scheduled using RR and the others
,
which are not run until they become accepted.
(Perhaps SRR really stands for snobbish RR
).
The behavior of SRR depends on the relationships between a, b, and zero. There are four cases.
batches. This is similar to n-step scan for disk I/O.
It is not clear what to do to the priority when a process blocks. There are several possibilities.
The third possibility seems a little weird. We shall adopt the first possibility (reset to zero) since it seems the simplest.
Start Lecture #7
Recall that SFJ/PSFJ do a good job of minimizing the average waiting time. The problem with them is the difficulty in finding the job whose next CPU burst is minimal. We now learn three scheduling algorithms that attempt to do this. The first algorithm does it statically, presumably with some manual help; the other two are dynamic and fully automatic.
Put different classes of processs in different queues
As with multilevel queues above, we have many queues, but now
processes move from queue to queue in an attempt to dynamically
separate batch-like
from interactive processs so that we can
favor the latter.
Remember that low average waiting time is achieved by SJF. Multiple Queues is an attempt to determine dynamically those processes that are interactive, which means have very short cpu bursts.
Shortest process next (mentioned previously) is an
attempt to apply sjf to interactive scheduling.
What is needed is an estimate of how long the process will run until
it blocks again.
One method is to choose some initial estimate when the process
starts and then, whenever the process blocks choose a new estimate
via
NewEstimate = A*OldEstimate + (1-A)*LastBurst
where 0<A<1 and LastBurst is the actual
time used during the burst that just ended.
Run the process that has been hurt
the most.
A variation on HPRN.
The penalty ratio is a little different.
It is nearly the reciprocal of the above, namely
t / (T/n)
where n is the multiprogramming level.
So if n is constant, this ratio is a constant times 1/r.
Each process gets a fixed number of tickets and at each scheduling event a random ticket is drawn (with replacement) and the process holding that ticket runs for the next interval (probably a RR-like quantum q).
On the average a process with P percent of the tickets will get P percent of the CPU (assuming no blocking, i.e., full quanta).
If you treat processes fairly
you may not be treating
users fairly
since users with many processes will get more
service than users with few processes.
The scheduler can group processes by user and only give one of a
user's processes a time slice before moving to another user.
For example, linux has cgroups
for a related purpose.
The scheduler first schedules across cgroups so if a big job has
many processes in the same cgroup, it will not get more time than a
small job with just one process.
Fancier methods have been implemented that give some fairness to groups of users. Say one group paid 30% of the cost of the computer. That group would be entitled to 30% of the cpu cycles providing it had at least one process active. Furthermore a group earns some credit when it has no processes active.
Considerable theory has been developed.
In addition to the short-term scheduling we have discussed, we add medium-term scheduling in which decisions are made at a coarser time scale.
Recall my favorite diagram, shown again on the right. Medium term scheduling determines the transitions from the top triangle to the bottom row. We suspend (swap out) some process if memory is over-committed, dropping the (ready or blocked) process down. We also need resume transitions to return a process to the top triangle.
Criteria for choosing a victim to suspend include:
We will discuss medium term scheduling again when we study memory
management and understand what is meant by saying
memory is over-committed
.
Process | CPU Time | Start Time |
Blocks after/for |
|
---|---|---|---|---|
P0 | 10 | 0 | 5 | 9 |
P1 | 11 | 4 | 4 | 6 |
P2 | 9 | 4 | never |
Consider the following problem of the same genre as those in the homework. In this example we have RR scheduling with q=3 and zero context switch time.
The system contains three processes. Their relevant characteristics are given in the table on the right.
The diagram below presents a detailed solution. The numbers above the horizontal lines give the CPU time remaining at the beginning and end of the execution interval. The numbers below the horizontal lines give the length of the execution interval. The red lines indicate a blocked process.
We see that P2 finishes at time 21, P0 at time 23, and P1 at time 30.
This is sometimes called Job scheduling
.
A similar idea (but more drastic and not always so well coordinated) is to force some users to log out, kill processes, and/or block logins if over-committed.
Only LEM jobs during the day(Grumman).
// A general program // that alternates // computing with I/O // // Compute with no I/O // I/O with no computing // Compute with no I/O // I/O with no computing // ... // Compute with no I/O // I/O with no computing
On the right is the general form of many programs, compute, I/O, compute, I/O, ..., compute, I/O. In lab 2 we characterize a program of this type by a tuple of four nonnegative integers (A, B, C, M), only A can be zero.
A is the arrival (or start) time and C is the (total) CPU time. These two were used in the problems I did on the board.
B the CPU-burst time and M the I/O-burst time,
generalize the blocks after
and blocks for
in the
previous example.
They are used to calculate the times for each of the compute
and I/O
sections of the program on the right.
Show the detailed output
Remark: Lab 2 assigned.
Start Lecture #8
A race condition occurs when
In other words, there is a race between A and B and the program result differs depending on which one wins the race.
Notes:
interestingcase is when one ordering, which occurs most frequently, gives the expected result, and another, rarely occurring, ordering give an unexpected result.
Imagine two processes both accessing x, which is initially 10.
A1: LOAD r1,x B1: LOAD r2,x A2: ADD r1,1 B2: SUB r2,1 A3: STORE r1,x B3: STORE r2,x
We must prevent interleaving sections of code that need to be atomic with respect to each other. That is, the conflicting sections need mutual exclusion. If process A is executing its critical section, it excludes process B from executing its critical section. Conversely if process B is executing its critical section, it excludes process A from executing its critical section.
Tanenbaum gives four requirements for a critical section implementation.
loop forever loop forever "ordinary" code "ordinary" code ENTRY code ENTRY code critical section critical section EXIT code EXIT code ordinary code ordinary code
We will study only solutions of this kind. Note that higher level solutions, e.g., having one process block when it cannot enter its critical are implemented using busy waiting algorithms.
The operating system can choose not to preempt itself. That is, we could choose not to preempt system processes (if the OS is client server) or processes running in system mode (if the OS is self service). Forbidding preemption within the operating system would prevent the problem above where x<--x+1 not being atomic crashed the printer spooler (assume the spooler is part of the OS).
The way to prevent preemption of kernel-mode code is to disable interrupts. Indeed, disabling (i.e., temporarily preventing) interrupts is often done for exactly this reason. This is not, however, a complete solution.
Initially: P1wants = P2wants = false Code for P1 Code for P2 Loop forever { Loop forever { P1wants <-- true ENTRY P2wants <-- true while (P2wants) {} ENTRY while (P1wants) {} critical-section critical-section P1wants <-- false EXIT P2wants <-- false non-critical-section } non-critical-section }
Explain why this works.
But it is wrong!
Why?
Let's try again. The trouble was that setting want before the loop permitted us to get stuck. We had them in the wrong order!
Initially P1wants=P2wants=false Code for P1 Code for P2 Loop forever { Loop forever { while (P2wants) {} ENTRY while (P1wants) {} P1wants <-- true ENTRY P2wants <-- true critical-section critical-section P1wants <-- false EXIT P2wants <-- false non-critical-section } non-critical-section }
Explain why this works.
But it is wrong again!
Why?
Now let's try being polite and really take turns. None of this wanting stuff.
Initially turn=1 Code for P1 Code for P2 Loop forever { Loop forever { while (turn = 2) {} while (turn = 1) {} critical-section critical-section turn <-- 2 turn <-- 1 non-critical-section } non-critical-section }
This one forces alternation, so is not general enough. Specifically, it does not satisfy condition three, which requires that no process in its non-critical section can stop another process from entering its critical section. With alternation, if one process is in its non-critical section (NCS) then the other can enter the CS once but not again.
The first example violated rule 4 (the whole system blocked). The second example violated rule 1 (both in the critical section. The third example violated rule 3 (one process in the NCS stopped another from entering its CS).
In fact, it took years (way back when) to find a
correct solution.
Many earlier solutions
were found and several were published,
but all were wrong.
The first correct solution was found by a mathematician named
Dekker, who combined the ideas of turn and wants.
The basic idea is that you take turns when there is contention, but
when there is no contention, the requesting process can enter.
It is very clever, but I am skipping it (I cover it when I teach
distributed operating systems in CSCI-GA.2251).
Subsequently, algorithms with better fairness properties were found
(e.g., no task has to wait for another task to enter the CS
twice).
What follows is Peterson's solution, which also combines wants and turn to force alternation only when there is contention. When Peterson's algorithm was published, it was a surprise to see such a simple solution. In fact Peterson gave a solution for any number of processes. A proof that the algorithm satisfies our properties (including a strong fairness condition) for any number of processes can be found in Operating Systems Review Jan 1990, pp. 18-22.
Initially P1wants=P2wants=false and turn=1 Code for P1 Code for P2 Loop forever { Loop forever { P1wants <-- true P2wants <-- true turn <-- 2 turn <-- 1 while (P2wants and turn=2) {} while (P1wants and turn=1) {} critical-section critical-section P1wants <-- false P2wants <-- false non-critical-section } non-critical-section }
Start Lecture #9
Tanenbaum calls this instruction
test and set lock
and writes it TSL.
I believe most computer scientists call it more simply test and
set
and write it TAS.
Everyone agrees on the definition
TAS(b) where b
is a binary variable,
ATOMICALLY sets b←true and returns the OLD value of
b.
(It would be silly to return the new value of b since we know the new value is true).
The word atomically means that the two actions
performed by TAS(x), testing
x
(i.e., returning its old value)
and setting x (i.e., giving it the value true)
are inseparable.
Specifically it is not possible for two concurrent
TAS(x) operations to both return false (unless
there is also another concurrent statement that sets x to
false).
With TAS available, implementing a critical section for any number of processes is easy.
loop forever { while (TAS(s)) {} ENTRY CS s<--false EXIT NCS }
Note: Tanenbaum presents both busy waiting (as above) and blocking (process switching) solutions. We study only busy waiting solutions, which are easier and are used in the blocking solutions. Sleep and Wakeup are the simplest blocking primitives. Sleep voluntarily blocks the process and wakeup unblocks a sleeping process. However, it is far from clear how sleep and wakeup are implemented. Indeed, deep inside, they typically use TAS or some similar primitive. We will not cover these solutions.
Homework: Explain the difference between busy waiting and blocking process synchronization.
Terminology note: Tanenbaum use the term semaphore only for blocking solutions. I will use the term for our busy waiting solutions (as well as for blocking solutions, which we do not cover). Others call our busy waiting solutions spin locks.
The entry code is often called P and the exit code V. Thus the critical section problem is to write P and V so that the loop on the right satisfies the conditions on the right.
loop forever P critical-section V non-critical-section
We have just seen a solution to the critical section problem, namely:
P is while (TAS(s)) {} V is s<--false
Note: When writing pseudo-code, I use indenting carefully (c.f. Python) and hence do not need (and sometimes omit) the braces {} used in languages like C or java.
A binary semaphore abstracts the TAS solution we gave for the critical section problem.
openand
closed(think of S as a gate to a castle).
while (S==closed) {} S←closed -- This is NOT the body of the whilewhere finding S=open and setting S←closed is a single atomic operation.
runs through and closes the gate.
The above code is not real, i.e., it is not an implementation of P. It requires a sequence of two instructions to be atomic and that is, after all, what we are trying to implement in the first place. The above code is, instead, a definition of the effect P is to have.
To repeat: for any number of processes, the critical section problem can be solved using P and V as follows.
loop forever P(S) CS V(S) NCS
The only solution we have seen for an arbitrary number of processes is the one just before 2.3.4 with P(S) implemented via test and set.
Note: Peterson's software solution requires each process to know its process number; the TAS soluton does not. Moreover the definition of P and V does not permit use of the process number. Thus, strictly, speaking Peterson did not provide an implementation of P and V. He did, however, solve the critical section problem.
To solve other coordination problems we want to extend binary semaphores.
Both of these (related) shortcomings can be overcome by not restricting ourselves to a binary variable, but instead define a generalized or counting semaphore.
while (S==0) {} S--where finding S>0 and decrementing S is atomic
run through and partially close the gate.
Counting semaphores can solve what I call the semi-critical-section problem, where you permit up to k processes in the section. When k=1 we have the original critical-section problem.
initially S=k loop forever P(S) SCS -- semi-critical-section V(S) NCS
Recall that my definition of semaphore differs from Tanenbaum's (busy waiting vs. blocking); hence it is not surprising that my solution to various coordination problems also differ from his.
Unlike the previous problems of mutual exclusion where all processes are the same, the producer-consumer problem has two classes of processes
Question: What happens if a producer encounters a
full buffer?
Answer: It must wait for the buffer to become
non-full.
Question: What if a consumer encounters an empty
buffer?
Answer: It must wait for the buffer to become
non-empty.
The producer-consumer problem is also called the bounded buffer problem, which is another example of active entities being replaced by a data structure when viewed at a lower level (Finkel's level principle).
Let k be the size of the buffer (the number of slots). Let e be a counting semaphore (representing the number of empty slots), and let f be a counting semaphore (representing the number of full slots).
Initially e=k, f=0 (counting semaphores) b=open (binary semaphore) Producer Consumer loop forever loop forever produce-item P(f) P(e) P(b); take item from buf; V(b) P(b); add item to buf; V(b) V(e) V(f) consume-item
We assume the buffer itself is only serially accessible. That is, only one operation can be done at a time. This explains the P(b) V(b) around buffer operations.
I use ; and put three statements on one line to suggest that a buffer insertion or removal is viewed as one atomic operation. Of course this writing style is only a convention, the enforcement of atomicity is done by the P/V.
The P(e), V(f) motif is used to force bounded alternation
.
If k=1 it gives strict alternation.
Note: Whereas we use the term semaphore to mean binary semaphore and explicitly say generalized or counting semaphore for the positive integer version, Tanenbaum uses semaphore for the positive integer solution and mutex for the binary version. Also, as indicated above, for Tanenbaum semaphore/mutex implies a blocking primitive; whereas I use binary/counting semaphore for both busy-waiting and blocking implementations. Finally, remember that in this course our only solutions are busy-waiting.
Busy wait | block/switch | |
---|---|---|
critical | (binary) semaphore | (binary) semaphore |
semi-critical | counting semaphore | counting semaphore |
Busy wait | block/switch | |
---|---|---|
critical | enter/leave region | mutex |
semi-critical | no name | semaphore |
You can find some information on barriers in my lecture notes for a follow-on course (see in particular lecture number 16).
We did this previously.
Start Lecture #10
A classic problem from Dijkstra concerning philosophers each of
whose life consists of
loop forever
Eating consists of the following
What algorithm do you use for access to the shared resource (the forks)?
The purpose of mentioning the Dining Philosophers problem without giving the solution is to give a feel of what coordination problems are like. The book gives others as well. The solutions would be covered in a sequel course. If you are interested look, for example here.
Homework: In the solution to the dining philosophers problem, why is the state variable set to HUNGRY int the procedure take_forks?
Homework: Consider the procedure put_forks. Suppose that the variable state[i] was set to THINKING after the two calls to test, rather than before. How would this change affect the solution?
As in the producer-consumer problem we have two classes of processes.
The problem is to
Variants
Solutions to the readers-writers problem are quite useful in
multiprocessor operating systems and database systems.
The easy way out
is to treat all processes as writers in
which case the problem reduces to mutual exclusion (P and V).
The disadvantage of the easy way out is that you give up reader
concurrency.
Again for more information see the web page referenced above.
Critical Sections have a form of atomicity, in some ways similar to transactions. But there is a key difference: With critical sections you have certain blocks of code, say A, B, and C, that are mutually exclusive (i.e., are atomic with respect to each other) and other blocks, say D and E, that are mutually exclusive; but blocks from different critical sections, say A and D, are not mutually exclusive.
The day after giving this lecture in 2006-07-spring, I found a
modern reference to the same question.
The quote below is from
Subtleties of Transactional Memory Atomicity Semantics
by Blundell, Lewis, and Martin in
Computer Architecture Letters
(volume 5, number 2, July-Dec. 2006, pp. 65-66).
As mentioned above, busy-waiting (binary) semaphores are often
called locks (or spin locks).
... conversion (of a critical section to a transaction) broadens the scope of atomicity, thus changing the program's semantics: a critical section that was previously atomic only with respect to other critical sections guarded by the same lock is now atomic with respect to all other critical sections.
We began with a subtle bug (wrong answer for x++ and x--) and used it to motivate the Critical Section Problem for which we provided a (software) solution due to Peterson.
We then defined (binary) Semaphores and showed that a Semaphore easily solves the critical section problem and doesn't require knowledge of how many processes are competing for the critical section. We gave an implementation of a binary semaphore using Test-and-Set.
We then gave an definition of a Semaphore (which was not an implementation) and morphed this definition to obtain a definition for a Counting (or Generalized) Semaphore, for which we gave NO implementation. I asserted that a counting semaphore can be implemented using 2 binary semaphores and gave a reference.
We defined the Producer-Consumer (or Bounded Buffer) Problem and showed that it can be solved using counting semaphores (and binary semaphores, which are a special case of counting semaphores).
Finally we briefly discussed some classical problems, but did not give (full) solutions.
Skipped, but you should read.
Note: Deadlocks are closely related to process
management so belong
here, right after chapter 2.
It was here in 2e.
A goal of 3e wqs to make sure that the basic material gets covered
in one semester.
But I know we will do the first 6 chapters so there is no need for
us to postpone the study of deadlock.
Definition: A deadlock occurs when every member of a set of processes is waiting for an event that can only be caused by a member of the set.
Often the event waited for is the release of a resource.
In the automotive world deadlocks are called gridlocks.
For a computer science example consider two processes A and B that each want to copy a file from a CD to a blank CD-R. Each processor needs exclusive access to a CD reader and to a CD burner. Assume the system has exactly one of each device.
It is quite possible for this to work perfectly. If A goes first, gets both devices, does the copy, releases both devices, and the B does the same, all is well.
However, the following problematic scenario is also possible.
Bingo: deadlock!
Definition: A resource is an object that can be granted to a process.
Resources come in two types
The interesting issues arise with non-preemptable resources so those are the ones we study.
The life history of a resource is a sequence of
Processes request the resource, use the resource, and release the resource. The allocate decisions are made by the system and we will study policies used to make these decisions.
A simple example of the trouble you can get into.
P(S); P(T); <regular
instructions> V(T); V(S)
all is well.
P(T); P(S); <regular
instructions> V(S); V(T)
disaster (deadlock can occur)!
This was the CD-burner/CD-reader example just above.Recall from the semaphore/critical-section treatment last chapter, that it is easy to cause trouble if a process dies or stays forever inside its critical section. We assumed processes do not do this. Similarly, we assume that no process retains a resource forever. It may obtain the resource an unbounded number of times (i.e. it can have a loop with a resource request inside), but each time it gets the resource, it must release it eventually.
Definition: A deadlock occurs when a every member of a set of processes is waiting for an event that can only be caused by a member of the set.
Often the event waited for is the release of a resource.
The following four conditions (Coffman; Havender) are necessary but not sufficient for deadlock. Repeat: They are not sufficient.
One can say
If you want a deadlock, you must have these four conditions.
.
But of course you don't actually want a deadlock, so you would more
likely say
If you want to prevent deadlock, you need only violate
one or more of these four conditions.
.
The first three are static characteristics of the system and resources. That is, for a given system with a fixed set of resources, the first three conditions are either always true or always false: They don't change with time. The truth or falsehood of the last condition does indeed change with time as the resources are requested/allocated/released.
On the right are several examples of a Resource Allocation Graph, also called a Reusable Resource Graph.
Homework: 9. Are all such reusable resource graphs legal?
Consider two concurrent processes P1 and P2 whose programs are.
P1 P2 request R1 request R2 request R2 request R1 release R2 release R1 release R1 release R2
On the board draw the resource allocation graph for various possible executions of the processes, indicating when deadlock occurs and when deadlock is no longer avoidable.
There are four strategies used for dealing with deadlocks.
The put your head in the sand approach
.
Start Lecture #11
Consider the case in which there is only one instance of each resource.
a printernot a specific printer. Similarly, one can have many CD-ROM drives.
To find a directed cycle in a directed graph is not hard. The algorithm is in the book. The idea is simple.
The searches are finite since there are a finite number of nodes and you stop if you hit a node twice.
This is more difficult.
can possibly). If, even with such demanding processes, the resource manager can insure that all process terminates, then the manager can insure that deadlock is avoided.
Perhaps you can temporarily preempt a resource from a process. Not likely.
Database (and other) systems take periodic checkpoints. If the system does take checkpoints, one can roll back to a checkpoint whenever a deadlock is detected. You must somehow guarantee forward progress.
Can always be done but might be painful. For example some processes have had effects that can't be simply undone. Print, launch a missile, etc.
Note: We are doing 6.6 before 6.5 since 6.6 is easier and I believe serves as a good warm-up.
Attack one of the Coffman/Havender conditions.
The idea is to use spooling instead of mutual exclusion. Not possible for many kinds of resources.
Require each processes to request all resources at the beginning of the run. This is often called One Shot.
Normally not possible.
That is, some resources are inherently pre-emptable (e.g., memory).
For those, deadlock is not an issue.
Other resources are non-preemptable, such as a robot arm.
It is often not possible to find a way to preempt one of these
latter resources.
One moder exception, which we shall not study, is if the resource
(say a CD-ROM drive) can be virtualized
(recall
hypervisors).
Establish a fixed ordering of the resources and require that they be requested in this order. So if a process holds resources #34 and #54, it can request only resources #55 and higher.
It is easy to see that a cycle is no longer possible.
Homework: 10. Consider Figure 6-4. Suppose that in step (o) C requested S instead of requesting R. Would this lead to deadlock? Suppose that it requested both S and R.
Let's see if we can tiptoe through the tulips and avoid deadlock states even though our system does permit all four of the necessary conditions for deadlock.
An optimistic resource manager is one that grants every request as soon as it can. To avoid deadlocks with all four conditions present, the manager must be smart not optimistic.
In this section we assume knowledge of the entire request and release pattern of the processes in advance. Thus we are not presenting a practical solution. I believe this material is useful as motivation for the more nearly practical solution that follows, the Banker's Algorithm.
The diagram below depicts two processes H (horizontal) and V (vertical) executing the programs shown on the right.
H V <reg code> <reg code> P(print) P(plot) <reg code> <reg code> P(plot) P(print) <reg code> <reg code> V(print) V(plot) <reg code> <reg code> V(plot) V(print) <reg code> <reg code>
We plot progress of each process along an axis. In the example we show, there are two processes, hence two axes, i.e., planar.
The time periods where the printer and plotter are needed by each process are indicated along the axes and their combined effect is represented by the colors of the squares.
The dashed line represents a possible execution pattern.
The crisis is at hand!
Abandon all hope ye who enter here—Dante.
This procedure is not practical for a general purpose OS since it requires knowing the programs in advance. That is, the resource manager, knows in advance what requests each process will make and in what order.
Homework: 17. All the trajectories in the Figure are horizontal or vertical. Under what conditions is is possible for a trajectory to be a diagonal.
Homework: 18, 19.
Avoiding deadlocks given some extra knowledge.
Definition: A state is safe if there is an ordering of the processes such that: if the processes are run in this order, they will all terminate (assuming none exceeds its claim, and assuming each would terminate if all its requests are granted).
Recall the comparison made above between detecting deadlocks with multi-unit resources and the banker's algorithm).
Start Lecture #12
In the definition of a safe state no assumption is made about the running processes. That is, for a state to be safe, termination must occur no matter what the processes do (providing each would terminate if run alone and each never exceeds its claims). Making no assumption on a process's behavior is the same as making the most pessimistic assumption.
Note: When I say pessimistic
I am
speaking from the point of view of the resource manager.
From the manager's viewpoint, the worst thing a process can do is
request resources.
Give an example of each of the following four possibilities. A state that is
Is the figure on the right safe or not?
You can NOT tell until I give you the initial claims of the process.
For the figure on the right, if the initial claims are:
But if the initial claims are instead:
Explain why this is so.
Please do not make the unfortunately common exam mistake of giving an example involving safe states without giving the claims. So if I ask you to draw a resource allocation graph that is safe or if I ask you to draw one that is unsafe, you MUST include the initial claims for each process. I often, but not always, ask such a question and every time I have done so, several students forgot to give the claims and hence lost points.
Remark: Vote on midterm date.
A manager can determine if a state is safe.
The manager then follows the following procedure, which is part of Banker's Algorithms discovered by Dijkstra, to determine if the state is safe.
Consider the example shown in the table on the right.
process | initial claim | current alloc | max add'l |
---|---|---|---|
X | 3 | 1 | 2 |
Y | 11 | 5 | 6 |
Z | 19 | 10 | 9 |
Total | 16 | ||
Available | 6 |
This example is a continuation of example 1 in which Z requested 2 units and the manager (foolishly?) granted the request.
process | initial claim | current alloc | max add'l |
---|---|---|---|
X | 3 | 1 | 2 |
Y | 11 | 5 | 6 |
Z | 19 | 12 | 7 |
Total | 18 | ||
Available | 4 |
Notes:
The algorithm is simple: Stay in safe states. For now, we assume that, before execution begins, all processes are present and all initial claims have been given. We will relax these assumptions at the end of the chapter.
In a little more detail the banker's algorithm is as follows.
Homework: 21.
At a high level the algorithm is identical to the one for a single resource type: Stay in safe states.
But what is a safe state in this new setting?
The same definition (if processes are run in a certain order they will all terminate).
Checking for safety is the same idea as above. The difference is that to tell if there are enough free resources for a processes to terminate, the manager must check that, for all resource types, the number of free units is at least equal to the max additional need of the process.
Homework: Consider a system containing a total of 12 units of resource R and 24 units of resource S managed by the banker's algorithm. There are three processes P1, P2, and P3. P1's claim is 0 units of R and 12 units of S, written (0,12). P2's claim is (8,15). P3's claim is (8,20). Currently P1 has 4 units of S, P2 has 8 units of R, P3 has 8 units of S, and there are no outstanding requests.
Homework: 26, 29, and 38. There is an interesting typo in 26. A has claimed 3 units of resource 5, but there are only 2 units in the entire system. Change the problem by having B both claim and be allocated 1 unit of resource 5.
This is covered extensively in a database text. We will skip it.
We have mostly considered actually hardware resources such as printers, but have also considered more abstract resources such as semaphores.
There are other possibilities. For example a server often waits for a client to make a request. But if the request msg is lost the server is still waiting for the client and the client is waiting for the server to respond to the (lost) last request. Each will wait for the other forever, a deadlock.
A solution
to this communication deadlock would be to use a
timeout so that the client eventually determines that the msg was
lost and sends another.
But it is not nearly that simple: The msg might have been greatly delayed and now the server will get two requests, which could be bad, and is likely to send two replies, which also might be bad.
This gives rise to the serious subject of communication protocols.
Instead of blocking when a resource is not available, a process may (wait and then) try again to obtain it. Now assume process A has the printer, and B the CD-ROM, and each process wants the other resource as well. A will repeatedly request the CD-ROM and B will repeatedly request the printer. Neither can ever succeed since the other process holds the desired resource. Since no process is blocked, this is not technically deadlock, but a related concept called livelock.
As usual FCFS is a good cure. Often this is done by priority aging and picking the highest priority process to get the resource. Also can periodically stop accepting new processes until all old ones get their resources.
Read.
Note: End of material on midterm. Also included are labs 1 and 2. Not the programs in the labs but the OS concepts. Indeed both are on the practice exam.
Start Lecture #13
Remark: Solutions to chapter 1 and chapter 2 homework are on nyu classes (they are "resources").
Remark: Midterm exam 27 Oct 2016.
Also called storage management or space management.
The memory manager must deal with the storage hierarchy present in modern machines.
The same questions are asked about the cache ↔ central memory boundary when one studies computer architecture. Surprisingly, the terminology is almost completely different!
We will see in the next few weeks that there are three independent decision:
Memory management implements address translation.
Homework: What is the difference between a physical address and a virtual address?
linker lab)
doesthis, but it is trivial since we assume the linked program will be loaded at 0.
Note: I will place ** before each memory management scheme.
Remark: Do HW problem from last time.
The entire process remains in memory from start to finish and does not move.
The sum of the memory requirements of all jobs in the system cannot exceed the size of physical memory.
The good old days
when everything was easy (for the OS).
rootpiece is always memory resident.
The mythical man month) remarked that the OS/360 linkage editor was terrific, especially in its support for overlays, but by the time it came out, overlays were no longer used.
This can be done via swapping if you have only one program loaded at a time. A more general version of swapping is discussed below.
One can also support a limited form of multiprogramming, similar to MFT (which is described next). In this limited version, the loader relocates all relative addresses, thus permitting multiple processes to coexist in physical memory the way your linker permitted multiple modules in a single process to coexist.
Two goals of multiprogramming are to improve CPU utilization, by overlapping CPU and I/O, and to permit short jobs to finish quickly.
listinstead of one queue for each partition.
segment(i.e., the virtual address space is contiguous). We will discuss segments later.
establish addressability.
Just as the process concept creates a kind of abstract CPU to run programs (each process acts as thought it is the only one running), the address space creates a kind of abstract memory for programs to live in.
Addresses spaces do for processes, what you so kindly did for modules in the linker lab. That is, addresses spaces permit each process to believe it has its own memory starting at address zero.
Base and limit registers are additional hardware, invisible to the programmer, that supports multiprogramming by automatically adding the base address (i.e., the value in the base register) to every relative address when that address is accessed at run time.
In addition the relative address is compared against the value in the limit register and if larger, the processes is aborted since it has exceeded its memory bound. Compare this to your error checking in the linker lab.
The base and limit register are set by the OS when the process starts.
Moving an entire processes back and forth between disk and memory is called swapping.
Both the number and size of the partitions change with time.
runsof processes, but not of holes.
Homework: 3. A swapping system eliminates holes by compaction. Assume a random distribution of holes and data segments, assume the data segments are much bigger than the holes, and assume a time to read or write a 32-bit memory word of 4ns. About how long does it take to compact 4 GB? For simplicity, assume that word 0 is part of a hole and the highest word in memory conatains valid data.
MVT Introduces the Placement Question. That is, into which hole (partition) should one we place the process when several holes are big enough?
There are several possibilities, including best fit, worst fit, first fit, circular first fit, quick fit, next fit, and Buddy.
A current favorite is circular first fit, also known as next fit.
Homework: 4. Consider a swapping system in which memory consists of the following hole sizes in memory order: 10MB, 4MB, 20MB, 18MB 7MB, 9MB, 12MB, and 15MB. Using first fit, Which hole is taken for successive segment requests of
Buddy comes with its own implementation. How about the others?
Divide memory into blocks and associate a bit with each block, used to indicate if the corresponding block is free or allocated. To find a chunk of size N blocks need to find N consecutive bits indicating a free block.
The only design question is how much memory does one bit represent.
Instead of a bit map, use a linked list of nodes where each node corresponds to a region of memory either allocated to a process or still available (a hole).
See Knuth, The Art of Computer Programming vol 1.
MVT Introduces the Replacement Question. That is, which victim should we swap out when we need to free up some memory?
This is an example of the suspend arc mentioned in process scheduling.
We will study this question more when we discuss demand paging in which case we swap out only part of a process.
Considerations in choosing a victim
Notes:
pagingas a synonym for what I call demand paging. This is unfortunate as it mixes together two concepts.
shouldbe used in contrast with physical memory to describe any virtual to physical address translation.
Paging is the simplest scheme to remove the requirement of contiguous physical memory and the potentially large external fragmentation that it causes..
Start Lecture #14
The figure on the right shows how to translate a virtual address into the corresponding physical address, i.e. how to find where in physical memory a given virtual address resides.
Example: Given a machine with page size (PS) =
frame size = 1000.
What is the physical address (PA) corresponding to the virtual
address (VA) = 3372?
Properties of (non-demand) paging (without segmentation).
randompoints in the program and can change from run to run (the page size can change with no effect on the program—other than performance), pages are not appropriate units of memory to use for protection and sharing. Segmentation, which is discussed later, is sometimes more appropriate for protection and sharing.
If all you have is a hammer, everything looks like a nail.
There seems to be a bunch of arithmetic and an extra memory reference (to the page table) required, which would make the scheme totally impractable.
The arithmetic is not needed. Indeed you and I can divide by 1000 (the page size) or take mod 1000 in our heads by separating the rightmost three digits from the rest. That is because 1000 is 103 and we write numbers in decimal.
Computers use binary so designers choose the page size to be a power of two and again dividing by the page size and and calculating mod page size simply requires separating the leftmost digits from the rightmost.
It does seem as though each memory reference turns into 2 memory references.
This would indeed be a disaster! But it isn't done that way. Instead,the MMU caches page#→frame# translations. This cache is kept near the processor and can be accessed rapidly.
This cache is called a translation lookaside buffer (TLB) or translation buffer (TB).
For the above example, after referencing virtual address 3372, there would be an entry in the TLB containing the mapping 3→459.
Hence a subsequent access to virtual address 3881 would be translated to physical address 459881 without an extra memory reference. Naturally, a memory reference for location 459881 itself would be required.
Choosing the page size is discuss below.
Homework: 7. Using the page table of Fig. 3.9, give the physical address corresponding to each of the following virtual addresses.
The idea is to enable a program to execute even if only the active portion of its address space is memory resident. That is, we are to swap in and swap out portions of a program and can run a program even if some (perhaps most) of the program is not in memory.
In a crude sense this could be called
automatic overlays
.
Advantages
Disadvantages
The memory management unit is a piece of hardware in the processor that, together with the OS, translates virtual addresses (i.e., the addresses in the program) into physical addresses (i.e., real hardware addresses in the memory). The memory management unit is abbreviated as and normally referred to as the MMU.
(The idea of an MMU and virtual to physical address translation applies equally well to non-demand paging and in olden days the meaning of paging and virtual memory included that case as well. Sadly, in my opinion, modern usage of the term paging and virtual memory are limited to fetch-on-demand memory systems, typically some form of demand paging.)
The idea is to fetch pages from disk to memory when they are referenced, hoping to get the most actively used pages in memory. The choice of page size is discussed below.
Demand paging is very common: More complicated variants, multilevel-level paging and paging plus segmentation (both of which we will discuss), have been used and the former dominates modern operating systems.
Started by the Atlas system at Manchester University in the 60s (paper by Fortheringham).
Each PTE continues to contain the frame number if the page is loaded. But what if the page is not loaded (i.e., the page exists only on disk)?
The PTE has a flag indicating if the page is loaded (can think of the X in the diagram on the right as indicating that this flag is not set). If the page is not loaded, the location on disk could be kept in the PTE, but normally it is not (discussed below).
When a reference is made to a non-loaded page (sometimes called a non-existent page, but that is a bad name), the system has a lot of work to do. (We give more details below.)
Really not done quite this way as we shall see later
Homework: 14. A machine has a 32-bit address space and an 8-KB page. The page table is entirely in hardware, with one 32-bit word per entry. When a process starts, the page table is copied to the hardware from memory, at one word every 100 nsec. If each process runs for 100 msec (including the time to load the page table), what fraction of the CPU time is devoted to loading the page tables?
A discussion of page tables is also appropriate for (non-demand) paging, but the issues are more important with demand paging for at least two reasons.
We must be able access to the page table very quickly since it is needed for every memory access.
Unfortunate laws of hardware.
So we can't just say, put the page table in fast processor registers, and let it be huge, and sell the system for $1000.
The simplest solution is to put the page table in main memory as shown on the right. However this solution seems to be both too slow and too big.
fixis to use multiple levels of mapping. We will see two examples below: multilevel page tables and segmentation plus paging.
Each page has a corresponding page table entry (PTE). The information in a PTE is used by the hardware and its format is machine dependent; thus the OS routines that access PTEs are not portable. Information set by and used by the OS is normally kept in other OS tables.
(Actually some systems, those with software TLB reload, do not require hardware access to the page table.)
The page table is indexed by the page number; thus the page number is not stored in the table.
The following fields are often present in a PTE.
recently). It is used to select a victim: unreferenced pages make good victims due to the locality property (discussed below).
Question: Why not store the disk addresses of
non-resident pages in the PTE?
Answer: On most systems the PTEs are accessed by
the hardware automatically on a TLB miss (see immediately below).
Thus the format of the PTEs is determined by the hardware and
contains only information used on page hits.
Hence the disk address, which is only used on page faults, is not
present.
As mentioned above, the simple scheme of storing the page table in its entirety in central memory alone appears to be both too slow and too big. We address both these issues here, but note that a second solution (segmentation) to the size question is discussed later.
Note: Tanenbaum suggests that
associative memory
and translation lookaside buffer
are synonyms.
This is wrong.
Associative memory is a general concept of which translation
lookaside buffer is a specific example.
An associative memory is a content addressable memory. That is you access the memory by giving the value of some field (called the index) and the hardware searches all the records and returns the record whose index field contains the requested value.
For example
Name | Animal | Mood | Color ======+========+==========+====== Moris | Cat | Finicky | Grey Fido | Dog | Friendly | Black Izzy | Iguana | Quiet | Brown Bud | Frog | Smashed | Green
If the index field is Animal and Iguana is given, the associative memory returns
Izzy | Iguana | Quiet | Brown
A Translation Lookaside Buffer or TLB is an associate memory where the index field is the page number. The other fields include the frame number, dirty bit, valid bit, etc.
Note that, unlike the situation with a the page table, the page number is stored in the TLB; indeed it is the index field.
A TLB is small and expensive but at least it is fast. When the page number is in the TLB, the frame number is returned very quickly.
On a miss, a TLB reload is performed. The page number is looked up in the page table. The record found is placed in the TLB and a victim is discarded (not really discarded, dirty and referenced bits are copied back to the PTE). There is no placement question since all TLB entries are accessed at once and hence are equally suitable. But there is a replacement question.
Homework: 22. A computer whose processes have 1024 pages in their address spaces keeps its page tables in memory. The overhead required for reading a word from the page table is 5 nsec. To reduce this overhead, the computer has a TLB, which holds 32 (virtual page, physical page frame) pairs, and can do a look up in 1 nsec. What hit rate is needed to reduce the mean overhead to 2 nsec?
As the size of the TLB has grown, some processors have switched from single-level, fully-associative, unified TLBs to multi-level, set-associative, separate instruction and data, TLBs.
We are actually discussing caching, but using different terminology.
The words above assume that, on a TLB miss, the MMU (i.e., hardware and not the OS) loads the TLB with the needed PTE and then performs the virtual to physical address translation.
Some newer systems do this in software, i.e., the OS is involved.
Recall the diagram above showing the data and stack growing towards each other. Most of the virtual memory is the unused space between the data and stack regions. However, with demand paging this space does not waste real memory. But the single large page table does waste real memory.
The idea of multi-level page tables (a similar idea is used in Unix i-node-based file systems, which we study later when we do I/O) is to add a level of indirection and have a page table containing pointers to page tables.
This idea can be extended to three or more levels. The largest I know of has four levels. We will be content with two levels.
For a two level page table the virtual address is divided into three pieces
+-----+-----+-------+ | P#1 | P#2 | Offset| +-----+-----+-------+
Do an example on the board.
The VAX used a 2-level page table structure, but with some wrinkles (see Tanenbaum for details).
Naturally, there is no need to stop at 2 levels. In fact the SPARC has 3 levels and the Motorola 68030 has 4 (and the number of bits of Virtual Address used for P#1, P#2, P#3, and P#4 can be varied). More recently, x86-64 also has 4-levels.
For many systems the virtual address range is much bigger that the size of physical memory. In particular, with 64-bit addresses, the range is 264 bytes, which is 16 million terabytes. If the page size is 4KB and a PTE is 4 bytes, a full page table would be 16 thousand terabytes.
A two level table would still need 16 terabytes for the first level table, which is stored in memory. A three level table reduces this to 16 gigabytes, which is still large and only a 4-level table gives a reasonable memory footprint of 16 megabytes.
An alternative is to instead keep a table indexed by frame number. The content of entry f contains the number of the page currently loaded in frame f. This is often called a frame table as well as an inverted page table.
Now there is one entry per frame. Again using 4KB pages and 4 byte PTEs, we see that the table would be a constant 0.1% of the size of real memory.
But on a TLB miss, the system must search the inverted page table, which would be hopelessly slow except that some tricks are employed. Specifically, hashing is used.
Also it is often convenient to have an inverted table as we will see when we study global page replacement algorithms. Some systems keep both page and inverted page tables.
Start Lecture #15
These are solutions to the replacement question. Good solutions take advantage of locality when choosing the victim page to replace.
likelyto be referenced. So it is good to bring in the entire page on a miss and to keep the page in memory for a while.
When programs begin there is no history so nothing to base locality
on.
At this point the paging system is said to be undergoing a
cold start
.
Programs exhibit phase changes
in which the set of pages
referenced changes abruptly (similar to a cold start).
An example would occurs in your linker lab when you finish pass 1
and start pass 2.
At the point of a phase change, many page faults occur because
locality is poor.
Pages belonging to processes that have terminated are of course perfect choices for victims.
Pages belonging to processes that have been blocked for a long time are good choices as well.
A lower bound on performance. Any decent scheme should do better.
Replace the page whose next reference will be furthest in the future.
Divide the frames into four classes and make a random selection from the lowest nonempty class.
Assumes that in each PTE there are two extra flags R (for referenced; sometimes called U, for used) and M (for modified; often called D, for dirty).
NRU is based on the belief that a page in a lower priority class is a better victim.
Implementation
Old cartoons often had prisoners wearing broad horizontal stripes and using sledge hammers to break up rocks.
This gives what I sometimes call the prisoner problem: If you do a good job of making little ones out of big ones, but a poor job job of the reverse, you soon wind up with all little ones.
In this case we do a great job setting R but rarely reset it. We need more resets. Therefore, every k clock ticks, the OS resets all R bits.
Question: Why not reset M as well?
Answer: If a dirty page has a clear M, we will not
copy the page back to disk when it is evicted, and thus the only
accurate version of the page will be lost!
What if the hardware doesn't set these bits?
Answer: The OS can uses tricks.
When the bits are reset, the PTE is made to indicate that the page
is not resident (which is a lie).
On the ensuing page fault, the OS sets the appropriate bit(s).
The R and M bits determine the NRU class
Simple algorithm.
Basically, we try to be fair
to the pages: the first one
loaded is the first one evicted.
The natural implementation is to have a queue of nodes each referring to a resident page (i.e., pointing to a frame).
This sound reasonable at first, but it is not a good policy. The trouble is that a page referenced say every other memory reference and thus very likely to be referenced soon will be evicted because we only look at the first reference to a page, when we should be particularly interested in recent references to the page.
Similar to the FIFO PRA, but altered so that a page recently referenced is given a second chance.
Same algorithm as 2nd chance, but uses a better implementation, namely a circular list with a single pointer serving as both head and tail pointer.
We assume that the most common operation is to choose a victim and replace it by a new page.
clockPRA.)
oldest, unreferencedpage by a given new page.
Question: Why is this terrible?
Answer: All but the last frame are frozen once
loaded so you can replace only one frame.
This is especially bad after a phase shift in the program as now the
program is referencing mostly new pages but only one frame is
available to hold them.
When a page fault occurs, choose as victim that page that has been unused for the longest time, i.e. the one that has been least recently used.
LRU is definitely
Homework: 28. If FIFO page replacement is used with four page frames and eight pages, how many page faults will occur with the reference string 0172327103 if the four frames are initially empty? Now repeat this problem for LRU.
Page | Loaded | Last ref. | R | M |
---|---|---|---|---|
0 | 126 | 280 | 1 | 0 |
1 | 230 | 265 | 0 | 1 |
2 | 140 | 270 | 0 | 0 |
3 | 110 | 285 | 1 | 1 |
Homework: 36. A computer has four page frames. The time of loading, time of last access, and the R and M bits for each page are shown on the right (the times in clock ticks).
A clever hardware method to determine the LRU page.
Keep a count of how frequently each page is used and evict the one that has the lowest score. Specifically:
R | counter |
---|---|
1 | 10000000 |
0 | 01000000 |
1 | 10100000 |
1 | 11010000 |
0 | 01101000 |
0 | 00110100 |
1 | 10011010 |
1 | 11001101 |
0 | 01100110 |
NFU doesn't distinguish between old references and recent ones. The following modification does distinguish.
Aging does indeed give more weight to later references, but an n bit counter maintains data for only n time intervals; whereas NFU maintains data for at least 2n intervals.
Homework: 30. A small computer on a smart card has four page frames. At the first clock tick, the R bits are 0111 (page 0 is 0, the rest are 1). At subsequent clock ticks, the values are 1011, 1010, 1101, 0010, 1010, 1100 and 0001. If the aging algorithm is used with an 8-bit counter, give the values of the four counters after the last tick.
Homework: 42. It has been observed that the number of instructions executed between page faults is directly proportional to the number of page frames allocated to a program. If the available memory is doubled, the mean interval between page faults is also doubled. Suppose that a normal instruction takes 1 us, but if a page fault occurs, it takes 2001 us. If a program takes 60 sec to run, during which time it gets 15,000 page faults, how long would it take to run if twice as much memory were available?
The goals of the working set policy are first to determine which pages a given process needs to have memory resident in order for the process to run without too many page faults and second to ensure that these pages are indeed resident.
But this is impossible since it requires predicting the future. So we again make the assumption that the near future is well approximated by the immediate past.
We measure time in units of memory references, so t=1045 means the time when the 1045th memory reference is issued. In fact we measure time separately for each process, so t=1045 really means the time when this process made its 1045th memory reference.
Definition: w(k,t), the working set at time t (with window k) is the set of pages referenced by the last k memory references ending at reference t.
The idea of the working set policy is to ensure that each process keeps its working set in memory.
Unfortunately, determining w(t,k) precisely is quite time consuming. It is never done in real systems. Instead approximations are used as we shall see
Homework: Describe a program that runs for a long time (say hours) and always has a working set size less than 10. Assume k=100,000 and the page size is 4KB. The program need not be practical or useful.
Homework: Describe a program that runs for a long time and (except for the very beginning of execution) always has a working set size greater than 1000. Again assume k=100,000 and the page size is 4KB. The program need not be practical or useful.
The definition of Working Set is local to a process. That is, each process has a working set; there is no system wide working set other than the union of all the working sets of each process.
However, the working set of a single process has effects on the demand paging behavior and victim selection of other processes. If a process's working set is growing in size, i.e., |w(t,k)| is increasing as t increases, then we need to obtain new frames from other processes. A process with a working set decreasing in size is a source of free frames. We will see below that this is an interesting amalgam of local and global replacement policies.
Interesting questions concerning the working set include:
... Various approximations to the working set, have been devised. We will study two: Using virtual time instead of memory references (immediately below), and Page Fault Frequency (part of section 3.5.1). In 3.4.9 we will introduce the popular WSClock algorithm that includes an approximation of the working set as well as several other ideas.
Start Lecture #16
Start Lecture #17
Remarks:
Instead of counting memory references and declaring a page in the working set if it was used within k references, we declare a page in the working set if it was used in the past τ seconds. This is easier to do since the system already keeps track of time for scheduling (and perhaps accounting). Note that the time is measured only while this process is running, i.e., we are using virtual time.
What follows is a possible working-set algorithm using virtual time.
Use reference and modify bits R and M as described above. As usual, the OS clears both bits when a page is loaded and clears R every m milliseconds. The hardware sets M on writes and sets R on every access.
Add a field time of last use
to the PTE.
The procedure for setting this field is below.
If the reference bit is 1, the page has been referenced within the last m milliseconds, which we assume is significant shorter than τ seconds. Hence a page with R=1 is in the working set.
To choose a victim when a page fault occurs (also setting the time of last use field) we scan the page table and treat each resident page as follows. Since we are interested only in resident pages, we would rather scan a page frame table.
Start Lecture #18
Remark: Lab #3 (Bnaker) assigned. Note that the cutoff for submissions is 29 November.
The WSClock algorithm combines aspects of the working set algorithm (with virtual time) and the clock implementation of second chance. It also distinguishes clean from dirty and referenced from non-referenced in the spirit of NRU.
As in clock we create a circular list of nodes with a hand
pointing to the next node to examine.
There is one such node for every resident page of this process; thus
the nodes can be thought of as a list of frames
or a kind of inverted page table.
As in working set we store in each node the referenced and modified bits R and M and the time of last use. R and M are treated as above
We discuss below the setting of the time of last use and the clearing of M.
We use virtual time and declare a page old (i.e., not in the working set) if its last reference is more than τ seconds in the past. We again assume τ seconds is much longer than m milliseconds. Other pages are declared young (i.e., in the working set).
As with clock, on every page fault a victim is found by scanning the list of resident pages starting with the page indicated by the clock hand.
It is possible to go all around the clock without finding a victim. In that case
An alternative treatment of WSClock, including more details of its interaction with the I/O subsystem, can be found here.
Algorithm | Comment |
---|---|
Random | Poor, used for comparison |
Optimal | Unimplementable, used for comparison |
NRU | Crude |
FIFO | Not good ignores timeliness of use |
Second Chance | Improvement over FIFO |
Clock | Better implementation of Second Chance |
LIFO | Horrible, useless |
LRU | Great but impractical |
NFU | Crude LRU approximation |
Aging | Better LRU approximation |
Working Set | Good, but expensive |
WSClock | Good approximation to working set |
Consider a system that has no pages loaded and that uses the FIFO
PRU.
Consider the following reference string
(sequence of
pages referenced by a given process).
0 1 2 3 0 1 4 0 1 2 3 4
What happens if we run the process on a tiny machine with only 3 frames? What if we run it on a bigger (but still tiny) machine with 4 frames?
Theory has been developed and certain PRA (so called
stack algorithms
) cannot suffer this anomaly for any
reference string.
FIFO is clearly not a stack algorithm.
LRU is.
Repeat the above calculations for LRU.
Note: A former OS student, Alec Jacobson,
has extended the above to a repeating string so that, if
N cycles of the repeating pattern are included, a FIFO
replacement policy with 4 frames has N more faults than one
with only 3 frames.
His blog entry
is here and
a local (de-blogged
) copy is
here.
A local PRA is one is which a victim page is chosen among the pages of the same process that requires a new frame. That is the number of frames for each process is fixed. So LRU for a local policy means that, on a page fault, we evict the page least recently used by this process. A global policy is one in which the choice of victim is made among all pages of all processes.
Question: Why is a strictly local policy
impractical/impossible.
Answer: A new process has zero frames.
With a purely local policy, the new process would never get a frame.
More realistically, if you arranged for the first fault to be
satisfied before restricting to purely local, a new process would
remain with only one frame.
A more reasonable local policy would be to wait until a process has been running a while before restricting it to existing frames or give the process an initial allocation of frames based on the size of the executable.
In general, a global policy seems to work better. For example, consider LRU. With a local policy, the local LRU page might have been more recently used than many resident pages of other processes. A global policy needs to be coupled with a good method to decide how many frames to give to each process. By the working set principle, each process should be given |w(k,t)| frames at time t, but this value is hard to calculate exactly.
If a process is given too few frames (i.e., well below |w(k,t)|), its faulting rate will rise dramatically. If this occurs for many or all the processes, the resulting situation in which the system is doing very little useful work due to the high I/O requirements for all the page faults is called thrashing.
An approximation to the working set policy that is useful for determining how many frames a process needs (but not which pages) is the Page Fault Frequency algorithm.
Question: What if there are not enough frames in
the entire system?
That is, what if the PFF is too high for all processes?
Answer: Reduce the MPL as we now discuss.
To reduce the overall memory pressure, we must reduce the multiprogramming level (MPL) or install more memory while the system is running, which is not possible with current technology. Actually it is becoming possible with virtual machine monitors, but we don't consider this possibility.
Lowering the MPL is a connection between memory management and process management. We adjust the MPL by the suspend/resume arcs we saw way back when and are shown again in the diagram on the right.
When the PFF (or another indicator) is too high, we choose a process and suspend it, thereby swapping it to disk and releasing all its frames. When the frequency gets low, we can resume one or more suspended processes. We also need a policy to decide when a suspended process should be resumed even at the cost of suspending another.
This is called medium-term scheduling. Since suspending or resuming a process can take seconds, we clearly do not perform this scheduling decision every few milliseconds as we do for short-term scheduling. A time scale of minutes would be more appropriate.
Question:
Why must the Page size be a multiple of the disk block size.
Answer: When copying out a page if you have a
partial disk block, you must do a read/modify/write (i.e., 2
I/Os).
Characteristics of a large page size.
startup costs, the total time for performing 8 I/O operations each of size 1KB is much larger that the time for a single 8KB I/O. Hence it is better to swap in/out one big page than several small pages.
sqrt(2 * process size * size of PTE)Since the term inside the sqrt is typically millions of byte2, we see that modern practice of having the page size a few kilobytes is near the minimum point.
regionsthan the number of (large) frames that the process has been allocated.
A small page size has the opposite characteristics.
Homework: Consider a 32-bit address machine using paging with 8KB pages and 4 byte PTEs. How many bits are used for the offset and what is the size of the largest page table? Repeat the question for 128KB pages.
Remind me to do this problem next time.
Start Lecture #19
Remarks:
This was used when machine have very small virtual address spaces. Specifically the PDP-11, with 16-bit addresses, could address only 216 bytes or 64KB, a severe limitation. With separate I and D spaces there could be 64KB of instructions and 64KB of data.
Separate I and D are no longer needed with modern architectures having large address spaces.
Permit several processes to each have the same page loaded in the same frame. This is particularly useful if the processes are running the same program.
Must keep reference counts or something so that, when a process terminates, pages it shares with another process are not automatically discarded. This reference count would make a widely shared page (correctly) look like a poor choice for a victim.
copy on writetechniques.
Homework: Can a page shared between two processes be read-only for one process and read-write for the other?
In addition to sharing individual pages, process can share entire library routines. The technique used is called dynamic linking and the objects produced are called shared libraries or dynamically-linked libraries (DLLs). (The traditional linking you did in lab1 is today often called static linking).
changeeven when they haven't changed.
copiesof the module). Instead position-independent code must be used. For example, jumps within the module would use PC-relative addresses.
The idea of memory-mapped files is to use the mechanisms in place for demand paging (and segmentation, if present) to implement I/O.
A system call is used to map a file into a portion of the address
space.
(No page can be part of a file and part of regular
memory;
the mapped file would be a complete segment if segmentation is
present).
The implementation of demand paging we have presented assumes that the entire process is stored on disk. This portion of secondary storage is called the backing store for the pages. Sometimes it is called a paging disk. For memory-mapped files, the file itself is the backing store.
Once the file is mapped into memory, reads and writes become loads and stores.
In practice there is rarely no free frame because many systems use a paging daemon, a process that, whenever active, evicts pages to increase the number of free frames. The daemon is activated when the number of free pages falls below a low water mark and suspended when the number rises above a high water mark.
Some replacement algorithm must be chosen, and naturally dirty pages must be written back to disk prior to eviction. Indeed, writing the dirty page to disk is the most important aspect of cleaning since that is the most time consuming part of finding a free frame.
Since we have studied replacement algorithms, we can suggest an implementation of the daemon. If a clock-like algorithm is used for victim selection, one can have a two handed clock with one hand (the cleaning daemon) staying ahead of the other (the one invoked by the need for a free frame).
The front hand simply writes out any page it hits that is dirty and thus the trailing hand (the one responsible for finding a victim) is likely to see clean pages and hence is more quickly able to find a suitable victim.
Note that our WSClock implementation had a page cleaner built in (look at the implementation when R=0 and M=1).
Unless specifically requested, you may ignore paging daemons when answering exam questions.
Skipped.
When must the operating system be involved with demand paging?
activepage table. The TLB must be cleared (unless it contains a
process idfield).
What happens when a process, say process A, gets a page fault? Compare the following with the processing for a trap command and for an interrupt.
callsthe OS.
forgotsome here.
returnsto user mode.
The user's program running as process A is unaware that any of this occurred.
A cute horror story.
The hardware support for page faults in the original Motorola 68000 (the first microprocessor with a large address space) was flawed. When a processor encountered a page fault, the hardware didn't always store enough information to enable the OS to figure out what to do so (for example, did a register pre-increment occur). That is, one could not safely restart an instruction. This was thought to make demand paging impossible, or at least very difficult and tricky.
However, one clever system for the 68000 used two processors, one executing the program and a second processor executing one instruction behind. When a page fault occurs the executing processor brings in the page and switches to the second processor, which had not yet executed the instruction, thus eliminating instruction restart and thereby supporting demand paging.
The next generation machine, the 68010, pushed extra information on the stack when encountering a page fault so the horrible/clever 2-processor kludge/hack was no longer necessary.
Don't worry about instruction backup; it is very machine dependent and all modern implementations get it right.
We discussed pinning jobs already. The same (mostly I/O) considerations apply to pages.
The issue is where on disk do we put pages that are not in frames.
memory page table(the one we have studied up to now) is determined by the hardware since the hardware modifies/accesses it. It is machine dependent.
disk page tableis decided by the OS designers and is machine independent.
Homework: Assume a program requires a billion memory references be executed. Assume every memory reference takes 0.1 microseconds to execute providing the reference page is memory resident. Assume each page fault takes an additional 10 milliseconds to service.
Question: What should we do if we felt disk
space was too expensive and wanted to put some of these disk pages
on say tape or holographic storage?
Answer: We use demand paging of the disk
blocks!
That way "unimportant" disk blocks will migrate out to tape and
are brought back in if and when needed.
Since a tape read requires seconds to complete (because the request is not likely to be for the sequentially next tape block), it is crucial that we get very few disk block faults.
I don't know of any systems that actually did this.
Homework: Assume a program requires a billion memory references be executed. Assume every memory reference takes 0.1 microseconds to execute providing the reference page is memory resident. Assume a page fault takes 10 milliseconds to service providing the necessary disk block is actually on the disk. Assume a disk block fault takes 10 seconds to service. So the worst case time for a memory reference is 10.0100001 seconds.
Skipped.
Up to now, the virtual address space has been contiguous. In segmentation the virtual address space is divided into a number of variable-size pieces called segments. One can view the designs we have studied so far as having just one segment, the entire address space of the process.
With just one segment (i.e., with all virtual addresses contiguous) memory management is difficult when there are more that two dynamically growing regions.
Imagine a program with several large, dynamically-growing, data structures. The same problem we mentioned for the OS when there are more than two growing regions, occurs as well here for user programs.
This division of the address space is user visible.
Recall that user visible
really means
visible to user-mode programs
.
Unlike (user-invisible) page boundaries, segment boundaries are specified by the user and thus can be made to occur at logical points, e.g., one large array per segment.
Unlike fixed-size pages and frames, segments are variable-size and can grow during execution.
Segmentation eases flexible protection and sharing: One places in a single segment a unit that is logically shared. This would be a natural method to implement shared libraries.
When shared libraries are implemented on paging systems, the design essentially mimics segmentation by treating a collection of pages as a segment. This requires that the end of the unit to be shared occurs on a page boundary (which is done by padding).
Without segmentation (equivalently said with just one segment) all procedures are packed together so, if procedure A changes in size, all the virtual addresses following this procedure are changed and the program must be re-linked. With each procedure in a separate segment, the virtual addresses outside procedure A are unchanged so the relinking would be limited to the symbols defined or used in the procedure A.
Homework: Explain the difference between internal fragmentation and external fragmentation. Which one occurs in paging systems? Which one occurs in systems using pure segmentation?
Late PDP-10s and TOPS-10
Traditional (early) Unix had three segments as shown on the right.
Since the text doesn't grow, this was sometimes treated as 2 segments by combining text and data into one segment. But then the text could not be shared.
Segmentation is a user-visible division of a process into multiple variable-size segments, whose sizes can change during execution . It enables fine-grained sharing and protection. For example, one can share the text segment as done in early unix.
With segmentation, the virtual address has two components: the segment number and the offset in the segment.
Segmentation does not mandate how the program is stored in memory. Possibilities include
All segmentation implementations employed a segment table with one entry for each segment.
Question: Why is there no limit value in a PTE?
Answer: All pages are the same size so the
limit is obvious.
The address translation for segmentation is
(seg#, offset) --> if (offset<limit) base+offset else error.
Segmentation, like whole program swapping, exhibits external fragmentation, sometimes called checkerboarding. (See the treatment of OS/MVT for a review of external fragmentation and whole program swapping.) Since segments are smaller than programs (several segments make up one program), the external fragmentation is not as bad as with whole program swapping. But it is still a serious problem.
As with whole program swapping, compaction can be employed.
Consideration | Demand Paging |
Demand Segmentation |
---|---|---|
User (mode) aware | No | Yes |
How many addr spaces | 1 | Many |
VA size > PA size | Yes | Yes |
Protect individual procedures separately |
No | Yes |
Accommodate elements with changing sizes |
No | Yes |
Ease user sharing | No | Yes |
Why invented | let the VA size exceed the PA size |
Sharing, Protection, Independent addr spaces |
Internal fragmentation | Yes | No, in principle |
External fragmentation | No | Yes |
Placement question | No | Yes |
Replacement question | Yes | Yes |
Same idea as demand paging, but applied to segments.
The table on the right compares demand paging with demand segmentation. The portion above the double line is from Tanenbaum.
These two sections of the book cover segmentation combined with demand paging in two different systems. Section 3.7.2 covers the historic Multics system of the 1960s (it was coming up at MIT when I was an undergraduate there). Multics was complicated and revolutionary. Indeed, Thompson and Richie developed (and named) Unix partially in rebellion to the complexity of Multics. Multics is no longer used.
Section 3.7.3 covers the Intel Pentium hardware, which
offers a segmentation+demand-paging scheme that is not used by any
of the current operating systems (OS/2 used it in the past).
The Pentium design permits one to convert
the system into a
pure damand-paging scheme and that is the common usage today.
(Moreover the hardware is not in the x86-64 architecture).
I will present the material in the following order.
One can combine segmentation and paging to get advantages of both at a cost in complexity. In particular, user-visible, variable-size segments are the most appropriate units for protection and sharing; the addition of (non-demand) paging eliminates the placement question and external fragmentation (at the small average cost of 1/2-page internal fragmentation per segment).
The basic idea is to employ (non-demand) paging on each segment. A segmentation plus paging scheme has the following properties.
Although it is possible to combine segmentation with non-demand paging, I do not know of any system that did this.
Homework: Consider a 32-bit address machine using paging with 8KB pages and 4 byte PTEs. How many bits are used for the offset and what is the size of the largest page table? Repeat the question for 128KB pages. So far this question has been asked before. Repeat both parts assuming the system also has segmentation with at most 128 segments. Remind me to do this in class next time.
Homework: (Ask me about this one next class.) Consider a system with 36-bit addresses that employs both segmentation and paging. Assume each PTE and STE is 4-bytes in size.
Start Lecture #20
Remark: I believe last time, when doing a homework problem, I did a 1KB page size instead of 8KB and 128KB. Let's do the correct sizes and then the homework due today.
There is very little to say. The previous section employed (non-demand) paging on each segment. For the present scheme, we employ demand paging on each segment, that is we perform fetch-on-demand for the pages of each segment. As with pure demand paging, we introduce a valid bit in each PTE and probably other bits as well (e.g., Referenced and Modified).
Homework: 46.
When segmentation and paging are both being used, first the segment
descriptor is looked up, then the page descriptor.
Does the TLB also work this way, with two levels of lookup?
Multics was the first system to employ segmentation plus demand paging. The implementation was as described above with just a few wrinkles.
The Pentium design implements a trifecta: Depending on the setting of a various control bits, the Pentium scheme can be pure demand-paging (current OSes use this mode), pure segmentation, or segmentation with demand-paging.
The Pentium supports 214=16K segments, each of size up to 232 bytes.
code segmentand
data segment(and other less important segments). Technically, the CS register is loaded with the
selectorof the active code segment and the DS register is loaded with the
selectorof the active data register.
Once the 32-bit segment base and the segment limit are determined,
the 32-bit address from the instruction itself is compared with the
limit and, if valid, is added to the base and the sum is called the
32-bit linear address
.
Now we have three possibilities depending on whether the system is
running in pure segmentation, pure demand-paging, or segmentation
plus demand-paging mode.
Current operating systems for the Pentium use mode 3.
Skipped
Read
We have studied the following concepts.
There are three basic requirements for file systems.
High level solution: Store data in files that together form a file system.
Very important. A major function of the file system is to supply uniform naming. As with files themselves, important characteristics of the file name space are that it is persistent and concurrently accessible.
Unix-like operating systems extend the file name space to encompass devices as well
Does each name refer to a unique file?
Answer: Yes, providing the names start at the same place
(absolute or relative to the same directory).
Does each file have a unique name?
Answer: Often no.
We will discuss this below when we study
links.
The extensions are suffixes attached to the file names and are intended to in some way describe the high-level structure of the file's contents.
For example, consider the .html
extension
in class-notes.html
, the name of the file we are viewing.
Depending on the system and application, these extensions can have
little or great significance.
The extensions can be
Should file names be case sensitive. For example, do x.y, X.Y, x.Y all name the same file? There is no clear answer.
String string;(case sensitive). However an Ada programer would know that
A:=a+1;is a simple increment (case insensitive).
How should the file be structured? Said another way, how does the OS interpret the contents of a file.
A file can be interpreted as a
We distinguish four types of files.
fileis simply a collection of data that forms a unit of sharing for processes, even concurrent processes. These are called regular files.
Some regular files contain lines of text and are
called (not surprisingly) text files or ascii files (the latter name
is of decreasing popularity as latin-1 and unicode become more
important).
Each text line concludes with some end of line indication: on unix
and recent MacOS this is a newline (a.k.a line feed), in MS-DOS and
Windows it is the two character sequence carriage return
followed by newline, and in earlier MacOS it was carriage
return
.
Ascii, with only 7 bits per character, is poorly suited for most human languages other than English. Indeed, ascii is an acronym for American Standard Code for Information Interchange. Latin-1 (8 bits) is a little better with support for most Western European Languages.
Perhaps, with growing support for more varied character sets, ascii files will be replaced by unicode (16 bits) files. The Java and Ada programming languages (and perhaps others) already support unicode.
An advantage of all these formats is that they can be directly printed on a terminal or printer.
Other regular files, often referred to as binary files, do not represent a sequence of characters. For example, a four-byte, twos-complement representation of integers in the range from roughly -2 billion to +2 billion is definitely not to be thought of as 4 latin-1 characters, one per byte.
Just because a file is unstructured (i.e., is a byte stream) from the OS perspective does not mean that applications cannot impose structure on the bytes. So a document written without any explicit formatting in MS word is not simply a sequence of ascii (or latin-1 or unicode) characters.
On unix, an executable file must begin with one of certain
magic numbers
in the first few bytes.
For a native executable, the remainder of the file has a well
defined format.
Another option is for the magic number to be the ascii
representation of the two characters #!
in which case the
next several characters specify the location of the executable
program that is to be run with the current file fed in as input.
That is how some interpreted (as opposed to compiled) languages work
in unix.
#!/usr/bin/perl
perl script
Start Lecture #21
In some systems the type of the file (which is often specified by the extension) determines what you can do with the file. This make the common case easier and, more importantly, safer.
However, it tends to make the unusual case harder. For example, assume you have a program that turns out data (.data) files. Now you want to use it to turn out a java file, but the type of the output is data and cannot be easily converted to type java and hence cannot be given to the java compiler.
We will discuss several file types that are not
called regular
.
dir # prints on screen dir > file # result put in a file dir > /dev/audio1 # results sent to speaker (sounds awful)
There are two possibilities, sequential access and random access (a.k.a. direct access).
With sequential access, each access to a given
file starts where the previous access to that file finished (the
first access to the file starts at the beginning of the file).
Sequential access is the most common and gives the highest
performance.
For some devices (e.g. magnetic or paper tape) access must
be
sequential.
With random access, the bytes are accessed in any order. Thus each access must specify which bytes are desired. This is done either by having each read/write specify the starting location or by defining another system call (often named seek) that specifies the starting location for the next read/write.
In unix, if no seek occurs between two read/write operations, then the second begins where the first finished. That is, unix treats a sequences of reads and writes as sequential, but supports seeking to achieve random access.
Previously, files were declared to be sequential or random. Modern systems do not do this. Instead, all files support random access, and optimizations are applied if the system determines that a file is (probably) being accessed sequentially.
Question: Why the strange name random
?
Answer: From the OS point of view the access are
random (they were chosen by the user).
Attributes are various properties that can be specified for a file, for example:
Homework: 4. Is the open system call in UNIX absolutely essential? What would be the consequences of not having it?
Homework: 5. Systems that support sequential files always have an operation to rewind files. Do systems that support random-access files need this, too?
Homework: 6. Some operating systems provide a system call RENAME to give a file a new name. Is there any difference at all between using the call to rename a file and just copying the file to a new file with the new name, followed by deleting the old one?
Let's look at copyfile.c to see the use of file descriptors and error checks. Note specifically, the code checks the return value from each I/O system call. It is a common error to assume that
Directories form the primary unit of organization for the filesystem.
One often refers to the level structure of a directory system. It is easy to be fooled by the names given. A single level directory structure results in a file system tree with two levels: the single root directory and the files in this directory. That is, there is one level of directories and another level of files so the full file system tree has two levels.
Possibilities.
These possibilities are not as wildly different as they sound or as the pictures suggests.
/is allowed in a file name. Then one could fake a tree by having a file named
links, which we will study soon.
You can specify the location of a file in the file hierarchy by using either an absolute or a relative path to the file.
one of the roots, if we have a forest).
Homework: Give 8 different path names for the file /etc/passwd.
Homework: 8. A simple operating system supports only a single directory but allows it to have arbitrarily many files with arbitrarily long file names. Can something approximating a hierarchical file system be simulated? How?.
Remember that the job
of a directory is to map names (of its
children) to the files represented by the those names.
emptydirectory. Normally the directory created actually contains . and .., so is not really empty
handlefor the directory that speeds future access by eliminating the need to process the name of the directory.
information hiding.
I have mentioned that in Unix all files can be accessed from the single root. This does not seem possible when you have two disks (or two partitions on one disk, which is nearly the same thing). How can you get from the root of one partition to anywhere in the other?
One solution might be to make a super-root having the two original roots as children. However, this is not done. Instead, one of the devices is mounted on the other, as is illustrated in the figures on the right.
The top row shows two filesystems (on separate devices). From either root, you can easily get to all the files in that filesystem. Filesystems on Windows machines leave the situation as shown and have syntax to change our focus from one filesystem to another.
Unix uses a different approach. In the normal case (the only one we will consider), one mounts one filesystem on an empty directory of the other. For example, the bottom row shows the result of mounting the right filesystem on the directory /y of the left filesystem.
In this way, the forest of the top row becomes a tree in the bottom. However, we shall see below that the introduction of (hard and symbolic) links in Unix results in filesystems that are not trees. It is true, however, that you can name any file starting from the (single) root.
Now that we understand how the file system looks to a user, we turn our attention to how it is implemented.
We first look at how file systems are laid out on a disk in modern PCs. Much of this is required by the bios so all PC operating systems have the same lowest level layout. I do not know the corresponding layout for mainframe systems or supercomputers.
A system often has more than one physical disk. The first disk is the boot disk.
How do we determine which is the first disk?
The BIOS reads the first sector of the boot disk into memory and transfers control to it. This particular sector is called the MBR, or master boot record. (A sector is the smallest addressable unit of a disk; we assume it contains 512 bytes.)
The MBR contains two key components: the partition table and the first-level loader.
logical disk. Normally each partition holds a complete file system. Each entry in the partition table is like an STE; it gives the starting point and length of the partition.
The contents of a filesystem vary from one file system to another but there is some commonality.
A fundamental property of disks is that they cannot read or write single bytes. The smallest unit that can be read or written is called a sector and is normally 512 bytes (plus error correction/detection bytes). This is a property of the hardware, not the operating system. Recently some drives have much bigger sectors, but we will ignore this fact.
The operating system reads or writes disk blocks. The size of a block is a multiple (we assume a power of 2) of the size of a sector. Since sectors are (for us) always 512 bytes, the block size can be 512, 1024=1K, 2K, 4K, 8K, 16K, etc. The most common block sizes today are 4K and 8K.
So files are composed of blocks.
When we studied memory management, we had to worry about fragmentation, processes growing and shrinking, compaction, etc. Many of these same considerations apply to files; the difference (largely in nominclature) is that instead of a memory region being composed of bytes, a file is composed of blocks.
Recall the simplest form of memory management beyond uniprogramming was OS/MFT where memory was divided into a very few regions and each process was given one of these regions. The analogue for disks would be to give each file an entire partition. This is too inflexible and is not used for files.
The next simplest memory management scheme was the one used in OS/MVT (a swapping system), where the memory for an entire process was contiguous.
The analogous scheme for files, in which each file is stored contiguously is called contiguous allocation. It is simple, and fast for access since, as we shall see next chapter, disks give much better performance when accessed sequentially.
However contiguous allocation is problematic, especially for growing files.
As a result contiguous allocation is not used for general purpose, rewritable file systems. However, it is ideal for file systems where files do not move or change size.
almostis that the terminology used is that the movie is one file stored as a sequence of extents and only the extents are contiguous.
Homework: 11. (There is a typo: the first sentence should end at the first comma.) Contiguous allocation of files leads to disk fragmentation. Is this internal fragmentation or external fragmentation? Make an analogy with something discussed in the previous chapter.
Start Lecture #22
Remark: Lab 4 assigned. It is due in 2 NYU weeks (i.e., 4 lectures) and has a firm cutoff date 1 week later (see NYU Classes).
A file is an ordered sequence of blocks. We just considered storing the blocks one right after the other (i.e., contiguously) the same way that one can store an in-memory list as an array. The other common method for in-memory lists is to link the elements together via pointers. This can also be done for files as follows.
However, this scheme gives horrible performance for random access: N disk reads are needed to access block N.
As a result this implementation of linked allocation is not used.
Consider the following two code segments, which store the same data but in different orders. The left is analogous to the horrible linked-list file organization above and the right is analogous to the ms-dos FAT file system we study next.
struct node_type { float data; float node_data[100]; int next; int node_next[100]; } node[100]
With the second arrangement the data can be stored far away from the next pointers. In the FAT file system this idea is taken to an extreme: The data, which are large (each is a disk block), are stored on disk; whereas, the next pointers, which are small (each is an integer) are stored in memory in a File Allocation Table or FAT. (When the system is shut down the FAT is copied to disk and when the system is booted, the FAT is copied to memory.)
The FAT file system stores each file as a linked list of disk blocks. The blocks, which contain file data only (not the linked list structure) are stored on disk. The pointers implementing the linked list are stored in memory.
There is a long lineage of FAT file systems (FAT-12, FAT-16, vfat, ...) all of which use the file allocation table. The following description of FAT is fairly generic and applies to all of them.
FAT is an essentially ubiquitous file system.
The FAT itself (i.e., the table) is maintained in memory. It contains one (1-int) entry for each disk block. Finding a block is a standard O(N) linked list traversal.
An example FAT is on the right. The directory entry for file A contains a 4, the directory entry for file B contains a 6.
Let's trace in class the steps to find all the blocks in each file.
FAT implements linked allocation but the links are stored separate from the data. Thus, the time needed to access a random block is still is linear in the size of the file, but now all the references are to the FAT, which is in memory. So it is bad for random accesses, but not nearly as horrible as plain linked allocation.
If the partition in question contains N disk blocks, then the FAT contains N pointers. Hence the ratio of the disk space supported to the memory space needed is
(size of a disk block) / (size of a pointer)If the block size is 8KB and a pointer is 2B, the memory requirement is 1/4 megabyte for each disk gigabyte. Large but not prohibitive. (While 8KB is reasonable today for blocksize, 2B pointers is not since that would mean the largest partition could be 216 blocks = 216×213 bytes = 229 bytes = 1/2 GB, which is much too small.)
If the block size is 512B (the sector size of many disks) and a pointer is 8B then the memory requirement is 16 megabytes for each disk gigabyte, which would most likely be prohibitive.
More details on the FAT file system can be found in this lecture from Igor Kholodov. (A local copy is here).
Continuing the idea of adapting storage schemes for memory to file storage, why don't we mimic the idea of (non-demand) paging and have a table giving, for each block of the file, where on the disk that file block is stored? In other words a ``file block table'' mapping each file block to its corresponding disk block. This is the idea of (the first part of) the unix i-node solution, which we study next.
Although Linux and other Unix and Unix-like operating systems have a variety of file systems, the most widely used Unix file systems are i-node based as was the original Unix file system from Bell Labs. As we shall see i-node file systems are more complicated than straight paging, they have aspects of multilevel paging as well.
Inode based systems have the following properties.
How many disk accesses are needed to access one block in an i-node based file system?
Given a block number (= byte number / block size), how do you find the block? Specifically, if we assume
firstblock.
then the following algorithm can be used to find block N.
If N < D // This block is pointed to by the i-node use direct pointer N in the i-node else if N < D + K // The single indirect block points to this block use pointer D in the inode to get the indirect block then use pointer N-D in the indirect block to get block N else // This is one of the K*K blocks obtained via the double indirect block use pointer D+1 in the inode to get the double indirect block let P = (N-(D+K)) DIV K // Which single indirect block to use use pointer P to get the indirect block B let Q = (N-(D+K)) MOD K // Which pointer in B to use use pointer Q in B to get block N
For example, let D=12, assume all blocks are 1000B, assume all pointers are 4B. Retrieve the block containing byte 1,000,000.
With a triple indirect block, the ideas are the same, but there is more work.
Homework: Consider an inode-based system with the same parameters as just above, D=12, K=250, etc.
There is some question of what the i
stands for in i-node;
the consensus seems to be index.
Now, however, people often write inode (not i-node) and don't view
the i
as standing for anything.
For example Dennis Richie, a co-author of unix, doesn't remember
why the name was chosen.
In truth, I don't know either. It was just a term that we started to use. "Index" is my best guess, because of the slightly unusual file system structure that stored the access information of files as a flat array on the disk, with all the hierarchical directory information living aside from this. Thus the i-number is an index in this array, the i-node is the selected element of the array. (The "i-" notation was used in the 1st edition manual; its hyphen became gradually dropped).(lkml.indiana.edu/hypermail/linux/kernel/0207.2/1182.html).
Recall that the primary function of a directory is to map the file name (in ASCII, Unicode, or some other text-based encoding) to whatever is needed to retrieve the data of the file itself.
There are several ways to do this depending on how files are stored.
Another important function of a directory is to enable the retrieval of the various attributes (e.g., length, owner, size, permissions, etc.) associated with a given file.
Homework: 30. It has been suggested that the first part of each Unix file be kept in the same disk block as its i-node. What good would this do?
It is convenient to view the directory as an array of entries, one per file. This view tacitly assumes that all entries are the same size and, in early operating systems, they were. Most of the contents of a directory are inherently of a fixed size. The primary exception is the file name.
Early systems placed a severe limit on the maximum length of a file name and allocated this much space for all names. DOS used an 8+3 naming scheme (8 characters before the dot and 3 after). Unix version 7 limited names to 14 characters.
Later systems raised the limit considerably (255, 1023, etc) and thus allocating the maximum amount for each entry was inefficient and other schemes were used. Since we are storing variable size quantities, a number of the consideration that we saw for non-paged memory management arise here as well.
The simplest method to find a file name in a directory is to search the list of directory entries one at a time. This scheme becomes inefficient for very large directories containing hundreds or thousands of files. In such cases a more sophisticated technique (such as hashing or B-trees) is used.
We often think of the files and directories in a file system as forming a tree (or forest). However in most modern systems this is not always the case, the same file can appear in two (or more) different directories (not two copies of the file, but the same file). The same file can also appear multiple times in the same directory, having different names each time.
I like to say that the same file has two different names. One can also think of the file as being shared by the two directories (but those words don't work so well for a file with two names in the same directory).
Sharedfiles is Tanenbaum's terminology.
With unix hard links there are multiple names for the same file and each name has equal status. The directory entries for both names point to the same inode.
real nameand the other one is
just a link.
For example, the diagram on the right illustrates the result that occurs when, starting with an empty file system (i.e., just the root directory), one executes
cd / mkdir /A; mkdir /B touch /A/X; touch /B/Y
If the file named in the touch command does not exist, an empty ordinary file with that name is created. Touch has other uses, but we won't need them.
The diagrams in this section use the following conventions
Now we execute
ln /B/Y /A/New
which leads to the next diagram on the right.
Note that there are still exactly 5 inodes and 5 files: two regular files and three directories. All that has changed is that there is another name for one of the regular files. At this point there are two equally valid name for the right hand yellow file, /B/Y and /A/New. The fact that /B/Y was created first is NOT detectable.
the file nameS(plural) vs
the file(singular).
ln /B /A/NewDir
Next assume Bob created /B and /B/Y and Alice created /A, /A/X, and /A/New. Later Bob tires of /B/Y and removes it by executing
rm /B/Y
The file /A/New is still fine (see the diagram on the right). But it is owned by Bob, who can't find it! If the system enforces quotas Bob will likely be charged (as the owner), but he can neither find nor delete the file (since Bob cannot unlink, i.e. remove, files from /A).
If, prior to removing /B/Y, Bob had examined its link count
(an attribute of the file), he would have noticed that there is
another (hard) link to the file, but would not have been able to
determine in which directory the hard link was located (/A in this
case) or what is the name of the file in that directory (New in this
case).
Since hard links are only permitted to files (not directories) the resulting file system is a dag (directed acyclic graph). That is, there are no directed cycles. We will now proceed to give away this useful property by studying symlinks, which can point to directories.
Start Lecture #23
As just noted, hard links do NOT create a new file, just another name for an existing file. Once the hard link is created the two names have equal status.
A Symlink, on the other hand DOES create another file, a non-regular file, that itself serves as another name for the original file. Specifically
Again start with an empty file system and this time execute the following code sequence (the only difference from the above is the addition of a -s).
cd / mkdir /A; mkdir /B touch /A/X; touch /B/Y ln -s /B/Y /A/New
We now have an additional file /A/New, which is a symlink to /B/Y.
/B/Y).
The bottom line is that, with a hard link, a new name is created for the file. This new name has equal status with the original name. This can cause some surprises (e.g., you create a link but I own the file). With a symbolic link a new file is created (owned by the creator naturally) that contains the name of the original file. We often say the new file points to the original file.
Question: Consider the hard link setup above.
If Bob removes /B/Y and then creates another /B/Y, what happens to
/A/New?
Answer: Nothing.
/A/New is still a file owned by Bob having the same
contents, creation time, etc. as the original /B/Y.
Question: What about with a symlink?
Answer: /A/New becomes invalid and then valid
again, this time pointing to the new /B/Y.
(It can't point to the old /B/Y as that is completely gone.)
Notes:
What happens if the target of the symlink is an existing directory? For example, consider the code below (again starting with an empty file system), which gives rise to the diagram on the right.
cd / mkdir /A; mkdir /B touch /A/X; touch /B/Y ln -s /B /A/New
Questions:
This research project of the early 1990s was inspired by the key observation that systems are becoming limited in speed by small writes. The factors contributing to this phenomenon were (and still are).
preparationbefore any data is transferred, and then can transfer a block in less than 1ms. Thus, a one block
transferspends most of its time
getting readyto transfer.
The goal of the log-structured file system project was to design a file system in which all writes are large and sequential (most of the preparation is eliminated when writes are sequential). These writes can be thought of as being appended to a log, which gave the project its name.
cleanerprocess runs in the background and examines segments starting from the beginning. It removes overwritten blocks and then adds the remaining blocks to the segment buffer. (This is very much not trivial.)
Despite the advantages given, log-structured file systems have not caught on. They are incompatible with existing file systems and the cleaner has proved to be difficult.
Many seemingly simple I/O operations are actually composed of sub-actions. For example, deleting a file on an i-node based system (really this means deleting the last link to the i-node) requires removing the entry from the directory, placing the i-node on the free list, and placing the file blocks on the free list.
What happens if the system crashes during a delete and some, but not all three, of the above actions occur?
A journaling file system prevents these problems by using an idea from database theory, namely transaction logs. To ensure that the multiple sub-actions are all performed, the larger I/O operation (delete in the example) is broken into 3 steps.
After a crash, the log (called a journal) is examined and if there are pending sub-actions, they are done before the system is made available to users.
Since sub-actions may be repeated (once before the crash, and once after), it is required that they all be idempotent (applying the action twice is the same as applying it once).
Some history.
A single operating system needs to support a variety of file systems. The software support for each file system would have to handle the various I/O system calls defined.
Not surprisingly the various file systems often have a great deal in common and large parts of the implementations would be essentially the same. Thus for software engineering reasons one would like to abstract out the common part.
This was done by Sun Microsystems when they introduced NFS the Network File System for Unix and by now most unix-like operating systems have adopted this idea. The common code is called the VFS layer and is illustrated on the right.
The original motivation for Sun was to support NFS (Network File System), which permits a file system residing on machine A to be mounted onto a file system residing on machine B. The result is that by cd'ing to the appropriate directory on machine B, a user with sufficient privileges can read/write/execute the files in the machine A file system.
Note that mounting one file system onto another (whether they are on different machines or not) does not require that the two file systems be the same type. For example, I routinely mount FAT file systems (from MP3 players, cameras, ets) on to my Linux inode-based file system. The involvement of multiple file system software components for a single operation is another point in VFS's favor.
Nonetheless, I consider the idea of VFS to be mainly good (perhaps superb) software engineering more than OS design. The details are naturally OS specific.
Since I/O operations can dominate the time required for complete user processes, considerable effort has been expended to improve the performance of these operations.
All general purpose file systems use a paging-like algorithm for file storage (read-only systems, which often use contiguous allocation, are the major exception). Files are broken into fixed size pieces, called blocks that are scattered over the disk.
Note that although this algorithm is similar to paging, it is not called paging and often does not have an explicit page table.
Note also that all the blocks of the file are present at all times, i.e., this system is not demand paging.
One can imagine systems that do utilize demand-paging-like algorithms for disk block storage. In such a system only some of the file blocks would be stored on disk with the rest on tertiary storage (some kind of tape, or holographic storage perhaps). NASA might do this with their huge datasets.
We discussed a similar question before when studying page size. The sizes chosen are similar, both blocks and pages are measured in kilobytes. Common choices are 4KB and 8KB, for each.
There are two conflicting goals, performance and efficient memory utilization..
startup timerequired before any bytes are transferred. This favors a large block size.
For some systems, the vast majority of the space used is consumed by the very largest files. For example, it would be easy to have a few hundred gigabytes of video. In that case the space efficiency of small files is largely irrelevant since most of the disk space is used by very large files.
There are basically two possibilities, a bit map and a linked list.
A region of kernel memory is dedicated to keeping track of the free blocks. One bit is assigned to each block of the file system. The bit is 1 if the block is free.
If the block size is 8KB the bitmap uses 1 bit for every 64K bits of disk space. Thus a 64GB disk would require 1MB of RAM to hold its bitmap.
One can break the bitmap into (fixed size) pieces and apply demand paging. This saves RAM at the cost of increased I/O.
A naive implementation would simply link the free blocks together and just keep a pointer to the head of the list. Although it wastes no space, this simple scheme has poor performance since it requires an I/O for every acquisition or return of a free block.
In the naive scheme a free disk block contains just one pointer; even though it could hold a thousand pointers. The improved scheme, shown on the right, has only a small number of the blocks on the list. Those blocks point not only to the next block on the list, but also to many other free blocks that are not directly on the list.
As a result only one in about 1000 requests for a free block requires an I/O, a great improvement.
Unfortunately, a bad case still remains. Assume the head block on the list is exhausted, i.e. points only to the next block on the list. A request for a free block will receive this block, and the next one on the list is brought it. It is full of pointers to free blocks not on the list (so far so good).
If a free block is now returned we repeat the process and get back to the in-memory block being exhausted. This can repeat forever, with one extra I/O per request.
Tanenbaum shows an improvement where you try to keep the one in-memory free block half full of pointers. Similar considerations apply when splitting and coalescing nodes in a B-tree.
Two limits can be placed on disk blocks owned by a given user,
the so called soft
and hard
limits.
A user is never permitted to exceed the hard limit.
This limitation is enforced by having system calls such
as write return failure if the user is
already at the hard limit.
A user is permitted to exceed the soft limit during a login session provided it is corrected prior to logout. This limitation is enforced by forbidding logins (or issuing a warning) if the user is above the soft limit.
Often files on directories such as /tmp are not counted towards either limit since the system is permitted to deleted these files when needed.
Start Lecture #24
Remark: Please remember that the cutoff dates on NYU classes for the labs are firm.
A physical backup simply copies every block in order onto a tape or second disk or the cloud (or other backup media). It is simple and useful for disaster protection, but not useful for retrieving individual files.
We will study logical backups, i.e., dumps that are file and directory based not simply block based.
Tanenbaum describes the (four phase) unix dump algorithm.
All modern systems support full and incremental dumps.
Traditionally, disks were dumped onto tapes since the latter were cheaper per byte. Since tape densities are increasing slower than disk densities, an ever larger number of tapes are needed to dump a disk. This has lead to the importance of disk-to-disk dumps.
Another possibility is to utilize raid, which we study next chapter.
Modern systems have utility programs that check the consistency of
a file system.
A different utility is needed for each file system type in the
system, but a wrapper
program is often created so that the
user is unaware of the different utilities.
The unix utility is called fsck (file system check) and the windows utility is called chkdsk (check disk).
fixthe errors found (for most errors).
Less of an issue now than previously. Disks are more reliable and, more importantly, disks and disk controllers take care most bad blocks themselves.
Demand paging again!
Demand paging is a form of caching: Conceptually, the process resides on disk (the big and slow medium) and only a portion of the process (hopefully a small portion that is heavily access) resides in memory (the small and fast medium).
The same idea can be applied to files. The file resides on disk but a portion is kept in memory. The area in memory used to for those file blocks is called the buffer cache or block cache.
Some form of LRU replacement is used.
The buffer cache is clearly good and simple for blocks that are only read.
What about writes?
write-allocate policyAlthough
no-write-allocateis possible and sometimes used for memory caches, it performs poorly for disk caching.
needed.
Homework: 32. The performance of a file system depends upon the cache hit rate (fraction of blocks found in the cache. If it take 1 msec to satisfy a request from the cache, but 40 msec to satisfy a request if a disk read is needed, give a formula for the mean time required to satisfy a request if the hit rate is h. Plot this function for values of h ranging from mo to 1.
When the access pattern looks
sequential, read ahead is
employed.
This means that when processing a read() request for
block n of a file, the system guesses that a read() request
for block n+1 will shortly be issued and hence
automatically fetches block n+1.
Questions:
The idea is to try to place near each other blocks that are likely to be used together.
super-blocks, consisting of several contiguous blocks.
If clustering is not done, files can become spread out all over the disk and a utility (defrag on windows) can be run which makes files contiguous on the disk.
CP/M was a very early and simple OS. It ran on primitive hardware with very little ram and disk space. CP/M had only one directory in the entire system. The directory entry for a file contained pointers to the disk blocks of the file. If the file contained more blocks than could fit in a directory entry, a second entry was used.
File systems on cdroms do not need to support file addition or deletion and as a result have no need for free blocks. A CD-R (recordable) does permit files to be added, but they are always added at the end of the disk. The space allocated to a file is not recovered even when the file is deleted, so the (implicit) free list is simply the blocks after the last file recorded.
The result is that the file systems for these devices are quite simple.
This international standard forms the basis for essentially all file systems on data cdroms (music cdroms are different and are not discussed). Most Unix systems use iso9660 with the Rock Ridge extensions, and most windows systems use iso9660 with the Joliet extensions.
The ISO9660 standard permits a single physical CD to be partitioned and permits a cdrom file system to span many physical CDs. However, these features are rarely used and we will not discuss them.
Since files do not change, they are stored contiguously and each directory entry need only give the starting location and file length.
File names are 8+3 characters (directory names just 8) for iso9660-level-1 and 31 characters for -level-2. There is also a -level-3 in which a file is composed of extents which can be shared among files and even shared within a single file (i.e. a single physical extent can occur multiple times in a given file).
Directories can be nested only 8 deep.
The Rock Ridge extensions were designed by a committee from the unix community to permit a unix file system to be copied to a cdrom without information loss.
These extensions included.
special files, i.e. including devices in the file system name structure.
The Joliet extensions were designed by Microsoft to permit a windows file system to be copied to a cdrom without information loss.
These extensions included.
We discussed this linked-list, File-Allocation-Table-based file system previously. Here we add a little history. More details can be found in this lecture from Igor Kholodov. (A local copy is here).
The FAT file system has been supported since the first IBM PC (1981) and is still widely used. Indeed, considering the number of cameras, MP3 players, and other devices sold, it is very widely used.
Unlike CP/M, MS-DOS always had support for subdirectories and metadata such as date and size.
File names were restricted in length to 8+3.
As described previously, the directory entries point to the first block of each file and the FAT contains pointers to the remaining blocks.
The free list was supported by using a special code in the FAT for
free blocks.
You can think of this as a bitmap with a wide bit
.
The first version FAT-12 used 12-bit block numbers so a partition could not exceed 212 blocks. A subsequent release went to FAT-16.
Two changes were made: Long file names were supported and the file allocation table was switched from FAT-16 to FAT-32. These changes first appeared in the second release of Windows 95.
The hard part of supporting long names was keeping compatibility with the old 8+3 naming rule. That is, new file systems created with windows 98 using long file names must be accessible if the file system is subsequently used with an older version of windows that supported only 8+3 file names. The ability for old systems to read data from new systems was important since users often had both new and old systems and kept many files on floppy disks that were used on both systems.
This abiliity to access new objects on old systems is called
foward compatibility
and is often not achieved.
For example files produced by new versions of microsoft word may not
be comprehended by old versions of word.
The reverse concept, backward compatibility
, the ability to
read old files on new systems, is much easier to accomplish and is
almost always achieved.
For example, new versions of microsoft word could always read
documents produced by older versions.
Forward compatibility of Windows file names was achieved by permitting a file to have two names: a long one and an 8+3 one. The primary directory entry for a file in windows 98 is the same format as it was in MS-DOS and contains the 8+3 file name. If the long name fits the 8+3 format, the story ends here.
If the long name does not fit in 8+3, an (often ugly) 8+3 alternate
name is produced and stored in the normal location.
The long name is stored in one or more axillary
directory
entries adjacent to the main entry.
These axillary entries are set up to appear invalid to the old OS,
which (surprisingly) ignores them.
FAT-32 used 32 bit words for the block numbers (actually, it used 28 bits) so the FAT could be huge (228 entries). Windows 98 kept only a portion of the FAT-32 table in memory at a time, a form of caching / demand-paging.
I presented the inode system in some detail above. Here we just describe a few properties of the filesystem beyond the inode structure itself.
touch 255-char-nameis OK but
touch 256-char-nameis not.
Read
The most noticeable characteristic of the current ensemble of I/O devices is their great diversity.
output onlydevice such as a printer supplies very little output to the computer (perhaps an out‐of‐paper indication) but receives voluminous input from the computer. Again it is better thought of as a transducer, converting electronic data from the computer to paper data for humans.
To perform I/O typically requires both mechanical and electronic activity. The mechanical component is called the device itself. The electronic component is called a controller or an adapter.
The controllers are the devices as far as the OS is concerned. That is, the OS code is written with the controller specification in hand not with the device specification.
Start Lecture #25
Consider a disk controller processing a read request. The goal is to copy data from the disk to some portion of the central memory. How is this to be accomplished?
The controller contains a microprocessor and memory, and is connected to the disk (by wires). When the controller requests a sector from the disk, the sector is transmitted to the controller via the wires and is stored by the controller in its own memory.
The separate processor and memory on the controller gives rise to two questions.
Typically the interface the controller presents to the OS consists of a few registers located on the controller board. (See the diagram on the right.)
go button.
So the first question above becomes, how does the OS read and write the device registers?
I/O spaceinto which the registers are mapped. In this case special I/O space instructions are used to accomplish the loads and stores.
elegantsolution in that it uses an existing mechanism to accomplish a second objective.
We now address the second question, moving data between the controller and the main memory. Recall that the disk controller, when processing a read request, pulls the desired data from the disk to its own buffer. (Similarly, it pushes data from the buffer to the disk when processing a write).
Without DMA, i.e., with programmed I/O (PIO), the read is completed by having the cpu issue loads and stores to copy the data from the buffer to the desired memory locations.
With DMA the controller writes the main memory itself, without intervention of the CPU.
Clearly DMA saves CPU work. However, the number of memory references remains the same so, even with DMA, the CPU may be delayed due to busy memory.
An important point is that there is less data movement with DMA so the buses are used less and the entire operation takes less time. Compare the two blue arrows vs. the single red arrow.
Since PIO is pure software it is easier to change, which is an advantage.
Initiating DMA requires a number of bus transfers from the CPU to the controller to write the device registers. So DMA is most effective for large transfers where the setup is amortized.
A serious complexity of DMA is that the bus must support
multiple masters
and hence requires arbitration,
which leads to issues similar to those we faced with critical
sections.
Why not just go from the disk straight to the main memory?
Homework: 15. A local area network is used as follows. The user issues a system call to write data packets to the network. The operating system then copies the data to a kernel buffer. Then it copies the data to the network controller board. When all the bytes are safely inside the controller, they are sent over the network at a rate of 10 megabits/sec. The receiving network controller stores each bin a microsecond after it is sent. When the last bit arrives, the destination CPU is interrupted, and the kernel copies the newly arrived packet to a kernel buffer to inspect it. Once it has figured out which user the packet is for, the kernel copies the data to the user space. If we assume that each interrupt and its associated processing takes 1 msec, that packets are 1024 bytes (ignore the headers), and that copying a byte takes 1 microsecond, what is the maximum rate at which one process can pump data to another? Assume that the sender is blocked until the work is finished at the receiving side and acknowledegment comes back. For simplicity, assume that the time to get the acknowledgement back is so small it can be ignored.
As with any large software system, good design and layering is important.
We want to have most of the OS to be unaware of the characteristics of the specific devices attached to the system. (This principle of device independence is not limited to I/O; we also want the OS to be largely unaware of the specific CPU employed.)
This objective has been accomplished quite well for files stored on various devices. Most of the OS, including the file system code, and most applications can read or write a file without knowing if the file is stored on an internal SATA hard disk, an external USB SCSI disk, an external USB Flash Ram, a tape, or (for read-only applications) a CD-ROM.
This principle also applies for user programs reading or writing
streams.
A program reading from standard input
, which is normally the
user's keyboard can be told to instead read from a disk file with no
change to the application program.
Similarly, standard output
can be redirected to a disk file.
However, the low-level OS code dealing with disks is rather different
from that dealing keyboards and (character-oriented) terminals.
One can say that device independence permits programs to be implemented as if they will read and write generic or abstract devices, with the actual devices specified at run time. Although writing to a disk has differences from writing to a terminal, Unix cp, DOS copy, and many programs we compose need not be aware of these differences.
However, there are devices that really are special. The graphics interface to a monitor (that is, the graphics interface presented by the video controller—often called a ``video card'') does not resemble the ``stream of bytes'' we see for disk files.
Homework: What is device independence?
We have already discussed the value of the name space implemented by file systems. There is no dependence between the name of the file and the device on which it is stored. So a file called IAmStoredOnAHardDisk might well be stored on a thumb drive.
A more interesting example is, once a device is mounted on (Unix) directory, the device is named exactly the same as the directory was. So if a CD-ROM was mounted on (existing) directory /x/y, a file named joe on the CD-ROM would now be accessible as /x/y/joe.
There are several aspects to error handling including: detection, correction (if possible) and reporting.
I/O must be asynchronous for good performance. That is the OS cannot simply wait for an I/O to complete. Instead, it proceeds with other activities and responds to the interrupt that is generated when the I/O has finished.
Read X Y = X+1 Print Y
Users (mostly) want no part of this. The code sequence on the right should print a value one greater than that read. But if the assignment is performed before the read completes, the wrong value can easily be printed.
Performance junkies sometimes do want the asynchrony so that they can have another portion of their program executed while the I/O is underway. That is, they implement a mini-scheduler in their application code.
See this message from linux kernel developer Ingo Molnar for his take on asynchronous IO and kernel/user threads/processes. You can find the entire discussion here.
Buffering is often needed to hold data for examination prior to sending it to its desired destination.
When two buffers are used the producer can deposit data into one buffer while previously deposited data is being consumed from the other buffer. We illustrated an example of double buffering early in the course when discussing threads. See also our discussion of the bounded buffer (a.k.a. producer-consumer) problem.
Since this involves copying the data, which can be expensive, modern systems try to avoid as much buffering as possible. This is especially noticeable in network transmissions, where the data could conceivably be copied many times.
I do not know if any systems actually do all seven.
For devices like printers and CD-ROM drives, only one user at a time is permitted. These are called serially reusable devices, which we studied in the deadlocks chapter. Devices such as disks and ethernet ports can, on the contrary, be shared by concurrent processes without any deadlock risk.
As mentioned just above, with programmed I/O the main processor (i.e., the one on which the OS runs) moves the data between memory and the device. This is the most straightforward method for performing I/O.
One question that arises is, How does the processor know when
the device is ready to accept or supply new data?
.
while (device-not-available) do-other-useful-work
The simplest implementation is shown on the right. The processor, when it seeks to use a device, loops continually querying the device status, until the device reports that it is free. This is called polling or busy waiting.
If we poll infrequently (and do useful work in between), there can be a significant delay from when the previous I/O is complete to when the OS detects the device availability.
If we poll frequently (and thus are able to do little useful work in between) and the device is (sometimes) slow, polling is clearly wasteful.
The extreme case is where the process does nothing between polls. For a slow device this can take the CPU out of service for a significant period. This bad situation leads us to ... .
As we have just seen, a difficulty with polling is determining the frequency with which to poll. Another problem is that the OS must continually return to the polling loop, i.e., we must arrange that do-other-useful-work takes the desired amount of time. Really we want the device to tell the CPU when it is available, which is exactly what an interrupt does.
The device interrupts the processor when it is ready and an interrupt handler (a.k.a. an interrupt service routine) then initiates transfer of the next datum.
Normally interrupt schemes perform better than polling, but not always since interrupts are expensive on modern machines. To minimize interrupts, better controllers often employ ...
We discussed DMA above.
An additional advantage of dma, not mentioned above, is that the processor is interrupted only at the end of a command not after each datum is transferred. Some devices present a character at a time, but with a dma controller, an interrupt occurs only after a buffer has been transferred.
Layers of abstraction as usual prove to be effective. Most systems are believed to use the following layers.
We will give a bottom up explanation.
We discussed behavior similar to an interrupt handler
before when studying page faults.
Then it was called assembly-language code
.
A difference is that page faults are caused by specific user
instructions, whereas interrupts just happen
.
However, the assembly-language code
for a page fault
accomplishes essentially the same task as the interrupt handler does
for I/O.
In the present case, we have a process blocked on I/O and the I/O event has just completed. As we will soon see the process is currently executing the driver. So the goal is to unblock the process, mark it as ready, and then call the scheduler. Possible methods are.
Once the process is ready, it is up to the scheduler to decide when it should run.
We gave an overview before.
Device drivers form the portion of the OS that is tailored to the characteristics of individual controllers. They form the dominant portion of the source code of the OS since there are hundreds of drivers. Normally some mechanism is used so that the only drivers loaded on a given system are those corresponding to hardware actually present.
Indeed, modern systems often have loadable device drivers
,
which are loaded dynamically when needed.
This way if a user buys a new device, no manual configuration
changes to the operating system are needed.
Instead, after the device is installed it will be auto-detected
during the boot process and the corresponding driver is loaded.
Sometimes an even fancier method is used and the device can be plugged in while the system is running (USB devices are like this). In this case it is the device insertion that is detected by the OS and that causes the driver to be loaded.
Finally, some systems can dynamically unload a driver, when the corresponding device is unplugged.
The driver has two parts
corresponding to its two access
points.
The figure on the right, which we first encountered at the beginning
of the course, helps explain the two parts (and possibly their
names).
The driver is accessed by the main line OS via the envelope in
response to an I/O system call.
The portion of the driver accessed in this way is sometimes called
the top
part.
The driver is also accessed by the interrupt handler when the I/O
completes (this completion is signaled by an interrupt).
The portion of the driver accessed in this way is sometimes called
the bottom
part.
In some system the drivers are implemented as user-mode processes. Indeed, Tanenbaum's MINIX system works that way, and in previous editions of the text, he describes such a scheme. However, most systems have the drivers in the kernel itself and the 4e describes this scheme. I previously included both descriptions, but have now grayed out the user-mode process description.
The three-part diagram to the right and below shows the high-level actions that occur. On the right we see the initial state, process A is running and is about it issue a read system call. Process B is ready to run; it is waiting to be scheduled. (Although only A and B are shown, there may be other ready and blocked processes as well.)
Below we see later states. The second diagram shows the situation after process A has issued its read system call. This process is now blocked waiting for the read to complete. The scheduler has chosen to run process B. In the third diagram, the read is complete and process A is now ready. Perhaps the scheduler will run it soon.
What follows is the Unix-like view in which the
driver is invoked by the OS acting in behalf of a user process
(alternatively stated, the process shifts into kernel mode).
Thus one says that the scheme follows a self-service
paradigm
in that the process itself (now in kernel mode) executes the
driver.
The numbers in the diagram to the right correspond to the numbered steps in the description that follows. The previous diagram showed the state of processes A and B at steps 1, 6, and 9 in the execution sequence.
go buttonmentioned previously).
Actions that occur when the user issues an I/O request.
Actions that occur when an interrupt arrives (i.e., when an I/O has been completed).
Start Lecture #26
Remark: A practice final is available on NYU classes.
The device-independent code cantains most of the I/O functionality, but not most of the code since there are very many drivers. All drivers of the same class (say all hard disk drivers) do essentially the same thing in slightly different ways due to slightly different controllers.
As stated above the bulk of the OS code consists of device drivers and thus it is important that the task of driver writing not be made more difficult than needed. As a result each class of devices (e.g. the class of all disks) has a defined driver interface to which all drivers for that class of device conform. The device independent I/O portion processes user requests and calls the drivers.
Naming is again an important O/S functionality. In addition it offers a consistent interface to the drivers. The Unix method works as follows
specialfile in the /dev directory.
specialfiles and also contain so called major and minor device numbers.
allan dev # ls -l /dev/sd* brw-r----- 1 root disk 8, 0 Apr 25 09:55 /dev/sda brw-r----- 1 root disk 8, 1 Apr 25 09:55 /dev/sda1 brw-r----- 1 root disk 8, 2 Apr 25 09:55 /dev/sda2 brw-r----- 1 root disk 8, 3 Apr 25 09:55 /dev/sda3 brw-r----- 1 root disk 8, 4 Apr 25 09:55 /dev/sda4 brw-r----- 1 root disk 8, 5 Apr 25 09:55 /dev/sda5 brw-r----- 1 root disk 8, 6 Apr 25 09:55 /dev/sda6 brw-r----- 1 root disk 8, 16 Apr 25 09:55 /dev/sdb brw-r----- 1 root disk 8, 17 Apr 25 09:55 /dev/sdb1 brw-r----- 1 root disk 8, 18 Apr 25 09:55 /dev/sdb2 brw-r----- 1 root disk 8, 19 Apr 25 09:55 /dev/sdb3 brw-r----- 1 root disk 8, 20 Apr 25 09:55 /dev/sdb4 allan dev #
A wide range of possibilities are actually done in real systems. Including both extreme examples of everything is permitted and nothing is (directly) permitted. As mentioned, security is an enormous issue; one that we don't have the time to cover adequately.
Buffering is necessary since requests come in a sizes specified by the user and data is delivered by reads and accepted by writes in sizes specified by the device. Buffering is also important so that a user process using getchar() in C or the Scanner in java is not blocked and unblocked for each character read.
The text describes double buffering, which we have discussed, and circular buffers, which we have not. They are important programming techniques, but are not specific to operating systems.
The system must enforce exclusive access for non-shared devices like CD-ROMs. We discussed the issues involved when studying deadlocks.
A good deal of I/O software is actually executed by unprivileged code running in user space. This code includes library routines linked into user programs, standard utilities, and daemon processes.
If one uses the strict definition that the operating system consists of the (supervisor-mode) kernel, then this I/O code is not part of the OS. However, very few use this strict definition.
Some library routines are very simple and just move their arguments into the correct place (e.g., a specific register) and then issue a trap to the kernel to do the real work.
I think everyone considers these routines to be part of the
operating system.
Indeed, they implement the published user interface to the OS.
For example, when we specify the (Unix) read system call by
count = read (fd, buffer, nbytes)
as we did in chapter 1, we are really
giving the parameters and return value of such a library
routine.
Although users could write these routines (they are unprivileged), it would make their programs non-portable and would require them to write in assembly language since neither trap nor specifying individual registers is available in high-level languages.
Other library routines, notably standard I/O (stdio) in Unix, are definitely not trivial. For example consider the formatting of floating point numbers done in System.out.printf() and the reverse operation done by the Scanner in nextDouble().
In unix-like systems (probably including MacOS) the extremely large and complex graphics libraries and the gui itself are outside the kernel. In windows, the gui is inside the kernel.
Printing to a local printer is often performed in part by a regular program (lpr in Unix) that copies (or links) the file to a standard place, and in part by a daemon (lpd in Unix) that reads the copied files and sends them to the printer. The daemon might be started when the system boots.
Note that this implementation of printing uses spooling, i.e., the file to be printed is copied somewhere by lpr and then the daemon works with this copy. Mail uses a similar technique (but generally it is called queuing, not spooling).
The diagram on the right shows the various layers and some of the actions that are performed by each layer.
The arrows show the flow of control. The blue downward arrows show the execution path made by a request from user space eventually reaching the device itself. The red upward arrows show the response, beginning with the device supplying the result for an input request (or a completion acknowledgement for an output request) and ending with the initiating user process receiving its response.
Homework: 14. In which of the four I/O software layers is each of the following done.
Homework: 16. Why are output files for the printer normally spooled on disk before being printed?
The ideal storage device is
When compared to central memory, disks (i.e., hard drives) are big, cheap (per byte), and slow.
Show a real disk opened up and illustrate the components.
Consider the following characteristics of a disk.
Overlapping I/O operations is important when the system has more than one disk. Many disk controllers can do overlapped seeks, i.e. issue a seek to one disk while another disk is already seeking.
As technology improves the space taken to store a bit decreases, i.e., the bit density increases. This changes the number of cylinders per inch of radius (the cylinders are closer together) and the number of bits per inch along a given track.
Despite what Tanenbaum says later, it is not true that when one head is reading from cylinder C, all the heads can read from cylinder C with no penalty. It is, however, true that the penalty is very small.
Current commodity disks require a little less than 10ms. before transferring the first byte and then transfer roughly 100K bytes per ms. (if contiguous). Specifically
This is quite extraordinary. For a large sequential transfer, in the first 10ms, no bytes are transmitted; in the next 10ms, 1,000,000 bytes are transmitted. The analysis suggests using large disk blocks, 100KB or more.
But the internal fragmentation would be severe since many files are small. Moreover, transferring small files would take longer with a 100KB block size.
In practice typical block sizes are 4KB-8KB.
Multiple block sizes have been tried (e.g., blocks are 8KB but a
file can also have fragments
that are a fraction of
a block, say 1KB).
Some systems employ techniques to encourage consecutive blocks of a given file to be stored near each other. In the best case, logically sequential blocks are also physically sequential and then the performance advantage of large block sizes is obtained without the disadvantages mentioned.
In a similar vein, some systems try to cluster related
files
(e.g., files in the same directory).
Homework: Consider a disk with an average seek time of 5ms, an average rotational latency of 5ms, and a transfer rate of 40MB/sec.
Originally, a disk was implemented as a three dimensional
array
Cylinder#, Head#, Sector#
The
cylinder number determined the cylinder, the head number specified
the surface (recall that there is one head per surface), i.e., the
head number determined the track within the cylinder, and the sector
number determined the sector within the track.
But there is something wrong here. An outer track is longer (in centimeters) than an inner track, but each stores the same number of sectors. Essentially some space on the outer tracks was wasted.
Eventually disks lied. They said they had a virtual geometry as above, but really had more sectors on outer tracks (like a ragged array). The electronics on the disk converted between the published virtual geometry and the real geometry.
Modern disk continue to lie for backwards compatibility, but also support Logical Block Addressing in which the sectors are treated as a simple one dimensional array with no notion of cylinders and heads.
Start Lecture #27
The name and its acronym RAID came from Dave Patterson's group at Berkeley. IBM kept the well-known acronym, but changed the name to Redundant Array of Independent Disks. I wonder why?
The basic idea is to utilize multiple drives to simulate a single larger drive, but with redundancy and increased performance (and, at the time, decreased price).
Definition: The parity or exclusive or of N bits is 1 if an odd number of the N bits are 1 and is 0 otherwise. Hence, the original N bits plus the parity bit always have an even number of 1 bits.
If you lose any one of the N+1 bits, it can be recovered as the exclusive or of the remaining N bits.
The different RAID configurations are often called different
levels, but this is not a good name since there is no hierarchy and
it is not clear that higher levels are better
than low ones.
However, the terminology is commonly used so I will follow the trend
and describe them level by level, but having very little to say
about some levels.
The levels are
There are three components to disk response time: seek, rotational latency, and transfer time. Disk arm scheduling is concerned with minimizing seek time by reordering the requests.
These algorithms are relevant only if there are several I/O requests pending. For many PCs, the system is so underutilized that there are rarely multiple outstanding I/O requests and hence no scheduling is possible. At the other extreme, many large servers are I/O bound with significant queues of pending I/O requests. For these systems, effective disk arm scheduling is crucial.
Although disk scheduling algorithms are performed by the OS, they are also sometimes implemented in the electronics on the disk itself. The disks I brought to class were old so I suspect those didn't implement scheduling, but the then-current operating systems definitely did.
We study the following algorithms all of which are quite simple.
no scheduling, but I wouldn't.
Happy Days?) and by elevators.
stolecoins since requesting an already requested song was a nop.
Assume the heads are at cylinder 50 and the following fifo requests are present: 75, 40, 200, 0, 55. Show on the board what order are the requests satisfied if the disk arm scheduling algorithm is FIFO, Pick, SSTF, Elevator (going up), and Circular Elevator (going up).
Start Lecture #28
Once the heads are on the correct cylinder, there may be several
requests to service.
All the systems I know, use circular Elevator based on sector
numbers to retrieve these requests.
Question: Why always Circular Elevator?
Answer: Because the disk rotates in only one
direction.
The above is certainly correct for requests to the same track. If requests are for different tracks on the same cylinder, a question arises of how fast the disk can switch from reading one track to another on the same cylinder. There are two components to consider.
The electronic switching is very fast.
I doubt that would be an issue.
The second point is more problematic.
I know it was not true in the 1980s: I proposed a disk in which
all tracks in a cylinder were read simultaneously and coupled
this parallel readout
disk with with some network we had
devised.
Alas, a disk designer set me straight: The heads are not
perfectly aligned with the tracks.
Homework: 31. Disk requests come into to the disk driver for cylinders 10, 22, 20, 2, 40, 6, and 38, in that order. A seek takes 6 ms per cylinder. How much seek time is needed for
Homework: 33. A salesman claimed that their version of Unix was very fast. For example, their disk driver used the elevator algorithm to reorder requests for different cylinders. In addition, the driver queued multiple requests for the same cylinder in sector order. Some hacker bought a version of the OS and tested it with a program that read 10,000 blocks randomly chosen across the disk. The new Unix was not faster than an old one that did FCFS for all requests. What happened?
Often the disk/controller caches (a significant portion of) the entire track whenever it access a block, since the seek and rotational latency penalties have already been paid. In fact modern disks have multi-megabyte caches that hold many recently read blocks. Since modern disks cheat and don't have the same number of blocks on each track, it is better for the disk electronics (and not the OS or controller) to do the caching since the disk is the only part of the system to know the true geometry.
Most disk errors are handled by the device/controller and not the
OS itself.
That is, disks are manufactured with more sectors than are
advertised and spares are used when a bad sector is referenced.
Older disks did not do this and the operating system would form
a secret file
of bad blocks that were never to be used.
The hardware is simple. It consists of
The counter reload can be automatic or under OS control. If it is done automatically, the interrupt occurs periodically (the frequency is the oscillator frequency divided by the value in the register).
The value in the register can be set by the operating system and thus this programmable clock can be configured to generate periodic interrupts at any desired frequency (providing that frequency divides the oscillator frequency).
As we have just seen, the clock hardware simply generates a periodic interrupt, called the clock interrupt, at a set frequency. Using this interrupt, the OS software can accomplish a number of important tasks.
The basic idea is to increment a counter each clock tick (i.e., each clock interrupt). The simplest solution is to initialize this counter at boot time to the number of ticks since a fixed date (Unix traditionally uses midnight, 1 January 1970). Thus the counter always contains the number of ticks since that date and hence the current date and time is easily calculated. Two questions arise.
Three methods are used for initialization. The system can contact one or more know time sources (see the Wikipedia entry for NTP), a human operator can type in the date and time, or the system can have a battery-powered, backup clock. The last two methods only give an approximate time.
Overflow is a real problem if a 32-bit counter is used. In this case two counters are kept, the low-order and the high-order. Only the low order is incremented each tick; the high order is incremented whenever the low order overflows. That is, a counter with twice as many bits is simulated.
The system decrements a counter at each tick. The quantum expires when the counter reaches zero. The counter is loaded when the scheduler runs a process (i.e., changes the state of the process from ready to running). This is what I (and I would guess you) did for the (processor) scheduling lab.
At each tick, bump a counter in the process table entry for the currently running process.
Users can request a signal at some future time (the Unix
alarm
system call).
The system also on occasion needs to schedule some of its own
activities to occur at specific times in the future (e.g., exercise
a network time out).
The conceptually simplest solution is to have one timer for each event.
Instead, we simulate many timers with just one using the data structure on the right with one node for each event.
The objective is to obtain a histogram giving how much time was spent in each software module of a given user program.
The program is logically divided into blocks of say 1KB and a counter is associated with each block. At each tick the profiled code checks the program counter and bumps the appropriate counter.
After the program is run, a (user-mode) utility program can determine the software module associated with each 1K block and present the fraction of execution time spent in each module.
If we use a finer granularity (say 10B instead of 1KB), we get increased precision but more memory overhead.
Homework: 37. The clock interrupt handler on a certain computer requires 2msec (including process switching overhead) per clock tick. The clock runs at 60 Hz. What fraction of the CPU is devoted to the clock.
At each key press and key release a scan code
is written
into the keyboard controller and the computer is interrupted.
By remembering which keys have been depressed and not released the
software can determine Cntl-A, Shift-B, etc.
For example.
There are two fundamental modes of input, traditionally called
raw and cooked in Unix and now sometimes called
noncanonical
and canonical
in POSIX.
In raw mode the application sees every character
the user
types.
Indeed, raw mode is character oriented.
All the OS does is convert the keyboard scan codes
to characters
and and pass these characters to the
application.
Full screen editors use this mode.
Cooked mode is line oriented, that is, the application program only receives input when a newline (a.k.a. enter) is pressed. The OS delivers the line to the application program after cooking it as follows.
The characters must be buffered until the application issues a read and (for cooked mode) an end-of-line has been entered.
Whenever the mouse is moved or a button is pressed, it sends a
message to the computer consisting of Δx, Δy, and the
status of the buttons.
That is all the hardware does.
Issues such as double click
vs. two clicks are all handled by
the software.
In the beginning these were essentially typewriters (and were often
called glass ttys
) and therefore simply received a stream of
characters.
Soon after, they accepted commands (called escape sequences
)
that would position the cursor, insert and delete characters,
etc.
This is the window system on Unix machines. From the very beginning it was a client-server system in which the server (the display manager) could run on a separate machine from the clients (graphical applications such as pdf viewers, calendars, browsers, etc).
This is a large subject that would take many lectures to cover well. Both the hardware and the software are complex. On a high-powered game computer, the graphics hardware is more powerful and likely more expensive that the cpu on which the operating system runs.
Read.