Distributed Operating Systems
1997-98 Spring
Allan Gottlieb
Text: Tannenbaum, Modern Operating Systems
================ Lecture 1 ================
---------------- Some administrivia --------------
Reaching me
Office Hour Thurs 11:15--12:15 (i.e. after class)
email gottlieb@nyu.edu
x8-3344
715 Broadway room 1001
Midterm & Final & Labs & Homework
Lab != Homework Describe both
Labs can be run on home machine
... but you are responsible
Please PLEASE upload a copy to ACF
Upper Left board for announcements ... but you are responsible.
Handouts are low tech, not great (a feature!)
Web page: http://allan.ultra.nyu.edu/gottlieb/os
Comment on the text
The "right" text is Tannenbaum's "Distributed OS", but that book
has a high overlap with the second part of "Modern OS" and you
already own the latter (V22.202 used it).
---------------- Chapter 9, Intro to Distributed Systems ----------------
Enablers
Powerful micros: Inexpensive machines worth connecting
High-speed LANs: Move data quickly
Low latency: Small transfers in about a ms
High Bandwidth: 10 Megabits/sec (now 100 "soon" 1000)
Gave rise to Distributed (as opposed to Centralized) systems
(Somewhat vague) Definition of a distributed system
A distributed system is a collection of independent computers that
appears to the users of the system as a single computer.
Advantages over centralized
Repeal of grosch's law--indeed the reverse occurred (diminishing
returns) so that now dist sys are more cost efficient
Absolute performance beyond a single system
Big web engine
Weather prediction
Sometimes the natural soln since the problem is distributed
Branches of a bank
Computer supported cooperative work
Incremental growth
Reliability
In principle you have redundancy so fault tolerance
Advantages over independent machines
Data sharing
Device sharing
Sometimes more expensive than the computer
Communication (between people)--email
Flexibility
In addition to the symmetric soln of each user gets
pc/wkstation can have some more powerful and/or specialized
machines.
Disadvantages of distributed systems
Immature software
Network failure / saturation
Security
Reliability
Often with a dist sys you need ALL the components to work
HOMEWORK 9-1
Hardware configuration and concepts.
Book shows its age a little here
Flynn's classification of machines with multiple CPUs
SISD: Single Instruction stream Single Data stream
SIMD: Process arrays all at once
MISD: Perhaps doesn't exist; not important
MIMD: Each CPU capable of autonomy
================ Lecture 2 ================
Multiprocessors: Shared address space across all
processors
Multicomputer: Each processor has its own
Tightly Coupled vs. Loosely Coupled
Former means capable of cooperating closely on a single
problem
Fine grained communication: i.e. communicate often
Large amt of data transfered
Tightly coupled machines can normally emulate loosely coupled
ones but might give up some fault tolerance
Can solve a single large problem on loosely coupled hardware,
but the problem must be coarse grained.
Various cryptographic challenges solved over the internet
with email as the only communication
Bus-based multiprocessors
These are symmetric multiprocessors (SMPs): From the viewpoint
of one processor, all the others look the same. None are
logically closer than others. SMP also implies cache coherent
(see below).
Hardware is fairly simple. Recall a uniprocessor looks like
Proc
|
|
Cache(s) Mem
| |
| |
------------------------------------ Memory Bus
|
|
"Chipset"
|
|
------------------------------- I/O Bus (e.g. PCI)
| | | |
| | | |
video scsi serial etc
Complication. When a scsi disk writes to memory (i.e. a disk
read is performed), you need to keep the cache(s) up to date,
i.e. CONSISTENT with the data. If the cache had the old value
before the disk read, the system can either INVALIDATE the
cache entry or UPDATE it with the new value.
Facts about caches. The can be WRITE THROUGH, when the
processor issues a store the value goes in the cache and also
is sent to memory, or they can be WRITE BACK, the value only
goes to the cache. In the later case the cache line is marked
dirty and when it is evicted it must be written back to memory.
To make a bus-based MP just add more processor-caches
Proc Proc
| |
| |
Cache(s) Cache(s) Mem
| | |
| | |
------------------------------------ Memory Bus
|
|
"Chipset"
|
|
------------------------------- I/O Bus (e.g. PCI)
| | | |
| | | |
video scsi serial etc
A key question now is whether the caches are automatically
kept consistent (a.k.a. coherent) with each other. When the
processor on the left writes a word, you can't let the old
value hang around in the right hand cache(s) or the right hand
processor can read the wrong value.
If the cache is write back, then on an initial write the cache
must claim ownership of the cache line (invalidate all other
copies). If the cache is write through, you can either
invalidate other copies or update them.
Because of the broadcast nature of the bus it is not hard to
design snooping (a.k.a snoopy) caches that maintain
consistency. I call this the "Turkish Bath" mode of
communication because "everyone sees what everyone else has
got".
These machines are VERY COMMON.
DISADVANTAGE (really limitation) Cannot be used for a large
number of processors since the bandwidth needed on the bus
grows with the number of processors and this gets increasingly
difficult and expensive to supply. Moreover the latency grows
as the number of processors for both speed of light and more
complicated (electrical) reasons.
HOMEWORK 9-2 (ask me next time about this one)
Other SMPs
From the software viewpoint, the key property of the bus-based
MP is that they are SMPs, the bus itself is just an
implementation property.
To support larger number of processors, other interconnection
networks could be used. The book shows two possibilities
crossbars and omega networks. Ignore what is said there as it
is too superficial and occasionally more or less wrong.
Tannenbaum is not a fan of shared memory.
(The Ultracomputer in figure 9.4 is an NYU machine; my main
research activity from 1980-95)
Another name for SMP is UMA for Uniform Memory Access,
i.e. all the memories are equal distant from a given
processor.
NUMAs
For larger number of processors, NUMAs (NonUniform Memory
Access) MPs are being introduced. Typically some memory is
associated with each processor and that memory is close.
Others are further away.
P--C---M P--C---M
| |
| |
|----------------------------|
| |
| interconnection network |
| |
|----------------------------|
CC-NUMAs (Cache Coherent NUMAs) are programmed like SMPs BUT
to get good good performance must try to exploit the memory
hierarchy and have most references hit in your local cache and
most others in the part of the shared memory in your "node".
HOMEWORK 9-2' Will semaphores work here
NUMAs (not cache coherent) are yet harder to program as you
must maintain cache consistent manually (or with compiler
help).
HOMEWORK 9-2'' Will semaphores work here
Bus-based Multicomputers
The big difference is that it is an MC not an MP, NO shared
memory.
In some sense all the computers on the internet form one
enormous MC.
The interesting case is when there is some closer cooperation
between processors; say the workstations in one distributed
systems research lab cooperating on a single problem.
Application must tolerate long-latency communication and
modest bandwidth, using current state-of-the-practice
Recall the figure for a typical computer
Proc
|
|
Cache(s) Mem
| |
| |
------------------------------------ Memory Bus
|
|
"Chipset"
|
|
------------------------------- I/O Bus (e.g. PCI)
| | | |
| | | |
video scsi serial etc
The simple way to get a bus-based MC is to put an ethernet
controller on the I/O bus and then connect all these with a
bus called ethernet. (Draw this on the board).
Better (higher performance) is to put the Network Interface
(possibly ethernet possibly something else) on the memory bus
Other (bigger) multicomputers
Once again buses are limiting so switched networks are used.
Much could be said, but currently grids (SGI/Cray T3E) and
omega networks (IBM SP2) are popular. Hypercubes less popular
than in the 80s.
HOMEWORK 9-2''' Will semaphores work here, 9-3, 9-4
================ Lecture 3 ================
Software concepts
Tightly vs. loosely coupled
High autonomy and low communication characterize loose
coupling
Network Operating Systems
Loosely coupled software and hardware
Autonomous workstations on a LAN
Commands normally run locally
Remote login and remote execution possible (rlogin, rcp, rsh)
Next step is to have a shared file system.
Have file SERVERS that export filesystems to CLIENTS on
the LAN.
Client makes a request to server, which sends a response.
Example from unix: Unix file structure is a tree (really
dag). Dos is a forest (a: c: not connected). For local
disks, unix supplies mount to make one file system part of
another. To get the shared file system, the idea is to
mount remote filesystems locally. The server exports the
filesystems to the clients, who do the mount.
Client decides where to mount the fs. Hence the same
filesystem can be mounted at different places on different
clients.
This fails to give location transparency.
An example from my research group (a few years ago)
We had several machine all running unix. Named
after the owner: allan, eric, richard, jan, jim, susan
and one other called lab. We had several
filesystems that were for common material /a /b /c.
I remember that allan.ultra.nyu.edu had /e and
exported it to other machines. Our convention was
to mount /e on /nfs/allan/e but have a link
/e --> /nfs/allan/e so on all machines /e got you
to the same place.
However there were two system filesystems / and
/usr that were on all machines (a little different
on each). For sysadmin we wanted all these
assessible to all our admins. So on allan I would
have jim's /usr mounted on /nfs/jim/usr. I could
not link /usr --> /nfs/jim/usr. Why?
The name I used was /jim.usr but that is not
location transparent.
Another example from same place.
To save disk space we kept only one copy of
/usr/gnu/lib/emacs. It was stored on lab in
/d/usr.gnu.lib/emacs and everyone mounted /d and
linked /usr/gnu/lib/emacs -->
/d/usr.gnu.lib/emacs.
Problem: When lab was down, so was emacs!
So we went to several severers for this. All
controled manually.
Reads and writes are caught by the client and requests
made to the server. For reads the server respondes by
giving the requested data to the client, which is then
used to satisfy the read
Text gives detailed description of NFS implementation. We
will defer this until chapter 13 (distributed filesystems)
where it belongs.
Is a network OS a dist system?
True distributed systems (according to st. andrew)
(Comparatively) tightly-coupled software on loose-coupled
hardware
Goal is to make the illusion that there is a single computer.
Single-system image
Virtual uniprocessor
HOMEWORK 9-6
Many scientists use the term single-system image or virtual
uniprocessor even if the hardware is tightly coupled.
Official (Modern OS) def: A DISTRIBUTED SYSTEM is on that runs
on a collection of machines that do not have shared memory,
yet looks to its users like a single computer.
The newer book (Dist OS) uses the def we gave earlier: A
distributed system is a collection of independent computers
that appears to the users of the system as a single computer.
Single IPC (interprocess communication) mechanism.
Any process (on any machine) can talk to ANY other process
(even those on another machine) using the same procedure.
Single protection scheme to protect all objects
Same scheme (ACL, unix rwx bits) on all machines.
Consistent process management across machines.
File system looks the same everywhere. Not only the same name
for a given file but, for example, the same rules for
filenames.
Easiest to do if all the machines are the same so the same
operating system can be run on each. However, there is
considerable work on "heterogeneous computing" where (in
principle at least) different computers are used.
For each program, have a separate binary for each
architecture.
What about floating point formats and wordsize?
What about byte order?
Autoconvert when doing ipc
Even when the each system is the same architecture and runs
the same OS, one may not want a centralized master that
controls everything.
Paging is probably best done locally (assuming a local
disk).
Scheduling is not clear. Advantages both ways. Many
systems have local scheduling but communication and
process migration for load balancing.
================ Lecture 4 ================
Multiprocessor systems
Tanenbaum's "timesharing" term dates the book.
Easiest to get a single system image.
One example is my workstation at NECI (soon to move north).
Shared memory makes a single run queue (ready list) natural choice.
So scheduling is "trivial": Take uniprocessor code and add
semaphores.
The standard diagram for process states applies.
The diagram is available in
postscript (best) and in
html (most portable).
What about processor affinity?
Shared memory makes shared I/O buffer cache natural choice.
Gives true (i.e. uniprocessor semantics for I/O with
little extra work.
Avoiding performance bottlenecks is serious.
Shared memory makes uniprocessor file system natural choices.
Summary (Tanenbaum figure 9-13)
Network Distributed Multiprocessor
Operating Operating Operating
Item System System System
--------------------------------------------------------------------
Virtual Uniprocessor No Yes Yes
Same OS No Yes Yes
Copies of OS N N 1
Communication Shared Files Messages Shared memory
Network Protocols Yes Yes No
Single Ready List No No Yes
Well def file sharing Rare Yes Yes
--------------------------------------------------------------------
Design issues
A Big one is TRANSPARANCY, i.e. you don't "see" the multiplicity
of processors. Or you see through them.
(NOT always what is wanted. Might want to create as many (child)
processes as there are processors. Might choose a different
coordination algorithm for 2 processes than for 50)
Location transparancy: cannot tell where the resourses are
located.
So ``pr -Psam'' does not tell you where sam is located (but
cat /etc/printcap does).
I assume tanenbaum means you need not know where the resourses
are located. Presumably there are routing tables somewhere
saying where they are.
Migration transparancy: moving a resources does not require a name
change.
Recall our research system. If you move a file from /e stored
on allan to lab it cannot be in /e (property of mount).
Symlinks can hide this slightly.
Replication transparancy: Don't know how many copyies of a
resourse exist.
E.g. multithreaded (or multiprocess) server
Concurrency transparancy.
Don't notice other users (except for slowdowns).
That is, same as for uniprocessor.
Can lock resource--but deadlocks possible
Again just like for uniprocessor.
HOMEWORK 9-11
Parallelism transparancy
You write uniprocessor programs and poof it all works hundreds
of times faster on a dist sys with 500 processors.
Far from current state of the art
My wording above is that you don't always want transparancy.
If you had parallelism transparancy, you would always want it.
Flexibility--really monolithic vs micro kernel
Should the (entire) OS run in supervisor mode?
Tanenbaum is rather one sided here. He is a big microkernel
fan. Amoeba is his group's system.
Monolithic kernel is conventional, all services supplied by
kernel, OS. Often derived from uniprocessor OS
Microkernel only supplies
IPC
Some mem mgt
low level proc mgt and sched
low level I/O
Micro kernel does not supply
Filesystems
Most system calls
Full process management (deciding which process is highest
priority)
Instead these are supplied by servers
Advantages
Fairly easy to modify while the system is running.
Bug in server can bring down system but the bug will
appear in the server not in another part of the OS
Disadvantage
Performance: Tanenbaum says its minor but not so clear.
Crossing more protection domains and having more transfers
of control, hurt cache performance and this is
increasingly important as speed ratios between processors
and DRAM grow.
HOMEWORK 9-9
Reliability
It's an improvement if you need just one of many possible
resourses in order to work.
HOMEWORK 9-12
It's a negative if you need all the resourses. (AND vs OR)
Avail with prob p^n if AND (p = prob one resourse avail)
Avail with prob 1 - (1-p)^n if OR
Lamport's def of dist sys "One on which I cannot get any work
done because some machine I have never heard of has crashed"
Availability: percentage of time the system is up. So
increased replication increases availability
Consistency: Must keep copies consistent so data not garbled.
Increasing replication makes this worse (more expensive).
Performance
Coarse grained parallelism: Little, infrequent
communication/coordination. This is the easy case to get high
perf for. Sometimes called "embarassingly parallel".
Fine grained parallelism: Tight coordination and/or much data
communication. Tough. Many msgs.
Scalability
Not trivial! Descirbes mintel where french phone company is
"currently" (i.e. prior to 1992) installing terminal in every
home. If successful, "other countries will inevitably adopt
similar systems". It is 1998 and we don't have it yet.
Is the web a dist sys?
Centralized components/tables/algorithms.
If the degree of parallelism is large any centralized
"thing" is a potential bottleneck.
Fault tolerance (single point of failure).
Centralized tables have fault tolerence performance
bottleneck problems.
The perf problem can be solved if concurrently
accessible by combining requests on the way to the
server of the centralized table.
It is often too expensive to get the entire (accurate) state
of the system to one computer to act on. Instead, one prefers
decentralized algorithms.
No machine has complete information
Decisions made based on locally available info (obvious?)
Tolerate some machine failures
Don't assume a global clock (exactly synchronized)
HOMEWORK 9-13
---------------- Chapter 10: Communication in Dist Sys ----------------
With no shared memory, communication is very different from that in a
uniprocessor (or a shared memory multiprocessor).
PROTOCOL: An agreement between communicating parties on how
communication is to proceed.
Error correction codes.
Blocksize.
Ack/Nak
LAYERED protocol: The protocol decisions concern very different things
How many volts is 1 or zero? How wide is the pulse? (LOW level)
Error correction
Routing
Sequencing
As a result you have many routines that work on the various
aspects. They are called layered.
Layer X of sender acts as if it is directly communicating with
layer X of received but in fact it is communicating with layer X-1
of sender.
Similarly layer X of sender acts as a virtual layer X+1 of
receiver to layer X+1 of sender.
Famous example is the ISO OSI (Intern standards org open sys
interconnect).
First lets look at the OSI diagram just as an example of layering
The diagram is available in
postscript (best) and in
html (most portable).
So for example the network layer sends msgs intended for the other
network layer but in fact sends them to the data link layer
Also the network layer must accept msgs from the transport layer,
which it then sends to the other network layer (really its own
data link layer.
What a layer really does to a msg it receives is that it adds a
header (and maybe a trailer) that is to be interpreted by its
corresponding layer in the receiver.
So the network layer adds a header (in front of the transport
layer's header) and sends to the other network layer (really its
own data link layer that adds a header in front of the network
layer's--and a trailer--)
================ Lecture 5 ================
So headers get added as you go down the sender's layers (often
called the PROTOCOL STACK or PROTOCOL SUITE). They get used (and
stripped off) as the msg goes up the receiver's stack.
It all starts with process A sending a msg. By the time it
reaches the wire it has 6 headers (the physical layer doesn't add
one--WHY?) and one trailer.
Nice thing is that the layers are independent. You can change one
layer and not change the others.
HOMEWORK 10-1
Now let's understand the layers themselves.
Physical layer: hardware, i.e. voltages, speeds, duplexity,
connectors
Data link layer: Error correction and detection. "Group the bits
into units called frames".
Frames contain error detection (and correction) bits.
This is what the pair of data like layers do when viewed
as an extension of the physical.
But when being used, the sending DL layer gets a packet
from the network layer and breaks it into frames and adds
the error detection bits.
Network layer: Routing.
Connection oriended network-layer protocol: X.25. Send a
msg to destination and establish a route that will be used
for further msgs during this connection (a connection
number is given). Think telephone.
Connectionless: IP (Internet Protocol). Each PACKET
(message between the network layers) is routed separately.
Think post office.
Transport layer: make reliable and ordered (but not always).
Break incoming msg into packets and send to corresponding
transport layer (really send to ...). They are sequence
numbered.
Header contains info as to which packets have
been sent and received.
These sequence numbers are for the end to end msg.
I.e. if allan.ultra.nyu.edu sends msg to allan.nj.nec.com
the transport layer numbers the packets. But these
packets may take different routes. On any one hop the
data link layer keeps the frames ordered.
If you use X.25 for network there is little for transport
layer to do.
If you use IP for network layer, there is a lot to do.
Important transport layers are TCP (transmission control
protocol) and UDP (Universal datagram protocol)
TCP is connection oriented and reliable as described
above.
UDP is basically just IP (a "dummy") transport layer when
you don't need that service (error rates are low already
and msgs almost always come in order.
TCP/IP is widely used.
Session Layer: dialog and sync (we skip it)
Presentation layer: Describes "meaning" of fields (we skip it).
Application layer: For specific apps (e.g. mail, news, ftp)
(we skip it)
Another (newer) famous protocol is ATM (asynchronous transfer mode)
NOT in modern operating systems
The hardware is better (newer) and has much higher data rates than
previous long haul networks.
155Mbits/sec compared to T3=45Mb/s is the low end. 622Mb/sec
is midrange.
In pure circuit switching a circuit is established and held
(reserved) for the duration of the transmission.
In store and forward packet switching, go one hop at at time
with entire packet
ATM estabilishes a circuit but it is not (exclusively)
reserved. Instead the packet is broken into smallish
fixed-sized CELLS. Cells from different transmissions can be
interleaved on same wire
Cute fact about high speed LONG haul network is that a LOT of bits
are on the wires. 15ms from coast to coast. So at 622Mb/sec,
have 10Mb on the wire. So if receiver says STOP, 20Mb will still
come. At 622Mb/set it takes 1.6ms to push a Mb file out of
computer. So if you wait for reply, most of the time the line is
idle.
HOMEWORK Assume one interrupt to generate 8-bits and the interrupt
takes 1us (us=mcirosecond) to process. Assume you are sending at a
rate of 155Mb/sec. What percent of the CPU time is being used to
process the interrupts?
Trouble in paradise
The problem with all these layered things (esp OSI) is that all
the layers add overhead.
Especially noticable in systems with high speed interconnects
where the processing time counts (i.e. not completely wire
limited).
Clients and Servers
Users and providers of services
This provides more than communication. It gives a structuring to
the system.
Much simplier protocol.
Physical and datalink are as normal (hardware)
All the rest is request/reply protocol
Client sends a request; server sends a reply
Less overhead than full OSI
Minimal kernel support needed
send (destination, message)
receive (port, message)
variants (mailboxes, etc) discussed later
Message format (example; triv file server)
struct message
source
dest
opcode
count -- number of bytes to read or write
offset
result
filename
data
Servers are normally infinite loops
loop
receive (port, message)
switch message.opcode
-- handle various cases filling in replymessage
replymessage.result = result from case
send (message.source, replymessage)
Here is a client for copying A to B
loop
-- set (name=A, opcode=READ, count, position) in message
send (fileserver, message)
receive (port, replymessage) -- could use same message
-- set (name=B, opcode=WRITE, count=result, position) in message
send (fileserver, message)
position +=result
while result>0
-- return OK or error code.
Bug? If the read gets an error you send bogus (neg count) to write
loop
-- set (name=A, opcode=READ, count, position) in message
send (fileserver, message)
receive (port, replymessage) -- could use same message
exit when result <=0
-- set (name=B, opcode=WRITE, count=result, position) in message
send (fileserver, message)
position +=result
end
--return OK or error code
Addressing the server
Need to specify the machine server is on and the "process" number
Actually more common is to use the port number
tanenbaum calls it local-id
Server tells kernel that it wants to listen on this port
Can we avoid giving the machine name (location transparency)?
Can have each server pick a random number from a large space
(so probability of duplicates is low).
When a client wants service S it broadcasts a I need X and the
server supplying X responds with its location. So now the
client knows the address.
This first broadcast and reply can be used to eliminate
duplicate servers (if desired) and different services
accidentally using the same number
Another method is to use a name server that has mapping from
service names to locations (and local ids, i.e. ports).
At startup servers tell the name server their location
Name server a bottleneck? Replicate and keep consistent
To block or not to block
Synchronous vs asynchronous
Send and receive synchronous is often called rendezvous
Async send
Do not wait for the msg to be recieved, return cntl immediately.
How can you re-use the message variable
Have the kernel copy the msg and then return. This costs
performance
Don't copy but send interrupt when msg sent. This makes
programming harder
Offer a syscall to tell when the msg has been sent.
Similar to above but "easier" to program. But difficult
to guess how often to ask if msg sent
================ Lecture 6 ================
Async Receive
Return control before have filled in msg variable with
received messages.
How can this be useful?
Wait syscall (until msg avail)
Test syscall (has msg arrived)
Condition receive (receive or announce no msg yet)
interrupt
none are beautiful
Timeouts
If have blocking primitive, send or receive could wait
forever.
Some systems/languages offer timeouts.
HOMEWORK 10-3
To buffer or not to buffer (unbuffered)
If unbuffered, the received tells where to put the msg
Doesn't work if async send is before receive (where put msg).
For buffered, the kernel keeps the msg (in a MAILBOX) until the
receiver asks for it.
Raises buffer management questions
HOMEWORK 10-5
Acks
Can assert msg delivery are not reliable
Checked at higher level
Kernel can ack every msg
Senders and repliers keep msg until receive ack
Kernel can use reply to ack every request but explicitly ack replies
Kernel can use reply as ack to every request but NOT ack replies
Client will resend request
Not always good (if server had to work hard to calculate reply)
Kernel at server end can deliver request and send ack if reply nor
forthcoming soon enough. Again can either ack reply or not.
Large design space. Several multiway choices for acks, buffers, ...
Click for diagram in
postscript or
html.
Other messages (meta messages??)
Are you alive?
I am alive
I am not alive (joke)
Mailbox full
No process listening on this port
---------------- Remote Procedure Call (RPC) ----------------
Birrell and Nelson (1984)
Recall how different the client code for copying a file was from
normal centralized (uniprocessor) code.
Lets make client server request-reply look like a normal procedure
call and return.
Tanenbaum says C has call-by-value and call-by-reference. I don't
like this. In C you pass pointers by value (at least for scalars) and
I don't consider that the same. But "this is not a programming
languages course"
HOMEWORK 10-7
Recall that getchar in the centralized version turns into a read
syscall (I know about buffering). The following is for unix
Read looks like a normal procedure to its caller.
Read is a USER mode program (needs some assembler code)
Read plays around with registers and then does a poof (trap)
After the poof, the kernel plays with regs and then does a
C-language routine and lots gets done (drivers, disks, etc)
After the I/O the process get unblocked, the kernel read plays
with regs, and does an unpoof. The user mode read plays with regs
and returns to the original caller
Lets do something similar with request reply
User (client) does subroutine call to getchar (or read)
Client knows nothing about messages
We link in a user mode program called the client stub (analogous
to the user mode read above).
Takes the parameters to read and converts it to a msg
(MARSHALLS the arguments)
Sends a msg to machine containing the server directed to a
SERVER STUB
Does a blocking receive (of the reply msg)
Server stub is linked with the server.
Receives the msg from the client stub.
Unmarshalls the arguments and calls the server (as a
subroutine)
The server procedure does what it does and returns (to the server
stub)
Server kows nothing about messages
Server stub now converts this to a reply msg sent to the client
stub
Marshalls the arguments
Client stub unblocks and receives the reply
Unmarshalls the arguments
RETURNS to the client
Client believes (correctly) that the routine it calls has returned
just like a normal procedure does.
Heterogeneity: Machines have different data formats
Previously discussed
Have conversions between all posibilities
Done during marshalling and unmarshalling
Adopt a std and convert to/from it.
HOMEWORK 10-8
Pointers
Avoid them for RPC!
Can put the object pointed to into the msg itself (assuming you
know its length).
Convert call-by-ref to copyin/copyout
If have in or out param (instead of in out) can elim one of
the copies
Gummy up the server to handle pointers special
Callback to client stub
Ugh
Registering and name servers
As we said before can use a name server.
This permits the server to move
deregister from the name server
move
reregister
Sometimes called dynamic binding
Client stub calls name server (binder) first time to get a HANDLE
to use for the future
Callback from binder to client stub if server deregisters or
have the attempt to used the handle fail so client stub will
goto to binder again.
HOMEWORK 10-12
Failures
This gets hard and ugly
Can't find the server.
Need some sort of out-of-band response from client stub to
client
Ada exceptions
C signals
Multithread the CLIENT and start the "exception" thread
Loses transparancy (centralized systems don't have this).
Lost request msg
This is easy if known. That is, if sure request was lost.
Also easy if idempotent and think might be lost.
Simply retransmit request
Assumes the client still knows the request
Lost reply msg
If it is known the reply was lost, have server retransmit
Assumes the server still has the reply
How long should the server hold the reply
Wait forever for the reply to be ack'ed--NO
Discard after "enough" time
Discard after receive another request from this client
Ask the client if the reply was received
Keep resending reply
What if not sure of whether lost request or reply?
If server stateless, it doesn't know and client can't tell.
If idempotent, simply retransmit the request
What if not idempotent and can't tell if lost request or reply?
Use sequence numbers so server can tell that this is a new
request not a retransmission of a request it has already done.
Doesn't work for stateless servers
HOMEWORK 10-13 10-14 Remind me to discuss these two next time
Server crashes
Did it crash before or after doing some nonidempotent action?
Can't tell from messages.
For databases, get idea of transactions and commits.
This really does solve the problem but is not cheap.
Fairly easy to get "at least once" (try request again if timer
expires) or "at most once (give up if timer expires)"
semantics. Hard to get "exactly once" without transactions.
To be more precise. A tranaction either happens exactly
one or not at all (sounds like at most once) AND the
client knows which.
Client crashes
Orphan computations exist.
Again transactions work but are expensive
Can have rebooted client start another epoch and all
computations of previous epoch are killed and clients
resubmit.
Better is to let continue old computations with owners
that can be found.
Not wonderful
Orhpan may hold locks or might have done something not
easily undone.
Serious programming needed.
Implementation
Protocol choice
Existing ones like UDP are designed for harder (more general)
cases so are not efficient.
Often developers of distributed systems, invent their own
protocol that is more efficient.
But of course they are all different.
On a lan would like large messages since they are more
efficient and don't take so long considering the high
datarate.
HOMEWORK 10-15
================ Lecture 7 ================
Acks
One per pkt vs one per msg
Called stop-and-wait and blast
In former wait for each ack
In blast keep sending packets until msg finished
Could also do a hybrid
Blast but ack each packet
Blast but request only those missing instead of general
nak
Called selective repeat
Flow control
Buffer overrun problem
Internet worm caused by buffer overrun and rewriting non-
buffer space. This is not the problem here.
Can occur right at the interface chip, in which case the
(later) packet is lost.
More likely with blast but can occur with stop and wait if
have multiple senders
What to do
If chip needs a delay to do back to back receives have
sender delay that amt.
If can only buffer n pkts, have sender only send n then
wait for ack
The above fails when have simultaneous sends. But
hopefully that is not too common.
This tuning to the specific hardware present is one reason
why gen'l protocols don't work as well as specialized ones.
Why so slow? Lots to do!
Call stub
get msg buf
marshall params
If use std (UDP), computer checksum
fill in headers
Poof
Copy msg to kernel space (Unless special kernel)
Put in real destination addr
Start DMA to comm device
---------------- wire time
Process interrupt (or polling delay)
Check packet
Determine relevant stub
Copy to stub addr space (unless special kernel)
Unpoof
Unmarshall
Call server
On the Paragon (intel large MPP of a few years ago), the above
(not exactly the same things) took 30us of which 1us was wire time
Eliminating copying
Message transmission is essentially a copy so min is 1
This requires the network device to do its dma from the
user buffer (client stub). Directly into the server stub.
Hard for receiver to know where to put the msg until it
arrives and is inspected
Sounds like a copy needed from receiving buffer to server
stub.
Can avoid this by fiddling with mem maps
Must be full pages (as that is what is mapped)
Normally there are two copies on the receiving side
From hardware buffer to a kernel buffer
From kernel buffer to user space (server stub)
Often two on sender side
User space (client stub) to kernel buffer
Kernel buffer to buffer on device
Then start the device
The sender ones can be reduced
Device can do DMA from the kernel buffer eliminates 2nd
Doing DMA from user would eliminate the first but need
scather gather (just gather here) since the header must be
in kernel space since the user is not allowed to set it.
To eliminate the two on the receiver side is harder
Can eliminate the first if device writes directly into
kernel buffer.
To eliminate the 2nd requires the remaping trick.
Timers and timeout values
Getting a good value for the timeouts is a black art.
Too small and many unneded retransmissions
Too large and wait too long
Should be adaptive??
If find that sent an extra msg raise timeout for this
class of transmissions.
If timeout expires most of the time, lower value for
this class
How to keep timeout values
If you know that almost all timers of this class are going
to go off (alarms) and accuracy is important, then keep a
list sorted by time to alarm.
Only have to scan head for timer (so can do it
frequently)
Additions must search for place to add
Deletions (cancelled alarms) are presumed rare
If deletions are common and can afford not so accurate an
alarm, then sweep list of all processes (not so frequently
since accuracy not required).
Deletions and additions are easy since list is indexed by
process number
Difficulties with RPC
Global variables like errno inherently have shared-variable
semantics so don't fit in a distributed system.
One (remote) procedure sets the vble and the local procedure
is supposed to see it.
But the setting is a normal store so is not seen by the
communication system.
So transparancy is violated
Weak typing makes marshalling hard/impossible
How big is the object we should copy?
What is the conversion needed if heterogeneous system?
So transparancy is violated
Doesn't fit all programming models
Unix pipes
pgm1 < f > g looks good
pgm is a client for stdin and stdout
RPCs to the file server for f and g
Similarly for pgm2 < x > y
But what about pgm1 < f | pgm2 > y
Both pgm1 and pgm2 are compiled to be clients but they
communicate so one must be a server
Could make pipes servers but this is not efficient and
somehow doesn't seem natural
Out of band (unexpected) msgs
Program reading from a terminal is client and have a
terminal server to supply the characters.
But what about ^C. Now the terminal driver is supposed to
be active and servers are passive
---------------- Group Communication ----------------
Groups give rise to ONE-TO-MANY communication.
A msg sent to the group is received by all (current) members.
We have had just send-receive which is point to point or
one-to-one
Some networks support MULTICASTING
Broadcast is special case where many=all
Multicasting gives the one-to-many semantics needed
Can use broadcast for multicast
Each machine checks to see if it is a part of the group
If don't even have broadcast, use multiple one-to-one
transmissions often called UNICASTING.
Groups are dynamic
Groups come and go
Members come and go
Need algorithms for this
Somehow the name of the group must be known to the
member-to-be
For an open group, just send a "I'm joining" msg
For a closed group, not possible
For a pseudo-closed (allows join msgs), just like open
To leave a group send good bye
If process dies members have to notice it
I am alive msgs
Piggyback on other msgs
Once noticed send a group msg removing member
This msg must be ordered w.r.t other group msgs (Why?)
If many go down or network severs, need to re-establish group
Message ordering
A big deal
Want consistent ordering
Will discuss in detail later this chapter
Note that an interesting problem is ordering goodbye and join with
regular msgs.
Closed and open groups
Difference is that in open and outsider can send to group
Open for replicated server
Closed for group working on a single (internal) problem
================ Lecture 8 ================
Peer vs hierarchical groups
Peer has all members equal.
Common hierarchy is depth 2 tree
Called coordinator-workers
Fits problem solving in a master/slave manner
Can have more general hierarchies
Peer has better fault tollerence but harder to make decision
since need to run a consensus algorithm.
What to do if coord fails
Give up
Elect a new leader (Will discuss in section 11.3)
(This is last "soft" section. Algorithms will soon appear)
Group addressing
Have kernel know which processes on its own machine is in
which groups.
Hence must just get the msgs to the correct machines.
Again can use multicast, broadcast or multiple point to point
Group communication uses send-receive
RPC doesn't fit group comm.
Msg to group is NOT like a procedure call
Would have MANY replies
Atomicity
Either every member receives the msg or none does.
Clearly useful
How to implement?
Can't always guarentee that all kernels will receive msg
Use acks
What about crashes?
Expensive ALGORITHM:
Original sender sends to all with acks, timers,
retries
Each time anyone receives a msg for the first time it
does what the sender did
Eventually either all the running processes get msg or
none do.
Go through some scenarios.
HOMEWORK 10-16
Ordering revisitied
Best is if have GLOBAL TIME ORDERING, i.e. msgs delivered in order sent
Requires exact clock sync
Too hard
Next best is CONSISTENT TIME ORDERNG, i.e. delivered in same order
everywhere
Even this is hard and we will discuss causality and weaker
orders later this chapter
If processes are in multiple groups must order be preserved
between groups?
Often NOT required. Different groups have different functions
so no need for the msgs to arrive in the same order in the two
groups.
Only relevant when the groups overlap. Why?
If have an ethernet, get natural order since have just one msg on
ethernet and all receive it.
But if one is lost and must retransmit ???
Perhaps kill first msg
This doesn't work with more complicated networks (gateways)
HOMEWORK 10-18 (ask me in class next time)
---------------- Case study: Group communication in ISIS ----------------
Ken Birman Cornell
Synchrony
Synchronous System: Communication takes zero time so events
happen strictly sequentially
Can't build this
needs absolute clock sync
needs zero trans time
Loosely synchronous
Transmission takes nonzero time
Same msg may be received at different times at different
receipients
All events (send and receive) occur in the SAME ORDER at all
participants
Two events are CAUSALLY RELATED if the first might influence the
second.
The first "causes" (i.e. MIGHT cause) the second.
Events in the same process are causally related in program
order.
A send of msg M is causally related to the receive.
Events not causally related are CONCURRENT
Virtually synchronous
Causally related events occur in the correct order in all
processes.
In particular in the same order
No guarantees on concurrent events.
HOMEWORK 10-17 At least describe what is the main problem. Let's
discuss this one next time.
Click for diagram in
postscript or
html.
Not permitted to deliver C before B because they are causally
related
Message ordering (again)
Best (Synchronous) Msgs delivered in order sent.
Good (Loosely Synchronous): Messages delivered in the same order to all
members.
OK (Virtually synchronous) Delivery order respects causality.
ISIS algorithms for ordered multicast
ABCAST: Loosely Synchronous
Book is incomplete
I filled in some details that seem to fix it.
Learning to see these counterexamples is crucial for designing
CORRECT distributed systems
To send a msg
Pick a timestamp (presumably bigger than any you have seen)
Send timestamped msg to all group members
Wait for all acks, which are also timestamped
Send commit with ts = max { ts of acks }
Wnen (system, not application) receives a msg
Send ack with ts > any it has seen or sent
When receive a commit
Deliver committed msgs in timestamp order
Counterexample
4 processes A, B, C, D
Messages denoted (M,t) M is msg, t is ts
time = 1
A sends (Ma, 1) to B, C, D
B sends (Mb, 1) to A, C, D
time = 2
B receives Ma acks with ts=2
C receives Ma acks with ts=2
A receives Mb acks with ts=2
D receives Mb acks with ts=2
time = 3
A received ack with ts=2 from B & C
B received ack with ts=2 from A & D
D receives Ma acks with ts=3
C receives Mb acks with ts=3
time = 4
A received ack with ts=3 from D
B received ack with ts=3 from C
time = 5
A and B send commit with ts=3 to B,C,D and A,C,D
respectively
Now can have two violations
C and D receive both commits and since they have the same
ts, might break the tie differently.
Solution: Agree on tie breaking rule (say Process
number)
C might get A's commit and D might get B's commit. Since
each has only one commit, it is delivered to the program
Solution: Keep track of all msgs you have acked but
not received commit for. Can't deliver msg if a lower
numbered ack is outstanding.
CBCAST: Loosely Synchronous
Each process maintains a list L with n components n=groupsize
ith component is the number of the last msg received from i
all components are initialized to 0
Each msg contains a vector V with n components
For Pi to send msg
Bump ith component of L
Set V in msg equal to (new value of) L
For Pj to receive msg from Pi
Compare V to Lj. Must have
V(i) = Lj(i) + 1 (next msg received from i)
V(k) <= Lj(k) (j has seen every msg the current msg
depends on)
Bump ith component of L
================ Lecture 9 ================
Class cancelled
================ Lecture 10 ================
Show how this works using the causality diagram
Click for diagram in
postscript or
html.
---------------- Shared Memory Coordination ----------------
See chapter 2 of tannenbaum
We will be looking at processes coordination using SHARED MEMORY and
BUSY WAITING.
So we don't send messages but read and write shared variables.
When we need to wait, we loop and don't context switch
Can be wasteful of resourses if must wait a long time.
Context switching primitives normally use busy waiting in
their implementation.
Mutual Exclusion
Consider adding one to a shared variable V.
When compiled onto many machines get three instructions
load r1 <-- V
add r1 <-- r1+1
store r1 --> V
Assume V is initially 10 and one process begins the 3 inst seq
after the first instruction context to another process
registers are of course saved
new process does all three instructions
context switch back
registers are of course restored
first process finishes
V has been incremented twice but has only reached 11
Problem is the 3 inst seq must be atomic, i.e. cannot be
interleaved with another execution of these instructions
That is one execution excludes the possibility of another. So the
must exclude each other, i.e. mutual exclusion.
This was a RACE CONDITION.
Hard bugs to find since non-deterministic.
Can be more than two processes
The portion of code that requires mutual exclusion is often called
a CRITICAL SECTION.
HOMEWORK 2-2 2-3 2-4
One approach is to prevent context switching
Do this for the kernel of a uniprocessor
Mask interrupts
Not feasible for user mode processes
Not feasible for multiprocessors
CRITICAL SECTION PROBLEM is to implement
loop
trying-part
critical-section
releasing-part
non-critical section
So that when many processes execute this you never have more than one
in the critical section
That is you must write trying-part and releasing-part
Trivial solution.
Let releasing part be simply "halt"
This shows we need to specify the problem better
Additional requirement
Assume that if a process begins execution of its critical section
and no other process enters the critical section, then the first
process will eventually exit the critical section
Then the requirement is "If a process is executing its
trying part, then SOME process will eventually enter the critical
section".
---------------- Software-only solutions to CS problem ----------------
We assume the existence of atomic loads and stores
Only upto wordlength
We start with the case of two processes
Easy if want tasks to alternate in CS and you know which one goes
first in CS
Shared int turn = 1
loop loop
while (turn=2) --EMPTY BODY!! while (turn=1)
CS CS
turn=2 turn=1
NCS NCS
But always alternating does not satisfy the additional requirement
above. Let NCS for process 1 be an infinite loop (or a halt).
Will get to a point when process 1 is in its trying part but
turn=2 and turn will not change. So some process enters its
trying part but neither proc will enter the CS.
Some attempts at a general soln follow
First idea. The trouble was the silly turn
Shared bool P1wants=false, P2wants=false
loop loop
P1wants <-- true P2wants <-- true
while (P2wants) while (P1wants)
CS CS
P1wants <-- false P2wants <-- false
NCS NCS
This fails. Why? This kind of question is very important and
makes a good exam question.
Next idea. It is easy to fix the above
Shared bool P1wants=false, P2wants=false
loop loop
while (P2wants) while (P1wants)
P1wants <-- true P2wants <-- true
CS CS
P1wants <-- false P2wants <-- false
NCS NCS
This also fails. Why?? Show on board what a scenario looks like.
Next idea need a second test.
Shared bool P1wants=false, P2wants=false
loop loop
while (P2wants) while (P1wants)
P1wants <-- true P2wants <-- true
while (P2wants) while (P1wants)
CS CS
P1wants <-- false P2wants <-- false
NCS NCS
Guess what, it fails again.
The first one that worked was discovered by a mathematician named
Dekker. Use turn only to resolve disputes.
Shared bool P1wants=false, P2wants=false
Shared int turn=1
loop loop
P1wants <-- true P2wants <-- true
while (P2wants) while (P1wants)
if turn=2 if turn=1
P1wants <-- false P2wants <-- false
while (turn==2) while (turn=1)
P1wants <-- true P2wants <-- true
CS CS
turn <-- 2 turn <-- 1
P1wants <-- false P2wants <-- false
NCS NCS
First two lines look like deadlock country. But it is not an
empty while loop.
The winner-to-be just loops waiting for the loser to give up
and then goes into the CS.
The loser-to-be
Gives up
Waits to see that the winner has finished
Starts over (knowing it will win)
Dijkstra extended dekker's soln for > 2 processes
Others improved the fairness of dijkstra's algorithm
These complicated methods remained the simplest known until 1981
when Peterson found a much simpler method
Keep dekker's idea of using turn only to resolve disputes, but
drop the complicated then body of the if.
Shared bool P1wants=false, P2wants=false
Shared int turn=1
loop loop
P1wants <-- true P2wants <-- true
while (P2wants and turn=2) while (P1wants and turn=1)
CS CS
turn <-- 2 turn <-- 1
P1wants <-- false P2wants <-- false
NCS NCS
This might be correct!
The standard solution from peterson just has the assignment to
turn moved from the trying to the releasing part
Shared bool P1wants=false, P2wants=false
Shared int turn=1
loop loop
P1wants <-- true P2wants <-- true
turn <-- 2 turn <-- 1
while (P2wants and turn=2) while (P1wants and turn=1)
CS CS
P1wants <-- false P2wants <-- false
NCS NCS
Remarks
Peterson's paper is in Inf. Proc. Lett. 12 #3 (1981) pp
115-116.
A fairly simple proof of its correctness is by Hofri in
OS Review Jan 1980 pp 18-22. I will put a copy in the courant
library in the box for this course.
Peterson actually showed the result for n processes (not just
2).
Hofri's proof also shows that the algorithm satisfies a strong
fairness condition, namely LINEAR WAITING.
Assuming a process continually executes, other cannot pass
it (see hofri for details).
================ Lecture 11 ================
Don't forget to go over 10-17 and 10-18
Hand out "Simulating Concurrency" notes for lab 1
These notes and lab 1, Process Coordination, are on the web
---------------- (binary) Semaphores ----------------
Trying and release often called ENTRY and EXIT, or WAIT and SIGNAL, or
DOWN and UP, or P and V (the latter are from dutch words--dijkstra).
Lets try to formalize the entry and exit parts.
To get mutual exclusion we need to ensure that no more than one task
can pass through P until a V has occurred. The idea is to keeptrying
to walk through the gate and when you succeed ATOMICALLY closed the
gate behind you so that no one else can enter.
Definition (NOT an implementation)
Let S be an enum with values closed and open (like a gate).
P(S) is
while S=closed
S <-- closed
The failed test and the assignment are a single atomic action.
P(S) is
label:
{[ --begin atomic part
if S=open
S <-- closed
else
}] --end atomic part
goto label
V(S) is
S <-- open
Note that this P and V (NOT yet implemented) can be used to solve the
critical section problem very easily
The entry part is P(S)
The exit part is V(S)
Make very sure the "atomic part" is understood.
Note that dekker and peterson do not give us a P and V since each
process has a unique entry and a unique exit
S is called a (binary) semaphore.
To implement binary semaphores we need some help from our hardware
friends.
Boolean in out X
TestAndSet(X) is
oldx <-- X
X <-- true
return oldx
Note that the name is a good one. This function tests the value of X
and sets it (i.e. sets it true; reset is to set false)
Boolean in out X, in e
FetchAndOr(X, e)
oldx <-- X
X <-- X OR e
return oldx
TestAndSet(X) is the same as FetchAndOr(X, true)
FetchAndOr is also a good name, X is fetched and OR'ed with e
Why is it better to return the old than the new value?
Now P/V for binary semaphores is trivial.
S is Boolean variable (false is open, true is closed)
P(S) is
while (TestAndSet(S))
V(S) is
S <-- false
HOMEWORK 2-5
Just because P and V are now simple to implement does not mean that
the implementation is trivial to understand.
Go over the implementation of P(S)
This works fine no matter how many processes are involved
---------------- Counting Semaphores ----------------
Now want to consider permitting a bounded number of processors into
what might be called a SEMI-CRITICAL SECTION.
loop
P(S)
SCS -- at most k processes can be here simultaneously
V(S)
NCS
A semaphore S with this property is called a COUNTING SEMAPHORE.
If k=1, get a binary semaphore so counting semaphore generalizes
binary semaphore.
How can we implement a counting semaphore given binary semaphores?
S is a nonnegative integer
Initialize S to k, the max number allowed in SCS
Use k=1 to get binary semaphore (hence the name binary)
We only ask for
Limit of k in SCS (analogue of mutual exclusion)
Progress: If process enters P and < k in SCS, a process
will enter the SCS
We do not ask for fairness, and don't assume it (for the binary
semaphore) either.
Idea: use bin sem to protect arith on S
binary semaphore q
P(S) is V(S) is
start-over: P(q); S++; V(q)
P(q)
if S<=0
V(q)
goto start-over
else
S--
V(q)
Explain how this works.
Let's make this a little more elegant (avoid goto)
binary semaphore q
P(S) is V(S) is
loop P(q); S++; V(q)
P(q)
exit when S>0 -- will exit loop holding the lock
V(q)
S--
V(q)
This so-call n 1/2 loop clearly shows what is going on.
An alternative is to "prime" the loop.
get-character -- priming read
while not eof
put-character
get-character
This is all very nice but the real trouble is that the above code is WRONG!
Distinguish between DEADLOCK, LIVELOCK, STARVATION
Deadlock: Cannot make progress
Livelock: Might not make progress (race condition)
Starvation: Some process does/might not make progress
The counting semaphore livelocks (no progress) if the binary
semaphore is unfair (starvation).
Talk about letting people off subway when full and many waiting at
the platform.
New idea: Release q sema early and use two others to force alternation
Remember that multiple Vs on a bin sem may not free multiple
processes waiting on Ps.
binary semaphore q,t initially open
binary semaphore r initially closed
integer NS; -- might be negative, keeps value of S
P(S) is V(S) is
P(q) P(q)
NS-- NS++
if NS < 0 if NS <= 0
V(q) V(q)
P(r) <-- these two lines --> P(t)
V(t) <-- force alternation --> V(r)
else else
V(q) V(q)
Explain how this works. Do some scenarios
"factor out" V(q) and move above if
P(S) is V(S) is
P(q) P(q)
NS-- NS++
V(q) V(q)
if NS < 0 if NS <= 0
P(r) <-- these two lines --> P(t)
V(t) <-- force alternation --> V(r)
It fails!!
You need to decrement/increment NS and test atomically.
If you release the lock between testing and doing something,
the information you found during the test is only a rummor
during the something.
Sneaky code optimization (applied to the correct code) gives
P(S) is V(S) is
P(q) P(q)
NS-- NS++
if NS < 0 if NS <= 0
V(q) V(r)
P(r) else
V(q) V(q)
================ Lecture 12 ================
HOMEWORK Add in variable S to unoptimized version so that
1. Outside P and V S is accurate, i.e. when no process is in P or V,
S = k -#P + #V
2. S >= 0 always
Note: NS satisfies 1 but not 2.
Remind me to go over this next class
These solutions all employ binary semaphores. Hence if N tasks try to
execute P/V on the COUNTING semaphore, there will be a part that is
serialized, i.e. will take time at least proportional to N.
Fetch-and-add
An NYU invention
Reference Gottlieb, Lubachevsky, and Rudolph, TOPLAS, apr 83
Good name (like fetch-and-or test-and-set)
fetch-and-add (in out X, in e) is
oldX <-- X
X <-- X+e
return oldX
Assume this is an atomic operation
NYU built hardware that this did this atomically
Indeed could do N FAAs in the time of 1.
Interesting work ... but this is not a hardware course
Some hardware description in TOPLAS better in
IEEE Trans. Comp. Feb 83
Just as test-and-set seems perfect for binary semaphores,
fetch-and-add seems perfect for counting semaphores.
NaiveP (in out S) is V (in out S) is
while FAA(S,-1) <= 0 FAA(S,+1)
FAA(S,+1)
Explain why it works.
But it fails!! Why?
Give history (dijkstra, wilson)
P (in out S) is V (in out S) is
L: while S<=0 FAA(S,+1)
if FAA (S,-1) <= 0
FAA(S,+1)
goto L
If you don't like gotos and labels, you can rewrite P as
ElegantP (in out S) is
while S<=0
if FAA (S,-1) <= 0
FAA(S,+1)
ElegantP(S)
A "good" compiler will turn out the identical code for P and ElegantP
"Good" means capable of elimnating TAIL RECURSION, but ...
This is not a programming languages course
This is not a compilers course
The real P looks VERY close to NaiveP, especially if you rewrite
NaiveP as
AlternateNaiveP (in out S) is
L:
if FAA (S,-1) <= 0
FAA(S,+1)
goto L
So all that P adds is the while loop
But the test in the while is "redundant" as the next if makes the
same test
Explain why (real) P and V work.
First show that the previous problem is gone.
Mutual exclusion because any process in CS found S positive
Progress a little harder to show
Idea is to look at all states of the system and show that once
you get to a state where a process has reached P, then any
process in the CS can leave and then at least one trying
process can get in.
See the TOPLAS article for a proof
HOMEWORK 2-8
If you believe the claim that concurrent FAA can be executed in
constant time (really it takes the time of one shared memory
reference, which must grow with machine time), then the counting
semaphore is a constant time operation with P/V (assuming the
semaphore is wide open).
We gave this "redundant" test a name test-decrement-retest or TDR
We can apply this TDR technique/trick to the bad soln we began with
(the one using one binary semaphore).
Binary semphore q initially open
P (in out S) is V (in out S) is
L: While S <= 0 P(q); S++; V(q)
P(q)
if S > 0
S--
V(q)
else
V(q
goto L
================ Lecture 13 ================
Midterm exam week from today (rule is must return by 13 march)
HOMEWORK Is the following code (contains an additional binary
semaphore) also correct? If it is correct, in what way is it
better than the original. Let's discuss next time.
Binary semphore q,r initially open
P (in out S) is V (in out S) is
L: While S <= 0 P(r); S++; V(r)
P(q)
if S > 0
S--
V(q)
else
V(q)
goto L
Some authors, e.g. tanenbaum, reserve the term semaphore for
context-switching (a.k.a blocking) implementations and would not call
our busy-waiting algorithms semaphores.
HOMEWORK 2-2 2-6 (note 2-6 uses TANENBAUM's def of semaphore;
you may ignore the part about monitors).
P/V Chunk: With a counting semaphore, one might want to reduce the
semaphore by more than one.
If each value corresponds to one unit of a resource, this
reduction is reserving a chunk of resources. We call it P chunk.
Similarly V chunk.
Why can't you just do P chunk by doing multiple P's??
Assume there are 3 units of the resource available and 3 tasks
each need 2 units to proceed. If they do P;P, you can get the
case where no one gets what they need and none can proceed.
PChunk(in out S, in amt) is VChunk(in out S, in amt) is
while S < amt FAA(S,+amt)
if FAA(S,-amt) < amt
FAA(S,+amt)
PChunk
Let's look at the case amt=1
PChunk1(in out S) is VChunk1(in out S) is
while S < 1
if FAA(S,-1) < 1
FAA(S,+1)
PChunk1
Since S<1 is the same as S<=0, the above is identical to ElegantP.
There will be more about this below.
---------------- Producer Consumer (bounded buffer) ----------------
Problem: Producers produce items and consumers consume items (what a
surprise). We have a bounded buffer capable of holding k items and
want to use it to hold items produced but not yet consumed.
Special case k=1. What we want is alternation of producer adding
to buffer and consumer removing. Alternation, umm sounds familiar.
binary semaphore q initially open
binary semaphore r initially closed
producer is consumer is
loop loop
produce item P(r)
P(q) remove item from buffer
add item to buffer V(q)
V(r) consume item
Note that there can be many producers and many consumers
But the code guarantees that only one will be adding or
removing from the buffer at a time.
Hence can use normal code to remove from the buffer.
Now the general case for arbitrary k.
Will need to allow some slop so that a few (upto k) producers can
proceed before a consumer removes an item
Will need to put a binary semaphore around the add and remove code
(unless the code is special and can tolerate concurrency).
This is what counting semaphores are great for (semi-critical section)
counting semaphore e initially k -- num EMPTY slots
counting semaphore f initially 0 -- num FULL slots
binary semaphore b initially open
producer is consumer is
loop loop
produce item P(f)
P(e) P(b); rem item from buf; V(b)
P(b); add item to buf; V(b) V(e)
V(f) consume item
Normally want the buffer to be a queue.
HOMEWORK What would you do if you had two items that needed to be
consecutive on the buffer (assume the buffer is a queue)?
If we used FAA counting semaphores, there is no serial section in the
above except for the P(b).
NYU Ultracomputer Critical-section-free queue algorithms
Implement the queue as a circular array
FAA(tail,1) mod size gives the slot to use for insertions
FAA(head,1) mod size gives the slot to use for deletions
Will use I and D instead of tail and head below
type queue is array 1 .. size of record
natural phase -- will be explained later
some-type data -- the data to store
Since we do NOT want to have the critical section in the above
code, we have a thorny problem to solve.
The counting semaphores will guarantee that when an insert gets
past P(e), there is an empty slot. But it might NOT be at the
head slot!!
How can this be?
If this were all we could just force alternation at each queue
position with two semaphores at each slot.
But boris (lubachevsky) also found a scenario where two
inserts could be going after the same slot and thus you could
ruin fifo but having the first insert go second.
So we have a phase at each slot.
The first (zeroth) insert at this slot is phase 0
The first (zeroth) delete at this slot is phase 1.
Insert j is phase 2*j; delete j is phase 2*j+1
The phase is for an insert is I div size
The slot is I mod size
Most hardware calculate both at once if doing a division
(quotient and remainder). If size is a power of 2 these
are just two parts of the number (i.e. mask and shift). I
use a (made up) function (Div, Mod) that returns both values
counting semaphore e initially size
counting semaphore f initially 0
Insert is
P(e)
(MyPhase, MyI) <-- (Div, Mod) (FAA(I,1), size)
while phase[MyI] < 2*MyPhase -- we are here too early, wait
data[MyI] <-- the-datum-to-insert
FAA(Phase[MyI],1) -- this phase is over
V(f)
Delete is
P(f)
(MyPhase, MyD) <-- (Div, Mod) (FAA(D,1), size)
while phase[MyD] < 2*MyPhase+1 -- we are here too early, wait
extracted-data <-- data[MyD]
FAA(Phase[MyD],1) -- this phase is over
V(e)
This code went through several variations. The current version was
discovered when I last taught this course in spring of 94, but was
never written before now.
Originally the insert and delete code were complicated and looked
quite different from each other. At some point I found a way to make
them look very symmetric (but still complicated). At first I was a
little proud of this discovery and rushed to show boris. He was not
impressed; indeed he remarked "it must be so". It was always hard to
translate his "must" so I didn't understand if he was saying that my
comment was trivial or important. His next, quite perceptive, remark
showed that it was the former and ended the discussion. Of course
they look the same, "Deletion is insertion of empty space.".
This queue algorithm uses size for two purposes
The maximum size of the queue
The maximum concurrency supported.
It would be natural for these too requirements to differ considerable
in the size required.
A system with 100 processors each running no more than 10 active
threads using the queue needs at most 1000 fold concurrency, but
if the traffic generation is bursty, many more than 1000 slots
would be desired.
One can have a (serially accessed, i.e. critical section) list
instead of a single slot associated with each MyI
One can further enhance this by implementing these serially accessed
lists as linked lists rather than arrays. This gives the usual
advantages of linked vs sequentially allocated lists (as well as the
usual disadvantages).
This enhanced version is used in our operating system (Symunix)
written by jan edler.
================ Lecture 14 ================
---------------- Readers and Writers ----------------
Problem: We have two classes of processes.
Readers, which can execute concurrently
Writers, which demand exclusive access
We are to permit concurrency among readers, but when a writer is
active, there are to be NO readers and NO OTHER writers active.
integer #w range 0 .. maxint initially 0
integer #r range 0 .. maxint initially 0
binary semaphore S
Reader() is Writer() is
loop loop
P(S) P(S)
exit when #w=0 exit when #r=0 and #w=0
V(S) V(S)
#r++ #w++
V(S) V(S)
Do-the-read Do-the-write
P(S); #r--; V(S) P(S); #w--; V(S)
Explain the idea behind the code and why it works.
(As usual) it doesn't work.
We fix it with our magic elixir (redundant test before lock)
integer #w range 0 .. maxint initially 0
integer #r range 0 .. maxint initially 0
binary semaphore S
Reader() is Writer() is
loop loop
while #w>0 while #r>0 or #w>0
P(S) P(S)
exit when #w=0 exit when #r=0 and #w=0
V(S) V(S)
#r++ #w++
V(S) V(S)
Do-the-read Do-the-write
P(S); #r--; V(S) P(S); #w--; V(S)
HOMEWORK What is the trying-part and what is the releasing-part?
HOMEWORK Show that #w only takes on values 0 and 1
Observation #1. Since #w is only 0 and 1, we use the following code
instead.
integer #w range 0 .. 1 initially 0
integer #r range 0 .. maxint initially 0
binary semaphore S
Reader() is Writer() is
loop loop
while #w>0 while #r>0 or #w>0
P(S) P(S)
exit when #w=0 exit when #r=0 and #w=0
V(S) V(S)
#r++ #w <-- 1
V(S) V(S)
Do-the-read Do-the-write
P(S); #r--; V(S) P(S); #w <--0 ; V(S)
Observation #2. Assigning a 1 to #w is atomic at the hardware level.
We don't need a semaphore to make it atomic. Also the other sections
protected by P(S)/V(S) don't need to be protected from #w<--0.
integer #w range 0 .. 1 initially 0
integer #r range 0 .. maxint initially 0
binary semaphore S
Reader() is Writer() is
loop loop
while #w>0 while #r>0 or #w>0
P(S) P(S)
exit when #w=0 exit when #r=0 and #w=0
V(S) V(S)
#r++ #w <-- 1
V(S) V(S)
Do-the-read Do-the-write
P(S); #r--; V(S) #w <-- 0
Observation #3. Since we don't grab a lock to set #w zero, the trying
readers can't prevent it. This means we don't need the elixir for
them. Similarly we don't need the #w part of the elixir for the
writers either. We do need the #r part otherwise we could prevent the
decrement of #r at the bottom of the Reader code.
integer #w range 0 .. 1 initially 0
integer #r range 0 .. maxint initially 0
binary semaphore S initially open
Reader() is Writer() is
loop loop
while #r>0
P(S) P(S)
exit when #w=0 exit when #r=0 and #w=0
V(S) V(S)
#r++ #w <-- 1
V(S) V(S)
Do-the-read Do-the-write
P(S); #r--; V(S) #w <-- 0
New approach to readers / writers. Assume we have B units of a
resource R, where B is at least the maximum possible numbers of
readers (e.g. B = max # processes). Make each reader grab one unit
of the resource and make each writer grab all B. Thus when a writer
is active, no other process is.
So how can we grab B units all at once?
PChunk!
counting semaphore S initially B
Reader() is Writer() is
P(S) PChunk(S,B)
Do-the-read Do-the-write
V(S) VChunk(S,B)
If we use the FAA implementations for P/V Chunk and run the code on
the NYU Ultracomputer Ultra III prototype, we get the property that
"during periods of no writer activity (or attempted activity), NO
critical sections are executed".
I no of no other implementation with this property.
The code above permits readers to easily starve writers. The
following writer-priority version does not (but writers can starve
readers).
This is a generic way to take two algorithms and give one priority.
That is, the above code for readers and writers are used as black
boxes.
counting semaphore S initially B
integer #ww initially 0 -- number of waiting writers
Reader() is Writer() is
while #ww>0 FAA(#ww,+1)
P(S) PChunk(S,B)
Do-the-read Do-the-write
V(S) VChunk(S,B)
FAA(#ww,-1)
HOMEWORK. Write the corresponding reader-priority readers/writers
code.
---------------- Fetch-and-increment and fetch-and-decrement
With the exception of P/V chunk (and hence readers/writers) all the
FAA algorithms used an addend of +1 or -1.
Fetch-and-increment(X) (written FAI(X)) is defined as FAA(X,+1)
Fetch-and-decrement(X) (written FAD(X)) is defined as FAA(X,-1)
It turns out that the hardware is easier for FAI/FAD than for general
FAA.
FAA looks like a store and a load
data sent to and from memory
FAI/FAD look like a load
data comes back from memory
it is a special load (memory must do something)
Somewhat surprisingly readers and writers can be solved with just
FAI/FAD.
integer #r initially 0
integer #w initially 0
Reader is Writer is
while #w > 0 while #w > 0
FAI(#R) if FAI(#w) > 0
if #w > 0 FAD(#w)
FAD(#R) writer
Reader while #R > 0
Do-the-read Do-the-write
FAD(#R) FAD(#w)
This is proved in freudenthal gottlieb asplos 1991.
================ Lecture 15 ================
---------------- Midterm Exam ----------------
================ Lecture 16 ================
return and go over midterm exam
---------------- Barriers ----------------
Many higher-level algorithms work in "phases". All processes execute
upto a point and then they must synchronize so that all processes know
that all other processes have gotten this far.
For example, each process might be responsible for updating a
specific entry in a matrix NEW using all the values of matrix
OLD. When NEW is fully updated the roles of NEW and OLD are
reversed. (Explict relaxation method for PDEs)
while (not-done)
compute
barrier()
We will assume that N, the number of processes involved is know at
compile time.
This is a common assumption.
To avoid it, use "group lock" (see freudenthal and gottlieb)
Idea is simple
Each process adds one to a counter.
Wait until counter is N
Want barrier self-cleaning, i.e. counter is restored to zero so next
execution of barrier can begin.
We use two counters and clean the "other one"
P: shared integer range 0 .. 1 {the "phase" mod 2}
C: shared array 0 .. 1 of integer range 0 .. N {the count for this phase}
barrier is
p = P
C[1-p] <-- 0 { clean for the next phase }
if fai(C[p]) = N-1 { last process of this phase }
P <-- 1-p { swap phases }
while (p = P) { wait for last to get here }
Wording bug in lab 1. Barrier uses only fai; does not need fad.
---------------- Chapter 11--Distributed System Coordination ----------------
No shared memory so
No semaphores
No test-and-set
No fetch-and-add
Basic "rules of the game"
The relevant information is scattered around the system
Each process makes decisions based on LOCAL information
Try to avoid a single point of failure
No common clock supplied by hardware
================ Lecture 17 ================
Our first goal is to deal with the last rule
Some common time is neeced (e.g. make)
Hardware doesn't supply it.
So software must.
For isolated uniprocessor all that really matters is that the clock advances
monotonically (never goes backward).
This says that if one time is greater than another the second
happened later than the first.
For isolated multiple processor system need the clocks to agree so
if one processor marks one file as later than a second processor marks
a second file, the second marking really did come after the first.
Processor has a timer that interrups periodically. The interrupt is
called a clock tick.
Software keeps a count of # of ticks since a known time.
Initialized at boot time
Sadly the hardware used to generate clock ticks isn't perfect and some
run slightly faster than others. This causes CLOCK SKEW.
Clocks that agree with each other (i.e. are consistent, see below) are
called LOGICAL CLOCKS.
If the system must interact with the real world (say a human),
it is important that the system time is at least close to the real
time.
Clocks that agree with real time are called PHYSICAL CLOCKS
Logical Clocks
What does it mean for clocks to agree with each other?
For logical clocks we only care that if one event happens before
another, the first occurs at an earlier time ON THE LOCAL CLOCK.
We say a HAPPENS BEFORE b, written a-->b if one of the following holds
1. Events a and b are in the same process, and a occurs before b.
2. Event a is the sending of msg M and b is the receiveing of M.
3. Transitivity, i.e. a-->c and c-->b
We say a HAPPENS AFTER b if b happens before a.
We say a and b are CONCURRENT if neither a-->b nor b-->a
HOMEWORK 11-1
We require for logical clocks that a-->b implies C(a) < C(b), where
C(x) is the time at which x occurs. That is C(x) is the value of the
local clock when event x occurs.
How do we insure that this rule (a-->b implies C(a) < C(b)). We start
by using the local clock value, but need some fixups.
Condition 1 is almost satisfied. If a process stays on a
processor, the clock value will never decrease. So we need two
fixups.
If a process moves from one processor to another, must save
the clock value on the first and make sure that the second is
no earlier (indeed it should probably be at lease one tick
later).
If two events occur rapidly (say two msg sends), make sure
that there is at lease one clock tick between.
Condition 2 is NOT satisfied. The clocks on different processORs
can definitely differ a little so a msg might arrive before it was
sent.
The fixup is to record in the msg the time of sending and when it
arrives, set local clock to max (local time, msg send time + 1)
The above is Lamport's algorithm for synchronizing logical
clocks.
More information on logical time can be found in Raynal and Singhal
IEEE Computer Feb 96 (we are NOT covering this).
The above is called SCALAR TIME.
They also have VECTOR TIME and MATRIX TIME
In vector time each process P keeps a vector (hence the name) of
clocks, entry i gives P's idea of the clock at Pi.
In matrix time each process keeps a (guess) matrix (correct!)
of clocks. Idea is that P keeps a record of what Q thinks R's
clock says.
================ Lecture 18 ================
Physical clocks
What is time in the physical world.
Earth revolves about the sun once per year
Earth rotates on its axis once per day
1/24 day is hour, etc for min and sec
But earth "day" isn't exactly the same all the time so
now use atomic clocks to define one second to be the time for a
cesium 133 atom to make 9,192,631,770 "transitions"
We don't care about any of this really. For us the right time is that
boradcast by the NIST on WWV (short wave radio) or via GEOS satelite.
How do we get the times on machines to agree with NIST?
Send a msg to to NIST (or a surrogate) asking for the right time
and change yours to that one.
How often?
If you know
1. the max drift rate D (each machine drifts D from reality)
2. the max error you are willing to tollerate E
can calculate how long L you can wait between updates.
L = E / 2D
If only one machine drifts (other is NIST) L = E/D
HOMEWORK 11-3
Remind me to go over this next time.
How change the time?
Bad to have big jumps and bad to go backwards.
So make an adjustment in how much you add at each clock tick.
How do you send msg and get reply in zero (or even just "known
in advance") time?
You can't. See how long it took for reply to come back,
subtract service time at NIST and divide by 2. Add this to
the time NIST returns.
What if you can't reach a machine known to be perfect (or within D)?
Ask "around" for time
Take average
Broadcast result
Eliminate outliers
Try to contact nodes who can contact NIST or nodes who can
contact nodes who can contact NIST.
---------------- Mutual Exclusion ----------------
Try to do mutual exclusion w/o shared memory
Centralized approach
Pick a process a coordinator (mutual-exclusion-server)
To get access to CS send msg to coord and await reply.
When leave CS send msg to coord.
When coord gets a msg requesting CS it
Replies if the CS is free
Enter requesters name into waiting Q
When coord gets a msg announcing departure from CS
Removes head entry from list of waiters and replies to it
The simplist soln and perhaps the best
Distributed soln
When you want to get into CS
Send request msg to EVERYONE (except yourself)
Include timestamp (logical clock!)
Wait until receive OK from everyone
When receive request
If you are not in CS and don't want to be, say OK
If you are in CS, put requester's name on list
If you are not in CS but want to
If your TS is lower, put name on list
If your TS is higher, send OK
When leave CS, send OK to all on your list
Show why this works
HOMEWORK 11-7
Token Passing soln
Form logical ring
Pass token around ring
When you have the token can enter CS (hold token until exit)
Comparison
Centralized is best
Distributed of theoretical interest
Token passing good if HW is ring based (e.g. token ring)
---------------- Election ----------------
How do you get the coord for centralized alg above?
Bully Algorithm
Used initially and in general when any process notices that the
process it thinks is the coord is not responding.
Processes are numbered (e.g. by their addresses or names or
whatever)
The algorithm below determins the highest numbered process, which
then becomes the elected member.
Called the bully alg because the biggest wins.
When a process wants to hold an election,
It sends an election msg to all higher numbered processes
If any respond, orig process gives up
If none respond, it has won and announces election result to
all processors (called coordinator msg in book)
When receive an election msg
Send OK
Start election yourself
Show how this works
When a process comes back up, it starts an election.
HOMEWORK 11-9
Ring algorithm (again biggest number wins)
Form a (logical) ring
When a proc wants a new leader
Send election msg to next proc (in ring) include its proc # in msg
If msg not ack'ed, send to next in ring
When msg returns
Election done
Send result msg around the ring
When req msg arrives
Add your # to msg and send to next proc
Only necessary if you are bigger than all there (I think)
If concurrent elections occur, same result for all
HOMEWORK 11-10
================ Lecture 19 ================
---------------- Atomic Transactions ----------------
This is a BIG topic; worthy of a course (databases)
Stable storage: Able to survive "almost" anything
For example can survive a (single) media failure
Transactions: ACID
Atomic
To the outside world it happens all at once.
It's effects are never visible while in progress
Can fail (abort) then none of its effects are ever visible
Consistent
Does not violate any system invariants when completed
Intermediate states are NOT visible from Atomic
System dependent
Isolated
Normally called serializable
When multiple transactions are run and complete the effect is
as if they were run one after the other in some (unspecified)
order, i.e. you don't get a little of one and then a little of
another.
Durable
Often called permanance
Once a transaction commits, its effects are permanant
Stable storage often used
HOMEWORK 11-13
Nested Transactions
May want to have the child transactions run concurrently.
If a child commits, its effects are now available WITHIN the
parent transaction
So children serialized AFTER the child see the effects
But if a child commits and then the parent (subsequently) aborts
the effects of the child are NOT visible outside.
---------------- The following is by eric freudenthal ----------------
Concurrency Control
The synchronization primitives required to build transactions
Locking
Use mutex, explore granularity issues.
Two phase locking
acquire all locks
do transaction
release locks
Quit if a lock is unavailable.
This scheme works well when contention is low, may lead to
starvation.
In-order lock aquisition avoids live/deadlock.
Optimistic
Resembles two-phase commitment - don't save files until all
mods made. May need to abort & retry.
indirection makes this faster (copy & mod reference
structure rather than data itself
Good parallelism when works.
Deadlock free.
Timestamp consistency
Data (files) marked with date of last transactions read &
write.
Transactions fail if they try to set times inconsistently.
Preventing Deadlock (while providing synchronization as building
blocks)
Both communication channel & resource deadlocks are possible.
Communication channel deadlock can be caused by running out of
buffers.
Important to get this right, for it's hard to negotiate
if the communication channels are blocked.
What to do
ostrich - works most of the time if contention is low
detect - easier
prevent - make structurally impossible
avoid - carefully write program so it can't happen
Detecting deadlock
centralized manager
risk of outdated information - false deadlock
fix with lamport's time stamps & interrogation
================ Lecture 20 ================
Lecture by eric fredenthal
Not in text format but it is available on the web as lecture-20
in the list of individual lectures
Click
here
to view lecture 20.
================ Lecture 21 ================
System Model (i.e. what to buy and configure)
We look at three models
Workstations (zero cost soln)
Clusters (aka NOW aka COW, aka LAMP, aka Beawolf, aka pool)
Hybrid--likely "winner"
Workstation model
Connect workstations in department via LAN
Includes personal workstations and public ones
Often have file servers
The workstations can be diskless
Tannenbaum seems to like this
Most users don't
Not so popular anymore (disks are cheap)
Maintenence is easy
Must have some startup code in rom
If have disk on workstation can use it for
1. Paging and temp files
2. 1 + (some) system executables
3. 2 + file caching
4. full file system
Case 1 is often called dataless
Just as easy to (software) maintain as diskless
Still need startup code in rom
Serious reduction in load on network and file servers
Case 2
Reduces load more and speeds up program start time
Adds maintenance since new releases of programs must be
loaded onto the workstations
HOMEWORK 12-7, 12-8
Case 3
Can have very few executables permanently on the disk
Must keep the caches consistent
Not trivial for data files with multiple writers
This issue comes up for NFS as well and is discussed
(much) later
Should you cache whole files or blocks?
Case 4
You can work if just your machine is up
Lose location transparancy
Most maintenance
Using Idle workstations
Early systems did this manually via rsh
Still used today
Newer systems like Condor (Univ Wisc ?) try to automate this
How find idle workstations?
Idle = no mouse or keybd activity and low load avg
Workstation can announce it is idle and this is
recorded by all
Job looking for machine can inquire
Must worry about race conditions
HOMEWORK 12-10
Some jobs want a bunch of machines so look for many
idle machines
Can also have centralized soln, processor server
Usual tradeoffs
What about local environment?
Files on servers are no problem
Requests for local files must be sent home
... but not needed for tmp files
Syscalls for mem or proc mgt probably need to be
executed on the remote machine
Time is a bit of a mess unless have time synchronized
by a system like ntp
If program is interactive, must deal with devices
mouse, keybd, display
HOMEWORK 12-11
What if machine becomes non-idle (i.e. owner returns)?
Detect presence of user.
Kill off the guest processes.
Helpful if made checkpoints (or ran short jobs)
Erase files, etc.
Some NYU research unifies this entire evacuation
procedure with other failures (transaction oriented).
Could try to migrate the guest processes to other
hosts but this must be very fast or the owner will
object (at least I would).
Goal is to make owner not be aware of your presence.
May not be possible since you may have paged out
his basic environment (shell, editor, X server,
window manager) that s/he left running when s/he
stopped using the machine.
================ Lecture 22 ================
Clusters (pools, etc)
Bunch of workstations without displays in machine room connected
by a network.
Quite popular now.
Indeed some clusters are packaged by their manufacturer into a
serious compute engine.
IBM SP2 sold $1B in 1997.
VERY fast network
Used to solve large problems using many processors at one time
Pluses of large time sharing system vs small individual
machines.
Also the minus of timesharing
Can use easy queuing theory to show that large fast server
better in some cases than many slower personal machines
Tannenbaum suggests using X-terminals to access the cluster, but
X-terminals haven't caught on.
Personal workstations don't cost much more
Hybrid
Each user has a workstation and use the pool for big jobs
Tannenbaum calls this a possible compromise.
It is the dominant model for cluster based machines.
X-terminals haven't caught on
The cheapest workstations are already serious enough for
most interactive work freeing the cluster for serious
efforts.
---------------- Processor Allocation ----------------
Decide which processes should run on which processors
Could also be process allocation
Assume any process can run on any processor
Often the only difference between different processors is
CPU speed
CPU Speed and max Memory
What if the processors are not homogeneous?
Assume have binaries for all the different architectures.
What if not all machines are directly connected
Send process via intermediate machines
If all else fails view system as multiple subsystems
If have only alpha binaries, restrict to alphas
If need machines very close for fast comm, restrict to a group
of close machines.
Can you move a running process or are processor allocations done a
process creation time?
Migatory allocation algorithms vs nonmigratory
What is the figure of merit, i.e. what do we want to optimize?
Similar to CPU scheduling in OS 1.
Minimize reponse time
We are NOT assuming all machines equally fast.
Consider two processes P1 executes 100 millions instructions,
P2 executes 10 million instructions.
Both processes enter system at t=0
Consider two machines A executes 100 MIPS, B 10 MIPS
If run P1 on A and P2 on B each takes 1 second so avg
response time is 1 sec.
If run P1 on B and P2 on A, P1 takes 10 seconds P2 .1 sec
so avg response time is 5.05 sec
If run P2 then P1 both on A finish at times .1 and 1.1 so
avg resp time is .6 seconds!!
Do not assume machines are ready to run new jobs, i.e. there
can be backlogs.
Minimize response ratio.
Response ratio is the time to run on some machine divided by
time to run on a standardized (benchmark) machine, assuming
the benchmark machine is unloaded.
This takes into account the fact that long jobs should take
longer.
Do the P1 P2 A B example with response ratios
HOMEWORK 12-12
Maximize CPU utilization
NOT my favorite figure of merit.
Throughput
Jobs per hour
Weighted jobs per hour
If weighting is CPU time, get CPU utilization
This is the way to justify CPU utilization (user centric)
Design issues
Deterministic vs Heurestic
Use determanistic for embedded applications, when all
requirements are known a priori
Patient monitoring in hospital
Nuclear reactor monitoring
Centralized vs distributed
Usual tradeoff of accuracy vs fault tollerence and bottlenecks
Optimal vs best effort
Optimal normally requires off line processing.
Similar requirements as for determanistic.
Usual tradeoff of system effort vs result quality
Transfer policy
Does a process decide to shed jobs just based on its own load
or does it have (and use) knowledge of other loads?
Called local vs global algs by tannenbaum
Usual tradeoff of system effort (gather data) vs res quality
Location policy
Sender vs receiver initiated
Look for help vs look for work
Both are done
Tannenbaum asserts that clearly the decision can't be
local (else might send to even higher loaded machine)
NOT clear
The overloaded recipient will then send it again
Tradeoff of #sends vs effort to decide
Use random destination so will tend to spread load
evenly
"Better" might be to send probes first, but the
tradeoff is cost of msg (one more for probe in
normal case) vs cost of bigger msg (bouncing a job
around instead of tiny probes) vs likelyhood of
overload at target.
================ Lecture 23 ================
Implementation issues
Determining local load
NOT difficult (despite what tannenbaum says)
The case he mentions is possible but not dominant
Do not ask for "perfect" informantion (not clear what it
means)
Normally use a weighted mean of recent loads with more
recent weighted higher
Normally uniprocessor stuff
Do not forget that complexity (i.e. difficulty) is bad
Of course
Do not forget the cost of getting additional information needed
for a better decision.
Of course
Example algorithms
Min cut determanistic algorithm
Define a graph with processes as nodes and IPC traffic as arcs
Goal: Cut the graph (i.e some arcs) into pieces so that
All nodes in one piece can be run on one processor
Mem constraints
Processor completion times
Values on cut arcs minimized
Min the max
min max traffic for a process pair
min the sum
min total traffic (tannenbaum's def)
min the sum to/from a piece
don't overload a processor
min the sum between pieces
min traffic for processor pair
Tends to get hard as you get more realistic
Want more than just processor constraints
Figures of merit discussed above
The same traffic between different processor pairs does
not cost the same
HOMEWORK 12-14
Up-down centralized algorithm
Centralized table that keeps "usage" data for a USER, the users
are defined to be the workstation owners. Call this the score
for the user.
Goal is to give each user a fair share.
When workstation requests a remote job, if wkstation avail it
is assigned
For each process user has running remotely, the user's score
increases a fixed amt each time interval.
When a user has an unsatisfied request pending (and none being
satisfied), the score decreases (can go negative).
If no requests pending and none being satisfied, score bumped
towards zero.
When a processor becomes free, assign it to a requesting user
with lowest score.
Hierarchical algorithm
This was used in a system where each processor could run at
most one process. But we describe a more general situation.
Quick idea of algorithm
Processors in tree
Requests go up tree until subtree has enought resourses
Request is split and parts go back down tree
Arrange processors in a hierarchy (tree)
This is a logical tree independent of how physically
connected
Each node keeps (imperfect) track of how many available
processors are below it.
If a processor can run more than one process, must be more
sophisticated and must keep track of how many processes
can be allocated (without overload) in the subtree below.
If a new request appears in the tree, the curent node sees if
it can be satisfied by the processors below (plus itself).
If so, do it.
If not pass the request up the tree
Actually since machines may be down or the data on
availability may be out of date, you actually try to find
more processes than requested
Once a request has gone high enough to be satisfied, the
current node splits request into pieces and sends each piece
to appropriate child.
What if a node dies?
Promote one of its children say C
Now C's children are peers with the previous peers of C
If this is considered too unbalanced, can promote one
of C children to take C's place.
How decide which child C to promote?
Peers of dead node have an election
Children of dead node have an election
Parent of dead node decides
What if root dies?
Must use children since no peers or parent
If want to use peers, then do not have a single root
I.e. the top level of the hierarchy is a collection of
roots that communicate
This is a forest, not a tree
What if multiple requests are generated simutaneously?
Gets hard fast as information gets stale and potential
race conditions and deadlocks are possible.
See Van Tilborg and Wittie
Distributed heuristic algorithm
Send probe to random
If remote load is low, ship job
If remote load is high, try another random probe
After k (parameter of implementation) probes all say load is
too high, give up and run job locally.
Modelled analytically and seen to work fairly well
Bidding
View the dist system as a collection of competing servide
providers and processes (users?) as potential customers.
Customers bid for service and computers pick the highest bid.
This is just a sketch. MUCH needs to be filled in.
Scheduling
General goal is to have processes that communicate frequently run
simultaneously
If not and use busy waiting for msgs will have an huge disaster.
Even if use context switching, may have a small disaster as only
one msg transfer can occur per time scheduling slot
Co-scheduling (aka gang scheduling). Processes belonging to a job
are scheduled together
Time slots are coordinated among the processors.
Some slots are for gangs; other slots are for regular
processors.
HOMEWORK 12-16 (use figure 12-27b)
Tannenbaum asserts "there is not much to say about scheduling in a
distributed system"
Wrong.
Concept of job scheduling
In many big systems (e.g. SP2, T3E) run job scheduler
User gives max time (say wall time) and number of
processors
System can use this information to figure out the latest
time a job can run (i.e. assumes all preceeding jobs use
all their time).
If a job finishes early and the following job needs more
processors than avail, look for subsequent jobs that
Don't need more processors than available
Won't run past the time the scheduler promised to the
following job.
This is called backfilling
================ Lecture 24 ================
---------------- Chapter 13--Distributed File Systems ----------------
File service vs file server
File service is the specification
File server is an process running on a machine to implement the
file service for (some) files on that machine
In a normal distributed would have one file service
but perhaps many file servers
If have very different kinds of filesystems might not be able
to have a single file service as perhaps some services are not
available
File Server Design
File
Sequence of bytes
Unix
MS-Dos
Windows
Sequence of Records
Mainframes
Keys
We do not cover these rilesystems. They are often
discussed in database courses
File attributes
rwx perhaps a (append)
This is really a subset of what is called
ACL -- access control list
or
Capability
Get ACLs and Capabilities by reading columns and rows of
the access matrix
owner, group, various dates, size
dump, autocompress, immutable
Upload/download vs remote access
Upload/download means only file services supplied are read
file and write file.
All mods done on local copy of file
Conseptually simple at first glance
Whole file transfers are efficient (assuming you are going
to access most of the file) when compared to multiple
small accesses
Not efficient use of bandwidth if you access only small
part of large file.
Requires storage on client
What about concurrent updates?
What if one client reads and "forgets" to write for a
long time and then writes back the "new" version
overwritting newer changes from others?
Remote access means direct individual reads and writes to the
remote copy of the file
File stays on server
Issue of (client) buffering
Good to reduce number of remote accesses.
What about semantics when a write occurs?
Note that meta-data is written for a read so if
you want faithful semantics. Ever client read
must mod metadata on server or all requests for
metadata (e.g ls or dir commands) must go to
server.
Cache consistency question
Directories
Mapping from names to files/directories
Contains rules for names of files and (sub)directories
Hierarchy i.e. tree
(hard) links
gives another name to an existing file
a new directory entry
The old and new name have equal status
cd ~
mkdir dir1
touch dir1/file1
ln dir1/file1 file2
Now ~/file2 is the SAME file as ~/dir1/file1
In unix-speak they have the same inode
Need to do rm twice to actually delete the file
The owner is NOT changed so
cd ~
ln ~joe/file1 file2
Gives me a link to a file of joe. Presumably joe set his
permissions so I can't write it.
Now joe does
rm ~/file1
But my file2 still exists and is owned by joe. Most
accounting programs would charge the file to joe (who
doesn't know it exists).
With hard links the filesystem becomes a DAG instead of a
simple tree.
Symlinks
Symbolic (NOT symmetric). Indeed asymetric
Consider
cd ~
mkdir dir1
touch dir1/file1
ln -s dir1/file1 file2
file2 has a new inode it is a new type of file called a
symlink and its "contents" are the name of the file
dir/file1
When accessed file2 returns the contents of file1, but it
is NOT equal to file1.
If file1 is deleted, file2 "exists" but is invalid
If a new file2 is created, file2 now points to it.
Symlinks can point to directories as well
With symlinks pointing to directories, the filesystem
becomes a general graph, i.e. directed cycles are
permitted.
HOMEWORK 13-1
================ Lecture 25 ================
Imagine hard links pointing to directories
(unix does NOT permit this).
cd ~
mkdir B; mkdir C
mkdir B/D; mkdir B/E
ln B B/D/oh-my
Now you have a loop with honest looking links.
Normally can't remove a directory (i.e. unlink it from its
parent) unless it is empty.
But when can have multiple hard links to a directory, should
permit removing (i.e. unlinking) one even if the directory is
not empty.
So in above example could unlink B from A.
Now you have garbage (unreachable, i.e. unnamable) directories
B, D, and E.
For a centralized system need a conventional garbage
collection.
For distributed system need a distributed garbage collector,
which is much harder.
Transparancy
Location transparancy
Path name (i.e. full name of file) does NOT say where the
file is located.
On our ultra setup, we have filesystems /a /b /c and
others exported and remote mounted. When we moved /e
from machine allan to machine decstation very little
had to change on other machines (just the file
/etc/fstab). More importantly, programs running
everywhere could still refer to /e/xy
But this was just because we did it that way. I.e.,
we could have mounted the same filesystem as /e on one
machine and /xyz on another.
Location Indpendence
Path name is independent of the server. Hence can move a
file from server to server without changing its name.
Have a namespace of files an then have some (dynamically)
assigned to certain servers. This namespace would be the
same on all machines in the system.
Not sure if any systems do this.
Root transparancy
made up name
/ is the same on all systems
Would ruin some conventions like /tmp
Examples
Machine + path naming
/machine/path
machine:path
Mounting remote filesystem onto local heirarchy
When done inteligently get location transparancy
Single namespace looking the same on all machines
Two level naming
Said above that a directory is a mapping from names to files
(and subdirectories).
More formally, the directory maps the user name
/home/gottlieb/course/os/class-notes.html to the OS name for
that file 143428 (the unix inode number).
These two names are sometimes called the symbolic and binary
names.
For some systems the binary names are available.
allan$ ls -i course/os/class-notes.html
allan$ 143426 course/os/class-notes.html
The binary name could contain the server name so that could
directly reference files on other filesystems/machines
Unix doesn't do this
Could have symbolic links contain the server name
Unix doesn't do this either
I believe that vms did something like this. Symbolic name
was something like nodename::filename
It has been a while since I used VMS so I may have
this wrong.
Could have the name lookup yield MULTIPLE binary names.
Redundant storage of files for availability
Naturally must worry about updates
When visible?
Concurrent updates?
WHENEVER you hear of a system that keeps multiple
copies of something, an immediate question should be
"are these immutable?". If the answer is no, the next
question is "what are the update semantics?"
HOMEWORK 13-5
Sharing semantics
Unix semantics -- A read returns the value store by the last
write.
Probably unix doesn't quite do this.
If a write is large (several blocks) do seeks for each
During a seek, the process sleeps (in the kernel)
Another process can be writing a range of blocks that
intersects the blocks for the first write.
The result could be (depending on disk scheduling)
that the result does not have a last write.
Perhaps Unix semantics means -- A read returns the value
stored by the last write providing one exists.
Perhaps Unix semantics means -- A write syscall should be
thought of as a sequence of write-block syscalls and
similar for reads. A read-block syscall returns the value
of the last write-block syscall for that block
Easy to get this same semantics for systems with file servers
PROVIDING
No client side copies (Upload/download)
No client side caching
Session semantics
Changes to an open file are visible only to the process
(machine???) that issued the open. When the file is
closed the changes become visible to all
If using client caching CANNOT flush dirty blocks until
close. What if you run out of buffer space?
Messes up file-pointer semantics
The file pointer is shared across fork so all children
of a parent share it.
But if the children run on another machine with
session semantics, the file pointer can't be shared
since the other machine does not see the effect of the
writes done by the parent).
HOMEWORK 13-2, 13-4
Immutable files
Then there is "no problem"
Fine if you don't want to change anything
Can have "version numbers"
Book says old version becomes inaccessible (at least
under the current name)
With version numbers if use name without number get
highest numbered version so would have what book says.
But really you do have the old (full) name accessible
VMS definitely did this
Note that directories are still mutable
Otherwise no create-file is possible
HOMEWORK 13-4
Transactions
Clean semantics
Using transactions in OS is becoming more widely studied
Distributed File System Implementation
File Usage characteristics
Measured under unix at a university
Not obvious same results would hold in a different environment
Findings
1. Most files are small (< 10K)
2. Reading dominates writing
3. Sequential accesses dominate
4. Most files have a short lifetime
5. Sharing is unusual
6. Most processes use few files
7. File classes with different properties exist
Some conclusions
1 suggests whole-file transfer may be worthwhile (except
for really big files).
2+5 suggest client caching and dealing with multiple
writers somehow, even if the latter is slow (since it is
infrequent).
4 suggests doing creates on the client
Not so clear. Possibly the short lifetime files are
tempories that are created in /tmp or /usr/tmp or
/somethingorother/tmp. These would not be on the
server anyway.
7 suggests having mulitple mechanisms for the several classes.
================ Lecture 26 ================
FINAL EXAM IS THURSDAY 7 MAY IN THIS ROOM 10:00am -- 11:50am
Implementation choices
Servers & clients together?
Common unix+nfs: any machine can be a server and/or a
client
Separate modules: Servers for files and directories are
user programs so can configure some machines to offer
the services and others not to
Fundamentally different: Either the hardware or software
is fundamentally different for clients an servers.
Truth
In unix some server code is in the kernel but other
code is a user program (run as root) called nfsd
File and directory servers together?
If yes, less communication
If no, more modular "cleaner"
Looking up a/b/c/ when a a/b a/b/c on different servers
Natural soln is for server-a to return name of server-a/b
Then client contacts server-a/b gets name of server-a/b/c
etc.
Alternatively server-a forwards request to server-a/b who
forwards to server-a/b/c.
Natural method takes 6 comunications (3 RPCs)
Alternative is 4 communications but is not RPC
Name caching
The translation from a/b/c to the inode (i.e. symbolic to
binary name) is expensive even for centralize system.
Called namei in unix and was once measured to be a
signifcant percentage of all of kernel activity.
Later unices added "namei caching"
Potentially an even greater time saver for dist systems
since communication is expensive.
Must worry about obsolete entries.
Stateless vs Stateful
Should the server keep information BETWEEN requests from a
user, i.e. should the server maintain state?
What state?
Recall that the open returns an integer called a file
descriptor that is subsequentally used in read/write.
With a stateless server, the read/write must be self
contained, i.e. cannot refer to the file descriptor.
Why?
Advantages of stateless
Fault tollerant--No state to be lost in a crash
No open/close needed (saves messages)
So space used for tables (state requires storage)
No limit on number of open files (no tables to fill
up)
No problem if client crashes (no state to be confused
by)
Advantages of stateful
Shorter read/write (descriptor shorter than name)
Better performance
Since keep track of what files are open, know to
keep those inodes in memory
But stateful could keep a memory cache of inodes
as well (evict via LRU instead of close, not a
good)
Blocks can be read in advance (read ahead)
Of course stateless can read ahead.
Difference is that with stateful can better
decide when accesses are sequential.
Idempotency easier (keep seq numbers)
File locking possible (the lock is state)
Stateless can write a lock file by convention.
Stateless can call a lock server
HOMEWORK 13-6, 9-5, 9-7
Caching
There are four places to store a file suppled by a file
server (these are NOT mutually exclusive)
Server's disk
essentially always done
Server's main memory
normally done
Standard buffer cache
Clear performance gain
Little if any semantics problems
Client's main memory
Considerable performance gain
Considerable semantic considerations
The one we will study
Clients disk
Not so common now
Unit of caching
File vs block
Tradeoff of fewer access vs storage efficiency
What eviction algorithm?
Exact LRU feasible because can afford the time to do it
(via linked lists) since access rate is low
HOMEWORK 13-8
Where in client's memory to put cache
The user's process
The cache will die with the process
No cache reuse among distinct processes
Not done for normal OS.
Big deal in databases
Cache management is a well studied DB problem
The kernel (i.e. the client's kernel)
System call required for cache hit
Quite common
Another process
"Cleaner" than in kernel
Easier to debug
Slower
Might get paged out by kernel!
Look at firgure 13-10 (handout)
Cache consistency
Big question
Write-through
All writes sent to the server (as well as the
client cache)
Hence does not lower traffic for writes
HOMEWORK 13-10
Does not by itself fix values in other caches
Need to invalidate or update other caches
Can have the client cache check with server
whenever supplying a block to ensure that the
block is not obsolete
Hence still need to reach server for all
accesses but at least the reads that hit in
the cache only need to send tiny msg
(timestamp not data).
I guess this would be called lazy
invalidation.
Delayed write
Wait a while (30 seconds is used in some NFS
implementations) and then send a bulk write msg.
This is more efficient that a bunch of small
write msgs
If file is deleted quickly, you might never write
it.
Semantics are now time dependent (and ugly).
HOMEWORK 13-11
Write on close
Session semantics
Fewer msgs since more writes than closes.
Not beautiful (think of two files
simultaneously opened)
Not much worse than normal (uniprocessor)
semantics. The difference is that it
(appears) to be much more likely to hit the
bad case. Really mean much less unlikely.
HOMEWORK 13-12
================ Lecture 27 ================
FINAL EXAM IS THURSDAY 7 MAY IN THIS ROOM 10:00am -- 11:50am
Delayed write on close
Combines the advantages and dissadvantages of
delayed write and write on close.
Doing it "right"
Multiprocessor caching (of central memory) is well
studied and many solns are known.
We mentioned this at beginning of course
Cache consistency (aka cache coherence)
Book mentions a centralized soln.
Others are possible, but none are cheap
Interesting thought: IPC is more expensive that a
cache invalidate but disk I/O is much rarer than
mem refs. Might this balance out and might one of
the cache consistence algorithms perform OK to
manage distributed disk caches?
If so why not used?
Perhaps NSF is good enough and not enough
reason to change (NFS predates cache coherence
work).
Replication
Some issues similar to (client) caching
Why?
Because WHENEVER you have multiple copies of anything,
bells ring
Are they immutable?
What is update policy?
How do you keep copies consistent?
Purposes of replication
Reliability
A "backup" is available if data is corrupted on
one server.
Availability
Only need to reach ANY of the servers to access
the file (at least for queries)
NOT the same as reliability
Performance
Each server handles less than the full load (for a
query-only system MUCH less)
Can use closest server lowering network delays
NOT important for dist sys on one physical
network
VERY important for the web
mirror sites
Transparancy
If can't tell files are replicated, say the system has
replication transparency
Creation can be completely opaque
i.e. fully manual
users use copy commands
if directory supports multiple binary names for a
single symbolic name,
use this when making copies
presumably subsequent opens will try the
binary names in order (so not opaque)
Creation can use lazy replication
User creates original
system later makes copies
subsequent opens can be (re)directed at any
copy
Creation can use group communication
User directs requests at a group
Hence creation happen to all copies in the
group at once
HOMEWORK 13-14
Update protocols
Primary copy
All updates done to the primary copy
This server writes the update to stable
storage and then updates all the other
(secondary) copies.
After a crash, the server looks at stable
storage and sees if there are any updates to
complete.
Reads are done from any copy.
This is good for reads (read any one copy).
Writes are not so good.
Can't write if primary copy is unavailable
Semantics
The update can take a long time (some of
the secondaries can be down)
While the update is in progress, reads are
concurrent with it. That is might get old
or new value depending which copy they
read.
Voting
All copies are equal (symmetric)
To write you must write at least WQ of the
copies (a write quorum). Set the version
number of all these copies to 1 + max of
current version numbers.
To read you must read at least RQ copies and
use the value with the highest version.
Require WQ+RQ > num copies
Hence any write quorum and read quorum
intersect.
Hence the highest version number in any
read quorum is the highest ver num there
is.
Hence always read the current version
Consider extremes (WQ=1 and RQ=1)
Fine points (omitted by tannenbaum)
To write, you must first read all the
copies in your WQ to get the ver num.
Must prevent races
Let N=2, WQ=2, RQ=1. Both copies (A
and B) have ver num 10
Two updates start. U1 wants to write
1234, U2 wants to write 6789.
Both read ver numbers and add 1 (get
11).
U1 writes A and U2 writes B at roughly
the same time.
Later U1 writes B and U2 writes A.
Now both are at version 11 but A=6789
and B=1234
HOMEWORK 13-15
Voting with ghosts
Often reads dominate writes so choose RQ=1 (or
at least RQ very small so WQ very large)
This makes it hard to write. E.g. RQ=1 so
WQ=n and hence can't update if any machine is
down.
When one detects that a server is down, a
ghost is created.
Ghost canNOT participate in read quorum, but
can in write quorum
write quorum must have at least one
non-ghost
Ghost throws away value written to it
Ghost always have version 0 (tannenbaum forgot
this point)
When crashed server reboots, it accesses a read
quorum to update its value
================ Lecture 28 ================
FINAL EXAM IS THURSDAY 7 MAY IN THIS ROOM 10:00am -- 11:50am
NFS--SUN Microsystem's Network File System
In chapter 9 of your book. But it should be here in chapter 13.
In newer book, it is here! (Replaces AFS material, which we will
skip).
"Industry standard", dominant system
Machines can be (and often are) both clients and servers
HOMEWORK 9-8
Basic idea is servers EXPORT directories and clients mount them
When server exports a directory, the subtree routed there is
exported.
In unix exporting is specified in /etc/exports
In unix mounting is specified in /etc/fstab
fstab = file system table
In unix w/o NFS what you mount are filesystems
Two Protocols
1. Mounting
Client sends server msg containing pathname (on server) of
the directory it wishes to mount
Server returns HANDLE for the directory
Subsequent read/write calls use the handle
Handle has data giving disk, inode#, et al
Handle is NOT an index into table of actively
exported directories. Why not?
Beacause the table would be STATE and NFS is
stateless
Can do this mounting at any time, often done at client
boot time.
Automounting--we skip
2. File and directory access
Most unix sys calls supported
Open/close NOT supported
NFS is stateless
Do have lookup, which returns a file handle. But this
handle is NOT an index into a table. Instead it contains
the data needed.
As indicated previously, stateless makes unix locking
semantics hard to achieve
Authentication
Client gives the rwx bits to server.
How does server know the client is machine it claims to be?
Various Crypto keys.
This and other stuff stored in NIS (net info svc) aka yellow
pages
Replicate NIS
Update master copy
master updates slaves
window of inconsistency
Implementation
Click here for figure.
Client system call layer processes I/O system calls and calls
the virtual file system layer (VFS).
VFS has a v-node (virtual i-node) for each open file
Like incore inodes for traditional unix
For local files v-node points to inode in local OS
For remote files v-node points to r-node (remote i-node) in
NFS client code.
For remote files r-node holds the file's handle (see below)
which is enough for a remote access
Blow by blow
Mount (remote directory, local directory)
First the mount PROGRAM goes to work
Contact the server and obtains a handle for the remote
directory.
Makes mount sys call passing handle
Now the kernel takes over
Makes a v-node for the remote directory
Asks client code to construct an r-noded
have v-node point to r-node
Open system call
While parsing the name of the file, the kernel (VFS layer)
hits the local directory on which the remote is mounted
(this part is similar to ordinary mounts of local
filesystems).
Kernel gets v-node of the remote directory (just as would
get i-node if processing local files)
Kernel asks client code to open the file (given r-node)
Client code calls server code to look up remaining portion
of the filename
Server does this and returns a handle (but does NOT keep a
record of this). Presumably the server, via the VFS and
local OS, does an open and this data is part of the
handle. So the handle give enough information for the
server code to determine the v-node on the server machine.
When client gets a handle for the remote file, it makes an
r-node for it. This is returned to the VFS layer, which
makes a v-node for the newly opened remote file. This
v-node points to the r-node. The latter contains the
handle information.
The kernel returns a file descriptor, which points to the
v-node.
Read/write
VFS finds v-node from the file descriptor it is given.
Realizes remote and asks client code to do the read/write
on the given r-node (pointed to by the v-node).
Client code gets the handle from its r-node table and
contacts the server code.
Server verifies the handle is valid (perhaps
authentication) and determins the v-node.
VFS (on server) called with the v-node and the read/write
is performed by the local (on server) OS.
Read ahead is implemented but as stated before it is
primitive (always read ahead).
Caching
Servers cache but not big deal
Clients cache
Potential problems of course so
Discard cached entries after some SECONDS
On open the server is contacted to see when file last
modifies. If newer than cached version, cached version is
discarded.
After some SECONDS all dirty cache blocks are flushed back
to server.
All these bandaids still do not give proper semantics (or even
unix semantics).
Lessons learned (from AFS, not covered, but applies in some generality)
Workstations, i.e. clients, have cycles to burn
So do as much as possible on client
Cache whenever possible
Exploit usage properties
Several classes of files (e.g. temporary)
Trades off symplicity for efficiency
Minimize system wide knowledge and change
Helps scalability
Favors hierarchies
Trust fewest possible entities
Good principle for life as well
Try not to depend on the "kindness of strangers"
Batch work where possible
*************************************************************
*
* IGNORE all this stuff
*
* Local Variables:
* mode: text
* indent-line-function: indent-relative
* indent-tabs-mode: nil
* tab-width: 4
* tab-stop-list: (4 8 12 16 20 24 28 32 36 40 44 48 52 56 60)
* End:
*
**************************************************************
Click for diagram in
postscript or
html.