Distributed Operating Systems
                            1997-98 Spring
                            Allan Gottlieb

              Text: Tannenbaum, Modern Operating Systems

================ Lecture 1 ================
---------------- Some administrivia --------------

Reaching me
    Office Hour Thurs 11:15--12:15 (i.e. after class)
    email gottlieb@nyu.edu
    715 Broadway room 1001

Midterm & Final & Labs & Homework

    Lab != Homework  Describe both

    Labs can be run on home machine
         ... but you are responsible
         Please PLEASE upload a copy to ACF

Upper Left board for announcements ... but you are responsible.

Handouts are low tech, not great (a feature!)

Web page: http://allan.ultra.nyu.edu/gottlieb/os

Comment on the text

    The "right" text is Tannenbaum's "Distributed OS", but that book
    has a high overlap with the second part of "Modern OS" and you
    already own the latter (V22.202 used it).

----------------  Chapter 9, Intro to Distributed Systems ----------------


    Powerful micros: Inexpensive machines worth connecting

    High-speed LANs: Move data quickly

       Low latency: Small transfers in about a ms
       High Bandwidth: 10 Megabits/sec (now 100 "soon" 1000)

    Gave rise to Distributed (as opposed to Centralized) systems

(Somewhat vague) Definition of a distributed system

    A distributed system is a collection of independent computers that
    appears to the users of the system as a single computer.

Advantages over centralized

    Repeal of grosch's law--indeed the reverse occurred (diminishing
    returns) so that now dist sys are more cost efficient

    Absolute performance beyond a single system

        Big web engine

        Weather prediction

    Sometimes the natural soln since the problem is distributed

        Branches of a bank

        Computer supported cooperative work

    Incremental growth


        In principle you have redundancy so fault tolerance

Advantages over independent machines

    Data sharing

    Device sharing

        Sometimes more expensive than the computer

    Communication (between people)--email


        In addition to the symmetric soln of each user gets
        pc/wkstation can have some more powerful and/or specialized

Disadvantages of distributed systems

    Immature software

    Network failure / saturation



        Often with a dist sys you need ALL the components to work


Hardware configuration and concepts.

    Book shows its age a little here

    Flynn's classification of machines with multiple CPUs

        SISD: Single Instruction stream Single Data stream

        SIMD: Process arrays all at once

        MISD: Perhaps doesn't exist; not important

        MIMD: Each CPU capable of autonomy

================  Lecture 2 ================

            Multiprocessors: Shared address space across all

            Multicomputer: Each processor has its own

    Tightly Coupled vs. Loosely Coupled

        Former means capable of cooperating closely on a single

            Fine grained communication: i.e. communicate often

            Large amt of data transfered

        Tightly coupled machines can normally emulate loosely coupled
        ones but might give up some fault tolerance

        Can solve a single large problem on loosely coupled hardware,
        but the problem must be coarse grained.

            Various cryptographic challenges solved over the internet
            with email as the only communication

    Bus-based multiprocessors

        These are symmetric multiprocessors (SMPs): From the viewpoint
        of one processor, all the others look the same.  None are
        logically closer than others.  SMP also implies cache coherent
        (see below).

        Hardware is fairly simple.  Recall a uniprocessor looks like

               Cache(s)               Mem
                  |                    |
                  |                    |
            ------------------------------------ Memory Bus
                ------------------------------- I/O Bus (e.g. PCI)
                     |          |     |      |
                     |          |     |      |
                   video       scsi serial  etc

        Complication.  When a scsi disk writes to memory (i.e. a disk
        read is performed), you need to keep the cache(s) up to date,
        i.e. CONSISTENT with the data.  If the cache had the old value
        before the disk read, the system can either INVALIDATE the
        cache entry or UPDATE it with the new value.

        Facts about caches.  The can be WRITE THROUGH, when the
        processor issues a store the value goes in the cache and also
        is sent to memory, or they can be WRITE BACK, the value only
        goes to the cache.  In the later case the cache line is marked
        dirty and when it is evicted it must be written back to memory.

        To make a bus-based MP just add more processor-caches

                Proc       Proc
                  |          |
                  |          |
               Cache(s)   Cache(s)    Mem
                  |          |         |
                  |          |         |
            ------------------------------------ Memory Bus
                ------------------------------- I/O Bus (e.g. PCI)
                     |          |     |      |
                     |          |     |      |
                   video       scsi serial  etc

        A key question now is whether the caches are automatically
        kept consistent (a.k.a. coherent) with each other.  When the
        processor on the left writes a word, you can't let the old
        value hang around in the right hand cache(s) or the right hand
        processor can read the wrong value.

        If the cache is write back, then on an initial write the cache
        must claim ownership of the cache line (invalidate all other
        copies).  If the cache is write through, you can either
        invalidate other copies or update them.

        Because of the broadcast nature of the bus it is not hard to
        design snooping (a.k.a snoopy) caches that maintain
        consistency.  I call this the "Turkish Bath" mode of
        communication because "everyone sees what everyone else has

        These machines are VERY COMMON.

        DISADVANTAGE (really limitation) Cannot be used for a large
        number of processors since the bandwidth needed on the bus
        grows with the number of processors and this gets increasingly
        difficult and expensive to supply.  Moreover the latency grows
        as the number of processors for both speed of light and more
        complicated (electrical) reasons.

HOMEWORK 9-2 (ask me next time about this one)

    Other SMPs

        From the software viewpoint, the key property of the bus-based
        MP is that they are SMPs, the bus itself is just an
        implementation property.

        To support larger number of processors, other interconnection
        networks could be used.  The book shows two possibilities
        crossbars and omega networks.  Ignore what is said there as it
        is too superficial and occasionally more or less wrong.
        Tannenbaum is not a fan of shared memory.

        (The Ultracomputer in figure 9.4 is an NYU machine; my main
        research activity from 1980-95)

        Another name for SMP is UMA for Uniform Memory Access,
        i.e. all the memories are equal distant from a given


        For larger number of processors, NUMAs (NonUniform Memory
        Access) MPs are being introduced.  Typically some memory is
        associated with each processor and that memory is close.
        Others are further away.

          P--C---M   P--C---M
               |          |
               |          |
          |                            |
          |   interconnection network  |
          |                            |

        CC-NUMAs (Cache Coherent NUMAs) are programmed like SMPs BUT
        to get good good performance must try to exploit the memory
        hierarchy and have most references hit in your local cache and
        most others in the part of the shared memory in your "node".

HOMEWORK 9-2' Will semaphores work here

        NUMAs (not cache coherent) are yet harder to program as you
        must maintain cache consistent manually (or with compiler

HOMEWORK 9-2'' Will semaphores work here

    Bus-based Multicomputers

        The big difference is that it is an MC not an MP, NO shared

        In some sense all the computers on the internet form one
        enormous MC.

        The interesting case is when there is some closer cooperation
        between processors; say the workstations in one distributed
        systems research lab cooperating on a single problem.

        Application must tolerate long-latency communication and
        modest bandwidth, using current state-of-the-practice

        Recall the figure for a typical computer

               Cache(s)               Mem
                  |                    |
                  |                    |
            ------------------------------------ Memory Bus
                ------------------------------- I/O Bus (e.g. PCI)
                     |          |     |      |
                     |          |     |      |
                   video       scsi serial  etc

        The simple way to get a bus-based MC is to put an ethernet
        controller on the I/O bus and then connect all these with a
        bus called ethernet.  (Draw this on the board).

        Better (higher performance) is to put the Network Interface
        (possibly ethernet possibly something else) on the memory bus

    Other (bigger) multicomputers

        Once again buses are limiting so switched networks are used.
        Much could be said, but currently grids (SGI/Cray T3E) and
        omega networks (IBM SP2) are popular.  Hypercubes less popular
        than in the 80s.

HOMEWORK 9-2'''  Will semaphores work here, 9-3, 9-4

================ Lecture 3 ================

Software concepts

    Tightly vs. loosely coupled

        High autonomy and low communication characterize loose

    Network Operating Systems

        Loosely coupled software and hardware

        Autonomous workstations on a LAN

        Commands normally run locally

        Remote login and remote execution possible (rlogin, rcp, rsh)

        Next step is to have a shared file system.

            Have file SERVERS that export filesystems to CLIENTS on
            the LAN.

            Client makes a request to server, which sends a response.

            Example from unix:  Unix file structure is a tree (really
            dag).  Dos is a forest (a: c: not connected).  For local
            disks, unix supplies mount to make one file system part of
            another.  To get the shared file system, the idea is to
            mount remote filesystems locally.  The server exports the
            filesystems to the clients, who do the mount.

            Client decides where to mount the fs.  Hence the same
            filesystem can be mounted at different places on different

            This fails to give location transparency.

            An example from my research group (a few years ago)

                We had several machine all running unix.  Named
                after the owner: allan, eric, richard, jan, jim, susan
                and one other called lab.  We had several
                filesystems that were for common material /a /b /c.
                I remember that allan.ultra.nyu.edu had /e and
                exported it to other machines.  Our convention was
                to mount /e on /nfs/allan/e but have a link
                /e --> /nfs/allan/e so on all machines /e got you
                to the same place.

                However there were two system filesystems / and
                /usr that were on all machines (a little different
                on each).  For sysadmin we wanted all these
                assessible to all our admins.  So on allan I would
                have jim's /usr mounted on /nfs/jim/usr.  I could
                not link /usr --> /nfs/jim/usr.  Why?
                The name I used was /jim.usr but that is not
                location transparent.

            Another example from same place.

                To save disk space we kept only one copy of
                /usr/gnu/lib/emacs.  It was stored on lab in
                /d/usr.gnu.lib/emacs and everyone mounted /d and
                linked /usr/gnu/lib/emacs -->

                Problem: When lab was down, so was emacs!

                So we went to several severers for this.  All
                controled manually.

            Reads and writes are caught by the client and requests
            made to the server.  For reads the server respondes by
            giving the requested data to the client, which is then
            used to satisfy the read

            Text gives detailed description of NFS implementation.  We
            will defer this until chapter 13 (distributed filesystems)
            where it belongs.

        Is a network OS a dist system?

    True distributed systems (according to st. andrew)

        (Comparatively) tightly-coupled software on loose-coupled

        Goal is to make the illusion that there is a single computer.

        Single-system image

        Virtual uniprocessor


        Many scientists use the term single-system image or virtual
        uniprocessor even if the hardware is tightly coupled.

        Official (Modern OS) def: A DISTRIBUTED SYSTEM is on that runs
        on a collection of machines that do not have shared memory,
        yet looks to its users like a single computer.

        The newer book (Dist OS) uses the def we gave earlier: A
        distributed system is a collection of independent computers
        that appears to the users of the system as a single computer.

        Single IPC (interprocess communication) mechanism.

            Any process (on any machine) can talk to ANY other process
            (even those on another machine) using the same procedure.

        Single protection scheme to protect all objects

            Same scheme (ACL, unix rwx bits) on all machines.

        Consistent process management across machines.

        File system looks the same everywhere.  Not only the same name
        for a given file but, for example, the same rules for

        Easiest to do if all the machines are the same so the same
        operating system can be run on each.  However, there is
        considerable work on "heterogeneous computing" where (in
        principle at least) different computers are used.

            For each program, have a separate binary for each

            What about floating point formats and wordsize?

            What about byte order?

            Autoconvert when doing ipc

        Even when the each system is the same architecture and runs
        the same OS, one may not want a centralized master that
        controls everything.

            Paging is probably best done locally (assuming a local

            Scheduling is not clear.  Advantages both ways.  Many
            systems have local scheduling but communication and
            process migration for load balancing.

================ Lecture 4 ================

    Multiprocessor systems

        Tanenbaum's "timesharing" term dates the book.

        Easiest to get a single system image.

        One example is my workstation at NECI (soon to move north).

        Shared memory makes a single run queue (ready list) natural choice.

            So scheduling is "trivial": Take uniprocessor code and add

            The standard diagram for process states applies.
The diagram is available in postscript (best) and in html (most portable).
            What about processor affinity?

        Shared memory makes shared I/O buffer cache natural choice.

            Gives true (i.e. uniprocessor semantics for I/O with
            little extra work.

            Avoiding performance bottlenecks is serious.

        Shared memory makes uniprocessor file system natural choices.

    Summary (Tanenbaum figure 9-13)

                        Network        Distributed   Multiprocessor
                        Operating      Operating     Operating
Item                    System         System        System
Virtual Uniprocessor    No             Yes           Yes

Same OS                 No             Yes           Yes

Copies of OS            N              N             1

Communication           Shared Files   Messages      Shared memory

Network Protocols       Yes            Yes           No

Single Ready List       No             No            Yes

Well def file sharing   Rare           Yes           Yes


Design issues

    A Big one is TRANSPARANCY, i.e. you don't "see" the multiplicity
    of processors.  Or you see through them.

    (NOT always what is wanted.  Might want to create as many (child)
    processes as there are processors.  Might choose a different
    coordination algorithm for 2 processes than for 50)

    Location transparancy: cannot tell where the resourses are

        So ``pr -Psam'' does not tell you where sam is located (but
        cat /etc/printcap does).

        I assume tanenbaum means you need not know where the resourses
        are located.  Presumably there are routing tables somewhere
        saying where they are.

    Migration transparancy: moving a resources does not require a name

        Recall our research system.  If you move a file from /e stored
        on allan to lab it cannot be in /e (property of mount).
        Symlinks can hide this slightly.

    Replication transparancy: Don't know how many copyies of a
    resourse exist.

        E.g. multithreaded (or multiprocess) server

    Concurrency transparancy.

        Don't notice other users (except for slowdowns).

            That is, same as for uniprocessor.

        Can lock resource--but deadlocks possible

            Again just like for uniprocessor.


    Parallelism transparancy

        You write uniprocessor programs and poof it all works hundreds
        of times faster on a dist sys with 500 processors.

        Far from current state of the art

        My wording above is that you don't always want transparancy.
        If you had parallelism transparancy, you would always want it.

    Flexibility--really monolithic vs micro kernel

        Should the (entire) OS run in supervisor mode?

        Tanenbaum is rather one sided here.  He is a big microkernel
        fan.  Amoeba is his group's system.

        Monolithic kernel is conventional, all services supplied by
        kernel, OS.  Often derived from uniprocessor OS

        Microkernel only supplies


            Some mem mgt

            low level proc mgt and sched

            low level I/O

        Micro kernel does not supply


            Most system calls

            Full process management (deciding which process is highest

        Instead these are supplied by servers


            Fairly easy to modify while the system is running.

            Bug in server can bring down system but the bug will
            appear in the server not in another part of the OS


            Performance: Tanenbaum says its minor but not so clear.
            Crossing more protection domains and having more transfers
            of control, hurt cache performance and this is
            increasingly important as speed ratios between processors
            and DRAM grow.



        It's an improvement if you need just one of many possible
        resourses in order to work.


        It's a negative if you need all the resourses.  (AND vs OR)

            Avail with prob p^n if AND (p = prob one resourse avail)

            Avail with prob 1 - (1-p)^n if OR

        Lamport's def of dist sys "One on which I cannot get any work
        done because some machine I have never heard of has crashed"

        Availability: percentage of time the system is up.  So
        increased replication increases availability

        Consistency: Must keep copies consistent so data not garbled.
        Increasing replication makes this worse (more expensive).


        Coarse grained parallelism:  Little, infrequent
        communication/coordination.  This is the easy case to get high
        perf for.  Sometimes called "embarassingly parallel".

        Fine grained parallelism:  Tight coordination and/or much data
        communication.  Tough.  Many msgs.


        Not trivial!  Descirbes mintel where french phone company is
        "currently" (i.e. prior to 1992) installing terminal in every
        home.  If successful, "other countries will inevitably adopt
        similar systems".  It is 1998 and we don't have it yet.

            Is the web a dist sys?

        Centralized components/tables/algorithms.

            If the degree of parallelism is large any centralized
            "thing" is a potential bottleneck.

            Fault tolerance (single point of failure).

            Centralized tables have fault tolerence performance
            bottleneck problems.

                The perf problem can be solved if concurrently
                accessible by combining requests on the way to the
                server of the centralized table.

        It is often too expensive to get the entire (accurate) state
        of the system to one computer to act on.  Instead, one prefers
        decentralized algorithms.

            No machine has complete information

            Decisions made based on locally available info (obvious?)

            Tolerate some machine failures

            Don't assume a global clock (exactly synchronized)


---------------- Chapter 10: Communication in Dist Sys  ----------------

With no shared memory, communication is very different from that in a
uniprocessor (or a shared memory multiprocessor).

PROTOCOL: An agreement between communicating parties on how
communication is to proceed.

    Error correction codes.



LAYERED protocol: The protocol decisions concern very different things

    How many volts is 1 or zero?  How wide is the pulse? (LOW level)

    Error correction



    As a result you have many routines that work on the various
    aspects.  They are called layered.

    Layer X of sender acts as if it is directly communicating with
    layer X of received but in fact it is communicating with layer X-1
    of sender.

    Similarly layer X of sender acts as a virtual layer X+1 of
    receiver to layer X+1 of sender.

Famous example is the ISO OSI (Intern standards org open sys

    First lets look at the OSI diagram just as an example of layering

The diagram is available in postscript (best) and in html (most portable).

    So for example the network layer sends msgs intended for the other
    network layer but in fact sends them to the data link layer

    Also the network layer must accept msgs from the transport layer,
    which it then sends to the other network layer (really its own
    data link layer.

    What a layer really does to a msg it receives is that it adds a
    header (and maybe a trailer) that is to be interpreted by its
    corresponding layer in the receiver.

    So the network layer adds a header (in front of the transport
    layer's header) and sends to the other network layer (really its
    own data link layer that adds a header in front of the network
    layer's--and a trailer--)

================ Lecture 5 ================

    So headers get added as you go down the sender's layers (often
    called the PROTOCOL STACK or PROTOCOL SUITE).  They get used (and
    stripped off) as the msg goes up the receiver's stack.

    It all starts with process A sending a msg.  By the time it
    reaches the wire it has 6 headers (the physical layer doesn't add
    one--WHY?) and one trailer.

    Nice thing is that the layers are independent.  You can change one
    layer and not change the others.


    Now let's understand the layers themselves.

        Physical layer: hardware, i.e. voltages, speeds, duplexity,

        Data link layer: Error correction and detection.  "Group the bits
        into units called frames".

            Frames contain error detection (and correction) bits.

            This is what the pair of data like layers do when viewed
            as an extension of the physical.

            But when being used, the sending DL layer gets a packet
            from the network layer and breaks it into frames and adds
            the error detection bits.

        Network layer: Routing.

            Connection oriended network-layer protocol: X.25.  Send a
            msg to destination and establish a route that will be used
            for further msgs during this connection (a connection
            number is given).  Think telephone.

            Connectionless: IP (Internet Protocol).  Each PACKET
            (message between the network layers) is routed separately.
            Think post office.

        Transport layer: make reliable and ordered (but not always).

            Break incoming msg into packets and send to corresponding
            transport layer (really send to ...).  They are sequence

            Header contains info as to which packets have
            been sent and received.

            These sequence numbers are for the end to end msg.
            I.e. if allan.ultra.nyu.edu sends msg to allan.nj.nec.com
            the transport layer numbers the packets.  But these
            packets may take different routes.  On any one hop the
            data link layer keeps the frames ordered.

            If you use X.25 for network there is little for transport
            layer to do.

            If you use IP for network layer, there is a lot to do.

            Important transport layers are TCP (transmission control
            protocol) and UDP (Universal datagram protocol)

            TCP is connection oriented and reliable as described

            UDP is basically just IP (a "dummy") transport layer when
            you don't need that service (error rates are low already
            and msgs almost always come in order.

            TCP/IP is widely used.

        Session Layer: dialog and sync (we skip it)

        Presentation layer: Describes "meaning" of fields (we skip it).

        Application layer: For specific apps (e.g. mail, news, ftp)
        (we skip it)

Another (newer) famous protocol is ATM (asynchronous transfer mode)

    NOT in modern operating systems

    The hardware is better (newer) and has much higher data rates than
    previous long haul networks.

        155Mbits/sec compared to T3=45Mb/s is the low end.  622Mb/sec
        is midrange.

        In pure circuit switching a circuit is established and held
        (reserved) for the duration of the transmission.

        In store and forward packet switching, go one hop at at time
        with entire packet

        ATM estabilishes a circuit but it is not (exclusively)
        reserved.  Instead the packet is broken into smallish
        fixed-sized CELLS.  Cells from different transmissions can be
        interleaved on same wire

    Cute fact about high speed LONG haul network is that a LOT of bits
    are on the wires.  15ms from coast to coast.  So at 622Mb/sec,
    have 10Mb on the wire.  So if receiver says STOP, 20Mb will still
    come.  At 622Mb/set it takes 1.6ms to push a Mb file out of
    computer.  So if you wait for reply, most of the time the line is

HOMEWORK Assume one interrupt to generate 8-bits and the interrupt
takes 1us (us=mcirosecond) to process.  Assume you are sending at a
rate of 155Mb/sec.  What percent of the CPU time is being used to
process the interrupts?

Trouble in paradise

    The problem with all these layered things (esp OSI) is that all
    the layers add overhead.

    Especially noticable in systems with high speed interconnects
    where the processing time counts (i.e. not completely wire

Clients and Servers

    Users and providers of services

    This provides more than communication.  It gives a structuring to
    the system.

    Much simplier protocol.

        Physical and datalink are as normal (hardware)

        All the rest is request/reply protocol

        Client sends a request; server sends a reply

        Less overhead than full OSI

    Minimal kernel support needed

        send (destination, message)

        receive (port, message)

        variants (mailboxes, etc) discussed later

Message format (example; triv file server)

    struct message
        count   -- number of bytes to read or write

Servers are normally infinite loops

        receive (port, message)
        switch message.opcode
            -- handle various cases filling in replymessage
        replymessage.result = result from case
        send (message.source, replymessage)

Here is a client for copying A to B

        -- set (name=A, opcode=READ, count, position) in message
        send (fileserver, message)
        receive (port, replymessage) -- could use same message
        -- set (name=B, opcode=WRITE, count=result, position) in message
        send (fileserver, message)
        position +=result
    while result>0
    -- return OK or error code.

    Bug? If the read gets an error you send bogus (neg count) to write

        -- set (name=A, opcode=READ, count, position) in message
        send (fileserver, message)
        receive (port, replymessage) -- could use same message
    exit when result <=0
        -- set (name=B, opcode=WRITE, count=result, position) in message
        send (fileserver, message)
        position +=result
    --return OK or error code

Addressing the server

    Need to specify the machine server is on and the "process" number

    Actually more common is to use the port number

        tanenbaum calls it local-id

        Server tells kernel that it wants to listen on this port

    Can we avoid giving the machine name (location transparency)?

        Can have each server pick a random number from a large space
        (so probability of duplicates is low).

        When a client wants service S it broadcasts a I need X and the
        server supplying X responds with its location.  So now the
        client knows the address.

        This first broadcast and reply can be used to eliminate
        duplicate servers (if desired) and different services
        accidentally using the same number

        Another method is to use a name server that has mapping from
        service names to locations (and local ids, i.e. ports).

            At startup servers tell the name server their location

            Name server a bottleneck?  Replicate and keep consistent

To block or not to block

    Synchronous vs asynchronous

    Send and receive synchronous is often called rendezvous

    Async send

        Do not wait for the msg to be recieved, return cntl immediately.

        How can you re-use the message variable

            Have the kernel copy the msg and then return.  This costs

            Don't copy but send interrupt when msg sent.  This makes
            programming harder

            Offer a syscall to tell when the msg has been sent.
            Similar to above but "easier" to program.  But difficult
            to guess how often to ask if msg sent

================ Lecture 6 ================

    Async Receive

        Return control before have filled in msg variable with
        received messages.

        How can this be useful?

            Wait syscall (until msg avail)

            Test syscall (has msg arrived)

            Condition receive (receive or announce no msg yet)


            none are beautiful


        If have blocking primitive, send or receive could wait

        Some systems/languages offer timeouts.


To buffer or not to buffer (unbuffered)

    If unbuffered, the received tells where to put the msg

        Doesn't work if async send is before receive (where put msg).

    For buffered, the kernel keeps the msg (in a MAILBOX) until the
    receiver asks for it.

        Raises buffer management questions



    Can assert msg delivery are not reliable

        Checked at higher level

    Kernel can ack every msg

        Senders and repliers keep msg until receive ack

    Kernel can use reply to ack every request but explicitly ack replies

    Kernel can use reply as ack to every request but NOT ack replies

        Client will resend request

        Not always good (if server had to work hard to calculate reply)

    Kernel at server end can deliver request and send ack if reply nor
    forthcoming soon enough.  Again can either ack reply or not.

Large design space.  Several multiway choices for acks, buffers, ...

Click for diagram in postscript or html.

Other messages (meta messages??)

    Are you alive?

    I am alive

    I am not alive (joke)

    Mailbox full

    No process listening on this port

---------------- Remote Procedure Call (RPC) ----------------

Birrell and Nelson (1984)

Recall how different the client code for copying a file was from
normal centralized (uniprocessor) code.

Lets make client server request-reply look like a normal procedure
call and return.

Tanenbaum says C has call-by-value and call-by-reference.  I don't
like this.  In C you pass pointers by value (at least for scalars) and
I don't consider that the same.  But "this is not a programming
languages course"


Recall that getchar in the centralized version turns into a read
syscall (I know about buffering).  The following is for unix

    Read looks like a normal procedure to its caller.

    Read is a USER mode program (needs some assembler code)

    Read plays around with registers and then does a poof (trap)

    After the poof, the kernel plays with regs and then does a
    C-language routine and lots gets done (drivers, disks, etc)

    After the I/O the process get unblocked, the kernel read plays
    with regs, and does an unpoof.  The user mode read plays with regs
    and returns to the original caller

Lets do something similar with request reply

    User (client) does subroutine call to getchar (or read)

        Client knows nothing about messages

    We link in a user mode program called the client stub (analogous
    to the user mode read above).

        Takes the parameters to read and converts it to a msg
        (MARSHALLS the arguments)

        Sends a msg to machine containing the server directed to a

        Does a blocking receive (of the reply msg)

    Server stub is linked with the server.

        Receives the msg from the client stub.

        Unmarshalls the arguments and calls the server (as a

    The server procedure does what it does and returns (to the server

        Server kows nothing about messages

    Server stub now converts this to a reply msg sent to the client

        Marshalls the arguments

    Client stub unblocks and receives the reply

        Unmarshalls the arguments

        RETURNS to the client

    Client believes (correctly) that the routine it calls has returned
    just like a normal procedure does.

Heterogeneity: Machines have different data formats

    Previously discussed

    Have conversions between all posibilities

        Done during marshalling and unmarshalling

    Adopt a std and convert to/from it.



    Avoid them for RPC!

    Can put the object pointed to into the msg itself (assuming you
    know its length).

        Convert call-by-ref to copyin/copyout

        If have in or out param (instead of in out) can elim one of
        the copies

    Gummy up the server to handle pointers special

        Callback to client stub


Registering and name servers

    As we said before can use a name server.

    This permits the server to move

        deregister from the name server



    Sometimes called dynamic binding

    Client stub calls name server (binder) first time to get a HANDLE
    to use for the future

        Callback from binder to client stub if server deregisters or
        have the attempt to used the handle fail so client stub will
        goto to binder again.



    This gets hard and ugly

    Can't find the server.

        Need some sort of out-of-band response from client stub to

            Ada exceptions

            C signals

            Multithread the CLIENT and start the "exception" thread

        Loses transparancy (centralized systems don't have this).

    Lost request msg

        This is easy if known.  That is, if sure request was lost.

        Also easy if idempotent and think might be lost.

        Simply retransmit request

            Assumes the client still knows the request

    Lost reply msg

        If it is known the reply was lost, have server retransmit

            Assumes the server still has the reply

            How long should the server hold the reply

                Wait forever for the reply to be ack'ed--NO

                Discard after "enough" time

                Discard after receive another request from this client

                Ask the client if the reply was received

                Keep resending reply

    What if not sure of whether lost request or reply?

        If server stateless, it doesn't know and client can't tell.

        If idempotent, simply retransmit the request

    What if not idempotent and can't tell if lost request or reply?

        Use sequence numbers so server can tell that this is a new
        request not a retransmission of a request it has already done.

        Doesn't work for stateless servers

HOMEWORK 10-13 10-14  Remind me to discuss these two next time

    Server crashes

        Did it crash before or after doing some nonidempotent action?

        Can't tell from messages.

        For databases, get idea of transactions and commits.

            This really does solve the problem but is not cheap.

        Fairly easy to get "at least once" (try request again if timer
        expires) or "at most once (give up if timer expires)"
        semantics.  Hard to get "exactly once" without transactions.

            To be more precise.  A tranaction either happens exactly
            one or not at all (sounds like at most once) AND the
            client knows which.

    Client crashes

        Orphan computations exist.

        Again transactions work but are expensive

        Can have rebooted client start another epoch and all
        computations of previous epoch are killed and clients

            Better is to let continue old computations with owners
            that can be found.

        Not wonderful

            Orhpan may hold locks or might have done something not
            easily undone.

        Serious programming needed.


    Protocol choice

        Existing ones like UDP are designed for harder (more general)
        cases so are not efficient.

        Often developers of distributed systems, invent their own
        protocol that is more efficient.

            But of course they are all different.

        On a lan would like large messages since they are more
        efficient and don't take so long considering the high


================ Lecture 7 ================


        One per pkt vs one per msg

        Called stop-and-wait and blast

            In former wait for each ack

            In blast keep sending packets until msg finished

        Could also do a hybrid

            Blast but ack each packet

            Blast but request only those missing instead of general

                Called selective repeat

    Flow control

        Buffer overrun problem

            Internet worm caused by buffer overrun and rewriting non-
            buffer space.  This is not the problem here.

            Can occur right at the interface chip, in which case the
            (later) packet is lost.

            More likely with blast but can occur with stop and wait if
            have multiple senders

        What to do

            If chip needs a delay to do back to back receives have
            sender delay that amt.

            If can only buffer n pkts, have sender only send n then
            wait for ack

            The above fails when have simultaneous sends.  But
            hopefully that is not too common.

            This tuning to the specific hardware present is one reason
            why gen'l protocols don't work as well as specialized ones.

    Why so slow?  Lots to do!

        Call stub

        get msg buf

        marshall params

        If use std (UDP), computer checksum

        fill in headers


        Copy msg to kernel space (Unless special kernel)

        Put in real destination addr

        Start DMA to comm device

        ----------------  wire time

        Process interrupt (or polling delay)

        Check packet

        Determine relevant stub

        Copy to stub addr space (unless special kernel)



        Call server

    On the Paragon (intel large MPP of a few years ago), the above
    (not exactly the same things) took 30us of which 1us was wire time

    Eliminating copying

        Message transmission is essentially a copy so min is 1

            This requires the network device to do its dma from the
            user buffer (client stub).  Directly into the server stub.

            Hard for receiver to know where to put the msg until it
            arrives and is inspected

            Sounds like a copy needed from receiving buffer to server

            Can avoid this by fiddling with mem maps

                Must be full pages (as that is what is mapped)

        Normally there are two copies on the receiving side

            From hardware buffer to a kernel buffer

            From kernel buffer to user space (server stub)

        Often two on sender side

            User space (client stub) to kernel buffer

            Kernel buffer to buffer on device

            Then start the device

        The sender ones can be reduced

            Device can do DMA from the kernel buffer eliminates 2nd

            Doing DMA from user would eliminate the first but need
            scather gather (just gather here) since the header must be
            in kernel space since the user is not allowed to set it.

         To eliminate the two on the receiver side is harder

            Can eliminate the first if device writes directly into
            kernel buffer.

            To eliminate the 2nd requires the remaping trick.

    Timers and timeout values

        Getting a good value for the timeouts is a black art.

            Too small and many unneded retransmissions

            Too large and wait too long

            Should be adaptive??

                If find that sent an extra msg raise timeout for this
                class of transmissions.

                If timeout expires most of the time, lower value for
                this class

        How to keep timeout values

            If you know that almost all timers of this class are going
            to go off (alarms) and accuracy is important, then keep a
            list sorted by time to alarm.

                Only have to scan head for timer (so can do it

                Additions must search for place to add

                Deletions (cancelled alarms) are presumed rare

            If deletions are common and can afford not so accurate an
            alarm, then sweep list of all processes (not so frequently
            since accuracy not required).

            Deletions and additions are easy since list is indexed by
            process number

Difficulties with RPC

    Global variables like errno inherently have shared-variable
    semantics so don't fit in a distributed system.

        One (remote) procedure sets the vble and the local procedure
        is supposed to see it.

        But the setting is a normal store so is not seen by the
        communication system.

        So transparancy is violated

    Weak typing makes marshalling hard/impossible

        How big is the object we should copy?

        What is the conversion needed if heterogeneous system?

        So transparancy is violated

    Doesn't fit all programming models

        Unix pipes

            pgm1 < f > g looks good

                pgm is a client for stdin and stdout

                RPCs to the file server for f and g

            Similarly for pgm2 < x > y

            But what about pgm1 < f | pgm2 > y

                Both pgm1 and pgm2 are compiled to be clients but they
                communicate so one must be a server

                Could make pipes servers but this is not efficient and
                somehow doesn't seem natural

        Out of band (unexpected) msgs

            Program reading from a terminal is client and have a
            terminal server to supply the characters.

            But what about ^C.  Now the terminal driver is supposed to
            be active and servers are passive

---------------- Group Communication ----------------

Groups give rise to ONE-TO-MANY communication.

    A msg sent to the group is received by all (current) members.

    We have had just send-receive which is point to point or

    Some networks support MULTICASTING

        Broadcast is special case where many=all

        Multicasting gives the one-to-many semantics needed

        Can use broadcast for multicast

            Each machine checks to see if it is a part of the group

        If don't even have broadcast, use multiple one-to-one
        transmissions often called UNICASTING.

Groups are dynamic

    Groups come and go

    Members come and go

    Need algorithms for this

        Somehow the name of the group must be known to the

        For an open group, just send a "I'm joining" msg

        For a closed group, not possible

        For a pseudo-closed (allows join msgs), just like open

        To leave a group send good bye

        If process dies members have to notice it

            I am alive msgs

            Piggyback on other msgs

        Once noticed send a group msg removing member

            This msg must be ordered w.r.t other group msgs (Why?)

        If many go down or network severs, need to re-establish group

Message ordering

    A big deal

    Want consistent ordering

        Will discuss in detail later this chapter

    Note that an interesting problem is ordering goodbye and join with
    regular msgs.

Closed and open groups

    Difference is that in open and outsider can send to group

    Open for replicated server

    Closed for group working on a single (internal) problem

================ Lecture 8 ================

Peer vs hierarchical groups

    Peer has all members equal.

    Common hierarchy is depth 2 tree

        Called coordinator-workers

        Fits problem solving in a master/slave manner

    Can have more general hierarchies

    Peer has better fault tollerence but harder to make decision
    since need to run a consensus algorithm.

    What to do if coord fails

        Give up

        Elect a new leader (Will discuss in section 11.3)

        (This is last "soft" section.  Algorithms will soon appear)

Group addressing

    Have kernel know which processes on its own machine is in
    which groups.

    Hence must just get the msgs to the correct machines.

    Again can use multicast, broadcast or multiple point to point

Group communication uses send-receive

    RPC doesn't fit group comm.

    Msg to group is NOT like a procedure call

    Would have MANY replies


    Either every member receives the msg or none does.

    Clearly useful

    How to implement?

        Can't always guarentee that all kernels will receive msg

        Use acks

        What about crashes?

             Expensive ALGORITHM:

                Original sender sends to all with acks, timers,

                Each time anyone receives a msg for the first time it
                does what the sender did

                Eventually either all the running processes get msg or
                none do.

                Go through some scenarios.


Ordering revisitied

    Best is if have GLOBAL TIME ORDERING, i.e. msgs delivered in order sent

        Requires exact clock sync

        Too hard

    Next best is CONSISTENT TIME ORDERNG, i.e. delivered in same order

        Even this is hard and we will discuss causality and weaker
        orders later this chapter

    If processes are in multiple groups must order be preserved
    between groups?

        Often NOT required.  Different groups have different functions
        so no need for the msgs to arrive in the same order in the two

        Only relevant when the groups overlap.  Why?

    If have an ethernet, get natural order since have just one msg on
    ethernet and all receive it.

        But if one is lost and must retransmit ???

        Perhaps kill first msg

        This doesn't work with more complicated networks (gateways)

HOMEWORK 10-18 (ask me in class next time)

---------------- Case study:  Group communication in ISIS ----------------

Ken Birman Cornell


    Synchronous System:  Communication takes zero time so events
    happen strictly sequentially

        Can't build this

            needs absolute clock sync

            needs zero trans time

    Loosely synchronous

        Transmission takes nonzero time

        Same msg may be received at different times at different

        All events (send and receive) occur in the SAME ORDER at all

    Two events are CAUSALLY RELATED if the first might influence the

        The first "causes" (i.e. MIGHT cause) the second.

        Events in the same process are causally related in program

        A send of msg M is causally related to the receive.

    Events not causally related are CONCURRENT

    Virtually synchronous

        Causally related events occur in the correct order in all

            In particular in the same order

        No guarantees on concurrent events.

HOMEWORK 10-17  At least describe what is the main problem.  Let's
discuss this one next time.

Click for diagram in postscript or html.

        Not permitted to deliver C before B because they are causally

Message ordering (again)

    Best (Synchronous)  Msgs delivered in order sent.

    Good (Loosely Synchronous): Messages delivered in the same order to all

    OK (Virtually synchronous) Delivery order respects causality.

ISIS algorithms for ordered multicast

    ABCAST: Loosely Synchronous

    Book is incomplete

        I filled in some details that seem to fix it.

        Learning to see these counterexamples is crucial for designing
        CORRECT distributed systems

    To send a msg

        Pick a timestamp (presumably bigger than any you have seen)

        Send timestamped msg to all group members

        Wait for all acks, which are also timestamped

        Send commit with ts = max { ts of acks }

    Wnen (system, not application) receives a msg

        Send ack with ts > any it has seen or sent

    When receive a commit

        Deliver committed msgs in timestamp order


        4 processes  A, B, C, D

        Messages denoted (M,t) M is msg, t is ts

        time = 1

            A sends (Ma, 1) to B, C, D

            B sends (Mb, 1) to A, C, D

        time = 2

            B receives Ma acks with ts=2

            C receives Ma acks with ts=2

            A receives Mb acks with ts=2

            D receives Mb acks with ts=2

        time = 3

            A received ack with ts=2 from B & C

            B received ack with ts=2 from A & D

            D receives Ma acks with ts=3

            C receives Mb acks with ts=3

        time = 4

            A received ack with ts=3 from D

            B received ack with ts=3 from C

        time = 5

            A and B send commit with ts=3 to B,C,D and A,C,D

        Now can have two violations

            C and D receive both commits and since they have the same
            ts, might break the tie differently.

                Solution: Agree on tie breaking rule (say Process

            C might get A's commit and D might get B's commit.  Since
            each has only one commit, it is delivered to the program

                Solution: Keep track of all msgs you have acked but
                not received commit for.  Can't deliver msg if a lower
                numbered ack is outstanding.

    CBCAST: Loosely Synchronous

        Each process maintains a list L with n components n=groupsize

            ith component is the number of the last msg received from i

            all components are initialized to 0

        Each msg contains a vector V with n components

        For Pi to send msg

            Bump ith component of L

            Set V in msg equal to (new value of) L

        For Pj to receive msg from Pi

            Compare V to Lj.  Must have

                V(i) = Lj(i) + 1 (next msg received from i)

                V(k) <= Lj(k)  (j has seen every msg the current msg
                depends on)

            Bump ith component of L

================ Lecture 9 ================

Class cancelled

================ Lecture 10 ================

        Show how this works using the causality diagram

Click for diagram in postscript or html.

---------------- Shared Memory Coordination ----------------

See chapter 2 of tannenbaum

We will be looking at processes coordination using SHARED MEMORY and

    So we don't send messages but read and write shared variables.

    When we need to wait, we loop and don't context switch

        Can be wasteful of resourses if must wait a long time.

        Context switching primitives normally use busy waiting in
        their implementation.

Mutual Exclusion

    Consider adding one to a shared variable V.

    When compiled onto many machines get three instructions

        load  r1 <-- V
        add   r1 <-- r1+1
        store r1 --> V

    Assume V is initially 10 and one process begins the 3 inst seq

        after the first instruction context to another process

            registers are of course saved

        new process does all three instructions

        context switch back

            registers are of course restored

        first process finishes

    V has been incremented twice but has only reached 11

    Problem is the 3 inst seq must be atomic, i.e. cannot be
    interleaved with another execution of these instructions

    That is one execution excludes the possibility of another.  So the
    must exclude each other, i.e. mutual exclusion.

    This was a RACE CONDITION.

        Hard bugs to find since non-deterministic.

    Can be more than two processes

    The portion of code that requires mutual exclusion is often called

HOMEWORK 2-2 2-3 2-4

One approach is to prevent context switching

    Do this for the kernel of a uniprocessor

        Mask interrupts

    Not feasible for user mode processes

    Not feasible for multiprocessors


        non-critical section

So that when many processes execute this you never have more than one
in the critical section

That is you must write trying-part and releasing-part

Trivial solution.

    Let releasing part be simply "halt"

This shows we need to specify the problem better

Additional requirement

    Assume that if a process begins execution of its critical section
    and no other process enters the critical section, then the first
    process will eventually exit the critical section

    Then the requirement is "If a process is executing its
    trying part, then SOME process will eventually enter the critical

---------------- Software-only solutions to CS problem ----------------

We assume the existence of atomic loads and stores

    Only upto wordlength

We start with the case of two processes

Easy if want tasks to alternate in CS and you know which one goes
first in CS

            Shared int turn = 1

loop                                  loop
    while (turn=2)  --EMPTY BODY!!        while (turn=1)
    CS                                    CS
    turn=2                                turn=1
    NCS                                   NCS

But always alternating does not satisfy the additional requirement
above.  Let NCS for process 1 be an infinite loop (or a halt).
Will get to a point when process 1 is in its trying part but
turn=2 and turn will not change.  So some process enters its
trying part but neither proc will enter the CS.

Some attempts at a general soln follow

First idea.  The trouble was the silly turn

           Shared bool P1wants=false, P2wants=false

loop                                  loop
    P1wants <-- true                      P2wants <-- true
    while (P2wants)                       while (P1wants)
    CS                                    CS
    P1wants <-- false                     P2wants <-- false
    NCS                                   NCS

This fails.  Why?  This kind of question is very important and
makes a good exam question.

Next idea.  It is easy to fix the above

           Shared bool P1wants=false, P2wants=false

loop                                  loop
    while (P2wants)                       while (P1wants)
    P1wants <-- true                      P2wants <-- true
    CS                                    CS
    P1wants <-- false                     P2wants <-- false
    NCS                                   NCS

This also fails.  Why??  Show on board what a scenario looks like.

Next idea need a second test.

           Shared bool P1wants=false, P2wants=false

loop                                  loop
    while (P2wants)                       while (P1wants)
    P1wants <-- true                      P2wants <-- true
    while (P2wants)                       while (P1wants)
    CS                                    CS
    P1wants <-- false                     P2wants <-- false
    NCS                                   NCS

Guess what, it fails again.

The first one that worked was discovered by a mathematician named
Dekker.  Use turn only to resolve disputes.

           Shared bool P1wants=false, P2wants=false
           Shared int  turn=1

loop                                loop
    P1wants <-- true                    P2wants <-- true
    while (P2wants)                     while (P1wants)
        if turn=2                           if turn=1
            P1wants <-- false                   P2wants <-- false
            while (turn==2)                     while (turn=1)
            P1wants <-- true                    P2wants <-- true
    CS                                  CS
    turn <-- 2                          turn <-- 1
    P1wants <-- false                   P2wants <-- false
    NCS                                 NCS

First two lines look like deadlock country.  But it is not an
empty while loop.

    The winner-to-be just loops waiting for the loser to give up
    and then goes into the CS.

    The loser-to-be

        Gives up

        Waits to see that the winner has finished

        Starts over (knowing it will win)

Dijkstra extended dekker's soln for > 2 processes

Others improved the fairness of dijkstra's algorithm

These complicated methods remained the simplest known until 1981
when Peterson found a much simpler method

Keep dekker's idea of using turn only to resolve disputes, but
drop the complicated then body of the if.

           Shared bool P1wants=false, P2wants=false
           Shared int  turn=1

loop                                loop
    P1wants <-- true                    P2wants <-- true
    while (P2wants and turn=2)          while (P1wants and turn=1)
    CS                                  CS
    turn <-- 2                          turn <-- 1
    P1wants <-- false                   P2wants <-- false
    NCS                                 NCS

This might be correct!

The standard solution from peterson just has the assignment to
turn moved from the trying to the releasing part

           Shared bool P1wants=false, P2wants=false
           Shared int  turn=1

loop                                loop
    P1wants <-- true                    P2wants <-- true
    turn <-- 2                          turn <-- 1
    while (P2wants and turn=2)          while (P1wants and turn=1)
    CS                                  CS
    P1wants <-- false                   P2wants <-- false
    NCS                                 NCS


    Peterson's paper is in Inf. Proc. Lett. 12 #3 (1981) pp

    A fairly simple proof of its correctness is by Hofri in
    OS Review Jan 1980 pp 18-22.  I will put a copy in the courant
    library in the box for this course.

    Peterson actually showed the result for n processes (not just

    Hofri's proof also shows that the algorithm satisfies a strong
    fairness condition, namely LINEAR WAITING.

        Assuming a process continually executes, other cannot pass
        it (see hofri for details).

================ Lecture 11 ================

Don't forget to go over 10-17 and 10-18

Hand out "Simulating Concurrency" notes for lab 1

These notes and lab 1, Process Coordination, are on the web

---------------- (binary) Semaphores ----------------

Trying and release often called ENTRY and EXIT, or WAIT and SIGNAL, or
DOWN and UP, or P and V (the latter are from dutch words--dijkstra).

Lets try to formalize the entry and exit parts.

To get mutual exclusion we need to ensure that no more than one task
can pass through P until a V has occurred.  The idea is to keeptrying
to walk through the gate and when you succeed ATOMICALLY closed the
gate behind you so that no one else can enter.

Definition (NOT an implementation)

    Let S be an enum with values closed and open (like a gate).

    P(S) is
        while S=closed
        S <-- closed

    The failed test and the assignment are a single atomic action.

    P(S) is
        {[     --begin atomic part
        if S=open
            S <-- closed
            }] --end atomic part
            goto label

    V(S) is
        S <-- open

Note that this P and V (NOT yet implemented) can be used to solve the
critical section problem very easily

    The entry part is P(S)

    The exit part is V(S)

Make very sure the "atomic part" is understood.

Note that dekker and peterson do not give us a P and V since each
process has a unique entry and a unique exit

S is called a (binary) semaphore.

To implement binary semaphores we need some help from our hardware

Boolean in out X

TestAndSet(X) is
    oldx <-- X
    X <-- true
    return oldx

Note that the name is a good one.  This function tests the value of X
and sets it (i.e. sets it true; reset is to set false)

Boolean in out X, in e

FetchAndOr(X, e)
    oldx <-- X
    X <-- X OR e
    return oldx

TestAndSet(X) is the same as FetchAndOr(X, true)

FetchAndOr is also a good name, X is fetched and OR'ed with e

Why is it better to return the old than the new value?

Now P/V for binary semaphores is trivial.

    S is Boolean variable (false is open, true is closed)

    P(S) is
        while (TestAndSet(S))

    V(S) is
        S <-- false


Just because P and V are now simple to implement does not mean that
the implementation is trivial to understand.

    Go over the implementation of P(S)

This works fine no matter how many processes are involved

---------------- Counting Semaphores ----------------

Now want to consider permitting a bounded number of processors into
what might be called a SEMI-CRITICAL SECTION.

        SCS   -- at most k processes can be here simultaneously

A semaphore S with this property is called a COUNTING SEMAPHORE.

If k=1, get a binary semaphore so counting semaphore generalizes
binary semaphore.

How can we implement a counting semaphore given binary semaphores?

    S is a nonnegative integer

    Initialize S to k, the max number allowed in SCS

    Use k=1 to get binary semaphore (hence the name binary)

    We only ask for

        Limit of k in SCS (analogue of mutual exclusion)

        Progress:  If process enters P and < k in SCS, a process
        will enter the SCS

    We do not ask for fairness, and don't assume it (for the binary
    semaphore) either.

Idea: use bin sem to protect arith on S

                binary semaphore q
    P(S) is                         V(S) is
        start-over:                     P(q); S++; V(q)
        if S<=0
            goto start-over

    Explain how this works.

Let's make this a little more elegant (avoid goto)

                binary semaphore q
    P(S) is                         V(S) is
        loop                           P(q); S++; V(q)
        exit when S>0  -- will exit loop holding the lock

This so-call n 1/2 loop clearly shows what is going on.

An alternative is to "prime" the loop.

    get-character   -- priming read
    while not eof

This is all very nice but the real trouble is that the above code is WRONG!

    Distinguish between DEADLOCK, LIVELOCK, STARVATION

        Deadlock: Cannot make progress

        Livelock: Might not make progress (race condition)

        Starvation: Some process does/might not make progress

    The counting semaphore livelocks (no progress) if the binary
    semaphore is unfair (starvation).

    Talk about letting people off subway when full and many waiting at
    the platform.

New idea: Release q sema early and use two others to force alternation

    Remember that multiple Vs on a bin sem may not free multiple
    processes waiting on Ps.

                 binary semaphore q,t initially open
                 binary semaphore r   initially closed
                 integer NS; -- might be negative, keeps value of S

    P(S) is                             V(S) is
        P(q)                                P(q)
        NS--                                NS++
        if NS < 0                           if NS <= 0
            V(q)                                V(q)
            P(r)  <--  these two lines -->      P(t)
            V(t)  <-- force alternation -->     V(r)
        else                                else
            V(q)                                V(q)

Explain how this works.  Do some scenarios

"factor out" V(q) and move above if

    P(S) is                             V(S) is
        P(q)                                P(q)
        NS--                                NS++
        V(q)                                V(q)
        if NS < 0                           if NS <= 0
            P(r)  <--  these two lines -->      P(t)
            V(t)  <-- force alternation -->     V(r)

It fails!!

You need to decrement/increment NS and test atomically.

If you release the lock between testing and doing something,
the information you found during the test is only a rummor
during the something.

Sneaky code optimization (applied to the correct code) gives

    P(S) is                             V(S) is
        P(q)                                P(q)
        NS--                                NS++
        if NS < 0                           if NS <= 0
            V(q)                                V(r)
            P(r)                            else
        V(q)                                    V(q)

================ Lecture 12 ================

HOMEWORK Add in variable S to unoptimized version so that

    1.  Outside P and V S is accurate, i.e. when no process is in P or V,
        S = k -#P + #V

    2.  S >= 0 always

    Note: NS satisfies 1 but not 2.

    Remind me to go over this next class

These solutions all employ binary semaphores.  Hence if N tasks try to
execute P/V on the COUNTING semaphore, there will be a part that is
serialized, i.e. will take time at least proportional to N.


    An NYU invention

    Reference Gottlieb, Lubachevsky, and Rudolph, TOPLAS, apr 83

    Good name (like fetch-and-or  test-and-set)

    fetch-and-add (in out X, in e) is
        oldX <-- X
        X <-- X+e
        return oldX

    Assume this is an atomic operation

        NYU built hardware that this did this atomically

            Indeed could do N FAAs in the time of 1.

            Interesting work ... but this is not a hardware course

            Some hardware description in TOPLAS better in
            IEEE Trans. Comp. Feb 83

Just as test-and-set seems perfect for binary semaphores,
fetch-and-add seems perfect for counting semaphores.

NaiveP (in out S) is                V (in out S) is
    while FAA(S,-1) <= 0                FAA(S,+1)

Explain why it works.

But it fails!!  Why?

Give history (dijkstra, wilson)

P (in out S) is                     V (in out S) is
    L: while S<=0                       FAA(S,+1)
    if FAA (S,-1) <= 0
        goto L

If you don't like gotos and labels, you can rewrite P as

ElegantP (in out S) is
    while S<=0
    if FAA (S,-1) <= 0

A "good" compiler will turn out the identical code for P and ElegantP

    "Good" means capable of elimnating TAIL RECURSION, but ...

        This is not a programming languages course

        This is not a compilers course

The real P looks VERY close to NaiveP, especially if you rewrite
NaiveP as

AlternateNaiveP (in out S) is
    if FAA (S,-1) <= 0
        goto L

So all that P adds is the while loop

    But the test in the while is "redundant" as the next if makes the
    same test

Explain why (real) P and V work.

    First show that the previous problem is gone.

    Mutual exclusion because any process in CS found S positive

    Progress a little harder to show

        Idea is to look at all states of the system and show that once
        you get to a state where a process has reached P, then any
        process in the CS can leave and then at least one trying
        process can get in.

        See the TOPLAS article for a proof


If you believe the claim that concurrent FAA can be executed in
constant time (really it takes the time of one shared memory
reference, which must grow with machine time), then the counting
semaphore is a constant time operation with P/V (assuming the
semaphore is wide open).

We gave this "redundant" test a name test-decrement-retest or TDR

We can apply this TDR technique/trick to the bad soln we began with
(the one using one binary semaphore).

                Binary semphore q initially open

P (in out S) is                 V (in out S) is
    L: While S <= 0                 P(q); S++; V(q)
    if S > 0
        goto L

================ Lecture 13 ================

Midterm exam week from today (rule is must return by 13 march)

HOMEWORK Is the following code (contains an additional binary
         semaphore) also correct?  If it is correct, in what way is it
         better than the original.  Let's discuss next time.

                Binary semphore q,r initially open

P (in out S) is                 V (in out S) is
    L: While S <= 0                 P(r); S++; V(r)
    if S > 0
        goto L

Some authors, e.g. tanenbaum, reserve the term semaphore for
context-switching (a.k.a blocking) implementations and would not call
our busy-waiting algorithms semaphores.

HOMEWORK 2-2 2-6 (note 2-6 uses TANENBAUM's def of semaphore;
         you may ignore the part about monitors).

P/V Chunk:  With a counting semaphore, one might want to reduce the
semaphore by more than one.

    If each value corresponds to one unit of a resource, this
    reduction is reserving a chunk of resources.  We call it P chunk.

    Similarly V chunk.

    Why can't you just do P chunk by doing multiple P's??

        Assume there are 3 units of the resource available and 3 tasks
        each need 2 units to proceed.  If they do P;P, you can get the
        case where no one gets what they need and none can proceed.

    PChunk(in out S, in amt) is         VChunk(in out S, in amt) is
    while S < amt                           FAA(S,+amt)
    if FAA(S,-amt) < amt

    Let's look at the case amt=1

    PChunk1(in out S) is                VChunk1(in out S) is
    while S < 1
    if FAA(S,-1) < 1

    Since S<1 is the same as S<=0, the above is identical to ElegantP.

    There will be more about this below.

---------------- Producer Consumer (bounded buffer) ----------------

Problem: Producers produce items and consumers consume items (what a
surprise).  We have a bounded buffer capable of holding k items and
want to use it to hold items produced but not yet consumed.

Special case k=1.  What we want is alternation of producer adding
to buffer and consumer removing.  Alternation, umm sounds familiar.

            binary semaphore q initially open
            binary semaphore r initially closed

producer is                         consumer is
    loop                                loop
        produce item                        P(r)
        P(q)                                remove item from buffer
        add item to buffer                  V(q)
        V(r)                                consume item

Note that there can be many producers and many consumers

    But the code guarantees that only one will be adding or
    removing from the buffer at a time.

    Hence can use normal code to remove from the buffer.

Now the general case for arbitrary k.

    Will need to allow some slop so that a few (upto k) producers can
    proceed before a consumer removes an item

    Will need to put a binary semaphore around the add and remove code
    (unless the code is special and can tolerate concurrency).

This is what counting semaphores are great for (semi-critical section)

         counting semaphore e initially k  -- num EMPTY slots
         counting semaphore f initially 0  -- num FULL  slots
         binary   semaphore b initially open

producer is                          consumer is
    loop                                loop
        produce item                        P(f)
        P(e)                                P(b); rem item from buf; V(b)
        P(b); add item to buf; V(b)         V(e)
        V(f)                                consume item

Normally want the buffer to be a queue.

HOMEWORK  What would you do if you had two items that needed to be
consecutive on the buffer (assume the buffer is a queue)?

If we used FAA counting semaphores, there is no serial section in the
above except for the P(b).

NYU Ultracomputer Critical-section-free queue algorithms

    Implement the queue as a circular array

    FAA(tail,1) mod size gives the slot to use for insertions

    FAA(head,1) mod size gives the slot to use for deletions

    Will use I and D instead of tail and head below

    type queue is array 1 .. size of record
        natural phase    -- will be explained later
        some-type data   -- the data to store

    Since we do NOT want to have the critical section in the above
    code, we have a thorny problem to solve.

        The counting semaphores will guarantee that when an insert gets
        past P(e), there is an empty slot.  But it might NOT be at the
        head slot!!

        How can this be?

        If this were all we could just force alternation at each queue
        position with two semaphores at each slot.

        But boris (lubachevsky) also found a scenario where two
        inserts could be going after the same slot and thus you could
        ruin fifo but having the first insert go second.

        So we have a phase at each slot.

            The first (zeroth) insert at this slot is phase 0

            The first (zeroth) delete at this slot  is phase 1.

            Insert j is phase 2*j; delete j is phase 2*j+1

            The phase is for an insert is I div size

            The slot is I mod size

            Most hardware calculate both at once if doing a division
            (quotient and remainder).  If size is a power of 2 these
            are just two parts of the number (i.e. mask and shift).  I
            use a (made up) function (Div, Mod) that returns both values

              counting semaphore e initially size
              counting semaphore f initially 0

Insert is
    (MyPhase, MyI) <-- (Div, Mod) (FAA(I,1), size)
    while phase[MyI] < 2*MyPhase          -- we are here too early, wait
    data[MyI] <-- the-datum-to-insert
    FAA(Phase[MyI],1)                     -- this phase is over

Delete is
    (MyPhase, MyD) <-- (Div, Mod) (FAA(D,1), size)
    while phase[MyD] < 2*MyPhase+1        -- we are here too early, wait
    extracted-data <-- data[MyD]
    FAA(Phase[MyD],1)                     -- this phase is over

This code went through several variations.  The current version was
discovered when I last taught this course in spring of 94, but was
never written before now.

Originally the insert and delete code were complicated and looked
quite different from each other.  At some point I found a way to make
them look very symmetric (but still complicated).  At first I was a
little proud of this discovery and rushed to show boris.  He was not
impressed; indeed he remarked "it must be so".  It was always hard to
translate his "must" so I didn't understand if he was saying that my
comment was trivial or important.  His next, quite perceptive, remark
showed that it was the former and ended the discussion.  Of course
they look the same, "Deletion is insertion of empty space.".

This queue algorithm uses size for two purposes

    The maximum size of the queue

    The maximum concurrency supported.

It would be natural for these too requirements to differ considerable
in the size required.

    A system with 100 processors each running no more than 10 active
    threads using the queue needs at most 1000 fold concurrency, but
    if the traffic generation is bursty, many more than 1000 slots
    would be desired.

One can have a (serially accessed, i.e. critical section) list
instead of a single slot associated with each MyI

One can further enhance this by implementing these serially accessed
lists as linked lists rather than arrays.  This gives the usual
advantages of linked vs sequentially allocated lists (as well as the
usual disadvantages).

This enhanced version is used in our operating system (Symunix)
written by jan edler.

================ Lecture 14 ================

---------------- Readers and Writers ----------------

Problem:  We have two classes of processes.

    Readers, which can execute concurrently

    Writers, which demand exclusive access

We are to permit concurrency among readers, but when a writer is
active, there are to be NO readers and NO OTHER writers active.

              integer #w range 0 .. maxint initially 0
              integer #r range 0 .. maxint initially 0
              binary semaphore S

Reader() is                     Writer() is
    loop                            loop
        P(S)                            P(S)
    exit when #w=0                  exit when #r=0 and #w=0
        V(S)                            V(S)
    #r++                            #w++
    V(S)                            V(S)
    Do-the-read                     Do-the-write
    P(S); #r--; V(S)                P(S); #w--; V(S)

Explain the idea behind the code and why it works.

(As usual) it doesn't work.

We fix it with our magic elixir (redundant test before lock)

           integer #w range 0 .. maxint initially 0
           integer #r range 0 .. maxint initially 0
           binary semaphore S

Reader() is                     Writer() is
    loop                            loop
        while #w>0                      while #r>0 or #w>0
        P(S)                            P(S)
    exit when #w=0                  exit when #r=0 and #w=0
        V(S)                            V(S)
    #r++                            #w++
    V(S)                            V(S)
    Do-the-read                     Do-the-write
    P(S); #r--; V(S)                P(S); #w--; V(S)

HOMEWORK What is the trying-part and what is the releasing-part?

HOMEWORK Show that #w only takes on values 0 and 1

Observation #1.  Since #w is only 0 and 1, we use the following code

          integer #w range 0 .. 1      initially 0
          integer #r range 0 .. maxint initially 0
          binary semaphore S

Reader() is                     Writer() is
    loop                            loop
        while #w>0                      while #r>0 or #w>0
        P(S)                            P(S)
    exit when #w=0                  exit when #r=0 and #w=0
        V(S)                            V(S)
    #r++                            #w <-- 1
    V(S)                            V(S)
    Do-the-read                     Do-the-write
    P(S); #r--; V(S)                P(S); #w <--0 ; V(S)

Observation #2.  Assigning a 1 to #w is atomic at the hardware level.
We don't need a semaphore to make it atomic.  Also the other sections
protected by P(S)/V(S) don't need to be protected from #w<--0.

          integer #w range 0 .. 1      initially 0
          integer #r range 0 .. maxint initially 0
          binary semaphore S

Reader() is                     Writer() is
    loop                            loop
        while #w>0                      while #r>0 or #w>0
        P(S)                            P(S)
    exit when #w=0                  exit when #r=0 and #w=0
        V(S)                            V(S)
    #r++                            #w <-- 1
    V(S)                            V(S)
    Do-the-read                     Do-the-write
    P(S); #r--; V(S)                #w <-- 0

Observation #3.  Since we don't grab a lock to set #w zero, the trying
readers can't prevent it.  This means we don't need the elixir for
them.  Similarly we don't need the #w part of the elixir for the
writers either.  We do need the #r part otherwise we could prevent the
decrement of #r at the bottom of the Reader code.

          integer #w range 0 .. 1      initially 0
          integer #r range 0 .. maxint initially 0
          binary semaphore S initially open

Reader() is                     Writer() is
loop                            loop
                                    while #r>0
    P(S)                            P(S)
exit when #w=0                  exit when #r=0 and #w=0
    V(S)                            V(S)
#r++                            #w <-- 1
V(S)                            V(S)
Do-the-read                     Do-the-write
P(S); #r--; V(S)                #w <-- 0

New approach to readers / writers.  Assume we have B units of a
resource R, where B is at least the maximum possible numbers of
readers (e.g. B = max # processes).  Make each reader grab one unit
of the resource and make each writer grab all B.  Thus when a writer
is active, no other process is.

So how can we grab B units all at once?


     counting semaphore S initially B

Reader() is              Writer() is
    P(S)                     PChunk(S,B)
    Do-the-read              Do-the-write
    V(S)                     VChunk(S,B)

If we use the FAA implementations for P/V Chunk and run the code on
the NYU Ultracomputer Ultra III prototype, we get the property that
"during periods of no writer activity (or attempted activity), NO
critical sections are executed".

I no of no other implementation with this property.

The code above permits readers to easily starve writers.  The
following writer-priority version does not (but writers can starve

This is a generic way to take two algorithms and give one priority.
That is, the above code for readers and writers are used as black

     counting semaphore S initially B
     integer #ww initially 0    -- number of waiting writers

Reader() is              Writer() is
    while #ww>0              FAA(#ww,+1)
    P(S)                     PChunk(S,B)
    Do-the-read              Do-the-write
    V(S)                     VChunk(S,B)

HOMEWORK.  Write the corresponding reader-priority readers/writers

---------------- Fetch-and-increment and fetch-and-decrement

With the exception of P/V chunk (and hence readers/writers) all the
FAA algorithms used an addend of +1 or -1.

Fetch-and-increment(X) (written FAI(X)) is defined as FAA(X,+1)

Fetch-and-decrement(X) (written FAD(X)) is defined as FAA(X,-1)

It turns out that the hardware is easier for FAI/FAD than for general

    FAA looks like a store and a load

        data sent to and from memory

    FAI/FAD look like a load

        data comes back from memory

        it is a special load (memory must do something)

Somewhat surprisingly readers and writers can be solved with just

     integer  #r initially 0
     integer  #w initially 0

Reader is              Writer is
    while #w > 0           while #w > 0
    FAI(#R)                if FAI(#w) > 0
    if #w > 0                  FAD(#w)
        FAD(#R)                writer
        Reader             while #R > 0
    Do-the-read            Do-the-write
    FAD(#R)                FAD(#w)

This is proved in freudenthal gottlieb asplos 1991.

================ Lecture 15 ================

---------------- Midterm Exam ----------------

================ Lecture 16 ================

return and go over midterm exam

---------------- Barriers ----------------

Many higher-level algorithms work in "phases".  All processes execute
upto a point and then they must synchronize so that all processes know
that all other processes have gotten this far.

    For example, each process might be responsible for updating a
    specific entry in a matrix NEW using all the values of matrix
    OLD.  When NEW is fully updated the roles of NEW and OLD are
    reversed.  (Explict relaxation method for PDEs)

    while (not-done)
We will assume that N, the number of processes involved is know at
compile time.

    This is a common assumption.

    To avoid it, use "group lock" (see freudenthal and gottlieb)

Idea is simple

    Each process adds one to a counter.

    Wait until counter is N

Want barrier self-cleaning, i.e. counter is restored to zero so next
execution of barrier can begin.

    We use two counters and clean the "other one"

P: shared integer range 0 .. 1                   {the "phase" mod 2}
C: shared array 0 .. 1 of integer range 0 .. N   {the count for this phase}
barrier is
    p = P
    C[1-p] <-- 0            { clean for the next phase }
    if fai(C[p]) = N-1      { last process of this phase }
        P <-- 1-p           { swap phases }
    while (p = P)           { wait for last to get here }

Wording bug in lab 1.  Barrier uses only fai; does not need fad.

---------------- Chapter 11--Distributed System Coordination ----------------

No shared memory so

    No semaphores

    No test-and-set

    No fetch-and-add

Basic "rules of the game"

    The relevant information is scattered around the system

    Each process makes decisions based on LOCAL information

    Try to avoid a single point of failure

    No common clock supplied by hardware

================ Lecture 17 ================

Our first goal is to deal with the last rule

    Some common time is neeced (e.g. make)

    Hardware doesn't supply it.

    So software must.

For isolated uniprocessor all that really matters is that the clock advances
monotonically (never goes backward).

    This says that if one time is greater than another the second
    happened later than the first.

For isolated multiple processor system need the clocks to agree so
if one processor marks one file as later than a second processor marks
a second file, the second marking really did come after the first.

Processor has a timer that interrups periodically.  The interrupt is
called a clock tick.

Software keeps a count of # of ticks since a known time.

    Initialized at boot time

Sadly the hardware used to generate clock ticks isn't perfect and some
run slightly faster than others.  This causes CLOCK SKEW.

Clocks that agree with each other (i.e. are consistent, see below) are

If the system must interact with the real world (say a human),
it is important that the system time is at least close to the real

Clocks that agree with real time are called PHYSICAL CLOCKS

Logical Clocks

    What does it mean for clocks to agree with each other?

    For logical clocks we only care that if one event happens before
    another, the first occurs at an earlier time ON THE LOCAL CLOCK.

We say a HAPPENS BEFORE b, written a-->b if one of the following holds

    1.  Events a and b are in the same process, and a occurs before b.

    2.  Event a is the sending of msg M and b is the receiveing of M.

    3.  Transitivity, i.e. a-->c and c-->b

We say a HAPPENS AFTER b if b happens before a.

We say a and b are CONCURRENT if neither a-->b nor b-->a


We require for logical clocks that  a-->b implies C(a) < C(b), where
C(x) is the time at which x occurs.  That is C(x) is the value of the
local clock when event x occurs.

How do we insure that this rule (a-->b implies C(a) < C(b)).  We start
by using the local clock value, but need some fixups.

    Condition 1 is almost satisfied.  If a process stays on a
    processor, the clock value will never decrease.  So we need two

        If a process moves from one processor to another, must save
        the clock value on the first and make sure that the second is
        no earlier (indeed it should probably be at lease one tick

        If two events occur rapidly (say two msg sends), make sure
        that there is at lease one clock tick between.

    Condition 2 is NOT satisfied.  The clocks on different processORs
    can definitely differ a little so a msg might arrive before it was

    The fixup is to record in the msg the time of sending and when it
    arrives, set local clock to max (local time, msg send time + 1)

    The above is Lamport's algorithm for synchronizing logical

More information on logical time can be found in Raynal and Singhal
IEEE Computer Feb 96 (we are NOT covering this).

    The above is called SCALAR TIME.

    They also have VECTOR TIME and MATRIX TIME

        In vector time each process P keeps a vector (hence the name) of
        clocks, entry i gives P's idea of the clock at Pi.

        In matrix time each process keeps a (guess) matrix (correct!)
        of clocks.  Idea is that P keeps a record of what Q thinks R's
        clock says.

================ Lecture 18 ================

Physical clocks

What is time in the physical world.

    Earth revolves about the sun once per year

    Earth rotates on its axis once per day

    1/24 day is hour, etc for min and sec

    But earth "day" isn't exactly the same all the time so
    now use atomic clocks to define one second to be the time for a
    cesium 133 atom to make 9,192,631,770 "transitions"

We don't care about any of this really.  For us the right time is that
boradcast by the NIST on WWV (short wave radio) or via GEOS satelite.

How do we get the times on machines to agree with NIST?

    Send a msg to to NIST (or a surrogate) asking for the right time
    and change yours to that one.

        How often?

        If you know 
            1. the max drift rate D (each machine drifts D from reality)
            2. the max error you are willing to tollerate E
        can calculate how long L you can wait between updates.

        L = E / 2D

        If only one machine drifts (other is NIST)  L = E/D

Remind me to go over this next time.

        How change the time?

        Bad to have big jumps and bad to go backwards.

        So make an adjustment in how much you add at each clock tick.

        How do you send msg and get reply in zero (or even just "known
        in advance") time?

        You can't.  See how long it took for reply to come back,
        subtract service time at NIST and divide by 2.  Add this to
        the time NIST returns.
What if you can't reach a machine known to be perfect (or within D)?

    Ask "around" for time

    Take average

    Broadcast result

    Eliminate outliers

    Try to contact nodes who can contact NIST or nodes who can
    contact nodes who can contact NIST.            

---------------- Mutual Exclusion ----------------

Try to do mutual exclusion w/o shared memory

Centralized approach

    Pick a process a coordinator (mutual-exclusion-server)

    To get access to CS send msg to coord and await reply.

    When leave CS send msg to coord.

    When coord gets a msg requesting CS it

        Replies if the CS is free

        Enter requesters name into waiting Q

    When coord gets a msg announcing departure from CS

        Removes head entry from list of waiters and replies to it

    The simplist soln and perhaps the best

Distributed soln

    When you want to get into CS 

        Send request msg to EVERYONE (except yourself)

            Include timestamp (logical clock!)

        Wait until receive OK from everyone

    When receive request

        If you are not in CS and don't want to be, say OK
        If you are in CS, put requester's name on list

        If you are not in CS but want to

            If your TS is lower, put name on list
            If your TS is higher, send OK

    When leave CS, send OK to all on your list

    Show why this works


Token Passing soln

    Form logical ring

    Pass token around ring

    When you have the token can enter CS (hold token until exit)


    Centralized is best

    Distributed of theoretical interest

    Token passing good if HW is ring based (e.g. token ring)

---------------- Election ----------------

How do you get the coord for centralized alg above?

Bully Algorithm

    Used initially and in general when any process notices that the
    process it thinks is the coord is not responding.

    Processes are numbered (e.g. by their addresses or names or

    The algorithm below determins the highest numbered process, which
    then becomes the elected member.

    Called the bully alg because the biggest wins.

    When a process wants to hold an election,

        It sends an election msg to all higher numbered processes

        If any respond, orig process gives up

        If none respond, it has won and announces election result to
        all processors (called coordinator msg in book)

    When receive an election msg

        Send OK

        Start election yourself

    Show how this works

    When a process comes back up, it starts an election.


Ring algorithm (again biggest number wins)

    Form a (logical) ring

    When a proc wants a new leader

        Send election msg to next proc (in ring) include its proc # in msg

        If msg not ack'ed, send to next in ring

        When msg returns

            Election done

            Send result msg around the ring

    When req msg arrives

        Add your # to msg and send to next proc

            Only necessary if you are bigger than all there (I think)
    If concurrent elections occur, same result for all


================ Lecture 19 ================

---------------- Atomic Transactions ----------------

This is a BIG topic; worthy of a course (databases)

Stable storage: Able to survive "almost" anything

    For example can survive a (single) media failure

Transactions:    ACID


        To the outside world it happens all at once.

        It's effects are never visible while in progress

        Can fail (abort) then none of its effects are ever visible


        Does not violate any system invariants when completed

            Intermediate states are NOT visible from Atomic

        System dependent


        Normally called serializable

        When multiple transactions are run and complete the effect is
        as if they were run one after the other in some (unspecified)
        order, i.e. you don't get a little of one and then a little of


        Often called permanance

        Once a transaction commits, its effects are permanant

        Stable storage often used


Nested Transactions
    May want to have the child transactions run concurrently.

    If a child commits, its effects are now available WITHIN the
    parent transaction

        So children serialized AFTER the child see the effects

    But if a child commits and then the parent (subsequently) aborts
    the effects of the child are NOT visible outside.

---------------- The following is by eric freudenthal ----------------

Concurrency Control

    The synchronization primitives required to build transactions


        Use mutex, explore granularity issues.

        Two phase locking

            acquire all locks

            do transaction

            release locks

            Quit if a lock is unavailable.

        This scheme works well when contention is low, may lead to

        In-order lock aquisition avoids live/deadlock.


        Resembles two-phase commitment - don't save files until all
        mods made.  May need to abort & retry.

            indirection makes this faster (copy & mod reference
            structure rather than data itself

        Good parallelism when works.

        Deadlock free.

    Timestamp consistency

        Data (files) marked with date of last transactions read &

        Transactions fail if they try to set times inconsistently.

Preventing Deadlock (while providing synchronization as building

    Both communication channel & resource deadlocks are possible.

        Communication channel deadlock can be caused by running out of

            Important to get this right, for it's hard to negotiate 
            if the communication channels are blocked.

    What to do

        ostrich - works most of the time if contention is low

        detect - easier

        prevent - make structurally impossible

        avoid - carefully write program so it can't happen

    Detecting deadlock

        centralized manager

            risk of outdated information - false deadlock

                fix with lamport's time stamps & interrogation

================ Lecture 20 ================

Lecture by eric fredenthal

Not in text format but it is available on the web as lecture-20
in the list of individual lectures

Click here to view lecture 20.

================ Lecture 21 ================

System Model (i.e. what to buy and configure)

    We look at three models

        Workstations (zero cost soln)

        Clusters (aka NOW aka COW, aka LAMP, aka Beawolf, aka pool)

        Hybrid--likely "winner"

Workstation model

    Connect workstations in department via LAN

    Includes personal workstations and public ones

    Often have file servers

    The workstations can be diskless

        Tannenbaum seems to like this

        Most users don't

        Not so popular anymore (disks are cheap)

        Maintenence is easy

        Must have some startup code in rom

    If have disk on workstation can use it for

        1. Paging and temp files

        2. 1 + (some) system executables

        3. 2 + file caching

        4. full file system

    Case 1 is often called dataless

        Just as easy to (software) maintain as diskless

        Still need startup code in rom

        Serious reduction in load on network and file servers
    Case 2

        Reduces load more and speeds up program start time

        Adds maintenance since new releases of programs must be
        loaded onto the workstations

HOMEWORK 12-7, 12-8

    Case 3

        Can have very few executables permanently on the disk

        Must keep the caches consistent

            Not trivial for data files with multiple writers

                This issue comes up for NFS as well and is discussed
                (much) later

            Should you cache whole files or blocks?

    Case 4

        You can work if just your machine is up

        Lose location transparancy

        Most maintenance

    Using Idle workstations

        Early systems did this manually via rsh

            Still used today

        Newer systems like Condor (Univ Wisc ?) try to automate this

            How find idle workstations?

                Idle = no mouse or keybd activity and low load avg

                Workstation can announce it is idle and this is
                recorded by all

                Job looking for machine can inquire

                Must worry about race conditions


                Some jobs want a bunch of machines so look for many
                idle machines

                Can also have centralized soln, processor server

                    Usual tradeoffs
            What about local environment?

                Files on servers are no problem

                Requests for local files must be sent home

                    ... but not needed for tmp files

                Syscalls for mem or proc mgt probably need to be
                executed on the remote machine

                Time is a bit of a mess unless have time synchronized
                by a system like ntp

                If program is interactive, must deal with devices

                    mouse, keybd, display


            What if machine becomes non-idle (i.e. owner returns)?

                Detect presence of user.

                Kill off the guest processes.

                    Helpful if made checkpoints (or ran short jobs)

                Erase files, etc.

                Some NYU research unifies this entire evacuation
                procedure with other failures (transaction oriented).

                Could try to migrate the guest processes to other
                hosts but this must be very fast or the owner will
                object (at least I would).

                Goal is to make owner not be aware of your presence.

                    May not be possible since you may have paged out
                    his basic environment (shell, editor, X server,
                    window manager) that s/he left running when s/he
                    stopped using the machine.

================ Lecture 22 ================

Clusters (pools, etc)

    Bunch of workstations without displays in machine room connected
    by a network.

    Quite popular now.

    Indeed some clusters are packaged by their manufacturer into a
    serious compute engine.

        IBM SP2 sold $1B in 1997.

            VERY fast network

    Used to solve large problems using many processors at one time

        Pluses of large time sharing system vs small individual

        Also the minus of timesharing

        Can use easy queuing theory to show that large fast server
        better in some cases than many slower personal machines

    Tannenbaum suggests using X-terminals to access the cluster, but
    X-terminals haven't caught on.

        Personal workstations don't cost much more


    Each user has a workstation and use the pool for big jobs

    Tannenbaum calls this a possible compromise.

        It is the dominant model for cluster based machines.

            X-terminals haven't caught on

            The cheapest workstations are already serious enough for
            most interactive work freeing the cluster for serious

---------------- Processor Allocation ----------------

Decide which processes should run on which processors

Could also be process allocation

Assume any process can run on any processor

    Often the only difference between different processors is

        CPU speed

        CPU Speed and max Memory

    What if the processors are not homogeneous?
        Assume have binaries for all the different architectures.

    What if not all machines are directly connected    

        Send process via intermediate machines
    If all else fails view system as multiple subsystems

        If have only alpha binaries, restrict to alphas

        If need machines very close for fast comm, restrict to a group
        of close machines.

Can you move a running process or are processor allocations done a
process creation time?

    Migatory allocation algorithms vs nonmigratory

What is the figure of merit, i.e. what do we want to optimize?

    Similar to CPU scheduling in OS 1.

    Minimize reponse time

        We are NOT assuming all machines equally fast.

            Consider two processes P1 executes 100 millions instructions,
            P2 executes 10 million instructions.

            Both processes enter system at t=0

            Consider two machines A executes 100 MIPS, B 10 MIPS

            If run P1 on A and P2 on B each takes 1 second so avg
            response time is 1 sec.

            If run P1 on B and P2 on A, P1 takes 10 seconds P2 .1 sec
            so avg response time is 5.05 sec

            If run P2 then P1 both on A finish at times .1 and 1.1 so
            avg resp time is .6 seconds!!

        Do not assume machines are ready to run new jobs, i.e. there
        can be backlogs.

    Minimize response ratio.

        Response ratio is the time to run on some machine divided by
        time to run on a standardized (benchmark) machine, assuming
        the benchmark machine is unloaded.

        This takes into account the fact that long jobs should take

        Do the P1 P2 A B example with response ratios


    Maximize CPU utilization

        NOT my favorite figure of merit.


        Jobs per hour

        Weighted jobs per hour

            If weighting is CPU time, get CPU utilization

            This is the way to justify CPU utilization (user centric)

Design issues

    Deterministic vs Heurestic

        Use determanistic for embedded applications, when all
        requirements are known a priori

            Patient monitoring in hospital

            Nuclear reactor monitoring

    Centralized vs distributed

        Usual tradeoff of accuracy vs fault tollerence and bottlenecks

    Optimal vs best effort

        Optimal normally requires off line processing.

        Similar requirements as for determanistic.

        Usual tradeoff of system effort vs result quality

    Transfer policy

        Does a process decide to shed jobs just based on its own load
        or does it have (and use) knowledge of other loads?

        Called local vs global algs by tannenbaum

        Usual tradeoff of system effort (gather data) vs res quality

    Location policy
        Sender vs receiver initiated 

            Look for help vs look for work

            Both are done

            Tannenbaum asserts that clearly the decision can't be
            local (else might send to even higher loaded machine)

                NOT clear

                The overloaded recipient will then send it again

                    Tradeoff of #sends vs effort to decide

                    Use random destination so will tend to spread load

                    "Better" might be to send probes first, but the
                    tradeoff is cost of msg (one more for probe in
                    normal case) vs cost of bigger msg (bouncing a job
                    around instead of tiny probes) vs likelyhood of
                    overload at target.

================ Lecture 23 ================

Implementation issues

    Determining local load

        NOT difficult (despite what tannenbaum says)

            The case he mentions is possible but not dominant

        Do not ask for "perfect" informantion (not clear what it

            Normally use a weighted mean of recent loads with more
            recent weighted higher

            Normally uniprocessor stuff

    Do not forget that complexity (i.e. difficulty) is bad

        Of course

    Do not forget the cost of getting additional information needed
    for a better decision.

        Of course

Example algorithms

    Min cut determanistic algorithm

        Define a graph with processes as nodes and IPC traffic as arcs

        Goal: Cut the graph (i.e some arcs) into pieces so that

            All nodes in one piece can be run on one processor

                Mem constraints

                Processor completion times

            Values on cut arcs minimized

                Min the max

                    min max traffic for a process pair

                min the sum

                    min total traffic (tannenbaum's def)

                min the sum to/from a piece

                    don't overload a processor
                min the sum between pieces

                    min traffic for processor pair

        Tends to get hard as you get more realistic

            Want more than just processor constraints

                Figures of merit discussed above

            The same traffic between different processor pairs does
            not cost the same


    Up-down centralized algorithm

        Centralized table that keeps "usage" data for a USER, the users
        are defined to be the workstation owners.  Call this the score
        for the user.

        Goal is to give each user a fair share.

        When workstation requests a remote job, if wkstation avail it
        is assigned

        For each process user has running remotely, the user's score
        increases a fixed amt each time interval.

        When a user has an unsatisfied request pending (and none being
        satisfied), the score decreases (can go negative).

        If no requests pending and none being satisfied, score bumped
        towards zero.

        When a processor becomes free, assign it to a requesting user
        with lowest score.

    Hierarchical algorithm

        This was used in a system where each processor could run at
        most one process.  But we describe a more general situation.

        Quick idea of algorithm

            Processors in tree

            Requests go up tree until subtree has enought resourses

            Request is split and parts go back down tree

        Arrange processors in a hierarchy (tree)

            This is a logical tree independent of how physically

        Each node keeps (imperfect) track of how many available
        processors are below it.

            If a processor can run more than one process, must be more
            sophisticated and must keep track of how many processes
            can be allocated (without overload) in the subtree below.

        If a new request appears in the tree, the curent node sees if
        it can be satisfied by the processors below (plus itself).

            If so, do it.

            If not pass the request up the tree

            Actually since machines may be down or the data on
            availability may be out of date, you actually try to find
            more processes than requested

        Once a request has gone high enough to be satisfied, the
        current node splits request into pieces and sends each piece
        to appropriate child.

        What if a node dies?

            Promote one of its children say C

            Now C's children are peers with the previous peers of C

                If this is considered too unbalanced, can promote one
                of C children to take C's place.

        How decide which child C to promote?

            Peers of dead node have an election

            Children of dead node have an election

            Parent of dead node decides
        What if root dies?

            Must use children since no peers or parent

            If want to use peers, then do not have a single root

                I.e. the top level of the hierarchy is a collection of
                roots that communicate

                This is a forest, not a tree

        What if multiple requests are generated simutaneously?

            Gets hard fast as information gets stale and potential
            race conditions and deadlocks are possible.

            See Van Tilborg and Wittie

    Distributed heuristic algorithm

        Send probe to random 

        If remote load is low, ship job

        If remote load is high, try another random probe

        After k (parameter of implementation) probes all say load is
        too high, give up and run job locally.

        Modelled analytically and seen to work fairly well


        View the dist system as a collection of competing servide
        providers and processes (users?) as potential customers.
        Customers bid for service and computers pick the highest bid.

        This is just a sketch.  MUCH needs to be filled in.


    General goal is to have processes that communicate frequently run

    If not and use busy waiting for msgs will have an huge disaster.

    Even if use context switching, may have a small disaster as only
    one msg transfer can occur per time scheduling slot

    Co-scheduling (aka gang scheduling).  Processes belonging to a job
    are scheduled together

        Time slots are coordinated among the processors.

        Some slots are for gangs; other slots are for regular

HOMEWORK 12-16 (use figure 12-27b)

    Tannenbaum asserts "there is not much to say about scheduling in a
    distributed system"


        Concept of job scheduling

        In many big systems (e.g. SP2, T3E) run job scheduler

            User gives max time (say wall time) and number of

            System can use this information to figure out the latest
            time a job can run (i.e. assumes all preceeding jobs use
            all their time).

            If a job finishes early and the following job needs more
            processors than avail, look for subsequent jobs that

                Don't need more processors than available

                Won't run past the time the scheduler promised to the
                following job.

            This is called backfilling

================ Lecture 24 ================

---------------- Chapter 13--Distributed File Systems ----------------

File service vs file server

    File service is the specification

    File server is an process running on a machine to implement the
    file service for (some) files on that machine

    In a normal distributed would have one file service
    but perhaps many file servers

        If have very different kinds of filesystems might not be able
        to have a single file service as perhaps some services are not

File Server Design


        Sequence of bytes




        Sequence of Records



            We do not cover these rilesystems.  They are often
            discussed in database courses

    File attributes

        rwx perhaps a (append)

            This is really a subset of what is called
            ACL -- access control list

            Get ACLs and Capabilities by reading columns and rows of
            the access matrix

        owner, group, various dates, size

        dump, autocompress, immutable
    Upload/download  vs  remote access

        Upload/download means only file services supplied are read
        file and write file.

            All mods done on local copy of file

            Conseptually simple at first glance

            Whole file transfers are efficient (assuming you are going
            to access most of the file) when compared to multiple
            small accesses

            Not efficient use of bandwidth if you access only small
            part of large file.

            Requires storage on client

            What about concurrent updates?

                What if one client reads and "forgets" to write for a
                long time and then writes back the "new" version
                overwritting newer changes from others?

        Remote access means direct individual reads and writes to the
        remote copy of the file

            File stays on server

            Issue of (client) buffering

                Good to reduce number of remote accesses.

                What about semantics when a write occurs?

                    Note that meta-data is written for a read so if
                    you want faithful semantics.  Ever client read
                    must mod metadata on server or all requests for
                    metadata (e.g ls or dir commands) must go to

                Cache consistency question


        Mapping from names to files/directories

        Contains rules for names of files and (sub)directories

        Hierarchy i.e. tree

        (hard) links

            gives another name to an existing file

            a new directory entry

            The old and new name have equal status

                cd ~
                mkdir dir1
                touch dir1/file1
                ln dir1/file1 file2

                Now ~/file2 is the SAME file as ~/dir1/file1

                    In unix-speak they have the same inode

                Need to do rm twice to actually delete the file

            The owner is NOT changed so

                cd ~
                ln ~joe/file1 file2

            Gives me a link to a file of joe.  Presumably joe set his
            permissions so I can't write it.

            Now joe does

                rm ~/file1

            But my file2 still exists and is owned by joe.  Most
            accounting programs would charge the file to joe (who
            doesn't know it exists).

            With hard links the filesystem becomes a DAG instead of a
            simple tree.


            Symbolic (NOT symmetric).  Indeed asymetric


                cd ~
                mkdir dir1
                touch dir1/file1
                ln -s dir1/file1 file2

            file2 has a new inode it is a new type of file called a
            symlink and its "contents" are the name of the file

            When accessed file2 returns the contents of file1, but it
            is NOT equal to file1.

                If file1 is deleted, file2 "exists" but is invalid

                If a new file2 is created, file2 now points to it.

            Symlinks can point to directories as well

            With symlinks pointing to directories, the filesystem
            becomes a general graph, i.e. directed cycles are

================ Lecture 25 ================

        Imagine hard links pointing to directories
        (unix does NOT permit this).

            cd ~
            mkdir B;   mkdir C
            mkdir B/D; mkdir B/E
            ln B B/D/oh-my

        Now you have a loop with honest looking links.

        Normally can't remove a directory (i.e. unlink it from its
        parent) unless it is empty.

        But when can have multiple hard links to a directory, should
        permit removing (i.e. unlinking) one even if the directory is
        not empty.

        So in above example could unlink B from A.

        Now you have garbage (unreachable, i.e. unnamable) directories
        B, D, and E.

        For a centralized system need a conventional garbage

        For distributed system need a distributed garbage collector,
        which is much harder.


        Location transparancy

            Path name (i.e. full name of file) does NOT say where the
            file is located.

                On our ultra setup, we have filesystems /a /b /c and
                others exported and remote mounted.  When we moved /e
                from machine allan to machine decstation very little
                had to change on other machines (just the file
                /etc/fstab).  More importantly, programs running
                everywhere could still refer to /e/xy

                But this was just because we did it that way.  I.e.,
                we could have mounted the same filesystem as /e on one
                machine and /xyz on another.

        Location Indpendence

            Path name is independent of the server.  Hence can move a
            file from server to server without changing its name.

            Have a namespace of files an then have some (dynamically)
            assigned to certain servers.  This namespace would be the
            same on all machines in the system.

            Not sure if any systems do this.

        Root transparancy

            made up name

            / is the same on all systems

            Would ruin some conventions like /tmp


            Machine + path naming



            Mounting remote filesystem onto local heirarchy

                When done inteligently get location transparancy

            Single namespace looking the same on all machines

    Two level naming

        Said above that a directory is a mapping from names to files
        (and subdirectories).

        More formally, the directory maps the user name
        /home/gottlieb/course/os/class-notes.html to the OS name for
        that file 143428 (the unix inode number).

        These two names are sometimes called the symbolic and binary

        For some systems the binary names are available.

            allan$ ls -i course/os/class-notes.html
            allan$ 143426 course/os/class-notes.html

        The binary name could contain the server name so that could
        directly reference files on other filesystems/machines

            Unix doesn't do this

        Could have symbolic links contain the server name

            Unix doesn't do this either

            I believe that vms did something like this.  Symbolic name
            was something like nodename::filename

                It has been a while since I used VMS so I may have
                this wrong.

        Could have the name lookup yield MULTIPLE binary names.

            Redundant storage of files for availability

            Naturally must worry about updates

                When visible?

                Concurrent updates?

                WHENEVER you hear of a system that keeps multiple
                copies of something, an immediate question should be
                "are these immutable?".  If the answer is no, the next
                question is "what are the update semantics?"


    Sharing semantics

        Unix semantics -- A read returns the value store by the last

            Probably unix doesn't quite do this.

                If a write is large (several blocks) do seeks for each

                During a seek, the process sleeps (in the kernel)

                Another process can be writing a range of blocks that
                intersects the blocks for the first write.

                The result could be (depending on disk scheduling)
                that the result does not have a last write.

            Perhaps Unix semantics means -- A read returns the value
            stored by the last write providing one exists.

            Perhaps Unix semantics means -- A write syscall should be
            thought of as a sequence of write-block syscalls and
            similar for reads.  A read-block syscall returns the value
            of the last write-block syscall for that block

        Easy to get this same semantics for systems with file servers

            No client side copies (Upload/download)

            No client side caching

        Session semantics

            Changes to an open file are visible only to the process
            (machine???) that issued the open.  When the file is
            closed the changes become visible to all

            If using client caching CANNOT flush dirty blocks until
            close.  What if you run out of buffer space?

            Messes up file-pointer semantics

                The file pointer is shared across fork so all children
                of a parent share it.

                But if the children run on another machine with
                session semantics, the file pointer can't be shared
                since the other machine does not see the effect of the
                writes done by the parent).

HOMEWORK 13-2, 13-4

        Immutable files

            Then there is "no problem"

            Fine if you don't want to change anything

            Can have "version numbers"

                Book says old version becomes inaccessible (at least
                under the current name)

                With version numbers if use name without number get
                highest numbered version so would have what book says.

                But really you do have the old (full) name accessible

                VMS definitely did this

            Note that directories are still mutable

                Otherwise no create-file is possible



            Clean semantics

            Using transactions in OS is becoming more widely studied

Distributed File System Implementation

    File Usage characteristics

        Measured under unix at a university

        Not obvious same results would hold in a different environment


            1. Most files are small (< 10K)
            2. Reading dominates writing
            3. Sequential accesses dominate
            4. Most files have a short lifetime
            5. Sharing is unusual
            6. Most processes use few files
            7. File classes with different properties exist

        Some conclusions

            1 suggests whole-file transfer may be worthwhile (except
            for really big files).

            2+5 suggest client caching and dealing with multiple
            writers somehow, even if the latter is slow (since it is

            4 suggests doing creates on the client

                Not so clear.  Possibly the short lifetime files are
                tempories that are created in /tmp or /usr/tmp or
                /somethingorother/tmp.  These would not be on the
                server anyway.

            7 suggests having mulitple mechanisms for the several classes.

================ Lecture 26 ================


    Implementation choices

        Servers & clients together?

            Common unix+nfs: any machine can be a server and/or a

            Separate modules:  Servers for files and directories are
            user programs so can configure some machines to offer
            the services and others not to

            Fundamentally different:  Either the hardware or software
            is fundamentally different for clients an servers.


                In unix some server code is in the kernel but other
                code is a user program (run as root) called nfsd

        File and directory servers together?

            If yes, less communication

            If no, more modular "cleaner"

        Looking up a/b/c/ when a a/b a/b/c on different servers

            Natural soln is for server-a to return name of server-a/b
            Then client contacts server-a/b gets name of server-a/b/c

            Alternatively server-a forwards request to server-a/b who
            forwards to server-a/b/c.

            Natural method takes 6 comunications (3 RPCs)

            Alternative is 4 communications but is not RPC

        Name caching

            The translation from a/b/c to the inode (i.e. symbolic to
            binary name) is expensive even for centralize system.

            Called namei in unix and was once measured to be a
            signifcant percentage of all of kernel activity.

            Later unices added "namei caching"

            Potentially an even greater time saver for dist systems
            since communication is expensive.

            Must worry about obsolete entries.

        Stateless vs Stateful

            Should the server keep information BETWEEN requests from a
            user, i.e. should the server maintain state?

            What state?

                Recall that the open returns an integer called a file
                descriptor that is subsequentally used in read/write.

                With a stateless server, the read/write must be self
                contained, i.e. cannot refer to the file descriptor.


            Advantages of stateless

                Fault tollerant--No state to be lost in a crash

                No open/close needed (saves messages)

                So space used for tables (state requires storage)

                No limit on number of open files (no tables to fill

                No problem if client crashes (no state to be confused

            Advantages of stateful

                Shorter read/write (descriptor shorter than name)

                Better performance

                    Since keep track of what files are open, know to
                    keep those inodes in memory

                    But stateful could keep a memory cache of inodes
                    as well (evict via LRU instead of close, not a

                    Blocks can be read in advance (read ahead)

                        Of course stateless can read ahead.

                        Difference is that with stateful can better
                        decide when accesses are sequential.

                    Idempotency easier (keep seq numbers)

                    File locking possible (the lock is state)

                        Stateless can write a lock file by convention.

                        Stateless can call a lock server

HOMEWORK 13-6, 9-5, 9-7


            There are four places to store a file suppled by a file
            server (these are NOT mutually exclusive)

                Server's disk

                    essentially always done

                Server's main memory

                    normally done

                    Standard buffer cache

                    Clear performance gain

                    Little if any semantics problems

                Client's main memory

                    Considerable performance gain

                    Considerable semantic considerations

                    The one we will study

                Clients disk

                    Not so common now

            Unit of caching

                File vs block

                Tradeoff of fewer access vs storage efficiency

            What eviction algorithm?

                Exact LRU feasible because can afford the time to do it
                (via linked lists) since access rate is low


            Where in client's memory to put cache

                The user's process

                    The cache will die with the process

                    No cache reuse among distinct processes

                    Not done for normal OS.

                    Big deal in databases

                        Cache management is a well studied DB problem

                The kernel (i.e. the client's kernel)

                    System call required for cache hit

                    Quite common

                Another process

                    "Cleaner" than in kernel

                    Easier to debug


                    Might get paged out by kernel!

                Look at firgure 13-10 (handout)

            Cache consistency

                Big question


                    All writes sent to the server (as well as the
                    client cache)

                    Hence does not lower traffic for writes


                    Does not by itself fix values in other caches

                    Need to invalidate or update other caches

                        Can have the client cache check with server
                        whenever supplying a block to ensure that the
                        block is not obsolete

                        Hence still need to reach server for all
                        accesses but at least the reads that hit in
                        the cache only need to send tiny msg
                        (timestamp not data).

                        I guess this would be called lazy

                Delayed write

                    Wait a while (30 seconds is used in some NFS
                    implementations) and then send a bulk write msg.

                    This is more efficient that a bunch of small
                    write msgs

                    If file is deleted quickly, you might never write

                    Semantics are now time dependent (and ugly).


                Write on close

                    Session semantics

                        Fewer msgs since more writes than closes.

                        Not beautiful (think of two files
                        simultaneously opened)

                        Not much worse than normal (uniprocessor)
                        semantics.  The difference is that it
                        (appears) to be much more likely to hit the
                        bad case.  Really mean much less unlikely.


================ Lecture 27 ================


                Delayed write on close

                    Combines the advantages and dissadvantages of
                    delayed write and write on close.

                Doing it "right"

                    Multiprocessor caching (of central memory) is well
                    studied and many solns are known.

                        We mentioned this at beginning of course

                        Cache consistency (aka cache coherence)

                    Book mentions a centralized soln.

                    Others are possible, but none are cheap

                    Interesting thought: IPC is more expensive that a
                    cache invalidate but disk I/O is much rarer than
                    mem refs.  Might this balance out and might one of
                    the cache consistence algorithms perform OK to
                    manage distributed disk caches?

                        If so why not used?

                        Perhaps NSF is good enough and not enough
                        reason to change (NFS predates cache coherence


            Some issues similar to (client) caching


            Because WHENEVER you have multiple copies of anything,
            bells ring

                Are they immutable?

                What is update policy?

                How do you keep copies consistent?

            Purposes of replication


                    A "backup" is available if data is corrupted on
                    one server.


                    Only need to reach ANY of the servers to access
                    the file (at least for queries)

                    NOT the same as reliability


                    Each server handles less than the full load (for a
                    query-only system MUCH less)

                    Can use closest server lowering network delays

                        NOT important for dist sys on one physical

                        VERY important for the web

                            mirror sites


                If can't tell files are replicated, say the system has
                replication transparency

                Creation can be completely opaque

                    i.e. fully manual

                    users use copy commands

                    if directory supports multiple binary names for a
                    single symbolic name,

                        use this when making copies

                        presumably subsequent opens will try the
                        binary names in order (so not opaque)

                Creation can use lazy replication

                    User creates original

                        system later makes copies

                        subsequent opens can be (re)directed at any

                Creation can use group communication

                    User directs requests at a group

                        Hence creation happen to all copies in the
                        group at once


                Update protocols

                    Primary copy

                        All updates done to the primary copy

                        This server writes the update to stable
                        storage and then updates all the other
                        (secondary) copies.

                        After a crash, the server looks at stable
                        storage and sees if there are any updates to

                        Reads are done from any copy.

                        This is good for reads (read any one copy).

                        Writes are not so good.

                            Can't write if primary copy is unavailable


                            The update can take a long time (some of
                            the secondaries can be down)

                            While the update is in progress, reads are
                            concurrent with it.  That is might get old
                            or new value depending which copy they


                        All copies are equal (symmetric)

                        To write you must write at least WQ of the
                        copies (a write quorum).  Set the version
                        number of all these copies to 1 + max of
                        current version numbers.

                        To read you must read at least RQ copies and
                        use the value with the highest version.

                        Require WQ+RQ > num copies

                            Hence any write quorum and read quorum

                            Hence the highest version number in any
                            read quorum is the highest ver num there

                            Hence always read the current version

                        Consider extremes (WQ=1  and   RQ=1)
                        Fine points (omitted by tannenbaum)

                            To write, you must first read all the
                            copies in your WQ to get the ver num.
                            Must prevent races

                                Let N=2, WQ=2, RQ=1.  Both copies (A
                                and B) have ver num 10

                                Two updates start.  U1 wants to write
                                1234, U2 wants to write 6789.

                                Both read ver numbers and add 1 (get

                                U1 writes A and U2 writes B at roughly
                                the same time.

                                Later U1 writes B and U2 writes A.

                                Now both are at version 11 but A=6789
                                and B=1234


                    Voting with ghosts

                        Often reads dominate writes so choose RQ=1 (or
                        at least RQ very small so WQ very large)

                        This makes it hard to write.  E.g. RQ=1 so
                        WQ=n and hence can't update if any machine is

                        When one detects that a server is down, a
                        ghost is created.

                        Ghost canNOT participate in read quorum, but
                        can in write quorum

                            write quorum must have at least one

                        Ghost throws away value written to it

                        Ghost always have version 0 (tannenbaum forgot
                        this point)

                        When crashed server reboots, it accesses a read
                        quorum to update its value

================ Lecture 28 ================


NFS--SUN Microsystem's Network File System

    In chapter 9 of your book.  But it should be here in chapter 13.
    In newer book, it is here!  (Replaces AFS material, which we will

    "Industry standard", dominant system

    Machines can be (and often are) both clients and servers


    Basic idea is servers EXPORT directories and clients mount them

        When server exports a directory, the subtree routed there is

        In unix exporting is specified in /etc/exports

        In unix mounting is specified in /etc/fstab

            fstab = file system table

            In unix w/o NFS what you mount are filesystems

    Two Protocols

        1.  Mounting

            Client sends server msg containing pathname (on server) of
            the directory it wishes to mount

            Server returns HANDLE for the directory

                Subsequent read/write calls use the handle
                Handle has data giving disk, inode#, et al
                Handle is NOT an index into table of actively
                exported directories.  Why not?
                Beacause the table would be STATE and NFS is

            Can do this mounting at any time, often done at client
            boot time.

            Automounting--we skip

        2.  File and directory access

            Most unix sys calls supported

            Open/close NOT supported

                NFS is stateless

            Do have lookup, which returns a file handle.  But this
            handle is NOT an index into a table.  Instead it contains
            the data needed.

            As indicated previously, stateless makes unix locking
            semantics hard to achieve


        Client gives the rwx bits to server.

        How does server know the client is machine it claims to be?

        Various Crypto keys.

        This and other stuff stored in NIS (net info svc) aka yellow

        Replicate NIS

        Update master copy

            master updates slaves

            window of inconsistency


        Click here for figure.

	Client system call layer processes I/O system calls and calls
	the virtual file system layer (VFS).

	VFS has a v-node (virtual i-node) for each open file

        Like incore inodes for traditional unix

        For local files v-node points to inode in local OS

        For remote files v-node points to r-node (remote i-node) in
        NFS client code.

        For remote files r-node holds the file's handle (see below)
        which is enough for a remote access

    Blow by blow

        Mount (remote directory, local directory)

            First the mount PROGRAM goes to work

                Contact the server and obtains a handle for the remote

                Makes mount sys call passing handle

            Now the kernel takes over

                Makes a v-node for the remote directory

                Asks client code to construct an r-noded

                have v-node point to r-node

        Open system call

            While parsing the name of the file, the kernel (VFS layer)
            hits the local directory on which the remote is mounted
            (this part is similar to ordinary mounts of local

            Kernel gets v-node of the remote directory (just as would
            get i-node if processing local files)

            Kernel asks client code to open the file (given r-node)

            Client code calls server code to look up remaining portion
            of the filename

            Server does this and returns a handle (but does NOT keep a
            record of this).  Presumably the server, via the VFS and
            local OS, does an open and this data is part of the
            handle.  So the handle give enough information for the
            server code to determine the v-node on the server machine.

            When client gets a handle for the remote file, it makes an
            r-node for it.  This is returned to the VFS layer, which
            makes a v-node for the newly opened remote file.  This
            v-node points to the r-node.  The latter contains the
            handle information.

            The kernel returns a file descriptor, which points to the


            VFS finds v-node from the file descriptor it is given.

            Realizes remote and asks client code to do the read/write
            on the given r-node (pointed to by the v-node).

            Client code gets the handle from its r-node table and
            contacts the server code.

            Server verifies the handle is valid (perhaps
            authentication) and determins the v-node.

            VFS (on server) called with the v-node and the read/write
            is performed by the local (on server) OS.

            Read ahead is implemented but as stated before it is
            primitive (always read ahead).


        Servers cache but not big deal

        Clients cache

        Potential problems of course so

            Discard cached entries after some SECONDS

            On open the server is contacted to see when file last
            modifies.  If newer than cached version, cached version is

            After some SECONDS all dirty cache blocks are flushed back
            to server.

        All these bandaids still do not give proper semantics (or even
        unix semantics).

Lessons learned (from AFS, not covered, but applies in some generality)

    Workstations, i.e. clients, have cycles to burn

        So do as much as possible on client

    Cache whenever possible

    Exploit usage properties

        Several classes of files (e.g. temporary)

        Trades off symplicity for efficiency

    Minimize system wide knowledge and change

        Helps scalability

        Favors hierarchies

    Trust fewest possible entities

        Good principle for life as well

            Try not to depend on the "kindness of strangers"

    Batch work where possible
*  IGNORE all this stuff
* Local Variables:
* mode: text
* indent-line-function: indent-relative
* indent-tabs-mode: nil
* tab-width: 4
* tab-stop-list: (4 8 12 16 20 24 28 32 36 40 44 48 52 56 60)
* End:

Click for diagram in postscript or html.