Multiprocessors: Shared address space across all
            processors

            Multicomputer: Each processor has its own

    Tightly Coupled vs. Loosely Coupled

        Former means capable of cooperating closely on a single
        problem

            Fine grained communication: i.e. communicate often

            Large amt of data transfered

        Tightly coupled machines can normally emulate loosely coupled
        ones but might give up some fault tolerance

        Can solve a single large problem on loosely coupled hardware,
        but the problem must be coarse grained.

            Various cryptographic challenges solved over the internet
            with email as the only communication

    Bus-based multiprocessors

        These are symmetric multiprocessors (SMPs): From the viewpoint
        of one processor, all the others look the same.  None are
        logically closer than others.  SMP also implies cache coherent
        (see below).

        Hardware is fairly simple.  Recall a uniprocessor looks like

                Proc
                  |
                  |
               Cache(s)               Mem
                  |                    |
                  |                    |
            ------------------------------------ Memory Bus
                         |
                         |
                      "Chipset"
                         |
                         |
                ------------------------------- I/O Bus (e.g. PCI)
                     |          |     |      |
                     |          |     |      |
                   video       scsi serial  etc

        Complication.  When a scsi disk writes to memory (i.e. a disk
        read is performed), you need to keep the cache(s) up to date,
        i.e. CONSISTENT with the data.  If the cache had the old value
        before the disk read, the system can either INVALIDATE the
        cache entry or UPDATE it with the new value.

        Facts about caches.  The can be WRITE THROUGH, when the
        processor issues a store the value goes in the cache and also
        is sent to memory, or they can be WRITE BACK, the value only
        goes to the cache.  In the later case the cache line is marked
        dirty and when it is evicted it must be written back to memory.

        To make a bus-based MP just add more processor-caches

                Proc       Proc
                  |          |
                  |          |
               Cache(s)   Cache(s)    Mem
                  |          |         |
                  |          |         |
            ------------------------------------ Memory Bus
                         |
                         |
                      "Chipset"
                         |
                         |
                ------------------------------- I/O Bus (e.g. PCI)
                     |          |     |      |
                     |          |     |      |
                   video       scsi serial  etc

        A key question now is whether the caches are automatically
        kept consistent (a.k.a. coherent) with each other.  When the
        processor on the left writes a word, you can't let the old
        value hang around in the right hand cache(s) or the right hand
        processor can read the wrong value.

        If the cache is write back, then on an initial write the cache
        must claim ownership of the cache line (invalidate all other
        copies).  If the cache is write through, you can either
        invalidate other copies or update them.

        Because of the broadcast nature of the bus it is not hard to
        design snooping (a.k.a snoopy) caches that maintain
        consistency.  I call this the "Turkish Bath" mode of
        communication because "everyone sees what everyone else has
        got".

        These machines are VERY COMMON.

        DISADVANTAGE (really limitation) Cannot be used for a large
        number of processors since the bandwidth needed on the bus
        grows with the number of processors and this gets increasingly
        difficult and expensive to supply.  Moreover the latency grows
        as the number of processors for both speed of light and more
        complicated (electrical) reasons.

Homework 9-2 (ask me next time about this one)

    Other SMPs

        From the software viewpoint, the key property of the bus-based
        MP is that they are SMPs, the bus itself is just an
        implementation property.

        To support larger number of processors, other interconnection
        networks could be used.  The book shows two possibilities
        crossbars and omega networks.  Ignore what is said there as it
        is too superficial and occasionally more or less wrong.
        Tannenbaum is not a fan of shared memory.

        (The Ultracomputer in figure 9.4 is an NYU machine; my main
        research activity from 1980-95)

        Another name for SMP is UMA for Uniform Memory Access,
        i.e. all the memories are equal distant from a given
        processor.

    NUMAs

        For larger number of processors, NUMAs (NonUniform Memory
        Access) MPs are being introduced.  Typically some memory is
        associated with each processor and that memory is close.
        Others are further away.

          P--C---M   P--C---M
               |          |
               |          |
          |----------------------------|
          |                            |
          |   interconnection network  |
          |                            |
          |----------------------------|


        CC-NUMAs (Cache Coherent NUMAs) are programmed like SMPs BUT
        to get good good performance must try to exploit the memory
        hierarchy and have most references hit in your local cache and
        most others in the part of the shared memory in your "node".

HOMEWORK 9.2' Will semaphores work here

        NUMAs (not cache coherent) are yet harder to program as you
        must maintain cache consistent manually (or with compiler
        help).

HOMEWORK 9.2'' Will semaphores work here

    Bus-based Multicomputers

        The big difference is that it is an MC not an MP, NO shared
        memory.

        In some sense all the computers on the internet form one
        enormous MC.

        The interesting case is when there is some closer cooperation
        between processors; say the workstations in one distributed
        systems research lab cooperating on a single problem.

        Application must tolerate long-latency communication and
        modest bandwidth, using current state-of-the-practice

        Recall the figure for a typical computer

                Proc
                  |
                  |
               Cache(s)               Mem
                  |                    |
                  |                    |
            ------------------------------------ Memory Bus
                         |
                         |
                      "Chipset"
                         |
                         |
                ------------------------------- I/O Bus (e.g. PCI)
                     |          |     |      |
                     |          |     |      |
                   video       scsi serial  etc

        The simple way to get a bus-based MC is to put an ethernet
        controller on the I/O bus and then connect all these with a
        bus called ethernet.  (Draw this on the board).

        Better (higher performance) is to put the Network Interface
        (possibly ethernet possibly something else) on the memory bus

    Other (bigger) multicomputers

        Once again buses are limiting so switched networks are used.
        Much could be said, but currently grids (SGI/Cray T3E) and
        omega networks (IBM SP2) are popular.

HOMEWORK 9.2'''  Will semaphores work here, 9.3