.\" to print this file, use: .\" eqn filename | itroff -me -Pprinter .\" .\" Paragraphs inside (.q .)q .\" Don't change pointsize or indentation. .\" Instead use .qP .po 1i .ll 6.5i .sz 11 .nr eN 0 1 \" Entry number .de qP .ie t .sp 0.35v .el .sp .@F \\n(pf .ne 2u*\\n(.Lv .. .de nE .sp .ne \\n(.Lv*3u .ie '\\$1'' .b \\n+(eN. .el .b \\$1. .. .fo 'Ultracomputer Notes Abstracts''Page %' .nr SI 0 .EQ delim $$ gfont I gsize 11 .EN .(b C \fBNEW YORK UNIVERSITY Ultracomputer Research Laboratory .sp Courant Institute of Mathematical Sciences 251 Mercer Street New York, NY 10012 .sp ULTRACOMPUTER NOTE ABSTRACTS Revised: January, 1990 .)b .nE UC Jack Schwartz, .q "Ultracomputers" .(q A class of parallel processors potentially involving thousands of individual processing elements is described. The architecture is based on the perfect shuffle connection and has two favorable characteristics: (1) Each processor communicates with a fixed number of other processors. (2) Important communication functions can be accomplished in time proportional to the logarithm of the number of processors. A number of basic algorithms for these .q "ultracomputers" are presented, and physical design considerations are discussed in a preliminary fashion. .)q .nE 1 Lambert Meertens, .q "Bitonic Sort on Ultracomputers" .(q Batcher's .i "bitonic sort" is a sorting network, capable of sorting n inputs in \(*H((log n)\*[2\*]) stages. When adapted to conventional computers, it gives rise to an algorithm that runs in time \(*H(n(log n)\*[2\*]). The method can also be adapted to .i ultracomputers to exploit their high degree of parallelism. The resulting algorithm will take time \(*H((log N)\*[2\*]) for ultracomputers of .q size N. The implicit constant factor is low, so that even for moderate values of N the ultracomputer architecture performs faster than the \(*H(N log N) time conventional architecture can achieve. The purpose of this note is to describe the adapted algorithm. After some preliminaries a first version of the algorithm is given whose correctness is easily shown. Next, this algorithm is transformed to make it suitable for an ultracomputer. .)q .nE 2 Lambert Meertens, .q "Recurrent Thought on Ultracomputers Programming Style" .(q \fIUltracomputers\fR are assemblages of processors that are able to operate concurrently and can exchange data through communication lines in, say, one cycle of operation. For physical reasons, the fan in/out of the processors must be limited. This imposes restrictions on the possible communication schemes. In order to have the ultracomputer operate efficiently as a whole, it is desirable that arbitrary exchanges of information between the processors can be effected in a small number of data shifts. .qP If a really huge ultracomputer is built, it would be nice if it could be constructed by coupling smaller ultracomputers, which in turn are assembled from still smaller ultracomputers, and so on. It will be shown that the latter desire conflicts to a certain extent with the earlier one. .)q .nE 3 Jack Schwartz, .q "Preliminary Thought on Ultracomputers Programming Style" .(q This note describes a simple low-level language which can be used to write ultracomputer programs of the synchronous type, i.e. programs oriented toward the coordinated use of all the processors of an ultracomputer. .)q .nE 4 Jack Schwartz, .q "A Remark on Nearly Planar Embeddings of Small Ultracomputers" .(q For efficient VLSI realizations of ultracomputers, planar or nearly planar realizations of the shuffle interconnection are desirable. In the N = * case this is easily done. In the N = 16 case a nearly planar layout is given and shown to be optimal. .)q .nE 5 Jack Schwartz, .q "The Burroughs FMP Machine" .(q This note comments on various interesting points which appear in the Burroughs Corporation final technical report NAS2-9897, March 1979, entitled \fINumerical Aerodynamic Simulation Facility Feasibility Study.\fR The points covered include: The data movement algorithm of the interconnection network, physical structure of the interconnection network and timings, programming style, applications benchmarked, and reliability considerations. .)q .nE 6 Clyde P. Kruskal and Larry Rudolph, .q "Observations Concerning Multidimensional Ultracomputers" .(q Schwartz introduced the idea of an ultracomputer and reviewed various basic algorithms for such an ensemble of processors containing .q "shuffle" interconnections. Later, Harrison and Kalos introduced the idea of an N-dimensional ultracomputer (N-D UC), which is well suited for implementing many algorithms. The main result of this note is that an N-D UC can simulate the right/left and shuffle/unshuffle connections of its 1-dimensional counterpart in N and 2N steps, respectively. Thus, for each N, an N-D UC is asymptotically as fast as a 1-D UC of the same size. .)q .nE 7 Allan Gottlieb and Clyde P. Kruskal, .q "A Data-Motion Algorithm" .(q We desire to permute N items w\*<0\*>,...,w\*, in an ultracomputer containing P processing elements (PEs), PE\*<0\*>,...,PE\*. Under the assumption that N=P and that w\*\(*e PE\*, Schwartz gives the following worst case analyses: The static permutation algorithm requires 4 log P - 3 data communication steps, and the dynamic permutation problem requires \(*h((log P)\*[2\*]) data communications steps. It is easily seen that for both algorithms the average case behavior closely approximates the worst case. .qP Here we present a data motion algorithm oriented toward average case rather than worst case performance, and supply an argument suggesting that the average number of data communication steps required is approximately 3 log P. .)q .nE 8 Allan Gottlieb, .q "A Remark on the Planarity of the Shuffle-Exchange Networks of Sizes 16 and 32" .(q The shuffle-exchange network omits some of the connections that are present in a full ultracomputer. This note demonstrates that a 16 processor network of this sort can be imbedded in the plane without crossings. In addition, we show that if just 6 connections are allowed to pass through the .q "active" area occupied by another processor then the 32 processor case is also planar. Connections of this sort may be feasible in a VLSI technology. .)q .nE 9 Allan Gottlieb, .q "Another Remark on the Planarity of Shuffle-Exchange Networks of Sizes 16 and 32" .(q Rudolph presented an example of a 32 processor shuffle-exchange network in which 6 connections pass through the area occupied by another processor. In this note we reduce the number of intersections from 6 to 4. Since none of the exchange connections pass through a processor, the exchange operation can be performed in one cycle. The shuffle operation requires 2 cycles. .)q .nE 10 Allan Gottlieb, .q "Plus \- A PL/I Based Ultracomputer Simulator, I" .(q This note describes a (presently implemented) ultracomputer emulator called PLUS which uses the multitasking and preprocessing features of PL/I to support a recursive ultracomputer programming style. Since the emulation to be described is written in in PL/I, the powerful debugging features of the PL/I checkout compiler, as well as PL/I's separate compilation facility are available. This note also describes the emulation system and furnish a \*(lqUser's Guide\*(rq. In a subsequent part II we will discuss the system's implementation and prove both correctness and freedom from deadlock. .)q .nE 11 Allan Gottlieb and Clyde P. Kruskal, .q "Supersaturated Ultracomputer Algorithms" .(q For a wide class of problems, we obtain lower bounds for algorithms executed on certain parallel processors. These bounds show that for sufficiently large problems many known algorithms are optimal. The central result of the paper is the following sharper lower bound for permutation algorithms. Any permutation algorithm for \fIN\fP data items on a \fIP\fP processor parallel machine without shared memory requires time on the order of (N/P) log\*P where \fIK\fP is the maximum number of processors directly connected to a single processor. In particular, a speedup on the order of \fIP\fP is impossible if \fIK\fP is bounded. .)q .nE 12 Allan Gottlieb, .q "WASHCLOTH \- The Logical Successor to SOAPSUDS" .(q Schwartz has proposed the .q "paracomputer" model of parallel computation whose performance more realistic architectures could try to approach. This note describes the WASHCLOTH paracomputer simulator and is organized as follows: Section II reviews the paracomputer model; section III presents the WASHCLOTH design and the machine operations provided; section IV describes the FORTRAN interface; and section V is a users' guide to WASHCLOTH. .)q .nE 13 Allan Gottlieb, .q "MOP \- A (Minimal) Multiprocessor Operating System Extending WASHCLOTH" .(q This note describes MOP, a skeletal multiprocessor operating system for the WASHCLOTH virtual machine. The note is organized as follows: Chapter II describes the WASHCLOTH virtual machine, chapter III presents the MOP user interface, and chapter IV proposes an initial implementation. .)q .nE 14 Allan Gottlieb, .q "PLUS: A PL/I Based Ultracomputer Simulator, II" .(q Part 1 of this two part report introduced a (presently implemented) ultracomputer simulator called PLUS which uses the multitasking and preprocessing features of PL/I to support a recursive ultracomputer programming style. Since the simulation is written in PL/I, the powerful debugging features of the PL/I checkout compiler, as well as PL/I's separate compilation facility are available. In this note we discuss the system's implementation and prove both correctness and freedom from deadlock. .)q .nE 15 Allan Gottlieb and Clyde P. Kruskal, .q "MULT \- A Multitasking Ultracomputer Language with Timing, I & II" .(q [UC] introduced the idea of an ultracomputer, a parallel processor based on the .q "shuffle" connections; reviewed various basic algorithms for such an ensemble of processors; and analyzed their asymptotic time complexities. However, one will want to know exact, rather than merely asymptotic, efficiencies when designing and using an ultracomputer. Moreover, some algorithms are difficult to analyze analytically (e.g. linear programming and backtracking) and only actual practice gives us a feeling for their efficiencies. The MULT (\fIM\fRultiprocessor \fIU\fRltracomputer \fIL\fRanguage with \fIT\fRiming) system described in this note is an attempt to deal with these issues by timing simulated executions of ultracomputer programs. .)q .nE 16 Allan Gottlieb, B.D. Lubachevsky, and Larry Rudolph, .q "Basic Techniques for the Efficient Coordination of Very Large Numbers of Cooperating Sequential Processors" .(q In this paper we implement several basic operating system primitives by using a .q "replace-add" operation, which can supersede the standard .q "test and set", and which appears to be a universal primitive for efficiently coordinating large numbers of independently acting sequential processors. We also present a hardware implementation of replace-add that permits multiple replace-adds to be processed nearly as efficiently as loads and stores. Moreover, the crucial special case of concurrent replace-adds updating the same variable is handled particularly well: If every PE simultaneously addresses a replace-add at the same variable, all these requests are satisfied in the time required to process just one request. .)q .nE 17 Allan Gottlieb, .q "Comments on Concurrent Search and Insertion in AVL Trees" .(q Ellis's concurrent AVL insertion algorithm is discussed in this correspondence. We note that obtaining a block of storage for the new AVL leaf may become a serial bottleneck for the entire insertion algorithm. We indicate a potential solution and refer the reader to another paper in which the full details are given. .)q .nE 18 Charles S. Peskin and Olof B. Widlund, .q "Remarks on Efficient Numerical Algorithms for Ultracomputers" .(q This note gives an outline of a few numerical algorithms for some important problems arising in continuum mechanics which can take full advantage of the special features of ultracomputers. We discuss fast Poisson solvers, capacitance matrix or imbedding methods for special elliptic problems on general bounded domains and a finite difference approximation to the time dependent three dimensional Navier Stokes equations. We show that algorithms exist which are within a logarithmic factor of achieving optimal speedup and which are relatively easy to implement. We conclude with a few general comments on the parallel solution of linear systems of algebraic equations. .)q .nE 19 Charles S. Peskin, .q "Ultracomputer Implementation of Odd-Even Reduction" .(q The Odd-Even Reduction Algorithm has been widely used to solve tridiagonal systems of equations in serial computers, especially in the context of fast Poisson-solvers. The parallel form of the algorithm is also discussed by Heller. The purpose of this note is to show that the structure of the algorithm is just right for implementation on an ultracomputer. .)q .nE 20 Charles S. Peskin, .q "A Comparison of Ultracomputer Architecture And Lattice Architecture for Problems on Lattices" .(q We conclude that for most numerical problems on lattices, an ultracomputer will be faster than a lattice machine, despite the fact that the lattice machine seems to have precisely the right architecture for the problem. There is a special class of problems where the lattice machine is faster, but only by a factor of log N. In either case, the ultracomputer will be a more flexible machine, adaptable to a variety of problems without rewiring. .)q .nE 21 Allan Gottlieb, .q "WASHCLOTH 81" .(q This note describes several enhancements that have been made to the WASHCLOTH simulator. .)q .nE 22 Norman Rushfield, .q "Atmospheric Computations on Highly Parallel MIMD Computers" .(q A FORTRAN program representing a two-dimensional vertical plane model of the atmosphere has been modified to make use of the WASHCLOTH simulator of a parallel processor. To determine the suitability of this code for execution on very large scale parallel computers of the proposed paracomputer class, we analyzed the performance of three codes: the original code; code 1 as modified to run under the WASHCLOTH parallel processor simulator; and a fully parallel code that partitions work among multiple PEs. .)q .nE 23 David Korn, .q "Converting Scientific Codes to Run Under the WASHCLOTH Simulator" .(q This note describes a set of subroutines that have been written to aid in preparing scientific FORTRAN code for WASHCLOTH simulation. In addition to making code conversion simpler, these routines count machine instructions spent waiting for processor synchronizations. Finally, the SOFTSOAP package that enables FORTRAN code prepared for WASHCLOTH simulation to run with 1 processor in a simple, efficient manner is presented. .)q .nE 24 David Korn, .q "Timing Analysis for Scientific Codes Run Under WASHCLOTH Simulation" .(q This note details the timing measurements made for scientific programs run under the WASHCLOTH simulator. Processing time, as a function of the number of processing elements and the problem rate, is analyzed. The aim is to predict how the number of processors that can be used efficiently varies as a function of the problem size. In addition, the analysis sheds light on the problem of designing and coding efficient programs. .)q .nE 25 Allan Gottlieb and Jack Schwartz, .q "Networks and Algorithms for Very Large Scale Parallel Computation" .(q After discussing interconnection networks in general, this report illustrates the operation of three well-known examples: the cube, the shuffle, and the cube connected cycle. The important .q "dynamic permutation" problem (similar to the problem of routing memory requests to the appropriate memory module) is then described and analyzed. These results are particularly applicable to machine designs similar to Burrough's FMP. The paracomputer model of computation and the replace-add primitive are reviewed, and hardware issues are discussed. We show how this permits queues to be managed in a highly parallel manner, which in turn gives a concurrent implementation of a dataflow language. .)q .nE 26 Clyde P. Kruskal, .q "Supersaturated Paracomputer Algorithms" .(q This paper describes the paracomputer model of parallel computation and presents paracomputer algorithms for solving several important problems including sorting. A paracomputer is essentially a traditional shared memory machine augmented with an extra primative \- the .q "replace-add". We show that this primative enhances the model by exhibiting algorithms that are faster that any possible algorithm for solving the same problem on a shared memory machine. In addition, several of the algorithms are faster than any known algorithm for solving the same problem on a shared memory machine. .)q .nE 27 Malvin H. Kalos, Gabi Leshem, and B.D. Lubachevsky, .q "Molecular Simulations of Equilibrium Properties" .(q A Monte Carlo algorithm for sampling representative configurations of a classical many-body system in equilibrium was implemented as a program for an asynchronous shared memory parallel computer. The algorithm is the iterative repetition of a cycle involving serial initialization and termination phases and a dominant computation in the middle that is suitable for parallelization. The efficiency of the parallel implementation is increased by allowing more than one such a cycle to be treated concurrently. Two versions of the implementation are discussed. The first one uses only one parallel feature, basic indivisible operation Fetch-and-Add and has necessary service subroutines built in. The second one employs primitives Request, Release, and Suspend of a possible parallel operating system. Timing analysis of simulated executions predicts high efficiency of this algorithm for parallel computers that include up to thousands of processing elements. .)q .nE 28 Marc Snir, .q "`NETSIM' Network Simulator for the Ultracomputer" .(q This note describes an extension to WASHCLOTH that introduces extra delays for references to shared variables. The simulator is not intended to yield accurate measurements of the performance of an actual machine. A precise simulation of the communication network we intend to implement would be extremely time consuming. We made several simplifying assumptions, and used some rough estimates on the relative performance of different hardware components of the system. .qP In section two, we present the probabilistic model used. In section three, we detail the information used to drive the simulation, and then discuss the simplifying assumption made. A summary of results obtained follows, along with a discussion of these, and suggestions for future work. .)q .nE 29 Marc Snir, .q "Lower Bounds on VLSI Implementations of Communication Networks" .(q An O(n\*[2\*]) lower bound on the area required for VLSI implementations of permutation networks and concentrators is given. A tradeoff result for the bandwidth of packet switching networks and the area required to implement them is proven. .)q .nE 30 Malvin H. Kalos, .q "Scientific Computations on the Ultracomputer" .(q This note reviews very briefly the architecture of the NYU Ultracomputer, a highly parallel, shared memory MIMD machine. The use of the replace add instruction for scheduling parallel tasks is also sketched. Finally, some experience in using the WASHCLOTH simulator to study some representative numerical programs is summarized. These include a Monte Carlo program for statistical mechanics; a program for solving the Navier-Stokes equation for blood flow; parallelization of Householder's algorithm for reduction of a symmetric matrix to tridiagonal form; and a program for modeling atmospheric flow in two dimensions. .)q .nE 31 David Korn, .q "Timing Simulations for Elliptic PDE's Run Under WASHCLOTH" .(q Two methods for solving three dimensional elliptic partial differential equations have been modified for parallel execution on a paracomputer and run under WASHCLOTH simulation. A brief discussion of these methods and timing results under simulation are presented in this note. .)q .nE 32 UCN 32 IS OUT OF PRINT. .nE 33 B.D. Lubachevsky, .q "Verification of Several Parallel Coordination Primitives Based on Descriptions of Their Reachability Sets" .(q A method for verifying parallel programs is applied to several examples (PV-semaphore, .q "busy-wait" synchronization, .q "cessation of activity" synchronization, .q "readers-writers"). The graph of all program states and state transitions is represented in a special compact form independent of the number N of processing elements. This representation aids in verifying certain correctness properties that can not be easily expressed in the form .q "predicate(state)". In each of the above mentioned examples a special .q "reachability tree" is developed whose nodes are some subsets of the set of all reachable states. The root is the initial state and moving down the tree corresponds to some processors advancing their execution. In the presented examples the size of this tree is independent of N. The notion of compact program is introduced: roughly speaking a parallel program is compact if there exists a boundary, independent of N, on time required to reach any state. Examples of non-compact programs are represented. .)q .nE 34 Allan Gottlieb and Clyde P. Kruskal, .q "Coordinating Parallel Processors: A Partial Unification" .(q A variety of coordination primitives have been proposed for memory-sharing MIMD parallel processors. The most familiar of these primitives are probably test-and-set and swap. In this report, we show how a simple modification of the fetch-and-add coordination primitive recently proposed for the .q "NYU Ultracomputer" yields a hardware implementation technique that may also be used for test-and-set and swap. This technique enables one to support a wide range of operations on any network-based multiprocessor in a highly parallel manner. .)q .nE 35 B.D. Lubachevsky, .q "Review of Soviet Publications on Parallel Data Processing" .(q The aim of this paper is to review the available Soviet publications on the topic of parallel data processing published during 1978-1980. (A few articles published outside of this interval are also included as they were relevant.) .)q .nE 36 Allan Gottlieb, .q "A Historical Guide to the Ultracomputer Literature" .(q When reading our reports, one should note that three distinct parallel processors are discussed: the .q "ultracomputer", a message passing machine; the .q "paracomputer", an idealized shared memory machine; and the .q "NYU Ultracomputer", a network-based shared memory machine that approximates a paracomputer. Moreover, since the NYU Ultracomputer has replaced the original ultracomputer as out preferred design, our later reports often .q "abbreviate" NYU Ultracomputer as Ultracomputer. Finally, our basic coordination primitive has been called NEWVAL, replace-add, and fetch-and-add. We realize that such highly context sensitive language may be confusing and this guide is a (long overdue) attempt to help clarify the situation. In addition, by presenting a historical overview of the project I hope to group together those reports reflecting work done at roughly the same time (and hence using the same terminology). .)q .nE 37 B.D. Lubachevsky, .q "A Parallel Computer Implementation of the Ascend/Descend Types of Vector Algorithms" .(q An implementation of the ASCEND/DESCEND types of vector algorithms (like FFT) is adapted to the current model of .q "NYU-Ultracomputer", a shared memory asynchronous parallel processor, including basic features of the proposed hardware and software. .)q .nE 38 Marc Snir, .q "Comments on Lens and Hypertrees \- or the Perfect-Shuffle Again" .(q Finkel and Solomon introduce the lens interconnection strategy, and claim it is the only one known having small diameter, fixed degree, simple routing algorithm, uniform traffic load, and redundancy. We show that the .q "de Bruijn" network is nothing less than the shuffle-exchange network of Stone in disguise. .)q .nE 39 Marc Snir and Jon Solworth, .q "The Ultraswitch \- A VLSI Network Node for Parallel Processing" .(q A detailed description is given of the network connecting processors (PE's) to memory modules (MM's) in the NYU Ultracomputer, and of the switches used to realize this network .qP The network is a packet switched multi-stage shuffle-exchange network. All the switches are identical, and built of a small number of chips: two monolithic integrated circuits and one small arbiter chip. .qP In addition to routing, these chips can queue information when the network is congested. A third function is the combining of like requests to the same memory address. This capability allows efficient concurrent access by many PE's to the same location in memory, and effectively distributes the support of synchronization primitives (Fetch-and-Add) over the entire network. .qP The logic of each individual switch is pipelined, and several systolic structures are used to achieve a high throughput. .)q .nE 40 UCN 40 HAS BEEN SUPERSEDED BY UCN 100. .nE 41 Clyde P. Kruskal and Marc Snir, .q "Some Results on Multistage Interconnection Networks for Multiprocessors" .(q This paper studies the performance of unbuffered and buffered, packet-switching, multistage interconnection networks. We begin by reviewing the definition of banyan networks and introducing some generalizations of them. We then present an asymptotic analysis of the performance of unbuffered banyan networks, thereby solving a problem left open by Patel. We analyze the performance of the unbuffered generalized banyan networks, and compare networks with approximately equivalent hardware complexity. Finally, we analyze the performance of buffered banyan networks and again compare networks with approximately equivalent hardware complexity. .)q .nE 42 B.D. Lubachevsky, .q "Parallelizing an Algorithm of Charles S. Peskin for Describing Incompressible Flow of a Fluid Coupled to an Elastic Membrane" .(q An algorithm of Charles S. Peskin for simulation of incompressible flow of a fluid coupled to an elastic membrane, was implemented as a program for an asynchronous shared memory parallel computer, the .q "NYU-Ultracomputer". The algorithm is an iterative repetition of a cycle consisting of recomputing the position of the membrane and the velocity of the fluid for successive discrete times. Each iteration requires solving Navier-Stokes equations for the fluid, using an FFT and algorithms for solving linear equations and also includes an iterative minimization of a tension functional for recomputation of the new position of the membrane. All these phases are parallelized by spawning a number of independent tasks and waiting for all of them to terminate. The program includes subroutines REQUEST (INDEX) and SYNCH which provide coordination necessary for such a spawning. Timing analysis of simulated executions predicts high efficiency of this algorithm for parallel computers that include up to hundreds of processing elements in the 2-D case and up to thousands of processing elements in the 3-D case. .)q .nE 43 Ronald Bianchini and Ronald Bianchini, Jr., .q "Wireability of an Ultracomputer" .(q We analyze the wireability of an Ultracomputer, a shared-memory MIMD parallel machine with thousands of processing elements, wired using an Omega connection network. We show a way of structuring hardware involving large numbers of processors into smaller subsets of processors and switches. A design with 4096 processors, based on present-day wiring technology and electrical power considerations, is presented. This design assumes a 2 by 2 switch on a chip is available, also a 1 megabit memory chip and a high speed microprocessor chip with 32 bit addressing. .)q .nE 44 Marc Snir, .q "On Parallel Searching" .(q We investigate the complexity of searching by comparisons a table of n elements on a synchronous, shared memory parallel computer with p processors. We show that \(*h(lg\|n) steps are required if concurrent accesses to the same memory cell are not allowed, whereas only \(*h(lg\|n\|/lg\|p) steps are required if simultaneous reads area allowed. We next show that these lower bounds are still valid even if only communication steps are counted, and even if polynomial comparisons are allowed. .)q .nE 45 B.D. Lubachevsky, .q "An Approach to Automating the Verification of Compact Parallel Coordination Programs" .(q A class of parallel coordination programs for a shared memory asynchronous parallel processor is considered. They use the primitive operation Fetch and Add and differ in various respects from those which use basic P and V operations. The F&A is the basic primitive for the .q "NYU-Ultracomputer". A correctness proof for the considered programs must be done for arbitrary number N of processing elements since the Ultracomputer design includes thousands of PEs. .)q .nE 46 Malvin H. Kalos and Y-Q. Zhong, .q "Monte Carlo Transports on an Ultracomputer" .(q This report summarizes the results of experiments in adapting transport Monte Carlo programs to an NYU Ultracomputer. Two fixed source (one in adjoint mode) and two eigenvalue problems were studied. In general, the required program is very straightforward: the shared memory MIMD architecture seems ideal for this application. Certain secondary questions related to efficiency loss owing to the straggling of histories are analyzed an practical remedies suggested and tested. The issue of repeatability of the Monte Carlo is solved with the device of having each process seed its successors. Finally, the probability that different pseudo-random sequences overlap in part is analyzed and recommendations are made for acceptable periods. .)q .nE 47 Herbert Bernstein and Max Goldstein, .q "RINSE \- What Follows WASHCLOTH" .(q This note describes a simulator for a parallel processor based on Gottlieb's WASHCLOTH simulator. RINSE includes both the replace-add synchronization primatives of WASHCLOTH, and the full-empty synchronization primatives of the Denelcor HEP, to allow an exploration of the actual effect of choice of synchronization techniques. A linear equation solver was tested. In this case the degradation due to use of full-empty synchronization was found to occur with a number of PEs at which overall efficiency was also declining sharply. .)q .nE 48 Malvin H. Kalos, .q "The NYU Ultracomputer" .(q The architecture of the NYU Ultracomputer is outlined. This is a proposed MIMD parallel machine comprising thousands of independently programmable fast processing elements connected to a large shared memory by a switching network. The latter has the geometry of an Omega-network, with switches each having enough memory and processing power to yield a bandwidth proportional to the number of processors. The network, which is pipelined, also supports the parallel execution of the .q "fetch-and-add" primitive important for process coordination. Questions relating to operating systems, programmability, and efficiency of applications codes are also discussed. .)q .nE 49 B.D. Lubachevsky, .q "An Approach to Automating the Verification of Compact Parallel Coordination Programs, II" .(q This paper describes an algorithm for building reachability set descriptions for compact parallel programs. The notion of a controlled vector addition system (CVAS, which generalizes that of a vector addition system, is introduced. An algorithm which exhausts all the reachable states of a CVAS is described. It is proved that if the program to which the CVAS corresponds is normal, then termination of this algorithm is equivalent to the compactness of the program. .)q .nE 50 Uzi Vishkin, .q "On Choice of A Model of Parallel Computation" .(q The problem of choosing a theoretical abstract model of parallel computation to be simulated by an exclusive-read, exclusive-write parallel RAM or a synchronous distributed machines is discussed. The principle of choosing the most permissible model of parallel computation as long as the cost of computational resources does not increase is applied. .qP We prove two theorems. The first (resp. second) theorem asserts that for every exclusive-read exclusive-write parallel RAM (resp. synchronous distributed machine) and every simulation of the concurrent-read exclusive-write parallel RAM model of computation into this machine, there exists a simulation of the Fetch-and-Add parallel RAM into the same machine that uses the same order of computational resources. It implies the choice of a Fetch-and-Add parallel RAM model of computation. .qP A promising methodological notion, called Execution Reducibilities, is being exercised in the proof of the main theorems. .)q .nE 51 Robert Tarjan and Uzi Vishkin, .q "0 (log n) and Optimal Parallel Bioconnectivity" .(q In this paper we propose a new algorithm for finding the blocks (bioconnected components) of an undirected graph. A serial implementation runs in 0(n + m) time and space on a graph of n vertices and m edges. A parallel implementation runs in 0(log n) time and 0(n + m) time and 0(n + m) space using 0(n + m) processors on a concurrent-read, concurrent-write parallel RAM. An alternative implementation runs in $0(n sup 2 /p)$ time and $0(n sup 2 )$ space using an number $p^<=^ n sup 2 /log sup 2 n$ of processors, on a concurrent-read, exclusive-write parallel RAM. The latter algorithm has optimal speed-up, assuming an adjacency matrix representation of the input. .)q .nE 52 W. Paul, Uzi Vishkin and H. Wagener, .q "Parallel Computation on 2-3 Trees" .(q Our model of computation is a parallel computer with k synchronized processors $P sub 1 ,..., P sub k$ sharing a common random access storage, where simultaneous access to the same storage location by two or more processors is not allowed. Suppose a 2-3 tree T with n leaves is implemented in the storage, suppose $a sub 1 ,..., a sub k$ are data that may or may not be sorted in the leaves, and for all i, 1 p) shared memory locations is given. The problem is simulating this algorithm by p processors and only p shared memory locations without increasing the running time by more that a constant factor is considered. A solution for a family of such parallel algorithms is given. The solution utilizes the idea of dynamically changing locations of the addresses of the algorithm throughout the simulation. .)q .nE 55 David Korn and Norman Rushfield, .q "WASHCLOTH Simulation of Three-Dimensional Weather Forecasting Codes" .(q This note describes the method used, and the results obtained for simulating an existing weather forecast program of Rivas with the WASHCLOTH simulator. The WASHCLOTH simulator is used to investigate how FORTRAN programs would behave on an idealized MIMD machine (the paracomputer) which will be approximated by the NYU Ultracomputer. This memo sheds light on the problems encountered in taking existing codes and running them on the Ultracomputer. .)q .nE 56 Uzi Vishkin and Avi Wigderson, .q "Trade-Offs Between Depth and Width in Parallel Computation" .(q A new technique for proving lower bounds for parallel computation is introduced. This technique enables us to obtain, for the first time, non-trivial tight lower bounds for shared-memory models of parallel computation that allow several processors to have simultaneous access to the same memory location. Specifically, we use a concurrent-read, concurrent-write model of parallel computation. It has \fIp\fP processors, each has access to a common memory of size \fIm\fP (also called communication width, or width in short). The input to the problem is located in an additional read-only portion of the common memory. .qP For a wide variety of problems (including parity, majority and summation) we show that the time complexity \fIT\fP (depth) and the communication width \fIm\fP are related by the trade-off curve $mT sup 2~=~ OMEGA (n)$ (where \fIn\fP is the size of the input), regardless of the number of processors. Moreover, for every point on this curve with $m~=~0(n/log sup 2 n)$ we give a matching upper bound with the optimal number of processors. .qP We extend our technique to prove $mT sup 3 ~=~ OMEGA (n)$ trade-off for a class of .q "simpler" functions (including Boolean OR) on a weaker model that forbids simultaneous write access. We also state and give proof of a new result by Beame that achieves a tight lower bound for the OR in this model, namely $mT sup 2 ~=~ OMEGA (n)$. These results improve the lower bound of Cook and Dwork when communication is limited. .)q .nE 57 Malcolm C. Harrison, .q "An Ultracomputer Switch Design Using Circuit and Packet Switching" .(q In this note we show that and 8 by 8 switch, which will perform the function of 12 2 by 2 switches, can be built using 16 59-pin chips, 18 58-pin chips and 4 49-pin chips. This is better than 3 59-pin chips per 2 by 2 switch. If 296-pin chips are available, only 4 chips would be required for the 8 by 8 switch, less than 1/3 chip per 2 by 2 switch. .qP The 8 by 8 switch discussed is compatible with the proposed Ultracomputer physical design, and also with currently available chip packaging. It also appears to be competitive with pin counts up to several hundred, though with substantial increases in pin counts, 16 by 16 or larger switches may also be feasible. .)q .nE 58 Uzi Vishkin, .q "A Parallel Design Distributed Implementation (PDDI) General Purpose Computer" .(q A scheme of an efficient general-purpose parallel computer is introduced. Its design space (i.e., the model for which parallel programs are written), is a permissive parallel RAM model of computation. The implementation space is presented as a scheme of a synchronous distributed machine which is not more involved that a sorting network followed by a merging network. An efficient translation from the design space into the implementation space is given. Suppose for some t and x there is a parallel algorithm in the design space which has depth (i.e., parallel time), 0(t/p) using p processors for all p=n$, the time complexity of this problem is $ THETA ( logn over log(1+p/n) ) $ are presented. This complements [AKS-83] in settling the problem since the AKS sorting network established that for $p <= n$ the time complexity is $ THETA ( {n log n} over p )$. To prove the lower bounds we show that to achieve $k ~<=~ log n$ parallel time, we need $ OMEGA ( n sup 1+1/k )~~ $ processors. .)q .nE 104 Malcolm C. Harrison, .q "The Add-and-Lambda Operation: An Extension of F&A" .(q This note describes a proposed extension to the architecture of shared memory multiprocessors with combining fetch-and-add operations, such as the NYU Ultracomputer and the IBM RPn. The extension involves addition of a small amount of hardware between the network and the memory, which permits the efficient implementation of a number of parallel operations. Examples are given. .)q .nE 105 Clyde P. Kruskal, Larry Rudolph, and Marc Snir, .q "Efficient Synchronization on Multiprocessors with Shared Memory" .(q A new formalism is given for read-modify-write (RMW) synchronization operations. This formalism is used to extend the memory reference combining mechanism, introduced in the NYU Ultracomputer, to arbitrary RMW operations. A formal correctness proof of this combining mechanism is given. General requirements for the practicality of combining are discussed. Combining is shown to be practical for many useful memory access operations. This includes memory updates of the form \fImem_val := mem_val op val\fP, where $op$ need not be associative, and a variety of synchronization primitives. The computation involved is shown to be closely related to parallel prefix evaluation. .)q .nE 106 Clyde P. Kruskal and Marc Snir, .q "A Unified Theory of Interconnection Network Structure" .(q The relationship between the topology of interconnection networks and their functional properties is examined. Graph theoretical characterizations are derived for delta networks, which have a simple routing scheme, and for bidelta networks, which have the delta property in both directions. Delta networks are shown to have a recursive structure. Bidelta networks are shown to have a unique topology. The definition of bidelta network is used to derive in a uniform manner the labeling schemes that define the omega networks, indirect binary cube networks, flip networks, baseline networks, modified data manipulators, and two new networks; these schemes are generalized to arbitrary radices. .br .sp .5 The labeling schemes are used to characterize networks with simple routing. In another paper, we characterize the networks with optimal performance/cost ratio. Only the multistage shuffle-exchange networks have both optimal performance/cost ratio and simple routing. This helps explain why few fundamentally different geometries have been proposed. .)q .nE 107 Susan Flynn, Edith Schonberg, and Edmond Schonberg, .q "The Efficient Termination of Ada Tasks in a Distributed Environment" .(q An efficient implementation of Ada task termination for shared memory MIMD architectures is presented. The solution is highly parallel and distributed; termination is not synchronized by a centralized supervisor, and there are no critical sections. The handling of block and subprogram termination that depend on nested tasks is also included. This implementation can be easily adapted to message based systems. .)q .nE 108 Richard Cole and Uzi Vishkin, .q "The Accelerated Centroid Decomposition Technique for Optimal Parallel Tree Evaluation In Logarithmic Time" .(q A new general parallel algorithmic technique for computations on trees is presented. The new technique performs a reduction of the tree expression evaluation problem to list ranking; then, the list ranking provides a schedule for evaluating the tree operations. The technique needs logarithmic time using an optimal number of processors and has applications to other tree problems. .br This new technique enables us to systematically order four basic ideas and techniques for parallel algorithms on tree: (1) The list ranking problem. (2) The Euler tour technique on trees. (3) The centroid decomposition technique. (4) The new \fIaccelerated centroid decomposition (ACD)\fR technique. .)q .nE 109 Richard Cole and Colm O'Dunlaing, .q "Notes on the AKS Sorting Network" .(q The Ajtai-Komlo\*'s-Szemere\*'di sorting network, as improved and simplified by Michael S. Paterson, is described. .)q .nE 110 Richard Cole and Uzi Vishkin, .q "Approximate Parallel Scheduling: Part I: The Basic Technique with Applications to Optimal Parallel List Ranking in Logarithmic Time" .(q We define a novel scheduling problem; it is solved in parallel by repeated, rapid, approximate reschedulings. This leads to the first optimal logarithmic time PRAM algorithm for list ranking. Companion papers show how to apply these results to obtain improved PRAM upper bounds for a variety of problems on graphs, including: connectivity, biconnectivity, minimum spanning tree, Euler tour and st-numbering, and a number of problems on trees. .)q .nE 111 Gad M. Landau, Baruch Schieber and Uzi Vishkin, .q "Parallel Construction of a Suffix Tree" .(q Weiner's [We-73] suffix tree is known to be a powerful tool for string manipulations. We present a parallel algorithm for constructing a suffix tree. The algorithm runs in 0(logn) time and uses n processors. We also present applications for designing efficient parallel algorithms for several string problems. .)q .nE 112 Isaac Dimitrovsky, .q "A Group Lock Algorithm, With Applications" .(q In this note I introduce the concept of a group lock. This is a generalization of P and V that can be used in writing asynchronous parallel algorithms. I then present and algorithm for group locking that has a nice time complexity property. This algorithm is used to define several fast and simple parallel algorithms. Among these are algorithms for a parallel queue, a parallel stack, atomic read/writes on multiword records, a parallel heap, and a parallel dictionary. .)q .nE 113 Ora E. Percus, .q "A Note on Random Number Generator of Chung et. al." .(q We derive the probability distribution $P "{" X sub n ~=~ x "}"$ for a random walk satisfying $X sub 0 ~=~ 0;~ P "{" X sub {n + 1} ~=~ (2X sub n ~+~ 1 )~roman mod ~ p "}" ~=~ P "{" X sub {n+1} ~=~ (2X sub n - 1) ~roman mod~p "}" ~=~ 1 over 2 ~for ~ n ~ >= ~ 0$ and $p$ odd integer. .)q .nE 114 Ora E. Percus and Malvin H. Kalos, .q "Random Number Generators for Ultracomputers" .(q We discuss and analyze issues related to the design of pseudorandom number generators (prn's) for MIMD (multiple instruction stream/multiple data stream) parallel processors, which are very well suited for Monte Carlo calculations. We are concerned to ensure reproducibility of runs, to provide very long sequences, and to assure an adequate degree of independence of the parallel streams. We consider the class of linear congruential generators .EQ x sub { n + 1, i } ~==~ a x sub {n, i} + b sub i ~~ roman mod~ m .EN and analyze the effect that different choices of $b sub i $ have on the correlation properties between such streams. We derive a spectral test $nu sub t$ for $t$ parallel linear congruential generators, a modification of Knuth's Algorithm \fBS\fP. From this, we prove a good lower bound for $nu sub 2 sup * = roman min from { roman "all pairs" (i,j) }~~nu sub {2} (i,j)$ for certain choices of $b sub {i} 's$. The set of the largest $r$ primes $p sub i $, $ i=1, ...,r$ satisfying $p sub {i}~<~sqrt {m/2} $, where $m$ is the period length of each generator gives a lower bound $ O ( m sup half )$ to the correlation between a pair of corresponding elements in any two streams. An alternative choice, $b sub i~=~ d sup i ~roman mod~ m$ for $ d ~=~ m sup half + 1$ gives a bound $ O (m sup half / (t-1) ) $ which will be satisfactory for small numbers of streams. Finally, we construct a spectral test for correlations between $x sub n,i$ and $x sub {n+k, i+l}$, but derive no analytic prescriptions from it. .)q .nE 115 Richard Cole, .q "Parallel Merge Sort" .(q We give a parallel implementation of merge sort on a CREW PRAM that uses $n$ processors and $O( log^n)$ time; the constant in the running time is small. We also give a more complex version of the algorithm for the EREW PRAM; it also uses $n$ processors and $O( log^n)$ time. The constant in the running time is still moderate, though not as small. .)q .nE 116 Malcolm C. Harrison, Nikolaos Markantonatos and Georgios Papadopoulos, .q "UltraProlog" .(q UltraProlog is a dialect of Prolog designed for shared memory parallel computers such as the NYU Ultracomputer and the IBM RPx, and is being implemented on the NYU Ultracomputer. UltraProlog is similar in spirit to Concurrent Prolog, Parlog and Guarded Horn Clauses. This report defines the language, and describes the implementation approach. .)q .nE 117 Richard Cole and Uzi Vishkin, .q "Faster Optimal Parallel Prefix Sums and List Ranking" .(q We present a parallel algorithm for the prefix sums problem which runs in time $O( log ~n ~/~ log log ~n)$ using $n log log ~n ~/~ log ~n$ processors (optimal speed up). This algorithm leads to a parallel list ranking algorithm which runs in $O( log~n)$ time using $n/ log~n$ processors (optimal speed up). .)q .nE 118 Baruch Schieber and Uzi Vishkin, .q "On Finding Lowest Common Ancestors: Simplification and Parallelization" .(q We consider the following problem. Suppose a rooted tree $T$ is available for preprocessing. Answer on-line queries requesting the lowest common ancestor for any pair of vertices in $T$. We present a linear time and space preprocessing algorithm which enables us to answer each query in $O(1)$ time, as in Harel and Tarjan [HT-84]. Our algorithm has the advantage of being simple and easily parallelizable. The resulting parallel preprocessing algorithm runs in logarithmic time using an optimal number of processors on an EREW PRAM. Each query is then answered in $O(1)$ time using a single processor. .)q .nE 119 Richard Cole and Uzi Vishkin, .q "Approximate Parallel Scheduling. Part II: Applications to Optimal Parallel Graph Algorithms in Logarithmic Time" .(q Part I of this paper presented a novel technique for approximate parallel scheduling and a new logarithmic time optimal parallel algorithm for the list ranking problem. In this part, we give a new logarithmic time parallel (PRAM) algorithm for computing the connected components of undirected graphs which uses this scheduling technique. The connectivity algorithm is optimal unless $m~=~o(n~log sup * n)$\(dg\(dg, in graphs of $n$ vertices and $m$ edges. Using known results, this new algorithms imply logarithmic time optimal parallel algorithms for a number of other graph problems, including: biconnectivity, Euler tours, strong orientation and $st$-numbering. Another contribution of the present paper is a parallel union/find algorithm. .qP To the best of our knowledge, this paper presents for the first time optimal, logarithmic time, parallel algorithms for problems on graphs. .)q .nE 120 Vladimir Fleygakker, .q "On Some Parallel Sieve of Prime Number Generation" .(q This short report addresses the question of using The NYU Ultracomputer for the generation of prime numbers. .)q .nE 121 Ora E. Percus and Jerome K. Percus, .q "Long Range Correlations in Linear Congruential Generators" .(q Every linear congruential random number generator that uses a power of two as modulus introduces correlations among elements separated sequentially by powers of two. In particular, we have the following Theorem: .qP Let .EQ x sub {n+1} ~=~ ax sub n + b ~~ roman mod 2 sup beta .EN where .EQ a ~==~ 1~ roman mod~ 2 sup alpha~~ and ~~a ~!=~ roman mod ~2 sup {alpha +1} ~~ roman and ~~beta > alpha > 1. .EN .qP Then for every $k$ the linear relation .EQ sum from j=0 to p ~ c sub j x sub n+j2 sup k~ == ~0 ~ roman mod~ 2 sup beta .EN holds, where $c sub j ~=~ (-1) sup j ~left ( pile { p above j } right )$ $p$ is the smallest integer $>= lpile { {( beta + alpha )/( k + alpha )} above {~beta / k + alpha} } $ $~~rpile { { roman if~ b ~roman "is odd" } above {roman if~ b~=~0} } $ .)q .nE 122 Kaizhong Zhang and Dennis Shasha, .q "On the Editing Distance Between Trees and Related Problems" .(q Since a tree can represent a scene description, a grammar parse, a structural description, and many other phenomena, comparing trees is a way to compare scenes, parses and so on. We consider the distance between two trees to be the (weighted) number of edit operations (insert, delete, and modify) to transform one tree to another. Then, we consider the following kinds of questions: .br 1. What is the distance between two trees? .br 2. What is the minimum distance between $ T sub 1 $ and $ T sub 2 $ with an arbitrary subtree removed? What if zero or more subtrees can be removed from $ T sub 2 $? A specialization of these algorithms solves the problem of approximate tree matching. .br 3. Given k, are $ T sub 1 $ and $ T sub 2 $ within distance k of one another and if so what is their distance? .qP We present a postorder dynamic programming algorithm to solve question 1 in sequential time $O( |T sub 1 | \(mu | T sub 2 | \(mu depth(T sub 1 ) \(mu depth(T sub 2 ))$ compared with $O( |T sub 1 | \(mu | T sub 2 | \(mu (depth(T sub 1 )) sup 2 \(mu (depth(T sub 2 )) sup 2 )$ for the best previous algorithm due to Tai. Further, our algorithm can be parallelized to give time $O( |T sub 1 | + | T sub 2 | ).$ Our approach extends to answer question 2 giving an algorithm of the same complexity. For question 3, a variant of our distance algorithm yields an $O( k sup 2 \(mu ~ min (|T sub 1 |, | T sub 2 | ) \(mu ~ min ( depth(T sub 1 ), depth(T sub 2 ) ))$ algorithm. .)q .nE 123 Robert A. Hummel and Kiazhong Zhang, .q "Dynamic Processor Allocation for Parallel Algorithms in Image Processing" .(q Because image processing is numerically intensive, there has been much interest in parallel processing for image analysis applications. While much of low-level vision can be attacked by SIMD mesh-connected architectures, intermediate and high-level vision applications might be able to make effective use of MIMD and distributed architectures. We have taken a standard parallel connected components algorithm, and applied it to image segmentation using an MIMD architecture. The resulting version of the Shiloach/Vishkin algorithm runs on the prototype NYU Ultracomputer. We will describe the implementation and the results of some experiments. We take note of the lesson learned from this implementation: that processor power should be focused dynamically to those portions of the image requiring greatest attention. We then consider the implications of this lesson to other image processing tasks. .)q .nE 124 Dennis Shasha, Vladimir Lanin and Jeanette Schmidt, .q "An Analytical Model for the Performance of Concurrent B Tree Algorithms" .(q A dictionary is an abstract data type supporting the actions search, insert, and delete. Search structures are data structure used to implement a dictionary, e.g. B trees, hash structures, grid files, etc. Many concurrent algorithms for search structures have been proposed, but relatively little work has addressed their performance. This paper proposes an analytic performance model for a range of concurrent algorithms on B+ trees. It uses this model to propose a high performance, simple-to-program variant of lock-coupling algorithms. .qP The model actually consists of two mutually exclusive models: a .q "constant flow" model in which the throughput is assumed to remain at its average value; and a .q "bursty flow" model in which the throughput is predicted to oscillate with a certain frequency and amplitude based on the number of processes, the height of the tree, and the size of the nodes. We find that the data points from simulation fall between these two predictive models, tending towards the bursty flow one as the number of processors increases. Inasmuch as the analytical model characterizes B+ trees in terms of fanout, height, probability of restructuring, and node size, it does not depend on the details of B+ trees and so can be extended, at least in spirit, to other structures. .)q .nE 125 Susan Dickey, Allan Gottlieb, Richard Kenner, and Yue-Sheng Liu, .q "Designing VLSI Network Nodes To Reduce Memory Traffic in a Shared Memory Parallel Computer" .(q Serialization of memory access can be a critical bottleneck in shared memory parallel computers. The NYU Ultracomputer, a large-scale MIMD (multiple instruction stream, multiple data stream) shared memory architecture, may be viewed as a column of processors and a column of memory modules connected by a rectangular network of enhanced 2 x 2 buffered crossbars. These VLSI nodes enable the network to combine multiple requests directed at the same memory location. Such requests include a new coordination primitive, fetch-and-add, which permits task coordination to be achieved in a highly parallel manner. Processing within the network is used to reduce serialization at the memory modules. .qP To avoid large network latency, the VLSI network nodes must be high-performance components. Design tradeoffs between architectural features, asympotic performance requirements, cycle time, and packaging limitations are complex. This report sketches the Ultracomputer architecture and discusses the issues involved in the design of the VLSI enhanced buffered crossbars which are the key element in reducing serialization. .)q .nE 126 Anne Greenbaum, Congming Li, and Han Zheng Chao, .q "Comparison of Linear System Solvers Applied to Diffusion-Type Finite Element Equations" .(q Various iterative methods for solving the linear systems associated with finite element approximations to self-adjoint elliptic differential operators are compared based on their performance on serial and parallel machines. The methods studied are all preconditioned conjugate gradient methods, differing only in the choice of preconditioner. The preconditioners considered arise from diagonal scaling, incomplete Cholesky decomposition, hierarchical basis functions, and a Neumann-Dirichlet domain decomposition technique. The hierarchical basis function idea is shown to be especially effective on both serial and parallel architectures. .)q .nE 127 Mikhail J. Atallah, Richard Cole, and Michael T. Goodrich, .q "Cascading Divide-and-Conquer: A Technique for Designing Parallel Algorithms" .(q We present techniques for parallel divide-and-conquer, resulting in improved parallel algorithms for a number of problems. The problems for which we give improved algorithms include intersection detection, trapezoidal decomposition, and planar point location. We also give efficient parallel algorithms for fractional cascading, 3-dimensional maxima, 2-set dominance counting, and visibility from a point. All of our algorithms run in \fIO\fR(log \fIn\fP) time with either a linear or sub-linear number of processors in the CREW PRAM model. .)q .nE 128 Richard Cole and Ofer Zajicek, .q "On Optimal Parallel Algorithm for Building A Data Structure for Planar Point Location" .(q Using a generalization of the Approximate Parallel Scheduling of Cole and Vishkin [CoV87a], we give an optimal parallel algorithm for generating a linear size data structure for planar point location that supports an \fIO\fR(log\fI n\fR) query time. Our algorithm runs on a EREW PRAM in time \fIO\fR(log\fI n\fR) using a \fIn\fR/log \fIn\fR processors .)q .nE 129 Patricia J. Teller, Richard Kenner, and Marc Snir, .q "TLB Consistency on Highly-Parallel Shared-Memory Multiprocessors" .(q Multiprocessors that store the same shared data in different private caches must ensure these caches have consistent copies. Almost all known solutions to this \fIcache consistency\fR problem are only suitable for architectures with a few tens of processors (PEs). Efficient solutions to the \fITLB (translation lookaside buffer) consistency problem\fR, a special case of the cache consistency problem, can be found for \fIhighly-parallel, shared-memory multiprocessors\fR (HPSMMs) with many hundreds of PEs for the following reasons: the number of references to address translation information per modification is very large; the cache for storing translation information can be present anywhere on the path from the PEs to memory; when the memory mapping needs to be modified, one can often select which translation information to change; and obsolete mapping information can be used until permanent changes must be made. We present three general methods that exploit these features and can be used on HPSMMs to maintain TLB consistency. Tradeoffs are discussed and are related to overall system performance. Some interesting issues inherent to the TLB consistency problem and the support of a demand-paged virtual memory system on a HPSMM are also presented. Our methods demonstrate the use of alternative implementations of the classical readers-writers algorithm. .)q .nE 130 Marsha Berger, .q "Adaptive Mesh Refinement for Parallel Processors" .nE 131 Maksymilian Dryja and Olof Widlund, .q "An Additive Variant of the Schwarz Alternating Method for the Cade of Many Subregions" .(q In recent years, there has been a revival of interest in the Schwarz alternating method. Other domain decomposition algorithms, in particular the so called iterative substructuring methods, have also been developed to solve elliptic finite element problems. In this paper, we present an additive variant of the Schwarz method for elliptic problems, which shows great promise for parallel computers. By using techniques previously developed for iterative substructuring methods, we are able to show that this methods converges quite rapidly even when the region is divided into many subregions. We note that, as is the case with other fast iterative methods for many subregions, a mechanism for the global transportation of information is necessary in order to obtain fast convergence. .)q .nE 132 Yue-Sheng Liu, .q "Delta Network Performance and \*(lqHot Spot\*(rq Traffic" .(q In this paper, we review previous work on both buffered and unbuffered delta network performance analysis. We then analyze the delta network performance under .q "hot spot" traffic and give a solution to the single .q "hot spot". .)q .nE 133 Theodore Johnson, .q "Modifying Two-Phase Locking to Improve Performance" .(q Studies of concurrency control performance suggest that Two-Phase Locking (2PL) suffers from data-contention thrashing because transaction blocking is uncontrolled, allowing blocked transactions to block other transactions. Our algorithm improves the performance of 2PL in certain situations by aborting blocked transactions that block other transactions. This paper presents an analysis that validates this claim. The new algorithm gives better performance if aborting transactions is not too expensive. .)q .nE 134 Yehuda Afek, Gad M. Landau, Baruch Schieber, and Moti Yung, .q "The Power of Multimedia: Combining Point-to-Point and Multiaccess Networks" .(q In this paper we introduce a new network model called a .i "multimedia network". It combines the point-to-point message passing network and the multiaccess channel. To benefit from the combination we design algorithms which consist of two stages: a local stage which utilizes the parallelism of the point-to-point network and a global stage which utilizes the broadcast capability of the multiaccess channel. As a reasonable approach, one wishes to balance the complexities of the two stages by obtaining an efficient partition of the network into $O( sqrt n )$ connected components each of radius $O ( sqrt n ) $ . To this end we present efficient deterministic and randomized partitioning algorithms. The deterministic algorithm runs in $O ( sqrt n ~ log sup * ~n ) $ time and $ O (m ~+~ n ~ log n ~log sup * n ) j$ messages, where $m$ and $n$ are the number of links and nodes in the point-to-point part of the network. The randomized algorithm runs in the same time but sends $O ( m ~+~ n ~ log sup * n )$ messages. The partitioning algorithms are then used to obtain $O ( sqrt n ~ log~ n $ time deterministic and $O ( sqrt n ~ log~n ) $ time randomized algorithms, for computing .i "global sensitive" functions and a minimum spanning tree. .sp .5 An $OMEGA (n)$ time lower bounds for computing global sensitive functions in both point-to-point and multiaccess networks, are given, thus showing that the multimedia network is more powerful than both its separate components. Furthermore, we prove an $OMEGA ( sqrt n ) $ time lower bound for multimedia networks, thus leaving a small gap between our upper and lower bounds. .)q .nE 135 Jan Edler, Jim Lipkis, and Edith Schonberg, .q "Memory Management in Symunix II: A Design for Large-Scale Shared Memory Multiprocessors" .(q While various vendors and independent research groups have adapted \s-2UNIX\s0 and other operating systems for multiprocessor architectures, relatively little work has been done in anticipation of the software requirements of very large-scale shared memory machines containing thousands of processors. Programming environments for these machines must exploit multi-level memory and cache hierarchies so as to obtain the maximum performance available from the hardware architecture. Several years of experience at New York University with parallel systems and programming environments have led to a memory management design emphasizing flexible control over arbitrarily complex patterns of memory sharing and cacheing. This design integrates user memory management, virtual memory management (swapping/paging), program debugging and monitoring, the file system and buffer cache, and checkpoint/restart in a synergistic manner that extends the existing strengths of \s-2UNIX\s0. It is being implemented for two research multiprocessor prototypes, the NYU Ultracomputer and IBM RP3. .)q .nE 136 Jan Edler, Jim Lipkis, and Edith Schonberg, .q "Process Management for Highly Parallel \s-2UNIX\s0 Systems" .(q Despite early development exclusively on uniprocessors, a growing number of \s-2UNIX\s0 systems are now available for shared memory (MIMD) multiprocessors. While much of this trend has been driven by the general success of the \s-2UNIX\s0 interface as an emerging industry standard, experience has shown that the basic \s-2UNIX\s0 design is amenable to such environments. Relatively simple extensions such as shared memory and synchronization mechanisms suffice for many parallel programs. .sp .5 While simple needs can be satisfied in a simple fashion, the desire to support more sophisticated applications has created pressure for ever more complex extensions. Is there a better way to meet such needs? Although some argue that it is time to abandon the \s-2UNIX\s0 model completely, we believe that viable alternatives exist within the traditional framework. .sp .5 In this paper we propose several modifications to the process management facilities of the \s-2UNIX\s0 kernel. Some of them are primarily of interest for parallel processing, such as a generalized \fIfork\fP system call that can efficiently create many processes at once, while others are equally attractive in other contexts, such as mechanisms for improved I/O and IPC performance. While the primary goals are improved performance and reliability, a strong aesthetic judgement is applied to create a total design that is cohesively integrated. .sp .5 While the concepts presented here are applicable to any \s-2UNIX\s0 environment, they have been conceived in the context of very large scale parallel computing, with hundreds or thousands of processors. An initial implementation of these extensions is currently underway for the NYU Ultracomputer prototype and the IBM RP3. .)q .nE 137 Wayne Berke, .q "\fBParFOR\fP \- A Structured Environment for Parallel FORTRAN" .(q We present an extension to the FORTRAN language that allows the user to specify parallelism by means of clearly defined, nestable blocks. The implementation achieves compiler-independence through a portable preprocessor. High performance is obtained by \fIprespawning\fP processes and relying on a set of run-time routines to manage a self-scheduling allocation scheme. The resulting system, which we call ParFOR, lends itself to the exploitation of fine-grained parallelism because of its low scheduling overhead. It encourages the elimination of explicit process synchronization, thereby enhancing the readability of the source program. In addition, ParFOR provides a variety of modes and compile-time options that are useful for performance measurement and debugging. Finally, we present an evaluation of system efficiency including timing results for several parallel applications running on the eight-processor Ultracomputer prototype. .)q .nE 138 Ora E. Percus and Jerome K. Percus, .q "Elementary Properties of Clock-regulated Queues" .(q A single clock-regulated queue is fed by Bernoulli streams of messages. We obtain the distributions of queue length, of waiting time, and of output for such a system. This is generalized as well to the case of a queue of finite capacity. .)q .nE 139 Malcolm C. Harrison, .q "Add-and-Lambda II: Eliminating Busy Waits" .(q This paper describes a further extension of the combining fetch-and-add used in shared memory multiprocessors such as the NYU Ultracomputer and the IBM RPn. This extension permits the elimination of busy-waiting in a number of important parallel operations. These operations include barrier synchronization, stack and queue operations, full/empty bits, and group lock; most of these operations become one or two single-instruction operations for the PE. The hardware necessary is compatible with the design of the switches which are used in the Ultracomputer, and in fact are implemented as fetch-and-add instructions with an increment of unity. .)q .nE 140 Lori S. Grob, .q "Automatic Exploitation of Concurrency in C: Is it Really So Hard?" .(q The vast majority of work being done today on the automatic exploitation of concurrency, for both multiprocessor and vector machines, is not being done for C. Yet there are many important applications, written in C, which would benefit immensely from the speedup produced by such techniques. Many more applications would be written in C were there optimizing, vectorizing and parallelizing compilers for the language. .sp .5 In this paper we consider whether it is possible to have compilers that will automatically exploit concurrency in C. We discuss the relationship between automatic exploitation of concurrency for the purposes of vectorizing and multiprocessing. We review the basic techniques and transformations used and examine the necessary conditions to perform these transformations, with examples in C. Several elements of the C language and programming style, such as pointers and recursion, make it difficult to do the necessary data flow analysis. There are two possible approaches to this problem: to bypass code or blocks of code that contain \*(lqdifficult\*(rq features and be unable to apply optimizations to these fragments, or to suitably restrict the language. We examine the choices made by the few available vectorizing and parallelizing C compilers and consider what the future may hold in the light of current research. .)q .nE 141 Yue-Sheng Liu and Susan Dickey, .q "Simulation Analysis of Different Switch Architectures for Interconnection Networks in MIMD Shared Memory Machines" .(q Differences in switch architectures can have a significant effect on both latency and throughput in interconnection networks, and thus can affect overall performance in shared memory parallel systems with processor to memory interconnection networks. In this paper, we describe several switch architectures which differ in the presence or absence of buffers, in the location and functionality of the buffers, and in the width of the data path. .sp .5 Since the addition of an architectural feature may have a significant cost in the complexity or cycle speed of a switching component, it is important to have an accurate assessment of the benefit provided. Therefore, we make quantitative comparisons of switch performance using both simulation and analytical methods. .)q .nE 142 Jeanette P. Schmidt and Alan Siegel, .q "The Spatial Complexity of Oblivious \fIk\fP-probe Hash Functions" .(q We consider the problem of constructing a dense static hash-based lookup table $T$ for a set $S$ of $n$ elements belonging to a universe $U= "{" 0,1,2,...,m-1 "}" $. We provide nearly tight bounds on the spatial complexity of oblivious $O(1)$-probe hash functions, which are defined to depend solely on their search key argument. This establishes a significant gap between oblivious and non-oblivious search. In particular, our results include the following: .ip \(bu Oblivious $k$-probe hash functions require $ OMEGA ( {n over k^2 } e^{-k} ~ + ~ loglog m)$ bits. .ip \(bu A probabilistic construction of a family of oblivious $k$-probe hash functions that can be specified in $ O (n e^{-k} ~ + ~ loglog m)$ bits, which nearly matches the above lower bound. .ip \(bu A variation of an $O(1)$ time 1-probe (perfect) hash function that can be specified in $O(n ~ + ~loglog m)$ bits, which is tight to within a constant factor of the lower bound. .ip \(bu Upper and lower bounds for related families of hash functions. .)q .nE 143 Seymour V. Parter and Thomas A. Manteuffel, .q "Preconditioning and Boundary Conditions" .(q Consider the large systems of linear equations $A sub h u sub h eq f sub h$ which arise from the discretization of a second-order elliptic boundary value problem. Consider also the preconditioned systems (i) $B sub h sup -1 A sub h u sub h eq B sub h sup -1 f sub h$; and (ii) $A sub h B sub h sup -1 v sub h eq f sub h$, $u sub h eq B sub h sup -1 v sub h$ where $B sub h$ is itself a matrix which arises from the discretization of another elliptic operator. We discuss the effect of boundary conditions (of $A$ and $B$) on the $L2$ and $H sub 1$ condition of $B sub h sup -1 A sub h$, $A sub h B sub h sup -1$. In particular, in the case of $H sub 2$ regularity one finds that $|| B sub h sup -1 A sub h || sub L2$ is uniformly bounded if and only if $A sup *$ and $B sup *$ have the same boundary conditions while $|| A sub h B sub h sup -1 || sub L2$ is uniformly bounded if and only if $A$ and $B$ have the same boundary conditions. Similarly, $|| B sub h sup -1 A sub h || sub {H sub 1}$ is uniformly bounded only if $A$ and $B$ have homogeneous Dirichlet boundary conditions on the same portion of the boundary. This latter result does not depend upon $H sub 2$ regularity. .)q .nE 144 Jerome Chiabaut, Vladimir Fleyshgakker, and Anne Greenebaum, .q "Porting Scientific Applications to the NYU Ultracomputer" .(q We report our experience in porting six different scientific codes to the Ultracomputer prototype and present promising results on speedups that can be achieved on shared memory MIMD architectures. .)q .nE 145 Olof Widlund, .q "Some Domain Decomposition and Iterative Refinement Algorithms for Elliptic Finite Element Problems" .(q In this contribution, we report on some results recently obtained in joint work with Maksymilian Dryja. We first study an additive variant of Schwarz' alternating algorithms and establish that a fast method of this kind can be devised which is optimal in the sense that the number of conjugate gradient iterations required, to reach a certain tolerance, is independent of the mesh size as well as the number of subregions. An interesting feature of this algorithm is that all the subproblems can be solved at the same time. The method is therefore quite promising for parallel computers. Using a similar mathematical framework, we then consider the solution of elliptic finite element problems on composite meshes. Such problems can be built up systematically by introducing a basic finite element approximation on the entire region and then repeatedly selecting subregions, annd subregions of subregions, where the finite element model is further refined. We consider conjugate gradient algorithms where, in each iteration, problems on the subregions representing finite element models prior to further refinement are solved. This makes it possible to use solvers for problems with uniform or relatively uniform mesh sizes, while the composite mesh can be strongly graded. We remark that this work is technically quite closely related to our previous work on iterative substructuring methods, which are domain decompositions algorithms using non-overlapping subregions. .)q .nE 146 Olof Widlund, .q "Optimal Iterative Refinement Methods" .(q We consider the solution of the linear systems of algebraic equations which arise from elliptic finite element problems defined on composite meshes. Such problems can systematically be built up by introducing a basic finite element approximation on the entire region and then repeatedly selecting subregions, and subregions of subregions, where the finite element model is further refined in order to gain higher accuracy. We consider conjugate gradient algorithms, and other acceleration procedures, where, in each iteration, problems representing finite element models on the original region and the subregions prior to further refinement are solved. We can therefore use solvers for problems with uniform or relatively uniform mesh sizes, while the composite mesh can be strongly graded. .qP In this contribution to the theory, we report on new results recently obtained in joint work with Maksymilian Dryja. We use a basic mathematical frame work recently introduced in a study of a variant of Schwarz' alternating algorithm. We establish that several fast methods can be devised which are optimal in the sense that the number of iterations required to reach a certain tolerance is independent of the mesh size as well as the number of refinement levels. This work is also technically quite closely related to previous work on iterative substructuring methods, which are domain decomposition algorithms using non-overlapping subregions. .)q .nE 147 Yosi Ben-Asher, David Egozi and Assaf Schuster, .q "SIMD \ Algorithms For 2-D Arrays In Shuffle Networks" .(q This paper studies a set of 2-D basic algorithms for SIMD perfect shuffle networks. Those algorithms where studied in [1] and [SN], but for the 1-D case, where the size of the problem $N$ is the same as the number of processors $P$. For the 2-D case of $N=L*P$, we improve some algorithms to be .q "2-D efficient" , by improving their running time from $\(*O (L* log P )$ to $\(*O (L+log P )$, as $N$ exceeds $P$. We give nontrivial algorithms for the following operations: Row-Reduction, Parallel-Prefix, Transpose, Packing, Smoothing and Cartesian-Product. .)q .nE 148 Eric Freudenthal and Olivier Pez\*'e, .q "Efficient Synchronization Algorithms Using Fetch & Add on Multiple Bitfield Integers" .(q Efficient interprocess synchronization tools for MIMD shared memory computers are critically important for the success of such systems. As members of the Ultracomputer Project at NYU, the authors have been participating in the continuing development of such algorithms. From their work, partially represented in this paper, they have observed that many of the more efficient algorithms utilize the storage and manipulation of more than one state variable in a single machine integer using fetch-and-add (see [GLR 83]). This paper describes this technique and presents several new implementations of busy-waiting synchronization functions which utilize this facility. .)q .nE 149 Edith Schonberg, .q "On-The-Fly Detection Of Access Anomalies" .(q Access anomalies are a common class of bugs in shared-memory parallel programs. An access anomaly occurs when two concurrent execution threads both write (or one thread reads and the other writes) the same shared memory location. Approaches to the detection of access anomalies include static analysis, post-mortem trace analysis, and on-the-fly monitoring. .sp A general on-the-fly algorithm for access anomaly detection is presented, which can be applied to programs with both nested fork-join and synchronization operations. The advantage of on-the-fly detection over post-mortem analysis is that the amount of storage used can be greatly reduced by data compression techniques and by discarding information as soon as it becomes obsolete. In the algorithm presented, the amount of storage required at any time depends only on the number $V$ of shared variables being monitored and the number $N$ of threads, not on the number of synchronizations. Data compression is achieved by the use of two techniques called \fImerging\fP and \fIsubtraction\fP. Upper bounds on storage are shown to be $V times N sup 2 $ for merging and $V times N$ for subtraction. .)q .nE 150 David Wood, .q "Parallel Queues and Pools, An Evaluation" .(q In this paper I examine a fairly broad range of parallel queue and pool (unordered queue) algorithms and present data indicating their relative strengths and weaknesses. The algorithms have been implemented on the NYU Ultracomputer prototype, an eight processor MIMD, shared memory single bus machine. The data have been normalized to remove the effects of bus contention. I discuss the implementation of a few of the basic synchronization primitives and the implementation of each algorithm to be tested. Two programs were written, one \fIcqtst\fP provided a generalized arena for timing the different algorithms under different circumstances. The second program, \fIfifo\fP, was used to test the parallel fifo-ness of the queues, that is, to test whether the given algorithm truly represented a queue. .)q .nE 151 Wayne Berke, .q "A Cache Technique for Synchronization Variables in Highly Parallel, Shared Memory Systems" .(q Caches have traditionally been used to lower the average latency of memory access. When paired with the individual CPU's of a multiprocessor, they have the additional benefit of reducing the overall load on the processor-memory interconnection. Since synchronization variables have been identified as centers of memory contention, we have looked at methods of utilizing the cache to minimize this effect. A technique of "polling the cache" is proposed to deal with this problem. Consistency within the caches is maintained with a bottleneck-free update facility that exploits the topology of the multistage network. Since as indiscriminate broadcast-on-write policy can lead to severe network congestion at high levels of parallelism, we selectively invoke these updates from the software. We illustrate our methods with a number of useful synchronization algorithms and present simulation results that support the feasibility of our design. In addition to providing support for basic synchronization operations, our methodology is generally applicable to all parallel algorithms that utilize polling. .)q .nE 152 Ronald Bianchini, .q "Ultracomputer Packaging and Prototypes" .(q This paper reviews interconnection networks and describes packaging of the NYU Ultracomputer. The packaging is based on the `button' technology developed by TRW and adopted by MCC for its ES-Kit modules. This paper presents the basic ES-Kit modules needed to assemble several sizes of Ultracomputers. It also reviews Ultracomputer prototypes and previews future prototypes. .)q .nE 153 Ora E. Percus and J. K. Percus, .q "Models for Queue Length in Clocked Queuing Networks" .(q We analyze a discrete-time network of queues. The unit element of the network is a $ 2 times 2$ buffered switch, which we regard as a system of two queues working in parallel. We show how to transform transition probability information from the output of one switch, or network stage, to the input of the next one. This is used to carry our a Markov time series input model to predict mean queue length at every stage of the system. Another model considered is a renewal process time series model, which we use to find the mean queue length of the second stage of the network. Numerical simulations fall within the narrow band spanned by the two models. .)q .nE 154 Anne Dinning and B. Mishra, .q "Fully Parallel Algorithm for Implementing Path Expressions" .(q Current research efforts in parallel architecture usually evolve from a design of a processor-memory-connection architecture with rudimentary facilities for synchronization and communications, which are often found to be inadequate to support the more sophisticated usages. In order to remedy this problem, we favor a top-down design, where the primitive operations are selected via a careful and thorough analysis of the high-level general purpose applications to be supported by the architecture. .sp .5 In this paper, we study how a high-level process-synchronization specification, presented in the path expression language, can be translated into a fully parallel implementation on an MIMD shared memory architecture and what primitive support facilitates are necessary to achieve this. Primary emphasis is obviously on the simplicity of the system, low synchronization overhead of the algorithm, efficiency of the primitives and the relative ease of system implementation. .)q .nE 155 Maksymilian Dryja and Olof B. Widlund, .q "Some Domain Decomposition Algorithms for Elliptic Problems" .(q We discuss domain decomposition methods with which the often very large linear systems of algebraic equations, rising when elliptic problems are discretized by finite differences or finite elements, can be solved with the aid of exact or approximate solvers for the same equations restricted to subregions. The interaction between the subregions, to enforce appropriate continuity requirements, is handled by an iterative method, often a preconditioned conjugate gradient method. Much of the work is local and can be carried out in parallel. We first explore how ideas from structural engineering computations naturally lead to certain matrix splittings. In preparation for the detailed design and analysis of related domain decomposition methods, we then consider the Schwarz alternating algorithm, discovered already in 1869. That algorithm can conveniently be expressed in term of certain projections. We develop these ideas further and discuss an interesting additive variant of the Schwarz method. This also leads to the development of a general framework, which already has proven quite useful in the study of a variety of domain decomposition methods and certain related algorithms. We demonstrate this by developing several algorithms and by showing how their rates of convergence can be estimated. One of them is a Schwarz-type method, for which the subregions overlap, while the others are so called iterative substructuring methods, where the subregions do not overlap. Compared to previous studies of iterative substructuring methods, our proof is simpler and in one case it can be completed without using a finite element extension theorem. Such a theorem has, to our knowledge, always been used in the previous analysis in all but the very simplest cases. .)q .nE 156 Maksymilian Dryja and Olof B. Widlund, .q "On the Optimality of an Additive Iterative Refinement Method" .(q We consider the solution of the linear system of algebraic equations which arise from elliptic finite element problems defined on composite meshes. Such problems can systematically be constructed by introducing a basic finite element approximation on the entire region and then selecting subregions, and subregions of subregions etc., where the finite element model is further refined in order to gain higher accuracy. We consider conjugate gradient algorithms where, in each iteration, problems representing finite element models on the original region and the subregions prior to further refinement are solved. We can therefore use solvers for problems with uniform or relatively uniform mesh sizes, while the composite mesh can be strongly graded. .sp .5 The central theoretical issue concerning these so-called iterative refinement algorithms is the design and analysis of algorithms for which the rate of convergence is independent of the number of subproblems, i.e. the number of refinement levels, as well as the mesh sizes. Algorithms for which such bounds can be obtained are called optimal. In this paper, we extend our previous results and show that an additional important algorithm is optimal. .sp .5 As in our previous studies, we use a basic mathematical framework introduced in a recent study of a variant of Schwarz' alternating algorithm. Some tools from older work on iterative substructuring methods are also used. .)q .nE 157 Richard Kenner, Susan Dickey and Patricia Teller, .q "The Design of Processing Elements on a Multiprocessor System With A High-Bandwidth, High-Latency Interconnection Network" .(q The environment of a highly-parallel, shared-memory multiprocessor system with a high-bandwidth, high latency interconnection network is different in many ways from that encountered in a uniprocessor system. We describe these differences and the impact they have on the design of processing elements. In addition, this paper describes methods that can be used to evaluate the impact of architectural choices on the performance of any system that uses a similar network. Two detailed designs of processing elements, one processor and the other using a RISC, are given as examples. .)q .nE 158 Richard Kenner, Ronald Bianchini, Susan Dickey, and Patricia Teller, .q "A Compact Design For A Highly-Parallel Shared-Memory MIMD Computer" .(q The ultra computer architecture connects hundreds or possibly thousands of processing elements (PEs), each containing a cache but no local memory, to an equal number of memory modules (MMs) via an interconnection network constructed of custom VLSI components that combine (merge) requests from different PEs destined to the same memory location. The network provides a highbandwidth path from the PEs to MMs but the memory latency is significantly larger than that encountered in either a uniprocessor or small bus-based multiprocessor. The PE must exploit the high available bandwidth and minimize the effect of network latency. .)q .nE 159 Eric Freudenthal and Allan Gottlieb, .q "Process Coordination with Fetch & Increment" .(q The Fetch-and-add (F&A) operation has been used effectively in a number of process coordination algorithms. In this paper we assess the power of Fetch-and-increment (F&I) and Fetch-and-decrement (F&D), which we view as restricted forms of F&A in which the only addends permitted are $+- 1$. F&A-based algorithms that use only unit addends are thus trivially expressed with just F&I and F&D. Our primary contribution is a bottleneck-free readers/writers algorithm that uses only F&I and F&D. We prove that this readers/writers algorithm is correct and indicate how existing F&A-based algorithms for queues-with-multiplicity can be simplified to yield algorithms using just F&I and F&D. Finally, we discuss a general technique for replacing F&A by F&I/F&D at a cost logarithmic in the number of processors. .)q .nE 160 Steven Jaffe .q "Equilibrium Results for a Pair of Coupled Discrete-time Queues" .(q A pair or discrete-time queues is coupled by the fact that the arrivals on the trwo are not independent. We derive a functional equational for the bivariate generating function of the joint distribution of queue lengths at equilibrium. This is solved, and the result is used to find exact expressions for the mean and variance of the sum of the queue lengths and to covariance of the pair of the queue lengths. We also find asymptotic formulae for the equilibrium probabilities $ p sub ij $ as $i, j -> inf $ , as well as some asymtotic results when the arrivval rate approaches 0 or 1. .)q .nE 161 O. E. Percus and J. K. Percus, .q "Queue Length Distributions in a Markov Model of a Multistage Clocked Queueing Network" .(q In [4], we treated the problem of passage through a discrete-time clock-regulated multistage queueing network by modeling the input time series $a sub n$ to each queue as a Markov chain. We showed how to transform probability transition information from the input of one queue to the input of the next one in order to predict mean queue length at each stage. The Markov approximation is very good for $rho ~=~ E (a sub n ) <= 1 over 2$, which is in fact the range of practical utility. Here we carry out a Markov time series input analysis to predict the stage to stage change in the probability distribution of queue length. The main reason for estimating the queue length distribution at each stage is to locate .q "hot spots", loci where unrestricted queue length would exceed queue capacitym and a quite simple expression is obtained for this purpose. .)q .nE 162 A. Greenbaum, .q "Parallelizing the Adaptive Fast Multipole Method on a Shared Memory MIMD Machine" .(q The fast multipole method is a recently developed scheme for evaluating the potential and force fields in systems of particles whose interactions are Coulombic or gravitational in nature. When the distribution of particles is highly nonuniform, an adaptive version of the algorithm is needed. While the nonadaptive algorithm can be parallelized on an SIMD machine, it is shown that the adaptive version is equally well parallelized on an MIMD machine. Shared memory and an atomic fetch-and-add instruction make implementation especially easy. Results are presented on the 8-processor NYU Ultracomputer Prototype. .)q .nE 163 Anne Dinning and Edith Schonberg .q "An Evaluation of Monitoring Algorithms for Access Anomaly Detection" .(q One of the major disadvantages of parallel programming with shared memory is the nondeterministic behavior caused by uncoordinated access to shared variables, known as .i access anomalies. Monitoring program execution to detect access anomalies is a promising and relatively unexplored approach to this problem. We present a new algorithm, referred to as .i talk recycling, for detecting anomalies, and compare it with two other monitoring algorithms. Both analytic and empirical results indicate that talk recycling is more efficient in terms of space requirements and performance. Significant conclusions are: (i) Space requirements are in the worst case bounded by O(TxV), where T is the maximum number of threads that may potentially execute in parallel and V is the number of variable monitored. For typical programs, however, space requirements are on average O(V). (ii) Most of the actual performance costs for monitoring is at synchronization operations - the cost per variable access is a small constant. (iii) The general approach of monitoring to detect access anomalies is practical. Various implementation optimizations, as well as the use of complementary tools, can enhance monitoring to effectively solve the access anomaly problem. .)q .nE 164 Allan Gottlieb, .q "An Outsider's View of Dataflow" .(q An outsider's view of the current state of dataflow research is presented. Two related observations are made. First, the field has matured from emphasizing proof of concept to pragmatic concerns needed to obtain high performance. Second, dataflow research is moving closer to mainstream parallel processing. Several examples of dataflow research supporting these observations are given. They include two-stage enabling rules and other eclectic scheduling disciplines, hardware and software throttling of parallelism, non-functional and (possibly) non-deterministic execution semantics, traditional memory management, and von Neumann compatibility. .)q .nE 165 Barry F. Smith and Olof B. Widlund, .q "A Domain Decomposition Algorithm Based on a Change to a Hierarchical Basis" .(q Over the last five years, several new fast algorithms have been developed for the solution of elliptic finite element problems with many degrees of freedom. Among them are Yserentant's hierarchical basis multigrid method and a number of domain decomposition algorithms for problems divided into many subproblems. The condition number of the relevant operators, for the best of these preconditioned conjugate gradient methods, grows only slowly with the size of the problems; the growth typically is quadratic in the logarithm of the number of degrees of freedom associated with a subregion or an element of the coarsest mesh. Important components of many domain decomposition preconditioners are subproblems related to the sets of variables on the interfaces between neighboring subregions. In this paper, we demonstrate that, for problems in two dimensions, a very simple change of basis for these one-dimensional problems leads to a preconditioner which is as effective as those previously considered. In our proof, we use only tools of linear algebra and a result of Yserentant's. Our numerical experiments confirm that the new algorithm is better conditioned than the original hierarchical basis multigrid method. .)q .nE 166 Anne Dinning and Edith Schonberg, .q "An Empirical Comparison of Monitoring Algorithms for Access Anomaly Detection" .(q One of the major disadvantages of parallel programming with shared memory is the nondeterministic behavior caused by uncoordinated access to shared variables, known as .i "access anomalies". Monitoring program execution to detect access anomalies is a promising and relatively unexplored approach to this problem. We present a new algorithm, referred to as .i "task recycling", for detecting anomalies, and compare it with an existing algorithm. Empirical results indicate several significant conclusions: (i) While space requirements are in the worst case bounded by O(T x V), where T is the maximum number of threads that may potentially execute in parallel and V is the number of variable monitored, for typical programs space requirements are on average O(V). (ii) Task recycling is more efficient in terms of space requirements and often in terms of performance. (iii) The general approach of monitoring to detect access anomalies is practical. .)q .nE 167 Maksymilian Dryja and Olof B. Widlund, .q "Towards a Unified Theory of Domain Decomposition Algorithms for Elliptic Problems" .(q A distinction is often made between domain decomposition algorithms for elliptic partial differential equations which use overlapping subregions and those which do not. Schwarz's alternating method, the oldest of them all, belongs to the first category. It has recently been discovered that many of the algorithms that belong to the second category can also be regarded as generalizations of the classical Schwarz method or an additive variant thereof. This new approach provides new tools for the analysis and development of domain decomposition algorithms. In this paper, we introduce an abstract additive Schwarz method and develop a simple framework for the analysis of its rate of convergence. We show that it is possible and convenient to specify algorithms in terms of a set of subspaces and related orthogonal projections. This family of algorithms is further expanded by replacing the linear systems of equations, which correspond to the projections, by suitable preconditioners. We illustrate the usefulness of this approach by considering additive Schwarz methods of a relatively conventional type, iterative substructuring methods, domain decomposition methods defined on the curves or surfaces which subdivide the region, and iterative refinement methods. Throughout we work in a framework of conforming finite elements and self-adjoint problems, but we also mention some new results for more general elliptic finite element problems. .)q .nE 168 O. E. Percus and J. K. Percus, .q "Time Series Transformations in Clocked Queueing Networks" .(q We consider a network in which message packets arriving on a clock cycle are directed along paths which may have segments in common. When they do, the messages involved are queued and transmitted, one per cycle. The statistics of queue length are investigated by an expansion of the time series in a channel about the regime of independent arrivals. Results for the second stage of such a network are in good agreement with computer simulations. .)q .nE 169 Ora E. Percus and Susan R. Dickey .q "Performance Analysis of Clock-Regulated Queues with Output Multiplexing In Three Different Types of 2 by 2 Crossbar Switch Architectures" .(q Switches in interconnection networks for highly parallel shared memory computer systems may be implemented with different internal buffer structures. For a $2 times 2$ synchronous switch, previous studies have often assumed a switch composed of two queues, one at each output, each of which has unbounded size and may accept two inputs in one clock cycle. Hardware implementations may actually use simpler queue designs and will have bounded size. .sp .5 Two additional models for a $2 times 2$ switch using queues that may accept only one input at a time are analyzed. The first uses four queues, one for each input/output pair. The second uses two queues, one at each input. In each case, a multiplexer blocks one queue if two queues desire the same output, making these models more difficult to analyze than the previous model. Maximum bandwidth, expected queue length, expected waiting time, and queue length distribution are presented for both models, with unbounded queue size and with queue size equal to 1. .)q .nE 170 Ron Cytron, Jim Lipkis and Edith Schonberg .q "A Compiler-Assisted Approach to SPMD Execution" .(q Today, two styles of scientific parallel programming prevail. In the SPMD style, all processors execute the same program, with sequential code executed redundantly and parallel code executed cooperatively. In the fork-join style, a sequential thread of control spawns multiple threads to execute a portion of the code concurrently. In this paper we show how to obtain automatically execution efficiency approaching SPMD-style execution for programs written in the more structured fork-join style. .)q .nE 171 Yue-Sheng Liu .q "Architecture and Performance of Processor-Memory Interconnection Networks for MIMD Shared Memory Parallel Processing Systems" .(q In this thesis we study interconnection networks and their switch architectures, the performance of different architectures, under the MIMD shared memory environment, using both simulation and analytical methods. .sp .5 The networks we study are constructed from a single basic building block, a switch element. A number of alternatives are proposed in designing the switch element. We study the performance of different switches with both analytical methods and extensive simulations. We also propose a switch architecture with multiple hand shaking signals (which is also referred as 2-CTS single switch for a 2 by 2 switch) that gives improved performance. .sp .5 Various interconnection networks and interested issues related to them are reviewed in the thesis. We propose a new class of interconnection networks called F networks. In comparison to traditional multi-stage network, F networks provide faster communications among nodes within a cluster. Also extra routes available in the F network make the network fault-tolerant. .sp .5 Under the uniform traffic model, Kruskal & Snir discovered that different stages of buffered omega network experience about the same queueing delay for moderate load. Based on simulations and analysis, Kruskal, Snir & Weiss established a formula for calculating network delays under moderate traffic. From our simulations we discovered that the Kruskal-Snir-Weiss formula holds only for the forward path delay of the networks; the return path delay is actually substantially less than the delay of the forward path. We complete the network performance formula by extending the Kruskal-Snir-Weiss formula to include the return path. We also analyze the network performance under "hot spot" traffic and obtain analytic results on the performance attained. .)q .nE 172 Ora E. Percus and J. K. Percus .q "Intrinsic Relations in the Structure of Linear Congruential Generators modulo $2 sup beta$" .(q We show two different general relationships that exist among terms of the linear congruential generator .sp .ce $x sub {i + 1} = ax sub i + b~$mod$~2 sup beta$ .sp that are separated by powers of two. These two relationships are not equivalent except in one special case which turns out to be Marsaglia's result. .)q .nE 173 Patricia J. Teller and Allan Gottlieb .q "TLB Performance in Multiprocessors" .(q This paper compares the performance, in highly-parallel shared-memory multiprocessors, of locating translation-lookaside buffers (TLBs) at processors with that of locating TLBs at memory. Our performance comparison is based on results of trace-driven simulations of multiprocessors with $log N$-stage networks interconnecting $N$ processors and $N$ memory modules. .sp .5v For the systems and workloads studied, memory-based TLBs perform better than processor-based TLBs, provided that memory is organized as multiple \fIpaging arenas\fR, where the mapping of pages to arenas is fixed. The cost of a processor-based TLB reload is at least $log N$ because of network transit. The cost of a memory-based TLB reload can be made smaller than that of a processor-based TLB reload, since network transits are not required. Furthermore, with multiple paging arenas, the number of reloads is smaller with memory-based TLBs. For memory-based TLBs to continue to outperform processor-based TLBs for large $N$, it is likely that the number of paging arenas must grow with $N$. .)q .nE 174 Torbj\o"o\(:a"rn Granlund and Richard Kenner .q "Eliminating Branches using a Superoptimizer and the GNU C Compiler" .(q This paper presents a Superoptimizer, GSO, which searches for an instruction sequence that implements a specified operation on one of a set of microprocessors. The basic algorithms used by GSO are described along with the techniques used to allow GSO to be easily configurable to support many microprocesors. We show sequences for generating various comparison operations on the IBM RS/6000 and describe how patterns based on these sequences were added to the GNU C compiler to allow it to generate compact, branch-less code for many common conditional expressions. .sp .5v Published in this year's ACM Conference on Programming Languages and Implementation. .)q .nE 175 Susan Dickey and Richard Kenner .q "Using Qualified Clocks in the NORA Clocking Methodology to Implement a Systolic Queue Design" .(q A systolic queue design has been implemented with the the NORA clocking methodology using qualified clocks. NORA allows the use of compact CMOS circuits with high tolerance for clock skew. Qualified clocks provide a natural way to implement local data movement in a systolic design. Their use with NORA, however, involves certain complications. .sp .5v A circuit to produce a qualified clock for use in the NORA methodology is presented, and its maintenance of NORA assumptions is proved. Charge-sharing and noise problems which can arise are detailed. The performance of chips we have fabricated using this methodology is described. .)q .nE 176 Guoying Chen .q "A Wiring Tool for Ultracomputer" .(q During the final stage of computer system design and implementation, using some flexible and convenient tools can save a lot of time and effort. For example, when all subcomponents of a design are carefully placed on a board, and the design schematics and some necessary data are given, a wiring tool is strongly needed so that the wiring can be done easily and automatically. However, it is a practical question how to route each signal wire for .i many components on .i any given board under various constraints and complicated situations. .sp .5v This report describes a general-purpose wiring tool which meets such a kind of demand, and is used particularly for Ultracomputer prototyping and implementations. The most important part of the wiring tool is how to solve the routing problem for a signal net. We reduce it to a Traveling Salesman Problem (TSP) with various constraints. The wiring tool employs a heuristic and fast TSP algorithm that is able to produce optimal results in high probabilities. .sp .5v The wiring tool program consists of three parts: .i "Preprocessor, wiring-man," and .i "output generator." The tool package offers a flexible interface to users, and is able to handle any new coming boards, chips, and other technologies semi-automatically with minimal manual intervention. .sp .5v In addition to describing the features and the method of implementation of the tool, we also provide useful information and discuss several issues for it's future upgradings. .)q .nE 177 Ron Bianchini .nE 178 Susan R. Dickey and Richard Kenner .q "A Combining Switch for the NYU Ultracomputer" .(q This report describes the combining switch that we have implemented for use in the 16 x 16 processor/memory interconnection network of the NYU Ultracomputer prototype. Packaging, message types and message formats are described. Details are given about the internal logic of the two component types used in the network, including the systolic combining queue, the wait buffer and the ALU. Systolic combining queue designs with greater combining capability are sketched. .)q .nE 179 Ora Engelberg Percus and J. K. Percus .q "An Expanded Set of Correlation Tests" .(q An analytic study is made of the correlation structure of Tausworthe and linear congruential random number generators. The former case is analyzed by the bit mask correlations recently introduced by Compagner. The latter is studied both by an extension to word masks which includes spectral test coefficients as special cases, and then by the bit mask procedure. Although low order bit mask coefficients vanish in both cases, the Tausworthe generator appears to produce a substantially smaller non-vanishing correlation set for large masks\*-but with larger correlation values\*-than does the linear congruential. .)q .nE 180 Susan R. Dickey and Ora E. Percus .q "Performance differences among combining switch architectures" .(q Combining switches can be used in multiprocessor interconnection networks to prevent tree saturation due to hot spot access. Small differences in the capabilities of combining switch architectures can make a significant difference in their performance. We have studied several such architectures using both analytical techniques and simulation. .sp .5v A new, more accurate method for specifying the queue-length dependent probability of combining is described. This method was used to construct recurrence relations for the queue length probability distribution of four different combining switch architectures. These recurrence relations were used to compute average queue length and output rates; results were validated with simulations. .)q .nE 181 Guoying Chen .q "Primary-node Scheme for Cache Coherence in Large Scale Shared-Memory Multiprocessors" .(q In this paper we propose a new cache coherence scheme called the .i "primary-node" method, capitalizing on the observation in [17] that there is usually a primary processor accessing a data block with much higher frequency than other sharing processors (secondary nodes). We associate each memory block with only one pointer that points to the ``primary'' node responsible for the majority of consistency operations for that memory block, while the remainings are performed collaborated by both secondary nodes and the primary node. As expected, an overwhelmingly large percentage of consistency operations for the block can be performed locally at the primary node, reducing both the access latency and network traffic related to the coherence actions. The overall memory consumption for this scheme is $O(Clog sub 2 N + Mlog sub 2 N)$, almost the same as for the chained pointer scheme. Unlike the chained pointers, our scheme eliminates the large latency problem caused by sequential invalidations. Performance analysis shows that our scheme is simple, efficient and scalable. .)q .nE 182 Susan R. Dickey and Richard Kenner .q "Hardware Combining and Scalability" .(q Scalability refers to the extent to which a system approaches perfect (linear) speedup with large numbers of processors. Fundamental physical laws limit scalability, but in practical terms, bottlenecks due to serialization points at critical sections are often the most immediate limitation on the scalability of an architecture. The NYU Ultracomputer project has developed algorithms without software critical sections that depend on an interconnection network with hardware combining. Such a network has the ability to merge concurrent hot-spot requests into a single memory reference without introducing significant additional delay. .sp .5v Hardware combining is not without cost, but our experience in implementing a combining switch indicates that the cost is much less than is widely believed. We describe the design choices made in implementing the switch to keep network bandwidth high and latency low. We compare the cost of a combining switch to that of a non-combining switch and discuss the scalability of the design we have implemented to large numbers of processors. .)q .nE 183 Ora E. Percus .q "A Multistage Clocked Queueing Network" .(q The networks of interest are characterized by synchronous pulsed operation, and are substantially more difficult to analyze than traditional continuous time queueing systems. The control parameter in such a network is the probability $rho$ of a packet entering a channel per clock cycle; in steady state, with nonsaturating queues, $rho$ is constant throughout the network. The quantity of practical interest is the distribution of queue length throughout the network, since a queue which does saturate must enter another mode to allow continued operation. It is important to have a concise form for the time series transformation engendered by units of queueing networks. .sp .5v We treat the problem of passage through a discrete time clock regulated multistage queuing network by modeling the input time series to each queue as a Markov chain. The Markov approximation is very good for $rho = E{a sub n} <= {1 over 2}$, which is in fact the range of practical utility, and it compares favorably with numerical simulation. .sp .5v Since these investigations point strongly to the fact that the near independent ``light traffic'' assumption is a surprisingly decent zeroth approximation up to half of the saturation flow, we continue by investigating the expected queue length by correlation expansion of the time series in a channel about the regime of independent arrivals. .sp .5v In both the Markov model and the correlation expansion, the Kruskal, Snir, and Weiss (KSW) conjecture that the average delay at successive stages can be bounded, independently of the network size, is verified. .)q .nE 184 Ora E. Percus .q "Why Use Chen-Stein for Sequence Comparison?" .(q The bounds found on the distribution of the longest match between two sequences, using the Chen-Stein method, are adequate for long matches but poor for short ones. They are very greatly improved if exact estimates due to Uspensky and others are employed. .)q .nE 185 Ronald Bianchini, Susan Dickey, Gabriel Goodman, Allan Gottlieb, Richard Kenner and Jiarui Wang .q "The Ultra III Prototype" .(q .)q .nE 186 Guoying Chen .q "SLiD --- A Cost-Effective and Scalable Limited-Directory Scheme for Cache Coherence" .(q In this paper, we propose a hybrid scheme, dubbed SLiD ({\em S}calable {\em Li}mited {\em D}irectory), combining the good features of the {\em chained directory} scheme and the {\em limited directory} scheme. Instead of putting two schemes together, we attempt, in the SLiD scheme, to attack the weak points of each individual scheme, namely the performance penalty in the limited directory when pointer overflow occurs, and the linear latency in the chained directory scheme when performing invalidations due to a write. Moreover, the SLiD scheme framework provides designers with three structural options of different performance requirements and implementation cost. .)q .nE 187 Susan Dickey .q "Interconnection Network Switch Architectures and Combining Strategies" .(q We have studied a variety of switch architectures for use in the multistage interconnection network of a shared memory multiprocessor. In this paper, we describe a taxonomy of buffered switch architectures and give performance results for the different types we have defined. We investigate two techniques for improving network performance in the presence of contention: the use of multiple handshaking signals and the use of message combining. Simulations results are provided for implementation alternatives using switches of varying capabilities, in order to compare the effectiveness of different methods as system size grows. We show that a practical combining switch design, the two-and-a-half-way combining switch, provides performance as good as other more expensive designs. .)q .nE 188 Patricia J. Teller, Allan Gottlieb .q "Locating Multiprocessor TLBs at Memory" .(q This paper compares the performance, in shared memory multiprocessors, of locating translation-lookaside buffers (TLBs) at processors with that of locating TLBs at memory. Our comparison is based on trace-driven simulations of multiprocessors with log N-stage networks interconnecting N processors and N memory modules. .sp .5v For the systems and workloads studied, memory-based TLBs perform noticeably better than processor-based TLBs, provided that memory is organized as multiple paging arenas, i.e., multiple clusters of memory modules where the mapping of a page to a cluster is fixed. The cost of a processor-based TLB reload is at least log N because of network transit. In contrast, the cost of a memory-based TLB reload can be smaller, since network transits are not required. Furthermore, with multiple paging arenas, the number of reloads is smaller with memory-based TLBs. .)q .nE 189 Jan Edler .q "Molasses: An Ultracomputer Simulator (Preliminary Version)" .(q Molasses is a simulator for a complete Ultracomputer system, containing processors, memory modules, and a multi-stage interconnection network supporting combining. The simulated machine is quite close to the Ultra-3 prototype. A large number of parameters may be set by the user to alter the characteristics of the simulated machine and the data to be collected. .sp .5v Molasses relies on careful implementation, parallelism, checkpoint/restart, and specialized modes of simulation to achieve high performance. Performance may be maximized by automatically building a custom version of Molasses with most parameters fixed to constant values; this allows the compiler to generate superior code, typically doubling performance. .sp .5v Although the preliminary version of Molasses does not yet simulate the Ultra-3 processors, its good performance and flexibility have already made it useful for network studies. .)q