On Compiling Regular Loops for Efficient Parallel Execution
Candidate: Pei Ouyang
Advisor: Zvi Kedem and Krishna Palem
12:00 p.m., Wednesday, December 18, 1991
room 101, Warren Weaver Hall


In this thesis, we study the problem of mapping regular loops onto multiprocessors. We develop mapping schemes that yield very efficient executions of regular loops on shared and distributed memory architectures. We also develop novel analysis techniques, using which we argue about the efficiency of these resulting executions. The quality of the execution of these regular loops in the distributed memory setting, relies heavily on implementing cyclic shifts efficiently. Effectively, cyclic shifts are used to communicate results between individual processors, to which different interdependent iterations are assigned. Therefore, in order to achieve efficient executions of regular loops on distributed memory architectures, we also develop and analyze algorithms for solving the cyclic shift problem. In order to analyze the executions of regular loops that result from any specific mapping, we need to characterize the important parameters that determine its efficiency. We formally characterize a basic set of such parameters. These parameters facilitate the analysis of the memory and the processor requirements of a given execution, as well as its running time. Using these parameters, we analyze a greedy execution scheme, in the shared memory model. For example, we can determine the limit on the number of processors beyond which no speedup can be attained by the greedy method, for regular loops. The greedy scheme is of interest because it exploits the maximal possible parallelism in a natural way.

We then address the mapping scheme of regular loops onto distributed memory machines. Unfortunately, we show that the problem of finding an optimal mapping is computationally intractable in this case. In order to provide schemes that can be actually applied to regular loops at compile-time, we relax the requirement that the resulting executions be optimum. Instead, we design a heuristic mapping algorithm and validate it through experiments. This heuristic mapping scheme relies heavily on the use of efficient algorithms for realizing cyclic shifts. Therefore, we also study the problem of realizing cyclic shifts on hypercube architectures.