Candidate: Tsong-Li Wang
Advisor: Dennis Shasha
3:30 p.m., Thursday, January 24, 1991
12th floor conference room, 719 Broadway


Recently, several prototype and commercial systems based on a loosely-coupled shared-nothing architecture have been proposed and built for database applications. To achieve speed-ups proportional to the number of processors for operations such as selections and joins, such systems often distribute data across storage units using a hashing function. In the first part of this thesis, we investigate ways of minimizing response time for various multi-join queries in such systems. We develop a dynamic programming algorithm for queries whose closures are chains. We next prove the NP-completeness for more general queries and propose four heuristics for them. We then evaluate experimentally the relative performance of these heuristics and their performance relative to optimums. The empirical results show that a hybrid heuristic combining our chain algorithm with a heuristic related to Kruskal's spanning tree algorithm performs well.

In the second part of the thesis, we present a scheme to answer best-match queries from a file containing a collection of objects. A best-match query is to find the objects in the file which are closest (according to some (dis)similarity measure) to a given target.

Previous work suggested that one can reduce the computational effort required to achieve the desired results using the triangle inequality when starting with a data structure for the file which reflects some precomputed intrafile distances. We generalize the technique to allow the optimum use of any given set of precomputed intrafile distances. We then extend our scheme to a class of queries for retrieving similar or dissimilar objects that commonly arise in vision and molecular biology. Artificial data and actual protein sequences are used to illustrate the effectiveness of our scheme for different queries, and to compare its performance with previous algorithms.

Finally, we implement our techniques into a tree information system that enables users to retrieve and extract information from trees based on approximate comparison. We expect this system to have applications in pattern recognition, biology, linguistics, and programming languages. The system is implemented in C and X-windows, and is fully operational on SUN workstations.