Fault-tolerant parallel computing on networks of non-dedicated workstations

Candidate: Wyckoff,Peter


This thesis addresses fault tolerance issues in parallel computing on loosely-coupled networks of non-dedicated, heterogeneous workstations. The efficiency of fault tolerance mechanisms is dictated by network and failure characteristics. Traditional approaches to fault tolerance are efficient when network and failure characteristics are identical across workstations, such as in a local area network of homogeneous workstations; however, a loosely coupled network of non-dedicated workstations has non-uniform network and failure characteristics. This thesis presents the design and implementation of a flexible fault tolerance runtime system that allows each process in a parallel application to use one of three rollback recovery mechanisms. Rollback recovery is achieved using a lightweight form of transaction, which performance results show incurs almost no overhead. The system is built on top of the Linda coordination language and runs on Alpha, Linux, Solaris and SGI workstations and Java-enabled browsers. For barrier synchronous parallel applications, a new equi-distant checkpointing interval selection method, the expected maximum heuristic, is presented. The method is applicable to any rollback recovery system in which processes recover from failure independently and communicate through a reliable third party. Simulation results show that the expected maximum heuristic has near optimal performance under a variety of different failure rates and barrier lengths.