Abstract:
This paper presents the design and the implementation of a resource
management system for monitoring computing resources on a network and
for dynamically allocating them to concurrently executing jobs. In
particular, it is designed to support adaptive parallel
computations---computations that benefit from addition of new
machines, and can tolerate removal of machines while executing. The
challenge for such a resource manager is to communicate the
availability of resources to running programs even when the programs
were not developed to work with external resource managers. Our main
contribution is a novel mechanism addressing this issue, built on
low-level features common to popular parallel programming systems.
Existing resource management systems for adaptive computations either
require tight integration with the operating system (DRMS), or require
an integration with a programming system that is aware of external
resource managers (e.g. Condor/CARMI, MPVM, Piranha). Thus in each
case, their support is limited to a single type of programming system.
In contrast, our resource management system is unique in supporting
several unmodified parallel programming systems. Furthermore, the
system runs with user-level privilege, and thus can not compromise the
security of the network.
The underlying mechanism and the overall system have been validated on
a dynamically changing mix of jobs, some sequential, some PVM, some
MPI, and some Calypso computations. We demonstrate the feasibility
and the usefulness of our approach, thus showing how to construct a
middleware resource management system to enhance the utilizations of
distributed systems.