Research interests

My research concerned how to improve the performance of a system by understanding, in a detailed fashion, the storage system. Detailed areas: random bit generation, file systems, web searching, payment schemes, improving performance (task scheduling, parallel I/O, parallel file systems, storage systems, hierarchical memory models).

Recent work

Web searching

User navigation is a powerful, yet underused source of data. We apply user navigation to web searching, with the desire to reduce the number of URLs visited when searching. Our source of user navigation information is proxy logs. We tease out search sessions and topic sessions. A search session is the query, and the set of URLs that the user visited when searching. Topic sessions which are groups of search sessions on the same topic.

Details on how we use search and topic sessions to improve web searching can be found at the user navigation home page. Two projects that are part of this effort are SearchLight and Hyponym.

Random bit generation

We have developed a random bit generator that uses disk accesses. Our generator is practical and economical. It requires no additional hardware (given a system with a disk), and no user involvement. As a concrete example of performance, on a Sun Ultra-1 with a Seagate Cheetah disk, it generates bits at a rate of either 5 bits per minute or 577 bits per minute depending on the physical phenomena that we use as a source of randomness. The generated bits are random by a theoretical argument, and also pass a severe battery of statistical tests. More information can be found here.

File systems

Application-specifc file systems

We have been working on developing high-performance application-specific file systems. We have some very promising results for a file system for caching web proxies. More details are on our Hummingbird web page.

File system prefetching

We developed an analytic model of read performance in a file system which performs prefetching. The model helps one understand why and when prefetching works. We also have some suggesions on how to improve file system performance for UNIX-like workloads. Our paper was presented at USENIX, June 6-11, 1999. More information can be found here.

Previous work

Storage systems

Single disk model

I developed an analytic performance model of a SCSI disk that determines the service time of a request in a workload-specific method. The inputs into the model are the workload attributes (e.g., mean request size, arrival rate, and fraction of read requests) and a description of the disk. More information can be found here.

Multiple disk model

We have developed an analytic model for multiple disks on a SCSI bus, along with a technique to increase the performance of the disk system when the workload has large requests and random access. More information can be found here.

Task scheduling on tape-resident jobs

We are interested in developing efficient scheduling algorithms for jobs where the data set is so large that it is stored on tape. For example, the data for the tape-resident NASA EOS jobs must be staged from tape into disk before the processing can begin. Our current results were presented at PODS, May 31, 1999. More information can be found here.

Parallel I/O

Jeff Vitter and I developed a computation model (called the Parallel Disk Model) and a number of out-of-core algorithms for problems such as sorting and FTT. This work is published as my master's thesis and in a STOC '90 paper. (See published papers section.)

Parallel file systems

Len Wisniewski and I designed an application programmer interface for multiprocessor multi-disk file systems based on the Parallel Disk Model. (See the published papers section for our TR describing our work.)


Bibliography files

My .bib files have many references for the above areas: