Computer Science Colloquium

The Human Proteome Folding Project and the Inferelator: two examples of computational systems biology.

Richard Bonneau
Institute for Systems Biology

Wednesday, April 27, 2005 11:15 A.M.
Room 1302 Warren Weaver Hall
251 Mercer Street
New York, NY 10012-1185

Colloquium Information:


Richard Cole, (212) 998-3119


I will survey two major computational efforts currently underway to 1) determine the structure of a large fraction of all proteins of unknown function using Rosetta structure prediction on the World Community Grid and 2) infer the regulatory networks of several organisms de novo from systems-biology data.

Human Proteome Folding Project:Large fractions of all fully sequenced genomes code for proteins of unknown function. Annotating these proteins of unknown function remains a critical bottleneck for systems biology and is crucial to understanding the biological relevance of genome-wide changes in mRNA and protein expression, protein-protein and protein-DNA interactions. I have previously shown that Rosetta de novostructure prediction can be used to predict three-dimensional structures for proteins of unknown function and that those predicted structures can be used in a systems biology context to glean biological insight into protein function. Rosetta de novo structure prediction is quite computationally intensive and we have implemented a distributed computing strategy that currently employs over 3 million devices globally in collaboration with United Devices and IBM. We have begun folding on the grid, using Rosetta, all relevant protein domains in all fully sequenced genomes (including the Human genome and the genomes of all sequenced major Human pathogens). The results from this effort will be publicly available. Regulatory Network Inference:I will describe a transparent modular method for the inference of gene regulatory influences on a genome-wide scale. At the core of this statistical learning framework are two methods that work in tandem to infer regulatory networks, simultaneously cluster genes and experimental conditions (bicluster), and detect cis-acting regulatory motifs. I will describe our results from our initial application of this method to the Halobacterium NRC-1 and H. pylori.

The first part of the procedure (biclustering) detects groupings of genes that are coherent across subsets of conditions (based on microarray data and upstream sequence) resulting in sets of gene/condition groupings and regulatory motifs for a given organism. The biclustering procedure uses motif detection and function information in the form of predicted association networks to guide an iterative search for coherent gene/condition groupings, effectively integrating three major biological data-types (co-expression, co-occurrence of transcription factor binding motifs, and previously known or predicted functional networks). The second phase of the regulatory network inference procedure, the Inferelator, then determines the network of regulatory influence that control the expression of each bicluster (based on microarray data, chip-chip data, and the results of the Biclusterer).

top | contact