Trait to Gene -- software

Programmer: Dennis Shasha
Courant Institute of Mathematical Sciences
Department of Computer Science
New York University
shasha@cs.nyu.edu
http://cs.nyu.edu/cs/faculty/shasha/index.html

Paper citation and collaborators: Trait-To-Gene: A Computational Method for Predicting the Function of Uncharacterized Genes Mitchell Levesque, Dennis Shasha, Wook Kim, Michael G. Surette, and Philip Benfey

Motivation

An important use of homology in genomics is to infer the function of unknown genes from well- characterized homologs. However, more refined uses of homology are necessary to identify and annotate the large number of homologous but uncharacterized genes. We have designed a series of algorithms that make functional predictions based on the principles of set theory. Our first analysis detects perfect matches between sets of organisms with a particular trait and sets of homologous genes. Our second set of algorithms utilizes Boolean combinations of genomes and traits to infer gene function. By finding sets of orthologous genes that complement each other in evolutionary lineages, these algorithms predict that other previously uncharacterized, highly conserved genes may be involved in this well-studied pathway. This software is available to the community.

Installation, Brief Explanation, and Execution

Our software runs in a high performance interpreted environment called K.

To begin with, therefore, please download trial K from for a sun version and a pair of files for windows consisting of k.exe and k20.dll and a . K and our program run equally well on linux and windows.
You must also download our software bac.k , cog.k and getcogs.k .
The data we used is found in microorganisms --the binary character matrix in comma delimited format and cogtxt --the list of COGs from Koonin et al. that is archived annually at NCBI
If you care to describe your application, send email to shasha@cs.nyu.edu. We'd be glad to hear about it.

Description of Software

bac.k--the main analysis software. It has one internal parameter: the similarity measure threshold is adjusted by changing the value of similarmin to the appropriate level. This can be done by searching for the previous similarmin value and then changing it to the next.

cog.k --which is read by bac.k and parses the cog file

getcogs.k --This reads COG IDs from a constructed file T (that might have all the COGS having a given trait) and writes them to two files.
Trows maintains COG arrangements that correspond to the trait.
Tunique is a duplicate-free list of all the COGs that are in Trows.

Executing the Software

Run k bac.k (this includes cog.k) It produces tempconclude and temptraitmap the first of which constitute hypotheses about associations between traits and genes on genomes. Post-processing depends on the trait of interest. Suppose we are interested in flagella.

grep "+flagella" temptraitcogs > flagella
grep "perfect" flagella > perfect
grep "similar" flagella > similar
grep "oneneeded" flagella > oneneeded
cp perfect T
k getcogs.k

This last step produces the files Trows and Tunique corresponding to perfect correspondences between COGs and flagella. Repeat the last two steps for similar matches as follows:
cp similar T
k getcogs.k

Repeat the last two steps again for oneneeded matches as follows:
cp oneneeded T
k getcogs.k

Caveats: the programs are set to evaluate a specific set of bacteria with their relevant abbreviations. To update the analysis, it is necessary to upload a new cogtxt file from NCBI and then to change the organism abbreviations in bac.k, microorganisms, and cog.k so that they all agree. In addition, in order to update the trait table it is better to fill in trait information for as many genomes as possible.

Support

This material is based upon work partly supported by the United States National Science Foundation under grants IIS-9988636, 0115586, and MCB-0209754. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the National Science Foundation. This support is greatly appreciated.