CisTF:Software to find Transcription Factor/Binding Site Pairs

Philip Benfey
Department of Biology
New York University
philip.benfey@nyu.edu

Ken Birnbaum
Department of Biology
New York University
kdb4348@is.nyu.edu

Dennis Shasha (code author)
Courant Institute of Mathematical Sciences
Department of Computer Science
New York University
shasha@cs.nyu.edu
http://cs.nyu.edu/cs/faculty/shasha/index.html

Motivation

Microarray experiments generate expresion data. Entire genomic sequences are known. Many transcription factors are known. Wouldn't it be nice to make good guesses as to transcription factor binding sites from this information. This suite of software is meant to do that.

Description

The cistf programs run in a high performance interpreted environment called K.

To begin with, please download trial K from kx.com. Both K and our program run equally well on linux and windows. If you need more space than is available in trial k, contact us.
Send email to shasha@cs.nyu.edu. If you care to describe your application, we'd be glad to hear about it. In any case, we will send you instructions for downloading cistf program files. Here is a brief description.
- The parser converts the sequence into a set of candidate cis-elements. These may have don't cares and may be a variety of lengths depending on certain parameters documented at the head of the parser.
- The helping programs middle and later reform the data so that each candidate cis-element is associated with the genes on whose promoter it finds itself. This is done for memory scalability.
- The main program implements cistf, correlating transcription factors with candidate cis-elements to find the highest correlations. You can find a description of the algorithm in the paper ``cis Element/Transcription Factor Analysis (cis/TF): A Method for Discovering Transcription Factor/cis Element Relationships'' Kenneth Birnbaum, Philip N. Benfey, and Dennis E. Shasha Genome Res. 2001 11: 1567-1573.
In addition, you will need to prepare three data files.
- measurement -- three columns: gene|experimentid|value
- transfac -- single column list of genes that make transcription factors
- sequence -- geneid|sequence We currently take the upstream region of size about 2000. The parsing program will generate reverse complements.
The output is of the form: transcription factor|promising binding site|correlation|significance.