CisTF:Software to find Transcription Factor/Binding Site Pairs
Philip Benfey
Department of Biology
New York University
philip.benfey@nyu.edu
Ken Birnbaum
Department of Biology
New York University
kdb4348@is.nyu.edu
Dennis Shasha (code author)
Courant Institute of Mathematical Sciences
Department of Computer Science
New York University
shasha@cs.nyu.edu
http://cs.nyu.edu/cs/faculty/shasha/index.html
Motivation
Microarray experiments generate expresion data.
Entire genomic sequences are known.
Many transcription factors are known.
Wouldn't it be nice to make good guesses as to transcription factor
binding sites from this information.
This suite of software is meant to do that.
Description
The cistf programs run in a high performance
interpreted environment called K.
- To begin with, please download trial K from
kx.com.
Both K and our program run equally well on linux and windows.
If you need more space than is available in trial k, contact us.
- Send email to shasha@cs.nyu.edu.
If you care to describe your application,
we'd be glad to hear about it.
In any case, we will send you instructions
for downloading cistf program files.
Here is a brief description.
-
The parser converts the sequence into a set of candidate cis-elements.
These may have don't cares and may be a variety of lengths depending
on certain parameters documented at the head of the parser.
-
The helping programs middle and later reform the data so that each
candidate cis-element is associated with the genes on whose promoter
it finds itself.
This is done for memory scalability.
-
The main program implements cistf, correlating transcription factors
with candidate cis-elements to find the highest correlations.
You can find a description of the algorithm in the paper
``cis Element/Transcription Factor Analysis (cis/TF):
A Method for Discovering Transcription Factor/cis Element Relationships''
Kenneth Birnbaum, Philip N. Benfey, and Dennis E. Shasha
Genome Res. 2001 11: 1567-1573.
-
In addition, you will need to prepare three data files.
- measurement -- three columns: gene|experimentid|value
- transfac -- single column list of genes that make transcription
factors
- sequence -- geneid|sequence
We currently take the upstream region of size about 2000.
The parsing program will generate reverse complements.
-
The output is of the form:
transcription factor|promising binding site|correlation|significance.