Shasha Lab
Biological Computing Projects
I've worked on several projects over the last few years,
notably with the plant biology labs of Gloria Coruzzi, Ken Birnbaum,
Philip Benfey, and Rodrigo Gutierrez.
More recently I'm working
with Rich Bonneau's computational group.
-
Software to contribute to the visualization of the intersections
and unions of collections of
multiple experiments, multiple genomes or even multiple
baseball players.
Sungear
has been called a Venn diagram on steroids.
It should be useful for social scientists, cancer researchers
and sports fanatics -- anyone concerned with trying
to derive interesting information from several long lists of
items (genes, proteins, people, players).
Besides supporting set intersection and unions on items,
Sungear relates those items to functional categories.
This is joint work with Chris Poultney, Rodrigo Gutierrez, Manpreet Katari,
Miriam Gifford, Brad Paley, and Gloria Coruzzi.
-
Combinatorial
design software
to specify the design of experiments over
several input variables where most of those variables
are considered to be unimportant.
The goal is to explore a large search space with few experiments
while guaranteeing certain properties.
Joint work with Gloria Coruzzi, Peter Palenchar, Rodrigo Gutierrez,
Andrei Kouranov, Laurence Lejay,
and Michael Chou.
-
Analysis software to find analog circuits given the results
of experiments.
This is also joint work with the Coruzzi lab.
-
Machine learning work to infer the function of genes, the redundancy
of genes, and regulatory networks.
This is joint work with Gloria Coruzzi, Ken Birnbaum, Huangwen Chen,
Aris Tsirigos, Lee Parnell, as well as help from Mehryar Mohri and
Corrina Cortez.
-
Work to aid the first stages of protein docking.
We call this protein speed-dating.
This is joint work with Chris Poultney and the Bonneau lab.
-
Software to find the functionality of genes based on
species traits.
This has discovered flagella genes already.
Please
see a description here.
Joint work with Mitchell Levesque,
Wook Kim, Michael G. Surette, and Philip Benfey.
-
Software to try to infer binding sites of transcription factors
as well as causality and repression among transcription factors.
We have started with yeast, but the software is quite general.
It uses a mix of pattern analysis and heuristics based on statistics.
Joint work with Philip Benfey and Ken Birnbaum and Aris Tsirigos.
Ken Birnbaum has led a project concerned with
cell-sorting and analysis for Arabidopsis that was published in
Science.
Ken, Sunayan Bandyopadhyay, Yuan-Chien Yang,
and I have also embarked on a project
to predict which genes are likely to
be redundant in a collection.
-
Algorithms for phylogenetic reconstruction based on graphs (a.k.a., networks)
instead of trees.
Our
algorithm
takes as input the best phylogenetic trees for
some set of genes and forms a parsimonious graph.
Joint work primarily with Ken Birnbaum, Matt Olim, and Chien-I Liao,
though with helpful summer projects by
Chandni Rajan Valiathan
and David Almond.
Related Pattern Recognition Projects
-
Software and algorithms to find
the highest correlated streams
among thousands of streams extremely efficiently.
That is joint work with
award-winning
Yunyue Zhu, Xiaojian Zhao,
and Zhihua Wang.
We have a paper in VLDB 2002 describing the algorithms.
More recent work has involved Richard Cole,
Tyler Neylon, and Lee-Ad Gotlieb.
You can find Zhihua's and Xiaojian's theses on the
CS department's thesis web site
.
Work on "uncooperative" time series (time series in which the
power of the time series
is spread over all Fourier coefficients) that uses random projections
is discussed
here.
Tyler Neylon has extended the pair-wise correlation work
to finding approximate linear dependencies among multiple
time series as you can see in his
thesis
Further work along those lines allows us to do query by humming
We have a paper in SIGMOD 2003 describing the algorithms.
Our demo can be found online
here.
Current students working on query by humming are:
Zhihua Wang, Steve Toub, Michael Schidlowsky, Kevin Cox, and Megan McNulty.
These algorithms are summarized in the book
High Performance Discovery in Time Series: techniques and case studies
published by Springer Verlag.
There are a few
>
errata
in the book.
We currently have more than 1000 songs, including most of the Beatles
and other popular folk and melodious rock songs.
The demo makes use of dynamic time warping, statistics-based filters,
and adaptive boosting.
You can download both midi files and lyrics.
If you hum a tune that has a match in our database
and you upload both your humming and the match information,
we will pay $2 to Nature Conservancy.
-
Software and algorithms to
find bursts in time series data.
We have a paper in KDD 2003 describing the algorithm.
That is joint work with Zhihua Wang, Xiaojian Zhao, and Yunyue Zhu.
Xin Zhang is working on a very nice thesis to improve this further.
-
Software to make tree, graph and structure searching as fast
as keyword searching.
A paper describing that was published in ACM Pods 2002 as an invited
tutorial
in pdf.
This uses a combination of geometric hashing, combinatorial
techniques from approximate tree matching, and generalizations
of suffix trees.
To download software for the current version
of the graph search software, please see
graphblast (an improvement on graphgrep).
We have deployed a similar module for use within the cytoscape software
called
To cluster graphs, please see
GraphClust.
If you need to compare schedules or partial orders please see
SchedMatch.
If you need tree comparison for ordered trees (where
sibling order matters), please see
treegrep .
If you need searching among unordered trees
(with applications to XML among others), please
see
unordered tree searcher .
If you want to compare unordered trees, please see
our cousin-distance based unordered comparison software
You can find our PODS 2002 presentation
here.
An important application of this work has been to perform
structural searches in phylogenetic databases. As of November 2003,
about 500 users worldwide have accessed the tools over 7000 times.
The
tools
have been integrated into
Joint work with Jason Wang, Kaizhong Zhang, Rosalba Giugno,
Diego Reforgiato Recupero
and Alfredo Ferro.
Here is
Rosalba Giugno's very nice thesis.
In addition, we have written several papers:
1. A review of approximate tree matching algorithms by
(the final version is in the book
Pattern Matching in Strings, Trees, and Arrays
by Apostolico
and Galil published by Oxford University Press)
paper in reverse order, just for fun.
2. Approximate graph matching for acyclic graphs
postscript.
3. Discovering patterns in protein sequences
postscript.
-
With the group of Alfredo Ferro (including Rosalba Giugno for
GraphGrep, and
Domenico Cantone, Alfredo Pulvirenti, Tarcisio
Maugeri, and Giuseppe Piqola for the computer science
side of clustering), we have developed
a set of tools for
finding multiple alignments that we believe improve on the
Clustal package as well as innovative clustering algorithms.
-
Our goal is to make it possible to discover patterns (i.e.
do data mining) in strings,
trees, and graphs given a pattern metric.
On the way we have worked out or borrowed algorithms
to match a pattern against data and find a distance between the pattern
and the data.
We are really good on trees
and are getting good on graphs.
We have software available by anonymous ftp
and certain experimental software is available from me.
Collaborators: Kaizhong Zhang (U of Western Ontario),
Jason Wang (NJ Institute of Technology),
and Bruce Shapiro (National Cancer Institute).
We have edited a book on this subject:
Pattern Discovery in Biomolecular Data:
Tools, Techniques, and Applications
Jason Wang, Bruce Shapiro, and Dennis Shasha (Eds.)
Oxford University Press, 1999.
Partly on the strength of that book, I have become
the editor of a book series in Genomics
and Bioinformatics . The intent of the series
is to publish graduate level texts for working researchers
in the field.
We are honored to have a superb advisory board:
Michael Ashburner,
Amos Bairoch,
David Botstein,
Charles Cantor,
Lee Hood,
Minoru Kanehisa,
Raju Kucherlapati, and
Craig Venter.
More recently, we have edited a second book on the subject of data mining
Data Mining in Bioinformatics
by J. T. L. Wang, M. J. Zaki, H. T. T. Toivonen and D. Shasha.
published by Springer-Verlag in 2005.
Software
Statstream (incremental correlation in time series)
software description
Tree searching (finding a small tree in a database of large trees;
where the order among siblings doesn't matter)
software description
Tree difference (finding the difference between trees, where
the order among siblings does matter)
software description
Graph comparison (heuristic techniques for comparing graphs)
software description
Graph clustering (finding interesting motifs in graphs)
software description
StratPal (software for designing strategies)
software description
Last modified January 2008.