Shasha Lab

Biological Computing Projects

I've worked on several projects over the last few years, notably with the plant biology labs of Gloria Coruzzi, Ken Birnbaum, Philip Benfey, and Rodrigo Gutierrez. More recently I'm working with Rich Bonneau's computational group.

Software to contribute to the visualization of the intersections and unions of collections of multiple experiments, multiple genomes or even multiple baseball players. Sungear has been called a Venn diagram on steroids. It should be useful for social scientists, cancer researchers and sports fanatics -- anyone concerned with trying to derive interesting information from several long lists of items (genes, proteins, people, players). Besides supporting set intersection and unions on items, Sungear relates those items to functional categories. This is joint work with Chris Poultney, Rodrigo Gutierrez, Manpreet Katari, Miriam Gifford, Brad Paley, and Gloria Coruzzi.
Combinatorial design software to specify the design of experiments over several input variables where most of those variables are considered to be unimportant. The goal is to explore a large search space with few experiments while guaranteeing certain properties. Joint work with Gloria Coruzzi, Peter Palenchar, Rodrigo Gutierrez, Andrei Kouranov, Laurence Lejay, and Michael Chou.
Analysis software to find analog circuits given the results of experiments. This is also joint work with the Coruzzi lab.
Machine learning work to infer the function of genes, the redundancy of genes, and regulatory networks. This is joint work with Gloria Coruzzi, Ken Birnbaum, Huangwen Chen, Aris Tsirigos, Lee Parnell, as well as help from Mehryar Mohri and Corrina Cortez.
Work to aid the first stages of protein docking. We call this protein speed-dating. This is joint work with Chris Poultney and the Bonneau lab.
Software to find the functionality of genes based on species traits. This has discovered flagella genes already. Please see a description here. Joint work with Mitchell Levesque, Wook Kim, Michael G. Surette, and Philip Benfey.
Software to try to infer binding sites of transcription factors as well as causality and repression among transcription factors. We have started with yeast, but the software is quite general. It uses a mix of pattern analysis and heuristics based on statistics. Joint work with Philip Benfey and Ken Birnbaum and Aris Tsirigos. Ken Birnbaum has led a project concerned with cell-sorting and analysis for Arabidopsis that was published in Science. Ken, Sunayan Bandyopadhyay, Yuan-Chien Yang, and I have also embarked on a project to predict which genes are likely to be redundant in a collection.
Algorithms for phylogenetic reconstruction based on graphs (a.k.a., networks) instead of trees. Our algorithm takes as input the best phylogenetic trees for some set of genes and forms a parsimonious graph. Joint work primarily with Ken Birnbaum, Matt Olim, and Chien-I Liao, though with helpful summer projects by Chandni Rajan Valiathan and David Almond.

Related Pattern Recognition Projects

Software and algorithms to find the highest correlated streams among thousands of streams extremely efficiently. That is joint work with award-winning Yunyue Zhu, Xiaojian Zhao, and Zhihua Wang. We have a paper in VLDB 2002 describing the algorithms. More recent work has involved Richard Cole, Tyler Neylon, and Lee-Ad Gotlieb. You can find Zhihua's and Xiaojian's theses on the CS department's thesis web site . Work on "uncooperative" time series (time series in which the power of the time series is spread over all Fourier coefficients) that uses random projections is discussed here. Tyler Neylon has extended the pair-wise correlation work to finding approximate linear dependencies among multiple time series as you can see in his thesis
Further work along those lines allows us to do query by humming We have a paper in SIGMOD 2003 describing the algorithms. Our demo can be found online here. Current students working on query by humming are: Zhihua Wang, Steve Toub, Michael Schidlowsky, Kevin Cox, and Megan McNulty.
These algorithms are summarized in the book High Performance Discovery in Time Series: techniques and case studies published by Springer Verlag. There are a few > errata in the book. We currently have more than 1000 songs, including most of the Beatles and other popular folk and melodious rock songs. The demo makes use of dynamic time warping, statistics-based filters, and adaptive boosting. You can download both midi files and lyrics. If you hum a tune that has a match in our database and you upload both your humming and the match information, we will pay $2 to Nature Conservancy.
Software and algorithms to find bursts in time series data. We have a paper in KDD 2003 describing the algorithm. That is joint work with Zhihua Wang, Xiaojian Zhao, and Yunyue Zhu. Xin Zhang is working on a very nice thesis to improve this further.
Software to make tree, graph and structure searching as fast as keyword searching. A paper describing that was published in ACM Pods 2002 as an invited tutorial in pdf. This uses a combination of geometric hashing, combinatorial techniques from approximate tree matching, and generalizations of suffix trees. To download software for the current version of the graph search software, please see graphblast (an improvement on graphgrep). We have deployed a similar module for use within the cytoscape software called To cluster graphs, please see GraphClust. If you need to compare schedules or partial orders please see SchedMatch. If you need tree comparison for ordered trees (where sibling order matters), please see treegrep . If you need searching among unordered trees (with applications to XML among others), please see unordered tree searcher . If you want to compare unordered trees, please see our cousin-distance based unordered comparison software You can find our PODS 2002 presentation here.
An important application of this work has been to perform structural searches in phylogenetic databases. As of November 2003, about 500 users worldwide have accessed the tools over 7000 times. The tools have been integrated into Joint work with Jason Wang, Kaizhong Zhang, Rosalba Giugno, Diego Reforgiato Recupero and Alfredo Ferro. Here is Rosalba Giugno's very nice thesis.
In addition, we have written several papers:
1. A review of approximate tree matching algorithms by (the final version is in the book Pattern Matching in Strings, Trees, and Arrays by Apostolico and Galil published by Oxford University Press) paper in reverse order, just for fun.
2. Approximate graph matching for acyclic graphs postscript.
3. Discovering patterns in protein sequences postscript.
With the group of Alfredo Ferro (including Rosalba Giugno for GraphGrep, and Domenico Cantone, Alfredo Pulvirenti, Tarcisio Maugeri, and Giuseppe Piqola for the computer science side of clustering), we have developed a set of tools for finding multiple alignments that we believe improve on the Clustal package as well as innovative clustering algorithms.
Our goal is to make it possible to discover patterns (i.e. do data mining) in strings, trees, and graphs given a pattern metric. On the way we have worked out or borrowed algorithms to match a pattern against data and find a distance between the pattern and the data. We are really good on trees and are getting good on graphs. We have software available by anonymous ftp and certain experimental software is available from me. Collaborators: Kaizhong Zhang (U of Western Ontario), Jason Wang (NJ Institute of Technology), and Bruce Shapiro (National Cancer Institute).
We have edited a book on this subject: Pattern Discovery in Biomolecular Data: Tools, Techniques, and Applications Jason Wang, Bruce Shapiro, and Dennis Shasha (Eds.) Oxford University Press, 1999.
Partly on the strength of that book, I have become the editor of a book series in Genomics and Bioinformatics . The intent of the series is to publish graduate level texts for working researchers in the field. We are honored to have a superb advisory board: Michael Ashburner, Amos Bairoch, David Botstein, Charles Cantor, Lee Hood, Minoru Kanehisa, Raju Kucherlapati, and Craig Venter.
More recently, we have edited a second book on the subject of data mining Data Mining in Bioinformatics by J. T. L. Wang, M. J. Zaki, H. T. T. Toivonen and D. Shasha. published by Springer-Verlag in 2005.

Software

Statstream (incremental correlation in time series) software description

Tree searching (finding a small tree in a database of large trees; where the order among siblings doesn't matter) software description

Tree difference (finding the difference between trees, where the order among siblings does matter) software description

Graph comparison (heuristic techniques for comparing graphs) software description

Graph clustering (finding interesting motifs in graphs) software description

StratPal (software for designing strategies) software description

Last modified January 2008.