Dennis Shasha's Research Summary

Goals

I work on quite a few different projects -- most having to do with large data and pattern matching or machine learning. Area interests include computational biology and biomedicine (data analysis, visualization, experimental design), time series (fast algorithms for fundamental problems such as correlation and burst detection as well as applications like query by humming), and pattern matching in trees and graphs. A general pattern? I like puzzles. The other general pattern is that I program a lot in a fast, extremely exressive language called K. The third pattern is that I work with excellent people -- undergraduates, master's students, doctoral students, post-docs, other profs, and non-academics. If you have skill, energy, and initiative and if you like what you see below, then drop me a line. I might have a project for you.

Biological Computing

I've worked on several projects over the last few years. The first five projects below are new:

Software to try to infer binding sites of transcription factors as well as causality and repression among transcription factors. We have started with yeast, but the software is quite general. It uses a mix of pattern analysis and heuristics based on statistics. Joint work with Philip Benfey and Ken Birnbaum and Aris Tsirigos. Ken Birnbaum has led a project concerned with cell-sorting and analysis for Arabidopsis that was published in Science.

Combinatorial design software to specify the design of experiments over several input variables where most of those variables are considered to be unimportant. The goal is to explore a large search space with few experiments while guaranteeing certain properties. Joint work with Gloria Coruzzi, Peter Palenchar, Andrei Kouranov, Laurence Lejay, and Michael Chou.
Analysis software to find analog circuits given the results of experiments. Also, joint work with the Coruzzi lab.
Analysis software and a database called Pathexplore to issue various queries about metabolic, proteomic, or other pathways. Play with it here. Joint work with Peter Palenchar also formerly of the Coruzzi lab.
Software to find the functionality of genes based on species traits. This has discovered flagella genes already. Please see a description here. Joint work with Mitchell Levesque, Wook Kim, Michael G. Surette, and Philip Benfey.
Algorithms for phylogenetic reconstruction based on graphs (a.k.a., networks) instead of trees. Our algorithm takes as input the best phylogenetic trees for some set of genes and forms a parsimonious graph. Joint work with Ken Birnbaum and Matt Olim. Some comparisons with other approaches done by Chandni Rajan Valiathan. New algorithms being developed by Chien-I Liao.
Visualization software called Sungear to help relate the results of many different experiments to one another as well as to functional categories. By results, one could mean gene collections (e.g., of genes induced in an experiment), edge collections (e.g., of protein interactions), or any collections. Functional categories could be, for example, gene ontologies or pathway information. This software is available for use now.

Pattern Recognition

Software and algorithms to find the highest correlated streams among thousands of streams extremely efficiently. That is joint work with award-winning Yunyue Zhu and Zhihua Wang. We have a paper in VLDB 2002 describing the algorithms. Further work along those lines allows us to do query by humming We have a paper in SIGMOD 2003 describing the algorithms. Our demo can be found online here. We currently have more than 1000 songs, including most of the Beatles and other popular folk and melodious rock songs. The demo makes use of dynamic time warping, statistics-based filters, and adaptive boosting. You can download both midi files and lyrics. If you hum a tune that has a match in our database and you upload both your humming and the match information, we will pay $2 to Nature Conservancy.
Software and algorithms to find bursts in time series data. We have a paper in KDD 2003 describing the algorithm. That is joint work with Zhihua Wang, Xiaojian Zhao, and Yunyue Zhu.
Software to make tree, graph and structure searching as fast as keyword searching. A paper describing that was published in ACM Pods 2002 as an invited tutorial in pdf. This uses a combination of geometric hashing, combinatorial techniques from approximate tree matching, and generalizations of suffix trees. To download software for the current version of the graph search software, please see graphblast (an improvement on graphgrep). To cluster graphs, please see GraphClust. If you need tree comparison for ordered trees (where sibling order matters), please see treegrep . If you need searching among unordered trees (with applications to XML among others), please see unordered tree searcher . If you want to compare unordered trees, please see our cousin-distance based unordered comparison software You can find our PODS 2002 presentation here.
An important application of this work has been to perform structural searches in phylogenetic databases. As of November 2003, about 500 users worldwide have accessed the tools over 7000 times. The tools have been integrated into Joint work with Jason Wang, Kaizhong Zhang, Rosalba Giugno, and Diego Reforgiato Recupero. Here is Rosalba Giugno's very nice thesis.
In addition, we have written several papers:
1. A review of approximate tree matching algorithms by (the final version is in the book Pattern Matching in Strings, Trees, and Arrays by Apostolico and Galil published by Oxford University Press) postscript.
2. Approximate graph matching for acyclic graphs postscript.
3. Discovering patterns in protein sequences postscript.
With the group of Alfredo Ferro (including Rosalba Giugno for GraphGrep, and Domenico Cantone, Alfredo Pulvirenti, Tarcisio Maugeri, and Giuseppe Piqola for the computer science side of clustering), we have developed a set of tools for finding multiple alignments that we believe improve on the Clustal package as well as innovative clustering algorithms.
Our goal is to make it possible to discover patterns (i.e. do data mining) in strings, trees, and graphs given a pattern metric. On the way we have worked out or borrowed algorithms to match a pattern against data and find a distance between the pattern and the data. We are really good on trees and are getting good on graphs. We have software available by anonymous ftp and certain experimental software is available from me. Collaborators: Kaizhong Zhang (U of Western Ontario), Jason Wang (NJ Institute of Technology), and Bruce Shapiro (National Cancer Institute).
We have edited a book on this subject: Pattern Discovery in Biomolecular Data: Tools, Techniques, and Applications Jason Wang, Bruce Shapiro, and Dennis Shasha (Eds.) Oxford University Press, 1999.
Partly on the strength of that book, I have become the editor of a book series in Genomics and Bioinformatics . The intent of the series is to publish graduate level texts for working researchers in the field. We are honored to have a superb advisory board: Michael Ashburner, Amos Bairoch, David Botstein, Charles Cantor, Lee Hood, Minoru Kanehisa, Raju Kucherlapati, and Craig Venter.

Tamper-resistant file systems

With David Mazieres, I have been working on a network file system that supports the following scenario (among others): a group of people work together in a distributed fashion but trust neither one another nor their system administrator. For example, they might outsource their system administration to an organization that they don't necessarily trust. Historically, that organization could do subtle changes to their data, read their data, and so on. The system we have designed makes makes tampering changes quickly detectable. (Secrecy can be achieved with straightforward cryptographic techniques.) The only assumption we make is that each client has a secret signature key. Please see our PODC paper here . Some promising performance results due to the great efforts of David, Jinyuan Li, and Maxwell Krohn appear in OSDI.

AQuery: a database system for querying ordered data

An order-dependent query is one whose result (interpreted as a multi-set) changes if the order of the input records is changed. In a stock-quotes database, for instance, retrieving all quotes concerning a given stock for a given day does not depend on order, because the collection of quotes does not depend on order. By contrast, finding the five price moving average in a trade table gives a result that depends on the order of the table. Query languages based on the relational data model can handle order-dependent queries only through add-ons. SQL:1999, for example, permits the use of a data ordering mechanism called a ``window'' in limited parts of a query. As a result, order-dependent queries become difficult to write in those languages and optimization techniques for these features, applied as pre- or post-enumerating phases, are generally crude. The goal of our work is to show that when order is a property of the underlying data model and algebra, writing order-dependent queries in a language can be natural as is their optimization. We introduce AQuery, an SQL-like query language and algebra t has from-the-ground-up support for order. We also present a framework for optimization of the order-dependent queries categories it expresses. The framework is able to take advantage of the large body of query transformations on relational systems while incorporating new ones described here. We show by experiment that the resulting system is orders of magnitude faster than current SQL:1999 systems on many natural order-dependen You can see our paper here. You can see a power point presentation here. Joint work with Alberto Lerner. We are looking for Beta users.

AJAX: a data cleaning system

Ajax is a framework for data cleaning. It includes an implementation of comparison, clustering, and schema tracking for all aspects of data cleaning. It's also a framework for future additions to data cleaning. Joint work with Helena Galhardas, Dana Florescu, and Eric Simon of Inria. Helena is currently implementing furiously.

Le Subscribe: a publish-subscribe system

We (this is joint work with Francoise Fabret and Francois Llirbat, and Joao Pereira at INRIA) have implemented a publish-subscribe system for extremely high performance and distributed functionality.
Our subscriptions are conjunctions of the form (attribute; value; relop) e.g.
(movie; toy story II; =), (price; < ; $10)
Our events are also conjunctions but of the form (attribute; value) and implicitly on equality, e.g.
(movie; toy story II), (city; paris)
Our performance is the following: For 400,000 subscriptions having 5 attributes of which one is inequality and four are equality, and events with 5 attributes, we can process events at 5 milliseconds per event on a machine with Linux SO, i686 CPU at 500MHz with 1G of RAM.

Fault Tolerant Parallel Programming

Our Persistent Linda project extends the Linda system developed primarily by Dave Gelernter and Nick Carriero at Yale. We use a slightly weakened form of transaction combined with checkpoints to support fault tolerance (appeared in Proc of 13th Symp on Fault Tolerant Distributed Systems) postscript. You can get a copy of PLinda from our web site. Collaborators: Brian Anderson, Karp Jeong, Suren Talla, Peter Wyckoff, Bin Li, all students or former students at NYU In addition, Ekkart Kindler has done a formal proof method for verifying long-running parallel computations.

Database Internals Work

Buffer paging algorithms that beat LRU in several application areas. Collaborators: Ted Johnson (U. of Florida), industrial collaborators notably at database companies (appeared in VLDB 94) postscript. By the way, part of Johnson's thesis showed that it is a good idea to be lazy when you're designing B-trees: as long as there are more inserts than deletes, free-at-empty is a better strategy than merge-at-half (appeared in Journal of Computer Science and Systems, Aug. 1993) postscript.
Ted and I have done another piece of work about a data structure for decision support called ``Hierarchically Split Cube Forests''. postscript .

Database Tuning and Wall Street

Database tuning is the activity of making your database system run faster. Though each vendor will tell you a different story about the subject, it turns out that the underlying principles are the same. (As a consultant, I've applied these principles for companies in telecommunications, finance, on-line travel companies, and on-line gaming.) If you are interested in notes on the subject, then please go ahead and download My latest book on the subject, co-authored with Philippe Bonnet, is called Database Tuning: principles, experiments, and troublieshooting techniques You can find the experiments from that work here.
With Arthur Whitney, Steve Apter, and the rest of the K community, I have been working on simplifying transaction processing in a large main memory setting. The technique uses a very fast, interpreted vector-processing language. It avoids concurrency control but allows concurrency. It is tailored to financial applications. postscript version and pdf version .
In Sigmod 1997, I present some lessons learned from my experience on Wall Street. The lessons have to do with configuration for global distributed systems, tuning, and language issues. Lessons from Wall Street (postscript) .
In Sigmod 2001, Arthur Whitney and I have a paper describing time series databases for finance that he built. Lots o' Ticks: real-time high performance time series queries on billions of trades and quotes (pdf) .

Real-Time Scheduling

We have developed algorithms for scheduling overloaded sporadic real-time tasks in a uniprocessor setting (in Siam J. comp) postscript and in a multi-processor setting (appeared in Theoretical Computer Science, July 1994) postscript.
We have found algorithms and bounds for scheduling sporadically arriving periodic tasks. That is, the instances of each task arrive at regular periods but the task can arrive at any instant. Our twist on this problem is that we allow certain instances of such tasks to be skipped. We look at schedulability in this context. pdf .
Collaborator: Gilad Koren

Thinksheet

A tool to tailor information flow for readers of complex (or boring) documents such as laws and a problem-solving tool generalizing spreadsheets. The tool integrates spreadsheets, rules, databases, and hypermedia. Collaborators: Roman Yangarber, Peter Piatko, Daoi Lin, Minna Cha, Dave Tanzer, Alex Shenker, Mike Leder, Julia Tolpin, Mirella Shannon, and Chris Jones, all students at NYU.

Dave Tanzer's thesis is on efficient backwards reasoning in the thinksheet context. The work has been completed.

Recently, we (Stacey Kuznetsov and I) have designed a new system called Stratpal. Stratpal is a simplified thinksheet that can be used to model laws and strategies. Its key features is that given a linear document, it is easy to create a StratPal application that can be improved incrementally over time.

For Kids

Please see the earthtime project not yet functional

To learn arithmetic and elementary algebra in a fun way. Look at superply

Fun Stuff

Dr. Ecco, a mathematical detective cracks mysteries by solving puzzles. Some are combinatoric, e.g. what is the smallest number of people who could be at a party in which everyone has shaken hands with three other people, except one person who has only shaken hands with one other person? Others involve algorithmic aspects, including the simplest zero knowledge protocols known to (wo)man.
The first books about him were first published by W. H. Freeman, (1-)212-576-9400:
The Puzzling Adventures of Dr. Ecco in 1988 and
Codes, Puzzles, and Conspiracy in 1992.
Dover has republished The Puzzling Adventures of Dr. Ecco in 1998. You can get all my books from online bookstores such as amazon.com.
I have had the pleasure of writing a mathematical puzzle column for Dr. Dobb's Journal and currently write the monthly puzzle column for Scientific American . These puzzles are good to attack with a computer. The best of those puzzles, somewhat improved, have appeared in Dr. Ecco's Cyberpuzzles published by W. W. Norton in July, 2002.
You can see a talk called Upstart Puzzles that I gave at the Canadian Mathematical Society summer meeting at Edmonton in June 2003. A collection of the early Scientific American puzzles appeared as Puzzling Adventures published by W. W. Norton in 2005. A new book called The Puzzler's Elusion will be published by Avalon Press in 2006.
Out of Their Minds: the lives and discoveries of 15 great computer scientists is a book of biographies of 15 great computer scientists. You can see me on realplayer video talking about the book.
Finally, all the smart Russian students I've had has inspired me to collaborate with playwright Marina Shron in a book about recent Russian immigrants entitled Red Blues: voices from the last wave of Rusian immigrants. You can find excerpts of the book here.

Last modified February 2006.