I
work on quite a few different projects -- most having
to do with large data and pattern matching or machine learning.
Area interests include computational biology and
biomedicine (data analysis,
visualization, experimental design), time series (fast algorithms
for fundamental problems such as correlation and burst detection
as well as applications like query by humming),
and pattern matching in trees and graphs.
A general pattern? I like puzzles.
A second general pattern is that I program a lot in a fast,
extremely exressive language
called
K.
A third pattern is that I work with excellent people -- undergraduates,
master's students, doctoral students, post-docs, other profs, and
non-academics.
If you have skill, energy, and initiative and if you like what you see below,
then drop me a line.
I might have a project for you.
Look at
Marianne Winslett's interview of me
in which I try to express my research philosophy.
I've worked on several projects over the last few years,
notably with the plant biology labs of Gloria Coruzzi, Ken Birnbaum,
Philip Benfey, and Rodrigo Gutierrez.
More recently with Rich Bonneau's computational group.
Chris Collins and Richard Kayne from NYU Linguistics, along with Linguistics
doctoral student Michael Taylor, have worked
with the computer science team consisting of Sangeeta Vishwanath,
Hiral Rajani, Jillian Kozyra, and I to build a
website to enable
the syntactic comparison of the world's languages.
The design ensures flexibility and high functionality
and already permits questions to be answered on a scale never before possible
in linguistics.
Further work along those lines allows us to do query by humming
We have a paper in SIGMOD 2003 describing the algorithms.
Our demo was attacked by hackers but will hopefully return.
Students who have worked on query by humming are:
Zhihua Wang, Steve Toub, Michael Schidlowsky, Kevin Cox, and Megan McNulty.
These algorithms are summarized in the book
High Performance Discovery in Time Series: techniques and case studies
published by Springer Verlag.
There are a few
>
errata
in the book.
We currently have more than 1000 songs, including most of the Beatles
and other popular folk and melodious rock songs.
The demo makes use of dynamic time warping, statistics-based filters,
and adaptive boosting.
You can download both midi files and lyrics.
If you hum a tune that has a match in our database
and you upload both your humming and the match information,
we will pay $2 to Nature Conservancy.
An important application of this work has been to perform
structural searches in phylogenetic databases. As of November 2003,
about 500 users worldwide have accessed the tools over 7000 times.
The
tools
have been integrated into
Joint work with Jason Wang, Kaizhong Zhang, Rosalba Giugno,
Diego Reforgiato Recupero
and Alfredo Ferro.
Here is
Rosalba Giugno's very nice thesis.
In addition, we have written several papers:
We have edited a book on this subject:
Pattern Discovery in Biomolecular Data:
Tools, Techniques, and Applications
Jason Wang, Bruce Shapiro, and Dennis Shasha (Eds.)
Oxford University Press, 1999.
Partly on the strength of that book, I have become
the editor of a book series in Genomics
and Bioinformatics . The intent of the series
is to publish graduate level texts for working researchers
in the field.
We are honored to have a superb advisory board:
Michael Ashburner,
Amos Bairoch,
David Botstein,
Charles Cantor,
Lee Hood,
Minoru Kanehisa,
Raju Kucherlapati, and
Craig Venter.
More recently, we have edited a second book on the subject of data mining
Data Mining in Bioinformatics
by J. T. L. Wang, M. J. Zaki, H. T. T. Toivonen and D. Shasha.
published by Springer-Verlag in 2005.
With David Mazieres, I have been working on a network file system
that supports the following scenario (among others): a group of
people work together in a distributed fashion but trust neither one
another nor their system administrator.
For example, they might outsource their
system administration to an organization that
they don't necessarily trust.
Historically, that organization could do subtle changes to their
data, read their data, and so on.
The system we have designed makes
makes tampering changes quickly
detectable.
(Secrecy can be achieved with straightforward cryptographic techniques.)
The only assumption we make is that each client has a secret signature key.
Please
see our PODC
paper here .
Some promising performance results
due to the great efforts of David, Jinyuan Li, and Maxwell Krohn
appear
in OSDI.
With Radu Sion and Peter Worth of Stony Brook, we are doing related
work
having to do with provably secure database outsourcing.
Suppose a mutually trusting group of clients want to use
the software provided by an outsourcer.
The guarantee is that the outsourcer will not be able to understand
the client data (because it is encrypted when
the outsourcer sees it), nor will the outsourcer know which
data any client accesses and the clients will enjoy full transactional
guarantees.
An order-dependent query is one whose result
(interpreted as a multi-set)
changes if the order of the input records is changed.
In a stock-quotes database, for instance, retrieving all quotes
concerning a given stock for a given day does not depend on order, because
the collection of quotes does not depend on order.
By contrast, finding the five price moving average in a trade table
gives a result that depends on the order of the table.
Query languages based on the relational data model
can handle order-dependent queries only through add-ons.
SQL:1999, for example,
permits the use of a data ordering mechanism called
a ``window''
in limited parts of a query.
As a result, order-dependent queries
become difficult to write in those languages and
optimization techniques for these features,
applied as pre- or post-enumerating phases, are generally crude.
The goal of our work is to show that when order
is a property of the underlying data model and algebra, writing
order-dependent queries in a language can be natural as is their optimization.
We introduce AQuery, an SQL-like query language and algebra
t has from-the-ground-up support for order.
We also present a framework for optimization of the order-dependent
queries categories it expresses.
The framework is able to take advantage of the large body of
query transformations on relational systems
while incorporating new ones
described here.
We show by experiment that the resulting system is orders of magnitude
faster than current SQL:1999 systems on many natural order-dependen
You can see our paper here.
You can see a power point presentation here.
Joint work with Alberto Lerner.
Ted and I have done
another piece of work about a data structure for decision support
called ``Hierarchically Split Cube Forests''.
postscript .
With Arthur Whitney, Steve Apter, and the
rest of the K community,
I have been working on simplifying transaction
processing in a large main memory setting.
The technique uses a very fast, interpreted vector-processing language.
It avoids concurrency control but allows concurrency.
It is tailored to financial applications.
postscript version and
pdf version .
In Sigmod 1997, I present some lessons learned from
my experience on Wall Street.
The lessons have to do with configuration for global
distributed systems, tuning, and language issues.
Lessons from Wall Street (postscript) .
In Sigmod 2001, Arthur Whitney and I have a paper describing
time series databases for finance that he built.
Lots o' Ticks: real-time high
performance time series queries on billions of trades and quotes
(pdf) .
We have found algorithms and bounds for scheduling
sporadically arriving periodic tasks.
That is, the instances of
each task arrive at regular periods but the task
can arrive at any instant.
Our twist on this problem is that we allow certain instances of such tasks
to be skipped. We look at schedulability in this context.
pdf
.
Collaborator: Gilad Koren
Recently, we (Stacey Kuznetsov and I) have designed a new system called
Stratpal.
Stratpal is a simplified thinksheet that can be used to model laws
and strategies. Its key features is that given a linear document,
it is easy to create a StratPal application that can be improved
incrementally over time.
K. Jacob of Morgan Stanley and I have designed a benchmark
for financial time series queries, called
FinTime
which database vendors and customers may find interesting.
Also on the subject of benchmarks, Yunyue Zhu
and I have designed a benchmark for bitemporal database management
systems, called
SpyTime .
Statstream (incremental correlation in time series)
software description
Tree searching (finding a small tree in a database of large trees;
where the order among siblings doesn't matter)
software description
Tree difference (finding the difference between trees, where
the order among siblings does matter)
software description
Graph comparison (heuristic techniques for comparing graphs)
software description
Graph clustering (finding interesting motifs in graphs)
software description
SchedMatch (find the differences between partial orders)
software description
Chris Poultney is the primary author of a game I designed
called the
Voronoi game.
It was written about in France because it appeared in many
science fairs.
A game to teach statistics in a fun way
requires a player to find the causes of a pandemic
so it's called the
Pandemic Game
The first books about him were first published by
W. H. Freeman, (1-)212-576-9400:
Last modified May 2006.
Biological Computing
Linguistics
Pattern Recognition
1. A review of approximate tree matching algorithms by
(the final version is in the book
Pattern Matching in Strings, Trees, and Arrays
by Apostolico
and Galil published by Oxford University Press)
paper in reverse order, just for fun.
2. Approximate graph matching for acyclic graphs
postscript.
3. Discovering patterns in protein sequences
postscript.
Tamper-resistant file systems
AQuery: a database system for querying ordered data
AJAX: a data cleaning system
Le Subscribe: a publish-subscribe system
(movie; toy story II; =),
(price; < ; $10)
Our events are also conjunctions but of the form (attribute; value)
and implicitly on equality, e.g.
(movie; toy story II), (city; paris)
Fault Tolerant Parallel Programming
Database Internals Work
Database Tuning and Wall Street
Real-Time Scheduling
Thinksheet and StratPal
Dave Tanzer's thesis is on efficient
backwards reasoning in the thinksheet context.
The work has been completed.
Benchmarks
Software
Fun Stuff
The Puzzling Adventures of Dr. Ecco in 1988
(republished by Dover in 1998) and
Codes, Puzzles, and Conspiracy in 1992 (now retitled
Dr. Ecco: mathematical detective in the Dover edition).
See
Professor Scarlet's Notebook
a companion book to teach real mathematics through puzzles.
Dr. Ecco's Cyberpuzzles : 36 Puzzles for Hackers
and Other Mathematical Detectives
published by W. W. Norton in 2002.
This had the first collection of puzzles from Dr. Dobb's Journal
Puzzling Adventures
published by W. W. Norton in January 2005.
This had the first collection of puzzles from my Scientific American column.
The Puzzler's Elusion
published by Avalon Press, March 2006.
A combination of puzzles
from Scientific American and Dr. Dobb's Journal.
Puzzles for Programmers and Pros
published by Wiley in May 2007.