The third pattern is that I work with excellent people -- undergraduates,
master's students, doctoral students, post-docs, other profs, and
non-academics.
If you have skill, energy, and initiative and if you like what you see below,
then drop me a line.
I might have a project for you.
Biological Computing
I've worked on several projects over the last few years.
The first five projects below are new:
-
Software to try to infer binding sites of transcription factors
as well as causality and repression among transcription factors.
We have started with yeast, but the software is quite general.
It uses a mix of pattern analysis and heuristics based on statistics.
Joint work with Philip Benfey and Ken Birnbaum and Aris Tsirigos.
Ken Birnbaum has led a project concerned with
cell-sorting and analysis for Arabidopsis that was published in Science.
-
Combinatorial
design software
to specify the design of experiments over
several input variables where most of those variables
are considered to be unimportant.
The goal is to explore a large search space with few experiments
while guaranteeing certain properties.
Joint work with Gloria Coruzzi, Peter Palenchar,
Andrei Kouranov, Laurence Lejay,
and Michael Chou.
-
Analysis software to find analog circuits given the results
of experiments.
Also, joint work with the Coruzzi lab.
-
Analysis software and a database called Pathexplore
to issue various queries about metabolic,
proteomic, or other pathways.
Play with it here.
Joint work with Peter Palenchar also formerly of the Coruzzi lab.
-
Software to find the functionality of genes based on
species traits.
This has discovered flagella genes already.
Please
see a description here.
Joint work with Mitchell Levesque,
Wook Kim, Michael G. Surette, and Philip Benfey.
-
Algorithms for phylogenetic reconstruction based on graphs (a.k.a., networks)
instead of trees.
Our
algorithm
takes as input the best phylogenetic trees for
some set of genes and forms a parsimonious graph.
Joint work with Ken Birnbaum and Matt Olim.
Some
comparisons
with other approaches done by Chandni Rajan Valiathan.
New algorithms being developed by
Chien-I Liao.
-
Visualization software called Sungear to help relate the results of
many different experiments to one another as well as to functional
categories.
By results, one could mean gene collections (e.g.,
of genes induced in an experiment), edge collections (e.g., of
protein interactions), or any
collections. Functional categories could be, for example, gene ontologies
or pathway information.
This software is available for use now.
Pattern Recognition
-
Software and algorithms to find
the highest correlated streams
among thousands of streams extremely efficiently.
That is joint work with
award-winning
Yunyue Zhu
and Zhihua Wang.
We have a paper in VLDB 2002 describing the algorithms.
Further work along those lines allows us to do query by humming
We have a paper in SIGMOD 2003 describing the algorithms.
Our demo can be found online
here.
We currently have more than 1000 songs, including most of the Beatles
and other popular folk and melodious rock songs.
The demo makes use of dynamic time warping, statistics-based filters,
and adaptive boosting.
You can download both midi files and lyrics.
If you hum a tune that has a match in our database
and you upload both your humming and the match information,
we will pay $2 to Nature Conservancy.
-
Software and algorithms to
find bursts in time series data.
We have a paper in KDD 2003 describing the algorithm.
That is joint work with Zhihua Wang, Xiaojian Zhao, and Yunyue Zhu.
-
Software to make tree, graph and structure searching as fast
as keyword searching.
A paper describing that was published in ACM Pods 2002 as an invited
tutorial
in pdf.
This uses a combination of geometric hashing, combinatorial
techniques from approximate tree matching, and generalizations
of suffix trees.
To download software for the current version
of the graph search software, please see
graphblast (an improvement on graphgrep).
To cluster graphs, please see
GraphClust.
If you need tree comparison for ordered trees (where
sibling order matters), please see
treegrep .
If you need searching among unordered trees
(with applications to XML among others), please
see
unordered tree searcher .
If you want to compare unordered trees, please see
our cousin-distance based unordered comparison software
You can find our PODS 2002 presentation
here.
An important application of this work has been to perform
structural searches in phylogenetic databases. As of November 2003,
about 500 users worldwide have accessed the tools over 7000 times.
The
tools
have been integrated into
Joint work with Jason Wang, Kaizhong Zhang, Rosalba Giugno, and
Diego Reforgiato Recupero.
Here is
Rosalba Giugno's very nice thesis.
In addition, we have written several papers:
1. A review of approximate tree matching algorithms by
(the final version is in the book
Pattern Matching in Strings, Trees, and Arrays
by Apostolico
and Galil published by Oxford University Press)
postscript.
2. Approximate graph matching for acyclic graphs
postscript.
3. Discovering patterns in protein sequences
postscript.
-
With the group of Alfredo Ferro (including Rosalba Giugno for
GraphGrep, and
Domenico Cantone, Alfredo Pulvirenti, Tarcisio
Maugeri, and Giuseppe Piqola for the computer science
side of clustering), we have developed
a set of tools for
finding multiple alignments that we believe improve on the
Clustal package as well as innovative clustering algorithms.
-
Our goal is to make it possible to discover patterns (i.e.
do data mining) in strings,
trees, and graphs given a pattern metric.
On the way we have worked out or borrowed algorithms
to match a pattern against data and find a distance between the pattern
and the data.
We are really good on trees
and are getting good on graphs.
We have software available by anonymous ftp
and certain experimental software is available from me.
Collaborators: Kaizhong Zhang (U of Western Ontario),
Jason Wang (NJ Institute of Technology),
and Bruce Shapiro (National Cancer Institute).
We have edited a book on this subject:
Pattern Discovery in Biomolecular Data:
Tools, Techniques, and Applications
Jason Wang, Bruce Shapiro, and Dennis Shasha (Eds.)
Oxford University Press, 1999.
Partly on the strength of that book, I have become
the editor of a book series in Genomics
and Bioinformatics . The intent of the series
is to publish graduate level texts for working researchers
in the field.
We are honored to have a superb advisory board:
Michael Ashburner,
Amos Bairoch,
David Botstein,
Charles Cantor,
Lee Hood,
Minoru Kanehisa,
Raju Kucherlapati, and
Craig Venter.
Tamper-resistant file systems
With David Mazieres, I have been working on a network file system
that supports the following scenario (among others): a group of
people work together in a distributed fashion but trust neither one
another nor their system administrator.
For example, they might outsource their
system administration to an organization that
they don't necessarily trust.
Historically, that organization could do subtle changes to their
data, read their data, and so on.
The system we have designed makes
makes tampering changes quickly
detectable.
(Secrecy can be achieved with straightforward cryptographic techniques.)
The only assumption we make is that each client has a secret signature key.
Please
see our PODC
paper here .
Some promising performance results
due to the great efforts of David, Jinyuan Li, and Maxwell Krohn
appear
in OSDI.
AQuery: a database system for querying ordered data
An order-dependent query is one whose result
(interpreted as a multi-set)
changes if the order of the input records is changed.
In a stock-quotes database, for instance, retrieving all quotes
concerning a given stock for a given day does not depend on order, because
the collection of quotes does not depend on order.
By contrast, finding the five price moving average in a trade table
gives a result that depends on the order of the table.
Query languages based on the relational data model
can handle order-dependent queries only through add-ons.
SQL:1999, for example,
permits the use of a data ordering mechanism called
a ``window''
in limited parts of a query.
As a result, order-dependent queries
become difficult to write in those languages and
optimization techniques for these features,
applied as pre- or post-enumerating phases, are generally crude.
The goal of our work is to show that when order
is a property of the underlying data model and algebra, writing
order-dependent queries in a language can be natural as is their optimization.
We introduce AQuery, an SQL-like query language and algebra
t has from-the-ground-up support for order.
We also present a framework for optimization of the order-dependent
queries categories it expresses.
The framework is able to take advantage of the large body of
query transformations on relational systems
while incorporating new ones
described here.
We show by experiment that the resulting system is orders of magnitude
faster than current SQL:1999 systems on many natural order-dependen
You can see our paper here.
You can see a power point presentation here.
Joint work with Alberto Lerner.
We are looking for Beta users.
AJAX: a data cleaning system
-
Ajax is a framework for data cleaning.
It includes an implementation of comparison, clustering, and
schema tracking for all aspects of data cleaning.
It's also a framework for future additions to data cleaning.
Joint work with Helena Galhardas, Dana Florescu, and Eric Simon of Inria.
Helena is currently implementing furiously.
Le Subscribe: a publish-subscribe system
-
We (this is joint work with Francoise Fabret and Francois
Llirbat, and Joao Pereira at INRIA)
have implemented a publish-subscribe system for extremely high
performance and distributed functionality.
-
Our subscriptions are conjunctions of the form (attribute; value; relop)
e.g.
(movie; toy story II; =),
(price; < ; $10)
Our events are also conjunctions but of the form (attribute; value)
and implicitly on equality, e.g.
(movie; toy story II), (city; paris)
-
Our performance is the following:
For 400,000 subscriptions having
5 attributes of which one is inequality and four are equality,
and
events with 5 attributes, we
can process events at 5 milliseconds per event
on a machine with Linux SO, i686 CPU at 500MHz with 1G of RAM.
Fault Tolerant Parallel Programming
-
Our Persistent Linda project extends the Linda system
developed primarily by Dave Gelernter and Nick Carriero at Yale.
We use a slightly weakened form of transaction combined with
checkpoints to support fault tolerance (appeared in Proc of 13th
Symp on Fault Tolerant Distributed Systems)
postscript.
You can get a copy of PLinda from
our web site.
Collaborators: Brian Anderson, Karp Jeong, Suren Talla, Peter Wyckoff, Bin Li,
all students or former students at NYU
In addition, Ekkart Kindler has done
a formal proof method
for verifying long-running parallel computations.
Database Internals Work
Database Tuning and Wall Street
-
Database tuning is the activity of making your database system run faster.
Though each vendor will tell you a different story about the subject,
it turns out that the underlying principles are the same.
(As a consultant, I've applied these principles for companies in
telecommunications, finance, on-line travel companies, and
on-line gaming.)
If you are interested in
notes on the subject,
then please go ahead and download
My latest book on the subject, co-authored with Philippe Bonnet,
is called
Database Tuning: principles, experiments, and troublieshooting
techniques
You can find the experiments from that work
here.
With Arthur Whitney, Steve Apter, and the
rest of the K community,
I have been working on simplifying transaction
processing in a large main memory setting.
The technique uses a very fast, interpreted vector-processing language.
It avoids concurrency control but allows concurrency.
It is tailored to financial applications.
postscript version and
pdf version .
In Sigmod 1997, I present some lessons learned from
my experience on Wall Street.
The lessons have to do with configuration for global
distributed systems, tuning, and language issues.
Lessons from Wall Street (postscript) .
In Sigmod 2001, Arthur Whitney and I have a paper describing
time series databases for finance that he built.
Lots o' Ticks: real-time high
performance time series queries on billions of trades and quotes
(pdf) .
Real-Time Scheduling
-
We have developed algorithms for scheduling overloaded
sporadic real-time tasks in a uniprocessor setting (in Siam J. comp)
postscript
and in a multi-processor setting (appeared
in Theoretical Computer Science, July 1994)
postscript.
We have found algorithms and bounds for scheduling
sporadically arriving periodic tasks.
That is, the instances of
each task arrive at regular periods but the task
can arrive at any instant.
Our twist on this problem is that we allow certain instances of such tasks
to be skipped. We look at schedulability in this context.
pdf
.
Collaborator: Gilad Koren
Thinksheet
-
A tool to tailor information flow for readers of complex (or boring) documents
such as laws and a problem-solving tool generalizing spreadsheets.
The tool integrates spreadsheets, rules, databases, and hypermedia.
Collaborators: Roman Yangarber, Peter Piatko, Daoi Lin, Minna Cha,
Dave Tanzer, Alex Shenker, Mike Leder, Julia Tolpin, Mirella Shannon,
and Chris Jones,
all students at NYU.
Dave Tanzer's thesis is on efficient
backwards reasoning in the thinksheet context.
The work has been completed.
Recently, we (Stacey Kuznetsov and I) have designed a new system called
Stratpal.
Stratpal is a simplified thinksheet that can be used to model laws
and strategies. Its key features is that given a linear document,
it is easy to create a StratPal application that can be improved
incrementally over time.
For Kids
Please see the
earthtime project not yet functional
To learn arithmetic and elementary algebra in a fun way.
Look at
superply
Fun Stuff
-
Dr. Ecco, a mathematical detective cracks
mysteries by solving puzzles.
Some are combinatoric, e.g. what is the smallest number of people
who could be at a party in which everyone has shaken hands with three
other people, except one person who has only shaken hands with one other
person?
Others involve algorithmic aspects, including the
simplest zero knowledge protocols known to (wo)man.
The first books about him were first published by
W. H. Freeman, (1-)212-576-9400:
The Puzzling Adventures of Dr. Ecco in 1988 and
Codes, Puzzles, and Conspiracy in 1992.
Dover has republished
The Puzzling Adventures of Dr. Ecco in 1998.
You can get all my books from online bookstores such
as amazon.com.
- I have had the pleasure of writing a mathematical puzzle
column for Dr. Dobb's Journal
and currently
write the monthly puzzle
column for Scientific American
.
These puzzles are good to attack with a computer.
The best of those puzzles, somewhat improved, have appeared
in Dr. Ecco's Cyberpuzzles published by W. W. Norton in July, 2002.
-
You can see a talk called
Upstart Puzzles
that I gave at the Canadian Mathematical Society
summer meeting at Edmonton in June 2003.
A collection of the early Scientific American puzzles appeared
as
Puzzling Adventures published by W. W. Norton in 2005.
A new book called The Puzzler's Elusion will be
published by Avalon Press in 2006.
-
Out of Their Minds: the lives and discoveries of 15
great computer scientists is a book of biographies
of 15 great computer scientists.
You can see me on realplayer video
talking about the book.
-
Finally, all the smart Russian students I've had has inspired
me to collaborate with playwright Marina Shron
in a book about recent Russian immigrants entitled
Red Blues: voices from the last wave of Rusian immigrants.
You can find excerpts of the book
here.
Last modified February 2006.