Dennis Shasha's Research Summary

Goals

I work on quite a few different projects. Most end up resembling puzzles on large data and pattern matching or machine learning. Areas of interest include computational biology (mostly on plants) and biomedicine (data analysis, experimental design), time series (fast algorithms for fundamental problems such as correlation and burst detection as well as applications like time series forecasting), and pattern matching in trees and labeled graphs.

In collaboration with Caspar Lant, Daniele Panozzo, and Denis Zorin, I have worked on a device to block floods by unfolding and allowing stacking called SnailGate. We are also working on devices to ensure the right person is taking the right pill called CorrectConsumer , and an alarm system for pedestrians who might be struck by a vehicle.

Other recent work is an evidence-based diet site called dietnerd. We plan to extend the ``nerd family'' to applications in wireless, general engineering, contracts, and financial advice.

Other recent work

Since 2013, I've become interested in millimeter wireless problems under the auspices of NYU WIRELESS and in close collaboration with Sundeep Rangan, Aditya Dhananjay, and Marco Mezzavilla, especially in the context of Pi-Radio.

Meta-Algorithms for Machine Learning to Improve Accuracy

In addition to applied machine learning work, I have worked on the problem of designing a meta-algorithm called SafePredict that is allowed to refuse to accept the predictions of an underlying machine learning algorithm if it determines the predictions will incur an error rate higher than a user-specified bound. Our SafePredict framework can asymptotically achieve a higher correctness rate for any machine learning algorithm on the non-refused predictions. This is the thesis work of Anil Kocak whom I co-advised with Elza Erkip. David Ramirez is also a collaborator in that work. Shantanu Jain is also working on making the software usable by users of machine learning packages. With Ben Peherstorfer , we are extending that work to numerical problems. Alexis Joly, Titouan Lorieul and I are applying similar reasoning to inferring sets of labels for an object when finding the exact label of that object is too uncertain.

Biological Computing

I've worked on several molecular biology projects since the early 2000s, notably with the plant biology labs of Gloria Coruzzi, Ken Birnbaum, Rich Bonneau, Philip Benfey, and Rodrigo Gutierrez.

Analysis software to find causality relationships given the results of RNA expression experiments. This is also joint work with the Coruzzi lab and soon-to-be-doctor Jacopo Cirrone and incoming student Bingran Shen. Previous students involved with that include Jesse Lingeman and Piotr Mirowski.
Software to contribute to the visualization of the intersections and unions of collections of multiple experiments, multiple genomes or even multiple baseball players. Sungear has been called a Venn diagram on steroids. It should be useful for social scientists, cancer researchers and sports fanatics -- anyone concerned with trying to derive interesting information from several long lists of items (genes, proteins, people, players). Besides supporting set intersection and unions on items, Sungear relates those items to functional categories. This is joint work with Chris Poultney, Rodrigo Gutierrez, Manpreet Katari, Miriam Gifford, Brad Paley, and Gloria Coruzzi. Chris made the software what it is.
Combinatorial design software to specify the design of experiments over several input variables where most of those variables are considered to be unimportant. The goal is to explore a large search space with few experiments while guaranteeing certain properties. (A follow-on paper with Charles Colbourne gave me an Erdos number of 2.)
Work to aid the first stages of protein docking. We call this protein speed-dating. This is joint work with Noah Youngs, Tian Jiang, Doug Renfew, Glenn Butterfoss and Rich Bonneau.

Graph Algorithms

With Alfredo Ferro, Rosalba Giugno, Alfredo Pulverenti, Giovanni Micale, Vinzenzo Bonnici, Antonio Di Maria, and I have worked on two kinds of graph matching problems: (i) given a small graph find all instances of that subgraph in a large graph, a well known NP-complete problem but one with many good heuristics. Lately we have been able to deal with graphs in which nodes can have zero or more labels and so can edges. We have dealt with both exact and inexact matching. (ii) Given a large graph, find the patterns in that graph that appear "unusually often". What this means can depend on the graph generating process. This ongoing project has had the delightful side effects of my visiting the beautiful island of Lipari and keeping me from forgetting all my Italian.

Finding Bugs in Black Box Workflows

Raoni Lourenco, Juliana Freire and I have worked on the problem of finding the root causes of bugs in complex workflows. Given a setting where an execution configuration can be characterized by parameter-value settings, new configurations can be executed at will, and there is some application-based notion of success or failure, our method can find minimal root causes consisting of disjunctions of conjunctions of parameter-value pairs, often in linear time in the number of parameters.

Version Climber

Christophe Pradal, Sarah Cohen-Boulakia, Patrick Valduriez and I have worked on the problem of updating software configurations. Imagine a software system in which there are many packages. Each has a certain version, e.g. some version of Python 3, some version of a graphics package etc. Now, we would like to update one or more of these packages to a newer version. This can be a time-intensive trial-and-error process. Version Climber does an automatic systematic exploration, using heuristics and parallelism do version upgrades "without tears". We build this on top of Conda.

Formal Verification of Concurrent Search Structures

Thomas Wies, Siddharth Krishna, Nisarg Patel, and I have worked on formal verification of concurrent search structure algorithms using separation logic and a framework I developed some time around the stone age. The approach can already automatically verify a lot of concurrent algorithms that are now verified by hand.

Data Science Support for Linguistics

Chris Collins and Richard Kayne from NYU Linguistics, along with Linguistics doctoral student Michael Taylor, worked with the computer science team consisting of Sangeeta Vishwanath, Hiral Rajani, Jillian Kozyra, and me to build a system called Syntactic Structures of the World's Languages (SSWL) several years ago. The idea is to allow the comparison of the syntax of hundreds or even thousands of languages and already permits questions to be answered on a scale never before possible in linguistics. Hilda Koopman has led and greatly expanded the linguistics effort for the last few years, bringing the number of participating linguists to the hundreds. Ross Affenberger and Alex Lobascio then Marco Liberati, then Hannan Butt along with Shailesh Vasandani have worked on the implementation resulting in Terraling a more flexible version of the software.

Acronym Expansion

Haven't we all had the problem that we cannot understand an acronym in an article even when it might have been defined earlier? Our acronym expander project uses parsing to find acronym definitions within a paper p and document similarity to find the definitions in other papers p' that are similar to p. We are working on an end-to-end system that I hope will be used by millions. This is joint work with Helena Galhardas, Joao Pereira currently and earlier Kshitiz Sethia and Ben Turtel.

Finding Correlated Time Series

Alexandra Levchenko, Boyan Kolev, Djamel-Edine Yagoubi, Reza Akbarinia, Florent Maeeglia, Themis Palpanas, Patrick Valduriez, and I have created a system that, given a query time series, finds the closest time series to that one in a (possibly large) collection of time series. In order to avoid a linear scan of that collection, the system BestNeighbor uses either a derivation of the iSAX system or a random projection approach. It turns out that random projections work best when the time series have a lot of energy in their high frequency components.

Short Review of Other Projects

Queries at multiple scales in astronomy (with Fabio Porto) especially to detect patterns of points.
Fast Monte Carlo approaches to epidemic modeling (with Azza Abouzied, Anh Mai, and Whitney Bagge).
The inference of medical treatments using the treatment, measurement, and discharge information from the MIMIC dataset. (with Azza Abouzied, Rosalba Giugno, Abbas Shojaee, YiFan Li, Elisa Sorrentino, Farah Shamout, Hashim Hayat, and Rick Hull).
System for supporting triggers in for streaming data (with Azza Abouzied, Ahmad Chihabi, and Kostas Zoumpatianos).
A system for detection of copyright violation for logos and other designs (with Karan Chadha and Tolga Yenisey).
Resampling statistics for computer and natural scientists. With Manda Wilson, I wrote a short monograph called Statistics is Easy a few years ago. Now with Manny Katari and Sudarshini Tyagi, we are completing a companion book to be entitled something like: Statistics is Easy: case studies on real scientific datasets. The idea is to use the techniques of resampling statistics to normalize, analyze, and then statistically evaluate scientific questions. All code will be available in Python and in R.
With Marcelo Jose Sandoval, a project to do automatic metadata tagging of video sequences. The metadata tagging will say whether the sequence is primarily a close-up, medium or faraway shot and will use the speech to associate the sequence with the proper scene for scripted works of art.
Energy-efficient and secure blockchain.
A project to ensure that the right person takes a particular medicine using facial recognition and an electro-mechanical dispenser. e.g. opioids.
Fun projects having to do with good tasting diets and tango choreography

A general pattern? I like puzzles. A second general pattern is that I program a lot in a fast, extremely exressive language called K A third pattern is that I work with excellent people -- undergraduates, master's students, doctoral students, post-docs, other profs, and non-academics.

If you have skill, energy, and initiative and if you like what you have seen here, then drop me a line. I might have a project for you. My philosophy is to try to find something that is close to your heart and close to mine. All projects take work. You have to believe in the goal and take pleasure in the means to get there.

Look at Marianne Winslett's interview of me in which I try to explain the way I work.

Older Work

Pattern Recognition

Software and algorithms to find the highest correlated streams among thousands of streams extremely efficiently. That is joint work with award-winning Yunyue Zhu, Xiaojian Zhao, and Zhihua Wang. We had a paper in VLDB 2002 describing the algorithms. Later work has involved Richard Cole and Tyler Neylon. You can find Zhihua's and Xiaojian's theses on the CS department's thesis web site . Work on "uncooperative" time series (time series in which the power of the time series is spread over all Fourier coefficients) that uses random projections is discussed here. Tyler Neylon has extended the pair-wise correlation work to finding approximate linear dependencies among multiple time series as you can see in his thesis
Further work along those lines allows us to do query by humming We had a paper in SIGMOD 2003 describing the algorithms though others have continued this work. Students who have worked on query by humming are: Yunyue Zhu, Zhihua Wang, Steve Toub, Michael Schidlowsky, Kevin Cox, and Megan McNulty.
Many of the basic algorithms are summarized in the book High Performance Discovery in Time Series: techniques and case studies published by Springer Verlag (by Yunyue Zhu and me). There are a few > errata in the book.
Software and algorithms to find bursts in time series data. We have a paper in KDD 2003 describing the algorithm. That is joint work with Zhihua Wang, Xiaojian Zhao, and Yunyue Zhu. Xin Zhang wrote a very nice thesis to improve this further.
Software to make tree, graph and structure searching as fast as keyword searching. A paper describing that was published in ACM Pods 2002 as an invited tutorial in pdf. This uses a combination of geometric hashing, combinatorial techniques from approximate tree matching, and generalizations of suffix trees. The work is with my dear friends in Catania: Rosalba Giugno, Alfredo Ferro, Alfredo Pulvirenti, and various wonderful students who are the lead authors of the papers. To download software for the current version of the graph search software, please see graphgrep family. We have deployed a similar module for use within the cytoscape software called To cluster graphs (with Diego Reforgiato Recupero) , please see GraphClust. If you need to compare schedules or partial orders please see SchedMatch. If you need tree comparison for ordered trees (where sibling order matters), please see treegrep . If you need searching among unordered trees (with applications to XML among others), please see unordered tree searcher . If you want to compare unordered trees, please see our cousin-distance based unordered comparison software You can find our PODS 2002 presentation here.
An important application of this work has been to perform structural searches in phylogenetic databases. As of November 2003, about 500 users worldwide have accessed the tools over 7000 times. The tools have been integrated into Joint work with Jason Wang, Kaizhong Zhang, Rosalba Giugno, Diego Reforgiato Recupero and Alfredo Ferro. Here is Rosalba Giugno's very nice thesis.
In addition, we have written several papers:
1. A review of approximate tree matching algorithms by (the final version is in the book Pattern Matching in Strings, Trees, and Arrays by Apostolico and Galil published by Oxford University Press) paper in reverse order, just for fun.
2. Approximate graph matching for acyclic graphs postscript.
3. Discovering patterns in protein sequences postscript.
With the group of Alfredo Ferro (including Rosalba Giugno for GraphGrep, and Domenico Cantone, Alfredo Pulvirenti, Tarcisio Maugeri, and Giuseppe Piqola for the computer science side of clustering), we have developed a set of tools for finding multiple alignments that we believe improve on the Clustal package as well as innovative clustering algorithms.
Our goal is to make it possible to discover patterns (i.e. do data mining) in strings, trees, and graphs given a pattern metric. On the way we have worked out or borrowed algorithms to match a pattern against data and find a distance between the pattern and the data. We are really good on trees and are getting good on graphs. We have software available by anonymous ftp and certain experimental software is available from me. Collaborators: Kaizhong Zhang (U of Western Ontario), Jason Wang (NJ Institute of Technology), and Bruce Shapiro (National Cancer Institute).
We have edited a book on this subject: Pattern Discovery in Biomolecular Data: Tools, Techniques, and Applications Jason Wang, Bruce Shapiro, and Dennis Shasha (Eds.) Oxford University Press, 1999.
Partly on the strength of that book, I have become the editor of a book series in Genomics and Bioinformatics . The intent of the series is to publish graduate level texts for working researchers in the field. We are honored to have a superb advisory board: Michael Ashburner, Amos Bairoch, David Botstein, Charles Cantor, Lee Hood, Minoru Kanehisa, Raju Kucherlapati, and Craig Venter.
More recently, we have edited a second book on the subject of data mining Data Mining in Bioinformatics by J. T. L. Wang, M. J. Zaki, H. T. T. Toivonen and D. Shasha. published by Springer-Verlag in 2005.

Tamper-resistant file systems

With David Mazieres, I worked on a network file system that supports the following scenario (among others): a group of people work together in a distributed fashion but trust neither one another nor their system administrator. For example, they might outsource their system administration to an organization that they don't necessarily trust. Historically, that organization could do subtle changes to their data, read their data, and so on. The system we have designed makes makes tampering changes quickly detectable. (Secrecy can be achieved with straightforward cryptographic techniques.) The only assumption we make is that each client has a secret signature key. Please see our PODC paper here . Some promising performance results due to the great efforts of David, Jinyuan Li, and Maxwell Krohn appear in OSDI. With Radu Sion and Peter Worth of Stony Brook and then with Arthur Meacham then at NYU, we did work having to do with provably secure database outsourcing. Suppose a mutually trusting group of clients want to use the software provided by an outsourcer. The guarantee is that the outsourcer will not be able to understand the client data (because it is encrypted when the outsourcer sees it), nor will the outsourcer know which data any client accesses and the clients will enjoy full transactional guarantees.

AQuery: a database system for querying ordered data

An order-dependent query is one whose result (interpreted as a multi-set) changes if the order of the input records is changed. In a stock-quotes database, for instance, retrieving all quotes concerning a given stock for a given day does not depend on order, because the collection of quotes does not depend on order. By contrast, finding the five price moving average in a trade table gives a result that depends on the order of the table. Query languages based on the relational data model can handle order-dependent queries only through add-ons. SQL:1999, for example, permits the use of a data ordering mechanism called a ``window'' in limited parts of a query. As a result, order-dependent queries become difficult to write in those languages and optimization techniques for these features, applied as pre- or post-enumerating phases, are generally crude. The goal of our work is to show that when order is a property of the underlying data model and algebra, writing order-dependent queries in a language can be natural as is their optimization. We introduce AQuery, an SQL-like query language and algebra t has from-the-ground-up support for order. We also present a framework for optimization of the order-dependent queries categories it expresses. The framework is able to take advantage of the large body of query transformations on relational systems while incorporating new ones described here. We show by experiment that the resulting system is orders of magnitude faster than current SQL:1999 systems on many natural order-dependen You can see our paper here. You can see a power point presentation here. Joint work with Alberto Lerner.

AJAX: a data cleaning system

Ajax is a framework for data cleaning. It includes an implementation of comparison, clustering, and schema tracking for all aspects of data cleaning. It's also a framework for future additions to data cleaning. Joint work with Helena Galhardas, Dana Florescu, and Eric Simon of Inria.

Le Subscribe: a publish-subscribe system

We (this is joint work with Francoise Fabret and Francois Llirbat, and Joao Pereira at INRIA) have implemented a publish-subscribe system for extremely high performance and distributed functionality.
Our subscriptions are conjunctions of the form (attribute; value; relop) e.g.
(movie; toy story II; =), (price; < ; $10)
Our events are also conjunctions but of the form (attribute; value) and implicitly on equality, e.g.
(movie; toy story II), (city; paris)
Our performance is the following: For 400,000 subscriptions having 5 attributes of which one is inequality and four are equality, and events with 5 attributes, we can process events at 5 milliseconds per event on a machine with Linux SO, i686 CPU at 500MHz with 1G of RAM.

Fault Tolerant Parallel Programming

Our Persistent Linda project extends the Linda system developed primarily by Dave Gelernter and Nick Carriero at Yale. We use a slightly weakened form of transaction combined with checkpoints to support fault tolerance (appeared in Proc of 13th Symp on Fault Tolerant Distributed Systems) postscript. You can get a copy of PLinda from our web site. Collaborators: Brian Anderson, Karp Jeong, Suren Talla, Peter Wyckoff, Bin Li, all students or former students at NYU In addition, Ekkart Kindler has done a formal proof method for verifying long-running parallel computations.

Database Internals Work

Buffer paging algorithms that beat LRU in several application areas. Collaborators: Ted Johnson (U. of Florida), industrial collaborators notably at database companies (appeared in VLDB 94) with recent deployment at a large search engine company postscript. By the way, part of Johnson's thesis showed that it is a good idea to be lazy when you're designing B-trees: as long as there are more inserts than deletes, free-at-empty is a better strategy than merge-at-half (appeared in Journal of Computer Science and Systems, Aug. 1993) postscript.
Ted and I have done another piece of work about a data structure for decision support called ``Hierarchically Split Cube Forests''. postscript .

Database Tuning and Wall Street

Database tuning is the activity of making your database system run faster. Though each vendor will tell you a different story about the subject, it turns out that the underlying principles are the same. (As a consultant, I've applied these principles for companies in telecommunications, finance, on-line travel companies, and on-line gaming.) If you are interested in notes on the subject, then please go ahead and download My book on the subject, co-authored with Philippe Bonnet, is called Database Tuning: principles, experiments, and troublieshooting techniques published by Morgan-Kaufmann in 2002. Whereas the numbers published in that book no longer apply, most of the tuning principles still do. Wei Cao and I worked out some ways to detect application-programming "delinquent" design patterns that would cause slow performance.
With Arthur Whitney, Steve Apter, and the rest of the K community, I have been working on simplifying transaction processing in a large main memory setting. The technique uses a very fast, interpreted vector-processing language. It avoids concurrency control but allows concurrency. It is tailored to financial applications. postscript version and pdf version .
In Sigmod 1997, I presented some lessons learned from my experience on Wall Street. The lessons have to do with configuration for global distributed systems, tuning, and language issues. Lessons from Wall Street (postscript) .

Real-Time Scheduling

We developed algorithms for scheduling overloaded sporadic real-time tasks in a uniprocessor setting (in Siam J. comp) postscript and in a multi-processor setting (appeared in Theoretical Computer Science, July 1994) postscript.
We have found algorithms and bounds for scheduling sporadically arriving periodic tasks. That is, the instances of each task arrive at regular periods but the task can arrive at any instant. Our twist on this problem is that we allow certain instances of such tasks to be skipped. We look at schedulability in this context. pdf .
Collaborator: Gilad Koren

Thinksheet and StratPal

A tool to tailor information flow for readers of complex (or boring) documents such as laws and a problem-solving tool generalizing spreadsheets. The tool integrates spreadsheets, rules, databases, and hypermedia. Collaborators: Roman Yangarber, Peter Piatko, Daoi Lin, Minna Cha, Dave Tanzer, Alex Shenker, Mike Leder, Julia Tolpin, Mirella Shannon, and Chris Jones, all students at NYU.

Dave Tanzer's thesis is on efficient backwards reasoning in the thinksheet context. The work has been completed.

Recently, we (Stacey Kuznetsov and I) have designed a new system called Stratpal. Stratpal is a simplified thinksheet that can be used to model laws and strategies. Its key features is that given a linear document, it is easy to create a StratPal application that can be improved incrementally over time.

Benchmarks

K. Jacob of Morgan Stanley and I have designed a benchmark for financial time series queries, called FinTime which database vendors and customers may find interesting. Also on the subject of benchmarks, Yunyue Zhu and I have designed a benchmark for bitemporal database management systems, called SpyTime .

Software

Statstream (incremental correlation in time series) software description

Tree searching (finding a small tree in a database of large trees; where the order among siblings doesn't matter) software description

Tree difference (finding the difference between trees, where the order among siblings does matter) software description

Graph comparison (heuristic techniques for comparing graphs) software description

Graph clustering (finding interesting motifs in graphs) software description

SchedMatch (find the differences between partial orders) software description

Fun Stuff

I have written a game to teach children arithmetic and elementary algebra called Superply. One of my kids beats me at it regularly, alas.
Chris Poultney is the primary author of a game I designed called the Voronoi game. It was written about in France because it appeared in many science fairs.
A game to teach statistics in a fun way requires a player to find the causes of a pandemic so it's called the Pandemic Game
Dr. Ecco, a mathematical detective cracks mysteries by solving puzzles. Some are combinatoric, e.g. what is the smallest number of people who could be at a party in which everyone has shaken hands with three other people, except one person who has only shaken hands with one other person? Others involve algorithmic aspects, including the simplest zero knowledge protocols known to (wo)man.
The first books about him were first published by W. H. Freeman, (1-)212-576-9400:
The Puzzling Adventures of Dr. Ecco in 1988 (republished by Dover in 1998) and
Codes, Puzzles, and Conspiracy in 1992 (now retitled Dr. Ecco: mathematical detective in the Dover edition). See Professor Scarlet's Notebook a companion book to teach real mathematics through puzzles.
Dr. Ecco's Cyberpuzzles : 36 Puzzles for Hackers and Other Mathematical Detectives published by W. W. Norton in 2002. This had the first collection of puzzles from Dr. Dobb's Journal
Puzzling Adventures published by W. W. Norton in January 2005. This had the first collection of puzzles from my Scientific American column.
The Puzzler's Elusion published by Avalon Press, March 2006. A combination of puzzles from Scientific American and Dr. Dobb's Journal.
Puzzles for Programmers and Pros published by Wiley in May 2007.
As suggested by these books, I have had the pleasure of writing a mathematical puzzle column for Dr. Dobb's Journal and currently write the monthly puzzle column for Scientific American (look under recreations).
You can see a talk called Upstart Puzzles that I gave at the Canadian Mathematical Society summer meeting at Edmonton in June 2003.
You can also hear puzzles on the radio in Arkansas.
Out of Their Minds: the lives and discoveries of 15 great computer scientists is a book of biographies of 15 great computer scientists. You can see me on realplayer video talking about the book.
My latest book Natural Computing is a book about the work of computer scientists, roboticists, and other innovators about the future of computing. Here are some of the reviews.
All the smart Russian students I've had has inspired me to collaborate with playwright Marina Shron in a book about recent Russian immigrants entitled Red Blues: voices from the last wave of Rusian immigrants. You can find excerpts of the book here.
Here is a list of the publishers who have translated my puzzle books in several countries:
- China, People's Republic: Hunan Science and Technology press.
- Czech Republic: Mlada
- France: Odile Jacob
- Germany: MVG Verlag
- Hungary: Typotex
- Japan: Nikkei Scientific
- Korea: Kyungmoon
- Poland: Spolddzielnia Wydawniczo-Handlowa Ksiazka i Wiedza
- Portugal: Gradiva
- Spain: Labor and Gredisa
- Slovenia: Drzavna Zalozba Slovenije, zbirka z logiko
- Taiwan, Republic of China: The Eurasian Publishing Group and Chiu Chang Math. Books & Puzzles Co.
- Turkey: Tubitak
These publishers have translated {\em Database Tuning: principles, experiments, and troubleshooting techniques} by Philippe Bonnet and me:
- China, People's Republic: Publishing House of Electronics
- Korea: KCC (Brain Korea Publishing Co.)
- Russia: Kudits Obraz
These publishers translated {\em Out of their Minds: the lives and discoveries of 15 great computer scientists} by Cathy Lazere and me:
- China, People's Republic: Hebei University Publishers
- Japan: Nikkei business publications
- Korea: Sejong
- Taiwan: Yuan-Liou

Last modified July 2020. The November 2018 version was translated to French by Deepak Khanna The January 2015 version was translated to Estonian by Weronika Pawlak