Advanced Database Systems

CSCI-GA.2434-001
Fall 2020
Tuesdays 5:00 - 7, Warren Weaver 109

Prof Dennis Shasha

Instructor: Dennis Shasha (shasha@cs.nyu.edu)
Zoom: Join at class time the following URL: https://nyu.zoom.us/j/96365781363

Except for the first class which had pre-recorded lectures, all lectures are live and recorded here. The classes will be live but also on zoom.

Office hours by zoom on Tuesdays 2 PM to 3 PM, no appointment necessary. Starts on September 8, 2020. https://nyu.zoom.us/j/95764190139
Recorded lectures will be on the classes page on the left side labeled Panopto.

Graders:
Nandhitha (nr2229@nyu.edu, point of contact for questions regarding mySQL and Reprozip, but first look at documents below) and
Nishchitha ( nhv215@nyu.edu, point of contact for questions regarding Aquery, but first look at documents below)

GOALS

To study the internals of database systems as an introduction to research and as a basis for rational performance tuning.

The study of internals will concern topics at the intersection of database system, operating system, and distributed computing research and development. Specific to databases is the support of the notion of transaction: a multi-step atomic unit of work that must appear to execute in isolation and in an all-or-nothing manner. The theory and practice of transaction processing is the problem of making this happen efficiently and reliably.

Tuning is the activity of making your database system run faster. The capable tuner must understand the internals and externals of a database system well enough to understand what could be affecting the performance of a database application. We will see that interactions between different levels of the system, e.g., index design and concurrency control, are extremely important, so will require a new optic on database management design as well as introduce new research issues. Our discussion of tuning will range from the hardware to conceptual design, touching on operating systems, transactional subcomponents, index selection, query reformulation, normalization decisions, and the comparative advantage of object-oriented database systems. This portion of the course will be heavily sprinkled with case studies from database tuning in biotech, telecommunications, and finance. Also, since the book that Philippe Bonnet and I have written has many tests associated with it, you will get the benefit of those tests.

Because I do a lot of work on ordered data (such as financial time series), we will explore databases that support ordered queries such as those found in finance and science and we will do compare two freely downloadable systems kdb and mysql.

Class materials

Here is the syllabus in pdf.
Please find homework 1 here. Here is the ticks data
Please find homework 2 here. This one includes a tuning practicum so give yourself extra time. Here are sample like and friends tables. We may use different ones in our tests.
In addition to your learning aquery for the first assignment, you may want to use AQuery for the tuning practicum for extra credit. Here is a presentation of the system developed by Jose Pablo Cambronero.
Here is the github repository. Here is detailed installation documentation.
And here are some additional notes written by Devina Bisen in 2019.
Here are additional installation guidelines written by Nishchitha Prasad if you have MacOS Catalina.
Here is a way to install using a virtual machine (you'll need docker).
Here is a worked example with a subset of a plant database application. Here is a simplified worked example extending Jose's notes along with compilation instructions. and data for that worked example.
The appendix to this paper has other examples.
Here is a directory showing (start from the file lect5) how to generate indexes in Q.
If you like movies, here is a video explaining how it can be used. Here are some basic notes about the underlying system q. and an introduction to the arrable model in q (everything you need to know for the homework). Here is a place to get a kdb.
Because reproducibility is an important discipline to acquire, consider several alternatives, e.g. the jupyter notebook approach. Alternatively, find notes on Reprozip written by Devina Bisen. Here are notes on using bit bucket.
The main questions I get about the distributed transaction project are:
(i) Which programming language should be use? Answer: It's up to you. We just have to be to run your program on the NYU servers.
(ii) Can we work in teams? Answer: Yes, team of one or two. No more than two.
(iii) Are the deadlines firm? Answer: Yes. You have plenty of warning so no reason to be late.
(iv) What does the design document have to contain? Answer: The name(s) of the team member(s), the major modules, what they do, the programming language you will use.
(v) The project imagines a multi-server system. Do we have to use a multi-threaded simulator? Answer: No. You merely have to create the correct input-output behavior based on the dump(), R, W, begin, end, fail, recover directives.
Here is a further set of frequently asked questions and their answers. This should be read in addition to the specification.
Also, here is a set of sample tests for that project.
There are two main lecture slide decks: transaction processing in pdf and database tuning slides. However there are ancillary lecture notes.
Here is a lecture on concurrent search structure algorithms.
Here is a lecture on AQuery.
Here find a brief lecture traitorous failures from the paper "Easy Impossibility Proofs for Distributed Consensus Problems" by Fischer, M. and Lynch, N. and Merritt, M. Here is a relevant picture. Network partitions alone can lead to problems as shown in these notes that are derived from Lynch, Fisher, and Patterson A related topic is the so-called CAP theorem which states that achieving consistency (up-to-date information is always available), availability (if some site is up, it will have the most up-to-date information), and partition tolerance (this all works even if the network is broken) cannot all be done simultaneously.
On the positive side, here is a simplified discussion of Castro and Liskov's Practical Byzantine Fault Tolerance.
Here are Jake Loveless's notes on high frequency trading system considerations and his article in CACM. and my notes on his article.
The system that aquery is most similar to has a nice way of scaling data. In fact, here is an argument for why it should be used for Big Data .
Here are Alberto Lerner's lecture notes on partitioning in large data systems and the accompanying prose notes
Here is a blog about cockroachdb built on top of rocksdb Spoiler alert: it uses almost all the transaction processing techniques we've discussed in this course. While they don't specify how they do it, they are likely to use this algorithm to render snapshot isolation serializable (by aborting certain transactions that would likely lead to serialization violations using snapshot isolation as shown in figures 6, 7, 8).
Here is some information about log structured merge trees and consistent hashing.
Here are Manos Athanassoulis's notes on indexes.
Here are Stratos Idreos's and Mark Callaghan's notes about key-value store (really about navigating the data structure design space) Here is the accompanying video.
Here is a nice talk about Not Only SQL databases (nosql). along with my notes on that talk. Here is a link to a course on NoSQL by Shahram Ghandeharizadeh. Here is a taxonomy of NoSQL systems. Here is one of many NoSQL benchmarks.
Here are notes on consistent hashing for distribution and replication.
Here are Alberto Lerner's notes on parallelizing locking, logging, accessing and buffer management
Here are notes about the successes and failures of big data analysis.
Here is an article about a very fast transactional system called VoltDB
Here is a discussion about the merits of eventual consistency vs. transactions.
Here is a tuning case study from a travel application. Here is a description of AppSleuth
Here is a lecture about a database-inspired language called pig. Here is a recent video.
Installing mysql are on PC, Linux, and Mac.
Eric Hielscher's notes (brought up to date by Nandhitha Raghuram) on running mysql are here.
Another lecture will be on an approach to allow database outsourcing
Here is an approximation to an n-server capacity planner.
We will discuss information gain, decision trees and clustering, perhaps among other topics. Here is a web site of tutorials on data mining by Andrew Moore
Here is a very nice review paper showing how relational query processing can vastly accelerate machine learning.
This paper discusses the relationship between untrusted anonymous transactions and notions like serializability and fault tolerance.
Strict serializability and two phase commit work nicely on high bandwidth networks.
Here is a lecture on biosurveillance by Andrew Moore and colleagues.
Here is a lecture on time series analysis by my colleagues and me. Here is a site of open source sketch algorithms from Yahoo! by Lee Rhodes and colleagues. Here is a cartoon concerning a related puzzle about sketches.
Here are some notes about clustering.
We will also discuss, if we have time several related topics found at the end of these lecture notes (i) high performance replicated database systems without the use of two phase commit, (ii) case studies from Wall Street, (iii) a self-tuning technique that improves on LRU, and (iv) a data structure for data warehouses.
Here is a nice poster showing the genealogy of relational databases.
Here are Sonali Malik's and Rasika Jangle's excellent notes on git and bitbucket.
Here are Tristan Allard's general slides on privacy and his nice informal critique of naive methods.
Here are Patrick Valduriez's gentle introductory notes on data science.
The very real-life consequences of default values.

Course Videos

To see the videos of previous lectures: registered students should be able to go to newclasses.nyu.edu, click on "Advanced Database Systems", and then click on a Mediasite link on the left-hand side of the page. If you find the videos choppy, then (from the 2016 Teaching Assistant Nicholas Souris): "you can use a plugin for Firefox called flashgot to download the videos and I'm pretty sure there's similar plugins for other browsers as well. Though keep in mind that the downloads will take some time to finish and the files are pretty large."

Books

Concurrency Control and Recovery in Database Systems by Bernstein, Hadzilacos, and Goodman, Addison-Wesley, 1987. ISBN 0-201-10715-5 Here is a local copy in zip format.
Your reading in this book should focus on the topics we hit in lecture. Note that the available copies algorithm I present is a slightly different version from tha presented in the book. Please use mine for the project.
Database Tuning: principles, experiments, and troubleshooting techniques by Dennis Shasha and Philippe Bonnet 2002 Morgan Kaufmann Publishers; ISBN: 1558607536
We will go through essentially this whole book.
Additional books can be found in the syllabus.

Here are some experiments having to do with database tuning .

Here is how to call C from K: Don Orth's description of how to call C from K.

Here are Alan Fekete's slides on snapshot isolation Snapshot Isolation and Fixes to It and the even better fixes (but in a paper) due to Michael Cahill, Uwe Roehm, and Alan Fekete.

Here is Joe Conron's nice paper on indexes (from when he was a master's student).

Some results from database tuning projects. Presented as rules of thumb. Here is one very nice tuning project by Yuhong Chen. Here is another by Ilya Finkelshteyn Here is a third by Marina Balina Here is a fourth by Pratik Daga.

Here are Alberto Lerner's excellent notes on performance monitoring. Here you can find his thesis.

Here are notes about materialized views in Oracle.

Here is a call to a new organization of databases by the Turing Award winner Jim Gray

Here are excellent notes on problems, systems, and algorithms having to do with social media queries by Sara Cohen. Typical questions: What does it mean to be a central personality in graph databases? What is the best way to find a group of people with the right interests who are likely to be compatible? Here are notes on random graphs.

Finally, here are the rules about academic honesty.