Advanced Database Systems
CSCI-GA.2434-001
Fall 2020
Tuesdays 5:00 - 7, Warren Weaver 109
Prof Dennis Shasha
Instructor: Dennis Shasha (shasha@cs.nyu.edu)
Zoom: Join at class time the following
URL: https://nyu.zoom.us/j/96365781363
Except for the first class which had pre-recorded lectures, all
lectures are live and recorded
here. The classes will be live but also on zoom.
Office hours by zoom on Tuesdays 2 PM to 3 PM, no appointment
necessary. Starts on September 8, 2020.
https://nyu.zoom.us/j/95764190139
Recorded lectures will be on the classes page on the left side labeled Panopto.
Graders:
Nandhitha (nr2229@nyu.edu, point of contact for questions
regarding mySQL and Reprozip, but first look at documents below) and
Nishchitha ( nhv215@nyu.edu, point of contact for questions regarding Aquery,
but first look at documents below)
GOALS
To study the internals of database systems
as an introduction to research and as a basis
for rational performance tuning.
The study of internals will concern topics at the intersection
of database system, operating system, and distributed computing
research and development.
Specific to databases is the support of the notion of transaction:
a multi-step atomic unit of work that must appear to execute
in isolation and in an all-or-nothing manner.
The theory and practice of transaction processing is the problem
of making this happen efficiently and reliably.
Tuning is the activity of making your database system run faster.
The capable tuner must understand the internals and externals
of a database system well enough to understand what could be
affecting the performance of a database application.
We will see that interactions between different levels of the system, e.g.,
index design and concurrency control, are extremely important,
so will require a new optic on database management design
as well as introduce new research issues.
Our discussion of tuning will range from the hardware to conceptual
design, touching on operating systems, transactional subcomponents,
index selection, query reformulation, normalization decisions,
and the comparative advantage of object-oriented database systems.
This portion of the course will be heavily sprinkled with case
studies from database tuning in biotech, telecommunications,
and finance.
Also, since the book that Philippe Bonnet and I have written
has many tests associated with it,
you will get the benefit of those tests.
Because I do a lot of work on ordered data (such as financial time
series), we will explore
databases that support ordered queries such as those found in finance
and science and we will do compare two freely downloadable
systems kdb and mysql.
Class materials
-
Here is the syllabus in
pdf.
-
Please find homework 1 here.
Here is the
ticks data
-
Please find homework 2 here.
This one includes a tuning practicum so give yourself
extra time.
Here are sample like
and friends tables.
We may use different ones in our tests.
-
-
In addition to your learning aquery for the first assignment,
you may want to use
AQuery
for the tuning practicum for extra credit.
Here is a
presentation
of the system developed by Jose Pablo Cambronero.
Here is the github repository.
Here is detailed
installation documentation.
And here are some
additional notes
written by Devina Bisen in 2019.
Here are additional installation guidelines written by
Nishchitha Prasad if you have
MacOS Catalina.
Here
is a way to
install using a virtual machine
(you'll need docker).
Here is a worked
example
with a subset of a plant database application.
Here is a simplified
worked example extending Jose's notes along with compilation instructions.
and
data for that worked example.
The
appendix
to this paper has other examples.
Here is a directory showing (start from the file lect5)
how to generate indexes in Q.
If you like movies,
here is a video explaining how it can be used.
Here are
some basic notes about the underlying system q.
and
an introduction to the arrable model in q
(everything you need to know for the homework).
Here is a
place
to get a kdb.
-
Because reproducibility is an important discipline to acquire,
consider several alternatives,
e.g. the
jupyter notebook approach.
Alternatively, find notes on
Reprozip written by Devina Bisen.
Here
are notes on using
bit bucket.
-
The main questions I get about the distributed transaction project are:
(i) Which programming language should be use? Answer: It's up to you.
We just have to be to run your program on the NYU servers.
(ii) Can we work in teams? Answer: Yes, team of one or two. No more than two.
(iii) Are the deadlines firm? Answer: Yes. You have plenty of warning
so no reason to be late.
(iv) What does the design document have to contain? Answer: The
name(s) of the team member(s), the major modules, what they do,
the programming language you will use.
(v) The project imagines a multi-server system. Do we have to use
a multi-threaded simulator? Answer: No. You merely have to create the
correct input-output behavior based on the dump(), R, W, begin, end,
fail, recover directives.
Here is a
further set of frequently asked questions and their answers.
This should be read in addition to the specification.
Also, here is a set of
sample tests
for that project.
-
There are two main lecture slide decks:
transaction processing in pdf
and
database tuning
slides.
However there are ancillary lecture notes.
-
Here is a lecture on
concurrent search structure algorithms.
-
Here is a lecture on
AQuery.
-
Here
find a brief lecture traitorous failures
from the paper
"Easy Impossibility Proofs for Distributed Consensus Problems"
by Fischer, M. and Lynch, N. and Merritt, M.
Here
is a relevant picture.
Network partitions alone can lead to problems
as shown
in these notes that are derived from Lynch, Fisher, and Patterson
A related topic is the so-called
CAP theorem
which states that achieving consistency (up-to-date information is always
available), availability (if some site is up, it will have the most up-to-date
information), and partition tolerance (this all works even if the network
is broken) cannot all be done simultaneously.
-
On the positive side, here is a simplified discussion
of
Castro and Liskov's Practical Byzantine Fault Tolerance.
-
Here are
Jake Loveless's notes on
high frequency trading system
considerations
and his
article in CACM.
and
my notes on his article.
-
The system that aquery is most similar to
has
a nice way of scaling data.
In fact, here is an
argument for why it should be used for Big Data .
-
Here are Alberto Lerner's lecture notes
on
partitioning in large data systems
and the
accompanying prose notes
-
Here is a blog
about cockroachdb built on top of rocksdb
Spoiler alert: it uses almost all the transaction processing techniques we've discussed
in this course.
While they don't specify how they do it, they are likely to use
this algorithm
to render snapshot isolation serializable
(by aborting certain transactions
that would likely lead to serialization violations using snapshot isolation
as shown in figures 6, 7, 8).
Here is some information about
log structured merge trees and consistent hashing.
-
Here are
Manos Athanassoulis's notes on indexes.
-
Here are
Stratos Idreos's and Mark Callaghan's notes about key-value store
(really about navigating the data structure design space)
Here is the
accompanying video.
-
Here is a nice
talk about Not Only SQL databases (nosql).
along with my
notes on that talk.
Here is a link to
a course on NoSQL by Shahram Ghandeharizadeh.
Here is a
taxonomy of NoSQL systems.
Here is one of many
NoSQL benchmarks.
-
Here are notes on
consistent hashing for distribution and replication.
-
Here are Alberto Lerner's notes on
parallelizing locking, logging, accessing and buffer management
-
Here are notes about
the successes and failures of big data analysis.
-
Here is an article about a very
fast transactional system called VoltDB
-
Here is a
discussion about the merits of eventual consistency vs. transactions.
-
Here is a
tuning case study from a travel application.
Here is a description of
AppSleuth
-
Here is a lecture about a
database-inspired language called pig.
Here is a
recent video.
-
Installing mysql
are
on PC, Linux, and Mac.
-
Eric Hielscher's notes (brought up to date by Nandhitha Raghuram)
on running mysql
are
here.
-
Another lecture will be on
an approach to allow database outsourcing
-
Here is an approximation to an n-server
capacity planner.
-
We will discuss information gain, decision trees and clustering,
perhaps among other topics.
Here is a web site of tutorials on
data mining by Andrew Moore
-
Here is a very nice
review paper
showing how relational query processing
can vastly accelerate machine learning.
-
This paper discusses the relationship between
untrusted anonymous transactions
and notions like serializability and fault tolerance.
-
Strict serializability and two phase commit
work nicely on high bandwidth networks.
-
Here is a lecture on
biosurveillance
by Andrew Moore and colleagues.
-
Here is a lecture
on time series analysis
by my colleagues and me.
Here is a
site of open source sketch algorithms
from Yahoo! by Lee Rhodes and colleagues.
Here
is a cartoon concerning a related puzzle about sketches.
-
Here are some notes about
clustering.
-
We will also discuss, if we have time
several related topics found at the end of these lecture notes
(i) high performance replicated
database systems without the use of two phase commit, (ii) case studies from
Wall Street, (iii) a self-tuning technique that improves on LRU,
and (iv) a data structure for data warehouses.
-
Here is a nice
poster showing the genealogy of relational databases.
-
Here are Sonali Malik's and Rasika Jangle's excellent notes on
git and bitbucket.
-
Here are Tristan Allard's
general slides on privacy
and his nice
informal critique of naive methods.
-
Here are Patrick Valduriez's
gentle introductory notes on data science.
-
The very real-life
consequences
of default values.
Course Videos
To see the videos of previous lectures:
registered students should be able
to go to newclasses.nyu.edu, click on "Advanced Database
Systems", and then click on a Mediasite link on the left-hand side of
the page.
If you find the videos choppy, then
(from the 2016 Teaching Assistant Nicholas Souris):
"you can use a plugin for Firefox called flashgot to download the videos and
I'm pretty sure there's similar plugins for other browsers as well. Though
keep in mind that the downloads will take some time to finish and the files
are pretty large."
Books
-
Concurrency Control and Recovery in Database Systems
by Bernstein, Hadzilacos, and Goodman, Addison-Wesley, 1987.
ISBN 0-201-10715-5
Here is a local copy in zip format.
Your reading in this book should focus on the topics we hit in lecture.
Note that the available copies algorithm I present is a slightly different
version from tha presented in the book. Please use mine for the project.
-
Database Tuning: principles, experiments, and troubleshooting
techniques
by Dennis Shasha and Philippe Bonnet 2002
Morgan Kaufmann Publishers; ISBN: 1558607536
We will go through essentially this whole book.
-
Additional books can be found in the syllabus.
Here are some
experiments having to do with database tuning
.
Here is how to call C from K:
Don Orth's description of how
to call C from K.
Here are Alan Fekete's slides on snapshot isolation
Snapshot Isolation and Fixes to It
and the even better fixes (but in a paper)
due to Michael Cahill, Uwe Roehm, and Alan Fekete.
Here is Joe Conron's nice paper on indexes
(from when he was a master's student).
Some results from database tuning projects.
Presented as rules of thumb.
Here is one very nice tuning project by
Yuhong Chen.
Here is another by
Ilya Finkelshteyn
Here is a third by
Marina Balina
Here is a fourth by
Pratik Daga.
Here are Alberto Lerner's excellent notes on performance monitoring.
Here you can find
his thesis.
Here are notes about
materialized views in Oracle.
Here is a call to a new organization of
databases by the Turing Award winner Jim Gray
Here are excellent notes on problems, systems, and algorithms
having to do with
social media queries
by Sara Cohen.
Typical questions: What does it mean
to be a central personality in graph databases?
What is the best way to find a group of people with the right
interests who are likely to be compatible?
Here are notes on
random graphs.
Finally, here are the rules about
academic honesty.