Advanced Database Systems
CSCI-GA.2434-001
Fall 2012
Tuesdays 5:00 - 7, Warren Weaver 101.
Prof Dennis Shasha
Instructor: Dennis Shasha (shasha@cs.nyu.edu)
Office hours: Tuesdays at 9 pm in Warren Weaver 101
Graders:
Ashish Walia ashish.walia@nyu.edu
Meehir Nagda mn1292@nyu.edu
GOALS
To study the internals of database systems
as an introduction to research and as a basis
for rational performance tuning.
The study of internals will concern topics at the intersection
of database system, operating system, and distributed computing
research and development.
Specific to databases is the support of the notion of transaction:
a multi-step atomic unit of work that must appear to execute
in isolation and in an all-or-nothing manner.
The theory and practice of transaction processing is the problem
of making this happen efficiently and reliably.
Tuning is the activity of making your database system run faster.
The capable tuner must understand the internals and externals
of a database system well enough to understand what could be
affecting the performance of a database application.
We will see that interactions between different levels of the system, e.g.,
index design and concurrency control, are extremely important,
so will require a new optic on database management design
as well as introduce new research issues.
Our discussion of tuning will range from the hardware to conceptual
design, touching on operating systems, transactional subcomponents,
index selection, query reformulation, normalization decisions,
and the comparative advantage of object-oriented database systems.
This portion of the course will be heavily sprinkled with case
studies from database tuning in biotech, telecommunications,
and finance.
Also, since the book that Philippe Bonnet and I have written
has many tests associated with it,
you will get the benefit of those tests.
Because of my recent research (and product) interests,
this year will include frequent discussions of
-
databases that support ordered queries such as those found in finance
and science and
-
technology-enhanced privacy.
Class materials
-
Here is the syllabus in
pdf.
-
Please find homework 1 here.
-
Please find homework 2 here.
This one includes a tuning practicum so give yourself
extra time.
Homework 2 also includes a second part, called the supplement,
but that is required.
Please find that
here
-
A
frequently asked question write-up
about the replicated concurrency control project.
Also, here is a set of
sample tests
for that project.
-
There are two principal sets of class notes:
transaction processing in pdf
and
database tuning
slides.
However there are ancillary lecture notes.
-
Here is a lecture on
concurrent search structure algorithms.
-
Here
find a brief lecture traitorous failures
from the paper
"Easy Impossibility Proofs for Distributed Consensus Problems"
by Fischer, M. and Lynch, N. and Merritt, M.
Here
is a relevant picture.
-
One lecture will be on
array query languages.
The system that aquery is most similar to
has
a nice way of scaling data.
-
Here are Alberto Lerner's lecture notes
on
partitioning in large data systems
and the
accompanying prose notes
-
Here is a nice
talk about Not Only SQL databases (nosql).
along with my
notes on that talk.
Here is a link to
a course on NoSQL by Shahram Ghandeharizadeh.
Here is a
taxonomy of NoSQL systems.
Here is one of many
NoSQL benchmarks.
-
Here are Alberto Lerner's notes on
parallelizing locking, logging, accessing and buffer management
-
Here is an article about a very
fast transactional system called VoltDB
-
Here is a
discussion about the merits of eventual consistency vs. transactions.
-
Here is a
tuning case study from a travel application.
Here is a description of
AppSleuth
-
Here is a lecture about a
database-inspired language called pig.
Here is a
recent video.
-
Installing mysql
are
on PC, Linux, and Mac.
-
Eric Hielscher's notes on running mysql
are
here.
-
Another lecture will be on
an approach to allow database outsourcing
-
Here is an approximation to an n-server
capacity planner.
-
We will discuss information gain, decision trees and clustering,
perhaps among other topics.
Here is a web site of tutorials on
data mining by Andrew Moore
-
Here is a lecture on
biosurveillance
by Andrew Moore and colleagues.
-
Here is a lecture
on time series analysis
by my colleagues and me.
Here
is a cartoon concerning a related puzzle about sketches.
-
Here are some notes about
clustering.
-
We will also discuss, if we have time
several related topics found at the end of these lecture notes
(i) high performance replicated
database systems without the use of two phase commit, (ii) case studies from
Wall Street, (iii) a self-tuning technique that improves on LRU,
and (iv) a data structure for data warehouses.
Books
-
Concurrency Control and Recovery in Database Systems
by Bernstein, Hadzilacos, and Goodman, Addison-Wesley, 1987.
ISBN 0-201-10715-5
Here is a local copy in zip format.
Your reading in this book should focus on the topics we hit in lecture.
Note that the available copies algorithm I present is a slightly different
version from tha presented in the book. Please use mine for the project.
-
Database Tuning: principles, experiments, and troubleshooting
techniques
by Dennis Shasha and Philippe Bonnet 2002
Morgan Kaufmann Publishers; ISBN: 1558607536
We will go through essentially this whole book.
-
Additional books can be found in the syllabus.
Here are some
experiments having to do with database tuning
.
Here is how to call C from K:
Don Orth's description of how
to call C from K.
Here are Alan Fekete's slides on snapshot isolation
Snapshot Isolation and Fixes to It
and the even better fixes (but in a paper)
due to Michael Cahill, Uwe Roehm, and Alan Fekete.
Here is Joe Conron's nice paper on indexes
(from when he was a master's student).
Some results from database tuning projects.
Presented as rules of thumb.
Here is one very nice tuning project by
Yuhong Chen.
Here is another by
Ilya Finkelshteyn
Here is a third by
Marina Balina
Here is a fourth by
Pratik Daga.
Here are Alberto Lerner's excellent notes on performance monitoring.
Here you can find
his thesis.
Here are notes about
materialized views in Oracle.
Here is a call to a new organization of
databases by the Turing Award winner Jim Gray
Here are two very nice and very practical papers about
data warehousing by one of the foremost practitioners in the field:
Ralph Kimball on constructing an
enterprise data warehouse
and
Ralph Kimball on why to use star schemas in dimensional modeling.
Finally, here are the rules about
academic honesty.