Time Series in Finance: the array database approach
Prof. Dennis Shasha
Courant Institute of Mathematical Sciences
Department of Computer Science
New York University
shasha@cs.nyu.edu
http://cs.nyu.edu/cs/faculty/shasha/index.html
Outline

What are time series as
used in business and finance?

What do typical systems (FAME, SPLUS, SAS, KSQL)
do to support them?
I include challenge queries for you to try against your
favorite SQL or alternative database management system.

What are the relevant strengths of each system?

Is there an ideal timeseries language?

Fintime, a time series benchmark
http://cs.nyu.edu/cs/faculty/shasha/fintime.html

Which research in temporal data mining might help
finance?

Time series bibliography.

Brief glossary of statistical concepts.
Scenario

Group discovers the desirability of ``pairs trading.''

The goal is to identify pairs (or in general groups) of stocks
whose prices track one another after factoring in dividends.
One can make money (lots was made in the 1980s), because, for example,
if you know that the two banks Chase and Citibank track
one another (their difference is a stationary process) and Chase goes up
but Citibank doesn't, then buy Citibank and sell Chase.
Unless there is a good external reason for the difference, that is.
(This is simplistic: one needs a linear
combination of the two price series (once the market factor is accounted =
for (removed) and dividends included) to be stationary.
But this is the idea.)

Typical challenge queries from such an application:

Correlate the price histories of two stocks
or in general among many stocks and options.
(For most traders, returns are more interesting than prices, because
they have better statistics: a stock that trends up
over the years has an unstationary mean, but perhaps
a stationary return.
So, one performs correlations over ``returns.''
The return at time t is ln(price(t)/price(t1)).)

Perform the correlation over certain time intervals
to evaluate the stationarity.

The correlation might be weighted: recent history
counts more than distant history
So, What's the Database Problem?

The raw data comes in the form of ticks (stock, quantity, price)
and can be stored in a relational database without a problem.

The fundamental difficulty is that the relational model
does not take advantage of the order of rows.
Whereas one can perform an ``order by'' query and
manipulate the data in some other language, one cannot natively
manipulate the ordered data using select, from, and where.
Arguably, this is good for data independence, but it is bad
for time series.

Realizing this, the traders curse a lot and tell their programmers
to cobble something together.
The programmers do so and create a piece of software that is part
spreadsheet, part special purpose database, with lots of C++ code.
Employment goes up.

Note 1:
Joe Celko shows how to bend
SQL to the task of simulating order in his popular and excellent book
The SQL Puzzle Book, published by Morgan Kaufmann.
Usually, the bending results in a loss of efficiency.
It also works only for special cases.

Note 2:
Objectrelational systems address this
issue by providing special data types
and userdefined functions.
My goal is to show the array database approach.
The two are converging, but the array people
have been at it longer and have some good ideas.
What Are Time Series

Time series = sequence of values usually recorded at regular
increasing intervals
(yearly, monthly, weekly, ... secondly).
Regularity is critical: without regularity, moving
averages, autocorrelations, and volatility would not make sense
(e.g. if I have a sequence of daily
price closings and then 1,000 values within one day, the moving
average covering the entire sequence doesn't make much sense).
Nonregular time series are also of interest (e.g. the history
of stock splits), but we can say less about them.

Time series also exhibit historicity: the past
is an indicator of the future.
That is why autoregression can be used to predict the future
of sales and why the past volatility may predict future volatility.

Notice that temporal logics, for example, use the fact that
the i+1st value in a sequence is at a later time than the ith value,
but assume neither regularity nor historicity.
Temporal query languages are equally agnostic about this question.
System Support for Time Series

We want to be able to create time series, manipulate them,
keep them in persistent storage, and display them in reports.

Time series have frequencies, but may not have values
for every time instance at the stated frequency,
e.g. business day has the frequency of a day but has no values on
holidays or weekends.
On the other hand, time frequencies with gaps can present problems
in international markets.
For example,
some Asian stock markets are open on Saturdays.
Different markets don't have the same holidays in general.
One solution is to store values without gaps everywhere
(i.e. every day).

Then the question becomes: How to fill the gaps?
The answer has to do with the kind of value stored.

The values associated with each time are of two general types
(we borrow this distinction from the FAME system):

level values stay the same
from one period to the next in the absence
of activity.
For example, inventory is a level value, because
inventory stays the same if you neither buy nor sell.

flow values are zero in the absence of activity.
For example,
expenses go to zero if you buy nothing.

This distinction turns out to be important when interpolating
missing values and for time scale conversion.
Operations on Time Series Data

A typical framework
is that of the FAME system, since it embodies an excellent
understanding of the special properties of time series.
FAME stands for forecasting, analysis and modeling environment
FAME information systems, Ann Arbor Michigan.
www.fame.com

Data Preparation (i.e. interpolating and time scale conversion)  curvefitting

Queries (e.g. moving averages and sums)  aggregates over time.

Forecasting (e.g. statistical or data miningbased extrapolation)
 regression, correlation, Fourier analysis, and patternfinding.
Data Preparation

Sometimes it is necessary to relate time series that don't
have the same time frequencies, e.g. mine is days and yours is weeks.

Converting one to the other depends on the kind of value one has.
For example, if the daily time series denotes inventory level,
then converting from daily to weekly simply entails taking
the inventory level at the end of each week.
On the other hand, if the daily time series denotes revenues (a flow
type of value),
then one must sum them up to get weekly revenues.

Time conversion can force interpolation too, especially when
graphing values.
Typically, systems use various spline techniques
such as a cubic spline to interpolate missing values.

Interpolation can be more involved than mere curvefitting, however
as in the BlackDermanToy interpolation of the yield curve.
So, users should be able to add in their own interpolation functions.
Query Types  try these on your Database

cumulative sum, e.g. year to date sales.

moving averages, e.g. 30 day average of stock prices.

nth best, e.g. 5th best sales region.
median  one in the middle
rank  associate ordinal to each value based on its sort order.

discretize  e.g. rank the revenues by whether they
are in the top third, the middle third, or the bottom third.
This implies discovering the boundaries and then using them
in an update query.

yeartoyear comparisons  e.g. balance of trade of this year vs. last.

accounting functions  e.g. average growth rate, amortization,
internal rate of return and so on.

statistical functions  e.g. autocorrelation, and correlation between
two series.
Forecasting

Before the 1920s, forecasting meant
drawing lines through clouds of data values.
Yule
invented the autoregressive technique in 1927,
so he
could predict the annual number of sunspots.
This was a linear model and the basic approach
was to assume a linear underlying process modified by noise.
That model is often used in marketing (e.g., what will
my sales of wheat be next month).

Autoregression uses a weighted sum of previous
values to predict future ones.
There are also seasonal autoregressive models.

These and other models are incorporated
in time series products such as FAME, SAS and SPLUS.

In options finance, the basic approach is to assume that the price
of an equity is based on a random walk (Brownian motion)
around a basic slope.
The magnitude of the randomness is called the volatility.
In a result due to Norbert Wiener (he worked it out to shoot down
bombers over London), for this model, the standard deviation of the
difference between the initial price and the price
at a certain time t
rises as the square root of time t.
Steps in a Typical FAME Session

Specify frequency.
Say monthly, starting at January 1, 1996
and ending at the current time.

Create
sales and expenses time series by importing
these from a file or typing them in.
Specify that these are flow type time series.

Create a new time series:
formula profit = sales  expenses.

Create a fourth time series with weekly frequency
on inventory.
Specify that inventory is a level type time series.

Convert the first three time series to a weekly
frequency (by dividing the monthly values by 4.2
or by constructing a cubic spline to make
the sales, expenses, and profits curve look smooth).
This interpolation depends on knowing that sales and expenses
are flowtype values.

Now, use autoregression to predict future time series values.
SPlus
SPlus, some details

A nice reference to the language on the web has been written by
Carlos Alzola and Frank Harrell.
It is entitled
An Introduction to SPlus and the Hmisc
and Design Libraries )
at the University of Virginia.
(Hmisc is a set of addon functions that are truly
miscellaneous.)
http://fharrell.biostat.virginia.edu/s/index.html

The SPlus programming model is vectororiented.
Here are some typical statements:

sum(age < mean(age)) gives the count of people
who are younger than the mean.

ageg < ifelse(age < 16, 'Young', 'Old')
assigns to the vector ageg, a sequence of Young/Old values
thus discretizing the data.

find positions of elements
in a table that have the values of x,
thus performing a select while maintaining order.

So, one can create vectors, do statistics on them,
and perform selectionstyle operations on them.

The SPlus implementation is also vectororiented.
In their case, this creates a performance discontinuity.
SPlus is slow for large databases,
even for data that exceeds RAM size.
Below RAM size, it is very fast.
For this reason,
SPLUS is often used with SAS, because SAS has some data management
capabilities as we'll see.
SAS
KDB

KDB is a database system
implemented on top of the K language environment (produced by Kx systems
www.kx.com), an array language.
Data structures (e.g. tables) can be interchanged between
the two and functions can be called in both directions.
A free trial version can be downloaded.

KDB supports an SQL dialect called KSQL.
KSQL is easy to learn (for anyone fluent in SQL) and carries
over the speed and functionality of K to large data manipulation.
KDB also supports most of standard SQL.

The basic data structure in KSQL is the
arrable (arraytable) which is a table whose order can be exploited.
In this way, it is very similar to SPlus.

Arrables are nonfirstnormal form objects: a field of a record
can be an array.
For example, an entire time series can be stored in a field.

Like most modern SQLs, KSQL allows the inclusion of
userdefined functions inside database statements.
Unlike other SQL's KSQL allows functions to be defined over
arrays as well as scalars.

Like classical SQL, KSQL has aggregates, grouping,
selections and string matching, and so on.

KSQL adds many useful functions to SQL, permitting
economical expression and often better performance by exploiting order.
For example, finding the fifth highest value is a linear time
operation in KSQL but requires a selfjoin in SQL, which
is only sometimes linear time.

KDB can function as a high performance distributed server
with full recovery and distributed consistency.
(KDB guarantees consistency by using ordered atomic
broadcast and a replicated state machine
design rather than two phase commit.)

Some performance notes

KDB can do
40,000+ TPC/B transactions per second against a 100
megabyte database
on a Pentium 2.

Typical realworld application (at Lehman Brothers):
time series: keyed by instrument id, date.
Data: 11 million rows, couple of hundred
columns, typical query returns 7500 rows
with 15 columns.
10 gigabytes.
Query: is find all information having
to do with a given set of instruments at a given date.
This returns 7500 rows in less than a second.
KSQL Basics  an extended example

Create tables (arrables, really, but we'll conform to standard
terminology, even though it's slightly inexact)
either through K or within .t scripts.
We will use .t scripts for now and will set up
a simple trade database and then go through it line
by line (you can copy out this file and run it, provided
you also copy out newstat.k which will be given later).
'trade.t'
n:1000 # number of trades
'generate'
stock:([name:('Alcatel','Alstom','AirFrance','FranceTelecom','Snecma','Supra')]
industry:('telecom','engineering','aviation','telecom','aviation','consulting'))
trade:([]
stock:n rand stock.name,
price:50 + .25 * n rand 200,
amount:100 * 10 + n rand 20,
date:date['19980101'] + n rand 365)
trade:'date' asc trade
show'trade'
'last price (in time) by stock, month'
'we get a vector and then take the last'
t1:select last price by stock, date.month from trade
'trade volume by month and industry'
show select volume:sum amount * price
by date.month, stock.industry from trade
'5 day moving avg price by stock, month'
t3:select 5 avgs price
by stock, date.month
from trade
\\l newstat.k
'5 day moving max price by stock, month'
t4:select movingmax[5,price] by stock, date.month from trade

Here is the newstat.k file.
/ newstat.k
/ moving max of last num elements
movingmax:{[num; vec]
minel: &/vec
vec: ((num1) # minel), vec
size: #vec
indexes: (num1) + !(size(num1))
findmax:{[n;v;i] /v[(i(n1))+!n]}
:findmax[num;vec]'indexes
}
Commentary on Part of trade.t

The line
t1:select last price by stock, date.month from trade
performs a group by stock and yearmonth in the trade table.
Mathematically, the by clause partitions the
records based on distinct stock/yearmonth
values, like
a group by in SQL.
What is different is that
each partition is guaranteed to
be ordered in the same way as in the table (arrable),
in this case, by ascending order of date.
For each partition
corresponding to stock s and yearmonth x,
the select last price
part will return
s, x, and p where p is the price of the last record in the partition
corresponding to s and x.
Because the table is ordered by date, the last record will be
the one with the most recent date.

In the select statement
show select volume:sum amount * price
by date.month, stock.industry from trade
(i) the date.month expression in the by
clause gives each distinct yearmonth value in the date column.
This is a big convenience in time series applications.
The by clause groups by these yearmonth values.

The lines
'5 day moving avg price by stock, month'
t3:select 5 avgs price by stock, date.month from trade
use the avgs function which, given a vector (price for
each stock in this case), returns a vector
of the same length, computing a moving average.
5 avgs computes the five day moving average.

The line
\\l newstat.k
loads one or more k functions from the file
newstat.k, in this case a single moving
maximum taking two arguments, a number and a vector.

The line
t4:select movingmax[5,price]
by stock, date.month from trade
uses the movingmax function defined in the K file
newstat.k to compute the 5 day moving
maximum of prices grouped by stock and yearmonth.
The movingmax function takes a scalar and a vector and that's
exactly what the by clause delivers.
Two of our Challenge Queries using Vectors

Dot product of prices offset by 10 days for each stock
\dotprod:{[x;y] +/(x*y)}
t4: select dotprod[(10 drop price), (10 drop price)]
by stock from trade

The 10th highest price for each stock
t5: select last 10 first price
by stock from 'price' desc trade

You should be able to do the rest assuming that you can put in arbitrary
functions.
Is There an Ideal Time Series Language?

The ability to treat sequences as first class objects
on which one can do useful operations within
the database system.
FAME, SPlus, SAS, and KSQL.

The ability to treat multiple sequences together
for correlations and other purposes.
FAME, SPlus, SAS, and KSQL.

A basic collection of functions, e.g.
aggregates, moving aggregates, statistics, crosscorrelations,
interpolation, etc.
FAME, SPlus, SAS, and KSQL.

The ability to integrate userdefined functions into the query
engine.
My belief: userdefined functions are
necessary for time series. There is no analogue to relational completeness
that will satisfy all (or even most) time series applications.
Prove me wrong if you can.
KSQL and SPlus; FAME and SAS to some extent.

The main helpful database amenities include
a rich relational vocabulary and
the ability to work efficiently with disk as well as RAMresident
data.
KSQL, FAME and SAS.

Treat time
specially, e.g. be able to group by date.month or date.year of KSQL.
FAME, SAS, SPlus with libraries, and KSQL.

Treat values appropriately, e.g. with the level and flow concepts of FAME.
FAME, natively.
The others require libraries.

Give support for bitemporality (discussed next) at least as an extension.
No system, natively.

Objectrelational database vendors have begun to field
products with timeseries extensions.
It will be interesting to see how well they measure up.
The Bitemporal Challenge

Even in finance, there is more to life than looking at flows
and levels.
Sometimes, one must do historical searches and understand the
state of the enterprise in the past or future.

Bitemporal = two times (Snodgrass, Ahn, Jensen, and others)
(A very nice review paper was published in Information Systems:
``Semantics of TimeVarying Information
Christian S. Jensen, Richard T. Snodgrass
Information Systems, 21(4): 311352 (1996) )
The two times are valid time (when a fact holds)
and transaction time (when the fact was asserted).

Using this model, one can ask:
What was Rick's salary on April 1, 1998?
Suppose the answer was $1,500 per week, given
the information we know now (i.e. according to
the latest transaction).
But somehow we sent Rick a check for much more.
This gives rise to a new question:
What did we believe on March 28, 1998 (when payroll printed out Rick's
100,000 dollar weekly check) about Rick's salary valid on April 1, 1998?
Thus, we look at the transactions concerning Rick's salary
that preceded March 29, 1998 and see what was asserted
about Rick's salary as of April 1, 1998.
If the asserted value was 100,000 dollars per week, then
we know that the mistaken payment was due to a database error,
not a payroll processing error.
So, bitemporal databases are useful to track the cause of failures
in operations.

This example also
shows their general usefulness for human resources applications,
What is the job history of this candidate within the organization?

Or the more challenging:
How has the job history of this candidate been corrected?

For trading applications, we can use bitemporality to track the correlations
between predictions of earnings and price,
What did we believe as of January of this year that first
quarter earnings would be?

Unitemporal example for those with SQL skill:
Rick Snodgrass suggested the following example
to test your system's temporal acumen.
The salary of employees changes over time, usually in a positive direction.
Each salary for an employee therefore has a time when it is valid,
marked by starttime and endtime, the notion being that the salary
becomes valid at starttime and ceases being valid at endtime.
Your task is to compute the average salary over time as a set of rows
having the average salary as well as the start and end times.
Two consecutive intervals with the same average salary should be merged.
Send me mail if you want to see a KSQL version.
FinTime a Financial Time Series Databases
The design of this benchmark is joint work
with Kaippallim Jacob of Morgan Stanley (kjacob@ms.com)
and its full description can be found at
http://cs.nyu.edu/cs/faculty/shasha/fintime.html.
Here we present a summary of that benchmark.
Goal
The FinTime benchmark tries to model the practical uses
of timeseries databases in financial applications.
The
models suggested in FinTime reflect two frequently occurring cases in the
financial industry, namely, a historical market data system and realtime price
tick database. These models are quite similar to two wellstudied models in the
relational world, namely, decision support systems and OLTP systems. FinTime
also suggests and defines metrics that capture three useful dimensions of any
timeseries system, namely, performance in a singleuser mode, performance in a
multiuser mode and the price to performance ratio.
Models for a
Timeseries Benchmark
Before deciding on a model, we have to
examine the different parameters that determine a model for timeseries system.
The most important parameters that influence a timeseries database system
are:
 Periodicity of data (Regular/irregular)
 Density of data
(Dense/Sparse)
 Schedule of updates (periodic, continuous)
 Types
of queries (Simple/Complex)
 Time interval between queries (Ad
hoc/Batch)
 Number of concurrent users
(Few/Many)
Combinations of these factors will give rise to 64
possible models but for simplicity we can focus on the following commonly
occurring cases in the financial industry
Model 1:
Historical market Information
Attribute  Specification 
Periodicity of
Data  Periodic 
Density of Data  Dense 
Schedule of
updates  Periodic updates (e.g. at the end of a business
day) 
Complexity of queries  Complex (e.g. Decision support
queries) 
Nature of query arrival  Batch 
Concurrency of
users  Low (e.g. Few concurrent users) 
Model 2: Tick
databases for financial instruments
Attribute  Specification 
Periodicity of
Data  Nonperiodic 
Density of Data  Sparse to moderately
dense 
Schedule of updates  Continuous 
Complexity of
queries  Simple 
Nature of query arrival  Ad hoc 
Concurrency of
users  High (e.g. Many concurrent users) 
Let us now discuss the characteristics of
the two models in some detail:
Model 1: Historical Market
Information
Historical Market data systems are closely
related to decision support systems of the traditional relational database
world. The various elements of this model are:
 Data Model:
In the case of historical timeseries databases for market information,
the model consists of a few relational tables that typically contain
infrequently changing (static) information and a number of timeseries tables.
The following tables describe the data in this
benchmark
Base
Information Table
Field
Name  Type  Comments 
Id  Char(30)  Unique Identifier for the financial
instrument e.g. IBM.N 
Ex  Char(3)  Exchange on which the instrument is traded e.g.
"N" for the New York Stock Exchange 
Descr  Varchar(256)  A brief description of the financial security e.g.
"International Business Machines, Armonk, NY" 
SIC  Char(10)  Standard Industry Code
e.g. "COMPUTERS" or "CHEMICAL" 
SPR  Char (4)  S&P Rating for the
company e.g. "AAA" 
Cu  Char(5)  The base currency in which the
security is traded e.g. "USD" for US Dollars 
CreateDate  Date  Date this security came
into existence 
Stock Splits (Irregular
Timeseries)
Field Name  Type  Comments 
Id  Char(30)  Identifier 
SplitDate  Date  Date the split is actually
effective 
EntryDate  Date  Date the split is announced 
SplitFactor  Float  The split factor
expressed as a decimal value. E.g. a 2 for 1 split is expressed as 0.5 (ie
1/2), a 4 for 3 is expressed as 0.75(ie 3/4) 
Dividends (Irregular
Timeseries)
FieldName  Type  Comments 
Id  Char(30)  Identifier 
XdivDate  Date  Date on which dividend is
disbursed 
DivAmt  Float  Dividend Amount In the base currency of the
instrument 
AnnounceDate  Date  Date on which the dividend is
announced 
Market Data (Regular
Timeseries)
FieldName  Type  Comments 
Id  Char(30)  Identifier 
Tradedate  Date  Trade Date 
HightPrice  Float  High price for the day

LowPrice  Float  Low price for the day 
ClosePrice  Float  Closing price for the
day 
OpenPrice  Float  Open price for the day 
Volume  Long  Volume of shares
traded 

Market information is typically provided as a set of input files by a market
data vendor at the end of each trading day. These files correspond to
the tables defined above. While the data for the Base
Information table, Split Adjustment Table and Dividend
Table is irregular (ie an external event triggers an entry into these tables),
the Market Information Table has an entry for every trading day. For the
purposes of this benchmark the implementation can select a scale factor
combination of the number securities under consideration and the number of
events for those securities. We suggest 3 scale factors, namely, 50,000
securities, 100,000 securities,
and 1,000,000 securities, all for 4,000 days.
These roughly correspond to all equity
securities in the US, all equity securities in the G7 countries and all equity
securities in the world. Programs to generate the data can be found
on the benchmark pages.
 Query characteristics.
The queries model:
 Join of relational data and
timeseries information
 Access of long depth timeseries information
for a few keys (deep history query)
 Access of a short depth
timeseries for a large number of keys (crosssectional
queries)
 Sorting
 Grouping and
Aggregation
Following are the
benchmark queries (we don't require any special query language).
Many of them include a notion of ``specified set of securities''
or ``specified period''. These notions are defined
with respect to a simple random model.
1  Get the closing price of a set of 10 stocks for a 10year period and
group into weekly, monthly and yearly aggregates. For each aggregate period
determine the low, high and average closing price value. The output should be
sorted by id and trade date. 
2  Adjust all prices
and volumes (prices are multiplied by the split factor and volumes are divided
by the split factor) for a set of 1000 stocks to reflect the split events
during a specified 300 day period,
assuming that events occur before the first trade of
the split date. These are called splitadjusted prices and volumes.

3  For each stock in a specified list of 1000 stocks, find the
differences between the daily high and daily low on the day of each split event
during a specified period. 
4  Calculate the value
of the S&P500 and Russell 2000 index for a specified day using unadjusted
prices and the index composition of the 2 indexes (assume to be given) on the
specified day 
5  Find the 21day and 5day moving average price for a
specified list of 1000 stocks during a 6month period. (Use split adjusted
prices) 
6  (Based on the previous query)
Find the points (specific days) when the 5month moving average intersects the
21day moving average for these stocks. The output is to be sorted by id and
date. 
7  Determine the value of $100,000 now if 1 year ago it was invested
equally in 10 specified stocks (i.e. allocation for each stock is $10,000). The
trading strategy is: When the 20day moving average crosses over the
5month moving average the complete allocation for that stock is invested and
when the 20day moving average crosses below the 5month moving average the
entire position is sold. The trades happen on the closing price of the trading
day. 
8  Find the pairwise coefficients of correlation in a set of 10
securities for a 2 year period. Sort the securities by the coefficient of
correlation, indicating the pair of securities corresponding to that row.
[Note: coefficient of correlation defined in appendix] 
9  Determine the yearly dividends and annual yield
(dividends/average closing price) for the past 3 years for all the stocks in
the Russell 2000 index that did not split during that period. Use unadjusted
prices since there were no splits to adjust for. 
Since the benchmark model assumes that the
level of concurrency is small, the number of users should be 5. The benchmark
methodology should be that each of the five users will pick a query at random
from the set above without replacement
and submit it to the database. The five users will be
simultaneously active at any point in
time. Each user should do every query and then stop.
Model 2: Tick databases for
financial instruments
This second benchmark
models the
case where the database (consisting
of ticks) is expected to keep up with a very high rate of
updates while responding to several users issuing fairly simple queries.
Ticks are price
quotation or trade (transaction) prices and associated attributes for
individual securities that occur either on the floor of a stock exchange or in
an electronic trading system, such as the NASDAQ market system. Ticks include 2
basic kinds of data (1) Trades are transactions between buyers and a sellers at
a fixed price and quantity (2) Quotes are price quotations offered by buyers
and sellers. Quotes can have the ask quote, the bid quote or both along with
their associate attributes such as quantity offered.
Let us now define
the different elements of this benchmark. They are:
 Data model.
The data model for this benchmark is simple. It consists of the following 2
tables:
Base Information
Table
Field Name  Type  Comments 
Id  Char(30)  Identifier for a security 
Ex  Char(3)  Exchange on which the
instrument is traded e.g. "N" for the New York Stock
Exchange 
Descr  Varchar (256)  A short description of the
security 
SIC  Char(10)  The standard industry code for the security 
Cu  Char(10)  Base currency for the
security 
Trades/Quotes
Table
FieldName  Type  Comments 
Id  Char(30)  Identifier 
SeqNo  Integer  A unique sequence number for each trade or
quote 
TradeDate  Date  The trading date 
TimeStamp  Time  The time a particular trade was
executed or Quote was generated 
TradePrice  Float  The price at which the trade was
executed 
TradeSize  Integer  The number of shares traded in a transaction 
AskPrice  Float  The Price at which a
seller would sell a security 
AskSize  Float  The size of the transaction offered
at the specified ask price. 
BidPrice  Float  The price at which a buyer would buy
the security 
BidSize  Float  The size of the transaction offered at the specified bid
price 
Type  Char  An indication if this is a quoteor a trade

 Data population,
Update frequency and volume. Tick databases are generally populated by
adapters that receive data from realtime feeds. The frequency of updates is
constantly increasing but on the average we can assume that each security ticks
(each corresponding to a record in the database) about 100 times during an
8hour trading day. Additionally, we can assume that the set of securities
being tracked is traded around the world and hence there is no quiescent period.
For the purposes of this benchmark we will assume that the system monitors
ticks on 1,000, 10,000 or 100,000 securities, where each represents a different
scale factor.
A very important consideration in tick
databases is the ability to quickly apply "cancel/correct".
Occasionally an incorrect quote or trade record is published. The vendor will
then send a correction record with the identifier of the security and its
sequence number. The record will either be corrected according to the new
published or simply deleted.
 Query characteristics. The
kinds of queries issued against an online tick databases are usually simple and
predetermined. Listed below are the set of queries to be used in the
benchmark.
1  Get all ticks a specified set of 100
securities for a specified three hour time period on a specified trade
date. 
2.  Determine the volume weighted
price of a security considering only the ticks in a specified
three hour interval 
3.  Determine the top 10 percentage losers for the specified date on the
specified exchanges sorted by percentage loss. The loss is calculated as a
percentage of the last trade price of the previous day. 
4.
 Determine the top 10 most active stocks for a specified date sorted
by cumulative trade volume by considering all trades 
5.  Find the most active
stocks in the "COMPUTER" industry (use SIC
code) 
6.  Find the 10 stocks with the
highest percentage spreads. Spread is the difference between the last askprice
and the last bidprice. Percentage spread is calculated as a percentage of the
midpoint price (average of ask and bid price). 
 Operational parameters. Since
usually tick systems have a relatively large number of concurrent users, we
will assume that there are 50 concurrent users for the benchmark. These users
randomly pick queries from the list above and submit them for execution. Each
user will submit each query only once and run through the complete list of
queries.
Finding Patterns in Temporal Data

Financial traders have done data mining for many years.
One trader described his work to me as follows:
I think about an arbitrage trick (pairs trading
is such a trick). Program for a few months.
Try the trick and either it works or it doesn't.
If it doesn't, I try something new.
If it works, I enjoy it until the arbitrage disappears.

What does the research community have to offer to
such traders?
I present some research that I think might be most relevant.
I will be updating this as time goes on.

U.M. Fayyad, G. PiatetskyShapiro, P. Smyth, and R. Uthurusamy, editors
Advances in Knowledge Discovery and Data mining
AAAI Press/ The MIT Press, 1996.
The article by Berndt and Clifford about
finding patterns in time series is particularly relevant to finance.

Temporal Databases  Research and Practice
Editors: Opher Etzion, Sushil Jajodia, Sury Sripada. (SpringerVerlag,
1998).
There, you will find articles about finding unexpected patterns (e.g. fraud) and
multigranularity data mining.

Christos Faloutsos
Searching Multimedia Databases by Content
Kluwer Academic Publishers.
This book shows how to do signal processing analysis on time series
to solve problems such as:

Discovering whether two time series have similar shapes:
the basic idea is to
store the first few Fourier coefficients of a time sequence in a
database and assert that two time sequences are similar if
their Fourier coefficients are close.
(Remarkably this works well because the energy spectrum for
stock prices declines with
the power 2 with increasing coefficients.)
Joint work with Rakesh Agrawal and Arun Swami.
The efficiency of this technique has been improved by
Davood Rafiei and Alberto Mendelzon of the University of Toronto.

Subsequence matching (is this sequence
close to some subsequence of that sequence?).
Faloutsos
uses a special data structure called Fastmap to make this performant.

Other papers explore
the question of similarity search when time scaling and inversion
is possible:

R. Agrawal, KI Lin, H.S. Sawhney and K. Shim.
``Fast similarity search in the presence of noise, scaling
and translation in timeseries databases.''
Proc of the 21st VLDB Conference, 1995

D. Q. Goldin and P. C. Kanellakis.
``On similarity queries for timeseries data:
constraint specification and implementation.''
1st International Conference on the Principles
and Practice of Constraint Programming. pp. 137153.
SpringerVerlag, LNCS 976. September 1995.

Davood Rafiei and Alberto Mendelzon.
``Similaritybased queries for time series data''
ACM Sigmod, pp. 1324. May 1997

Yi,
Efficient Retrieval of Similar Time
Sequences Under Time Warping.
Data Engineering, 1998.

Excellent work has also been done on data structures
by many researchers at Brown, Polytechnic, and the
University of Maryland, but that falls outside the data mining
purview.

As an alternative to seeing whether two sequences or subsequences match,
one might want to describe a desirable sequence (e.g. a headandshoulders
movement of stock prices) and see whether it is present.
Relevant papers about this include:

H.V. Jagadish, A. O. Mendelzon and T. Milo.
Similaritybased queries. PODS 1995.

R. Agrawal, G. Psaila, E. L. Wimmers and M. Zait.
Querying shapes of histories.
Proceedings of the 21st VLDB Conference.
pp. 502514. 1995.

P. Seshadri, M. Livny and R. Ramakrishnan.
Sequence query processing.
ACM SIGMOD, pp. 430441, 1994
Data model and query language for sequences in general,
with time series as a special case.

Arie Shoshani, Kyoji Kawagoe: Temporal Data Management. VLDB 1986: 7988
One of the first papers in the literature.

Snodgrass, R.~T., editor, The TSQL2 Temporal Query Language , Kluwer
Academic Publishers, 1995, 674+xxiv pages. The TSQL2 Language Design
Committee consisted of Richard Snodgrass (chair),
Ilsoo Ahn, Gad Ariav, Don Batory, James
Clifford, Curtis E. Dyreson, Ramez Elmasri, Fabio Grandi, Christian
S. Jensen, Wolfgang Kaefer, Nick Kline, Krishna Kulkarni,
T. Y. Cliff Leung, Nikos Lorentzos, John F. Roddick, Arie Segev,
Michael D. Soo and Suryanarayana M. Sripada.
TSQL2 has timevarying aggregates, including moving window aggregates,
aggregates over different time granularities, and weighted over time.

Munir Cochinwala, John Bradley: A Multidatabase System for Tracking and
Retrieval of Financial Data. VLDB 1994: 714721
A paper discussing the implementation of a tick capture
and query system  for those brave enough to roll their own.

Raghu Ramakrishnan,
Donko Donjerkovic,
Arvind Ranganathan,
Keven S. Beyer, and
Muralidhar Krishnaprasad:
SRQL: sorted relational query language
SSDBM 98
A paper discussing a model in which relations are tables
that can be ordered.
This allows one to do moving averages, find ten cheapest, preceding
fifteen, etc.
The strategy is to extend SQL with order and special operators.

Leonid Libkin and colleagues:
An optimizable arrayoriented language based on comprehensions.
The basic primitives are tabulation (analogous to selection),
subscripting (remove elements from arrays), dimension
reduction (like count of an array), and interaction
between sets and arrays.
Optimizations are analogous to pushing selects into expressions
and techniques that reduce the complexity of expressions.
Books on Time Series for Computer Scientists

C. Chatfield,
The Analysis of Time Series: Theory and Practice
Chapman & Hall fourth edition 1984.
Good general introduction, especially for those completely
new to time series.

P.J. Brockwell and R.A. Davis,
Time Series: Theory and Methods,
Springer Series in Statistics (1986).

B.D. Ripley and W.N. Venables,
Modern Applied Statistics with SPlus, Springer (1994)
Chapter 14 has a good discussion of time series.
http://www.stats.ox.ac.uk/~ripley/
has a lot of useful functions.

FinTime, a time series benchmark for finance
http://cs.nyu.edu/cs/faculty/shasha/fintime.html
Appendix: Informal Review of Statistical Concepts

Recall that the goal of probability theory is to determine the likelihood
of a given event given a probability distribution (e.g. how likely
is it to get 5,300 heads in 10,000 flips of a fair coin?).
The goal of statistics is to determine a probability distribution
given a series of observations or at least
to disprove a null hypothesis (e.g. is a fair coin a reasonable
model if I get 8,000 heads in 10,000 flips?).

In parametric statistics, one knows the form of the target probability
distribution but not the value of certain parameters, e.g.
coin flips are binomial but the probability of a head may be unknown.
In nonparametric statistics, one does not know the form
of the target probability distribution.
In finance, most models are parametric (autoregression, option pricing).
When models aren't, people use queries and eyeballs to figure
out what to do.

Stationary process : one whose statistics (mean and variance)
do not vary with time.
Stationarity is a fundamental assumption of pairs trading
and options pricing.

Correlation: a measure of the association between two series,
e.g. the option open interest and the price of a security
5 days later.
If cov(x,y) represents the covariance between x and y
and sigma(x) is the standard deviation of x, then
correlation(x,y) = cov(x,y)/(sigma(x)*sigma(y))
so is entirely symmetric and lies always between 1 and 1.

Partial correlation :
suppose you are looking at the one day returns of Merck and Pfizer
(two drug companies). You can look at them as raw data or you
can subtract out the market influence via a least squares estimate
and use the correlation of the residuals.

Volatility : a measure of the standard deviation
of the value of a variable over a specific time,
e.g. the annualized standard deviation of the returns.
The return at time t is ln(p(t)/p(t1)).
This is a critical parameter in options pricing, because it determines
the probability that a price will exceed a certain price range.

Alpha, Beta, and Regression: suppose we estimate the relationship
between the percentage change in price of some stock S vs.
the percentage change in some market index M using a best fit
(least squares)
linear relationship:
s = a + bm
Then the parameter alpha (a) is the change in S independent of M
and beta (b) is the slope of the best fit line.
A riskless investment has a positive alpha and a zero beta,
but most investments have a zero alpha and a positive beta.
If beta is greater than 1, then for a given change in the market,
you can expect a greater change in S.
If beta is negative, then S moves in the opposite direction
from the market.
Note that beta is different from correlation (and can
be arbitrarily large or small) because it is not symmetric:
beta = cov(S,M)/(sigma(M)*sigma(M))

ANOVA: analysis of variance in cases when there is no missing data.
This is used to model situations in which several factors can play
a role and one wants to tease out a probabilistic model that describes
their interaction.
For example, product, location and customer income may be factors
that influence buying behavior.
ANOVA helps to figure out how to weight each one.
More significant variants of this include
principal components analysis
and factor analysis .
In finance, one might use one of these to figure out what determines
the price movement of a stock (perhaps half general market movement,
one third interest rates, etc.).
In psychology, one can ask a person 100 questions and then
categorize the person according to a weighted sum of a few questions.

Autoregression: a statistical model which predicts future
values from one or more previous ones.
This generalizes trend forecasting as used to predict sales.
Financial traders use this sparingly since models that look
at the recent past often just follow a short term trend.
As one trader put it:
``they
follow a trend and are always a day late and many dollars short.''
In general, regression of y on x is a determination
of how y depends on x.

Maximum likelihood method: suppose you are given a training set
consisting of observations and the categories to which the observations
belong.
The maximum likelihood method selects the probability distribution
that best explains the training set.
For example, if you toss a coin 10,000 times and observe
that heads comes up 8,000, you assign a probability to the heads
that maximizes the probability of this event.
This will make the probability of heads be greater than 1/2.
In finance, the maximum likelihood method is often used for forecasting
based on previously seen patterns.

Regularization A technique for smoothing a function
to make it have nice mathematical properties such as differentiability.
Moving averages are an example of regularization.

Bootstrapping (i) Divide the training set (set
of (observation, category) pairs) into
pieces. (ii) Infer the model from some pieces.
(iii) Test it
on the other pieces.
Acknowledgments
Lory Molesky,
Judy Smith, David Rothman,
and Rick Snodgrass
made several suggestions that have contributed
to this presentation.
All errors are mine.
This work was partly supported by grant 9531554
of the United States National Science Foundation.
This support is greatly appreciated.
Thank you for your attention.