Time Series in Finance: the array database approach

Prof. Dennis Shasha
Courant Institute of Mathematical Sciences
Department of Computer Science
New York University
shasha@cs.nyu.edu
http://cs.nyu.edu/cs/faculty/shasha/index.html

Outline

What are time series as used in business and finance?
What do typical systems (FAME, S-PLUS, SAS, KSQL) do to support them?
I include challenge queries for you to try against your favorite SQL or alternative database management system.
What are the relevant strengths of each system?
Is there an ideal time-series language?
Fintime, a time series benchmark
http://cs.nyu.edu/cs/faculty/shasha/fintime.html
Which research in temporal data mining might help finance?
Time series bibliography.
Brief glossary of statistical concepts.

Scenario

Group discovers the desirability of ``pairs trading.''
The goal is to identify pairs (or in general groups) of stocks whose prices track one another after factoring in dividends.
One can make money (lots was made in the 1980s), because, for example, if you know that the two banks Chase and Citibank track one another (their difference is a stationary process) and Chase goes up but Citibank doesn't, then buy Citibank and sell Chase. Unless there is a good external reason for the difference, that is. (This is simplistic: one needs a linear combination of the two price series (once the market factor is accounted = for (removed) and dividends included) to be stationary. But this is the idea.)
Typical challenge queries from such an application:
- Correlate the price histories of two stocks or in general among many stocks and options. (For most traders, returns are more interesting than prices, because they have better statistics: a stock that trends up over the years has an unstationary mean, but perhaps a stationary return. So, one performs correlations over ``returns.'' The return at time t is ln(price(t)/price(t-1)).)
- Perform the correlation over certain time intervals to evaluate the stationarity.
- The correlation might be weighted: recent history counts more than distant history

So, What's the Database Problem?

The raw data comes in the form of ticks (stock, quantity, price) and can be stored in a relational database without a problem.
The fundamental difficulty is that the relational model does not take advantage of the order of rows. Whereas one can perform an ``order by'' query and manipulate the data in some other language, one cannot natively manipulate the ordered data using select, from, and where.
Arguably, this is good for data independence, but it is bad for time series.
Realizing this, the traders curse a lot and tell their programmers to cobble something together. The programmers do so and create a piece of software that is part spreadsheet, part special purpose database, with lots of C++ code.
Employment goes up.
Note 1: Joe Celko shows how to bend SQL to the task of simulating order in his popular and excellent book The SQL Puzzle Book, published by Morgan Kaufmann. Usually, the bending results in a loss of efficiency. It also works only for special cases.
Note 2: Object-relational systems address this issue by providing special data types and user-defined functions. My goal is to show the array database approach. The two are converging, but the array people have been at it longer and have some good ideas.

What Are Time Series

Time series = sequence of values usually recorded at regular increasing intervals
(yearly, monthly, weekly, ... secondly).
Regularity is critical: without regularity, moving averages, autocorrelations, and volatility would not make sense (e.g. if I have a sequence of daily price closings and then 1,000 values within one day, the moving average covering the entire sequence doesn't make much sense).
Non-regular time series are also of interest (e.g. the history of stock splits), but we can say less about them.
Time series also exhibit historicity: the past is an indicator of the future. That is why autoregression can be used to predict the future of sales and why the past volatility may predict future volatility.
Notice that temporal logics, for example, use the fact that the i+1st value in a sequence is at a later time than the ith value, but assume neither regularity nor historicity. Temporal query languages are equally agnostic about this question.

System Support for Time Series

We want to be able to create time series, manipulate them, keep them in persistent storage, and display them in reports.
Time series have frequencies, but may not have values for every time instance at the stated frequency, e.g. business day has the frequency of a day but has no values on holidays or weekends.
On the other hand, time frequencies with gaps can present problems in international markets. For example, some Asian stock markets are open on Saturdays. Different markets don't have the same holidays in general. One solution is to store values without gaps everywhere (i.e. every day).
Then the question becomes: How to fill the gaps?
The answer has to do with the kind of value stored.
The values associated with each time are of two general types (we borrow this distinction from the FAME system):
- level values stay the same from one period to the next in the absence of activity. For example, inventory is a level value, because inventory stays the same if you neither buy nor sell.
- flow values are zero in the absence of activity. For example, expenses go to zero if you buy nothing.
This distinction turns out to be important when interpolating missing values and for time scale conversion.

Operations on Time Series Data

A typical framework is that of the FAME system, since it embodies an excellent understanding of the special properties of time series. FAME stands for forecasting, analysis and modeling environment
FAME information systems, Ann Arbor Michigan.
www.fame.com
Data Preparation (i.e. interpolating and time scale conversion) -- curve-fitting
Queries (e.g. moving averages and sums) -- aggregates over time.
Forecasting (e.g. statistical or data mining-based extrapolation) -- regression, correlation, Fourier analysis, and pattern-finding.

Data Preparation

Sometimes it is necessary to relate time series that don't have the same time frequencies, e.g. mine is days and yours is weeks.
Converting one to the other depends on the kind of value one has.
For example, if the daily time series denotes inventory level, then converting from daily to weekly simply entails taking the inventory level at the end of each week.
On the other hand, if the daily time series denotes revenues (a flow type of value), then one must sum them up to get weekly revenues.
Time conversion can force interpolation too, especially when graphing values. Typically, systems use various spline techniques such as a cubic spline to interpolate missing values.
Interpolation can be more involved than mere curve-fitting, however as in the Black-Derman-Toy interpolation of the yield curve. So, users should be able to add in their own interpolation functions.

Query Types -- try these on your Database

cumulative sum, e.g. year to date sales.
moving averages, e.g. 30 day average of stock prices.
nth best, e.g. 5th best sales region.
median --- one in the middle
rank -- associate ordinal to each value based on its sort order.
discretize --- e.g. rank the revenues by whether they are in the top third, the middle third, or the bottom third.
This implies discovering the boundaries and then using them in an update query.
year-to-year comparisons --- e.g. balance of trade of this year vs. last.
accounting functions --- e.g. average growth rate, amortization, internal rate of return and so on.
statistical functions --- e.g. autocorrelation, and correlation between two series.

Forecasting

Before the 1920s, forecasting meant drawing lines through clouds of data values. Yule invented the autoregressive technique in 1927, so he could predict the annual number of sunspots. This was a linear model and the basic approach was to assume a linear underlying process modified by noise. That model is often used in marketing (e.g., what will my sales of wheat be next month).
Autoregression uses a weighted sum of previous values to predict future ones. There are also seasonal autoregressive models.
These and other models are incorporated in time series products such as FAME, SAS and SPLUS.
In options finance, the basic approach is to assume that the price of an equity is based on a random walk (Brownian motion) around a basic slope. The magnitude of the randomness is called the volatility. In a result due to Norbert Wiener (he worked it out to shoot down bombers over London), for this model, the standard deviation of the difference between the initial price and the price at a certain time t rises as the square root of time t.

Steps in a Typical FAME Session

Specify frequency. Say monthly, starting at January 1, 1996 and ending at the current time.
Create sales and expenses time series by importing these from a file or typing them in. Specify that these are flow type time series.
Create a new time series:
formula profit = sales - expenses.
Create a fourth time series with weekly frequency on inventory. Specify that inventory is a level type time series.
Convert the first three time series to a weekly frequency (by dividing the monthly values by 4.2 or by constructing a cubic spline to make the sales, expenses, and profits curve look smooth).
This interpolation depends on knowing that sales and expenses are flow-type values.
Now, use autoregression to predict future time series values.

S-Plus

S-PLUS is an interpretive environment for data analysis, not specifically oriented towards time series, but based on vectors.
http://www.mathsoft.com/splus/
http://www.math.umbc.edu/~arghya/Splus/splus.html
S-Plus is derived from the S language developed at ATT Bell Laboratories by Becker, Chambers and Wilkens, but development now belongs to MathSoft Inc.
S-Plus has
- standard statistical and mathematical functions, including anova, wavelets, bootstrapping to check for model overfitting, and so on.
- graphics capabilities for visualization (user testimonials say this is a particular strong point).
- combinatorial data mining (e.g. inference of classification trees and regression).
- an object-oriented language allowing encapsulation and overloading. For example, objects of a certain class will have a special plot function.
Applications in finance include statistical models for futures trading, e.g., correlation of Australian and US bonds, and other single or multi-sequence correlations.

S-Plus, some details

A nice reference to the language on the web has been written by
Carlos Alzola and Frank Harrell. It is entitled An Introduction to S-Plus and the Hmisc and Design Libraries ) at the University of Virginia. (Hmisc is a set of add-on functions that are truly miscellaneous.)
http://fharrell.biostat.virginia.edu/s/index.html
The S-Plus programming model is vector-oriented. Here are some typical statements:
- sum(age < mean(age)) gives the count of people who are younger than the mean.
- ageg <-- ifelse(age < 16, 'Young', 'Old') assigns to the vector ageg, a sequence of Young/Old values thus discretizing the data.
- find positions of elements in a table that have the values of x, thus performing a select while maintaining order.
So, one can create vectors, do statistics on them, and perform selection-style operations on them.
The S-Plus implementation is also vector-oriented. In their case, this creates a performance discontinuity. S-Plus is slow for large databases, even for data that exceeds RAM size. Below RAM size, it is very fast. For this reason, S-PLUS is often used with SAS, because SAS has some data management capabilities as we'll see.

SAS

SAS (www.sas.com) is a leading vendor of statistical databases. (Originally, it stood for Statistical Analysis System, but now the acronym stands for itself.)
A SAS programmer interacts with the system by parametrizing varioius functions as the following example shows:
proc forecast data=leadprd
ar=1 /* number of autoregressive parameters to estimate */
interval=month /* frequency of input time series */
trend=1 /* fit a constant trend model */
method=stepar /* use stepwise autoregressive method */
out=leadout1 /* create output data set for forecasts */
lead=12 /* number of forecast periods */
outlimit
outstd;
var leadprod;
id date; /* identification variable */
run;
In addition, SAS has an integrated, fairly complete, SQL dialect called Proc SQL.
SAS has modules for data mining and data warehousing as well.
To obtain support for time series data management in SAS, you purchase a library called ETS that enables you to do
- interpolation,
- econometric forecasting (e.g. maximum likelihood method)
- financial analysis (analysis of fixed rate mortgages, adjustable rate mortgages etc.),
- time series forecasting (exponential smoothing, ARIMA, dynamic regression).
By reputation, SAS is harder to extend than S-Plus, but SAS programmers have become clever with the language over the years. The libraries are very rich.
Bottom line: S-Plus is more flexible for special purpose problems and is fast for problems that fit into RAM. It also has great graphics. Combining the two works well if the application selects a subset of data and then works on it (like a loose-coupling expert system).

KDB

KDB is a database system implemented on top of the K language environment (produced by Kx systems www.kx.com), an array language. Data structures (e.g. tables) can be interchanged between the two and functions can be called in both directions. A free trial version can be downloaded.
KDB supports an SQL dialect called KSQL. KSQL is easy to learn (for anyone fluent in SQL) and carries over the speed and functionality of K to large data manipulation. KDB also supports most of standard SQL.
The basic data structure in KSQL is the arrable (array-table) which is a table whose order can be exploited. In this way, it is very similar to S-Plus.
Arrables are non-first-normal form objects: a field of a record can be an array. For example, an entire time series can be stored in a field.
Like most modern SQLs, KSQL allows the inclusion of user-defined functions inside database statements. Unlike other SQL's KSQL allows functions to be defined over arrays as well as scalars.
Like classical SQL, KSQL has aggregates, grouping, selections and string matching, and so on.
KSQL adds many useful functions to SQL, permitting economical expression and often better performance by exploiting order. For example, finding the fifth highest value is a linear time operation in KSQL but requires a self-join in SQL, which is only sometimes linear time.
KDB can function as a high performance distributed server with full recovery and distributed consistency. (KDB guarantees consistency by using ordered atomic broadcast and a replicated state machine design rather than two phase commit.)
Some performance notes
- KDB can do 40,000+ TPC/B transactions per second against a 100 megabyte database on a Pentium 2.
- Typical real-world application (at Lehman Brothers): time series: keyed by instrument id, date.
  Data: 11 million rows, couple of hundred columns, typical query returns 7500 rows with 15 columns. 10 gigabytes.
  Query: is find all information having to do with a given set of instruments at a given date. This returns 7500 rows in less than a second.

KSQL Basics -- an extended example

Create tables (arrables, really, but we'll conform to standard terminology, even though it's slightly inexact) either through K or within .t scripts. We will use .t scripts for now and will set up a simple trade database and then go through it line by line (you can copy out this file and run it, provided you also copy out newstat.k which will be given later).

'trade.t'

n:1000	# number of trades

'generate'
stock:([name:('Alcatel','Alstom','AirFrance','FranceTelecom','Snecma','Supra')]
 industry:('telecom','engineering','aviation','telecom','aviation','consulting'))

trade:([]
 stock:n rand stock.name,
 price:50 + .25 * n rand 200,
 amount:100 * 10 + n rand 20,
 date:date['1998-01-01'] + n rand 365)

trade:'date' asc trade

show'trade'

'last price (in time) by stock, month'
'we get a vector and then take the last'
t1:select last price by stock, date.month from trade


'trade volume by month and industry'
show select volume:sum amount * price 
 by date.month, stock.industry from trade

'5 day moving avg price by stock, month'
t3:select 5 avgs price 
by stock, date.month
from trade

\\l newstat.k


'5 day moving max price by stock, month'
t4:select movingmax[5,price] by stock, date.month from trade

Here is the newstat.k file.

/ newstat.k

/ moving max of last num elements
movingmax:{[num; vec]
	minel: &/vec
	vec: ((num-1) # minel), vec
	size: #vec
	indexes: (num-1) + !(size-(num-1))
	findmax:{[n;v;i] |/v[(i-(n-1))+!n]}
	:findmax[num;vec]'indexes
}

Commentary on Part of trade.t

The line
```
t1:select last price by stock, date.month from trade
```
performs a group by stock and year-month in the trade table. Mathematically, the by clause partitions the records based on distinct stock/year-month values, like a group by in SQL. What is different is that each partition is guaranteed to be ordered in the same way as in the table (arrable), in this case, by ascending order of date. For each partition corresponding to stock s and year-month x, the
```
 select last price 
```
part will return s, x, and p where p is the price of the last record in the partition corresponding to s and x. Because the table is ordered by date, the last record will be the one with the most recent date.
In the select statement
```
show select volume:sum amount * price
 by date.month, stock.industry from trade
```
(i) the date.month expression in the by clause gives each distinct year-month value in the date column. This is a big convenience in time series applications. The by clause groups by these year-month values.
The lines
```
'5 day moving avg price by stock, month'
t3:select 5 avgs price by stock, date.month from trade
```
use the avgs function which, given a vector (price for each stock in this case), returns a vector of the same length, computing a moving average. 5 avgs computes the five day moving average.
The line
```
\\l newstat.k
```
loads one or more k functions from the file newstat.k, in this case a single moving maximum taking two arguments, a number and a vector.
The line
```
t4:select movingmax[5,price] 
 by stock, date.month from trade
```
uses the movingmax function defined in the K file newstat.k to compute the 5 day moving maximum of prices grouped by stock and year-month. The movingmax function takes a scalar and a vector and that's exactly what the by clause delivers.

Two of our Challenge Queries using Vectors

Dot product of prices offset by 10 days for each stock

\dotprod:{[x;y] +/(x*y)}
t4: select dotprod[(10 drop price), (-10 drop price)] 
 by stock from trade

The 10th highest price for each stock

t5: select last 10 first price 
 by stock from 'price' desc trade

You should be able to do the rest assuming that you can put in arbitrary functions.

Is There an Ideal Time Series Language?

The ability to treat sequences as first class objects on which one can do useful operations within the database system.
FAME, S-Plus, SAS, and KSQL.
The ability to treat multiple sequences together for correlations and other purposes.
FAME, S-Plus, SAS, and KSQL.
A basic collection of functions, e.g. aggregates, moving aggregates, statistics, cross-correlations, interpolation, etc.
FAME, S-Plus, SAS, and KSQL.
The ability to integrate user-defined functions into the query engine. My belief: user-defined functions are necessary for time series. There is no analogue to relational completeness that will satisfy all (or even most) time series applications. Prove me wrong if you can.
KSQL and S-Plus; FAME and SAS to some extent.
The main helpful database amenities include a rich relational vocabulary and the ability to work efficiently with disk- as well as RAM-resident data.
KSQL, FAME and SAS.
Treat time specially, e.g. be able to group by date.month or date.year of KSQL.
FAME, SAS, S-Plus with libraries, and KSQL.
Treat values appropriately, e.g. with the level and flow concepts of FAME.
FAME, natively. The others require libraries.
Give support for bitemporality (discussed next) at least as an extension.
No system, natively.
Object-relational database vendors have begun to field products with time-series extensions. It will be interesting to see how well they measure up.

The Bitemporal Challenge

Even in finance, there is more to life than looking at flows and levels. Sometimes, one must do historical searches and understand the state of the enterprise in the past or future.
Bitemporal = two times (Snodgrass, Ahn, Jensen, and others) (A very nice review paper was published in Information Systems:
``Semantics of Time-Varying Information Christian S. Jensen, Richard T. Snodgrass Information Systems, 21(4): 311-352 (1996) )
The two times are valid time (when a fact holds) and transaction time (when the fact was asserted).
Using this model, one can ask:
What was Rick's salary on April 1, 1998?
Suppose the answer was $1,500 per week, given the information we know now (i.e. according to the latest transaction). But somehow we sent Rick a check for much more. This gives rise to a new question:
What did we believe on March 28, 1998 (when payroll printed out Rick's 100,000 dollar weekly check) about Rick's salary valid on April 1, 1998? Thus, we look at the transactions concerning Rick's salary that preceded March 29, 1998 and see what was asserted about Rick's salary as of April 1, 1998. If the asserted value was 100,000 dollars per week, then we know that the mistaken payment was due to a database error, not a payroll processing error.
So, bitemporal databases are useful to track the cause of failures in operations.
This example also shows their general usefulness for human resources applications, What is the job history of this candidate within the organization?
Or the more challenging: How has the job history of this candidate been corrected?
For trading applications, we can use bitemporality to track the correlations between predictions of earnings and price, What did we believe as of January of this year that first quarter earnings would be?
Unitemporal example for those with SQL skill: Rick Snodgrass suggested the following example to test your system's temporal acumen. The salary of employees changes over time, usually in a positive direction. Each salary for an employee therefore has a time when it is valid, marked by starttime and endtime, the notion being that the salary becomes valid at starttime and ceases being valid at endtime. Your task is to compute the average salary over time as a set of rows having the average salary as well as the start and end times. Two consecutive intervals with the same average salary should be merged. Send me mail if you want to see a KSQL version.

FinTime a Financial Time Series Databases

The design of this benchmark is joint work with Kaippallim Jacob of Morgan Stanley (kjacob@ms.com) and its full description can be found at http://cs.nyu.edu/cs/faculty/shasha/fintime.html.
Here we present a summary of that benchmark.

Goal

The FinTime benchmark tries to model the practical uses of time-series databases in financial applications. The models suggested in FinTime reflect two frequently occurring cases in the financial industry, namely, a historical market data system and real-time price tick database. These models are quite similar to two well-studied models in the relational world, namely, decision support systems and OLTP systems. FinTime also suggests and defines metrics that capture three useful dimensions of any time-series system, namely, performance in a single-user mode, performance in a multi-user mode and the price to performance ratio.

Models for a Time-series Benchmark

Before deciding on a model, we have to examine the different parameters that determine a model for time-series system. The most important parameters that influence a time-series database system are:

Periodicity of data (Regular/irregular)
Density of data (Dense/Sparse)
Schedule of updates (periodic, continuous)
Types of queries (Simple/Complex)
Time interval between queries (Ad hoc/Batch)
Number of concurrent users (Few/Many)

Combinations of these factors will give rise to 64 possible models but for simplicity we can focus on the following commonly occurring cases in the financial industry

Model 1: Historical market Information

Attribute	Specification
Periodicity of Data	Periodic
Density of Data	Dense
Schedule of updates	Periodic updates (e.g. at the end of a business day)
Complexity of queries	Complex (e.g. Decision support queries)
Nature of query arrival	Batch
Concurrency of users	Low (e.g. Few concurrent users)

Model 2: Tick databases for financial instruments

Attribute	Specification
Periodicity of Data	Non-periodic
Density of Data	Sparse to moderately dense
Schedule of updates	Continuous
Complexity of queries	Simple
Nature of query arrival	Ad hoc
Concurrency of users	High (e.g. Many concurrent users)

Let us now discuss the characteristics of the two models in some detail:

Model 1: Historical Market Information

Historical Market data systems are closely related to decision support systems of the traditional relational database world. The various elements of this model are:

Data Model: In the case of historical time-series databases for market information, the model consists of a few relational tables that typically contain infrequently changing (static) information and a number of time-series tables. The following tables describe the data in this benchmark

Base Information Table

Field Name	Type	Comments
Id	Char(30)	Unique Identifier for the financial instrument e.g. IBM.N
Ex	Char(3)	Exchange on which the instrument is traded e.g. "N" for the New York Stock Exchange
Descr	Varchar(256)	A brief description of the financial security e.g. "International Business Machines, Armonk, NY"
SIC	Char(10)	Standard Industry Code e.g. "COMPUTERS" or "CHEMICAL"
SPR	Char (4)	S&P Rating for the company e.g. "AAA"
Cu	Char(5)	The base currency in which the security is traded e.g. "USD" for US Dollars
CreateDate	Date	Date this security came into existence

Stock Splits (Irregular Time-series)

Field Name	Type	Comments
Id	Char(30)	Identifier
SplitDate	Date	Date the split is actually effective
EntryDate	Date	Date the split is announced
SplitFactor	Float	The split factor expressed as a decimal value. E.g. a 2 for 1 split is expressed as 0.5 (ie 1/2), a 4 for 3 is expressed as 0.75(ie 3/4)

Dividends (Irregular Timeseries)

FieldName	Type	Comments
Id	Char(30)	Identifier
XdivDate	Date	Date on which dividend is disbursed
DivAmt	Float	Dividend Amount In the base currency of the instrument
AnnounceDate	Date	Date on which the dividend is announced

Market Data (Regular Timeseries)

FieldName	Type	Comments
Id	Char(30)	Identifier
Tradedate	Date	Trade Date
HightPrice	Float	High price for the day
LowPrice	Float	Low price for the day
ClosePrice	Float	Closing price for the day
OpenPrice	Float	Open price for the day
Volume	Long	Volume of shares traded

Market information is typically provided as a set of input files by a market data vendor at the end of each trading day. These files correspond to the tables defined above. While the data for the Base Information table, Split Adjustment Table and Dividend Table is irregular (ie an external event triggers an entry into these tables), the Market Information Table has an entry for every trading day. For the purposes of this benchmark the implementation can select a scale factor combination of the number securities under consideration and the number of events for those securities. We suggest 3 scale factors, namely, 50,000 securities, 100,000 securities, and 1,000,000 securities, all for 4,000 days. These roughly correspond to all equity securities in the US, all equity securities in the G7 countries and all equity securities in the world. Programs to generate the data can be found on the benchmark pages.

Query characteristics. The queries model:

Join of relational data and time-series information
Access of long depth time-series information for a few keys (deep history query)
Access of a short depth time-series for a large number of keys (cross-sectional queries)
Sorting
Grouping and Aggregation

Following are the benchmark queries (we don't require any special query language). Many of them include a notion of ``specified set of securities'' or ``specified period''. These notions are defined with respect to a simple random model.

1	Get the closing price of a set of 10 stocks for a 10-year period and group into weekly, monthly and yearly aggregates. For each aggregate period determine the low, high and average closing price value. The output should be sorted by id and trade date.
2	Adjust all prices and volumes (prices are multiplied by the split factor and volumes are divided by the split factor) for a set of 1000 stocks to reflect the split events during a specified 300 day period, assuming that events occur before the first trade of the split date. These are called split-adjusted prices and volumes.
3	For each stock in a specified list of 1000 stocks, find the differences between the daily high and daily low on the day of each split event during a specified period.
4	Calculate the value of the S&P500 and Russell 2000 index for a specified day using unadjusted prices and the index composition of the 2 indexes (assume to be given) on the specified day
5	Find the 21-day and 5-day moving average price for a specified list of 1000 stocks during a 6-month period. (Use split adjusted prices)
6	(Based on the previous query) Find the points (specific days) when the 5-month moving average intersects the 21-day moving average for these stocks. The output is to be sorted by id and date.
7	Determine the value of $100,000 now if 1 year ago it was invested equally in 10 specified stocks (i.e. allocation for each stock is $10,000). The trading strategy is: When the 20-day moving average crosses over the 5-month moving average the complete allocation for that stock is invested and when the 20-day moving average crosses below the 5-month moving average the entire position is sold. The trades happen on the closing price of the trading day.
8	Find the pair-wise coefficients of correlation in a set of 10 securities for a 2 year period. Sort the securities by the coefficient of correlation, indicating the pair of securities corresponding to that row. [Note: coefficient of correlation defined in appendix]
9	Determine the yearly dividends and annual yield (dividends/average closing price) for the past 3 years for all the stocks in the Russell 2000 index that did not split during that period. Use unadjusted prices since there were no splits to adjust for.

Operational parameters

Since the benchmark model assumes that the level of concurrency is small, the number of users should be 5. The benchmark methodology should be that each of the five users will pick a query at random from the set above without replacement and submit it to the database. The five users will be simultaneously active at any point in time. Each user should do every query and then stop.

Model 2: Tick databases for financial instruments

This second benchmark models the case where the database (consisting of ticks) is expected to keep up with a very high rate of updates while responding to several users issuing fairly simple queries.

Ticks are price quotation or trade (transaction) prices and associated attributes for individual securities that occur either on the floor of a stock exchange or in an electronic trading system, such as the NASDAQ market system. Ticks include 2 basic kinds of data (1) Trades are transactions between buyers and a sellers at a fixed price and quantity (2) Quotes are price quotations offered by buyers and sellers. Quotes can have the ask quote, the bid quote or both along with their associate attributes such as quantity offered.

Let us now define the different elements of this benchmark. They are:

Data model. The data model for this benchmark is simple. It consists of the following 2 tables:

Base Information Table

Field Name	Type	Comments
Id	Char(30)	Identifier for a security
Ex	Char(3)	Exchange on which the instrument is traded e.g. "N" for the New York Stock Exchange
Descr	Varchar (256)	A short description of the security
SIC	Char(10)	The standard industry code for the security
Cu	Char(10)	Base currency for the security

Trades/Quotes Table

FieldName	Type	Comments
Id	Char(30)	Identifier
SeqNo	Integer	A unique sequence number for each trade or quote
TradeDate	Date	The trading date
TimeStamp	Time	The time a particular trade was executed or Quote was generated
TradePrice	Float	The price at which the trade was executed
TradeSize	Integer	The number of shares traded in a transaction
AskPrice	Float	The Price at which a seller would sell a security
AskSize	Float	The size of the transaction offered at the specified ask price.
BidPrice	Float	The price at which a buyer would buy the security
BidSize	Float	The size of the transaction offered at the specified bid price
Type	Char	An indication if this is a quoteor a trade

Data population, Update frequency and volume. Tick databases are generally populated by adapters that receive data from real-time feeds. The frequency of updates is constantly increasing but on the average we can assume that each security ticks (each corresponding to a record in the database) about 100 times during an 8-hour trading day. Additionally, we can assume that the set of securities being tracked is traded around the world and hence there is no quiescent period. For the purposes of this benchmark we will assume that the system monitors ticks on 1,000, 10,000 or 100,000 securities, where each represents a different scale factor.

A very important consideration in tick databases is the ability to quickly apply "cancel/correct". Occasionally an incorrect quote or trade record is published. The vendor will then send a correction record with the identifier of the security and its sequence number. The record will either be corrected according to the new published or simply deleted.

Query characteristics. The kinds of queries issued against an online tick databases are usually simple and pre-determined. Listed below are the set of queries to be used in the benchmark.

1	Get all ticks a specified set of 100 securities for a specified three hour time period on a specified trade date.
2.	Determine the volume weighted price of a security considering only the ticks in a specified three hour interval
3.	Determine the top 10 percentage losers for the specified date on the specified exchanges sorted by percentage loss. The loss is calculated as a percentage of the last trade price of the previous day.
4.	Determine the top 10 most active stocks for a specified date sorted by cumulative trade volume by considering all trades
5.	Find the most active stocks in the "COMPUTER" industry (use SIC code)
6.	Find the 10 stocks with the highest percentage spreads. Spread is the difference between the last ask-price and the last bid-price. Percentage spread is calculated as a percentage of the mid-point price (average of ask and bid price).

Operational parameters. Since usually tick systems have a relatively large number of concurrent users, we will assume that there are 50 concurrent users for the benchmark. These users randomly pick queries from the list above and submit them for execution. Each user will submit each query only once and run through the complete list of queries.

Finding Patterns in Temporal Data

Financial traders have done data mining for many years. One trader described his work to me as follows: I think about an arbitrage trick (pairs trading is such a trick). Program for a few months. Try the trick and either it works or it doesn't. If it doesn't, I try something new. If it works, I enjoy it until the arbitrage disappears.
What does the research community have to offer to such traders?
I present some research that I think might be most relevant. I will be updating this as time goes on.
U.M. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy, editors Advances in Knowledge Discovery and Data mining AAAI Press/ The MIT Press, 1996. The article by Berndt and Clifford about finding patterns in time series is particularly relevant to finance.
Temporal Databases -- Research and Practice Editors: Opher Etzion, Sushil Jajodia, Sury Sripada. (Springer-Verlag, 1998). There, you will find articles about finding unexpected patterns (e.g. fraud) and multi-granularity data mining.
Christos Faloutsos Searching Multimedia Databases by Content Kluwer Academic Publishers.
This book shows how to do signal processing analysis on time series to solve problems such as:
- Discovering whether two time series have similar shapes: the basic idea is to store the first few Fourier coefficients of a time sequence in a database and assert that two time sequences are similar if their Fourier coefficients are close. (Remarkably this works well because the energy spectrum for stock prices declines with the power 2 with increasing coefficients.) Joint work with Rakesh Agrawal and Arun Swami.
  The efficiency of this technique has been improved by Davood Rafiei and Alberto Mendelzon of the University of Toronto.
- Subsequence matching (is this sequence close to some subsequence of that sequence?). Faloutsos uses a special data structure called Fastmap to make this performant.
Other papers explore the question of similarity search when time scaling and inversion is possible:
- R. Agrawal, K-I Lin, H.S. Sawhney and K. Shim. ``Fast similarity search in the presence of noise, scaling and translation in time-series databases.'' Proc of the 21st VLDB Conference, 1995
- D. Q. Goldin and P. C. Kanellakis. ``On similarity queries for time-series data: constraint specification and implementation.'' 1st International Conference on the Principles and Practice of Constraint Programming. pp. 137-153. Springer-Verlag, LNCS 976. September 1995.
- Davood Rafiei and Alberto Mendelzon. ``Similarity-based queries for time series data'' ACM Sigmod, pp. 13-24. May 1997
- Yi, Efficient Retrieval of Similar Time Sequences Under Time Warping. Data Engineering, 1998.
- Excellent work has also been done on data structures by many researchers at Brown, Polytechnic, and the University of Maryland, but that falls outside the data mining purview.
As an alternative to seeing whether two sequences or subsequences match, one might want to describe a desirable sequence (e.g. a head-and-shoulders movement of stock prices) and see whether it is present. Relevant papers about this include:
- H.V. Jagadish, A. O. Mendelzon and T. Milo. Similarity-based queries. PODS 1995.
- R. Agrawal, G. Psaila, E. L. Wimmers and M. Zait. Querying shapes of histories. Proceedings of the 21st VLDB Conference. pp. 502-514. 1995.
- P. Seshadri, M. Livny and R. Ramakrishnan. Sequence query processing. ACM SIGMOD, pp. 430-441, 1994
  Data model and query language for sequences in general, with time series as a special case.
- Arie Shoshani, Kyoji Kawagoe: Temporal Data Management. VLDB 1986: 79-88
  One of the first papers in the literature.
- Snodgrass, R.~T., editor, The TSQL2 Temporal Query Language , Kluwer Academic Publishers, 1995, 674+xxiv pages. The TSQL2 Language Design Committee consisted of Richard Snodgrass (chair), Ilsoo Ahn, Gad Ariav, Don Batory, James Clifford, Curtis E. Dyreson, Ramez Elmasri, Fabio Grandi, Christian S. Jensen, Wolfgang Kaefer, Nick Kline, Krishna Kulkarni, T. Y. Cliff Leung, Nikos Lorentzos, John F. Roddick, Arie Segev, Michael D. Soo and Suryanarayana M. Sripada.
  TSQL2 has time-varying aggregates, including moving window aggregates, aggregates over different time granularities, and weighted over time.
- Munir Cochinwala, John Bradley: A Multidatabase System for Tracking and Retrieval of Financial Data. VLDB 1994: 714-721
  A paper discussing the implementation of a tick capture and query system --- for those brave enough to roll their own.
- Raghu Ramakrishnan, Donko Donjerkovic, Arvind Ranganathan, Keven S. Beyer, and Muralidhar Krishnaprasad: SRQL: sorted relational query language SSDBM 98
  A paper discussing a model in which relations are tables that can be ordered. This allows one to do moving averages, find ten cheapest, preceding fifteen, etc. The strategy is to extend SQL with order and special operators.
- Leonid Libkin and colleagues: An optimizable array-oriented language based on comprehensions. The basic primitives are tabulation (analogous to selection), subscripting (remove elements from arrays), dimension reduction (like count of an array), and interaction between sets and arrays.
  Optimizations are analogous to pushing selects into expressions and techniques that reduce the complexity of expressions.

Books on Time Series for Computer Scientists

C. Chatfield, The Analysis of Time Series: Theory and Practice Chapman & Hall fourth edition 1984. Good general introduction, especially for those completely new to time series.
P.J. Brockwell and R.A. Davis, Time Series: Theory and Methods, Springer Series in Statistics (1986).
B.D. Ripley and W.N. Venables, Modern Applied Statistics with S-Plus, Springer (1994) Chapter 14 has a good discussion of time series.
http://www.stats.ox.ac.uk/~ripley/ has a lot of useful functions.
FinTime, a time series benchmark for finance
http://cs.nyu.edu/cs/faculty/shasha/fintime.html

Appendix: Informal Review of Statistical Concepts

Recall that the goal of probability theory is to determine the likelihood of a given event given a probability distribution (e.g. how likely is it to get 5,300 heads in 10,000 flips of a fair coin?).
The goal of statistics is to determine a probability distribution given a series of observations or at least to disprove a null hypothesis (e.g. is a fair coin a reasonable model if I get 8,000 heads in 10,000 flips?).
In parametric statistics, one knows the form of the target probability distribution but not the value of certain parameters, e.g. coin flips are binomial but the probability of a head may be unknown.
In non-parametric statistics, one does not know the form of the target probability distribution.
In finance, most models are parametric (autoregression, option pricing). When models aren't, people use queries and eyeballs to figure out what to do.
Stationary process : one whose statistics (mean and variance) do not vary with time. Stationarity is a fundamental assumption of pairs trading and options pricing.
Correlation: a measure of the association between two series, e.g. the option open interest and the price of a security 5 days later. If cov(x,y) represents the covariance between x and y and sigma(x) is the standard deviation of x, then
correlation(x,y) = cov(x,y)/(sigma(x)*sigma(y))
so is entirely symmetric and lies always between -1 and 1.
Partial correlation : suppose you are looking at the one day returns of Merck and Pfizer (two drug companies). You can look at them as raw data or you can subtract out the market influence via a least squares estimate and use the correlation of the residuals.
Volatility : a measure of the standard deviation of the value of a variable over a specific time, e.g. the annualized standard deviation of the returns. The return at time t is ln(p(t)/p(t-1)). This is a critical parameter in options pricing, because it determines the probability that a price will exceed a certain price range.
Alpha, Beta, and Regression: suppose we estimate the relationship between the percentage change in price of some stock S vs. the percentage change in some market index M using a best fit (least squares) linear relationship:
s = a + bm
Then the parameter alpha (a) is the change in S independent of M and beta (b) is the slope of the best fit line. A riskless investment has a positive alpha and a zero beta, but most investments have a zero alpha and a positive beta. If beta is greater than 1, then for a given change in the market, you can expect a greater change in S. If beta is negative, then S moves in the opposite direction from the market. Note that beta is different from correlation (and can be arbitrarily large or small) because it is not symmetric:
beta = cov(S,M)/(sigma(M)*sigma(M))
ANOVA: analysis of variance in cases when there is no missing data. This is used to model situations in which several factors can play a role and one wants to tease out a probabilistic model that describes their interaction. For example, product, location and customer income may be factors that influence buying behavior. ANOVA helps to figure out how to weight each one. More significant variants of this include principal components analysis and factor analysis . In finance, one might use one of these to figure out what determines the price movement of a stock (perhaps half general market movement, one third interest rates, etc.). In psychology, one can ask a person 100 questions and then categorize the person according to a weighted sum of a few questions.
Autoregression: a statistical model which predicts future values from one or more previous ones. This generalizes trend forecasting as used to predict sales. Financial traders use this sparingly since models that look at the recent past often just follow a short term trend. As one trader put it: ``they follow a trend and are always a day late and many dollars short.'' In general, regression of y on x is a determination of how y depends on x.
Maximum likelihood method: suppose you are given a training set consisting of observations and the categories to which the observations belong. The maximum likelihood method selects the probability distribution that best explains the training set. For example, if you toss a coin 10,000 times and observe that heads comes up 8,000, you assign a probability to the heads that maximizes the probability of this event. This will make the probability of heads be greater than 1/2. In finance, the maximum likelihood method is often used for forecasting based on previously seen patterns.
Regularization A technique for smoothing a function to make it have nice mathematical properties such as differentiability. Moving averages are an example of regularization.
Bootstrapping (i) Divide the training set (set of (observation, category) pairs) into pieces. (ii) Infer the model from some pieces. (iii) Test it on the other pieces.

Acknowledgments

Lory Molesky, Judy Smith, David Rothman, and Rick Snodgrass made several suggestions that have contributed to this presentation. All errors are mine.

Thank you for your attention.