Time Series in Finance: the array database approach

Prof. Dennis Shasha
Courant Institute of Mathematical Sciences
Department of Computer Science
New York University
shasha@cs.nyu.edu
http://cs.nyu.edu/cs/faculty/shasha/index.html

Outline

What are time series as used in business and finance?
What do typical systems (e.g. FAME) do to support them?
I include challenge queries for you to try against your favorite SQL or alternative database management system.
Fintime, a time series benchmark
http://cs.nyu.edu/cs/faculty/shasha/fintime.html
Which research in temporal data mining might help finance?
Time series bibliography.
Brief glossary of statistical concepts.

Scenario

Group discovers the desirability of ``pairs trading.''
The goal is to identify pairs (or in general groups) of stocks whose prices track one another after factoring in dividends.
One can make money (lots was made in the 1980s), because, for example, if you know that the two banks Chase and Citibank track one another (their difference is a stationary process) and Chase goes up but Citibank doesn't, then buy Citibank and sell Chase. Unless there is a good external reason for the difference, that is. (This is simplistic: one needs a linear combination of the two price series (once the market factor is accounted = for (removed) and dividends included) to be stationary. But this is the idea.)
Typical challenge queries from such an application:
- Correlate the price histories of two stocks or in general among many stocks and options perhaps with delays. (For most traders, returns are more interesting than prices, because they have better statistics: a stock that trends up over the years has an unstationary mean, but perhaps a stationary return. So, one performs correlations over ``returns.'' The return at time t is ln(price(t)/price(t-1)).)
- Perform the correlation over certain time intervals to evaluate the stationarity.
- The correlation might be weighted: recent history counts more than distant history

So, What's the Database Problem?

The raw data comes in the form of ticks (stock, quantity, price) and can be stored in a relational database without a problem.
The fundamental difficulty is that the relational model does not take advantage of the order of rows. Whereas one can perform an ``order by'' query and manipulate the data in some other language, one cannot natively manipulate the ordered data using select, from, and where.
Arguably, this is good for data independence, but it is bad for time series.
Realizing this, the traders curse a lot and tell their programmers to cobble something together. The programmers do so and create a piece of software that is part spreadsheet, part special purpose database, with lots of C++ code.
Employment goes up.
Note 1: Joe Celko shows how to bend SQL to the task of simulating order in his popular and excellent book The SQL Puzzle Book, published by Morgan Kaufmann. Usually, the bending results in a loss of efficiency. It also works only for special cases.
Note 2: Object-relational systems address this issue by providing special data types and user-defined functions. My goal is to show the array database approach. The two are converging, but the array people have been at it longer and have some good ideas.

What Are Time Series

Time series = sequence of values usually recorded at regular increasing intervals
(yearly, monthly, weekly, ... secondly).
Time series also exhibit historicity: the past is an indicator of the future. That is why autoregression can be used to predict the future of sales and why the past volatility may predict future volatility.

Operations on Time Series Data

A typical framework is that of the FAME system, since it embodies an excellent understanding of the special properties of time series. FAME stands for forecasting, analysis and modeling environment
FAME information systems, Ann Arbor Michigan.
www.fame.com
Data Preparation (i.e. interpolating and time scale conversion) -- curve-fitting
Queries (e.g. moving averages and sums) -- aggregates over time.
Forecasting (e.g. statistical or data mining-based extrapolation) -- regression, correlation, Fourier analysis, and pattern-finding.

Data Preparation

Sometimes it is necessary to relate time series that don't have the same time frequencies, e.g. mine is days and yours is weeks.
Converting one to the other depends on the kind of value one has.
For example, if the daily time series denotes inventory level, then converting from daily to weekly simply entails taking the inventory level at the end of each week.
On the other hand, if the daily time series denotes revenues (a flow type of value), then one must sum them up to get weekly revenues.
Time conversion can force interpolation too, especially when graphing values. Typically, systems use various spline techniques such as a cubic spline to interpolate missing values.
Interpolation can be more involved than mere curve-fitting, however as in the Black-Derman-Toy interpolation of the yield curve. So, users should be able to add in their own interpolation functions.

Forecasting

Before the 1920s, forecasting meant drawing lines through clouds of data values. Yule invented the autoregressive technique in 1927, so he could predict the annual number of sunspots. This was a linear model and the basic approach was to assume a linear underlying process modified by noise. That model is often used in marketing (e.g., what will my sales of wheat be next month).
Autoregression uses a weighted sum of previous values to predict future ones. There are also seasonal autoregressive models.
These and other models are incorporated in time series products such as FAME, SAS and SPLUS.
In options finance, the basic approach is to assume that the price of an equity is based on a random walk (Brownian motion) around a basic slope. The magnitude of the randomness is called the volatility. In a result due to Norbert Wiener (he worked it out to shoot down bombers over London), for this model, the standard deviation of the difference between the initial price and the price at a certain time t rises as the square root of time t.

Steps in a Typical FAME Session

Specify frequency. Say monthly, starting at January 1, 1996 and ending at the current time.
Create sales and expenses time series by importing these from a file or typing them in. Specify that these are flow type time series.
Create a new time series:
formula profit = sales - expenses.
Create a fourth time series with weekly frequency on inventory. Specify that inventory is a level type time series.
Convert the first three time series to a weekly frequency (by dividing the monthly values by 4.2 or by constructing a cubic spline to make the sales, expenses, and profits curve look smooth).
This interpolation depends on knowing that sales and expenses are flow-type values.
Now, use autoregression to predict future time series values.

KDB

KDB is a database system implemented on top of the K language environment (produced by Kx systems www.kx.com), an array language. Data structures (e.g. tables) can be interchanged between the two and functions can be called in both directions. A free trial version can be downloaded.
KDB supports an SQL dialect called KSQL. KSQL is easy to learn (for anyone fluent in SQL) and carries over the speed and functionality of K to large data manipulation. KDB also supports most of standard SQL.
The basic data structure in KSQL is the arrable (array-table) which is a table whose order can be exploited. In this way, it is very similar to S-Plus.
Arrables are non-first-normal form objects: a field of a record can be an array. For example, an entire time series can be stored in a field.
Like most modern SQLs, KSQL allows the inclusion of user-defined functions inside database statements. Unlike other SQL's KSQL allows functions to be defined over arrays as well as scalars.
Like classical SQL, KSQL has aggregates, grouping, selections and string matching, and so on.
KSQL adds many useful functions to SQL, permitting economical expression and often better performance by exploiting order. For example, finding the fifth highest value is a linear time operation in KSQL but requires a self-join in SQL, which is only sometimes linear time.
KDB can function as a high performance distributed server with full recovery and distributed consistency. (KDB guarantees consistency by using ordered atomic broadcast and a replicated state machine design rather than two phase commit.)

Two of our Challenge Queries using Vectors

Dot product of prices offset by 10 days for each stock

\dotprod:{[x;y] +/(x*y)}

t4: select dotprod[(10 drop price), (-10 drop price)] 
 by stock from trade

The 10th highest price for each stock

t5: select last 10 first price 
 by stock from 'price' desc trade

Very Large Databases in KDB

Basic procedure: Specify the number of partitions of large tables in the build script. Within each partition, all necessary joins should be possible. This means that dimension tables should be replicated or should be partitioned in a way that permits all joins to be done, e.g. denormalized.
Layout: should be on different disks.
Varchars should become integers if they come from an enumerated set, e.g. days of the week. Description fields can be left as is.
For access: Number of (process) slaves -S should divide the number of partitions and should be a multiple of number of cpu's (e.g. 1 or 2). The goal is to have each slave have the same amount of work and each CPU have the same number of slaves. Two slaves per CPU is good to overlap computations with reads.
To map on demand and unmap, use -s. Won't be necessary with a 64 bit version.
Linguistic restrictions: no use of order, e.g. avgs.
Example:
k db census.t to load
k db census to play
k db census -P 80 -S 2
Browser interface
Rule of thumb speeds for the data warehousing benchmark TPC-H (unaudited):
90 seconds per GB per 500MHZ pentium.
tpch100 with 20 cpu's is 450 seconds.

References for Time Series

Financial traders have done data mining for many years. One trader described his work to me as follows: I think about an arbitrage trick (pairs trading is such a trick). Program for a few months. Try the trick and either it works or it doesn't. If it doesn't, I try something new. If it works, I enjoy it until the arbitrage disappears.
What does the research community have to offer to such traders?
I present some research that I think might be most relevant. I will be updating this as time goes on.
U.M. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy, editors Advances in Knowledge Discovery and Data mining AAAI Press/ The MIT Press, 1996. The article by Berndt and Clifford about finding patterns in time series is particularly relevant to finance.
Temporal Databases -- Research and Practice Editors: Opher Etzion, Sushil Jajodia, Sury Sripada. (Springer-Verlag, 1998). There, you will find articles about finding unexpected patterns (e.g. fraud) and multi-granularity data mining.
Christos Faloutsos Searching Multimedia Databases by Content Kluwer Academic Publishers.
This book shows how to do signal processing analysis on time series to solve problems such as:
- Discovering whether two time series have similar shapes: the basic idea is to store the first few Fourier coefficients of a time sequence in a database and assert that two time sequences are similar if their Fourier coefficients are close. (Remarkably this works well because the energy spectrum for stock prices declines with the power 2 with increasing coefficients.) Joint work with Rakesh Agrawal and Arun Swami.
  The efficiency of this technique has been improved by Davood Rafiei and Alberto Mendelzon of the University of Toronto.
- Subsequence matching (is this sequence close to some subsequence of that sequence?). Faloutsos uses a special data structure called Fastmap to make this performant.
Other papers explore the question of similarity search when time scaling and inversion is possible:
- R. Agrawal, K-I Lin, H.S. Sawhney and K. Shim. ``Fast similarity search in the presence of noise, scaling and translation in time-series databases.'' Proc of the 21st VLDB Conference, 1995
- D. Q. Goldin and P. C. Kanellakis. ``On similarity queries for time-series data: constraint specification and implementation.'' 1st International Conference on the Principles and Practice of Constraint Programming. pp. 137-153. Springer-Verlag, LNCS 976. September 1995.
- Davood Rafiei and Alberto Mendelzon. ``Similarity-based queries for time series data'' ACM Sigmod, pp. 13-24. May 1997
- Yi, Efficient Retrieval of Similar Time Sequences Under Time Warping. Data Engineering, 1998.
- Excellent work has also been done on data structures by many researchers at Brown, Polytechnic, and the University of Maryland, but that falls outside the data mining purview.
As an alternative to seeing whether two sequences or subsequences match, one might want to describe a desirable sequence (e.g. a head-and-shoulders movement of stock prices) and see whether it is present. Relevant papers about this include:
- H.V. Jagadish, A. O. Mendelzon and T. Milo. Similarity-based queries. PODS 1995.
- R. Agrawal, G. Psaila, E. L. Wimmers and M. Zait. Querying shapes of histories. Proceedings of the 21st VLDB Conference. pp. 502-514. 1995.
- P. Seshadri, M. Livny and R. Ramakrishnan. Sequence query processing. ACM SIGMOD, pp. 430-441, 1994
  Data model and query language for sequences in general, with time series as a special case.
- Arie Shoshani, Kyoji Kawagoe: Temporal Data Management. VLDB 1986: 79-88
  One of the first papers in the literature.
- Snodgrass, R.~T., editor, The TSQL2 Temporal Query Language , Kluwer Academic Publishers, 1995, 674+xxiv pages. The TSQL2 Language Design Committee consisted of Richard Snodgrass (chair), Ilsoo Ahn, Gad Ariav, Don Batory, James Clifford, Curtis E. Dyreson, Ramez Elmasri, Fabio Grandi, Christian S. Jensen, Wolfgang Kaefer, Nick Kline, Krishna Kulkarni, T. Y. Cliff Leung, Nikos Lorentzos, John F. Roddick, Arie Segev, Michael D. Soo and Suryanarayana M. Sripada.
  TSQL2 has time-varying aggregates, including moving window aggregates, aggregates over different time granularities, and weighted over time.
- Munir Cochinwala, John Bradley: A Multidatabase System for Tracking and Retrieval of Financial Data. VLDB 1994: 714-721
  A paper discussing the implementation of a tick capture and query system --- for those brave enough to roll their own.
- Raghu Ramakrishnan, Donko Donjerkovic, Arvind Ranganathan, Keven S. Beyer, and Muralidhar Krishnaprasad: SRQL: sorted relational query language SSDBM 98
  A paper discussing a model in which relations are tables that can be ordered. This allows one to do moving averages, find ten cheapest, preceding fifteen, etc. The strategy is to extend SQL with order and special operators.
- Leonid Libkin and colleagues: An optimizable array-oriented language based on comprehensions. The basic primitives are tabulation (analogous to selection), subscripting (remove elements from arrays), dimension reduction (like count of an array), and interaction between sets and arrays.
  Optimizations are analogous to pushing selects into expressions and techniques that reduce the complexity of expressions.

Books on Time Series for Computer Scientists

C. Chatfield, The Analysis of Time Series: Theory and Practice Chapman & Hall fourth edition 1984. Good general introduction, especially for those completely new to time series.
P.J. Brockwell and R.A. Davis, Time Series: Theory and Methods, Springer Series in Statistics (1986).
B.D. Ripley and W.N. Venables, Modern Applied Statistics with S-Plus, Springer (1994) Chapter 14 has a good discussion of time series.
http://www.stats.ox.ac.uk/~ripley/ has a lot of useful functions.
FinTime, a time series benchmark for finance
http://cs.nyu.edu/cs/faculty/shasha/fintime.html

Appendix: Informal Review of Statistical Concepts

Recall that the goal of probability theory is to determine the likelihood of a given event given a probability distribution (e.g. how likely is it to get 5,300 heads in 10,000 flips of a fair coin?).
The goal of statistics is to determine a probability distribution given a series of observations or at least to disprove a null hypothesis (e.g. is a fair coin a reasonable model if I get 8,000 heads in 10,000 flips?).
In parametric statistics, one knows the form of the target probability distribution but not the value of certain parameters, e.g. coin flips are binomial but the probability of a head may be unknown.
In non-parametric statistics, one does not know the form of the target probability distribution.
In finance, most models are parametric (autoregression, option pricing). When models aren't, people use queries and eyeballs to figure out what to do.
Stationary process : one whose statistics (mean and variance) do not vary with time. Stationarity is a fundamental assumption of pairs trading and options pricing.
Correlation: a measure of the association between two series, e.g. the option open interest and the price of a security 5 days later. If cov(x,y) represents the covariance between x and y and sigma(x) is the standard deviation of x, then
correlation(x,y) = cov(x,y)/(sigma(x)*sigma(y))
so is entirely symmetric and lies always between -1 and 1.
Partial correlation : suppose you are looking at the one day returns of Merck and Pfizer (two drug companies). You can look at them as raw data or you can subtract out the market influence via a least squares estimate and use the correlation of the residuals.
Volatility : a measure of the standard deviation of the value of a variable over a specific time, e.g. the annualized standard deviation of the returns. The return at time t is ln(p(t)/p(t-1)). This is a critical parameter in options pricing, because it determines the probability that a price will exceed a certain price range.
Alpha, Beta, and Regression: suppose we estimate the relationship between the percentage change in price of some stock S vs. the percentage change in some market index M using a best fit (least squares) linear relationship:
s = a + bm
Then the parameter alpha (a) is the change in S independent of M and beta (b) is the slope of the best fit line. A riskless investment has a positive alpha and a zero beta, but most investments have a zero alpha and a positive beta. If beta is greater than 1, then for a given change in the market, you can expect a greater change in S. If beta is negative, then S moves in the opposite direction from the market. Note that beta is different from correlation (and can be arbitrarily large or small) because it is not symmetric:
beta = cov(S,M)/(sigma(M)*sigma(M))
ANOVA: analysis of variance in cases when there is no missing data. This is used to model situations in which several factors can play a role and one wants to tease out a probabilistic model that describes their interaction. For example, product, location and customer income may be factors that influence buying behavior. ANOVA helps to figure out how to weight each one. More significant variants of this include principal components analysis and factor analysis . In finance, one might use one of these to figure out what determines the price movement of a stock (perhaps half general market movement, one third interest rates, etc.). In psychology, one can ask a person 100 questions and then categorize the person according to a weighted sum of a few questions.
Autoregression: a statistical model which predicts future values from one or more previous ones. This generalizes trend forecasting as used to predict sales. Financial traders use this sparingly since models that look at the recent past often just follow a short term trend. As one trader put it: ``they follow a trend and are always a day late and many dollars short.'' In general, regression of y on x is a determination of how y depends on x.
Maximum likelihood method: suppose you are given a training set consisting of observations and the categories to which the observations belong. The maximum likelihood method selects the probability distribution that best explains the training set. For example, if you toss a coin 10,000 times and observe that heads comes up 8,000, you assign a probability to the heads that maximizes the probability of this event. This will make the probability of heads be greater than 1/2. In finance, the maximum likelihood method is often used for forecasting based on previously seen patterns.
Regularization A technique for smoothing a function to make it have nice mathematical properties such as differentiability. Moving averages are an example of regularization.
Bootstrapping (i) Divide the training set (set of (observation, category) pairs) into pieces. (ii) Infer the model from some pieces. (iii) Test it on the other pieces.

Thank you for your attention.