Time Series in Finance: the array database approach
Prof. Dennis Shasha
Courant Institute of Mathematical Sciences
Department of Computer Science
New York University
shasha@cs.nyu.edu
http://cs.nyu.edu/cs/faculty/shasha/index.html
Outline
-
What are time series as
used in business and finance?
-
What do typical systems (FAME, S-PLUS, SAS, KSQL)
do to support them?
I include challenge queries for you to try against your
favorite SQL or alternative database management system.
-
What are the relevant strengths of each system?
-
Is there an ideal time-series language?
-
Fintime, a time series benchmark
http://cs.nyu.edu/cs/faculty/shasha/fintime.html
-
Which research in temporal data mining might help
finance?
-
Time series bibliography.
-
Brief glossary of statistical concepts.
Scenario
-
Group discovers the desirability of ``pairs trading.''
-
The goal is to identify pairs (or in general groups) of stocks
whose prices track one another after factoring in dividends.
One can make money (lots was made in the 1980s), because, for example,
if you know that the two banks Chase and Citibank track
one another (their difference is a stationary process) and Chase goes up
but Citibank doesn't, then buy Citibank and sell Chase.
Unless there is a good external reason for the difference, that is.
(This is simplistic: one needs a linear
combination of the two price series (once the market factor is accounted =
for (removed) and dividends included) to be stationary.
But this is the idea.)
-
Typical challenge queries from such an application:
-
Correlate the price histories of two stocks
or in general among many stocks and options.
(For most traders, returns are more interesting than prices, because
they have better statistics: a stock that trends up
over the years has an unstationary mean, but perhaps
a stationary return.
So, one performs correlations over ``returns.''
The return at time t is ln(price(t)/price(t-1)).)
-
Perform the correlation over certain time intervals
to evaluate the stationarity.
-
The correlation might be weighted: recent history
counts more than distant history
So, What's the Database Problem?
-
The raw data comes in the form of ticks (stock, quantity, price)
and can be stored in a relational database without a problem.
-
The fundamental difficulty is that the relational model
does not take advantage of the order of rows.
Whereas one can perform an ``order by'' query and
manipulate the data in some other language, one cannot natively
manipulate the ordered data using select, from, and where.
Arguably, this is good for data independence, but it is bad
for time series.
-
Realizing this, the traders curse a lot and tell their programmers
to cobble something together.
The programmers do so and create a piece of software that is part
spreadsheet, part special purpose database, with lots of C++ code.
Employment goes up.
-
Note 1:
Joe Celko shows how to bend
SQL to the task of simulating order in his popular and excellent book
The SQL Puzzle Book, published by Morgan Kaufmann.
Usually, the bending results in a loss of efficiency.
It also works only for special cases.
-
Note 2:
Object-relational systems address this
issue by providing special data types
and user-defined functions.
My goal is to show the array database approach.
The two are converging, but the array people
have been at it longer and have some good ideas.
What Are Time Series
-
Time series = sequence of values usually recorded at regular
increasing intervals
(yearly, monthly, weekly, ... secondly).
Regularity is critical: without regularity, moving
averages, autocorrelations, and volatility would not make sense
(e.g. if I have a sequence of daily
price closings and then 1,000 values within one day, the moving
average covering the entire sequence doesn't make much sense).
Non-regular time series are also of interest (e.g. the history
of stock splits), but we can say less about them.
-
Time series also exhibit historicity: the past
is an indicator of the future.
That is why autoregression can be used to predict the future
of sales and why the past volatility may predict future volatility.
-
Notice that temporal logics, for example, use the fact that
the i+1st value in a sequence is at a later time than the ith value,
but assume neither regularity nor historicity.
Temporal query languages are equally agnostic about this question.
System Support for Time Series
-
We want to be able to create time series, manipulate them,
keep them in persistent storage, and display them in reports.
-
Time series have frequencies, but may not have values
for every time instance at the stated frequency,
e.g. business day has the frequency of a day but has no values on
holidays or weekends.
On the other hand, time frequencies with gaps can present problems
in international markets.
For example,
some Asian stock markets are open on Saturdays.
Different markets don't have the same holidays in general.
One solution is to store values without gaps everywhere
(i.e. every day).
-
Then the question becomes: How to fill the gaps?
The answer has to do with the kind of value stored.
-
The values associated with each time are of two general types
(we borrow this distinction from the FAME system):
-
level values stay the same
from one period to the next in the absence
of activity.
For example, inventory is a level value, because
inventory stays the same if you neither buy nor sell.
-
flow values are zero in the absence of activity.
For example,
expenses go to zero if you buy nothing.
-
This distinction turns out to be important when interpolating
missing values and for time scale conversion.
Operations on Time Series Data
-
A typical framework
is that of the FAME system, since it embodies an excellent
understanding of the special properties of time series.
FAME stands for forecasting, analysis and modeling environment
FAME information systems, Ann Arbor Michigan.
www.fame.com
-
Data Preparation (i.e. interpolating and time scale conversion) -- curve-fitting
-
Queries (e.g. moving averages and sums) -- aggregates over time.
-
Forecasting (e.g. statistical or data mining-based extrapolation)
-- regression, correlation, Fourier analysis, and pattern-finding.
Data Preparation
-
Sometimes it is necessary to relate time series that don't
have the same time frequencies, e.g. mine is days and yours is weeks.
-
Converting one to the other depends on the kind of value one has.
For example, if the daily time series denotes inventory level,
then converting from daily to weekly simply entails taking
the inventory level at the end of each week.
On the other hand, if the daily time series denotes revenues (a flow
type of value),
then one must sum them up to get weekly revenues.
-
Time conversion can force interpolation too, especially when
graphing values.
Typically, systems use various spline techniques
such as a cubic spline to interpolate missing values.
-
Interpolation can be more involved than mere curve-fitting, however
as in the Black-Derman-Toy interpolation of the yield curve.
So, users should be able to add in their own interpolation functions.
Query Types -- try these on your Database
-
cumulative sum, e.g. year to date sales.
-
moving averages, e.g. 30 day average of stock prices.
-
nth best, e.g. 5th best sales region.
median --- one in the middle
rank -- associate ordinal to each value based on its sort order.
-
discretize --- e.g. rank the revenues by whether they
are in the top third, the middle third, or the bottom third.
This implies discovering the boundaries and then using them
in an update query.
-
year-to-year comparisons --- e.g. balance of trade of this year vs. last.
-
accounting functions --- e.g. average growth rate, amortization,
internal rate of return and so on.
-
statistical functions --- e.g. autocorrelation, and correlation between
two series.
Forecasting
-
Before the 1920s, forecasting meant
drawing lines through clouds of data values.
Yule
invented the autoregressive technique in 1927,
so he
could predict the annual number of sunspots.
This was a linear model and the basic approach
was to assume a linear underlying process modified by noise.
That model is often used in marketing (e.g., what will
my sales of wheat be next month).
-
Autoregression uses a weighted sum of previous
values to predict future ones.
There are also seasonal autoregressive models.
-
These and other models are incorporated
in time series products such as FAME, SAS and SPLUS.
-
In options finance, the basic approach is to assume that the price
of an equity is based on a random walk (Brownian motion)
around a basic slope.
The magnitude of the randomness is called the volatility.
In a result due to Norbert Wiener (he worked it out to shoot down
bombers over London), for this model, the standard deviation of the
difference between the initial price and the price
at a certain time t
rises as the square root of time t.
Steps in a Typical FAME Session
-
Specify frequency.
Say monthly, starting at January 1, 1996
and ending at the current time.
-
Create
sales and expenses time series by importing
these from a file or typing them in.
Specify that these are flow type time series.
-
Create a new time series:
formula profit = sales - expenses.
-
Create a fourth time series with weekly frequency
on inventory.
Specify that inventory is a level type time series.
-
Convert the first three time series to a weekly
frequency (by dividing the monthly values by 4.2
or by constructing a cubic spline to make
the sales, expenses, and profits curve look smooth).
This interpolation depends on knowing that sales and expenses
are flow-type values.
-
Now, use autoregression to predict future time series values.
S-Plus
S-Plus, some details
-
A nice reference to the language on the web has been written by
Carlos Alzola and Frank Harrell.
It is entitled
An Introduction to S-Plus and the Hmisc
and Design Libraries )
at the University of Virginia.
(Hmisc is a set of add-on functions that are truly
miscellaneous.)
http://fharrell.biostat.virginia.edu/s/index.html
-
The S-Plus programming model is vector-oriented.
Here are some typical statements:
-
sum(age < mean(age)) gives the count of people
who are younger than the mean.
-
ageg <-- ifelse(age < 16, 'Young', 'Old')
assigns to the vector ageg, a sequence of Young/Old values
thus discretizing the data.
-
find positions of elements
in a table that have the values of x,
thus performing a select while maintaining order.
-
So, one can create vectors, do statistics on them,
and perform selection-style operations on them.
-
The S-Plus implementation is also vector-oriented.
In their case, this creates a performance discontinuity.
S-Plus is slow for large databases,
even for data that exceeds RAM size.
Below RAM size, it is very fast.
For this reason,
S-PLUS is often used with SAS, because SAS has some data management
capabilities as we'll see.
SAS
KDB
-
KDB is a database system
implemented on top of the K language environment (produced by Kx systems
www.kx.com), an array language.
Data structures (e.g. tables) can be interchanged between
the two and functions can be called in both directions.
A free trial version can be downloaded.
-
KDB supports an SQL dialect called KSQL.
KSQL is easy to learn (for anyone fluent in SQL) and carries
over the speed and functionality of K to large data manipulation.
KDB also supports most of standard SQL.
-
The basic data structure in KSQL is the
arrable (array-table) which is a table whose order can be exploited.
In this way, it is very similar to S-Plus.
-
Arrables are non-first-normal form objects: a field of a record
can be an array.
For example, an entire time series can be stored in a field.
-
Like most modern SQLs, KSQL allows the inclusion of
user-defined functions inside database statements.
Unlike other SQL's KSQL allows functions to be defined over
arrays as well as scalars.
-
Like classical SQL, KSQL has aggregates, grouping,
selections and string matching, and so on.
-
KSQL adds many useful functions to SQL, permitting
economical expression and often better performance by exploiting order.
For example, finding the fifth highest value is a linear time
operation in KSQL but requires a self-join in SQL, which
is only sometimes linear time.
-
KDB can function as a high performance distributed server
with full recovery and distributed consistency.
(KDB guarantees consistency by using ordered atomic
broadcast and a replicated state machine
design rather than two phase commit.)
-
Some performance notes
-
KDB can do
40,000+ TPC/B transactions per second against a 100
megabyte database
on a Pentium 2.
-
Typical real-world application (at Lehman Brothers):
time series: keyed by instrument id, date.
Data: 11 million rows, couple of hundred
columns, typical query returns 7500 rows
with 15 columns.
10 gigabytes.
Query: is find all information having
to do with a given set of instruments at a given date.
This returns 7500 rows in less than a second.
KSQL Basics -- an extended example
-
Create tables (arrables, really, but we'll conform to standard
terminology, even though it's slightly inexact)
either through K or within .t scripts.
We will use .t scripts for now and will set up
a simple trade database and then go through it line
by line (you can copy out this file and run it, provided
you also copy out newstat.k which will be given later).
'trade.t'
n:1000 # number of trades
'generate'
stock:([name:('Alcatel','Alstom','AirFrance','FranceTelecom','Snecma','Supra')]
industry:('telecom','engineering','aviation','telecom','aviation','consulting'))
trade:([]
stock:n rand stock.name,
price:50 + .25 * n rand 200,
amount:100 * 10 + n rand 20,
date:date['1998-01-01'] + n rand 365)
trade:'date' asc trade
show'trade'
'last price (in time) by stock, month'
'we get a vector and then take the last'
t1:select last price by stock, date.month from trade
'trade volume by month and industry'
show select volume:sum amount * price
by date.month, stock.industry from trade
'5 day moving avg price by stock, month'
t3:select 5 avgs price
by stock, date.month
from trade
\\l newstat.k
'5 day moving max price by stock, month'
t4:select movingmax[5,price] by stock, date.month from trade
-
Here is the newstat.k file.
/ newstat.k
/ moving max of last num elements
movingmax:{[num; vec]
minel: &/vec
vec: ((num-1) # minel), vec
size: #vec
indexes: (num-1) + !(size-(num-1))
findmax:{[n;v;i] |/v[(i-(n-1))+!n]}
:findmax[num;vec]'indexes
}
Commentary on Part of trade.t
-
The line
t1:select last price by stock, date.month from trade
performs a group by stock and year-month in the trade table.
Mathematically, the by clause partitions the
records based on distinct stock/year-month
values, like
a group by in SQL.
What is different is that
each partition is guaranteed to
be ordered in the same way as in the table (arrable),
in this case, by ascending order of date.
For each partition
corresponding to stock s and year-month x,
the select last price
part will return
s, x, and p where p is the price of the last record in the partition
corresponding to s and x.
Because the table is ordered by date, the last record will be
the one with the most recent date.
-
In the select statement
show select volume:sum amount * price
by date.month, stock.industry from trade
(i) the date.month expression in the by
clause gives each distinct year-month value in the date column.
This is a big convenience in time series applications.
The by clause groups by these year-month values.
-
The lines
'5 day moving avg price by stock, month'
t3:select 5 avgs price by stock, date.month from trade
use the avgs function which, given a vector (price for
each stock in this case), returns a vector
of the same length, computing a moving average.
5 avgs computes the five day moving average.
-
The line
\\l newstat.k
loads one or more k functions from the file
newstat.k, in this case a single moving
maximum taking two arguments, a number and a vector.
-
The line
t4:select movingmax[5,price]
by stock, date.month from trade
uses the movingmax function defined in the K file
newstat.k to compute the 5 day moving
maximum of prices grouped by stock and year-month.
The movingmax function takes a scalar and a vector and that's
exactly what the by clause delivers.
Two of our Challenge Queries using Vectors
-
Dot product of prices offset by 10 days for each stock
\dotprod:{[x;y] +/(x*y)}
t4: select dotprod[(10 drop price), (-10 drop price)]
by stock from trade
-
The 10th highest price for each stock
t5: select last 10 first price
by stock from 'price' desc trade
-
You should be able to do the rest assuming that you can put in arbitrary
functions.
Is There an Ideal Time Series Language?
-
The ability to treat sequences as first class objects
on which one can do useful operations within
the database system.
FAME, S-Plus, SAS, and KSQL.
-
The ability to treat multiple sequences together
for correlations and other purposes.
FAME, S-Plus, SAS, and KSQL.
-
A basic collection of functions, e.g.
aggregates, moving aggregates, statistics, cross-correlations,
interpolation, etc.
FAME, S-Plus, SAS, and KSQL.
-
The ability to integrate user-defined functions into the query
engine.
My belief: user-defined functions are
necessary for time series. There is no analogue to relational completeness
that will satisfy all (or even most) time series applications.
Prove me wrong if you can.
KSQL and S-Plus; FAME and SAS to some extent.
-
The main helpful database amenities include
a rich relational vocabulary and
the ability to work efficiently with disk- as well as RAM-resident
data.
KSQL, FAME and SAS.
-
Treat time
specially, e.g. be able to group by date.month or date.year of KSQL.
FAME, SAS, S-Plus with libraries, and KSQL.
-
Treat values appropriately, e.g. with the level and flow concepts of FAME.
FAME, natively.
The others require libraries.
-
Give support for bitemporality (discussed next) at least as an extension.
No system, natively.
-
Object-relational database vendors have begun to field
products with time-series extensions.
It will be interesting to see how well they measure up.
The Bitemporal Challenge
-
Even in finance, there is more to life than looking at flows
and levels.
Sometimes, one must do historical searches and understand the
state of the enterprise in the past or future.
-
Bitemporal = two times (Snodgrass, Ahn, Jensen, and others)
(A very nice review paper was published in Information Systems:
``Semantics of Time-Varying Information
Christian S. Jensen, Richard T. Snodgrass
Information Systems, 21(4): 311-352 (1996) )
The two times are valid time (when a fact holds)
and transaction time (when the fact was asserted).
-
Using this model, one can ask:
What was Rick's salary on April 1, 1998?
Suppose the answer was $1,500 per week, given
the information we know now (i.e. according to
the latest transaction).
But somehow we sent Rick a check for much more.
This gives rise to a new question:
What did we believe on March 28, 1998 (when payroll printed out Rick's
100,000 dollar weekly check) about Rick's salary valid on April 1, 1998?
Thus, we look at the transactions concerning Rick's salary
that preceded March 29, 1998 and see what was asserted
about Rick's salary as of April 1, 1998.
If the asserted value was 100,000 dollars per week, then
we know that the mistaken payment was due to a database error,
not a payroll processing error.
So, bitemporal databases are useful to track the cause of failures
in operations.
-
This example also
shows their general usefulness for human resources applications,
What is the job history of this candidate within the organization?
-
Or the more challenging:
How has the job history of this candidate been corrected?
-
For trading applications, we can use bitemporality to track the correlations
between predictions of earnings and price,
What did we believe as of January of this year that first
quarter earnings would be?
-
Unitemporal example for those with SQL skill:
Rick Snodgrass suggested the following example
to test your system's temporal acumen.
The salary of employees changes over time, usually in a positive direction.
Each salary for an employee therefore has a time when it is valid,
marked by starttime and endtime, the notion being that the salary
becomes valid at starttime and ceases being valid at endtime.
Your task is to compute the average salary over time as a set of rows
having the average salary as well as the start and end times.
Two consecutive intervals with the same average salary should be merged.
Send me mail if you want to see a KSQL version.
FinTime a Financial Time Series Databases
The design of this benchmark is joint work
with Kaippallim Jacob of Morgan Stanley (kjacob@ms.com)
and its full description can be found at
http://cs.nyu.edu/cs/faculty/shasha/fintime.html.
Here we present a summary of that benchmark.
Goal
The FinTime benchmark tries to model the practical uses
of time-series databases in financial applications.
The
models suggested in FinTime reflect two frequently occurring cases in the
financial industry, namely, a historical market data system and real-time price
tick database. These models are quite similar to two well-studied models in the
relational world, namely, decision support systems and OLTP systems. FinTime
also suggests and defines metrics that capture three useful dimensions of any
time-series system, namely, performance in a single-user mode, performance in a
multi-user mode and the price to performance ratio.
Models for a
Time-series Benchmark
Before deciding on a model, we have to
examine the different parameters that determine a model for time-series system.
The most important parameters that influence a time-series database system
are:
- Periodicity of data (Regular/irregular)
- Density of data
(Dense/Sparse)
- Schedule of updates (periodic, continuous)
- Types
of queries (Simple/Complex)
- Time interval between queries (Ad
hoc/Batch)
- Number of concurrent users
(Few/Many)
Combinations of these factors will give rise to 64
possible models but for simplicity we can focus on the following commonly
occurring cases in the financial industry
Model 1:
Historical market Information
Attribute | Specification |
Periodicity of
Data | Periodic |
Density of Data | Dense |
Schedule of
updates | Periodic updates (e.g. at the end of a business
day) |
Complexity of queries | Complex (e.g. Decision support
queries) |
Nature of query arrival | Batch |
Concurrency of
users | Low (e.g. Few concurrent users) |
Model 2: Tick
databases for financial instruments
Attribute | Specification |
Periodicity of
Data | Non-periodic |
Density of Data | Sparse to moderately
dense |
Schedule of updates | Continuous |
Complexity of
queries | Simple |
Nature of query arrival | Ad hoc |
Concurrency of
users | High (e.g. Many concurrent users) |
Let us now discuss the characteristics of
the two models in some detail:
Model 1: Historical Market
Information
Historical Market data systems are closely
related to decision support systems of the traditional relational database
world. The various elements of this model are:
- Data Model:
In the case of historical time-series databases for market information,
the model consists of a few relational tables that typically contain
infrequently changing (static) information and a number of time-series tables.
The following tables describe the data in this
benchmark
Base
Information Table
Field
Name | Type | Comments |
Id | Char(30) | Unique Identifier for the financial
instrument e.g. IBM.N |
Ex | Char(3) | Exchange on which the instrument is traded e.g.
"N" for the New York Stock Exchange |
Descr | Varchar(256) | A brief description of the financial security e.g.
"International Business Machines, Armonk, NY" |
SIC | Char(10) | Standard Industry Code
e.g. "COMPUTERS" or "CHEMICAL" |
SPR | Char (4) | S&P Rating for the
company e.g. "AAA" |
Cu | Char(5) | The base currency in which the
security is traded e.g. "USD" for US Dollars |
CreateDate | Date | Date this security came
into existence |
Stock Splits (Irregular
Time-series)
Field Name | Type | Comments |
Id | Char(30) | Identifier |
SplitDate | Date | Date the split is actually
effective |
EntryDate | Date | Date the split is announced |
SplitFactor | Float | The split factor
expressed as a decimal value. E.g. a 2 for 1 split is expressed as 0.5 (ie
1/2), a 4 for 3 is expressed as 0.75(ie 3/4) |
Dividends (Irregular
Timeseries)
FieldName | Type | Comments |
Id | Char(30) | Identifier |
XdivDate | Date | Date on which dividend is
disbursed |
DivAmt | Float | Dividend Amount In the base currency of the
instrument |
AnnounceDate | Date | Date on which the dividend is
announced |
Market Data (Regular
Timeseries)
FieldName | Type | Comments |
Id | Char(30) | Identifier |
Tradedate | Date | Trade Date |
HightPrice | Float | High price for the day
|
LowPrice | Float | Low price for the day |
ClosePrice | Float | Closing price for the
day |
OpenPrice | Float | Open price for the day |
Volume | Long | Volume of shares
traded |
-
Market information is typically provided as a set of input files by a market
data vendor at the end of each trading day. These files correspond to
the tables defined above. While the data for the Base
Information table, Split Adjustment Table and Dividend
Table is irregular (ie an external event triggers an entry into these tables),
the Market Information Table has an entry for every trading day. For the
purposes of this benchmark the implementation can select a scale factor
combination of the number securities under consideration and the number of
events for those securities. We suggest 3 scale factors, namely, 50,000
securities, 100,000 securities,
and 1,000,000 securities, all for 4,000 days.
These roughly correspond to all equity
securities in the US, all equity securities in the G7 countries and all equity
securities in the world. Programs to generate the data can be found
on the benchmark pages.
- Query characteristics.
The queries model:
- Join of relational data and
time-series information
- Access of long depth time-series information
for a few keys (deep history query)
- Access of a short depth
time-series for a large number of keys (cross-sectional
queries)
- Sorting
- Grouping and
Aggregation
Following are the
benchmark queries (we don't require any special query language).
Many of them include a notion of ``specified set of securities''
or ``specified period''. These notions are defined
with respect to a simple random model.
1 | Get the closing price of a set of 10 stocks for a 10-year period and
group into weekly, monthly and yearly aggregates. For each aggregate period
determine the low, high and average closing price value. The output should be
sorted by id and trade date. |
2 | Adjust all prices
and volumes (prices are multiplied by the split factor and volumes are divided
by the split factor) for a set of 1000 stocks to reflect the split events
during a specified 300 day period,
assuming that events occur before the first trade of
the split date. These are called split-adjusted prices and volumes.
|
3 | For each stock in a specified list of 1000 stocks, find the
differences between the daily high and daily low on the day of each split event
during a specified period. |
4 | Calculate the value
of the S&P500 and Russell 2000 index for a specified day using unadjusted
prices and the index composition of the 2 indexes (assume to be given) on the
specified day |
5 | Find the 21-day and 5-day moving average price for a
specified list of 1000 stocks during a 6-month period. (Use split adjusted
prices) |
6 | (Based on the previous query)
Find the points (specific days) when the 5-month moving average intersects the
21-day moving average for these stocks. The output is to be sorted by id and
date. |
7 | Determine the value of $100,000 now if 1 year ago it was invested
equally in 10 specified stocks (i.e. allocation for each stock is $10,000). The
trading strategy is: When the 20-day moving average crosses over the
5-month moving average the complete allocation for that stock is invested and
when the 20-day moving average crosses below the 5-month moving average the
entire position is sold. The trades happen on the closing price of the trading
day. |
8 | Find the pair-wise coefficients of correlation in a set of 10
securities for a 2 year period. Sort the securities by the coefficient of
correlation, indicating the pair of securities corresponding to that row.
[Note: coefficient of correlation defined in appendix] |
9 | Determine the yearly dividends and annual yield
(dividends/average closing price) for the past 3 years for all the stocks in
the Russell 2000 index that did not split during that period. Use unadjusted
prices since there were no splits to adjust for. |
Since the benchmark model assumes that the
level of concurrency is small, the number of users should be 5. The benchmark
methodology should be that each of the five users will pick a query at random
from the set above without replacement
and submit it to the database. The five users will be
simultaneously active at any point in
time. Each user should do every query and then stop.
Model 2: Tick databases for
financial instruments
This second benchmark
models the
case where the database (consisting
of ticks) is expected to keep up with a very high rate of
updates while responding to several users issuing fairly simple queries.
Ticks are price
quotation or trade (transaction) prices and associated attributes for
individual securities that occur either on the floor of a stock exchange or in
an electronic trading system, such as the NASDAQ market system. Ticks include 2
basic kinds of data (1) Trades are transactions between buyers and a sellers at
a fixed price and quantity (2) Quotes are price quotations offered by buyers
and sellers. Quotes can have the ask quote, the bid quote or both along with
their associate attributes such as quantity offered.
Let us now define
the different elements of this benchmark. They are:
- Data model.
The data model for this benchmark is simple. It consists of the following 2
tables:
Base Information
Table
Field Name | Type | Comments |
Id | Char(30) | Identifier for a security |
Ex | Char(3) | Exchange on which the
instrument is traded e.g. "N" for the New York Stock
Exchange |
Descr | Varchar (256) | A short description of the
security |
SIC | Char(10) | The standard industry code for the security |
Cu | Char(10) | Base currency for the
security |
Trades/Quotes
Table
FieldName | Type | Comments |
Id | Char(30) | Identifier |
SeqNo | Integer | A unique sequence number for each trade or
quote |
TradeDate | Date | The trading date |
TimeStamp | Time | The time a particular trade was
executed or Quote was generated |
TradePrice | Float | The price at which the trade was
executed |
TradeSize | Integer | The number of shares traded in a transaction |
AskPrice | Float | The Price at which a
seller would sell a security |
AskSize | Float | The size of the transaction offered
at the specified ask price. |
BidPrice | Float | The price at which a buyer would buy
the security |
BidSize | Float | The size of the transaction offered at the specified bid
price |
Type | Char | An indication if this is a quoteor a trade
|
- Data population,
Update frequency and volume.
Tick databases are generally populated by
adapters that receive data from real-time feeds. The frequency of updates is
constantly increasing but on the average we can assume that each security ticks
(each corresponding to a record in the database) about 100 times during an
8-hour trading day. Additionally, we can assume that the set of securities
being tracked is traded around the world and hence there is no quiescent period.
For the purposes of this benchmark we will assume that the system monitors
ticks on 1,000, 10,000 or 100,000 securities, where each represents a different
scale factor.
A very important consideration in tick
databases is the ability to quickly apply "cancel/correct".
Occasionally an incorrect quote or trade record is published. The vendor will
then send a correction record with the identifier of the security and its
sequence number. The record will either be corrected according to the new
published or simply deleted.
- Query characteristics. The
kinds of queries issued against an online tick databases are usually simple and
pre-determined. Listed below are the set of queries to be used in the
benchmark.
1 | Get all ticks a specified set of 100
securities for a specified three hour time period on a specified trade
date. |
2. | Determine the volume weighted
price of a security considering only the ticks in a specified
three hour interval |
3. | Determine the top 10 percentage losers for the specified date on the
specified exchanges sorted by percentage loss. The loss is calculated as a
percentage of the last trade price of the previous day. |
4.
| Determine the top 10 most active stocks for a specified date sorted
by cumulative trade volume by considering all trades |
5. | Find the most active
stocks in the "COMPUTER" industry (use SIC
code) |
6. | Find the 10 stocks with the
highest percentage spreads. Spread is the difference between the last ask-price
and the last bid-price. Percentage spread is calculated as a percentage of the
mid-point price (average of ask and bid price). |
- Operational parameters. Since
usually tick systems have a relatively large number of concurrent users, we
will assume that there are 50 concurrent users for the benchmark. These users
randomly pick queries from the list above and submit them for execution. Each
user will submit each query only once and run through the complete list of
queries.
Finding Patterns in Temporal Data
-
Financial traders have done data mining for many years.
One trader described his work to me as follows:
I think about an arbitrage trick (pairs trading
is such a trick). Program for a few months.
Try the trick and either it works or it doesn't.
If it doesn't, I try something new.
If it works, I enjoy it until the arbitrage disappears.
-
What does the research community have to offer to
such traders?
I present some research that I think might be most relevant.
I will be updating this as time goes on.
-
U.M. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy, editors
Advances in Knowledge Discovery and Data mining
AAAI Press/ The MIT Press, 1996.
The article by Berndt and Clifford about
finding patterns in time series is particularly relevant to finance.
-
Temporal Databases -- Research and Practice
Editors: Opher Etzion, Sushil Jajodia, Sury Sripada. (Springer-Verlag,
1998).
There, you will find articles about finding unexpected patterns (e.g. fraud) and
multi-granularity data mining.
-
Christos Faloutsos
Searching Multimedia Databases by Content
Kluwer Academic Publishers.
This book shows how to do signal processing analysis on time series
to solve problems such as:
-
Discovering whether two time series have similar shapes:
the basic idea is to
store the first few Fourier coefficients of a time sequence in a
database and assert that two time sequences are similar if
their Fourier coefficients are close.
(Remarkably this works well because the energy spectrum for
stock prices declines with
the power 2 with increasing coefficients.)
Joint work with Rakesh Agrawal and Arun Swami.
The efficiency of this technique has been improved by
Davood Rafiei and Alberto Mendelzon of the University of Toronto.
-
Subsequence matching (is this sequence
close to some subsequence of that sequence?).
Faloutsos
uses a special data structure called Fastmap to make this performant.
-
Other papers explore
the question of similarity search when time scaling and inversion
is possible:
-
R. Agrawal, K-I Lin, H.S. Sawhney and K. Shim.
``Fast similarity search in the presence of noise, scaling
and translation in time-series databases.''
Proc of the 21st VLDB Conference, 1995
-
D. Q. Goldin and P. C. Kanellakis.
``On similarity queries for time-series data:
constraint specification and implementation.''
1st International Conference on the Principles
and Practice of Constraint Programming. pp. 137-153.
Springer-Verlag, LNCS 976. September 1995.
-
Davood Rafiei and Alberto Mendelzon.
``Similarity-based queries for time series data''
ACM Sigmod, pp. 13-24. May 1997
-
Yi,
Efficient Retrieval of Similar Time
Sequences Under Time Warping.
Data Engineering, 1998.
-
Excellent work has also been done on data structures
by many researchers at Brown, Polytechnic, and the
University of Maryland, but that falls outside the data mining
purview.
-
As an alternative to seeing whether two sequences or subsequences match,
one might want to describe a desirable sequence (e.g. a head-and-shoulders
movement of stock prices) and see whether it is present.
Relevant papers about this include:
-
H.V. Jagadish, A. O. Mendelzon and T. Milo.
Similarity-based queries. PODS 1995.
-
R. Agrawal, G. Psaila, E. L. Wimmers and M. Zait.
Querying shapes of histories.
Proceedings of the 21st VLDB Conference.
pp. 502-514. 1995.
-
P. Seshadri, M. Livny and R. Ramakrishnan.
Sequence query processing.
ACM SIGMOD, pp. 430-441, 1994
Data model and query language for sequences in general,
with time series as a special case.
-
Arie Shoshani, Kyoji Kawagoe: Temporal Data Management. VLDB 1986: 79-88
One of the first papers in the literature.
-
Snodgrass, R.~T., editor, The TSQL2 Temporal Query Language , Kluwer
Academic Publishers, 1995, 674+xxiv pages. The TSQL2 Language Design
Committee consisted of Richard Snodgrass (chair),
Ilsoo Ahn, Gad Ariav, Don Batory, James
Clifford, Curtis E. Dyreson, Ramez Elmasri, Fabio Grandi, Christian
S. Jensen, Wolfgang Kaefer, Nick Kline, Krishna Kulkarni,
T. Y. Cliff Leung, Nikos Lorentzos, John F. Roddick, Arie Segev,
Michael D. Soo and Suryanarayana M. Sripada.
TSQL2 has time-varying aggregates, including moving window aggregates,
aggregates over different time granularities, and weighted over time.
-
Munir Cochinwala, John Bradley: A Multidatabase System for Tracking and
Retrieval of Financial Data. VLDB 1994: 714-721
A paper discussing the implementation of a tick capture
and query system --- for those brave enough to roll their own.
-
Raghu Ramakrishnan,
Donko Donjerkovic,
Arvind Ranganathan,
Keven S. Beyer, and
Muralidhar Krishnaprasad:
SRQL: sorted relational query language
SSDBM 98
A paper discussing a model in which relations are tables
that can be ordered.
This allows one to do moving averages, find ten cheapest, preceding
fifteen, etc.
The strategy is to extend SQL with order and special operators.
-
Leonid Libkin and colleagues:
An optimizable array-oriented language based on comprehensions.
The basic primitives are tabulation (analogous to selection),
subscripting (remove elements from arrays), dimension
reduction (like count of an array), and interaction
between sets and arrays.
Optimizations are analogous to pushing selects into expressions
and techniques that reduce the complexity of expressions.
Books on Time Series for Computer Scientists
-
C. Chatfield,
The Analysis of Time Series: Theory and Practice
Chapman & Hall fourth edition 1984.
Good general introduction, especially for those completely
new to time series.
-
P.J. Brockwell and R.A. Davis,
Time Series: Theory and Methods,
Springer Series in Statistics (1986).
-
B.D. Ripley and W.N. Venables,
Modern Applied Statistics with S-Plus, Springer (1994)
Chapter 14 has a good discussion of time series.
http://www.stats.ox.ac.uk/~ripley/
has a lot of useful functions.
-
FinTime, a time series benchmark for finance
http://cs.nyu.edu/cs/faculty/shasha/fintime.html
Appendix: Informal Review of Statistical Concepts
-
Recall that the goal of probability theory is to determine the likelihood
of a given event given a probability distribution (e.g. how likely
is it to get 5,300 heads in 10,000 flips of a fair coin?).
The goal of statistics is to determine a probability distribution
given a series of observations or at least
to disprove a null hypothesis (e.g. is a fair coin a reasonable
model if I get 8,000 heads in 10,000 flips?).
-
In parametric statistics, one knows the form of the target probability
distribution but not the value of certain parameters, e.g.
coin flips are binomial but the probability of a head may be unknown.
In non-parametric statistics, one does not know the form
of the target probability distribution.
In finance, most models are parametric (autoregression, option pricing).
When models aren't, people use queries and eyeballs to figure
out what to do.
-
Stationary process : one whose statistics (mean and variance)
do not vary with time.
Stationarity is a fundamental assumption of pairs trading
and options pricing.
-
Correlation: a measure of the association between two series,
e.g. the option open interest and the price of a security
5 days later.
If cov(x,y) represents the covariance between x and y
and sigma(x) is the standard deviation of x, then
correlation(x,y) = cov(x,y)/(sigma(x)*sigma(y))
so is entirely symmetric and lies always between -1 and 1.
-
Partial correlation :
suppose you are looking at the one day returns of Merck and Pfizer
(two drug companies). You can look at them as raw data or you
can subtract out the market influence via a least squares estimate
and use the correlation of the residuals.
-
Volatility : a measure of the standard deviation
of the value of a variable over a specific time,
e.g. the annualized standard deviation of the returns.
The return at time t is ln(p(t)/p(t-1)).
This is a critical parameter in options pricing, because it determines
the probability that a price will exceed a certain price range.
-
Alpha, Beta, and Regression: suppose we estimate the relationship
between the percentage change in price of some stock S vs.
the percentage change in some market index M using a best fit
(least squares)
linear relationship:
s = a + bm
Then the parameter alpha (a) is the change in S independent of M
and beta (b) is the slope of the best fit line.
A riskless investment has a positive alpha and a zero beta,
but most investments have a zero alpha and a positive beta.
If beta is greater than 1, then for a given change in the market,
you can expect a greater change in S.
If beta is negative, then S moves in the opposite direction
from the market.
Note that beta is different from correlation (and can
be arbitrarily large or small) because it is not symmetric:
beta = cov(S,M)/(sigma(M)*sigma(M))
-
ANOVA: analysis of variance in cases when there is no missing data.
This is used to model situations in which several factors can play
a role and one wants to tease out a probabilistic model that describes
their interaction.
For example, product, location and customer income may be factors
that influence buying behavior.
ANOVA helps to figure out how to weight each one.
More significant variants of this include
principal components analysis
and factor analysis .
In finance, one might use one of these to figure out what determines
the price movement of a stock (perhaps half general market movement,
one third interest rates, etc.).
In psychology, one can ask a person 100 questions and then
categorize the person according to a weighted sum of a few questions.
-
Autoregression: a statistical model which predicts future
values from one or more previous ones.
This generalizes trend forecasting as used to predict sales.
Financial traders use this sparingly since models that look
at the recent past often just follow a short term trend.
As one trader put it:
``they
follow a trend and are always a day late and many dollars short.''
In general, regression of y on x is a determination
of how y depends on x.
-
Maximum likelihood method: suppose you are given a training set
consisting of observations and the categories to which the observations
belong.
The maximum likelihood method selects the probability distribution
that best explains the training set.
For example, if you toss a coin 10,000 times and observe
that heads comes up 8,000, you assign a probability to the heads
that maximizes the probability of this event.
This will make the probability of heads be greater than 1/2.
In finance, the maximum likelihood method is often used for forecasting
based on previously seen patterns.
-
Regularization A technique for smoothing a function
to make it have nice mathematical properties such as differentiability.
Moving averages are an example of regularization.
-
Bootstrapping (i) Divide the training set (set
of (observation, category) pairs) into
pieces. (ii) Infer the model from some pieces.
(iii) Test it
on the other pieces.
Acknowledgments
Lory Molesky,
Judy Smith, David Rothman,
and Rick Snodgrass
made several suggestions that have contributed
to this presentation.
All errors are mine.
Thank you for your attention.