Time Series in Finance: the array database approach


Prof. Dennis Shasha
Courant Institute of Mathematical Sciences
Department of Computer Science
New York University
shasha@cs.nyu.edu
http://cs.nyu.edu/cs/faculty/shasha/index.html


Outline


Scenario


So, What's the Database Problem?


What Are Time Series


System Support for Time Series


Operations on Time Series Data


Data Preparation


Query Types -- try these on your Database


Forecasting


Steps in a Typical FAME Session


S-Plus


S-Plus, some details


SAS


KDB


KSQL Basics -- an extended example




Commentary on Part of trade.t



Two of our Challenge Queries using Vectors


Is There an Ideal Time Series Language?


The Bitemporal Challenge


FinTime a Financial Time Series Databases

The design of this benchmark is joint work with Kaippallim Jacob of Morgan Stanley (kjacob@ms.com) and its full description can be found at http://cs.nyu.edu/cs/faculty/shasha/fintime.html.
Here we present a summary of that benchmark.

Goal

The FinTime benchmark tries to model the practical uses of time-series databases in financial applications. The models suggested in FinTime reflect two frequently occurring cases in the financial industry, namely, a historical market data system and real-time price tick database. These models are quite similar to two well-studied models in the relational world, namely, decision support systems and OLTP systems. FinTime also suggests and defines metrics that capture three useful dimensions of any time-series system, namely, performance in a single-user mode, performance in a multi-user mode and the price to performance ratio.

Models for a Time-series Benchmark

Before deciding on a model, we have to examine the different parameters that determine a model for time-series system. The most important parameters that influence a time-series database system are:

  1. Periodicity of data (Regular/irregular)
  2. Density of data (Dense/Sparse)
  3. Schedule of updates (periodic, continuous)
  4. Types of queries (Simple/Complex)
  5. Time interval between queries (Ad hoc/Batch)
  6. Number of concurrent users (Few/Many)

Combinations of these factors will give rise to 64 possible models but for simplicity we can focus on the following commonly occurring cases in the financial industry

Model 1: Historical market Information

 

Attribute

Specification

Periodicity of Data

Periodic

Density of Data

Dense

Schedule of updates

Periodic updates (e.g. at the end of a business day)

Complexity of queries

Complex (e.g. Decision support queries)

Nature of query arrival

Batch

Concurrency of users

Low (e.g. Few concurrent users)

 

Model 2: Tick databases for financial instruments

 

Attribute

Specification

Periodicity of Data

Non-periodic

Density of Data

Sparse to moderately dense

Schedule of updates

Continuous

Complexity of queries

Simple

Nature of query arrival

Ad hoc

Concurrency of users

High (e.g. Many concurrent users)

 

Let us now discuss the characteristics of the two models in some detail:

Model 1: Historical Market Information

Historical Market data systems are closely related to decision support systems of the traditional relational database world. The various elements of this model are:

 

 

Base Information Table

Field Name

Type

Comments

Id

Char(30)

Unique Identifier for the financial instrument e.g. IBM.N

Ex

Char(3)

Exchange on which the instrument is traded e.g. "N" for the New York Stock Exchange

Descr

Varchar(256)

A brief description of the financial security e.g. "International Business Machines, Armonk, NY"

SIC

Char(10)

Standard Industry Code e.g. "COMPUTERS" or "CHEMICAL"

SPR

Char (4)

S&P Rating for the company e.g. "AAA"

Cu

Char(5)

The base currency in which the security is traded e.g. "USD" for US Dollars

CreateDate

Date

Date this security came into existence

 

 

Stock Splits (Irregular Time-series)

Field Name

Type

Comments

Id

Char(30)

Identifier

SplitDate

Date

Date the split is actually effective

EntryDate

Date

Date the split is announced

SplitFactor

Float

The split factor expressed as a decimal value. E.g. a 2 for 1 split is expressed as 0.5 (ie 1/2), a 4 for 3 is expressed as 0.75(ie 3/4)

 

 

Dividends (Irregular Timeseries)

FieldName

Type

Comments

Id

Char(30)

Identifier

XdivDate

Date

Date on which dividend is disbursed

DivAmt

Float

Dividend Amount In the base currency of the instrument

AnnounceDate

Date

Date on which the dividend is announced

 

 

Market Data (Regular Timeseries)

FieldName

Type

Comments

Id

Char(30)

Identifier

Tradedate

Date

Trade Date

HightPrice

Float

High price for the day

LowPrice

Float

Low price for the day

ClosePrice

Float

Closing price for the day

OpenPrice

Float

Open price for the day

Volume

Long

Volume of shares traded

    1. Join of relational data and time-series information
    2. Access of long depth time-series information for a few keys (deep history query)
    3. Access of a short depth time-series for a large number of keys (cross-sectional queries)
    4. Sorting
    5. Grouping and Aggregation

Following are the benchmark queries (we don't require any special query language). Many of them include a notion of ``specified set of securities'' or ``specified period''. These notions are defined with respect to a simple random model.

1

Get the closing price of a set of 10 stocks for a 10-year period and group into weekly, monthly and yearly aggregates. For each aggregate period determine the low, high and average closing price value. The output should be sorted by id and trade date.

2

Adjust all prices and volumes (prices are multiplied by the split factor and volumes are divided by the split factor) for a set of 1000 stocks to reflect the split events during a specified 300 day period, assuming that events occur before the first trade of the split date. These are called split-adjusted prices and volumes.

3

For each stock in a specified list of 1000 stocks, find the differences between the daily high and daily low on the day of each split event during a specified period.

4

Calculate the value of the S&P500 and Russell 2000 index for a specified day using unadjusted prices and the index composition of the 2 indexes (assume to be given) on the specified day

5

Find the 21-day and 5-day moving average price for a specified list of 1000 stocks during a 6-month period. (Use split adjusted prices)

6

(Based on the previous query) Find the points (specific days) when the 5-month moving average intersects the 21-day moving average for these stocks. The output is to be sorted by id and date.

7

Determine the value of $100,000 now if 1 year ago it was invested equally in 10 specified stocks (i.e. allocation for each stock is $10,000). The trading strategy is:

When the 20-day moving average crosses over the 5-month moving average the complete allocation for that stock is invested and when the 20-day moving average crosses below the 5-month moving average the entire position is sold. The trades happen on the closing price of the trading day.

8

Find the pair-wise coefficients of correlation in a set of 10 securities for a 2 year period. Sort the securities by the coefficient of correlation, indicating the pair of securities corresponding to that row. [Note: coefficient of correlation defined in appendix]

9

Determine the yearly dividends and annual yield (dividends/average closing price) for the past 3 years for all the stocks in the Russell 2000 index that did not split during that period. Use unadjusted prices since there were no splits to adjust for.

 

Since the benchmark model assumes that the level of concurrency is small, the number of users should be 5. The benchmark methodology should be that each of the five users will pick a query at random from the set above without replacement and submit it to the database. The five users will be simultaneously active at any point in time. Each user should do every query and then stop.

 

 

Model 2: Tick databases for financial instruments

This second benchmark models the case where the database (consisting of ticks) is expected to keep up with a very high rate of updates while responding to several users issuing fairly simple queries.

Ticks are price quotation or trade (transaction) prices and associated attributes for individual securities that occur either on the floor of a stock exchange or in an electronic trading system, such as the NASDAQ market system. Ticks include 2 basic kinds of data (1) Trades are transactions between buyers and a sellers at a fixed price and quantity (2) Quotes are price quotations offered by buyers and sellers. Quotes can have the ask quote, the bid quote or both along with their associate attributes such as quantity offered.

Let us now define the different elements of this benchmark. They are:

Base Information Table

Field Name

Type

Comments

Id

Char(30)

Identifier for a security

Ex

Char(3)

Exchange on which the instrument is traded e.g. "N" for the New York Stock Exchange

Descr

Varchar (256)

A short description of the security

SIC

Char(10)

The standard industry code for the security

Cu

Char(10)

Base currency for the security

 

Trades/Quotes Table

FieldName

Type

Comments

Id

Char(30)

Identifier

SeqNo

Integer

A unique sequence number for each trade or quote

TradeDate

Date

The trading date

TimeStamp

Time

The time a particular trade was executed or Quote was generated

TradePrice

Float

The price at which the trade was executed

TradeSize

Integer

The number of shares traded in a transaction

AskPrice

Float

The Price at which a seller would sell a security

AskSize

Float

The size of the transaction offered at the specified ask price.

BidPrice

Float

The price at which a buyer would buy the security

BidSize

Float

The size of the transaction offered at the specified bid price

Type

Char

An indication if this is a quoteor a trade

A very important consideration in tick databases is the ability to quickly apply "cancel/correct". Occasionally an incorrect quote or trade record is published. The vendor will then send a correction record with the identifier of the security and its sequence number. The record will either be corrected according to the new published or simply deleted.

 

1

Get all ticks a specified set of 100 securities for a specified three hour time period on a specified trade date.

2.

Determine the volume weighted price of a security considering only the ticks in a specified three hour interval

3.

Determine the top 10 percentage losers for the specified date on the specified exchanges sorted by percentage loss. The loss is calculated as a percentage of the last trade price of the previous day.

4.

Determine the top 10 most active stocks for a specified date sorted by cumulative trade volume by considering all trades

5.

Find the most active stocks in the "COMPUTER" industry (use SIC code)

6.

Find the 10 stocks with the highest percentage spreads. Spread is the difference between the last ask-price and the last bid-price. Percentage spread is calculated as a percentage of the mid-point price (average of ask and bid price).

 

 


Finding Patterns in Temporal Data



Books on Time Series for Computer Scientists


Appendix: Informal Review of Statistical Concepts


Acknowledgments

Lory Molesky, Judy Smith, David Rothman, and Rick Snodgrass made several suggestions that have contributed to this presentation. All errors are mine.

Thank you for your attention.