Fast time series classes in Java

We describe our space- and time-efficient daily time series implementation.

Our system is expressly built to handle investing concepts well, just like the programming language R is particularly good for statistics vs., say, creating video games or word processors.

A common such concept is a time series, which is a collection of data points in time order. Particularly useful to us are daily time series (henceforth DTS), where we have values for consecutive market (NOT calendar) days1. DTS objects are central to our backtest subsystem, and we run thousands of such backtests in our automated test suite. Therefore, we created our own space- and time-efficient optimized implementation.

Sharing a mapping

A DTS can store any object type; however, the following example refers to the DTS of an account’s value.

A common implementation of a map with O(1) lookup time, such as java.util.HashMap,  is (conceptually) an unordered set of {key, value} entries, where the keys implement hashCode and equals. Example:

Fri 2010-01-08 → $1,008
Mon 2010-01-11 → $1,011
Tue 2010-01-12 → $1,012

Barring a few exceptions, data classes in our codebase are immutable2. A corollary of this is that map objects cannot have entries added to or removed from them. This allows us to represent the above in two pieces, without any efficiency losses from having to enlarge the array, or moving objects in it, since the object is immutable:

  • a mapping of market day to array index:
    Fri 2010-01-08 → 0
    Mon 2010-01-11 → 1
    Tue 2010-01-12 → 2
    
  • an array with the data points appearing consecutively:
    { $1008 , $1011, $1012 }

This allows us to improve on a simple java.util.HashMap in multiple ways, which improve both running time or space (not in the big-O sense, but still).

  1. Less space
    1. No hash table gaps: data items can be stored consecutively in memory, so we will not have any gaps – whereas a hash map’s underlying array will need to be larger (typically at least 2x) to reduce collisions.
    2. Mapping can be shared: if there are many time series for the same market date range, the date -> array index mapping can be shared, and only the actual data array needs to be different. [Note: further down, we describe a trick to achieve savings even across DTS objects that cover different date ranges.]
  2. Fast ordered iteration: if we want to iterate over all data points in sequential day order, we just sequentially access the array. With a hash map3 we would have to enumerate all market days in order, and then look up the corresponding values, which would result in a hash function calculation per item.

Here is the (condensed) code. We call this ArrayIndexMapping; it is actually a more general concept, and not specific to market days.

public interface ArrayIndexMapping {
  T getKey(int index); // throws if index not in mapping
  int getIndex(T key); // throws if key not in mapping
  int size();
}

DailyTimeSeries objects have an ArrayIndexMapping member; more details below.

Hash function speedup

A naive implementation of ArrayIndexMapping would be using a simple hash map under the hood, and there would be no improvements vs. just using a plain hash map.

However, with market days, there’s a trick we do.

We use a hash function:

f(date) = day of month + 32 * (month of year + 13 * year)

This is guaranteed to give us increasing values for later days (i.e. f(x) is monotonically increasing)4.

Then, we pick a date range that we want to support. It has to be early enough to accommodate backtests that go far into the past, but late enough so that we know the market holiday schedule. We chose the first market day5 in 2000, which is Monday 2000-01-03.

We create a ‘master array’ whose size is the number of calendar days between 2000-01-03 and today. Then, for the n-th calendar day D after 2000-01-03, we store f(D) – f(2000-01-03) in the n-th location of the array – but only if it is a market day. If not, it will have a value of 0, since Java initializes int arrays this way, but it doesn’t matter, because that value will never get looked at.

The master array will look like this – using empty to denote ‘value which never gets read’:

Array index (not stored) n, for n-th market day since 2000-01-03 Calendar date (not stored)
0 0 2000-01-03 Monday
1 1 2000-01-04 Tuesday
2 2 2000-01-05 Wednesday
3 3 2000-01-06 Thursday
4 4 2000-01-07 Friday
5 2000-01-08 Saturday
6 2000-01-09 Sunday
7 5 2000-01-10 Monday
8 6 2000-01-11 Tuesday

The first 5 days are for the weekdays (Mon – Fri). Then, there are two empty values for the weekend, and then there are more weekdays, and so on. [Note: The grayed-out row is used in the example further down.]

The memory requirements of the master array are trivial: e.g. when we run our code in 2018, this would be approximately 18 * 365 * 4 bytes per int ~= 26 Kb.

Here’s an example. For simplicity, let us assume that the account value on the n-th market day after the first one is $1000 + n.

The DTS class will store this array:

Array index (not stored) $ account value Market date (not stored)
0 $1,000 2000-01-03 Monday
1 $1,001 2000-01-04 Tuesday
2 $1,002 2000-01-05 Wednesday
3 $1,003 2000-01-06 Thursday
4 $1,004 2000-01-07 Friday
5 $1,005 2000-01-10 Monday
6 $1,006 2000-01-11 Tuesday

Say we want to look up that DTS’s value for Monday 2000-01-10 (gray row). Here are the steps:

  1. We compute f(2000-01-10) – f(2000-01-03), which is 7; i.e. there are 7 calendar days since the ‘beginning of time’.
  2. We index the master array on 7, which gives us 5.
  3. We index the DTS’s internal array on 5, which gives us $1,005.

The lookup time consists of 2 array lookups, and the calculation of f(2000-01-10) = 10 + 32 * (1 + 13 * 2000). [Note: although trivial, we precompute once and save the value of f(2000-01-03)].

In short, a DTS lookup by date will take almost zero time.

An additional advantage is improved spatial locality, which should improve cache performance, since we typically go through a DTS in time order.

Reusing the master array

The previous solution works if all time series start on 2000-01-03. However, different backtests use different starting dates; some may start in 2008 and some in 2009. We use a simple trick so that all DTS lookups can be fast, regardless of the date range they cover.

Each DTS includes a specialized ArrayIndexMapping implementation that stores:

A. a reference to the ‘master array’

B. the calendar day index of the starting day in the master array. In other words, f(starting day) – f(2000-01-03).

The space requirements per DTS are trivial: 8 bytes for the reference (on a 64-bit machine) and 4 for the int that stores the starting index.

Let’s take the example of a DTS that starts on Wednesday 2008-01-02, the first market day in 2008. Value (B) is:

f(2008-01-02) – f(2000-01-03)

= (2 + 32 * (1 + 13 * 2008)) – (3 + 32 * (1 + 13 * 2000))

= 32 * 13 * 8 – 1

= 3327

Let us assume that account values for n-th market day in 2008 are $8000 + n. The DTS will store this array:

Array index (not stored) $ account value Market date (not stored)
0 $8,000 2008-01-02 Wednesday
1 $8,001 2008-01-03 Thursday
2 $8,002 2008-01-04 Friday

Say we want to look up this DTS’s value for 2008-01-04. The steps are simple:

  • Compute f(2008-01-04) – f(2000-01-03) = 3329
  • Subtract 3327 from it to get 2
  • Index the internal DTS array at position 2 to get the account value of $8,002

This shows that the space and time benefits extend to all DTS objects in the system, regardless of their market date range.

Summary

We have described a way to store daily time series objects in a space-efficient way, and where lookups based on market day are lightning fast. We do thousands of backtests in our test suite, so even small savings matter.


Notes

1. We use “market day” to mean “day when stock markets were open.” This excludes weekends and equity market holidays (Martin Luther King Day, Labor Day, etc.) All equity exchanges in the US have shared the same holiday calendar in the recent past. Some ‘half market days’ exist, e.g. the Thursday before Thanksgiving, but we treat those as regular days.

2. There are several benefits beyond the scope of this post. Quickly: immutability of data objects simplifies the code and reduces bugs, because it enforces value semantics for non-primitives, instead of the default reference semantics Java gives us. In other words, it guarantees that a callee cannot change the contents of an object that is passed in.

3. There are ways around this (e.g. java.util.TreeMap) but they come with their own baggage over a plain hash map (e.g. extra internal state).

4. We could also have used:

(day of month - 1) + 31 * ((month of year - 1) + 12 * (year - 1))

However, we wanted to be stingy even about the minus-one subtraction.

5. In the last few years, if a market holiday (such as New Year’s Day) falls on a weekend, it is usually observed on the next weekday. That was not the case back in 2000, so Monday 2000-01-03 was the first market day in 2000, even though New Year’s fell on a Saturday.

Author: Rowboat Advisors

Rowboat Advisors builds software for sophisticated and fully automated portfolio management for the financial advisor industry.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s