Modelling Time-Series Data for Top-Performing Stocks in Aerospike using Sorted Maps
With the new Sorted Map API in Aerospike v3.8.4, it is now possible to store Sorted Map data in Aerospike. This is an experimental feature intended to help make Aerospike easier to use; we look forward to our community’s feedback on the API, which we expect to make GA this autumn. With this feature, it is now possible to retrieve partial data based on certain criteria, such as a List of Top Ten Values, or a Portfolio Stock Position. The Sorted Map API adds to the existing features in Aerospike to store Lists and Maps and create Secondary Indexes on these structures.
As part of this blog post, we will look at ways in which data can be efficiently stored and retrieved in Aerospike. As an example, let us consider a Financial Tick Data use case for which real-time stock tick data is stored. This data can then be retrieved efficiently to find out the average tick price and the max and min values for the stock, as well as the overall stock position for the period.
Data Model
To be able to handle rapidly growing tick data within Aerospike in an efficient way, this data needs to be grouped based on the stock ticker and a specific time scale (such as hours, days, or weeks). In this example, let us consider a data model where data is organized by ticker and by the specific day for which the stock data is processed. In essence, for every tick data point being read for a particular stock in a day, there would be a Sorted Map to store every single tick data value.
For example, the ticker data for a day for a particular stock – say, AAPL (Apple) – could be as follows:
{0:99.19, 1:99.34, 2:100.23, 3: 98.24, 4: 99.38,… 390: 101.73}
Here, the map key is the index of the timestamp within a day. In essence, 0 stands for 9:30a.m., and 390 stands for 4p.m.
The primary key for this data point can be a concatenated string of the ticker id and the timestamp of the particular day, such as AAPL1465237800000.
Here, there are 391 data points read in the day for a specific stock (where the read time period is 1 second); each ticker value is stored as part of a Sorted Map in Aerospike. Sorted Maps in Aerospike are great for storing sorted data for which querying for top or bottom values can be done very efficiently based on the rank of the specific data point. The index position of this value provides the specific time period for which the data value was recorded. Likewise, it is also possible to query for a sorted order of the key by using an index.
All this can be directly done on the server, without having to bring the entire data into memory and then having to sort it.
As an example, a query to find the nth tallest or shortest person in a map containing the height of people would be directly operated on the server. Also, when a new data point is added into a Sorted Map, for the existing data points, the only things that change are the rank and index of each entry in the map.
Figure 1. Sorting people by height
Alternatively, by using a generic map API, you would query the record, return it to the application code, and use a sorting algorithm to rank order the data client-side. Because this is costly and time-consuming, it makes Sorted Maps a great way to store and retrieve ordered data in real time.
To be able to easily calculate average stock value for the day at any given point in time, one can create a separate bin (column) to store the sum of all inserted data points for a particular stock on that day.
This is how the data would look in Aerospike:
aql> select * from test.timeseries where pk ='IBM1464892200000'
+----------------------------------------------------------------------------------+
| stock | sum |
+----------------------------------------------------------------------------------+
| MAP’{0:152.5, 1:152.525, 2:152.92, 3:152.6001, 4:152.82, ...}’) | 59565.357 |
+----------------------------------------------------------------------------------+
1 row in set (0.001 secs)
Both the insertion of a new data point into the Sorted Map and the addition to the “Sum” Bin can be done in one atomic operation using the operate command.
For example, in Java:
record = client.operate(wPolicy,
key,
MapOperation.put(mPolicy,
Stock",
Value.get(index),
Value.get(stockTickVal)),
Operation.add(sumBin));
In order to calculate averages, we would also need the count of data points. The size() method can be called on the Sorted Map to get this information. In this way, averages can easily be calculated at any given point in time.
Reporting
Once this data is loaded into Aerospike, it can be retrieved at query time based on the sorted order of the tick data. Again, the operate command can be used to get the Sorted Map data and the other bin data.
An example appears below (where stock is the name of the bin):
Record record = client.operate(wPolicy,
key,
MapOperation.getByRank("stock",
-1,
MapReturnType.VALUE),
MapOperation.getByRank("stock",
-1,
MapReturnType.INDEX),
MapOperation.getByRank("stock",
0,
MapReturnType.VALUE),
MapOperation.getByRank("stock",
0,
MapReturnType.INDEX),
MapOperation.getByIndex("stock",
0,
MapReturnType.VALUE),
MapOperation.getByIndex("stock",
-1,
MapReturnType.VALUE),
Operation.get("sum"),
MapOperation.size("stock"));
For a particular stock ticker, this operation gets the lowest stock tick data for the day and its index position, the highest stock tick data for the day and its index position, the first data point of the day, the last data point for the day, the sum of all the Sorted Map entries, and the number of recorded data points.
Since we have designed the data model based on stock tick data for a day, we could use the following approach to summarize data for a time period: create a separate set (table) to store the summarized data, and use Sorted Maps to store this summary data for a particular stock across days. This makes it very easy to query for things like the maximum or minimum value of the stock across a given time period.
To get the top-performing stocks, one approach could be to use another set where the difference between the initial and final values for each of the stocks can be stored. This data can then be used to retrieve top-performing stocks.
Demo
To showcase the power of Aerospike’s Sorted Map, a time series demo has been created. In it, data for a particular stock across a time period is read from Google Finance in real time and stored in Aerospike using a SortedMap. Each record in the database stores all the tick data for a ticker ID for a particular day.
The output of the demo is a summarized view of the maximum, minimum, and average values for the stock, along with the stock position at the time.
For example, below is the demo output for IBM, Apple, GE, Cisco and Oracle from July 1st to July 5th, 2016. The demo loads data into Aerospike for the period mentioned by pulling data from Google Finance. Once the data is loaded, for each day in the period, it uses the Sorted Map API to find the Maximum and Minimum values, along with the time of the day when this was recorded. Also, summary information for the entire period is retrieved by storing this daily information into another set. The overall stock position for all the stocks is shown, along with the top 5 performing stocks for the time period. If less than 5 stocks are being analysed, then only the best- and worst-performing stocks during the period are shown. The demo also has the option to be run in offline mode by loading data into a text file.
Any number of stocks can be loaded and analysed using the -t option by supplying a comma-separated list, as shown below, where 5 stocks for the last 2 days (using the -d option) are loaded and then analysed:
$ java -jar target/AeroTimeSeries-1.0.jar -o LR -t AAPL,IBM,ORCL,MSFT,CSCO -h 127.0.0.1 -d 2
****************************************
Loading Data
****************************************
Inserting Data for 01/07/2016 with Primary Key: AAPL1467311400000
Inserting Data for 05/07/2016 with Primary Key: AAPL1467657000000
Inserting Data for 01/07/2016 with Primary Key: IBM1467311400000
Inserting Data for 05/07/2016 with Primary Key: IBM1467657000000
Inserting Data for 01/07/2016 with Primary Key: ORCL1467311400000
Inserting Data for 05/07/2016 with Primary Key: ORCL1467657000000
Inserting Data for 01/07/2016 with Primary Key: MSFT1467311400000
Inserting Data for 05/07/2016 with Primary Key: MSFT1467657000000
Inserting Data for 01/07/2016 with Primary Key: CSCO1467311400000
Inserting Data for 05/07/2016 with Primary Key: CSCO1467657000000
****************************************
Loading Complete
****************************************
Reading Data for 01/07/2016 with Primary Key: AAPL1467311400000
: MaxValue: 96.45 Time of Day: 11:5
: MinValue: 95.45 Time of Day: 9:33
Reading Data for 05/07/2016 with Primary Key: AAPL1467657000000
: MaxValue: 95.34 Time of Day: 9:30
: MinValue: 94.51 Time of Day: 10:20
****************************************
*********** AAPL Summary ***************
To get the following report in AQL, run - select * from test.tickersummary where pk= AAPLSummary1467781064761
****************************************
Sum: 74600.13
Count: 782.0
Average Value of Stock for the Period: 95.4
Starting Price: 95.48
Ending Price: 94.99
Maximum Price on 01/07/2016 of Stock Price: 96.45
Minimum Price on 05/07/2016 of Stock Price: 94.51
****************************************
Reading Data for 01/07/2016 with Primary Key: IBM1467311400000
: MaxValue: 152.95 Time of Day: 13:10
: MinValue: 151.45 Time of Day: 9:53
Reading Data for 05/07/2016 with Primary Key: IBM1467657000000
: MaxValue: 151.99 Time of Day: 9:44
: MinValue: 150.83 Time of Day: 11:55
****************************************
*********** IBM Summary ***************
To get the following report in AQL, run - select * from test.tickersummary where pk= IBMSummary1467781064761
****************************************
Sum: 118639.72
Count: 781.0
Average Value of Stock for the Period: 151.91
Starting Price: 152.27
Ending Price: 151.65
Maximum Price on 01/07/2016 of Stock Price: 152.95
Minimum Price on 05/07/2016 of Stock Price: 150.83
****************************************
Reading Data for 01/07/2016 with Primary Key: ORCL1467311400000
: MaxValue: 41.05 Time of Day: 11:5
: MinValue: 40.7 Time of Day: 9:29
Reading Data for 05/07/2016 with Primary Key: ORCL1467657000000
: MaxValue: 40.58 Time of Day: 9:36
: MinValue: 40.16 Time of Day: 13:43
****************************************
*********** ORCL Summary ***************
To get the following report in AQL, run - select * from test.tickersummary where pk= ORCLSummary1467781064761
****************************************
Sum: 31781.15
Count: 782.0
Average Value of Stock for the Period: 40.64
Starting Price: 40.7
Ending Price: 40.39
Maximum Price on 01/07/2016 of Stock Price: 41.05
Minimum Price on 05/07/2016 of Stock Price: 40.16
****************************************
Reading Data for 01/07/2016 with Primary Key: MSFT1467311400000
: MaxValue: 51.71 Time of Day: 10:22
: MinValue: 51.11 Time of Day: 14:42
Reading Data for 05/07/2016 with Primary Key: MSFT1467657000000
: MaxValue: 51.27 Time of Day: 15:45
: MinValue: 50.76 Time of Day: 9:50
****************************************
*********** MSFT Summary ***************
To get the following report in AQL, run - select * from test.tickersummary where pk= MSFTSummary1467781064761
****************************************
Sum: 40039.74
Count: 782.0
Average Value of Stock for the Period: 51.2
Starting Price: 51.13
Ending Price: 51.17
Maximum Price on 01/07/2016 of Stock Price: 51.71
Minimum Price on 05/07/2016 of Stock Price: 50.76
****************************************
Reading Data for 01/07/2016 with Primary Key: CSCO1467311400000
: MaxValue: 28.93 Time of Day: 11:18
: MinValue: 28.61 Time of Day: 9:33
Reading Data for 05/07/2016 with Primary Key: CSCO1467657000000
: MaxValue: 28.61 Time of Day: 10:43
: MinValue: 28.24 Time of Day: 9:29
****************************************
*********** CSCO Summary ***************
To get the following report in AQL, run - select * from test.tickersummary where pk= CSCOSummary1467781064761
****************************************
Sum: 22358.22
Count: 782.0
Average Value of Stock for the Period: 28.59
Starting Price: 28.77
Ending Price: 28.33
Maximum Price on 01/07/2016 of Stock Price: 28.93
Minimum Price on 05/07/2016 of Stock Price: 28.24
****************************************
****************************************
*********** Top Performing Stocks ***************
To get the following report in AQL, run - select * from test.overallsummary where pk= 1467781064761
****************************************
1: MSFT with net position: 0.04
2: ORCL with net position: -0.31
3: CSCO with net position: -0.44
4: AAPL with net position: -0.49
5: IBM with net position: -0.62
****************************************
The demo can be downloaded from Github. It can be run by specifying a time period during which data needs to be analysed, along with ticker IDs. For further details on how to set up the demo and use it, please refer to the Readme file on Github.
Summary
The Aerospike Sorted Maps demo can be extended to support a complete financial portfolio, where the top-performing analysts and their top-performing stocks can be retrieved at run time. This has applications in a number of scenarios, industries and use cases, such as Leaderboards (Top Stocks, Top Advertisers, Top Customers, etc.), Time-Series Applications, and Summary Reporting.
In addition to trying out the demo, we encourage you, the Aerospike community, to try the new Sorted Maps API and to build your own applications on it. We look forward to your feedback on our user forum.