Data Sources

quantammsim supports loading price data from multiple exchanges and data providers, with automatic gap-filling across sources to produce complete minute-level time series. This guide covers the data pipeline, supported sources, format requirements, and how to supply your own price data.

Data Pipeline Overview

All data loading flows through get_data_dict(), called by both train_on_historic_data() and do_run_on_historic_data().

from quantammsim.utils.data_processing.historic_data_utils import get_data_dict

data_dict = get_data_dict(
    list_of_tickers=["BTC", "ETH", "USDC"],
    run_fingerprint=run_fingerprint,
    data_kind="historic",
    root="/path/to/data/",
    max_memory_days=365.0,
    start_date_string="2024-01-01 00:00:00",
    end_time_string="2024-06-01 00:00:00",
)

The returned dictionary contains:

prices – Minute-level close prices, numpy array of shape (T, n_assets)
unix_values – Millisecond unix timestamps, shape (T,)
start_idx / end_idx – Indices bounding the simulation period
bout_length – Timesteps in the simulation period (end_idx - start_idx)
max_memory_days – Burn-in lookback before start_idx (clamped if data is short)
n_chunks – Number of chunk_period-sized blocks in the price array

When a test period is specified, the dictionary also includes prices_test, start_idx_test, end_idx_test, bout_length_test, and unix_values_test.

The data_kind parameter selects the loading strategy:

"historic" – Load from parquet via get_historic_parquet_data() (default).
"mc" – Monte Carlo price path versions (BTC and ETH only).
"step" – Step-function price pattern for debugging strategy responses.

Supported Data Sources

Historic Parquet (primary)

Per-asset parquet files (e.g. BTC_USD.parquet) loaded and joined on their unix index. When root=None, data is loaded from the bundled quantammsim/data/ directory. update_historic_data() builds these files by amalgamating all downstream sources into a gap-free minute-level series per token.

Binance

quantammsim.utils.data_processing.binance_data_utils – Handles yearly CSVs from CryptoDataDownload; concat_csv_files() joins them into a single DataFrame. get_binance_vision_data() downloads directly from binance.vision via the binance_historical_data package.

Coinbase

quantammsim.utils.data_processing.coinbase_data_utils – Uses the Historic_Crypto package. fill_missing_rows_with_coinbase_data() fills gaps from pre-downloaded Coinbase Pro CSVs.

CoinMarketCap

quantammsim.utils.data_processing.cmc_data_utils – 3-hour interval data. fill_missing_rows_with_cmc_historical_data() fills gaps in the primary series.

Crypto Historical Dataset

quantammsim.utils.data_processing.amalgamated_data_utils – 1-minute text files ({TOKEN}_full_1min.txt). forward_fill_ohlcv_data() creates a complete minute-level index, forward-filling close prices, setting OHLC to previous close, and volume to zero for missing rows.

st0x

quantammsim.utils.data_processing.st0x_data_utils – Non-crypto assets (e.g. TSLA, JNJ). fill_missing_rows_with_st0x_historical_data() fills gaps with st0x data.

Aerodrome DEX

quantammsim.utils.data_processing.aerodrome_data_utils – On-chain data from Aerodrome on Base. Last source in the gap-filling cascade; useful for tokens with limited centralised exchange coverage.

Treasury Bill Rates

quantammsim.utils.data_processing.dtb3_data_utils – 3-month T-bill rates from FRED, used as a risk-free rate benchmark. Not part of the gap-filling cascade. Returns daily rates as decimals (percentage / 100), forward- then back-filled for missing dates.

from quantammsim.utils.data_processing.dtb3_data_utils import filter_dtb3_values
rates = filter_dtb3_values("DTB3.csv", "2024-01-01", "2024-06-01")

Synthetic Data

quantammsim.utils.data_processing.synthetic_data_utils – Deterministic sinusoidal prices for testing, with no external data dependency.

from quantammsim.utils.data_processing.synthetic_data_utils import make_sinuisoid_data
prices = make_sinuisoid_data(n_time_steps=2880, n_tokens=3, n_periods=3, noise=True)

The composite_run flag interleaves fast and slow cycles for multi-frequency tests.

Data Format Requirements

All sources are normalised to a common DataFrame format before writing to parquet:

Column	Type	Description
`unix`	`int64`	Millisecond Unix timestamp (index or column)
`date`	`str`	`"YYYY-MM-DD HH:MM:SS"`
`symbol`	`str`	Trading pair, e.g. `"BTC/USD"`
`open` / `high` / `low` / `close`	`float64`	OHLC minute candle prices
`Volume USD`	`float64`	Dollar volume
`Volume {TOKEN}`	`float64`	Token-denominated volume

Key constraints:

1-minute resolution required (60000 ms between consecutive rows). update_historic_data validates this and raises on any remaining gap.
Timestamps use millisecond precision. Source data in seconds is multiplied by 1000; nanosecond data is divided by 10⁶.
Only close is consumed by the default simulation path (cols=["close"]). Full OHLCV is needed only with return_slippage=True.

Using Custom Price Data

Bypass all file loading by passing price_data directly:

import pandas as pd, numpy as np

unix_ms = np.arange(start_ms, end_ms, 60_000, dtype=np.int64)
df = pd.DataFrame(
    {"close_BTC": btc_prices, "close_ETH": eth_prices, "close_USDC": usdc_prices},
    index=pd.Index(unix_ms, name="unix"),
)

data_dict = get_data_dict(
    list_of_tickers=["BTC", "ETH", "USDC"],
    run_fingerprint=run_fingerprint,
    price_data=df,
    start_date_string="2024-01-01 00:00:00",
    end_time_string="2024-06-01 00:00:00",
)

The index must be named "unix" with millisecond timestamps. Columns must follow the close_{TICKER} convention. Tickers are sorted alphabetically internally, so column order is irrelevant.

Gap Filling and Amalgamation

update_historic_data() tries each source in sequence, filling only timestamps still missing:

Binance Vision – primary, from binance.vision
Binance CDD – CryptoDataDownload yearly CSVs
Coinbase – pre-downloaded Coinbase Pro CSVs
Gemini – Gemini exchange CSVs
Bitstamp – Bitstamp exchange CSVs
Crypto Historical Dataset – 1-minute text files
CoinMarketCap – 3-hour data (selected tokens)
st0x – non-crypto assets (selected tokens)
Candles parquet – DeFi tokens via Trading Strategy candles
Aerodrome DEX – on-chain data from Base

At each step the fill function computes the index set difference, concatenates missing rows, sorts, and deduplicates. After all sources, forward_fill_ohlcv_data() produces a gapless series. The pipeline validates that all consecutive timestamp differences are exactly 60000 ms.

Frequency Conversion

quantammsim.utils.data_processing.minute_daily_conversion_utils provides:

expand_daily_to_minute_data() – reindex daily data to minute frequency via forward-fill.
resample_minute_level_OHLC_data_to_daily() – aggregate minute OHLC into daily candles (first open, max high, min low, last close, summed volume).
calculate_annualised_daily_volatility_from_minute_data() – daily log-return std from minute prices, annualised by \(\sqrt{365.25}\).

quantammsim.utils.data_processing.volume_data_utils provides calculate_daily_volume_from_minute_data(), computing daily token volume as summed dollar volume / daily mean close.