Walk-Forward Analysis
Robust Walk-Forward Utilities
Core utilities for walk-forward validation: Rademacher complexity, Walk-Forward Efficiency (WFE), and cycle generation.
Robust Walk-Forward Training Utilities
This module provides core utilities for walk-forward analysis:
Rademacher Complexity (Paleologo) - Compute empirical Rademacher complexity from checkpoint returns - Apply haircut to OOS performance estimates
Walk-Forward Efficiency (Pardo) - Compute WFE = OOS performance / IS performance - Standard metric for assessing robustness
Cycle Generation - Generate walk-forward train/test splits - Support for rolling and expanding windows
Key References: - Pardo, “The Evaluation and Optimization of Trading Strategies” (2008) - Paleologo, “The Elements of Quantitative Investing” (2024), Ch. 6
- class WalkForwardCycle(cycle_number, train_start_date, train_end_date, test_start_date, test_end_date, train_start_idx=0, train_end_idx=0, test_start_idx=0, test_end_idx=0)[source]
Bases:
objectSpecification for a single walk-forward train/test cycle.
Defines one segment of a walk-forward analysis: a contiguous training window followed by a contiguous test window. Date fields are set at cycle-generation time; index fields are populated later once the price data has been loaded and aligned.
- Parameters:
- train_start_idx
Row index into the price array for the start of training. Default 0; populated after data loading.
- Type:
- __init__(cycle_number, train_start_date, train_end_date, test_start_date, test_end_date, train_start_idx=0, train_end_idx=0, test_start_idx=0, test_end_idx=0)
- compute_empirical_rademacher(returns_matrix, n_samples=1000, seed=42)[source]
Compute empirical Rademacher complexity of a set of strategies.
The Rademacher complexity measures how well the strategy class can “fit” random noise. Higher complexity = more overfitting risk.
- Parameters:
- Returns:
Empirical Rademacher complexity R̂
- Return type:
Notes
R̂ = E_σ[sup_s (1/T) Σ_t σ_t r_s(t)]
where σ_t are random Rademacher variables (±1 with prob 0.5)
- compute_rademacher_haircut(observed_sharpe, rademacher_complexity, T, delta=0.05)[source]
Compute Rademacher-adjusted performance bound.
From Paleologo (2024): θ_n ≥ θ̂_n - 2R̂ - estimation_error
- Parameters:
- Returns:
(adjusted_sharpe, haircut_magnitude)
- Return type:
- compute_walk_forward_efficiency(is_sharpe, oos_sharpe, is_days, oos_days)[source]
Compute Walk-Forward Efficiency (WFE) as per Pardo.
WFE = (Annualized OOS Performance) / (Annualized IS Performance)
A WFE of 0.5 or higher suggests robustness. A WFE near 1.0 is ideal (OOS ≈ IS). A WFE > 1.0 means OOS outperformed IS (unusual but possible).
- generate_walk_forward_cycles(start_date, end_date, n_cycles, keep_fixed_start=False)[source]
Generate walk-forward cycle specifications with equal-length test periods.
Divides [start_date, end_date] into (n_cycles + 1) equal segments. Each cycle trains on segment i and tests on segment i+1.
- Parameters:
start_date (str) – Start date (format: “YYYY-MM-DD HH:MM:SS”)
end_date (str) – End date of walk-forward analysis (end of final test period)
n_cycles (int) – Number of training/test cycles
keep_fixed_start (bool) – If True, training always starts from start_date (expanding window). If False (default), training window rolls forward (rolling window).
- Return type:
List[WalkForwardCycle]
Training Evaluator
Walk-forward evaluation framework with pluggable trainer wrappers, per-cycle IS/OOS metric extraction, and aggregate robustness diagnostics.
Training Evaluator: A Meta-Runner for Assessing Training Effectiveness.
Wrap any training approach and evaluate whether it’s effective using:
Walk-Forward Efficiency (Pardo)
Rademacher Complexity (Paleologo) — requires checkpoint tracking, see below
OOS performance metrics
Usage:
from quantammsim.runners.training_evaluator import TrainingEvaluator, compare_trainers
# Option 1: Wrap existing runner
evaluator = TrainingEvaluator.from_runner("train_on_historic_data", max_iterations=500)
results = evaluator.evaluate(run_fingerprint, n_cycles=5)
# Option 2: Wrap custom function
def my_trainer(data_dict, train_start_idx, train_end_idx, pool, run_fp, warm_start=None):
# ... your logic ...
return params, {"epochs": n}
evaluator = TrainingEvaluator.from_function(my_trainer)
# Option 3: Compare approaches
comparison = compare_trainers(
run_fingerprint,
trainers={
"sgd": TrainingEvaluator.from_runner("train_on_historic_data"),
"random": TrainingEvaluator.random_baseline(),
},
)
Rademacher Complexity
Rademacher complexity measures overfitting risk by tracking the “search space”
explored during optimization. To compute Rademacher complexity, the trainer
must return checkpoint_returns in metadata:
def my_trainer_with_checkpoints(...):
checkpoint_returns = []
for epoch in range(n_epochs):
params = update(params)
if epoch % checkpoint_interval == 0:
returns = evaluate(params) # Returns array of shape (T,)
checkpoint_returns.append(returns)
return params, {
"epochs_trained": n_epochs,
"checkpoint_returns": np.stack(checkpoint_returns), # (n_checkpoints, T)
}
evaluator = TrainingEvaluator.from_function(
my_trainer_with_checkpoints,
compute_rademacher=True, # Enable Rademacher computation
)
The built-in wrapper for train_on_historic_data supports checkpoint tracking.
Enable it by passing compute_rademacher=True to from_runner():
evaluator = TrainingEvaluator.from_runner(
"train_on_historic_data",
compute_rademacher=True, # Enable checkpoint tracking
checkpoint_interval=10, # Optional: checkpoint every N iterations
)
For multi_period_sgd or custom trainers, you can implement checkpoint
tracking manually by returning checkpoint_returns in metadata (as shown
above).
- class CycleEvaluation(cycle_number, is_sharpe, is_returns_over_hodl, oos_sharpe, oos_returns_over_hodl, walk_forward_efficiency, is_oos_gap, epochs_trained=0, rademacher_complexity=None, adjusted_oos_sharpe=None, is_calmar=None, oos_calmar=None, is_sterling=None, oos_sterling=None, is_ulcer=None, oos_ulcer=None, is_returns=None, oos_returns=None, is_daily_log_sharpe=None, oos_daily_log_sharpe=None, trained_params=None, train_start_date=None, train_end_date=None, test_start_date=None, test_end_date=None, run_location=None, run_fingerprint=None)[source]
Bases:
objectEvaluation results for a single walk-forward cycle.
Captures in-sample (IS) and out-of-sample (OOS) performance metrics for one train/test window, plus robustness diagnostics.
- Parameters:
cycle_number (int)
is_sharpe (float)
is_returns_over_hodl (float)
oos_sharpe (float)
oos_returns_over_hodl (float)
walk_forward_efficiency (float)
is_oos_gap (float)
epochs_trained (int)
rademacher_complexity (float | None)
adjusted_oos_sharpe (float | None)
is_calmar (float | None)
oos_calmar (float | None)
is_sterling (float | None)
oos_sterling (float | None)
is_ulcer (float | None)
oos_ulcer (float | None)
is_returns (float | None)
oos_returns (float | None)
is_daily_log_sharpe (float | None)
oos_daily_log_sharpe (float | None)
train_start_date (str | None)
train_end_date (str | None)
test_start_date (str | None)
test_end_date (str | None)
run_location (str | None)
- rademacher_complexity
Empirical Rademacher complexity from training checkpoints.
- Type:
float or None
- is_calmar, oos_calmar
Calmar ratio (return / max drawdown) for IS and OOS.
- Type:
float or None
- is_sterling, oos_sterling
Sterling ratio for IS and OOS.
- Type:
float or None
- is_ulcer, oos_ulcer
Ulcer index for IS and OOS.
- Type:
float or None
- is_returns, oos_returns
Cumulative returns for IS and OOS.
- Type:
float or None
- is_daily_log_sharpe, oos_daily_log_sharpe
Daily-log-return Sharpe for IS and OOS.
- Type:
float or None
- train_start_date, train_end_date
IS window date boundaries.
- Type:
str or None
- test_start_date, test_end_date
OOS window date boundaries.
- Type:
str or None
- __init__(cycle_number, is_sharpe, is_returns_over_hodl, oos_sharpe, oos_returns_over_hodl, walk_forward_efficiency, is_oos_gap, epochs_trained=0, rademacher_complexity=None, adjusted_oos_sharpe=None, is_calmar=None, oos_calmar=None, is_sterling=None, oos_sterling=None, is_ulcer=None, oos_ulcer=None, is_returns=None, oos_returns=None, is_daily_log_sharpe=None, oos_daily_log_sharpe=None, trained_params=None, train_start_date=None, train_end_date=None, test_start_date=None, test_end_date=None, run_location=None, run_fingerprint=None)
- Parameters:
cycle_number (int)
is_sharpe (float)
is_returns_over_hodl (float)
oos_sharpe (float)
oos_returns_over_hodl (float)
walk_forward_efficiency (float)
is_oos_gap (float)
epochs_trained (int)
rademacher_complexity (float | None)
adjusted_oos_sharpe (float | None)
is_calmar (float | None)
oos_calmar (float | None)
is_sterling (float | None)
oos_sterling (float | None)
is_ulcer (float | None)
oos_ulcer (float | None)
is_returns (float | None)
oos_returns (float | None)
is_daily_log_sharpe (float | None)
oos_daily_log_sharpe (float | None)
train_start_date (str | None)
train_end_date (str | None)
test_start_date (str | None)
test_end_date (str | None)
run_location (str | None)
- Return type:
None
- class EvaluationResult(trainer_name, trainer_config, cycles, mean_wfe, mean_oos_sharpe, std_oos_sharpe, worst_oos_sharpe, mean_is_oos_gap, aggregate_rademacher=None, adjusted_mean_oos_sharpe=None, is_effective=False, effectiveness_reasons=<factory>)[source]
Bases:
objectComplete evaluation results across all walk-forward cycles.
Aggregates per-cycle metrics into summary statistics and provides an effectiveness verdict based on configurable thresholds.
- Parameters:
- cycles
Per-cycle evaluation results.
- Type:
List[CycleEvaluation]
- is_effective
Whether the strategy passes the effectiveness criteria (positive mean OOS Sharpe, WFE > threshold, etc.).
- Type:
- __init__(trainer_name, trainer_config, cycles, mean_wfe, mean_oos_sharpe, std_oos_sharpe, worst_oos_sharpe, mean_is_oos_gap, aggregate_rademacher=None, adjusted_mean_oos_sharpe=None, is_effective=False, effectiveness_reasons=<factory>)
- class TrainerWrapper(name='trainer', config=None)[source]
Bases:
objectBase class for wrapping training functions.
- A trainer must implement:
- train(data_dict, train_start_idx, train_end_idx, pool, run_fp, warm_start, …)
-> (params, metadata)
- train(data_dict, train_start_idx, train_end_idx, pool, run_fingerprint, n_assets, warm_start_params=None, warm_start_weights=None, train_start_date=None, train_end_date=None, test_end_date=None)[source]
Train and return (params, metadata).
- Parameters:
warm_start_params (dict, optional) – Strategy parameters from previous cycle to use as initialization.
warm_start_weights (array-like, optional) – Final weights from previous cycle. Pool starts with fresh initial_pool_value but distributed according to these weights (simulating continuous operation).
data_dict (dict)
train_start_idx (int)
train_end_idx (int)
pool (Any)
run_fingerprint (dict)
n_assets (int)
train_start_date (str | None)
train_end_date (str | None)
test_end_date (str | None)
- Return type:
- class FunctionWrapper(fn, name='custom', config=None)[source]
Bases:
TrainerWrapperWrap a plain
(run_fingerprint, **kwargs) -> (params, metrics)function as a trainer.Use via
TrainingEvaluator.from_function()rather than constructing directly.
- class ExistingRunnerWrapper(runner_name, runner_kwargs=None, compute_rademacher=False, root=None)[source]
Bases:
TrainerWrapperWrap an existing runner (train_on_historic_data, etc).
This creates a thin adapter that calls the existing runner with appropriate parameters.
- train(data_dict, train_start_idx, train_end_idx, pool, run_fingerprint, n_assets, warm_start_params=None, warm_start_weights=None, train_start_date=None, train_end_date=None, test_end_date=None)[source]
Call the existing runner.
Note: This adapts the cycle-based interface to the existing runners which expect full run_fingerprint with date strings. The date strings are used to modify the fingerprint so each cycle trains on different data.
- Parameters:
- Return type:
- class RandomBaselineWrapper(seed=42)[source]
Bases:
TrainerWrapperBaseline: Random parameters.
Use to check if your trainer beats random chance.
- Parameters:
seed (int)
- class TrainingEvaluator(trainer, n_cycles=5, keep_fixed_start=False, compute_rademacher=False, verbose=True, root=None, wfe_metric='sharpe')[source]
Bases:
objectEvaluates whether a training approach is effective.
Wraps any trainer and runs walk-forward evaluation to assess effectiveness using WFE and Rademacher metrics.
Pruning
This evaluator yields CycleEvaluation results via evaluate_iter(), allowing the consumer (e.g., HyperparamTuner) to decide when to prune. The evaluator itself does not prune - it evaluates all cycles unless the consumer stops iterating. This design keeps pruning logic in one place (the Optuna integration) rather than duplicating it here.
- __init__(trainer, n_cycles=5, keep_fixed_start=False, compute_rademacher=False, verbose=True, root=None, wfe_metric='sharpe')[source]
- classmethod from_runner(runner_name, n_cycles=5, keep_fixed_start=False, verbose=True, compute_rademacher=False, root=None, wfe_metric='sharpe', **runner_kwargs)[source]
Create evaluator from an existing runner.
- Parameters:
runner_name (str) – One of: “train_on_historic_data”, “multi_period_sgd”
n_cycles (int) – Number of walk-forward cycles
verbose (bool) – Print progress
compute_rademacher (bool) – Enable Rademacher complexity computation. This enables checkpoint tracking in the trainer, which saves intermediate returns during training for Rademacher estimation. Default False.
root (str, optional) – Root directory for data files. If None, uses default data location.
wfe_metric (str) – Metric to use for WFE and IS-OOS gap computation. Default “sharpe” (per Pardo). Can be any metric from calculate_period_metrics (sharpe, calmar, sterling, etc.)
**runner_kwargs – Arguments passed to the runner (e.g., max_iterations=500)
keep_fixed_start (bool)
- Return type:
Example
>>> evaluator = TrainingEvaluator.from_runner( ... "train_on_historic_data", ... max_iterations=500, ... compute_rademacher=True, # Enable Rademacher complexity ... )
- classmethod from_function(fn, name='custom', n_cycles=5, keep_fixed_start=False, verbose=True, root=None, wfe_metric='sharpe', **config)[source]
Create evaluator from a custom training function.
- Parameters:
fn (Callable) – Function with signature
fn(data_dict, train_start_idx, train_end_idx, pool, run_fingerprint, n_assets, warm_start_params) -> (params, metadata).name (str) – Name for this trainer
n_cycles (int) – Number of walk-forward cycles
keep_fixed_start (bool) – If True, expanding window (train always starts from beginning). If False, rolling window (train window moves forward).
root (str, optional) – Root directory for data files. If None, uses default data location.
wfe_metric (str) – Metric to use for WFE and IS-OOS gap computation. Default “sharpe”.
**config – Config dict for reporting
verbose (bool)
- Return type:
Example
>>> def my_trainer(data_dict, train_start_idx, train_end_idx, pool, ... run_fingerprint, n_assets, warm_start_params=None): ... # Your training logic ... return params, {"epochs": 100} >>> >>> evaluator = TrainingEvaluator.from_function(my_trainer)
- classmethod random_baseline(seed=42, n_cycles=5, keep_fixed_start=False, verbose=True, root=None, wfe_metric='sharpe')[source]
Create evaluator that uses random parameters.
Use this as a baseline to verify your trainer beats random chance.
- Parameters:
seed (int) – Random seed for reproducibility
n_cycles (int) – Number of walk-forward cycles
keep_fixed_start (bool) – If True, expanding window. If False, rolling window.
verbose (bool) – Print progress
root (str, optional) – Root directory for data files. If None, uses default data location.
wfe_metric (str) – Metric to use for WFE and IS-OOS gap computation. Default “sharpe”.
- Return type:
- evaluate_iter(run_fingerprint)[source]
Generator that yields CycleEvaluation after each cycle completes.
This allows callers to inspect intermediate results and potentially stop early (e.g., for Optuna pruning).
- Yields:
CycleEvaluation – Results from each completed cycle
- Returns:
Final aggregated results (accessible via generator.value after StopIteration)
- Return type:
- Parameters:
run_fingerprint (dict)
Example
>>> evaluator = TrainingEvaluator.from_runner("train_on_historic_data") >>> gen = evaluator.evaluate_iter(run_fingerprint) >>> for cycle_eval in gen: ... print(f"Cycle {cycle_eval.cycle_number}: OOS Sharpe = {cycle_eval.oos_sharpe}") ... if cycle_eval.oos_sharpe < -1.0: ... break # Stop early if terrible >>> # If completed, get final result >>> # final_result = gen.value # Only available after StopIteration
- evaluate(run_fingerprint)[source]
Run walk-forward evaluation.
- Parameters:
run_fingerprint (dict) – Run configuration
- Returns:
Comprehensive evaluation results
- Return type:
- print_report(result)[source]
Print a human-readable evaluation report to stdout.
Shows per-cycle IS/OOS metrics in a tabular layout, aggregate statistics, Rademacher diagnostics (if available), and the effectiveness verdict.
- Parameters:
result (EvaluationResult) – Completed evaluation result to display.
- compare_trainers(run_fingerprint, trainers, verbose=True)[source]
Compare multiple trainers on the same data.
- Parameters:
run_fingerprint (dict) – Run configuration
trainers (Dict[str, TrainingEvaluator]) – Dictionary of name -> evaluator
verbose (bool) – Print progress and summary
- Returns:
Results keyed by trainer name
- Return type:
Dict[str, EvaluationResult]
Example
>>> results = compare_trainers( ... run_fingerprint, ... trainers={ ... "sgd_500": TrainingEvaluator.from_runner( ... "train_on_historic_data", max_iterations=500 ... ), ... "sgd_100": TrainingEvaluator.from_runner( ... "train_on_historic_data", max_iterations=100 ... ), ... "random": TrainingEvaluator.random_baseline(), ... }, ... )