Walk-Forward Analysis
Walk-forward analysis (WFA) is the gold standard for assessing whether a trained strategy generalises beyond its training data. Rather than training once and hoping for the best, WFA trains on rolling windows and evaluates on subsequent out-of-sample (OOS) periods — the closest thing to a live test you can run on historical data.
This tutorial covers:
Why walk-forward analysis matters
Running a basic walk-forward evaluation
Interpreting WFE and IS-OOS gap
Adding Rademacher complexity estimation
Comparing training approaches
Walk-forward with early stopping and regularisation
Why Walk-Forward Analysis?
Standard backtests train on all the data and report the result. This tells you how well the strategy fits history, not how it would have performed in real time. WFA addresses this by mimicking the deployment cycle:
Train on data up to time t
Deploy (evaluate) on the next unseen period t to t + Δ
Roll forward and repeat
The key insight from Pardo (2008): if a strategy’s OOS performance is consistently close to its in-sample performance across multiple cycles, you have evidence of genuine generalisation rather than overfitting.
Basic Walk-Forward Evaluation
The TrainingEvaluator
orchestrates the full workflow: generating cycles, training, evaluating, and
aggregating results.
from quantammsim.runners.training_evaluator import TrainingEvaluator
# Configure the base run fingerprint
run_fingerprint = {
"tokens": ["BTC", "ETH"],
"rule": "mean_reversion_channel",
"startDateString": "2023-01-01 00:00:00",
"endDateString": "2024-07-01 00:00:00",
"initial_pool_value": 1_000_000.0,
"fees": 0.003,
"do_arb": True,
"return_val": "daily_log_sharpe",
"chunk_period": 1440,
"bout_offset": 1440 * 14, # 2-week offset
"optimisation_settings": {
"method": "gradient_descent",
"optimiser": "adam",
"base_lr": 0.05,
"n_iterations": 300,
"batch_size": 16,
"n_parameter_sets": 4,
"use_gradient_clipping": True,
"clip_norm": 10.0,
},
}
# Create evaluator wrapping the standard trainer
evaluator = TrainingEvaluator.from_runner(
"train_on_historic_data",
n_cycles=4, # 4 train/test cycles
verbose=True,
)
# Run the full walk-forward evaluation
result = evaluator.evaluate(run_fingerprint)
# Print summary report
evaluator.print_report(result)
The evaluator divides the date range into n_cycles + 1 equal segments.
Each cycle trains on one segment and tests on the next:
|--- Seg 0 ---|--- Seg 1 ---|--- Seg 2 ---|--- Seg 3 ---|--- Seg 4 ---|
| Train C0 | Test C0 | | | |
| | Train C1 | Test C1 | | |
| | | Train C2 | Test C2 | |
| | | | Train C3 | Test C3 |
Rolling vs Expanding Windows
By default, training windows roll forward — each cycle trains only on its
own segment. Set keep_fixed_start=True for expanding windows, where
training always starts from the beginning:
# Expanding window: later cycles see more data
evaluator = TrainingEvaluator.from_runner(
"train_on_historic_data",
n_cycles=4,
keep_fixed_start=True,
)
Rolling windows test how the strategy adapts to regime changes. Expanding windows test whether more data always helps (and can reveal when old data hurts).
Interpreting Results
The EvaluationResult
contains per-cycle evaluations and aggregate statistics.
Walk-Forward Efficiency (WFE)
WFE is the ratio of OOS to IS performance:
Rules of thumb (Pardo, 2008):
WFE > 0.5 — Suggests robustness; the strategy retains at least half its in-sample edge out of sample.
WFE ≈ 1.0 — Ideal. OOS performance matches IS (no overfitting).
WFE > 1.0 — OOS outperformed IS. Possible with mean-reverting strategies in favorable market conditions.
WFE < 0.3 — Red flag. Strategy likely overfits to training data.
print(f"Mean WFE: {result.mean_wfe:.3f}")
print(f"Mean OOS Sharpe: {result.mean_oos_sharpe:.3f}")
print(f"Worst OOS Sharpe: {result.worst_oos_sharpe:.3f}")
# Per-cycle breakdown
for c in result.cycles:
print(
f"Cycle {c.cycle_number}: "
f"IS={c.is_sharpe:.2f}, OOS={c.oos_sharpe:.2f}, "
f"WFE={c.walk_forward_efficiency:.2f}"
)
IS-OOS Gap
The gap IS Sharpe - OOS Sharpe directly measures overfitting. A large
positive gap means the strategy performed much better in-sample than out.
print(f"Mean IS-OOS gap: {result.mean_is_oos_gap:.3f}")
# Investigate cycle-level gaps
for c in result.cycles:
flag = " ⚠ OVERFIT" if c.is_oos_gap > 0.5 else ""
print(f"Cycle {c.cycle_number}: gap = {c.is_oos_gap:.3f}{flag}")
Rademacher Complexity
Rademacher complexity (Paleologo, 2024) provides a data-dependent upper bound on overfitting. It measures how well the set of strategies explored during training can fit random noise — a strategy class with high Rademacher complexity can fit anything, which means observed performance may be spurious.
Enable checkpoint tracking and Rademacher computation:
evaluator = TrainingEvaluator.from_runner(
"train_on_historic_data",
n_cycles=4,
compute_rademacher=True, # Track parameter checkpoints
)
result = evaluator.evaluate(run_fingerprint)
print(f"Rademacher complexity: {result.aggregate_rademacher:.4f}")
print(f"Adjusted OOS Sharpe: {result.adjusted_mean_oos_sharpe:.3f}")
The Rademacher haircut adjusts observed OOS performance downward:
where \(\hat{R}\) is the empirical Rademacher complexity and \(T\) is the number of test periods. If the adjusted Sharpe is still positive, you have stronger evidence that performance is genuine.
for c in result.cycles:
if c.rademacher_complexity is not None:
print(
f"Cycle {c.cycle_number}: "
f"OOS Sharpe={c.oos_sharpe:.2f}, "
f"R̂={c.rademacher_complexity:.4f}, "
f"Adjusted={c.adjusted_oos_sharpe:.2f}"
)
Comparing Training Approaches
Use compare_trainers() to
benchmark different training configurations side-by-side:
from quantammsim.runners.training_evaluator import (
TrainingEvaluator,
compare_trainers,
)
comparison = compare_trainers(
run_fingerprint,
trainers={
"sgd_conservative": TrainingEvaluator.from_runner(
"train_on_historic_data",
n_cycles=4,
max_iterations=200,
),
"sgd_aggressive": TrainingEvaluator.from_runner(
"train_on_historic_data",
n_cycles=4,
max_iterations=2000,
),
"random_baseline": TrainingEvaluator.random_baseline(n_cycles=4),
},
)
# comparison is a dict of {name: EvaluationResult}
for name, res in comparison.items():
print(f"{name}: WFE={res.mean_wfe:.2f}, OOS Sharpe={res.mean_oos_sharpe:.3f}")
The random baseline is essential — it trains with random parameters and tells you how much of the OOS performance comes from the strategy class structure versus the optimisation itself. If your tuned strategy barely beats random, the optimisation is adding noise, not signal.
Walk-Forward with Regularisation
WFA benefits enormously from regularisation features. Here’s a complete example combining early stopping, price noise, and turnover penalty:
run_fingerprint = {
"tokens": ["BTC", "ETH", "SOL"],
"rule": "mean_reversion_channel",
"startDateString": "2022-06-01 00:00:00",
"endDateString": "2024-06-01 00:00:00",
"initial_pool_value": 1_000_000.0,
"fees": 0.003,
"do_arb": True,
"return_val": "daily_log_sharpe",
"chunk_period": 1440,
"bout_offset": 1440 * 14,
# Regularisation
"price_noise_sigma": 0.001, # Multiplicative log-normal noise
"turnover_penalty": 0.01, # Penalise excessive weight changes
"include_flipped_training_data": True, # Time-reversed augmentation
"optimisation_settings": {
"method": "gradient_descent",
"optimiser": "adam",
"base_lr": 0.05,
"n_iterations": 1000,
"batch_size": 16,
"n_parameter_sets": 4,
"use_gradient_clipping": True,
"clip_norm": 10.0,
# Early stopping on validation set
"early_stopping": True,
"early_stopping_patience": 100,
"early_stopping_metric": "daily_log_sharpe",
"val_fraction": 0.2,
# SWA for flatter optima
"use_swa": True,
"swa_start_frac": 0.75,
"swa_freq": 5,
},
}
evaluator = TrainingEvaluator.from_runner(
"train_on_historic_data",
n_cycles=5,
compute_rademacher=True,
)
result = evaluator.evaluate(run_fingerprint)
evaluator.print_report(result)
Choosing the WFE Metric
By default, WFE is computed from annualised Sharpe ratios (Pardo’s original definition). You can change this to any metric from Metrics Reference:
# Use Calmar ratio for drawdown-sensitive WFE
evaluator = TrainingEvaluator.from_runner(
"train_on_historic_data",
n_cycles=4,
wfe_metric="calmar",
)
Warm Starting
Parameters from each cycle can seed the next cycle’s training. This is
enabled by default in the evaluator — the warm_start_params from cycle
n become the initial parameters for cycle n + 1. This mirrors deployment
where you’d warm-start retraining from the last known good parameters.
Generating Cycles Manually
For advanced use, generate cycle specifications directly:
from quantammsim.runners.robust_walk_forward import (
generate_walk_forward_cycles,
)
cycles = generate_walk_forward_cycles(
start_date="2022-01-01 00:00:00",
end_date="2024-01-01 00:00:00",
n_cycles=6,
keep_fixed_start=False, # Rolling windows
)
for c in cycles:
print(
f"Cycle {c.cycle_number}: "
f"Train {c.train_start_date} → {c.train_end_date}, "
f"Test {c.test_start_date} → {c.test_end_date}"
)
Decision Framework
Diagnostic |
Healthy |
Action if Unhealthy |
|---|---|---|
Mean WFE |
> 0.5 |
Add regularisation, reduce model complexity |
IS-OOS gap |
< 0.3 |
Enable early stopping, add price noise |
Rademacher complexity |
< 0.05 |
Reduce training iterations, use SWA |
Worst OOS Sharpe |
> 0.0 |
Check for regime sensitivity, use expanding windows |
|
|
Review |
See Also
Robustness Features — Full robustness feature guide
Metrics Reference — Available training and evaluation metrics
Hyperparameter Tuning — Optimise training hyperparameters using WFA as the objective
Ensemble Training — Ensemble averaging for implicit regularisation
quantammsim.runners.training_evaluator— API referencequantammsim.runners.robust_walk_forward— Rademacher and WFE utilities