Walk-Forward Analysis

Walk-forward analysis (WFA) is the gold standard for assessing whether a trained strategy generalises beyond its training data. Rather than training once and hoping for the best, WFA trains on rolling windows and evaluates on subsequent out-of-sample (OOS) periods — the closest thing to a live test you can run on historical data.

This tutorial covers:

Why walk-forward analysis matters
Running a basic walk-forward evaluation
Interpreting WFE and IS-OOS gap
Adding Rademacher complexity estimation
Comparing training approaches
Walk-forward with early stopping and regularisation

Why Walk-Forward Analysis?

Standard backtests train on all the data and report the result. This tells you how well the strategy fits history, not how it would have performed in real time. WFA addresses this by mimicking the deployment cycle:

Train on data up to time t
Deploy (evaluate) on the next unseen period t to t + Δ
Roll forward and repeat

The key insight from Pardo (2008): if a strategy’s OOS performance is consistently close to its in-sample performance across multiple cycles, you have evidence of genuine generalisation rather than overfitting.

Basic Walk-Forward Evaluation

The TrainingEvaluator orchestrates the full workflow: generating cycles, training, evaluating, and aggregating results.

from quantammsim.runners.training_evaluator import TrainingEvaluator

# Configure the base run fingerprint
run_fingerprint = {
    "tokens": ["BTC", "ETH"],
    "rule": "mean_reversion_channel",
    "startDateString": "2023-01-01 00:00:00",
    "endDateString": "2024-07-01 00:00:00",
    "initial_pool_value": 1_000_000.0,
    "fees": 0.003,
    "do_arb": True,
    "return_val": "daily_log_sharpe",
    "chunk_period": 1440,
    "bout_offset": 1440 * 14,  # 2-week offset
    "optimisation_settings": {
        "method": "gradient_descent",
        "optimiser": "adam",
        "base_lr": 0.05,
        "n_iterations": 300,
        "batch_size": 16,
        "n_parameter_sets": 4,
        "use_gradient_clipping": True,
        "clip_norm": 10.0,
    },
}

# Create evaluator wrapping the standard trainer
evaluator = TrainingEvaluator.from_runner(
    "train_on_historic_data",
    n_cycles=4,          # 4 train/test cycles
    verbose=True,
)

# Run the full walk-forward evaluation
result = evaluator.evaluate(run_fingerprint)

# Print summary report
evaluator.print_report(result)

The evaluator divides the date range into n_cycles + 1 equal segments. Each cycle trains on one segment and tests on the next:

|--- Seg 0 ---|--- Seg 1 ---|--- Seg 2 ---|--- Seg 3 ---|--- Seg 4 ---|
|  Train C0   |  Test C0    |             |             |             |
|             |  Train C1   |  Test C1    |             |             |
|             |             |  Train C2   |  Test C2    |             |
|             |             |             |  Train C3   |  Test C3    |

Rolling vs Expanding Windows

By default, training windows roll forward — each cycle trains only on its own segment. Set keep_fixed_start=True for expanding windows, where training always starts from the beginning:

# Expanding window: later cycles see more data
evaluator = TrainingEvaluator.from_runner(
    "train_on_historic_data",
    n_cycles=4,
    keep_fixed_start=True,
)

Rolling windows test how the strategy adapts to regime changes. Expanding windows test whether more data always helps (and can reveal when old data hurts).

Interpreting Results

The EvaluationResult contains per-cycle evaluations and aggregate statistics.

Walk-Forward Efficiency (WFE)

WFE is the ratio of OOS to IS performance:

\[\text{WFE} = \frac{\text{OOS Sharpe}}{\text{IS Sharpe}}\]

Rules of thumb (Pardo, 2008):

WFE > 0.5 — Suggests robustness; the strategy retains at least half its in-sample edge out of sample.
WFE ≈ 1.0 — Ideal. OOS performance matches IS (no overfitting).
WFE > 1.0 — OOS outperformed IS. Possible with mean-reverting strategies in favorable market conditions.
WFE < 0.3 — Red flag. Strategy likely overfits to training data.

print(f"Mean WFE: {result.mean_wfe:.3f}")
print(f"Mean OOS Sharpe: {result.mean_oos_sharpe:.3f}")
print(f"Worst OOS Sharpe: {result.worst_oos_sharpe:.3f}")

# Per-cycle breakdown
for c in result.cycles:
    print(
        f"Cycle {c.cycle_number}: "
        f"IS={c.is_sharpe:.2f}, OOS={c.oos_sharpe:.2f}, "
        f"WFE={c.walk_forward_efficiency:.2f}"
    )

IS-OOS Gap

The gap IS Sharpe - OOS Sharpe directly measures overfitting. A large positive gap means the strategy performed much better in-sample than out.

print(f"Mean IS-OOS gap: {result.mean_is_oos_gap:.3f}")

# Investigate cycle-level gaps
for c in result.cycles:
    flag = " ⚠ OVERFIT" if c.is_oos_gap > 0.5 else ""
    print(f"Cycle {c.cycle_number}: gap = {c.is_oos_gap:.3f}{flag}")

Rademacher Complexity

Rademacher complexity (Paleologo, 2024) provides a data-dependent upper bound on overfitting. It measures how well the set of strategies explored during training can fit random noise — a strategy class with high Rademacher complexity can fit anything, which means observed performance may be spurious.

Enable checkpoint tracking and Rademacher computation:

evaluator = TrainingEvaluator.from_runner(
    "train_on_historic_data",
    n_cycles=4,
    compute_rademacher=True,  # Track parameter checkpoints
)

result = evaluator.evaluate(run_fingerprint)

print(f"Rademacher complexity: {result.aggregate_rademacher:.4f}")
print(f"Adjusted OOS Sharpe: {result.adjusted_mean_oos_sharpe:.3f}")

The Rademacher haircut adjusts observed OOS performance downward:

\[\theta_n \geq \hat{\theta}_n - 2\hat{R} - 3\sqrt{\frac{2\log(2/\delta)}{T}}\]

where \(\hat{R}\) is the empirical Rademacher complexity and \(T\) is the number of test periods. If the adjusted Sharpe is still positive, you have stronger evidence that performance is genuine.

for c in result.cycles:
    if c.rademacher_complexity is not None:
        print(
            f"Cycle {c.cycle_number}: "
            f"OOS Sharpe={c.oos_sharpe:.2f}, "
            f"R̂={c.rademacher_complexity:.4f}, "
            f"Adjusted={c.adjusted_oos_sharpe:.2f}"
        )

Comparing Training Approaches

Use compare_trainers() to benchmark different training configurations side-by-side:

from quantammsim.runners.training_evaluator import (
    TrainingEvaluator,
    compare_trainers,
)

comparison = compare_trainers(
    run_fingerprint,
    trainers={
        "sgd_conservative": TrainingEvaluator.from_runner(
            "train_on_historic_data",
            n_cycles=4,
            max_iterations=200,
        ),
        "sgd_aggressive": TrainingEvaluator.from_runner(
            "train_on_historic_data",
            n_cycles=4,
            max_iterations=2000,
        ),
        "random_baseline": TrainingEvaluator.random_baseline(n_cycles=4),
    },
)

# comparison is a dict of {name: EvaluationResult}
for name, res in comparison.items():
    print(f"{name}: WFE={res.mean_wfe:.2f}, OOS Sharpe={res.mean_oos_sharpe:.3f}")

The random baseline is essential — it trains with random parameters and tells you how much of the OOS performance comes from the strategy class structure versus the optimisation itself. If your tuned strategy barely beats random, the optimisation is adding noise, not signal.

Walk-Forward with Regularisation

WFA benefits enormously from regularisation features. Here’s a complete example combining early stopping, price noise, and turnover penalty:

run_fingerprint = {
    "tokens": ["BTC", "ETH", "SOL"],
    "rule": "mean_reversion_channel",
    "startDateString": "2022-06-01 00:00:00",
    "endDateString": "2024-06-01 00:00:00",
    "initial_pool_value": 1_000_000.0,
    "fees": 0.003,
    "do_arb": True,
    "return_val": "daily_log_sharpe",
    "chunk_period": 1440,
    "bout_offset": 1440 * 14,

    # Regularisation
    "price_noise_sigma": 0.001,          # Multiplicative log-normal noise
    "turnover_penalty": 0.01,            # Penalise excessive weight changes
    "include_flipped_training_data": True,  # Time-reversed augmentation

    "optimisation_settings": {
        "method": "gradient_descent",
        "optimiser": "adam",
        "base_lr": 0.05,
        "n_iterations": 1000,
        "batch_size": 16,
        "n_parameter_sets": 4,
        "use_gradient_clipping": True,
        "clip_norm": 10.0,

        # Early stopping on validation set
        "early_stopping": True,
        "early_stopping_patience": 100,
        "early_stopping_metric": "daily_log_sharpe",
        "val_fraction": 0.2,

        # SWA for flatter optima
        "use_swa": True,
        "swa_start_frac": 0.75,
        "swa_freq": 5,
    },
}

evaluator = TrainingEvaluator.from_runner(
    "train_on_historic_data",
    n_cycles=5,
    compute_rademacher=True,
)

result = evaluator.evaluate(run_fingerprint)
evaluator.print_report(result)

Choosing the WFE Metric

By default, WFE is computed from annualised Sharpe ratios (Pardo’s original definition). You can change this to any metric from Metrics Reference:

# Use Calmar ratio for drawdown-sensitive WFE
evaluator = TrainingEvaluator.from_runner(
    "train_on_historic_data",
    n_cycles=4,
    wfe_metric="calmar",
)

Warm Starting

Parameters from each cycle can seed the next cycle’s training. This is enabled by default in the evaluator — the warm_start_params from cycle n become the initial parameters for cycle n + 1. This mirrors deployment where you’d warm-start retraining from the last known good parameters.

Generating Cycles Manually

For advanced use, generate cycle specifications directly:

from quantammsim.runners.robust_walk_forward import (
    generate_walk_forward_cycles,
)

cycles = generate_walk_forward_cycles(
    start_date="2022-01-01 00:00:00",
    end_date="2024-01-01 00:00:00",
    n_cycles=6,
    keep_fixed_start=False,  # Rolling windows
)

for c in cycles:
    print(
        f"Cycle {c.cycle_number}: "
        f"Train {c.train_start_date} → {c.train_end_date}, "
        f"Test {c.test_start_date} → {c.test_end_date}"
    )

Decision Framework

Diagnostic	Healthy	Action if Unhealthy
Mean WFE	> 0.5	Add regularisation, reduce model complexity
IS-OOS gap	< 0.3	Enable early stopping, add price noise
Rademacher complexity	< 0.05	Reduce training iterations, use SWA
Worst OOS Sharpe	> 0.0	Check for regime sensitivity, use expanding windows
`is_effective`	`True`	Review `effectiveness_reasons` for specifics