Walk-Forward Analysis
=====================

Walk-forward analysis (WFA) is the gold standard for assessing whether a
trained strategy generalises beyond its training data.  Rather than training
once and hoping for the best, WFA trains on rolling windows and evaluates on
subsequent out-of-sample (OOS) periods — the closest thing to a live test you
can run on historical data.

This tutorial covers:

1. Why walk-forward analysis matters
2. Running a basic walk-forward evaluation
3. Interpreting WFE and IS-OOS gap
4. Adding Rademacher complexity estimation
5. Comparing training approaches
6. Walk-forward with early stopping and regularisation


Why Walk-Forward Analysis?
--------------------------

Standard backtests train on *all* the data and report the result.  This tells
you how well the strategy fits history, not how it would have performed in
real time.  WFA addresses this by mimicking the deployment cycle:

1. Train on data up to time *t*
2. Deploy (evaluate) on the next unseen period *t* to *t + Δ*
3. Roll forward and repeat

The key insight from Pardo (2008): if a strategy's OOS performance is
consistently close to its in-sample performance across multiple cycles, you
have evidence of genuine generalisation rather than overfitting.


Basic Walk-Forward Evaluation
-----------------------------

The :class:`~quantammsim.runners.training_evaluator.TrainingEvaluator`
orchestrates the full workflow: generating cycles, training, evaluating, and
aggregating results.

.. code-block:: python

    from quantammsim.runners.training_evaluator import TrainingEvaluator

    # Configure the base run fingerprint
    run_fingerprint = {
        "tokens": ["BTC", "ETH"],
        "rule": "mean_reversion_channel",
        "startDateString": "2023-01-01 00:00:00",
        "endDateString": "2024-07-01 00:00:00",
        "initial_pool_value": 1_000_000.0,
        "fees": 0.003,
        "do_arb": True,
        "return_val": "daily_log_sharpe",
        "chunk_period": 1440,
        "bout_offset": 1440 * 14,  # 2-week offset
        "optimisation_settings": {
            "method": "gradient_descent",
            "optimiser": "adam",
            "base_lr": 0.05,
            "n_iterations": 300,
            "batch_size": 16,
            "n_parameter_sets": 4,
            "use_gradient_clipping": True,
            "clip_norm": 10.0,
        },
    }

    # Create evaluator wrapping the standard trainer
    evaluator = TrainingEvaluator.from_runner(
        "train_on_historic_data",
        n_cycles=4,          # 4 train/test cycles
        verbose=True,
    )

    # Run the full walk-forward evaluation
    result = evaluator.evaluate(run_fingerprint)

    # Print summary report
    evaluator.print_report(result)

The evaluator divides the date range into ``n_cycles + 1`` equal segments.
Each cycle trains on one segment and tests on the next:

.. code-block:: text

    |--- Seg 0 ---|--- Seg 1 ---|--- Seg 2 ---|--- Seg 3 ---|--- Seg 4 ---|
    |  Train C0   |  Test C0    |             |             |             |
    |             |  Train C1   |  Test C1    |             |             |
    |             |             |  Train C2   |  Test C2    |             |
    |             |             |             |  Train C3   |  Test C3    |


Rolling vs Expanding Windows
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

By default, training windows *roll forward* — each cycle trains only on its
own segment.  Set ``keep_fixed_start=True`` for *expanding windows*, where
training always starts from the beginning:

.. code-block:: python

    # Expanding window: later cycles see more data
    evaluator = TrainingEvaluator.from_runner(
        "train_on_historic_data",
        n_cycles=4,
        keep_fixed_start=True,
    )

Rolling windows test how the strategy adapts to regime changes.  Expanding
windows test whether more data always helps (and can reveal when old data
hurts).


Interpreting Results
--------------------

The :class:`~quantammsim.runners.training_evaluator.EvaluationResult`
contains per-cycle evaluations and aggregate statistics.

Walk-Forward Efficiency (WFE)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

WFE is the ratio of OOS to IS performance:

.. math::

   \text{WFE} = \frac{\text{OOS Sharpe}}{\text{IS Sharpe}}

Rules of thumb (Pardo, 2008):

* **WFE > 0.5** — Suggests robustness; the strategy retains at least half
  its in-sample edge out of sample.
* **WFE ≈ 1.0** — Ideal. OOS performance matches IS (no overfitting).
* **WFE > 1.0** — OOS outperformed IS. Possible with mean-reverting strategies
  in favorable market conditions.
* **WFE < 0.3** — Red flag. Strategy likely overfits to training data.

.. code-block:: python

    print(f"Mean WFE: {result.mean_wfe:.3f}")
    print(f"Mean OOS Sharpe: {result.mean_oos_sharpe:.3f}")
    print(f"Worst OOS Sharpe: {result.worst_oos_sharpe:.3f}")

    # Per-cycle breakdown
    for c in result.cycles:
        print(
            f"Cycle {c.cycle_number}: "
            f"IS={c.is_sharpe:.2f}, OOS={c.oos_sharpe:.2f}, "
            f"WFE={c.walk_forward_efficiency:.2f}"
        )


IS-OOS Gap
~~~~~~~~~~

The gap ``IS Sharpe - OOS Sharpe`` directly measures overfitting.  A large
positive gap means the strategy performed much better in-sample than out.

.. code-block:: python

    print(f"Mean IS-OOS gap: {result.mean_is_oos_gap:.3f}")

    # Investigate cycle-level gaps
    for c in result.cycles:
        flag = " ⚠ OVERFIT" if c.is_oos_gap > 0.5 else ""
        print(f"Cycle {c.cycle_number}: gap = {c.is_oos_gap:.3f}{flag}")


Rademacher Complexity
---------------------

Rademacher complexity (Paleologo, 2024) provides a data-dependent upper
bound on overfitting.  It measures how well the *set of strategies explored
during training* can fit random noise — a strategy class with high Rademacher
complexity can fit anything, which means observed performance may be spurious.

Enable checkpoint tracking and Rademacher computation:

.. code-block:: python

    evaluator = TrainingEvaluator.from_runner(
        "train_on_historic_data",
        n_cycles=4,
        compute_rademacher=True,  # Track parameter checkpoints
    )

    result = evaluator.evaluate(run_fingerprint)

    print(f"Rademacher complexity: {result.aggregate_rademacher:.4f}")
    print(f"Adjusted OOS Sharpe: {result.adjusted_mean_oos_sharpe:.3f}")

The **Rademacher haircut** adjusts observed OOS performance downward:

.. math::

   \theta_n \geq \hat{\theta}_n - 2\hat{R}
   - 3\sqrt{\frac{2\log(2/\delta)}{T}}

where :math:`\hat{R}` is the empirical Rademacher complexity and :math:`T` is
the number of test periods.  If the adjusted Sharpe is still positive, you have
stronger evidence that performance is genuine.

.. code-block:: python

    for c in result.cycles:
        if c.rademacher_complexity is not None:
            print(
                f"Cycle {c.cycle_number}: "
                f"OOS Sharpe={c.oos_sharpe:.2f}, "
                f"R̂={c.rademacher_complexity:.4f}, "
                f"Adjusted={c.adjusted_oos_sharpe:.2f}"
            )


Comparing Training Approaches
------------------------------

Use :func:`~quantammsim.runners.training_evaluator.compare_trainers` to
benchmark different training configurations side-by-side:

.. code-block:: python

    from quantammsim.runners.training_evaluator import (
        TrainingEvaluator,
        compare_trainers,
    )

    comparison = compare_trainers(
        run_fingerprint,
        trainers={
            "sgd_conservative": TrainingEvaluator.from_runner(
                "train_on_historic_data",
                n_cycles=4,
                max_iterations=200,
            ),
            "sgd_aggressive": TrainingEvaluator.from_runner(
                "train_on_historic_data",
                n_cycles=4,
                max_iterations=2000,
            ),
            "random_baseline": TrainingEvaluator.random_baseline(n_cycles=4),
        },
    )

    # comparison is a dict of {name: EvaluationResult}
    for name, res in comparison.items():
        print(f"{name}: WFE={res.mean_wfe:.2f}, OOS Sharpe={res.mean_oos_sharpe:.3f}")

The random baseline is essential — it trains with random parameters and tells
you how much of the OOS performance comes from the strategy class structure
versus the optimisation itself.  If your tuned strategy barely beats random,
the optimisation is adding noise, not signal.


Walk-Forward with Regularisation
---------------------------------

WFA benefits enormously from regularisation features.  Here's a complete
example combining early stopping, price noise, and turnover penalty:

.. code-block:: python

    run_fingerprint = {
        "tokens": ["BTC", "ETH", "SOL"],
        "rule": "mean_reversion_channel",
        "startDateString": "2022-06-01 00:00:00",
        "endDateString": "2024-06-01 00:00:00",
        "initial_pool_value": 1_000_000.0,
        "fees": 0.003,
        "do_arb": True,
        "return_val": "daily_log_sharpe",
        "chunk_period": 1440,
        "bout_offset": 1440 * 14,

        # Regularisation
        "price_noise_sigma": 0.001,          # Multiplicative log-normal noise
        "turnover_penalty": 0.01,            # Penalise excessive weight changes
        "include_flipped_training_data": True,  # Time-reversed augmentation

        "optimisation_settings": {
            "method": "gradient_descent",
            "optimiser": "adam",
            "base_lr": 0.05,
            "n_iterations": 1000,
            "batch_size": 16,
            "n_parameter_sets": 4,
            "use_gradient_clipping": True,
            "clip_norm": 10.0,

            # Early stopping on validation set
            "early_stopping": True,
            "early_stopping_patience": 100,
            "early_stopping_metric": "daily_log_sharpe",
            "val_fraction": 0.2,

            # SWA for flatter optima
            "use_swa": True,
            "swa_start_frac": 0.75,
            "swa_freq": 5,
        },
    }

    evaluator = TrainingEvaluator.from_runner(
        "train_on_historic_data",
        n_cycles=5,
        compute_rademacher=True,
    )

    result = evaluator.evaluate(run_fingerprint)
    evaluator.print_report(result)


Choosing the WFE Metric
~~~~~~~~~~~~~~~~~~~~~~~~

By default, WFE is computed from annualised Sharpe ratios (Pardo's original
definition).  You can change this to any metric from
:doc:`../user_guide/metrics_reference`:

.. code-block:: python

    # Use Calmar ratio for drawdown-sensitive WFE
    evaluator = TrainingEvaluator.from_runner(
        "train_on_historic_data",
        n_cycles=4,
        wfe_metric="calmar",
    )


Warm Starting
~~~~~~~~~~~~~

Parameters from each cycle can seed the next cycle's training.  This is
enabled by default in the evaluator — the ``warm_start_params`` from cycle
*n* become the initial parameters for cycle *n + 1*.  This mirrors deployment
where you'd warm-start retraining from the last known good parameters.


Generating Cycles Manually
--------------------------

For advanced use, generate cycle specifications directly:

.. code-block:: python

    from quantammsim.runners.robust_walk_forward import (
        generate_walk_forward_cycles,
    )

    cycles = generate_walk_forward_cycles(
        start_date="2022-01-01 00:00:00",
        end_date="2024-01-01 00:00:00",
        n_cycles=6,
        keep_fixed_start=False,  # Rolling windows
    )

    for c in cycles:
        print(
            f"Cycle {c.cycle_number}: "
            f"Train {c.train_start_date} → {c.train_end_date}, "
            f"Test {c.test_start_date} → {c.test_end_date}"
        )


Decision Framework
------------------

.. list-table::
   :header-rows: 1
   :widths: 30 35 35

   * - Diagnostic
     - Healthy
     - Action if Unhealthy
   * - Mean WFE
     - > 0.5
     - Add regularisation, reduce model complexity
   * - IS-OOS gap
     - < 0.3
     - Enable early stopping, add price noise
   * - Rademacher complexity
     - < 0.05
     - Reduce training iterations, use SWA
   * - Worst OOS Sharpe
     - > 0.0
     - Check for regime sensitivity, use expanding windows
   * - ``is_effective``
     - ``True``
     - Review ``effectiveness_reasons`` for specifics


See Also
--------

- :doc:`../user_guide/robustness_features` — Full robustness feature guide
- :doc:`../user_guide/metrics_reference` — Available training and evaluation metrics
- :doc:`hyperparameter_tuning` — Optimise training hyperparameters using WFA as the objective
- :doc:`ensemble_training` — Ensemble averaging for implicit regularisation
- :mod:`quantammsim.runners.training_evaluator` — API reference
- :mod:`quantammsim.runners.robust_walk_forward` — Rademacher and WFE utilities