Ensemble Training
=================

Ensemble training trains multiple parameter sets ("members") simultaneously
and averages their weight outputs.  This provides implicit regularisation
through diversity: individual members may overfit in different ways, but their
average tends toward the robust core signal.

This tutorial covers:

1. Why ensemble averaging works
2. Basic ensemble setup
3. Initialisation methods and their trade-offs
4. Multi-hook chaining (ensemble + bounded weights)
5. Ensemble with walk-forward validation
6. Best practices


Why Ensemble Averaging?
-----------------------

Single-strategy training optimises one set of parameters.  If the optimisation
landscape has multiple local optima (common for financial strategies), the
training outcome depends strongly on initialisation.  Worse, a single solution
may overfit to idiosyncratic features of the training data.

Ensemble averaging mitigates both problems:

* **Exploration**: Members start from different positions in parameter space,
  increasing the chance that at least one finds a good basin.
* **Regularisation**: Averaging rule outputs smooths out member-specific
  overfitting.  The ensemble's effective hypothesis class is more constrained
  than any individual member's.
* **Gradient flow**: Because the averaging uses ``jnp.mean`` (not
  ``stop_gradient``), gradients flow back to all members proportionally:

  .. math::

     \frac{\partial \mathcal{L}}{\partial \theta_i}
     = \frac{1}{N} \cdot \frac{\partial \mathcal{L}}{\partial \bar{w}}
     \cdot \frac{\partial w_i}{\partial \theta_i}

  Each member receives gradients weighted by how its output affected the mean.


Basic Ensemble Setup
--------------------

Ensembles are enabled via the ``ensemble`` hook and the
``n_ensemble_members`` fingerprint key:

.. code-block:: python

    from quantammsim.runners.jax_runners import train_on_historic_data

    run_fingerprint = {
        "tokens": ["BTC", "ETH"],
        "rule": "ensemble__momentum",  # Hook prefix
        "startDateString": "2023-01-01 00:00:00",
        "endDateString": "2024-01-01 00:00:00",
        "initial_pool_value": 1_000_000.0,
        "fees": 0.003,
        "do_arb": True,
        "return_val": "daily_log_sharpe",
        "chunk_period": 1440,

        # Ensemble configuration
        "n_ensemble_members": 4,
        "ensemble_init_method": "lhs",    # Latin Hypercube Sampling
        "ensemble_init_scale": 0.5,       # Spread around initial values
        "ensemble_init_seed": 42,         # Reproducibility

        "optimisation_settings": {
            "method": "gradient_descent",
            "optimiser": "adam",
            "base_lr": 0.05,
            "n_iterations": 500,
            "batch_size": 16,
            "n_parameter_sets": 2,
        },
    }

    train_on_historic_data(run_fingerprint, verbose=True)

With ``n_parameter_sets=2`` and ``n_ensemble_members=4``, the parameter
tensors have shape ``(2, 4, ...)``:

* **Outer dimension** (2): independent training runs (vmapped in the runner)
* **Inner dimension** (4): ensemble members that share gradients through
  averaging

The ensemble hook averages the *rule outputs* (weight changes), not the raw
parameters.  Each member maintains its own EWMA estimator state and produces
its own weight trajectory; the final weights are the arithmetic mean across
members.


Initialisation Methods
----------------------

How ensemble members are spread across parameter space at initialisation
significantly affects diversity and convergence.  Set the method via
``run_fingerprint["ensemble_init_method"]``.

.. list-table::
   :header-rows: 1
   :widths: 20 40 40

   * - Method
     - Description
     - Best for
   * - ``"lhs"``
     - Latin Hypercube Sampling.  Each parameter dimension is divided into
       *N* equal strata, and exactly one sample is placed in each stratum.
     - General use. Good space coverage with low sample counts.  **Recommended
       default.**
   * - ``"centered_lhs"``
     - LHS with samples at stratum centres rather than random positions
       within each stratum.
     - When you want deterministic, evenly-spaced initialisation.
   * - ``"sobol"``
     - Sobol quasi-random sequence (low-discrepancy).  Provides more
       uniform coverage than pseudo-random sampling, especially at higher
       dimensions.
     - Larger ensembles (8+) or high-dimensional parameter spaces.
   * - ``"grid"``
     - Regular grid over the parameter space.  Deterministic and maximally
       uniform, but scales poorly with dimension.
     - Small ensembles (2-4 members) with few parameters.
   * - ``"gaussian"``
     - Independent Gaussian noise around initial values (the original,
       backwards-compatible approach).
     - Quick experiments.  Provides no space-coverage guarantees.


The ``ensemble_init_scale`` parameter controls the spread.  For structured
methods (LHS, Sobol, grid), samples are drawn in [0, 1] and mapped to:

.. code-block:: text

    value = base_value × ((1 - scale) + sample × 2 × scale)

So ``scale=0.5`` maps samples to [0.5×base, 1.5×base].  If the pool has a
:class:`~quantammsim.core_simulator.param_schema.ParamSpec` with Optuna
ranges, those ranges are used instead for tighter, schema-aware initialisation.

Example — comparing LHS and Gaussian:

.. code-block:: python

    import matplotlib.pyplot as plt

    # Train with LHS initialisation
    run_fp_lhs = {**base_fingerprint, "ensemble_init_method": "lhs"}
    result_lhs = train_on_historic_data(run_fp_lhs, verbose=True)

    # Train with Gaussian initialisation
    run_fp_gauss = {**base_fingerprint, "ensemble_init_method": "gaussian"}
    result_gauss = train_on_historic_data(run_fp_gauss, verbose=True)


Multi-Hook Chaining
-------------------

The ensemble hook composes with other hooks via the double-underscore syntax.
Hooks are applied left-to-right (leftmost = highest MRO priority):

.. code-block:: python

    # Ensemble + bounded weights + mean reversion channel
    run_fingerprint["rule"] = "ensemble__bounded__mean_reversion_channel"

    # Ensemble + LVR tracking + momentum
    run_fingerprint["rule"] = "ensemble__lvr__momentum"

For example, combining ensemble training with per-asset weight bounds:

.. code-block:: python

    import jax.numpy as jnp

    run_fingerprint = {
        "tokens": ["BTC", "ETH", "SOL"],
        "rule": "ensemble__bounded__mean_reversion_channel",
        "startDateString": "2023-01-01 00:00:00",
        "endDateString": "2024-01-01 00:00:00",
        "initial_pool_value": 1_000_000.0,
        "fees": 0.003,
        "do_arb": True,
        "return_val": "daily_log_sharpe",
        "chunk_period": 1440,

        # Ensemble config
        "n_ensemble_members": 4,
        "ensemble_init_method": "lhs",
        "ensemble_init_scale": 0.5,

        # Per-asset bounds (applied after ensemble averaging)
        "min_weights_per_asset": jnp.array([0.2, 0.2, 0.1]),
        "max_weights_per_asset": jnp.array([0.5, 0.5, 0.3]),

        "optimisation_settings": {
            "method": "gradient_descent",
            "optimiser": "adam",
            "base_lr": 0.05,
            "n_iterations": 500,
            "batch_size": 16,
            "n_parameter_sets": 4,
        },
    }

    train_on_historic_data(run_fingerprint, verbose=True)

The order matters: ``ensemble__bounded__rule`` means the ensemble hook has
higher priority than the bounded hook.  The ensemble averages raw rule outputs
*before* bounds are enforced — this is usually what you want, since bounds
should constrain the final output, not the individual member contributions.

You can also construct the hooked pool class manually:

.. code-block:: python

    from quantammsim.pools.creator import create_hooked_pool_instance
    from quantammsim.hooks.ensemble_averaging_hook import EnsembleAveragingHook
    from quantammsim.hooks.bounded_weights_hook import BoundedWeightsHook
    from quantammsim.pools.G3M.quantamm.mean_reversion_channel_pool import (
        MeanReversionChannelPool,
    )

    pool = create_hooked_pool_instance(
        MeanReversionChannelPool,
        BoundedWeightsHook,
        EnsembleAveragingHook,
    )


Ensemble + Walk-Forward Validation
-----------------------------------

Ensemble training is most powerful when combined with walk-forward analysis
to verify that the regularisation effect translates to OOS performance:

.. code-block:: python

    from quantammsim.runners.training_evaluator import TrainingEvaluator

    run_fingerprint = {
        "tokens": ["BTC", "ETH"],
        "rule": "ensemble__mean_reversion_channel",
        "startDateString": "2022-06-01 00:00:00",
        "endDateString": "2024-06-01 00:00:00",
        "initial_pool_value": 1_000_000.0,
        "fees": 0.003,
        "do_arb": True,
        "return_val": "daily_log_sharpe",
        "chunk_period": 1440,
        "bout_offset": 1440 * 14,

        # Ensemble
        "n_ensemble_members": 4,
        "ensemble_init_method": "lhs",
        "ensemble_init_scale": 0.5,

        # Early stopping
        "optimisation_settings": {
            "method": "gradient_descent",
            "optimiser": "adam",
            "base_lr": 0.05,
            "n_iterations": 1000,
            "batch_size": 16,
            "n_parameter_sets": 4,
            "early_stopping": True,
            "early_stopping_patience": 100,
            "early_stopping_metric": "daily_log_sharpe",
            "val_fraction": 0.2,
        },
    }

    evaluator = TrainingEvaluator.from_runner(
        "train_on_historic_data",
        n_cycles=4,
        compute_rademacher=True,
    )

    result = evaluator.evaluate(run_fingerprint)
    evaluator.print_report(result)

Compare against a non-ensemble baseline to quantify the regularisation
benefit:

.. code-block:: python

    from quantammsim.runners.training_evaluator import compare_trainers

    # Same config but without ensemble
    run_fp_no_ensemble = {**run_fingerprint, "rule": "mean_reversion_channel"}
    run_fp_no_ensemble.pop("n_ensemble_members", None)

    comparison = compare_trainers(
        run_fingerprint,
        trainers={
            "ensemble_4": TrainingEvaluator.from_runner(
                "train_on_historic_data", n_cycles=4,
            ),
            "no_ensemble": TrainingEvaluator.from_runner(
                "train_on_historic_data", n_cycles=4,
            ),
        },
    )


Parameter Shapes
----------------

Understanding the parameter tensor layout is important for debugging:

.. code-block:: text

    Without ensemble:
      params["log_k"]           shape: (n_parameter_sets, n_assets)
      params["logit_lamb"]      shape: (n_parameter_sets,)

    With 4 ensemble members:
      params["log_k"]           shape: (n_parameter_sets, 4, n_assets)
      params["logit_lamb"]      shape: (n_parameter_sets, 4)
      params["initial_weights_logits"]  shape: (n_parameter_sets, n_assets)
                                        ← SHARED, no ensemble dim

Note that ``initial_weights_logits`` is shared across ensemble members
because the ensemble is about the *strategy* (rule outputs), not the starting
allocation.  All members begin with the same initial weights and diverge
through their different rule parameters.


Best Practices
--------------

**Member count**: 4 members is a good starting point.  Below 3, the
diversity benefit is marginal.  Above 8, returns diminish while memory usage
grows linearly.  The compute cost is proportional to ``n_parameter_sets ×
n_ensemble_members``.

**Initialisation method**: Use ``"lhs"`` unless you have reason not to.  It
provides good space coverage without the pathologies of pure random sampling
(clumping, poor tail coverage).

**Init scale**: Start with 0.5.  Too small (< 0.1) and members collapse to
the same solution.  Too large (> 2.0) and some members start in poor regions
and drag down the average.

**Combine with other regularisation**: Ensemble training is complementary to
early stopping, price noise, and SWA.  The strongest configs typically use
ensemble + early stopping + price noise:

.. code-block:: python

    run_fingerprint.update({
        "n_ensemble_members": 4,
        "ensemble_init_method": "lhs",
        "ensemble_init_scale": 0.5,
        "price_noise_sigma": 0.001,
        "optimisation_settings": {
            **run_fingerprint["optimisation_settings"],
            "early_stopping": True,
            "early_stopping_patience": 100,
            "val_fraction": 0.2,
        },
    })

**Seed control**: Set ``ensemble_init_seed`` for reproducibility.  Different
seeds with the same method will produce different member placements, which
can cause variance in results.  Pin the seed for production configs.


See Also
--------

- :doc:`../user_guide/hooks` — Hook system overview and custom hooks
- :doc:`../user_guide/robustness_features` — All regularisation techniques
- :doc:`walk_forward_analysis` — Walk-forward validation tutorial
- :doc:`../user_guide/per_asset_bounds` — Per-asset weight bounds (composable with ensemble)
- :mod:`quantammsim.hooks.ensemble_averaging_hook` — API reference