Ensemble Training

Ensemble training trains multiple parameter sets (“members”) simultaneously and averages their weight outputs. This provides implicit regularisation through diversity: individual members may overfit in different ways, but their average tends toward the robust core signal.

This tutorial covers:

Why ensemble averaging works
Basic ensemble setup
Initialisation methods and their trade-offs
Multi-hook chaining (ensemble + bounded weights)
Ensemble with walk-forward validation
Best practices

Why Ensemble Averaging?

Single-strategy training optimises one set of parameters. If the optimisation landscape has multiple local optima (common for financial strategies), the training outcome depends strongly on initialisation. Worse, a single solution may overfit to idiosyncratic features of the training data.

Ensemble averaging mitigates both problems:

Exploration: Members start from different positions in parameter space, increasing the chance that at least one finds a good basin.
Regularisation: Averaging rule outputs smooths out member-specific overfitting. The ensemble’s effective hypothesis class is more constrained than any individual member’s.
Gradient flow: Because the averaging uses jnp.mean (not stop_gradient), gradients flow back to all members proportionally:

\[\frac{\partial \mathcal{L}}{\partial \theta_i} = \frac{1}{N} \cdot \frac{\partial \mathcal{L}}{\partial \bar{w}} \cdot \frac{\partial w_i}{\partial \theta_i}\]

Each member receives gradients weighted by how its output affected the mean.

Basic Ensemble Setup

Ensembles are enabled via the ensemble hook and the n_ensemble_members fingerprint key:

from quantammsim.runners.jax_runners import train_on_historic_data

run_fingerprint = {
    "tokens": ["BTC", "ETH"],
    "rule": "ensemble__momentum",  # Hook prefix
    "startDateString": "2023-01-01 00:00:00",
    "endDateString": "2024-01-01 00:00:00",
    "initial_pool_value": 1_000_000.0,
    "fees": 0.003,
    "do_arb": True,
    "return_val": "daily_log_sharpe",
    "chunk_period": 1440,

    # Ensemble configuration
    "n_ensemble_members": 4,
    "ensemble_init_method": "lhs",    # Latin Hypercube Sampling
    "ensemble_init_scale": 0.5,       # Spread around initial values
    "ensemble_init_seed": 42,         # Reproducibility

    "optimisation_settings": {
        "method": "gradient_descent",
        "optimiser": "adam",
        "base_lr": 0.05,
        "n_iterations": 500,
        "batch_size": 16,
        "n_parameter_sets": 2,
    },
}

train_on_historic_data(run_fingerprint, verbose=True)

With n_parameter_sets=2 and n_ensemble_members=4, the parameter tensors have shape (2, 4, ...):

Outer dimension (2): independent training runs (vmapped in the runner)
Inner dimension (4): ensemble members that share gradients through averaging

The ensemble hook averages the rule outputs (weight changes), not the raw parameters. Each member maintains its own EWMA estimator state and produces its own weight trajectory; the final weights are the arithmetic mean across members.

Initialisation Methods

How ensemble members are spread across parameter space at initialisation significantly affects diversity and convergence. Set the method via run_fingerprint["ensemble_init_method"].

Method	Description	Best for
`"lhs"`	Latin Hypercube Sampling. Each parameter dimension is divided into N equal strata, and exactly one sample is placed in each stratum.	General use. Good space coverage with low sample counts. Recommended default.
`"centered_lhs"`	LHS with samples at stratum centres rather than random positions within each stratum.	When you want deterministic, evenly-spaced initialisation.
`"sobol"`	Sobol quasi-random sequence (low-discrepancy). Provides more uniform coverage than pseudo-random sampling, especially at higher dimensions.	Larger ensembles (8+) or high-dimensional parameter spaces.
`"grid"`	Regular grid over the parameter space. Deterministic and maximally uniform, but scales poorly with dimension.	Small ensembles (2-4 members) with few parameters.
`"gaussian"`	Independent Gaussian noise around initial values (the original, backwards-compatible approach).	Quick experiments. Provides no space-coverage guarantees.

The ensemble_init_scale parameter controls the spread. For structured methods (LHS, Sobol, grid), samples are drawn in [0, 1] and mapped to:

value = base_value × ((1 - scale) + sample × 2 × scale)

So scale=0.5 maps samples to [0.5×base, 1.5×base]. If the pool has a ParamSpec with Optuna ranges, those ranges are used instead for tighter, schema-aware initialisation.

Example — comparing LHS and Gaussian:

import matplotlib.pyplot as plt

# Train with LHS initialisation
run_fp_lhs = {**base_fingerprint, "ensemble_init_method": "lhs"}
result_lhs = train_on_historic_data(run_fp_lhs, verbose=True)

# Train with Gaussian initialisation
run_fp_gauss = {**base_fingerprint, "ensemble_init_method": "gaussian"}
result_gauss = train_on_historic_data(run_fp_gauss, verbose=True)

Multi-Hook Chaining

The ensemble hook composes with other hooks via the double-underscore syntax. Hooks are applied left-to-right (leftmost = highest MRO priority):

# Ensemble + bounded weights + mean reversion channel
run_fingerprint["rule"] = "ensemble__bounded__mean_reversion_channel"

# Ensemble + LVR tracking + momentum
run_fingerprint["rule"] = "ensemble__lvr__momentum"

For example, combining ensemble training with per-asset weight bounds:

import jax.numpy as jnp

run_fingerprint = {
    "tokens": ["BTC", "ETH", "SOL"],
    "rule": "ensemble__bounded__mean_reversion_channel",
    "startDateString": "2023-01-01 00:00:00",
    "endDateString": "2024-01-01 00:00:00",
    "initial_pool_value": 1_000_000.0,
    "fees": 0.003,
    "do_arb": True,
    "return_val": "daily_log_sharpe",
    "chunk_period": 1440,

    # Ensemble config
    "n_ensemble_members": 4,
    "ensemble_init_method": "lhs",
    "ensemble_init_scale": 0.5,

    # Per-asset bounds (applied after ensemble averaging)
    "min_weights_per_asset": jnp.array([0.2, 0.2, 0.1]),
    "max_weights_per_asset": jnp.array([0.5, 0.5, 0.3]),

    "optimisation_settings": {
        "method": "gradient_descent",
        "optimiser": "adam",
        "base_lr": 0.05,
        "n_iterations": 500,
        "batch_size": 16,
        "n_parameter_sets": 4,
    },
}

train_on_historic_data(run_fingerprint, verbose=True)

The order matters: ensemble__bounded__rule means the ensemble hook has higher priority than the bounded hook. The ensemble averages raw rule outputs before bounds are enforced — this is usually what you want, since bounds should constrain the final output, not the individual member contributions.

You can also construct the hooked pool class manually:

from quantammsim.pools.creator import create_hooked_pool_instance
from quantammsim.hooks.ensemble_averaging_hook import EnsembleAveragingHook
from quantammsim.hooks.bounded_weights_hook import BoundedWeightsHook
from quantammsim.pools.G3M.quantamm.mean_reversion_channel_pool import (
    MeanReversionChannelPool,
)

pool = create_hooked_pool_instance(
    MeanReversionChannelPool,
    BoundedWeightsHook,
    EnsembleAveragingHook,
)

Ensemble + Walk-Forward Validation

Ensemble training is most powerful when combined with walk-forward analysis to verify that the regularisation effect translates to OOS performance:

from quantammsim.runners.training_evaluator import TrainingEvaluator

run_fingerprint = {
    "tokens": ["BTC", "ETH"],
    "rule": "ensemble__mean_reversion_channel",
    "startDateString": "2022-06-01 00:00:00",
    "endDateString": "2024-06-01 00:00:00",
    "initial_pool_value": 1_000_000.0,
    "fees": 0.003,
    "do_arb": True,
    "return_val": "daily_log_sharpe",
    "chunk_period": 1440,
    "bout_offset": 1440 * 14,

    # Ensemble
    "n_ensemble_members": 4,
    "ensemble_init_method": "lhs",
    "ensemble_init_scale": 0.5,

    # Early stopping
    "optimisation_settings": {
        "method": "gradient_descent",
        "optimiser": "adam",
        "base_lr": 0.05,
        "n_iterations": 1000,
        "batch_size": 16,
        "n_parameter_sets": 4,
        "early_stopping": True,
        "early_stopping_patience": 100,
        "early_stopping_metric": "daily_log_sharpe",
        "val_fraction": 0.2,
    },
}

evaluator = TrainingEvaluator.from_runner(
    "train_on_historic_data",
    n_cycles=4,
    compute_rademacher=True,
)

result = evaluator.evaluate(run_fingerprint)
evaluator.print_report(result)

Compare against a non-ensemble baseline to quantify the regularisation benefit:

from quantammsim.runners.training_evaluator import compare_trainers

# Same config but without ensemble
run_fp_no_ensemble = {**run_fingerprint, "rule": "mean_reversion_channel"}
run_fp_no_ensemble.pop("n_ensemble_members", None)

comparison = compare_trainers(
    run_fingerprint,
    trainers={
        "ensemble_4": TrainingEvaluator.from_runner(
            "train_on_historic_data", n_cycles=4,
        ),
        "no_ensemble": TrainingEvaluator.from_runner(
            "train_on_historic_data", n_cycles=4,
        ),
    },
)

Parameter Shapes

Understanding the parameter tensor layout is important for debugging:

Without ensemble:
  params["log_k"]           shape: (n_parameter_sets, n_assets)
  params["logit_lamb"]      shape: (n_parameter_sets,)

With 4 ensemble members:
  params["log_k"]           shape: (n_parameter_sets, 4, n_assets)
  params["logit_lamb"]      shape: (n_parameter_sets, 4)
  params["initial_weights_logits"]  shape: (n_parameter_sets, n_assets)
                                    ← SHARED, no ensemble dim

Note that initial_weights_logits is shared across ensemble members because the ensemble is about the strategy (rule outputs), not the starting allocation. All members begin with the same initial weights and diverge through their different rule parameters.

Best Practices

Member count: 4 members is a good starting point. Below 3, the diversity benefit is marginal. Above 8, returns diminish while memory usage grows linearly. The compute cost is proportional to n_parameter_sets × n_ensemble_members.

Initialisation method: Use "lhs" unless you have reason not to. It provides good space coverage without the pathologies of pure random sampling (clumping, poor tail coverage).

Init scale: Start with 0.5. Too small (< 0.1) and members collapse to the same solution. Too large (> 2.0) and some members start in poor regions and drag down the average.

Combine with other regularisation: Ensemble training is complementary to early stopping, price noise, and SWA. The strongest configs typically use ensemble + early stopping + price noise:

run_fingerprint.update({
    "n_ensemble_members": 4,
    "ensemble_init_method": "lhs",
    "ensemble_init_scale": 0.5,
    "price_noise_sigma": 0.001,
    "optimisation_settings": {
        **run_fingerprint["optimisation_settings"],
        "early_stopping": True,
        "early_stopping_patience": 100,
        "val_fraction": 0.2,
    },
})

Seed control: Set ensemble_init_seed for reproducibility. Different seeds with the same method will produce different member placements, which can cause variance in results. Pin the seed for production configs.