Walk-Forward Analysis

Robust Walk-Forward Utilities

Core utilities for walk-forward validation: Rademacher complexity, Walk-Forward Efficiency (WFE), and cycle generation.

Robust Walk-Forward Training Utilities

This module provides core utilities for walk-forward analysis:

  1. Rademacher Complexity (Paleologo) - Compute empirical Rademacher complexity from checkpoint returns - Apply haircut to OOS performance estimates

  2. Walk-Forward Efficiency (Pardo) - Compute WFE = OOS performance / IS performance - Standard metric for assessing robustness

  3. Cycle Generation - Generate walk-forward train/test splits - Support for rolling and expanding windows

Key References: - Pardo, “The Evaluation and Optimization of Trading Strategies” (2008) - Paleologo, “The Elements of Quantitative Investing” (2024), Ch. 6

class WalkForwardCycle(cycle_number, train_start_date, train_end_date, test_start_date, test_end_date, train_start_idx=0, train_end_idx=0, test_start_idx=0, test_end_idx=0)[source]

Bases: object

Specification for a single walk-forward train/test cycle.

Defines one segment of a walk-forward analysis: a contiguous training window followed by a contiguous test window. Date fields are set at cycle-generation time; index fields are populated later once the price data has been loaded and aligned.

Parameters:
  • cycle_number (int)

  • train_start_date (str)

  • train_end_date (str)

  • test_start_date (str)

  • test_end_date (str)

  • train_start_idx (int)

  • train_end_idx (int)

  • test_start_idx (int)

  • test_end_idx (int)

cycle_number

Zero-based index of this cycle within the walk-forward sequence.

Type:

int

train_start_date

Training window start date ("YYYY-MM-DD HH:MM:SS").

Type:

str

train_end_date

Training window end date (inclusive).

Type:

str

test_start_date

Test window start date, typically equal to train_end_date.

Type:

str

test_end_date

Test window end date (inclusive).

Type:

str

train_start_idx

Row index into the price array for the start of training. Default 0; populated after data loading.

Type:

int

train_end_idx

Row index for the end of training. Default 0.

Type:

int

test_start_idx

Row index for the start of testing. Default 0.

Type:

int

test_end_idx

Row index for the end of testing. Default 0.

Type:

int

__init__(cycle_number, train_start_date, train_end_date, test_start_date, test_end_date, train_start_idx=0, train_end_idx=0, test_start_idx=0, test_end_idx=0)
Parameters:
  • cycle_number (int)

  • train_start_date (str)

  • train_end_date (str)

  • test_start_date (str)

  • test_end_date (str)

  • train_start_idx (int)

  • train_end_idx (int)

  • test_start_idx (int)

  • test_end_idx (int)

Return type:

None

compute_empirical_rademacher(returns_matrix, n_samples=1000, seed=42)[source]

Compute empirical Rademacher complexity of a set of strategies.

The Rademacher complexity measures how well the strategy class can “fit” random noise. Higher complexity = more overfitting risk.

Parameters:
  • returns_matrix (ndarray of shape (n_strategies, T)) – Returns time series for each strategy (checkpoint)

  • n_samples (int) – Number of random sign vectors to sample

  • seed (int) – Random seed for reproducibility

Returns:

Empirical Rademacher complexity R̂

Return type:

float

Notes

R̂ = E_σ[sup_s (1/T) Σ_t σ_t r_s(t)]

where σ_t are random Rademacher variables (±1 with prob 0.5)

compute_rademacher_haircut(observed_sharpe, rademacher_complexity, T, delta=0.05)[source]

Compute Rademacher-adjusted performance bound.

From Paleologo (2024): θ_n ≥ θ̂_n - 2R̂ - estimation_error

Parameters:
  • observed_sharpe (float) – Observed Sharpe ratio on test data

  • rademacher_complexity (float) – Empirical Rademacher complexity R̂

  • T (int) – Number of time periods in test data

  • delta (float) – Confidence level (default 0.05 for 95% confidence)

Returns:

(adjusted_sharpe, haircut_magnitude)

Return type:

Tuple[float, float]

compute_walk_forward_efficiency(is_sharpe, oos_sharpe, is_days, oos_days)[source]

Compute Walk-Forward Efficiency (WFE) as per Pardo.

WFE = (Annualized OOS Performance) / (Annualized IS Performance)

A WFE of 0.5 or higher suggests robustness. A WFE near 1.0 is ideal (OOS ≈ IS). A WFE > 1.0 means OOS outperformed IS (unusual but possible).

Parameters:
  • is_sharpe (float) – In-sample Sharpe ratio

  • oos_sharpe (float) – Out-of-sample Sharpe ratio

  • is_days (int) – Number of days in IS period

  • oos_days (int) – Number of days in OOS period

Returns:

Walk-Forward Efficiency (returns NaN for undefined cases)

Return type:

float

datetime_to_timestamp(date_string)[source]

Convert a datetime string to a Unix timestamp.

Parameters:

date_string (str) – Date in "YYYY-MM-DD HH:MM:SS" format.

Returns:

Seconds since the Unix epoch, interpreted in the local timezone.

Return type:

float

timestamp_to_datetime(timestamp)[source]

Convert a Unix timestamp to a datetime string.

Parameters:

timestamp (float) – Seconds since the Unix epoch.

Returns:

Formatted date string in "YYYY-MM-DD HH:MM:SS" format.

Return type:

str

generate_walk_forward_cycles(start_date, end_date, n_cycles, keep_fixed_start=False)[source]

Generate walk-forward cycle specifications with equal-length test periods.

Divides [start_date, end_date] into (n_cycles + 1) equal segments. Each cycle trains on segment i and tests on segment i+1.

Parameters:
  • start_date (str) – Start date (format: “YYYY-MM-DD HH:MM:SS”)

  • end_date (str) – End date of walk-forward analysis (end of final test period)

  • n_cycles (int) – Number of training/test cycles

  • keep_fixed_start (bool) – If True, training always starts from start_date (expanding window). If False (default), training window rolls forward (rolling window).

Return type:

List[WalkForwardCycle]

Training Evaluator

Walk-forward evaluation framework with pluggable trainer wrappers, per-cycle IS/OOS metric extraction, and aggregate robustness diagnostics.

Training Evaluator: A Meta-Runner for Assessing Training Effectiveness.

Wrap any training approach and evaluate whether it’s effective using:

  • Walk-Forward Efficiency (Pardo)

  • Rademacher Complexity (Paleologo) — requires checkpoint tracking, see below

  • OOS performance metrics

Usage:

from quantammsim.runners.training_evaluator import TrainingEvaluator, compare_trainers

# Option 1: Wrap existing runner
evaluator = TrainingEvaluator.from_runner("train_on_historic_data", max_iterations=500)
results = evaluator.evaluate(run_fingerprint, n_cycles=5)

# Option 2: Wrap custom function
def my_trainer(data_dict, train_start_idx, train_end_idx, pool, run_fp, warm_start=None):
    # ... your logic ...
    return params, {"epochs": n}

evaluator = TrainingEvaluator.from_function(my_trainer)

# Option 3: Compare approaches
comparison = compare_trainers(
    run_fingerprint,
    trainers={
        "sgd": TrainingEvaluator.from_runner("train_on_historic_data"),
        "random": TrainingEvaluator.random_baseline(),
    },
)

Rademacher Complexity

Rademacher complexity measures overfitting risk by tracking the “search space” explored during optimization. To compute Rademacher complexity, the trainer must return checkpoint_returns in metadata:

def my_trainer_with_checkpoints(...):
    checkpoint_returns = []
    for epoch in range(n_epochs):
        params = update(params)
        if epoch % checkpoint_interval == 0:
            returns = evaluate(params)  # Returns array of shape (T,)
            checkpoint_returns.append(returns)

    return params, {
        "epochs_trained": n_epochs,
        "checkpoint_returns": np.stack(checkpoint_returns),  # (n_checkpoints, T)
    }

evaluator = TrainingEvaluator.from_function(
    my_trainer_with_checkpoints,
    compute_rademacher=True,  # Enable Rademacher computation
)

The built-in wrapper for train_on_historic_data supports checkpoint tracking. Enable it by passing compute_rademacher=True to from_runner():

evaluator = TrainingEvaluator.from_runner(
    "train_on_historic_data",
    compute_rademacher=True,  # Enable checkpoint tracking
    checkpoint_interval=10,   # Optional: checkpoint every N iterations
)

For multi_period_sgd or custom trainers, you can implement checkpoint tracking manually by returning checkpoint_returns in metadata (as shown above).

class CycleEvaluation(cycle_number, is_sharpe, is_returns_over_hodl, oos_sharpe, oos_returns_over_hodl, walk_forward_efficiency, is_oos_gap, epochs_trained=0, rademacher_complexity=None, adjusted_oos_sharpe=None, is_calmar=None, oos_calmar=None, is_sterling=None, oos_sterling=None, is_ulcer=None, oos_ulcer=None, is_returns=None, oos_returns=None, is_daily_log_sharpe=None, oos_daily_log_sharpe=None, trained_params=None, train_start_date=None, train_end_date=None, test_start_date=None, test_end_date=None, run_location=None, run_fingerprint=None)[source]

Bases: object

Evaluation results for a single walk-forward cycle.

Captures in-sample (IS) and out-of-sample (OOS) performance metrics for one train/test window, plus robustness diagnostics.

Parameters:
  • cycle_number (int)

  • is_sharpe (float)

  • is_returns_over_hodl (float)

  • oos_sharpe (float)

  • oos_returns_over_hodl (float)

  • walk_forward_efficiency (float)

  • is_oos_gap (float)

  • epochs_trained (int)

  • rademacher_complexity (float | None)

  • adjusted_oos_sharpe (float | None)

  • is_calmar (float | None)

  • oos_calmar (float | None)

  • is_sterling (float | None)

  • oos_sterling (float | None)

  • is_ulcer (float | None)

  • oos_ulcer (float | None)

  • is_returns (float | None)

  • oos_returns (float | None)

  • is_daily_log_sharpe (float | None)

  • oos_daily_log_sharpe (float | None)

  • trained_params (Dict[str, Any] | None)

  • train_start_date (str | None)

  • train_end_date (str | None)

  • test_start_date (str | None)

  • test_end_date (str | None)

  • run_location (str | None)

  • run_fingerprint (Dict[str, Any] | None)

cycle_number

Zero-based index of this cycle.

Type:

int

is_sharpe

Annualised Sharpe ratio on the in-sample (training) window.

Type:

float

is_returns_over_hodl

Cumulative return relative to uniform HODL on the IS window.

Type:

float

oos_sharpe

Annualised Sharpe ratio on the out-of-sample (test) window.

Type:

float

oos_returns_over_hodl

Cumulative return relative to uniform HODL on the OOS window.

Type:

float

walk_forward_efficiency

WFE = OOS Sharpe / IS Sharpe (Pardo metric).

Type:

float

is_oos_gap

IS Sharpe minus OOS Sharpe (positive = overfitting).

Type:

float

epochs_trained

Number of gradient updates in this cycle’s training run.

Type:

int

rademacher_complexity

Empirical Rademacher complexity from training checkpoints.

Type:

float or None

adjusted_oos_sharpe

OOS Sharpe minus the Rademacher haircut.

Type:

float or None

is_calmar, oos_calmar

Calmar ratio (return / max drawdown) for IS and OOS.

Type:

float or None

is_sterling, oos_sterling

Sterling ratio for IS and OOS.

Type:

float or None

is_ulcer, oos_ulcer

Ulcer index for IS and OOS.

Type:

float or None

is_returns, oos_returns

Cumulative returns for IS and OOS.

Type:

float or None

is_daily_log_sharpe, oos_daily_log_sharpe

Daily-log-return Sharpe for IS and OOS.

Type:

float or None

trained_params

Strategy parameters at end of training for this cycle.

Type:

dict or None

train_start_date, train_end_date

IS window date boundaries.

Type:

str or None

test_start_date, test_end_date

OOS window date boundaries.

Type:

str or None

run_location

Filesystem path to the training output for this cycle.

Type:

str or None

run_fingerprint

Full run configuration used for this cycle.

Type:

dict or None

__init__(cycle_number, is_sharpe, is_returns_over_hodl, oos_sharpe, oos_returns_over_hodl, walk_forward_efficiency, is_oos_gap, epochs_trained=0, rademacher_complexity=None, adjusted_oos_sharpe=None, is_calmar=None, oos_calmar=None, is_sterling=None, oos_sterling=None, is_ulcer=None, oos_ulcer=None, is_returns=None, oos_returns=None, is_daily_log_sharpe=None, oos_daily_log_sharpe=None, trained_params=None, train_start_date=None, train_end_date=None, test_start_date=None, test_end_date=None, run_location=None, run_fingerprint=None)
Parameters:
  • cycle_number (int)

  • is_sharpe (float)

  • is_returns_over_hodl (float)

  • oos_sharpe (float)

  • oos_returns_over_hodl (float)

  • walk_forward_efficiency (float)

  • is_oos_gap (float)

  • epochs_trained (int)

  • rademacher_complexity (float | None)

  • adjusted_oos_sharpe (float | None)

  • is_calmar (float | None)

  • oos_calmar (float | None)

  • is_sterling (float | None)

  • oos_sterling (float | None)

  • is_ulcer (float | None)

  • oos_ulcer (float | None)

  • is_returns (float | None)

  • oos_returns (float | None)

  • is_daily_log_sharpe (float | None)

  • oos_daily_log_sharpe (float | None)

  • trained_params (Dict[str, Any] | None)

  • train_start_date (str | None)

  • train_end_date (str | None)

  • test_start_date (str | None)

  • test_end_date (str | None)

  • run_location (str | None)

  • run_fingerprint (Dict[str, Any] | None)

Return type:

None

class EvaluationResult(trainer_name, trainer_config, cycles, mean_wfe, mean_oos_sharpe, std_oos_sharpe, worst_oos_sharpe, mean_is_oos_gap, aggregate_rademacher=None, adjusted_mean_oos_sharpe=None, is_effective=False, effectiveness_reasons=<factory>)[source]

Bases: object

Complete evaluation results across all walk-forward cycles.

Aggregates per-cycle metrics into summary statistics and provides an effectiveness verdict based on configurable thresholds.

Parameters:
trainer_name

Identifier for the trainer wrapper that produced these results.

Type:

str

trainer_config

Configuration dict passed to the trainer.

Type:

Dict[str, Any]

cycles

Per-cycle evaluation results.

Type:

List[CycleEvaluation]

mean_wfe

Mean Walk-Forward Efficiency across cycles.

Type:

float

mean_oos_sharpe

Mean OOS Sharpe ratio across cycles.

Type:

float

std_oos_sharpe

Standard deviation of OOS Sharpe across cycles.

Type:

float

worst_oos_sharpe

Minimum OOS Sharpe across cycles.

Type:

float

mean_is_oos_gap

Mean IS–OOS Sharpe gap (positive = overfitting).

Type:

float

aggregate_rademacher

Mean Rademacher complexity across cycles (if computed).

Type:

float or None

adjusted_mean_oos_sharpe

Mean OOS Sharpe minus the mean Rademacher haircut.

Type:

float or None

is_effective

Whether the strategy passes the effectiveness criteria (positive mean OOS Sharpe, WFE > threshold, etc.).

Type:

bool

effectiveness_reasons

Human-readable explanations for the effectiveness verdict.

Type:

List[str]

__init__(trainer_name, trainer_config, cycles, mean_wfe, mean_oos_sharpe, std_oos_sharpe, worst_oos_sharpe, mean_is_oos_gap, aggregate_rademacher=None, adjusted_mean_oos_sharpe=None, is_effective=False, effectiveness_reasons=<factory>)
Parameters:
Return type:

None

class TrainerWrapper(name='trainer', config=None)[source]

Bases: object

Base class for wrapping training functions.

A trainer must implement:
train(data_dict, train_start_idx, train_end_idx, pool, run_fp, warm_start, …)

-> (params, metadata)

Parameters:
__init__(name='trainer', config=None)[source]
Parameters:
property name: str
property config: Dict[str, Any]
train(data_dict, train_start_idx, train_end_idx, pool, run_fingerprint, n_assets, warm_start_params=None, warm_start_weights=None, train_start_date=None, train_end_date=None, test_end_date=None)[source]

Train and return (params, metadata).

Parameters:
  • warm_start_params (dict, optional) – Strategy parameters from previous cycle to use as initialization.

  • warm_start_weights (array-like, optional) – Final weights from previous cycle. Pool starts with fresh initial_pool_value but distributed according to these weights (simulating continuous operation).

  • data_dict (dict)

  • train_start_idx (int)

  • train_end_idx (int)

  • pool (Any)

  • run_fingerprint (dict)

  • n_assets (int)

  • train_start_date (str | None)

  • train_end_date (str | None)

  • test_end_date (str | None)

Return type:

Tuple[Dict[str, Any], Dict[str, Any]]

class FunctionWrapper(fn, name='custom', config=None)[source]

Bases: TrainerWrapper

Wrap a plain (run_fingerprint, **kwargs) -> (params, metrics) function as a trainer.

Use via TrainingEvaluator.from_function() rather than constructing directly.

Parameters:
__init__(fn, name='custom', config=None)[source]
Parameters:
train(data_dict, train_start_idx, train_end_idx, pool, run_fingerprint, n_assets, warm_start_params=None, warm_start_weights=None, train_start_date=None, train_end_date=None, test_end_date=None)[source]
Parameters:
  • data_dict (dict)

  • train_start_idx (int)

  • train_end_idx (int)

  • pool (Any)

  • run_fingerprint (dict)

  • n_assets (int)

  • warm_start_params (Dict | None)

  • warm_start_weights (Any | None)

  • train_start_date (str | None)

  • train_end_date (str | None)

  • test_end_date (str | None)

Return type:

Tuple[Dict[str, Any], Dict[str, Any]]

class ExistingRunnerWrapper(runner_name, runner_kwargs=None, compute_rademacher=False, root=None)[source]

Bases: TrainerWrapper

Wrap an existing runner (train_on_historic_data, etc).

This creates a thin adapter that calls the existing runner with appropriate parameters.

Parameters:
  • runner_name (str)

  • runner_kwargs (Dict | None)

  • compute_rademacher (bool)

  • root (str)

__init__(runner_name, runner_kwargs=None, compute_rademacher=False, root=None)[source]
Parameters:
  • runner_name (str)

  • runner_kwargs (Dict | None)

  • compute_rademacher (bool)

  • root (str | None)

train(data_dict, train_start_idx, train_end_idx, pool, run_fingerprint, n_assets, warm_start_params=None, warm_start_weights=None, train_start_date=None, train_end_date=None, test_end_date=None)[source]

Call the existing runner.

Note: This adapts the cycle-based interface to the existing runners which expect full run_fingerprint with date strings. The date strings are used to modify the fingerprint so each cycle trains on different data.

Parameters:
  • data_dict (dict)

  • train_start_idx (int)

  • train_end_idx (int)

  • pool (Any)

  • run_fingerprint (dict)

  • n_assets (int)

  • warm_start_params (Dict | None)

  • warm_start_weights (Any | None)

  • train_start_date (str | None)

  • train_end_date (str | None)

  • test_end_date (str | None)

Return type:

Tuple[Dict[str, Any], Dict[str, Any]]

class RandomBaselineWrapper(seed=42)[source]

Bases: TrainerWrapper

Baseline: Random parameters.

Use to check if your trainer beats random chance.

Parameters:

seed (int)

__init__(seed=42)[source]
Parameters:

seed (int)

train(data_dict, train_start_idx, train_end_idx, pool, run_fingerprint, n_assets, warm_start_params=None, warm_start_weights=None, train_start_date=None, train_end_date=None, test_end_date=None)[source]

Return random parameters (ignores warm-start and date strings).

Parameters:
  • data_dict (dict)

  • train_start_idx (int)

  • train_end_idx (int)

  • pool (Any)

  • run_fingerprint (dict)

  • n_assets (int)

  • warm_start_params (Dict | None)

  • warm_start_weights (Any | None)

  • train_start_date (str | None)

  • train_end_date (str | None)

  • test_end_date (str | None)

Return type:

Tuple[Dict[str, Any], Dict[str, Any]]

class TrainingEvaluator(trainer, n_cycles=5, keep_fixed_start=False, compute_rademacher=False, verbose=True, root=None, wfe_metric='sharpe')[source]

Bases: object

Evaluates whether a training approach is effective.

Wraps any trainer and runs walk-forward evaluation to assess effectiveness using WFE and Rademacher metrics.

Pruning

This evaluator yields CycleEvaluation results via evaluate_iter(), allowing the consumer (e.g., HyperparamTuner) to decide when to prune. The evaluator itself does not prune - it evaluates all cycles unless the consumer stops iterating. This design keeps pruning logic in one place (the Optuna integration) rather than duplicating it here.

__init__(trainer, n_cycles=5, keep_fixed_start=False, compute_rademacher=False, verbose=True, root=None, wfe_metric='sharpe')[source]
Parameters:
classmethod from_runner(runner_name, n_cycles=5, keep_fixed_start=False, verbose=True, compute_rademacher=False, root=None, wfe_metric='sharpe', **runner_kwargs)[source]

Create evaluator from an existing runner.

Parameters:
  • runner_name (str) – One of: “train_on_historic_data”, “multi_period_sgd”

  • n_cycles (int) – Number of walk-forward cycles

  • verbose (bool) – Print progress

  • compute_rademacher (bool) – Enable Rademacher complexity computation. This enables checkpoint tracking in the trainer, which saves intermediate returns during training for Rademacher estimation. Default False.

  • root (str, optional) – Root directory for data files. If None, uses default data location.

  • wfe_metric (str) – Metric to use for WFE and IS-OOS gap computation. Default “sharpe” (per Pardo). Can be any metric from calculate_period_metrics (sharpe, calmar, sterling, etc.)

  • **runner_kwargs – Arguments passed to the runner (e.g., max_iterations=500)

  • keep_fixed_start (bool)

Return type:

TrainingEvaluator

Example

>>> evaluator = TrainingEvaluator.from_runner(
...     "train_on_historic_data",
...     max_iterations=500,
...     compute_rademacher=True,  # Enable Rademacher complexity
... )
classmethod from_function(fn, name='custom', n_cycles=5, keep_fixed_start=False, verbose=True, root=None, wfe_metric='sharpe', **config)[source]

Create evaluator from a custom training function.

Parameters:
  • fn (Callable) – Function with signature fn(data_dict, train_start_idx, train_end_idx, pool, run_fingerprint, n_assets, warm_start_params) -> (params, metadata).

  • name (str) – Name for this trainer

  • n_cycles (int) – Number of walk-forward cycles

  • keep_fixed_start (bool) – If True, expanding window (train always starts from beginning). If False, rolling window (train window moves forward).

  • root (str, optional) – Root directory for data files. If None, uses default data location.

  • wfe_metric (str) – Metric to use for WFE and IS-OOS gap computation. Default “sharpe”.

  • **config – Config dict for reporting

  • verbose (bool)

Return type:

TrainingEvaluator

Example

>>> def my_trainer(data_dict, train_start_idx, train_end_idx, pool,
...                run_fingerprint, n_assets, warm_start_params=None):
...     # Your training logic
...     return params, {"epochs": 100}
>>>
>>> evaluator = TrainingEvaluator.from_function(my_trainer)
classmethod random_baseline(seed=42, n_cycles=5, keep_fixed_start=False, verbose=True, root=None, wfe_metric='sharpe')[source]

Create evaluator that uses random parameters.

Use this as a baseline to verify your trainer beats random chance.

Parameters:
  • seed (int) – Random seed for reproducibility

  • n_cycles (int) – Number of walk-forward cycles

  • keep_fixed_start (bool) – If True, expanding window. If False, rolling window.

  • verbose (bool) – Print progress

  • root (str, optional) – Root directory for data files. If None, uses default data location.

  • wfe_metric (str) – Metric to use for WFE and IS-OOS gap computation. Default “sharpe”.

Return type:

TrainingEvaluator

evaluate_iter(run_fingerprint)[source]

Generator that yields CycleEvaluation after each cycle completes.

This allows callers to inspect intermediate results and potentially stop early (e.g., for Optuna pruning).

Yields:

CycleEvaluation – Results from each completed cycle

Returns:

Final aggregated results (accessible via generator.value after StopIteration)

Return type:

EvaluationResult

Parameters:

run_fingerprint (dict)

Example

>>> evaluator = TrainingEvaluator.from_runner("train_on_historic_data")
>>> gen = evaluator.evaluate_iter(run_fingerprint)
>>> for cycle_eval in gen:
...     print(f"Cycle {cycle_eval.cycle_number}: OOS Sharpe = {cycle_eval.oos_sharpe}")
...     if cycle_eval.oos_sharpe < -1.0:
...         break  # Stop early if terrible
>>> # If completed, get final result
>>> # final_result = gen.value  # Only available after StopIteration
evaluate(run_fingerprint)[source]

Run walk-forward evaluation.

Parameters:

run_fingerprint (dict) – Run configuration

Returns:

Comprehensive evaluation results

Return type:

EvaluationResult

print_report(result)[source]

Print a human-readable evaluation report to stdout.

Shows per-cycle IS/OOS metrics in a tabular layout, aggregate statistics, Rademacher diagnostics (if available), and the effectiveness verdict.

Parameters:

result (EvaluationResult) – Completed evaluation result to display.

Parameters:
compare_trainers(run_fingerprint, trainers, verbose=True)[source]

Compare multiple trainers on the same data.

Parameters:
  • run_fingerprint (dict) – Run configuration

  • trainers (Dict[str, TrainingEvaluator]) – Dictionary of name -> evaluator

  • verbose (bool) – Print progress and summary

Returns:

Results keyed by trainer name

Return type:

Dict[str, EvaluationResult]

Example

>>> results = compare_trainers(
...     run_fingerprint,
...     trainers={
...         "sgd_500": TrainingEvaluator.from_runner(
...             "train_on_historic_data", max_iterations=500
...         ),
...         "sgd_100": TrainingEvaluator.from_runner(
...             "train_on_historic_data", max_iterations=100
...         ),
...         "random": TrainingEvaluator.random_baseline(),
...     },
... )