Skip to content

Model comparison

This section documents model comparison utilities provided by eb-evaluation.

Model comparison utilities support side-by-side evaluation of multiple models across metrics, diagnostics, and operational outcomes.

eb_evaluation.model_selection.compare

Forecast comparison and cost-aware model selection helpers.

This module provides evaluation-oriented utilities built on top of eb_metrics.metrics:

  • compare_forecasts computes CWSL and related diagnostics for multiple forecast vectors against a common target series.
  • select_model_by_cwsl fits candidate estimators (using their native training objective) and selects the model with the lowest validation CWSL.
  • select_model_by_cwsl_cv performs K-fold cross-validation, selecting the model with the lowest mean CWSL and refitting it on the full dataset.

CWSL is evaluated with asymmetric costs for underbuild and overbuild, typically summarized by a cost ratio:

\[ R = \frac{c_u}{c_o} \]

where \(c_u\) is the cost per unit of shortfall and \(c_o\) is the cost per unit of excess.

compare_forecasts(y_true, forecasts, cu, co, sample_weight=None, tau=2.0)

Compare multiple forecast models on the same target series.

For each forecast vector, compute CWSL and a standard set of diagnostics:

  • CWSL
  • NSL
  • UD
  • wMAPE
  • HR@tau
  • FRS
  • MAE
  • RMSE
  • MAPE

Parameters:

Name Type Description Default
y_true array-like of shape (n_samples,)

Actual (ground-truth) values.

required
forecasts Mapping[str, array - like]

Mapping from model name to forecast vector. Each forecast must be shape (n_samples,).

required
cu float or array-like of shape (n_samples,)

Underbuild (shortfall) cost per unit.

required
co float or array-like of shape (n_samples,)

Overbuild (excess) cost per unit.

required
sample_weight array-like of shape (n_samples,)

Optional non-negative weights per interval. Passed to metrics that support sample_weight (CWSL, NSL, UD, HR@tau, FRS). Metrics that are currently unweighted in eb_metrics (e.g., wMAPE, MAE, RMSE, MAPE) are computed without weights.

None
tau float or array - like

Tolerance parameter for HR@tau. May be scalar or per-interval.

2.0

Returns:

Type Description
DataFrame

DataFrame indexed by model name with columns: ["CWSL", "NSL", "UD", "wMAPE", "HR@tau", "FRS", "MAE", "RMSE", "MAPE"].

Raises:

Type Description
ValueError

If y_true is not 1D, if forecasts is empty, or if any forecast length is incompatible with y_true.

select_model_by_cwsl(models, X_train, y_train, X_val, y_val, *, cu, co, sample_weight_val=None)

Fit multiple models, then select the best by validation CWSL.

Each estimator is fit on (X_train, y_train) using its native objective (typically MSE/RMSE), then evaluated on the validation set via CWSL:

\[ \text{CWSL} = \mathrm{cwsl}(y_{\mathrm{val}}, \hat{y}_{\mathrm{val}}; c_u, c_o) \]

The model with the lowest CWSL is returned, along with a compact results table.

Parameters:

Name Type Description Default
models dict[str, Any]

Mapping from model name to an unfitted estimator implementing:

  • fit(X, y)
  • predict(X)
required
X_train

Training data used to fit each model.

required
y_train

Training data used to fit each model.

required
X_val

Validation data used only for evaluation.

required
y_val

Validation data used only for evaluation.

required
cu float

Underbuild (shortfall) cost per unit for CWSL.

required
co float

Overbuild (excess) cost per unit for CWSL.

required
sample_weight_val array - like or None

Optional per-interval weights for the validation set, passed to CWSL.

None

Returns:

Name Type Description
best_name str

Name of the model with the lowest CWSL on the validation set.

best_model Any

Fitted estimator corresponding to best_name.

results DataFrame

DataFrame indexed by model name with columns ["CWSL", "RMSE", "wMAPE"].

Raises:

Type Description
ValueError

If no models are evaluated.

Notes
  • RMSE and wMAPE are computed unweighted (consistent with current eb_metrics behavior).
  • This function is intentionally simple and does not handle time-series splitting; callers should ensure the split is appropriate.

select_model_by_cwsl_cv(models, X, y, *, cu, co, cv=5, sample_weight=None)

Select a model by cross-validated CWSL and refit on the full dataset.

This is a simple K-fold cross-validation loop:

  1. Split indices into cv folds.
  2. For each model and fold:
  3. fit on (cv - 1) folds
  4. evaluate on the held-out fold using CWSL, RMSE, and wMAPE
  5. Aggregate metrics across folds for each model.
  6. Choose the model with the lowest mean CWSL.
  7. Refit the chosen model once on all data (X, y).

Parameters:

Name Type Description Default
models dict[str, Any]

Mapping from model name to an unfitted estimator implementing fit and predict.

required
X array-like of shape (n_samples, n_features)

Feature matrix.

required
y array-like of shape (n_samples,)

Target vector.

required
cu float

Underbuild (shortfall) cost per unit for CWSL.

required
co float

Overbuild (excess) cost per unit for CWSL.

required
cv int

Number of folds. Must be >= 2.

5
sample_weight numpy.ndarray of shape (n_samples,)

Optional per-sample weights used only for CWSL metric calculation. RMSE and wMAPE remain unweighted.

None

Returns:

Name Type Description
best_name str

Model name with the lowest mean CWSL across folds.

best_model Any

The chosen estimator refit on all data.

results DataFrame

DataFrame indexed by model name with columns:

  • CWSL_mean, CWSL_std
  • RMSE_mean, RMSE_std
  • wMAPE_mean, wMAPE_std
  • n_folds

Raises:

Type Description
ValueError

If X/y dimensions mismatch, cv < 2, sample_weight length mismatch, or no models are evaluated.

Notes

This function uses a naive split of indices into contiguous folds via numpy.array_split. For time-series problems, callers should prefer time-aware splitting outside this helper.