Run an Evaluation¶

Evaluation is the process of measuring and comparing the behavior of forecasting systems under explicit criteria. In Electric Barometer, evaluation is treated as a distinct and reproducible stage that informs decisioning without making decisions itself.

This guide describes how to run an evaluation in a way that is consistent, interpretable, and compatible with the Electric Barometer ecosystem.

What an evaluation is (and is not)¶

An evaluation answers the question:

How do alternative forecasting systems behave under defined measurement criteria?

An evaluation does not answer:

Which system should be chosen
What action should be taken
What policy is preferred

Those questions are resolved later through Decisioning and governance.

Keeping this boundary clear is essential for transparency and reproducibility.

When you should run an evaluation¶

Run an evaluation when:

Comparing multiple forecasting systems or configurations
Assessing the impact of new features or transforms
Introducing or validating new metrics
Investigating tradeoffs or sensitivity
Preparing inputs for selection or policy tuning

Evaluations should be repeatable and inspectable, not one-off experiments.

Prerequisites¶

Before running an evaluation, ensure that:

Forecasting systems are clearly defined
Inputs and data slices are fixed and documented
Metrics and parameters are explicitly selected
The evaluation window and granularity are appropriate
Assumptions about cost and context are understood

For conceptual grounding, review Problem Framing and Asymmetric Cost before proceeding.

Step 1: Define the evaluation scope¶

Begin by defining what is being evaluated.

This includes:

Which forecasting systems or variants are included
Which entities, segments, or time periods are in scope
The temporal resolution of evaluation
Any exclusions or filters applied

The scope should align with the decision context the evaluation is meant to support, as described in Concepts.

Step 2: Select evaluation metrics¶

Choose metrics that reflect the behavior you want to measure.

Consider:

Whether cost asymmetry is relevant
Which tradeoffs are important to surface
How metrics aggregate across entities or time
Whether multiple metrics are required

Metrics should be selected intentionally. If a new measure is required, see Add a Metric.

Step 3: Configure metric parameters¶

Many metrics require parameters that encode assumptions.

Examples include:

Relative weighting of different error types
Thresholds or tolerances
Normalization or scaling choices

Parameter choices should be explicit and recorded as part of the evaluation configuration. These assumptions often reflect policy and should be consistent with Governance principles.

Step 4: Execute the evaluation¶

Run the evaluation using the configured inputs and metrics.

Execution should:

Be deterministic given the same inputs
Produce structured outputs
Capture metadata describing the run
Avoid modifying source data or forecasts

Evaluation execution is a measurement process, not an optimization step.

Step 5: Inspect evaluation outputs¶

After execution, review the evaluation results.

Inspection may involve:

Comparing metrics across systems
Examining distributions or segment-level behavior
Identifying tradeoffs or sensitivities
Checking for anomalies or unexpected patterns

At this stage, evaluation outputs should be interpreted—but not acted upon—until readiness and policy considerations are applied.

Step 6: Preserve evaluation artifacts¶

Evaluation artifacts are essential for governance and reproducibility.

Artifacts may include:

Metric outputs and summaries
Configuration files or parameters
Input identifiers or hashes
Execution metadata

Preserving these artifacts enables auditability and supports later review under Governance.

What comes after evaluation¶

Evaluation produces information, not decisions.

Typical next steps include:

Applying readiness adjustments (see Readiness and RAL)
Comparing systems under explicit policy rules
Performing sensitivity analysis
Selecting or tuning systems

These steps belong to decisioning and optimization, not evaluation itself.

Governance considerations¶

Evaluations influence downstream decisions and must therefore be governed.

Good governance practices include:

Documenting evaluation intent and scope
Avoiding ad hoc metric selection
Preserving historical evaluation results
Ensuring consistency across comparable runs

For a full discussion, see Governance.

How evaluation fits into the Electric Barometer lifecycle¶

Within the Electric Barometer framework:

Feature transforms define system inputs (see Add a Feature Transform)
Forecasting systems produce candidate outputs
Evaluation measures behavior under explicit criteria
Readiness and policy layers contextualize results
Decisioning selects or tunes systems
Governance ensures traceability and accountability

Evaluation is the measurement backbone of this lifecycle.

Where to go next¶

Add or refine metrics using Add a Metric
Apply policy and tuning in Optimization
Review Evaluation vs Decisioning to reinforce role boundaries

Running an evaluation is not about declaring winners. It is about producing reliable, decision-relevant information. Electric Barometer is designed to make that process explicit and reproducible.