A calibrated model’s predicted probabilities match observed frequencies — predict 70%, see 70% happen. Measure calibration with the Brier decomposition (Brier = Reliability - Resolution + Uncertainty), ECE, or the Hosmer-Lemeshow test. Fix miscalibration with isotonic regression or Platt scaling. An uncalibrated model destroys Kelly sizing and every downstream decision.

Why This Matters for Agents

An autonomous betting agent’s entire decision pipeline depends on one assumption: its probability estimates are accurate. Kelly sizing takes a probability as input. Expected value calculations require a probability. Bankroll growth models assume the input probabilities are real. If the agent’s model says 65% but the true frequency is 52%, every downstream calculation is wrong — Kelly oversizes the bet, EV is overestimated, and the agent bleeds money.

This is Layer 4 — Intelligence. Calibration evaluation sits at the output of the model pipeline, right before predictions flow into the decision engine. In the Agent Betting Stack, calibration is the quality gate between “model produced a number” and “agent acts on that number.” Polyseer uses calibration audits in its multi-agent Bayesian architecture to weight individual model outputs — models with worse calibration get lower weight in the ensemble. An agent that skips calibration monitoring is flying blind with a false sense of precision.

The Math

What Calibration Means Formally

A probabilistic model is calibrated if:

P(Y = 1 | f(X) = p) = p,  for all p in [0, 1]

where f(X) is the model’s predicted probability and Y is the binary outcome. In words: among all instances where the model predicts probability p, the observed frequency of the positive outcome equals p.

If your model says “70% chance the Lakers cover -3.5” for 200 different games, calibration demands that approximately 140 of those games result in a Lakers cover.

The Reliability Diagram (Calibration Plot)

The reliability diagram is the primary visual diagnostic. Construction:

  1. Sort all N predictions by predicted probability.
  2. Bin them into K groups (typically K = 10, equal-width bins from 0 to 1).
  3. For each bin k, compute the mean predicted probability p_k and the observed frequency o_k.
  4. Plot p_k on the x-axis, o_k on the y-axis.

Perfect calibration is the 45-degree diagonal. Points above the diagonal indicate underconfidence (model says 40%, reality is 55%). Points below indicate overconfidence (model says 80%, reality is 65%).

Reliability Diagram

Observed  1.0 |                                        *  /
Frequency     |                                      /
              |                                    /
          0.8 |                              *   /
              |                                /
              |                          *   /     * overconfident
          0.6 |                        /       (below diagonal)
              |                  *   /
              |                    /
          0.4 |              *   /
              |                /
              |          *   /     * underconfident
          0.2 |            /       (above diagonal)
              |      *   /
              |        /
          0.0 |------/-----------------------------------
              0.0   0.2   0.4   0.6   0.8   1.0
                        Predicted Probability

Brier Score and Its Decomposition

The Brier score is the mean squared error of probabilistic predictions:

Brier = (1/N) * sum((f_i - y_i)^2),  i = 1..N

where f_i is the predicted probability for event i and y_i is the outcome (0 or 1). Lower is better. Range: 0 (perfect) to 1 (worst).

The Brier score decomposes into three components that isolate distinct model properties:

Brier = Reliability - Resolution + Uncertainty

Reliability (calibration error — lower is better):

Reliability = (1/N) * sum(n_k * (o_k - p_k)^2),  k = 1..K

where n_k is the number of predictions in bin k, o_k is the observed frequency in bin k, and p_k is the mean predicted probability in bin k. This is the weighted average squared distance from the diagonal on the reliability diagram.

Resolution (discrimination power — higher is better):

Resolution = (1/N) * sum(n_k * (o_k - o_bar)^2),  k = 1..K

where o_bar is the overall base rate (total positives / N). Resolution measures how much the model’s predictions vary from always predicting the base rate. A model that always says 50% has zero resolution.

Uncertainty (irreducible — property of the dataset):

Uncertainty = o_bar * (1 - o_bar)

This is the variance of a Bernoulli distribution at the base rate. You cannot reduce this term — it’s determined by how balanced your outcome set is.

The key insight: A good betting model needs low Reliability AND high Resolution. Low Reliability alone is trivial — predict the base rate for everything and you get Reliability = 0. Resolution is what separates a useful model from a useless one.

Expected Calibration Error (ECE)

ECE is the weighted average absolute calibration error across bins:

ECE = sum((n_k / N) * |o_k - p_k|),  k = 1..K

ECE is easier to interpret than Brier’s reliability component because it uses absolute error rather than squared error. Benchmarks for sports betting models:

ECE RangeInterpretation
< 0.02Excellent calibration
0.02 - 0.05Good — acceptable for production
0.05 - 0.10Mediocre — recalibrate before deploying
> 0.10Poor — model outputs are unreliable

Maximum Calibration Error (MCE)

MCE = max(|o_k - p_k|),  k = 1..K

MCE catches the worst-case bin. A model with ECE = 0.03 but MCE = 0.15 has a severe miscalibration pocket — dangerous if the agent trades heavily in that probability range.

Hosmer-Lemeshow Test

The Hosmer-Lemeshow (HL) test provides a formal statistical test of calibration. The test statistic:

H = sum(n_k * (o_k - p_k)^2 / (p_k * (1 - p_k))),  k = 1..K

Under the null hypothesis of perfect calibration, H follows a chi-squared distribution with K - 2 degrees of freedom. A p-value below 0.05 indicates statistically significant miscalibration.

Limitations: the HL test is sensitive to bin count K and sample size. With very large samples (N > 10,000), even trivially small miscalibration produces a significant result. Use it alongside ECE and the reliability diagram, not as a standalone verdict.

The Calibration-Sharpness Tradeoff

Sharpness is the variance of predicted probabilities:

Sharpness = (1/N) * sum((f_i - f_bar)^2)

where f_bar is the mean prediction. A model that always predicts 0.50 has zero sharpness. A model that predicts 0.05, 0.10, 0.85, 0.95 has high sharpness.

The tradeoff: increasing sharpness without maintaining calibration makes the model worse. A model predicting 0.95 and 0.05 (high sharpness) that’s actually 0.70 and 0.30 (poor calibration) is more dangerous than a model predicting 0.55 and 0.45 with correct calibration. The first model will oversize Kelly bets and amplify losses.

The goal is maximum sharpness subject to calibration. This is formalized by the Brier decomposition: Resolution rewards sharpness, Reliability penalizes miscalibration. Optimize both simultaneously.

Post-Hoc Calibration: Fixing a Miscalibrated Model

Two standard methods recalibrate model outputs without retraining:

Platt Scaling fits a logistic sigmoid to map raw outputs to calibrated probabilities:

P_calibrated = 1 / (1 + exp(-(A * f + B)))

where f is the raw model output and A, B are parameters fit by maximum likelihood on a held-out calibration set. Two parameters, fast to fit, works well when miscalibration is approximately sigmoidal.

Isotonic Regression fits a non-parametric, monotonically increasing step function. No functional form assumption. Maps raw predictions to calibrated probabilities using a piecewise-constant function that minimizes squared error subject to the monotonicity constraint (higher raw score → higher calibrated probability).

MethodParametersAssumptionBest When
Platt Scaling2 (A, B)Sigmoid miscalibrationSmall calibration set (< 1,000)
Isotonic RegressionNon-parametricMonotonicity onlyLarge calibration set (> 5,000)

Both require a held-out calibration dataset — never calibrate on training data.

Log-Loss as an Alternative to Brier

Log-loss (binary cross-entropy) is another proper scoring rule:

LogLoss = -(1/N) * sum(y_i * log(f_i) + (1 - y_i) * log(1 - f_i))

Log-loss penalizes confident wrong predictions more severely than Brier. Predicting 0.99 for an event that doesn’t happen produces log-loss = -log(0.01) = 4.61, while Brier = (0.99)^2 = 0.98. This makes log-loss the preferred metric when the agent uses Kelly sizing, because Kelly’s logarithmic utility function aligns with log-loss’s penalty structure. The connection to information theory is direct: log-loss measures the excess bits of surprise in the model’s predictions.

Worked Examples

Example 1: NFL Game Prediction Model Calibration

An agent’s NFL model produces win probabilities for 256 regular-season games. The agent groups predictions into 10 bins:

Bin (Predicted)   Count   Wins   Observed Freq   |Error|
0.00 - 0.10          8      0        0.000         0.050
0.10 - 0.20         15      2        0.133         0.017
0.20 - 0.30         22      7        0.318         0.068
0.30 - 0.40         31     11        0.355         0.005
0.40 - 0.50         38     18        0.474         0.024
0.50 - 0.60         42     24        0.571         0.021
0.60 - 0.70         35     23        0.657         0.007
0.70 - 0.80         30     22        0.733         0.017
0.80 - 0.90         22     18        0.818         0.032
0.90 - 1.00         13     13        1.000         0.050

Base rate o_bar = 138/256 = 0.539

ECE = (8/256)(0.050) + (15/256)(0.017) + (22/256)(0.068) + (31/256)(0.005) + (38/256)(0.024) + (42/256)(0.021) + (35/256)(0.007) + (30/256)(0.017) + (22/256)(0.032) + (13/256)(0.050) = 0.026

MCE = 0.068 (in the 0.20-0.30 bin)

ECE = 0.026 falls in the “good” range. MCE = 0.068 suggests slight underconfidence in the 20-30% range — the model says ~25% but outcomes occur ~32% of the time. An agent betting NFL underdogs in the +250 to +350 range should note this bias.

Example 2: Polymarket Political Market Predictions

An agent tracks its predictions across 80 Polymarket political resolution events over a quarter. The agent’s model predicted probabilities at time of trade:

Agent predicted 0.85 for "Will X pass the Senate?" → Resolved YES
Agent predicted 0.30 for "Will Y win primary?"     → Resolved NO
Agent predicted 0.72 for "Will debt ceiling raise?" → Resolved YES
Agent predicted 0.15 for "Will Z resign by March?" → Resolved NO
Agent predicted 0.60 for "Will bill pass committee?"→ Resolved YES
(80 total resolved events across Q1 2026)

After 80 events, the agent computes Brier decomposition:

Brier Score:  0.187
Reliability:  0.012   (good calibration)
Resolution:   0.074   (moderate discrimination)
Uncertainty:  0.249   (base rate ~50%)

Reliability of 0.012 confirms calibration is tight. Resolution of 0.074 against an Uncertainty of 0.249 means the model explains ~30% of the reducible variance — meaningful edge but room to improve. The agent’s Brier of 0.187 beats the market baseline of 0.249 (always predicting 50%), confirming positive skill.

For comparison, a BetOnline NFL spread model (-110 at -110, implying 52.4% on both sides) applied naively produces a Brier score near 0.250 — no better than the base rate.

Implementation

import numpy as np
from dataclasses import dataclass
from scipy import stats


@dataclass
class CalibrationReport:
    """Complete calibration diagnostic for a prediction model."""
    ece: float
    mce: float
    brier: float
    reliability: float
    resolution: float
    uncertainty: float
    hl_statistic: float
    hl_pvalue: float
    bin_edges: np.ndarray
    bin_counts: np.ndarray
    bin_pred_means: np.ndarray
    bin_obs_freqs: np.ndarray


def compute_calibration(
    predictions: np.ndarray,
    outcomes: np.ndarray,
    n_bins: int = 10
) -> CalibrationReport:
    """
    Full calibration analysis of a binary prediction model.

    Args:
        predictions: Array of predicted probabilities in [0, 1].
        outcomes: Array of binary outcomes (0 or 1).
        n_bins: Number of equal-width bins for calibration analysis.

    Returns:
        CalibrationReport with all calibration metrics.
    """
    predictions = np.asarray(predictions, dtype=float)
    outcomes = np.asarray(outcomes, dtype=float)
    n = len(predictions)
    o_bar = outcomes.mean()

    # Bin predictions
    bin_edges = np.linspace(0.0, 1.0, n_bins + 1)
    bin_indices = np.digitize(predictions, bin_edges[1:-1])

    bin_counts = np.zeros(n_bins)
    bin_pred_means = np.zeros(n_bins)
    bin_obs_freqs = np.zeros(n_bins)

    for k in range(n_bins):
        mask = bin_indices == k
        bin_counts[k] = mask.sum()
        if bin_counts[k] > 0:
            bin_pred_means[k] = predictions[mask].mean()
            bin_obs_freqs[k] = outcomes[mask].mean()

    # Filter non-empty bins for metrics
    nonempty = bin_counts > 0
    n_k = bin_counts[nonempty]
    p_k = bin_pred_means[nonempty]
    o_k = bin_obs_freqs[nonempty]

    # ECE: weighted average absolute calibration error
    ece = np.sum((n_k / n) * np.abs(o_k - p_k))

    # MCE: maximum calibration error
    mce = np.max(np.abs(o_k - p_k))

    # Brier score
    brier = np.mean((predictions - outcomes) ** 2)

    # Brier decomposition
    reliability = np.sum(n_k * (o_k - p_k) ** 2) / n
    resolution = np.sum(n_k * (o_k - o_bar) ** 2) / n
    uncertainty = o_bar * (1 - o_bar)

    # Hosmer-Lemeshow test
    # Avoid division by zero for bins where p_k is 0 or 1
    hl_valid = (p_k > 0) & (p_k < 1)
    if hl_valid.sum() >= 3:
        hl_n = n_k[hl_valid]
        hl_p = p_k[hl_valid]
        hl_o = o_k[hl_valid]
        hl_stat = np.sum(
            hl_n * (hl_o - hl_p) ** 2 / (hl_p * (1 - hl_p))
        )
        hl_df = hl_valid.sum() - 2
        hl_pval = 1 - stats.chi2.cdf(hl_stat, df=max(hl_df, 1))
    else:
        hl_stat = np.nan
        hl_pval = np.nan

    return CalibrationReport(
        ece=ece,
        mce=mce,
        brier=brier,
        reliability=reliability,
        resolution=resolution,
        uncertainty=uncertainty,
        hl_statistic=hl_stat,
        hl_pvalue=hl_pval,
        bin_edges=bin_edges,
        bin_counts=bin_counts,
        bin_pred_means=bin_pred_means,
        bin_obs_freqs=bin_obs_freqs,
    )


def platt_scaling(
    predictions: np.ndarray,
    outcomes: np.ndarray
) -> tuple[float, float]:
    """
    Fit Platt scaling parameters A, B to map raw predictions
    to calibrated probabilities via P = 1 / (1 + exp(-(A*f + B))).

    Args:
        predictions: Raw model outputs (training/calibration set).
        outcomes: Binary outcomes (0 or 1).

    Returns:
        Tuple (A, B) for the sigmoid transform.
    """
    from scipy.optimize import minimize

    def neg_log_likelihood(params):
        a, b = params
        z = a * predictions + b
        # Clip for numerical stability
        z = np.clip(z, -30, 30)
        p = 1.0 / (1.0 + np.exp(-z))
        p = np.clip(p, 1e-15, 1 - 1e-15)
        return -np.sum(outcomes * np.log(p) + (1 - outcomes) * np.log(1 - p))

    result = minimize(neg_log_likelihood, x0=[1.0, 0.0], method="Nelder-Mead")
    return result.x[0], result.x[1]


def apply_platt(predictions: np.ndarray, a: float, b: float) -> np.ndarray:
    """Apply fitted Platt scaling to new predictions."""
    z = a * predictions + b
    z = np.clip(z, -30, 30)
    return 1.0 / (1.0 + np.exp(-z))


def isotonic_calibration(
    cal_predictions: np.ndarray,
    cal_outcomes: np.ndarray,
    new_predictions: np.ndarray
) -> np.ndarray:
    """
    Fit isotonic regression on calibration data and apply to new predictions.

    Args:
        cal_predictions: Predictions from calibration set.
        cal_outcomes: Outcomes from calibration set.
        new_predictions: New predictions to calibrate.

    Returns:
        Calibrated probabilities for new_predictions.
    """
    from sklearn.isotonic import IsotonicRegression

    iso = IsotonicRegression(y_min=0.0, y_max=1.0, out_of_bounds="clip")
    iso.fit(cal_predictions, cal_outcomes)
    return iso.predict(new_predictions)


def walk_forward_backtest(
    predictions: np.ndarray,
    outcomes: np.ndarray,
    timestamps: np.ndarray,
    window_size: int = 500,
    step_size: int = 50
) -> list[CalibrationReport]:
    """
    Walk-forward calibration backtest to prevent look-ahead bias.

    Slides a window forward through time, computing calibration
    metrics only on past data at each step.

    Args:
        predictions: Full array of predicted probabilities.
        outcomes: Full array of binary outcomes.
        timestamps: Sortable timestamps for temporal ordering.
        window_size: Number of past predictions to evaluate.
        step_size: Number of predictions to advance each step.

    Returns:
        List of CalibrationReport for each window position.
    """
    sort_idx = np.argsort(timestamps)
    predictions = predictions[sort_idx]
    outcomes = outcomes[sort_idx]
    timestamps = timestamps[sort_idx]

    reports = []
    start = window_size
    while start <= len(predictions):
        window_preds = predictions[start - window_size:start]
        window_outs = outcomes[start - window_size:start]
        report = compute_calibration(window_preds, window_outs)
        reports.append(report)
        start += step_size

    return reports


def daily_calibration_audit(
    predictions: np.ndarray,
    outcomes: np.ndarray,
    ece_threshold: float = 0.05,
    mce_threshold: float = 0.12,
    min_samples: int = 100
) -> dict:
    """
    Automated calibration audit for agent deployment.

    Returns a status dict with pass/fail and diagnostics.
    An agent should run this daily and halt trading if calibration
    degrades beyond thresholds.

    Args:
        predictions: Recent prediction history.
        outcomes: Corresponding resolved outcomes.
        ece_threshold: Maximum acceptable ECE.
        mce_threshold: Maximum acceptable MCE.
        min_samples: Minimum samples required for valid audit.

    Returns:
        Dict with status, metrics, and recommended action.
    """
    n = len(predictions)
    if n < min_samples:
        return {
            "status": "INSUFFICIENT_DATA",
            "samples": n,
            "required": min_samples,
            "action": "CONTINUE — accumulate more resolved predictions"
        }

    report = compute_calibration(predictions, outcomes)

    issues = []
    if report.ece > ece_threshold:
        issues.append(f"ECE {report.ece:.4f} exceeds threshold {ece_threshold}")
    if report.mce > mce_threshold:
        issues.append(f"MCE {report.mce:.4f} exceeds threshold {mce_threshold}")
    if report.hl_pvalue is not np.nan and report.hl_pvalue < 0.05:
        issues.append(
            f"Hosmer-Lemeshow p={report.hl_pvalue:.4f} — "
            f"statistically significant miscalibration"
        )

    if issues:
        return {
            "status": "FAIL",
            "issues": issues,
            "ece": report.ece,
            "mce": report.mce,
            "brier": report.brier,
            "hl_pvalue": report.hl_pvalue,
            "action": "HALT TRADING — recalibrate model via Platt or isotonic"
        }

    return {
        "status": "PASS",
        "ece": report.ece,
        "mce": report.mce,
        "brier": report.brier,
        "reliability": report.reliability,
        "resolution": report.resolution,
        "action": "CONTINUE — calibration within acceptable bounds"
    }


# --- Demo: Full calibration pipeline on synthetic NFL data ---
if __name__ == "__main__":
    np.random.seed(42)

    # Simulate 500 NFL game predictions from a decent model
    # True probabilities drawn from Beta distribution (realistic spread)
    true_probs = np.random.beta(2.5, 2.5, size=500)
    outcomes = (np.random.rand(500) < true_probs).astype(float)

    # Model predictions: true probs + calibration noise
    # Simulates slight overconfidence (common in sports models)
    raw_preds = true_probs + 0.05 * (true_probs - 0.5)
    raw_preds = np.clip(raw_preds, 0.01, 0.99)

    # Run calibration analysis
    report = compute_calibration(raw_preds, outcomes)
    print("=== Calibration Report (Raw Model) ===")
    print(f"Brier Score:   {report.brier:.4f}")
    print(f"  Reliability: {report.reliability:.4f}")
    print(f"  Resolution:  {report.resolution:.4f}")
    print(f"  Uncertainty: {report.uncertainty:.4f}")
    print(f"ECE:           {report.ece:.4f}")
    print(f"MCE:           {report.mce:.4f}")
    print(f"HL statistic:  {report.hl_statistic:.2f}")
    print(f"HL p-value:    {report.hl_pvalue:.4f}")

    # Apply Platt scaling to fix calibration
    # Split: first 300 for calibration, last 200 for test
    a, b = platt_scaling(raw_preds[:300], outcomes[:300])
    calibrated_preds = apply_platt(raw_preds[300:], a, b)

    report_cal = compute_calibration(calibrated_preds, outcomes[300:])
    print("\n=== Calibration Report (After Platt Scaling) ===")
    print(f"Brier Score:   {report_cal.brier:.4f}")
    print(f"  Reliability: {report_cal.reliability:.4f}")
    print(f"ECE:           {report_cal.ece:.4f}")
    print(f"MCE:           {report_cal.mce:.4f}")

    # Run daily audit
    audit = daily_calibration_audit(calibrated_preds, outcomes[300:])
    print(f"\n=== Daily Audit ===")
    print(f"Status: {audit['status']}")
    print(f"Action: {audit['action']}")

Limitations and Edge Cases

Small sample sizes destroy calibration estimates. With 50 predictions, a reliability diagram has ~5 samples per bin. One lucky or unlucky outcome flips a bin’s observed frequency by 0.20. ECE computed on fewer than 200 resolved predictions is noise. The Hosmer-Lemeshow test needs at least 500 samples for reliable power.

Bin count sensitivity. ECE and reliability depend on the number of bins K. Ten equal-width bins is standard, but if most predictions cluster around 0.50-0.60 (common in NFL spreads at -110), most bins are empty and the occupied bins are coarse. Adaptive binning (equal-frequency instead of equal-width) helps — each bin gets N/K samples, reducing variance per bin.

Calibration drift. Models calibrated on 2024 NFL data may miscalibrate in 2025 due to rule changes, roster turnover, or market structure shifts. An agent must monitor calibration continuously using the walk-forward approach, not just validate once at deployment. The daily audit function above exists for this reason.

Platt scaling assumes sigmoid miscalibration. If the model’s miscalibration is non-monotonic (e.g., well-calibrated at extremes but overconfident in the middle), Platt scaling cannot fix it. Use isotonic regression for non-sigmoid miscalibration patterns.

Overround complicates market calibration comparisons. When evaluating whether a sportsbook’s closing lines at BetOnline or Bovada are well-calibrated, you must first remove the vig using the techniques from Sports Betting Math 101. Raw implied probabilities from -110/-110 lines sum to 104.8% — comparing those directly to observed frequencies conflates vig with miscalibration.

Post-hoc calibration does not create edge. Isotonic regression and Platt scaling fix the probability mapping, but they cannot improve a model’s resolution. If the raw model has zero skill (its predictions don’t correlate with outcomes), calibration just maps everything toward the base rate. Resolution must come from the model itself — from better feature engineering and training data.

FAQ

What does it mean for a betting model to be calibrated?

A model is calibrated if its predicted probabilities match observed frequencies. If the model outputs 70% for 100 different events, exactly 70 of those events should occur. Calibration is necessary but not sufficient — a model that always predicts the base rate is perfectly calibrated but generates zero edge.

What is the Brier score decomposition for betting models?

The Brier score decomposes into three components: Brier = Reliability - Resolution + Uncertainty. Reliability measures calibration error (lower is better). Resolution measures how much predictions deviate from the base rate (higher is better). Uncertainty is the irreducible variance of outcomes. A good betting model has low reliability and high resolution.

How do you fix a miscalibrated prediction model?

Two standard post-hoc calibration methods exist. Platt scaling fits a logistic sigmoid to map raw model outputs to calibrated probabilities — fast and works well with two parameters. Isotonic regression fits a non-parametric monotone function — more flexible but requires more data to avoid overfitting. Both require a held-out calibration dataset.

What is Expected Calibration Error (ECE) in sports betting?

ECE is the weighted average of calibration error across probability bins: ECE = sum(n_k/N * |o_k - p_k|), where n_k is the count in bin k, o_k is the observed frequency, and p_k is the mean predicted probability. Lower ECE means better calibration. An ECE below 0.02 is excellent for sports betting models.

How does calibration connect to information theory and betting edge?

A calibrated model maximizes the information value of its predictions. Log-loss (cross-entropy) simultaneously penalizes miscalibration and rewards sharpness. The connection to information theory is direct: a well-calibrated model with high resolution extracts more bits of useful information from data, translating to larger edge in prediction markets. See the Information Theory guide for the formal entropy-based framework.

What’s Next

Calibration tells you whether your model’s probabilities are accurate. The next step is determining whether that accuracy is statistically significant or just luck: