A proper scoring rule forces honest probability reporting — you maximize your expected score only by stating your true belief. The Brier score BS = (1/N) * sum((f_i - o_i)^2) measures mean squared forecast error (lower is better). The log score LS = -(1/N) * sum(o_i * ln(f_i) + (1 - o_i) * ln(1 - f_i)) punishes confident wrong predictions exponentially harder. Both are strictly proper, and every betting agent needs at least one to know if its model is improving.

Why This Matters for Agents

An autonomous betting agent that cannot measure its own forecast quality is flying blind. The agent might generate probability estimates all day — “Lakers win at 62%,” “Biden YES at 54%” — but without a rigorous scoring framework, it has no way to answer the critical question: are these estimates any good?

This is Layer 4 — Intelligence. Scoring rules sit in the agent’s feedback loop, downstream of probability generation and upstream of model retraining. After the agent produces forecasts and outcomes resolve, the scoring module computes Brier scores, log scores, and calibration diagnostics. Those metrics feed back into model selection, hyperparameter tuning, and ensemble weighting. Polyseer uses log-loss-weighted aggregation to combine multiple sub-models — the scoring rule determines how much each model’s opinion counts. Without proper scoring, the agent cannot learn, and without learning, the agent cannot maintain edge against an efficient market that prices in new information within minutes.

The Agent Betting Stack places model evaluation squarely in Layer 4. Your agent’s decision engine is only as good as the feedback signal it receives from scoring.

The Math

What Makes a Scoring Rule “Proper”

A scoring rule S(f, o) assigns a score to a forecast f given outcome o. The scoring rule is proper if, for a forecaster with true belief p, the expected score is maximized (or loss minimized) when f = p. It is strictly proper if f = p is the unique optimum.

Formally, a scoring rule S is proper if for all true probabilities p in (0, 1):

E_p[S(p, o)] >= E_p[S(f, o)]   for all f != p

where E_p denotes expectation under the true probability p.

Why does this matter? If your scoring rule is not proper, a rational agent has an incentive to report something other than its true probability estimate. That corrupts the entire feedback loop. An agent optimizing an improper scoring rule learns to game the metric rather than improve its forecasts.

The Brier Score

The Brier score, introduced by Glenn Brier in 1950, is the mean squared error between forecasts and outcomes:

BS = (1/N) * sum_{i=1}^{N} (f_i - o_i)^2

where f_i is the forecast probability for event i, o_i is the outcome (1 if the event occurred, 0 otherwise), and N is the number of forecasts.

Score interpretation:

ScoreMeaning
0.000Perfect — every forecast was exactly right
0.250Equivalent to forecasting 50% every time
0.500Equivalent to being systematically wrong (forecasting 0% for events that happen 50% of the time)
1.000Worst possible — predicted 100% confident, always wrong

Proof that Brier is strictly proper. For a single event with true probability p, the expected Brier score when reporting f is:

E_p[BS(f)] = p * (1 - f)^2 + (1 - p) * f^2
           = p - 2pf + pf^2 + f^2 - pf^2
           = f^2 - 2pf + p
           = (f - p)^2 + p - p^2
           = (f - p)^2 + p(1 - p)

The term p(1 - p) is constant with respect to f. The term (f - p)^2 is minimized uniquely when f = p. Therefore the Brier score is strictly proper — the expected loss is minimized only by reporting your true probability.

The Logarithmic Score

The logarithmic scoring rule (log score or log loss) is:

LS = -(1/N) * sum_{i=1}^{N} [o_i * ln(f_i) + (1 - o_i) * ln(1 - f_i)]

Lower is better. For a single correct prediction at confidence f, the cost is -ln(f). For a single incorrect prediction at confidence f, the cost is -ln(1 - f).

The asymmetry is the key feature. Compare what happens when you’re wrong at different confidence levels:

Forecast fOutcome oBrier (f - o)^2Log -ln(1 - f)
0.6000.360.92
0.8000.641.61
0.9000.812.30
0.9500.903.00
0.9900.984.61
0.99900.9986.91

The Brier score goes from 0.36 to 0.998 — roughly 3x. The log score goes from 0.92 to 6.91 — roughly 7.5x. And as f approaches 1.0, the log penalty approaches infinity. This makes the log score the right diagnostic for agents that must avoid catastrophic overconfidence.

Proof that log score is strictly proper. For true probability p, the expected log loss when reporting f is:

E_p[LS(f)] = -[p * ln(f) + (1 - p) * ln(1 - f)]

Take the derivative with respect to f and set to zero:

dE/df = -[p/f - (1 - p)/(1 - f)] = 0
p/f = (1 - p)/(1 - f)
p(1 - f) = f(1 - p)
p - pf = f - fp
p = f

The second derivative is -p/f^2 - (1-p)/(1-f)^2, which is always negative, confirming this is a maximum (minimum loss). The log score is strictly proper.

Brier Skill Score

Raw Brier scores are hard to interpret in isolation. A score of 0.18 — is that good? Depends on the domain. The Brier Skill Score (BSS) benchmarks against a reference forecast:

BSS = 1 - BS_model / BS_reference

where BS_reference is typically the Brier score of a naive baseline (e.g., always predicting the base rate) or market consensus.

  • BSS = 1.0: Perfect forecast
  • BSS = 0.0: No better than the reference
  • BSS < 0.0: Worse than the reference

For a betting agent, the natural reference is the market closing price. If your agent scores BSS = 0.12 against Polymarket closing prices, your forecasts are 12% better calibrated than the market — that’s tradeable edge.

Murphy Decomposition

The Brier score decomposes into three components that diagnose why your agent’s score is what it is:

BS = Reliability - Resolution + Uncertainty

Reliability (calibration error): Groups forecasts into bins (e.g., all forecasts between 0.30 and 0.40) and measures how far the average forecast in each bin is from the actual frequency of events in that bin. Lower is better.

Reliability = (1/N) * sum_{k=1}^{K} n_k * (f_bar_k - o_bar_k)^2

where K is the number of bins, n_k is the count in bin k, f_bar_k is the mean forecast in bin k, and o_bar_k is the observed frequency in bin k.

Resolution: Measures how much your forecasts vary from the overall base rate. Higher is better — it means your forecasts are informative, not just the same number repeated.

Resolution = (1/N) * sum_{k=1}^{K} n_k * (o_bar_k - o_bar)^2

where o_bar is the overall base rate.

Uncertainty: The inherent unpredictability of the events, equal to o_bar * (1 - o_bar). You cannot control this.

This decomposition answers a critical diagnostic question: if your Brier score is bad, is it because your calibration is off (high reliability term — fixable by recalibrating) or because your forecasts are uninformative (low resolution — requires a better model)?

Worked Examples

Example 1: Scoring an Agent’s Polymarket Forecasts

An agent tracked 10 Polymarket binary markets, generated probability forecasts before each resolved, and recorded the outcomes:

Market                            Agent Forecast    Polymarket Close    Outcome
──────────────────────────────────────────────────────────────────────────────
Trump wins 2028 GOP primary             0.85              0.78             1
Fed cuts rates June 2026                0.40              0.35             0
Lakers win NBA Finals                   0.12              0.08             0
Bitcoin > $100K by Dec 2026             0.65              0.58             1
UK exits WHO                            0.15              0.10             0
Gavin Newsom wins Dem primary           0.30              0.25             0
SpaceX Starship orbit success           0.70              0.72             1
India GDP growth > 7%                   0.55              0.50             1
Next pandemic declared by 2027          0.20              0.18             0
Argentina dollarization complete        0.25              0.22             0

Agent Brier score: BS = (1/10) * [(0.85-1)^2 + (0.40-0)^2 + (0.12-0)^2 + (0.65-1)^2 + (0.15-0)^2 + (0.30-0)^2 + (0.70-1)^2 + (0.55-1)^2 + (0.20-0)^2 + (0.25-0)^2]

= (1/10) * [0.0225 + 0.16 + 0.0144 + 0.1225 + 0.0225 + 0.09 + 0.09 + 0.2025 + 0.04 + 0.0625]
= (1/10) * 0.8269
= 0.0827

Market Brier score: BS = (1/10) * [(0.78-1)^2 + (0.35-0)^2 + (0.08-0)^2 + (0.58-1)^2 + (0.10-0)^2 + (0.25-0)^2 + (0.72-1)^2 + (0.50-1)^2 + (0.18-0)^2 + (0.22-0)^2]

= (1/10) * [0.0484 + 0.1225 + 0.0064 + 0.1764 + 0.01 + 0.0625 + 0.0784 + 0.25 + 0.0324 + 0.0484]
= (1/10) * 0.8354
= 0.0835

Brier Skill Score vs. market: BSS = 1 - 0.0827 / 0.0835 = 0.0096

The agent is 0.96% better than Polymarket closing prices — marginal but positive. In prediction market terms, even sub-1% edge is tradeable at scale because binary contracts have zero-sum payoffs.

Example 2: Log Score Comparison on High-Confidence Calls

Compare two agents on 5 markets where confident calls mattered:

Market                    Agent A    Agent B    Outcome
────────────────────────────────────────────────────────
Fed rate cut June 2026      0.40       0.25        0
Bitcoin > $100K             0.65       0.90        1
Lakers Finals               0.12       0.05        0
SpaceX orbit                0.70       0.95        1
Newsom primary              0.30       0.55        0

Agent A log loss: -[0ln(0.40) + 1ln(0.60) + 0ln(0.12) + 1ln(0.65) + … ] computed per event:

Agent A: -[ln(0.60) + ln(0.75) + ln(0.88) + ln(0.65) + ln(0.85) + ln(0.70) + ln(0.70) + ... ]

Let me compute cleanly per event:

Event 1 (o=0): -ln(1-0.40) = -ln(0.60) = 0.5108      Agent B: -ln(0.75) = 0.2877
Event 2 (o=1): -ln(0.65) = 0.4308                      Agent B: -ln(0.90) = 0.1054
Event 3 (o=0): -ln(1-0.12) = -ln(0.88) = 0.1278       Agent B: -ln(0.95) = 0.0513
Event 4 (o=1): -ln(0.70) = 0.3567                      Agent B: -ln(0.95) = 0.0513
Event 5 (o=0): -ln(1-0.30) = -ln(0.70) = 0.3567       Agent B: -ln(0.45) = 0.7985

Agent A total: 1.7828 / 5 = 0.3566
Agent B total: 1.2942 / 5 = 0.2588

Agent B wins on log score — its confident correct calls (0.90, 0.95) contributed very little loss, and even its one bad confident call (0.55 on Newsom, outcome 0) only cost 0.7985. If Agent B had predicted 0.95 on Newsom and been wrong, the penalty would have been -ln(0.05) = 2.9957 — roughly 4x worse. The log score’s steep penalty for confident misses is what makes it the preferred metric for risk-conscious agents.

Implementation

import numpy as np
from dataclasses import dataclass, field
from typing import Optional


@dataclass
class ScoringResult:
    """Container for scoring rule outputs."""
    brier_score: float
    log_score: float
    brier_skill_score: Optional[float] = None
    reliability: Optional[float] = None
    resolution: Optional[float] = None
    uncertainty: Optional[float] = None
    n_forecasts: int = 0
    calibration_bins: dict = field(default_factory=dict)


def brier_score(forecasts: np.ndarray, outcomes: np.ndarray) -> float:
    """
    Compute the Brier score: BS = (1/N) * sum((f_i - o_i)^2).

    Args:
        forecasts: Array of probability forecasts in [0, 1].
        outcomes: Array of binary outcomes (0 or 1).

    Returns:
        Brier score (lower is better, 0 = perfect, 0.25 = coin flip baseline).
    """
    forecasts = np.asarray(forecasts, dtype=np.float64)
    outcomes = np.asarray(outcomes, dtype=np.float64)

    if len(forecasts) != len(outcomes):
        raise ValueError(f"Length mismatch: {len(forecasts)} forecasts vs {len(outcomes)} outcomes")
    if np.any((forecasts < 0) | (forecasts > 1)):
        raise ValueError("Forecasts must be in [0, 1]")
    if not np.all(np.isin(outcomes, [0, 1])):
        raise ValueError("Outcomes must be 0 or 1")

    return float(np.mean((forecasts - outcomes) ** 2))


def log_score(
    forecasts: np.ndarray,
    outcomes: np.ndarray,
    clip_eps: float = 1e-15
) -> float:
    """
    Compute the logarithmic score (log loss):
    LS = -(1/N) * sum(o_i * ln(f_i) + (1 - o_i) * ln(1 - f_i)).

    Args:
        forecasts: Array of probability forecasts in [0, 1].
        outcomes: Array of binary outcomes (0 or 1).
        clip_eps: Small epsilon to avoid ln(0). Default 1e-15.

    Returns:
        Log score (lower is better, 0 = perfect, ln(2) ≈ 0.693 = coin flip baseline).
    """
    forecasts = np.asarray(forecasts, dtype=np.float64)
    outcomes = np.asarray(outcomes, dtype=np.float64)

    # Clip to avoid log(0)
    f_clipped = np.clip(forecasts, clip_eps, 1 - clip_eps)

    per_event = -(outcomes * np.log(f_clipped) + (1 - outcomes) * np.log(1 - f_clipped))
    return float(np.mean(per_event))


def brier_skill_score(
    forecasts: np.ndarray,
    outcomes: np.ndarray,
    reference_forecasts: Optional[np.ndarray] = None
) -> float:
    """
    Compute the Brier Skill Score: BSS = 1 - BS_model / BS_reference.

    If no reference is provided, uses the sample base rate (climatological forecast).

    Args:
        forecasts: Agent's probability forecasts.
        outcomes: Binary outcomes.
        reference_forecasts: Reference forecasts (e.g., market closing prices).

    Returns:
        BSS (> 0 means agent beats reference, < 0 means worse).
    """
    bs_model = brier_score(forecasts, outcomes)

    if reference_forecasts is not None:
        bs_ref = brier_score(np.asarray(reference_forecasts), outcomes)
    else:
        # Climatological reference: always predict the base rate
        base_rate = np.mean(outcomes)
        bs_ref = base_rate * (1 - base_rate)

    if bs_ref == 0:
        return 0.0  # Reference is perfect — can't beat it

    return 1 - bs_model / bs_ref


def murphy_decomposition(
    forecasts: np.ndarray,
    outcomes: np.ndarray,
    n_bins: int = 10
) -> dict:
    """
    Decompose Brier score into reliability, resolution, and uncertainty.

    BS = reliability - resolution + uncertainty

    Args:
        forecasts: Probability forecasts in [0, 1].
        outcomes: Binary outcomes.
        n_bins: Number of bins for grouping forecasts.

    Returns:
        Dict with 'reliability', 'resolution', 'uncertainty', and 'bin_data'.
    """
    forecasts = np.asarray(forecasts, dtype=np.float64)
    outcomes = np.asarray(outcomes, dtype=np.float64)

    n = len(forecasts)
    o_bar = np.mean(outcomes)
    uncertainty = o_bar * (1 - o_bar)

    bin_edges = np.linspace(0, 1, n_bins + 1)
    bin_indices = np.digitize(forecasts, bin_edges[1:-1])

    reliability = 0.0
    resolution = 0.0
    bin_data = []

    for k in range(n_bins):
        mask = bin_indices == k
        n_k = np.sum(mask)

        if n_k == 0:
            continue

        f_bar_k = np.mean(forecasts[mask])
        o_bar_k = np.mean(outcomes[mask])

        reliability += n_k * (f_bar_k - o_bar_k) ** 2
        resolution += n_k * (o_bar_k - o_bar) ** 2

        bin_data.append({
            "bin_center": (bin_edges[k] + bin_edges[k + 1]) / 2,
            "mean_forecast": float(f_bar_k),
            "observed_frequency": float(o_bar_k),
            "count": int(n_k)
        })

    reliability /= n
    resolution /= n

    return {
        "reliability": float(reliability),
        "resolution": float(resolution),
        "uncertainty": float(uncertainty),
        "brier_score": float(reliability - resolution + uncertainty),
        "bin_data": bin_data
    }


def calibration_curve(
    forecasts: np.ndarray,
    outcomes: np.ndarray,
    n_bins: int = 10
) -> tuple[np.ndarray, np.ndarray, np.ndarray]:
    """
    Compute a calibration curve for reliability diagrams.

    Args:
        forecasts: Probability forecasts in [0, 1].
        outcomes: Binary outcomes.
        n_bins: Number of equal-width bins.

    Returns:
        Tuple of (mean_forecast_per_bin, observed_frequency_per_bin, bin_counts).
    """
    forecasts = np.asarray(forecasts, dtype=np.float64)
    outcomes = np.asarray(outcomes, dtype=np.float64)

    bin_edges = np.linspace(0, 1, n_bins + 1)
    bin_indices = np.digitize(forecasts, bin_edges[1:-1])

    mean_forecasts = []
    observed_freqs = []
    counts = []

    for k in range(n_bins):
        mask = bin_indices == k
        n_k = np.sum(mask)

        if n_k == 0:
            continue

        mean_forecasts.append(np.mean(forecasts[mask]))
        observed_freqs.append(np.mean(outcomes[mask]))
        counts.append(n_k)

    return (
        np.array(mean_forecasts),
        np.array(observed_freqs),
        np.array(counts)
    )


def score_agent_vs_market(
    agent_forecasts: np.ndarray,
    market_prices: np.ndarray,
    outcomes: np.ndarray
) -> ScoringResult:
    """
    Full scoring comparison between an agent and market consensus.

    Args:
        agent_forecasts: Agent's probability estimates.
        market_prices: Market closing prices (consensus probabilities).
        outcomes: Binary outcomes (0 or 1).

    Returns:
        ScoringResult with all metrics.
    """
    agent_forecasts = np.asarray(agent_forecasts, dtype=np.float64)
    market_prices = np.asarray(market_prices, dtype=np.float64)
    outcomes = np.asarray(outcomes, dtype=np.float64)

    bs = brier_score(agent_forecasts, outcomes)
    ls = log_score(agent_forecasts, outcomes)
    bss = brier_skill_score(agent_forecasts, outcomes, market_prices)

    decomp = murphy_decomposition(agent_forecasts, outcomes)

    return ScoringResult(
        brier_score=bs,
        log_score=ls,
        brier_skill_score=bss,
        reliability=decomp["reliability"],
        resolution=decomp["resolution"],
        uncertainty=decomp["uncertainty"],
        n_forecasts=len(agent_forecasts),
        calibration_bins=decomp["bin_data"]
    )


# --- Example usage with the worked example data ---

if __name__ == "__main__":
    # Agent forecasts from Worked Example 1
    agent_f = np.array([0.85, 0.40, 0.12, 0.65, 0.15, 0.30, 0.70, 0.55, 0.20, 0.25])
    market_f = np.array([0.78, 0.35, 0.08, 0.58, 0.10, 0.25, 0.72, 0.50, 0.18, 0.22])
    outcomes = np.array([1, 0, 0, 1, 0, 0, 1, 1, 0, 0])

    result = score_agent_vs_market(agent_f, market_f, outcomes)

    print(f"Agent Brier Score:       {result.brier_score:.4f}")
    print(f"Agent Log Score:         {result.log_score:.4f}")
    print(f"Brier Skill Score (vs market): {result.brier_skill_score:.4f}")
    print(f"Reliability:             {result.reliability:.4f}")
    print(f"Resolution:              {result.resolution:.4f}")
    print(f"Uncertainty:             {result.uncertainty:.4f}")
    print(f"Decomposition check:     {result.reliability - result.resolution + result.uncertainty:.4f}")
    print(f"\nMarket Brier Score:      {brier_score(market_f, outcomes):.4f}")
    print(f"Market Log Score:        {log_score(market_f, outcomes):.4f}")

Limitations and Edge Cases

Small sample sizes destroy signal. With 10 forecasts, the Brier score’s standard error is wide enough to render most BSS comparisons meaningless. You need a minimum of ~100 resolved forecasts before the scoring signal dominates noise. For agents operating on Polymarket where markets resolve weekly or monthly, this means months of data before scoring comparisons become reliable. The statistical significance guide covers the exact sample size calculations.

The log score is unbounded. A single forecast of f = 0.999 on an event that doesn’t happen costs -ln(0.001) = 6.91. One catastrophic miss dominates the entire score. This is a feature for risk management (it penalizes exactly the behavior you want to avoid) but a liability for leaderboard comparisons. An agent with 99 perfect forecasts and one disaster will lose to an agent with 100 mediocre forecasts on log score. Consider using a trimmed log score (drop the worst 1%) for rankings.

Calibration does not equal skill. A perfectly calibrated agent that always predicts the base rate (say 50% for every market) has perfect reliability but zero resolution. Its Brier score equals the uncertainty term — it adds no information. Calibration is necessary but insufficient. The Murphy decomposition separates these: high resolution with moderate reliability usually beats perfect reliability with low resolution.

Binary-only limitation. The formulations above handle binary outcomes only. For multi-outcome markets on Polymarket (e.g., “Who wins the 2028 election?” with 8 candidates), you need the ranked probability score (RPS) or the multi-class log loss. Multi-class log loss is a straightforward extension: LS = -(1/N) * sum(sum(o_ij * ln(f_ij))) where j indexes outcomes. See the multi-outcome markets guide for the full treatment.

Market consensus is a moving target. Using market closing prices as the BSS reference is convenient but imperfect. The market price at close reflects all available information, including information that arrived after your agent made its forecast. A fairer comparison uses the market price at the time the agent generated its forecast, not at resolution. Track both timestamps.

FAQ

What is a proper scoring rule in prediction markets?

A proper scoring rule is a function that assigns a numerical score to a probability forecast such that the forecaster maximizes their expected score by reporting their true believed probability. The Brier score and logarithmic score are both strictly proper — any deviation from your true estimate makes your expected score worse. This property is essential for agents because it ensures the scoring feedback loop incentivizes accurate probability estimation rather than gaming.

How do you calculate the Brier score for prediction market forecasts?

The Brier score is BS = (1/N) * sum((f_i - o_i)^2), where f_i is your forecast probability and o_i is the outcome (1 if the event occurred, 0 if not). Lower is better: 0.0 is a perfect score, 0.25 is the score of a constant 50% forecast, and 1.0 is the worst possible score. For an agent forecasting on Polymarket, f_i is the agent’s estimated probability and o_i is whether the YES contract paid out.

Why does the logarithmic scoring rule penalize confident wrong predictions more than the Brier score?

The log score uses -ln(f) for correct outcomes and -ln(1-f) for incorrect outcomes. As f approaches 1.0 for a wrong prediction, -ln(1-f) approaches infinity. The Brier score’s squared error is bounded at 1.0. A prediction of 0.99 that turns out wrong costs 4.61 on the log score but only 0.98 on the Brier score. This makes the log score the right diagnostic for agents that need to avoid catastrophic overconfidence — one disaster can wipe out months of profits.

What is the Brier skill score and how do agents use it?

The Brier skill score is BSS = 1 - BS_model / BS_reference. It benchmarks your agent’s Brier score against a reference forecast like market consensus or a 50% baseline. BSS > 0 means your agent outperforms the reference. An agent scoring BSS = 0.15 against Polymarket closing prices has 15% better calibration than the market — that is strong enough edge to trade profitably after transaction costs.

How does the Brier score decompose into calibration and resolution?

The Murphy decomposition breaks the Brier score into three components: BS = reliability - resolution + uncertainty. Reliability measures calibration error (lower is better), resolution measures how much your forecasts vary from the base rate (higher is better), and uncertainty is the inherent unpredictability of the events. An agent with high reliability and low resolution is well-calibrated but uninformative — it needs a better model. An agent with low reliability and high resolution has good signal but needs recalibration — apply Platt scaling or isotonic regression.

What’s Next

Scoring rules tell an agent how good its current model is. The next step is improving that model through systematic calibration and model evaluation, which covers Platt scaling, isotonic regression, and ensemble calibration techniques.

Related guides in the series: