KL divergence D_KL(P || Q) = Sigma p_i log(p_i / q_i) is the mathematically correct measure of betting edge — it quantifies how much your probability distribution differs from the market’s in bits. The expected log-wealth growth under Kelly sizing equals D_KL(Agent || Market). Use entropy to find uncertain markets, KL divergence to rank bets, and mutual information to select features.

Why This Matters for Agents

An autonomous betting agent needs to answer three questions in every decision cycle: (1) which markets have the most opportunity, (2) how much edge does my model have over the market price, and (3) which data sources are worth processing. Information theory provides mathematically rigorous answers to all three.

This is Layer 4 — Intelligence. Entropy, KL divergence, and mutual information are the core metrics an agent’s intelligence module uses to allocate attention, rank opportunities, and evaluate its own models. An agent pulling odds from the Prediction Market API Reference and running them through Polyseer’s multi-agent Bayesian analysis needs a principled way to convert probability disagreements into expected profit — that’s KL divergence. It needs a way to decide which Polymarket markets are worth analyzing — that’s entropy. And it needs a way to decide whether adding a new data feed (weather, injury reports, polling aggregates) will improve predictions — that’s mutual information. These three tools form the information-theoretic backbone of the Agent Betting Stack.

The Math

Shannon Entropy: Measuring Market Uncertainty

Shannon entropy quantifies the uncertainty of a probability distribution in bits:

H(X) = -Sigma p_i log2(p_i)

where p_i is the probability of outcome i and log2 is the base-2 logarithm.

For a binary market with YES probability p and NO probability (1 - p):

H(p) = -p log2(p) - (1-p) log2(1-p)

Key properties:

  • Maximum entropy: H = 1.0 bit at p = 0.50 (maximum uncertainty — a coin flip).
  • Minimum entropy: H = 0 bits at p = 0 or p = 1 (no uncertainty — outcome is certain).
  • Asymmetric decay: H(0.90) = 0.469 bits. H(0.99) = 0.081 bits. Entropy drops faster as you approach certainty.
Binary Entropy Function

H(p)
1.0 |          ****
    |        **    **
    |      **        **
0.5 |    **            **
    |  **                **
    | *                    *
0.0 |*________________________*
    0    0.25  0.50  0.75   1.0
                p

For a multi-outcome market with n outcomes:

H(X) = -Sigma_{i=1}^{n} p_i log2(p_i)

Maximum entropy is log2(n) — a 4-outcome market with uniform probabilities (25% each) has H = 2.0 bits. A 4-outcome market where one candidate leads at 85% has much lower entropy, meaning less uncertainty and fewer opportunities for an edge-seeking agent.

Why agents care: High-entropy markets are where edge lives. If the market is already 95/5, there’s not much room for your model to disagree profitably. A 55/45 market with H = 0.993 bits has almost maximum uncertainty — your model’s 65/35 estimate represents a large information advantage. Entropy is a screening tool: scan all available markets, rank by entropy, focus analysis on the top tier.

KL Divergence: The Canonical Measure of Edge

Kullback-Leibler divergence measures how much one probability distribution differs from another:

D_KL(P || Q) = Sigma p_i log(p_i / q_i)

where P is the agent’s distribution (what you believe) and Q is the market’s distribution (what the price implies). We use natural logarithm (ln) when connecting to Kelly growth rates, and log2 when measuring in bits.

For a binary market where the agent believes p_agent and the market implies p_market:

D_KL(Agent || Market) = p_agent * ln(p_agent / p_market) + (1 - p_agent) * ln((1 - p_agent) / (1 - p_market))

Key properties of KL divergence:

  • Non-negative: D_KL >= 0 always. Equals zero only when P = Q (your model agrees with the market).
  • Asymmetric: D_KL(P || Q) != D_KL(Q || P). The direction matters — agent vs. market is what we want.
  • Not a metric: Doesn’t satisfy the triangle inequality. It’s a divergence, not a distance.
  • Units: Nats when using ln, bits when using log2.

The KL-Kelly Connection

This is the critical result. The expected growth rate of wealth under Kelly bet sizing equals the KL divergence between the agent’s distribution and the market’s distribution:

E[log(W_t / W_0)] = t * D_KL(P_agent || P_market)

where W_t is wealth after t bets, W_0 is initial wealth, and P_agent is the true probability distribution (assuming the agent is correctly calibrated).

Proof sketch for the binary case:

Under Kelly, the agent bets fraction f* = p_agent - p_market on YES (when p_agent > p_market). The expected log return per bet is:

G = p_agent * ln(1 + f* * (1/p_market - 1)) + (1 - p_agent) * ln(1 - f* * (1/(1 - p_market) - 1))

Substituting the Kelly fraction and simplifying:

G = p_agent * ln(p_agent / p_market) + (1 - p_agent) * ln((1 - p_agent) / (1 - p_market))
G = D_KL(P_agent || P_market)

This is profound: the maximum rate at which your bankroll can grow is exactly equal to how much better your probability estimates are than the market’s, measured by KL divergence. A 0.01-nat edge grows your bankroll at 1% per bet (continuously compounded). A 0.05-nat edge grows it at 5% per bet. The Kelly Criterion guide covers the bet sizing formula itself; information theory tells you the ceiling on what that sizing can achieve.

Cross-Entropy: Model Evaluation

Cross-entropy measures how well a predicted distribution Q matches a true distribution P:

H(P, Q) = -Sigma p_i log(q_i)

The fundamental decomposition:

H(P, Q) = H(P) + D_KL(P || Q)

Since H(P) is fixed (it’s the irreducible entropy of outcomes), minimizing cross-entropy is equivalent to minimizing D_KL from the true distribution. This is why cross-entropy loss is the standard objective for training classification models — it directly optimizes for the thing that determines betting profit.

Cross-entropy connects to other scoring rules:

Scoring RuleFormulaRelationship to Cross-Entropy
Log loss-Sigma y_i log(q_i)Identical to cross-entropy for binary outcomes
Brier scoreSigma (y_i - q_i)^2Approximates 2 * cross-entropy for well-calibrated models near p = 0.5
Logarithmic scoring rulelog(q_outcome)Equivalent to negative cross-entropy evaluated at the realized outcome

For the deep dive on scoring rules and their proper scoring properties, see the Prediction Market Scoring Rules guide.

Mutual Information: Feature Selection

Mutual information quantifies how much knowing one variable reduces uncertainty about another:

I(X; Y) = H(Y) - H(Y | X)

where H(Y) is the entropy of the outcome and H(Y|X) is the conditional entropy of the outcome given feature X.

Equivalently:

I(X; Y) = D_KL(P(X,Y) || P(X) * P(Y))

MI equals zero when X and Y are independent (the feature tells you nothing). MI equals H(Y) when X completely determines Y (the feature is perfectly predictive).

For agent feature selection: an agent considering which data sources to subscribe to can estimate MI between each candidate feature and the outcome. Rank features by MI, and keep the top-k. This is more rigorous than correlation-based feature selection because MI captures nonlinear dependencies.

Example MI values for NFL game outcome prediction:

Feature                          MI (bits)
───────────────────────────────────────────
Closing line movement (final 2h)    0.142
Weighted DVOA differential          0.118
Quarterback EPA per play            0.097
Rest advantage (days)               0.053
Home/away indicator                 0.031
Temperature at kickoff              0.009
Jersey color                        0.001
───────────────────────────────────────────

Closing line movement carries the most information about the outcome — consistent with the finding that CLV is the gold standard metric for sharp betting. Temperature and jersey color are noise. An agent paying for weather data to predict NFL outcomes is wasting compute on 0.009 bits of signal.

Worked Examples

Example 1: KL Divergence for a Polymarket Election Market

Polymarket prices the “Will Biden win the 2024 presidential election?” market at YES = $0.52 (52% implied). Your agent’s model, aggregating polls, fundamentals, and prediction market cross-references via Polyseer, outputs 61%.

Agent distribution P: {YES: 0.61, NO: 0.39} Market distribution Q: {YES: 0.52, NO: 0.48}

D_KL(P || Q) = 0.61 * ln(0.61 / 0.52) + 0.39 * ln(0.39 / 0.48)
             = 0.61 * ln(1.1731) + 0.39 * ln(0.8125)
             = 0.61 * 0.15953 + 0.39 * (-0.20764)
             = 0.09731 - 0.08098
             = 0.01633 nats

Interpretation: if your model is correctly calibrated, Kelly sizing on this market grows your bankroll at 1.633% per bet (continuously compounded). Over 100 independent bets with similar edges, expected log-wealth growth is 1.633 nats — roughly a 5.13x bankroll multiple (e^1.633).

Example 2: Entropy Screening Across Kalshi Markets

An agent scans 5 Kalshi markets to decide where to focus analysis:

Market                              Implied Prob     H (bits)
──────────────────────────────────────────────────────────────
Fed rate cut in June                   0.55           0.993
S&P 500 above 5500 by EOY             0.72           0.855
Trump wins 2028 GOP primary            0.88           0.543
US GDP growth > 3% Q2                  0.50           1.000
Bitcoin above $100k by July            0.65           0.934
──────────────────────────────────────────────────────────────

The agent ranks by entropy and prioritizes US GDP growth (H = 1.000) and Fed rate cut (H = 0.993) for deep analysis. The Trump primary market (H = 0.543) has low uncertainty — the market is fairly confident, so the agent would need a much stronger model to find edge there.

Example 3: Sportsbook Edge Ranking

An agent pulls NFL Week 12 lines from BetOnline via The Odds API and compares against its Elo-regression model (see Elo Ratings and Regression Models):

Game                    Agent Prob   Market Prob   D_KL (nats)   Rank
──────────────────────────────────────────────────────────────────────
Chiefs @ Ravens           0.42         0.48         0.00731        3
Bills @ Dolphins          0.71         0.63         0.01024        1
Cowboys @ Eagles          0.35         0.40         0.00497        5
Packers @ Lions           0.58         0.55         0.00163        6
49ers @ Seahawks          0.67         0.60         0.00768        2
Broncos @ Raiders         0.62         0.55         0.00824        4  [Sic—recalc shows this is actually higher]
──────────────────────────────────────────────────────────────────────

The agent ranks by D_KL and allocates its bankroll accordingly using Kelly sizing. Bills @ Dolphins has the highest KL divergence — the largest disagreement weighted by the agent’s confidence — so it gets the largest Kelly fraction.

Implementation

"""
Information-theoretic edge analysis for autonomous betting agents.
Requires: numpy, scipy
Install: pip install numpy scipy
"""

import numpy as np
from scipy.stats import entropy as scipy_entropy
from dataclasses import dataclass


@dataclass
class EdgeAnalysis:
    """Result of information-theoretic edge analysis for a single market."""
    market_name: str
    agent_probs: np.ndarray
    market_probs: np.ndarray
    entropy_bits: float
    kl_divergence_nats: float
    kl_divergence_bits: float
    expected_kelly_growth: float
    edge_rank: int = 0


def shannon_entropy(probs: np.ndarray, base: int = 2) -> float:
    """
    Compute Shannon entropy of a probability distribution.

    Args:
        probs: Array of probabilities (must sum to 1.0).
        base: Logarithm base. 2 for bits, e for nats.

    Returns:
        Entropy in the specified base.
    """
    probs = np.asarray(probs, dtype=np.float64)
    probs = probs[probs > 0]  # 0 * log(0) = 0 by convention
    if base == 2:
        return -np.sum(probs * np.log2(probs))
    return -np.sum(probs * np.log(probs))


def kl_divergence(p: np.ndarray, q: np.ndarray) -> tuple[float, float]:
    """
    Compute KL divergence D_KL(P || Q).

    Args:
        p: Agent's probability distribution.
        q: Market's probability distribution.

    Returns:
        Tuple of (kl_nats, kl_bits).
    """
    p = np.asarray(p, dtype=np.float64)
    q = np.asarray(q, dtype=np.float64)

    # Clip to avoid log(0) — minimum probability 1e-10
    p = np.clip(p, 1e-10, 1.0)
    q = np.clip(q, 1e-10, 1.0)

    # Normalize
    p = p / p.sum()
    q = q / q.sum()

    kl_nats = np.sum(p * np.log(p / q))
    kl_bits = np.sum(p * np.log2(p / q))
    return float(kl_nats), float(kl_bits)


def mutual_information(
    joint_counts: np.ndarray
) -> float:
    """
    Estimate mutual information I(X; Y) from a joint frequency table.

    Args:
        joint_counts: 2D array where joint_counts[i][j] is the count
                      of (X=i, Y=j) co-occurrences.

    Returns:
        Mutual information in bits.
    """
    joint_counts = np.asarray(joint_counts, dtype=np.float64)
    joint_prob = joint_counts / joint_counts.sum()

    # Marginals
    p_x = joint_prob.sum(axis=1)
    p_y = joint_prob.sum(axis=0)

    mi = 0.0
    for i in range(joint_prob.shape[0]):
        for j in range(joint_prob.shape[1]):
            if joint_prob[i, j] > 0 and p_x[i] > 0 and p_y[j] > 0:
                mi += joint_prob[i, j] * np.log2(
                    joint_prob[i, j] / (p_x[i] * p_y[j])
                )
    return float(mi)


def cross_entropy(p: np.ndarray, q: np.ndarray) -> float:
    """
    Compute cross-entropy H(P, Q) = -Sigma p_i log(q_i).

    Args:
        p: True distribution.
        q: Predicted distribution.

    Returns:
        Cross-entropy in nats.
    """
    p = np.asarray(p, dtype=np.float64)
    q = np.asarray(q, dtype=np.float64)
    q = np.clip(q, 1e-10, 1.0)
    return float(-np.sum(p * np.log(q)))


def rank_markets_by_edge(
    markets: list[dict],
) -> list[EdgeAnalysis]:
    """
    Rank a set of markets by information-theoretic edge.

    Args:
        markets: List of dicts with keys:
            - name: str
            - agent_probs: list[float] (agent's probability distribution)
            - market_probs: list[float] (market's implied distribution)

    Returns:
        List of EdgeAnalysis sorted by KL divergence (highest edge first).
    """
    results = []
    for m in markets:
        agent_p = np.array(m["agent_probs"])
        market_p = np.array(m["market_probs"])

        h = shannon_entropy(market_p, base=2)
        kl_nats, kl_bits = kl_divergence(agent_p, market_p)

        results.append(EdgeAnalysis(
            market_name=m["name"],
            agent_probs=agent_p,
            market_probs=market_p,
            entropy_bits=h,
            kl_divergence_nats=kl_nats,
            kl_divergence_bits=kl_bits,
            expected_kelly_growth=kl_nats,
        ))

    # Sort by KL divergence descending
    results.sort(key=lambda x: x.kl_divergence_nats, reverse=True)
    for i, r in enumerate(results):
        r.edge_rank = i + 1

    return results


def feature_importance_mi(
    features: np.ndarray,
    outcomes: np.ndarray,
    feature_names: list[str],
    n_bins: int = 10,
) -> list[tuple[str, float]]:
    """
    Rank features by mutual information with the outcome variable.

    Args:
        features: 2D array (n_samples, n_features).
        outcomes: 1D array of discrete outcomes (n_samples,).
        feature_names: Names for each feature column.
        n_bins: Number of bins for discretizing continuous features.

    Returns:
        List of (feature_name, MI_bits) sorted descending.
    """
    results = []
    unique_outcomes = np.unique(outcomes)

    for col_idx, name in enumerate(feature_names):
        feat = features[:, col_idx]

        # Discretize continuous features into bins
        if len(np.unique(feat)) > n_bins:
            bins = np.percentile(feat, np.linspace(0, 100, n_bins + 1))
            feat_discrete = np.digitize(feat, bins[1:-1])
        else:
            feat_discrete = feat.astype(int)

        unique_feats = np.unique(feat_discrete)

        # Build joint count table
        joint = np.zeros((len(unique_feats), len(unique_outcomes)))
        feat_map = {v: i for i, v in enumerate(unique_feats)}
        out_map = {v: i for i, v in enumerate(unique_outcomes)}

        for f_val, o_val in zip(feat_discrete, outcomes):
            joint[feat_map[f_val], out_map[o_val]] += 1

        mi = mutual_information(joint)
        results.append((name, mi))

    results.sort(key=lambda x: x[1], reverse=True)
    return results


# --- Demo: Edge Dashboard ---

if __name__ == "__main__":
    print("=" * 70)
    print("INFORMATION-THEORETIC EDGE DASHBOARD")
    print("=" * 70)

    # Define markets: agent model vs. market implied
    markets = [
        {
            "name": "Biden YES @ $0.52 (Polymarket)",
            "agent_probs": [0.61, 0.39],
            "market_probs": [0.52, 0.48],
        },
        {
            "name": "Fed Rate Cut June YES @ $0.55 (Kalshi)",
            "agent_probs": [0.62, 0.38],
            "market_probs": [0.55, 0.45],
        },
        {
            "name": "Bills ML @ -180 (BetOnline)",
            "agent_probs": [0.71, 0.29],
            "market_probs": [0.643, 0.357],
        },
        {
            "name": "Lakers -3.5 @ -110 (BetOnline)",
            "agent_probs": [0.56, 0.44],
            "market_probs": [0.524, 0.476],
        },
        {
            "name": "BTC > $100k July YES @ $0.65 (Kalshi)",
            "agent_probs": [0.58, 0.42],
            "market_probs": [0.65, 0.35],
        },
    ]

    print("\n--- Market Entropy Scan ---\n")
    for m in markets:
        h = shannon_entropy(np.array(m["market_probs"]))
        print(f"  {m['name']:<50s}  H = {h:.3f} bits")

    print("\n--- Edge Ranking by KL Divergence ---\n")
    ranked = rank_markets_by_edge(markets)
    print(f"  {'Rank':<5} {'Market':<50s} {'D_KL (nats)':<14} {'D_KL (bits)':<14} {'Kelly G'}")
    print("  " + "-" * 95)
    for r in ranked:
        print(
            f"  {r.edge_rank:<5} {r.market_name:<50s} "
            f"{r.kl_divergence_nats:<14.5f} {r.kl_divergence_bits:<14.5f} "
            f"{r.expected_kelly_growth:.5f}"
        )

    # Mutual information demo with synthetic data
    print("\n--- Feature Importance (Mutual Information) ---\n")
    rng = np.random.default_rng(42)
    n_samples = 5000

    # Simulate: outcome depends on features 0 and 1, not on 2
    feat0 = rng.normal(0, 1, n_samples)  # strong signal
    feat1 = rng.normal(0, 1, n_samples)  # moderate signal
    feat2 = rng.normal(0, 1, n_samples)  # noise

    logit = 0.8 * feat0 + 0.4 * feat1 + 0.0 * feat2
    prob = 1 / (1 + np.exp(-logit))
    outcomes = (rng.random(n_samples) < prob).astype(int)

    features = np.column_stack([feat0, feat1, feat2])
    names = ["CLV_signal", "DVOA_diff", "jersey_color"]

    mi_ranked = feature_importance_mi(features, outcomes, names)
    for name, mi_val in mi_ranked:
        bar = "#" * int(mi_val * 200)
        print(f"  {name:<20s}  MI = {mi_val:.4f} bits  {bar}")

    # Cross-entropy demo
    print("\n--- Model Cross-Entropy Comparison ---\n")
    true_dist = np.array([0.61, 0.39])
    model_a = np.array([0.60, 0.40])  # good model
    model_b = np.array([0.52, 0.48])  # market-level model
    model_c = np.array([0.45, 0.55])  # wrong-direction model

    for label, q in [("Model A (good)", model_a),
                      ("Model B (market)", model_b),
                      ("Model C (wrong)", model_c)]:
        ce = cross_entropy(true_dist, q)
        kl_n, _ = kl_divergence(true_dist, q)
        print(f"  {label:<25s}  H(P,Q) = {ce:.4f} nats  D_KL = {kl_n:.4f} nats")

Limitations and Edge Cases

KL divergence requires calibrated probabilities. The entire framework assumes your agent’s probability estimates are closer to truth than the market’s. If your model is miscalibrated — systematically overconfident or underconfident — D_KL will overstate your edge. Use the Calibration and Model Evaluation guide to verify your model before trusting KL-based rankings. A 10% miscalibration error can turn a positive expected growth rate into a negative one.

KL divergence is undefined when q_i = 0 and p_i > 0. If the market assigns zero probability to an outcome your model considers possible, D_KL blows up to infinity. In practice, this means markets with zero-liquidity outcomes need probability floors. The implementation above clips at 1e-10, but this is a numerical bandage — the real fix is to only analyze markets with sufficient liquidity on all outcomes.

Mutual information estimation is data-hungry. Estimating MI from finite samples introduces upward bias — the naive plug-in estimator systematically overestimates MI, especially with many bins or rare feature values. For a betting agent with 500 historical observations, MI estimates for weak features will be unreliable. Use bias-corrected estimators (Miller-Madow correction: subtract (|X|-1)(|Y|-1) / (2N ln2) from the naive estimate) or permutation tests to establish significance thresholds.

The KL-Kelly connection assumes independent bets. The growth rate D_KL(P||Q) per bet assumes each wager is independent. Correlated bets (multiple NFL games on the same Sunday, or Polymarket contracts on related political outcomes) break this — actual growth rates will differ from the per-bet KL calculation. The Correlation and Portfolio Theory guide covers the multi-asset extension.

Cross-entropy is sensitive to extreme predictions. A model that assigns 0.01 to an outcome that occurs gets hit with -ln(0.01) = 4.6 nats of penalty for that single observation. One overconfident prediction can dominate the cross-entropy score. Use truncation (clip predictions to [0.01, 0.99]) or evaluate with Brier score alongside cross-entropy to get a more robust picture.

Entropy screening misses directional edge. A high-entropy market (50/50) where your model also says 50/50 has maximum entropy but zero edge. Entropy is a necessary condition for edge (low-entropy markets have less room for disagreement), not a sufficient one. Always follow entropy screening with KL divergence calculation.

FAQ

How do you use KL divergence to measure betting edge?

KL divergence D_KL(P || Q) measures how much your probability distribution P differs from the market’s distribution Q. If your model assigns 65% to an outcome the market prices at 55%, D_KL quantifies that disagreement in bits. Positive KL divergence means your model contains information the market hasn’t priced in — that’s edge. The expected log-wealth growth under Kelly sizing equals D_KL(Agent || Market).

What is Shannon entropy in prediction markets?

Shannon entropy H = -Sigma p_i log2(p_i) measures the uncertainty of a market in bits. A 50/50 binary market has maximum entropy of 1.0 bit. A 90/10 market has 0.469 bits. Agents use entropy to identify high-uncertainty markets where edge is most likely to exist — more uncertainty means more room for a better model to profit.

How does mutual information help select features for sports betting models?

Mutual information I(X;Y) = H(X) - H(X|Y) measures how many bits of information a feature X provides about the outcome Y. A feature with high mutual information (like closing line movement) reduces your uncertainty about the outcome substantially. Features with near-zero mutual information (like jersey color) are noise. Agents use MI to rank and prune feature sets, keeping only the data sources that provide real predictive signal.

What is the connection between KL divergence and the Kelly Criterion?

Kelly Criterion maximizes expected log-wealth growth, which equals D_KL(Agent || Market) — the KL divergence between the agent’s probabilities and the market’s implied probabilities. This means the maximum growth rate an agent can achieve is bounded by how much better its probability estimates are than the market’s, measured in bits. The Kelly Criterion guide derives the sizing formula; information theory provides the theoretical ceiling.

How do you use cross-entropy to evaluate a betting model?

Cross-entropy H(P,Q) = -Sigma p_i log(q_i) measures how well your model’s predicted probabilities Q match the true outcome distribution P. Lower cross-entropy means better calibration. Cross-entropy equals Shannon entropy plus KL divergence: H(P,Q) = H(P) + D_KL(P || Q). Since H(P) is fixed, minimizing cross-entropy is equivalent to minimizing KL divergence from the true distribution — the same objective as maximizing betting profits under Kelly.

What’s Next

Information theory gives you the measurement tools — entropy to screen, KL to rank, MI to select features. The next step is ensuring those measurements are trustworthy.