Layer 4 — Intelligence

Multi-Armed Bandit Problems: How Agents Explore vs. Exploit in Betting Markets

Q: "What is the multi-armed bandit problem in sports betting?"

"The multi-armed bandit problem in betting frames each market, sport, or bet type as an 'arm' with unknown expected return. An agent must decide whether to exploit arms where it knows it has edge (e.g., NBA totals with a proven model) or explore new arms (e.g., Kalshi weather markets) to discover untapped edge. UCB and Thompson Sampling are the two most effective algorithms for this tradeoff."

Q: "How does Thompson Sampling work for betting agent market allocation?"

"Thompson Sampling maintains a Beta(alpha, beta) posterior distribution for each market's expected win rate. Alpha counts successes (profitable bets), beta counts failures. Each round, the agent samples from every posterior and picks the market with the highest sample. Markets with high uncertainty get explored naturally because their posteriors are wide. As data accumulates, the posteriors narrow and the agent exploits the best markets."

Q: "What is the UCB1 formula for explore-exploit in betting?"

"UCB1 selects the arm maximizing Q(a) + c * sqrt(ln(t) / N(a)), where Q(a) is the estimated mean reward for arm a, t is the total round count, N(a) is how many times arm a has been tried, and c is an exploration constant (typically sqrt(2)). The second term is a confidence bonus that shrinks as an arm is pulled more, ensuring under-explored arms get tried."

Q: "How much bankroll should a betting agent allocate to exploration?"

"Typical exploration budgets are 5-15% of total bankroll. The exact amount depends on the agent's confidence in its current best markets and the number of unexplored alternatives. Thompson Sampling handles this implicitly through posterior uncertainty — no explicit budget is needed. For epsilon-greedy, set epsilon to 0.05-0.10, decaying over time as the agent's market map stabilizes."

Q: "What is the difference between contextual bandits and standard multi-armed bandits for betting?"

"Standard MABs treat each arm as having a fixed, unknown reward distribution. Contextual bandits incorporate a feature vector — sport type, market liquidity, time of day, odds movement velocity — to model reward as a function of context. This lets an agent learn that NBA totals are profitable on back-to-back games specifically, rather than treating 'NBA totals' as a single undifferentiated arm."

By Rahim March 21, 2026March 31, 2026 · 18 min read

How autonomous betting agents use multi-armed bandit algorithms — UCB, Thompson sampling, epsilon-greedy, and contextual bandits — to balance exploration and exploitation across sports betting and prediction markets.

Multi-Armed Bandit Problems: How Agents Explore vs. Exploit in Betting Markets

Summary: Technical guide to multi-armed bandit (MAB) algorithms for autonomous betting agents solving the explore-exploit tradeoff across markets and bet types. Frames the agent's market selection problem as a K-armed bandit where each arm represents a market, sport, or bet type with unknown expected return. Covers four core algorithms: (1) Epsilon-greedy — exploit the best-known arm with probability 1-epsilon, explore a random arm with probability epsilon. Simple but wastes exploration budget on clearly bad arms. (2) Upper Confidence Bound (UCB1) — select the arm maximizing Q(a) + c * sqrt(ln(t) / N(a)), where Q(a) is the estimated reward for arm a, t is the total number of pulls, N(a) is the number of times arm a has been pulled, and c controls exploration width. Achieves O(K ln(t)) regret bound. (3) Thompson Sampling — maintain a Beta(alpha, beta) posterior distribution over each arm's expected return, sample from each posterior, select the arm with the highest sample. Empirically optimal in many settings, naturally balances exploration and exploitation through posterior uncertainty. (4) Contextual bandits — extend standard MABs by incorporating feature vectors (sport type, market liquidity, time of day, odds movement velocity) into arm selection via a linear model: reward ~ x^T * theta_a. Derives formal regret bounds: UCB1 achieves O(K ln(T)) cumulative regret over T rounds with K arms. Thompson Sampling matches this bound asymptotically and often outperforms in practice. Applies directly to agent portfolio allocation: should the agent trade more Polymarket political markets (known 3.2% edge) or explore Kalshi weather markets (unknown edge, 47 pulls so far)? Shows how to set exploration budgets as a percentage of bankroll — typically 5-15% allocated to exploration arms with high uncertainty. Python implementation of a full Thompson Sampling market allocator using scipy.stats Beta distributions with realistic market data from Polymarket, Kalshi, and BetOnline. Part of the AgentBets Math Behind Betting series. Maps to Layer 4 (Intelligence) of the Agent Betting Stack. Cross-references game theory for prediction market agents and reinforcement learning bet timing. Topics: multi-armed bandit, explore-exploit, Thompson sampling, UCB, epsilon-greedy, contextual bandit, regret bounds, agent portfolio allocation, betting agent intelligence.

Topics: multi-armed bandit, explore-exploit tradeoff, Thompson sampling, UCB algorithm, epsilon-greedy, contextual bandits, regret bounds, agent portfolio allocation, betting agent intelligence, market selection

Stack layers: Layer 4 — Intelligence

Related tools: Polymarket CLOB, Kalshi API, The Odds API, Polyseer

An agent’s market selection problem is a multi-armed bandit. Each market or bet type is an arm with unknown expected return. Thompson Sampling — maintain Beta(alpha, beta) posteriors per arm, sample, pick the highest — is the best general-purpose solution. UCB1 selects the arm maximizing Q(a) + c * sqrt(ln(t) / N(a)) and guarantees O(K ln(T)) regret. Allocate 5-15% of bankroll to exploration.

Why This Matters for Agents

An autonomous betting agent faces hundreds of possible markets every day. Polymarket political contracts, Kalshi weather events, NBA spreads on BetOnline, MLB run totals on Bookmaker, UFC fight props on Bovada. The agent has a proven edge in some of these markets. It has no data on others. The question: should the agent keep betting where it knows it wins, or try new markets to find even better opportunities?

This is Layer 4 — Intelligence. The multi-armed bandit framework sits inside the agent’s decision engine, upstream of bet sizing (Kelly) and downstream of probability estimation. The Agent Betting Stack positions this as the market allocation module: given a bankroll and a universe of available markets, which markets does the agent trade in, and how much capital does each market get? The bandit algorithm answers both questions simultaneously — exploration budget determines capital allocation to uncertain markets, while exploitation concentrates capital where edge is proven.

Every betting agent that operates across multiple markets needs a bandit solution. Without one, the agent either over-exploits (missing profitable markets it never tried) or over-explores (burning bankroll on random markets with no edge). The math gives you the optimal tradeoff.

The Math

The K-Armed Bandit Formulation

Define the betting agent’s problem formally. There are K arms (markets/bet types). At each round t = 1, 2, …, T, the agent selects one arm a_t and receives a reward r_t drawn from that arm’s unknown reward distribution.

Arms:        a ∈ {1, 2, ..., K}
True mean:   μ(a) = E[r | arm = a]
Optimal arm: a* = argmax_a μ(a)

The agent’s goal is to maximize cumulative reward over T rounds. Equivalently, minimize cumulative regret:

Regret(T) = T × μ(a*) - Σ_{t=1}^{T} μ(a_t)

Regret measures the gap between what the agent earned and what it would have earned by always playing the optimal arm. Lower regret = better algorithm.

For a betting agent, each arm is a market category. The reward is the profit (or loss) from a bet in that category. The true mean μ(a) is the agent’s long-run expected profit per bet in market a — which depends on both the agent’s edge and the market’s characteristics (liquidity, vig, volatility).

Epsilon-Greedy: The Baseline

The simplest bandit algorithm. With probability 1 - epsilon, play the arm with the highest estimated reward (exploit). With probability epsilon, play a uniformly random arm (explore).

Epsilon-Greedy Policy:
  With probability (1 - ε): select a_t = argmax_a Q(a)
  With probability ε:       select a_t ~ Uniform({1, ..., K})

Where Q(a) = (1/N(a)) × Σ r_i for all rounds where arm a was played
N(a) = number of times arm a has been pulled

Q(a) is the sample mean reward for arm a. N(a) is the pull count.

The problem with epsilon-greedy: it wastes exploration budget. With K = 20 arms and epsilon = 0.10, the agent spends 10% of its rounds picking random arms — including arms it already knows are terrible. It explores a market with -8% expected return just as eagerly as a market with unknown but potentially positive return.

Typical epsilon values: 0.05 to 0.10, often decayed over time as epsilon_t = epsilon_0 / (1 + decay_rate * t). Linear regret if epsilon is fixed; sublinear if decayed properly.

UCB1: Upper Confidence Bound

UCB1 solves epsilon-greedy’s waste problem by exploring intelligently — it explores arms with high uncertainty, not random arms.

The UCB1 selection rule:

a_t = argmax_a [ Q(a) + c × sqrt(ln(t) / N(a)) ]

Where:

Q(a) = sample mean reward for arm a
t = total number of rounds played so far
N(a) = number of times arm a has been pulled
c = exploration constant (c = sqrt(2) is standard; smaller c means less exploration)

The term c * sqrt(ln(t) / N(a)) is the confidence bonus. It rewards under-explored arms: if N(a) is small relative to t, the bonus is large. As the agent pulls arm a more, N(a) grows, the bonus shrinks, and Q(a) dominates the selection.

Regret bound (Auer et al., 2002):

E[Regret(T)] ≤ 8 × Σ_{a: μ(a) < μ*} (ln(T) / Δ(a)) + (1 + π²/3) × Σ Δ(a)

Where Δ(a) = μ* - μ(a) is the suboptimality gap for arm a

The important result: UCB1 achieves O(K ln(T)) regret. This is logarithmic in T — the agent’s regret grows slower and slower over time. The Lai-Robbins lower bound proves that no algorithm can do better than O(K ln(T)), so UCB1 is order-optimal.

For a betting agent with K = 15 market categories over T = 10,000 bets, UCB1’s cumulative regret grows like 15 * ln(10,000) ≈ 138 units. Compare to random selection’s linear regret of ~5,000 units.

Thompson Sampling: The Bayesian Solution

Thompson Sampling is empirically the best general-purpose bandit algorithm. It maintains a posterior distribution over each arm’s expected reward and samples from it to make decisions.

For binary outcomes (win/loss bets), the posterior is a Beta distribution:

Prior:      μ(a) ~ Beta(α₀, β₀)        (typically α₀ = β₀ = 1, i.e., uniform)
Update:     After a win on arm a:  α_a → α_a + 1
            After a loss on arm a: β_a → β_a + 1
Posterior:  μ(a) | data ~ Beta(α_a, β_a)

Thompson Sampling Policy:
  1. For each arm a, sample θ_a ~ Beta(α_a, β_a)
  2. Select a_t = argmax_a θ_a
  3. Observe reward r_t, update α or β for arm a_t

Why this works: arms with high uncertainty (few pulls) have wide posteriors. Wide posteriors produce high samples sometimes, causing the agent to explore those arms. Arms with consistently low rewards develop posteriors concentrated near zero — they almost never produce the highest sample. Arms with proven high rewards develop tight posteriors near their true mean — they get exploited most of the time.

Thompson Sampling achieves the same O(K ln(T)) regret bound as UCB1. In practice, it often outperforms because:

It naturally adapts exploration intensity to uncertainty — no tuning parameter needed (unlike epsilon or c in UCB1).
It explores “optimistically” — arms that could be good get explored, arms that are clearly bad do not.
It handles non-stationary environments better because recent data naturally dominates the posterior.

Contextual Bandits: Adding Features

Standard bandits treat each arm as having a fixed reward distribution. Contextual bandits extend this by incorporating a feature vector x_t at each round:

At round t:
  1. Observe context x_t ∈ R^d  (feature vector)
  2. Select arm a_t
  3. Receive reward r_t = x_t^T × θ_{a_t} + noise

Where θ_a ∈ R^d is the unknown parameter vector for arm a

For a betting agent, the context vector includes features like:

Feature	Description	Example
Sport type	One-hot encoding of sport	[1, 0, 0, 0] = NBA
Market liquidity	Log of total volume	ln($450,000) = 13.02
Time to close	Hours until market resolves	48.0
Odds movement velocity	Price change per hour	-0.003/hr
Day of week	Encoded as sine/cosine	sin(2π × 3/7), cos(2π × 3/7)
Model confidence	Agent’s internal confidence score	0.82

The canonical contextual bandit algorithm is LinUCB (Li et al., 2010):

a_t = argmax_a [ x_t^T × θ̂_a + α × sqrt(x_t^T × A_a^{-1} × x_t) ]

Where:
  θ̂_a = A_a^{-1} × b_a           (ridge regression estimate)
  A_a = I_d + Σ x_i × x_i^T      (design matrix for arm a)
  b_a = Σ r_i × x_i               (reward-weighted features for arm a)
  α = exploration parameter        (typically 1 + sqrt(ln(2/δ)/2))

This lets the agent learn that “NBA totals on back-to-back games” is profitable while “NBA totals on rest days” is not — a distinction standard bandits cannot make.

Worked Examples

Example 1: Thompson Sampling Across Market Types

An agent has been trading across five market categories for 200 rounds. Here’s its current state:

Market Category          Wins   Losses   Win Rate   Posterior
─────────────────────────────────────────────────────────────
Polymarket politics       28      12      70.0%     Beta(29, 13)
Kalshi weather events      5       8      38.5%     Beta(6, 9)
NBA totals (BetOnline)    31      19      62.0%     Beta(32, 20)
NFL spreads (Bookmaker)   22      18      55.0%     Beta(23, 19)
MLB moneylines (Bovada)    3       4      42.9%     Beta(4, 5)

At round 201, Thompson Sampling draws one sample from each Beta posterior:

Polymarket politics:  sample from Beta(29, 13) → 0.714
Kalshi weather:       sample from Beta(6, 9)   → 0.528
NBA totals:           sample from Beta(32, 20) → 0.587
NFL spreads:          sample from Beta(23, 19) → 0.491
MLB moneylines:       sample from Beta(4, 5)   → 0.601

The agent selects Polymarket politics (highest sample: 0.714). But notice MLB moneylines drew 0.601 despite having the fewest pulls and a 42.9% observed win rate. The wide Beta(4, 5) posterior gives it a shot at producing high samples — that’s exploration happening naturally.

After 1,000 more rounds, MLB’s posterior will have narrowed. If its true win rate is below breakeven, it will almost never produce the winning sample again. No exploration budget wasted.

Example 2: UCB1 Market Selection

Same data, UCB1 with c = sqrt(2), total rounds t = 200:

Market Category          Q(a)     N(a)   Bonus                  UCB Score
─────────────────────────────────────────────────────────────────────────
Polymarket politics      0.700     40    sqrt(2) × sqrt(ln(200)/40) = 0.462    1.162
Kalshi weather           0.385     13    sqrt(2) × sqrt(ln(200)/13) = 0.812    1.197
NBA totals               0.620     50    sqrt(2) × sqrt(ln(200)/50) = 0.413    1.033
NFL spreads              0.550     40    sqrt(2) × sqrt(ln(200)/40) = 0.462    1.012
MLB moneylines           0.429      7    sqrt(2) × sqrt(ln(200)/7)  = 1.107    1.536

UCB1 selects MLB moneylines (UCB score 1.536) — not because it’s the best, but because it’s the most uncertain. With only 7 pulls, the confidence bonus dominates. After more pulls, if MLB’s Q(a) stays low, the bonus will shrink and the agent will shift to exploiting Polymarket politics.

Example 3: Exploration Budget Calculation

An agent has a $10,000 bankroll and uses Thompson Sampling across K = 8 market categories. How much capital goes to exploration?

Thompson Sampling doesn’t have an explicit exploration parameter, but you can estimate the exploration fraction empirically. In the first 100 rounds, roughly 30-40% of selections will be exploratory (non-optimal arm). By round 1,000, this drops to 5-10%. By round 10,000, it’s under 2%.

For explicit budgeting (e.g., with epsilon-greedy), a common approach:

Exploration budget = bankroll × epsilon × (1 / K_unexplored)
                   = $10,000 × 0.10 × (1 / 3)
                   = $333 per unexplored market category

Exploitation budget = bankroll × (1 - epsilon)
                    = $10,000 × 0.90 = $9,000
                    allocated proportionally to proven markets

With a $10,000 bankroll and epsilon = 0.10, the agent allocates $1,000 total to exploration across 3 untested market categories ($333 each) and $9,000 to exploitation across the 5 proven categories, weighted by estimated edge. If Polymarket politics has the highest estimated edge, it gets the largest share of that $9,000.

Implementation

import numpy as np
from scipy import stats
from dataclasses import dataclass, field
from typing import Optional


@dataclass
class ArmState:
    """Tracks the state of a single bandit arm (market category)."""
    name: str
    alpha: float = 1.0  # Beta prior successes + 1
    beta: float = 1.0   # Beta prior failures + 1
    total_reward: float = 0.0
    pull_count: int = 0

    @property
    def mean_reward(self) -> float:
        """Sample mean reward."""
        if self.pull_count == 0:
            return 0.0
        return self.total_reward / self.pull_count

    @property
    def win_rate(self) -> float:
        """Posterior mean (Beta distribution)."""
        return self.alpha / (self.alpha + self.beta)

    def update(self, reward: float) -> None:
        """Update arm state after observing a reward (1.0 = win, 0.0 = loss)."""
        self.pull_count += 1
        self.total_reward += reward
        if reward > 0:
            self.alpha += 1.0
        else:
            self.beta += 1.0


class ThompsonSamplingAllocator:
    """
    Multi-armed bandit market allocator using Thompson Sampling.
    Each arm is a market category (e.g., 'polymarket_politics', 'nba_totals').
    Binary reward: 1 if bet was profitable, 0 if not.
    """

    def __init__(self, arm_names: list[str], seed: Optional[int] = None):
        self.arms: dict[str, ArmState] = {
            name: ArmState(name=name) for name in arm_names
        }
        self.rng = np.random.default_rng(seed)
        self.round: int = 0
        self.history: list[dict] = []

    def select_arm(self) -> str:
        """
        Thompson Sampling: sample from each arm's Beta posterior,
        return the arm with the highest sample.
        """
        self.round += 1
        samples = {}
        for name, arm in self.arms.items():
            samples[name] = self.rng.beta(arm.alpha, arm.beta)

        selected = max(samples, key=samples.get)

        self.history.append({
            "round": self.round,
            "samples": dict(samples),
            "selected": selected,
        })
        return selected

    def update(self, arm_name: str, reward: float) -> None:
        """Update the selected arm with observed reward."""
        self.arms[arm_name].update(reward)

    def get_allocation_weights(self, n_samples: int = 10_000) -> dict[str, float]:
        """
        Estimate the probability each arm is optimal by Monte Carlo sampling.
        This gives the long-run allocation fraction for each market.
        """
        win_counts = {name: 0 for name in self.arms}

        for _ in range(n_samples):
            samples = {
                name: self.rng.beta(arm.alpha, arm.beta)
                for name, arm in self.arms.items()
            }
            winner = max(samples, key=samples.get)
            win_counts[winner] += 1

        total = sum(win_counts.values())
        return {name: count / total for name, count in win_counts.items()}

    def summary(self) -> str:
        """Print current state of all arms."""
        lines = [
            f"{'Market':<28} {'Pulls':>6} {'Win%':>7} "
            f"{'Alpha':>7} {'Beta':>7} {'Post.Mean':>10}"
        ]
        lines.append("-" * 72)
        for name, arm in sorted(self.arms.items(), key=lambda x: -x[1].win_rate):
            lines.append(
                f"{name:<28} {arm.pull_count:>6} "
                f"{arm.mean_reward * 100:>6.1f}% "
                f"{arm.alpha:>7.1f} {arm.beta:>7.1f} "
                f"{arm.win_rate:>9.1%}"
            )
        return "\n".join(lines)


class UCB1Allocator:
    """
    UCB1 market allocator. Selects arm maximizing
    Q(a) + c * sqrt(ln(t) / N(a)).
    """

    def __init__(self, arm_names: list[str], c: float = np.sqrt(2)):
        self.arms: dict[str, ArmState] = {
            name: ArmState(name=name) for name in arm_names
        }
        self.c = c
        self.round: int = 0

    def select_arm(self) -> str:
        """UCB1 selection. Pull each arm once first, then use UCB formula."""
        self.round += 1

        # Force exploration: pull each arm at least once
        for name, arm in self.arms.items():
            if arm.pull_count == 0:
                return name

        scores = {}
        for name, arm in self.arms.items():
            exploration_bonus = self.c * np.sqrt(
                np.log(self.round) / arm.pull_count
            )
            scores[name] = arm.mean_reward + exploration_bonus

        return max(scores, key=scores.get)

    def update(self, arm_name: str, reward: float) -> None:
        """Update the selected arm with observed reward."""
        self.arms[arm_name].update(reward)


class EpsilonGreedyAllocator:
    """
    Epsilon-greedy with optional decay.
    epsilon_t = epsilon_0 / (1 + decay_rate * t)
    """

    def __init__(
        self,
        arm_names: list[str],
        epsilon: float = 0.10,
        decay_rate: float = 0.001,
        seed: Optional[int] = None,
    ):
        self.arms: dict[str, ArmState] = {
            name: ArmState(name=name) for name in arm_names
        }
        self.epsilon_0 = epsilon
        self.decay_rate = decay_rate
        self.rng = np.random.default_rng(seed)
        self.round: int = 0

    def select_arm(self) -> str:
        """Epsilon-greedy with decay."""
        self.round += 1
        epsilon_t = self.epsilon_0 / (1 + self.decay_rate * self.round)

        if self.rng.random() < epsilon_t:
            return self.rng.choice(list(self.arms.keys()))

        # Exploit: pick the arm with highest observed mean
        best_arm = max(
            self.arms.items(),
            key=lambda x: x[1].mean_reward if x[1].pull_count > 0 else float("inf"),
        )
        return best_arm[0]

    def update(self, arm_name: str, reward: float) -> None:
        """Update the selected arm with observed reward."""
        self.arms[arm_name].update(reward)


def simulate_bandit(
    allocator,
    true_win_rates: dict[str, float],
    n_rounds: int = 5000,
    seed: int = 42,
) -> dict:
    """
    Simulate a bandit algorithm against arms with known true win rates.
    Returns cumulative regret and final arm statistics.
    """
    rng = np.random.default_rng(seed)
    optimal_rate = max(true_win_rates.values())
    cumulative_regret = 0.0
    regret_history = []

    for t in range(n_rounds):
        arm = allocator.select_arm()
        true_rate = true_win_rates[arm]
        reward = 1.0 if rng.random() < true_rate else 0.0
        allocator.update(arm, reward)

        instantaneous_regret = optimal_rate - true_rate
        cumulative_regret += instantaneous_regret
        regret_history.append(cumulative_regret)

    return {
        "cumulative_regret": cumulative_regret,
        "regret_history": regret_history,
        "final_round": n_rounds,
    }


# --- Run a full simulation ---
if __name__ == "__main__":
    # Define market categories with true (hidden) win rates
    # These represent the agent's actual edge in each market
    true_rates = {
        "polymarket_politics": 0.58,   # Known strong edge
        "kalshi_weather": 0.44,        # Slightly below breakeven after vig
        "nba_totals_betonline": 0.55,  # Moderate edge
        "nfl_spreads_bookmaker": 0.53, # Small edge
        "mlb_ml_bovada": 0.48,         # Negative edge (vig eats profit)
        "ufc_props_mybookie": 0.51,    # Marginal edge
        "polymarket_crypto": 0.56,     # Good edge, unexplored
        "kalshi_economics": 0.42,      # No edge
    }

    arm_names = list(true_rates.keys())

    # Compare all three algorithms
    print("=" * 60)
    print("MULTI-ARMED BANDIT SIMULATION: 5,000 ROUNDS")
    print("=" * 60)

    for name, AllocClass, kwargs in [
        ("Thompson Sampling", ThompsonSamplingAllocator, {"seed": 42}),
        ("UCB1 (c=sqrt(2))", UCB1Allocator, {"c": np.sqrt(2)}),
        ("Epsilon-Greedy (0.10)", EpsilonGreedyAllocator, {"epsilon": 0.10, "seed": 42}),
    ]:
        alloc = AllocClass(arm_names=arm_names, **kwargs)
        result = simulate_bandit(alloc, true_rates, n_rounds=5000, seed=42)
        print(f"\n{name}")
        print(f"  Cumulative regret: {result['cumulative_regret']:.1f}")
        print(f"  Avg regret/round:  {result['cumulative_regret'] / 5000:.4f}")

        if hasattr(alloc, "summary"):
            print(f"\n{alloc.summary()}")

    # Show Thompson Sampling allocation weights
    ts = ThompsonSamplingAllocator(arm_names=arm_names, seed=42)
    simulate_bandit(ts, true_rates, n_rounds=5000, seed=42)
    weights = ts.get_allocation_weights()
    print("\n\nThompson Sampling — Estimated Optimal Allocation:")
    print("-" * 45)
    for name, weight in sorted(weights.items(), key=lambda x: -x[1]):
        print(f"  {name:<30} {weight:>6.1%}")

Limitations and Edge Cases

Non-stationary rewards. Standard bandits assume each arm’s reward distribution is fixed. Betting markets are non-stationary — a model’s edge in NBA totals evaporates mid-season as sportsbooks adjust their lines. Solutions: use a sliding window for Q(a) (e.g., last 200 bets only), apply exponential decay to older observations, or use the Discounted UCB variant with discount factor gamma in [0.9, 0.99].

Delayed rewards. Prediction market bets on Polymarket or Kalshi can take days, weeks, or months to resolve. The agent selects an arm today but doesn’t observe the reward until the market settles. Standard bandits assume immediate feedback. With delayed feedback, the agent over-explores because it hasn’t updated its posteriors with pending outcomes. Mitigation: maintain a “pending bets” queue and use the expected value of pending bets as a provisional update.

Correlated arms. Standard bandits assume arms are independent. In reality, “NBA totals on BetOnline” and “NBA totals on Bookmaker” are highly correlated — edge in one implies edge in the other. Ignoring this means the agent explores both separately when one would suffice. Contextual bandits partially solve this by sharing information across arms via the feature vector.

Vig erosion of reward signal. A bet that wins at 52% true probability against -110 odds has positive EV but the reward signal (win/loss) is noisy. The agent needs hundreds of pulls per arm to distinguish a 53% win rate from a 50% win rate. With K = 20 arms, that’s 4,000+ total rounds before the bandit algorithm converges — potentially months of live betting. Practical fix: use your model’s estimated EV as the reward signal instead of binary win/loss. This gives a continuous, less noisy signal.

Bankroll constraints. Bandit algorithms assume unlimited resources. A real agent has finite bankroll. A bad exploration run (10 losses in a row on a new market) can trigger drawdown limits and force the agent to reduce position sizes across all markets. Integrate the bandit allocator with Kelly sizing — arms with high uncertainty get smaller Kelly fractions automatically because the estimated edge has a wide confidence interval.

Cold start. With zero data on any arm, all algorithms degenerate to random selection for the first K rounds. If K is large (30+ markets), the cold start phase burns significant bankroll. Prior knowledge helps: initialize Beta posteriors with informative priors based on historical backtest data rather than uniform Beta(1, 1).

FAQ

What is the multi-armed bandit problem in sports betting?

The multi-armed bandit problem in betting frames each market, sport, or bet type as an “arm” with unknown expected return. An agent must decide whether to exploit arms where it knows it has edge (e.g., NBA totals with a proven model) or explore new arms (e.g., Kalshi weather markets) to discover untapped edge. UCB and Thompson Sampling are the two most effective algorithms for this tradeoff.

How does Thompson Sampling work for betting agent market allocation?

Thompson Sampling maintains a Beta(alpha, beta) posterior distribution for each market’s expected win rate. Alpha counts successes (profitable bets), beta counts failures. Each round, the agent samples from every posterior and picks the market with the highest sample. Markets with high uncertainty get explored naturally because their posteriors are wide. As data accumulates, the posteriors narrow and the agent exploits the best markets.

What is the UCB1 formula for explore-exploit in betting?

UCB1 selects the arm maximizing Q(a) + c * sqrt(ln(t) / N(a)), where Q(a) is the estimated mean reward for arm a, t is the total round count, N(a) is how many times arm a has been tried, and c is an exploration constant (typically sqrt(2)). The second term is a confidence bonus that shrinks as an arm is pulled more, ensuring under-explored arms get tried.

How much bankroll should a betting agent allocate to exploration?

Typical exploration budgets are 5-15% of total bankroll. The exact amount depends on the agent’s confidence in its current best markets and the number of unexplored alternatives. Thompson Sampling handles this implicitly through posterior uncertainty — no explicit budget is needed. For epsilon-greedy, set epsilon to 0.05-0.10, decaying over time as the agent’s market map stabilizes.

What is the difference between contextual bandits and standard multi-armed bandits for betting?

Standard MABs treat each arm as having a fixed, unknown reward distribution. Contextual bandits incorporate a feature vector — sport type, market liquidity, time of day, odds movement velocity — to model reward as a function of context. This lets an agent learn that NBA totals are profitable on back-to-back games specifically, rather than treating “NBA totals” as a single undifferentiated arm.

What’s Next

The bandit framework is the first step in the agent intelligence trilogy. Next, the agent needs to reason about other agents’ strategies — that’s game theory for prediction market agents. Then it needs to optimize its bet timing across episodes — that’s reinforcement learning for bet timing.

Game theory extension: Game Theory for Prediction Market Agents — when other agents are adversarial, bandit rewards become non-stationary because competitors adapt. Nash equilibrium analysis tells you when to expect reward shifts.
Full RL formulation: Reinforcement Learning for Bet Timing — extends bandits to sequential decisions with state transitions. The agent’s bankroll, current positions, and market state form the state space.
Bankroll risk management: Drawdown Math and Variance in Betting — exploration bets increase variance. Understand how exploration budgets affect maximum drawdown probability.
Agent infrastructure: The Agent Betting Stack shows where the bandit allocator sits in the full four-layer architecture.
Sharp market dynamics: Sharp betting strategies explain why edges disappear — the non-stationarity problem that makes standard bandits insufficient for long-horizon deployment.

Frequently Asked Questions

What is the multi-armed bandit problem in sports betting?

The multi-armed bandit problem in betting frames each market, sport, or bet type as an 'arm' with unknown expected return. An agent must decide whether to exploit arms where it knows it has edge (e.g., NBA totals with a proven model) or explore new arms (e.g., Kalshi weather markets) to discover untapped edge. UCB and Thompson Sampling are the two most effective algorithms for this tradeoff.

How does Thompson Sampling work for betting agent market allocation?

Thompson Sampling maintains a Beta(alpha, beta) posterior distribution for each market's expected win rate. Alpha counts successes (profitable bets), beta counts failures. Each round, the agent samples from every posterior and picks the market with the highest sample. Markets with high uncertainty get explored naturally because their posteriors are wide. As data accumulates, the posteriors narrow and the agent exploits the best markets.

What is the UCB1 formula for explore-exploit in betting?

How much bankroll should a betting agent allocate to exploration?

Typical exploration budgets are 5-15% of total bankroll. The exact amount depends on the agent's confidence in its current best markets and the number of unexplored alternatives. Thompson Sampling handles this implicitly through posterior uncertainty — no explicit budget is needed. For epsilon-greedy, set epsilon to 0.05-0.10, decaying over time as the agent's market map stabilizes.

What is the difference between contextual bandits and standard multi-armed bandits for betting?