Layer 4 — Intelligence

Reinforcement Learning for Dynamic Bet Timing and Execution

Q: "How do you use reinforcement learning for sports betting?"

"Frame betting as a Markov Decision Process where the state includes current odds, time to event, bankroll, and model edge. The agent learns a policy mapping states to actions (bet size or wait) by maximizing cumulative profit through trial and error in simulation. Train on historical data using DQN or PPO, then deploy to live markets with conservative exploration."

Q: "What is the MDP formulation for a betting agent?"

"The state is s = (odds, time_to_event, position, bankroll, model_edge). Actions are continuous bet sizes from -max_bet to +max_bet, plus a wait action. The reward is realized PnL minus opportunity cost. The transition function encodes how odds evolve stochastically over time. The discount factor γ is typically 0.99 for event-horizon problems."

Q: "Why is non-stationarity a problem for RL betting agents?"

"Betting markets are non-stationary — the data distribution shifts as market participants adapt, seasons change, and information regimes evolve. A Q-function trained on last season's NFL data may be worthless this season. Agents must use online learning with sliding windows, regime detection, and periodic retraining triggered by degrading rolling Sharpe ratios."

Q: "How does sim-to-real transfer work for betting agents?"

"Train the RL agent in a simulated environment built from historical orderbook snapshots with domain randomization — randomized latency, fee perturbation, and slippage noise. This forces the policy to be robust to execution uncertainty. Deploy with conservative bet sizes initially, then scale as live performance validates the simulation."

Q: "How do you combine RL with Kelly Criterion for bet sizing?"

"Use a two-stage pipeline: the model-based stage identifies +EV opportunities and computes Kelly-optimal fractions, then the RL execution stage learns when to enter (timing), how to split the Kelly allocation across multiple entries (execution), and when to exit early. RL optimizes the execution envelope around the Kelly target."

By Rahim March 21, 2026March 31, 2026 · 22 min read

How to frame autonomous bet timing as a reinforcement learning problem — MDPs, Q-learning, DQN, policy gradients, sim-to-real transfer, and combining RL execution with model-based edge detection.

Reinforcement Learning for Dynamic Bet Timing and Execution

Summary: Technical guide to reinforcement learning (RL) as the execution layer for autonomous betting agents. Frames bet timing and sizing as a Markov Decision Process (MDP) with state s = (current_odds, time_to_event, current_position, bankroll, model_edge), action a = (bet_size ∈ [-max, +max], or wait), and reward r = realized_PnL - opportunity_cost. Derives the Bellman equation Q*(s, a) = E[r + γ max_a' Q*(s', a')] for optimal action-value estimation in betting environments. Covers tabular Q-learning with update rule Q(s,a) ← Q(s,a) + α[r + γ max_a' Q(s',a') - Q(s,a)], epsilon-greedy exploration with ε-decay schedules, and convergence guarantees under the Robbins-Monro conditions. Extends to Deep Q-Networks (DQN) for high-dimensional state spaces — experience replay buffer, target network soft updates τ = 0.005, and Huber loss for stable training. Covers policy gradient methods (PPO, A2C) for continuous action spaces where bet sizing is a real-valued output from a Gaussian policy π(a|s) = N(μ_θ(s), σ_θ(s)). PPO's clipped surrogate objective L = min(r_t(θ)A_t, clip(r_t(θ), 1-ε, 1+ε)A_t) prevents catastrophic policy updates. Addresses non-stationarity — the central challenge where market dynamics shift and strategies degrade — via online learning with sliding windows, contextual features that encode regime, and periodic retraining triggers based on rolling Sharpe ratio degradation. Covers sim-to-real transfer: training on historical orderbook data with domain randomization (randomized latency 50-500ms, fee perturbation ±0.5%, slippage noise), then deploying to live Polymarket CLOB or sportsbook APIs. Exploration cost minimization via Thompson Sampling-guided exploration and warm-starting from supervised pre-training. Shows how to combine RL execution with model-based edge detection: the upstream model identifies +EV opportunities per the Kelly criterion, RL optimizes when to enter, how much to size, and when to exit. Multi-agent RL coordination via CrewAI for role-specialized agents (scanner, executor, risk manager). Python implementation using stable-baselines3 with custom Gymnasium environments. Part of the AgentBets Math Behind Betting series — maps to Layer 4 (Intelligence) of the Agent Betting Stack. Topics: reinforcement learning, MDP, Q-learning, DQN, PPO, A2C, policy gradient, bet timing, bet sizing, sim-to-real transfer, non-stationarity, exploration-exploitation, Thompson Sampling, CrewAI, multi-agent RL, autonomous betting agents.

Topics: reinforcement learning, MDP, Q-learning, DQN, policy gradient, PPO, bet timing, bet sizing, sim-to-real transfer, non-stationarity, exploration-exploitation, multi-agent RL

Stack layers: Layer 4 — Intelligence

Related tools: The Odds API, Polymarket CLOB, Kalshi API, CrewAI

Frame bet timing and sizing as a Markov Decision Process. State = (odds, time, position, bankroll, edge). Action = bet size or wait. Train Q-learning or PPO on historical simulation, transfer to live markets with domain randomization. RL doesn’t replace your edge model — it optimizes when and how much to execute on edges your model finds.

Why This Matters for Agents

A betting agent that detects a +EV opportunity at 2:14 PM faces a non-trivial execution problem: should it fire the full Kelly allocation immediately, split it across the next 30 minutes as the line moves, wait for a sharper entry, or skip entirely because the line is about to close? Static rules — “bet immediately when edge > 2%” — leave money on the table. The optimal execution strategy depends on how odds evolve, how close the event is, what the current position is, and how much bankroll is at risk.

This is Layer 4 — Intelligence, specifically the execution sub-layer. The upstream model (Bayesian, regression, Elo — anything from earlier in this series) identifies edges. RL learns the execution policy: the mapping from market state to optimal action. It sits between edge detection and order placement in the Agent Betting Stack, consuming signals from Layer 4 intelligence and issuing instructions to Layer 3 trading infrastructure.

The multi-armed bandit approach handles the simpler problem of selecting which opportunity to pursue. Game theory models strategic interactions with other market participants. RL unifies both into a sequential decision framework where the agent learns from the consequences of its own actions over time.

The Math

The Betting MDP

A Markov Decision Process is defined by the tuple (S, A, P, R, γ):

State space S. The agent observes a feature vector at each timestep:

s_t = (odds_t, Δodds_t, time_to_event_t, position_t, bankroll_t, model_edge_t, volume_t, spread_t)

odds_t: current decimal odds (or Polymarket YES price)
Δodds_t: odds change over the last k intervals (momentum)
time_to_event_t: fraction of time remaining until event resolution, ∈ [0, 1]
position_t: current net exposure as fraction of bankroll
bankroll_t: current bankroll (normalized to starting bankroll = 1.0)
model_edge_t: the upstream model’s estimated edge = p_model - p_market
volume_t: recent trading volume (normalized)
spread_t: current bid-ask spread

Action space A. For continuous sizing:

a_t ∈ [-max_fraction, +max_fraction]

where a_t > 0 means buy YES (go long), a_t < 0 means buy NO (go short), and a_t = 0 means wait. The fraction is relative to current bankroll. A typical max_fraction is 0.05 (5% of bankroll per action).

For discrete action spaces (simpler to train): A = {-2x, -1x, 0, +1x, +2x} where x is a base unit size.

Transition function P(s’|s, a). The next state depends on:

How odds evolve (stochastic, driven by the market)
The agent’s action (changes position and bankroll)
Time passing (time_to_event decreases)

The agent does not control odds evolution — only its own position. This partial observability is what makes betting RL hard.

Reward function R(s, a, s’). The reward is realized profit or loss:

r_t = position_change_t × (odds_t+1 - odds_t) - transaction_cost_t - opportunity_cost_t

where transaction_cost includes spread crossing and fees, and opportunity_cost penalizes idle capital. At event resolution, the terminal reward is the final PnL on the position.

Discount factor γ. Set γ = 0.99 for multi-step episodes (e.g., betting over hours or days before an event). For single-bet-then-resolve settings, γ = 1.0 is appropriate since there’s no infinite horizon.

Q-Learning: Learning the Optimal Action-Value Function

The optimal action-value function Q*(s, a) represents the maximum expected cumulative reward from taking action a in state s and following the optimal policy thereafter:

Q*(s, a) = E[r_t + γ max_{a'} Q*(s', a') | s_t = s, a_t = a]

This is the Bellman optimality equation. If we know Q*, the optimal policy is trivial: π*(s) = argmax_a Q*(s, a). The agent always picks the action with the highest expected cumulative value.

Tabular Q-learning updates the Q-table iteratively:

Q(s, a) ← Q(s, a) + α [r + γ max_{a'} Q(s', a') - Q(s, a)]

where α is the learning rate. The term in brackets is the temporal difference (TD) error — the discrepancy between the current Q-estimate and the bootstrapped target r + γ max Q(s’, a’).

Convergence is guaranteed under the Robbins-Monro conditions: (1) every state-action pair is visited infinitely often, and (2) the learning rate decays such that Σα_t = ∞ and Σα_t² < ∞. In practice, α = 0.001 with decay works.

Exploration uses ε-greedy: with probability ε, take a random action; otherwise, take argmax_a Q(s, a). Decay ε from 1.0 to 0.01 over training:

ε_t = max(0.01, 1.0 - t / N_decay)

where N_decay is the number of episodes over which to anneal.

Deep Q-Networks (DQN) for High-Dimensional States

When the state space is continuous (which it always is in betting), tabular Q-learning fails. DQN approximates Q*(s, a) with a neural network Q(s, a; θ):

Loss = E[(r + γ max_{a'} Q(s', a'; θ⁻) - Q(s, a; θ))²]

Two techniques stabilize training:

1. Experience replay. Store transitions (s, a, r, s’, done) in a replay buffer of size N (typically 100,000-1,000,000). Sample random minibatches of 64-256 transitions for each gradient update. This breaks temporal correlation and improves sample efficiency.

2. Target network. Maintain a separate target network with parameters θ⁻, updated via soft update:

θ⁻ ← τθ + (1 - τ)θ⁻,   τ = 0.005

This prevents the moving-target problem where Q-values chase their own updates. Use Huber loss instead of MSE to reduce sensitivity to outlier TD errors (large unexpected losses in betting).

Policy Gradient Methods for Continuous Bet Sizing

DQN works for discrete actions but struggles with continuous action spaces. Policy gradient methods directly parameterize the policy π_θ(a|s) as a probability distribution.

For bet sizing, use a Gaussian policy:

π_θ(a|s) = N(μ_θ(s), σ_θ(s))

The network outputs mean μ and log-standard-deviation log(σ) for each state. The agent samples a bet size from this distribution. As training progresses, σ shrinks and the policy becomes more deterministic.

REINFORCE (vanilla policy gradient) updates:

∇_θ J(θ) = E[Σ_t ∇_θ log π_θ(a_t|s_t) × G_t]

where G_t = Σ_{k=0}^{T-t} γ^k r_{t+k} is the return-to-go. This has high variance. Subtract a baseline (typically the state-value V(s)):

∇_θ J(θ) = E[Σ_t ∇_θ log π_θ(a_t|s_t) × A_t]

where A_t = G_t - V(s_t) is the advantage function. A_t > 0 means “this action was better than average” — reinforce it. A_t < 0 means “worse than average” — discourage it.

PPO (Proximal Policy Optimization) clips the policy ratio to prevent catastrophic updates:

L^CLIP(θ) = E[min(r_t(θ) × A_t, clip(r_t(θ), 1-ε, 1+ε) × A_t)]

where r_t(θ) = π_θ(a_t|s_t) / π_θ_old(a_t|s_t) is the probability ratio between new and old policies, and ε = 0.2 is the clipping parameter. If the policy changes too much in one update (r_t far from 1), the clipped term limits the gradient. This is the standard algorithm for continuous-action RL in 2026.

A2C (Advantage Actor-Critic) uses two networks: an actor π_θ(a|s) and a critic V_φ(s). The critic estimates state values; the actor uses advantage estimates from the critic to update the policy. A2C is synchronous — it runs multiple environment copies in parallel and averages gradients.

Non-Stationarity: The Central Challenge

Betting markets violate the stationarity assumption that standard RL requires. Three mechanisms cause distribution shift:

1. Adaptive opponents. Other bettors and market makers observe the same data. If your RL agent exploits a pattern (e.g., late money on NFL favorites predicts covers), other participants adapt to close that pattern.

2. Seasonal regime changes. NFL line dynamics differ from NBA. Preseason markets differ from playoff markets. A Q-function trained on regular season data may be worthless in postseason.

3. Information regime shifts. A new injury reporting rule, a platform fee change, or a liquidity shock fundamentally alters the transition dynamics P(s’|s, a).

Mitigation strategies:

Sliding window training. Only train on the most recent N episodes (e.g., last 90 days). Old data ages out.
Contextual features. Include regime indicators in the state: sport, season phase, day of week, hours to event. Let the network learn regime-conditional policies.
Performance monitoring. Track rolling 30-day Sharpe ratio. When Sharpe drops below a threshold (e.g., < 0.5 after previously being > 1.5), trigger retraining.
Ensemble policies. Run 3-5 policies trained on different windows. Weight by recent performance. This hedges against any single policy going stale.

Worked Examples

Example 1: DQN Bet Timing on Polymarket

A Polymarket market “Will the Fed cut rates in June 2026?” is trading YES at $0.42. Your Bayesian model estimates a 51% probability. That’s 9 cents of edge on a $1.00 contract.

The question: buy now at $0.42, or wait? Historical data shows this market tends to drift toward model consensus 3-5 days before the FOMC meeting. It’s currently 14 days out.

State: s = (0.42, -0.02, 0.54, 0.0, 1.0, 0.09, 12500, 0.03)
         odds  Δodds  time  pos  bank  edge  vol   spread

DQN evaluates Q(s, a) for a ∈ {-0.02, -0.01, 0, +0.01, +0.02, +0.05}:

  a = 0.00 (wait):   Q = 0.041   ← DQN expects price will drop, better entry later
  a = +0.01 (small):  Q = 0.038
  a = +0.02 (medium): Q = 0.035
  a = +0.05 (large):  Q = 0.029   ← large position now suboptimal

Optimal action: wait (a = 0)

Three days later, YES has dipped to $0.38 on a hawkish Fed speech. Edge is now $0.51 - $0.38 = $0.13. Time remaining is 0.46.

State: s = (0.38, -0.04, 0.46, 0.0, 1.0, 0.13, 34000, 0.02)

  a = 0.00 (wait):   Q = 0.052
  a = +0.02 (medium): Q = 0.068   ← enter now
  a = +0.05 (large):  Q = 0.071   ← best action

Optimal action: buy 5% of bankroll at $0.38

The DQN learned from historical FOMC markets that entry after hawkish news overreaction, with > 10 days to resolution, yields the highest risk-adjusted return.

Example 2: PPO Execution on a Sportsbook Line

Lakers -3.5 is at -110 on BetOnline. Your regression model (from the NFL modeling guide — applicable to NBA with different features) estimates the Lakers should be -5.0. Edge exists but the line is moving toward -4.0 across the market.

The PPO agent has been trained on 50,000 simulated NBA game episodes:

State: s = (1.909, +0.015, 0.83, 0.0, 1.0, 0.034, 0.72, 0.03)
         odds   Δodds   time  pos  bank  edge  vol_norm  spread

PPO Gaussian policy output:
  μ = 0.028  (mean bet size = 2.8% of bankroll)
  σ = 0.008  (standard deviation — fairly confident)

Sampled action: a = 0.031 (3.1% of bankroll)

The agent places 3.1% of bankroll on Lakers -3.5 at -110. This is slightly below the Kelly-optimal fraction of 3.8% because the PPO agent has learned that -110 lines with small edges carry execution risk (line could move to -115 by game time, eroding edge).

Implementation

Custom Gymnasium Environment for Betting

import gymnasium as gym
import numpy as np
from gymnasium import spaces
from dataclasses import dataclass


@dataclass
class BettingConfig:
    """Configuration for the betting RL environment."""
    max_position_frac: float = 0.05  # max 5% of bankroll per action
    transaction_cost: float = 0.005  # 0.5% per trade (spread + fees)
    episode_length: int = 100  # timesteps per episode
    initial_bankroll: float = 10000.0
    odds_volatility: float = 0.02  # per-step odds std dev
    true_edge_mean: float = 0.03  # average model edge
    true_edge_std: float = 0.02  # edge uncertainty


class BettingEnv(gym.Env):
    """
    Simulated betting environment for RL training.

    State: [odds, delta_odds, time_remaining, position, bankroll_norm, model_edge, volume, spread]
    Action: continuous bet size in [-max_position_frac, +max_position_frac]
    Reward: realized PnL from position changes and settlement
    """

    metadata = {"render_modes": []}

    def __init__(self, config: BettingConfig = BettingConfig()):
        super().__init__()
        self.config = config

        # 8-dimensional continuous state
        self.observation_space = spaces.Box(
            low=np.array([0.01, -0.5, 0.0, -1.0, 0.0, -0.5, 0.0, 0.0]),
            high=np.array([0.99, 0.5, 1.0, 1.0, 5.0, 0.5, 1.0, 0.2]),
            dtype=np.float32
        )

        # Continuous bet sizing
        self.action_space = spaces.Box(
            low=-config.max_position_frac,
            high=config.max_position_frac,
            shape=(1,),
            dtype=np.float32
        )

        self.rng = np.random.default_rng()
        self.reset()

    def reset(self, seed=None, options=None):
        super().reset(seed=seed)
        if seed is not None:
            self.rng = np.random.default_rng(seed)

        self.step_count = 0
        self.odds = self.rng.uniform(0.20, 0.80)
        self.prev_odds = self.odds
        self.position = 0.0
        self.bankroll = self.config.initial_bankroll
        self.initial_bankroll = self.config.initial_bankroll

        # True outcome determined at episode start (agent doesn't know it)
        self.true_prob = self.rng.uniform(0.15, 0.85)
        self.outcome = 1 if self.rng.random() < self.true_prob else 0

        # Model edge: noisy estimate of true edge
        self.model_edge = (self.true_prob - self.odds) + self.rng.normal(0, 0.01)

        return self._get_obs(), {}

    def _get_obs(self) -> np.ndarray:
        time_remaining = 1.0 - (self.step_count / self.config.episode_length)
        delta_odds = self.odds - self.prev_odds
        volume = self.rng.uniform(0.1, 0.9)
        spread = self.rng.uniform(0.01, 0.05)

        return np.array([
            self.odds,
            np.clip(delta_odds, -0.5, 0.5),
            time_remaining,
            self.position / self.config.max_position_frac,  # normalized
            self.bankroll / self.initial_bankroll,
            np.clip(self.model_edge, -0.5, 0.5),
            volume,
            spread
        ], dtype=np.float32)

    def step(self, action):
        action_size = float(np.clip(action[0], -self.config.max_position_frac,
                                     self.config.max_position_frac))

        # Calculate transaction cost
        position_change = action_size - self.position
        txn_cost = abs(position_change) * self.bankroll * self.config.transaction_cost

        # Update position
        self.position = action_size

        # Evolve odds (mean-reverting toward true probability with noise)
        self.prev_odds = self.odds
        mean_reversion = 0.02 * (self.true_prob - self.odds)
        noise = self.rng.normal(0, self.config.odds_volatility)
        self.odds = np.clip(self.odds + mean_reversion + noise, 0.01, 0.99)

        # Update model edge with noise
        self.model_edge = (self.true_prob - self.odds) + self.rng.normal(0, 0.01)

        # Mark-to-market PnL from odds movement
        mtm_pnl = self.position * self.bankroll * (self.odds - self.prev_odds)

        self.step_count += 1
        terminated = self.step_count >= self.config.episode_length

        # Terminal reward: settlement
        settlement_pnl = 0.0
        if terminated:
            if self.outcome == 1:  # YES wins
                settlement_pnl = self.position * self.bankroll * (1.0 - self.odds)
            else:  # NO wins
                settlement_pnl = self.position * self.bankroll * (0.0 - self.odds)

        reward = mtm_pnl + settlement_pnl - txn_cost
        self.bankroll += reward

        return self._get_obs(), reward, terminated, False, {
            "bankroll": self.bankroll,
            "position": self.position,
            "odds": self.odds,
            "pnl": reward
        }

Training with Stable-Baselines3

import numpy as np
from stable_baselines3 import PPO
from stable_baselines3.common.vec_env import SubprocVecEnv
from stable_baselines3.common.callbacks import EvalCallback


def make_env(seed: int):
    """Factory function for vectorized environments."""
    def _init():
        env = BettingEnv(BettingConfig(
            odds_volatility=np.random.uniform(0.01, 0.04),  # domain randomization
            transaction_cost=np.random.uniform(0.003, 0.008),  # fee perturbation
            true_edge_mean=np.random.uniform(0.01, 0.05),
        ))
        env.reset(seed=seed)
        return env
    return _init


def train_betting_agent(
    n_envs: int = 8,
    total_timesteps: int = 2_000_000,
    save_path: str = "betting_ppo_agent"
) -> PPO:
    """
    Train a PPO agent for bet timing and sizing.
    Uses domain randomization for sim-to-real transfer.
    """
    # Parallel environments with domain randomization
    env = SubprocVecEnv([make_env(seed=i) for i in range(n_envs)])
    eval_env = SubprocVecEnv([make_env(seed=100 + i) for i in range(4)])

    model = PPO(
        "MlpPolicy",
        env,
        learning_rate=3e-4,
        n_steps=2048,
        batch_size=256,
        n_epochs=10,
        gamma=0.99,
        gae_lambda=0.95,
        clip_range=0.2,
        ent_coef=0.01,  # entropy bonus for exploration
        vf_coef=0.5,
        max_grad_norm=0.5,
        policy_kwargs={
            "net_arch": {
                "pi": [256, 256, 128],  # actor
                "vf": [256, 256, 128],  # critic
            }
        },
        verbose=1,
    )

    eval_callback = EvalCallback(
        eval_env,
        best_model_save_path=f"./{save_path}_best/",
        log_path=f"./{save_path}_logs/",
        eval_freq=10_000,
        n_eval_episodes=50,
        deterministic=True,
    )

    model.learn(
        total_timesteps=total_timesteps,
        callback=eval_callback,
        progress_bar=True,
    )

    model.save(save_path)
    env.close()
    eval_env.close()

    return model

Sim-to-Real Deployment with Domain Randomization

import numpy as np
from dataclasses import dataclass
from typing import Optional


@dataclass
class LiveMarketState:
    """State observation from a live market."""
    current_odds: float
    odds_change: float
    time_to_event: float
    current_position: float
    bankroll: float
    model_edge: float
    volume_normalized: float
    spread: float


class RLBettingExecutor:
    """
    Deploys a trained RL agent to live betting markets.
    Wraps the trained PPO model with safety checks and logging.
    """

    def __init__(
        self,
        model_path: str,
        max_position_frac: float = 0.05,
        min_edge_threshold: float = 0.01,
        max_drawdown_halt: float = 0.15,
        sharpe_retrain_threshold: float = 0.5,
    ):
        from stable_baselines3 import PPO
        self.model = PPO.load(model_path)
        self.max_position_frac = max_position_frac
        self.min_edge_threshold = min_edge_threshold
        self.max_drawdown_halt = max_drawdown_halt
        self.sharpe_retrain_threshold = sharpe_retrain_threshold

        self.trade_log: list[dict] = []
        self.peak_bankroll: float = 0.0
        self.returns: list[float] = []

    def get_action(self, state: LiveMarketState) -> dict:
        """
        Get the RL agent's recommended action for the current market state.
        Returns action dict with safety checks applied.
        """
        # Safety check: halt if drawdown exceeds threshold
        if self.peak_bankroll > 0:
            drawdown = (self.peak_bankroll - state.bankroll) / self.peak_bankroll
            if drawdown > self.max_drawdown_halt:
                return {
                    "action": "HALT",
                    "reason": f"Drawdown {drawdown:.1%} exceeds {self.max_drawdown_halt:.1%} limit",
                    "bet_size": 0.0
                }

        self.peak_bankroll = max(self.peak_bankroll, state.bankroll)

        # Safety check: no action if edge below minimum
        if abs(state.model_edge) < self.min_edge_threshold:
            return {
                "action": "WAIT",
                "reason": f"Edge {state.model_edge:.4f} below threshold {self.min_edge_threshold}",
                "bet_size": 0.0
            }

        # Build observation vector
        obs = np.array([
            state.current_odds,
            np.clip(state.odds_change, -0.5, 0.5),
            state.time_to_event,
            state.current_position / self.max_position_frac,
            state.bankroll / (self.peak_bankroll if self.peak_bankroll > 0 else state.bankroll),
            np.clip(state.model_edge, -0.5, 0.5),
            state.volume_normalized,
            state.spread,
        ], dtype=np.float32)

        # Get deterministic action from trained policy
        action, _ = self.model.predict(obs, deterministic=True)
        bet_fraction = float(np.clip(action[0], -self.max_position_frac, self.max_position_frac))

        # Determine direction
        if bet_fraction > 0.001:
            action_type = "BUY_YES"
        elif bet_fraction < -0.001:
            action_type = "BUY_NO"
        else:
            action_type = "WAIT"

        result = {
            "action": action_type,
            "bet_fraction": bet_fraction,
            "bet_size": abs(bet_fraction) * state.bankroll,
            "raw_action": float(action[0]),
            "state_summary": {
                "odds": state.current_odds,
                "edge": state.model_edge,
                "time_remaining": state.time_to_event,
                "position": state.current_position,
            }
        }

        self.trade_log.append(result)
        return result

    def rolling_sharpe(self, window: int = 30) -> Optional[float]:
        """Compute rolling Sharpe ratio over recent trades."""
        if len(self.returns) < window:
            return None
        recent = np.array(self.returns[-window:])
        if recent.std() == 0:
            return 0.0
        return float(recent.mean() / recent.std() * np.sqrt(252))

    def needs_retraining(self) -> bool:
        """Check if rolling Sharpe has degraded below threshold."""
        sharpe = self.rolling_sharpe()
        if sharpe is None:
            return False
        return sharpe < self.sharpe_retrain_threshold

Multi-Agent RL Coordination

For complex betting operations, decompose the RL problem into specialized roles. A scanner agent identifies opportunities, an executor agent times entries, and a risk manager enforces portfolio constraints. This maps to multi-agent RL coordination frameworks like CrewAI (available in the AgentBets Marketplace).

from dataclasses import dataclass
from typing import Protocol
import numpy as np


class BettingRole(Protocol):
    """Protocol for role-specialized betting agents."""
    def act(self, state: dict) -> dict: ...


@dataclass
class ScannerOutput:
    """Output from the scanner agent — identified opportunities."""
    market_id: str
    platform: str
    model_edge: float
    confidence: float
    kelly_fraction: float


@dataclass
class ExecutorOutput:
    """Output from the executor agent — timing and sizing decision."""
    action: str  # "ENTER", "WAIT", "EXIT"
    size_fraction: float
    urgency: float  # 0-1, how time-sensitive


@dataclass
class RiskOutput:
    """Output from the risk manager — position approval."""
    approved: bool
    adjusted_size: float
    reason: str


class MultiAgentBettingSystem:
    """
    Three-agent system for betting execution.
    Scanner → Executor → Risk Manager → Order.

    Each agent can be an RL policy or rule-based.
    The executor is the RL-trained component.
    """

    def __init__(self, executor_model_path: str, max_portfolio_exposure: float = 0.20):
        from stable_baselines3 import PPO
        self.executor_model = PPO.load(executor_model_path)
        self.max_portfolio_exposure = max_portfolio_exposure
        self.open_positions: dict[str, float] = {}

    def scan(self, market_data: list[dict]) -> list[ScannerOutput]:
        """
        Rule-based scanner: filter markets with model edge > 2%.
        In production, this could also be an RL agent trained to
        select which markets to even evaluate.
        """
        opportunities = []
        for m in market_data:
            edge = m["model_prob"] - m["market_prob"]
            if abs(edge) > 0.02:
                kelly_f = edge / m["market_prob"]  # simplified Kelly
                opportunities.append(ScannerOutput(
                    market_id=m["id"],
                    platform=m["platform"],
                    model_edge=edge,
                    confidence=m.get("confidence", 0.5),
                    kelly_fraction=min(kelly_f, 0.05),
                ))
        return sorted(opportunities, key=lambda x: -abs(x.model_edge))

    def execute(self, opportunity: ScannerOutput, market_state: np.ndarray) -> ExecutorOutput:
        """
        RL-trained executor: decides timing and sizing.
        """
        action, _ = self.executor_model.predict(market_state, deterministic=True)
        size = float(np.clip(action[0], 0, opportunity.kelly_fraction))

        if size < 0.002:
            return ExecutorOutput(action="WAIT", size_fraction=0.0, urgency=0.0)

        return ExecutorOutput(
            action="ENTER",
            size_fraction=size,
            urgency=float(1.0 - market_state[2]),  # inverse of time_remaining
        )

    def risk_check(self, executor_output: ExecutorOutput, market_id: str) -> RiskOutput:
        """
        Rule-based risk manager: enforces portfolio constraints.
        """
        current_exposure = sum(abs(v) for v in self.open_positions.values())
        proposed_total = current_exposure + executor_output.size_fraction

        if proposed_total > self.max_portfolio_exposure:
            adjusted = max(0, self.max_portfolio_exposure - current_exposure)
            return RiskOutput(
                approved=adjusted > 0.002,
                adjusted_size=adjusted,
                reason=f"Portfolio cap: {current_exposure:.1%} current + {adjusted:.1%} approved of {executor_output.size_fraction:.1%} requested"
            )

        return RiskOutput(
            approved=True,
            adjusted_size=executor_output.size_fraction,
            reason="Within portfolio limits"
        )

Limitations and Edge Cases

Sample inefficiency in live markets. RL agents require thousands of episodes to learn effective policies. In simulation, this is free. In live markets, every exploratory action costs real money. A single exploration episode on BetOnline with a $100 bankroll and 50 bets could cost $50-200 in losses before the agent learns anything useful. Always pre-train in simulation and deploy with minimal exploration (ε ≤ 0.02).

Reward sparsity. Most betting episodes end with one terminal reward at event resolution. The agent may take 100 actions before learning whether its strategy worked. Mark-to-market intermediate rewards help but introduce noise. Hindsight experience replay (HER) can reframe failed episodes as “what if the objective had been different” — but this is tricky to adapt to betting.

Overfitting to simulation dynamics. If the simulator’s odds evolution model is wrong, the RL policy is wrong. Domain randomization mitigates this, but there’s no substitute for validating on out-of-sample historical data before live deployment. Test on 6 months of held-out data minimum.

Partial observability. The agent doesn’t see other participants’ orders, private information, or intent. The true state is partially hidden. Formally, this makes the problem a POMDP (Partially Observable MDP). Recurrent architectures (LSTM layers in the policy network) help by maintaining implicit beliefs over hidden state, but they don’t solve the fundamental information asymmetry.

Action latency. In live markets, there’s a delay between the agent observing state, computing an action, and the order executing. Polymarket CLOB orders may fill 200-500ms after submission. The state may have changed. Train with randomized latency (50-500ms) in simulation so the policy is robust to execution delays.

Catastrophic forgetting during retraining. When you retrain on new data to adapt to non-stationarity, the agent may forget strategies that still work. Use elastic weight consolidation (EWC) or progressive neural networks to retain useful past knowledge.

FAQ

How do you use reinforcement learning for sports betting?

Frame betting as a Markov Decision Process where the state includes current odds, time to event, bankroll, and model edge. The agent learns a policy mapping states to actions (bet size or wait) by maximizing cumulative profit through trial and error in simulation. Train on historical data using DQN or PPO, then deploy to live markets with conservative exploration.

What is the MDP formulation for a betting agent?

The state is s = (odds, time_to_event, position, bankroll, model_edge). Actions are continuous bet sizes from -max_bet to +max_bet, plus a wait action. The reward is realized PnL minus opportunity cost. The transition function encodes how odds evolve stochastically over time. The discount factor γ is typically 0.99 for event-horizon problems.

Why is non-stationarity a problem for RL betting agents?

Betting markets are non-stationary — the data distribution shifts as market participants adapt, seasons change, and information regimes evolve. A Q-function trained on last season’s NFL data may be worthless this season. Agents must use online learning with sliding windows, regime detection, and periodic retraining triggered by degrading rolling Sharpe ratios.

How does sim-to-real transfer work for betting agents?

Train the RL agent in a simulated environment built from historical orderbook snapshots with domain randomization — randomized latency, fee perturbation, and slippage noise. This forces the policy to be robust to execution uncertainty. Deploy with conservative bet sizes initially, then scale as live performance validates the simulation.

How do you combine RL with Kelly Criterion for bet sizing?

Use a two-stage pipeline: the model-based stage identifies +EV opportunities and computes Kelly-optimal fractions, then the RL execution stage learns when to enter (timing), how to split the Kelly allocation across multiple entries (execution), and when to exit early. RL optimizes the execution envelope around the Kelly target.

What’s Next

RL gives your agent a learned execution policy. The next step is building the features that feed into it.

Feature engineering: Feature Engineering for Sports Prediction Models covers how to construct the state representation — which features matter, how to encode them, and how to avoid lookahead bias.
Calibration: An RL agent is only as good as its model_edge input. Calibration and Model Evaluation for Agents shows how to verify that your upstream probability estimates are accurate.
Pipeline integration: The Odds API Edge Detection Pipeline connects live odds feeds to model inference to RL execution — the full end-to-end stack.
Bankroll risk: RL agents can blow up bankrolls if risk constraints aren’t enforced. Drawdown Math and Variance in Betting covers the math of ruin probability and maximum drawdown bounds.
Betting bots in production: The Betting Bots hub covers deployment patterns, monitoring, and operational concerns for live autonomous agents.

Frequently Asked Questions