At -110 odds with a true 54% win rate, you need ~2,485 bets for 95% confidence your edge is real. The required sample size formula is n = (z_alpha + z_beta)^2 * p(1-p) / (p - p0)^2. Most bettors never reach sufficient sample sizes, which is why most “winning systems” are noise.

Why This Matters for Agents

Every autonomous betting agent faces the same question after its first thousand bets: is my edge real, or am I just lucky?

This is Layer 4 — Intelligence. Statistical significance testing is the validation layer that sits between an agent’s model outputs and its capital allocation decisions. An agent pulls odds from The Odds API pipeline, runs its prediction model, and generates a track record. Before scaling up bet sizes via Kelly Criterion or deploying real capital through Layer 2 wallet infrastructure, the agent must answer a binary question: does the observed win rate exceed break-even by a statistically significant margin? Without this gate, agents scale noise — burning bankroll on edges that never existed. The Agent Betting Stack places significance testing as a mandatory checkpoint between backtesting and live deployment.

The Math

The Hypothesis Testing Framework

Frame edge detection as a classical hypothesis test.

Null hypothesis H0: The agent has no edge. Its true win rate equals the break-even rate p0.

Alternative hypothesis H1: The agent has an edge. Its true win rate p exceeds p0.

For standard -110 American odds (decimal 1.909), the break-even win rate is:

p0 = 110 / (110 + 100) = 110 / 210 = 0.5238 (52.38%)

An agent that wins 55 out of 100 bets at -110 has a 55% observed win rate. The question: is 55% sufficiently above 52.38% to reject H0?

The Z-Test for Proportions

Under H0, the number of wins in N bets follows a Binomial(N, p0) distribution. For large N, the Central Limit Theorem gives a normal approximation:

z = (p_hat - p0) / sqrt(p0 * (1 - p0) / N)

where p_hat = W/N is the observed win rate, p0 is the break-even rate, and N is the number of bets.

For 55 wins in 100 bets at -110 odds:

z = (0.55 - 0.5238) / sqrt(0.5238 * 0.4762 / 100)
z = 0.0262 / sqrt(0.002494)
z = 0.0262 / 0.04994
z = 0.524

The p-value for z = 0.524 (one-tailed) is 0.300. Not significant. A 55% win rate over 100 bets is completely consistent with luck at -110 odds. The agent has no evidence of edge.

What a p-Value Actually Means

A p-value of 0.03 means: if the agent had no edge at all, there is a 3% chance of observing results this extreme or more extreme. It is not the probability that H0 is true. It is not the probability the agent has no edge. It is P(data | H0), not P(H0 | data). Confusing these two is the single most common statistical error in betting analysis.

Reject H0 when p < alpha (typically alpha = 0.05). This means you accept a 5% false positive rate — 1 in 20 times, you will incorrectly conclude an edge exists when it doesn’t.

Required Sample Size

The sample size needed to detect a true edge p at significance level alpha with power (1 - beta) is:

n = (z_alpha + z_beta)^2 * p0 * (1 - p0) / (p - p0)^2

where z_alpha is the critical value for significance (1.645 for alpha = 0.05 one-tailed), z_beta is the critical value for power (0.842 for 80% power), p is the true win rate, and p0 is the break-even rate.

True Win RateEdge Over Break-EvenRequired N (95% conf, 80% power)
53% at -1100.62%~9,604
54% at -1101.62%~2,485
55% at -1102.62%~950
56% at -1103.62%~504
57% at -1104.62%~310
60% at -1107.62%~117

The message is stark. A realistic sharp edge of 54% at -110 requires nearly 2,500 bets to confirm. Most recreational bettors place fewer than 500 bets per year. Most “systems” never accumulate enough data to distinguish signal from noise.

The Multiple Comparisons Problem

An agent testing 20 different strategies at alpha = 0.05 expects one false positive even if none of the strategies have real edge:

P(at least one false positive) = 1 - (1 - 0.05)^20 = 1 - 0.95^20 = 0.642

A 64.2% chance of at least one false positive across 20 tests. This is the look-elsewhere effect, and it destroys naive backtesting. If an agent scans across NFL, NBA, MLB, NHL, and soccer, testing four strategies per sport, it will almost certainly find something that “works” — by pure chance.

Bonferroni Correction

The simplest fix: divide alpha by the number of tests.

alpha_adjusted = alpha / m

For 20 tests at alpha = 0.05:

alpha_adjusted = 0.05 / 20 = 0.0025

Now each individual test requires p < 0.0025 to be declared significant. This is conservative — it controls the family-wise error rate (FWER), the probability of even one false positive. The cost: reduced power. Real edges with marginal significance get missed.

Benjamini-Hochberg False Discovery Rate (FDR)

A less conservative alternative that controls the expected proportion of false discoveries among all discoveries:

  1. Rank all m p-values from smallest to largest: p(1) <= p(2) <= … <= p(m)
  2. Find the largest k such that p(k) <= (k/m) * alpha
  3. Reject all hypotheses with p-values <= p(k)

FDR control at q = 0.05 means: among all strategies you declare significant, you expect at most 5% to be false discoveries. This is less strict than Bonferroni but more realistic for multi-strategy agents.

Confidence Intervals for Win Rate

A point estimate of “55% win rate” is incomplete without uncertainty bounds. The Wilson score interval is preferred over the Wald interval for proportions because it doesn’t produce impossible values (negative or > 1) at extreme rates:

Wilson interval: (p_hat + z^2/(2n) +/- z * sqrt(p_hat*(1-p_hat)/n + z^2/(4n^2))) / (1 + z^2/n)

where z = 1.96 for 95% confidence.

For 550 wins in 1,000 bets:

p_hat = 0.55
Lower = 0.519
Upper = 0.581

The 95% CI is [51.9%, 58.1%]. Since this interval includes 52.38% (break-even at -110), you cannot reject H0 at the 95% level. The agent might have edge, or it might be breaking even.

Statistical Power

Power = P(reject H0 | H1 is true). It’s the probability of detecting a real edge when it exists.

Power = P(z > z_alpha - (p - p0) / sqrt(p0 * (1 - p0) / N))
Sample SizePower at 54% true rate (-110)Power at 56% true rate (-110)
1007.3%12.8%
50019.1%43.6%
1,00030.2%66.1%
2,50056.0%93.0%
5,00080.1%99.6%

At 100 bets, an agent with a real 54% edge has only a 7.3% chance of detecting it. Even at 1,000 bets, the power is only 30.2%. This is why premature evaluation kills profitable systems — the agent shuts down a winning strategy because it hasn’t gathered enough data to prove it works.

Bayesian Edge Detection

The frequentist framework answers “how surprising is my data if I have no edge?” The Bayesian framework answers the question you actually care about: “given my data, what is the probability I have edge?”

Use the Beta-Binomial conjugate model:

Prior: Beta(alpha_prior, beta_prior). Use Beta(1, 1) (uniform) if you have no prior information, or Beta(52.38, 47.62) to encode the prior belief that most bettors are break-even.

Likelihood: Binomial(N, p) with W wins.

Posterior: Beta(alpha_prior + W, beta_prior + N - W).

The quantity of interest: P(p > p0 | data) — the posterior probability that the true win rate exceeds break-even.

P(p > 0.5238 | 550 wins in 1000 bets) = 1 - BetaCDF(0.5238; 551, 451)

With a uniform prior, this equals approximately 0.952 — a 95.2% posterior probability of having edge. The Bayesian approach provides a direct probability statement about edge existence, which is far more useful for an agent’s decision logic than a p-value.

Survivorship Bias

Survivorship bias is the deadliest statistical trap in betting. You only observe systems that survived long enough to report results. The graveyard of failed systems is invisible.

If 1,000 people each flip a fair coin 10 times:

  • ~1 person will get 10 heads (appears to have a “system”)
  • ~11 people will get 9+ heads
  • ~56 people will get 8+ heads

Those 56 people share their “winning method” on forums. The other 944 stay silent. An agent crawling betting forums for strategies inherits this bias directly.

The correction: evaluate any discovered strategy on out-of-sample data only. Split the historical record. Use the first half for discovery and the second half for validation. Never use in-sample performance as evidence of edge.

Worked Examples

Example 1: NFL Season ATS Record

A sharp betting agent goes 145-115 ATS (against the spread) over an NFL season at -110 odds across BetOnline and BookMaker.

N = 260, W = 145, p_hat = 145/260 = 0.5577
p0 = 0.5238 (break-even at -110)
z = (0.5577 - 0.5238) / sqrt(0.5238 * 0.4762 / 260) = 0.0339 / 0.03098 = 1.094
p-value = 0.137 (one-tailed)

Not significant at alpha = 0.05. A 55.77% ATS rate over 260 games is a profitable season (+7.9 units at flat betting) but provides no statistical evidence of persistent edge. The 95% Wilson CI for win rate is [49.5%, 61.8%] — wide enough to include break-even.

Example 2: Polymarket Election Model — 2,000 Resolved Markets

An agent trading YES/NO contracts on Polymarket resolves 2,000 positions over 18 months. It bought YES contracts at an average price of $0.58 and the realized win rate is 63%.

The break-even rate for contracts bought at average price $0.58 (with 2% fee on profit):

break-even = 0.58 / (1 - 0.02 * (1 - 0.58)) = 0.58 / 0.9916 = 0.5849

Test: is 63% significantly above 58.49%?

z = (0.63 - 0.5849) / sqrt(0.5849 * 0.4151 / 2000) = 0.0451 / 0.01102 = 4.09
p-value = 0.0000217 (one-tailed)

Highly significant. p < 0.001. The agent has strong statistical evidence of genuine edge in prediction market trading. The 95% Wilson CI for win rate is [61.0%, 65.0%] — entirely above break-even.

Example 3: Multiple Strategy Scan

An agent tests 15 MLB strategies across team totals, run lines, and first-five-inning lines using historical odds from The Odds API. Results for the top 3:

StrategyNWin RateRaw p-valueBonferroni pBH Threshold
Over 8.5 road dogs31257.1%0.0120.1800.003
Under F5 AL West19858.6%0.0180.2700.007
RL favorites -1.544555.3%0.0410.6150.010

Raw: All three strategies appear significant at alpha = 0.05. Bonferroni (alpha/15 = 0.0033): None survive. BH at q = 0.05: None survive (smallest p-value 0.012 exceeds BH threshold 0.05 * 1/15 = 0.003). After correction, zero strategies show significant edge. The agent avoids deploying capital on noise.

Implementation

import numpy as np
from scipy import stats
from dataclasses import dataclass


@dataclass
class SignificanceResult:
    """Result of a betting edge significance test."""
    n_bets: int
    wins: int
    win_rate: float
    break_even_rate: float
    z_statistic: float
    p_value: float
    ci_lower: float
    ci_upper: float
    is_significant: bool
    alpha: float


def test_betting_edge(
    wins: int,
    n_bets: int,
    break_even_rate: float = 0.5238,
    alpha: float = 0.05
) -> SignificanceResult:
    """
    One-sided z-test for betting edge.

    Tests H0: true win rate = break_even_rate
    vs.   H1: true win rate > break_even_rate

    Args:
        wins: Number of winning bets
        n_bets: Total number of bets
        break_even_rate: Break-even win rate (0.5238 for -110 odds)
        alpha: Significance level

    Returns:
        SignificanceResult with test statistics and confidence interval
    """
    p_hat = wins / n_bets
    se = np.sqrt(break_even_rate * (1 - break_even_rate) / n_bets)
    z = (p_hat - break_even_rate) / se
    p_value = 1 - stats.norm.cdf(z)

    # Wilson score interval
    z_ci = stats.norm.ppf(1 - alpha / 2)
    denominator = 1 + z_ci**2 / n_bets
    center = (p_hat + z_ci**2 / (2 * n_bets)) / denominator
    margin = (z_ci / denominator) * np.sqrt(
        p_hat * (1 - p_hat) / n_bets + z_ci**2 / (4 * n_bets**2)
    )
    ci_lower = center - margin
    ci_upper = center + margin

    return SignificanceResult(
        n_bets=n_bets,
        wins=wins,
        win_rate=p_hat,
        break_even_rate=break_even_rate,
        z_statistic=z,
        p_value=p_value,
        ci_lower=ci_lower,
        ci_upper=ci_upper,
        is_significant=p_value < alpha,
        alpha=alpha,
    )


def required_sample_size(
    true_rate: float,
    break_even_rate: float = 0.5238,
    alpha: float = 0.05,
    power: float = 0.80
) -> int:
    """
    Minimum bets needed to detect a true win rate at given confidence and power.

    Args:
        true_rate: Hypothesized true win rate (e.g., 0.54)
        break_even_rate: Break-even rate (0.5238 for -110)
        alpha: Significance level
        power: Desired statistical power (1 - beta)

    Returns:
        Required number of bets (rounded up)
    """
    z_alpha = stats.norm.ppf(1 - alpha)
    z_beta = stats.norm.ppf(power)
    effect = true_rate - break_even_rate
    variance = break_even_rate * (1 - break_even_rate)

    n = ((z_alpha + z_beta) ** 2 * variance) / (effect ** 2)
    return int(np.ceil(n))


def power_analysis(
    n_bets: int,
    true_rate: float,
    break_even_rate: float = 0.5238,
    alpha: float = 0.05
) -> float:
    """
    Calculate statistical power for a given sample size and true rate.

    Args:
        n_bets: Number of bets in the sample
        true_rate: Hypothesized true win rate
        break_even_rate: Break-even rate
        alpha: Significance level

    Returns:
        Power (probability of detecting the edge)
    """
    z_alpha = stats.norm.ppf(1 - alpha)
    se_null = np.sqrt(break_even_rate * (1 - break_even_rate) / n_bets)
    se_alt = np.sqrt(true_rate * (1 - true_rate) / n_bets)

    threshold = break_even_rate + z_alpha * se_null
    z_power = (true_rate - threshold) / se_alt
    return stats.norm.cdf(z_power)


def bonferroni_correction(
    p_values: list[float],
    alpha: float = 0.05
) -> list[dict]:
    """
    Apply Bonferroni correction for multiple comparisons.

    Args:
        p_values: List of raw p-values from individual tests
        alpha: Family-wise significance level

    Returns:
        List of dicts with raw p-value, adjusted p-value, and significance
    """
    m = len(p_values)
    adjusted_alpha = alpha / m
    results = []
    for i, p in enumerate(p_values):
        adjusted_p = min(p * m, 1.0)
        results.append({
            "test_index": i,
            "raw_p": p,
            "adjusted_p": adjusted_p,
            "significant": p < adjusted_alpha,
            "adjusted_alpha": adjusted_alpha,
        })
    return results


def benjamini_hochberg(
    p_values: list[float],
    q: float = 0.05
) -> list[dict]:
    """
    Benjamini-Hochberg procedure for FDR control.

    Args:
        p_values: List of raw p-values
        q: Target false discovery rate

    Returns:
        List of dicts with rank, threshold, and rejection decision
    """
    m = len(p_values)
    indexed = sorted(enumerate(p_values), key=lambda x: x[1])

    max_k = -1
    for rank, (orig_idx, p) in enumerate(indexed, 1):
        threshold = (rank / m) * q
        if p <= threshold:
            max_k = rank

    results = []
    for rank, (orig_idx, p) in enumerate(indexed, 1):
        results.append({
            "original_index": orig_idx,
            "rank": rank,
            "raw_p": p,
            "bh_threshold": (rank / m) * q,
            "rejected": rank <= max_k,
        })
    return sorted(results, key=lambda x: x["original_index"])


def bayesian_edge_probability(
    wins: int,
    n_bets: int,
    break_even_rate: float = 0.5238,
    prior_alpha: float = 1.0,
    prior_beta: float = 1.0
) -> dict:
    """
    Bayesian posterior probability of having edge using Beta-Binomial model.

    Args:
        wins: Number of winning bets
        n_bets: Total bets
        break_even_rate: Break-even win rate
        prior_alpha: Beta prior alpha (1.0 = uniform)
        prior_beta: Beta prior beta (1.0 = uniform)

    Returns:
        Dict with posterior parameters, P(edge), and credible interval
    """
    post_alpha = prior_alpha + wins
    post_beta = prior_beta + (n_bets - wins)

    p_edge = 1 - stats.beta.cdf(break_even_rate, post_alpha, post_beta)

    ci_lower = stats.beta.ppf(0.025, post_alpha, post_beta)
    ci_upper = stats.beta.ppf(0.975, post_alpha, post_beta)
    posterior_mean = post_alpha / (post_alpha + post_beta)

    return {
        "posterior_alpha": post_alpha,
        "posterior_beta": post_beta,
        "posterior_mean": posterior_mean,
        "p_edge": p_edge,
        "credible_interval_95": (ci_lower, ci_upper),
    }


# --- Demo: Run all analyses on a sample record ---
if __name__ == "__main__":
    # NFL season: 145-115 ATS at -110
    print("=" * 60)
    print("EXAMPLE 1: NFL Season 145-115 ATS at -110")
    print("=" * 60)
    result = test_betting_edge(wins=145, n_bets=260)
    print(f"Win rate:     {result.win_rate:.4f} ({result.win_rate:.1%})")
    print(f"Break-even:   {result.break_even_rate:.4f}")
    print(f"Z-statistic:  {result.z_statistic:.3f}")
    print(f"P-value:      {result.p_value:.4f}")
    print(f"95% CI:       [{result.ci_lower:.3f}, {result.ci_upper:.3f}]")
    print(f"Significant:  {result.is_significant}")

    bayes = bayesian_edge_probability(wins=145, n_bets=260)
    print(f"Bayesian P(edge): {bayes['p_edge']:.3f}")
    print(f"95% Credible:     [{bayes['credible_interval_95'][0]:.3f}, "
          f"{bayes['credible_interval_95'][1]:.3f}]")

    print(f"\n{'=' * 60}")
    print("SAMPLE SIZE TABLE")
    print("=" * 60)
    for rate in [0.53, 0.54, 0.55, 0.56, 0.57, 0.60]:
        n = required_sample_size(true_rate=rate)
        pwr = power_analysis(n_bets=n, true_rate=rate)
        print(f"True rate {rate:.0%}: need {n:>6,} bets (power = {pwr:.1%})")

    print(f"\n{'=' * 60}")
    print("MULTIPLE COMPARISONS: 15 MLB Strategies")
    print("=" * 60)
    raw_pvals = [0.012, 0.018, 0.041, 0.082, 0.11, 0.15, 0.22,
                 0.31, 0.38, 0.42, 0.55, 0.61, 0.72, 0.84, 0.93]

    bonf = bonferroni_correction(raw_pvals)
    print("\nBonferroni correction:")
    for r in bonf[:5]:
        print(f"  Test {r['test_index']}: p={r['raw_p']:.3f} -> "
              f"adj_p={r['adjusted_p']:.3f} "
              f"{'SIGNIFICANT' if r['significant'] else 'not sig'}")

    bh = benjamini_hochberg(raw_pvals)
    print("\nBenjamini-Hochberg (FDR = 0.05):")
    for r in bh[:5]:
        print(f"  Test {r['original_index']}: p={r['raw_p']:.3f} "
              f"threshold={r['bh_threshold']:.4f} "
              f"{'REJECTED' if r['rejected'] else 'not rejected'}")

Limitations and Edge Cases

Small sample distortion. The normal approximation for the z-test breaks down below ~30 bets. For small samples, use the exact binomial test: scipy.stats.binom_test(wins, n_bets, break_even_rate, alternative='greater'). The z-test is faster but unreliable for N < 50.

Non-independent bets. The z-test assumes each bet is independent. If an agent bets the same NBA game on the spread and the total, those bets are correlated. Correlated bets inflate the effective sample size, making edge appear more significant than it is. Cluster-robust standard errors or a GEE model is required when bets share underlying events.

Changing odds and edge. The formulas above assume a fixed break-even rate. In practice, agents bet at varying odds (-105 to -120 across different sportsbooks). Use ROI-based significance testing instead: compute the t-test on per-bet returns rather than the z-test on win rate. This handles heterogeneous odds naturally.

The prior matters in Bayesian analysis. A uniform Beta(1,1) prior says “all win rates are equally likely before seeing data.” A more realistic prior for sports betting is Beta(52, 48) — encoding the belief that most bettors cluster around break-even. The choice of prior matters most when sample sizes are small (< 200 bets). At 2,000+ bets, the posterior is dominated by the data regardless of prior.

Significance does not equal profitability. A strategy can be statistically significant but not worth trading if the edge is smaller than transaction costs, opportunity costs, or the time value of capital locked in pending bets. Always compute net ROI after all fees before declaring a strategy deployable. Use the AgentBets Vig Index to identify the lowest-vig platforms where marginal edges remain profitable.

FAQ

How many bets do you need to prove a sports betting edge is statistically significant?

At standard -110 odds with a true 54% win rate (1.6% edge over break-even), you need approximately 2,485 bets for 95% confidence and 80% power. Smaller edges require exponentially more bets — a 53% true rate needs roughly 9,600 bets. Most bettors never reach these sample sizes, which is why most claimed edges are indistinguishable from luck.

What is a p-value in sports betting and how do you interpret it?

A p-value is the probability of observing results at least as extreme as yours if you had no edge at all. A p-value of 0.03 means there is a 3% chance of seeing your record (or better) by pure luck. It is not the probability that you have no edge. Rejecting the null at p < 0.05 means you have moderate evidence of edge, but it does not guarantee profitability.

Why do most sports betting systems fail even with a winning record?

The multiple comparisons problem. If you test 20 different strategies at the 0.05 significance level, you expect one false positive by chance alone. Survivorship bias compounds this: you only hear about the systems that happened to win, never the 19 that lost. Apply Bonferroni correction (divide alpha by the number of tests) or FDR control to guard against false discoveries.

What is the Bayesian approach to detecting a sports betting edge?

Instead of p-values, use a Beta-Binomial model. Start with a Beta(1,1) prior (uniform), observe W wins in N bets, and compute the posterior Beta(1+W, 1+N-W). The quantity P(p > break_even | data) gives the direct probability that your true win rate exceeds break-even. This is often more intuitive than frequentist hypothesis testing and naturally incorporates prior beliefs about edge rarity.

How does statistical significance connect to calibration and model evaluation for betting agents?

Statistical significance tells you whether observed edge is real. Calibration tells you whether your probability estimates match reality. Both are required: a significant but miscalibrated model will size bets incorrectly via Kelly Criterion. See the calibration and model evaluation guide for forecast accuracy assessment that complements significance testing.

What’s Next

Statistical significance is the validation gate. Once an agent confirms edge is real, it needs to measure the quality of its probability estimates and optimize its feature set for maximum predictive power.