At -110 odds with a true 54% win rate, you need ~2,485 bets for 95% confidence your edge is real. The required sample size formula is n = (z_alpha + z_beta)^2 * p(1-p) / (p - p0)^2. Most bettors never reach sufficient sample sizes, which is why most “winning systems” are noise.
Why This Matters for Agents
Every autonomous betting agent faces the same question after its first thousand bets: is my edge real, or am I just lucky?
This is Layer 4 — Intelligence. Statistical significance testing is the validation layer that sits between an agent’s model outputs and its capital allocation decisions. An agent pulls odds from The Odds API pipeline, runs its prediction model, and generates a track record. Before scaling up bet sizes via Kelly Criterion or deploying real capital through Layer 2 wallet infrastructure, the agent must answer a binary question: does the observed win rate exceed break-even by a statistically significant margin? Without this gate, agents scale noise — burning bankroll on edges that never existed. The Agent Betting Stack places significance testing as a mandatory checkpoint between backtesting and live deployment.
The Math
The Hypothesis Testing Framework
Frame edge detection as a classical hypothesis test.
Null hypothesis H0: The agent has no edge. Its true win rate equals the break-even rate p0.
Alternative hypothesis H1: The agent has an edge. Its true win rate p exceeds p0.
For standard -110 American odds (decimal 1.909), the break-even win rate is:
p0 = 110 / (110 + 100) = 110 / 210 = 0.5238 (52.38%)
An agent that wins 55 out of 100 bets at -110 has a 55% observed win rate. The question: is 55% sufficiently above 52.38% to reject H0?
The Z-Test for Proportions
Under H0, the number of wins in N bets follows a Binomial(N, p0) distribution. For large N, the Central Limit Theorem gives a normal approximation:
z = (p_hat - p0) / sqrt(p0 * (1 - p0) / N)
where p_hat = W/N is the observed win rate, p0 is the break-even rate, and N is the number of bets.
For 55 wins in 100 bets at -110 odds:
z = (0.55 - 0.5238) / sqrt(0.5238 * 0.4762 / 100)
z = 0.0262 / sqrt(0.002494)
z = 0.0262 / 0.04994
z = 0.524
The p-value for z = 0.524 (one-tailed) is 0.300. Not significant. A 55% win rate over 100 bets is completely consistent with luck at -110 odds. The agent has no evidence of edge.
What a p-Value Actually Means
A p-value of 0.03 means: if the agent had no edge at all, there is a 3% chance of observing results this extreme or more extreme. It is not the probability that H0 is true. It is not the probability the agent has no edge. It is P(data | H0), not P(H0 | data). Confusing these two is the single most common statistical error in betting analysis.
Reject H0 when p < alpha (typically alpha = 0.05). This means you accept a 5% false positive rate — 1 in 20 times, you will incorrectly conclude an edge exists when it doesn’t.
Required Sample Size
The sample size needed to detect a true edge p at significance level alpha with power (1 - beta) is:
n = (z_alpha + z_beta)^2 * p0 * (1 - p0) / (p - p0)^2
where z_alpha is the critical value for significance (1.645 for alpha = 0.05 one-tailed), z_beta is the critical value for power (0.842 for 80% power), p is the true win rate, and p0 is the break-even rate.
| True Win Rate | Edge Over Break-Even | Required N (95% conf, 80% power) |
|---|---|---|
| 53% at -110 | 0.62% | ~9,604 |
| 54% at -110 | 1.62% | ~2,485 |
| 55% at -110 | 2.62% | ~950 |
| 56% at -110 | 3.62% | ~504 |
| 57% at -110 | 4.62% | ~310 |
| 60% at -110 | 7.62% | ~117 |
The message is stark. A realistic sharp edge of 54% at -110 requires nearly 2,500 bets to confirm. Most recreational bettors place fewer than 500 bets per year. Most “systems” never accumulate enough data to distinguish signal from noise.
The Multiple Comparisons Problem
An agent testing 20 different strategies at alpha = 0.05 expects one false positive even if none of the strategies have real edge:
P(at least one false positive) = 1 - (1 - 0.05)^20 = 1 - 0.95^20 = 0.642
A 64.2% chance of at least one false positive across 20 tests. This is the look-elsewhere effect, and it destroys naive backtesting. If an agent scans across NFL, NBA, MLB, NHL, and soccer, testing four strategies per sport, it will almost certainly find something that “works” — by pure chance.
Bonferroni Correction
The simplest fix: divide alpha by the number of tests.
alpha_adjusted = alpha / m
For 20 tests at alpha = 0.05:
alpha_adjusted = 0.05 / 20 = 0.0025
Now each individual test requires p < 0.0025 to be declared significant. This is conservative — it controls the family-wise error rate (FWER), the probability of even one false positive. The cost: reduced power. Real edges with marginal significance get missed.
Benjamini-Hochberg False Discovery Rate (FDR)
A less conservative alternative that controls the expected proportion of false discoveries among all discoveries:
- Rank all m p-values from smallest to largest: p(1) <= p(2) <= … <= p(m)
- Find the largest k such that p(k) <= (k/m) * alpha
- Reject all hypotheses with p-values <= p(k)
FDR control at q = 0.05 means: among all strategies you declare significant, you expect at most 5% to be false discoveries. This is less strict than Bonferroni but more realistic for multi-strategy agents.
Confidence Intervals for Win Rate
A point estimate of “55% win rate” is incomplete without uncertainty bounds. The Wilson score interval is preferred over the Wald interval for proportions because it doesn’t produce impossible values (negative or > 1) at extreme rates:
Wilson interval: (p_hat + z^2/(2n) +/- z * sqrt(p_hat*(1-p_hat)/n + z^2/(4n^2))) / (1 + z^2/n)
where z = 1.96 for 95% confidence.
For 550 wins in 1,000 bets:
p_hat = 0.55
Lower = 0.519
Upper = 0.581
The 95% CI is [51.9%, 58.1%]. Since this interval includes 52.38% (break-even at -110), you cannot reject H0 at the 95% level. The agent might have edge, or it might be breaking even.
Statistical Power
Power = P(reject H0 | H1 is true). It’s the probability of detecting a real edge when it exists.
Power = P(z > z_alpha - (p - p0) / sqrt(p0 * (1 - p0) / N))
| Sample Size | Power at 54% true rate (-110) | Power at 56% true rate (-110) |
|---|---|---|
| 100 | 7.3% | 12.8% |
| 500 | 19.1% | 43.6% |
| 1,000 | 30.2% | 66.1% |
| 2,500 | 56.0% | 93.0% |
| 5,000 | 80.1% | 99.6% |
At 100 bets, an agent with a real 54% edge has only a 7.3% chance of detecting it. Even at 1,000 bets, the power is only 30.2%. This is why premature evaluation kills profitable systems — the agent shuts down a winning strategy because it hasn’t gathered enough data to prove it works.
Bayesian Edge Detection
The frequentist framework answers “how surprising is my data if I have no edge?” The Bayesian framework answers the question you actually care about: “given my data, what is the probability I have edge?”
Use the Beta-Binomial conjugate model:
Prior: Beta(alpha_prior, beta_prior). Use Beta(1, 1) (uniform) if you have no prior information, or Beta(52.38, 47.62) to encode the prior belief that most bettors are break-even.
Likelihood: Binomial(N, p) with W wins.
Posterior: Beta(alpha_prior + W, beta_prior + N - W).
The quantity of interest: P(p > p0 | data) — the posterior probability that the true win rate exceeds break-even.
P(p > 0.5238 | 550 wins in 1000 bets) = 1 - BetaCDF(0.5238; 551, 451)
With a uniform prior, this equals approximately 0.952 — a 95.2% posterior probability of having edge. The Bayesian approach provides a direct probability statement about edge existence, which is far more useful for an agent’s decision logic than a p-value.
Survivorship Bias
Survivorship bias is the deadliest statistical trap in betting. You only observe systems that survived long enough to report results. The graveyard of failed systems is invisible.
If 1,000 people each flip a fair coin 10 times:
- ~1 person will get 10 heads (appears to have a “system”)
- ~11 people will get 9+ heads
- ~56 people will get 8+ heads
Those 56 people share their “winning method” on forums. The other 944 stay silent. An agent crawling betting forums for strategies inherits this bias directly.
The correction: evaluate any discovered strategy on out-of-sample data only. Split the historical record. Use the first half for discovery and the second half for validation. Never use in-sample performance as evidence of edge.
Worked Examples
Example 1: NFL Season ATS Record
A sharp betting agent goes 145-115 ATS (against the spread) over an NFL season at -110 odds across BetOnline and BookMaker.
N = 260, W = 145, p_hat = 145/260 = 0.5577
p0 = 0.5238 (break-even at -110)
z = (0.5577 - 0.5238) / sqrt(0.5238 * 0.4762 / 260) = 0.0339 / 0.03098 = 1.094
p-value = 0.137 (one-tailed)
Not significant at alpha = 0.05. A 55.77% ATS rate over 260 games is a profitable season (+7.9 units at flat betting) but provides no statistical evidence of persistent edge. The 95% Wilson CI for win rate is [49.5%, 61.8%] — wide enough to include break-even.
Example 2: Polymarket Election Model — 2,000 Resolved Markets
An agent trading YES/NO contracts on Polymarket resolves 2,000 positions over 18 months. It bought YES contracts at an average price of $0.58 and the realized win rate is 63%.
The break-even rate for contracts bought at average price $0.58 (with 2% fee on profit):
break-even = 0.58 / (1 - 0.02 * (1 - 0.58)) = 0.58 / 0.9916 = 0.5849
Test: is 63% significantly above 58.49%?
z = (0.63 - 0.5849) / sqrt(0.5849 * 0.4151 / 2000) = 0.0451 / 0.01102 = 4.09
p-value = 0.0000217 (one-tailed)
Highly significant. p < 0.001. The agent has strong statistical evidence of genuine edge in prediction market trading. The 95% Wilson CI for win rate is [61.0%, 65.0%] — entirely above break-even.
Example 3: Multiple Strategy Scan
An agent tests 15 MLB strategies across team totals, run lines, and first-five-inning lines using historical odds from The Odds API. Results for the top 3:
| Strategy | N | Win Rate | Raw p-value | Bonferroni p | BH Threshold |
|---|---|---|---|---|---|
| Over 8.5 road dogs | 312 | 57.1% | 0.012 | 0.180 | 0.003 |
| Under F5 AL West | 198 | 58.6% | 0.018 | 0.270 | 0.007 |
| RL favorites -1.5 | 445 | 55.3% | 0.041 | 0.615 | 0.010 |
Raw: All three strategies appear significant at alpha = 0.05. Bonferroni (alpha/15 = 0.0033): None survive. BH at q = 0.05: None survive (smallest p-value 0.012 exceeds BH threshold 0.05 * 1/15 = 0.003). After correction, zero strategies show significant edge. The agent avoids deploying capital on noise.
Implementation
import numpy as np
from scipy import stats
from dataclasses import dataclass
@dataclass
class SignificanceResult:
"""Result of a betting edge significance test."""
n_bets: int
wins: int
win_rate: float
break_even_rate: float
z_statistic: float
p_value: float
ci_lower: float
ci_upper: float
is_significant: bool
alpha: float
def test_betting_edge(
wins: int,
n_bets: int,
break_even_rate: float = 0.5238,
alpha: float = 0.05
) -> SignificanceResult:
"""
One-sided z-test for betting edge.
Tests H0: true win rate = break_even_rate
vs. H1: true win rate > break_even_rate
Args:
wins: Number of winning bets
n_bets: Total number of bets
break_even_rate: Break-even win rate (0.5238 for -110 odds)
alpha: Significance level
Returns:
SignificanceResult with test statistics and confidence interval
"""
p_hat = wins / n_bets
se = np.sqrt(break_even_rate * (1 - break_even_rate) / n_bets)
z = (p_hat - break_even_rate) / se
p_value = 1 - stats.norm.cdf(z)
# Wilson score interval
z_ci = stats.norm.ppf(1 - alpha / 2)
denominator = 1 + z_ci**2 / n_bets
center = (p_hat + z_ci**2 / (2 * n_bets)) / denominator
margin = (z_ci / denominator) * np.sqrt(
p_hat * (1 - p_hat) / n_bets + z_ci**2 / (4 * n_bets**2)
)
ci_lower = center - margin
ci_upper = center + margin
return SignificanceResult(
n_bets=n_bets,
wins=wins,
win_rate=p_hat,
break_even_rate=break_even_rate,
z_statistic=z,
p_value=p_value,
ci_lower=ci_lower,
ci_upper=ci_upper,
is_significant=p_value < alpha,
alpha=alpha,
)
def required_sample_size(
true_rate: float,
break_even_rate: float = 0.5238,
alpha: float = 0.05,
power: float = 0.80
) -> int:
"""
Minimum bets needed to detect a true win rate at given confidence and power.
Args:
true_rate: Hypothesized true win rate (e.g., 0.54)
break_even_rate: Break-even rate (0.5238 for -110)
alpha: Significance level
power: Desired statistical power (1 - beta)
Returns:
Required number of bets (rounded up)
"""
z_alpha = stats.norm.ppf(1 - alpha)
z_beta = stats.norm.ppf(power)
effect = true_rate - break_even_rate
variance = break_even_rate * (1 - break_even_rate)
n = ((z_alpha + z_beta) ** 2 * variance) / (effect ** 2)
return int(np.ceil(n))
def power_analysis(
n_bets: int,
true_rate: float,
break_even_rate: float = 0.5238,
alpha: float = 0.05
) -> float:
"""
Calculate statistical power for a given sample size and true rate.
Args:
n_bets: Number of bets in the sample
true_rate: Hypothesized true win rate
break_even_rate: Break-even rate
alpha: Significance level
Returns:
Power (probability of detecting the edge)
"""
z_alpha = stats.norm.ppf(1 - alpha)
se_null = np.sqrt(break_even_rate * (1 - break_even_rate) / n_bets)
se_alt = np.sqrt(true_rate * (1 - true_rate) / n_bets)
threshold = break_even_rate + z_alpha * se_null
z_power = (true_rate - threshold) / se_alt
return stats.norm.cdf(z_power)
def bonferroni_correction(
p_values: list[float],
alpha: float = 0.05
) -> list[dict]:
"""
Apply Bonferroni correction for multiple comparisons.
Args:
p_values: List of raw p-values from individual tests
alpha: Family-wise significance level
Returns:
List of dicts with raw p-value, adjusted p-value, and significance
"""
m = len(p_values)
adjusted_alpha = alpha / m
results = []
for i, p in enumerate(p_values):
adjusted_p = min(p * m, 1.0)
results.append({
"test_index": i,
"raw_p": p,
"adjusted_p": adjusted_p,
"significant": p < adjusted_alpha,
"adjusted_alpha": adjusted_alpha,
})
return results
def benjamini_hochberg(
p_values: list[float],
q: float = 0.05
) -> list[dict]:
"""
Benjamini-Hochberg procedure for FDR control.
Args:
p_values: List of raw p-values
q: Target false discovery rate
Returns:
List of dicts with rank, threshold, and rejection decision
"""
m = len(p_values)
indexed = sorted(enumerate(p_values), key=lambda x: x[1])
max_k = -1
for rank, (orig_idx, p) in enumerate(indexed, 1):
threshold = (rank / m) * q
if p <= threshold:
max_k = rank
results = []
for rank, (orig_idx, p) in enumerate(indexed, 1):
results.append({
"original_index": orig_idx,
"rank": rank,
"raw_p": p,
"bh_threshold": (rank / m) * q,
"rejected": rank <= max_k,
})
return sorted(results, key=lambda x: x["original_index"])
def bayesian_edge_probability(
wins: int,
n_bets: int,
break_even_rate: float = 0.5238,
prior_alpha: float = 1.0,
prior_beta: float = 1.0
) -> dict:
"""
Bayesian posterior probability of having edge using Beta-Binomial model.
Args:
wins: Number of winning bets
n_bets: Total bets
break_even_rate: Break-even win rate
prior_alpha: Beta prior alpha (1.0 = uniform)
prior_beta: Beta prior beta (1.0 = uniform)
Returns:
Dict with posterior parameters, P(edge), and credible interval
"""
post_alpha = prior_alpha + wins
post_beta = prior_beta + (n_bets - wins)
p_edge = 1 - stats.beta.cdf(break_even_rate, post_alpha, post_beta)
ci_lower = stats.beta.ppf(0.025, post_alpha, post_beta)
ci_upper = stats.beta.ppf(0.975, post_alpha, post_beta)
posterior_mean = post_alpha / (post_alpha + post_beta)
return {
"posterior_alpha": post_alpha,
"posterior_beta": post_beta,
"posterior_mean": posterior_mean,
"p_edge": p_edge,
"credible_interval_95": (ci_lower, ci_upper),
}
# --- Demo: Run all analyses on a sample record ---
if __name__ == "__main__":
# NFL season: 145-115 ATS at -110
print("=" * 60)
print("EXAMPLE 1: NFL Season 145-115 ATS at -110")
print("=" * 60)
result = test_betting_edge(wins=145, n_bets=260)
print(f"Win rate: {result.win_rate:.4f} ({result.win_rate:.1%})")
print(f"Break-even: {result.break_even_rate:.4f}")
print(f"Z-statistic: {result.z_statistic:.3f}")
print(f"P-value: {result.p_value:.4f}")
print(f"95% CI: [{result.ci_lower:.3f}, {result.ci_upper:.3f}]")
print(f"Significant: {result.is_significant}")
bayes = bayesian_edge_probability(wins=145, n_bets=260)
print(f"Bayesian P(edge): {bayes['p_edge']:.3f}")
print(f"95% Credible: [{bayes['credible_interval_95'][0]:.3f}, "
f"{bayes['credible_interval_95'][1]:.3f}]")
print(f"\n{'=' * 60}")
print("SAMPLE SIZE TABLE")
print("=" * 60)
for rate in [0.53, 0.54, 0.55, 0.56, 0.57, 0.60]:
n = required_sample_size(true_rate=rate)
pwr = power_analysis(n_bets=n, true_rate=rate)
print(f"True rate {rate:.0%}: need {n:>6,} bets (power = {pwr:.1%})")
print(f"\n{'=' * 60}")
print("MULTIPLE COMPARISONS: 15 MLB Strategies")
print("=" * 60)
raw_pvals = [0.012, 0.018, 0.041, 0.082, 0.11, 0.15, 0.22,
0.31, 0.38, 0.42, 0.55, 0.61, 0.72, 0.84, 0.93]
bonf = bonferroni_correction(raw_pvals)
print("\nBonferroni correction:")
for r in bonf[:5]:
print(f" Test {r['test_index']}: p={r['raw_p']:.3f} -> "
f"adj_p={r['adjusted_p']:.3f} "
f"{'SIGNIFICANT' if r['significant'] else 'not sig'}")
bh = benjamini_hochberg(raw_pvals)
print("\nBenjamini-Hochberg (FDR = 0.05):")
for r in bh[:5]:
print(f" Test {r['original_index']}: p={r['raw_p']:.3f} "
f"threshold={r['bh_threshold']:.4f} "
f"{'REJECTED' if r['rejected'] else 'not rejected'}")
Limitations and Edge Cases
Small sample distortion. The normal approximation for the z-test breaks down below ~30 bets. For small samples, use the exact binomial test: scipy.stats.binom_test(wins, n_bets, break_even_rate, alternative='greater'). The z-test is faster but unreliable for N < 50.
Non-independent bets. The z-test assumes each bet is independent. If an agent bets the same NBA game on the spread and the total, those bets are correlated. Correlated bets inflate the effective sample size, making edge appear more significant than it is. Cluster-robust standard errors or a GEE model is required when bets share underlying events.
Changing odds and edge. The formulas above assume a fixed break-even rate. In practice, agents bet at varying odds (-105 to -120 across different sportsbooks). Use ROI-based significance testing instead: compute the t-test on per-bet returns rather than the z-test on win rate. This handles heterogeneous odds naturally.
The prior matters in Bayesian analysis. A uniform Beta(1,1) prior says “all win rates are equally likely before seeing data.” A more realistic prior for sports betting is Beta(52, 48) — encoding the belief that most bettors cluster around break-even. The choice of prior matters most when sample sizes are small (< 200 bets). At 2,000+ bets, the posterior is dominated by the data regardless of prior.
Significance does not equal profitability. A strategy can be statistically significant but not worth trading if the edge is smaller than transaction costs, opportunity costs, or the time value of capital locked in pending bets. Always compute net ROI after all fees before declaring a strategy deployable. Use the AgentBets Vig Index to identify the lowest-vig platforms where marginal edges remain profitable.
FAQ
How many bets do you need to prove a sports betting edge is statistically significant?
At standard -110 odds with a true 54% win rate (1.6% edge over break-even), you need approximately 2,485 bets for 95% confidence and 80% power. Smaller edges require exponentially more bets — a 53% true rate needs roughly 9,600 bets. Most bettors never reach these sample sizes, which is why most claimed edges are indistinguishable from luck.
What is a p-value in sports betting and how do you interpret it?
A p-value is the probability of observing results at least as extreme as yours if you had no edge at all. A p-value of 0.03 means there is a 3% chance of seeing your record (or better) by pure luck. It is not the probability that you have no edge. Rejecting the null at p < 0.05 means you have moderate evidence of edge, but it does not guarantee profitability.
Why do most sports betting systems fail even with a winning record?
The multiple comparisons problem. If you test 20 different strategies at the 0.05 significance level, you expect one false positive by chance alone. Survivorship bias compounds this: you only hear about the systems that happened to win, never the 19 that lost. Apply Bonferroni correction (divide alpha by the number of tests) or FDR control to guard against false discoveries.
What is the Bayesian approach to detecting a sports betting edge?
Instead of p-values, use a Beta-Binomial model. Start with a Beta(1,1) prior (uniform), observe W wins in N bets, and compute the posterior Beta(1+W, 1+N-W). The quantity P(p > break_even | data) gives the direct probability that your true win rate exceeds break-even. This is often more intuitive than frequentist hypothesis testing and naturally incorporates prior beliefs about edge rarity.
How does statistical significance connect to calibration and model evaluation for betting agents?
Statistical significance tells you whether observed edge is real. Calibration tells you whether your probability estimates match reality. Both are required: a significant but miscalibrated model will size bets incorrectly via Kelly Criterion. See the calibration and model evaluation guide for forecast accuracy assessment that complements significance testing.
What’s Next
Statistical significance is the validation gate. Once an agent confirms edge is real, it needs to measure the quality of its probability estimates and optimize its feature set for maximum predictive power.
- Prerequisite connection: Information Theory and Betting covers entropy and KL divergence — the information-theoretic foundation for quantifying edge that this guide validates statistically.
- Natural follow-up: Calibration and Model Evaluation — once you know edge exists, assess whether your model’s probability outputs are well-calibrated for Kelly sizing.
- Build the pipeline: The Odds API Edge Detection Pipeline integrates significance testing into a production agent’s end-to-end workflow.
- Apply to features: Feature Engineering for Sports Prediction shows how to select and validate features using the significance framework from this guide.
- Find the sharpest lines: Sharp betting strategies rely on statistically validated CLV — see Closing Line Value for the metric that matters most.
