Layer 4 — Intelligence

Soccer/Football Expected Goals (xG): Mathematical Framework for Betting

Q: "What is expected goals (xG) in soccer betting?"

"Expected goals (xG) is the probability that a given shot results in a goal, based on features like shot distance, angle, body part, and assist type. A shot with xG = 0.12 scores roughly 12% of the time. Match-level xG sums all individual shot xG values to estimate how many goals a team 'deserved' to score, providing a more predictive metric than actual goals."

Q: "How do you build an xG model with logistic regression?"

"An xG model uses logistic regression where the target variable is binary (goal or no goal) and features include shot distance from goal center, shot angle, body part, assist type, and game state. The formula is P(goal) = 1 / (1 + e^-(beta_0 + beta_1*distance + beta_2*angle + ...)). Train on 10,000+ labeled shots from StatsBomb or similar data sources to get reliable coefficients."

Q: "How does xG connect to Poisson models for match prediction?"

"Team-level xG rates become the lambda parameters in a Poisson distribution. If the home team's expected scoring rate is 1.65 xG and the away team's is 1.10 xG, you compute P(home=h, away=a) = Poisson(h|1.65) * Poisson(a|1.10) for each scoreline to get win/draw/loss probabilities. This bridges shot-level analytics to match outcome betting."

Q: "Why does xG outperform raw goals for predicting future results?"

"Goals are high-variance Poisson events. A team averaging 1.3 xG per match will score 0 goals 27% of the time and 3+ goals 14% of the time — raw goal counts are noisy over small samples. xG stabilizes much faster (within 8-10 matches) because it measures shot quality rather than binary outcomes, making it a superior predictor of future scoring."

Q: "How can betting agents exploit xG-to-goals divergence?"

"When a team's actual goals significantly exceed their xG (overperformance) or fall below it (underperformance), regression toward xG-predicted levels is statistically expected. An agent can fade teams on hot shooting streaks by betting unders or against them in Asian handicap markets, and back underperforming teams whose xG suggests better results are coming."

By Rahim March 21, 2026March 31, 2026 · 18 min read

How to build and exploit expected goals (xG) models for soccer betting — from logistic regression shot models to Poisson match outcome predictions, with Python implementation for autonomous agents.

Soccer/Football Expected Goals (xG): Mathematical Framework for Betting

Summary: Comprehensive technical guide to the expected goals (xG) framework as a foundation for autonomous soccer/football betting agents. Defines xG as the probability that a given shot results in a goal, modeled via logistic regression: P(goal|shot) = 1 / (1 + e^-(beta_0 + beta_1*distance + beta_2*angle + beta_3*body_part + ...)). Core features include shot distance from goal center (meters), shot angle (radians), body part (foot/head/other), assist type (through ball, cross, set piece), game state (goal difference at time of shot), and whether the shot followed a dribble or fast break. Match-level xG equals the sum of individual shot xG values: match_xG = sum(xG_i) for all shots i. Explains why xG outperforms raw goals as a predictor — goals are high-variance events (a team averaging 1.3 xG per match will score 0, 1, 2, or 3+ goals with Poisson-distributed frequency), making raw goal counts unreliable over small samples. Builds a match outcome model by parameterizing a Poisson distribution with team xG rates: P(home=h, away=a) = Poisson(h|lambda_home) * Poisson(a|lambda_away), where lambda_home and lambda_away derive from attack strength, defense strength, and home advantage. Covers xG process analysis — when actual goals diverge significantly from cumulative xG, regression to xG-predicted levels is statistically likely, creating exploitable betting opportunities in Asian handicap and totals markets. Data sources include StatsBomb Open Data (free, 15,000+ matches), FBref (aggregated xG tables), and Understat (shot-level xG for top 6 European leagues). Addresses World Cup 2026 application challenges: international xG data is sparser (10-15 matches per year vs 38+ for club teams), requires Bayesian priors from club-level data, and must account for squad turnover and tactical variation. Python implementation uses scikit-learn LogisticRegression for the shot model and scipy.stats.poisson for match outcome probabilities. Maps to Layer 4 (Intelligence) of the Agent Betting Stack — the xG model feeds match probabilities into the agent's decision engine, which compares model output against market odds from The Odds API to identify positive expected value bets. References the AgentBets Poisson Distribution guide for the scoring model foundation and the World Cup 2026 Betting Math guide for tournament-specific applications. Part of the AgentBets Math Behind Betting series. Topics: expected goals, xG, logistic regression, Poisson distribution, soccer betting, football betting, match prediction, shot quality, xG process, regression to mean, Asian handicap, StatsBomb, betting agents, sports modeling.

Topics: expected goals, xG, logistic regression, poisson distribution, soccer betting, football betting, match prediction, shot quality, xG process, sports modeling

Stack layers: Layer 4 — Intelligence

Related tools: StatsBomb API, The Odds API, numpy, scipy, scikit-learn

Expected goals (xG) models the probability each shot scores: P(goal) = 1 / (1 + e^-(beta_0 + beta_1distance + beta_2angle + …)). Sum shot-level xG to get match-level expected goals, feed that into a Poisson model for match outcome probabilities, and compare against market odds to find edge.

Why This Matters for Agents

An autonomous soccer betting agent needs a way to convert raw match data into win/draw/loss probabilities that are more accurate than the market. Raw goals are terrible for this — they’re high-variance Poisson events where a team averaging 1.3 goals per match scores zero 27% of the time. An agent relying on goal counts needs 50+ matches to get stable estimates. xG cuts that to 8-10 matches by measuring shot quality rather than binary outcomes.

This is Layer 4 — Intelligence. The xG model sits at the core of the agent’s prediction engine. The pipeline: pull historical shot data from StatsBomb, train a logistic regression xG model, aggregate to match-level xG, parameterize a Poisson scoring model, compute match outcome probabilities, compare against live odds from The Odds API, and flag positive EV bets. The output feeds directly into Kelly sizing for bankroll-optimal stake calculation. Every component of the Agent Betting Stack downstream depends on the quality of this Layer 4 model.

The Math

What xG Measures

Expected goals quantifies shot quality. For any shot i, xG_i is the probability that shot results in a goal, conditional on observable features:

xG_i = P(goal | distance_i, angle_i, body_part_i, assist_type_i, game_state_i, ...)

A penalty kick has xG ≈ 0.76. A header from 12 meters off a cross has xG ≈ 0.04. A shot from the edge of the 6-yard box with the keeper out of position has xG ≈ 0.45. The model assigns each shot a probability based on historical conversion rates from similar positions and contexts.

Match-level xG is the sum of all shot-level xG values:

match_xG = Σ xG_i    for all shots i in the match

If a team takes 14 shots with individual xG values of [0.03, 0.07, 0.45, 0.02, 0.11, 0.08, 0.04, 0.76, 0.06, 0.03, 0.12, 0.09, 0.05, 0.02], their match xG = 1.93. They “deserved” roughly 1.93 goals based on the quality of chances created.

The Logistic Regression Model

The standard xG model is logistic regression. The target is binary: goal (1) or no goal (0). The model estimates:

P(goal) = 1 / (1 + e^-(β₀ + β₁x₁ + β₂x₂ + ... + βₙxₙ))

Where the sigmoid function maps any linear combination of features to the (0, 1) probability range.

The core features, ranked by predictive importance:

Feature	Variable	Description	Expected Sign
Distance	x₁	Meters from shot location to goal center	Negative (farther = less likely)
Angle	x₂	Angle in radians between shot location and both goalposts	Positive (wider angle = more likely)
Body part	x₃	Binary: 1 = foot, 0 = head/other	Positive (foot shots convert better)
Assist type	x₄-x₇	One-hot: through ball, cross, set piece, none	Through ball positive, cross negative
Game state	x₈	Goal difference at time of shot	Slight negative (leading teams take lower-quality shots)
Fast break	x₉	Binary: 1 = counter-attack, 0 = settled play	Positive (disorganized defense)
Big chance	x₁₀	Binary: 1 = one-on-one or open goal	Strong positive

Computing Shot Angle and Distance

Shot angle — the angle subtended by the two goalposts from the shot location — is the single most important geometric feature. Given shot coordinates (x, y) on a pitch where the goal line is at x = 0 and the goal spans y ∈ [-3.66, 3.66] (standard 7.32m goal width):

Goal posts:  P₁ = (0, -3.66),  P₂ = (0, 3.66)

Distance to goal center:
  d = sqrt(x² + y²)

Angle (using the law of cosines):
  a = |P₁ - shot|,  b = |P₂ - shot|,  c = |P₁ - P₂| = 7.32m
  θ = arccos((a² + b² - c²) / (2ab))

Larger angles mean the shooter “sees” more of the goal. A shot from directly in front at 6 meters has a wide angle (~1.05 rad). A shot from a tight angle at the same distance might have θ ≈ 0.25 rad.

From Shot xG to Match Outcomes: The Poisson Bridge

Individual shot xG values aggregate to team-level scoring rates. The key insight: goals in soccer follow a Poisson distribution with parameter λ equal to the team’s xG rate.

For a match between team A (home) and team B (away):

λ_home = home_attack_strength × away_defense_weakness × league_home_avg
λ_away = away_attack_strength × home_defense_weakness × league_away_avg

Where attack strength = team xG per match / league average xG, and defense weakness = opponent xG conceded per match / league average xG conceded.

The probability of any specific scoreline (h goals for home, a goals for away):

P(home = h, away = a) = Poisson(h | λ_home) × Poisson(a | λ_away)

Where Poisson(k | λ) = (λ^k × e^(-λ)) / k!

This assumes independence between home and away scoring — a simplification that holds reasonably well empirically. For the full Poisson derivation and independence assumption analysis, see the Poisson Distribution in Sports Modeling guide.

Match Outcome Probabilities

Sum the scoreline probabilities across all outcomes where home wins, draws, or loses:

P(home win) = Σ P(h, a)  for all h > a
P(draw)     = Σ P(h, a)  for all h = a
P(away win) = Σ P(h, a)  for all h < a

In practice, truncate at 8 or 10 goals per side — the probability of any team scoring 10+ goals is negligible (<0.001% for any realistic λ).

Worked Examples

Example 1: Building xG Rates from Season Data

Manchester City 2024-25 Premier League through 25 matches (data from FBref):

Man City:   xG for = 48.3,  xG against = 22.1  (25 matches)
Per match:  xGF = 1.93,  xGA = 0.88

League averages:  avg xGF = 1.35,  avg xGA = 1.35  (by definition equal)
Home advantage factor: 1.24 (home teams score ~24% more in the Premier League)

Attack strength = 1.93 / 1.35 = 1.430
Defense strength = 0.88 / 1.35 = 0.652

Liverpool 2024-25 through 25 matches:

Liverpool:  xG for = 52.8,  xG against = 24.5
Per match:  xGF = 2.11,  xGA = 0.98

Attack strength = 2.11 / 1.35 = 1.563
Defense strength = 0.98 / 1.35 = 0.726

For Man City (home) vs Liverpool (away):

λ_home = City_attack × Liverpool_def_weakness × home_avg
       = 1.430 × 0.726 × (1.35 × 1.24)
       = 1.430 × 0.726 × 1.674
       = 1.738

λ_away = Liverpool_attack × City_def_weakness × away_avg
       = 1.563 × 0.652 × (1.35 / 1.24)
       = 1.563 × 0.652 × 1.089
       = 1.109

Scoreline probabilities (Poisson):

Score   P(score)    Cumulative category
0-0     0.0309      Draw
1-0     0.0537      Home win
0-1     0.0343      Away win
2-0     0.0467      Home win
1-1     0.0596      Draw
0-2     0.0190      Away win
2-1     0.0519      Home win
3-0     0.0271      Home win
...

P(Man City win)  = 0.486
P(Draw)          = 0.214
P(Liverpool win) = 0.300

Now compare against market odds. BetOnline might have Man City at -125 (implied 55.6%), Draw at +300 (25.0%), Liverpool at +250 (28.6%). The model says City at 48.6% — the market is overpricing City by 7 percentage points. An agent flags Liverpool or Draw as potential value.

Example 2: Exploiting xG-to-Goals Divergence

Brighton 2024-25 through 20 matches:

Actual goals scored: 34  (1.70 per match)
xG:                  28.6 (1.43 per match)
Overperformance:     +5.4 goals (+0.27 per match)

Brighton is outscoring their xG by 18.9%. This kind of finishing overperformance regresses. The empirical data: teams overperforming xG by >15% over 20-match windows revert to within 5% of xG over the next 15 matches roughly 78% of the time.

An agent’s strategy: fade Brighton in totals markets. If the market sets Brighton’s match total at 1.75 based on actual goals, but xG says 1.43, there’s value on the Under. On a sportsbook like BetOnline, Brighton total goals Under 1.5 at +115 (implied 46.5%) is worth investigating when the xG-based Poisson model gives:

P(Brighton scores 0) = e^(-1.43) = 0.239
P(Brighton scores 1) = 1.43 × e^(-1.43) = 0.342
P(Under 1.5) = 0.239 + 0.342 = 0.581

Model says 58.1%, market says 46.5%. That’s +11.6 percentage points of edge — a strong bet by any standard.

Implementation

import numpy as np
from scipy.stats import poisson
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score
from dataclasses import dataclass


@dataclass
class ShotFeatures:
    """Features for a single shot."""
    distance: float       # meters from goal center
    angle: float          # radians, angle subtended by goalposts
    is_foot: int          # 1 = foot, 0 = head/other
    is_through_ball: int  # 1 = assisted by through ball
    is_cross: int         # 1 = assisted by cross
    is_set_piece: int     # 1 = from set piece
    is_fast_break: int    # 1 = counter-attack
    game_state: int       # goal difference at time of shot (positive = winning)
    is_big_chance: int    # 1 = one-on-one or open goal


def compute_shot_angle(x: float, y: float, goal_width: float = 7.32) -> float:
    """
    Compute the angle subtended by the goalposts from shot location.

    Pitch coordinates: goal line at x=0, goal center at y=0.
    Goal posts at (0, -goal_width/2) and (0, +goal_width/2).

    Args:
        x: Distance from goal line (meters, positive into pitch)
        y: Lateral position (meters, 0 = center of goal)
        goal_width: Width of goal in meters (default 7.32)

    Returns:
        Angle in radians
    """
    half_w = goal_width / 2

    # Vectors from shot to each post
    a_sq = x**2 + (y - half_w)**2  # distance squared to right post
    b_sq = x**2 + (y + half_w)**2  # distance squared to left post
    c_sq = goal_width**2            # distance between posts squared

    a = np.sqrt(a_sq)
    b = np.sqrt(b_sq)

    # Law of cosines: c² = a² + b² - 2ab*cos(θ)
    cos_theta = (a_sq + b_sq - c_sq) / (2 * a * b)
    cos_theta = np.clip(cos_theta, -1.0, 1.0)  # numerical safety

    return np.arccos(cos_theta)


def compute_distance(x: float, y: float) -> float:
    """Distance from shot location to goal center (0, 0)."""
    return np.sqrt(x**2 + y**2)


class XGModel:
    """
    Expected Goals model using logistic regression.

    Train on shot-level data with binary target (1=goal, 0=no goal).
    Predict xG for new shots. Aggregate to match-level xG.
    """

    def __init__(self):
        self.model = LogisticRegression(
            max_iter=1000,
            C=1.0,  # regularization strength
            solver='lbfgs'
        )
        self.is_fitted = False

    def _features_to_array(self, shots: list[ShotFeatures]) -> np.ndarray:
        """Convert list of ShotFeatures to numpy array."""
        return np.array([
            [s.distance, s.angle, s.is_foot, s.is_through_ball,
             s.is_cross, s.is_set_piece, s.is_fast_break,
             s.game_state, s.is_big_chance]
            for s in shots
        ])

    def fit(self, shots: list[ShotFeatures], goals: list[int]) -> dict:
        """
        Train the xG model on historical shot data.

        Args:
            shots: List of ShotFeatures for each shot
            goals: List of 0/1 indicating if each shot was a goal

        Returns:
            Dict with training metrics
        """
        X = self._features_to_array(shots)
        y = np.array(goals)

        self.model.fit(X, y)
        self.is_fitted = True

        # Cross-validated log loss
        cv_scores = cross_val_score(
            self.model, X, y, cv=5, scoring='neg_log_loss'
        )

        feature_names = [
            'distance', 'angle', 'is_foot', 'through_ball',
            'cross', 'set_piece', 'fast_break', 'game_state', 'big_chance'
        ]

        return {
            'n_shots': len(goals),
            'goal_rate': np.mean(y),
            'cv_log_loss': -cv_scores.mean(),
            'coefficients': dict(zip(feature_names, self.model.coef_[0])),
            'intercept': self.model.intercept_[0]
        }

    def predict_xg(self, shots: list[ShotFeatures]) -> np.ndarray:
        """Predict xG for each shot. Returns array of probabilities."""
        if not self.is_fitted:
            raise RuntimeError("Model not fitted. Call fit() first.")
        X = self._features_to_array(shots)
        return self.model.predict_proba(X)[:, 1]

    def match_xg(self, shots: list[ShotFeatures]) -> float:
        """Sum of xG for all shots — the match-level expected goals."""
        return float(np.sum(self.predict_xg(shots)))


class PoissonMatchModel:
    """
    Match outcome model using Poisson-distributed goals.

    Takes team xG rates as lambda parameters.
    Computes scoreline and outcome probabilities.
    """

    def __init__(self, max_goals: int = 8):
        self.max_goals = max_goals

    def compute_lambda(
        self,
        team_xg_per_match: float,
        opponent_xg_conceded_per_match: float,
        league_avg_goals: float = 1.35,
        home_advantage: float = 1.24,
        is_home: bool = True
    ) -> float:
        """
        Compute expected goals (lambda) for a team in a specific match.

        Args:
            team_xg_per_match: Team's average xG scored per match
            opponent_xg_conceded_per_match: Opponent's avg xG conceded per match
            league_avg_goals: League average goals per team per match
            home_advantage: Multiplicative home advantage factor
            is_home: Whether this team is playing at home

        Returns:
            Lambda (expected goals) for the Poisson distribution
        """
        attack = team_xg_per_match / league_avg_goals
        defense = opponent_xg_conceded_per_match / league_avg_goals

        if is_home:
            base = league_avg_goals * home_advantage
        else:
            base = league_avg_goals / home_advantage

        return attack * defense * base

    def scoreline_matrix(
        self,
        lambda_home: float,
        lambda_away: float
    ) -> np.ndarray:
        """
        Compute probability matrix for all scorelines up to max_goals.

        Returns:
            2D array where [h][a] = P(home scores h, away scores a)
        """
        home_probs = poisson.pmf(range(self.max_goals + 1), lambda_home)
        away_probs = poisson.pmf(range(self.max_goals + 1), lambda_away)

        return np.outer(home_probs, away_probs)

    def match_probabilities(
        self,
        lambda_home: float,
        lambda_away: float
    ) -> dict:
        """
        Compute home win, draw, away win probabilities.

        Returns:
            Dict with 'home_win', 'draw', 'away_win' probabilities
            and top scorelines.
        """
        matrix = self.scoreline_matrix(lambda_home, lambda_away)
        n = self.max_goals + 1

        home_win = sum(
            matrix[h][a] for h in range(n) for a in range(n) if h > a
        )
        draw = sum(matrix[h][h] for h in range(n))
        away_win = sum(
            matrix[h][a] for h in range(n) for a in range(n) if h < a
        )

        # Top 5 most likely scorelines
        scorelines = []
        for h in range(n):
            for a in range(n):
                scorelines.append((h, a, matrix[h][a]))
        scorelines.sort(key=lambda x: -x[2])

        return {
            'home_win': home_win,
            'draw': draw,
            'away_win': away_win,
            'lambda_home': lambda_home,
            'lambda_away': lambda_away,
            'top_scorelines': [
                {'score': f"{h}-{a}", 'prob': p}
                for h, a, p in scorelines[:5]
            ]
        }

    def over_under(
        self,
        lambda_home: float,
        lambda_away: float,
        line: float = 2.5
    ) -> dict:
        """
        Compute over/under probabilities for total goals line.

        Args:
            lambda_home: Home team expected goals
            lambda_away: Away team expected goals
            line: Total goals line (e.g., 2.5)

        Returns:
            Dict with 'over' and 'under' probabilities
        """
        matrix = self.scoreline_matrix(lambda_home, lambda_away)
        n = self.max_goals + 1

        under = sum(
            matrix[h][a]
            for h in range(n) for a in range(n)
            if (h + a) < line
        )
        over = 1.0 - under

        return {'over': over, 'under': under, 'line': line}


def xg_divergence_analysis(
    actual_goals: list[int],
    match_xg: list[float],
    team_name: str = "Team"
) -> dict:
    """
    Analyze the divergence between actual goals and xG.

    Identifies overperformance/underperformance and estimates
    regression probability.

    Args:
        actual_goals: List of actual goals scored per match
        match_xg: List of match-level xG values
        team_name: Team name for reporting

    Returns:
        Analysis dict with divergence metrics
    """
    n = len(actual_goals)
    total_goals = sum(actual_goals)
    total_xg = sum(match_xg)

    goals_per_match = total_goals / n
    xg_per_match = total_xg / n
    divergence = total_goals - total_xg
    divergence_pct = (divergence / total_xg) * 100 if total_xg > 0 else 0

    # Probability of observed goals under xG-based Poisson model
    # For each match, compute P(actual goals | xG) under Poisson
    match_probs = [
        poisson.pmf(g, xg) for g, xg in zip(actual_goals, match_xg)
    ]
    log_likelihood = np.sum(np.log(np.array(match_probs) + 1e-10))

    return {
        'team': team_name,
        'matches': n,
        'total_goals': total_goals,
        'total_xg': round(total_xg, 2),
        'goals_per_match': round(goals_per_match, 2),
        'xg_per_match': round(xg_per_match, 2),
        'divergence': round(divergence, 2),
        'divergence_pct': round(divergence_pct, 1),
        'direction': 'overperforming' if divergence > 0 else 'underperforming',
        'log_likelihood': round(log_likelihood, 2),
        'regression_signal': abs(divergence_pct) > 15
    }


def find_value_bets(
    model_probs: dict,
    market_odds: dict,
    min_edge: float = 0.03
) -> list[dict]:
    """
    Compare model probabilities against market odds to find value.

    Args:
        model_probs: Dict with 'home_win', 'draw', 'away_win' probabilities
        market_odds: Dict with same keys, values are decimal odds
        min_edge: Minimum probability edge to flag (default 3%)

    Returns:
        List of value bet opportunities
    """
    value_bets = []

    for outcome in ['home_win', 'draw', 'away_win']:
        model_p = model_probs[outcome]
        decimal_odds = market_odds[outcome]
        implied_p = 1.0 / decimal_odds

        edge = model_p - implied_p
        ev = model_p * (decimal_odds - 1) - (1 - model_p)

        if edge >= min_edge:
            value_bets.append({
                'outcome': outcome,
                'model_prob': round(model_p, 4),
                'implied_prob': round(implied_p, 4),
                'edge': round(edge, 4),
                'decimal_odds': decimal_odds,
                'ev_per_unit': round(ev, 4),
                'kelly_fraction': round(edge / (decimal_odds - 1), 4)
            })

    value_bets.sort(key=lambda x: -x['edge'])
    return value_bets


# --- Example usage ---

if __name__ == "__main__":
    # Generate synthetic training data (in production, use StatsBomb)
    np.random.seed(42)
    n_shots = 5000

    distances = np.random.uniform(3, 35, n_shots)
    laterals = np.random.uniform(-15, 15, n_shots)
    angles = np.array([
        compute_shot_angle(d, y) for d, y in zip(distances, laterals)
    ])
    is_foot = np.random.binomial(1, 0.72, n_shots)
    is_through = np.random.binomial(1, 0.08, n_shots)
    is_cross = np.random.binomial(1, 0.15, n_shots)
    is_set_piece = np.random.binomial(1, 0.12, n_shots)
    is_fast_break = np.random.binomial(1, 0.06, n_shots)
    game_state = np.random.choice([-2, -1, 0, 1, 2], n_shots, p=[0.05, 0.15, 0.50, 0.20, 0.10])
    is_big_chance = np.random.binomial(1, 0.05, n_shots)

    # Simulate goals based on realistic xG relationship
    log_odds = (
        -1.5
        - 0.08 * distances
        + 1.2 * angles
        + 0.3 * is_foot
        + 0.5 * is_through
        - 0.2 * is_cross
        + 0.0 * is_set_piece
        + 0.4 * is_fast_break
        - 0.05 * game_state
        + 1.8 * is_big_chance
    )
    true_probs = 1 / (1 + np.exp(-log_odds))
    goals = np.random.binomial(1, true_probs)

    shots = [
        ShotFeatures(
            distance=distances[i], angle=angles[i], is_foot=int(is_foot[i]),
            is_through_ball=int(is_through[i]), is_cross=int(is_cross[i]),
            is_set_piece=int(is_set_piece[i]), is_fast_break=int(is_fast_break[i]),
            game_state=int(game_state[i]), is_big_chance=int(is_big_chance[i])
        )
        for i in range(n_shots)
    ]

    # Train xG model
    xg_model = XGModel()
    metrics = xg_model.fit(shots, goals.tolist())
    print("=== xG Model Training ===")
    print(f"Shots: {metrics['n_shots']}, Goal rate: {metrics['goal_rate']:.3f}")
    print(f"CV Log Loss: {metrics['cv_log_loss']:.4f}")
    print(f"Coefficients: {metrics['coefficients']}")

    # Match outcome prediction: Man City vs Liverpool
    match_model = PoissonMatchModel()

    lambda_city = match_model.compute_lambda(
        team_xg_per_match=1.93,
        opponent_xg_conceded_per_match=0.98,
        is_home=True
    )
    lambda_liverpool = match_model.compute_lambda(
        team_xg_per_match=2.11,
        opponent_xg_conceded_per_match=0.88,
        is_home=False
    )

    print(f"\n=== Man City (H) vs Liverpool (A) ===")
    print(f"Lambda City: {lambda_city:.3f}, Lambda Liverpool: {lambda_liverpool:.3f}")

    result = match_model.match_probabilities(lambda_city, lambda_liverpool)
    print(f"Home win: {result['home_win']:.3f}")
    print(f"Draw:     {result['draw']:.3f}")
    print(f"Away win: {result['away_win']:.3f}")
    print(f"Top scorelines: {result['top_scorelines'][:3]}")

    ou = match_model.over_under(lambda_city, lambda_liverpool, line=2.5)
    print(f"Over 2.5: {ou['over']:.3f}, Under 2.5: {ou['under']:.3f}")

    # Value bet detection
    market_odds = {
        'home_win': 1.80,  # -125 American
        'draw': 4.00,      # +300
        'away_win': 3.50   # +250
    }

    value = find_value_bets(result, market_odds, min_edge=0.02)
    print(f"\n=== Value Bets (min 2% edge) ===")
    for v in value:
        print(f"  {v['outcome']}: model={v['model_prob']:.1%}, "
              f"implied={v['implied_prob']:.1%}, edge={v['edge']:.1%}, "
              f"EV={v['ev_per_unit']:.3f}")

Limitations and Edge Cases

Small sample xG is unstable. A team’s xG per match stabilizes after roughly 8-10 league matches. For international teams that play 10-15 competitive matches per year, xG estimates carry wide confidence intervals. An agent betting on World Cup 2026 matches must use Bayesian priors from club-level data to supplement sparse international xG data — the World Cup 2026 Betting Math guide covers this approach.

xG ignores goalkeeper quality. Standard xG models condition on shot location and context but not on which goalkeeper is in net. A shot with xG = 0.15 against Thibaut Courtois converts at a different rate than the same shot against a League Two keeper. Post-shot xG (PSxG) models account for shot placement and keeper positioning but require video-derived data that most free sources lack.

xG does not capture non-shot situations. Dangerous attacks that don’t produce shots — a through ball where the striker slips, a 2-on-1 broken up at the last second — generate zero xG despite representing genuine goal threat. This means xG systematically underestimates the quality of teams that create high-danger situations but don’t get shots off.

The Poisson independence assumption fails in blowouts. When a team goes up 3-0, game dynamics change: the leading team defends deeper, the trailing team pushes forward desperately. The Poisson model assumes each team’s goals are independent, which breaks down in lopsided game states. A bivariate Poisson or Dixon-Coles adjustment (which inflates low-scoring draw probabilities) partially addresses this.

Regression timing is unpredictable. An xG overperformer will regress — but “when” is unknown. A team overperforming by +5 xG over 20 matches might continue for another 5 matches before regressing. The agent must size bets accounting for this variance. Blindly fading every overperformer immediately loses money in the short run roughly 40% of the time.

Data source inconsistencies. StatsBomb, Opta, and Understat use different xG models with different features. StatsBomb xG ≈ 0.08 for a given shot might be Opta xG ≈ 0.11 for the same shot. An agent must choose one source and stick with it — mixing xG from different providers creates systematic bias.

FAQ

What is expected goals (xG) in soccer betting?

Expected goals (xG) is the probability that a given shot results in a goal, based on features like shot distance, angle, body part, and assist type. A shot with xG = 0.12 scores roughly 12% of the time. Match-level xG sums all individual shot xG values to estimate how many goals a team “deserved” to score, providing a more predictive metric than actual goals.

How do you build an xG model with logistic regression?

An xG model uses logistic regression where the target variable is binary (goal or no goal) and features include shot distance from goal center, shot angle, body part, assist type, and game state. The formula is P(goal) = 1 / (1 + e^-(beta_0 + beta_1distance + beta_2angle + …)). Train on 10,000+ labeled shots from StatsBomb or similar data sources to get reliable coefficients.

How does xG connect to Poisson models for match prediction?

Team-level xG rates become the lambda parameters in a Poisson distribution. If the home team’s expected scoring rate is 1.65 xG and the away team’s is 1.10 xG, you compute P(home=h, away=a) = Poisson(h|1.65) * Poisson(a|1.10) for each scoreline to get win/draw/loss probabilities. This bridges shot-level analytics to match outcome betting. The full Poisson derivation is in the Poisson Distribution guide.

Why does xG outperform raw goals for predicting future results?

Goals are high-variance Poisson events. A team averaging 1.3 xG per match will score 0 goals 27% of the time and 3+ goals 14% of the time — raw goal counts are noisy over small samples. xG stabilizes much faster (within 8-10 matches) because it measures shot quality rather than binary outcomes, making it a superior predictor of future scoring.

How can betting agents exploit xG-to-goals divergence?

When a team’s actual goals significantly exceed their xG (overperformance) or fall below it (underperformance), regression toward xG-predicted levels is statistically expected. An agent can fade teams on hot shooting streaks by betting unders or against them in Asian handicap markets, and back underperforming teams whose xG suggests better results are coming. See the sharp betting hub for more regression-based strategies.

What’s Next

The xG framework converts raw shot data into match outcome probabilities — the core intelligence an agent needs for soccer betting. From here, the math extends in two directions.

Tournament application: World Cup 2026 Betting Math adapts the xG-Poisson pipeline for tournament-specific challenges — sparse international data, group stage dynamics, and knockout round adjustments.
The Poisson foundation: If you need a deeper dive into Poisson distributions and their properties, the Poisson Distribution in Sports Modeling guide covers everything from PMF derivation to overdispersion corrections.
Feature engineering for models: The Feature Engineering for Sports Prediction guide covers how to select, transform, and validate features (including xG-derived features) for any sports prediction model.
Full agent architecture: See where the xG model fits in the four-layer pipeline in the Agent Betting Stack.
Live odds comparison: Use The Odds API pipeline to compare your xG-derived probabilities against live market odds across dozens of sportsbooks simultaneously.

Frequently Asked Questions

What is expected goals (xG) in soccer betting?

Expected goals (xG) is the probability that a given shot results in a goal, based on features like shot distance, angle, body part, and assist type. A shot with xG = 0.12 scores roughly 12% of the time. Match-level xG sums all individual shot xG values to estimate how many goals a team 'deserved' to score, providing a more predictive metric than actual goals.

How do you build an xG model with logistic regression?

An xG model uses logistic regression where the target variable is binary (goal or no goal) and features include shot distance from goal center, shot angle, body part, assist type, and game state. The formula is P(goal) = 1 / (1 + e^-(beta_0 + beta_1*distance + beta_2*angle + ...)). Train on 10,000+ labeled shots from StatsBomb or similar data sources to get reliable coefficients.

How does xG connect to Poisson models for match prediction?

Team-level xG rates become the lambda parameters in a Poisson distribution. If the home team's expected scoring rate is 1.65 xG and the away team's is 1.10 xG, you compute P(home=h, away=a) = Poisson(h|1.65) * Poisson(a|1.10) for each scoreline to get win/draw/loss probabilities. This bridges shot-level analytics to match outcome betting.

Why does xG outperform raw goals for predicting future results?

How can betting agents exploit xG-to-goals divergence?

When a team's actual goals significantly exceed their xG (overperformance) or fall below it (underperformance), regression toward xG-predicted levels is statistically expected. An agent can fade teams on hot shooting streaks by betting unders or against them in Asian handicap markets, and back underperforming teams whose xG suggests better results are coming.