Layer 4 — Intelligence

Feature Engineering for Sports Prediction Models: Building the Signal That Powers Agent Intelligence

Q: "What features should I use for a sports betting prediction model?"

"Start with raw box-score stats (points, yards, shots), then build derived features (per-possession efficiency, strength of schedule), rolling windowed features (last-5-game EWMA), and opponent-adjusted metrics. The specific features depend on sport — NFL models use EPA per play and DVOA, NBA models use offensive/defensive rating and pace, MLB models use wRC+ and FIP."

Q: "How do you avoid data leakage in sports prediction models?"

"Never use information that would not have been available before the game. Common leaks include season-end stats in mid-season predictions, post-game injury reports, and calculating rolling averages that include the target game. Always split data temporally (train on past, test on future) and compute all features using only pre-game data."

Q: "What is the best feature selection method for sports betting models?"

"LASSO (L1 regularization) is the standard first choice because it simultaneously fits the model and drives irrelevant feature coefficients to zero. For tree-based models, permutation importance and mutual information scoring work well. Always validate feature selection on a temporal holdout set to avoid overfitting."

Q: "Should I use exponentially weighted moving averages or simple averages for sports prediction?"

"Exponentially weighted moving averages (EWMA) outperform simple averages for capturing recent form. With EWMA, a 5-game span weights the most recent game at 33% vs. 20% for a flat average. This matters because team performance is non-stationary — injuries, trades, and tactical changes make recent games more predictive than games from months ago."

Q: "How does feature engineering connect to closing line value in sports betting?"

"Features that consistently generate closing line value (CLV) are validated as predictive. If your model's pre-game predictions move in the same direction as the closing line, your features are capturing real signal. The CLV guide covers how to measure this, and feature engineering is the upstream process that determines what signal your model can access."

By Rahim March 21, 2026March 31, 2026 · 22 min read

How to build, select, and pipeline features for sports prediction models — raw stats, derived metrics, rolling windows, opponent adjustments, interaction terms, and LASSO selection with full Python implementation.

Feature Engineering for Sports Prediction Models: Building the Signal That Powers Agent Intelligence

Summary: Comprehensive guide to feature engineering for autonomous sports prediction agents operating at Layer 4 (Intelligence) of the Agent Betting Stack. Covers the full feature hierarchy: raw features (points scored, yards gained, shots on target), derived features (per-possession efficiency, strength of schedule adjustments, pace-adjusted ratings), rolling and windowed features (last-5-game exponentially weighted moving averages vs. season-long flat averages), and opponent-adjusted features (subtracting league-average opponent stats to isolate true team strength). Derives the exponentially weighted moving average formula EWMA_t = alpha * x_t + (1 - alpha) * EWMA_{t-1} with alpha = 2/(span+1), and demonstrates why recency weighting outperforms flat averages for capturing form — a 5-game EWMA with alpha=0.33 weights the most recent game at 33% vs. 20% for a flat 5-game average. Covers interaction features such as pass-heavy offense vs. weak pass defense matchups. Details feature scaling approaches: standardization (z-score) for linear and regularized models, min-max normalization for neural networks, and no scaling for tree-based models like XGBoost. Addresses missing data imputation strategies for injured players and postponed games using median imputation and indicator variables. Implements feature selection via forward selection, backward elimination, LASSO (L1 regularization), and mutual information scoring. Extensively covers the critical problem of data leakage in sports modeling — using future game results, season-end statistics, or post-game injury reports in training features. Includes sport-specific feature libraries for NFL (DVOA components, EPA per play, completion percentage over expected), NBA (offensive/defensive rating, true shooting percentage, pace), MLB (wRC+, FIP, barrel rate), and soccer (xG, xGA, PPDA). Python implementation of a complete SportFeatureEngine class using pandas with rolling windows, opponent adjustment, and train-test temporal splitting. Connects to the edge detection pipeline in the Odds API Edge Detection guide and the CLV framework from the Closing Line Value guide. References the Agent Betting Stack Layer 4 architecture. Part of the AgentBets Math Behind Betting series. Topics: feature engineering, sports prediction, machine learning, rolling averages, EWMA, opponent adjustment, LASSO, data leakage, feature selection, NFL modeling, NBA modeling, MLB modeling, soccer xG.

Topics: feature engineering, sports prediction, machine learning, rolling averages, EWMA, opponent adjustment, LASSO, data leakage, feature selection, NFL modeling, NBA modeling, MLB modeling

Stack layers: Layer 4 — Intelligence

Related tools: The Odds API, pandas, scikit-learn, numpy

Features are the signal your model sees. Raw stats are noise until you transform them — per-possession efficiency, EWMA rolling windows (alpha = 2/(span+1)), opponent adjustments, and interaction terms extract the predictive structure. LASSO selects the features that matter. Data leakage is the #1 model-killer in sports prediction. Get the features right and mediocre algorithms win; get them wrong and no algorithm saves you.

Why This Matters for Agents

An autonomous betting agent’s prediction quality is bounded by its features, not its model architecture. XGBoost on good features beats a neural network on bad features every time. This is Layer 4 — Intelligence — and feature engineering is the first operation in the intelligence pipeline: raw data enters, structured signal exits, and that signal feeds into whatever model the agent runs (logistic regression, gradient boosting, neural nets).

The agent pulls odds from The Odds API via the edge detection pipeline, pulls box-score data from sports APIs, transforms that data into features, runs predictions, compares against market-implied probabilities, and bets when it finds edge. Every step downstream of feature engineering inherits its quality. An agent operating within the Agent Betting Stack framework needs a feature pipeline that is automated, temporally safe (no leakage), and sport-aware. This guide builds that pipeline.

The Math

The Feature Hierarchy

Sports prediction features form a four-level hierarchy. Each level adds predictive power but also complexity and latency:

Level 4: Interaction Features     (matchup-specific)
         ↑
Level 3: Opponent-Adjusted        (context-aware)
         ↑
Level 2: Rolling / Windowed       (recency-weighted)
         ↑
Level 1: Raw Box-Score Stats      (per-game counts)

Level 1 — Raw features are direct box-score outputs: points scored, yards gained, shots on target, rebounds. They are noisy because they conflate team ability with opponent quality, pace, and game context.

Level 2 — Rolling/windowed features apply temporal aggregation. Instead of using a single game’s stats, you average over a window of recent games. This reduces noise.

Level 3 — Opponent-adjusted features subtract or normalize by opponent quality. Scoring 110 points against the worst defense in the NBA means less than scoring 100 against the best.

Level 4 — Interaction features capture matchup-specific dynamics. A pass-heavy NFL offense facing a weak pass defense is a different proposition than that same offense facing a strong one.

Exponentially Weighted Moving Averages (EWMA)

Flat (simple) rolling averages weight all games in the window equally. A 5-game simple moving average gives 20% weight to each game. This is suboptimal because team performance is non-stationary — injuries happen, lineups change, form fluctuates.

EWMA solves this by weighting recent games more heavily:

EWMA_t = alpha * x_t + (1 - alpha) * EWMA_{t-1}

where:
  EWMA_t = the smoothed value after observing game t
  x_t    = the raw stat from game t
  alpha  = smoothing factor = 2 / (span + 1)
  span   = the effective window size (e.g., 5 games)

For a 5-game span, alpha = 2/(5+1) = 0.333. The weight assigned to each past game decays geometrically:

Game t   (most recent): weight = 0.333
Game t-1:               weight = 0.333 * 0.667 = 0.222
Game t-2:               weight = 0.333 * 0.667^2 = 0.148
Game t-3:               weight = 0.333 * 0.667^3 = 0.099
Game t-4:               weight = 0.333 * 0.667^4 = 0.066

The most recent game gets 33% of the weight vs. 20% in a flat average. After 5 games, the cumulative weight is 86.8% — the tail beyond the window still contributes but diminishes rapidly.

Why EWMA outperforms flat averages: If a starting quarterback gets injured in game t-1 and the backup plays game t, EWMA assigns 33% weight to the backup’s performance immediately. A flat 5-game average still gives 80% weight to the starter’s games. EWMA adapts faster.

Opponent Adjustment

Raw stats conflate team ability with schedule difficulty. A team averaging 28 points per game against the five worst defenses in the league is not the same as a team averaging 24 points against average defenses.

The simplest opponent adjustment subtracts league-average opponent performance:

OppAdj_i = Raw_i - Opp_Avg_i + League_Avg

where:
  Raw_i      = team's raw stat in game i
  Opp_Avg_i  = opponent's season average for that stat (excluding this game)
  League_Avg = league-wide average for that stat

A team scores 31 points against an opponent that allows 28 points per game (league average is 23):

OppAdj = 31 - 28 + 23 = 26

The raw 31 points adjusts down to 26 — the opponent was weak, so the raw number overstated true offensive ability.

For a more rigorous approach, use iterative strength-of-schedule adjustment (SRS-style):

Rating_team = Avg_MOV_team + Avg_Rating_opponents

Solved iteratively until convergence (typically 10-20 iterations).

This accounts for the circular dependency: your opponents’ strength depends on their opponents’ strength, and so on.

Interaction Features

Interaction features capture matchup-specific effects that individual team stats miss. The canonical example in NFL:

pass_offense_vs_pass_defense = Team_A_pass_yards_per_attempt * (1 / Team_B_pass_yards_allowed_per_attempt)

More generally, for any offensive stat X and the corresponding defensive stat Y:

matchup_advantage = (Team_off_X - League_Avg_X) - (Team_def_Y - League_Avg_Y)

A positive matchup_advantage means the offense is above average in a dimension where the defense is below average — a compounding edge.

Feature Scaling

Different model types require different scaling:

Model Type	Scaling Method	Formula	When to Use
Linear regression, logistic regression	Standardization (z-score)	z = (x - mean) / std	Coefficients must be comparable
Neural networks	Min-max normalization	x_norm = (x - min) / (max - min)	Activations need bounded inputs
LASSO, Ridge, Elastic Net	Standardization	z = (x - mean) / std	Regularization penalizes magnitude
XGBoost, Random Forest	None required	—	Tree splits are scale-invariant
KNN, SVM	Standardization or min-max	Either	Distance metrics are scale-sensitive

Standardization uses training set statistics only. Applying test-set statistics introduces data leakage.

Feature Selection: LASSO

LASSO (Least Absolute Shrinkage and Selection Operator) adds an L1 penalty to the loss function:

Loss_LASSO = Σ(y_i - X_i * beta)^2 + lambda * Σ|beta_j|

where:
  y_i    = target variable (1 = win, 0 = loss, or point spread)
  X_i    = feature vector for game i
  beta_j = coefficient for feature j
  lambda = regularization strength (higher = more aggressive selection)

The L1 penalty drives small coefficients exactly to zero, effectively removing those features. The features that survive LASSO with nonzero coefficients are the ones the model deems predictive after accounting for multicollinearity.

Cross-validate lambda on a temporal holdout (not random k-fold — temporal ordering matters in sports).

Mutual Information

For non-linear relationships (common in sports), mutual information (MI) quantifies how much knowing a feature reduces uncertainty about the target:

MI(X; Y) = Σ_x Σ_y p(x,y) * log(p(x,y) / (p(x) * p(y)))

where:
  X = feature
  Y = target (win/loss, spread outcome)
  p(x,y) = joint probability distribution
  p(x), p(y) = marginal distributions

MI = 0 means the feature and target are independent. Higher MI means more predictive power. Unlike correlation, MI captures non-linear dependencies.

Worked Examples

Example 1: NFL Feature Construction

Building features for an NFL game prediction: Dallas Cowboys at Philadelphia Eagles, Week 14, 2025 season.

Raw box-score features (season averages through Week 13):

Dallas Cowboys (Offense):
  Points/game:              22.3
  Pass yards/attempt:        7.1
  Rush yards/carry:          4.4
  EPA/play (offense):       +0.04
  3rd-down conversion %:    39.8%
  Turnover margin:          -3

Philadelphia Eagles (Defense):
  Points allowed/game:      19.8
  Pass yards allowed/att:    6.2
  Rush yards allowed/carry:  3.8
  EPA/play (defense):       -0.08
  Sack rate:                 8.2%
  Takeaways:                18

Derived features (opponent-adjusted, using 5-game EWMA):

import numpy as np

# Dallas 5-game EWMA offensive points: [20, 17, 28, 31, 24]
dallas_pts = np.array([20, 17, 28, 31, 24])
alpha = 2 / (5 + 1)  # 0.333

ewma = dallas_pts[0]
for pt in dallas_pts[1:]:
    ewma = alpha * pt + (1 - alpha) * ewma
dallas_ewma_pts = ewma  # = 24.9 (recent hot streak reflected)

# Opponent adjustment
philly_def_pts_allowed_avg = 19.8  # season average
league_avg_pts = 22.5

dallas_adj_pts = dallas_ewma_pts - philly_def_pts_allowed_avg + league_avg_pts
# = 24.9 - 19.8 + 22.5 = 27.6 (adjusted DOWN because Philly defense is above average)
# Wait — Philly allows fewer than league avg, so this adjusts DOWN correctly.

print(f"Dallas raw EWMA points: {dallas_ewma_pts:.1f}")
print(f"Dallas opponent-adjusted points vs PHI: {dallas_adj_pts:.1f}")

The interaction feature for Cowboys pass offense vs. Eagles pass defense:

# Matchup advantage: positive = offense has edge over this specific defense
dal_pass_ya_above_avg = 7.1 - 6.8   # +0.3 above league avg
phi_pass_ya_below_avg = 6.2 - 6.8   # -0.6 below league avg (good defense)

pass_matchup_advantage = dal_pass_ya_above_avg - phi_pass_ya_below_avg
# = 0.3 - (-0.6) = 0.9
# Positive: Dallas pass offense is above average, AND Philly pass defense is above average
# The 0.9 captures the combined effect — but this favors Dallas less than a weak defense would

print(f"Pass matchup advantage (DAL off vs PHI def): {pass_matchup_advantage:.2f}")

Example 2: NBA Rolling Window Comparison

Comparing flat average vs. EWMA for the Boston Celtics’ offensive rating over a 10-game window. The Celtics traded for a new point guard after game 5, changing their offensive scheme:

import numpy as np

# Celtics offensive rating (points per 100 possessions), games 1-10
off_rtg = np.array([112.3, 109.8, 111.5, 110.1, 108.7,  # pre-trade
                     118.4, 121.2, 119.8, 122.1, 120.5])  # post-trade

# Flat 5-game average for prediction before game 11
flat_avg = np.mean(off_rtg[-5:])  # 120.4

# EWMA (span=5) for prediction before game 11
alpha = 2 / (5 + 1)
ewma = off_rtg[0]
for x in off_rtg[1:]:
    ewma = alpha * x + (1 - alpha) * ewma
ewma_val = ewma  # 119.1

# Both capture the post-trade improvement, but EWMA incorporates the pre-trade
# baseline more smoothly and is more robust to a single outlier game.
# The flat average of the last 5 games (all post-trade) is 120.4
# The EWMA over all 10 games, recency-weighted, is 119.1

print(f"Flat 5-game avg:  {flat_avg:.1f}")
print(f"EWMA (span=5):    {ewma_val:.1f}")
print(f"True post-trade avg: {np.mean(off_rtg[5:]):.1f}")

For a betting line of 118.5 total offensive rating, the flat average says “over” (120.4), the EWMA says “over” (119.1), but the EWMA is more conservative and closer to the true post-trade mean. In practice, agents use EWMA because it degrades gracefully when regime changes (trades, injuries) aren’t perfectly delineated.

Example 3: Data Leakage in a Spread Model

A naive modeler builds an NFL spread prediction model and achieves 62% ATS accuracy on the test set. The features include:

LEAKED FEATURE: season_win_pct (uses full season record, including future games)
LEAKED FEATURE: final_power_ranking (published post-season)
LEAKED FEATURE: post_game_injury_report (uses info from AFTER the game)

When these are removed and replaced with temporally safe versions:

SAFE: rolling_win_pct (wins in last 5 games only, computed pre-game)
SAFE: pre_week_power_ranking (Elo rating as of game morning)
SAFE: pre_game_injury_report (listed status before kickoff)

Accuracy drops to 53.8% ATS — still above 50% but nowhere near the illusory 62%. The 8.2 percentage points were data leakage, not real signal. An agent deploying the leaked model would bet aggressively on phantom edge and hemorrhage bankroll.

Implementation

"""
SportFeatureEngine: Feature engineering framework for sports prediction agents.
Layer 4 — Intelligence module of the Agent Betting Stack.

Requires: pip install pandas numpy scikit-learn
"""

import numpy as np
import pandas as pd
from typing import Optional
from sklearn.linear_model import LassoCV
from sklearn.feature_selection import mutual_info_classif
from sklearn.preprocessing import StandardScaler


class SportFeatureEngine:
    """
    Builds temporally safe features for sports prediction models.

    All features are computed using only data available BEFORE the game.
    No future data ever leaks into feature construction.
    """

    def __init__(self, ewma_span: int = 5):
        """
        Args:
            ewma_span: Effective window for exponentially weighted moving averages.
                       alpha = 2 / (ewma_span + 1).
        """
        self.ewma_span = ewma_span
        self.alpha = 2 / (ewma_span + 1)
        self.scaler = StandardScaler()

    def add_rolling_features(
        self,
        df: pd.DataFrame,
        stat_cols: list[str],
        group_col: str = "team",
        date_col: str = "game_date",
    ) -> pd.DataFrame:
        """
        Add EWMA rolling features for each stat column, grouped by team.
        Features are shifted by 1 to prevent leakage (uses only pre-game data).

        Args:
            df: DataFrame with one row per team-game. Must be sorted by date.
            stat_cols: Columns to compute rolling averages for.
            group_col: Column identifying the team.
            date_col: Column with game dates (used for sorting).

        Returns:
            DataFrame with new columns: {stat}_ewma_{span} for each stat.
        """
        df = df.sort_values([group_col, date_col]).copy()

        for col in stat_cols:
            ewma_col = f"{col}_ewma_{self.ewma_span}"
            # shift(1) ensures we only use data from BEFORE this game
            df[ewma_col] = (
                df.groupby(group_col)[col]
                .transform(lambda x: x.shift(1).ewm(span=self.ewma_span, min_periods=1).mean())
            )

        return df

    def add_opponent_adjusted(
        self,
        df: pd.DataFrame,
        stat_cols: list[str],
        opp_col: str = "opponent",
        group_col: str = "team",
        date_col: str = "game_date",
    ) -> pd.DataFrame:
        """
        Add opponent-adjusted features.

        Formula: OppAdj_i = Raw_i - Opp_Season_Avg + League_Avg
        Opponent averages exclude the current game (shift by 1).

        Args:
            df: DataFrame sorted by date with team and opponent columns.
            stat_cols: Columns to opponent-adjust.
            opp_col: Column identifying the opponent team.

        Returns:
            DataFrame with new columns: {stat}_opp_adj for each stat.
        """
        df = df.sort_values([group_col, date_col]).copy()

        for col in stat_cols:
            # Compute expanding mean for each team (shifted to avoid leakage)
            team_avgs = (
                df.groupby(group_col)[col]
                .transform(lambda x: x.shift(1).expanding().mean())
            )

            # Map opponent's defensive average to each row
            opp_avg_map = df.set_index([group_col, date_col])[col].to_dict()

            # Compute opponent's season average (this is the defense they faced)
            opp_season_avg = (
                df.groupby(opp_col)[col]
                .transform(lambda x: x.shift(1).expanding().mean())
            )

            league_avg = df[col].expanding().mean().shift(1)

            ewma_col = f"{col}_ewma_{self.ewma_span}"
            source_col = ewma_col if ewma_col in df.columns else col

            df[f"{col}_opp_adj"] = df[source_col] - opp_season_avg + league_avg

        return df

    def add_interaction_features(
        self,
        df: pd.DataFrame,
        offense_col: str,
        defense_col: str,
        feature_name: str,
    ) -> pd.DataFrame:
        """
        Add a matchup interaction feature.

        matchup_advantage = (off_stat - league_avg) - (def_stat - league_avg)
                          = off_stat - def_stat

        Args:
            df: DataFrame with offense and defense stat columns.
            offense_col: Column for offensive stat (higher = better for offense).
            defense_col: Column for defensive stat (higher = worse for defense).
            feature_name: Name for the new interaction column.

        Returns:
            DataFrame with the new interaction column.
        """
        df = df.copy()
        df[feature_name] = df[offense_col] - df[defense_col]
        return df

    def handle_missing(
        self,
        df: pd.DataFrame,
        stat_cols: list[str],
        strategy: str = "median",
    ) -> pd.DataFrame:
        """
        Impute missing values and add missingness indicator columns.

        Args:
            df: DataFrame with potential NaN values.
            stat_cols: Columns to impute.
            strategy: "median" or "zero".

        Returns:
            DataFrame with imputed values and {col}_missing indicator columns.
        """
        df = df.copy()
        for col in stat_cols:
            missing_mask = df[col].isna()
            df[f"{col}_missing"] = missing_mask.astype(int)

            if strategy == "median":
                df[col] = df[col].fillna(df[col].median())
            elif strategy == "zero":
                df[col] = df[col].fillna(0)

        return df

    def select_features_lasso(
        self,
        X_train: pd.DataFrame,
        y_train: pd.Series,
        cv_folds: int = 5,
    ) -> list[str]:
        """
        Select features using LASSO with cross-validated lambda.

        Uses TimeSeriesSplit-style CV internally. Returns feature names
        with nonzero coefficients.

        Args:
            X_train: Training features (already scaled).
            y_train: Training target (binary or continuous).
            cv_folds: Number of CV folds.

        Returns:
            List of selected feature names.
        """
        X_scaled = self.scaler.fit_transform(X_train)

        lasso = LassoCV(cv=cv_folds, random_state=42, max_iter=10000)
        lasso.fit(X_scaled, y_train)

        selected = X_train.columns[np.abs(lasso.coef_) > 1e-6].tolist()

        print(f"LASSO selected {len(selected)}/{X_train.shape[1]} features")
        print(f"Optimal lambda: {lasso.alpha_:.6f}")
        print(f"Selected: {selected}")

        return selected

    def select_features_mi(
        self,
        X_train: pd.DataFrame,
        y_train: pd.Series,
        top_k: int = 15,
    ) -> list[str]:
        """
        Select top-k features by mutual information with the target.

        Captures non-linear relationships that LASSO misses.

        Args:
            X_train: Training features.
            y_train: Training target (binary classification).
            top_k: Number of top features to select.

        Returns:
            List of top-k feature names sorted by MI score.
        """
        mi_scores = mutual_info_classif(X_train.fillna(0), y_train, random_state=42)
        mi_df = pd.DataFrame({
            "feature": X_train.columns,
            "mi_score": mi_scores
        }).sort_values("mi_score", ascending=False)

        selected = mi_df.head(top_k)["feature"].tolist()

        print(f"Top {top_k} features by mutual information:")
        for _, row in mi_df.head(top_k).iterrows():
            print(f"  {row['feature']:<35s} MI = {row['mi_score']:.4f}")

        return selected

    def temporal_split(
        self,
        df: pd.DataFrame,
        date_col: str = "game_date",
        test_start: str = "2025-11-01",
    ) -> tuple[pd.DataFrame, pd.DataFrame]:
        """
        Split data temporally. All games before test_start go to train,
        all games on or after go to test. Never use random splits for
        time-series sports data.

        Args:
            df: Full dataset sorted by date.
            date_col: Date column name.
            test_start: ISO date string for test set start.

        Returns:
            (train_df, test_df) tuple.
        """
        df = df.copy()
        df[date_col] = pd.to_datetime(df[date_col])
        cutoff = pd.Timestamp(test_start)

        train = df[df[date_col] < cutoff].copy()
        test = df[df[date_col] >= cutoff].copy()

        print(f"Train: {len(train)} games (before {test_start})")
        print(f"Test:  {len(test)} games (from {test_start})")

        return train, test


def build_nfl_features(games_df: pd.DataFrame) -> pd.DataFrame:
    """
    Build a complete NFL feature set from raw game data.

    Expected columns in games_df:
        team, opponent, game_date, season, week,
        points_scored, points_allowed, pass_yards, rush_yards,
        pass_attempts, rush_attempts, turnovers, sacks_allowed,
        opponent_pass_yards, opponent_rush_yards, result (1=win, 0=loss)

    Returns:
        DataFrame with engineered features ready for modeling.
    """
    engine = SportFeatureEngine(ewma_span=5)

    # Level 1: Derived per-play efficiency
    games_df = games_df.copy()
    games_df["yards_per_pass_att"] = games_df["pass_yards"] / games_df["pass_attempts"].clip(lower=1)
    games_df["yards_per_rush"] = games_df["rush_yards"] / games_df["rush_attempts"].clip(lower=1)
    games_df["point_differential"] = games_df["points_scored"] - games_df["points_allowed"]
    games_df["total_yards"] = games_df["pass_yards"] + games_df["rush_yards"]
    games_df["pass_ratio"] = games_df["pass_attempts"] / (
        games_df["pass_attempts"] + games_df["rush_attempts"]
    ).clip(lower=1)

    stat_cols = [
        "points_scored", "points_allowed", "yards_per_pass_att",
        "yards_per_rush", "point_differential", "total_yards",
        "turnovers", "sacks_allowed", "pass_ratio",
    ]

    # Level 2: Rolling EWMA
    games_df = engine.add_rolling_features(games_df, stat_cols)

    # Level 3: Opponent adjustment
    games_df = engine.add_opponent_adjusted(
        games_df,
        stat_cols=["points_scored", "yards_per_pass_att", "yards_per_rush"],
    )

    # Level 4: Interaction features
    if "opp_pass_yards_allowed_ewma_5" in games_df.columns:
        games_df = engine.add_interaction_features(
            games_df,
            offense_col="yards_per_pass_att_ewma_5",
            defense_col="opp_pass_yards_allowed_ewma_5",
            feature_name="pass_matchup_advantage",
        )

    # Handle missing (early-season games with insufficient history)
    feature_cols = [c for c in games_df.columns if "ewma" in c or "opp_adj" in c or "matchup" in c]
    games_df = engine.handle_missing(games_df, feature_cols)

    return games_df


# --- Sport-Specific Feature Libraries ---

NFL_FEATURES = {
    "raw": [
        "points_scored", "points_allowed", "pass_yards", "rush_yards",
        "pass_attempts", "rush_attempts", "completions", "interceptions",
        "fumbles_lost", "sacks_allowed", "penalties", "penalty_yards",
        "third_down_conv_pct", "redzone_td_pct", "time_of_possession",
    ],
    "derived": [
        "epa_per_play",               # Expected Points Added per play
        "cpoe",                        # Completion % Over Expected
        "yards_per_pass_attempt",      # pass_yards / pass_attempts
        "yards_per_rush",              # rush_yards / rush_attempts
        "dvoa_offense",                # DVOA offensive component
        "dvoa_defense",                # DVOA defensive component
        "point_differential",          # points_scored - points_allowed
        "turnover_margin",             # takeaways - giveaways
        "pass_ratio",                  # pass_att / total_plays
    ],
    "contextual": [
        "home_away",                   # 1 = home, 0 = away
        "rest_days",                   # days since last game
        "travel_distance_miles",       # distance traveled for away games
        "dome_indicator",              # 1 = dome/indoor, 0 = outdoor
        "temperature_f",              # game-time temperature
        "wind_speed_mph",             # game-time wind speed
    ],
}

NBA_FEATURES = {
    "raw": [
        "points", "rebounds", "assists", "steals", "blocks",
        "turnovers", "field_goals_made", "field_goals_attempted",
        "three_pointers_made", "three_pointers_attempted",
        "free_throws_made", "free_throws_attempted",
    ],
    "derived": [
        "offensive_rating",            # points per 100 possessions
        "defensive_rating",            # points allowed per 100 possessions
        "net_rating",                  # off_rtg - def_rtg
        "pace",                        # possessions per 48 minutes
        "true_shooting_pct",           # PTS / (2 * (FGA + 0.44 * FTA))
        "effective_fg_pct",            # (FGM + 0.5 * 3PM) / FGA
        "turnover_rate",               # TOV / possessions
        "offensive_rebound_pct",       # OREB / (OREB + OPP_DREB)
        "free_throw_rate",             # FTA / FGA
    ],
    "contextual": [
        "home_away", "rest_days", "back_to_back",
        "altitude_feet", "travel_distance_miles",
    ],
}

MLB_FEATURES = {
    "raw": [
        "runs", "hits", "home_runs", "walks_drawn", "strikeouts_batting",
        "runs_allowed", "hits_allowed", "walks_allowed", "strikeouts_pitching",
        "errors",
    ],
    "derived": [
        "wrc_plus",                    # Weighted Runs Created Plus (park/league adjusted)
        "fip",                         # Fielding Independent Pitching
        "barrel_rate",                 # barrels / batted ball events
        "hard_hit_rate",               # exit velo >= 95mph / BBE
        "whip",                        # (walks + hits) / innings pitched
        "k_rate",                      # strikeouts / plate appearances
        "bb_rate",                     # walks / plate appearances
        "babip",                       # batting avg on balls in play
        "run_differential",            # runs - runs_allowed
    ],
    "contextual": [
        "home_away", "park_factor", "starting_pitcher_fip",
        "bullpen_era_last_7", "rest_days_starter",
    ],
}

SOCCER_FEATURES = {
    "raw": [
        "goals", "goals_allowed", "shots", "shots_on_target",
        "corners", "fouls", "yellow_cards", "possession_pct",
    ],
    "derived": [
        "xg",                          # Expected Goals
        "xga",                         # Expected Goals Against
        "xg_difference",               # xG - xGA
        "xg_overperformance",          # goals - xG (positive = clinical finishing)
        "ppda",                        # Passes Per Defensive Action (pressing intensity)
        "shot_conversion_rate",        # goals / shots
        "big_chance_creation",         # high-xG chances created per game
    ],
    "contextual": [
        "home_away", "rest_days", "european_match_midweek",
        "derby_indicator", "league_position_diff",
    ],
}

Limitations and Edge Cases

Small sample sizes. NFL teams play 17 regular-season games. A 5-game EWMA window covers 29% of the season. Early-season predictions (weeks 1-3) rely on fewer than 3 data points — the EWMA is unstable. Use preseason priors (last season’s late-season performance, regressed toward league average) to stabilize early-season features.

Regime changes are invisible to rolling averages. A mid-season coaching change, a major trade, or a scheme overhaul invalidates historical features. EWMA adapts faster than flat averages, but neither can react instantly. An agent should maintain a metadata layer that flags regime changes and optionally resets the rolling window.

Multicollinearity. Offensive rating and points scored are highly correlated. Net rating and point differential are nearly identical. Including both inflates coefficient variance in linear models and slows convergence. LASSO handles this automatically by zeroing out redundant features, but tree models may split on either arbitrarily. Prefer the more stable derived metric (offensive rating over raw points).

Feature drift across seasons. Rule changes (NFL’s expanded roughing-the-passer penalties, NBA’s 3-point revolution, MLB’s pitch clock) shift the distribution of features year over year. Features standardized on 2020 data may not transfer to 2025. Re-fit scalers each season.

Survivorship bias in feature selection. If you run LASSO on the full dataset and then evaluate on a subset, the selected features are contaminated. Always select features on the training set only, then evaluate on the temporal holdout.

Garbage in, garbage out. No feature engineering technique rescues bad source data. If the box-score provider has systematic errors (misattributed assists, incorrect play-by-play), every downstream feature inherits those errors. Validate source data against multiple providers.

FAQ

What features should I use for a sports betting prediction model?

Start with raw box-score stats (points, yards, shots), then build derived features (per-possession efficiency, strength of schedule), rolling windowed features (last-5-game EWMA), and opponent-adjusted metrics. The specific features depend on sport — NFL models use EPA per play and DVOA, NBA models use offensive/defensive rating and pace, MLB models use wRC+ and FIP. See the sport-specific feature libraries in the Implementation section for the complete lists.

How do you avoid data leakage in sports prediction models?

Never use information that would not have been available before the game. Common leaks include season-end stats in mid-season predictions, post-game injury reports, and calculating rolling averages that include the target game. Always split data temporally (train on past, test on future) and compute all features using only pre-game data. The shift(1) call in the rolling feature code above is the critical safeguard.

What is the best feature selection method for sports betting models?

LASSO (L1 regularization) is the standard first choice because it simultaneously fits the model and drives irrelevant feature coefficients to zero. For tree-based models, permutation importance and mutual information scoring work well. Always validate feature selection on a temporal holdout set to avoid overfitting. The regression models guide covers model fitting in detail.

Should I use exponentially weighted moving averages or simple averages for sports prediction?

Exponentially weighted moving averages (EWMA) outperform simple averages for capturing recent form. With EWMA, a 5-game span weights the most recent game at 33% vs. 20% for a flat average. This matters because team performance is non-stationary — injuries, trades, and tactical changes make recent games more predictive than games from months ago. The formula is EWMA_t = alpha * x_t + (1 - alpha) * EWMA_{t-1}, where alpha = 2/(span+1).

How does feature engineering connect to closing line value in sports betting?

Features that consistently generate closing line value (CLV) are validated as predictive. If your model’s pre-game predictions move in the same direction as the closing line, your features are capturing real signal. CLV is the gold standard for evaluating whether your feature engineering pipeline produces genuine edge vs. noise. Feature engineering is the upstream process; CLV measurement is the downstream validation.

What’s Next

Feature engineering feeds directly into the model training and deployment pipeline. The next steps in the series build on this foundation:

Downstream pipeline: The Odds API Edge Detection Pipeline shows how features flow through a production agent — from data ingestion to odds comparison to bet execution.
Model validation: Statistical Significance for Sports Betting covers how to determine whether your feature-driven model has real edge or is overfitting.
Upstream signal: Closing Line Value and Line Movement Analysis generate the features (CLV, steam moves, reverse line movement) that capture market-level signal.
Sport-specific deep dives: The NFL Mathematical Modeling and Expected Goals (xG) guides implement full models using the feature libraries defined here.
Sharp betting context: Sharp Betting covers how professional bettors evaluate feature-driven models in practice.

Frequently Asked Questions