When your bot monitors five offshore sportsbooks, you’re dealing with five different data formats, five different naming conventions, and five different update schedules. One book gives you American odds, another gives decimal. One calls them the “LA Lakers,” another says “Los Angeles Lakers,” a third uses “LAL.” Timestamps arrive in different zones, at different intervals, with different levels of staleness.

Normalization is the unglamorous but essential step that makes everything else work. Without it, your arbitrage scanner compares garbage to garbage. Your +EV calculator produces meaningless output. Your line movement tracker can’t stitch together a coherent history.

This guide walks through the full normalization stack: odds format conversion, market matching, timestamp alignment, and a working Python pipeline that ties it all together.


Odds Format Conversion

The first normalization task is the simplest: get every odds value into the same format. The target is implied probability — a float between 0 and 1 that represents the price in a format-agnostic way.

American to implied probability

American odds are the standard for US-facing offshore books like Bovada, BetOnline, and MyBookie. Positive values represent the underdog, negative values represent the favorite.

The conversion:

  • Negative odds (e.g., -150): implied = abs(odds) / (abs(odds) + 100)
  • Positive odds (e.g., +130): implied = 100 / (odds + 100)

So -150 becomes 150 / 250 = 0.60, and +130 becomes 100 / 230 ≈ 0.4348.

Decimal to implied probability

Decimal odds are common on international-facing offshore books and some aggregator APIs. The conversion is straightforward:

implied = 1 / decimal_odds

Decimal 2.50 becomes 1 / 2.50 = 0.40. Decimal 1.67 becomes 1 / 1.67 ≈ 0.5988.

Fractional to implied probability

Fractional odds (e.g., 5/2, 7/4) appear less frequently in offshore sportsbook APIs, but some European-origin books still return them:

implied = denominator / (numerator + denominator)

5/2 becomes 2 / 7 ≈ 0.2857. 7/4 becomes 4 / 11 ≈ 0.3636.

The OddsConverter class

Wrap all conversions into a single utility:

class OddsConverter:
    @staticmethod
    def american_to_prob(odds: int) -> float:
        if odds == 0:
            raise ValueError("American odds cannot be zero")
        if odds < 0:
            return abs(odds) / (abs(odds) + 100)
        return 100 / (odds + 100)

    @staticmethod
    def decimal_to_prob(odds: float) -> float:
        if odds <= 1.0:
            raise ValueError(f"Decimal odds must be > 1.0, got {odds}")
        return 1 / odds

    @staticmethod
    def fractional_to_prob(numerator: int, denominator: int) -> float:
        if numerator <= 0 or denominator <= 0:
            raise ValueError("Fractional odds must be positive")
        return denominator / (numerator + denominator)

    @staticmethod
    def prob_to_american(prob: float) -> int:
        if prob <= 0 or prob >= 1:
            raise ValueError("Probability must be between 0 and 1")
        if prob >= 0.5:
            return int(-prob * 100 / (1 - prob))
        return int((1 - prob) * 100 / prob)

Edge cases to handle: even money (American +100, decimal 2.0) converts cleanly to 0.50. Extremely long odds (e.g., +10000) produce very small probabilities — make sure your downstream code doesn’t choke on 0.0099.

Note that implied probabilities from a single book will sum to more than 1.0 for a two-way market. The excess is the vig/juice. Don’t “correct” this during normalization — that’s a separate calculation for your juice comparison pipeline.


Market Matching

Format conversion is mechanical. Market matching is where normalization gets messy.

The team name problem

Here’s what the same NBA game looks like across three books:

SourceTeam 1Team 2
Book A (The Odds API)Los Angeles LakersBoston Celtics
Book B (raw feed)LA LakersBos Celtics
Book C (raw feed)LALBOS

Your pipeline needs to understand that these three rows describe the same event. Multiply this across four major sports, hundreds of teams, and international leagues, and you’ve got a serious mapping problem.

Building a canonical name map

Start with a static dictionary that maps known variations to a canonical identifier:

TEAM_ALIASES = {
    "nba": {
        "Los Angeles Lakers": "LAL",
        "LA Lakers": "LAL",
        "L.A. Lakers": "LAL",
        "Lakers": "LAL",
        "Boston Celtics": "BOS",
        "Bos Celtics": "BOS",
        "Celtics": "BOS",
    }
}

def normalize_team(name: str, sport: str) -> str:
    sport_map = TEAM_ALIASES.get(sport, {})
    if name in sport_map:
        return sport_map[name]
    # Fuzzy fallback
    from rapidfuzz import process
    match, score, _ = process.extractOne(name, sport_map.keys())
    if score >= 85:
        return sport_map[match]
    return name  # Return original if no confident match

The static map handles 90% of cases. The fuzzy fallback (using rapidfuzz for speed) catches typos and minor variations. Set a high threshold — a bad fuzzy match is worse than no match, because it silently corrupts your data.

Market type normalization

Different books use different terminology for the same market types:

ConceptBook ABook BBook C
Winnermoneylineh2hmatch_winner
Point spreadspreadhandicapline
Game totaltotalover_undertotals

Map these to canonical market types in your pipeline:

MARKET_TYPE_MAP = {
    "moneyline": "moneyline",
    "h2h": "moneyline",
    "match_winner": "moneyline",
    "spread": "spread",
    "handicap": "spread",
    "line": "spread",
    "total": "total",
    "over_under": "total",
    "totals": "total",
}

Timestamp Alignment

Different APIs report odds at different intervals. The Odds API updates every 30–60 seconds. OpticOdds streams in near real-time. A raw book feed might push updates only when lines actually move.

Converting to UTC

Every timestamp that enters your pipeline should be converted to UTC immediately. Don’t store local times. Don’t rely on timezone-naive datetimes.

from datetime import datetime, timezone
from dateutil import parser

def to_utc(ts: str) -> datetime:
    dt = parser.parse(ts)
    if dt.tzinfo is None:
        dt = dt.replace(tzinfo=timezone.utc)
    return dt.astimezone(timezone.utc)

Filling gaps between updates

When you’re comparing odds across books at a specific point in time, you’ll often find that one book reported at 14:30:00 and another at 14:30:45. You have two strategies:

  1. Last-known-value (forward fill): Use the most recent odds from each book. Simple and reliable for most use cases.
  2. Interpolation: Estimate intermediate values based on surrounding data points. Only useful for continuous analytics — not for trade decisions.

For arbitrage and +EV detection, last-known-value is the right choice. You’re making a trading decision based on what you know, not what you estimate.

Staleness detection

Data that’s too old is worse than no data. Define a staleness threshold per source:

STALENESS_THRESHOLDS = {
    "the_odds_api": 120,   # 2 minutes
    "opticodds": 15,       # 15 seconds
    "raw_feed": 300,       # 5 minutes
}

def is_stale(last_update: datetime, source: str) -> bool:
    age = (datetime.now(timezone.utc) - last_update).total_seconds()
    return age > STALENESS_THRESHOLDS.get(source, 180)

When data is stale, flag it but don’t drop it. Downstream consumers can decide whether to use stale data or skip it. An arbitrage scanner should skip stale lines. A historical analysis tool probably wants them.


Building the Normalization Pipeline

Here’s the full pipeline — an OddsNormalizer class that accepts raw odds from multiple sources and outputs a clean, unified structure.

import pandas as pd
from datetime import datetime, timezone
from typing import Optional

class OddsNormalizer:
    def __init__(self):
        self.converter = OddsConverter()
        self.records = []

    def ingest(self, raw_data: list[dict], source: str, odds_format: str = "american"):
        for event in raw_data:
            sport = event.get("sport", "unknown")
            home = normalize_team(event["home_team"], sport)
            away = normalize_team(event["away_team"], sport)
            timestamp = to_utc(event["last_update"])
            stale = is_stale(timestamp, source)

            for market in event.get("markets", []):
                market_type = MARKET_TYPE_MAP.get(
                    market["type"].lower(), market["type"].lower()
                )
                for outcome in market.get("outcomes", []):
                    prob = self._convert(outcome["price"], odds_format)
                    self.records.append({
                        "sport": sport,
                        "home": home,
                        "away": away,
                        "market_type": market_type,
                        "outcome_name": outcome["name"],
                        "outcome_point": outcome.get("point"),
                        "implied_prob": prob,
                        "raw_price": outcome["price"],
                        "source": source,
                        "timestamp": timestamp,
                        "stale": stale,
                    })

    def _convert(self, price, fmt: str) -> Optional[float]:
        try:
            if fmt == "american":
                return self.converter.american_to_prob(int(price))
            elif fmt == "decimal":
                return self.converter.decimal_to_prob(float(price))
            return None
        except (ValueError, ZeroDivisionError):
            return None

    def to_dataframe(self) -> pd.DataFrame:
        df = pd.DataFrame(self.records)
        if df.empty:
            return df
        df["event_key"] = df["sport"] + "|" + df["home"] + "|" + df["away"]
        return df.sort_values(["event_key", "market_type", "timestamp"])

    def compare(self, event_key: str, market_type: str = "moneyline") -> pd.DataFrame:
        df = self.to_dataframe()
        subset = df[(df["event_key"] == event_key) & (df["market_type"] == market_type)]
        return subset.pivot_table(
            index="outcome_name",
            columns="source",
            values="implied_prob",
            aggfunc="last",
        )

Example: normalizing NFL odds from three books

normalizer = OddsNormalizer()

# Book A returns American odds
normalizer.ingest(book_a_response, source="bovada", odds_format="american")

# Book B returns decimal odds
normalizer.ingest(book_b_response, source="betonline", odds_format="decimal")

# Book C returns American odds
normalizer.ingest(book_c_response, source="mybookie", odds_format="american")

df = normalizer.to_dataframe()
print(df[df["event_key"] == "nfl|KC|BUF"].head(10))

Output: a single DataFrame where every row has a normalized team identifier, a standardized market type, an implied probability, and a UTC timestamp — regardless of which book it came from.

Call compare() to get a cross-book comparison for a specific event:

normalizer.compare("nfl|KC|BUF", market_type="moneyline")

This returns a pivot table with outcome names as rows and books as columns, each cell containing the implied probability. You can see at a glance which book is offering the best price on each side.


Handling Edge Cases

The pipeline above covers the happy path. Real-world data will test you.

Missing markets. Not every book offers every market. When Book A has a player prop that Book B doesn’t carry, the normalized DataFrame should have NaN for Book B on that market — never a synthesized value.

Suspended markets. Books pull markets from their board when lines are moving fast or when there’s breaking news. Your pipeline should track market status and exclude suspended lines from active comparisons. A suspended line at -110 is not the same as an active line at -110.

Half-point differences. Book A offers Chiefs -3.5, Book B offers Chiefs -3. These are different markets and should not be directly compared. Group by the point value in your spread and total markets — outcome_point in the data model above serves this purpose.

Props and non-standard markets. Player props, game props, and futures have far less naming consistency across books. For these, you’ll need a more sophisticated matching layer — potentially using a combination of player name normalization, prop type mapping, and manual review for initial setup.


What’s Next