When your bot monitors five offshore sportsbooks, you’re dealing with five different data formats, five different naming conventions, and five different update schedules. One book gives you American odds, another gives decimal. One calls them the “LA Lakers,” another says “Los Angeles Lakers,” a third uses “LAL.” Timestamps arrive in different zones, at different intervals, with different levels of staleness.
Normalization is the unglamorous but essential step that makes everything else work. Without it, your arbitrage scanner compares garbage to garbage. Your +EV calculator produces meaningless output. Your line movement tracker can’t stitch together a coherent history.
This guide walks through the full normalization stack: odds format conversion, market matching, timestamp alignment, and a working Python pipeline that ties it all together.
Odds Format Conversion
The first normalization task is the simplest: get every odds value into the same format. The target is implied probability — a float between 0 and 1 that represents the price in a format-agnostic way.
American to implied probability
American odds are the standard for US-facing offshore books like Bovada, BetOnline, and MyBookie. Positive values represent the underdog, negative values represent the favorite.
The conversion:
- Negative odds (e.g., -150):
implied = abs(odds) / (abs(odds) + 100) - Positive odds (e.g., +130):
implied = 100 / (odds + 100)
So -150 becomes 150 / 250 = 0.60, and +130 becomes 100 / 230 ≈ 0.4348.
Decimal to implied probability
Decimal odds are common on international-facing offshore books and some aggregator APIs. The conversion is straightforward:
implied = 1 / decimal_odds
Decimal 2.50 becomes 1 / 2.50 = 0.40. Decimal 1.67 becomes 1 / 1.67 ≈ 0.5988.
Fractional to implied probability
Fractional odds (e.g., 5/2, 7/4) appear less frequently in offshore sportsbook APIs, but some European-origin books still return them:
implied = denominator / (numerator + denominator)
5/2 becomes 2 / 7 ≈ 0.2857. 7/4 becomes 4 / 11 ≈ 0.3636.
The OddsConverter class
Wrap all conversions into a single utility:
class OddsConverter:
@staticmethod
def american_to_prob(odds: int) -> float:
if odds == 0:
raise ValueError("American odds cannot be zero")
if odds < 0:
return abs(odds) / (abs(odds) + 100)
return 100 / (odds + 100)
@staticmethod
def decimal_to_prob(odds: float) -> float:
if odds <= 1.0:
raise ValueError(f"Decimal odds must be > 1.0, got {odds}")
return 1 / odds
@staticmethod
def fractional_to_prob(numerator: int, denominator: int) -> float:
if numerator <= 0 or denominator <= 0:
raise ValueError("Fractional odds must be positive")
return denominator / (numerator + denominator)
@staticmethod
def prob_to_american(prob: float) -> int:
if prob <= 0 or prob >= 1:
raise ValueError("Probability must be between 0 and 1")
if prob >= 0.5:
return int(-prob * 100 / (1 - prob))
return int((1 - prob) * 100 / prob)
Edge cases to handle: even money (American +100, decimal 2.0) converts cleanly to 0.50. Extremely long odds (e.g., +10000) produce very small probabilities — make sure your downstream code doesn’t choke on 0.0099.
Note that implied probabilities from a single book will sum to more than 1.0 for a two-way market. The excess is the vig/juice. Don’t “correct” this during normalization — that’s a separate calculation for your juice comparison pipeline.
Market Matching
Format conversion is mechanical. Market matching is where normalization gets messy.
The team name problem
Here’s what the same NBA game looks like across three books:
| Source | Team 1 | Team 2 |
|---|---|---|
| Book A (The Odds API) | Los Angeles Lakers | Boston Celtics |
| Book B (raw feed) | LA Lakers | Bos Celtics |
| Book C (raw feed) | LAL | BOS |
Your pipeline needs to understand that these three rows describe the same event. Multiply this across four major sports, hundreds of teams, and international leagues, and you’ve got a serious mapping problem.
Building a canonical name map
Start with a static dictionary that maps known variations to a canonical identifier:
TEAM_ALIASES = {
"nba": {
"Los Angeles Lakers": "LAL",
"LA Lakers": "LAL",
"L.A. Lakers": "LAL",
"Lakers": "LAL",
"Boston Celtics": "BOS",
"Bos Celtics": "BOS",
"Celtics": "BOS",
}
}
def normalize_team(name: str, sport: str) -> str:
sport_map = TEAM_ALIASES.get(sport, {})
if name in sport_map:
return sport_map[name]
# Fuzzy fallback
from rapidfuzz import process
match, score, _ = process.extractOne(name, sport_map.keys())
if score >= 85:
return sport_map[match]
return name # Return original if no confident match
The static map handles 90% of cases. The fuzzy fallback (using rapidfuzz for speed) catches typos and minor variations. Set a high threshold — a bad fuzzy match is worse than no match, because it silently corrupts your data.
Market type normalization
Different books use different terminology for the same market types:
| Concept | Book A | Book B | Book C |
|---|---|---|---|
| Winner | moneyline | h2h | match_winner |
| Point spread | spread | handicap | line |
| Game total | total | over_under | totals |
Map these to canonical market types in your pipeline:
MARKET_TYPE_MAP = {
"moneyline": "moneyline",
"h2h": "moneyline",
"match_winner": "moneyline",
"spread": "spread",
"handicap": "spread",
"line": "spread",
"total": "total",
"over_under": "total",
"totals": "total",
}
Timestamp Alignment
Different APIs report odds at different intervals. The Odds API updates every 30–60 seconds. OpticOdds streams in near real-time. A raw book feed might push updates only when lines actually move.
Converting to UTC
Every timestamp that enters your pipeline should be converted to UTC immediately. Don’t store local times. Don’t rely on timezone-naive datetimes.
from datetime import datetime, timezone
from dateutil import parser
def to_utc(ts: str) -> datetime:
dt = parser.parse(ts)
if dt.tzinfo is None:
dt = dt.replace(tzinfo=timezone.utc)
return dt.astimezone(timezone.utc)
Filling gaps between updates
When you’re comparing odds across books at a specific point in time, you’ll often find that one book reported at 14:30:00 and another at 14:30:45. You have two strategies:
- Last-known-value (forward fill): Use the most recent odds from each book. Simple and reliable for most use cases.
- Interpolation: Estimate intermediate values based on surrounding data points. Only useful for continuous analytics — not for trade decisions.
For arbitrage and +EV detection, last-known-value is the right choice. You’re making a trading decision based on what you know, not what you estimate.
Staleness detection
Data that’s too old is worse than no data. Define a staleness threshold per source:
STALENESS_THRESHOLDS = {
"the_odds_api": 120, # 2 minutes
"opticodds": 15, # 15 seconds
"raw_feed": 300, # 5 minutes
}
def is_stale(last_update: datetime, source: str) -> bool:
age = (datetime.now(timezone.utc) - last_update).total_seconds()
return age > STALENESS_THRESHOLDS.get(source, 180)
When data is stale, flag it but don’t drop it. Downstream consumers can decide whether to use stale data or skip it. An arbitrage scanner should skip stale lines. A historical analysis tool probably wants them.
Building the Normalization Pipeline
Here’s the full pipeline — an OddsNormalizer class that accepts raw odds from multiple sources and outputs a clean, unified structure.
import pandas as pd
from datetime import datetime, timezone
from typing import Optional
class OddsNormalizer:
def __init__(self):
self.converter = OddsConverter()
self.records = []
def ingest(self, raw_data: list[dict], source: str, odds_format: str = "american"):
for event in raw_data:
sport = event.get("sport", "unknown")
home = normalize_team(event["home_team"], sport)
away = normalize_team(event["away_team"], sport)
timestamp = to_utc(event["last_update"])
stale = is_stale(timestamp, source)
for market in event.get("markets", []):
market_type = MARKET_TYPE_MAP.get(
market["type"].lower(), market["type"].lower()
)
for outcome in market.get("outcomes", []):
prob = self._convert(outcome["price"], odds_format)
self.records.append({
"sport": sport,
"home": home,
"away": away,
"market_type": market_type,
"outcome_name": outcome["name"],
"outcome_point": outcome.get("point"),
"implied_prob": prob,
"raw_price": outcome["price"],
"source": source,
"timestamp": timestamp,
"stale": stale,
})
def _convert(self, price, fmt: str) -> Optional[float]:
try:
if fmt == "american":
return self.converter.american_to_prob(int(price))
elif fmt == "decimal":
return self.converter.decimal_to_prob(float(price))
return None
except (ValueError, ZeroDivisionError):
return None
def to_dataframe(self) -> pd.DataFrame:
df = pd.DataFrame(self.records)
if df.empty:
return df
df["event_key"] = df["sport"] + "|" + df["home"] + "|" + df["away"]
return df.sort_values(["event_key", "market_type", "timestamp"])
def compare(self, event_key: str, market_type: str = "moneyline") -> pd.DataFrame:
df = self.to_dataframe()
subset = df[(df["event_key"] == event_key) & (df["market_type"] == market_type)]
return subset.pivot_table(
index="outcome_name",
columns="source",
values="implied_prob",
aggfunc="last",
)
Example: normalizing NFL odds from three books
normalizer = OddsNormalizer()
# Book A returns American odds
normalizer.ingest(book_a_response, source="bovada", odds_format="american")
# Book B returns decimal odds
normalizer.ingest(book_b_response, source="betonline", odds_format="decimal")
# Book C returns American odds
normalizer.ingest(book_c_response, source="mybookie", odds_format="american")
df = normalizer.to_dataframe()
print(df[df["event_key"] == "nfl|KC|BUF"].head(10))
Output: a single DataFrame where every row has a normalized team identifier, a standardized market type, an implied probability, and a UTC timestamp — regardless of which book it came from.
Call compare() to get a cross-book comparison for a specific event:
normalizer.compare("nfl|KC|BUF", market_type="moneyline")
This returns a pivot table with outcome names as rows and books as columns, each cell containing the implied probability. You can see at a glance which book is offering the best price on each side.
Handling Edge Cases
The pipeline above covers the happy path. Real-world data will test you.
Missing markets. Not every book offers every market. When Book A has a player prop that Book B doesn’t carry, the normalized DataFrame should have NaN for Book B on that market — never a synthesized value.
Suspended markets. Books pull markets from their board when lines are moving fast or when there’s breaking news. Your pipeline should track market status and exclude suspended lines from active comparisons. A suspended line at -110 is not the same as an active line at -110.
Half-point differences. Book A offers Chiefs -3.5, Book B offers Chiefs -3. These are different markets and should not be directly compared. Group by the point value in your spread and total markets — outcome_point in the data model above serves this purpose.
Props and non-standard markets. Player props, game props, and futures have far less naming consistency across books. For these, you’ll need a more sophisticated matching layer — potentially using a combination of player name normalization, prop type mapping, and manual review for initial setup.
What’s Next
- BetOnline API Guide — the most common starting point for offshore odds data
- Bovada API Guide — accessing Bovada’s data through third-party providers
- Offshore Sportsbook APIs Overview — the landscape of data access across major offshore books
- Juice Comparison — use your normalized data to compare vig across books
- +EV Betting Bots — feed normalized odds into an expected value calculator
- Sports Betting Arbitrage Bot — the primary consumer of cross-book normalized data