Developing a March Madness Prediction Model with XGBoost

In some of our earlier Talking Tech ventures, we explored using a random forest classifier to anticipate play calls in college football. Similarly, I have a particular fondness for crafting artificial neural networks to forecast college football outcomes. But today, we’re shifting gears to investigate a new style of machine learning technique, nestled within the realm of ensemble methods. These methods amalgamate a diverse array of models to harness a collective strength in numbers. Although in a random forest, we create numerous decision trees, compile their findings, and push them into a unified result, our focus today will be on a method that’s just a bit less… arbitrary.

Gradient boosting shares quite a bit of its DNA with random forest approaches—think of them as siblings in the ensemble family. Both champion decision trees, and are equally versatile, applicable for either classification or regression tasks. But what differentiates them? While a random forest relies on crafting a whirlwind of decision trees, banking on the hope that misfit trees will average themselves out and the cream of the crop emerges, gradient boosting starts with a single tree, scrutinizes its errors, and incrementally evolves its successors to master precision. This leapfrogging continues all the way through.

Eventually, this technique yields a sequence of trees, each inheriting insights from its forerunners to refine accuracy. Yet, none of these creations are discarded; each contributes to the eventual model, solidifying this as another ensemble tactic. Such strategies often outperform random forests, and indeed, gradient-boosted models frequently shine in platforms like Kaggle.

For those dabbling in gradient boosting using Python, XGBoost and LightGBM top my list of library recommendations. Both offer robust solutions, though today we’ll play around with XGBoost. However, I’d suggest checking out LightGBM when you have a moment.

Let’s pivot to practicalities: employing the CBBD Python library, we’ll tap into the data from CollegeBasketballData.com’s REST API. Ensure you’ve got these packages ready: cbbd, pandas, sklearn, xgboost—install them via pip if necessary. We’ll kick off by importing all essentials and setting up our CBBD API key in the configuration. Need one? Head over to the CBBD website to grab yours.

import cbbd
import pandas as pd
from sklearn.metrics import mean_absolute_error
from sklearn.model_selection import train_test_split
from xgboost import XGBRegressor

configuration = cbbd.Configuration(
    access_token = 'your_api_key_here'
)

You’ll breathe easy knowing we’ll only make 22 API calls, comfortably within the free monthly 1000-call allocation by CBBD—plenty for repeated model runs.

Next, let’s archive all NCAA tournament games from 2013 through 2024. You’re welcome to venture further back if curiosity strikes. We’ll aptly use the tournament="NCAA" parameter to corral all the tournament games from a specified year.

games = []
with cbbd.ApiClient(configuration) as api_client:
    games_api = cbbd.GamesApi(api_client)
    for season in range(2024, 2013, -1):
        results = games_api.get_games(season=season, tournament="NCAA")
        games += results

Interestingly, this returns us a suite of 686 games. Curious about the data details of a single game? Let’s peek at one now.

GameInfo(id=12010, source_id='401638579', season_label="20232024", season=2024, season_type=, start_date=datetime.datetime(2024, 3, 19, 18, 40, tzinfo=datetime.timezone.utc), start_time_tbd=False, neutral_site=True, conference_game=False, game_type="TRNMNT", tournament="NCAA", game_notes="Men's Basketball Championship - West Region - First Four", status=, attendance=0, home_team_id=114, home_team='Howard', home_conference_id=18, home_conference="MEAC", home_seed=16, home_points=68, home_period_points=[27, 41], home_winner=False, away_team_id=341, away_team='Wagner', away_conference_id=21, away_conference="NEC", away_seed=16, away_points=71, away_period_points=[38, 33], away_winner=True, excitement=4.7, venue_id=76, venue="UD Arena", city='Dayton', state="OH")

Our next move is to scoop up team statistics to include as features in our model. For this, we’ll lean on the CBBD Stats API, compiling regular season stats for the same years we gathered tournament game data. Remember to specify season_type="regular"—a crucial step to avoid inadvertently training a model on retrospective data.

Run the code below to fetch those team season stats.

stats = []
with cbbd.ApiClient(configuration) as api_client:
    stats_api = cbbd.StatsApi(api_client)
    for season in range(2024, 2013, -1):
        results = stats_api.get_team_season_stats(season=season, season_type="regular")
        stats += results

A glance at these stats reveals the plethora of metrics at our disposal:

TeamSeasonStats(season=2024, season_label="20232024", team_id=1, team='Abilene Christian', conference="WAC", games=32, wins=15, losses=17, total_minutes=1325, pace=61.1, team_stats=TeamSeasonUnitStats(field_goals=TeamSeasonUnitStatsFieldGoals(pct=43.2, attempted=1877, made=811), two_point_field_goals=TeamSeasonUnitStatsFieldGoals(pct=46.4, attempted=1393, made=646), three_point_field_goals=TeamSeasonUnitStatsFieldGoals(pct=34.1, attempted=484, made=165), free_throws=TeamSeasonUnitStatsFieldGoals(pct=73.1, attempted=729, made=533), rebounds=TeamSeasonUnitStatsRebounds(total=1070, defensive=756, offensive=314), turnovers=TeamSeasonUnitStatsTurnovers(team_total=12, total=404), fouls=TeamSeasonUnitStatsFouls(flagrant=0, technical=6, total=635), points=TeamSeasonUnitStatsPoints(fast_break=319, off_turnovers=466, in_paint=1138, total=2320), four_factors=TeamSeasonUnitStatsFourFactors(free_throw_rate=38.8, offensive_rebound_pct=29.3, turnover_ratio=0.2, effective_field_goal_pct=47.6), assists=405, blocks=65, steals=253, possessions=2028, rating=114.4, true_shooting=52.8), opponent_stats=TeamSeasonUnitStats(field_goals=TeamSeasonUnitStatsFieldGoals(pct=46.5, attempted=1792, made=833), two_point_field_goals=TeamSeasonUnitStatsFieldGoals(pct=52.6, attempted=1227, made=645), three_point_field_goals=TeamSeasonUnitStatsFieldGoals(pct=33.3, attempted=565, made=188), free_throws=TeamSeasonUnitStatsFieldGoals(pct=68.7, attempted=723, made=497), rebounds=TeamSeasonUnitStatsRebounds(total=1171, defensive=859, offensive=312), turnovers=TeamSeasonUnitStatsTurnovers(team_total=23, total=478), fouls=TeamSeasonUnitStatsFouls(flagrant=0, technical=6, total=619), points=TeamSeasonUnitStatsPoints(fast_break=316, off_turnovers=411, in_paint=1120, total=2351), four_factors=TeamSeasonUnitStatsFourFactors(free_throw_rate=40.3, offensive_rebound_pct=26.6, turnover_ratio=0.2, effective_field_goal_pct=51.7), assists=388, blocks=108, steals=206, possessions=2023, rating=116.2, true_shooting=55.7))

Now, the task is to seamlessly integrate team statistics with each game record for our data frame. We aim to create a list of dict objects combining this information, which can be easily imported into pandas.

For merging these datasets into a cohesive structure:

records = []
for game in games:
    record = game.to_dict()
    home_stats = [stat for stat in stats if stat.team_id == game.home_team_id and stat.season == game.season][0]
    away_stats = [stat for stat in stats if stat.team_id == game.away_team_id and stat.season == game.season][0]
    record['home_pace'] = home_stats.pace
    record['home_o_rating'] = home_stats.team_stats.rating
    record['home_d_rating'] = home_stats.opponent_stats.rating
    record['home_free_throw_rate'] = home_stats.team_stats.four_factors.free_throw_rate
    record['home_offensive_rebound_rate'] = home_stats.team_stats.four_factors.offensive_rebound_pct
    record['home_turnover_ratio'] = home_stats.team_stats.four_factors.turnover_ratio
    record['home_efg'] = home_stats.team_stats.four_factors.effective_field_goal_pct
    record['home_free_throw_rate_allowed'] = home_stats.opponent_stats.four_factors.free_throw_rate
    record['home_offensive_rebound_rate_allowed'] = home_stats.opponent_stats.four_factors.offensive_rebound_pct
    record['home_turnover_ratio_forced'] = home_stats.opponent_stats.four_factors.turnover_ratio
    record['home_efg_allowed'] = home_stats.opponent_stats.four_factors.effective_field_goal_pct
    record['away_pace'] = away_stats.pace
    record['away_o_rating'] = away_stats.team_stats.rating
    record['away_d_rating'] = away_stats.opponent_stats.rating
    record['away_free_throw_rate'] = away_stats.team_stats.four_factors.free_throw_rate
    record['away_offensive_rebound_rate'] = away_stats.team_stats.four_factors.offensive_rebound_pct
    record['away_turnover_ratio'] = away_stats.team_stats.four_factors.turnover_ratio
    record['away_efg'] = away_stats.team_stats.four_factors.effective_field_goal_pct
    record['away_free_throw_rate_allowed'] = away_stats.opponent_stats.four_factors.free_throw_rate
    record['away_offensive_rebound_rate_allowed'] = away_stats.opponent_stats.four_factors.offensive_rebound_pct
    record['away_turnover_ratio_forced'] = away_stats.opponent_stats.four_factors.turnover_ratio
    record['away_efg_allowed'] = away_stats.opponent_stats.four_factors.effective_field_goal_pct
    records.append(record)

With these records in place, we’ll transition into a pandas data frame, calculating a new column to represent the final score margin derived from the home and away scores.

df = pd.DataFrame(records)
df['margin'] = df.home_points - df.away_points

As we take stock of the data with df.head(), it’s time to engage in feature selection, deciding which columns to include in our model training.

Let’s gather a bird’s-eye view on the columns within the data frame:

df.columns

We’ll extract particular columns to train our model and identify the output we aim to predict, the margin.

features = [
    'home_o_rating',
    'home_d_rating',
    'home_pace',
    'home_free_throw_rate',
    'home_offensive_rebound_rate',
    'home_turnover_ratio',
    'home_efg',
    'home_free_throw_rate_allowed',
    'home_offensive_rebound_rate_allowed',
    'home_turnover_ratio_forced',
    'home_efg_allowed',
    'away_o_rating',
    'away_d_rating',
    'away_pace',
    'away_free_throw_rate',
    'away_offensive_rebound_rate',
    'away_turnover_ratio',
    'away_efg',
    'away_free_throw_rate_allowed',
    'away_offensive_rebound_rate_allowed',
    'away_turnover_ratio_forced',
    'away_efg_allowed',
    'homeSeed',
    'awaySeed'
]

outputs = ['margin']

df[features + outputs]

Feeling free to tailor and tinker with features is encouraged here—alter the stats as fits your goals. Finally, we’ll split the dataset into training and testing data sets, using previous seasons to train the model, while the 2024 games serve as our test set.

training = df.query("season != 2024").copy()
testing = df.query("season == 2024").copy()

Further slicing the training data into training and validation subsets offers a safeguard against overfitting and sharpens model accuracy.

X_train, X_valid, y_train, y_valid = train_test_split(training[features], training[outputs], train_size=0.8, test_size=0.2, random_state=0)

And finally, the moment we’ve been gearing toward—training our model using XGBRegressor. Were we tackling a classification problem, an XGBClassifier would be our tool of choice.

model = XGBRegressor(random_state=0)
model.fit(X_train, y_train)

Feel the buzz of success—it’s alive! Now, let’s put this model to work by using it to predict our validation set.

predictions = model.predict(X_valid)
predictions

Should these games have already played out, metrics like mean absolute error (or others) can measure prediction fidelity.

mae = mean_absolute_error(predictions, y_valid)
mae

A mean absolute error of ~7.96 emerges—a respectable start, approximately on par with benchmark MAEs. What now? Finetuning is ever-vital: modifying model parameters or amplifying input features might yield uprated results.

For exploratory fine-tuning, consider adjusting features like the number of estimators or learning rate.

model = XGBRegressor(n_estimators=100, learning_rate=0.05, n_jobs=4)
model.fit(X_train, y_train)
predictions = model.predict(X_valid)
mae = mean_absolute_error(predictions, y_valid)

While my efforts didn’t slice any lower into the MAE terrain, your altercations might! It’s a journey of continuous enhancement.

Now, over to our test dataset: let’s predict outcomes and compare them against the real 2024 NCAA tournament results.

predictions = model.predict(testing[features])
testing['prediction'] = predictions
testing[['homeSeed', 'homeTeam', 'awaySeed', 'awayTeam', 'margin', 'prediction']]

And here’s the fun part—measuring the percentage of games our model nailed.

testing.query("(margin < 0 and prediction < 0) or (margin > 0 and prediction > 0)").shape[0] / testing.shape[0]

At 64.3%, our model’s predictive prowess shines through. First-round prediction stats reveal a slightly enhanced 69.7% rate—neat, right? To zero in on these first-round numbers:

testing[testing['gameNotes'].str.contains('1st')].query("(margin < 0 and prediction < 0) or (margin > 0 and prediction > 0)").shape[0] / testing[testing['gameNotes'].str.contains('1st')].shape[0]

With tangible results in hand, safeguarding your model for future re-use becomes convenient—save it as shown:

model.save_model('xgboostmodel')

When the occasion demands future predictions, loading it back up is straightforward:

model = XGBRegressor()
model.load_model('xgboostmodel')

Fancy predicting outcomes for theoretical matchups? Perfect for bracket strategizing—here’s how you might weave that magic:

stats = stats_api.get_team_season_stats(season=2025, season_type="regular")

def predict_game(model, stats, projected_home_seed, home_team, projected_away_seed, away_team):
    home_stats = [stat for stat in stats if stat.team == home_team][0]
    away_stats = [stat for stat in stats if stat.team == away_team][0]
    record = {
        'home_o_rating': home_stats.team_stats.rating,
        'home_d_rating': home_stats.opponent_stats.rating,
        'home_pace': home_stats.pace,
        'home_free_throw_rate': home_stats.team_stats.four_factors.free_throw_rate,
        'home_offensive_rebound_rate': home_stats.team_stats.four_factors.offensive_rebound_pct,
        'home_turnover_ratio': home_stats.team_stats.four_factors.turnover_ratio,
        'home_efg': home_stats.team_stats.four_factors.effective_field_goal_pct,
        'home_free_throw_rate_allowed': home_stats.opponent_stats.four_factors.free_throw_rate,
        'home_offensive_rebound_rate_allowed': home_stats.opponent_stats.four_factors.offensive_rebound_pct,
        'home_turnover_ratio_forced': home_stats.opponent_stats.four_factors.turnover_ratio,
        'home_efg_allowed': home_stats.opponent_stats.four_factors.effective_field_goal_pct,
        'away_o_rating': away_stats.team_stats.rating,
        'away_d_rating': away_stats.opponent_stats.rating,
        'away_pace': away_stats.pace,
        'away_free_throw_rate': away_stats.team_stats.four_factors.free_throw_rate,
        'away_offensive_rebound_rate': away_stats.team_stats.four_factors.offensive_rebound_pct,
        'away_turnover_ratio': away_stats.team_stats.four_factors.turnover_ratio,
        'away_efg': away_stats.team_stats.four_factors.effective_field_goal_pct,
        'away_free_throw_rate_allowed': away_stats.opponent_stats.four_factors.free_throw_rate,
        'away_offensive_rebound_rate_allowed': away_stats.opponent_stats.four_factors.offensive_rebound_pct,
        'away_turnover_ratio_forced': away_stats.opponent_stats.four_factors.turnover_ratio,
        'away_efg_allowed': away_stats.opponent_stats.four_factors.effective_field_goal_pct,
        'homeSeed': projected_home_seed,
        'awaySeed': projected_away_seed
    }
    return model.predict(pd.DataFrame([record]))[0]

predict_game(model, stats, 5, 'Michigan', 11, 'Dayton')

In this demonstration, we loaded data for the current season, sketched a method to craft a DataFrame record bundled with essential features, and used said method to divine a prediction. Here, the model opines that Michigan, as a 5 seed, could outstrip Dayton, an 11 seed, by 6.1 points. Magnifique!

In closing, I pass the bat over to you—our endeavor laid a sturdy foundation, yet countless enhancements lie await. From untapped Stats API features to opponent-adjusted statistics, and beyond the API bounds entirely, the field is rife with potential.

I’d be thrilled to witness your thoughts, whether on Twitter, Bluesky, Discord, or beyond. Enjoy building, and may the odds favor your brackets!