X
Data Analytics | Data Science | Databricks

You Have A 1 in 9.2 Quintillion Chance Of Building The Perfect NCAA Tournament Bracket

“Brackets, Brackets, and more Brackets”

It is that time of the year again. March Madness is here and people around the globe are rushing to fill out tournament brackets with the hope of picking the winning team each and every game and of course, rooting for their favorite team. Like many workplaces, Lovelytics employees get super excited about this event.  We are first and foremost data geeks, but we are also basketball fans.

Like all sports fans, we are cheering, booing, celebrating or crying the entire tournament.  But as data nerds, we are trying to find clarity in the chaos of the data, asking ourselves, “how can we leverage our knowledge and expertise to gain insights and use data to help us make better bracket decisions and remove the bias that is so prevalent in the bracket selection process each year.” And, is there a way to take this approach and apply it to our client’s business challenges?

Our March Madness challenge:  How can we leverage machine learning best practices using Databricks, a lakehouse platform,  to create successful bracket predictions? 

“The Madness in the Odds”

For context, there are 63 games in an NCAA tournament bracket, and each game has two possible winners.   All possible permutations of brackets will be 2^63 ≈ 9.2 quintillions.  That is a 9 with 18 zeros.  Said another way, the odds of predicting a perfect bracket  is 1 in 9.2 quintillions.

How Can We Improve Our Odds? – Deciding What To Predict

With a 64 team tournament bracket, it is challenging to formulate exactly how and what to predict, especially knowing the later brackets are only defined when a team has won their previous game.  Therefore we opted to predict the outcome of all possible matchups of the 64 final teams, this is 64*63/2 = 2016 possible matches. 

Our approach is to identify each team  with a unique numerical id that is the same every year. If a team is new to the tournament (i.e. has never played in the tournament before) they  are assigned new ids. So in each one of our 2016 possible matches,there is always a team with a lower numerical id than the other team.  From that, we will try to predict if the lower numerical id team wins and identify those matches as a Category 1.  If the team with a larger numerical id wins, we assign it a Category 0.

“Best Practices”

Like many modern data engineering approaches, we opted to follow a medallion architecture to tackle our challenge, aiming to provide our project with a cohesive and organized structure by dividing it into clearly identifiable layers. 

Bronze Layer:

First step, we need our raw data ingested. This step is considered the “Bronze Layer”.  We can divide this step into two phases: 

  1. Loading raw training data
  2. Loading raw testing data 

There are countless resources to pull NCAA historical data from. We obtained most of the previous seasons’ (1985-2020) results and statistics from Kaggle and we updated this data including last year’s results. The current season data serves as our testing data. It was obtained from https://basketball.realgm.com and https://masseyratings.com

Silver Layer:

Next we proceed to prepare our data to be consumed by the machine learning algorithms. In this step, known as the Silver Layer, we carry out necessary transformations and perform feature engineering (select and create new features, fill out missing values) on the raw training and testing sets in order to turn them into the source of truth for our models. 

Gold Layer:

 It is now time to build our model(s), commonly known as Gold Layer, where we can define business use cases. We opted to use three classical machine learning algorithms: 

  1. Logistic Regression, 
  2. Random Forest, and 
  3. Gradient Boosted Trees 

All three algorithms are part of Databricks’ Spark native machine learning library (MLib). We developed a voting system to combine the predictions of our models. 

As a final step, we can feed our test data to the models and obtain our predictions. By following these best practices, it  allows us to easily adapt our implementation to other use cases and serve as a good template for many machine learning projects.

“Why Spark and Databricks”

Spark is a multi-language engine for executing data engineering, data science, and machine learning on single-node machines or clusters. As we reference Spark, we also need to reference Databricks, which is the perfect platform for our project. 

Databricks provides an optimized implementation of Spark.  From the flexibility of ingesting your data from multi-cloud sources, along with a way to combine highly performant data and machine learning pipelines, Databricks has everything we need. 

While building out our Silver and Gold Layers, we leveraged Spark on Databricks to carry out tasks in parallel in an optimized and efficient way. Trying to do this using a traditional system would take a long time and would eat up costly resources.  

Specifically, we carried out complex transformations during our Silver Layer and performed cross-validation for tuning model parameters during our Gold Layer. Databricks also has Managed MLflow, which allows us to run experiments with any ML library, framework or language, and automatically keeps track of parameters, metrics, code, and models from each experiment. 

Additionally, Databricks has a smart API to schedule and automate job executions. We consolidated our full project by creating a job that allows us to easily debug, tune and modify our machine learning pipeline from top to bottom. These features provide a tremendous value for our customers due to the ease of data ingestion and the ability to track results and make real-time adjustments to large-scale machine learning pipelines.

“Watching Our Model”

As I write this blog , we are about to start the Final Four Round of the tournament.  Our test bracket is performing at a 94 percentile according to the official NCAA website and we have a 91.6% accuracy so far. 

Similar to machine learning models we build for our clients, we know that as things change and new variables come about, our model will need to change and continue to learn from every game, every transaction, every data point. Our models will get smarter, more accurate and be able to improve our decision making as they evolve.  

We are looking forward to next year’s bracket, and thanks to Databricks and Spark, we will have a better model and will spend considerably less time and resources predicting who will win it all.  

So, what do you think? Is March Madness that crazy after all?
Randy is a data consultant for Lovelytics. To learn more about Randy’s work, our machine learning capabilities, Spark, Databricks or building models to help you predict the outcome of a business challenge, connect with us by email [email protected].

Author

Related Posts

Aug 04 2025

How Lovelytics and Databricks Partnered to Migrate and Automate Databricks’ Internal Reporting to AI/BI

Introduction: What is AI/BI and Why It’s a Game-Changer For years, BI tools have helped organizations analyze and visualize data, but the landscape has shifted....
Jun 23 2025

From Productivity Paradox to GenAI Acceleration: Key Takeaways from DAIS 2025

Historical Perspective on Innovation: From Dynamos to AI Agents In the late 19th century, the promise of electrification captured the imagination of industrialists....
Jun 02 2025

Sessions You Won’t Want to Miss at Data + AI Summit 2025

The Databricks Data + AI Summit 2025 (June 9–12, San Francisco) is right around the corner, bringing together some of the world’s most innovative companies to showcase...
May 20 2025

Powering the Agentic AI Era: What the Databricks-Neon Deal Means for GenAI

Introduction At Lovelytics, we are thrilled by the recent announcement of Databricks acquiring Neon. We believe this marks a pivotal advancement in the Generative AI...
May 15 2025

Driving Smarter Marketing with Databricks’ Data Intelligence Platform

Lovelytics accelerates marketing outcomes with Databricks’ latest industry solution Marketing teams are under more pressure than ever to deliver results—and fast. With...
Mar 01 2025

Lovelytics and Databricks Deliver Data Innovation for $28B World-Renowned Healthcare Provider and Insurer

A world-renowned $28 billion healthcare provider at the forefront of research, treatment, and clinical care set out to modernize its health informatics and reporting...
Feb 25 2025

3 Key Healthcare Data and AI Trends to Watch at HIMSS 2025

The HIMSS Global Health Conference and Exhibition is where healthcare, data, and AI converge to drive the next wave of industry transformation. As healthcare...
Jan 07 2025

The Future of Manufacturing: Navigating 2025 with AI, Data, and GenAI in a Challenging Global Environment

The manufacturing sector is at the forefront of a transformative wave fueled by innovations in data, analytics, machine learning (ML), and generative AI (GenAI). These...
Dec 10 2024

Decoding the Customer DNA with Customer 360

Imagine opening your mailbox to find a glossy, impersonal letter from your bank urging you to open a new account—as if your decade of loyalty as a customer didn’t...
Dec 03 2024

Revolutionizing Healthcare with Patient360 and Operational Efficiency: Lovelytics and Databricks Deliver Data Innovation for $8B Teaching Hospital and Research University

Amid the rapidly changing healthcare landscape, a renowned U.S. teaching hospital and biomedical research university that serves over 2.8 million patients annually...