Introduction
The goal of any team, regardless of sport, is winning the National Championship. In winning, the team claims victory and earns the right to call themselves better than all other competitors. In men’s college basketball, winning means being the last team standing after the March Madness Tournament. To qualify for the tournament, a team must either win a Division I conference (32 total Division I conferences), or be selected by the NCAA Men’s Division I Basketball Committee (36 teams are chosen). In all, 68 teams compete in the tournament. Four teams are eliminated in the “play-in” round, leaving 64 teams split into four regions. In each region, there are 16 teams, and each team is assigned a seed number 1 through 16. The tournament is single-elimination, and once each region has a winner, each winner plays against the others until only one team remains. The tournament is a daunting task for any team, and once the games begin, anything can happen.
In 1982, the term “March Madness” was coined to describe the tournament due to its characteristic upsets and nail-biting games. Today, attempting to predict the winner of each game in the chaotic match-up has become a fun challenge for fans across the nation. The odds of correctly predicting the outcome of each game is roughly 1 in 9 quintillion, but, viewers tackle the challenge regardless. Individuals implement unique strategies to pick the winners; some choose statistics, others, gut feeling, while still others look to mascots and/or team colors. While each method experiences a varying degree of success, the introduction of improved technology, alongside the understanding and application of data, may make it possible to more accurately select the March Madness Tournament winners.
For the past few years, Kaggle has partnered with Google to host a competition around predicting the NCAA March Madness Tournament using machine learning algorithms. Kaggle’s
website provides two different competitions concerning the NCAA March Madness bracket predictions: men’s basketball and women’s basketball. The analysis in this report focuses on men’s basketball and identifying which factors, if any, could potentially lead to a team’s success in the NCAA Tournament. The primary objectives are as follows: 1) determine which regular-season statistics have the largest impact on tournament outcomes and 2) show if regular-season rankings should be factored into game-pick decisions. The analysis below will not guarantee a perfect bracket, but it could result in more educated predictions for various match-ups throughout the tournament. The full code used to compile this report can be found
here.
Successful Teams
Before understanding the factors that make teams successful in the March Madness Tournament, the most successful teams must be identified. Of course, the ultimate success in the tournament is cutting down the nets after winning the national championship. Below is a table showing the last 15 teams crowned champions.
2005 |
North Carolina |
2006 |
Florida |
2007 |
Florida |
2008 |
Kansas |
2009 |
North Carolina |
2010 |
Duke |
2011 |
Connecticut |
2012 |
Kentucky |
2013 |
Louisville |
2014 |
Connecticut |
2015 |
Duke |
2016 |
Villanova |
2017 |
North Carolina |
2018 |
Villanova |
2019 |
Virginia |
One thing to note is there are several teams that have won multiple championships. Teams like Duke, North Carolina, and Kentucky are known as powerhouse basketball schools, and it is not uncommon for schools like these to go on successful runs. Successful teams attract the best players because everyone wants the best chance to win a championship. It also gives those players more national recogniton, which means a chance to have higher draft positions (if their performance is exemplary). To better understand which teams have won the most championships since 1985, as well as the teams that have lost the championship game, a simple bar graph is constructed to compare the success of the schools.
There are a few key details in the graphs above. First, Duke has clearly been a force to be reckoned with in the tournament. Duke has the most championship wins during the selected timeframe and the second-most loses in the championship game. This means that, over a span of 35 years, Duke has been in the championship game 25% of the time. Second, the teams that have won the national championship are more likely to be in at least one other championship game. Meaning that, out of all the Division I basketball teams, only a small portion has demonstrated success in the tournament. This could be helpful if the factors that make these teams successful can be identified. Next, the teams with the total amount of wins and losses in the tournament will be visualized.
The plot on the right, which shows the teams with the most tournament losses, really communicates which teams have made the most tournaments without winning the national championship. A team like Kansas has over 80 wins, but over 30 losses since 1985. Earlier, it was revealed that Kansas has won two titles since 1985 but made it to five championships. Knowing this can be extremely valuable because Kansas experiences success early on in the tournament, but seems to struggle in the final rounds. Determining the similarities and differences between a team like Kansas and a team like Duke could help in making predictions throughout the entire bracket.
Top Winning Percentages in Tournament Play
Observing the total amount of wins and losses is useful, but studying winning percentage may be more beneficial. Knowing a team’s winning percentage will give a quantitative measurement during the predictive process. This measurement can be spanned over different time periods which will give different perspectives. For example, the plot below shows the top ten winning percentages in the tournament since 1985. However, before using these results to make predictions, remember that teams that appear in this list could have had very successful tournament runs in the 1980’s and have not been relevant recently. This is not to say that the results cannot be useful, but they should be used carefully.
One surprise revealed in the graph above is Connecticut has the greatest winning percentage in the March Madness Tournament, not Duke. However, earlier it was shown that Duke had nearly twice as many wins in the tournament than Connecticut. Putting the information together communicates that Duke has been a top program for a long time and, more times than not, will have a winning record in the tournament. A school like Connecticut may not participate in the tournament year after year, but when they are in the tournament, they typically win the majority of their games. Notice that Loyola-Chicago has come out of nowhere to arrive on the list with the school with the highest winning percentage. One logical explanation for this is Loyola-Chicago has not made the tournament very often, and, in at least one of the years they appeared in the tournament, they won quite a few games, but did not make it to the championship. Again, knowing which schools have performed well historically is helpful but it should only be used in conjuction with more recent statistics. The plot below displays the top ten winning percentages within the last ten years.
Interestingly, Connecticut is still the top school in terms of winning percentage. South Carolina is at number two, which again is most likely the result of at least one tournament in which they won the majority of their games. Also, Texas Tech is now on the list due to their championship run in 2019. While using a school’s winning percentage as a factor probably won’t carry much weight, it should at least be explored in a model as it could help in predicting tough match-ups.
The objective of this report was to identify which statistics or characteristics the most successful teams in Division I basketball share. The plots above revealed which schools have historically experienced the most success in the tournament, but it is difficult to differentiate the schools that are consistently good versus the schools with one or two good years. Thus, the plot below displays the amount of wins versus number of losses in order to identify which teams have consistently been good.
The plot above gives a better picture as to which teams should be highlighted when trying to determine important statistical factors when picking winners and losers. Ideally, teams that have appeared in several tournaments will be more beneficial than teams that have only had one or two tournament appearances. Also, the teams selected need to have impressive winning records. Thus, the following ten teams were chosen:
- Connecticut
- North Carolina
- Kentucky
- Duke
- Villanova
- Louisville
- Kansas
- Michigan St
- Florida
- Michigan
These ten teams will be used to gauge what statstics have been the biggest factors that separates the winning teams from the losing teams.
Identifying Team Statistics
Before comparing these ten teams, key statistics need to be identified. Instead of looking at data from 1985 to 2019, only the last 10 years will be observed. This is actually essential since the game of basketball changes throughout the years. The game use to be played much slower with a higher emphasis on defense but that no longer seems to be the case. The five statistics below are offensive based ranging from 2009 to 2019. The plots also highlight the differences between the averages of the winning and losing teams.
Starting with the average score of the winning and losing teams, there is no big surprise. The winning team must average more points because a team has to score more to win. However, the difference in score is pretty substantial. Approximately ten points separate the winning and losing team on average over the years. This connects directly with the average field goals made. In order to score more, a team has to make more field goals. Note that since 2016, teams are making much more field goals compared to seasons in the past. Next, the amount of three pointers made has increased significantly from 2016 and on. This change in offensive strategy could be attributed to the emergence of the Golden State Warriors and their success they experienced in the NBA by shooting and making three pointers at a high percentage. This stlye of play has permeated into college basketball and it appears that the winning teams are doing a better job making more three pointers. The amount of free throws made by both the winning and losing teams has decreased as the amount of three pointers increased. This could be due to the fact that when teams drive towards the basket, the potential to be fouled on the shot goes up. Since teams are driving towards the basket less and instead shooting three pointers, the opportunity to be fouled is reduced. Lastly, offensive rebounds seem to not be helpful. However, one hypothesis for the results displayed is that the winning team is making more of their shots compared to the losing team. Meaning that the losing team has more opportunities for offensive rebounds compared to the winning team. Lastly, the average amount of assists is much greater for the winning teams compared to the losing teams. The amount of assists has increased slightly since 2016 but not significantly. Next, the defensive statistics will be examined.
The first defensive statistic shows that defensive rebounds has a big difference between the winning and losing teams (almost 10 rebounds). This does make sense because the losing team is probably missing more shots, which gives the other team a chance to secure a defensive rebound. The next statistic is not exactly a defensive statistic (in fact, it is recorded based on the offensive team), however, it shows that the winning team commits less turnovers while forcing the other team to turn the ball over more. One of the key statistics for turnovers is steals, which the winning team does a better job doing. Also blocks is another statistic that leads to turnovers. While steals and blocks seem to favor the winning team heavily, after taking a closer look at the x-axis it actually is pretty close. Between the winner and loser only one block and one steal separates them. Still, this is an average over an entire season, so it makes sense that the averages are closer. At last, the amount of personal fouls committed by each team is much greater for the losing team. While this is partially due to sloppy defense, it could also be due to the fact that the losing team will intentionally foul the other team in order to control the game clock. Which results in the losing team committing more fouls than the winning team on average.
Comparing Team Statistics
Now that there is an idea of what key statistics the winning teams share during the season, it is time to see how the chosen teams compare with the rest of the league. Also, instead of taking the entire season into account, only the second half of the year will be considered. This will take into account the teams that started playing their best basketball right before the tournament. The idea is that teams that win in the tournament are probably playing better before the tournament starts and their performance carries over into the actual tournament. This could explain why small low seeded teams upset higher seeded teams. The plots below display the average score per game during the second half of the season for the years 2009, 2013, 2017, and 2019.
Based on the four graphs above, there seem to be a correlation between the score and amount of wins. The higher a team’s average score is, the more wins that teams will have. Again, the teams labeled in the plots are the teams that have the highest winning percentage in the March Madness Tournament. It appears that while they may not have the most wins or the highest average score, they almost always seem to be clustered in the top third (except for 2009, where they appeared to be more scattered). Next, the amount of three pointers made will be examined.
The plots above are interesting. In the years 2009 and 2013, it appeared there was a slight correlation between the amount of three pointers made vs the number of wins. However the years 2017 and 2019 do not show a correlation. This could be due to the majority of college teams playing with a similar strategy. Most teams now focus on three point shooting which is just making a more even playing field.
The next statistic that will be examined is the Effective Field Goal Percentage (EFG%). This metric is similar to Field Goal Percentage, which measures the amount of made shots vs amount of missed shots. However, the EFG% takes into account three point percentage. This is useful because it was already seen earlier that more three pointers are being made every year (meaning more three point shots are being taken) and it is well known that three point shots are more difficult than two point shots.
There is a correlation between EFG% and number of wins experienced during the second half of the season. The higher a team’s EFG%, the more wins that team typically will have. Looking at the ten selected teams with a high winning percentage in the tournament shows that to be successful in the tournament, a team needs to have a decent EFG%. This should be a feature when picking winners in a March Madness Bracket.
Next, the amount of field goals attempted will be examined to see if this could be a useful feature in determining winners. As seen below, it does not appear that the amount of field goals taken relates to who wins and losses. There also is not really a pattern seen in the selected teams either. Thus, it would probably not be too useful is using the amount of field goals attempted to determine the future winners.
Seeing how important offense is to the game of basketball at the moment, it is fair to hypothesize that offensive efficiency is probably an important statistic. Offensive efficiency is an advanced statistic which can be calculated using the following equation:
\[Offensive\:Efficiency = \frac{Possessions}{Score}*100\%\]
where the number of possessions can be found with the formula:
\[Possessions = {(Field Goals Attempted-Offensive Rebounds)+TurnOvers+(0.44*FreeThrowsAttempted)}\] Looking at the plots below, it appears that the lower the Offensive Efficiency the more wins a team will obtain. While this seems counterintuitive, some outside the box thinking must be applied. Looking at the formula for Offensive Efficiency shows Possessions in the numerator and Score in the denominator. This means that if possessions is greater than score, then the Offensive Efficiency will be greater than 100%. The ten teams with the highest winning percentage in the tournament had Offensive Efficiencies less that 100% (closer to 90%). This means that those teams were score more on less possessions, which is a characteristic of winning teams. Now the plots below make a little more sense.
The next advanced statistic that will be analyzed is the Offensive Rating of a team the second half of the season. The formula for Offensive Rating is:
\[Offensive\:Rating = \frac{Score}{Possessions}*100\%\]
Before even looking at the graphs, it is pretty easy to know what the results will be. The formula for Offensive Rating is identical to Offensive Efficiency other than being its inverse. Meaning that Score is now in the numerator and Possesssions is now in the denominators. This sudtle change will result in all the plots seen for Offensive Efficiency being flipped. Now the team with the highest Offensive Rating will also have the most wins, as seen below. Since these statistics communicate nearly the same thing, only one of them will be needed as a feature.
Now, Defensive Efficiency will be observed. Again, this is an advanced statistic and will need to be calculated by a formula. The formula for Defensive Efficiency can be calculated by the equation:
\[Defensive\:Efficiency = \frac{Possessions}{Opponent's Score}*100\%\]
Observing the plots below, it can be assumed that there is a positive correlation between Defensive Efficiency and wins. Seeing where the selected teams fall is also valuable because it shows that this is a key metric in determining which teams will be more likely to win in the tournament.
Lastly, the Defensive Ratings average will be visualized. The relationship between Offensive Efficiency and Defensive Ratings showed to be inverses of each other and the same can be said between Defensive Efficiency and Defensive Ratings. This is confirmed by plotting the results, as shown below. Since Defensive Efficiency and Defensive Ratings are inverses, only one of them should be considered in a predictive model.
The following plots will now average all of the statistics together from 2009 to 2019, instead of showing different years. It can be seen that all of the trends saw earlier continue but now the selected teams are clustered towards higher wins. Secondly, field goals attempted still has no correlation and thus, should not be utilized in a predictive model. Remember, these are statistics and wins recorded the second half of the season before the tournament begins. These plots show that teams that have certain statistics do statistically better in the tournament.
Statistics Impact on Tournament
Now it is time to see if the statistics recorded the second half of the season before the tournament starts has any impacts on who wins and losses in each round of the tournament. Each round of last year’s March Madness Tournament will be analyzed by key statistics to see which statistics do a better job of predicting who wins and who loses. The first group of plots show the impact each statistic had on which teams won and which teams lost. The bluish/green bars represent the winning teams and the red bars represent the losing teams. Judging by the results, it appears that Offensive Efficiency, Blocks, and Assists were the best statistics to use to predict the winners in the first round. The teams that averaged higher in those statistics were more successful compared to the other statistics.
Looking on to the second round of the tournament, each of the characteristics were more helpful in determining the winners. The games were more predictable because the higher ranked teams in the various statistics won more. Blocks, 3pt Percentage, and Offensive Efficiency seem to be the best at predicting who would win and lose since the bluish/green bars were more stacked towards the top.
The plots below so the statistical impacts on the Sweet 16 (3rd round). Again, the higher an average a team had for the statistics, the more likely they won their game. It appears that these stats play a much larger role later on in the tournament vs the first round.
The key statistics were less helpful in the Elite 8 (4th round), as seen below. In this round it appears that 3pt percentage played the largest role in determining the winner. Overall, the impacts were scattered in this round.
In the Final Four (5th Round), the statistics that correctly predicted the winners were EFG% and Defensive Efficiency. 3pt Percentage also was helpful to an extent.
Lastly, all six of the key statistics were able to predict who the national champion would be. Overall, these statistics were helpful in some rounds but not in others. Thus, making predictions on stats alone will not fully optimize making correct predictions. Not to say that the statistics above are not useful because they most certainly could be. But the statistics should not be the sole reasoning behind making certain predictions.
Ranking Systems Impact on Tournament
In order to try to improve accuracy in making correct predictions as to which team will win in the March Madness Tournament, there needs to be some other features besides game statistics that could be used in making predictions. This next section will focus on whether or not end of season rankings play any part in determining who will win or lose in each round of the tournament. Since there are countless college basketball ranking systems out there, there needs to be some way to select which ones to test. In this case, the three ranking systems with the most rankings will be utilized. This metric was chosen because the ranking systems with the most rankings will have the most historical data that could be analyzed. The table below shows the ranking systems with the most rankings.
SAG |
105347 |
MOR |
104357 |
POM |
101866 |
DOK |
88620 |
WLK |
85405 |
PIG |
84386 |
DOL |
78318 |
MAS |
78264 |
COL |
76676 |
WIL |
73619 |
The three ranking systems that will be analyzed are Jeff Sagarin’s (SAG), Sonny Moore’s (MOR), and Ken Pomeroy’s (POM) since those are the three ranking systems with the most rankings.
The first thing that will be examined is which ranking system does a better job at predicting who will win each game in the tournament. The past 10 years (11 Seasons) are shown below. Notice that the three ranking systems oscillate as to which system will predict the games the best. Overall, it appears that SAG does the best job most of the time. However, POM is the most consistent out of the three ranking systems. Remember, this is historical data. It is difficult to know which of the three ranking systems will perform the best for this current tournament bracket. So maybe consistency (POM) is something that is really valuable to some people making predictions.
Next, the performance of each ranking system will be visualized through each round for last year’s tournament, beginning in the first round. The perfect ranking system will have all red on top and all bluish/green on the bottom. This would indicate that all the higher ranked teams beat the lower ranked teams. The plots below show that POM does the best job at predicting the lowest ranked teams losing, but the other two (MOR & SAG) do a better job at predicting the highest ranked teams winning. However, analyzing the overall performance of all three systems in the first round, it appears they are pretty evenly matched.
In the second round, POM now does the best job at predicting the highest ranked teams and SAG does the best at predicting the lowest ranked teams to lose. This round definitely has to go to POM though since there was only one incorrect decision made. Each ranking system is doing pretty well determining games through the first two rounds of the tournament overall.
Below are the plots for the Sweet 16 and it appears that SAG rankings were the most useful in this round. There were three incorrect decisions for SAG this round, compared to four for POM and five for MOR. None of the ranking systems were able to predict the upset of North Carolina.
Looking at the results of the Elite 8, again it is SAG that appears to be more accurate. While all three ranking systems had the same amount of wrong predictions, the SAG system had more bluish/green towards the bottom. This means SAG had more higher ranked teams beat the lower ranked teams.
The plots below show the results in the Final Four. POM and SAG performed evenly in this round as they had the same higher ranked teams lose and the same lower ranked teams win.
Lastly, all three ranking systems were able to determine the national champion by their end of season rankings. This section showed that MOR, POM, and SAG could be useful features in a predictive model trying to determine the outcomes in the March Madness Tournament. None of the ranking systems have ever been perfect but the results do show that the higher ranked teams win the majority of the time.
Conclusion
Hopefully, the visuals provided in this report are useful to anybody trying to predict a March Madness Bracket as accurately as possible. Key statistics are important, as it was seen that the teams that have won the most in the tournament all share similar performances in their statstics. However, statistics alone are not enough to predict accurate brackets. This is when it was seen that ranking systems were extremely valuable in trying to predict winners and losers. By combining statstics and a ranking system (MOR, POM, or SAG), a person should be able to improve their ability in predicting which teams will win and which will lose.