Last year it occurred to me that no one ever picks a perfect bracket. I asked myself what the probability of picking bracket would actually be. Starting in 1985, the NCAA Tournament consisted of 63 games, played by 64 teams (ignoring the play-in round created in 1999). A quick in dirty answer is 1/2^63, or roughly, the probability of correctly predicting what side a coin would land on 63 times in a row. That’s roughly 1 in 9 quintillion, or 1 in 9 million trillions. I don’t care how many brackets you fill out. We’re not going to see that one anytime soon.
But surely that’s not right. Every year, someone gets decently close to filling out a perfect bracket. Is there a better way to predict winners? Sure. How about using the seeding system? After all, a number one team is almost unequivocally better than a number sixteen team. How much better? I went and found out. Over a surprisingly short few hours, I was able to record every match-up between ranked teams in NCAA Tournaments since 1985. 20 years of data, 63 games per year, 1260 total games. What resulted is a table of the probabilities of each rank beating every other rank, so long as such match ups occurred.
A few fun facts:
-A #1 seed has beaten the #16 seed every time they played (100 of 100). A #1 seed beats every other seed more often than it loses, except against #11 seeds. In the four times those seeds have matched up, the #1 seed only prevailed half of the time.
-#8 seeds often upset higher ranked teams. In the four match-ups against #2 teams, #8 teams won twice. They have also beaten #4 and #5 seeds more often than they lost to them.
-Surprisingly (to me), #11 teams have beaten #3 teams 29% of the time (8 of 28). That means this model predicts one upset by a #11 team in the first round of every tournament.
Disclaimer about statistics: In real life the probability of one team beating another team is independent of what happened in the past. And the number of observations we have for each seed is limited. But I am not interested in causation, only prediction. And it is surely a better predictor than flipping a coin for each game.
Below are the historical probabilities (click to enlarge). Match up row first and then column to see the probability of the row seed beating the column seed. A “2″ in the column and a “One” in the row gives a score of .44, meaning a #2 team beats a #1 team 44% of the time.
I use these probabilities as part of a stochastic process to make my picks for the NCAA Tournament. For example, as mentioned above, #11 teams have historically beaten #3 teams 29% (.29 in decimal form) of the time. What I do is generate a random number between 0 and 1 from a uniform distribution, and if that number is below .29, I choose the #11 seed. If it is above .29, I choose the #3 team. Repeat for each game until you have a complete bracket.
As for my bracket, well, I would not have chosen it if I had not used historical probabilities. My Final Four ended up as #1 Kansas v. #3 Pittsburgh and #8 Texas v. #2 Villanova, with Kansas and Villanova advancing, and Villanova winning the whole thing. I also have #12 Utah State playing #13 Siena in the South region and #11 Washington playing #14 Montana in the East region. Here it is (click to enlarge):
Want to play with the probabilities? Click here for the STATA (.dta) file or click here for the Excel file. The Excel file shows the number of wins and total games played between each seed. The STATA file only shows probabilities. (I have to thank Avinash Vora for hosting them and encouraging me to do this. Thanks also to James Somers for being a sounding board.)
Information on President Obama’s bracket is here. We’ll see who does better.
Last year I coded this information with C++ and actually ran predictions for a few hours at a time on my computer. I could predict perfect first and second rounds (after several million iterations), but never got a perfect third round. Getting that would have required a generating more than a few billion iterations. This year I put the probabilities into STATA and quickly spit out random numbers based on individual match-ups. There is a more efficient way to do this. In fact, I would be thrilled if someone would take this data and turn it into an easy to use bracket generator. I’ll leave that to the sports-minded hackers out there.
There is much more to say about the mechanics of using these probabilities for picking games, but I’ll leave that for another time.


Oh Tom.. Only you could do this…
By: David on March 17, 2010
at 6:58 pm
First of all I appreciate the data crunching. Clearly you’ve done your homework
Just a couple things I’d like to point out (or perhaps you can explain it to me.)
“Surprisingly (to me), #11 teams have beaten #3 teams 29% of the time (8 of 28). That means this model predicts one upset by a #11 team in the first round of every tournament.”
How do you extrapalate that? If a #11 team beats a #6 seed (a regular season top 25 team) and then #3 seed (top 12) there’s a damn good chance that they’re severely underated.
How does the performance of great number 11 seed in round two have anything to do with how the average 11 seed plays in round 1?
Just looking at your bracket, I see some serious data driven mishaps IMO.
The one that really stands out to me is BYU. IMO they are by far the most under-seeded team in the tournament. I think they’re a good bet to surprise folks and up in the sweet 16 or elite 8. Florida has more talent, so I guess BYU might lose. But they’re a great basketball team. Their sum is better than their parts and I think they go far.
By: Blur on March 18, 2010
at 1:21 am
this will just make it that much sweeter when i beat you this season.
By: Ryan Donohue on March 18, 2010
at 2:12 am
what if it takes 65% to get a passing score? All true would then be an automatic failure. Of course the method suggested is nonsense; I;m not defending it in general.
By: Donald on March 18, 2010
at 5:32 am
I just came across this on the Hacker News RSS feed. I know nothing at all about basketball, but for kicks I’m playing your bracket in our office pool. For the combined final game score I just averaged the last 10 years’ tournaments: 148
By: Don on March 18, 2010
at 7:24 am
Don,
Good luck with the bracket! You’d be surprised how different each one looks as you use this process. Eventually, I’ll get someone to make it a one-step process to generate new brackets.
-TC
By: Tom Church on March 18, 2010
at 8:53 am
You should really post the data in a more easily used format. The raw data would be useful. Maybe you’d weight the certainty of the prediction based on how often the matchup has occurred. Anything better than the stata or XLSX file you’ve posted?
By: Chris on March 18, 2010
at 10:30 am
This may be a naive question, but why did you use a uniform distribution as opposed to a normal distribution? To me it seems as if a normal distribution picks outliers more appropriately. Great post, I really enjoyed it.
By: Reid on March 18, 2010
at 11:16 am
[...] A Stochastic Model for Picking Winners in the NCAA Tournament Last year it occurred to me that no one ever picks a perfect bracket. I asked myself what the probability of picking [...] [...]
By: Top Posts — WordPress.com on March 18, 2010
at 5:27 pm