Identifying some issues with my models that affected past results.
So, I screwed up.
This is not unexpected, as this was year 1 of rolling out the new models, so there were bound to be some mistakes, but it’s important that I’m transparent about what went wrong and what it affected so that you all know you can trust me.
The issues were with how Minnow calculated its efficiency metrics. Since Branchy also used those metrics, the issue affected Branchy as well. For those of you who don’t care about the math, feel free to skip to the next section where I break down how the fix affected the data. Essentially, I was failing to properly anchor my parameters, and I wasn’t stabilizing my alpha hyperparameter. This is why the issue became apparent when I went to update my rankings for the Sweet 16, the values all shifted.
First, let’s look at how the changes affect the results of the 2025 MLBR (Machine Learning Battle Royale). Since Minnow and Branchy were both affected, the composite was naturally affected as well. Every other value remains the same. Next to the rank, I include in parentheses the change in rank from the old evaluation, where (+1) means a move up in the rank, and (-1) means a move down in the rank (+ means a better rank, whereas - is a worse rank).
| Rank | Model | Old Score | New Score | Diff |
|---|---|---|---|---|
| 1 | BranchyBrackets | 88.89% | 90.48% | +1.59% |
| 2 (+1) | Composite | 82.54% | 85.71% | +3.17% |
| 3 (-1) | Torvik | 84.13% | NA | NA |
| 4 (-1) | Resumetric | 82.54% | NA | NA |
| T5 (+1) | Minnow | 79.37% | 80.95% | +1.58% |
| T5 | MNPI | 80.95% | NA | NA |
| 7 | Chalk | 77.78% | NA | NA |
| Rank | Model | Old Score | New Score | Diff |
|---|---|---|---|---|
| 1 | Branchy Brackets | 0.3494 | 0.3509 | +0.0015 |
| 2 | Resumetric | 0.3628 | NA | NA |
| 3 | Composite | 0.3753 | 0.3821 | +0.0068 |
| 4 (+1) | Torvik | 0.4047 | NA | NA |
| 5 (-1) | Minnow | 0.3846 | 0.4072 | +0.0226 |
| 6 | MNPI | 0.4383 | NA | NA |
| 7 | Chalk | 2.0468 | NA | NA |
(For context, 0.693 is the log loss of a random classifier)
| Rank | Model | Old Score | New Score | Diff |
|---|---|---|---|---|
| 1 | Branchy Brackets | 1760 | 1790 | +30 |
| 2 (+1) | Composite | 1630 | 1760 | +130 |
| 3 (-1) | Resumetric | 1740 | NA | NA |
| 4 | Torvik | 1270 | NA | NA |
| 5 | Minnow | 1140 | 1150 | +10 |
| 6 | Chalk | 1090 | NA | NA |
| 7 | MNPI | 1060 | NA | NA |
| Rank | Model | Old Score | New Score | Diff |
|---|---|---|---|---|
| 1 | Resumetric | 1114.85 | NA | NA |
| 2 | Branchy Brackets | 1079.35 | 1079.66 | +0.31 |
| 3 | Composite | 955.24 | 944.28 | -10.96 |
| 4 | Minnow | 904.86 | 865.30 | -39.56 |
| 5 | MNPI | 764.66 | NA | NA |
Most of the differences were subtle, but a couple stood out. In particular, the composite saw a moderate leap in accuracy, and a huge leap in tourney points. I’m going to chalk that up to random chance more than anything significant analytically, since the difference is just the shifting of percentage points in a few places just barely over 50%. We can confirm our suspicions by seeing that the composite’s expected tourney points actually decreased (if only slightly). However, if the composite performs similarly well after this year’s tournament upon review, then we can start to sing its praises.
At first glance, the updated scores might seem a bit concerning, as my “fix” to Minnow resulted in Minnow scoring worse across all four evaluation metrics, and all three affected models scoring worse in log loss, which I identified in my original post as likely the most important metric. We have to remember that we’re looking at a relatively small sample size, and that the 2025 tournament was exceptionally chalky. The issues that I fixed in Minnow had a few different practical effects, but one of them was that it made Minnow more confident in favorites. That served it well in a tourney as chalky as 2025, but I remain confident that a more balanced approach will benefit it in the long-run, especially since our fixes are well-founded and principled. I’ll keep an eye on it moving forward, and if the old versions are routinely outperforming the new ones once we’ve built up a dataset of 3-5 years, then we can contemplate switching back, and perhaps do some research into why having untethered parameters benefits these models.
There are changes up and down the model, with Minnow in particular (and, to a lesser extent, Branchy) being slightly more favorable to underdogs, but the most obvious changes are at the top, so let’s focus on that:
| Minnow | Branchy | |
|---|---|---|
| Duke | 30% (-1%) | 42% (-10%) |
| Arizona | 12% (+0%) | 36% (+13%) |
| Michigan | 19% (-3%) | 10% (-2%) |
| The Field | 39% (+4%) | 12% (-1%) |
As expected, the model flattened out somewhat, and Minnow reduced some confidence in the favorites to return odds to the field.
As for Branchy, honestly, who knows what that model is thinking. This might not inspire a lot of faith in me, but your guess is as good as mine when it comes to why that model decided to start liking Arizona more. The only inputs that changed for Branchy were the adjusted efficiency metrics provided by Minnow. My best guess is that those subtle differences taught Branchy new interactions that snowballed into a big change at the top.
Here’s a breakdown of the changes to Minnow rankings further down the list:
| # | Team | Old Rank | New Rank | Change |
|---|---|---|---|---|
| 1 | Maryland | 134 | 182 | 🔻 48 |
| 2 | Utah | 122 | 167 | 🔻 45 |
| 3 | Penn St. | 142 | 181 | 🔻 39 |
| 4 | Navy | 163 | 127 | 🔺 36 |
| 5 | Rutgers | 144 | 179 | 🔻 35 |
| 6 | Georgia Tech | 158 | 186 | 🔻 28 |
| 7 | East Tennessee St. | 150 | 123 | 🔺 27 |
| 8 | Oregon | 103 | 130 | 🔻 27 |
| 9 | Boston College | 149 | 176 | 🔻 27 |
| 10 | Austin Peay | 183 | 158 | 🔺 25 |
| # | Team | Old Rank | New Rank | Change |
|---|---|---|---|---|
| 1 | Howard | 225 | 192 | 🔺 33 |
| 2 | Siena | 186 | 154 | 🔺 32 |
| 3 | UMBC | 202 | 172 | 🔺 30 |
| 4 | North Dakota St. | 130 | 103 | 🔺 27 |
| 5 | Tennessee St. | 168 | 147 | 🔺 21 |
| 6 | High Point | 76 | 56 | 🔺 20 |
| 7 | LIU | 233 | 215 | 🔺 18 |
| 8 | Hawaii | 121 | 104 | 🔺 17 |
| 9 | Cal Baptist | 133 | 116 | 🔺 17 |
| 10 | Idaho | 177 | 160 | 🔺 17 |
| 11 | Wright St. | 151 | 135 | 🔺 16 |
| 12 | McNeese St. | 66 | 52 | 🔺 14 |
| 13 | Troy | 140 | 126 | 🔺 14 |
| 14 | Akron | 56 | 45 | 🔺 11 |
| 15 | UCF | 51 | 62 | 🔻 11 |
| 16 | Miami OH | 81 | 70 | 🔺 11 |
| 17 | Northern Iowa | 68 | 58 | 🔺 10 |
| 18 | Furman | 174 | 164 | 🔺 10 |
| 19 | Prairie View A&M | 323 | 313 | 🔺 10 |
| 20 | Queens | 203 | 194 | 🔺 9 |
| 21 | Hofstra | 79 | 71 | 🔺 8 |
| 22 | Lehigh | 281 | 273 | 🔺 8 |
| 23 | Penn | 129 | 136 | 🔻 7 |
| 24 | Saint Louis | 31 | 25 | 🔺 6 |
| 25 | Utah St. | 33 | 28 | 🔺 5 |
| 26 | Santa Clara | 40 | 35 | 🔺 5 |
| 27 | South Florida | 45 | 40 | 🔺 5 |
| 28 | Missouri | 54 | 59 | 🔻 5 |
| 29 | Connecticut | 17 | 13 | 🔺 4 |
| 30 | Virginia | 18 | 14 | 🔺 4 |
| 31 | VCU | 42 | 38 | 🔺 4 |
| 32 | Texas | 38 | 42 | 🔻 4 |
| 33 | Gonzaga | 11 | 8 | 🔺 3 |
| 34 | Arkansas | 13 | 16 | 🔻 3 |
| 35 | Saint Mary’s | 25 | 22 | 🔺 3 |
| 36 | Wisconsin | 27 | 30 | 🔻 3 |
| 37 | Ohio St. | 28 | 31 | 🔻 3 |
| 38 | SMU | 36 | 39 | 🔻 3 |
| 39 | Michigan St. | 9 | 11 | 🔻 2 |
| 40 | Texas Tech | 15 | 17 | 🔻 2 |
| 41 | Alabama | 16 | 18 | 🔻 2 |
| 42 | Kansas | 19 | 21 | 🔻 2 |
| 43 | Kentucky | 22 | 24 | 🔻 2 |
| 44 | North Carolina | 24 | 26 | 🔻 2 |
| 45 | N.C. State | 30 | 32 | 🔻 2 |
| 46 | Villanova | 39 | 41 | 🔻 2 |
| 47 | Purdue | 8 | 9 | 🔻 1 |
| 48 | Tennessee | 14 | 15 | 🔻 1 |
| 49 | St. John’s | 20 | 19 | 🔺 1 |
| 50 | Nebraska | 21 | 20 | 🔺 1 |
| 51 | Iowa | 26 | 27 | 🔻 1 |
| 52 | Miami FL | 34 | 33 | 🔺 1 |
| 53 | Clemson | 35 | 34 | 🔺 1 |
| 54 | UCLA | 37 | 36 | 🔺 1 |
| 55 | Texas A&M | 43 | 44 | 🔻 1 |
| 56 | TCU | 47 | 46 | 🔺 1 |
| 57 | Kennesaw St. | 164 | 165 | 🔻 1 |
| 58 | Duke | 1 | 1 | — 0 |
| 59 | Michigan | 2 | 2 | — 0 |
| 60 | Arizona | 3 | 3 | — 0 |
| 61 | Florida | 4 | 4 | — 0 |
| 62 | Houston | 5 | 5 | — 0 |
| 63 | Illinois | 6 | 6 | — 0 |
| 64 | Iowa St. | 7 | 7 | — 0 |
| 65 | Vanderbilt | 10 | 10 | — 0 |
| 66 | Louisville | 12 | 12 | — 0 |
| 67 | BYU | 23 | 23 | — 0 |
| 68 | Georgia | 29 | 29 | — 0 |