Having had some success with my NFL model, I was curious about trying to find something in the NBA. The inputs for the NBA were not that parallel to the NFL, so had to start from scratch. I could not really come up with much that was very clever, so just assembled a data set, ran some linear regressions and tweaked it a bit to see if I could find anything interesting. Unfortunately, not much turned up.
Winning Drivers
My first step was to see which metrics seemed to do the best explaining winning percentage. This turned out to be pretty intuitive: net points (pts scored per game - pts allowed per game).

This chart says to me that net points explaining winning percentage. From that logic, I sought to build a data set focused on net points. So, I took "total" net scored per game and "conditional" (e.g. road net points for the road team, home net points for the home team). I ran a linear regression (using stepwise selection), and found that total net points was the only variable that the software kept. Based on that coefficient plus the constant, I ran the following:


My r-squared is .147, the line is .174. So I am not as good as the line (big surprise). But the difference is similar to the NFL, and with that I was able to isolate certain conditions that made it appealing to bet. So, I ran my Winning Percentage vs. Difference Between My Predicted and The Line (aka Line Gap), and found the following:

So, this looks both familiar and promising. I recently created these summaries from my largest data set I had, and was startled because this did not match my experience. My experience was I could not isolate any population where I could reliably win. I ran a few months of mock bets and my winning percentage was very much at 50%. As I dug into this particular data set, I found errors. In this case, a few of the games the direction of the line was reversed. A heavily favored team would be incorrectly shown as large underdog. Not in all the games, but enough, I believe, to cause the winning percentage look better when my calculated number and the line diverge. It is a risk with cutting & pasting sports data, especially line data (hard to get historical data). The data is sometimes just wrong. So, my next step is to clean this again, but based on my experience with my other, smaller data sets, I doubt the data is materially wrong, but does explain the increasing winning rates for the games when my model and line vary considerably.
One other line of reasoning I had was wondering if the population of the cities might influence the line. Think of the match up of NY vs. Milwaukee. Assuming that a sizeable portion of the money that influences the line was "emotional" money, there ought to be more NY bettors than Milwaukee bettors, so the line should favor Milwaukee. To test this, I looked at the residual of the line vs. outcome against the population differences of the cities. If there was a bias, I would expect some slope to the trend line. Nope. None. Nada.
The chart above shows about a flat a trend line as is possible, with an r-squared roughly of zero. So, based on this, I don't see much opportunity to capitalize on any bias of the relative sizes of the markets playing each other.
Next Steps
- Clean up the data, re-check winning rates.
- Develop panel of quantitative measures of teams, weight by predictive value.