Header image  
where jay's nerdiness & the world meet  
  

 
 
 

 
 
Sports Betting: MLB

More than any of the sports I have studied, I probably have the best sense of this game. It is also the most data intensive, so seemingly creates the best opportunity to really breakdown the sport. However, this data intensity also creates the need for robust datasets, the kind beyond my abilities in Excel. I did assemble a fairly thorough dataset (data fields summarized below) but it was only for 220 games. I hope some day to assemble a season's worth of data and see if I can find the relationships I need to outperform the line. I did get a friend to do some database programming to get the data I need, but he took a job in the middle of the project, so it stalled.

Winning Drivers

Based on my analysis of baseball, I know OPS is a very helpful metric to determine the quality of baseball team. My basic logic was to compare each team's OPS of the hitting and the OPS allowed of the pitching and see if the net result had any predictive value in the outcomes. For the pitching, I inserted the data for the specific pitcher and tallied the cumulative data for the relief corp of each team. If I could find some useful data, I could then compare the predictive values against the implied expected winning percentage in the line. I expanded these variables to include winning percentage and the opinion of people who track the games.

Assembling this data set was very labor intensive. It involved, each day, cut & pasting from the web the team OPS (team hitting, individual pitching data). The table below summarizes all the variables I created, assembled into a data set and ran against the "net OPS" outcome of the game. My logic was to assemble an intuitive set of variables and then run the data set to see which ones provided the best model.

Pitching Variables

OPS Starting Pitching, Basic OPS allowed of the starting pitching. This was driven by the statistics of the specific starter for both teams. Overall, not adjusted for road/home or park factors.
OPS Relief Pitching, Basic OPS allowed of the relief pitching. This took an aggregate of all the relief pitching by the team. Overall, not adjusted for road/home or park factors.
Expected OPS of Pitching, Basic This was the weighted average of the first two variables, weighted by the average number of innings the starter averaged in his starts. Overall, not adjusted for road/home or park factors.

These variables above were further enhanced to account for home vs. road splits as well as the effects of the ballpark where the game was being played. I would do the same calculations for both home and visitor, and derive a "net OPS pitching" of the two teams, for basic, split adjusted and split adjusted that factors in the ballpark.

Hitting Variables

OPS Hitting, Basic OPS of the home hitting - OPS of the visitors' hitting, no adjustments for road/home or park factors.
OPS Hitting, Split Adjusted OPS of the home team when they play at home - OPS of the visiting team when they play on the road
OPS Hitting, Park Factor OPS of home team when they play at home (no change from above) - OPS of visiting team's road OPS, adjusted for the home team's ballpark.

Winning Percentage/Winning Metrics

Winning Percentage Home team's winning percentage - visiting team's winning percentage
Winning Percentage, Split Adjusted Home team's home winning percentage - visiting team's road winning percentage
Winning Percentage, L10 Home team's winning percentage, last 10 games - visiting team's winning percentage in last 10 games
Winning Percentage, L30 Home team's winning percentage, last 30 games - visiting team's winning percentage in last 30 games
Winning Percentage, Road/Home, L10 Home team's winning percentage in L10 games at home - visiting team's winning percentage in L10 road games
Winning Percentage, Road/Home, L30 Home team's winning percentage in L30 games at home - visiting team's winning percentage in L30 road games
Net Runs Home team's net runs - visiting team's net runs
Net Runs, Road/Home Home team's net runs at home - visiting team's net runs on the road
Net Runs, L10 Home team's net runs in L10 - visiting team's net runs in L10
Net Runs, L30 Home team's net runs in L30 - visiting team's net runs in L30
Net Runs, Road/Home, L10 Home team's net runs in L10 at home - visiting team's net runs in L10 on road

 

Third Party Variables

A new, very helpful site launched recently (www.wagerline.com), that is a very helpful source for game information, especially odds. In addition, they track the picks of their users. I decided to track this data to see if it would help in the model as well.

Wagerline, All The % of all Wagerline subscribers favoring the home team.
Wagerline, Top 10 The % of the Wagerline subscribers who are in the top 10% of predicting outcomes of the team that favored the home team.
Wagerline, Experts The % of the Wagerline subscribers that qualify as "team experts" that favored the home team.

Results

For all this work, I did not get much. Counter-intuitive relationships, little explanatory value. My data set was only 220 games. One of ideas was to build this data set out, for an entire season and see if I can find some meaningful relatinships.