The Shotlink System provides detailed shot-by-shot location data for PGA Tour events, and these data led to many groundbreaking advances in golf analytics. The strokes gained metric, a measure of how a golfer's shots compare to an expectation given ball location and other information, headlines these advances.
This notebook contains an effort to replicate the tee shot portion of strokes gained. That is, given the distance of the tee from the hole, what is the expected number of strokes needed to hole out?
import numpy as np
import pandas as pd
import pymc as pm
import arviz as az
import matplotlib.pyplot as plt
import scipy
import scipy.stats as st
import xarray as xr
import pickle
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error
hole_df = pd.read_csv("/content/drive/MyDrive/Golf/drive.csv")
hole_df
event_id | year | player_id | round_num | hole | strokes | hole_number | par | distance | |
---|---|---|---|---|---|---|---|---|---|
0 | R2021464 | 2021 | 1810 | 1 | 1 | 3 | 1 | 4 | 436 |
1 | R2021464 | 2021 | 10809 | 1 | 1 | 4 | 1 | 4 | 436 |
2 | R2021464 | 2021 | 12510 | 1 | 1 | 5 | 1 | 4 | 436 |
3 | R2021464 | 2021 | 12716 | 1 | 1 | 4 | 1 | 4 | 436 |
4 | R2021464 | 2021 | 19803 | 1 | 1 | 5 | 1 | 4 | 436 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
192936 | R2022541 | 2022 | 52375 | 4 | 18 | 4 | 18 | 4 | 512 |
192937 | R2022541 | 2022 | 52666 | 4 | 18 | 4 | 18 | 4 | 512 |
192938 | R2022541 | 2022 | 52686 | 4 | 18 | 5 | 18 | 4 | 512 |
192939 | R2022541 | 2022 | 55182 | 4 | 18 | 5 | 18 | 4 | 512 |
192940 | R2022541 | 2022 | 55955 | 4 | 18 | 5 | 18 | 4 | 512 |
192941 rows × 9 columns
These data have been filtered and aggregated from the shot-by-shot data. Data cleaning included
With observations at the golfer-hole level, taking the mean strokes across each hole allows models to be fit at the hole level and observations can be weighted by the number of golfers to play the hole.
avg_df = pd.DataFrame(hole_df.groupby(["event_id", "year", "round_num", "hole"])[["strokes", "par", "distance"]].mean()).reset_index()
count_df = pd.DataFrame(hole_df.groupby(["event_id", "year", "round_num", "hole"])["strokes"].count()).reset_index()
count_df.columns = ["event_id", "year", "round_num", "hole", "count"]
avg_df = avg_df.merge(count_df, on=["event_id", "year", "round_num", "hole",])
avg_df[(avg_df["event_id"]=="R2021464") & (avg_df["round_num"]==1) & (avg_df["hole"]==1)]
event_id | year | round_num | hole | strokes | par | distance | count | |
---|---|---|---|---|---|---|---|---|
288 | R2021464 | 2021 | 1 | 1 | 3.907285 | 4.0 | 436.0 | 151 |
For the first hole in the first round of the RSM Classic in 2021, the par-4 played about a tenth of a stroke easier across 151 golfers.
Broadie cites that a piecewise polynomial fit to hole distance performs better than a simple linear regression. Indeed, Figure 1 from his paper seems to indicate as much.
With varying intercepts and slopes being apparent, fitting separate models by hole type would likely be sufficient. Such a modeling framework would leave some information sharing between hole types, which might be helpful for par-4 and par-5 holes especially. But, given sufficient sample size across each hole type, the benefit of any information sharing is likely minor. With that, let's fit an expected strokes model for each set of par-3, par-4, and par-5 holes, and begin by splitting the dataset.
x_train, x_test, y_train, y_test = train_test_split(avg_df[["par", "distance", "count"]], avg_df[["strokes"]], test_size=0.25, random_state=42)
par_3_model = LinearRegression()
par_3_model.fit(x_train[x_train["par"]==3][["distance"]],
y_train[x_train["par"]==3]["strokes"], sample_weight=x_train[x_train["par"]==3]["count"])
# Model parameters and MAE
print(par_3_model.coef_)
print(par_3_model.intercept_)
print("Training Set MAE: " + str(round(mean_absolute_error(y_train[x_train["par"]==3]["strokes"],
par_3_model.predict(x_train[x_train["par"]==3][["distance"]])), 3)))
print("Test Set MAE: " + str(round(mean_absolute_error(y_test[x_test["par"]==3]["strokes"],
par_3_model.predict(x_test[x_test["par"]==3][["distance"]])), 3)))
[0.00254143] 2.5686469023363045 Training Set MAE: 0.105 Test Set MAE: 0.111
par_4_model = LinearRegression()
par_4_model.fit(x_train[x_train["par"]==4][["distance"]], y_train[x_train["par"]==4]["strokes"],
sample_weight=x_train[x_train["par"]==4]["count"])
# Model parameters and MAE
print(par_4_model.coef_)
print(par_4_model.intercept_)
print("Training Set MAE: " + str(round(mean_absolute_error(y_train[x_train["par"]==4]["strokes"],
par_4_model.predict(x_train[x_train["par"]==4][["distance"]])), 3)))
print("Test Set MAE: " + str(round(mean_absolute_error(y_test[x_test["par"]==4]["strokes"],
par_4_model.predict(x_test[x_test["par"]==4][["distance"]])), 3)))
[0.00192602] 3.187367319289582 Training Set MAE: 0.134 Test Set MAE: 0.146
par_5_model = LinearRegression()
par_5_model.fit(x_train[x_train["par"]==5][["distance"]], y_train[x_train["par"]==5]["strokes"],
sample_weight=x_train[x_train["par"]==5]["count"])
# Model parameters and MAE
print(par_5_model.coef_)
print(par_5_model.intercept_)
print("Training Set MAE: " + str(round(mean_absolute_error(y_train[x_train["par"]==5]["strokes"],
par_5_model.predict(x_train[x_train["par"]==5][["distance"]])), 3)))
print("Test Set MAE: " + str(round(mean_absolute_error(y_test[x_test["par"]==5]["strokes"],
par_5_model.predict(x_test[x_test["par"]==5][["distance"]])), 3)))
[0.00326362] 2.7921565590168576 Training Set MAE: 0.151 Test Set MAE: 0.121
Performance on the test set is comparable to that on the training set for each hole type, so let's examine the model fits to see if they each seem reasonable.
# Visualize test set observations
fig, ax = plt.subplots()
ax.scatter(x_test[x_test["par"]==3]["distance"], y_test[x_test["par"]==3]["strokes"],
c="green", label="Par 3", alpha=0.1)
ax.scatter(x_test[x_test["par"]==4]["distance"], y_test[x_test["par"]==4]["strokes"],
c="blue", label="Par 4", alpha=0.1)
ax.scatter(x_test[x_test["par"]==5]["distance"], y_test[x_test["par"]==5]["strokes"],
c="red", label="Par 5", alpha=0.1)
# Visualize model predictions
## Par-3
dist = np.linspace(150, 250, 100)
test_df = pd.DataFrame({"distance": dist})
par_3_preds = par_3_model.predict(test_df)
ax.plot(dist, par_3_preds, "g-")
# Par-4
dist = np.linspace(300, 500, 100)
test_df = pd.DataFrame({"distance": dist})
par_4_preds = par_4_model.predict(test_df)
ax.plot(dist, par_4_preds, "b-")
# Par-5
dist = np.linspace(500, 650, 100)
test_df = pd.DataFrame({"distance": dist})
par_5_preds = par_5_model.predict(test_df)
ax.plot(dist, par_5_preds, "r-")
# Add even-par indicators
plt.hlines(3, 100, 660, 'k', '--')
plt.hlines(4, 100, 660, 'k', '--')
plt.hlines(5, 100, 660, 'k', '--')
plt.ylim(2.5, 6)
plt.title("Average Strokes Needed by Hole Distance")
plt.ylabel("Average Strokes")
plt.xlabel("Hole Distance (feet)")
ax.legend()
plt.show()
The linear fits by hole type appear reasonable. Interestingly, the linear models predict most par-3 holes to play over par and most par-5 holes to play under par. Anecdotally, this seems to match intuition. Par-5 holes offer more chances to recover from mistakes or poor shots, while par-3 holes aren't as forgiving. This result seems to match Broadie's findings as well, so let's move forward with these at the moment.
models = {
"par_3": par_3_model,
"par_4": par_4_model,
"par_5": par_5_model,
}
with open('/content/drive/MyDrive/Golf/drive_models.pkl', 'wb') as handle:
pickle.dump(models, handle)