PGA Tour Expected Strokes - Off-the-Tee¶

The Shotlink System provides detailed shot-by-shot location data for PGA Tour events, and these data led to many groundbreaking advances in golf analytics. The strokes gained metric, a measure of how a golfer's shots compare to an expectation given ball location and other information, headlines these advances.

This notebook contains an effort to replicate the tee shot portion of strokes gained. That is, given the distance of the tee from the hole, what is the expected number of strokes needed to hole out?

In [1]:
import numpy as np
import pandas as pd
import pymc as pm
import arviz as az
import matplotlib.pyplot as plt
import scipy
import scipy.stats as st
import xarray as xr
import pickle

from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error
In [2]:
hole_df = pd.read_csv("/content/drive/MyDrive/Golf/drive.csv")
hole_df
Out[2]:
event_id year player_id round_num hole strokes hole_number par distance
0 R2021464 2021 1810 1 1 3 1 4 436
1 R2021464 2021 10809 1 1 4 1 4 436
2 R2021464 2021 12510 1 1 5 1 4 436
3 R2021464 2021 12716 1 1 4 1 4 436
4 R2021464 2021 19803 1 1 5 1 4 436
... ... ... ... ... ... ... ... ... ...
192936 R2022541 2022 52375 4 18 4 18 4 512
192937 R2022541 2022 52666 4 18 4 18 4 512
192938 R2022541 2022 52686 4 18 5 18 4 512
192939 R2022541 2022 55182 4 18 5 18 4 512
192940 R2022541 2022 55955 4 18 5 18 4 512

192941 rows × 9 columns

These data have been filtered and aggregated from the shot-by-shot data. Data cleaning included

  • Removing shots with missing or zero distance from the hole
  • Ensuring each event-year-golfer-round-hole has a tee shot and hole out (that is, each hole started ended in a hole out)
  • Removing provisional shots

With observations at the golfer-hole level, taking the mean strokes across each hole allows models to be fit at the hole level and observations can be weighted by the number of golfers to play the hole.

In [4]:
avg_df = pd.DataFrame(hole_df.groupby(["event_id", "year", "round_num", "hole"])[["strokes", "par", "distance"]].mean()).reset_index()
count_df = pd.DataFrame(hole_df.groupby(["event_id", "year", "round_num", "hole"])["strokes"].count()).reset_index()

count_df.columns = ["event_id", "year", "round_num", "hole", "count"]
avg_df = avg_df.merge(count_df, on=["event_id", "year", "round_num", "hole",])
avg_df[(avg_df["event_id"]=="R2021464") & (avg_df["round_num"]==1) & (avg_df["hole"]==1)]
Out[4]:
event_id year round_num hole strokes par distance count
288 R2021464 2021 1 1 3.907285 4.0 436.0 151

For the first hole in the first round of the RSM Classic in 2021, the par-4 played about a tenth of a stroke easier across 151 golfers.

Modeling¶

Broadie cites that a piecewise polynomial fit to hole distance performs better than a simple linear regression. Indeed, Figure 1 from his paper seems to indicate as much.

With varying intercepts and slopes being apparent, fitting separate models by hole type would likely be sufficient. Such a modeling framework would leave some information sharing between hole types, which might be helpful for par-4 and par-5 holes especially. But, given sufficient sample size across each hole type, the benefit of any information sharing is likely minor. With that, let's fit an expected strokes model for each set of par-3, par-4, and par-5 holes, and begin by splitting the dataset.

Screen Shot 2023-08-15 at 6.18.14 PM.png

In [5]:
x_train, x_test, y_train, y_test = train_test_split(avg_df[["par", "distance", "count"]], avg_df[["strokes"]], test_size=0.25, random_state=42)

Par-3 Holes¶

In [6]:
par_3_model = LinearRegression()
par_3_model.fit(x_train[x_train["par"]==3][["distance"]],
                y_train[x_train["par"]==3]["strokes"], sample_weight=x_train[x_train["par"]==3]["count"])

# Model parameters and MAE
print(par_3_model.coef_)
print(par_3_model.intercept_)
print("Training Set MAE: " + str(round(mean_absolute_error(y_train[x_train["par"]==3]["strokes"],
                                                           par_3_model.predict(x_train[x_train["par"]==3][["distance"]])), 3)))
print("Test Set MAE: " + str(round(mean_absolute_error(y_test[x_test["par"]==3]["strokes"],
                                                       par_3_model.predict(x_test[x_test["par"]==3][["distance"]])), 3)))
[0.00254143]
2.5686469023363045
Training Set MAE: 0.105
Test Set MAE: 0.111

Par-4 Holes¶

In [7]:
par_4_model = LinearRegression()
par_4_model.fit(x_train[x_train["par"]==4][["distance"]], y_train[x_train["par"]==4]["strokes"],
                sample_weight=x_train[x_train["par"]==4]["count"])

# Model parameters and MAE
print(par_4_model.coef_)
print(par_4_model.intercept_)
print("Training Set MAE: " + str(round(mean_absolute_error(y_train[x_train["par"]==4]["strokes"],
                                                           par_4_model.predict(x_train[x_train["par"]==4][["distance"]])), 3)))
print("Test Set MAE: " + str(round(mean_absolute_error(y_test[x_test["par"]==4]["strokes"],
                                                       par_4_model.predict(x_test[x_test["par"]==4][["distance"]])), 3)))
[0.00192602]
3.187367319289582
Training Set MAE: 0.134
Test Set MAE: 0.146

Par-5 Holes¶

In [8]:
par_5_model = LinearRegression()
par_5_model.fit(x_train[x_train["par"]==5][["distance"]], y_train[x_train["par"]==5]["strokes"],
                sample_weight=x_train[x_train["par"]==5]["count"])

# Model parameters and MAE
print(par_5_model.coef_)
print(par_5_model.intercept_)
print("Training Set MAE: " + str(round(mean_absolute_error(y_train[x_train["par"]==5]["strokes"],
                                                           par_5_model.predict(x_train[x_train["par"]==5][["distance"]])), 3)))
print("Test Set MAE: " + str(round(mean_absolute_error(y_test[x_test["par"]==5]["strokes"],
                                                       par_5_model.predict(x_test[x_test["par"]==5][["distance"]])), 3)))
[0.00326362]
2.7921565590168576
Training Set MAE: 0.151
Test Set MAE: 0.121

Model Visualization¶

Performance on the test set is comparable to that on the training set for each hole type, so let's examine the model fits to see if they each seem reasonable.

In [9]:
# Visualize test set observations
fig, ax = plt.subplots()
ax.scatter(x_test[x_test["par"]==3]["distance"], y_test[x_test["par"]==3]["strokes"],
           c="green", label="Par 3", alpha=0.1)
ax.scatter(x_test[x_test["par"]==4]["distance"], y_test[x_test["par"]==4]["strokes"],
           c="blue", label="Par 4", alpha=0.1)
ax.scatter(x_test[x_test["par"]==5]["distance"], y_test[x_test["par"]==5]["strokes"],
           c="red", label="Par 5", alpha=0.1)

# Visualize model predictions
## Par-3
dist = np.linspace(150, 250, 100)
test_df = pd.DataFrame({"distance": dist})
par_3_preds = par_3_model.predict(test_df)
ax.plot(dist, par_3_preds, "g-")

# Par-4
dist = np.linspace(300, 500, 100)
test_df = pd.DataFrame({"distance": dist})
par_4_preds = par_4_model.predict(test_df)
ax.plot(dist, par_4_preds, "b-")

# Par-5
dist = np.linspace(500, 650, 100)
test_df = pd.DataFrame({"distance": dist})
par_5_preds = par_5_model.predict(test_df)
ax.plot(dist, par_5_preds, "r-")

# Add even-par indicators
plt.hlines(3, 100, 660, 'k', '--')
plt.hlines(4, 100, 660, 'k', '--')
plt.hlines(5, 100, 660, 'k', '--')
plt.ylim(2.5, 6)

plt.title("Average Strokes Needed by Hole Distance")
plt.ylabel("Average Strokes")
plt.xlabel("Hole Distance (feet)")

ax.legend()
plt.show()

The linear fits by hole type appear reasonable. Interestingly, the linear models predict most par-3 holes to play over par and most par-5 holes to play under par. Anecdotally, this seems to match intuition. Par-5 holes offer more chances to recover from mistakes or poor shots, while par-3 holes aren't as forgiving. This result seems to match Broadie's findings as well, so let's move forward with these at the moment.

In [10]:
models = {
    "par_3": par_3_model,
    "par_4": par_4_model,
    "par_5": par_5_model,
}
with open('/content/drive/MyDrive/Golf/drive_models.pkl', 'wb') as handle:
    pickle.dump(models, handle)
In [ ]: