Technologies: R, Excel
Skills & Methodologies: Linear Regression, Hypothesis Testing, Confidence Internvals
GitHub Repository
In collaboration with the Baltimore Ravens my research team was tasked with gleaning interesting insights about punts in the NFL from a proprietary play-by-play data set courtesy of PFF. In short, I took the data set and after some cleaning formed various linear regression models to predict the length of NFL punt returns using in-game variables such as starting field position, punt distance, and punt hang time. The data cleaning involved creating dummy variables for categorical predictors and transforming predictors to a positive linear scale. The response variable was punt return yards. After starting with over a dozen predictor variables I reduced the model to include only those predictors which significantly improved the fit. The removal of predictors was justified via an ANOVA test. This step in model reduction was important as it made the findings significantly more intuitive and easily conveyed to the Baltimore Ravens Analytics Team. At this point the R-squared value for what would eventually become the final model was 0.1357. I then checked for points of high-leverage using half-normal quantile-quantile plot and Cook's distance. I also checked for influential points by looking at DFBETAS. After removing a couple points, which severely skewed the model, the R-squared model improved to 0.1439. Next, I applied a transformation to the response variable which further raised the R-squared value to 0.2046. Whether this was an improvement or not, however, depends on the preferences of the model user. Some would prioritize ease of interpretation over strength of fit and for them this transformation would not be ideal. When presenting the model on the poster I opted to not use the transformation as it made explaining our findings much more intuitive. Lastly, I verified the validity of the assumptions one inherently makes when performing a linear regression. Namely, I checked if the response variables had a constant variance through a plot of the residuals and an F-test. Secondly, I checked if the data was normally distributed using a QQ-Norm plot of the residuals and a Shapiro-Wilk normality test. Thirdly, I checked if the data was serially correlated with both a plot and a Durbin-Watson test. The final linear regression model I presented on the poster had an R-squared value of ~0.15 and a near 0 p-value. Thus only approximately 15% punt return yardage data can be explained via in-game variables. I concluded that while in-game variables certainly have an effect on the length of a punt return, large break-out punt returns which turn the tide of the game are largely random events. At the very least, long punt returns are heavily influenced by factors not recorded in the data set. Grouping the data by team also revealed that no team reliably outperformed the model. This means that no team has devised a game plan to gain significantly more yardage on punt returns than their peers. Likewise no team's defense has found a way to reliably minimize punt return yardage compared to their peers.