Statistical Analysis of The 2016 and 2017 NCAA Division-I Swimming Championships
- Author(s): Kaunitz, Sarah Louise
- Advisor(s): Schoenberg, Frederic P
- et al.
This paper applies the implementation of web-scraping to create a single new dataset composed of eight separate competition results datasets. Exploratory analysis will be performed in large to identify the measurable reasons why most swimmers perform worse at the fastest collegiate competition in the nation. Additionally, using forward and backward stepwise variable selection, the impact of various factors on the outcome variable time difference will be studied. Machine learning algorithms such as ridge regression and lasso method will create models that predict the time difference between entry time and final time of a swimmer’s race. The mean squared error value will evaluate the overall performance of the models. Although many variables are created and used to best fit the final ridge regression model, there are unmeasurable factors that must be taken into account to accurately describe what impacts how fast a swimmer goes at the competition.