Sensor systems that acquire large sets of data have been deployed to document sporting events at unprecedented levels of detail. Machine learning techniques have been applied to these sensor measurements to discover new skills, quantify known skills with greater accuracy, and understand biomechanical principles to improve performance and prevent injury. The use of learning methods to support the generation of predictive models has revolutionized decision making as teams search for an advantage in a highly competitive industry. Machine learning methods are particularly well suited for baseball due to the discrete structure of the sport.
We develop and apply learning methods to large sets of sensor data to address several of the most important and challenging problems in baseball analytics. We introduce a method for learning a function over distributions that generalizes nonparametric kernel regression by using the Wasserstein metric for distribution space. The technique is applied to the problem of learning the dependence of pitcher performance on multidimensional pitch distributions that are derived from sensor measurements which capture physical properties of each pitch. We also develop a method for estimation and prediction called measurement space partitioning. The method is applied to the problem of estimating batted-ball talent by using large sets of trajectory measurements acquired by in-game sensors to show that the predictive value of a batted ball depends on its physical properties. This knowledge is exploited to estimate batted-ball distributions defined over a multidimensional measurement space by using regression parameters that adapt to batted ball properties. This process is central to a new method for quantifying batted-ball skill. We present examples illustrating facets of the approach and use a set of experiments to show that the new methods generate predictions that are significantly more
accurate than those generated using current methods.