A problem-first approach to statistics develops statistical methods directly from real world questions and problems. This dissertation illustrates this approach through the development of statistics methods and tools in four disciplines: active transportation, higher education, election auditing and sports. Causal inference and nonparametric methods are emphasized as they avoid typically incorrect parametric assumptions.
The second chapter focuses on active transportation and problems with ensuring dataquality. Sufficiently accurate bicycle and pedestrian counts are useful for improving safety
analyses, planning infrastructure, and prioritizing funding. The accuracy of instrumental
counts is affected by the instrument’s sensing technology, details of siting and installation,
calibration, random error, and malfunctions. Some of these errors cannot be detected without
an independent, accurate count to compare to the instrumental count. But some failures
can be detected (imperfectly) through their signal in the count data, which has led to a
variety of algorithms to clean and interpolate instrumental count data. We present different
methods for flagging questionable data and provide a detailed comparison of data cleaning
approaches.
Higher education is the focus of the next chapter, and the central research questionis “do female presenters receive more questions or comments than male presenters during
academic job talks?” We collect a large dataset of academic job talks from eight UC Berkeley
departments from 2013-2019 in order to answer this question. We find that differences in
the number, nature, and total duration of audience questions and comments are neither
material nor statistically significant. For instance, the median difference (by gender) in the
duration of questioning ranges from zero to less than two minutes in the five departments.
Moreover, in some departments, candidates who were interrupted more often were more
likely to be offered a position, challenging the premise that interruptions are necessarily
prejudicial. These results are specific to the departments and years covered by the data, but
they are broadly consistent with previous research, which found differences of comparable in
magnitude. However, those studies concluded that the (small) differences were statistically
significant. We present evidence that the nominal statistical significance is an artifact of
using inappropriate hypothesis tests. We show that it is possible to calibrate those tests to obtain a proper P-value using randomization.
Motivated by the permutation test work in the previous chapter, the fourth chapterdevelops a method to construct fast exact/conservative Monte Carlo confidence intervals by
inverting exact/conservative Monte Carlo tests about parameters. The method uses a single
set of Monte Carlo samples, which both reduces the computational burden and ensures that
the problem of finding where the P-value crosses α is well posed. For problems with realvalued
parameters, if the P-value is quasiconcave in the parameter, a minor modification of
the bisection algorithm quickly finds conservative confidence bounds to any desired degree
of accuracy. Additional computational savings are possible for common test statistics in the
one-sample and two-sample problem by exploiting the relationship between values of the test
statistics for different values of the parameter. Examples across a wide range of disciplines
are given to illustrate this new method.
The fifth, sixth, and seventh chapters focus on post-election audits. Post-election auditscan provide convincing evidence that election outcomes are correct—that the reported winner(
s) really won—by manually inspecting ballots selected at random from a trustworthy
paper trail of votes. Risk-limiting audits (RLAs) control the probability that, if the reported
outcome is wrong, it is not corrected before the outcome becomes official. RLAs keep this
probability below the specified “risk limit.” Chapter five compares RLAs to a proposed
Bayesian alternative, Bayesian audits (BAs). BAs control a weighted average probability of
correcting wrong outcomes over a hypothetical collection of elections; the weights come from
the prior. RLAs and BAs make different assumptions, use different standards of evidence
and offer different assurances. We illustrate these differences using simulations based on real
contests. Historically, conducting RLAs of all contests in a jurisdiction has been infeasible,
because efficiency is eroded when sampling cannot be targeted to ballot cards that contain
the contest(s) under audit. States that conduct RLAs of contests on multi-card ballots or of
small contests can dramatically reduce sample sizes by using information about which ballot
cards contain which contests—by keeping track of card-style data (CSD). We present a
method for using CSD to drastically decrease RLA sample sizes in chapter six. Chapter seven
describes an open-source Python implementation of RLAs using CSD for the Hart InterCivic
Verity voting system and the Dominion Democracy Suite voting system. The software is
demonstrated using all 181 contests in the 2020 general election and all 214 contests in the
2022 general election in Orange County, CA, USA, the fifth-largest election jurisdiction in
the U.S., with over 1.8 million active voters.
In the final chapter, we develop a novel method to quantify the impact of injuries on playerperformance in baseball. To quantify this impact we can look at the difference between
performance the player would have achieved in the absence of injury and after a given
injury. This quantity can be estimated by matching injured players to similar non-injured
players. However, matching in observational studies faces complications when units enroll
in treatment on a rolling basis (e.g., players are injured at different times). To address
this issue, we introduce a new matched design, GroupMatch with instance replacement,
allowing maximum flexibility in control selection. Second, we propose a block bootstrap
approach for inference in matched designs with rolling enrollment and demonstrate that it
accounts properly for complex correlations across matched sets in our new design and several
other contexts. Third, we develop a falsification test to detect violations of the timepoint
agnosticism assumption, which is needed to permit flexible matching across time.