Case-control study designs are frequently used in public health and medical research to assess potential risk factors for disease. These study designs are particularly attractive to investigators researching rare diseases, as they are able to sample known cases of disease, vs. following a large number of subjects and waiting for disease onset in a relatively small number of individuals. The data-generating experiment in case-control study designs involves an additional complexity called biased sampling. That is, one assumes the underlying experiment that randomly samples a unit from a target population, measures baseline characteristics, assigns an exposure, and measures a final binary outcome, but one samples from the conditional probability distribution, given the value of the binary outcome. One still desires to assess the causal effect of exposure on the binary outcome for the target population.
The targeted maximum likelihood estimator of a causal effect of treatment on the binary outcome based on such case-control studies is presented. Our proposed case-control-weighted targeted maximum likelihood estimator for case-control studies relies on knowledge of the true prevalence probability, or a reasonable estimate of this probability, to eliminate the bias of the case-control sampling design. We use the prevalence probability in case-control weights, and our case-control weighting scheme successfully maps the targeted maximum likelihood estimator for a random sample into a method for case-control sampling.
Individually matched case-control study designs are commonly implemented in the field of public health. While matching is intended to eliminate confounding, the main potential<\italic> benefit of matching in case-control studies is a gain in efficiency. We investigate the use of the case-control-weighted targeted maximum likelihood estimator to estimate causal effects in matched case-control study designs. We also compare the case-control-weighted targeted maximum likelihood estimator in matched and unmatched designs in an effort to determine which design yields the most information about the causal effect. In many practical situations where a causal effect is the parameter of interest, researchers may be better served using an unmatched design.
We also consider two-stage sampling designs, including so-called nested case-control studies, where one takes a random sample from a target population and completes measurements on each subject in the first stage. The second stage involves drawing a subsample from the original sample, collecting additional data on the subsample. This data structure can be viewed as a missing data structure on the full-data structure collected in the second stage of the study. We propose an inverse-probability-of-censoring-weighted targeted maximum likelihood estimator in two-stage sampling designs. Two-stage designs are also common for prediction research questions. We present an analysis using super learner in nested case-control data from a large Kaiser Permanente database to generate a function for mortality risk prediction.