Abstract:
Large‐scale genome‐wide analyses scans on massive numbers of various cases and controls are archived in the genetic databases that are publically available, for example, the Database of Genotypes and Phenotypes (https://www.ncbi.nlm.nih.gov/gap/). These databases offer unprecscendented opportunity to study the genetic effects. Yet, the set of nongenetic variables in these databases is often brief. From the statistical literature, we know that omitting a continuous variable from a logistic regression model can result in biased estimates of odds ratios (OR), even when the omitted and the included variables are independent. We are interested in assessing what information is needed to recover the bias in the OR estimate of genotype due to omitting a continuous variable in settings when the actual values of the omitted variable are not available. We derive two estimating procedures that can recover the degree of bias based on a conditional density of the omitted variable given the disease status and the genotype or the known distribution of the omitted variable and frequency of the disease in the population. Importantly, our derivations show that omitting a continuous variable can result in either under‐ or over‐estimation of the genetic effects. We performed extensive simulation studies to examine bias, variability, false‐positive rate, and power in the model that omits a continuous variable. We show the application to two genome‐wide studies of Alzheimer's disease.