Using variable importance measures from causal inference to rank risk factors of schistosomiasis infection in a rural setting in China
- Author(s): Sudat, Sylvia EK
- Carlton, Elizabeth J
- Seto, Edmund YW
- Spear, Robert C
- Hubbard, Alan E
- et al.
Published Web Locationhttp://dx.doi.org/10.1186/1742-5573-7-3
Abstract Background Schistosomiasis infection, contracted through contact with contaminated water, is a global public health concern. In this paper we analyze data from a retrospective study reporting water contact and schistosomiasis infection status among 1011 individuals in rural China. We present semi-parametric methods for identifying risk factors through a comparison of three analysis approaches: a prediction-focused machine learning algorithm, a simple main-effects multivariable regression, and a semi-parametric variable importance (VI) estimate inspired by a causal population intervention parameter. Results The multivariable regression found only tool washing to be associated with the outcome, with a relative risk of 1.03 and a 95% confidence interval (CI) of 1.01-1.05. Three types of water contact were found to be associated with the outcome in the semi-parametric VI analysis: July water contact (VI estimate 0.16, 95% CI 0.11-0.22), water contact from tool washing (VI estimate 0.88, 95% CI 0.80-0.97), and water contact from rice planting (VI estimate 0.71, 95% CI 0.53-0.96). The July VI result, in particular, indicated a strong association with infection status - its causal interpretation implies that eliminating water contact in July would reduce the prevalence of schistosomiasis in our study population by 84%, or from 0.3 to 0.05 (95% CI 78%-89%). Conclusions The July VI estimate suggests possible within-season variability in schistosomiasis infection risk, an association not detected by the regression analysis. Though there are many limitations to this study that temper the potential for causal interpretations, if a high-risk time period could be detected in something close to real time, new prevention options would be opened. Most importantly, we emphasize that traditional regression approaches are usually based on arbitrary pre-specified models, making their parameters difficult to interpret in the context of real-world applications. Our results support the practical application of analysis approaches that, in contrast, do not require arbitrary model pre-specification, estimate parameters that have simple public health interpretations, and apply inference that considers model selection as a source of variation.