Of the 3.7 million deaths attributed to outdoor air pollution, ischemic heart disease (IHD) represents 40% of the total deaths, or approximately 1.48 million deaths, which occur mainly in older adults. IHD is the largest single causes of death attributable to ambient air pollution. Research on the progression and incidence of IHD are pointing to ambient fine particulate matter (PM) as a major contributor to morbidity and mortality outcomes.
In this context, improvements in air pollution exposure assessment methods and health effects assessments are developed and investigated in this thesis. With the exposure assessment, methods and tools were created that had utility for improving air pollution exposure assessment. Two exposure assessment chapters are presented. The first of these is focused on the creation of a national-level spatio-temporal air pollution exposure model. In the second exposure chapter, emphasis is placed on the development and evaluation of methods used to estimate annual average daily traffic - a local source of ambient particulates and other air pollutants thought to have heightened toxicity.
A model was created to predict ambient fine particulate matter less than 2.5 microns in aerodynamic diameter (PM2.5) across the contiguous United States to be applied to health effects modeling (Chapter 2). We developed a novel hybrid approach that combine a land use regression model (LUR) and Bayesian Maximum Entropy (BME) interpolation of the LUR space-time residuals,. The PM2.5 dataset included observations at 1,464 monitoring locations with approximately 10% of locations reserved for cross-validation across the contiguous United States. In the LUR, variables based on remote sensing estimates of PM2.5, land use and traffic indicators were made available to the Deletion/Substitution/Addition machine learning algorithm used to select predictive models describing local variability in PM2.5. Two modeling configurations were tested. The first included all of the available covariates; and the second did not include the remote sensing. The remote sensing variable was not based on any ground information.
Specific results showed that normalized cross-validated R2 values for LUR were 0.63 and 0.11 with and without remote sensing, respectively; suggesting remote sensing is a strong predictor of ground-level concentrations. In the models including the BME interpolation of the residuals, cross-validated R2 were 0.79 for both configurations; the model without remotely sensed data described more fine-scale variation than the model including remote sensing. Our results suggest that our modeling framework effectively predicts ground-level concentrations of PM2.5 at multiple scales over the contiguous U.S.
The network interpolation tool used to estimate traffic is described in Chapter 3. The program was created using free open-source software, namely Python 2.7 and its related libraries. It was applied to two county study areas in California, USA (Alameda and Los Angeles), where inverse distance weighted (IDW) and kriging annual average daily traffic (AADT) models were estimated. These estimates were compared to: each other; to an entirely independent dataset; and against a traffic model using similar methods to those used in the traffic estimates employed in the exposure model in Chapter 2.
Results show different levels of predictive agreement. Using cross-validation methods, the R2 for these models were 0.36 and 0.32 in Alameda and 0.46 and 0.47 in Los Angeles, for IDW and Kriging, respectively. Differences in model performance seen between and within the study area suggest that data issues may have materially contributed; these include: temporal discordance in the measurements and mischaracterization of road types. A comparison of network interpolation methods to those used to estimate traffic in Chapter 2 found the network methods to be superior.
For the health effects analysis that that estimated an exposure response curve describing the effect of PM2.5 on ischemic heart disease mortality, monthly ambient PM2.5 estimates (from the model outlined in Chapter 2) were averaged to represent long-term exposure at the home. Super Learner evaluated 14 models that fell within the classes of parametric, semi-parametric, and non-parametric models. A generalized additive model with splined terms was identified as being most predictive of life expectancy. Over the range of exposure 3-27 µg/m3 the estimated years of life lost over this interval was 0.6 years. This relationship, however, was not linear. It followed the pattern reported in previous studies with increased risk rates at lower exposures and a flattening out of the curve at higher exposures. An inflection point appeared to occur near 10 µg/m3. These estimates failed to reach significance at the 95% confidence criteria but were close enough to be suggestive of a relationship. Results from a complementary simulation showed that left truncation characteristics of the cohort likely biased to results towards the null. In addition, the use of inverse probability of censoring weights to control for bias induced by right censoring added variability to the estimator that likely reduced the power to detect and effect.
This research has shown the utility of machine-learning algorithms for improving health effects assessments in the field of air pollution epidemiology. In exposure science, they have proven their utility in creating estimates of exposure that can be used to characterize multiple scales of variability. In health effects assessments, in combination with causal inference methods, this work has shown the utility of these methods to detect non-linear effects in novel parameter estimates in individual cohort studies. In addition to the methodological contribution, the health effects results contribute to the discussion about the burden of disease attributable to particulate matter.