Statistical Analysis of Infectious Diseases in Nursing and Genomic Data
- Author(s): Toyama, Joy
- Advisor(s): Ramirez, Christina Michelle
- et al.
In a variety of settings, including the medical field, it is common for the number of variables gathered to far exceed the sample size. Along with a high dimension, many of these included variables are often correlated. This can pose problems for traditional methods. Much of the time, the data cannot be utilized completely as is, but instead requires previous research to guide researchers to choose relevant predictors prior to model selection. Traditional methods such as logistic regression and mixed models cannot necessarily converge and struggle with identifiability when the number of measurements collected approach or become larger than the number of patients in the study. Machine-learning techniques, including Random Forests and the newly developed Fuzzy Forests method, can accommodate data with high dimensionality. We concentrate on decision trees in particular because of their relative ease of use, availability and predictive ability. Random Forest is a widely used, parallelizable and computationally efficient method; however it does not acknowledge any correlation between variables leading to a preference for correlated predictors. Fuzzy Forest, on the other hand, explicitly explores the correlation structure among the variables, leading to unbiased variable importance measures. Fuzzy Forest, along with Random Forest, is utilized in three applications; smoking cessation in health care workers, re-arrest among homeless ex-offenders and genetic predictors of lithium response in individuals with Bipolar disorder.