Novel machine learning and correlation network methods for genomic data
- Author(s): Song, Lin
- Advisor(s): Horvath, Stefan
- et al.
Both correlation and mutual information (MI) are common co-expression measures. MI has a major advantage to measure non-linear relationships. However, it is not clear how much MI adds beyond standard (robust) correlation measures or regression model based association measures. We provided a comprehensive comparison between mutual information and correlation in 8 empirical data sets and in simulations. We confirmed close relationships between MI and correlation in all data sets. The biweight midcorrelation, a robust form of correlation, outperformed MI in terms of elucidating gene pairwise relationships. Coupled with the topological overlap matrix transformation, it often led to modules superior to MI and maximal information coefficient (MIC) in terms of gene ontology enrichment. In addition, we proposed the use of polynomial or spline regression models as an alternative to MI for capturing non-linear relationships between quantitative variables. Overall, our results indicated that MI networks could be safely replaced by correlation networks for stationary co-expression data.
Sample classification, especially disease status prediction, is an important area of investigation for gene expression studies. Many machine learning methods, i.e. predictors, have been developed to tackle this problem. We proposed a novel bootstrap aggregated (bagged) GLM predictor randomGLM (RGLM) that shares superior accuracy and good interpretability. RGLM incorporates several elements of randomness and instability, such as random subspace method, optional interaction terms and forward feature selection. The prediction performances of various predictors were evaluated on hundreds of genomic data sets, the UCI machine learning benchmark data and simulations. RGLM often outperformed alternative methods including random forests and penalized regression models (ridge regression, elastic net, lasso) in both binary and continuous outcome predictions. Further, RGLM provides variable importance measures that can be used to define a ``thinned" ensemble predictor (involving few features) retaining excellent predictive accuracy.
RGLM has won the 2012 COPD Improver Challenge, in which we aimed to predict the chronic obstructive pulmonary disease (COPD) status based on gene expression data. We outlined how RGLM compared with random forest on the COPD data set, and discussed potential reasons for the superior performance of RGLM in this sub-challenge.