Fuzzy Forests: Extending Random Forests for Correlated, High-Dimensional Data

2015

Abstract

In this paper we introduce fuzzy forests, a novel machine learning algorithm for ranking

the importance of features in high-dimensional classication and regression problems.

Fuzzy forests is specically designed to provide relatively unbiased rankings of variable

importance in the presence of highly correlated features, especially when p >> n . We

introduce our implementation of fuzzy forests in the R package, fuzzyforest . Fuzzy forests

works by taking advantage of the network structure between features. First, the features

are partitioned into separate modules such that the correlation within modules is high

and the correlation between modules is low. The package fuzzyforest allows for easy use

of Weighted Gene Coexpression Network Analysis (WGCNA) to form modules of features

such that the modules are roughly uncorrelated. Then recursive feature elimination random

forests (RFE-RFs) are used on each module, separately. From the surviving features,

a nal group is selected and ranked using one last round of RFE-RFs. This procedure

results in a ranked variable importance list whose size is pre-specied by the user. The

selected features can then be used to construct a predictive model.

Main Content

For improved accessibility of PDF content, download the file to your device.

Department of Biostatistics