Removing Unwanted Variation from Microarray Data with Negative Controls
Microarray expression studies suffer from the problem of batch effects and other unwanted variation. Unwanted variation complicates the analysis of microarray data, leading to high rates of false discoveries, high rates of missed discoveries, or both. Many methods have been proposed to adjust microarray data to mitigate the problems of unwanted variation. Because the factors causing the unwanted variation are frequently unknown, several of these methods rely on factor analysis to infer the unwanted factors from the data. A central problem with this approach is the difficulty in discerning the unwanted variation from the biological variation that is of interest to the researcher. To overcome this problem, we present novel methods that use negative controls to help identify the unwanted factors and separate the unwanted variation from the variation that is of interest. Negative control genes are genes known a priori not to be differentially expressed with respect to the biological factor of interest.
The first method we present is a simple two-step procedure that we name RUV-2. In the first step RUV-2 estimates the unwanted factors by performing factor analysis on the negative control genes. Here, RUV-2 exploits the fact that any variation in the expression levels of negative control genes can be assumed to be unwanted variation. In the second step, RUV-2 regresses the expression data on the factor of interest, including the estimated unwanted factors as covariates in the regression model. The principal difficulty with RUV-2 is choosing the number of unwanted factors to include in the model.
The second method we present is a more complicated four-step procedure that we name RUV-4. Compared to RUV-2, RUV-4 is relatively insensitive to the number of unwanted factors included in the model; this makes estimating the number of factors less critical. We also present a novel method for estimating the genes' variances that may be used even when a large number of unwanted factors are included in the model and the design matrix is full rank. We name this method the "inverse method for estimating variances." By combining RUV-4 with the inverse method, it is no longer necessary to estimate the number of unwanted factors at all.
We discuss various techniques for assessing the performance of an adjustment method, and compare the performance of RUV-2, RUV-4, and their variants with the performance of other commonly used adjustment methods such as Combat, SVA, LEAPP, and ICE. We present several example studies, each concerning genes differentially expressed with respect to gender in the brain. We find that our methods performs as well or better than other methods.