Skip to main content
Open Access Publications from the University of California

Improving Clinically Relevant Classification of Gene Expression Datasets Using Attribute Classifiers as Features

  • Author(s): Durbin, Kenneth James
  • Advisor(s): Stuart, Josh
  • et al.

In this work, I examine the use of low-level feature classifiers to enhance the performance of clinically relevant gene expression classifiers. In other machine learning domains, notably machine vision, recognizing higher level or more abstract concepts by using lower level or more concrete feature classifiers is a common motif. For example, a classifier to recognize images of “parties” might usefully first have sub-classifiers to recognize concrete elements commonly found in parties: indoors, people, party hats, cake, candles, banners, confetti, etc. The output of these low level feature classifiers can be supplied to an overall “party” classifier to assign a label. Image pixels are far removed from an abstract concept like “party,” so training classifiers for these more concretely defined sub-concepts is a way to bridge this semantic gap. Moreover, there may be many more training examples available for the lower level concepts (e.g. people) than the target (e.g. party), allowing one to utilize the robustness that comes from plentiful data. I hypothesize that there is a similar semantic gap between gene expression levels and clinically relevant gene expression classification targets, such as survival prognosis or drug sensitivity for tumor types, and that these clinically relevant classification tasks can benefit from decomposition into classifiers for more concrete concepts, such as tissue type, chromatin state, mutations, and gene essentiality. I will present a series of experiments that show modest but real improvements from the use of low-level feature classifiers in several gene expression prediction tasks. I will also present a series of tools developed for this work including wekaMine (a large-scale model selection and feature-building pipeline), viewtab (a “big data” spreadsheet), csvsql (a program to allow arbitrary SQL queries on csv/tab files), and SamplePsychic (a web application to apply a suite of feature classifiers to gene expression samples and explore the results).

Main Content
Current View