High Dimensional Statistical and Computational Methods for Knowledge Discovery and Data Mining in Biomedical Data
Biomedical sciences have seen radical growth in recent decades, inspired by a plethora of technological breakthroughs, of which sequencing and imaging are two particular technologies whose advancements have enabled scientists to explore areas that were previously impossible. High-throughput sequencing, for instance, is perhaps one of the most groundbreaking advancements in biology; it allows genetic material (e.g DNA, RNA, proteins) to be identified cheaply and accurately, granting investigators unprecedented insight into the inner workings of the genome—the blueprint of all living organisms. Therefore, high-throughput technology, and in recent years single cell sequencing in particular, has become the cornerstone of genetics research. Sequencing can reveal the genomic location of a gene, but often times the physical locations where a gene is expressed in a cell are also biologically meaningful, and with imaging technologies like florescent tagging and powerful electronic microscopes, this information is now possible to ascertain. Of course, the field of imaging technology is vast, and other areas have also seen tremendous leaps forward. For instance, with the development of CT scans and better PET tracers, researchers now have an in vivo view of the metabolic activities in organs, allowing researchers to monitor and study diseases as they progress, thus generating an unprecedented level of understanding of devastating conditions such as Alzheimer’s.
In response to the profusion of quality data, statistical techniques that attempts to analyze
these data have also flourished into the field of computational biology and statistical
genomics, which has since emerged as an indispensable part of scientific discovery pipeline as well as an important interface between statistics/machine learning and biomedical sciences. In this thesis we examine applications of statistical techniques to three vastly different data sets. In the first work we analyze data from PET brain scans of Alzheimer’s Disease patients and explore how linear mixed effect model offers a powerful and flexible alternative for gauging β-Amyloid accumulation. The data we study in the second work consists of singlecell RNAseq data from mouse embryonic, human embryonic, and human cancer cells, from which we introduce a biclustering method to simultaneously extract biologically relevant cell clusters and genes that are active in those clusters. In the third work, multiple sources of biological databases consisting of both imaging and sequencing data were leveraged into a machine learning problem, on which random forest is applied to mine organogenesis master regulators.