Microarrays have emerged during the past decade as a viable platform for detection of DNA from microorganisms in clinical and environmental samples. These microbial detection arrays occupy a middle ground between low cost, narrowly focused assays such as multiplex PCR and more expensive, broad-spectrum technologies like high-throughput sequencing. The Pathogen Bioinformatics Group at Lawrence Livermore National Laboratory is one of several teams that are actively working to develop arrays for clinical diagnostics, biologic product safety testing, environmental monitoring and biodefense applications.
Statistical algorithms that can analyze data from microbial detection arrays and provide easily interpretable results are absolutely required in order for these efforts to succeed. Several researchers have developed methods to determine what organisms are present in a microbial detection array sample. The algorithms developed so far operate mainly within a hypothesis testing framework, and are not motivated by a physical model of the process by which microbial DNA hybridizes to DNA probes on the array. Therefore, they only provide probabilities for the absolute presence or absence of an organism, and lack the ability to infer the abundances of the microbes in the sample. They also have limited capacity to handle samples containing complex mixtures of microorganisms.
This dissertation describes an approach to developing a quantitative algorithm for microbial detection array data analysis, capable of both identifying the organisms present in a sample and inferring their concentrations. After reviewing the most promising array designs and analysis algorithms that have been developed to date, I present a physical model for predicting probe signals on an array given a set of target organisms present in a sample and their concentrations. I describe the experiments that were performed to fit the key parameters in this model. Finally, I present an approach to solving the inverse problem, in which the probe signals are observed and used to infer the targets present and their concentrations.