UC San Diego
Statistical Approaches for Big Data Analytics and Machine Learning : Data-Driven Network Reconstruction and Predictive Modeling of Time Series Biological Systems
- Author(s): Farhangmehr, Farzaneh
- et al.
Ever-increasing quantity of data generated by modern technologies necessitates the development of advanced approaches for big data analytics. The ultimate goal of such approaches is to capture insightful patterns and turn them into actionable information. This information not only reveals the hidden patterns underlying complex systems but also facilitates the design and development of new mechanisms to overcome multidisciplinary challenges. The data mining process can be divided into two steps: network reconstruction - to determine the structure and details of interactions, and predictive modeling - to represent constructed networks as predictive models capable of predicting the performance of systems under new conditions. The main goal of this research is to develop algorithms and methodologies to overcome challenges in big data analytics. Statistical approaches for data-driven network reconstruction and predictive modeling developed in this research have several advantages: First, unlike most data-mining methods, they do not make any assumptions about the linearity, functional or parametric forms of variables. Second, they decrease the complexity of computations for time-series data sets. Finally, these algorithms are applicable to multiple systems, ranging from social networks to complex biological systems which are the main focus of this research. We propose a Bayesian and information-theoretic approach for data-driven network reconstruction and predictive modeling of phosphoprotein- cytokine signaling networks in RAW 264.7 macrophages. To decrease computational complexities associated with dynamic networks, an algorithm is presented for network reconstruction of large-scale systems from time-course microarray data sets. The applicability of this algorithm is demonstrated by constructing the network of pathway interactions in yeast cell-cycle. This algorithm is implemented to also capture predictive models of dynamic networks and applied to reverse engineer E. coli under Ampicillin. Finally, we demonstrate a data-mining methodology for linking changes in gene expressions and health over time by reverse engineering a GEO dataset in which gene expressions of Multiple Sclerosis (MS) patients under Interferon-[Beta] therapy have been measured over a 10-year time interval