Search

Scholarly Works (5 results)

Sort By:

Article
Peer Reviewed

Local Identifiability of`l(1)-minimization Dictionary Learning: a Sufficient and Almost Necessary Condition

UC Berkeley Previously Published Works (2018)

Thesis
Peer Reviewed

Dictionary learning: analysis of spatial gene expression data and local identifiability theory

Wu, Siqi
Advisor(s): Yu, Bin

UC Berkeley Electronic Theses and Dissertations (2016)

Spatial gene expression data enable the detection of local covariability and are extremely useful for identifying local gene interactions during normal development. The abundance of spatial expression data in recent years has led to the modeling and analysis of regulatory networks. The inherent complexity of such data makes it a challenge to extract biological information. In the first part of the thesis, we developed staNMF, a method that combines a dictionary learning algorithm called nonnegative matrix factorization (NMF), with a new stability-driven criterion to select the number of dictionary atoms. When applied to a set of {\em Drosophila} early embryonic spatial gene expression images, one of the largest datasets of its kind, staNMF identified a dictionary with 21 atoms, which we call {\em principal patterns} (PP). Providing a compact yet biologically interpretable representation of {\em Drosophila} expression patterns, PP are comparable to a fate map generated experimentally by laser ablation and show exceptional promise as a data-driven alternative to manual annotations. Our analysis mapped genes to cell-fate programs and assigned putative biological roles to uncharacterized genes. Furthermore, we used the PP to generate local transcription factor (TF) regulatory networks. Spatially local correlation networks (SLCN) were constructed for six PP that span along the embryonic anterior-posterior axis. Using a two-tail 5\% cut-off on correlation, we reproduced 10 of the 11 links in the well-studied gap gene network. The performance of PP with the {\em Drosophila} data suggests that staNMF provides informative decompositions and constitutes a useful computational lens through which to extract biological insight from complex and often noisy gene expression data.

The biological interpretability of the NMF-derived dictionary motivated us to understand why dictionary learning works analytically. In particular, if the observed data are generated from a ground truth dictionary, under what conditions can dictionary learning recovers the true dictionary? In the second part of the thesis, we studied the local correctness, or {\em local identifiability}, of a particular dictionary learning formulation with the $l_1$-norm objective function. Suppose we observe $N$ data points $\x_i\in \mathbb R^K$ for $i=1,...,N$, where $\x_i$'s are $i.i.d.$ random linear combinations of the $K$ columns from a square and invertible dictionary $\D_0 \in \mathbb R^{K\times K}$. We assumed that the random linear coefficients are generated from either the $s$-sparse Gaussian model or the Bernoulli-Gaussian model. For the population case, we established a sufficient and almost necessary condition for $\D_0$ to be locally identifiable, i.e., a local minimum of the expected $l_1$-norm objective function. Our condition covers both sparse and dense cases of the random linear coefficients and significantly improves the sufficient condition in Gribonval and Schnass (2010). Moreover, we demonstrated that for a complete $\mu$-coherent reference dictionary, i.e., a dictionary with absolute pairwise column inner-product at most $\coh\in[0,1)$, local identifiability holds even when the random linear coefficient vector has up to $O(\mu^{-2})$ nonzeros on average. Finally, it was shown that our local identifiability results translate to the finite sample case with high probability provided $N = O(K\log K)$.

Cover page: Dictionary learning: analysis of spatial gene expression data and local identifiability theory

Article
Peer Reviewed

Stability-driven nonnegative matrix factorization to interpret spatial gene expression and build local gene networks

UC Berkeley Previously Published Works (2016)

Spatial gene expression patterns enable the detection of local covariability and are extremely useful for identifying local gene interactions during normal development. The abundance of spatial expression data in recent years has led to the modeling and analysis of regulatory networks. The inherent complexity of such data makes it a challenge to extract biological information. We developed staNMF, a method that combines a scalable implementation of nonnegative matrix factorization (NMF) with a new stability-driven model selection criterion. When applied to a set ofDrosophilaearly embryonic spatial gene expression images, one of the largest datasets of its kind, staNMF identified 21 principal patterns (PP). Providing a compact yet biologically interpretable representation ofDrosophilaexpression patterns, PP are comparable to a fate map generated experimentally by laser ablation and show exceptional promise as a data-driven alternative to manual annotations. Our analysis mapped genes to cell-fate programs and assigned putative biological roles to uncharacterized genes. Finally, we used the PP to generate local transcription factor regulatory networks. Spatially local correlation networks were constructed for six PP that span along the embryonic anterior-posterior axis. Using a two-tail 5% cutoff on correlation, we reproduced 10 of the 11 links in the well-studied gap gene network. The performance of PP with theDrosophiladata suggests that staNMF provides informative decompositions and constitutes a useful computational lens through which to extract biological insight from complex and often noisy gene expression data.

Article
Peer Reviewed

DataLab: A Version Data Management and Analytics System

UC Berkeley Previously Published Works (2016)

Article
Peer Reviewed

Spatial expression of transcription factors in Drosophilaembryonic organ development

UC Berkeley Previously Published Works (2013)

Background

Site-specific transcription factors (TFs) bind DNA regulatory elements to control expression of target genes, forming the core of gene regulatory networks. Despite decades of research, most studies focus on only a small number of TFs and the roles of many remain unknown.

Results

We present a systematic characterization of spatiotemporal gene expression patterns for all known or predicted Drosophila TFs throughout embryogenesis, the first such comprehensive study for any metazoan animal. We generated RNA expression patterns for all 708 TFs by in situ hybridization, annotated the patterns using an anatomical controlled vocabulary, and analyzed TF expression in the context of organ system development. Nearly all TFs are expressed during embryogenesis and more than half are specifically expressed in the central nervous system. Compared to other genes, TFs are enriched early in the development of most organ systems, and throughout the development of the nervous system. Of the 535 TFs with spatially restricted expression, 79% are dynamically expressed in multiple organ systems while 21% show single-organ specificity. Of those expressed in multiple organ systems, 77 TFs are restricted to a single organ system either early or late in development. Expression patterns for 354 TFs are characterized for the first time in this study.

Conclusions

We produced a reference TF dataset for the investigation of gene regulatory networks in embryogenesis, and gained insight into the expression dynamics of the full complement of TFs controlling the development of each organ system.