- Main

## Dictionary learning: analysis of spatial gene expression data and local identifiability theory

- Author(s): Wu, Siqi
- Advisor(s): Yu, Bin
- et al.

## Abstract

Spatial gene expression data enable the detection of local covariability and are extremely useful for identifying local gene interactions during normal development. The abundance of spatial expression data in recent years has led to the modeling and analysis of regulatory networks. The inherent complexity of such data makes it a challenge to extract biological information. In the first part of the thesis, we developed staNMF, a method that combines a dictionary learning algorithm called nonnegative matrix factorization (NMF), with a new stability-driven criterion to select the number of dictionary atoms. When applied to a set of {\em Drosophila} early embryonic spatial gene expression images, one of the largest datasets of its kind, staNMF identified a dictionary with 21 atoms, which we call {\em principal patterns} (PP). Providing a compact yet biologically interpretable representation of {\em Drosophila} expression patterns, PP are comparable to a fate map generated experimentally by laser ablation and show exceptional promise as a data-driven alternative to manual annotations. Our analysis mapped genes to cell-fate programs and assigned putative biological roles to uncharacterized genes. Furthermore, we used the PP to generate local transcription factor (TF) regulatory networks. Spatially local correlation networks (SLCN) were constructed for six PP that span along the embryonic anterior-posterior axis. Using a two-tail 5\% cut-off on correlation, we reproduced 10 of the 11 links in the well-studied gap gene network. The performance of PP with the {\em Drosophila} data suggests that staNMF provides informative decompositions and constitutes a useful computational lens through which to extract biological insight from complex and often noisy gene expression data.

The biological interpretability of the NMF-derived dictionary motivated us to understand why dictionary learning works analytically. In particular, if the observed data are generated from a ground truth dictionary, under what conditions can dictionary learning recovers the true dictionary? In the second part of the thesis, we studied the local correctness, or {\em local identifiability}, of a particular dictionary learning formulation with the $l_1$-norm objective function. Suppose we observe $N$ data points $\x_i\in \mathbb R^K$ for $i=1,...,N$, where $\x_i$'s are $i.i.d.$ random linear combinations of the $K$ columns from a square and invertible dictionary $\D_0 \in \mathbb R^{K\times K}$. We assumed that the random linear coefficients are generated from either the $s$-sparse Gaussian model or the Bernoulli-Gaussian model. For the population case, we established a sufficient and almost necessary condition for $\D_0$ to be locally identifiable, i.e., a local minimum of the expected $l_1$-norm objective function. Our condition covers both sparse and dense cases of the random linear coefficients and significantly improves the sufficient condition in Gribonval and Schnass (2010). Moreover, we demonstrated that for a complete $\mu$-coherent reference dictionary, i.e., a dictionary with absolute pairwise column inner-product at most $\coh\in[0,1)$, local identifiability holds even when the random linear coefficient vector has up to $O(\mu^{-2})$ nonzeros on average. Finally, it was shown that our local identifiability results translate to the finite sample case with high probability provided $N = O(K\log K)$.