Neurodevelopmental disorders (NDDs), such as autism spectrum disorder (ASD), intellectual disability, and developmental disability are genetically and phenotypically heterogeneous disorders that display high comorbidity and relatively high heritability. Although non-coding and common variation contribute to a substantial proportion of all NDD cases, rare coding genetic variation has proved invaluable to the identification of NDD risk genes. NDD cases possess a significantly larger burden of de novo variation, a form of rare genetic variation that is not inherited from either parent, compared to unaffected controls. The enrichment of non-synonymous de novo coding variation in cases compared to controls enables the discovery of genetic modules, the early prediction of a subset of affected cases at low false positive rates, and the identification of critical cell-types relevant to specific modules. Modules are networks of genes that participate in a certain biological function. The module discovery tools MAGI-S and its extension MAGI-MS are introduced in Chapter 1, which identify modules that can dissect specific phenotypes given ‘seed’ gene(s) that are members of biological pathways of interest. MAGI-S and MAGI-MS provide evidence of the dissection of the epilepsy phenotype from more general NDD phenotypes and the enrichment of non-synonymous de novo mutation in cases compared to controls among module genes.
In Chapter 2, a shallow neural network (SNN) with a false positive rate (FPR) minimizing loss function uses non-synonymous de novo mutation and features related to genic constraint and conservation to identify a small subset of NDD cases at very low FPR. Compared to traditional machine learning techniques and heuristics derived from genic constraint metrics and known NDD risk genes, the SNN achieves greater true positive rates (TPR) at near-zero FPR and ranks candidate NDD risk genes.
Given modules such as those generated by MAGI-S and MAGI-MS and single-cell expression data, MoToCC identifies groups of cells that selectively express the module genes. MoToCC is a linear programming approach that maximizes the gene co-expression amongst selected cells with consideration of cell-cell similarity and K-nearest neighbor connectivity. By allowing users to vary the number of cells to return as a solution, cell-types relevant to the module and shifting percent composition can be visualized at varied scales, as shown in Chapter 3 for three NDD modules.
The described computational tools seek to use the predictive power of de novo coding variation to further characterize the genetic etiology of neurodevelopmental disorders and lead to improvements in the well-being of affected patients.