The recent advances in genomic technologies, have made it possible to collect large-scale information on genetic variation across a diverse biological landscape.
This has resulted in an exponential influx of genetic information and the field of genetics has become data-rich in a relatively short amount of time.
These developments have opened new avenues to elucidate the genetic basis of complex diseases, where the traditional disease study approaches had little success.
In recent years, the genome-wide association study (GWAS) approach has gained widespread popularity for its ease of use and effectiveness, and is now the standard approach to study complex diseases.
In GWAS, information on millions of single-nucleotide polymorphisms (SNPs) is collected from case and control individuals.
SNP genotyping is cost-effective and due to their abundance in the genome, SNPs are correlated to their neighboring genetic variation, which makes them tags for genomic regions.
Typically, each SNP is statistically tested for association to disease, and the genomic regions tagged by the significant SNPs are believed to be harboring the functional variants contributing to disease.
In order to reduce the cost of GWAS and the redundancy in the information collected, an informative subset of the SNPs, or tag SNPs, are genotyped.
Typically, the genomic regions harboring the significantly associated tag SNPs may be large and contain many additional polymorphisms.
At this stage of the study it may not be clear which specific genes or polymorphisms are in fact most strongly associated to disease.
We present a novel framework for designing cost-effective follow-up association studies to further characterize such regions by genotyping additional SNPs to identify all the associated polymorphisms.
This identification of all associated polymorphisms provides a catalog of all possible functional variants, and the values of the actual association statistics at these polymorphisms may provide information to identify causal variants.
We present the utility of our method in identifying significant associations and causal variants using simulated and real GWAS datasets.
Although GWAS have been widely used to study associations of SNPs to disease phenotypes, there has been growing interest in applying the GWAS approach to high-throughput biological phenotypes, such as gene expression.
In these studies, the goal is to identify genomic regions that affect gene expression levels, known as expression quantitative trait loci (eQTL).
A challenge in applying GWAS to eQTL studies is that there are tens of thousands of measurements, each representing the expression level of one gene, for each sample tested, as opposed to values for one or two clinical traits.
This results in a tremendous computational burden when performing the analysis, requiring computation for billions of tests and demands substantial computational resources.
We present a novel two-stage approach to efficiently identify all of the significant associations without testing all the SNPs.
In the first-stage, a small number of informative SNPs across the genome are tested.
Based on their observed associations, our approach locates the regions that may contain significant SNPs and only tests additional SNPs from those regions.
We demonstrate that this method increases the computational speed of eQTL studies by a factor of ten, and can be applied to reduce the computational burden of a wide range of association statistics.
Finally, we develop a novel approach to address a problem that has been of fundamental interest to geneticists for decades.
The contribution of genetics to a trait, termed as heritability, is often measured by the amount of variation in the trait that is due to genetics.
Heritability, quantifies the role of genetics in a trait and provides insight about disease etiology.
Traditionally, heritabilities were estimated in studies of individuals with known relatedness such as classical twin studies.
Recently, estimating the heritability of a trait from unrelated individuals using GWAS data, and further, partitioning the heritability into the contributions of genomic regions has received a lot of attention.
Existing methods partition the heritability by jointly estimating the contributions of all regions.
However, these methods are computationally intractable and may be inaccurate when the number of regions is large.
In this work, we present an alternative approach that partitions the total heritability into the contributions of an arbitrary number of regions, while performing these computations in parallel.
We demonstrate that our method is more accurate and computationally efficient than existing approaches.