Skip to main content
eScholarship
Open Access Publications from the University of California

UCSF

UC San Francisco Electronic Theses and Dissertations bannerUCSF

Development of Bioinformatics Methods to Interrogate Complex Immune Related Genomic Regions from Next Generation Sequencing Data

Abstract

The killer-cell immunoglobulin-like receptor (KIR) gene complex, located in human chromosomal region 19q13.42, and the complement component 4 (C4) gene complex, located in human chromosomal region 6p21.33, encode for proteins that have vital roles in immune system function. Component genes of these complexes exhibit copy number variation (CNV), extensive nucleotide polymorphisms, and high sequence similarity with other genes of their complex. Next generation sequencing (NGS) has transformed the world of genomics, offering a high-throughput, high-fidelity and cost-effective sequencing method, however, NGS analysis of the KIR and C4 regions has been thwarted due to the bioinformatics challenges imposed by their complex variation. In this work, the researcher presents the bioinformatics pipelines, PING, developed for KIR sequence analysis, and C4Investigator, developed for C4 sequence analysis. These bioinformatics pipelines provide comprehensive, high-throughput characterization of human KIR and C4 sequence variation from NGS data. These pipelines take in paired-end short-read sequencing data and output gene copy number for both genomic regions, high-resolution genotypes for the KIR complex, and high-resolution mapping of single nucleotide variants (SNVs) for the C4 region. The performance of PING was evaluated by real-world and synthetic datasets, while the performance of C4Investigator was evaluated by real-world datasets and comparison to existing methods. Both PING and C4Investigator showed high performance for copy number determination and SNV characterization. To demonstrate the utility of the C4Investigator pipeline, the researcher applied C4Investigator to whole genome sequencing (WGS) data from the 1000 Genomes Project (1KGP) cohort (N=3199), characterizing C4 copy number and sequence variation for the first time in this dataset. To demonstrate the utility of the PING pipeline, the researched applied PING to targeted sequencing datasets from divergent populations (European N=363, Khoesan N=104), in addition to WGS data from the 1KGP cohort (N=215). To the best of our knowledge, PING and C4Investigator are the only bioinformatics workflows currently available for assessment of KIR and C4 full genomic sequence variation from NGS data.

Main Content
For improved accessibility of PDF content, download the file to your device.
Current View