The term ‘rare disease’ may at first suggest a problem that, if addressed, would benefit few and lead only to obscure scientific discoveries. Nothing could be further from the truth. Since before the discovery of the structure of DNA, rare disease research has enabled essential biological insights. These insights are surpassed only by the clinical innovations that were developed to treat rare disease, which benefit not only those living with rare disease but also millions of individuals living with common diseases. In the past decade, whole genome sequencing has revolutionized the diagnosis and care of individuals with rare genetic disease. However, at least half of individuals do not reach a conclusive diagnosis after whole genome sequencing. Structural variants (SVs; genomic variants longer than 50 base pairs) are the genetic cause of a portion of these unresolved cases. As sequencing methods using long reads become more accessible and structural variant detection algorithms improve, clinicians and researchers are gaining access to thousands of reliable SVs of unknown disease relevance. To address this emerging need, I developed StrVCTVRE to distinguish pathogenic SVs from benign SVs that overlap exons. StrVCTVRE performs accurately across a wide SV size range on independent test sets, which will allow clinicians and researchers to eliminate about half of SVs from consideration while retaining a 90% sensitivity. I anticipate clinicians and researchers will use StrVCTVRE to prioritize SVs in patients where no SV is immediately compelling, empowering deeper investigation into novel SVs to resolve cases and understand new mechanisms of disease. To illustrate the value of StrVCTVRE, I next applied it to a cohort of 50 probands with undiagnosed rare disease. Linked-read sequencing and optical mapping were performed for each proband, mother, and father in this cohort. I investigated the diagnostic value of these two methods by comparing them to short-read sequencing. Clinical analysis and validation discovered 11 diagnostic or candidate SVs in this cohort. Analysis of optical mapping and linked-read sequencing data were each able to detect all 11 SVs. Analysis of short-read sequencing data could detect only 7 out of 11 (64%) of these SVs. After prioritizing the SVs in each case with StrVCTVRE, I considered the number of SVs a clinical researcher would need to manually investigate to find the diagnostic or candidate SV. This number of SVs was surprisingly consistent across methods, and this can be attributed to the greater sensitivity of newer methods and the poor specificity of older methods. While newer methods detect more SVs with greater specificity, I found that they have not been carefully calibrated in several measures that are clinically important, including SV type, zygosity, and endpoint accuracy. These are mostly algorithmic limitations and should improve as these methods mature.
An important limitation of SV classification is the relatively few SVs that have been cataloged as pathogenic, compared to the number of cataloged single nucleotide variants (SNVs). To investigate how the accuracy of cataloged variants has changed over time, I shifted my focus to SNVs. Curated databases of pathogenic SNVs assist clinicians and researchers to interpret genetic testing results and classify novel variants. Yet these databases contain errors. Several studies have sought to identify cataloged variants that are misclassified, but none have recorded how variant misclassification has changed over time. Using archives of ClinVar and HGMD, I investigated how variant misclassification has changed over six years across different ancestry groups. I considered a class of disorders that are often highly penetrant with neonatal phenotypes—inborn errors of metabolism (IEMs) screened in newborn screening—as a model system. I used samples from the 1000 Genomes Project (1KGP) to identify individuals with genotypes that were annotated as pathogenic. Due to the rarity of IEMs, nearly all annotated pathogenic genotypes indicate likely variant misclassification. While the accuracy of both ClinVar and HGMD have improved over time, HGMD variants currently imply two orders of magnitude more affected individuals in 1KGP than ClinVar variants. I observed that African ancestry individuals have a significantly increased chance of being incorrectly predicted to be affected by a screened IEM when HGMD variants are used. However, this African ancestry bias was no longer significant once common variants were removed in accordance with recent variant interpretation guidelines.