Whole genome sequence analysis of blood lipid levels in >66,000 individuals

Plasma lipids are heritable modifiable causal factors for coronary artery disease, the leading cause of death globally. Despite the well-described monogenic and polygenic bases of dyslipidemia, limitations remain in discovery of lipid-associated alleles using whole genome sequencing, partly due to limited sample sizes, ancestral diversity, and interpretation of potential clinical significance. Increasingly larger whole genome sequence datasets with plasma lipids coupled with methodologic advances enable us to more fully catalog the allelic spectrum for lipids. Here, among 66,329 ancestrally diverse (56% non-European ancestry) participants, we associate 428M variants from deep-coverage whole genome sequences with plasma lipids. Approximately 400M of these variants were not studied in prior lipids genetic analyses. We find multiple lipid-related genes strongly associated with plasma lipids through analysis of common and rare coding variants. We additionally discover several significantly associated rare non-coding variants largely at Mendelian lipid genes. Notably, we detect rare LDLR intronic variants associated with markedly increased LDL-C, similar to rare LDLR exonic variants. In conclusion, we conducted a systematic whole genome scan for plasma lipids expanding the alleles linked to lipids for multiple ancestries and characterize a clinically-relevant rare non-coding variant model for lipids.


Introduction 131
Discovery of rare alleles linked to plasma lipids (i.e., low-density lipoprotein cholesterol [LDL-C], high-density 132 lipoprotein cholesterol [HDL-C], total cholesterol [TC], and triglycerides [TG]) continue to yield important translational 133 insights toward coronary artery disease (CAD), including PCSK9 and ANGPTL3 inhibitors now available in clinical 134 practice 1,2,3,4,5 . The monogenic and polygenic bases of plasma lipids are well-suited to population-based discovery 135 analyses and confer broader insights for genetic analyses of complex traits. We now evaluate numerous newly 136 catalogued, largely rare, alleles never previously systematically analyzed with lipids. 137 Analyses of imputed array-derived genome-wide genotypes and whole exome sequences in hundreds of 138 thousands of increasingly diverse individuals continue to uncover low-frequency protein-coding variants linked to lipids. 139 Due to purifying selection, causal variants conferring large effects tend to occur relatively more recently, and are thus rare 140 and often specific to families or communities 6 . Most discovery analyses for large-effect rare alleles have focused on the 141 analysis of disruptive protein-coding variants given (1) well-recognized constraint in coding regions, (2) incomplete 142 genotyping of rare non-coding sequence given relative sparsity of deep-coverage (i.e., >30X) whole genome sequencing 143 (WGS), and (3) better prediction of coding versus non-coding sequence variation consequence 1,7,8,9,10,11,12 . We recently 144 described a statistical framework incorporating multi-dimensional reference datasets paired with genomic data to improve 145 rare coding and non-coding variant analyses for WGS analysis of lipids and other complex traits 13,14 . Furthermore, 146 including individuals of non-European ancestry facilitates the discovery of both novel alleles at established loci as well as 147 novel loci 14,15,16 . 148 Here, we examine the full allelic spectrum with plasma lipids using whole genome sequences and harmonized 149 lipids from the National Heart, Lung, and Blood Institute (NHLBI) Trans-Omics for Precision Medicine (TOPMed) 150 program 17,18 . We studied 66,329 participants and 428 million variants across multiple ancestry groups -44.48% 151 European, 25.60% Black, 21.02% Hispanic, 7.11% Asian and 1.78% Samoan. We identified robust allelic heterogeneity at 152 known loci with several novel variants at these loci; we additionally identified novel loci and pursued replication in 153 independent cohorts (31.50% non-European samples). We then explored the association of genome-wide rare variants 154 with lipids, with detailed explorations of rare coding and non-coding variant models at known Mendelian dyslipidemia  Table 5). We brought seven putative novel variants with p-values < 5x10 -9 forward for replication. The 225 three common variants, rs183130 (CETP), rs7140110 (GAS6) and rs73729083 (CREB3L2), that were associated with 226 both LDL-C and TC in TOPMed also replicated in MGB and two (rs183130, rs73729083) replicated in Penn Biobank at an 227 alpha level of 0.05 and consistent direction of effect ( Table 1). The two variants that were associated in both replication 228 studies were most significantly associated among African Americans in TOPMed (rs183130: beta = -2.762 mg/dL, p-value 229 = 5.71x10 -07 ; rs73729083: beta = -3.725 mg/dL, p-value = 5.25x10 -07 ). Low-frequency variants from specific ancestry 230 groups associated with lipids in TOPMed were not replicated but we cannot rule out the possibility of reduced power due 231 to general underrepresentation of non-white ancestry groups in the replication data. In exploratory analyses, we extended 232 the same approach for variants discovered to have 5x10 -9 < p-value < 5x10 -7 but did not observe replication 233 (Supplementary Table 6). 234 235 CETP locus, HDL-C, and LDL-C 236 CETP is a well-recognized Mendelian HDL-C gene and the locus was previously known to be significantly 237 associated with HDL-C, TC and TG at genome-wide significance 15 . Pharmacologic CETP inhibitors have shown strong 238 associations with increased HDL-C but mixed effects for LDL-C reduction in clinical trials 27,28,29,30 . We found that the CETP 239 locus variant rs183130 (chr16:56957451:C:T, MAF 28.3%, intergenic variant) was associated with reduced LDL-C 240 concentration (beta = -1.568 mg/dL, SE = 0.264, p-value = 2.88x10 -09 ). The lead HDL-C-associated variant at the locus, 241 rs3764261 (chr16:56959412:C:A, MAF 30.3%), was associated with 3.5 mg/dL increased HDL-C (p-value = 8.03x10 -283 ), 242 and rs183130 was associated with 3.9 mg/dL increased HDL-C (p-value < 1x10 -284 ) as well. Among the ancestry groups 243 analyzed, rs183130 was most significantly associated with LDL-C among those of African ancestry (beta = -2.762 mg/dL, 244 p-value = 5.71x10 -07 ) (Supplementary Table 7). We next investigated variants by their HDL-C and LDL-C effects within 245 this locus (+/-500kb of rs183130 and rs3764261) (Fig. 3). We identified five variants showing at least suggestive (p-value 246 < 5x10 -07 ) association with both HDL-C and LDL-C. Though variants with strong LD (linkage disequilibrium) existed, 247 ancestry-specific analyses showed that the stronger LDL-C effects were among those of African ancestry. 248 To better understand the mechanisms for HDL-C and LDL-C effects at the CETP locus, we pursued colocalization 249 with eQTLs from 3 tissues (Liver, Adipose Subcutaneous and Adipose Visceral [Omentum]) from GTEx 31 . We analyzed 5 250 LDL-C and 441 HDL-C associated (p-values < 5x10 -07 ) variants. We correlated eQTL effect estimates for genes at the 251 locus with lipid outcome effect estimates. Indeed, CETP gene expression effects were strongly negatively correlated with 252 HDL-C effects (Liver: ρ -0.933, p-value 4.01x10 -17 ; Adipose Subcutaneous: ρ -0.762, p-value 8.87x10 -12 ; Adipose Visceral: 253 ρ -0.739, p-value 5.52x10 -10 ) (Supplementary Fig. 3). However, CETP expression effects were not significantly 254 correlated with LDL-C (Liver: ρ 0.007, p-value 0.99; Adipose Subcutaneous: ρ 0.344, p-value 0.57; Adipose Visceral: ρ -255 0.59, p-value 0.29). Given the possibility that the observed lack of correlation for LDL-C could be due to reduced power 256 from a limited number of variants attaining a suggestive p-value (< 5x10 -07 ), we repeated the analysis with a subset of 122 257 nominally significant (p-value < 0.05) LDL-C associated variants in this locus. Indeed, CETP gene expression effects were 258 strongly positively correlated with LDL-C effects (Liver: ρ 0.957, p-value 2.28x10 -08 ; Adipose Subcutaneous: ρ 0.922, p-259 value 1.34 x10 -15 ; Adipose Visceral: ρ 0.868, p-value 6.09x10 -11 ). 260 261

I) Gene-Centric associations 263
We next evaluated the association of aggregated rare (MAF<1%) variants, linked to protein-coding genes ('gene-264 centric'). We employed a Bonferroni-corrected significance threshold of 0.05/20,000=2.5x10 -06 for coding and non-coding 265 gene-centric rare variant analyses (Supplementary Fig. 4). We identified 102 coding and 160 non-coding gene-centric 266 rare variant aggregates significantly associated with at least one of the four plasma lipid phenotypes in nonconditional 267 analysis (Supplementary Table 8-9). We secondarily conditioned our significant aggregate sets on variants individually 268 associated with lipid levels from the GWAS catalog, MVP summary statistics and the TOPMed data. We identified 74 269 coding and 25 non-coding rare variants aggregates associated with at least one lipid level after conditional analyses 270

(Supplementary Table 10-11). 271
Most of the coding gene-centric sets remained significant after secondary conditioning while a minority of non-272 coding gene-centric sets remained significant after conditioning. Significant genes identified from coding rare variant 273 analyses included multiple known Mendelian lipid genes including LCAT, LDLR, and APOB (Supplementary Table 10). 274 RFC2 putative loss-of-function mutations (combined allele frequency < 0.002%) were significantly associated with 275 triglycerides (p-value 2x10 -06 ) representing a putative novel association for triglycerides. The RFC2 aggregate set (plof) 276 was associated with reduced TG (beta = -0.89 for log [TG]). The persistently significant regions identified from non-coding 277 rare variant analyses linked to genes included the UTR (untranslated region) for CETP and promoter-CAGE (CAGE-Cap 278 Analysis of Gene Expression sites) around APOA1 for HDL-C, and APOE promoter-CAGE, APOE enhancer-DHS (DHS -279 DNase hypersensitivity sites), and EHD3 promoter-DHS for total cholesterol (Supplementary Table 11). Most of the 280 coding aggregates had larger effects compared to non-coding aggregates, and among the non-coding aggregates SPC24 281 non-coding aggregate (enhancer-CAGE) at the LDLR locus had the strongest effect for LDL-C (beta = 2.320 mg/dL; p-282 value = 1.75x10 -05 ). 283

II) Region-Based associations 285
We also performed unbiased region-based rare variant association analyses tiled across the genome with both 286 static and dynamic window sizes. We first evaluated 2.6M regions statically at 2 kb size and 1 kb window overlap by the 287 sliding window approach. Statistical significance was assigned at 0.05/(2.6x1 -06 )=1.88x10 -08 . We identified 28 significantly 288 associated windows with at least one lipid phenotype. After conditioning on variants individually associated with the 289 corresponding lipid phenotype, we identified two regions at LDLR still significantly associated with both total cholesterol 290 and LDL-C although these regions included both intronic and exonic variants (Supplementary Table 12). LDLR intron 1, 291 which encodes LDLR-AS1 (LDLR antisense RNA 1) on the minus strand, had suggestive evidence for association with TC 292 (p-value 3.17x10 -6 ) with -2.76 mg/dL reduction in TC. A prior study identify that a common variant (rs6511720, MAF 0.11) 293 in LDLR intron 1 is associated with increased LDLR expression in a luciferase assay and reduction in LDL-C 32 . When 294 adjusting for rs6511720, the significance improved (p-value 1.43x10 -8 ) with -3.35 mg/dL reduction in TC. 295 For dynamic window scanning of the genome, we implemented the SCANG method 33 . The SCANG procedure 296 accounts for multiple testing by controlling the genome-wide error rate (GWER) at 0.1 33 . In the dynamic window-based 297 workflow, STAAR-O detected 51 regions significantly associated with at least one lipid phenotype after conditioning on 298 known variants (Supplementary Table 13). Most of the regions mapped to known Mendelian lipid genes, including LCAT 299 (8.7x10 -13 ) for HDL-C, and LDLR (2.4x10 -28 , 7.3x10 -26 ) and PCSK9 (2.9x10 -12 , 5.5x10 -12 ) for LDL-C and TC, respectively. 300 Exon 4 aggregates of LDLR were specifically associated with 20 mg/dL increase in LDL-C. PCSK9 Exon2-Intron2 region 301 spanning chr1:55043782-55045960 had significantly reduced LDL-C by 6 mg/dL (p-value = 3x10 -13 ), and the effect 302 persisted even with only Intron 2 rare variants of PCSK9 (-5 mg/dL, p-value = 2x10 -8 ). Strikingly, in secondary analyses, 303 we found evidence for very large effects for rare variants in LDLR Introns 2 and 3 (+21 mg/dL, p-value = 7x10 -4 ) and 304 LDLR Introns 16 and 17 (+17 mg/dL, p-value = 0.02), similar to rare coding LDLR mutations. While 32 of the significant 305 dynamic windows also included exonic regions, there were also several dynamic windows significantly independently 306 associated with lipids not containing exonic regions. For example, four non-coding windows (two overlapping) at 2p24.1, 307 which harbors the Mendelian APOB gene, were significantly associated with LDL-C. Intronic non-coding regions were 308 associated with both LDL-C and TC -associated windows at LPAL2-LPA-SLC22A3; for example LPAL2 Intron 3 was 309 associated with a 3.7 mg/dL increase in TC. Non-coding TC-associated significant dynamic windows were near 310 TOMM40/APOE. One rare variant signal observed was at TOMM40 Intron 6, where the 'poly-T' variant in this region is on 311 the APOE4 haplotype and influences expressivity for Alzheimer's disease age-of-onset 34,35 . For HDL-C, we identified 312 significant non-coding windows at an intergenic region near LPL and CD36 Intron 4. In the generation of the 313 spontaneously hypertensive rat model, the deletion of intron 4 in Cd36 with resultant Cd36 deficiency has been mapped to 314 defective fatty acid metabolism in this model 36 . Several regions significant in SCANG were not even nominally significant 315 in burden association analyses indicating the likelihood of causal variants with bidirectional effects. 316 Several gene-centric non-coding aggregates associated with lipids near known monogenic lipid genes but mapped 317 to another gene at the locus via annotations. Therefore, we performed downstream conditional analyses adjusting the 318 gene-centric non-coding results for rare coding variants (MAF<1%) within known lipid monogenic genes (Supplementary 319 Table 14). When accounting for both common and rare coding variants at the nearby familial hypercholesterolemia LDLR 320 gene, SPC24-enhancer DHS was significantly associated with total cholesterol (p-value= 3.01x10 -11 ) and with suggestive 321 evidence for LDL-C (p-value= 1.57x10 -06 ). In a similarly adjusted model, LDLR-enhancer-DHS showed a strong 322 association with TC (p-value 5.18x10 -12 ). When adjusting for known common variants as well as rare coding variants in 323 PCSK9, both PCSK9-enhancer DHS and PCSK9-promoter DHS were significantly associated with total cholesterol. (Fig.  324 4, Supplementary Fig. 5). Through this procedure, CETP UTR retained significance for its independent association with 325 HDL-C as well as the putatively novel gene EHD3-promoter DHS association with TC. However, the non-coding gene-326 centric APOC3 and APOE associations were rendered non-significant for HDL-C and TC, respectively. 327 Since we cannot rule out the possibility of reduced power for genome-wide rare variant analyses, we leveraged 328 current knowledge of 22 Mendelian lipid genes for more focused exploratory analyses 14 . We validated most genes in rare 329 variant coding analyses. The genes with the strongest coding signals typically had at least nominal evidence of gene-330 centric non-coding rare variant associations (Supplementary Table 15, Supplementary Fig. 6). When rare coding 331 variants were introduced into the model, the evidence for non-coding rare variant associations were largely unchanged. 332 Our findings expanding the currently described genetic basis for hypercholesterolemia to include rare non-coding variation 333 at LDLR and PCSK9 (Fig. 5). by the availability of WGS data, we identify new putative loci associated with lipids in non-Europeans. Furthermore, our 348 study enabled the discovery of several novel alleles at known loci, with richly distinct allelic heterogeneity across ancestry 349 groups. For example, HDL-C-raising CETP locus variants linked to CETP gene expression were only associated with 350 LDL-C reduction among those of African ancestry. While all pharmacologic CETP inhibitors increase HDL-C, only those 351 that decrease LDL-C also reduce cardiovascular disease risk 27,28,30,29 . Given the contribution of genetic differences, 352 clinical trials with more diverse samples would show insights. 353 Our study now provides increasingly robust evidence for a rare non-coding variant model for complex traits. Our 354 rare non-coding variant associations in both gene-centric and sliding window models were largely restricted to the introns 355 of Mendelian lipid genes with prior robust rare coding variant support consistent with biologic plausibility 44 . Rare intronic 356 variants, often impacting splicing, have been previously implicated in afflicted Mendelian families or small exceptional 357 case series, often through candidate gene approaches 45,46,47,48 . We discovered one example of a rare non-coding signal 358 without prior rare coding support -i.e., EHD3. We obtained estimates of phenotypic effect using burden tests. For most 359 regions, even nominal significance was not detected using burden testing indicating the likelihood of variants with 360 bidirectional effects further complicating clinical interpretation. When burden signals were detected, observed effects were 361 typically larger than common non-coding variants and less than rare coding variants, with the exception of LDLR, 362 consistent with whole genome mutational constraint models 49,50,51 . 363 The detection of independent rare non-coding variant signals has remained elusive largely due to limited sample 364 sizes with requisite WGS and limitations in the interpretation of rare non-coding variation functional consequence. 365 Previously, we used annotated functional non-coding sequence in 16,324 TOPMed participants, and found that rare non-366 coding gene regions associated with lipid levels, but they were not independent of individually associated single 367 variants 14 . Using STAAR, we observed putative rare non-coding variant associations for lipids independent of individual 368 variants associated with lipids in TOPMed. variants causing familial hypercholesterolemia due to altered splicing, which we now observe in our unbiased population-377 based WGS study 56,57 . A WGS approach to lipid disorders, particularly for familial hypercholesterolemia, will markedly 378 improve the diagnostic yield beyond existing limited approaches. 379 Our dynamic window approach may also improve the clinical curation of exonic variants. Among the data used to 380 curate exonic variants is the use of in silico functional prediction tools 58  Our study has important limitations. First, while our study is large for a WGS study by contemporary standards, it is 389 dwarfed by existing GWAS datasets limiting power for novel discovery. Nevertheless, by using WGS in diverse ancestries, 390 we can study hundreds of millions new variants. Second, prediction of rare non-coding variation consequence to prioritize 391 causal variants remains a challenge thereby limiting power 65 . The striking difference for most STAAR and burden results 392 also highlights bidirectional effects for rare non-coding variants within the same region and further challenges for clinical 393 utility. Third, given the paucity of multi-ancestral WGS datasets with lipids, our analyses are largely restricted to TOPMed. 394 For single variant associations, we pursued TOPMed-imputed GWAS datasets but were limited by the lack of ancestral 395 diversity. As TOPMed is a consortium of multiple different cohorts, we demonstrate consistencies by cohort. Furthermore, 396 rare variant non-coding signals were largely restricted to regions with rare variant coding signals supporting biological 397 plausibility. 398 In conclusion, using WGS and lipids among 66,329 ancestrally diverse individuals we expand the catalog of alleles 399 associated with lipids, including allelic heterogeneity at known loci and locus heterogeneity by ancestry. We characterize 400 the common, rare coding, and rare non-coding variant model for lipids. Lastly, we now demonstrate a monogenic-   The primary outcomes in this study included LDL cholesterol (LDL-C), HDL cholesterol (HDL-C), total cholesterol 441 (TC) and triglycerides (TG) phenotypes. LDL-C was either directly measured or calculated by the Friedewald equation 442 when triglycerides were <400 mg/dL. Given the average effect of lipid lowering-medicines, when lipid-lowering medicines 443 were present, we adjusted the total cholesterol by dividing by 0.8 and LDL-C by dividing by 0.7, as previously done 14 . 444 Triglycerides remained natural log transformed for analysis. Fasting status was accounted for with an indicator variable. 445 We harmonized the phenotypes across each cohort 18 and inverse rank normalization of the residuals of each race 446 within each cohort scaled by the standard deviation of the trait and adjusted for covariates 12 . We included covariates such 447 as age, age 2 , sex, PC1-11, study-groups as well as Mendelian founder lipid variants APOB p.R3527Q and APOC3 448 p.R19X for the Amish cohort 7,66,8 . Supplementary Table 1 provides the distributions of each of the four lipid phenotypes 449 by cohort, ancestral groups, and gender. We executed similar steps of phenotype harmonization and normalization for the 450 replication cohorts. Additionally, we adjusted the MGB Biobank for study-center and array-type, and Penn Medicine 451 Biobank for ancestry and BMI in addition to the other common covariates.

Rare variant association test 498
We performed rare variant association (RVA) using the Variant-Set Test for Association using Annotation 499 infoRmation (STAAR) pipeline 13 from STARtopmed R package. STAARpipeline is a regression-based framework that 500 permits adjustment of covariates, population structure, and relatedness by fitting linear and logistic mixed models for 501 quantitative and dichotomous traits 75, . We chose STAAR to leverage the annotation information and associated scores 502 that were available for TOPMed Freeze 8 data to incorporate the analysis of rare non-coding variants from whole genome In the gene-centric workflows, for both coding (within exonic boundaries) and non-coding (promoter: +/-3kb 515 window of transcription starting site (TSS), enhancer: GeneHancer predicted regions) regions, we considered only genes 516 with at least two rare variants (i.e., 18,445 genes in all 22 autosomes). In the region-based workflows, we implemented two 517 protocols: 1) a 'sliding window' approach, where we aggregated rare variants within 2-kb sliding windows and with 1-kb 518 overlap length, and 2) a 'dynamic window' approach, where we executed SCANG 33 method and aggregated dynamically 519 variant-sets between 40-300 variants per set, where the method systematically scans the whole genome with overlapping 520 windows of varying sizes. The STAARtopmed R-package implements multiple rare-variants aggregate tests including 521 SKAT 81 , Burden 82 and ACAT 83 and integrates them as STAAR-O 13 . We performed gene-centric and region-based rare 522 variant tests using annotated GDS files of TOPMed. 523 We completed aggregate tests as three-step process. In the first step, we fitted a null model using glmmkin() 524 function in STAARtopmed. The null model was fitted for each of the four lipid phenotypes adjusted for all covariates and 525 relatedness except the genotype of interest. In the second step, we ran genome wide gene-centric and region-based rare-526 variant aggregate tests. The third step directed conditional analyses, where the results were adjusted for previously 527 known significantly lipid-associated (i.e., p < 5x10 -8 in external datasets) individual variants from GWAS Catalog 84 and 528 Million Veterans Program (MVP) 15 GWAS summary statistics. To obtain effect estimates of significant aggregate sets, we 529 associated the cumulative genotypes (binary scores) based on the variants forming the aggregates and used Glmm.Wald 530 test from GMMAT R package 75 (version 1.3.1). For significantly-associated window-based rare variant aggregations, we 531 trimmed the exonic variants and estimated the effects with only non-coding variants. 532

CETP gene expression and lipid trait colocalization 534
We studied the correlation of LDL-C and HDL-C effects with eQTL effects at chromosome 16q13, which includes 535 CETP. We downloaded GTEx eQTL build 38 (version8) data for Liver, Adipose Subcutaneous and Adipose Visceral 536 (Omentum) tissues from GTEx Portal on 16/APR/2020 85 . We selected eQTLs with nominal significance (p-value<0.05) 537 and utilized the eQTL-gene pairs with the most significant p-values. Genes with at least 5 eQTLs were selected for the 538 colocalization analysis. We selected variants with a suggestive significance (p-value < 5x10 -7 ) for LDL-C or HDL-C effects 539 within 500 kb of the lead locus variant. We performed Pearson correlation tests between the lipid effect estimates and

Fig. 5 1078
Influence of common and rare variants with hypercholesterolemia. In addition to monogenic contributions from rare 1079 variants in Mendelian hypercholesterolemia genes, multiple genome-wide significant LDL-C-associated common variants 1080 also yield a polygenic basis for hypercholesterolemia. In the present work, we now identify rare non-coding variants in 1081 proximity of Mendelian hypercholesterolemia genes, specifically LDLR and PCSK9, that also contribute to the genetic 1082 basis of hypercholesterolemia. 1083 LDL-C -Low-Density Lipoprotein Cholesterol 1084