A multi-ethnic polygenic risk score is associated with hypertension prevalence and progression throughout adulthood

Background: We used summary statistics from previously-published GWAS of systolic and diastolic BP and of hypertension to construct Polygenic Risk Scores (PRS) to predict hypertension across diverse populations. Methods: We used 10,314 participants of diverse ancestry from BioMe to train trait-specific PRS. We implemented a novel approach to select one of multiple potential PRS based on the same GWAS, by optimizing the coefficient of variation across estimated PRS effect sizes in independent subsets of the training dataset. We combined the 3 selected trait-specific PRS as their unweighted sum, called "PRSsum". We evaluated PRS associations in an independent dataset of 39,035 individuals from eight cohort studies, to select the final, multi-ethnic, HTN-PRS. We estimated its association with prevalent and incident hypertension 4-6 years later. We studied hypertension development within HTN-PRS strata in a longitudinal, six-visit, longitudinal dataset of 3,087 self-identified Black and White participants from the CARDIA study. Finally, we evaluated the HTN-PRS association with clinical outcomes in 40,201 individuals from the MGB Biobank. Results: Compared to other race/ethnic backgrounds, African-Americans had higher average values of the HTN-PRS. The HTN-PRS was associated with prevalent hypertension (OR=2.10, 95% CI [1.99, 2.21], per one standard deviation (SD) of the PRS) across all participants, and in each race/ethnic background, with heterogeneity by background (p-value < 1.0x10-4). The lowest estimated effect size was in African Americans (OR=1.53, 95% CI [1.38, 1.69]). The HTN-PRS was associated with new onset hypertension among individuals with normal (respectively, elevated) BP at baseline: OR=1.71, 95% CI [1.55, 1.91] (OR=1.48, 95% CI [1.27, 1.71]). Association was further observed in age-stratified analysis. In CARDIA, Black participants with high HTN-PRS percentiles developed hypertension earlier than White participants with high HTN-PRS percentiles. The HTN-PRS was significantly associated with increased risk of coronary artery disease (OR=1.12), ischemic stroke (OR=1.15), type 2 diabetes (OR=1.19), and chronic kidney disease (OR=1.12), in the MGB Biobank. Conclusions: The multi-ethnic HTN-PRS is associated with both prevalent and incident hypertension at 4-6 years of follow up across adulthood and is associated with clinical outcomes.

Methods Figure 1 provides an overview of the study. At stage 1, we used summary statistics from multiple GWAS of BP phenotypes to construct PRS in a training dataset (stage 1 dataset) with prevalent hypertension. We selected GWAS that were based on individuals not overlapping with our dataset (published Million Veterans Program GWAS (10), and summary statistics from FinnGen and UK Biobank databases). We used a clump & threshold methodology which requires selection of tuning parameters. Importantly, we evaluated a few methods to select tuning parameters, and an approach to combine PRS across phenotypes. At stage 2, we further assessed the methods for tuning parameter selection and the combined PRS in a stage 2 dataset using prevalent hypertension at a baseline exam. We selected the best performing PRS, and used it in analyses of PRS association with hypertension using data from two visits in the stage 2 dataset. At stage 3, we studied the PRS association with incident hypertension in young African and European adults, using a longitudinal, stage 3 dataset with 6 exams. At stage 4, we tested the association of the PRS with disease status in individuals from the MGB Biobank (stage 4 dataset).
. CC-BY-NC 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted November 1, 2021. as: not having hypertension across the two exams (healthy longitudinal trajectory; may include individuals with normal and with elevated BP, but without change in these categories between the exams); having hypertension in the two exams (severe longitudinal trajectory); BP category worsen between exams, including individuals who had normal BP at the baseline exam, and elevated BP or hypertension at the follow-up exam, or elevated BP followed by hypertension; BP category improved between exams, including individuals who were not treated by antihypertensive medications in any of the exams, and had improved BP category (hypertensive to elevated or normal, or elevated to normal). Individuals treated with antihypertensive medication in either baseline or follow-up exam were never categorized as "improved". We also performed association analysis of the PRS with new onset hypertension at the follow-up exam, focusing, separately, on individuals who had normal BP at baseline and who had elevated BP at baseline.

Whole Genome Sequencing
We used whole genome sequencing data from the Trans-Omics in Precision Medicine (TOPMed) program Freeze 8 release (29). Only variants with minor allele frequency (MAF) ≥ 0.01 were used in this analysis. Information about genome sequencing, variant calling, and quality control procedures is available here https://www.nhlbiwgs.org/topmed-whole-genomesequencing-methods-freeze-8. The TOPMed Data Coordinating Center constructed a sparse kinship matrix estimating recent genetic relatedness where values were set to zero when the genetic relationship was estimated to be more distant than 4 th degree relatedness, and principal components (PCs), using the PC-Relate algorithm (32). Table 1 provides information about hypertension and BP GWAS used to construct PRS. In primary analysis, we used multi-ethnic GWAS: hypertension "pan ancestry" GWAS from UKBB (https://pan.ukbb.broadinstitute.org/), and systolic BP (SBP), and diastolic BP (DBP) from MVP (10). Note that UKBB pan ancestry GWAS are multi-ethnic, however, U.S. minorities are not well represented compared to MVP, and therefore we prioritized MVP as a primary GWAS for SBP and DBP. All these GWAS are based on large sample sizes and have no overlap in participants with the TOPMed BP dataset. In secondary analysis, we also used hypertension GWAS from FinnGen (https://www.finngen.fi/en) and SBP and DBP GWAS from UKBB pan ancestry and BBJ (33), and performed inverse-variance fixed-effects meta-analyses using METAL (34) for each BP trait GWAS (SBP, DBP and hypertension). These are described in Supplementary Table S1.

Published GWAS of BP Phenotype
Secondary analyses were only performed on the training dataset.

Quality control on summary statistics
We filtered SNPs with MAF<0.01 in the discovery GWAS from Table 1 and/or in the dataset comprising of all TOPMed analysis participants (stage 1, 2 and 3 datasets combined), and SNPs that did not pass TOPMed quality filters. We re-coded the variant positions and alleles to match those in the TOPMed data (via the UCSC hg19 to hg38 chain file) using rtracklayer R package version 1.46.0 (35)

PRS construction based on a single GWAS
We constructed PRS using the clump-and-threshold method implemented in the PRSice 2 software version v2.3.3 (22) using each of the GWAS in Table 1. Three tuning parameters are required: p-value threshold, and two clumping parameters. As p-value thresholds, we used 5x10 -8 , 1x10 -7 , 1x10 -6 , 1x10 -5 , 1x10 -4 , 1x10 -3 , 1x10 -2 , 0.1, 0.2, 0.3, 0.4, 0.5. For clumping, we used the entire TOPMed dataset (stage 1, 2, and 3 datasets combined) as a reference panel for Linkage Disequilibrium (LD) and set clumping parameters to R 2 =0.1, 0.2 and 0.3 and distance 250kb, 500kb, and 1000Kb. To standardize PRS while keeping effect sizes comparable in all analyses, we computed the mean and standard deviations (SDs) of each type of PRS based on the complete TOPMed dataset. Then, we used these means and SDs in all subsequent analyses: for a given PRS in any dataset, we subtracted its pre-computed mean and divided it by its precomputed SD. This standardization approach allowed for obtaining comparable effect sizes across stage 1, 2, 3, and 4 datasets, as well as across background-specific and multi-ethnic analyses. Standardization does not influence p-values or prediction measures.

Using stage 1 dataset to select of tuning parameters for PRS construction
To select tuning parameters for a PRS based on a given GWAS from Table 1 (or based on metaanalysis of multiple GWAS), we examined the association of various constructed PRS with prevalent hypertension in the stage 1 dataset. We developed the selected CV-PRS approach, which we describe below, and compare it to two additional widely-used approaches: genomewide significance PRS, selected PVAL-PRS. Both the selected CV-PRS and the selected PVAL-PRS approaches attempt to select one set of tuning parameters (p-value threshold and LD clumping parameters) to construct PRS out of all tuning parameter combination used. The selected CV-PRS approach aims to identify the tuning parameters that yield consistently high PRS effect size in new, independent, datasets. To do this, it minimized the coefficient of variation (CV) computed on the PRS effect size estimates obtained from 5 equal-sized independent subsets of BioMe (this is visualized in Figure 1). Specifically, the CV was estimated as the standard deviation of the five effect (log odds ratio) estimates, divided by the mean of these effect estimates. The selected PVAL-PRS is the PRS with the lowest association p-value in the stage 1 dataset. The genome-wide significance PRS was constructed using SNPs with p-value<5x10 -8 , and fixed clumping parameters: R 2 =0.1 and distance of 1000kb, and otherwise no selection of other parameters.

PRS construction based on multiple GWAS
. CC-BY-NC 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted November 1, 2021. ; https://doi.org/10.1101/2021. 10.31.21265717 doi: medRxiv preprint We constructed a PRS called "PRSsum" by summing PRS constructed based on the three BP phenotypes (SBP, DBP, hypertension). The three PRS were summed after each was scaled using the mean and SD computed using the entire TOPMed dataset. We summed non-adaptively, i.e., unweighted sum with PRSsum = PRS1 + PRS2 + PRS3. We generated three PRSsum, based on the three potential strategies to construct PRS: 1) PRSsum based on genome-wide significance

Association analysis of PRS with hypertension in stage 1 and 2 datasets
We used logistic mixed models, as implemented in the GENESIS R package (36) version 2.16.1 to estimate the association between the PRS and hypertension, with relatedness modeled via a sparse kinship matrix. Association analyses were adjusted for age, age 2 , BMI, smoking status (current smoker versus former or never smoker) at the baseline exam, the first 11 PCs, race/ethnicity when evaluating PRS association in a multi-ethnic sample, and time between exams when studying incident associations. We performed two analyses of incident, new onset hypertension in the follow-up exam: one based on individuals who had normal BP at baseline, and second based on individuals who had elevated BP at baseline. We accounted for differences in variances across groups defined according to self-reported race/ethnicity and . CC-BY-NC 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted November 1, 2021. ; study by estimating a different residual variance parameter for each (37). We estimated both multi-ethnic and background-specific PRS associations. For the latter, we tested for heterogeneity of estimated effects by race/ethnic background using the Cochran Q test that accounts for covariance between effect estimates across the background groups (38). We evaluated the predictive performance of PRS by calculating the area under the receiver operating curve (AUC) using the AUC function from the pROC R package (39), version 1.16.2.
We used only unrelated individual when calculating AUC. We visualized the unadjusted association of the final HTN-PRS with longitudinal measure of hypertension via a decile plot, demonstrating proportions of individuals in categories of longitudinal BP trajectories in each of the PRS deciles. In secondary analysis, to benchmark the effect of the HTN-PRS against known hypertension risk factors, we compared standardized effect size estimates of the HTN-PRS, BMI, and smoking status, from both the prevalent and incident hypertension analyses.

Development of new onset hypertension in young adulthood by PRS levels
Stage 3 dataset consisted of n=1,388 self-identified Black and n=1,699 self-identified White young adults from CARDIA. Follow up started on average at age 25 (minimum age of participants at baseline = 17). We used 15 years of follow up available on the dbGaP repository (40). We generated the HTN-PRS for each of CARDIA participant, removed related individuals (degree 3 or higher) and assigned individuals to strata defined by <10 percentile of the PRS, 10-50, 50-90, and >90 percentiles. We first computed these strata across all CARDIA individuals.
Next, because there was only a single Black individual in the bottom stratum and only a single White individual in the top stratum, we also defined strata within Black and White groups . CC-BY-NC 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted November 1, 2021. ; https://doi.org/10.1101/2021.10.31.21265717 doi: medRxiv preprint separately. Next, we fit generalized linear mixed models (GLMM) with random intercept for each participant within strata in the combined and background-specific analyses. The outcome was hypertension, defined as before, and the exposure variables were sex, 11

Association of the HTN-PRS with outcomes in the MGB Biobank
We tested the association of the HTN-PRS with hypertension (another form of replication), coronary artery disease (CAD), ischemic stroke, type 2 diabetes, and chronic kidney disease (CKD), in the MGB Biobank (stage 4 dataset). We also tested the association of the HTN-PRS with obesity because the pan-UKBB GWAS summary statistics that we used for constructing one of the PRS used by the HTN-PRS was from an analysis not adjusted to BMI. We used n=40,201 unrelated individuals with relevant phenotypes. For all outcomes other than CKD we used "curated disease populations" defined by a phenotyping algorithm that uses ICD-9 codes and natural language processing (41). For CKD we used a single term referring to having a health care system encounter due to CKD ("reason to visit" is CKD). We used logistic regression . CC-BY-NC 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted November 1, 2021. ; https://doi.org/10.1101/2021.10.31.21265717 doi: medRxiv preprint adjusted for 10 PCs, current age, sex, race/ethnicity. The main analysis was multi-ethnic, and we also tested race/ethnic background-specific associations, though sample sizes were small in non-EA groups. Comprehensive description of the MGB Biobank methods is provided in the Supplementary Information. Based on each GWAS, we selected PRS using three criteria: Genome-wide Significant PRS; Selected CV-PRS; Selected PVAL-PRS. Supplementary Figure S1 describes the association of each PRS with hypertension in the stage 1 dataset. Selected CV-PRS had the highest AUC compared to other PRS. Supplementary Figure S2 further reports these associations for the secondary GWAS as well, showing that they mostly performed less well than the primary GWAS, with the exception of the PRS based on the UKBB+ICBP EA GWAS (9), in which all TOPMed EA individuals participated. The meta-analysis of all available independent GWAS performed less well than the primary GWAS. Table S4 reports the clumping parameters and number of SNPs for each of the primary and secondary GWAS and each selection criterion. PRSsum based on selected CV-PRS had the strongest association with hypertension (OR=2.10, 95% CI [1.99, 2.21], p-value <1x10 -100 , AUC=0.76). Based on these results, we move forward with PRSsum based on Selected CV-PRS for analysis of incident hypertension. Figure 4 shows the distributions of Selected CV-PRS based on each GWAS and their combined PRSsum. For all PRS, the AA group tended to have the highest PRS values. We computed the correlation between PRS and stratified by race/ethnic background as described in supplementary S5. As . CC-BY-NC 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review)

PRS associations with baseline hypertension in the stage 2 dataset
The copyright holder for this preprint this version posted November 1, 2021.  Figure S3 shows similar data stratified by race/ethnic background and demonstrates generally similar patterns across backgrounds, except for AsA, who are also the group of the smallest sample size (n=794), and therefore there is higher uncertainty in results for this group. Supplementary Figure S4 visualizes similar data stratified by age decades at baseline (≤20, 21-30, 31-40, … 71-80, >80). We observed longitudinal associations of the HTN-PRS with BP category is most age groups (age 31 to age 80), with most the pronounced associations from ages 41-70, for which each severity category is well represented in the data.

HTN-PRS association with prevalent and incident hypertension across race/ethnic
backgrounds . CC-BY-NC 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted November 1, 2021. ; is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted November 1, 2021. ; https://doi.org/10.1101/2021.10.31.21265717 doi: medRxiv preprint individuals with elevated hypertension in the 21-40 and 51-60 age groups were not statistically significant. This could be due to low sample sizes (see figure for more details).
Supplementary Figure S8 provides a comparison of effect sizes and predictive performance measured by AUC of the multi-ethnic PRS, BMI, and current smoking, in prevalent and incident hypertension analyses. The PRS is comparable to BMI, and both perform better than current smoking.

PRS Association with development of hypertension in young Black and White adults
We estimated trajectories of hypertension development as second order polynomial functions of age within strata defined by quantiles of the PRS. As shown in Figure 6, in the combined analysis of both Black and White CARDIA participants, trajectories of hypertension risk are separated between the PRS-defined strata, with individuals in strata defined by higher PRS values consistently develop higher hypertension risk compared to those in lower PRS strata.
Note that Black individuals obtain higher PRS values compared to White individuals and vice versa. Looking at the race-defined strata, this pattern is seen much more clearly in the Blacks,  Figure S9 in the Supplementary Materials describes the association of the multi-ethnic hypertension PRS with hypertension, ischemic stroke, CAD, type 2 diabetes, CKD, and obesity, in the MGB Biobank. In multi-ethnic analysis, the PRS was associated with hypertension (OR=1.45, p-value< 9x10 -100 ), as well as with all other outcomes (OR =1.1-1.4 for all outcomes, p-value<0.05). Association of the HTN-PRS with obesity is likely because the pan-ancestry UKBB GWAS of hypertension, which we used, was not adjusted for BMI, and/or due to residual effects of BMI that were not fully accounted for by the BMI adjustment in the MVP GWAS.

Discussion
We developed a HTN-PRS based on multi-ethnic GWAS for SBP, DBP and assessed its association in a multi-ethnic TOPMed dataset. A BioMe stage 1 dataset was used to select optimal tuning parameters for PRS based on each GWAS using three approaches: the novel Selected CV-PRS approach, Selected PVAL-PRS, and genome-wide significant PRS. We further proposed to combine BP phenotypes PRS based on GWAS of different phenotypes using the PRSsum approach: an unweighted sum of the separate phenotypes' PRS. This final HTN-PRS, PRSsum based on Selected CV-PRS, was associated with hypertension prevalence in the independent stage 2 dataset, as well as with longitudinal categories of BP trajectories across race/ethnic backgrounds. In analysis stratified by age decade, the association of the PRS with . CC-BY-NC 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted November 1, 2021. Recently, a study in Finnish Europeans from FinnGen (26), studied the use of BP PRS to predict longitudinal and lifetime risk of hypertension. The PRS were highly associated hypertension and with cardiovascular disease (CVD) risk, underscoring the potential of PRS to predict hypertension and stratify individuals for intervention to potentially reduce CVD risk. Here, we addressed a similar problem while focusing on a multi-ethnic population and on 4-6 years from between two exams. The distribution of the various constructed PRS, including the final one, PRSsum based on Selected CV-PRS, differed across race/ethnic backgrounds. This is expected, because PRS are sums of alleles, which have different distributions (defined by allele frequencies) across genetic ancestries, and therefore, also race/ethnic background, as these generally have different ancestry admixture. While we expect PRS values in the upper decile to be associated with higher risk of hypertension across all race/ethnicities, a natural question is how to define individuals as "at risk". An "at risk" classification may use a specific cut-off value of the PRS, which may be based on a percentile of the distribution (42). Clearly, this approach cannot be used when distributions differ across race/ethnicities, and moreover, admixed individuals are not accurately represented by any specific distribution. Therefore, more work is . CC-BY-NC 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted November 1, 2021. ; https://doi.org/10.1101/2021.10.31.21265717 doi: medRxiv preprint needed on approaches that do not require categorization of neither individuals nor of specific PRS values to define risk. Rather, models that take into account multiple risk factors (such as demographic, clinical, and other risk factors, as well as PRS (43)) and allow for flexible association model may be more powerful and equitable, in that they could be applied to more individuals. Notably, we could not use the standard approach of quantifying hypertension risk between individuals in the top HTN-PRS decile to the bottom, in the multi-ethnic analysis: such an approach would separate AAs from others, and therefore will be confounded by other social race/ethnic-related environmental exposures.
Methodologically, while we first constructed various PRS using a standard clump and threshold methodology (22), we used two novel approaches to construct the new HTN-PRS. First, we leveraged stage 1 and stage 2 independent datasets to study how to select the tuning parameters for a PRS, and chose the Selected CV-PRS as a method. This approach attempts to avoid overfitting to a particular dataset by splitting the training data to 5 independent subsets, i.e. with no related individuals between them, and assessing association of the PRS with the outcome in each. The Selected CV-PRS is the one that has consistent, high, effect size across these subsets, represented by smallest CV across them. Other measures can be used rather than effect size, but we chose the latter because of clinical interpretability. In the testing dataset, the effect size was indeed the highest when using the Selected CV-PRS compared with other PRS. A second methodological choice was the construction of PRSsum. PRSsum allowed us to combine information across PRS that were based on GWAS of different phenotypes (SBP, DBP, hypertension). GWAS of different phenotypes cannot be meaningfully meta-analyzed to . CC-BY-NC 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted November 1, 2021. ; produce new summary statistics for PRS construction. Another motivation behind it is that high values of PRSsum may capture individuals with either high SBP, DBP, or hypertension PRS values, or combined, meaning that their hypertension may be captured by various underlying genetic components. While PRSsum is an unweighted sum of PRS, a weighted sum (or with adaptive weights) can be constructed as well (44,45). We here used PRSsum without adaptive weights out of concern for overfitting due to a dominant PRS and dominant subgroup, for example, where the PRS based on hypertension GWAS would potentially be upweighted due to a large set of individuals with European ancestry. Future work should study weighted combination of PRS in diverse populations.
Strengths of our study are the use of large multi-ethnic datasets with harmonized genetic and phenotypic data, a range of ages of participants, and longitudinal datasets, allowing us to explore the association of hypertension PRS across adulthood. In stage 3, we focused on one study, CARDIA, with longer follow up of younger individuals, and compared trajectories of hypertension risk by age across Black and White individuals, demonstrating that Blacks develop hypertension early, in agreement with their higher PRS values. Additional longitudinal datasets from underrepresented populations are needed to study long-term trajectories of disease development and usefulness of PRS across the lifespan. Our study also has additional weaknesses. For example, our primary analysis did not use the largest available GWAS of SBP and DBP to construct PRS, namely a meta-analysis of the European ancestry participants UKBB and of the international consortium for BP (9) as most of our study individuals participated in it, and the overlap could lead to overfitting. More work is needed to assess overfitting effects . CC-BY-NC 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted November 1, 2021. ; https://doi.org/10.1101/2021.10.31.21265717 doi: medRxiv preprint across samples sizes and overlaps of discovery GWAS, training, and testing datasets. Also, to construct PRS we used the clump & threshold methodology, rather than a more modern approach such as LDpred (46) or lassosum (47). We chose to focus on clump & threshold methodology because we think that these other methods still need to be separately studied for diverse populations.
In summary, we applied novel methodology for developing PRS and constructed a PRS predictive of incident hypertension across adulthood in a multi-ethnic population. The PRS was also significantly associated with clinical outcomes. Future work will incorporate rare variants and pleiotropic variants (48) in the construction of PRS, and will investigate models for clinical uses of hypertension PRS in diverse populations.

Data availability statement
TOPMed freeze 8 WGS data are available by application to dbGaP according to the study specific accessions: ARIC: phs001416, BioMe: phs001644, CFS: phs000954, CHS: phs001368, . CC-BY-NC 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted November 1, 2021.  is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted November 1, 2021. ; Program. Admixture mapping identifies genetic regions associated with blood pressure phenotypes in African Americans. PLoS One. 2020 Apr 21;15 (4) . CC-BY-NC 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted November . CC-BY-NC 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. In stage 1, we used the stage 1 dataset to select tuning parameters for PRS construction based on GWAS of BP phenotypes. We compared a few methods for tuning parameter selection and constructed PRSsum combining a few phenotype-specific PRS. In stage 2 we evaluate the methods for tuning parameter selection in the stage 2 dataset, and selected one PRS, namely HTN-PRS, to move forward for two-visit longitudinal analysis. In stage 3, we used a longitudinal dataset from CARDIA to study hypertension development in young adults, and compared AAs and EAs. In stage 4, we tested the association of the HTN-PRS with disease outcomes in MGB biobank (stage 4 dataset).
. CC-BY-NC 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

Figure 2: PRS selection using coefficient of variation workflow
Flowchart describing the selection of PRS according to the coefficient of variation (CV) criterion. The data set is split into five independent sub-datasets (without related individuals between the subsets). An association model is fit on each sub-dataset for each PRS. Each PRS, defined by a unique combination of tuning parameter, has 5 independent effect size estimates. We compute the CV for each such PRS, and select the PRS that minizes the CV.
. CC-BY-NC 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

Figure 3: Association of PRS with prevalent hypertension at baseline in the stage 2 data set
Associations of PRS in stage 2 dataset. PRS were trained for hypertension association using stage 1 dataset. "Genome-wide significant PRS" are PRS constructed using genome-wide significant SNPs in the discovery GWAS, with fixed LD parameters or R 2 =0.1 and distance =1000kb. "Selected CV PRS" are PRS that minimized the coefficient of variation (CV) across effect size (log odds ratio (OR)) estimates in 5 independent subsets of the stage 1 dataset. "Selected PVAL-PRS" are PRS that minimized the association p-value with hypertension in the stage 1 dataset. The figure provides estimated ORs, 95% confidence intervals, p-values from association with hypertension, and Area Under the Receiver Operator Curve (AUC). The PRS association was estimated in a model adjusted for sex, age, age 2 , study site, race/ethnic background, smoking status, BMI, and 11 ancestral principal components.
. CC-BY-NC 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted November 1, 2021. ;     The forest plot provides the association of the HTN-PRS with prevalent and incident hypertension in the stage 2 dataset, and within race/ethnic backgrounds. The top part corresponds to prevalence analysis at the baseline visit, the middle part corresponds to prediction of new onset hypertension in exam 2, among individuals who had normal BP at baseline, and the bottom part corresponds to prediction of new onset hypertension in exam 2, among individuals who had elevated BP at baseline. For each analysis we provide sample size, estimated odds ratio (OR) and 95% confidence interval, p-value, and area under the receiver operating curve (AUC). Heterogeneity of effects across race/ethnic groups was tested using the Cochran's Q test accounting for correlation due to genetic relatedness across groups.
. CC-BY-NC 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

Figure 7: Trajectories of hypertension risk by strata defined by HTN-PRS in young adults from CARDIA
Results from analysis of age-dependent risk of hypertension in young adults from CARDIA (stage 3 dataset). We used generalized linear mixed model to model the risk of hypertension by age within strata defined by quantiles of the HTN-PRS. Analyses were adjusted for age, sex and the first 11 PCs of genetic data. Stratification by PRS quantiles was performed in each presented group: All (combined Black and White participants), Black, and White. The effect of age was modelled using a second degree polynomial. At each point on the curve we provide 95% confidence interval of the effect estimate. In the combined sample, all individuals in the top PRS strata are Black, and 99 % of the individuals in the bottom PRS strata are White.
. CC-BY-NC 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted November 1, 2021. ; https://doi.org/10.1101/2021.10.31.21265717 doi: medRxiv preprint  . CC-BY-NC 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted November 1, 2021.

( 4.6)
Year between visit (mean (SD)) 4 . CC-BY-NC 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted November 1, 2021.  . CC-BY-NC 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted November 1, 2021. For each GWAS, the table describes the characteristics of the selected PRS based on: (1) Genome wide significant SNPs, using SNPs with p-value<5x10 -8 , and fixed clumping parameters: R 2 =0.1 and distance of 1000kb,(2) Optimizing the coefficient of variation across 5 independent subsets of the stage 1 dataset (see Figure 1), and (3) The PRS with the lowest association P-value. We provide clumping criteria (Threshold, R2 and distance) and number of SNP for each GWAS. PRSsum was created by summing the PRS for each criteria.
. CC-BY-NC 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted November 1, 2021.   Figure S1: Association of primary PRS with hypertension in the stage 1 dataset across compared tuning parameter selection criteria For each GWAS, the PRS were selected based on: (1) Genome wide significant SNPs, using SNPs with p-value<5x10 -8 , and fixed clumping parameters: R 2 =0.1 and distance of 1000kb, (2) Optimizing the coefficient of variation across 5 independent subsets of the stage 1 dataset (see Figure 1), and (3) The PRS with the lowest association P-value. We provide OR, 95% confident interval, association P-value and AUC. The PRS associations were estimated in models adjusted for sex, age, age 2 , study site, race/ethnic background, smoking status, BMI, and 11 ancestral principal components.

Supplementary figures
. CC-BY-NC 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted November 1, 2021. ; https://doi.org/10.1101/2021.10.31.21265717 doi: medRxiv preprint Figure S2. Association of primary and secondary PRS with hypertension in the BioMe stage 1 dataset across tuning parameter selection criteria For each of the primary and secondary GWAS used, the PRS were selected based on: (1) Genome wide significant SNPs, using SNPs with p-value<5x10 -8 , and fixed clumping parameters: R 2 =0.1 and distance of 1000kb, (2) Optimizing the coefficient of variation across 5 independent subsets of the stage 1 dataset (see Figure 1), and (3) The PRS with the lowest association P-value. We provide OR, 95% confident interval, association p-value and AUC. The PRS associations were estimated in models adjusted for sex, age, age 2 , study site, race/ethnic background, smoking status, BMI, and 11 ancestral principal components.
. CC-BY-NC 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted November 1, 2021. . CC-BY-NC 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted November 1, 2021. ; https://doi.org/10.1101/2021.10.31.21265717 doi: medRxiv preprint Figure S4. Distribution of longitudinal categories of BP stratified by age The figure visualizes the distribution of hypertension severity measures stratified by age (age (≤20, 21-30, 31-40, … 71-80, >80). Severity measures from most to least severe: always hypertension (treated and hypertension in both visit), worsen (change from normal to elevated BP or hypertension, or change from elevated to hypertension) improve (change from untreated hypertension at to elevated/normal at follow up, or change from elevated to . CC-BY-NC 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted November 1, 2021.  . CC-BY-NC 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted November 1, 2021. ; https://doi.org/10.1101/2021.10.31.21265717 doi: medRxiv preprint Figure S6. Risk of hypertension in individuals in high deciles of the HTN-PRS compared to those in low deciles of the HTN-PRS.
The forest plot provides OR for hypertension at baseline in individuals in the highest versus the lowest decile of the HTN-PRS. Associations were computed in the stage 2 dataset, and are provided in each race/ethnic group separately, because the PRS distribution differ between race/ethnicities. . CC-BY-NC 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted November 1, 2021. ; https://doi.org/10.1101/2021.10.31.21265717 doi: medRxiv preprint Figure S7. Age-stratified associations of HTN-PRS with prevalent and incident hypertension The forest plot provides the association of the HTN-PRS with prevalent and incident hypertension in the stage 2 dataset, and within age groups. Incident hypertension analysis was performed within individuals who had normal BP at baseline (normotensive), and within individuals who had elevated BP at baseline (elevated). For each analysis we provide sample size, estimated odds ratio (OR) and 95% confidence interval, p-value, and area under the receiver operating curve (AUC).
. CC-BY-NC 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted November 1, 2021. ; https://doi.org/10.1101/2021.10.31.21265717 doi: medRxiv preprint Figure S8: Benchmarking the risk of HTN-PRS against other risk factors The figure compares hypertension risk factors: BMI, current smoking, and HTN-PRS. For each association analysis, the figure provides the within-group standardized effect size estimate (effect size per 1SD increase in the risk factor within the race/ethnic group used) and 95% confidence interval. We also provide the AUC from a model that include covariates (sex, age, age 2 , study site, race/ethnicity in the multi-ethnic model, and 11 PCs) and only the single risk factor of interest (and not the other two risk factors).
. CC-BY-NC 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted November 1, 2021. ; https://doi.org/10.1101/2021.10.31.21265717 doi: medRxiv preprint Figure S9: Race/ethnicity stratified associations of the HTN-PRS with outcomes in the MGB Biobank The figure provides the association of HTN-PRS with outcomes in the MGB Biobank, stratified by race/ethnic background groups.
. CC-BY-NC 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted November 1, 2021.

Blood pressure measurements methods:
BioMe operations are fully integrated in clinical care processes, including direct recruitment from clinical sites waiting areas and phlebotomy stations by dedicated BioMe recruiters independent of clinical care providers, prior to or following a clinician standard of care visit.
Recruitment currently occurs at a broad spectrum of over 30 clinical care sites. Information on anthropometrics, demographics, blood pressure values and use of antihypertensive medication . CC-BY-NC 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

Ethics statement:
The BioMe cohort was approved by the Institutional Review Board at the Icahn School of Medicine at Mount Sinai. All BioMe participants provided written, informed consent for genomic data sharing.

BioMe acknowledgements:
The Mount Sinai BioMe Biobank has been supported by The Andrea and Charles Bronfman Philanthropies and in part by Federal funds from the NHLBI and NHGRI (U01HG00638001; U01HG007417; X01HL134588). We thank all participants in the Mount Sinai Biobank. We also thank all our recruiters who have assisted and continue to assist in data collection and management and are grateful for the computational resources and staff expertise provided by Scientific Computing at the Icahn School of Medicine at Mount Sinai.

ARIC
The Atherosclerosis Risk in Communities (ARIC) study (dbGaP accession phs000090) is a population-based prospective cohort study of cardiovascular disease sponsored by the NHLBI.  CC-BY-NC 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted November 1, 2021. ; https://doi.org/10.1101/2021.10.31.21265717 doi: medRxiv preprint ARIC study has been described in detail previously (3). For this analysis, we used BP measurements from the first and third ARIC exams.

Blood pressure measurements methods:
BP was measured using a standardized Hawksley random-zero mercury column sphygmomanometer with participants in a sitting position after a resting period of 5 minutes.
The size of the cuff was chosen according to the arm circumference. Three sequential recordings for SBP and DBP were obtained; the mean of the last two measurements was used in this analysis, discarding the first reading. Blood pressure lowering medication use was recorded from the medication history

Ethics statement:
The ARIC study has been approved by Institutional Review Boards (IRB) at all participating institutions: University of North Carolina at Chapel Hill IRB, Johns Hopkins University IRB, University of Minnesota IRB, and University of Mississippi Medical Center IRB. Study participants provided written informed consent at all study visits.

ARIC acknowledgements:
The Atherosclerosis Risk in Communities study has been funded in whole or in part with Federal funds from the National Heart, Lung, and Blood Institute, National Institutes of Health, Department of Health and Human Services (contract numbers HHSN268201700001I, . CC-BY-NC 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.
The authors thank the staff and participants of the ARIC study for their important contributions.

Blood pressure measurements methods:
Research staff who received central training in blood pressure measurement assessed repeat right-arm seated systolic and diastolic blood pressure levels at baseline with a Hawksley random-zero sphygmomanometer. Means of the repeated blood pressure measurements from the baseline examination were used for these analyses. Blood pressure lowering medication use was recorded from the medication history.

Ethics statement:
All CHS participants provided informed consent, and the study was approved by the Institutional Review Board [or ethics review committee] of University Washington.
. CC-BY-NC 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted November 1, 2021. and 71%, respectively) and written informed consent was obtained in each visit.

Blood pressure measurements methods:
. CC-BY-NC 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted November 1, 2021. ; https://doi.org/10.1101/2021.10.31.21265717 doi: medRxiv preprint Seated BP was measured on the right arm following 5 minutes rest using a random-zero sphygmomanometer. SBP and DBP were recorded as Phase I and Phase V Korotkoff sounds.
Three measurements were taken at 1 minute intervals with the average of the second and third measurements taken for the BP values.  designs for the three cohorts are summarized elsewhere (6)(7)(8). At each clinic visit, a medical history was obtained, and participants underwent a physical examination. Only study participants consented for genetic and non-genetic data are included. FHS has been approved by the Boston University IRB

Blood pressure measurements methods:
Systolic and diastolic blood pressures were measured twice by a physician on the left arm of the resting and seated participants using a mercury column sphygmomanometer. The blood pressure measurements used in this study were obtained at examination one cycle for all three generation cohorts.

Ethics statement:
The Framingham Heart Study was approved by the Institutional Review Board of the Boston University Medical Center. All study participants provided written informed consent.

FHS acknowledgements:
The Framingham Heart Study (FHS) acknowledges the support of contracts NO1-HC-25195, HHSN268201500001I and 75N92019D00031 from the National Heart, Lung and Blood Institute . CC-BY-NC 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted November 1, 2021. ; https://doi.org/10.1101/2021.10.31.21265717 doi: medRxiv preprint and grant supplement R01 HL092577-06S1 for this research. We also acknowledge the dedication of the FHS study participants without whom this research would not be possible.

GENOA
The Genetic Epidemiology Network of Arteriopathy (GENOA) study, a part of the Family Blood Pressure Program (FBPP Investigators, 2002), consists of hypertensive sibships that were recruited for linkage and association studies in order to identify genes that influence blood pressure and its target organ damage (Daniels, 2004). In the initial phase of the GENOA study (Phase I: 1996(Phase I: -2001, all members of sibships containing ≥ 2 individuals with essential hypertension clinically diagnosed before age 60 were invited to participate, including both hypertensive and normotensive siblings. In the second phase of the GENOA study (Phase II: 2000-2004), 1,239 non-Hispanic white and 1,482 African American participants were successfully re-recruited to measure potential target organ damage due to hypertension.

Blood pressure measurements methods:
SBP and DBPs were measured using an automated oscillometric BP measurement device with a consistent protocol across the FBPP networks. BP was measured three times on each participant by trained and certified technicians and then averaged for use in this analysis.

Ethics statement:
. CC-BY-NC 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted November 1, 2021. ; https://doi.org/10.1101/2021.10.31.21265717 doi: medRxiv preprint from the baseline sample. Visit 3 has started in 2020 and will last 3 years. In addition to clinic visit, participants are contacted annually to assess clinical outcomes. The study was approved by the Institutional Review Boards at each participating institution and written informed consent was obtained from all participants.

Blood pressure measurements methods:
Blood pressure was measured on the right arm using an OMRON HEM-907 XL (Omron Healthcare, Inc., Lake Forest, IL) automatic sphygmomanometer, with the participant in the seated position and the arm resting. Cuff sizes were determined by measurement of the upper arm circumference. Four cuff sizes were available. Three blood pressures measurements were obtained 1 minute apart following an initial 5-minute rest period. The average of these 3 blood pressure values was used in this analysis. If there were fewer than 3 measurements, all available measurements were averaged. For more details, see (11).

Ethics statement:
This study was approved by the institutional review boards (IRBs) at each field center, where all participants gave written informed consent, and by the Non-Biomedical IRB at the University of

HCHS/SOL acknowledgements:
The Hispanic Community Health Study/Study of Latinos is a collaborative study supported by contracts from the National Heart, Lung, and Blood Institute (NHLBI) to the University of North

JHS
The Jackson Heart Study (dbGaP accession phs000286) is a longitudinal investigation of genetic and environmental risk factors associated with the disproportionate burden of cardiovascular disease in African Americans (12,13). JHS is funded by the NHLBI and the National Institute on Minority Health and Health Disparities (NIMHD). JHS is an expansion of the ARIC study in its Jackson Field Center. At baseline, the JHS recruited 5306 African American residents of the . CC-BY-NC 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. Two seated blood pressure measurements were taken using a Hawksley random zero sphygmomanometer and an appropriately sized cuff. BP measurements were calibrated using robust regression to the Omron HEM-907XL device (14). The use of antihypertensive medications was recorded in the medication history.

Ethics statement:
The JHS study was approved by Jackson State University, Tougaloo College, and the University of Mississippi Medical Center IRBs, and all participants provided written informed consent.

JHS acknowledgements:
. CC-BY-NC 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.  is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted November 1, 2021. ; https://doi.org/10.1101/2021.10.31.21265717 doi: medRxiv preprint accumulation of large numbers of diverse clinical outcomes, risk factor measurements, medication use, and many other types of data.

Blood pressure measurements methods:
BP was measured by certified staff using standardized procedures and instruments (15). Two BP measures were recorded after 5 minutes rest using a mercury sphygmomanometer. Appropriate cuff bladder size was determined at each visit based on arm circumference. Diastolic BP was taken from the phase V Korotkoff measures. The average of the two measurements, obtained 30 seconds apart, was used in analyses Ethics statement: All WHI participants provided informed consent and the study was approved by the Institutional Review Board (IRB) of the Fred Hutchinson Cancer Research Center.

WHI acknowledgements:
The WHI program is funded by the National Heart, Lung, and Blood Institute, National Institutes of Health, U.S. Department of Health and Human Services through contracts 75N92021D00001, 75N92021D00002, 75N92021D00003, 75N92021D00004, 75N92021D00005.

MESA
The Multi-Ethnic Study of Atherosclerosis (dbGaP accession phs000209) is a study of the characteristics of subclinical cardiovascular disease (disease detected non-invasively before it . CC-BY-NC 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted November 1, 2021. ; https://doi.org/10.1101/2021.10.31.21265717 doi: medRxiv preprint has produced clinical signs and symptoms) and the risk factors that predict progression to clinically overt cardiovascular disease or progression of the subclinical disease (16). MESA consisted of a diverse, population-based sample of an initial 6,814 asymptomatic men and women aged 45-84. 38 percent of the recruited participants were white, 28 percent African American, 22 percent Hispanic, and 12 percent Asian, predominantly of Chinese descent.
Participants were recruited from six field centers across the United States: Wake Forest University, Columbia University, Johns Hopkins University, University of Minnesota, Northwestern University and University of California -Los Angeles. Participants are being followed for identification and characterization of cardiovascular disease events, including acute myocardial infarction and other forms of coronary heart disease (CHD), stroke, and congestive heart failure; for cardiovascular disease interventions; and for mortality. The first examination took place over two years, from July 2000 -July 2002. It was followed by five examination periods that were 17-20 months in length. Participants have been contacted every 9 to 12 months throughout the study to assess clinical morbidity and mortality.

Blood pressure measurement methods:
Resting BP was taken three times in the seated position after a five-minute rest using a Dinamap model Pro 100 automated oscillometric sphygmomanometer (Critikon, Tampa, Florida) (17) with the average of the last two measurements recorded and verified.

Ethics statements:
. CC-BY-NC 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted November 1, 2021.  . CC-BY-NC 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted November 1, 2021. ; https://doi.org/10.1101/2021.10.31.21265717 doi: medRxiv preprint

MGB Biobank
Samples, genomic data, and health information were obtained from the Mass General Brigham (MGB) Biobank, a biorepository of consented patient samples at Mass General Brigham.

DNA samples
DNA samples are processed from whole blood that was collected as a dedicated research draw or as a clinical discard. Dedicated research samples are aimed to be processed within four hours of collection. Clinical discards are processed 24+ hours after collection. Whole blood is spun to buffy coat with a centrifuge and the buffy coat is stored in a freezer up to several months. The buffy coat is then extracted to DNA. The DNA is then placed in an ultralow freezer (-80ºC).Each DNA aliquot contains a minimum of 2 ug of DNA. The concentration varies.

Genotyping
Samples have been genotyped using three versions of the biobank SNP array offered by Illumina that is designed to capture the diversity of genetic backgrounds across the globe. The first batch of data was generated on the Multi-Ethnic Genotyping Array (MEGA) array, the first release of this SNP array. The second, third, and fourth batches were generated on the Expanded Multi-Ethnic Genotyping Array (MEGA Ex) array. All remaining data were generated on the Multi-Ethnic Global (MEG) BeadChip.

Imputation
. CC-BY-NC 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted November 1, 2021. ; https://doi.org/10.1101/2021.10.31.21265717 doi: medRxiv preprint Prior to performing imputation, files were converted to VCF format, separated by chromosomes. When multiple probes measured the same genotypes, they were checked for concordance and were set to a missing value if the genotypes did not match. Files were uploaded to the Michigan Imputation Server, and Genotypes were imputed using TOPMed reference panel. Genomic coordinates are provided in GRCh38.
We computed principal component (PC) using PLINK: we pruned the genotype data using a window size of 1000 variants, sliding across the genome with a step size of 250 variants at a time, filtering out any SNPs with LD R 2 >0.1. We used unrelated individuals (3rd degree, identified using PLINK) to compute the loadings for the first 10 PCs.

PRS construction
We constructed PRS using PRSice 2, using the same SNPs as those based on clumping performed on the TOPMed dataset, and otherwise the same methodology. Including, for comparability, when scaling PRS by mean and SD we used the ones estimated on the TOPMed dataset from stage 2.

Curated Disease Populations
. CC-BY-NC 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted November 1, 2021. ; https://doi.org/10.1101/2021.10.31.21265717 doi: medRxiv preprint We used outcomes from "curated disease populations". These phenotypes were developed by the Biobank Portal team using both structured and unstructured electronic medical record (EMR) data and clinical, computational and statistical methods. Natural Language Processing (NLP) was used to extract data from narrative text. Chart reviews by disease experts helped identify features and variables associated with particular phenotypes and were also used to validate results of the algorithms. The process produced robust phenotype algorithms that were evaluated using metrics such as sensitivity, the proportion of true positives correctly identified as such, and positive predictive value (PPV), the proportion of individuals classified as cases by the algorithm (18). The high throughput phenotyping algorithm is as follows: 1. Create an initial phenotype definition using ICD-9 diagnosis codes. 4. Create a gold-standard patient set for training the method. Query coded EMR data for the set of patients having at least one ICD-9 code for the phenotype. Apply a statistical sampling algorithm to select a random subset of those patients for full chart review. A clinical expert performs a full chart review to classify the patients as positive or negative for the phenotype.
. CC-BY-NC 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted November 1, 2021. ; https://doi.org/10.1101/2021.10.31.21265717 doi: medRxiv preprint 5. Train a statistical model that incorporates all features in the definition to predict the presence or absence of the phenotype against the gold-standard patient set. 6. Apply the trained model to the entire Biobank Population.
The following table provides the prevalence and AUC of the phenotypes that we used (AUC was computed based on step 4 above).

Association analyses with disease outcomes
We used the outcomes above in association analyses. We adjusted for current age, sex, genotype batch, and the first 10 PCs. We only used unrelated individuals in the analysis and therefore used standard logistic regression models.

Ethics statement
. CC-BY-NC 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted November 1, 2021. ; https://doi.org/10.1101/2021.10.31.21265717 doi: medRxiv preprint All Biobank subjects have provided their consent to join the Partners Biobank, which includes agreeing to provide a blood sample linked to the electronic medical record. Subjects also agree to be recontacted by the Partners Biobank staff as needed.

Acknowledgements
We thank Mass General Brigham Biobank for providing samples, genomic data, and health information data. Phenotypes in the Framingham Heart Study" (phs000974.v4.p3) was performed at the Broad Institute Genomics Platform (3R01HL092577-06S1, 3U54HG003067-12S2). DNA extraction for "NHLBI TOPMed: Genetic Epidemiology Network of Arteriopathy" (phs001345.v2.p1) was performed at the Mayo Clinic Genotyping Core and Genome sequencing was performed at the Broad Institute Genomics Platform (HHSN268201500014C) and the Northwest Genomics Center (3R01HL055673-18S1). Genome sequencing for "NHLBI TOPMed: The Jackson Heart Study" (phs000964.v1.p1) was performed at the Northwest Genomics Center (HHSN268201100037C).

TOPMed and CCDG acknowledgements
. CC-BY-NC 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted November 1, 2021. ; https://doi.org/10.1101/2021.10.31.21265717 doi: medRxiv preprint biological samples and data for TOPMed. The Genome Sequencing Program (GSP) was funded by the National Human Genome Research Institute (NHGRI), the National Heart, Lung, and Blood Institute (NHLBI), and the National Eye Institute (NEI). The GSP Coordinating Center (U24 HG008956) contributed to cross-program scientific initiatives and provided logistical and general study coordination. The Centers for Common Disease Genomics (CCDG) program was supported by NHGRI and NHLBI, and whole genome sequencing was performed at the Baylor College of Medicine Human Genome Sequencing Center (UM1 HG008898 and R01HL059367).