Single-nucleotide conservation state annotation of the SARS-CoV-2 genome

Given the global impact and severity of COVID-19, there is a pressing need for a better understanding of the SARS-CoV-2 genome and mutations. Multi-strain sequence alignments of coronaviruses (CoV) provide important information for interpreting the genome and its variation. We apply a comparative genomics method, ConsHMM, to the multi-strain alignments of CoV to annotate every base of the SARS-CoV-2 genome with conservation states based on sequence alignment patterns among CoV. The learned conservation states show distinct enrichment patterns for genes, protein domains, and other regions of interest. Certain states are strongly enriched or depleted of SARS-CoV-2 mutations, which can be used to predict potentially consequential mutations. We expect the conservation states to be a resource for interpreting the SARS-CoV-2 genome and mutations.


S9
All Sarbecoviruses RaTG13 S6 Enriched for mutations S7 Small subset of close strains including RaTG13 S8 Most enriched for heatpad repeat 1 S10 Subset of strains including RaTG13 and SARS-CoV Most enriched for spike protein's receptor binding motif (RBM) Deviation along a branch of the Sarbecovirus phylogeny S12 All Sarbecoviruses Subset of strains corresponding to a subtree in the phylogeny (Supplementary Fig. 1) Enriched for mutations S13 Aligns to most and matches to a subset of Sarbecoviruses S16 All Sarbecoviruses Distinct subsets of strains with varying distance to SARS-CoV-2 Most enriched for heatpad repeat 2 S11 Most enriched for fusion peptide S15 Enriched for mutations S5 S24 Most except several distal strains Most enriched for gene ORF8 Aligns and matches to most Sarbecoviruses

S4
All Sarbecoviruses Most except several strains S3 S2 S1 S26 Most enriched for mutations S21 Most enriched for dimerization-associated region S22 S14 All Sarbecoviruses S23 Most except a distal strain Most enriched for gene S S17 All Sarbecoviruses Depleted of mutations S18 Most enriched for gene E; Most depleted of mutations S20 S19 Most except several distal strains . CC-BY 4.0 International license available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint this version posted November 2, 2020. ; https://doi.org/10.1101/2020.07.13.201277 doi: bioRxiv preprint . CC-BY 4.0 International license available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint this version posted November 2, 2020.    Enrichment for protein products in states learned from the Sarbecovirus alignment Enrichment for protein products in states learned from the vertebrate CoV alignment . CC-BY 4.0 International license available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint this version posted November 2, 2020. ;https://doi.org/10.1101https://doi.org/10. /2020 Supplementary Each cell corresponding to an enrichment value is colored based on its value with blue as 0 (annotation not overlapping the state), white as 1 to denote no enrichment (fold enrichment of 1), and red as the maximum enrichment value in this table. Each cell corresponding to a coverage percentage is colored based on its value with white as minimum and green as maximum.
b. Similar to a, except based on states learned from the vertebrate CoV model.
. CC-BY 4.0 International license available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint this version posted November 2, 2020. ;https://doi.org/10.1101https://doi.org/10. /2020 start  end  gene  confirmed based  on human CoV  Gussow et al.   7390  7450  orf1ab  7807  7809  orf1ab  7809  7816  orf1ab  TRUE  7816  7825  orf1ab  7868  7871  orf1ab  7931  7933  orf1ab  8575  8589  orf1ab  8640  8647  orf1ab  8658  8660  orf1ab  8888  8892  orf1ab  8892  8893  orf1ab  TRUE  8893  Each row corresponds to a genomic segment annotated by state V14, which corresponds to bases with high (>0.5) align probabilities for SARS-CoV and MERS-CoV and low (<0.5) align probabilities for common-cold-associated human CoV. First and second columns denote 0-based genomic coordinates (BED format). Third column shows the gene in which the genomic segments are located if it is in a gene or "non-coding" if it is not a gene. Fourth column denotes whether the base is confirmed to be unique to pathogenic human CoV and missing in less pathogenic human CoV based on an alignment of 944 human CoV sequences. Last column denotes whether the genomic segment was identified as an insertion specific to pathogenic strains in a prior study.
. CC-BY 4.0 International license available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint this version posted November 2, 2020. ; https://doi.org/10.1101/2020.07.13.201277 doi: bioRxiv preprint Supplementary Figure 1. Sarbecoviruses associated with states S12 and S13 in the phylogenetic tree of the 44-way Sarbecovirus alignment. Similar to Fig. 2c except strains colored according to their align and match probabilities in states S12 and S13. The strain colored in blue is the reference SARS-CoV-2 strain of the alignment, SARS-CoV-2/Wuhan-Hu-1. Strains colored in black are those that have match probabilities below 0.5 for both states S12 and S13. Strains colored in red are those with match probabilities above 0.5 for both states S12 and S13. Strains colored in yellow are those with match probabilities above 0.5 for state S13 but not for state S12. All strains have high (>0.95) align probabilities for states S12 and S13. States S12 and S13 are likely to correspond to a deviation along the branch preceding all strains colored in black.
. CC-BY 4.0 International license available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint this version posted November 2, 2020. ; https://doi.org/10. 1101/2020