Skip to main content
Open Access Publications from the University of California

Towards a reference genome that captures global genetic diversity.

  • Author(s): Wong, Karen HY
  • Ma, Walfred
  • Wei, Chun-Yu
  • Yeh, Erh-Chan
  • Lin, Wan-Jia
  • Wang, Elin HF
  • Su, Jen-Ping
  • Hsieh, Feng-Jen
  • Kao, Hsiao-Jung
  • Chen, Hsiao-Huei
  • Chow, Stephen K
  • Young, Eleanor
  • Chu, Catherine
  • Poon, Annie
  • Yang, Chi-Fan
  • Lin, Dar-Shong
  • Hu, Yu-Feng
  • Wu, Jer-Yuarn
  • Lee, Ni-Chung
  • Hwu, Wuh-Liang
  • Boffelli, Dario
  • Martin, David
  • Xiao, Ming
  • Kwok, Pui-Yan
  • et al.

The current human reference genome is predominantly derived from a single individual and it does not adequately reflect human genetic diversity. Here, we analyze 338 high-quality human assemblies of genetically divergent human populations to identify missing sequences in the human reference genome with breakpoint resolution. We identify 127,727 recurrent non-reference unique insertions spanning 18,048,877 bp, some of which disrupt exons and known regulatory elements. To improve genome annotations, we linearly integrate these sequences into the chromosomal assemblies and construct a Human Diversity Reference. Leveraging this reference, an average of 402,573 previously unmapped reads can be recovered for a given genome sequenced to ~40X coverage. Transcriptomic diversity among these non-reference sequences can also be directly assessed. We successfully map tens of thousands of previously discarded RNA-Seq reads to this reference and identify transcription evidence in 4781 gene loci, underlining the importance of these non-reference sequences in functional genomics. Our extensive datasets are important advances toward a comprehensive reference representation of global human genetic diversity.

Many UC-authored scholarly publications are freely available on this site because of the UC's open access policies. Let us know how this access is important for you.

Main Content
Current View