Skip to main content
eScholarship
Open Access Publications from the University of California

This series is automatically populated with publications deposited by UC Riverside Bourns College of Engineering Computer Science and Engineering Department researchers in accordance with the University of California’s open access policies. For more information see Open Access Policy Deposits and the UC Publication Management System.

Cover page of RAmbler resolves complex repeats in human Chromosomes 8, 19, and X

RAmbler resolves complex repeats in human Chromosomes 8, 19, and X

(2025)

Repetitive regions in eukaryotic genomes often contain important functional or regulatory elements. Despite significant algorithmic and technological advancements in genome sequencing and assembly over the past three decades, modern de novo assemblers still struggle to accurately reconstruct highly repetitive regions. In this work, we introduce RAmbler (Repeat Assembler), a reference-guided assembler specialized for the assembly of complex repetitive regions exclusively from PacBio HiFi reads. RAmbler (i) identifies repetitive regions by detecting unusually high coverage regions after mapping HiFi reads to the draft genome assembly, (ii) finds single-copy k-mers from the HiFi reads, (i.e., k-mers that are expected to occur only once in the genome), (iii) uses the relative location of single-copy k-mers to barcode each HiFi read, (iv) clusters HiFi reads based on their shared bar-codes, (v) generates contigs by assembling the reads in each cluster, and (vi) generates a consensus assembly from the overlap graph of the assembled contigs. Here we show that RAmbler can reconstruct human centromeres and other complex repeats to a quality comparable to the manually-curated telomere-to-telomere human genome assembly. Across over 250 synthetic datasets, RAmbler outperforms hifiasm, LJA, HiCANU, and Verkko across various parameters such as repeat lengths, number of repeats, heterozygosity rates and depth of sequencing.

Cover page of Predicting differentially methylated cytosines in TET and DNMT3 knockout mutants via a large language model

Predicting differentially methylated cytosines in TET and DNMT3 knockout mutants via a large language model

(2025)

DNA methylation is an epigenetic marker that directly or indirectly regulates several critical cellular processes. While cytosines in mammalian genomes generally maintain stable methylation patterns over time, other cytosines that belong to specific regulatory regions, such as promoters and enhancers, can exhibit dynamic changes. These changes in methylation are driven by a complex cellular machinery, in which the enzymes DNMT3 and TET play key roles. The objective of this study is to design a machine learning model capable of accurately predicting which cytosines have a fluctuating methylation level [hereafter called differentially methylated cytosines (DMCs)] from the surrounding DNA sequence. Here, we introduce L-MAP, a transformer-based large language model that is trained on DNMT3-knockout and TET-knockout data in human and mouse embryonic stem cells. Our extensive experimental results demonstrate the high accuracy of L-MAP in predicting DMCs. Our experiments also explore whether a classifier trained on human knockout data could predict DMCs in the mouse genome (and vice versa), and whether a classifier trained on DNMT3 knockout data could predict DMCs in TET knockouts (and vice versa). L-MAP enables the identification of sequence motifs associated with the enzymatic activity of DNMT3 and TET, which include known motifs but also novel binding sites that could provide new insights into DNA methylation in stem cells. L-MAP is available at https://github.com/ucrbioinfo/dmc_prediction.

Cover page of Balanced Training Sets Improve Deep Learning-Based Prediction of CRISPR sgRNA Activity

Balanced Training Sets Improve Deep Learning-Based Prediction of CRISPR sgRNA Activity

(2024)

CRISPR-Cas systems have transformed the field of synthetic biology by providing a versatile method for genome editing. The efficiency of CRISPR systems is largely dependent on the sequence of the constituent sgRNA, necessitating the development of computational methods for designing active sgRNAs. While deep learning-based models have shown promise in predicting sgRNA activity, the accuracy of prediction is primarily governed by the data set used in model training. Here, we trained a convolutional neural network (CNN) model and a large language model (LLM) on balanced and imbalanced data sets generated from CRISPR-Cas12a screening data for the yeast Yarrowia lipolytica and evaluated their ability to predict high- and low-activity sgRNAs. We further tested whether prediction performance can be improved by training on imbalanced data sets augmented with synthetic sgRNAs. Lastly, we demonstrated that adding synthetic sgRNAs to inherently imbalanced CRISPR-Cas9 data sets from Y. lipolytica and Komagataella phaffii leads to improved performance in predicting sgRNA activity, thus underscoring the importance of employing balanced training sets for accurate sgRNA activity prediction.

Cover page of Insights Into the Evolution, Virulence and Speciation of Babesia MO1 and Babesia divergens Through Multiomics Analyses.

Insights Into the Evolution, Virulence and Speciation of Babesia MO1 and Babesia divergens Through Multiomics Analyses.

(2024)

AbstractBabesiosis, caused by protozoan parasites of the genus Babesia, is an emerging tick-borne disease of significance for both human and animal health. Babesia parasites infect erythrocytes of vertebrate hosts where they develop and multiply rapidly to cause the pathological symptoms associated with the disease. The identification of new Babesia species underscores the ongoing risk of zoonotic pathogens capable of infecting humans, a concern amplified by anthropogenic activities and environmental changes. One such pathogen, Babesia MO1, previously implicated in severe cases of human babesiosis in the United States, was initially considered a subspecies of B. divergens, the predominant agent of human babesiosis in Europe. Here we report comparative multiomics analyses of B. divergens and B. MO1 that offer insight into their biology and evolution. Our analysis shows that despite their highly similar genomic sequences, substantial genetic and genomic divergence occurred throughout their evolution resulting in major differences in gene functions, expression and regulation, replication rates and susceptibility to antiparasitic drugs. Furthermore, both pathogens have evolved distinct classes of multigene families, crucial for their pathogenicity and adaptation to specific mammalian hosts. Leveraging genomic information for B. MO1, B. divergens, and other members of the Babesiidae family within Apicomplexa provides valuable insights into the evolution, diversity, and virulence of these parasites. This knowledge serves as a critical tool in preemptively addressing the emergence and rapid transmission of more virulent strains.

Cover page of A view of the pan‐genome of domesticated Cowpea (Vigna unguiculata [L.] Walp.)

A view of the pan‐genome of domesticated Cowpea (Vigna unguiculata [L.] Walp.)

(2024)

Cowpea, Vigna unguiculata L. Walp., is a diploid warm-season legume of critical importance as both food and fodder in sub-Saharan Africa. This species is also grown in Northern Africa, Europe, Latin America, North America, and East to Southeast Asia. To capture the genomic diversity of domesticates of this important legume, de novo genome assemblies were produced for representatives of six subpopulations of cultivated cowpea identified previously from genotyping of several hundred diverse accessions. In the most complete assembly (IT97K-499-35), 26,026 core and 4963 noncore genes were identified, with 35,436 pan genes when considering all seven accessions. GO terms associated with response to stress and defense response were highly enriched among the noncore genes, while core genes were enriched in terms related to transcription factor activity, and transport and metabolic processes. Over 5 million single nucleotide polymorphisms (SNPs) relative to each assembly and over 40 structural variants >1 Mb in size were identified by comparing genomes. Vu10 was the chromosome with the highest frequency of SNPs, and Vu04 had the most structural variants. Noncore genes harbor a larger proportion of potentially disruptive variants than core genes, including missense, stop gain, and frameshift mutations; this suggests that noncore genes substantially contribute to diversity within domesticated cowpea.