Skip to main content
eScholarship
Open Access Publications from the University of California

This series is automatically populated with publications deposited by UC Riverside Bourns College of Engineering Computer Science and Engineering Department researchers in accordance with the University of California’s open access policies. For more information see Open Access Policy Deposits and the UC Publication Management System.

Cover page of Balanced Training Sets Improve Deep Learning-Based Prediction of CRISPR sgRNA Activity

Balanced Training Sets Improve Deep Learning-Based Prediction of CRISPR sgRNA Activity

(2024)

CRISPR-Cas systems have transformed the field of synthetic biology by providing a versatile method for genome editing. The efficiency of CRISPR systems is largely dependent on the sequence of the constituent sgRNA, necessitating the development of computational methods for designing active sgRNAs. While deep learning-based models have shown promise in predicting sgRNA activity, the accuracy of prediction is primarily governed by the data set used in model training. Here, we trained a convolutional neural network (CNN) model and a large language model (LLM) on balanced and imbalanced data sets generated from CRISPR-Cas12a screening data for the yeast Yarrowia lipolytica and evaluated their ability to predict high- and low-activity sgRNAs. We further tested whether prediction performance can be improved by training on imbalanced data sets augmented with synthetic sgRNAs. Lastly, we demonstrated that adding synthetic sgRNAs to inherently imbalanced CRISPR-Cas9 data sets from Y. lipolytica and Komagataella phaffii leads to improved performance in predicting sgRNA activity, thus underscoring the importance of employing balanced training sets for accurate sgRNA activity prediction.

Cover page of Insights Into the Evolution, Virulence and Speciation of Babesia MO1 and Babesia divergens Through Multiomics Analyses.

Insights Into the Evolution, Virulence and Speciation of Babesia MO1 and Babesia divergens Through Multiomics Analyses.

(2024)

AbstractBabesiosis, caused by protozoan parasites of the genus Babesia, is an emerging tick-borne disease of significance for both human and animal health. Babesia parasites infect erythrocytes of vertebrate hosts where they develop and multiply rapidly to cause the pathological symptoms associated with the disease. The identification of new Babesia species underscores the ongoing risk of zoonotic pathogens capable of infecting humans, a concern amplified by anthropogenic activities and environmental changes. One such pathogen, Babesia MO1, previously implicated in severe cases of human babesiosis in the United States, was initially considered a subspecies of B. divergens, the predominant agent of human babesiosis in Europe. Here we report comparative multiomics analyses of B. divergens and B. MO1 that offer insight into their biology and evolution. Our analysis shows that despite their highly similar genomic sequences, substantial genetic and genomic divergence occurred throughout their evolution resulting in major differences in gene functions, expression and regulation, replication rates and susceptibility to antiparasitic drugs. Furthermore, both pathogens have evolved distinct classes of multigene families, crucial for their pathogenicity and adaptation to specific mammalian hosts. Leveraging genomic information for B. MO1, B. divergens, and other members of the Babesiidae family within Apicomplexa provides valuable insights into the evolution, diversity, and virulence of these parasites. This knowledge serves as a critical tool in preemptively addressing the emergence and rapid transmission of more virulent strains.

Cover page of A view of the pan‐genome of domesticated Cowpea (Vigna unguiculata [L.] Walp.)

A view of the pan‐genome of domesticated Cowpea (Vigna unguiculata [L.] Walp.)

(2024)

Cowpea, Vigna unguiculata L. Walp., is a diploid warm-season legume of critical importance as both food and fodder in sub-Saharan Africa. This species is also grown in Northern Africa, Europe, Latin America, North America, and East to Southeast Asia. To capture the genomic diversity of domesticates of this important legume, de novo genome assemblies were produced for representatives of six subpopulations of cultivated cowpea identified previously from genotyping of several hundred diverse accessions. In the most complete assembly (IT97K-499-35), 26,026 core and 4963 noncore genes were identified, with 35,436 pan genes when considering all seven accessions. GO terms associated with response to stress and defense response were highly enriched among the noncore genes, while core genes were enriched in terms related to transcription factor activity, and transport and metabolic processes. Over 5 million single nucleotide polymorphisms (SNPs) relative to each assembly and over 40 structural variants >1 Mb in size were identified by comparing genomes. Vu10 was the chromosome with the highest frequency of SNPs, and Vu04 had the most structural variants. Noncore genes harbor a larger proportion of potentially disruptive variants than core genes, including missense, stop gain, and frameshift mutations; this suggests that noncore genes substantially contribute to diversity within domesticated cowpea.

Cover page of Reverse metabolomics for the discovery of chemical structures from humans

Reverse metabolomics for the discovery of chemical structures from humans

(2024)

Determining the structure and phenotypic context of molecules detected in untargeted metabolomics experiments remains challenging. Here we present reverse metabolomics as a discovery strategy, whereby tandem mass spectrometry spectra acquired from newly synthesized compounds are searched for in public metabolomics datasets to uncover phenotypic associations. To demonstrate the concept, we broadly synthesized and explored multiple classes of metabolites in humans, including N-acyl amides, fatty acid esters of hydroxy fatty acids, bile acid esters and conjugated bile acids. Using repository-scale analysis1,2, we discovered that some conjugated bile acids are associated with inflammatory bowel disease (IBD). Validation using four distinct human IBD cohorts showed that cholic acids conjugated to Glu, Ile/Leu, Phe, Thr, Trp or Tyr are increased in Crohn's disease. Several of these compounds and related structures affected pathways associated with IBD, such as interferon-γ production in CD4+ T cells3 and agonism of the pregnane X receptor4. Culture of bacteria belonging to the Bifidobacterium, Clostridium and Enterococcus genera produced these bile amidates. Because searching repositories with tandem mass spectrometry spectra has only recently become possible, this reverse metabolomics approach can now be used as a general strategy to discover other molecules from human and animal ecosystems.