Skip to main content
eScholarship
Open Access Publications from the University of California
Cover page of A Functional Survey of the Regulatory Landscape of Estrogen Receptor-Positive Breast Cancer Evolution.

A Functional Survey of the Regulatory Landscape of Estrogen Receptor-Positive Breast Cancer Evolution.

(2024)

Only a handful of somatic alterations have been linked to endocrine therapy resistance in hormone-dependent breast cancer, potentially explaining ∼40% of relapses. If other mechanisms underlie the evolution of hormone-dependent breast cancer under adjuvant therapy is currently unknown. In this work, we employ functional genomics to dissect the contribution of cis-regulatory elements (CRE) to cancer evolution by focusing on 12 megabases of noncoding DNA, including clonal enhancers, gene promoters, and boundaries of topologically associating domains. Parallel epigenetic perturbation (CRISPRi) in vitro reveals context-dependent roles for many of these CREs, with a specific impact on dormancy entrance and endocrine therapy resistance. Profiling of CRE somatic alterations in a unique, longitudinal cohort of patients treated with endocrine therapies identifies a limited set of noncoding changes potentially involved in therapy resistance. Overall, our data uncover how endocrine therapies trigger the emergence of transient features which could ultimately be exploited to hinder the adaptive process. Significance: This study shows that cells adapting to endocrine therapies undergo changes in the usage or regulatory regions. Dormant cells are less vulnerable to regulatory perturbation but gain transient dependencies which can be exploited to decrease the formation of dormant persisters.

Cover page of Origin of biogeographically distinct ecotypes during laboratory evolution.

Origin of biogeographically distinct ecotypes during laboratory evolution.

(2024)

Resource partitioning is central to the incredible productivity of microbial communities, including gigatons in annual methane emissions through syntrophic interactions. Previous work revealed how a sulfate reducer (Desulfovibrio vulgaris, Dv) and a methanogen (Methanococcus maripaludis, Mm) underwent evolutionary diversification in a planktonic context, improving stability, cooperativity, and productivity within 300-1000 generations. Here, we show that mutations in just 15 Dv and 7 Mm genes within a minimal assemblage of this evolved community gave rise to co-existing ecotypes that were spatially enriched within a few days of culturing in a fluidized bed reactor. The spatially segregated communities partitioned resources in the simulated subsurface environment, with greater lactate utilization by attached Dv but partial utilization of resulting H2 by low affinity hydrogenases of Mm in the same phase. The unutilized H2 was scavenged by high affinity hydrogenases of planktonic Mm, producing copious amounts of methane. Our findings show how a few mutations can drive resource partitioning amongst niche-differentiated ecotypes, whose interplay synergistically improves productivity of the entire mutualistic community.

Expression of dehydroshikimate dehydratase in poplar induces transcriptional and metabolic changes in the phenylpropanoid pathway

(2024)

Modification of lignin in feedstocks via genetic engineering aims to reduce biomass recalcitrance to facilitate efficient conversion processes. These improvements can be achieved by expressing exogenous enzymes that interfere with native biosynthetic pathways responsible for the production of the lignin precursors. In planta expression of a bacterial 3-dehydroshikimate dehydratase in poplar trees reduced lignin content and altered the monomer composition, which enabled higher yields of sugars after cell wall polysaccharide hydrolysis. Understanding how plants respond to such genetic modifications at the transcriptional and metabolic levels is needed to facilitate further improvement and field deployment. In this work, we acquired fundamental knowledge on lignin-modified poplar expressing 3-dehydroshikimate dehydratase using RNA-seq and metabolomics. The data clearly demonstrate that changes in gene expression and metabolite abundance can occur in a strict spatiotemporal fashion, revealing tissue-specific responses in the xylem, phloem, or periderm. In the poplar line that exhibited the strongest reduction in lignin, we found that 3% of the transcripts had altered expression levels and ~19% of the detected metabolites had differential abundance in the xylem from older stems. The changes affected predominantly the shikimate and phenylpropanoid pathways as well as secondary cell wall metabolism, and resulted in significant accumulation of hydroxybenzoates derived from protocatechuate and salicylate.

High-throughput genetics enables identification of nutrient utilization and accessory energy metabolism genes in a model methanogen.

(2024)

Archaea are widespread in the environment and play fundamental roles in diverse ecosystems; however, characterization of their unique biology requires advanced tools. This is particularly challenging when characterizing gene function. Here, we generate randomly barcoded transposon libraries in the model methanogenic archaeon Methanococcus maripaludis and use high-throughput growth methods to conduct fitness assays (RB-TnSeq) across over 100 unique growth conditions. Using our approach, we identified new genes involved in nutrient utilization and response to oxidative stress. We identified novel genes for the usage of diverse nitrogen sources in M. maripaludis including a putative regulator of alanine deamination and molybdate transporters important for nitrogen fixation. Furthermore, leveraging the fitness data, we inferred that M. maripaludis can utilize additional nitrogen sources including ʟ-glutamine, ᴅ-glucuronamide, and adenosine. Under autotrophic growth conditions, we identified a gene encoding a domain of unknown function (DUF166) that is important for fitness and hypothesize that it has an accessory role in carbon dioxide assimilation. Finally, comparing fitness costs of oxygen versus sulfite stress, we identified a previously uncharacterized class of dissimilatory sulfite reductase-like proteins (Dsr-LP; group IIId) that is important during growth in the presence of sulfite. When overexpressed, Dsr-LP conferred sulfite resistance and enabled use of sulfite as the sole sulfur source. The high-throughput approach employed here allowed for generation of a large-scale data set that can be used as a resource to further understand gene function and metabolism in the archaeal domain.IMPORTANCEArchaea are widespread in the environment, yet basic aspects of their biology remain underexplored. To address this, we apply randomly barcoded transposon libraries (RB-TnSeq) to the model archaeon Methanococcus maripaludis. RB-TnSeq coupled with high-throughput growth assays across over 100 unique conditions identified roles for previously uncharacterized genes, including several encoding proteins with domains of unknown function (DUFs). We also expand on our understanding of carbon and nitrogen metabolism and characterize a group IIId dissimilatory sulfite reductase-like protein as a functional sulfite reductase. This data set encompasses a wide range of additional conditions including stress, nitrogen fixation, amino acid supplementation, and autotrophy, thus providing an extensive data set for the archaeal community to mine for characterizing additional genes of unknown function.

Cover page of Nutrient and moisture limitations reveal keystone metabolites linking rhizosphere metabolomes and microbiomes.

Nutrient and moisture limitations reveal keystone metabolites linking rhizosphere metabolomes and microbiomes.

(2024)

Plants release a wealth of metabolites into the rhizosphere that can shape the composition and activity of microbial communities in response to environmental stress. The connection between rhizodeposition and rhizosphere microbiome succession has been suggested, particularly under environmental stress conditions, yet definitive evidence is scarce. In this study, we investigated the relationship between rhizosphere chemistry, microbiome dynamics, and abiotic stress in the bioenergy crop switchgrass grown in a marginal soil under nutrient-limited, moisture-limited, and nitrogen (N)-replete, phosphorus (P)-replete, and NP-replete conditions. We combined 16S rRNA amplicon sequencing and LC-MS/MS-based metabolomics to link rhizosphere microbial communities and metabolites. We identified significant changes in rhizosphere metabolite profiles in response to abiotic stress and linked them to changes in microbial communities using network analysis. N-limitation amplified the abundance of aromatic acids, pentoses, and their derivatives in the rhizosphere, and their enhanced availability was linked to the abundance of bacterial lineages from Acidobacteria, Verrucomicrobia, Planctomycetes, and Alphaproteobacteria. Conversely, N-amended conditions increased the availability of N-rich rhizosphere compounds, which coincided with proliferation of Actinobacteria. Treatments with contrasting N availability differed greatly in the abundance of potential keystone metabolites; serotonin and ectoine were particularly abundant in N-replete soils, while chlorogenic, cinnamic, and glucuronic acids were enriched in N-limited soils. Serotonin, the keystone metabolite we identified with the largest number of links to microbial taxa, significantly affected root architecture and growth of rhizosphere microorganisms, highlighting its potential to shape microbial community and mediate rhizosphere plant-microbe interactions.

Cover page of An ontology-based knowledge graph for representing interactions involving RNA molecules

An ontology-based knowledge graph for representing interactions involving RNA molecules

(2024)

The "RNA world" represents a novel frontier for the study of fundamental biological processes and human diseases and is paving the way for the development of new drugs tailored to each patient's biomolecular characteristics. Although scientific data about coding and non-coding RNA molecules are constantly produced and available from public repositories, they are scattered across different databases and a centralized, uniform, and semantically consistent representation of the "RNA world" is still lacking. We propose RNA-KG, a knowledge graph (KG) encompassing biological knowledge about RNAs gathered from more than 60 public databases, integrating functional relationships with genes, proteins, and chemicals and ontologically grounded biomedical concepts. To develop RNA-KG, we first identified, pre-processed, and characterized each data source; next, we built a meta-graph that provides an ontological description of the KG by representing all the bio-molecular entities and medical concepts of interest in this domain, as well as the types of interactions connecting them. Finally, we leveraged an instance-based semantically abstracted knowledge model to specify the ontological alignment according to which RNA-KG was generated. RNA-KG can be downloaded in different formats and also queried by a SPARQL endpoint. A thorough topological analysis of the resulting heterogeneous graph provides further insights into the characteristics of the "RNA world". RNA-KG can be both directly explored and visualized, and/or analyzed by applying computational methods to infer bio-medical knowledge from its heterogeneous nodes and edges. The resource can be easily updated with new experimental data, and specific views of the overall KG can be extracted according to the bio-medical problem to be studied.

Cover page of Putative rhamnogalacturonan-II glycosyltransferase identified through callus gene editing which bypasses embryo lethality.

Putative rhamnogalacturonan-II glycosyltransferase identified through callus gene editing which bypasses embryo lethality.

(2024)

Rhamnogalacturonan II (RG-II) is a structurally complex and conserved domain of the pectin present in the primary cell walls of vascular plants. Borate cross-linking of RG-II is required for plants to grow and develop normally. Mutations that alter RG-II structure also affect cross-linking and are lethal or severely impair growth. Thus, few genes involved in RG-II synthesis have been identified. Here, we developed a method to generate viable loss-of-function Arabidopsis (Arabidopsis thaliana) mutants in callus tissue via CRISPR/Cas9-mediated gene editing. We combined this with a candidate gene approach to characterize the male gametophyte defective 2 (MGP2) gene that encodes a putative family GT29 glycosyltransferase. Plants homozygous for this mutation do not survive. We showed that in the callus mutant cell walls, RG-II does not cross-link normally because it lacks 3-deoxy-D-manno-octulosonic acid (Kdo) and thus cannot form the α-L-Rhap-(1→5)-α-D-kdop-(1→sidechain). We suggest that MGP2 encodes an inverting RG-II CMP-β-Kdo transferase (RCKT1). Our discovery provides further insight into the role of sidechains in RG-II dimerization. Our method also provides a viable strategy for further identifying proteins involved in the biosynthesis of RG-II.

Evaluation of the Diagnostic Accuracy of GPT-4 in Five Thousand Rare Disease Cases

(2024)

Large language models (LLM) have shown great promise in supporting differential diagnosis, but 23 available published studies on the diagnostic accuracy evaluated small cohorts (number of cases, 30-422, mean 104) and have evaluated LLM responses subjectively by manual curation (23/23 studies). The performance of LLMs for rare disease diagnosis has not been evaluated systematically. Here, we perform a rigorous and large-scale analysis of the performance of a GPT-4 in prioritizing candidate diagnoses, using the largest-ever cohort of rare disease patients. Our computational study used 5267 computational case reports from previously published data. Each case was formatted as a Global Alliance for Genomics and Health (GA4GH) phenopacket, in which clinical anomalies were represented as Human Phenotype Ontology (HPO) terms. We developed software to generate prompts from each phenopacket. Prompts were sent to Generative Pre-trained Transformer 4 (GPT-4), and the rank of the correct diagnosis, if present in the response, was recorded. The mean reciprocal rank of the correct diagnosis was 0.24 (with the reciprocal of the MRR corresponding to a rank of 4.2), and the correct diagnosis was placed in rank 1 in 19.2% of the cases, in the first 3 ranks in 28.6%, and in the first 10 ranks in 32.5%. Our study is the largest to be reported to date and provides a realistic estimate of the performance of GPT-4 in rare disease medicine.

Gene Set Summarization Using Large Language Models.

(2024)

Molecular biologists frequently interpret gene lists derived from high-throughput experiments and computational analysis. This is typically done as a statistical enrichment analysis that measures the over- or under-representation of biological function terms associated with genes or their properties, based on curated assertions from a knowledge base (KB) such as the Gene Ontology (GO). Interpreting gene lists can also be framed as a textual summarization task, enabling Large Language Models (LLMs) to use scientific texts directly and avoid reliance on a KB. TALISMAN (Terminological ArtificiaL Intelligence SuMmarization of Annotation and Narratives) uses generative AI to perform gene set function summarization as a complement to standard enrichment analysis. This method can use different sources of gene functional information: (1) structured text derived from curated ontological KB annotations, (2) ontology-free narrative gene summaries, or (3) direct retrieval from the model. We demonstrate that these methods are able to generate plausible and biologically valid summary GO term lists for an input gene set. However, LLM-based approaches are unable to deliver reliable scores or p-values and often return terms that are not statistically significant. Crucially, in our experiments these methods were rarely able to recapitulate the most precise and informative term from standard enrichment analysis. We also observe minor differences depending on prompt input information, with GO term descriptions leading to higher recall but lower precision. However, newer LLM models perform statistically significantly better than the oldest model across all performance metrics, suggesting that future models may lead to further improvements. Overall, the results are nondeterministic, with minor variations in prompt resulting in radically different term lists, true to the stochastic nature of LLMs. Our results show that at this point, LLM-based methods are unsuitable as a replacement for standard term enrichment analysis, however they may provide summarization benefits for implicit knowledge integration across extant but unstandardized knowledge, for large sets of features, and where the amount of information is difficult for humans to process.