Skip to main content
eScholarship
Open Access Publications from the University of California

An approach to functionally relevant clustering of the protein universe: Active site profile-based clustering of protein structures and sequences.

  • Author(s): Knutson, Stacy T
  • Westwood, Brian M
  • Leuthaeuser, Janelle B
  • Turner, Brandon E
  • Nguyendac, Don
  • Shea, Gabrielle
  • Kumar, Kiran
  • Hayden, Julia D
  • Harper, Angela F
  • Brown, Shoshana D
  • Morris, John H
  • Ferrin, Thomas E
  • Babbitt, Patricia C
  • Fetrow, Jacquelyn S
  • et al.

Published Web Location

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5368075/
No data is associated with this publication.
Abstract

Protein function identification remains a significant problem. Solving this problem at the molecular functional level would allow mechanistic determinant identification-amino acids that distinguish details between functional families within a superfamily. Active site profiling was developed to identify mechanistic determinants. DASP and DASP2 were developed as tools to search sequence databases using active site profiling. Here, TuLIP (Two-Level Iterative clustering Process) is introduced as an iterative, divisive clustering process that utilizes active site profiling to separate structurally characterized superfamily members into functionally relevant clusters. Underlying TuLIP is the observation that functionally relevant families (curated by Structure-Function Linkage Database, SFLD) self-identify in DASP2 searches; clusters containing multiple functional families do not. Each TuLIP iteration produces candidate clusters, each evaluated to determine if it self-identifies using DASP2. If so, it is deemed a functionally relevant group. Divisive clustering continues until each structure is either a functionally relevant group member or a singlet. TuLIP is validated on enolase and glutathione transferase structures, superfamilies well-curated by SFLD. Correlation is strong; small numbers of structures prevent statistically significant analysis. TuLIP-identified enolase clusters are used in DASP2 GenBank searches to identify sequences sharing functional site features. Analysis shows a true positive rate of 96%, false negative rate of 4%, and maximum false positive rate of 4%. F-measure and performance analysis on the enolase search results and comparison to GEMMA and SCI-PHY demonstrate that TuLIP avoids the over-division problem of these methods. Mechanistic determinants for enolase families are evaluated and shown to correlate well with literature results.

Many UC-authored scholarly publications are freely available on this site because of the UC's open access policies. Let us know how this access is important for you.

Item not freely available? Link broken?
Report a problem accessing this item