From random to predictive: a context-specific interaction framework improves selection of drug protein-protein interactions for unknown drug pathways

With high drug attrition, interaction network methods are increasingly attractive as quick and inexpensive methods for prediction of drug safety and efficacy effects when a drug pathway is unknown. However, these methods suffer from high false positive rates for selecting drug phenotypic effects, their performance is often no better than random (AUROC ~0.5), and this limits the use of network methods in regulatory and industrial decision making. In contrast to many network engineering approaches that apply mathematical thresholds to discover phenotype associations, we hypothesized that interaction networks associated with true positive drug phenotypes are context specific. We tested this hypothesis on 16 designated medical event (DMEs) phenotypes which are a subset of adverse events that are of upmost concern to FDA review using a novel data set extracted from drug labels. We demonstrated that context-specific interactions (CSIs) distinguished true from false positive DMEs with an 50% improvement over non-context-specific approaches (AUROC 0.77 compared to 0.51). By reducing false positives, CSI analysis has the potential to advance network techniques to influence decision making in regulatory and industry settings. Author summary Drugs bind proteins that interact with multiple downstream proteins and these protein networks are responsible for drug efficacy and safety. Protein interaction network methods predict drug effects aggregating information about proteins around drug-binding protein targets. However, many frameworks exist for identifying proteins relevant to a drug’s effect. We consider three frameworks for selecting these proteins and show increased performance from a context-specific approach on selecting proteins relevant to severe drug side effects. The context-specific approach leverages the idea that the proteins responsible for a drug side effect are specific to each side-effect. By discovering the relevant proteins, we can better understand downstream effects of drugs and better anticipate drug side effects for new drugs in development. Further, we focus on designated medical events, a subset of the most severe drug side-effects that are high priority for regulatory review.

4 88 PathFX to identify networks for all 1,136 drugs and investigated where PathFX identified a true positive -a 89 network association between a drug and a DME on the drug's label -and a false positive -a network 90 association to a DME not listed on the drug label. The distributions for these p-values, both raw and 91 normalized, overlap (Supplemental Figure 2), suggesting that a simple statistical test of enrichment is 92 insufficient for separating true positives and true negatives. Not surprisingly, the area under the receiver 93 operator curve (AUROC) is 0.51 ( Figure 1C).

96
We next investigated a simple distance metric for separating true and false positives ( Figure 1A). For 97 this investigation, we modified PathFX from the original published form (Supplemental Figure 1). Specifically, 98 the original PathFX algorithm relied on an empirically derived path-score threshold to minimize common biases 99 for network algorithms including hub-bias (a gene/protein has high connectivity because it is well studied) and 00 annotation bias (a phenotype is associated with many network genes/proteins because it is overly studied). We 01 considered this path score to be a sufficient proxy for interaction path distance, and so we created modified 02 versions of PathFX using non-optimal distances (e.g. PathFX_dist0.9, PathFX_dist0.8, etc). We reanalyzed our 03 1,136 drug set using each of these distance algorithms and investigated how relaxing the path score value 04 affected true and false positive rates. At distances of 0.82-0.99, we were unable to generate a full ROC curve( 05 Figure 1C). This is likely due to the fact that increasing interaction path distance can only yield more true 06 positives if there are more genes associated with the DME phenotype of interest. We discovered that modifying 07 the path score threshold did not increase an ability to detect true positive associations to DME-associated 08 genes.

09
10 Context-specific interactions increase ability to discern true from false positive DME associations 11 Much of biology is context dependent and many pathways investigations have used disease-specific 12 pathways to uncover target candidates for therapeutic interventions. We hypothesized that each DME may 13 result from association to a DME-specific pathway and that a better separator of true and false positives could 14 be the specific network genes/proteins supporting an association to a DME phenotype. To test this hypothesis, 15 we tested multiple machine learning and multivariate approaches to distinguish network proteins associated 5 16 with true positives and true negatives for each DME phenotype. We performed nested cross-validation to 17 select among random forests, logistic regression, and decision trees and used the F1 statistic to discover that 18 these methods were comparable in performance ( Figure 1A, Supplemental Figure 3, Supplemental Table   19 1). We selected a simple linear regression because it was the most straightforward method for interpreting if 20 and how network genes/proteins were associated with each DME of interest. Indeed, using a linear regression 21 model combined with networks discovered for DME-associated drugs increased AUROC values 50% 22 improvement over p-value (AUROC 0.77 compared to 0.51) or distance methods (Figure 1C). Performance 23 varied for each DME because a separate logistic regression model was required for each DME phenotype 24 (Supplemental Figure 4).

25
CSIs are further attractive for their interpretability. For instance, linear regression feature importance 26 scores highlight network proteins -both drug-binding and downstream of drug-binding proteins -that are 27 associated with positive and negative drugs for each DME (example for edema shown in Figure 2, other 28 feature importance scores in Supplemental File 1). We overlaid feature importance scores on a merged 29 network for edema to visualize the feature-importance scores in the context of drug protein-protein interaction 30 networks (Figure 4). In the tabular results and merged network image, both drug-binding and downstream 31 networked intermediate proteins have high feature importance scores, suggesting that downstream 32 interactions (in addition to specific drug-binding targets) could contribute to drug-induced DMEs.

35
Protein-protein interaction network methods are increasingly used for identifying phenotypes associated 36 with drug-binding proteins, however, network methods are not sufficiently validated to have translational 37 impact. Here we considered different network selection paradigms for their ability to discern true from false 38 positive drug associations to designated medical events (DMEs). Statistical enrichment is a tractable and 39 relatively easy method to implement, because it requires the selection of a p-value threshold for considering a 40 phenotype as "positive". However, we discovered that statistical enrichment was unable to separate true 41 positives from true negatives. Distance-based metrics are another attractive, and easily implemented approach 42 for discovering associations between a drug's targets and DME-associated genes. However, we were unable 6 43 to universally apply a distance-based metric that correctly identified true positives without increase false 44 positives. Further, interaction distances at high path score thresholds include little to no downstream 45 interactions in the network and these truncated networks can be considered synonymous with only analyzing 46 the drug's targets. An inability to detect DME associations using only drug targets further motivates the use of 47 network methods for DME detection. We discovered that multivariate and machine learning techniques -48 specifically a simple logistic regression model -could identify network proteins for each DME and these 49 interaction-based classifiers could separate true positives and true negatives across DMEs. To build further 50 validation and support for network methods to be used more broadly in drug discovery, our results emphasize 51 the importance of leveraging a context-specific paradigm. Indeed, the main contribution of this work is 52 advancing the paradigm of context-specific analysis and emphasizing the role that context-specific interaction 53 "mining" could have for making protein network methods have greater utility in industrial and regulatory 54 decision making.

55
The relative success of CSI-mining is not entirely surprising given that disease-specific pathway 56 investigations have successfully identified candidate therapeutic targets, however, the results highlight several 57 hypotheses related to advancing network methods to have greater translational impact. In this analysis of 58 DME-associated pathways, it was possible that DME positive and negative drugs converged on the same 59 pathway proteins but had different effects on pathway activation or deactivation. For instance, convergence on