Novel Machine Learning Approach for Protein Structure Prediction
- Author(s): Nagata, Ken
- Advisor(s): Baldi, Pierre
- et al.
The side-chain prediction and residue-residue contact prediction are sub-problems in the protein structure prediction. Both predictions are important for protein prediction and other applications.
We have developed a new algorithm, SIDEpro, for the side-chain prediction where an energy function for each rotamer in a structure is computed additively over pairs of contacting atoms. A family of 156 neural networks indexed by amino acids and contacting atom types is used to compute these rotamer energies as a function of atomic contact distances. Although direct energy targets are not available for training, the neural networks can still be optimized by converting the energies to probabilities and optimizing these probabilities using Markov Chain Monte Carlo methods. The resulting predictor SIDEpro makes predictions by initially setting the rotamer probabilities for each residue from a backbone-dependent rotamer library, then iteratively updating these probabilities using the trained neural networks. After convergences of the probabilities, the side-chains are set to the highest probability rotamer. Finally, a post processing clash reduction step is applied to the models. SIDEpro represents a significant improvement in speed and a modest, but statistically significant, improvement in accuracy when compared with the state-of-the-art for rapid side-chain prediction method SCWRL4 on the 379 protein test set of SCWRL4. Using the SCWRL4 test set, SIDEpro's accuracy (χ1 86.14%, χ1+2 74.15%) is slightly better than SCWRL4-FRM (χ1 85.43%, χ1+2 73.47%) and it is 7.0 times faster. SIDEpro can also predict the side chains of proteins containing non-standard amino acids, including 15 of the most frequently observed PTMs in the Protein Data Bank and all types of phosphorylation. For PTMs, the χ1 and χ1+2 accuracies are comparable with those obtained for the precursor amino acid, and so are the RMSD values for the atoms shared with the precursor amino acid. In addition, SIDEpro can accommodate any PTM or unnatural amino acid, thus providing a flexible prediction system for high-throughput modeling of proteins beyond the standard amino acids.
We have also developed a novel machine learning approach for contact map prediction using three steps of increasing resolution. First, we use 2D recursive neural networks to predict coarse contacts and orientations between secondary structure elements. Second, we use an energy-based method to align secondary structure elements and predict contact probabilities between residues in contacting alpha-helices or strands. Third, we use a deep neural network architecture to organize and progressively refine the prediction of contacts, integrating information over both space and time. We train the architecture on a large set of non-redundant proteins and test it on a large set of non-homologous domains, as well as on the set of protein domains used for contact prediction in the two most recent CASP8 and CASP9 experiments. For long-range contacts, the accuracy of the new CMAPpro predictor is close to 30%, a significant increase over existing approaches.
Both SIDEpro and CMAPpro are part of the SCRATCH suite of predictors and available from: http://scratch.proteomics.ics.uci.edu/.