Predicting protein structures and simulating protein folding motions are two of the most important problems in computational biology today. Modern folding simulation methods rely on a scoring function, which attempts to distinguish the native structure (the most energetically stable 3D structure) from one or more non-native structures. Decoy databases are collections of non-native structures that are widely used to test and verify these scoring functions. We present a method to evaluate and improve the quality of decoy databases by adding novel structures and/or removing redundant structures. We test our approach on decoy databases of varying size and type and show significant improvement across a variety of metrics. Most improvement comes from the addition of novel structures indicating that our improved databases have more informative structures that are more likely to fool scoring functions. This work can aid the development and testing of better scoring functions, which in turn, will improve the quality of protein folding simulations.
We apply our methods to existing decoy sets and show that our algorithms are able to generate sets with lower energies and more diverse structures that are more likely to fool scoring functions of protein folding algorithms. All decoy sets were obtained from the Decoys 'R' Us database and are listed in the following table.
Because our methods improve existing decoy sets, we first develop strategies for analyzing the quality of decoy sets. We present several quantitative metrics to compare decoy sets and describe how their values are calculated in the experiments.
Z-Score - The z-score (or standard score) indicates the number of standard deviations between the native structure energy adn teh average energy of a decoy set.
MinDist - The minimum distance metric measures the average minimum distance from each decoy structure to any other decoy structure in the set.
Improvement - Given an original decoy set and an improved decoy set, the improvement score returns the change in z-score per sample between the two sets.
There are two main phases in the improvement of decoy sets. First, samples are generated on the protein's energy landscape. Second, in the decoy selection phase, some structures are chosen from the original set D to be removed and some are chosen from the sample set S to be added. The original decoy set D and the sample set S can be broken down into four subsets:
The following figure shows the relationship between these four subsets.
The next three figures summarize the resulting z-score, improvement score, and minimum distance value for each protein studied. For each metric, we show the contribution from each operation (removing redundant decoys (DV), adding new samples (D U SV) and from their combination (DV U SV).
When the z-score approached zero, the native structure energy is harder to distinguish among the energies of the other structures in the set. For every protein in this figure, the z-scores of D and DV are very similar. Thus, simply removing structures does not greatly impact the z-score. However, once we add new structures from our sampling approach (D U SV), the z-score drops drastically with comparable z-scores to the final set (DV U SV). Therefore, the main contributors to z-score improvement are the structures generated by our sampling approach.
Recall that the improvement score shows the change in z-score per sample between two sets. A higher value indicates that the change (either structure addtion, removal, or both) has a greater impact on the z-score. This figure displays the improvement scores across all test proteins. We again see that adding structures provide a decoy set with better quality than simply removing redundant structures. Proteins 1ash and 1gdm with the smallest original sets show the largest improvement scores.
The last metric we examine is the minimum distance between neighboring structures which indicates how varied the structures. A larger distance signifies greater structural diversity and implies a greater ability to fool different scoring functions. This figure shows how this metric changes for each operation. As expected, when decoys are removed (D), the minimum distance increases, and when V adding decoys (D U SV), the minimum distance decreases. For all protein studied, the final decoy set (DV U SV) has smaller minimum distance than the original set (D) yielding a set with greater diversity.
Decoy Database Improvement for Protein Folding, Hsin-Yi (Cindy) Yeh, Aaron Lindsey, Chih-Peng Wu, Shawna Thomas, Nancy M. Amato, Journal of Computational Biology, 22(9):823 - 836, Sep 2015.
Improving Decoy Databases for Protein Folding Algorithms, Aaron Lindsey, Hsin-Yi (Cindy) Yeh, Chih-Peng Wu, Shawna Thomas, Nancy M. Amato, In ACM Conf. on Bioinformatics, Comput. Biology and Health Informatics on Computational Structural Bioinformatics Wkshp., pp. 717 - 724, Newport Beach, CA, Sep 2014.
Proceedings(ps, pdf, abstract)
Improving Decoy Databases for Protein Folding Algorithms , Aaron Lindsey, Hsin-Yi (Cindy) Yeh, Chih-Peng Wu, Shawna Thomas, Nancy M. Amato, In Proc.of 2014 RSS Wkshp. on Robotics Methods for Structural and Dynamic Modeling of Molecular Systems, Berkeley, CA, Jul 2014.
Proceedings(ps, pdf, abstract)
Supported by NSF, KAUST
Project Alumni:Aaron Lindsey
Parasol Home | Research | People | General info | Seminars | Resources
Parasol Lab, 301 Harvey R. Bright Bldg, 3112 TAMU, College Station, TX 77843-3112
email@example.com Phone 979.458.0722 Fax 979.458.0718
Department of Computer Science and Engineering | Dwight Look College of Engineering | Texas A&M University
Privacy statement: Computer Science and Engineering Engineering TAMU
Web Accessibility Policy and Law - Web Accessibility and Usability Standards - Contact Webmaster