
Project Personnel:HsinYi (Cindy) Yeh,ChihPeng Wu,Shawna Thomas,Nancy Amato
Predicting protein structures and simulating protein folding motions are two of the most important problems in computational biology today. Modern folding simulation methods rely on a scoring function, which attempts to distinguish the native structure (the most energetically stable 3D structure) from one or more nonnative structures. Decoy databases are collections of nonnative structures that are widely used to test and verify these scoring functions. We present a method to evaluate and improve the quality of decoy databases by adding novel structures and/or removing redundant structures. We test our approach on decoy databases of varying size and type and show significant improvement across a variety of metrics. Most improvement comes from the addition of novel structures indicating that our improved databases have more informative structures that are more likely to fool scoring functions. This work can aid the development and testing of better scoring functions, which in turn, will improve the quality of protein folding simulations.
We apply our methods to existing decoy sets and show that our algorithms are able to generate sets with lower energies and more diverse structures that are more likely to fool scoring functions of protein folding algorithms. All decoy sets were obtained from the Decoys 'R' Us database and are listed in the following table.
Because our methods improve existing decoy sets, we first develop strategies for analyzing the quality of decoy sets. We present several quantitative metrics to compare decoy sets and describe how their values are calculated in the experiments.
ZScore  The zscore (or standard score) indicates the number of standard deviations between the native structure energy adn teh average energy of a decoy set.
MinDist  The minimum distance metric measures the average minimum distance from each decoy structure to any other decoy structure in the set.
Improvement  Given an original decoy set and an improved decoy set, the improvement score returns the change in zscore per sample between the two sets.
There are two main phases in the improvement of decoy sets. First, samples are generated on the protein's energy landscape. Second, in the decoy selection phase, some structures are chosen from the original set D to be removed and some are chosen from the sample set S to be added. The original decoy set D and the sample set S can be broken down into four subsets:
The following figure shows the relationship between these four subsets.
The next three figures summarize the resulting zscore, improvement score, and minimum distance value for each protein studied. For each metric, we show the contribution from each operation (removing redundant decoys (DV), adding new samples (D U SV) and from their combination (DV U SV).
When the zscore approached zero, the native structure energy is harder to distinguish among the energies of the other structures in the set. For every protein in this figure, the zscores of D and DV are very similar. Thus, simply removing structures does not greatly impact the zscore. However, once we add new structures from our sampling approach (D U SV), the zscore drops drastically with comparable zscores to the final set (DV U SV). Therefore, the main contributors to zscore improvement are the structures generated by our sampling approach.
Recall that the improvement score shows the change in zscore per sample between two sets. A higher value indicates that the change (either structure addtion, removal, or both) has a greater impact on the zscore. This figure displays the improvement scores across all test proteins. We again see that adding structures provide a decoy set with better quality than simply removing redundant structures. Proteins 1ash and 1gdm with the smallest original sets show the largest improvement scores.
The last metric we examine is the minimum distance between neighboring structures which indicates how varied the structures. A larger distance signifies greater structural diversity and implies a greater ability to fool different scoring functions. This figure shows how this metric changes for each operation. As expected, when decoys are removed (D), the minimum distance increases, and when V adding decoys (D U SV), the minimum distance decreases. For all protein studied, the final decoy set (DV U SV) has smaller minimum distance than the original set (D) yielding a set with greater diversity.
Decoy Database Improvement for Protein Folding, HsinYi (Cindy) Yeh, Aaron Lindsey, ChihPeng Wu, Shawna Thomas, Nancy M. Amato, Journal of Computational Biology, 22(9):823  836, Sep 2015.
Journal(pdf, abstract)
Improving Decoy Databases for Protein Folding Algorithms, Aaron Lindsey, HsinYi (Cindy) Yeh, ChihPeng Wu, Shawna Thomas, Nancy M. Amato, In ACM Conf. on Bioinformatics, Comput. Biology and Health Informatics on Computational Structural Bioinformatics Wkshp., pp. 717  724, Newport Beach, CA, Sep 2014.
Proceedings(ps, pdf, abstract)
Improving Decoy Databases for Protein Folding Algorithms , Aaron Lindsey, HsinYi (Cindy) Yeh, ChihPeng Wu, Shawna Thomas, Nancy M. Amato, In Proc.of 2014 RSS Wkshp. on Robotics Methods for Structural and Dynamic Modeling of Molecular Systems, Berkeley, CA, Jul 2014.
Proceedings(ps, pdf, abstract)
Supported by NSF, KAUST
Project Alumni:Aaron Lindsey
Parasol Home  Research  People  General info  Seminars  Resources Parasol Laboratory, 425 Harvey R. Bright Bldg, 3112 TAMU, College Station, TX 778433112 parasoladmin@cse.tamu.edu Phone 979.458.0722 Fax 979.458.0718 Department of Computer Science and Engineering  Dwight Look College of Engineering  Texas A&M University Privacy statement: Computer Science and Engineering Engineering TAMU Web Accessibility Policy and Law  Web Accessibility and Usability Standards  Contact Webmaster 