Advanced homology modeling

Published

August 21, 2023

Modified

February 14, 2025

1 From homology modeling to threading

Although we do not intend to describe in detail the evolution of modeling methods, I briefly outline below the origin and transformation of advanced protocols that outperform the classical single-template homology modeling during the last three decades. This step-wise evolution of modeling methods is the origin of the revolution of Alphafold and related protocols, which we will discuss in the next section.

Threading or Fold-recognition methods

As mentioned earlier, the introduction of HMM-based profiles during the first decade of this century led to a great improvement in template detection and protein modeling in the twilight zone, i.e., proteins with only distant homologs (<25-30% identity) in databases. In order to exploit the power of HMM searches, those methods naturally evolved into iterative threading methods, based on multitemplate model construction, implemented in I-TASSER (Roy, Kucukural, and Zhang 2010), Phyre2 (L. A. Kelley et al. 2015), and RosettaCM (Song et al. 2013), among others. These methods are usually referred to as Threading or Fold-recognition methods. Note that the classification of modeling methods is often blurry. The current version of SwissModel and the use of HHPred+Modeller already rely on HMM profiles for template identification and alignment; being thus strictly also fold-recognition methods.

Both terms can be often used interchangeably, although some authors see Fold-Recognition as any technique that uses structural information in addition to sequence information to identify remote homologies, while Threading would refer to a more complex process of modeling including remote homologies and also the modeling of pairwise amino acid interactions in the structure. Therefore, HHPRED is a fold-recognition method and its use along with Modeller, could be indeed considered threading.

Figure 1: The idea behind fold-recognition is that instead of comparing sequences, we intend to compare structures. In the Frozen approximation (left), one residue is aligned with the template structure and then we evaluate the probability of the nearby residues in the query sequence to be in the same position than the equivalent in the template. On the other hand, Defrost methods use profiles to generate improved alignments that allow better starting points to the energy calculations during the iterative modeling steps. From Lawrence A. Kelley (2009).

The Iterative Threading ASSembly Refinement (I-TASSER) from Yang Zhang lab is one of the most widely used threading methods and servers. This method was was ranked as the No 1 server for protein structure prediction in the community-wide CASP7, CASP8, CASP9, CASP10, CASP11, CASP12, CASP13, and CASP14 experiments. I-TASSER first generates three-dimensional (3D) atomic models from multiple threading alignments and iterative structural assembly simulations that are iteratively selected and improved. The quality of the template alignments (and therefore the difficulty of modeling the targets) is judged based on the statistical significance of the best threading alignment, i.e., the Z-score, which is defined as the energy score in standard deviation units relative to the statistical mean of all alignments.

Figure 2: Flowchart of I-TASSER protein structure modeling. From Rigden (2017).

First, I-TASSER uses Psi-BLAST against curated databases to select sequence homologs and generate a sequence profile. That profile is used to predict the secondary structure and generate multiple fragmented models using several programs. The top template hits from each threading program are then selected for the following steps. In the second stage, continuous fragments in threading alignments are excised from the template structures and are used to assemble structural conformations of the sections that aligned well, with the unaligned regions (mainly loops/tails) built by ab initio modeling. The fragment assembly is performed using a modified replica-exchange Monte Carlo random simulation technique, which implements several replica simulations in parallel using different conditions that are periodically exchanged. Those simulations consider multiple parameters, including model statistics (stereochemical outliers, H-bond, hydrophobicity…), spatial restraints and amino acid pairwise contact predictions (see below). In each step, output models are clustered to select the representative ones for the next stage. A final refinement step includes rotamers modeling and filtering out steric clashes.

One interesting thing about I-TASSER is that it is integrated within a server with many other applications, including some of the tools that I-TASSER uses and other advanced methods based on I-TASSER, like I-TASSER-MTD for large, multidomain proteins or C-I-TASSER that implements a deep learning step, similar to Alphafold2 (see next section).

Figure 3: RosettaCM Protocol. (A) Flowchart of the RosettaCM protocol. (B–D) RosettaCM conformational sampling. From Song et al. (2013).

RosettaCM is an advanced homology modeling or threading algorithm by the Baker lab, implemented in Rosetta software and the Robetta webserver. RossetaCM provides accurate models by breaking up the sequence into fragments that are aligned to a set of selected templates, generating accurate models by a threading processes that uses different fragments from each of the templates. Additionally it uses minor ab initio folding to fill the residues that could not be assigned during the threading. Then, the model is closed by iterative optimization steps that include Monte Carlo sampling. Finally, an all-atom refinement towards a minimum of free energy (Song et al. 2013).

Puzzling nomenclature: comparative, homology or ab initio modeling?

De novo or ab initio modeling used to mean modeling a protein without using a template. However, this strict definition is blurred in the 2000s (decade) by advanced methods that use fragments. Threading protocols such as RosettaCM and I-Tasser, among others, use fragments that may or may not come from homologous protein structures or not. Therefore, they cannot be classified as homology modeling, but they are sometimes referred to as comparative or hybrid methods.

Scoring functions in threading and deep-learning protein modeling

In protein modeling, various scoring functions are used to evaluate the similarity of protein structures. As you know, the Root-Mean-Square Deviation (RMSD) measures three-dimensional similarity by calculating the RMSD of the Cα atomic coordinates after structural alignment. However, it is sensitive to outliers and may overlook good models. TM-Score is a normalized alternative to RMSD, ranging from 0 to 1, which considers the length of the protein and is less influenced by outliers.

In CASP, the score of the models is based on the Global Distance Test (GDT), often expressed as a percentage between 0 and 100, which measures the number of residues within a set distance cutoff. Specifically, the GDT-TS calculates the average GDT for 1, 2, 4, and 8 Å cutoffs. Similar to RMSD, the GDT score is length-dependent, as its average score for random structure pairs follows a power-law dependence on protein size. To address this, the GDT-TS Z-score, used in RosettaCM, indicates data quality and dispersion based on mean and standard deviation values. This use of the Z-score or standard scores is common in mathematics, reflecting how many standard deviations a raw score is from the mean.

Finally, the plDDT, or per-residue estimate lDDT (Mariani et al. 2013) used in AlphaFold and related methods, provides a per-residue normalized score of Cα-atomic superposition-free distance, with values ranging from 0 to 100. This scoring can refer to either a single structure or an ensemble, offering detailed insights into protein modeling accuracy. Additionally, if a protein region is naturally highly flexible or intrinsically disordered, in which case it does not have any well-defined structure, will also have a lower lDDT (Wilson, Choy, and Karttunen 2022).

2 From contact maps to pairwise high-res feature maps

A protein contact map illustrates the interactions between all possible pairs of amino acid residues in a protein’s three-dimensional structure. This is displayed as a binary matrix with n rows and columns, where n represents the number of residues in the sequence. In this matrix, the element at position ij is marked as 1 if residues i and j are in contact within the structure. Contact is typically defined as residues being closer than a certain distance threshold, which is 9 Å in the examples shown in Figure 4. The patterns in these maps highlight the differences between motifs and reflect the stretches of secondary structure.

Figure 4: Contact-based map of representative proteins. The map represents a matrix of amino acid positions in the protein sequences (on both, the X and Y axis); with contacts indicated as blue dots. When a number of consecutive residues in the sequence interact the dots form diagonal stretches. Maps obtained at http://cmweb.enzim.hu/

Accurate information on residue-residue contacts is sufficient to determine a protein’s fold (Olmea and Valencia 1997). However, using these maps in protein modeling is challenging, as predicting these contacts is not straightforward. The advent of direct-coupling analysis (DCA), which extracts residue coevolution from multiple sequence alignments (MSAs) as shown in Figure 5, has improved contact map predictions. This has facilitated their use in protein folding with methods such as PSICOV (Jones et al. 2012) and Gremlin (Kamisetty, Ovchinnikov, and Baker 2013). Nevertheless, for proteins with few sequence homologs, the predicted contacts are often low in quality, making accurate contact-assisted protein modeling difficult.

Figure 5: Schematic of how co-evolution methods extract information about protein structure from a multiple sequence alignment (MSA). From Bittrich, Schroeder, and Labudde (2019)

Implementation of several layers of information processed by neural network and deep learning methods

Deep learning is a sub-field of machine learning based on artificial neural networks (NNs). Neural networks were initially introduced in the late 1940s and 1950s but gained prominence again in the 2000s with the rise of computational capacities and the use of GPUs. Essentially, an NN uses multiple interconnected layers to transform various inputs, such as MSAs and high-resolution contact maps, into complex features that can then predict intricate outputs like a 3D protein structure. NNs aim to simulate the behavior of the human brain, processing large amounts of data and learning from it. Deep learning utilizes multiple-layer NNs to optimize and refine accuracy.

Figure 6: Illustration of column pair and precision sub-matrix grouping for advanced prediction of contact maps. In the example, Columns 5 and 14 in the first family are aligned to columns 5 and 11 in the second family, respectively, so column pair (5,14) in the first family and the pair (5,11) in the second family are assigned to the same group. Accordingly, the two precision sub-matrices will be assigned to the same group. From Ma et al. (2015).

The next complexity level in contact maps involves applying them to distantly related proteins by comparing sets of DCA from different protein families, sometimes referred to as joint evolutionary coupling analysis (Figure 6). This method requires processing massive amounts of information, which increases computational demands. Hence, the use of trained neural networks and advanced deep-learning methods has significantly enhanced protein modeling capabilities.

In this context, the introduction of supervised machine learning methods that predict contacts has outperformed DCA methods by employing multilayer neural networks (Jones et al. 2015; Ma et al. 2015; Wang et al. 2017; Yang et al. 2020). These methods incorporate high-resolution contact maps (Figure 7), containing enriched information that includes not only contacts but also distances and angles, represented in a heatmap-like probability scale.

Figure 7: Example of high-resolution contact maps of 6MSP. From Yang et al. (2020)

References

Bittrich, Sebastian, Michael Schroeder, and Dirk Labudde. 2019. “StructureDistiller: Structural relevance scoring identifies the most informative entries of a contact map.” Scientific Reports 9 (1): 18517. https://doi.org/10.1038/s41598-019-55047-4.
Jones, David T., Daniel W. A. Buchan, Domenico Cozzetto, and Massimiliano Pontil. 2012. “PSICOV: precise structural contact prediction using sparse inverse covariance estimation on large multiple sequence alignments.” Bioinformatics (Oxford, England) 28 (2): 184–90. https://doi.org/10.1093/bioinformatics/btr638.
Jones, David T., Tanya Singh, Tomasz Kosciolek, and Stuart Tetchner. 2015. “MetaPSICOV: combining coevolution methods for accurate prediction of contacts and long range hydrogen bonding in proteins.” Bioinformatics (Oxford, England) 31 (7): 999–1006. https://doi.org/10.1093/bioinformatics/btu791.
Kamisetty, Hetunandan, Sergey Ovchinnikov, and David Baker. 2013. “Assessing the utility of coevolution-based residue-residue contact predictions in a sequence- and structure-rich era.” Proceedings of the National Academy of Sciences of the United States of America 110 (39): 15674–79. https://doi.org/10.1073/pnas.1314045110.
Kelley, L. A., S. Mezulis, C. M. Yates, M. N. Wass, and M. J. Sternberg. 2015. “The Phyre2 Web Portal for Protein Modeling, Prediction and Analysis.” Nat Protoc 10 (6): 845–58. https://doi.org/10.1038/nprot.2015.053.
Kelley, Lawrence A. 2009. “Fold Recognition.” In, edited by Daniel John Rigden, 27–55. Dordrecht: Springer Netherlands. https://doi.org/10.1007/978-1-4020-9058-5_2.
Ma, Jianzhu, Sheng Wang, Zhiyong Wang, and Jinbo Xu. 2015. “Protein contact prediction by integrating joint evolutionary coupling analysis and supervised learning.” Bioinformatics (Oxford, England) 31 (21): 3506–13. https://doi.org/10.1093/bioinformatics/btv472.
Mariani, Valerio, Marco Biasini, Alessandro Barbato, and Torsten Schwede. 2013. “lDDT: A Local Superposition-Free Score for Comparing Protein Structures and Models Using Distance Difference Tests.” Bioinformatics 29 (21): 2722–28. https://doi.org/10.1093/bioinformatics/btt473.
Olmea, O., and A. Valencia. 1997. “Improving contact predictions by the combination of correlated mutations and other sources of sequence information.” Folding & Design 2 (3): S25–32. https://doi.org/10.1016/s1359-0278(97)00060-6.
Rigden, D. J. 2017. From Protein Structure to Function with Bioinformatics. Springer Netherlands. https://books.google.hn/books?id=l4LynAAACAAJ.
Roy, A., A. Kucukural, and Y. Zhang. 2010. “I-TASSER: a unified platform for automated protein structure and function prediction.” Nat Protoc 5 (4): 725–38. https://doi.org/nprot.2010.5 [pii] 10.1038/nprot.2010.5.
Song, Yifan, Frank DiMaio, Ray Yu-Ruei Wang, David Kim, Chris Miles, TJ Brunette, James Thompson, and David Baker. 2013. “High-Resolution Comparative Modeling with RosettaCM.” Structure 21 (10): 1735–42. https://doi.org/https://doi.org/10.1016/j.str.2013.08.005.
Wang, Sheng, Siqi Sun, Zhen Li, Renyu Zhang, and Jinbo Xu. 2017. “Accurate De Novo Prediction of Protein Contact Map by Ultra-Deep Learning Model.” PLoS computational biology 13 (1): e1005324. https://doi.org/10.1371/journal.pcbi.1005324.
Wilson, Carter J., Wing-Yiu Choy, and Mikko Karttunen. 2022. “AlphaFold2: A Role for Disordered Protein/Region Prediction?” International Journal of Molecular Sciences 23 (9): 4591. https://doi.org/10.3390/ijms23094591.
Yang, Jianyi, Ivan Anishchenko, Hahnbeom Park, Zhenling Peng, Sergey Ovchinnikov, and David Baker. 2020. “Improved Protein Structure Prediction Using Predicted Interresidue Orientations.” Proceedings of the National Academy of Sciences 117 (3): 1496–1503. https://doi.org/10.1073/pnas.1914677117.