1D Features

Published

August 21, 2023

Modified

February 20, 2025

1D features are protein characteristics that can be directly interpreted from the protein’s primary sequence and represented as values assigned to each residue in the sequence. For example, we can assign a secondary structure state (symbol or probability) to each residue. Many structure prediction methods use or incorporate third-party methods to predict secondary structure and other 1D features, which provide crucial information during the modeling process.

You can find links to several 1D features prediction tools in the Modeling Resources section.

1 Protein secondary structure prediction

Multiple state of secondary structures

Secondary structures are assigned to structures using the DSSP (Define Secondary Structure of Proteins) algorithm, originally created in 1983 and updated several times, with the latest version available on GitHub from 2021. The DSSP algorithm classifies each residue based on its geometry and predicted hydrogen bonds by comparing it with patterns in the DSSP database. Importantly, DSSP does not predict secondary structures; it simply extracts this information from the 3D coordinates.

Most protein secondary structure prediction (PSSP) methods use a three-state model, categorizing secondary structures into helix (H), sheet (E), and coil (C). Helix and sheet are the main conformations proposed by Linus Pauling in the early days of structural biology, while coil (C) refers to any amino acid that does not fit into either helix or sheet categories. This three-state model is still widely used, but it has limitations, as it oversimplifies the backbone structure, often missing deviations from standard helix and sheet conformations.

In the 1980s, an eight-state secondary structure model was proposed, including α-helix (H), 310-helix (G), parallel/anti-parallel β-sheet (E), isolated β-bridge (B), bend (S), turn (T), π-helix (I), and coil (C). DSSP defines these eight states in experimentally obtained structures and includes transformations for mapping eight-state structures to three-state models (see Ismi, Pulungan, and Afiahayati (2022)).

More recently, in 2020, four-state and five-state PSSP models were introduced to simplify predictions and increase accuracy. These new models address the imbalance in sample sizes for certain classes, like isolated β-bridge (B) and bend (S), which have fewer samples and lower true-positive rates. In the five-state model, B and S are categorized as C, while in the four-state model, B, S, and G are categorized as C. Additionally, 75% of π-helix (I) occurrences are found at the beginning or end of an α-helix (H), so they are categorized as H. The full potential of these new categories is still being explored.

Evolution of Prediction methods

Protein secondary structure prediction (PSSP) from protein sequences is based on the idea that segments of consecutive residues have preferences for certain secondary structure states. Like other methods in bioinformatics, including protein modeling, approaches to secondary structure prediction have evolved over the last 50 years (see Table 1).

The first generation methods relied on statistical approaches, where prediction depended on assigning a set of prediction values to a residue and then applying a simple algorithm to those numbers. This meant applying a probability score based on single amino acid propensity. In the 1990s, new methods included the information of the flanking residues (3-50 nearby amino acids) in the so-called Nearest Neighbor (N-N) Methods. These methods increased the accuracy in many cases but still had strong limitations, as they only considered three possible states (helix, strand, or turn). Moreover, as you know from the secondary structure practice, β-strand predictions are more difficult and did not improve much thanks to N-N methods. Additionally, predicted helices and strands were usually too short.

By the end of the 1990s, new methods boosted the accuracy to values near 80%. These methods included two innovations, one conceptual and one methodological. The conceptual innovation was the inclusion of evolutionary information in the predictions, by considering the information of multiple sequence alignments or profiles. If a residue or a type of residue is evolutionary conserved, it is likely that it is important for defining secondary structure stretches. The methodological innovation was the use of neural networks, in which multiple layers of sequence-to-structure predictions were compared with independently trained networks.

Since the 2000s, most commonly used methods are meta-servers that compare several algorithms, mostly based on neural networks, such as JPred or SYMPRED, among others.

In recent years, deep neural networks trained with large datasets have become the primary method for protein secondary structure prediction (and almost any other prediction in structural biology). In the Alphafold era (see the last lesson), methods adapted from image processing or natural language processing (NLP) are also used (for instance, in NetSurfP-3.0, see Høie et al. (2022)), allowing protein secondary structure predictions to focus on specific objectives, such as enhancing the quality of evolutionary information for protein modeling (Ismi, Pulungan, and Afiahayati 2022).

Evolution of secondary structure prediction methods (modified from Ismi, Pulungan, and Afiahayati (2022)).
Generation	Method	Accuracy
1st: Statistics	Chow & Fassman (1974-)	57%
	GOR (1978-)
2nd: Nearest Neighbor (N-N) methods	PREDATOR (1996)	75%
	NNSSP (1995)	72%
3rd: N-N neural network & evolutionary info	APSSP	Up to 86%
	PsiPRED (1999-)	75.7% (1999) 84% (2019)
	PHD (1997)
4th: Multiple layers of info	Extra layers of info, such as conserved domains, frequent patterns, contact maps or predicted residue solvent accessibility (2000s)	<80%
5th generation	Sophisticated deep learning architectures and NLP (2010s, 2020s). RaptorX-Property, (2018), SPIDER3 (2020) and NetSurfP-3.0 (2022), among others.	>80%
META-Servers	Jpred4
	GeneSilico (Discontinued)
	SYMPRED

2 Structural disorder and solvent accessibility

The expression disorder denote protein stretches that cannot be assigned to any SS. They are usually dynamic/flexible, thus with high B-factor or even missing in crystal structures. These fragments show a low complexity and they are usually rich in polar residues, whereas aromatic residues are rarely found in disordered regions. These motifs are usually at the ends of proteins or domain boundaries (as linkers). Additionally, they are frequently related to specific functionalities, such in the case of proteolytic targets or protein-protein interactions (PPI). More rarely, large disordered domains can be conserved in protein families and associated with relevant functions, as in the case of some transcription factors, transcription regulators, kinases…

There are many methods and servers to predict disordered regions. You can see a list in the Wikipedia here or in the review by Atkins et al. (2015). The best-known server is DisProt, which uses a large curated database of intrinsically disordered proteins and regions from the literature, which has been recently improved to version 9 in 2022, as described in Quaglia et al. (2022).

Interestingly, a low plDDT (see below) score in Alphafold2 models has been also suggested as a good indicator of protein disorder (Wilson, Choy, and Karttunen 2022).

Figure 1: Examples of disordered regions. From Database of disordered regions.

Hydrophobic collapse is usually referred to as a key step in protein folding. Hydrophobic residues tend to be buried inside the protein, whereas hydrophilic, polar amino acids are exposed to the aqueous solvent.

Solvent accessibility correlates with residue hydrofobicity (accessibility methods usually better performance). Therefore, estimation of how likely each residue is exposed to the solvent or buried inside the protein is useful to obtain and analyze protein models. Moreover, this information is useful to predict PPIs as well as ligand binding or functional sites. Most methods only classify each residue into two groups: Buried, for those with relative accessibility probability <16% and Exposed, for accessibility residues >16%.

Most common recent methods, like ProtSA or PROFacc, combine evolutionary information with neural networks to predict accessibility.

3 Trans-membrane motifs and membrane topology

Identification of transmembrane motifs is also a key step in protein modeling. About 25-30% of human proteins contain transmembrane elements, most of them in alpha helices.

Figure 3: Different topologies of transmembrane proteins

The PDBTM (Protein Data Bank of Transmembrane Proteins) is a comprehensive and up-to-date transmembrane protein selection. As of September 2022, it contains more than 7600 transmembrane proteins, 92.6% of them with alpha helices TM elements. This number of TM proteins is relatively low, as compared with more than 160k structures in PDB, as TM proteins are usually harder to purify and crystalization conditions are often elusive. Thus, although difficult, accurate predictions of TM motifs and overall protein topology can be essential to define protein architecture and identify domains that could be structurally or functionally studied independently.

Figure 4: Different topologies of TM helices.

Current state-of-the-art TM prediction protocols show an accuracy of 90% for definition of TM elements, but only a 80% regarding the protein topology. However, some authors claim that in some types of proteins, the accuracy is not over 70%, due to the small datasets of TM proteins. Most recent methods, based in deep-learning seem to have increased the accuracy to values near 90% for several groups of proteins (Hallgren et al. 2022).

4 Subcellular localization tags and post-translational modification sites

Many cellular functions are compartmentalized in the nucleus, mitochondria, endoplasmatic reticulum (ER), or other organules. Therefore, many proteins should be located in those compartments. That is achieved by the presence of some labels, in form of short peptidic sequences that regulate traffic and compartmentalization of proteins within the cells. Typically, N-terminal signals direct proteins to the mitochondrial matrix, ER, or peroxisomes, whereas nucleus traffic is regulated by nuclear localization signals (NLS) and nuclear export signals (NES). These short motifs are difficult to predict, as datasets of validated signals are small. The use of consensus sequences allowed predictions, although in many cases with a high level of uncertainty. As you can already suspect, there are a bunch of new deep learning-based methods in the last decade. If you are interested in this topic, check out the recent review by Pollastri lab (Gillani and Pollastri 2024).

Post-translational modifications often happen in specific patterns that include important residues for processes like phosphorylation or ubiquitination. Here also, using deep learning methods to predict these modifications can help identify them accurately and quickly (Meng et al. 2022).

5 References

Atkins, Jennifer D., Samuel Y. Boateng, Thomas Sorensen, and Liam J. McGuffin. 2015. “Disorder Prediction Methods, Their Applicability to Different Protein Targets and Their Usefulness for Guiding Experimental Studies.” International Journal of Molecular Sciences 16 (8): 19040–54. https://doi.org/10.3390/ijms160819040.

Gillani, Maryam, and Gianluca Pollastri. 2024. “Protein subcellular localization prediction tools.” Computational and Structural Biotechnology Journal 23 (December): 1796–1807. https://doi.org/10.1016/j.csbj.2024.04.032.

Hallgren, Jeppe, Konstantinos D. Tsirigos, Mads Damgaard Pedersen, José Juan Almagro Armenteros, Paolo Marcatili, Henrik Nielsen, Anders Krogh, and Ole Winther. 2022. “DeepTMHMM Predicts Alpha and Beta Transmembrane Proteins Using Deep Neural Networks.” https://doi.org/10.1101/2022.04.08.487609.

Høie, Magnus Haraldson, Erik Nicolas Kiehl, Bent Petersen, Morten Nielsen, Ole Winther, Henrik Nielsen, Jeppe Hallgren, and Paolo Marcatili. 2022. “NetSurfP-3.0: Accurate and Fast Prediction of Protein Structural Features by Protein Language Models and Deep Learning.” Nucleic Acids Research 50 (W1): W510–15. https://doi.org/10.1093/nar/gkac439.

Ismi, Dewi Pramudi, Reza Pulungan, and null Afiahayati. 2022. “Deep learning for protein secondary structure prediction: Pre and post-AlphaFold.” Computational and Structural Biotechnology Journal 20: 6271–86. https://doi.org/10.1016/j.csbj.2022.11.012.

Meng, Lingkuan, Wai-Sum Chan, Lei Huang, Linjing Liu, Xingjian Chen, Weitong Zhang, Fuzhou Wang, Ke Cheng, Hongyan Sun, and Ka-Chun Wong. 2022. “Mini-review: Recent advances in post-translational modification site prediction based on deep learning.” Computational and Structural Biotechnology Journal 20: 3522–32. https://doi.org/10.1016/j.csbj.2022.06.045.

Quaglia, Federica, Bálint Mészáros, Edoardo Salladini, András Hatos, Rita Pancsa, Lucía B. Chemes, Mátyás Pajkos, et al. 2022. “DisProt in 2022: improved quality and accessibility of protein intrinsic disorder annotation.” Nucleic Acids Research 50 (D1): D480–87. https://doi.org/10.1093/nar/gkab1082.

Wilson, Carter J., Wing-Yiu Choy, and Mikko Karttunen. 2022. “AlphaFold2: A Role for Disordered Protein/Region Prediction?” International Journal of Molecular Sciences 23 (9): 4591. https://doi.org/10.3390/ijms23094591.