Protein Databases

Published

August 21, 2023

Modified

September 12, 2023

1 Structure comparison and alignment

As in the case of sequences, comparison and alignment of protein structures is a fundamental and widely used task in computational structure biology. The identification and statistical measure of the similarities between two or more structures allow their classification and infer functional and evolutionary relationships. Moreover, this process is essential during protein modeling, allowing to identify, assess, and select intermediate models.

It is also important to clear up any possible confusion between alignment and superposition, as they are often interchanged in the literature. An structural alignment tries to identify similarities and differences between two structures, while structure superposition displays the structures using a given criteria, usually from a previous structural alignment. Thus, based on that criteria, superposition tries to minimize the distance between structures by finding a transformation that produces either the lowest root-mean-square deviation (RMSD) or the maximal equivalences within an RMSD cutoff. The RMSD can be calculated for any pair of molecules. Regarding proteins, we usually refer to the RMSD of the alpha-carbons. A better alignment will allow a better superposition. Thus, although alignment and superposition are two different processes, the RMSD can be used as an indication of both (the lowest RMSD, the best alignment/superposition).

Figure 1: Structural alignment of three different AP endoncleases with the cealign algorithm implemented in Pymol. The RMSD and the number of aligned residues for each pair of structures is indicated.

2 Major protein databases

Classification of protein sequences allow to understand the diversity of different proteins by their sequence, i.e., the protein sequence space. However, classification of protein structures aim to group proteins based on their structural relationships. Some of these classification schemes explore the concept of structural neighbourhood (structural continuum), whereas other utilize the notion of protein evolution as a driving force of diversification and thus provide a discrete rather than continuum protein structure space (Andreeva 2012).

This section does not aim to comprehensively review all protein databases in detail. Indeed, you probably already know most of the databases we are discussing. We will clarify the main differences and applications of Pfam, Uniprot, Prosite, PDB, SCOP & CATH. As usual, I encourage you to check out online in the references mentioned below for more detailed information.

Figure 2: Major protein databases. Although most of these databases are around 30 years old (or more), since they were created (or renovated) during the last years of the 20th century or the early 21st.

In bioinformatics, databases are often categorized as primary or secondary. Primary databases are populated with experimentally derived data such as nucleotide sequence, protein sequence or macromolecular structure. Importantly, once given a database accession number, the data in primary databases are never changed: they form part of the scientific record. By contrast, secondary databases comprise data derived from the results of analysing primary data. Secondary databases often draw upon information from numerous sources, including other databases and the scientific literature. They are highly curated, often using a complex combination of computational algorithms and/or manual analysis and interpretation to derive new knowledge from the public record of science.

Although differences between primary and secondary databases are blurry nowadays because most databases currently contain data from different sources, we still can make some differences. The main protein primary databases are NCBI Protein for protein sequences and RCSB-PDB for protein structures. Uniprot (see below) also contains a primary sequence database, named TrEMBL. It also incorporated in 2002 PIR-PSD, joining the databases of the Protein information Resource, EMBL and SIB in a single (meta)database (see https://proteininformationresource.org/pirwww/dbinfo/pir_psd.shtml). On the other hand, RCSB-PDB is the major structural primary database, whereas SCOP2 and CATH are secondary databases.

Programatic access

All of the databases we describe here allow programmatic and/or API access, which significantly increases the possibilities for systematic or batch data analysis

Sequence databases

Uniprot

The Uniprot databases are maintained by the Uniprot consortium, created in 2002 by EMBL-EBI, SIB, and PIR. Uniprot can be considered nowadays as a metadatabase as its entries contain information from diverse sources. It was created with two goals, constitute a non-redundant comprehensive protein sequence database and enrich that database with annotation of those proteins. Annotations include, but it is not limited to: protein and gene families, function and structure-function available data, interactions with other proteins or cofactor, localization, patterns of expression, variants… Thus, it aims to join the goals of both primary and secondary DDBBs.

The central hub of Uniprot databases is Uniprot Knowledgebase. It is a collection of functional information on proteins, with accurate, consistent, and rich annotation. The UniProtKB consists of two internal databases: a section containing manually-annotated records with information extracted from literature, community suggestions, and curator-evaluated computational analysis, and a section with computationally analyzed records. For the sake of continuity and name recognition, the two sections are referred to as “UniProtKB/Swiss-Prot” (reviewed, manually annotated) and “UniProtKB/TrEMBL” (unreviewed, automatically annotated), respectively.

Besides cross-references to structural information, in the last years, UniprotKB has also incorporated structural data from Alphafold database (created by DeepMind and EBI), see “State-of-the-art protein modeling with Deep Learning-based methods” module.

Sequences with different detail of annotation can be found in two overlapping and complementary databases in Uniprot: Uniparc and Uniref. Briefly, UniParc (UniProt Archive) is a comprehensive and non-redundant database that contains most of the publicly available protein sequences in the world. UniParc avoided redundancy by storing each unique sequence only once and giving it a stable and unique identifier (UPI) making it possible to identify the same protein from different source databases. A UPI is never removed, changed, or reassigned. On the other hand, UniRef (UniProt Reference Clusters) provide clustered sets of sequences from the UniprotKB (and selected UniParc records) in order to obtain complete coverage of the sequence space at several resolutions while hiding redundant sequences (but not their descriptions) from view. The UniRef100 database combines identical sequences into a single UniRef entry, displaying the sequence of a representative protein, the accession numbers of all the merged entries, and links to the corresponding. UniRef90 is built by clustering UniRef100 sequences using the MMseqs2 algorithm (Steinegger and Söding 2018) such that each cluster is composed of sequences that have at least 90% sequence identity and 80% overlap with the longest sequence (a.k.a. seed sequence ) of the cluster. Similarly, UniRef50 is built by clustering UniRef90 seed sequences that have at least 50% sequence identity to and 80% overlap with the longest sequence in the cluster. UniParc and Uniref contain only protein sequences. All other information about the protein must be retrieved from the source databases using the database cross-references.

Pfam

Pfam is a protein database that aims to classify sequences by their evolutionary relationships. It was founded in 1995 and it has been very useful for functional annotation of genomic data. Pfam’s website (http://pfam.xfam.org/) was closed by the end of 2022. However, Pfam database was not discontinued, but integrated into the InterPro site (see the Xfam Blog entry).

Pfam uses HMM profiles (we will discuss HMMs in the Homology Modeling module) to classify proteins into families, which are grouped into clans. Check out the EBI course about using Pfam: https://www.ebi.ac.uk/training/online/courses/pfam-creating-protein-families/

The current release (35.0) contains 19,632 entries (families and clans). Pfam was designed as a database that should be often updated in the fast-forward genomic era. To this aim, they use two alignment types. Each Pfam family has a seed alignment that contains a representative set of sequences for the entry. A profile hidden Markov model (HMM) is automatically built from the seed alignment and searched against a sequence database called pfamseq using the HMMER3 software (http://hmmer.org/). All sequence regions that satisfy a family-specific curated threshold, also known as the gathering threshold, are aligned to the profile HMM to create the full alignment (Mistry et al. 2020).

In addition to the HMM-based Pfam entries (Pfam-A), the Pfam profiles are used to provide a set of unannotated, computationally generated multiple sequence alignments called Pfam-B. However, in the last Pfam versions, the Pfam-B alignments are presently only released on the Pfam FTP site.

Pfam has also been used in the creation of other resources such as Rfam (RNA families) and Dfam (transposable DNA elements).

InterPro

InterPro aims to be a functional secondary database, by classifying proteins into families, domains, and important sites. To classify proteins in this way, InterPro uses predictive models, known as signatures, provided by several different databases (up to 13) that make up the InterPro consortium. InterPro combines those different signatures representing equivalent families, domains, or sites, and provides additional information such as descriptions, literature references, and Gene Ontology (GO) terms, to produce a comprehensive resource for protein classification (see Blum et al. (2021)).

InterPro database is updated every 2 months and it is very useful for annotation of ORFans or divergent proteins. In the last years, it has integrated more resources, including Pfam, as well as structural data and predictions, giving rise to a very handy resource for multiple purposes in protein science (Paysan-Lafosse et al. 2023).

Note

Interpro was created as a sequence DDBB, but it is currently in between. I’d rather say that it is more a “meta-database” containing sequence and structure information.

Structure databases

RCSB-PDB

The Protein Data Bank (PDB, www.rcsb.org) database is the major macromolecule structural primary database (Burley et al. 2020). It contains mostly protein structures, but also spans nucleic acids and nucleoprotein complexes. PDB turned 50 in 2021 and you can see a detailed overview of its history in the RCSB-PDB site.

Briefly, the PDB was established in 1971 at Brookhaven National Laboratory with only 7 structures. Then, the Research Collaboratory for Structural Bioinformatics (RCSB), formed by Rutgers, UCSD/SDSC, and CARB/NIST, became responsible for the management of the PDB in 1998 in response to an RFP and a lengthy review process. In 2003, the Worldwide Protein Data Bank (wwPDB) was formed to maintain a single PDB archive of macromolecular structural data that is freely and publicly available to the global community. It consists of organizations that act as deposition, data processing and distribution centers for PDB data.

PDB structures are largely obtained by X-ray crystallography, but it accepts derived from EM and RMN data since 1989 and 1991, respectively. Indeed, the BMRB (Biological Magnetic Resonance Bank) has partnered with the PDB since 2006 and the EMBD (Electron Microscopy Data Bank) since 2021. Furthermore, starting September 2022, the PDB also contains computed models from the Alphafold database (which we will discuss later in this course) and RoseTTAFold-ModelArchive. Thus, the PDB database is the major hub that centralizes biological structures nowadays.

PDB database has four mirrors and websites (RCSB, Europe, BMRB, and Japan) with mainly overlapping information, although they have some specialization. The RCSB PDB site has also an educational section (PDB-101) with very useful info and resources for teaching and learning structural biology and the work with PDB structures.

The PDB entries contain all the information about the structure, spanning from the protein sequence and source to the experiment details, as well as the structure assessment (see Structural quality assurance section) and visualization. You can download all this information and the structure coordinates in diverse file formats.

SCOP

The Structural Classification of Proteins (SCOP, http://scop.mrc-lmb.cam.ac.uk) database is a classification of protein domains organized according to their evolutionary and structural relationships in hierarchical categories. The main unit is the family that groups related proteins with clear evidence for their evolutionary origin while the superfamily brings together more distantly related protein domains. Further, superfamilies are grouped into distinct folds on the basis of the global structural features shared by the majority of their members. Domain definitions are provided for the two main levels of the SCOP classification, family and superfamily, and the domain boundaries for each of these can coincide or differ (Andreeva et al. 2020).

For each group, a representative is selected based on its sequence (UniProtKB) and structure (PDB) and used for SCOP classification. Thus, the SCOP domain boundaries are assigned to both, the PDB and UniProtKB entry.

CATH

CATH (www.cathdb.info) is a free, publicly available resource that identifies protein domains within proteins from the Protein Data Bank and classifies them into evolutionary-related groups according to sequence, structure, and function information. It assumes that related proteins which fold alike often exhibit similar functions (this only could be demonstrated if we find intermediates). CATH uses a hierarchical classification scheme where the units compared and classified are structural domains. Domains, defined here as globular structural domains capable of semi-independent folding, are extracted from experimentally determined protein structures available in the PDB database. The domains are classified into the following hierarchical levels that compose the CATH name: Class (C), Architecture (A), Topology (T), and Homologous superfamilies (H).

CATH uses a combination of several structure-based (SSAP, CATHEDRAL) and sequence-based (Needleman-Wunsch-based sequence alignments, Jackhmmer, Profile Comparer, and HHsearch) algorithms to assess the similarity of domains to each other and to identify protein homologues (Sillitoe et al. 2021).

CATH has a sister resource, Gene3D, that adds in additional protein domain sequences with no known structure, which brings the current total number of domains in CATH-Gene3D up to 95 million.

CATH database is updated quite regularly through daily snapshots (CATH-B), but a full release with more tools, named CATH+ is released every 12 months. CATH-plus contains functional families (CATH-FunFams), structural clusters, and other tools (Sillitoe et al. 2021).

Graphical summary

Figure 3: Main features of protein databases (updated in 2020)

3 DDBBs Practice

This is a Evernote note that you can check on your web browser and also copy into your Evenote account/App if you wish. Link here

The goal of this exercise is to get familiar with protein structures and their domain organization. Try to explore the estructures and identify similarities and differences (just by eye).

Classify the following protein structures into “groups” and “subgroups”

Focus only in structural databases (CATH, Scop2), but any given database can be helpful in a different way. Use generic terms, valid for any hierarchical classification, independent of the database(s) used.

Structures
2MNA
2WB0
1C0A
1QUM
5ZHZ
5CFE
1DE9
6WCD
1AKO
6KHY

Answer these questions

a. Can you suggest the structural and evolutionary relationship between them?

b. Which of the structures is more difficult to group

c. Which are the pros/cons of each database?

Answer briefly.

References

Andreeva, Antonina. 2012. “Classification of Proteins: Available Structural Space for Molecular Modeling.” In, edited by Andrew J. W. Orry and Ruben Abagyan, 1–31. Totowa, NJ: Humana Press. https://doi.org/10.1007/978-1-61779-588-6_1.
Andreeva, Antonina, Eugene Kulesha, Julian Gough, and Alexey G. Murzin. 2020. “The SCOP database in 2020: expanded classification of representative family and superfamily domains of known protein structures.” Nucleic Acids Research 48 (D1): D376–82. https://doi.org/10.1093/nar/gkz1064.
Blum, Matthias, Hsin-Yu Chang, Sara Chuguransky, Tiago Grego, Swaathi Kandasaamy, Alex Mitchell, Gift Nuka, et al. 2021. “The InterPro Protein Families and Domains Database: 20 Years On.” Nucleic Acids Research 49 (D1): D344–54. https://doi.org/10.1093/nar/gkaa977.
Burley, Stephen K, Charmi Bhikadiya, Chunxiao Bi, Sebastian Bittrich, Li Chen, Gregg V Crichlow, Cole H Christie, et al. 2020. RCSB Protein Data Bank: powerful new tools for exploring 3D structures of biological macromolecules for basic and applied research and education in fundamental biology, biomedicine, biotechnology, bioengineering and energy sciences.” Nucleic Acids Research 49 (D1): D437–51. https://doi.org/10.1093/nar/gkaa1038.
Mistry, Jaina, Sara Chuguransky, Lowri Williams, Matloob Qureshi, Gustavo A Salazar, Erik L L Sonnhammer, Silvio C E Tosatto, et al. 2020. “Pfam: The Protein Families Database in 2021.” Nucleic Acids Research 49 (D1): D412–19. https://doi.org/10.1093/nar/gkaa913.
Paysan-Lafosse, Typhaine, Matthias Blum, Sara Chuguransky, Tiago Grego, Beatriz Lázaro Pinto, Gustavo A Salazar, Maxwell L Bileschi, et al. 2023. “InterPro in 2022.” Nucleic Acids Research 51 (D1): D418–27. https://doi.org/10.1093/nar/gkac993.
Sillitoe, Ian, Nicola Bordin, Natalie Dawson, Vaishali P. Waman, Paul Ashford, Harry M. Scholes, Camilla S. M. Pang, et al. 2021. “CATH: increased structural coverage of functional space.” Nucleic Acids Research 49 (D1): D266–73. https://doi.org/10.1093/nar/gkaa1079.
Steinegger, Martin, and Johannes Söding. 2018. “Clustering huge protein sequence sets in linear time.” Nature Communications 9 (1): 2542. https://doi.org/10.1038/s41467-018-04964-5.