Structure Search

Introduction

The functions of biological molecules follow their form (or shape). This in turn means that molecules that have similar shapes or structures have similar functions. The number of structures, their size, and complexity of experimental structures in the Protein Data Bank (PDB) continues to grow each year. Many of the experimental structures are assemblies of multiple proteins or multiple copies of a protein. The assembly coordinates may either be specific subsets of the model or deposited coordinates or may be derived by applying specific types of symmetry operations. Querying both deposited and assembly coordinates make finding structurally similar proteins and assemblies challenging.

RCSB.org also offers access to more than a million computed structure models (CSMs). The coordinates of these models do not include any symmetry related information so the model and assembly coordinates are identical and included by default in structure based searches.

What is Structure Search?

The Structure Search option allows you to query the PDB archive using the 3D shape of a protein structure. This RCSB PDB developed method (Guzenko et al., 2020) looks at proteins as volumes of space filled by atoms (i.e., density distribution), instead of a collection of atomic coordinates and chain connectivities. The protein volumes are broken down using a mathematical tool known as 3D Zernike polynomials, and are described as vectors of Zernike moments. This approach helps describe volumes with compact descriptors that are invariant to rotation and translation (Novotni and Klein, 2004). The search assesses global 3D-shape similarity using BioZernike descriptors to capture the global volumetric shape of the protein and works really fast for both individual protein chains and assemblies.

Why run a Structure Search?

Finding and classifying structures in the PDB is fundamental to understanding functional and evolutionary relationships. While sequence based searches can reveal conserved domains in proteins, there are many examples in biology where the protein shapes (and functions) are similar, despite sequence variations. Also, sometimes the same protein may adopt more than one conformation, such as open and closed forms of an enzyme. These structures can not be identified using sequence based searches and require structure search options.

Moreover, some proteins are stabilized and/or function as part of an assembly - where it interacts with one or more copies of itself or with other proteins. The structure search option allows you to identify similar assemblies - enabling exploration of shape and interactions of the protein (or its complex).

Documentation

There are a few different options that can be combined to run a Structure Search. These options are being listed here under 3 different sections:

  • Query - this will describe the option you have to input your query
  • Search - this will describe the types of searches that can be run (e.g., strict and relaxed).
  • Results - this will describe options available for what you wish to see in the results page.

Query Options

There are two types of structure searches possible:

  1. search for similar polymeric chains to a given chain
  2. search for similar assemblies to a given assembly

Both these types of structure searches can be launched from two different locations on the website as described here.

Query using the 'Advanced Search' panel

The structure search options are available from the “Advanced Search” panel and can be accessed by typing in a PDB ID or RCSB.org assigned CSM ID in the box listed under Structure Similarity (Figure 1).

Figure 1: Options for launching a Structure Similarity search from the Advanced Search Query builder.
Figure 1: Options for launching a Structure Similarity search from the Advanced Search Query builder.

Once a 3D structure ID (PDB ID for experimental structures or RCSB.org assigned CSM ID) is typed in the box, some additional options become available. Select the type of search to launch - Assembly ID or Chain ID.

For the assembly structure search, select the Assembly ID from the pull-down menus, select “Assemblies” in the results "Return" options, decide on whether to include or exclude CSMs, and click on the blue Search button with a green magnifying lens icon to launch the search (Figure 2A). If a CSM ID is used for this search, remember to turn on the Include CSM toggle switch (see Figures 2B). Note that for CSMs the assembly coordinates are the same as the model coordinates, so the assembly is denoted as the deposited assembly.

Figure 2: Options to specify a Structure Similarity Search - A. using a PDB ID and assembly ID and deciding whether to include or exclude CSMs; B. using an RCSB.org assigned CSM ID, turn on Include CSM toggle switch. In both cases specify the results Return type to be Assemblies, before launching the search.
Figure 2: Options to specify a Structure Similarity Search - A. using a PDB ID and assembly ID and deciding whether to include or exclude CSMs; B. using an RCSB.org assigned CSM ID, turn on Include CSM toggle switch. In both cases specify the results Return type to be Assemblies, before launching the search.

For the protein chain based structure search, select the chain ID of the protein of interest in the query structure, select “Polymer entities” in the results Return options, decide on whether to include or exclude CSMs, and click on the blue Search button with a green magnifying lens icon to launch the search (Figure 3A). If a CSM ID is used for this search, remember to turn on the Include CSM toggle switch (see Figures 3B).

Figure 3: Options to specify a Structure Similarity Search - A. using a PDB ID and chain ID and deciding whether to include or exclude CSMs; B. using an RCSB.org assigned CSM ID, turn on Include CSM toggle switch. In both cases specify the results Return type to be Polymer Entities, before launching the search.
Figure 3: Options to specify a Structure Similarity Search - A. using a PDB ID and chain ID and deciding whether to include or exclude CSMs; B. using an RCSB.org assigned CSM ID, turn on Include CSM toggle switch. In both cases specify the results Return type to be Polymer Entities, before launching the search.

To find structures that are similar to a 3D structure not included in RCSB.org, include the query structure using a URL. This allows custom searching for non-RCSB.org structures, e.g., from AlphaFold, RoseTTAFold, or ESMFold predictions.

To use this feature switch the input mode from “Entry ID” to “Web Link” (Figure 4). Make sure to specify the URL as an “http” or “https” protocol. Specify the file format, which defaults to mmCIF, but BinaryCIF and PDB files are also supported. Select “Polymer entities” or "Structures" in the results Return options, as appropriate. Decide on whether to include or exclude CSMs, and click on the blue Search button with a green magnifying lens icon to launch the search.

The search will be based on the deposited coordinates, also referred to as “asymmetric unit”. Note: this is different from the 3D experimental or CSM entry-ID-based query, which allows you to select a specific assembly or chain identifier for the search.

In CSM structures with local low confidence regions, i.e., for CIF files from AlphaFold, RoseTTAFold, ESMFold, where the `ma_qa_metric_local` cif category is present and the local pLDDT scores are less than 70, a pre-filtering step is applied to remove these regions from the query. Excluding such unstructured or highly flexible regions of CSMs can reduce the number of false positives and negatives in the query results.

Figure 4: Structure Similarity Search options using a web Link to specify a non-RCSB.org 3D structure as a Query.
Figure 4: Structure Similarity Search options using a web Link to specify a non-RCSB.org 3D structure as a Query.

Query from the Structure Summary page

All 3D structures available from the RCSB.org (experimental structures and CSMs) have a dedicated Structure Summary page that displays information about the entities and assemblies of that entry.
To search for structures similar to any one polymer entity in the structure click on the “Structure” link above the details listed for the macromolecule (Figure 5).

Figure 5: Options to launch a structure based search from the structure summary page (highlighted in a red oval).
Figure 5: Options to launch a structure based search from the structure summary page (highlighted in a red oval).

To search for assemblies similar to a specific assembly of the structure click on the “Find Similar Assemblies” link written below the snapshot of the assembly on the page (Figure 6).

Figure 6: Options to launch a search for an assembly from the structure summary page.
Figure 6: Options to launch a search for an assembly from the structure summary page.

Search Options

For any structure search it is possible to choose between two modes of matching by selecting the corresponding radio button:

  • Strict: use this if you want to be sure your matches are all relevant, at the risk of not finding some more distant matches
  • Relaxed: use this if you want to be sure your matches include all similar structures, at the risk of bringing in some False Positives

Note that while the strict or relaxed options may be selected for the structure searches launched from the Advanced Search panel, the searches launched from the Structure Summary Page automatically select the strict search option.

Results

Depending on the options selected, structure search results list similar entities or assemblies.

For entity based searches, each matched entity can be superposed on the query entity and viewed in 3D using the pairwise alignment tool by clicking on the View button next to “Structure Match” (Figure 7)

Figure 7: Part of the query results page showing options to view the structure match (panels on the right) and some measures describing the extent of the match (red outlined boxes at the top and bottom of the figure.
Figure 7: Part of the query results page showing options to view the structure match (panels on the right) and some measures describing the extent of the match (red outlined boxes at the top and bottom of the figure.

For assembly based searches, each matched assembly is assigned a structure match score, expressed as a percentage of the probability that it matches the query structure. So a score of 100 indicates a perfect match while lower numbers indicate lesser degrees of similarity in the assemblies (Figure 8).

Figure 8: Part of the results list of assembly based match showing the structure match score
Figure 8: Part of the results list of assembly based match showing the structure match score

Limitations of Structure Search

The structure search system has some limitations:

  • The method can not report an RMSD since it only produces a global optimal superposition of the volumes but knows nothing about residues that are paired in the alignment. Instead the method outputs a score that indicates the likelihood that the match is relevant.
  • Highly symmetric assemblies often produce false positives (with lower scores), e.g. searching for a D3 point-group symmetric assembly will likely match a few unrelated D3 assemblies with lower scores.
  • Highly symmetric assemblies often produce false positives (with lower scores), e.g. searching for a D3 point-group symmetric assembly will likely match a few unrelated D3 assemblies with lower scores.
  • Flexible NMR structures will often be unmatched due to the long flexible tails
  • Long protruding tails will result in failure to match otherwise globally similar shapes.
  • The matching is global, thus local similarities are not found. For example:
    • when searching for chains: 2 chains that are similar only in some common domain will usually not match,
    • when searching for assemblies: 2 assemblies that are similar in some subset of chains but not globally will usually not match.

Examples

1. Search for entities similar to Myoglobin

  • Launch this search from the Advanced Search interface for PDB ID 1mbn, Chain ID A
  • Select the strict search radio button, Display results as Polymer Entities, include CSMs, and launch the search (Figure 9)
Figure 9: Options to run a structure based search for chain ID A in PDB entry 1mbn, to return polymer entities. The search includes CSMs.
Figure 9: Options to run a structure based search for chain ID A in PDB entry 1mbn, to return polymer entities. The search includes CSMs.
  • The search results show many myoglobin entities, some hemoglobin entities, a few neuroglobin and some others entities.

2. Search for entities that are conformationally similar to the open form of hexokinase

  • Use a structure of the enzyme hexokinase in an “open” conformation as a query. Launch this search from the Advanced Search interface for PDB ID 2yhx, Chain ID A (Figure 10)
  • Select the strict search radio button, Display results as Polymer Entities, include CSMs, and launch the search.
Figure 10: Options for searching structures that are conformationally similar to the open form of hexokinase
Figure 10: Options for searching structures that are conformationally similar to the open form of hexokinase
  • The search results show other hexokinase and related proteins. Note that the better matches are hexokinase entities with an open conformation while the matches listed towards the end of the result list include the same or related enzyme entities in the closed conformation.

3. Search for assemblies similar to the SARS-CoV-2 Spike protein trimer

  • The SARS-CoV-2 spike protein is composed of three polymer chains, each of which has a receptor-binding domain that can be in an open (or up) conformation for interacting with cellular receptors or a closed (or down) conformation. The Structure Search functionality can be used to identify spike structures that have a similar arrangement of these domains.
  • To find spike structures where all three receptor-binding domains are closed, launch the structure search from the Structure Summary page for the PDB ID 6vxx, Biological Assembly 1 (Figure 11).
Figure 11: Options to search for structures with the same assembly from the structure summary page of PDB ID 6vxx.
Figure 11: Options to search for structures with the same assembly from the structure summary page of PDB ID 6vxx.
  • The search results show similar spike protein assemblies with closed conformations.

4. Search for assemblies similar to Insulin hexamers

  • Launch this search from the Structure Summary page for the PDB entry 1trz, Biological Assembly 3 (Figure 12)
Figure 12: Options to launch a structure (assembly) based search from the structure summary page of PDB ID 1trz.
Figure 12: Options to launch a structure (assembly) based search from the structure summary page of PDB ID 1trz.

The search results show many other similar insulin assemblies, and some unrelated structures at ~12% Structure Match Scores.

References



Please report any encountered broken links to info@rcsb.org
Last updated: 12/2/2022