RCSB PDB Help

Grouping Search Results

Introduction

Documentation

● Options for Grouping Search Results

○ Grouping Structures

■ Grouping by PDB Deposit Group ID

○ Grouping Macromolecules

■ Grouping by Sequence Identity

■ Grouping by UniProt Accession

Displaying Grouped Results

● Groups View

● Representatives View

Examples

● Exploring all protein targets of the drug, Imatinib

Introduction

For many proteins, the PDB archive includes multiple structures, providing snapshots of the structure, interactions, and functions of these proteins under different conditions. This redundancy provides opportunities for exploration of biomolecular interactions and functions. In cases when the query results include many matches to the same or similar proteins, it may be helpful to be able to remove redundancy by grouping and organizing the search results in meaningful ways.

While redundancy in the PDB enables a deeper understanding of biology, it may present some challenges in bioinformatics analysis. Here are four main reasons for grouping search results:

Reducing the size of datasets (by examining distinct representatives of groups). This is particularly important as the size of the PDB continues to grow.
Recognizing the relationship between various groups of the search results and exploring each group of results.
Drawing attention to the full range of query matches - by hiding redundant matches from the results, less frequent but relevant results can become more prominent and included in exploration and analysis.
Removing undesirable biases - which may be introduced if a result set has many similar and homologous proteins.

Documentation

Options for Grouping Search Results

Redundancy occurs at many levels - including at the level of sequence and/or structure similarity. A variety of different grouping methods can be applied to PDB data to provide a non-redundant view. Current options available allow grouping:

Structures

By PDB Deposit Group ID

Macromolecules

By Sequence Identity
By UniProt Accession

Ligands

By Ligand ID

Grouping Structures

Grouping by PDB Deposit Group ID

Although PDB structures are each identified by a PDB ID, some PDB structures that are deposited by authors in batches have a common PDB Deposit Group ID.
Criteria for membership in these groups is determined by the authors at the time of structure submission.
Frequently these structures have the same protein(s), but with different ligands bound to them. For example, structures resulting from the screening for fragments binding to a specific target.
Only a small number of structures in the archive have PDB Deposit Group IDs (e.g. G_#######). So organizing search results using this option will only include those structures which have this PDB Deposit Group ID (e.g., G_1002057). Structures in the results that do not have any Group Deposition ID are not listed in the grouped results.

Grouping Macromolecules

A polymer entity (e.g., a protein) may appear in the PDB archive in many different entries - by itself under different experimental conditions, with minor modifications (e.g., mutations or sequence variations), or in complex with other molecules, representing different functional states. While the structures of the protein in each entry may be different the sequence and its mapping to UniProt remain the same. Search results of a query for polymer entities can be organized in the following ways.

Grouping by Sequence Identity

Polymer entities in the result list can be matched by using specific sequence identity criteria (from 100% to 30%).
The sequence identity groups are based on sequence clustering done by RCSB PDB with weekly PDB archive release. Thus these groups are likely to change over time, as new structures are added to the archive. Learn more about sequence clusters here.
Sequence clusters are based on alignments that involve nearly an entire sequence - i.e., sequence coverage must be at least 90%. So sequences of different lengths of the same protein may end up in different groups even if identity level is sufficient. For example, if different lengths of the same protein were included in the same or different structures, they may end up in different groups because of differing lengths.

Grouping by UniProt Accession

Polymer chains in the result set are matched by the UniProt Accession associated with the polymer sequence.
Grouping results by this method will classify all search results that have a specific UniProt ID, regardless of whether the structure includes the entire protein, parts of the protein, mutations, or modifications.
Proteins included in a group defined by its UniProt ID may include polymer sequences and structures that match different domains of the complete protein and its variants.

Displaying Grouped Results

Grouped search results may be viewed as:

Groups - each of the groups shown in the grouped results page have a page summarizing properties of the group members that can be explored.
Representatives - a list of representative members of each group. All other structures in the group are hidden. For the search and grouping criteria used, this is the set of non-redundant matches.

Groups View

This view displays a summary of each of the groups in the grouped search results.

Each group in the list displays a few features relevant to the group.
The criteria for selecting the representative for this grouped result may be changed using the options available in the pulldown menu in the center of the page. These options include:

Resolution: Best - the best or highest resolution structure (lowest number in the experimental structure resolution) for the refined structural model. Structures with no resolution have lower ranking compared to structures having assigned resolution values.
Entry All Residues: Most - the largest total count of residues (e.g., amino acids) for all polymer entity instances reported per deposited structure model.
Entry Modeled Residues: Most - the largest total count of residues (e.g., amino acids) with reported coordinate data for all polymer entity instances reported per deposited structure model.
Entry Chain Count: Most - the largest total count of polymer entity instances per deposited structure model.
Score: Best - the most relevant for a given search query.

The order of groups displayed here may be sorted by a number of criteria available from the pull down list on the right of the page.

Group Score: is calculated as the average of the relevance scores of the group members matching the search query
Matched Count: is the number of group members matching the search query

Any one group listed on this page displays a few key pieces of information about the group member.

Group name is assigned the name of the most frequent polymer name in the group.

When grouping structures by sequence identity, the group names are inferred. The name for the most frequent protein is used as the group name. It is possible that this name doesn’t fully describe the group. This message “Accounts for % of matches“ and tooltip are in place to explain this. For example the name for the group shown in Figure 5 panel B is that of 98% of the members of that group.
When grouping results by sequence clustering, the same proteins from different species may be grouped into different groups because of the level of their sequence identity. In this case, multiple groups of the search results may be assigned the same group name.

Group ID is shown if the grouping criteria use either the PDB deposit Group ID (Panel A) or UniProt Accession (panel C). No group ID is shown for groups formed by sequence clusters because this identifier is subject to change over time.
Group size is the number of entries in the group. The size is based on the archive content of the grouping option. For example, for grouping by UniProt Accession, the size of the group represents the total number of structures in the PDB archive that are mapped to that UniProt Accession. Note that any sequences that do not have UniProt assignment will not be a part of any groups based on UniProt Accession so may be left out of the grouping.
Matched count refers to a subset of a whole group that matches your search query.

The total size and matched count can be the same if your search returns all the members of a given group. Otherwise, matched count will be smaller, indicating that some members from a whole group were filtered out by search.
Clicking on the hyperlinked matched count allows exploration of the matched group members.

An image next to the group summary depicts the structure of the best example. The stack icon in the upper left corner of the image helps to visually indicate a group.

Group views of a search result showing grouping by (A) PDB deposit Group ID; (B) sequence identity clusters; and (C) UniProt Accession. The Group name and contents link is shown in red outlined boxes, while the Group size is shown in a blue outlined box.

Hyperlinks from the group display can open Group Summary pages to explore the sequences, structures, and other properties of group members.

Representatives View

This view lists only representatives of the grouped search results.

The number of structures listed here is usually smaller than that of the complete search results.
The criteria for selecting the representative for this grouped result may be changed using the options available in the pulldown menu in the center of the page. These options include:

Resolution: Best - the best or highest resolution structure (lowest number in the experimental structure resolution) for the refined structural model.
Structures with no resolution have lower ranking compared to structures having assigned resolution values.
Entry All Residues: Most - the largest total count of residues (e.g., amino acids) for all polymer entity instances reported per deposited structure model.
Entry Modeled Residues: Most - the largest total count of residues (e.g., amino acids) with reported coordinate data for all polymer entity instances reported per deposited structure model.
Entry Chain Count: Most - the largest total count of polymer entity instances per deposited structure model.
Score: Best - the most relevant for a given search query.

The entire list of representatives may be sorted by a number of criteria available from the pull down list on the right of the page. These include the same options available for sorting the full search results (see here).

Examples

Exploring all protein targets of the drug, Imatinib

The small molecule drug Imatinib or Gleevec (chemical component ID STI) is used for treating various types of cancer including chronic myeloid leukemia, acute lymphoblastic leukemia, aggressive systemic mastocytosis, and metastatic malignant gastrointestinal stromal tumors. Explore structures of STI-protein complexes in the PDB archive to explore its interactions with various target proteins.

Find distinct proteins from structures that include STI as a standalone ligand.