SCOWLP METHODOLOGY


1) Extraction of interfaces and contacting domains

We label all structural units of the PDB on the following way:

  • the SCOP database definitions are considered domains
  • atomic chains (ATOM) not defined in SCOP smaller than 90 residues are considered peptides
  • DNA/RNA chains are considered nucleic acids and they are differentiated by the O2' group
  • A hand-curated list of all HET-tagged molecules from the PDB are considered saccharides

Only amino acid residues and water molecules placed in the intersection of structural unit shapes are potential interactors. We apply atom type and distance criteria to compute interactions between structural unit pairs at physicochemical level. For hydrogen bonds we apply a ≤ 3.6 Å donor-acceptor distance. For salt bridges, we apply a ≤ 4 Å distance criteria. Van der Waals energies are defined by hydrophobic atoms at van der Waals radii distance.

2) Pair-wise structural alignments (PSAs)

We performed all-against-all PSAs of the contacting units for each family to be able to measure the similarity among binding regions. The alignments were performed with MAMMOTH program taking the Cα atoms into account and using a gap penalty function for opening and extension. The root-mean-squared deviation (RMSD) was not considered for measuring the similarity between two interfaces, as the superimposed members of the same family share a common structure.

3) Similarity Index (Si)

The residues described in SCOWLP to be forming an interface were mapped onto the domain-pair structural alignment. We calculated a similarity index (Si) based on the number of interacting residues that overlap and the length of both interacting regions. We exclude the interacting residues located in gap regions in the structural alignment.

4) Clustering binding regions

We cluster the binding regions of each SCOP family using the agglomerative hierarchical algorithm following several steps:

  • Define as a cluster each contacting domain.
  • Find the closest pair of clusters and merge them into a single cluster.
  • Re-compute the distances between the new cluster and each of the remaining clusters.
  • Repeat steps 2 and 3 until all contacting domains are clustered into a single cluster.

To re-compute the distances we used the complete-linkage method, which considers the distance between two clusters to be equal to the minimum similarity of the two members.

5) Binding region definition by Si cut-offs

The result of the clustering can be represented in an intuitive tree or dendrogram, which shows how the individual contacting domains are successively merged at greater distances into larger and fewer clusters. The final PBRs depend on the Si cut-off that is set up. We pre-calculated the results for Si cut-offs at 0, 0.1, 0.2, 0.3 and 0.4 to offer a range of values that allow flexibility in the final analysis of PBRs. The SCOWLP web application offers the possibility to display the classification at any of these cut-off values.

6) Interface definitions

In order to differentiate binding regions having single-interfaces from multi-interfaces, we identified in each binding region the partner for each contacting domain. Each binding region was divided in sub-clusters when there were different domain families interacting in the same binding region.

PUTATIVE BINDING REGIONS


SCOWLP binding region clusters are taken at zero similarity for the first five classes of SCOP. A protein representative is taken per family that includes all family binding regions mapped on its sequence.

  • We performed all-against-all non-sequential structural alignments for all protein family representatives.
  • Binding regions conservation

    For each binding region, the interacting residues are mapped onto the structure-based sequence alignment. We also calculate the solvent accessibility for both representative proteins to distinguish the residues located in the core region from the solvent exposed ones, since core residues can not participate in recognition. We used NACCESS to calculate the solvent accessibility of each residue using a probe sphere of radius 1.4Å. A residue is considered accessible if its total relative accessible surface area (RSA) is more than 5%. We calculate the binding region conservation (BRC) as the ratio between the number of interacting residues located in structurally aligned regions that are also solvent exposed and the total number of interacting residues.

  • P-value estimation

    We assess the statistical significance of the BRC by estimating the p-values under the null hypothesis that two random protein families do not contain conserved binding regions. The estimation was carried out by calculating the BRC of 105 randomly selected samples of protein representative pairs and a binding region for each pair. The distribution of these scores was used to estimate the p-values obtained as (r+1)/(n+1), where n is the number of samples that have been simulated (105), and r is the number of these replicates that have a score greater than or equal to the BRC value for which we are estimating the p-value.41 Note that, as the sampling procedure can possibly contain undetermined cases of similar binding regions from the alternate distribution, these p-values are likely to be an underestimate of true significance, i.e. in some instances the real p-values will be much more significant. In a pairwise ns-SA, a binding region is inferred from one to another protein family if the conservation significance has a p-value ≤ 0.05.

  • Clustering of inferred binding regions

    The inferred binding regions (iBR) and the known family binding regions (kBR) were collected for each family. Since binding inferences may occupy equivalent surface regions in the family, we re-clustered the binding regions in a similar way as described above for SCOWLP. To make sure that the obtained kBR clusters from SCOWLP are not modified in this process, we set the similarity between these initial kBR to zero. Three distinguishable cluster types were obtained: 1) those that only contained one kBR, 2) those that only contained iBR (putative binding regions), 3) those that contained both.

CONTACT


Structural Bioinformatics Group. SCOWLP has been developed by Dr. Joan Teyra, Sven Schreiber and Dr. MT Pisabarro at BIOTEC of the TU-Dresden, Germany. All comments, suggestions, corrections and advices, should be sent to:

DOWNLOAD


The tables composing the SCOWLP database can be download here:

Here, you can download a compressed file in sql format of the results obtained using the methodology explained in the paper: "Studies on the inference of protein binding regions across fold space based on structural similarities"