Introduction 2007 2003 2006 2005 2003 1989 2005 2002 2005 2005 2005 2004 1998 1996 The analysis described here relates only to distance constraints derived from NOE data, with a base set of 1834 NMR PDB entries containing 1909 protein chains. Only constraints between protons in protein chains were retained for analysis, and for validation purposes the base set was further divided into subsets for entries that contain intra-residue constraints and entries where all the original constraint and coordinate information was recognised and linked to each other. A coordinate data set based on the original coordinate files was also generated and used for comparison. This article explores some of the issues surrounding distance constraints and the NMR data they are derived from, and hopes to highlight the importance of depositing the constraint lists used for structure calculations along with the molecular coordinates. Materials and methods 2005 1 Table 1 Overview of the available data sets used in the analysis Set Name Data type Details 1 HP Constraints Base set 2 HIP Constraints Intra-residue 3 AHP Constraints High assignment 4 AHIP Constraints Intra-residue, high assignment 5 HPC Coordinates Original coordinate data 2003 1 1 2 No valid protein chains No valid constraint lists Insufficient linking Insufficient valid constraints Fig. 1 Overview of the workflow employed in the analysis. Grey boxes indicate files, white boxes Python scripts Table 2 Overview, for each data set, of the number of removed and analysed entries HP HIP AHP AHIP No valid protein chains 396 396 396 396 No valid constraint lists 310 409 310 409 Insufficient linking 48 37 677 635 Insufficient valid constraints 55 55 55 55 Total included entries 1834 1746 1203 1146 Total included chains 1909 1817 1252 1192 1 r −6 1995 http://www.ebi.ac.uk/msd-srv/docs/NMR/analysis/results/html/comparison.html 2007 2006 1904 1896 In the analysis, a residue is marked as ‘assigned’ when at least one proton belonging to it is linked to a constraint. The total number of times a particular inter-atomic contact is observed can be a fraction, as for ambiguous constraints each constraint item contributes a fraction of 1 to the total: \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$ f_{{{\text{contribution}}}} = \frac{1} {{n_{{{\text{items}}}} }}\quad n_{{{\text{actual}}}} = {\sum\limits_{i = 0}^{n_{{{\text{constraints}}}} } {f_{{{\text{contribution,i}}}} } } $$\end{document} n dist \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$ f_{{{\text{occurrence}}}} = \frac{{n_{{{\text{actual}}}} }} {{n_{{{\text{possible}}}} }} $$\end{document} n actual n possible \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$ {\text{ambiguity}} = 1 - \frac{{n_{{{\text{actual}}}} }} {{n_{{{\text{dist}}}} }} $$\end{document} n possible n actual n possible,ss n actual,ss \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$ {\text{uniqueness}} = \frac{{n_{{{\text{actual,ss}}}} }} {{n_{{{\text{actual}}}} }} $$\end{document} n actual n actual n actual n actual Results Information from analysis 2 Contact analysis i–i i–i + 1 i–i + n i–i + 6< Protein secondary structure analysis Residue atom analysis Unassigned fragments breakdown Fragment analysis by residue Fig. 2 i i http://www.ebi.ac.uk/msd-srv/docs/NMR/analysis/results/html/comparison.html Validation of data sets 3 2 Fig. 3 f occurrence 1983 2 3 Table 3 f occurrence Type Atom 1 Atom 2 α-helix β-sheet No secondary structure i i H Hα 0.50 0.51/1.00 0.43 H Hβ* 0.73 0.68 0.57 Hα Hβ* 0.32/1.00 0.33/1.00 0.29 i i H H 0.84 0.50 1.00 0.57 H Hβ* 0.22 0.99 0.07/0.69 0.10/ Hα H 0.52 0.90 0.67 Hα Hβ* 0.03/0.42 0.30 1.00 0.93 i i  H H 0.47 0.99 0.00/0.00 0.16/0.45 H Hβ* 0.09 0.79 0.00 0.25 Hα H 0.28 0.97 0.04/0.04 0.60 Hα Hβ* 0.63 0.00 0.27 Hβ* H 0.98 0.45 0.17/0.77 Hβ* Hβ* 0.02/0.12 0.70 0.51 i i H H 0.09 0.86 0.12 0.16 H Hβ* 0.12 0.84 0.12 0.22 Hα H 0.66 0.99 0.00/0.04 0.14/0.28 Hα Hβ* 0.58 0.99 0.00/0.04 0.14/0.33 Hβ* H 0.14 0.91 0.16 0.06/0.34 Hβ* Hβ* 0.97 0.12 0.46 i i H H 0.02/0.01 0.00/0.00 0.02/0.03 H Hβ* 0.01 0.05/0.09 0.02/0.07 Hα H 0.32 0.94 0.00/0.00 0.07/0.16 Hα Hβ* 0.79 0.00 0.23 Hβ* H 0.09 0.85 0.09 0.22 Hβ* Hβ* 0.93 0.09 0.37 In the secondary structure columns, the first value is from the HP set, the second from the HPC set. A bold value indicates a contact that occurs significantly more than average, a italic value significantly less 3 i–i i–i 3 i–i i–i 10 i–i i–i 3 Contact data highlights 1986 i i 4 i–i i–i i–i 4 i–i 4 Table 4 f occurrence Type Atom 1 Atom 2 α-helix β-sheet No secondary structure i i Trp Hε3 Gly H 0.05/0.45 0.56 0.88 0.24/0.63 Trp Hε3 Phe Hα 0.55/0.68 0.11/0.16 0.13/0.47 i i Thr Hγ2* Tyr Hε* 0.02 0.00 0.80 0.12 0.41 i i Ala Hα Ile Hδ1 0.66 0.93 0.00 0.00 0.11 0.21 Xxx Hα Ile Hδ1 0.57 0.92 0.00 0.02 0.09 0.22 i i Trp Hζ2 Thr Hγ2 0.52 0.56 0.00/0.00 0.00 Tyr Hε1 + Hε2 Val Hγ2 0.47 0.74 0.00/0.05 0.06 In the secondary structure columns, the first value is from the HP set, the second from the HPC set. A bold value indicates a contact that occurs significantly more than average, an italic value significantly less 5 i–i i–i i–i i i Table 5 Brief overview of general trends in joint secondary structure information for all contacts Type SS H (i) Hα (i) Hβ (i) H Hα Hβ H Hα Hβ H Hα Hβ i–i Helix . . + . . . − − + . − − Sheet . . + . . . . . + . + . + . + . i–i Helix + . − + + − − − − + . + + + + Sheet − − − − + . + + + + + . − − − − i–i Helix + + − − + + + + - - . + + + − − − − Sheet − − − − − − − − − − − + . + + i–i Helix + + − + + + + + + + + + + − − + + Sheet − − + + − − − − − − − − − − − − i–i Helix . . − − − − ++ . . + + + + − − . + Sheet ++ + + + . − − − − − + . − i–i Helix − − − − − − − − − − − − − − − − − − Sheet + + + + + + + + + + + + + + + + + + + Indicates that signals observed more than average, − less than average, . signifies that there is no trend. The first character in each cell contains the constraint HP set information, the second the coordinate HPC set Residue atom analysis Unassigned fragments breakdown Fragment analysis by residue i 6 Table 6 Selected sequence fragments where the central residue is often unassigned. The unassigned percentages are relative to the total number of times the fragment occurs Fragment Unassigned (%) Total Xxx–His–His 62 756 Xxx–His–Met 5 83 Gly–His–Xxx 2 34 His–His–Xxx 63 740 Ser–His–Xxx 8 114 Xxx–Ser–Gly 18 365 Gly–Ser–Xxx 19 479 Xxx–Pro–Ser 8 157 Discussion In this analysis only the original data as deposited by the authors was used, and no attempt was made to ‘clean up’ and further interpret this information, except for linking the constraint with the coordinate data and removing identical sequences from the data set pool (where only the entry with the highest number of constraints linked to atoms was kept). This approach is intentional, as it best represents the quality and extent of the data that is currently deposited at the PDB. Only the distance constraint information was included in the analysis, and the information from dihedral, H-bond and RDC constraints was ignored. Even though these constraints contain important structural information, they were, as experimental data, recorded independently from the NOE data. They are used in the structure determination process, however, and it was not investigated whether their presence influences the quality of the final distance constraint lists. There are several other issues that can still be addressed, and although these are likely to improve some of the aspects of this type of study, it is also important to start with the original information so that a comparison point is available. 4 1992 2002 2005 Fig. 4 Distance distribution from the constraint information for sequential Ala–Ala contacts between backbone H protons In this analysis a particular inter-atomic contact between two residues from one PDB entry is either observed or not observed. The reason why a contact is observed (or not) implicitly includes distance information, peak overlap, water exchange line broadening, and all other factors that can lead to not observing or assigning a contact during analysis of a spectrum. This is different from the traditional meaning of an ‘assigned atom’ on the chemical shift level, where it means that the chemical shift value for the resonance that arises from the atom is known. However, this does not necessarily mean that these assigned atoms produce any valid inter-atomic distance information. Thus, an ‘assigned atom’ (or residue) on the constraint level means that a chemical shift assignment also produced useful and valid information related to the inter-atomic distances within the molecule. 1984 f occurence Also of interest is the relationship between the information that comes directly from the deposited constraints and the information that comes from the deposited coordinates. Here, the constraint information is compared to the distances from the originally deposited coordinates. Although a set of recalculated coordinates (as in RECOORD) or X-ray structures could have equally well been used, the originally deposited coordinates were chosen because they should best match the content of the constraint lists. All comparisons between constraint and coordinate information are intended for informative purposes only: the constraints represent the experimental NMR side of the information contained in the coordinates, and are in effect only a subset of the information contained therein. However, a dependable determination of whether a particular NOE contact is observed or not is not possible based on an NMR structural ensemble, but is trivial based on the constraints because they inherently contain NMR-specific information like signal overlap, dynamics, etc. 5 f occurrence i–i i Fig. 5 f occurrence 2002 Conclusion A resource is now available where it is possible to check how likely a particular contact is when assigning NOESY spectra, or if a particular sequence fragment is likely to be difficult to assign. In this respect it formalises information that scientists with experience in spectrum analysis are aware of but cannot quantify. The amount of information provided here is extensive, however, and is even more useful when used as ‘knowledge based’ probabilities in automatic assignment strategies, to filter and/or validate ambiguous constraint possibilities, and as a tool to rank assignment possibilities in spectrum analysis programs. These are being implemented as part of the CCPN framework. Finally, the NMR constraint lists encompass the experimental NMR data encoded in the NMR structural ensembles, and comprise a single set of data that is much easier to analyse than an ensemble of solutions. As such, they provide a reduced form of structural information that is relevant for NMR analysis only. For this reason, and to allow a basic level of scientific reproducibility and validation, it is important that constraints, and all other possible NMR derived information, are deposited along with the structure coordinates. It is very likely that a lot more information than described in this article can be gained from it, which in turn can assist the NMR community and can help to understand the relationships between NMR and structure.