Background Fv1 1 3 4 Fv1 5 6 7 Fv1 7 8 Fv1 Fv1 9 Fv1 8 There are now several known structures for retroviral capsid proteins in the Protein Databank (PDB). While some of these are only fragmentary, a selection can be extracted that gives a reasonable phylogenetic spread across the retroviruses, with examples from three out of six genera of orthoretroviruses. In all the known structures, the CA protein has an all-α type structure consisting of two domains: a larger N-terminal and smaller C-terminal domain with a short extended linker-region between them. As this linker enters the C-terminal domain it incorporates the MHR. There is considerable variation in the orientation of the domains and in the conformation of the loop-regions between α-helices, particularly in the N-terminal domain. Fv1 Results and Discussion n 1 Table 1 n a b c d (a) EIAV (b) RSV SEED 1eia 12084543 pdb-1E6J 0–210 6358699 gb-AAF07324 131–342 8072301 gb-AAF71968 0–155 6815746 gb-AAF28696 0–173 6649692 gb-AAF21520 5–224 12084543 pdb-1E6J 3–210 120850 sp-P18041 97–362 27803398 gb-AAO21890 120–281 294961 gb-AAA74706 116–381 5106563 gb-AAD39752 81–346 SEED 1d1dA (c) HIV-1 (d) HTLV-I 6358699 gb-AAF07324 129–340 SEED 1qrjA 22037894 gb-AAM90230 148–359 12084543 pdb-1E6J 0–210 SEED 1e6jP 22037894 gb-AAM90230 144–370 532325 gb-AAA99545 50–224 9886907 gb-AAG01643 0–222 9886907 gb-AAG01643 0–211 MLV Modelling 1(a) n 1 n F u F w n 2 F v S G Figure 1 MLV sequence alignment a α β b Figure 2 MLV model agreement a F u b F w n c F u F u S G n S G N-terminal domain F w Given that the alignment of the capsid protein sequences is ambiguous, the superposition of the models on the structures from which they were derived provides a better way to assess whether there is any significant sequence similarity that could be used as a basis on which any one model might be preferred over the others. The PSId values were: 5.6, 18.5, 10.5, 13.0 for 1d1dA, 1qrjA, 1eia and 1e6jP, respectively. (No differences were observed whether using the standard version or the sequence-biased version of SAP). 3 Figure 3 Consensus model for the MLV N-domain a β 8(a) et al. b 3(a) Fv1 29 C-terminal domain With its relatively unambiguous MHR, all the models of the C-terminal domain were in complete agreement over the first half of the domain. The more C-terminal half, however, was less consistent due to a combination of its generally less conserved nature combined with uncertainty in the location of the terminus in some of the sequences. As the C-terminal domain has been shown to be less important in determining the property of virus susceptibility, further effort was not expended to try and refine the alignment at the carboxy terminus of the molecule, especially in the more difficult alignment of the FV1 sequence described below. FV1 Modelling 1(b) F n F u F w S G 4 Figure 4 FV1 model agreement a F u b F w n c F u F u S G n S , G 5 Figure 5 FV1 sequence alignment 2 N-terminal domain F 6 Figure 6 Consensus model for the FV1 N-terminal domain a β 8(a) b 5 6(a) Table 2 n a b 1 (a) MLV SEED AAD55051 215–432 gi-120892 sp-P03330 207–423 gi-2393894 gb-AAC58239 206–434 gi-419481 pir-A46312 199–423 gi-323873 gb-AAA43041 203–418* gi-7548235 gb-AAA4306 206–422 gi-5726238 gb-AAD48375 156–352 (b) FV1 gi-7521942 T29096 gi-23485357 gb-EAA20381.1 gi-3913713 sp-P70213 FV1 MOUSE Control Models 3 Table 3 4 S, G F v n str \ seq D1D QRJ E6J EIA params D1D 0.60/78.9 4.14/90.0 2.05/56.7 2.41/38.3 9,10 QRJ 5.58/100. 0.32/100. 3.58/88.4 5.51/100. 3,10 E6J 1.88/95.2 1.70/84.7 0.31/100. 1.39/100. 7,10 EIA 2.00/93.3 2.00/100. 1.28/100. 0.35/100. 8,20 7 S Figure 7 Control model uRMSd values Conclusions 3(b) 7 7 7 This study has shown that reasonable models can be constructed for both the FV1 and its target MLV protein based on other retroviral capsid proteins. Although this result was suggested by the existence of the MHR in both sequences, the fluid nature of retroviral genomes does not necessarily constrain the preceding domain to remain constant in structure or even remain at all. Despite only weak sequence similarities in this region, the addition of multiple sequences with predicted secondary structure has allowed plausible models to be constructed. 29 3 Fv1 30 31 32 Methods and Data Data Sequence Data th 10 Fv1 7 8 11 12 Structural Data 13 14 15 16 17 18 19 The common core of the N-terminal domains of these proteins (in the numbering of the PDB structure) was defined as: 1d1dA 15–148, 1qrjA 16–129, 1eia 16–145 and 1e6jP 16–146. These fragments will be distinguished below as: 1d1dAn, 1qrjAn, 1eia-n and 1e6jPn and each terminates 8 or 9 residues before the conserved glutamine of the MHR. The N-terminal domain can be described as having five α-helices (N1...N5) with a long 'disordered', partly helical, loop between helices N4 and N5. For ease of reference below, this region will be called the 'top' of the molecule and its representation in the Figures will preserve this orientation. The C-terminal domains were defined as: 1d1dA 152–224, 1qrjA 132–204, 1eia 149–220, 1e6jP 149–220 and were distinguished by the suffix "c". The common core of this domain consists of an extended strand leading into the MHR region followed by four helices designated C1...C4. Despite their different sizes, both the N and the C domains have the same fold, perhaps suggesting an ancient gene duplication. This is most obvious in the HIV structure [1e6jP] where the domains can be superposed with 4.6 (2.0) unweighted (weighted) RMSd over 68 residues. 20 4 Table 4 a b Upper-right triangle Lower-left triangle (a) structure (b) sequence N C D1D QRJ E6J EIA N C D1D QRJ E6J EIA D1D 2.88 2.64 2.46 D1D 21.9 22.2 14.7 QRJ 3.83 2.11 2.39 QRJ 12.5 30.6 26.4 E6J 6.21 4.76 1.40 E6J 8.6 10.8 41.7 EIA 5.75 4.31 3.22 EIA 11.6 12.5 22.0 Sequence Databank Searches 21 22 23 24 Secondary Structure Prediction 25 Each sequence in the alignment was taken in turn and used as a probe against this small local database. As the Ψ-BLAST parameters used by PsiPred were more restrictive than those used in the full search (only 3 cycles) and there are fewer sequences in the databank, each sequence may only find those to which it is more closely related. This introduces some variation into the predictions which provides a useful indication of the confidence of each predicted secondary structure element (SSE). Multiple Sequence Threading 26 Template Sequence Alignments The MST program can incorporate multiple aligned sequences along with both the probe sequence and the template structure. The latter were gathered in an identical manner to the probe sequence using the Ψ-BLAST/QUEST search protocol described above. Each search against the NRDB was started with the sequence of the protein of known structure and the resulting multiple alignments examined 'by-eye' in the light of the known secondary structures. If any large insert had been made in a secondary structure element (SSE) then it was assessed whether the gap could be shifted outside the SSE without significant loss of residue matches. Similarly, if a large insert (more than 6 residues) was made by any sequence other than the probe sequence (of known structure) then the insert was reduced to six residues by removing the positions with most gaps. Parameter Choice 26 S G S S S G Measuring Model Agreement Whatever the parameters for MST, all the models constructed from the same probe have an identical sequence. These might therefore be compared using the α-carbon RMSd based on a one-to-one (100 PSId) sequence equivalence. However, using this simple measure, a 'trivial' shift in space in which, say, an α-helix shifts by one turn relative to another α-helix might result in a large RMSd between what are, topologically, similar models. It is better to allow a local relative shift in sequence of four residues to restore the spatial equivalence at the expense of residue identity. 20 Rw Ru 27 f f M/ R M R N N 2 N F F w F u F v F w F u While this procedure provides a general method for choosing parameter values, in the current application to a multi-domain protein it was not meaningful to calculate the RMSd over the full atomic model (because of relative domain movements). Instead, the agreement was calculated over the more distantly related N-terminal domain. Selecting a consensus model 28 F n An alternative selection test was also considered of selecting the model that had greatest sequence similarity when superposed with the template structure from which it was derived. As most of the sequence similarities considered below lie in the 'twilight-zone', the latter option was only used when one model was clearly better than the others. For this, we choose the criterion that it had to be 10 PSId points clear of its 'rivals'. Abbreviations Fv1 MLV, Murine Leukaemia Virus; CA, CApsid protein; HERV-L, Human Endogenous RetroVirus (L family); MuERV-L, Murine Endogenous RetroVirus (L family); MHR, Major Homology Region; NCBI, National Centre for Biotechnology Information; NRDB, Non-Redundant DataBank; MST, Multiple Sequence Threading (program); SSE, Secondary Structure Element; PSId, Percent Sequence Identity; PDB, Protein DataBank; RMSd, Root-Mean Square deviation; wRMSd, weighted Root-Mean Square deviation; uRMSd, unweighted Root-Mean Square deviation; Figure 8 Capsid protein domains a b a