Introduction http://www.nigms.nih.gov/Initiatives/PSI.htm The most readily apparent contribution of SG is the rapid expansion in the number of available protein structures, derived at a reduced cost because of the efficiency of specialized centers. Proper target selection is critical to ensure that the structures solved by SG centers are indeed valuable to the research and industrial community, either because of the intrinsic interest of the proteins investigated, or because of the improved mapping of the protein structure universe, providing homologous structural models. A second important contribution of SG projects for the scientific community is the development of methods for efficient protein production and structure determination, which could be adopted in smaller research laboratories to improve productivity. Other scientific deliverables of structural genomics derive from the scale and nature of the operations, and include comparative studies on members of protein families, identifying determinants of specificity, deriving general rules, and improving the capability to predict protein structure and function from gene sequences. The Structural Genomics Consortium (SGC), operating in the Universities of Oxford and Toronto and the Karolinska Institute, was initiated in 2003 to address needs of industrial and academic pharmaceutical research. The SGC investigates human and apicomplexan proteins; the targets are selected based on their potential as drug targets or involvement in disease processes. Technologically, the SGC focuses on interaction of proteins with small molecules (ligands, inhibitors, substrates and co-factors), and on coverage of protein families. This report provides several examples of the impact of research undertaken at the Oxford node of the SGC, including methodology for high-throughput structure determination, generic means for ligand screening, selected examples of insight from specific structures, insights from family coverage, and the possibilities resulting from the availability of large numbers of purified protein samples. The other SGC nodes share the core technologies but investigate non-overlapping target areas. Finally, the scientific impact depends on dissemination of structural data. We describe a new platform for distribution of annotated protein structures, which aims at making this data more meaningful to an audience beyond the usual users of the PDB. Methodology Protein production 1 Table 1 Core protocols employed at the SGC 1. Source of DNA 1. Sequence-verified cDNA clone collections. 2. Synthetic DNA. 3. RT-PCR, site-directed mutagenesis. 4. Genomic (microbial). 2. Cloning Ligation-independent cloning. Recombinase-based cloning (e.g., Gateway, InFusion). 3. Expression vectors and hosts T7 promoters, controlled by Lac repressor. Bacterial vectors N-terminal hexahistidine tag, cleavable by specific proteases (TEV, Thrombin, C3). Host strains based on BL21(DE3), often expressing rare-codon tRNAs or chaperone proteins. 4. Eukaryotic expression Bacoluvirus-infected insect cells. 5. Protein expression Rich media, grow at 37°C to mid-log, then induce at low temperature with IPTG. OR: Similar protocol using minimal medium for Selenomethionine or isotopic labelling. 6. Purification Two-step purification: Affinity chromatography, Gel filtration, all in high-salt buffers (0.5 M NaCl). Optional: tag cleavage and re-purification. 7. Ligand and buffer screening 3 8. Crystalliation Initial coarse screens (2–4 × 96 conditions; 3 protein concentrations each). Vapour diffusion, sitting drops, imaged by robots but scoring done by humans. Include ligands identified from screening or biochemical knowledge to promote crystallization. Follow-up screens and crystal optimization. 9. Data collection and structure determination Manual or robotic screening of crystals for diffraction properties; data collection in rotating anode or synchrotron sources. Phasing: Molecular replacement (95%), experimental phasing using SeMet derivatives, and MIR. Several features of this protocol have been optimized to capture a large portion of target proteins. Gene clones have been predominantly obtained from public and commercial cDNA libraries. However, gene synthesis may become the method of choice, allowing to optimize codon frequency, restriction sites, and mRNA structure and to introduce site-directed mutations. Ligation-independent cloning is a generic, high-throughput process that can be uniformly applied regardless of the target gene or the cloning vector. Short N-terminal fusion tags, including a hexahistidine sequence and a specific protease cleavage site, are almost universally used. It has been widely documented, that larger fusion tags (e.g., GST, thioredoxin, MBP) can enhance solubility of proteins that are not soluble when expressed with a short peptide tag. However, such fusion proteins have not been widely used in the SGC, since removal of the tag often leads to loss of solubility. E. coli The greatest barrier to production of human proteins in bacteria is recovery of soluble protein. Less than 15% of protein targets yielded detectable levels of soluble protein when tested as full-length constructs in the SGC, while more than 80% were expressed as insoluble aggregates. The key to achieving higher success rates has been the parallel production of large numbers of truncated constructs, often containing a compact protein domain. Construct design is initially based on domain boundary analysis, using a number of bioinformatic tools; 3–4 endpoints are designated around each of the predicted termini of the domain, resulting in 9–16 constructs. We have consistently found that this approach results in a 4-fold increase in the number of targets that can be produced as soluble proteins; a similar impact has been seen on the production of diffracting crystals, which can be dramatically affected by minute changes in protein termini. Although not rigorously tested, it is presumed that a protein construct that is inherently well-behaved (little tendency to aggregate or denature) will be less dependent on specialized conditions for expression and purification, and may crystallize in a wider range of conditions. Crystallization, crystal screening and data collection For successful crystallization of a given target, the SGC’s phase I operation appears to have confirmed that the most important driver for success is to explore protein diversity at the crystallization stage. One major form of variation was discussed above, namely testing multiple constructs of the target. Equally effective has been setting up co-crystallization with multiple ligands, along with varying protein concentration in the primary crystallization screens. 1 3 In practice, this diversity exploration leads to large numbers of parallel crystallization experiments, presenting a logistical challenge which, at this scale, can only be met with an efficient robotics and IT infrastructure. For the automation, the SGC has been able to exploit the devices developed on the back of the first wave of structural genomics initiatives, and our investment has been less in developing the machines, than in integrating them and implementing experimental best practices. Particular examples: by minimizing sample requirements with nanolitre crystallization, the available protein can be used in more experiments. The large numbers of drops thereby produced (1.5 million/year) would be practically impossible to view by eye under the microscope, whereas automatic drop imaging on a fixed schedule allows images to be reviewed at leisure at the desk. Automation has also played an important role in crystal characterization. An automatic sample changer has been used for initial characterization of diffraction quality of a vast number of crystals. This allows to rank the crystals for more careful data collection, especially at the synchrotron, and to direct further efforts at crystal optimization. A significant saver of upstream efforts has been to exploit each crystal’s diffraction as efficiently as possible, even those traditionally considered to be marginal or problematic. Marginal diffractors would include crystals that are “very small” (<40 μm in longest dimension), twinned, or have streaky or anisotropic diffraction. The latter cases generally require the undivided attention of experienced crystallographers. Small crystals require an excellent X-ray beam: the PXII beamline of the Swiss Light Source synchrotron provides a beam which is reliably small but also well-aligned and very stable. Most efficient use of the beamline relied on pre-screening all crystals at the laboratory source for thorough work prioritization; real-time data processing during data collection; and close attention to radiation damage of crystals. It has been crucial to have experienced crystallographers on site. Adherence to these good practices has been highly productive: of datasets collected on 24-hour trips to SLS, 66% were used for final structures, while 90% of all depositions relied on synchrotron data. The ability to extract useful data from marginal crystals has been especially productive in combination with the protein/ligand diversity approach of the SGC, as a significant fraction of structures (>50%) could be derived from crystals emerging from the primary screens, saving the need for further optimization. Phasing and structure solution 4 5 The final step, namely finalizing and depositing the model, is in fact a frequent stalling point, not only in high-throughput contexts. The reason is that the final model is not merely a result that can be trivially read off a few measurements, but instead is an interpretation of often rather noisy data, with a lot of detail that is easy to miss, where individual errors influence the clarity in all areas. Moreover, poor model definition affects biologically interesting parts of a structure, and interpreting it becomes a matter of judgment and using in orthogonal information. Indeed, the “final” model is as much scientific hypothesis as result, and depositing the model means signing off on the hypothesis––which is why it has traditionally been a bottleneck in structural genomics efforts. The SGC has used a peer proofreading system combined with strict timelines to counteract the problem: before deposition, the structure is reviewed by another crystallographer for errors or alternative interpretations, and comments passed back to the original refiner. The intention is threefold: First, to introduce quality control on the final output. Second, the refiner does not feel compelled to spend excessive time on the model to flush out the final errors, since she knows it will be checked. Third, by mixing up refiners and proofreaders, over time this should lead to common interpretations of marginal modeling decisions. The timelines depend on situation and difficulty, but typically allow two weeks for refinement, a day for proofreading, and two further days for deposition. This approach has made it possible to deposit novel structures at a considerable rate (6 each month from a team of 6 dedicated and 4–5 occasional crystallographers) without compromising quality. Information infrastructure An efficient laboratory information management system (LIMS) has been vital to manage not only target tracking, but also capturing and integrating where possible information generated from robotics, as well as capturing human assessments of experimental outcomes, where these could be entered via a client (e.g., scoring of crystallization images). http://www.molsoft.com/beehive.html Protein characterization and ligand screening 6 9 m m 8 10 8 5 3 SGC target and biology area selection: relevance for the treatment of human diseases For any structural genomic organisation target selection is an important consideration as it can have a major impact on the procedures that are implemented during the process of structure determination. There are a number of approaches applied by different structural genomics projects to select targets for structural analysis such as blanket coverage of an organism’s genome, targets with potential novel folds, percentage cut off based on sequence identity or total coverage of selected protein families. The SGC has opted for the family-based approach with an emphasis on protein families whose members are important in human health, disease and are potentially druggable. From our point of view, the main advantages of this approach are 2-fold. Firstly, the methods and procedures identified for one family member can be applied to another family member improving everything from expression, solubility, stability, and purification, to crystallisation and structure determination. Secondly, analysis of the structures from all family members can reveal additional significant information such as ligand binding site specificity, conformational dynamics, understanding of aberrant behaviour of specific family members or the converse revealing common structural properties within all family members. 11 The SGC has focused on providing protein structures to support drug development and understanding of the structural determinants for human disease. Of 160 unique targets deposited by the SGC (in phase 1), clear disease relevance has been established for 70% and a further 18% are likely to be involved in at least one disease. This pattern holds true for all the human protein families the SGC is working on. The following sections provide an overview of the three distinct biological areas selected at the Oxford site of the SGC. Biology area I: Structural Genomics of human metabolic enzymes Selection of metabolic enzymes as biological target area at the SGC was based on two distinct features: they are fundamentally involved in a multitude of human diseases, including cardiovascular, metabolic diseases or cancer, and in addition several enzymes constitute possible drug targets. Emphasis has been given to certain metabolic enzyme families such as oxidoreductases (mostly short-chain dehydrogenases/reductases (SDR), medium-chain dehydrogenases/reductases (MDR), long-chain dehydrogenases/reductases, aldehyde dehydrogenases (ALDH), aldo keto reductases (AKR) and 2′oxoglutarate dependent oxygenases (2OGs). In addition, pathways of importance, e.g., in lipid or amino acid metabolism were selected with a distribution of about 1:1 between oxidoreductases and other metabolic enzymes. The target list comprises about 300 metabolic enzymes, and after three years of operation, >60 unique novel structures have been solved. Three points of importance are highlighted in this review: structural characterization of enzymes shown to be causative of metabolic inherited diseases, structure determination of drug discovery targets in metabolic diseases such as metabolic syndrome or osteoporosis, and structure-guided “de-orphanization” of insufficiently characterized human gene products or even entire pathways. Structural basis of inherited metabolic diseases 12 13 14 2+ 14 Metabolic enzymes as drug targets 15 18 19 1 Fig. 1 Bisphosphonate binding to human farnesyl diphosphate synthase. Electron density is shown in green around the clinically used inhibitor risedronate Deorphanization of metabolic enzymes and pathways 20 21 Biology area II: Structural Genomics of transmembrane receptor signalling pathways Complete coverage of the14-3-3 protein family 22 23 24 25 26 28 25 29 30 31 32 33 2 Fig. 2 The flexibility of the 14-3-3 is illustrated by the superimposition of 14-3-3β (blue) with 14-3-3η (orange). The monomer conformations of both isoforms are essentially identical on the left hand side. However, the beta monomer on the right side has a more open peptide binding groove and flexibility at the dimeric interface 33 2 34 36 33 37 38 − − 39 40 PDZ domains 41 42 43 41 Biology area III: Structural Genomics of human protein kinases 44 47 48 46 http://www.pdb.org/pdb/home/home.do 49 50 2+ 51 52 46 2 3 Table 2 Protein kinase structures determined by SGC Name PDB ID Resolution [Å] Inhibitor name Disease link Family CLK1 1Z57 1.70 Hymenialdisine e CMGC c 2EU9 1.53 none e CMGC CK1γ1 2CMW 1.75 Compound 52 CK1 CK1γ2 2C47 2.40 5-Iodotubercidin Genetic CK1 d 2CHL 1.95 Triazolodiamine 1 Cancer CK1 ERK3 2I6L 2.25 none Cancer CMGC ASK1 2CLQ 2.30 Staurosporine f STE NEK2 2JAV 2.10 SU11652 Cancer Other-NEK a 2CDZ 2.40 Cdk1 Inhibitor Cancer STE PAK5 2F57 1.80 Cdk1 Inhibitor Pot. Cancer STE PAK6 2C30 1.60 none Cancer STE b 1XWS 1.80 BIM I, HB1 Cancer, Inflammation CAMK PIM2 2IWI 2.80 HB1 Cancer, Inflammation CAMK c 2J51 2.10 Triazolodiamine 1 e STE MPSK1 2BUJ 2.60 Staurosporine e Other-NAK STK10 2J7T 2.0 SU11274 Not known STE DAPK3 2J90 2.0 Pyridone 6 Cancer, Inflammation CAMK CAMK1G 2JAM 1.7 SU11652 Not known CAMK CAMK1D 2JC6 2.5 GSK inhibitor XIII Genetic CAMK a b c http://www.sgc.ox.ac.uk/structures/KIN.html d e f Fig. 3 53 54 Contributions of NMR to Structural Genomics NMR as a complementary method to crystallography for protein structure determination E. coli 3 Table 3 Deposited NMR structures and assignments Gene PDB deposition Resonance assignment deposition RGS3 – BMRB-15178 RGS10 2I59 BMRB-7272 RGS14 2JNU BMRB-15128 RGS18 2OWI BMRB-7106 SPRED2 2JP2 BMRB-5939 JARID1CA 2JRZ BMRB-15348 NMR as an assessment tool for the feasibility of structure determination 55 56 57 58 4 Fig. 4 Visible improvement in quality of 15N-HSQC spectra over two rounds of iterative construct re-design for the JARID1CA Bright/ARID domain. The leftmost (initial) construct shows potential. The structure of the final construct on the far right was determined by NMR (PDB code: 2JRZ) The study of protein dynamics by NMR The use of NMR to study the rotational correlation times and internal dynamics of the proteins offers good explanations as to why crystallization sometimes fails even for well-folded proteins. In all of the proteins we rescued by NMR, 15N heteronuclear NOE and 15N T1, T2 relaxation data revealed regions of internal mobility within the proteins, which would have hindered long-range order and impaired or prevented efficient crystal packing. A striking example was the case of the RGS domain from RGS10, in which NMR relaxation data confirmed true local mobility in a region of the domain which not only lacked in NMR restraints, but also showed no electron density in the crystal structure of the complex of RGS10 with G-alpha-i3 (PDB 2IHB). Comparison of mobility in RGS domains from different branches of the phylogenetic tree leads to clues about their specificity and helps to guide further investigations. In some cases, the 15N T1 and T2 data have also identified partial dimerization in proteins that fail to crystallize, thus explaining the latter. NMR relaxation data were in each case confirmed by analytical ultracentrifugation (AUC). The combined information allowed us to decide whether these proteins should be highlighted as candidates for structure determination by NMR and to judge the best conditions under which they should be studied. Future and outlook 59 Structural bioinformatics and rationalisation of experimental results 20 Dissemination of structural genomics data and knowledge Structural genomics produces a wealth of information of different types: DNA and protein seqeuences, biochemical information, coordinates of crystal structures, and structural annotation. This information is deposited in one or more public databases, predominantly the PDB, in addition to publication in journals. This form of data distribution does not adequately disseminate the full information to a wide scientific audience. The first issue is the fragmentation of data between different formats. A user may have to read text information in a journal paper, which may include a few two-dimensional Figures; then download a PDB structure file and image with a separate application; and then perform analysis and alignment of data from, say, SNP database using alignment software. The second issue is that non-structural biologists do not routinely access PDB files, especially of structures that were not published in pubmed-indexed journals. 60 iSee 5 Fig. 5 Screenshot of iSee datapack. The annotation text (top left panel) includes links (blue text), which lead to structural images focused at areas of interest, simultaneously accessing other types of information (sequence alignment, small molecule formulae, etc.) http://www.sgc.ox.ac.uk/iSee We also maintain and curate each of these files by revising each datapack quarterly to ensure that all the recently disclosed information is added (either by ourselves through follow-up experiments or by external collaborators working on the same targets). Each of the datapacks has a built-in automated updating function that can be executed on user’s request.