Introduction 2001 2000 2000 2005 2002 2003 2003 2003 2003 2006 1972 1986 1995 2000 2001 2001 2000 2002 2000 2005 2005 2006 2000 2004 2003 2005 2005 2000 2000 2003 2003 2003 http://daisy.nagahama-i-bio.ac.jp/Famsbase/ 2000 2002 We report here the update of the database including differences in the amount of structural data from the previous version, estimation of the time that whole ORFs predicted out of genome sequences are covered by homology modeling 3D structures and upcoming issues for utilizing those modeled structures. Methods Data update of FAMSBASE 2002 2004 2000 Assessing annual difference of data in FASBASE 2004 1998 i \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$_{{\rm G}i}$$\end{document} \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$_{{\rm G}i}$$\end{document} \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$_{{{\rm G}i}}$$\end{document} \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$_{{{\rm G}i}}$$\end{document} i \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$_{{\rm G}i}$$\end{document} \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$_{{\rm G}i}\times100$$\end{document} \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$_{{\rm G}i}$$\end{document} \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$_{{\rm G}i}\times100$$\end{document} \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$ \begin{aligned} \hbox{coverage of soluble protein }&=\sum\limits_{i\in {\rm kingdom}} {{\rm S3}_{{\rm G}i}} /\sum\limits_{i\in {\rm kingdom}} {{\rm S}_{{\rm G}i}}\times 100,\\ \hbox{coverage of membrane protein }&=\sum\limits_ {i\in {\rm kingdom}} {{\rm M3}_{{\rm G}i}} /\sum\limits_{i\in {\rm kingdom}} {{\rm M}_{{\rm G}i}}\times100. \end{aligned} $$\end{document} 2005 2005 2003 Non-overlap multiple model structures in single ORFs 2003 Prediction of domain interfaces i i b S b i I b i \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$ I_{b,i} =-\log_2 \left({\left({S_{b,i}/\sum\limits_i{S_{b,i}} }\right)/\left({S_{0,i}/\sum\limits_i{S_{0,i}}}\right)}\right). $$\end{document} b I b i 1982 I i Results and discussion Coverage of whole protein space by homology modeling http://daisy.nagahama-i-bio.ac.jp/ 1993 1 Table 1 Number of ORFs and those with modeled 3D structures in 276 genomes Species ORF Model % Archaea Archaeoglobus fulgidus 2,407 1,233 51.2 Aeropyrum pernix 2,694 789 29.3 Halobacterium 2,605 1,195 45.9 Methanosarcina acetivorans 4,544 2,124 46.7 Methanocaldococcus jannaschii 1,770 875 49.4 Methanopyrus kandleri 1,687 784 46.5 Methanosarcina mazei 3,371 1,634 48.5 Methanothermobacter thermautotrophicus 1,869 998 53.4 Nanoarchaeum equitans 536 264 49.3 Pyrococcus abyssi Orsay 1,784 942 52.8 Pyrobaculum aerophilum 2,605 1,047 40.2 Pyrococcus furiosus 2,065 1,035 50.1 Pyrococcus horikoshii 2,061 879 42.6 Sulfolobus solfataricus 2,994 1,365 45.6 Sulfolobus tokodaii 2,826 1,228 43.5 Thermoplasma acidophilum 1,478 844 57.1 Thermoplasma volcanium 1,526 839 55.0 sum 38,822 18,075 46.6 Eubacteria Aquifex aeolicus 1,553 929 59.8 Nostoc 6,132 2,765 45.1 Agrobacterium tumefaciens 5,301 3,017 56.9 A. tumefaciens 5,402 3,028 56.1 Bacillus anthracis 5,311 2,463 46.4 Buchnera aphidicola 552 410 74.3 B. aphidicola 507 385 75.9 Bordetella bronchiseptica 4,994 2,934 58.8 Borrelia burgdorferi 1,639 535 32.6 Bacillus cereus 5,255 2,534 48.2 Candidatus Blochmannia floridanus 583 447 76.7 Bacillus halodurans 4,066 2,127 52.3 Bradyrhizobium japonicum 8,317 4,449 53.5 Bifidobacterium longum 1,731 985 56.9 Brucella melitensis 3,198 1,801 56.3 Bordetella parapertussis 4,185 2,525 60.3 B. pertussis 3,447 2,179 63.2 Bacillus subtilis 4,106 2,153 52.4 Brucella suis 3,264 1,677 51.4 Bacteroides thetaiotaomicron 4,816 2,462 51.1 Buchnera 574 436 76.0 Clostridium acetobutylicum 3,848 2,053 53.4 Coxiella burnetii 2,045 925 45.2 Chlamydophila caviae 1,005 505 50.2 Caulobacter crescentus 3,737 2,084 55.8 Corynebacterium diphtheriae 2,272 1,165 51.3 Corynebacterium efficiens 2,998 1,513 50.5 Corynebacterium glutamicum 3,099 1,554 50.1 Campylobacter jejuni 1,634 893 54.7 Chlamydia muridarum 911 483 53.0 Clostridium perfringens 2,723 1,470 54.0 Chlamydophila pneumoniae 1,116 495 44.4 Chlamydophila pneumoniae 1,052 496 47.1 Chlamydophila pneumoniae 1,069 501 46.9 Chlamydophila pneumoniae 1,113 501 45.0 Chlorobium tepidum 2,252 1,166 51.8 Clostridium tetani 2,432 1,306 53.7 Chlamydia trachomatis 894 485 54.3 Chromobacterium violaceum 4,385 2,343 53.4 Deinococcus radiodurans 3,102 1,579 50.9 Escherichia coli 4,284 2,398 56.0 E. coli 5,447 2,607 47.9 E. coli 5,449 2,629 48.2 E. coli 5,379 2,558 47.6 Enterococcus faecalis 3,265 1,568 48.0 Fusobacterium nucleatum 2,067 1,011 48.9 Geobacter sulfurreducens 3,445 1,902 55.2 Gloeobacter violaceus 4,430 2,208 49.8 Haemophilus ducreyi 1,717 865 50.4 Helicobacter hepaticus 1,875 902 48.1 Haemophilus influenzae 1,709 1,038 60.7 Helicobacter pylori 1,566 741 47.3 Helicobacter pylori 1,491 747 50.1 Listeria innocua 3,043 1,641 53.9 Leptospira interrogans 4,725 1,719 36.4 Lactococcus lactis 2,266 1,254 55.3 Listeria monocytogenes 2,846 1,653 58.1 Lactobacillus plantarum 3,009 1,647 54.7 Mycobacterium bovis 3,920 2,018 51.5 Mycoplasma gallisepticum 726 371 51.1 Mycoplasma genitalium 480 305 63.5 Mycobacterium leprae 1,605 918 57.2 Mesorhizobium loti 7,281 3,829 52.6 Mycoplasma penetrans 1,037 472 45.5 Mycoplasma pneumoniae 688 333 48.4 Mycoplasma pulmonis 782 398 50.9 Mycobacterium tuberculosis 3,918 2,036 52.0 Mycobacterium tuberculosis 4,187 1,990 47.5 Nitrosomonas europaea 2,461 1,366 55.5 Neisseria meningitidis 2,025 1,016 50.2 Neisseria meningitidis 2,065 1,025 49.6 Oceanobacillus iheyensis 3,496 1,892 54.1 Phytoplasma asteris 754 423 56.1 Pseudomonas aeruginosa 5,566 3,206 57.6 Porphyromonas gingivalis 1,909 944 49.4 Photorhabdus luminescens laumondii 4,683 2,286 48.8 Prochlorococcus marinus 1,712 933 54.5 Prochlorococcus marinus 2,265 1,122 49.5 Prochlorococcus marinus marinus 1,882 939 49.9 Pasteurella multocida 2,014 1,237 61.4 Pseudomonas putida 5,350 2,968 55.5 Pseudomonas syringae 5,608 2,938 52.4 Pirellula 7,325 2,588 35.3 Rickettsia conorii 1,374 572 41.6 Rhodopseudomonas palustris 4,814 2,739 56.9 Rickettsia prowazekii 834 498 59.7 Ralstonia solanacearum 5,116 2,698 52.7 Streptococcus agalactiae 2,124 1,159 54.6 Streptococcus agalactiae 2,094 1,174 56.1 Staphylococcus aureus 2,748 1,451 52.8 Staphylococcus aureus 2,624 1,447 55.1 Staphylococcus aureus 2,659 1,410 53.0 Streptomyces avermitilis 7,671 4,001 52.2 Streptomyces coelicolor 8,154 4,195 51.4 Staphylococcus epidermidis 2,485 1,303 52.4 Shigella flexneri 4,452 2,306 51.8 Shigella flexneri 4,068 2,159 53.1 Sinorhizobium meliloti 6,205 3,499 56.4 Streptococcus mutans 1,960 1,136 58.0 Shewanella oneidensis 4,778 2,291 47.9 Streptococcus pneumoniae 2,094 1,101 52.6 Streptococcus pneumoniae 2,043 1,135 55.6 Streptococcus pyogenes 1,696 956 56.4 Streptococcus pyogenes 1,845 996 54.0 Streptococcus pyogenes 1,865 986 52.9 Streptococcus pyogenes 1,861 976 52.4 Salmonella typhi 4,767 2,347 49.2 Salmonella typhimurium 4,554 2,457 54.0 Salmonella enterica 4,323 2,263 52.3 Synechocystis 3,167 1,679 53.0 Synechococcus 2,517 1,243 49.4 Thermosynechococcus elongatus 2,475 1,303 52.6 Thermotoga maritima 1,846 1,051 56.9 Treponema pallidum 1,031 517 50.1 Thermoanaerobacter tengcongensis 2,588 1,403 54.2 Tropheryma whipplei 783 494 63.1 Tropheryma whipplei 808 499 61.8 Ureaplasma urealyticum 611 303 49.6 Vibrio cholerae 3,828 1,971 51.5 Vibrio parahaemolyticus 4,832 2,461 50.9 Vibrio vulnificus 4,537 2,461 54.2 Vibrio vulnificus 5,028 2,499 49.7 Wigglesworthia brevipalpis 611 441 72.2 Wolinella succinogenes 2,044 1,208 59.1 Xanthomonas axonopodis 4,427 2,374 53.6 Xanthomonas campestris 4,181 2,287 54.7 Xylella fastidiosa 2,832 1,158 40.9 Xylella fastidiosa 2,036 1,066 52.4 Yersinia pestis 4,083 2,116 51.8 Yersinia pestis 4,281 2,123 49.6 sum 396,126 206,311 52.1 Eukaryotes Arabidopsis thaliana 28,723 14,394 50.1 Caenorhabditis briggsae 14,713 7,063 48.0 Caenorhabditis elegans 22,220 8,841 39.8 Ciona intestinalis 15,865 7,994 50.4 Drosophila melanogaster 18,302 9,541 52.1 Danio rerio 26,587 16,443 61.8 Encephalitozoon cuniculi 1,996 887 44.4 Guillardia theta 632 307 48.6 Homo sapiens 28,063 15,467 55.1 Leishmania major 173 62 35.8 Mus musculus 24,928 14,382 57.7 Neurospora crassa 10,088 3,800 37.7 Oryza sativa 16,724 4,517 27.0 Plasmodium falciparum 5,268 1,905 36.2 Rattus norvegicus 28,682 16,740 58.4 Saccharomyces cerevisiae 5,869 2,913 49.6 Schizosaccharomyces pombe 5,261 2,807 53.4 Takifugu rubripes rubripes 37,452 15,202 40.6 sum 291,546 143,265 49.1 Phages/Viruses 186 46 8 17.4 44AHJD 21 1 4.8 44RR2.8t 252 51 20.2 933W 80 9 11.3 A118 72 9 12.5 A511 11 0 0.0 Aeh1 331 51 15.4 APSE-1 54 6 11.1 B1 11 1 9.1 B103 17 4 23.5 Bcep781 61 5 8.2 BF23 8 1 12.5 bIL170 64 2 3.1 bIL285 62 5 8.1 bIL286 61 7 11.5 bIL309 56 6 10.7 bIL310 29 4 13.8 bIL311 22 6 27.3 bIL312 27 3 11.1 BK5-T 63 6 9.5 Bxb1 86 12 14.0 C2 39 2 5.1 Cp-1 28 2 7.1 ϕCTX 47 4 8.5 D29 79 15 19.0 D3 94 11 11.7 Rb15 49 6 12.2 ϕg1e 49 6 12.2 GA-1 35 3 8.6 Gh-1 42 12 28.6 H-19B 22 4 18.2 HF2 114 11 9.6 HK022 57 8 14.0 HK620 58 6 10.3 HK97 61 10 16.4 HP1 41 3 7.3 HP2 36 3 8.3 K139 44 4 9.1 KVP40 381 57 15.0 2,389 57 7 12.3 L-413C 40 4 10.0 L5 85 12 14.1 λ 66 18 27.3 A2 61 8 13.1 Mu 53 6 11.3 N15 60 13 21.7 Mycoplasma virus P1 11 0 0.0 Enterobacteria phage P1 11 0 0.0 P2 42 5 11.9 P22 36 9 25.0 P27 58 9 15.5 P335 49 6 12.2 P4 12 2 16.7 P60 80 13 16.3 PA01 34 5 14.7 PaP3 69 8 11.6 ϕKZ 306 25 8.2 ϕCh1 98 9 9.2 ϕYeO3-12 59 13 22.0 ϕ105 51 8 15.7 ϕC31 55 8 14.5 ϕ3626 50 10 20.0 ϕE125 71 12 16.9 ϕETA 66 8 12.1 ϕNIH1.1 55 6 10.9 ϕPV83 65 9 13.8 ϕSLT 62 12 19.4 ϕadh 63 8 12.7 ϕBT1 55 9 16.4 ϕA1122 50 10 20.0 P68 22 2 9.1 ϕKMV 48 11 22.9 PM2 22 1 4.5 PRD1 22 4 18.2 ΨM2 31 1 3.2 ΨM100 37 4 10.8 PY54 67 10 14.9 PZA 27 4 14.8 R1t 50 6 12.0 RB69 256 56 21.9 RB49 272 49 18.0 Rd 47 6 12.8 RM378 146 17 11.6 PVL 62 8 12.9 Sfi11 25 1 4.0 V 53 7 13.2 SIO1 34 6 17.6 Sk1 54 1 1.9 SP6 20 6 30.0 SP βc2 185 33 17.8 SPP1 106 7 6.6 MM1 53 6 11.3 ST64B 56 8 14.3 ST64T 65 9 13.8 7201 46 8 17.4 DT1 47 7 14.9 O1205 57 4 7.0 Sfi19 45 6 13.3 Sfi21 50 9 18.0 Stx2 165 11 6.7 T3 44 10 22.7 T4 278 58 20.9 T7 58 10 17.2 TM4 89 5 5.6 TP901-1 56 7 12.5 Tuc2009 56 7 12.5 Ul36 58 5 8.6 VHML 57 8 14.0 VpV262 67 4 6.0 VT2-Sa 82 11 13.4 Wϕ 44 4 9.1 Sum 7,699 1073 13.9 Total 734,193 368,724 50.2 1 1 2004 2002 Fig. 1 Percentage of amino acid residues included in modeled 3D structures in each ORF is classified by 10% bins and shown in pie charts. ORFs without a modeled structure are not included. A number of ORFs with modeled structures and an average length of the ORFs are shown at the center of each pie chart. Sections bordered by thick black lines indicate that the unmodeled region in the ORF is no less than the size of a domain (about 150 residues) Annual difference of model structures 2003 1998 2001 2003 2 Fig. 2 Annual differences of modeled structures classified by kingdoms of life. The percentage is the number of amino acid residues included in modeled structures over the whole number of residues in predicted sequences for soluble and membrane proteins in each kingdom. (S) stands for soluble proteins and (M) stands for membrane proteins. Some of the residues are predicted to be in a disordered region. The percentage of residues in disordered regions is shown at the top 1999 2 2004 2005 2003 2004 2004 2005 Frequency of template structure in use 2002 2 2 2 3 2000 2 2 2 Fig. 3 Frequency of template usage in descending order. Horizontal axis is a template and the vertical axis is a frequency of templates in use. Red line is a template usage in archaebacteria, blue line is eubacteria and green line is eukaryotes 2003 4 Fig. 4 Protein family distribution of ‘P-loop containing nucleoside triphosphate hydrolases’ superfamily in each kingdom. In the three pie charts, the section with the same color is a category of the same family except for the white section 2 2 2 2004 2005 2005 Table 2 Top 15 modeling templates in the newly determined 3D structures between 2002 and 2003 PDBID Chain Number of uses as a template a Protein name 1q12 A 7,031 N Maltose/maltodextrin transport ATP-binding protein MalK 1l2t A 6,529 N Hypothetical ABC transporter ATP-binding protein Mj0796 1oxx K 3,948 N ABC transporter ATP-binding protein GlcV 1pf4 A 3,202 N Transport ATP-Binding Protein MsbA 1nr0 A 2,640 Y Actin interacting protein 1 Aip1 1ixc A 2,495 N LysR-type regulatory protein CbnR 1ld8 A 2,410 N Farnesyltransferase α subunit 1ji0 A 2,331 Y ABC transporter 1oyw A 2,251 N ATP-dependent DNA helicase; RecQ helicase 1kt1 A 2,198 N Fk506-binding protein FKBP51 1mt0 A 1,961 N Haemolysin secretion ATP-binding protein; ATP-binding domain 1mdb A 1,745 N 2,3-dihydroxybenzoate-AMP ligase DhbE 1nnm A 1,730 N Acetyl-CoA synthetase 1gxr A 1,715 N Transducin-like enhancer protein 1 Esg1 1uoh A 1,706 N 26S proteasome non-ATPase regulatory subunit 10 a  Whole structure and function of proteins from homology modeling of domain structures 2003 2004 1 1996 2005 2003 2003 5 Escherichia coli Pyrococcus abyssi E. coli 2004 E. coli E. coli Fig. 5 Eukaryotic ORFs with multiple model structures covering more than 70% of entire protein. In each of the bar representation of proteins, a black box is a region with 3D structure. A name and PDB ID of a template structure and amino acid sequence identity between template and target domains are given below the black box. A yellow box is a putative signal peptide and green box is a putative transmembrane region. Template and modeled structures of ENSMUSP00000019416 were shown on the right side of the figure. Each domain is colored by hydrophobicity. A hydrophilic residue is in green and a hydrophobic residue is in red. A buried residue is in deep blue 2002 5 Accuracy of homology modeling 2005 2004 2001 2003 6 6 2003 2005 Fig. 6 Distribution of sequence identity between template and target amino acid sequences in FAMSBASE Conclusion Construction of database of whole genome homology modeling clarified that protein 3D structures of about 50% of the protein coding regions in whole genome can now be modeled. Maintaining the current speed of 3D structure determination, it will take, at most, 11 years to have enough templates to cover whole soluble proteins of eubacterial genomes, and 25 years to cover those of eukaryotic genomes. The current advancement in technologies of protein structure determination is expected to make these due times closer to the present. What we obtain at those times are not the 3D structures of entire proteins, but domain structures in pieces. A homology modeled domain structure is now in use of predicting domain functions, but predicting spatial arrangement of domains in a protein will be an important issue for function prediction.