Background 1 2 A. pernix 3 2 statistical significance 4 6 7 E. coli 8 9 1 10 11 12 13 In this paper, we describe a fully automated method for making an organism specific gene finder. It starts from a raw genome and searches for protein matches to get a good training set. Then an HMM with states for coding regions as well as RBS is estimated from the data set. This HMM is used to score all the ORFs in the genome and finally the score is converted to a measure of significance – R – which is the expected number of ORFs that would be predicted in one megabase of random DNA. The main advantage of this significance measure is that it takes the length distribution of random ORFs properly into account. The method is shown to match or exceed other gene finders currently available. Methods Automatic extraction of training set 1 1. Extract the maximal ORFs longer than 120 bases from the query-genome. For every stop codon, extract the region from the first downstream (in frame) start codon to the next downstream (in frame) stop codon. 14 -5 15 excluding A B A 16 A A B T T T 6. Add 50 bases of upstream flank to genes of all sets and 10 bases downstream flank. A T HMM architecture 17 19 1 B E not 13 Figure 1 The overall HMM architecture. 2 17 20 tied Figure 2 Enlargement of null model and internal looped codons. 3 21 Figure 3 The state structure of the RBS model. 2 22 Model estimation 17 1. The emission probabilities of the background state are estimated from both strands of the complete genome. 2. The genes from start to stop codons are extracted from the training set and reverse complemented. The shadow model (consisting of three states) is estimated from these sequences. The parameters of this model are fixed. 1 A A 23 A T The whole procedure can be completely automated. Note that no experimentally mapped RBSs are used for estimating the RBS model, the RBS is discovered during the estimation procedure. 17 20 T Decoding decoding 17 i S S i 24 17 1 8 The state posterior probability itself is not a useful score, because it is a probability of the whole sequence, not just a single gene, and it therefore depends on the length of the sequence it is part of. By dividing the posterior probability by the probability of the whole sequence (the genome) according to the null model, the contribution to the state posterior probability of the sequences flanking a gene will cancel and effectively make the ratio independent of the flanking sequences (except the parts very close to the gene), see the Appendix. The log of this ratio is called the log-odds score, and that is our basic score for a gene. Significance As mentioned above, it is important to take into account the chance that a random ORF of the same length scores as high as a given gene. This is implicitly taken into account by our HMM because it models the length distribution of genes, but it turns out that one can calculate a significance score, which works slightly better (se results) and has a more intuitive interpretation. l N l A Bl A B l l 0 l 0 l l l 0 n l n μ μ l σ σ l l n standard score which is normally distributed with mean 0 and variance 1. l C l C l A Bl l C l l 1 1 C C 1 all C l C C l C C l C C A B R C 4 R Figure 4 R l E. coli R A B μ μ σ σ C R R R Using other gene finders In order to benchmark EasyGene we compare it with some of the existing gene finders. GeneMark 2.4, Frame-by-Frame, GeneMark.hmm 2.1 and GeneMark.hmm/S all belong to the GeneMark suit of programs and are accessible via the web interfaces listed in the references. For GeneMark.hmm2.1, GeneMark.hmm/S and Frame-by-Frame the output is a coordinate listing (start and stop positions) of all predicted genes. GeneMark2.4 outputs a list of stop codons and corresponding high-scoring start codons. Each start/stop is listed with scores for coding potential and RBS. We collect all starts for a given stop and output the "Avg Prob" of the start with the highest RBS score. Whenever a threshold was needed for comparison purposes, we used 0.5 which is the default set on the web page. Glimmer2.02 and Orpheus2 were installed locally. We changed the minimum ORF length predicted by these gene finders to 60 bp which seems to be the minimum used by the other gene finders. Orpheus and Glimmer provide two kinds of output: a verbose coordinate list of starts, stops and ORF scores and a simpler, post-processed list of coordinates for ORFs regarded as genes. In order to test their ORF scoring we had to parse the scored output. We had some difficulties interpreting the scored Orpheus output since some ORFs were scored several times with identical results (several identical "Start chosen"). In cases where multiple copies were found, we simply chose one of them and used the corresponding "Coding potential" (with the recommended threshold of -1) for further analysis. For Glimmer2.02 the scored output was parsed simply by selecting the Gene Score attributed to every scored ORF and using the recommended threshold of 90. 25 Results and Discussion 5 Figure 5 Gene length distribution imposed by HMM architecture. n A H. pylori 6 7 performance Figure 6 Assessing the optimal number of HMM coding branches. E. coli R T R T R R Figure 7 Assessing the optimal order of looped codon states. E. coli 6 6 7 E. coli 8 Figure 8 Comparing significance and log-odds. E. coli 6 1 R E. coli R 6 7 8 Table 1 R R TP rate FP rate 0.1 0.971 0 2 0.980 2.3e-6 10 0.984 3.0e-5 50 0.987 3.1e-4 150 0.991 8.7e-4 500 0.995 2.8e-3 10000 0.999 0.059 R E. coli R R E. coli 9 R 10 11 R Figure 9 Statistical characteristics of random sequences. E. coli Figure 10 Comparing predicted and found number of false positives. E. coli Figure 11 Probability density functions for the standard score. E. coli 2 E. coli Table 2 E. coli Data set EasyGene Glim rbs-Glim Orpheus Gm24 GmS Gmhmm Frame A'-% found 98.4 98.9/98.9 98.9 98.0/95.3 91.5 97.2 98.1 97.0 A'-% exact 93.8 98.9/95.3 84.1 95.1/92.4 41.6 88.0 85.7 93.2 B'-% found 98.4 98.5/98.6 98.6 95.9/96.5 90.2 96.6 97.2 96.4 T-% found 98.1(98.0) 98.3/98.4 98.4 96.5/95.6 89.8 96.3 97.1 96.1 Genome 4145 6827/5756 5756 9333/7543 3552 4064 4230 4064 zero order 7 169/211 211 6761/5430 6 153 1459 0 first order 7 545/723 723 6836/4804 13 241 830 0 third order 1 2423/2694 2694 6582/4817 43 659 866 1 shadows 0 19/21 21 22/9 1 0 2 0 E. coli E. coli A E. coli A B T A B Genome E. coli E. coli A A T T orfnuc It is always difficult to asses the specificity of a gene finder, because it is difficult to find genomic regions that are certain to contain no genes. We have therefore assessed specificity in three different ways. First, by counting the number of predicted genes in a genome. If this number is much higher than the number of annotated genes, it is likely that there are many false positive predictions, i.e. poor specificity. Our second test is based on random sequence. Clearly, a high number of predicted genes in a random sequence of bases indicates a poor specificity. However, it is probably not possible to find an exact quantitative correspondence between predictions in random sequences and real genomes. Also, it is not clear what sort of random sequence to use for such a test. By 0'th order we mean a sequence with bases generated randomly and independently with the base frequencies of the genome. Bases are quite correlated in DNA sequences, so we have also tested on sequences that are generated by Markov chains of orders 1 and 3. These Markov chains are estimated from the genome, so the sequences will have the same distribution of dinucleotides and 4-nucleotides, respectively, as the genome. Finally, the third test is the number of genes predicted on the opposite strand of genes (shadows); these shadow regions should contain very few genes if any. All gene finders except EasyGene, Frame and GeneMark2.4 predict a rather large number of false positives in random sequences, but for GeneMark.hmm and GeneMarkS we do not see large over-prediction in the genome or in shadows. Evidently, Glimmer and Orpheus predict a lot more genes in the genome than the other gene finders, suggesting that these gene finders have very high false positive levels. This is supported by the high numbers of genes predicted in random sequences, and (to a much lesser extent) in shadows. Orpheus and Glimmer actually predict more genes pr. nucleotide in the third order random sequence than they do in the genome, suggesting that the coding potential calculated in these gene finders is far from optimal. The HMM used by Frame assumes a minimum gene length of 69 bases which could make its false positive level seem somewhat better (lower) than it is, but there was no convenient way to lower the minimum length so we simply left it. It should also be noted that Frame relies on pre-existing annotations for training and is therefore not a self-training gene finder like Glimmer, Orpheus, GeneMarkS and Easygene. The sensitivity of the gene finders is tested on sets of high-confidence genes. Glimmer has the highest sensitivity for all sets, but this is largely due to heavy over-prediction. One ought always to bear in mind that it is not difficult to achieve high sensitivity if high levels of false predictions are tolerated at the same time – sensitivity is 100% if all ORFs are predicted as genes! Although there are some very close competitors, EasyGene comes out as the second best in sensitivity for all sets. A A E. coli 26 3 3 27 25 Table 3 E. coli Data set EasyGene Glim rbs-Glim Orpheus Gm24 GmS Gmhmm Frame LiC-% found 100 100/100 100 97.7/91.7 97.7 100 100 100 LiC-% exact 94.0 100/97.0 88.7 96.2/90.2 49.6 94.0 90.2 98.5 LiD-% found 100 100/100 100 96.8/98.4 100 100 100 100 LiD-% exact 96.8 0/1.6 75.8 51.5/51.6 67.7 95.2 80.6 87.0 26 E. coli 2 3 M. tuberculosis 28 H. pylori 29 B. subtilis 30 M. tuberculosis H. pylori B. subtilis 11 4 2 Table 4 M. tuberculosis H. pylori J99 B. subtilis M. tuberculosis Data set EasyGene GmS Frame A'-% found 96.7 97.2 96.0 A'-% exact 89.1 80.9 87.9 B'-% found 96.8 97.1 96.3 T-% found 96.9(96.6) 97.3 96.4 Genome 3749 3983 4341 zero order 0 - 8 first order 3 - 2 third order 2 - 12 shadows 1 0 1 H. pylori J99 Data set EasyGene GmS Frame A'-% found 99.2 99.2 99.2 A'-% exact 97.5 95.7 96.7 B'-% found 100 99.6 98.9 T-% found 99.7(98.8) 99.5 99.1 Genome 1491 1518 1479 zero order 2 1479 2 first order 1 336 2 third order 0 403 0 shadows 2 0 0 B. subtilis Data set EasyGene GmS Frame A'-% found 99.3 98.1 98.8 A'-% exacts 94.8 94.1 93.3 B'-% found 99.2 99.0 98.2 T-% found 99.3(99.2) 99.0 98.4 Genome 4083 4221 4006 zero order 1 675 0 first order 2 457 0 third order 1 813 2 shadows 0 0 0 M. tuberculosis H. pylori J99 B. subtilis M. tuberculosis M. tuberculosis H. pylori B. subtilis Conclusions The emerging overall picture is that the sensitivity of EasyGene tends to be comparable to GeneMarkS and higher than Frame. With regards to specificity, EasyGene and Frame tend to be comparable and both higher than GeneMarkS. Hence, EasyGene comes out with the best combined sensitivity/specificity performance. When it comes to exact starts, EasyGene also generally performs best. Glimmer and Orpheus have good sensitivities but at the cost of very low specificities in this comparison. We have lowered the ORF length cutoff from their default values in these methods to make the results comparable. This may be unfair, but since the challenge is to find the short genes, we believe that any gene finder should be able to score them. At present it is not possible to automatically find all genes in a prokaryotic genome. We believe the aim of a gene finding system is to help expert annotators as much as possible, and we consider the statistical significance of a gene an important help in classifying the predictions into almost certain genes and border-line genes needing more attention. Contrary to most other gene finders discussed here, it is up to the user which significance cut-off to use. EasyGene also predicts sub-optimal start codons if need be, so it will be easy to see if e.g. two alternative starts have almost the same score. R If the amount of available genomic DNA is very small (as it may be in partially sequenced genomes) the 3 branches of 4th order coding models may have too many parameters to be reliably estimated. In such cases, one could reduce the parameter space simply by using fewer branches and/or lower orders. More generally, one could develop a method for automatic fine-tuning of HMM architecture for every new organism. Other lines of future research could focus on modelling of genes with errors and frame shifts. S. typhi 31 Appendix: The length dependent score distribution z M c 1 c l P l M P z N where We will now look at the distribution of each of these terms. The null model consists of a state with a loop and three reverse codon states with a loop. For a long sequence one of the loops will usually dominate the probability, so the length distribution is well approximated by a geometric distribution P l N k 1 k 1 l k 1 17 p n n l 0 variable l l l 0 l n l n n Q Q N l l p M p N l l μ μ l' 9 l l standard score l C l A Bl l l l C C l s l C R A C B l s l 0 C C B Web sites used Frame-by-Frame: GeneMark.hmm 2.1: GeneMark.hmm 2.1 using GeneMarkS models: GeneMark 2.4: