Background 1 3 It is obvious that the full text of an article contains more information than its Abstract. However, in approaching full text analysis several problems must be tackled. On the one hand, the storage of full text articles requires more disk space and the analysis needs more computational capacity. On the other hand, an Abstract, as a summary, contains a high frequency of relevant terms (keywords), but this may not be the case of the rest of the article. Other questions regard the quality of the information carried by different sections of an article. First of all, is the information in full text organized enough so that keywords can be extracted? Secondly, different biological concepts (for example, gene and protein names, tissue names, organisms, experimental conditions, etc.) may be located in different parts of the article. Or it could be that a word has a different meaning depending on the section where it is located (the word has a context dependent meaning). For example, regarding gene names, those found in the Methods section may refer mostly to analytical tools rather than being relevant to the biological phenomenology described in the whole article. In summary, it would be good to quantify and qualify the information in a full text article before embarking in large scale extraction of particular items of information. With this goal in mind, we analyzed in this work the kind of information that is attached to different parts of an article and we tried to quantify how much information can be found in each section of an article. This should help to state some guidelines for researchers attempting to extract particular keywords (words synthesizing the content of the article) from full text articles. Results Text Corpus Nature Genetics Nature Genetics Selection of Keywords 4 5 K animal mouse, mycobacterium, human hippocampus, cerebellum, breast K Keyword Selection by Section K 1 K Figure 1 K K K K 1 K Table 1 Keyword selection per section. all K >= 0.3 K >= 0.4 K >= 0.5 A 52.17 19.44 14.42 9.77 I 171.32 31.03 20.47 14.00 M 404.19 54.24 28.50 15.80 R 599.98 24.74 12.74 7.85 D 331.04 26.16 14.25 8.75 K Sections Display Heterogeneous Information K K 6 2 Nf1 mouse mutation Nf1 neurofibromin GAP maze impairment, lethality antibody, amersham, tris, primer Figure 2 6 K 2 protein, gene Table 2 K A I M R D A 2.01 0.92 1.77 2.20 I 2.01 0.81 1.34 2.02 M 0.92 0.81 1.55 1.02 R 1.77 1.34 1.55 1.99 D 2.20 2.02 1.02 1.99 2 3 Figure 3 2 19 This result indicates that each section contains certain keywords that are unique to the section. In the following we try to characterize what are the differences in content between sections. Qualitative Analysis of Subjects per Section To make a deeper analysis of the kind of information present in each of the sections, we classified in seven categories a set of words present in our corpus of 104 articles (among the most frequent nouns). In order to do so as unambiguously as possible, we selected words that matched MeSH descriptors also consisting on that single word and belonging to only one major MeSH category (see METHODS). We added another category not present in MeSH, that of "Units, Dimensions, & Parts" in order to account for many terms that are currently not MeSH terms but are of interest to us. 4a 4b Figure 4 (a) (b) Distribution of Gene Names 7 8 Not That A6 5a 5b Figure 5 (a) (b) 9 3 3 Pbp2 Pom1 Sac1 10 11 7 12 13 Table 3 Detection of gene names appearing only in the Methods section. Ref Restriction endonucleases Msp1 v27.n3.277 Incorrect: Pst1 v19.n4.340 Sac1 v27.n4.375 Vector name Psg5 v23.n3.287 Cell strain Tig3 v26.n3.291 Definition of a Yeast strain Can1, Leu2, Lys2, Trp1 v26.n4.415 In array Faf1 v20.n3.266 Correct (technical context): Growth detection Mcm5, Mcm6 v25.n3.263 Platelet mRNA analysis Pbp2 v23.n2.166 Primers used to determine embryo sex Zfy1, Zfy2 v27.n1.31 Analysis of mutant phenotypes Pmd1 v24.n4.355 cDNA probe Rab2 v19.n2.134 Correct: SNP found in cDNA Add3, Npr2 v22.n3.239 Identifier given Pom1 v28.n3.223 Detection of meiosis specific genes Mei4, Mek1, Sps4, Zip1 v26.n4.415 Nature Genetics Discussion 14 15 16 In this work we have shown that the distribution of information (as keywords) in full text articles is heterogeneous and that there is certain correspondence of article sections with different kind and density of relevant data. The Abstracts are shown as the best repository from the point of view of having many keywords in a short space, justifying previous information extraction approaches. The lack of large repositories of full text articles in contrast to the current eleven million of references (many of them with their abstract) in the MEDLINE database, are another advantage of the Abstract approach. However, we have shown that there is much more relevant information (at least in a ratio of 1:4 regarding gene names, anatomical terms, organism names, etc.) in the rest of the article. We have demonstrated that the information is structured enough to get important numbers of relevant keywords, but that for certain words (such as gene names) caution has to be taken regarding the context of the word. We propose that the text mining of full text articles should be approached with different strategies for different sections. Beyond the Abstract, the Introduction looks like the best place to look for protein and gene names (and interactions) since it is probably describing current knowledge. The Discussion section, that interprets the results and put them in context with the current knowledge, looks like the third best place for mining such information, with Methods probably as the worst place. The Results section could be problematic given its mixed nature between Methods and the rest. Regarding other subjects, such as keywords about biological concepts (species, tissues, diseases, etc.), again the Abstract and then the Introduction section look like the best sections to mine regarding frequency of such keywords, but Results and especially Discussion seem better from a quantitative point of view. The Methods section is clearly appropriated for looking for technical data, measurements, and chemicals. Respect to chemicals, again, their context can be completely different in this section compared to the rest. Conclusions Extraction of biological information from full text looks promising, but context must be regarded. Part of this context is given by the situation of the text under analysis within the article. Therefore, tuning the extraction of information to the section is probably a good strategy, and for particular tasks some sections should be avoided. 17 18 Methods Derivation of Associations between the words of a section 4 w i w j w i w j w i Selection of Keywords 4 w i K K Classification of Words in Subjects 4 Authors' Contributions PS carried out the analysis of the keyword distribution from a database of full text articles. CP developed and applied the method to compute keywords. MA prepared the figures (except fig 2 by PS) and conceptualised the structure of the paper. PB and MA co-directed the project and contributed to the final manuscript. All authors collaborated during the whole length of the project. All authors read and approved the final manuscript.