Background 1 2 3 4 5 6 7 8 9 10 13 14 15 16 17 18 19 23 19 20 24 25 25 weighting dimension reduction sentences gene keyword Figure 1 We validated the performance of keywords extracted by our method using a manually annotated corpus of 200 abstracts. We also evaluated the usefulness of our method by sorting differentially expressed genes from a microarray experiment into functional sub-groups. The objective of our gene clustering process using functional keywords is to identify and summarize potential functional gene groups and to complement the conventional gene expression data clustering tasks. Methodology Gene/Protein name and synonym dictionary creation 26 27 28 29 30 Keywords extraction from biomedical literature In our study each gene is represented by a list of keywords extracted from MEDLINE abstract sentences, MeSH terms and GO terms. The procedure for extracting keywords from each data source is discussed below. MEDLINE abstracts keywords extraction sentences Gene-name normalization This process replaces all the gene names in the abstract with its unique canonical identifier (Entrez gene ID) using the gene-synonym dictionary specially constructed for this study. Sentence filtering supplementary material gene name words action verb words gene name Sentence keyword extraction 31 Sentence BRCA1 physically associates with p53 and stimulates its transcriptional activity. Brill-POS-tagged sentence BRCA1/NNP physically/RB associates/VBZ with/IN p53/NN and/CC stimulates/VBZ its/PRP$ transcriptional/JJ activity/NN. /. Sentence keywords associates, stimulates, transcription activity Sentence keywords after manual curation transcription activity MeSH keywords extraction To extract MeSH keywords, we searched for the gene names in our gene lists in the title and abstract of MEDLINE citations related to each gene and extracted the associated MeSH terms for each gene. The extracted gene-MeSH term list was represented by scores indicating the frequency of gene-MeSH term co-occurrence. Initial tests showed that certain MeSH keywords in the list were common biological terms and less useful from the point of view of gene annotation (e.g., human, DNA, animal, Support U.S Govt etc.). A collection of MeSH stop words was created manually and these terms were removed from the gene-MeSH term lists. Finally, from the thus filtered gene-MeSH lists, the 20 highest-frequency MeSH terms associated with each gene were taken as MeSH keywords associated with each gene. For example the MeSH keywords associated with a gene “FOS” in our gene list are oncogene, felypressin, transcription-factor, thermoreceptors, DNA-binding, antibiosis, inflammatory-response, zinc-fingers, gene-regulation, and neuronal-plasticity. GO keyword extraction We used the GO keywords information incorporated in Gene Ontology [Error! Bookmark not defined.] to extract GO keywords associated with each gene. Out of the three GO annotation categories we included only molecular function and biological process as we believe that cellular component (e.g. nucleus, cell membrane etc.) is less important for characterizing genes in the context of this study. Further, due to the hierarchical nature of GO and multiple inheritance in the GO structure, we consider with every ancestor up to level 2 in the GO tree in assigning GO keywords. This enables us to use more generalized GO terms. For example the GO keywords associated with the gene “FOS” in our gene list are protein-dimerization, DNA binding, RNA polymerase, transcription factor, DNA methylation, inflammatory-response, and nucleus. Keyword representation and calculation of numeric vectors ij i j supplementary material Gene clustering 32 33 32 33 34 35 Results and discussion Evaluation supplementary material With this text corpus we were able to construct a matrix containing all 20 genes and their associated keywords and keyword frequencies from abstracts, MeSH terms and Go terms. The manually annotated corpus of 200 abstracts and the matrix of 20 annotated genes served as gold standard for our evaluation experiments. We carried our four evaluation experiments: (1) Abstract keywords (baseline). Extracts gene annotation terms based on term frequencies * inverse document frequencies (TF*IDF) within the entire abstract without regard to sentence structure, (2) Sentence keywords. Extracts gene annotation terms based sentence-level keywords, (3) Sentence + MeSH keywords. As in (2) above plus MeSH terms (see Section MeSH keywords extraction), (4) Sentence + MeSH + GO keywords. As in (2) above plus MeSH terms (see Section MeSH keywords extraction) and GO terms (see Section GO keyword extraction). supplementary material We notice that the baseline method comprising TF*IDF keywords fares worst among all four approaches. We interpret this as evidence for the validity of the methods involving sentence-level processing as this information is likely to carry most specific characterizing terms. The ‘brute-force’ abstract-level processing will have difficulty in extracting these terms correctly and consistently. We further notice that the substantial improvements of precision and recall when we include MeSH terms and GO terms. This may be because these two categories are more specific and MeSH and GO annotations were done using full-papers and these biological functions and process are not described in all abstracts. Clustering of genes resulting from microarray experiment 36 supplementary material supplementary material Figure 2 Figure 3 Figure 4 The clustograms depict associations between genes and biological function/process terms derived from the abstracts obtained with the various gene lists. For the investigating scientist, the clustograms fulfill the following main functions: (1) Squares highlighted in a horizontal line link a gene to one or more biological functions or processes. This is useful to see which genes are associated with which functions/processes and which genes have few or many associations. The interpretation of many and few is very much dependent on the associated biological function/process categories, the particular scientific question under investigation, and also on how extensively a particular gene has been researched and reported in the literature. (2) Users may visually delineate clusters, i.e., rectangular areas with many highlighted squares in them and few highlighted squares around them. Any cluster, small or large, is potentially very useful to have discovered. Each cluster identified in this way relates a set of genes to a group of biological functions and processes. In a sense, each gene in the clustered is characterized by the same set of biological function and process concepts, a kind of ‘guilt by association’. This information is extremely useful as it provides clues as to the roles genes may play collectively in pathways and functions, processes, and possible phenotypes, that are associated with these pathways. Summary of analysis of EGF cluster, G(EGF) Figure 2 Figure 2a Figure 2a Figure 2b Summary of analysis of S1P cluster, G(S1P) Figure 3 Figure 3a Figure 3a Figure 3b Summary of analysis of the common gene cluster, G(COM) Figure 4 Figure 4a Figure 4a Figure 4b Figure 2b Figure 3b Figure 4b Conclusion The sequencing of whole genomes and the introduction high throughput analysis (e.g., oligonucleotide and cDNA chips, MALDI/SELDI-TOF MS) provides biomedical research with a global perspective, which necessitates the development of novel mining tools to explore and interpret data in timely manner. This paper presents a novel approach to combine sentence-level keywords with GO and MeSH terms. In our evaluation experiment, this approach has shown promising results. The present evaluation suggests that this approach will provide more specific information than typical approaches using abstract-level information. This is particularly the case when the sentence-level terms are complemented by MeSH and GO terms. Further, clustering of genes into different functional groups based on literature keywords has the potential to help biologists identify and characterize literally informative genes of interest for further investigations. Future work Future enhancements of the system will include additional data resources (OMIM. DIP, KEGG) and the generation of association rules to identify correlations among genes in the same cluster. Association rules between the genes in the same cluster seem particularly interesting because it allows one to find the presence of regularities between gene groups. Finally, abstracts were used in this study as they are readily and easily available but they are limited in content. As full-text contains large number of irrelevent sentences compared to abstracts this approach may be useful for full-text analysis too, as it performs filtering of irrelevant sentences before clustering. The plan to perform the current study with full-text articles and compare the results with that of abstracts is on the way. Supplementary material Data 1