BACKGROUND 1 Hospitals typically generate a lot of data about patients, their examinations, and their treatments. Due to the patient-oriented workflow, hospital information systems have a strong patient-centric architecture; hence, problems tend to arise whenever information across patients is required. For purposes of business intelligence, data mining and data warehousing can provide an adequate solution. When cross-patient information is needed in a research or training context, collecting medical data will become labor-intensive or practically impossible if the relevant data only exist in a free textual form. A solution for this problem consists in building an index of all free-text words that allows for efficient queries. In our institution, the university hospital closely associated with the largest Belgian university, close to two million radiological reports and corresponding images were made instantaneously searchable by implementing a language-independent, feature-rich indexing solution. This paper hopes to share the insights we experienced during designing, building, and running an indexing system in a production environment for more than 3 years. Extended use of open-source technology dramatically reduced both implementation time and cost. METHODS 1 Fig 1. The architecture of the complete indexing system. Report Warehouse 2 3 4 5 Search Engine With the advent of the Internet and ever-increasing computing power, indexing software became widely available. Our report warehouse made it possible to plug in and evaluate several search engines with a limited set of reports. Due to our characteristic European setting, language independence was needed to support Dutch, French, English, and German reports. The ability to include RIS data in the index as report metadata was also an important factor; this allows searching for (or sorting on) specific authors and requesting physicians and departments and other administrative information. Four more features we required were phrase and wildcard searching, Boolean logic, and the possibility to exclude common stop words. The section on semantics in this paper contains a detailed motivation for why these last four features are essential in a medical context. 6 7 Import Filters Our evaluation showed that the filters provided by most search engines, including Swish-e, had issues with importing our reports. Most problems were related to file formats; old formats were converted incorrectly, international characters often disappeared, and automatically generated text fragments (Microsoft Word AutoText) were not being filled in. One issue was observed across file formats: hyphenated words did not always make it into the index in one piece. 2 Fig 2 Example of a XML file containing the imported report and its corresponding RIS metadata. Efficient Indexing Indexing without tuning becomes very computing-intensive as the collection of document content, and thus the number of word/document combinations, grows. Fortunately, limiting the size of the index itself strongly decreases the computing power needed. In our situation, patient and study IDs in the radiological reports were safely excluded from the index without loss of information; using the metadata from the RIS yielded the same results and was more efficient. A second technique consists in excluding some, but certainly not all, of the very frequent words present in almost all documents. These common stop words were found by building an index and evaluating the significance of the thousand most frequent words. Amongst the 39 manually excluded stop words, good examples like “a,” “the,” “of,” and closing statements like “greetings” were found. The very frequent word “no” was not excluded because its meaning was highly significant in combination with other keywords. In the section on semantics we will show that excluding common stop words leads to semantic advantages. We will also discuss building indexes in an incremental fashion in the section on scalability. User Interface 8 3 Fig 3. The user interface for radiologists and researchers. Alas, trying to comprehend the ranking code of Swish-e is not for the faint of heart because of its inherent complexity and the speed optimizations used. The algorithms take into account the binary structure of the search query, details of keyword occurrences, and the relative frequency of words in the index. Luckily, unlike searching the Internet, queries in a radiological context tend to be very specific (see results), making fine-tuning of ranking unnecessary. RESULTS Preceding the go-live of our indexing solution, over one million reports and associated data were extracted from the RIS. Some filter fine-tuning was necessary to pass all reports through the layer of import filters. The initial import was spread over a 14-day period without performance penalties for production systems. At the time of writing, our report warehouse contained 1.8 million reports and corresponding administrative data, their XML versions and backup copies not included. The system holds 7.2 million files in total. The oldest report dates back to 1989, but not many reports of that era were available in digital form. Each night the index is rebuilt completely, which facilitates early detection of disc decay because all XML reports get reread in the course of this process. Alternatively, incremental reindexing could have been used. Building a full index is very efficient; it only takes 76 min worth of CPU time on 8-year-old hardware to process 104 million words. Currently, the index contains 332,088 unique words in four languages, including names of patients and requesting physicians. Alas, all spelling mistakes ever made by our typists and radiologists are also included. 4 9 Fig 4. The increase in users over time. Indexing reports added useful metadata to our collection of radiological examinations. Early generations of the RIS did not track the supervisor of a radiologist in training; instead, the supervisor’s name was added manually to the text of a report. Indexing made it possible to retrieve historical supervisor information. Having an independent report warehouse available that contains all radiological reports avoids vendor lock-in. During PACS implementation we encountered a 6-month delay in the PACS/RIS front-end integration due to a difficult multivendor situation. We successfully avoided this delay by making historical reports available at the radiologists’ work stations through the report warehouse. Query Analysis The search engine was consulted 7,071 times by our radiologists mainly to find interesting cases for research or training purposes. Occasionally, a query is performed to find a recent clinical exam in a fast and convenient manner. 1 Table 1. The Usage of Search Engine Features Feature n n Keywords only 6,344 (89.7%) 15 (39%) Phrases 215 (3.0%) 14 (37%) Wildcards 208 (2.9%) 14 (37%) Binary operator “and” 325 (4.6%) 15 (39%) Binary operator “not” 117 (1.7%) 9 (24%) Binary operator “or” 83 (1.2%) 8 (21%) Parentheses 40 (0.6%) 6 (16%) 2 5 6 Table 2. Statistical Analysis of the Query Duration and the Number of Results Query Duration (ms) Number of Results Median 31 36 Maximum 2,598 962,629 Mean 129 5,878 Standard deviation 274 59,286 Skewness 4.4 15 Kurtosis 24 237 Fig 5. The cumulative frequency distribution of the query duration (ms). Fig 6. Detail of the frequency distribution of the query duration (ms), showing the influence of caching performed by the operating system. Open Source Our indexing solution is a prime example of an open-source success story; extended use of this technology dramatically reduced both implementation time and cost. Designing and building the system took a mere 6 weeks, which was less than the “red tape” time needed to approve a commercial offering. Apart from the implementation man-hours, the total cost of ownership consists of low-budget hardware and support costs. Support calls are rare due to the high-availability design of the system and the excellent stability of the Swish-e search engine. In fact, users almost never call; the interface is very user friendly and similar to well-known interfaces of Internet search engines. Furthermore, passwords cannot be forgotten because of the single-sign-on mechanism. About four times a year, manual intervention is required when the synchronization link with the RIS breaks down; in this case, the administrator is automatically notified by e-mail. There has only been one hardware failure during the system’s lifetime, a broken power supply, but this did not cause any downtime. 10 DISCUSSION Although indexing technology has undoubtedly been used in a clinical setting before, the authors believe they have followed a novel approach in designing and implementing a complete and scalable indexing solution. This high-availability and high-performance system has been running without issues for more than 3 years in a production environment. Furthermore, the independent and flexible design of the platform allows for hosting other applications; several research projects have already found a home under its wings. In this section, Internet indexing technology, scalability, and semantics are briefly discussed. Internet Indexing Technology 11 12 13 Two aspects of Internet indexing technology could not be applied to our medical documents because of absent hyperlinking. The first was “spidering”: robots crawling the web to search for new and updated material by investigating hyperlinks. However, this did not pose a problem because the RIS kept track of new and updated reports. Secondly, page-ranking algorithms could not use rules based on the number of referrals by other documents. 14 Scalability Generally, the duration of building an index should be less than the desired timeframe in which new or updated documents show up in the system. Build times can increase beyond this point when document access speeds decrease or when the total size of the document collection grows. To counter this limitation, indexes can be built incrementally; however, care must be taken because some indexers do not recalculate frequencies of very common words for automatic exclusion. Similarly, indexes can be built in parallel; both mechanisms make use of index merging. Judging by our current results and the obsolete hardware used, we estimate that our indexing system scales up by two orders of magnitude, the main bottleneck being the nondistributed index. To scale a cross-patient search engine, hospital-wide privacy considerations have to be taken into account. The hospital’s policy on patient privacy should be implemented in strict access rules based on document metadata. Semantics A disadvantage of keyword-based searching is the simple fact that radiologists or researchers want to find the meaningful concepts behind their keywords. To make word indexing more or less suitable for concept searching, four search engine features are necessary. Firstly, because medical terms mostly consist of more than one word, it has to be possible to search for phrases. Page-ranking algorithms can help by giving a higher score to documents with adjacent keywords, but in practice, the sole use of this technique alone is not specific enough. Secondly, very frequent words with low significance can hamper phrase searching and should not be indexed; when “of” and “the” are excluded, the phrases “MRI of the brain” and “MRI brain” yield the same results. A third helpful feature is Boolean logic to facilitate binary operations like “and,” “or,” and “not” in combination with parentheses. Finally, a word-stemming technique can be used to find all words with a common root. Although we did not implement word stemming because our multilanguage set-up required building a separate stemmed index for each language, the use of wildcards turned out to be an adequate alternative. Word stemming is part of the broader method of fuzzy indexing, which can be used to find words with similar pronunciations. 3 Table 3. Refining of a Query by a Typical User Time Query Number of Results 12:00 cirrhose 528 12:02 cirrhose ascites 183 12:02 cirrhose ascites MR not angio* 100 15 17 18 19 20 21 CONCLUSION In this digital age, special care should be taken that the ever-growing amount of on-line information does not deteriorate into an inaccessible swamp of mere bits and bytes. On should strive to provide fast, clear, and easy access to information sources and, hence, increase the total value of stored data. Designing and implementing an indexing solution for a large set of radiological reports and images turned out to be a technically challenging but educational endeavor. For research and training purposes, it certainly is a valuable and convenient addition to our radiology informatics toolbox. The use of open-source technology is highly recommended to reduce both implementation time and cost.