Introduction 1 6 7 8 10 11 12 13 16 17 18 19 20 21 22 23 24 7 25 26 27 28 29 30 31 20 Methods The National Cancer Institute (NCI) partnered with AskJeeves, Inc to develop a methodology to capture, sample, and analyze 3 months of cancer-related queries on the Ask.com Web site, a prominent US natural-language-processing consumer search engine. At the time of the project, Ask.com was receiving over 35 million queries per month. Search Terms 32 Table 1 Collecting Queries and Sampling Figure 1 Appendix 1 Very often there were multiples of the same questions. Thus, these 7500 queries actually represented 76077 queries that were entered into Ask.com, about 37% (76077/204164) of all queries identified as cancer-related from the 3 month period of log analysis. For example, a user question might be "Where can I find information about breast cancer?" This individual example represents 1 user question, but might have been queried by more than 100 people on any given day. Each query was counted only once. Sampling Issues The random sample of 7500 individual queries provides a confidence interval of 1.11% at a confidence level of 95%. This means that even if more samples were taken from 204165 queries, 95% of those samples should not be off by more than 1.1%. While this means that the samples themselves would not vary more than 1.1% over 95% of the samples taken, as the data are categorized and classified, in effect smaller and smaller samples are taken. Therefore, to offset this problem additional queries were examined, even though a smaller sample would still provide a high degree of confidence in the results. In other words, although broad generalizations—such as "breast cancer accounts for 25% of all cancer queries"—can be easily presented, a large sample size is required to break down data far enough to conclude that when users ask about breast cancer, they are most often asking about specific types of treatments. Highest-Level Categories for Queries Table 2 Cancer (ie, specifically mentioning a cancer type) General Research Treatment Diagnosis and Testing Cause/Risk/Link Coping 32 33 The highest-level categories were populated using proprietary AskJeeves filters and automated-analysis tools that sorted queries according to specific types of cancers, or—in the absence of mentioning a specific cancer type—whether the query asked about other areas such as Treatment or Coping. (AskJeeves did not share the filters and automated analysis tools with the authors.) Queries that could not be sorted by the filters and automated-analysis tools were placed in a temporarily-uncategorized category; they were categorized during the next step (reading and analysis). Reading and analyzing each individual query not only verified the automated process, but also helped to refine existing categories and create new categories and subcategories, as appropriate. For example, without this type of analysis, the query "Where can I find a Web site with information on using high protein food to fight Breast cancer?" might have been left under Breast Cancer > Media and Organizations > Web sites (where ">" indicates a change in category level). This would not be correct, as the true user intent was to inquire about Alternative Treatments. As a result, under the category Breast Cancer > Treatment, "Alternative" was added to the Breast Cancer > Treatment category analysis as a subtopic. (Treatment—without a specific cancer site designated—is both a highest-level category and a subcategory under Breast Cancer and under most cancer types.) Approximately 78% of all categorized queries from the sample referenced a particular type of Cancer, and were placed in the highest-level category Cancer. An example of this kind of query would be "Where can I find information about Breast Cancer?" (This query would be classified as Cancer > Breast Cancer > General Information.) Any query that did not mention a specific kind of Cancer, even though the question was about cancer, was placed on 1 of the 5 other highest-level categories. An example of this type of query would be "Where can I find information on cancer treatment with radiation?" This query was assigned to the Radiation subcategory in the highest-level category Treatment (ie, it was classified as Treatment > Radiation). Queries that did not relate to a specific Cancer type were placed in 1 of the 5 other highest-level categories: General Research, Treatment, Diagnosis and Testing, Cause/Risk/Link, or Coping. For example the query "How does smoking cause cancer?" would be placed in the Cause/Risk/Link category, as it did not refer to any specific type of cancer. "Cancer" Queries (Related to Specific Cancer Types) Table 3 Privacy Issues Although NCI helped create the search terms and the categories into which the analyzed data was placed, NCI did not have access to: the raw query logs at AskJeeves, any information about what AskJeeves users did with the searches generated on the AskJeeves Web site (ie, what links they picked), or the identities of any users of the Ask.com Web site. NCI did not require permission from the Institutional Review Board. Results Frequency of Top-Level Categories Table 2 Cancer (N = 59619, 78.37%) General Research (N = 7808, 10.26%) Treatment (N = 3832, 5.04%) Diagnosis and Testing (N = 3315, 4.36%) Cause/Risk/Link (N = 1249, 1.64%) Coping (N = 254, 0.33%) Table 2 Subdividing Cancer Queries Table 3 Digestive/Gastrointestinal/Bowel (D/G/B) (N = 8959, 15.0%) Breast (N = 6953, 11.7%) Skin (N = 6709, 11.3%) Genitourinary (N = 6250, 10.5%) Hematologic/Blood (N = 5448, 9.2%) Gynecological (N = 5344, 9.0%) Lung (N = 4630, 7.8%) Soft Tissue/Muscle (N = 3954, 6.6%) Lymphoma (N = 3333, 5.6%) Head and Neck (N = 2522, 4.2%) Brain and Neurological (N = 1852, 3.1%) Miscellaneous (N = 1633, 2.7%) Bone (N = 1429, 2.4%) Pediatric (N = 603, 1.0%) Any query specifically mentioning a cancer type by name, was assigned to that subcategory. For example, questions about Breast-Cancer-specific Treatment, Diagnosis and Testing, Causes, and Coping are found in the Cancer > Breast Cancer category, within 1 of the 10 subcategories displaying Breast Cancer information. All questions about Leukemia or Myeloma would be found in Hematologic/Blood, Hodgkin's Disease queries in Lymphoma, and Esophageal cancer questions in D/G/B. The number of subcategories assigned to each of the 14 different cancer types varied somewhat and was driven by the nature and number of the specific queries in those cancer types. Detailed Analysis of Queries Appendix 1 Major observations about the 19 categories and subcategories are noted below, in the order they appear in the Appendix. Our comments emphasize issues related to requested cancer content more than technology issues related to the natural language processing. 1.0 Bone Cancer Appendix 1 2.0 Brain and Neurological Cancer Of the 1852 Brain and Neurological Cancers queries, General Information accounted for the vast majority (N = 1323, 71.44%). There were 427 (23.1%) questions about specific cancer types in this category. Some cancer types queries asked about Medulloblastoma, which is typically but not always a Pediatric tumor. As with Bone Cancer above, some questions could have been meaningfully assigned to more than 1 top-level Cancer site category. In this category there were more queries about Symptoms (N = 259, 13.98%) than Treatment (N = 112, 6.05%). 3.0 Breast Cancer Appendix 1 The 10 top-level Breast Cancer subcategories were: General Information (N = 3423, 49.23%) Symptoms (N = 889, 12.79%) Treatment (N = 570, 8.20%) Media/Organization (N = 428, 6.16%) Cause/Risk/Link (N = 393, 5.65%) Diagnosis and Testing (N = 376, 5.41%) Statistics (N = 274, 3.94%) Pictures (N = 225, 3.24%) Type (N = 217, 3.12%) Definition (N = 158, 2.27%) Appendix 1 Table 3 Even though other cancer types may have been assigned more subcategories than the 10 for Breast, the detail and the medical specificity and technical vocabulary of Breast queries appear to be the most complex than other Cancer sites, probably reflecting the sophistication of basic research and clinical data on this topic and the relative sophistication of the breast cancer information seekers. 4.0 Cause and Risk There were 1249 queries in this highest-level category. Without mentioning a specific cancer by name, there were N = 1115 (89.27%) queries about Causes and Links but only N = 134 (10.73%) about Prevention. Among the 1115 queries in the Causes and Links subcategory, the following topics were noted: Drugs (N = 287, 25.74%) Unspecified (N = 247, 22.15%) (eg, "What is cause a cancer?" [sic]) Radiation (N = 247, 22.15%) Personal (N = 116, 10.40) (eg, "Can anti-persperant [sic] deodorant cause cancer?") Chemical/Plastics (N = 74, 6.64%) Environmental (N = 70, 6.28%) Food Supplement (N = 64, 5.74%) Genetic Mutation/Virus (N = 10, 0.90%) Smoking was not in this list, probably because most queries about smoking were included under a query about a specific type of cancer, like Lung or Head and Neck. 5.0 Coping There were only 254 queries about Coping. The queries referenced Support Groups (N = 127, 50%), Pain (N= 98, 38.58%), and Depression (N = 29, 11.42%). Even though there were few questions in this highest-level category, the issue was of specific interest to NCI, which asked for this category to be created and analyzed separately. 6.0 Diagnosis and Testing There were 3315 queries in this highest-level category, which did not mention a specific cancer by name. Most were queries about specific Testing (N = 2842, 85.73%). The others (N = 473, 14.27%) were queries about Symptoms. Among Testing queries, CAT/CT scan (Computerized Axial Tomography/Computed Tomography scan) (N = 1509, 53.10%) and MRI (N = 587, 20.65%) were the most-common Testing topics, followed by Biopsy (N = 502, 17.66%). 7.0 Digestive/Gastrointestinal/Bowel (D/G/B) Appendix 1 Appendix 1 General Information (N = 5568, 62.15%) Symptoms (N = 1506, 16.81%) Diagnosis and Testing (N = 1125, 12.56%) Treatment (N = 294, 3.28%) Statistics (N = 184, 2.05%) Definition (N = 163, 1.82%) Cause/Risk/Link (N = 119, 1.33%) Most queries asked for General Information. Examples of General Information queries would be "Where can I learn about the cancer esophageal cancer?"? and "Where can I find information on Stomach cancer"? Appendix 1 Colorectal (N = 4801, 53.59) Liver (N = 1413, 15.77%) Gastrointestinal (stomach) (N = 1094, 12.21%) Pancreas (N = 965, 10.77%) Bowel (N = 273, 3.05%) Esophagus (N = 260, 2.90%) Other (N = 153, 1.7%) The organ subsites in Other include Gall Bladder, Bile Duct, Anal, and Abdominal. Appendix 1 The terms Bowel, Gastrointestinal, Stomach, and Abdominal may have been used interchangeably by users. They appear not to recognize that queries for sigmoid, rectum, cecum, appendix, transverse colon, small bowel, and stomach (gastric) cancer would provide much more useful information. For D/G/B, some queries about Liver Metastases were included with queries about primary Liver Cancers. 8.0 General Research There were 7808 queries assigned to the highest-level category General Research, a topic not linked to a specific cancer type. In this category the 5 most-common subcategories were: Research (N = 2819, 36.10%) Organization (N = 1656, 21.21%) Clinical Trials (N = 1272, 16.29%) Concerns (N = 1201, 15.38%) Pictures (N = 559, 7.16%) Among the queries about Organization, there were 1065 queries about the American Cancer Society (ACS) and 223 about the National Cancer Institute (NCI). Among the 1272 queries about Clinical Trials, the most-common 3 questions/topics were: What are ... (N = 634, 49.84%) eg, "What are clinical trials?" Latest ... (N = 260, 20.44%) eg, "latest cancer clinical trial research" Types of ... (N = 111, 8.73%) eg, "types of cancer trials" 9.0 Genitourinary Cancers In decreasing order, the frequency of Genitourinary organ-type queries (N = 6250) in all 12 Genitourinary subcategories including General Information was: Prostate (N = 3141, 50.26%) Testicular (N = 1772, 28.35%) Bladder (N = 708, 11.33%) Kidney (N = 496, 7.94%) Other (N = 133, 2.12%) 34 As with most sites, the most-common Prostate Cancer questions were General Information (N = 1715, 54.6%). For Prostate Cancer, there were more questions about Treatment (N = 460, 14.65%) than Symptoms (N = 364, 11.59%). This may reflect major medical controversies about treatment options and the typically asymptomatic presentation of the disease. For the Genitourinary category as a whole, there were more questions about Symptoms (N = 854, 13.66%) than Treatment (N = 604, 9.66%). Expected misspellings of prostate (prostrate) were noted. 10.0 Gynecological Cancers There were 5344 queries overall. The breakdown of subcategories in decreasing frequency was: General Information (N = 3409, 63.79%) Symptoms (N = 939, 17.57%) Diagnosis and Testing (N = 452, 8.46%) Treatment (N = 247, 4.62%) Definition (N = 158, 2.96%) Cause/Risk (N = 83, 1.55%) Statistics (N = 42, 0.79%) Prevention (N = 14, 0.26%) In decreasing order of frequency, the cancer types queried in all 8 Gynecological subcategories included the following: Ovarian (N = 2031, 38.00%) Cervical (N = 1924, 36.00%) Uterine (N = 606, 11.34%) Endometrial (N = 225, 4.21%) Vulvar (N = 166, 3.11%) Vaginal (N = 219, 4.09%) Other or not specified (N = 173, 3.24%) 34 There were questions about Endometrial cancer as well as Uterine cancer. These data suggest that Web site information needs to be provided using both labels. 11.0 Head and Neck There were 2522 queries overall. Most queries asked for General Information (N = 1485, 58.88%). The vocabulary used to ask about specific cancer types within General Information was: Throat Mouth Oral Tongue Head Neck The vocabulary confirms the need to offer health information with words that are not technical like larynx, glottis, pharynx, or nasopharynx. There were 59 questions asking about Definitions of Head and Neck cancer. Specifics about cancer anatomy of this cancer type may be less familiar to the general public than other sites. There were 422 queries asking for Pictures of Head and Neck Cancer. There were only 47 questions (1.86%) asking about Cause/Risk/Link issues, despite the fact that there is a great deal known about the Causes and Prevention of Head and Neck Cancer. There were 418 questions (16.57%) about Symptoms and but only 52 (2.06%) about Treatment. 12.0 Hematologic and Blood Cancers Among 5448 queries in this category, the 5 most common of the 12 subcategories were: General Information (N = 3781, 69.40%), Definition (N = 701, 12.96%), Symptoms (N = 539, 9.89%), Treatments (N = 175, 3.21%), and Organizations (N = 102, 187%). Within General Information users asked about Leukemia (N = 2895, 76.57%), Myeloma (N = 592, 15.66%), Bone Marrow (N = 148, 3.91%), and Blood Cancers (N = 146, 3.86%). Various misspellings of Leukemia were noted and nontechnical terms such as Blood Cancer and Bone Marrow Cancer were frequent. 13.0 Lung Cancer 32 Among Lung Cancer queries, the queries were classified as follows: General Information (N = 3223, 69.61%) Symptoms (N = 530, 11.45%) Cause/Risk/Link (N = 305, 6.59%) Treatment (N = 219, 4.73%) Definition (N = 150, 3.24%) Statistics (N = 113, 2.44%) Diagnosis and Testing (N = 90, 1.94%) In the Cause/Risk/Link category of Lung Cancer, there were only N = 180 queries (59.02%) that asked generally about Causes of Lung Cancer and N = 102 queries (33.44%) that asked specifically about Smoking. There were N = 23 queries (7.54%) asking if Marijuana caused Lung Cancer. Only N = 255 (7.91%) queries within General Information asked about Lung Cancer by (histologic cell) Type, despite the fact that this is a major determinant of triage for treatment. For Lung Cancer > Treatment, there were 219 queries (4.73%). Most Treatment queries were Unspecified (N = 118, 53.88%), eg, "What are treatments for lung cancer?" There were 26 Treatment questions about Cure (11.87%). There were few specific questions about Medications (chemotherapy) (N = 21, 9.59%), Radiation (N = 19, 8.68%), or Surgery (N = 10, 4.57%). Although all numbers were small, there were more questions about Alternative Treatment (N = 13, 5.94%) than Surgery (N = 10, 4.57%). There were only 4 Treatment questions (1.83%) about palliative care, despite the grave prognosis for most Lung Cancers. Clearly the questions about Lung Cancer, the most-common lethal cancer, were far less sophisticated than the questions about either Breast Cancer or Prostate Cancer. 14.0 Lymphomas Among the 3333 queries about Lymphoma (including both Hodgkin's Disease and Non-Hodgkin's Lymphoma), General Information (N = 2391, 71.74%) questions were the most common. Unlike many cancer types, there was frequent mention of histologic types, as is appropriate, given the wide variety of clinically-different prognoses and treatments in this subcategory. There were many different spellings of Hodgkin's Disease. 15.0 Miscellaneous Cancers There were 1633 queries assigned to this Cancer subcategory. The Miscellaneous Cancers were: Endocrine (N = 901, 55.17%) Neoplasm (N = 272, 16.66%) Kaposis (N = 262, 16.04%) Ocular (N = 179, 10.96%) Germ Cell (N = 19, 1.16%) Several of the Ocular queries, eg, Ocular Melanoma and Retinoblastoma, could have been considered for other subcategories, such as Skin and Pediatric respectively. Germ cell tumors could also have been placed in either Genitourinary or Gynecological subcategories. These ambiguities illustrate the difficulty in categorizing precise user information needs despite the use of natural language processing. 16.0 Pediatric There were only 603 Pediatric queries, and most asked about a specific cancer type (N= 403, 66.83%). There were relatively few General Information queries (N = 81, 13.43%) eg, "where can I find information on children's cancers?" Since patients with Pediatric cancers in the US are usually managed generally by pediatric oncology specialists at major regional medical centers, those seeking Pediatric cancer information are probably directed to specialized Web sites rather than general sites like Ask.com. Of 403 queries for cancer types, the most common were Hematologic/Blood (N = 137, 34%), Neuroblastoma (N = 133, 33%), and Rhabdomyosarcoma (N = 68, 16.87%). There were only 4 questions referring to pediatric Brain and Neurological cancers. Since this is such a common Pediatric tumor type, it is possible that some Pediatric neurological tumor questions were assigned to the Brain and Neurologic Cancer category even though the questions were really meant to target a Pediatric issue. 17.0 Skin Cancers Among 6709 queries in this Cancer subcategory, 3596 (53.60%) asked for General Information. Like Lymphoma, there was frequent mention of specific Skin Cancer types (N = 2157, 32.15%), probably because of the significantly-different clinical prognoses and treatments. Only 169 queries (2.52%) asked about Cause/Risk/Link, and 60 queries (0.89%) asked about Prevention despite the fact that so much is known about these topics and Skin Cancer. 10 18.0 Soft Tissue Cancers There were 3954 queries in this Cancer subcategory. Although most appropriately refer to sarcomas of various types, there was a minority of misplaced queries. Some queries appear to reference conditions that are probably benign (Ganglion, Fibroid, Dysplasia, and Lipoma) and others should have been placed in different Cancer subcategories eg, Brain and Neurological (Oligodendroglioma and Glioma) These will be corrected on later analyses. 19.0 Treatment In the 3832 highest-level category queries about Treatment, most questions were about a specific Treatment Type (N = 3223. 84.11%), even though no specific cancer was mentioned. Within Treatment > Treatment Type there were many general queries about Chemotherapy (N = 2275, 70.59%). There were questions about general Radiation Therapy (N = 534, 16.57%), and few about specialized Radiation Therapy treatments like Gamma Knife, Laser, and Protons. There were more general questions about Alternative Therapies (N = 239, 7.42%) than Surgery (N = 127, 3.94%) Many Alternative Therapy questions also appear in specific organ-type subcategories, particularly Breast. Query Frequency Relative to US Incidence of Cancer Types Table 4 34 Table 4 The relative percentage of specific organ-type queries exceeds the percentage of annual incidence only for rarer cancers. The difficulty of finding useful information on prominent cancer portals or with standard search engines may be one explanation, although there are others. The comparison is not meant to be definitive as there are clearly issues with validity of this comparison: Cancer prevalencemight be a better benchmark than incidence US incidence data exclude cases of in situ breast and cervix cancers as well as the very-common basal cell and squamous cell skin cancers Queries could have come from anywhere in the world, not just the United States Query total may include those who accessed the site more than once Queries could have come from individuals who are not newly-diagnosed patients Other Observations The query analysis reveals that online users generally seek information about Symptoms and Treatment for specific cancers, rather than about cancers generally. In addition, Symptom queries showed a frequency between 2 and 5 times that of Treatment queries, for most cancers. For this study we did not specifically target queries about Adult Immune Deficiency Syndrome (AIDS), even though AIDS can often be associated with Cancer. There were 262 questions about Kaposi's Sarcoma in the Miscellaneous Cancers category. Discussion General Information was the largest category for almost all cancers, probably reflecting the nature of the Ask.com consumer search engine. It is a consumer-oriented Web-wide search engine where users tend to seek general information that can help them learn either how or where they should further pursue their inquiries. It is likely the users are just starting their Web searches on Ask.com and they are not yet interested in or they do not yet know enough information to ask more-sophisticated questions. This behavior may not reflect that of users who go directly to a known cancer-information portal with a predetermined need for detailed information. Appendix 1 35 Ask.com users entered both keyword searches and sentence-style queries, despite the fact that this is a natural-language-processing search engine. We recognize that even if users typed in a long query it was still sometimes difficult to discern absolutely what specific information the user needed, particularly since we did not have access to the links users picked. The vocabulary employed by users of Ask.com ranged from unsophisticated to very sophisticated. This suggests that allowing users to employ less-technical language on cancer Web sites would significantly help them find the information they seek. The queries captured for this study undoubtedly reflect the news and research studies in the public arena during the time period from June to August 2001. A different time period would certainly reflect a different distribution. Examples of the kinds of events that could affect the results include the diagnosis or death of a celebrity with cancer, the publication of a major trial about bone marrow transplantation for breast cancer, or the Food and Drug Administration approval of an important new drug. 36 37 38 39 40 41 29 In summary, natural-language-processing tools such as the one used for this study are able to filter and subset raw query data into useful analysis categories. Retrieval and analysis of these data can be used to better understand the actual content users want and the level of understanding and sophistication they have when they come to the Web site. Using the information on a continuing basis can form the basis for updating content on Web sites based on the most-current user needs. If a natural-language search engine were offered on a health-information portal, for example, it could improve customer access to desired information, particularly for those users with less sophistication about content or language. Additional analyses of query results are planned for the future. Consideration has been given to piloting the use of natural language processing on subsites of our Web portal.