Robust Test Collections for Retrieval Evaluation Ben Carterette Center for Intelligent Information Retrieval Computer Science Department University of Massachusetts Amherst Amherst, MA 01003 carteret@cs.umass.edu ABSTRACT Low-cost methods for acquiring relevance judgments can be a boon to researchers who need to evaluate new retrieval tasks or topics but do not have the resources to make thousands of judgments. While these judgments are very useful for a one-time evaluation, it is not clear that they can be trusted when re-used to evaluate new systems. In this work, we formally define what it means for judgments to be reusable: the confidence in an evaluation of new systems can be accurately assessed from an existing set of relevance judgments. We then present a method for augmenting a set of relevance judgments with relevance estimates that require no additional assessor effort. Using this method practically guarantees reusability: with as few as five judgments per topic taken from only two systems, we can reliably evaluate a larger set of ten systems. Even the smallest sets of judgments can be useful for evaluation of new systems. Categories and Subject Descriptors: H.3 Information Storage and Retrieval; H.3.4 Systems and Software: Performance Evaluation Reliability 1. INTRODUCTION Consider an information retrieval researcher who has invented a new retrieval task. She has built a system to perform the task and wants to evaluate it. Since the task is new, it is unlikely that there are any extant relevance judgments. She does not have the time or resources to judge every document, or even every retrieved document. She can only judge the documents that seem to be the most informative and stop when she has a reasonable degree of confidence in her conclusions. But what happens when she develops a new system and needs to evaluate it? Or another research group decides to implement a system to perform the task? Can they reliably reuse the original judgments? Can they evaluate without more relevance judgments? Evaluation is an important aspect of information retrieval research, but it is only a semi-solved problem: for most retrieval tasks, it is impossible to judge the relevance of every document; there are simply too many of them. The solution used by NIST at TREC (Text REtrieval Conference) is the pooling method [19, 20]: all competing systems contribute N documents to a pool, and every document in that pool is judged. This method creates large sets of judgments that are reusable for training or evaluating new systems that did not contribute to the pool [21]. This solution is not adequate for our hypothetical researcher. The pooling method gives thousands of relevance judgments, but it requires many hours of (paid) annotator time. As a result, there have been a slew of recent papers on reducing annotator effort in producing test collections: Cormack et al. [11], Zobel [21], Sanderson and Joho [17], Carterette et al. [8], and Aslam et al. [4], among others. As we will see, the judgments these methods produce can significantly bias the evaluation of a new set of systems. Returning to our hypothetical resesarcher, can she reuse her relevance judgments? First we must formally define what it means to be reusable. In previous work, reusability has been tested by simply assessing the accuracy of a set of relevance judgments at evaluating unseen systems. While we can say that it was right 75% of the time, or that it had a rank correlation of 0.8, these numbers do not have any predictive power: they do not tell us which systems are likely to be wrong or how confident we should be in any one. We need a more careful definition of reusability. Specifically, the question of reusability is not how accurately we can evaluate new systems. A malicious adversary can always produce a new ranked list that has not retrieved any of the judged documents. The real question is how much confidence we have in our evaluations, and, more importantly, whether we can trust our estimates of confidence. Even if confidence is not high, as long as we can trust it, we can identify which systems need more judgments in order to increase confidence. Any set of judgments, no matter how small, becomes reusable to some degree. Small, reusable test collections could have a huge impact on information retrieval research. Research groups would be able to share the relevance judgments they have done in-house for pilot studies, new tasks, or new topics. The amount of data available to researchers would grow exponentially over time. 2. ROBUST EVALUATION Above we gave an intuitive definition of reusability: a collection is reusable if we can trust our estimates of confidence in an evaluation. By that we mean that if we have made some relevance judgments and have, for example, 75% confidence that system A is better than system B, we would like there to be no more than 25% chance that our assessment of the relative quality of the systems will change as we continue to judge documents. Our evaluation should be robust to missing judgments. In our previous work, we defined confidence as the probability that the difference in an evaluation measure calculated for two systems is less than zero [8]. This notion of confidence is defined in the context of a particular evaluation task that we call comparative evaluation: determining the sign of the difference in an evaluation measure. Other evaluation tasks could be defined; estimating the magnitude of the difference or the values of the measures themselves are examples that entail different notions of confidence. We therefore see confidence as a probability estimate. One of the questions we must ask about a probability estimate is what it means. What does it mean to have 75% confidence that system A is better than system B? As described above, we want it to mean that if we continue to judge documents, there will only be a 25% chance that our assessment will change. If this is what it means, we can trust the confidence estimates. But do we know it has that meaning? Our calculation of confidence rested on an assumption about the probability of relevance of unjudged documents, specifically that each unjudged document was equally likely to be relevant or nonrelevant. This assumption is almost certainly not realistic in most IR applications. As it turns out, it is this assumption that determines whether the confidence estimates can eb trusted. Before elaborating on this, we formally define confidence. 2.1 Estimating Confidence Average precision (AP) is a standard evaluation metric that captures both the ability of a system to rank relevant documents highly (precision) as well as its ability to retrieve relevant documents (recall). It is typically written as the mean precision at the ranks of relevant documents: AP = 1 |R| i∈R prec@r(i) where R is the set of relevant documents and r(i) is the rank of document i. Let Xi be a random variable indicating the relevance of document i. If documents are ordered by rank, we can express precision as prec@i = 1/i i j=1 Xj . Average precision then becomes the quadratic equation AP = 1 Xi n i=1 Xi/i i j=1 Xj = 1 Xi n i=1 j≥i aijXiXj where aij = 1/ max{r(i), r(j)}. Using aij instead of 1/i allows us to number the documents arbitrarily. To see why this is true, consider a toy example: a list of 3 documents with relevant documents B, C at ranks 1 and 3 and nonrelevant document A at rank 2. Average precision will be 1 2 (1 1 x2 B+ 1 2 xBxA+ 1 3 xBxC + 1 2 x2 A+ 1 3 xAxC + 1 3 x2 C) = 1 2 1 + 2 3 because xA = 0, xB = 1, xC = 1. Though the ordering B, A, C is different from the labeling A, B, C, it does not affect the computation. We can now see average precision itself is a random variable with a distribution over all possible assignments of relevance to all documents. This random variable has an expectation, a variance, confidence intervals, and a certain probability of being less than or equal to a given value. All of these are dependent on the probability that document i is relevant: pi = p(Xi = 1). Suppose in our previous example we do not know the relevance judgments, but we believe pA = 0.4, pB = 0.8, pC = 0.7. We can then compute e.g. P(AP = 0) = 0.2 · 0.6 · 0.3 = 0.036, or P(AP = 1 2 ) = 0.2 · 0.4 · 0.7 = 0.056. Summing over all possibilities, we can compute expectation and variance: E[AP] ≈ 1 pi aiipi + j>i aij pipj V ar[AP] ≈ 1 ( pi)2 n i a2 iipiqi + j>i a2 ijpipj(1 − pipj) + i=j 2aiiaijpipj(1 − pi) + k>j=i 2aijaikpipjpk(1 − pi) AP asymptotically converges to a normal distribution with expectation and variance as defined above.1 For our comparative evaluation task we are interested in the sign of the difference in two average precisions: ΔAP = AP1 − AP2. As we showed in our previous work, ΔAP has a closed form when documents are ordered arbitrarily: ΔAP = 1 Xi n i=1 j≥i cij XiXj cij = aij − bij where bij is defined analogously to aij for the second ranking. Since AP is normal, ΔAP is normal as well, meaning we can use the normal cumulative density function to determine the confidence that a difference in AP is less than zero. Since topics are independent, we can easily extend this to mean average precision (MAP). MAP is also normally distributed; its expectation and variance are: EMAP = 1 T t∈T E[APt] (1) VMAP = 1 T2 t∈T V ar[APt] ΔMAP = MAP1 − MAP2 Confidence can then be estimated by calculating the expectation and variance and using the normal density function to find P(ΔMAP < 0). 2.2 Confidence and Robustness Having defined confidence, we turn back to the issue of trust in confidence estimates, and show how it ties into the robustness of the collection to missing judgments. 1 These are actually approximations to the true expectation and variance, but the error is a negligible O(n2−n ). Let Z be the set of all pairs of ranked results for a common set of topics. Suppose we have a set of m relevance judgments xm = {x1, x2, ..., xm} (using small x rather than capital X to distinguish between judged and unjudged documents); these are the judgments against which we compute confidence. Let Zα be the subset of pairs in Z for which we predict that ΔMAP = −1 with confidence α given the judgments xm . For the confidence estimates to be accurate, we need at least α · |Zα| of these pairs to actually have ΔMAP = −1 after we have judged every document. If they do, we can trust the confidence estimates; our evaluation will be robust to missing judgments. If our confidence estimates are based on unrealistic assumptions, we cannot expect them to be accurate. The assumptions they are based on are the probabilities of relevance pi. We need these to be realistic. We argue that the best possible distribution of relevance p(Xi) is the one that explains all of the data (all of the observations made about the retrieval systems) while at the same time making no unwarranted assumptions. This is known as the principle of maximum entropy [13]. The entropy of a random variable X with distribution p(X) is defined as H(p) = − i p(X = i) log p(X = i). This has found a wide array of uses in computer science and information retrieval. The maximum entropy distribution is the one that maximizes H. This distribution is unique and has an exponential form. The following theorem shows the utility of a maximum entropy distribution for relevance when estimating confidence. Theorem 1. If p(Xn |I, xm ) = argmaxpH(p), confidence estimates will be accurate. where xm is the set of relevance judgments defined above, Xn is the full set of documents that we wish to estimate the relevance of, and I is some information about the documents (unspecified as of now). We forgo the proof for the time being, but it is quite simple. This says that the better the estimates of relevance, the more accurate the evaluation. The task of creating a reusable test collection thus becomes the task of estimating the relevance of unjudged documents. The theorem and its proof say nothing whatsoever about the evaluation metric. The probability estimates are entirely indepedent of the measure we are interested in. This means the same probability estimates can tell us about average precision as well as precision, recall, bpref, etc. Furthermore, we could assume that the relevance of documents i and j is independent and achieve the same result, which we state as a corollary: Corollary 1. If p(Xi|I, xm ) = argmaxpH(p), confidence estimates will be accurate. The task therefore becomes the imputation of the missing values of relevance. The theorem implies that the closer we get to the maximum entropy distribution of relevance, the closer we get to robustness. 3. PREDICTING RELEVANCE In our statement of Theorem 1, we left the nature of the information I unspecified. One of the advantages of our confidence estimates is that they admit information from a wide variety of sources; essentially anything that can be modeled can be used as information for predicting relevance. A natural source of information is the retrieval systems themselves: how they ranked the judged documents, how often they failed to rank relevant documents, how they perform across topics, and so on. If we treat each system as an information retrieval expert providing an opinion about the relevance of each document, the problem becomes one of expert opinion aggregation. This is similar to the metasearch or data fusion problem in which the task is to take k input systems and merge them into a single ranking. Aslam et al. [3] previously identified a connection between evaluation and metasearch. Our problem has two key differences: 1. We explicitly need probabilities of relevance that we can plug into Eq. 1; metasearch algorithms have no such requirement. 2. We are accumulating relevance judgments as we proceed with the evaluation and are able to re-estimate relevance given each new judgment. In light of (1) above, we introduce a probabilistic model for expert combination. 3.1 A Model for Expert Opinion Aggregation Suppose that each expert j provides a probability of relevance qij = pj(Xi = 1). The information about the relevance of document i will then be the set of k expert opinions I = qi = (qi1, qi2, · · · , qik). The probability distribution we wish to find is the one that maximizes the entropy of pi = p(Xi = 1|qi). As it turns out, finding the maximum entropy model is equivalent to finding the parameters that maximize the likelihood [5]. Blower [6] explicitly shows that finding the maximum entropy model for a binary variable is equivalent to solving a logistic regression. Then pi = p(Xi = 1|qi) = exp k j=1 λjqij 1 + exp k j=1 λj qij (2) where λ1, · · · , λk are the regression parameters. We include a beta prior for p(λj) with parameters α, β. This can be seen as a type of smoothing to account for the fact that the training data is highly biased. This model has the advantage of including the statistical dependence between the experts. A model of the same form was shown by Clemen & Winkler to be the best for aggregating expert probabilities [10]. A similar maximumentropy-motivated approach has been used for expert aggregation [15]. Aslam & Montague [1] used a similar model for metasearch, but assumed independence among experts. Where do the qij s come from? Using raw, uncalibrated scores as predictors will not work because score distributions vary too much between topics. A language modeling ranker, for instance, will typically give a much higher score to the top retrieved document for a short query than to the top retrieved document for a long query. We could train a separate predicting model for each topic, but that does not take advantage of all of the information we have: we may only have a handful of judgments for a topic, not enough to train a model to any confidence. Furthermore, it seems reasonable to assume that if an expert makes good predictions for one topic, it will make good predictions for other topics as well. We could use a hierarchical model [12], but that will not generalize to unseen topics. Instead, we will calibrate the scores of each expert individually so that scores can be compared both within topic and between topic. Thus our model takes into account not only the dependence between experts, but also the dependence between experts" performances on different tasks (topics). 3.2 Calibrating Experts Each expert gives us a score and a rank for each document. We need to convert these to probabilities. A method such as the one used by Manmatha et al. [14] could be used to convert scores into probabilities of relevance. The pairwise preference method of Carterette & Petkova [9] could also be used, interpeting the ranking of one document over another as an expression of preference. Let q∗ ij be expert j"s self-reported probability that document i is relevant. Intuitively it seems clear that q∗ ij should decrease with rank, and it should be zero if document i is unranked (the expert did not believe it to be relevant). The pairwise preference model can handle these two requirements easily, so we will use it. Let θrj (i) be the relevance coefficient of the document at rank rj(i). We want to find the θs that maximize the likelihood function: Ljt(Θ) = rj (i)