File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/n06-1051_metho.xml
Size: 11,270 bytes
Last Modified: 2025-10-06 14:10:11
<?xml version="1.0" standalone="yes"?> <Paper uid="N06-1051"> <Title>A Machine Learning based Approach to Evaluating Retrieval Systems</Title> <Section position="4" start_page="400" end_page="401" type="metho"> <SectionTitle> 3 Machine learning based Pooling </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="400" end_page="400" type="sub_section"> <SectionTitle> 3.1 General framework </SectionTitle> <Paragraph position="0"> Let M denote the topic set size available for the training purpose, N the number of participating systems, k1 the pool depth to get the training data from any participating system and K the final pool size.</Paragraph> <Paragraph position="1"> The training process consists of two main steps.</Paragraph> <Paragraph position="2"> Firstly, for each training topic, k1 first documents of all N systems are gathered and the assessors are asked to assess all of these documents. Let T denote the outcome of this assessing step on all M topics.</Paragraph> <Paragraph position="3"> From the information of T, a function f will then be learned which assigns to each document a value corresponding to its relevance degree for a given query.</Paragraph> <Paragraph position="4"> At the usage time, for each given topic, the whole retrieved list of N systems will be fused. These documents will then be sorted in the decreasing order of their values according to f and the K top documents will be sent to the assessor for judgment. This last set of judgements will be the qrels used for the system evaluation.</Paragraph> <Paragraph position="5"> In the training framework, it is clear that the second step plays the major role. An effective scoring function can substantially save the workload at the last assessment step. We will now focus on methods for estimating such scoring function.</Paragraph> </Section> <Section position="2" start_page="400" end_page="401" type="sub_section"> <SectionTitle> 3.2 Document ranking principle </SectionTitle> <Paragraph position="0"> The scoring function f can be estimated in different ways as seen in the last section. In this study, we adopt the learning-to-rank paradigm for estimating this scoring function. The principle of document ranking will be sketched in this section. The next sub-section will introduce the two specific ranking algorithms used in our experiment.</Paragraph> <Paragraph position="1"> A ranking algorithm aims at estimating a function which describes correctly all partial orders inside a set of elements. An ideal ranking in information retrieval must be able to place all relevant documents above non relevant ones for a given topic. The problem can be described as follows. For each topic, the document collection is decomposed into two disjoint sets S+ and S[?] for relevant (non relevant respectively) documents, R and NR are their cardinality. A ranking function H(d) assigns to each document d of the document collection a score value. We seek for a function H(d) so that the document ranking generated from the scores respect the relevance relationship, that is any relevant document has a higher score than any non relevant one. Let &quot;d [?] d'&quot; signify that d is ranked higher than d'. The learning objective can therefore be stated as follows.</Paragraph> <Paragraph position="2"> d+ [?] d[?] = H(d+) > H(d[?]),[?](d+, d[?]) [?] S+ xS[?] There are different ways to measure the ranking error of a scoring function h. The natural criterion might be the proportion of misordered pairs (a relevant document is below a non relevant one) over the total pair number R.NR. This criterion is an estimate of the probability of misordering a pair P(d[?] [?] d+). where dblbracketleftphdblbracketright is 1 if ph holds, 0 otherwise; D(d+, d[?]) describes the importance of the pair in consideration, it will be uniform parenleftBig 1 R.NR parenrightBig if the information is unknown. null In practice, we have to average RLoss over the training topic set. This can be done by either macro-averaging at topic level or micro-averaging at document pair level. For presentation simplification, this operation has been implicit.</Paragraph> </Section> <Section position="3" start_page="401" end_page="401" type="sub_section"> <SectionTitle> 3.3 Discriminative ranking algorithms </SectionTitle> <Paragraph position="0"> Since RLoss is neither continuous nor differentiable, its direct use as a training criterion raises practical difficulties. Also, in order to provide reliable predictions on previously unseen data, the prediction error of the learning function has to be bounded with a significant confidence. For both practical and theoretical reasons, RLoss is then often approximated by a smooth error function.</Paragraph> <Paragraph position="1"> In this study, we will explore the performance of two ranking algorithms, they are RankBoost (Freund et al., 2003) and Ranking SVM (Joachims, 2002b). As far as we know, these algorithms are actually among a few state-of-the-art ranking learning algorithms whose convergence and generalization properties have been theoretically proved (Freund et al., 2003; Joachims, 2002b; Cl'emenc,on et al., 2005).</Paragraph> <Paragraph position="2"> RankBoost (aka RBoost) (Freund et al., 2003) returns a scoring function for each document d by minimizing the following exponential upper bound of the ranking error RLoss (Eq. (2)):</Paragraph> <Paragraph position="4"> This is an iterative algorithm like all other boosting methods (Freund and Schapire, 1997). The global ranking function of a document d is a linear combination of all base functions H(d) = summationtextTt=1 atht(d). At each iteration t, a new training data sample is generated by putting more weight D(.,.) on difficult pairs (d+, d[?]). A scoring function ht is proposed (it can even be chosen among the features used to describe documents) and the weight at is estimated in order to minimize the ELoss at that iteration.</Paragraph> <Paragraph position="5"> RBoost has virtues particularly fitting the pooling task. First, it can operate on relative values.</Paragraph> <Paragraph position="6"> Second, it does not impose any independence assumption between combined systems. Finally, in the case of binary relevance judgment which usually occurs in IR, there is an efficient implementation of RBoost whose complexity is linear in terms of the training instance number (cf: the original text (Freund et al., 2003)).</Paragraph> <Paragraph position="7"> Ranking SVM (Joachims, 2002b), rSVM for short, is a straightforward adaptation of the max-margin principle (Vapnik, 2000) to pairwise object ranking. The score function is often assumed to be linear in some feature space, that is H(d) = wTPs(d) where w is the vector of weights to be estimated and Ps is a feature mapping. The max-margin approach minimizes the following approximation of RLoss:</Paragraph> <Paragraph position="9"> for all pairs (d+, d[?]) while at the same time controlling the complexity of function space described via the norm of vector w for generalization objective.</Paragraph> <Paragraph position="10"> Notice that rSVM does not explicitly support rank values as does RBoost. Nevertheless, we will see later that the discriminative nature allows rSVM to work quite well on features merely deduced from rank values. Its behavior difference is in fact ignorable in comparison with RBoost.</Paragraph> </Section> </Section> <Section position="5" start_page="401" end_page="402" type="metho"> <SectionTitle> 4 Experimental setup </SectionTitle> <Paragraph position="0"> Our method is general enough to be applicable to any ad-hoc retrieval information task where pooling could be useful. In this paper, we will however focus on TREC traditional ad-hoc retrieval collections. Experiments have been performed on the three corpora TREC-6, TREC-7 and TREC-8. Statistics about the number of runs, of judgments, of relevant documents are shown in Tab. 1. Due to limit of space, we will detail results on the TREC-8 case and only mention the results on the two others.</Paragraph> <Paragraph position="1"> lections. The three last columns are averaged over the topic set size (50 topics/collection).</Paragraph> <Paragraph position="2"> Training data is gathered from the top five answers of each run. The pool depth of five has been arbitrarily chosen to have both sufficient training data and to eliminate potential bias from assessors towards a particular system or towards early identified answers while judging a shallow pool. Furthermore, this training data set is large enough for testing the ranking algorithm efficiency.</Paragraph> <Paragraph position="3"> Each document is described by an N-dimensional feature vector where N is the number of participating systems. The jth feature value for a document is a function of its position in the retrieved list, ties are arbitrary broken. A document at rank i is assigned a feature value of (L + 1 [?] i) where L is the TREC limit of submission run (L is usually set up at 1000). Documents outside submission runs receive the zero feature value (i.e. it is assumed to be at rank (L + 1)). For implementation speed, the input for rSVM is further scaled down to the interval [0, 1].</Paragraph> <Paragraph position="4"> Due to the small topic set size, we use a leave-one-out training strategy: a model will be trained for each topic by using judgments of all other topics. The training data set size is presented in the last column of Tab. 1. The workload for training dataset does not exceed the effort for assessing 5 topics in the full pool of TREC.</Paragraph> <Paragraph position="5"> We employ SVMlight package4 for rSVM.</Paragraph> <Paragraph position="6"> We adopt the efficient RBoost version for binary feedback and binary base functions ht (cf. (Freund et al., 2003)), boosting is iterated 100 times and we impose positive weighting for all coefficients at.</Paragraph> <Paragraph position="7"> The non-interpolated average precision (MAP) has been chosen to measure system performance5. This metric has been shown to be highly stable and reliable with both small topic set size (Buckley and Voorhees, 2000) and very large document collections (Hawking and Robertson, 2003).</Paragraph> <Paragraph position="8"> RBoost and rSVM pools will be compared to the TREC-style pools of the same size. We also include &quot;local MTF&quot; (Cormack et al., 1998) in the experiment. The &quot;global MTF&quot; has been shown to slightly outperform the local version in the aforementioned paper. However, we believe that the global mode is merely for demonstration but unlikely practical of online judgment since it insists that all queries are judged simultaneously with a strict synchronisation among all assessors. Hereafter, for simplicity, the TREC-style pool of the first n documents retrieved by each submission will be denoted by Depth-n, the equivalent pool (with the same average final pool size m over the topic set) produced by RBoost, rSVM or MTF will be RBoost-m, rSVM-m or MTF-m respectively. In all figures in the next section, the abscissa denotes the pool size m and values of n will be present along the Depth-n curve.</Paragraph> </Section> class="xml-element"></Paper>