File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/06/n06-1051_intro.xml
Size: 6,636 bytes
Last Modified: 2025-10-06 14:03:23
<?xml version="1.0" standalone="yes"?> <Paper uid="N06-1051"> <Title>A Machine Learning based Approach to Evaluating Retrieval Systems</Title> <Section position="3" start_page="399" end_page="400" type="intro"> <SectionTitle> 2 Related work 2.1 TREC methodology </SectionTitle> <Paragraph position="0"> Since the seminal work of test collection forming in 1975 (Sparck Jones and Van Rijsbergen, 1975), pooling has been outlined as the main approach to form the assessment set. The simple solution of round robin pooling from different systems proposed in that report has been adopted in most existing IR evaluation forums such as TREC, CLEF2 or NTCIR3. For convenience, we will denote that strategy as TREC-style pooling. To have the assessment set, from submissions (restricted length L = 1000 for most TREC tracks), only n top documents per submission are pooled. Despite different technical tricks to control the final pool size such as gathering only principal runs or reducing the value of n, the assessment procedure is still quite time-consuming.</Paragraph> <Paragraph position="1"> In TREC 8 ad-hoc track, for example, despite limiting the pool depth n at 100 and gathering only 71 of 129 submissions, each assessor has to work with approximately 1737 documents per topic (precisely, between 1046 and 2992 documents). Assuming that it takes on average 30 seconds to read and judge a document, the whole judgment procedure for this topic set can therefore only terminate after a roundthe-clock month. Meanwhile, a simple analysis on the ad-hoc collections from TREC-3 to TREC-8 revealed that there are on average 94% documents judged as non relevant. Since most of existing effectiveness measures do not take into account these non relevant documents, it would be bettter to not waste effort on judging non relevant documents provided that the quality of test collections is always conserved. Several advanced pool sampling methods have been proposed but due to some common drawbacks, none of them has been used in practice.</Paragraph> <Section position="1" start_page="399" end_page="399" type="sub_section"> <SectionTitle> 2.2 Topic adaptive pooling </SectionTitle> <Paragraph position="0"> Zobel (Zobel, 1998) forms the shallow pools according to the TREC methodology. When there are enough judged documents (up to the set of 30 top documents per run in his experiment), an extrapolation function will then be estimated to predict the number of unpooled relevant documents. The idea is to judge more documents for topics that have high potential to have relevant documents else. Carterette and Allan (Carterette and Allan, 2005) have recently replaced that extrapolation function by statistical tests to distinguish runs. This method produced interesting empirical outcomes on TREC ad-hoc collections, lack however a sound theoretical basis and is clearly of very high complexity due to iterative statistical tests of every run pairs. Furthermore, this incremental/online pooling approach raises a major concern about the unbiasness requirement from the human judgment as the assessors know well that documents come later are of lower ranks, thus of lower relevance possibility.</Paragraph> </Section> <Section position="2" start_page="399" end_page="400" type="sub_section"> <SectionTitle> 2.3 System adaptive pooling </SectionTitle> <Paragraph position="0"> Cormack et al. (Cormack et al., 1998) propose the so-called Move-To-Front (MTF) heuristic to give priority for documents based on the corresponding system performance. In their experiment, the latter factor has been simply the number of non relevant documents this system has introduced to the pool since the last relevant document. Aslam et al. (Aslam et al., 2003) formulate this priority rule by adopting an online learning algorithm called Hedged (Freund and Schapire, 1997).</Paragraph> <Paragraph position="1"> Our method relies on this idea of pushing ahead relevant documents by weighting retrieval systems.</Paragraph> <Paragraph position="2"> There are however two major differences. Whilst all aforementioned proposals favor online paradigm with a series of human interaction rounds, our method works in batch mode. We believe that the latter is more suitable for this task since it eliminates as much as possible the bias introduced by human assessor towards any document. Moreover, the batch mode enables us to exploit intuitively the inter-topic relationship what is not the case of on- null line paradigm. The second difference lies in the way of estimating the ranking function. It is widely accepted that machine learning techniques can deliver more reliable model on previously unseen data given much less training instances than any classical statistical techniques or expert rules can.</Paragraph> </Section> <Section position="3" start_page="400" end_page="400" type="sub_section"> <SectionTitle> 2.4 Generate pseudo assessment set </SectionTitle> <Paragraph position="0"> Several evaluation methodologies, especially for web search engines, have been proposed to evaluate systems without relevance judgment. These proposals can be grouped into two main categories. The first (Soboroff et al., 2001; Wu and Crestani, 2003; Nuray and Can, 2006) exploits internal information of submissions. The second (Can et al., 2004; Joachims, 2002a; Beitzel et al., 2003) benefits external resources such as document and query content, or those of web environment. We skip the second category since these resources are not available in generic situations.</Paragraph> <Paragraph position="1"> Soboroff et al. (Soboroff et al., 2001) sample documents of a shallow pool (top ten documents returned by retrieval systems) based on statistics from past qrels. Wu and Crestani (Wu and Crestani, 2003), Nuray and Can (Nuray and Can, 2006) adopt metasearch strategies on document position. A certain number of top outcome documents will then be considered as relevant without any human verification. Different voting schemes have been tried in the two aformentioned papers. Their empirical experiment illustrated how the quality of these pseudo-qrels is sensible to the chosen voting scheme and to other parameters such as the pool depth or the diversity of systems used for fusion. They also confirm that pseudo-qrels are often unable to identify best systems.</Paragraph> <Paragraph position="2"> In sum, the thorough literature review confirmed the importance of relevance assessment sets in IR evaluation yet the lack of an appropriate solution to have a reliable set given a moderate amount of judgment resource.</Paragraph> </Section> </Section> class="xml-element"></Paper>