File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/06/n06-1051_evalu.xml
Size: 6,173 bytes
Last Modified: 2025-10-06 13:59:38
<?xml version="1.0" standalone="yes"?> <Paper uid="N06-1051"> <Title>A Machine Learning based Approach to Evaluating Retrieval Systems</Title> <Section position="6" start_page="402" end_page="404" type="evalu"> <SectionTitle> 5 Experimental results </SectionTitle> <Paragraph position="0"> This section will examine small pools produced either by the TREC method or by RBoost/rSVM/MTF from two angles: their pooling performance and their influence on system comparison result.</Paragraph> <Section position="1" start_page="402" end_page="403" type="sub_section"> <SectionTitle> 5.1 Identify relevant documents </SectionTitle> <Paragraph position="0"> Fig. 1 shows the ratio of relevant documents retrieved by different pooling methods (i.e. the recall). The curves obtained by RBoost and rSVM are quite similar and much higher than that by TREC methodology. The curve of MTF is in the middle of RBoost/rSVM and Depth-n at the beginning and then catches that of RBoost at the pools of about 600 documents.</Paragraph> </Section> <Section position="2" start_page="403" end_page="403" type="sub_section"> <SectionTitle> 5.2 Correlation of system rankings </SectionTitle> <Paragraph position="0"> Once the pool is obtained by a given method, the assessor will give relevance judgment for all documents of that pool, called qrels for the outcome. This qrels will be used as the ground truth to measure effectiveness of a retrieval system.</Paragraph> <Paragraph position="1"> according to different qrels methods in comparison with that produced by the full assessment set.</Paragraph> <Paragraph position="2"> The simplest way to compare different systems is to sort them by the decreasing effectiveness values. The correlation of each two system rankings will then be quantified through a correlation statistic. In this study, we follow TREC convention (Buckley and Voorhees, 2004; Carterette and Allan, 2005), that is taking the 0.9 value of Kendall's t as the sufficient threshold to conclude that the difference of two system rankings is ignorable. We compare here the system ranking obtained by the official TREC qrels with those by Depth-n where n varies from 1 to 100. We then replace Depth-n by RBoost-m, rSVM-m and MTF-m.</Paragraph> <Paragraph position="3"> The results are shown in Fig. 2 for TREC-8 and in Tab. 2 for the 7 first pool depths. We observe from the figure the similar order of pooling methods as seen in the previous section. The MTF curve meets those of RBoost and rSVM from qrels of more than 400 documents. The results obtained on the two collections of TREC-6 and TREC-7 are in line with those observed on TREC-8 (Tab. 2).</Paragraph> <Paragraph position="4"> It is clear that system ranking correlation quantified by any rank correlation statistics provides necessary but not sufficient information about system comparison. Ranking systems by their sample means is indeed the simplest way with at least two implicit assumptions. First, runs have similar variances, this usually does not hold in practice even after discarding poorest runs. Second, all run swaps have the same importance without taking into account their statistically significant difference and their positions in the system ranking. In practice, swap of adjacent systems does not make much sense if they are not significantly different to each other according to statstical tests. The next section will be devoted to further statistical validations.</Paragraph> </Section> <Section position="3" start_page="403" end_page="404" type="sub_section"> <SectionTitle> 5.3 Statistical Validations 5.3.1 Significant difference detection </SectionTitle> <Paragraph position="0"> We register for a given qrels all system pairs which are significantly different on the topic set. The quality of a qrels can be measured by the similarity of this significant difference detection in comparison with that obtained by the official TREC qrels.</Paragraph> <Paragraph position="1"> We carry out the paired t-test for each pair of runs with 95% significance level. The recall and the false alarm rate of these detections are shown in Fig. 3. In terms of recall, RBoost and rSVM qrels are much more better than its TREC-style counterparts and MTF is in the middle. In terms of false alarm rate, there are some changes concerning rSVM and MTF.</Paragraph> <Paragraph position="2"> Precisely, rSVM at small qrels of less than 100 documents is the best whilst that is MTF qrels of more than 150 documents.</Paragraph> <Paragraph position="3"> This multicomparison test6 aims to group runs based on their statistical difference. We concentrate 6IR-STAT-PAK (Tague-Sutcliffe and Blustein, 1995) cantly different systems: recall (top) and false alarm rate (bottom) particularly on the top group, called group A which consists of runs on which there is not enough evidence to conclude that they are statistically significantly worse than the top run. In practice, this figure will be meaningful if it is around 10 (one often says about the top 10 runs). It will however become meaningless if the group A is too large, for example contains more than half of systems in consideration. Note that Tukey test relies on the assumption of Equality of Variances. This requirement can not be completely satisfied in practice, even after level) after the arcsine-root data transformation. some data transformation such as arcsine-root or using rank values.</Paragraph> <Paragraph position="4"> The size of group A on TREC-8 collection is shown in Fig. 4. We observe from that figure the stability of the two curves of RBoost and rSVM, this implies that the two qrels RBoost-35 and rSVM-35 which have both satisfied the 0.9 requirement of Kendall's t can replace the official TREC qrels. The effort saving is therefore a factor of 50 (if ignoring the cost of training data set preparation) and of 10.5 otherwise. MTF needs qrels of at least 168 documents to produce comparable group A's with that of the official TREC qrels. The Depth-n pools however should not be recommended with less than 1000 documents in total (i.e. pooling more than 40 top documents per run).</Paragraph> </Section> </Section> class="xml-element"></Paper>