File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/04/n04-1029_evalu.xml

Size: 9,409 bytes

Last Modified: 2025-10-06 13:59:08

<?xml version="1.0" standalone="yes"?>
<Paper uid="N04-1029">
  <Title>Comparison of Two Interactive Search Refinement Techniques</Title>
  <Section position="6" start_page="0" end_page="0" type="evalu">
    <SectionTitle>
4 Evaluation
</SectionTitle>
    <Paragraph position="0"> Every run submitted to the HARD track was evaluated in three different ways. The first two evaluations are done at the document level only, whereas the last one takes into account the granularity metadata.</Paragraph>
    <Paragraph position="1">  1. SOFT-DOC - document-level evaluation, where only the traditional TREC topic formulations (title, description, narrative) are used as relevance criteria. 2. HARD-DOC - the same as the above, plus 'purpose', 'genre' and 'familiarity' metadata are used as additional relevance criteria.</Paragraph>
    <Paragraph position="2"> 3. HARD-PSG - passage-level evaluation, which in  addition to all criteria in HARD-DOC also requires that retrieved items satisfy the granularity metadata (Allan 2004).</Paragraph>
    <Paragraph position="3"> Document-level evaluation was done by the traditional IR metrics of mean average precision and precision at various document cutoff points. In this paper we focus on document-level evaluation. Passage-level evaluation is discussed elsewhere (Vechtomova et al. 2004).</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.1 Document-level evaluation
</SectionTitle>
      <Paragraph position="0"> For all of our runs we used Okapi BSS (Basic Search System). For the baseline run we used keywords from the title field only, as these proved to be most effective in our preliminary experiments described in section 2.2.</Paragraph>
      <Paragraph position="1"> Topic titles were parsed in Okapi, weighted and searched using BM25 function against the HARD track corpus.</Paragraph>
      <Paragraph position="2"> Document-level results of the three submitted runs are given in table 1. UWAThard1 is the baseline run using original query terms from the topic titles.</Paragraph>
      <Paragraph position="3"> UWAThard2 is a final run using query expansion method 1, outlined earlier, plus the granularity and known relevant documents metadata. UWAThard3 is a final run using query expansion method 2 plus the  The fact that the query expansion method 1 (UWAThard2) produced no improvement over the baseline (UWAThard1) was a surprise, and did not correspond to our training runs with the Financial Times and Los Angeles Times collections, which showed 21% improvement over the original title-only query run. We evaluated the user selection of the sentence using average precision, calculated as the number of relevant sentences selected by the user out of the total number of sentences selected, and average recall - the number of relevant sentences selected by the user out of the total number of relevant sentences shown in the clarification form. Average precision of TREC sentence selections made by TREC annotators is 0.73, recall - 0.69, what is slightly better than our selections during training runs (precision: 0.70, recall: 0.64). On average 7.14 relevant sentences were included in the forms. The annotators on average selected 4.9 relevant and 1.8 non-relevant sentences.</Paragraph>
      <Paragraph position="4"> Figure 1 shows the number of relevant/non-relevant selected sentences by topic. It is not clear why query expansion method 1 performed worse in the official UWAThard2 run compared to the training run, given very similar numbers of relevant sentences selected. Corpus differences could be one reason for that - HARD corpus contains a large proportion of governmental documents, and we have only evaluated our algorithm on newswire corpora. More experiments need to be done to determine the effect of the governmental documents on our query expansion algorithm.</Paragraph>
      <Paragraph position="5"> In addition to clarification forms, we used the 'related text' metadata for UWAThard2, from which we extracted query expansion terms using the method described in section 2.2. To determine the effect of this metadata on performance, we conducted a run without it (UWAThard5), which showed only a slight drop in performance. This suggests that additional relevant documents from other sources do not affect performance of this query expansion method significantly.</Paragraph>
      <Paragraph position="6"> We thought that one possible reason for the poor performance of UWAThard2 compared to the baseline run UWAThard1 was the fact that we used document retrieval search function BM25 for all topics in the UWAThard1, whereas for UWAThard2 we used BM25 for topics requiring document retrieval and BM250 for the topics requiring passage retrieval. The two functions produce somewhat different document rankings. In UWAThard4 we used BM250 for the topics requiring passages, and got only a slightly lower average precision of 0.2937 (SOFT-DOC evaluation) and 0.2450 (HARD-DOC evaluation).</Paragraph>
      <Paragraph position="7"> Our second query expansion method on the contrary did not perform very well in the training runs, achieving only 10% improvement over the original title-only query run. The official run UWAThard3, however resulted in 18% increase in average precision (SOFT-DOC evaluation) and 26.4% increase in average precision (HARD-DOC evaluation). Both improvements are statistically significant (using t-test at .05 significance level).</Paragraph>
      <Paragraph position="8"> TREC annotators selected on average 19 phrases, whereas we selected on average 7 phrases in our tests. This suggests that selecting more phrases leads to a notably better performance. The reason why we selected fewer phrases than the TREC annotators could be due to the fact that on many occasions we were not sufficiently familiar with the topic, and could not determine how an out-of-context phrase is related or not related to the topic. TREC annotators are, presumably, more familiar with the topics they have formulated.</Paragraph>
      <Paragraph position="9"> In total 88 runs were submitted by participants to the HARD track. All our submitted runs are above the median in all evaluation measures shown in table 1. The only participating site, whose expansion runs performed better than our UWAThard3 run, was the Queen's college group (Kwok et al. 2004). Their best baseline system achieved 32.7% AveP (HARD-DOC) and their best result after clarification forms was 36%, which gives 10% increase over the baseline. We have achieved 26% improvement over the baseline (HARD-DOC), which is the highest increase over baseline among the top 50% highest-scoring baseline runs.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.2 The effect of different numbers of relevant and
</SectionTitle>
      <Paragraph position="0"> non-relevant documents on performance following user feedback Query expansion based on relevance feedback is typically more effective than based on blind feedback, however as discussed in the previous section, only 73% of the sentences selected by users from the clarification form 1 were actually relevant. This has prompted us to explore the following question: How does the presence of different numbers of relevant and non-relevant documents in the feedback affect average precision? With this goal, we conducted a series of runs on Financial Times and Los Angeles Times corpora and TREC topics 301-450. For each run we composed a set, consisting of the required number of relevant and non-relevant documents. To minimize the difference between relevant and non-relevant documents we selected non-relevant documents ranked closely to relevant documents in the ranked document set.</Paragraph>
      <Paragraph position="1"> The process of document selection is as follows: first all documents in the ranked set are marked as relevant/nonrelevant using TREC relevance judgements. Then, each time a relevant document is found, it is recorded together with the nearest non-relevant document, until the necessary  number of relevant/non-relevant documents is reached.</Paragraph>
      <Paragraph position="2"> The graph in figure 2 shows that as the number of relevant documents increases, average precision (AveP) after feedback increases considerably for each extra relevant document used, up to the point when we have 4 relevant documents. The increment in AveP slows down when more relevant documents are added.</Paragraph>
      <Paragraph position="3"> Adding few non-relevant documents to relevant ones causes a considerable drop in the AveP. However, the precision does not deteriorate further when more non-relevant documents are added (Figure 2). As long as there are more than three relevant documents that are used, a plateau is hit at around 4-5 non-relevant documents.</Paragraph>
      <Paragraph position="4"> We can conclude from this experiment that as a general rule, the more relevant documents are used for query expansion, the better is the average precision. Even though use of 5 or more relevant documents does not increase the precision considerably, it still does cause an improvement compared to 4 and fewer relevant documents.</Paragraph>
      <Paragraph position="5"> Another finding is that non-relevant documents do not affect average precision considerably, as long as there are a sufficient number of relevant documents.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML