File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/n04-1029_metho.xml

Size: 18,720 bytes

Last Modified: 2025-10-06 14:08:53

<?xml version="1.0" standalone="yes"?>
<Paper uid="N04-1029">
  <Title>Comparison of Two Interactive Search Refinement Techniques</Title>
  <Section position="3" start_page="0" end_page="0" type="metho">
    <SectionTitle>
Accuracy Retrieval from Documents (HARD) track of
TREC (Text Retrieval Conference).
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
1.1 HARD track
</SectionTitle>
      <Paragraph position="0"> The main goal of the new HARD track in TREC-12 is to explore what techniques could be used to improve search results by using two types of information:  1. Extra-linguistic contextual information about the user and the information need, which was provided by track organisers in the form of metadata. It specifies the following: Genre - the type of documents that the searcher is looking for. It has the following values: - Overview (general news related to the topic); - Reaction (news commentary on the topic); - I-Reaction (as above, but about non-US commentary) - Any.</Paragraph>
      <Paragraph position="1"> Purpose of the user's search, which has one of the following values: - Background (the searcher is interested in the background information for the topic); - Details (the searcher is interested in the details of the topic); - Answer (the searcher wants to know the answer to a specific question); - Any.</Paragraph>
      <Paragraph position="2"> Familiarity of the user with the topic on a five-point scale.</Paragraph>
      <Paragraph position="3"> Granularity - the amount of text the user is expecting in response to the query. It has the following values: Document, Passage, Sentence, Phrase, Any.</Paragraph>
      <Paragraph position="4"> Related text - sample relevant text found by the users from any source, except the evaluation corpus.</Paragraph>
      <Paragraph position="5"> 2. Relevance feedback given by the user in response  to topic clarification questions. This information was elicited by each site by means of a (manually or automatically) composed set of clarification forms per topic. The forms are filled in by the users (annotators), and provide additional search criteria.</Paragraph>
      <Paragraph position="6"> In more detail the HARD track evaluation scenario consists of the following steps: 1) The track organisers invite annotators (users), each of whom formulates one or more topics. An example of a typical HARD topic is given below: Title: Red Cross activities Description: What has been the Red Cross's international role in the last year? Narrative: Articles concerning the Red Cross's activities around the globe are on topic. Has the RC's role changed? Information restricted to international relief efforts that do not include the RC are off-topic.  2) Participants receive Title, Description and Narrative sections of the topics, and use any information from them to produce one or more baseline runs.</Paragraph>
      <Paragraph position="7"> 3) Participants produce zero or more clarification forms with the purpose of obtaining feedback from the annotators. Only two forms were guaranteed to be filled out.</Paragraph>
      <Paragraph position="8"> 4) All clarification forms for one topic are filled out by the annotator, who has composed that topic.</Paragraph>
      <Paragraph position="9"> 5) Participants receive the topic metadata and the annotators' responses to clarification forms, and use any data from them to produce one or more final runs. 6) Two runs per site (baseline and final) are judged by  the annotators. Top 75 documents, retrieved for each topic in each of these runs, are assigned binary relevance judgement by the annotator - author of the topic.</Paragraph>
      <Paragraph position="10"> 7) The annotators' relevance judgements are then used to calculate the performance metrics (see section 4). The evaluation corpus used in the HARD track consists of 372,219 documents, and includes three newswire corpora (New York Times, Associated Press Worldstream and Xinghua English) and two governmental corpora (The Congressional Record and Federal Register). The overall size of the corpus is 1.7Gb.</Paragraph>
      <Paragraph position="11"> The primary goal of our participation in the track was to investigate how to achieve high retrieval accuracy through relevance feedback. The secondary goal was to study ways of reducing the amount of time and effort the user spends on making a relevance judgement, and at the same time assisting the user to make a correct judgement.</Paragraph>
      <Paragraph position="12"> We evaluated the effectiveness of two different approaches to eliciting information from the users. The first approach is to represent each top-ranked retrieved document by means of one sentence containing the highest proportion of query terms, and ask the user to select those sentences, which possibly represent relevant documents. The second method extracts noun phrases from top-ranked retrieved documents and asks the user to select those, which might be useful in retrieving relevant documents. Both approaches aim to minimise the amount of text the user has to read, and to focus the user's attention on the key information clues from the documents.</Paragraph>
      <Paragraph position="13"> Traditionally in bibliographical and library IR systems the hitlist of retrieved documents is represented in the form of the titles and/or the first few sentences of each document. Based on this information the user has to make initial implicit relevance judgements: whether to refer to the full text document or not. Explicit relevance feedback is typically requested by IR systems after the user has seen the full text document, an example of such IR system is Okapi (Robertson et al. 2000, Beaulieu 1997). Reference to full text documents is obviously time-consuming, therefore it is important to represent documents in the hitlist in such a form, that would enable the users to reliably judge their relevance without referring to the full text. Arguably, the title and the first few sentences of the document are frequently not sufficient to make correct relevance judgement. Query-biased summaries, usually constructed through the extraction of sentences that contain higher proportion of query terms than the rest of the text - may contain more relevance clues than generic document representations. Tombros and Sanderson (1998) compared query-biased summaries with the titles plus the first few sentences of the documents by how many times the users have to request full-text documents to verify their relevance/non-relevance. They discovered that subjects using query-biased summaries refer to the full text of only 1.32% documents, while subjects using titles and first few sentences refer to 23.7% of documents. This suggests that query-biased representations are likely to contain more relevance clues than generic document representations.</Paragraph>
      <Paragraph position="14"> The remainder of this paper is organised as follows: sections 2 and 3 present the two document representation and query expansion methods we developed, section 4 discusses their evaluation, and section 5 concludes the paper and outlines future research directions.</Paragraph>
    </Section>
  </Section>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
2 Query expansion method 1
</SectionTitle>
    <Paragraph position="0"> According to the HARD track specifications, a clarification form for each topic must fit into a screen with 1152 x 900 pixels resolution, and the user may spend no more than 3 minutes filling out each form.</Paragraph>
    <Paragraph position="1"> The goal that we aim to achieve with the aid of the clarification form is to have the users judge as many relevant documents as possible on the basis of one sentence representation of a document. The questions explored here were: What is the error rate in selecting relevant documents on the basis of one sentence representation of its content? How does sentence-level relevance feedback affect retrieval performance?</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
2.1 Sentence selection
</SectionTitle>
      <Paragraph position="0"> The sentence selection algorithm consists of the following steps: We take N top-ranked documents, retrieved in response to query terms from the topic title. Given the screen space restrictions, we can only display 15 threeline sentences, hence N=15. The full-text of each of the documents is then split into sentences. For every sentence that contains one or more query terms, i.e. any term from the title field of the topic, two scores are calculated: S1 and S2.</Paragraph>
      <Paragraph position="1"> Sentence selection score 1 (S1) is the sum of idf of all query terms present in the sentence.</Paragraph>
      <Paragraph position="2"> Sentence selection score 2 (S2): Where: Wi - Weight of the term i, see (3); fs - length normalisation factor for sentence s, see (4). The weight of each term in the sentence, except stopwords, is calculated as follows: Where: idfi - inverse document frequency of term i in the corpus; tfi - frequency of term i in the document; tmax - tf of the term with the highest frequency in the document.</Paragraph>
      <Paragraph position="3"> To normalise the length of the sentence we introduced the sentence length normalisation factor f: Where: smax - the length of the longest sentence in the document, measured as a number of terms, excluding stopwords; slen - the length of the current sentence. All sentences in the document were ranked by S1 as the primary score and S2 as the secondary score. Thus, we first select the sentences that contain more query terms, and therefore are more likely to be related to the user's query, and secondarily, from this pool of sentences select the one which is more content-bearing, i.e. containing a higher proportion of terms with high tf*idf weights.</Paragraph>
      <Paragraph position="4"> Because we are restricted by the screen space, we reject sentences that exceed 250 characters, i.e. three lines. In addition, to avoid displaying very short, and hence insufficiently informative sentences, we reject sentences with less than 6 non-stopwords. If the top-scoring sentence does not satisfy the length criteria, the next sentence in the ranked list is considered to represent the document. Also, since there are a number of almost identical documents in the corpus, we remove the representations of the duplicate documents from the clarification form using pattern matching, and process the necessary number of additional documents from the baseline run sets.</Paragraph>
      <Paragraph position="5"> By selecting the sentence with the query terms and the highest proportion of high-weighted terms in the document, we are showing query term instances in their typical context in this document. Typically a term is only used in one sense in the same document. Also, in many cases it is sufficient to establish the linguistic sense of a word by looking at its immediate neighbours in the same sentence or a clause. Based on this, we hypothesise that users will be able to reject those sentences, where the query terms are used in an unrelated linguistic sense. However, we recognise that it is more difficult, if not impossible, for users to reliably determine the relevance of the document on the basis of one sentence, especially in cases where the relevance of the document to the query is due to more subtle aspects of the topic.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
2.2 Selection of query expansion terms
</SectionTitle>
      <Paragraph position="0"> The user's feedback to the clarification form is used for obtaining query expansion terms for the final run. For query expansion we use collocates of query terms words co-occurring within a limited span with query terms. Vechtomova et al. (2003) have demonstrated that expansion with long-span collocates of query terms obtained from 5 known relevant documents showed 7274% improvement over the use of Title-only query terms on the Financial Times (TREC volume 4) corpus with TREC-5 ad hoc topics.</Paragraph>
      <Paragraph position="1"> We extract collocates from windows surrounding query term occurrences. The span of the window is</Paragraph>
      <Paragraph position="3"> measured as the number of sentences to the left and right of the sentence containing the instance of the query term. For example, span 0 means that only terms from the same sentence as the query term are considered as collocates, span 1 means that terms from 1 preceding and 1 following sentences are also considered as collocates.</Paragraph>
      <Paragraph position="4"> In more detail the collocate extraction and ranking algorithm is as follows: For each query term we extract all sentences containing its instance, plus s sentences to the left and right of these sentences, where s is the span size. Each sentence is only extracted once. After all required sentences are selected we extract stems from them, discarding stopwords. For each unique stem we calculate the Z score to measure the significance of its co-occurrence with the query term as follows: Where: fr(x,y) - frequency of x and y occurring in the same windows in the known relevant document set (see (6); fc(y) - frequency of y in the corpus; fr(x) frequency of x in the relevant documents; vx(R) average size of windows around x in the known relevant document set (R); N - the total number of non-stopword occurrences in the corpus.</Paragraph>
      <Paragraph position="5"> The frequency of x and y occurring in the same windows in the relevant set - fr(x,y) - is calculated as follows: Where: m - number of windows in the relevant set (R); fw(x) - frequency of x in the window w; fw(y) frequency of y in the window w.</Paragraph>
      <Paragraph position="6"> All collocates with an insignificant degree of association: Z&lt;1.65 are discarded, see (Church et al. 1991). The remaining collocates are sorted by their Z score. The above Z score formula is described in more detail in (Vechtomova et al. 2003).</Paragraph>
      <Paragraph position="7"> After we obtain sorted lists of collocates of each query term, we select those collocates for query expansion, which co-occur significantly with two or more query terms. For each collocate the collocate score (C1) is calculated: Where: ni - rank of the collocate in the Z-sorted collocation list for the query term i; Wi - weight of the query term i.</Paragraph>
      <Paragraph position="8"> The reason why we use the rank of the collocate in the above formula instead of its Z score is because Z scores of collocates of different terms are not comparable.</Paragraph>
      <Paragraph position="9"> Finally, collocates are ranked by two parameters: the primary parameter is the number of query terms they co-occur with, and the secondary - C1 score.</Paragraph>
      <Paragraph position="10"> We tested the algorithm on past TREC data (Financial Times and Los Angeles Times newswire corpora, topics 301-450) with blind feedback using Okapi BM25 search function (Sparck Jones et al. 2000). The goal was to determine the optimal values for R - the size of the pseudo-relevant set, s - the span size, and k the number of query expansion terms. The results indicate that variations of these parameters have an insignificant effect on precision. However, some tendencies were observed, namely: (1) larger R values tend to lead to poorer performance in both Title-only and Title+Desc. runs; (2) larger span sizes also tend to degrade performance in both Title and Title+Desc runs. Title-only unexpanded run was 10% better than Title+Description. Expansion of Title+Desc. queries resulted in relatively poorer performance than expansion of Title-only queries. For example, AveP of the worst Title+Desc expansion run (R=50, s=4, k=40) is 23% worse than the baseline, and AveP of the best run (R=5, s=1, k=10) is 8% better than the baseline. AveP of the worst Title-only run (R=50, s=5, k=20) is 4.5% worse than the baseline, and AveP of the best Title-only run (R=5, s=1, k=40) is 10.9% better than the baseline.</Paragraph>
      <Paragraph position="11"> Based on this data we decided to use Title-only terms for the official TREC run 'UWAThard2', and, given that values k=40 and s=1 contributed to a somewhat better performance, we used these values in all of our official expansion runs. The question of R value is obviously irrelevant here, as we used all documents selected by users in the clarification form. We used Okapi BM25 document retrieval function for topics with granularity Document, and Okapi passage retrieval function BM250 (Sparck Jones et al.</Paragraph>
      <Paragraph position="12"> 2000) for topics with other granularity values. For topics with granularity Sentence the best sentences were selected from the passages, returned by BM250, using the algorithm described in section 2.1 above.</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="0" end_page="0" type="metho">
    <SectionTitle>
3 Query expansion method 2
</SectionTitle>
    <Paragraph position="0"> The second user feedback mechanism that we evaluated consists of automatically selecting noun phrases from the top-ranked documents retrieved in the baseline run, and asking the users to select all phrases that contain possibly useful query expansion terms.</Paragraph>
    <Paragraph position="1"> The research question explored here is whether noun phrases provide sufficient context for the user to select potentially useful terms for query expansion.</Paragraph>
    <Paragraph position="3"> We take top 25 documents from the baseline run, and select 2 sentences per document using the algorithm described above. We have not experimented with alternative values for these two parameters. We then apply Brill's rule-based tagger (Brill 1995) and BaseNP noun phrase chunker (Ramshaw and Marcus 1995) to extract noun phrases from these sentences. The phrases are then parsed in Okapi to obtain their term weights, removing all stopwords and phrases consisting entirely of the original query terms. The remaining phrases are ranked by the sum of weights of their constituent terms.</Paragraph>
    <Paragraph position="4"> Top 78 phrases are then included in the clarification form for the user to select. This is the maximum number of phrases that could fit into the clarification form.</Paragraph>
    <Paragraph position="5"> All user-selected phrases were split into single terms, which were then used to expand the original user query. The expanded query was then searched against the HARD track database in the same way as in the query expansion method 1 described in the previous section.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML