File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/05/h05-1014_metho.xml

Size: 10,674 bytes

Last Modified: 2025-10-06 14:09:31

<?xml version="1.0" standalone="yes"?>
<Paper uid="H05-1014">
  <Title>Proceedings of Human Language Technology Conference and Conference on Empirical Methods in Natural Language Processing (HLT/EMNLP), pages 105-112, Vancouver, October 2005. c(c)2005 Association for Computational Linguistics Novelty Detection: The TREC Experience</Title>
  <Section position="3" start_page="105" end_page="105" type="metho">
    <SectionTitle>
2 Input Data
</SectionTitle>
    <Paragraph position="0"> The first year of the novelty track (Harman, 2002) was a trial run in several ways. First, this was a new task for the community and participating groups had no training data or experience. But second, it was unclear how humans would perform this task and therefore creating the &amp;quot;truth&amp;quot; data was in itself a large experiment. NIST decided to minimize the cost by using 50 old topics from TRECs 6, 7, and 8.</Paragraph>
    <Paragraph position="1"> The truth data was created by asking NIST assessors (the humans performing this task) to identify the set of relevant sentences from each relevant document and then from that set of relevant sentences, mark those that were novel. Specifically, the assessors were instructed to identify a list of sentences  that were: 1. relevant to the question or request made in the description section of the topic, 2. their relevance was independent of any surrounding sentences, 3. they provided new information that had not been found in any previously picked sentences.</Paragraph>
    <Paragraph position="2">  Most of the NIST assessors who worked on this task were not the ones who created the original topics, nor had they selected the relevant documents. This turned out to be a major problem. The assessors' judgments for the topics were remarkable in that only a median of 2% of the sentences were judged to be relevant, despite the documents themselves being relevant. As a consequence, nearly every relevant sentence (median of 93%) was declared novel. This was due in large part to assessor disagreement as to relevancy, but also that fact that this was a new task to the assessors. Additionally, there was an encouragement not to select consecutive sentences, because the goal was to identify relevant and novel sentences minimally, rather than to try and capture coherent blocks of text which could stand alone. Unfortunately, this last instruction only served to confuse the assessors. Data from 2002 has not been included in the rest of this paper, nor are groups encouraged to use that data for further experiments because of these problems.</Paragraph>
    <Paragraph position="3"> In the second year of the novelty track (Soboroff and Harman, 2003), the assessors created their own new topics on the AQUAINT collection of three contemporaneous newswires. For each topic, the assessor composed the topic and selected twenty-five relevant documents by searching the collection. Once selected, the documents were ordered chronologically, and the assessor marked the relevant sentences and those relevant sentences that were novel. No instruction or limitation was given to the assessors concerning selection of consecutive sentences, although they were told that they did not need to choose an otherwise irrelevant sentence in order to resolve a pronoun reference in a relevant sentence. Each topic was independently judged by two different assessors, the topic author and a &amp;quot;secondary&amp;quot; assessor, so that the effects of different human judgments could be measured. The judgments of the primary assessor were used as ground truth for evaluation, and the secondary assessor's judgments were taken to represent a ceiling for system performance in this task.</Paragraph>
    <Paragraph position="4"> Another new feature of the 2003 data set was a division of the topics into two types. Twenty-eight of the fifty topics concerned events such as the bombing at the 1996 Olympics in Atlanta, while the remaining topics focused on opinions about controversial subjects such as cloning, gun control, and same-sex marriages. The topic type was indicated in the topic description by a &lt;toptype&gt; tag.</Paragraph>
    <Paragraph position="5"> This pattern was repeated for TREC 2004 (Soboroff, 2004), with fifty new topics (twenty-five events and twenty-five opinion) created in a similar manner and with the same document collection. For 2004, assessors also labeled some documents as irrelevant, and irrelevant documents up through the first twenty-five relevant documents were included in the document sets distributed to the participants. These irrelevant documents were included to increase the &amp;quot;noise&amp;quot; in the data set. However, the assessors only judged sentences in the relevant documents, since, by the TREC standard of relevance, a document is considered relevant if it contains any relevant information. null</Paragraph>
  </Section>
  <Section position="4" start_page="105" end_page="106" type="metho">
    <SectionTitle>
3 Task De nition
</SectionTitle>
    <Paragraph position="0"> There were four tasks in the novelty track: Task 1. Given the set of documents for the topic, identify all relevant and novel sentences.</Paragraph>
    <Paragraph position="1"> Task 2. Given the relevant sentences in all documents, identify all novel sentences.</Paragraph>
    <Paragraph position="2"> Task 3. Given the relevant and novel sentences in the first 5 documents only, find the relevant  and novel sentences in the remaining documents. Note that since some documents are irrelevant, there may not be any relevant or novel sentences in the first 5 documents for some topics.</Paragraph>
    <Paragraph position="3"> Task 4. Given the relevant sentences from all documents, and the novel sentences from the first 5 documents, find the novel sentences in the remaining documents.</Paragraph>
    <Paragraph position="4"> These four tasks allowed the participants to test their approaches to novelty detection given different levels of training: none, partial, or complete relevance information, and none or partial novelty information. null The test data for a topic consisted of the topic statement, the set of sentence-segmented documents, and the chronological order for those documents. For tasks 2-4, training data in the form of relevant and novel &amp;quot;sentence qrels&amp;quot; were also given. The data was released and results were submitted in stages to limit &amp;quot;leakage&amp;quot; of training data between tasks. Depending on the task, the system was to output the identifiers of sentences which the system determined to contain relevant and/or novel relevant information.</Paragraph>
  </Section>
  <Section position="5" start_page="106" end_page="107" type="metho">
    <SectionTitle>
4 Evaluation
</SectionTitle>
    <Paragraph position="0"> Because novelty track runs report their relevant and novel sentences as an unranked set, traditional measures of ranked retrieval effectiveness such as mean average precision can't be used. One alternative is to use set-based recall and precision. Let M be the number of matched sentences, i.e., the number of sentences selected by both the assessor and the system, A be the number of sentences selected by the assessor, and S be the number of sentences selected by the system. Then sentence set recall is R = M/A and precision is P = M/S.</Paragraph>
    <Paragraph position="1"> However, set-based recall and precision do not average well, especially when the assessor set sizes A vary widely across topics. Consider the following example as an illustration of the problems. One topic has hundreds of relevant sentences and the system retrieves 1 relevant sentence. The second topic has 1 relevant sentence and the system retrieves hundreds of sentences. The average for both recall and precision over these two topics is approximately .5 (the scores on the first topic are 1.0 for precision and essentially 0.0 for recall, and the scores for the second topic are the reverse), even though the system did precisely the wrong thing. While most real systems wouldn't exhibit this extreme behavior, the fact remains that set recall and set precision averaged over a set of topics is not a robust diagnostic indicator  precision and recall components. The lines show contours at intervals of 0.1 points of F. The black numbers are per-topic scores for one TREC system.</Paragraph>
    <Paragraph position="2"> of system performance. There is also the problem of how to define precision when the system returns no sentences (S = 0). Leaving that topic out of the evaluation for that run would mean that different systems would be evaluated over different numbers of topics. The standard procedure is to define precision to be 0 when S = 0.</Paragraph>
    <Paragraph position="3"> To avoid these problems, the primary measure used in the novelty track was the F measure. The F measure (which is itself derived from van Rijsbergen's E measure (van Rijsbergen, 1979)) is a function of set recall and precision, together with a parameter b which determines the relative importance of recall and precision:</Paragraph>
    <Paragraph position="5"> A b value of 1, indicating equal weight, is used in the novelty track:</Paragraph>
    <Paragraph position="7"> Alternatively, this can be formulated as Fb=1 = 2x(# relevant retrieved)(# retrieved) + (# relevant) For any choice of b, F lies in the range [0, 1], and the average of the F measure is meaningful even when the judgment sets sizes vary widely. For example, the F measure in the scenario above is essentially 0, an intuitively appropriate score for such behavior.</Paragraph>
    <Paragraph position="8"> Using the F measure also deals with the problem of  what to do when the system returns no sentences since recall is 0 and the F measure is legitimately 0 regardless of what precision is defined to be.</Paragraph>
    <Paragraph position="9"> Note, however, that two runs with equal F scores do not indicate equal precision and recall. The contour lines in Figure 1 illustrate the shape of the F measure in recall-precision space. An F score of 0.5, for example, can describe a range of precision and recall scores. Figure 1 also shows the per-topic scores for a particular TREC run. It is easy to see that topics 98, 83, 82, and 67 exhibit a wide range of performance, but all have an F score of close to 0.6.</Paragraph>
    <Paragraph position="10"> Thus, two runs with equal F scores may be performing quite differently, and a difference in F scores can be due to changes in precision, recall, or both. In practice, if F is used, precision and recall should also be examined, and we do so in the analysis which follows. null</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML