File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/01/w01-1008_metho.xml

Size: 20,533 bytes

Last Modified: 2025-10-06 14:07:43

<?xml version="1.0" standalone="yes"?>
<Paper uid="W01-1008">
  <Title>Document Fusion for Comprehensive Event Description</Title>
  <Section position="3" start_page="0" end_page="0" type="metho">
    <SectionTitle>
2 Fusing Documents
</SectionTitle>
    <Paragraph position="0"> Before developing a document fusion system,  some basic issues have to be considered. 1. On which level of granularity are the documents fused (i.e., word or phrase level, sentence level, or paragraph level? 2. How to decide whether news fragments from different sources convey the same information? null 3. How to ensure readability of the fused document? I.e., where should information stem null ming from different documents be placed in the fused document, retaining a natural flow of information.</Paragraph>
    <Paragraph position="1"> Each of the these issues is addressed in the following subsections.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
2.1 Segmentation
</SectionTitle>
      <Paragraph position="0"> In the current implementation, we decided to fuse documents on paragraph level for two reasons: First, paragraphs are less context-dependent than sentences and are therefore easier to compare. Second, compiling paragraphs yields a better readability of the fused document. It should be noted that paragraphs are rather short in news stories, rarely being longer than three sentences.</Paragraph>
      <Paragraph position="1"> When putting together (fusing) pieces of text from different sources in a way that was not anticipated by the writers of the news stories, it can introduce information gaps. For instance, if a paragraph containing a pronoun is taken out of its original context and placed in a new context (the fused document), this can lead to dangling pronouns, which cannot be correctly resolved anymore. In general, this problem does not only hold for pronouns but for all kind of anaphoric expressions such as pronouns, definite noun phrases (e.g., the negotiations) and anaphoric adverbials (e.g., later). To cope with this problem simple segmentation is applied as a pre-processing step where paragraphs that contain pronouns or simple definite noun phrases are attached to the preceding paragraph. A more sophisticated approach to text segmentation is described in (Hearst, 1997).</Paragraph>
      <Paragraph position="2"> Obviously, it would be better to use an automatic anaphora resolution component to cope with this problem, see, e.g., (Kennedy and Boguraev, 1996; Kameyama, 1997), where anaphoric expressions are replaced by their antecedents, but at the moment, the integration of such a component remains future work.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
2.2 Informativity
</SectionTitle>
      <Paragraph position="0"> (Radev, 2000) describes 24 cross-document relations that can hold between their segments, one of which is the subsumption (or entailment) relation.</Paragraph>
      <Paragraph position="1"> In the context of document fusion, we focus on the entailment relation and how it can be formally defined; unfortunately, (Radev, 2000) provides no formal definition for any of the relations.</Paragraph>
      <Paragraph position="2"> Computing the informativity of a segment compared to another segment is an essential task during document fusion. Here, we say that the i-th segment of document d (si;d) is more informative than thej-th segment of documentd0 (sj;d0) if si;d entails sj;d0. In theory, this should be proven logically, but in practice this is far beyond the current state of the art in natural language processing. Additionally, a binary logical decision might also be too strict for simulating the human understanding of entailment.</Paragraph>
      <Paragraph position="3"> A simple but nevertheless quite effective solution is based on one of the simpler similarity measures in information retrieval (IR), where texts are simply represented as bags of (weighted) words.</Paragraph>
      <Paragraph position="4"> The definition of the entailment score (es) is given in (1). es(si;d;sj;d0) compares the sum of the weights of terms that appear in both segments to the total sum weights of sj;d0.</Paragraph>
      <Paragraph position="6"> The weight of a term ti is its inverse document frequency (idf i), as defined in (2), where N is the number of all segments in the set of related documents (the topic) and ni is the number of segments in which the term ti occurs.</Paragraph>
      <Paragraph position="8"> (2) Terms which occur in many segments (i.e., for which ni is rather large), such as the, some, etc., receive a lower idf -score than terms that occur only in a few segments. The underlying intuition of the idf -score is that terms with a higher idf -score are better suited for discriminating the content of a particular segment from the other segments in the topic, or to put it differently, they are more content-bearing. Note, that the logarithm in (2) is only used for dampening the differences.</Paragraph>
      <Paragraph position="9"> The entailment score es(si;d;sj;d0) measures how many of the words of the segment si;d occur in sj;d0, and how important those words are. This is obviously a very shallow approach to entailment computation, but nevertheless it proved to be effective, see (Monz and de Rijke, 2001).</Paragraph>
    </Section>
    <Section position="3" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
2.3 Implementation
</SectionTitle>
      <Paragraph position="0"> In this subsection, we present the general algorithm underlying the implementation, given a set of documents belonging to topic T. The implementation has to tackle two basic tasks. First, identify segments that are entailed by other segments and use the more informative one. Second, place the remaining segments at positions with similar content. The fusion algorithm depicted in Figure 1 consists of five steps.</Paragraph>
      <Paragraph position="1"> 1. is basically a pre-processing step as explained above. 2. computes pairwise the cross-document entailment scores for all segments inT. Although the pairwise computation of es and sim is exponential in the number of documents in T, it still remains computationally tractable in practice. For instance, for a topic containing 4 documents (the average case) it takes 10 CPU seconds to compute all entailment and similarity relations.</Paragraph>
      <Paragraph position="2"> For a topic containing 8 documents (an artificially constructed extreme case) it takes 66 CPU seconds; both on a 600 MHz Pentium III PC.</Paragraph>
      <Paragraph position="3"> In 3., one of the documents is taken as base for the fusion process. Starting with a 'real' document improves the readability of the final fused documents as it imposes some structure on the fusion process. There are several ways to select the base document. For instance, take the document with the most unique terms, or the document with the highest document weight (sum of all idf scores). In the current implementation we simply took the longest document within the topic, which ensures a good base coverage of an event.</Paragraph>
      <Paragraph position="4"> 4. and 5. are the actual fusion steps. Step 4. replaces a segment si;dF in the fused document by a segment sj;d0 from another document if sj;d0 is the segment maximally entailing si;dF and if it is significantly (above the threshold es) more informative than si;dF . Choosing an optimal value for es is essential for the effectiveness of the fusion system. Section 3 discusses some of our experiments to determine es.</Paragraph>
      <Paragraph position="5"> Step 5. is kind of complementary to step 4., where related but more informative segments are identified. Step 5. identifies segments that add new information to dF, where a segment sj;d0 is new if it has low similarity to all segments in dF, i.e., if the the similarity score is below the threshold sim. If a segment sj;d0 is new, it is placed right after the segment in dF to which it is most similar.</Paragraph>
      <Paragraph position="6"> Similarity is implemented as the traditional cosine similarity in information retrieval, as defined in (3). This similarity measure is also known as the tfc.tfc measure, see (Salton and Buckley, 1988).</Paragraph>
      <Paragraph position="8"> Where wk;si;d is the weight associated with the  term tk in segment si;d. In the nominator of (3), 1. segmentize all documents in T 2. for all si;d;sj;d0 s.t. d;d02T and d6= d0: compute es(si;d;sj;d0) 3. select a document d2T as fusion base document: dF 4. for all si;dF : find sj;d0 s.t. dF 6= d0 and sj;d0 = arg maxs</Paragraph>
      <Paragraph position="10"> if es(sj;d0;si;dF) &gt; es then replace si;dF by sj;d0 in the fused document 5. for all sj;d0 s.t. sj;d062dF: if for all si;dF : sim(si;dF;sj;d0) &lt; sim, then find the most similar si;dF : si;dF = arg maxs</Paragraph>
      <Paragraph position="12"> the weights of the terms that occur insi;d andsj;d0 are summed up. The denominator is used for normalization. Otherwise, longer documents tend to result in a higher similarity score. In the current implementation wk;si;d = idf k for all si;d. The reader is referred to (Salton and Buckley, 1988; Zobel and Moffat, 1998) for a broad spectrum of similarity measures for information retrieval.</Paragraph>
    </Section>
  </Section>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
3 Evaluation Issues
</SectionTitle>
    <Paragraph position="0"> The document fusion system is evaluated in two steps. First, the effectiveness of entailment detection is evaluated, which is the key component of our system. Then we present some preliminary evaluation of the whole system focusing on the quality of the fused documents.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.1 Evaluating Entailment
</SectionTitle>
      <Paragraph position="0"> Recently, we have started to build a small test collection for evaluating entailment relations. The reader is referred to (Monz and de Rijke, 2001) for more details on the results presented in this subsection.</Paragraph>
      <Paragraph position="1"> For each of the 21 topics in our test corpus two documents in the topic were randomly selected, and given to a human assessor to determine all subsumption relations between segments in different documents (within the same topic). Judgments were made on a scale 0-2, according to the extent to which one segment was found to entail another.</Paragraph>
      <Paragraph position="2"> Out of the 12083 possible subsumption relations between the text segments, 501 (4.15%) received a score of 1, and 89 (0.73%) received a score of 2.</Paragraph>
      <Paragraph position="3"> Let a subsumption pair be an ordered pair of segments (si;d, sj;d0) that may or may not stand in the subsumption relation, and let a correct subsumption pair be a subsumption pair (si;d, sj;d0) for which si;d does indeed entail sj;d0. Further, a computed subsumption pair is a subsumption pair for which our subsumption method has produced a score above the subsumption threshold.</Paragraph>
      <Paragraph position="4"> Then, precision is the fraction of computed subsumption pairs that is correct: Precision = number of correct subsumption pairs computedtotal number of subsumption pairs computed : And recall is the proportion of the total number of correct subsumption pairs that were computed: Recall = number of correct subsumption pairs computedtotal number of correct subsumption pairs : Observe that precision and recall depend on the subsumption threshold that we use.</Paragraph>
      <Paragraph position="5"> We computed average recall and precision at 11 different subsumption thresholds, ranging from 0 to 1, with :1 increments; the average was computed over all topics. The results are summarized in Figures 2 (a) and (b).</Paragraph>
      <Paragraph position="6"> Since precision and recall suggest two different optimal subsumption thresholds, we use the F-Score, or harmonic mean, which has a high value only when both recall and precision are high.</Paragraph>
      <Paragraph position="7">  The optimal subsumption threshold for human judgments &gt; 0 is around 0.18, while it is approximately 0:4 for human judgments &gt; 1. This confirms the intuition that a higher threshold is more effective when human judgments are stricter.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.2 Evaluating Fusion
</SectionTitle>
      <Paragraph position="0"> In the introduction, it was pointed out that document fusion by hand can be rather laborious, and the same holds for the evaluation of automatic document fusion. Similar to automatic summarization, there are no standard document collections or clear evaluation criteria aiding to automatize the process of evaluation. One approach could be to focus on news stories which mention their sources. For instance CNN's new stories often say that &amp;quot;AP and Reuters contributed to this story&amp;quot;. On the other hand one has to be cautious to take those news stories as gold standard as the respective contributions of the journalist and his or her sources are not made explicit.</Paragraph>
      <Paragraph position="1"> In the area of multi-document summarization, there is a distinction between intrinsic and extrinsic evaluation, see (Mani et al., 1998). Intrinsic evaluation judges the quality directly based on analysis of the summary. Usually, a human judge assesses the quality of a summary based on some standardized evaluation criteria.</Paragraph>
      <Paragraph position="2"> In extrinsic evaluation, the usefulness of a summary is judged based on how it affects the completion of some other task. A typical task used for extrinsic evaluation is ad-hoc retrieval, where the relevance of a retrieved document is assessed by a human judge based on the document's summary. Then, those judgments are compared to judgments based on original documents, see, e.g., (Brandow et al., 1995; Mani and Bloedorn, 1999).</Paragraph>
      <Paragraph position="3"> At this stage we have just carried out some preliminary evaluation. The test collection consists of 69 news stories categorized into 21 topics. Categorization was done by hand, but it is also possible to have information filtering, see (Robertson and Hull, 2001), or topic detection and tracking (TDT) tools carrying out this task (Allan et al., 1998). All documents belonging to the same topic were released on the same day and describe the same event. Table 1 provides further details on the collection.</Paragraph>
      <Paragraph position="4"> avg. per topic no. of docs. 3.3 docs.</Paragraph>
      <Paragraph position="5"> length of a doc. 612 words length of all docs. together 2115 words length of longest doc. 783 words length of shortest doc. 444 words  ments).</Paragraph>
      <Paragraph position="6"> In addition to the aforementioned news agencies, the collection includes texts from the L.A. Times, Washington Post and Washington Times.</Paragraph>
      <Paragraph position="7"> In general, a segment should be included in the fused document if it did not occur before to avoid redundancy (False Alarm), and if it adds information, so no information is left out (Miss). As in IR or TDT, Miss and False Alarm tend to be inversely related; i.e., a decrease of Miss often results in an increase of False Alarm and vice versa.</Paragraph>
      <Paragraph position="8"> Table 2 illustrates the different possibilities how the system responds as to whether a segment should be included in the fused document and how a human reader judges.</Paragraph>
      <Paragraph position="9">  Then, Miss and False Alarm can be defined as in (4) and (5), respectively.</Paragraph>
      <Paragraph position="11"> The fusion impact factor (fif) describes to what extent the different sources actually contributed to the fused document. For instance if the fused document solely contains segments from one source, fif equals 0, and if all sources equally contributed it equals 1. This can be formalized as follows:</Paragraph>
      <Paragraph position="13"> Where S is a set of related documents, and nT is its size. nseg is the number of segments in the fused document and nseg;d is the number of segments stemming from document d.</Paragraph>
      <Paragraph position="14"> For our test collection, the average fusion impact factor was 0.56. Of course the fif -score depends on the choice of es and sim, in a way that a lower value of es or a higher value of sim increases the fif -score. In this case, es = 0:2 and sim = 0:05.</Paragraph>
      <Paragraph position="15"> Table 3 shows the length of the fused documents in average compared to the longest, shortest, and all documents in a topic, for es = 0:2 and sim = 0:05.</Paragraph>
      <Paragraph position="16"> avg. compression ratio per topic all docs. together 0.55 longest doc. 1.36 shortest doc. 2.55  Measuring Miss intrinsically is extremely laborious; especially comparing the effectiveness of different values for the thresholds es and sim is infeasible in practice. Therefore, we decided to measure Miss extrinsically. We used ad-hoc retrieval as the extrinsic evaluation task. The evaluation criterion is stated as follows: Using the fused document of each topic as a query, what is the average (non-interpolated) precision? As baseline, we concatenated all documents of each topic. This would constitute an event description that does not miss any information within the topic. This document is then used to query a collection of 242,996 documents, containing the 69 documents from our test collection. Since the baseline is simply the concatenation of all documents within the topic, one can expect that all documents from that topic receive a high rank in the set of retrieved documents. This average precision forms the optimal performance for that topic. For instance, if a topic contains three documents, and the ad-hoc retrieval ranks those documents as 1, 3, and 6, there are three recall levels: 33: 3%, 66: 6%, and 100%. The precision at these levels is 1/1, 2/3, and 3/6 respectively, which averages to 0:7 2.</Paragraph>
      <Paragraph position="17"> The next step is to compare the actually fused documents to the baseline. It is to be expected that the performance is worse, because the fused documents do not contain segments which are entailed by other segments in the topic. For instance, if the fused document for the aforementioned topic is used as a query and the original documents of the topic are ranked as 2, 4, and 9, the average precision is (1/2 + 2/4 + 3/9)/3 = 0: 4.</Paragraph>
      <Paragraph position="18"> Compared to 0:7 2 for the baseline, fusion leads to a decrease of effectiveness of approximately 38.5%. Figure 4, gives the averaged precision for the different values for es.</Paragraph>
      <Paragraph position="19">  It is not obvious how to interpret the numerical value of the ad-hoc retrieval precision in terms of Miss, but the degree of deviation from the base-line gives a rough estimate of the completeness of a fused document. At least, this allows for an ordinally scaled ranking of the different methods (in our case different values for es), that are used for generating the fused documents. Figure 4 illustrates that in the context of the ad-hoc retrieval evaluation an optimal entailment threshold ( es) lies around 0.2. Table 4 shows the decrease in retrieval effectiveness in percent, compared to the baseline. The average precision at 0.2 is 0.8614, which is just 11:5% below the baseline.</Paragraph>
      <Paragraph position="20"> For all ad-hoc retrieval experiments, the Lnu.ltu weighting scheme, see (Singhal et al., 1996), has been used, which is one of the best-performing weighting schemes in ad-hoc retrieval. In addition to the 69 documents from our collection, the retrieval collection contains articles from Associ- null ated Press 1988-1990 (from the TREC distribution), which also belong to the newswire or newspaper domain. Any meta information such as the name of the journalist or news agency is removed to avoid matches based on that information.</Paragraph>
      <Paragraph position="21"> In the context of multi-document summarization, (Stein et al., 2000) use topic clustering for extrinsic evaluation. Although we did not carry out any evaluation based on topic clustering, it seems that it could also be applied to multi-document fusion, given the close relationship between fusion and summarization on the one hand and retrieval and clustering on the other hand.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML