XML Viewer - w04-3216

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/04/w04-3216_evalu.xml
Size: 10,148 bytes
Last Modified: 2025-10-06 13:59:19
<?xml version="1.0" standalone="yes"?>
<Paper uid="W04-3216">
  <Title>A Phrase-Based HMM Approach to Document/Abstract Alignment</Title>
  <Section position="4" start_page="0" end_page="0" type="evalu">
    <SectionTitle>
3 Evaluation and Results
</SectionTitle>
    <Paragraph position="0"> In this section, we describe an intrinsic evaluation of the PBHMM document/abstract alignment model.</Paragraph>
    <Paragraph position="1"> All experiments in this paper are done on the Ziff-Davis corpus (statistics are in Table 4). In order to judge the quality of the alignments produced by a system, we first need to create a set of &amp;quot;gold standard&amp;quot; alignments. Two human annotators manually constructed such alignments between documents and their abstracts. Software for assisting this process was developed and is made freely available.</Paragraph>
    <Paragraph position="2"> An annotation guide, which explains in detail the document/abstract alignment process was also prepared and is freely available.4 4Both the software and documentation are available on the first author's web page. The alignments are also available; contact the authors for a copy.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.1 Human Annotation
</SectionTitle>
      <Paragraph position="0"> From the Ziff-Davis corpus, we randomly selected 45 document/abstract pairs and had both annotators align them. The first five were annotated separately and then discussed; the last 40 were done independently. null Annotators were asked to perform phrase-to-phrase alignments between abstracts and documents and to classify each alignment as either possible P or sure S, where P S. In order to calculate scores for phrase alignments, we convert all phrase alignments to word alignments. That is, if we have an alignment between phrases A and B, then this induces word alignments between a and b for all words a 2 A and b 2 B. Given an alignment A, we could calculate precision and recall as (see (Och and Ney, 2003)):</Paragraph>
      <Paragraph position="2"> One problem with these definitions is that phrase-based models are fond of making phrases. That is, when given an abstract containing &amp;quot;the man&amp;quot; and a document also containing &amp;quot;the man,&amp;quot; a human may prefer to align &amp;quot;the&amp;quot; to &amp;quot;the&amp;quot; and &amp;quot;man&amp;quot; to &amp;quot;man.&amp;quot; However, a phrase-based model will almost always prefer to align the entire phrase &amp;quot;the man&amp;quot; to &amp;quot;the man.&amp;quot; This is because it results in fewer probabilities being multiplied together.</Paragraph>
      <Paragraph position="3"> To compensate for this, we define soft precision (SoftP in the tables) by counting alignments where &amp;quot;a b&amp;quot; is aligned to &amp;quot;a b&amp;quot; the same as ones in which &amp;quot;a&amp;quot; is aligned to &amp;quot;a&amp;quot; and &amp;quot;b&amp;quot; is aligned to &amp;quot;b.&amp;quot; Note, however, that this is not the same as &amp;quot;a&amp;quot; aligned to &amp;quot;a b&amp;quot; and &amp;quot;b&amp;quot; aligned to &amp;quot;b&amp;quot;. This latter alignment will, of course, incur a precision error. The soft precision metric induces a new, soft F-Score, labeled SoftF.</Paragraph>
      <Paragraph position="4"> Often, even humans find it difficult to align function words and punctuation. A list of 58 function words and punctuation marks which appeared in the corpus (henceforth called the ignore-list) was assembled. Agreement and precision/recall have been calculated both on all words and on all words that do not appear in the ignore-list.</Paragraph>
      <Paragraph position="5"> Annotator agreement was strong for Sure alignments and fairly weak for Possible alignments (considering only the 40 independently annotated pairs).</Paragraph>
      <Paragraph position="6"> When considering only Sure alignments, the kappa statistic (over 7:2 million items, 2 annotators and 2 categories) for agreement was 0:63. When words from the ignore-list were thrown out, this rose to 0:68. Carletta (1995) suggests that kappa values over 0:80 reflect very strong agreement and that kappa values between 0:60 and 0:80 reflect good agreement.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.2 Machine Translation Experiments
</SectionTitle>
      <Paragraph position="0"> In order to establish a baseline alignment model, we used the IBM Model 4 (Brown et al., 1993) and the HMM model (Stephan Vogel and Tillmann, 1996) as implemented in the GIZA++ package (Och and Ney, 2003). We modified this slightly to allow longer inputs and higher fertilities.</Paragraph>
      <Paragraph position="1"> Such translation models require that input be in sentence-aligned form. In the summarization task, however, one abstract sentence often corresponds to multiple document sentences. In order to overcome this problem, each sentence in an abstract was paired with three sentences from the corresponding document, selected using the techniques described by Marcu (1999). In an informal evaluation, 20 such pairs were randomly extracted and evaluated by a human. Each pair was ranked as 0 (document sentences contain little-to-none of the information in the abstract sentence), 1 (document sentences contain some of the information in the abstract sentence) or 2 (document sentences contain all of the information). Of the twenty random examples, none were labeled as 0; five were labeled as 1; and 15 were labeled as 2, giving a mean rating of 1:75.</Paragraph>
      <Paragraph position="2"> We ran experiments using the document sentences as both the source and the target language in GIZA++. When document sentences were used as the target language, each abstract word needed to produce many document words, leading to very high fertilities. However, since each target word is generated independently, this led to very flat rewrite tables and, hence, to poor results. Performance increased dramatically by using the document as the source language and the abstract as the target language. null In all MT cases, the corpus was appended with one-word sentence pairs for each word where that word is translated as itself. In the two basic models, HMM and Model 4, the abstract sentence is the source language and the document sentences are the target language. To alleviate the fertility problem, we also ran experiments with the translation going in the opposite direction. These are called HMMflipped and Model 4-flipped, respectively. These tend to out-perform the original translation direction. In all of these setups, 5 iterations of Model 1 were run, followed by 5 iterations of the HMM model. In the Model 4 cases, 5 iterations of Model 4 were run, following the HMM.</Paragraph>
    </Section>
    <Section position="3" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.3 Cut and Paste Experiments
</SectionTitle>
      <Paragraph position="0"> We also tested alignments using the Cut and Paste summary decomposition method (Jing, 2002), based on a non-trainable HMM. Briefly, the Cut and Paste HMM searches for long contiguous blocks of words in the document and abstract that are identical (up to stem). The longest such sequences are aligned. By fixing a length cutoff of n and ignoring sequences of length less than n, one can arbitrarily increase the precision of this method. We found that n = 2 yields the best balance between precision and recall (and the highest F-measure). The results of these experiments are shown under the header &amp;quot;Cut &amp; Paste.&amp;quot; It clearly outperforms all of the MT-based models.</Paragraph>
    </Section>
    <Section position="4" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.4 PBHMM Experiments
</SectionTitle>
      <Paragraph position="0"> While the PBHMM is based on a dynamic programming algorithm, the effective search space in this model is enormous, even for moderately sized document/abstract pairs. We selected the 2000 shortest document/abstract pairs from the Ziff-Davis corpus for training; however, only 12 of the hand-annotated documents were included in this set, so we additionally added the other 33 hand-annotate documents to this set, yielding 2033 document/abstract pairs. We then performed sentence extraction on this corpus exactly as in the MT case, using the technique of (Marcu, 1999). The relevant data for this corpus is in Table 4. We also restrict the state-space with a beam, sized at 50% of the unrestricted state-space.</Paragraph>
      <Paragraph position="1"> The PBHMM system was then trained on this abstract/extract corpus. The precision/recall results are shown in Table 5. Under the methodology for combining the two human annotations by taking the union, either of the human scores would achieve a  precision and recall of 1:0. To give a sense of how well humans actually perform on this task (in addition to the kappa scores reported earlier), we compare each human against the other.</Paragraph>
      <Paragraph position="2"> One common precision mistake made by the PBHMM system is to accidentally align words on the summary side to words on the document side, when the summary word should be null-aligned.</Paragraph>
      <Paragraph position="3"> The PBHMMO system is an oracle system in which system-produced alignments are removed for summary words that should be null-aligned (according to the hand-annotated data). Doing this results in a rather significant gain in SoftP score.</Paragraph>
      <Paragraph position="4"> As we can see from Table 5, none of the machine translation models is well suited to this task, achieving, at best, an F-score of 0:298. The Cut &amp; Paste method performs significantly better, which is to be expected, since it is designed specifically for summarization. As one would expect, this method achieves higher precision than recall, though not by very much. Our method significantly outperforms both the IBM models and the Cut &amp; Paste method, achieving a precision of 0:456 and a recall nearing 0:7, yielding an overall F-score of 0:548.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML