XML Viewer - w05-0905

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/05/w05-0905_metho.xml
Size: 19,137 bytes
Last Modified: 2025-10-06 14:09:59
<?xml version="1.0" standalone="yes"?>
<Paper uid="W05-0905">
  <Title>Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization, pages 33-40, Ann Arbor, June 2005. c(c)2005 Association for Computational Linguistics Evaluating Automatic Summaries of Meeting Recordings</Title>
  <Section position="4" start_page="34" end_page="35" type="metho">
    <SectionTitle>
3 Experimental Setup
</SectionTitle>
    <Paragraph position="0"> We used human summaries of the ICSI Meeting corpus for evaluation and for training the feature-based approaches. An evaluation set of six meetings was defined and multiple human summaries were created for these meetings, with each test meeting having either three or four manual summaries. The remaining meetings were regarded as training data and a single human summary was created for these. Our summaries were created as follows.</Paragraph>
    <Paragraph position="1"> Annotators were given access to a graphical user interface (GUI) for browsing an individual meeting that included earlier human annotations: an orthographic transcription time-synchronized with the audio, and a topic segmentation based on a shallow hierarchical decomposition with keyword-based text labels describing each topic segment. The annotators were told to construct a textual summary of the meeting aimed at someone who is interested in the research being carried out, such as a researcher who does similar work elsewhere, using four headings: * general abstract: &amp;quot;why are they meeting and what do they talk about?&amp;quot;; * decisions made by the group; * progress and achievements; * problems described The annotators were given a 200 word limit for each heading, and told that there must be text for the general abstract, but that the other headings may have null annotations for some meetings.</Paragraph>
    <Paragraph position="2"> Immediately after authoring a textual summary, annotators were asked to create an extractive summary, using a different GUI. This GUI showed both their textual summary and the orthographic transcription, without topic segmentation but with one line per dialogue act based on the pre-existing MRDA coding (Shriberg et al., 2004) (The dialogue act categories themselves were not displayed, just the segmentation). Annotators were told to extract dialogue acts that together would convey the information in the textual summary, and could be used to support the correctness of that summary. They were given no specific instructions about the number or percentage of acts to extract or about redundant dialogue act. For each dialogue act extracted, they were then required in a second pass to choose the sentences from the textual summary supported by the dialogue act, creating a many-to-many mapping between the recording and the textual summary.</Paragraph>
    <Paragraph position="3"> The MMR and LSA approaches are both unsupervised and do not require labelled training data. For both feature-based approaches, the GMM classifiers were trained on a subset of the training data representing approximately 20 hours of meetings.</Paragraph>
    <Paragraph position="4"> We performed summarization using both the human transcripts and speech recognizer output. The speech recognizer output was created using base-line acoustic models created using a training set consisting of 300 hours of conversational telephone speech from the Switchboard and Callhome corpora. The resultant models (cross-word triphones trained on conversational side based cepstral mean normalised PLP features) were then MAP adapted to the meeting domain using the ICSI corpus (Hain et al., 2005). A trigram language model was employed. Fair recognition output for the whole corpus was obtained by dividing the corpus into four parts, and employing a leave one out procedure (training the acoustic and language models on three parts of the corpus and testing on the fourth, rotating to obtain recognition results for the full corpus). This resulted in an average word error rate (WER) of 29.5%. Automatic segmentation into dialogue acts or sentence boundaries was not performed: the dialogue act boundaries for the manual transcripts were mapped on to the speech recognition output.</Paragraph>
    <Section position="1" start_page="34" end_page="35" type="sub_section">
      <SectionTitle>
3.1 Description of the Evaluation Schemes
</SectionTitle>
      <Paragraph position="0"> A particular interest in our research is how automatic measures of informativeness correlate with human judgments on the same criteria. During the development stage of a summarization system it is not feasible to employ many hours of manual evaluations, and so a critical issue is whether or not software packages such as ROUGE are able to measure informativeness in a way that correlates with subjective summarization evaluations.</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="35" end_page="36" type="metho">
    <SectionTitle>
3.1.1 ROUGE
</SectionTitle>
    <Paragraph position="0"> Gauging informativeness has been the focus of automatic summarization evaluation research.</Paragraph>
    <Paragraph position="1"> We used the ROUGE evaluation approach (Lin and Hovy, 2003), which is based on n-gram co-occurrence between machine summaries and &amp;quot;ideal&amp;quot; human summaries. ROUGE is currently the standard objective evaluation measure for the Document Understanding Conference 1; ROUGE does not assume that there is a single &amp;quot;gold standard&amp;quot; summary. Instead it operates by matching the target summary against a set of reference summaries. ROUGE-1 through ROUGE-4 are simple n-gram co-occurrence measures, which check whether each n-gram in the reference summary is contained in the machine summary. ROUGE-L and ROUGE-W are measures of common subsequences shared between two summaries, with ROUGE-W favoring contiguous common subsequences. Lin (Lin and Hovy, 2003) has found that ROUGE-1 and ROUGE-2 correlate well with human judgments.</Paragraph>
    <Paragraph position="2">  The subjective evaluation portion of our research utilized 5 judges who had little or no familiarity with the content of the ICSI meetings. Each judge evaluated 10 summaries per meeting, for a total of sixty summaries. In order to familiarize themselves with a given meeting, they were provided with a human abstract of the meeting and the full transcript of the meeting with links to the audio. The human judges were instructed to read the abstract, and to consult the full transcript and audio as needed, with the entire familiarization stage not to exceed 20 minutes. The judges were presented with 12 questions at the end of each summary, and were instructed that upon beginning the questionnaire they should not reconsult the summary itself. 6 of the questions regarded informativeness and 6 involved readability and coherence, though our current research concentrates on the informativeness evaluations. The eval- null 2. The summary avoids redundancy.</Paragraph>
    <Paragraph position="3"> 3. The summary sentences on average seem relevant. null 4. The relationship between the importance of each topic and the amount of summary space given to that topic seems appropriate.</Paragraph>
    <Paragraph position="4"> 5. The summary is repetitive.</Paragraph>
    <Paragraph position="5"> 6. The summary contains unnecessary informa null tion.</Paragraph>
    <Paragraph position="6"> Statements such as 2 and 5 above are measuring the same impressions, with the polarity of the statements merely reversed, in order to better gauge the reliability of the answers. The readability/coherence portion consisted of the following statements:  1. It is generally easy to tell whom or what is being referred to in the summary.</Paragraph>
    <Paragraph position="7"> 2. The summary has good continuity, i.e. the sentences seem to join smoothly from one to another. null 3. The individual sentences on average are clear and well-formed.</Paragraph>
    <Paragraph position="8"> 4. The summary seems disjointed.</Paragraph>
    <Paragraph position="9"> 5. The summary is incoherent.</Paragraph>
    <Paragraph position="10"> 6. On average, individual sentences are poorly constructed.</Paragraph>
    <Paragraph position="11">  It was not possible in this paper to gauge how responses to these readability statements correlate with automatic metrics, for the reason that automatic metrics of readability and coherence have not been widely discussed in the field of summarization. Though subjective evaluations of summaries are often divided into informativeness and readability questions, only automatic metrics of informativeness have been investigated in-depth by the summarization community. We believe that the development of automatic metrics for coherence and readability should be a high priority for researchers in summarization evaluation and plan on pursuing this avenue of research. For example, work on coherence in NLG (Lapata, 2003) could potentially inform summarization evaluation. Mani (Mani et al.,</Paragraph>
  </Section>
  <Section position="6" start_page="36" end_page="38" type="metho">
    <SectionTitle>
4 Results
</SectionTitle>
    <Paragraph position="0"> The results of these experiments can be analyzed in various ways: significant differences of ROUGE results across summarization approaches, deterioration of ROUGE results on ASR versus manual transcripts, significant differences of human evaluations across summarization approaches, deterioration of human evaluations on ASR versus manual transcripts, and finally, the correlation between ROUGE and human evaluations.</Paragraph>
    <Section position="1" start_page="36" end_page="37" type="sub_section">
      <SectionTitle>
4.1 ROUGE results across summarization
approaches
</SectionTitle>
      <Paragraph position="0"> All of the machine summaries were 10% of the original document length, in terms of the number of dialogue acts contained. Of the four approaches to summarization used herein, the latent semantic analysis method performed the best on every meeting tested for every ROUGE measure with the exception of ROUGE-3 and ROUGE-4. This approach was significantly better than either feature-based approach (p&lt;0.05), but was not a significant improvement over MMR. For ROUGE-3 and ROUGE-4, none of the summarization approaches were significantly different from each other, owing to data sparsity. Figure 1 gives the ROUGE-1, ROUGE-2 and ROUGE-L results for each of the summarization approaches, on both manual and ASR transcripts.</Paragraph>
      <Paragraph position="1">  The results of the four summarization approaches on ASR output were much the same, with LSA and MMR being comparable to each other, and each of them outperforming the feature-based approaches.</Paragraph>
      <Paragraph position="2"> On ASR output, LSA again consistently performed the best.</Paragraph>
      <Paragraph position="3"> Interestingly, though the LSA approach scored higher when using manual transcripts than when using ASR transcripts, the difference was small and insignificant despite the nearly 30% WER of the ASR. All of the summarization approaches showed minimal deterioration when used on ASR output as compared to manual transcripts, but the LSA approach seemed particularly resilient, as evidenced by Figure 1. One reason for the relatively small impact of ASR output on summarization results is that for each of the 6 meetings, the WER of the summaries was lower than the WER of the meeting as a whole. Similarly, Valenza et al (Valenza et al., 1999) and Zechner and Waibel (Zechner and Waibel, 2000) both observed that the WER of extracted summaries was significantly lower than the overall WER in the case of broadcast news. The table below demonstrates the discrepancy between summary WER and meeting WER for the six meetings used in this research.</Paragraph>
      <Paragraph position="4">  WER% for Summaries and Meetings There was no improvement in the second feature-based approach (adding an LSA sentence score) as compared with the first feature-based approach. The sentence score used here relied on a reduction to 300 dimensions, which may not have been ideal for this data.</Paragraph>
      <Paragraph position="5"> The similarity between the MMR and LSA approaches here mirrors Gong and Liu's findings, giving credence to the claim that LSA maximizes relevance and minimizes redundancy, in a different and more opaque manner then MMR, but with similar</Paragraph>
    </Section>
    <Section position="2" start_page="37" end_page="37" type="sub_section">
      <SectionTitle>
Transcripts
</SectionTitle>
      <Paragraph position="0"> results. Regardless of whether or not the singular vectors of V T can rightly be thought of as topics or concepts (a seemingly strong claim), the LSA approach was as successful as the more popular MMR algorithm.</Paragraph>
    </Section>
    <Section position="3" start_page="37" end_page="37" type="sub_section">
      <SectionTitle>
4.2 Human results across summarization
approaches
</SectionTitle>
      <Paragraph position="0"> Table 1 presents average ratings for the six statements across four summarization approaches on manual transcripts. Interestingly, the first feature-based approach is given the highest marks on each criterion. For statements 2, 5 and 6 FB1 is significantly better than the other approaches. It is particularly surprising that FB1 would score well on statement 2, which concerns redundancy, given that MMR and LSA explicitly aim to reduce redundancy while the feature-based approaches are merely classifying utterances as relevant or not. The second feature-based approach was not significantly worse than the first on this score.</Paragraph>
      <Paragraph position="1"> Considering the difficult task of evaluating ten extractive summaries per meeting, we are quite satisfied with the consistency of the human judges. For example, statements that were merely reworded versions of other statements were given consistent ratings. It was also the case that, with the exception of evaluating the sixth statement, judges were able to tell that the manual extracts were superior to the automatic approaches.</Paragraph>
      <Paragraph position="2">  Table 2 presents average ratings for the six statements across four summarization approaches on ASR transcripts. The LSA and MMR approaches performed better in terms of having less deteri-</Paragraph>
    </Section>
    <Section position="4" start_page="37" end_page="37" type="sub_section">
      <SectionTitle>
Summarization Approaches
</SectionTitle>
      <Paragraph position="0"> oration of scores when used on ASR output instead of manual transcripts. LSA-ASR was not significantly worse than LSA on any of the 6 ratings. MMR-ASR was significantly worse than MMR on only 3 of the 6. In contrast, FB1-ASR was significantly worse than FB1 for 5 of the 6 approaches, reinforcing the point that MMR and LSA seem to favor extracting utterances with fewer errors. Figures 2, 3 and 4 depict the how the ASR and manual approaches affect the INFORMATIVENESS-1, INFORMATIVENESS-4 and INFORMATIVENESS-6 ratings, respectively.</Paragraph>
      <Paragraph position="1"> Note that for Figure 6, a higher score is a worse rating. null</Paragraph>
    </Section>
    <Section position="5" start_page="37" end_page="38" type="sub_section">
      <SectionTitle>
4.3 ROUGE and Human correlations
</SectionTitle>
      <Paragraph position="0"> According to (Lin and Hovy, 2003), ROUGE-1 correlates particularly well with human judgments of informativeness. In the human evaluation survey discussed here, the first statement (INFORMATIVENESS-1) would be expected to correlate most highly with ROUGE-1, as it is ask-</Paragraph>
    </Section>
    <Section position="6" start_page="38" end_page="38" type="sub_section">
      <SectionTitle>
Summarization Approaches
</SectionTitle>
      <Paragraph position="0"> ing whether the summary contains the important points of the meeting. As could be guessed from the discussion above, there is no significant correlation between ROUGE-1 and human evaluations when analyzing only the 4 summarization approaches on manual transcripts. However, when looking at the 4 approaches on ASR output, ROUGE-1 and INFORMATIVENESS-1 have a moderate and significant positive correlation (Spearman's rho = 0.500, p &lt; 0.05). This correlation on ASR output is strong enough that when ROUGE-1 and INFORMATIVENESS-1 scores are tested for correlation across all 8 summarization approaches, there is a significant positive correlation (Spearman's rho = 0.388, p &lt; 0.05).</Paragraph>
      <Paragraph position="1"> The other significant correlations for ROUGE-1 across all 8 summarization approaches are with INFORMATIVENESS-2, INFORMATIVENESS-5 and INFORMATIVENESS-6. However, these are negative correlations. For example, with regard to INFORMATIVENESS-2, summaries that are rated as having a high level of redundancy are given high ROUGE-1 scores, and summaries with little redundancy are given low ROUGE-1 scores. Similary, with regard to INFORMATIVENESS-6, summaries that are said to have a great deal of unnecessary information are given high ROUGE-1 scores. It is difficult to interpret some of these negative correlations, as ROUGE does not measure redundancy and would not necessarily be expected to correlate with redundancy evaluations.</Paragraph>
    </Section>
  </Section>
  <Section position="7" start_page="38" end_page="38" type="metho">
    <SectionTitle>
5 Discussion
</SectionTitle>
    <Paragraph position="0"> In general, ROUGE did not correlate well with the human evaluations for this data. The MMR and LSA approaches were deemed to be significantly better than the feature-based approaches according to ROUGE, while these findings were reversed according to the human evaluations. An area of agreement, however, is that the LSA-ASR and MMR-ASR approaches have a small and insignificant decline in scores compared with the decline of scores for the feature-based approaches. One of the most interesting findings of this research is that MMR and LSA approaches used on ASR tend to select utterances with fewer ASR errors.</Paragraph>
    <Paragraph position="1"> ROUGE has been shown to correlate well with human evaluations in DUC, when used on news corpora, but the summarization task here - using conversational speech from meetings - is quite different from summarizing news articles. ROUGE may simply be less applicable to this domain.</Paragraph>
  </Section>
  <Section position="8" start_page="38" end_page="39" type="metho">
    <SectionTitle>
6 Future Work
</SectionTitle>
    <Paragraph position="0"> It remains to be determined through further experimentation by researchers using various corpora whether or not ROUGE truly correlates well with human judgments. The results presented above are mixed in nature, but do not present ROUGE as being sufficient in itself to robustly evaluate a summarization system under development.</Paragraph>
    <Paragraph position="1"> We are also interested in developing automatic metrics of coherence and readability. We now have human evaluations of these criteria and are ready to  begin testing for correlations between these subjective judgments and potential automatic metrics.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML