File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/99/w99-0311_intro.xml

Size: 8,594 bytes

Last Modified: 2025-10-06 14:06:55

<?xml version="1.0" standalone="yes"?>
<Paper uid="W99-0311">
  <Title>Discourse-level argumentation in scientific articles: human and automatic annotation</Title>
  <Section position="3" start_page="0" end_page="84" type="intro">
    <SectionTitle>
1 Introduction
</SectionTitle>
    <Paragraph position="0"> Work on summarisation has suffered from a lack of appropriately annotated corpora that can be used for building, training and evaluating summarisation systems. Typically, corpus work in this area has taken as its starting point texts target summaries: abstracts written by the researchers, supplied by the original authors or provided by professional abstractors. Training a summarisation system then involves learning the properties of sentences in those abstracts and using this knowledge to extract similax abstract-worthy sentences from unseen texts. In this scenario, system performance or development progress can be evaluated by taking texts in a test sample and comparing the sentences extracted from these texts with the sentences in the target abstract.</Paragraph>
    <Paragraph position="1"> But this approach has a number of shortcomings.</Paragraph>
    <Paragraph position="2"> First, sentence extraction on its own is a very general methodology, which can produce extracts that are incoherent or under-informative especially when used for high-compression summarisation (i.e. reducing a document to a small percentage of its original size). It is difficult to overcome this problem, because once sentences have been extracted from the source text, the context that is needed for their interpretation is not available anymore and cannot be used to produce more coherent abstracts (Spgrck Jones, 1998).</Paragraph>
    <Paragraph position="3"> Our proposed solution to this problem is to extract sentences but also to classify them into one of a small number of possible argumentative roles, reflecting whether the sentence expresses a main goal of the source text, a shortcoming in someone else's work, etc. The summarisation system can then use this information to generate template-like abstracts: Main goal of the text:... ; Builds on work by:... ; Contrasts with:... ; etc.</Paragraph>
    <Paragraph position="4"> Second, the question of what constitutes a useful gold standard has not yet been solved satisfactorily. Researchers developing corpus resources for summarisation work have often defined their own gold standard, relying on their own intuitions (see, e.g. Luhn, 1958; Edmundson, 1969) or have used abstracts supplied by authors or by professional abstractors as their gold standard (e.g. Kupiec et al., 1995; Mani and Bloedorn, 1998). Neither approach is very satisfactory. Relying only on your own intuitions inevitably creates a biased resource; indeed, Rath et al. (1961) report low agreement between human judges carrying out this kind of task. On the other hand, using abstracts as targets is not necessarily a good gold standard for comparison of the systems' results, although abstracts are the only kind of gold standard that comes for free with the papers. Even if the abstracts are written by professional abstractors, there are considerable differences in length, structure, and information content. This is due to differences in the common abstract presentation style in different disciplines and to the projected use of the abstracts (cf. Liddy, 1991). In the case of our corpus, an additional problem was the fact that the abstracts are written by the authors themselves and thus susceptible to differences  in individual writing style.</Paragraph>
    <Paragraph position="5"> For the task of summarisation and relevance decision between similar papers, however, it is essential that the information contained in the gold standard is comparable between papers. In our approach, the vehicle for comparability of information is similarity in argumentative roles of the associated sentences.</Paragraph>
    <Paragraph position="6"> We argue that it is more difficult to find the kind of information that preserves similarity of argumentative roles, and that it is not guaranteed that it will occur in the abstract. : ....</Paragraph>
    <Paragraph position="7"> A related problem concerns fair evaluation Of the extraction methodology. The evaluation of extracted material necessarily consists of a comparison of sentences, whereas one would really want to compare the informational content of the extracted sentences and the target abstract. Thus it will often be the case that a system extracts a sentence which in that form does not appear in the supplied abstract (resulting in a low performance score) but which is nevertheless an abstract-worthy sentence. The mis-match often arises simply because a similar idea is expressed in the supplied abstract in a very different form. But comparison of content is difficult to perform: it would require sentences to be mapped into some underlying meaning representations and then comparing these to the representations of the sentences in the gold standard. As this is technically not feasible, system performance is typically performed against a fixed gold standard (e.g. the aforementioned abstracts), which is ultimately undesirable. null Our proposed solution to this problem is to build a corpus which details not only what the abstract-worthy sentences are but also what their argumentative role is. This corpus can then be used as a resource to build a system to similarly classify sentences in unseen texts, and to evaluate that system. This paper reports on the development of a set of such argumentative roles that we have been using in our work.</Paragraph>
    <Paragraph position="8"> In particular, we employ human intuition to annotate argumentatively defined information. We ask our annotators to classify every sentence in the source text in terms of its argumentative role (e.g.</Paragraph>
    <Paragraph position="9"> that it expresses the main goal of the source text, or identifies open problems in earlier work, etc). Under this scenario, system evaluation is no longer a comparison of extracted sentences against a supplied abstract, or against a single sentence that was chosen as expressing (e.g.) the main goal of the source text.</Paragraph>
    <Paragraph position="10"> Instead, every sentence in the source text which expresses the main goal will have been identified, and the system's performance is evaluated against that classification.</Paragraph>
    <Paragraph position="11"> Of course, having someone annotate text in this way may still lead to a biased or careless annotation.</Paragraph>
    <Paragraph position="12"> We therefore needed an annotation scheme which is simple enough to be usable in a stable and intuitive way for several annotators. This paper also reports on how we tested the stability of the annotation scheme we developed. A second design criterion for our annotation scheme was that we wanted the roles to be annotated automatically. This paper reports on preliminary results which show that the annotation process can indeed be automated.</Paragraph>
    <Paragraph position="13"> To summarise, we have argued that discourse structure information will improve summarisation.</Paragraph>
    <Paragraph position="14"> Other researchers (Ono et al., 1994; Marcu, 1997) have argued similarly, although most previous work on discourse-based summarisation follows a different discourse model, namely Rhetorical Structure Theory (Mann and Thompson, 1987). In contrast to RST, we stress the importance of rhetorical moves which are global to the argumentation of the paper, as opposed to more local RST-type relations. Our categories are not hierarchical, and they are much less fine-grained than RST-relations. As mentioned above, we wanted them to a) provide context information for flexible summarisation, b) provide a higher degree of comparability between papers, and c) provide a fairer evaluation of superficially different sentences.</Paragraph>
    <Paragraph position="15"> In the rest of this paper, we will first describe how we chose the categories (section 2). Second, we had to construct training and evaluation material such that we could be sure that the proposed categorisation yielded a reliable resource of annotated text to train a system against, a gold standard. The human annotation experiments are reported in section 3.</Paragraph>
    <Paragraph position="16"> Finally, in section 4, we describe some of the automated annotation work which we have started recently and which uses a corpus annotated according to our scheme as its training material.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML