File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/99/w99-0311_metho.xml

Size: 22,552 bytes

Last Modified: 2025-10-06 14:15:31

<?xml version="1.0" standalone="yes"?>
<Paper uid="W99-0311">
  <Title>Discourse-level argumentation in scientific articles: human and automatic annotation</Title>
  <Section position="4" start_page="84" end_page="85" type="metho">
    <SectionTitle>
2 The annotation scheme
</SectionTitle>
    <Paragraph position="0"> The domain in which we work is that of scientifc research articles, in particular computational linguistics articles. We settled on this domain for a number of reasons. One reason is that it is a domain we are familiar with, which helps for intermediate evaluation of the annotation work. The other reason is that computational linguistics is also a rather heterogeneous domain: the papers in our collection cover a wide range of subject matters, such as logic programming, statistical language modelling, theoreticai semantics and computational psycholinguistics. This makes it a challenging test bed for our  ing out weaknesses in other research; sentences stating that the research task of the current paper has never been done before; direct comparisons BASIS Statements that the own work uses some other work as its basis or starting point, or gets support from this other work</Paragraph>
  </Section>
  <Section position="5" start_page="85" end_page="86" type="metho">
    <SectionTitle>
FULL
SCHEME
</SectionTitle>
    <Paragraph position="0"> scheme which we hope to be applicable in a range of disciplines.</Paragraph>
    <Paragraph position="1"> Despite its heterogeneity, our collection of papers does exhibit predictable rhetorical patterns of scientific argumentation. To analyse these patterns we used STales' (1990) CARS (Creating a Research space) model as our starting point.</Paragraph>
    <Paragraph position="2"> The annotation scheme we designed is summarised in Figure 1. The seven categories describe argumentative roles with respect to the overall communicative act of the paper. They are to be read as mutually exclusive labels, one of which is attributed to each sentence in a text. There are two kinds of categories in this scheme: basic categories and non-basic categories. Basic categories are defined by attribution of intellectual ownership; they distinguish between: * statements which are presented as generally accepted (BACKGROUND); * statements which are attributed to other, specific pieces of research outside the given paper, including the authors' own previous work (OTHER); * statements which describe the authors' own new contributions (OWN).</Paragraph>
    <Paragraph position="3"> The four additional (non-basic) categories are more directly based on STales' theory. The most important of these is AIM, as this move on its own is already a good characterisation of the entire paper, and thus very useful for the generation of abstracts. The other categories are TEXTUAL, which provides information about section structure that might prove helpful for subsequent search steps. There are two moves having to do with the author's attitude towards previous research, namely BASIS and CONTRAST. We expect this kind of information to be useful for the creation of typed links for bibliometric search tools and for the automatic determination of rival approaches in the field and intellectual ancestry of methodologies (cf. Garfield's (1979) classification of the function of citation within researchers' papers).</Paragraph>
    <Paragraph position="4"> The structure in Figure 2, for example, displays a common rhetorical pattern of scientific argumentation which we found in many introductions. A BACKGROUND segment, in which the history and the importance of the task is discussed, is followed by a longer sequence of OTHER sentences, in which specific prior work is described in a neutral way. This discussion usually terminates in a criticism of the prior work, thus giving a motivation for the own work presented in the paper. The next sentence typically states the specific goal or contribution of the paper, often in a formulaic way (Myers, 1992).</Paragraph>
    <Paragraph position="5"> Such regularities, where the segments are contiguous, non-overlapping and non-hierarchical, can be  paper introduction expressed well with our category labels. Whereas non-basic categories are typically short segments of one or two sentences, the basic categories form much larger segments of sentences with the same rhetorical role.</Paragraph>
  </Section>
  <Section position="6" start_page="86" end_page="88" type="metho">
    <SectionTitle>
3 Human Annotation
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="86" end_page="87" type="sub_section">
      <SectionTitle>
3.1 Annotating full texts
</SectionTitle>
      <Paragraph position="0"> To ensure that our coding scheme leads to less biased annotation than some of the other resources available for building summarisation systems, and to ensure that other researchers besides ourselves can use it to replicate our results on different types of texts, we wanted to examine two properties of our scheme: stability and reproducibility (Krippendorff, 1980). Stability is the extent to which an annotator will produce the same classifications at different times. Reproducibility is the extent to which different annotators will produce the same classification. We use the Kappa coefficient (Siegel and Castellan, 1988) to measure stability and reproducibility. The rationale for using Kappa is explained in (Carletta, 1996).</Paragraph>
      <Paragraph position="1"> The studies used to evaluate stability and reproducibility we describe in more detail in (Teufel et al., To Appear). In brief, 48 papers were annotated by three extensively trained annotators. The training period was four weeks consisting of 5 hours of annotation per week. There were written instructions (guidelines) of 17 pages. Skim-reading and  annotation of an average length (3800 word) paper typically took 20-30 minutes. The studies show that the training material is reliable. In particular, the basic annotation scheme is stable (K=.82, .81, .76; N=1220; k=2 for all three annotators) and reproducible (K=.71, N=4261, k=3), where k denotes the number of annotators, N the number of sentences annotated, and K gives the Kappa value.</Paragraph>
      <Paragraph position="2"> The full annotation scheme is stable (K=.83, .79, .81; N=1248; k-2 for all three annotators) and reproducible (K=.78, N=4031, k=3). Overall, reproducibility and stability for trained annotators does not quite reach the levels found for, for instance, the best dialogue act coding schemes, which typically reach Kappa values of around K=.80 (Carletta et al., 1997; Jurafsky et al., 1997). Our annotation requires more subjective judgements and is possibly more cognitively complex. Our reproducibility and stability results are in the range which Krippendorff (1980) describes as giving marginally significant results for reasonable size data sets when correlating two coded variables which would show a clear correlation if there were perfect agreement. As our requirements are less stringent than Krippendorff's, we find the level of agreement which we achieved acceptable.</Paragraph>
      <Paragraph position="3">  categories, shows that OWN is by far the most frequent category. Figure 4 reports how well the four non-basic categories could be distinguished from all other categories, measured by Krippendorff's diagnostics for category distinctions (i.e. collapsing all other distinctions). When compared to the over-all reproducibility of .71, we notice that the annotators were good at distinguishing AIM and TEX-TUAL, and less good at determining BASIS and CON-TRAST. This might have to do with the location of those types of sentences in the paper: AIM and TEXTUAL are usually found at the beginning or end of the introduction section, whereas CONTRAST, and even more so BASIS, are usually interspersed within longer stretches of OWN. As a result, these categories are more exposed to lapses of attention during annotation.</Paragraph>
      <Paragraph position="4"> The fact that the annotators are good at determining AIM sentences is an important result: as AIM sentences constitute the best characterisation of the research paper for the summarisation task at a very high compression to 1.8% of the original text length, we are particularly interested in having them annotated consistently in our training material. This result is clearly in contrast to studies which conclude that humans are not very reliable at this kind of task (Rath et al., 1961). We attribute this difference to a difference in our instructions. Whereas the subjects in Rath et al.'s experiment were asked to look for the most relevant sentences, our annotators had to look for specific argumentative roles which seems to have eased the task. In addition, our guidelines give very specific instructions for ambiguous cases.</Paragraph>
      <Paragraph position="5"> These reproducibility values are important because they can act as a good evaluation measure as it factors random agreement out, unlike percentage agreement. It also provides a realistic upper bound on performance: if the machine is treated as another coder, and if reproducibity does not decrease then the machine has reached the theoretically best result, considering the cognitive difficulty of the task.</Paragraph>
    </Section>
    <Section position="2" start_page="87" end_page="88" type="sub_section">
      <SectionTitle>
3.2 Annotating parts of texts
</SectionTitle>
      <Paragraph position="0"> Annotating texts with our scheme is timeconsuming, so we wanted to determine if there was a more efficient way of obtaining hand-coded training material, namely by annotating only parts of the source texts. For example, the abstract, introductions and conclusions of source texts are often like &amp;quot;condensed&amp;quot; versions of the contents of the entire paper and might be good areas to restrict annotation to. Alternatively, it might be a good idea to restrict annotation to the first 20% or the last 10% of any given text. Yet another possibility for restricting the range of sentences to be annotated is based on the 'alignment' idea introduced in (Kupiec et al., 1995): a simple surface measure determines sentences in the document that are maximally similar to sentences in the abstract.</Paragraph>
      <Paragraph position="1"> Obviously, any of these strategies of area restriction would give us fewer gold standard sentences per paper, so we would have to make sure that we still had enough candidate sentences for all seven categories. On the other hand, because these areas could well be the most clearly written and informationally rich sections, it might be the case that the quality of the resulting gold standard is higher. In this case we would expect the reliability of the coding in these areas to be higher in comparison to the reliability achieved overall, which in turn would result in higher accuracy when this task is done automatically. null  We did extensive experiments on this. Figure 5 shows reliability values for each of the annotated portions of text, and Figure 6 shows the composi- null tion in terms of our labels for each of the annotated portions of text. The implications for corpus preparation for abstract generation experiments can be summarised as follows. If one wants to avoid manually annotating entire papers but still make all argumentative distinctions, one can restrict the annotation to sentences appearing in the introduction section, even though annotators will find them slightly harder to classify (K=.69), or to all alignable abstract sentences, even if there are not many alignable abstract sentences detectable overall (around 50% of the sentences in the abstract), or to conclusion sentences, even if the coverage of argumentative categories is very restricted in the conclusions (mostly AIM and OWN sentences).</Paragraph>
      <Paragraph position="2"> We also examined a fall-back option of just annotating the first 10% or last 5% of a paper (as not all papers in our collection have an explicitly marked introduction and conclusion section), but the reliability results of this were far less good (K=.66 and K=.63, respectively).</Paragraph>
    </Section>
  </Section>
  <Section position="7" start_page="88" end_page="90" type="metho">
    <SectionTitle>
4 Automatic annotation
</SectionTitle>
    <Paragraph position="0"> All the annotation work is obviously in aid of development work, in particular for the training of a system. We will provide a brief description of training results so as to show the practical viability of the proposed corpus preparation method.</Paragraph>
    <Section position="1" start_page="88" end_page="88" type="sub_section">
      <SectionTitle>
4.1 Data
</SectionTitle>
      <Paragraph position="0"> Our training material is a collection of 80 conference papers and their summaries, taken from the Computation and Language E-Print Archive (http://xxx. lanl. gov/cmp-lg/). The training material contains 330,000 word tokens.</Paragraph>
      <Paragraph position="1"> The data is automatically preprocessed into xml format, and the following structural information is marked up: title, summary, headings, paragraph structure and sentences, citations in running text, and reference list at the end of the paper. If one of the paper's authors also appears on the author list of a cited paper, then that citation is marked as self citation. Tables, equations, figures, captions, cross references are removed and replaced by place holders. Sentence boundaries are automatically detected, and the text is POS-tagged according to the UPenn tagset.</Paragraph>
      <Paragraph position="2"> Annotation of rhetorical roles for all 80 papers (around 12,000 sentences) was provided by one of our human judges during the annotation study mentioned above.</Paragraph>
    </Section>
    <Section position="2" start_page="88" end_page="88" type="sub_section">
      <SectionTitle>
4.2 The method
</SectionTitle>
      <Paragraph position="0"> (Kupiec et al., 1995) use supervised learning to automatically adjust feature weights. Each document sentence receives scores for each of the features, resuiting in an estimate for the sentence's probability to also occur in the summary. This probability is calculated for each feature value as a combination of the probability of the feature-value pair occurring in a sentence which is in the summary (successful case) and the probability that the feature-value pair occurs unconditionally.</Paragraph>
      <Paragraph position="1"> We extend Kupiec et al.'s estimation of the probability that a sentence is contained in the abstract, to the probability that it has rhetorical role R (cf. Figure 7).</Paragraph>
      <Paragraph position="3"> ., Fk): Probability that sentence s in the source text has rhetorical role R, given its feature values; relative frequency of role R (constant); null probability of feature-value pair occurring in a sentence which is in rhetorical class R; probability that the feature-value pair occurs unconditionally; number of feature-value pairs; j-th feature-value pair.</Paragraph>
      <Paragraph position="4">  Evaluation of the method relies on crossvalidation: the model is trained on a training set of documents, leaving one document out at a time (the test document). The model is then used to assign each sentence a probability for each category R, and the category with the highest probability is chosen as answer for the sentence.</Paragraph>
    </Section>
    <Section position="3" start_page="88" end_page="89" type="sub_section">
      <SectionTitle>
4.3 Features
</SectionTitle>
      <Paragraph position="0"> The features we use in training (see Figure 8) are different from Kupiecet al.'s because we do not estimate overall importance in one step, but instead guess argumentative status first and determine importance later.</Paragraph>
      <Paragraph position="1"> Many of our features can be read off directly from the way the corpus is encoded: our preprocessors determine sentence-boundaries and parse the reference list at the end. This gives us a good handle on structural and locational features, as well as on features related to citations.</Paragraph>
      <Paragraph position="2">  initial, medial, final first, second or last third</Paragraph>
      <Paragraph position="4"> The syntactic features rely on determining the first finite verb in the sentence, which is done symbolically using POS-information. Heuristics are used to determine the tense and possible negation.</Paragraph>
      <Paragraph position="5"> The semantic features rely on template matching.</Paragraph>
      <Paragraph position="6"> In the feature Sem-1, a hand-crafted lexicon is used to classify the verb into one of 20 Action Classes (cf. Figure 9, left half), if it is one of the 388 verbs contained in the lexicon. The feature Sem-2 encodes whether the agent of the action is most likely to refer to the authors, or to other agents, e.g. other researchers (177 templates). Heuristic rules determine that the agent is the subject in an active sentence, or the head of the by-phrase (if present) in a passive sentence. Sere-3 encodes various other formulaic expressions (indicator phrases (Paice, 1981), meta-comments (Zukerman, 1991)) in order to exploit explicit rhetoric phrases the authors might have used, cf. Figure 9, right half (414 templates).</Paragraph>
      <Paragraph position="7"> The content features use the tf/idf method and title and header information for finding contentful words or phrases. In contrast to all other features they do not attempt to model the form or meta-discourse contained in the sentences but instead model their domain (object-level) contents.</Paragraph>
    </Section>
    <Section position="4" start_page="89" end_page="90" type="sub_section">
      <SectionTitle>
4.4 Results
</SectionTitle>
      <Paragraph position="0"> When the Naive Bayesian Model is added to the pool of coders, the reproducibility drops from K=.71 to K=.55. This reproducibility value is equivalent to the value achieved by 6 human annotators with no prior training, as found in an earlier experiment (Teufel et al., To Appear). Compared to one of the annotators, Kappa is K=.37, which corresponds to percentage accuracy of 71.2%. This number cannot be directly compared to experiments like Kupiec et al.'s because in their experiment a compression of around 3% was achieved whereas we classify each sentence into one of the categories.</Paragraph>
      <Paragraph position="1"> Further analysis of our results shows the system performs well on the frequent category OWN, cf. the confusion matrix in Fig. reftab:confusion. Indeed, as Figure 3 shows, OWN is so frequent that choosing OWN all the time gives us a seemingly hardto-beat baseline with a high percentage agreement of 69% (Baseline 1). However, the Kappa statistic, which controls for expected random agreement, reveals just how bad that baseline really is: Kappa is K=-.12 (machine vs. one annotator). Random choice of categories according to the distribution of categories (Baseline 2) is a better baseline; Kappa</Paragraph>
    </Section>
  </Section>
  <Section position="8" start_page="90" end_page="91" type="metho">
    <SectionTitle>
POSSESSION
</SectionTitle>
    <Paragraph position="0"> we hop._.._~e to improve these results we argue against an application of we know of no other attempts...</Paragraph>
    <Paragraph position="1"> our system outperforms that of ...</Paragraph>
    <Paragraph position="2"> we extend &lt; CITE/&gt; 's algorithm we tested_ our system against...</Paragraph>
    <Paragraph position="3"> we follow X in postulating that our approach differs from X's ...</Paragraph>
    <Paragraph position="4"> we inten..d to improve our results... we are concerned with ...</Paragraph>
    <Paragraph position="5"> this approach, however, lacks...</Paragraph>
    <Paragraph position="6"> we present here a method for...</Paragraph>
    <Paragraph position="7"> thi~-~ses the problem of how to...</Paragraph>
    <Paragraph position="8"> we collected our data from...</Paragraph>
    <Paragraph position="9"> our approach resembles that of X...</Paragraph>
    <Paragraph position="10"> we solve this problem by...</Paragraph>
    <Paragraph position="11"> the paper is organized as follows... we employ X's method...</Paragraph>
    <Paragraph position="12"> our goal i...ss to...</Paragraph>
    <Paragraph position="13"> our approach has three advantages... null  according to &lt; REF'~ to our knowledge main contribution of this in section &lt; CREF/&gt; in this paper following the argument in bears similarity to when compared to our however a novel method for XX-ing elsewhere, we have avenue for improvement  for this baseline is K=0.</Paragraph>
    <Paragraph position="14"> AIM categories can be determined with a precision of 48% and a recall of 56% (cf. Figure 11). These values are more directly comparable to Kupiec et al.'s results of 44% co-selection of extracted sentences with alignable summary sentences. We assume that most of the sentences extracted by their method would have fallen into the AIM category. The other easily determinable category for the automatic method is TEXTUAL (p----55%; r=52%), whereas the results for the other non-basic categories are relatively lower - mirroring the results for humans. null As far as the individual features are concerned, we found the strongest heuristics to be location, type of header, citations, and the semantic classes (indicator phrases, agents and actions); syntactic and content-based heuristics are the weakest. The first column in Figure 12 gives the predictiveness of the feature  on its own, in terms of kappa between machine and one annotator. Some of the weaker features are not predictive enough on their own to break the dominance of the prior; in that case, they behave just like Baseline 1 (K=-.12).</Paragraph>
    <Paragraph position="15"> The second column gives kappa for experiments using all features except the given feature, i.e. the results if this feature is left out of the pool of fea- null heuristics tures. These numbers show that some of the weaker features contribute some predictive power in combination with others.</Paragraph>
    <Paragraph position="16"> While not entirely satisfactory, these results might be taken as an indication that we have indeed managed to identify the right kinds of features for argumentative sentence classification. Taking the context into account should further increase results, as preliminary experiments with n-gram modelling have shown. In these experiments, we replaced the prior P(s E R) in Figure 7 with a n-gram based probability of that role occurring in the given context. null</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML