File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/02/j02-4002_intro.xml

Size: 12,972 bytes

Last Modified: 2025-10-06 14:01:22

<?xml version="1.0" standalone="yes"?>
<Paper uid="J02-4002">
  <Title>c(c) 2002 Association for Computational Linguistics Summarizing Scientific Articles: Experiments with Relevance and Rhetorical Status</Title>
  <Section position="4" start_page="420" end_page="425" type="intro">
    <SectionTitle>
TEXTUAL
OTHER
</SectionTitle>
    <Paragraph position="0"> or comparison of the own work to it? of other work, or a contrast Does it describe a negative aspect or support for the current paper? Does this sentence mention other work as basis of work by the authors (excluding previous Figure 6 Decision tree for rhetorical annotation.  Teufel and Moens Summarizing Scientific Articles We use the kappa coefficient K (Siegel and Castellan 1988) to measure stability and reproducibility, following Carletta (1996). The kappa coefficient is defined as follows:</Paragraph>
    <Paragraph position="2"> where P(A) is pairwise agreement and P(E) random agreement. K varies between 1 when agreement is perfect and [?]1 when there is a perfect negative correlation. K = 0 is defined as the level of agreement that would be reached by random annotation using the same distribution of categories as the actual annotators did.</Paragraph>
    <Paragraph position="3"> The main advantage of kappa as an annotation measure is that it factors out random agreement by numbers of categories and by their distribution. As kappa also abstracts over the number of annotators considered, it allows us to compare the agreement numerically among a group of human annotators with the agreement between the system and one or more annotators (section 5), which we use as one of the performance measures of the system.</Paragraph>
    <Paragraph position="4"> 3.2.2 Results. The annotation experiments show that humans distinguish the seven rhetorical categories with a stability of K = .82, .81, .76 (N = 1,220; k = 2, where K stands for the kappa coefficient, N for the number of items (sentences) annotated, and k for the number of annotators). This is equivalent to 93%, 92%, and 90% agreement.</Paragraph>
    <Paragraph position="5"> Reproducibility was measured at K = .71 (N = 4,261, k = 3), which is equivalent to 87% agreement. On Krippendorff's (1980) scale, agreement of K = .8 or above is considered as reliable, agreement of .67-.8 as marginally reliable, and less than .67 as unreliable. On Landis and Koch's (1977) more forgiving scale, agreement of .0-.2 is considered as showing &amp;quot;slight&amp;quot; correlation, .21-.4 as &amp;quot;fair,&amp;quot; .41-.6 as &amp;quot;moderate,&amp;quot; .610.8 as &amp;quot;substantial,&amp;quot; and .81 -1.0 as &amp;quot;almost perfect.&amp;quot; According to these guidelines, our results can be considered reliable, substantial annotation.</Paragraph>
    <Paragraph position="6">  items is then k * N, i.e., 12,783 in this case.) Table 2 shows a confusion matrix between two annotators. The numbers represent absolute sentence numbers, and the diagonal (boldface numbers) are the counts of sentences that were identically classified by both annotators. We used Krippendorff's diagnostics to determine which particular categories humans had most problems with: For each category, agreement is measured with a new data set in which all categories  Distribution of rhetorical categories (entire document).</Paragraph>
    <Paragraph position="7">  except for the category of interest are collapsed into one metacategory. Original agreement is compared to that measured on the new (artificial) data set; high values show that annotators can distinguish the given category well from all others. When their results are compared to the overall reproducibility of K = .71, the annotators were good at distinguishing AIM (Krippendorff's diagnostics; K = .79) and TEXTUAL (K = .79).</Paragraph>
    <Paragraph position="8"> The high agreement in AIM sentences is a positive result that seems to be at odds with previous sentence extraction experiments. We take this as an indication that some types of rhetorical classification are easier for human minds to do than unqualified relevance decision. We also think that the positive results are partly due to the existence of the guidelines.</Paragraph>
    <Paragraph position="9"> The annotators were less consistent at determining BASIS (K = .49) and CONTRAST (K = .59). The same picture emerges if we look at precision and recall of single categories between two annotators (cf. Table 3). Precision and recall for AIM and TEXTUAL are high at 72%/56% and 79%/79%, whereas they are lower for CONTRAST (50%/55%) and BASIS (82%/34%).</Paragraph>
    <Paragraph position="10"> This contrast in agreement might have to do with the location of the rhetorical zones in the paper: AIM and TEXTUAL zones are usually found in fixed locations (beginning or end of the introduction section) and are explicitly marked with metadiscourse, whereas CONTRAST sentences, and even more so BASIS sentences, are usually interspersed within longer OWN zones. As a result, these categories are more exposed to lapses of attention during annotation.</Paragraph>
    <Paragraph position="11"> With respect to the longer, more neutral zones (intellectual attribution), annotators often had problems in distinguishing OTHER work from OWN work, particularly in cases where the authors did not express a clear distinction between new work and previous own work (which, according to our instructions, should be annotated as OTHER). Another persistently problematic distinction for our annotators was that between OWN Table 3 Annotator C's precision and recall per category if annotator B is gold standard.</Paragraph>
    <Paragraph position="12">  Teufel and Moens Summarizing Scientific Articles and BACKGROUND. This could be a sign that some authors aimed their papers at an expert audience and thus thought it unnecessary to signal clearly which statements are commonly agreed upon in the field, as opposed to their own new claims. If a paper is written in such a way, it can indeed be understood only with a considerable amount of domain knowledge, which our annotators did not have.</Paragraph>
    <Paragraph position="13"> Because intellectual attribution (the distinction between OWN,OTHER, and BACKGROUND material) is an important part of our annotation scheme, we conducted a second experiment measuring how well our annotators could distinguish just these three roles, using the same annotators and 22 different articles. We wrote seven pages of new guidelines describing the semantics of the three categories. Results show higher stability compared to the full annotation scheme (K = .83, .79, .81; N = 1,248; k = 2) and higher reproducibility (K = .78, N = 4,031, k = 3), corresponding to 94%, 93%, and 93% agreement (stability) and 93% (reproducibility). It is most remarkable that agreement of annotation of intellectual attribution in the abstracts is almost perfect: K = .98 (N = 89, k = 3), corresponding to 99% agreement. This points to the fact that authors, when writing abstracts for their papers, take care to make it clear to whom a certain statement is attributed. This effect also holds for the annotation with the full scheme with all seven categories: again, reproducibility in the abstract is higher (K = .79) than in the entire document (K = .71), but the effect is much weaker.</Paragraph>
    <Paragraph position="14"> Abstracts might be easier to annotate than the rest of a paper, but this does not necessarily make it possible to define a gold standard solely by looking at the abstracts. As foreshadowed in section 2.5, abstracts do not contain all types of rhetorical information. AIM and OWN sentences make up 74% of the sentences in abstracts, and only 5% of all CONTRAST sentences and 3% of all BASIS sentences occur in the abstract. Abstracts in our corpus are also not structurally homogeneous. When we inspected the rhetorical structure of abstracts in terms of sequences of rhetorical zones, we found a high level of variation. Even though the sequence AIM-OWN is very common (contained in 73% of all abstracts), the 80 abstracts still contain 40 different rhetorical sequences, 28 of which are unique. This heterogeneity is in stark contrast to the systematic structures Liddy (1991) found to be produced by professional abstractors. Both observations, the lack of certain rhetorical types in the abstracts and their rhetorical heterogeneity, reassure us in our decision not to use human-written abstracts as a gold standard.</Paragraph>
    <Section position="1" start_page="423" end_page="425" type="sub_section">
      <SectionTitle>
3.3 Annotation of Relevance
</SectionTitle>
      <Paragraph position="0"> We collected two different kinds of relevance gold standards for the documents in our development corpus: abstract-similar document sentences and additional manually selected sentences.</Paragraph>
      <Paragraph position="1"> In order to establish alignment between summary and document sentences, we used a semiautomatic method that relies on a simple surface similarity measure (longest common subsequence of content words, i.e., excluding words on a stop list). As in Kupiec, Pedersen, and Chen's experiment, final alignment was decided by a human judge, and the criterion was semantic similarity of the two sentences. The following sentence pair illustrates a direct match: Summary: In understanding a reference, an agent determines his confidence in its adequacy as a means of identifying the referent.</Paragraph>
      <Paragraph position="2"> Document: An agent understands a reference once he is confident in the adequacy of its (inferred) plan as a means of identifying the referent.</Paragraph>
      <Paragraph position="3">  Computational Linguistics Volume 28, Number 4 Of the 346 abstract sentences contained in the 80 documents, 156 (45%) could be aligned this way. Because of this low agreement and because certain rhetorical types are not present in the abstracts, we decided not to rely on abstract alignment as our only gold standard. Instead, we used manually selected sentences as an alternative gold standard, which is more informative, but also more subjective.</Paragraph>
      <Paragraph position="4"> We wrote eight pages of guidelines that describe relevance criteria (e.g., our definition prescribes that neutral descriptions of other work be selected only if the other work is an essential part of the solution presented, whereas all statements of criticism are to be included). The first author annotated all documents in the development corpus for relevance using the rhetorical zones and abstract similarity as aides in the relevance decision, and also skim-reading the whole paper before making the decision. This resulted in 5 to 28 sentences per paper and a total of 1,183 sentences.</Paragraph>
      <Paragraph position="5"> Implicitly, rhetorical classification of the extracted sentences was already given as each of these sentences already had a rhetorical status assigned to it. However, the rhetorical scheme we used for this task is slightly different. We excluded TEXTUAL,as this category was designed for document uses other than summarization. If a selected sentence had the rhetorical class TEXTUAL, it was reclassified into one of the other six categories. Figure 8 shows the resulting category distribution among these 1,183 sentences, which is far more evenly distributed than the one covering all sentences (cf. Figure 7). CONTRAST and OWN are the two most frequent categories.</Paragraph>
      <Paragraph position="6"> We did not verify the relevance annotation with human experiments. We accept that the set of sentences chosen by the human annotator is only one possible gold standard. What is more important is that humans can agree on the rhetorical status of the relevant sentences. Liddy observed that agreement on rhetorical status was easier for professional abstractors than sentence selection: Although they did not necessarily agree on which individual sentences should go into an abstract, they did agree on the rhetorical information types that make up a good abstract.</Paragraph>
      <Paragraph position="7"> We asked our trained annotators to classify a set of 200 sentences, randomly sampled from the 1,183 sentences selected by the first author, into the six rhetorical categories. The sentences were presented in order of occurrence in the document, but without any context in terms of surrounding sentences. We measured stability at K = .9, .86, .83 (N = 100, k = 2) and reproducibility at K = .84 (N = 200, k = 3). These results are reassuring: They show that the rhetorical status for important sentences can be particularly well determined, better than rhetorical status for all sentences in the document (for which reproducibility was K = .71; cf. section 3.2.2).</Paragraph>
      <Paragraph position="8"> Figure 8 Distribution of rhetorical categories (relevant sentences).</Paragraph>
      <Paragraph position="9">  Teufel and Moens Summarizing Scientific Articles</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML