File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/99/e99-1015_metho.xml

Size: 16,268 bytes

Last Modified: 2025-10-06 14:15:20

<?xml version="1.0" standalone="yes"?>
<Paper uid="E99-1015">
  <Title>An annotation scheme for discourse-level argumentation in research articles</Title>
  <Section position="5" start_page="111" end_page="111" type="metho">
    <SectionTitle>
FULL
SCHEME
</SectionTitle>
    <Paragraph position="0"> 1997; Alexandersson et al., 1995; Jurafsky et al., 1997), but our task is more difficult since it requires more subjective interpretation.</Paragraph>
  </Section>
  <Section position="6" start_page="111" end_page="112" type="metho">
    <SectionTitle>
3 Annotation experiment
</SectionTitle>
    <Paragraph position="0"> Our annotation scheme is based on the intuition that its categories provide an adequate and intuitive description of scientific texts. But this intuition alone is not enough of a justification: we believe that our claims, like claims about any other descriptive account of textual interpretation, should be substantiated by demonstrating that other humans can apply this interpretation consistently to actual texts.</Paragraph>
    <Paragraph position="1"> We did three studies. Study I and II were designed to find out if the two versions of the annotation scheme (basic vs. full) can be learned by human coders with a significant amount of training. We are interested in two formal properties of the annotation scheme: stability and reproducibility (Krippendorff, 1980). Stability, the extent to which one annotator will produce the same classifications at different times, is important because an instable annotation scheme can never be reproducible. Reproducibility, the extent to which different annotators will produce the same classifications, is important because it measures the consistency of shared understandings (or meaning) held between annotators.</Paragraph>
    <Paragraph position="2"> We use the Kappa coefficient K (Siegel and Castellan, 1988) to measure stability and reproducibility among k annotators on N items: In our experiment, the items are sentences. Kappa is a better measurement of agreement than raw percentage agreement (Carletta, 1996) because it factors out the level of agreement which would be reached by random annotators using the same distribution of categories as the real coders. No matter how many items or annotators, or how the categories are distributed, K--0 when there is no agreement other than what would be expected by chance, and K=I when agreement is perfect. We expect high random agreement for our annotation scheme because so many sentences fall into the OWN category.</Paragraph>
    <Paragraph position="3"> Studies I and II will determine how far we can trust in the human-annotated training material for both learning and evaluation of the automatic method. The outcome of Study II (full annotation scheme) is crucial to the task, as some of the categories specific to the full annotation scheme (particularly AIM) add considerable value to the information contained in the training material.</Paragraph>
    <Paragraph position="4"> Study III tries to answer the question whether the considerable training effort used in Studies I and II can be reduced. If it were the case that coders with hardly any task-specific training can produce similar results to highly trained coders, the training material could be acquired in a more efficient way. A positive outcome of Study III would also strengthen claims about the intuitivity of the category definitions.</Paragraph>
    <Paragraph position="5">  Does this sentence refer to own work (excluding previous work of the same author)? Does this sentence contain material that describes the specific aim described in the paper? Does this sentence make reference to the structure of the paper?</Paragraph>
  </Section>
  <Section position="7" start_page="112" end_page="114" type="metho">
    <SectionTitle>
I TEXTUAL \]
</SectionTitle>
    <Paragraph position="0"> Does the sentence describe general background, including phenomena to be explained or linguistic example sentences? t\[ BACKGROUND 1 Does it describe a negative aspect J of the other work, or a contrast or comparison of the own work to it?  Our materials consist of 48 computational linguistics papers (22 for Study I, 26 for Study II), taken from the Computation and Language E-Print Archive (http://xxx. lanl. gov/cmp-lg/). We chose papers that had been presented at COL-ING, ANLP or ACL conferences (including student sessions), or ACL-sponsored workshops, and been put onto the archive between April 1994 and April 1995.</Paragraph>
    <Section position="1" start_page="112" end_page="114" type="sub_section">
      <SectionTitle>
3.1 Studies I and II
</SectionTitle>
      <Paragraph position="0"> For Studies I and II, we used three highly trained annotators. The annotators (two graduate students and the first author) can be considered skilled at extracting information from scientific papers but they were not experts in all of the sub-domains of the papers they annotated. The annotators went through a substantial amount of training, including the reading of coding instructions for the two versions of the scheme (6 pages for the basic scheme and 17 pages for the full scheme), four training papers and weekly discussions, in which previous annotations were discussed. However, annotators were not allowed to change any previous decisions. For the stability figures (intraannotator agreement), annotators re-coded 6 randomly chosen papers 6 weeks after the end of the annotation experiment. Skim-reading and annotation of an average length paper (3800 words) typically took the annotators 20-30 minutes.</Paragraph>
      <Paragraph position="1"> During the annotation phase, one of the papers turned out to be a review paper. This paper caused the annotators difficulty as the scheme was not intended to cover reviews. Thus, we discarded this paper from the analysis.</Paragraph>
      <Paragraph position="2"> The results show that the basic annotation scheme is stable (K=.83, .79, .81; N=1248; k=2 for all three annotators) and reproducible (K=.78, N=4031, k=3). This reconfirms that trained annotators are capable of making the basic distinction between own work, specific other work, and general background. The full annotation scheme is stable (K=.82, .81, .76; N--1220; k=2 for all three annotators) and reproducible (K=.71, N=4261, k=3). Because of the increased cognitive difficulty of the task, the decrease in stability and reproducibility in comparison to Study I is acceptable. Leaving the coding developer out of the coder pool for Study II did not change the results (K=.71, N=4261, k=2), suggesting that the training conveyed her intentions fairly well.</Paragraph>
      <Paragraph position="3"> We collected informal comments from our annotators about how natural the task felt, but did not conduct a formal evaluation of subjective perception of the difficulty of the task. As a general approach in our analysis, we wanted to look at the trends in the data as our main information source.</Paragraph>
      <Paragraph position="4"> Figure 3 reports how well the four non-basic categories could be distinguished from all other categories, measured by Krippendorff's diagnostics for category distinctions (i.e. collapsing all other distinctions). When compared to the overall reproducibility of .71, we notice that the annotators were good at distinguishing AIM and TEx- null research paper for the summarization task we are particularly interested in having them annotated consistently in our training material. The annotators were less good at determining BASIS and CONTRAST. This might have to do with the loca-tion of those types of sentences in the paper: AIM and TEXTUAL are usually found at the beginning or end of the introduction section, whereas CON-TRAST, and even more so BASIS, are usually interspersed within longer stretches of OWN. As a result, these categories are more exposed to lapses of attention during annotation.</Paragraph>
      <Paragraph position="5"> If we blur the less important distinctions between CONTRAST, OTHER, and BACKGROUND, the reproducibility of the scheme increases to K=.75. Structuring our training set in this way seems to be a good compromise for our task, because with high reliability, it would still give us the crucial distinctions contained in the basic annotation scheme, plus the highly important AIM sentences, plus the useful TEXTUAL and BASIS sentences.</Paragraph>
      <Paragraph position="6"> The variation in reproducibility across papers is large, both in Study I and Study II (cf. the quasibimodal distribution shown in Figure 4). Some hypotheses for why this might be so are the fol- null * One problem our annotators reported was a difficulty in distinguishing OTHEa work from OWN work, due to the fact that some authors did not express a clear distinction between previous own work (which, according to our instructions, had to be coded as OTHEa) and current, new work. This was particularly the case where authors had published several papers about different aspects of one piece of research. We found a correlation with self citation ratio (ratio of self citations to all citations in running text): papers with many self citations are more difficult to annotate than papers that have few or no self citations (cf.</Paragraph>
      <Paragraph position="7"> Figure 5).</Paragraph>
      <Paragraph position="8"> * Another persistent problematic distinction for our annotators was that between OWN and BACKGROUND. This could be a sign that some authors aimed their papers at an expert audience, and thus thought it unnecessary to signal clearly which statements are commonly agreed in the field, as opposed to their own new claims. If a paper is written in such a way, it can indeed only be understood with a considerable amount of domain knowledge, which our annotators did not have.</Paragraph>
      <Paragraph position="9"> * There is also a difference in reproducibility between papers from different conference types, as Figure 6 suggests. Out of our 25 papers, 4 were presented in student sessions, 4 came from workshops, the remaining 16 ones were main conference papers. Student session papers are easiest to annotate, which might be due to the fact that they are shorter and have a simpler structure, with less mentions of previous research. Main conference papers dedicate more space to describing and  criticising other people's work than student or workshop papers (on average about one fourth of the paper). They seem to be carefully prepared (and thus easy to annotate); conference authors must express themselves more clearly than workshop authors because they are reporting finished work to a wider audience.</Paragraph>
    </Section>
    <Section position="2" start_page="114" end_page="114" type="sub_section">
      <SectionTitle>
3.2 Study III
For Study III, we used a different subject pool:
</SectionTitle>
      <Paragraph position="0"> 18 subjects with no prior annotation training. All of them had a graduate degree in Cognitive Science, with two exceptions: one was a graduate student in Sociology of Science; and one was a secretary. Subjects were given only minimal instructions (1 page A4), and the decision tree in Figure 2. Each annotator was randomly assigned to a group of six, all of whom independently annotated the same single paper. These three papers were randomly chosen from the set of papers for which our trained annotators had previously achieved good reproducibility in Study II (K=.65,N=205, k=3; K=.85,N=192,k=3; K=.87,N=144,k=3, respectively). null Reproducibility varied considerably between groups (K=.35, N=205, k=6; K=.49, N=192, k=6; K=.72, N=144, k=6). Kappa is designed to abstract over the number of coders. Lower reliablity for Study III as compared to Studies I and II is not an artefact of how K was calculated.</Paragraph>
      <Paragraph position="1"> Some subjects in Group 1 and 2 did not understand the instructions as intended - we must conclude that our very short instructions did not provide enough information for consistent annotation. This is not surprising, given that human indexers (whose task is very similar to the task introduced here) are highly skilled professionals.</Paragraph>
      <Paragraph position="2"> However, part of this result can be attributed to the papers: Group 3, which annotated the paper found to be most reproducible in Study II, performed almost as well as trained annotators; Group 1, which performed worst, also happened to have the paper with the lowest reproducibility. In Groups 1 and 2, the most similar three annotators reached a respectable reproducibility (K=.5, N=205, k=3; K=.63, N=192, k=3). That, together with the good performance of Group 3, seems to show that the instructions did at least convey some of the meaning of the categories.</Paragraph>
      <Paragraph position="3"> It is remarkable that the two subjects who had no training in computational linguistics performed reasonably well: they were not part of the circle of the three most similar subjects in their groups, but they were also not performing worse than the other two annotators.</Paragraph>
    </Section>
  </Section>
  <Section position="8" start_page="114" end_page="115" type="metho">
    <SectionTitle>
4 Discussion
</SectionTitle>
    <Paragraph position="0"> It is an interesting question how far shallow (human and automatic) information extraction methods, i.e. those using no domain knowledge, can be successful in a task such as ours. We believe that argumentative structure has so many reliable linguistic or non-linguistic correlates on the surface - physical layout being one of these correlates, others are linguistic indicators like &amp;quot;to our knowledge&amp;quot; and the relative order of the individual argumentative moves - that it should be possible to detect the line of argumentation of a text without much world knowledge. The two non-experts in the subject pool of Study III, who must have used some other information besides computational linguistics knowledge, performed satisfactorily - a fact that seems to confirm the promise of shallow methods.</Paragraph>
    <Paragraph position="1"> Overall, reproducibility and stability for trained annotators does not quite reach the levels found for, for instance, the best dialogue act coding schemes (around K=.80). Our annotation requires more subjective judgments and is possibly more cognitively complex. Our reproducibility and stability results are in the range which Krippendorff (1980) describes as giving marginally significant results for reasonable size data sets when correlating two coded variables which would show a clear correlation if there were prefectly agreement. That is, the coding contains enough signal to be found among the noise of disagreement.</Paragraph>
    <Paragraph position="2"> Of course, our requirements are rather less stringent than Krippendorff's because only one coded variable is involved, although coding is expensive enough that simply building larger data sets is not an attractive option. Overall, we find the level of agreement which we achieved acceptable. However, as with all coding schemes, its usefulness will only be clarified by the final appli- null Proceedings of EACL '99 cation.</Paragraph>
    <Paragraph position="3"> The single most surprising result of the experiments is the large variation in reproducibility between papers. Intuitively, the reason for this are qualitative differences in individual writing style - annotators reported that some papers are better structured and better written than others, and that some authors tend to write more clearly than others. It would be interesting to compare our reproducibility results to independent quality judgements of the papers, in order to determine if our experiments can indeed measure the clarity of scientific argumentation.</Paragraph>
    <Paragraph position="4"> Most of the problems we identified in our studies have to do with a lack of distinction between own and other people's work (or own previous work). Because our scheme discriminates based on these properties, as well as being useful for summarizing research papers, it might be used for automatically detecting whether a paper is a review, a position paper, an evaluation paper or a 'pure' research article by looking at the relative frequencies of automatically annotated categories.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML