File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/05/w05-0311_evalu.xml

Size: 9,798 bytes

Last Modified: 2025-10-06 13:59:25

<?xml version="1.0" standalone="yes"?>
<Paper uid="W05-0311">
  <Title>The Reliability of Anaphoric Annotation, Reconsidered: Taking Ambiguity into Account</Title>
  <Section position="6" start_page="79" end_page="82" type="evalu">
    <SectionTitle>
5 RESULTS
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="79" end_page="80" type="sub_section">
      <SectionTitle>
5.1 Agreement on category labels
</SectionTitle>
      <Paragraph position="0"> The following table reports for each of the four categories the number of cases (in the first half) in which 5It would be preferable, of course, to get the annotators to mark such configurations in a uniform way; this however would require much more extensive training of the subjects, as well as support which is currently unavailable from the annotation tool for tracking chains of pointers.</Paragraph>
      <Paragraph position="1">  a good number (18, 17, 16) annotators agreed on a particular label-phrase, segment, place, or none-or no annotators assigned a particular label to a markable. (The figures for the second half are similar.) Number of judgments 18 17 16 0  In other words, in 49 cases out of 72 at least 16 annotators agreed on a label.</Paragraph>
    </Section>
    <Section position="2" start_page="80" end_page="80" type="sub_section">
      <SectionTitle>
5.2 Explicitly annotated ambiguity, and its
</SectionTitle>
      <Paragraph position="0"> impact on agreement Next, we attempted to get an idea of the amount of explicit ambiguity-i.e., the cases in which coders marked multiple antecedents-and the impact on reliability resulting by allowing them to do this. In the first half, 15 markables out of 72 (20.8%) were marked as explicitly ambiguous by at least one annotator, for a total of 55 explicit ambiguity markings (45 phrase references, 10 segment references); in the second, 8/76, 10.5% (21 judgments of ambiguity in total). The impact of these cases on agreement can be estimated by comparing the values of K and a on the antecedents only, before the construction of cospecification chains. Recall that the difference between the coefficients is that K does not allow for partial disagreement while a gives it some credit. Thus if one subject marks markable A as antecedent of an expression, while a second sub-ject marks markables A and B, K will register a disagreement while a will register partial agreement.</Paragraph>
      <Paragraph position="1"> Table 3 compares the values of K and a, computed separately for each half of the dialogue, first with all the markables, then by excluding &amp;quot;place&amp;quot; markables (agreement on marking place names was almost perfect, contributing substantially to overall agreement). The value of a is somewhat higher than that of K, across all conditions.</Paragraph>
    </Section>
    <Section position="3" start_page="80" end_page="81" type="sub_section">
      <SectionTitle>
5.3 Agreement on anaphora
</SectionTitle>
      <Paragraph position="0"> Finally, we come to the agreement values obtained by using a to compare anaphoric chains computed  The coefficient reported here as K is the one called K by Siegel and Castellan (1988).</Paragraph>
      <Paragraph position="1"> The value of a is calculated using Passonneau's distance metric; for other distance metrics, see table 4.</Paragraph>
      <Paragraph position="2">  as discussed above. Table 4 gives the value of a for the first half (the figures for the second half are similar). The calculation of a was manipulated under the following three conditions.</Paragraph>
      <Paragraph position="3"> Place markables. We calculated the value of a on the entire set of markables (with the exception of three which had data errors), and also on a subset of markables - those that were not place names. Agreement on marking place names was almost perfect: 45 of the 48 place name markables were marked correctly as &amp;quot;place&amp;quot; by all 18 subjects, two were marked correctly by all but one subject, and one was marked correctly by all but two subjects. Place names thus contributed substantially to the agreement among the subjects. Dropping these markables from the analysis resulted in a substantial drop in the value of a across all conditions.</Paragraph>
      <Paragraph position="4"> Distance measure. We used the three measures discussed earlier to calculate distance between sets: Passonneau, Jaccard, and Dice.6 Chain construction. Substantial variation in the agreement values can be obtained by making changes to the way we construct anaphoric chains. We tested the following methods.</Paragraph>
      <Paragraph position="5"> NO CHAIN: only the immediate antecedents of an anaphoric expression were considered, instead of building an anaphoric chain.</Paragraph>
      <Paragraph position="6"> PARTIAL CHAIN: a markable's chain included only phrase markables which occurred in the dia6For the nominal categories &amp;quot;place&amp;quot; and &amp;quot;none&amp;quot; we assign a distance of zero between the category and itself, and of one between a nominal category and any other category.  logue before the markable in question (as well as all discourse markables).</Paragraph>
      <Paragraph position="7"> FULL CHAIN: chains were constructed by looking upward and then back down, including all phrase markables which occurred in the dialogue either before or after the markable in question (as well as the markable itself, and all discourse markables).</Paragraph>
      <Paragraph position="8"> We used two separate versions of the full chain condition: in the [+top] version we associate the top of a chain with the chain itself, whereas in the [[?]top] version we associate the top of a chain with its original category label, &amp;quot;place&amp;quot; or &amp;quot;none&amp;quot;. Passonneau (2004) observed that in the calculation of observed agreement, two full chains always intersect because they include the current item. Passonneau suggests to prevent this by excluding the current item from the chain for the purpose of calculating the observed agreement. We performed the calculation both ways - the inclusive condition includes the current item, while the exclusive condition excludes it.</Paragraph>
      <Paragraph position="9"> The four ways of calculating a for full chains, plus the no chain and partial chain condition, yield the six chain conditions in Table 4. Other things being equal, Dice yields a higher agreement than Jaccard; considering both halves of the dialogue, the Passonneau measure always yielded a higher agreement that Jaccard, while being higher than Dice in 10 of the 24 conditions, and lower in the remaining 14 conditions.</Paragraph>
      <Paragraph position="10"> The exclusive chain conditions always give lower agreement values than the corresponding inclusive chain conditions, because excluding the current item reduces observed agreement without affecting expected agreement (there is no &amp;quot;current item&amp;quot; in the calculation of expected agreement).</Paragraph>
      <Paragraph position="11"> The [[?]top] conditions tended to result in a higher agreement value than the corresponding [+top] conditions because the tops of the chains retained their &amp;quot;place&amp;quot; and &amp;quot;none&amp;quot; labels; not surprisingly, the effect was less pronounced when place markables were excluded from the analysis. Inclusive [[?]top] was the only full chain condition which gave a values comparable to the partial chain and no chain conditions. For each of the four selections of markables, the highest a value was given by the Inclusive [[?]top] chain with Dice measure.</Paragraph>
    </Section>
    <Section position="4" start_page="81" end_page="82" type="sub_section">
      <SectionTitle>
5.4 Qualitative Analysis
</SectionTitle>
      <Paragraph position="0"> The difference between annotation of (identity!) anaphoric relations and other semantic annotation tasks such as dialogue act or wordsense annotation is that apart from the occasional example of carelessness, such as marking Elmira as antecedent for the boxcar at Elmira,7 all other cases of disagreement reflect a genuine ambiguity, as opposed to differences in the application of subjective categories.8 Lack of space prevents a full discussion of the data, but some of the main points can already be made with reference to the part of the dialogue in (2), repeated with additional context in (3).</Paragraph>
      <Paragraph position="1"> 7According to our (subjective) calculations, at least one annotator made one obvious mistake of this type for 20 items out of 72 in the first half of the dialogue-for a total of 35 careless or mistaken judgment out of 1296 total judgments, or 2.7%.</Paragraph>
      <Paragraph position="2"> 8Things are different for associative anaphora, see (Poesio and Vieira, 1998).</Paragraph>
      <Paragraph position="3">  (3) 1.4 M: first thing I'd like you to do 1.5 is send engine E2 off with a boxcar to Corning to pick up oranges 1.6 uh as soon as possible 2.1 S: okay [6 sec] 3.1 M: and while it's there it  should pick up the tanker The two it pronouns in utterance unit 3.1 are examples of the type of ambiguity already seen in (1). All of our subjects considered the first pronoun a 'phrase' reference. 9 coders marked the pronoun as ambiguous between engine E2 and the boxcar, 6 marked it as unambiguous and referring to engine E2, and 3 as unambiguous and referring to the boxcar. This example shows that when trying to develop methods to identify ambiguous cases it is important to consider not only the cases of explicit ambiguity, but also so-called implicit ambiguity-cases in which subjects do not provide evidence of being consciously aware of the ambiguity, but the presence of ambiguity is revealed by the existence of two or more annotators in disagreement (Poesio, 1996).</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML