File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/05/w05-0311_metho.xml
Size: 18,221 bytes
Last Modified: 2025-10-06 14:09:53
<?xml version="1.0" standalone="yes"?> <Paper uid="W05-0311"> <Title>The Reliability of Anaphoric Annotation, Reconsidered: Taking Ambiguity into Account</Title> <Section position="3" start_page="76" end_page="76" type="metho"> <SectionTitle> 2 ANNOTATING ANAPHORA </SectionTitle> <Paragraph position="0"> It is not our goal at this stage to propose a new scheme for annotating anaphora. For this study we simply developed a coding manual for the purposes of our experiment, broadly based on the approach adopted in MATE (Poesio et al., 1999) and GNOME (Poesio, 2004), but introducing new types of annotation (ambiguous anaphora, and a simple form of discourse deixis) while simplifying other aspects (e.g., by not annotating bridging references).</Paragraph> <Paragraph position="1"> The task of 'anaphoric annotation' discussed here is related, although different from, the task of annotating 'coreference' in the sense of the so-called MUCSS scheme for the MUC-7 initiative (Hirschman, 1998). This scheme, while often criticized, is still widely used, and has been the basis of coreference annotation for the ACE initiative in the past two years. It suffers however from a number of problems (van Deemter and Kibble, 2000), chief among which is the fact that the one semantic relation expressed by the scheme, ident, conflates a number of relations that semanticists view as distinct: besides COREFERENCE proper, there are IDENTITY ANAPHORA, BOUND ANAPHORA, and even PRED-ICATION. (Space prevents a fuller discussion and exemplification of these relations here.) The goal of the MATE and GNOME schemes (as well of other schemes developed by Passonneau (1997), and Byron (2003)) was to devise instructions appropriate for the creation of resources suitable for the theoretical study of anaphora from a linguistic / psychological perspective, and, from a computational perspective, for the evaluation of anaphora resolution and referring expressions generation. The goal is to annotate the discourse model resulting from the interpretation of a text, in the sense both of (Webber, 1979) and of dynamic theories of anaphora (Kamp and Reyle, 1993). In order to do this, annotators must first of all identify the noun phrases which either introduce new discourse entities (discoursenew (Prince, 1992)) or are mentions of previously introduced ones (discourse-old), ignoring those that are used predicatively. Secondly, annotators have to specify which discourse entities have the same interpretation. Given that the characterization of such discourse models is usually considered part of the area of the semantics of anaphora, and that the relations to be annotated include relations other than Sidner's (1979) COSPECIFICATION, we will use the term ANNOTATION OF ANAPHORA for this task (Poesio, 2004), but the reader should keep in mind that we are not concerned only with nominal expressions which are lexically anaphoric.</Paragraph> </Section> <Section position="4" start_page="76" end_page="78" type="metho"> <SectionTitle> 3 MEASURING AGREEMENT ON ANAPHORIC ANNOTATION </SectionTitle> <Paragraph position="0"> The agreement coefficient which is most widely used in NLP is the one called K by Siegel and Castellan (1988). Howewer, most authors who attempted anaphora annotation pointed out that K is not appropriate for anaphoric annotation. The only sensible choice of 'label' in the case of (identity) anaphora are anaphoric chains (Passonneau, 2004); but except when a text is very short, few annotators will catch all mentions of the same discourse entity-most forget to mark a few, which means that agreement as measured with K is always very low. Following Passonneau (2004), we used the coefficient a of Krippendorff (1980) for this purpose, which allows for partial agreement among anaphoric chains.3</Paragraph> <Section position="1" start_page="76" end_page="77" type="sub_section"> <SectionTitle> 3.1 Krippendorf's alpha </SectionTitle> <Paragraph position="0"> The a coefficient measures agreement among a set of coders C who assign each of a set of items I to one of a set of distinct and mutually exclusive categories K; for anaphora annotation the coders are the annotators, the items are the markables in the text, and the categories are the emerging anaphoric chains. The coefficient measures the observed disagreement between the coders Do, and corrects for 3We also tried a few variants of a, but these differed from a only in the third to fifth significant digit, well below any of the other variables that affected agreement. In the interest of space we only report here the results obtained with a.</Paragraph> <Paragraph position="1"> chance by removing the amount of disagreement expected by chance De. The result is subtracted from 1 to yield a final value of agreement.</Paragraph> <Paragraph position="3"> As in the case of K, the higher the value of a, the more agreement there is between the annotators.</Paragraph> <Paragraph position="4"> a = 1 means that agreement is complete, and a = 0 means that agreement is at chance level.</Paragraph> <Paragraph position="5"> What makes a particularly appropriate for anaphora annotation is that the categories are not required to be disjoint; instead, they must be ordered according to a DISTANCE METRIC-a function d from category pairs to real numbers that specifies the amount of dissimilarity between the categories. The distance between a category and itself is always zero, and the less similar two categories are, the larger the distance between them. Table 1 gives the formulas for calculating the observed and expected disagreement for a. The amount of disagreement for each item i [?] I is the arithmetic mean of the distances between the pairs of judgments pertaining to it, and the observed agreement is the mean of all the item disagreements. The expected disagreement is the mean of the distances between all the judgment pairs in the data, without regard to items.</Paragraph> <Paragraph position="7"> c number of coders i number of items nik number of times item i is classified in category k nk number of times any item is classified in category k dkkprime distance between categories k and kprime</Paragraph> </Section> <Section position="2" start_page="77" end_page="77" type="sub_section"> <SectionTitle> 3.2 Distance measures </SectionTitle> <Paragraph position="0"> The distance metric is not part of the general definition of a, because different metrics are appropriate for different types of categories. For anaphora annotation, the categories are the ANAPHORIC CHAINS: the sets of markables which are mentions of the same discourse entity. Passonneau (2004) proposes a distance metric between anaphoric chains based on the following rationale: two sets are minimally distant when they are identical and maximally distant when they are disjoint; between these extremes, sets that stand in a subset relation are closer (less distant) than ones that merely intersect. This leads to the following distance metric between two sets A and B.</Paragraph> <Paragraph position="2"> We also tested distance metrics commonly used in Information Retrieval that take the size of the anaphoric chain into account, such as Jaccard and Dice (Manning and Schuetze, 1999), the rationale being that the larger the overlap between two anaphoric chains, the better the agreement. Jaccard and Dice's set comparison metrics were subtracted from 1 in order to get measures of distance that range between zero (minimal distance, identity) and one (maximal distance, disjointness).</Paragraph> <Paragraph position="4"> The Dice measure always gives a smaller distance than the Jaccard measure, hence Dice always yields a higher agreement coefficient than Jaccard when the other conditions remain constant. The difference between Dice and Jaccard grows with the size of the compared sets. Obviously, the Passonneau measure is not sensitive to the size of these sets.</Paragraph> </Section> <Section position="3" start_page="77" end_page="78" type="sub_section"> <SectionTitle> 3.3 Computing the anaphoric chains </SectionTitle> <Paragraph position="0"> Another factor that affects the value of the agreement coefficient-in fact, arguably the most important factor-is the method used for constructing from the raw annotation data the 'labels' used for agreement computation, i.e., the anaphoric chains. We experimented with a number of methods. However, since the raw data are highly dependent on the annotation scheme, we will postpone discussing our chain construction methods until after we have described our experimental setup and annotation scheme. We will also discuss there how comparisons are made when an ambiguity is marked.</Paragraph> </Section> </Section> <Section position="5" start_page="78" end_page="79" type="metho"> <SectionTitle> 4 THE ANNOTATION STUDY </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="78" end_page="78" type="sub_section"> <SectionTitle> 4.1 The Experimental Setup </SectionTitle> <Paragraph position="0"> Materials. The text annotated in the experiment was dialogue 3.2 from the TRAINS 91 corpus. Subjects were trained on dialogue 3.1.</Paragraph> <Paragraph position="1"> Tools. The subjects performed their annotations on Viglen Genie workstations with LG Flatron monitors running Windows XP, using the MMAX 2 annotation tool (M&quot;uller and Strube, 2003).4 Subjects. Eighteen paid subjects participated in the experiment, all students at the University of Essex, mostly undergraduates from the Departments of Psychology and Language and Linguistics.</Paragraph> <Paragraph position="2"> Procedure. The subjects performed the experiment together in one lab, each working on a separate computer. The experiment was run in two sessions, each consisting of two hour-long parts separated by a 30 minute break. The first part of the first session was devoted to training: subjects were given the annotation manual and taught how to use the software, and then annotated the training text together. After the break, the subjects annotated the first half of the dialogue (up to utterance 19.6). The second session took place five days later. In the first part we quickly pointed out some problems in the first session (for instance reminding the subjects to be careful during the annotation), and then immediately the subjects annotated the second half of the dialogue, and wrote up a summary. The second part of the second session was used for a separate experiment with a different dialogue and a slightly different annotation scheme.</Paragraph> </Section> <Section position="2" start_page="78" end_page="78" type="sub_section"> <SectionTitle> 4.2 The Annotation Scheme </SectionTitle> <Paragraph position="0"> MMAX 2 allows for multiple types of markables; markables at the phrase, utterance, and turn levels were defined before the experiment. All noun phrases except temporal ones were treated as phrase markables (Poesio, 2004). Subjects were instructed to go through the phrase markables in order (using MMAX 2's markable browser) and mark each of them with one of four attributes: &quot;phrase&quot; if it referred to an object which was mentioned earlier in the dialogue; &quot;segment&quot; if it referred to a plan, event, action, or fact discussed earlier in the dialogue; &quot;place&quot; if it was one of the five railway stations Avon, Bath, Corning, Dansville, and Elmira, explicitly mentioned by name; or &quot;none&quot; if it did not fit any of the above criteria, for instance if it referred to a novel object or was not a referential noun phrase. (We included the attribute &quot;place&quot; in order to avoid having our subjects mark pointers from explicit place names. These occur frequently in the dialogue-49 of the 151 markables-but are rather uninteresting as far as anaphora goes.) For markables designated as &quot;phrase&quot; or &quot;segment&quot; subjects were instructed to set a pointer to the antecedent, a markable at the phrase or turn level. Subjects were instructed to set more than one pointer in case of ambiguous reference. Markables which were not given an attribute or which were marked as &quot;phrase&quot; or &quot;segment&quot; but did not have an antecedent specified were considered to be data errors; data errors occurred in 3 out of the 151 markables in the dialogue, and these items were excluded from the analysis.</Paragraph> <Paragraph position="1"> We chose to mark antecedents using MMAX 2's pointers, rather than its sets, because pointers allow us to annotate ambiguity: an ambiguous phrase can point to two antecedents without creating an association between them. In addition, MMAX 2 makes it possible to restrict pointers to a particular level.</Paragraph> <Paragraph position="2"> In our scheme, markables marked as &quot;phrase&quot; could only point to phrase-level antecedents while markables marked as &quot;segment&quot; could only point to turn-level antecedents, thus simplifying the annotation.</Paragraph> <Paragraph position="3"> As in previous studies (Eckert and Strube, 2001; Byron, 2003), we only allowed a constrained form of reference to discourse segments: our subjects could only indicate turn-level markables as antecedents. This resulted in rather coarse-grained markings, especially when a single turn was long and included discussion of a number of topics. In a separate experiment we tested a more complicated annotation scheme which allowed a more fine-grained marking of reference to discourse segments.</Paragraph> </Section> <Section position="3" start_page="78" end_page="79" type="sub_section"> <SectionTitle> 4.3 Computing anaphoric chains </SectionTitle> <Paragraph position="0"> The raw annotation data were processed using custom-written Perl scripts to generate coreference chains and calculate reliability statistics.</Paragraph> <Paragraph position="1"> The core of Passonneau's proposal (Passonneau, 2004) is her method for generating the set of dis- null tinct and mutually exclusive categories required by a out of the raw data of anaphoric annotation. Considering as categories the immediate antecedents would mean a disagreement every time two annotators mark different members of an anaphoric chain as antecedents, while agreeing that these different antecedents are part of the same chain. Passonneau proposes the better solution to view the emerging anaphoric chains themselves as the categories. And in a scheme where anaphoric reference is unambiguous, these chains are equivalence classes of markables. But we have a problem: since our annotation scheme allows for multiple pointers, these chains take on various shapes and forms.</Paragraph> <Paragraph position="2"> Our solution is to associate each markable m with the set of markables obtained by following the chain of pointers from m, and then following the pointers backwards from the resulting set. The rationale for this method is as follows. Two pointers to a single markable never signify ambiguity: if B points to A and C points to A then B and C are cospecificational; we thus have to follow the links up and then back down. However, two pointers from a single markable may signify ambiguity, so we should not follow an up-link from a markable that we arrived at via a down-link. The net result is that an unambiguous markable is associated with the set of all markables that are cospecificational with it on one of their readings; an ambiguous markable is associated with the set of all markables that are cospecificational with at least one of its readings. (See figure 1.) This method of chain construction also allows to resolve apparent discrepancies between reference to phrase-level and turn-level markables. Take for example the snippet below: many annotators marked a pointer from the demonstrative that in utterance unit 4.2 to turn 3; as for that in utterance unit 4.3, some marked a pointer to the previous that, while others marked a pointer directly to turn 3.</Paragraph> <Paragraph position="3"> (2)</Paragraph> </Section> <Section position="4" start_page="79" end_page="79" type="sub_section"> <SectionTitle> 3.1 M: and whileit's thereit </SectionTitle> <Paragraph position="0"> shouldpick up the tanker</Paragraph> </Section> <Section position="5" start_page="79" end_page="79" type="sub_section"> <SectionTitle> 4.1 S: okay 4.2 and that can get </SectionTitle> <Paragraph position="0"> 4.3 we can get that done by three In this case, not only do the annotators mark different direct antecedents for the second that; they even use different attributes-&quot;phrase&quot; when pointing to a phrase antecedent and &quot;segment&quot; when pointing to a turn. Our method of chain construction associates both of these markings with the same set of three markables - the two that phrases and turn 3 - capturing the fact that the two markings are in agreement.5</Paragraph> </Section> <Section position="6" start_page="79" end_page="79" type="sub_section"> <SectionTitle> 4.4 Taking ambiguity into account </SectionTitle> <Paragraph position="0"> The cleanest way to deal with ambiguity would be to consider each item for which more than one antecedent is marked as denoting a set of interpretations, i.e., a set of anaphoric chains (Poesio, 1996), and to develop methods for comparing such sets of sets of markables. However, while our instructions to the annotators were to use multiple pointers for ambiguity, they only followed these instructions for phrase references; when indicating the referents of discourse deixis, they often used multiple pointers to indicate that more than one turn had contributed to the development of a plan. So, for this experiment, we simply used as the interpretation of markables marked as ambiguous the union of the constituent interpretations. E.g., a markable E marked as pointing both to antecedent A, belonging to anaphoric chain {A,B}, and to antecedent C, belonging to anaphoric chain {C,D}, would be treated by our scripts as being interpreted as referring to anaphoric chain {A,B,C,D}.</Paragraph> </Section> </Section> class="xml-element"></Paper>