File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/99/w99-0309_metho.xml
Size: 39,925 bytes
Last Modified: 2025-10-06 14:15:24
<?xml version="1.0" standalone="yes"?> <Paper uid="W99-0309"> <Title>J The MATE meta-scheme for coreference in dialogues in multiple languages</Title> <Section position="2" start_page="0" end_page="65" type="metho"> <SectionTitle> 2 Annotating for 'Coreference' </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="0" end_page="65" type="sub_section"> <SectionTitle> 2.1 Problems to be addressed </SectionTitle> <Paragraph position="0"> The difficulties to be addressed in the case of the 'coreference' level are of a different nature from those that arise in the case of the other annotation task at the semantic level considered in MATE, the dialogue acts level.</Paragraph> <Paragraph position="1"> A very basic problem arising in the case of coreference is deciding what type of information is being annotated, since the term 'coreference' is used to indicate different things. The name 'coreference' derives from the task of 'coreference resolution', one of the semantic interpretation tasks adopted in the Message Understanding Conference (MUC), a US initiative to evaluate systems performing information extraction. The coreference annotation scheme used in MUC-7, MUCCS (Hirschman, 1997) was devised to evaluate the ability of the systems participating in the competition to identify which elements in the text referred to the same object; hence the term 'coreference'. The scheme adopted for annotating references to landmarks in the MapTask corpus is also meant to annotate reference in this sense. Both the DRAMA scheme and the schemes proposed by Lancaster University (Fligelstone, 1992), instead, are meant to be use(! to annotate anaphoric information in texts; but coreference is not the same as anaphoricity. Two NPS can corefer without either of them being 'anaphoric' in the traditional sense -e.g., proper names are not generally considered 'anaphoric expressions', yet two proper names can obviously corefer, as in (la); and conversely, two NPs can be in an anaphoric relation without either of them 'referring' to anything, ms in (lb), where one of the engines at Elmira doesn't really refer to any specific engine yet serves as antecedent of that. (See (van Deemter and Kibble, 1999) for further discussion.) Fortunately for our purposes, coreference information can be expressed using the relations used to express anaphoric information, which makes it possible to develop schemes in which both types of information can be encoded, as we will see below. We will keep referring to the markup level with which we are concerned as 'coreference' for consistency with common use, but the reader should use the term with care.</Paragraph> <Paragraph position="2"> (1) a .... home fans at the Stade de France endured an agonising final 20 minutes after Laurent Blanc was shown the red card following a tussle with Slaven Bilic. Blanc ...</Paragraph> <Paragraph position="3"> b. 15.I M: +we're+ gonna hook up ONE</Paragraph> </Section> </Section> <Section position="3" start_page="65" end_page="66" type="metho"> <SectionTitle> OF THE ENGINES AT ELMIRA </SectionTitle> <Paragraph position="0"> to the boxcar at Elmira 15.6 : shove THAT off to Corning One advantage of the coreference level is that notwithstanding the possible source of confusion just mentioned, there is much more agreement on the underlying catalogue of semantic notions needed to characterize this type of information than there is, say, for the discourse act level, so that it's possible to come up with fairly precise definitions of most of the information one would want to annotate. So, the main problem the designer of such a scheme has to confront is the sheer pervasiveness of the phenomenon: almost every word in a coherent text-including quantifiers, nouns, (modal) verbs, and adjectives-can be said to be anaphorically related in some way to what has already been introduced in the text, as shown by the following examples. This means that, in practice, it will always be necessary to restrict somehow the amount of anaphoric information to annotate. (See also the discussion in (Hirschman, 1997).) (2) a. A group of students entered a pub.</Paragraph> <Paragraph position="1"> Three boys ordered beer, ...</Paragraph> <Paragraph position="2"> b .... It is in such places that we find some of the most beautiful and adventurous modern architecture, and some of the most intriguing attempts to deepen the experience of ar_._tt. Newhouse has probably seen more of the recently built ar_.&t museums ...than anyone else.</Paragraph> <Paragraph position="3"> A second problem is that we do not know yet what can be annotated reliably and what cannot. In their reliability study, (Poesio and Vieira, 1998) found a fair agreement among annotators (K = .76) concerning which NPS were anaphoric and which ones were not, and about 95% agreement on antecedents for those definite descriptions that all subjects ideatiffed as anaphoric; but no agreement on identif.ving bridging references (K=.24), and often different antecedents were indicated for those that all annotators classified as bridges. 3 As a result, the only coding scheme whose reliability has been extensively tested is MUCCS.</Paragraph> <Section position="1" start_page="65" end_page="66" type="sub_section"> <SectionTitle> 2.2 Existing schemes </SectionTitle> <Paragraph position="0"> The MUCCS scheme (Hirschman, 1997) is the most widely used of the existing coreference schemes, and also the more modest in scope: it concentrates on identity relations between NPs. The nlain l>roblem with MUCCS from our point of view is that it was designed for texts, so it does not provide instructions either for dealing with typical problems in dialogue such as disfluencies, or for annotating references to the visual situation, common e.g., in the MapTask corpus and in multimodal applications, and that we hypothesize can be reliably annotated (although this hyppthesis will have to be verified). Also, it's only designed for English, and therefore does not include instructions for anaphoric expressions common in other European languages and whose relation with other discourse entities could be annotated reliably, such as clitics.</Paragraph> <Paragraph position="1"> The DRAMA scheme (Passonneau, 1997) does include instructions for dealing with some difficult problems of markable identification in dialogues, but not for multilingual annotation. DRAMA includes instructions for annotating bridging references (whose reliability, however, still has to be ascertained), but not for references to the visual situation. The scheme proposed by (Bruneseanx and Romary, 1998) provides markup elements (based on the TEI scheme) to annotate both references to the visual situation and discourse deixis, in addition to bridging references; the reliability of this type of annotatiou wasn't evaluated. The Lancaster scheme (Fligelstone, 1992) was also designed for texts, and in certain ways is more ambitious than any of the schemes discussed 3Informal studies conducted in MUC and by the uR1 confirm this (Discourse Resource Initiative, 1997). The participants to these initiatives found reasonable agreement on the antecedents of anaphoric expressions, but poor recall for bridges. These results led to the elimination of relations other than IDENT from the r,u)C coding scheme for coreference, MUCCS. Those studies also suggest that one way to improve reliability on bridges may be to use better tools suggesting relations. Also~ using a smaller number of relations may help, but the actual number of anaphoric elements annotated was not indicated so it's not clear whether this result is actuall.v</Paragraph> <Paragraph position="3"> here in that it also contains instructions for annotating elliptical references. We are not aware of any results about the reliability of the scheme.</Paragraph> </Section> </Section> <Section position="4" start_page="66" end_page="67" type="metho"> <SectionTitle> 3 The MATE Proposal </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="66" end_page="66" type="sub_section"> <SectionTitle> 3.1 Approach: A Meta Scheme and Two Instantiations </SectionTitle> <Paragraph position="0"> It should be clear from the considerations above that we do not believe that there is such a thing as a universally useful standard for 'coreference' annotation in dialogues. At the same time, because the semantics of anaphora and coreference is relatively well-understood, it is possible to extract from the schemes discussed above a fairly short list of options available to the designer of a scheme. (This is unlike the case of dialogue acts, where different schemes are very difficult to compare.) These considerations suggested a 'meta-scheme' approach to the goal of developing a scheme for the coreference level that could be useful for different types of applications.</Paragraph> <Paragraph position="1"> What this means is that instead of proposing a single scheme, we identified a range of possible types of information about 'coreference' one may want to annotate, on the basis of the coding schemes for coreference discussed above; we evaluated how reliable each type of annotation is likely to be; and we specified the markup language needed to pursue each option. 4 The meta scheme consists of a CORE SCHEME and three extensions. The core scheme can be used to do the type of annotation that can be done with MUCCS (i.e., identity relations between discourse entities introduced by NPb). The three extensions to the core scheme can be used to annotate (i) references to the visual situation, as in the Bruneseaux and Romary scheme and in the MapTask scheme for annotating references to landmarks; (ii) a more complex set of relations between entities ('bridges'), as in the DRAMA scheme; and (iii) anaphoric relations involving an extended range of anaphoric expressions (such as incorporated clitics) and of antecedents (as in discourse deixis). The tool will support the whole rmlge of elements and attributes of the meta scheme; the task of the designer of a scheme for a particular application will be to identify the options of interest among those supported by the tool, ignoring the rest. The documentation for the coreference level includes, in additions to a discussion of the core scheme and the extensions, an example of how to extract a scheme from the meta scheme (the exam4The 'meta-scheme' approach was also adopted in EAGLES ht~p ://www. ilc. pi. cnr. it~EAGLES~home, html and the CES http ://ww~. cs. vassar, edu/CES/.</Paragraph> <Paragraph position="2"> pie covers references to the visual situation) as well as instructions for using the markup definitions provided by the meta scheme to encode according to the DRAMA scheme.</Paragraph> <Paragraph position="3"> On the assumption that the designer of a scheme for dialogues may be interested in annotating both 'anaphoric' and 'coreferential' information, we addressed the problem of the difference between the two types of annotation by adopting a position analogous to that taken in DRT (Kamp, 1981; Helm, 1982), whereby coreference information is expressed in terms of the same semantic relations used to annotate anaphoric information. This is done by introducing in the annotation identifiers that stand for the 'actual' referents, and expressing reference by means of relations between the discourse entities that 'refer' and these identifiers.</Paragraph> </Section> <Section position="2" start_page="66" end_page="67" type="sub_section"> <SectionTitle> 3.2 Markup Language and Assumptions About File Organization </SectionTitle> <Paragraph position="0"> In the rest of the paper we discuss first the Core Scheme, then each of the extensions. In this section we introduce the markup language and discuss a few assumptions underlying annotation using the MATE Workbench common to several levels, as well as a few assumptions specific to the coreference level.</Paragraph> <Paragraph position="1"> The markup language in MATE is XML, a simplification of SGML meant to make for easier parsing, but the workbench will make this transparent to annotators (not to annotation scheme designers, of course). An assumption common to most levels is the distinction between BASE FILE and ANNOTATION FILE. The base file contains the information necessary to annotate; the results of the annotation for a given level go in a separate file containing pointers to elements of the base file. For the coreference level, for example, the base file could be either a file annotated with one XML element per word, as in the British National Corpus, or possibly a file containing syntactic information; in the latter case, the text elements to be annotated could be identified automatically.</Paragraph> <Paragraph position="2"> We assume here that the base file is encoded according to the recommendations for the morpho-syntactic level of chunks adopted in MATE (Pirrelli and Soria, 1999), which specify a type of syntactic representation that could be produced by existing parsers; such parsers might be integrated in the workbench. For example, the representation in terms of chunks of the sentence John likes Bill would be as follows: <cA id=&quot;ch_002&quot; type=&quot;V&quot;> <potgov id=&quot;p_002&quot;>likes </potgov> </cA> : <cA id=&quot;ch_003&quot; type=&quot;N&quot;> <potgov id=&quot;p_003&quot;>Bill </potgov> </cA> The result of an annotation at the coreference level of the file ch.xml (the base file) would be to produce a second file containing elements that point to cA. xml, as discussed below.</Paragraph> </Section> </Section> <Section position="5" start_page="67" end_page="69" type="metho"> <SectionTitle> 4 The Core Scheme </SectionTitle> <Paragraph position="0"> m The Core Scheme has been designed to annotate the subset of all information about 'coreference' that can be annotated reliably: that is, information about identity relations between discourse entities explicitly introduced by elements of a text. As in MUCCS, it is assumed that annotation will proceed in two steps: first agreeing on the markables, then markup of anaphoric relations. The main difference 'from the MUC scheme is that following the recommendations of the Text Encoding Initiative and of Bruneseaux and Romary, the distinction between these two steps of annotation is mirrored in the Core Scheme by a distinction between two elements: <de>,, used to mark the parts of text that may be involved in these relations, and <link>, used to mark information about these relations.</Paragraph> <Section position="1" start_page="67" end_page="68" type="sub_section"> <SectionTitle> 4.1 Identifying and Marking Discourse ' Entities </SectionTitle> <Paragraph position="0"> As mentioned above, the main problem with designing a scheme for anaphora and coreference is not that the relevazlt notions are difficult to define, but that you can't annotate everything. Of ghe schemes we examined, the Lancaster University is the most ambitious, including also ways for annotating vP ellipsis. DRAMA recommends to annotate all noun phrases, whether or not they introduce discourse'entities. MUCCS recommends to annotate also noun phrases Occurring in prenominal position in other noun phrases, such as Getty in the Getty museum.</Paragraph> <Paragraph position="1"> For the Core Scheme we adopted a conservative view close to that of DRAMA, and only recommend to mark the potential antecedents of anaphoric and referential expressions that are realized in the text as full NPS; in other words, we did not include instructions for annotating parts of NPS that may enter in such relations (as in Muccs) and for annotating verbal constituents that enter e.g., in ellipsis (as in the Lancaster scheme). If needed, elements for marking up verbal constituents are included in one of the extensions of the Core Scheme (see below), so the MATE Workbench could be used to support this type of annotation, as wellprovided that the users come up with their own instructions for identifying the markables.</Paragraph> <Paragraph position="2"> Each markable NP is annotated with a <de> element with an ID attribute. In the underlying XML representation, the <de> elements include pointers to elements in the base file; e.g., the marl~bles in the example in (3)'would be represented as follows: (4) .... f. xml <de id=&quot;de_O01&quot; href=&quot;ch.xml#id(ch_O01)&quot;/> <de id=&quot;de 002 '' href=&quot; ch.xml#id(ch_O03)&quot;/> However, this aspect of the representation should be transparent to the annotator, who will only be ('oncerned with marking <de> elements and assigning them an ID. SO in what follows, also to make the notation more readable, we will represent markup using a simpler notation without HREF pointers, as in the following examples: (5) <de ID=&quot;de_Ol&quot;>we</de>'re gonna take <de ID=&quot;de_OT&quot;>the engine E3</de> and shove <de ID=&quot;de_08&quot;> it </de> over to <de ID=&quot;de_O2&quot;>Corning</de>, hook <de ID=&quot;de_09&quot;> it </de> up to <de ID=&quot;de_O3&quot;>the tanker car</de>...</Paragraph> <Paragraph position="3"> (6) . 197 F: r~nh /Donc qu'est ce que vous allez garder en fair (?) + / 198 M: l<de ID=&quot;de_96&quot;>la longueur du <de ID=&quot;de_97&quot;>tube</de></de> et <de ID=&quot;de_98&quot;> les ailerons </de> 199 D:<de ID=&quot;de_99&quot;> les ailerons </de> 200 F: Done <de ID=&quot;de_lO0&quot;> les ailerons </de> vous m~avez dit.</Paragraph> <Paragraph position="4"> It is assumed that in most cases (at least, when the base file is annotated with syntactic information) markables will be automatically identified by means of search patterns formulated in terms of the MATE query language (Held and Mengel, 1999); the main role of the annotator would be to correct possible problems. This suggests that the markables would be mostly identified on purely structural grounds.</Paragraph> <Paragraph position="5"> The instructions for identifying markables do include however a discussion of several cases in wlfich the designers of a scheme may decide not to mark a text element as <de> even if syntactically it counts as a NP: examples are NPS in predicative position, such as a policeman in John is a policeman 5, and repeated NPs in the case of disfluencies, as in the following example: (7) 193 F: Doric qu'est ce qui / qu'est ce qui serait commun a 5Note that the pronoun he in the contimlation Hc works for the 27th district would not be con,~idered ambiguous.</Paragraph> <Paragraph position="6"> (in cases like this, DRAMA recommends to mark up all repetitions of ells, although again, they do not create ambiguity). The instructions for the Core Scheme include a fairly extensive discussion of which text constituents count as NPS, whichincorporates examples from MUCCS and DRAMA as well as from (Quirk and Greenbaum, 1973).</Paragraph> <Paragraph position="7"> An issue not considered either in MUCCS or in DRAMA is what to do when a discourse entity is not introduced by a single contiguous prase, but by utterances interrupted by disfluencies or comments, as in (8), where the utterance of the diamond mine is interrupted by an acknowledgment from the follower: null (8) GIVER: curving, just curving round the diamond FOLLOWER: uh-huh GIVER: mine ...... uh-huh We believe this problem should be addressed at the parsing level by providing ways of representing non-contiguous syntactic elements, as done in the representation for the morpho-syntactic level proposed in MATE. The chunk-level representation of the example above is shown in (9), whereas the representation at the coref level is shown in (10).</Paragraph> <Paragraph position="8"> (9) ~h.,.l: GIVER: Curving, just curving round hrefB&quot;ah, xml$id (ch_66).. id (C/h_68) &quot;/> The other addition to the instructions given in MUCCS and DRAMA are instructions for marking up clitics and empty elements, common in Italian and Spanish. Markup elements for marking incorporated clitics (such as daselo in (11) and empty elements are discussed below; clitics realized as distinct particles (such as la in (11) are also marked as <de>s, as folh)ws. null (II) ~ir~. te doy <de ID=&quot;de_I67&quot;> esZe libro </de></Paragraph> </Section> <Section position="2" start_page="68" end_page="69" type="sub_section"> <SectionTitle> 4.2 Links </SectionTitle> <Paragraph position="0"> The subset of 'coreference' information which has been shown most clearly to be markable in a reliable way coincides with the information that can be annotated with the MUCCS scheme: identity relations between discourse entities explicitly introduced in the text by nominal phrases. In the core scheme, this is the only information that can be annotated.</Paragraph> <Paragraph position="1"> Whereas identity relations are represented in MUCCS and DRAMA by means of attributes on elements that correspond to the <de> element used in the MATE scheme, we adopted a notation derived from the <link> mechanism used in the TEl for linking any text element and adopted by Bruneseaux and Ronmry for representing anaphoric information. 6 <link> elements have two attributes: a HREF pointer to the <de> element that stands in an anaphoric relation with an antecedent, and a TYPE attribute specifying the relation (which in the case of the Core Scheme can only be IDENT). <link> elements contain then one or more <anchor> elements, with a single <href> pointer to the antecedent. So for example, the anaphoric relations in (5) and (6) would be annotated as follows: (12). coref.xml <de ID=&quot;de_Ol&quot;>we</de>'re gonna take <de ID=&quot;de_07&quot;> the engine E3 </de> and shove <de ID=&quot;de_08&quot;> it </de> over </link> In MUCCS, the annotator is free to choose any of the two elements in an identity relation as 'anaphor', because identity is symmetric. As the intention is to use <link> elements to also annotate non-symmetric relations such as those found in bridging cases, we recommend to always have the HREF pointer in the <link> element point to the anaphor, and the HREF pointer in the <anchor> to the antecedent. (In general, the annotator will still be able to exploit the transitivity property of identity and choose any antecedent, although in particular cases this may not be a good idea either.) The reason why <link> elements may have more than one <anchor> element is to annotate ambiguities, which are very common in spoken dialogue. In case more than one <de> element appears to be an equally likely antecedent of an anaphoric expression, each of the possibilities should be marked by means of a separate <anchor> element. In (14a), for example, the pronun it in 15.16 could refer equally well to engine E3 or the tanker car. Both antecedents should be annotated, as shown in (14b).</Paragraph> <Paragraph position="2"> (14) a. 15.12 : we're gonna take the engine E3 15.13 : and shove it over to Coming 15.14 : hook it up to the tanker car 15.15 : _and_ 15.16 : and send it back to Elmira b. coref, xml : 15.12 : we're gonna take <de ID=&quot;de_15&quot;>the engine E3</de> 15.13 : and shove <de ID=&quot;de_16&quot;> it </de> over to Corning 15.14 : hook <de ID=&quot;de_lT&quot;>it</de> up to <de ID=&quot;de_18&quot;>the tanker car</de></Paragraph> </Section> </Section> <Section position="6" start_page="69" end_page="70" type="metho"> <SectionTitle> 5 References to Visual Situation </SectionTitle> <Paragraph position="0"> In inultimodal dialogues it is possible to refer to objects which have not been previously introduced, but are 'accessible' by virtue of being part of the visual situation: examples are objects on the screen in the case of multimodal applications (Bruneseaux and Romary, 1998) and references to landmarks in the map in the MAPTASK corpus. The first proposed extension to the Core Scheme consists of a new set of elements introduced in order to annotate references to the visual situation.</Paragraph> <Paragraph position="1"> We adopted for this purpose a variant of the <universe> mechanism used in the Bruneseaux-Romary scheme. The idea is to assign an ID to each object in the visual situation that can be referred to, and then represent references to these objects by means of the same <link> mechanism used for anaphoric relations in the Core Scheme. For each object in the visual situation, a <ue> element gets created; the <ue> elements are then grouped in a <universe> element, as follows: Sort of in the middle of the page?...</Paragraph> <Paragraph position="2"> On on a level to <de ID=&quot;de52&quot;> the c--.., er dlamond mine. </de> <link href=&quot;coref.xml#id(deSO)&quot; type=&quot;ident&quot;> <anchor href =&quot;coref. xml#id (ue I) &quot;/></Paragraph> <Paragraph position="4"> Having a single universe is sufficient in cases when there is a single case of objects, but not iu domains like the MapTask, where the two participants to the conversation have slightly different maps. The <universe> mechanism has been designed to ham dle this type of situations, as well. hi these cases.</Paragraph> <Paragraph position="5"> it is suggested that three universes be c,'eated: ore, with ID=&quot;COMMON&quot; containing all objects shared l)~,tween the visual situations, and then one univ(,rs(, for each conversational participant containing the elements known only to that participant. This will ensure that the shared elements receive a unique ID.</Paragraph> <Paragraph position="6"> <universe> elements have an optional MODIFIES attribute that can be used to encode the information that a given universe is an extension of another uni- null verse; e.g., in the case just discussed, tile universe of each participant could be given a value for the MODIFIES= &quot;COMMON&quot;.</Paragraph> <Paragraph position="7"> In (161)) we see how the situation in (16a) could be encoded. Three universes are defined; COMMON contains a gold mine, whereas the GIVER_UNIVERSE also contains a diamond mine, which isn't in the follower's universe. As a result, the follower mistakeuly believes that the gold mine and the dianmnd mine are the same. This example also illustrates how these misunderstandings could be encoded by ineans of another optional extension to the link mechanism specified in the Core Scheme: the attribute WH0-BELIEVES, whose values would be identifiers for the two participants in the conversation (G and F in this case).</Paragraph> <Paragraph position="8"> (16) a. GIVER: Do_you have diamond_mine.</Paragraph> <Paragraph position="9"> FOLLOWER: Yes I've got a gold mine.</Paragraph> <Paragraph position="10"> GIVER: Ah. S--.</Paragraph> <Paragraph position="11"> FOLLOWER: ....</Paragraph> <Paragraph position="12"> GIVER: You don't have diamond_mine though.</Paragraph> <Paragraph position="13"> FOLLOWF~: No. It,s a gold_mine according to this One.</Paragraph> <Paragraph position="14"> Presumably that Je the same.</Paragraph> <Paragraph position="15"> GIVER: Well I've got a gold_nine as well you see. (MT) b. coref, xml : We don't know of any reliability study for this type of references, but experience with MapTask suggests that it call be done reliably. We are currently doing a test of the reliability of this extension in two languages (Italian and English) and will report, at tile Workshop.</Paragraph> </Section> <Section position="7" start_page="70" end_page="72" type="metho"> <SectionTitle> 6 . Marking non-nominal elements </SectionTitle> <Paragraph position="0"> Even if we only consider anaphorie relations inwflving nominal elenients, there ark at least two situations in which an aimotator may wish to mark an anaphoric relation that also involves other types of constituents. The first is the case, already mentioned in Section 4, ill which we have a relation that would fall for all purposes under tile Core Scheme, except that the anaphoric elemeut is either iuwxpressed or incorporated in tile w'rb. Tile second situation are tile ca.ses of so-called I)ISCOURSE DI,'IXIS (Webber, 1991), when tile anWcedent of n nonfimd expression is an ahstract Object such as an event or prol)osition introduced in the discourse somewhat indirectly by sentences. (DJtAMA allows for such r(~lations to be marked.) The second extension to the Core ScheIne was developed to give annotators tools to mark these types of aimphoric relations. The solution we propose is to use the <seg> element introduced in tile TEl to mark up arbitrary pieces of text; <seg> elements ar(' given an ID which can then be used in <link> ,~h,mcnts just like for other anaphuric relations. (The <seg> elenmnt couhl also be use(l to extend tile recta s(:heme to cover amq)tmri(: relations betw(:en non-nominal elements, such as vp ellil)sis.)</Paragraph> <Section position="1" start_page="70" end_page="71" type="sub_section"> <SectionTitle> 6.1 Using SEG to mark up empty and </SectionTitle> <Paragraph position="0"> incorporated constituents In Italian, Spanish and Inany other languages, (:attain noininal constituents may not be realized; this is especially conunoil for nominals in subject position. These nonlinals are i)resent in annotations l)ro(lu('(~(l by hand (e.g., in the Penn Treebank), but the parsers used for parsing spoken dialogues tend not to produce representations containing empty constituents in this case. In case these nominals are not represented in the base level, we recommend to mark the verb with a <seg> element, and then code the anaphoric relation as usual by means of <link> elements, as follows: (17) coref.xml: A: Dov'e' <de ID=&quot;de_157&quot;>Gianni?</de > \[Where is Gianni?\] B: <seg type=&quot;pred&quot; ID=&quot;seg_158 >e' andato a mangiare </seg> \[. went to have lunch\]</Paragraph> <Paragraph position="2"> The reader will have noticed that this representation can only be used without loss of information when there is at most one empty elements; this is true for Italian, but not for Japanese or Portuguese. If more precision needed, the annotator should the** define more specific identity relations also specifying which empty argument of the verb enters in the anaphoric relation: SUBJ-IDENT, 0BJ-IDENT~ etc. These relations could then used instead of IDENT to specify the value of the TYPE attribute of the <link> element.</Paragraph> <Paragraph position="3"> A second case in which an argument is not realized by means of a nominal is in the case of incorporated clitics, such as daselo in (11). In this case, again, we recommend marking the verb by way of a <seg> element when the parser doesn't produce a morphologically decomposed representation, and then encoding the anaphoric relations in which the clitics are involved by means of either a single IDENT relation or by means of more fine-grained relations such as SUBJ-IDENT or 0BJ-IDENT.</Paragraph> <Paragraph position="4"> Provided that the <seg> elements are identified during the first pass of markable identification, encoding this information should not be any harder than in the case of the Core Scheme. The real question for this type of annotation is which empty elements to annotate -e.g., in addition to 'small pro' elements such as those discussed above, the amlotator may also decide to annotate 'big PRO' elements that according to some syntactic theories occupy the sub-ject position of infinitival clauses.</Paragraph> </Section> <Section position="2" start_page="71" end_page="72" type="sub_section"> <SectionTitle> 6.2 Using SEG to mark the antecedents of </SectionTitle> <Paragraph position="0"> discourse deixis Abstract objects such as events, actions and propositions can all serve as antecedents of anaphoric expressions. We are not aware of any reliability results for this type of annotation, but the <seg> element can be used to identify the antecedents in this type of anaphora. If desired, the annotator could use a second attribute TYPE to specify the type of object introduced by the <seg> element; TYPE would have values EVENT, PROP and ACTION.</Paragraph> <Paragraph position="1"> (19) a. The 23-year-old had h, it his head against another player during a game of Aussie-rules football. McGlinn remembered nothing of the collision, but developed a headache and had several seizures. (BBC) b. <seg type=&quot;event&quot; ID=&quot;seg_130&quot;>The 23-year-old had hit his head against another player</seg> during a game of Aussie-rules football.</Paragraph> <Paragraph position="2"> McGlinn remembered nothing of <de ID=&quot;de_131&quot;> the collision </de>, but developed a headache and had several seizures.</Paragraph> <Paragraph position="4"> (20) a. Despite the latest negative results, doctors are still convinced that Tamoxifen can prevent breast cancer. This is because of the way it blocks the action of oestrogen, the female sex horinone that can make the breast cells of some women go out of control.</Paragraph> <Paragraph position="5"> b. Despite the latest negative results, <seg type=&quot;prop&quot; ID=&quot;seg_*29 ''> doctors are still convinced that <de ID=&quot;de_131&quot;> Tamoxifen </de> can prevent breast cancer </seg>.</Paragraph> <Paragraph position="6"> <de ID=&quot;de_13O&quot;> This </de> is because of the way <de ID=&quot;132&quot;> it </de> blocks the action of oestrogen, the female sex hormone that can make the breast cells of some women go out of control.</Paragraph> <Paragraph position="7"> <link href=&quot;coref, xml#id (de_130)&quot; type=&quot;ident&quot;> <anchor href=&quot;coref, xml#id(seg_* 29) &quot;/> </link> (21) a. GIVER: You're sort_of going past st ....... k... but your line's curving up past the...</Paragraph> <Paragraph position="8"> fiat rocks.</Paragraph> <Paragraph position="9"> FOLLOWER: Right. Okay.</Paragraph> <Paragraph position="10"> but your line's curving up past the... flat rocks.</Paragraph> <Paragraph position="11"> FOLLOWER: Right. Okay.</Paragraph> <Paragraph position="12"> GIVER: <sag ID=&quot;seg_135&quot; typa=&quot;action&quot;>And then starting to come down again.</ses> FOLLOWER: Got <de ID-&quot;de_136&quot;> that </de>. <link href=&quot;coraf, xml#id (de_136)&quot; type=&quot;ident&quot;> <anchor href=&quot;coref, xml#id (seg_135) &quot;/> </link> These examples also ilustrate some of the problems to be addressed when designing a reliable annotation scheme for this phenomenon: these include deciding what part of the text counts as antecedent as well as deciding which type of object the antecedent is (see, e.g., (21)).</Paragraph> </Section> </Section> <Section position="8" start_page="72" end_page="73" type="metho"> <SectionTitle> 7 Bridging References </SectionTitle> <Paragraph position="0"> DRAMA also allows annotators to encode certain types of eRIDOING REFERENCES (Clark, 1977): these are anaphoric expressions that denote objects that have not yet been introduced in the discourse, but that are related to an entity already introduced in the text by relations other than identity* An example is the indicators in: (22) John has bought a new car. The indicators use the latest laser technology.</Paragraph> <Paragraph position="1"> We are able to interpret the description the indicators because we know that indicators are parts of cars. The set of relations that may hold between a bridging reference and its 'antecedent' or 'anchor' is rather wide; an extensive survey of the existing classifications can be found in (Vieira, 1998).</Paragraph> <Paragraph position="2"> The Extended Relations Scheme is designed for those who wish to mark up this more general anaphoric relations. It uses the same elements as the Core Scheme, but more values are allowed for the TYPE attribute of the <link> element besides simple IDENT. The set of relations allowed by the scheme derives from the analysis of Vieira and includes most of the bridging relations in DRAMA (MEMBER, SUBSET, PART, CAUSE, POSS and ARG). For example, we see in (23) how the elements of the Extended Relations Scheme can be used to encode a subset relation between lee modeles de fusees and lee fusees qui ont bien vole'.</Paragraph> <Paragraph position="3"> As the poor reliability scores which have been obtained by (Poesio and Vieira, 1998) for this kind of scheme indicate, once one moves beyond the ident relation, it can be difficult to decide how to classify the link between two elements. We tried to alleviate this problem by adopting the TEl technique of specifying 'subtypes' of links: in those cases in which it may be difficult to identify precisely the type of relation that exists between two entities, we introduced a more general relation to be used as type of a link, as well as more specific relations to be used as values of the SUBTYPE attribute in those cases in which this additional specification is possible. We used this technique for two types of relations: possession relations (which include generic attribution, true possession and part as subtypes) and event relations (which include relations such as cause and 'role' as subparts). The following example illustrates how type and subtype attributes can be used to encode possession relations at the desired level of precision, as well as why sometimes it may be difficult to decide which relation holds between two discourse entities.</Paragraph> <Paragraph position="4"> (24) a. French boss Aime Jacquet praised his In the documentation we specify additional relations and further distinctions that an annotator may wish to make, including ways to annotate the functionvalue relations discussed in MUCCS.</Paragraph> <Paragraph position="5"> The basic problem to be solved when trying to do this type of annotation is to come up with instructions that will ensure that annotators recognize bridging references. As a preliminary proposal, we suggest that annotators try first to identi~&quot; an antecedent which is identical with the mmphor; if that fails, they should try first to find a discourse entity with which the anaphor stands in one of the set relations, then one with which it stands in one of the generalized possession relations.</Paragraph> <Paragraph position="6"> 8 State of the Proposal; Further Work So far, we have used to scheme to annotate a TRAINS dialogue, a MAPTASK dialogue, and a dialogue from the microfusees corpus collected by LOmA. We are currently running a reliability study of the extension dealing with references to the visual situation, while waiting for the preliminary release of the MATE workbench to study the more complex features of the scheme. This will also involve trying to extract a coding book from the manual by fixing up some parameters. As the preliminary release of the MATE workbench is planned for May, we may be able to report some results already at the ACL meeting.</Paragraph> </Section> class="xml-element"></Paper>