File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/w04-0209_metho.xml
Size: 20,135 bytes
Last Modified: 2025-10-06 14:09:04
<?xml version="1.0" standalone="yes"?> <Paper uid="W04-0209"> <Title>Exploiting semantic information for manual anaphoric annotation in Cast3LB corpus</Title> <Section position="3" start_page="0" end_page="0" type="metho"> <SectionTitle> 2 Cast3LB corpus: annotation project </SectionTitle> <Paragraph position="0"> overview Cast3Lb project is part of the general project 3LB1. The main objective of this general project is to develop three corpora annotated with syntactic, semantic and pragmatic/coreferential information: one for Catalan (Cat3LB), one for Basque (Eus3LB) and one for Spanish (Cast3LB).</Paragraph> <Paragraph position="1"> The Spanish corpus Cast3LB is a part of the CLIC-TALP corpus, which is made up of 100.000 words from the LexEsp corpus (Sebasti'an et al., 2000) plus 25.000 words coming from the EFE Spanish Corpus, given by the Agencia EFE (the official news agency) for research purposes. The EFE corpus fragments are comparable among the languages of the general project (Catalan, Basque and Spanish).</Paragraph> <Paragraph position="2"> We have selected this corpus because it contains a large variety of Spanish texts (newspapers, novels, scientific papers. . . ), both from Spain and South-America, so it is a good representation of the current state of the Spanish language. Moreover, the automatic morphological annotation of this corpus has been manually checked (Civit, 2003).</Paragraph> <Paragraph position="3"> The spirit of the annotation scheme is to build a flexible system portable to different romance languages and to potential new cases that might appear, but consistent with all annotation levels and annotation data.</Paragraph> <Paragraph position="4"> At the syntactic level we follow the constituency annotation scheme. Main principles of syntactic annotation are the following (Civit et al., 2003): a) 1Project partially funded by Spanish Government FIT-150500-2002-244. null only the explicit elements are annotated (except for elliptical subjects); b) we do not alter the surface word order of the elements; c) we do not follow any specific theoretical framework; d) we do not take into account the verbal phrase, rather, the main constituents of the sentence become the daughters of the root node; e) this syntactic information is enriched by the functional information of the main phrases, but we have not taken into account the possibility of double functions.</Paragraph> <Paragraph position="5"> At the semantic level, we annotate the sense of the nouns, verbs and some adjectives, following an all words approach. The specific sense (or senses) of each one is assigned by means of the EuroWordNet offset number (Vossen, 1998). Also, due to some words are not available in EuroWordNet or do not have the suitable sense, we have created two new tags to mark this circumstance.</Paragraph> <Paragraph position="6"> At the discourse level, we mark the coreference of nominal phrases and some elliptical elements. The coreference expressions taken into account are personal pronouns, clitics, elliptical subjects and some elliptical adjectives. The definite descriptions are not marked. The possible antecedents considered are the nominal phrases or other coreferential expressions. null</Paragraph> </Section> <Section position="4" start_page="0" end_page="0" type="metho"> <SectionTitle> 3 Semantic annotation </SectionTitle> <Paragraph position="0"> As we said before, main objective of Cast3LB project at semantic level is to develop an &quot;all words&quot; corpus with the specific sense (or senses) of nouns, verbs and adjectives.</Paragraph> <Paragraph position="1"> Our proposal is based on the SemCor corpus (Miller, 1990). This corpus is formed by a portion of the Brown corpus and the novel The Red Badge of Courage. Altogether, it is formed by approximately 250.000 words, where nouns, verbs, adjectives and adverbs have been manually annotated with WordNet senses (Miller, 1990). Another corpus with WordNet-based semantic annotation is the DSO corpus (Ng and Lee, 1996). In this corpus, the most frequent English ambiguous nouns and verbs had been annotated with the correct sense (121 nouns and 70 verbs). The corpus is formed by 192.800 sentences from the Brown Corpus and the Wall Street Journal, and it has also been manually annotated. Finally, the SENSEVAL forum has developed a few sense annotated corpora for the evaluation of Word Sense Disambiguation systems (Kilgarriff and Palmer, 2000), some of which also use WordNet as a lexical resource.</Paragraph> <Paragraph position="2"> We have decided to use Spanish WordNet for several reasons. First of all, Spanish WordNet is, up to now, the more commonly used lexical resource in Word Sense Disambiguation tasks. Secondly, it is one of the most complete lexical resources currently available for Spanish. Finally, as part of EuroWord-Net, the lexical structure of Spanish and the lexical structure of Catalan and Basque are related. Therefore, the annotated senses of the three corpora of 3LB project can also be related.</Paragraph> <Paragraph position="3"> The tag used to mark a word sense is its offset number, that is, its identification number in EuroWordNet's InterLingua Index. The corpus has 42291 lexical words, where 20461 are nouns, 13471 are verbs and 8543 are adjectives.</Paragraph> <Paragraph position="4"> On other hand, not all nouns, verbs, adjectives and adverbs are annotated, due to EuroWordNet does not contain them. Possible lacks in this sense are (i) the synset, (ii) the word, (iii) the synset and the word, and (iv) the link between the synset and the word.</Paragraph> <Paragraph position="5"> In order to deal with these cases we have defined two more tags in EuroWordNet: + C1S: the word is found, but not its correct sense (due to a sense lack, or because there is no link between the word and the synset).</Paragraph> <Paragraph position="6"> + C2S: the word is not found (because it is not there, or because both the word and the synset are missing).</Paragraph> <Paragraph position="7"> It is possible to distinguish two methods for semantically annotate a corpus. The first one is linear (or &quot;textual&quot;) method (Kilgarriff, 1998), where the human annotator marks the sentences token by token up to the end of the corpus. In this strategy the annotator must read and analyze the sense of each word every time it appears in the corpus. The second annotation method is transversal (or &quot;lexical&quot;) (Kilgarriff, 1998), where he/she annotates word-type by word-type, all the occurrences of each word in the corpus one by one. With this method, the annotator must read and analyze all the senses of a word only once.</Paragraph> <Paragraph position="8"> We have followed in Cast3LB the transversal process. The main advantage of this method is that we can focus our attention on the sense structure of one word and deal with its specific semantic problems: its main sense or senses, its specific senses. . . . Then we check the context of the single word each time it appears and select the corresponding sense.</Paragraph> <Paragraph position="9"> Through this approach, semantic features of each word is taken into consideration only once, and the whole corpus achieves greater consistency. Through the linear process, however, the annotator must remember the sense structure of each word and their specific problems each time the word appears in the corpus, making the annotation process much more complex, and increasing the possibilities of low consistency and disagreement between the annotators. null Nevertheless, the transversal method finds its disadvantage in the annotation of large corpus, because no fragment of the corpus is available until the whole corpus is completed. To avoid this, we have selected a fragment of the whole corpus and annotated it by means of the linear process.</Paragraph> <Paragraph position="10"> Everybody agrees that semantic annotation is a tedious and difficult task. From a general point of view, the main problem in the semantic annotation is the subjectivity of the human annotator when it comes to the selection of the correct sense, because there are usually more than one sense for a word, and, due to the WorNet's granularity, more than one could be correct for a given word. Another important problem in the semantic annotation is the poor agreement between different annotators, due to the ambiguity and/or vagueness of many words.</Paragraph> <Paragraph position="11"> In order to overcome these problems, the annotation process has been carried out in two steps. In the first step, a subset of ambiguous words have been annotated twice by two annotators. With this double annotation we have developed a disagreement typology and an annotation handbook, where all the possible causes of ambiguity have been described and common solutions have been adopted for the rest of cases. In the second step the remaining corpus is annotated following the criteria adopted in the annotation handbook.</Paragraph> <Paragraph position="12"> Our final aim is to obtain useful resources for Word Sense Disambiguation (WSD) systems in Spanish. This semantically annotated corpus will be used as a training corpus for the development of unsupervised systems and as a reference in general evaluation tasks. At the end of the project, we will have a large amount of words with an unambiguous sense tag in a real context.</Paragraph> <Paragraph position="13"> As well as this final application, we exploit this semantic information in the anaphoric annotation task. In (Saiz-Noeda, 2002), how to apply semantic information in anaphora resolution systems is showed and evaluated. We take this proposal, but applied to manual anaphora annotation.</Paragraph> <Paragraph position="14"> Due to the corpus has been annotated with syntactic information, and the sense of each word is marked with the offset number of EuroWordNet, it is possible to extract semantic features of each verb and noun through the ontological concepts of the EuroWordNet's Top Ontology. Furthermore, the corpus has been annotated with syntactic roles, so it is possible to extract syntactic patterns formed by the verb and its main complements: subject-verb, verb-direct objects, verb-indirect objects.</Paragraph> <Paragraph position="15"> As we will show bellow, these patterns are useful in order to select the specific antecedent of an anaphora, according to semantic compatibility criteria between the antecedent and the verb of the sentence where the anaphora appears.</Paragraph> </Section> <Section position="5" start_page="0" end_page="0" type="metho"> <SectionTitle> 4 Discourse annotation: anaphora and </SectionTitle> <Paragraph position="0"> coreference At discourse level, our objective is to annotate the anaphora and the coreference, in order to develop useful resources for anaphora resolution systems. We agreed to annotate the anaphoric elements and their antecedents. These anaphoric elements are the anaphoric ellipsis, the pronominal anaphora and the coreferential chains.</Paragraph> <Paragraph position="1"> Specifically, in each one, we mark: + Anaphoric ellipsis: - The elliptical subject, made explicit in the syntactic annotation step. Being a noun phrase, it could also be an antecedent too.</Paragraph> <Paragraph position="2"> Unlike English, where it is possible an expletive pronoun as subject, in Spanish it is very common an elliptical nominal phrase as subject of the sentence. This is why we have decide to include this kind of anaphora in the annotation process.</Paragraph> <Paragraph position="3"> - Elliptical head of nominal phrases with an adjective complement. In English, this construction is the &quot;one anaphora&quot;. In Spanish, however, the anaphoric construction is made up by an elliptical head noun and an adjective complement.</Paragraph> <Paragraph position="4"> + Anaphora: Two kinds of pronouns: - The tonic personal pronouns in the third person. They can appear in subject function or in object function.</Paragraph> <Paragraph position="5"> - The atonic pronouns, specifically the clitic pronouns that appear in the subcategorization frame of the main verb.</Paragraph> <Paragraph position="6"> + Finally, there are sets of anaphoric and elliptical units that corefer to the same entity. These units form coreferential chains. They must be marked in order to show the cohesion and coherence of the text. They are annotated by means of the identification of the same antecedent. null We do not annotate the definite descriptions. They consist of nominal phrases that can refer (or not) to an antecedent. We do not mark them because they outline specific problems that make this task very difficult: firstly, there are not clear criteria that allow us to distinguish between coreferential and not coreferential nominal phrases; secondly, there are not a clear typology for definite descriptions; and finally, there are not a clear typology of relationships between the definite description and their antecedents. These problems could further increase the time-consuming in the annotation process and widen the gap of disagreement between the human annotators.</Paragraph> <Paragraph position="7"> This proposal of annotation scheme is based on the one used in the MUC (Message Understanding Conference) (Hirschman, 1997) as well as in the works of Gaizauskas (Gaizauskas and Humphreys, 1996) and Mitkov (Mitkov et al., 2002): this is the mostly used scheme in coreferential annotation (Mitkov, 2002).</Paragraph> <Paragraph position="8"> In the anaphoric annotation, two linguistic elements must be marked: the anaphoric expression and its antecedent. In the antecedent we annotate the following information: + The type of anaphoric expression: elliptical subject, elliptical head of noun phrase, tonic pronoun or atonic pronoun (&quot;TYPE&quot;), + The antecedent, through its identification number (&quot;REF&quot;), + Finally, a status tag where the annotators shows their confidence in the annotation (&quot;STATUS&quot;). null As previously mentioned in this paper, the main problem in the anaphoric annotation is the low agreement between human annotators. There is usually less agreement in anaphoric annotation than in syntactic annotation ((Mitkov, 2002), 141). In order to reduce this low agreement, we annotate only the clearest type of anaphoric units (pronouns, elliptical subjects and elliptical nominal heads), and we introduce the lowest necessary information. Moreover, with the tag &quot;STATUS&quot;, the human annotator can show his confidence in the anaphoric unit and the antecedent marked. However, at the moment, as occurs in the semantic annotation, we do not have enough data on the agreement between annotators.</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.1 Manual annotation with an Enriched Anaphora Resolution System </SectionTitle> <Paragraph position="0"> As we said before, we follow a manual anaphora annotation with the help of a Enriched Anaphora Resolution System: our idea is to check the automatic annotation of the anaphora resolution system and to correct mistakes in the annotation process.</Paragraph> <Paragraph position="1"> In manual anaphora and coreferential annotation, the human annotator first locates a possible anaphora, and then must read back the text until the antecedent appears. With an anaphora resolution system it is possible to automatize this process: the system selects possible anaphoric elements, their possible antecedents, and decides the main candidate. The human annotator must only check the suggestion. The process is more useful because the most tedious task (to select a possible anaphora, to read back looking for the antecedent, etc.) is made up by the system. When the human annotator checks the solution, he does not read back for antecedents, he goes directly to the possible antecedents. null However, the anaphora resolution system must be very accurate. In order to automatically specify the antecedent of an anaphora and ensure the correctness of the system, we use all the linguistic information previously annotated in the corpus: morphological, syntactic and semantic. In this knowledge-based anaphora resolution system, the linguistic information is used through a set of restrictions and preferences. Following this strategy, the system rejects possible antecedents until only one is selected. The key point is the linguistic information used in restrictions and preferences.</Paragraph> <Paragraph position="2"> We have developed a semantically enriched anaphora resolution system in order to aid the discourse annotation level. EuroWordNet synsets are the base of the semantic information added to the resolution process. The fact of counting with a semantically annotated corpus such as Cast3Lb facilitates the use of the anaphora resolution method, based on a natural way of understanding the human process for anaphora resolution.</Paragraph> <Paragraph position="3"> The specific use of semantic information is related to the sematic compatibility between the possible antecedent (a noun) and the verb of the sentence in which the anaphoric pronoun appears. Due to the pronoun replaces a lexical word (the antecedent), the semantic information of the antecedent must be compatible with the semantic restrictions of the verb. In other words, the anaphoric expression takes the semantic features of the antecedent, so they must be compatible with the semantic restrictions of the verb.</Paragraph> <Paragraph position="4"> In this way, verbs like &quot;eat&quot; or &quot;drink&quot; will be specially compatible with animal subjects and eatable and drinkable objects than others.</Paragraph> <Paragraph position="5"> In our case, the semantic features of the lexical words have been extracted form the ontological concepts of EuroWorNet, that is, the Top Ontology. All the synsets in EuroWordnet are semantically described through a set of base concepts (the more general concepts). In the EuroWorNet's Top Ontology, these base concepts are classified in the three orders of Lyons (Lyons, 1977), according to basic semantic distinctions. So through the top ontology, all the synsets of EuroWordNet are semantically described with concepts like &quot;human&quot;, &quot;animal&quot;, &quot;artifact&quot;, etc. With this, we have extracted subject-verb, verb-direct object and/or verb-indirect object semantic patterns.</Paragraph> <Paragraph position="6"> From this semantic patters, rules about the semantic compatibility between nouns and verbs have been extracted. These rules are applied to the anaphora resolution as preferences. Based on the patterns, the system calculates the compatibility between the verb of the sentence in which the anaphora appears and the antecedent. So the possible antecedents with low compatibility are rejected, and the antecedents with high compatibility are selected. These semantic preferences, plus the syntactic and morphological restrictions and preferences, are used to select the correct antecedent of the anaphora.</Paragraph> <Paragraph position="7"> Furthermore, semantic information is also used in some rules. There are two kind of rules: + &quot;NO&quot; rules: NO(v#sense,c,r) defines the incompatibility between the verb v (and it sense) and any name which contains 'c' in its ontological concept list, being 'r' the syntactic function that relates them.</Paragraph> <Paragraph position="8"> + &quot;MUST&quot; rules: MUST(v#sense,c,r) defines the incompatibility between the verb v (and its sense) and all the names that don't contain 'c' in their ontological concept list, being 'r' the syntactic function that relates them.</Paragraph> <Paragraph position="9"> At the final annotation step, the annotator checks if the antecedent selected is the correct one or not, and, in each case, confirms the annotation or corrects it.</Paragraph> </Section> </Section> class="xml-element"></Paper>