File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/w04-0706_metho.xml
Size: 18,012 bytes
Last Modified: 2025-10-06 14:09:05
<?xml version="1.0" standalone="yes"?> <Paper uid="W04-0706"> <Title>Using word similarity lists for resolving indirect anaphora</Title> <Section position="4" start_page="0" end_page="0" type="metho"> <SectionTitle> 3 Lexical resource </SectionTitle> <Paragraph position="0"> Our lexical resource consists on lists of semantically related words. These lists are constructed automatically by a syntax-based knowledge-poor technique.</Paragraph> <Paragraph position="1"> The technique used is described in (Gasperin et al., 2001; Gasperin, 2001), and it is an extension of the technique presented in (Grefenstette, 1994).</Paragraph> <Paragraph position="2"> Briefly, this technique consists on extracting specific syntactic contexts for every noun in the parsed whole corpus and then applying a similarity measure (the weighted Jaccard measure) to compare the nouns by the contexts they have in common (more contexts they share, more similar they are). As syntactic context, we understand any word that establishes a syntactic relation with a given noun in the corpus. An example of one kind of syntactic context considered is subject/verb, meaning that two nouns that occur as subject of the same verb share this context. Other examples of syntactic contexts are verb/object, modifier/noun, etc. To each context it is assigned a global and a local weight: the first related to the context frequency in the corpus, and the second related to its frequency as a context of the noun in focus. As output, we have a list of the most similar nouns to each noun in the corpus, ordered by the similarity value. We present the similarity list for the noun acusacao (accusation) in Table 1 as an example.</Paragraph> <Paragraph position="4"> The similarity lists can contain any kind of semantic relation (e.g. synonymy, hyponymy, etc.) between the words, but they are not classified. In general, the similarity lists for the less frequent words in the corpus contain some non-semantically related words (noise), since the relations were based on few syntactic contexts they shared along the corpus. null The main advantage of this technique is the possibility of having a corpus-tunned lexical resource built completely automatically. This resource reflects closely the semantic relations present in the corpus used to create the lists. So, we believe the similarity lists are more suitable for being used as lexical knowledge for resolving the anaphoras than a generic lexical base (e.g. Wordnet), since it focus on the semantic relations between the terms that appear in the corpus, without considering extra meanings that some words could have. New lists could be generated from each corpus that one aims to resolve the anaphoras.</Paragraph> <Paragraph position="5"> To generate the similarity lists for Portuguese we utilised a 1,400,000-words corpus from the Brazilian newspaper 'Folha de Sao Paulo', containing news about different subjects (sports, economics, computers, culture, etc.). This corpus includes the set of texts that was hand-annotated with coreference information in previous work (Vieira et al., 2002; Salmon-Alt and Vieira, 2002). The corpus was parsed by the Portuguese parser PALAVRAS (Bick, 2000), provided by VISL project1.</Paragraph> <Paragraph position="6"> We created two different sets of similarity lists: one considering just nouns and the other considering nouns and proper names. So, the first set of lists includes one list for each noun in the corpus and each list is composed by other common nouns. The second set of lists has one list for each noun and proper name in the corpus, and each list is composed by other nouns and proper names. The first set contains 8019 lists and the second 12275, corresponding to the different nouns (and proper names) appearing in the corpus. Each similarity list contains the 15 words that are more similar to the word in focus, according to the calculated similarity values. null Having lexical information about the proper names in the corpus is important, since we have many coreference cases whose anaphor or antecedent is a proper name. But when generating the similarity lists, proper names bring noise (in general they are less frequent then common nouns) and the lists became more heterogeneous (includes more non semantically related words).</Paragraph> <Paragraph position="7"> 4 Using similar words lists to solve indirect anaphora From the manual annotation and classification of 680 definite descriptions we selected those cases classified as indirect anaphora (95). For each of them there is a list of candidate antecedents. This list is formed by all NPs that occur in the text. We consider as candidates all the NPs that occur in the text before the anaphor being mentioned.</Paragraph> <Paragraph position="8"> Our heuristic for solving indirect anaphoras using lists of similar words is the following. Consider:</Paragraph> <Paragraph position="10"> We call (1) 'right direction', (2) 'opposite direction', and (3) 'indirect way'.</Paragraph> <Paragraph position="11"> We consider (1) > (2) > (3) when regarding the reliability of the semantic relatedness betweenHana and Hcani.</Paragraph> <Paragraph position="12"> If the application of the heuristic resulted in more than one possible antecedent, we adopted a weighting scheme to choose only one among them. The candidate with the lowest weight wins. For ranking the possible antecedents, we considered two parameters: null * reliability: how the possible antecedent was select, according to (1), (2) or (3). A penalising value is added to its weight: 0, 40, 200, respectively. The higher penalty for the 'indirect way' is because we expected it could cause many false positives; * recency: we consider the distance in words between the anaphor and the possible antecedent. The penalty values for the reliability parameter were chosen in such a way they could be in the same magnitude as the recency parameter values, that are measured in words. For example, if candidate A is 250 words far from the anaphor and was selected by (1) (getting weight=250) and a candidate B is 10 words far from the anaphor and was selected by (3) (getting weight=210), candidate B will be selected as the correct antecedent.</Paragraph> </Section> <Section position="5" start_page="0" end_page="0" type="metho"> <SectionTitle> 5 Our evaluation corpus </SectionTitle> <Paragraph position="0"> As result of previous work (Vieira et al., 2002; Vieira et al., 2003), we have a Portuguese corpus manually annotated with coreference information.</Paragraph> <Paragraph position="1"> This corpus is considered our gold-standard to evaluate the performance of the heuristic presented in the previous section. The study aimed to verify if we could get a similar distribution of types of definite descriptions for Portuguese and English, which would serve as an indication that the same heuristics tested for English (Vieira et al., 2000) could apply for Portuguese. The main annotation task in this experiment was identifying antecedents and classifying each definite description according to the four classes presented in section 2.</Paragraph> <Paragraph position="2"> For the annotation task, we adopted the MMAX annotation tool (Muller and Strube, 2001), that requires all data to be encoded in XML format. The corpus is encoded by <word> elements with sequential identifiers, and the output - the anaphors and its antecedents - are enconded as <markable> elements, with the anaphor markable pointing to the antecedent markable by a 'pointer' attribute.</Paragraph> <Paragraph position="3"> The annotation process was split in 4 steps: selecting coreferent terms; identifying the antecedent of coreferent terms; classifying coreferent terms (direct or indirect); classifying non-coreferent terms (discourse new or other anaphora). About half of the anaphoras were classified as discourse new descriptions, which account for about 70% of non-coreferent cases. Among the coreferent cases the number of direct coreference is twice the number of indirect coreference. This confirms previous work done for English.</Paragraph> <Paragraph position="4"> For the present work, we took then the 95 cases classified as indirect coreference to serve as our evaluation set. In 14 of this cases, the relation between anaphor and antecedent is synonymy, in 43 of the cases the relation is hyponymy, and in 38, the antecedent or the anaphor are a proper name.</Paragraph> </Section> <Section position="6" start_page="0" end_page="0" type="metho"> <SectionTitle> 6 Implementing heuristics for indirect </SectionTitle> <Paragraph position="0"> anaphora in ART Our heuristics were implemented as an XSL stylesheet on the basis of the Anaphora Resolution Tool (ART) (Vieira et al., 2003).</Paragraph> <Paragraph position="1"> The tool integrates a set of heuristics corresponding to one or more stylesheets to resolve different sorts of anaphora. The heuristics may be applied in a sequence defined by the user. As resolving direct anaphoric descriptions (the ones where anaphor and antecedent have the same head noun) is a much simpler problem with high performance rates as shown in previous results (Vieira et al., 2000; Bean and Riloff, 1999), these heuristics should be applied first in a system that resolves definite descriptions. In this work, however, we decided to consider for the experiments just the anaphoras that were previously annotated as indirect and check if the proposed heuristic is able to find the correct antecedent. ART allows the user to define the set of anaphors to be resolved, in our case they are selected from previously classified definite descriptions. The stylesheet for indirect anaphora takes as input this list of indirect anaphors, a list of the candidates and the similarity lists. We consider all NPs in the text as candidates, and for each anaphor we consider just the candidates that appear before it in the text (we are ignoring cataphora at moment).</Paragraph> <Paragraph position="2"> All the input and output data is in XML format, based on the data format used by MMAX.</Paragraph> <Paragraph position="3"> Our stylesheet for solving indirect anaphora takes the <markable> elements with empty 'pointer' attribute (coming unsolved from passing by the previ- null ously applied stylesheets/heuristics) and create and intermediate file with <anaphor> elements to be resolved. The resolved <anaphor>s are again encoded as <markable>s, with the 'pointer' filled. A detailed description of our data encoding is presented in (Gasperin et al., 2003).</Paragraph> </Section> <Section position="7" start_page="0" end_page="0" type="metho"> <SectionTitle> 7 Experiments </SectionTitle> <Paragraph position="0"> We run two experiments: one using the similarity lists with proper names and another with the lists containing just common nouns.</Paragraph> <Paragraph position="1"> With these experiments we verify the values for precision, recall and false positives on the task of choosing an semantically similar antecedent for each indirect anaphor. Our annotated corpus has 95 indirect anaphors with nominal antecedents, where 57 of them do not include proper names (as anaphor or as antecedent). We use a non annotated version of this corpus for the experiments. It contains around 6000 words, from 24 news texts of 6 different newspaper sections.</Paragraph> <Paragraph position="2"> Firstly, we reduced both sets of similarity lists to contain just the list for the words present in this portion of the corpus (660 lists without proper names and 742 including proper names).</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 7.1 Experiment 1 </SectionTitle> <Paragraph position="0"> Considering the 57 indirect anaphoras to be solved (the ones that do not include any proper name), we could solve 19 of them. It leads to a precision of 52.7% and a a recall of 33.3%. Table 2 shows the result of our study considering the set of common noun lists.</Paragraph> <Paragraph position="1"> Most of the cases could be resolved by 'right direction', that represents the more intuitive way. 21 of the cases didn't get any antecedent. We got 17 false positives, with different causes: 1. the right antecedent was not in the lists, therefore it could not be found but other wrong antecedents were retrieved. For example, in meu amigo Ives Gandra da Silva Martins escreveu para esse jornal ... o conselheiro Ives (my friend Ives_Gandra_da_Silva_Martins wrote to this newspaper ... the councillor Ives), two more candidates head-nouns are similar words to &quot;conselheiro&quot; (councillor): &quot;arquiteto&quot; (architect) and &quot;consultor&quot; (consultant), but not &quot;amigo&quot; (friend); 2. the right antecedent was in the lists but another wrong antecedent was given higher weights, because of proximity to the anaphora, as in the example a rodovia Comandante Joao Ribeiro de Barros ... proximo a ponte ... ao tentar atravessar a estrada (the highway Comandante Joao Ribeiro de Barros ... near to the bridge ... while trying to cross the road). Here, the correct antecedent to &quot;a estrada&quot; (the road) is &quot;rodovia&quot; (the highway) and it is present in &quot;estrada&quot;'s similarity list (right direction), but also is &quot;ponte&quot; (the bridge) and it is closer to the anaphor in the text.</Paragraph> <Paragraph position="2"> As expected, most of the false positives (11 cases) were 'resolved' by &quot;indirect way&quot;.</Paragraph> <Paragraph position="3"> Considering all similar words found among the candidates, not just the one with highest weight, we could find the correct antecedent in 24 cases (42%).</Paragraph> <Paragraph position="4"> The average number of similar words among the candidates was 2.8, taking into account again the positive and false positive cases. These numbers report how much the similarity lists encode the semantic relations present in the corpus. 64% of the synonymy cases and 28% of the hyponymy cases could be resolved. 35% of the hyponymy cases resulted in false positives, the same happened with just 14% of the synonymy cases.</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 7.2 Experiment 2 </SectionTitle> <Paragraph position="0"> We replicated the previous experiment now using the similarity lists that include proper names. Table 3 shows the results considering the set of lists for nouns and proper names. Considering the 95 indirect anaphoras to be solved, we could solve 21 of them. It leads to a precision of 36.8% and a a recall of 22.1%. There was no antecedent found for 38 anaphors, and 36 anaphors got wrong antecedents (half of them by &quot;inderect way&quot;). We observed the same causes for false positives as the two presented for experiment 1.</Paragraph> <Paragraph position="1"> Considering all cases resolved (correct and false ones), we could find the correct antecedent among the similar words of the anaphor in 31 cases (32.6%). The average number of similar words among the candidates was 2.75. The numbers for synonymy and hyponymy cases were the same as in experiment 1 - 64% and 28% respectively. The numbers for proper names were 50% of false positives and 50% of unresolved cases. It means none of the cases that include proper names could be resolved, but do not means they hadn't any influence in other nouns similarity lists. In 26% of the false positive cases, the correct antecedent (a proper name) was in the anaphor similarity list (but was not selected due to the weighting strategy).</Paragraph> <Paragraph position="2"> The experiment with the similarity lists that include proper names was able to solve more cases, but experiment 1 got better precision and recall values. null</Paragraph> </Section> </Section> <Section position="8" start_page="0" end_page="0" type="metho"> <SectionTitle> 8 Related work </SectionTitle> <Paragraph position="0"> An evaluation of the use of WordNet for treating bridging descriptions is presented in (Poesio et al., 1997). This evaluation considers 204 bridging descriptions, distributed as follows, where NPj is the anaphora and NPi is antecedent.</Paragraph> <Paragraph position="1"> * synonymy relation between NPj and NPi: 12 cases; * hypernymy relation between NPj and NPi: 14 cases; * meronymy between NPj and NPi: 12; * NPj related with NPi being a proper name: 49; * NPj sharing a same noun in NPi other than head (compound nouns): 25; * NPj with antecedent being an event 40; * NPj with antecedents being an implicit discourse topic: 15; * other types of inferences holding between NPj and antecedent: 37.</Paragraph> <Paragraph position="2"> Due to the nature of the relations, only some of them were expected to be found in WordNet. For Synonymy, hypernymy and meronymy, 39% of the 38 cases could be solved on the basis of WordNet. From this related work we can see the large variety of cases one can found in a class such as bridging. In our work we concentrated on coreference relations, these can be related to synonymy, hypernymy, and proper name sub-classes evaluated in (Poesio et al., 1997).</Paragraph> <Paragraph position="3"> The technique presented in (Schulte im Walde, 1997) based on lexical acquisition from the British National Corpus was evaluated against the same cases in (Poesio et al., 1997). For synonymy, hypernymy and meronymy, it was reported that 22% of the 38 cases were resolved. In (Poesio et al., 2002) the inclusion of syntactic patterns improved the resolution of meronymy in particular, resulting in 66% of the meronymy cases being resolved. Bunescu (Bunescu, 2003) reports for his method on resolving associative anaphora (anaphoric relation between non-coreferent entities) a precision of 53% when his recall is 22.7%.</Paragraph> </Section> class="xml-element"></Paper>