File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/02/w02-1040_evalu.xml
Size: 11,384 bytes
Last Modified: 2025-10-06 13:58:50
<?xml version="1.0" standalone="yes"?> <Paper uid="W02-1040"> <Title>The Influence of Minimum Edit Distance on Reference Resolution</Title> <Section position="5" start_page="0" end_page="0" type="evalu"> <SectionTitle> 4 Results </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.1 Our Features </SectionTitle> <Paragraph position="0"> The features for our study were selected according to three criteria: We distinguish between features assigned to noun phrases and features assigned to the potential coreference relation. All features are listed in Table 3 together with their respective possible values. The grammatical function of referring expressions has often been claimed to be an important factor for reference resolution and was therefore included (features 2 and 6). The surface realization of referring expressions seems to have an influence on coreference relations as well (features 3 and 7). Since we use a German corpus and in this language the gender and the semantic class do not necessarily coincide (i.e., objects are not necessarily neuter as they are in English) we also provide a semantic class feature (5 and 9) which captures the difference between human, concrete objects, and abstract objects. This basically corresponds to the gender attribute in English, for which we introduced an agreement feature (4 and 8). The feature wdist (10) captures the distance in words between anaphor and antecedent, while the feature ddist (11) does the same in terms of sentences and mdist (12) in terms of markables. The equivalence in grammatical function between anaphor and potential antecedent is captured in the feature syn par (13), which is true if both anaphor and antecedent are subjects or both are objects, and false in the other cases. The string ident feature (14) appears to be of major importance since it provides for high precision in reference resolution (it almost never fails) while the substring match feature (15) could potentially provide better recall.</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.2 Baseline Results </SectionTitle> <Paragraph position="0"> Using the features of Table 3, we trained decision tree classifiers using C5.0, with standard settings for pre and post pruning. As several features are discrete, we allowed the algorithm to use subsets of feature values in questions such as &quot;Is ana npform in</Paragraph> <Paragraph position="2"> rules from the decision trees, as we found them to give superior results. In our experiments, the value Document level features 1. doc id document number (1 . . . 250) NP-level features 2. ante gram func grammatical function of antecedent (subject, object, other) 3. ante npform form of antecedent (definite NP, indefinite NP, personal pronoun, demonstrative pronoun, possessive pronoun, proper name) 4. ante agree agreement in person, gender, number 5. ante semanticclass semantic class of antecedent (human, concrete object, abstract object) 6. ana gram func grammatical function of anaphor (subject, object, other) 7. ana npform form of anaphor (definite NP, indefinite NP, personal pronoun, demonstrative pronoun, possessive pronoun, proper name) 8. ana agree agreement in person, gender, number 9. ana semanticclass semantic class of anaphor (human, concrete object, abstract object) Coreference-level features 10. wdist distance between anaphor and antecedent in words (1 . . . n) 11. ddist distance between anaphor and antecedent in sentences (0, 1, a3 1) 12. mdist distance between anaphor and antecedent in markables (1 . . . n) 13. syn par anaphor and antecedent have the same grammatical function (yes, no) 14. string ident anaphor and antecedent consist of identical strings (yes, no) 15. substring match one string contains the other (yes, no) of the ana semanticclass attribute was reset to missing for pronominal anaphors, because in a realistic setting the semantic class of a pronoun obviously is not available prior to its resolution.</Paragraph> <Paragraph position="3"> Using 10-fold cross validation (with about 25 documents for each of the 10 bins), we achieved an overall error rate of 1.74%. Always guessing the by far more frequent negative class would give an error rate of 2.88% (70019 out of 72093 cases). The precision for finding positive cases is 88.60%, the recall is 45.32%. The equally weighted F-measure3 is 59.97%.</Paragraph> <Paragraph position="4"> Since we were not satisfied with this result we examined the performance of the features. Surprisingly, against our linguistic intuition the ana npform feature appeared to be the most important one. Thus, we expected considerable differences in the performance of our classifier with respect to the NP form of the anaphor under consideration. We split the data into subsets defined by the NP form of the anaphor and trained the classifier on these data sets. The results confirmed that the classifier performed poorly on definite NPs (defNP) and demonstrative pronouns (PDS), moderately on proper names (NE) and quite good on personal pronouns (PPER) and possessive pronouns (PPOS) (the results are reported in Table 4). As definite NPs account for 792 out of 2074 (38.19%) of the positive cases (and for 48125 (66.75%) of all cases), it is evident that the weak performance for the resolution of definite NPs, especially the low recall of only 8.71% clearly impairs the overall results. Demonstrative pronouns appear only in 0.87% of the positive cases, so the inferior performance is not that important. Proper names (NE) however are more problematic, as they have to be considered in 644 or 31.05% of the positive cases (22.96% of all).</Paragraph> </Section> <Section position="3" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.3 Additional features </SectionTitle> <Paragraph position="0"> Since definite noun phrases constitute more than a third of the anaphoric expressions in our corpus, we investigated why the resolution performed so poorly for these cases. The major reason may be that the resolution algorithm relies on surface features and does not have access to world or domain knowledge, which we did not want to depend upon since we were mainly interested in cheap features. However, the string ident and substring match features did not perform very well either. The string ident feature had a very high precision (it almost never failed) but a poor recall. The substring match feature was not too helpful either as it does not trigger in many cases. So, we investigated ways to raise the recall of the string ident and substring match features without losing too much precision.</Paragraph> <Paragraph position="1"> A look at some relevant cases (Table 5) suggested that a large number of anaphoric definite NPs shared some substring with their antecedent, but they were not identical nor completely included. What is needed is a weakened form of the string ident and substring match features. Soon et al. (2001) removed determiners before comparing the strings. Other researchers like Vieira and Poesio (2000) used information about the syntactic structure and compared only the syntactic heads of the phrases. However, the feature used by Soon et al. (2001) is neither sufficient nor language dependent, the one used by Vieira and Poesio (2000) is not cheap since it relies on a syntactic analysis.</Paragraph> <Paragraph position="2"> We were looking for a feature which gave us the improvements of the features used by other researchers without their associated costs. Hence we considered the minimum edit distance (MED) (Wagner and Fischer, 1974), which has been used for spelling correction and in speech recognizer evaluations (termed &quot;accuracy&quot; there) in the past. The MED computes the similarity of strings by taking into account the minimum number of editing operations (substitutions, insertions, deletions) needed to transform one string into the other (see also Jurafsky and Martin (2000, p.153ff. and p.271)).</Paragraph> <Paragraph position="3"> We included MED into our feature set by computing one value for each editing direction. Both values share the number of editing operations but they differ when anaphor and antecedent have a different length. The features ante med (16) and ana med (17) are computed from the number of substitutions a60 , insertionsa61, deletions a29 and the length of the potential antecedenta28 or anaphora22 as in Table 6.</Paragraph> </Section> <Section position="4" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.4 Improved Results </SectionTitle> <Paragraph position="0"> The inclusion of the MED features 16 and 17 led to a significant improvement (Table 7). The F-measure is improved to 67.98%, an improvement of about 8%.</Paragraph> <Paragraph position="1"> Considering the classifiers trained and tested on the data partitions according to ana npform, we can see that the improvements mainly stem from defNP and NE. With respect to definite NPs we gained about 18% F-measure, with respect to proper names about 11% F-measure. For pronouns, the results did not vary much.</Paragraph> </Section> <Section position="5" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.5 MUC-style results </SectionTitle> <Paragraph position="0"> It is common practice to evaluate coreference resolution systems according to a scheme originally developed for MUC evaluation by (Vilain et al., 1995).</Paragraph> <Paragraph position="1"> In order to be able to apply it to our classifier, we first implemented a simple reference resolution algorithm. This algorithm incrementally processes a real text by iterating over all referring expressions.</Paragraph> <Paragraph position="2"> Upon encountering a possibly anaphoric expression, it moves upwards (i.e. in the direction of the beginning of the text) and submits each pair of potential anaphor and potential antecedent to a classifier trained on the features described above. For the reasons mentioned in Section 4.2, the value of the ana semanticclass attribute is reset to missing if the potential anaphor is a pronominal form. The algorithm then selects the first (if any) pair which the classifier labels as coreferential. Once a text has been completely processed, the resulting coreference classes are evaluated by comparing them to the original annotation according to the scheme proposed by (Vilain et al., 1995). This scheme takes into account the particularities of coreference resolution by abstracting from the question if individual pairs of anaphors and antecedents are found.</Paragraph> <Paragraph position="3"> Instead, it focusses on whether sets of coreferring expressions are correctly identified. In contrast to the experiments reported in Section 4.2 and 4.4, our algorithm did not use a C5.0, but a J484 decision tree classifier, which is a Java re-implementation of C4.5. This was done for technical reasons, J48 being more easily integrated into our system. Accompanying experimentation revealed that J48's performance is only slightly inferior to that of C5.0 for our data.</Paragraph> <Paragraph position="4"> Again using 10-fold cross validation, we obtained the results given in Table 8.</Paragraph> </Section> </Section> class="xml-element"></Paper>