XML Viewer - p98-2143

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/98/p98-2143_evalu.xml
Size: 14,464 bytes
Last Modified: 2025-10-06 14:00:33
<?xml version="1.0" standalone="yes"?>
<Paper uid="P98-2143">
  <Title>United Kingdom</Title>
  <Section position="4" start_page="871" end_page="874" type="evalu">
    <SectionTitle>
3. Evaluation
</SectionTitle>
    <Paragraph position="0"> For practical reasons, the approach presented does not incorporate syntactic and semantic information (other than a list of domain terms) and it is not realistic to expect its performance to be as good as an approach which makes use of syntactic and semantic knowledge in terms of constraints and preferences.</Paragraph>
    <Paragraph position="1"> The lack of syntactic information, for instance, means giving up c-command constraints and subject preference (or on other occasions object preference, see Mitkov 1995) which could be used in center tracking. Syntactic parallelism, useful in discriminating between identical pronouns on the basis of their syntactic function, also has to be forgone. Lack of semantic knowledge rules out the use of verb semantics and semantic parallelism. Our evaluation, however, suggests that much less is lost than might be feared. In fact, our evaluation shows that the results are comparable to syntax-based methods (Lappin &amp; Leass 1994). We believe that the good success rate is due to the fact that a number of antecedent indicators are taken into account and no factor is given absolute preference. In particular, this strategy can often override incorrect decisions linked with strong centering preference (Mitkov &amp; Belguith 1998) or syntactic and semantic parallelism preferences (see below).</Paragraph>
    <Section position="1" start_page="871" end_page="872" type="sub_section">
      <SectionTitle>
3.1 Evaluation A
</SectionTitle>
      <Paragraph position="0"> Our first evaluation exercise (Mitkov &amp; Stys 1997) was based on a random sample text from a technical manual in English (Minolta 1994). There were 71 pronouns in the 140 page technical manual; 7 of the pronouns were non-anaphoric and 16 exophoric. The resolution of anaphors was carried out with a success rate of 95.8%. The approach being robust (an attempt is made to resolve each anaphor and a proposed antecedent is returned), this figure represents both &amp;quot;precision&amp;quot; and &amp;quot;recall&amp;quot; if we use the MUC terminology. To avoid any terminological confusion, we shall therefore use the more neutral term &amp;quot;success rate&amp;quot; while discussing the evaluation.  In order to evaluate the effectiveness of the approach and to explore if/how far it is superior over the baseline models for anaphora resolution, we also tested the sample text on (i) a Baseline Model which checks agreement in number and gender and, where more than one candidate remains, picks as antecedent the most recent subject matching the gender and number of the anaphor (ii) a Baseline Model which picks as antecedent the most recent noun phrase that matches the gender and number of the anaphor.</Paragraph>
      <Paragraph position="1"> The success rate of the &amp;quot;Baseline Subject&amp;quot; was 29.2%, whereas the success rate of &amp;quot;Baseline Most Recent NP&amp;quot; was 62.5%. Given that our knowledge-poor approach is basically an enhancement of a baseline model through a set of antecedent indicators, we see a dramatic improvement in performance (95.8%) when these preferences are called upon.</Paragraph>
      <Paragraph position="2"> Typically, our preference-based model proved superior to both baseline models when the antecedent was neither the most recent subject nor the most recent noun phrase matching the anaphor in gender and number. Example: Identify the drawer i by the lit paper port LED and add paper to it i.</Paragraph>
      <Paragraph position="3"> The aggregate score for &amp;quot;the drawer&amp;quot; is 7  (definiteness 1 + givenness 0 + term preference 1 + indicating verbs l + lexical reiteration 0 + section heading 0 + collocation 0 + referential distance 2 + non-prepositional noun phrase 0 + immediate reference 2 = 7), whereas aggregate score for the most  recent matching noun phrase (&amp;quot;the lit paper port LED&amp;quot;) is 4 (definiteness 1 + givenness 0 + term preference 1 + indicating verbs 0 + lexical reiteration 0 + section heading 0 + collocation 0 + referential distance 2 + non-prepositional noun phrase 0 + immediate reference 0 = 4).</Paragraph>
      <Paragraph position="4"> From this example we can also see that our knowledge-poor approach successfully tackles cases in which the anaphor and the antecedent have not only different syntactic functions but also different semantic roles. Usually knowledge-based approaches have difficulties in such a situation because they use preferences such as &amp;quot;syntactic parallelism&amp;quot; or &amp;quot;semantic parallelism&amp;quot;. Our robust approach does not use these because it has no information about the syntactic structure of the sentence or about the syntactic function/semantic role of each individual word.</Paragraph>
      <Paragraph position="5"> As far as the typical failure cases are concerned, we anticipate the knowledge-poor approach to have difficulties with sentences which have a more complex syntactic structure. This should not be surprising, given that the approach does not rely on any syntactic knowledge and in particular, it does not produce any parse tree. Indeed, the approach fails on the sentence: The paper through key can be used to feed \[a blank sheet of paper\]i through the copier out into the copy tray without making a copy on iti.</Paragraph>
      <Paragraph position="6"> where &amp;quot;blank sheet of paper&amp;quot; scores only 2 as opposed to the &amp;quot;the paper through key&amp;quot; which scores 6.</Paragraph>
    </Section>
    <Section position="2" start_page="872" end_page="872" type="sub_section">
      <SectionTitle>
3.2 Evaluation B
</SectionTitle>
      <Paragraph position="0"> Similarly to the first evaluation, we found that the robust approach was not very successful on sentences with too complicated syntax - a price we have to pay for the &amp;quot;convenience&amp;quot; of developing a knowledge-poor system.</Paragraph>
      <Paragraph position="1"> The results from experiment 1 and experiment 2 can be summarised in the following (statistically) slightly more representative figures.</Paragraph>
      <Paragraph position="2">  We carried out a second evaluation of the approach on a different set of sample texts from the genre of technical manuals (47-page Portable Style-Writer User's Guide (Stylewriter 1994). Out of 223 pronouns in the text, 167 were non-anaphoric (deictic and non-anaphoric &amp;quot;it&amp;quot;). The evaluation carried out was manual to ensure that no added error was generated (e.g. due to possible wrong sentence/clause detection or POS tagging). Another reason for doing it by hand is to ensure a fair comparison with Breck Baldwin's method, which not being available to us, had to be hand-simulated (see 3.3).</Paragraph>
      <Paragraph position="3"> The evaluation indicated 83.6% success rate. The &amp;quot;Baseline subject&amp;quot; model tested on the same data scored 33.9% recall and 67.9% precision, whereas &amp;quot;Baseline most recent&amp;quot; scored 66.7%. Note that &amp;quot;Baseline subject&amp;quot; can be assessed both in terms of recall and precision because this &amp;quot;version&amp;quot; is not robust: in the event of no subject being available, it is not able to propose an antecedent (the manual guide used as evaluation text contained many imperative zero-subject sentences).</Paragraph>
      <Paragraph position="4"> In the second experiment we evaluated the approach from the point of view also of its &amp;quot;critical success rate&amp;quot;. This measure (Mitkov 1998b) applies only to anaphors &amp;quot;ambiguous&amp;quot; from the point of view of number and gender (i.e. to those &amp;quot;tough&amp;quot; anaphors which, after activating the gender and number filters, still have more than one candidate for antecedent) and is indicative of the performance of the antecedent indicators. Our evaluation established the critical success rate as 82%.</Paragraph>
      <Paragraph position="5"> A case where the system failed was when the anaphor and the antecedent were in the same sentence and where preference was given to a candidate in the preceding sentence. This case and other cases suggest that it might be worthwhile reconsidering/refining the weights for the indicator &amp;quot;referential distance&amp;quot;.</Paragraph>
      <Paragraph position="6"> The lower figure in &amp;quot;Baseline subject&amp;quot; corresponds to &amp;quot;recall&amp;quot; and the higher figure - to &amp;quot;precision&amp;quot;. If we regard as &amp;quot;discriminative power&amp;quot; of each antecedent indicator the ratio &amp;quot;number of successful antecedent identifications when this indicator was applied&amp;quot;/&amp;quot;number of applications of this indicator&amp;quot; (for the non-prepositional noun phrase and definiteness being penalising indicators, this figure is calculated as the ratio &amp;quot;number of unsuccessful antecedent identifications&amp;quot;/&amp;quot;number of applications&amp;quot;), the immediate reference emerges as the most discriminative indicator (100%), followed by non-prepositional noun phrase (92.2%), collocation (90.9%), section heading (61.9%), lexical reiteration (58.5%), givenness (49.3%), term preference (35.7%) and referential distance (34.4%). The relatively low figures for the majority of indicators should not be regarded as a surprise: firstly, we should bear in mind that in most cases a candidate was picked (or rejected) as an antecedent on the basis of applying a number of different indicators and secondly, that most anaphors had a relatively high number of candidates for antecedent.</Paragraph>
      <Paragraph position="7"> In terms of frequency of use (&amp;quot;number of non-zero applications&amp;quot;/&amp;quot;number of anaphors&amp;quot;), the most frequently used indicator proved to be referential distance used in 98.9% of the cases, followed by term preference (97.8%), givenness (83.3%), lexical reiteration (64.4%), definiteness (40%), section heading (37.8%), immediate reference (31.1%) and collocation (11.1%). As expected, the most frequent indicators were not the most discriminative ones.</Paragraph>
    </Section>
    <Section position="3" start_page="872" end_page="874" type="sub_section">
      <SectionTitle>
3.3 Comparison to similar approaches: compara-
</SectionTitle>
      <Paragraph position="0"> tive evaluation of Breck Baldwin's CogNIAC We felt appropriate to extend the evaluation of our approach by comparing it to Breck Baldwin's CogNIAC (Baldwin 1997) approach which features &amp;quot;high precision coreference with limited knowledge  and linguistics resources&amp;quot;. The reason is that both our approach and Breck Baldwin's approach share common principles (both are knowledge-poor and use a POS tagger to provide the input) and therefore a comparison would be appropriate.</Paragraph>
      <Paragraph position="1"> Given that our approach is robust and returns antecedent for each pronoun, in order to make the comparison as fair as possible, we used CogNIAC's &amp;quot;resolve all&amp;quot; version by simulating it manually on the same training data used in evaluation B above. CogNIAC successfully resolved the pronouns in 75% of the cases. This result is comparable with the results described in (Baldwin 1997). For the training data from the genre of technical manuals, it was rule 5 (see Baldwin 1997) which was most frequently used (39% of the cases, 100% success), followed by rule 8 (33% of the cases, 33% success), rule 7 (11%, 100%), rule I (9%, 100%) and rule 3 (7.4%, 100%).</Paragraph>
      <Paragraph position="2"> It would be fair to say that even though the results show superiority of our approach on the training data used (the genre of technical manuals), they cannot be generalised automatically for other genres or unrestricted texts and for a more accurate picture, further extensive tests are necessary.</Paragraph>
      <Paragraph position="3"> 4. Adapting the robust approach for other languages An attractive feature of any NLP approach would be its language &amp;quot;universality&amp;quot;. While we acknowledge that most of the monolingual NLP approaches are not automatically transferable (with the same degree of efficiency) to other languages, it would be highly desirable if this could be done with minimal adaptation. null We used the robust approach as a basis for developing a genre-specific reference resolution approach in Polish. As expected, some of the preferences had to be modified in order to fit with specific features of Polish (Mitkov &amp; Stys 1997). For the time being, we are using the same scores for Polish.</Paragraph>
      <Paragraph position="4"> The evaluation for Polish was based technical manuals available on the Internet (Internet Manual, 1994; Java Manual 1998). The sample texts contained 180 pronouns among which were 120 instances of exophoric reference (most being zero pronouns). The robust approach adapted for Polish demonstrated a high success rate of 93.3% in resolving anaphors (with critical success rate of 86.2%). Similarly to the evaluation for English, we compared the approach for Polish with (i) a Baseline Model which discounts candidates on the basis of agreement in number and gender and, if there were still competing candidates, selects as the antecedent the most recent subject matching the anaphor in  gender and number (ii) a Baseline Model which checks agreement in number and gender and, if there were still more than one candidate left, picks up as the antecedent the most recent noun phrase that agrees with the anaphor.</Paragraph>
      <Paragraph position="5"> Our preference-based approach showed clear superiority over both baseline models. The first Base-line Model (Baseline Subject) was successful in only 23.7% of the cases, whereas the second (Baseline Most Recent) had a success rate of 68.4%. Therefore, the 93.3% success rate (see above) demonstrates a dramatic increase in precision, which is due to the use of antecedent tracking preferences.</Paragraph>
      <Paragraph position="6"> We have recently adapted the approach for Arabic as well (Mitkov &amp; Belguith 1998). Our evaluation, based on 63 examples (anaphors) from a technical manual (Sony 1992), indicates a success rate of 95.2% (and critical success rate 89.3 %).</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML