File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/04/w04-0713_evalu.xml

Size: 5,011 bytes

Last Modified: 2025-10-06 13:59:15

<?xml version="1.0" standalone="yes"?>
<Paper uid="W04-0713">
  <Title>An Algorithm for Resolving Individual and Abstract Anaphora in Danish Texts and Dialogues</Title>
  <Section position="6" start_page="0" end_page="0" type="evalu">
    <SectionTitle>
5 Tests and Evaluation
</SectionTitle>
    <Paragraph position="0"> We have manually tested dar on randomly chosen texts and dialogues from our collections.</Paragraph>
    <Paragraph position="1"> The performance of dar on dialogues has been compared with that of es00. The function for resolving IPAs (ResolveIpa) has similarly been tested on texts, where APAs were excluded. We have compared the obtained results with those obtained by testing bfp (Brennan et al., 1987) and str98 (Strube, 1998). In all tests the intrasentential anaphors have been manually resolved. Expletive and cataphoric uses of pronouns have been marked and excluded from the tests. Dialogue act units were marked and classified by three persons following the strategy proposed in (Eckert and Strube, 2000). The reliability for the two annotation tasks (k-statistics (Carletta, 1996)) was of 0.94 and 0.90 respectively. Pronominal anaphors were marked, classified and resolved by two annotators. The k-statistics for the pronoun classification was 0.86. When the annotators did not agree upon resolution, the pronoun was marked as ambiguous and excluded from evaluation. The results obtained for bfp and str98 are given in table 1, while the results of dar's ResolveIpa are given in table 2. Because dar both classifies and resolves anaphors, both precision and recall are given in table 2. Precision indicates the proportion of the resolved pronouns which are correctly resolved, while recall indicates the proportion of all pronouns resolved by humans which are correctly resolved by the algorithm.</Paragraph>
    <Paragraph position="2"> The results indicate that ResolveIpa performs significantly better than bfp and str98 on the Danish texts. The better performance of dar was due to the account of focal and parallelism preferences and of the different reference mechanisms of personal and demonstrative pronouns. Furthermore dar recognises some generic pronouns and inferable pronouns and excludes them from resolution, but often fails to recognise antecedentless and inferable plural pronouns, because it often finds a plural nominal in the preceding discourse and proposes it as antecedent. The lack of commonsense knowledge explains many incorrectly resolved anaphors. The results of the test of the dar algoalgorithm corr.resolved res.human precision  rithm on written texts are in table 3. These results are good compared with the results of the function ResolveIpa (table 2). The discriminating rules identify correctly IPAs and  APAs in the large majority of the cases. Recognition failure often involves pronouns in contexts which are not covered by the discriminating rules. In particular dar fails to resolve singular neuter gender pronouns with distant antecedents and to identify vague anaphors, because it always &amp;quot;finds&amp;quot; an antecedent in the context ranking. Correct resolution in these cases requires a deep analysis of the context. The resolution IPA corr.res. res.overall res.hum. precis recall  results of applying dar and es00 on Danish dialogues are reported in table 4.11 In the last colum the overall performance of the two algorithms is given as f-measure (F) which is defined as 1a 1</Paragraph>
    <Paragraph position="4"> where P is precision, R is recall and a is the weight of P and R. We have assigned the same weight to P and R (a = 0.5) and thus F = 2PRP+R. The results of the tests indicate that dar resolves IPAs significantly better than es00 (which uses str98). The better performance of dar is also due to the enlarged resolution scope respect to the one used in es00.</Paragraph>
    <Paragraph position="5"> dar correctly resolves more Danish demonstrative pronouns than es00, because it accounts for language-specific particularities. In general, however, the resolution results for APAs are similar to those obtained for es00. This is not surprising, because dar uses the same resolution strategy on these pronouns. dar performs better on texts than on dialogues. This reflects the more complex nature of dialogues. The results indicate that the IPA/APA discriminat11We extended es00 with the Danish-specific identification rules before applying it.</Paragraph>
    <Paragraph position="6"> ing rules also work well on dialogues. The cases of resolution failure were the same as for the texts. As an experiment we applied dar on the dialogues without relying on the predefined dialogue structure. In this test the recognition of IPAs and APAs was still good, however the success rate for IPAs was 60.1 % and for APAs was only 39.3%. Many errors were due to the fact that antecedents were searched for in the preceding discourse in linear order and that ungrounded utterances were included in the discourse model.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML