File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/99/w99-0207_metho.xml

Size: 20,824 bytes

Last Modified: 2025-10-06 14:15:25

<?xml version="1.0" standalone="yes"?>
<Paper uid="W99-0207">
  <Title>Corpus-Based Anaphora Resolution Towards Antecedent Preference</Title>
  <Section position="3" start_page="47" end_page="48" type="metho">
    <SectionTitle>
2 Corpus-Based Anaphora Resolution
</SectionTitle>
    <Paragraph position="0"> In this section we introduce a new approach to anaphora resolution based on coreferential properties automatically extracted from a training corpus. In the first step, the decison tree filter is trained on the linguistic, discourse and coreference information annotated in the training corpus which is described in section 2.1.</Paragraph>
    <Paragraph position="1">  The resolution system in Figure 1 applies the coreference filter (cf. section 2.2) to all anaphor-candidate pairs (Ai + C/#) found in the discourse history. The detection of anaphoric expressions is out of the scope of this paper and just reduced to tags in our annotated corpus. Antecedent candidates are identified according to noun phrase part-of-speech tags. The reduced set (Ai + C/~) forms the input of the preference algorithm which selects the most salient candidate C~ as described in section 2.3.</Paragraph>
    <Paragraph position="2"> Preliminary experiments are conducted for the task of pronominal anaphora resolution and the performance of our system is evaluated in section 3.</Paragraph>
    <Section position="1" start_page="47" end_page="48" type="sub_section">
      <SectionTitle>
2.1 Data Corpus
</SectionTitle>
      <Paragraph position="0"> For our experiments we use the ATR-ITL Speech and Language Database (Takezawa et al., 1998) consisting of 500 Japanese spoken-language dialogs annotated with coreferential tags. It includes nominal, pronominal, and ellipsis annotations, whereby the anaphoric expressions used in our experiments are limited to those referring to nominal antecedents (nominal: 2160, pronominal: 526, ellipsis: 3843).</Paragraph>
      <Paragraph position="1"> Besides the anaphor type, we also include morphosyntactic information like stem form and inflection attributes for each surface word as well as semantic codes for content words (Ohno and Hamanishi, 1981) in this corpus.</Paragraph>
      <Paragraph position="2"> rl: ~ ~) ~)~'~ O ~-'~H&amp;quot;~-o &amp;quot;&gt;&amp;quot;7&amp;quot; 4 -t,'T-)l~&amp;quot;~'~b~ Ht~ \[thank you very much\] \[City Hotel\] '~'hank you for calling City Hotel.&amp;quot; \[hello\] \[l\]\[Himko Tanaka\]\[the name is\] &amp;quot;Hello, my name is Hiroko Tanaka.&amp;quot; \]there\] \[hotel\] \[reservation\]\[wonld like to have\] &amp;quot;I would like to make a reservation at your hotel.&amp;quot; \[yonr\] \[name\] \[spening\] \[can I have\] &amp;quot;Can you spell your name for me, please? c2: II~,% -7&amp;quot;4--..x.--~,.x.--9&amp;quot;4&amp;quot;,-x---~o \[yes\] rr\] \[A\] \[N\] {A\] \[K\] \[A\] \[be\] &amp;quot;It's T A N A K A.&amp;quot; \[yes\] \[tenth\] \[here\] \[arrival\] \[be\] &amp;quot;Okay, you will arrive here on the tenth, right?&amp;quot;  In the example dialog between the hotel reception (r) and a customer (c) listed in Figure 2 the proper noun (rl)&amp;quot;5,#-# ~Y-~P \[City Hotel\]&amp;quot; is tagged as the antecedent of the pronoun (cl)&amp;quot;~-~5 ~9 \[there\]&amp;quot; as well as the noun (cl)&amp;quot;$ff-)l~ \[hotel\]&amp;quot;. An example for ellipsis is the ommitted subject (c2)&amp;quot;@\[it\]&amp;quot; referring to (r2)&amp;quot;Y~-x~P \[spelling\]&amp;quot;.</Paragraph>
      <Paragraph position="3"> According to the tagging guidelines used for our corpus an anaphoric tag refers to the most recent antecedent found in the dialog. However, this antecedent might also refer to a previous one, e.g.</Paragraph>
      <Paragraph position="4"> (r3)&amp;quot;~- C/9 ~ \[here\]&amp;quot;-*(cl)&amp;quot; ~-C/9 ~ \[there\]&amp;quot;--~(rl) &amp;quot; 5&amp;quot; ~-~&amp;quot; * ~)P \[City Hotel\]&amp;quot;. Thus, the transitive closure between the anaphora and the first mention of the antecedent in the discourse history defines the set of positive examples, e.g. (~-~ ~9, 5,if-4 $~-)P), whereas the nominal candidates outside the transitive closure are considered negative examples, e.g.</Paragraph>
      <Paragraph position="5"> (~- ~5 C9, ~ qu), for coreferential relationships.</Paragraph>
      <Paragraph position="6"> Based on the corpus annotations we extract the frequency information of coreferential anaphor-antecedent pairs and non-referential pairs from the training data. For each non-/coreferential pair the occurrences of surface and stem form as well as semantic code combinations are counted.</Paragraph>
      <Paragraph position="8"> {demonstratives} {shop} 51 18 0.48 In Table 1 some examples are given for pronoun anaphora, whereas the expressions &amp;quot;{...}&amp;quot; denote semantic classes assigned to the respective words.</Paragraph>
      <Paragraph position="9">  The values freq +, freq- and ratio and their usage are described in more detailed in section 2.3. Moreover, each dialog is subdivided into utterances consisting of one or more clauses. Therefore, distance features are available on the utterance, clause, candidate, and morpheme levels. For example, the distance values of the pronoun (r3)&amp;quot;~- C/9 \[here\]&amp;quot; and the antecedent (rl)&amp;quot; &amp;quot;Y if-4 * ff-)l~ \[City Hotel\]&amp;quot; in our sample dialog in Figure 2 are d~tte~=4, dclaus~=7, dcand=14, dmorph=40.</Paragraph>
    </Section>
    <Section position="2" start_page="48" end_page="48" type="sub_section">
      <SectionTitle>
2.2 Coreferenee Analysis
</SectionTitle>
      <Paragraph position="0"> To learn the coreference relations from our corpus we have chosen a C4.52-1ike machine learning algorithm without pruning. The training attributes consist of lexical word attributes (surface word, stem form, part-of-speech, semantic code, morphological attributes) applied to the anaphor, antecedent candidate, and clause predicate. In addition, features like attribute agreement, distance and frequency ratio are checked for each anaphor-candidate pair. The decision tree result consists of only two classes determining the coreference relation between the given anaphor-candidate pair.</Paragraph>
      <Paragraph position="1"> During anaphora resolution the decision tree is used as a module determining the coreferential prop-erty of each anaphor-candidate pair. For each detected anaphoric expression a candidate list 3 is created. The decision tree filter is then successively applied to all anaphor-candidate pairs.</Paragraph>
      <Paragraph position="2"> If the decision tree results in the non-reference class, the candidate is judged as irrelevant and eliminated from the list of potential antecedents forming the input of the preference selection algorithm.</Paragraph>
    </Section>
    <Section position="3" start_page="48" end_page="48" type="sub_section">
      <SectionTitle>
2.3 Preference Selection
</SectionTitle>
      <Paragraph position="0"> The primary order of candidates is given by their word distance from the anaphoric expression. A straightforward preference strategy we could choose is the selection of the most recent candidate (MRC) as the antecedent, i.e., the first element of the candidate list. The success rate of this baseline test, however, is quite low as shown in section 3.</Paragraph>
      <Paragraph position="1"> But, this result does not mean that the recency factor is not important at all for the determination of saliency in this task. One reason for the bad performance is the application of the baseline test to the unfiltered set of-candidates resulting in the frequent selection of non-referential antecedents. Additionally, long-range references to candidates introduced first in the dialog are quite frequent in our data.</Paragraph>
      <Paragraph position="2">  An examination of our corpus gives rise to suspicion that similarities to references in our training data might be useful for the identification of those antecedents. Therefore, we propose a preference selection scheme based on the combination of distance and frequency information.</Paragraph>
      <Paragraph position="3"> First, utilizing statistical information about the frequency of coreferential anaphor-antecedent pairs (freq +) and non-referential pairs (freq-) extracted from the training data, we define the ratio of a given reference pair as follows4: I -6 : (freq + -- freq- = O) ratio = \]req + - \]req- freq+ 4- \]req- : otherwise The value of ratio is in the range of \[-1,-1-1\], whereby ratio = -1 in the case of exclusive non-referential relations and ratio -- +1 in the case of exclusive coreferential relationships. In order for referential pairs occurring in the training corpus with ratio = 0 to be preferred to those without frequency information, we slightly decrease the ratio value of the latter ones by a factor 6.</Paragraph>
      <Paragraph position="4"> As mentioned above the distance plays a crucial role in our selection method, too. We define a preference value pref by normalizing the ratio value according to the distance dist given by the primary order of the candidates in the discourse.</Paragraph>
      <Paragraph position="5"> ratio pre f = dist The pref value is calculated for each candidate and the precedence ordered list of candidates is resorted towards the maximization of the preference factor.</Paragraph>
      <Paragraph position="6"> Similarly to the baseline test, the first element of the preferenced candidate list is chosen as the antecedent. The precedence order between candidates of the same confidence continues to remain so and thus a final decision is made in the case of a draw.</Paragraph>
      <Paragraph position="7"> The robustness of our approach is ensured by the definition of a backup strategy which ultimately selects one candidate occurring in the history in the case that all antecedent candidates are rejected by the decision tree filter. For our experiments reported in section 3 we adopted the selection of the dialoginitial candidate as the backup strategy.</Paragraph>
    </Section>
  </Section>
  <Section position="4" start_page="48" end_page="50" type="metho">
    <SectionTitle>
3 Evaluation
</SectionTitle>
    <Paragraph position="0"> For the evaluation of the experimental results described in this section we use F-measure metrics calculated by the recall and precision of the system performance. Let ~\]t denote the total number of tagged 4In order to keep the formula simple the frequency types are omitted (cf. Table 1)  anaphor-antecedent pairs contained in the test data, El the number of these pairs passing the decision tree filter, and ~ the number of correctly selected antecedents.</Paragraph>
    <Paragraph position="1"> During evaluation we distinguish three classes: whether the correct antecedent is the first element of the candidate list (f), is in the candidate list (i), or is filtered out by the decision tree (o). The metrics F, recall (R) and precision (P) are defined as follows:</Paragraph>
    <Paragraph position="3"> In order to prove the feasibility of our approach we compare the four preference selection methods listed in Figure 3.</Paragraph>
    <Paragraph position="4"> tagged corpus~  cent candidate as the antecedent of an anaphoric expression. The necessity of the filter and preference selection components is shown by comparing the decision tree filter scheme DT (i.e., select the first element of the filtered candidate list) and preference scheme PREF (i.e., resort the complete candidate list) against our combined method DT+PREF (i.e., resort the filtered candidate list).</Paragraph>
    <Paragraph position="5"> 5-way cross-validation experiments are conducted for pronominal anaphora resolution. The selected antecedents are checked against the annotated correct antecedents according to their morphosyntactic and semantic attributes.</Paragraph>
    <Section position="1" start_page="49" end_page="50" type="sub_section">
      <SectionTitle>
3.1 Training Size
</SectionTitle>
      <Paragraph position="0"> We use varied numbers of training dialogs (50-400) for the training of the decision tree and the extraction of the frequency information from the corpus.</Paragraph>
      <Paragraph position="1"> Open tests are conducted on 100 non-training dialogs whereas closed tests use the training data for evaluation. The results of the different preference selection methods are shown in Figure 4.</Paragraph>
      <Paragraph position="2"> The baseline test MRC succeeds in resolving only 43.9% of the most recent candidates correctly as the antecedent. The best F-measure rate for DT is 65.0% and for PREF the best rate is 78.1% whereas</Paragraph>
      <Paragraph position="4"> the combination of both methods achieves a success rate of 80.6%.</Paragraph>
      <Paragraph position="5"> The PREF method seems to reach a plateau at around 300 dialogs which is borne out by the closed test reaching a maximum of 81.1%. Comparing the recall rate of DT (61.2%) and DT+PREF (75.9%) with the PREF result, we might conclude that the decision tree is not much of a help due to the side-effect of 11.8% of the correct antecedents being filtered out.</Paragraph>
      <Paragraph position="6"> However, in contrast to the PREF algorithm, the DT method improves continuously according to the training size implying a lack of training data for the identification of potential candidates. Despite the sparse data the filtering method proves to be very effective. The average number of all candidates (history) for a given anaphor in our open data is 39 candidates which is reduced to 11 potential candidates by the decision tree filter resulting in a reduction rate of 71.8% (closed test: 81%). The number of trivial selection cases (only one candidate) increases from 2.7% (history) to 11.4% (filter; closed test: 21%).</Paragraph>
      <Paragraph position="7"> On average, two candidates are skipped in the history to select the correct antecedent.</Paragraph>
      <Paragraph position="8"> Moreover, the precision rates of DT (69.4%) and DT+PREF (86.0%) show that the utilization of the decision tree filter in combination with the statistical preference selection gains a relative improvement of 9% towards the preference and 16% towards the filter method.</Paragraph>
      <Paragraph position="9"> Additionally, the system proves to be quite robust, because the decision tree filters out all candidates in only 1% of the open test samples. Selecting the candidate first introduced in the dialog as a backup strategy shows the best performance due to the frequent dialog initial references contained in our data.</Paragraph>
    </Section>
    <Section position="2" start_page="50" end_page="50" type="sub_section">
      <SectionTitle>
3.2 Feature Dependency
</SectionTitle>
      <Paragraph position="0"> In our approach frequency ratio and distance information plays a crucial role not only for the identification of potential candidates during decision tree filtering, but also for the calculation of the preference value for each antecedent candidate.</Paragraph>
      <Paragraph position="1"> In the first case these features are used independently to characterize the training samples whereas the preference selection method is based on the dependency between the frequency and distance values of the given anaphor-candidate pair in the context of the respective discourse. The relative importance of each factor is shown in Table 2.</Paragraph>
      <Paragraph position="2"> First, we compare our decision tree filter DT to those methods that do not use either frequency (DTno-freq) or distance (DT-no-dist) information. Frequency information does appear to be more relevant for the identification of potential candidates than distance features extracted from the training corpus.</Paragraph>
      <Paragraph position="3"> The recall performance of DT-no-freq decreases by 7.6% whereas DT-no-dist is only 1.1% below the result of the original DT filter 5. Moreover, the number of correct antecedents not passing the filter increases by 5.1% (DT-no-freq) and 0.7% (DT-no-dist).</Paragraph>
      <Paragraph position="4"> However, the distance factor proves to be quite important as a preference criterion. Relying only on the frequency ratio as the preference value, the recall performance of DT+PREF-no-dist is only 73.0%, down 2.9% of the original DT+PREF method.</Paragraph>
      <Paragraph position="5"> The effectiveness of our approach is not only based on the usage of single antecedent indicators extracted from the corpus, but also on the combination of these features for the selection of the most preferable candidate in the context of the given discourse.</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="50" end_page="51" type="metho">
    <SectionTitle>
4 Related Research
</SectionTitle>
    <Paragraph position="0"> Due to the characteristics of the underlying data used in these experiments a comparison involving absolute numbers to previous approaches gives us less evidence. However, the difficulty of our task can be verified according to the baseline experiment 5So far we have considered the decision tree filter just as a black-box tool. Further investigations on tree structures, however, should give us more evidence about the relative importance of the respective features.</Paragraph>
    <Paragraph position="1"> results reported in (Mitkov, 1998). Resolving pronouns in English technical manuals to the most recent candidate achieved a success rate of 62.5%, whereas in our experiments only 43.9% of the most recent candidates are resolved correctly as the antecedent (cf. section 3).</Paragraph>
    <Paragraph position="2"> Whereas knowledge-based systems like (Carbonell and Brown, 1988) and (Rich and LuperFoy, 1988) combining multiple resolution strategies are expensive in the cost of human effort at development time and limited ability to scale to new domains, more recent knowledge-poor approaches like (Kennedy and Boguraev, 1996) and (Mitkov, 1998) address the problem without sophisticated linguistic knowledge.</Paragraph>
    <Paragraph position="3"> Similarly to them we do not use any sentence parsing or structural analysis, but just rely on morphosyntactic and semantic word information.</Paragraph>
    <Paragraph position="4"> Moreover, clues are used about the grammatical and pragmatic functions of expressions as in (Grosz et al., 1995), (Strube, 1998), &amp;quot;or (Azzam et al., 1998) as well as rule-based empirical approaches like (Nakaiwa and Shirai, 1996) or (Murata and Nagao, 1997), to determine the most salient referent. These kinds of manually defined scoring heuristics, however, involve quite an amount of human intervention which is avoided in machine learning approaches.</Paragraph>
    <Paragraph position="5"> As briefly noted in section 1, the work described in (Conolly et al., 1994) and (Aone and Bennett, 1995) differs from our approach according to the usage of the decision tree in the resolution task. In (Conolly et al., 1994) a decision tree is trained on a small number of 15 features concerning anaphor type, grammatical function, recency, morphosyntactic agreement and subsuming concepts. Given two anaphor-candidate pairs the system judges which is &amp;quot;better&amp;quot;. However, due to the lack of a strong assumption on &amp;quot;transitivity&amp;quot; this sorting algorithm may be unable to find the &amp;quot;best&amp;quot; solution.</Paragraph>
    <Paragraph position="6"> Based on discourse markers extracted from lexical, syntactic, and semantic processing, the approach of (Aone and Bennett, 1995) uses 66 unary and binary attributes (lexical, syntactic, semantic, position, matching category, topic) during decision tree training. The confidence values returned from the pruned decision tree are utilized as a saliency measure for each anaphor-candidate pair in order to se- null lect a single antecedent. However, we use dependency factors for preference selection which cannot be learned automatically because of the independent learning of specific features during decision tree training. Therefore, our decision tree is not applied directly to the task of preference selection, but only used as a filter to reduce the number of potential candidates for preference selection.</Paragraph>
    <Paragraph position="7"> In addition to salience preference, a statistically modeled iexical preference is exploited in (Dagan et al., 1995) by comparing the conditional probabilities of co-occurrence patterns given the occurrence of candidates. Experiments, however, are carried out on computer manual texts with mainly intra-sentential references. This kind of data is also characterized by the avoidance of disambiguities and only short discourse units, which prohibits almost any long-range references. In contrast to this research, our results show that the distance factor in addition to corpus-based frequency information is quite relevant for the selection of the most salient candidate in our task.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML