XML Viewer - p95-1017

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/95/p95-1017_metho.xml
Size: 22,328 bytes
Last Modified: 2025-10-06 14:14:05
<?xml version="1.0" standalone="yes"?>
<Paper uid="P95-1017">
  <Title>Evaluating Automated and Manual Acquisition of Anaphora Resolution Strategies</Title>
  <Section position="3" start_page="0" end_page="122" type="metho">
    <SectionTitle>
2 Applying a Machine Learning
</SectionTitle>
    <Paragraph position="0"> Technique to Anaphora Resolution In this section, we first discuss corpora which we created for training and testing. Then, we describe the learning approach chosen, and discuss training features and training methods that we employed for our current experiments.</Paragraph>
    <Section position="1" start_page="0" end_page="122" type="sub_section">
      <SectionTitle>
2.1 Training and Test Corpora
</SectionTitle>
      <Paragraph position="0"> In order to both train and evaluate an anaphora resolution system, we have been developing corpora which are tagged with discourse information.</Paragraph>
      <Paragraph position="1"> The tagging has been done using a GUI-based tool called the Discourse Tagging Tool (DTTool) according to &amp;quot;The Discourse Tagging Guidelines&amp;quot; we  have developed. 2 The tool allows a user to link an anaphor with its antecedent and specify the type of the anaphor (e.g. pronouns, definite NP's, etc.). The tagged result can be written out to an SGMLmarked file, as shown in Figure 1.</Paragraph>
      <Paragraph position="2"> For our experiments, we have used a discoursetagged corpus which consists of Japanese newspaper articles about joint ventures. The tool lets a user define types of anaphora as necessary. The anaphoric types used to tag this corpus are shown in Table 1.</Paragraph>
      <Paragraph position="3"> NAME anaphora are tagged when proper names are used anaphorically. For example, in Figure 1, &amp;quot;Yamaichi (ID=3)&amp;quot; and &amp;quot;Sony-Prudential (ID=5)&amp;quot; referring back to &amp;quot;Yamaichi Shouken (ID=4)&amp;quot; (Yamaichi Securities) and &amp;quot;Sony-Prudential Seimeihoken (ID=6)&amp;quot; (Sony-Prudential Life Insurance) respectively are NAME anaphora. NAME anaphora in Japanese are different from those in English in that any combination of characters in an antecedent can be NAME anaphora as long as the character order is preserved (e.g. &amp;quot;abe&amp;quot; can be an anaphor of &amp;quot;abcde&amp;quot;).</Paragraph>
      <Paragraph position="4"> Japanese definite NPs (i.e. DNP anaphora) are those prefixed by &amp;quot;dou&amp;quot; (literally meaning &amp;quot;the same&amp;quot;), &amp;quot;ryou&amp;quot; (literally meaning &amp;quot;the two&amp;quot;), and deictic determiners like &amp;quot;kono&amp;quot;(this) and &amp;quot;sono&amp;quot; (that). For example, &amp;quot;dou-sha&amp;quot; is equivalent to &amp;quot;the company&amp;quot;, and &amp;quot;ryou-koku&amp;quot; to &amp;quot;the two countries&amp;quot;. The DNP anaphora with &amp;quot;dou&amp;quot; and &amp;quot;ryou&amp;quot; prefixes are characteristic of written, but not spoken, Japanese texts.</Paragraph>
      <Paragraph position="5"> Unlike English, Japanese has so-called zero pronouns, which are not explicit in the text. In these cases, the DTTool lets the user insert a &amp;quot;Z&amp;quot; marker just before the main predicate of the zero pronoun to indicate the existence of the anaphor. We made distinction between QZPRO and ZPRO when tagging zero pronouns. QZPRO (&amp;quot;quasi-zero pronoun&amp;quot;) is chosen when a sentence has multiple clauses (subordinate or coordinate), and the zero pronouns in these clauses refer back to the subject of the initial clause in the same sentence, as shown in Figure 2.</Paragraph>
      <Paragraph position="6"> The anaphoric types are sub-divided according to more semantic criteria such as organizations, people, locations, etc. This is because the current application of our multilingual NLP system is information extraction (Aone et al., 1993), i.e. extracting from texts information about which organizations are forming joint ventures with whom. Thus, resolving certain anaphora (e.g. various ways to refer back to organizations) affects the task performance more than others, as we previously reported (Aone, 1994).</Paragraph>
      <Paragraph position="7"> Our goal is to customize and evaluate anaphora resolution systems according to the types of anaphora when necessary.</Paragraph>
      <Paragraph position="8"> 2Our work on the DTTool and tagged corpora was reported in a recent paper (Aone and Bennett, 1994).</Paragraph>
    </Section>
    <Section position="2" start_page="122" end_page="122" type="sub_section">
      <SectionTitle>
2.2 Learning Method
</SectionTitle>
      <Paragraph position="0"> While several inductive learning approaches could have been taken for construction of the trainable anaphoric resolution system, we found it useful to be able to observe the resulting classifier in the form of a decision tree. The tree and the features used could most easily be compared to existing theories.</Paragraph>
      <Paragraph position="1"> Therefore, our initial approach has been to employ Quinlan's C4.5 algorithm at the heart of our classification approach. We discuss the features used for learning below and go on to discuss the training methods and how the resulting tree is used in our anaphora resolution algorithm.</Paragraph>
    </Section>
    <Section position="3" start_page="122" end_page="122" type="sub_section">
      <SectionTitle>
2.3 Training Features
</SectionTitle>
      <Paragraph position="0"> In our current machine learning experiments, we have taken an approach where we train a decision tree by feeding feature vectors for pairs of an anaphor and its possible antecedent. Currently we use 66 features, and they include lezical (e.g. category), syntactic (e.g. grammatical role), semantic (e.g. semantic class), and positional (e.g. distance between anaphor and antecedent) features. Those features can be either unary features (i.e. features of either an anaphor or an antecedent such as syntactic number values) or binary features (i.e. features concerning relations between the pairs such as the positional relation between an anaphor and an antecedent.) We started with the features used by the MDR, generalized them, and added new features. The features that we employed are common across domains and languages though the feature values may change in different domains or languages. Example of training features are shown in Table 2.</Paragraph>
      <Paragraph position="1"> The feature values are obtained automatically by processing a set of texts with our NLP system, which performs lexical, syntactic and semantic analysis and then creates discourse markers (Kamp, 1981) for each NP and S. 3 Since discourse markers store the output of lexical, syntactic and semantic processing, the feature vectors are automatically calculated from them. Because the system output is not always perfect (especially given the complex newspaper articles), however, there is some noise in feature values.</Paragraph>
    </Section>
    <Section position="4" start_page="122" end_page="122" type="sub_section">
      <SectionTitle>
2.4 Training Methods
</SectionTitle>
      <Paragraph position="0"> We have employed different training methods using three parameters: anaphoric chains, anaphoric type identification, and confidence factors.</Paragraph>
      <Paragraph position="1"> The anaphoric chain parameter is used in selecting training examples. When this parameter is on, we select a set of positive training examples and a set of negative training examples for each anaphor in a text in the following way:</Paragraph>
    </Section>
  </Section>
  <Section position="4" start_page="122" end_page="124" type="metho">
    <SectionTitle>
3 Existence of zero pronouns in sentences is detected
</SectionTitle>
    <Paragraph position="0"> by the syntax module, and discourse maxkers are created for them.</Paragraph>
    <Paragraph position="1">  &lt;CORe: m='I&amp;quot;&gt;&lt;COREF n~'4&amp;quot;&gt;ttl--lEff-&lt;/mR~:&lt;u.~J- m='s'&gt;y-'-- * ~')l,~Y:,,~,)t,C/.@~l~ (~P,-'ll~l~:.~t, :C/4t. lr)~) &lt;CORE\]: m='O&amp;quot; rcPE='~ RB:='i&amp;quot;&gt;&lt;/COR~&gt;III@b~. ~)q~'~6&lt;COR~ ZD='2e rVPE='ZPm-t~-&amp;quot; REFf'I&amp;quot;&gt;&lt;/COREF&gt;~Ii 3&amp;quot;~. &lt;CORe: ZD='~' WRf&amp;quot;NANE--OR6&amp;quot; RB:f'4&amp;quot;&gt;ttI--&lt;/COE~&lt;COREF ~&amp;quot;8&amp;quot;&gt;q~,l,~ltC)~e't-&amp;quot;~.'3tt~ttll~:~'~'&amp; &lt;/COR~&lt;COR~ m='s&amp;quot; WR='tt~E-O~ REFf&amp;quot;#'&gt;y-'---. ~')t,-~&gt;-b,,v)l,&lt;/mR~{:~-, &lt;COmF n)=&amp;quot;C/' WPE='Dm&amp;quot; REF='8&amp;quot;&gt; C r~ 5, ~-7&amp;quot; I, &lt;,'CUT~ ~ &lt;CORBF m='9&amp;quot; WR='ZT4~O-O~ 8EEf'5&amp;quot;&gt; &lt;/OR~ ff -~ T &lt;CO~ m=&amp;quot; ~o&amp;quot; TYR='~O-U~ RE~='5&amp;quot;&gt;  Positive training examples are those anaphor-antecedent pairs whose anaphor is directly linked to its antecedent in the tagged corpus and also whose anaphor is paired with one of the antecedents on the anaphoric chain, i.e. the transitive closure between the anaphor and the first mention of the antecedent.</Paragraph>
    <Paragraph position="2"> For example, if B refers to A and C refers to B, C-A is a positive training example as well as B-A and C-B.</Paragraph>
    <Paragraph position="3"> Negative training examples are chosen by pairing an anaphor with all the possible antecedents in a text except for those on the transitive closure described above. Thus, if there are possible antecedents in the text which are not in the C-B-A transitive closure, say D, C-D and B-D are negative training examples.</Paragraph>
    <Paragraph position="4"> When the anaphoric chain parameter is off, only those anaphor-antecedent pairs whose anaphora are directly linked to their antecedents in the corpus are considered as positive examples. Because of the way in which the corpus was tagged (according to our tagging guidelines), an anaphor is linked to the most recent antecedent, except for a zero pronoun, which is linked to its most recent overt antecedent. In other words, a zero pronoun is never linked to another zero pronoun.</Paragraph>
    <Paragraph position="5"> The anaphoric type identification parameter is utilized in training decision trees. With this param- null eter on, a decision tree is trained to answer &amp;quot;no&amp;quot; when a pair of an anaphor and a possible antecedent are not co-referential, or answer the anaphoric type when they are co-referential. If the parameter is off, a binary decision tree is trained to answer just &amp;quot;yes&amp;quot; or &amp;quot;no&amp;quot; and does not have to answer the types of  anaphora.</Paragraph>
    <Paragraph position="6"> The confidence factor parameter (0-100) is used in pruning decision trees. With a higher confidence factor, less pruning of the tree is performed, and thus it tends to overfit the training examples. With a lower confidence factor, more pruning is performed, resulting in a smaller, more generalized tree. We used confidence factors of 25, 50, 75 and 100%.</Paragraph>
    <Paragraph position="7"> The anaphoric chain parameter described above was employed because an anaphor may have more than one &amp;quot;correct&amp;quot; antecedent, in which case there is no absolute answer as to whether one antecedent is better than the others. The decision tree approach we have taken may thus predict more than one antecedent to pair with a given anaphor. Currently, confidence values returned from the decision tree are employed when it is desired that a single antecedent be selected for a given anaphor. We are experimenting with techniques to break ties in confidence values from the tree. One approach is to use a particular bias, say, in preferring the antecedent closest to the anaphor among those with the highest confidence (as in the results reported here). Although use of the confidence values from the tree works well in practice, these values were only intended as a heuristic for pruning in Quinlan's C4.5. We have plans to use cross-validation across the training set as a method of determining error-rates by which to prefer one predicted antecedent over another.</Paragraph>
    <Paragraph position="8"> Another approach is to use a hybrid method where a preference-trained decision tree is brought in to supplement the decision process. Preference-trained trees, like that discussed in Connolly et al. (Connolly et al., 1994), are trained by presenting the learning algorithm with examples of when one anaphor-antecedent pair should be preferred over another. Despite the fact that such trees are learning preferences, they may not produce sufficient preferences to permit selection of a single best anaphor-antecedent combination (see the &amp;quot;Related Work&amp;quot; section be- low).</Paragraph>
  </Section>
  <Section position="5" start_page="124" end_page="125" type="metho">
    <SectionTitle>
3 Testing
</SectionTitle>
    <Paragraph position="0"> In this section, we first discuss how we configured and developed the MLRs and the MDR for testing.</Paragraph>
    <Paragraph position="1"> Next, we describe the scoring methods used, and then the testing results of the MLRs and the MDR.</Paragraph>
    <Paragraph position="2"> In this paper, we report the results of the four types of anaphora, namely NAME-ORG, QZPRO-ORG, DNP-ORG, and ZPRO-ORG, since they are the majority of the anaphora appearing in the texts and most important for the current domain (i.e. joint ventures) and application (i.e. information extraction). null</Paragraph>
    <Section position="1" start_page="124" end_page="124" type="sub_section">
      <SectionTitle>
3.1 Testing the MLRa
</SectionTitle>
      <Paragraph position="0"> To build MLRs, we first trained decision trees with 1971 anaphora 4 (of which 929 were NAME-ORG; 546 QZPRO-ORG; 87 DNP-ORG; 282 ZPRO-ORG) in 295 training texts. The six MLRs using decision trees with different parameter combinations are described in Table 3.</Paragraph>
      <Paragraph position="1"> Then, we trained decision trees in the MLR-2 configuration with varied numbers of training texts, namely 50, 100, 150,200 and 250 texts. This is done to find out the minimum number of training texts to achieve the optimal performance.</Paragraph>
    </Section>
    <Section position="2" start_page="124" end_page="125" type="sub_section">
      <SectionTitle>
3.2 Testing the MDR
</SectionTitle>
      <Paragraph position="0"> The same training texts used by the MLRs served as development data for the MDR. Because the NLP system is used for extracting information about joint ventures, the MDR was configured to handle only the crucial subset of anaphoric types for this experiment, namely all the name anaphora and zero pronouns and the definite NPs referring to organizations (i.e. DNP-ORG). The MDR applies different sets of generators, filters and orderers to resolve different anaphoric types (Aone and McKee, 1993). A generator generates a set of possible antecedent hypotheses for each anaphor, while a filter eliminates *In both training and testing, we did not include anaphora which refer to multiple discontinuous antecedents.</Paragraph>
      <Paragraph position="1">  unlikely hypotheses from the set. An orderer ranks hypotheses in a preference order if there is more than one hypothesis left in the set after applying all the applicable filters. Table 4 shows KS's employed for the four anaphoric types.</Paragraph>
    </Section>
    <Section position="3" start_page="125" end_page="125" type="sub_section">
      <SectionTitle>
3.3 Scoring Method
</SectionTitle>
      <Paragraph position="0"> We used recall and precision metrics, as shown in Table 5, to evaluate the performance of anaphora resolution. It is important to use both measures because one can build a high recall-low precision system or a low recall-high precision system, neither of which may be appropriate in certain situations.</Paragraph>
      <Paragraph position="1"> The NLP system sometimes fails to create discourse markers exactly corresponding to anaphora in texts due to failures of hxical or syntactic processing. In order to evaluate the performance of the anaphora resolution systems themselves, we only considered anaphora whose discourse markers were identified by the NLP system in our evaluation. Thus, the system performance evaluated against all the anaphora in texts could be different.</Paragraph>
    </Section>
    <Section position="4" start_page="125" end_page="125" type="sub_section">
      <SectionTitle>
3.4 Testing Results
</SectionTitle>
      <Paragraph position="0"> The testing was done using 1359 anaphora (of which 1271 were one of the four anaphoric types) in 200 blind test texts for both the MLRs and the MDR. It should be noted that both the training and testing texts are newspaper articles about joint ventures, and that each article always talks about more than one organization. Thus, finding antecedents of organizational anaphora is not straightforward. Table 6 shows the results of six different MLRs and the MDR for the four types of anaphora, while Table 7 shows the results of the MLR-2 with different sizes of training examples,</Paragraph>
    </Section>
  </Section>
  <Section position="6" start_page="125" end_page="295" type="metho">
    <SectionTitle>
4 Evaluation
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="125" end_page="295" type="sub_section">
      <SectionTitle>
4.1 The MLRs vs. the MDR
</SectionTitle>
      <Paragraph position="0"> Using F-measures 5 as an indicator for overall performance, the MLRs with the chain parameters turned on and type identification turned off (i.e. MLR-1, 2, 3, and 4) performed the best. MLR-1, 2, 3, 4, and 5 all exceeded the MDR in overall performance based on F-measure.</Paragraph>
      <Paragraph position="1"> Both the MLRs and the MDR used the character subsequence, the proper noun category, and the semantic class feature values for NAME-ORG anaphora (in MLR-5, using anaphoric type identification). It is interesting to see that the MLR additionally uses the topicalization feature before testing the semantic class feature. This indicates that, information theoretically, if the topicalization feature is present, the semantic class feature is not needed for the classification. The performance of NAME-ORG is better than other anaphoric phenomena because the character subsequence feature has very high antecedent predictive power.</Paragraph>
      <Paragraph position="2">  Changing the three parameters in the MLRs caused changes in anaphora resolution performance.</Paragraph>
      <Paragraph position="3"> As Table 6 shows, using anaphoric chains without anaphoric type identification helped improve the MLRs. Our experiments with the confidence factor parameter indicates the trade off between recall and precision. With 100% confidence factor, which means no pruning of the tree, the tree overfits the examples, and leads to spurious uses of features such as the number of sentences between an anaphor and an antecedent near the leaves of the generated tree.</Paragraph>
      <Paragraph position="4"> This causes the system to attempt more anaphor resolutions albeit with lower precision. Conversely, too much pruning can also yield poorer results.</Paragraph>
      <Paragraph position="5"> MLR-5 illustrates that when anaphoric type identification is turned on the MLR's performance drops SF-measure is calculated by:</Paragraph>
      <Paragraph position="7"> where P is precision, R is recall, and /3 is the relative importance given to recall over precision. In this case, = 1.0.</Paragraph>
    </Section>
  </Section>
  <Section position="7" start_page="295" end_page="295" type="metho">
    <SectionTitle>
MDR
</SectionTitle>
    <Paragraph position="0"> but still exceeds that of the MDR. MLR-6 shows the effect of not training on anaphoric chains. It results in poorer performance than the MLR-1, 2, 3, 4, and 5 configurations and the MDR.</Paragraph>
    <Paragraph position="1"> One of the advantages of the MLRs is that due to the number of different anaphoric types present in the training data, they also learned classifiers for several additional anaphoric types beyond what the MDR could handle. While additional coding would have been required for each of these types in the MDR, the MLRs picked them up without additional work. The additional anaphoric types included DPRO, REFLEXIVE, and TIMEI (cf. Table 1). Another advantage is that, unlike the MDR, whose features are hand picked, the MLRs automatically select and use necessary features.</Paragraph>
    <Paragraph position="2"> We suspect that the poorer performance of ZPRO-OR(; and DNP-ORG may be due to the following deficiency of the current MLR algorithms: Because anaphora resolution is performed in a &amp;quot;batch mode&amp;quot; for the MLRs, there is currently no way to percolate the information on an anaphor-antecedent link found by a system after each resolution. For example, if a zero pronoun (Z-2) refers to another zero pronoun (Z-l), which in turn refers to an overt NP, knowing which is the antecedent of Z-1 may be important for Z-2 to resolve its antecedent correctly.</Paragraph>
    <Paragraph position="3"> However, such information is not available to the MLRs when resolving Z-2.</Paragraph>
    <Paragraph position="4">  One advantage of the MDR is that a tagged training corpus is not required for hand-coding the resolution algorithms. Of course, such a tagged corpus is necessary to evaluate system performance quantitatively and is also useful to consult with during algorithm construction.</Paragraph>
    <Paragraph position="5"> However, the MLR results seem to indicate the limitation of the MDR in the way it uses orderer KS's. Currently, the MDR uses an ordered list of multiple orderer KS's for each anaphoric type (cf.</Paragraph>
    <Paragraph position="6"> Table 4), where the first applicable orderer KS in the list is used to pick the best antecedent when there is more than one possibility. Such selection ignores the fact that even anaphora of the same type may use different orderers (i.e. have different preferences), depending on the types of possible antecedents and on the context in which the particular anaphor was used in the text.</Paragraph>
    <Section position="1" start_page="295" end_page="295" type="sub_section">
      <SectionTitle>
4.2 Training Data Size vs. Performance
</SectionTitle>
      <Paragraph position="0"> the MLR achieves better performance than the MDR. Performance seems to reach a plateau at about 250 training examples with a F-measure of around 77.4.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML