File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/98/w98-1119_metho.xml
Size: 21,441 bytes
Last Modified: 2025-10-06 14:15:14
<?xml version="1.0" standalone="yes"?> <Paper uid="W98-1119"> <Title>A Statistical Approach to Anaphora Resolution</Title> <Section position="5" start_page="163" end_page="164" type="metho"> <SectionTitle> 3 The Implementation </SectionTitle> <Paragraph position="0"> We use a small portion of the Penn Wall Street Journal Tree-bank as our training corpus. From this data, we collect the three statistics detailed ha the following subsections.</Paragraph> <Paragraph position="1"> 3.0.1 The Hobbs algorithm The Hobbs algorithm makes a few assumptions about the syntactic trees upon which it operates that are not satisfied by the tree-bank trees that form the substrate for our algorithm. Most notably, the Hobbs algorithm depends on the existence of an/~&quot; parse-tree node that is absent from the Penn Tree-bank trees. We have implemented a slightly modified version of Hobbs algorithm for the Tree-bank parse trees. We also transform our trees under certain conditions to meet Hobbs' assumptions as much as possible. We have not, however, been able to duplicate exactly the syntactic structures assumed by Hobbs.</Paragraph> <Paragraph position="2"> Once we have the trees in the proper form (to the degree this is possible) we run Hobbs' algorithm repeatedly for each pronoun until it has proposed n (= 15 in our experiment) candidates. The ith candidate is regarded as occurring at &quot;Hobbs distance&quot; dH = i. Then the</Paragraph> <Paragraph position="4"> I correct antecedent at Hobbs distance i i \[ correct antecedents 1 We use \[ z \[ to denote the number of times z is observed in our training set.</Paragraph> <Section position="1" start_page="164" end_page="164" type="sub_section"> <SectionTitle> 3.1 The gender/animaticity statistics </SectionTitle> <Paragraph position="0"> After we have identified the correct antecedents it is a simple counting procedure to compute P(p\[wa) where wa is in the correct antecedent for the pronoun p (Note the pronouns are grouped by their gender): \[ wain the antecedent for p \[ P(pl o) = When there are multiple relevant words in the antecedent we apply the likelihood test designed by Dunning (1993) on all the words in the candidate NP. Given our limited data, the Dunning test tells which word is the most informative, call it w i, and we then use P(p\[wi).</Paragraph> <Paragraph position="1"> The referents range from being mentioned only once to begin mentioned 120 times in the trainhag examples. Instead of computing the probabUity for each one of them we group them into &quot;buckets&quot;, so that rrt a iS the bucket for the number of times that a is mentioned. We also observe that the position of a pronoun in a story influences the mention count of its referent. In other words, the nearer the end of the story a pronoun occurs, the more probable it is that its referent has been mentioned several times.</Paragraph> <Paragraph position="2"> We measure position by the sentence number, j. The method to compute this probability is: \[ a is antecedent, rna, j I P(alm~, j) = I ms, j l (We omitted j from equations (1-7) to reduce the notational load.)</Paragraph> </Section> <Section position="2" start_page="164" end_page="164" type="sub_section"> <SectionTitle> 3.2 Resolving Pronouns </SectionTitle> <Paragraph position="0"> After collecting the statistics on the training exanaples, we run the program on the test data.</Paragraph> <Paragraph position="1"> For any pronoun we collect n(= 15 in the experiment) candidate antecedents proposed by Hobbs' algorithm. It is quite possible that a word appears in the test data that the program never saw in the training data and low which it hence has no P(plwo) probability. In this case we simply use the prior probability of the pronoun P(p). From the parser project mentioned earlier, we obtain the probability e(Wolh,tJ/ Fi- P(w, It,t) &quot; nally, we extract the mention count number associated with each candidate NP, which is used to obtain P(alrn,). The four probabilities are multiplied together. The procedure is repeated for each proposed NP in l~&quot; and the one with the highest combined probability is selected as the antecedent.</Paragraph> </Section> </Section> <Section position="6" start_page="164" end_page="165" type="metho"> <SectionTitle> 4 The Experiment </SectionTitle> <Paragraph position="0"> The algorithm has two modules. One collects the statistics on the training corpus required by equation (8) and the other uses these probabilities to resolve pronouns in the test corpus.</Paragraph> <Paragraph position="1"> Our data consists of 93,931 words (3975 sentences) and contains 2477 pronouns, 1371 of which are singular (he, she and it). The corpus is manually tagged with reference indices and referents&quot; repetition numbers. The result presented here is the accuracy of the program in finding antecedents for he, she, and it and their various forms (e.g. him, his, himself, etc.) The cases where &quot;it&quot; is merely a dummy subject in a cleft sentence (example 1) or has conventional unspecified referents (example 2) are excluded from computing the precision: * Example 1: It is very hard to justify paying a silly price for Jaguar if an out-and-out bidding war were to start now.</Paragraph> <Paragraph position="2"> * Example 2: It is raining.</Paragraph> <Paragraph position="3"> We performed a ten-way cross-validation where we reserved 10% of the corpus for testing and used the remaining 90% for training. Our preliminary results are shown in the last line of We are also interested in finding the relative importance of each probability (i.e. each of the four factors in equation (8) in pronoun resolution. To this end, we ran the program &quot;incrementally&quot;, each time incorporating one more probability. The results are shown in Table 1 (all obtained from cross-validation). The last column of Table i contains the p-values for testing the statistical significance of each improvement. null Due to relatively large differences between Tree:bank parse trees and Hobbs' trees, our Hobbs' implementation does not yield as high an accuracy as it would have if we had had perfect Hobbs' tree representations. Since the Hobbs' algorithm serves as the base of our scheme, we expect the accuracy to be much higher with more accurately transformed trees.</Paragraph> <Paragraph position="4"> We also note that the very simple model that ignores syntax and takes the last mentioned noun-phrase as the referent performs quite a bit worse, about 43% correct. This indicates that syntax does play a very important role in anaphora resolution.</Paragraph> <Paragraph position="5"> We see a significant improvement after the word knowledge is added to the program. The P(plw,d probability gives the system information about gender and animaticity. The contribution of this factor is quite significant, as ca/n be seen from Table 1. The impact of this probability can be seen more clearly from another experiment in which we tested the program (using just Hobbs distance and gender information) on the training data. Here the program can be thought of having &quot;perfect&quot; gender/animaticity knowledge. We obtained a success rate of 89.3%. Although this success rate overstates the effect, it is a clear indication that knowledge of a referent's gender and animaticity is essential to anaphora resolution.</Paragraph> <Paragraph position="6"> We hoped that the knowledge about the governing constituent would, like gender and animaticity, make a large contribution. To our surprise, the improvement is only about 2.2%.</Paragraph> <Paragraph position="7"> This is partly because selection restrictions are not clearcut in many cases. Also, some head verbs are too general to restrict the selection of any NP. Examples are &quot;is&quot; and &quot;has&quot;, which appear frequently in Wall Street Journal: these verbs are not &quot;selective&quot; enough and the associated probability is not strong enough to rule out erroneous candidates. Sparse data also causes a problem in this statistic. Consequently, we observe a relatively small enhancement to the system.</Paragraph> <Paragraph position="8"> The mention information gives the sys~em some idea of the story's focus. The more frequently an entity is repeated, the more likely it is to be the topic of the story and thus to be a candidate for pronominalization. Our results show that this is indeed the case. References by pronouns are closely related to the topic or the center of the discourse. NP repetition is one simple way of approximately identifying the topic. The more accurately the topic of a segment can be identified, the higher the success rate we expect an anaphora resolution system can achieve.</Paragraph> </Section> <Section position="7" start_page="165" end_page="166" type="metho"> <SectionTitle> 5 Unsupervised Learning of Gender </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="165" end_page="166" type="sub_section"> <SectionTitle> Information </SectionTitle> <Paragraph position="0"> The importance of gender information as revealed in the previous experiments caused us to consider automatic methods for estimating the probability that nouns occurring in a large corpus of English text deonote inanimate, masculine or feminine things. The method described here is based on simply counting co-occurrences of pronouns and noun phrases, and thus can employ any method of analysis of the text stream that results in referent/pronoun pairs (cf. (Hatzivassiloglou and McKeown, 1997) for another application in which no explicit indicators are available in the stream). We present two very simple methods for finding referent/pronoun pairs, and also give an application of a salience statistic that can indicate how confident we should be about the predictions the method makes. Following this, we show the results of applying this method to the 21-million-word 1987 Wall Street Journal corpus using two different pronoun reference strategies of varying sophistication, and evaluate their performance using honorifics as reliable gender indicators.</Paragraph> <Paragraph position="1"> The method is a very simple mechanism for harvesting the kind of gender information present in discourse fragments like &quot;Kim slept. She slept for a long time.&quot; Even if Kim's gender was unknown before seeing the first sentence, after the second sentence, it is known.</Paragraph> <Paragraph position="2"> The probability that a referent is in a partic- null ular gender class is just the relative frequency with which that referent is referred to by a pronoun p that is part of that gender class. That is, the probability of a referent ref being in gender</Paragraph> <Paragraph position="4"> = I refs to refwith p e gci I (9) El refs to re/with p E gcj I J In this work we have considered only three gender classes, masculine, feminine and inanimate, which are indicated by their typical pronouns, HE, SHE, and IT. However, a variety of pronouns indicate the same class: Plural propronoun gender class he,himself, him,his HE she,herself, her,hers SHE it,itself, its IT nouns like &quot;they&quot; and &quot;us&quot; reveal no gender information about their referent and consequently aren't useful, although this might be a way to learn pluralization in an unsupervised manner. In order to gather statistics on the gender of referents in a corpus, there must be some way of identifying the referents. In attempting to b.ootstrap lexical information about referents' gender, we consider two strategies, both completely blind to any kind of semantics.</Paragraph> <Paragraph position="5"> One of the most naive pronoun reference strategies is the &quot;previous noun&quot; heuristic. On the intuition pronouns closely follow their referents, this heuristic simply keeps track of the last noun seen and submits that noun as the referent of any pronouns following. This strategy is certainly simple-minded but, as noted earlier, it achieves an accuracy of 43%.</Paragraph> <Paragraph position="6"> In the present system, a statistical parser is used (see (Charniak, 1997)) simply as a tagger. This apparent parser overkill is a control to ensure that the part-of-speech tags assigned to words are the same when we use the previous noun heuristic and the Hobbs algorithm, to which we wish to compare the previous noun method. In fact, the only part-of-speech tags necessary are those indicating nouns and pronouns. null Obviously a much superior strategy would be to apply the anaphora-resolution strategy from previous sections to finding putative referents. However, we chose to use only the Hobbs distance portion thereof. We do not use the &quot;mention&quot; probabilities P(alma), as they are not given in the unmarked text. Nor do we use the gender/animiticity information gathered from the much smaller hand-marked text, both because we were interested in seeing what unsupervised learning could accomplish, and because we were concerned with inheriting strong biases from the limited hand-marked data. Thus our second method of finding the pronoun/noun co-occurrences is simply to parse the text and then assume that the noun-phrase at Hobbs distance one is the antecedent.</Paragraph> <Paragraph position="7"> Given a pronoun resolution method and a corpus, the result is a set of pronoun/referent pairs. By collating by referent and abstracting away to the gender classes of pronouns, rather than individual pronouns, we have the relative frequencies with which a given referent is referred to by pronouns of each gender class. We will say that the gender class for which this relative frequency is the highest is the gender class to which the referent most probably belongs.</Paragraph> <Paragraph position="8"> However, any syntax-only pronoun resolution strategy will be wrong some of the time - these methods know nothing about discourse boundaries, intentions, or real-world knowledge. We would like to know, therefore, whether the pattern of pronoun references that we observe for a given referent is the result of our supposed &quot;hypothesis about pronoun reference&quot; - that is, the pronoun reference strategy we have provisionally adopted in order to gather statistics or whether the result of some other unidentified process.</Paragraph> <Paragraph position="9"> This decision is made by ranking the referents by log-likelihood ratio, termed salience, for each referent. The likelihood ratio is adapted from Dunning (1993, page 66) and uses the raw frequencies of each pronoun class in the corpus as the null hypothesis, Pr(gc0i) as well as Pr(ref E gci) from equation 9.</Paragraph> <Paragraph position="10"> salience(re/) = -2 log Making the unrealistic simplifying assumption that references of one gender class are completely independent of references for another classes 1, the likelihood function in this case is just the product over all classes of the probabilities of each class of reference to the power of the number of observations of this class.</Paragraph> </Section> </Section> <Section position="8" start_page="166" end_page="168" type="metho"> <SectionTitle> 6 Evaluation </SectionTitle> <Paragraph position="0"> We ran the program on 21 million words of Wall Street Journal text. One can judge the program informally by simply examining the results and determining if the program's gender decisions are correct (occasionally looking at the text for difficult cases). Figure 1 shows the 43 noun phrases with the highest salience figures (run using the Hobbs algorithm). An examination of these show that all but three are correct. (The three mistakes are &quot;husband,&quot; &quot;wife,&quot; and &quot;years.&quot; We return to the significance of these mistakes later.) As a measure of the utility of these results, we also ran our pronoun-anaphora program with these statistics added. This achieved an accuracy rate of 84.2%. This is only a small improvement over what was achieved without the data.</Paragraph> <Paragraph position="1"> We believe, however, that there are ways to improve the accuracy of the learning method and thus increase its influence on pronoun anaphora resolution.</Paragraph> <Paragraph position="2"> Finally we attempted a fully automatic direct test of the accuracy of both pronoun methods for gender determination. To that end, we devised a more objective test, useful only for scoring the subset of referents that are names of people. In particular, we assume that any noun-phrase with the honorifics &quot;Mr.&quot;. &quot;Mrs.&quot; or &quot;Ms.&quot; may be confidently assigned to gender classes HE, SHE, and SHE, respectively. Thus we compute precision as follows: precision = \[rattrib. asHEA Mr. Erl+ \[rattrib. asSHEA Mrs. or Ms. Er\[ I Mr., Mrs., or Ms. E r \] Here r varies over referent types, not tokens.</Paragraph> <Paragraph position="3"> The precision score computed over all phrases containing any of the target honorifics are 66.0% l In effect, this is the same as admitting that a referent can be in different gender classes across different observations.</Paragraph> <Paragraph position="4"> scheme with syntactic Hobbs algorithm for the last-noun method and 70.3% for the Hobbs method.</Paragraph> <Paragraph position="5"> There are several things to note about these results. First, as one might expect given the already noted superior performance of the Hobbs scheme over last-noun, Hobbs also performs better at determining gender. Secondly, at first glance,the 70.3% accuracy of the Hobbs method is disappointing, only slightly superior to the 65.3% accuracy of Hobbs at finding correct referents. It might have been hoped that the statistics would make things considerably more accurate.</Paragraph> <Paragraph position="6"> In fact, the statistics do make things considerably more accurate. Figure 2 shows average accuracy as a function of number of references for a given referent. It can be seen that there is a significant improvement with increased referent count. The reason that the average over all referents is so low is that the counts on referents obey Zipf's law, so that the mode ~f the distribution on counts is one. Thus the 70.3% overall accuracy is a mix of relatively high accuracy for referents with counts greater than one, and relatively low accuracy for referents with counts of exactly one.</Paragraph> </Section> <Section position="9" start_page="168" end_page="169" type="metho"> <SectionTitle> 7 Previous Work </SectionTitle> <Paragraph position="0"> The literature on pronoun anaphora is too extensive to summarize, so we concentrate here on corpus-based anaphora research.</Paragraph> <Paragraph position="1"> Aone and Bennett (1996) present an approach to an automatically trainable anaphora resolution system. They use Japanese newspaper articles tagged with discourse information as training examples for a machine-learning algorithm which is the C4.5 decision-tree algorithm by Quinlan (1993). They train their decision tree using (anaphora, antecedent) pairs together with a set of feature vectors. Among the 66 features are lexical, syntactic, semantic, and positional features. Their Machine Learning-based Resolver (MLR) is trained using decision trees with 1971 anaphoras (excluding those referring to multiple discontinuous antecedents) and they report an average success rate of 74.8%.</Paragraph> <Paragraph position="2"> Mitkov (1997) describes an approach that uses a set of factors as constraints and preferences. The constraints rule out implausible candidates and the preferences emphasize the selection of the most likely antecedent. The system is not entirely &quot;statistical&quot; in that it consists of various types of rule-based knowledge -- syntactic, semantic, domain, discourse, and heuristic. A statistical approach is present in the discourse module only where it is used to determine the probability that a noun (verb) phrase is the center of a sentence. The system also contains domain knowledge including the domain concepts, specific list of subjects and verbs, and topic headings. The evaluation was conducted on 133 paragraphs of annotated Computer Science text. The results show an accuracy of 83% for the 512 occurrences of it.</Paragraph> <Paragraph position="3"> Lappin and Leass (1994) report on a (essentially non-statistical) approach that relies on salience measures derived from syntactic structure and a dynamic model of attentional state.</Paragraph> <Paragraph position="4"> The system employs various constraints for NP-pronoun non-coreference within a sentence. It also uses person, number, and gender features for ruling out anaphoric dependence of a pronoun on an NP. The algorithm has a sophisticated mechanism for assigning values to several salience parameters and for computing global salience values. A blind test was conducted on manual text containing 360 pronoun occur- null rences; the algorithm successfully identified the antecedent of the pronoun in 86% of these pronoun occurrences. The addition of a module that contributes statistically measured lexJcal preferences to the range of factors the algorithm considers improved the performance by 2%.</Paragraph> </Section> class="xml-element"></Paper>