File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/98/w98-1119_intro.xml
Size: 9,062 bytes
Last Modified: 2025-10-06 14:06:43
<?xml version="1.0" standalone="yes"?> <Paper uid="W98-1119"> <Title>A Statistical Approach to Anaphora Resolution</Title> <Section position="4" start_page="161" end_page="163" type="intro"> <SectionTitle> 2 A Probabilistic Model </SectionTitle> <Paragraph position="0"> There are many factors, both syntactic and semantic, upon which a pronoun resolution system relies. (Mitkov (1997) does a detailed study on factors in anaphora resolution.) We first discuss the training features we use and then derive the probability equations from them.</Paragraph> <Paragraph position="1"> The first piece of useful information we consider is the distance between the pronoun and the candidate antecedent. Obviously the greater the distance the lower the probability.</Paragraph> <Paragraph position="2"> Secondly, we look at the syntactic situation in which the pronoun finds itself. The most well studied constraints are those involving reflexive pronouns. One classical approach to resolving pronouns in text that takes some syntactic factors into consideration is that of Hobbs (1976).</Paragraph> <Paragraph position="3"> This algorithm searches the parse tree in a leftto-right, breadth-first fashion that obeys the major reflexive pronoun constraints while giving a preference to antecedents that are closer to the pronoun. In resolving inter-sentential pronouns, the algorithm searches the previous sentence, again in left-to-right, breadth-first order. This implements the observed preference for subject position antecedents.</Paragraph> <Paragraph position="4"> Next, the actual words in a proposed noun-phrase antecedent give us information regarding the gender, number, and animaticity of the proposed referent. For example: Marie Giraud carries historical significance as one of the last women to be ezecuted in France. She became an abortionist because it enabled her to buy jam, cocoa and other war-rationed goodies.</Paragraph> <Paragraph position="5"> Here it is helpful to recognize that &quot;Marie&quot; is probably female and thus is unlikely to be referred to by &quot;he&quot; or &quot;it&quot;. Given the words in the proposed antecedent we want to find the probability that it is the referent of the pronoun in question. We collect these probabilities on the training data, which are marked with reference links. The words in the antecedent sometimes also let us test for number agreement. Generally, a singular pronoun cannot refer to a plural noun phrase, so that in resolving such a pronoun any plural candidates should be ruled out. However a singular noun phrase can be the referent of a plural pronoun, as illustrated by the following example: &quot;I think if I tell Viacom I need more time, they will take 'Cosby' across the street,&quot; says the general manager ol a network a~liate.</Paragraph> <Paragraph position="6"> It is also useful to note the interaction between the head constituent of the pronoun p and the antecedent. For example: A Japanese company might make television picture tubes in Japan, assemble the TV sets in Malaysia and extort them to Indonesia.</Paragraph> <Paragraph position="7"> Here we would compare the degree to which each possible candidate antecedent (A Japanese company, television picture tubes, Japan, TV sets, and Malaysia in this example) could serve as the direct object of &quot;export&quot;. These probabilities give us a way to implement selectional restriction. A canonical example of selectional restriction is that of the verb &quot;eat&quot;, which selects food as its direct object. In the case of &quot;export&quot; the restriction is not as clearcut. Nevertheless it can still give us guidance on which candidates are more probable than others.</Paragraph> <Paragraph position="8"> The last factor we consider is referents' mention count. Noun phrases that are mentioned repeatedly are preferred. The training corpus is marked with the number of times a referent has been mentioned up to that point in the story.</Paragraph> <Paragraph position="9"> Here we are concerned with the probability that a proposed antecedent is correct given that it has been repeated a certain number of times.</Paragraph> <Paragraph position="10"> In effect, we use this probability information to identify the topic of the segment with the belief that the topic is more likely to be referred to by a pronoun. The idea is similar to that used in the centering approach (Brennan et al., 1987) where a continued topic is the highest-ranked candidate for pronominalization.</Paragraph> <Paragraph position="11"> Given the above possible sources of informar tion, we arrive at the following equation, where F(p) denotes a function from pronouns to their antecedents:</Paragraph> <Paragraph position="13"> where A(p) is a random variable denoting the referent of the pronoun p and a is a proposed antecedent. In the conditioning events, h is the head constituent above p, l~ r is the list of candidate antecedents to be considered, t is the type of phrase of the proposed antecedent (always a noun-phrase in this study), I is the type of the head constituent, sp describes the syntactic structure in which p appears, dspecifies the distance of each antecedent from p and M&quot; is the number of times the referent is mentioned. Note that 17r &quot;, d'~ and A~ are vector quantities in which each entry corresponds to a possible antecedent.</Paragraph> <Paragraph position="14"> When viewed in this way, a can be regarded as an index into these vectors that specifies which value is relevant to the particular choice of antecedent. null This equation is decomposed into pieces that correspond to all the above factors but are more statistically manageable. The decomposition makes use of Bayes' theorem and is based on certain independence assumptions discussed be-</Paragraph> <Paragraph position="16"> Equation (1) is simply an application of Bayes' rule. The denominator is eliminated in the usual fashion, resulting in equation (2). Selectively applying the chain rule results in equations (3) and (4). In equation (4), the term P(h. t, lla, .~, So, d) is the same for every antecedent and is thus removed. Equation (6) follows when we break the last component of (5) into two probability distributions. In equation (7) we make the following independence assumptions: null * Given a particular choice of the antecedent candidates, the distance is independent of distances of candidates other than the antecedent (and the distance to non-referents can be ignored): P(so, d~a, 2~) oC/ P(so, dola , IC4) * The syntnctic structure st, and the distance from the pronoun da are independent of the number of times the referent is mentioned.</Paragraph> <Paragraph position="17"> Thus P(sp, dola, M) = P(sp, d.la) Then we combine sp and de into one variable dIt, Hobbs distance, since the Hobbs algorithm takes both the syntax and distance into account.</Paragraph> <Paragraph position="18"> The words in the antecedent depend only on the parent constituent h, the type of the words t, and the type of the parent I. Hence e(ff'la, M, sp, ~, h, t, l) = P(~lh, t, l, a) * The choice pronoun depends only on the words in the antecedent, i.e.</Paragraph> <Paragraph position="19"> P{pla, M, sp, d, h, t, l, ~ = P(pla, W) * If we treat a as an index into the vector 1~, then (a, I.V') is simply the ath candidate in the list ffz. We assume the selection of the pronoun is independent of the candidates other than the antecedent. Hence</Paragraph> <Paragraph position="21"> Since I~&quot; is a vector, we need to normalize P(ff'lh, t,l, a) to obtain the probability of each element in the vector. It is reasonable to assume that the antecedents in W are independent of each other; in other words,</Paragraph> <Paragraph position="23"> To get the probability for each candidate, we divide the above product by:</Paragraph> <Paragraph position="25"> Now we arrive at the final equation for computing the probability of each proposed antecedent:</Paragraph> <Paragraph position="27"> We obtain P(dH\[a) by running the Hobbs algorithm on the training data. Since the training corpus is tawed with reference information, the probability P(plWo) is easily obtained. In building a statistical parser for the Penn Tree-bank various statLstics have been collected (Charniak, 1997), two of which are P(w~lh, t, l) and P(w~lt , l). To avoid the sparse-data problem, the heads h are clustered according to how they behave in P(w~lh, t, l). The probability of we is then computed on the basis of h's cluster c(h). Our corpus also contains referewts' repetition information, from which we can directly compute P(alrna ). The four components in equation (8) can be estimated in a reasonable fashion. The system computes this product and returns the antecedent t0o for a pronoun p that maximizes this probability. More formally, we want the program to return our antecedent function F(p), where</Paragraph> <Paragraph position="29"/> </Section> class="xml-element"></Paper>