File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/05/w05-0612_metho.xml
Size: 18,902 bytes
Last Modified: 2025-10-06 14:09:53
<?xml version="1.0" standalone="yes"?> <Paper uid="W05-0612"> <Title>An Expectation Maximization Approach to Pronoun Resolution</Title> <Section position="5" start_page="88" end_page="90" type="metho"> <SectionTitle> 3 Methods </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="88" end_page="88" type="sub_section"> <SectionTitle> 3.1 Problem formulation </SectionTitle> <Paragraph position="0"> We will consider our training set to consist of (p,k,C) triples: one for each pronoun, where p is the pronoun to be resolved, k is the pronoun's context, and C is a candidate list containing the nouns p could potentially be resolved to. Initially, we take k to be the parsed sentence that p appears in.</Paragraph> <Paragraph position="1"> C consists of all nouns and pronouns that precede p, looking back through the current sentence and the sentence immediately preceding it. This small window may seem limiting, but we found that a correct candidate appeared in 97% of such lists in a labeled development text. Mitkov et al. also limit candidate consideration to the same window (2002).</Paragraph> <Paragraph position="2"> Each triple is processed with non-anaphoric pronoun handlers (Section 3.3) and linguistic filters (Section 3.4), which produce the final candidate lists.</Paragraph> <Paragraph position="3"> Before we pass the (p,k,C) triples to EM, we modify them to better suit our EM formulation.</Paragraph> <Paragraph position="4"> There are four possibilities for the gender and number of third-person pronouns in English: masculine, feminine, neutral and plural (e.g., he, she, it, they).</Paragraph> <Paragraph position="5"> We assume a noun is equally likely to corefer with any member of a given gender/number category, and reduce each p to a category label accordingly. For example, he, his, him and himself are all labeled as masc for masculine pronoun. Plural, feminine and neutral pronouns are handled similarly. We reduce the context term k to p's immediate syntactic context, including only p's syntactic parent, the parent's part of speech, and p's relationship to the parent, as determined by a dependency parser. Incorporating context only through the governing constituent was also done in (Ge et al., 1998). Finally, each candidate in C is augmented with ordering information, so we know how many nouns to &quot;step over&quot; before arriving at a given candidate. We will refer to this ordering information as a candidate's j term, for jump.</Paragraph> <Paragraph position="6"> Our example sentence in Section 1 would create the two triples shown in Figure 1, assuming the sentence began the document it was found in.</Paragraph> </Section> <Section position="2" start_page="88" end_page="90" type="sub_section"> <SectionTitle> 3.2 Probability model </SectionTitle> <Paragraph position="0"> Expectation Maximization (Dempster et al., 1977) is a process for filling in unobserved data probabilistically. To use EM to do unsupervised pronoun reso- null his: p = masc k = p's familyC = arena (0), president (1) he: p = masc k = serenade pC = family (0), masc (1), arena (2), lution, we phrase the resolution task in terms of hidden variables of an observed process. We assume that in each case, one candidate from the candidate list is selected as the antecedent before p and k are generated. EM's role is to induce a probability distribution over candidates to maximize the likelihood of the (p,k) pairs observed in our training set:</Paragraph> <Paragraph position="2"> We can rewrite Pr(p,k) so that it uses a hidden candidate (or antecedent) variable c that influences the observed p and k:</Paragraph> <Paragraph position="4"> To improve our ability to generalize to future cases, we use a na&quot;ive Bayes assumption to state that the choices of pronoun and context are conditionally independent, given an antecedent. That is, once we select the word the pronoun represents, the pronoun and its context are no longer coupled:</Paragraph> <Paragraph position="6"> We can split each candidate c into its lexical component l and its jump value j. That is, c = (l,j).</Paragraph> <Paragraph position="7"> If we assume that l and j are independent, and that p and k each depend only on the l component of c, we can combine Equations 3 and 4 to get our final formulation for the joint probability distribution:</Paragraph> <Paragraph position="9"> The jump term j, though important when resolving pronouns, is not likely to be correlated with any lexical choices in the training set.</Paragraph> <Paragraph position="10"> This results in four models that work together to determine the likelihood of a given candidate. The Pr(p|l) distribution measures the likelihood of a pronoun given an antecedent. Since we have collapsed the observed pronouns into groups, this models a word's affinity for each of the four relevant gender/number categories. We will refer to this as our pronoun model. Pr(k|l) measures the probability of the syntactic relationship between a pronoun and its parent, given a prospective antecedent for the pronoun. This is effectively a language model, grading lexical choice by context. Pr(l) measures the probability that the word l will be found to be an antecedent. This is useful, as some entities, such as &quot;president&quot; in newspaper text, are inherently more likely to be referenced with a pronoun. Finally, Pr(j) measures the likelihood of jumping a given number of noun phrases backward to find the correct candidate. We represent these models with table look-up. Table 1 shows selected l-value entries in the Pr(p|l) table from our best performing EM model. Note that the probabilities reflect biases inherent in our news domain training set.</Paragraph> <Paragraph position="11"> Given models for the four distributions above, we can assign a probability to each candidate in C according to the observations p and k; that is, Pr(c|p,k) can be obtained by dividing Equation 5 by Equation 2. Remember that c = (l,j).</Paragraph> <Paragraph position="13"> Pr(c|p,k) allows us to get fractional counts of (p,k,c) triples in our training set, as if we had actually observed c co-occurring with (p,k) in the proportions specified by Equation 6. This estimation process is effectively the E-step in EM.</Paragraph> <Paragraph position="14"> The M-step is conducted by redefining our models according to these fractional counts. For example, after assigning fractional counts to candidates according to Pr(c|p,k), we re-estimate Pr(p|l) with the following equation for a specific (p,l) pair:</Paragraph> <Paragraph position="16"> where N() counts the number of times we see a given event or joint event throughout the training set.</Paragraph> <Paragraph position="17"> Given trained models, we resolve pronouns by finding the candidate ^c that is most likely for the current pronoun, that is ^c = argmaxc[?]CPr(c|p,k).</Paragraph> <Paragraph position="18"> Because Pr(p,k) is constant with respect to c,</Paragraph> <Paragraph position="20"/> </Section> <Section position="3" start_page="90" end_page="90" type="sub_section"> <SectionTitle> 3.3 Non-anaphoric Pronouns </SectionTitle> <Paragraph position="0"> Not every pronoun in text refers anaphorically to a preceding noun phrase. There are a frequent number of difficult cases that require special attention, including pronouns that are: * Pleonastic: pronouns that have a grammatical function but do not reference an entity. E.g. &quot;It is important to observe it is raining.&quot; * Cataphora: pronouns that reference a future noun phrase. E.g. &quot;In his speech, the president praised the workers.&quot; * Non-noun referential: pronouns that refer to a verb phrase, sentence, or implicit concept. E.g.</Paragraph> <Paragraph position="1"> &quot;John told Mary they should buy a car.&quot; If we construct them na&quot;ively, the candidate lists for these pronouns will be invalid, introducing noise in our training set. Manual handling or removal of these cases is infeasible in an unsupervised approach, where the input is thousands of documents.</Paragraph> <Paragraph position="2"> Instead, pleonastics are identified syntactically using an extension of the detector developed by Lappin and Leass (1994). Roughly 7% of all pronouns in our labeled test data are pleonastic. We detect cataphora using a pattern-based method on parsed sentences, described in (Bergsma, 2005b). Future nouns are only included when cataphora are identified. This approach is quite different from Lappin and Leass (1994), who always include all future nouns from the current sentence as candidates, with a constant penalty added to possible cataphoric resolutions. The cataphora module identifies 1.4% of test data pronouns to be cataphoric; in each instance this identification is correct. Finally, we know of no approach that handles pronouns referring to verb phrases or implicit entities. The unavoidable errors for these pronouns, occurring roughly 4% of the time, are included in our final results.</Paragraph> </Section> <Section position="4" start_page="90" end_page="90" type="sub_section"> <SectionTitle> 3.4 Candidate list modifications </SectionTitle> <Paragraph position="0"> It would be possible for C to include every noun phrase in the current and previous sentence, but performance can be improved by automatically removing improbable antecedents. We use a standard set of constraints to filter candidates. If a candidate's gender or number is known, and does not match the pronoun's, the candidate is excluded. Candidates with known gender include other pronouns, and names with gendered designators (such as &quot;Mr.&quot; or &quot;Mrs.&quot;). Our parser also identifies plurals and some gendered first names. We remove from C all times, dates, addresses, monetary amounts, units of measurement, and pronouns identified as pleonastic.</Paragraph> <Paragraph position="1"> We use the syntactic constraints from Binding Theory to eliminate candidates (Haegeman, 1994).</Paragraph> <Paragraph position="2"> For the reflexives himself, herself, itself and themselves, this allows immediate syntactic identification of the antecedent. These cases become unambiguous; only the indicated antecedent is included in C. We improve the quality of our training set by removing known noisy cases before passing the set to EM. For example, we anticipate that sentences with quotation marks will be problematic, as other researchers have observed that quoted text requires special handling for pronoun resolution (Kennedy and Boguraev, 1996). Thus we remove pronouns occurring in the same sentences as quotes from the learning process. Also, we exclude triples where the constraints removed all possible antecedents, or where the pronoun was deemed to be pleonastic.</Paragraph> <Paragraph position="3"> Performing these exclusions is justified for training, but in testing we state results for all pronouns.</Paragraph> </Section> </Section> <Section position="6" start_page="90" end_page="91" type="metho"> <SectionTitle> 3.5 EM initialization </SectionTitle> <Paragraph position="0"> Early in the development of this system, we were impressed with the quality of the pronoun model Pr(p|l) learned by EM. However, we found we could construct an even more precise pronoun model for common words by examining unambiguous cases in our training data. Unambiguous cases are pronouns having only one word in their candidate list C. This could be a result of the preprocessors described in Sections 3.3 and 3.4, or the pronoun's position in the document. A PrU(p|l) model constructed from only unambiguous examples covers far fewer words than a learned model, but it rarely makes poor gender/number choices. Furthermore, it can be obtained without EM. Training on unambiguous cases is similar in spirit to (Hindle and Rooth, 1993). We found in our development and test sets that, after applying filters, roughly 9% of pronouns occur with unambiguous antecedents.</Paragraph> <Paragraph position="1"> When optimizing a probability function that is not concave, the EM algorithm is only guaranteed to find a local maximum; therefore, it can be helpful to start the process near the desired end-point in parameter space. The unambiguous pronoun model described above can provide such a starting point.</Paragraph> <Paragraph position="2"> When using this initializer, we perform our initial E-step by weighting candidates according to PrU(p|l), instead of weighting them uniformly. This biases the initial E-step probabilities so that a strong indication of the gender/number of a candidate from unambiguous cases will either boost the candidate's chances or remove it from competition, depending on whether or not the predicted category matches that of the pronoun being resolved.</Paragraph> <Paragraph position="3"> To deal with the sparseness of the PrU(p|l) distribution, we use add-1 smoothing (Jeffreys, 1961). The resulting effect is that words with few unambiguous occurrences receive a near-uniform gender/number distribution, while those observed frequently will closely match the observed distribution. During development, we also tried clever initializers for the other three models, including an extensive language model initializer, but none were able to improve over PrU(p|l) alone.</Paragraph> <Section position="1" start_page="91" end_page="91" type="sub_section"> <SectionTitle> 3.6 Supervised extension </SectionTitle> <Paragraph position="0"> Even though we have justified Equation 5 with reasonable independence assumptions, our four models may not be combined optimally for our pronoun resolution task, as the models are only approximations of the true distributions they are intended to represent. Following the approach in (Och and Ney, 2002), we can view the right-hand-side of Equation 5 as a special case of:</Paragraph> <Paragraph position="2"> where [?]i : li = 1. Effectively, the log probabilities of our models become feature functions in a log-linear model. When labeled training data is available, we can use the Maximum Entropy principle (Berger et al., 1996) to optimize the l weights. This provides us with an optional supervised extension to the unsupervised system. Given a small set of data that has the correct candidates indicated, such as the set we used while developing our unsupervised system, we can re-weight the final models provided by EM to maximize the probability of observing the indicated candidates. To this end, we follow the approach of (Och and Ney, 2002) very closely, including their handling of multiple correct answers. We use the limited memory variable metric method as implemented in Malouf's maximum entropy package (2002) to set our weights.</Paragraph> </Section> </Section> <Section position="7" start_page="91" end_page="92" type="metho"> <SectionTitle> 4 Experimental Design </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="91" end_page="92" type="sub_section"> <SectionTitle> 4.1 Data sets </SectionTitle> <Paragraph position="0"> We used two training sets in our experiments, both drawn from the AQUAINT Question Answering corpus (Vorhees, 2002). For each training set, we manually labeled pronoun antecedents in a corresponding key containing a subset of the pronouns in the set. These keys are drawn from a collection of complete documents. For each document, all pronouns are included. With the exception of the supervised extension, the keys are used only to validate the resolution decisions made by a trained system.</Paragraph> <Paragraph position="1"> Further details are available in (Bergsma, 2005b).</Paragraph> <Paragraph position="2"> The development set consists of 333,000 pronouns drawn from 31,000 documents. The development key consists of 644 labeled pronouns drawn from 58 documents; 417 are drawn from sentences without quotation marks. The development set and its key were used to guide us while designing the probability model, and to fine-tune EM and smoothing parameters. We also use the development key as labeled training data for our supervised extension.</Paragraph> <Paragraph position="3"> The test set consists of 890,000 pronouns drawn from 50,000 documents. The test key consists of 1209 labeled pronouns drawn from 118 documents; 892 are drawn from sentences without quotation marks. All of the results reported in Section 5 are determined using the test key.</Paragraph> </Section> <Section position="2" start_page="92" end_page="92" type="sub_section"> <SectionTitle> 4.2 Implementation Details </SectionTitle> <Paragraph position="0"> To get the context values and implement the syntactic filters, we parsed our corpora with Minipar (Lin, 1994). Experiments on the development set indicated that EM generally began to overfit after 2 iterations, so we stop EM after the second iteration, using the models from the second M-step for testing. During testing, ties in likelihood are broken by taking the candidate closest to the pronoun.</Paragraph> <Paragraph position="1"> The EM-produced models need to be smoothed, as there will be unseen words and unobserved (p,l) or (k,l) pairs in the test set. This is because problematic cases are omitted from the training set, while all pronouns are included in the key. We handle out-of-vocabulary events by replacing words or context-values that occur only once during training with a special unknown symbol. Out-of-vocabulary events encountered during testing are also treated as unknown. We handle unseen pairs with additive smoothing. Instead of adding 1 as in Section 3.5, we add dp = 0.00001 for (k,l) pairs, and dw = 0.001 for (p,l) pairs. These d values were determined experimentally with the development key.</Paragraph> </Section> <Section position="3" start_page="92" end_page="92" type="sub_section"> <SectionTitle> 4.3 Evaluation scheme </SectionTitle> <Paragraph position="0"> We evaluate our work in the context of a fully automatic system, as was done in (Mitkov et al., 2002).</Paragraph> <Paragraph position="1"> Our evaluation criteria is similar to their resolution etiquette. We define accuracy as the proportion of pronouns correctly resolved, either to any coreferent noun phrase in the candidate list, or to the pleonastic category, which precludes resolution. Systems that handle and state performance for all pronouns in unrestricted text report much lower accuracy than most approaches in the literature. Furthermore, automatically parsing and pre-processing texts causes consistent degradation in performance, regardless of the accuracy of the pronoun resolution algorithm. To have a point of comparison to other fully-automatic approaches, note the resolution etiquette score reported in (Mitkov et al., 2002) is 0.582.</Paragraph> </Section> </Section> class="xml-element"></Paper>