File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/97/p97-1022_metho.xml
Size: 10,220 bytes
Last Modified: 2025-10-06 14:14:40
<?xml version="1.0" standalone="yes"?> <Paper uid="P97-1022"> <Title>Fertility Models for Statistical Natural Language Understanding</Title> <Section position="4" start_page="168" end_page="168" type="metho"> <SectionTitle> 2 Fertility Clumping Translation </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="168" end_page="168" type="sub_section"> <SectionTitle> Models </SectionTitle> <Paragraph position="0"> The rationale behind a clumping model is that the input English can be clumped or bracketed into phrases. Each clump is then generated from a single formal language word using a translation model.</Paragraph> <Paragraph position="1"> The notion of what constitutes a natural clumping depends on the formal language. For example, suppose the English sentence were: I want to fly to Memphis please.</Paragraph> <Paragraph position="2"> If the formal language for this sentence were:</Paragraph> </Section> </Section> <Section position="5" start_page="168" end_page="168" type="metho"> <SectionTitle> LIST FLIGHTS TO LOCATION, </SectionTitle> <Paragraph position="0"> then the most plausible clumping would be: \[I want\] \[to fly\] \[to\] \[Memphis\] \[please\], for which we would expect &quot;\[I want\]&quot; and &quot;\[please\]&quot; to be generated from &quot;LIST&quot;, &quot;\[to fly\]&quot; from &quot;FLIGHTS&quot;, &quot;\[to\]&quot; from &quot;TO, and &quot;\[Memphis\]&quot; from LOCATION. Similarly, if the formal language were:</Paragraph> </Section> <Section position="6" start_page="168" end_page="169" type="metho"> <SectionTitle> LIST FLIGHTS DESTINATION_LOC </SectionTitle> <Paragraph position="0"> then the most natural clumping would be: \[I want\] \[to fly\] \[to Memphis\] \[please\], in which we would now expect &quot;\[to Memphis\]&quot; to be generated by &quot;DESTINATION_LOC&quot;.</Paragraph> <Paragraph position="1"> Although these ctumpings are perhaps the most natural, neither the clumping nor the alignment is annotated in our training data. Instead, both the alignment and the clumping are viewed as &quot;hidden&quot; quantities for which all values are possible with some probability. The EM algorithm is used to produce a maximum likelihood estimate of the model parameters, taking into account all possible alignments and clumpings.</Paragraph> <Paragraph position="2"> In the discussion of fertility models we denote an English sentence by E, which consists of I(E) words.</Paragraph> <Paragraph position="3"> Similarly, we denote the formal language by F, a tuple of order g(F), whose individual elements are denoted by fi. A clumping for a sentence partitions E into a tuple of clumps C. The number of clumps in C is denoted by g(C), and is an integer in the range 1...g(E). A particular clump is denoted by ci, where i 6 {1...g(C)}. The number of words in q is denoted by g(ci), cl begins at the first word in the sentence, and ct(c) ends at the last word in the sentence. The clumps form a proper partition of E. All the words in a clump c must align to the same f. An alignment between E and F determines which f generates each clump of E in C. Similarly, A denotes the alignment, with g(A) = g(C), and the ai denote the formal language word to which each e in c~ align. The individual words in a clump c are represented by el ..-el(~).</Paragraph> <Paragraph position="4"> For all fertility models, the fundamental parameters are the joint probabilities p( E, C, A, F). Since the clumping and alignment are hidden, to compute the probability that E is generated by F, one calculates: null</Paragraph> <Paragraph position="6"/> </Section> <Section position="7" start_page="169" end_page="169" type="metho"> <SectionTitle> 3 General and Poisson Fertility </SectionTitle> <Paragraph position="0"> In the general fertility model, the translation probability with &quot;revealed&quot; alignment and clumping is</Paragraph> <Paragraph position="2"> where p(ni \[ fi) is the fertility probability of generating n i clumps by formal word f~. Note that ni = L. The factorial terms combine to give an inverse multinomial coefficient which is the uniform probability distribution for the alignment A of F to C.</Paragraph> <Paragraph position="3"> It appears that the computation of the likelihood, which is the sum of e(F)(e(F) + product terms, is exponential. Although dynamic programming can reduce the complexity, there remain an exponentially large number of terms to evaluate in each iteration of the EM algorithm. We resort to a top-N approximation to the EM sum for the general model, summing over candidate clumpings and alignments proposed by the Poisson fertility model developed below.</Paragraph> <Paragraph position="4"> If one assumes that the fertility is modeled by the Poisson distribution with mean fertility ),: e-Xt )tf n p(n</Paragraph> <Paragraph position="6"> then a polynomial time training algorithm exists.</Paragraph> <Paragraph position="7"> The simplicity arises from the fortuitous cancellation of n! between the Poisson distribution and the uniform alignment probability. Substituting equation 3 into equation 1 yields:</Paragraph> <Paragraph position="9"> where A: '~ has been absorbed into the effective clump score q(c I f). In this form, it is particularly simple to explicitly sum over all alignments A to obtain p(E, C \[ F) by repeated application of the distributive law. The resulting polynomial time expressions are:</Paragraph> <Paragraph position="11"> can be calculated in O(e(E)2e(F)) time if the maximum clump size is unbounded, and in O(e(E)I(F)) if bounded. The Viterbi decoding algorithm (Forney, 1973) is used to calculate p(E I L,F) from these expressions. The Viterbi algorithm produces a score which is the sum over all possible clumpings for a fixed L. This score must then normalized by the exp(-X't(v) z...~,=l AA)/L! factor. The EM count accumulation is done using an adaptation of the Baum-Welch algorithm (Baum, 1972) which searches through the space of all possible ctumpings, first considering 1 clump, then 2, and so forth.</Paragraph> <Paragraph position="12"> Initial values for p(e \[ f) are bootstrapped from Model 1 (Epstein et al., 1996) with the initial mean fertilities A/ set to 1. We also fixed the maximum clump size at 5 words. Empirically, we found it beneficial to hold the p(e I f) parameters fixed for 20 iterations to allow the other parameters to train to reasonable values. After training, the translation probabilities and clump lengths are smoothed using deleted interpolation (Bahl, Jelinek, and Mercer, 1983).</Paragraph> <Paragraph position="13"> Since we have been unable to find a polynomial time algorithm to train the general fertility model, we use the Poisson model to &quot;expose&quot; the hidden alignments. The Poisson fertility model gives the most likely 1000 clumpings and alignments, which are then restored according to the current general fertility model parameters. This gives fractional counts for each of the 1000 alignments, which are then used to update the the general fertility model parameters.</Paragraph> </Section> <Section position="8" start_page="169" end_page="171" type="metho"> <SectionTitle> 4 Improved Clump Modeling </SectionTitle> <Paragraph position="0"> In both the Poisson and general fertility models, the computation ofp(clf ) in equation 2 uses a unigram model. Each English word e~ is generated with probability p(ei\[fc). Two more powerful modeling techniques for modeling clump generation are n-gram language models (Miller et al., 1995; Levin and Pieraccini, 1995; Epstein, 1996), and headword language models (Epstein, 1996). A bigram language model uses:</Paragraph> <Paragraph position="2"> where bdy is a special marker to delimit the beginning and end of the clump.</Paragraph> <Paragraph position="3"> A headword language model uses two unigram models, a headword model and a non-headword model. Each clump is required to have a headword. All other words are non-headwords. The identity of a clump's headword is hidden, hence it is necessary to sum over all possible headwords:</Paragraph> <Paragraph position="5"/> </Section> <Section position="9" start_page="171" end_page="171" type="metho"> <SectionTitle> 5 Example Fertilities </SectionTitle> <Paragraph position="0"> To illustrate how well fertility captures simple cases of embedding, trained fertilities are shown in table 1 for several formal language words denoting time intervals. As expected, &quot;early_morning&quot; dominantly produces two clumps, but can produce either one or three clumps with reasonable probability. &quot;morning&quot; and &quot;afternoon&quot; train to comparable fertilities and preferentially generate a single clump. Another interesting case is the formal language token &quot;List&quot; which trains to a A of 0.62 indicating that it frequently generates no English text. As a further check, the A values for &quot;from&quot;, &quot;to&quot;, and the two special classed words &quot;CITY-l&quot; and &quot;CITY-2&quot; are near 1, ranging between 0.96 and 1.17.</Paragraph> <Paragraph position="1"> Some trained translation probabilities are shown for the unigram and headword models in table 2.</Paragraph> <Paragraph position="2"> The formal language words have captured reasonable English words for their most likely translation or headword translation. However, &quot;early&quot; and &quot;morning&quot; have fairly undesirable looking second and third choices. The reason for this is that these undesirable words are frequently adjacent to the English words &quot;early&quot; and &quot;morning&quot;; hence the training algorithm includes contributions with two word clumps containing these extraneous words.</Paragraph> <Paragraph position="3"> This is the price we pay for not using supervised training data. Intriguingly, the headword model is more strongly biased towards the likely translations and has a smoother tail than the unigram model.</Paragraph> </Section> class="xml-element"></Paper>