File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/05/p05-1043_metho.xml
Size: 20,528 bytes
Last Modified: 2025-10-06 14:09:46
<?xml version="1.0" standalone="yes"?> <Paper uid="P05-1043"> <Title>Learning Stochastic OT Grammars: A Bayesian approach using Data Augmentation and Gibbs Sampling</Title> <Section position="3" start_page="346" end_page="347" type="metho"> <SectionTitle> 2 The difficulty of a maximum-likelihood </SectionTitle> <Paragraph position="0"> approach Naturally, one may consider &quot;frequency matching&quot; as estimating the grammar based on the maximum-likelihood criterion. Given a set of constraints and candidates, the data may be compiled in the form of (1), on which the likelihood calculation is based. As an example, given the grammar and data set in Table 1, the likelihood of d=&quot;max{C1,C2} > C3&quot; can be written as P(d|u1,u2,u3)=</Paragraph> <Paragraph position="2"> where vectorfxy = (x[?]u1 +u3,y [?]u2 +u3), and S is the identity covariance matrix. The integral sign follows from the fact that both C1 [?] C2, C2 [?] C3 are normal, since each constraint is independently normally distributed.</Paragraph> <Paragraph position="3"> If we treat each data as independently generated by the grammar, then the likelihood will be a product of such integrals (multiple integrals if many constraints are interacting). One may attempt to maximize such a likelihood function using numerical methods3, yet it appears to be desirable to avoid likelihood calculations altogether.</Paragraph> </Section> <Section position="4" start_page="347" end_page="348" type="metho"> <SectionTitle> 3 The missing data scheme for learning Stochastic OT grammars </SectionTitle> <Paragraph position="0"> The Bayesian approach tries to explore p(G|D), the posterior distribution. Notice if we take the usual approach by using the relationship p(G|D) [?] p(D|G) * p(G), we will encounter the same problem as in Section 2. Therefore we need a feasible way of sampling p(G|D) without having to derive the closed-form of p(D|G).</Paragraph> <Paragraph position="1"> The key idea here is the so-called &quot;missing data&quot; scheme in Bayesian statistics: in a complex modelfitting problem, the computation can sometimes be greatly simplified if we treat part of the unknown parameters as data and fit the model in successive stages. To apply this idea, one needs to observe that Stochastic OT grammars are learned from ordinal data, as seen in (1). In other words, only one aspect of the structure generated by those normal distributions -- the ordering of constraints -- is used to generate outputs.</Paragraph> <Paragraph position="2"> This observation points to the possibility of treating the sample values of constraints vectory = (y1,y2,*** ,yN) that satisfy the ordering relations as missing data. It is appropriate to refer to them as &quot;missing&quot; because a language learner obviously cannot observe real numbers from the constraints, which are postulated by linguistic theory. When the observed data are augmented with missing data and become a complete data model, computation becomes significantly simpler. This type of idea is officially known as Data Augmentation (Tanner and Wong, 1987). More specifically, we also make the following intuitive observations: * The complete data model consists of 3 random variables: the observed ordering relations D, the grammar G, and the missing samples of constraint values Y that generate the ordering D.</Paragraph> <Paragraph position="3"> * G and Y are interdependent: - For each fixed d, values of Y that respect d can be obtained easily once G is given: we just sample from p(Y|G) and only keep 3Notice even computing the gradient is non-trivial. those that observe d. Then we let d vary with its frequency in the data, and obtain a sample of p(Y|G,D); - Once we have the values of Y that respect the ranking relations D, G becomes independent of D. Thus, sampling G from p(G|Y,D) becomes the same as sampling from p(G|Y).</Paragraph> <Paragraph position="4"> 4 Gibbs sampler for the joint posterior --</Paragraph> <Paragraph position="6"> The interdependence of G and Y helps design iterative algorithms for sampling p(G,Y|D). In this case, since each step samples from a conditional distribution (p(G|Y,D) or p(Y|G,D)), they can be combined to form a Gibbs sampler (Geman and Geman, 1984). In the same order as described in Section 3, the two conditional sampling steps are implemented as follows: 1. Sample an ordering relation d according to the prior p(D), which is simply normalized frequency counts; sample a vector of constraint values y = {y1,*** ,yN} from the normal distributions N(u(t)1 ,s2),*** ,N(u(t)N ,s2) such that y observes the ordering in d; 2. Repeat Step 1 and obtain M samples of missing data: y1,*** ,yM; sample u(t+1)i from N(summationtextj yji/M,s2/M).</Paragraph> <Paragraph position="7"> The grammar G = (u1,*** ,uN), and the superscript (t) represents a sample of G in iteration t. As explained in 3, Step 1 samples missing data from p(Y|G,D), and Step 2 is equivalent to sampling from p(G|Y,D), by the conditional independence of G and D given Y . The normal posterior distribution N(summationtextj yji/M,s2/M) is derived by using p(G|Y) [?] p(Y|G)p(G), where p(Y|G) is normal, and p(G) [?] N(u0,s0) is chosen to be an non-informative prior with s0 -[?].</Paragraph> <Paragraph position="8"> M (the number of missing data) is not a crucial parameter. In our experiments, M is set to the total number of observed forms4. Although it may seem that s2/M is small for a large M and does not play 4Other choices of M, e.g. M = 1, lead to more or less the same running time.</Paragraph> <Paragraph position="9"> a significant role in the sampling of u(t+1)i , the variance of the sampling distribution is a necessary ingredient of the Gibbs sampler5.</Paragraph> <Paragraph position="10"> Under fairly general conditions (Geman and Geman, 1984), the Gibbs sampler iterates these two steps until it converges to a unique stationary distribution. In practice, convergence can be monitored by calculating cross-sample statistics from multiple Markov chains with different starting points (Gelman and Rubin, 1992). After the simulation is stopped at convergence, we will have obtained a perfect sample of p(G,Y|D). These samples can be used to derive our target distribution p(G|D) by simply keeping all the G components, since p(G|D) is a marginal distribution of p(G,Y|D). Thus, the sampling-based approach gives us the advantage of doing inference without performing any integration.</Paragraph> </Section> <Section position="5" start_page="348" end_page="350" type="metho"> <SectionTitle> 5 Computational issues in implementation </SectionTitle> <Paragraph position="0"> In this section, I will sketch some key steps in the implementation of the Gibbs sampler. Particular attention is paid to sampling p(Y|G,D), since a direct implementation may require an unrealistic running time.</Paragraph> <Section position="1" start_page="348" end_page="348" type="sub_section"> <SectionTitle> 5.1 Computing p(D) from linguistic data </SectionTitle> <Paragraph position="0"> The prior probability p(D) determines the number of samples (missing data) that are drawn under each ordering relation. The following example illustrates how the ordering D and p(D) are calculated from data collected in a linguistic analysis. Consider a data set that contains 2 inputs and a few outputs, each associated with an observed frequency in the lexicon: Here each ordering relation has several conjuncts, and the number of conjuncts is equal to the number of competing candidates for each given input. These conjuncts need to hold simultaneously because each winning candidate needs to be more harmonic than all other competing candidates. The probabilities p(D) are obtained by normalizing the frequencies of the surface forms in the original data. This will have the consequence of placing more weight on lexical items that occur frequently in the corpus.</Paragraph> </Section> <Section position="2" start_page="348" end_page="349" type="sub_section"> <SectionTitle> 5.2 Sampling p(Y|G,D) under complex </SectionTitle> <Paragraph position="0"> ordering relations A direct implementation p(Y|G,d) is straightforward: 1) first obtain N samples from N Gaussian distributions; 2) check each conjunct to see if the ordering relation is satisfied. If so, then keep the sample; if not, discard the sample and try again.</Paragraph> <Paragraph position="1"> However, this can be highly inefficient in many cases. For example, if m constraints appear in the ordering relation d and the sample is rejected, the N [?]m random numbers for constraints not appearing in d are also discarded. When d has several conjuncts, the chance of rejecting samples for irrelevant constraints is even greater.</Paragraph> <Paragraph position="2"> In order to save the generated random numbers, the vector Y can be decomposed into its 1-dimensional components (Y1,Y2,*** ,YN). The problem then becomes sampling p(Y1,*** ,YN|G,D). Again, we may use conditional sampling to draw yi one at a time: we keep yjnegationslash=i and d fixed6, and draw yi so that d holds for y. There are now two cases: if d holds regardless of yi, then any sample from N(u(t)i ,s2) will do; otherwise, we will need to draw yi from a truncated normal distribution.</Paragraph> <Paragraph position="3"> To illustrate this idea, consider an example used earlier where d=&quot;max{c1,c2} > c3&quot;, and the initial sample and parameters are (y(0)1 ,y(0)2 ,y(0)3 ) =</Paragraph> <Paragraph position="5"> Notice that in each step, the sampling density is either just a normal, or a truncated normal distribution. This is because we only need to make sure that d will continue to hold for the next sample y(t+1), which differs from y(t) by just 1 constraint.</Paragraph> <Paragraph position="6"> In our experiment, sampling from truncated normal distributions is realized by using the idea of rejection sampling: to sample from a truncated normal7 pic(x) = 1Z(c)*N(u,s)*I{x>c}, we first find an envelope density function g(x) that is easy to sample directly, such that pic(x) is uniformly bounded by M *g(x) for some constant M that does not depend on x. It can be shown that once each sample x from g(x) is rejected with probability r(x) = 1[?] pic(x)M*g(x), the resulting histogram will provide a perfect sample for pic(x). In the current work, the exponential distribution g(x) = lexp{[?]lx} is used as the envelope, with the following choices for l and the rejection ratio r(x), which have been optimized to lower the rejection rate:</Paragraph> <Paragraph position="8"> Putting these ideas together, the final version of Gibbs sampler is constructed by implementing Step 1 in Section 4 as a sequence of conditional sampling steps for p(Yi|Yjnegationslash=i,d), and combining them 7Notice the truncated distribution needs to be re-normalized in order to be a proper density.</Paragraph> <Paragraph position="9"> with the sampling of p(G|Y,D). Notice the order in which Yi is updated is fixed, which makes our implementation an instance of the systematic-scan Gibbs sampler (Liu, 2001). This implementation may be improved even further by utilizing the structure of the ordering relation d, and optimizing the order in which Yi is updated.</Paragraph> </Section> <Section position="3" start_page="349" end_page="349" type="sub_section"> <SectionTitle> 5.3 Model identifiability </SectionTitle> <Paragraph position="0"> Identifiability is related to the uniqueness of solution in model fitting. Given N constraints, a grammar G [?] RN is not identifiable because G + C will have the same behavior as G for any constant C = (c0,*** ,c0). To remove translation invariance, in Step 2 the average ranking value is subtracted from G, such that summationtexti ui = 0.</Paragraph> <Paragraph position="1"> Another problem related to identifiability arises when the data contains the so-called &quot;categorical domination&quot;, i.e., there may be data of the following form: c1 > c2 with probability 1.</Paragraph> <Paragraph position="2"> In theory, the mode of the posterior tends to infinity and the Gibbs sampler will not converge. Since having categorical dominance relations is a common practice in linguistics, we avoid this problem by truncating the posterior distribution8 by I|u|<K, where K is chosen to be a positive number large enough to ensure that the model be identifiable. The role of truncation/renormalization may be seen as a strong prior that makes the model identifiable on a bounded set.</Paragraph> <Paragraph position="3"> A third problem related to identifiability occurs when the posterior has multiple modes, which suggests that multiple grammars may generate the same output frequencies. This situation is common when the grammar contains interactions between many constraints, and greedy algorithms like GLA tend to find one of the many solutions. In this case, one can either introduce extra ordering relations or use informative priors to sample p(G|Y), so that the inference on the posterior can be done with a relatively small number of samples.</Paragraph> </Section> <Section position="4" start_page="349" end_page="350" type="sub_section"> <SectionTitle> 5.4 Posterior inference </SectionTitle> <Paragraph position="0"> Once the Gibbs sampler has converged to its stationary distribution, we can use the samples to make var8The implementation of sampling from truncated normals is the same as described in 5.2.</Paragraph> <Paragraph position="1"> ious inferences on the posterior. In the experiments reported in this paper, we are primarily interested in the mode of the posterior marginal9 p(ui|D), where i = 1,*** ,N. In cases where the posterior marginal is symmetric and uni-modal, its mode can be estimated by the sample median.</Paragraph> <Paragraph position="2"> In real linguistic applications, the posterior marginal may be a skewed distribution, and many modes may appear in the histogram. In these cases, more sophisticated non-parametric methods, such as kernel density estimation, can be used to estimate the modes. To reduce the computation in identifying multiple modes, a mixture approximation (by EM algorithm or its relatives) may be necessary.</Paragraph> </Section> </Section> <Section position="6" start_page="350" end_page="351" type="metho"> <SectionTitle> 6 Experiments </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="350" end_page="350" type="sub_section"> <SectionTitle> 6.1 Ilokano reduplication </SectionTitle> <Paragraph position="0"> The following Ilokano grammar and data set, used in (Boersma and Hayes, 2001), illustrate a complex type of constraint interaction: the interaction between the three constraints: [?]COMPLEX-ONSET, ALIGN, and IDENTBR([long]) cannot be factored into interactions between 2 constraints. For any given candidate to be optimal, the constraint that prefers such a candidate must simultaneously dominate the other two constraints. Hence it is not immediately clear whether there is a grammar that will assign equal probability to the 3 candidates.</Paragraph> <Paragraph position="2"> Since it does not address the problem of identifiability, the GLA does not always converge on this data set, and the returned grammar does not always fit the input frequencies exactly, depending on the choice of parameters10.</Paragraph> <Paragraph position="3"> In comparison, the Gibbs sampler converges quickly11, regardless of the parameters. The result suggests the existence of a unique grammar that will 9Note G = (u1,*** ,uN), and p(ui|D) is a marginal of p(G|D).</Paragraph> <Paragraph position="4"> 10B &H reported results of averaging many runs of the algorithm. Yet there appears to be significant randomness in each run of the algorithm.</Paragraph> <Paragraph position="5"> 11Within 1000 iterations.</Paragraph> <Paragraph position="6"> assign equal probabilities to the 3 candidates. The posterior samples and histograms are displayed in</Paragraph> </Section> <Section position="2" start_page="350" end_page="351" type="sub_section"> <SectionTitle> 6.2 Spanish diminutive suffixation </SectionTitle> <Paragraph position="0"> The second experiment uses linguistic data on Spanish diminutives and the analysis proposed in (Arbisi-Kelm, 2002). There are 3 base forms, each associated with 2 diminutive suffixes. The grammar consists of 4 constraints: ALIGN(TE,Word,R), MAX-OO(V), DEP-IO and BaseTooLittle. The data presents the problem of learning from noise, since no Stochastic OT grammar can provide an exact fit to the data: the candidate [ubita] violates an extra constraint compared to [liri.ito], and [ubasita] violates the same constraint as [liryosito]. Yet unlike [lityosito], [ubasita] is not observed.</Paragraph> <Paragraph position="1"> Input Output Freq. A M D B</Paragraph> <Paragraph position="3"> In the results found by GLA, [marEsito] always has a lower frequency than [marsito] (See Table 7).</Paragraph> <Paragraph position="4"> This is not accidental. Instead it reveals a problematic use of heuristics in GLA12: since the constraint B is violated by [ubita], it is always demoted whenever the underlying form /uba/ is encountered dur- null model assigns equal values to u3 and u4 (corresponding to D and B, respectively), u3 is always less than u4, simply because there is more chance of penalizing D rather than B. This problem arises precisely because of the heuristic (i.e. demoting the constraint that prefers the wrong candidate) that GLA uses to find the target grammar.</Paragraph> <Paragraph position="5"> The Gibbs sampler, on the other hand, does not depend on heuristic rules in its search. Since modes of the posterior p(u3|D) and p(u4|D) reside in negative infinity, the posterior is truncated by Iui<K, with K = 6, based on the discussion in 5.3. Results of the Gibbs sampler and two runs of GLA13 are reported in Table 7.</Paragraph> </Section> </Section> <Section position="7" start_page="351" end_page="351" type="metho"> <SectionTitle> 7 A comparison with Max-Ent models </SectionTitle> <Paragraph position="0"> Previously, problems with the GLA14 have inspired other OT-like models of linguistic variation. One such proposal suggests using the more well-known Maximum Entropy model (Goldwater and Johnson, 2003). In Max-Ent models, a grammar G is also parameterized by a real vector of weights w = (w1,*** ,wN), but the conditional likelihood of an output y given an input x is given by:</Paragraph> <Paragraph position="2"> (2) where fi(y,x) is the violation each constraint assigns to the input-output pair (x,y). Clearly, Max-Ent is a rather different type of model from Stochastic OT, not only in the use of constraint ordering, but also in the objective function (conditional likelihood rather than likelihood/posterior). However, it may be of interest to compare these two types of models. Using the same 13The two runs here both use 0.002 and 0.0001 as the final plasticity. The initial plasticity and the iterations are set to 2 and 1.0e7. Slightly better fits can be found by tuning these parameters, but the observation remains the same.</Paragraph> <Paragraph position="3"> 14See (Keller and Asudeh, 2002) for a summary.</Paragraph> <Paragraph position="4"> data as in 6.2, results of fitting Max-Ent (using conjugate gradient descent) and Stochastic OT (using Gibbs sampler) are reported in Table 8: It can be seen that the Max-Ent model, in the absence of a smoothing prior, fits the data perfectly by assigning positive weights to constraints B and D. A less exact fit (denoted by MEsm) is obtained when the smoothing Gaussian prior is used with ui = 0, s2i = 1. But as observed in 6.2, an exact fit is impossible to obtain using Stochastic OT, due to the difference in the way variation is generated by the models. Thus it may be seen that Max-Ent is a more powerful class of models than Stochastic OT, though it is not clear how the Max-Ent model's descriptive power is related to generative linguistic theories like phonology.</Paragraph> <Paragraph position="5"> Although the abundance of well-behaved optimization algorithms has been pointed out in favor of Max-Ent models, it is the author's hope that the MCMC approach also gives Stochastic OT a similar underpinning. However, complex Stochastic OT models often bring worries about identifiability, whereas the convexity property of Max-Ent may be viewed as an advantage15.</Paragraph> </Section> class="xml-element"></Paper>