File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/02/p02-1065_intro.xml
Size: 7,677 bytes
Last Modified: 2025-10-06 14:01:28
<?xml version="1.0" standalone="yes"?> <Paper uid="P02-1065"> <Title>Memory-Based Learning of Morphology with Stochastic Transducers</Title> <Section position="3" start_page="0" end_page="0" type="intro"> <SectionTitle> 2 Stochastic Transducers </SectionTitle> <Paragraph position="0"> It is possible to apply the EM algorithm to learn the parameters of stochastic transducers, (Ristad, 1997; Casacuberta, 1995; Clark, 2001a). (Clark, 2001a) showed how this approach could be used to learn morphology by starting with a randomly initialized model and using the EM algorithm to find a local maximum of the joint probabilities over the pairs of inflected and uninflected words. In addition rather than using the EM algorithm to optimize the joint probability it would be possible to use a gradient de-Computational Linguistics (ACL), Philadelphia, July 2002, pp. 513-520. Proceedings of the 40th Annual Meeting of the Association for scent algorithm to maximize the conditional probability. null The models used here are Stochastic Non-Deterministic Finite-State Transducers (FST), or Pair Hidden Markov Models (Durbin et al., 1998), a name that emphasizes the similarity of the training algorithm to the well-known Forward-Backward training algorithm for Hidden Markov Models.</Paragraph> <Paragraph position="1"> Instead of outputting symbols in a single stream, however, as in normal Hidden Markov Models they output them on two separate streams, the left and right streams. In general we could have different left and right alphabets; here we assume they are the same. At each transition the FST may output the same symbol on both streams, a symbol on the left stream only, or a symbol on the right stream only. I call these a0a2a1a3a1 , a0a2a1a5a4 and a0a6a4a7a1 outputs respectively. For each state a8 the sum of all these output parameters over the alphabet a9 must be one.</Paragraph> <Paragraph position="2"> Since we are concerned with finite strings rather than indefinite streams of symbols, we have in addition to the normal initial state a8 a4 , an explicit end state a8 a1 , such that the FST terminates when it enters this state. The FST then defines a joint probability distribution on pairs of strings from the alphabet. Though we are more interested in stochastic transductions, which are best represented by the conditional probability of one string given the other, it is more convenient to operate with models of the joint probability, and then to derive the conditional probability as needed later on.</Paragraph> <Paragraph position="3"> It is possible to modify the normal dynamic-programming training algorithm for HMMs, the Baum-Welch algorithm (Baum and Petrie, 1966) to work with FSTs as well. This algorithm will maximize the joint probability of the training data.</Paragraph> <Paragraph position="4"> We define the forward and backward probabilities as follows. Given two strings a30 a1a32a31a32a33a32a33a32a33 a30a35a34 and a36 a1 a31a32a33a32a33a32a33 a36a15a37 we define the forward probabilities a38a40a39a6a17a42a41 a31a5a43 a22 as the probability that it will start from</Paragraph> <Paragraph position="6"> a36a25a47 on the right stream and be in state a8 , and the backward probabilities a48 a39 a17a42a41 a31a5a43 a22 as the probability that starting from state a8 it will output a30a49a45a51a50 a1 a31a32a33a32a33a32a33a44a31 a30a52a34 , on the right and a36a44a47a53a50 a1 a31a32a33a32a33a32a33a6a31 a36a16a37 on the left and then terminate, ie end in state a8 a1 .</Paragraph> <Paragraph position="7"> We can calculate these using the following recurrence relations: less a30 a45 is equal to a36 a47 . Instead of the normal two-dimensional trellis discussed in standard works on HMMs, which has one dimension corresponding to the current state and one corresponding to the position, we have a three-dimensional trellis, with a dimension for the position in each string. With these modifications, we can use all of the standard HMM algorithms. In particular, we can use this as the basis of a parameter estimation algorithm using the expectation-maximization theorem. We use the forward and backward probabilities to calculate the expected number of times each transition will be taken; at each iteration we set the new values of the parameters to be the appropriately normalized sums of these expectations.</Paragraph> <Paragraph position="8"> Given a FST, and a string a30 , we often need to find the string a36 that maximizes a59 a17 a30 a31 a36a65a22 . This is equivalent to the task of finding the most likely string generated by a HMM, which is NP-hard (Casacuberta and de la Higuera, 2000), but it is possible to sample from the conditional distribution a59 a17 a36 a20a30a66a22 , which allows an efficient stochastic computation. If we consider only what is output on the left stream, the FST is equivalent to a HMM with null transitions corresponding to the a0a44a4a7a1 transitions of the FST. We can remove these using standard techniques and then use this to calculate the left backward probabilities for a particular string a30 : a48a1a0a39 a17a42a41 a22 defined as the probability that starting from state a8 the FST generates</Paragraph> <Paragraph position="10"> a30a35a34 on the left and terminates. Then if one samples from the FST, but weights each transition by the appropriate left backward probability, it will be equivalent to sampling from the conditional distribution of a2 a17 a36 a20a30a66a22 . We can then find the string a36 that is most likely given a30 , by generating randomly from</Paragraph> <Paragraph position="12"> we can sum a59 a17 a36 a20a30a49a22 for all the observed strings; if the difference between this sum and 1 is less than the maximum value of a59 a17 a36 a20a30a49a22 we know we have found the most likely a36 . In practice, the distributions we are interested in often have a a36 with a59 a17 a36 a20a30a49a22a4a3a6a5 a33a8a7 ; in this case we immediately know that we have found the maximum.</Paragraph> <Paragraph position="13"> We then model the morphological process as a transduction from the lemma form to the inflected form, and assume that the model outputs for each input, the output with highest conditional or joint probability with respect to the model. There are a number of reasons why this simple approach will not work: first, for many languages the inflected form is lexically not phonologically specified and thus the model will not be able to identify the correct form; secondly, modelling all of the irregular exceptions in a single transduction is computationally intractable at the moment. One way to improve the efficiency is to use a mixture of models as discussed in (Clark, 2001a), each corresponding to a morphological paradigm. The productivity of each paradigm can be directly modelled, and the class of each lexical item can again be memorized.</Paragraph> <Paragraph position="14"> There are a number of criticisms that can be made of this approach.</Paragraph> <Paragraph position="15"> a9 Many of the models produced merely memorize a pair of strings - this is extremely inefficient. null a9 Though the model correctly models the productivity of some morphological classes, it models this directly. A more satisfactory approach would be to have this arise naturally as an emergent property of other aspects of the model.</Paragraph> <Paragraph position="16"> a9 These models may not be able to account for some psycho-linguistic evidence that appears to require some form of proximity or similarity.</Paragraph> <Paragraph position="17"> In the next section I shall present a technique that addresses these problems.</Paragraph> </Section> class="xml-element"></Paper>