File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/91/e91-1019_metho.xml

Size: 10,726 bytes

Last Modified: 2025-10-06 14:12:39

<?xml version="1.0" standalone="yes"?>
<Paper uid="E91-1019">
  <Title>AUTOMATIC LEARNING OF WORD TRANSDUCERS FROM EXAMPLES</Title>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
THE TRANSDUCTION PROBLEM
</SectionTitle>
    <Paragraph position="0"> In the context of character strings transduction, we look for an application f: C* --&gt; C'* which transforms certaJn words built over the alphabet C Into words over the alphabet C'. For example, In the case of grapheme-to-phoneme transcription, C is the set of graphemes and C' that of phonemes. null It may be appropriate, for example in morphology, to use an auxiliary lexicon (Ritchle et al. 1987; Ritchie 1989) which allows to discard certain translation results. For example, the decomposition &amp;quot;sage&amp;quot; -~ &amp;quot;ser+age&amp;quot; would not be allowed because &amp;quot;sef is not a verb in the French lexicon, although this is a correct result with respect to the splitting of word forms into affixes.</Paragraph>
    <Paragraph position="1"> The method we propose in this paper is only concerned with describing this last type of regularities leaving aside all non regular phenomena better described on a case-by-case basis such as through a lexicon.</Paragraph>
  </Section>
  <Section position="5" start_page="0" end_page="0" type="metho">
    <SectionTitle>
MARKOV MODELS
</SectionTitle>
    <Paragraph position="0"> A Markou model is a probabilistic flnlte state automaton M = (S, T, A, s I, s F, g) where S is a finite set of states, A is a finite alphabet, s x E S and s F ~ S are two distlngulshed states called respectively the/nit/a/ state and the final state, T is a finite set of transitions, and g Is a function g: t E T --&gt;</Paragraph>
    <Paragraph position="2"> where p(t) is the probabfllty of reaching state D(t) while generating symbol S(t) starting from state O(t).</Paragraph>
    <Paragraph position="3"> In general, the transition probablIities p(t} are mutually independent. Yet, in some contexts, it may be useful to have theft values depend on others transitions.</Paragraph>
    <Paragraph position="4"> In this respect, it is possible to define a one-to-one correspondence x , {t I O(t) = s'} {t I O(t) = s} such tha~Sp(t) is equal to p(~s,(t}). States s and s' are then said to be For every word w = a I &amp;quot;'&amp;quot; an ~ A*, the set of partWl paths compatible wflh w till C Pathl(w}, is the set of sequences of I transitions t I ... t l such that O(t 1) = % D(~} = O(.tj+l), forJ = 1 ..... 1-1 and S(tj) = aj, lorJ =I ..... 1.</Paragraph>
    <Paragraph position="5"> The set of complete paths compatible with w, Path(w), is in turn the set of elements in Pathlwl(W}, where I wl = n, the length of word @, such that D(t n) = SF.</Paragraph>
    <Paragraph position="6"> The probability for the model M of emitting the word w is</Paragraph>
    <Paragraph position="8"> A Markov model for which there exist at most one complete path for a given word is said to be un/fl/ar. In this case, the above probability is reduced to</Paragraph>
    <Paragraph position="10"> Thus the probability PM{W) may be generally computed by adding the probabilities observed along every path compatible with w. In practice, this renders computationally expensive the algorithm for computing PM(W) and it is tempting to assume that the model is unlfllar. Practical studies have shown that this sub-optimal method is applicable without great loss (Bahl et al.</Paragraph>
    <Paragraph position="11"> 1983).</Paragraph>
    <Paragraph position="12"> Under this hypothesis, the probability PM(W) may be computed through the Vlterbi dynamic programming algorithm. Indeed, the probability PM(w, 1, s}, maximal probability of reaching state s with the 1 first transitions in a path compatible with w</Paragraph>
    <Paragraph position="14"> It is therefore possible to compute PM(W, 1, s) recurslvely for t = 1 ..... n until PrObM(W).</Paragraph>
    <Paragraph position="15"> Automatic learning of Markov models Given a training set &amp;quot;IS made of words in A* and a number N &gt; 2 of states, that is the set S, learning a Markov model consists in finding a set T of transitions such that the Joint probability P of the examples in the training set</Paragraph>
    <Paragraph position="17"> is maximal.</Paragraph>
    <Paragraph position="18"> In general, the set T Is composed a priori of all possible transitions between states in S producing a symbol in A. The determination of probabilities p associated with these transitions Is equivalent to the restriction of T to elements with non null probability which induces the structure of the associated automaton. In this case, the model is said to be hidden because it is hard to attach a meaning to the states in S. On the contrary, it is possible to force those states to have a clear-cut interpretation by defining them, for example, as n-grams which are sequences of n elements in A which encode the last n symbols produced by the model to reach the state. It is clear that then only some transitions are meaningful. In dealing with problems like those studied in the present paper it Is preferable to use hidden models which allow states to stand for arbitrarily complex predicates.</Paragraph>
    <Paragraph position="19"> The learning algorithm {Bahl et al.</Paragraph>
    <Paragraph position="20"> 1983) is based upon the following remark: given a model M whose transitions probabtlltles are known a priori, the a postertori probability of a transition t may be estimated by the relative frequency with which t is used on a training set.</Paragraph>
    <Paragraph position="21"> The number of times a transition t is used on &amp;quot;IS is</Paragraph>
    <Paragraph position="23"> where 8(t, t')=l fit=t deg. 0 otherwise The relative frequency of using t on</Paragraph>
    <Paragraph position="25"> The learning algorithm consists then in setting randomly the probabtltty distribution p(t) and adjusting lteratively its values through the above formula until the ad-Justment is small enough to consider the distribution as stationary. It has been shown (Bahl et al. 1983) that this algorithm does converge towards a stationary value of the p(t} which maximizes locally 1 the probability P of the training set depending on the initial random probability distribution.</Paragraph>
    <Paragraph position="26"> \]In order to find a global optimum, we used a kind of simulated annealing technique (Kirkpatrick et al. 1983) during the learning process. 10~ The stationary distribution defines the Markov model induced from the examples in TS i.</Paragraph>
  </Section>
  <Section position="6" start_page="0" end_page="0" type="metho">
    <SectionTitle>
TRANSDUCTION MODEL
</SectionTitle>
    <Paragraph position="0"> To be applied in both illustrative examples, the general structure of Markov models should be related, by means of a shi~ in representation, to the problem of strings translation. The model of two-level morphological analysis (Koskenniemi 1983) suggests the nature of this shift. Indeed, this method, which was successfully applied to morphologically rich natural languages (Koskenniemi 1983), Is based upon a two-level rule formalism for which there exist a way to compile them into the language of finite state automata (FSA) (Ritchie 1989). This result validates the idea that FSAs are reasonable candidates for representing transductJon rules, at least in the case of morphology 2.</Paragraph>
    <Paragraph position="1"> The shift in representation is designed so as to define the alphabet A as the set of pairs c:- or -:c' where c e C and c C', - standing Ior the null character, - ~ C, - * C'. The mapping between the transducer f and the associated Markov model M is now straightforward: lln practice, the number N = Card(S) of states for the model to be learned on a training set is not known. When N is small, the model has a tendency to generate much more character strings that were in &amp;quot;IS due to an overgeneralllation. At the other end of the spectrum, when N is large, the learned model will describe the examples in TS and them only. So. it is among the intermediate values of N that an optimum has to be looked for,  generative power of two-level morphological analyzers is strictly bound by that of finite state automata. He proved that all languages L generated by these analyzers are such that whenever E^, E 3 and EIE2E3E. belong to L, then E2E 3 belongs to L. a ough tins point was not considered in the present study, we may suppose that constraining the learned automaton to respect this last property, for example by means of tying states, would improve the overall results by augmenting in a sound way the generalization from examples.</Paragraph>
    <Paragraph position="3"> where the function delete is defined as</Paragraph>
    <Paragraph position="5"> Given a training set TS = {&lt;w, w'&gt; l w C*, w' ~ C'*}, the problem is thus to find the model M that maximizes the probability P= \]I max-- .Prob..{x. : yl ...Xn: yn) (w, w~ ~e T s ~x,y~ M&amp;quot; i where delete(x) = w anddelete(y) = w' This formula makes it clear what is the new difficulty with this type of learning, namely the indetermination of words x and y, that is of the alignment induced by them between w and its translation w'. The notions of partial and complete compatible paths should thus be redefined in order to take this into account.</Paragraph>
    <Paragraph position="6"> The partial paths compatible with w and w' till t and J are now the set of se- null quences t 1 ... tl+ ! , Pathlj(W, w') such that O(t 1) = sl, D(t k) =O(tk+l), 'Jfor k= 1 ..... l+J1, S(tk)= Xk.Tk, for k = 1 ..... t+J, delete(xl...xl+ 1} = wl...w I and delete(Yl...Yl+j) =  W'l...w I. partial path is also complete as soon aS 1 = \[wl, J = Iw'\[ and D(t b~+ I~l ) = SF. As before, we can define the probability PM(W, 1, w', J, s) of reaching state s along a partial path compatible with w and w' and generating the first I symbols in w andJ first symbols in w'.</Paragraph>
    <Paragraph position="7"> PM(W, i, w',J, s) = maxtl ...ti +Jk &lt;l~l +J P(tk) where (t 1...tt+ j e {Patht,j(w, w')\[ D(tl+j) = s I ) PM(W, O, w', O, sl) = I PM(W, O, w', O, s) = O, if s~s I - llO-Here again, this probability is such that PrObM(W, w') = PM(W, \]wl, w', \[w'l, Sv} and may be computed (firough dynamic programming according to the formula</Paragraph>
    <Paragraph position="9"> It is now possible to compute for every training example the optimal path corresponding to a given probability distribution p(t). This path not only defines the crossed states but also the alignment between w and w'. The learning algorithm applicable to general markovian models remains valid for adjusting iteratively the probabilities p(t).</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML