AUTOMATIC LEARNING OF WORD TRANSDUCERS 
FROM EXAMPLES 
Michel Gilloux 
Centre National d'l~tudes des T616communlcations 
LAA/SLC/AtA 
Route de Tr6gastel, BP 40 
F-22301 Lannlon Cedex, FRANCE 
e-mail: gflloux@lannion.cnet .fr 
ABBTRACT 
This paper describes the application of 
markovian teaming methods to the infer- 
ence of word transducers. We show how 
the proposed method dispenses from the 
difficult design of hand-crafted rules, al- 
lows the use of weighed non deterministic 
transducers and is able to translate words 
by taking into account their whole rather 
than by making decisions locally. These ar- 
guments are illustrated on two examples: 
morphological analysis and grapheme-to- 
phoneme transcription. 
INTRODUCTION 
Several tasks associated with elec- 
tronic lexicons may be viewed as transduc- 
tions between character strings. This may 
be the decomposition of words into mor- 
phemes in morphology or the grapheme-to- 
phoneme transcription in phonology. In the 
first case. one has for example to decom- 
pose the French word "chronom~trage" into 
the sequence of affixes "chrono+m~tre+er÷- 
age". In the second, "abstenlr" should be 
translated into "abstoniR",or "apstoniR" I. 
Most of the proposed methods in the 
IThese two tasks are in fact closely related 
in that {I) the correct phoneme transcription 
may mirror an underlying morphological struc- 
ture, like for "asoc/a/" whose phonemic form is 
"asos jal" rather than "azosjal" due to the de- 
composition "a+soclal", and (2) the surface form 
of a derived word may depend on the pronuncl- 
aUon of its component morphemes, llke for 
"d~+harnacher" which results in "d~harnacher" 
and not "d~sharnachet". 
domain (Catach 1984; Danlos et al. 1986; 
Koskenniemi 1983; Laporte 1988; Ritchle 
et al. 1987, Tufts 1989; V6ronls 1988) are 
based on the availability of local rules 
whose combination, either through direct 
interpretation or by being compiled, form 
the target transducer. 
Although these methods make it pos- 
sible - at least in theory - to design suitable 
transducers, provided that the rule de- 
scrlpUon language has the right expressive 
power, they are complex to use because of 
the difficulty of writing down rules. More- 
over, for a given rule language, there may 
not exist an algorithm for compiling rules 
into a form better suited to the translation 
process. Lastly, in numerous cases, the 
translation procedures are improperly de- 
terministic as shown by the example of"ab- 
stcnlf so that it is not possible to consider 
several competing hypotheses in parallel 
not to speak of ranking them according to 
some certainty factor. 
We have designed a program which 
allows to construct transducers without re- 
taming the above shortcomings. It is no 
longer necessary to write down translation 
rules since the transducer is obtained as 
the result of an automatic learning over a 
set of examples. The transducer is repre- 
sented into the language of probabillstic fi- 
nite state automata (Markov models\] so 
that its use is straightforward. Lastly, tt 
produces results which are assigned a 
probability and makes it possible to llst 
them by decreasing order of likelihood. 
After stating the problem of character 
strings translation and defining the few 
- 107 - 
central notions of markovian learnJng, this 
paper describes their adaptation to the 
word translation problem in the learning 
and translating phases. This adaptation is 
illustrated through two applications: mor- 
phological analysis and grapheme-to-pho- 
neme transcription. 
THE TRANSDUCTION PROBLEM 
In the context of character strings 
transduction, we look for an application f: 
C* --> C'* which transforms certaJn words 
built over the alphabet C Into words over 
the alphabet C'. For example, In the case of 
grapheme-to-phoneme transcription, C is 
the set of graphemes and C' that of pho- 
nemes. 
It may be appropriate, for example in 
morphology, to use an auxiliary lexicon 
(Ritchle et al. 1987; Ritchie 1989) which al- 
lows to discard certain translation results. 
For example, the decomposition "sage" -~ 
"ser+age" would not be allowed because 
"sef is not a verb in the French lexicon, al- 
though this is a correct result with respect 
to the splitting of word forms into affixes. 
The method we propose in this paper is only 
concerned with describing this last type of 
regularities leaving aside all non regular 
phenomena better described on a case-by- 
case basis such as through a lexicon. 
MARKOV MODELS 
A Markou model is a probabilistic fl- 
nlte state automaton M = (S, T, A, s I, s F, g) 
where S is a finite set of states, A is a finite 
alphabet, s x E S and s F ~ S are two distln- 
gulshed states called respectively the/nit/a/ 
state and the final state, T is a finite set of 
transitions, and g Is a function g: t E T --> 
(O(t), D(t), S(t), p(t)) ~ SxS ×A× \[0, I\] such 
that 
V(se S), ~ p(t) = l 
{tl O(t)= s\] 
where p(t) is the probabfllty of reaching 
state D(t) while generating symbol S(t) 
starting from state O(t). 
In general, the transition probablIi- 
ties p(t} are mutually independent. Yet, in 
some contexts, it may be useful to have 
theft values depend on others transitions. 
In this respect, it is possible to define a one- 
to-one correspondence x , {t I O(t) = s'} 
{t I O(t) = s} such tha~Sp(t) is equal to 
p(~s,(t}). States s and s' are then said to be 
For every word w = a I "'" an ~ A*, the 
set of partWl paths compatible wflh w till C 
Pathl(w}, is the set of sequences of I transi- 
tions t I ... t l such that O(t 1) = % D(~} = 
O(.tj+l), forJ = 1 ..... 1-1 and S(tj) = aj, lorJ 
=I ..... 1. 
The set of complete paths compatible 
with w, Path(w), is in turn the set of ele- 
ments in Pathlwl(W}, where I wl = n, the 
length of word @, such that D(t n) = SF. 
The probability for the model M of 
emitting the word w is 
ProbM(m) = ~ l'I p(t) path q Path(w) t e path 
A Markov model for which there exist 
at most one complete path for a given word 
is said to be un/fl/ar. In this case, the above 
probability is reduced to 
Pr°bM(W) = l'I p(t), if Path(w) = path 
t e path 
ProbM(w ) = O, if Path(w) = O 
Thus the probability PM{W) may be 
generally computed by adding the probabil- 
ities observed along every path compatible 
with w. In practice, this renders computa- 
tionally expensive the algorithm for com- 
puting PM(W) and it is tempting to assume 
that the model is unlfllar. Practical studies 
have shown that this sub-optimal method 
is applicable without great loss (Bahl et al. 
1983). 
Under this hypothesis, the probabili- 
ty PM(W) may be computed through the Vlt- 
erbi dynamic programming algorithm. In- 
deed, the probability PM(w, 1, s}, maximal 
probability of reaching state s with the 1 
first transitions in a path compatible with w 
- 108 - 
is 
where (path ~ {t l...ti E Pathl(w) \[ D(t I) = sl ) 
PM(W, O, sl) = I 
' PM(W, O, s) = O, ff (S ~ el) 
therefore 
Plvi(w, 1 + 1, s) = maXpath (P(tl + 1). PM(w, 1, D(tl))) 
where path e {tl...tl+ 1 E Pathl+l(W) \[ D(tl+ 1 ) = s} 
whereby 
PM(W, I + I, s) = max t (p(t) • PM(W, I, O(t))) 
where (te {qD(t) = sand S(t) = ai+ll ) 
with 
= = \[I p(t) PrObM(W ) PM(W, \[w\[, SF) t e MaxPath(w) 
It is therefore possible to compute 
PM(W, 1, s) recurslvely for t = 1 ..... n until 
PrObM(W). 
Automatic learning of Markov 
models 
Given a training set "IS made of words 
in A* and a number N > 2 of states, that is 
the set S, learning a Markov model consists 
in finding a set T of transitions such that 
the Joint probability P of the examples in 
the training set 
P(TS) = l'I PM (w) we TS 
is maximal. 
In general, the set T Is composed a 
priori of all possible transitions between 
states in S producing a symbol in A. The de- 
termination of probabilities p associated 
with these transitions Is equivalent to the 
restriction of T to elements with non null 
probability which induces the structure of 
the associated automaton. In this case, the 
model is said to be hidden because it is 
hard to attach a meaning to the states in S. 
On the contrary, it is possible to force those 
states to have a clear-cut interpretation by 
defining them, for example, as n-grams 
which are sequences of n elements in A 
which encode the last n symbols produced 
by the model to reach the state. It is clear 
that then only some transitions are mean- 
ingful. In dealing with problems like those 
studied in the present paper it Is preferable 
to use hidden models which allow states to 
stand for arbitrarily complex predicates. 
The learning algorithm {Bahl et al. 
1983) is based upon the following remark: 
given a model M whose transitions proba- 
btlltles are known a priori, the a postertori 
probability of a transition t may be estimat- 
ed by the relative frequency with which t is 
used on a training set. 
The number of times a transition t is 
used on "IS is 
freq(t) = ~ ~ 8(t, t') 
wGTS t' e MaxPath(w) 
where 8(t, t')=l fit=t °. 0 otherwise 
The relative frequency of using t on 
"IS is 
rel-freq(t) = 
freq(t) 
( ~' freq(t')) 
{t'l (o(t')= o(t)) } 
The learning algorithm consists then 
in setting randomly the probabtltty distri- 
bution p(t) and adjusting lteratively its val- 
ues through the above formula until the ad- 
Justment is small enough to consider the 
distribution as stationary. It has been 
shown (Bahl et al. 1983) that this algorithm 
does converge towards a stationary value of 
the p(t} which maximizes locally 1 the prob- 
ability P of the training set depending on 
the initial random probability distribution. 
\]In order to find a global optimum, we used 
a kind of simulated annealing technique (Kirk- 
patrick et al. 1983) during the learning process. 
10~ - 
The stationary distribution defines the 
Markov model induced from the examples 
in TS i. 
TRANSDUCTION MODEL 
To be applied in both illustrative ex- 
amples, the general structure of Markov 
models should be related, by means of a 
shi~ in representation, to the problem of 
strings translation. The model of two-level 
morphological analysis (Koskenniemi 1983) 
suggests the nature of this shift. Indeed, 
this method, which was successfully ap- 
plied to morphologically rich natural lan- 
guages (Koskenniemi 1983), Is based upon 
a two-level rule formalism for which there 
exist a way to compile them into the lan- 
guage of finite state automata (FSA) (Ritchie 
1989). This result validates the idea that 
FSAs are reasonable candidates for repre- 
senting transductJon rules, at least in the 
case of morphology 2. 
The shift in representation is de- 
signed so as to define the alphabet A as the 
set of pairs c:- or -:c' where c e C and c 
C', - standing Ior the null character, - ~ C, 
- • C'. The mapping between the transduc- 
er f and the associated Markov model M is 
now straightforward: 
lln practice, the number N = Card(S) of 
states for the model to be learned on a training 
set is not known. When N is small, the model 
has a tendency to generate much more charac- 
ter strings that were in "IS due to an overgener- 
alllation. At the other end of the spectrum, when 
N is large, the learned model will describe the ex- 
amples in TS and them only. So. it is among the 
intermediate values of N that an optimum has to 
be looked for, 
2Rltchle (1989) has even shown that the 
generative power of two-level morphological an- 
alyzers is strictly bound by that of finite state 
automata. He proved that all languages L gener- 
ated by these analyzers are such that whenever 
E^, E 3 and EIE2E3E. belong to L, then E2E 3 
belongs to L.  a ough tins point was not 
considered in the present study, we may sup- 
pose that constraining the learned automaton to 
respect this last property, for example by means 
of tying states, would improve the overall results 
by augmenting in a sound way the generaliza- 
tion from examples. 
w' e f{w) iff 
3 x = x I ... x n E (C u {-}}*, 
Y = Yl "" Yn E (C' u {-}}* 
such that 
xi:Y i is of the form -:c' or c:-, 
fori = 1 ..... n, 
ProbM(Xl.T1 ... xn.Tn) 40, 
w = delete(x) and w' = delete(y) 
where the function delete is defined as 
delete(M = ~., (~. is the empty string), 
delete(-Z} = delete(Z) and 
delete{zZ} = z.delete(Z} if z e C or z e C' 
Given a training set TS = {<w, w'> l w 
C*, w' ~ C'*}, the problem is thus to find 
the model M that maximizes the probability 
P= \]I max-- .Prob..{x. : yl ...Xn: yn) (w, w~ ~e T s ~x,y~ M" i 
where delete(x) = w anddelete(y) = w' 
This formula makes it clear what is 
the new difficulty with this type of learning, 
namely the indetermination of words x and 
y, that is of the alignment induced by them 
between w and its translation w'. The no- 
tions of partial and complete compatible 
paths should thus be redefined in order to 
take this into account. 
The partial paths compatible with w 
and w' till t and J are now the set of se- 
quences t 1 ... tl+ ! , Pathlj(W, w') such that 
O(t 1) = sl, D(t k) =O(tk+l), 'Jfor k= 1 ..... l+J- 
1, S(tk)= Xk.Tk, for k = 1 ..... t+J, dele- 
te(xl...xl+ 1} = wl...w I and delete(Yl...Yl+j) = 
W'l...w I.   partial path is also complete as 
soon aS 1 = \[wl, J = Iw'\[ and D(t b~+ I~l ) = SF. 
As before, we can define the probabil- 
ity PM(W, 1, w', J, s) of reaching state s along 
a partial path compatible with w and w' and 
generating the first I symbols in w andJ first 
symbols in w'. 
PM(W, i, w',J, s) = maxtl ...ti +Jk <l~l +J P(tk) 
where (t 1...tt+ j e {Patht,j(w, w')\[ D(tl+j) = s I ) 
PM(W, O, w', O, sl) = I 
PM(W, O, w', O, s) = O, if s~s I 
- llO- 
Here again, this probability is such 
that PrObM(W, w') = PM(W, \]wl, w', \[w'l, Sv} and 
may be computed (firough dynamic pro- 
gramming according to the formula 
PM(W, i + I, w',J + I, s) = 
max ( maXtlPM(W'i'w"J + I"O (tl)) 
I maxt2PM(W , I + I, w',J, O (t2)) 
where (tl~ {t~'I~ D(t) = set S(t) = wl+l:-l) 
and (t2a {taT\[D(t) = set S(t) =-:wj:+l\]) 
It is now possible to compute for every 
training example the optimal path corre- 
sponding to a given probability distribution 
p(t). This path not only defines the crossed 
states but also the alignment between w 
and w'. The learning algorithm applicable to 
general markovian models remains valid for 
adjusting iteratively the probabilities p(t). 
EXPERIMENTS 
Morphological analysis 
As a preliminary experiment, the 
morphological analysis automaton was 
learned on a set of 738 French words end- 
ing with the morpheme "/sme" and :associ- 
ated with their decomposition into two mor- 
phemes, the first being a noun or an 
adjective. For example, we had the pair 
<"athl~ttsme","athl~te+isme">. With a 400 
states only automaton, the correct decom- 
position was found amongst the 10 most 
probable outputs for 97.6% of the training 
examples !. 
Grapheme-to -phoneme 
transcription 
The case of grapheme-to-phoneme 
transcription is a straightforward applica- 
tion of the transduction model. String w is 
the graphetnic form, e.g. "absten/r" and w' 
lWe are aware that a more precise assess- 
ment of the method would use a test set different 
fi'om the training set. We plan to perform such a 
test in the near future. 
is its transcription into phonemes, e.g. "ap- 
steniR" or "absteniR". Here the training 
set may feature such pairs as <w, w'> and 
<w, w"> where w' ~ w". 
The automaton was learned on a set 
of 1170 acronyms associated to their pho- 
nemic form which was described in a 
coarse phonemic alphabet where, for exam- 
ple, open or closed /o/ are not dlstin- 
guished. Acronyms raise an interesting 
problem in that some should be spelled let- 
ter by letter ("ACL") whereas others may be 
pronounced ("COLING"). This experiment 
was thus intended to show that the model 
may take into account its input as a whole. 
With a 400 states only automaton, more 
than 50% of the training examples were 
correctly transcribed when only the most 
probable output was considered. This fig- 
ure may be improved by augmenting the 
number of states in which case the learning 
phase becomes much longer. 
CONCLUSION 
We have proposed a method for leam- 
Ing transducers for the tasks of morpholog- 
ical analysis and grapheme-to-phoneme 
transcription. This method may be favor- 
ably compared to others solutions based 
upon writing rules in the sense that it does 
not oblige to identify rules, it provides a re- 
sult which is directly usable as a transduc- 
er and it allows to list/~anslations accord- 
ing to a decreasing order of probability. Yet, 
the learned automaton does not lend itself 
to an interpretation in the form of symbolic 
rules - provided that such rules exist -. 
Moreover, some learning parameters are 
set only as the results of empirical or ran- 
dom choices: number of states, initial prob- 
ability distribution, etc. Yet, other advan- 
tages weigh for the proposed method. The 
automaton may take into account the 
whole word to be translated rather than a 
limited part of it - this Justifies that a set of 
equivalent symbolic rules is hard to obtain 
-. For example, the grapheme-to-phoneme 
transcription may recognize the original 
language of a word while translating It 
(Oshlka et al. 1988): the "French" nouns 
"meeting" and "carpacclo" have kept respec- 
tively their original English and Italian form 
- II1 - 
and pronunciation, etc. The learned autom- 
aton is symmetrical, thus it Is also revers- 
ible. In other words, the morphological 
analysis automaton may also be used as a 
generator and the grapheme-to-phoneme 
automaton may become a phoneme-to- 
grapheme transducer. Another remark ts in 
order: since the automaton is reversible, it 
may be composed with its inverse to form, 
for example, a grapheme-to-grapheme 
translator that keeps the phonemic form 
constant without actually computing it. 
Now, it has been shown elsewhere (Reape 
and Thompson 1988) that the transducer 
that would result is also describable in the 
formalism of finite state automata and that 
its number of states has a upper bound 
which is the square of the number of states 
in the base automaton. (Reape and Thomp- 
son 1988) also describes an algorithm for 
computing the resulting automaton. Lastly, 
other functions than morphological analy- 
sis or grapheme-to-phoneme transcription 
may be envisioned like, for example, the de- 
composition of words into syllables or the 
computation of abbreviations by contrac- 
tion. 
REFERENCES 
Bahl, L. R.; Jelinek, F.; and Mercer, R. L. 
1983. "A Maximum Likelihood Approach 
to Continuous Speech Recognition," IEEE 
Trans. on Pattern Analysis and Machine 
Intelligence, 5(2\]: 179-190. 
Catach, N. 1984."La phon~tisation automa- 
Uque du fran~ais," CNRS. 
Danlos, L.; Laporte, E.; and Emerard, F. 
1986 "Synthesis of Spoken Messages 
from Semantic Representations (Seman- 
tlc-Representation-to-Speech-System)," 
Proc. of the 11 th Intern. Conf. on Computa- 
tional Linguistics, 599-604. 
Kirkpatrlck, S.; Gelatt, C. D.; and Vecchl, 
M. P. 1983. "Optimization by Simulated 
Annealing," Science, 220:671-680. 
Koskenniemi, K. 1983. '~I'wo-Level Model 
for Morphological Analysis", Proc. of the 
Eighth Intern. Joint Conf. onArti.flcial Intel- 
ligence, 683-685. 
Laporte, E. 1988. M(~thodes algorlthmlques 
et lexlcales de phon~tlsation de textes, 
These de doctorat, Universit6 Paris 7. 
Oshlka, B. T.; Evans, B.; Machi, F.; and 
Tom, J. 1988. "Computational Tech- 
niques for Improved Name Search," Proc. 
of the Second Conf. on Applied Natural 
Lang~J_oge Processing, 203-210. 
Reape, W. and Thompson, H. 1988. "Paral- 
lel Intersection and Serial Composition of 
Finite State: Transducers", Proc. of COL- 
ING'88, 535-539. 
Ritchie G. D.; Pulman, S. G.; Black, A. W.; 
and Russell, G.J. 1987. "A Computation~ 
al Framework for Lexical Description", 
Computational Linguistics, 13(3-4):290- 
307. 
Ritchie, G. 1989. "On the Generative Power 
of Two-Level Morphological Rules," 
Fourth Conf. of the European Chapter of 
the ACL, 51-57. 
Tufts D. 1989. "It Would Be Much Easier If 
WENT Were GOED," Proc. of the Fourth 
Conf. of the European Chapter of the ACL, 
145-152. 
V6ronis, J. 1988. "Correction of Phono- 
graphic Errors in Natural Language In- 
terfaces", 1 lth ACM-SIGIR Conf., I01- 
115. 
- 112- 
