Finite-State Transducers in Language and 
Speech Processing 
Mehryar Mohri* 
AT&T Labs-Research 
Finite-state machines have been used in various domains of natural language processing. We 
consider here the use of a type of transducer that supports very efficient programs: sequential 
transducers. We recall classical theorems and give new ones characterizing sequential string-to- 
string transducers. Transducers that output weights also play an important role in language and 
speech processing. We give a specific study of string-to-weight transducers, including algorithms 
for determinizing and minimizing these transducers very efficiently, and characterizations of the 
transducers admitting determinization and the corresponding algorithms. Some applications of 
these algorithms in speech recognition are described and illustrated. 
1. Introduction 
Finite-state machines have been used in many areas of computational linguistics. Their 
use can be justified by both linguistic and computational arguments. Linguistically, 
finite automata are convenient since they allow one to describe easily most of the 
relevant local phenomena encountered in the empirical study of language. They often 
lead to a compact representation of lexical rules, or idioms and cliches, that appears 
natural to linguists (Gross 1989). Graphic tools also allow one to visualize and modify 
automata, which helps in correcting and completing a grammar. Other more general 
phenomena, such as parsing context-free grammars, can also be dealt with using finite- 
state machines such as RTN's (Woods 1970). Moreover, the underlying mechanisms in 
most of the methods used in parsing are related to automata. 
From the computational point of view, the use of finite-state machines is mainly 
motivated by considerations of time and space efficiency. Time efficiency is usually 
achieved using deterministic automata. The output of deterministic machines depends, 
in general linearly, only on the input size and can therefore be considered optimal from 
this point of view. Space efficiency is achieved with classical minimization algorithms 
(Aho, Hopcroft, and Ullman 1974) for deterministic automata. Applications such as 
compiler construction have shown deterministic finite automata to be very efficient 
in practice (Aho, Sethi, and Ullman 1986). Finite automata now also constitute a rich 
chapter of theoretical computer science (Perrin 1990). 
Their recent applications in natural language processing, which range from the 
construction of lexical analyzers (Silverztein 1993) and the compilation of morpho- 
logical and phonological rules (Kaplan and Kay 1994; Karttunen, Kaplan and Zaenen 
1992) to speech processing (Mohri, Pereira, and Riley 1996) show the usefulness of 
finite-state machines in many areas. In this paper, we provide theoretical and algo- 
rithmic bases for the use and application of the devices that support very efficient 
programs: sequential transducers. 
* 600 Mountain Avenue, Murray Hill, NJ 07974, USA. 
(~) 1997 Association for Computational Linguistics 
Computational Linguistics Volume 23, Number 2 
We extend the idea of deterministic automata to transducers with deterministic 
input, that is, machines that produce output strings or weights in addition to (deter- 
ministically) accepting input. Thus, we describe methods consistent with the initial 
reasons for using finite-state machines, in particular the time efficiency of determinis- 
tic machines, and the space efficiency achievable with new minimization algorithms 
for sequential transducers. 
Both time and space concerns are important when dealing with language. Indeed, 
one of the recent trends in language studies is a large increase in the size of data sets. 
Lexical approaches have been shown to be the most appropriate in many areas of 
computational linguistics ranging from large-scale dictionaries in morphology to large 
lexical grammars in syntax. The effect of the size increase on time and space efficiency 
is probably the main computational problem of language processing. 
The use of finite-state machines in natural language processing is certainly not 
new. The limitations of the corresponding techniques, however, are pointed out more 
often than their advantages, probably because recent work in this field is not yet 
described in computer science textbooks. Sequential finite-state transducers are now 
used in all areas of computational linguistics. 
In the following sections, we give an extended description of these devices. We 
first consider string-to-string transducers, which have been successfully used in the 
representation of large-scale dictionaries, computational morphology, and local gram- 
mars and syntax, and describe the theoretical bases for their use. In particular, we 
recall classical theorems and provide some new ones characterizing these transducers. 
We then consider the case of sequential string-to-weight transducers. Language 
models, phone lattices, and word lattices are among the objects that can be represented 
by these transducers, making them very interesting from the point of view of speech 
processing. We give new theorems extending the known characterizations of string- 
to-string transducers to these transducers. We define an algorithm for determinizing 
string-to-weight transducers, characterize the unambiguous transducers admitting de- 
terminization, and describe an algorithm to test determinizability. We also give an 
algorithm to minimize sequential transducers that has a complexity equivalent to that 
of classical automata minimization and that is very efficient in practice. Under cer- 
tain restrictions, the minimization of sequential string-to-weight transducers can also 
be performed using the determinization algorithm. We describe the corresponding 
algorithm and give the proof of its correctness in the appendix. 
We have used most of these algorithms in speech processing. In the last section, we 
describe some applications of determinization and minimization of string-to-weight 
transducers in speech recognition, illustrating them with several results that show 
them to be very efficient. Our implementation of the determinization is such that it 
can be used on the fly: only the necessary part of the transducer needs to be expanded. 
This plays an important role in the space and time efficiency of speech recognition. 
The reduction in the size of word lattices that these algorithms provide sheds new 
light on the complexity of the networks involved in speech processing. 
2. Sequential String-to-String Transducers 
Sequential string-to-string transducers are used in various areas of natural language 
processing. Both determinization (Mohri 1994c) and minimization algorithms (Mohri 
1994b) have been defined for the class of p-subsequential transducers, which includes 
sequential string-to-string transducers. In this section, the theoretical basis of the use 
of sequential transducers is described. Classical and new theorems help to indicate 
the usefulness of these devices as well as their characterization. 
270 
Mohri Transducers in Language and Speech 
b:E 
Figure 1 
Example of a sequential transducer. 
2.1 Sequential Transducers 
We consider here sequential transducers, namel3~ transducers with a deterministic 
input. At any state of such transducers, at most one outgoing arc is labeled with a 
given element of the alphabet. Figure 1 gives an example of a sequential transducer. 
Notice that output labels might be strings, including the empty string ~. The empty 
string is not allowed on input, however. The output of a sequential transducer is not 
necessarily deterministic. The one in Figure 1 is not since, for instance, two distinct 
arcs with output labels b leave the state 0. 
Sequential transducers are computationally interesting because their use with a 
given input does not depend on the size of the transducer but only on the size of the 
input. Since using a sequential transducer with a given input consists of following the 
only path corresponding to the input string and in writing consecutive output labels 
along this path, the total computational time is linear in the size of the input, if we 
consider that the cost of copying out each output label does not depend on its length. 
Definition 
More formally, a sequential string-to-string transducer T is a 7-tuple (Q, i, F, G, A, 6, or), 
with: 
• Q the set of states, 
• i E Q the initial state, 
• F c Q the set of final states, 
• ~ and A finite sets corresponding respectively to the input and output 
alphabets of the transducer, 
• 6 the state transition function, which maps Q x G to Q, 
• ¢ the output function, which maps Q x G to A'. 
The functions 6 and rr are generally partial functions: a state q c Q does not 
necessarily admit outgoing transitions labeled on the input side with all elements 
of the alphabet. These functions can be extended to mappings from Q x G* by the 
following classical recurrence relations: 
Vs E Q, Vw E ~*,Va c G, 6(s,¢) = s, 6(s, wa) = 6(6(s,w),a); 
= wa) = 
271 
Computational Linguistics Volume 23, Number 2 
Figure 2 
Example of a 2-subsequential transducer T1. 
Thus, a string w E ~* is accepted by Tiff 6(i, w) C F, and in that case the output of 
the transducer is or(i, w). 
2.2 Subsequential and p-Subsequential Transducers 
Sequential transducers can be generalized by introducing the possibility of generating 
an additional output string at final states (Sch~itzenberger 1977). The application of the 
transducer to a string can then possibly finish with the concatenation of such an out- 
put string to the usual output. Such transducers are called subsequential transducers. 
Language processing often requires a more general extension. Indeed, the ambigui- 
ties encountered in language--ambiguity of grammars, of morphological analyzers, 
or that of pronunciation dictionaries, for instance---cannot be taken into account when 
using sequential or subsequential transducers. These devices associate at most a sin- 
gle output to a given input. In order to deal with ambiguities, one can introduce 
p-subsequential transducers (Mohri 1994a), namely transducers provided with at most 
p final output strings at each final state. Figure 2 gives an example of a 2-subsequential 
transducer. H~re, the input string w = aa gives two distinct outputs aaa and aab. Since 
one cannot find any reasonable case in language in which the number of ambiguities 
would be infinite, p-subsequential transducers seem to be sufficient for describing lin- 
guistic ambiguities. However, the number of ambiguities could be very large in some 
cases. Notice that 1-subsequential transducers are exactly the subsequential transduc- 
ers. 
Transducers can be considered to represent mappings from strings to strings. As 
such, they admit the composition operation defined for mappings, a useful operation 
that allows the construction of more complex transducers from simpler ones. The 
result of the application of T 20 T 1 to a string s can be computed by first considering 
all output strings associated with the input s in the transducer T1, then applying T2 
to all of these strings. The output strings obtained after this application represent the 
result (T2 o T1)(S). In fact, instead of waiting for the result of the application of T1 to 
be completely given, one can gradually apply T2 to the output strings of ~-1 yet to 
be completed. This is the basic idea of the composition algorithm, which allows the 
transducer T 2 o T 1 to be directly constructed given T1 and 7- 2. 
We define sequential (resp. p-subsequential) functions to be those functions that 
can be represented by sequential (resp. p-subsequential) transducers. We noted previ- 
ously that the result of the composition of two transducers is a transducer that can 
be directly constructed. There exists an efficient algorithm for the general case of the 
composition of transducers (transducers subsequential or not, having e-transitions or 
not, and with outputs in ~*, or in ~* t2 {oo} x T4+ U {oo}) (Mohri, Pereira, and Riley 
1996). 
The following theorem gives a more specific result for the case of subsequential 
and p-subsequential functions, which expresses their closure under composition. We 
use the expression p-subsequential in two ways here. One means that a finite number of 
272 
Mohri Transducers in Language and Speech 
Figure 3 
Example of a subsequential transducer v2. 
a 
ambiguities is admitted (the closure under composition matches this case), the second 
indicates that this number equals exactly p. 
Theorem 1 
Let f: E* --* A* be a sequential (resp. p-subsequential) and g: A* -, f~* be a sequential 
(resp. q-subsequential) function, then g of is sequential (resp. pq-subsequential). 
Proof 
We prove the theorem in the general case of p-subsequential transducers. The case 
of sequential transducers, first proved by Choffrut (1978), can be derived from the 
general case in a trivial way. Let 7-1 be a p-subsequential transducer representing f, 
T1 = (Ql, i1,F1, E,A, 61,a1,pl), and 7"2 = (Q2, i2, F2, A,~,62, cr2, P2) a q-subsequential 
transducer representing g. pl and p2 denote the final output functions of T1 and r2, 
which map F1 to (A*)P and F2 to (f~*)q, respectively, pl(r) represents, for instance, the 
set of final output strings at a final state r. Define the pq-subsequential transducer T = 
(Q,i,F,E,f~,6,a,p) by Q = Q1 x Q2, i = (il, i2), F -- {(91,92) E Q: 91 c F1,62(q2,p1(q1))f3 
F2 ~ 0}, with the following transition and output functions: 
Va E E, V(ql, q2) E Q,6((ql, q2),a) = (61(ql,a),62(q2,al(ql,a))) 
o'((ql,q2),a) = cra(q2, crl(ql,a)) 
and with the final output function defined by: 
V(ql,q2) E F, p((ql, q2)) =a2(q2, Pl(ql))p2(6(q2, Pl(ql))) 
Clearly, according to the definition of composition, the transducer ~- realizes g of. The 
definition of p shows that it admits at most pq distinct output strings for a given input 
one. This ends the proof of the theorem. \[\] 
Figure 3 gives an example of a 1-subsequential or subsequential transducer T2. The 
result of the composition of the transducers rl and T2 is shown in Figure 4. States in 
the transducer 73 correspond to pairs of states of ;1 and "r2. The composition consists 
essentially of making the intersection of the outputs of T1 with the inputs of ~'2. 
Transducers admit another useful operation: union. Given an input string w, a 
transducer union of "0 and T2 gives the set union of the strings obtained by application 
of 71 to w and r2 to w. We denote by ~-1 + r2 the union of 7"1 and T2. The following 
theorem specifies the type of the transducer 71 + ~'2, implying in particular the closure 
under union of p-subsequential transducers. It can be proved in a way similar to the 
composition theorem. 
Theorem 2 
Let f: E* ~ A* be a sequential (resp. p-subsequential) and g: E* --* A* be a se- 
quential (resp. q-subsequential) function, then g + f is 2-subsequential (resp. (p + q)- 
subsequential). 
273 
Computational Linguistics Volume 23, Number 2 
Figure 4 
2-subsequential transducer r3, obtained by composition of T1 and r2. 
The union transducer T1 + ~-2 Can be constructed from rl and r2 in a way close 
to the union of automata. One can indeed introduce a new initial state connected to 
the old initial states of rl and "r2 by transitions labeled with the empty string both on 
input and output. But the transducer obtained using this construction is not sequential, 
since it contains c-transitions on the input side. There exists, however, an algorithm 
to construct the union of p-subsequential and q-subsequential transducers directly as 
a p + q-subsequential transducer. 
The direct construction consists of considering pairs of states (ql, q2), ql being a 
state of ~-1 or an additional state that we denote by an underscore, q2 a state of r2 or 
an additional state that we denote by an underscore. The transitions leaving (ql, q2) 
are obtained by taking the union of the transitions leaving ql and q2, or by keeping 
only those of ql if q2 is the underscore state, similarly by keeping only those of q2 if ql 
is the underscore state. The union of the transitions is performed in such a way that 
if ql and q2 both have transitions labeled with the same input label a, then only one 
transition labeled with a is associated to (O, q2)- The output label of that transition is 
the longest common prefix of the output transitions labeled with a leaving ql and q2. 
See Mohri (1996b) for a full description of this algorithm. 
Figure 5 shows the 2-subsequential transducer obtained by constructing the union 
of the transducers rl and ;2 this way. Notice that according to the theorem the result 
could be a priori 3-subsequential, but these two transducers share no common accepted 
string. In such cases, the resulting transducer is max(p, q)-subsequential. 
2.3 Characterization and Extensions 
The linear complexity of their use makes sequential or p-subsequential transducers 
both mathematically and computationally of particular interest. However, not all trans- 
ducers, even when they realize functions (rational functions), admit an equivalent 
sequential or subsequential transducer. 
Consider, for instance, the function f associated with the classical transducer rep- 
resented in Figure 6; f can be defined by: 1 
Vw c {x} +, f(w) = alwl if Iwl is even, (1) 
= blwl otherwise 
This function is not sequential, that is, it cannot be realized by any sequential trans- 
ducer. Indeed, in order to start writing the output associated to an input string w = x n, 
a or b according to whether n is even or odd, one needs to finish reading the whole in- 
put string w, which can be arbitrarily long. Sequential functions, namely functions that 
1 We denote by \]w\[ the length of a string w. 
274 
Mohri Transducers in Language and Speech 
a:E 
b:a 
Figure 5 
2-subsequential transducer v4, union of 7-1 and r2. 
x:a 
x:a 
x:b 
Figure 6 
Transducer T with no equivalent sequential representation. 
can be represented by sequential transducers do not allow such unbounded delays. 
More generally, sequential functions can be characterized among rational functions by 
the following theorem: 
Theorem 3 (Ginsburg and Rose 1966) 
Letf be a rational function mapping G* to A*. f is sequential iff there exists a positive 
integer K such that: 
Vu 6 G*,Va E G, 3w 6 A*, Iwl < K: f(ua) =f(u)w (2) 
In other words, for any string u and any element of the alphabet a, f(ua) is equal 
to f(u) concatenated with some bounded string. Notice that this implies that flu) is 
always a prefix of f(ua), and more generally that if f is sequential then it preserves 
prefixes. 
275 
Computational Linguistics Volume 23, Number 2 
Q x:xl 
Figure 7 
Left-to-right sequential transducer L. 
x.x2 
xl:a 
Figure 8 
Right-to-left sequential transducer R. 
The fact that not all rational functions are sequential could reduce the interest 
of sequential transducers. The following theorem, due to Elgot and Mezei (1965), 
shows, however, that transducers are exactly compositions of left and right sequential 
transducers. 
Theorem 4 (Elgot and Mezei 1965) 
Letf be a partial function mapping G" to A*.f is rational iff there exists a left sequential 
function h ~* --* f~* and a right sequential function r: f~* --* A ~ such thatf = r o I. 
Left sequential functions or transducers are those we previously defined. Their 
application to a string proceeds from left to right. Right sequential functions apply 
to strings from right to left. According to the theorem, considering a new sufficiently 
large alphabet f~ allows one to define two sequential functions I and r that decompose 
a rational function f. This result considerably increases the importance of sequential 
functions in the theory of finite-state machines as well as in the practical use of trans- 
ducers. 
Berstel (1979) gives a constructive proof of this theorem. Given a finite-state trans- 
ducer T, one can easily construct a left sequential transducer L and a right sequential 
transducer R such that R o L = T. Intuitively, the extended alphabet f~ keeps track of 
the local ambiguities encountered when applying the transducer from left to right. A 
distinct element of the alphabet is assigned to each of these ambiguities. The right 
sequential transducer can be constructed in such a way that these ambiguities can 
then be resolved from right to left. Figures 7 and 8 give a decomposition of the non- 
sequential transducer T of Figure 6. The symbols of the alphabet f~ = {xl, x2} store 
information about the size of the input string w. The output of L ends with xl iff Iwl 
is odd. The right sequential function R is then easy to construct. 
276 
Mohri Transducers in Language and Speech 
Sequential transducers offer other theoretical advantages. In particular, while sev- 
eral important tests, such as equivalence, are undecidable with general transducers, 
sequential transducers have the following decidability property. 
Theorem 5 
Let T be a transducer mapping G* to A ". It is decidable whether T is sequential. 
A constructive proof of this theorem was given by Choffrut (1978). An efficient 
polynomial algorithm for testing the sequentiability of transducers based on this proof 
was given by Weber and Klemm (1995). 
Choffrut also gave a characterization of subsequential functions based on the def- 
inition of a metric on G*. Denote by u/~ v the longest common prefix of two strings u 
and v in G*. It is easy to verify that the following defines a metric on G*: 
d(u,v) = tul + Ivl - 21u A v I (3) 
The following theorem describes this characterization of subsequential functions. 
Theorem 6 
Letf be a partial function mapping G* to A*. f is subsequential iff: 
1. f has bounded variation (according to the metric defined above). 
2. for any rational subset Y of A*,f-I(Y) is rational. 
The notion of bounded variation can be roughly understood here as follows: if 
d(x,y) is small enough, namely if the prefix that x and y share is sufficiently long 
compared to their lengths, then the same is true of their images by f, f(x) and f(y). 
This theorem can be extended to describe the case of p-subsequential functions by 
defining a metric d~ on (A*)p. For any u = (u I .... ,Up) and v = (vl .... ,Vp) C (A*)P, 
we define: 
doo(u, v) = max d(ui, vi) (4) 1Ki~p 
Theorem 7 
Let f = (fl ..... fp) be a partial function mapping Dom(f) c G* to (A*)P. f is p- 
subsequential iff: 
1. f has bounded variation (using the metric d on G* and d~ on (A*)p). 
2. for all i (1 < i < p) and any rational subset Y of A*,f/-I(Y) is rational. 
Proof 
Assume f p-subsequential, and let T be a p-subsequential transducer realizing f. A 
transducer Ti, 1 ~ i < p, realizing a component fi of f can be obtained from T simply 
by keeping only one of the p outputs at each final state of T. Ti is subsequential 
by construction, hence the component )~ is subsequential. Then the previous theorem 
implies that each component fi has bounded variation, and by definition of d~, f has 
also bounded variation. 
Conversely, if the first condition holds, afortiori each)~ has bounded variation. This 
combined with the second condition implies that each)~ is subsequential. A transducer 
T realizing f can be obtained by taking the union of p subsequential transducers 
realizing each componentS. Thus, in view of the theorem 2,f is p-subsequential. \[\] 
277 
Computational Linguistics Volume 23, Number 2 
One can also give a characterization of p-subsequential transducers irrespective of 
the choice of their components. Let d~ be the semimetric defined by: 
V(u,v) E \[(A*)P\] 2, d'p(U,V) = max d(ui, vj) l<i,j<p (5) 
The following theorem that follows then gives that characterization. 
Theorem 8 
Letf be a rational function mapping E* to (A*)P. f is p-subsequential iff it has bounded 
variation (using the semimetric d~ on (A*)P). 
Proof 
According to the previous theorem the condition is sufficient since: 
V(u,v) c < a'p(u,v) 
Conversely if f is p-subsequential, let T = (Q, i,F, ~, A, 6,or, p) be a p-subsequential 
transducer representing f, where p = (pl ..... pp) is the output function mapping Q to 
(A*)P. Let N and M be defined by: 
N= max Ipi(q)l and M= max Icffq, a)l (6) qEF, l <id< p aE~,qEQ 
We denote by Dom(T) the set of strings accepted by T. Let k > 0 and (ul, U2) E 
\[Dora(T)\] 2 such that d(ul, u2) _< k. Then, there exists u E E* such that: 
U 1 = UVl, U 2 = UV2, and \[vii + Iv2l G k (7) 
Hence, 
f(Ul) = {cr(i,u)rr(~(i,u),vl)pj(6(i, ul)): 1 G j < p} 
f(u2) = {rr(i,u)cr(6(i,u),v2)Pj(~(i, u2)): 1 < j < p} 
(8) 
Let K = kM + 2N. We have: 
dp(f(ul),f(u2)) M(ivl\[ + Iv2I) + dlp(p(6(i, ul)),p(~(i, u2))) 
< kM+2N = K 
Thus, f has bounded variation using d~. This ends the proof of the theorem. \[\] 
2.4 Application to Language Processing 
We briefly mentioned several theoretical and computational properties of sequential 
and p-subsequential transducers. These devices are used in many areas of compu- 
tational linguistics. In all those areas, the determinization algorithm can be used to 
obtain a p-subsequential transducer (Mohri 1996b), and the minimization algorithm to 
reduce the size of the p-subsequential transducer used (Mohri 1994b). The composi- 
tion, union, and equivalence algorithms for subsequential transducers are also useful 
in many applications. 
278 
Mohri Transducers in Language and Speech 
2.4.1 Representation of Dictionaries. Very large-scale dictionaries can be represented 
by p-subsequential dictionaries because the number of entries and that of the ambigu- 
ities they contain are finite. The corresponding representation offers fast look-up since 
the recognition does not depend on the size of the dictionary but only on that of the in- 
put string considered. The minimization algorithm for sequential and p-subsequential 
transducers allows the size of these devices to be reduced to the minimum. Exper- 
iments have shown that these compact and fast look-up representations for large 
natural language dictionaries can be efficiently obtained. As an example, a French 
morphological dictionary of about 21.2 Mb can be compiled into a p-subsequential 
transducer of 1.3 Mb, in a few minutes (Mohri 1996b). 
2.4.2 Compilation of Morphological and Phonological Rules. Similarly, context-depen- 
dent phonological and morphological rules can be represented by finite-state transduc- 
ers (Kaplan and Kay 1994). Most phonological and morphological rules correspond to 
p-subsequential functions. The result of the computation described by Kaplan and Kay 
(1994) is not necessarily a p-subsequential transducer. But, it can often be determinized 
using the determinization algorithm for p-subsequentiable transducers. This consid- 
erably increases the time efficiency of the transducer. It can be further minimized to 
reduce its size. These observations can be extended to the case of weighted rewrite 
rules (Mohri and Sproat 1996). 
2.4.3 Syntax. Finite-state machines are also currently used to represent local syn- 
tactic constraints (Silberztein 1993; Roche 1993; Karlsson et al. 1995; Mohri 1994d). 
Linguists can conveniently introduce local grammar transducers that can be used to 
disambiguate sentences. The number of local grammars for a given language and 
even for a specific domain can be large. The local grammar transducers are mostly 
p-subsequential. Determinization and minimization can then be used to make the 
use of local grammar transducers more time efficient and to reduce their size. Since 
p-subsequential transducers are closed under composition, the result of the composi- 
tion of all local grammar transducers is a p-subsequential transducer. The equivalence 
of local grammars can also be tested using the equivalence algorithm for sequential 
transducers. 
For a more detailed overview of the applications of sequential string to string 
transducers to language processing, see Mohri (1996a). 
Because they are so time and space efficient, sequential transducers will likely be 
used increasingly often in natural language processing as well as in other connected 
fields. In the following, we consider the case of string-to-weight transducers, which 
are also used in many areas of computational linguistics. 
3. Power Series and Subsequential String-to-Weight Transducers 
We consider string-to-weight transducers, namely transducers with input strings and 
output weights. These transducers are used in various domains, such as language 
modeling, representation of word or phonetic lattices, etc., in the following way: one 
reads and follows a path corresponding to a given input string and outputs a number 
obtained by combining the weights along this path. In most applications to natural 
language processing, the weights are simply added along the path, since they are 
interpreted as (negative) logarithms of probabilities. In case the transducer is not se- 
quential, that is, when it does not have a deterministic input, one proceeds in the same 
way for all the paths corresponding to the input string. In natural language processing, 
specifically in speech processing, one keeps the minimum of the weights associated to 
279 
Computational Linguistics "Volume 23, Number 2 
Figure 9 
Example of a string-to-weight transducer. 
these paths. This corresponds to the Viterbi approximation in speech recognition or in 
other related areas for which hidden Markov models (HMM's) are used. In all such 
applications, one looks for the best path, i.e., the path with the minimum weight. 
3.1 Definitions 
In this section, we give the definition of string-to-weight transducers and other deft- 
nitions useful for the presentation of the theorems of the following sections. 
In addition to the output weights of the transitions, string-to-weight transducers 
are provided with initial and output weights. For instance, when used with the input 
string ab, the transducer in Figure 9 outputs: 5 + 1 + 2 + 3 = 11, 5 being the initial and 
3 the final weight. 
Definition 
More formally, a string-to-weight transducer T is defined by T = (Q, ~, I, F, E, A, p) 
with: 
• Q a finite set of states, 
• G the input alphabet, 
• I C Q the set of initial states, 
• F C Q the set of final states, 
• E C Q x G x T4+ x Q a finite set of transitions, 
• A the initial weight function mapping I to 7Z+, 
• p the final weight function mapping F to 7"4+. 
One can define for T a transition (partial) function 6 mapping Q x E to 2 Q by: 
V(q,a) E Q x G,~(q,a) = {q' I 3x E 7"4+: (q,a,x,q') E E} 
and an output function cr mapping E to T4+ by: 
Vt = (p,a,x,q) E E, cr(t) = x 
A path ~r in T from q E Q to q' E Q is a set of successive transitions from q to q': 
7r = ((qo, ao, xo, ql) .... , (qm-l,am-l, Xm-l, qm)), with Vi E \[0, m- 1\], qi+l E 6(qi, ai). We can 
extend the definition of a to paths by: cr(Tr) -- XoXl""Xm-1. 
We denote by ~- E q w q, the set of paths from q to q' labeled with the input string 
w. The definition of 6 can be extended to Q x G* by: 
V(q,w) E Q x ~*,6(q,w) = {q': 3 path 7r in T, Tr E qW q,} 
280 
Mohri Transducers in Language and Speech 
and to 2 Q x E*, by: 
VR G Q, Vw E ~*, 6(R, w) = U t~(q, w) 
qER 
For (q, w, q') E Q x ~ x Q such that there exists a path from q to ql labeled with 
w, we define O(q, w, q') as the minimum of the outputs of all paths from q to q' with 
input w: 
O(q, w,q') = min cr(Tr) 
w IrEq...~q' 
A successful path in T is a path from an initial state to a final state. A string w E E* 
is accepted by Tiff there exists a successful path labeled with w: w E 6(L w) N F. The 
output corresponding to an accepted string w is then obtained by taking the minimum 
of the outputs of all successful paths with input label w: 
min (A(i) + O(i, w,f) + p(f)) (i,f)EIxF: fE6(i,w) 
A transducer T is said to be trim if all states of T belong to a successful path. String-to- 
weight transducers clearly realize functions mapping ~* to 74+. Since the operations 
we need to consider are addition and min, and since (74+ U {oo}, min, +, cxD, 0) is a 
semiring, we call these functions formal power series. 2 We adopt the terminology 
and notation used in formal language theory (Berstel and Reutenauer 1988; Kuich and 
Salomaa 1986; Salomaa and Soittola 1978): 
the image by a formal power series S of a string w is denoted by (S, w) 
and called the coefficient of w in S, 
the notation S = ~,w~. (S, w)w is then used to define a power series by 
its coefficients, 
the support of S is the language defined by: 
suep(S) = {w (S,w) # 
The fundamental theorem of Schtitzenberger (1961), analogous to Kleene's the- 
orem for formal languages, states that a formal power series S is rational iff it is 
recognizable, that is, realizable by a string-to-weight transducer. The semiring (74+ U 
{cx~}, rain, +, c~, 0) used in many optimization problems is called the tropical semiring. 3 
So, the functions we consider here are more precisely rational power series over the 
tropical semiring. 
A string-to-weight transducer T is said to be unambiguous if for any given string 
w there exists at most one successful path labeled with w. 
In the following, we examine, more specifically, efficient string-to-weight transduc- 
ers: subsequential transducers. A transducer is said to be subsequential if its input is 
2 Recall that a semiring is essentially a ring that may lack negation, namely in which the first operation 
does not necessarily admit inversion. (TZ, +,., 0,1), where 0 and 1 are, respectively, the identity 
elements for + and., or, for any non-empty set E, (2 E, U, n, 0, E), where 0 and E are, respectively, the 
identity elements for U and O, are other examples of semirings. 
3 This terminology is often used more specifically when the set is restricted to natural integers (Nu {oo},min, +,~,0). 
281 
Computational Linguistics Volume 23, Number 2 
deterministic, that is if at any state there exists at most one outgoing transition labeled 
with a given element of the input alphabet G. Subsequential string-to-weight transduc- 
ers are sometimes called weighted automata, or weighted acceptors, or probabilistic 
automata, or distance automata. Our terminology is meant to favor the functional view 
of these devices, which is the view that we consider here. Not all string-to-weight trans- 
ducers are subsequential but we define an algorithm to determinize nonsubsequential 
transducers when possible. 
Definition 
More formally a string-to-weight subsequential transducer "r = (Q, i, F, ~, 6, or, )~, p) is 
an 8-tuple, with: 
• Q the set of its states, 
• i E Q its initial state, 
• F c_ Q the set of final states, 
• G the input alphabet, 
• 6 the transition function mapping Q x E to Q, 6 can be extended as in 
the string case to map Q x G* to Q, 
• cr the output function, which maps Q x G to 7%+, cr can also be extended 
to Q x ~,*, 
• ;~ E T4+ the initial weight, 
• p the final weight function mapping F to T4+. 
A string w E ~,* is accepted by a subsequential transducer T if there exists f E F 
such that 6(i, w) =f. The output associated to w is then: )~ + or(i, w) + p(f). 
We will use the following definition for characterizing the transducers that admit 
determinization. 
Definition 
Two states q and q' of a string-to-weight transducer T = (Q, I, F, G, 6, rr, )~, p), not nec- 
essarily subsequential, are said to be twins if: 
V(u,v) E (~,)2, ({q,q,} C 6(I,u),q E 6(q,v),q' E 6(q',v)) ~ ~(q,v,q) = ~(q',v,q') (9) 
In other words, q and q' are twins if, when they can be reached from the initial 
state by the same string u, the minimum outputs of loops at q and q' labeled with any 
string v are identical. We say that T has the twins property when any two states q and 
q' of T are twins. Notice that according to the definition, two states that do not have 
cycles with the same string v are twins. In particular, two states that do not belong to 
any cycle are necessarily twins. Thus, an acyclic transducer has the twins property. 
In the following section, we consider subsequential power series in the tropi- 
cal semiring, that is, functions that can be realized by subsequential string-to-weight 
transducers. Many rational power series defined on the tropical semiring considered 
in practice are subsequential, in particular, acyclic transducers represent subsequential 
power series. 
282 
Mohri Transducers in Language and Speech 
We introduce a theorem giving an intrinsic characterization of subsequential power 
series irrespective of the transducer realizing them. We then present an algorithm 
that allows one to determinize some string-to-weight transducers. We give a general 
presentation of the algorithm since it can be used with many other semirings, in 
particular, with string-to-string transducers and with transducers whose output labels 
are pairs of strings and weights. 
We then use the twins property to define a set of transducers to which the deter- 
minization algorithm applies. We give a characterization of unambiguous transducers 
admitting determinization, and then use this characterization to define an algorithm 
to test if a given transducer can be determinized. 
We also present a very efficient minimization algorithm that applies to subse- 
quential string-to-weight transducers. In many cases, the determinization algorithm 
can also be used to minimize a subsequential transducer; we describe this use of the 
algorithm and give the related proofs in the appendix. 
3.2 Characterization of Subsequential Power Series 
Recall that one can define a metric on E* by: 
d(u,v) = lu\[ + Iv\[ - 21u A v\[ (10) 
where we denote by u A v the longest common prefix of two strings u and v in E*. 
The definition we gave for subsequential power series depends on the transducers 
representing them. The theorem that follows gives an intrinsic characterization of 
subsequential power series. 4 
Theorem 9 
Let S be a rational power series defined on the tropical semiring. S is subsequential 
iff it has bounded variation. 
Proof 
Assume that S is subsequential. Let ~- = (Q, i, F, E, 6, or, A, p) be a subsequential trans- 
ducer. 5 denotes the transition function associated with T, cr its output function, and 
)~ and p the initial and final weight functions. Let L be the maximum of the lengths of 
all output labels of T: 
L= max \[cr(q,a)\[ (11) 
(q,a)CQxE 
and R the upper bound of all output differences at final states: 
R= max \[p(q)-p(q')\[ (12) (q,q')EF 2 
and define M as M = L + R. Let (Ul, u2) be in (E*) 2. By definition of d, there exists 
u E E* such that: 
Ul = uvl, u2 = uv2, and Iv1\[ + Iv2\[ = d(ul,u2) (13) 
Hence, 
cr(i,u,) = cr(i,u) + cr(5(i,u),vl) 
¢(i, u2) = cr(i,u) + cr(6(i,u),v2) 
4 This is an extension of the characterization theorem of Choffrut (1978) for string-to-string functions. 
The extension is not straightforward because the length of an output string is a natural integer. Here 
we deal with real numbers. 
283 
Computational Linguistics Volume 23, Number 2 
Since 
\[¢(6(i,u),vl) - ~(6(i,u),v2)l <_ L. (Iv1\] + Iv21) = L.d(ul, u2) 
and 
\[p(6(i, ul))-p(~(i, u2))l <_ R 
we have 
IA+¢(i, ul) + p(6(i, ul)) - A+~(i, u2) + p(6(i, u2))\[ <_ L.d(ul, u2) + a 
Notice that if ul # u2, R <_ R. d(ul, u2). Thus 
\[A + a(i, ul) + p(6(i, Ul)) - )~ + or(i, u2) + p(6(i, u2))\[ <_ (L + n) . d(ul, u2) 
Therefore: 
V(Ul, U2) E (E*) 2, \[S(Ul)- S(u2)\[ < M'd(Ul, U2) (14) 
This proves that S is M-Lipschitzian s and afortiori that it has bounded variation. 
Conversely, suppose that S has bounded variation. Since S is rational, according 
to the theorem of Schtitzenberger (1961) it is recognizable and therefore there exists 
a string-to-weight transducer ~- = (Q,I,F, E, ~, or, ,k, p) realizing S. As in the case of 
string-to-string transducers, one can show that any transducer admits an equivalent 
trim unambiguous transducer. So, without loss of generality we can assume T trim 
• and unambiguous. 
Furthermore, we describe in the next sections a determinization algorithm. We 
show that this algorithm applies to any transducer that has the twins property. Thus, 
in order to show that S is subsequentiable, it is sufficient to show that ~- has the twins 
property. 
Consider two states q and q' of ~- and let (u,v) E (E*) 2 be such that: 
{q, q'} c ~(1, u), q E ~(q, v), q' ~ ~(ql, v) 
Since ~- is trim there exists (w,w') E (~.)2 such that 6(q,w) NF   0 and ~(q, w') NF # 0. 
Notice that 
Vk >_ O, cl(uvkw, uvkw ') = a(w, w') 
Thus, since S has bounded variation 
3K >_ 0,Vk _> 0, IS(uvkw) - S(uv~w')l <_ K 
Since r is unambiguous, there is only one path from I to F corresponding to uvkw 
(resp. uvkw'). We have: 
S(uvkw) = O(I, uw, F) +kO(q,v,q) 
S(uvkw ') = O(I, uw',F) + kO(q',v,q') 
5 This implies in particular that the subsequential power series over the tropical semiring define 
continuous functions for the topology induced by the metric d on E*. Also this shows that in the 
theorem one can replace has bounded variation by is Lipschitzian. 
284 
Mohri Transducers in Language and Speech 
9 
10 
11 
12 
13 
14 
15 
Figure 10 
Power Series Determinization(T~, T2) 
1 F2 *-- 0 
2 )~2 ~ (~E~AI(i) 
iEI1 
3 i2 *-- U{(i,)~21 ® ~1(i))} 
iEI1 
4 Q ~ {/2} 
5 while Q # 0 
6 do q2 ~-- head\[Q\] 
7 if (there exists (q, x) E q2 such that q E F1) 
8 then F2 *--- F2 U {q2} 
/:}2 (q2) *-- ~ X @ PA (q) 
qEFl,(q,x) Eq2 
for each a such that F(q2,a) # 0 
do a2(q2,a) *- (~ \[x® (~ cq (t)\] 
(q,x)EI'(q2,a) t=(q,a,crl(t),nl(t))EE1 
62(q2,a)*-- U {(q" ED \[cr2(q2"a)\]-l®x(gal(t)} 
q'Ev(q2,a) (q,x,t) ET(q2,a),nl(t)=q' 
if (62(q2,a) is a new state) 
then ENQUEUE(Q, 62(q2, a)) 
DEQUEUE(Q) 
Algorithm for the determinization of a transducer ~-~ representing a power series defined on 
the semiring (S, E3, ®, 0, i). 
Hence 
3K > 0,Vk _> 0, i(O(I, uw, F) - O(I, uw',F)) + k(O(q,v,q) - O(q',v,q')) I <_ K 
==~ #(q,v,q) - ~)(q',v,q') = 0 
Thus T has the twins property. This ends the proof of the theorem. \[\] 
3.3 General Determinization Algorithm for Power Series 
We describe in this section an algorithm for constructing a subsequential transducer 
"1" 2 = (Q2, i2, F2,~,~2, cr2,,,~2, P2 ) equivalent to a given nonsubsequential one ~-I = 
(Q1, G, I1, F1, El, A1, pl). The algorithm extends our determinization algorithm for string- 
to-string transducers representing p-subsequential functions to the case of transducers 
outputting weights (Mohri 1994c). 
Figure 10 gives the pseudocode of the algorithm. We present the algorithm in the 
general case of a semiring (S, ~, ®, 0,1) on which the transducer T1 is defined. Indeed, 
the algorithm we are describing here applies as well to transducers representing power 
series defined on many other semirings. 6 We describe the algorithm in the case of the 
tropical semiring. For the tropical semiring, one can replace @ by min and ® by + in 
the pseudocode of Figure 10. 7 
6 In particular, the algorithm also applies to string subsequentiable transducers and to transducers that 
output pairs of strings and weights. We will come back to this point later. 
7 Similarly, ~21 should be interpreted as -A, and \[~2(q2,a)\] -1 as -cr2(q2,a ). 
285 
Computational Linguistics Volume 23, Number 2 
The algorithm is similar to the powerset construction used for the determinization 
of automata. However, since the outputs of two transitions bearing the same input 
label might differ, one can only output the minimum of these outputs in the result- 
ing transducer, therefore one needs to keep track of the residual weights. Hence, the 
subsets q2 that we consider here are made of pairs (q, x) of states and weights. 
The initial weight &2 of T2 is the minimum of all the initial weights of ~-1 (line 
2). The initial state i2 is a subset made of pairs (i, x), where i is an initial state of T1, 
and x = &l (i) - )~2 (line 3). We use a queue Q to maintain the set of subsets q2 yet to 
be examined, as in the classical powerset construction, s Initially, Q contains only the 
subset i2. The subsets q2 are the states of the resulting transducer, q2 is a final state of 
T2 iff it contains at least one pair (q, x), with q a final state of ~1 (lines 7-8). The final 
output associated to q2 is then the minimum of the final outputs of all the final states 
in q2 combined with their respective residual weight (line 9). 
For each input label a such that there exists at least one state q of the subset 
q2 admitting an outgoing transition labeled with a, one outgoing transition leaving q2 
with the input label a is constructed (lines 10-14). The output o'2(q2 , a) of this transition 
is the minimum of the outputs of all the transitions with input label a that leave a 
state in the subset q2, when combined with the residual weight associated to that state 
(line 11). 
The destination state 62(q2, a) of the transition leaving q2 is a subset made of pairs 
(q', x'), where q' is a state of T1 that can be reached by a transition labeled with a, and 
x' the corresponding residual weight (line 12). x' is computed by taking the minimum 
of all the transitions with input label a that leave a state q of q2 and reach q', when 
combined with the residual weight of q minus the output weight cr2(q2,a). Finally, 
62(q2,a) is enqueued in Q iff it is a new subset. 
We denote by nl (t) the destination state of a transition t E El. Hence nl (t) = q', 
if t -- (q,a,x,q') E El. The sets F(q2,a), 7(q2,a), and v(q2,a) used in the algorithm are 
defined by: 
• F(q2,a) = {(q,x) E q2: 3t = (q,a, rrl(t),nl(t)) E El} 
• 7(q2,a) = {(q,x,t) E q2 x El: t= (q,a, cq(t),nl(t)) E El} 
• ~(q2,a) = {q': 3(q,x) E q2,3t = (q,a, rrfft),q') E El} 
F(q2, a) denotes the set of pairs (q, x), elements of the subset q2, having transitions 
labeled with the input a. 7(q2, a) denotes the set of triples (q, x, t) where (q, x) is a pair 
in q2 such that q admits a transition with input label a. v(q2,a) is the set of states q' 
that can be reached by transitions labeled with a from the states of the subset q2. 
The algorithm is illustrated in Figures 11 and 12. Notice that the input ab admits 
several outputs in #1:{1 + 1 = 2,1 + 3 = 4,3 + 3 = 6,3 + 5 = 8}. Only one of these 
outputs (2, the smallest) is kept in the determinized transducer 1'2, since in the tropical 
semiring one is only interested in the minimum outputs for any given string. 
Notice that several transitions might reach the same state with a priori differ- 
ent residual weights. Since one is only interested in the best path, namely the path 
corresponding to the minimum weight, one can keep the minimum of these weights 
for a given state element of a subset (line 11 of the algorithm of Figure 10). In the 
next section, we give a set of transducers "rl for which the determinization algorithm 
terminates. The following theorem shows the correctness of the algorithm when it 
terminates. 
8 The algorithm works with any queue discipline chosen for Q. 
286 
Mohri Transducers in Language and Speech 
Figure 11 
Transducer #1 representing a power series defined on (7"4+ U {o0}, min, +). 
Figure 12 
Transducer #2 obtained by power series determinization of #1. 
Theorem 10 
Assume that the determinization algorithm terminates, then the resulting transducer 
~'2 is equivalent to ~1. 
Proof 
We denote by Offq, w,q') the minimum of the outputs of all paths from q to q'. By 
construction we have: 
)~2 "~-- min/~1 (il) il E I1 
We define the residual output associated to q in the subset (~2(/2, W) as the weight 
c(q, w) associated to the pair containing q in ~2(i2, w). It is not hard to show by induction 
on Iwl that the subsets constructed by the algorithm are the sets 62(i2, w), w E ~*, such 
that: 
Vw E E*, ~2(i2,w) = U {(q,c(q,w)} 
qE61 (ll,w) 
c(q,w) = m~1~1(/~1(/1 ) q-Ol(il, w,q)) -- cr2(/2,w) --/~2 
ff2(i2,w) = min w (/~1(/1) q- Ol(il, w,q)) -- ~2 q~1(I1, ) 
(15) 
287 
Computational Linguistics Volume 23, Number 2 
Notice that the size of a subset never exceeds \[QI\[: card(62(i2,w)) ~ IQI\[. A state 
q belongs at most to one pair of a subset, since for all paths reaching q, only the 
minimum of the residual outputs is kept. Notice also that, by definition of min, in any 
subset there exists at least one state q with a residual output c(q, w) equal to 0. 
A string w is accepted by ~-1 iff there exists q E F1 such that q c 61 (I1, w). Using 
equations 15, it is accepted iff 62(i2,w) contains a pair (q,c(q,w)) with q E F1. This is 
exactly the definition of the final states F2 (line 7). So ~1 and T2 accept the same set of 
strings. 
Let w C E* be a string accepted by ~1 and ~-2. The definition of p2 in the algorithm 
of figure 10, line 9, gives: 
p2(62(i2,w)) = rain pl(q) + m'.m(Al(h) + 81(il, w,q)) - a2(i2,w) - )~2 (16) qE 61 ( II,w )NF1 ll El1 
Thus, if we denote by S the power series realized by "rl, we have: 
p2(62(i2,w)) = (S,w) - cr2(/2,w) - )~2 (17) 
Hence: &2 + cr2(i2, w) + p2(62(/2, w)) = (S, w). \[\] 
The power series determinization algorithm is equivalent to the usual determiniza- 
tion of automata when the initial weight, the final weights, and all output labels are 
equal to 0. The subsets considered in the algorithm are then exactly those obtained in 
the powerset determinization of automata, all residual outputs c(q, w) being equal to 
0. 
Both space and time complexity of the determinization algorithm for automata 
are exponential. There are minimal deterministic automata with exponential size with 
respect to an equivalent nondeterministic one. A fortiori the complexity of the de- 
terminization algorithm in the weighted case we just described is also exponential. 
However, in some cases in which the degree of nondeterminism of the initial trans- 
ducer is high, the determinization algorithm turns out to be fast and the resulting 
transducer has fewer states than the initial one. We present examples of such cases, 
which appear in speech recognition, in the last section. We also present a minimization 
algorithm that allows the size of subsequential transducers representing power series 
to be reduced. 
The complexity of the application of subsequential transducers is linear in the 
size of the string to which it applies. This property makes it worthwhile to use the 
power series determinization to speed up the application of transducers. Not all trans- 
ducers can be determinized using the power series determinization. In the following 
section, we define a set of transducers that admit determinization, and characterize 
unambiguous transducers that admit the application of the algorithm. 
Since determinization does not apply to all transducers, it is important to be able to 
test the determinizability of a transducer. We present, in the next section, an algorithm 
to test this property in the case of unambiguous trim transducers. 
The proofs of some of the theorems in the next two sections are complex; they can 
be skipped on first reading. 
3.4 Determinizable Transducers 
There are transducers with which determinization does not halt, but rather generates 
an infinite number of subsets. We define determinizable transducers as those transduc- 
ers with which the algorithm terminates. We first show that a large set of transducers 
288 
Mohri Transducers in Language and Speech 
admit determinization, then give a characterization of unambiguous transducers ad- 
mitting determinization. In what follows, the states of the transducers considered will 
be assumed to be accessible from the initial one. 
The following lemma will be useful in the proof of the theorems. 
Lemma 1 
W Let T = (Q, E, I, F, E, A, p) be a string-to-weight transducer, ~r E p "-* q a path in T from 
the state p ~ Q to q ~ Q, and ~" E p' ,G q' a path from p' E Q to q' ~ Q both labeled 
with the input string w ~ E*. Assume that the lengths of 7r and ~-' are greater than 
\[Q\[2 _ 1, then there exist strings Ul, u2, u3 in E*, and states pit p2, p~r and p~ such that 
\[u2\[ > 0, UlU2U3 = w and such that 7r and 7r' be factored in the following way: 
7r' E p, ,,.,u' P~' ,,~ p~ ,,~ q, 
Proof 
The proof is based on the use of a cross product of two transducers. Given two 
transducers T1 = (Q1, E, I1, F1, El, A1, pl) and T2 = (Q2, G,/2, F2, E2, A2, #2), we define 
the cross product of T1 and T2 as the transducer: 
T1 X T2 = (Q1 x Q2,G, I1 x I2,F1 x F2,E,A,#) 
with outputs in 7"4+ x T4+ such that t = ((ql, q2), a, (Xl, x2), (q~, q~)) E Q1 x E x 14+ x 74+ x 
Q2 is a transition of T1 x T2, namely t E E, iff (ql,a, xl, q~) E E1 and (q2,a, x2,¢2) E E2. 
We also define A and p by: V(i1,i2) E/1 x I2, A(il, i2) = (Al(i1),A2(i2)), V(fl,f2) E F1 x 
F2, #(fl,f2) ---- (Pl 0Cl ), P20c2)). 
Consider the cross product of T with itself, T x T. Let 7r and 7r' be two paths in T 
with lengths greater than IQI 2 - 1, (m > IQI2 _ 1): 
~r = ( (p = qo, ao, xo, ql) ..... (qm-l,am-l, Xm-l, qm = q) ) 
V V z ! a x ! ! 7r' = ((p' = q'o, ao, xo, qO ..... ~qm-1, m-I, m-l, qm = q')) 
then: 
II = (((q0, q~), a0, (x0, x~), (ql, q~)), • •., ((qm-1, q~-x), am-I, (Xm-l., Xm_ 1),, (qm, Cm))) 
is a path in T x T with length greater than \[Q\]2 _ 1. Since T x T has exactly \[Q\[2 states, 
II admits at least one cycle at some state (pl, p~) labeled with a non-empty input string 
u2. This shows the existence of the factorization above and proves the lemma. \[\] 
Theorem 11 
Let 7-1 -- (Q1, E, I1, F1, El, A1, pl) be a string-to-weight transducer defined on the tropical 
semiring. If T1 has the twins property then it is determinizable. 
Proof 
Assume that ~- has the twins property. If the determinization algorithm does not halt, 
there exists at least one subset of 2 Q, {q0 ..... qm}, such that the algorithm generates 
an infinite number of distinct weighted subsets {(q0, Co) ..... (qm, Cm)}. 
289 
Computational Linguistics Volume 23, Number 2 
Then we have necessarily m > 1. Indeed, we mentioned previously that in any 
subset there exists at least one state qi with a residual output ci = 0. If the subset 
contains only one state qo, then Co = 0. So there cannot be an infinite number of 
distinct subsets ((qo, Co)}. 
Let A c G* be the set of strings w such that the states of 62(i2, w) be {qo ..... qm}. We 
have: Vw E A, ~2(/2, W) = {(q0, c(qo, w)),..., (qm, C(qm, w))}. Since A is infinite, and since 
in each weighted subset there exists a null residual output, there exist i0, 0 ~ i0 ~ m, 
such that c(q/0, w) -- 0 for an infinite number of strings w E A. Without loss of generality 
we can assume that i0 -- 0. 
Let B C_ A be the infinite set of strings w for which c(q0, w) = 0. Since the number 
of subsets ((qo, c(qo, w)) ..... (qm, C(qm, W))}, w E B, is infinite, there exists j, 0 < j _< m, 
such that c(qj, w) be distinct for an infinite number of strings w E B. Without loss of 
generality we can assume j = 1. 
Let C c B be an infinite set of strings w with c(ql, w) all distinct. Define R(qo, ql) 
to be the finite set of differences of the weights of paths leading to q0 and ql labeled 
with the same string w, \[w\[ _< \[Qll 2 - 1: 
R(qo, q,) = {(A(i,) +~(~-,)) - (A(io) + ¢(~'o)): 7to E io w qo, Tr, E i, w q', 
io E/,il E/,Iw I _( IQ, I2 - 1} 
We will show that {c(ql, w): w E C} C_ R(qo, ql). This will yield a contradiction with 
the infinity of C, and will therefore prove that the algorithm terminates. 
Let w E C, and consider a shortest path 7r0 from a state i0 E I to q0 labeled with 
the input string w and with total cost c~(zr0). Similarly consider a shortest path ~rl from 
il E I to ql labeled with the input string w and with total cost cr(zrl). By definition of 
the subset construction we have: (A(h) + ~r(~rl)) - (A(/o) + er(~r0)) = c(ql, w). Assume 
that Iw\[ > \[Q1\] 2 - 1. Using the lemma 1, there exists a factorization of ~r0 and zrl of the 
type: 
~ro E io "~ po "~ po d~ qo 
7rl E il d2, pl ~ pl ~ ql 
with \]u2\] > 0. Since Po and pl are twins, 01(po, u2,po) = 01(pl, u2,Pl). Define zr~ and ~r~ 
by: 
~r~ E i0 G p0 "~ q0 
7r~ E il G pl "~ ql 
Since ~r and ~r' are shortest paths, we have: a(Tro) = cr(zr~) + 01(po, u2,po) and a0rl ) = 
cr(zr~) + 01 (Pl, u2, pl). Hence: (A(il) + rr0r~)) - (A(io) + a(Tr~)) = c(ql, w). By induction on 
\]w I, we can therefore find shortest paths Ho and H1 from io to qo (resp. il to ql) with 
length less or equal to \]Q112 - 1 and such that (A(h) +a(H1)) - (A(io) +a(Ho)) = c(ql, w). 
Since a(H1) - a(IIo) E R(qo, ql), c(ql, w) E R(qo, ql) and C is finite. This ends the proof 
of the theorem. \[\] 
There are transducers that do not have the twins property and that are still de- 
terminizable. To characterize such transducers, more complex conditions that we will 
not describe here are required. However, in the case of trim unambiguous transducers, 
the twins property provides a characterization of determinizable transducers. 
290 
Mohri Transducers in Language and Speech 
Theorem 12 
Let "rl = (Q1,E, I1,F1, El, ~1,Pl) be a trim unambiguous string-to-weight transducer 
defined on the tropical semiring. Then ~rl is determinizable iff it has the twins property. 
Proof 
According to the previous theorem, if ~1 has the twins property, then it is deter- 
minizable. Assume now that T does not have the twins property, then there exist at 
least two states q and q~ in Q that are not twins. There exists (u, v) E E* such that: 
({q,q'} C 61(I,u),q E 51(q,v),q' E 5ffq',v)) and ~l(q,v,q) ~ ~l(q',v,q'). Consider the 
weighted subsets 62(/2, uvk), with k E N, constructed by the determinization algorithm. 
A subset 62(/2, uv k) contains the pairs (q, c(q, uvk)) and (q', c(q', uvk)). We will show that 
these subsets are all distinct. This will prove that the determinization algorithm does 
not terminate if ~-1 does not have the twins property. 
Since T1 is a trim unambiguous transducer, there exits only one path in ~-1 from 
I to q or to qt with input string u. Similarly, the cycles at q and q~ labeled with v are 
unique. Thus, there exist i E I and i ~ E I such that: 
VkEN, c(q, uv k) = )~l(i)+01(i,u,q)+kOl(q,v,q)-~2(i2,uvk)-)~2 (18) 
Vk E ./V', c(q', uv k) = )~1(/') q- 81 (i', u, q') + k01(q', v, q') - or2(/2, uv k) - ~2 
Let )~ and 0 be defined by: 
)~ = (,~1(i') - )~1(i)) q- ( l(i',u,q') - 01(i,u,q)) 
0 = Ol(q',v,q') - Offq, v,q) 
(19) 
We have: 
'Ok E N, c(q', uv k) - c(q, uv k) = ~ + k~) (20) 
Since (1 ~ 0, equation 20 shows that the subsets 62(i2, uv k) are all distinct. \[\] 
3.5 Test of Determinizability 
The characterization of determinizable transducers provided by theorem 12 leads to 
the definition of an algorithm for testing the determinizability of trim unambiguous 
transducers. Before describing the algorithm, we introduce a lemma that shows that 
it suffices to examine a finite number of paths to test the twins property. 
Lemma 2 
Let "rl = (Q1, E, I1,F1, El, )~1,Pl) be a trim unambiguous string-to-weight transducer 
defined on the tropical semiring. 7-1 has the twins property iff V(u,v) E (E*) 2, luvl <_ 
2IQ112 - 1, 
({q,q'} C 61(I,u),q E 51(q,v),q' E 61(q',v)) ~ 01(q,v,q) = 01(q',v,q') (21) 
Proof 
Clearly if "O has the twins property, then (21) holds. Conversely, we prove that if (21) 
holds, then it also holds for any (u,v) E (E*) 2, by induction on luvI . Our proof is 
similar to that of Berstel (1979) for string-to-string transducers. Consider (u, v) E (E*) 2 
and (q,q') E IQll 2 such that: {q,q'} c 61(I,u),q E 61(q,v),q' E 61(q',v). Assume that 
luvI > 21Qll 2 - 1 with Iv I > o. Then either luI > IQll 2 - 1 or IvI > IQll 2 - 1. 
Assume that lul > IQ112-1. Since T1 is a trim unambiguous transducer there exists 
a unique path ~r in rl from i E I to q labeled with the input string u, and a unique path 
291 
Computational Linguistics Volume 23, Number 2 
7r' from i' E I to q'. In view of lemma 2, there exist strings ul, u2, u3 in ~*, and states 
pl, p2, p~, and p~ such that \]u2\] > 0, UlU2U3 -~ U and such that lr and 7r' be factored in 
the following way: 
~r E i u"G~ pl "~ pl ,G q 
~r' E i' ,-,,*ul Pl' ,`% P~ "~ q' 
Since \]ulu3vl < luv\], by induction 01(q,v,q) = Offq',v,q'). 
Next, assume that Iv I > \]Qll 2 - 1. Then according to lemma 1, there exist strings 
Vl, v2, v3 in E*, and states ql, q2, q~, and q~ such that Iv21 > 0, vlv2v3 = v and such that 
lr and lr' be factored in the following way: 
~r E q~ ql~ ql ~ q 
~r' E q',G q~ ,~ q~ ,~ q' 
Since lUVl V3\] < \]uv\], by induction, 81 ( q, vl v3, q) = 81 ( q', vl v3, q'). Similarly, since luvl v2\] < 
l uv\], 01 (ql, v2, ql) = 01 (q~, v2, q~). "rl is a trim unambiguous transducer, so: 
01(q,v,q) =01(q, vlv3,q)+Ol(ql, v2,ql) 
01(q', v, q') = 01(q', vlv3, q') + 01 (q~, v2, q~) 
Thus, 01 (q, v, q) = 01(q', v, q'). This completes the proof of the lemma. \[\] 
Theorem 13 
Let T1 = (Q1, E, 11, F1, El, A1, Pl) be a trim unambiguous string-to-weight transducer 
defined on the tropical semiring. There exists an algorithm to test the determinizability 
of ~1. 
Proof 
According to theorem 12, testing the determinizability of T1 is equivalent to testing for 
the twins property. We define an algorithm to test this property. Our algorithm is close 
to that of Weber and Klemm (1995) for testing the sequentiability of string-to-string 
transducers. It is based on the construction of an automaton A = (Q,/, F, E) similar to 
the cross product of ~-1 with itself. 
Let K C T~ be the finite set of real numbers defined by: 
K= (¢(t~)-cr(ti)): l <k <2iQ1\]2-1,Vi<_k(ti, tl) EE 
We define A by the following: 
• The set of states of A is defined by Q = Q1 x Q1 x K, 
• The set of initial states by I = h x/1 x {0}, 
• The set of final states by F = F1 x F1 x K, 
• The set of transitions by: 
E = {((ql, q2,c),a,(q~,q~2,c')) E. Q x E x Q: 
3 (ql,a, x, q2) E El, ' ' ' x' x}. (ql,a,x,q2) EEl, C'-C= - 
292 
Mohri Transducers in Language and Speech 
By construction, two states ql and q2 of Q can be reached by the same string u, lut < 
21Qll 2 - 1, iff there exists c E K such that (ql, q2,c) can be reached from I in A. The set 
of such (ql, q2, c) is exactly the transitive closure of I in A. The transitive closure of I 
can be determined in time linear in the size of A, O(IQI + IEI). 
Two such states ql and q2 are not twins iff there exists a path in A from (ql, q2, 0) to 
(ql, q2, c), with c # 0. Indeed, this is exactly equivalent to the existence of cycles at ql 
and q2 with the same input label and distinct output weights. According to lemma 2, 
it suffices to test the twins property for strings of length less than 21Qll 2 - 1. So the 
following gives an algorithm to test the twins property of a transducer ~-1: 
, 
2. 
3. 
Compute the transitive closure of h T(I). 
Determine the set of pairs (ql, q2) of T(I) with distinct states ql # q2- 
For each such {ql, q2}, compute the transitive closure of (ql, q2, 0) in A. If 
it contains (ql, q2, c) with c # 0, then ~-1 does not have the twins property. 
The operations used in the algorithm (computation of the transitive closure, determi- 
nation of the set of states) can all be done in polynomial time with respect to the size 
of A, using classical algorithms (Aho, Hopcroft, and Ullman 1974). \[\] 
This provides an algorithm for testing the twins property of an unambiguous trim 
transducer T. It is very useful when T is known to be unambiguous. 
In many practical cases, the transducer one wishes to determinize is ambiguous. 
It is always possible to construct an unambiguous transducer T' from T (Eilenberg 
1974-1976). The complexity of such a construction is exponential in the worst case. 
Thus the overall complexity of the test of determinizability is also exponentia ! in the 
worst case. 
Notice that if one wishes to construct the result of the determinization of T for a 
given input string w, one does not need to expand the whole result of the determiniza- 
tion, but only the necessary part of the determinized transducer. When restricted to 
a finite set the function realized by any transducer is subsequentiable, since it has 
bounded variation? Acyclic transducers have the twins property, so they are deter- 
minizable. Therefore, it is always possible to expand the result of the determinization 
algorithm for a finite set of input strings, even if T is not determinizable. 
3.6 Determinization in Other Semirings 
The determinization algorithm that we previously presented applies as well to trans- 
ducers mapping strings to other semirings. We gave the pseudocode of the algorithm 
in the general case. The algorithm applies for instance to the real semiring (7"4, +,., 0,1). 
One can also verify that (~* U {oc}, A,., cx~, e), where A denotes the longest common 
prefix operation and • concatenation, o~ a new element such that for any string w E 
(~* U {~}), w A oo = oo A w = w and w- oo = eo. w = oo, defines a left semiring} ° We 
call this semiring the string semiring. The algorithm of Figure 10 used with the string 
semiring is exactly the determinization algorithm for subsequentiable string-to-string 
transducers, as defined by Mohri (1994c). The cross product of two semirings defines 
a semiring. The algorithm also applies when the semiring is the cross product of 
9 Using the proof of the theorem of the previous section, it is easy to convince oneself that this assertion 
can be generalized to any rational subset Y of E* such that the restriction of S, the function T realizes, 
to Y has bounded variation. 
10 A left semiring is a semiring that may lack right distributivity. 
293 
Computational Linguistics Volume 23, Number 2 
a:b/3 
Figure 13 
Transducer 7-1 with outputs in ~* x 74. 
a:b/3 ~ c:c/5 = 
b.a/~ d:a/~ 
Figure 14 
Sequential transducer r2 with outputs in ~,,* x 74 obtained from fll by determinization. 
(E* U{cx~}, A,., cx~, c) and (T4+ U{oo}, min, +, oo, 0), which allows transducers outputting 
pairs of strings and weights to be determined. The determinization algorithm for such 
transducers is illustrated in Figures 13 and 14. Subsets in this algorithm are made of 
triples (q, w, x) E Q x E* u {oo} x 7-4 u {cx~}, where q is a state of the initial transducer, 
w a residual string, and x a residual output weight. 
3.7 Minimization 
We here define a minimization algorithm for subsequential power series defined on 
the tropical semiring, which extends the algorithm defined by Mohri (1994b) in the 
case of string-to-string transducers. For any subset L of G* and any string u we define 
u-lL by: 
u-IL = {w: uw E L} (22) 
Recall that L is a regular language iff there exists a finite number of distinct u-lL 
Nerode (1958). In a similar way, given a power series S we define a new power series 
u-is by: n 
u-iS = y~ (S, uw)w (23) 
wE~* 
11 One can prove that S, a power series defined on a field, is rational if it admits a finite number of 
independent u-iS (Carlyle and Paz 1971). This is the equivalent, for power series, of Nerode's theorem 
for regular languages. 
294 
Mohri Transducers in Language and Speech 
For any subsequential power series S we can now define the following relation on ~,*: 
V(u, v) E ~*, u Rs v 4=~ 3k E T4, (u-lsupp(S) = v-lsupp(S)) and 
(\[u-iS -1 -- V S\]/u-lsupp(S ) = k) (24) 
It is easy to show that Rs is an equivalence relation. (u-lsupp(S) = v-lsupp(S)) defines 
the equivalence relation for regular languages. Rs is a finer relation. The additional 
condition in the definition of Rs is that the restriction of the power series \[u-iS -v-iS\] 
to u-lsupp(S) = v-lsupp(S) is constant. The following lemma shows that if there exists 
a subsequential transducer T computing S with a number of states equal to the number 
of equivalence classes of Rs, then T is a minimal transducer computing f. 
Lemma 3 
If S is a subsequential power series defined on the tropical semiring, Rs has a finite 
number of equivalence classes. This number is bounded by the number of states of 
any subsequential transducer realizing S. 
Proof 
Let T -- (Q, i, F, ~, 6, or, ~, p) be a subsequential transducer realizing S. Clearly, 
~(U,V) E (~*)2, 6(i,u) = 6(i,v) ~ Vw E u-lsupp(S),6(i, uw) E F ~ 6(i, vw) E F 
u-lsupp(S) = v-lsupp(S) 
Also, if u-lsupp(S) = v-lsupp(S), V(u, v) E (~,)2, 
6(i, u) = 6(i,v) ~ VW E u-lsupp(S), (S, uw) - (S, vw) = or(i, u) - ¢(i,v) 
4=~ \[u-iS - v-lS\]/u-~supp(S) = cr(i,u) - cr(i,v) 
So V(u,v) E (G,)2, 6(i,u) = 6(i,v) ~ (uRsv). This proves the lemma. \[\] 
The following theorem proves the existence of a minimal subsequential transducer 
representing S. 
Theorem 14 
For any subsequential function S, there exists a minimal subsequential transducer 
computing it. Its number of states is equal to the index of Rs. 
Proof 
Given a subsequential power series S, we define a power series f by: 
Vu E ~*: u-lsupp(S) = 0, (f,u) = 0 
Vu E G*: u-lsupp(S) # O,  u) = min (S, uw) 
wEu-lsupp(S) 
We then define a subsequential transducer T = (Q, i, F, ~, 6, or, )~, p) by: 12 
• Q={~: uEG*}; 
12 We denote by ~ the equivalence class of u E G*. 
295 
Computational Linguistics Volume 23, Number 2 
• i=~; 
• F = {a: u E ~,* Msupp(S)}; 
• Vu e ~*,Va e ~,6(a,a) = Ca; 
• vu ~ y,*,va ~ z,~(~,a) = (f,u~) - (f,u); 
• ,~ = ff,~); 
• VqEQ, p(q)=0. 
Since the index of Rs is finite, Q and F are well-defined. The definition of 6 does 
not depend on the choice of the element u in ~, since for any a E ~, u Rs v implies 
(ua) Rs (va). The definition of rr is also independent of this choice, since by definition 
of Rs, if uRsv, then (ua) Rs (va) and there exists k E T4 such that Vw E ~*, (S, uaw) - 
(S, vaw)= (S, uw) - (S, vw) = k. Notice that the definition of G implies that: 
Vw ~ s*,¢(i,w) = (f,w) - ff,~) (25) 
So: 
Vw E supp(S), A + ¢r(i, w) + p(q) -= (f, w) = rnin (S, ww') w' ew-lsupp(S) 
S is subsequential, hence: Vw' E w-lsupp(S), (S, ww') < (S, w). Since Vw E supp(S), ¢ E 
w-lsupp(S), we have: 
m:m (S, ww') = (S,w) 
w' ew-lsupp(S) 
T realizes S. This ends the proof of the theorem. \[\] 
Given a subsequential transducer T = (Q, i, F, G, 6, cr, A, p), we can define for each 
state q E Q, d(q) by: 
d(q) -- rnin (er(q,w) + p(6(q,w))) (26) 6(q,w) EF 
Definition 
We define a new operation of pushing, which applies to any transducer T. In partic- 
ular, if T is subsequential the result of the application of pushing to T is a new 
subsequential transducer T' --- (Q, i, F, ~.., 6, er', A', p') that only differs from T by its 
output weights in the following way: 
•  ' = ;~ +,~(i); 
• V(q,a) E Q x Z, G'(q,a) = rr(q,a) +d(6(q,a))-d(q); 
• Vq E Q,#(q) = O. 
According to the definition of d, we have: 
Vw E G*: 6(q, aw) E F,d(q) < cr(q,a) + er(6(q,a),w) + p(6(6(q,a),w)) 
This implies that: 
a(q) <_ o(q,a) + a(6(q,a)) 
So, ¢r ~ is well-defined: 
v(q,a) ~ Q x z,~'(q,a) > o 
296 
Mohri Transducers in Language and Speech 
Lemma 4 
Let T' be the transducer obtained from T by pushing. T ~ is a subsequential transducer 
which realizes the same function as T. 
Proof 
That T' is subsequential follows immediately its definition. Also, 
Vw E ~',q C Q, cr'(q,w) = rr(q,w) + d(6(q,w) ) -d(q) 
Since 6(i,w) E F =~ d(6(i,w)) = p(6(i,w)), we have: 
+ o'(i,w) + p'(6(i,w)) = ;, + a(i) + o(q,w) + w)) - d(i) + 0 
This proves the lemma. \[\] 
The following theorem defines the minimization algorithm. 
Theorem 15 
Let T be a subsequential transducer realizing a power series on the tropical semiring. 
Then applying the following two operations: 
1. pushing 
2. automata minimization 
leads to a minimal transducer. This minimal transducer is exactly the one defined in 
the proof of theorem 14. 
The automata minimization step in the theorem consists of considering pairs of 
input labels and associated weights as a single label and of applying classical mini- 
mization algorithms for automata (Aho, Hopcroft, and Ullman 1974). We do not give 
the proof of the theorem; it can be proved in a way similar to what is indicated in 
Mohri (1994b). 
In general, there are several distinct minimal subsequential transducers realizing 
the same function. Pushing introduces an equivalence relation on minimal transduc- 
ers: T Rp T' iff p(T) = p(T'), where p(T) (resp. p(T')) denotes the transducer obtained 
from T (resp. T t) by pushing. Indeed, if T and T t are minimal transducers realizing 
the same function, then p(T) and p(T') are both equal to the unique minimal trans- 
ducer equivalent to T and T t as defined in theorem 14. So, two equivalent minimal 
transducers only differ by their output labels, they have the same topology. They only 
differ by the way the output weights are spread along the paths. 
Notice that if we introduce a new super final state @ to which each final state q 
is connected by a transition of weight p(q), then d(q) in the definition of T' is exactly 
the length of a shortest path from • to q. Thus, T' can be obtained from T using 
the classical single-source shortest paths algorithms such as that of Dijkstra (Cormen, 
Leiserson, and Rivest 1992). 13 In case the transducer is acyclic, a classical linear time 
algorithm based on a topological sort of the graph allows one to obtain d. 
13 This algorithm can be extended to the case where weights are negative. If there is no negative cycle the 
Bellman-Ford algorithm can be used. 
297 
Computational Linguistics Volume 23, Number 2 
d/O / ~ ~ c/1 e,,Q 
Figure 15 
Transducer ill. 
d/O / N ~ c/1 
e/O =- Q 
Figure 16 
Transducer ~'1 obtained from fll by pushing. 
Once the function d is defined, the transformation of T into T' can be done in linear 
time, namely O(IQ\]+IEI), if we denote by E the set of transitions of T. The complexity of 
pushing is therefore linear (O(IQI + IEI)) if the transducer is acyclic. In the general case, 
the complexity of pushing is O(IE \] log IQI) if we use classical heaps, O(\]E I + IQI log \]Q\]) 
if we use Fibonacci heaps, and O(IE I log log IQI) if we use the efficient implementation 
of priority queues by Thorup (1996). In case the maximum output weight W is small, 
we can use the algorithm of Ahuja et al. (1988); the complexity of pushing is then 
O(IEI + IQIx/fwI). 
In case the transducer is acyclic, we can use a specific automata minimization 
algorithm (Revuz 1992) with linear time complexity, O(\]Q\] + IE\]). In the general case, 
an efficient implementation of Hopcroft's algorithm (Aho, Hopcroft, and Ullman 1974) 
leads to O(\]E\] log \]Q\]). 
Thus, the overall complexity of the minimization of subsequential transducers is 
always as good as that of classical automata minimization: O(IQI + IE\]) in the acyclic 
case, and O(\]E I log \]Q\[) in the general case. 
Figures 15 to 17 illustrate the minimization algorithm. 131 (Figure 15) represents a 
subsequential string-to-weight transducer. Notice that the size of fll cannot be reduced 
using the automata minimization. 71 represents the transducer obtained by pushing, 
and 51 a minimal transducer realizing the same function as fll in the tropical semiring. 
298 
Mohri Transducers in Language and Speech 
d/O 
c/1 
Figure 17 
Minimal transducer 61 obtained from "n by automata minimization. 
The transducer obtained by this algorithm is the one defined in the proof of the- 
orem 14 and has the minimal number of states. This raises the question of whether 
there exists a subsequential transducer with the minimal number of transitions and 
computing the same function as a given subsequential transducer T. The following 
corollary offers an answer. 
Corollary 1 
A minimal subsequential transducer has also the minimal number of transitions among 
all subsequential transducers realizing the same function. 
Proof 
This generalizes the analogous theorem that holds in the case of automata. The proof 
is similar. Let T be a subsequential transducer with a minimal number of transitions. 
Clearly, pushing does not change the number of transitions of T and automatan mini- 
mization, which consists of merging equivalent states, reduces or does not change this 
number. Thus, the number of transitions of the minimal transducer equivalent to T 
as previously defined is less or equal to that of T. This proves the corollary since, as 
previously pointed out, equivalent minimal transducers all have the same topology: 
in particular, they have the same number of states and transitions. \[\] 
Given two subsequential transducers, one might wish to test their equivalence. 
The importance of this problem was pointed out by Hopcroft and Ullman (1979, 284). 
The following corollary addresses this question. 
Corollary 2 
There exists an algorithm to determine if two subsequential transducers are equivalent. 
Proof 
The algorithm of theorem 15 associates a unique minimal transducer to each sub- 
sequential transducer T. More precisely, this minimal transducer is unique up to a 
renumbering of the states. The identity of two subsequential transducers with differ- 
ent numbering of states can be tested in the same way as that of two deterministic 
automata; for instance, by testing the equivalence of the automata and the equality 
of their number of states. An efficient algorithm for testing the equivalence of two 
deterministic automata is given in Aho, Hopcroft, and Ullman (1974). 14 Since the min- 
14 The automata minimization step can in fact be omitted if this equivalence algorithm is used, since it 
does not affect the equivalence of the two subsequential transducers, considered as automata. 
299 

Mohri Transducers in Language and Speech 
weight transducer, each path of which corresponds to a sentence. The weight of the 
path can be interpreted as a negative log of the probability of that sentence given the 
sequence of acoustic observations (utterance). Such acyclic string-to-weight transduc- 
ers are called word lattices. 
4.2 Word Lattices 
For a given utterance, the word lattice obtained in such a way contains many paths 
labeled with the possible sentences and their associated weights. A word lattice often 
contains a lot of redundancy: many paths correspond to the same sentence but with 
different weights. 
Word lattices can be directly searched to find the most probable sentences, those 
which correspond to the best paths, the paths with the smallest weights. 
Figure 18 shows a word lattice obtained in speech recognition for the 2,000-word 
ARPA ATIS Task. It corresponds to the following utterance: Which flights leave Detroit 
and arrive at Saint Petersburg around nine am? Clearly the lattice is complex; it contains 
about 83 million paths. 
Usually, it is not enough to consider the best path of a word lattice. It is also 
necessary to correct the best path approximation by considering the n best paths, 
where the value of n depends on the task considered. Notice that in case n is very 
large, one would need to consider, for the lattice in Figure 18, all 83 million paths. The 
transducer contains 106 states and 359 transitions. 
Determinization applies to this lattice. The resulting transducer W2 (Figure 19) is 
sparser. Recall that it is equivalent to W1, realizing exactly the same function mapping 
strings to weights. For a given sentence s recognized by W1, there are many different 
paths with different total weights. W2 contains a path labeled with s and with a total 
weight equal to the minimum of the weights of the paths of W1. Let us insist on the 
fact that no pruning, heuristic, or approximation has been used here. The lattice W2 
only contains 18 paths. Obviously, the search stage in speech recognition is greatly 
simplified when applied to W2 rather than W1. W2 admits 38 states and 51 transitions. 
The transducer W2 can still be minimized. The minimization algorithm described 
in the previous section leads to the transducer W3 shown in Figure 20. It contains 25 
states and 33 transitions and of course the same number of paths as W2, 18. The effect of 
minimization appears to be less important. This is because, in this case, determinizafion 
includes a large part of the minimization by reducing the size of the first lattice. This 
can be explained by the degree of nondeterminism of word lattices such as 14/1.15 Many 
states can be reached by the same set of strings. These states are grouped into a single 
subset during determinization. 
Also, the complexity of determinization is exponential in general, but in the case 
of the lattices considered in speech recognition, it is not. 16 Since they contain a lot 
of redundancy, the resulting lattice is smaller than the initial one. In fact, the time 
complexity of determinization can be expressed in terms of the initial and resulting 
lattices, W1 and W2, by O(l~ I log IGl(IWllIW21)2), where IWll and IW21 denote the sizes 
of W1 and W2. Clearly if we restrict determinization to the cases where I w21 < I W1 I, its 
complexity is polynomial in terms of the size of the initial transducer \[Wll. This also 
15 The notion of ambiguity of a finite automaton can be formalized conveniently using the tropical 
semiring. Many important studies of the degree of ambiguity of automata have been done in connection with the study of the properties of this semiring (Simon 1987). 
16 A more specific determinization can be used in the cases often encountered in natural language 
processing where the graph admits a loop at the initial state over the elements of the alphabet (Mohri 1995). 
301 

Mohri Transducers in Language and Speech 
Figure 19 
Equivalent word lattice W2 obtained by determinization of W1. 
@ .... © .... .@ ~,, @ ~,o ... @ • 
Figure 20 
Equivalent word lattice Wa obtained by minimization from W2. 
rescoring 
l approximate_~J detailed \]._..l~ cde4Pls ~ lattice ~ models 
Figure 21 
Rescoring. 
applies to the space complexity of the algorithm. In practice, the algorithm appears to 
be very efficient. As an example, it took about 0.02s on a Silicon Graphics (Indy 100 
MHZ Processor, 64 Mb RAM) to determinize the transducer of Figure 18.17 
Determinization makes the use of lattices much faster. Since at any state there 
exists at most one transition labeled with the word considered, finding the weight 
associated with a sentence does not depend on the size of the lattice. The time and 
space complexity of such an operation is simply linear in the size of the sentence. 
When dealing with large tasks, most speech recognition systems use a rescoring 
method (Figure 21). This consists of first using a simple acoustic and grammar model 
to produce a word lattice, and then to reevaluate this word lattice with a more sophis- 
ticated model. 
The size of the word lattice is then a critical parameter in the time and space 
efficiency of the system. The determinization and minimization algorithms we pre- 
sented allow the size of such word lattices to be considerably reduced, as seen in the 
examples. 
We experimented with both determinization and minimization algorithms in the 
ATIS task. Table I illustrates these results. It shows these algorithms to be very effective 
in reducing the redundancy of speech networks in this task. The reduction is also 
illustrated by an example in the ATIS task. 
17 Part of this time corresponds to I/O's and is therefore independent of the algorithm. 
303 
Computational Linguistics Volume 23, Number 2 
Table 1 
Word lattices in the ATIS task. 
Determinization Determinization + Minimization 
Objects reduction factor reduction factor 
States ~ 3 ~ 5 
Transitions ,-~ 9 ~ 17 
Paths > 232 > 232 
Table 2 
Subsequential word lattices in the NAB task. 
Minimization results 
Objects reduction factor 
States ~ 4 
Transitions ,~ 3 
Example 1 
Example of a word lattice in the ATIS task. 
States: 187 --* 37 
Transitions: 993 ~ 59 
Paths: > 232 ~ 6,993 
The number of paths of the word lattice before determinization was larger than that 
of the largest integer representable with 32 bit machines. We also experimented with 
the minimization algorithm by applying it to several word lattices obtained in the 
60,000-word ARPA North American Business News task (NAB). 
These lattices were already determinized. Table 2 shows the average reduction fac- 
tors we obtained when using the minimization algorithms with several subsequential 
lattices obtained for utterances of the NAB task. The reduction factors help to mea- 
sure the gain of minimization alone, since the lattices are already subsequential. The 
numbers in example 2, an example of reduction we obtained, correspond to a typical 
case. 
Example 2 
Example of a word lattice in NAB task. 
Transitions: 10,8211 ~ 37,563 
States: 10,407 --* 2,553 
4.3 On-the-fly Implementation of Determinization 
An important characteristic of the determinization algorithm is that it can be used 
on-the-fly. Indeed, the determinization algorithm is such that given a subset repre- 
senting a state of the resulting transducer, the definition of the transitions leaving that 
state depends only on that state or, equivalently, on the states of that subset, and on 
the transducer to determinize. In particular, the definition and construction of these 
transitions do not depend directly on the previous subsets constructed. 
We have produced an implementation of the determinization that allows one both 
to completely expand the result or to expand it on demand. Arcs leaving a state of 
304 
Mohri Transducers in Language and Speech 
the determinized transducer are expanded only if necessary. This characteristic of the 
implementation is important. It can then be used, for instance, at any step in an on- 
the-fly cascade of composition of transducers in speech recognition to expand only 
the necessary part of a lattice or transducer (Pereira and Riley 1996; Mohri, Pereira, 
and Riley 1996). One of the essential implications of the implementation is that it 
contributes to saving space during the search stage. It is also very useful in speeding 
up the n-best decoder in speech recognition. TM 
The determinization and minimization algorithms for string-to-weight transducers 
seem to have other applications in speech processing. Many new experiments can be 
done using these algorithms at different stages of speech recognition, which might 
lead to the reshaping of some of the methods used in this field and create a renewed 
interest in the theory of automata and transducers. 
5. Conclusion 
We have briefly presented the theoretical bases, algorithmic tools, and practical use of 
a set of devices that seem to fit the complexity of language and provide efficiency in 
space and time. From the theoretical point of view, the understanding of these objects 
is crucial. It helps to describe the possibilities they offer and to guide algorithmic 
choices. Many new theoretical issues arise when more precision is sought. 
The notion of determinization can be generalized to that of E-determinization for 
instance (Salomaa and Soittola 1978, chapter 3, exercise) requiring more general al- 
gorithms. It can also be extended to local determinization: determinization at only 
those states of a transducer that admit a predefined property, such as that of having 
a large number of outgoing transitions. An important advantage of local determiniza- 
tion is that it can be applied to any transducer without restriction. Furthermore, local 
determinization also admits an on-the-fly implementation. New characterizations of 
rational functions shed new light on some aspects of the theory of finite-state trans- 
ducers (Reutenauer and Schiitzenberger 1995). We have also offered a generalization of 
the operations we use based on the notions of semiring and power series, which help 
to simplify problems and algorithms used in various cases. In particular, the string 
semiring that we introduced makes it conceptually easier to describe many algorithms 
and properties. 
Subsequential transducers admit very efficient algorithms. The determinization 
and minimization algorithms in the case of string-to-weight transducers presented 
here complete a large series of algorithms that have been shown to give remarkable 
results in natural language processing. Sequential machines lead to useful algorithms 
in many other areas of computational linguistics. In particular, subsequential power 
series allow for efficient results in indexation of natural language texts (Crochemore 
1986; Mohri 1996b). 
We briefly illustrated the application of these algorithms to speech recognition. 
More precision in acoustic modeling, finer language models, large lexicon grammars, 
and a larger vocabulary will lead, in the near future, to networks of much larger sizes 
in speech recognition. The determinization and minimization algorithms might help 
to limit the size of these networks while maintaining their time efficiency. 
These algorithms can also be used in text-to-speech synthesis. In fact, the same 
operations of composition of transducers (Sproat 1995) and perhaps more important 
size issues can be found in this field. 
18 We describe this application of determinization elsewhere. 
305 
Computational Linguistics Volume 23, Number 2 
Figure 22 
Subsequential power series S nonbisubsequential. 
Appendix 
The determinization algorithm for power series can also be used to minimize trans- 
ducers in many cases. Let us first consider the case of automata. Brzozowski (1962) 
showed that determinization can be used to minimize automata. This nice result has 
also been proved more recently in elegant papers by Bauer (1988) and Urbanek (1989). 
These authors refine the method to obtain better complexities. 19 
Theorem 16 (Brzozowski 1962) 
Let A be a nondeterministic automaton. Then the automaton A' = (Q', i', F', E, 6') ob- 
tained by reversing A, applying determinization, rereversing the obtained automaton 
and determinizing it is the minimal deterministic automaton equivalent to A. 
We generalize this theorem to the case of string-to-weight transducers. We say that 
a rational power series S is bisubsequential when S is subsequential and the power 
series S R = Y~w~, (S, wR)w is also subsequential. 2° Not all subsequential transducers 
are bisubsequential. Figure 22 shows a transducer representing a power series S that 
is not bisubsequential. S is such that: 
Vn E.M, (S, ba") = n+l (27) 
Vn E Af, (S, ca n) = 0 
The transducer of Figure 22 is subsequential so S is subsequential. But the reverse S a 
is not, because it does not have bounded variation. Indeed, since: 
We have: 
Vn E Af, (S a,anb) = n + l 
VnE.Af, (Sa, anc) = 0 
Vn ~ A/', I(Sa, a"b) - (Sa, a"c)l = n + 1 
(28) 
19 See Watson (1993) for a taxonomy of minimization algorithms for automata; see also Courcelle, Niwinski, and Podelski 1991. 
20 For any string w E ~*, we denote by w a its reverse. 
306 
Mohri Transducers in Language and Speech 
A characterization similar to that of string-to-string transducers (Choffrut 1978) is 
possible for bisubsequential power series defined on the tropical semiring. In particu- 
lar, the theorem of the previous sections shows that S is bisubsequential iff S and S n 
have bounded variation. 
We similarly define bideterminizable transducers as the transducers T defined on 
the tropical semiring admitting two applications of determinization, as follows: 
. 
. 
The reverse of T, T a can be determinized. We denote by det(T a) the 
resulting transducer. 
The reverse of det(TR), \[det(Ta)\] R can also be determinized. We denote by 
det(\[det(Ta)\] ~) the resulting transducer. 
In this definition, we assume that the reverse operation is performed simply by 
reversing the direction of the transitions and exchanging initial and final states. Given 
this definition, we can present the extension of the theorem of Brzozowski (1962) to 
bideterminizable transducers. 21 
Theorem 17 
Let T be a bideterminizable transducer defined on the tropical semiring. Then the trans- 
ducer det(\[det(TR)\] R) obtained by reversing T, applying determinization, rereversing 
the obtained transducer and determinizing it is a minimal subsequential transducer 
equivalent to T. 
Proof 
We denote by: 
• T1 = (Q1,il,F1,G,61,Crl,)~l, p1) det(Ta), 
• T'= (Q',i',F',E,6',¢',&',p') det(\[det(Ta)\] a) 
• T" = (Q', i', F', G, ~', rr', &', p') the transducer obtained from T by 
pushing. 
The double reverse and determinization algorithms clearly do not change the function 
that T realizes. So T' is a subsequential transducer equivalent to T. We only need to 
prove that T ~ is minimal. This is equivalent to showing that T" is minimal, since T' 
and T" have the same number of states. T1 is the result of a determinization, hence 
it is a trim subsequential transducer. We show that T' = det(T~) is minimal if T1 is 
a trim subsequential transducer. Notice that the theorem does not require that T be 
subsequential. 
Let $1 and $2 be two states of T" equivalent in the sense of automata. We prove 
that $1 = $2, namely that no two distinct states of T" can be merged. This will prove 
that T" is minimal. Since pushing only affects the output labels, T' and T" have the 
same set of states: Q' = Q". Hence $1 and $2 are also states of T'. The states of T' 
can be viewed as weighted subsets whose state elements belong to T1, because T' is 
obtained by determinization of T~. 
Let (q, c) E Q1 x T4 be a pair of the subset $1. Since T1 is trim there exists w c G* 
such that 61(il, w) = q, so 6~($1,w) E F'. Since $1 and S 2 are equivalent, we also have: 
21 The theorem also holds in the case of string-to-string bideterminizable transducers. We give the proof 
in the more complex case of string-to-weight transducers. 
307 
Computational Linguistics Volume 23, Number 2 
b/1 
Figure 23 
Transducer f12 obtained by reversing ill. 
a/O 
c/l a/4 ~b/1 ~ 
Figure 24 
Transducer/33 obtained by determinizafion of/32. 
d/0 a/4 
Figure 25 
Minimal transducer f14 obtained by reversing fig and applying determinization. 
6~($2, w) E F ~. Since T1 is subsequential, there exists only one state of T~ admitting a 
path labeled with w to il; that state is q. Thus, q E $2. Therefore any state q member 
of a pair of $1 is also member of a pair of $2. By symmetry the reverse is also true. 
Thus exactly the same states are members of the pairs of $1 and $2. There exists k > 0 
such that: 
Sl = {(qo, clo), (ql, c11) ..... (qk, clk)} 
S 2 ~- {(qo, c20), (ql, c21) ..... (qk, C2k)} 
(29) 
We prove that weights are also the same in $1 and $2. Let IIj, (0 > j > k), be the 
set of strings labeling the paths from il to qi in T1. Crl(il, w) is the weight output 
corresponding to a string w E IIj. Consider the accumulated weights cq, 1 < i < 2, 
0 < j < k, in determinization of T~. Each Clj for instance corresponds to the weight not 
yet output in the paths reaching $1. It needs to be added to the weights of any path 
from qj c $1 to a final state in rev(T1). In other terms, the determinization algorithm 
will assign the weight Clj q- ff1(/1, W) q- .X1 to a path labeled with w a reaching a final 
state of T ~ from $1. T" is obtained by pushing from T ~. Therefore the weight of such 
308 

References

1   Alfred V. Aho , John E. Hopcroft, The Design and Analysis of Computer Algorithms, Addison-Wesley Longman Publishing Co., Inc., Boston, MA, 1974 

2   Alfred V. Aho , Ravi Sethi , Jeffrey D. Ullman, Compilers: principles, techniques, and tools, Addison-Wesley Longman Publishing Co., Inc., Boston, MA, 1986 

3   Ahuja, Ravindra K., Kurt Mehlhorn, James B. Orlin, and Robert Tarjan. 1988. Faster algorithms for the shortest path problem. Technical Report 193, MIT Operations Research Center. 

4   Bauer, W. 1988. On minimizing finite automata. EATCS bULLETIN, 35. 

5   Berstel, Jean. 1979. Transductions and Context-Free Languages. Teubner Studienbucher, Stuttgart. 

6   Jean Berstel, Jr. , Christophe Reutenauer, Rational series and their languages, Springer-Verlag New York, Inc., New York, NY, 1988 

7   Brzozowski, J. A. 1962. Canonical regular expressions and minimal state graphs for definite events. Methematical Theory of Automata, 12:529--561. 

8   Carlyle, J. W. and A. Paz. 1971. Realizations by stochastic finite automaton. Journal of Computer and System Sciences, 5:26--40. 

9   Choffrut, Christian. 1978. Contributions  l'tude de quelques familles remarquables de functions rationnelles. Ph.D. thesis, (thse de doctorat d'Etat), Universit Paris 7, LITP, Paris. 

10   Thomas T. Cormen , Charles E. Leiserson , Ronald L. Rivest, Introduction to algorithms, MIT Press, Cambridge, MA, 1990 

11   Courcelle, Bruno, Damian Niwinski, and Andreas Podelski. 1991. A geometrical view of the determinization and minimization of finite-state utomata. Mathematical Systems Theory, 24:117--146. 

12   Maxine Crochemore, Transducers and repetitions, Theoretical Computer Science, v.45 n.1, p.63-86, Sept. 1986 

13   Samuel Eilenberg, Automata, Languages, and Machines, Academic Press, Inc., Orlando, FL, 1976 

14   Elgot, C. C. and J. E. Mezei. 1965. On relations defined by generalized finite automata. IBM Journal of Research and Development, 9. 

15   Ginsburg, S. and G. F. Rose. 1966. A characterization of machine mappings. Canadian Journal of Mathematics, 18. 

16   Gross, Maurice. 1989. The use of finite automata in the lexical representation of natural language. Lecture Notes in Computer Science, 377. 

17   John E. Hopcroft , Jeffrey D. Ullman, Introduction To Automata Theory, Languages, And Computation, Addison-Wesley Longman Publishing Co., Inc., Boston, MA, 1990 

18   Ronald M. Kaplan , Martin Kay, Regular models of phonological rule systems, Computational Linguistics, v.20 n.3, September 1994 

19   Fred Karlsson , Atro Voutilainen , Juha Heikkila , Arto Anttila, Constraint Grammar: A Language-Independent System for Parsing Unrestricted Text, Mouton de Gruyter, 1995 

20   Lauri Karttunen , Ronald M. Kaplan , Annie Zaenen, Two-level morphology with composition, Proceedings of the 14th conference on Computational linguistics, August 23-28, 1992, Nantes, France 

21   Krob, daniel. 1994. The equality problem for rational series with multiplicities in the tropical semiring is undecidable. Journal of Algebra and Computation, 4. 

22   Werner Kuich , Arto Salomaa, Semirings, automata, languages, Springer-Verlag, London, 1985 

23   Mehryar Mohri, Compact representations by finite-state transducers, Proceedings of the 32nd annual meeting on Association for Computational Linguistics, p.204-209, June 27-30, 1994, Las Cruces, New Mexico 

24   Mehryar Mohri, Minimization of Sequential Transducers, Proceedings of the 5th Annual Symposium on Combinatorial Pattern Matching, p.151-163, June 05-08, 1994 

25   Mohri, Mehryar. 1994c. On some applications of finite-state automata theory to natural language processing: Representation of morphological dictionaries, compaction, and indexation. Technical Report IGM 94--22, Institut Gaspard Monge, Noisy-le-Grand. 

26   Mohri, Mehryar. 1994b. Syntactic analysis by local grammars automata: An efficient algorithm. In Proceedings of the International Conference on Computational Lexicography (COMPLEX94). Linguistic Institute, Hungarian Academy of Science, Budapest, Hungary. 

27   Mohri, Mehryar. 1995. Matching patterns of an automaton. Lecture Notes in Computer Science, 937. 

28   Mohri, Mehryar, 1996a. On The Use of Sequential Transducers in Natural Language Processing. In Yves Shabes, editor, Finite State Devices in Natural Language Processing. MIT Press, Cambridge, MA. To appear. 

29   Mehryar Mohri, On some applications of finite-state automata theory to natural language processing, Natural Language Engineering, v.2 n.1, p.61-80, March 1996 

30   Morhi, Mehryar, Fernando C. N. Pereira, and Michael Riley. 1996. Weighted automata in text and speech processing. In ECAI-96 Workshop, Budapest, Hungary. ECAI. 

31   Mehryar Mohri , Richard Sproat, An efficient compiler for weighted rewrite rules, Proceedings of the 34th annual meeting on Association for Computational Linguistics, p.231-238, June 24-27, 1996, Santa Cruz, California 

32   Nerode, Anil. 1958. Linear automaton transformations. In Proceedings of AMS, volume 9. 

33   Pereira, Fernando C. N. and Michael Riley, 1996. Weighted Rational Transductions and their Application to Human Language Processing. In Yves Shabes, editor, Finite State Devices in Natural Language Processing. MIT Press, Cambridge, MA. To appear. 

34   Dominique Perrin, Finite automata, Handbook of theoretical computer science (vol. B): formal models and semantics, MIT Press, Cambridge, MA, 1991 

35   Christophe Reutenauer , Marcel Paul Schtzenberger, Varieties and rational functions, Theoretical Computer Science, v.145 n.1-2, p.229-240, July 10, 1995 

36   Dominique Revuz, Minimisation of acyclic deterministic automata in linear time, Theoretical Computer Science, v.92 n.1, p.181-189, Jan. 6, 1992 

37   Roche, Emmanuel. 1993. Analyse syntaxique transformationnelle du franais par transducteurs et lexique-grammaire. Ph.D. thesis, Universit Paris 7, Paris. 

38   Arto Salomaa , M. Soittola, Automata: Theoretic Aspects of Formal Power Series, Springer-Verlag New York, Inc., Secaucus, NJ, 1978 

39   Schtzenberger, Marcel Paul. 1961. On the definition of a family of automata. Information and Control, 4. 

40   Schtzenberger, Marcel Paul. 1977. Sur une variante des fonctions squentielles. Theoretical Computer Science. 

41   Schtzenberger, Marcel Paul. 1987. Polynomial decomposition of rational functions. In Lecture Notes in Computer Science, volume 386. Springer-Verlag, Berlin, Heidelberg, and New York. 

42   Silberztein Max. 1993. Dictionnaires lectroniques et analyse automatique de textes: le systme INTEX. Masson, Paris. 

43   Simon, Imre. 1987. The nondeterministic complexity of finite automata. technical Report RT-MAP-8073, Instituto de Matemtica e Estatistica da Universidade de So Paulo. 

44   Sproat, Richard. 1995. A finite-state architecture for tokenization and grapheme-to-phoneme conversion in multilingual text analysis. In Proceedings of the ACL SIGDAT Workshop, Dublin, Ireland. ACL. 

45   Mikkel Thorup, On RAM priority queues, Proceedings of the seventh annual ACM-SIAM symposium on Discrete algorithms, p.59-67, January 28-30, 1996, Atlanta, Georgia, United States 

46   Urbanek, F. 1989. On minimizing finite automata. EATCS Bulletin, 39. 

47   Watson, Bruce W. 1993. A taxonomy of finite automata minimization algorithms. Technical Report 93/44, Eindhoven University of Technology, The Netherlands. 

48   Weber, Andreas and Reinhard Klemm. 1995. Economyof description for single-valued transducers. Information and Computation, 119. 

49   W. A. Woods, Transition network grammars for natural language analysis, Communications of the ACM, v.13 n.10, p.591-606, Oct. 1970 
