Weighted Rational Transductions and their Application to Human 
Language Processing 
Fernando Pereira Michael Riley Richard Sproat 
AT&T Bell Laboratories 
600 Mountain Ave. 
Murray Hill, NJ 07974 
ABSTRACT 
We present the concepts of weighted language, ~ansduction and au- 
tomaton from algebraic automata theory as a general framework for 
describing and implementing decoding cascades in speech and lan- 
guage processing. This generality allows us to represent uniformly 
such information sources as pronunciation dictionaries, language 
models artd lattices, and to use uniform algorithms for building de- 
coding stages and for optimizing and combining them. In particular, 
a single automata join algorithm can be used either to combine in- 
formation sources such as a pronunciation dictionary and a context- 
dependency model during the construction of a decoder, or dynam- 
ically during the operation of the decoder. Applications to speech 
recognition and to Chinese text segmentation will be discussed. 
1. Introduction 
As is well known, many problems in human language process- 
ing can be usefully analyzed in terms of the "noisy channel" 
metaphor: given an observation sequence o, find which in- 
tended message w is most likely to generate that observation 
sequence by maximizing 
P(w, o) = P(olw)P(w), 
where P(olw ) characterizes the transduction between in- 
tended messages and observations, and P(w) characterizes 
the message generator. More generally, the transduction be- 
tween messages and observations may involve several inter- 
mediate stages 
P( so, sk )= P( sk Iso)P(so) P(sklso)=~,, 
..... ,~_,P(sklsk-X)'"P(sllso) (1) 
where P(sk Iso) is the probability of transducing so to sk 
through the intermediate stages, assuming that each step in the 
cascade is conditionally independent from the previous ones. 
Each s t is a sequence of units in an appropriaterepresentation. 
For instance, in speech recognition some of the intermediate 
stages might correspond to sequences of units like phones or 
syllables. A stralghtforwardbut useful observation is that any 
such a cascade can be factored at any intermediate stage 
P(sils~) = ~ P(s~lsk)P(skls.i) (2) 
Sk 
For computational reasons, sums and products in (1) are often 
replaced by minirnizations and sums of negative log probabil- 
ities, yielding the approximation 
P(s0,sk) = P(sklso) + P(s0) (3) P(s ls0) rain,, 
..... ,~_,El<j_<kP(s~lsj=l) 
where X = - log X. In this formulation, assuming the ap- 
proximation is reasonable, the most likely message so is the 
one minimizing P(s0, sk). 
Finally, each transduction in such a cascade is often modeled 
by some finite-state device, for example a hidden Markov 
model. 
Although the above approach is widely used in speech and 
language processing, usually the elements of the transduction 
cascade are built by "ad hoc" means, and commonalities be- 
tween them are not exploited. We will here outline how the 
theory of weighted rational languages and transductions can 
be used as a general framework for transduction cascades. 
This theoretical foundation provides a rich set of operators 
for combining cascade elements that generalizes the standard 
operations on regular languages, suggests novel ways of com- 
bining models of different parts of the decoding process, and 
supports uniform algorithms for transduction and search at all 
levels in the cascade. In particular, we developed a generic 
join algorithm for combining any two consecutive levels of a 
cascade, a generic best-path search algorithm, and a generic 
interleaving of join and search for building pruned joins. In 
addition, general finite-state minimization techniques are also 
applicable to all levels of a cascade. 
Weighted languages and transductions are generalizations of 
the standard notions of language and transduction in formal 
language theory \[1, 2\]. A weighted language is just a mapping 
from strings over an alphabet to weights. A weighted trans- 
duction is a mapping from pairs of strings over two alphabets 
to weights. For example, when weights represent proba- 
bilities and assuming appropriate normalization, a weighted 
language is just a probability distribution over strings, and 
a weighted trarisduction a joint probability distribution over 
string pairs. The weighted rationallanguages and transducers 
are those that can be represented by weighted finite-state ac- 
ceptors (WFSAs) and weighted finite-state transducers (WF- 
STs), as described in more detail in the next section. In this 
paper we will be concerned with the weighted rational case, 
although some of the theory can be profitably extended beyond 
the finite-state case \[3, 4\]. 
262 
The notion of weighted rational transduction arises from the 
combination of two ideas in automata theory: rational trans- 
ductions, used in many aspects of formal language theory \[2\], 
and weighted languages and automata, developed in pattern 
recognition \[5, 6\] and algebraic automata theory \[7, 8, 9\]. 
Ordinary (unweighted) rational transductions have been suc- 
cessfully applied by researchers at Xerox PARC \[10\] and at 
the University of Paris 7 \[11\], among others, to several prob- 
lems in language processing, includifig morphological anal- 
ysis, dictionary compression and syntactic analysis. Hidden 
Markov Models and probabilistic finite-state language mod- 
els can be shown to be equivalent to WFSAs. In algebraic 
automata theory, rational series and rational transductions \[8\] 
are the algebraic counterparts of WFSAs and WFSTs and 
give the correct generalizations to the weighted case of the 
standard algebraic operations on formal languages and trans- 
ductions, such as union, concatenation, intersection, restric- 
tion and composition. We believe the work presented here 
is among the first to apply these generalizations to human- 
language processing. 
Our first application is to speech recognition decoding. We 
show that a conventional HMM decoder can be naturally 
viewed as equivalent to a cascade of weighted transductions, 
and that our approach requires no modification whatsoever 
when context dependencies cross higher-level unit boundaries 
(for instance, cross-word context-dependent models). 
Our second application is to the segmentation of Chinese text 
into words, and the assignment of pronunciations to those 
words. In Chinese orthography, most characters represent 
(monosyllabic) 'morphemes', and as in English, 'words' may 
consist of one or more morphemes. Given that Chinese does 
not use whitespace to delimit words, it is necessary to recon- 
struct the grouping of characters into words. This reconstruc- 
tion can also be thought of as a transduction problem. 
2. Theory 
In the transduction cascade (1), each step corresponds to a 
mapping from input-output pairs (r, s) to probabilities P(slr). 
More formally, steps in the cascade will be weighted trans- 
ductions T : 27 x F* ~ K where 27 and F* the sets of 
strings over the alphabets Z and F, and K is an appropriate set 
of weights, for instance the real numbers between 0 and 1 in 
the case of probabilities. We will denote by T- 1 the inverse 
of T defined by T(t, s) = T(s, t). 
The right-most step of (1) is not a transduction, but rather an 
information source, in that case the language model. We will 
represent such sources as weighted languages L : Z* ~ K. 
Given two transductions S : Z* x F* ~ K and T : F* x A* ---~ 
K, we can define their composition S o T by 
(S o T)(r, t) = E S(r, s)T(s, t) (4) 
sEI"* 
For example, ifS represents P(sk Isj) and T P(sj Isi) in (2), 
Figure 1: Recognition Cascade 
is is clear that S o T represents P(sk Isi). 
A weighted transduction S : Z* × F* --. K can be applied to a 
weighted language L : Z* ~ K to yield a weighted language 
over F. It is convenient to abuse notation somewhat and use 
M o S for the result of the application, defined as 
(L o S)(t) = E L(s)S(s,t) (5) 
sEF ° 
Furthermore, if M is a weighted language over F, we can 
reverse apply S to M, written S o M = M o (S-:). For 
example, ifS represents P(sk \[so) and M represents P(so) in 
(1), then S o M represents P(po, pk). 
Finally, given two weighted languages M, N : Z* ~ K we 
define their intersection, also by convenient abuse of notation 
written M o N as: 
(M o N)(t) ='M(s)N(s) (6) 
In any cascade R1 o ... o Rm, with the Ri for 1 < i < m 
appropriate transductions and R1 and Rm transductions or 
languages, it is easy to see that the order of association of 
the o operators does not matter. For example, if we have 
L o S o T o M, we could either apply S to L, apply T to 
the result and intersect the result with M, or compose S with 
T, reverse apply the result to M and intersect the result with 
L. We are thus justified in our use of the same symbol for 
composition, application and intersection, and we will in the 
rest of the paper use the term "(generalized) composition" for 
all of these operations. 
For a more concrete example, consider the transduction cas- 
cade for speech recognition depicted in Figure 1, where A is 
the transduction from acoustic observation sequences to phone 
sequences, D the transduction from phone sequences to word 
sequences (essentially a pronunciation dictionary) and M a 
weighted language representing the language model. Given a 
particular sequence of observations o, we can represent it as 
the trivial weighted language O that assigns 1 to o and 0 to 
any other sequence. Then O o A represents the acoustic likeli- 
hoods of possible phone sequences that generate o, O o A o D 
the aeoustic-lexical likelihoods of possible word sequences 
yielding o, and O o A o D o M the combined acoustic-lexical- 
linguistic probabilities of word sequences generating o. The 
word string w with the highest weight (0 o A o D o M)(w) 
is precisely the most likely sentence hypothesis generating o. 
Exactly the same construction could have been carried out 
with weights combined by rain and sum instead of sum and 
product in the definitions of application and intersection, and 
263 
Language Transduction 
singleton 
scaling 
sum 
concatenation 
power 
closure 
{u}(v) = I iffu = v 
(kL)(u) = kL(u) 
(L + M)(u) = L(u) + M(u) 
(LM)(w) = ~uv=~o L(u)M(v) 
L°(e) = I 
L°(u ~ e) = 0 
L n+l = LL" 
L* = ~k>o Lk 
{(u, v)}(w,z) = 1 iffu = w and v = z 
(kT)(u, = kT( , v) 
(S + T)(u, v) = S(u, v) + T(u, v) 
(ST)(t, w) : E,,:t,~.=w S(r, u)T(s, v) 
T°(e, e) = 1 
= 0 
Tn + 1 : TT n 
T* = ~k>o Tk 
Table 1: Rational Operations 
in that case the string w with the lowest weight (O o A o D o 
M)(w) 'would the best hypothesis. More generally, the sum 
and product operations in (4), (5) and (6) can be replaced by 
any two operations forming an appropriate semiring \[7, 8, 9\], 
of which numeric addition and multiplication and numeric 
minimum and addition are two examples 1 
Generalized composition is thus the main operation involved 
in the construction and use of transduction cascades. As we 
will see in a moment, for rational languages and transductions, 
all instances of generalized composition are implemented by 
a uniform algorithm, the join of two weighted finite automata. 
In addition to those operations, weighted languages and trans- 
ductions can be constructed from simpler ones by the opera- 
tions shown in Table 1, which generalize in a straightforward 
way the regular operations well-known from traditional au- 
tomata theory \[1\]. In fact, the rational languages and trans- 
ductions are exactly those that can be built from singletons by 
applications of scaling, sum, concatenation and closure. 
For example, assume that for each word w in a lexicon we are 
given a rational transduction D,o such that D~ (p, w) is the 
probability that w is realized as the phone sequence p. Note 
that this crucially allows for multiple pronunciations for w. 
Then the rational transduction (~ D,o) * gives the probabil- 
ities for realizations of word sequences as phone sequences 
(ignoring possible cross-word dependencies, which will be 
discussed in the next section). 
Kleene's theorem states that regular languages are exactly 
those representable by finite-state acceptors \[1\]. Its gener- 
alization to the weighted case and to transducers states that 
weighted rational languages and transducers are exactly those 
that can be represented by finite automata \[8\]. Furthermore, 
all the operations on languages and transductions we have 
discussed have finite-automata counterparts, which we have 
implemented. Any cascade representable in terms of those 
operations can thus be implemented directly as an appropri- 
ate combination of the programs implementing each of the 
operations. 
lAdditional conditions to guarantee the existence of certain infinite sums 
may be necessary for certain semirings, for details see \[7\] and \[8\]. 
In the present setting, a K-weighted finite automaton.,4 con- 
sists of a finite set of states Qa and a finite set Aa of transitions 
s//~ ql q --, between states, where x is an element of the set of 
transition labels AA and k E K is the transition weight. An 
associative concatenation operation u • v must defined be- 
tween transition labels, with identity element ect. As usual, 
each automaton has an initial state iA and a final state as- 
signment, which we represent as column vector of weights 
FA indexed by states:. A K-weighted finite automaton with 
AA = Z* is just a weighted finite-state acceptor (WFSA). On 
the other hand, ifAA = Z* x F* with concatenation defined 
by (r, s). (u, v) = (ru, sv), we have a weightedfinite-state 
transducer (WFST). 
As usual, we can define a path in an automa- 
ton .,4 as a sequence of connected transitions /3 = 
(q0, xl, kl, ql), • •., (qra-1, Xm, kin, qm). Such a path has la- 
bel LA(p) = xl ..... z,~, weight Wa(15) = kl '' "krn and 
final weight F - W~ (p) = WA(pP)FA(qm). We call ff reduced if 
it is the empty path or ifxl # e, and we write p ~,~ p' if k is 
the sum of the weights of all reduced paths with label u from 
q to q~. 
The language of automaton .,4 is defined as 
f~I~(~) 
where I.a(u) is the set of paths in .,4 with label u that start in 
the initial state i.d. Obviously, if .,4 is an acceptor, \[.A\] is a 
weighted language, and ifA is a transducer \[,4\]\] is a weighted 
transduction. The appropriate generalization of Kleene's the- 
orem to weighted acceptors and transducers states that under 
mild conditions on the weights (which for instance are satis- 
fied by the rain, sum semiring), weighted rational languages 
and transductions are exactly those defined by weighted au- 
tomata as outlined here \[8\]. 
Weighted acceptors and transducers are thus faithful imple- 
mentations of rational languages and transductions, and all 
2The usual notion of final state can be encoded this way by setting FA(q) 
= 1 ffq is final, FA(q) = 0 otherwise. 
264 
(a) ~1 ~=~.." o.~ 
(b) 
oi:E/pi Oi:e../pi O/:Jpi 
(d) o~~-,'(~ ) 
Figure 2: Models as Automata 
the operations on these described above have corresponding 
implementations in terms of algorithms on automata. In par- 
ticular, generalized composition corresponds to the join of 
two automata. 
Given two automata ..4 and B and a new label set J, and 
a partial label join function ~: A~ x An ~ J, we define 
their join by t~ as a new automaton C with label set J, states 
Qc = Q~ x Qt~, initial state ic = (i.a, it3), final weights 
Fc(q, q') = F~(q)Ft3(q) and transitions 
(p,p') (q, q')iff k = ab (7) 
~=y~z ,p ~.~q ,p' ~bq ~ 
Different choices of t~ correspond to the instances of gen- 
eralized composition: for intersection, Aa = An = Z*, 
z = V~ z iffz = y = z; for composition, AA = Z* x F*, 
At3 = F* x A* and (z, z) = (z, y) ~ (y, z); and for appli- 
cation = AaZ*, As = Z* x F* and y = z ~ (z, y). Thus 
join is the automata counterpart of generalized composition, 
and we will use the composition symbol indiferently in what 
follows to represent either composition or join. 
The operation between automata thus defined has a direct 
dynamic-programming implementation in which reachable 
join states (q, q') are placed in a queue and extended in turn 
usng (7). By organizing this queue according to the weights 
of least-weight paths from the start state, we can combine join 
computation with search for lowest-weight paths, and subau- 
tomata of the join with states reachable by paths with weights 
within a beam of the best path. 
3. Speech Recognition 
In our first application, we elaborate on how to describe a 
speech recognizer as a transduction cascade. Recall we de- 
compose the problem into a language, O, of acoustic observa- 
tion sequences, a transduction, A, from acoustic observation 
sequences to phone sequences, a transduction, D, from phone 
sequences to word sequences and a weighted language, M, 
specifying the language model (see Figure 1). Each of these 
can be represented as a finite-state automaton (to some ap- 
proximation). 
The trivial automaton for the acoustic observation language, 
O, is defined for a given utterance as depicted in Figure 2a. 
Each state represents a fixed point in time ti, and each transi- 
tion has a label, oi, drawn from a finite alphabet that quantizes 
the acoustic waveform between adjacent time points and is as- 
signed probability 1.0. 
The automaton for the acoustic observation sequence to phone 
sequence transduction, A, is defined in terms of phone models. 
A phone model is defined as a transducer from a subsequence 
of acoustic observation labels to a specific phone, and assigns 
to each subsequence a likelihood that the specified phone 
produced it. Thus, different paths through a phone model 
correspond to different acoustic realizations of the phone. 
Figure 2b depicts a common topology for such a phone model. 
A is then defined as the closure of the sum ofthephone models. 
The automaton for the phone sequence to word sequence trans- 
duction, D, is defined similarly to that for A. We define a word 
model as a transducer from a subsequence of phone labels to 
a specific word, which assigns to each subsequence a like- 
lihood that the specified word produced it. Thus, different 
paths through a word model correspond to different phonetic 
realizations of the word. Figure 2c depicts a common topol- 
ogy for such a word model. D is then defined as the closure 
of the sum of the phone models. 
Finally, the language model, M, is commonly an N-gram 
model, encodable as a WFSA. Combining these automata, 
(0 o A o D o M)(w) is thus an automaton that assigns a 
probability to each word sequence, and the highest-probability 
path through that automaton estimates the most likely word 
sequence for the given utterance. 
The finite-state modeling for speech recognition that we have 
just described is hardly novel. In fact, it is equivalent to 
that presented in \[12\], in the sense that it generates the same 
weighted language. However, the transduction cascade ap- 
proach presented here allows one to view the computations in 
new ways. 
For instance, because composition, o, is associative, we see 
that the computation of max,o(O o A o D o M)(w) can be 
organized in several ways. A conventional integrated-search, 
speech recognizer computes maxw(O o (A o D o M))(w). 
In other words, the phone, word, and language models are, 
in effect, compiled together into one large transducer which 
is then applied to the input observation sequence \[12\]. On 
the other hand, one can use a more modular, staged compu- 
tation, maxw(((O o A) o D) o M)(w). In other words, first 
the acoustic observations are transduced into a phone lattice 
represented as an automaton labeled by phones (phone recog- 
265 
nition). 'This lattice is in turn transduced into a word lattice 
(word recognition), which is then joined with the language 
model (language model application) \[13\]. 
The best approach may depend on the specific task, which 
determines the size of intermediate results and the whether 
finite-state minimization is fruitful. By having a general 
package to manipulate these automata, we have been able 
to experiment with various alternatives. For many tasks, the 
complete; network, O o A o D o M, is too large to compute 
explicitly, regardless of the order in which the operations are 
applied. The solution that is usually taken is to interleave the 
best path computation with the composition operations and to 
retain only a portion of the intermediate results by discarding 
unpromising paths. 
So far, our presentation has used context-independent phone 
models. In other words, the likelihoods assigned by a phone 
model in A assumed conditional independence from neigh- 
boring phones. However, it has been shown that context- 
dependent phone models, which model a phone in the context 
of its adjacent phones, are very effective for improving recog- 
nition performance \[14\]. 
We can include context-dependent models, such as triphone 
models, in our presentation by expanding our 'atomic models' 
in A to one for every phone in a distinct triphonic context. 
Each model will have the same form as in Figure 2b, but 
will have different likelihoods for the different contexts. We 
could also try to directly specify D in terms of the new units, 
but this is problematic. First, even if each word in D had 
only one phonetic realization, we could not directly substitute 
its spelling in terms of context-dependent units, since the 
cross-word units must be specified (because of the closure 
operation). In this case, a common approach is to either 
use left (right) context-independent units at the word starts 
(ends), or to build a fully context-dependent lexicon, but have 
special computations that insure the correct models are used at 
word junctures. In either case, this disallows use of phonetic 
networks as in Figure 2c. 
There is, however, a natural solution to these problems using a 
a finite-state transduction. We leave D as defined before, but 
interpose a new transduction, C, between A and D, to convert 
between context-dependent and context-independent units. In 
other words, we now compute maxw (O o A o C o D o M) (w). 
The form of C for triphonic models is depicted in Figure 2d. 
For each context-dependent phone model, 7, which corre- 
sponds to the (context-independent) phone 7re in the context of 
7q and 7rr, there is a state qle in C for the biphone 7rlre, a state 
qcr for 7rcTr~ and a transition from qtc to q~ with input label 
7 and output label 7rr. We have constructed such a transducer 
and have been able to easily convert context-independent pho- 
netic networks into context-dependent networks for certain 
tasks. In those cases, we can implement full-context depen- 
dency with no special-purpose computations. 
4. Chinese Text Segmentation 
Our second application is to text processing, namely the to- 
kenization of Chinese text into words, and the assignment 
of pronunciations to those words. In Chinese orthography, 
most characters represent (monosyllabic) morphemes, and as 
in English, words may consist of one or more morphemes. 
Given that Chinese does not use whitespace to delimit words, 
it is necessary to 'reconstruct' the grouping of characters into 
words. For example, we want to say that the sentence \[\] 3~ 
l~,~g~-~ "How do you say octopus in Japanese?", con- 
sists of four words, namely \[\] 3~ ri4-wen2 'Japanese', ~, 
zhangl-yu2 'octopus', ~g~ zen3-mo 'how', and -~ shuol 
'say'. The problem with this sentence is that \[\] ri4 is also 
a word (e.g. a common abbreviation for Japan) as are 3~ 
Y~ wen2-zhangl 'essay', and ~, yu2 'fish', so there is not a 
unique segmentation. 
The task of segmenting and pronouncing Chinese text is nat- 
urally thought of as a transduction problem. The Chinese 
dictionary s is represented as a WFST D. The input alphabet 
is the set of Chinese characters, and the output alphabet is the 
union of the set of Mandarin syllables with the set of part- 
of-speech labels. A given word is represented as a sequence 
of character-to-syllable transitions, terminated in an e-to-part- 
of-speech transition weighted by an estimate of the negative 
log probability of the word. For instance, the word ~, 'oc- 
topus' would be represented as the sequence of transductions 
~:zhangllO.O ~:yu210.O c:noun/13.18. A dictionary in this 
form can easily be minimized using standard algorithms. 
An input sentence is represented as an unweighted acceptor S, 
with characters as transition labels. Segmentation is then ac- 
complished by finding the lowest weight string in S o D*. The 
result is a string with the words delimited by part-of-speech 
labels and marked with their pronunciation. For the example 
at hand, the best path is the correct segmentation, mapping 
the input sequence \[\] 3~ c~ ~, c~ F~ c-~ ~ to the sequence 
ri4 wen2 noun zhangl yu2 noun zen3 mo adv shuo l verb. 
As is the case with English, no Chinese dictionary covers all 
of the words that one will encounter in Chinese text. For 
example, many words that are derived via productive mor- 
phological processes are not generally to be found in the dic- 
tionary. One such case in Chinese involves words derived via 
the nominal plural affix r~l -men. While some words in ~I 
will be found in the dictionary (e.g.,/!!~ tal-men 'they'; 
~ ren2-men 'people'), many attested instances will not: 
for example, ~f~ jiang4-men '(military) generals', ~ 
qingl-wal-men 'frogs'. Given that the basic dictionary is 
represented as a finite-state automaton, it is a simple matter 
to augment the model just described with standard techniques 
from finite-state morphology (\[15, 16\], inter alia). For in- 
3We are currently using the 'Behavior Chinese-English Electronic Dic- 
tionary', Copyright Number 112366, from Behavior Design Corporation, 
R.O.C.; we also wish to thank United Informaties, Inc., R.O.C. for providing 
us with the Chinese text corpus that we used in estimating lexieal probabil- 
ities. Finally we thank Dr. Jyun-Sheng Chang for kindly providing us with 
Chinese personal name corpora. 
266 
stance, we can represent the fact that f\] attaches to nouns by 
allowing e-transitions from the final states of noun entries, to 
the initial state of a sub-transducer containing f\]. However, 
for our purposes it is not sufficient merely to represent the 
morphological decomposition of (say) plural nouns, since we 
also want to estimate the cost of the resulting words. For 
derived words that occur in our corpus we can estimate these 
costs as we would the costs for an underived dictionary en- 
try. So, ~\] jiang4-men '(military)generals' occurs and 
we estimate its cost at 15.02; we include this word by allow- 
ing an e-transition between ~ and f~, with a cost chosen so 
that the entire analysis of~\] ends up with a cost of 15.02. 
For non-occurring possible plural forms (e.g., ~//~f\] nan2- 
gual-men 'pumpkins') we use the Good-Turing estimate (e.g. 
\[ 17\]), whereby the aggregate probability of previously unseen 
members of a construction is estimated as N1/N, where N is 
the total number of observed tokens and N1 is the number of 
types observed only once; again, we arrange the automaton 
so that noun entries may transition to f\], and the cost of the 
whole (previously unseen) construction comes out with the 
value derived from the Good-Turing estimate. 
Another large class of words that are generally not to be found 
in the dictionary are Chinese personal names: only famous 
names like ~j,~ 'Zhou Enlai' can reasonably be expected 
to be in a dictionary, and even many of these are missing. Full 
Chinese personal names are formally simple, being always 
of the form FAMILY+GIVEN. The FAMILY name set is re- 
stricted: there are a few hundred single-character FAMILY 
names, and about ten double-character ones. Given names 
are most commonly two characters long, occasionally one- 
character long: there are thus four possible name types. The 
difficulty is that GIVEN names can consist, in principle, of any 
character or pair of characters, so the possible GIVEN names 
are limited only by the total number of characters, though 
some characters are certainly far more likely than others. For 
a sequence of characters that is a possible name, we wish to 
assign a probabilityto that sequence qua name. We use a vari- 
ant of an estimate proposed in \[18\]. Given a potential name 
of the form F1 G1 G2, where F1 is a legal FAMILY name and 
G1 and G2 are Chinese characters, we estimate the probabil- 
ity of that name as the product of the probability of finding 
any name in text; the probability of F1 as a FAMILY name; 
the probability of the first character of a double GIVEN name 
being G1; the probability of the second character of a double 
GIVEN name being G2; and the probability of a name of the 
ftyrm SINGLE-FAMILY+DOUBLE-GIVEN. The first proba- 
bility is estimated from a count of names in a text database, 
whereas the last four probabilities are estimated from a large 
list of personal names. This model is easily incorporated into 
the segmenter by building a transducer restricting the names 
to the four licit types, with costs on the transitions for any 
particular name summing to an estimate of the cost of that 
name. This transducer is then summed with the transducer 
implementing the dictionary and morphological rules, and the 
transitive closure of the resulting transducer computed. 
References 
1. M. A. Harrison, Introduction to Formal Language Theory. 
Reading, Massachussets: Addison-Wesley, 1978. 
2. J. Berstel, Transductions and Context-Free Languages. No. 38 
in LeitF~iden der angewandten Mathematik and Mechanik 
LAMM, Stuttgart, Germany: Teubner StudienbOcher, 1979. 
3. R. Teitelbaum, "Context-free error analysis by evaluation of 
algebraic power series," in Proc. Fifth Annual A CM Symposium 
on Theory of Computing, (Austin, Texas), pp. 196-199, 1973. 
4. B. Lang, "A generative view of ill-formed input processing," 
in ATR Symposium on Basic Research for Telephone Interpre- 
tation, (Kyotu, Japan), Dec. 1989. 
5. A. Paz, Introduction to Probabilistic Automata. Academic, 
1971. 
6. T. R. Booth and R. A. Thompson, "Applying probability mea- 
sures to abstract languages," IEEE Trans. Computers, vol. C-22, 
pp. 442--450, May 1973. 
7. S. Eilenberg, Automata, Languages, andMachines, vol. A. San 
Diego, California: Academic Press, 1974. 
8. W. Kuich and A. Salomaa, Semirings, Automata, Languages. 
No. 5 in EATCS Monographs on Theoretical Computer Sci- 
ence, Berlin, Germany: Springer-Verlag, 1986. 
9. J. Berstel and C. Reutenauer, Rational Series and Their Lan- 
guages. No. 12 in EATCS Monographs on Theoretical Com- 
puter Science, Berlin, Germany: Spnnger-Verlag, 1988. 
10. R. M. Kaplan and M. Kay, "Regular models of phonological 
rule systems;' Computational Linguistics, 1994. To appear. 
11. E. Roche, Analyse Syntaxique Transformationelle du Francais 
par Transducteurs et Lexique-Grammaire. PhD thesis, Univer- 
sit6 Paris 7, 1993. 
12. L. R. Bahl, E Jelinek, and R. Mercer, "A maximum likeli- 
hood approach to continuous speech recognition;' 1EEE Trans. 
PAMI, vol. 5, pp. 179-190, Mar. 1983. 
13. A. Ljolje and M. D. Riley, "Optimal speech recognition us- 
ing phone recognition and lexical access;' in Proceedings of 
ICSLP, (Banff, Canada), pp. 313-316, Oct. 1992. 
14. K.-E Lee, "Context dependentphonetic hidden Markov models 
for continuous speech recognition," IEEE Trans. ASSP, vol. 38, 
pp. 599--609, Apr. 1990. 
15. K. Koskenniemi, Two-LeveI Morphology: a General Computa- 
tional Model for Word.Form Recognition and Production. PhD 
thesis, University of Helsinki, Helsinki, 1983. 
16. E. Tzoukermann and M. Liberman, "A finite-state morpholog- 
ical processor for Spanish:' in COLING-90, Volume 3, pp. 3: 
277-286, COLING, 1990. 
17. K. W. Church and W. Gale, "A comparison of the enhanced 
Good-Turing and deleted estimation methods for estimating 
probabilities of English bigrams," Computer Speech and Lan- 
guage, vol. 5, no. 1, pp. 19-54, 1991. 
18. J.-S. Chang, S.-D. Chen, Y. Zheng, X.-Z. Liu, and S.-J. 
Ke, "Large-corpus-based methods for Chinese personal name 
recognition. (In Chinese\]);' Journal of Chinese Information 
Processing, vol. 6, no. 3, pp. 7-15, 1992. 
267 
