/ 
II 
II 
/ 
/ 
/ 
/ 
/ 
/ 
II 
/ 
/ 
/ 
/ 
II 
/ 
Look-Back and Look-Ahead in the Conversion of 
Hidden Markov Models into Finite State Transducers 
Andrd Kempe 
Xerox Research Centre Europe - Grenoble Laboratory 
6, chemin de Maupertuis - 38240 Meylan - France 
andre, kempe@xrce, xerox, tom 
http://www, xrce. xerox, com/research/mltt 
Abstract 
This paper describes the conversion of a Hid- 
den Markov Model into a finite state trans- 
ducer that closely approximates the behavior 
of the stochastic model. In some cases the 
transducer is equivalent to the HMM. This 
conversion is especially advantageous for part- 
of-speech tagging because the resulting trans- 
ducer can be composed with other transducers 
that encode correction rules for the most fre- 
quent tagging errors. The speed of tagging is 
also improved. The described methods have 
been implemented and successfully tested. 
1 Introduction 
This paper presents an algorithm 1 which approxi- 
mates a Hidden Markov Model (HMM) by a finite- 
state transducer (FST). We describe one applica- 
tion, namely part-of-speech tagging. Other poten- 
tial applications may be found in areas where both 
HMMs and finite-state technology are applied, such 
as speech recognition, etc. The algorithm has been 
fully implemented. 
An HMM used for tagging encodes, like a trans- 
ducer, a relation between two languages. One lan- 
guage contains sequences of ambiguity classes ob- 
tained by looking up in a lexicon all words of a sen- 
tence. The other language contains sequences of tags 
obtained by statistically disambiguating the class se- 
quences. From the outside, an HMM tagger behaves 
like a sequential transducer that deterministically 
maps every class sequence to a tag sequence, e.g.: 
\[DET, PRO\] \[ADJ,NOUN\] \[ADJ,NOUN\] ...... \[END\] (i) 
DET ADJ NOUN ...... END 
aThere are other (dillerent) algorithms for HMM 
to FST conversion: An unpublished one by Julian M. 
Kupiec and John T. Maxwell (p.c.), and n-type and s- 
type approximation by Kempe (1997). 
The main advantage of transforming an HMM is 
that the resulting transducer can be handled by fi- 
nite state calculus. Among others, it can be com- 
posed with transducers that encode: 
• correction rules for the most frequent tagging 
errors which are automatically generated (Brill, 
1992; Roche and Schabes, 1995) or manually 
written (Chanod and Tapanainen, 1995), in or- 
der to significantly improve tagging accuracy -9 . 
These rules may include long-distance depen- 
dencies not handled by ttMM taggers, and can 
conveniently be expressed by the replace oper- 
ator (Kaplan and Kay, 1994; Karttunen, 1995; 
Kempe and Karttunen, 1996). 
• further steps of text analysis, e.g. light parsing 
or extraction of noun phrases or other phrases 
(Ait-Mokhtar and Chanod, 1997). 
These compositions enable complex text analysis to 
be performed by a single transducer. 
The speed of tagging by an FST is up to six times 
higher than with the original HMM. 
The motivation for deriving the FST from an 
HMM is that the tIMM can be trained and con- 
verted with little manual effort. 
An HMM transducer builds on the data (probabil- 
ity matrices) of the underlying HMM. The accuracy 
of this data has an impact on the tagging accuracy 
of both the HMM itself and the derived transducer. 
The training of the HMM can be done on either a 
tagged or untagged corpus, and is not a topic of this 
paper since it is exhaustively described in the liter- 
ature (Bahl and Mercer, 1976; Church, 1988). 
An HMM can be identically represented by a 
weighted FST in a straightforward way. We are, 
however, interested in non-weighted transducers. 
2Automatically derived rules require less work than 
manually written ones but are unlikely to yield better 
results because they would consider relatively limited 
context and simple relations only. 
Kempe 29 Look-Back and Look-Ahead in the Conversion of HMMs 
Andr~ Kempe (1998) Look-Back and Look-Ahead in the Conversion of Hidden Markov Models into Finite State 
Transducers. In D.M.W. Powers (ed.) NeMLaP3/CoNLL98: New Methods in Language Processing and Computational Natural 
Language Learning, ACL, pp 29-37. 
2 b-Type Approximation 
This section presents a method that approximates 
a (first order) Hidden Markov Model (HMM) by a 
finite-state transducer (FST), called b-type approxi- 
mation s. Regular expression operators used in this 
section are explained in the annex. 
Looking up, in a lexicon the word sequence of a 
sentence produces a unique sequence of ambiguity 
classes. Tagging the sentence by means of a (first 
order) ttMM consists of finding the most probable 
tag sequence T given this class sequence C (eq. 1, 
fig. 1). The joint probability of the sequences C and 
T can be estimated by: 
p(C, T) = p(ci .... c.,tz .... tn) = 
11 
,r(t,) b(c~ Its). \]-I a(t~ Iti- ~) b(c~ Its) 
i=2 
(2) 
2.1 Basic Idea 
The determination of a tag of a particular word can- 
not be made separately from the other tags. Tags 
can influence each other over a long distance via 
transition probabilities. 
In this approach, an ambiguity class is disam- 
biguated with respect to a context. A context con- 
sists of a sequence of ambiguity classes limited at 
both ends by some selected tag 4. For the left con- 
text of length/3 we use the term look-back, and for 
the right context of length a we use the term look- 
ahead. 
Wi.3 Wi.2 Wi.i Wi Wt+ t Wt÷ 2 Wi+ 3 words 
Ci-3 Ci-I i~i'i Ci Ci+i Ci+2 Ci÷3 classes 
tl : ;I 1 ~1 :1 i-3 ti'l ti+l t+2 ti*3 
t2+-3 tP-2~ tP-I l~I t~l xt~ ti2+3 m~ 
Lt~.lJ'~ t~.z t 3 i-3 
Figure 1: Disambiguation of classes between 
two selected tags 
a look-ahead distance of a = 2. Actually, the two 
selected tags t~_ 2 and t~+ 2 allow not only the disam- 
biguation of the class ci but of all classes inbetween, 
i.e. ci-t, ci and ci+l. 
We approximate the tagging of a whole sentence 
by tagging subsequences with selected tags at both 
ends (fig. 1), and then overlapping them. The most 
probable paths in the tag space of a sentence, i.e. 
valid paths according to this approach, can be found 
as sketched in figure 2. 
w I w z w 3 w 4 w s w~ w 7 w~ words 
# c i c 2 C 3 c 4 c 5 c 6 .c 7 c 8 # classes 
t t t t -t t t 
# t~~ t ~"~r-'--t ~..t # ~gs 
Figure 2: Two valid paths through the tag 
space of a sentence 
w~ w 2 w 3 w 4 w 5 w 6 w7 w8 words 
c i c 2 c 3 c 4 C 5 c 6 c 7 C s # classes 
xW./ tags 
Figure 3: Incompatible sequences in the tag 
space of a sentence 
A valid path consists of an ordered set of overlap- 
ping sequences .in which each member overlaps with 
its neighbour except for the first or last tag. There 
can be more than one valid path in the tag space 
of a sentence (fig. 2). Sets of sequences that do not 
overlap in such a way are incompatible according to 
this model, and do not constitute valid paths (fig. 3). 
In figure 1, the tag t~ can be selected from the class 
ci because it is between two selected tags d which are 
t~_ 2 at a look-back distance of fl = 2 and t~2+2 at 
ZName given by the author, to distinguish the algo- 
rithm from n-type and s-type approximation (Kempe, 1997). 
4The algorithm is explained for a first order HMM. In 
the case of a second order HMM, b-type sequences must 
begin and end with two selected tags rather than one. 
2.2 b-Type Sequences 
Given a length ~ of look-back and a length a of look- 
ahead, we generate for every class co, every look- 
back sequence t_~ c-a+1 ... c-z, and every look- 
ahead sequence ci ... ca-1 ta, a b-type sequenced: 
t_~ c-,+z ... c-z co cl ... c~-z t~ (3) 
Kempe 30 Look-Back and Look-Ahead in the Conversion of HMMs 
I 
I 
I 
I 
I 
I 
I 
| 
I 
I 
I 
I 
I 
II 
il 
II 
I! 
II 
il 
II 
II 
II 
II 
II 
II 
II 
II 
I! 
II 
For example: 
CONJ \[DET, PRON\] lAD J, NOUN, VERB\] \[NOUI~, VERB\] VERB (4) 
Each such original b-type sequence (eq. 3,4; fig. 4) 
is disambiguated based on a first order HMM. Here 
we use the Viterbi algorithm (Viterbi, 1967; Ra- 
biner, 1990) for efficiency. 
look-back look-ahead 
-~, ~-I ... -I 0 I ... a-I a positions 
z:-.a/'~J\a a I a~a J a a(I/~; lt~-~ t_~-V, ...---- t_, ~ to----- t,----...--~ to.~-:r=~to 3 ~s 
transition probabili~" b cla~ probabili~ 
(~r - "(~r" "(~r" "(~ original b-t~pe sequence 
Figure 4: b-Type sequence 
For an original b-type sequence, the joint proba- 
bility of its class sequence C with its tag sequence T 
(fig. 4), can be estimated by: 
p(C, T) = p(c_~+~ ... e~_~ , t-z ... t~) = 
\[i=~S+la(t,lt,_~) b(cilti)\].a(t~lt~_~) (5) 
At every position in the look-back sequence and 
in the look-ahead sequence, a boundary # may oc- 
cur, i.e. a sentence beginning or end. No look-back 
(~? = 0) or no look-ahead (a = 0) is also allowed. 
The above probability estimation (eq. 5) can then 
be expressed more generally (fig. 4) as: 
p(C, T) = p,~,~ . p,,~e~e . p,,e (6) 
with P~tart being 
Psta~t = a(t-Z+zlt-S) for selected tag t_ z (7) 
P~t~.t = rr(t-z+z) for boundary ~ (8) 
P, ta~ = 1 for ~3=0 (9) 
with prniddle being 
a-1 
Prniaate = b(c-a+z It-z+1)" H a(tilti_i) b(cilti) 
i=-Z+2 
for a+#> 0 (10) 
PmiddZe = b(colto) for a+/~=0 (11) 
and with Pend being 
Pe,~a =a(ta\[t.a-z) for selected tag ta (12) 
Pend = 1 for boundary # or a=0 (13) 
When the most likely tag sequence is found for an 
original b-type sequence, the class co in the middle 
position (eq. 3) is associated with its most likely tag 
to. We formulate constraints for the other tags t_ z 
and ta and classes c_z+1...c_ z and Cl...ca_ I of the 
original b-type sequence. Thus we obtain a tagged 
b-type sequence s. 
" (14) - c_/~+l .-.C_ 2 C0:~0 C2- '"~a-1 ta 
stating that to is the most probable tag in the class 
co if it is preceded by t B~ cS(Z-z)...cB2 c m and 
followed by c al cA:...c A(~-I) ta% 
In expression 14 the subscripts --/3 -B+I...0...~-I 
a denote the position of the tag or class in the b-type 
sequence, and the superscripts Bfl B(/~-I)...B1 and 
A1...A(o-1) Aa express constraints for preceding 
and following tags and classes which are part of other 
b-type sequences. In the exampleS: 
CONI-B2 \[DET, PRON\]-B1 
\[ADJ,NOUN, v~aB\]:~a 
\[~ao~,v~aB}-al V~B-A2 (15) 
ADJ is the most likely tag in the class 
\[£1~J,IY0trN,vFalB\] if it is preceded by the tag C0NJ 
two positions back (B2), by the class \[DET,PRON'I 
one position back (B1), and followed by the class 
I'NOUlY,VEI~\] one position ahead (A1) and by the 
tag VERB two positions ahead (A2). 
Boundaries are denoted by a particular symbol 
and can occur at the edge of the look-back and look- 
ahead sequence: 
t B~ c s(t~-l) ...c B2 c B1 c:t c Ax c A1 ...c A(a-1) #An (16) 
t s# c ~(~-l) ...c ~ c B1 c:t c A1 c A1 ...#A(~--Z) (17) 
#Be C~(~-Z) ...CB2 cBZ c:t #AZ (18) 
#BZ c:t #AZ (19) 
#B2 cBl c:t c A' c ~I ...cA(°-I) t a~ (20) 
For example: 
~-B2 \[DET, PRONI-B1 
\[ADJ, NOUN, V~B\]: ADJ 
(21) 
SRegular expression operators used in this article are 
explained in the annex. 
Kempe 31 Look-Back and Look-Ahead in the Conversion of HMMs 
C0NJ---B2 \[DET, PRON\]-B1 
\[ADJ, NOON, VF23\] : NOUN 
#-A~ (22) 
Note that look-back of length ,3 and look-ahead of 
length a also include all sequences shorter than 3 or 
~, respectively, that are limited by #. 
For a given length 3 of look-back and a length a 
of look-ahead, we generate every possible original b- 
type sequence (eq. 3), disambiguate it statistically 
(eq. 5-13), and encode the tagged b-type sequence 
Bi (eq. 14) as an FST. All sequences Bi are then 
unioned 
°B = U B; (23) 
{ 
and we generate a preliminary tagger model B" 
B" = lOB \]. (24) 
where all sequences Bi can occur in any order 
and number (including zero times) because no con- 
straints have yet been applied. 
2.3 Concatenation Constraints 
To ensure a correct concatenation of sequences Bi, 
we have to make sure that every Bi is preceded and 
followed by other Bi according to what is encoded 
in the look-back and look-ahead constraints. E.g. 
the sequence in example (21) must be preceded by 
a sentence beginning, #, and the class \[DET,PRON\] 
and followed by the class \[NOON, VERB\] and the tag 
VERB. 
We create constraints for preceding and following 
tags, classes and sentence boundaries. For the look- 
back, a particular tag ti or class cj is required for a 
particular distance of 6 < -1, byS: 
R'(ti) ='\[-\[?* tl \[\ut\]* ~t \[\ut\].\]'(-$-1}\] t/B(-~) ?.\] (25) 
R'(cj) ='\['\[% cj \[\%\]* \[% \[\%\],\]'(-$-I)\] ci(-~) ?.\] (26) 
for 6 < -1 
with °t and °c being the union of all tags and all 
classes respectively. 
A sentence beginning, #, is required for a partic- 
ular look-back distance of 6<-1, on the side of the 
tags, by: 
R'(#) =-\[ "\[ \[\~t\], \[~t \[\~t\],\]'(-~-1)\] #8(-~ ?,\] (2r) 
for J < -1 
In the case of look-ahead we require for a partic- 
ular distance of 6 > 1, a particular tag ti or class cj 
or a sentence end, #, on the side of the tags, in a 
similar way by: 
n~(t,) =-\[?, t, ~s -{ \[\°t\], ~t \[\~t\],\]-(~-ll t~ ?,\]\] (2s) 
a%,) =-\[ ?, ~ -\[ \[\°4* \[o \[\~\]*\]'(~-~1 c, ?,\]\] (29) 
n'(#) =-\[ ?, #.,6 -\[ \[\°t\], \[°t \[\~t\],\]-(~-l)\]\] (30) 
for J>l 
All tags ti are required for the look-back only at 
the distance of 6 = -3 and for the look-ahead only 
at the distance of 6 = a. All classes cj are required 
for distances of 6 E \[-3 + 1, -1\] and 6 E \[1, a, - 1\]. 
Sentence boundaries, #, are required for distances 
of 6 E \[-3,-1\] and 6 E \[1, a\]. 
We create the intersection Rt of all tag con- 
straints, the intersection Re of all class constraints, 
and the intersection R# of all sentence boundary 
constraints: 
R, = N R,(t,) (31) 
i ~ \[I,.\] 
e {-~,~} 
Ro = N R%) (32) 
j ~ ll,ml 
6 E \[-3+l,--l\]U\[l,a--l\] 
a# = n a~(#) (33) 
~ \[-~,- I\]u\[I,.,\] 
All constraints are enforced by composition with 
the preliminary tagger model B" (eq. 24). The class 
constraint Rc is composed on the upper side of B" 
which is the side of the classes (eq. 14), and both 
the tag constraint Rt and the boundary constraint 6 
R# are composed on the lower side of B', which is 
the side of the tagsS: 
B'" = Rc .o. B" .o. Rt .o. R# (34) 
Having ensured correct concatenation, we delete 
all symbols r that have served to constrain tags, 
classes or boundaries, using Dr: 
6The boundary constraint R# could alternatively be 
computed for and composed on the side of the classes. 
The transducer which encodes R# would then, however, 
be bigger because the number of classes is bigger than 
the number of tags. 
Kempe 32 Look-Back and Look-Ahead in the Conversion of HMMs 
II 
II 
II 
II 
II 
II 
II 
II 
II 
II 
II 
II 
II 
II 
II 
II 
II 
II 
I! 
II 
II 
II 
II 
II 
II 
II 
II 
II 
II 
II 
II 
II 
II 
II 
D, = r-> \[\] (36) 
By composing r B'" (eq. 34) on the lower side with 
Dr and on the upper side with the inverted relation 
Dr.i, we obtain the final tagger model B: 
B = D,.i .o.B'" .o. Dr (37) 
We call the model a b-type model, the correspond- 
ing FST a b-type transducer, and the whole algo- 
rithm leading from the HMM to the transducer, a 
b-type approximation of an HMM. 
2.4 Properties of b-Type Transducers 
There are two groups of b-type transducers with dif- 
ferent properties: FSTs without look-back and/or 
without look-ahead (19-a = 0) and FSTs with both 
look-back and look-ahead (8"a > 0). Both accept 
any sequence of ambiguity classes. 
b-Type FSTs with $.cr =0 are always sequential. 
They map a class sequence that corresponds to the 
word sequence of a sentence, always to exactly one 
tag sequence. Their tagging accuracy and similarity 
with the underlying HMM increases with growing 
fl + or. A b-type FST with $ = 0 and a = 0 is equiva- 
lent to an nO-type FST, and with $ = 1 and a = 0 it 
is equivalent to an nl-type FST (Kempe, 1997). 
b-Type FSTs with $.a > 0 are in general not se- 
quential. For a class sequence they deliver a set of 
different tag sequences, which means that the tag- 
ging results are ambiguous. This set is never empty, 
and the most probable tag sequence according to the 
underlying HMM is always in this set. The longer 
the look-back distance $ and the look-ahead distance 
a are, the larger the FST and the smaller the set of 
resulting tag sequences. For sufficiently large $+a, 
this set may contain always only one tag sequence. 
In this case the FST is equivalent to the underlying 
HMM. For reasons of size however, this FST may 
not be computable for particular HMMs (see. 4). 
3 An Implemented Finite-State Tagger 
The implemented tagger requires three transducers 
which represent a lexicon, a guesser and an approx- 
imation of an HMM mentioned above. 
Both the lexicon and guesser are sequential, i.e. 
deterministic on the input side. They both unam- 
biguously map a surface form of any word that they 
accept to the corresponding ambiguity class (fig. 5, 
col. 1 and 2): First of all, the word is looked for in the 
rFor efficiency reasons, we actually do not delete the 
constraint symbols r by composition. We rather tra- 
verse the network, and overwrite every symbol r with 
the empty string symbol e. In the following determiniza- 
tion of the network, all ~ are eliminated. 
lexicon. If this fails, it is looked for in the guesser. If 
this equally fails, it gets the label \[UNKNOWN\] which 
denotes the ambiguity class of unknown words. Tag 
probabilities in this class are approximated by tags 
of words that appear only once in the training cor- 
pus. 
As soon as an input token gets labeled with the 
tag class of sentence end symbols (fig. 5: \[SENT\] ), 
the tagger stops reading words from the input. At 
this point, the tagger has read and stored the words 
of a whole sentence (fig. 5, col. 1) and generated the 
corresponding sequence of classes (fig. 5, col. 2). 
The class sequence is now mapped to a tag se- 
quence (fig. 5, col. 3) using the HMM transducer. A 
b-type FST is not sequential in general (sec. 2.4), 
so to obtain a unique tagging result, the finite-state 
tagger can be run in a special mode, where only the 
first, result found is retained, and the tagger does 
not look for other results s. Since paths through an 
FST have no particular order, the result retained is 
random. 
The tagger outputs the stored word and tag se- 
quence of the sentence, and continues in the same 
way with the remaining sentences of the corpus. 
The \[AT\] AT 
share Ll~, VB\] NN 
of \[IN\] IN 
.., 
tripled \[VBD, VBN\] VBD 
.ithin \[IN ,RB\] IN 
that \[CS ,DT.WPS\] DT 
span \[NN,VB, VBD\] NN 
of \[IN\] IN 
t ime \['NN, VB\] NN 
\[SENT\] s~.wr 
Figure 5: Tagging a sentence 
The tagger can be run in a statistical mode ,, here 
the number of tag sequences found per sentence is 
counted. These numbers give an overview of the 
degree of non-sequentiality of the concerned b-type 
transducer (sec. 2.4). 
8This mode of retaining the first result only is not 
necessary with n-type and s-type transducers which are 
both sequential (Kempe, 1997). 
Kempe 33 Look-Back and Look-Ahead in the Conversion of HMMs 
Transducer 
or HMM 
I HMM 
I Accuracy \] Tagging speed Transducer size Creation 
test corp. I in words/sec time 
in % I ultra2 I sparc20 #states I #arcs\] . ultra2 
t 97.351 48341 16241 ~1 ~l_~l s nI-FST ,733 I,,3, J  ,80 i ,,iOpi 15,225 22 io 
s+nl-FST 1M, F8) 96.12 22 001 9 969 329 42 560 :.. 4 min 
b-FST (/~=0, a=0), =nO 87.21 26 585 11 000 1 181 6 se~ 
b-FST (fl=l,a=0), =nl 95.16 26 585 11 600 37 6 697 11 sec 
b-FST (~=2,a=0) 95.32 21 268 7 089 3 663 663 003 4 h 11 
b-FST (fl=0, a=l) 93.691 199391 877 I 252 40243 12sec 
b-FST (fl=0,a=2) 93.92 19 334 114 10 554 l 246 686 j0 min 
b-FST (fl=2, a=l) "97.34 15 191 6 510 54 578 18 402 055 2 h 17 
b-FST (fl=3, a=l) FST was not computable 
Language: English 
Corpora: 19 944 words for HMM training, 19 934 words for test 
Tag set: 36 tags, 181 classes 
* Multiple, i.e. ambiguous tagging results: Only first result retained 
Types of FST (Finite-State Transducers) : 
n0, nl n-type transducers (Kempe, 1997) 
s+nl (IM,FS) s-type transducer (Kempe, 1997), 
with subsequences of frequency > 8, from a training corpus 
of I 000 000 words, completed with nl-type 
b (fl=2,a=l) b-type transducer (sec. 2), with look-back of 2 and look-ahead of i 
Computers: 
ultra2 1 CPU, 512 MBytes physical RAM, 1.4 GBytes virtual RAM 
spare20 1 CPU, 192 MBytes physical RAM, 827 MBytes virtual RAM 
Table 1: Accuracy, speed, size and creation time of some HMM transducers 
4 Experiments and Results 
This section compares different FSTs with each 
other and with the original ttMM. 
As expected, the FSTs perform tagging faster 
than the HMM. 
Since all FSTs are approximations of HMMs, they 
show lower tagging accuracy than the ttMMs. In the 
case of FSTs with fl > 1 and a = 1, this difference in 
accuracy is negligible. Improvement in accuracy can 
be expected since these FSTs can be composed with 
FSTs encoding correction rules for frequent errors 
(sec. 1). 
For all tests below an English corpus, lexicon and 
guesser were used, which were originally annotated 
with 74 different tags. We automatically recoded the 
tags in order to reduce their number, i.e. in some 
cases more than one of the original tags were recoded 
into one and the same new tag. We applied different 
recodings, thus obtaining English corpora, lexicons 
and guessers with reduced tag sets of 45, 36, 27, 18 
and 9 tags respectively. 
FSTs with fl= 2 and ~ = 1 and with fl= 1 and 
a = 2 were equivalent, in all cases where they could 
be computed. 
Table 1 compares different FSTs for a tag set of 
36 tags. 
The b-type FST with no look-back and no look- 
ahead which is equivalent to an n0-type FST 
(Kempe, 1997), shows the lowest tagging accuracy 
(b-FST ()3=0, a=0): 87.21%). It is also the small- 
est transducer (1 state and 181 arcs, as many as 
tag classes) and can be created faster than the other 
FSTs (6 sec.). 
The highest accuracy is obtained with a b-type 
FST with fl= 2 and a = 1 (b-FST (/3=2,~=1): 
97.34 %) and with an s-type FST (Kempe, 1997) 
trained on 1 000 000 words (s+nl-FST (1M, F1): 
97.33 %). In these two cases the difference in accu- 
racy with respect to the underlying ttMM (97.35 %) 
is negligible. In this particular test, the s-type FST 
comes out ahead because it is considerably smaller 
than the b-type FST. 
The size of a b-type FST increases with the size 
of the tag set and with the length of look-back plus 
look-ahead, ~+c~. Accuracy improves with growing 
b-Type FSTs may produce ambiguous tagging re- 
suits (sec. 2.4)'. In such instances only the first result 
was retained (see. 3). 
Kempe 34 Look-Back and Look-Ahead in the Conversion of HMMs 
II 
II 
II 
II 
II 
Ii 
II 
II 
II 
II 
II 
II 
II 
II 
II 
II 
II 
II 
II 
II 
il 
II 
II 
II 
II 
II 
II 
II 
II 
II 
II 
II 
II 
II 
II 
II 
II 
Transducer 
or HMM 
HMM 
s+nl FST (1M, F1) 96.76 
99.89 
s+nl-FST (1M, F8) 95.09 
97.00 
b-FST (fl=0,a=0), =n0 83.53 
84.00 
b-FST (/3=l,a=0), =nl 94.19 
95.61 
b-FST (fl=2, a=0) 
b-FST (fl=0, a=l ) 92.79 
93.64 
b-FST (fl=0, c~=2) 93.46 
94.35 
b-FST (fl=l, a--'l } *94.94 
*97.86 
b-FST (B=2, a=l) 
b-FST (/3=3, a=l) 
Tagging accuracy and agreement with the tIMM 
for tag sets of different sizes 
297 cls. 214 cls. 181 cls. 119 els. 97 ds. 67 ¢ls. 
I 96"781 96.92 \] 97"351 97"071 96 73 I' 95.76 I 
96.88 
99.93 
95.25 
97.35 
83.71 
84.40 
94.09 
95.92 
94.28 
96.09 
92.47 
93.41 
92.77 
93.70 
"95.14 
°97.93 
97.33 97.06 96.72 95.74 
99.90 99.95 99.95 99.94 
96.12 96.36 96.05 95.29 
98.15 98.90 98.99 98.96 
87.21 94.47 94.24 93.86 
88.04 96.03 96.22 95.76 
95.16 95.60 95.17 94.14 
96.90 97.75 97.66 96.74 
95.32 95.71 95.31 94.22 
97.01 97.84 97.77 96.83 
93.69 95.26 95.19 94.64 
94.67 96.87 97.06 97.09 
93.92 95.37 95.30 94.80 
94.90 96.99 97.20 97.29 
"95.78 "96.78 "96.59 "95.36 
"98.11 °99.58 "99.72 *99.26 
"97.34 °97.06 "96.73 °95.73 
*99.97 °99.98 *100.00 *99.97 
95.76 
100.00 
Language: English 
Corpora: 19 944 words for HMM training, 19 934 words for test 
Types of FST (Finite-State Transducers) cf. table 1 
\[ 9998 197°6 I 
Multiple, i.e. ambiguous tagging results: Only first result retained 
Tagging accuracy of 97.06 %, 
and agreement of FST with HMM tagging results of 99.98 % 
Transducer could not be computed, for reasons of size. 
Table 2: Tagging accuracy and agreement of the FST tagging results with those 
of the underlying HMM, for tag sets of different sizes 
Table 2 shows the tagging accuracy and the agree- 
ment of the tagging results with the results of the 
underlying HMM for different FSTs and tag sets of 
different sizes. 
To get results that are almost equivalent to those 
of an HMM, a b-type FST needs at least a look-back 
of/5 = 2 and a look-ahead of a = 1 or vice versa. 
For reasons of size, this kind of FST could only be 
computed for tag sets with 36 tags or less. A b-type 
FST with/5 = 3 and a = 1 could only be computed 
for the tag set with 9 tags. This FST gave exactly 
the same tagging results as the underlying HMM. 
Table 3 illustrates which of the b-type FSTs are 
sequential, i.e. always produce exactly one tagging 
result, and which of the FSTs are non-sequential. 
For all tag sets, the FSTs with no look-back 
(/5 = 0) and/or no look-ahead (a = 0) behaved se- 
quentially. Here 100 % of the tagged sentences had 
only one result. Most of the other FSTs (/5. o~ > 0) 
behaved non-sequentially. For example, in the case 
of 27 tags withl3=l anda=l, 90.08%of the 
tagged sentences had one result, 9.46 % had two re- 
sults, 0.23 % had tree results, etc. 
Non-sequentiality decreases with growing look- 
back and look-ahead,/5+c~, and should completely 
disappear with sufficiently large/5+~. Such b-type 
FSTs can, however, only be computed for small tag 
sets. We could compute this kind of FST only for 
the case of 9 tags with/5=3 and a=l. 
The set of alternative tag sequences for a sentence, 
produced by a b-type FST with/5, a > 0, always 
contains the tag sequence that corresponds with the 
result of the underlying HMM. 
Kempe 35 Look-Back and Look-Ahead in the Conversion of HMMs 
I Sentences with n tagging results 
Transducer (in %) 
n= 11 n= 21n= 31n= 41 5-8\] 9-16 
74 tags, 297 dasses (origina~ tag set) 
b-FST (fl.a=0) I 1°°1 
b-FST (fl=l,a=l) 75.14120.18 t 0.341 3.421 0.801 0.11 
b-FST (~=2,a=l) FST was not computable 
45 tags, 214 classes (reduced tag set) 
b-rSZ(a.4=0) I 1°°1 I l I I b-FST (fl=1,4=1)175.71119.731 0.68\[ 3.191 0.68\] 
b-FST (fl=2,4=1)\[ FST was not computable 
36 tags, 181 classes (reduced tag set) 
b-FST (fl-a=0) 100 
b-FST (fl=1,4=1) 78.56 17.90 0.34 2.85 0.34 
b-FST (/3=2,4=1) 99.77 0.23 
27 tags, 119 classes (reduced tag set) 
b-FST (/3-4=0} 100 
b-FST (fl=1,a=l) 90.08 9.46 0.23~ 0.11 0.11 
b-FST (fl=2,a=l) 99.77 0.23 
18 tags, 97 classes (reduced tag set) 
b-FST (fl-a=0) 100 
b-FST (fl=l,4=l)\[93.04 6.84 0.11 
b-FST (fl--2,a--1)199.89 0.11 
9 tags, 67 classes (reduced tag set) 
b-FST (fl-4=0) 1001 
b.-FST (fl=l,4=l) 86.66112.43 0.91 
b-FST (fl=2,4=1) 99.771 0.23 
b-FST (fl=3,4=1) 100 
Language: English 1 ITest corpus: 19 934 words, 877 sentences 
\[Types of FST (Finite-State Transducers) cf. table 1 
Table 3: Percentage of sentences with a par- 
ticular number of tagging results 
5 Conclusion and Future Research 
The algorithm presented in this paper describes the 
construction of a finite-state transducer (FST) that 
approximates the behaviour of a Hidden Markov 
Model (HMM) in part-of-speech tagging. 
The algorithm, called b-type approximation, uses 
look-back and look-ahead of freely selectable length. 
The size of the FSTs grows with both the size of 
the tag set and the length of the look-back plus look- 
ahead. Therefore, to keep the FST at a computable 
size, an increase in the length of the look-back or 
look-ahead, requires a reduction of the number of 
tags. In the case of small tag sets (e.g. 36 tags), the 
look-back and look-ahead can be sufficiently large 
to obtain an FST that is almost equivalent to the 
original HMM. 
In some tests s-type FSTs (Kempe, 1997) and 
b-type FSTs reached equal tagging accuracy. In 
these cases s-type FSTs are smaller because they 
encode the most frequent ambiguity class sequences 
of a training corpus very accurately and all other 
sequences less accurately, b-Type FSTs encode all 
sequences with the same accuracy. Therefore, a 
b-type FST can reach equivalence with the original 
HMM, but an s-type FST cannot. 
The algorithms of both conversion and tagging are 
fully implemented. 
The main advantage of transforming an HMM is 
that the resulting FST can be handled by finite state 
calculus ~ and thus be directly composed with other 
FSTs. 
The tagging speed of the FSTs is up to six times 
higher than the speed of the original HMM. 
Future research will include the composition of 
HMM transducers with, among others: 
• FSTs that encode correction rules for the most 
frequent tagging errors in order to significantly 
improve tagging accuracy (above the accuracy 
of the underlying HMM). These rules can ei- 
ther be extracted automatically from a corpus 
(Brill, 1992) or written manually (Chanod and 
Tapanalnen, 1995). 
* FSTs for light parsing, phrase extraction and 
other text analysis (Ait-Mokhtar and Chanod, 
1997). 
An HMM transducer can be composed with one 
or more of these FSTs in order to perform complex 
text analysis by a single FST. 
ANNEX: Regular Expression Operators 
Below, a and b designate symbols, A and B designate 
languages, and R and Q designate relations between 
two languages. More details on the following 
operators and pointers to finite-state literature can 
be found in 
http://w~, xrce. xerox, com/research/ml~t/f s~ 
-A 
\a 
A* 
A^n 
a -> b 
Complement (negation). Set of all strings 
except those from the language A. 
Term complement. Any symbol other 
than a. 
Kleene star. Language A zero or more 
times concatenated with itself. 
A n times. Language A n times concate- 
nated with itself. 
Replace. Relation where every a on the 
upper side gets mapped to a b on the 
lower side. 
9A large library of finite-state functions is available 
at Xerox. 
Kempe 36 Look-Back and Look-Ahead in the Conversion of HMMs 
II 
II 
I! 
II 
II 
II 
II 
II 
II 
II 
III 
II 
II 
II 
II 
II 
II 
II 
II 
I! 
II 
II 
II 
II 
II 
II 
II 
II 
II 
II 
II 
II 
II 
II 
II 
II 
II 
II 
a:b 
R.i 
A B 
R.o.q 
0 or \[\] 
? 
Symbol pair with a on the upper and b 
on the lower side. 
Inverse relation where both sides are ex- 
changed with respect to R. 
Concatenation of all strings of A with all 
strings of B. 
Composition of the relations R and  . 
Empty string (epsilon). 
Any symbol in the known alphabet and 
its extensions 
Acknowledgements 
I am grateful to all colleagues who helped me, par- 
ticularly to Lauri Karttunen (XRCE Grenoble) for 
extensive discussion, and to Julian Kupiec (Xerox 
PARC) for sending me information on his own re- 
lated work. Many thanks to Irene Maxwell for cor- 
recting various versions of the paper. 

References 
Ait-Mokhtar, Salah and Chanod, Jean-Pierre 
(1997). Incremental Finite-State Parsing. In the 
Proceedings of the 5th Conference of Applied .Nat- 
ural Language Processing (ANLP). ACL, pp. 72- 
79. Washington, DC, USA. 
Bahl, Lalit R. and Mercer, Robert L. (1976). Part 
of Speech Assignment by a Statistical Decision Al- 
gorithm. In IEEE international Symposium on 
Information Theory. pp. 88-89. Ronneby. 
Brill, Eric (1992). A Simple Rule-Based Part-of- 
Speech Tagger. In the Proceedings of the 3rd con- 
ference on Applied Natural Language Processing, 
pp. 152-155. Trento, Italy. 
Chanod, Jean-Pierre and Tapanainen, Pasi (1995). 
Tagging French - Comparing a Statistical and a 
Constraint Based Method. In the Proceedings of 
the 7th conference of the EACL, pp. 149-156. 
ACL. Dublin, Ireland. cmp-lg/9S03003 
Church, Kenneth W. (1988). A Stochastic Parts 
Program and Noun Phrase Parser for Unrestricted 
Text. In Proceedings of the 2nd Conference on 
Applied Natural Language Processing. ACL, pp. 
136-143. 
Kaplan, Ronald M. and Kay, Martin (1994). Regu- 
lar Models of Phonological Rule Systems. In Com- 
putational Linguistics. 20:3, pp. 331-378. 
Karttunen, Lauri (1995). The Replace Operator. In 
the Proceedings of the 33rd Annual Meeting of the 
Association for Computational Linguistics. Cam- 
bridge, MA, USA. cmp-lg/9504032 
Kempe, Andrd and Karttunen, Lauri (1996). Par- 
allel Replacement in Finite State Calculus. In 
the Proceedings of the 16th International Confer- 
ence on Computational Linguistics, pp. 622-627. 
Copenhagen, Denmark. crap-lg/9607007 
Kempe, Andrd (1997). Finite State Transducers Ap- 
proximating Hidden Markov Models. In the Pro- 
ceedings of the 35th Annual Meeting of the Associ- 
ation for Computational Linguistics, pp. 460-467. 
Madrid, Spain. crap-lg/9707006 
Rabiner, Lawrence R. (1990). A Tutorial on Hid- 
den Markov Models and Selected Applications in 
Speech Recognition. In Readings in Speech Recog- 
nition (eds. A. Waibel, K.F. Lee). Morgan Kauf- 
mann Publishers, Inc. San Mateo, CA., USA. 
Roche, Emmanuel and Schabes, Yves (1995). Deter- 
ministic Part-of-Speech Tagging with Finite-State 
Transducers. In Computational Linguistics. Vol. 
21, No. 2, pp. 227-253. 
Viterbi, A.J. (1967). Error Bounds for Convolu- 
tional Codes and an Asymptotical Optimal De- 
coding Algorithm. In Proceedings of IEEE, vol. 
61, pp. 268-278. 
