Deterministic Part-of-Speech Tagging 
with Finite-State Transducers 
Emmanuel Roche* 
MERL 
Yves Schabes* 
MERL 
Stochastic approaches to natural language processing have often been preferred to rule-based 
approaches because of their robustness and their automatic training capabilities. This was the 
case for part-of-speech tagging until Brill showed how state-of-the-art part-of-speech tagging can 
be achieved with a rule-based tagger by inferring rules from a training corpus. However, current 
implementations of the rule-based tagger run more slowly than previous approaches. In this 
paper, we present a finite-state tagger, inspired by the rule-based tagger, that operates in optimal 
time in the sense that the time to assign tags to a sentence corresponds to the time required to 
follow a single path in a deterministic finite-state machine. This result is achieved by encoding 
the application of the rules found in the tagger as a nondeterministic finite-state transducer and 
then turning it into a deterministic transducer. The resulting deterministic transducer yields a 
part-of-speech tagger whose speed is dominated by the access time of mass storage devices. We 
then generalize the techniques to the class of transformation-based systems. 
1. Introduction 
Finite-state devices have important applications to many areas of computer science, in- 
cluding pattern matching, databases, and compiler technology. Although their linguis- 
tic adequacy to natural language processing has been questioned in the past (Chomsky, 
1964), there has recently been a dramatic renewal of interest in the application of finite- 
state devices to several aspects of natural language processing. This renewal of interest 
is due to the speed and compactness of finite-state representations. This efficiency is ex- 
plained by two properties: finite-state devices can be made deterministic, and they can 
be turned into a minimal form. Such representations have been successfully applied to 
different aspects of natural language processing, such as morphological analysis and 
generation (Karttunen, Kaplan, and Zaenen 1992; Clemenceau 1993), parsing (Roche 
1993; Tapanainen and Voutilainen 1993), phonology (Laporte 1993; Kaplan and Kay 
1994) and speech recognition (Pereira, Riley, and Sproat 1994). Although finite-state 
machines have been used for part-of-speech tagging (Tapanainen and Voutilainen 1993; 
Silberztein 1993), none of these approaches has the same flexibility as stochastic tech- 
niques. Unlike stochastic approaches to part-of-speech tagging (Church 1988; Kupiec 
1992; Cutting et al. 1992; Merialdo 1990; DeRose 1988; Weischedel et al. 1993), up to 
now the knowledge found in finite-state taggers has been handcrafted and was not 
automatically acquired. 
Recently, Brill (1992) described a rule-based tagger that performs as well as taggers 
based upon probabilistic models and overcomes the limitations common in rule-based 
approaches to language processing: it is robust and the rules are automatically ac- 
* Mitsubishi Electric Research Laboratories, 201 Broadway, Cambridge, MA 02139. E-mail: 
rocbe/schabes@merl.com. 
(~) 1995 Association for Computational Linguistics 
Computational Linguistics Volume 21, Number 2 
quired. In addition, the tagger requires drastically less space than stochastic taggers. 
However, current implementations of Brill's tagger are considerably slower than the 
ones based on probabilistic models since it may require RKn elementary steps to tag 
an input of n words with R rules requiring at most K tokens of context. 
Although the speed of current part-of-speech taggers is acceptable for interac- 
tive systems where a sentence at a time is being processed, it is not adequate for 
applications where large bodies of text need to be tagged, such as in information re- 
trieval, indexing applications, and grammar-checking systems. Furthermore, the space 
required for part-of-speech taggers is also an issue in commercial personal computer 
applications such as grammar-checking systems. In addition, part-of-speech taggers 
are often being coupled with a syntactic analysis module. Usually these two modules 
are written in different frameworks, making it very difficult to integrate interactions 
between the two modules. 
In this paper, we design a tagger that requires n steps to tag a sentence of length 
n, independently of the number of rules and the length of the context they require. 
The tagger is represented by a finite-state transducer, a framework that can also be 
the basis for syntactic analysis. This finite-state tagger will also be found useful when 
combined with other language components, since it can be naturally extended by 
composing it with finite-state transducers that could encode other aspects of natural 
language syntax. 
Relying on algorithms and formal characterizations described in later sections, we 
explain how each rule in Brill's tagger can be viewed as a nondeterministic finite-state 
transducer. We also show how the application of all rules in Brill's tagger is achieved 
by composing each of these nondeterministic transducers and why nondeterminism 
arises in this transducer. We then prove the correctness of the general algorithm for 
determinizing (whenever possible) finite-state transducers, and we successfully apply 
this algorithm to the previously obtained nondeterministic transducer. The resulting 
deterministic transducer yields a part-of-speech tagger that operates in optimal time 
in the sense that the time to assign tags to a sentence corresponds to the time required 
to follow a single path in this deterministic finite-state machine. We also show how 
the lexicon used by the tagger can be optimally encoded using a finite-state machine. 
The techniques used for the construction of the finite-state tagger are then for- 
malized and mathematically proven correct. We introduce a proof of soundness and 
completeness with a worst-case complexity analysis for the algorithm for determiniz- 
ing finite-state transducers. 
We conclude by proving that the method can be applied to the class of transformation- 
based error-driven systems. 
2. Overview of Brill's Tagger 
Brill's tagger is comprised of three parts, each of which is inferred from a training cor- 
pus: a lexical tagger, an unknown word tagger, and a contextual tagger. For purposes 
of exposition, we will postpone the discussion of the unknown word tagger and focus 
mainly on the contextual rule tagger, which is the core of the tagger. 
The lexical tagger initially tags each word with its most likely tag, estimated by 
examining a large tagged corpus, without regard to context. For example, assuming 
that vbn is the most likely tag for the word "killed" and vbd for "shot," the lexical 
tagger might assign the following part-of-speech tags: 1 
1 The notation for part-of-speech tags is adapted from the one used in the Brown Corpus (Francis and 
228 
Emmanuel Roche and Yves Schabes Deterministic Part-of-Speech Tagging 
Figure 1 
Sample rules. 
1. vbn vbd PREVTAG np 
2. vbd vbn NEXTTAG by 
(1) 
(2) 
(3) 
Chapman/np killed/vbn John/np Lennon/np 
John/np Lennon/np was/bedz shot/vbd by~by Chapman/np 
He/pps witnessed/vbd Lennon/np killed/vbn by~by Chapman/np 
Since the lexical tagger does not use any contextual information, many words can 
be tagged incorrectly. For example, in (1), the word "killed" is erroneously tagged as 
a verb in past participle form, and in (2), "shot" is incorrectly tagged as a verb in past 
tense. 
Given the initial tagging obtained by the lexical tagger, the contextual tagger ap- 
plies a sequence of rules in order and attempts to remedy the errors made by the initial 
tagging. For example, the rules in Figure 1 might be found in a contextual tagger. 
The first rule says to change tag vbn to vbd if the previous tag is np. The second 
rule says to change vbd to tag vbn if the next tag is by. Once the first rule is applied, 
the tag for "killed" in (1) and (3) is changed from vbn to vbd and the following tagged 
sentences are obtained: 
(4) 
(5) 
(6) 
Chapman/np killed/vbd John/np Lennon/np 
John/np Lennon/np was/bedz shot/vbd by~by Chapman/np 
He/pps witnessed/vbd Lennon/np killed/vbd by~by Chapman/np 
And once the second rule is applied, the tag for "shot" in (5) is changed from vbd 
to vbn, resulting in (8), and the tag for "killed" in (6) is changed back from vbd to vbn, 
resulting in (9): 
(7) 
(8) 
(9) 
Chapman/np killed/vbd John/np Lennon/np 
John/np Lennon/np was~be& shot/vbn by~by Chapman/np 
He/pps witnessed/vbd Lennon/np killed/vbn by~by Chapman/np 
It is relevant to our following discussion to note that the application of the NEXT- 
TAG rule must look ahead one token in the sentence before it can be applied, and that 
the application of two rules may perform a series of operations resulting in no net 
change. As we will see in the next section, these two aspects are the source of local 
nondeterminism in Brill's tagger. 
The sequence of contextual rules is automatically inferred from a training corpus. 
A list of tagging errors (with their counts) is compiled by comparing the output of 
the lexical tagger to the correct part-of-speech assignment. Then, for each error, it is 
determined which instantiation of a set of rule templates results in the greatest error 
reduction. Then the set of new errors caused by applying the rule is computed and 
the process is repeated until the error reduction drops below a given threshold. 
Ku~era 1982): pps stands for singular nominative pronoun in third person, vbd for verb in past tense, np for proper noun, vbn for verb in past participle form, by for the word "by," at for determiner, nn for 
singular noun, and bedz for the word "was." 
229 
Computational Linguistics Volume 21, Number 2 
A B PREVTAG C 
A B PREVIOR2OR3TAG C 
A B PREVIOR2TAG C 
A B NEXTIOR2TAG C 
A B NEXTTAG C 
A B SURROUNDTAG C D 
A B NEXTBIGRAM C D 
A B PREVBIGRAM C D 
change A to B if previous tag is C 
change A to B if previous one or two or three tag is C 
change A to B if previous one or two tag is C 
change A to B if next one or two tag is C 
change A to B if next tag is C 
change A to B if surrounding tags are C and D 
change A to B if next bigram tag is C D 
change A to B if previous bigram tag is C D 
Figure 2 
Contextual rule templates. 
iii iilD \] C \[C IA \[ 
IclclAI 
ICIDIClClAI I C ID lii iiI ii iiiil 
IClCIAI IClClAI 
(1) (2) 
Figure 3 
Partial matches of A B PREVBIGRAM C C on the input C D C C A. 
(3) 
Using the set of contextual rule templates shown in Figure 2, after training on 
the Brown Corpus, 280 contextual rules are obtained. The resulting rule-based tagger 
performs as well as state-of-the-art taggers based upon probabilistic models. It also 
overcomes the limitations common in rule-based approaches to language processing: 
it is robust, and the rules are automatically acquired. In addition, the tagger requires 
drastically less space than stochastic taggers. However, as we will see in the next 
section, Brill's tagger is inherently slow. 
3. Complexity of Brill's Tagger 
Once the lexical assignment is performed, in Brill's algorithm, each contextual rule 
acquired during the training phase is applied to each sentence to be tagged. For each 
individual rule, the algorithm scans the input from left to right while attempting to 
match the rule. 
This simple algorithm is computationally inefficient for two reasons. The first rea- 
son for inefficiency is the fact that an individual rule is compared at each token of the 
input, regardless of the fact that some of the current tokens may have been previously 
examined when matching the same rule at a previous position. The algorithm treats 
each rule as a template of tags and slides it along the input, one word at a time. 
Consider, for example, the rule A B PREVBIGRAM C C that changes tag A to tag B if 
the previous two tags are C. 
When applied to the input CDCCA, the pattern CCA is compared three times to 
the input, as shown in Figure 3. At each step no record of previous partial matches 
or mismatches is remembered. In this example, C is compared with the second input 
token D during the first and second steps, and therefore, the second step could have 
been skipped by remembering the comparisons from the first step. This method is 
similar to a naive pattern-matching algorithm. 
The second reason for inefficiency is the potential interaction between rules. For 
example, when the rules in Figure 1 are applied to sentence (3), the first rule results 
230 
Emmanuel Roche and Yves Schabes Deterministic Part-of-Speech Tagging 
in a change (6) that is undone by the second rule as shown in (9). The algorithm may 
therefore perform unnecessary computation. 
In summary, Brill's algorithm for implementing the contextual tagger may require 
RKn elementary steps to tag an input of n words with R contextual rules requiring at 
most K tokens of context. 
4. Construction of the Finite-State Tagger 
We show how the function represented by each contextual rule can be represented 
as a nondeterministic finite-state transducer and how the sequential application of 
each contextual rule also corresponds to a nondeterministic finite-state transducer 
being the result of the composition of each individual transducer. We will then turn 
the nondeterministic transducer into a deterministic transducer. The resulting part- 
of-speech tagger operates in linear time independent of the number of rules and the 
length of the context. The new tagger operates in optimal time in the sense that the 
time to assign tags to a sentence corresponds to the time required to follow a single 
path in the resulting deterministic finite-state machine. 
Our work relies on two central notions: the notion of a finite-state transducer and 
the notion of a subsequential transducer. Informally speaking, a finite-state transducer 
is a finite-state automaton whose transitions are labeled by pairs of symbols. The first 
symbol is the input and the second is the output. Applying a finite-state transducer to 
an input consists of following a path according to the input symbols while storing the 
output symbols, the result being the sequence of output symbols stored. Section 8.1 
formally defines the notion of transducer. 
Finite-state transducers can be composed, intersected, merged with the union op- 
eration and sometimes determinized. Basically, one can manipulate finite-state trans- 
ducers as easily as finite-state automata. However, whereas every finite-state automa- 
ton is equivalent to some deterministic finite-state automaton, there are finite-state 
transducers that are not equivalent to any deterministic finite-state transducer. Trans- 
ductions that can be computed by some deterministic finite-state transducer are called 
subsequential functions. We will see that the final step of the compilation of our tag- 
ger consists of transforming a finite-state transducer into an equivalent subsequential 
transducer. 
We will use the following notation when pictorially describing a finite-state trans- 
ducer: final states are depicted with two concentric circles; e represents the empty 
string; on a transition from state i to state j, a/b indicates a transition on input symbol 
a and output symbol(s) b; a a question mark (?) on an input transition (for example 
labeled ?/b) originating at state i stands for any input symbol that does not appear as 
input symbol on any other outgoing arc from i. In this document, each depicted finite- 
state transducer will be assumed to have a single initial state, namely the leftmost 
state (usually labeled 0). 
We are now ready to construct the tagger. Given a set of rules, the tagger is 
constructed in four steps. 
The first step consists of turning each contextual rule found in Brill's tagger into a 
finite-state transducer. Following the example discussed in Section 2, the functionality 
of the rule vbn vbd PREVTAG np is represented by the transducer shown on the left of 
Figure 4. 
2 When multiple output symbols are emitted, a comma symbolizes the concatenation of the output 
symbols. 
231 
Computational Linguistics Volume 21, Number 2 
np/np vbn/vbd 
?/? (.~p/np 
Figure 4 
Left: Transducer T1 representing the contextual rule vbn vbd PREVTAG np. Right: Local 
extension LocExt(T1) of T1. 
bn 
Figure 5 
Left: Transducer T2 representing vbd vbn NEXTTAG by. Right: Local extension LocExt(T2) of T2. 
Each contextual rule is defined locally; that is, the transformation it describes must 
be applied at each position of the input sequence. For instance, the rule 
A B PREVIOR2TAG C, 
which changes A into B if the previous tag or the one before is C, must be applied 
twice on C A A (resulting in the output C B B). As we have seen in the previous section, 
this method is not efficient. 
The second step consists of turning the transducers produced by the preceding step 
into transducers that operate globally on the input in one pass. This transformation 
is performed for each transducer associated with each rule. Given a function fl that 
transforms, say, a into b (i.e. fl(a) = b), we want to extend it to a function f2 such 
that f2(w) = w / where w' is the word built from the word w where each occurrence 
of a has been replaced by b. We say that f2 is the local extension 3 of fl, and we write 
f2 = LocExt(fl). Section 8.2 formally defines this notion and gives an algorithm for 
computing the local extension. 
Referring to the example of Section 2, the local extension of the transducer for the 
rule vbn vbd PREVTAG np is shown to the right of Figure 4. Similarly, the transducer for 
the contextual rule vbd vbn NEXTTAG by and its local extension are shown in Figure 5. 
The transducers obtained in the previous step still need to be applied one after 
the other. 
3 This notion was introduced by Roche (1993). 
232 
Emmanuel Roche and Yves Schabes Deterministic Part-of-Speech Tagging 
vbd/vbn 
~~~4 
Figure 6 
Composition T3 = LocExt(T1) o LocExt(T2). 
a:a 
Figure 7 
Example of a transducer not equivalent to any subsequential transducer. 
The third step combines all transducers into one single transducer. This corre- 
sponds to the formal operation of composition defined on transducers. The formaliza- 
tion of this notion and an algorithm for computing the composed transducer are well 
known and are described originally by Elgot and Mezei (1965). 
Returning to our running example of Section 2, the transducer obtained by com- 
posing the local extension of T2 (right in Figure 5) with the local extension of T1 (right 
in Figure 4) is shown in Figure 6. 
The fourth and final step consists of transforming the finite-state transducer ob- 
tained in the previous step into an equivalent subsequential (deterministic) transducer. 
The transducer obtained in the previous step may contain some nondeterminism. The 
fourth step tries to turn it into a deterministic machine. This determinization is not al- 
ways possible for any given finite-state transducer. For example, the transducer shown 
in Figure 7 is not equivalent to any subsequential transducer. Intuitively speaking, this 
transducer has to look ahead an unbounded distance in order to correctly generate 
the output. This intuition will be formalized in Section 9.2. 
However, as proven in Section 10, the rules inferred in Brill's tagger can always 
be turned into a deterministic machine. Section 9.1 describes an algorithm for deter- 
minizing finite-state transducers. This algorithm will not terminate when applied to 
transducers representing nonsubsequential functions. 
In our running example, the transducer in Figure 6 has some nondeterministic 
paths. For example, from state 0 on input symbol vbd, two possible emissions are 
possible: vbn (from 0 to 2) and vbd (from 0 to 3). This nondeterminism is due to the 
rule vbd vbn NEXTTAG by, since this rule has to read the second symbol before it can 
know which symbol must be emitted. The deterministic version of the transducer T3 is 
shown in Figure 8. Whenever nondeterminism arises in T3, the deterministic machine 
233 
Computational Linguistics Volume 21, Number 2 
Figure 8 
Subsequential form for T3. 
?/vbd,? 
emits the empty symbol ¢, and postpones the emission of the output symbol. For 
example, from the start state 0, the empty string is emitted on input vbd, while the 
current state is set to 2. If the following word is by, the two token string vbn by is 
emitted (from 2 to 0), otherwise vbd is emitted (depending on the input from 2 to 2 or 
from 2 to 0). 
Using an appropriate implementation for finite-state transducers (see Section 11), 
the resulting part-of-speech tagger operates in linear time, independently of the num- 
ber of rules and the length of the context. The new tagger therefore operates in optimal 
time. 
We have shown how the contextual rules can be implemented very efficiently. We 
now turn our attention to lexical assignment, the step that precedes the application of 
the contextual transducer. This step can also be made very efficient. 
5. Lexical Tagger 
The first step of the tagging process consists of looking up each word in a dictionary. 
Since the dictionary is the largest part of the tagger in terms of space, a compact rep- 
resentation is crucial. Moreover, the lookup process has to be very fast too---otherwise 
the improvement in speed of the contextual manipulations would be of little practical 
interest. 
To achieve high speed for this procedure, the dictionary is represented by a deter- 
ministic finite-state automaton with both fast access and small storage space. Suppose 
one wants to encode the sample dictionary of Figure 9. The algorithm, as described by 
Revuz (1991), consists of first building a tree whose branches are labeled by letters and 
whose leaves are labeled by a list of tags (such as nn vb), and then minimizing it into 
a directed acyclic graph (DAG). The result of applying this procedure to the sample 
dictionary of Figure 9 is the DAG of Figure 10. When a dictionary is represented as 
a DAG, looking up a word in it consists simply of following one path in the DAG. 
The complexity of the lookup procedure depends only on the length of the word; in 
particular, it is independent of the size of the dictionary. 
The lexicon used in our system encodes 54, 000 words. The corresponding DAG 
takes 360Kb of space and provides an access time of 12, 000 words per second. 4 
4 The size of the dictionary in plain text (ASCII form) is 742KB. 
234 
Emmanuel Roche and Yves Schabes Deterministic Part-of-Speech Tagging 
ads nns 
bag nn vb 
bagged vbn vbd 
bayed vbn vbd 
bids nns 
Figure 9 
Sample dictionary. 
a " ~ / d ~,O s ~-~ (nns) 
_ ~,/~ 7 . ~ (nn,vb) 
~-~O----~ ~) (vbd,vbn) 
Figure 10 
DAG representation of the dictionary of Figure 9. 
6. Tagging Unknown Words 
The rule-based system described by Brill (1992) contains a module that operates after 
all known words--that is, words listed in the dictionary--have been tagged with their 
most frequent tag, and before contextual rules are applied. This module guesses a 
tag for a word according to its suffix (e.g. a word with an "ing" suffix is likely to be 
a verb), its prefix (e.g. a word starting with an uppercase character is likely to be a 
proper noun), and other relevant properties. 
This module basically follows the same techniques as the ones used to implement 
the lexicon. Because of the similarity of the methods used, we do not provide further 
details about this module. 
7. Empirical Evaluation 
The tagger we constructed has an accuracy identical s to Brill's tagger and comparable 
to statistical-based methods. However, it runs at a much higher speed. The tagger 
runs nearly ten times faster than the fastest of the other systems. Moreover, the finite- 
state tagger inherits from the rule-based system its compactness compared with a 
stochastic tagger. In fact, whereas stochastic taggers have to store word-tag, bigram, 
and trigram probabilities, the rule-based tagger and therefore the finite-state one only 
have to encode a small number of rules (between 200 and 300). 
We empirically compared our tagger with Eric Brill's implementation of his tagger, 
and with our implementation of a trigram tagger adapted from the work of Church 
(1988) that we previously implemented for another purpose. We ran the three programs 
on large files and piped their output into a file. In the times reported, we included 
the time spent reading the input and writing the output. Figure 11 summarizes the 
results. All taggers were trained on a portion of the Brown corpus. The experiments 
were run on an HP720 with 32MB of memory. In order to conduct a fair comparison, 
the dictionary lookup part of the stochastic tagger has also been implemented using 
the techniques described in Section 5. All three taggers have approximately the same 
5 Our current implementation is functionally equivalent to the tagger as described by Brill (1992). 
However, the tagger could be extended to include recent improvements described in more recent 
papers (Brill 1994). 
235 
Computational Linguistics Volume 21, Number 2 
Stochastic Tagger 
Speed 1,200 w/s 
Space 2,158KB 
Rule-Based Tagger 
500 w/s 
379KB 
Finite-State Tagger 
10,800 w/s 
815KB 
Figure 11 
Overall performance comparison. 
dictionary lookup unknown words 
Speed 12,800 w/s 16,600 w/s 
Percent of the time 85% 6,5% 
contextual 
125,100 w/s 
8.5% 
Figure 12 
Speeds of the different parts of the program. 
precision (95% of the tags are correct). 6 By design, the finite-state tagger produces 
the same output as the rule-based tagger. The rule-based tagger--and the finite-state 
tagger--do not always produce the exact same tagging as the stochastic tagger (they do 
not make the same errors); however, no significant difference in performance between 
the systems was detected. 7 
Independently, Cutting et aL (1992) quote a performance of 800 words per second 
for their part-of-speech tagger based on hidden Markov models. 
The space required by the finite-state tagger (815KB) is distributed as follows: 
363KB for the dictionary, 440KB for the subsequential transducer and 12KB for the 
module for unknown words. 
The speeds of the different parts of our system are shown in Figure 12. 8 
Our system reaches a performance level in speed for which other, very low-level 
factors (such as storage access) may dominate the computation. At such speeds, the 
time spent reading the input file, breaking the file into sentences, breaking the sen- 
tences into words, and writing the result into a file is no longer negligible. 
8. Finite-State Transducers 
The methods used in the construction of the finite-state tagger described in the previ- 
ous sections were described informally. In the following section, the notion of finite- 
state transducer and the notion of local extension are defined. We also provide an 
algorithm for computing the local extension of a finite-state transducer. Issues related 
to the determinization of finite-state transducers are discussed in the section following 
this one. 
8.1 Definition of Finite-State Transducers 
A finite-state transducer T is a five-tuple (~, Q, i,F, E) where: G is a fnite alphabet; Q is 
a finite set of states or vertices; i c Q is the initial state; F C Q is the set of final states; 
E c Q x (y, u {c}) x ~,* x Q is the set of edges or transitions. 
6 For evaluation purposes, we randomly selected 90% of the Brown corpus for training purposes and 
10% for testing. 
7 An extended discussion of the precision of the rule-based tagger can be found in Brill (1992). 8 In Figure 12, the dictionary lookup includes reading the file, splitting it into sentences, looking up each 
word in the dictionary, and writing the final result to a file. The dictionary lookup and the tagging of 
unknown words take roughly the same amount of time, but since the second procedure only applies 
on unknown words (around 10% in our experiments), the percentage of time it takes is much smaller. 
236 
Emmanuel Roche and Yves Schabes Deterministic Part-of-Speech Tagging 
1 
Figure 13 
T4: Example of a finite-state transducer. 
For instance, Figure 13 is the graphical representation of the transducer: 
T4 = (Ca, b,c,h,e}, C 0,1, 2,3}, o, {3}, C(0,a, b, 1), (0,a, c, 2), (1, h, h, 3), (2, e, e, 3)}). 
A finite-state transducer T also defines a function on words in the following way: 
the extended set of edges F., the transitive closure of E, is defined by the following 
recursive relation: 
• ifeEEtheneE/~ 
• if (q,a,b,q'), (q',a',b',q") E E then (q, aa',bb',q") E E. 
Then the function f from G* to ~* defined byf(w) = w' iff 3q E F such that (i,w,w',q) E 
/~ is the function defined by T. One says that T represents f and writes f = ITI. 
The functions on words that are represented by finite-state transducers are called 
rational functions. If, for some input w, more than one output is allowed (e.g. f(w) = 
{Wl, w2 .... }) then f is called a rational transduction. 
In the example of Figure 13, IT41 is defined by IT4i(ah) = bh and IT4i(ae) = ce. 
Given a finite-state transducer T = (~, Q, i,F, E), the following additional notions 
are useful: its state transition function d that maps Q x (G u {¢}) into 2 Q defined by 
d(q,a) = Cq' E Q I 3w' E G* and (q,a,w',q') E E}; and its emission function ~ that maps 
Q x (G u {~}) x Q into 2 ~" defined by 6(q,a,q') = {w' E G* I (q,a,w,',q') E E}. 
A finite-state transducer could be seen as a finite-state automaton, where each 
transition label is a pair. In this respect, T4 would be deterministic; however, since 
transducers are generally used to compute a function, a more relevant definition 
of determinism consists of saying that both the transition function d and the emis- 
sion function ~ lead to sets containing at most one element, that is, Id(q,a)I < 1 and 
I~(q, a, qt)l < 1 (and that these sets are empty for a = ~). With this notion, if a finite-state 
transducer is deterministic, one can apply the function to a given word by determin- 
istically following a single path in the transducer. Deterministic transducers are called 
subsequential transducers (Schfitzenberger 1977). 9 Given a deterministic transducer, we 
can define the partial functions q®a = q' iff d(q,a) ~ {q~} and q,a = w ~ iff 3q' E Q such 
that q @ a = q~ and 6(q, a, q~) = Cw~}. This leads to the definition of subsequential trans- 
ducers: a subsequential transducer T' is a seven-tuple (G, Q,/, F, ®, *, p) where: ~, Q, i, F 
are defined as above; ® is the deterministic state transition function that maps Q x 
on Q, one writes q®a = q~; * is the deterministic emission function that maps Q x ~ on 
Y,*, one writes q • a = w~; and the final emission function p maps F on G*, one writes 
,(q) = w. 
For instance, T4 is not deterministic because d(0,a) = C1,2}, but it is equivalent 
to T5 represented Figure 14 in the sense that they represent the same function, i.e. 
9 A sequential transducer is a deterministic transducer for which all states are final. Sequential transducers 
are also called generalized sequential machines (Eilenberg 1974). 
237 
Computational Linguistics Volume 21, Number 2 
Figure 14 
Subsequential transducer T5. 
h/bh 
0 a& 1,/""-"~ 2 
b,c 
Figure 15 
T6: a finite-state transducer to be extended. 
a a b c a b 
a a b c a b 
b c b c 
a a b c a b 
d c a 
Figure 16 
Top: Input. Middle: First factorization. Bottom: Second factorization. 
IT4\] =\]Ts\[. T5 is defined by T5 = ({a,b,c,h,e},(O, 1,2},O,{2},®,,,p) where 0®a = 1, 
0,a = ¢, 1 ®h = 2, 1 ,h = bh, 1@e = 2, 1 ,e = ce, and p(2) = ~. 
8.2 Local Extension 
In this section, we will see how a function that needs to be applied at all input positions 
can be transformed into a global function that needs to be applied once on the input. 
For instance, consider T6 of Figure 15. It represents the function f6 = \]T6\[ such that 
f6(ab) = bc and f6(bca) = dca. We want to build the function that, given a word w, each 
time w contains ab (i.e. ab is a factor of the word) (resp. bca), this factor is transformed 
into its image bc (resp. dca). Suppose, for instance, that the input word is w = aabcab, as 
shown in Figure 16, and that the factors that are in dom(f6) 1° can be found according 
to two different factorizations: i.e. w I = a.w2. c-W211, where w2 -- ab, and wl = 
aa • w3 • b, where w3 = bca. The local extension of f6 will be the transduction that takes 
each possible factorization and transforms each factor according to f6, i.e. f6(w2) = 
bc and f6(w3) -= dca, and leaves the other parts unchanged; here this leads to two 
outputs: abccbc according to the first factorization, and aadcab according to the second 
factorization. 
The notion of local extension is formalized through the following definition. 
Definition 
If f is a rational transduction from G* to G*, the local extension F = LocExt(f) is 
the rational transduction from G* on G* defined in the following way: if u = 
• ' '. ' F(u) if E ~* • albla2b2 • "anbnan+l E G* then v = albla2b 2 • "anbnan+l E ai - (G* 
dom(f) . ~*), bi c dom(f) and b I c f(bi). 
10 dom(f) denotes the domain of f, that is, the set of words that have at least one output through f. 
11 If wi, w2 C ~*, Wl - W2 denotes the concatenation of Wl and w 2. It may also be written WlW 2, 
238 
Emmanuel Roche and Yves Schabes Deterministic Part-of-Speech Tagging 
Local Extension ( T' = (G, Q', i', F', E' ) , T = (~., Q, i, F, E ) ) 
1 C'\[0\] = ({i}, identity); q = 0; i' = 0; F' = O; E' = 0; Q' = 0; C'\[1\] = (0, transduction); n = 2; 
2 do{ 
3 (S, type)= C'\[q\];Q' = Q'u {q}; 
4 if (type == identity) 
5 F' = F'U {q};E' = E' u {(q, ?, ?, i')}; 
6 for each w E (~. U {¢}) s.t. 3x E S, d(x,w) # 0 and Vy E S, d(y,w) NF = O 
7 if 3r E \[0,n - 1\] such that C'\[r\] == ({i} U Ud(x,w),identity) 
xES 8 e=r; 
9 else 
10 C'\[e = n + +\] = ({i} U Ud(x,w),identity); 
xES 
11 E' = E' U {(q,w,w,e)}; 
12 for each (i, w, w', x) E E 
13 if 3r E \[0, n - 1\] such that C'\[r\] == ({x}, transduction) 
14 e=r; 
15 else 
16 C'\[e = n + +\] = ({x}, transduction); 
17 E' = E' U {(q,w,w',e)}; 
18 for each w E (G U {c}) s.t. 3x E S d(x,w) MF # 0 then E' = E' U {(q,w,w, 1)}; 
19 else if (type == transduction) 
20 if 3Xl E Q s.t. S == {Xl} 
21 if (xi E F) then E' = E' U {(q,~,c,0)}; 
22 for each (xl, w, w', y) E E 
23 if 3r E \[0, n -- 1\] such that C'\[r\] == ({y}, transduction) 
24 e = r; 
25 else 
26 C'\[e = n + +\] = ({y}, transduction); 
27 E' = E' U {(q,w,w',e)}; 
28 q++; 
29 }while(q < n); 
Figure 17 
Local extension algorithm. 
Intuitively, if F = LocExt(f) and w E ~*, each factor of w in dom(f) is transformed 
into its image by f and the remaining part of w is left unchanged. If f is represented 
by a finite-state transducer T and LocExt(f) is represented by a finite-state transducer 
T', one writes T' = LocExt(T). 
It could also be seen that if "YT is the identity function on  * - (~* • dom(T) • ~*), 
then LocExt(T) = "Tr " (T. "yw)*. 12 Figure 17 gives an algorithm that computes the local 
extension directly. 
The idea is that an input word is processed nondeterministically from left to right. 
Suppose, for instance, that we have the initial transducer T7 of Figure 18 and that we 
want to build its local extension, Ts of Figure 19. 
When the input is read, if a current input letter cannot be transformed at the 
initial state of T7 (the letter c for instance), it is left unchanged: this is expressed by 
the looping transition on the initial state 0 of Ts labeled ?/?.13 On the other hand, 
12 In this last formula, the concatenation • stands for the concatenation of the graphs of each function; 
that is, for the concatenation of the transducers viewed as automata whose labels are of the form a/b. 
13 As explained before, an input transition labeled by the symbol ? stands for all transitions labeled with 
a letter that doesn't appear as input on any outgoing arc from this state. A transition labeled ?/? stands 
239 
b,c 
Figure 18 
Sample transducer T7. 
F.dE 
?/? 
Computational Linguistics Volume 21, Number 2 
Figure 19 
Local extension Ts of TT: T8 = LocExt(T7). 
if the input symbol, say a, can be processed at the initial state of T7, one doesn't 
know yet whether a will be the beginning of a word that can be transformed (e.g. ab) 
or whether it will be followed by a sequence that makes it impossible to apply the 
transformation (e.g. ac). Hence one has to entertain two possibilities, namely (1) we 
are processing the input according to T7 and the transitions should be a/b; or (2) we 
are within the identity and the transition should be a/a. This leads to two kind of 
states: the transduction states (marked transduction in the algorithm) and the identity 
states (marked identity in the algorithm). It can be seen in Figure 19 that this leads 
to a transducer that has a copy of the initial transducer and an additional part that 
processes the identity while making sure it could not have been transformed. In other 
words, the algorithm consists of building a copy of the original transducer and at the 
same time the identity function that operates on ~* - ~* • dom(T) • Y,*. 
Let us now see how the algorithm of Figure 17 applies step by step to the trans- 
ducer T7 of Figure 18, producing the transducer T8 of Figure 19. 
In Figure 17, C'\[0\] = ({i}, identity) of line 1 states that state 0 of the transducer to 
be built is of type identity and refers to the initial state i = 0 of T7. q represents the 
current state and n the current number of states. In the loop do{...} while (q < n), one 
builds the transitions of each state one after the other: if the transition points to a state 
not already built, a new state is added, thus incrementing n. The program stops when 
all states have been inspected and when no additional state is created. The number of 
iterations is bounded by 2 Ilz\]l*2, where \[\[T\[I = \[Q\[ is the number of states of the original 
transducer. 14 Line 3 says that the current state within the loop is q and that this state 
for all the diagonal pairs a/a s.t. a is not an input symbol on any outgoing arc from this state. 
14 In fact, Qr c 2 Qx {transduction,identity}. Thus, q ~ 2 2\[Q\[. 
240 
Emmanuel Roche and Yves Schabes Deterministic Part-of-Speech Tagging 
1 
?/? 
Figure 20 
Local extension T9 of T6:T9 = LocExt(T6). 
refers to the set of states S and is marked by the type type. In our example, at the 
first occurrence of this line, S is instantiated to {0} and type = identity. Line 5 adds 
the current identity state to the set of final states and a transition to the initial state 
for all letters that do not appear on any outgoing arc from this state. Lines 6-11 build 
the transitions from and to the identity states, keeping track of where this leads in the 
original transducer. For instance, a is a label that verifies the conditions of line 6. Thus 
a transition a/a is to be added to the identity state 2, which refers to 1 (because of the 
transition a/b of T7) and to i = 0 (because it is possible to start the transduction T7 
from any identity state). Line 7 checks that this state doesn't already exist and adds it 
if necessary, e = n + + means that the arrival state for this transition, i.e. d(q, w), will be 
the last added state and that the number of states being built has to be incremented. 
Line 11 actually builds the transition between 0 and e = 2 labeled a/a. Lines 12-17 
describe the fact that it is possible to start a transduction from any identity state. Here 
a transition is added to a new state, i.e. a/b to 3. The next state to be considered is 2 
and it is built like state 0, except that the symbol b should block the current output. In 
fact, state 1 means that we already read a with a as output; thus, if one reads b, ab is 
at the current point, and since ab should be transformed into bc, the current identity 
transformation (that is a ~ a) should be blocked: this is expressed by the transition b/b 
that leads to state 1 (this state is a "trash" state; that is, it has no outgoing transition 
and it is not final). 
The following state is 3, which is marked as being of type transduction, which 
means that lines 19-27 should be applied. This consists simply of copying the transi- 
tions of the original transducer. If the original state was final, as for 4 = ({2}, transduction), 
an ~/~ transition to the initial state is added (to get the behavior of T+). 
The transducer T9 = LocExt(T6) of Figure 20 gives a more complete (and slightly 
more complex) example of this algorithm. 
241 
Computational Linguistics Volume 21, Number 2 
9. Determinization 
The basic idea behind the determinization algorithm comes from Mehryar Mohri. is 
In this section, after giving a formalization of the algorithm, we introduce a proof of 
soundness and completeness, and we study its worst-case complexity. 
9.1 Determinization Algorithm 
In the following, for Wl, w 2 E Y~,*, Wl /~ W2 denotes the longest common prefix of wl 
and w2. 
The finite-state transducers we use in our system have the property that they can be 
made deterministic; that is, there exists a subsequential transducer that represents the 
same function. 16 If T = (~, Q, i, F, E) is such a finite-state transducer, the subsequential 
transducer T' = (E, Q', i', F', ®, ,, p) defined as follows will be later proved equivalent 
to T: 
Q~ c 2 QxE* . In fact, the determinization of the transducer is related to 
the determinization of FSAs in the sense that it also involves a power set 
construction. The difference is that one has to keep track of the set of 
states of the original transducer, one might be in and also of the words 
whose emission have been postponed. For instance, a state 
{(ql, Wl), (q2,w2)} means that this state corresponds to a path that leads 
to q~ and q2 in the original transducer and that the emission of wl (resp. 
w2) was delayed for ql (resp. q2). 
i' = {(i, ~)}. There is no postponed emission at the initial state. 
the emission function is defined by: 
S,a= A A u.6(q,a,q') 
(q,u)~S q' Ed(q,a) 
This means that, for a given symbol, the set of possible emissions is 
obtained by concatenating the postponed emissions with the emission at 
the current state. Since one wants the transition to be deterministic, the 
actual emission is the longest common prefix of this set. 
the state transition function is defined by: 
S®a= U U {(q',(S*a)-l"u'6(q,a,q'))} 
(q,u)cS q,~d(q,a) 
Given u, v E E*, u - v denotes the concatenation of u and v and 
u -1 • v -- w, if w is such that u - w -- v, u -I • v = 0 if no such w exists. 
F'={SEQ'I3(q,u) ESandqCF} 
if S E F t, p(S) = u s.t. 3q E F, (q, u) C S. We will see in the proof of 
correctness that p is properly defined. 
15 Mohri (1994b) also gives a formalization of the algorithm. 16 As opposed to automata, a large class of finite-state transducers do not have any deterministic 
representation; they cannot be determinized. 
242 
Emmanuel Roche and Yves Schabes Deterministic Part-of-Speech Tagging 
The determinization algorithm of Figure 21 computes the above subsequential 
transducer. 
Let us now apply the determinization algorithm of Figure 21 on the finite-state 
transducer T4 of Figure 13 and show how it builds the subsequential transducer T10 
of Figure 22. Line 1 of the algorithm builds the first state and instantiates it with the 
pair {(0, e)}. q and n respectively denote the current state and the number of states 
having been built so far. At line 5, one takes all the possible input symbols w; here 
only a is possible, w' of line 6 is the output symbol, 
w'= e. ( A a(0,a,~')), 
~'E{1,2} 
thus w' = a(0,a, 1) A 6(0,a,2) = b A c = e. Line 8 is then computed as follows: 
s'= U U 
~ff{0} ~'E{1,2} 
thus S' = { (1, a (0, a, 1 )) } U { (2, 6 (0, a, 2) } = { (1, b), (2, c) }. Since no r verifies the condition 
on line 9, a new state e is created to which the transition labeled a/w = a/e points and 
n is incremented. On line 15, the program goes to the construction of the transitions 
of state 1. On line 5, d and e are then two possible symbols. The first symbol, h, at line 
6, is such that w' is 
w' = A b. 6(1,h,~')) = bh. 
F/'cd(1,h)={2} 
Henceforth, the computation of line 8 leads to 
S'= U U {(q ''(bh)-l"b'h)}={(2"e)}" 
qE{1} ~'E{2} 
State 2 labeled {(2, e)} is thus added, and a transition labeled h/bh that points to state 
2 is also added. The transition for the input symbol e is computed the same way. 
The subsequential transducer generated by this algorithm could in turn be min- 
imized by an~'algorithm described in Mohri (1994a). However, in our case, the trans- 
ducer is nearly minimal. 
9.2 Proof of Correctness 
Although it is decidable whether a function is subsequential or not (Choffrut 1977), 
the determinization algorithm described in the previous section does not terminate 
when run on a nonsubsequential function. 
Two issues are addressed in this section. First, the proof of soundness: the fact that 
if the algorithm terminates, then the output transducer is deterministic and represents 
the same function. Second, the proof of completeness: the algorithm terminates in the 
case of subsequential functions. 
Soundness and completeness are a consequence of the main proposition, which 
states that if a transducer T represents a subsequential function f, then the algorithm 
DeterminizeTransducer described in the previous section applied on T computes a sub- 
sequential transducer representing the same function. 
In order to simplify the proofs, we will only consider transducers that do not have 
e input transitions, that is E C Q x ~ x ~* x Q, and also without loss of generality, 
243 
Computational Linguistics Volume 21, Number 2 
DeterminizeTransducer(T' = (G, Q', i', F', ®, ,, p), T = (~I, Q, i, F, E)) 
9 
10 
11 
12 
13 
14 
15 
16 
i'= 0;q = 0;n = 1;C'\[0\] = {(0,~)};F' = 0;Q'= 0; 
do { 
S = C'\[q\];Q' = Q'u {q}; 
if 3(~, u) ¢ S s.t. ~ ¢ F then F' = F' U {q} and p(q) = u; 
foreach w such that 3(~,u) E S and d(~,w)   0 { 
w,= A A u 61,,w,,'l 
G,)es ~'edGw) 
q*w=w'; 
s'= U U 
(~,u) es 7' edGw) 
if 3r E \[0,n -- 1\] such that C'\[r\] == S' 
e=r; 
else 
C'\[e = n + +\] = S'; 
q@w=e; } 
q++; 
}while(q < n); 
Figure 21 
Determinization algorithm. 
h/bh 
Figure 22 
Subsequential transducer T10 such that IT10I = IT4I . 
transducers that are reduced and that are deterministic in the sense of finite-state 
automata. 17 
In order to prove this proposition, we need to establish some preliminary notations 
and lemmas. 
First we extend the definition of the transition function d, the emission function 6, 
the deterministic transition function @, and the deterministic emission function * on 
words in the classical way. We then have the following properties: 
ab) = U a(q',b) 
6(ql,ab, q2) = U 6(ql, a, q'). 6(q', b, q2) 
{q' Cd(ql,a) \[q2 Cd(q',b ) } 
q®ab = (q®a)®b 
17 A transducer defines an automaton whose labels are the pairs "input/output"; this automaton is 
assumed to be deterministic. 
244 
Emmanuel Roche and Yves Schabes Deterministic Part-of-Speech Tagging 
q,ab = (q,a).(q®a),b 
For the following, it is useful to note that if IT I is a function, then 6 is a function 
too. 
The following lemma states an invariant that holds for each state S built within 
the algorithm. The lemma will later be used for the proof of soundness. 
Lemma 1 
Let I = C'\[0\] be the initial state. At each iteration of the "do" loop in Determinize- 
Transducer, for each S --- C'\[q\] and for each w E ~* such that I ® w = S, the following 
holds: 
(i) I,w= /~ 6(i,w,q) 
qEd(i,w) 
(ii) S=I®w={(q,u) lqEd(i,w)andu=(I*w)-l.6(i,w,q)} 
Proof 
(i) and (ii) are obviously true for S = I (since d(i, ~) = i and ~(i, c, i) = c), and we 
will show that given some w E ~* if it is true for S = I ® w, then it is also true for 
$1 = S @ a = I Q wa for all a E Y.. 
Assuming that (i) and (ii) hold for S and w, then for each a E ~: 
A ~(i,w,q). ~(q,a,q') 
qEd(i,w),q' Ed(q,a) 
= (I,w). A ') 
qEd( i,w),q' Ed(q,a ) 
= A (q,u ) ES=I®w,q' Ed(q,a) 
= (I,w).(S,a) 
= I,w.(I®w),a 
= I ,wa 
This proves (i). 
We now turn to (ii). Assuming that (i) and (ii) hold for S and w, then for each 
a E ~, let $1 = S ® a; the algorithm (line 8) is such that 
$1 = ( (q',u') \] 3(q,u) E S,q' E d(q,a) and u'= (S,a) -1 . u . 6(q,a,q') } 
Let 
$2 -- {(q',u') I q' E d(i, wa) and u' = (I • wa) -1. 6(i, wa, q')} 
We show that $1 c $2. Let (q',u') E $1, then 3(q,u) E S s.t. q' E d(q,a) and 
u' = (S * a)-l. u. 6(q, a, q'). Since u = (I • w)-I . 6(i, w, q), then u' = (S * a)-I . (I * w)-I . 
6(i,w,q). 6(q,a,q'); that is, u' = (I*wa) -1. 6(i, wa, q'). Thus (q',u') E $2. Hence $1 c $2. 
We now show that $2 c $1. Let (q',u') E $2, and let q E d(i,w) be s.t. q' E d(q,a) 
and u = (I, w) -1 . 6(i,w,q) then (q,u) E S and since u' = (I* wa) -1 • 6(i, wa, q') = 
(s ,a) -1 • u . (q',u') E sl 
This concludes the proof of (ii). \[\] 
245 
Computational Linguistics Volume 21, Number 2 
The following lemma states a common property of the state S, which will be used 
in the complexity analysis of the algorithm. 
Lemma 2 
Each S = C'\[q\] built within the "do" loop is s.t. Vq E Q, there is at most one pair 
(q, w) c S with q as first element. 
Proof 
Suppose (q, wl) c S and (q, w2) c S, and let w be s.t. I®w = S. Then Wl = (I W) -1 ' 
fi(i, w, q) and w2 = (I • w)-I . 6(i, w, q). Thus W 1 = W 2. \[\] 
The following lemma will also be used for soundness. It states that the final state 
emission function is indeed a function. 
Lemma 3 
For each S built in the algorithm, if (q, u), (q', u') c S, then q, q' E F ~ u = u' 
Proof 
Let S be one state set built in line 8 of the algorithm. Suppose (q, u), (q', u') E S and q, 
q' E F. According to (ii) of lemma 1, u = (I,w) -1 .6(i,w,q) and u' = (I,w) -1.6(i,w,q'). 
Since IT\[ is a function and {6(i,w,q),6(i,w,q')} E ITl(w) then 6(i,w,q) = 6(i,w,q'), 
therefore u = uq \[\] 
The following lemma will be used for completeness. 
Lemma 4 
Given a transducer T representing a subsequential function, there exists a bound M 
s.t. for each S built at line 8, for each (q,u) E S, lu\[ < M. 
We rely on the following theorem proven by Choffrut (1978): 
Theorem 1 
A function f on G* is subsequential iff it has bounded variations and for any rational 
language L C ~*, f-1 (L) is also rational. 
with the following two definitions: 
Definition 
The left distance between two strings u and v is I\[u,v\[I = \[u\[ + Iv\[ - 2\[u/~ v\[. 
Definition 
A function f on G* has bounded variations iff for all k ~ 0, there exists K > 0 s.t. 
u,v C dom(f), \[\[u,v\[\[ <_ k ~ \]\[f(u),f(v)\[\[ <_ K. 
Proof of Lemma 4 
Let f = IT\[. For each q E Q, let c(q) be a string w s.t. d(q,w) N F ~ 0 and s.t. \[w\[ is 
minimal among such strings. Note that \[c(q)\[ _< \[IT\[\] where \[IT\[\[ is the number of states 
246 
Emmanuel Roche and Yves Schabes Deterministic Part-of-Speech Tagging 
in T. For each q c Q let s(q) E Q be a state s.t. s(q) c d(q,c(q)) AF. Let us further define 
M1 = maxl6(q,c(q),s(q))\] qEQ 
M2 = max Ic(q)l qEQ i %l~ l 
Since f is subsequential, it is of bounded variations, therefore there exists K s.t. if 
\]\[u, vi\] ~ aM 2 then I\[f(u),f(v)\] I G K. Let M = K + 2M1. 
Let S be a state set built at line 8, let w be s.t. I®w = S and A = I,w. Let (ql, u) E S. 
Let (q2, v) C S be s.t. u A v = c. Such a pair always exists, since if not 
thus \[A. 
\] A u'\] > 0 
(q',u')ES 
A u'l = I A .x'u'l>l,',l (q',u')~s (q',u')cs 
Thus, because of (ii) in Lemma 1, 
I A 6(i,w,q')\] > II,wl 
q' Ed(i,w) 
which contradicts (i) in Lemma 1. 
Let w = ~(ql, c(ql), s(ql)) and a;' = 6(q2, c(q2), s(q2)). 
Moreover, for any a,b,c,d E ~*, Iia, ciI <_ \]lab, cd\[I + Ibl + \[d I. In fact, Ilab, cdiI = 
\[ab\[ + IcdI- 2Iab A cd I = lal + I c\] + IbI + IdI- 2Iab A cd I = II a, c\]I + 21a A c I + \[b I + \]d I -2lab A cd I 
but labAcd\] <_ laAcI +Ib\[+\]d\[ and since \]Iab, cd\[I = Ila, cI\[-2(\[abAcd I -\[aAc I -\[b I -IdI) - IbI- IdI 
one has Iia, cil < I\]ab, cdll + Ib\[ + Idl. 
Therefore, in particular, luI < \]\[Au, AvI\[ < JiAua;,Avw'\]\[ + \]0; I + Iw'I, thus I u\] < Iif(w • 
c(ql)),f(w, c(q2))I\] q- 2M1. But \]\[w. c(ql),W" c(q2)ll G \]c(ql)\[ + Ic(q2)I ~ 2M2, thus Iif(w • 
c(ql)),f(w" c(q2))\[\] < K and therefore I u\] < K + 2M 1 = M. \[\] 
The time is now ripe for the main proposition, which proves soundness and com- 
pleteness. 
Proposition 
If a transducer T represents a subsequential function f, then the algorithm Determinize- 
Transducer described in the previous section applied on T computes a subsequential 
transducer ~- representing the same function. 
Proof 
Lemma 4 shows that the algorithm always terminates if IT\] is subsequential. 
Let us show that dom(iTI) c dom(iTI). Let w E ~* s.t. w is not in dom(iTI), then 
d(i, w) M F = 0. Thus, according to (ii) of Lemma 1, for all (q, u) c I ® w, q is not in F, 
thus I ® w is not terminal and therefore w is not in dom(~-). 
Conversely, let w E dom(iT\[). There exists a qf C F s.t. IT\](w) = 6(i,w, qf) and s.t. 
qf C d(i,w). Therefore \]Zi(w ) = (I, w). ((I* w) -1- 6(i,w, qf)) and according to (ii) of 
Lemma 1 (qf, (I * w) -I • 6(i,w, qf)) c I ® w and since qf E F, Lemma 3 shows that 
p(I® w) = (I,w) -1. ~(i,w, qf), thus ITI(w) = (I,w). p(I® w) = ITi(w). \[\] 
247 
Computational Linguistics Volume 21, Number 2 
9.3 Worst-Case Complexity 
In this section we give a worst-case upper bound of the size of the subsequential 
transducer in terms of the size of the input transducer. 
Let L = {w E G" s.t. Iw\[ <__ M}, where M is the bound defined in the proof 
of Lemma 4. Since, according to Lemma 2, for each state set Q~, for each q E Q, Q' 
contains at most one pair (q, w), the maximal number N of states built in the algorithm 
is smaller than the sum of the number of functions from states to strings in L for each 
state set, that is 
N < ILl IQ't 
Q' E2Q 
we thus have N _< 2 IQI x ILl iQI -- 2 IQI x 2 \[Qlxl°g2 iLl and therefore N _< 2 IQl(l+l°glLI). 
Moreover, 
M+' - 1 ILl 
= 1 + lye\] +... + ISl M - ISl - 1 if Is\] > 1 
and ILl = M+I if = 1. In this last formula, M = K+2M1, as described in Lemma 4. 
Note that if P = MAXa~sl6(q,a, q')l is the maximal length of the simple transitions 
emissions, M1 ~ IQI x P, thus M _< K + 2 x IQI x P. 
Therefore, if \[E I > 1, the number of states N is bounded: 
i:gl(K+2 x IQI xP+1-1 ) N <_ 2 IQIx(l+l°g i~l-, 
and if lee = 1, N ~ 2 \[QIx(l+l°g(K+2xiQLxP+l)) 
10. Subsequentiality of Transformation-Based Systems 
The proof of correctness of the determinization algorithm and the fact that the algo- 
rithm terminates on the transducer encoding Brill's tagger show that the final function 
is subsequential and equivalent to Brill's original tagger. 
In this section, we prove in general that any transformation-based system, such as 
those used by Brill, is a subsequential function. In other words, any transformation- 
based system can be turned into a deterministic finite-state transducer. 
We define transformation-based systems as follows. 
Definition 
A transformation-based system is a finite sequence (f\],...,fn) of subsequential func- 
tions whose domains are bounded. 
Applying a transformation-based system consists of applying each function fi one 
after the other. Applying one function consists of looking for the first position in 
the input at which the function can be triggered. When the function is triggered, 
the longest possible string starting at that position is transformed according to this 
function. After the string is transformed, the process is iterated starting at the end of 
the previously transformed string. Then, the next function is applied. The program 
ends when all functions have been applied. 
It is not true that, in general, the local extension of a subsequential function is 
subsequential. TM For instance, consider the function fa of Figure 23. 
18 However, the local extensions of the functions we had to compute were subsequentiaL 
248 
Emmanuel Roche and Yves Schabes Deterministic Part-of-Speech Tagging 
Figure 23 
Function fa. 
a:b a:b a:b 
The local extension of the function fa is not a function. In fact, consider the input 
string daaaad; it can be decomposed either into d • aaa. ad or into da • aaa. d. The first 
decomposition leads to the output dbbbad, and the second one to the output dabbbd. 
The intended use of the rules in the tagger defined by Brill is to apply each 
function from left to right. In addition, if several decompositions are possible, the one 
that occurs first is the one chosen. In our previous example, it means that only the 
output dbbbad is generated. 
This notion is now defined precisely. 
Let a be the rational function defined by a(a) = a for a c ~, a(\[) = a(\]) = ~ on the 
additional symbols '\[' and '\]', with a such that a(u. v) = a(u). a(v). 
Definition 
Let Y c ~+ and X = ~* - ~*. Y. ~*, a Y-decomposition of x is a string y E X. (\[. Y. \]. X)* 
s.t. a(y) = x 
For instance, if Y = dom(fa) -- {aaa}, the set of Y-decompositions of x = daaad is 
{ d \[aaa \]ad , da \[aaa \] d }. 
Definition 
Let < be a total order on P, and let ~ = ~ U {\[,\]} be the al _phabet ~ with the two 
additional symbols '\[' and '\]'. Let extend the order > to N by Va E ~, '\['< a and 
a < '\]'. < defines a lexicographic order on ~* that we also denote <. Let Y c 2 + 
and x c N*, the minimal Y-decomposition of x is the Y-decomposition which is 
minimal in (~*, <). 
For instance, the minimal dom(fa)-decomposition of daaaad is d\[aaa\]ad. In fact, 
d\[aaaJad < da\[aaa\]d. 
Proposition 
Given Y C ~+ finite, the function mdy that to each x c G* associates its minimal 
Y-decomposition, is subsequential and total. 
Proof 
Let dec be defined by dec(w) = u. \[. v. 1. dec((uv) -1 . w), where u, v E P~* are s.t. v E Y, 
3v' c ~* with w = uvv' and lul is minimal among such strings and dec(w) -- w if no 
such u, v exists. The function mdy is total because the function dec always returns an 
output that is a Y-decomposition of w. 
We shall now prove that the function is rational and then that it has bounded 
variations; this will prove according to Theorem 1 that the function is subsequential. 
In the following X = ~* - P,* • Y- P,*. The transduction Ty that generates the set of 
Y-decompositions is defined by 
Ty = Idx. (eft. Idy- c/\]. Idx)* 
where Idx (resp. Idy) stands for the identity function on X (resp. Y). Furthermore, 
249 
Computational Linguistics Volume 21, Number 2 
Figure 24 
Transduction T~,>. 
C D 
the transduction TU,> that to each string w E ~* associates the set of strings strictly 
greater than w, that is T~,>(w) = {w' E ~*I w < w'}, is defined by the transducer of 
-- --2 -- Figure 24, in which A = {(x,x)ix E G}, B = {(x,y) E ~2\[x < y}, C = G , D = {¢} x 
and E = G x {c}. 19 
Therefore, the right-minimal Y-decomposition function mdy is defined by mdy -- 
Ty - (Tu,> o Ty), which proves that mdy is rational. 
Letk > 0. LetK = 6xk+6xM, whereM-- maxx~yix I. Let u,v E G* bes.t. 
Iiu, vII _< k. Let us consider two cases: (i) I u A v I _< M and (ii) lu A v I > M. 
(i): I u Av I _< M, thus \[uHv I ~ I u Av I + Iiu, vI\[< M+k. Moreover, for each w E Y~*, 
for each Y-decomposition w' of w, Iw'\[ _< 3 x \]w I. In fact, Y doesn't contain ~, thus the 
number of \[ (resp. l) in w' is smaller than Iw\[. Therefore, Imdy(u) I, Imdy(v)l <_ 3 x (M+k) 
thus \[Imdy(u),mdy(v)lI < K. 
(ii): u A v = ~ • a; with \[a; I = M. Let #, v be s.t. u = &w# and v = )~a;~. Let )~', 
w', #', .~", a;" and v" be s.t. mdy(u) = )~'J#', mdy(v) -- )~"~;"~,", c~(~') = ~(,V') = ~, 
c~(a;') = c~(~o") = w, o~(#') = # and ~(~,") = ~,. Suppose that &' # &", for instance 
), < )i,. Let i be the first indice s.t. (;f)i < (,VI)i. 20 We have two possible situations: 
(ii.1) ()~r)i = \[ and ;~" E ~ or (~')i ---- \]. In that case, since the length of the elements in 
Y is smaller than M = 14, one has &'~;' = .~1\[.~2\],~3 with \[~ll = i, ;~2 ~ Y and "~3 E . 
We also have ),'w" ' ' = /~1/~2/~ 3 with c~()~) = c~(&2) and the first letter of "~2 is different 
from \[. Let )~4 be a Y-decomposition of ~ 3 y, then &1\[~2\]/~4 is a Y-decomposition of 
v strictly smaller than ~1 &~)~L," = mdy (v), which contradicts the minimality of mdy (v). 
The second situation is (ii.2): (&~)i E ~. and (&')i = \], then we have )~GJ = ~1\[,~2,~3\],~4 
s.t. I,~1\[/~2I = i and ),'M" = .,~1\[,~2\],,~&~ s.t. C~(&~) = C~()~3) and c~(&~) = c~(&4). Let As be 
a Y-decomposition of ~", then ;~ \[/~2/~3\]/~5 is a Y-decomposition of v strictly smaller 
than ,V'w"~,", which leads to the same contradiction. Therefore, &' = )~" and since 
I#'\[+1~,"1 _< 3x (l#l+l~,l)--3x Ilu, vll _< 3x/, IImdyCu),mdy(v)ll <_ la;'l+la;"l+l#'l+l~,r'l < 
2 x M + 3 x k _< K. This proves that mdy has bounded variations and therefore that it 
is subsequential. \[\] 
N 
We can now define precisely what is the effect of a function when one applies it 
from left to right, as was done in the original tagger. 
19 This construction is similar to the transduction built within the proof of Eilenberg's cross section 
theorem (Eilenberg 1974). 
20 (w)i refers to the i th letter in w. 
250 
Emmanuel Roche and Yves Schabes Deterministic Part-of-Speech Tagging 
Definition 
If f is a rational function with bounded domain, Y = dom(f) c ~.+, the right-minimal 
local extension of f, denoted RmLocExt(f), is the composition of a right-minimal 
Y-decomposition mdy with Ida, • (\[/~. f. \]/~. Ida,)*. 
RmLocExt being the composition of two subsequential functions, it is itself subse- 
quential; this proves the following final proposition, which states that given a rule- 
based system similar to Brill's system, one can build a subsequential transducer that 
represents it: 
Proposition 
If (fl .... ,fn) is a sequence of subsequential functions with bounded domains and such 
that fi(~) = 0, then RmLocExt(h ) o... o RmLocExt(fn) is subsequential. 
We have proven in this section that our techniques apply to the class of transformation- 
based systems. We now turn our attention to the implementation of finite-state trans- 
ducers. 
11. Implementation of Finite-State Transducers 
Once the final finite-state transducer is computed, applying it to an input is straight- 
forward: it consists of following the unique sequence of transitions whose left labels 
correspond to the input. However, in order to have a complexity fully independent of 
the size of the grammar and in particular independent of the number of transitions 
at each state, one should carefully choose an appropriate representation for the trans- 
ducer. In our implementation, transitions can be accessed randomly. The transducer 
is first represented by a two-dimensional table whose rows are indexed by states and 
whose columns are indexed by the alphabet of all possible input letters. The content 
of the table at line q and at column a is the word w such that the transition from q 
with the input label a outputs w. Since only a few transitions are allowed from many 
states, this table is very sparse and can be compressed. This compression is achieved 
while maintaining random access using a procedure for sparse data tables following 
the method given by Tarjan and Yao (1979). 
12. Conclusion 
The techniques described in this paper are more general than the problem of part-of- 
speech tagging and are applicable to the class of problems dealing with local transfor- 
mation rules. 
We showed that any transformation-based program can be transformed into a 
deterministic finite-state transducer. This yields to optimal time implementations of 
transformation based programs. 
As a case study, we applied these techniques to the problem of part-of-speech 
tagging and presented a finite-state tagger that requires n steps to tag a sentence of 
length n, independently of the number of rules and the length of the context they 
require. We achieved this result by representing the rules acquired for Brill's tagger 
as nondeterministic finite-state transducers. We composed each of these nondetermin- 
istic transducers and turned the resulting transducer into a deterministic transducer. 
The resulting deterministic transducer yields a part-of-speech tagger that operates in 
optimal time in the sense that the time to assign tags to a sentence corresponds to 
the time required to follow a single path in this deterministic finite-state machine. The 
251 
Computational Linguistics Volume 21, Number 2 
tagger outperforms in speed both Brill's tagger and stochastic taggers. Moreover, the 
finite-state tagger inherits from the rule-based system its compactness compared with 
stochastic taggers. We also proved the correctness and the generality of the methods. 
We believe that this finite-state tagger will also be found useful when combined 
with other language components, since it can be naturally extended by composing 
it with finite-state transducers that could encode other aspects of natural language 
syntax. 
Acknowledgments 
We thank Eric Brill for providing us with 
the code of his tagger and for many useful 
discussions. We also thank Aravind K. 
Joshi, Mark Liberman, and Mehryar Mohri 
for valuable discussions. We thank the 
anonymous reviewers for many helpful 
comments that led to improvements in both 
the content and the presentation of this 
paper. 
References 
Brill, Eric (1992). "A simple rule-based part 
of speech tagger." In Proceedings, Third 
Conference on Applied Natural Language 
Processing. Trento, Italy, 152-155. 
Brill, Eric (1994). "A report of recent 
progress in transformation error-driven 
learning." In Proceedings, Tenth National 
Conference on Artificial Intelligence 
(AAAI-94). Seattle, Washington, 722-727. 
Choffrut, Christian (1977). "Une 
caract6risation des fonctions s6quentielles 
et des fonctions sous-s6quentielles en tant 
que relations rationnelles." Theoretical 
Computer Science, 5, 325-338. 
Choffrut, Christian (1978). Contribution 
l'dtude de quelques families remarquables de 
fonctions rationnelles. Doctoral dissertation, 
Universit6 Paris VII (Th6se d'Etat). 
Chomsky, Noam (1964). Syntactic Structures. 
Mouton. 
Church, Kenneth Ward (1988). "A stochastic 
parts program and noun phrase parser for 
unrestricted text." In Proceedings, Second 
Conference on Applied Natural Language 
Processing. Austin, Texas, 136-143. 
Clemenceau, David (1993). Structuration du 
lexique et reconnaissance de mots ddriv&. 
Doctoral dissertation, Universit6 Paris 7. 
Cutting, Doug; Kupiec, Julian; Pederson, 
Jan; and Sibun, Penelope (1992). "A 
practical part-of-speech tagger." In 
Proceedings, Third Conference on Applied 
Natural Language Processing. Trento, Italy, 
133-140. 
DeRose, S. J. (1988). "Grammatical category 
disambiguation by statistical 
optimization." Computational Linguistics, 
14, 31-39. 
Eilenberg, Samuel (1974). Automata, 
Languages, and Machines. Academic Press. 
Elgot, C. C., and Mezei, J. E. (1965). "On 
relations defined by generalized finite 
automata." IBM Journal of Research and 
Development, 9, 47-65. 
Francis, W. Nelson, and Ku~era, Henry 
(1982). Frequency Analysis of English Usage. 
Houghton Mifflin. 
Kaplan, Ronald M., and Kay, Martin (1994). 
"Regular models of phonological rule 
systems." Computational Linguistics, 20(3), 
331-378. 
Karttunen, Lauri; Kaplan, Ronald M.; and 
Zaenen, Annie (1992). "Two-level 
morphology with composition." In 
Proceedings, 14 th International Conference on 
Computational Linguistics (COLING-92). 
Nantes, France, 141-148. 
Kupiec, J. M. (1992). "Robust part-of-speech 
tagging using a hidden Markov model." 
Computer Speech and Language, 6, 225-242. 
Laporte, Eric (1993). "Phon6tique et 
transducteurs." Technical report, 
Universit6 Paris 7, June. 
Merialdo, Bernard (1990). "Tagging text 
with a probabilistic model." Technical 
Report RC 15972, IBM Research Division. 
Mohri, Mehryar (1994a). "Minimisation of 
sequential transducers." In Proceedings, 
Fifth Annual Symposium on Combinatorial 
Pattern Matching. Lecture Notes in 
Computer Science, Springer-Verlag. 
Mohri, Mehryar (1994b). "On some 
applications of finite-state automata 
theory to natural language processing." 
Technical report, Institut Gaspard Monge. 
Pereira, Fernando C. N.; Riley, Michael; and 
Sproat, Richard W. (1994). "Weighted 
rational transductions and their 
application to human language 
processing." In Human Language Technology 
Workshop. 262-267. Morgan Kaufmann. 
Revuz, Dominique (1991). Dictionnaires et 
lexiques, m~thodes et algorithmes. Doctoral 
dissertation, Universit6 Paris 7. 
Roche, Emmanuel (1993). Analyse syntaxique 
transformationelle du fran~ais par 
transducteurs et lexique-grammaire. Doctoral 
dissertation, Universit6 Paris 7. 
Sch6tzenberger, Marcel Paul (1977). "Sur 
252 
Emmanuel Roche and Yves Schabes Deterministic Part-of-Speech Tagging 
une variante des fonctions sequentielles." 
Theoretical Computer Science, 4, 47-57. 
Silberztein, Max (1993). Dictionnaires 
Electroniques et Analyse Lexicale du 
FranFais--Le Syst~me INTEX. Masson. 
Tapanainen, Pasi, and Voutilainen, Atro 
(1993). "Ambiguity resolution in a 
reductionistic parser." In Proceedings, Sixth 
Conference of the European Chapter of the 
ACL. Utrecht, Netherlands, 394-403. 
Tarjan, Robert Endre, and Chi-Chih Yao, 
Andrew (1979). "Storing a sparse table." 
Communications of the ACM, 22(11), 
606-611. 
Weischedel, Ralph; Meteer, Marie; Schwartz, 
Richard; Ramshaw, Lance; and Palmucci, 
Jeff (1993). "Coping with ambiguity and 
unknown words through probabilistic 
models." Computational Linguistics, 19(2), 
359-382. 
253 

