c© 2002 Association for Computational Linguistics
Incremental Construction and
Maintenance of Minimal Finite-State
Automata
Rafael C. Carrasco
∗
Mikel L. Forcada
†
Universitat d’Alacant Universitat d’Alacant
Daciuk et al. [Computational Linguistics 26(1):3–16 (2000)] describe a method for constructing
incrementally minimal, deterministic, acyclic finite-state automata (dictionaries) from sets of
strings. But acyclic finite-state automata have limitations: For instance, if one wants a linguistic
application to accept all possible integer numbers or Internet addresses, the corresponding finite-
state automaton has to be cyclic. In this article, we describe a simple and equally efficient method
for modifying any minimal finite-state automaton (be it acyclic or not) so that a string is added
to or removed from the language it accepts; both operations are very important when dictionary
maintenance is performed and solve the dictionary construction problem addressed by Daciuk
et al. as a special case. The algorithms proposed here may be straightforwardly derived from
the customary textbook constructions for the intersection and the complementation of finite-
state automata; the algorithms exploit the special properties of the automata resulting from the
intersection operation when one of the finite-state automata accepts a single string.
1. Introduction
In a recent paper in this journal, Daciuk et al. (2000) describe two methods for con-
structing incrementally minimal, deterministic, acyclic finite-state automata (dictio-
naries) from sets of strings: The first method adds strings in dictionary order, and the
second one is for unsorted data. Adding an entry is an important dictionary mainte-
nance operation, but so is removing an entry from the dictionary, for example, if it
is found to be incorrect. Since ordering cannot obviously be expected in the removal
case, we will focus on the second, more general problem (a solution for which has
already been sketched by Revuz [2000]).
But dictionaries or acyclic finite automata have limitations: For instance, if one
wants an application to accept all possible integer numbers or Internet addresses, the
corresponding finite-state automaton has to be cyclic. In this article, we show a simple
and equally efficient method for modifying any minimal finite-state automaton (be it
acyclic or not) so that a string is added to or removed from the language it accepts. The
algorithm may be straightforwardly derived from customary textbook constructions
for the intersection and the complementation of finite-state automata; the resulting
algorithm solves the dictionary construction problem addressed by Daciuk et al.’s
(2000) second algorithm as a special case.
∗ Departament de Llenguatges i Sistemes Inform`atics, Universitat d’Alacant, E-03071 Alacant, Spain.
E-mail: carrasco@dlsi.ua.es
† Departament de Llenguatges i Sistemes Inform`atics, Universitat d’Alacant, E-03071 Alacant, Spain.
E-mail: mlf@dlsi.ua.es.
208
Computational Linguistics Volume 28, Number 2
This article has the following parts. In Section 2, we give some necessary math-
ematical preliminaries. The minimal automata resulting from adding or removing a
string are described in detail in Section 3; the algorithms are described in Section 4.
In Section 5, one addition and one removal example are explained in detail, and some
closing remarks are given in Section 6.
2. Mathematical Preliminaries
2.1 Finite-State Automata and Languages
As in Daciuk et al. (2000), we will define a deterministic finite-state automaton as
M =(Q,Σ, δ, q
0
, F), where Q is a finite set of states, q
0
∈ Q is the start state, F ⊆ Q is
a set of accepting states, Σ is a finite set of symbols called the alphabet, and δ: Q ×
Σ → Q is the next-state mapping. In this article, we will define δ as a total mapping;
the corresponding finite-state automaton will be called complete (Revuz 2000). This
involves no loss of generality, as any finite-state automaton may be made complete
by adding a new absorption state ⊥ to Q, so that all undefined transitions point to it
and δ(⊥, a)=⊥ for all a ∈ Σ. Using complete finite-state automata is convenient for
the theoretical discussion presented in this article; real implementations of automata
and the corresponding algorithms need not contain an explicit representation of the
absorption state and its incoming and outgoing transitions.
For complete finite-state automata, the extended mapping δ
∗
: Q × Σ
∗
→ Q (the
extension of δ for strings) is defined simply as
δ
∗
(q, epsilon1)=q
δ
∗
(q, ax)=δ
∗
(δ(q, a), x) (1)
for all a ∈ Σ and x ∈ Σ
∗
, with epsilon1 the empty or null string. The language accepted by
automaton M
L(M)={w ∈ Σ
∗
: δ
∗
(q
0
, w) ∈ F} (2)
and the right language of state q
vector
L(q)={x ∈ Σ
∗
: δ
∗
(q, x) ∈ F} (3)
are defined as in Daciuk et al. (2000).
2.2 Single-String Automaton
We also find it convenient to define the (complete) single-string automaton for string
w, denoted M
w
=(Q
w
,Σ, δ
w
, q
0w
, F
w
), such that L(M
w
)={w}. This automaton has
Q
w
= Pr(w)∪{⊥
w
}, where Pr(w) is the set of all prefixes of w and ⊥
w
is the absorption
state, F
w
= {w}, and q
0w
= epsilon1 (note that nonabsorption states in Q
w
will be named after
the corresponding prefix of w). The next-state function is defined as follows
δ(x, a)=
braceleftBigg
xa if x, xa ∈ Pr(w)
⊥
w
otherwise
(4)
Note that the single-string automaton for a string w has |Q
w
| = |w|+ 2 states.
2.3 Operations with Finite-State Automata
2.3.1 Intersection Automaton. Given two finite-state automata M
1
and M
2
, it is easy
to build an automaton M so that L(M)=L(M
1
)∩L(M
2
). This construction is found
209
Carrasco and Forcada Incremental Construction of Minimal FSA
in formal language theory textbooks (Hopcroft and Ullman 1979, page 59) and is
referred to as standard in papers (Karakostas, Viglas, and Lipton 2000). The (complete)
intersection automaton has Q = Q
1
×Q
2
, q
0
=(q
01
, q
02
), F = F
1
×F
2
, and δ((q
1
, q
2
), a)=
(δ
1
(q
1
, a), δ
2
(q
2
, a)) for all a ∈ Σ, q
1
∈ Q
1
and q
2
∈ Q
2
.
2.3.2 Complementary Automaton. Given a complete finite-state automaton M,itis
easy to build its complementary automaton
¯
M so that L(
¯
M)=Σ
∗
−L(M); the only
change is the set of final states:
¯
F = Q − F (Hopcroft and Ullman 1979, page 59). In
particular, the complementary single-string automaton M
−w
accepting Σ
∗
−{w} is
identical to M
w
except that F
−w
= Q −{w}.
2.3.3 Union Automaton. The above constructions may be combined easily to obtain a
construction to build, from two complete automata M
1
and M
2
, the (complete) union
automaton M such that L(M)=L(M
1
) ∪L(M
2
). It suffices to consider that, for any
two languages on Σ
∗
, L
1
∪ L
2
=Σ
∗
−(Σ
∗
− L
1
)∩(Σ
∗
− L
2
). The resulting automaton
M is identical to the intersection automaton defined above except that F =(F
1
×Q
2
)∪
(Q
1
× F
2
).
3. Adding and Removing a String
3.1 Adding a String
Given a (possibly cyclic) minimal complete finite-state automaton M, it is easy to
build a new complete automaton M
prime
accepting L(M
prime
)=L(M)∪{w} by applying the
union construct defined above to M and the single-string automaton M
w
. The resulting
automaton M
prime
=(Q
prime
,Σ, δ
prime
, q
prime
0
, F
prime
), which may be minimized very easily (see below), has
four kinds of states in Q
prime
:
• States of the form (q,⊥
w
) with q ∈ Q −{⊥}, equivalent to those
nonabsorption states of M that are not reached by any prefix of w; they
will be called intact states, because they have the same transition
structure as their counterparts in M (that is, if δ(q, a)=q
prime
, then
δ
prime
((q,⊥
w
), a)=(q
prime
,⊥
w
)) and belong to F
prime
if q ∈ F. As a result, they have
exactly the same right languages,
vector
L((q,⊥
w
)) =
vector
L(q), because all of their
outgoing transitions go to other intact states. Furthermore, each state
(q,⊥
w
) has a different right language; therefore, no two intact states will
ever be merged into one by minimization (intact states may, however, be
eliminated, if they become unreachable, as we will describe below). For
large automata (dictionaries) M, these are the great majority of states (the
number of intact states ranges between |Q|−|w|−1 and |Q|); therefore, it
will be convenient in practice to consider M
prime
as a modified version of M,
and it will be treated as such in the algorithms found in this article.
• States of the form (q, x) with q ∈ Q −{⊥} and x ∈ Pr(w), such that
δ
∗
(q
0
, x)=q; they will be called cloned states, inspired by the
terminology in Daciuk et al. (2000); the remaining states in
(Q −{⊥})× Pr(w)—the great majority of states in Q × Q
w
—may safely
be discarded because they are unreachable from the new start state
q
prime
0
=(q
0
, epsilon1), which itself is a cloned state. Cloned states are modified
versions of the original states q ∈ Q −{⊥}: All of their outgoing
transitions point to the corresponding intact states in Q
prime
, that is,
(δ(q, a),⊥
w
), except for the transition with symbol a : xa ∈ Pr(w), which
210
Computational Linguistics Volume 28, Number 2
now points to the corresponding cloned state (δ(q, a), xa), that is,
δ
prime
((q, x), a)=
braceleftBigg
(δ(q, a), xa) if xa ∈ Pr(w)
(δ(q, a),⊥
w
) otherwise
(5)
Cloned states are in F
prime
if the corresponding original states are in F;in
addition, if there is a cloned state of the form (q, w), then it is in F
prime
. There
are at most |w|+ 1 cloned states.
• States of the form (⊥, x), with x ∈ Pr(w). These states will be called
queue states; states of this form appear when the string w is not in L(M)
(the pertinent case, because we are adding it) and only if in the original
automaton δ
∗
(q
0
, x)=⊥ for some x ∈ Pr(w). Only the final queue state
(⊥, w)—if it exists—is in F
prime
. There are at most |w| queue states.
• The new absorption state ⊥
prime
=(⊥,⊥
w
) /∈ F.
This automaton has to be minimized; because of the nature of the construction algo-
rithm, however, minimization may be accomplished in a small number of operations.
It is not difficult to show that minimization may be performed by initializing a list R
called the register (Daciuk et al. 2000) with all of the intact states and then testing,
one by one, queue and cloned states (starting with the last queue state (⊥, w) or, if it
does not exist, the last clone state (q, w), and descending in Pr(w)) against states in
the register and adding them to the register if they are not found to be equivalent to
a state in R. (Performing this check backwards avoids having to test the equivalence
of states by visiting their descendants recursively: see the end of Section 4.1.) Mini-
mization (including the elimination of unreachable states in M
prime
) appears in Section 4
as part of the string addition and removal algorithms.
3.2 Removing a String
Again, given a (possibly cyclic) minimal complete finite-state automaton M, it is easy
to build a new complete automaton M
prime
accepting L(M
prime
)=L(M) −{w} = L(M) ∩
(Σ
∗
−{w}) by applying the intersection construct defined above to M and M
−w
. The
resulting automaton has the same sets of reachable states in Q
prime
as in the case of adding
string w and therefore the same close-to-minimality properties; since w is supposed
to be in L(M), however, no queue states will be formed. (Note that, if w /∈L(M),
a nonaccepting queue with all states eventually equivalent to ⊥
prime
=(⊥,⊥
w
) may be
formed.) The accepting states in F
prime
are intact states (q,⊥
w
) and cloned states (q, x)
with q ∈ F, except for state (q, w). Minimization may be performed analogously to the
string addition case.
4. Algorithms
4.1 Adding a String
Figure 1 shows the algorithm that may be used to add a string to an existing automa-
ton, which follows the construction in Section 3.1. The resulting automaton is viewed
as a modification of the original one: Therefore, intact states are not created; instead,
unreachable intact states are eliminated later. The register R of states not needing
minimization is initialized with Q. The algorithm has three parts:
• First, the cloned and queue states are built and added to Q by using
function clone() for all prefixes of w. The function returns a cloned state
211
Carrasco and Forcada Incremental Construction of Minimal FSA
(with all transitions created), if the argument is a nonabsorption state in
Q −{⊥}, or a queue state, if it operates on the absorption state ⊥∈Q.
• Second, those intact states that have become unreachable as a result of
designating the cloned state q
prime
0
as the new start state are removed from Q
and R, and the start state is replaced by its clone. Unreachable states are
simply those having no incoming transitions as constructed by the
algorithm or as a consequence of the removal of other unreachable
states; therefore, function unreachable() simply has to check for the
absence of incoming transitions. Note that only intact states (q,⊥
w
)
corresponding to q such that δ
∗
(q
0
, x)=q for some x ∈ Pr(w) may
become unreachable as a result of having been cloned.
• Third, the queue and cloned states are checked (starting with the last
state) against the register using function replace or register(), which is
essentially the same as the nonrecursive version in the second algorithm
in Daciuk et al. (2000) and is shown in Figure 2. If argument state q is
found to be equivalent to a state p in the register R, function merge(p, q)
is called to redirect into p those transitions coming into q; if not,
argument state q is simply added to the register. Equivalence is checked
by function equiv(), shown in Figure 3, which checks for the equivalence
of states by comparing (1) whether both states are accepting or not, and
(2) whether the corresponding outgoing transitions lead to the same state
in R. Note that outgoing transitions cannot lead to equivalent states, as
there are no pairs of different equivalent states in the register
(∀p, q ∈ R, equiv(p, q) ⇒ p = q) and backwards minimization guarantees
that the state has no transitions to unregister states.
Finally, the new (minimal) automaton is returned. In real implementations, absorption
states are not explicitly stored; this results in small differences in the implementations
of the functions clone() and equiv().
4.2 Removing a String
The algorithm for removing a string from the language accepted by an automaton M
prime
differs from the previous algorithm only in that the line
F ← F −{q
last
}
has to be added after the first end for. Since the string removal algorithm will usually
be asked to remove a string that was in L(M), function clone() will usually generate
only cloned states and no queue states (see Section 3.2 for the special case w /∈ L(M)).
5. Examples
5.1 Adding a String
Assume that we want to add the string bra to the automaton in Figure 4, which accepts
the set of strings (ba)
+
∪{bar} (for clarity, in all automata, the absorption state and all
transitions leading to it will not be drawn). The single-string automaton for string bra
is shown in Figure 5. Application of the first stages of the string addition algorithm
leads to the (unminimized) automaton in Figure 6. The automaton has, in addition to
the set of intact states {(0,⊥
w
), ...,(5,⊥
w
)}, two cloned states ((0, epsilon1) and (1,b)) and two
queue states ((⊥,br) and (⊥,bra)). As a consequence of the designation of (0, epsilon1) as the
212
Computational Linguistics Volume 28, Number 2
algorithm addstring
Input: M =(Q,Σ, δ, q
0
, F) (minimal, complete), w ∈ Σ
∗
Output: M
prime
=(Q
prime
,Σ, δ
prime
, q
prime
0
, F
prime
) minimal, complete, and such that L(M
prime
)=L(M)∪{w}
R ← Q [initialize register]
q
prime
0
← clone(q
0
) [clone start state]
q
last
← q
prime
0
for i = 1to|w|
q ← clone(δ
∗
(q
0
, w
1
···w
i
)) [create cloned and queue states;
add clones of accepting states to F]
δ(q
last
, w
i
) ← q
q
last
← q
end for
i ← 1
q
current
← q
0
while(i ≤|w| and unreachable(q
current
))
q
next
← δ(q
current
, w
i
)
Q ← Q −{q
current
} [remove unreachable state from Q
and update transitions in δ]
R ← R −{q
current
} [remove also from register]
q
current
← q
next
i ← i + 1
end while
if unreachable(q
current
)
Q ← Q −{q
current
}
R ← R −{q
current
}
end if
q
0
← q
prime
0
[replace start state]
for i = |w| downto 1
replace or register(δ
∗
(q
0
, w
1
···w
i
)) [check queue and cloned states one by one]
end for
return M =(Q,Σ, δ, q
0
, F)
end algorithm
Figure 1
Algorithm to add a string w to the language accepted by a finite-state automaton while
keeping it minimal.
function replace or register(q)
if ∃p ∈ R : equiv(p, q) then
merge(p, q)
else
R ← R ∪{q}
end if
end function
Figure 2
The function replace or register().
213
Carrasco and Forcada Incremental Construction of Minimal FSA
function equiv(p, q)
if (p ∈ F ∧ q /∈ F)∨(p /∈ F ∧ q ∈ F) return false
for all symbols a ∈ Σ
if δ(p, a) negationslash= δ(q, a) return false
end for
return true
end function
Figure 3
The function equiv(p, q).
Figure 4
Minimal automaton accepting the set of strings (ba)
+
∪{bar}.
Figure 5
Single-string automaton accepting string bra.
0,⊥
w
1,⊥
w
2,⊥
w
3,⊥
w
4,⊥
w
5,⊥
w
0,ε 1,b
⊥,br ⊥,bra
b a
r
b
a
b
b
a
r
a
Figure 6
Unminimized automaton accepting the set (ba)
+
∪{bar}∪{bra}. Shadowed states (0,⊥
w
)
and (1,⊥
w
) have become unreachable (have no incoming transitions) and are eliminated in
precisely that order.
214
Computational Linguistics Volume 28, Number 2
Figure 7
Minimal automaton accepting the set (ba)
+
∪{bar}∪{bra}.
ε b ba bab baba
b a b a
Figure 8
Single-string automaton accepting the string baba.
0,⊥
w
1,⊥
w
2,⊥
w
3,⊥
w
4,⊥
w
5,⊥
w
6,⊥
w
0,ε 1,b 2,ba 4,bab 6,baba
b
a
r
b
r
a
a
b
b
r
a
r
b a
b
Figure 9
Unminimized automaton accepting the set (ba)
+
∪{bar}∪{bra}−{baba}. Shadowed states
(0,⊥
w
), (1,⊥
w
), and (2,⊥
w
) have become unreachable (have no incoming transitions) and are
eliminated in precisely that order.
new start state, shadowed states (0,⊥
w
) and (1,⊥
w
) become unreachable (have no in-
coming transitions) and are eliminated in precisely that order in the second stage of the
algorithm. The final stage of the algorithm puts intact states into the register and tests
queue and cloned states for equivalence with states in the register. The first state tested
is (⊥,bra), which is found to be equivalent to (3,⊥
w
); therefore, transitions coming
into (⊥,bra) are made to point to (3,⊥
w
). Then, states (⊥,br), (1,b) and (0, epsilon1) are tested
in order, found to have no equivalent in the register, and added to it. The resulting
minimal automaton, after a convenient renumbering of states, is shown in Figure 7.
5.2 Removing a String
Now let us consider the case in which we want to remove string baba from the
language accepted by the automaton in Figure 7 (the single-string automaton for baba
is shown in Figure 8). The automaton resulting from the application of the initial
(construction) stages of the automaton is shown in Figure 9. Note that state (6,baba) is
marked as nonaccepting, because we are removing a string. Again, as a consequence
of the designation of (0, epsilon1) as the new start state, shadowed states (0,⊥
w
), (1,⊥
w
),
215
Carrasco and Forcada Incremental Construction of Minimal FSA
01
2
3
4
56
78
b
r
a
a
r
b a
b
a
b
Figure 10
Minimal automaton accepting the set (ba)
+
∪{bar}∪{bra}−{baba}.
and (2,⊥
w
) become unreachable (have no incoming transitions) and are eliminated in
precisely that order in the second stage of the algorithm. The last stage of the algorithm
puts all intact states into the register, checks cloned states (6,baba), (4,bab), (2,ba),
(1,b) and (0, epsilon1) (no queue states, since baba is accepted by the automaton in Figure 7),
and finds none of them to be equivalent to those in the register, to which they are
added. The resulting minimal automaton is shown in Figure 10.
6. Concluding Remarks
We have derived, from basic results of language and automata theory, a simple method
for modifying a minimal (possibly cyclic) finite-state automaton so that it recognizes
one string more or one string less while keeping the finite-state automaton minimal.
These two operations may be applied to dictionary construction and maintenance
and generalize the result in Daciuk et al.’s (2000) second algorithm (incremental con-
struction of acyclic finite-state automata from unsorted strings) in two respects, with
interesting practical implications:
• The method described here allows for the addition of strings to the
languages of cyclic automata (in practice, it may be convenient to have
cycles in dictionaries if we want them to accept, for example, all possible
integer numbers or Internet addresses). In this respect, the algorithm
presented also generalizes the string removal method sketched by Revuz
(2000) for acyclic automata.
• Removal of strings is as easy as addition. This means that, for example,
the detection of an erroneous entry in the dictionary does not imply
having to rebuild the dictionary completely.
The asymptotic time complexity of the algorithms is in the same class (O(|Q||w|))as
that in Daciuk et al. (2000), because the slowest part of the algorithm (the last one)
checks all queue and cloned states (O(|w|)) against all states of the register (O(|Q|)). As
suggested by one of the reviewers of this article, an improvement in efficiency may be
obtained by realizing that, in many cases, cloned states corresponding to the shortest
prefixes of string w are not affected by minimization, because their intact equivalents
have become unreachable and therefore have been removed from the register; the
solution lies in identifying these states and not cloning them (for example, Daciuk et
al.’s [2000] and Revuz’s [2000] algorithms do not clone them).
As for the future, we are working on an adaptation of this algorithm for the main-
tenance of morphological analyzers and generators using finite-state nondeterministic
letter transducers (Roche and Schabes 1997; Garrido et al. 1999).
216
Computational Linguistics Volume 28, Number 2
Acknowledgments
The work reported in this article has been
funded by the Spanish Comisi´on
Interministerial de Ciencia y Tecnolog´ıa
through grant TIC2000-1599. We thank the
two reviewers for their suggestions and
Colin de la Higuera for his comments on
the manuscript.

References
Daciuk, Jan, Stoyan Mihov, Bruce W.
Watson, and Richard E. Watson. 2000.
Incremental construction of minimal
acyclic finite-state automata.
Computational Linguistics, 26(1):3–16.
Garrido, Alicia, Amaia Iturraspe, Sandra
Montserrat, Herm´ınia Pastor, and
Mikel L. Forcada. 1999. A compiler for
morphological analysers and generators
based on finite-state transducers.
Procesamiento del Lenguaje Natural,
25:93–98.
Hopcroft, John E. and Jeffrey D. Ullman.
1979. Introduction to Automata Theory,
Languages, and Computation.
Addison-Wesley, Reading, MA.
Karakostas, George, Anastasios Viglas, and
Richard J. Lipton. 2000. On the
complexity of intersecting finite state
automata. In Proceedings of the 15th Annual
IEEE Conference on Computational
Complexity (CoCo’00), pages 229–234.
Revuz, Dominique. 2000. Dynamic acyclic
minimal automaton. In Preproceedings of
CIAA 2000: Fifth International Conference on
Implementation and Application of Automata,
pages 226–232, London, Ontario, July
24–25.
Roche, Emmanuel and Yves Schabes. 1997.
Introduction. In Emmanuel Roche and
Yves Schabes, editors, Finite-State Language
Processing. MIT Press, pages 1–65.
