Proceedings of EACL '99 
Transducers from Rewrite Rules with Backreferences 
Dale Gerdemann 
University of Tuebingen 
K1. Wilhelmstr. 113 
D-72074 Tuebingen 
dg@sf s. nphil, uni-tuebingen, de 
Gertjan van Noord 
Groningen University 
PO Box 716 
NL 9700 AS Groningen 
vannoord@let, rug. nl 
Abstract 
Context sensitive rewrite rules have been 
widely used in several areas of natural 
language processing, including syntax, 
morphology, phonology and speech pro- 
cessing. Kaplan and Kay, Karttunen, 
and Mohri & Sproat have given vari- 
ous algorithms to compile such rewrite 
rules into finite-state transducers. The 
present paper extends this work by al- 
lowing a limited form of backreferencing 
in such rules. The explicit use of backref- 
erencing leads to more elegant and gen- 
eral solutions. 
1 Introduction 
Context sensitive rewrite rules have been widely 
used in several areas of natural language pro- 
cessing. Johnson (1972) has shown that such 
rewrite rules are equivalent to finite state trans- 
ducers in the special case that they are not al- 
lowed to rewrite their own output. An algo- 
rithm for compilation into transducers was pro- 
vided by Kaplan and Kay (1994). Improvements 
and extensions to this algorithm have been pro- 
vided by Karttunen (1995), Karttunen (1997), 
Karttunen (1996) and Mohri and Sproat (1996). 
In this paper, the algorithm will be ex- 
tended to provide a limited form of back- 
referencing. Backreferencing has been im- 
plicit in previous research, such as in the 
"batch rules" of Kaplan and Kay (1994), brack- 
eting transducers for finite-state parsing (Kart- 
tunen, 1996), and the "LocalExtension" operation 
of Roche and Schabes (1995). The explicit use of 
backreferencing leads to more elegant and general 
solutions. 
Backreferencing is widely used in editors, script- 
ing languages and other tools employing regular 
expressions (Friedl, 1997). For example, Emacs 
uses the special brackets \( and \) to capture 
strings along with the notation \n to recall the nth 
such string. The expression \(a*\)b\l matches 
strings of the form anba n. Unrestricted use of 
backreferencing thus can introduce non-regular 
languages. For NLP finite state calculi (Kart- 
tunen et al., 1996; van Noord, 1997) this is unac- 
ceptable. The form of backreferences introduced 
in this paper will therefore be restricted. 
The central case of an allowable backreference 
is: 
x ~ T(x)/A__p (1) 
This says that each string x preceded by A and 
followed by p is replaced by T(x), where A and p 
are arbitrary regular expressions, and T is a trans- 
ducer) This contrasts sharply with the rewriting 
rules that follow the tradition of Kaplan & Kay: 
¢ ~ ¢l:~__p (2) 
In this case, any string from the language ¢ is 
replaced by any string independently chosen from 
the language ¢. 
We also allow multiple (non-permuting) back- 
references of the form: 
~The syntax at this point is merely suggestive. As 
an example, suppose that T,c,. transduces phrases into 
acronyms. Then 
x =¢~ T=cr(x)/(abbr)__(/abbr> 
would transduce <abbr>non-deterministic finite 
automaton</abbr> into <abbr>NDFA</abbr>. 
To compare this with a backreference in Perl, 
suppose that T~cr is a subroutine that con- 
verts phrases into acronyms and that R~¢,. is 
a regular expression matching phrases that can 
be converted into acronyms. Then (ignoring 
the left context) one can write something like: 
s/(R~c,.)(?=(/ASBR))/T,,c~($1)/ge;. The backrefer- 
ence variable, $1, will be set to whatever string R~c,. 
matches. 
126 
Proceedings of EACL '99 
xlx2.., xn ~ Tl(xl)T2(x2)...Tn(x,O/A--p (3) 
Since transducers are closed under concatenation, 
handling multiple backreferences reduces to the 
problem of handling a single backreference: 
x ~ (TI" T2..... T,O(x)/A--p (4) 
A problem arises if we want capturing to fol- 
low the POSIX standard requiring a longest- 
capture strategy. ~riedl (1997) (p. 117), for 
example, discusses matching the regular expres- 
sion (toltop)(olpolo)?(gicallo?logical) against the 
word: topological. The desired result is that 
(once an overall match is established) the first set 
of parentheses should capture the longest string 
possible (top); the second set should then match 
the longest string possible from what's left (o), 
and so on. Such a left-most longest match con- 
catenation operation is described in §3. 
In the following section, we initially concentrate 
on the simple Case in (1) and show how (1) may be 
compiled assuming left-to-right processing along 
with the overall longest match strategy described 
by Karttunen (1996). 
The major components of the algorithm are 
not new, but straightforward modifications of 
components presented in Karttunen (1996) and 
Mohri and Sproat (1996). We improve upon ex- 
isting approaches because we solve a problem con- 
cerning the use of special marker symbols (§2.1.2). 
A further contribution is that all steps are imple- 
mented in a freely available system, the FSA Util- 
ities of van Noord (1997) (§2.1.1). 
2 The Algorithm 
2.1 Preliminary Considerations 
Before presenting the algorithm proper, we will 
deal with a couple of meta issues. First, we in- 
troduce our version of the finite state calculus in 
§2.1.1. The treatment of special marker symbols 
is discussed in §2.1.2. Then in §2.1.3, we discuss 
various utilities that will be essential for the algo- 
rithm. 
2.1.1 FSA Utilities 
The algorithm is implemented in the FSA Util- 
ities (van Noord, 1997). We use the notation pro- 
vided by the toolbox throughout this paper. Ta- 
ble 1 lists the relevant regular expression opera- 
tors. FSA Utilities offers the possibility to de- 
fine new regular expression operators. For exam- 
ple, consider the definition of the nullary operator 
vowel as the union of the five vowels: 
\[\] empty string 
\[El,... En\] concatenation of E1 ... En 
{} empty language 
<El,...En} union of El,...En 
E* Kleene closure 
E ^ optionality 
-E complement 
EI-E2 difference 
$ E containment 
E1 ~ E2 intersection 
any symbol 
A : B pair 
E1 x E2 cross-product 
A o B composition 
domain(E) domain of a transduction 
range (E) range of a transduction 
ident ity (E) identity transduction 
inverse (E) inverse transduction 
Table 1: Regular expression operators. 
macro (vowel, {a, e, i,o,u}). 
In such macro definitions, Prolog variables can be 
used in order to define new n-ary regular expres- 
sion operators in terms of existing operators. For 
instance, the lenient_composition operator (Kart- 
tunen, 1998) is defined by: 
macro (priorityiunion (Q ,R), 
{Q, -domain(Q) o R}). 
macro (lenient_composition (R, C), 
priority_union(R o C,R)). 
Here, priority_union of two regular expressions 
Q and R is defined as the union of Q and the compo- 
sition of the complement of the domain of Q with 
R. Lenient composition of R and C is defined as the 
priority union of the composition of R and C (on 
the one hand) and R (on the other hand). 
Some operators, however, require something 
more than simple macro expansion for their def- 
inition. For example, suppose a user wanted to 
match n occurrences of some pattern. The FSA 
Utilities already has the '*' and '+' quantifiers, 
but any other operators like this need to be user 
defined. For this purpose, the FSA Utilities sup- 
plies simple Prolog hooks allowing this general 
quantifier to be defined as: 
macro (mat chn (N, X), Regex) • - 
mat ch_n (N, X, Regex). 
match_n(O, _X, \[\] ) . 
match_n(N,X, \[XIRest\]) :- 
N>O, 
N1 is N-l, 
mat ch_n (NI, X, Rest) . 
127 
Proceedings of EACL '99 
For example: match_n(3,a) is equivalent to the 
ordinary finite state calculus expression \[a, a, a\]. 
Finally, regular expression operators can be 
defined in terms of operations on the un- 
derlying automaton. In such cases, Prolog 
hooks for manipulating states and transitions 
may be used. This functionality has been 
used in van Noord and Gerdemann (1999) to pro- 
vide an implementation of the algorithm in 
Mohri and Sproat (1996). 
2.1.2 Treatment of Markers 
Previous algorithms for compiling rewrite 
rules into transducers have followed 
Kaplan and Kay (1994) by introducing spe- 
cial marker symbols (markers) into strings in 
order to mark off candidate regions for replace- 
ment. The assumption is that these markers are 
outside the resulting transducer's alphabets. But 
previous algorithms have not ensured that the 
assumption holds. 
This problem was recognized by 
Karttunen (1996), whose algorithm starts with 
a filter transducer which filters out any string 
containing a marker. This is problematic for two 
reasons. First, when applied to a string that does 
happen to contain a marker, the algorithm will 
simply fail. Second, it leads to logical problems in 
the interpretation of complementation. Since the 
complement of a regular expression R is defined 
as E - R, one needs to know whether the marker 
symbols are in E or not. This has not been 
clearly addressed in previous literature. 
We have taken a different approach by providing 
a contextual way of distinguishing markers from 
non-markers. Every symbol used in the algorithm 
is replaced by a pair of symbols, where the second 
member of the pair is either a 0 or a 1 depending 
on whether the first member is a marker or not. 2 
As the first step in the algorithm, O's are inserted 
after every symbol in the input string to indicate 
that initially every symbol is a non-marker. This 
is defined as: 
macro (non_markers, \[?, \[\] :0\] *) . 
Similarly, the following macro can be used to 
insert a 0 after every symbol in an arbitrary ex- 
pression E. 
2This approach is similar to the idea of laying down 
tracks as in the compilation of monadic second-order 
logic into automata Klarlund (1997, p. 5). In fact, this 
technique could possibly be used for a more efficient 
implementation of our algorithm: instead of adding 
transitions over 0 and 1, one could represent the al- 
phabet as bit sequences and then add a final 0 bit for 
any ordinary symbol and a final 1 bit for a marker 
symbol. 
macro (non_markers (E), 
range (E o non_markers)). 
Since E is a recognizer, it is first coerced to 
identity(E). This form of implicit conversion is 
standard in the finite state calculus. 
Note that 0 and 1 are perfectly ordinary alpha- 
bet symbols, which may also be used within a re- 
placement. For example, the sequence \[i,0\] repre- 
sents a non-marker use of the symbol I. 
2.1.3 Utilities 
Before describing the algorithm, it will be 
helpful to have at our disposal a few general 
tools, most of which were described already in 
Kaplan and Kay (1994). These tools, however, 
have been modified so that they work with our 
approach of distinguishing markers from ordinary 
symbols. So to begin with, we provide macros to 
describe the alphabet and the alphabet extended 
with marker symbols: 
macro (sig, \[?, 0\] ). 
macro (xsig, \[?, {0,1}\] ). 
The macro xsig is useful for defining a special- 
ized version of complementation and containment: 
macro(not (X) ,xsig* - X). 
macro ($$ (X), \[xsig*, X, xsig*\] ). 
The algorithm uses four kinds of brackets, so 
it will be convenient to define macros for each of 
these brackets, and for a few disjunctions. 
macro (lbl, \[' <1 ', 1\] ) 
macro (lb2, \[' <2', 1\] ) 
macro (rb2, \[' 2> ', 1\] ) 
macro (rbl, \[' 1> ', 1\] ) 
macro (lb, {lbl, lb2}) 
macro (rb, {rbl ,rb2}) 
macro (bl, {lbl, rbl}) 
macro (b2, {lb2, rb2}) 
macro (brack, {lb, rb}). 
As in Kaplan & Kay, we define an Intro(S) op- 
erator that produces a transducer that freely in- 
troduces instances of S into an input string. We 
extend this idea to create a family of Intro oper- 
ators. It is often the case that we want to freely 
introduce marker symbols into a string at any po- 
sition except the beginning or the end. 
%% Free introduction 
macro(intro(S) ,{xsig-S, \[\] x S}*) . 
~.7. Introduction, except at begin 
macro (xintro (S) , ( \[\] , \[xsig-S, intro (S) \] }) . 
°/.~. Introduction, except at end 
macro (introx (S) , ( \[\] , \[intro (S) , xsig-S\] }) . 
128 
Proceedings of EACL '99 
%% Introduction, except at begin & end 
macro (xintrox (S), { \[\], \[xsig-S\] , 
\[xsig-S, intro (S), xsig-S\] }). 
This family of Intro operators is useful for defin- 
ing a family of Ignore operators: 
macro( ign( E1,S),range(E1 o intro(S))). 
macro(xign(El,S) ,range(E1 o xintro(S))). 
macro( ignx(E1,S),range(E1 o introx(S))). 
macro (xigax (El, S), range (El o xintrox (S)) ). 
In order to create filter transducers to en- 
sure that markers are placed in the correct po- 
sitions, Kaplan & Kay introduce the operator 
P-iff-S(L1,L2). A string is described by this 
expression iff each prefix in L1 is followed by a 
suffix in L2 and each suffix in L2 is preceded by a 
prefix in L1. In our approach, this is defined as: 
macro(if_p then s(L1,L2), 
not( iLl ,not (L2) \] )). 
macro (if s then_p (L1,L2), 
not ( \[not (al), L2\] ) ). 
macro (p_iff_s (LI, L2), 
if_p_then_s (LI, L2) 
if_s_then_p (LI ,L2) ). 
To make the use ofp_iff_s more convenient, we 
introduce a new operator l_if f_r (L, R), which de- 
scribes strings where every string position is pre- 
ceded by a string in L just in case it is followed by 
a string in R: 
macro (l_iff_r (L ,R), 
p_iff_s(\[xsig*,L\] , \[R,xsig*\])) . 
Finally, we introduce a new operator 
if (Condit ion, Then, Else) for conditionals. 
This operator is extremely useful, but in order 
for it to work within the finite state calculus, one 
needs a convention as to what counts as a boolean 
true or false for the condition argument. It is 
possible to define true as the universal language 
and false as the empty language: 
macro(true,? *). macro(false,{}). 
With these definitions, we can use the comple- 
ment operator as negation, the intersection opera- 
tor as conjunction and the union operator as dis- 
junction. Arbitrary expressions may be coerced 
to booleans using the following macro: 
macro (coerce_t oboolean (E), 
range(E o (true x true))). 
Here, E should describe a recognizer. E is com- 
posed with the universal transducer, which trans- 
duces from anything (?*) to anything (?*). Now 
with this background, we can define the condi- 
tionah 
macro ( if (Cond, Then, Else), 
{ coerce_to_boolean(Cond) o Then, 
-coerce_to_boolean(Cond) o Else 
}). 
2.2 Implementation 
A rule of the form x ~ T(x)/A__p will be written 
as replace(T,Lambda,Rho). Rules of the more 
general form xl ...z,, ~ Tl(xl)...T,~(Xn)/A_-p 
will be discussed in §3. The algorithm consists 
of nine steps composed as in figure 1. 
The names of these steps are mostly 
derived from Karttunen (1995) and 
Mohri and Sproat (1996) even though the 
transductions involved are not exactly the same. 
In particular, the steps derived from Mohri & 
Sproat (r, f, 11 and 12) will all be defined in 
terms of the finite state calculus as opposed to 
Mohri & Sproat's approach of using low-level 
manipulation of states and transitions, z 
The first step, non_markers, was already de- 
fined above. For the second step, we first consider 
a simple special case. If the empty string is in 
the language described by Right, then r(Right) 
should insert an rb2 in every string position. The 
definition of r(Right) is both simpler and more 
efficient if this is treated as a special case. To in- 
sert a bracket in every possible string position, we 
use: 
\[\[\[\] x rb2,sig\]*,\[\] x rb2\] 
If the empty string is not in Right, then we 
must use intro(rb2) to introduce the marker 
rb2, fol\]owed by l_iff_r to ensure that such 
markers are immediately followed by a string in 
Right, or more precisely a string in Right where 
additional instances of rb2 are freely inserted in 
any position other than the beginning. This ex- 
pression is written as: 
intro (rb2) 
o 
i_ if f _r (rb2, xign (non_markers (Right) , rb2) ) 
Putting these two pieces together with the con- 
ditional yields: 
macro (r (R), 
if(\[\] ~ R, % If: \[\] is in R: 
\[\[\[\] x rb2,sig\]*,\[\] x rb2\], 
intro (rb2) % Else: 
o 
l_iff_r (rb2, xign (non_markers (R) , rb2) ) ) ) . 
The third step, f(domain(T)) is implemented 
as: 
3The alternative implementation is provided in 
van Noord and Gerdemann (1999). 
129 
macro(replace(T,Left,Right), 
non_markers 
0 
r(Right) 
0 
f(domain(T)) 
0 
left_toright (domain(T)) 
0 
longest_match(domain(T)) 
0 
aux_replace(T) 
0 
ll(Left) 
0 
12(Left) 
O 
inverse(non_markers)). 
Proceedings of EACL '99 
% introduce 0 after every symbol 
% (a b c => a 0 b 0 c 0). 
% introduce rb2 before any string 
% in Right. 
% introduce ib2 before any string in 
% domain(T) followed by rb2. 
% ib2 ... rb2 around domain(T) optionally 
% replaced by Ibl ... rbl 
% filter out non-longest matches marked 
% in previous step. 
% perform T's transduction on regions marked 
% off by bl's. 
% ensure that Ibl must be preceded 
% by a string in Left. 
% ensure that Ib2 must not occur preceded 
% by a string in Left. 
% remove the auxiliary O's. 
Figure 1: Definition of replace operator. 
macro (f (Phi), intro (lb2) 
O 
l_iff_r (Ib2, \[xignx (non_markers (Phi), b2), 
lb2", rb2\] ) ). 
The lb2 is first introduced and then, using 
t_i f f_.r, it is constrained to occur immediately be- 
fore every instance of (ignoring complexities) Phi 
followed by an rb2. Phi needs to be marked as 
normal text using non_markers and then xign_x 
is used to allow freely inserted lb2 and rb2 any- 
where except at the beginning and end. The fol- 
lowing lb2" allows an optional lb2, which occurs 
when the empty string is in Phi. 
The fourth step is a guessing component which 
(ignoring complexities) looks for sequences of the 
form lb2 Phi rb2 and converts some of these 
into lbl Phi rbl, where the bl marking indicates 
that the sequence is a candidate for replacement. 
The complication is that Phi, as always, must 
be converted to non_markers (Phi) and instances 
of b2 need to be ignored. Furthermore, between 
pairs of lbl and rbl, instances of lb2 are deleted. 
These lb2 markers have done their job and are 
no longer needed. Putting this all together, the 
definition is: 
macro (left_to_right (Phi), 
\[ \[xsig*, 
lib2 x ibl, 
( ign (non_markers (Phi) , b2) 
O 
inverse (intro (ib2)) 
), 
rb2 x rbl\] 
\]*, xsig*\]). 
The fifth step filters out non-longest matches 
produced in the previous step. For example (and 
simplifying a bit), if Phi is ab*, then a string of 
the form ... rbl a b Ibl b ... should be ruled out 
since there is an instance of Phi (ignoring brackets 
except at the end) where there is an internal Ibl. 
This is implemented as:~ 
macro (longest_mat ch (Phi), 
not ($$ ( \[lbl, 
(ignx (non_markers (Phi) , brack) 
$$(rbl) 
), % longer match must be 
rb % followed by an rb 
\])) % so context is ok 
0 
~, done with rb2, throw away: 
inverse (intro (rb2)) ) . 
The sixth step performs the transduction de- 
scribed by T. This step is straightforwardly imple- 
mented, where the main difficulty is getting T to 
apply to our specially marked string: 
macro (aux_replace (T), 
{{sig, Ib2}, 
\[Ibl, 
inverse (non_markers) 
4The line with $$ (rbl) (:an be oI)ti- 
mized a bit: Since we know that an rbl 
must be preceded by Phi, we can write! 
\[ign_ (non_markers (Phi) , brack) , rb 1, xs ig*\] ). 
This may lead to a more constrained (hence smaller) 
transducer. 
130 
Proceedings of EACL '99 
oTo 
non_markers, 
rbl x \[\] 
\] 
}*). 
The seventh step ensures that lbl is preceded 
by a string in Left: 
macro (ii (L), 
ign ( if _s_then p ( 
ignx ( \[xsig*, non_markers (L) \], lbl), 
\[lbl,xsig*\] ), 
ib2) 
O 
inverse (intro (ib i) ) ). 
The eighth step ensures that ib2 is not preceded 
by a string in Left. This is implemented similarly 
to the previous step: 
macro (12 (L), 
if_s_then_p ( 
ignx (not ( \[xsig*,non_markers (L) \] ), lb2), 
\[lb2, xsig*\] ) 
0 
inverse ( intro (lb2) ) ). 
Finally the ninth step, inverse (non_markers), 
removes "the O's so that the final result in not 
marked up in any special way. 
3 Longest Match Capturing 
As discussed in §1 the POSIX standard requires 
that multiple captures follow a longest match 
strategy. For multiple captures as in (3), one es- 
tablishes first a longest match for domain(T1). 
.... domain( T~ ). Then we ensure that each of 
domain(Ti) in turn is required to match as long 
as possible, with each one having priority over its 
rightward neighbors. To implement this, we define 
a macro lm_concat(Ts) and use it as: 
replace (lm_concat (Ts), Left, Right) 
Ensuring the longest overall match is delegated 
to the replace macro, so lm_concat(Ts) needs 
only ensure that each individual transducer within 
Ts gets its proper left-to-right longest matching 
priority. This problem is mostly solved by the 
same techniques used to ensure the longest match 
within the replace macro. The only complica- 
tion here is that Ts can be of unbounded length. 
So it is not possible to have a single expression in 
the finite state calculus that applies to all possi- 
ble lenghts. This means that we need something 
a little more powerful than mere macro expan- 
sion to construct the proper finite state calculus 
expression. The FSA Utilities provides a Prolog 
hook for this purpose. The resulting definition of 
lm_concat is given in figure 2. 
Suppose (as in Friedl (1997)), we want to match 
the following list of recognizers against the string 
topological and insert a marker in each bound- 
ary position. This reduces to applying: 
im_concat ( \[ 
\[{\[t,o\],\[t,o,p\]},\[\] : '#'\], 
\[{o,\[p,o,l,o\]},\[\]: '#'\], 
{ \[g,i,c,a,l\], \[o',l,o,g,i,c,a,l\] } 
\]) 
This expression transduces the string 
topological only to the string top#o#1ogical. 5 
4 Conclusions 
The algorithm presented here has extended previ- 
ous algorithms for rewrite rules by adding a lim- 
ited version of backreferencing. This allows the 
output of rewriting to be dependent on the form of 
the strings which are rewritten. This new feature 
brings techniques used in Perl-like languages into 
the finite state calculus. Such an integration is 
needed in practical applications where simple text 
processing needs to be combined with more so- 
phisticated computational linguistics techniques. 
One particularly interesting example where 
backreferences are essential is cascaded determin- 
istic (longest match) finite state parsing as de- 
scribed for example in Abney (Abney, 1996) and 
various papers in (Roche and Schabes, 1997a). 
Clearly, the standard rewrite rules do not apply in 
this domain. If NP is an NP recognizer, it would 
not do to.say NP ~ \[NP\]/A_p. Nothing would 
force the string matched by the NP to the left of 
the arrow to be the same as the string matched 
by the NP to the right of the arrow. 
One advantage of using our algorithm for fi- 
nite state parsing is that the left and right con- 
texts may be used to bring in top-down filter- 
ing. 6 An often cited advantage of finite state 
5An anonymous reviewer suggested theft 
lm_concat could be implemented in the frame- 
work of Karttunen (1996) as: 
\[toltoplolpolo\]-+... #; 
Indeed the resulting transducer from this expression 
would transduce topological into top#o#1ogical. 
But unfortunately this transducer would also trans- 
duce polotopogical into polo#top#o#gical, since 
the notion of left-right ordering is lost in this expres- 
sion. 
6The bracketing operator of Karttunen (1996), on 
the other hand, does not provide for left and right 
contexts. 
131 
Proceedings of EACL '99 
macro(im_concat(Ts),mark_boundaries(Domains) o ConcatTs):- 
domains(Ts,Domains), concatT(Ts,ConcatTs). 
domains(\[\],\[\]). 
domains(\[FIRO\],\[domain(F) IR\]):- domains(RO,R). 
concatT(\[\],\[\]). 
concatT(\[TlTs\], \[inverse(non_markers) o T,ibl x \[\]IRest\]):- concatT(Ts,Rest). 
%% macro(mark_boundaries(L),Exp): This is the central component of im_concat. For our 
%% "toplological" example we will have: 
%% mark_boundaries (\[domain( \[{ \[t, o\] , \[t, o ,p\] }, \[\] : #\] ), 
%% domain(\[{o,\[p,o,l,o\]},\[\]: #\]), 
%% domain({ \[g,i, c,a, i\] , \[o^,l,o,g,i,c,a,l\] })\]) 
%% which simplifies to: 
%% mark_boundaries(\[{\[t,o\],\[t,o,p\]}, {o,\[p,o,l,o\]}, {\[g,i,c,a,l\],\[o^,l,o,g,i,c,a,l\]}\]). 
%% Then by macro expansion, we get: 
%% \[{\[t,o\], \[t,o,p\]} o non_markers,\[\]x ibl, 
%% {o,\[p,o,l,o\]} o non_markers,\[\]x ibl, 
%% {\[g,i,c,a,l\],\[o',l,o,g,i,c,a,l\]} o non_markers,\[\]x ibl\] 
%% o 
%% % Filter i: {\[t,o\],\[t,o,p\]} gets longest match 
%% - \[ignx_l(non_markers({ \[t,o\] , \[t,o,p\] }),ibl) , 
%% ign(non_markers({o, \[p,o,l,o\] }) ,ibl) , 
%% ign(non_markers({ \[g,i,c,a,l\] , \[o^,l,o,g,i,c,a,l\] }) ,ibl)\] 
%% o 
%% % Filter 2: {o,\[p,o,l,o\]} gets longest match 
%% ~ \[non_markers ({ \[t, o\] , \[t, o, p\] }) , Ib i, 
%% ignx_l(non_markers ({o, \[p,o,l,o\] }) ,ibl), 
%% ign(non_markers({ \[g, i,c,a,l\] , \[o',l,o,g,i,c,a,l\] }) ,ibl)\] 
macro(mark_boundaries(L),Exp):- 
boundaries(L,ExpO), % guess boundary positions 
greed(L,ExpO,Exp). % filter non-longest matches 
boundaries(\[\],\[\]). 
boundaries(\[FIRO\],\[F o non_markers, \[\] x ibl \]R\]):- boundaries(RO,R). 
greed(L,ComposedO,Composed) :- 
aux_greed(L,\[\],Filters), compose_list(Filters,ComposedO,Composed). 
aux_greed(\[HIT\],Front,Filters):- aux_greed(T,H,Front,Filters,_CurrentFilter). 
aux_greed(\[\],F,_,\[\],\[ign(non_markers(F),Ibl)\]). 
aux_greed(\[HlRO\],F,Front,\[-LiIR\],\[ign(non_markers(F),ibl)IRl\]) "- 
append(Front,\[ignx_l(non_markers(F),Ibl)IRl\],Ll), 
append(Front,\[non_markers(F),ibl\],NewFront), 
aux_greed(RO,H,NewFront,R,Rl). 
%% ignore at least one instance of E2 except at end 
macro(ignx_l(E1,E2), range(El o \[\[? *,\[\] x E2\]+,? +\])). 
compose_list(\[\],SoFar,SoFar). 
compose_list(\[FlR\],SoFar,Composed):- compose_list(R,(SoFar o F),Composed). 
Figure 2: Definition of lm_concat operator. 
132 
Proceedings of EACL '99 
parsing is robustness. A constituent is found bot- 
tom up in an early level in the cascade even if 
that constituent does not ultimately contribute 
to an S in a later level of the cascade. While 
this is undoubtedly an advantage for certain ap- 
plications, our approach would allow the intro- 
duction of some top-down filtering while main- 
taining the robustness of a bottom-up approach. 
A second advantage for robust finite state pars- 
ing is that bracketing could also include the no- 
tion of "repair" as in Abney (1990). One might, 
for example, want to say something like: xy 
\[NP RepairDet(x) RepairN(y) \]/)~__p 7 so that an 
NP could be parsed as a slightly malformed Det 
followed by a slightly malformed N. RepairDet 
and RepairN, in this example, could be doing a 
variety of things such as: contextualized spelling 
correction, reordering of function words, replace- 
ment of phrases by acronyms, or any other oper- 
ation implemented as a transducer. 
Finally, we should mention the problem of com- 
plexity. A critical reader might see the nine steps 
in our algorithm and conclude that the algorithm 
is overly complex. This would be a false conclu- 
sion. To begin with, the problem itself is complex. 
It is easy to create examples where the resulting 
transducer created by any algorithm would be- 
come unmanageably large. But there exist strate- 
gies for keeping the transducers smaller. For ex- 
ample, it is not necessary for all nine steps to 
be composed. They can also be cascaded. In 
that case it will be possible to implement different 
steps by different strategies, e.g. by determinis- 
tic or non-deterministic transducers or bimachines 
(Roche and Schabes, 1997b). The range of possi- 
bilities leaves plenty of room for future research. 
References 
Steve Abney. 1990. Rapid incremental parsing 
with repair. In Proceedings of the 6th New OED 
Conference: Electronic Text Rese arch, pages 
1-9. 
Steven Abney. 1996. Partial parsing via finite- 
state cascades. In Proceedings of the ESSLLI 
'96 Robust Parsing Workshop. 
.Jeffrey Friedl. 1997. Mastering Regular Expres- 
sions. O'Reilly & Associates, Inc. 
C. Douglas Johnson. 1972. Formal Aspects 
of Phonological Descriptions. Mouton, The 
Hague. 
7The syntax here has been simplified. The rule 
should be understood as: replace(lm_concat(\[\[\]:'\[np', 
repair_det, repair_n, \[\]:'\]'\],lambda, rho). 
Ronald Kaplan and Martin Kay. 1994. Regular 
models of phonological rule systems. Computa- 
tional Linguistics, 20(3):331-379. 
L. Karttunen, J-P. Chanod, G. Grefenstette, and 
A. Schiller. 1996. Regular expressions for lan- 
guage engineering. Natural Language Engineer- 
ing, 2(4):305-238. 
Lauri Karttunen. 1995. The replace operator. 
In 33th Annual Meeting of the Association for 
Computational Linguistics, M.I.T. Cambridge 
Mass. 
Lauri Karttunen. 1996. Directed replacement. 
In 34th Annual Meeting of the Association for 
Computational Linguistics, Santa Cruz. 
Lauri Karttunen. 1997. The replace operator. 
In Emannual Roche and Yves Schabes, editors, 
Finite-State Language Processing, pages 117- 
147. Bradford, MIT Press. 
Lauri Karttunen. 1998. The proper treatment 
of optimality theory in computational phonol- 
ogy. In Finite-state Methods in Natural Lan- 
guage Processing, pages 1-12, Ankara, June. 
Nils Klarlund. 1997. Mona & Fido: The logic 
automaton connection in practice. In CSL '97. 
Mehryar Mohri and Richard Sproat. 1996. An 
efficient compiler for weighted rewrite rules. 
In 3~th Annual Meeting of the Association for 
Computational Linguistics, Santa Cruz. 
Emmanuel Roche and Yves Schabes. 1995. De- 
terministic part-of-speech tagging with finite- 
state transducers. Computational Linguistics, 
21:227-263. Reprinted in Roche & Schabes 
(1997). 
Emmanuel Roche and Yves Schabes, editors. 
1997a. Finite-State Language Processing. MIT 
Press, Cambridge. 
Emmanuel Roche and Yves Schabes. 1997b. In- 
troduction. In Emmanuel Roche and Yves Sch- 
abes, editors, Finite-State Language Processing. 
MIT Press, Cambridge, Mass. 
Gertjan van Noord and Dale Gerdemann. 1999. 
An extendible regular expression compiler for 
finite-state approaches in natural language pro- 
cessing. In Workshop on Implementing Au- 
tomata 99, Potsdam Germany. 
Gertjan van Noord. 1997. Fsa utilities. 
The FSA Utilities toolbox is available free of 
charge under Gnu General Public License at 
http://www.let.rug.nl/-vannoord/Fsa/. 
133 
