Practical Experiments with Regular 
Approximation of Context-Free Languages 
Mark-Jan Nederhof" 
German Research Center 
Intelligence 
for Artificial 
Several methods are discussed that construct a finite automaton given a context-free grammar, 
including both methods that lead to subsets and those that lead to supersets of the original 
context-free . Some of these methods of regular approximation are new, and some others 
are presented here in a more refined form with respect to existing literature. Practical experiments 
with the different methods of regular approximation are performed for spoken- input: 
hypotheses from a speech recognizer are filtered through a finite automaton. 
1. Introduction 
Several methods of regular approximation of context-free s have been pro- 
posed in the literature. For some, the regular  is a superset of the context-free 
, and for others it is a subset. We have implemented a large number of meth- 
ods, and where necessary, refined them with an analysis of the grammar. We also 
propose a number of new methods. 
The analysis of the grammar is based on a sufficient condition for context-free 
grammars to generate regular s. For an arbitrary grammar, this analysis iden- 
tifies sets of rules that need to be processed in a special way in order to obtain a regular 
. The nature of this processing differs for the respective approximation meth- 
ods. For other parts of the grammar, no special treatment is needed and the grammar 
rules are translated to the states and transitions of a finite automaton without affecting 
the . 
Few of the published articles on regular approximation have discussed the appli- 
cation in practice. In particular, little attention has been given to the following two 
questions: First, what happens when a context-free grammar grows in size? What is 
then the increase of the sizes of the intermediate results and the obtained minimal de- 
terministic automaton? Second, how "precise" are the approximations? That is, how 
much larger than the original context-free  is the  obtained by a 
superset approximation, and how much smaller is the  obtained by a subset 
approximation? (How we measure the "sizes" of s in a practical setting will 
become clear in what follows.) 
Some considerations with regard to theoretical upper bounds on the sizes of the 
intermediate results and the finite automata have already been discussed in Nederhof 
(1997). In this article we will try to answer the above two questions in a practical set- 
ring, using practical linguistic grammars and sentences taken from a spoken- 
corpus. 
• DFKI, Stuhlsatzenhausweg 3, D-66123 Saarbriicken, Germany. E-mail: nederhof@dfki.de 
© 2000 Association for Computational Linguistics 
Computational Linguistics Volume 26, Number 1 
The structure of this paper is as follows: In Section 2 we recall some standard 
definitions from  theory. Section 3 investigates a sufficient condition for a 
context-free grammar to generate a regular . We also present the construction 
of a finite automaton from such a grammar. In Section 4, we discuss several meth- 
ods to approximate the  generated by a grammar if the sufficient condition 
mentioned above is not satisfied. These methods can be enhanced by a grammar trans- 
formation presented in Section 5. Section 6 compares the respective methods, which 
leads to conclusions in Section 7. 
2. Preliminaries 
Throughout this paper we use standard formal  notation (see, for example, 
Harrison \[1978\]). In this section we review some basic definitions. 
A context-free grammar G is a 4-tuple (G,N,P,S), where G and N are two finite 
disjoint sets of terminals and nonterminals, respectively, S E N is the start symbol, and 
P is a finite set of rules. Each rule has the form A ~ ~ with A E N and ~ E V*, where 
V denotes N U ~. The relation ~ on N x V* is extended to a relation on V* x V* as 
usual. The transitive and reflexive closure of 4 is denoted by 4*. 
The  generated by G is given by the set {w E G* I S 4" w}. By definition, 
such a set is a context-free . By reduction of a grammar we mean the elimi- 
nation from P of all rules A 4,7 such that S --+* c~Afl --* a"/fl 7" w does not hold for 
any~,flEV*andwEG*. 
We generally use symbols A, B, C .... to range over N, symbols a, b, c,... to range 
over ~, symbols X, Y, Z to range over V, symbols a, fl,"7 .... to range over V* and 
symbols v, w, x .... to range over G* We write ¢ to denote the empty string. 
A rule of the form A --, B is called a unit rule. 
A (nondeterministic) finite automaton .T is a 5-tuple (K, G, A, s, F), where K is a 
finite set of states, of which s is the initial state and those in F c K are the final states, 
is the input alphabet, and the transition relation A is a finite subset of K x Z* x K. 
We define a configuration to be an element of K x G*. We define the binary relation 
t- between configurations as: (q, vw) F- (q', w) if and only if (q, v, q') E A. The transitive 
and reflexive closure of ~- is denoted by F-*. 
Some input v is recognized if (s, v) t-* (q, c), for some q E F. The  accepted 
by .T is defined to be the set of all strings v that are recognized. By definition, a 
 accepted by a finite automaton is called a regular . 
3. Finite Automata in the Absence of Self-Embedding 
We define a spine in a parse tree to be a path that runs from the root down to some 
leaf. Our main interest in spines lies in the sequences of grammar symbols at nodes 
bordering on spines. 
A simple example is the set of parse trees such as the one in Figure 1, for a 
grammar of palindromes. It is intuitively clear that the  is not regular: the 
grammar symbols to the left of the spine from the root to E "communicate" with those 
to the right of the spine. More precisely, the prefix of the input up to the point where 
it meets the final node c of the spine determines the suffix after that point, in such 
a way that an unbounded quantity of symbols from the prefix need to be taken into 
account. 
A formal explanation for why the grammar may not generate a regular  
relies on the following definition (Chomsky 1959b): 
18 
Nederhof Experiments with Regular Approximation 
S--',a S a 
S-->b S b 
S---~ ~ 
S 
a S a Y\ 
b S b 
Figure 1 
Grammar of palindromes, and a parse tree. 
Definition 
A grammar is self-embedding if there is some A E N such that A --+* c~Afl, for some 
a¢eandfl¢e. 
If a grammar is not self-embedding, this means that when a section of a spine in 
a parse tree repeats itself, then either no grammar symbols occur to the left of that 
section of the spine, or no grammar symbols occur to the right. This prevents the 
"unbounded communication" between the two sides of the spine exemplified by the 
palindrome grammar. 
We now prove that grammars that are not self-embedding generate regular lan- 
guages. For an arbitrary grammar, we define the set of reeursive nonterminals as: 
B N = {A E N I  Ag\]} 
m We determine the partition N" of N consisting of subsets N1, N2 .... , Nk, for some k > 0, 
of mutually recursive nonterminals: 
H = {N1,N2 .... ,Nk} 
NIUN2U...UNk=N 
Vi\[Ni 7L O\] 
Vi, j\[i   j =~ Ni N Nj = 0\] 
and for all A, B E N: 
3i\[A E Ni AB @ Nil - ~oQ, fll, O~2,fl2\[a ---~* alBfll AB ---+* c¢2Afl2\], 
We now define the function recursive from N" to the set {left, right, self, cyclic}. For 
l<iKk: 
recursive(Ni) -- left, if ~LeftGenerating(Ni) 
= right, if LeftGenerating(Ni) 
-- self, if LeftGenerating(Ni) 
= cyclic, if -,LeftGenerating(Ni) 
/x RightGenerating(Ni) 
/x ~RightGenerating(Ni) 
/x RightGenerating(Ni) 
/x ~RightGenerating( Ni ) 
where 
LeftGenerating(Ni) = 3(A --* aBfl) E P\[A E Ni A B E Ni /X ~ 7~ e\] 
RightGenerating(Ni) = 3(A --* aBfl) E P\[A E Ni /x B E Ni /~ fl   ¢\] 
19 
Computational Linguistics Volume 26, Number 1 
When recursive(Ni) = left, Ni consists of only left-recursive nonterminals, which does 
not mean it cannot also contain right-recursive nonterminals, but in that case right 
recursion amounts to application of unit rules. When recursive(Ni) = cyclic, it is only 
such unit rules that take part in the recursion. 
That recursive(Ni) = self, for some i, is a sufficient and necessary condition for the 
grammar to be self-embedding. Therefore, we have to prove that if recursive(Ni) E 
{left, right, cyclic}, for all i, then the grammar generates a regular . Our proof 
differs from an existing proof (Chomsky 1959a) in that it is fully constructive: Fig- 
ure 2 presents an algorithm for creating a finite automaton that accepts the  
generated by the grammar. 
The process is initiated at the start symbol, and from there the process descends 
the grammar in all ways until terminals are encountered, and then transitions are 
created labeled with those terminals. Descending the grammar is straightforward in 
the case of rules of which the left-hand side is not a recursive nonterminal: the sub- 
automata found recursively for members in the right-hand side will be connected. 
In the case of recursive nonterminals, the process depends on whether the nontermi- 
nals in the corresponding set from H are mutually left-recursive or right-recursive; 
if they are both, which means they are cyclic, then either subprocess can be ap- 
plied; in the code in Figure 2 cyclic and right-recursive subsets Ni are treated uni- 
formly. 
We discuss the case in which the nonterminals are left-recursive. One new state is 
created for each nonterminal in the set. The transitions that are created for terminals 
and nonterminals not in Ni are connected in a way that is reminiscent of the con- 
struction of left-corner parsers (Rosenkrantz and Lewis 1970), and specifically of one 
construction that focuses on sets of mutually recursive nonterminals (Nederhof 1994, 
Section 5.8). 
An example is given in Figure 3. Four states have been labeled according to the 
names they are given in procedure make~fa. There are two states that are labeled qB. 
This can be explained by the fact that nonterminal B can be reached by descending 
the grammar from S in two essentially distinct ways. 
The code in Figure 2 differs from the actual implementation in that sometimes, for a 
nonterminal, a separate finite automaton is constructed, namely, for those nonterminals 
that occur as A in the code. A transition in such a subautomaton may be labeled by 
another nonterminal B, which then represents the subautomaton corresponding to B. 
The resulting representation is similar to extended context-free grammars (Purdom 
and Brown 1981), with the exception that in our case recursion cannot occur, by virtue 
of the construction. 
The representation for the running example is indicated by Figure 4, which shows 
two subautomata, labeled S and B. The one labeled S is the automaton on the top level, 
and contains two transitions labeled B, which refer to the other subautomaton. Note 
that this representation is more compact than that of Figure 3, since the transitions 
that are involved in representing the sub of strings generated by nonterminal 
B are included only once. 
The compact representation consisting of subautomata can be turned into a sin- 
gle finite automaton by substituting subautomata A for transitions labeled A in other 
automata. This comes down to regular substitution in the sense of Berstel (1979). The 
advantage of this way of obtaining a finite automaton over a direct construction of a 
nondeterministic automaton is that subautomata may be determinized and minimized 
before they are substituted into larger subautomata. Since in many cases determinized 
and minimized automata are much smaller, this process avoids much of the combina- 
20 
Nederhof Experiments with Regular Approximation 
let K = O, A = O, s = fresh_state, f = fresh_state, F = {f}; 
make_fa( s, S, f). 
procedure makeffa(qo, a, ql): 
if a= e 
then let A = A U {(q0,e, ql)} 
elseif a = a, some a E ,U 
then let A = A U {(q0, a, ql)} 
elseif a = Xfl, some X E V, fl C V* such that IflI > 0 
then let q = fresh_state; 
makeffa(qo, X, q); 
makeffa( q, t, ql ) 
else let A = a; (* a must consist of a single nonterminal *) 
if there exists i such that A C Ni 
then for each B E Ni do let qB = fresh_state end; 
if recursive(Ni) = left 
then for each (C-+ XI'.'Xm) E P such that CENi AX1,...,Xm~Ni 
do make_fa(qo, XI " . Xm, qc ) 
end; 
for each (C --+ DX1 ... X,~) C P such that 
C,D ~ Ni A X1,...,Xm ~ Ni 
do make ffa( qD , X I " " X,~ , qc ) 
end; 
let A = A U {(qA, e, ql)} 
else (* recursive(g,) C {right, cyclic} *) 
for each (C-+ X1...Xm) E P such that CENi A X1,...,Xm~Ni 
do make_fa(qc, X1... Xm, ql) 
end; 
for each (C --~ XI ".. XmD) E P such that 
C, D E Ni AXI,...,Xm ~ Ni 
do makc_fa(qc, XI ".. Xm, qD) 
end; 
let A = A U {(qo, e, qa)} 
end 
else for each (A -+ fl) C P do make_fa(qo,fl, ql) end (* A is not recursive *) 
end 
end 
end. 
procedure fresh_state(): 
create some object q such that q ~ K; 
let K=KU{q}; 
return q 
end. 
Figure 2 
Transformation from a grammar G = (E, N,P, S) that is not self-embedding into an equivalent 
finite automaton 3 v = (K, E, A, s, F). 
21 
Computational Linguistics Volume 26, Number 1 
S --* Aa 
A --* SB 
A ~ Bb 
B --* Bc 
B ---* d 
c 
qB 
Figure 3 
N = {S,A,B} 
\]kf : {N1, N2} 
N1 = {S,A} recursive(N1) = left 
N2 -- {B} recursive(N2) = left 
__qA a d 
Application of the code from Figure 2 on a small grammar. 
S 
Figure 4 
B 
............................ 1 c 
i 
i 
i 
I 
,,. d ~ (~) , 
fw -w = i 
qB I 
The automaton from Figure 3 in a compact representation. 
torial explosion that takes place upon naive construction of a single nondeterministic 
finite automaton. 1 
Assume we have a list of subautomata A1 ..... Am that is ordered from lower-level 
to higher-level automata; i.e., if an automaton Ap occurs as the label of a transition 
of automaton Aq, then p < q; Am must be the start symbol S. This order is a natural 
result of the way that subautomata are constructed during our depth-first traversal of 
the grammar, which is actually postorder in the sense that a subautomaton is output 
after all subautomata occurring at its transitions have been output. 
Our implementation constructs a minimal deterministic automaton by repeating 
the following for p = 1,...,m: 
. 
. 
Make a copy of Ap. Determinize and minimize the copy. If it has fewer 
transitions labeled by nonterminals than the original, then replace Ap by 
its copy. 
Replace each transition in Ap of the form (q, Ar, q') by (a copy of) 
automaton Ar in a straightforward way. This means that new e-transitions 
connect q to the start state of Ar and the final states of Ar to qt. 
1 The representation in Mohri and Pereira (1998) is even more compact than ours for grammars that are 
not self-embedding. However, in this paper we use our representation as an intermediate result in 
approximating an unrestricted context-free grammar, with the final objective of obtaining a single 
minimal deterministic automaton. For this purpose, Mohri and Pereira's representation offers little 
advantage. 
22 
Nederhof Experiments with Regular Approximation 
3. Again determinize and minimize Ap and store it for later reference. 
The automaton obtained for Am after step 3 is the desired result. 
4. Methods of Regular Approximation 
This section describes a number of methods for approximating a context-free gram- 
mar by means of a finite automaton. Some published methods did not mention self- 
embedding explicitly as the source of nonregularity for the , and suggested 
that approximations should be applied globally for the complete grammar. Where 
this is the case, we adapt the method so that it is more selective and deals with 
self-embedding locally. 
The approximations are integrated into the construction of the finite automaton 
from the grammar, which was described in the previous section. A separate incarnation 
of the approximation process is activated upon finding a nonterminal A such that 
A E Ni and recursive(Ni) = self, for some i. This incarnation then only pertains to 
the set of rules of the form B --* c~, where B E Ni. In other words, nonterminals not 
in Ni are treated by this incarnation of the approximation process as if they were 
terminals. 
4.1 Superset Approximation Based on RTNs 
The following approximation was proposed in Nederhof (1997). The presentation 
here, however, differs substantially from the earlier publication, which treated the ap- 
proximation process entirely on the level of context-free grammars: a self-embedding 
grammar was transformed in such a way that it was no longer self-embedding. A 
finite automaton was then obtained from the grammar by the algorithm discussed 
above. 
The presentation here is based on recursive transition networks (RTNs) (Woods 
1970). We can see a context-free grammar as an RTN as follows: We introduce two 
states qA and q~ for each nonterminal A, and m + 1 states q0 ..... qm for each rule 
A --* X1 .. • Xm. The states for a rule A ~ X 1 . . . X m are connected with each other and 
to the states for the left-hand side A by one transition (qA, c, q0), a transition (qi-1, Xi, qi) 
for each i such that 1 < i < m, and one transition (qm, e,q~A). (Actually, some epsilon 
transitions are avoided in our implementation, but we will not be concerned with such 
optimizations here.) 
In this way, we obtain a finite automaton with initial state qA and final state q~ for 
each nonterminal A and its defining rules A --* X1 • • • Xm. This automaton can be seen 
as one component of the RTN. The complete RTN is obtained by the collection of all 
such finite automata for different nonterminals. 
An approximation now results if we join all the components in one big automaton, 
and if we approximate the usual mechanism of recursion by replacing each transition 
(q, A, q') by two transitions (q, c, qA) and (q~, e, q'). The construction is illustrated in 
Figure 5. 
In terms of the original grammar, this approximation can be informally explained 
as follows: Suppose we have three rules B --* c~Afl, B I ~ c~IAfl ~, and A ~ % Top-down, 
left-to-right parsing would proceed, for example, by recognizing a in the first rule; 
it would then descend into rule A ~ % and recognize "y; it would then return to 
the first rule and subsequently process ft. In the approximation, however, the finite 
automaton "forgets" which rule it came from when it starts to recognize % so that it 
may subsequently recognize fl' in the second rule. 
23 
Computational Linguistics Volume 26, Number 1 
(b) 
a B b 
(a) 
d A e 
qB Y'~ ;i =i~ 'B A---~ a B b >t~ q 
A---~c A "'"~"~ f 
B---~dA e B--~ f 
(c) 
a b 
~ ~ t / 
Figure 5 
Application of the RTN method for the grammar in (a). The RTN is given in (b), and (c) 
presents the approximating finite automaton. We assume A is the start symbol and therefore qA 
becomes the initial state and q~ becomes the final state in the approximating automaton. 
For the sake of presentational convenience, the above describes a construction 
working on the complete grammar. However, our implementation applies the con- 
struction separately for each nonterminal in a set Ni such that recursive(Ni) = self, 
which leads to a separate subautomaton of the compact representation (Section 3). 
See Nederhof (1998) for a variant of this approximation that constructs finite trans- 
ducers rather than finite automata. 
We have further implemented a parameterized version of the RTN approximation. 
A state of the nondeterministic automaton is now also associated to a list H of length 
IHI strictly smaller than a number d, which is the parameter to the method. This list 
represents a history of rule positions that were encountered in the computation leading 
to the present state. 
More precisely, we define an item to be an object of the form \[A ~ a • fl\], 
where A ~ aft is a rule from the grammar. These are the same objects as the "dot- 
ted" productions of Earley (1970). The dot indicates a position in the right-hand 
side. 
The unparameterized RTN method had one state qI for each item/, and two states 
qA and q~ for each nonterminal A. The parameterized RTN method has one state qrH 
for each item I and each list of items H that represents a valid history for reaching 
I, and two states qaH and q~H for each nonterminal A and each list of items H that 
represents a valid history for reaching A. Such a valid history is defined to be a list 
24 
Nederhof Experiments with Regular Approximation 
H with 0 < \[HI < d that represents a series of positions in rules that could have been 
invoked before reaching I or A, respectively. More precisely, if we set H =/1 ... In, then 
each Im (1 < m < n) should be of the form \[Am ~ olin • Bmflm\] and for 1 < m < n we 
should have Am -- Bm+l. Furthermore, for a state qiH with I = \[A --* a • fl\] we demand 
A = B1 if n > 0. For a state qAH we demand A -- B1 if n > 0. (Strictly speaking, states 
qAH and qrH, with \[HI < d - 1 and I = \[A --+ a • fl\], will only be needed if AIH \] is the 
start symbol in the case IH\[ > 0, or if A is the start symbol in the case H = c.) 
The transitions of the automaton that pertain to terminals in right-hand sides 
of rules are very similar to those in the case of the unparameterized method: For a 
state qIH with I of the form \[A ~ a • aft\], we create a transition (q~H, a, qi,H), with 
I' = \[A ~ aa • fl\]. 
Similarly, we create epsilon transitions that connect left-hand sides and right-hand 
sides of rules: For each state qAa there is a transition (qAH, e, qIH) for each item I = 
\[A --* • a\], for some a, and for each state of the form qI,u, with I' = \[A ~ a •\], there 
is a transition (qFa, c, q~H). 
For transitions that pertain to nonterminals in the right-hand sides of rules, we 
need to manipulate the histories. For a state qIH with I of the form \[A ~ a • Bfl\], we 
create two epsilon transitions. One is (qIH, c, qBn,), where H' is defined to be IH if 
\[IH\[ < d, and to be the first d - 1 items of IH, otherwise. Informally, we extend the 
history by the item I representing the rule position that we have just come from, but 
the oldest information in the history is discarded if the history becomes too long. The 
second transition is (q'BH,, ~, q~'H), with I' = \[A --* aB • fl\]. 
If the start symbol is S, the initial state is qs and the final state is q~ (after the 
symbol S in the subscripts we find empty lists of items). Note that the parameterized 
method with d -- 1 concurs with the unparameterized method, since the lists of items 
then remain empty. 
An example with parameter d -- 2 is given in Figure 6. For the unparameterized 
method, each I = \[A --* a • fl\] corresponded to one state (Figure 5). Since reaching A 
can have three different histories of length shorter than 2 (the empty history, since A is 
the start symbol; the history of coming from the rule position given by item \[A -~ c • A\]; 
and the history of coming from the rule position given by item \[B ~ d • Ae\]), in Figure 6 
we now have three states of the form qI~ for each I -- \[A ~ a • fl\], as well as three 
states of the form qA~r and q~H" 
The higher we choose d, the more precise the approximation is, since the histories 
allow the automaton to simulate part of the mechanism of recursion from the original 
grammar, and the maximum length of the histories corresponds to the number of 
levels of recursion that can be simulated accurately. 
4.2 Refinement of RTN Superset Approximation 
We rephrase the method of Grimley-Evans (1997) as follows: First, we construct the 
approximating finite automaton according to the unparameterized RTN method above. 
Then an additional mechanism is introduced that ensures for each rule A --~ X1 • .. Xm 
separately that the list of visits to the states qo,.. • • qm satisfies some reasonable criteria: 
a visit to qi, with 0 < i < m, should be followed by one to qi+l or q0. The latter option 
amounts to a nested incarnation of the rule. There is a complementary condition for 
what should precede a visit to qi, with 0 < i < m. 
Since only pairs of consecutive visits to states from the set {q0 ..... qm} are consid- 
ered, finite-state techniques suffice to implement such conditions. This can be realized 
by attaching histories to the states as in the case of the parameterized RTN method 
above, but now each history is a set rather than a list, and can contain at most one 
item \[A --* a • fl\] for each rule A ---* aft. As reported by Grimley-Evans (1997) and con- 
25 
Computational Linguistics Volume 26, Number 1 
A~a B b 
A~c A 
B---,'d A e 
B--->f 
Figure 6 
a 
c 
/ 
a 
H = \[A----> c.Al qA~_ ..... g ~', 
x II , a ,, 
H= \[B-->d.A el qA e',, 
b 
, E 
b 
i , 
, ,, .... 
b " 
,,'" \-qA. 
d " ,,, , Z e _ 
qB Q~___ H = \[A --~ a.B b\] -- 5.. "'L qBH 
Application of the parameterized RTN method with d = 2. We again assume A is the start 
symbol. States qm have not been labeled in order to avoid cluttering the picture. 
firmed by our own experiments, the nondeterministic finite automata resulting from 
this method may be quite large, even for small grammars. The explanation is that the 
number of such histories is exponential in the number of rules. 
We have refined the method with respect to the original publication by applying 
the construction separately for each nonterminal in a set Ni such that recursive(Ni) = 
self. 
4.3 Subset Approximation by Transforming the Grammar 
Putting restrictions on spines is another way to obtain a regular . Several 
methods can be defined. The first method we present investigates spines in a very 
detailed way. It eliminates from the  only those sentences for which a sub- 
derivation is required of the form B --~* aBfl, for some a ~ ¢ and fl ~ e. The motivation 
is that such sentences do not occur frequently in practice, since these subderivations 
make them difficult for people to comprehend (Resnik 1992). Their exclusion will 
therefore not lead to much loss of coverage of typical sentences, especially for simple 
application domains. 
We express the method in terms of a grammar transformation in Figure 7. The 
effect of this transformation is that a nonterminal A is tagged with a set of pairs 
(B, Q), where B is a nonterminal occurring higher in the spine; for any given B, at 
most one such pair (B, Q) can be contained in the set. The set Q may contain the 
element l to indicate that something to the left of the part of the spine from B to A 
26 
Nederhof Experiments with Regular Approximation 
We are given a grammar G = (E,N, P, S). The following is to be performed for each 
set Ni EAf such that recursive(Ni) = self. 
. For each A E Ni and each F E 2 (Nix2~l''}), add the following nonterminal 
to N. 
• A F . 
2. For each A E Ni, add the following rule to P. 
• A---~A 0. 
. For each (A --* o~0A1o~1A2... C~m-lAmCrm) E P such that A, A1 .... ,Am E Ni 
and no symbols from c~0 .... , am are members of Ni, and each F such that 
(A, (l, r}) ~ F, add the following rule to P. 
a F F1 Fm o~0A 1 oq... A m O~m, where, for 1 G j _< m, 
-- Fj= {(B, QUC~U~F) I (B,Q) E F'}; 
F' = FU {(A, 0)} if -~3Q\[(A,Q) E F\], and F' = F 
otherwise; 
--   = 0 if c~0AlC~l...Aj-I~j-1 = c, and ~ = {l} otherwise; 
-- QJr = 0 if o/.jaj+lOLj+l...AmOL m = £, and QJr = {r} 
otherwise. 
4. Remove from P the old rules of the form A --* c~, where A E Ni. 
5. Reduce the grammar. 
Figure 7 
Subset approximation by transforming the grammar. 
was generated. Similarly, r E Q indicates that something to the right was generated. If 
Q = {l, r}, then we have obtained a derivation B --** c~Afl, for some c~ ~ c and fl ~ ~, 
and further occurrences of B below A should be blocked in order to avoid a derivation 
with self-embedding. 
An example is given in Figure 8. The original grammar is implicit in the depicted 
parse tree on the left, and contains at least the rules S --+ A a, A --, b B, B -* C, and 
C --* S. This grammar is self-embedding, since we have a subderivation S --~* bSa. 
We explain how FB is obtained from FA in the rule A ~ --* b B r'. We first construct 
F' = {(S, {r}), (A, 0)} from FA = {(S, (r})} by adding (A, 0), since no other pair of the 
form (A, Q) was already present. To the left of the occurrence of B in the original rule 
A --* b B we find a nonempty string b. This means that we have to add l to all second 
components of pairs in F', which gives us FB = {(S, (l, r}), (A, {l})}. 
In the transformed grammar, the lower occurrence of S in the tree is tagged with 
the set {(S, {I, r}), (A, {l}), (B, 0), (C, 0)}. The meaning is that higher up in the spine, we 
will find the nonterminals S, A, B, and C. The pair (A, (1}) indicates that since we saw 
A on the spine, something to the left has been generated, namely, b. The pair (B, 0) 
indicates that nothing either to the left or to the right has been generated since we 
saw B. The pair (S, {1, r}) indicates that both to the left and to the right something has 
been generated (namely, b on the left and a on the right). Since this indicates that an 
27 
Computational Linguistics Volume 26, Number 1 
s 
(a) s (b) s Fs Fs = 
A a a /\ /\ 
FB b B b B 'B 
I 
C ~F c Fc = 
5' ' -- s 
X 
0 
{(S, {l, r}), (A, {/})} 
{(S, {l, r}), (A, {/}), (B, 0)} 
{(S, {l, r}), (A, {/}), (B, 0), (C, 0)} 
Figure 8 
A parse tree m a self-embedding grammar (a), and the corresponding parse tree in the 
transformed grammar (b), for the transformation from Figure 7. For the moment we ignore 
step 5 of Figure 7, i.e., reduction of the transformed grammar. 
offending subderivation S --** c~Sfl has been found, further completion of the parse 
tree is blocked: the transformed grammar will not have any rules with left-hand side 
S {(S'{I'r})'(A'{I})'(B'O)'(C'O)}. In fact, after the grammar is reduced, any parse tree that is 
constructed can no longer even contain a node labeled by S {(s'U'r})'(a'{O)'(B'°)'(c'°)}, or 
any nodes with labels of the form A r such that (A, {l,r}) c F. 
One could generalize this approximation in such a way that not all self-embedding 
is blocked, but only self-embedding occurring, say, twice in a row, in the sense of a 
subderivation of the form A --** alAfll --+* oqol2Afl2fll. We will not do so here, because 
already for the basic case above, the transformed grammar can be huge due to the 
high number of nonterminals of the form A F that may result; the number of such 
nonterminals is exponential in the size of Ni. 
We therefore present, in Figure 9, an alternative approximation that has a lower 
complexity. By parameter d, it restricts the number of rules along a spine that may 
generate something to the left and to the right. We do not, however, restrict pure left 
recursion and pure right recursion. Between two occurrences of an arbitrary rule, we 
allow left recursion followed by right recursion (which leads to tag r followed by tag 
rl), or right recursion followed by left recursion (which leads to tag l followed by 
tag lr). 
An example is given in Figure 10. As before, the rules of the grammar are implicit 
in the depicted parse tree. At the top of the derivation we find S. In the transformed 
grammar, we first have to apply S --* S -r'°. The derivation starts with a rule S --* A a, 
which generates a string (a) to the right of a nonterminal (A). Before we can apply zero 
or more of such rules, we first have to apply a unit rule S T,° --* S r,° in the transformed 
grammar. For zero or more rules that subsequently generate something on the left, 
such as A ~ b B, we have to obtain a superscript containing rl, and in the example 
this is done by applying A r,° ~ A rl,°. Now we are finished with pure left recursion and 
pure right recursion, and apply B rl,O ---+ B ±,0. This allows us to apply one unconstrained 
rule, which appears in the transformed grammar as B ±,° ---* c S T'I d. 
28 
Nederhof Experiments with Regular Approximation 
We are given a grammar G = (G, N, P, S). The following is to be performed for each 
set Ni C .IV" such that recursive(Ni) = self. The value d stands for the maximum number 
of unconstrained rules along a spine, possibly alternated with a series of left-recursive 
rules followed by a series of right-recursive rules, or vice versa. 
1. For each A c Ni, each Q E { T, l, r, It, rl, 3_ }, and each f such that 
0 < f < d, add the following nonterminals to N. 
• AQ,f. 
2. For each A E Ni, add the following rule to P. 
• A ---+ A T'0. 
3. For each A E Ni and f such that 0 G f G d, add the following rules to P. 
• AT,f ___+ Al,f. 
• ATd: __+ Ar,f. 
• Aid ---+ Alr,f. 
• Ar,f ---, A~l,/. 
• Atr,f __+ A±,d. 
• Arl,f ___+ A±,f. 
4. For each (A -+ Ba) ~ P such that A, B c Ni and no symbols from ~ are 
members of Ni, eachf such that 0 <f G d, and each Q E {r, lr}, add the 
following rule to P. 
• AQd ~ BQ/a. 
5. For each (A --+ c~B) E P such that A, B E Ni and no symbols from c~ are 
members of Ni, eachf such that 0 Gf < d, and each Q c {l, rl}, add the 
following rule to P. 
• Aqd ~ c~BQ,f. 
6. For each (A -~ o~0AloqA2... O~m-lAmC~m) C P such that A, A1 ..... Am E Ni 
and no symbols from s0 ..... C~m are members of Ni, and each f such that 
0 < f G d, add the following rule to P, provided m = 0 vf < d. 
• A±/ c~0Alq-d+lc~l AT,f+1 ---4 . . .~l m ' OLm . 
7. Remove from P the old rules of the form A ~ c~, where A E Ni. 
8. Reduce the grammar. 
Figure 9 
A simpler subset approximation by transforming the grammar. 
Now the counter f has been increased from 0 at the start of the subderivation to 
1 at the end. Depending on the value d that we choose, we cannot build derivations 
by repeating subderivation S --+* b c S d a an unlimited number of times: at some 
point the counter will exceed d. If we choose d = 0, then already the derivation at 
29 
Computational Linguistics Volume 26, Number 1 
S 
S !T,O 
(a) /~ (b)!r,O So\ 
A a ' a 
rl, O 
b B b B rl'O :o 
t t 
t t 
t t 
t t Figure 10 
A parse tree in a self-embedding grammar (a), and the corresponding parse tree in the 
transformed grammar (b), for the simple subset approximation from Figure 9. 
Figure 10 (b) is no longer possible, since no nonterminal in the transformed grammar 
would contain 1 in its superscript. 
Because of the demonstrated increase of the counter f, this transformation is guar- 
anteed to remove self-embedding from the grammar. However, it is not as selective as 
the transformation we saw before, in the sense that it may also block subderivations 
that are not of the form A --** ~Afl. Consider for example the subderivation from 
Figure 10, but replacing the lower occurrence of S by any other nonterminal C that is 
mutually recursive with S, A, and B. Such a subderivation S ---** b c C d a would also 
be blocked by choosing d = 0. In general, increasing d allows more of such derivations 
that are not of the form A ~" o~Afl but also allows more derivations that are of that 
form. 
The reason for considering this transformation rather than any other that elim- 
inates self-embedding is purely pragmatic: of the many variants we have tried that 
yield nontrivial subset approximations, this transformation has the lowest complex- 
ity in terms of the sizes of intermediate structures and of the resulting finite au- 
tomata. 
In the actual implementation, we have integrated the grammar transformation and 
the construction of the finite automaton, which avoids reanalysis of the grammar to 
determine the partition of mutually recursive nonterminals after transformation. This 
integration makes use, for example, of the fact that for fixed Ni and fixed f, the set of 
nonterminals of the form A,f, with A c Ni, is (potentially) mutually right-recursive. 
A set of such nonterminals can therefore be treated as the corresponding case from 
Figure 2, assuming the value right. 
The full formulation of the integrated grammar transformation and construction 
of the finite automaton is rather long and is therefore not given here. A very similar 
formulation, for another grammar transformation, is given in Nederhof (1998). 
30 
Nederhof Experiments with Regular Approximation 
4.4 Superset Approximation through Pushdown Automata 
The distinction between context-free s and regular s can be seen in 
terms of the distinction between pushdown automata and finite automata. Pushdown 
automata maintain a stack that is potentially unbounded in height, which allows more 
complex s to be recognized than in the case of finite automata. Regular ap- 
proximation can be achieved by restricting the height of the stack, as we will see in 
Section 4.5, or by ignoring the distinction between several stacks when they become 
too high. 
More specifically, the method proposed by Pereira and Wright (1997) first con- 
structs an LR automaton, which is a special case of a pushdown automaton. Then, 
stacks that may be constructed in the course of recognition of a string are computed 
one by one. However, stacks that contain two occurrences of a stack symbol are iden- 
tified with the shorter stack that results by removing the part of the stack between the 
two occurrences, including one of the two occurrences. This process defines a congru- 
ence relation on stacks, with a finite number of congruence classes. This congruence 
relation directly defines a finite automaton: each class is translated to a unique state of 
the nondeterministic finite automaton, shift actions are translated to transitions labeled 
with terminals, and reduce actions are translated to epsilon transitions. 
The method has a high complexity. First, construction of an LR automaton, of 
which the size is exponential in the size of the grammar, may be a prohibitively ex- 
pensive task (Nederhof and Satta 1996). This is, however, only a fraction of the effort 
needed to compute the congruence classes, of which the number is in turn exponen- 
tial in the size of the LR automaton. If the resulting nondeterministic automaton is 
determinized, we obtain a third source of exponential behavior. The time and space 
complexity of the method are thereby bounded by a triple exponential function in the 
size of the grammar. This theoretical analysis seems to be in keeping with the high 
costs of applying this method in practice, as will be shown later in this article. 
As proposed by Pereira and Wright (1997), our implementation applies the ap- 
proximation separately for each nonterminal occurring in a set Ni that reveals self- 
embedding. 
A different superset approximation based on LR automata was proposed by Baker 
(1981) and rediscovered by Heckert (1994). Each individual stack symbol is now trans- 
lated to one state of the nondeterministic finite automaton. It can be argued theoret- 
ically that this approximation differs from the unparameterized RTN approximation 
from Section 4.1 only under certain conditions that are not likely to occur very often 
in practice. This consideration is confirmed by our experiments to be discussed later. 
Our implementation differs from the original algorithm in that the approximation is 
applied separately for each nonterminal in a set Ni that reveals self-embedding. 
A generalization of this method was suggested by Bermudez and Schimpf (1990). 
For a fixed number d > 0 we investigate sequences of d top-most elements of stacks 
that may arise in the LR automaton, and we translate these to states of the finite 
automaton. More precisely, we define another congruence relation on stacks, such that 
we have one congruence class for each sequence of d stack symbols and this class 
contains all stacks that have that sequence as d top-most elements; we have a separate 
class for each stack that contains fewer than d elements. As before, each congruence 
class is translated to one state of the nondeterministic finite automaton. Note that the 
case d = 1 is equivalent to the approximation in Baker (1981). 
If we replace the LR automaton by a certain type of automaton that performs top- 
down recognition, then the method in Bermudez and Schimpf (1990) amounts to the 
parameterized RTN method from Section 4.1; note that the histories from Section 4.1 
in fact function as stacks, the items being the stack symbols. 
31 
Computational Linguistics Volume 26, Number 1 
4.5 Subset Approximation through Pushdown Automata 
By restricting the height of the stack of a pushdown automaton, one obstructs recogni- 
tion of a set of strings in the context-free , and therefore a subset approxima- 
tion results. This idea was proposed by Krauwer and des Tombe (1981), Langendoen 
and Langsam (1987), and Pulman (1986), and was rediscovered by Black (1989) and 
recently by Johnson (1998). Since the latest publication in this area is more explicit in 
its presentation, we will base our treatment on this, instead of going to the historical 
roots of the method. 
One first constructs a modified left-corner recognizer from the grammar, in the 
form of a pushdown automaton. The stack height is bounded by a low number; 
Johnson (1998) claims a suitable number would be 5. The motivation for using the 
left-corner strategy is that the height of the stack maintained by a left-corner parser 
is already bounded by a constant in the absence of self-embedding. If the artificial 
bound imposed by the approximation method is chosen to be larger than or equal to 
this natural bound, then the approximation may be exact. 
Our own implementation is more refined than the published algorithms mentioned 
above, in that it defines a separate left-corner recognizer for each nonterminal A such 
that A E Ni and recursive(Ni) = self, some i. In the construction of one such recognizer, 
nonterminals that do not belong to Ni are treated as terminals, as in all other methods 
discussed here. 
4.6 Superset Approximation by N-grams 
An approximation from Seyfarth and Bermudez (1995) can be explained as follows. 
Define the set of all terminals reachable from nonterminal A to be ~A = {a I 3c~, iliA --** 
o~afl\]}. We now approximate the set of strings derivable from A by G~, which is the 
set of strings consisting of terminals from GA. Our implementation is made slightly 
more sophisticated by taking ~A to be {X \] 3B, c~,fl\[B E Ni A B ~ oLXfl A X ~ Ni\]}, for 
each A such that A E Ni and recursive(Ni) = self, for some i. That is, each X E ~A is 
a terminal, or a nonterminal not in the same set Ni as A, but immediately reachable 
from set Ni, through B E Ni. 
This method can be generalized, inspired by Stolcke and Segal (1994), who derive 
N-gram probabilities from stochastic context-free grammars. By ignoring the probabil- 
ities, each N = 1, 2, 3 .... gives rise to a superset approximation that can be described 
as follows: The set of strings derivable from a nonterminal A is approximated by the 
set of strings al ... an such that 
• for each substring v = ai+l ... ai+N (0 < i < n -- N) we have A --+* wvy, for 
some w and y, 
• for each prefix v = al ... ai (0 < i < n) such that i < N we have A -** vy, 
for some y, and 
• for each suffix v = ai+l ... an (0 < i < n) such that n - i < N we have 
a ---~* wv, for some w. 
(Again, the algorithms that we actually implemented are more refined and take into 
account the sets Ni.) 
The approximation from Seyfarth and Bermudez (1995) can be seen as the case N = 
1, which will henceforth be called the unigram method. We have also experimented 
with the cases N = 2 and N = 3, which will be called the bigram and trigram methods. 
32 
Nederhof Experiments with Regular Approximation 
5. Increasing the Precision 
The methods of approximation described above take as input the parts of the grammar 
that pertain to self-embedding. It is only for those parts that the  is affected. 
This leads us to a way to increase the precision: before applying any of the above 
methods of regular approximation, we first transform the grammar. 
This grammar transformation copies grammar rules containing recursive nonter- 
minals and, in the copies, it replaces these nonterminals by new nonrecursive nonter- 
minals. The new rules take over part of the roles of the old rules, but since the new 
rules do not contain recursion and therefore do not pertain to self-embedding, they 
remain unaffected by the approximation process. 
Consider for example the palindrome grammar from Figure 1. The RTN method 
will yield a rather crude approximation, namely, the  {a, b}*. We transform 
this grammar in order to keep the approximation process away from the first three 
levels of recursion. We achieve this by introducing three new nonterminals S\[1\], S\[2\] 
and S\[3\], and by adding modified copies of the original grammar rules, so that we 
obtain: 
S\[1\] 
S\[2\] 
S\[3\] 
S 
The new start symbol is S\[1\]. 
aS\[2\]a \] bS\[2\] b I ¢ 
aS\[3\]a \] bS\[3\] b I c 
aSa l bSb i c 
aSa i bSb i e 
The new grammar generates the same  as before, but the approximation 
process leaves unaffected the nonterminals S\[1\], S\[2\], and S\[3\] and the rules defining 
them, since these nonterminals are not recursive. These nonterminals amount to the 
upper three levels of the parse trees, and therefore the effect of the approximation 
on the  is limited to lower levels. If we apply the RTN method then we 
obtain the  that consists of (grammatical) palindromes of the form ww R, where 
w E {¢, a, b} U {a, b} 2 U {a, b} 3, plus (possibly ungrammatical) strings of the form wvw R, 
where w E {a, b} 3 and v E {a, b}*. (w R indicates the mirror image of w.) 
The grammar transformation in its full generality is given by the following, which 
is to be applied for fixed integer j > 0, which is a parameter of the transformation, 
and for each Ni such that recursive(Ni) = self. 
For each nonterminal A E Ni we introduce j new nonterminals All\] ..... A~\]. For 
each A --, X1...Xm in P such that A E Ni, and h such that 1 ~ h < j, we add 
A\[h\] --* X'I... X" to P, where for 1 < k < m: 
X~k = Xk\[h + 1\] if X k E Ni /X h < j 
= Xk otherwise 
Further, we replace all rules A --* X1 ... Xm such that A ~ Ni by A --* X~ ... X~m, where 
for 1 < k < m: 
X~ -- Xk\[1\] ifXkENi 
= Xk otherwise 
If the start symbol S was in Ni, we let S\[1\] be the new start symbol. 
A second transformation, which shares some characteristics with the one above, 
was presented in Nederhof (1997). One of the earliest papers suggesting such transfor- 
mations as a way to increase the precision of approximation is due to ~ulik and Cohen 
(1973), who only discuss examples, however; no general algorithms were defined. 
33 
Computational Linguistics Volume 26, Number 1 
550 
500 
450 
• 400 -5 
350 
._N_ 300 
250 
E 200 E 
150 
100 
50 
0 
0 
I I I I I I 
50 100 150 200 250 300 350 
corpus size (# sentences) 
E 
180 
160 
140 
120 
100 
80 
60 
40 
20 
0 
5 10 15 20 25 30 
length (# words) 
Figure 11 
The test material. The left-hand curve refers to the construction of the grammar from 332 
sentences, the right-hand curve refers to the corpus of 1,000 sentences used as input to the 
finite automata. 
6. Empirical Results 
In this section we investigate empirically how the respective approximation methods 
behave on grammars of different sizes and how much the approximated s 
differ from the original context-free s. This last question is difficult to answer 
precisely. Both an original context-free  and an approximating regular lan- 
guage generally consist of an infinite number of strings, and the number of strings 
that are introduced in a superset approximation or that are excluded in a subset ap- 
proximation may also be infinite. This makes it difficult to attach numbers to the 
"quality" of approximations. 
We have opted for a pragmatic approach, which does not require investigation of 
the entire infinite s of the grammar and the finite automata, but looks at a 
certain finite set of strings taken from a corpus, as discussed below. For this finite set 
of strings, we measure the percentage that overlaps with the investigated s. 
For the experiments, we took context-free grammars for German, generated auto- 
matically from an HPSG and a spoken- corpus of 332 sentences. This corpus 
consists of sentences possessing grammatical phenomena of interest, manually selected 
from a larger corpus of actual dialogues. An HPSG parser was applied on these sen- 
tences, and a form of context-free backbone was selected from the first derivation that 
was found. (To take the first derivation is as good as any other strategy, given that we 
have at present no mechanisms for relative ranking of derivations.) The label occur- 
ring at a node together with the sequence of labels at the daughter nodes was then 
taken to be a context-free rule. The collection of such rules for the complete corpus 
forms a context-free grammar. Due to the incremental nature of this construction of 
the grammar, we can consider the subgrammars obtained after processing the first p 
sentences, where p = 1, 2, 3 ..... 332. See Figure 11 (left) for the relation between p and 
the number of rules of the grammar. The construction is such that rules have at most 
two members in the right-hand side. 
As input, we considered a set of 1,000 sentences, obtained independently from the 
332 sentences mentioned above. These 1,000 sentences were found by having a speech 
recognizer provide a single hypothesis for each utterance, where utterances come from 
actual dialogues. Figure 11 (right) shows how many sentences of different lengths the 
corpus contains, up to length 30. Above length 25, this number quickly declines, but 
still a fair quantity of longer strings can be found, e.g., 11 strings of a length between 
34 
Nederhof Experiments with Regular Approximation 
51 and 60 words. In most cases however such long strings are in fact composed of a 
number of shorter sentences. 
Each of the 1,000 sentences were input in their entirety to the automata, although 
in practical spoken- systems, often one is not interested in the grammaticality 
of complete utterances, but tries to find substrings that form certain phrases bearing 
information relevant to the understanding of the utterance. We will not be concerned 
here with the exact way such recognition of substrings could be realized by means of 
finite automata, since this is outside the scope of this paper. 
For the respective methods of approximation, we measured the size of the com- 
pact representation of the nondeterministic automaton, the number of states and the 
number of transitions of the minimal deterministic automaton, and the percentage 
of sentences that were recognized, in comparison to the percentage of grammatical 
sentences. For the compact representation, we counted the number of lines, which is 
roughly the sum of the numbers of transitions from all subautomata, not considering 
about three additional lines per subautomaton for overhead. 
We investigated the size of the compact representation because it is reasonably 
implementation independent, barring optimizations of the approximation algorithms 
themselves that affect the sizes of the subautomata. For some methods, we show that 
there is a sharp increase in the size of the compact representation for a small increase 
in the size of the grammar, which gives us a strong indication of how difficult it 
would be to apply the method to much larger grammars. Note that the size of the 
compact representation is a (very) rough indication of how much effort is involved in 
determinization, minimization, and substitution of the subautomata into each other. 
For determinization and minimization of automata, we have applied programs from 
the FSM library described in Mohri, Pereira, and Riley (1998). This library is considered 
to be competitive with respect to other tools for processing of finite-state machines. 
When these programs cannot determinize or minimize in reasonable time and space 
some subautomata constructed by a particular method of approximation, then this can 
be regarded as an indication of the impracticality of the method. 
We were not able to compute the compact representation for all the methods 
and all the grammars. The refined RTN approximation from Section 4.2 proved to be 
quite problematic. We were not able to compute the compact representation for any 
of the automatically obtained grammars in our collection that were self-embedding. 
We therefore eliminated individual rules by hand, starting from the smallest self- 
embedding grammar in our collection, eventually finding grammars small enough to 
be handled by this method. The results are given in Table 1. Note that the size of the 
compact representation increases significantly for each additional grammar rule. The 
sizes of the finite automata, after determinization and minimization, remain relatively 
small. 
Also problematic was the first approximation from Section 4.4, which was based 
on LR parsing following Pereira and Wright (1997). Even for the grammar of 50 rules, 
we were not able to determinize and minimize one of the subautomata according 
to step 1 of Section 3: we stopped the process after it had reached a size of over 600 
megabytes. Results, as far as we could obtain them, are given in Table 2. Note the sharp 
increases in the size of the compact representation, resulting from small increases, from 
44 to 47 and from 47 to 50, in the number of rules, and note an accompanying sharp 
increase in the size of the finite automaton. For this method, we see no possibility 
of accomplishing the complete approximation process, including determinization and 
minimization, for grammars in our collection that are substantially larger than 50 rules. 
Since no grammars of interest could be handled by them, the above two methods 
will be left out of further consideration. 
35 
Computational Linguistics Volume 26, Number 1 
Table 1 
Size of the compact representation and number of states and transitions, 
for the refined RTN approximation (Grimley-Evans 1997). 
Grammar Size Compact Representation # of States # of Transitions 
10 133 11 14 
12 427 17 26 
13 1,139 17 34 
14 4,895 17 36 
15 16,297 17 40 
16 51,493 19 52 
17 208,350 19 52 
18 409,348 21 59 
19 1,326,256 21 61 
Table 2 
Size of the compact representation and number of states and transitions, 
for the superset approximation based on LR automata following Pereira 
and Wright (1997). 
Grammar Size Compact Representation # of States # of Transitions 
35 15,921 350 2,125 
44 24,651 499 4,352 
47 151,226 5,112 35,754 
50 646,419 ? ? 
Below, we refer to the unparameterized and parameterized approximations based 
on RTNs (Section 4.1) as RTN and RTNd, respectively, for d = 2,3; to the subset 
approximation from Figure 9 as Subd, for d = 1, 2, 3; and to the second and third 
methods from Section 4.4, which were based on LR parsing following Baker (1981) 
and Bermudez and Schimpf (1990), as LR and LRd, respectively, for d = 2, 3. We refer 
to the subset approximation based on left-corner parsing from Section 4.5 as LCd, for 
the maximal stack height of d = 2, 3, 4; and to the methods discussed in Section 4.6 as 
Unigram, Bigram, and Trigram. 
We first discuss the compact representation of the nondeterministic automata. In 
Figure 12 we use two different scales to be able to represent the large variety of values. 
For the method Subd, the compact representation is of purely theoretical interest for 
grammars larger than 156 rules in the case of Sub1, for those larger than 62 rules 
in the case of Sub2, and for those larger than 35 rules in the case of Sub3, since 
the minimal deterministic automata could thereafter no longer be computed with a 
reasonable bound on resources; we stopped the processes after they had consumed 
over 400 megabytes. For LC3, LC4, RTN3, LR2, and LR3, this was also the case for 
grammars larger than 139, 62, 156, 217, and 156 rules, respectively. The sizes of the 
compact representation seem to grow moderately for LR and Bigram, in the upper 
panel, yet the sizes are much larger than those for RTN and Unigram, which are 
indicated in the lower panel. 
The numbers of states for the respective methods are given in Figure 13, again 
using two very different scales. As in the case of the grammars, the terminals of our 
finite automata are parts of speech rather than words. This means that in general there 
will be nondeterminism during application of an automaton on an input sentence due 
to lexical ambiguity. This nondeterminism can be handled efficiently using tabular 
36 
Nederhof Experiments with Regular Approximation 
r'~ 
E O 
o 
700000 
600000 
500000 
400000 
300000 
200000 
100000 
0 
0 
i i \] ; ; / i i 
, / ! /' LC4 
! / / ," LR3--x---. 
i \[ / ., RTN3 -~ 
i I i ," LC3 
i \[ i " LR2 ..... 
! /j Trigram -~'--- 
, / ; LC2 -e--- 
i / i RTN2 
i / / LR -+ -- 
:t \] / Bigram-E3--- 
~: ~ ;' 
~ /' 
! /" // 4- 
/ ," ,4- ...... + .... " /' /'~ 4-""" 
,' / .-" _..+ ........ 4- ..... 
' ~-~ "" - ---E3- ............ E} ....... E}----n .... --'- 
- - ......... 1 ........ I I I I I I 
50 100 150 200 250 300 350 400 450 500 550 
grammar size 
20000 
15000 
"3 
& 
10000 
o 
5000 
IJ :/ 
/" 
/ /" 
/ 
/.' ,.' 
,; ,.: ," 
i ... ,' 
,,..,' 
...." ,,,' 
/ ,,: ,,, ..,.. 
.: .... 
z~ ,'" . .. "~ 
/ ... ,' .- . ......' 
.....- ,,• y~..- 
"'""'' ,,,~ 
::'~ ..... 
0 50 100 150 200 
Figure 12 
Size of the compact representation. 
RTN2 .-a-- 
LC2 -e-- 
LR -+ --' 
Bigram -~--- 
Sub3 -x .... 
Sub2 "a ..... 
Sub1 -~ ..... 
RTN 
Unigram -~,--- 
.... 
.,0. 
I I I I I I 
250 300 350 400 450 500 550 
grammar size 
techniques, provided the number of states is not too high. This consideration favors 
methods that produce low numbers of states, such as Trigram, LR, RTN, Bigram, and 
Unigram. 
37 
Computational Linguistics Volume 26, Number 1 
10000 
9000 
8000 
7000 
6000 
5000 
4000 
3000 
2000 
1000 
0 
Sub2 ..A ..... 
LC4 -8-- 
Sub1 --~ ..... 
LC3 -x-- 
RTN3 
LC2 -e-- 
f 
/ j 
/ 
\[ / 
/ 
~~~i 
100 
°~-'~-~ ~ 
0 200 300 400 500 600 
grammar size 
100 , , , , , 
LC3 
RTN3 
LR3 --x- -- 
LC2 -e-- 
x RTN2 
80 ,,,,,," TrigramLR2 -.~--~---- 
LR -+ -- 
RTN -B~ 
Bigram -a-- 
~--~- Unigram -~--- 
60 .~¢" 
/ 
40 
: ~~ d~3._-.D .D ---\[\]'---  
......... ~E\]" * ...... \[\] ........... - 
20 " ~" --e-- "<> 
..e- ......... e ..... -e ....... ~ ...... 
0 i i i i i 
0 100 200 300 400 500 600 
grammar size 
Figure 13 
Number of states of the deterrninized and minimized automata. 
Note that the numbers of states for LR and RTN differ very little. In fact, for 
some of the smallest and for some of the largest grammars in our collection, the 
resulting automata were identical. Note, however, that the intermediate results for LR 
38 
Nederhof Experiments with Regular Approximation 
(Figure 12) are much larger. It should therefore be concluded that the "sophistication" 
of LR parsing is here merely an avoidable source of inefficiency. 
The numbers of transitions for the respective methods are given in Figure 14. 
Again, note the different scales used in the two panels. The numbers of transitions 
roughly correspond to the storage requirements for the automata. It can be seen that, 
again, Trigram, LR, RTN, Bigram, and Unigram perform well. 
The precision of the respective approximations is measured in terms of the per- 
centage of sentences in the corpus that are recognized by the automata, in comparison 
to the percentage of sentences that are generated by the grammar, as presented by Fig- 
ure 15. The lower panel represents an enlargement of a section from the upper panel. 
Methods that could only be applied for the smaller grammars are only presented in 
the lower panel; LC4 and Sub2 have been omitted entirely. 
The curve labeled G represents the percentage of sentences generated by the gram- 
mar. Note that since all approximation methods compute either supersets or subsets, a 
particular automaton cannot both recognize some ungrammatical sentences and reject 
some grammatical sentences. 
Unigram and Bigram recognize very high percentages of ungrammatical sentences. 
Much better results were obtained for RTN. The curve for LR would not be distin- 
guishable from that for RTN in the figure, and is therefore omitted. (For only two of 
the investigated grammars was there any difference, the largest difference occurring 
for grammar size 217, where 34.1 versus 34.5 percent of sentences were recognized 
in the cases of LR and RTN, respectively.) Trigram remains very close to RTN (and 
LR); for some grammars a lower percentage is recognized, for others a higher per- 
centage is recognized. LR2 seems to improve slightly over RTN and Trigram, but data 
is available only for small grammars, due to the difficulty of applying the method to 
larger grammars. A more substantial improvement is found for RTN2. Even smaller 
percentages are recognized by LR3 and RTN3, but again, data is available only for 
small grammars. 
The subset approximations LC3 and Sub1 remain very close to G, but here again 
only data for small grammars is available, since these two methods could not be 
applied on larger grammars. Although application of LC2 on larger grammars required 
relatively few resources, the approximation is very crude: only a small percentage of 
the grammatical sentences are recognized. 
We also performed experiments with the grammar transformation from Section 5, 
in combination with the RTN method. We found that for increasing j, the interme- 
diate automata soon became too large to be determinized and minimized, with a 
bound on the memory consumption of 400 megabytes. The sizes of the automata that 
we were able to compute are given in Figure 16. RTN+j, for j = 1, 2, 3,4, 5, repre- 
sents the (unparameterized) RTN method in combination with the grammar transfor- 
mation with parameter j. This is not to be confused with the parameterized RTNd 
method. 
Figure 17 indicates the number of sentences in the corpus that are recognized by 
an automaton divided by the number of sentences in the corpus that are generated 
by the grammar. For comparison, the figure also includes curves for RTNd, where 
d = 2, 3 (cf. Figure 15). We see that j = 1, 2 has little effect. For j = 3,4, 5, however, 
the approximating  becomes substantially smaller than that in the case of 
RTN, but at the expense of large automata. In particular, if we compare the sizes of 
the automata for RTN+j in Figure 16 with those for RTNd in Figures 13 and 14, then 
Figure 17 suggests the large sizes of the automata for RTN+j are not compensated 
adequately by a reduction of the percentage of sentences that are recognized. RTNd 
seems therefore preferable to RTN+j. 
39 
Computational Linguistics Volume 26, Number 1 
90000 
80000 
70000 
60000 
50000 o 
40000 
30000 
20000 
100OO 
0 
0 
z~ 
100 200 300 400 500 
grammar size 
i 
Sub2 --~ ...... 
LC4 
Sub1 --~ ..... 
LC3 
LC2 
RTN2 
I 
600 
5000 
4000 - 
0000 / /! 
' 
2000 - / 
/- 
. //'/' /';'" .13---_ D - 
/ ";Y" _ ...... ~- ..... 43"- 1000 I I .:..i~ ..- .... (3 .... 
.D -''0 
LC3 -x--- 
RTN3 
LR3 --x- -- 
LC2 -e~ 
RTN2 
Trigram -~--- 
LR2 -~,- -- 
LR -÷ -- 
RTN 
Bigram -B-- 
Unigram -~- -- 
0 100 200 300 400 500 
grammar size 
Figure 14 
Number of transitions of the determinized and minimized automata. 
600 
7. Conclusions 
If we apply the finite automata with the intention of filtering out incorrect sentences, 
for example from the output from a speech recognizer, then it is allowed that a 
40 
Nederhof Experiments with Regular Approximation 
100 
80 
-o 6O 
& oo 
40 
0 
0 
20 
Unigram -4- -- 
Bigram -\[\]-- 
,e---~ Trigram -x--.- 
,,' RTN(LR) -~-- 
e- ...... ¢, -- --e- -- -~- ..... '~ RTN2G -~---~a-- 
..' LC2 -e-- 
•13-*-£3 
,.O ........ ~" ~3 ..... (3" 
"'" 13 ...... El----\[\]'-'- 
\[\] ....... Ey • 
/ , - _. lq___ + 
/' ,' 4..- I 
6 ,~' 0 0 ._.---e--~O 0 
• C\] // 
i" I I 
100 200 300 400 500 600 
grammar size 
5' 
,\[:: 
4 
"E 
0 3 
i ,'i \[\] 
/ ,- 
• j:/ 
,D / / i- , 
// . .- 
/' 
0 i I I I I I I 
40 60 80 100 120 140 160 
grammar size 
Figure 15 
Percentage of sentences that are recognized. 
/ i 
/ Bigram -D-- 
RTN(LR) 
Trigram -x-.- 
LR2 -,~- -- 
RTN2 -~-- 
LR3 --x- -- 
RTN3 -+--- 
G --t--. 
LC3 -x-- 
Sub1 --~ ..... 
LC2 -e-- 
certain percentage of ungrammatical input is recognized. Recognizing ungrammat- 
ical input merely makes filtering less effective; it does not affect the functionality 
of the system as a whole, provided we assume that the grammar specifies exactly 
the set of sentences that can be successfully handled by a subsequent phase of pro- 
41 
Computational Linguistics Volume 26, Number 1 
10000 
8000 
6000 
4000 
2000 
:!! 
i i i i 
RTN+5 -~-- 
RTN+4 -0-- 
RTN+3 -+-- 
RTN+2 -o-- 
RTN+I ..... 
a %0 
, / ~,, ,, 
;/ .4- -" 
50 100 150 200 250 300 350 400 450 500 550 
grammar size 
250000 
200000 
o 150000 
100000 
50000 
iii,ii 
:i 
a ii j 
} 
! o 
i i i i i 
RTN+5 -z~-- 
RTN+4 -o-- 
RTN+3 -+-- 
RTN+2 -ra-- 
RTN+I "~'" 
,+" ~-E\],\[3 
^ _~/ ,, 4z''+ 
0 :"~'~"" ............. 
0 50 100 150 200 250 300 350 400 450 500 550 
grammar size 
Figure 16 
Number of states and number of transitions of the determinized and minimized automata. 
1.6 
1.5 
1.4 
1.3 
s 
1.2 
1.1 
i i 
RTN 
RTN2 -a--- 
RTN3 
RTN+I -~<-- 
RTN+2 -D-- 
RTN+3 -+-- 
RTN+4 -o-- 
RTN+5 -a-- 
1 
50 100 150 200 250 300 350 400 
grammar size 
Figure 17 
Number of recognized sentences divided by number of grammatical sentences. 
cessing. Also allowed is that "pathological" grammatical sentences are rejected that 
seldom occur in practice; an example are sentences requiring multiple levels of self- 
embedding. 
Of the methods we considered that may lead to rejection of grammatical sen- 
tences, i.e., the subset approximations, none seems of much practical value. The most 
serious problem is the complexity of the construction of automata from the compact 
representation for large grammars. Since the tools we used for obtaining the minimal 
42 
Nederhof Experiments with Regular Approximation 
deterministic automata are considered to be of high quality, it seems unlikely that 
alternative implementations could succeed on much larger grammars, especially con- 
sidering the sharp increases in the sizes of the automata for small increases in the size 
of the grammar. Only LC2 could be applied with relatively few resources, but this is a 
very crude approximation, which leads to rejection of many more sentences than just 
those requiring self-embedding. 
Similarly, some of the superset approximations are not applicable to large gram- 
mars because of the high costs of obtaining the minimal deterministic automata. Some 
others provide rather large s, and therefore do not allow very effective ill- 
tering of ungrammatical input. One method, however, seems to be excellently suited 
for large grammars, namely, the RTN method, considering both the unparameterized 
version and the parameterized version with d = 2. In both cases, the size of the au- 
tomaton grows moderately in the grammar size. For the unparameterized version, the 
compact representation also grows moderately. Furthermore, the percentage of recog- 
nized sentences remains close to the percentage of grammatical sentences. It seems 
therefore that, under the conditions of our experiments, this method is the most suit- 
able regular approximation that is presently available. 
Acknowledgments 
This paper could not have been written 
without the wonderful help of Hans-Ulrich 
Krieger, who created the series of grammars 
that are used in the experiments. I also owe 
to him many thanks for countless 
discussions and for allowing me to pursue 
this work. I am very grateful to the 
anonymous referees for their inspiring 
suggestions. 
This work was funded by the German 
Federal Ministry of Education, Science, 
Research and Technology (BMBF) in the 
framework of the VERBMOBIL Project under 
Grant 01 IV 701 V0. 
References 
Baker, Theodore P. 1981. Extending 
lookahead for LR parsers. Journal of 
Computer and System Sciences, 22:243-259. 
Bermudez, Manuel E. and Karl M. Schimpf. 
1990. Practical arbitrary lookahead LR 
parsing. Journal of Computer and System 
Sciences, 41:230-250. 
Berstel, Jean. 1979. Transductions and 
Context-Free Languages. B. G. Teubner, 
Stuttgart. 
Black, Alan W. 1989. Finite state machines 
from feature grammars. In International 
Workshop on Parsing Technologies, pages 
277-285, Pittsburgh, PA. 
Chomsky, Noam. 1959a. A note on phrase 
structure grammars. Information and 
Control, 2:393-395. 
Chomsky, Noam. 1959b. On certain formal 
properties of grammars. Information and 
Control, 2:137-167. 
Culik, Karel II and Rina Cohen. 1973. 
LR-regular grammars--An extension of 
LR(k) grammars. Journal of Computer and 
System Sciences, 7:66-96. 
Earley, Jay. 1970. An efficient context-free 
parsing algorithm. Communications of the 
ACM, 13(2):94-102, February. 
Grimley-Evans, Edmund. 1997. 
Approximating context-free grammars 
with a finite-state calculus. In Proceedings 
of the 35th Annual Meeting of the Association 
for Computational Linguistics an 8th 
Conference of the European Chapter of the 
Association for Computational Linguistics, 
pages 452-459, Madrid, Spain. 
Harrison, Michael A. 1978. Introduction to 
Formal Language Theory. Addison-Wesley. 
Heckert, Erik. 1994. Behandlung von 
Syntaxfehlern fiir LR-Sprachen ohne 
Korrekturversuche. Ph.D. thesis, 
Ruhr-Universit/it Bochum. 
Johnson, Mark. 1998. Finite-state 
approximation of constraint-based 
grammars using left-comer grammar 
transforms. In COLING-ACL "98: 36th 
Annual Meeting of the Association for 
Computational Linguistics and 17th 
International Conference on Computational 
Linguistics, volume 1, pages 619-623, 
Montreal, Quebec, Canada. 
Krauwer, Steven and Louis des Tombe. 1981. 
Transducers and grammars as theories of 
. Theoretical Linguistics, 8:173-202. 
Langendoen, D. Terence and Yedidyah 
Langsam. 1987. On the design of finite 
transducers for parsing phrase-structure 
s. In Alexis Manaster-Ramer, 
editor, Mathematics of Language. John 
Benjamins, Amsterdam, pages 191-235. 
Mohri, Mehryar and Fernando C. N. 
43 
Computational Linguistics Volume 26, Number 1 
Pereira. 1998. Dynamic compilation of 
weighted context-free grammars. In 
COLING-ACL "98: 36th Annual Meeting of 
the Association for Computational Linguistics 
and 17th International Conference on 
Computational Linguistics, volume 2, pages 
891-897, Montreal, Quebec, Canada. 
Mohri, Mehryar, Femando C. N. Pereira, 
and Michael Riley. 1998. A rational design 
for a weighted finite-state transducer 
library. In Derick Wood and Sheng Yu, 
editors, Automata Implementation. Lecture 
Notes in Computer Science, Number 1436. 
Springer Verlag, pages 144-158. 
Nederhof, Mark-Jan. 1994. Linguistic Parsing 
and Program Transformations. Ph.D. thesis, 
University of Nijmegen. 
Nederhof, Mark-Jan. 1997. Regular 
approximations of CFLs: A grammatical 
view. In Proceedings of the International 
Workshop on Parsing Technologies, 
pages 159-170, Massachusetts Institute of 
Technology. 
Nederhof, Mark-Jan. 1998. Context-free 
parsing through regular approximation. 
In Proceedings of the International Workshop 
on Finite State Methods in Natural Language 
Processing, pages 13-24, Ankara, Turkey. 
Nederhof, Mark-Jan and Giorgio Satta. 1996. 
Efficient tabular LR parsing. In Proceedings 
of the 34th Annual Meeting, pages 239-246, 
Santa Cruz, CA. Association for 
Computational Linguistics. 
Pereira, Fernando C. N. and Rebecca N. 
Wright. 1997. Finite-state approximation 
of phrase-structure grammars. In 
Emmanuel Roche and Yves Schabes, 
editors, Finite-State Language Processing. 
MIT Press, pages 149-173. 
Pulman, S. G. 1986. Grammars, parsers, and 
memory limitations. Language and 
Cognitive Processes, 1(3):197-225. 
Purdom, Paul Walton, Jr. and Cynthia A. 
Brown. 1981. Parsing extended LR(k) 
grammars. Acta Informatica, 15:115-127. 
Resnik, Philip. 1992. Left-corner parsing and 
psychological plausibility. In COLING '92: 
Papers presented to the Fifteenth \[sic\] 
International Conference on Computational 
Linguistics, pages 191-197, Nantes, France. 
Rosenkrantz, D. J. and P. M. Lewis, II. 1970. 
Deterministic left comer parsing. In IEEE 
Conference Record of the 11th Annual 
Symposium on Switching and Automata 
Theory, pages 139-152. 
Seyfarth, Benjamin R. and Manuel E. 
Bermudez. 1995. Suffix s in LR 
parsing. International Journal of Computer 
Mathematics, 55:135-153. 
Stolcke, Andreas and Jonathan Segal. 1994. 
Precise N-gram probabilities from 
stochastic context-free grammars. In 
Proceedings of the 32nd Annual Meeting, 
pages 74-79, Las Cruces, NM. Association 
for Computational Linguistics. 
Woods, W. A. 1970. Transition network 
grammars for natural  analysis. 
Communications of the ACM, 
13(10):591-606. 
