Context-Free Parsing through Regular Approximation 
Mark-Jan Nederhof 
DFKI 
Stuhlsatzenhausweg 3, D-66123 Saarbrficken, GERMANY 
Abstract. We show that context-free parsing can be realised by a 2-phase process, 
relying on an approximated context-free gr~mm~r. In the first phase {t finite transducer 
performs parsing according to the approximation. In the second phase, the approximated 
parses are refined according to the original grammar. 
1 Introduction 
A recent publication \[15\] presented a novel way of transforming a context-free grammar into 
a new grammar that generates a regular language. This new language is a superset of the 
orighal language. It was argued that this approach has advantages over other methods of 
regular approximation \[16, 7\]. 
Our method of approximation is the following. We define a condition on context-free gram- 
mars that is a suiBcient condition for a grammar to generate a regular language. We then give a 
transformation that turns an arbitrary grammar into another grammar that satisfies this condi- 
tion. This transformation is obviously not language-preserving; it adds strings to the language 
generated by the original grammar, in such a way that the language becomes regular. 
In the present communication we show how this procedure needs to be extended so that 
context-free parsing can be realised by a 2-phase process. For the first phase, the approximated 
grammar is turned into a finite transducer. This transducer processes the input in linear time 
and produces a table. In the second phase, this table is processed to obtain the set of all parses 
according to the original grammar. 
The order of the time complexity of the second phase is cubic, which corresponds to the time 
complexity of most context-free parsing algorithms that are used in practice. However, the first 
phase filters out many parses that are inconsistent with respect to the regular approximation. 
This may reduce the effort needed by the second phase. 
It is interesting to note that the work presented here is conceptually related to use of regular 
lookahead in context-free parsing \[5\]. ' 
The structure of this paper is as follows. In Section 2 we recall some standard definitions 
from language theory. Section 3 investigates a sut~cient condition for a context-free grammar 
to generate a regular language, We also present the construction of a finite transducer from 
such a grammar. 
How this transducer reads input and how the output of the transducer can be turned into 
a representation of all parse trees is discussed in Sections 4 and 5, respectively. 
An algorithm to transform a grammar if the sufficient condition mentioned above is not 
satisfied is given in Section 6. Section 7 explains how this transformation can be incorporated 
into the construction of the transducer and how the output of such a transducer is then to be 
interpreted in order to obtain parse trees according to the original grammar. 
Some preliminary conclusions drawn from empirical results are given in Section 8. 
13 
I 
I 
I 
2 Preliminaries 
II 
II 
A context-free grammar G is a 4-tuple (27, N, P, S), where 27 and N are two finite disjoint sets 
of terminals and nonterminals, respectively, S E N is the start symbol, and P is a finite set 
of rules. Each rule has the form A ~ ~ with A E N and a E V', where V denotes N t3 27. 
The relation ~ on N × V* is extended to a relation on V* × V* as usual. The transitive and 
reflexive closure of ~ is denoted by -r*. 
The language generated by a context-free grammar is given by the set {w E 27* I S --r* w}. 
By definition, such a set is a conteart-.free language. By reduction of a grammar we mean the 
elimination from P of all rules .4 -r 7 such that S ~* aA~ ~ aTB -~* w does not hold for 
any a,/~ E V" and w E 27*. 
We generally use symbols A, B, C,... to range over N, symbols a, b, e,... to range over 27, 
symbols X, Y, g to range over If, symbols a,/~, %... to range over V*, and symbols v, w, z,... 
to range over 27*. We write e to denote the empty string. 
A rule of the form A -+ B is called a ¢nit rule. 
A (nondeterministie) finite a~tomaton ~ is a 5-tuple (K, 27, A, s, F), where K is a finite set 
of states, of which s is the initial state and those in F _C K are the final states, ~ is the input 
alphabet, and the transition relation A is a finite subset of K × 27" × K. 
We define a configuration to be an element of K x 27". We define the binary relation I- 
between configurations as: (q, vw) I- (q~, w) if and only if (q, v, g) E A. The transitive and 
reflexive closure of ~- is denoted by ~-'. 
Some input v is recogn/zed if (s, v) I -° (q, e), for some q E F. The language accepted by ~" is 
defined to be the set of all strings v that are recognized. By definition, a language accepted by 
a finite automaton is called a regular language. 
A finite transducer 7" is a 6-tuple (K,271,~72, A,s,F). Next to the input alphabet 271 we 
now have an output alphabet 272. Transitions are of the form (q, v\[w, q~) where v E S~ and 
IV E 27~. 
For finite transducers, a configuration is an element of K × 27~ × 27~. We define the binary 
relation I-between configurations as: (q, vlwl, w2) ~" (qP, wl, w2v2) if and only if (q, vl Iv2, q~) E 
A. 
Some input wl is associated with output w2 if (s, wl,e) t-" (q,e, w2), for some q E F. The 
set of all such pairs (wx, w2) is the (regular) transduction represented by the transducer. 
II 
II 
II 
II 
II 
II 
!1 
II 
II 
II 
II 
II 
!1 
II 
II 
II 
U 
3 The Structure of Parse Trees 
We define a spine in a parse tree to be a path that runs from the root down to some leaf. Our 
main interest in spines lies in the sequences of grammar symbols at nodes bordering on spines. 
A simple example is the set of parse trees such as the one in Figure 1 (a), for a 3-line 
grammar of palindromes. It is intuitively clear that the language is not regular: the grammar 
symbols to the left of the spine from the root to 6 "communicate" with those to the right of 
the spine. More precisely, the prefix of the input up to the point where it meets the final node 
of the spine determines the sutBx after that point, in a way that an unbounded quantity of 
symbols from the prefix need to be taken into account. 
A formal explanation for why the grammar may not generate a regular language relies on 
the following definition \[4\]: 
14 
S-~aSa 
S-+bSb 
$ 
I 
4-- 
s ~s 
a S a a Ss 
b S b b as 
I I 
S~ 
4-- si-  Ss si 
si , ss 
Ss-~ a Ss 
-'4" 6"- 4-" 
Ss ~qs~ b as 
Ss~ ¢ S$ tl ---t --~ 
Ss~ Ss a 
b ~ SIs Ss~ Ss b 
(a) (b) 
Figure 1. Parse trees for a palindrome: (a) original grammar, (b) transformed grammar (Sec- 
tion 6). 
Definition 1 A grammar is self-embedding if there is some A E N such that A -~* aA/3, ,for 
some a # e and/3 # e. 
In order to avoid the somewhat unfortunate term nonselfembedding (or noncenter-embedding 
\[11\]) we define a strongly regular grammar to be a grammar that is not self-embedding. Strong 
regularity informally means that when a section of a spine in a parse tree repeats itself, then 
either no grammar symbols occur to the left of that section of the spine, or no grammar symbols 
occur to the right. This prevents the "unbounded communication" between the two sides of the 
spine exemplified by the palindrome grammar. 
We now prove that strongly regular grammars generate regular languages. For an arbitrary 
grammar, we define the set of re.cursive nonterminals as: 
We determine the partition A f of N consisting of subsets Na,N2,... ,Nk, for some k > 0, of 
mutually recursive nontermlnals: 
J¢ = { N, , N2, . . . , ~ } 
N, U N2 U . . . U N~ = N 
Vi\[Ni ~ 0\] and Vi,j\[i ~ j =~ N, N N~ = Of 
3i\[A E NiAB E Nil ¢~ 3aa,131,a2,~\[A ~* alB131AB --+* a~Al~\], for all A,B E -N 
We now define the function recursive from .N" to the set {left, right, self, cyclic}: 
reeursive(Ni) = left, if -~LeftGenerating(Ni) ^ RightGenerating(N~) 
= right, if LeftGenerating(Ni) ^ -,RightGenerating(Ni) 
-" self, if LeftGenerating(Ni) A RightGenerating(Ni) 
= cyclic, if -LeftGenerating(Ni) ^-~RightGenerating(Ni) 
15 
E 
E 
E 
where 
LeftGenerating(Ni) = 3(A -r aBl~) • P\[A E Ni A B • Ni ^ a # e\] 
RightGenerating(Ni ) = 3(A -~ aBE) • P\[A • Ni h B • Ni h ~   e\] 
When recursive(Ni) - left, Ni consists of only left-recursive nonterminals, which does not mean 
it cannot also contain right-recursive nonterminals, but in that case right recursion amounts to 
application of unit rules. When recursive(Ni) = cyclic, it is only such unit rules that take part 
in the recursion. 
That recursive(Ni) = self, for some i, is a sufficient and. necessary condition for 
the grammar to be self-embedding. Therefore, we have to prove that if recursive(Ni) • 
{left, right, cyclic}, for all i, then the gra.rnmar generates a regular language. Our proof dif- 
fers from an existing proof \[3\] in that it is fully constructive: Figure 2 presents an algorithm for 
creating a fiaite transducer that recognizes as input all strings from the language generated by 
the grammar, and produces output strings of a form to be discussed shortly. 
The process is initiated at the start symbol, and from there the process descends the gram- 
mar in all ways until terminals are encountered. Descending the grammar is straightforward in 
the case of rules of which the left-hand side is not a recursive nonterminal: the subautomata 
found recursively for members in the right-hand side will be connected. In the case of recursive 
nonterminals, the process depends on whether the nonterminals in the corresponding set from 
A/" are mutually left-recursive or right-recursive; ff they are both, which means they are cyclic, 
then either subprocess can be applied; in the code in Figure 2 cyclic and left-recursive subsets 
Ni are treated lmiformly. 
We discuss the case that the nonterminals are left-recursive or cyclic. One new state is created 
for each nonte _rrnlnal in the set. The transitions that are created for terminals and nonterminals 
not in Ni are connected in a way that is reminiscent of the construction of left-coruer parsers 
\[lr\]. 
The output of the transducer consists of a list of filter items interspersed with input symbols. 
A filter item is a rule with a distinguished position in the right-hand side, indicated by a 
diamond. The part to the left of the diamond generates a part of the input just to the left 
of the current input position. The part to the right of the diamond potentially generates a 
subsequent part of the input. A string consisting of filter items and input symbols can be seen 
as a representation of a parse, different from some existing representations \[11, 9, 12\]. 
At this point we use only initial filter items, from the set: 
Iinit = {B -~ XI ... X,~-I o X,n \[ (B ~ X1... X,n-lX, n) e P ^ 
3i\[recursive(Ni) = right A B, X,~ e Ni A X1,..., Xrn-1 ¢ N/f} U 
{B --+ Xl ...X,, o \[ (B ~ Xl...X,n) E P ^ 
(m = 0 V -,3i\[recursive(Ni) = right A B, Xr~ 6 Ni A X1,. .., Xrn-1 ~ Ni\])} 
This definitien implies that for every rule there is exactly one initial filter item. The diamond 
holds the rightmost position, unless we are dealing with a right-recursive rule. 
An example is given in Figure 3. Four states have been labelled according to the names 
they are given in procedure make_!st. There are two states that are labelled qs. This can be 
explained by the fact that nonterminal B can be reached by descending the grammar from S 
in two essentially distinct ways. 
II 
II 
II 
I 
I 
II 
II 
II 
II 
il 
II 
II 
n 
II 
U 
I 
II 
II 
II 
U 
II 
\[\] 
II 
I 
16 
E 
E 
m 
nn 
ii 
B 
ni 
E 
in 
B 
i 
in 
E 
E 
nn 
I 
i 
E 
I 
i 
let K = $, A = $, s = fresh_state, / = fresh_state, F = {/}; make_\]st(s, S, \]). 
procedure make_\]st(qo, a, ql): 
ifa=e 
then let a -- A U {(qo,~le, ql)} 
elseif a = a, some a E ,~ 
then let zl = A U {(q0, ala, ql)} 
elseif a = X~, some X E V,/~ E V* such that \]/~\[ > 0 
then let q = fresh.state; make_fst(qo, X, q); make_\]st(q, ~, ql ) 
else let A = a; (* a must consist of a single nonterminal *) 
if A E Ni, some i 
then for each B E Ni do let qs = fresh_state end; 
if recursive( N~ ) = right 
then for each (B --~ X1 ...Xm) E P such that B E NiAX1,...,Xm ~ Ni 
do let q = fresh_state; make_\]st(qs, X1... X,n, q); 
let A = /1 U {(q, el(B -~ X1...X~ o),ql)} 
end; 
for each (B -~ X1... XmC) E P such that B, C E Ni A X1,...,Xrn ~ Ni 
do let q = fresh_state; make_\]st(qs, X1... Xm, q); 
let Zi = A O {(q, el(B -* X1...Xrn 0 C),qc)} 
end; 
let A -- AU {(qo,6Ie, qA)} 
else for each (B -4 )(1... X~) E P such that B E N~ A X1,..., Xm ~ Ni 
do let q = fresh_state; make_\]st(qo, )(1... Xm, q); 
let zi = A U {(q, el(B -+Sx ...X~ o),qB)} 
end; 
for each (B --~ CX1...Xm) E P such that B,C E N~ A )(1,... ,Xm ~ Ni 
do let q = fresh_state; make_\]st(qc, X1... Xm, q); 
let A = A U {(q, el(B ~ CX1...X~ o),qB)} 
end; 
let iI = ,413 {(qA,~le, ql)} 
end 
else for each (A -~ ~) E P (* A is not recursive *) 
do let q = fresh_state; make_\]st(qo, l~,q); let A = A U {(q,c\[(A -+/3 o),ql)} 
end 
end 
end 
end. 
procedure fresh.state(): 
create some fresh object q; let K = K U {q}; return q 
end. 
Figure 2. Transformation/Tom a strongly regular grammar G -- (E, N, P, S) to a finite trans- 
ducer T = (K, ,~, ~ U Ii,it, A, s, F). 
4 Tabular Simulation of Finite Transducers 
After a finite transducer has been obtained, it may sometimes be turned into a deterministic 
transducer \[13\]. However, this is not always possible since not all regular transductions can be 
17 
I1 
I1 
II 
I:S.-+ Aa 
2 : A-', SB 
3:A.-+ Bb 
4:B~cB 
5 : B --', d 
{s,a,B} 
N1 = {S, A} reeursive(N1) = left 
N2 = {B} recnrsive(N2) = ri#ht 
_-. e\[£ ~./dld I~ \[ B-.. d~,.__ blb_~\[A---.-Bb* ala ,. e \[ S-.*. aa elg 
qB qa'~ - v £\[B_.,.coB qs 
c\]c 
Figure 3. Application of the code from Figure 2 on a small grammar. 
described by means of deterministic finite transducers. In this case, input can be processed by 
Simulating a nondetermlni~ic transducer in a tabular way. 
Assume we have a finite transducer 7" = (K, 271, ~2,/1, s, F) and an input string al...a,. 
We create two tables. The first table K ~ contains entries of the form (i, qx), where 0 < i < n and 
ql E K. Such an entry indicates that the transducer may be in state qz after reading~mput from 
position 0 up to i. The second table/1~ contain.~ entries of the form ((i, ql),v, (J, q2)), where 
v E ~. Such an entry indicates that furthermore the transducer may go from state ql to q2 in 
a single step, by reading the input from position i to position j while producing v as output. 
The preferred way of looking at these two tables is as a set of states and a set of transitions 
of a finite automaton ~" = (K ', ~2,/1', (0, s), F'), where F' is a subset of {n} x F. 
Initially K ~ = {(0, s)} and A' = 0. Then the following is repeated until no more new elements 
can be added to K ~ or/1': 
1. We choose a state (i, ql) E K ~ and a transition (ql,a~+l-..ajlv, q2 ) ~/1. 
2. We add a state (J, q2) to K ~ and a transition ((i, ql),V, (j, q2)) to/1' if not already present. 
We then define F' = K' N ({n} x F). The input al... an is recognized by T when F' is non- 
empty. The language accepted by ~" is the set of output strings that al .-. an is associated with 
by T \[1\]. 
Before continuing with the next phases of processing, as presented in the following sections, 
we may first reduce the automaton, i.e. we may remove the transitions that do not contribute 
to any paths from (O,s) to a state in F'. For simplifying the discussion in the next section, we 
further assume that ~" is transformed such that all transitions (q, v, q~) E A' Satisfy Iv I = 1. 
For the running example of Figure 3 we may then obtain the finite automaton indicated by 
the thick lines in Figure 4. (Two f-transitions were implicitly eliminated and the automaton 
has been reduced.) 
The time demand of the construction of ~ from 7" and ax". an is linear measured both in 
n and in the size of 7". Note that in general the language accepted by 5 r may be infinite in case 
the grammar is cyclic. 
18 
\[S~.Aa\] 
I~ .~bl 
\[B~.~ 
I 
ww c (S-+coS) 
\[B --+ c • B\] 
\[B --~ i ~ \[A --+ B . b\] 
d'~\[ (B--~do) " b:7(A--~Bbo ) 
\[B --~ d l\] \[A -~ Bb •\] 
\[A-~ • Sb\] 
\[S -~ A , a\] 
a (S -~ Aa 
\[s-+~.\] 
Is-, • A~\] 
Figure 4. A finite automaton resulting from simulating the transducer on input cdba (thick 
lines), and the subsequent table U of dotted items (thin lines). 
5 Retrieving a Parse Forest 
Using "the compact representation of all possible output strings discussed above, we can obtain 
the structure of the input according to the context-free gTammtlr; by "structure" of the input 
we mean the collection of all parse trees. Again, we use a tabular representation, called a parse 
/o,zst \[6, 10, 2\]. 
Our particular kind of parse forest is a table U consisting of dott~ items of the form 
\[q, A ~ t~,/~, q'\], where q and q' are states from K' and A ~ cl/~ is a rule. The dot indicates to 
how far recognition of the fight-hand side has progressed. To be more precise, the meaning of 
the above dotted item is that the input symbols on a path from q to q' can be derived from/~. 
Note that recognition of fight-hand sides is done from right to left, i.e. in. reversed order with 
respect to Earley's algorithm \[6\]. 
For a certain instance of a rule, the initial position of the dot is given by the position of the 
diamond in the corresponding filter item. 
There are several ways to construct U. For presentational reasons our algorithm will be 
relatively simple, in the style of the CYK algorithm \[8\]: 
1. Initially U is empty. 
2. We perform one of the following until no more new elements can be added to U: 
(a) We choose a tran.qition (q, A -~ a o, q~) E A' and add an item \[q, A -+ ct., q~\] to U. 
(b) We choose a transition (q, A -~ ct o B, q') E A' and an item \[q', B -@ • 7, q"\] E U and 
add an item \[q, A -~ ti • B, q"\] to U. 
(c) We choose a transition (q, a, q') E A' and an item \[q', A ~ aa •/3, q"\] E U and add an 
item \[q, A -~ a * a/~, q'\] to U. 
(d) We choose a pair of items \[q, B ~ * 7, q'\], \[q~, A -~ clB. ~, q"\] E U and add an item 
\[q, A -+ a • B~, q"\] to U. 
19 
I 
I 
I 
Assume the gr~rnm~r is G - (~, N, P, S). The following is to be performed for each set N~ E A f 
such that recursive(Ni) = self. 
1. Add the following nonterminals to N: A~, A~s, ~- -* As and AB for all A, B E Ni. 
2. Add the following rules to P, for all A, B, C, D, E E Ni: 
- A--+A~; 
- 
- 
- As 
- BB 
"-4 
- As 
-.4 
-+ A~--cY1...YmC~, for all (C --~ Y1...Ym) E P, with Y1,... ,Ym ~ Ni; 
--* ~ Y1...YmE~, for all (D --+ aCY1...Y,~E~) E P, with Y1,... ,Yrn ~ Ni; 
--~ BA; 
\]"1...YmCB, for all (n -~ I"1...YmC~) E P, with 1"1,-.. ,Y,,, ¢ Ni; 
-@ 
CsY1 ...Ym, for all (A ~ aCY1...Ym) E P, with Y1,.-.,Ym ~ Ni; 
3. Remove from P the old rules of the form A--~ a, where A E Ni. 
4. Reduce the grammar. 
Figure 5. Approximation by transforming the grammar. 
The items produced for the running example are represented as the thin lines in Figure 4. 
I 
6 Approximating a Context-Free Language 
Section 3 presented a sufficient condition for the generated language to be regular, and explained 
when this condition is violated. This suggests how to change an arbitrary grammar so that it 
will come to satisfy the condition. 
The intuition is that the "unbounded communication" between the left and right sides of 
spines is broken. This is done by a transformation that operates separately on each set Ni such 
that recursive(Ni) = self, as indicated in Figure 5. After this, the grammar will be strongly 
regular. 
Consider the grammar of palindromes in the left half Of Figure 1. The approximation algo- 
rithm leads to the grammar in the right half. Figure 1 (b) shows the effect on the structure of 
4-- parse trees. Note that the left sides of former spines are treated by the new nonterminal Ss and 
the right sides by the new nonterminal Ss. 
This example deals with the special case that each nonterminal can lead to at most one 
recursive call of itself. The general case is more complicated and is treated elsewhere \[15\]. 
7 Obtaining Correct Parse Trees 
In Section 5 we discussed how the table resulting from simulating the transducer should be 
interpreted in order to obtain a parse forest. However, we assumed then that the transducer 
had been constructed from a grammar that was strongly regular. In case the original grammar 
is not strongly regular we have to approach this task in a different way. 
20 
One possibility is to first apply the grammar transformation from the previous section and 
subsequently perform the 2-phase process as before. However, this approach results in a parse 
forest that reflects the structure of the transformed grammar rather than that of the original 
grammar. 
The second and preferred approach is to incorporate the grammar transformation into the 
construction of the transducer. The accepted language is then the same as in the case of the 
first approach, but the symbols that occur in the output carry information about the rules from 
the original grammar. 
How the construction of the finite transducer from Figure 2 needs to be changed is indicated 
in Figure 6. We only show the part of the code which deals with the case that ~ consists of a 
single nonterminal. 
For nontermlnals which are not in a set Ni such that recursive(Ni) = self, the same treat- 
ment as before is applied. Upon encountering a nonterminal B E Ni such that recursive(Ni) = 
self, we consider the structure of the grammar if it is transformed according to Figure 5. This 
transformation creates new sets of recursive nonterminals, which have to be treated according 
to Figure 2 depending on whether they may be left-recursive or right-recursive. 
For example, given a fixed nonterminal B E Ni, for some i such that recursive(Ni) = self, the 
set of nonterminals A~ and A~, for any A E Ni, together form a set M in the transformed gram- 
mar for which recursive (M) = right. We may therefore construct the transducer as dictated by 
Figure 2 for this case. In particular, this relates to the rules of the form A~ -+ Ac~...YmC~, 
-, CA and -, BA. 
4.- 
Note that a nonterminal of the form Ac does not belong to M but to another set, say M1, 
which in the transformed grammar satisfies recursive(M1) = right (or recursive(M1) = cyclic). 
Similarly, a nonterminal of the form CA belongs to a set, say M2, which satisfies recursive (M2) = 
left (or recursive(M2) = cyclic). Treatment of these nonterminals occurs in a deeper level of 
recursion of make_fst, and appears as separate cases in Figure 6. 
It is important to remember that the sets Ni in Figure 6 always refer to the nature of 
recursion in the original grammar; the transformed grammar is merely implicit in the given 
construction of the transducer, and helps us to understand the construction in terms of Figure 2. 
In addition to hnit, filter items from the following set are used: 
I~d = {B ~ oe o C~ I (B --r aCfl) ~ P A 3i\[recursive(Ni) = self A B, C ~ Ni\]} 
The meaning of the dianaond is largely unchanged with regard to Section 3. For example, for --t 
the rule D -r aC~... YmEt~, which corresponds to the rule A~ ~ CA Y1... YrnEts of the 
transformed grammar, the filter item D --~ aC~ ... Y,n o E/~ is output, which indicates that 
an instance of ~ ... Ym (or an approximation thereof) has just been read, which is potentially 
preceded by an instance of aC and followed by an instance of E/~. On the other hand, upon 
encountering a rule such as A~ --~ BA, which is an artifact of the grammar transformation, no 
output symbol is generated. 
For retrieving the forest from ~" we need to take into account the additional form of filter 
item. Now the following steps are required: 
(a) We choose (q, A --~ ~ o, q~) E ~ and add \[q, A -~ a e, q~\] to U. 
(b) We choose (q, A -~ ~ o B, ~) E ~, such that (A -~ a o B) E I~it, and \[q~, B -~ • ~, ~\] E U 
and add \[q, A -~ ~ • B, q"\] to U. 
21 
I 
I 
I 
else (* a must consist of a single nonterminal *) 
if a is of form A E Ni, some i, and recursive(Ne) E {right, left, cyclic) 
then ...treatment as in Figure 2... 
elseif a is of form B E Ni, some i, and recursive(N~) = self 
then (* we implicitly replace B by B~ according to B -~ B~ *) 
for each A E Ne do let qA~ "- fresh.state, qd, e = fresh_state end; 
for each A E Ni and (C -~ ~... Y,n) E P such that C E Ni ^ ~,..., Y,n ¢ Ni 
4-- do let q = fresh_state; make_Jst(qA~ , Ac ~... Ym, q); 
let A = A U {(q,e\[(C ~ ~... Ym o),qc,s)} 
4- end; (* for A~ ~ Ac ~... Y~CTB *) 
for each A E Ni and (D ~ aCYI ... YmE~) E P 
such that C, D , E E Ni A ~ , . . . , Ym ¢ Ni 
--t 
do let q = fresh_state; make_/st(qA~ , CA ~... Ym, q); 
let A =/t U {(q, e\[(D ~ aC~ ... Ym o E/~), qE~)} 
end; (* for ATe --~ ~ Y1 ...Y, nE~ *) 
for each A E Ni 
do make-fst(qA~s,BA, ql ) (* for ATe -~BA *) 
end; 
let ~ = ~ u ((q0,~le, qE:)} 
elseif a is of form DE such that D, B E Ne, some i 
then for each A E Ni do let qA~ = fresh.state end; 
for each (A -~ Yz... YmC~) E P such that A, C E Ni A ~,..., Ym ~ Ni 
do let q = fresh_state; make_Jst(q~8 , ~ . . . Ym, q).; 
let A = AU {(q,e\[(A -~ ~ ...Ym o C/~),q~a) } 
4-- 4- end; (* for AB-~ Y1... Ym CB *) 
4- let ,4 = /tU{(q~s,e\[E, ql)}; (* for BB--+ ~ *) 
let a = ~u {(q0,Ele, q_~ )} 
- JJB ---F 
elseif c~ is of form DB such that D, B ~ Ni, some i 
then for each A E Ne do let qA~ = fresh_state end; 
for each (A ~ aCYz... I/m) E P such that A, C E Ni ^ ~,..., Ym ~ N~ 
do let q = fresh.state; make..fst(q~s , Y~ . . . Ym, q); 
let A = AU {(q,~l(A -~ aCY~...Y,~ o),q~ )} 
end; (* for Aa-~ Ca Y~... Ym *) 
-+ 
let A = z3V{(qo,~le, q;)}; (* for BB-+ e *) 
let A = A U {(q~, ,e\[e, ql)} 
else let A = c~; (* a must consist of a single non-recursive nonterminal *) 
...treatment as in Figure 2... 
end 
Figure 6. Code from Figure 2 changed to deal with arbitrary context-free grammars. 
i 
I 
II 
I 
i 
I 
I 
I 
I 
in 
I 
i 
i 
in 
I 
i 
I 
I 
I 
22 
(c) We choose (q, a, q') E A' and \[q~, A ~ aa • E, q~'\] E U and add \[q, A -~ a • a~, q"\] to U. 
(d) We choose \[q, B --, • 7., q~\], \[q/, A --, ~,B • E, q"\] E U and: 
- if (A --> ct o B~) E I, nid, then add \[q"~, A --, a • BE, q"\] to U for each (q"', A -+ a e 
BE, q) E A ~, and 
- otherwise, add \[q, A -* a • BE, q"\] to U. 
8 Empirical Results 
The implementation was completed recently. Initial experiments allow some tentative conclu- 
sions, reported here. 
We have compared the 2-phase algorithm to a traditional tabular context-free parsing algo- 
rithm. In order to allow a fair comparison, we have taken a mixed parsing strategy that applies 
a set of dotted items comparable to that of Section 7. Ass~ming the input is given by al ... an 
as before, the steps are given by: 
(a) We choose i,'such that 0 ~ i _< n, and (A -~ ~ o) E/~n~t and add \[i, A ~ ~ o, i\] to U. 
(b) We choose \[i,B -4 • 7,J\] E U and (A -* a o B) E I~,.t, and add \[i,A ~ a * B,j\] to U. 
(c) We choose \[i + 1,A ~ aa~+l • E,j\] E U and add \[i,A --* a • a~+lE, j\] to U. 
(d) We choose \[i,B --* • 7,j\], \[j, A ~ ~B • E,k\] E U and add \[i,A ---> a • BE, k\] to U. 
For the experiments we have taken a grammar for German, generated automatically through 
EBL, of which a considerable part contaln.q self-embedding. The transducer was determinized 
and minimized as if it were a finite automaton, i.e. in a transition (q, vlw , q~) the pair vlw is 
treated as one symbol, and the pair ele is treated as the empty string. The test sentences were 
obtained using a random generator \[14\]. 
For a given input sentence, we define T1 and T2 to be the number of steps that are performed 
for the respective phases of the 2-phase algorithm: first, the creation of 3 r from the input 
al-..an, and second, the creation of U from ~'. We define Tcf: to be the number of steps 
that are performed for the direct construction of table U from al... an by the above tabular 
algorithm. 
Concerning the two processes with context-free power, viz. To! and T2, we have observed 
that in the majority of cases there is a reduction in the number of steps from To! to T2. This 
can be a reduction from several hundreds of steps to less than 10. In individual cases however, 
especially for long sentences, T2 can be larger than To!. This can be explained by the fact that 
~r may have many more states than that the input sentence has positions, which leads to less 
sharing of computation. 
Adding T1 and T2 in many cases leads to higher numbers of steps than To!. At this stage 
we cannot say whether this implies that the 2-phase idea is not useful. Many refinements, 
especially concerning the reduction of the number of states of 3 r in order to enhance sharing of 
computation, have as yet not been explored. 
In this context, we observe that the size of the repertoire of filter items has conflicting 
consequences for the overall complexity. If T outputs no filter items, then it reduces to a 
recognizer, which can be determinized. Consequently, T1 will be equal to the sentence length, 
but T2 will be no less than (and in fact identical to) To!. If on the other hand T outputs many 
types of filter item, then determinization and minimization is more difficult and consequently 
yr may be large and both T1 and T2 may be high. 
23 
II 
II 
II 
Acknowledgements 
Parts of this research were carried out within the framework of the Priority Programme Lan- 
guage and Speech Technology (TST), while the author was employed at the University of 
Groningen. The TST-Programme is sponsored by NWO (Dutch Organization for Scientific Re- 
search). This work was further funded by the German Federal Ministry of Education, Science, 
Research and Technology (BMBF) in the framework of the VERBMOBIL Project under Grant 
01 IV 701 V0. The responsibility for the contents lies with the author. 

References 
1. J. Berstel. 1979. Ttunsductiona and Conte-z~-Free Languages. B.G. Teubner, Stuttgart. 
2. S. Billot and B. Lang. 1989. The sta'ucture of shared forests in ambiguous parsing. In ~Tth Annual 
Meeting of the Association for Computational Linguistics, Proceedings of the Conference, pages 
143-151, Vancouver, British Columbia, Canada, June. 
3. N. Chomsky. 1959. A note on phrase structure gramme. Information and Control, 2:393-395. 
4. N. Chomsky. 1959. On certain formal properties of grammars. Information and Control, 2:137-167. 
5. K. (~ulik II and R. Cohen. 1973. LR-regular grammars--an extension of LR(k) grammars. Journal 
of Computer and System Sciences, 7:66-96. 
6. J. Earley. 1970. An etBcient context-free parsing algorithm. Communications of the A CM, 
13(2):94-102, February. 
7. E. Grimley Evans. 1997. Approximating context-free grammars with a finite-state calculus. In 35th 
Annual Meeting of the Association for Computational Linguistics, Proceedings of the Conference, 
pages 452-459, Madrid, Spain, July. 
8. M.A. Harrison. 1978. Introduction to Formal Language Theory. Addison-Wesley. 
9. S. Kranwer and L. des Tombe. 1981. Transducers and grammars as theories of language. Theoret- 
ical Linguistics, 8:173-202. 
10. B. Lang. 1974. Deterministic techniques for efficient non-deterministic parsers. In Automata, 
Languages and Programming, ~nd Colloquium, Lecture Notes in Computer Science, volume 14, 
pages 255-269, Saarbrticken. Springer-Verlag. 
11. D.T. Langendoen. 1975. Finite-state parsing of phrase-structure languages and the status of read- 
justment rules in grammar. Linguistic Inquiry, 6(4):533-554. 
12. D.T. Langendoen and Y. Langsam. 1990. A new method of representing constituent structures. 
Annals New York Academy of Sciences, 583:143-160. 
13. M. Mohri. 1997. Finite-state transducers in language and speech processing. Computational Lin- 
guistics, 23(2):269-311. 
14. M.-J. Nederhof. 1996. Etficient generation of random sentences. Natural Language Engineering, 
2(1):1-13. 
15. M.-J. Nederhof. 1997. Regular approximations of CFLs: A grammatical view. In International 
Workshop on Parsing Technologies, pages 159-170, Massachusetts Institute of Technology, Septem- 
ber. 
16. F.C.N. Pereira and R.N. Wright. 1997. Finite-state approximation of phrase-structure grammars. 
In E. Roche and Y. Schabes, editors, Finite-State Language Processing, pages 149-173. MIT Press. 
17. D.J. Rosenkrantz and P.M. Lewis II. 1970. Deterministic left corner parsing. In IEEE Conference 
Record of the 11th Annual Symposium on Switching and Automata Theory, pages 139-152. 
