Parsing Non-Recursive Context-Free Grammars
Mark-Jan Nederhofa0
Faculty of Arts
University of Groningen
P.O. Box 716
NL-9700 AS Groningen, The Netherlands
markjan@let.rug.nl
Giorgio Satta
Dip. di Elettronica e Informatica
Universit`a di Padova
via Gradenigo, 6/A
I-35131 Padova, Italy
satta@dei.unipd.it
Abstract
We consider the problem of parsing
non-recursive context-free grammars, i.e.,
context-free grammars that generate finite
languages. In natural language process-
ing, this problem arises in several areas
of application, including natural language
generation, speech recognition and ma-
chine translation. We present two tabu-
lar algorithms for parsing of non-recursive
context-free grammars, and show that they
perform well in practical settings, despite
the fact that this problem is PSPACE-
complete.
1 Introduction
Several applications in natural language processing
require “parsing” of a large but finite set of candidate
strings. Here parsing means some computation that
selects those strings out of the finite set that are well-
formed according to some grammar, or that are most
likely according to some language model. In these
applications, the finite set is typically encoded in a
compact way as a context-free grammar (CFG) that
is non-recursive. This is motivated by the fact that
non-recursive CFGs allow very compact represen-
tations for finite languages, since the strings deriv-
able from single nonterminals may be substrings of
many different strings in the language. Unfolding
such a grammar and parsing the generated strings
a1 Secondary affiliation is the German Research Center for
Artificial Intelligence (DFKI).
one by one then leads to an unnecessary duplica-
tion of subcomputations, since each occurrence of
a repeated substring has to be independently parsed.
As this approach may be prohibitively expensive, it
is preferable to find a parsing algorithm that shares
subcomputations among different strings by work-
ing directly on the nonterminals and the rules of the
non-recursive CFG. In this way, “parsing” a nonter-
minal of the grammar amounts to shared parsing of
all the substrings encoded by that nonterminal.
To give a few examples, in some natural lan-
guage generation systems (Langkilde, 2000) non-
recursive CFGs are used to encode very large sets
of candidate sentences realizing some input con-
ceptual representation (Langkilde calls such gram-
mars forests). Each CFG is later “parsed” using a
language model, in order to rank the sentences in
the set according to their likelyhood. Similarly, in
some approaches to automatic speech understand-
ing (Corazza and Lavelli, 1994) the a2 -best sen-
tences obtained from the speech recognition module
are “compressed” into a non-recursive CFG gram-
mar, which is later provided as input to a parser. Fi-
nally, in some machine translation applications re-
lated techniques are exploited to obtain sentences
that simultaneously realize two different conceptual
representations (Knight and Langkilde, 2000). This
is done in order to produce translations that preserve
syntactic or semantic ambiguity in cases where the
ambiguity could not be resolved when processing
the source sentence.
To be able to describe the above applications in an
abstract way, let us first fix some terminology. The
term “recognition” refers to the process of deciding
                Computational Linguistics (ACL), Philadelphia, July 2002, pp. 112-119.
                         Proceedings of the 40th Annual Meeting of the Association for
whether an input string is in the language described
by a grammar, the parsing grammar a3a5a4 . We will
generalize this notion in a natural way to input rep-
resenting a set of strings, and here the goal of recog-
nition is to decide whether at least one of the strings
in the set is in the language described bya3a6a4 . If the
input is itself given in the form of a grammar, the
input grammar a3a8a7, then recognition amounts to de-
termining whether the intersection of the languages
described bya3a8a7 anda3a9a4 is non-empty. In this paper
we use the term parsing as synonymous to recog-
nition, since the recognition algorithms we present
can be easily extended to yield parse trees (with as-
sociated probabilities if eithera3a10a7 or a3a9a4 or both are
probabilistic).
In what follows we consider the case where both
a3a9a4 and a3a8a7 are CFGs. General CFGs have un-
favourable computational properties with respect to
intersection. In particular, the problem of deciding
whether the intersection of two CFGs is non-empty
is undecidable (Harrison, 1978). Following the ter-
minology adopted above, this means that parsing
a context-free input grammar a3a11a7 on the basis of a
context-free parsing grammar a3a6a4 is not possible in
general.
One way to make the parsing problem decidable
is to place some additional restrictions on a3a10a7 or
a3a9a4 . This direction is taken by Langkilde (2000),
where a3 a7 is a non-recursive CFG and a3 a4 repre-
sents a regular language, more precisely ana2 -gram
model. In this way the problem can be solved us-
ing a stochastic variant of an algorithm presented
by Bar-Hillel et al. (1964), where it is shown that the
intersection of a general context-free language and a
regular language is still context-free.
In the present paper we leave the theoretical
framework of Bar-Hillel et al. (1964), and consider
parsing grammars a3a5a4 that are unrestricted CFGs,
and input grammars a3a8a7 that are non-recursive
context-free grammars. In this case the parsing (in-
tersection) problem becomes PSPACE-complete.1
Despite of this unfavourable theoretical result, algo-
rithms for the problem at hand have been proposed
in the literature and are currently used in practical
applications. In (Knight and Langkilde, 2000)a3a10a7 is
1The PSPACE-hardness result has been shown by Harry B.
Hunt III and Dan Rosenkrantz (Harry B. Hunt III, p.c.). Mem-
bership in PSPACE is shown by Nederhof and Satta (2002).
unfolded into a lattice (acyclic finite automaton) and
later parsed witha3a5a4 using an algorithm close to the
one by Bar-Hillel et al. (1964). The algorithm pro-
posed by Corazza and Lavelli (1994) involves copy-
ing of charts, and this makes it very similar in be-
haviour to the former approach. Thus in both al-
gorithms parts of the input grammar a3a11a7 are copied
where a nonterminal occurs more than once, which
destroys the compactness of the representation. In
this paper we propose two alternative tabular algo-
rithms that exploit the compactness of a3a10a7 as much
as possible. Although a limited amount of copying
is also done by our algorithms, this never happens in
cases where the resulting structure is ungrammatical
with respect to the parsing grammara3 a4 .
The structure of this paper is as follows. In Sec-
tion 2 we introduce some preliminary definitions,
followed in Section 3 by a first algorithm based on
CKY parsing. A more sophisticated algorithm, sat-
isfying the equivalent of the correct-prefix property
and based on Earley’s algorithm, is presented in Sec-
tion 4. Section 5 presents our experimental results
and Section 6 closes with some discussion.
2 Preliminaries
In this section we briefly recall some standard no-
tions from formal language theory. For more details
we refer the reader to textbooks such as (Harrison,
1978).
A context-free grammar is a 4-tuple a12a14a13a16a15a18a17a19a15a21a20a22a15
a23a25a24 , where
a13 is a finite set of terminals, called the
alphabet,a17 is a finite set of nonterminals, including
the start symbola20 , anda23 is a finite set of rules hav-
ing the forma26a28a27a30a29 witha26a32a31a33a17 anda29a34a31a35a12a36a13a38a37a39a17 a24a0 .
Throughout the paper we assume the following con-
ventions:a26 ,a40a41a15a43a42a44a42a43a42 denote nonterminals,a45 ,a46a47a15a43a42a44a42a44a42 de-
note terminals,a48 , a49 ,a29 are strings in a12a36a13a50a37a33a17 a24a0 and
a51
a15a53a52 are strings in a13
a0 . We also assume that each
CFG is reduced, i.e., no CFG contains nonterminals
that do not occur in any derivation of a string in the
language. Furthermore, we assume that the input
grammars do not contain epsilon rules and that there
is only one rulea20a54a27a55a29 defining the start symbola20 .2
Finally, in Section 3 we will consider parsing gram-
2Strictly speaking, the assumption about the absence of ep-
silon rules is not without loss of generality, since without ep-
silon rules the language cannot contain the empty string. How-
ever, this has no practical consequence.
mars in Chomsky normal form (CNF), i.e., gram-
mars with rules of the forma26a28a27a56a40a58a57 ora26a28a27a55a45 .
Instead of working with non-recursive CFGs, it
will be more convenient in the specification of our
algorithms to encodea3a8a7 as a push-down automaton
(PDA) with stack size bounded by some constant.
Unlike many text-books, we assume PDAs do not
have states; this is without loss of generality, since
states can be encoded in the symbols that occur top-
most on the stack. Thus, a PDA is a 5-tuple a12a14a13a16a15a60a59a61a15
a62a16a63a65a64a43a63a67a66
a15
a62a69a68a64a47a70a53a71
a15a60a72
a24 , where
a13 is the alphabet as above,
a59 is a finite set of stack symbols including the initial
stack symbola62 a63a65a64a43a63a73a66 and the final stack symbola62a69a68a64a47a70a53a71,
anda72 is the set of transitions, having one of the fol-
lowing three forms: a62a75a74a27 a62a77a76 (a push transition),
a62a77a76a78a74
a27 a79 (a pop transition), or
a62 a80a74
a27
a76 (a scan
transition, scanning symbola45 ). Throughout this pa-
per we use the following conventions: a81a34a15a62 a15a76 a15a82a79
denote stack symbols and a83a39a15a85a84a86a15a88a87 are strings in a59 a0
representing stacks. We remark that in our notation
stacks grow from left to right, i.e., the top-most stack
symbol will be found at the right end.
Configurations of the PDA have the form a12a89a83a39a15a53a52 a24 ,
wherea83a90a31a77a59 a0 is a stack anda52a91a31a92a13 a0 is the remain-
ing input. We let the binary relationa93 be defined by:
a12a94a87a95a83a39a15
a51
a52
a24
a93a96a12a97a87a98a84a39a15a88a52
a24 if and only if there is a transi-
tion in a72 of the form a83 a74a27a99a84 , where a51a101a100a103a102 , or of
the form a83 a80a74a27 a84 , where a51a104a100 a45 . The relation a93 a0
denotes the reflexive and transitive closure ofa93 . An
input stringa52 is recognized by the PDA if and only
if a12a62a41a63a105a64a43a63a67a66a15a53a52 a24 a93 a0 a12a62a69a68a64a47a70a53a71a15a102a24 .
3 The CKY algorithm
In this section we present our first parsing algorithm,
based on the so-called CKY algorithm (Harrison,
1978) and exploiting a decomposition of computa-
tions of PDAs cast in a specific form. We start with
a construction that translates the non-recursive input
CFGa3a8a7 into a PDA accepting the same language.
Let a3a8a7 a100 a12a36a13a41a15a14a17a19a15a82a20a22a15a23a25a24 . The PDA associated
witha3a8a7 is specified as
a12a14a13a41a15a88a59a61a15a107a106a108a20a38a27a30a109a8a29a47a110a14a15a107a106a111a20a54a27a30a29a112a109a47a110a14a15a113a72
a24
a15
wherea59 consists of symbols of the form a106a26a114a27a115a48a35a109
a49a116a110 fora12a89a26a28a27a30a48a117a49
a24
a31
a23 , and
a72 contains the following
transitions:
a109 For each pair of rulesa26a118a27a119a48a117a40a120a49 and a40a121a27a119a29 ,
a72 contains:
a106a26a28a27a55a48a122a109a10a40a120a49a116a110
a74
a27 a106a26a28a27a55a48a122a109a10a40a120a49a116a110a117a106a111a40a114a27a55a109a8a29a107a110
and
a106a26a28a27a55a48a122a109a10a40a120a49a116a110a123a106a111a40a114a27a30a29a124a109a125a110
a74
a27a75a106a26a28a27a55a48a117a40a114a109a10a49a116a110.
a109 For each rule a26 a27 a48a126a45a127a49 , a72 contains:
a106a26a28a27a55a48a122a109a8a45a128a49a116a110
a80a74
a27a119a106a26a28a27a55a48a126a45a129a109a10a49a116a110.
Observe that for all PDAs constructed as above,
no push transition can be immediately followed by
a pop transition, i.e., there are no stack symbolsa62 ,
a76 and
a79 such that
a62a130a74
a27
a62a77a76 and a62a77a76a30a74
a27 a79 . As
a consequence of this, a computation a12a62a131a63a65a64a43a63a73a66a15a88a52 a24 a93 a0
a12
a62a124a68a64a47a70a53a71
a15
a102
a24 of the PDA can always and uniquely
be decomposed into consecutive subcomputations,
which we call segments, each starting with zero or
more push transitions, followed by a single scan
transition and by zero or more pop transitions. In
what follows, we will formalize this basic idea and
exploit it within our parsing algorithm.
We writea83 a80a100a123a132 a84 to indicate that there is a com-
putation a12a89a83a39a15a53a45 a24 a93 a0 a12a97a84a39a15a102a24 of the PDA such that all of
the following three conditions hold:
(i) either a133a83a6a133a100a50a134 or a133a84a5a133a100a114a134 ;
(ii) the computation starts with zero or more push
transitions, followed by one scan transition
readinga45 and by zero or more pop transitions;
(iii) if a133a83a6a133a18a135 a134 then the top-most symbol ofa83 must
be in the right-hand side of a pop or scan tran-
sition (i.e., top-most in the stack at the end of a
previous segment) and if a133a84a6a133a116a135 a134 , then the top-
most symbol ofa84 must be the left-hand side of
a push or scan transition (i.e., top-most in the
stack at the beginning of a following segment).
Let a136a8a137a36a138a140a139a94a141 a100a114a142a62a16a63a65a64a43a63a67a66a85a143 a37 a142a79a144a133a146a145a62 a15a76 a106a62a77a76a121a74a27a147a79a9a110a143 a37
a142
a76
a133a148a145
a62
a15a53a45a126a106
a62 a80a74
a27
a76
a110
a143 , and
a149a39a141a151a150
a100a56a142
a62a69a68a64a47a70a53a71a143
a37
a142
a62
a133a152a145
a76
a106
a62a78a74
a27
a62a77a76
a110
a143
a37
a142
a62
a133a153a145
a76
a15a88a45a123a106
a62 a80a74
a27
a76
a110
a143 . A
formal definition of relationa132 above is provided in
Figure 1 by means of a deduction system. We assign
a procedural interpretation to such a system follow-
ing Shieber et al. (1995), resulting in an algorithm
for the computation of the relation.
We now turn to an important property of seg-
ments. Any computation a12a62 a63a105a64a43a63a67a66a15a53a45a98a154a123a155a43a155a44a155a88a45a153a156 a24 a93 a0
a12
a62a124a68a64a47a70a53a71
a15
a102
a24 ,
a157a19a158
a134 , can be computed by combining
a62 a80
a100a123a132
a76a160a159
a62 a80a74
a27
a76 (1)
a62 a80
a100a123a132
a76
a81
a80
a100a126a132
a79
a161
a81
a74
a27a162a81
a62
a81
a76a114a74
a27a163a79
(2)
a83
a62 a80
a100a126a132
a76
a81a10a83
a62 a80
a100a123a132
a79
a161
a81
a76a164a74
a27a162a79
a62
a31a77a136a8a137a36a138a60a139a97a141
(3)
a62 a80
a100a126a132
a83
a76
a81
a80
a100a126a132
a81a10a83
a76
a161
a81
a74
a27a162a81
a62
a76
a31a165a149a39a141a151a150
(4)
Figure 1: Inference rules for the computation of re-
lationa132 .
a83
a62 a80
a100a123a132a58a166
a84
a76
a161
a83
a62 a80
a100a126a132
a84
a76
a62
a31a165a136a8a137a36a138a140a139a94a141a168a167
a76
a31a38a149a86a141a169a150
(5)
a83 a170
a100a123a132a58a166
a87a18a84
a84a172a171
a100a123a132a58a166a165a173
a83a164a170a174a171
a100a126a132 a166
a87
a173
(6)
a83a99a170
a100a126a132 a166
a84
a87a18a84a172a171
a100a126a132a90a166a122a173
a87a95a83a164a170a113a171
a100a123a132a58a166a77a173
(7)
Figure 2: Inference rules for combining segments
a83 a7
a80a113a175
a100a126a132
a84 a7.
a157 segments represented bya83a22a7
a80a113a175
a100a123a132
a84a95a7, a134a25a176a144a177a124a176 a157 ,
witha83 a154 a100 a62a16a63a65a64a43a63a67a66,a84 a156 a100 a62a69a68a64a47a70a53a71, and for a134a41a176a114a177a8a178 a157 ,
a84a95a7 is a suffix ofa83a22a7
a166
a154 ora83a21a7
a166
a154 is a suffix ofa84a95a7. This
is done by the deduction system given in Figure 2,
which defines the relation a100a123a132 a166 . The second side-
condition of inference rule (5) checks whether a seg-
menta83
a62 a80
a100a123a132
a84
a76 may border on other segments, or
may be the first or last segment in a computation.
Figure 3 illustrates a computation of a PDA rec-
ognizing a string a45 a154a45a128a179a113a45a127a180a82a45a116a181 . A horizontal line seg-
ment in the curve represents a scan transition, an up-
ward line segment represents a push transition, and a
downward line segment a pop transition. The shaded
areas represent segmentsa83 a7 a80a113a175a100a126a132 a84 a7. As an example,
the area labelled I representsa62a131a63a65a64a43a63a73a66 a80a43a182a100a126a132 a62a16a63a65a64a43a63a73a66a183a62 a154a62 a179 ,
for certain stack symbolsa62 a154 anda62 a179 , where the left
edge of the shaded area represents a62a131a63a65a64a43a63a73a66 and the
right edge represents a62 a63a105a64a43a63a67a66a62 a154a62 a179 . Note that seg-
mentsa83a21a7
a80a113a175
a100a123a132
a84a98a7 abstract away from the stack sym-
bols that are pushed and then popped again. Fur-
thermore, in the context of the whole computation,
segments abstract away from stack symbols that are
not accessed during a subcomputation. As an exam-
ple, the shaded area labelled III represents segment
a76
a154
a76
a179
a80a82a184
a100a123a132
a79 , for certain stack symbols a76 a154 , a76 a179 and
a79 , and this abstracts away from the stack symbols
that may occur belowa76 a154 anda79 .
Figure 4 illustrates how two adjacent segments are
combined. The dashed box in the left-hand side of
the picture represents stack symbols from the right
edge of segment II that need not be explicitly repre-
sented by segment III, as discussed above. We may
assume that these symbols exist, so that II and III
can be combined into the larger computation in the
right-hand side of the picture. Note that if a com-
putation a83 a171a100a123a132a58a166 a84 is obtained as the combination
of two segments as in Figure 4, then some internal
details of these segments are abstracted away, i.e.,
stack elements that were pushed and again popped in
the combined computation are no longer recorded.
This abstraction is a key feature of the parsing al-
gorithm to be presented next, in that it considerably
reduces the time complexity as compared with that
of an algorithm that investigates all computations of
the PDA in isolation.
We are now ready to present our parsing algo-
rithm, which is the main result of this section. The
algorithm combines the deduction system in Fig-
ure 2, as applied to the PDA encoding the input
grammara3 a7, with the CKY algorithm as applied to
the parsing grammar a3a5a4 . (We assume that a3a9a4 is
in CNF.) The parsing algorithm may rule out many
combinations of segments from Figure 2 that are in-
consistent with the language generated bya3a6a4 . Also
ruled out are structural compositions of segments
that are inconsistent with the structure that a3 a4 as-
signs to the corresponding substrings.
The parsing algorithm is again specified as a de-
duction system, presented in Figure 5. The algo-
rithm manipulates items of the forma106a26a69a15a88a83a185a15a186a84a21a110, where
a26 is a nonterminal ofa3 a4 anda83 ,a84 are stacks of the
PDA encodinga3a11a7. Such an item indicates that there
stack
hight
time
I
II III
IV
Figure 3: A computation of a PDA divided into segments.
II III
combined into:
II + III
Figure 4: Combining two segments using rule (6) from Figure 2.
a106a26a69a15a88a83
a62
a15a186a84
a76
a110a188a187a189
a190
a189a191
a83
a62 a80
a100a123a132
a84
a76
a62
a31a165a136a8a137a36a138a140a139a94a141a168a167
a76
a31a165a149a86a141a169a150
a26a28a27a55a45
(8)
a106a111a40a33a15a53a83a39a15a88a87a18a84a21a110
a106a108a57a11a15a186a84a39a15
a173
a110
a106a26a124a15a53a83a39a15a88a87
a173
a110
a159
a26a104a27a56a40a58a57 (9)
a106a40a33a15a53a83a39a15a85a84a21a110
a106a111a57a11a15a88a87a18a84a39a15
a173
a110
a106a26a124a15a88a87a95a83a185a15
a173
a110
a159
a26a96a27a162a40a192a57 (10)
Figure 5: Inference rules that simultaneously derive
strings generated by a3a5a4 and accepted by the PDA
encodinga3 a7.
is some terminal string a52 that is derivable from a26
in a3 a4 , and such that a12a89a83a39a15a88a52 a24 a93 a0 a12a193a84a39a15a102a24 . If the item
a106a108a20a22a15
a62a16a63a65a64a43a63a73a66
a15
a62a124a68a64a47a70a53a71
a110 can be derived by the algorithm,
then the intersection of the language generated by
a3a9a4 and the language accepted by the PDA (gener-
ated bya3a8a7) is non-empty.
4 Earley’s algorithm
The CKY algorithm from Figure 5 can be seen to
filter out a selection of the computations that may be
derived by the deduction system from Figure 2. One
may however be even more selective in determining
which computations of the PDA to consider. The ba-
sis for the algorithm in this section is Earley’s algo-
rithm (Earley, 1970). This algorithm differs from the
CKY algorithm in that it satisfies the correct-prefix
property (Harrison, 1978).
The new algorithm is presented by Figure 6.
There are now two types of item involved. The first
item has the form a106a26a30a27 a48a194a109a58a49a90a133a127a87a194a195a92a83a39a15a85a87a194a195a165a84a21a110,
where a26a163a27 a48a196a109a197a49 has the same role as the dot-
ted rules in Earley’s original algorithm. The sec-
ond and third components are stacks of the PDA
as before, but these stacks now contain a distin-
guished position, indicated by a195 . The existence of
an item a106a26a198a27 a48a32a109a165a49a35a133a152a87a19a195a199a83a39a15a88a87a19a195a131a84a21a110 implies that
a12a94a87a95a83a39a15
a51
a24
a93
a0
a12a94a87a18a84a39a15
a102
a24 , where
a51 is now a string deriv-
able froma48 . This is quite similar to the meaning we
assigned to the items of the CKY algorithm, but here
not all stack symbols ina87a151a83 anda87a18a84 are involved in
this computation: only the symbols ina83 anda84 are
now accessed, while all symbols ina87 remain unaf-
fected. The portion of the stack represented bya87 is
needed to ensure the correct-prefix property in sub-
sequent computations following from this item, in
case all of the symbols ina84 are popped.
The correct-prefix property is ensured in the fol-
lowing sense. The existence of an item a106a26a200a27a99a48a28a109
a49a199a133a43a87a77a195a8a83a185a15a85a87a165a195a6a84a21a110 implies that (i) there is a stringa52
a51
that is both a prefix of a string accepted by the PDA
and of a string generated by the CFG such that after
a106a108a20a54a27a55a109a8a29a168a133a11a195
a62a41a63a105a64a43a63a67a66
a15a43a195
a62a41a63a105a64a43a63a67a66
a110
a159
a20a92a27a55a29 (11)
a106a26a28a27a55a48a122a109a8a45a127a49a25a133a10a195a5a83a39a15a43a195a5a87a18a84a21a110
a106a26a28a27a55a48a126a45a129a109a10a49a25a133a11a195a6a83a39a15a43a195a9a87
a173
a110
a159 a84
a80
a100a126a132 a166 a173 (12)
a106a26a28a27a55a48a122a109a8a45a128a49a25a133a201a87a122a195a5a83a39a15a88a87a188a195a5a84a117a110
a106a26a28a27a55a48a126a45a129a109a10a49a25a133a11a195a9a87a151a83a39a15a201a195
a173
a110
a159 a87a18a84
a80
a100a126a132 a166 a173 (13)
a106a26a28a27a55a48a122a109a8a45a128a49a25a133a11a195a5a83a39a15a43a195a9a84a21a110
a106a26a96a27a30a48a122a109a8a45a127a49a25a133a10a195a5a83a39a15a43a195a9a84a101a133a47a81a124a202a43a110
a159 a87a123a81a61a84
a80
a100a123a132a58a166 a173 (14)
a106a26a96a27a30a48a122a109a10a40a120a49a25a133a11a195a6a83a39a15a43a195a185a84
a62
a110
a106a40a114a27a30a109a8a29a168a133a11a195
a62
a15a43a195
a62
a110
a159
a40a164a27a55a29 (15)
a106a26a28a27a55a48a122a109a10a40a120a49a25a133a10a195a5a83a39a15a43a195a5a87a18a84a21a110
a106a111a40a114a27a30a29a124a109a114a133a11a195a9a84a39a15a201a195
a173
a110
a106a26a96a27a30a48a117a40a114a109a10a49a25a133a11a195a6a83a39a15a43a195a9a87
a173
a110
(16)
a106a26a96a27a30a48a122a109a10a40a120a49a25a133a201a87a122a195a5a83a185a15a85a87a122a195a9a84a21a110
a106a111a40a114a27a55a29a69a109a114a133a8a195a5a87a18a84a39a15a43a195
a173
a110
a106a26a96a27a30a48a117a40a114a109a10a49a25a133a11a195a5a87a95a83a185a15a201a195
a173
a110
(17)
a106a26a28a27a55a48a122a109a10a49a25a133a43a83
a62
a15a85a84a101a133a125a81a69a202a201a110
a106a26a96a27a55a109a11a48a117a49a131a133a43a83a101a195
a62
a15a53a83a35a195
a62
a133a47a81a124a202a43a110
(18)
a106a26a28a27a55a48a122a109a10a40a120a49a199a133a201a83a39a15a85a84
a62
a110
a106a40a114a27a30a109a61a29a129a133a113a84a35a195
a62
a15a186a84a35a195
a62
a133a125a81a69a202a201a110
a106a26a28a27a55a48a122a109a10a40a120a49a25a133a43a83a39a15a186a84
a62
a133a47a81a124a202a43a110
(19)
a106a26a28a27a55a48a122a109a10a40a120a49a199a133a201a83a39a15a88a87a126a81a8a84
a62
a110
a106a40a114a27a30a109a61a29a129a133a113a84a35a195
a62
a15a186a84a35a195
a62
a133a125a81a69a202a201a110
a106a111a40a114a27a30a109a61a29a168a133a47a81a61a84a35a195
a62
a15a113a81a8a84a92a195
a62
a110
(20)
a106a26a104a27a55a109a61a48a117a49a199a133a125a81a10a83 a154a83a203a179a61a195
a62
a15a113a81a10a83 a154a83a204a179a61a195
a62
a110
a106a26a104a27a55a48a122a109a11a49a199a133a201a83a86a154a9a195a5a83 a179
a62
a15a53a83a86a154a9a195a185a84a192a133a47a81a124a202a43a110
a106a26a28a27a55a48a122a109a10a49a25a133a47a81a11a83 a154 a195a5a83a204a179
a62
a15a82a81a11a83 a154 a195a5a84a117a110
(21)
Figure 6: Inference rules based on Earley’s algo-
rithm.
processing a52 , a26 is expanded in a left-most deriva-
tion and some stack can be obtained of which a87a151a83
represent the top-most elements, and (ii)a48 is rewrit-
ten toa51 and while processinga51 the PDA replaces the
stack elementsa83 bya84 .3
The second type of item has the form a106a26a118a27a119a48a90a109
a49a96a133a95a87a205a195a206a83a39a15a88a87a121a195a197a84a194a133a117a81a124a202a43a110. The first three compo-
nents are the same as before, anda81 indicates that we
wish to know whether a stack with top-most symbols
a81a8a87a151a83 may arise after reading a prefix of a string that
may also lead to expansion of nonterminal a26 in a
left-most derivation. Such an item results if it is de-
tected that the existence ofa81 belowa87a95a83 needs to be
ensured in order to continue the computation under
the constraint of the correct-prefix property.
Our algorithm also makes use of segments, as
computed by the algorithm from Figure 1. Con-
sistently with rule (5) from Figure 2, we write
a83
a62 a80
a100a126a132a90a166
a84
a76 to represent a segment
a83
a62 a80
a100a123a132
a84
a76
such thata62 a31a58a136a8a137a36a138a60a139a97a141a41a167 a76 a31a90a149a39a141a151a150 . The use of seg-
ments that were computed bottom-up is a departure
from pure left-to-right processing in the spirit of Ear-
ley’s original algorithm. The motivation is that we
have found empirically that the use of rule (2) was
essential for avoiding a large part of the exponen-
tial behaviour; note that that rule considers at most a
number of stacks that is quadratic in the size of the
PDA.
The first inference rule (11) can be easily justified:
we want to investigate strings that are both generated
by the grammar and recognized by the PDA, so we
begin by combining the start symbol and a match-
ing right-hand side from the grammar with the initial
stack for the PDA.
Segments are incorporated into the left-to-right
computation by rules (12) and (13). These two rules
are the equivalents of (9) and (10) from Figure 5.
Note that in the case of (13) we require the presence
ofa87 below the marker in the antecedent. This indi-
cates that a stack with top-most symbols a87a95a83 and a
dotted rulea26a28a27a55a48a122a109a8a45a128a49 can be obtained by simulta-
neously processing a string from left to right by the
grammar and the PDA. Thereby, we may continue
the derivation with the item in the consequent with-
out violating the correct-prefix property.
Rule (14) states that if a segment presupposes the
existence of stack elements that are not yet available,
we produce an item that starts a backward computa-
tion. We do this one symbol at a time, starting with
3We naturally assume that the PDA itself satisfies the
correct-prefix property, which is guaranteed by the construction
from Section 3 and the fact thata207
a175 is reduced.
the symbola81 just beneath the part of the stack that is
already available. This will be discussed more care-
fully below.
The predictor step of Earley’s algorithm is repre-
sented by (15), and the completer step by rules (16)
and (17). These latter two are very similar to (12)
and (13) in that they incorporate a smaller derivation
in a larger derivation.
Rules (18) and (19) repeat computations that have
been done before, but in a backward manner, in or-
der to propagate the information that deeper stack
symbols are needed than those currently available,
in particular that we want to know whether a certain
stack symbola81 may occur below the currently avail-
able parts of the stack. In (18) this query is passed on
to the beginning of the context-free rule, and in (19)
this query is passed on backwards through a predic-
tor step. In the antecedent of rule (18) the position of
the marker is irrelevant, and is not indicated explic-
itly. Similarly, for rule (19) we assume the position
of the marker is copied unaltered from the first an-
tecedent to the consequent.
If we find the required stack symbola81 , we prop-
agate the information forward that this symbol may
indeed occur at the specified position in the stack.
This is implemented by rules (20) and (21). Rule
(20) corresponds to the predictor step (15), but (20)
passes on a larger portion of the stack than (20).
Rule (15) only transfers the top-most symbol a62 to
the consequent, in order to keep the stacks as shal-
low as possible and to achieve a high degree of shar-
ing of computation.
5 Empirical results
We have implemented the two algorithms and tested
them on non-recursive input CFGs and a parsing
CFG. We have had access to six input CFGs of the
form described by Langkilde (2000). As parsing
CFG we have taken a small hand-written grammar
of about 100 rules. While this small size is not at all
typical of practical grammars, it suffices to demon-
strate the applicability of our algorithms.
The results of the experiments are reported in Fig-
ure 1. We have ordered the input grammars by
size, according to the number of nonterminals (or
the number of nodes in the forest, following the ter-
minology by Langkilde (2000)).
The second column presents the number of strings
generated by the input CFG, or more accurately,
the number of derivations, as the grammars contain
some ambiguity. The high numbers show that with-
out a doubt the naive solution of processing the input
grammars by enumerating individual strings (deriva-
tions) is not a viable option.
The third column shows the size, expressed as
number of states, of a lattice (acyclic finite au-
tomaton) that would result by unfolding the gram-
mar (Knight and Langkilde, 2000). Although this
approach could be of more practical interest than
the naive approach of enumerating all strings, it still
leads to large intermediate results. In fact, practical
context-free parsing algorithms for finite automata
have cubic time complexity in the number of states,
and derive a number of items that is quadratic in the
number of states.
The next column presents the number of segments
a83
a80
a100a123a132
a84 . These apply to both algorithm. We only
compute segmentsa83 a80a100a123a132 a84 for terminalsa45 that also
occur in the parsing grammar. (Further obvious op-
timizations in the case of Earley’s algorithm were
found to lead to no more than a slight reduction of
produced segments.) The last two columns present
the number of items specific to the two algorithms
in Figures 5 and 6, respectively. Although our two
algorithms are exponential in the number of stack
symbols in the worst case, just as approaches that
enumerate all strings or that unfolda3a11a7 into a lattice,
we see that the numbers of items are relatively mod-
erate if we compare them to the number of strings
generated by the input grammars.
Earley’s algorithm generally produces more items
than the CKY algorithm. An exception is the last in-
put CFG; it seems that the number of items that Ear-
ley’s algorithm needs to consider in order to main-
tain the correct-prefix property is very sensitive to
qualities of the particular input CFG.
The present implementations use a trie to store
stacks; the arcs in the trie closest to the root rep-
resent stack symbols closest to the top of the stacks.
For example, for storinga83 a80a100a126a132 a84 , the algorithm rep-
resentsa83 anda84 by their corresponding nodes in the
trie, and it indexes a83 a80a100a126a132 a84 twice, once through
each associated node. Since the trie is doubly linked
(i.e. we may traverse the trie upwards as well as
downwards), we can always reconstruct the stacks
Table 1: Empirical results.
# nonts # strings # states # segments # items CKY # items Earley
168 a208a127a42a65a209a8a195
a134a44a210a146a211 2643 1437 1252 6969
248 a209a127a42a65a209a8a195
a134a44a210a60a212 21984 3542 4430 40568
259 a208a127a42a210 a195 a134a44a210a60a213 6528 957 1314 29925
361 a134a42a105a214a6a195 a134a44a210a154a53a154 77198 7824 14627 14907
586 a208a127a42a65a209a61a195 a134a44a210a154a179 45713 8832 5608 8611
869 a214a116a42a105a214a6a195 a134a44a210a154a179 63851 15679 5709 3781
from the corresponding nodes. This structure is also
convenient for finding pairs of matching stacks, one
of which may be deeper than the other, as required
by the inference rules from e.g. Figure 5, since given
the first stack in such a pair, the second can be found
by traversing the trie either upwards or downwards.
6 Discussion
It is straightforward to give an algorithm for parsing
a finite language: we may trivially parse each string
in the language in isolation. However, this is not a
practical solution when the number of strings in the
language exceeds all reasonable bounds.
Some algorithms have been described in the exist-
ing literature that parse sets of strings of exponential
size in the length of the input description. These so-
lutions have not considered context-free parsing of
finite languages encoded by non-recursive CFGs, in
a way that takes full advantage of the compactness
of the representation. Our algorithms make this pos-
sible, relying on the compactness of the input gram-
mars for efficiency in practical cases, and on the ab-
sence of recursion for guaranteeing termination. Our
experiments also show that these algorithms are of
practical interest.
Acknowledgements
We are indebted to Irene Langkilde for putting to our
disposal the non-recursive CFGs on which we have
based our empirical evaluation.
References
Y. Bar-Hillel, M. Perles, and E. Shamir. 1964. On formal
properties of simple phrase structure grammars. In
Y. Bar-Hillel, editor, Language and Information: Se-
lected Essays on their Theory and Application, chap-
ter 9, pages 116–150. Addison-Wesley.
A. Corazza and A. Lavelli. 1994. Ana215 -best represen-
tation for bidirectional parsing strategies. In Working
Notes of the AAAI’94 Workshop on Integration of Nat-
ural Language and Speech Processing, pages 7–14,
Seattle, WA.
J. Earley. 1970. An efficient context-free parsing algo-
rithm. Communications of the ACM, 13(2):94–102,
February.
M.A. Harrison. 1978. Introduction to Formal Language
Theory. Addison-Wesley.
K. Knight and I. Langkilde. 2000. Preserving ambigu-
ities in generation via automata intersection. In Pro-
ceedings of the Seventeenth National Conference on
Artificial Intelligence and Twelfth Conference on In-
novative Applications of Artificial Intelligence, pages
697–702, Austin, Texas, USA, July–August.
I. Langkilde. 2000. Forest-based statistical sentence gen-
eration. In 6th Applied Natural Language Processing
Conference and 1st Meeting of the North American
Chapter of the Association for Computational Linguis-
tics, pages Section 2, 170–177, Seattle, Washington,
USA, April–May.
M.-J. Nederhof and G. Satta. 2002. The emptiness prob-
lem for intersection of a CFG and a nonrecursive CFG
is PSPACE-complete. In preparation.
S.M. Shieber, Y. Schabes, and F.C.N. Pereira. 1995.
Principles and implementation of deductive parsing.
Journal of Logic Programming, 24:3–36.
