Guided Parsing of Range Concatenation Languages
Fran¸cois Barth´elemy, Pierre Boullier, Philippe Deschamp and ´Eric de la Clergerie
INRIA-Rocquencourt
Domaine de Voluceau
B.P. 105
78153 Le Chesnay Cedex, France
a0 Francois.Barthelemy Pierre.Boullier
Philippe.Deschamp Eric.De La Clergeriea1 @inria.fr
Abstract
The theoretical study of the range
concatenation grammar [RCG] formal-
ism has revealed many attractive prop-
erties which may be used in NLP.
In particular, range concatenation lan-
guages [RCL] can be parsed in poly-
nomial time and many classical gram-
matical formalisms can be translated
into equivalent RCGs without increas-
ing their worst-case parsing time com-
plexity. For example, after transla-
tion into an equivalent RCG, any tree
adjoining grammar can be parsed in
a2a4a3a6a5a8a7a10a9 time. In this paper, we study a
parsing technique whose purpose is to
improve the practical efficiency of RCL
parsers. The non-deterministic parsing
choices of the main parser for a lan-
guagea11 are directed by a guide which
uses the shared derivation forest output
by a prior RCL parser for a suitable su-
perset of a11 . The results of a practi-
cal evaluation of this method on a wide
coverage English grammar are given.
1 Introduction
Usually, during a nondeterministic process, when
a nondeterministic choice occurs, one explores all
possible ways, either in parallel or one after the
other, using a backtracking mechanism. In both
cases, the nondeterministic process may be as-
sisted by another process to which it asks its way.
This assistant may be either a guide or an oracle.
An oracle always indicates all the good ways that
will eventually lead to success, and those good
ways only, while a guide will indicate all the good
ways but may also indicate some wrong ways. In
other words, an oracle is a perfect guide (Kay,
2000), and the worst guide indicates all possi-
ble ways. Given two problems a12a14a13 and a12a16a15 and
their respective solutions a17a18a13 and a17a19a15 , if they are
such that a17a20a13a22a21a23a17a8a15 , any algorithm which solves
a12 a13 is a candidate guide for nondeterministic al-
gorithms solving a12a16a15 . Obviously, supplementary
conditions have to be fulfilled fora12a24a13 to be a guide.
The first one deals with relative efficiency: it as-
sumes that problem a12a24a13 can be solved more effi-
ciently than problem a12 a15 . Of course, parsers are
privileged candidates to be guided. In this pa-
per we apply this technique to the parsing of a
subset of RCLs that are the languages defined by
RCGs. The syntactic formalism of RCGs is pow-
erful while staying computationally tractable. In-
deed, the positive version of RCGs [PRCGs] de-
fines positive RCLs [PRCLs] that exactly cover
the class PTIME of languages recognizable in de-
terministic polynomial time. For example, any
mildly context-sensitive language is a PRCL.
In Section 2, we present the definitions of
PRCGs and PRCLs. Then, in Section 3, we de-
sign an algorithm which transforms any PRCLa11
into another PRCLa11a25a13,a11a27a26a28a11a29a13 such that the (the-
oretical) parse time for a11a29a13 is less than or equal
to the parse time fora11 : the parser fora11 will be
guided by the parser for a11a29a13 . Last, in Section 4,
we relate some experiments with a wide coverage
tree-adjoining grammar [TAG] for English.
2 Positive Range Concatenation
Grammars
This section only presents the basics of RCGs,
more details can be found in (Boullier, 2000b).
A positive range concatenation grammar
[PRCG]a30a32a31 a3a34a33a36a35a38a37a39a35a41a40a42a35a12 a35a17 a9 is a 5-tuple where
a33 is a finite set of nonterminal symbols (also
called predicate names),a37 anda40 are finite, dis-
joint sets of terminal symbols and variable sym-
bols respectively, a17a44a43
a33 is the start predicate
name, anda12 is a finite set of clauses
a45a42a46a48a47a49a45
a13a8a50a51a50a51a50
a45a16a52
where a53 a54a49a55 and each of a45a16a46a35a45 a13a35a50a51a50a51a50a35a45a20a52 is a
predicate of the form
a56 a3a58a57
a13
a35
a50a51a50a51a50
a35a59a57a61a60a38a35
a50a51a50a51a50
a35a59a57a63a62a64a9
wherea65a66a54a68a67 is its arity, a56 a43 a33 , and each of
a57a8a60
a43
a3a69a37a28a70a71a40a72a9a38a73 ,
a67a75a74a77a76a24a74a78a65 , is an argument.
Each occurrence of a predicate in the LHS
(resp. RHS) of a clause is a predicate defini-
tion (resp. call). Clauses which define predicate
name a56 are called a56 -clauses. Each predicate
name a56 a43 a33 has a fixed arity whose value is
aritya3a56 a9. By definition aritya3a17 a9 a31a79a67 . The ar-
ity of an a56 -clause is aritya3a56 a9, and the arity a80
of a grammar (we have a a80 -PRCG) is the max-
imum arity of its clauses. The size of a clause
a81
a31
a56 a46a3
a50a51a50a51a50
a9 a47
a50a51a50a51a50
a56 a60a38a3
a50a51a50a51a50
a9
a50a51a50a51a50
a56 a52 a3
a50a51a50a51a50
a9 is the
integer a82a81a82a83a31a85a84
a52
a60a87a86a46 arity
a3a56 a60a58a9 and the size of
a30 is
a82a88a30a89a82a90a31 a84a92a91a38a93a90a94 a82
a81
a82.
For a given stringa95a96a31a66a97a83a13a8a50a51a50a51a50a98a97a100a99a101a43
a37 a73 , a pair
of integers a3a76a35a58a102a103a9 s.t.a55a101a74a104a76a105a74 a102 a74 a5 is called a
range, and is denoteda106a69a76a107a50a108a50a102a103a109a111a110 :a76 is its lower bound,
a102 is its upper bound anda102a71a112
a76 is its size. For a
given a95 , the set of all ranges is noted a113 a110 . In
fact, a106a69a76a59a50a108a50a102a114a109a111a110 denotes the occurrence of the string
a97
a60a87a115
a13a19a50a51a50a51a50a98a97a117a116 ina95 . Two ranges a106a69a76a107a50a108a50
a102a114a109a111a110 and
a106a58a80a63a50a108a50a119a118
a109a111a110
can be concatenated iff the two boundsa102 anda80 are
equal, the result is the range a106a69a76a107a50a108a50a119a118a109a110 . Variable oc-
currences or more generally strings in a3a69a37a120a70a78a40a72a9a38a73
can be instantiated to ranges. However, an oc-
currence of the terminal a121 can be instantiated to
the range a106
a102a72a112
a67a90a50a108a50
a102a103a109a111a110 iff
a121a122a31a123a97a124a116 . That is, in a
clause, several occurrences of the same terminal
may well be instantiated to different ranges while
several occurrences of the same variable can only
be instantiated to the same range. Of course, the
concatenation on strings matches the concatena-
tion on ranges.
We say thata56 a3a34a125a13a35a50a51a50a51a50a35a107a125a124a62a126a9 is an instantiation of
the predicatea56 a3a58a57 a13a35a50a51a50a51a50a35a59a57a63a62a64a9 iffa125a126a60 a43a127a113
a110a24a35
a67a75a74a77a76a24a74
a65 and each symbol (terminal or variable) of
a57 a60,
a67a128a74a120a76a25a74a129a65 is instantiated to a range ina113
a110 s.t.a57a61a60
is instantiated toa125a100a60. If, in a clause, all predicates
are instantiated, we have an instantiated clause.
A binary relation derive, denoted a130
a131a18a132
a110 , is de-
fined on strings of instantiated predicates. If
a133
a13a135a134
a133
a15 is a string of instantiated predicates and if
a134 is the LHS of some instantiated clausea134
a47 a133 ,
then we havea133 a13 a134 a133 a15a22a130
a131a18a132
a110
a133
a13
a133a61a133
a15 .
An input string a95a136a43 a37a137a73 , a82a95a105a82a138a31 a5 is a sen-
tence iff the empty string (of instantiated predi-
cates) can be derived froma17 a3a106a34a55a114a50a108a50a5a18a109a111a110a24a9, the instan-
tiation of the start predicate on the whole source
text. Such a sequence of instantiated predicates is
called a complete derivation.a139 a3a30 a9, the PRCL de-
fined by a PRCGa30 , is the set of all its sentences.
For a given sentencea95 , as in the context-free
[CF] case, a single complete derivation can be
represented by a parse tree and the (unbounded)
set of complete derivations by a finite structure,
the parse forest. All possible derivation strategies
(i.e., top-down, bottom-up, . . . ) are encompassed
within both parse trees and parse forests.
A clause is:
a140 combinatorial if at least one argument of its
RHS predicates does not consist of a single
variable;
a140 bottom-up erasing (resp. top-down erasing)
if there is at least one variable occurring in
its RHS (resp. LHS) which does not appear
in its LHS (resp. RHS);
a140 erasing if there exists a variable appearing
only in its LHS or only in its RHS;
a140 linear if none of its variables occurs twice in
its LHS or twice in its RHS;
a140 simple if it is non-combinatorial, non-
erasing and linear.
These definitions extend naturally from clause
to set of clauses (i.e., grammar).
In this paper we will not consider negative
RCGs, since the guide construction algorithm
presented is Section 3 is not valid for this class.
Thus, in the sequel, we shall assume that RCGs
are PRCGs.
In (Boullier, 2000b) is presented a parsing al-
gorithm which, for any RCG a30 and any input
string of length a5 , produces a parse forest in
a2a4a3
a82a88a30a89a82
a5a8a141a142a9 time. The exponent
a143 , called degree
of a30 , is the maximum number of free (indepen-
dent) bounds in a clause. For a non-bottom-up-
erasing RCG,a143 is less than or equal to the max-
imum value, for all clauses, of the suma65a91a145a144a120a146a90a91
where, for a clausea81 ,a65
a91 is its arity anda146a147a91 is the
number of (different) variables in its LHS predi-
cate.
3 PRCG to 1-PRCG Transformation
Algorithm
The purpose of this section is to present a transfor-
mation algorithm which takes as input any PRCG
a30 and generates as output a 1-PRCG a30a72a13 , such
thata11a101a31a120a139 a3a30 a9 a26a28a11a25a13a148a31a120a139 a3a30a72a13a9.
Let a30a149a31 a3a34a33a36a35a38a37a39a35a41a40a42a35a12 a35a17 a9 be the initial PRCG
and let a30a72a13a150a31 a3a34a33 a13a35a38a37 a13a35a41a40a13a35a12a14a13a35a17a18a13a9 be the gen-
erated 1-PRCG. Informally, to eacha65 -ary predi-
cate namea56 we shall associatea65 unary predicate
namesa56
a60
, each corresponding to one argument of
a56 . We define
a33
a13 a31
a70a16a151
a93a90a152a138a153
a56
a60
a82
a56
a43
a33a36a35
a67a154a74a28a76a155a74a28a97a126a156a117a76a157a121a111a158
a3a56 a9a41a159
anda37 a13a160a31 a37 , a40a13a4a31 a40 , a17a20a13a4a31a161a17 a13 and the set of
clausesa12a14a13 is generated in the way described be-
low.
We say that two strings a57 and a162 , on some al-
phabet, share a common substring, and we write
a163a39a3a58a57a24a35
a162
a9, iff eithera57 , or
a162 or both are empty or, if
a57
a31a150a164
a146
a95 anda162a22a31a150a165
a146
a158 , we havea82
a146
a82a103a54a166a67 .
For any clause a81 a31
a45a42a46a166a47 a45
a13a8a50a51a50a51a50
a45
a116a42a50a51a50a51a50
a45a20a52
in a12 , such that a45 a116 a31 a56 a116a3a58a57 a13
a116
a35
a50a51a50a51a50
a35a59a57
a52a20a167
a116
a9a41a35
a55a168a74
a102
a74a169a53
a35
a53a89a116a128a31a66a97a100a156a147a76a58a121a170a158
a3a56
a116
a9, we generate the set of
a53
a46 clauses
a171
a91
a31
a153
a81a13
a35
a50a51a50a51a50
a35
a81
a52a155a172
a159 in the following
way. The clause a81a98a173a35a67a101a74a174a80a120a74a175a53 a46 has the form
a56
a173
a46
a3a58a57
a173
a46
a9 a47a177a176
a173 where the RHS
a176
a173 is constructed
from thea45 a116 ’s as follows. A predicate calla56
a60
a116
a3a58a57
a60
a116
a9
is ina176 a173 iff the argumentsa57
a60
a116
anda57 a173a46 share a com-
mon substring (i.e., we havea163a137a3a58a57 a173a46a35a59a57
a60
a116
a9).
As an example, the following set of clauses,
in whicha178 , a179 and a180 are variables and a97 and a181
are terminal symbols, defines the 3-copy language
a153
a95a39a95a48a95a182a82a117a95a183a43
a153
a97
a35
a181
a159a142a73a147a159 which is not a CF language
[CFL] and even lies beyond the formal power of
TAGs.
a17
a3
a178a22a179a184a180
a9 a47 a56 a3
a178
a35
a179
a35
a180
a9
a56 a3
a97a100a178
a35
a97a126a179
a35
a97a114a180
a9 a47 a56 a3
a178
a35
a179
a35
a180
a9
a56 a3
a181a107a178
a35
a181a41a179
a35
a181a185a180
a9 a47 a56 a3
a178
a35
a179
a35
a180
a9
a56 a3a34a186a100a35a107a186a100a35a107a186a147a9 a47 a186
This PRCG is transformed by the above algorithm
into a 1-PRCG whose clause set is
a17
a13
a3
a178a22a179a105a180
a9 a47 a56
a13
a3
a178
a9 a56
a15
a3
a179
a9 a56a137a187a3
a180
a9
a56
a13
a3
a97a100a178
a9 a47 a56
a13
a3
a178
a9
a56
a15
a3
a97a126a179
a9 a47 a56
a15
a3
a179
a9
a56a137a187a3
a97a114a180
a9 a47 a56a137a187a3
a180
a9
a56
a13
a3
a181a107a178
a9 a47 a56
a13
a3
a178
a9
a56
a15
a3
a181a59a179
a9 a47 a56
a15
a3
a179
a9
a56 a187a90a3
a181a98a180
a9 a47 a56 a187a90a3
a180
a9
a56
a13
a3a34a186a117a9 a47 a186
a56
a15
a3a34a186a117a9 a47 a186
a56 a187a90a3a34a186a117a9 a47 a186
It is not difficult to show thata11a27a26a28a11a29a13 .
This transformation algorithm works for any
PRCG. Moreover, if we restrict ourselves to the
class of PRCGs that are non-combinatorial and
non-bottom-up-erasing, it is easy to check that the
constructed 1-PRCG is also non-combinatorial
and non-bottom-up-erasing. It has been shown in
(Boullier, 2000a) that non-combinatorial and non-
bottom-up-erasing 1-RCLs can be parsed in cubic
time after a simple grammatical transformation.
In order to reach this cubic parse time, we as-
sume in the sequel that any RCG at hand is a non-
combinatorial and non-bottom-up-erasing PRCG.
However, even if this cubic time transformation
is not performed, we can show that the (theoreti-
cal) throughput of the parser fora11a25a13 cannot be less
than the throughput of the parser fora11 . In other
words, if we consider the parsers fora11 anda11a25a13 and
if we recall the end of Section 2, it is easy to show
that the degrees, saya143 anda143a103a13 , of their polynomial
parse times are such thata143a13 a74a188a143 . The equality is
reached iff the maximum valuea143 ina30 is produced
by a unary clause which is kept unchanged by our
transformation algorithm.
The starting RCGa30 is called the initial gram-
mar and it defines the initial languagea11 . The cor-
responding 1-PRCGa30a184a13 constructed by our trans-
formation algorithm is called the guiding gram-
mar and its languagea11a25a13 is the guiding language.
If the algorithm to reach a cubic parse time is ap-
plied to the guiding grammara30a184a13, we get an equiv-
alent a5a8a187 -guiding grammar (it also defines a11a29a13 ).
The various RCL parsers associated with these
grammars are respectively called initial parser,
guiding parser anda5a8a187 -guiding parser. The output
of a (a5 a187 -) guiding parser is called a (a5 a187 -) guiding
structure. The term guide is used for the process
which, with the help of a guiding structure, an-
swers ‘yes’ or ‘no’ to any question asked by the
guided process. In our case, the guided processes
are the RCL parsers for a11 called guided parser
anda5a8a187 -guided parser.
4 Parsing with a Guide
Parsing with a guide proceeds as follows. The
guided process is split in two phases. First, the
source text is parsed by the guiding parser which
builds the guiding structure. Of course, if the
source text is parsed by thea5 a187 -guiding parser, the
a5 a187 -guiding structure is then translated into a guid-
ing structure, as if the source text had been parsed
by the guiding parser. Second, the guided parser
proper is launched, asking the guide to help (some
of) its nondeterministic choices.
Our current implementation of RCL parsers is
like a (cached) recursive descent parser in which
the nonterminal calls are replaced by instantiated
predicate calls. Assume that, at some place in an
RCL parser,a56 a3a34a125a13a35a107a125a15a9 is an instantiated predicate
call. In a corresponding guided parser, this call
can be guarded by a call to a guide, with a56 , a125a13
and a125a15 as parameters, that will check that both
a56
a13
a3a34a125
a13
a9 anda56
a15
a3a34a125
a15
a9 are instantiated predicates in
the guiding structure. Of course, various actions
in a guided parser can be guarded by guide calls,
but the guide can only answer questions that, in
some sense, have been registered into the guiding
structure. The guiding structure may thus con-
tain more or less complete information, leading
to several guide levels.
For example, one of the simplest levels one
may think of, is to only register in the guiding
structure the (numbers of the) clauses of the guid-
ing grammar for which at least one instantiation
occurs in their parse forest. In such a case, dur-
ing the second phase, when the guided parser tries
to instantiate some clause a81 of a30 , it can call the
guide to know whether or nota81 can be valid. The
guide will answer ‘yes’ iff the guiding structure
contains the set a171 a91 of clauses in a30a72a13 generated
froma81 by the transformation algorithm.
At the opposite, we can register in the guid-
ing structure the full parse forest output by the
guiding parser. This parse forest is, for a given
sentence, the set of all instantiated clauses of the
guiding grammar that are used in all complete
derivations. During the second phase, when the
guided parser has instantiated some clause a81 of
the initial grammar, it builds the set of the cor-
responding instantiations of all clauses ina171 a91 and
asks the guide to check that this set is a subset of
the guiding structure.
During our experiment, several guide levels
have been considered, however, the results in Sec-
tion 5 are reported with a restricted guiding struc-
ture which only contains the set of all (valid)
clause numbers and for each clause the set of its
LHS instantiated predicates.
The goal of a guided parser is to speed up a
parsing process. However, it is clear that the the-
oretical parse time complexity is not improved by
this technique and even that some practical parse
time will get worse. For example, this is the case
for the above 3-copy language. In that case, it
is not difficult to check that the guiding language
a11a25a13 isa37 a73 , and that the guide will always answer
‘yes’ to any question asked by the guided parser.
Thus the time taken by the guiding parser and by
the guide itself is simply wasted. Of course, a
guide that always answer ‘yes’ is not a good one
and we should note that this case may happen,
even when the guiding language is nota37 a73 . Thus,
from a practical point of view the question is sim-
ply “will the time spent in the guiding parser and
in the guide be at least recouped by the guided
parser?” Clearly, in the general case, no definite
answer can be brought to such a question, since
the total parse time may depend not only on the
input grammar, the (quality of) the guiding gram-
mar (e.g., isa11a29a13 not a too “large” superset ofa11 ),
the guide level, but also it may depend on the
parsed sentence itself. Thus, in our opinion, only
the results of practical experiments may globally
decide if using a guided parser is worthwhile .
Another potential problem may come from the
size of the guiding grammar itself. In partic-
ular, experiments with regular approximation of
CFLs related in (Nederhof, 2000) show that most
reported methods are not practical for large CF
grammars, because of the high costs of obtaining
the minimal DFSA.
In our case, it can easily be shown that the in-
crease in size of the guiding grammars is bounded
by a constant factor and thus seems a priori ac-
ceptable from a practical point of view.
The next section depicts the practical exper-
iments we have performed to validate our ap-
proach.
5 Experiments with an English
Grammar
In order to compare a (normal) RCL parser and its
guided versions, we looked for an existing wide-
coverage grammar. We chose the grammar for
English designed for the XTAG system (XTAG,
1995), because it both is freely available and
seems rather mature. Of course, that grammar
uses the TAG formalism.1 Thus, we first had
to transform that English TAG into an equiva-
lent RCG. To perform this task, we implemented
the algorithm described in (Boullier, 1998) (see
also (Boullier, 1999)), which allows to transform
any TAG into an equivalent simple PRCG.2
However, Boullier’s algorithm was designed
for pure TAGs, while the structures used in
the XTAG system are not trees, but rather tree
schemata, grouped into linguistically pertinent
tree families, which have to be instantiated by in-
flected forms for each given input sentence. That
important difference stems from the radical dif-
ference in approaches between “classical” TAG
parsing and “usual” RCL parsing. In the former,
through lexicalization, the input sentence allows
the selection of tree schemata which are then in-
stantiated on the corresponding inflected forms,
thus the TAG is not really part of the parser. While
in the latter, the (non-lexicalized) grammar is pre-
compiled into an optimized automaton.3
Since the instantiation of all tree schemata
1We assume here that the reader has at least some cursory
notions of this formalism. An introduction to TAG can be
found in (Joshi, 1987).
2We first stripped the original TAG of its feature struc-
tures in order to get a pure featureless TAG.
3The advantages of this approach might be balanced by
the size of the automaton, but we shall see later on that it can
be made to stay reasonable, at least in the case at hand.
by the complete dictionary is impracticable, we
designed a two-step process. For example, from
the sentence “George loved himself .”, a lexer
first produces the sequence “George a153n-n nxn-
n nn-na159 loved a153tnx0vnx1-v tnx0vnx1s2-
v tnx0vs1-va159 himself a153tnx0n1-n nxn-na159
. a153spu-punct spus-puncta159 ”, and, in a second
phase, this sequence is used as actual input to
our parsers. The names between braces are
pre-terminals. We assume that each terminal
leaf a118 of every elementary tree schema a189 has
been labeled by a pre-terminal name of the form
a121a105a31a191a190 -
a81a147a192-
a76a157a193 where a190 is the family ofa189 ,
a81 is the
category ofa118 (verb, noun, . . . ) anda76 is an optional
occurrence index.4
Thus, the association George “a153n-n nxn-n
nn-na159 ” means that the inflected form “George”
is a noun (suffix -n) that can occur in all trees of
the “n”, “nxn” or “nn” families (everywhere a ter-
minal leaf of category noun occurs).
Since, in this two-step process, the inputs are
not sequences of terminal symbols but instead
simple DAG structures, as the one depicted in
Figure 1, we have accordingly implemented in
our RCG system the ability to handle inputs that
are simple DAGs of tokens.5
In Section 3, we have seen that the language
a11 a13 defined by a guiding grammar a30 a13 for some
RCGa30 , is a superset ofa11 , the language defined
by a30 . If a30 is a simple PRCG, a30a184a13 is a simple
1-PRCG, and thus a11a25a13 is a CFL (see (Boullier,
2000a)). In other words, in the case of TAGs, our
transformation algorithm approximates the initial
tree-adjoining language by a CFL, and the steps
of CF parsing performed by the guiding parser
can well be understood in terms of TAG parsing.
The original algorithm in (Boullier, 1998) per-
forms a one-to-one mapping between elementary
trees and clauses, initial trees generate simple
unary clauses while auxiliary trees generate sim-
ple binary clauses. Our transformation algorithm
leaves unary clauses unchanged (simple unary
clauses are in fact CF productions). For binary
a56 -clauses, our algorithm generates two clauses,
4The usage of
a194 as component ofa195 is due to the fact
that in the XTAG syntactic dictionary, lemmas are associ-
ated with tree family names.
5This is done rather easily for linear RCGs. The process-
ing of non-linear RCGs with lattices as input is outside the
scope of this paper.
0 George 1
n-n
loved 2
tnx0vnx1-v
himself 3
tnx0n1-n
. 4
spu-punct
spus-punctnxn-ntnx0vnx1s2-v
tnx0vs1-v
nxn-n
nn-n
Figure 1: Actual source text as a simple DAG structure
ana56 a13 -clause which corresponds to the part of the
auxiliary tree to the left of the spine and ana56 a15 -
clause for the part to the right of the spine. Both
are CF clauses that the guiding parser calls inde-
pendently. Therefore, for a TAG, the associated
guiding parser performs substitutions as would a
TAG parser, while each adjunction is replaced by
two independent substitutions, such that there is
no guarantee that any couple ofa56 a13 -tree anda56 a15 -
tree can glue together to form a valid (adjoinable)
a56 -tree. In fact, guiding parsers perform some
kind of (deep-grammar based) shallow parsing.
For our experiments, we first transformed the
English XTAG into an equivalent simple PRCG:
the initial grammara30 . Then, using the algorithms
of Section 3, we built, from a30 , the correspond-
ing guiding grammar a30a184a13, and from a30a184a13 the a5 a187 -
guiding grammar. Table 1 gives some information
on these grammars.6
RCG initial guiding a5 a187 -guiding
a82
a33
a82 22 33 4 204
a82
a37
a82 476 476 476
a82a12a4a82 1 144 1 696 5 554
a82a88a30a196a82 15 578 15 618 17 722
degree 27 27 3
Table 1: RCGsa30a183a31
a3a34a33a36a35a38a37a39a35a41a40a42a35
a12
a35
a17
a9 facts
For our experiments, we have used a test suite
distributed with the XTAG system. It contains 31
sentences ranging from 4 to 17 words, with an
average length of 8. All measures have been per-
formed on a 800 MHz Pentium III with 640 MB
of memory, running Linux. All parsers have been
6Note that the worst-case parse time for both the initial
and the guiding parsers is a197a48a198a108a199a64a200a34a201a203a202. As explained in Sec-
tion 3, this identical polynomial degreesa204a155a205a160a204a124a206a19a205a127a207a98a208 comes
from an untransformed unary clause which itself is the result
of the translation of an initial tree.
compiled with gcc without any optimization flag.
We have first compared the total time taken to
produce the guiding structures, both by the a5 a187 -
guiding parser and by the guiding parser (see Ta-
ble 2). On this sample set, thea5 a15a203a209 -guiding parser
is twice as fast as the a5 a187 -guiding parser. We
guess that, on such short sentences, the benefit
yielded by the lowest degree has not yet offset
the time needed to handle a much greater num-
ber of clauses. To validate this guess, we have
tried longer sentences. With a 35-word sentence
we have noted that thea5 a187 -guiding parser is almost
six times faster than the a5 a15a203a209 -guiding parser and
besides we have verified that the even crossing
point seems to occur for sentences of around 16–
20 words.
parser guiding a5 a187 -guiding
sample set 0.990 1.870
35-word sent. 30.560 5.210
Table 2: Guiding parsers times (sec)
parser load module
initial 3.063
guided 8.374
a5 a187 -guided 14.530
Table 3: RCL parser sizes (MB)
parser sample set 35-word sent.
initial 5.810 3 679.570
guided 1.580 63.570
a5 a187 -guided 2.440 49.150
XTAG 4 282.870 a210 5 days
Table 4: Parse times (sec)
The sizes of these RCL parsers (load modules)
are in Table 3 while their parse times are in Ta-
ble 4.7 We have also noted in the last line, for
reference, the times of the latest XTAG parser
(February 2001),8 on our sample set and on the
35-word sentence.9
6 Guiding Parser as Tree Filter
In (Sarkar, 2000), there is some evidence to in-
dicate that in LTAG parsing the number of trees
selected by the words in a sentence (a measure
of the syntactic lexical ambiguity of the sentence)
is a better predictor of complexity than the num-
ber of words in the sentence. Thus, the accuracy
of the tree selection process may be crucial for
parsing speeds. In this section, we wish to briefly
compare the tree selections performed, on the one
hand by the words in a sentence and, on the other
hand, by a guiding parser. Such filters can be
used, for example, as pre-processors in classical
[L]TAG parsing. With a guiding parser as tree fil-
ter, a tree (i.e., a clause) is kept, not because it has
been selected by a word in the input sentence, but
because an instantiation of that clause belongs to
the guiding structure.
The recall of both filters is 100%, since all per-
tinent trees are necessarily selected by the input
words and present in the guiding structure. On
the other hand, for the tree selection by the words
in a sentence, the precision measured on our sam-
7The time taken by the lexer phase is linear in the length
of the input sentences and is negligible.
8It implements a chart-based head-corner parsing algo-
rithm for lexicalized TAGs, see (Sarkar, 2000). This parser
can be run in two phases, the second one being devoted to
the evaluation of the features structures on the parse forest
built during the first phase. Of course, the times reported
in that paper are only those of the first pass. Moreover, the
various parameters have been set so that the resulting parse
trees and ours are similar. Almost half the sample sentences
give identical results in both that system and ours. For the
other half, it seems that the differences come from the way
the co-anchoring problem is handled in both systems. To be
fair, it must be noted that the time taken to output a complete
parse forest is not included in the parse times reported for our
parsers. Outputing those parse forests, similar to Sarkar’s
ones, takes one second on the whole sample set and 80 sec-
onds for the 35-word sentence (there are more than 3 600 000
instantiated clauses in the parse forest of that last sentence).
9Considering the last line of Table 2, one can notice that
the times taken by the guided phases of the guided parser
and thea199a126a211 -guided parser are noticeably different, when they
should be the same. This anomaly, not present on the sample
set, is currently under investigation.
ple set is 15.6% on the average, while it reaches
100% for the guiding parser (i.e., each and every
selected tree is in the final parse forest).
7 Conclusion
The experiment related in this paper shows that
some kind of guiding technique has to be con-
sidered when one wants to increase parsing effi-
ciency. With a wide coverage English TAG, on
a small sample set of short sentences, a guided
parser is on the average three times faster than
its non-guided counterpart, while, for longer sen-
tences, more than one order of magnitude may be
expected.
However, the guided parser speed is very sensi-
tive to the level of the guide, which must be cho-
sen very carefully since potential benefits may be
overcome by the time taken by the guiding struc-
ture book-keeping procedures.
Of course, the filtering principle related in this
paper is not novel (see for example (Lakshmanan
and Yim, 1991) for deductive databases) but, if
we consider the various attempts of guided pars-
ing reported in the literature, ours is one of the
very few examples in which important savings
are noted. One reason for that seems to be the
extreme simplicity of the interface between the
guiding and the guided process: the guide only
performs a direct access into the guiding struc-
ture. Moreover, this guiding structure is (part
of) the usual parse forest output by the guiding
parser, without any transduction (see for example
in (Nederhof, 1998) how a FSA can guide a CF
parser).
As already noted by many authors (see for ex-
ample (Carroll, 1994)), the choice of a (parsing)
algorithm, as far as its throughput is concerned,
cannot rely only on its theoretical complexity
but must also take into account practical experi-
ments. Complexity analysis gives worst-case up-
per bounds which may well not be reached, and
which implies constants that may have a prepon-
derant effect on the typical size ranges of the ap-
plication.
We have also noted that guiding parsers can
be used in classical TAG parsers, as efficient and
(very) accurate tree selectors. More generally, we
are currently investigating the possibility to use
guiding parsers as shallow parsers.
The above results also show that (guided) RCL
parsing is a valuable alternative to classical (lex-
icalized) TAG parsers since we have exhibited
parse time savings of several orders of magnitude
over the most recent XTAG parser. These savings
even allow to consider the parsing of medium size
sentences with the English XTAG.
The global parse time for TAGs might also
be further improved using the transformation de-
scribed in (Boullier, 1999) which, starting from
any TAG, constructs an equivalent RCG that can
be parsed ina2a4a3a6a5a8a7a10a9. However, this improvement
is not definite, since, on typical input sentences,
the increase in size of the resulting grammar may
well ruin the expected practical benefits, as in
the case of thea5 a187 -guiding parser processing short
sentences.
We must also note that a (guided) parser may
also be used as a guide for a unification-based
parser in which feature terms are evaluated (see
the experiment related in (Barth´elemy et al.,
2000)).
Although the related practical experiments
have been conducted on a TAG, this guide tech-
nique is not dedicated to TAGs, and the speed of
all PRCL parsers may be thus increased. This per-
tains in particular to the parsing of all languages
whose grammars can be translated into equivalent
PRCGs — MC-TAGs, LCFRS, . . .
References
F. Barth´elemy, P. Boullier, Ph. Deschamp, and ´E. de la
Clergerie. 2000. Shared forests can guide parsing.
In Proceedings of the Second Workshop on Tabula-
tion in Parsing and Deduction (TAPD’2000), Uni-
versity of Vigo, Spain, September.
P. Boullier. 1998. A generalization of mildly context-
sensitive formalisms. In Proceedings of the Fourth
International Workshop on Tree Adjoining Gram-
mars and Related Frameworks (TAG+4), pages 17–
20, University of Pennsylvania, Philadelphia, PA,
August.
P. Boullier. 1999. On tag parsing. In a212 `eme
conf´erence annuelle sur le Traitement Au-
tomatique des Langues Naturelles (TALN’99),
pages 75–84, Carg`ese, Corse, France,
July. See also Research Report N ˚ 3668
at http://www.inria.fr/RRRT/RR-
3668.html, INRIA-Rocquencourt, France, Apr.
1999, 39 pages.
P. Boullier. 2000a. A cubic time extension of context-
free grammars. Grammars, 3(2/3):111–131.
P. Boullier. 2000b. Range concatenation grammars.
In Proceedings of the Sixth International Workshop
on Parsing Technologies (IWPT 2000), pages 53–
64, Trento, Italy, February.
John Carroll. 1994. Relating complexity to practical
performance in parsing with wide-coverage unifi-
cation grammars. In Proceedings of the 32th An-
nual Meeting of the Association for Computational
Linguistics (ACL’94), pages 287–294, New Mexico
State University at Las Cruces, New Mexico, June.
A. K. Joshi. 1987. An introduction to tree adjoining
grammars. In A. Manaster-Ramer, editor, Math-
ematics of Language, pages 87–114. John Ben-
jamins, Amsterdam.
M. Kay. 2000. Guides and oracles for linear-time
parsing. In Proceedings of the Sixth International
Workshop on Parsing Technologies (IWPT 2000),
pages 6–9, Trento, Italy, February.
V.S. Lakshmanan and C.H. Yim. 1991. Can filters
do magic for deductive databases? In 3rd UK
Annual Conference on Logic Programming, pages
174–189, Edinburgh, April. Springer Verlag.
M.-J. Nederhof. 1998. Context-free parsing through
regular approximation. In Proceedings of the Inter-
national Workshop on Finite State Methods in Nat-
ural Language Processing, Ankara, Turkey, June–
July.
M.-J. Nederhof. 2000. Practical experiments with
regular approximation of context-free languages.
Computational Linguistics, 26(1):17–44.
A. Sarkar. 2000. Practical experiments in parsing
using tree adjoining grammars. In Proceedings of
the Fifth International Workshop on Tree Adjoin-
ing Grammars and Related Formalisms (TAG+5),
pages 193–198, University of Paris 7, Jussieu, Paris,
France, May.
the research group XTAG. 1995. A lexicalized tree
adjoining grammar for English. Technical Report
IRCS 95-03, Institute for Research in Cognitive
Science, University of Pennsylvania, Philadelphia,
PA, USA, March.
