File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/01/p01-1007_metho.xml

Size: 22,820 bytes

Last Modified: 2025-10-06 14:07:38

<?xml version="1.0" standalone="yes"?>
<Paper uid="P01-1007">
  <Title>Guided Parsing of Range Concatenation Languages</Title>
  <Section position="3" start_page="0" end_page="0" type="metho">
    <SectionTitle>
2 Positive Range Concatenation
Grammars
</SectionTitle>
    <Paragraph position="0"> This section only presents the basics of RCGs, more details can be found in (Boullier, 2000b).</Paragraph>
    <Paragraph position="1"> A positive range concatenation grammar [PRCG]a30a32a31 a3a34a33a36a35a38a37a39a35a41a40a42a35a12 a35a17 a9 is a 5-tuple where a33 is a finite set of nonterminal symbols (also called predicate names),a37 anda40 are finite, disjoint sets of terminal symbols and variable symbols respectively, a17a44a43 a33 is the start predicate name, anda12 is a finite set of clauses</Paragraph>
    <Paragraph position="3"> where a53 a54a49a55 and each of a45a16a46a35a45 a13a35a50a51a50a51a50a35a45a20a52 is a predicate of the form</Paragraph>
    <Paragraph position="5"> Each occurrence of a predicate in the LHS (resp. RHS) of a clause is a predicate definition (resp. call). Clauses which define predicate name a56 are called a56 -clauses. Each predicate name a56 a43 a33 has a fixed arity whose value is aritya3a56 a9. By definition aritya3a17 a9 a31a79a67 . The arity of an a56 -clause is aritya3a56 a9, and the arity a80 of a grammar (we have a a80 -PRCG) is the maximum arity of its clauses. The size of a clause</Paragraph>
    <Paragraph position="7"> For a given stringa95a96a31a66a97a83a13a8a50a51a50a51a50a98a97a100a99a101a43 a37 a73 , a pair of integers a3a76a35a58a102a103a9 s.t.a55a101a74a104a76a105a74 a102 a74 a5 is called a range, and is denoteda106a69a76a107a50a108a50a102a103a109a111a110 :a76 is its lower bound, a102 is its upper bound anda102a71a112 a76 is its size. For a given a95 , the set of all ranges is noted a113 a110 . In fact, a106a69a76a59a50a108a50a102a114a109a111a110 denotes the occurrence of the string</Paragraph>
    <Paragraph position="9"> can be concatenated iff the two boundsa102 anda80 are equal, the result is the range a106a69a76a107a50a108a50a119a118a109a110 . Variable occurrences or more generally strings in a3a69a37a120a70a78a40a72a9a38a73 can be instantiated to ranges. However, an occurrence of the terminal a121 can be instantiated to the range a106</Paragraph>
    <Paragraph position="11"> clause, several occurrences of the same terminal may well be instantiated to different ranges while several occurrences of the same variable can only be instantiated to the same range. Of course, the concatenation on strings matches the concatenation on ranges.</Paragraph>
    <Paragraph position="12"> We say thata56 a3a34a125a13a35a50a51a50a51a50a35a107a125a124a62a126a9 is an instantiation of the predicatea56 a3a58a57 a13a35a50a51a50a51a50a35a59a57a63a62a64a9 iffa125a126a60 a43a127a113  An input string a95a136a43 a37a137a73 , a82a95a105a82a138a31 a5 is a sentence iff the empty string (of instantiated predicates) can be derived froma17 a3a106a34a55a114a50a108a50a5a18a109a111a110a24a9, the instantiation of the start predicate on the whole source text. Such a sequence of instantiated predicates is called a complete derivation.a139 a3a30 a9, the PRCL defined by a PRCGa30 , is the set of all its sentences.</Paragraph>
    <Paragraph position="13"> For a given sentencea95 , as in the context-free [CF] case, a single complete derivation can be represented by a parse tree and the (unbounded) set of complete derivations by a finite structure, the parse forest. All possible derivation strategies (i.e., top-down, bottom-up, . . . ) are encompassed within both parse trees and parse forests.</Paragraph>
    <Paragraph position="14"> A clause is: a140 combinatorial if at least one argument of its RHS predicates does not consist of a single variable; a140 bottom-up erasing (resp. top-down erasing) if there is at least one variable occurring in its RHS (resp. LHS) which does not appear in its LHS (resp. RHS); a140 erasing if there exists a variable appearing only in its LHS or only in its RHS; a140 linear if none of its variables occurs twice in its LHS or twice in its RHS; a140 simple if it is non-combinatorial, nonerasing and linear.</Paragraph>
    <Paragraph position="15"> These definitions extend naturally from clause to set of clauses (i.e., grammar).</Paragraph>
    <Paragraph position="16"> In this paper we will not consider negative RCGs, since the guide construction algorithm presented is Section 3 is not valid for this class.</Paragraph>
    <Paragraph position="17"> Thus, in the sequel, we shall assume that RCGs are PRCGs.</Paragraph>
    <Paragraph position="18"> In (Boullier, 2000b) is presented a parsing algorithm which, for any RCG a30 and any input string of length a5 , produces a parse forest in</Paragraph>
    <Paragraph position="20"> a5a8a141a142a9 time. The exponent a143 , called degree of a30 , is the maximum number of free (independent) bounds in a clause. For a non-bottom-up-erasing RCG,a143 is less than or equal to the maximum value, for all clauses, of the suma65a91a145a144a120a146a90a91 where, for a clausea81 ,a65 a91 is its arity anda146a147a91 is the number of (different) variables in its LHS predicate. null</Paragraph>
  </Section>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
3 PRCG to 1-PRCG Transformation
Algorithm
</SectionTitle>
    <Paragraph position="0"> The purpose of this section is to present a transformation algorithm which takes as input any PRCG a30 and generates as output a 1-PRCG a30a72a13 , such thata11a101a31a120a139 a3a30 a9 a26a28a11a25a13a148a31a120a139 a3a30a72a13a9. Let a30a149a31 a3a34a33a36a35a38a37a39a35a41a40a42a35a12 a35a17 a9 be the initial PRCG and let a30a72a13a150a31 a3a34a33 a13a35a38a37 a13a35a41a40a13a35a12a14a13a35a17a18a13a9 be the generated 1-PRCG. Informally, to eacha65 -ary predicate namea56 we shall associatea65 unary predicate</Paragraph>
    <Paragraph position="2"> clausesa12a14a13 is generated in the way described below. null We say that two strings a57 and a162 , on some alphabet, share a common substring, and we write  As an example, the following set of clauses, in whicha178 , a179 and a180 are variables and a97 and a181 are terminal symbols, defines the 3-copy language</Paragraph>
    <Paragraph position="4"> It is not difficult to show thata11a27a26a28a11a29a13 .</Paragraph>
    <Paragraph position="5"> This transformation algorithm works for any PRCG. Moreover, if we restrict ourselves to the class of PRCGs that are non-combinatorial and non-bottom-up-erasing, it is easy to check that the constructed 1-PRCG is also non-combinatorial and non-bottom-up-erasing. It has been shown in (Boullier, 2000a) that non-combinatorial and non-bottom-up-erasing 1-RCLs can be parsed in cubic time after a simple grammatical transformation.</Paragraph>
    <Paragraph position="6"> In order to reach this cubic parse time, we assume in the sequel that any RCG at hand is a non-combinatorial and non-bottom-up-erasing PRCG.</Paragraph>
    <Paragraph position="7"> However, even if this cubic time transformation is not performed, we can show that the (theoretical) throughput of the parser fora11a25a13 cannot be less than the throughput of the parser fora11 . In other words, if we consider the parsers fora11 anda11a25a13 and if we recall the end of Section 2, it is easy to show that the degrees, saya143 anda143a103a13 , of their polynomial parse times are such thata143a13 a74a188a143 . The equality is reached iff the maximum valuea143 ina30 is produced by a unary clause which is kept unchanged by our transformation algorithm.</Paragraph>
    <Paragraph position="8"> The starting RCGa30 is called the initial grammar and it defines the initial languagea11 . The corresponding 1-PRCGa30a184a13 constructed by our transformation algorithm is called the guiding grammar and its languagea11a25a13 is the guiding language. If the algorithm to reach a cubic parse time is applied to the guiding grammara30a184a13, we get an equivalent a5a8a187 -guiding grammar (it also defines a11a29a13 ). The various RCL parsers associated with these grammars are respectively called initial parser, guiding parser anda5a8a187 -guiding parser. The output of a (a5 a187 -) guiding parser is called a (a5 a187 -) guiding structure. The term guide is used for the process which, with the help of a guiding structure, answers 'yes' or 'no' to any question asked by the guided process. In our case, the guided processes are the RCL parsers for a11 called guided parser anda5a8a187 -guided parser.</Paragraph>
    <Paragraph position="9"> 4 Parsing with a Guide Parsing with a guide proceeds as follows. The guided process is split in two phases. First, the source text is parsed by the guiding parser which builds the guiding structure. Of course, if the source text is parsed by thea5 a187 -guiding parser, the a5 a187 -guiding structure is then translated into a guiding structure, as if the source text had been parsed by the guiding parser. Second, the guided parser proper is launched, asking the guide to help (some of) its nondeterministic choices.</Paragraph>
    <Paragraph position="10"> Our current implementation of RCL parsers is like a (cached) recursive descent parser in which the nonterminal calls are replaced by instantiated predicate calls. Assume that, at some place in an RCL parser,a56 a3a34a125a13a35a107a125a15a9 is an instantiated predicate call. In a corresponding guided parser, this call can be guarded by a call to a guide, with a56 , a125a13 and a125a15 as parameters, that will check that both</Paragraph>
    <Paragraph position="12"> the guiding structure. Of course, various actions in a guided parser can be guarded by guide calls, but the guide can only answer questions that, in some sense, have been registered into the guiding structure. The guiding structure may thus contain more or less complete information, leading to several guide levels.</Paragraph>
    <Paragraph position="13"> For example, one of the simplest levels one may think of, is to only register in the guiding structure the (numbers of the) clauses of the guiding grammar for which at least one instantiation occurs in their parse forest. In such a case, during the second phase, when the guided parser tries to instantiate some clause a81 of a30 , it can call the guide to know whether or nota81 can be valid. The guide will answer 'yes' iff the guiding structure contains the set a171 a91 of clauses in a30a72a13 generated froma81 by the transformation algorithm.</Paragraph>
    <Paragraph position="14"> At the opposite, we can register in the guiding structure the full parse forest output by the guiding parser. This parse forest is, for a given sentence, the set of all instantiated clauses of the guiding grammar that are used in all complete derivations. During the second phase, when the guided parser has instantiated some clause a81 of the initial grammar, it builds the set of the corresponding instantiations of all clauses ina171 a91 and asks the guide to check that this set is a subset of the guiding structure.</Paragraph>
    <Paragraph position="15"> During our experiment, several guide levels have been considered, however, the results in Section 5 are reported with a restricted guiding structure which only contains the set of all (valid) clause numbers and for each clause the set of its LHS instantiated predicates.</Paragraph>
    <Paragraph position="16"> The goal of a guided parser is to speed up a parsing process. However, it is clear that the theoretical parse time complexity is not improved by this technique and even that some practical parse time will get worse. For example, this is the case for the above 3-copy language. In that case, it is not difficult to check that the guiding language a11a25a13 isa37 a73 , and that the guide will always answer 'yes' to any question asked by the guided parser.</Paragraph>
    <Paragraph position="17"> Thus the time taken by the guiding parser and by the guide itself is simply wasted. Of course, a guide that always answer 'yes' is not a good one and we should note that this case may happen, even when the guiding language is nota37 a73 . Thus, from a practical point of view the question is simply &amp;quot;will the time spent in the guiding parser and in the guide be at least recouped by the guided parser?&amp;quot; Clearly, in the general case, no definite answer can be brought to such a question, since the total parse time may depend not only on the input grammar, the (quality of) the guiding grammar (e.g., isa11a29a13 not a too &amp;quot;large&amp;quot; superset ofa11 ), the guide level, but also it may depend on the parsed sentence itself. Thus, in our opinion, only the results of practical experiments may globally decide if using a guided parser is worthwhile .</Paragraph>
    <Paragraph position="18"> Another potential problem may come from the size of the guiding grammar itself. In particular, experiments with regular approximation of CFLs related in (Nederhof, 2000) show that most reported methods are not practical for large CF grammars, because of the high costs of obtaining the minimal DFSA.</Paragraph>
    <Paragraph position="19"> In our case, it can easily be shown that the increase in size of the guiding grammars is bounded by a constant factor and thus seems a priori acceptable from a practical point of view.</Paragraph>
    <Paragraph position="20"> The next section depicts the practical experiments we have performed to validate our approach. null</Paragraph>
  </Section>
  <Section position="5" start_page="0" end_page="0" type="metho">
    <SectionTitle>
5 Experiments with an English
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
Grammar
</SectionTitle>
      <Paragraph position="0"> In order to compare a (normal) RCL parser and its guided versions, we looked for an existing wide-coverage grammar. We chose the grammar for English designed for the XTAG system (XTAG, 1995), because it both is freely available and seems rather mature. Of course, that grammar uses the TAG formalism.1 Thus, we first had to transform that English TAG into an equivalent RCG. To perform this task, we implemented the algorithm described in (Boullier, 1998) (see also (Boullier, 1999)), which allows to transform any TAG into an equivalent simple PRCG.2 However, Boullier's algorithm was designed for pure TAGs, while the structures used in the XTAG system are not trees, but rather tree schemata, grouped into linguistically pertinent tree families, which have to be instantiated by inflected forms for each given input sentence. That important difference stems from the radical difference in approaches between &amp;quot;classical&amp;quot; TAG parsing and &amp;quot;usual&amp;quot; RCL parsing. In the former, through lexicalization, the input sentence allows the selection of tree schemata which are then instantiated on the corresponding inflected forms, thus the TAG is not really part of the parser. While in the latter, the (non-lexicalized) grammar is pre-compiled into an optimized automaton.3 Since the instantiation of all tree schemata  the size of the automaton, but we shall see later on that it can be made to stay reasonable, at least in the case at hand. by the complete dictionary is impracticable, we designed a two-step process. For example, from the sentence &amp;quot;George loved himself .&amp;quot;, a lexer first produces the sequence &amp;quot;George a153n-n nxn-n nn-na159 loved a153tnx0vnx1-v tnx0vnx1s2v tnx0vs1-va159 himself a153tnx0n1-n nxn-na159 . a153spu-punct spus-puncta159 &amp;quot;, and, in a second phase, this sequence is used as actual input to our parsers. The names between braces are pre-terminals. We assume that each terminal leaf a118 of every elementary tree schema a189 has been labeled by a pre-terminal name of the form</Paragraph>
      <Paragraph position="2"> a76a157a193 where a190 is the family ofa189 , a81 is the category ofa118 (verb, noun, . . . ) anda76 is an optional occurrence index.4 Thus, the association George &amp;quot;a153n-n nxn-n nn-na159 &amp;quot; means that the inflected form &amp;quot;George&amp;quot; is a noun (suffix -n) that can occur in all trees of the &amp;quot;n&amp;quot;, &amp;quot;nxn&amp;quot; or &amp;quot;nn&amp;quot; families (everywhere a terminal leaf of category noun occurs).</Paragraph>
      <Paragraph position="3"> Since, in this two-step process, the inputs are not sequences of terminal symbols but instead simple DAG structures, as the one depicted in Figure 1, we have accordingly implemented in our RCG system the ability to handle inputs that are simple DAGs of tokens.5 In Section 3, we have seen that the language a11 a13 defined by a guiding grammar a30 a13 for some RCGa30 , is a superset ofa11 , the language defined by a30 . If a30 is a simple PRCG, a30a184a13 is a simple 1-PRCG, and thus a11a25a13 is a CFL (see (Boullier, 2000a)). In other words, in the case of TAGs, our transformation algorithm approximates the initial tree-adjoining language by a CFL, and the steps of CF parsing performed by the guiding parser can well be understood in terms of TAG parsing.</Paragraph>
      <Paragraph position="4"> The original algorithm in (Boullier, 1998) performs a one-to-one mapping between elementary trees and clauses, initial trees generate simple unary clauses while auxiliary trees generate simple binary clauses. Our transformation algorithm leaves unary clauses unchanged (simple unary clauses are in fact CF productions). For binary  ana56 a13 -clause which corresponds to the part of the auxiliary tree to the left of the spine and ana56 a15 -clause for the part to the right of the spine. Both are CF clauses that the guiding parser calls independently. Therefore, for a TAG, the associated guiding parser performs substitutions as would a TAG parser, while each adjunction is replaced by two independent substitutions, such that there is no guarantee that any couple ofa56 a13 -tree anda56 a15 -tree can glue together to form a valid (adjoinable) a56 -tree. In fact, guiding parsers perform some kind of (deep-grammar based) shallow parsing.</Paragraph>
      <Paragraph position="5"> For our experiments, we first transformed the English XTAG into an equivalent simple PRCG: the initial grammara30 . Then, using the algorithms of Section 3, we built, from a30 , the corresponding guiding grammar a30a184a13, and from a30a184a13 the a5 a187 -guiding grammar. Table 1 gives some information  For our experiments, we have used a test suite distributed with the XTAG system. It contains 31 sentences ranging from 4 to 17 words, with an average length of 8. All measures have been performed on a 800 MHz Pentium III with 640 MB of memory, running Linux. All parsers have been 6Note that the worst-case parse time for both the initial and the guiding parsers is a197a48a198a108a199a64a200a34a201a203a202. As explained in Section 3, this identical polynomial degreesa204a155a205a160a204a124a206a19a205a127a207a98a208 comes from an untransformed unary clause which itself is the result of the translation of an initial tree.</Paragraph>
      <Paragraph position="6"> compiled with gcc without any optimization flag.</Paragraph>
      <Paragraph position="7"> We have first compared the total time taken to produce the guiding structures, both by the a5 a187 -guiding parser and by the guiding parser (see Table 2). On this sample set, thea5 a15a203a209 -guiding parser is twice as fast as the a5 a187 -guiding parser. We guess that, on such short sentences, the benefit yielded by the lowest degree has not yet offset the time needed to handle a much greater number of clauses. To validate this guess, we have tried longer sentences. With a 35-word sentence we have noted that thea5 a187 -guiding parser is almost six times faster than the a5 a15a203a209 -guiding parser and besides we have verified that the even crossing point seems to occur for sentences of around 16- null The sizes of these RCL parsers (load modules) are in Table 3 while their parse times are in Table 4.7 We have also noted in the last line, for reference, the times of the latest XTAG parser (February 2001),8 on our sample set and on the 35-word sentence.9 6 Guiding Parser as Tree Filter In (Sarkar, 2000), there is some evidence to indicate that in LTAG parsing the number of trees selected by the words in a sentence (a measure of the syntactic lexical ambiguity of the sentence) is a better predictor of complexity than the number of words in the sentence. Thus, the accuracy of the tree selection process may be crucial for parsing speeds. In this section, we wish to briefly compare the tree selections performed, on the one hand by the words in a sentence and, on the other hand, by a guiding parser. Such filters can be used, for example, as pre-processors in classical [L]TAG parsing. With a guiding parser as tree filter, a tree (i.e., a clause) is kept, not because it has been selected by a word in the input sentence, but because an instantiation of that clause belongs to the guiding structure.</Paragraph>
      <Paragraph position="8"> The recall of both filters is 100%, since all pertinent trees are necessarily selected by the input words and present in the guiding structure. On the other hand, for the tree selection by the words in a sentence, the precision measured on our sam- null rithm for lexicalized TAGs, see (Sarkar, 2000). This parser can be run in two phases, the second one being devoted to the evaluation of the features structures on the parse forest built during the first phase. Of course, the times reported in that paper are only those of the first pass. Moreover, the various parameters have been set so that the resulting parse trees and ours are similar. Almost half the sample sentences give identical results in both that system and ours. For the other half, it seems that the differences come from the way the co-anchoring problem is handled in both systems. To be fair, it must be noted that the time taken to output a complete parse forest is not included in the parse times reported for our parsers. Outputing those parse forests, similar to Sarkar's ones, takes one second on the whole sample set and 80 seconds for the 35-word sentence (there are more than 3 600 000 instantiated clauses in the parse forest of that last sentence). 9Considering the last line of Table 2, one can notice that the times taken by the guided phases of the guided parser and thea199a126a211 -guided parser are noticeably different, when they should be the same. This anomaly, not present on the sample set, is currently under investigation.</Paragraph>
      <Paragraph position="9"> ple set is 15.6% on the average, while it reaches 100% for the guiding parser (i.e., each and every selected tree is in the final parse forest).</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML