File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/89/p89-1018_metho.xml
Size: 15,101 bytes
Last Modified: 2025-10-06 14:12:25
<?xml version="1.0" standalone="yes"?> <Paper uid="P89-1018"> <Title>The Structure of Shared Forests in Ambiguous Parsing</Title> <Section position="4" start_page="144" end_page="146" type="metho"> <SectionTitle> 3 Implementation and Experimental </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="144" end_page="146" type="sub_section"> <SectionTitle> Results </SectionTitle> <Paragraph position="0"> The ideas presented above have been implemented in an experimental system called Tin (after the woodman of OZ).</Paragraph> <Paragraph position="1"> 10 This was noted by Shell \[26\] and is implicit in his use of &quot;2form ~ grammars.</Paragraph> <Paragraph position="2"> The intent is to provide a uniform f~amework for the construction and experimentation of chart parsers, somewhat as systems like MCHART \[29\], but with a more systematic theoretical foundation. The kernel of the system is a virtual parsing machine with a stack and a set of primitive commands corresponding essentially to the operation of a practical Push-Down Transducer. These commands include for example: push (resp. pop) to push a symbol on the stack (reap. pop one), check~indow to compare the look-ahead symbol(s) to some given symbol, chsckstack to branch depending on the top of the sta~k, scan to read an input word, outpu$ to output a rule number (or a terminal symbol), goto for unconditional jumps, and a few others. However theae commands are never used directly to program parsers. They are used as machine instructions for compilers that compile grammatical definitions into Tin code according to some parsing schema.</Paragraph> <Paragraph position="3"> A characteristic of these commands is that they may all be marked as non-determlnistic. The intuitive interpretation is that there is a non-deterministic choice between a command thus marked and another command whose address in the virtual machine code is then specified. However execution of the virtual machine code is done by an all-paths interpreter that follows the dynamic programming strategy described in section 2.1 and appendix A.</Paragraph> <Paragraph position="4"> The Tin interpreter is used in two different ways: 1. to study the effectiveness for chart parsing of known parsing schemata designed for deterministic parsing.</Paragraph> <Paragraph position="5"> We have only considered formally defined parsing schemata, corresponding to established PDA construction techniques that we use to mechanically translate CF grammars into Tin code. (e.g. LALR(1) and LALR(2) \[6\], weak precedence \[12\], LL(0) top-down (recursive descent), LR(0), LR(1) \[1\] ...).</Paragraph> <Paragraph position="6"> 2. to study the computational behavior of the generated code, and the optimization techniques that could be used on the Tin code -- and more generally chart parser code -- with respect to code size, execution speed and better sharing in the parse forest.</Paragraph> <Paragraph position="7"> Experimenting with several compilation schemata has shown that sophistication may have a negative effect on the ej~iciency of all-path parsin911 . Sophisticated PDT construction techniques tend to multiply the number of special cases, thereby increasing the code size of the chart parser. Sometimes it also prevents sharing of locally identical subcomputations because of differences in context analysis. This in turn may result in lesser sharing in the parse forest and sometimes longer computation, as in example $BBL in appendix C, but of course it does not change the set of parse-trees encoded in the forest 12. Experimentally, weak precedence gives slightly better sharing than LALR(1) parsing. The latter is often v/ewed as more efficient, whereas it only has a larger deterministic domain.</Paragraph> <Paragraph position="8"> One essential guideline to achieve better sharing (and often also reduced computation time) is to try to recognize every grammar rule in only one place of the generated chart parser code, even at the cost of increasing non-determinism.</Paragraph> <Paragraph position="9"> Thus simpler schemata such as precedence, LL(0) (and probably LR(0) I~) produce the best sharing. However, since they correspond to a smaller deterministic domain within the CF grammar realm, they may sometimes be computationally less efficient because they produce a larger number of useless items (Le. edges) that correspond to dead-end computational paths.</Paragraph> <Paragraph position="10"> Slight sophistication (e.g. LALR(1) used by Tomita in \[31\], or LR(1) ) may slightly improve computational performance by detecting earlier dead-end computations. This may however be at the expense of the forest sharing quality. More sophistication (say LR(2)) is usually losing on both accounts as explained earlier. The duplication of computational pgths due to distinct context analysis overweights the 11 We mean here the sophistication of the CF parser construction technique rather than the sophistication of the language features chopin to be used by this parser.</Paragraph> <Paragraph position="11"> l~ This negative behavior of some techniques originally intended to preserve determlni~n had beam remarked and analyzed in a special case by Bouckaert, Pirotte and Shelling \[3\]. However we believe their result to be weaker than ours, since it seems to rely on the fact that they directly interpret ~'anuuars rather than first compile them. Hence each interpretive step include in some sense compilation steps, which are more expensive when look-ahead is increased. Their paper presents several examples that run less efficiently when look-ahead is increased. For all these examples, this behavior disappears in our compiled setting. However the grammar SBBL in appendix C shows a loss of eltlciency with increased look-ahead that is due exclusively to loss of sharing caused by irrelevant contextual distinctions. This effect is particularly visible when parsing incomplete sentences \[16\].</Paragraph> <Paragraph position="12"> Eiticiency loss with increased look-ahead is mainly due to state splitting \[6\]. This should favor LALR techniques ova- LR ones. is Our resnlts do not take into account a newly found optimization of PDT interpretation that applies to all and only to bottom-up PDTs. This should make simple bottom-up schemes competitive for sharing quality, and even increase their computational ei~ciency. However it should not change qualitatively the relative performances of bottom-up parsers, and n~y emphasize even more the phenomenon that reduces efficiency when look-ahead in- null benefits of early elimination of dead-end paths. But there can be no absolute rule: ff a grammar is aclose&quot; to the LR(2) domain, an LR(2) schema is likely to give the best result for most parsed sentences.</Paragraph> <Paragraph position="13"> Sophisticated schemata correspond also to larger parsers, which may be critical in some natural language applications with very large grammars.</Paragraph> <Paragraph position="14"> The choice of a parsing schema depends in fine on the grammar used, on the corpus (or kind) of sentences to be analyzed, and on a balance between computational and sharing efficiency. It is best decided on an experimental basis with a system such as ours. Furthermore, we do not believe that any firm conclusion limited to CF grammars would be of real practical usefulness. The real purpose of the work presented is to get a qualitative insight in phenomena which are best exhibited in the simpler framework of CF parsing.</Paragraph> <Paragraph position="15"> This insight should help us with more complex formalisms (cf. section 5) for which the phenomena might be less easily evidenced.</Paragraph> <Paragraph position="16"> Note that the evidence gained contradicts the common belid that parsing schemata with a large deterministic domain (see for example the remarks on LR parsing in \[31\]) are more effective than simpler ones. Most experiments in this area were based on incomparable implementations, while our uniform framework gives us a common theoretical yardstick.</Paragraph> </Section> </Section> <Section position="5" start_page="146" end_page="147" type="metho"> <SectionTitle> 4 A Simple Bottom-Up Example </SectionTitle> <Paragraph position="0"> The following is a simple example based on a bottom-up PDT generated by our LALR(1) compiler from the following grammar taken from \[31\]: I (0) '$ax ::= $ 's $ (1) 's ::= 'up 'vp (2) 'e ::- 's 'pp (3) 'up ::= n (4) 'up ::- det n (5) 'up ::- 'up 'pp (6) 'pp ::- prep 'up (7) 'vp ::= v 'up Nonterminals are prefixed with a quote symbol The first rule is used for initialization and handlhg of the delimiter symbol 8. The $ delimiters are implicit in the actual input sentence.</Paragraph> <Paragraph position="1"> The sample input is a(n v det n prep n) ~. It figures (for example) the sentence: aT see a man at home ~.</Paragraph> <Section position="1" start_page="146" end_page="146" type="sub_section"> <SectionTitle> 4.1 Output grammar produced by the parser </SectionTitle> <Paragraph position="0"> The grammar of parses of the input sentence is given in figure 3.</Paragraph> <Paragraph position="1"> The initial nonterminal is the left-hand side of the first rule. For readability, the nonterminals have been given computer generated names of the form at2, where z is an integer. All other symbols are terminal. Integer terminals correspond to rule numbers of the input language grammar given above, and the other terminals are symbols of the parsed language, except fo r the special terminal %i1&quot; which indicates the end of the list of subconstituents of a sentence constituent, and may also be read as the empty string ~. Note the ambiguity for nontermlnal at4.</Paragraph> <Paragraph position="2"> It is possible to simplify this grammar to 7 rules without losing the sharing of common subparses. However it would no longer exhibit the structure that makes it readable as a shared-forest (though this structure could be retrieved).</Paragraph> <Paragraph position="4"> The two parses of the input sentence defined by this grammar are: $ n 3 v det n 4 7 1 prep n 3 6 2 $ $ n 3 vdet n 4 prepn 3 6 5 7 1 $ Here again the two $ symbols must be read as delimiters. The ~1&quot; symbols, no longer useful, have been omitted in these two parses.</Paragraph> </Section> <Section position="2" start_page="146" end_page="147" type="sub_section"> <SectionTitle> 4.2 Parse shared-forest constructed fi'om that grnnalxlar </SectionTitle> <Paragraph position="0"> To explain the structure of the shared forest, we first build a graph from the grammar, as shown in figure 4. Each node corresponds to one terminal or nonterminal of the grammar in figure 3, and is labelled by it. The labels at the right of small dashes are rule numbers from the parsed language grammar (see beginning of section 4). The basic structure is that of figure 1.</Paragraph> <Paragraph position="1"> From this first graph, we can trivially derive the more traditional shared forest given in figure 5. Note that this simplified representation is not always adequate since it does not allow partial sharing of their sons between two nodes. Each node includes a label which is a non-terminal of the parsed language grammar, and for each possible derivation (several in case of ambiguity) there is the number of the grammar rule used for that derivation. Though this simplified version is more readable, the representation of figure 5 is not adequate to represent partial sharing of the subconstituents of a constituent.</Paragraph> <Paragraph position="2"> Of course, the ~constructions ~ given in this section are purely virtual. In an implementation, the data-structure representing the grammar of figure 3 may be directly interpreted and used as a shared-forest.</Paragraph> <Paragraph position="3"> A similar construction for top-down parsing is sketched in appendix B.</Paragraph> </Section> </Section> <Section position="6" start_page="147" end_page="147" type="metho"> <SectionTitle> 5 Extensions </SectionTitle> <Paragraph position="0"> As indicated earlier, our intent is mostly to understand phenomena that would be harder to evidence in more complex grammatical formalisms.</Paragraph> <Paragraph position="1"> This statement implies that our approach can be extended. This is indeed the case. It is known that many simple parsing schemata can be expressed with stack based machines \[32\]* This is certainly the case for M! left-to-right CF chart parsing schemata.</Paragraph> <Paragraph position="2"> We have formally extended the concept of PDA into that of Logical PDA which is an operational push-down stack device for parsing unification based grammars \[17,18\] or other non-CF grammars such as Tree Adjoining Grammars \[19\].</Paragraph> <Paragraph position="3"> Hence we axe reusing and developing our theoretical \[18\] and experimental \[38\] approach in this much more general setting which is more likely to be effectively usable for natural language parsing.</Paragraph> <Paragraph position="4"> Furthermore, these extensions can also express, within the PDA model, non-left-to-fight behavior such as is used in island parsing \[38\] or in Shei\]'s approach \[26\]* More generally they allow the formal analysis of agenda strategies, which we have not considered here. In these extensions, the counterpart of parse forests are proof forests of definite clause programs.</Paragraph> </Section> <Section position="7" start_page="147" end_page="147" type="metho"> <SectionTitle> 6 Conclusion </SectionTitle> <Paragraph position="0"> AnMysis of Ml-path parsing schemata within a common framework exhibits in comparable terms the properties of these schemata, and gives objective criteria for chosing a given schema when implementing a language analyzer. The approach taken here supports both theoreticM analysis and actuM experimentation, both for the computational behavior of pLmers and for the structure of the resulting shared forest. Many experiments and extensions still remain t 9 be made: improved dynamic programming interpretation of bottom-up parsers, more extensive experimental measurements with a variety of languages and parsing schemata, or generalization of this approach to more complex situations, such as word lattice parsing \[21,30\], or even handling of &quot;secondary&quot; language features. Early research in that latter direction is promising: our framework and the corresponding paradigm for parser construction have been extended to full first-order Horn clauses \[17,18\], and are hence applicable to unification based grammatical formalisms \[27\]. Shared forest construction and analysis can be generalized in the same way to these more advanced formalisms.</Paragraph> <Paragraph position="1"> Acknowledgements: We are grateful to V~ronique Donzeau-Gouge for many fruitful discussions.</Paragraph> <Paragraph position="2"> This work has been partially supported by the Eureka Software Factory (ESF) project.</Paragraph> </Section> class="xml-element"></Paper>