File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/86/c86-1125_metho.xml
Size: 14,306 bytes
Last Modified: 2025-10-06 14:11:56
<?xml version="1.0" standalone="yes"?> <Paper uid="C86-1125"> <Title>ON FORMALIZATIONS OF MARCUS' PARSER</Title> <Section position="3" start_page="0" end_page="534" type="metho"> <SectionTitle> 2. Operation of Marcus' parser </SectionTitle> <Paragraph position="0"> Let us first recall that Marcus' parser has two data structures: a pushdown stack which holds the constructs yet to be completed, and a finite size buffer which holds the lookahead symbols. The lookaheads can be completed constructs as well as bare terminals. I n addition, the parser has three basic operations: (1) Attach: attaches a constituent in the buffer to the current active node (stack top).</Paragraph> <Paragraph position="1"> (2) Create (push): creates a new active node, i.e., when the parser decides thai Ihe first constituent(s) in the buffer begin a new higher constituent, a new node of the specified type is created and pushed on the stack. However the create operation has a second mode in which the newly created node is first attached to the old active node, and then pushed on the stack. Marcus indicates this by use of &quot;attach a new node of 'type' to active node&quot; in the grammar rules. Following Ritchie \[6\], we use a shorter notation: 'cattach' for this second mode.</Paragraph> <Paragraph position="2"> (3) Drop (pop): pops the top node of the stack (CAN). However if this node is not attached to a higher level node, it will be dropped in the first position of the window defined on the buffer. Marcus uses different notations, namely &quot;drop&quot; and &quot;drop into buffer&quot;, in the grammar to indicate the etTect of drop operations. This suggests that a grammar writer must be aware of the attachment of the current active node. Here, we adhere to his provision about differentiating between these two modes of drop operations.</Paragraph> <Paragraph position="3"> However we feel that there is no need for such a provision since PARSIFAl_. (the grammar interpreter) can take care of that by inserting an unattached node into the buffer, and the grammar can test the contents of the buffer to see if such insertion has taken place.</Paragraph> <Paragraph position="4"> The three basic operations plus &quot;attention shift&quot; and &quot;restore buffer&quot; (forward and backward window movements on the buffer) are sufficient for parsing some context-free grammars which we informally denote by MP(k) (i.e., Marcus parsable with k lookabeads).</Paragraph> <Paragraph position="5"> Now let us consider the context-free grammar G,: (1) S'-~S (2) S --,.d (3) S --.-A S B (4) A-).a (5) A-).a S (6) l~-}b The following gives a Marcns-style parser for L(G,), i.e., a grammar G~ written in a PIDGIN-like language that can be interpreted by PAF.SI FAI.. &quot;\['he symbols inside square brackets refer to the contents of buffer positions, except \[CAN = \] which indicates the current active node. The grammar has no attention shift rules. In the Marcus parser active packets are associated with the active node. that is, when a new node is created, some packets will usually he activated as well. Unless a packet is deactivated explicitly this association remains with the node. So when a node on the stack becomes the active node again as a result of 'pop' operations, its associated packets will be reactivated.</Paragraph> <Paragraph position="6"> We do not attempt to show formally the equivalence of G~ and G~, since there is no formal characterization of Marcus-style parsers yet. However one may, by going through examples, convince oneself that the parser given in PIDGIN parses L(G~). Such an example is illustrated in detail next.</Paragraph> <Paragraph position="7"> Example : The following diagrams illustrate the parsing of the sentence addb, L(G~) by the parser described by G2. The symbols inside the boxes are on the stack, and those inside the circles are already attached to a higher level symbol on the stack. The numbers shown above each stack node are the packet nttmbers associated with that node. G2 uses a buffer of size 2 shown on the right. This example shows the power of the Marcus parser in employing completed subtrees as lookaheads. The grammar G, is not LR(k) for any fixed k. Any a can be reduced to an A via production A~a or can be considered as a first symbol in production A~aS (i.e., a reduce/shift conflict in LR parser). However, in the first case a is followed by an Sb. and in the latter by an Sd or an Sa. By postponing the parsing decisions about the completion of A's, Marcus' parser is able to produce the correct parse.</Paragraph> </Section> <Section position="4" start_page="534" end_page="534" type="metho"> <SectionTitle> 3. LRlk,t) Grammars </SectionTitle> <Paragraph position="0"> LR(k,t) grammars were originally proposed by Knuth in his landmark paper on parsing of LR(k) grammars \[3\], and later developed by Szymanski \[7,8\], Essentially the LR(k,t) parsing technique is a non-canonical extension of tile LR(k) technique, in which instead of the reduction of the handle (the leftmost phrase) of a right sentcntial form, we must be able to determine that in any sentential form at least one of the t (a fixed number) leftmost phrases is reducible to a specific non-terminal. In other words, a grammar G is not l.R(k,t) if in parsing of an input sentence the decision about reduction of t or more questionable phrases in a sentential form needs to be delayed. Tile reduction decision is reached by examining the whole left context and k symbols to the right of a phrase in a sentential form.</Paragraph> <Paragraph position="1"> Now, it is easy to see that G, is not LR(k,t) for any finite numbers k and t. For given k and t, L(Gt) includes sentences with prefix a a where n>k+t. In such sentences t initial a's have different interpretations depending on the other parts of the sentences. For example consider the two sentences:</Paragraph> <Paragraph position="3"> them is a phrase. Therefore an LR(k.t) parser will need to delay reduction of more than t possible phrases in parsing of&quot; a sentence with a prefix a '~, n>k+t, and thus G, is not LR(k,t) for any given k and t. In fact, LR(k,t) parsers put a limit t on the number of delayed decisions at any time during the parsing. In Marcus parsing, depending on characteristics of the grammar there may be no limit on this number.</Paragraph> <Paragraph position="4"> We have shown that MP(k)q'-LR(k',t) for any k' and t. An interesting question is whether LR(k,t)cMP(k') for some k'. The answer is negative. Consider the LR (0) = LR (0,1) grammar G ~:</Paragraph> </Section> <Section position="5" start_page="534" end_page="534" type="metho"> <SectionTitle> S-~,A A~cA A-~a S~B B-~cB B~b </SectionTitle> <Paragraph position="0"> With any finite buffer, Marcus' parser will be flooded with c's, before it can decide to put an A node or a B node on the stack. The weakness of Marcus' parser is in its insistence on being partially predictive or top-down. Purely bottom-up LRRL(k) parsers remedy this shortcoming.</Paragraph> <Paragraph position="1"> ,4. BCP(m,n) Grammars The bounded context parsable grammars were introduced by Williams \[9\]. In parsing these grammars we need to be able to reduce at least one phrase in every sentential form (in a bottom-up fashion) by looking at m symbols to the left and n symbols to the right of a phrase. BCP-parsers use two stacks to work in this fashion.</Paragraph> <Paragraph position="2"> It is trivial to show that BCP grammars are unsuitable for formalizing the Marcus parser. A BCP-parser ignores the information extractable from the left context (except the last m symbols). Whereas in the Marcus parser, the use of tlmt information is the compelling reason for deployment of the packeting mechanism. In fact there are numerous simple LP,(k) grammars that are not BCP, but are parsed by the Marcus parser. An example is the grammar G4 :</Paragraph> <Paragraph position="4"> A Marcus-style parser after attaching the first symbol in an input sentence will activate different packets to parse A or B depending on whetller the first symbol was a or b. However, a BCP-parser cannot reduce the only phrase, i.e., d in the sentences ac...cd and bc...cd.</Paragraph> <Paragraph position="5"> Because a number of c's more than m shields the necessary context for reduction of A.,d or 11~d.</Paragraph> </Section> <Section position="6" start_page="534" end_page="534" type="metho"> <SectionTitle> 5. LRRL(k) Grammars </SectionTitle> <Paragraph position="0"> LRRI,(k) parsing basically is a non-canonical bottom-up parsing technique which is influenced by the &quot;wait and see&quot; policy of Marcus' parser. By LRRL(k) grammars, we denote a family of grammar classes that are parsed left to right with k reduced /ookaheads in a deterministic maimer. The difference between these classes lies in the nature of lookaheads that they employ. Roughly, the class with more 'complex' lookaheads includes the class with 'simpler' lookaheads. Here, we discuss the basic I~RRL(k) gramnlars. Further details abmtt LRRL(k) grammars and the algorithm for generation of basic LRRI, parsers are given in \[5\].</Paragraph> <Paragraph position="1"> A basic LRRL(k) parser employs k-symbol fully reduced right contexts or lookaheads. The k fully reduced right context of a phrase in a parse tree consists of the k non-null deriving nodes that follow the phrase in the leftnlost derivation of the tree. Thus these nodes dominate any sequence of k subtrees to the immediate right of&quot; the phrase that have nondegnull frontiers. This generalized lookahead policy implies Ihat when a questionable handle in a right sentential form is reached, the decision to reduce it or not may be reached by parsing ahead a segment of the input that can be reduced to a relevant fully reduced right context of length k. For example, in parsing a sentence in L(G~), after seeing the initial a there is a shift oreduce conflict as to whether we should reduce according to rule (4) or continue with the nile (5). However the 2-symbol fully reduced context for reduction is SB, and for the shift operation is SS, which indicates a possible resolution of conflict if we can parse the lookaheads. Therefore we postpone the reduction of this questionable phrase and add two new auxiliary productions SUBGOAL-RED(4).-,,.SB and SUBGOAL-SHIFT~SS, and continue with the parsing of these new constructs. Upon completion of one of these productions we will be able to resolve the conflicting situation. Fnrthermore, we may apply the same policy to the parsing of lookahead contexts themselves. This feature of LRRL(k) parsing, i.e., the recursive application of the method to the lookahead information, is the one that differentiates this method from any other. The method is recursively applied whenever the need arises, i.e., at ambivalent points during parsing.</Paragraph> <Paragraph position="2"> Note that the lookahead scheme does not allow us to examine any remaining segment or the input that is not a part of the Iookahead context, The parsed context is put in a buffer of size k.</Paragraph> <Paragraph position="3"> and no reexamination of the segment of the input sentence that has been reduced to this right context is carried out. In addition, the right context which is k symbols or less does not contain a complete phrase, i.e., the symbols in the right context do not participate in any future reductions involving only these symbols.</Paragraph> <Paragraph position="4"> The parsing algorithm for an LRRL(k) grammar is based on construction of a Characteristic Finite State Machine. A CFSM is rather similar to the deterministic finite automaton that is used in l,R(k) parsing for recognition of viable prefixes. However there are three major differences: (1) The nature of lookaheads. The lookaheads are fully reduced symbols as opposed to bare terminals in LR(k) parsers.</Paragraph> <Paragraph position="5"> (2) Introduction of auxiliary productions.</Paragraph> <Paragraph position="6"> (3) Partitioning of states which coqceals conflicting items.</Paragraph> <Paragraph position="7"> The information extracted from this machine is in tabulated form that acts as the finite control for the parsing algorithm.</Paragraph> <Paragraph position="8"> The basic I~RRL grammars, when augmented with atlributes or features, generate a class of languages that includes the subsets of English which are parsable by a Marcus type parser. Thus introduction of LRRI, grammars provides us with a capability for automatic generation of Marcus style parsers from a conlext-free base grannnar plus the information about the feature set, their propagation and matching rules, and a limited number of' transfornmtional rules (e.g., auxiliary inversion and handling of traces). We believe tha |such a description of a language m a declarative grammar form is much more understandable than the procedurally defined form in Marcus' parser. Not only does the presence of parsing notions such as create, drop, etc. in a PI DGI N grammar make it difficult to determine exactly what language (i.e.</Paragraph> <Paragraph position="9"> what subset of English) is parsed by the grammar, but it is also very hard to determine whether a given arbitrary language can he parsed in this style arid if so, how to construct a parser. Furthermore, modification of existing parsers and verification of their correctness and completeness seems lo be unmanageable.</Paragraph> <Paragraph position="10"> We may pause here to observe that I,RP, I, parsiug is a bottom-up method, while Marcus' parser is not strictly a boltom-up one. In fact it proceeds in a rap-down maimer and when need arises it contiuues in a bottom-up fashion. However. as lleawick \[1\] notes, the use of top-down prediction iu such a parser does not affect its basic bottom-up completion of coustructs. In fact the inclusion of MP(k) granlmals in the more general class of LRRI,(k) grammars is analogous to the inclusion of l,L(k) grammars in lhe class of LR(k) graunnars. In the Marcus parser incomplete nodes are put on the stack, while in a bottom-up parser completed nodes lhat seek a parent reside on the stack.</Paragraph> </Section> class="xml-element"></Paper>