File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/96/p96-1012_metho.xml
Size: 20,491 bytes
Last Modified: 2025-10-06 14:14:19
<?xml version="1.0" standalone="yes"?> <Paper uid="P96-1012"> <Title>Another Facet of LIG Parsing</Title> <Section position="3" start_page="87" end_page="88" type="metho"> <SectionTitle> 3 Linear Indexed Grammars </SectionTitle> <Paragraph position="0"> An indexed grammar is a CFG in which stack of symbols are associated with non-terminals. LIGs are a restricted form of indexed grammars in which the dependence between stacks is such that at most one stack in the RHS of a production is related with the stack in its LHS. Other non-terminals are associated with independant stacks of bounded size.</Paragraph> <Paragraph position="1"> Following (Vijay-Shanker and Weir, 1994) Definition 2 L = (VN,VT,VI,PL,S) denotes a LIG where VN, VT, VI and PL are respectively finite sets of non-terminals, terminals, stack symbols and productions, and S is the start symbol.</Paragraph> <Paragraph position="2"> In the sequel we will only consider a restricted 2if x = al... as, the states can be the integers 0... n, 0 is the initial state, n the unique final state, and the transition function 5 is s.t. i E 5(i-- 1, a~) and i E 5(i, ~). form of LIGs with productions of the form</Paragraph> <Paragraph position="4"> where A,B * VN, W * V~A0 < \[w\[ < 2, aa' * V;A 0 < \[aa'\[ < 1 and r,r2 * v u( }u(c01 c * An element like A(..a) is a primary constituent while C0 is a secondary constituent. The stack schema (..a) of a primary constituent matches all the stacks whose prefix (bottom) part is left unspecified and whose suffix (top) part is a; the stack of a secondary constituent is always empty.</Paragraph> <Paragraph position="5"> Such a form has been chosen both for complexity reasons and to decrease the number of cases we have to deal with. However, it is easy to see that this form of LIG constitutes a normal form.</Paragraph> <Paragraph position="6"> We use r 0 to denote a production in PL, where the parentheses remind us that we are in a LIG! The CF-backbone of a LIG is the underlying CFG in which each production is a LIG production where the stack part of each constituent has been deleted, leaving only the non-terminal part. We will only consider LIGs such there is a bijection between its production set and the production set of its CF-backbone 3.</Paragraph> <Paragraph position="7"> We call object the pair denoted A(a) where A is a non-terminal and (a) a stack of symbols. Let Vo = {A(a) \[ A * VN Aa * V;} be the set of objects. We define on (Vo LJ VT)* the binary relation derives denoted =~ (the relation symbol is sometimes</Paragraph> <Paragraph position="9"> In the first above element we say that the object B(a&quot;a ~) is the distinguished child of A(a&quot;a), and if F1F2 = C0, C0 is the secondary object. A derivation F~,..., Fi, Fi+x,..., Ft is a sequence of strings where the relation derives holds between any two consecutive strings The language defined by a LIG L is the set:</Paragraph> <Paragraph position="11"> As in the CF case we can talk of rightmost derivations when the rightmost object is derived at each step. Of course, many other derivation strategies may be thought of. For our parsing algorithm, we need such a particular derives relation. Assume that at one step an object derives both a distinguished 3rp and rp0 with the same index p designate associated productions.</Paragraph> <Paragraph position="12"> child and a secondary object. Our particular deriva- null tion strategy is such that this distinguished child will always be derived after the secondary object (and its descendants), whether this secondary object lays to its left or to its right. This derives relation is denoted =~ and is called linear 4.</Paragraph> <Paragraph position="13"> l,L A spine is the sequence of objects Al(al) * .. Ai(ai) Ai+l (~i+1)... Ap(ap) if, there is a derivation in which each object Ai+l (ai+l) is the distinguished child of Ai(ai) (and therefore the distinguished descendant of Aj(aj), 1 <_ j <_ i).</Paragraph> </Section> <Section position="4" start_page="88" end_page="89" type="metho"> <SectionTitle> 4 Linear Derivation Grammar </SectionTitle> <Paragraph position="0"> For a given LIG L, consider a linear SO~x-derivation so . . . . . . = t,L t,L l,L The sequence of productions rl0...riO...rnO (considered in reverse order) is a string in P~. The purpose of this section is to define the set of such strings as the language defined by some CFG.</Paragraph> <Paragraph position="1"> Associated with a LIG L = (VN, VT, VI, PL, S), we first define a bunch of binary relations which are borrowed from (Boullier , 1995)</Paragraph> <Paragraph position="3"> is a distinguished descendant of A1 O} The l-level relations simply indicate, for each production, which operation can be apply to the stack associated with the LHS non-terminal to get the stack associated with its distinguished child; ~ in- null dicates equality, -~ the pushing of 3&quot;, and ~- the pop1 1 ping of 3'-If we look at the evolution of a stack along a spine A1 (ax)... Ai (ai)Ai+x (ai+x)... Ap (ap), between any two objects one of the following holds:</Paragraph> <Paragraph position="5"> upon a linear (total) order over object occurrences in a derivation. See (Boullier, 1996) for a more formal definition.</Paragraph> <Paragraph position="7"> for which the relation p holds between A and B.</Paragraph> <Paragraph position="9"> The productions in pD define all the ways linear derivations can be composed from linear subderivations. This compositions rely on one side upon property 1 (recall that the productions in PL, must be produced in reverse order) and, on the other side, upon the order in which secondary spines (the rlF2spines) are processed to get the linear derivation order. null In (Boullier, 1996), we prove that LDGs are not ambiguous (in fact they are SLR(1)) and define</Paragraph> <Paragraph position="11"> If, by some classical algorithm, we remove from D all its useless symbols, we get a reduced CFG say D' = (VN D' , VT D' , pD', SO' ). In this grammar, all its terminal symbols, which are productions in L, are useful. By the way, the construction of D' solve the emptiness problem for LIGs: L specify the empty set iff the set VT D' is empty 7.</Paragraph> </Section> <Section position="5" start_page="89" end_page="91" type="metho"> <SectionTitle> 5 LIG parsing </SectionTitle> <Paragraph position="0"> Given a LIG L : (VN, VT, Vz, PL, S) we want to find all the syntactic structures associated with an input string x 6 V~. In section 2 we used a CFG (the shared parse forest) for representing all parses in a CFG. In this section we will see how to build a CFG which represents all parses in a LIG.</Paragraph> <Paragraph position="1"> In (Boullier, 1995) we give a recognizer for LIGs with the following scheme: in a first phase a general CF parsing algorithm, working on the CF-backbone builds a shared parse forest for a given input string x. In a second phase, the LIG conditions are checked on this forest. This checking can result in some subtree (production) deletions, namely the ones for which there is no valid symbol stack evaluation. If the resulting grammar is not empty, then x is a sentence.</Paragraph> <Paragraph position="2"> However, in the general case, this resulting grammar is not a shared parse forest for the initial LIG in the sense that the computation of stack of symbols along spines are not guaranteed to be consistent. Such invalid spines are not deleted during the check of the LIG conditions because they could be composed of sub-spines which are themselves parts of other valid spines. One way to solve this problem is to unfold the shared parse forest and to extract individual parse trees. A parse tree is then kept iff the LIG conditions are valid on that tree. But such a method is not practical since the number of parse trees can be unbounded when the CF-backbone is cyclic. Even for non cyclic grammars, the number of parse trees can be exponential in the size of the input. Moreover, it is problematic that a worst case polynomial size structure could be reached by some sharing compatible both with the syntactic and the %emantic&quot; features.</Paragraph> <Paragraph position="3"> However, we know that derivations in TAGs are context-free (see (Vijay-Shanker, 1987)) and (Vijay-Shanker and Weir, 1993) exhibits a CFG which represents all possible derivation sequences in a TAG. We will show that the analogous holds for LIGs and leads to an O(n 6) time parsing algorithm.</Paragraph> <Paragraph position="5"> in E(G), and G ~ = (V~,V~,P~,S ~) its shared parse \]orest for x. We define the LIGed forest for x as being the LIG L ~ = (V~r, V~, VI, P~, S ~) s.t. G z is its CF-backbone and its productions are the productions o\] P~ in which the corresponding stack-schemas o\] L have been added. For exam-</Paragraph> <Paragraph position="7"> If we follow(Lang, 1994), the previous definition which produces a LIGed forest from any L and x is a (LIG) parserS: given a LIG L and a string x, we have constructed a new LIG L ~ for the intersection Z;(L) C) {x}, which is the shared forest for all parses of the sentences in the intersection. However, we wish to go one step further since the parsing (or even recognition) problem for LIGs cannot be trivially extracted from the LIGed forests.</Paragraph> <Paragraph position="8"> Our vision for the parsing of a string x with a LIG L can be summarized in few lines. Let G be the CF-backbone of L, we first build G ~ the CFG shared parse forest by any classical general CF parsing algorithm and then L x its LIGed forest. Afterwards, we build the reduced LDG DL~ associated with L ~ as shown in section 4.</Paragraph> <Paragraph position="9"> Sof course, instead of x, we can consider any FSA.</Paragraph> <Paragraph position="10"> The recognition problem for (L, x) (i.e. is x an element of PS(L)) is equivalent to the non-emptiness of the production set of OLd.</Paragraph> <Paragraph position="11"> Moreover, each linear SO~x-derivation in L is (the reverse of) a string in ff.(DL*)9. So the extraction of individual parses in a LIG is merely reduced to the derivation of strings in a CFG.</Paragraph> <Paragraph position="12"> An important issue is about the complexity, in time and space, of DL~. Let n be the length of the input string x. Since G is in binary form we know that the shared parse forest G x can be build in O(n 3) time and the number of its productions is also in O(n3). Moreover, the cardinality of V~ is O(n 2) and, for any given non-terminal, say \[A\] q, there are at most O(n) \[A\]g-productions. Of course, these complexities extend to the LIGed forest L z.</Paragraph> <Paragraph position="13"> We now look at the LDG complexity when the input LIG is a LIGed forest. In fact, we mainly have to check two forms of productions (see definition 3). The first form is production (6) (\[A +-~ C\] -+ \[B + C\]\[A ~-0 B\]), where three different non-terminals in VN are implied (i.e. A, B and C), so the number of productions of that form is cubic in the number of non-terminals and therefore is O(n6).</Paragraph> <Paragraph position="14"> In the second form (productions (5), (7) and (9)), exemplified by \[A ~ C\] -4 \[B ~ c\]\[rlr2\]r(), there / are four non-terminals in VN (i.e. A, B, C, and X if FIF2 = X0) and a production r 0 (the number of relation symbols ~ is a constant), therefore, the / number of such productions seems to be of fourth degree in the number of non-terminals and linear in the number of productions. However, these variables are not independant. For a given A, the number of triples (B,X, r0) is the number of A-productions hence O(n). So, at the end, the number of productions of that form is O(nh).</Paragraph> <Paragraph position="15"> We can easily check that the other form of productions have a lesser degree.</Paragraph> <Paragraph position="16"> Therefore, the number of productions is dominated by the first form and the size (and in fact the construction time) of this grammar is 59(n6).</Paragraph> <Paragraph position="17"> This (once again) shows that the recognition and parsing problem for a LIG can be solved in 59(n 6) time.</Paragraph> <Paragraph position="18"> For a LDG D = (V D, V D, pD SD), we note that for any given non-terminal A E VN D and string a E PS:(A) with \[a\[ >_ 2, a single production A -4 X1X2 or A -4 X1X2X3 in pD is needed to &quot;cut&quot; a into two or three non-empty pieces al, 0&quot;2, and 0-3, such that degIn fact, the terminal symbols in DL~ axe productions in L ~ (say Rq()), which trivially can be mapped to productions in L (here rp()).</Paragraph> <Paragraph position="20"> two productions (namely (4) and (7)). This shows that the cutting out of any string of length l, into elementary pieces of length 1, is performed in using O(l) productions. Therefore, the extraction of a linear so~x-derivation in L is performed in time linear with the length of that derivation. If we assume that the CF-backbone G is non cyclic, the extraction of a parse is linear in n. Moreover, during an extraction, since DL= is not ambiguous, at some place, the choice of another A-production will result in a different linear derivation.</Paragraph> <Paragraph position="21"> Of course, practical generations of LDGs must improve over a blind application of definition 3. One way is to consider a top-down strategy: the Xproductions in a LDG are generated iff X is the start symbol or occurs in the RHS of an already generated production. The examples in section 6 are produced this way.</Paragraph> <Paragraph position="22"> If the number of ambiguities in the initial LIG is bounded, the size of DL=, for a given input string x of length n, is linear in n.</Paragraph> <Paragraph position="23"> The size and the time needed to compute DL. are closely related to the actual sizes of the -<~-, >- and + + relations. As pointed out in (Boullier, 1995), their O(n 4) maximum sizes seem to be seldom reached in practice. This means that the average parsing time is much better than this (..9(n 6) worst case.</Paragraph> <Paragraph position="24"> Moreover, our parsing schema allow to avoid some useless computations. Assume that the symbol \[A ~ B\] is useless in the LDG DL associated with the initial LIG L, we know that any non-terminal s.t. \[\[A\]{ +-~ \[B\]~\] is also useless in DL=. Therefore, the static computation of a reduced LDG for the initial LIG L (and the corresponding -C/-, >- and .~ + + relations) can be used to direct the parsing process and decrease the parsing time (see section 6).</Paragraph> </Section> <Section position="6" start_page="91" end_page="93" type="metho"> <SectionTitle> 6 Two Examples </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="91" end_page="93" type="sub_section"> <SectionTitle> 6.1 First Example </SectionTitle> <Paragraph position="0"> In this section, we illustrate our algorithm with a LIG L -- ({S, T\], {a, b, c}, {7~, 75, O'c}, PL, S) where PL contains the following productions:</Paragraph> <Paragraph position="2"> It is easy to see that its CF-backbone G, whose production set Pc is: S-+ Sa S-~ Sb S-+ S c S-~ T T-}aT T -+ bT T -~ cT T -+ c defines the language PS(G) = {wcw' I w,w' 6 {a, b, c\]*}. We remark that the stacks of symbols in L constrain the string w' to be equal to w and therefore the language PS(L) is {wcw I w 6 {a, b, c\]*}. We note that in L the key part is played by the middle c, introduced by production rs0, and that this grammar is non ambiguous, while in G the symbol c, introduced by the last production T ~ c, is only a separator between w and w' and that this grammar is ambiguous (any occurrence of c may be this separator).</Paragraph> <Paragraph position="3"> The computation of the relations gives:</Paragraph> <Paragraph position="5"> The production set pD of the LDG D associated with L is:</Paragraph> <Paragraph position="7"> The numbers (i) refer to definition 3. We can easily checked that this grammar is reduced.</Paragraph> <Paragraph position="8"> Let x = ccc be an input string. Since x is an element of PS(G), its shared parse forest G x is not empty. Its production set P~ is:</Paragraph> <Paragraph position="10"> We can observe that this shared parse forest denotes in fact three different parse trees. Each one corresponding to a different cutting out of x = wcw' (i.e.</Paragraph> <Paragraph position="11"> w = ~ and w' = ce, or w : c and w' = c, or w = ec and w' = g).</Paragraph> <Paragraph position="12"> The corresponding LIGed forest whose start symbol is S * = \[S\]~ and production set P~ is:</Paragraph> <Paragraph position="14"> For this LIGed forest the relations are: The start symbol of the LDG associated with the LIGed forest L * is \[\[S\]o3\]. If we assume that an A-production is generated iff it is an \[\[S\]o3\]-production or A occurs in an already generated production, we get:</Paragraph> <Paragraph position="16"> This CFG is reduced. Since its production set is non empty, we have ccc E ~(L). Its language is {r~ deg 0 r9 0 r4 ()r~ 0 } which shows that the only linear derivation in L is S() ~) S(%)c r~) T(Tc)C r=~)</Paragraph> <Paragraph position="18"> In computing the relations for the initial LIG L, we remark that though T ~2 T, T ~ T, and T ~ T, + + + the non-terminals IT ~ T\], \[T ~ T\], and IT ~: T\] are + + not used in pp. This means that for any LIGed forest L ~, the elements of the form (\[Tip q, \[T\]~:) do not &quot;)'a need to be computed in the ~+, ~+ , and ~:+ relations since they will never produce a useful non-terminal.</Paragraph> <Paragraph position="19"> In this example, the subset ~: of ~: is useless.</Paragraph> </Section> </Section> <Section position="7" start_page="93" end_page="93" type="metho"> <SectionTitle> 1 -b </SectionTitle> <Paragraph position="0"> The next example shows the handling of a cyclic grammar.</Paragraph> <Section position="1" start_page="93" end_page="93" type="sub_section"> <SectionTitle> 6.2 Second Example </SectionTitle> <Paragraph position="0"> The following LIG L, where A is the start symbol:</Paragraph> <Paragraph position="2"> is cyclic (we have A =~ A and B =~ B in its CFbackbone), and the stack schemas in production rl 0 indicate that an unbounded number of push % actions can take place, while production r3 0 indicates an unbounded number of pops. Its CF-backbone is unbounded ambiguous though its language contains the single string a.</Paragraph> <Paragraph position="3"> The computation of the relations gives: We can easily checked that this grammar is reduced. null We want to parse the input string x -- a (i.e. find all the linear SO/a-derivations ).</Paragraph> <Paragraph position="4"> Its LIGed forest, whose start The start symbol of the LDG associated with L x is \[\[A\]~\]. If we assume that an A-production is generated iff it is an \[\[A\]~\]-production or A occurs in an already generated production, its production set is:</Paragraph> <Paragraph position="6"> This CFG is reduced. Since its production set is non empty, we have a 6 PS(L). Its language is {r4(){r\]())kr~O{r~O} k \]0 < k) which shows that the only valid linear derivations w.r.t. L must contain an identical number k of productions which push 7a (i.e. the production rl0) and productions which pop 7a (i.e. the production r3()).</Paragraph> <Paragraph position="7"> As in the previous example, we can see that the</Paragraph> </Section> </Section> <Section position="8" start_page="93" end_page="94" type="metho"> <SectionTitle> element \[S\]~ ~ \[B\]~ is useless. + 7 Conclusion </SectionTitle> <Paragraph position="0"> We have shown that the parses of a LIG can be represented by a non ambiguous CFG. This representation captures the fact that the values of a stack of symbols is well parenthesized. When a symbol 3' is pushed on a stack at a given index at some place, this very symbol must be popped some place else, and we know that such (recursive) pairing is the essence of context-freeness.</Paragraph> <Paragraph position="1"> In this approach, the number of productions and the construction time of this CFG is at worst O(n6), though much better results occur in practical situations. Moreover, static computations on the initial LIG may decrease this practical complexity in avoiding useless computations. Each sentence in this CFG is a derivation of the given input string by the LIG, and is extracted in linear time.</Paragraph> </Section> class="xml-element"></Paper>