File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/00/a00-2033_metho.xml
Size: 20,206 bytes
Last Modified: 2025-10-06 14:07:08
<?xml version="1.0" standalone="yes"?> <Paper uid="A00-2033"> <Title>Removing Left Recursion from Context-Free Grammars</Title> <Section position="3" start_page="0" end_page="249" type="metho"> <SectionTitle> 3 Test Grammars </SectionTitle> <Paragraph position="0"> We will test the algorithms considered here on three large, independently-motivated, natural-language grammars. The CT grammar 1 was compiled into a CFG from a task-specific unification grammar written for CommandTalk (Moore et al., 1997), a spoken-language interface to a military simulation system. The ATIS grammar was extracted from an internally generated treebank of the DARPA ATIS3 training sentences (Dahl et al., 1994). The PT grammar 2 was extracted from the Penn Treebank (Marcus et al., 1993). To these grammars we add a small &quot;toy&quot; grammar, simply because some of the algorithms cannot be run to completion on any of the &quot;real&quot; grammars within reasonable time and space bounds.</Paragraph> <Paragraph position="1"> Some statistics on the test grammars are contained in Table 1. The criterion we use to judge effectiveness of the algorithms under test is the size of the' resulting grammar, measured in terms of the total number of terminal and nonterminal symbols needed to express the productions of the grammar.</Paragraph> <Paragraph position="2"> We use a slightly nonstandard metric, counting the symbols as if, for each nonterminal, there were a single production of the form A --+ al I ..- \[ a,~. This reflects the size of files and data structures typically used to store grammars for top-down processing more accurately than counting a separate occurrence of the left-hand side for each distinct right-hand side.</Paragraph> <Paragraph position="3"> It should be noted that the CT grammar has a very special property: none of the 535 left recursive nonterminals is indirectly left recursive. The grammar was designed to have this property specifically because Paull's algorithm does not handle indirect left recursion well.</Paragraph> <Paragraph position="4"> It should also be noted that none of these grammars contains empty productions or cycles, which can cause problems for algorithms for removing left recursion. It is relatively easy to trasform an arbitrary CFG into an equivalent grammar which does not contain any of the probelmatical cases. In its initial form the PT grammar contained cycles, but these were removed at a cost of increasing the size of the grammar by 78 productions and 89 total symbols. No empty productions or cycles existed anywhere else in the original grammars.</Paragraph> </Section> <Section position="4" start_page="249" end_page="251" type="metho"> <SectionTitle> 4 Paull's Algorithm </SectionTitle> <Paragraph position="0"> Panll's algorithm for eliminating left recursion from CFGs attacks the problem by an iterative procedure for transforming indirect left recursion into direct left recursion, with a subprocedure for eliminating direct left recursion, This algorithm is perhaps more familiar to some as the first phase of the textbook algorithm for transfomrming CFGs to Greibach norreal form (Greibach, 1965). 3 The subprocedure to eliminate direct left recursion performs the following transformation (Hopcroft and UUman, 1979, p. 96): Let A Aa11... IAa be the set of all directly left recursive Aproductions, and let I/?s be the remaining A-productions. Replace all these productions with A --+/71 \[/?IA' \[ ... \[/?8 \[/?sA', and A' --+ az \[ alA' \[ ... I as \[ asA', where A ~ is a new nonterminal not used elsewhere in the grammar.</Paragraph> <Paragraph position="1"> This transformation is embedded in the full algorithm (Aho et al., 1986, p. 177), displayed in Figure 1.</Paragraph> <Paragraph position="2"> The idea of the algorithm is to eliminate left recursion by transforming the grammar so that all the direct left corners of each nonterminal strictly follow that nonterminal in a fixed total ordering, in which case, no nonterminal can be left recursive. This is accomplished by iteratively replacing direct left corners that precede a given nonterminal with all their expansions in terms of other nonterminals that are greater in the ordering, until the nonterminal has only itself and greater nonterminals as direct left Assign an ordering A1,..., A,~ to the nonterminals of the grammar. for i := 1 to n do begin for j :-- 1 to i - 1 do begin for each production of the form Ai ~ Aja do begin remove Ai -+ Aja from the grammar for each production of the form Aj -~/~ do begin add Ai --~/~a to the grammar end end end transform the Ai-productions to eliminate direct left recursion end corners. Any direct left recursion for that nonterminal is then eliminated by the first transformation discussed.</Paragraph> <Paragraph position="3"> The difficulty with this approach is that the iterated substitutions can lead to an exponential increase in the size of the grammar. Consider the grammar consisting of the productions Az -+ 0 I 1, plus Ai+z -+ AiO I Ail for I < i < n. It is easy to see that Paull's algorithm will transform the grammar so that it consists of all possible Ai-productions with a binary sequence of length i on the right-hand side, for 1 < i < n, which is exponentially larger than the original grammar. Notice that the efficiency of PauU's algorithm crucially depends on the ordering of the nonterminals. If the ordering is reversed in the grammar of this example, Paull's algorithm will make no changes, since the grammar will already satisfy the condition that all the direct left corners of each nonterminal strictly follow that nonterminal in the revised ordering. The textbook discussions of Paull's algorithm, however, are silent on this issue. In the inner loop of Panll's algorithm, for nonterminals Ai and Aj, such that i > j and Aj is a direct left corner of Ai, we replace all occurrences of Aj as a direct left corner of Ai with all possible expansions of Aj. This only contributes to elimination of left recursion from the grammar if Ai is a left-recursive nonterminal, and Aj \]ies on a path that makes Ai left recursive; that is, if Ai is a left corner of A3 (in addition to Aj being a left corner of Ai). We could eliminate replacements that are useless in removing left recursion if we could order the nonterminals of the grammar so that, if i > j and Aj is a direct left corner of Ai, then Ai is also a left corner of Aj. We can achieve this by ordering the nonterminals in decreasing order of the number of distinct left corners they have. Since the left-corner relation is transitive, if C is a direct left corner of B, every left corner of C is also a left corner of/3. In addition, since we defined the left-corner relation to be reflexive, B is a left corner of itself. Hence, if C is a direct left corner of B, it must follow B in decreasing order of number of distinct left corners, unless B is a left corner of C.</Paragraph> <Paragraph position="4"> Table 2 shows the effect on Paull's algorithm of ordering the nonterminals according to decreasing number of distinct left corners, with respect to the toy grammar. 4 In the table, &quot;best&quot; means an ordering consistent with this constraint. Note that if a grammar has indirect left recursion, there will be multiple orderings consistent with our constraint, since indirect left recursion creates cycles in the the left-corner relation, so every nonterminal in one of these cycles will have the same set of left corners. Our &quot;best&quot; ordering is simply an arbitrarily chosen 4As mentioned previously, grammar sizes are given in terms of total terminal and nonterminal symbols needed to express the grammar.</Paragraph> <Paragraph position="5"> ordering respecting the constraint; we are unaware of any method for finding a unique best ordering, other than trying all the orderings respecting the constraint.</Paragraph> <Paragraph position="6"> As a neutral comparison, we also ran the algorithm with the nonterminals ordered lexicographically. Finally, to test how bad the algorithm could be with a really poor choice of nonterminal ordering, we defined a &quot;worst&quot; ordering to be one with increasing numbers of distinct left corners. It should be noted that with either the lexicographical or worst ordering, on all of our three large grammars Panll's algorithm exceeded a cut-off of 5,000,000 grammar symbols, which we chose as being well beyond what might be considered a tolerable increase in the size of the grammar.</Paragraph> <Paragraph position="7"> Let PA refer to Paull's algorithm with the non-terminals ordered according to decreasing number of distinct left corners. The second line of Table 3 shows the results of running PA on our three large grammars. The CT grammar increases only modestly in size, because as previously noted, it has no indirect left recursion. Thus the combinatorial phase of Paull's algorithm is never invoked, and the increase is solely due to the transformation applied to directly left-recursive productions. With the ATIS grammar and PT grammar, which do not have this special property, Panll's algorithm exceeded our cutoff, even with our best ordering of nonterminals. Some additional optimizations of Panll's aglorithm are possible. One way to reduce the number of substitutions made by the inner loop of the algorithm is to &quot;left factor&quot; the grammar (Aho et al., 1986, pp. 178-179). The left-factoring transformation (LF) applies the following grammar rewrite schema repeatedly, until it is no longer applicable: LF: For each nonterminal A, let a be the longest nonempty sequence such that there is more than one grammar production of the form A --+ a~. Replace the set of all productions A-+aft1, ..., A-+a~n with the productions A -+ aA', A' --~ ill, ..., A' --~ fin, where A' is a new nonterminal symbol.</Paragraph> <Paragraph position="8"> With left factoring, for each nonterminal A there will be only one A-production for each direct left corner of A, which will in general reduce the number of substitutions performed by the algorithm.</Paragraph> <Paragraph position="9"> The effect of left factoring by itself is shown in the third line of Table 3. Left factoring actually reduces the size of all three grammars, which may be unintuitive, since left factoring necessarily increases the number of productions in the grammar. However, the transformed productions axe shorter, and the grammar size as measured by total number of symbols can be smaller because common left factors are represented only once.</Paragraph> <Paragraph position="10"> The result of applying PA to the left-factored grammars is shown in the fourth line of Table 3 (LF+PA). This produces a modest decrease in the size of the non-left-recursive form of the CT grammar, and brings the nomleft-recursive form of the ATIS grammar under the cut-off size, but the non-left-recursive form of the PT grammar still exceeds the cut-off.</Paragraph> <Paragraph position="11"> The final optimization we have developed for Paull's algorithm is to transform the grammar to combine all the non-left-recursive possibilities for each left-recursive nonterminal under a new nonterminal symbol. This transformation, which we might call &quot;non-left-recursion grouping&quot; (NLRG), can be defined as follows: NLRG: For each left-recursive nonterminal A, let al,...,an be all the expansions of A that do not have a left recursive non-terminal as the left most symbol. If n > 1, replace the set of productions A -~ al , ..., A --~ a,~ with the productions</Paragraph> </Section> <Section position="5" start_page="251" end_page="252" type="metho"> <SectionTitle> A~A ~,A ~al, ...,A ~-~an, </SectionTitle> <Paragraph position="0"> where A t is a new nonterminal symbol.</Paragraph> <Paragraph position="1"> Since all the new nonterminals introduced by this transformation will be non-left-recursive, Paull's algorithm with our best ordering will never substitute the expansions of any of these new nonterminals into the productions for any other nonterminal, which in general reduces the number of substitutions the algorithm makes. We did not empirically measure the effect on grammar size of applying the NLRG transformation by itself, but it is easy to see that it increases the grammar size by exactly two symbols for each left-recursive nontermina\] to which it is applied. Thus an addition of twice the number of left-recursive nontermina\]s will be an upper bound on the increase in the size of the grammar, but since not every left-recursive nonterminal necessarily has more than one non-left-recursive expansion, the increase may be less than this.</Paragraph> <Paragraph position="2"> The fifth line of Table 3 (LF+NLRG+PA) shows the result of applying LF, followed by NLRG, followed by PA. This produces another modest decrease in the size of the non-left-recursive form of the CT grammar and reduces the size of the non-left-recursive form of the ATIS grammar by a factor of 27.8, compared to LF/PA. The non-left-recursive form of the PT grammar remains larger than the cut-off size of 5,000,000 symbols, however.</Paragraph> </Section> <Section position="6" start_page="252" end_page="253" type="metho"> <SectionTitle> 5 Left-Recursion Elimination Based </SectionTitle> <Paragraph position="0"> on the Left-Corner Transform An alternate approach to eliminating left-recursion is based on the left-corner (LC) grammar transform of Rosenkrantz and Lewis (1970) as presented and modified by Johnson (1998). Johnson's second form of the LC transform can be expressed as follows, with expressions of the form A-a, A-X, and A-B being new nonterminals in the transformed grammar: 1. If a terminal symbol a is a proper left corner of A in the original grammar, add A -4 aA-a to the transformed grammar.</Paragraph> <Paragraph position="1"> 2. If B is a proper left corner of A and B --+ X~ is a production of the original grammar, add A-X -+ ~A-B to the transformed grammar.</Paragraph> <Paragraph position="2"> 3. If X is a proper left corner of A and A --+ X~ is a production of the original grammar, add A-X -+ ~ to the transformed grammar.</Paragraph> <Paragraph position="3"> In Rosenkrantz and Lewis's original LC transform, schema 2 applied whenever B is a left corner of A, including all cases where B = A. In Johnson's version schema 2 applies when B -- A only if A is a proper left corner of itself. Johnson then introduces schema 3 handle the residual cases, without introducing instances of nonterminals of the form A-A that need to be allowed to derive the empty string. The original purpose of the LC transform is to allow simulation of left-corner parsing by top-down parsing, but it also eliminates left recursion from any noncyclic CFG. 5 Fhrthermore, in the worst case, the total number of symbols in the transformed grammar cannot exceed a fixed multiple of the square of the number of symbols in the original grammar, in contrast to Paull's algorithm, which exponentiates the size of the grammar in the worst case.</Paragraph> <Paragraph position="4"> Thus, we can use Johnson's version of the LC transform directly to eliminate left-recursion. Before applying this idea, however, we have one genera\] improvement to make in the transform. Johnson notes that in his version of the LC transform, a new nontermina\] of the form A-X is useless unless X is a proper left corner of A. We further note that a new nonterminal of the form A-X, as well as the orginal nonterminal A, is useless in the transformed grammar, unless A is either the top nonterminal of the grammar or appears on the right-hand side of an original grammar production in other than the left-most position. This can be shown by induction on the length of top-down derivations using the productions of the transformed grammar. Therefore, we will call the original nonterminals meeting this condition &quot;retained nontermina\]s&quot; and restrict the LC transform so that productions involving nonterminals of the form A-X are created only if A is a retained nonterminal.</Paragraph> <Paragraph position="5"> Let LC refer to Johnson's version of the LC transform restricted to retained nonterminals. In Table 4 the first three lines repeat the previously shown sizes for our three original grammars, their left-factored form, and their non-left-recursive form using our best variant of Panll's algorithm (LF+NLRG+PA).</Paragraph> <Paragraph position="6"> The fourth line shows the results of applying LC to the three original grammars. Note that this produces a non-left-recursive form of the PT grammar smaller than the cut-off size, but the non-left-recursive forms of the CT and ATIS grammars are Sin the case of a cyclic CFG, the schema 2 fails to guarantee a non-left-recursive transformed grammar.</Paragraph> <Paragraph position="7"> considerably larger than the most compact versions created with Paull's algorithm.</Paragraph> <Paragraph position="8"> We can improve on this result by noting that, since we are interested in the LC transform only as a means of eliminating left-recursion, we can greatly reduce the size of the transformed grammars by applying the transform only to left-recursive nonterminals. More precisely, we can retain in the transformed grammar all the productions expanding non-left-recursive nonterminals of the original grammar, and for the purposes of the LC transform, we can treat nomleft-recursive nonterminals as if they were terminals: 1. If a terminal symbol or non-left-recursive non-terminal X is a proper left corner of a retained left-recursive nonterminal A in the original grammar, add A -+ XA-X to the transformed grammar.</Paragraph> <Paragraph position="9"> 2. If B is a left-recursive proper left corner of a retained left-recursive nonterminal A and B --~ X/~ is a production of the original grammar, add</Paragraph> </Section> <Section position="7" start_page="253" end_page="253" type="metho"> <SectionTitle> A-X -~ ~A-B to the transformed grammar. </SectionTitle> <Paragraph position="0"> 3. If X is a proper left corner of a retained left-recursive nonterminal A and A --~ X/~ is a production of the original grammar, add A-X --~ to the transformed grammar.</Paragraph> <Paragraph position="1"> 4. If A is a non-left-recursive nonterminal and A -~ /3 is a production of the original grammar, add A -~/~ to the transformed grammar.</Paragraph> <Paragraph position="2"> Let LCLR refer to the LC transform restricted by these modifications so as to apply only to left-recursive nonterminals. The fifth line of Table 4 shows the results of applying LCLR to the three original grammars. LCLR greatly reduces the size of the non-left-recursive forms of the CT and ATIS grammars, but the size of the non-left-recursive form of the PT grammar is only slightly reduced. This is not surprising if we note from Table 1 that almost all the productions of the PT grammar are productions for left-recursive nonterminals. However, we can apply the additional transformations that we used with Paull's algorithm, to reduce the number of productions for left-recursive nonterminals before applying our modified LC transform. The effects of left factoring the grammar before applying LCLR (LF+LCLR), and additionally combining non-left-recursive productions for left-recursive non-terminals between left factoring and applying LCLR (LF+NLRG+LCLR), are shown in the sixth and seventh lines of Table 4.</Paragraph> <Paragraph position="3"> With all optimizations applied, the non-left-recursive forms of the ATIS and PT grammars are smaller than the originals (although not smaller than the left-factored forms of these grammars), and the non-left-recursive form of the CT grammar is only slightly larger than the original. In all cases, LF+NLRG+LCLR produces more compact grammars than LF+NLRG+PA, the best variant of Paull's algorithm--slightly more compact in the case of the CT grammar, more compact by a factor of 5.9 in the case of the ATIS grammar, and more compact by at least two orders of magnitude in the case of the PT grammar.</Paragraph> </Section> class="xml-element"></Paper>