File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/00/j00-1003_metho.xml
Size: 17,717 bytes
Last Modified: 2025-10-06 14:07:14
<?xml version="1.0" standalone="yes"?> <Paper uid="J00-1003"> <Title>Practical Experiments with Regular Approximation of Context-Free Languages</Title> <Section position="5" start_page="32" end_page="33" type="metho"> <SectionTitle> 5. Increasing the Precision </SectionTitle> <Paragraph position="0"> The methods of approximation described above take as input the parts of the grammar that pertain to self-embedding. It is only for those parts that the language is affected.</Paragraph> <Paragraph position="1"> This leads us to a way to increase the precision: before applying any of the above methods of regular approximation, we first transform the grammar.</Paragraph> <Paragraph position="2"> This grammar transformation copies grammar rules containing recursive nonterminals and, in the copies, it replaces these nonterminals by new nonrecursive nonterminals. The new rules take over part of the roles of the old rules, but since the new rules do not contain recursion and therefore do not pertain to self-embedding, they remain unaffected by the approximation process.</Paragraph> <Paragraph position="3"> Consider for example the palindrome grammar from Figure 1. The RTN method will yield a rather crude approximation, namely, the language {a, b}*. We transform this grammar in order to keep the approximation process away from the first three levels of recursion. We achieve this by introducing three new nonterminals S\[1\], S\[2\] and S\[3\], and by adding modified copies of the original grammar rules, so that we aSa l bSb i c aSa i bSb i e The new grammar generates the same language as before, but the approximation process leaves unaffected the nonterminals S\[1\], S\[2\], and S\[3\] and the rules defining them, since these nonterminals are not recursive. These nonterminals amount to the upper three levels of the parse trees, and therefore the effect of the approximation on the language is limited to lower levels. If we apply the RTN method then we obtain the language that consists of (grammatical) palindromes of the form ww R, where w E {C/, a, b} U {a, b} 2 U {a, b} 3, plus (possibly ungrammatical) strings of the form wvw R, where w E {a, b} 3 and v E {a, b}*. (w R indicates the mirror image of w.) The grammar transformation in its full generality is given by the following, which is to be applied for fixed integer j > 0, which is a parameter of the transformation, and for each Ni such that recursive(Ni) = self.</Paragraph> <Paragraph position="4"> For each nonterminal A E Ni we introduce j new nonterminals All\] ..... A~\]. For each A --, X1...Xm in P such that A E Ni, and h such that 1 ~ h < j, we add A\[h\] --* X'I... X&quot; to P, where for 1 < k < m:</Paragraph> <Paragraph position="6"> Further, we replace all rules A --* X1 ... Xm such that A ~ Ni by A --* X~ ... X~m, where for 1 < k < m:</Paragraph> <Paragraph position="8"> If the start symbol S was in Ni, we let S\[1\] be the new start symbol.</Paragraph> <Paragraph position="9"> A second transformation, which shares some characteristics with the one above, was presented in Nederhof (1997). One of the earliest papers suggesting such transformations as a way to increase the precision of approximation is due to ~ulik and Cohen (1973), who only discuss examples, however; no general algorithms were defined.</Paragraph> <Paragraph position="10"> The test material. The left-hand curve refers to the construction of the grammar from 332 sentences, the right-hand curve refers to the corpus of 1,000 sentences used as input to the finite automata.</Paragraph> </Section> <Section position="6" start_page="33" end_page="39" type="metho"> <SectionTitle> 6. Empirical Results </SectionTitle> <Paragraph position="0"> In this section we investigate empirically how the respective approximation methods behave on grammars of different sizes and how much the approximated languages differ from the original context-free languages. This last question is difficult to answer precisely. Both an original context-free language and an approximating regular language generally consist of an infinite number of strings, and the number of strings that are introduced in a superset approximation or that are excluded in a subset approximation may also be infinite. This makes it difficult to attach numbers to the &quot;quality&quot; of approximations.</Paragraph> <Paragraph position="1"> We have opted for a pragmatic approach, which does not require investigation of the entire infinite languages of the grammar and the finite automata, but looks at a certain finite set of strings taken from a corpus, as discussed below. For this finite set of strings, we measure the percentage that overlaps with the investigated languages.</Paragraph> <Paragraph position="2"> For the experiments, we took context-free grammars for German, generated automatically from an HPSG and a spoken-language corpus of 332 sentences. This corpus consists of sentences possessing grammatical phenomena of interest, manually selected from a larger corpus of actual dialogues. An HPSG parser was applied on these sentences, and a form of context-free backbone was selected from the first derivation that was found. (To take the first derivation is as good as any other strategy, given that we have at present no mechanisms for relative ranking of derivations.) The label occurring at a node together with the sequence of labels at the daughter nodes was then taken to be a context-free rule. The collection of such rules for the complete corpus forms a context-free grammar. Due to the incremental nature of this construction of the grammar, we can consider the subgrammars obtained after processing the first p sentences, where p = 1, 2, 3 ..... 332. See Figure 11 (left) for the relation between p and the number of rules of the grammar. The construction is such that rules have at most two members in the right-hand side.</Paragraph> <Paragraph position="3"> As input, we considered a set of 1,000 sentences, obtained independently from the 332 sentences mentioned above. These 1,000 sentences were found by having a speech recognizer provide a single hypothesis for each utterance, where utterances come from actual dialogues. Figure 11 (right) shows how many sentences of different lengths the corpus contains, up to length 30. Above length 25, this number quickly declines, but still a fair quantity of longer strings can be found, e.g., 11 strings of a length between Nederhof Experiments with Regular Approximation 51 and 60 words. In most cases however such long strings are in fact composed of a number of shorter sentences.</Paragraph> <Paragraph position="4"> Each of the 1,000 sentences were input in their entirety to the automata, although in practical spoken-language systems, often one is not interested in the grammaticality of complete utterances, but tries to find substrings that form certain phrases bearing information relevant to the understanding of the utterance. We will not be concerned here with the exact way such recognition of substrings could be realized by means of finite automata, since this is outside the scope of this paper.</Paragraph> <Paragraph position="5"> For the respective methods of approximation, we measured the size of the compact representation of the nondeterministic automaton, the number of states and the number of transitions of the minimal deterministic automaton, and the percentage of sentences that were recognized, in comparison to the percentage of grammatical sentences. For the compact representation, we counted the number of lines, which is roughly the sum of the numbers of transitions from all subautomata, not considering about three additional lines per subautomaton for overhead.</Paragraph> <Paragraph position="6"> We investigated the size of the compact representation because it is reasonably implementation independent, barring optimizations of the approximation algorithms themselves that affect the sizes of the subautomata. For some methods, we show that there is a sharp increase in the size of the compact representation for a small increase in the size of the grammar, which gives us a strong indication of how difficult it would be to apply the method to much larger grammars. Note that the size of the compact representation is a (very) rough indication of how much effort is involved in determinization, minimization, and substitution of the subautomata into each other.</Paragraph> <Paragraph position="7"> For determinization and minimization of automata, we have applied programs from the FSM library described in Mohri, Pereira, and Riley (1998). This library is considered to be competitive with respect to other tools for processing of finite-state machines.</Paragraph> <Paragraph position="8"> When these programs cannot determinize or minimize in reasonable time and space some subautomata constructed by a particular method of approximation, then this can be regarded as an indication of the impracticality of the method.</Paragraph> <Paragraph position="9"> We were not able to compute the compact representation for all the methods and all the grammars. The refined RTN approximation from Section 4.2 proved to be quite problematic. We were not able to compute the compact representation for any of the automatically obtained grammars in our collection that were self-embedding.</Paragraph> <Paragraph position="10"> We therefore eliminated individual rules by hand, starting from the smallest self-embedding grammar in our collection, eventually finding grammars small enough to be handled by this method. The results are given in Table 1. Note that the size of the compact representation increases significantly for each additional grammar rule. The sizes of the finite automata, after determinization and minimization, remain relatively small.</Paragraph> <Paragraph position="11"> Also problematic was the first approximation from Section 4.4, which was based on LR parsing following Pereira and Wright (1997). Even for the grammar of 50 rules, we were not able to determinize and minimize one of the subautomata according to step 1 of Section 3: we stopped the process after it had reached a size of over 600 megabytes. Results, as far as we could obtain them, are given in Table 2. Note the sharp increases in the size of the compact representation, resulting from small increases, from 44 to 47 and from 47 to 50, in the number of rules, and note an accompanying sharp increase in the size of the finite automaton. For this method, we see no possibility of accomplishing the complete approximation process, including determinization and minimization, for grammars in our collection that are substantially larger than 50 rules. Since no grammars of interest could be handled by them, the above two methods will be left out of further consideration.</Paragraph> <Paragraph position="12"> Below, we refer to the unparameterized and parameterized approximations based on RTNs (Section 4.1) as RTN and RTNd, respectively, for d = 2,3; to the subset approximation from Figure 9 as Subd, for d = 1, 2, 3; and to the second and third methods from Section 4.4, which were based on LR parsing following Baker (1981) and Bermudez and Schimpf (1990), as LR and LRd, respectively, for d = 2, 3. We refer to the subset approximation based on left-corner parsing from Section 4.5 as LCd, for the maximal stack height of d = 2, 3, 4; and to the methods discussed in Section 4.6 as Unigram, Bigram, and Trigram.</Paragraph> <Paragraph position="13"> We first discuss the compact representation of the nondeterministic automata. In Figure 12 we use two different scales to be able to represent the large variety of values. For the method Subd, the compact representation is of purely theoretical interest for grammars larger than 156 rules in the case of Sub1, for those larger than 62 rules in the case of Sub2, and for those larger than 35 rules in the case of Sub3, since the minimal deterministic automata could thereafter no longer be computed with a reasonable bound on resources; we stopped the processes after they had consumed over 400 megabytes. For LC3, LC4, RTN3, LR2, and LR3, this was also the case for grammars larger than 139, 62, 156, 217, and 156 rules, respectively. The sizes of the compact representation seem to grow moderately for LR and Bigram, in the upper panel, yet the sizes are much larger than those for RTN and Unigram, which are indicated in the lower panel.</Paragraph> <Paragraph position="14"> The numbers of states for the respective methods are given in Figure 13, again using two very different scales. As in the case of the grammars, the terminals of our finite automata are parts of speech rather than words. This means that in general there will be nondeterminism during application of an automaton on an input sentence due to lexical ambiguity. This nondeterminism can be handled efficiently using tabular</Paragraph> <Paragraph position="16"> - - ......... 1 ........ I I I I I I techniques, provided the number of states is not too high. This consideration favors methods that produce low numbers of states, such as Trigram, LR, RTN, Bigram, and Unigram.</Paragraph> <Paragraph position="17"> Number of states of the deterrninized and minimized automata. Note that the numbers of states for LR and RTN differ very little. In fact, for some of the smallest and for some of the largest grammars in our collection, the resulting automata were identical. Note, however, that the intermediate results for LR Nederhof Experiments with Regular Approximation (Figure 12) are much larger. It should therefore be concluded that the &quot;sophistication&quot; of LR parsing is here merely an avoidable source of inefficiency. The numbers of transitions for the respective methods are given in Figure 14. Again, note the different scales used in the two panels. The numbers of transitions roughly correspond to the storage requirements for the automata. It can be seen that, again, Trigram, LR, RTN, Bigram, and Unigram perform well.</Paragraph> <Paragraph position="18"> The precision of the respective approximations is measured in terms of the percentage of sentences in the corpus that are recognized by the automata, in comparison to the percentage of sentences that are generated by the grammar, as presented by Figure 15. The lower panel represents an enlargement of a section from the upper panel. Methods that could only be applied for the smaller grammars are only presented in the lower panel; LC4 and Sub2 have been omitted entirely.</Paragraph> <Paragraph position="19"> The curve labeled G represents the percentage of sentences generated by the grammar. Note that since all approximation methods compute either supersets or subsets, a particular automaton cannot both recognize some ungrammatical sentences and reject some grammatical sentences.</Paragraph> <Paragraph position="20"> Unigram and Bigram recognize very high percentages of ungrammatical sentences. Much better results were obtained for RTN. The curve for LR would not be distinguishable from that for RTN in the figure, and is therefore omitted. (For only two of the investigated grammars was there any difference, the largest difference occurring for grammar size 217, where 34.1 versus 34.5 percent of sentences were recognized in the cases of LR and RTN, respectively.) Trigram remains very close to RTN (and LR); for some grammars a lower percentage is recognized, for others a higher percentage is recognized. LR2 seems to improve slightly over RTN and Trigram, but data is available only for small grammars, due to the difficulty of applying the method to larger grammars. A more substantial improvement is found for RTN2. Even smaller percentages are recognized by LR3 and RTN3, but again, data is available only for small grammars.</Paragraph> <Paragraph position="21"> The subset approximations LC3 and Sub1 remain very close to G, but here again only data for small grammars is available, since these two methods could not be applied on larger grammars. Although application of LC2 on larger grammars required relatively few resources, the approximation is very crude: only a small percentage of the grammatical sentences are recognized.</Paragraph> <Paragraph position="22"> We also performed experiments with the grammar transformation from Section 5, in combination with the RTN method. We found that for increasing j, the intermediate automata soon became too large to be determinized and minimized, with a bound on the memory consumption of 400 megabytes. The sizes of the automata that we were able to compute are given in Figure 16. RTN+j, for j = 1, 2, 3,4, 5, represents the (unparameterized) RTN method in combination with the grammar transformation with parameter j. This is not to be confused with the parameterized RTNd method.</Paragraph> <Paragraph position="23"> Figure 17 indicates the number of sentences in the corpus that are recognized by an automaton divided by the number of sentences in the corpus that are generated by the grammar. For comparison, the figure also includes curves for RTNd, where d = 2, 3 (cf. Figure 15). We see that j = 1, 2 has little effect. For j = 3,4, 5, however, the approximating language becomes substantially smaller than that in the case of RTN, but at the expense of large automata. In particular, if we compare the sizes of the automata for RTN+j in Figure 16 with those for RTNd in Figures 13 and 14, then Figure 17 suggests the large sizes of the automata for RTN+j are not compensated adequately by a reduction of the percentage of sentences that are recognized. RTNd seems therefore preferable to RTN+j.</Paragraph> </Section> class="xml-element"></Paper>