File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/87/p87-1016_metho.xml
Size: 15,541 bytes
Last Modified: 2025-10-06 14:12:02
<?xml version="1.0" standalone="yes"?> <Paper uid="P87-1016"> <Title>ON THE SUCCINCTNESS PROPERTIES OF UNORDERED CONTEXT-FREE GRAMMARS</Title> <Section position="3" start_page="0" end_page="112" type="metho"> <SectionTitle> 2 Introduction </SectionTitle> <Paragraph position="0"> Context free grammars in immediate dominance and linear precedence format were used in GPSG \[3\] as a skeleton for metarule generation and feature checking. It is intuitively obvious that grammars in this form can describe languages which are closed under the operation of taking arbitrary permutations of strings in the language. (Such languages will be called symmetric.) Ordinary context-free grammars, on the other hand, &quot;seem to require that all permutations of right-hand sides of productions be explicitly listed, in order to describe certain symmetric languages. For an explicit example, consider the n-letter alphabet E,~ = {al ..... a,~}. Let P,~ be the set of all strings which are permutations of exactly these letters. It seems obvious that no context-free grammar could generate this language without explicitly listing it. Now try to prove that this is the case. This is in essence what we do in this paper. We also hope to get the audience for the paper interested in why the proof works! To give some idea of the difficulty of our problem, we begin by recounting Barton's results \[1\] in this conference in 1985. (There is a general discussion in \[2\].) He showed that the universal recognition problem (URP) for ID/LP grammars is NP-complete. 1 This means that if P :~ NP, then no polynomial algorithm can solve this problem. The difficulty of the problem seems to arise from the fact that the translation from an ID/LP grammar to a weakly equivalent CFG blows up exponentially.</Paragraph> <Paragraph position="1"> It is easy to show, assuming P ~ NP, that any reasonable transformation from ID/LP grammars to equivalent CFGs cannot be done in polynomial time; Rounds has done this as a remark in \[8\]. In this paper, we remove the hypothesis P ~: NP. That is, we can show that no algorithm whatever can effect the translation polynomi-The universal recognition problem is to tell for an ID/LP grammar G and a string w, whether or not w E L(G).</Paragraph> <Paragraph position="2"> ally in all cases. (Unfortunately, this does not solve the</Paragraph> <Paragraph position="4"> Barton's reduction took a known NP-complete problem, the vertex-cover problem, and reduced it to the URP for ID/LP. The reduction makes crucial use of grammars whose production size can be arbitrarily large. Define the fan-out of a grammar to be the largest total number of symbol occurrences on the right hand side of any production. For a CFG, this would be the maximum length of any RHS; for an ID/LP grammar, we would count symbols and their multiplicities. Barton's reduction does the following. For each instance of the vertex cover problem, of size n, he constructs a string w and an ID/LP grammar of fanout proportional to n such that the instance has a vertex cover if and only if the string is generated by the grammar. He also notes that if all ID/LP grammars have fanout bounded by a fixed constant, then the URP can be solved in polynomial time.</Paragraph> <Paragraph position="5"> This brings us to the statement of our results. Let Pn be the language described above. Clearly this language can be generated by the ID/LP grammar S -- al,...,an whose size in bits is O(n log n).</Paragraph> <Paragraph position="6"> Theorem 1 There is a constant c > I such that any contezt-free gr.mmar Gn generating Pn must have size ~(cn). 2 Moreover, every \[D/LP grammar'generating pn, whose fanout is bounded by a fized constant, must likewise have ezponential size.</Paragraph> <Paragraph position="7"> The theorem does not actually depend on having a vocabulary which grows with n. It is possible to code everything homomorphically into a two-letter alphabet.</Paragraph> <Paragraph position="8"> However, we think that the result shows that ordinary CFGs, and bounded-fanout ID/LP grammars, are inadequate for giving succinct descriptions of languages whose vocabulary is open, and whose word order can be very free. Thus, we prefer the statement of the result as it is. We start the paper with the technical results, in Section 3, and continue with a discussion of the implications for linguistics in Section 4. The final section contains a proof of the Interchange Lemma of Ogden, Ross, and Winklmann \[7\], which is the main tool used for our resuits. This proof is included, not because it is new, but because we want to show a beautiful example of the use of 2This notation meam.s that for inKnitely ram W n, the size of Gn must be bigger than c n.</Paragraph> <Paragraph position="9"> combinatorial principles in formal linguistics, and because we think the proof may be generalized to other classes of grammars.</Paragraph> </Section> <Section position="4" start_page="112" end_page="113" type="metho"> <SectionTitle> 3 Technical Results </SectionTitle> <Paragraph position="0"> As we have said, our basic tool is the Interchange Lemma, which was first used to show that the &quot;embedded reduplication&quot; language { wzzy I w, z, and y E {a, b, c}&quot; } is not context-free. It was also used in Kac, Manaster-Ramer, and Rounds \[6\] to show that English is not CF, and by Rounds, Manaster-Ramer, and Friedman to show that reduplication even over length n strings requires context-free grammar size exponential in n. The current application uses the last-mentioned technique, but the argument is more complicated.</Paragraph> <Paragraph position="1"> We will discuss the Interchange Lemma informally, then state it formally. We will then show how to apply it in our case.</Paragraph> <Paragraph position="2"> The IL relies on the following basic observation. Suppose we have a context-free language, and two strings in that language, each of which has a substring which is the yield of a subtree labeled by the same nonterminal symbol at the respective roots of the subtrees. Then these substrings can be interchanged, and the resulting strings will still be in the language. This is what distinguishes the IL from the Pumping Lemma, which finds repeated nonterminals in the derivation tree of just one string.</Paragraph> <Paragraph position="3"> The next observation about the IL is that it attempts to find these interchangeable strings among the length n strings of the given language. Moreover, we want to find a whole set of such strings, such that in the set, the interchanged substrings all have the same length, and all start at the same position in the host string. The lemma lets us select a number m less than n, and tells us that the length k of the interchangeable substrings is between role and m, where r is the fanout of the grammar. Finally, the lemma gives us an estimate of the size of the interchangeable subset. We may choose an arbitrary subset Q(n) of L(n), where L(n) is the set of length n strings in the language L. If we also choose an integer m < n, then the IL tells us that there is an interchangeable set A C_ Q(n) such that IAI _> IQ(n)I/(INI&quot; n=), where the vertical bars denote cardinality, and N is the set of nonterminals of the given grammar. (The interchanged strings do not stay in Q(n), but they do stay in L(n). ) Notice that if Q(n) is exponential in size, then A will be also. Thus, if a language has exponentially many strings of length n then it will have an interchangeable subset of roughly the same exponential size, provided the set of nonterminals of the grammar is small. Our proof turns this idea around. We show that any CF description of the permutation language L(n) must have an exponentially large set of nonterminals, because an interchangeable subset of this language cannot be of the same exponential order as n!, which is the size of L(n).</Paragraph> <Paragraph position="4"> Now we can give a more formal statement of the lem/'fla.</Paragraph> <Paragraph position="5"> Definition. Suppose that A is a subset {zl ..... -p} of L(n). A has the k-interchangeability property iff there are substrings Zh ..., z v of zl, ..., z v respectively, such that each z, has length k, each z~ occurs in the same relative position in each zi, and such that if z~ = wiziy( and z i = wjziV j for any i and j, then wi~jVl is an element of L(n).</Paragraph> <Paragraph position="6"> Interchange Lemma. Let G be a CFG or ID/LP grammar with fanout r, and with nonterminal alphabet N. Let m and n be any positive natural numbers with r < m_< n. Let L(n) be the set of length nstringsin L(G), and Q(n) be a subset of L(n). Then we can find a k-interchangeable subset A of Q(n), such that m/r <_ k _< m, and such that Ial >_ IQ(n)ll (INI&quot; n2).</Paragraph> <Paragraph position="7"> Now we can prove our main theorem. First we show that no CFG of fanout 2 can generate L(n) without an exponential number of nonterminals. The theorem for any CFG then follows, because any CFG can be transformed, into a CFG with fanout 2 by a process essentially like that of transforming into Chornsky normal form, but without having to eliminate e-productions or unit productions. This process at most cubes the grammar size, and the result follows because the cube root of an exponential is still an exponential. The proof for bounded-fanout ID/LP is a direct adaptation of the proof for fanout 2, which we now give.</Paragraph> <Paragraph position="8"> Let Pn be the permutation language above, and let G be a fanout 2 grammar for this language. Apply the Interchange Lemma to G, choosing Q(n) = P~, r = 2, and m = n/2. (n will be chosen as a multiple of 4.) Observe that IQ(n)l = IL(n)\[ = n!. From the IL, we get a k-interchangeable subset A of L(n), such that n/4 < k < n/2, and such that</Paragraph> <Paragraph position="10"> Next we use the fact that A is k-interchangeable to get an upper bound on its cardinality. Let wtztyt and w~.=~.y~.</Paragraph> <Paragraph position="11"> be members of A, and let E(z) be the set of alphabet characters appearing in z. We claim that E(zl) = ~(z~_).</Paragraph> <Paragraph position="12"> For if, say =t has a character not occurring in z~., then the interchanged string wtz2yl will have two occurrences of that character, and thus not be in L(n), as required by the IL. Without loss of generality, ,.V.(z) = {al ..... ak}.</Paragraph> <Paragraph position="13"> The number of strings in A is thus less than or equal to the number of ways of selecting the z string - that is, k!, times the number of ways of choosing the characters in the rest of the string - that is, (n - k)!. In other words, IAI < k! (n - k)!.</Paragraph> <Paragraph position="14"> Putting the two inequalities together and solving for IN\[, we get INI > k! (n - k)! &quot; n &quot;W = n -~&quot; &quot; From Pascal's triangle in high school mathematics, (i) increases with k until k - n/2. Thus since n/4 < k < n/2, we have (i) > (n~4), which by using Stirling's approximation null m! &quot;.., mm e-m~/27rm to estimate the various factorials, grows exponentially with n. Therefore, so does IN\[, and our theorem is proved.</Paragraph> <Paragraph position="15"> To obtain the result for a two-letter alphabet, consider the homomorphism sending the letter aj into 0 j 1. Let Ii'n be the image of Pn under this mapping. Then, because the mapping is one-to-one, P. is the inverse homomorphic image of Kn. If for every c > 1 there is a sequence of CFGs Gn generating K, such that the size of G,~ is not ft(c&quot;), then the same is true for the language Pn, contradicting Theorem I. The reason is that the size of a grammar for the inverse homomorphic image of a language need only be polynomiaUy bigger than the size of a grammar for the language itself. The proof of this claim rests on inspection of one of the standard proofs, say Hopcroft and Ullman \[5\]. The result is proved using pushdown automata, but all conversions from pdas to grammars require only polynomial increase in size.</Paragraph> <Paragraph position="16"> Our final technical result concerns an n-symbol analogue of the so-called MIX language, which has been conjectured by Marsh not to be an indexed language (see \[4\] for discussion.) We define the language M, to be the set of all strings over En which have identical numbers of occurrences of each character al in En. Observe that /I,I,~ is infinite for each n. However, there is a sequence of finite sublanguages of the various Mn, such that this sequence requires exponentially increasing context-free descriptions. ~Ve have the following theorem.</Paragraph> <Paragraph position="17"> Theorem 2 Consider the set Mn(n=) of all length n 2 strings of Mn. Then there is a constant c > 1 such that any context.free grammar Gn generating Mn(n 2) must have sue f~(cn).</Paragraph> <Paragraph position="18"> Proof. This proof is really just a generalization of the proof of Theorem 1. It uses, however, the Q subsets in a way that the proof of Theorem 1 does not.</Paragraph> <Paragraph position="19"> First, we drop the n subscript in Mn(n2). Observe next that in every string in M(n2), each character in En occurs exactly n times. Let O(n 2) = {u '~ : lul - n} be the subset of M(n 2) where, as indicated, each string is composed of n identical substrings concatenated in order. Then each u substring must be a permutation of E,, i.e., a member of P,. Let Gn be a fanout 2 grammar generating M(n2). As in the proof of Theorem I, apply the Interchange Lemma to G,~, choosing ~(n 2) as above, r - 2, and m -- n/2. Observe that we still have IQ(n2)l - n!. From the IL, we get a k-interchangeable subset A of Q(n2), such that n/4 < k < n/2, and such that n! IAI _> I/Vl. n4 Once again we use the fact that A is k-interchangeable to get an upper bound on its cardinaiity. Let wlztyl and w2z2y2 be members of .4, and let E(z) be the set of alphabet characters appearing in z. We claim once again that E(zt) - Z(z2). To see this, notice that the z portions of the strings in A can overlap at most one of the boundaries between the successive u strings, because \]u\] -- n and \[z\[ <_ n/2. If it does not overlap a boundary, then the reasoning is as before. If it does overlap a boundary, then we claim that the characters in z occurring to the right of the boundary must all be different from the characters in z to the left. This is because of the &quot;wraparound phenomenon&quot;: the u strings are identical, so the z characters to the right of the boundary are the same characters which occur to the right of the previous u-boundary. Since each u is a permutation of En, the claim holds. The same reasoning now applies to show that r-(zt) - E(z2). For if, say, zt has a character not occurring in z2, then one of the u-portions of the interchanged string wxz2yx will have two occurrences of that character, and thus not be in M(n~), as required by the IL. Without loss of generality, E(z) - {at ..... a~}. The number of strings in A is less than or equal to the number of ways of selecting one of the u strings. Consider the u string to the left of the boundary which z overlaps. Because of wraparound, this u string is still determined by selecting k positions in the z, and then choosing the characters in the remaining n - k positions. Thus we still have IAI < k! (n - k)! and we finish the proof as above.</Paragraph> </Section> class="xml-element"></Paper>