XML Viewer - p88-1028

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/88/p88-1028_metho.xml
Size: 29,698 bytes
Last Modified: 2025-10-06 14:12:12
<?xml version="1.0" standalone="yes"?>
<Paper uid="P88-1028">
  <Title>Polynomial Learnability and Locality of Formal Grammars</Title>
  <Section position="3" start_page="0" end_page="226" type="metho">
    <SectionTitle>
2 Polynomial Learnability
</SectionTitle>
    <Paragraph position="0"> &amp;quot;Polynomial learnability&amp;quot; is a complexity theoretic notion of feasible learnability recently formulated by Blumer et al. (\[6\]). This notion generalizes Valiant's theory of learnable boolean concepts \[15\], \[14\] to infinite objects such as formal languages. In this paradigm, the languages are presented via infinite sequences of pos3We hold no particular stance on the the validity of the claim that children make no use of negative examples. We do, however, maintain that the investigation of learnability of grammars from both positive and negative examples is a worthwhile endeavour for at least two reasons: First, it has a potential application for the design of natural language systems that learn. Second, it is possible that children do make use of indirect negative information. null  itive and negative examples 5 drawn with an arbitrary but time invariant distribution over the entire space, that is in our case, ~T*. Learners are to hypothesize a grammar at each finite initial segment of such a sequence, in other words, they are functions from finite sequences of members of ~2&amp;quot;&amp;quot; x {0, 1} to grammars. 6 The criterion for learning is a complexity theoretic, approximate, and probabilistic one. A learner is s~id to learn if it can, with an arbitrarily high probability (1 - 8), converge to an arbitrarily accurate (within c) grammar in a feasible number of examples. =A feasible number of examples&amp;quot; means, more precisely, polynomial in the size of the grammar it is learning and the degrees of probability and accuracy that it achieves - $ -1 and ~-1. =Accurate within d' means, more precisely, that the output grammar can predict, with error probability ~, future events (examples) drawn from the same distribution on which it has been presented examples for learning. We now formally state this criterion. 7 Definition 2.1 (Polynomial Learnability) A collection of languages PS with an associated 'size' f~nction with respect to some f~ed representation mechanism is polynomially learnable if and onlg if: s</Paragraph>
    <Paragraph position="2"> and f is computable in time polynomial in the length of input\]  in the limit&amp;quot; and =polynomial learnability &amp;quot;, require different kinds of convergence behavior of such a sequence, as is illustrated in Figure 1.</Paragraph>
    <Paragraph position="3"> Blumer et al. (\[6\]) shows an interesting connection between polynomial learnability and data compression. The connection is one way: If there exists a polynomial time algorithm which reliably *compresses ~ any sample of any language in a given collection to a provably small consistent grammar for it, then such an alogorlthm polynomially learns that collection. We state this theorem in a slightly weaker form.</Paragraph>
    <Paragraph position="4"> Definition 2.2 Let PS be a language collection with an associated size function &amp;quot;size&amp;quot;, and for each n let c,~ = {L E PS \] size(L) ~ n}. Then .4 is an Occam algorithm for PS with range size ~ f(m, n) if and only if: If in addition all of f's output grammars on esample sequences for languages in c belong to G, then we say that PS is polynomially learnable by G.</Paragraph>
    <Paragraph position="5"> Suppose we take the sequence of the hypotheses (grammars) made by a \]earner on successive initial finite sequences of examples, and plot the =errors&amp;quot; of those grammars with respect to the language being learned. The two \]earnability criteria, =identification awe let PSX(L) denote the set of infinite sequences which contain only positive and negative examples for L, so indicated. awe let ~r denote the set of all such functions.</Paragraph>
    <Paragraph position="6"> 7The following presentation uses concepts and notation of formal learning theory, of. \[12\] aNote the following notation. The inital segment of a sequence t up to the n-th element is denoted by t-~. L denotes some fixed mapping from grammars to languages: If G is a grammar, L(G) denotes the language generated by-it. If L I is a |anguage, slzs(Ll) denotes the size of a minimal grammar for LI. A&amp;B denotes the symmetric difference, i.e. (A--B)U(B -A). Finally, if P is a probability measure on ~-T deg, then Pdeg is the cannonical product extension of P.</Paragraph>
    <Paragraph position="8"> and .4 runs in time polynomial in \[ tm \[\] Theorem 2.1 (Blumer et al.) I1.4 is an Oceam algorithm .for PS with range size f(n, m) ----. O(n/=m =) for some k &gt;_ 1, 0 &lt; ct &lt; 1 (i.e. less than linear in sample size and polynomial in complexity of language), then .4 polynomially learns f-.</Paragraph>
    <Paragraph position="9"> 91n \[6\] the notion of &amp;quot;range dimension&amp;quot; is used in place of &amp;quot;range size&amp;quot;, which is the Vapmk-Chervonenkis dlmension of the hypothesis class. Here, we use the fact that the dimension of a hypothesis class with a size bound is at most equal to that size bound.</Paragraph>
  </Section>
  <Section position="4" start_page="226" end_page="227" type="metho">
    <SectionTitle>
3 K-Local Context Free Grammars
</SectionTitle>
    <Paragraph position="0"> The notion of &amp;quot;k-locality&amp;quot; of a context free grammar is defined with respect to a formulation of derivations defined originally for TAG's by Vijay-Shanker, Weir, and Josh, \[16\] \[17\], which is a generalization of the notion of a parse tree. In their formulation, a derivation is a tree recording the history of rewritings. Each node of a derivation tree is labeled by a rewriting rule, and in particular, the root must be labeled with a rule with the starting symbol as its left hand side. Each edge corresponds to the application of a rewriting; the edge from a rule (host rule) to another rule (applied rule) is labeled with the aposition ~ of the nonterminal in the right hand side of the host rule at which the rewriting ta~kes place.</Paragraph>
    <Paragraph position="1"> The degree of locality of a derivation is the number of distinct kinds of rewritings in it - including the immediate context in which rewritings take place. In terms of a derivation tree, the degree of locality is the number of different kinds of edges in it, where two edges axe equivalent just in case the two end nodes are labeled by the same rules, and the edges themselves are labeled by the same node address.</Paragraph>
    <Paragraph position="2"> Definition 3.1 Let D(G) denote the set of all deriva.</Paragraph>
    <Paragraph position="3"> tion trees of G, and let r E I)(G). Then, the degree of locality of r, written locality(r), is defined as follows, locality(r) ---- card{ (p,q, n) I there is an edge in r from a node labeled with p to another labeled with q, and is itself labeled with ~} The degree of locality of a grammar is the maximum of  those of M1 its derivations.</Paragraph>
    <Paragraph position="4"> Definition 3.2 A CFG G is called k.local if ma={locallty(r) I r e V(G)} &lt; k.</Paragraph>
    <Paragraph position="5">  We write k.Local.CFG = {G I G E CFG and G is k.</Paragraph>
    <Paragraph position="6"> Local} and k.Local.CFL = {L(G) I G E k.Local.CFG Example 3.1 La = { a&amp;quot;bnambm I n,m E N} E J.LocaI.CFL since all the derivations of G1 = ({S,,-,C/l}, {a,b}, S, {S -- SaS1, $1 &amp;quot;* aSlb, Sa -- A}) generating La have degree of locality at most J. For example, the derivation for the string aZba ab has degree of locality J as shown in Figure ~.</Paragraph>
    <Paragraph position="7"> A crucical property of k-local grammars, which we will utilize in proving the learnability result, is that for each k-local grammar, there exists another k-local grammar in a specific normal form, whose size is only</Paragraph>
    <Section position="1" start_page="226" end_page="227" type="sub_section">
      <SectionTitle>
Ga
</SectionTitle>
      <Paragraph position="0"> polynomially larger than the original grammar. The normal form in effect puts the grammar into a disjoint union of small grammars each with at most k rules and k nontenninal occurences. By ~the disjoint union&amp;quot; of an arbitrary set of n grammaxs, gl,..., gn, we mean the grammax obtained by first reanaming nonterminals in each g~ so that the nonterminal set of each one is disjoint from that of any other, and then taking the union of the rules in all those grammars, and finally adding the rule S -* Si for each staxing symbol S~ of g,, and making a brand new symbol S the starting symbol of the grAraraar 80 obtained.</Paragraph>
      <Paragraph position="1">  Lemma 3.1 (K-Local Normal Form) For every klocal.CFG H, if n = size(H), then there is a k-loml-CFG G such that I. Z(G)= L(H).</Paragraph>
      <Paragraph position="2"> ~. G is in k.local normal form, i.e. there is an index set I such that G = (I2r, UiC/~i, S, {S -* Si I i E I} U (UiC/IRi)), and if we let Gi -~ (~T, ~,, Si, Ri) for each i E I, then (a) Each G~ is &amp;quot;k.simple&amp;quot;; Vi E I \[ Ri \[&lt;_ k &amp;: NTO(R~) &lt;_ k. 11 (b) Each G, has size bounded by size(G); Vi E I size(G,) = O(n) (c) All Gi's have disjoint nonterminal sets; vi, j ~ I(i # j) -- r., n r~, = C/,.</Paragraph>
      <Paragraph position="3"> s. size(G) = O(nk+:).</Paragraph>
      <Paragraph position="4"> Definition 3.3 We let ~ and ~ to be any maps that satisfy: If G is any k.local-CFG in kolocal normal form, 11If R is a set of production r~nlen,ith~oNeTruOl(eaR.i) denotee the  number ol nontermlnm occurre ea  then 4(G) is the set of all of its k.local components (G above.) If 0 = {Gi \[ i G I} is a set of k-simple gram. mars, then ~b(O) is a single grammar that is a &amp;quot;disjoint union&amp;quot; of all of the k-simple grammars in G.</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="227" end_page="230" type="metho">
    <SectionTitle>
4 K-Local Context Free Languages
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="227" end_page="229" type="sub_section">
      <SectionTitle>
Are Polynomially Learnable
</SectionTitle>
      <Paragraph position="0"> In this section, we present a sketch of the proof of our main leaxnability result.</Paragraph>
      <Paragraph position="1"> Theorem 4.1 For each k G N; k-iocal.CFL is polynomially learnable. 12 Proof.&amp;quot; We prove this by exhibiting an Occam algorithm .A for k-local-CFL with some fixed k, with range size polynomial in the size of a minimal grammar and less than linear in the sample size.</Paragraph>
      <Paragraph position="2"> We assume that ,4 is given a labeled m-sample 13 SL for some L E k-local-CFL with size(H) = n where H is its minimal k-local-CFG. We let length(SL) ffi E,Es length(s) = I. 14 We let S~L and S~&amp;quot; denote the positive and negative portions of SL respectively,</Paragraph>
      <Paragraph position="4"> p by Lemma 3.1. and the fact that a minimal consistent k-local-CFG is not larger than H. Further, we let 0 be the set of all of &amp;quot;k-simple components&amp;quot; of G and define L(G) = UoieoL(Gi ). Then note L(G) = L(G).</Paragraph>
      <Paragraph position="5"> Since each k-simple component has at most k nonterminals, we assume without loss of generality that each G~ in 0 has the same nonterminal set of size k, say</Paragraph>
      <Paragraph position="7"> The idea for constructing .4 is straightforward.</Paragraph>
      <Paragraph position="8"> Step 1. We generate all possible rules that may be in the portion of G that is relevant to SL +. That is, if we fix a set of derivations 2), one for each string in SL + from G, then the set of rules that we generate will contain all the rules that paxticipate in any derivation in /). (We let ReI(G,S+L) denote the restriction of 0</Paragraph>
      <Paragraph position="10"> 14In the sequel, we refer to the number of strings in ~ sample as the sample size, and the total length of the strings in a sample as the sample length.</Paragraph>
      <Paragraph position="11"> k-locality of G to show that such a set will be polynomially bounded in the length of SL +. Step 2. We then generate the set of all possible grammars having at most k of these rules. Since each k-simple component of 0 has at most k rules, the generated set of grammars will include all of the k-simple components of G. Step 3.</Paragraph>
      <Paragraph position="12"> We then use the negative portion of the sample, S L to filter out the &amp;quot;inconsistent&amp;quot; ones. What we have at this stage is a polynomially bounded set of k-simple grammars with varying sizes, which do not generate any of S~, and contain all the k-simple grammars of G. Assodated with each k-simple grammar is the portion of SL + that it &amp;quot;covers&amp;quot; and its size. Step 4. What an Occam algorithm needs to do, then, is to find some subset of these k-simple grammmm that &amp;quot;covers&amp;quot; SL +, and has a total size that is provably only polynomially larger than a minimal total size of a subset that covers SL +, and is less than linear is the sample size, m. We formalize this as a variant of &amp;quot;Set Cover&amp;quot; problem which we call &amp;quot;Weighted Set Cover~(WSC), and prove the existence of an approximation algorithm with a performance guarantee which suffices to ensure that the output of .4 will be a grammar that is provably only polynomially larger than the minimal one, and is less than linear in the sample size. The algorithm runs in time polynomial in the size of the grammar being learned and the sample length.</Paragraph>
      <Paragraph position="13"> Step 1.</Paragraph>
      <Paragraph position="14"> A crucial consequence of the way k-locality is defined is that the &amp;quot;terminal yield&amp;quot; of any rule body that is used to derive any string in the language could be split into at most k + 1 intervals. (We define the &amp;quot;terminal yield&amp;quot; of a rule body R to be h(R), where h is a homomorphism that preserves termins2 symbols and deletes nonterminal symbols.) Definition 4.1 (Subylelds) For an arbitrary i E N, an i-tuple of members of E~ u~ = (vl, v2 ..... vi) is said to be a subyield of s, if there are some uz ..... ui, ui+z E E~. such that s = uavzu2~...ulviu~+z. We let SubYields(i,a) = {w E (E~) ffi \[ z ~_ i ~ w is a subyield of s}.</Paragraph>
      <Paragraph position="15"> We then let SubYieldsk(S+L) denote the set of all subyields of strings in S + that may have come from a rule body in a k-local-CFG, i.e. subyields that axe tuples of at most k + 1 strings.</Paragraph>
      <Paragraph position="17"> This is obvious, since given a string s of length a, there  are only O(a 2(k+~)) ways of choosing 2(k -i- 1) different positions in the string. This completely specifies all the elements of SubYieidsk+a(s). Since the number of strings (m) in S + and the length of each string in S + are each bounded by the sample length (1), we have at most O(l) x 0(12(k+1)) strings in SubYields~(S+L ). r~ Thus we now have a polynomially generable set of possible yields of rule bodies in G. The next step is to generate the set of all possible rules having these yields. Now, by k-locality, in may derivation of G we have at most k distinct &amp;quot;kinds&amp;quot; of rewritings present. So, each rule has at most k useful nonterminal occurrences mad since G is minimal, it is free of useless nonterminals. We generate all possible rules with at most k nonterminal occurrences from some fixed set of k nonterminals (Ek), having as terminal subyields, one of SubYieldsh(S+). We will then have generated all possible rules of Rel(G,S+). In other words, such a set will provably contain all the rules of ReI(G,S+).</Paragraph>
      <Paragraph position="18"> We let TFl~ules(Ek) denote the set of &amp;quot;terminal free rules&amp;quot; {Aio -'* zlAiaz2....znAi,,Z.+l \[ n &lt; k &amp; Vj &lt; n A~ E Ek} We note that the cardinality of such a set is a function only of k. We then &amp;quot;assign ~ members of SubYields~(S +) to TFRules(Eh), wherever it is possible (or the arities agree). We let CRules(k, S +) denote the set of &amp;quot;candidate rules ~ so obtained.</Paragraph>
      <Paragraph position="19">  Definition 4.3 C Rules( k, S +) = {R(wa/za ..... w,/z,) I a E TFRnles(Ek) &amp; w E SubYieldsk(S +) ~ arity(w) = arity(R) = n} It is easy to see that the number of rules in such a set is also polynomially bounded.</Paragraph>
      <Paragraph position="20"> Claim 4.2 card(ORulea(k, S+ )) = O(l 2k+3) Step 2.</Paragraph>
      <Paragraph position="21"> Recall that we have assumed that they each have a non-terminal set contained in some fixed set of k nonterminMs, Ek. So if we generate all subsets of CRules(k, S +) with at most k rules, then these will include all the k-simple grammars in G.</Paragraph>
      <Paragraph position="22"> Definition 4.4 ccra,.~(k, st) = ~'~(CR~les(k, St)). 's Step 3.</Paragraph>
      <Paragraph position="23"> Now we finally make use of the negative portion of the sample, S~', to ensure that we do not include any inconsistent grammars in our candidates.</Paragraph>
      <Paragraph position="24"> 15~k(X) in general denotes the set of all subsets of X with cardinality at most k.</Paragraph>
      <Paragraph position="25"> Definition 4.5 FGrams(k, Sz) = {H \[ H E CGra,ns(k, S +) ~, r.(a) n S~ = e~}  This filtering can be computed in time polynomial in the length of St., because for testing consistency of each grammar in CGrams(k, + S z ), all that is involved is the membership question for strings in S~&amp;quot; with that grammar. null Step 4.</Paragraph>
      <Paragraph position="26"> What we have at this stage is a set of 'subcovers' of SL +, each with a size (or 'weight') associated with it, and we wish to find a subset of these 'subcovers' that cover the entire S +, but has a provably small 'total weight'. We abstract this as the following problem.</Paragraph>
      <Paragraph position="27"> ~/EIGHTED-SET-COVER(WSC) INSTANCE: (X, Y, w) where X is a finite set and Y is a subset of ~(X) and w is a function from Y to N +. Intuitively, Y is a set of subcovers of the set X, each associated with its 'weight'.</Paragraph>
      <Paragraph position="28"> NOTATION: For every subset Z of Y, we let couer(g) = t3{z \[ z E Z}, and totahoeight(Z) = E,~z w(z).</Paragraph>
      <Paragraph position="29"> QUESTION: What subset of Y is a set-cover of X with a minimal total weight, i.e. find g C_ Y with the following properties:</Paragraph>
      <Paragraph position="31"> We now prove the existence of an approximation algorithm for this problem with the desired performance guarantee.</Paragraph>
      <Paragraph position="32">  Lemma 4.1 There is an algorithm B and a polynomial p such that given an arbitrary instance (X, Y, w) of WEIGHTED.SET.COVER with I X I = n, always outputs Z such that;  1. ZC_Y 2. Z is a cover for X, i.e. UZ = X 8. If Z' is a minimal weight set cover for (X, Y, w), then E~z to(y) &lt;_ p(Ey~z, w(y)) x log n. 4. B runs in time polynomial in the size of the instance. null Proof: To exhibit an algorithm with this property, we make use of the greedy algorithm g for the standard  set-cover problem due to Johnson (\[8\]), with a performance guarantee. SET-COVER can be thought of as a special case of WEIGHTED-SET-COVER with weight function being the constant funtion 1.</Paragraph>
      <Paragraph position="33"> Theorem 4.2 (David S. JohnRon) There is a greedy algorithm C for SET.COVER such that given an arbitrary instance (X, Y) with an optimal solution Z', outputs a solution Z, such that card(Z) = O(log \[ X \[ xcard(Z')) and runs in time polynomial in the instance size.</Paragraph>
      <Paragraph position="34"> Now we present the algorithm for WSC. The idea of the algorithm is simple. It applies C on X and successive subclasses of Y with bounded weights, upto the maximum weight there is, but using only powers of 2 as the bounds. It then outputs one with a minimal total weight araong those.</Paragraph>
      <Paragraph position="35">  Algorithm B: ((X, Y, w)) mazweight := maz{to(y) \[ Y E Y) m :-- \[log mazweight\]  /* this loop gets an approximate solution using C for subsets of Y each defined by putting an upperbound on the weights */</Paragraph>
      <Paragraph position="37"> the instance size, since Algorithm C runs in time polynomial in the instance size and there are only m ---~logmazweight\] cMls to it, which certainly does not exceed the instance size.</Paragraph>
    </Section>
    <Section position="2" start_page="229" end_page="230" type="sub_section">
      <SectionTitle>
Performance Guarantee
</SectionTitle>
      <Paragraph position="0"> Let (X, Y, to) be a given instance with card(X) = n. Then let Z* be an optimal solution of that instance, i.e., it is a minimal total weight set cover. Let totalweight(Z*) = w'. Now let m&amp;quot; ---- \[log maz{w(z) I z E Zdeg}\]. Then m* ~_ rain(n, \[logrnazweight\]). So when C is called with an instance (X, Y\[m'\]) in the m'-th iteration of the first 'For'-loop in the algorithm, every member of Z&amp;quot; is in Y\[m*\]. Hence, the optimal solution of this instance equals Z'. Thus, by the performance guarantee of C, s\[m*\] will be a cover of X with cardinality at most card(Z deg) x log n. Thus, we have card(s\[m*\]) ~_ card(Z*) xlogn. Now, for every member t of sire*l, w(t) ~ 2 '~&amp;quot; _&lt; 2 pOs~'I _~ 2w*.</Paragraph>
      <Paragraph position="1"> Therefore, totalweight(s\[m*\]) = card(Z') x logn x O(2w*) = O(w*) xlogn x O(2w'), since w&amp;quot; certainly is at least as large as card(Z'). Hence, we have totaltoeight(s\[m*\]) = O(w *= x log n). Now it is clear that the output of B will be a cover, and its total weight will not exceed the total weight of s\[m'\]. We conclude therefore that B((X, Y, to)) will be a set-cover for X, with total weight bounded above by O(to .= x log n), where to* is the total weight of a minimal weight cover and nflX \[.</Paragraph>
      <Paragraph position="2"> rl Now, to apply algorithm B to our learning problem, we let Y = {S+t. nL(H) \[ H E FGrams(k, SL)) and define the weight function w : Y --* N + by Vy E Y w(y) = rain{size(H) \[ H E FGrams(k, St) &amp; St = L(H)N S + } and call B on (S+,Y,w). We then output the grammar 'corresponding' to B((S +, Y, w)). In other words, we let ~r = {mingrammar(y) \[ y E IJ((S+L,Y,w))} where mingrammar(g) is a minimal-size grammar H in FGrams(k, SL) such that L(H)N S + = y. The final output 8ra~nmar H will be the =disjoint union&amp;quot; of all the grammars in /~, i.e. H ---- Ip(H). H is clearly consistent with SL, and since the minimal total weight solution of this instance of WSC is no larger than Rel(~, S+~), by the performance guarantee on the algorithm B, size(H) ~_ p(size( Rel( G, S + ))) x O(log m) for some polynomial p, where m is the sample size.</Paragraph>
      <Paragraph position="3"> size(O) ~_ size(Rei(G, S+)) is also bounded by a polynomial in the size of a minimal grammar consistent with SL. We therefore have shown the existence of an Occam algorithm with range size polymomlal in the size of a minimal consistent grammar and less than linear in the sample size. Hence, Theorem 4.1 has been proved.</Paragraph>
      <Paragraph position="4"> Q.E.D.</Paragraph>
      <Paragraph position="5"> 5 Extension to Mildly Context Sensitive Languages The learnability of k-local subclasses of CFG may appear to be quite restricted. It turns out, however, that the \]earnability of k-local subclasses extends to a rich class of mildly context sensitive grsmmars which we  call &amp;quot;Ranked Node Rewriting Grammaxs&amp;quot; (RNRG's). RNRG's are based on the underlying ideas of Tree Adjoining Grammars (TAG's) :e, and are also a specical case of context free tree grammars \[13\] in which unrestricted use of variables for moving, copying and deleting, is not permitted. In other words each rewriting in this system replaces a &amp;quot;ranked&amp;quot; nontermlnal node of say rank \] with an &amp;quot;incomplete&amp;quot; tree containing exactly \] edges that have no descendants. If we define a hierarchy of languages generated by subclasses of RNRG's having nodes and rules with bounded rank \] (RNRLj), then RNRL0 = CFL, and RNRL1 = TAL. 17 It turns out that each k-local subclass of each RNRLj is polynomially learnable. Further, the constraint of k-locality on RNRG's is an interesting one because not only each k-local subclass is an exponential class containing infinitely many infinite languages, but also k-local sub-classes of the RNRG hierarchy become progressively more complex as we go higher in the hierarchy. In paxt iculax, for each j, RNRG~ can &amp;quot;count up to&amp;quot; 2(j + 1) and for each k _&gt; 2, k-local-RNRGj can also count up to 20' + 1)? s We will omit a detailed definition of RNRG's (see \[2\]), and informally illustrate them by some examples? s Example 5.1 L1 = {a&amp;quot;b&amp;quot; \[ n E N} E CFL is generated by the following RNRGo grammar, where a is shown in Figure 3. G: = ({5'}, {s,a,b},|, (S}, {S -*</Paragraph>
      <Paragraph position="7"> TAL is generated by the following RNRG1 grammar, where \[$ is shown in Figure 3. G2 = ({s}, {~, a, b, ~, d}, ~, {(S(~))}, {S -- ~, S -- ,(~)}) Example 5.3 Ls = {a&amp;quot;b&amp;quot;c&amp;quot;d&amp;quot;e&amp;quot;y&amp;quot; I n E N} f~ TAL is generated by the \]allowing RNRG2 grammar, where 7 is shown in Figure 3. G3 =</Paragraph>
      <Paragraph position="9"> 16Tree adjoining grmnmars were introduced as a formalism for linguistic description by Joehi et al. \[10\], \[9\]. Various formal and computational properties of TAG'* were studied in \[16\]. Its linguistic relevance was demonstrated in \[11\]. IZThi* hierarchy is different from the hierarchy of &amp;quot;mete, TAL's&amp;quot; invented and studied extensively by Weir in \[18\].</Paragraph>
      <Paragraph position="10"> 18A class of _g~rammars G is said to be able to &amp;quot;count up to&amp;quot; j, just in case -{a~a~...a~ J n 6. N} E ~L(G) \[ G E Q} but {a~a~...a~'+1 1 n etC/} C/ {L(a) I G e C/}.</Paragraph>
      <Paragraph position="11"> 19Simpler trees are represented as term structures, whereas more involved trees are shown in the figure. Also note tha~ we use uppercase letters for nonterminals and lowercase for termi- nals. Note the use of the special symbol  |to indicate an edge with no descendent.</Paragraph>
      <Paragraph position="12"> ~: 7: derived:  We state the learnabillty result of RNRLj's below as a theorem, and again refer the reader to \[2\] for details.</Paragraph>
      <Paragraph position="13"> Note that this theorem sumsumes Theorem 4.1 as the  case j = 0.</Paragraph>
      <Paragraph position="14"> Theorem 5.1 Vj, k E N k-local-RNRLj is poignomi.</Paragraph>
      <Paragraph position="15"> ally learnable? deg</Paragraph>
    </Section>
  </Section>
  <Section position="6" start_page="230" end_page="231" type="metho">
    <SectionTitle>
6 Some Negative Results
</SectionTitle>
    <Paragraph position="0"> The reader's reaction to the result described above may be an illusion that the learnability of k-local grammars follows from &amp;quot;bounding by k&amp;quot;. On the contrary, we present a case where ~bounding by k&amp;quot; not only does not help feasible learning, but in some sense makes it harder to learn. Let us consider Tree Adjoining Grammars without local constraints, TAG(wolc) for the sake of comparison. 2x Then an anlogous argument to the one for the learn*bUlly of k-local-CFL shows that k-local-TAL(wolc) is polynomlally learnable for any k.</Paragraph>
    <Paragraph position="1"> Theorem 6.1 Vk E N + k-loeal-TAL(wolc) is polyno.</Paragraph>
    <Paragraph position="2"> mially learnable.</Paragraph>
    <Paragraph position="3"> Now let us define subclasses of TAG(wolc) with a bounded number of initial trees; k-inltial-tree-TAG(wolc) is the class of TAG(wolc) with at most k initial trees. Then surprisingly, for the case of single letter alphabet, we already have the following striking result. (For fun detail, see \[1\].) Theorem 6.2 (i) TAL(wolc) on l-letter alphabet is polynomially learnable.</Paragraph>
    <Paragraph position="4"> 2degWe use the size of a minimal k-local RNRGj as the size of a k-local RNRLj, i.e., Vj E N VL E k-local-RNRLj size(L) = mln{slz*(G) \[ G E k-local-RNRG~ &amp; L(G) = L}.</Paragraph>
    <Paragraph position="5"> 21Tree Adjoining Grammar formalism was never defined without local constrains.</Paragraph>
    <Paragraph position="6">  (ii) Vk &gt;_ 3 k.initial.tree-TAL(wolc) on 1.letter alphabet is not polynomially learnable by k.initial.tres. YA G (wolc ).</Paragraph>
    <Paragraph position="7"> As a corollary to the second part of the above theorem, we have that k-initial-tree-TAL(wolc) on an arbitrary alphabet is not polynomiaJ\]y learnable (by k-initial-tree-TAG(wolc)). This is because we would be able to use a learning algorithm for an arbitrary alphabet to construct one for the single letter alphabet case.</Paragraph>
    <Paragraph position="8"> Corollary 6.1 k.initial.tree-TAL(wolc) is not polynomially learnable by k-initial.tree- TA G(wolc).</Paragraph>
    <Paragraph position="9"> The learnability of k-local-TAL(wolc) and the nonlearnability of k-initial-tree-TAL(wolc) is an interesting contrast. Intuitively, in the former case, the &amp;quot;k-bound&amp;quot; is placed so that the grammar is forced to be an arbitrarily ~wide ~ union of boundedly small grammars, whereas, in the latter, the grammar is forced to be a boundedly &amp;quot;narrow&amp;quot; union of arbitrarily large g:ammars. It is suggestive of the possibility that in fact human infants when acquiring her native tongue may start developing small special purpose grammars for different uses and contexts and slowly start to generalize and compress the large set of similar grammars into a smaller set.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML