FeaMble Learnability of Formal Grammars and 
\[~?he Theory of Natm'al Language Acquisition 
Naoki ABE 
l)epartment of Computer and Information Science 
University of Pem~sylvania 
Philadelphia, PA 19104-6389 
A bst;ract 
We propo;:e to apply a. complexity theoretic notion of feasible 
learnability called "polynomial learnability" to the evaluation 
of grammatical formalisms for linguistic de.~;criptiol). Polylm-. 
mill h;arnability was originally defined by Valiant in the con- 
text of bo,llean concept t(!arniiig and sul)scquetltly generalized 
hy Blumec el, al. to i~llinita.cy domains. We give a clear, intuitive 
exposition el' this notion (/l' k'arnability au(l what characteristics 
of a collection of hmguages may or many not help feasible learn-- 
ability under this paradigm. In particular, we preset,t a novel, 
nontrivJal ::onstraint on the degree of "locality" of grammars 
which allows a ri& class of mildly context sensitive languages 
to be feasibly learnable. We discuss pos,';ihle implications of this 
observati(m to the theory of natm'al language acquisition. 
t. Introduct, ion 
A central i~sue o\[ linguistic theory is the "t)~'ojectio~l prohhml", 
which was origblally prol)osed by Noam Chomsky \[?\] and sub 
sequ(mtly l.?d to much of the development in modern linguistics. 
This probh,.m pose~ the question: "i\[ow is it posslbk~ for human 
infants to acquire thei,' native language on the basis of casual 
exposure to limited data in a short amount of t, ime?" The pro- 
posed solulion is that the human infant in ell\;ct "knows" what 
the natura{ language that it is trying to learn could possibly 
be. Another way to look at it is that there is a re.latively small 
set of possible grammars that it would be able to learn, and 
its learmng stratergy, implicitly or explicitly, takes adwmtage of 
this apriori knowledge. The goal of linguistic theory, then, is 
to &aractedze this set of possible grammars, by specifiying the 
constraints, often cMled the "Uniwwsal (Irammar". Tile theory 
of inductiw~' inference oilers a precise solution to this problem, 
by characterizing exactly what collections of (or its dual "con- 
straints ou") languages satisfy tile requirement for being the set 
of possible grammars, i,e. are learnable? A theory of "feasible" 
inference is particularly interesting because the language acqui- 
sitkm process of a human infant is feasible, not to mention its 
relewmce to the technological counterpart of such a pwbh'.m. 
In this paper, we investigate the learuability of formal gram- 
mars for linguistic description with respect to a complexity the- 
oretic notion of feasible lea.rnability called 'polynomial learnabil- 
ity'. Polynomial learnabillty was originally developed by Valiant 
\[?\], \[?\] in the context of learning boolean coitcei)t from exam- 
ples, artd subsequently generalized by I llumer et al. for arbitrary 
concepts \[?\]. We apply this criterion of feasible lcarnability to 
subclasses of formal grammars thai, are of considerable linguistic 
interest. Specifically, we present a novel, nontrivial constraint 
on gramma,:s called "k. locality", which ena\])k~s a rich ehlss of 
mildly context sensitive grammars called l{ank<~d Node Rewrit- 
ing G'rammars (RNI{.( 0 to be limsibly lear1~able. \'Vc discuss 
possible implications of this result to thc Lheory of natural Inn 
guagc acqui:~ition. 
2 Polynomial Learnability 
2ol Formal Modeling of Learning 
What constitutes a good model of tile learning behavior? Below 
we list tlve basic elements that any formal model of learning 
must con<,. (c.f. \[13\]) 
1. Objects to be learned: l,ct us call them ~knacks' for full 
generality. The question of learnability is asked of a col- 
lection of knacks. 
2. Environment: The way in whidl 'data' are available to tile 
learner. 
3. I\[ypotheses: I)escriptious t))r 'knacks', usually CXl)ressed 
in a certain language. 
4. /,earners: Ill general functions from data to hypotheses. 
5. Criterion of l,earning: \])efines precisely what is meant by 
the question; When is a learner said to 'learn' a giwm 
collection of 'knacks' on the basis of data obtained through 
the enviromnent ? 
In most cases 'knacks' can be thought of as subsets of some 
universe (set) of objects, from which examples are drawn. 1 (Such 
a set is often called the 'domain' of the learning problem.) The 
obvions example is the definition of what a language is in the 
theory of natural language syntax. Syntactically, the English 
language is nothing but the set of all grammatical sentences, 
although this is subject to much philosophical controversy. The 
corresponding mathematical notion of a formal language is one 
that is fi'ee of such a controversy. A formal language is a subset 
of the set of all strings in .E* for some alphabet E. Clearly E* 
is tile domMn. The characterization of a kna& as a subset of a 
universe is in fact a very general one. For example, a boolean 
concept of n variables is a subset of the set of all assignments to 
those n variables, often written 2 '~. Positive examples in this case 
are assignments to the n variables which 'satisfy' the concept in 
question. 
When the 'knacks' under consideration can in fact be thought 
of as subsets of some domain, the overall picture of a learning 
model looks like the one given in Figure 1. 
2.2 Polynomial Learnability 
Polynomial learnability departs from the classic paradigm of lan- 
guage learning, 'idenitification in the limit', ~ in at least two 
important aspects, lilt enforces a higher demand oil tile time 
1First order structures are an example in which langtlages arc more than 
just subsets of some set \[14\]. 
2Identification in the limit w¢~s originally proposed and studied by Gold 
\[8\], and has subsequently been generalized in many diflbrent ways. See for 
example \[13\] for a comprehensive treatment of this and related paradigms. 
The Knacks 
The Domain 
The Environment 
o 
The Hypotheses 
The Learner 
The Crileriony 
Figure 1: A Learning Model 
complexity by requiring that the learner converge in time poly- 
nomial, but on the other hand relaxes the criterion of what con- 
stitutes a 'correct' grammar by employing an approximate, and 
probabilistic notion of correctness, or aecraey to be'precise. Fur- 
thermore, this notion of correctness is intricately tied to both 
the time complexity requirement and the way in which the en- 
vironment presents examples to the learner, Specifically, the 
environment is assumed to present to the learner examples from 
the domain with respect to an unknown (to the learner) but 
fixed probability distribution, and the accuracy of a hypothesis 
is measured with respect to that same probability distribution. 
This way, the learner is, so to speak, protected from 'bad' pre- 
sentations of a knack. We now make these ideas precise by spec- 
ifying the five essential parameters of this learning paradigm. 
1. Objects to be learned are languages or subsets of ?2" for 
some fixed alphabet E. Although we do not specify apri- 
ori the language in which to express these grammars a, for 
each collection of languages Z; of which we ask the learn- 
ability, we fix a class of grammars G (such that L(~) = £ 
where we write L(~) to mean {L(G) I G E ~}) with re- 
spect to which we will define the notion of 'complexity' or 
'size' of a language. We take the number of bits it takes to 
write down a grammar under a reasonable 4, fixed encod- 
ing scheme to be the size of the grammar. The size of a 
language is then defined as the size of a minimal grammar 
for it. (For a language L, we write size(L) for its size.) 
2. The environment produces a string in E* with a time- 
invariant probability distribution unknown to the learner 
and pairs it with either 0 or 1 depending on whether the 
string is in the language in question or not, gives it to the 
learner. It repeats this process indefinitely. 
3. The hypotheses axe expressed as grammars. The class of 
grammars allowed as hypotheses, say "H, is not necessarily 
required to generate exactly the class Z; of languages to be 
learned. In general, when a collection £ can be learned by 
a learner which only outputs hypotheses from a class 7"/, 
we say that £ is learnable by Tl, and in particular, when 
Z; = L(~)) is learnable by ~, the class of representations G 
is said to be properly learnable. (See \[6\].) 
4. Learners passively receive an infinite sequence of positive 
and negative examples in the manner described above, and 
aPotentAally any 'l?urning program could be a hypothesis 
~By a reasonblc encoding, we mean one which can represent n ditrerent. 
grannnars using O(log*~) bits. 
5. 
at each initial (finite) segment of such a sequence, output a 
hypothesis. In other words, they are functions from finite 
sequences of positive and negative examples 5 to grammars. 
A learning function is said to polynomially learn a col- 
lection of languages just in case it is computable in time 
polynomial ill the length of the input sample, and for an 
arbitrary degrees of accuracy e and confidence 5, its output 
on a sample produced by the environment by the manner 
described above for any language L in that collection, will 
be an e-approximation of the unknown language L with 
confidence probability at least 1 -- a, no matter what the 
unknown distribution is, as long as the number of strings 
in the sample exceeds p(e -~, 5 -~, size (L)) for some fixed 
plynomial p. Here, grammar G is an e-approximation of 
language L, if the probability distribution over the sym- 
metric difference 6 of L and I,(G) is at most e. 
2.3 Occam Algorithm 
Blumer et al. \[5\] have shown an extremely interesting result 
revealing a connection between reliable data compression and 
polynomial learnability. Occam's l~azor is a principle in the 
philosophy of science which stipulates that a shorter theory is 
tobe preferred as long as it remains adequate. B\]umel" el; al. 
define a precise version of such a notion in the present context 
of learning which they call Occam Algorithm, and establishes a 
relation between the existence of such an algorithm and poly- 
nomiM learnability: If there exists a polynomial time algorithm 
which reliably "compresses" any sample of any language in a 
given collection to a provably small consistent grammar for it, 
then such an Mogorithm polynomially learns that collection in 
the limit. We state this theorem in a slightly weaker form. 
Definition 2.1 Let £ be a language collection with associated 
represenation ~ with size function "size". (Define a sequence 
of subclasses of ~ by 7~n = {G e 7-\[ \] size(G) _< n}.) Then A 
is an Occar(~ algorithm for £ with range size f(m, ~z) if and only 
if! 
VLE£ 
VS C graph(L) 
if size(L) = n and \] S I= m then 
A(S) is consistent with S 
and A(S)) e 7~I(,~,m ) 
and .A runs in time polynomial in the length of S. 
Theorem 2.1 (Blumer et al.) If A is an Occam algorithm 
for f~ with range size f(n,m) = O(nk~ ~) for some k >_ ; 
0 < c~ < 1 then .4 polynomially learns £ in the limit. 
We give below an intuitive explication of why an 0cesta Algo- 
rithm polynomiMly learns in the limit. Suppose A is an Occam 
Algorithm for £, and let L ~ l: be the language to be learned, 
and n its size. Then for an arbitrary sample for L of an arbi- 
trary size, a minimal consistent language for it will never have 
size larger than size(L) itself. Hence A's output on a sample of 
size m will always be one of the hypotheses in H\](m,~), whose 
cardinality is at most 2\](~,n). As the sample size m grows, its ef- 
fect on the probability that any consistent hypothesis in 7~i(,~,, 0 
is accurate will (polynomially) soon dominate that of the growth 
of the eardinality of the hypothesis class, which is less than linear 
in the sample size. 
Sin the sequel, we shall call them 'labeled samples' 
SThe symmetric difference between two sets A and B is (A-B)U(B-A). 
rFor any langugage L, ~jraph(L) = {(x, O} I x C-: L} U {{a:, I) \] a: ~ L}. 
3 Rar~ked Node Rewriting Grammars 
In this section, we define l, hc class of nrihlly context sensitive 
grammars under consideration, or Ranked Node Rewriting (\]ram.- 
mars (RNR(~'s). \[{NR(\]'s are based on the underlying ideas of 
Tree Adjoining Grammars (TArt's) s and are also a specical 
case of context fi'ee tree grammars \[15\] in which unres~,ricted 
use of w~rial)les for moving, copying and deleting, is not per- 
mitted, in other words each rewriting in this system replaces a 
"ranked" noclterminal node of say rank j with an "incomplete" 
tree containing exactly j edges that have no descendants. If 
we define a hierarchy of languages generated by subclasses of 
RNRG's having nodes and rules with hounded rank j (RNRLj), 
then RNRL0 = CFL, and RNRLa :: TAL. 9 We formally define 
these grammars below. 
Definition 'LI (Preliminaries) 77ze following definitions are 
necessar!l Jb'," the ,~equel. 
(i) The set of labeled directed trees over an alphabet E is denoted 
7;> 
(ii) r\['ll.e Ta.'ll.'. of an "incomplete" tree is the number of outgoing 
edges with no descendents. 
(iii) The rarth oj'a node is the. number of outgoing edges. 
(iv) The ~u& 4'a symbol is defined if the rank of any node 
labeled by it is always the same, and equal~ that rank. 
(v) A ranked alphabet is one in which every symbol has a rank. 
(vi) I,l)r writ,': rank(x) for the rank of a~ything x, if it is defined. 
Definition 3.2 (Ranked Node Rewriting Grammars) A 
ronl;ed nodt; re'writing grammar C is a q'uinl,ph' {>',,v, E'e, ~, It,., Re;) 
where: 
(i) EN is a ranked nonterminal alphabet. 
(ii) );'r is a germinal alphabet di4oint fi'om F~N. We let ~; = 
}-;N U 2T. 
(iii) ~ is a distinguished symbol distinct from any member of E, 
indicating "a'a outgoing edge with no descendent", m 
(iv) It; is a finite set of labeled trees over E. We refer ~o I(; as 
~he "initial trees" of the grammar. 
(v) Ra is a finite set of rewriting rules: R<~ C {(A,a} I A e 
Y,'N & a C T~u{.} & rank(A) = rank(re)}. (In the sequel, we 
write A --. o for rewriting rule {A, ce).) 
(vO ,'a,,V(c) = ,ha, {,-~,4.(A) I A e EN}. 
We emphasize that the nonterminM vs. terminal distinction above 
does not coiadde with the internal node vs. frontier node dis- 
tinction. (See examples 2.1 - 2.3.) tiaving defined the notions 
of 'rewriting' and 'derivation' in the obvious manner, the tree 
language of a grammar is then defiimd as the set of trees over 
the terminal alphabet, whid~ can be derived fi'om the grammar. 11 
This is analogous to the way the string language of a rewriting 
grammar in the Chomsky hierarchy is defined. 
Definition 3.:"1 ('IYee Languages and String Languages) 
The tree language and string Iang~tagc of a RNRG G, denoted 
s'\]?ree adjoitdng grammars were introduced a.s a formalism for linguis- 
tic description by aoshi et al. \[10\], \[9\]. Various formal and computational 
properties of TAG's were studied in \[17\]. Its linguistic relevance was demon- 
s~rated in \[12\]. 
9This hierar,:hy is different fi'om the hierarchy of "meta-TAL's" invented 
and studied exl.ensively by Weir in \[20\]. 
l°ln context free t.ree grammars iu \[15\], variables are used in place of ~J. 
'l'hese variables can then be used in rewriting rules to move, copy, or erase 
subtrees.. \[t is i;his restriction of avoiding such use of variables Hint keeps 
RNR,G's within the class of etlicient, ly recognizable rewriting systems called 
"Linear context fi'ee rewriting systems" (\[18\]). 
II'Phis is how an "obligatory adjunction constraint" in the tree adjoining 
nunar formalism can be sintulated. 
a S b 
9: 
S 
a S d 
I j\[--. 
b # c 
7: 
S 
IV.. 
a 8 f 
S $ 
b # c d # c 
derived : 
s 
a s f 
a s f 
s s 
b s c d s e 
b )v c d )~ e 
Figurc 2: a, fl, 7 and deriving 'aabbccddeeff' by G:~ 
T((;) and Leo repectively, are defined as follows; 
/~(c') = {.,ji~ld(~) t ~ ~ T(O)}. 
If we now define a hierarchy of languages generated by sub- 
classes of RNRG's with bounded ranks, context fi'ee languages 
((',FL) and tree adjoining languages (TAt) constitute the first 
two members of the hierarchy. 
Definition 3.4 l;br each j ~ N RNI~Gj = {GIG C RNRG & 
rank(G) < J}. l;br each j ~ N, I{NIU, j = {L(C) I O e: antiC;;} 
Theorem 3.1 I{NI~Lo - CFL ~tn.d l~N I~\[.1 : !I'AL. 
We now giw; some examples of grammars in this laierarchy, J2 
which also illustrate the way in which the weak generative ca- 
pacity of different levels of this hierarchy increases progressively. 
13 
Example 3.1. 1), = {3% ~ \[ n. C N} C Gl' , is generated by the 
following l?~Nl~(_7o 9rammar~ where o' is shown in Figure 2. 
6', = ({s}, {,,a,b},L {s'}, {,5'--~ ~,,~ + s(~)}) 
Example 3.2 I)2 -- {a'W~c'~d '~ \] n G N} C- TAL is ocher, ted by 
the following \]~N I~G1 grammar, where/~ is shown in Figure 2. 
C;~=({S},{s,a,b,e,d},~,{(S(,~))},{S'-, ,'<S' +s(~)}> 
Example 3.3 L3 = {a'%'*c'~d'~e'~f '~ I n C N} ¢ TAL is gen- 
erated by the following RNI?,G2 grnmn~ar, where 7 is shown 5* 
t,'igure 2. 
C;':~ = ({S'}, {s, a, b, ,.', d, c, f}, ~, {(,5'(A, A))}, {5'--, 7, ,5'-~ ,~(~, I1)}) 
4 K-Local Grammars 
q'he notion of qocality' of a grammar we define in this paper is 
a measure of how much global dependency there is within the 
grammar. By global dependency within a gramnlar, we. mean 
the interactions that exist between different rules and nonter- 
minals in the grammar. As it is intuitively clear, allowing un- 
bounded amont of global interaction is a major, though not 
only, cause of a combinatorial explosion in a search for a right 
grammar. K-locality limits the amount of such interaction, by 
tSSimpler trees are represented as term struct.ures, whereas lnore involved 
trees are shown in the figure. Also note that we rise uppercase letters for 
nonterminals and lowercase for terminals. 
IaSome linguistic motiwltions of this extension of'lDkG's are argagned for 
by the author in \[1\]. 
bounding the number of different rules that can participate in 
any slngle derivation. 
Pormally, the notion of "k-locality" of a grammar is defined 
with respect to a formulation of derivations due originally to 
Vijay-Shankar, Weir, and 3oshi (\[\[9\]), which is a generalization 
of the notion of parse trees for CFO's. In their formulation, 
a derivation is a tree recording the tfistory of rewritings. The 
root of a derivation tree is labeled with an initial tree, and the 
rest of the nodes with rewriting rules. Each edge corresponds 
to a rewriting; the edge from a rule (host rule) to auother rule 
(applied rule) is labeled with the address of the node in the host 
l, ree at which the rewriting takes place. 
The degree of locality of a derivation is the number of distinct 
kinds of rewritings that appear in it. In terms of a derivation 
tree, the degree of locality is the number of different kinds of 
edges in it, where two edges are equivalent just in ease the two 
end nodes are labeled by the same rules, and the edges them- 
selves are labeled by the same node address. 
Definition 4.1 Let 7)(G) denote the set of all derivation trees 
of G, and let r 6 D(G). Then, the degree of locality oft, written 
locality(r), is d4ned as follows, locality(r) = card{(p,q,,t) I 
there is an edge in r from a node labeled with p to another labeled 
with q, and is itself labeled with 77} 
The degree of locality of a gramm,~r is the maximum of those of 
all its derivations. 
Definition 4.2 a RNRG G is called k-local if max{locality(r) \] 
r e ~(C)} _< k. 
We write k-Local-I~NRO - {(7 I G (5 RNRG and G is k-Local} 
and k-Local-t2Nl~L = { L(G) I G C k-Local-i~NR(: }, etc.. 
Example 4.1 L1 = {a"bna"b '' I n,m C N} ~ /t-Local-RNRLo 
since all the derivations of G, -({S}, {s,a,b}, ~, {s(S,S)}, 
{S -+ sea, S,b), S --~ A}) generating Lt have deflree of locality 
at most 4. l,br example, the derivation for the string a3b3ab has 
degree of locality 4 as shown in Figure 8. 
Because locality of a derivation is the number of distinct 
kinds of rewritings, inclusive of the positions at which they takc 
place, k-locality also puts a bound on the number of nonterminal 
occurrences in any rule. In fact, had we defined the notion of k- 
locality by the two conditins: (i) at most k rules take part in any 
derivation, (if) each rule is k-bounded, t4, the analogous learn- 
ability result would follow essentially by the same argument. So, 
k-locality in effect forces a grammar to be an unbounded union 
of boundedly simple grammar, with bounded number of rules 
each boundedly small, with a bounded number of nonterminals. 
This fact is captured formally by the existence of the following 
normal form with only a polynomial expansion factor. 
Lelnma 4.1 (K-Local Normal Form) For every k-Local-RNRGj 
G, if we let n = size(G), then there is a RNRGj G' such that 
~. L( C') = r,,( a). 
2. c' is in k-local normal form, i.c. O' = U{1\]~ I i C -rG,} 
such that: 
(a) each lIi has a nonterminal set that is: disjoint from 
any other IIj. 
(b) each tI~ is k-sire, pie, that is 
i. each Ili contains exactly i initial tree. 
14'K-bounded' here means k nontermineJ occurrences in each rule, \[4\]. 
For instance, a context free grammar in Chomsky Normal l%rm has only 
2-bounded rules. 
,/-: 
s s 2 
A 2..£- s-*A.--- s 
S S a Sb I 
s s 2 s 2 s-./l',, s ---../1Xm 
aS b a S b a Sb 
locality(~-) = 4 
s 2 s s s 
A s--*A A m s.. A 
S S a Sb S S a Sb 
s s s 
s -'/1",, s.o 
aS b a S b a Sb 
Figure 3: Degree of locality of a derivation of a3b3ab by G1 
if. each Hi contains at most k rules. 
iii. each IIi contains at most k nonterminal occur- 
rences. 
s. ~i~e(c~") = o(~+'). 
Crucially, the constraint of k-locality on RNRG's is an interest- 
ing one because not only each k-local subclass is an exponential 
class containing infinitely many infinite languages, but also k- 
local subclasses of the RNRG hierarchy become progressively 
more complex as we go higher in the hierarchy. In particular, 
for each j, IlNP~Gj can "count up to" 2(j + 1) and for each k > 2, 
k-local-RN\[4Gj can also count up to 2(j + 1)) 5 We summarize 
these properties of k-loeal-RNRL's below. 
Theorem 4.1 Pbr every k E N, 
1. Vj E N UkeN k-local-RNRLj = RNRLj. 
~. Vj C N Vk > 3 k-local-RNRLj+l is incomparable with 
RNRLp 
3. Vj, k ~ N k-local:RNRLj is a p~oper subset of (k+I)- 
loeal-t~NRLj. 
4. Vj Vk > 2 E N k-local-RNRLj contains infinitely many 
infinite languages. 
hfformal t'roof: 
1 is obvious because for each grammar in RNRLj, the degree 
of locality o~" the grannnar is finite. 
As for 2, we note that the sequence of the languages (for the 
first three of which we gave example grammars) L~ = {a~*a~...a~ I 
u ~ N} are each in 3-1ocal-RNRLI_I but not in RNRLi_2. 
To verii} 3, we give the following sequence of languages Lj,k 
such that for each j and k, Lj, k is in k-local-RNRLj but not in 
(k-1)-local-RNRL/. Intuitively this is because k-local-languages 
can have at most O(k) mutually independent dependencies in a 
single sentence. 
Example 4.2 For each j, k ~ N, let Lj,k = { ~ '~ 2,~2 2~, al ...a20+1 ) al ...a2(j+l) 
knk kn~ ... a 1 ...a2(j~t) \]nl,n2,...,nk e N}. 
is obvious because Zoo = Uwe~.Lw where Lt~ = {w" \] n e N} 
are a subset of 2-1ocal-I~NRL0, and hence is a subset of k:local- 
RNl~Lj for every j and k >_ 2. £¢¢ clearly contains inifinitely 
many infinite languages. \[\] 
5 K-Local Languages Are Learnable 
It turns out that each k-loeal subclass of each RNRLj is poly- 
nomially lear~lable. 
Theorem 5. t For each j and k, k-local-RNRLj is polynomially 
Icarnable. 
This theorem can be proved by exhibiting an Occam Algorithm 
i(c.f, for this class with size which is Subsection 2.3), a range 
l logarithmic in the sample size, and polynomial in the size of a 
minimal consistent grammar. We ommit a detailed proof and 
igiw~ an informal outline of the proof. : 
1. By the Normal Form Lemma, for any k-local-RNRG G, 
there is a language equivalent k-local-RNR.G H in k-local 
normal form whose size is only polynomially larger than 
the size of G. 
t~A class of grammars G is said to be able to "count up to" j, just in 
case {a?a'~...a\] \] n e N} e {L(G) \[ G (~ G} but {ai'a'~...a~+ 1 \[ n e N} ¢ 
{c(G) I a e 6}. 
2. The number of k-simple grammars with is apriori infinite, 
but for a given positive sample, the number of such gram- 
mars that are 'relevant' to that sample (i.e. which could 
have been used to derive any of the examples) is polyno- 
mially bounded in the length of the sample. This follows 
essentially by the non-erasure and non-copying properties 
of RNRG's. (See \[3\] for detail.) 
3. Out of the set of k-simple grammars in the normal form 
thus obtained, the ones that are inconsistent with the neg- 
ative sample are eliminated. Such a filtering can be seen to 
be performable in polynomial time, appealing to the result 
of Vijay-Shankar, Weir and Joshi \[18\] that Linear Context 
Free Rewriting Systems (LCFRS's) are polynomial time 
recognizable. That R.NRG's are indeed LCFRS's follow 
also from the non-erasure and non-copying properties. 
4. What we have at this stage is a polynomially bounded set 
of k-simple grammars of varying sizes which are all con- 
sistent with the input sample. The 'relevant' part 10 of 
a minimal consistent grammar in k-local normal form is 
guaranteed to be a subset of this set of grammars. What 
an Oceam algorithm needs to do, then, is to find some sub- 
set of this set of k-simple grammars that "covers" all the 
points in the positive sample, and has a total size that is 
provably only polynomially larger than the minimal total 
size of a subset that covers the positive sample and is less 
than linear in the sample size. 
5. We formalize this as a variant of "Set Cover" problem 
which we call "Weighted Set Cover" (WSC), and prove (in 
\[2 D the existence of an approximation algorithm with a 
performance guarantee which suffices to ensure that the 
output of ,4 will be a basis set consistent with the sample 
which is provably only polynomially larger than a mini- 
mal one, and is less than linear in the sample size. The 
algorithm runs in time polynomial in the size of a minimal 
consistent grammar and the sample length. 
6 Discussion: Possible Implications 
to the Theory of Natural Language 
Acquisition 
We have shown that a single, nontrivial constraint of 'k-locality' 
allows a rich class of mildly context sensitive languages, which 
are argued by some \[9\] to be an upperbound of weak genera- 
tive capacity that may be needed by a hnguistic formalism, to 
be learnable. Let us recall that k-locality puts a bound on the 
amount of global interactions between different parts (rules) of a 
grammar. Although the most concise discription of natrual lan- 
guage might require almost unbounded amount of such interac- 
tions, it is conceivable that the actual grammar that is acquired 
by humans have a bounded degree of interactions, and thus in 
some cases may involve some inefficiency and redundancy. To 
illustrate the nature of inefficiecy introduced by 'forcing' a gram- 
mar to be k-loeal, consider the following. The syntactic category 
of a noun phrase seems to be essentially context independent in 
the sense that a noun phrase in a subject position and a noun 
phrase in an object positionare more or less syntactically equiv- 
alent. Such a 'generalization' contributes to the 'global' inter- 
action in a grammar. Thus, for a k-local grammar (for some 
relatively small k) to account for it, it may have to repeat the 
same set of noun phrase rules for different constructions. 
t¢This ,lotion is to be made precise. 
As is stated in Section 4, for each fixed k, there are clearly 
a lot of languages (in a given class) which could not be gener- 
ated by a k-local grammar. However, it is also the case that 
many languages, for which the most concise grammar is not a 
k-local grammar, can be generated by a less concise (and thus 
perhaps less explanatory) grammar, which is k-locah In some 
sense, this is similar to the well-known distinction of 'compe- 
tence' and 'performance'. It is conceivable that performance 
grammars which are actually acquired by humans are in some 
sense much less efficient and less explanatory than a competence 
grammar for the same language. After all when the 'projection 
problem' asks: 'How is it possible for human infants to acquire 
their native languages...', it does not seem necessary that it be 
asking the question with respect to 'competence grammars', for 
what we know is that the set of 'performance grammars' is fea- 
sibly learnable. The possibility that we are suggesting here is 
that 'k-locality ~ is not visible in competence grammars, however, 
it is implicitly there so that the languages generated by the class 
of competence grammars, which are not necessarily k-local, are 
indeed all k-local languages for some fixed 'k'. 
7 Conclusions 
We have investigated the use of complexity theory to the evalu- 
ation of grammatical systems as linguistic formalisms from the 
point of view of feasible learnability. In particular, we have 
demonstrated that a single, natural and non-trivial constraint 
of "locality" on the grammars allows a rich class of mildly con- 
text sensitive languages to be feasibly learnable, in a well-defined 
complexity theoretic sense. Our work differs from recent works 
on efficient learning of formal languages, for example by An- 
gluin (\[4\]), in that it uses only examples and no other powerful 
oracles. We hope to have demonstrated that learning formal 
-- grammars need not be doomed to be necessarily computation- 
ally intractable, and the investigation of alternative formulations 
of this problem is a worthwhile endeavonr. 
8 Acknowledgment 
The research reported here in was in part supported by an IBM 
graduate fellowship awarded to the author. The author grate- 
fully acknowledges his advisor, Scott Weinstein, for his guidance 
and encouragement throughout this research. He has also ben- 
efitted from valuable discussions with Aravind Joshi and David 
Weir. Finally he wishes to thank Haim Levkowitz and Ethel 
Schuster for their kind help in formatting this paper. 
References 
\[1\] Naoki Abe. Generalization of tree adjunction as ranked 
node rewriting. 1987. Unpublished manuscript. 

\[2\] Naoki Abe.. Polynomial learnability and locality of formal 
grammars. In 26th Meeting of A.C.L., June 1988. 

\[3\] Naoki Abe. Polynomially learnable subclasses of mildy con- 
text sensitive languages. 1987. Unpublished manuscript. 

\[4\] Dana Angluin. Leafing k-bounded context-free grammars. 
Technical Report YALEU/DCS/TR-557, Yale University, 
August 1987. 

\[5\] A. Blumer, A. Ehrenfeucht, D. Haussler, and M. Warmuth. 
Classifying learnable geometric concepts with the vapnik- 
chervonenkis dimension. In Proc. 18th ACM Syrup. on The- 
ory of Computation, pages 243 - 282, 1986. 

\[6\] A. Blumer, A. Ehrenfeueht, D. Hausslor, and M. War- 
muth. Learnability and the Vapnik-Chervonenkis Dimen- 
sion. Technical Report UCSC CI~L-87-20, University of 
California at Santa Cruz, Novermber 1987. 

\[7\] Noam Chomsky. Aspects of the Theory of Syntax. The MIT 
Press, 1965. 

\[8\] E. Mark Gold. Language identification in the limit. Infor- 
mation and Control, 10:447-474, 1967. 

\[9\] A. K. Joshi. How much context-sensitivity is necessary for 
characterizing structural description - tree adjoining gram- 
mars. In D. Dowty, L. Karttunen, and A. Zwicky, edi- 
tors, Natural Language Processing - Theoretical, Computa- 
tional~ and Psychological Perspectives, Cambridege Univer- 
sity Press, 1983. 

\[10\] Aravind K. Joshi, Leon Levy, and Masako Takahashi. Tree 
adjunct grammars. Journal of Computer and System Sci- 
ences, 10:136-163, 1975. 

\[11\] M. Kearns, M. Li, L. Pitt, and L. Valiant. On the learn- 
ability of boolean formulae. In Proc. 19th ACM Syrup. on 
Theory of Comoputation, pages 285 - 295, 1987. 

\[12\] A. Kroch and A. K. Joshi. Linguistic relevance of tree ad- 
joining grammars. 1989. To appear in Linguistics and Phi- 
losophy. 

\[13\] Daniel N. Osherson, Michael Stob, and Scott Weinstein. 
Systems That Learn. The MIT Press, 1986. 

\[14\] Daniel N. Osherson and Scott Weinstein. Identification in 
the limit of first order structures. JouT"aal of Philosophical 
Logic, 15:55 - 81, 1986. 

\[15\] William C. Rounds. Context-free grammars on trees. In 
A CM Symposium on Theory of Computing, pages 143-148, 
1969. 

\[16\] Leslie G. Valiant. A theory of the learnable. Communica- 
tions of A.C.M., 27:1134-1142, 1984. 

\[17\] K. Vijay-Shanker and A. K. Joshi. Some computational 
properties of tree adjoining grammars. In 23rd Meeting of 
A.C.L., 1985. 

\[18\] K. Vijay-Shanker, D. J. Weir, and A. K. Joshi. Character- 
izing structural descriptions produced by various grarmnat- 
ieal formalisms. In 25th Meeting of A.C.L., 1987. 

\[19\] K. Vijay-Shanker, D. J. Weir, and A. K. Joshi. On the 
progression from context-freo to tree adjoining languages. 
In A. Manaster-Ramer, editor, Mathematics of Language, 
John Benjamins, 1986. 

\[20\] David J. Weir. From Context-Free Grammars to Tree Ad- 
joining Grammars and Beyond - A dissertation proposal. 
Technical Report MS-CIS-87-42, University of Pennsylva- 
nia, 1987. 
