25
Grammatical Inference and First Language Acquisition
Alexander Clark (asc@aclark.demon.co.uk)
ISSCO / TIM, University of Geneva
UNI-MAIL, Boulevard du Pont-d’Arve,
CH-1211 Gen`eve 4, Switzerland
Abstract
One argument for parametric models of language
has been learnability in the context of first language
acquisition. The claim is made that “logical” ar-
guments from learnability theory require non-trivial
constraints on the class of languages. Initial formal-
isations of the problem (Gold, 1967) are however
inapplicable to this particular situation. In this pa-
per we construct an appropriate formalisation of the
problem using a modern vocabulary drawn from sta-
tistical learning theory and grammatical inference
and looking in detail at the relevant empirical facts.
We claim that a variant of the Probably Approxi-
mately Correct (PAC) learning framework (Valiant,
1984) with positive samples only, modified so it is
not completely distribution free is the appropriate
choice. Some negative results derived from crypto-
graphic problems (Kearns et al., 1994) appear to ap-
ply in this situation but the existence of algorithms
with provably good performance (Ron et al., 1995)
and subsequent work, shows how these negative re-
sults are not as strong as they initially appear, and
that recent algorithms for learning regular languages
partially satisfy our criteria. We then discuss the
applicability of these results to parametric and non-
parametric models.
1 Introduction
For some years, the relevance of formal results
in grammatical inference to the empirical question
of first language acquisition by infant children has
been recognised (Wexler and Culicover, 1980). Un-
fortunately, for many researchers, with a few no-
table exceptions (Abe, 1988), this begins and ends
with Gold’s famous negative results in the identifi-
cation in the limit paradigm. This paradigm, though
still widely used in the grammatical inference com-
munity, is clearly of limited relevance to the issue
at hand, since it requires the model to be able to
exactly identify the target language even when an
adversary can pick arbitrarily misleading sequences
of examples to provide. Moreover, the paradigm as
stated has no bounds on the amount of data or com-
putation required for the learner. In spite of the inap-
plicability of this particular paradigm, in a suitable
analysis there are quite strong arguments that bear
directly on this problem.
Grammatical inference is the study of machine
learning of formal languages. It has a vast formal
vocabulary and has been applied to a wide selec-
tion of different problems, where the “languages”
under study can be (representations of) parts of nat-
ural languages, sequences of nucleotides, moves of
a robot, or some other sequence data. For any con-
clusions that we draw from formal discussions to
have any applicability to the real world, we must
be sure to select, or construct, from the rich set of
formal devices available an appropriate formalisa-
tion. Even then, we should be very cautious about
making inferences about how the infant child must
or cannot learn language: subsequent developments
in GI might allow a more nuanced description in
which these conclusions are not valid. The situation
is complicated by the fact that the field of grammti-
cal inference, much like the wider field of machine
learning in general, is in a state of rapid change.
In this paper we hope to address this problem by
justifying the selection of the appropriate learning
framework starting by looking at the actual situa-
tion the child is in, rather than from an a priori deci-
sion about the right framework. We will not attempt
a survey of grammatical inference techniques; nor
shall we provide proofs of the theorems we use here.
Arguments based on formal learnability have been
used to support the idea of parameter based theo-
ries of language (Chomsky, 1986). As we shall see
below, under our analysis of the problem these ar-
guments are weak. Indeed, they are more pertinent
to questions about the autonomy and modularity of
language learning: the question whether learning of
some level of linguistic knowledge – morphology
or syntax, for example – can take place in isolation
from other forms of learning, such as the acquisition
of word meaning, and without interaction, ground-
ing and so on.
26
Positive results can help us to understand how hu-
mans might learn languages by outlining the class of
algorithms that might be used by humans, consid-
ered as computational systems at a suitable abstract
level. Conversely, negative results might be help-
ful if they could demonstrate that no algorithms of a
certain class could perform the task – in this case we
could know that the human child learns his language
in some other way.
We shall proceed as follows: after briefly de-
scribing FLA, we describe the various elements of
a model of learning, or framework. We then make
a series of decisions based on the empirical facts
about FLA, to construct an appropriate model or
models, avoiding unnecessary idealisation wherever
possible. We proceed to some strong negative re-
sults, well-known in the GI community that bear on
the questions at hand. The most powerful of these
(Kearns et al., 1994) appears to apply quite directly
to our chosen model. We then discuss an interest-
ing algorithm (Ron et al., 1995) which shows that
this can be circumvented, at least for a subclass of
regular languages. Finally, after discussing the pos-
sibilities for extending this result to all regular lan-
guages, and beyond, we conclude with a discussion
of the implications of the results presented for the
distinction between parametric and non-parametric
models.
2 First Language Acquisition
Let us first examine the phenomenon we are con-
cerned with: first language acquisition. In the space
of a few years, children almost invariably acquire,
in the absence of explicit instruction, one or more of
the languages that they are exposed to. A multitude
of subsidiary debates have sprung up around this
central issue covering questions about critical peri-
ods – the ages at which this can take place, the ex-
act nature of the evidence available to the child, and
the various phases of linguistic use through which
the infant child passes. In the opinion of many re-
searchers, explaining this ability is one of the most
important challenges facing linguists and cognitive
scientists today.
A difficulty for us in this paper is that many of
the idealisations made in the study of this field are
in fact demonstrably false. Classical assumptions,
such as the existence of uniform communities of
language users, are well-motivated in the study of
the “steady state” of a system, but less so when
studying acquisition and change. There is a regret-
table tendency to slip from viewing these idealisa-
tions correctly – as counter-factual idealizations – to
viewing them as empirical facts that need to be ex-
plained. Thus, when looking for an appropriate for-
mulation of the problem, we should recall for exam-
ple the fact that different children do not converge to
exactly the same knowledge of language as is some-
times claimed, nor do all of them acquire a language
competently at all, since there is a small proportion
of children who though apparently neurologically
normal fail to acquire language. In the context of
our discussion later on, these observations lead us
to accept slightly less stringent criteria where we al-
low a small probability of failure and do not demand
perfect equality of hypothesis and target.
3 Grammatical Inference
The general field of machine learning has a spe-
cialised subfield that deals with the learning of for-
mal languages. This field, Grammatical Inference
(GI), is characterised above all by an interest in for-
mal results, both in terms of formal characterisa-
tions of the target languages, and in terms of formal
proofs either that particular algorithms can learn ac-
cording to particular definitions, or that sets of lan-
guage cannot be learnt. In spite of its theoretical
bent, GI algorithms have also been applied with
some success. Natural language, however is not the
only source of real-world applications for GI. Other
domains include biological sequence data, artificial
languages, such as discovering XML schemas, or
sequences of moves of a robot. The field is also
driven by technical motives and the intrinsic ele-
gance and interest of the mathematical ideas em-
ployed. In summary it is not just about language,
and accordingly it has developed a rich vocabulary
to deal with the wide range of its subject matter.
In particular, researchers are often concerned
with formal results – that is we want algorithms
where we can prove that they will perform in a cer-
tain way. Often, we may be able to empirically es-
tablish that a particular algorithm performs well, in
the sense of reliably producing an accurate model,
while we may be unable to prove formally that the
algorithm will always perform in this way. This
can be for a number of reasons: the mathematics
required in the derivation of the bounds on the er-
rors may be difficult or obscure, or the algorithm
may behave strangely when dealing with sets of data
which are ill-behaved in some way.
The basic framework can be considered as a
game played between two players. One player, the
teacher, provides information to another, the learner,
and from that information the learner must identify
the underlying language. We can break down this
situation further into a number of elements. We as-
sume that the languages to be learned are drawn
27
in some way from a possibly infinite class of lan-
guages, L, which is a set of formal mathematical
objects. The teacher selects one of these languages,
which we call the target, and then gives the learner
a certain amount of information of various types
about the target. After a while, the learner then re-
turns its guess, the hypothesis, which in general will
be a language drawn from the same class L. Ide-
ally the learner has been able to deduce or induce
or abduce something about the target from the in-
formation we have given it, and in this case the hy-
pothesis it returns will be identical to, or close in
some technical sense, to the target. If the learner
can conistently do this, under whatever constraints
we choose, then we say it can learn that class of lan-
guages. To turn this vague description into some-
thing more concrete requires us to specify a number
of things.
 What sort of mathematical object should we
use to represent a language?
 What is the target class of languages?
 What information is the learner given?
 What computational constraints does the
learner operate under?
 How close must the target be to the hypothesis,
and how do we measure it?
This paper addresses the extent to which negative
results in GI could be relevant to this real world sit-
uation. As always, when negative results from the-
ory are being applied, a certain amount of caution
is appropriate in examining the underlying assump-
tions of the theory and the extent to which these are
applicable. As we shall see, in our opinion, none
of the current negative results, though powerful, are
applicable to the empirical situation. We shall ac-
cordingly, at various points, make strong pessimistic
assumptions about the learning environment of the
child, and show that even under these unrealistically
stringent stipulations, the negative results are still
inapplicable. This will make the conclusions we
come to a little sharper. Conversely, if we wanted
to show that the negative results did apply, to be
convincing we would have to make rather optimistic
assumptions about the learning environment.
4 Applying GI to FLA
We now have the delicate task of selecting, or rather
constructing, a formal model by identifying the vari-
ous components we have identified above. We want
to choose the model that is the best representation
of the learning task or tasks that the infant child
must perform. We consider that some of the em-
pirical questions do not yet have clear answers. In
those cases, we shall make the choice that makes the
learning task more difficult. In other cases, we may
not have a clear idea of how to formalise some in-
formation source. We shall start by making a signif-
icant idealisation: we consider language acquisition
as being a single task. Natural languages as tradi-
tionally describe have different levels. At the very
least we have morphology and syntax; one might
also consider inter-sentential or discourse as an ad-
ditional level. We conflate all of these into a single
task: learning a formal language; in the discussion
below, for the sake of concreteness and clarity, we
shall talk in terms of learning syntax.
4.1 The Language
The first question we must answer concerns the lan-
guage itself. A formal language is normally defined
as follows. Given a finite alphabet  , we define the
set of all strings (the free monoid) over  as   .
We want to learn a language L    . The alpha-
bet  could be a set of phonemes, or characters, or
a set of words, or a set of lexical categories (part
of speech tags). The language could be the set of
well-formed sentences, or the set of words that obey
the phonotactics of the language, and so on. We re-
duce all of the different learning tasks in language
to a single abstract task – identifying a possibly in-
finite set of strings. This is overly simplistic since
transductions, i.e. mappings from one string to an-
other, are probably also necessary. We are using
here a standard definition of a language where every
string is unambiguously either in or not in the lan-
guage.. This may appear unrealistic – if the formal
language is meant to represent the set of grammati-
cal sentences, there are well-known methodological
problems with deciding where exactly to draw the
line between grammatical and ungrammatical sen-
tences. An alternative might be to consider accept-
ability rather than grammaticality as the defining
criterion for inclusion in the set. Moreover, there
is a certain amount of noise in the input – There
are other possibilities. We could for example use a
fuzzy set – i.e. a function from   ! [0; 1] where
each string has a degree of membership between 0
and 1. This would seem to create more problems
than it solves. A more appealing option is to learn
distributions, again functions f from   ! [0; 1]
but where Ps2L f(s) = 1. This is of course the
classic problem of language modelling, and is com-
pelling for two reasons. First, it is empirically well
grounded – the probability of a string is related to its
frequency of occurrence, and secondly, we can de-
28
duce from the speech recognition capability of hu-
mans that they must have some similar capability.
Both possibilities – crisp languages, and distri-
butions – are reasonable. The choice depends on
what one considers the key phenomena to be ex-
plained are – grammaticality judgments by native
speakers, or natural use and comprehension of the
language. We favour the latter, and accordingly
think that learning distributions is a more accurate
and more difficult choice.
4.2 The class of languages
A common confusion in some discussions of this
topic is between languages and classes of lan-
guages. Learnability is a property of classes of
languages. If there is only one language in the
class of languages to be learned then the learner
can just guess that language and succeed. A class
with two languages is again trivially learnable if
you have an efficient algorithm for testing member-
ship. It is only when the set of languages is expo-
nentially large or infinite, that the problem becomes
non-trivial, from a theoretical point of view. The
class of languages we need is a class of languages
that includes all attested human languages and ad-
ditionally all “possible” human languages. Natu-
ral languages are thought to fall into the class of
mildly context-sensitive languages, (Vijay-Shanker
and Weir, 1994), so clearly this class is large
enough. It is, however, not necessary that our class
be this large. Indeed it is essential for learnability
that it is not. As we shall see below, even the class
of regular languages contains some subclasses that
are computationally hard to learn. Indeed, we claim
it is reasonable to define our class so it does not con-
tain languages that are clearly not possible human
languages.
4.3 Information sources
Next we must specify the information that our learn-
ing algorithm has access to. Clearly the primary
source of data is the primary linguistic data (PLD),
namely the utterances that occur in the child’s envi-
ronment. These will consist of both child-directed
speech and adult-to-adult speech. These are gen-
erally acceptable sentences that is to say sentences
that are in the language to be learned. These are
called positive samples. One of the most long-
running debates in this field is over whether the
child has access to negative data – unacceptable sen-
tences that are marked in some way as such. The
consensus (Marcus, 1993) appears to be that they do
not. In middle-class Western families, children are
provided with some sort of feedback about the well-
formedness of their utterances, but this is unreliable
and erratic, not a universal of global child-raising.
Furthermore this appears to have no effect on the
child. Children do also get indirect pragmatic feed-
back if their utterances are incomprehensible. In our
opinion, both of these would be better modelled by
what is called a membership query: the algorithm
may generate a string and be informed whether that
string is in the language or not. However, we feel
that this is too erratic to be considered an essential
part of the process. Another question is whether the
input data is presented as a flat string or annotated
with some sort of structural evidence, which might
be derived from prosodic or semantic information.
Unfortunately there is little agreement on what the
constituent structure should be – indeed many lin-
guistic theories do not have a level of constituent
structure at all, but just dependency structure.
Semantic information is also claimed as an im-
portant source. The hypothesis is that children can
use lexical semantics, coupled with rich sources of
real-world knowlege to infer the meaning of utter-
ances from the situational context. That would be
an extremely powerful piece of information, but it is
clearly absurd to claim that the meaning of an utter-
ance is uniquely specified by the situational context.
If true, there would be no need for communication
or information transfer at all. Of course the context
puts some constraints on the sentences that will be
uttered, but it is not clear how to incorporate this
fact without being far too generous. In summary it
appears that only positive evidence can be unequiv-
ocally relied upon though this may seem a harsh and
unrealistic environment.
4.4 Presentation
We have now decided that the only evidence avail-
able to the learner will be unadorned positive sam-
ples drawn from the target language. There are var-
ious possibilities for how the samples are selected.
The choice that is most favourable for the learner is
where they are slected by a helpful teacher to make
the learning process as easy as possible (Goldman
and Mathias, 1996). While it is certainly true that
carers speak to small children in sentences of sim-
ple structure (Motherese), this is not true for all of
the data that the child has access to, nor is it uni-
versally valid. Moreover, there are serious techni-
cal problems with formalising this, namely what is
called ’collusion’ where the teacher provides exam-
ples that encode the grammar itself, thus trivialising
the learning process. Though attempts have been
made to limit this problem, they are not yet com-
pletely satisfactory. The next alternative is that the
examples are selected randomly from some fixed
29
distribution. This appears to us to be the appropri-
ate choice, subject to some limitations on the dis-
tributions that we discuss below. The final option,
the most difficult for the learner, is where the se-
quence of samples can be selected by an intelli-
gent adversary, in an attempt to make the learner
fail, subject only to the weak requirement that each
string in the language appears at least once. This is
the approach taken in the identification in the limit
paradigm (Gold, 1967), and is clearly too stringent.
The remaining question then regards the distribu-
tion from which the samples are drawn: whether the
learner has to be able to learn for every possible dis-
tribution, or only for distributions from a particular
class, or only for one particular distribution.
4.5 Resources
Beyond the requirement of computability we will
wish to place additional limitations on the computa-
tional resources that the learner can use. Since chil-
dren learn the language in a limited period of time,
which limits both the amount of data they have ac-
cess to and the amount of computation they can use,
it seems appropriate to disallow algorithms that use
unbounded or very large amounts of data or time.
As normal, we shall formalise this by putting poly-
nomial bounds on the sample complexity and com-
putational complexity. Since the individual samples
are of varying length, we need to allow the compu-
tational complexity to depend on the total length of
the sample. A key question is what the parameters
of the sample complexity polynomial should be. We
shall discuss this further below.
4.6 Convergence Criteria
Next we address the issue of reliability: the extent
to which all children acquire language. First, vari-
ability in achievement of particular linguistic mile-
stones is high. There are numerous causes including
deafness, mental retardation, cerebral palsy, specific
language impairment and autism. Generally, autis-
tic children appear neurologically and physically
normal, but about half may never speak. Autism,
on some accounts, has an incidence of about 0.2%.
Therefore we can require learning to happen with
arbitrarily high probability, but requiring it to hap-
pen with probability one is unreasonable. A related
question concerns convergence: the extent to which
children exposed to a linguistic environment end
up with the same language as others. Clearly they
are very close since otherwise communication could
not happen, but there is ample evidence from stud-
ies of variation (Labov, 1975), that there are non-
trivial differences between adults, who have grown
up with near-identical linguistic experiences, about
the interpretation and syntactic acceptability of sim-
ple sentences, quite apart from the wide purely lex-
ical variation that is easily detected. A famous ex-
ample in English is “Each of the boys didn’t come”.
Moreover, language change requires some chil-
dren to end up with slightly different grammars
from the older generation. At the very most, we
should require that the hypothesis should be close
to the target. The function we use to measure the
’distance’ between hypothesis and target depends on
whether we are learnng crisp languages or distribu-
tions. If we are learning distributions then the ob-
vious choice is the Kullback-Leibler divergence – a
very strict measure. For crisp languages, the prob-
ability of the symmetric difference with respect to
some distribution is natural.
4.7 PAC-learning
These considerations lead us to some variant of the
Probably Approximately Correct (PAC) model of
learning (Valiant, 1984). We require the algorithm
to produce with arbitrarily high probability a good
hypothesis. We formalise this by saying that for any
 > 0 it must produce a good hypothesis with prob-
ability more than 1   . Next we require a good
hypothesis to be arbitrarily close to the target, so we
have a precision  and we say that for any  > 0, the
hypothesis must be less than  away from the target.
We allow the amount of data it can use to increase as
the confidence and precision get smaller. We define
PAC-learning in the following way: given a finite
alphabet  , and a class of languages L over  , an
algorithm PAC-learns the class L, if there is a poly-
nomial q, such that for every confidence  > 0 and
precision  > 0, for every distribution D over   ,
for every language L in L, whenever the number of
samples exceeds q(1= ; 1= ;j j;jLj), the algorithm
must produce a hypothesis H such that with prob-
ability greater than 1   , PrD(H L >  ). Here
we use A B to mean the symmetric difference be-
tween two sets. The polynomial q is called the
sample complexity polynomial. We also limit the
amount of computation to some polynomial in the
total length of the data it has seen. Note first of all
that this is a worst case bound – we are not requiring
merely that on average it comes close. Additionally
this model is what is called ’distribution-free’. This
means that the algorithm must work for every com-
bination of distribution and language. This is a very
stringent requirement, only mitigated by the fact
that the error is calculated with respect to the same
distribution that the samples are drawn from. Thus,
if there is a subset of   with low aggregate proba-
bility under D, the algorithm will not get many sam-
30
ples from this region but will not be penalised very
much for errors in that region. From our point of
view, there are two problems with this framework:
first, we only want to draw positive samples, but the
distributions are over all strings in   , and include
some that give a zero probability to all strings in
the language concerned. Secondly, this is too pes-
simistic because the distribution has no relation to
the language: intuitively it’s reasonable to expect
the distribution to be derived in some way from the
language, or the structure of a grammar generating
the language. Indeed there is a causal connection
in reality since the sample of the language the child
is exposed to is generated by people who do in fact
know the language.
One alternative that has been suggested is the
PAC learning with simple distributions model intro-
duced by (Denis, 2001). This is based on ideas from
complexity theory where the samples are drawn ac-
cording to a universal distribution defined by the
conditional Kolmogorov complexity. While math-
ematically correct this is inappropriate as a model
of FLA for a number of reasons. First, learnability
is proven only on a single very unusual distribution,
and relies on particular properties of this distribu-
tion, and secondly there are some very large con-
stants in the sample complexity polynomial.
The solution we favour is to define some natu-
ral class of distributions based on a grammar or au-
tomaton generating the language. Given a class of
languages defined by some generative device, there
is normally a natural stochastic variant of the de-
vice which defines a distribution over that language.
Thus regular languages can be defined by a finite-
state automaton, and these can be naturally ex-
tended to Probabilistic finite state automaton. Sim-
ilarly context free languages are normally defined
by context-free grammmars which can be extended
again to to Probabilistic or stochastic CFG. We
therefore propose a slight modification of the PAC-
framework. For every class of languages L, defined
by some formal device define a class of distribu-
tions defined by a stochastic variant of that device.
D. Then for each language L, we select the set of
distributions whose support is equal to the language
and subject to a polynomial bound (q)on the com-
plexity of the distribution in terms of the complex-
ity of the target language: D+L = fD 2 D : L =
supp(D)^jDj < q(jLj)g. Samples are drawn from
one of these distributions.
There are two technical problems here: first, this
doesn’t penalise over-generalisation. Since the dis-
tribution is over positive examples, negative exam-
ples have zero weight, so we need some penalty
function over negative examples or alternatively
require the hypothesis to be a subset of the tar-
get. Secondly, this definition is too vague. The
exact way in which you extend the “crisp” lan-
guage to a stochastic one can have serious con-
sequences. When dealing with regular languages,
for example, though the class of languages defined
by deterministic automata is the same as that de-
fined by non-deterministic languages, the same is
not true for their stochastic variants. Additionally,
one can have exponential blow-ups in the number
of states when determinising automata. Similarly,
with CFGs, (Abney et al., 1999) showed that con-
verting between two parametrisations of stochastic
Context Free languages are equivalent but that there
are blow-ups in both directions. We do not have a
completely satisfactory solution to this problem at
the moment; an alternative is to consider learning
the distributions rather than the languages.
In the case of learning distributions, we have the
same framework, but the samples are drawn accord-
ing to the distribution being learned T, and we re-
quire that the hypothesis H has small divergence
from the target: D(TjjH) <  . Since the divergence
is infinite if the hypothesis gives probability zero to
a string in the target, this will have the consequence
that the target must assign a non-zero probability to
every string.
5 Negative Results
Now that we have a fairly clear idea of various ways
of formalising the situation we can consider the ex-
tent to which formal results apply. We start by con-
sidering negative results, which in Machine Learn-
ing come in two types. First, there are information-
theoretic bounds on sample complexity, derived
from the Vapnik-Chervonenkis (VC) dimension of
the space of languages, a measure of the complex-
ity of the set of hypotheses. If we add a parameter
to the sample complexity polynomial that represents
the complexity of the concept to be learned then this
will remove these problems. This can be the size of
a representation of the target which will be a poly-
nomial in the number of states, or simply the num-
ber of non-terminals or states. This is very standard
in most fields of machine learning.
The second problem relates not to the amount
of information but to the computation involved.
Results derived from cryptographic limitations on
computational complexity, can be proved based on
widely held and well supported assumptions that
certain hard cryptographic problems are insoluble.
In what follows we assume that there are no effi-
cient algorithms for common cryptographic prob-
31
lems such as factoring Blum integers, inverting RSA
function, recognizing quadratic residues or learning
noisy parity functions.
There may be algorithms that will learn with rea-
sonable amounts of data but that require unfeasibly
large amounts of computation to find. There are
a number of powerful negative results on learning
in the purely distribution-free situation we consid-
ered and rejected above. (Kearns and Valiant, 1989)
showed that acyclic deterministic automata are not
learnable even with positive and negative exam-
ples. Similarly, (Abe and Warmuth, 1992) showed
a slightly weaker representation dependent result on
learning with a large alphabet for non-deterministic
automata, by showing that there are strings such that
maximising the likelihood of the string is NP-hard.
Again this does not strictly apply to the partially dis-
tribution free situation we have chosen.
However there is one very strong result that ap-
pears to apply. A straightforward consequence of
(Kearns et al., 1994) shows that Acyclic Determinis-
tic Probabilistic FSA over a two letter alphabet can-
not be learned under another cryptographic assump-
tion (the noisy parity assumption). Therefore any
class of languages that includes this comparatively
weak family will not be learnable in out framework.
But this rests upon the assumption that the class
of possible human languages must include some
cryptographically hard functions. It appears that
our formal apparatus does not distinguish between
these cryptographic functions which hav been con-
sciously designed to be hard to learn, and natu-
ral languages which presumably have evolved to be
easy to learn since there is no evolutionary pressure
to make them hard to decrypt – no intelligent preda-
tors eavesdropping for example. Clearly this is a
flaw in our analysis: we need to find some more
nuanced description for the class of possible human
languages that excludes these hard languages or dis-
tributions.
6 Positive results
There is a positive result that shows a way forward.
A PDFA is  -distinguishable the distributions gen-
erated from any two states differ by at least  in
the L1-norm, i.e. there is a string with a differ-
ence in probability of at least  . (Ron et al., 1995)
showed that  -distinguishable acyclic PDFAs can
be PAC-learned using the KLD as error function
in time polynomial in n; 1= ; 1= ; 1= ;j j. They
use a variant of a standard state-merging algorithm.
Since these are acyclic the languages they define
are always finite. This additional criterion of distin-
guishability suffices to guarantee learnability. This
work can be extended to cyclic automata (Clark and
Thollard, 2004a; Clark and Thollard, 2004b), and
thus the class of all regular languages, with the ad-
dition of a further parameter which bounds the ex-
pected length of a string generated from any state.
The use of distinguishability seems innocuous; in
syntactic terms it is a consequence of the plausible
condition that for any pair of distinct non-terminals
there is some fairly likely string generated by one
and not the other. Similarly strings of symbols in
natural language tend to have limited length. An
alternate way of formalising this is to define a class
of distinguishable automata, where the distinguisha-
bility of the automata is lower bounded by an in-
verse polynomial in the number of states. This is
formally equivalent, but avoids adding terms to the
sample complexity polynomial. In summary this
would be a valid solution if all human languages
actually lay within the class of regular languages.
Note also the general properties of this kind of al-
gorithm: provably learning an infinite class of lan-
guages with infinite support using only polynomial
amounts of data and computation.
It is worth pointing out that the algorithm does
not need to “know” the values of the parameters.
Define a new parameter t, and set, for example n =
t; L = t;  = e t;  = t 1 and  = t 1. This gives
a sample complexity polynomial in one parameter
q(t). Given a certain amount of data N we can just
choose the largest value of t such that q(t) < N,
and set the parameters accordingly.
7 Parametric models
We can now examine the relevance of these re-
sults to the distinction between parametric and non-
parametric languages. Parametric models are those
where the class of languages is parametrised by a
small set of finite-valued (binary) parameters, where
the number of paameters is small compared to the
log2 of the complexity of the languages. Without
this latter constraint the notion is mathematically
vacuous, since, for example, any context free gram-
mar in Chomsky normal form can be parametrised
with N3 + NM + 1 binary parameters where N
is the number of non-terminals and M the num-
ber of terminals. This constraint is also necessary
for parametric models to make testable empirical
predictions both about language universals, devel-
opmental evidence and relationships between the
two (Hyams, 1986). We neglect here the important
issue of lexical learning: we assume, implausibly,
that lexical learning can take place completely be-
fore syntax learning commences. It has in the past
been stated that the finiteness of a language class
32
suffices to guarantee learnability even under a PAC-
learning criterion (Bertolo, 2001). This is, in gen-
eral, false, and arises from neglecting constraints on
the sample complexity and the computational com-
plexities both of learning and of parsing. The neg-
ative result of (Kearns et al., 1994) discussed above
applies also to parametric models. The specific class
of noisy parity functions that they prove are unlearn-
able, are parametrised by a number of binary pa-
rameters in a way very reminiscent of a parametric
model of language. The mere fact that there are a
finite number of parameters does not suffice to guar-
antee learnability, if the resulting class of languages
is exponentially large, or if there is no polynomial
algorithm for parsing. This does not imply that all
parametrised classes of languages will be unlearn-
able, only that having a small number of parame-
ters is neither necessary nor sufficient to guarantee
efficient learnability. If the parameters are shallow
and relate to easily detectable properties of the lan-
guages and are independent then learning can oc-
cur efficiently (Yang, 2002). If they are “deep” and
inter-related, learning may be impossible. Learn-
ability depends more on simple statistical properties
of the distributions of the samples than on the struc-
ture of the class of languages.
Our conclusion then is ultimately that the theory
of learnability will not be able to resolve disputes
about the nature of first language acquisition: these
problems will have to be answered by empirical re-
search, rather than by mathematical analysis.
Acknowledgements
This work was supported in part by the IST
Programme of the European Community, under
the PASCAL Network of Excellence, IST-2002-
506778, funded in part by the Swiss Federal Office
for Education and Science (OFES). This publication
only reflects the authors’ views.

References
N. Abe and M. K. Warmuth. 1992. On the com-
putational complexity of approximating distribu-
tions by probabilistic automata. Machine Learn-
ing, 9:205–260.
N. Abe. 1988. Feasible learnability of formal gram-
mars and the theory of natural language acquisi-
tion. In Proceedings of COLING 1988, pages 1–
6.
S. Abney, D. McAllester, and F. Pereira. 1999. Re-
lating probabilistic grammars and automata. In
Proceedings of ACL ’99.
Stefano Bertolo. 2001. A brief overview of learn-
ability. In Stefano Bertolo, editor, Language Ac-
quisition and Learnability. Cambridge University
Press.
Noam Chomsky. 1986. Knowledge of Language :
Its Nature, Origin, and Use. Praeger.
Alexander Clark and Franck Thollard. 2004a.
PAC-learnability of probabilistic deterministic fi-
nite state automata. Journal of Machine Learning
Research, 5:473–497, May.
Alexander Clark and Franck Thollard. 2004b. Par-
tially distribution-free learning of regular lan-
guages from positive samples. In Proceedings of
COLING, Geneva, Switzerland.
F. Denis. 2001. Learning regular languages from
simple positive examples. Machine Learning,
44(1/2):37–66.
E. M. Gold. 1967. Language indentification in the
limit. Information and control, 10(5):447 – 474.
S. A. Goldman and H. D. Mathias. 1996. Teach-
ing a smarter learner. Journal of Computer and
System Sciences, 52(2):255–267.
N. Hyams. 1986. Language Acquisition and the
Theory of Parameters. D. Reidel.
M. Kearns and G. Valiant. 1989. Cryptographic
limitations on learning boolean formulae and fi-
nite automata. In 21st annual ACM symposium
on Theory of computation, pages 433–444, New
York. ACM, ACM.
M.J. Kearns, Y. Mansour, D. Ron, R. Rubinfeld,
R.E. Schapire, and L. Sellie. 1994. On the learn-
ability of discrete distributions. In Proc. of the
25th Annual ACM Symposium on Theory of Com-
puting, pages 273–282.
W. Labov. 1975. Empirical foundations of linguis-
tic theory. In R. Austerlitz, editor, The Scope of
American Linguistics. Peter de Ridder Press.
G. F. Marcus. 1993. Negative evidence in language
acquisition. Cognition, 46:53–85.
D. Ron, Y. Singer, and N. Tishby. 1995. On the
learnability and usage of acyclic probabilistic fi-
nite automata. In COLT 1995, pages 31–40,
Santa Cruz CA USA. ACM.
L. Valiant. 1984. A theory of the learnable. Com-
munications of the ACM, 27(11):1134 – 1142.
K. Vijay-Shanker and David J. Weir. 1994.
The equivalence of four extensions of context-
free grammars. Mathematical Systems Theory,
27(6):511–546.
Kenneth Wexler and Peter W. Culicover. 1980. For-
mal Principles of Language Acquisition. MIT
Press.
C. Yang. 2002. Knowledge and Learning in Natu-
ral Language. Oxford.
