Partially Distribution-Free Learning of Regular Languages from
Positive Samples
Alexander Clark
ISSCO / TIM, University of Geneva
UNI-MAIL, Boulevard du Pont-d’Arve,
CH-1211 Gen eve 4, Switzerland
asc@aclark.demon.co.uk
Franck Thollard
EURISE, Universit e Jean Monnet,23,
Rue du Docteur Paul Michelon,
42023 Saint-Etienne C edex 2, France
thollard@univ-st-etienne.fr
Abstract
Regular languages are widely used in NLP to-
day in spite of their shortcomings. E cient
algorithms that can reliably learn these lan-
guages, and which must in realistic applications
only use positive samples, are necessary. These
languages are not learnable under traditional
distribution free criteria. We claim that an ap-
propriate learning framework is PAC learning
where the distributions are constrained to be
generated by a class of stochastic automata with
support equal to the target concept. We discuss
how this is related to other learning paradigms.
We then present a simple learning algorithm
for regular languages, and a self-contained proof
that it learns according to this partially distri-
bution free criterion.
1 Introduction
Regular languages, especially generated by de-
terministic  nite state automata are widely used
in Natural Language processing, for various dif-
ferent tasks (Mohri, 1997). E cient learning
algorithms, that have some guarantees of cor-
rectness, would clearly be useful. Existing al-
gorithms for learning deterministic automata,
such as (Carrasco and Oncina, 1994) have only
guarantees of identi cation in the limit (Gold,
1967), generally considered not to be a good
guide to practical utility. Unforunately the
prospects for learning according to the more
useful PAC-learning criterion are poor after the
well known result of (Kearns and Valiant, 1989).
Distribution-free learning criteria require algo-
rithms to learn for every possible combination
of concept and distribution. Under this worst-
case analysis many simple concept classes are
unlearnable. However in many situations it is
more realistic to assume that there is some re-
lationship between the concept and the distri-
bution, and furthermore in general only positive
examples will be available.
There are two ways of modelling this. The
simplest is to study the learnability of distribu-
tions (Kearns et al., 1994; Ron et al., 1995). In
this case the samples are drawn from the distri-
bution that is being learned. The choice of error
function then becomes critical { the most nat-
ural (and di cult) being the Kullback-Leibler
divergence. This means that any successful al-
gorithm must produce hypotheses that assign a
non-zero probability to every string. If what we
are interested in is learning the underlying non-
probabilistic concept then these hypotheses will
be useless. We have elsewhere proved (Clark
and Thollard, 2004) a suitable result similar to
that of (Ron et al., 1995), bounding the diver-
gence, but that proof involves some more elab-
orate technical machinery.
The second way is to consider a traditional
concept-learning problem, but to restrict the
class of distributions to some set that only gen-
erates positive examples, and has some relation
to the target concept. It is this latter possibility
that we explore here.
In the particular case of learning languages
we will have an instance space of   for some
 nite alphabet  , and we shall have a concept
class, in this paper, corresponding to the class
of all regular languages. In a distribution-free
setting this is not learnable from positive and
negative samples, nor a fortiori from positive
samples alone. In our partially distribution-free
framework however, we are able to prove learn-
ability with an additional parameter in the sam-
ple complexity polynomial, that bounds a sim-
ple property of the distribution. We are able to
present a simple stand alone proof for this well
studied class of languages.
The rest of the paper is structured as follows.
Section 2 argues for a modi ed version of PAC
learning as being an appropriate learning frame-
work for a range of NLP problems. After de n-
ing some notation in Section 3 we then de ne
an algorithm that learns regular languages (Sec-
tion 4) and then in Section 5 prove that it does
so according to this modi ed PAC-learnability
criterion. We conclude with a critical analysis
of our results.
2 Appropriateness
Regular languages are widely used in a num-
ber of di erent applications drawn from nu-
merous domains such as computational biology,
robotics etc. In many of these areas, e cient
learning algorithms are desirable but in each the
exact requirements will be di erent since the
sources of information, and the desired proper-
ties of the algorithms vary widely. We argue
here that learning algorithms in NLP have cer-
tain special properties that make the particu-
lar learnability result we study here useful. The
most important feature in our opinion is the ne-
cessity for learning from positive examples only.
Negative examples in NLP are rarely available.
Even in a binary classi cation problem, there
will often be some overlap between the classes
so that examples labelled with  are not nec-
essarily negative examples of the class labelled
with +. For this reason alone we consider a tra-
ditional distribution-free PAC-learning frame-
work to be wholly inappropriate. An essential
part of the PAC-learning framework is a sort
of symmetry between the positive and negative
examples. Furthermore, there are a number
of negative results which rule out distribution
free learning of regular languages (Kearns et al.,
1994).
A related problem is that in the sorts of learn-
ing situations that occur in practice in NLP
problems, and also those such as  rst language
acquisition that one wishes to model formally,
the distribution of examples is dependent on the
concept being learned. Thus if we are modelling
the acquisition of the grammar of a language,
the positive examples are the grammatical, or
perhaps acceptable, sentences of the target lan-
guage. The distribution of examples is clearly
highly dependent on the particular language,
simply as a matter of fact, in that the sentences
in the sample are generated by people who have
acquired the language.
It thus seems reasonable to require the dis-
tribution to be drawn from some limited class
that depends on the target concept and gen-
erates only positive examples { i.e. where the
support of the distribution is identical to the
positive part of the target concept.
Our proposal is that when the class of lan-
guages is de ned by some simple class of au-
tomata, we can consider only those distribu-
tions generated by the corresponding stochas-
tic automata. The set of distributions is re-
stricted and thus we call this partially distribu-
tion free. Thus when learning the class of regu-
lar languages, which are generated by determin-
istic  nite-state automata, we select the class
of distributions which are generated by PDFAs.
Similarly, context free languages are normally
de ned by context-free grammmars which can
be extended again to probabilistic or stochastic
context free grammars.
Formally, for every class of languages, L, de-
 ned by some formal device de ne a class of
distributions, D, de ned by a stochastic variant
of that device. Then for each language L, we
select the set of distributions whose support is
equal to the language:
D+L = fD 2 D : 8s 2   s 2 L , PD(s) > 0g
Samples are drawn from one of these distri-
butions. There are two technical problems here:
 rst, this doesn’t penalise over-generalisation.
Since the distribution is over positive examples,
negative examples have zero weight { which
would give a hypothesis of all strings zero er-
ror. We therefore need some penalty function
over negative examples or alternatively require
the hypothesis to be a subset of the target, and
use a one-sided loss function as in Valiant’s orig-
inal paper (Valiant, 1984), which is what we
do here. Secondly, this de nition is too vague.
The exact way in which you extend the \crisp"
language to a stochastic one can have serious
consequences. When dealing with regular lan-
guages, for example, though the class of lan-
guages de ned by deterministic automata is the
same as that de ned by non-deterministic lan-
guages, the same is not true for their stochas-
tic variants. Additionally, one can have expo-
nential blow-ups in the number of states when
determinizing automata. Similarly, with con-
text free languages, (Abney et al., 1999) showed
that converting between two parametrisations
of models for stochastic context free languages
are equivalent but that there are blow-ups in
both directions.
It is interesting to compare this to the PAC-
learning with simple distributions model (De-
nis, 2001). There, the class of distributions
is limited to a single distribution derived from
algorithmic complexity theory. There are a
number of reasons why this is not appropriate.
First there is a computational issue: since Kol-
mogorov complexity is not computable, sam-
pling from the distribution is not possible,
though a lower bound on the probabilities can
be de ned. Secondly, there are very large con-
stants in the sample complexity polynomial. Fi-
nally and most importantly, there is no reason
to think that in the real world, samples will be
drawn from this distribution; in some sense it
is the easiest distribution to learn from since it
dominates every other distribution up to a mul-
tiplicative factor.
We reject the identi cation in the limit
paradigm introduced by (Gold, 1967) as un-
suitable for three reasons. First it is only an
asymptotic bound that says nothing about the
performance of the algorithms on  nite amounts
of data; secondly because it must learn under
all presentations of the data even when these
are chosen by an adversary to make it hard to
learn, and thirdly because it has no bounds on
the amount of computation allowed.
An alternative way to conceive of this prob-
lem is to consider the task of learning distri-
butions directly (Kearns et al., 1994), a task
related to probability density estimation and
language modelling, where the algorithm is
given examples drawn from a distribution and
must approximate the distribution closely ac-
cording to some distance metric: usually the
Kullback-Leibler divergence or the variational
distance. We consider the choice between the
distribution-learning analysis, and the analysis
we present here to depend on what the under-
lying task or phenomena to be modelled is. If
it is the probability of the event occurring, then
the distribution modelling analysis is better. If
on the other hand it concerns binary judgments
about the membership of strings in some set
then the analysis we present here is preferable.
The result of (Kearns et al., 1994) shows up
a further problem. Under a standard crypto-
graphic assumption the class of acyclic PDFAs
over a two-letter alphabet are not learnable
since the class of noisy parity functions can be
embedded in this simple subclass of PDFAs.
(Ron et al., 1995) show that this can be cir-
cumvented by adding an additional parameter
to the sample complexity polynomial, the dis-
tinguishability, which we de ne below.
3 Preliminaries
We will write  for letters and s for strings.
We have a  nite alphabet  , and   is the
free monoid generated by  , i.e. the set of all
strings with letters from  , with  the empty
string (identity). For s 2   we de ne jsj to
be the length of s. The subset of   of strings
of length d is denoted by  d. A distribution
or stochastic language D over   is a function
D :   ! [0; 1] such that Ps2  D(s) = 1. The
L1 norm between two distributions is de ned as
maxs jD1(s)  D2(s)j. For a multiset of strings
S we write ^S for the empirical distribution de-
 ned by that multiset { the maximum likelihood
estimate of the probability of the string.
A probabilistic deterministic  nite state
automaton is a mathematical object that
stochastically generates strings of symbols.
It has a  nite number of states one of which
is a distinguished start state. Parsing or
generating starts in the start state, and at
any given moment makes a transition with a
certain probability to another state and emits
a symbol. We have a particular symbol and
state which correspond to  nishing.
A PDFA A is a tuple (Q;  ; q0; qf;  ;  ;  ) ,
where
 Q is a  nite set of states,
  is the alphabet, a  nite set of symbols,
 q0 2 Q is the single initial state,
 qf 62 Q is the  nal state,
  62  is the  nal symbol,
  : Q  [f g ! Q[fqfg is the transition
function and
  : Q   [f g ! [0; 1] is the next sym-
bol probability function.  (q;  ) = 0 when
 (q;  ) is not de ned.
We will sometimes refer to automata by the
set of states. All transitions that emit  go to
the  nal state. In the following  and  will
be extended to strings recursively in the normal
way.
The sum of the output transition from each
states must be one: so for all q 2 Q
X
 2 [f g
 (q;  ) = 1 (1)
Assuming further that there is a non zero proba-
bility of reaching the  nal state from each state:
i.e.
8q 2 Q9s 2   :  (q; s ) = qf ^  (q; s ) > 0
(2)
the PDFA then de nes a probability distribu-
tion over   , where the probability of generat-
ing a string s 2   is PA(s) =  (q0; s ). We will
write L(A) for the support of this distribution,
L(A) = fs 2   : PA(s) > 0g. We will also
de ne Pq(s) =  (q; s ) which we call the su x
distribution of the state q.
We say that two states q; q0 are  -
distinguishable if L1(Pq; Pq0) >  for some  >
0. An automaton is  -distinguishable i every
pair of states is  -distinguishable. Since we can
merge states q; q0 which have L1(Pq; Pq0) = 0,
we can assume without loss of generality that
every PDFA has a non-zero distinguishability.
Note that  (q0; s) where s 2   is the pre x
probability of the string s, i.e. the probability
that the automaton will generate a string that
starts with s.
We will use a similar notation, neglecting the
probability function for (non-probabilistic) de-
terministic  nite-state automata (DFAs).
4 Algorithm
We shall  rst state our main result.
Theorem 1 For any regular language L, when
samples are generated by a PDFA A where
L(A) = L, with distinguishability  and num-
ber of states n, for any  ;  > 0, the algorithm
LearnDFA will with probability at least 1  re-
turn a DFA H which de nes a language L(H)
that is a subset of L with PA(L(A) L(H)) <  .
The algorithm will draw a number of samples
bounded by a polynomial in j j; n; 1= ; 1= ; 1= ,
and the computation is bounded by a polynomial
in the number of samples and the total length of
the strings in the sample.
We now de ne the algorithm LearnDFA. We
incrementally construct a sequence of DFAs
that will generate subsets of the target lan-
guage. Each state of the hypothesis automata
will represent a state of the target and will have
attached a multiset of strings that approximates
the distribution of strings generated by that
state. We calculate the following quantities m0
and N from the input parameters.
m0 = 8 2 log 48nj j(nj j + 2)  (3)
N = 2nj jm0 (4)
We start with an automaton that consists of a
single state and no transitions, and the attached
multiset is a sample of strings from the target.
At each step we sample N strings from the tar-
get distribution. This re-sampling ensures the
independence of all of the samples, and allows
us to apply bounds in a straightforward way.
For each state u in the hypothesis automaton
and letter  in the alphabet, such that there is
no arc labelled with  out of u we construct a
candidate node (u;  ) which represents the state
reached from u by the transition labelled with  .
For each string in the sample, we trace the cor-
responding path through the hypothesis. When
we reach a candidate node, we remove the pre-
ceding part of the string, and add the rest to
the multiset of the candidate node. Otherwise,
in the case when the string terminates in the
hypothesis automaton we discard the string.
After we have done this for every string in
the sample, we select a candidate node (u;  )
that has a multiset of size at least m0. If there
is no such candidate node, the algorithm ter-
minates, Otherwise we compare this candidate
node with each of the nodes already in the hy-
pothesis. The comparison we use calculates the
L1-norm between the empirical distributions of
the two multisets and says they are similar if
this distance is less than  =4. We will make
sure that with high probability these empirical
distributions are close in the L1-norm to the
su x distributions of the states they represent.
Since we know that the su x distributions of
di erent states will be at least  apart, we can
be con dent that we will only rarely make mis-
takes. If there is a node, v, which is similar then
we conclude that v and (u;  ) represent the same
state. We therefore add an arc labelled with  
leading from u to v. If it is not similar to any
node in the hypothesis, then we conclude that
it represents a new node, and we create a new
node u0 and add an arc labelled with  leading
from u to u0. In this case we attach the mul-
tiset of the candidate node to the new node in
the hypothesis. Intuitively this multiset will be
a sample from the su x distribution of the state
of the target that it represents. We then discard
all of the candidate nodes and their associated
multisets, but keep the multisets attached to the
states of the hypothesis, and repeat.
5 Proof
We can now prove that this algorithm has the
properties we claim. We use one technical
lemma that we prove in the appendix.
Lemma 1 Given a distribution D over   , for
any  0 < 1=2, when we independently draw
a number of samples m more than m0 =
1
2 02 log
12
 0 0, into a multiset S then L1( ^S; D) <
 0 with probability greater than 1   0.
Let H0; H1; : : :; Hk be the sequence of  nite
automata, the states labelled with multisets,
generated by the algorithm when samples are
generated by a target PDFA A.
We will say that a hypothesis automaton Hi
is  -good if there is a bijective function  from
a subset of states of A including q0, to all the
states of Hi such that  (q0) is the root node
of Hi, and if there is an edge in Hi such that
 (u;  ) = v then  (  1(u);  ) =   1(v) i.e. if
Hi is isomorphic to a subgraph of the target that
includes the root. If  (q) = u then we say that
u represents q. In this case the language gen-
erated by Hi is a subset of the target language.
Additionally we require that for every state v in
the hypothesis, the corresponding multiset sat-
is es L1( ^Sv; P  1(v)) <  =4. When a multiset
satis es this we will say it is  -good.
We will extend the function  to candidate
nodes in the obvious way, and also the de nition
of  -good.
De nition 1 (Good sample) We say that a
sample of size N is  - -good given a good hy-
pothesis DFA H and a target A if all the candi-
date nodes with multisets larger than the thresh-
old m0 are  -good, and that if PA(L(A)  
L(H)) >  then the number of strings that
exit the hypothesis automaton is more than
1
2NPA(L(A)  L(H)).
5.1 Approximately Correct
We will now show if all the samples are good,
that for all i 0; 1; : : :; k, the hypothesis Hi will
be good, and that when the algorithm termi-
nates the  nal hypothesis will have low error.
We will do this by induction on the index i of
the hypothesis Hi. Clearly H0 is good. Sup-
pose Hi 1 is good, and we draw a good sample.
Consider a candidate node (u;  ) with multiset
greater than m0.
Since the previous hypothesis was good, this
will be a representative of a state q and thus
the multiset will be a sequence of independent
draws from the su x distribution of this state
Pq. Thus L1( ^Su; ; Pq) <  =4 by the good-
ness of the sample. We compare it to a state
in the hypothesis v. If this state is a rep-
resentative of the same state in the target v,
then L1( ^Sv; Pq) <  =4 (by the goodness of the
multisets), the triangle inequality shows that
L1( ^Su; ; ^Sv) <  =2, and therefore the compar-
ison will return true. On the other hand, let us
suppose that v is a representative of a di erent
state qv. We know that L1( ^Su; ; Pq) <  =4
and L1( ^Sv; Pqv) <  =4 (by the goodness of
the multisets), and L1(Pq; Pqv)   (by the
 -distinguishability of the target). By the tri-
angle inequality L1(Pq; Pqv)  L1( ^Su; ; Pq) +
L1( ^Su; ; ^Sv) + L1( ^Sv; Pqv), which implies that
L1( ^Su; ; ^Sv) >  =2 and the comparison will
return false. In these cases Hi will be good.
Alternatively there is no candidate node above
threshold in which case the algorithm termi-
nates, and i = k. The total number of strings
that exit the hypotheis must then be less than
nj jm0 since there are at most nj j candidate
nodes each of which has multiset of size less than
m0. By the de nition of N and the goodness of
the sample PA(L(A)  L(H)) <  . Since it is
good and thus de nes a subset of the target lan-
guage, this is a suitably close hypothesis.
5.2 Probably Correct
We must now show that by setting m0 su -
ciently large we can be sure that with probabil-
ity greater than 1  all of the samples will be
good. We need to show that with high prob-
ability a sample of size N will be good for a
given hypotheis G. We can assume that the
hypothesis is good at each step. Each step of
the algorithm will increase the number of tran-
sitions in the active set by at least 1. There are
at most nj j transitions in the target; so there
are at most nj j+2 steps in the algorithm since
we need an initial step to get the multiset for
the root node and another at the end when we
terminate. So we want to show that a particu-
lar sample will be good with probability at least
1   nj j+2.
There are two sorts of errors that can make
the sample bad. First, one of the multisets could
be bad, and secondly too few strings might exit
the graph. There are at most nj j candidate
nodes, so we will make the probability of getting
a bad multiset less than  =2nj j(nj j+ 2), and
we will make the probability of the second sort
of error less than  =2(nj j + 2).
First we bound the probability of getting a
bad multiset of size m0. This will be satis ed
if we set  0 =  =4 and  0 =  =2nj j(nj j + 2),
and use Lemma 1.
We next need to show that at each step the
number of strings that exit the graph will be
not too far from its expectation, if PA(L(A)  
L(H)) >  . We can use Cherno bounds to
show that the probability too few strings exit
the graph will be less than  =2(nj j + 2)
e N(G)PA(L(A) L(H))=4 < e N =4
<  =2(nj j + 2)
which will be satis ed by the value of N de-
 ned earlier, as can be easily veri ed.
5.3 Polynomial complexity
Since we need to draw at most nj j+2 samples
of size N the overall sample complexity will be
(nj j + 2)N, which ignoring log factors gives a
sample complexity of O(n2j j2  2  1), which
is quite benign. It is easy to see that the com-
putational complexity is polynomial. Produc-
ing an exact bound is di cult since it depends
on the length of the strings. The precise com-
plexity also depends on the relative magnitudes
of  , j j and so on. The complexity is domi-
nated by the cost of the comparisons. We can
limit each multiset comparison to at most m0
strings, which can be compared naively with m20
string comparisons or much more e ciently us-
ing hashing or sorting. The number of nodes
in the hypothesis is at most n, and the num-
ber of candidate nodes is at most nj j, so the
number of comparisons at each step is bounded
by n2j j and thus the total number of multiset
comparisons by n2j j(nj j+2). Construction of
multisets can be performed in time linear in the
sample size. These observations su ce to show
that the computation is polynomially bounded.
6 Discussion
The convergence of these sorts of algorithms
has been studied before in the identi cation in
the limit framework, but previous proofs have
not been completely convincing (Carrasco and
Oncina, 1999), and this criterion gives no guide
to the practical utility of the algorithms since it
applies only asymptotically. The partially dis-
tribution free learning problem we study here
is novel. as is the extension of the results of
(Ron et al., 1995) to cyclic automata and thus
to in nite languages.
Before we examine our results critically, we
would like to point out some positive aspects of
the algorithm. First, this class of algorithms is
in practice e cient and reliable. This particular
algorithm is designed to have a provably good
worst-case performance, and thus we anticipate
its average performance on naturally occurring
data to be marginally worse than comparable
algorithms. We have established that we can
learn an exponentially large family of in nite
languages using polynomial amounts of data
and computation. Mild properties of the in-
put distributions su ce to guarantee learnabil-
ity. The algorithm we present here is however
not intended to be e cient or cognitively plau-
sible: our intention was to  nd one that allowed
a simple proof.
The major weakness of this approach in our
opinion is that the parameter n in the sample
complexity polynomial is the number of states
in the PDFA generating the distribution, and
not the number of states in the minimal FA gen-
erating the language. Since determinisation of
 nite automata can cause exponential blow ups
this is potentially a serious problem, depending
on the application domain. A second problem
is the need for a distinguishability parameter,
which again in speci c cases could be exponen-
tially small. An alternative to this is to de ne
a class of  -distinguishable automata where the
distinguishability is bounded by an inverse poly-
nomial in the number of states. Formally this is
equivalent, but it has the e ect of removing the
parameter from the sample complexity polyno-
mial at the cost of having a further restriction
on the class of distributions. Indeed we can deal
with the previous objection in the same way if
necessary by requiring the number of states in
the generating PDFA to be bounded by a poly-
nomial in the minimal number of states needed
to generate the target language. However both
of these limitations are unavoidable given the
negative results previously discussed.
Appendix
Proof of Lemma 1.
We write p(s) for the true probability and
^p(s) = c(s)=m for the empirical probability of
the string in the sample { i.e. the maximum
likelihood estimate. We want to bound the
probability over an in nite number of strings,
which rules out a naive application of Hoe ding
bounds. It will su ce to show that every string
with probability less than  0=2 will have empir-
ical probability less than  0, and that all other
strings will have probability within  0 of their
true values. The latter is straightforward: since
there are at most 2= 0 of these frequent strings.
For any given frequent string s, by Hoe ding
bounds:
Pr[j^p(s)  p(s)j >  0] < 2e 2m 02 < 2e 2m0 02
(5)
So the probability of making an error on a
frequent string is less than 4= 0e 2m0 02.
Consider all of the strings whose probability
is in [ 02 (k+1);  02 k).
Sk = fs 2   :  (q; s ) 2 [ 02 (k+1);  02 k)g
(6)
We de ne Srare = S1k=1 Sk. The Cherno 
bound says that for any  > 0, for the sum of n
bernouilli variables with prob p and
Pr(X > (1 +  )np) <
 e 
(1 +  )(1+ )
 np
(7)
Now we bound each group separately, using
the binomial Cherno bound where n = m 0 >
mp (which is true since p <  0)
Pr  ^p(s)   0  
 mp
n
 n
en mp (8)
This bound decreases with p, so we can re-
place this for all strings in Sk with the upper
bound for the probability, and we can replace
m with m0.
Pr  ^p(s)   0  
 m
0 02 k
m0 0
 m0 0
em0 0 m0 02 k
 
 
2 ke1 2 k
 m0 0
< 2 km0 0
Assuming that m0 0 > 3
Pr  ^p(s)   0 < 2 2k22 m0 0
Pr  9s 2 Sk : ^p(s)   0  jSkj2 2k22 m0 0
 8 2 k2 m0 0
Using the factor of the form 2 k, we can sum
over all of the k.
Pr[8s 2 Srare : ^p(s)   0] < 8 02 m0 0
1X
k=1
2 k
< 8 2 m0 0
Putting these together we can show that the
probability of the bound being exceeded will be
4
 0e
 2m0 02 + 8
 2
 m0 0 < 12
 0 e
 2m0 02 (9)
This will be less than  0 if
m0 = 12 02 log 12 0 0 (10)
which establishes the result.

References

S. Abney, D. McAllester, and F. Pereira.
1999. Relating probabilistic grammars and
automata. In Proceedings of ACL ’99.

R. C. Carrasco and J. Oncina. 1994. Learn-
ing stochastic regular grammars by means of
a state merging method. In R. C. Carrasco
and J. Oncina, editors, Grammatical Infer-
ence and Applications, ICGI-94, number 862
in LNAI, pages 139{152, Berlin, Heidelberg.
Springer Verlag.

R. C. Carrasco and J. Oncina. 1999. Learning
deterministic regular grammars from stochas-
tic samples in polynomial time. Theoretical
Informatics and Applications, 33(1):1{20.

Alexander Clark and Franck Thollard. 2004.
Pac-learnability of probabilistic determinis-
tic  nite state automata. Journal of Machine
Learning Research, 5:473{497, May.

F. Denis. 2001. Learning regular languages
from simple positive examples. Machine
Learning, 44(1/2):37{66.

E. M. Gold. 1967. Language indenti cation in
the limit. Information and control, 10(5):447
{ 474.

M. Kearns and G. Valiant. 1989. Crypto-
graphic limitations on learning boolean for-
mulae and  nite automata. In 21st annual
ACM symposium on Theory of computation,
pages 433{444, New York. ACM, ACM.

M.J. Kearns, Y. Mansour, D. Ron, R. Rubin-
feld, R.E. Schapire, and L. Sellie. 1994. On
the learnability of discrete distributions. In
Proc. of the 25th Annual ACM Symposium
on Theory of Computing, pages 273{282.

Mehryar Mohri. 1997. Finite-state transducers
in language and speech processing. Compu-
tational Linguistics, 23(2):269{311.

D. Ron, Y. Singer, and N. Tishby. 1995. On the
learnability and usage of acyclic probabilistic
 nite automata. In COLT 1995, pages 31{40,
Santa Cruz CA USA. ACM.

L. Valiant. 1984. A theory of the learnable.
Communications of the ACM, 27(11):1134 {
1142.
