Priors in Bayesian Learning of Phonological Rules
Sharon Goldwater and Mark Johnson
Department of Cognitive and Linguistic Sciences
Box 1978
Brown University
Providence, RI 02912
USA
fsharon goldwater, mark johnsong@brown.edu
Abstract
This paper describes a Bayesian procedure for un-
supervised learning of phonological rules from an
unlabeled corpus of training data. Like Goldsmith’s
Linguistica program (Goldsmith, 2004b), whose
output is taken as the starting point of this proce-
dure, our learner returns a grammar that consists of
a set of signatures, each of which consists of a set
of stems and a set of suffixes. Our grammars dif-
fer from Linguistica’s in that they also contain a set
of phonological rules, specifically insertion, dele-
tion and substitution rules, which permit our gram-
mars to collapse far more words into a signature
than Linguistica can. Interestingly, the choice of
Bayesian prior turns out to be crucial for obtaining a
learner that makes linguistically appropriate gener-
alizations through a range of different sized training
corpora.
1 Introduction
Unsupervised learning presents unusual challenges
to the field of computational linguistics. In super-
vised systems, the task of learning can often be re-
stricted to finding the optimal values for the param-
eters of a pre-specified model. In contrast, an un-
supervised learning system must often propose the
structure of the model itself, as well as the values
for any parameters in that model. In general, there
is a trade-off between the structural complexity of a
model and its ability to explain a set of data. One
way to deal with this trade-off is by using Bayesian
learning techniques, where the objective function
used to evaluate the overall goodness of a system
takes the form
Pr(H)Pr(DjH)
where Pr(H) is the prior probability of the hypoth-
esized model H, and Pr(DjH) is the likelihood of
the data D given that model. In a Bayesian sys-
tem, we want to find the hypothesis H for which
Pr(H)Pr(DjH) is highest (or, equivalently, where
 log Pr(H) log Pr(DjH) is lowest). While cal-
culating the likelihood of the data given a particu-
lar hypothesis is generally straightforward, the more
difficult question in Bayesian learning is how to de-
termine the prior probabilities of various hypothe-
ses.
In this paper, we compare the results of using
two different prior distributions for an unsupervised
learning task in the domain of morpho-phonology.
Our goal is to learn transformation rules of the form
x!y = C, where x and y are individual charac-
ters (or the empty character  ) and C is some rep-
resentation of the context licensing the transforma-
tion. Our input is an existing segmentation of words
from the Penn Treebank (Marcus et al., 93) into
stems and suffixes. This segmentation is provided
by the Linguistica morphological analyzer (Gold-
smith, 2001; Goldsmith, 2004b), itself an unsuper-
vised algorithm. Using the transformation rules we
learn, we are able to output a new segmentation that
more closely matches our linguistic intuitions.1
We are not the first to apply Bayesian learning
techniques for unsupervised learning of morphol-
ogy and phonology. Several other researchers have
also pursued these methods, usually within a Mini-
mum Description Length (MDL) framework (Ris-
sanen, 1989). In MDL approaches,  log Pr(H)
is taken to be proportional to the length of H in
some standard encoding, and log Pr(DjH) is the
length of D using the encoding specified by H.
MDL-based systems have been relatively successful
for tasks including word segmentation (de Marcken,
1996; Brent and Cartwright, 1996), morphological
1Since we use ordinary text, rather than phonological tran-
scriptions, as input, the rules we learn are really spelling rules,
not phonological rules. We believe that the work discussed
here would be equally applicable, and possibly more success-
ful, with phonological transcriptions. However, since we wish
to have an entirely unsupervised system and we require a mor-
phological segmentation as input, we are currently limited by
the capabilities of Linguistica, which requires standard textual
input. For the remainder of this paper, we use “phonology” and
“phonological rules” in a broad sense to include orthography as
well.
                                                                  Barcelona, July 2004
                                              Association for Computations Linguistics
                       ACL Special Interest Group on Computational Phonology (SIGPHON)
                                                    Proceedings of the Workshop of the
8
>>>
<
>>>:
lift
jump
roll
walk
:::
9
>>>
=
>>>; 
8
>>>
<
>>>:
 
 s
 ed
 ing
:::
9
>>>
=
>>>;
Figure 1: An example signature
segmentation (Goldsmith, 2001; Creutz and Lagus,
2002), discovery of syllabicity and sonority (Elli-
son, 1993), and learning constraints on vowel har-
mony and consonant clusters (Ellison, 1994). How-
ever, our work shows that a straightforward MDL
approach, where the prior log Pr(H) depends on
the length of the phonological rules and the rest of
the grammar in the obvious way, does not result in a
successful system for learning phonological rules.
We discuss why this is so, and then present sev-
eral changes that can be made to the prior in order
to learn phonological rules successfully. Our con-
clusion is that, although Bayesian techniques can
be successful for unsupervised learning of linguis-
tic information, careful choice of the prior, with at-
tention to both linguistic and statistical factors, is
important.
In the remainder of this paper, we first review
the basics behind Goldsmith’s Linguistica program,
which serves as the starting point for our own work.
We then explain the additional framework neces-
sary for learning phonological rules, and describe
our search algorithm. In Section 5, we describe the
results of two experiments using our search algo-
rithm, first with an MDL prior, then with a modified
prior. We discuss why the modified prior works bet-
ter for our task, and implications for other Bayesian
learners.
2 Linguistica
Since our algorithm is designed to take as input
a morphological analysis produced by Linguistica,
we first briefly review what that analysis consists of
and how it is arrived at. Linguistica is based on the
MDL principle, which states that the optimal hy-
pothesis to explain a set of data is the one that min-
imizes the total number of bits required to describe
both the hypothesis and the data under that hypoth-
esis. Information theory tells us that the description
length of the data under a given hypothesis is simply
the negative log likelihood of the data, so the MDL
criterion is equivalent to a Bayesian prior favoring
hypotheses that can be described succintly.
Linguistic hypotheses (grammars) all contain
some primitive types. Linguistica uses three primi-
tive types in its grammar: stems, suffixes, and sig-
natures. 2 Each signature is associated with a set
of stems, and each stem is associated with exactly
one signature representing those suffixes with which
it combines freely. For example, walk and jump
might be associated with the signature h .ed.ing.si
(see Figure 1), while bad might be associated with
h .lyi. Unanalyzed words can be thought of as be-
longing to the h i signature. A possible grammar
under this scenario consists of a set of signatures,
where each signature contains a set of stems and a
set of suffixes. Rather than modeling the probabil-
ity of each word in the corpus directly, the gram-
mar assumes that each word consists of a stem and
a (possibly empty) suffix, and assigns a probability
to each word w according to
Pr(w = t + f) = Pr( )Pr(tj )Pr(fj );
where  is the signature containing stem t. (We
have adopted Goldsmith’s notation here, using f to
denote suffixes, t for stems, and  for signatures.)
Clearly, grouping words into signatures will cause
their probabilities to be modeled less well than mod-
eling each word individually. The negative log like-
lihood of the corpus will therefore increase, and this
portion of the description length will grow. How-
ever, listing each word individually in the grammar
requires as many stems as there are words. As-
signing words to signatures significantly reduces the
number of stems, and thus the length of the gram-
mar. If the stems are chosen well, then the length of
the grammar will decrease more than the length of
the encoded corpus will increase, leading to an over-
all win. Goldsmith (2001) provides a detailed de-
scription of the exact grammar encoding and search
heuristics used to find the optimal set of stems, suf-
fixes, and signatures under this type of model.
Goldsmith’s algorithm is not without its prob-
lems, however. We concern ourselves here with its
tendency to postulate spurious signatures in cases
where phonological constraints operate. For exam-
ple, many English verb stems ending in e are placed
in the signature he.ed.es.ingi, while stems not end-
ing in e have the signature h .ed.ing.si. This is due
to the fact that the stem-final e deletes before suf-
fixes beginning in e or i. Similarly, words like match
and index are likely to be given the signatureh .esi,
whereas most nouns would beh .si. The toy gram-
mar G1 in Figure 2 illustrates the sort of analysis
produced by Linguistica.
Goldsmith himself has noted the problem of spu-
rious signatures (Goldsmith, 2004a), and recent ver-
2Linguistica actually can perform prefix analysis as well as
suffix analysis, but in our work we used only the suffixing func-
tions.
 1 = (fwork, rollg f , ed, ing, erg)
 2 = (fdin, bikg fe, ed, ing, erg)
 3 = (fwaitg f , ed, erg)
 4 = (fcarrg fy, ied, ierg)
 5 = (fcarryg f , ingg)
 6 = (fbike, booth, workerg f , sg)
 7 = (fbeach, matchg f , esg)
Figure 2: G1: A Sample Linguistica Grammar
sions of Linguistica include some functionality de-
voted to detecting allomorphs. Superficially, our
work may seem similar to Goldsmith’s, but in fact it
is quite different. First of all, the allomorphic vari-
ation detected by Linguistica is suffix-based. That
is, suffixes are proposed that operate to delete cer-
tain stem-final material. For example, a suffix (e)ing
could be proposed in order to include both hope
and walk in the signature h .(e)ing.si. This suf-
fix is actually separate in the grammar from the
ordinary ing suffix, so there is no recognition of
the fact that any occurrence of ing in any signa-
ture should delete a preceding stem-final e. More-
over, this approach is not really phonological, in
the sense that other suffixes beginning with i might
or might not be analyzed as deleting stem-final
e. While many languages do contain some affix-
specific morpho-phonological processes, our goal
here is to find phonological rules that apply at all
stem-suffix boundaries, given certain context crite-
ria.
A second major difference between the allomor-
phy detection in Linguistica and the work presented
here is that a Linguistica suffix such as (e)ing is as-
sumed to delete any stem-final e, without exception.
While this assumption may be valid in this case,
there are other suffixes and phonological processes
that are not categorical. For example, the English
plural s requires insertion of an e after certain stems,
including those ending in x or s. However, there
is no simple way to describe the context for this
rule based solely on orthography, because of stems
such as beach (+es) and stomach (+s). For this rea-
son, and to add robustness against errors in the input
morphological segmentation, we allow stems to be
listed in the grammar as exceptions to phonological
rules.
In addition to these theoretical differences, the
work presented here covers a wider range of phono-
logical processes than does Linguistica. Linguis-
tica is capable of detecting only stem-final deletion,
whereas our algorithm can also detect insertion (as
in match + s ! matches) and stem-final substitu-
tion (as in carry + ed!carried). In the following
section we discuss the structure of the grammar we
use to describe the words in our corpus.
3 A Morpho-Phonological Grammar
Since the morphology we use as input to our pro-
gram is obtained directly from Linguistica, our
grammar is necessarily similar to the one in that
program. As discussed above, Linguistica contains
three primitives types in its grammar: signatures,
stems, and suffixes. We add one more primitive type
to our grammar, the notion of a rule.
Each rule consists of a transformation, for ex-
ample  ! e or y ! i, and a conditioning con-
text. A context consists of a string of four charac-
ters XtytyfXf , where Xi 2fC;V; #g(consonant,
vowel, end-of-word) and yi is in the set of characters
in our text.3 The first half of the context is from the
end of the stem, and the second half is from the be-
ginning of the suffix. For example, the stem-suffix
pair jump + ed has the context CpeC. All trans-
formations are assumed to occur stem-finally, i.e.
at the second context position (or after the second
position, for insertions). Of course, these contexts
are more detailed than necessary for certain phono-
logical rules, and don’t capture all the information
required for others. In future work, we plan to al-
low for different types of contexts and generaliza-
tion over contexts, but for the present, all contexts
have the same form.
Using these four primitives, we can construct a
grammar in the following way: As in Goldsmith’s
work, we list a set of signatures, each of which
contains a set of stems and suffixes. In addition,
we list a set of phonological rules. In many cases,
only one rule will apply in a particular context, in
which case it applies to all stem-suffix pairs that
meet its context. If more than one rule applies, we
list the rule with the most common transformation
first and assume that it applies unless a particular
stem specifies otherwise. Stems can thus be listed
as exceptions to rules by using a non-default *no-
change* rule with the appropriate context. Note
that the more exceptions a rule has, the more expen-
sive it is to add to the grammar: each new type of
transformation in a particular context must be listed,
and each stem requiring a non-default transforma-
tion must specify the transformation required. Any
prior preferring short grammars will therefore tend
3The knowledge of which characters are consonants and
which are vowels is the only information we provide to our pro-
gram, other than the text corpus and the Linguistica-produced
morphology. Aside from the C/V distinction, our program is
entirely knowledge-free.
 1 = (fwork, roll, dine, carryg f , ed, er, ingg)
 2 = (fbikeg f , ed, er, ing, sg)
 3 = (fwaitg f , ed, erg)
 4 = (fbooth (r5), worker, beach, matchg f , sg)
r1 = e! = CeeC
r2 = e! = CeiC
r3 = y!i = CyeC
r4 =  !e = Chs#
r5 = *no-change* / Chs#
Figure 3: G2: A Sample Grammar with Transfor-
mation Rules
to reject rules requiring many exceptions (i.e. those
without a consistent application context).
Grammar G2, in Figure 3, shows a sample of the
kind of grammar we use. This grammar generates
exactly the same wordforms as G1, but using fewer
signatures due to the effects of the phonological
rules. All the stem-suffix parings in this grammar
undergo the default rules for their contexts except
for the stem booth, which is listed as an exception
to the e-insertion rule. For booth + s, the grammar
therefore generates booths, not boothes.
Our model generates data in much the same way
as Goldsmith’s: a word is generated by selecting a
signature and then independently generating a stem
and suffix from that signature. This means that
the likelihood of the data takes the same form in
our model as in Goldsmith’s, namely Pr(w) =
Pr( )Pr(tj )Pr(fj ), where the word w consists
of a stem t and a suffix f both drawn from the
same signature  . Our model differs from Gold-
smith’s in the way that stems and suffixes are pro-
duced; because we use phonological rules a great
many more stems and suffixes can belong to a sin-
gle signature. We defer discussion of how we define
the prior probability over grammars to Section 5,
and assume for the moment that we are given prior
and likelihood functions that can evaluate the utility
of a grammar and training data.
4 Search Algorithm
Since it is clearly infeasible to evaluate the utility of
every possible grammar, we need a search algorithm
to guide us toward a good solution. Our algorithm
uses certain heuristics to make small changes to the
initial grammar (the one provided by Linguistica),
evaluating each change using our objective func-
tion, and accepting or rejecting it based on the result
of evaluation. Our algorithm contains three major
components: a procedure to find signatures that are
similar in ways suggesting phonological change, a
procedure to identify possible contexts for phono-
logical change, and a procedure to collapse related
signatures and add phonological rules to the gram-
mar. We discuss each of these components in turn.
4.1 Identifying Similar Signatures
An important first step in simplifying the mor-
phological analysis of our data using phonologi-
cal rules is to identify signatures that might be re-
lated via such rules. Since our algorithm considers
three different types of possible phonological pro-
cesses (deletion, substitution, and insertion), there
are three different ways in which signatures may be
related. We need to look for pairs of signatures that
are similar in any of these three ways.
Insertion We look for potential insertion rules by
finding pairs of signatures in which all suffixes but
one are common to both signatures. The distinct
pair of suffixes must be such that one can be formed
from the other by inserting a single character at the
beginning. Example pairs found by our algorithm
include h .si/h .esi and h .yi/h .lyi. In searching
for these pairs (as well as deletion and substitution
pairs), we consider only pairs where each signature
contains at least two stems. This is partly in the
interests of efficiency and partly due to the fact that
signatures with only one stem are often less reliable.
Deletion Signature pairs exhibiting possible dele-
tion behavior are similar to those exhibiting inser-
tion behavior, except that one of the suffixes not
common to both signatures must be the empty suf-
fix. Examples of possible deletion pairs include
h .ed.ingi/he.ed.ingiandh .ed.ingi/hed.ing.si.
Substitution In a possible substitution pair, one
signature (the one potentially exhibiting stem-final
substitution) contains suffixes that all begin with
one of two characters: the basic stem-final char-
acter, and the substituted character. The signature
hied.ier.yi from G1 is such a signature. The other
signature in a possible substitution pair must con-
tain the empty suffix, and the two signatures must
be identical when the first character of each suffix in
the first signature is removed. Possible substitution
pairs includehied.ier.yi/h .ed.eriandhous.yi/h .usi.
Using the set of similar signatures we have de-
tected, we can propose a set of possible phonolog-
ical processes in our data. Some transformations,
such as e! , will be suggested by more than one
pair of signatures, while others, such as y ! o,
will occur with only one pair. We create a list of
all the possible transformations, ranked according
to the number of signature pairs attesting to them.
4.2 Identifying possible contexts
Once we have found a set of possible transforma-
tions, we need to identify the contexts in which
those transformations might apply. To see how this
works, suppose we are looking at the proposed e-
deletion rule and our input grammar is G1. Using
one of the signature pairs attesting to this rule, such
as h .ed.er.ingi/he.ed.er.ingi, we can find possible
conditioning contexts by examining the set of stems
and suffixes in the second signature. If we want to
reanalyze the stems din and bik as dine and bike,
we hypothesize that each wordform generated us-
ing the suffixes present in both signatures (ed, er,
and ing) must have deleted an e. We can find the
context for this deletion by looking at these suffixes
together with the reanalyzed stems. The contexts
for deletion that we would get fromfbike, dineg 
fed, inggarefCeeC, CeiCg.4
Our methods for finding possible contexts for
substitution and insertion rules are similar: reana-
lyze the stems and suffixes in the signature hypoth-
esized to require a phonological rule, combine them,
and note the context generated. In this way, we can
get contexts such as CyeC for the y!i rule (from
carry + ed) and V xs# for the ;! e rule (from
index + s).
Just as we ranked the set of possible phonologi-
cal rules according to the number of signature pairs
attesting to them, we can rank the set of contexts
proposed for each rule. We do this by calculating
r = Pr(XtytjyfXf )=Pr(Xtyt), the ratio between
the probability of seeing a particular stem context
given a particular suffix context to the prior proba-
bility of the stem context. If a stem context (such
as Ce) is quite common overall but hardly ever ap-
pears before a particular suffix context (iC), this is
good evidence that some phonological process has
modified the stem in the context of that suffix. Low
values of r are therefore better evidence of condi-
tioning for a rule than are high values of r.
4.3 Collapsing signatures
Given a set of similar signature pairs, the rules
relating them, and the possible contexts for those
rules, we need to determine which rules are actu-
ally phonologically legitimate and which are sim-
ply accidents of the data. We do this by simply
considering each rule and context in turn, proceed-
ing from the most attested to least attested rules and
from most likely to least likely contexts. For each
rule-context pair, we add the rule to the grammar
4The reasoning we use to finding conditioning contexts for
deletion rules was also described by Goldsmith (2004a), and is
similar to the much earlier work of Johnson (1984).
FINDPHONORULES()
1 G grammar produced by Linguistica
2 R ordered set of possible rules
3 for each r2R
4 do
5 Cr  ordered set of possible contexts for r
6 C  ;
7 while Cr 6=;
8 do c next c2Cr
9 Cr  Crnfcg
10 C  C[fcg
11 G0  collapseInContext(G;r;C)
12 G0  pruneRules(G0)
13 if score(G0) < score(G)
14 then G G0
15 return G
COLLAPSEINCONTEXT(G;r;C)
1 for each  i 2G
2 do for each  j 2G
3 do if ( i!r  j)^(8(t;f)2 i; ctx(t;f)2C)
4 then collapseSigs( i; j)
Figure 4: Pseudocode for our search algorithm
with that context and collapse any pairs of signa-
tures related by the rule, as long as all stem-suffix
pairs contain a context at least as likely as the one
under consideration. Collapsing a pair of signa-
tures means reanalyzing all the stems and suffixes
in one of the signatures, and possibly adding excep-
tions for any stems that don’t fit the rule. We have
found that exceptions are often required to handle
stems that were originally misanalyzed by Linguis-
tica. For that reason, we prune the rules added to the
grammar, and for each rule, if fewer than 2% of the
stems require exceptions, we assume that these are
errors and de-analyze the stems, returning the word-
forms they generated to theh isignature. We then
evaluate the new analysis using our objective func-
tion, and accept it if it scores better than our previ-
ous analysis. Otherwise, we revert to the previous
analysis and continue trying new rule-context pairs.
Pseudocode for our algorithm is presented in Fig-
ure 4. We use the notation  i!r  j to indicate that
 i and  j are similar with respect to rule r, with  j
being the more “basic” signature (i.e. adding r to
the grammar would allow us to move the stems in
 i into  j).
Note that collapsing a pair of signatures does not
always result in an overall reduction in the number
of signatures in the grammar. To see why this is
Morph Only Morph+Phon
Small Large Small Large
Tokens 100k 888k 100k 888k
Types 11313 35631 11313 35631
 s 435 1634 404 1594
Singleton  s 280 1231 259 1215
Stems 8255 24529 8186 24379
Non- Stems 2363 7673 2286 7494
Table 1: Grammatical Analysis of our Corpora
so, consider the effect of collapsing  1 and  2 and
adding r1 and r2 (the e-deletion rules) to G1. When
the stem bik gets reanalyzed as bike, the algorithm
recognizes that bike is already a stem in the gram-
mar, so rather than placing the reanalyzed stem in
 1, it combines the reanalyzed suffixes f , ed, er,
ingg with the suffixes f , sg from  6 and creates a
new signature for the stem bike — h .ed.er.ing.si.
The two stems carr and carry are also combined
in this way, but in that case, the combined suffixes
form a signature already present in the grammar, so
no new signature is required.
5 Experiments
For our experiments with learning phonological
rules, we used two different corpora obtained from
the Penn Treebank. The larger corpus contains the
words from sections 2-21 of the treebank, filtered to
remove most numbers, acronyms, and words con-
taining puctuation. This corpus consists of approx-
imately 900,000 tokens. The smaller corpus is sim-
ply the first 100,000 words from the larger corpus.
We ran each corpus through the Linguistica pro-
gram to obtain an initial morphological segmenta-
tion. Statistics on the results of this segmentation
are shown in the left half of Table 1. “Singleton
signatures” are those containing a single stem, and
“Non- stems” refers to stems in a signature other
than theh isignature, i.e. those stems that combine
with at least one non- suffix.
The original function we used to evaluate the util-
ity of our grammars was an MDL prior very simi-
lar to the one described by Goldsmith (2001). This
prior is simply the number of bits required to de-
scribe the grammar using a fairly straightforward
encoding. The encoding essentially lists all the suf-
fixes in the grammar along with pointers to each
one; then lists the phonological rules with their
pointers; then lists all the signatures. Each signa-
ture is a list of stems and their pointers, and a list of
pointers to suffixes. Each exceptional stem also has
Init. Grammar Change
#  s 1617 -10
# Stems 24374 -17
Grammar Size: 1335425 +520
 s, Suffixes 53933 -253
Stems 1280617 +493
Phonology 875 +279
Likelihood: 6478490 +166
Total: 7813915 +686
Table 2: Effects of adding y !i rules under MDL
prior (large corpus).
a pointer to a phonological rule.5
Our algorithm considered a total of 11 possible
transformations in the small corpus and 40 in the
large corpus, but using this prior, only a single type
of transformation appeared in any rule in the final
grammar: e !  , with seven contexts in the small
corpus and eight contexts in the large corpus. In
analyzing why our algorithm failed to accept any
other types of rules, we realized that there were sev-
eral problems with the MDL prior. Consider what
happens to the overall evaluation when two signa-
tures are collapsed. In general, the likelihood of the
corpus will go down, because the stem and suffix
probabilities in the combined signature will not fit
the true probabilities of the words as well as two
separate signatures could. For large corpora like the
ones we are using, this likelihood drop can be quite
large. In order to counterbalance it, there must be a
large gain in the prior.
But now look at Table 2, which shows the effects
of adding all the y ! i rules to the grammar for
the large corpus under the MDL prior. The first
two lines give the number of signatures and stems
in each grammar. The next line shows the total
length (in bits) of each grammar, and this value is
then broken down into three different components:
the overhead caused by listing the signatures and
their suffixes, the length of the stem list (not in-
cluding the length required to specify exceptions to
rules), and the length of the phonological compo-
nent (including both rules and exception specifica-
tions). Finally, we have the negative log likelihood
under each grammar and the total MDL cost (gram-
mar plus likelihood).
As expected, the likelihood term for the grammar
5There are some additional complexities in the grammar en-
coding that we have not mentioned, due to the fact that stems
can be recursively analyzed using shorter stems. These com-
plexities are irrelevant to the points we wish to make here, but
are described in detail in Goldsmith (2001).
Init. Grammar Change
#  s 1601 -7
# Stems 24386 -7
Grammar Size: 1249629 -318
 s, Suffixes 241465 -493
Stems 1005887 -111
Phonology 2277 +286
Likelihood: 6478764 +39
Total: 7728393 -279
Table 3: Effects of adding y !i rules under modi-
fied prior (large corpus).
with y ! i rules has increased, indicating a drop
in the probability of the corpus under this gram-
mar. But notice that the total grammar size has
also increased, leading to an overall evaluation that
is worse than for the original grammar. There are
two main reasons for this increase in grammar size.
Initially, the more puzzling of the two is the fact
that the number of bits required to list all the stems
has increased, despite the fact that the number of
stems has decreased due to reanalyzing some pairs
of stems into single stems. It turns out that this ef-
fect is due to the encoding used for stems, which is
simply a bitwise encoding of each character in the
stem. This encoding means that longer stems re-
quire longer descriptions. When reanalysis requires
shifting a character from a suffix onto the entire set
of stems in a signature (as infcertif, empt, hurrg 
fied, yg!fcertify, empty, hurryg f , edg), there
can be a large gain in description length simply due
to the extra characters in the stems. If the number of
stems eliminated through reanalysis is high enough
(as it is for the e! rules), this stem length effect
will be outweighed. But when only a few stems are
eliminated relative to the number that get longer, the
overall length of the stem list increases.
However, even without the stem list, the grammar
with y !i rules would still be slightly longer than
the grammar without them. In this case, the rea-
son in that under our MDL prior, it is quite efficient
to encode a signature and its suffixes. Therefore the
grammar reduction caused by removing a few signa-
tures is not enough to outweigh the increase caused
by adding a few phonological rules.
Using these observations as a guideline, we re-
designed our prior by assigning a fixed cost to each
stem and increasing the overhead cost for signa-
tures. The new overhead function is equal to the
sum of the lengths of all the suffixes in the signature,
times a constant factor. This function means there is
more incentive to collapse two signatures that share
several suffixes, such as he.ed.er.ingi/h .ed.er.ingi,
than to collapse signatures sharing only a single suf-
fix, such ashing.si/h .ingi. This behavior is exactly
what we want, since these shorter pairs are more
likely to be accidental. Table 3 shows the effects
of adding the y ! i rules under this new prior.
The starting grammar is somewhat different from
the one in Table 2, because more rules have already
been added by the time the y !i rules are consid-
ered. The important point, however, is that the cost
of each component of the grammar changes in the
direction we expect it to, and the total grammar cost
is reduced enough to more than make up for the loss
in likelihood.
With this new prior, our algorithm was more suc-
cessful, learning from the large corpus the three ma-
jor transformations for English (e ! ,  !e, and
y ! i) with a total of 22 contexts. Eight of these
rules, such as  !e = V xs# and y!i = CyeC,
had no exceptions. Of the remaining rules, the ex-
ceptions to six of the rules were correctly analyzed
stems (for example, unhappy + ly!unhappily and
necessary + ly!necessarily but sly + ly!slyly),
while the remaining eight rules contained misan-
alyzed exceptions (such as overse + er ! over-
seer, which was listed as an exception to the rule
e! = CeeC, rather than being reanalyzed as over-
see + er). In the small corpus, no y!i rules were
learned due to the fact that no similar signatures at-
testing to these rules were found.
Using these phonological rules, a total of 31 sig-
natures in the small corpus and 57 signatures in the
large corpus were collapsed, with subsequent re-
analysis of 225 and 528 stems, respectively. This
represents 7-10% of all the non- stems. The final
grammars are summarized in the right half of Table
1.
6 Conclusion
The work described here is clearly preliminary with
respect to learning phonological rules and using
those rules to simplify an existing morphology. Our
notion of context, for example, is somewhat impov-
erished; our system might benefit from using con-
texts with variable lengths and levels of generality,
such as those in Albright and Hayes (2003). We
also cannot handle transformations that require rule
ordering or more than one-character changes. One
reason we have not yet implemented these additions
is the difficulty of designing a heuristic search that
can handle the additional complexity required. We
are therefore working toward implementing a more
general search procedure that will allow us to ex-
plore a larger grammar space, allowing greater flex-
ibility with rules and contexts. Once some of these
improvements have been implemented, we hope to
explore the possibilities for learning in languages
with richer morphology and phonology than En-
glish.
Our point in this paper, however, is not to present
a fully general learner, but to emphasize that in a
Bayesian system, the choice of prior can be crucial
to the success of the learning task. Learning is a
trade-off between finding an explanation that fits the
current data (maximizing the likelihood) and main-
taining the ability to generalize to new data (max-
imizing the prior). The MDL framework is a way
to formalize this trade-off that is intuitively appeal-
ing and seems straightforward to implement, but we
have shown that a simple MDL approach is not the
best way to achieve our particular task. There are at
least two reasons for this. First, the obvious encod-
ing of stems actually penalizes the addition of cer-
tain types of phonological rules, even when adding
these rules reduces the number of stems in the gram-
mar. More importantly, the type of grammar we
want to learn allows two different kinds of general-
izations: the grouping of stems into signatures, and
the addition of phonological rules. Simply speci-
fying a method of encoding each type of general-
ization may not result in a linguistically appropriate
trade-off during learning. In particular, we discov-
ered that our MDL encoding for signatures was too
efficient relative to the encoding for rules, leading
the system to prefer not to add rules. Our large cor-
pus size already puts a great deal of pressure on the
system to keep signatures separate, since this leads
to a better fit of the data. In order to learn most of
the rules, we therefore had to significantly increase
the cost of signatures.
We are not the first to note that with an MDL-
style prior the choice of encoding makes a differ-
ence to the linguistic appropriateness of the result-
ing grammar. Chomsky himself (Chomsky, 1965)
points out that the reason for using certain types
of notation in grammar rules is to make clear the
types of generalizations that lead to shorter gram-
mars. However, our experience emphasizes the fact
that very little is still known about how to choose
appropriate encodings (or, more generally, priors).
As researchers continue to attempt more sophisti-
cated Bayesian learning tasks, they will encounter
more interactions between different kinds of gener-
alizations. As a result, the question of how to de-
sign a good prior will become increasingly impor-
tant. Our primary goal for the future is therefore to
investigate exactly what assumptions go into decid-
ing whether a grammar is linguistically sound, and
to determine how to specify those assumptions ex-
plicitly in a Bayesian prior.
Acknowledgements
The authors would like to thank Eugene Charniak
and the anonymous reviewers for helpful comments.
This work was supported by NSF grants 9870676
and 0085940, NIMH grant 1R0-IMH60922-01A2,
and an NDSEG fellowship.

References
A. Albright and B. Hayes. 2003. Rules vs.
analogy in english pass tenses: a computa-
tional/experimental study. Cognition, 90:119–
161.
M. Brent and T. Cartwright. 1996. Distributional
regularity and phonotactic constraints are useful
for segmentation. Cognition, 61:93–125.
N. Chomsky. 1965. Aspects of the Theory of Syn-
tax. MIT Press, Cambridge.
M. Creutz and K. Lagus. 2002. Unsupervised dis-
covery of morphemes. In Proceedings of the
Workshop on Morphological and Phonological
Learning at ACL ’02, pages 21–30.
C. de Marcken. 1996. Unsupervised Language Ac-
quisition. Ph.D. thesis, Massachusetts Institute of
Technology.
T. M. Ellison. 1993. The Machine Learning of
Phonological Structure. Ph.D. thesis, University
of Western Australia.
T. M. Ellison. 1994. The iterative learning of
phonological constraints. Computational Lin-
guistics, 20(3).
J. Goldsmith. 2001. Unsupervised learning of the
morphology of a natural language. Computa-
tional Linguistics, 27:153–198.
J. Goldsmith. 2004a. An algorithm for the unsuper-
vised learning of morphology. Preliminary draft
as of January 1.
J. Goldsmith. 2004b. Linguis-
tica. Executable available at
http://humanities.uchicago.edu/faculty/goldsmith/.
M. Johnson. 1984. A discovery procedure for cer-
tain phonological rules. In Proceedings of COL-
ING.
M. Marcus, B. Santorini, and M. A. Marcinkiewicz.
93. Building a large annotated corpus of english:
the penn treebank. Computational Linguistics,
19(2).
Rissanen. 1989. Stochastic Complexity and Statis-
tical Inquiry. World Scientific Co., Singapore.
