A DISCOVERY PROCEDURE 
FOR CERTAIN PHONOLOGICAL RULES 
Mark Johnson 
Linguistics, UCSD. 
ABSTRACT 
Acquisition of phonological systems can be insightfully 
studied in terms of discovery procedures. This paper describes 
a discovery procedure, implemented in Lisp, capable of deter- 
mining a set of ordered phonological rules, which may be in 
opaque contexts~ from a set of surface forms arranged in para- 
digms. 
1. INTRODUCTION 
For generative grammarians, such as Chomsky (1965), a 
primary problem of linguistics is to explain how the language 
learner can acquire the grammar of his or her language on the 
basis of the limited evidence available to him or her. Chomsky 
introduced the idealization of instantaneous acquisition, which 
1 adopt here, in order to model the language acquisition device 
as a function from primary linguistic data to possible gram- 
mars, rather than as a process. 
Assuming that the set of possible human languages is 
small, rather than large, appears to make acquisition easier, 
since there are fewer possible grammars to choose from, and 
less data should be required to choose between them. Accord- 
ingly, generative linguists are interested in delimiting the class 
of possible human languages. This is done by looking for pro- 
perties common to all human languages, or universals. 
Together, these universals form universal grammar, a set of 
principles that all human languages obey. Assuming that 
universal grammar is innate, the language learner can use it to 
restrict the number of possible grammars he or she must con- 
sider when learning a language. 
As part of universal grammar, the language learner is 
supposed to innately possess an evaluation metric, which is 
used to "decide" between two grammars when both are con- 
sistent with other principles of universal grammar and the 
available language data. 
2. DISCOVERY PROCEDURES 
This approach deals with acquisition without reference to 
a specific discovery procedure, and so in some sense the results 
of such research are general~ in that in principle they apply to 
all discovery procedures. Still, I think that there is some util- 
ity in considering the problem of acquisition in terms of actual 
discovery procedures. 
Firstly, we can identify the parts of a grammar that are 
underspeeified with respect to the available data. Parts of a 
grammar or a rule are strongly data determined if they are 
fixed or uniquely determined by the data, given the require- 
ment that overall grammar be empirically correct. 
By contrast, a part of a grammar or of a rule is weakly data 
determined if there is a large class of grammar or rule parts 
that are all consistent with the available data. For example, if 
there are two possible analyses that equally well account for 
the available data, then the choice of which of these analyses 
should be incorporated in the final grammar is weakly data 
determined. Strong or weak data determination is therefore a 
property of the grammar formalism and the data combined, 
and independent of the choice of discovery procedure. 
Secondly, a discovery procedure may partition a phono- 
logical system in an interesting way. For instance, in the 
discovery procedure described here tile evaluation metric is not 
called apon to compare one grammar with another, but rather 
to make smaller, more local, comparisons. This leads to a fac- 
toring of the evaluation metric that may prove useful for its 
further investigation. 
Thirdly, focussing on discovery procedures forces us to 
identify what the surface indications of the various construc- 
tions in the grammar are. Of course, this does not mean one 
should look for a one-to-one correspondence between individual 
grammar constructions and the surface data; but rather com- 
plexes of grammar constructions that interact to yield particu- 
lar patterns on the surface. One is then investigating the logi- 
cal implications of the existence of a particular constructions in 
the data. 
Following from the last point, 1 think a discovery pro- 
cedure should have a deductive rather than enumerative struc- 
ture. In particular, procedures that work essentially by 
enumerating all possible (sub)grammars and seeing which ones 
work are not only in general very inefficient, but. also not. very 
insightful. These discovery by enumeration procedures simply 
give us a list of all rule systems that are empirically adequate 
as a result, but they give us no idea as to what properties of 
these systems were crucial in their being empirically adequate. 
This is because the structure imposed on the problem by a 
simple recursive enumeration procedure is in general not 
related to the intrinsic structure of the rule discovery problem. 
3. A PHONOLOGICAL RULE DISCOVERY PRO- 
CEDURE 
Below and in Appendix A I outline a discovery pro- 
cedure: which I have fully implemented in Franz Lisp on a 
VAX 11/750 computer, for a restricted class of phonological 
rules, namely rules of the type shown in (1). 
(1) ~ ~ b / c 
Rule (1) means that any segment a that appears in con- 
text Cin the input to the rule appears asa bin the rule's out- 
put. Context C is a feature matrix, and to say that a appears 
in context C means that C is a subse! of the fvature malrix 
344 
formed by the segments around a 1. A phonological system 
consists of an ordered 2 set of such rules, where the rules are 
considered to apply in a cascaded fashion, that. is, the output 
of one rule is the input to the next.. 
The problem the discovery procedure must solve is, given 
some data, to determine the set of rules. As an idealization, I 
assume that the input to the discovery procedure is a set of 
surface paradigms, a two dimensional array of words with all 
words in the same row possessing the same stem and all words 
in the same column the same affix. Moreover, l assume the 
root and suffix morphemes are already identified, ahhough I 
admit this task may be non-trivial. 
4. DETERMINING THE CONTEXT THAT CONDI- 
TIONS AN ALTERNATION 
Consider the simplest phonological system: one in which 
only one phonological rule is operative. In this system the 
alternating segements a and b can be determined by inspec- 
tion, since a and b will be the only alternating segments in the 
data (although there will be a systematic ambiguity as to 
which is a and which is b). Thus a and b are strongly data 
determined. 
Given a and b. we can write a set of equations that the 
rule context C that conditions this alternation must obey. 
Our rule rnust apply in all contexts C b where a b appears that 
alternates with an a, since by hypothesis b was produced by 
this rule. We can represent this by equation (2). 
(2) ~7\]Cb, C matches C b 
The second condition that our rule must obey is that it 
doesn't apply in any context. C a where an a appears. If it did, 
of course, we would expect a b, not an a, in this position on 
the surface. We can write this condition by equation (3). 
(3) ~¢C,, C does not match 6', 
These two equations define the rule context C. Note that 
in general these equations do not yield a unique value for C; 
depending apon the data tbere may be no C that simultane- 
ously satisfies (2) and (3). or there may be several different C 
that simultaneously satisfies (2) and (3). We cannot appeal 
further to the data to decide which C to use, since they all are 
equally consistent with the data. 
Let us call the set of C that simultaneously satisfies (2) 
and (3) S o Then S c is strongly data determined; in fact, 
there is an efficient algorithm for computing S c from the C,s 
and Cbs that does not involve enumerating and testing all ima- 
ginable C (the algorithm is described in Appendix A). 
However, if S c contains more than one 6', the choice of 
which C from Sc to actually use as the rule's context is weakly 
1 What is crucial for what follows is that saying context C 
matches a portion of a word W is equivalent to saying that C 
is a subset of W. Since both rule contexts and words can be 
written as sets of features, 1 use "contexts" to refer both to 
rule contexts and to words. 
z I make this assumption as a first approximation. In 
fact, in real phonological systems phonological rules may be 
unordered with respect to each other. 
data determined. Moreover. the choice of v, hich ('from Sclo 
use does not affect any other decisions that the discovery pro- 
cedure has to make - that is. nothing else in the complete 
grammar must change if we decide to use one C instead of 
another. 
Plausibly, the evaluation metric and universal principles 
decide which C to use in this situation. For example, if the 
alternation involves nasafization of a vowel, something that 
usually only occurs in the context, of a nasal, and one of the 
contexts in S c involves the feature nasal but the other C in S c 
do not, a reasonable requirement is that the discovery pro- 
cedure should select the context involving the feature nasal as 
the appropriate context Cfor the rule. 
Another possibility is that .qc'S containing more than one, 
member indicates to the discovery procedure that it simply has 
too little data to determine the grammar, and it defers making 
a decision on which C to use until it has the relevant data. 
The decision as to which of these possibilities is correct is is 
not unimportant, and may have interesting empirical conse- 
quences regarding language acquisition. 
McCarthy (1981) gives some data on a related issue. 
Spanish does not tolerate word initial sC clusters, a fact. which 
might be accounted for in two ways; either with a rule that 
inserts e before word initial sC clusters, or by a constraint on 
well-formed underlying structures (a redundancy rule) barring 
word initial sC. McCarthy reports that either constraint is 
adequate to account for Spanish morphopbonemics, and there 
is no particular language internal evidence to prefer one over 
the other. 
The two accounts make differing predictions regarding 
the treatrnent of loan words. The e insertion rule predicts that 
loan words beginning with sC should receive an initial e (as 
they do: esnob, esmoking, esprey), while the well-formedness 
constraint makes no such prediction. 
McCarthy's evidence from Spanish therefore suggests that 
the human acquisition procedure can adopt one potential 
analysis and rejects an other without empirical evidence to dis- 
tinguish between them. ltowever, in the Spanish case, the two 
potential analyses differ as to which components of the gram- 
mar they involve (active phonological processes versus lexical 
redundancy rules) which affects the overall structure of the 
adopted grammar to a much greater degree than the choice of 
one C from S c over another. 
5. RULE ORDERING 
In the last section 1 showed that a single phonological 
rule can be determined from the surface data. In practice, 
very few, if any, phonological systems involve only one rule. 
Systems involving more than one rule show complexity that 
single rule systems do not. In particular, a rules may be 
ordered in such a fashion that one rule affects segments that 
are part of the context that conditions the operation of 
another rule. If a rule's context is visible on the surface (ie. 
has not been destroyed by the operation of another rule) it is 
said to be transparent, while if a rule's context is no longer 
visible on the surface it is opaque. On the face of it, opaque 
contexts could pose problems for discovery procedures. 
345 
()r<h,rillg (,i r,lh,~ h~u- b<'(q+ a topic ~,ul>,,l+jlilial re.~e~-~r\[h it+ 
?h..<,h,g',. Xl'. mai,, ,d,i,.cli'..c, i. thi- ~,,rti.. is t(, shov. that 
e×trirlsically ordered ruh,s i,, prilu'iph' pose t~o prohlem for a 
discover) prl,tt'durl'. ('~l'n if later ruh's obscure Ihe ('ontext of 
earlier ones. I don't make any elaitn that Ihe procedure 
presented here is optinlal - in fact I can think of at least two 
ways to make it perform its job more effil'ienlly. The output 
of this (lisc<~very procedure is the set of all possible ordered 
ruh. s3stelllS z aud their correspondiHg u lderhing forms that 
can pr(,duee the given surface fort,is. 
As before. I ass,lnle thal the data is in the form of sets of 
paradigms. I also assunu, that for e~er) ruh, ctlanging an a to 
a b. an aheri,aiion hetween a and b appears in the data: thus 
++e know hy listing the alternations in ttw data just what the 
possihle as and bs of the ruh' are 4. 
Frorn the assumpxion thai ruh,s are ex tins\[(ally ordered 
il folh,ws lhat one of the ruh's must have appli(,(t last: that is. 
there is a urJique "most surfaev" rule. The ('ontext or this ruh. 
+~ill ne<essariLy I,r t ransl)aret, t (visible in the surface hJrms), as 
there is ill) later rule to nlake its context opaque. 
Of coHrse, till' (liscover.', procedure has no a priori way of 
tellhJg +~hit'h alt(.rnati.n (.,rresponds In the nlost surfacy rule. 
ThlLy> although tilt, identh) of till' segnlelitS involved in tile 
niosl suffal", rule ilia)" he strictly data delerlnined, at this 
stall, Ihls inftlrnlaliiln i."; Ill)| availahle to the discovery pro- 
('edure. 
SO at this point, tile discovery pr(lcedure proposed here 
systematically investigates all of the surface ahernations: fi)r 
each alternation it makes the hypothesis that h, is the the 
alternation (if lilt, nlost sllrfa(') rub'. ('herks that a context Call 
be fouud thai conditions this alternation (this lnust he so if 
the hypothesis is correct) using the sirigle rule algorithm 
presented earlier, and then investigates if it, is possible to con- 
strut( an empirically correct set of rules based on this 
hylitlt.hesis. 
Given thai we have found a potential IlIIIOSI surfacy" 
ruh,, all of the surface alternates are replaced by the putative 
underlying segment to fornl a set of intermediate forms, in 
whi<'h the rule just discovered has been undone. We can undo 
this rule berause we previously identified tile alternating seg- 
nlents, ull),.rtantly, undoing this rule means that all other 
Thus if the n rules in the systetn are unoi'dered, this 
procedure returns n! solutions corresponding to the n ways of 
ordering these rules. 
The reason why the class of phonological rules con- 
sidered in this paper was restricted to those mapping segments 
into segments was so that all alternations could be identified 
by simply comparing surface forms segment by segment. Thus 
in this discovery procedure the algorithm for identifying possi- 
ble alternates can be of a particularly simple form. If we are 
willing It) complicate the rnachinery that deterlnines the possi- 
bh' ahernations in some data. we can relax the restriction 
prohibiting epe+nt, hesis and deletion rules, and the requirement 
that all alternations are visible on tile surface. That is, if the 
approach here is correct, the problem of identifying which seg- 
ments alternate is a different problem to discovering the 
((Ull|'~t llllll tl~hdllll~ll~, lhl ~ ,flit I hill ll,il, 
ruh.s whl)se cot, texts had been made opaque in the surface 
dala b.v the operation of the most surfacy rule will now be 
t ransparen t. 
The hypothesis tester proceeds to look for another alter- 
nation, this tilne in the intermediate forms, rather than in the 
surface fi)rms, and so on until all alternations have been 
accounted for. 
If at an.',' stage the hypothesis tester fails to find a rule I,o 
dr'scribe the alternation it is currently working with, that is, 
the single-rule algorithm determines thai no rule context exists 
that can capture this alternation, the hypothesis tester dis- 
cards ttte current hypothesis, and tries auother. 
The hypothesis tester is responsible for proposing dif- 
ferent rule order\[ass, which are tested by applying the rules in 
reverse to arrive at progressively more renloved representa- 
lions, with the single-ruh' algorithm being applied at each step 
to deterlnine if a rule exists that relates one level of intermedi- 
ate representation with the next. We ran regard the 
hyp(itilesis tester as systematically searching through tile space 
of different rule orderings, seeking rub' orderings that success- 
fully accounts for the ohserved data. 
q'tJe output of this procedure is therefore a list of all pos- 
sible rule orderings. As \] tnentioned before, I think that tile 
etlumeratlve approacit adopted here is basically flawed. So 
althougit this procedure is relatively efficient, in situations 
where rule ordering is strictly data determined (that is, where 
only one nile ordering is consistent with the data), in situa- 
tions where the rules are tmordered (any rule ordering will do), 
the procedure will generate all possible n! orderings of the n 
rules. 
This was most striking while working with some Japanese 
data. with 6 dislincl alternations, 4 of which were unordered 
with respect to each other. The discovery procedure, as 
presented above, required approximately 1 hour of CPU time 
to completely analyse this data: it. found <l different underlying 
forms and 512 different rule s.vstems that generate the 
Japanese data, differing primarily in tile ordering of the rules. 
This demonstrates that a discovery procedure that simply 
enumerates all possible rule ordering is failing to capture some 
inlportant insight regarding rule ordering, since unordered 
rules are much more difficult for this type of procedure to han- 
dle, yet, unordered rules are the most comtnon situation in 
natural langnage phonology. 
This problem may be traced back to the assumption 
made above that a phonological system consists of an ordered 
set of rules. The Japanese example shows that in many real 
phonological systems, the ordering of particular rules is simply 
not strongly data determined. What we need is some way of 
partitioning different, rule orderings into equivalence classes, as 
was done with this the different rule contexts in the single rule 
algorithm, and then compute with these equivalence classes 
rather than individual rule systems; that is. seek to localize the 
weak data determinacy. 
Looking at the problem in another way, we asked the 
discovery procedure to find all sets of ordered rules that gen- 
erate the surface data, which it did. However, it seems that 
this simply was not rigllt question, since the answer to this 
question, a set of 512 different systems, is virtually 
346 
uninterpretable by human beings. Part of the problem is lhat 
phonologists in general have not yet agreed what exactly the 
principles of rule ordering are s . 
Still, the present discovery procedure, whatever its defi- 
ciencies, does demonstrate that rule ordering in phonology 
does not pose any principled insurmountable problems for 
discovery procedures (although the procedure presented here is 
certainly practically lacking in certain situations), even if a 
later rule is allowed to disturb the context of an earlier rule, so 
that the rule's context is no longer "surface true". None the 
less, it is an empirical question as to whether phonology is best 
described in terms of ordered interacting rules~ all that l have 
shown is that such systems are not in principle unlearnable. 
6. CONCLUSION 
In this paper I have presented the details of a discovery 
procedure that can determine a limited class of phonological 
rules with arbitrary rule ordering. The procedure has the 
interesting property that it can be separated into two separate 
phases, the first, phase being superificial data analysis, that is, 
collecting the sets C, and C b of equations (2) and (3), and the 
second phase being the application of the procedure proper, 
which need never reference the data directly, but can do all of 
its calculations using C, and Cb ~. This property is interesting 
because it is likely that 6", and C a have limiting values, as the 
number of forms in the surface data increases. That is, 
presumably the language only has a fixed number of alterna- 
tions, and each of these only occurs in some fixed contexts, 
and as soon as we have enough data to see all of these con- 
texts we will have determined C, and C b. and extra data will 
not. make these sets larger. Thus the computational complex- 
ity of the second phase of the discovery procedure is more or 
less independent, of the size the lexicon, making the entire pro- 
cedure require linear time with respect to the size of the data. 
i think this is a desirable result, since there is something coun- 
terintuitive to a situation in which the difficulty of discovering 
a grammar increases rapidly with the size of the lexicon. 
7. APPENDIX A: DETERMINING A RULE'S CON- 
TEXT 
In this appendix ! describe an algorithm for calculating 
the set of rule contexts S c = { C } that satisify equations (2) 
and (3) repeated below in set notation as (4) and (5). Recall 
that C b are the contexts in which the alternation did take 
place, and C a are the contexts in which the alternations did 
not take place. We want to find (the set, of) contexts that. 
simultaneously match all the Cb, while not matching any C.. 
(4) V C~, C C_ C b 
In this paper 1 adopted strict ordering of all rules be- 
cause it is one of the more stringent rule ordering hypotheses 
available. 
e In fact, the sets C a and C b as defined above do not con- 
tain quite enough information alone. We must also indicate 
which segments in these contexts alternate, and what they al- 
ternate to. This may form the basis of a very different rule 
order discovery procedure. 
(5) Vc,. c ; c, 
We can manipulat.e these into computationally more 
tractable forms. Starting with (4), we have 
c~, c c c~ (= (4)) 
VCb,\/fE C,f~ C b 
~/e c, fc A CbCC I"3 Cb 
Put C, = f"l Cb- Then CC 6"i. 
Now consider equation (5). 
~'c,,c~ c, 
Vc,,~ i~ ( c- c.) 
But since C~ C 1, if f~ ( C- C0). then 
fE ( C 1 - C,) N C. Then 
~/c.,q_ /~ ( c,- c,),/~ c 
This last equation says thal ever), context thai fulfills the 
conditions above contains at least one feature that distin- 
guishes it from each C0, and that this feature must be in the 
intersection of all the C b. If for any C,. C\] - C e=O (the null 
set of features), then there are no contexts C that simultane- 
ously match all the C b and none of the C,, implying that no 
rule exists that accounts for the observed ah.ernation. 
We can construct the set S c using this last formula by 
first, calculating C1, the intersection of all the Cb, and then for 
each C,, calculating C I : ( C I - C° ), a member of which 
must be in every 6'. The idea is to keep a set of the minimal 
C needed to account for the C, so far; if C conl.ains a member 
of C! we don't need to modify it; if C does not contain a 
member of C I then we have to add a member of C I to it in 
order for it to satisfy the equations above. The algorithm 
below acomplishes this. 
set C 1 : \["I Cb 
set S c = {~} 
foreach C. 
set%= c,- c. 
if%-O 
return "No rule contezts" 
foreach C in S c 
ifcn el=-0 
remove C from S c 
foreaeh/in 6'/ 
add CU {/}t°S c 
return S c 
where the subroutine "add" adds a set to S c only if it or 
its subset is not already present. 
After this algorithm has applied, S c will contain all the 
minimal different C that satisfy equations (4) and (5) above. 
347 
