Unsupervised Discovery of Phonological Categories through 
Supervised Learning of Morphological Rules 
Walter Daelemans* 
CL & AI, Tilburg Uniw'xsity 
P.().Box 90153, 5000 LE Tilburg 
The Netherlands 
walter, dae-I emans@kub, nl 
Abstract 
We describe a case study in tit(', ap- 
plication of symbolic machinc learning 
techniques for the discow;ry of linguis- 
tic rules and categories. A supervised 
rule induction algorithm is used to learn 
to predict the. correct dimilmtive suffix 
given the phonological representation of 
Dutch nouns. The system produces rules 
which are comparable, to rules proposed 
by linguists, l,Slrthermore, in the process 
of learning this morphological task, the 
phonemes used are grouped into phono- 
logically relevant categories. We discuss 
the relevance of our method for linguis- 
tics attd language technology. 
1 Introduction 
This paper shows how machine lem'ning tech- 
niques can be used to induce linguistically relevant 
rules and categories fl'om data. Statistical, con- 
nectionist, and machine learning induction (data- 
oriented approaches) are currently nsed mainly in 
language, engineering at)t)lications in order to alle- 
viate the. linguistic knowledge acquisition bottle- 
neck (the fact that lexical an(t grammatical knowl- 
edge usually has to be reformulated t'i'()iii scratch 
whenever a new application has to be built or 
an existing application ported to a new domain), 
and to solve problems with robustness and cover- 
age inherent in knowledge-based (the.ory-oriente.d, 
hand-crafting) approaches. Linguistic relevance. 
or inspectability of the induced knowledge is usu- 
ally not an issue in this type of research. \]n lin- 
guistics, on the other hand, it is usually agreed 
that while computer modeling is a useful (or essen- 
tial) tool for enforcing internal consistency, com- 
pleteness, and empirical validity of the linguistic 
theory being modeled, its role in formulating or 
evaluating linguistic theories is minimal. 
In this paper, we argue that machine learning 
techniques can also assist in linguistic theory for- 
*Visiting fl'.llow at NIAS (Netherlands Instituee for 
Advanced Studies), Wassenaar, The. Netherlands. 
Peter Berck and Steven Gillis 
Linguistics, University of Antwerp 
Unive.rsiteitsplein 1, 2610 WiMjk 
l ~elgium 
steven, gillis@uia, ua. ac. be 
peter, berck@uia, ua. ac. be 
mation by providing a new tool for the evalua- 
tion of linguistic hypotheses, for the extraction 
of rules front corpora, and for the discovery (if 
useflll linguistic categories. As a case. study, we 
apply Quinlan's C4.5 inductive machine learning 
me.thod (Quinlan, 1993) to a particular linguistic 
task (diminutive fi)rmation in Dutch) and show 
that it; can be use(l (i) to test linguistic hypothe- 
ses about this process, (ii) to discover interesting 
morphological rules, and (iii) discover interesting 
phonological categories. Nothing hinges on our 
choic.e of (\]4.5 as a rule induction mechanism. Wc 
chose it because it is an easily available and so- 
phisticated instance of the class of rule induction 
algorithms. 
A second focus of this paper is the interac- 
tion between supervised and unsulmrvised ma- 
chine learning me.thods in linguistic discovery, in 
supervised learning, the. learner is presented a set 
of examples (the experience of the system). These 
examples consist of an inImt outtmt association 
(in our case, e.g., a representation of a llotln as 
input, and the corresponding dimilmtive sul\[ix as 
output). Unsupervised learning methods do not 
1)rovide the h',m'ner with inforlnatioil at)out the 
outf)ut to be generated; only the inputs ar(; I)re- 
sented to the learner as experience, not the target 
outputs. 
Unsupervised learning is necessarily more lim- 
ited t, hm~ supervised learning; the only informa- 
tion it has to construct categories is the similarity 
between inputs. Unsupervised learning has been 
successflflly applied e.g. for the discovery of syn- 
tactic categories from corpora on the basis of dis- 
tributional inforlnation about words (Finch and 
Chalet 1992, tIughes 1994, Schiitze 1995). We 
will show that it, is possible and useful to make 
use of unsupervised learning relative to a particu- 
lar task which is being learned in a supervised way. 
In our experinmnt, phonological categories are dis- 
covered in an unsupervised way, as a side-effect of 
the supervised learning of a morphological prob- 
lem. We will also show that this raises interesl;ing 
questions about, the. task-dependence of linguistic 
category systems. 
95 
2 Supervised Rule Induction with 
C4.5 
For the experiments, we used C4.5 (Quinlan, 
1993). Although several decision tree and rule 
induction variants have been proposed, we chose 
this program because it is widely available and 
reasonably well tested. C4,5 is a TDIDT (Top 
Down Induction of Decision Trees) decision tree 
learning algorithm which constructs a decision 
tree on the basis of a set of examples (tit('. training 
set). This decision tree has tests (feature names) 
as nodes, and feature values as branches between 
nodes. The leaf nodes are labeled with a category 
name and constitute the output of the system. A 
decision tree constructed on the basis of examples 
is used after training to assign a class to patterns. 
To test whether the tree has actually learned the 
problem, and has not just memorized the items 
it was trained on, the 9eneralization accuracy is 
measured by testing the learned tree on a part of 
the dataset not used in training. 
The algorithm for the construction of a C4.5 
decision tree can be easily stated. Given are a 
training set T (a collection of examples), and a 
finite number of classes C1 ... C~. 
1. If T contains one or more cases all belonging 
to the same class Cj, then the decision tree 
for 5/" is a leaf node with category Cj. 
2. If T is empty, a category has to be found on 
the basis of other information (e.g. domain 
knowledge). The heuristic used here is that 
the most frequent class in the initial training 
set is used. 
3. If T contains different classes then 
(at Choose a test (feature) with a finite num- 
ber of outcomes (values), and partition 
T into subsets of examples that have the 
same outcome for tim test chosen. The 
decision tree. consists of a root node con- 
taining the test, and a branch for each 
outcome, each bt'anch leading to a sub- 
set of the original set. 
(b) Apply the procedure recursively to sub- 
sets created this way. 
In this algorithm, it is not specitied which test 
to choose to split a node into sut)trees at some 
point. Taking one at random will usually result in 
large decision trees with poor generalization per- 
formanee, as uninformative tests may be chosen. 
Considering all possible trees consistent; with the 
data is computationally intractable, so a reliable 
heuristic test selection method has to be found. 
The method used in C4.5 is based on the con- 
cept of mutual information (or information gain). 
Whenever a test has to be selected, the feature is 
chosen with the highest information gain. This is 
the feature that reduces the information entropy 
of the training (sub)set on average most, when its 
value would be known. For the computation of 
information gain, see Quinlan (1993). 
Decision trees can be easily and automatically 
transformed into sets of if-then rules (production 
rules), which are in general easier to understand 
by domain experts (linguists in our case). In 
C4.5 this tree-to-ruh; transformation involves ad- 
ditional statistical evaluation resulting sometimes 
in a rule set more understandable att(.l accurate 
than the corresponding decision tree. 
The C4.5 algorithm also contains a value group- 
ing method which, on the basis of statistical in- 
formation, collapses different values for a feature 
into the same category. That way, more concise 
decision trees and rules can be produced (instead 
of sew'~ral different branches or rule conditions for 
each wflue, only one branch or condition has to 
be detined, making reference to a (;lass of values). 
The algorithm works as a heuristic search of the 
search space of all possible partitionings of the wd- 
ues of a particular tbature into sets, with the for- 
Ination of homogeneous nodes (nodes representing 
examples with predominantly the same category) 
as a heuristic guide. See Quinlan (1993) for more 
information. 
3 Diminutive Formation in Dutch 
In the remainder of this t)ape.r, we will describe 
a case study of using C4.5 to test linguistic hy- 
1)otheses attd to discover regularities and cate- 
gories. Tit(,. case study concerns allomorphy in 
Dutch diminutive formation, "one of the more 
vexed probleins of l)utch i,honology (...) \[and\] 
one of the most spectacular phenomena of mod- 
ern Dutch morphophonemics" (Trommelen 1983). 
Diminutive forlnation is a productive morpholog- 
ical rule in Dutch. Diminutives are formed by at- 
taching a form of the Germanic sntfix -tje to t;he 
singular base form of a noun. The suffix shows 
allomorphic variation (Table 1). 
Noun 
huts (house) 
man (man) 
raam (window) 
woning (house) 
baan (job) 
Form 
huisje 
mannetje 
raalnpje 
woninkje 
baantje 
Suffix 
-jc 
-gtj(; -pj('~ 
-tie 
Table 1: Allomorphic variation in Dutch diminu- 
tives. 
The fi'equency distribution of the different cat- 
egories is given in Table 2. We distinguish be- 
tween database frequency (frequency of a suffix in 
a list, of 3900 diminutive forms of nouns we took 
from the CELEX lexical database 1) and corpus 
~Developed by tile Center for Lexical Ilfforma- 
tion, Nijmegen. l)istributed by tile Linguistic Data 
Consortium. 
96 
frequency (frequency of ~ sutfix in the text corpus 
on which the word list was based). 
1).,,al,as. 'X, LC(,,p-s \] 
~t.i7 --|- --~897 - ->ig.7~, T 3b.,~%-1 
U / ar.a'x, / i;,¢ / / 7 0.9% t 
j 104 2.7'x, 1 4.o% 1 
i¢io J 77 2 0% 3 8% /1,, _ : • 
'l'abh~ 2: Lexicoil an(l (:O,l)US \[requ(!n(:y of alh/- 
morphs. 
llistoricnlly, dilh;rcnt a.nalyses of diminutive for- 
real;ion }tav(~ taken a (lifferenl, view of tile rules 
thai; goveru the (:hoi(:(', of 1;he diminutiv(~ sullix, 
and ot! the, linguistic con(:el)l;s playing a role in 
these rules (see, e.g. T(; Winkel 11866, Kruizinga 
1(.t15, Cohen 11958, and l'ef('~t'ellces ill Tl'OItlillt',lell 
1983). In t;ho, lal;1;er, il; ix argued l;hal; (limimll;ive 
formation ix a local 1)recess, in which collCel)l;s 
such as word stress and morphological st, rll(;l,llre 
(proposed ill l;he earlier analyses) (1() not play a 
r()le. The rhyme of the last syllabic of tim noun 
is necessary and sutlicienl; t(/ predict I;}m col'l'(~cl; 
a/lomort)h. The, nal;uraJ (:ategorics (or feal,ures) 
wlfi(:h are hyllothesised in her rules in(:lu(h', obst, r'u- 
ents, .sonorwnl,.% alld the (:lass of bimoraic vowels 
(consisting of long vowels, diphtongs and schwa). 
Diminutive formation is a. Slna\[l linguisl;i(: (lo- 
main for which different COmlmting l,hcories have 
}men pr(/t)os('xl ~ &ll(\[ fol' whi(:ll (liff(',r(~nt generaliza- 
l;ions (in terms of rules and linguistic categories) 
have been proposed. What we will show tw, x~; is 
how machine learning techniques tllay t)(~ llSed I;O 
(i) test competing hyi)otheso~s, (ii) discovc, r gene, r- 
alizations in the data whi(;h c}tIl I;ll(}II t)e comt/are(1 
to the generMizal;ions formulated })y linguists, aim 
(iii) discover phonologi(:al categories in ml unsu- 
pervised way by supervised learning of diminutive 
suttix prediction. 
4: Experiments 
li'or ea(:h of l,he 3900 nouns we coll(!cted, th(! fol- 
lowing information was kept. 
1. The phoneme transcription describing the 
syllable structure (in terms of onset, nucleus, 
and coda) of l;he last three syllables of the 
word. Missing slots are imlicatexl with =. 
2. D)r each of l;hese l;hree last syllables the, pres- 
elICe ()I' abse.llce of Sl;l'O, ss. 
3. The (:orreslionding dimitmtive allomorph, ab- 
breviated to E (-etjc), T (-tie,), ./ l-j(;), K (- 
Me), and I' (-pie). This is the' '(:al,egory' of 
the word to be learned by the learner. 
Some examples are given below (l;he word itself 
and its gloss are provided for convenience and were 
not used in the exllerimenl,s ). 
- b i = - z @ = + m h nt J biezenmand (basket) 
........ + b I x E big (pig) 
.... + b K = - b a n T bijbaan (side job) 
.... + b K = ~ b @ i T bijbel (bible) 
4.1 Experimental Method 
The, ext)(wim(ml;al set-u t) use(t in all eXl)Crin/(ml:s 
consisted of a ten-lbhl cross-wflid;ttion eXl)erimcnt 
(Weiss & Kulikowski 1991). In this set-up, the 
database is partitioned l;en time~s, each with ;t diL 
\['orelll. 101~/ (If lll(~ dal;asel; as the tesl prot, mid the 
remaining 9/1% as training parL. For each ¢)f l}te 
l,(',n simulations in our exp('~riinelll;s, I;h(~ l;esl; p;u't, 
was used to to, st go, ueralization perfornuu,:e. The 
success rate of an algoril;hm is obtained I)y cah:u- 
lat;ing Ihc av(ua,r( ,, aCClll'/lCy (llllltll)(!l' O\[: l;(~SI, t)nt, - 
I,ern categories correctly predit:ted) over the l:en 
test sets in the ten-fold cross-validation eXlmri- 
lilO.n{;. 
4.2 Learnal fility 
The, exp(~rim(mts show thai; the diminutive li~- 
marion 1)roblem is learnMfle in a data-(/ri(ml;(~(l 
way (i.e. 1)y extraction of regularities \['rein (!x- 
amlflCs , without, any a priori knowledge ahout~ 
the domain"). The average accuracy on unseen 
tx~st data of 98.4% should be compared to bast;- 
line l)crforlnan(:e measures baso, d on tnolmbilit~y - 
based guessing. This baseline would t)e an accu- 
ra(:y of a.l)out 4()~ for this prol)h;m. This shows 
t;hat the tn'()l)h'm is a.lm(/st t)(;rlh(:tly h',aruabl(! I)y 
induction, It, shouhl 1)e noted that CI';I,I,;X con- 
tains a numl)(~r ()\[ coding (',trots, so that some (ff 
lhe ~wrong' all(mlOrl)hs \])r(',(li(:ted by the ma(:lfine 
h;arning system were actually (:(II'I'(~(;L, Wq did not 
correct for this. 
\]It the next; three secl;ions, we will describe Lhe 
resull;s of l;he (~xImrim(;nts; tirst on the 1;ask of (:Olll- 
paring conlli(:ting l;he(/reti(:al hypotheses, then on 
discoverittg linguistic gen(;ralizaLions, and flintily 
(m unsul)(~rvis(~(l dis(:overy of l/h(/nologica.l cat(> 
gories. 
5 Linguistic Hypothesis Testing 
()n the basis of the analysis of I)utch diminutive 
formation by TronuneJen (1983), discussed brietly 
in SecLion 3, Lhe following hypotheses (among oth- 
ers) can be \[brnmlated. 
1. Only informatioil about the last, syllable is 
re, levant in predicting the, correct allomorph. 
2. \[nlormation about l;he onset of the last sylla- 
bi(, is irrelevant in predicting the, correct al- 
lomorph. 
3. Stress is irrelevant in predicting l;he correct 
allomorph. 
:~lCxcepl; syllMde stru(:tm-e, 
97 
In other words, information about the rhyine 
of the last syllable of a noun is necessary and 
sufficient to predict the correct allomorph of the 
diminutive suffix. To test these hypotheses, we 
performed four experiments, training and testing 
the C4.5 machine learning algorithm with fore' dif- 
ferent corpora. These corpora contained the fol- 
lowing information. 
1. All information (stress, onset, nucleus, coda) 
about the three last syllables (3-SYLL cor- 
pus). 
2. All information about the last syllable 
(SONC corpus). 
3. Information about the last syllable without 
stress (ONC corpus). 
4. Information about the last syllable without 
stress and onset (NC corpus). 
5.1 Results 
Table 3 lists the learnability results. The gener- 
alization error is given for each allomorph for the 
four different; training corpora. 
s''mX al 
I 
I -kjy 
\[ -p3e 
Errors and Error percental e.s 
3 SYLI, SONC ONC NC 
61 1.6 79 2.0 80 2.0 77 2.0 
13 0.7 
16 1.1 
26 7.3 
4 5.2 
2 1.9 
13 0.7 
15 1.0 
49 13.7 
0 0 
2 1.9 
14 0.7 
16 1.1 
48 13.5 
0 0 
2 1.9 
14 0.7 
14 1.0 
44 12.3 
0 0 
5 4.8 
Table 3: Error of C4.5 on the different corpora. 
The overall best results are achieved with the 
most elaborate corpus (containing all information 
about; the three last syllables), suggesting that, 
eontra Trommelen, important information is lost 
by restricting attention to only the last syllable. 
As far as the different encodings of the last sylla- 
ble are concerned, however, the learnability exper- 
iment coroborates Trommelen's claim that stress 
and onset are not necessary to predict the correct 
diminutive allomorph. When we look at the error 
rates for individual allomorphs, a more complex 
picture emerges. The error rate on -etje dramati- 
cally increases (from 7% to 14%) when restricting 
information to the last syllable. The -k~e allo- 
morph, on the other hand, is learned perfectly on 
the basis of the last syllable alone. What has hap- 
pened here is that the learning method has over- 
generalized a rule predicting -kje after the velar 
nasal, because the data do not contain enough in- 
formation to correctly handle the notoriously diffi- 
cult opposition between words like leerling (pupil, 
takes -etje) and koning (king, takes -kje). Purther- 
more, the error rate on -pje is doubled when onset 
information is left out from the corpus. 
We can conch;de from these experiments that 
although the broad lines of the analysis by Trom- 
melen (1983) are correct, the learnability results 
point at a number of problems with it (notably 
with -kje versus -etje and with -pje). We will move 
now to the use of inductive learning algorithms as 
a generator of generalizations about the domain, 
and compare these generalizations to the analysis 
of Trommelen. 
6 Supervised Learning of 
Linguistic Generalizations 
When looking only at the rhyme of the last sylla- 
ble (the NC corpus), the decision tree generated 
by C4.5 looks as follows: 
Decision Tree: 
coda in {rk,nt,lt,rt,p,k,t,st,s,ts,rs,rp,f, 
x, ik,Nk,mp, xt,rst,ns ,nst, 
rx,kt, ft, if ,mr, Ip,ks, is,kst, ix} : J 
coda in {n,=,l,j,r,m,N,rn,rm,w,lm}: 
nucleus in {I,A,},O,E}: 
coda in {n,l,r,m}: E 
coda in {=,j,rn}: T 
coda in {rm,lm}: P 
coda = N: 
1 nucleus = I: K 
I nucleus in {A,O,E}: E 
nucleus in {K,a,e,u,M,@,y,o,i,L,), I,<}: 
\[ coda in {n,=,l,j,r,rn,w}: T 
\[ coda = m: P 
Notice that the phoneme representation used 
by CELEX (called DISC) is shown here instead 
of the more standard IPA font, and that the value 
grouping mechanism of C4.5 has created a mnnber 
of phonological categories by collapsing different 
phonemes into sets indicated by curly brackets. 
This decision tree should be read as follows: 
first check the coda (of the last syllable). If it 
ends in an obstruent, the allomorph is -jc. If not, 
check tile nucleus. If it is bimoraic, and the coda 
is/m/, decide -pje, if the coda is not/m/, decide 
-tje. When the coda is not an obstruent, the nu- 
cleus is short and the coda is /ng/, we have to 
look at the nucleus again to decide between -kje 
and -etje (this is where the overgeneralization to 
-kje for words in -ing occurs). Finally, the coda 
(nasa-liquid or not) helps us distinguish between 
-etje and -pje for those cases where the nucleus is 
short. It should be clear that this tree can eas- 
ily be formulated as a set of rules without loss of 
accuracy. 
An interesting problem is that the -etje versus 
-kje problem for words ending in -ing couht hot be 
solved by referring only to the last syllable (C4.5 
and any other statistically based induction algo- 
rithm overgeneralize to -kjc). The following is the 
knowledge derived by C4.5 t'rofll the flfll corpus, 
with all information about the three last syllables 
(the 3 SYLL corpus). We provide the rule version 
of the inferred knowledge this time. 
98 
Default class is -tje 
1. IF coda last is /lm/ or /rm/ 
THEN -pje 
2. IF nucleus last is \[+bimoraic\] 
coda last is /m/ 
THEN -pje 
3. IF coda last is /N/ 
THEN IF nucleus penultimate is empty 
(monosyllabic word) or schwa 
THEN -etje 
ELSE -kje 
4. IF nucleus last is \[+short\] 
coda last is \[+nas\] or \[+liq\] 
THEN -etje 
5. IF coda last is \[+obstruent\] 
THEN -je 
The default class is -tjc, which is the allomorph 
chosen when none of the other rules apply. This 
explains why this rule set looks simi)h'.r than tit(; 
decision tree earlier. 
The first thing which is interesting in this rule 
set, is that only tlu'ee of the twelve presented fea- 
tures (coda an(1 nllclelts of (;lie last syllable, nll- 
cleus of the i)emlltimate syllal)le) m'e used in the 
rules. Contrary to the hyi)oth(;sis of Trommelen, 
apart from the rhyme of the last sylbfl)le, the m> 
(:\[eus of the pemfltimate sylhd)le is taken to \])e 
re.levant ;~s well. 
The induced rules roughly correspond to the 
previous decision tree, but; in ad(lition a solution 
is provided to the -etje versus -kje problem for 
words ending in -in9 (rule 3) making use of in- 
formation about the nucleus of the. t)emfltiInate 
syllabi(;. Rule 3 states that words ending in /ng/ 
get -etjeas (liminutive alloinorl)h when they are 
monosyllables (nucleus of the penultimate syllable 
is empty) or when they have a schwa as t)(multi - 
mate rainless, and -kjc othe, rwise. As fro as we 
now, this generalization has not been prot)osed in 
this form in the lmblished literature on diminutive 
formation. 
We conclude from this part of the experiment 
that the Inaehine learning inethod has suc(:ee(led 
in extracting a sophistieate(l set of linguistic rules 
from the examph'.s in a purely data-oriented way, 
an(l that these rules are formulated at a level that 
makes their use in the development of linguistic 
theories possible. 
7 Discovery of Phonological 
Categories 
To structure the phoneme inventory of a language, 
linguists define features. '\['hese ekLIl be interpreted 
as sets of st)ee(:h sounds (categories): e.g. the 
category (or feature) labial groups those speech 
sounds that involve the lips as an a(:tive art\[c- 
ulator. Speech sounds behmg to different cate- 
gories, i.e., are defined by ditferent \['e~tures. l,',.g. 
t is voiceless, a coronal, and a stop. Categories 
t)roposed in phonology are inspired by articula- 
tory, acoustic or tmreeptual phonetic ditferences 
between speech sounds. They are also proposed 
to allow an optimally concise or elegant formu- 
lation of rules for the description of phonologi- 
cal or mot'phological processes. E.g., the so-calleA 
major (:lass features (obstruents, nasals, liquids, 
glides, vowels) efficiently explain syllable structure 
eomput;ation, lint are of little use in the definition 
of rules describing assimilation. For ass\[re\[In\[ion, 
placu of mti(:ulation f(~atllr(~s arc t)est ilse(l. This 
situation has led to the t)roposa.l of many dillhrenC 
phonoh)gieal category systems. 
Whih; constructing the decision tre.e (see prey\[- 
Oils section), several t)honologically relevant cat- 
(;gories are 'discovered' by the value grouping 
mechanism in C4.5, including the nasals, the liq- 
uids, the obstruents, the short vowels, mtd the 
bimoraic vowels. This last category corresponds 
completely with the (then new) category hypoth- 
esise, d by Trommelen and containing the long vow- 
els, tit(; diphtongs att(l the s('hwa, in oth(;r words, 
the learning a.lgorithm has discovered this set of 
phonemes to 1)e a useful category in solving the 
(|iminut;ive formation problem t)y t)rovi(ling ml e×- 
I,e.nsional detinition of it (a lisl; of tim inst;ulees ()\[: 
I;he ea.tegory). 
This raises the question of the task-dependence 
of linguistic categories. Similar experiments in 
Dutch t)lural formation, for examt)le, fail to pro- 
duce th(' (:atcgory of bimoraic vowels, and for some 
tasks, categori(:s show u t) which hi~vc no ontolog- 
ical status in linguistics. In other words, mak- 
ing category formation del)endent oil the task to 
t)e learned, unde.rmitms the. tratlitional linguistic 
ideas about absolute, task-indel)endent (and even 
1;mguage-in(h',t)endeitt) categories. We present 
heI'e & lI(!~,v methodology with which this flltl(lDo- 
mental issue in linguistics can t)(; investigated: 
category systems ext;racted for difl'erent tasks in 
different languages can be studied to see which 
categories (if any) truely have a universal status. 
This is subject tbr fllrther resem'ch. It wouhl also 
l)e use.rid to stu(ly the indu(:ed categories when 
intensional descriptions (feature represeutations) 
are used as input instead of extensional descrit)- 
lions (phoitetnes). 
We also experimented with a siml)h;r alternative 
to the computationally complex heuristic category 
\[orma.tion algorithm used by (;4.5. This method 
is inspire(1 by machine learning work on wflue dif 
ference metrics (Stanfill & Waltz, 1986; Cost & 
Salzberg, :1993). Starting fl'om the training set 
of the sut)ervised learning exl)erinlent (the set ()f 
input ouq)ut mappings used by the system to ex- 
tract rules), we selc(:t a particular feature (e.g. the 
coda of the last syllable), and comt)ute a table as- 
99 
sociating with each t)ossit)le value of tile feature 
the number of times the pattern in which it, oc- 
curs was assigned to each different category (in 
this case, each of the the five allomorphs). This 
produces a table with for each value a distribution 
over categories. This table is then used in stan- 
dard clustering approaches to derive categories of 
values (in this case consonmlts). The following 
is one of these clustering results. The example 
shows that this computationally simple approach 
also succeeds in discovering categories in an unsu- 
pervised way on tile basis of data for supervised 
learning. 
....... > l 
I ....... > r 
-I ............ > n 
I---I ..... I ..... > t 
I I I ..... I .... > k 
I ............ I I .... > s 
I ..... I--> p 
I .... I I--> f 
I ..... I .... > m 
I .... I .... > N 
I .... I--> x 
I__1-> j 
I-> w 
Several categories, relevant for diminutive for- 
mation, such as liquids, nasals, the velar nasal, 
semi-vowels, fi'icatives etc., are reflected in this 
hierarchical clustering. 
8 Conclusion 
We have shown by example that machine learn- 
ing technique, s can profitably be used in linguistics 
as a tool for the comparison of linguistic theories 
and hypotheses or for the discovery of new lin- 
guistic theories in the form of linguistic rules or 
categories. 
The case study we presented concerns dimiml- 
live formation in Dutch, for which we showed that 
(i) machine learning techniques can be used to cor- 
roborate and falsify some of the existing theories 
about the phenomenon, and (ii) machine learning 
techniques can be used to (re)discover interesting 
linguistic rules (e.g. the rule solving the -etjc ver- 
sus -kje problem) and categories (e.g. the category 
of bimoraic vowels). 
The extracted system can of course also be used 
in language technology as a data-oriented system 
for solving particular linguistic tasks (in this case 
diminutive format!on). In order to test the usabil- 
ity of the approach for this application, we com- 
pared the performance of the extracted rule sys- 
tem to tile performance of the hand-crafted rule 
system proposed by Trommelen. Table 4 shows 
for each allomorph the number of errors by the 
C4.5 rules (trained using corpus NC, i.e. only the 
rhyme of the last syllable) as opposed to an imple- 
mentation of the rules suggested by ~l¥ommelen. 
One problem with the latter is thai; they often sug- 
gest more than one allomorph (the rules are not 
mutually exclusive). In those cases where more 
than one rule applies, a choice was made at ran- 
dom. 
Suffix Trommelen C4.5 
-tje 53 11 
-jc 12 12 
-eric 28 39 
-~iie 38 o 
-pje 21 4 
Total 152 66 
Table 4: Comparison of accuracy between hand- 
crafl;ed and induced rules. 
The comparison shows that C4.5 did a good job 
of finding an elegant and accurate rule-based de- 
scription of the problem. This rule set is useful 
both in linguistics (for evaluation, refinement, and 
discovery of theories) and in language technology. 
References 
Cohen, A. Het Nederlandse diminutiefsuilix; een 
morfologische I)roeve. De Nic~twe Taalgids, 51, 
40-45, 1958. 
Cost, S. and Salzberg, S. 'A weighte, d nearest 
neight)or algorithm tor learning with symbolic 
features.' Mach, ine Learning, l(/, 57 78, 1993. 
Finch, S. & N. Chater. 'Bootstrapping Syntac- 
tic Categories Using Statistical Methods', in: 
W.Daelemans & 1).Powers (eds.), Backgrov, r~,d 
and E.,cperirnent.s in Machine Lear'nb~,g of Natv,- 
ral Language, Tilburg University, ITK, 1992. 
Itughes, J. 'Automatically Acquiring a Classifica- 
tion of Words', Phi) dissertation, University of 
Leeds, School of Computer Studies, 1994 
Kruisinga, E. De vorm vail verkMnwoorden. De 
Nieuwc ~lhalgids , 9, 96-97, 1915. 
Quinlan, J. It,. C~.5 Programs for machine learn- 
ing 1993. 
Sch{itze, H., Ambiguity in Language Learning: 
Computational and Cognitive Models, PhD dis- 
sertation, Stanford University, Department of 
Linguistics, 1995. 
Stanfill, C. and Waltz, D.L. 'Toward Memory- 
based Reasoning'. Communications of the A CM 
29, 1986, 1213 1228. 
\]¥onunelen, M.T.G. Thc syllable in Dutch, with 
special reference to diminutive formation. Foris, 
Dordrecht, 1983. 
Weiss, S. and Kulikowski, C. (1991). Computer 
systems that learn. Morgan Kaufmann, San 
Marco. 
Winkel, L.A. Te. Over de verklcinwoorden. De 
Taalgids 4: 81-116. 
i00 
