/ 
/ 
/ 
I 
/ 
/ 
/ 
Do Not Forget: 
Full Memory in Memory-Based Learning of Word Pronunciation * 
Antal van den Bosch and Walter Daelemans 
Tilburg University, ILK 
P.O. Box 90153, NL-5000 LE Tilburg 
The Netherlands 
{ant alb, ealt er}@kub, nl 
Abstract 
Memory-based learning, keeping full memory 
ofleaxning material, appeaxs a viable approach 
to learning N-~ tasks, and is often superior 
in genera~sation accuracy to eager learning 
approaches that abstract from learning mate- 
riaL Here we investigate three pa~'tial memory- 
based learning approaches which remove from 
memory specific task instance types estimated 
to be exceptional. The three approaches each 
implement one heuristic function for estimat- 
ing exceptiona\]ity of instance types: (i) typi- 
catty, (ii) class prediction strength, and (fii) 
friencfly-neighbourhood size. Experiments are 
performed with the memory-based learning al- 
gorithm IBI-IG trained on English word pro- 
nunciatlon. We find that removing instance 
types with low prediction strength (il) is the 
only tested method which does not seriously 
harm generallsation accuracy. We conclude 
that keeping full memory of types rather than 
tokens, and excluding minority ambiguities ap- 
pear to be the only performance-preserving op- 
timi~tions of memory-based leaxning. 
1 Introduction 
Memory-based learning of classification tasks is a 
branch of supervised machine learning in which the 
learning phase consists simply of storing all en- 
countered instances from a training set in mem- 
ory (Aha, 1997). Memory-based learning algorithms 
do not invest effort during learning in abstract- 
ing from the tr-lnlng data, such as eager-learning 
(e.g., decision-tree algorithms, rule-induction, or 
connectionist-learning algorithms, (Qululan, 1993; 
Mitchell, 1997)) do. Rather, they defer investing 
effort until new instances axe presented. On be- 
ing presented with an instance, a memory-based 
*This research was done in the context of the "Induc- 
tion of Linguistic Knowledge" research programme, par- 
tially supported by the Foundation for Language Speech 
and Logic (TSL), which is funded by the Netherlands 
Organization for Scientific Research (NWO). Part of the 
first author's work was performed at the Department of 
Computer Science of the Unlversiteit Maastricht. 
learning algorithm searches for a best-matching in- 
stance, or, more generically, a set of the k best- 
matching instances in memory. Having found such 
a set of h best-matching instances, the algorithm 
takes the (majority) class with which the instances 
in the set axe labeled to be the class of the new 
instance. Pure memory-based learning algorithms 
implement the classic k-nearest neighbour algo- 
rithm (Cover and Hart, 1967; Devijver and Kittler, 
1982; Aha, Kibler, and Albert, 1991); in different 
contexts, memory-based learning algorithms have 
also been named lazy, instance-based, exemplar- 
based, memory-based, case-based learning or reason- 
ing (Stanfdl and Waltz, 1986; Kolodner, 1993; Aha, 
Kibler, and Albert, 1991; Aha, 1997)) 
Memory-based learning has been demonstrated 
to yield accurate models of various natural lan- 
guage tasks such as grapheme-phoneme conver- 
sion, word stress assignment, part-of-speech tagging, 
and PP-attachment (Daelemans, Van den Bosch, 
and Weijters, 1997a). For example, the memory- 
based learning algorithm ml-IG (Daelemans and 
Van den Bosch, 1992; Daclemans, Van den Bosch, 
and We~jters, 1997b), which extends the well-known 
ml algorithm (Aha, Kibler, and Albert, 1991) 
with an information-gain weighted similaxity met- 
tic, has been demonstrated to perform adequately 
and, moreover, consistently and significantly better 
than eager-lea~'ning algorithms which do invest ef- 
fort in abstraction during learning (e.g., decision- 
tree learning (Daelemans, Van den Bosch, and 
Weijters, 1997b; Quinlan, 1993), and connectionist 
learning (Rumelhart, Hinton, and Williams, 1986)) 
when trained and tested on a range of morpho- 
phonological tasks (e.g., morphological segmenta- 
tion, grapheme-phoneme conversion, syllabitlcation, 
and word stress assignment) (Daelemans, Gillis, and 
Durieux, 1994; Van den Bosch, Daelemans, and 
We~jters, 1996; Van den Bosch, 1997). Thus, when 
learning NLP tasks, the abstraction oeeurnng in de- 
cision trees (i.e., the explicit forgetting of informa- 
tion considered to be redundant) and in connee- 
tionist networks (i.e., a non-symbolic encoding and 
decoding in relatively small numbers of connection 
van den Bosch and Daelemans 195 Memory-Based Learning of Word Pronunciation 
Antal van den Bosch and Walter Daelemans (1998) Do Not Forget: Full Memory in Memory-Based Learning of Word 
Pronunciation. In D.M.W. Powers (ed.) NeMLaP3/CoNLL98: New Methods in Language Processing and Computational 
Natural Language Learning, ACL, pp 195-204. 
weights) both hamper accurate generalisation of the 
learned knowledge to new material. 
These findings appear to contrast with the general 
assumption behind eager learning, that data repre- 
senting real-world classification tasks tends to con- 
tains (i) redundancy and (ii) exceptions: redundant 
data can be compressed, yielding smaller descrip- 
tions of the original data; some exceptions (e.g., low- 
frequency exceptions) can (or should) be discarded 
since they are expected to be bad predictors for clas- 
shying new (test) material. However, both redun- 
dancy and exeeptionality cannot be computed triv- 
ially; heuristic functions are generally used to esti- 
mate them (e.g., functions from ixLformation theory 
(Qnlnl~m, 1993)). The lower generalization accura- 
cies of both decision-tree and eonnectionist learning, 
compared to memory-based learning, on the above- 
mentioned NLP tasks, suggest that these heuristic es- 
timates may not be the best choice for learning NLP 
tasks. It appears that in order to learn such tasks 
successfully, a learning algorithm should not forget 
(i.e., explicitly remove from memory) any informa- 
tion contained in the learning material: it should not 
abstract from the individual instances. 
An obvious type of abstraction that is not harm- 
ful for generalisation accuracy (but that is not al- 
ways acknowledged in implementations of memory- 
based learning) is the straightforward abstraction 
from tokens to types with frequency information. 
In general, data sets representing natural language 
tasks, when large enough, tend to contain consider- 
able numbers of duplicate sequences mapping to the 
same output or class. For example, in data repre- 
senting word pronunciations, some sequences of let- 
ters, such as ing at the end of English words, occur 
hundreds of times, while each of the sequences is 
pronounced identically, viz. /llJ/. Instead of storing 
all individual sequence tokens in memory, each set 
of identical tokens can be safely stored in memory 
as a single sequence type with frequency informa- 
tion, without loss of generalisation accuracy (Daele- 
roans and Van den Bosch, 1992; Daelemans, Van den 
Bosch, and Weijters, 1997b). Thus, forgetting in- 
stance tokens and replacing them by instance types 
may lead to considerable computational optlmi~a- 
tions of memory-based learning, since the memory 
that needs to be searched may become considerably 
smaller • 
Given the safe, performance-preserving optlmi-e~- 
tion of replacing sets of instance tokens by instance 
types with frequency information, a next step of in- 
vestigation into optlmlsing memory-based learning 
is to measure the effects offorge~ing instance types 
on grounds of their exceptionality, the underlying 
idea being that the more exceptional a task instance 
type is, the more likely it is that it is a bad predic- 
tor for new instances. Thus, exceptionality should in 
some way express the unsuitability of a task instance 
type to be a best match (nearest neighbour) to new 
instances: it would be unwise to copy its associated 
classification to best-matching new instances. In this 
paper, we investigate three criteria for estimating 
an instance type's exceptionality, and removing in- 
stance types estimated to be the most exceptional 
by each of these criteria. The criteria investigated 
are 
1. typicality of instance types; 
2. class prediction strength of instance types; 
3. fi-iendly-neighbourhood size of instance types; 
4. random (to provide a baseline experiment). 
We base our experiments on a large data set of 
English word pronunciation. We briefly describe 
this data set, and the way it is converted into an 
instance base fit for memotT-based learning, in Sec- 
tion 2. In Section 3 we describe the settings of our 
experiments and the memory-based learning algo- 
rithm IBI-Io with which the experiments are per- 
formed. We then turn to describing the notions 
of typicality, class-prediction strength, and friendly- 
neighbourhood size, and the functions to estimate 
them, in Section 4. Section 5 provides the experi- 
mental results. In Section 6, we discuss the obtained 
results and formulate our conclusions. 
2 The word-pronunciation data 
Converting written words to stressed phonemic tran- 
scription, i.e., word pronunciation, is a well-known 
benchmark task in machine learning (Stanfill and 
Waltz, 1986; Sejnowski and Rosenberg, 1987; Shav- 
lik, Mooney, and Towell, 1991; Dietterich, Hild, and 
Baklri, 1990; Wolpert, 1990). We define the task as 
the conversion of fixed-sized instances representing 
parts of words to a class representing the phoneme 
and the stress marker of the instance's middle let- 
ter. To genexate the instances, windowing is used 
(Sejnowski and Rosenberg, 1987). Table I displays 
example instances and their classifications generated 
on the basis of the sample word booking. Classifica- 
tious, i.e., phonemes with stress markers (henceforth 
PSs), are denoted by composite labels. For exam- 
pie, the first instance in Table 1, -_book, maps to 
dass labd /b/l, denoting a/b/ which is the first 
phoneme of a syllable receiving primary stress. In 
this study, we chose a fixed window width of seven 
letters, which offers sufficient context information for 
adequate performance, though extension of the win- 
dow decreases ambiguity within the data set (Van 
den Bosch, 1997). 
The task, henceforth referred to as Qs (Grapheme- 
phoneme conversion and stress assignment) is sim- 
ilar to the NBTTALK task presented by Sejnowski 
and Rosenberg (1986), but is performed on a laxger 
corpus, of 77,565 English word-pronunciation pairs, 
extracted from the cBr.Bx lexical data base (Bur- 
nage, 1990). Converted into fixed-sized instance, the 
van den Bosch and Daelemans 196 Memory-Based Learning of Word Pronunciation 
Ii 
II 
Ii 
II 
II 
II 
II 
II 
II 
II 
II 
II 
II 
II 
II 
II 
II 
II 
instance left 
number context 
1 
2 b 
3 o 
4 b o o 
5 o o k 
6 o k i 
7 k i n 
focus 
letter 
b 
0 
0 
k 
i 
n 
g 
fight 
context 
o o k 
o k i 
k i n 
i n g 
n g _ 
g 
Table 1: Example of instances generated fox the word-pronunciation 
classification 
/b/1 
/u/0 /-/0 
/k/0 
/olO /-/o 
task from the word booking. 
full instance base representing the as task contains 
675,745 instances. The task features 159 classes 
(combined phonemes and stress markers). The cod- 
ing of the output as 159 atomic ('local') classes com- 
bining grapheme-phoneme conversion and stress as- 
signment is one out of many types of output cod- 
ing (Shavlik, Mooney, and Towel\], 1991), e.g., dis- 
tributed bit coding using articulatory features (Se- 
jnowski and Rosenberg, 1987), error-correcting out- 
put coding (Diettefich, Hild, and Bakid, 1990), or 
split discrete coding of gmpheme-phoneme conver- 
sion and stress assignment (Van den Bosch, 1997). 
While these studies point at back-propagation learn- 
ing (Rumelhart, Hinton, and Williams, 1986), us- 
ing distributed output code, as the better pet- 
former as compared to ID3 (Quinlan, 1986), a sym- 
bolic inductive-learning decision tree algorithm (Di- 
etterich, Hild, and Bakid, 1990; Shavllk, Mooney, 
and Towel\], 1991), unless IV3 was equipped with 
error-correcting output codes and additional man- 
ual tweaks (Dietterich, Hild, and Bakiri, 1990). Sys- 
tematic experiments with the data also used in this 
paper have indicated that both back-propagation 
and decision-tree learning (using either distributed 
or atomic output coding) ate consistently and sig- 
nificantly outperformed by memory-based learning 
of gmpheme-phoneme conversion, stress assignment, 
and the combination of the two (Van den Bosch, 
1997), using atomic output coding. Our choice for 
atomic output classes in the present study is moti- 
vated by the latte~ results. 
3 Algorithm and experimental setup 
3.1 Memory-based learning in IBI-IG 
In the experiments reported here, we employ IBI-IG 
(Daelemaus and Van den Bosch, 1992; Daelemans, 
Van den Bosch, and Weijters, 1997b), which has 
been demonstrated to perform adequately, and sig- 
nitleant\]y better than eager-learning algorithms on 
the os task (Van den Bosch, 1997). ZBI-IG con- 
structs an instance base daring learning. An in- 
stance in the instance base consists of a fixed-length 
vector of n feature-value pairs (here, n = 7), an in- 
formation field containing the classification of that 
particular feature-value vector, and an information 
field containing the occurrences of the instance with 
its classification in the full training set. The lat- 
ter information field thus enables the storage of in- 
stance types rather than the more extensive storage 
of identical instance tokens. After the instance base 
is built, new (test) instances are classified by match- 
ing them to all instance types in the instance base, 
and by calculating with each match the distance be- 
tween the new instance X and the memory instance 
type Y, A(X, Y), using the function given in Eq. 1: 
ft 
A(X, Y) = E W(\[i)6(Xi, Yi), (1) 
i=l 
where W(fi) is the weight of the ith feature, and 
6(zl, yi) is the distance between the values of the 
ith fcature in the instances X and Y. When the 
values of the instance features are symbolic, as with 
the Gs task (i.e., feature values are letters), a simple 
distance function is used (Eq. 2): 
6(Xi, Y/) = 0 if Xi = Yi else 1. (2) 
The classification of the memory instance type Y 
with the smallest A(X,Y) is then taken as the clas- 
sification of X. This procedure is also known as 
1-NN, i.e., a search for the single nearest neighbour, 
the simplest variant of k-NN (Devijver and Kittler, 
1982). 
The weighting function of IBI-IG, W(fi), repre- 
sents the information gain of feature fi. Weight- 
ing features in k-NN ~ezs such as IB 1-IG is an 
active field of research (cf. (Wettschereck, 1995; 
Wettschereck, Aha, and Mohrl, 1997), for compre- 
hensive overviews and discussion). Information gain 
is a function from information theory also used in 
zv3 (Qnlnlan, 1986) and c4.5 (Quinlan, 1993). The 
information gain of a feature expresses its relative 
relevance compared to the other features when per- 
forming the mapping from input to classification. 
The idea behind computing the information gain 
of features is to interpret the training set as an in- 
formation source capable of generating a number of 
messages (i.e., classifications) with a certain proba- 
bility. The information entropy/it of such an infor- 
mation source can be compared in turn for each of 
van den Bosch and Daelemans 197 Memory-Based Learning of Word Pronunciation 
the features characterising the instances (let n equal 
the number of features), to the average information 
entropy of the information source when the value of 
those features are known. 
Data-base information entropy H(D) is equal to 
the number of bits of information needed to know 
the classification given an instance. It is computed 
by equation 3, where p/ (the probability of classifi- 
cation i) is estimated by its relative frequency in the 
traini~,g set. 
H(D) = - pjog p  (3) 
i 
To determine the information gain of each of the n 
features fl-.. f,~, we compute the average informa- 
tion entropy for each feature and subtract it f~om 
the information entropy of the data base. To com- 
pute the average information entropy for a feature 
fi, given in equation 4, we take the average informa- 
tion entropy of the data base restricted to each pos- 
sible value for the feature. The expression D\[y~=~\] 
refers to those patterns in the data base that have 
value vj for feature fi, j is the number of possible 
values of f~, and V is the set of possible values for 
feature f~. Finally, IDI is the number of patterns in 
the (sub) data base. 
IDLt'="J\]I (4) 
IDl ~j6V 
Information gain of feature f~ is then obtained by 
equation 5. 
G(I,) = H(D) - H(D , 1) (5) 
Using the weighting function W(fi) acknowledges 
the fact that for some tasks, such as the current GS 
task, some features axe fax more relevant (impor- 
tant) than other features. Using it, instances that 
match on a feature with a relatively high informa- 
tion gain axe regarded as less distant (more alike) 
than instances that match on a feature with a lower 
information gain. 
Finding a nearest neighbour to a test instance may 
result in two or more candidate ne~aest-neighbour 
instance types at an identical distance to the test in- 
stance, yet associated with different classes. The im- 
plementation oflBl-IG used here handles such eases 
in the following way. First, IBI-IG selects the class 
with the highest occurrence within the merged set of 
classes of the best-mateblng instance types. In case 
of occurrence ties, the classification is selected that 
has the highest overall occurrence in the training set. 
(Daehmans, Van den Bosch, and Weijters, 1997b). 
3.2 Setup 
We performed a series of experiments in which m 1- 
IG is applied to the Gs data set, systematically edited 
according to each of the three tested criteria (plus 
the baseline random criterion) described in the next 
section. We performed the following global proce- 
dure: 
1. We partioned the full Gs data set into a training 
set of 608,228 instances (90% of the full data 
set) and a test set of 67,517 instances (10%). 
For use with IB 1-IG, which stores instance types 
rather than instance tokens, the data set was re- 
duced to contain 222,601 instance types (i.e., 
unique combinations of feature-value vectors 
and their classifications), with frequency infor- 
mation. 
2. For each exceptionality criterion (i.e., typ- 
icality, class prediction strength, friendly- 
neighbourhood size, and random selection), 
(a) we created four edited instance bases by 
removing 1%, 2%, 5%, and 10% of the 
most exceptional instance types (according 
to the criterion) from the training set, re- 
spectively. 
(b) For each of these increasingly edited train- 
ing sets, we performed one experiment in 
which IBI-IG was trained on the edited 
training set, and tested on the original 
unedited test set. 
4 Three estimations of 
exceptionality 
We investigate three methods for estimating the 
(degree of) exceptionality of instance types: typ- 
icality, class prediction strength, and f~iendly- 
neighbouthood size. 
4.1 Typicality 
In its common meaning, "typicality" denotes 
roughly the opposite of exeeptionality; atypicality 
can be said to be s synonym of exceptionality. We 
adopt a definition from (Zhang, 1992), who proposes 
a typicality function. Zhang computes typiealities 
ofiustance types by taking both their feature values 
and their classifications into account (Zhang, 1992). 
He adopts the notions of Jaffa.concept similarity/and 
inter-concept similarity (Rosch and Mervis, 1975) to 
do this. First, Zhang introduces a distance func- 
tion slmilsr to Equation 1, in which W(fi) = 1.0 
for all features (i.e., fiat Euclidean distance rather 
than information-gain weighted distance), in which 
the distance between two instances X and Y is nor- 
malised by dividing the summed squared distance by 
n, the number of features, and in which 6(zi, 9i) is 
given as Equation 2. The normalised distance func- 
tion used by Zhang is given in Equation 6. 
A(x,y) = _1 
n i=1 
van den Bosch and Daelemans 198 Memory-Based Learning of Word Pronunciation 
I 
I 
I 
I 
l 
I 
I 
I 
/ 
/ 
Ii 
/ 
I 
/ 
/ 
/ 
/ 
/ 
I 
The intra-concept similarity of instance X with 
classification C is its similarity (i.e., 1-distance) 
with all instances in the data set with the same clas- 
sification C: this subset is referred to as X's family, 
Fara(X). Equation 7 gives the intra-concept simi- 
laxity function In~ra(X) (\]Fam(X)\[being the num- 
ber of instances in X's family, and Faro(X) ~ the ith 
instance in that family). 
1 I~'am(X)l 
Intra(X)_lFam(X)} ~ 1.0-Z~(X, Fa,~(X)') 
i=l (7) 
All remaining instances belong to the subset of un- 
related instances, Unr(X). The inter-concept simi- 
larity of an instance X, Inter(X), is given in Equa- 
tion 8 (with \[Unr(X)\[ being the number of instances 
unrelated to X, and Unr(X)" the ith instance in 
that subset). 
1 IV'~,(x)l 
I,~e~(X) = i~rnrCX) I ~ 1.0-a(X, U,r(X)') 
i----1 (s) 
The typicality of an instance X, Typ(X), is the quo- 
tient of X's intra-concept similarity and X's inter- 
concept similarity, as given in Equation 9. 
~nt~a(X) (9) Typ( X ) = Inter(X) 
An instance type is typical when its intra-concept 
similarity is laxger than its inter-concept similar- 
ity, which results in a typicality larger than 1. 
An instance type is atypical when its intra-concept 
similarity is smaller than its inter-concept similar- 
ity, which results in a typicality between 0 and 1. 
Around typicality value 1, instances cannot be sen- 
sibly called typical or atypical; (Zhang, 1992) refers 
to such instances as boundary instances. 
In our experiments, we compute the typicality of 
all instance types in the training set, order them 
on their typicality, and remove 1%, 2%, 5%, and 
10% of the instance types with the lowest typicality, 
i.e., the most atypical instance types. In addition to 
these four experiments, we performed an additional 
eight experiments using the same percentages, and 
editing on the basis of (i) instance types' typicality 
(by ordering them in reverse order) and (il) their in- 
difference towards typicality or atypicality (i.e., the 
closeness of their typicality to 1.0, by ordering them 
in order of the absolute value of their typicality sub- 
tracted by 1.0). The experiments with removing typ- 
ical and boundary instance types provide interesting 
comparisons with the more intuitive editing of atyp- 
ical instance types. 
Table 2 provides examples of four atypical, bound- 
ary, and typical instance types found in the train- 
ing set. Globally speaking, (i) the set of atypical 
instances tend to contain foreign spellings of loan 
van den Bosch and Daelemans 199 
words; (ii) there is no clear characteristic of bound- 
ary instances; ~and (iii) 'certain' pronunciations, i.e., 
instance types with high typicality values often in- 
volve instance types of which the middle letters are 
at the beginning of words or immediately following 
a hyphen, or high-frequency instance types, or in- 
stance types mapping to a low-frequency class that 
always occurs with a certain spelling (dass frequency 
is not accounted for in Zhang's metric). 
4.2 Class-predictlon strength 
A second estimate of exceptionality is to measure 
how well an instance type predicts the class of 
all instance types within the training set (includ- 
ing itself). Several functions for computing class- 
prediction strength have been proposed, e.g., as a 
criterion for removing instances in memory-based 
(k-nn) learning algorithms, such as m3 (Aha, Ki- 
bier, and Albert, 1991) (cf. earlier work on edited 
k-nn (Wilson, 1972; Voisin and Devijver, 1987)); 
or for weighting instances in the Each\[ algorithm 
(Salzberg, 1990; Cost and Salzberg, 1993). We chose 
to implement the straightforward class-prediction 
strength function as proposed in (Salzberg, 1990) 
in two steps. First, we count (a) the number of 
times that the instance type is the nearest neigh- 
bour of another instance type, and (b) the number 
of occurrences that when the instance type is a near- 
eat neighbour of another instance type, the classes 
of the two instances match. Second, the instance's 
class-prediction strength is computed by taking the 
ratio of (b) over (a). An instance type with class- 
prediction strength 1.0 is a perfect predictor of its 
own class; a class-prediction strength of 0.0 indicates 
that the instance type is a bad predictor of classes 
of other instances, presumably indicating that the 
instance type is exceptional. 
We computed the class-prediction strength of all 
instance types in the training set, ordered the in- 
stance types according to their strengths, and cre- 
ated edited training sets with 1%, 2%, 5%, and 
10% of the instance types with the lowest class 
prediction strength removed, respectively. In Ta- 
ble 3, four sample instance types axe displayed 
which have elass-prediction strength 0.0, i.e., the 
lowest possible strength. They are never a correct 
nearest-ncighbour match, since they all have higher- 
frequency counterpart types with the same feature 
values. For example, the letter sequence _ algo oc- 
curs in two types, one associated with the pronun- 
ciation /'~/ (via., primary-stressed /re/, or lm in 
our labelling), as in algorithm and algorithms; the 
other associated with the pronunciation /'~/(viz. 
secondary-stressed /~/ or 2se), as in algorithmic. 
The latter instance type occurs less frequently than 
the former, which is the reason that the class of the 
former is preferred over the latter. Thus, an am- 
biguous type with a minority class (a minority am- 
biguity) can never be a correct predictor, not even 
Memory-Based Learning of Word Pronunciation 
atypical 
feature values class I t.ypicality 
ureaucr OOU 0.428 
freudia Oar 0.442 
_tissue Of 0.458 
_czech O- 0.542 
instance types 
II boundary feature values class typicality 
cheques Oks 1.000 
elgium_ O- 1.000 
laby__ Ova 1.000 
manna__ O- 1.000 
I typical 
feature values class typicality 
__oiff l:)z 7.338 
etectio 0kf 8.452 
ow-by-b 0b 9.130 
ng-iron 2van 12.882 
Table 2: Examples of atypical (left), boundary (middle), and typical (left) instance types in the training set. 
For each instance (seven letters and a class mapping to the middle letter), its typicality value is given. 
feature values class cps 
__algo 2re 0.0 
ck-benc lb 0.0 
erby__ Om 0.0 
reface_ Oez 0.0 
Table 3: Examples of instance types with the lowest 
possible class prediction strength (cps) 0.0. 
for itself, when using ml-iG as a classifier, which 
always prefers high f~equency over low f~equency in 
case of ties. 
4.3 Prlendly-nelghbourhood size 
A third estimate for the exceptiona\]ity of instance 
types is counting by how many nearest neighbours of 
the same class an instance type is surrounded in in- 
stance space. Given a training set of instance types, 
for each instance type a ranking can be made oral\] of 
its nearest neighbours, ordered by their distance to 
the instance type. The number of neaxest-neighbouz 
instance types in this ranking with the same class, 
henceforth refe~ed to as the frendly-neighbourhood 
size, may range between 0 and the total number of 
instance types of the same class. When the friendly 
neighbourhood is empty, the instance type only has 
neaxest neighbouts of different classes. The argu- 
mentation to regard a small friendly neighbourhood 
as an indication of an instance type's exceptionality, 
follows f~om the same argumentation as used with 
e!~s-prediction strength: when an instance type has 
nearest neighbours of different classes, it is vice versa 
a bad predictor for those classes. Thus, the smaller 
an instance type's friendly neighboaxhood, the more 
it could be regarded exceptional. 
To illustrate the computation of frend\]y- 
neighbou~hood size, Table 4 lists fou~ examples of 
possible neaxest-neighbou~ zankings (truncated at 
ten nearest neighbours) with their respective num- 
ber of friendly neighbours. The Table shows that 
the number of friendly neighboaxs is the number of 
slmilaxly-labeled instances counted from left to right 
in the ranking, until a disslmilaxly-labeled instance 
occurs. 
feature values class fns 
_-edib 2E: 0 
__edib 1E: 0 
echnocr In 0 
soiree_ Or 0 
Table 5: Examples of instance types with the lowest 
possible f~iendly-neighbourhood size (fns) 0, i.e., no friendly neighbours. 
Friendly-neighbouthood size and class-prediction 
strength a~e related functions, but differ in thei~ 
treatment of class ambiguity. As stated above, in- 
stance types may receive a class-prediction strength 
of 0.0 when they axe minority ambiguities. Counting 
a friendly neighbouzhood does not take class ambi- 
guity into account; each of a set of ambiguous types 
necessarily has no friendly neighbouzs, since they axe 
eachothez's nearest neighbouts with different classes. 
Thus, friendiy-neighbourhood size does not discrim- 
inate between minority and majority ambiguities. In 
Table 5, four sample instance types axe listed with 
frendly-neighbouthood size 0. While some of these 
instance types without friendly neighbours in the 
training set (perhaps with friendly neighbours in the 
test set) are minority ambiguities (e.g., __edib 2~), 
others are majority ambiguities (e.g., __edib 1~), 
while others are not ambiguous at all but simply 
have a nearest neighbouz at some distance with a 
different class (e.g., soiree_ 0z). 
5 Results 
Figure 1 displays the generalisatiou acc~acies in 
terms of incorrectly classified test instances obtained 
with all performed experiments. The leftmost point 
in the Figure, f~om which all lines originate, indi- 
cates the performance of IBI°IG when trained on 
the full data set of 222,601 types, viz. 6.42% in- 
correctly classified test instances (when computed in 
terms of incorrectly pronounced test words, IBI-IG 
pronounces 64.61 of all test words flawlessly). 
The line graph representing the fou~ expemnents 
in which instance types are removed randomly can 
be seen as the baseline graph. It can be expected 
II 
| 
| 
m 
k 
van den Bosch and Daelemans 200 Memory-Based Learning of Word Pronunciation 
II 
II 
II 
II 
II 
II 
II 
II 
II 
II 
II 
II 
II 
II 
nearest neighbour rank # 
1 2 3 4 5 6 7 8 9 10 ¢~ 
ol x2 03 03 03 04 x4 ×5 x5 ×6 1 
ol ol ol ol o1 ol o2 02 03 x4 9 
x2 02 02 02 o2 02 x3 ×3 x3 x4 0 
ol ol ol x3 x4 x4 x4 x4 x 5 o6 3 
Table 4: Four examples of nearest-neighbour rankings and their respective numbers of friendly neighbours 
(fa). Each ranked nearest neighbour is identified by its match (o) or mismatch (×) with the target instance 
the ranking is computed for, and a number denoting its distance to the target instance. 
that removing instances randomly leads to a degra- 
dation of generalisation performance. The upward 
curve of the line graph denoting the experiments 
with random selection indeed shows degrading per- 
formanee with increasing numbers of left-out in- 
stance types. The relative decrease in generalisation 
accuracy is 2.0% when 1% of the training material is 
removed randomly, 3.8% with 2% random removal, 
10.7% with 5% random removal, and 20.7% with 
10% random removal. 
Surprisingly, the only experiments showing lower 
performance degradation than removal by random 
selection are those with class-prediction strength; 
the other criteria for removing exceptional instances 
lead to worse degradations. It does not matter 
whether instance types are removed on grounds of 
their typicality: apparently, a markedly low, neutral, 
or high typicality value indicates that the instance 
type is (on average) important, rather than remov- 
able. The same applies to friendly-neighbourhood 
size: instances with small neighbourhood sizes ap- 
pear to contribute significantly to performance on 
test material. It is remarkable that the largest er- 
rors with 1% and 2% removal are obtained with 
the friendly-neighbourhood size criterion: it appears 
that on average, the instances with few or no nearest 
neighbours are important in the classification of test 
material. 
When using class-prediction strength as removal 
criterion, performance does not degrade until about 
5% of the instance types with the lowest strength 
are removed from memory. The reason is that c|_~ss- 
prediction strength is the only criterion that detects 
minority ambiguities, i.e., instance types with pre- 
diction strength 0.0, that cannot contribute to classi- 
fication since they are always overshadowed by their 
counterpart instance types with majority classes, 
even for their own classification. In the tralni~g set, 
9,443 instance types are minority ambiguities, i.e., 
4.2% of the instance types (accounting for 3.8% of 
the instance tokens in the original token set). 
Thus, among the tested methods for reducing 
the memory needed for storing an instance base in 
memory-based learning, only two relatively trivial 
methods are performance-preserving while account- 
ing for a substantial reduction in the amount of 
memory needed by IB 1-IG: 
1. Replacing instance tokens by instance types ac- 
counts for a reduction of about 63% of mem- 
ory needed to store instances, excluding the 
memory needed to store frequency information. 
When frequency information is stored in two 
bytes per instance type, the memory reduction 
is about 54%. 
. Removing instance types that are minority am- 
bigulties on top of the type/token-reduction ac- 
counts only for an additional memory reduc- 
tion of 2%, i.e., for a total memory reduction 
of 65%; 56% with two-byte frequency informa- 
tion stored per instance. 
6 Discussion and future research 
As previous research has suggested (Daelemans, 
1996; Daelemans, Van den Bosch, and Weijters, 
1997a; Van den Bosch, 1997), keeping full mem- 
ory in memory-based learning of word pronunciation 
strongly appears to yield optimal generalisation ac- 
curacy. The experiments in this paper show that op- 
timi~tion of memory use in memory-based learning 
while preserving generalisation accuracy can only be 
performed by (i) replacing instance tokens by in- 
stance types with frequency information, and (ii) 
removing minority ambiguities. Both optimi~tions 
can be performed straightforwardly; minority ambi- 
guities can be traced with less effort than by using 
class-prediction strength. Our implementation of 
IB1-I6 described in (Daelemans and Van den Bosch, 
1992; Daelemans, Van den Bosch, and Weijters, 
1997b) already makes use of this knowledge, albeit 
partially (it stores class distributions with letter- 
window types). 
Our results also show that atypicality, non-typic- 
ality, and typicality (Zhang, 1992), and friendly- 
neighbourhood size are all estimates of exception- 
ality that indicate the importance of instance types 
for classification, rather than their removability. As 
far as these estimates of exeeptionality are viable, 
our results suggest that exceptions should be kept 
in memory and not be thrown away. 
van den Bosch and Daelemans 201 Memory-Based Learning of Word Pronunciation 
12.0 
11.0 
o~ 
v 10.0 
(D 
t- 
O 9.0 
._~ 
v 8.0, t- 
(D 
7.0 
6.0 
atypical o 
• typical ...... 
non-typical --~ .... 
small neighbourhood .* ...... 
low prediction strength ....... " / x 
• random -.. .... ._.~ ........ :::::::::::::::::::::::::.. 
......................... : ..... ~:::::~-. ...... 
...... ... " I ...0.. .o" . .... . ...... . .............. 
5 00 10000 15000 20000 
number of removed instances types 
Figure 1: Generallsation errors (percentages of incorrectly classified test instances of TRIBL-IG, with increased 
numbers of edited instances, according to the tested exeeptionality criteria atypical, typical, boundary, 
small neighbourhood, low prediction strength, and random selection. Performances, denoted by points, are 
measured when 1%, 2%, 5%, and 10% of the most exceptional instance types ate edited. 
Lazy vs. eager; not stable vs. unstable 
F~om the results in this paper and those reported 
eatlier (Daelemans, Van den Bosch, and Weijters, 
1997a; Van den Bosch, 1997), it appeats that no 
compromise can be made on memory-base learning 
in terms of abstraction by forgetting without los- 
ing generalisation accuracy. Consistently lower per- 
formances axe obtained with algorithms that forget 
by constructing decision trees or connectionist net- 
works, or by editing instance types. Generalisation 
accuracy appears to be related to the dimension lazy- 
eager leaxning; for the Gs task (and for many other 
language tasks, (Daelemans, Van den Bosch, and 
Weijtezs, 1997a)), it is demonstrated that memory- 
based lazy leatning leads to the best generalisation 
accuracies, 
Another explanation for the difference in per- 
formance between decision-tree, connectionist, and 
editing methods versus pure memory-based leaxn- 
ing is that the former generally display high ~ar/- 
ance, which is the portion of the generalisation error 
caused by the u~tabili~/of the learning algorithm 
(Breiman, 1996a). An algorithm is unstable when 
small perturbations in the learning material lead to 
large differences in induced models, and stable oth- 
ezwise; pure memory-based learning algorithms axe 
said to be very stable, and decision-tree algorithms 
and conneetionist learning to be unstable (Breiman, 
1996a). High variance is usually coupled with low 
bias, i.e., unstable leaxning algorithms with high 
vaziance tend to have few limitations in the fxeedom 
to approximate the task or function to be leaxned) 
(Bzeiman, 1996b). Breiman points out that often 
the opposite also holds: a stable classitiez with a 
low variance can display a high bias when it can- 
not represent data adequately in its available set of 
models, but it is not cleat whether or how this ap- 
plies to pure memory-based leatning as in ml-IG; 
its success in representing the Gs data and other 
language tasks quite adequately would rather sug- 
gest that IB 1-I6 has both low vatiance and low bias. 
Apatt fzom the possibility that the lazy and eager 
leatning algorithms investigated here and in eatllez 
work do not have a strongly contrasting bias, we con- 
jecture that the editing methods discussed here, and 
some specific decision-tree leaxning algorithms inves- 
tigated eaxlier (i.e., IGTItEE (Daclemuns, Van den 
Bosch, and Weijters, 1997b), a decision tree learn- 
ing algorithm that is an approximate optimisation 
of IBI-IG) have a slmilat vatia~lce to that of IB1- 
IG; they axe virtually as stable as ~I-IQ. We base 
this conjecture on the fact that the standard devi- 
ations of both decision-tree learning and memory- 
based learning trained and tested on the GS data axe 
not only very small (in the order of 1/10 percents), 
but also hatdiy different (cf. (Van den Bosch, 1997) 
for details and examples). Only counectionist net- 
works trained with back-propagation and decision- 
tree leaxning with pruning display latger standard 
deviations when accuracies ate averaged over exper- 
van den Bosch and Daelemans 202 Memory-Based Learning of Word Pronunciation 
II 
II 
I 
I 
I 
I 
I 
I 
l 
I 
I 
I 
I 
I 
I 
l 
/ 
/ 
/ 
/ 
/ 
/ 
l 
/ 
/ 
/ 
/ 
A 
iments (Van den Bosch, 1997); the stable-unstable 
dimension might play a role there, but not in the 
difference between pure memory-based learning and 
edited memory-based learning. 
Future research 
The results of the present study suggest that 
the following questions be investigated in future re- 
search: 
, The tested criteria for editing can be employed 
as instance weights as in EACH (Salzberg, 
1990) and PEI3LS (Cost and Salzberg, 1993), 
rather than as criteria for instance removal. 
Instance weighting, preserving pure memory- 
based learning, may add relevant information 
to similarity matching, and may improve IB1- 
IG~s performance. 
. Different data sets of different sizes may con- 
tain different portions of atypical instances or 
minority ambiguities. Moreover, data sets may 
contain pure noise. While atypical or excep- 
tional instances may (and do) return in test 
material, the chances of noise to return is rel- 
ativdy minute. Our results generalise to data 
sets with approximately the characteristics of 
the Gs dataset. Although there are indica- 
tions that data sets representing other language 
tasks indeed share some essential characteristics 
(e.g., memory-based learning is consistently the 
best-performlng algorithm), more investigation 
is needed to make these characteristics explicit. 
Acknowledgements 
We thank the members of the ILK group, Ton Weij- 
ters, and Eric Postma for fruitful discussions, and 
the anonymous reviewers for relevant comments and 
suggestions. 

References 
Aha, D. W., editor. 1997. Lazy learning. Dordrecht: 
Kluwet Academic Publishers. reprinted from: Ar- 
tificial Intelligence Review, 11:1-5. 
Aha, D. W., D. Kibler, and M. Albert. 1991. 
Instance-based learning algorithms. Machine 
Learning, 7:37-66. 
Breiman, L. 1996a. Bagging predictors. Machine 
Learning, 24(2). 
Breiman, L. 1996b. Bias, variance and arcing elas- 
sifters. Technical Report 460, University of Cali- 
fornia, Statistics Department, Berkeley, CA. 
Burnage, G., 1990. CELEX: A guide for users. Cen- 
tre for Lexical Information, Nijmegen. 
Cost, S. and S. Salzberg. 1993. A weighted near- 
est neighbor algorithm for learning with symbolic 
features. Machine Learning, 10:57-78. 
van den Bosch and Daelemans 203 
Cover, T. M. and P. E. Hart. 1967. Nearest neigh- 
bor pattern classification. Institute of Eledrical 
and Electronics Engineers Transactions on Infor- 
mation Theory, 13:21-27. 
Daelemans, W. 1996. Abstraction considered harm- 
ful: lazy learning of language processing. In H. J. 
Van den Herik and A. Weijters, editors, Proceed- 
ings of the Sizth Belgian-Dutch Conference on 
Machine Learning, pages 3-12, Maastricht, The 
Netherlands. MXTRIK$. 
Daelemans, W., S. Gillis, and G. Durieux. 1994. 
The acquisition of stress: a data-oriented ap- 
proach. Coraputational Linguistics, 20(3):421- 
451. 
Dadema~, W. and A. Van den Bosch. 1992. Gen- 
eralisation performance of backpropagation learn- 
ing on s syllabification task. In M. F. L Drossaers 
and A. Nijholt, editors, TWLT3: Connectionism 
and Natural Language Processing, pages 27-37, 
Enschede. Twente University. 
Daelemans, W., A. Van den Bosch, and A. Weij- 
ters. 1997a. Empirical learning of natural lan- 
guage processing tasks. Lecture Notes in Artifi- 
cial Intelligence, , number 1224, pages 337-344. 
Berlin: Springer-Verlag. 
Daelemans, W., A. Van den Bosch, and A. Weij- 
ters. 1997b. IGTree: using trees for classification 
in lazy learning algorithms. Artificial Intelligence 
Review, 11:407--423. 
Devijver, P..A. and J. Kittler. 1982. Pattern 
recognition. A statistical approach. London, UK: 
Prentice-HalL 
Dietterich, T. G., H. Hild, and G. Bakiri. 1990. A 
comparison of ID3 and backpropagation for En- 
glish text-to-speech mapping. Technical Report 
90-20-4, Oregon State University. 
Kolodner, J. 1993. Case-based reasoning. San Ma- 
teo, CA: Morgan Kanfmann. 
Mitchell, T. 1997. Machine learning. New York, 
NY: McGraw Hill. 
Quinlan, J. R. 1986. Induction of decision trees. 
Machine LeaNing, 1:81-206. 
Quinlsn, J. R. 1993. c4.5: Programs for Machine 
learning. San Mateo, CA: Morgan Kaufi~mm. 
Roach, E. and C. B. Mervis. 1975. Fam~y resem- 
blances: studies in the internal structure of cate- 
gories. Cognitive Psychology, 7:??-?? 
Rumelhart, D. E., G. E. Hinton, and R. J. Williams. 
1986. Learning internal representations by error 
propagation. In D. E. Rumelhart and J. L. Mc- 
Clelland, editors, Parallel Distributed Processing: 
gzplorations in the Microstructure of Cognition. 
Cambridge, MA: The MIT Press, pages 318-362. 
Memory-Based Learning of Word Pronunciation 
Salzberg, S. 1990. Learning with nested generalised 
ezemplars. Norwell, MA: Klawer Academic Pub- 
lishers. 
Sejnowski, T. J. and C. S. Rosenberg. 1987. Paral- 
lel networks that learn to pronounce English text. 
gomplez Systems, 1:145-168. 
Shavlik, J. W., R. J. Mooney, and G. G. Towell. 
1991. Symbolic and neural learning algorithms: 
An experimental comparison. Machine Learning, 
6:111-143. 
Stanfdl, C. and D. Waltz. 1986. Toward memory- 
based reasoning. Communications of the ACM, 29(12):1213-1228. 
Van den Bosch, A. 1997. Learning to pronounce 
written words, a study in inductive language learn- 
ing. Ph.D. thesis, Universiteit Manstricht. 
Van den Bosch, A., W. Daelemans, and A. Weijters. 
1996. Morphological analysis as classification: an 
inductive-learning approach. In K. Oflazer and 
H. Somers, editors, Proceedings of the Second In- 
ternational Conference on New Methods in Nat- 
ural Language Processing, NeMLaP-~, Ankara, 
Turkey, pages 79-89. 
Voisin, J. and P. A. Devijver. 1987. An applica- 
tion of the Multiedit-Condensing technique to the 
reference selection problem in a print recognition 
system. Pattern Recognition, 5:465-474. 
Wettschereck, D. 1995. A study of distance-based 
machine.learning algorithms. Ph.D. thesis, Ore- 
gon State University. 
Wettschereck, D., D. W. Aha, and T. Mohri. 1997. 
A review and empirical evaluation of feature 
weighting methods for a class of lazy learning algo- 
rithms. Artificial Intelligence Revietv, 11:273-314. 
Wilson, D. 1972. Asymptotic propexties of near- 
eat neighbor rules using edited data. Instit~zte of 
Electrical and Electronic Engineers Transactions 
on Systems, Man and Cybernetics, 2:408-421. 
Wolpert, D. H. 1990. Constructing a generalizer 
superior to NETtalk via a mathematical theory of 
generalization. Neural Networks, 3:445--452. 
Zhang, J. 1992. Selecting typical instances in 
instance-based learning. In Proceedings of the In- 
ternational Machine Learning Conference 199~, 
pages 470-479. 
