In: Proceedings of CoNLL-2000 and LLL-2000, pages 103-106, Lisbon, Portugal, 2000. 
Genetic Algorithms for Feature Relevance Assignment in 
Memory-Based Language Processing 
Anne Kool and Walter Daelemans and Jakub Zavrel* 
CNTS - Language Technology Group 
University of Antwerp, UIA, Universiteitsplein 1, 2610 Antwerpen, Belgium {kool, daelem, zavrel}@uia.ua.ac.be 
Abstract 
We investigate the usefulness of evolutionary al- 
gorithms in three incarnations of the problem of 
feature relevance assignment in memory-based 
 processing (MBLP): feature weight- 
ing, feature ordering and feature selection. We 
use a simple genetic algorithm (GA) for this 
problem on two typical tasks in natural lan- 
guage processing: morphological synthesis and 
unknown word tagging. We find that GA fea- 
ture selection always significantly outperforms 
the MBLP variant without selection and that 
feature ordering and weighting with CA signifi- 
cantly outperforms a situation where no weight- 
ing is used. However, GA selection does not sig- 
nificantly do better than simple iterative feature 
selection methods, and GA weighting and order- 
ing reach only similar performance as current 
information-theoretic feature weighting meth- 
ods. 
1 Memory-Based Language 
Processing 
Memory-Based Language Processing (Daele- 
mans, van den Bosch, and Zavrel, 1999) is 
based on the idea that  acquisition 
should be seen as the incremental storage of 
exemplars of specific tasks, and  pro- 
cessing as analogical reasoning on the basis of 
these stored exemplars. These exemplars take 
the form of a vector of, typically, nominal fea- 
tures, describing a linguistic problem and its 
context, and an associated class symbol repre- 
senting the solution to the problem. A new in- 
stance is categorized on the basis of its similar- 
ity with a memory instance and its associated 
* Research funded by CELE, S.AI.L Trust V.Z.W., 
Ieper, Belgium. 
class. 
The basic algorithm we use to calculate the dis- 
tance between two items is a variant of IB1 
(Aha, Kibler, and Albert, 1991). IB1 does 
not solve the problem of modeling the differ- 
ence in relevance between the various sources 
of information. In an MBLP approach, this 
can be overcome by means of feature weighting. 
The IBi-IG algorithm uses information gain to 
weight the cost of a feature value mismatch dur- 
ing comparison. IGTREE is a variant in which an 
oblivious decision tree is created with features 
as tests, and in which tests are ordered accord- 
ing to information gain of the associated fea- 
tures. In this case, the accuracy of the trained 
system is very much dependent on a good fea- 
ture ordering. For all variants of MBLP dis- 
cussed here, feature selection can also improve 
both accuracy and efficiency by discarding some 
features altogether because of their irrelevance 
or even counter-productivity in learning to solve 
the task. In our experiments we will use a rel- 
evance assignment method that radically dif- 
fers from information-theoretic measures: ge- 
netic algorithms. 
2 Genetic Algorithms for Assigning 
Relevance 
In the experiments, we linked our memory- 
based learner TIMBL 1 to PGAPACK 2. During the 
weighting experiments a gene corresponds to a 
specific real-valued feature-weight (we will indi- 
cate this by including GA in the algorithm name, 
i.e. IB1-GA and GATREE, cf. IBi-IG and IGTREE). 
1TIMBL is available from http://ilk.kub.nl/ and the 
algorithms are described in more detail in (Daelemans 
et al., 1999). 
2A software environment for evolutionary computa- 
tion developed by D. Levine, Argonne National Labora- 
tory, available from ftp://ftp.mcs.anl.gov/pub/pgapack/ 
103 
In the case of selection the string is composed 
of binary values, indicating presence or absence 
of a feature (we will call this GASEL). The fit- 
ness of the strings is determined by running the 
memory-based learner with each string on a val- 
idation set, and returning the resulting accuracy 
as a fitness value for that string. Hence, both 
weighting and selection with the GA is an in- 
stance of a wrapper approach as opposed to a 
filter approach such as information gain (Ko- 
havi and John, 1995). 
For comparison, we include two popular 
classical wrapper methods: backward elimina- 
tion selection (BASEL) and forward selection 
(FOSEL). Forward selection starts from an 
empty set of features and backward selection 
begins with a full set of features. At each fur- 
ther addition (or deletion, for BASEL) the fea- 
ture with the highest accuracy increase (resp. 
lowest accuracy decrease) is selected, until im- 
provement stalls (resp. performance drops). 
During the morphology experiment the pop- 
ulation size was 50, but for prediction of un- 
known words it was set to 16 because the larger 
dataset was computationally more demanding. 
The populations were evolved for a maximum 
of 200 generations or stopped when no change 
had occurred for over 50 generations. Parame- 
ter settings for the genetic algorithm were kept 
constant: a two-point crossover probability of 
0.85, a mutation rate of 0.006, an elitist replace- 
ment strategy, and tournament selection. 
2.1 Data 
The first task 3 we consider is prediction of what 
diminutive suffix a Dutch noun should take on 
the basis of its form. There are five different 
possible suffix forms (the classes). There are 12 
features which contain information (stress and 
segmental information) about the structure of 
the last three syllables of a noun. The data set 
contains 3949 such instances. 
The second data set 4 is larger and contains 
65275 instances, the task we consider here is 
part-of-speech (morpho-syntactic category) tag- 
ging of unknown words. The features used here 
are the coded POS-tags of two words before and 
two words after the focus word to be tagged, the 
3Data from the CELEX lexical data base, available 
on CD-ROM from the LDC, http://ldc.upenn, edu. 
4This dataset is based on the TOSCA tagged LOB 
corpus of English. 
last three letters of the focus word, and informa- 
tion on hyphenation and capitalisation. There 
are 111 possible classes (part of speech tags) to 
predict. 
2.2 Method 
We have used 10-fold-cross-validation in all ex- 
periments. Because the wrapper methods get 
their evaluation feedback directly from accuracy 
measurements on the data, we further split the 
trainfile for each fold into 2/3 sub-trainset and 
a 1/3 validation set. The settings obtained by 
this are then tested on the test set of that fold. 
2,3 Results 
In Table 1 we show the results of our exper- 
iments (average accuracy and standard devia- 
tion over ten folds). We can see that applying 
any feature selection scheme when no weights 
are used (IB1) significantly improves classifi- 
cation performance (p<0.01) 5. Selection also 
improves accuracy when using the IBI-IG or 
IGTREE algorithm. These differences are sig- 
nificant on the morphology dataset (p<0.05), 
but for the unknown words dataset only the dif- 
ference between (IB1) and (IBi-~-GASEL) is sig- 
nificant (p<0.01). In both cases, however, the 
results in Table 1 do not reveal significant dif- 
ferences between evolutionary, backward or for- 
ward selection. 
With respect to feature weighting by means 
of a GA the results are much less clear: for the 
morphology data, the GA-weights significantly 
improve upon IB1, refered to as IB1-GA in the 
table, (p<0.01) but not IGTREE (GATREE in the 
table). For the other dataset OA-weights do not 
even improve upon IB1. But in general, those 
weights found by the genetic algorithm lead to 
comparable classification accuracy as with gain 
ratio based weighting. The same applies to the 
combination of aA-weights with further selec- 
tion of irrelevant features (GATREE-bGASEL). 
2.4 The Effect of GA Parameters 
We also wanted to test whether the GA would 
benefit from optimisation in the crossover and 
mutation probabilities. To this end, we used 
the morphology dataset, which was split into an 
80% trainfile, a 10% validationfile and a held- 
out 10% testfile. The mutation rate was var- 
5All significance tests in this paper are one-tailed 
paired t-tests. 
104 
Classifier Morphology Unknown Words 
IB1 87.2 (± 1.6) 81.7 (± 0.5) 
IBi+GASEL 96.5 (± 1.0) 82.8 (± 0.6) 
IB1WFOSEL 96.6 (± 1.1) 82.9 (± 0.2) 
IB1WBASEL 96.6 (± 1.1) 82.9 (± 0.2) 
IBi-IG 96.2 (± 0.8) 82.8 (± 0.3) 
IBi-IG+GASEL 97.3 (± 0.9) 83.0 (± 0.3) 
IBI-IG+FOSEL 97.1 (± 0.9) 82.8 (± 0.3) 
IBI-IGWBASEL 97.3 (± 1.0) 82.9 (± 0.3) 
IGTREE 96.2 (±0.8) 81.4 (± 0.4) 
IGTREE-t-GASEL 97.1 (± 0.9) 81.4 (± 0.4) 
IGTREE--~-FOSEL 97.0 (± 0.9) 81.3 (± 0.4) 
IGTREE--~-BASEL 97.0 (± 1.1) 81.3 (± 0.4) 
ml-GA 95.6 (± 1.0) 81.6 (± 0.8) 
IB1-GA+GASEL 97.0 (± 1.1) 82.0 (± 1.2) 
GATREE 96.0 (± 1.0) 80.4 (± 1.2) 
GATREE--t-GASEL 9'7.1 (± 1.0) 81.0 (± 0.6) 
Table 1: Accuracy (:i: standard deviation) results of 
the experiments. Boldface marks the best results for 
each basic algorithm per data set. 
ied stepwise adding a value of 0.001 at each ex- 
periment, starting at a 0.004 value up to 0.01. 
The different values for crossover ranged from 
0.65 to 0.95, in steps of 0.05. The effect of 
changing crossover and mutation probabilities 
was tested for IBl-IG+GA-selection, for IB1 with 
CA weighting, for IGTREE+GA-selection, and for 
IGTREE with GA-weight settings. 
These experiments show considerable fluctua- 
tion in accuracy within the tested range, but dif- 
ferent parameter settings could also yield same 
results although they were far apart in value. 
Some settings achieved a particularly high accu- 
racy in this training regime (e.g. crossover: 0.75, 
mutation: 0.009). However, when we used these 
in the ten-fold cv setup of our main experi- 
ments, this gave a mean score of 97.4 (± 0.9) 
for IBi-IG with CA-selection and a mean score 
of 97.1 (± 1.1) for IGTREE with GA-selection. 
These accuracies are similar to those achieved 
with our default parameter settings. 
2.5 Discussion 
Feature selection on the morphology task shows 
a significant increase in performance accuracy, 
whereas on the unknown words task the differ- 
ences are less outspoken. To get some insight 
into this phenomenon, we looked at the average 
probabilities of the features that were left out 
by the evolutionary algorithm and their aver- 
age weights. 
On the morphology task this reveals that nu- 
cleus and coda of the last syllable are highly 
relevant, they are always included. The onset 
of all three syllables is always left out. Further, 
in all partitions the nucleus and coda of the sec- 
ond syllable are left out. 6 For part-of-speech 
tagging of unknown words all features appear 
to be more or less equally relevant. Over the 
ten partitions, either no omission is suggested 
at all, or the features that carry the pos-tag of 
n-2 word before and the n+2 word after the fo- 
cus word are deleted. This is comparable to re- 
ducing the context window of this classification 
task to one word before and one after the focus. 
The fact that all features seem to contribute 
to the classification when doing POS-tagging 
(making selection irrelevant) could also explain 
why the IGTREE algorithm seems to benefit less 
from the feature orders suggested and why the 
non-weighted approach IB1 already has a high 
score on the tagging task. The IGTREE algo- 
rithm is more suited for problems where the fea- 
tures can be ordered in a straightforward way 
because they have significantly different rele- 
vance. 
3 Conclusions and Related Research 
The issue of feature-relevance assignment is 
well-documented in the machine learning lit- 
erature. Excellent comparative surveys are 
(Wettschereck, Aha, and Mohri, 1997) and 
(Wettschereck and Aha, 1995) or (Blum and 
Langley, 1997). Feature subset selection 
by means of evolutionary algorithms was in- 
vestigated by Skalak (1994), Vafaie and de 
Jong (1992), and Yang and Honavar (1997). 
Other work deals with evolutionary approaches 
for continuous feature weight assignment such 
as Wilson and Martinez (1996), or Punch and 
Goodman (1993). 
The conclusions from these papers are in 
agreement with our findings on the natural lan- 
guage data, suggesting that feature selection 
and weighting with GA's significantly outper- 
form non-weighted approaches. Feature selec- 
tion generally improves accuracy with a reduc- 
6This fits in with current theory about this morpho- 
logical process (e.g. Trommelen (1983), Daelemans et 
al. (1997)). 
105 
tion in the number of features used. However, 
we have found no results (on these particular 
data) that indicate an advantage of evolutionary 
feature selection approach over the more classi- 
cal iterative methods. Our experiments further 
show that there is no evidence that GA weight- 
ing is in general competitive with simple filter 
methods such as gain ratio. Possibly, a parame- 
ter setting for the GA could be found that gives 
better results, but searching for such an optimal 
parameter setting is at present computationally 
unfeasible for typical natural  process- 
ing problems. 

References 
Aha, D., D. Kibler, and M. Albert. 1991. Instance- 
based learning algorithms. In Machine Learning 
Vol. 6, pp 37-66. 
Blum, A. and P. Langley. 1997. Selection of rele- 
vant features and examples in machine learning. 
In Machine Learning: Artificial Intelligence,97, 
pp 245-271. 
Daelemans, W., P. Berck, and S. Gillis. 1997. Data 
mining as a method for linguistic analysis: Dutch 
diminutives. In Folia Linguistica , XXXI/1-2, pp 
57-75. 
Daelemans, W., A. van den Bosch, and J. Zavrel. 
1999. Forgetting exceptions is harmful in lan- 
guage learning. In Machine Learning, special is- 
sue on natural  learning, 34 , pp 11-43. 
Daelemans, W., J. Zavrel, K. van der Sloot, and 
A. van den Bosch. 1999. Timbl: Tilburg mem- 
ory based learner, version 2.0, reference guide. Ilk 
technical report 99-01, ILK. 
John, G.H., R. Kohavi, and K. Pfleger. 1994. Irrel- 
evant features and the subset selection problem. 
In Machine Learning: Proceedings of the Eleventh 
International Conference, pp 121-129. 
Kohavi, R. and G.H. John. 1995. Wrappers for 
feature subset selection. In Artificial Intelligence 
Journal, Special Issue on Relevance Vol.97, pp 
273-324. 
Punch, W. F., E.D. Goodman, Lai Chia-Shun 
Min Pei, P. Hovland, and R. Enbody. 1993. Fur- 
ther research on feature selection and classifica- 
tion using genetic algorithms. In Proceedings of 
the Fifth International Conference on Genetic Al- 
gorithms, pp 557. 
Quinlan, J.R. 1993. C4.5: Programs for Machine 
Learning. San Mateo: Morgan Kaufmann. 
Skalak, D. 1993. Using a genetic algorithm to learn 
prototypes for case retrieval and classification. 
In Case-Based Reasoning: Papers from the 1993 
Workshop, Tech. Report WS-93-01, pp 211-215. 
AAAI Press. 
Skalak, D. B. 1994. Prototype and feature selec- 
tion by sampling and random mutation hill climb- 
ing algorithms. In Proceedings of the eleventh In- 
ternational Conference on Machine Learning, pp 
293-301. 
Trommelen, M.T.G. 1983. The Syllable in Dutch, 
with special Reference to Diminutive Formation. 
Foris: Dordrecht. 
Vafaie, H. and K. de Jong. 1992. Genetic algorithms 
as a tool for feature selection in machine learn- 
ing. In Machine Learning, Proceeding of the 4th 
International Conference on Tools with Artificial 
Intelligence, pp 200-204. 
Wettschereck, D. and D. Aha. 1995. Weighting fea- 
tures. In Proceedings of the First International 
Conference on Case-Based Reasoning, ICCBR- 
95, pp 347-358. 
Wettschereck, D., D. Aha, and T. Mohri. 1997. A 
review and empirical evaluation of feature weight- 
ing methods for a class of lazy learning algorithms. 
In Artificial Intelligence Review Vol.11, pp 273- 
314. 
Wilson, D. and T. Martinez. 1996. Instance- 
based learning with genetically derived attribute 
weights. In Proceedings of the International Con- 
ference on Artificial Intelligence, Expert Systems, 
and Neural Networks, pp 11-14. 
Yang, J. and V. Honavar. 1997. Feature subset se- 
lection using a genetic algorithm. In Genetic Pro- 
gramming 1997: Proceedings of the Second An- 
nual Conference, pp 380. 
