Combining a self-organising map with memory-based learning
James Hammerton
james.hammerton@ucd.ie
Dept of Computer Science
University College Dublin
Ireland
Erik Tjong Kim Sang
erikt@uia.ua.ac.be
CNTS – Language Technology Group
University of Antwerp
Belgium
Abstract
Memory-based learning (MBL) has
enjoyed considerable success in
corpus-based natural language process-
ing (NLP) tasks and is thus a reliable
method of getting a high-level of per-
formance when building corpus-based
NLP systems. However there is a
bottleneck in MBL whereby any novel
testing item has to be compared against
all the training items in memory base.
For this reason there has been some
interest in various forms of memory
editing whereby some method of
selecting a subset of the memory base
is employed to reduce the number of
comparisons. This paper investigates
the use of a modified self-organising
map (SOM) to select a subset of the
memory items for comparison. This
method involves reducing the number
of comparisons to a value proportional
to the square root of the number of
training items. The method is tested on
the identification of base noun-phrases
in the Wall Street Journal corpus,
using sections 15 to 18 for training and
section 20 for testing.
1 Introduction
Currently, there is considerable interest in ma-
chine learning methods for corpus-based lan-
guage learning. A promising technique here is
memory-based learning1 (MBL) (Daelemans et
al., 1999a), where a task is redescribed as a classi-
fication problem. The classification is performed
by matching an input item to the most similar of
a set of training items and choosing the most fre-
quent classification of the closest item(s). Sim-
ilarity is computed using an explicit similarity
metric.
MBL performs well by bringing all the training
data to bear on the task. This is done at the cost, in
the worst case, of comparing novel items to all of
the training items to find the closest match. There
is thus some interest in developing memory edit-
ing techniques to select a subset of the items for
comparison.
This paper investigates whether a self-
organising map (SOM) can be used to perform
memory editing without reducing performance.
The system is tested on base noun-phrase (NP)
chunking using the Wall Street Journal corpus
(Marcus et al., 1993).
2 The Self-Organising Map (SOM)
The SOM was developed originally by Kohonen
(1990) and has found a wide range of uses from
classification to storing a lexicon. It operates as
follows (see Figure 1). The SOM consists of two
layers, an input layer and the map. Each unit in
the map has a vector of weights associated with
it that is the same size as that of the input layer.
When an input is presented to the SOM, the unit
whose weight vector is closest to the input vector
is selected as a winner.
1Also known as instance based learning.
INPUTS
MAP
Figure 1: The Self-Organising Map. The map
units respond to the inputs. The map unit whose
weight vector is closest to the input vector be-
comes the winner. During training, after a winner
is chosen, the weight vectors of the winner and
a neighbourhood of surrounding units are nudged
towards the current input.
During training, the weight vectors of winning
unit and a set of units within the neighbourhood
of the winner are nudged, by an amount deter-
mined by the learning rate, towards the input vec-
tor. Over time the size of the neighbourhood is
decreased. Sometimes the learning rate may be
too. At the end of training, the units form a map
of the input space that reflects how the input space
was sampled in the training data. In particular ar-
eas of input space in which there were a lot of
inputs will be mapped in finer detail, using more
units than areas where the inputs were sparsely
distributed.
3 Why use a SOM for memory editing
The SOM was chosen because the input is
matched to the unit with the closest weight vec-
tor. Thus it is motivated by the same principle as
used to find the closest match in MBL. It is thus
hoped that the SOM can minimise the risk of fail-
ing to select the closest match, since the subset
will be chosen according to similarity.
However, Daelemans et al (1999b) claim that,
in language learning, pruning the training items
is harmful. When they removed items from the
memory base on the basis of their typicality (i.e.
the extent to which the items were representative
of other items belonging to the same class) or
their class prediction strength (i.e. the extent to
which the item formed a predictor for its class),
the generalisation performance of the MBL sys-
tem dropped across a range of language learning
tasks.
The memory editing approach used by Daele-
mans et al removes training items independently
of the novel items, and the remainder are used for
matching with all novel items. If one selects a
different subset for each novel item based on sim-
ilarity to the novel item, then maybe the risk of
degrading performance in memory editing will be
reduced. This work aims to achieve precisely this.
4 A hybrid SOM/MBL classifier
4.1 Labelled SOM and MBL (LSOMMBL)
A modified SOM was developed called Labelled
SOM. Training proceeds as follows:
a0 Each training item has an associated label.
Initially all the map units are unlabelled.
a0 When an item is presented, the closest unit
out of those with the same label as the input
and those that are unlabelled is chosen as the
winner. Should an unlabelled unit be chosen
it gets labelled with the input’s label.
a0 The weights for neighbouring units are up-
dated as with the standard SOM if they share
the same label as the input or are unlabelled.
a0 When training ends, all the training inputs
are presented to the SOM and the winners
for each training input noted. Unused units
are discarded.
Testing proceeds as follows:
a0 When an input is presented a winning unit is
found for each category.
a0 The closest match is selected from the train-
ing items associated with each of the win-
ning units found.
a0 The most frequent classification for that
match is chosen.
It is thus hoped that the closest matches for each
category are found and that these will include the
closest match amongst them.
Assuming each unit is equally likely to be cho-
sen, the average number of comparisons here is
given bya1a3a2a5a4a7a6a9a8a11a10a12a4a14a13 wherea1 is the number of
categories, a4 is the number of units in the map
and a8 is the number of training items. Choosing
a4 a15a17a16 a8 minimises comparisons to
a18
a1a19a16 a8 . In
the experiments the size of the map was chosen
to be close to a16 a8 . This system is referred to as
LSOMMBL.
4.2 SOM and MBL (SOMMBL)
In the experiments, a comparison with using
the standard SOM in a similar manner was per-
formed. Here the SOM is trained as normal on
the training items.
At the end of training each item is presented to
the SOM and the winning unit noted as with the
modified SOM above. Unused units are discarded
as above.
During testing, a novel item is presented to the
SOM and the top C winners chosen (i.e. the a1
closest map units), where C is the number of cat-
egories. The items associated with these win-
ners are then compared with the novel item and
the closest match found and then the most fre-
quent classification of that match is taken as be-
fore. This system is referred to as SOMMBL.
5 The task: Base NP chunking
The task is base NP chunking on section 20 of the
Wall Street Journal corpus, using sections 15 to
18 of the corpus as training data as in (Ramshaw
and Marcus, 1995). For each word in a sentence,
the POS tag is presented to the system which out-
puts whether the word is inside or outside a base
NP, or on the boundary between 2 base NPs.
Training items consist of the part of speech
(POS) tag for the current word, varying amounts
of left and right context (POS tags only) and the
classification frequencies for that combination of
tags. The tags were represented by a set of vec-
tors. 2 sets of vectors were used for comparison.
One was an orthogonal set with a vector of all
zeroes for the “empty” tag, used where the con-
text extends beyond the beginning/end of a sen-
tence. The other was a set of 25 dimensional vec-
tors based on a representation of the words en-
coding the contexts in which each word appears
in the WSJ corpus. The tag representations were
obtained by averaging the representations for the
words appearing with each tag. Details of the
method used to generate the tag representations,
known as lexical space, can be found in (Zavrel
and Veenstra, 1996). Reilly (1998) found it ben-
eficial when training a simple recurrent network
on word prediction.
The self-organising maps were trained as fol-
lows:
a0 For maps with 100 or more units, the train-
ing lasted 250 iterations and the neighbour-
hood started at a radius of 4 units, reducing
by 1 unit every 50 iterations to 0 (i.e. where
only the winner’s weights are modified). The
learning rate was constant at 0.1.
a0 For the maps with 6 units, the training lasted
90 iterations, with an initial neighbourhood
of 2, reducing by one every thirty iterations.
a0 For the maps with 30 units, the training
lasted 150 iterations, with an initial neigh-
bourhood of 2, reduced by one every 50 iter-
ations. A single training run is reported for
each network since the results did not vary
significantly for different runs.
These map sizes were chosen to be close to the
square root of the number of items in the training
set. No attempt was made to systematically in-
vestigate whether these sizes would optimise the
performance of the system. They were chosen
purely to minimise the number of comparisons
performed.
6 Results
Table 1 gives the results of the experiments. The
columns are as follows:
a0 “features”. This column indicates how the
features are made up.
“lex” means the features are the lexical
space vectors representing the POS tags.
“orth” means that orthogonal vectors are
used. “tags” indicates that the POS tags
themselves are used. MBL uses a weighted
overlap similarity metric while SOMMBL
and LSOMMBL use the euclidean distance.
features window Chunk Chunk tag Max
fscore accuracy comparisons (% of items)
LSOMMBL lex 0-0 79.99 94.48% 87 (197.7%)
LSOMMBL lex 1-0 86.13 95.77% 312 (30.3%)
LSOMMBL lex 1-1 89.51 96.76% 2046 (20.4%)
LSOMMBL lex 2-1 88.91 96.42% 2613 (6.8%)
LSOMMBL orth 0-0 79.99 94.48% 87 (197.7%)
LSOMMBL orth 1-0 86.09 95.76% 702 (68.1%)
LSOMMBL orth 1-1 89.39 96.75% 1917 (19.1%)
LSOMMBL orth 2-1 88.71 96.52% 2964 (7.7%)
SOMMBL lex 0-0 79.99 94.48% 51 (115.9%)
SOMMBL lex 1-0 86.11 95.77% 327 (31.7%)
SOMMBL lex 1-1 89.47 96.74% 1005 (10.0%)
SOMMBL lex 2-1 88.98 96.48% 1965 (5.1%)
SOMMBL orth 0-0 79.99 94.48% 42 (95.5%)
SOMMBL orth 1-0 86.08 95.75% 306 (29.7%)
SOMMBL orth 1-1 89.38 96.77% 1365 (13.6%)
SOMMBL orth 2-1 88.61 96.45% 2361 (6.1%)
MBL tags 0-0 79.99 94.48% 44 (100.0%)
MBL tags 1-0 86.14 95.78% 1031 (100.0%)
MBL tags 1-1 89.57 96.80% 10042 (100.0%)
MBL tags 2-1 89.81 96.83% 38465 (100.0%)
MBL lex 0-0 79.99 94.70% 44 (100.0%)
MBL lex 1-0 86.14 95.95% 1031 (100.0%)
MBL lex 1-1 89.57 96.93% 10042 (100.0%)
MBL lex 2-1 89.81 96.96% 38465 (100.0%)
MBL orth 0-0 79.99 94.70% 44 (100.0%)
MBL orth 1-0 86.12 95.94% 1031 (100.0%)
MBL orth 1-1 89.46 96.89% 10042 (100.0%)
MBL orth 2-1 89.55 96.87% 38465 (100.0%)
Table 1: Results of base NP chunking for section 20 of the WSJ corpus, using SOMMBL, LSOMMBL
and MBL, training was performed on sections 15 to 18. The fscores of the best performers in each case
for LSOMMBL, SOMMBL and MBL have been highlighted in bold.
a0 “window”. This indicates the amount of con-
text, in the form of “left-right” where “left”
is the number of words in the left context,
and “right” is the number of words in the
right context.
a0 “Chunk fscore” is the fscore for finding base
NPs. The fscore (a20 ) is computed as a20 a15
a21a23a22a25a24
a22a25a26a27a24 where
a28 is the percentage of base NPs
found that are correct and a29 is the percent-
age of base NPs defined in the corpus that
were found.
a0 “Chunk tag accuracy” gives the percentage
of correct chunk tag classifications. This is
provided to give a more direct comparison
with the results in (Daelemans et al., 1999a).
However, for many NL tasks it is not a good
measure and the fscore is more accurate.
a0 “Max comparisons”. This is the maximum
number of comparisons per novel item com-
puted as a1a3a2a5a4a30a6a32a31a33a13 wherea1 is the number
of categories, a4 is the number of units and
a31 is the maximum number of items associ-
ated with a unit in the SOM. This number de-
pends on how the map has organised itself in
training. The number given in brackets here
is the percentage this number represents of
window SOM Training
size items
0-0 10 44
1-0 30 1131
1-1 100 10042
2-1 200 38465
Table 2: Network sizes and number of training
items
the maximum number of comparisons under
MBL. The average number of comparisons
is likely to be closer to the average men-
tioned in Section 4.1.
Table 2 gives the sizes of the SOMs used for
each context size, and the number of training
items.
For small context sizes, LSOMMBL and MBL
give the same performance. As the window size
increases, LSOMMBL falls behind MBL. The
worst drop in performance is just over 1.0% on
the fscores (and just over 0.5% on chunk tag ac-
curacy). This is small considering that e.g. for the
largest context the number of comparisons used
was at most 6.8% of the number of training items.
To investigate whether the method is less risky
than the memory editing techniques used in
(Daelemans et al., 1999b), a re-analysis of their
data in performing the same task, albeit with lex-
ical information, was performed to find out the
exact drop in the chunk tag accuracy.
In the best case with the editing techniques
used by (Daelemans et al., 1999b), the drop in
performance was 0.66% in the chunk tag accu-
racy with 50% usage (i.e. 50% of the training
items were used). Our best involves a drop of
0.23% in chunk tag accuracy and only 20.4% us-
age. Furthermore their worst case involves a drop
of 16.06% in chunk tag accuracy again at the 50%
usage level, where ours involves only a 0.54%
drop in accuracy at the 6.8% usage level. This
confirms that our method may be less risky, al-
though a more direct comparison is required to
demonstrate this in a fully systematic manner. For
example, our system does not use lexical informa-
tion in the input where theirs does, which might
make a difference to these results.
Comparing the SOMMBL results with the
LSOMMBL results for the same context size
and tagset, the differences in performance are
insignificant, typically under 0.1 points on the
fscore and under 0.1% on the chunk tag accu-
racy. Furthermore the differences sometimes are
in favour of LSOMMBL and sometimes not. This
suggests they may be due to noise due to differ-
ent weight initialisations in training rather than a
systematic difference.
Thus at the moment it is unclear whether
SOMMBL or LSOMMBL have an advantage
compared to each other. It does appear however
that the use of the orthogonal vectors to repre-
sent the tags leads to slightly worse performance
than the use of the vectors derived from the lexical
space representations of the words.
7 Discussion and future work
The results suggest that the hybrid system pre-
sented here is capable of significantly reducing
the number of comparisons made without risking
a serious deterioration in performance. Certainly,
the reduction in the number of comparisons was
far greater than with the sampling techniques used
by Daelemans et al while yielding similar levels
of deterioration.
Given that a systematic investigation into what
the optimal training regimes and network sizes
are has not been performed the results thus seem
promising.
With regard to the network sizes, for example,
one reviewer commented that choosing the num-
ber of units to be the square root of the number
of training items may not be optimal since the
clusters formed in self-organising maps theoret-
ically vary as the squared cube root of the den-
sity, thus implying that a larger number of units
may offer better performance. What impact us-
ing this insight would have on the performance
(as opposed to the speed) of the system however
is unclear. Another variable that has not been sys-
tematically investigated is the optimal number of
winning units to choose during testing. Increasing
this value should allow tuning of the system to
balance deterioration of the performance against
the reduction in comparisons made.
Another issue regards the nature of the com-
parisons made. When choosing a winning unit,
the input is compared to the centroid of the clus-
ter each unit represents. However this fails to take
into account the distribution of the items around
the centroid. As one reviewer suggested, it may
be that if instead the comparison is made with the
periphery of the cluster there would be a reduced
risk of missing the closest item. One possibility
for taking this into account would be to compute
the average distance of the items from the cen-
troid and subtract this from the raw distance com-
puted between the input and the centroid. This
would take into account how spread out the clus-
ter is as well as how far the centroid is from the
input item. This would be a somewhat more ex-
pensive comparison to make but may be worth-
while to improve the probability that the closest
item is found in the clusters that are searched.
Future work will therefore include systemati-
cally investigating these issues to determine the
optimal parameter settings. Also a more system-
atic comparison with other sampling techniques,
especially other methods of clustering, is needed
to confirm that this method is less risky than other
techniques.
8 Conclusion
This work suggests that using the SOM for mem-
ory editing in MBL may be a useful technique for
improving the speed of MBL systems whilst min-
imising reductions in performance. Further work
is needed however to find the optimal training pa-
rameters for the system and to confirm the utility
of this approach.
Acknowledgements
The authors would like to thank the following
people for their discussions and advice during this
work; the reviewers of this paper for some help-
ful comments, Antal van den Bosch for providing
the data from (Daelemans et al., 1999b) for our
analysis, Ronan Reilly for his discussions of this
work, Walter Daelemans & other members of the
CNTS who made the first author’s visit to CNTS
an enjoyable and productive stay.
This work was supported by the EU-TMR
project “Learning Computational Grammars”.
The (L)SOMMBL simulations were performed
using the PDP++ simulator. The MBL simula-
tions were performed using the TiMBL package

References
W. Daelemans, S. Buchholz, and J. Veenstra. 1999a.
Memory-based shallow parsing. In M. Osborne
and E. Tjong Kim Sang, editors, Proceedings of the
Computational Natural Language Learning Work-
shop (CoNLL-99), University of Bergen, Bergen,
Norway, Somerset, NJ, USA. Association for Com-
putational Linguistics.
W. Daelemans, A. van den Bosch, and J. Zavrel.
1999b. Forgetting exceptions is harmful in lan-
guage learning. Machine Learning, 34:11–43.
T. Kohonen. 1990. The self-organising map. Pro-
ceedings of the IEEE, 78(9):1464–1480.
M. Marcus, B. Santorini, and M. A. Marcinkiewicz.
1993. Building a large annotated corpus of English:
the Penn Treebank. Computational Linguistics, 19.
L.A. Ramshaw and M.P. Marcus. 1995. Test chunking
using transformation based learning. In Proceed-
ings of the Third Workshop on Very Large Corpora,
Cambridge, MA, USA.
R.G. Reilly. 1998. Enriched lexical representations,
large corpora and the performance of srns. In
L. Niklasson, M. Bod`en, and T. Zemke, editors,
Proceedings of ICANN’98, Skovde, Sweden, pages
405–410.
J. Zavrel and J. Veenstra. 1996. The language envi-
ronment and syntactic word class acquisition. In
Koster C. and Wijnen F., editors, Proceedings of
the Groningen Assembly on Language Acquisition
(GALA ’95).
