Finding optimal parameter settings for high performance word sense
disambiguation
Cristian Grozea
Department of Computer Science
University of Bucharest
Str. Academiei 14, 70109 Bucharest, Romania
chrisg@phobos.ro
Abstract
This article describes the four systems sent by the
author to the SENSEVAL-3 contest, the English
lexical sample task. The best recognition rate ob-
tained by one of these systems was 72.9% (fine
grain score) .
1 Introduction. RLSC algorithm, input
and output.
This paper is not self-contained. The reader should
read first the paper of Marius Popescu (Popescu,
2004), paper that contains the full description the
base algorithm, Regularized Least Square Classifi-
cation (RLSC) applied to WSD.
Our systems used the feature extraction described
in (Popescu, 2004), with some differences.
Let us fix a word that is on the list of words we
must be able to disambiguate. Let a0 be the number
of possible senses of this word .
Each instance of the WSD problem for this fixed
word is represented as an array of binary values
(features), divided by its Euclidian norm. The num-
ber of input features is different from one word to
another. The desired output for that array is another
binary array, having the length a0 .
After the feature extraction, the WSD problem is
regarded as a linear regression problem. The equa-
tion of the regression is a1a3a2a5a4a7a6 where each of the
lines of the matrix a1 is an example and each line of
a6 is an array of length a0 containing a0a9a8a11a10 zeros and
a single a10 . The output a12a13a2 of the trained model a2 on
some particular input a12 is an array of values that ide-
ally are just 0 or 1. Actually those values are never
exactly 0 and 1, so we are prepared to consider them
as an ”activation” of the sense recognizers and con-
sider that the most ”activated” (the sense with high-
est value) wins and gives the sense we decide on. In
other words, we consider the a12a13a2 values an approxi-
mation of the true probabilities associated with each
sense.
The RLSC solution to this linear regression prob-
lem is a2a14a4a15a1a3a16a18a17a19a1a20a1a3a16a22a21a24a23a26a25a28a27a30a29a32a31a34a33a32a6 ;
The first difference between our system and Mar-
ius Popescu’s RLSC-LIN is that two of the systems
(HTSA3 and HTSA4) use supplementary features,
obtained by multiplying up to three of the exist-
ing features, because they improved the accuracy on
Senseval-2.
Another difference is that the targets a6 have values
0 and 1, while in the Marius Popescu’s RLSC-LIN
the targets have values -1 and 1. We see the output
values of the trained model as approximations of the
true probabilities of the senses.
The main difference is the postprocessing we ap-
ply after obtaining a2 . It is explained below.
2 Adding parameters
The obviously single parameter of the RLSC is a23 .
Some improvement can be obtained using larger a23
values. After dropping the parser information from
features (when it became clear that we won’t have
those for Senseval-3) the improvements proved to
be too small. Therefore we fixed a23a35a4 a10a28a36 a31a13a37 .
During the tests we performed it has been ob-
served that normalizing the models for each sense
(the columns of a2 ) - that is dividing them by
their Euclidian norm - gives better results, at least
on Senseval-2 and don’t give too bad results on
Senseval-1 either. When you have a yes/no param-
eter like this one (that is normalizing or not the
columns of a2 ), you don’t have too much room for
fine tuning. After some experimentation we decided
that the most promising way to convert this new dis-
crete parameter to a continuous one was to consider
that in both cases it was a division by a38a2a20a39a32a38a40 , where
a41
a4
a36 when we leave the model unchanged and
a41
a4
a10 when we normalize the model columns.
3 Choosing the best value of the
parameters
This is the procedure that has been employed to tune
the parameter a41 until the recognition rate achieved
the best levels on SENSEVAL-1 and 2 data.
1. preprocess the input data - obtain the features
                                             Association for Computational Linguistics
                        for the Semantic Analysis of Text, Barcelona, Spain, July 2004
                 SENSEVAL-3: Third International Workshop on the Evaluation of Systems
2. compute a2 a4a15a1a20a16a18a17a19a1a20a1a3a16a22a21 a23a26a25a28a27 a29a32a31a34a33a32a6
3. for each a41 from 0 to 1 with step 0.1
4. test the model (using a41 in the post-
processing phase and then the scoring python
script)
At this point we were worried by the lack of any
explanation (and therefore the lack of any guaran-
tee about performance on SENSEVAL-3). After
some thinking on the strengths and weaknesses of
RLSC it became apparent that RLSC implicitly in-
corporates a Bayesian style reasoning. That is, the
senses most frequent in the training data lead to
higher norm models, having thus a higher aposte-
riori probability. Experimental evidence was ob-
tained by plotting graphs with the sense frequencies
near graphs with the norms of the model’s columns.
If you consider this, then the correction done was
more or less dividing implicitly by the empiric fre-
quency of the senses in the training data. So, we
switched to dividing the columns a2a20a39 by the observed
frequency a1 a39 of the a2 -th sense instead of the norm
a38a2 a39a18a38 . This lead to an improvement on SENSEVAL-
2, so this is our base system HTSA1:
Test procedure for HTSA1:
1. Postprocessing: correct for a2 =1..a0 the model
a2 a39 by doing a2 a39 a4 a2 a39a4a3
a1
a40
a39
For each test input a12 do 2,3
2. Compute the output a5 a4 a12a13a2 for the input a12
3. Find the maximum component of a5 . Its posi-
tion is the label returned by the algorithm for the the
input a12 .
Please observe that, because of the linearity, the
correction can be performed on a5 instead of a2 , just
after the step 2 : a5 a39 a4a6a5a30a39a4a3
a1
a39 . For this reason we call
this correction ”postprocessing”.
4 Description of the systems.
Performance.
Here is a very short description of our systems. It
describes what they have in common and what is
different, as well which is their performance level
(recognition rate).
There are four flavors of the same algorithm,
based on RLSC. They differ by the preprocessing
and by the postprocessing done (name and explana-
tion is under each graphic).
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.80.45
0.5
0.55
0.6
0.65
0.7
0.75
0.8
alpha
HTSA1 performance on SENSEVAL−1
HTSA1: implicit correction of the frequencies,
by dividing the output confidences of the senses by
the a1a8a7a10a9a12a11a14a13a15a9a17a16a19a18 a5 a40 ; The graphic shows how the recog-
nition rate depends on a41 on SENSEVAL-1.
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8
0.35
0.4
0.45
0.5
0.55
0.6
0.65
0.7
alpha
HTSA1 performance on SENSEVAL−2
HTSA1 on SENSEVAL-2 - the recognition rated
depicted as a function of a41
0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.40.652
0.654
0.656
0.658
0.66
0.662
0.664
0.666
alpha
HTSA2 performance on SENSEVAL−2
HTSA2: explicit correction of the frequencies,
by multiplying the output confidences by a certain
decreasing function of frequency, that tries to ap-
proximate the effect of the postprocessing done by
HTSA1; here the performance on SENSEVAL-2 as
a function of a41 .
0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.40.65
0.655
0.66
0.665
0.67
0.675
alpha
HTSA3 performance on SENSEVAL−2
HTSA3: like HTSA1, with a preprocessing that
adds supplementary features by multiplying some
of the existing ones; here the performance on
SENSEVAL-2 as a function of a41 .
The supplementary features added to HTSA3 and
HTSA4 are all products of two and three local con-
text features. This was meant to supply the linear
regression with some nonlinear terms, giving thus
the algorithm the possibility to use conjunctions.
0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.40.716
0.718
0.72
0.722
0.724
0.726
0.728
0.73
HTSA3 performance on SENSEVAL−3
alpha
Was our best result lucky? Here is the perfor-
mance graph of HTSA3 on SENSEVAL-3 as a func-
tion of a41 . As we can see, any a41 between 0.2 and 0.3
would have given accuracies between a0a2a1a4a3 a5a7a6 and
a0a2a1a4a3 a8a7a6 .
0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.40.65
0.655
0.66
0.665
0.67
0.675
alpha
HTSA4 performance on SENSEVAL−2
HTSA4: like HTSA2, with the preprocess-
ing described above. Here the performance on
SENSEVAL-2 as a function of a41 .
0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 0.55 0.60.718
0.72
0.722
0.724
0.726
0.728
0.73
0.732
alpha
HTSA4 performance on SENSEVAL−3
The performance of HTSA4 on SENSEVAL-3 as
a function of a41 .
What can be seen on this graphic is that a41 a4
a36
a3 a1
was not such a good choice for SENSEVAL-3. In-
stead, a41 a4 a36 a3 a9a11a10 would have achieved a recogni-
tion rate of a0a2a1a4a3 a8a7a6 . In other words, the best value
of a41 on SENSEVAL-2 is not necessary the best on
SENSEVAL-3. The next section discusses alterna-
tive ways of ”guessing” the best values of the pa-
rameters, as well as why they won’t work in this
case.
5 Cross-validation. Possible explanations
of the results
The common idea of HTSA 1,2,3 and 4 is that a
slight departure from the Bayes apriori frequencies
improves the accuracy. This is done here by post-
processing and works on any method that produces
probabilities/credibilities for all word senses. The
degree of departure from the Bayes apriori frequen-
cies can be varied and has been tuned on Senseval-1
and Senseval-2 data until the optimum value a41 a4
a36
a3 a1 has been determined.
Of course, there was still no guarantee on how
good will be the performance on SENSEVAL-3.
The natural idea is to apply cross-validation to de-
termine the best a41 using the current training set.
We tried that, but a very strange thing could be ob-
served. On both SENSEVAL-1 and SENSEVAL-
2 the cross-validation indicated that values of a41
around a36 should have been better than a36 a3 a1 .
We see this as an indication that the distribution
of frequencies on the test set does not fully match
with the one of the train set. This could be an
explanation about why it is better to depart from
the Bayesian style and to go toward the maximum
verosimility method. We think that this is exactly
what we did.
Initially we only had HTSA1 and HTSA3. By
looking at the graph of the correction done by divid-
ing by a1a8a7a10a9a12a11a14a13a15a9a17a16a19a18 a5a1a0a3a2a4 , reproduced below in red, we
observed that it tends to give more chances to the
weakly represented senses. To test this hypothesis
we built an explicit correction, piecewise linear, also
reproduced below on the same graphic. Thus we
have obtained HTSA2 and HTSA4. In their case, a41
is the position of the joining point. Those performed
close to HTSA1 and HTSA3, so we have experi-
mental evidence that increasing the apriori proba-
bilities of the lower frequency senses gives better
recognition rates.
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 11
1.1
1.2
1.3
1.4
1.5
1.6
1.7
1.8
1.9
2
observed frequency
correction factor
implicit correction
explicit correction
Red: Implicit correction (HTSA 1, 3); Blue: Ex-
plicit correction (HTSA 2, 4)
6 Conclusions. Further work.
RLSC proved to be a very powerful learning model.
We also believe that tuning the parameters of a
model is a must, even if you have to invent parame-
ters first. We think that the way we have proceeded
here with a41 can be applied to other models, as a
simple and direct post processing. Of course the
right value of a41 has to be found case by case. We
would suggest everyone who participated with sys-
tems that produce Bayesian-like class probabilities
to try to apply this postprocessing to their systems.

References
Marius Popescu. 2004. Regularized least-squares
classification for word sense disambiguation. In
Proceedings of SENSEVAL-3, page N/A.
