In: Proceedings o/CoNLL-2000 and LLL-2000, pages 119-122, Lisbon, Portugal, 2000. 
A Default First Order Family Weight Determination Procedure 
for WPDV Models 
Hans van Halteren 
Dept. of Language and Speech, University of Nijmegen 
P.O. Box 9103, 6500 HD Nijmegen, The Netherlands 
hvh@let, kun. nl 
Abstract 
Weighted Probability Distribution Voting 
(WPDV) is a newly designed machine learning 
algorithm, for which research is currently 
aimed at the determination of good weighting 
schemes. This paper describes a simple yet 
effective weight determination procedure, which 
leads to models that can produce competitive 
results for a number of NLP classification 
tasks. 
1 The WPDV algorithm 
Weighted Probability Distribution Voting 
(WPDV) is a supervised learning approach to 
classification. A case which is to be classified is 
represented as a feature-value pair set: 
Fcase -- {{fl : Vl},..., {fn :Vn}} 
An estimation of the probabilities of the various 
classes for the case in question is then based on 
the classes observed with similar feature-value 
pair sets in the training data. To be exact, the 
probability of class C for Fcase is estimated as 
a weighted sum over all possible subsets Fsub of 
Fcase: 
w /req(CJF  b) P(C) = N(C) /req(F  b) 
FsubCFcase 
with the frequencies (freq) measured on the 
training data, and N(C) a normalizing factor 
such that ~/5(C) = 1. 
In principle, the weight factors WF,~,~ can be 
assigned per individual subset. For the time 
being, however, they are assigned for groups of 
subsets. First of all, it is possible to restrict 
the subsets that are taken into account in the 
model, using the size of the subset (e.g. Fsub 
contains at most 4 elements) and/or its fre- 
quency (e.g. Fsub occurs at least twice in the 
training material). Subsets which do not fulfil 
the chosen criteria are not used. For the sub- 
sets that are used, weight factors are not as- 
signed per individual subset either, but rather 
per "family", where a family consists of those 
subsets which contain the same combination of 
feature types (i.e. the same f/). 
The two components of a WPDV model, dis- 
tributions and weights, are determined sepa- 
rately. In this paper, I will use the term training 
set for the data on which the distributions are 
based and tuning set for the data on the basis of 
which the weights are selected. Whether these 
two sets should be disjunct or can coincide is 
one of the subjects under investigation. 
2 Family weights 
The various family weighting schemes can be 
classified according to the type of use they make 
of the tuning data. Here, I use a very rough 
classification, into weighting scheme orders. 
With 0 th order weights, no information 
whatsoever is used about the data in tuning 
set. Examples of such rudimentary weighting 
schemes are the use of a weight of k! for all sub- 
sets containing k elements, as has been used e.g. 
for wordclass tagger combination (van Halteren 
et al., To appear), or even a uniform weight for 
all subsets. 
With 1 st order weights, information is used 
about the individual feature types, i.e. 
WF,~b = IT WIt 
{il(fi:vi}eF, ub} 
First order weights ignore any possible inter- 
action between two or more feature types, but 
119 
have the clear advantage of corresponding to a 
reasonably low number of weights, viz. as many 
as there are feature types. 
With n th order weights, interaction pat- 
terns are determined of up to n feature types 
and the family weights are adjusted to compen- 
sate for the interaction. When n is equal to 
the total number of feature types, this corre- 
sponds to weight determination per individual 
family, n th order weighting generally requires 
much larger numbers of weights, which can be 
expected to lead to much slower tuning proce- 
dures. In this paper, therefore, I focus on first 
order weighting. 
3 First order weight determination 
As argumented in an earlier paper (van Hal- 
teren, 2000a), a theory-based feature weight de- 
termination would have to take into account 
each feature's decisiveness and reliability. How- 
ever, clear definitions of these qualities, and 
hence also means to measure them, are as yet 
sorely lacking. As a result, a more pragmatic 
approach will have to be taken. Reliability is ig- 
nored altogether at the moment, 1 and decisive- 
ness replaced by an entropy-related measure. 
3.1 Initial weights 
The weight given to each feature type fi should 
preferably increase with the amount of informa- 
tion it contributes to the classification process. 
A measure related to this is Information Gain, 
which represents the difference between the en- 
tropy of the choice with and without knowledge 
of the presence of a feature (cf. Quinlan (1986)). 
As do Daelemans et al. (2000), I opt for a fac- 
tor proportional to the feature type's Gain Ra- 
tio, a normalising derivative of the Information 
Gain value. The weight factors W/~ are set to 
an optimal multiplication constant C times the 
measured Gain Ratio for fi- C is determined by 
calculating the accuracies for various values of 
C on the tuning set 2 and selecting the C which 
yields the highest accuracy. 
lit may still be present, though, in the form of the 
abovementioned frequency threshold for features. 
2If the tuning set coincides with the training set, all 
parts of the tuning procedure are done in leave-one-out 
mode: in the WPDV implementation, it is possible to 
(virtually) remove the information about each individual 
instance from the model when that specific instance has 
to be classified. 
3.2 Hill-climbing 
Since the initial weight determination is based 
on pragmatic rather than theoretical consider- 
ations, it is unlikely that the resulting weights 
are already the optimal ones. For this reason, 
an attempt is made to locate even better weight 
vectors in the n-dimensional weight space. The 
navigation mechanism used in this search is hill- 
climbing. This means that systematic variations 
of the currently best vector are investigated. If 
the best variation is better than the currently 
best vector, that variation is taken as the best 
vector and the process is repeated. This repeti- 
tion continues until no better vector is found. 
In the experiments described here, the varia- 
tion consists of multiplication or division of each 
individual W/i by a variable V (i.e. 2n new vec- 
tors are tested each time), which is increased if a 
better vector is found, and otherwise decreased. 
The process is halted as soon as V falls below 
some pre-determined threshold. 
Hill-climbing, as most other optimaliza- 
tion techniques, is vulnerable to overtraining. 
To lessen this vulnerability, the WPDV hill- 
climbing implementation splits its tuning mate- 
rial into several (normally five) parts. A switch 
to a new weight vector is only taken if the ac- 
curacy increases on the tuning set as a whole 
and does not decrease on more than one part, 
i.e. some losses are accepted but only if they 
are localized. 
4 Quality of the first order weights 
In order to determine the quality of the WPDV 
system, using first order weights as described 
above, I run a series of experiments, using tasks 
introduced by Daelemans et al. (1999): 3 
The Part-of-speech tagging task (POS) is 
to determine a wordclass tag on the basis of dis- 
ambiguated tags of two preceding tokens and 
undisambiguated tags for the focus and two fol- 
lowing tokens. 4 5 features with 170-480 values; 
169 classes; 837Kcase training; 2xl05Kcase test. 
The Grapheme-to-phoneme conversion 
with stress task (GS) is to determine the pro- 
nunciation of an English grapheme, including 
aI only give a rough description of the tasks here. For 
the exact details, I refer the reader to Daelemans et al. 
(1999). 
4For a overall WPDV approach to wordclass tagging, 
see van Halteren (2000b). 
120 
Table h Accuracies for the POS task (with the 
training set ah tested in leave-one-out mode) 
Table 2: Accuracies for the GS task (with the 
training set ah tested in leave-one-out mode) 
Weighting scheme Test set 
ah i 
Comparison 
Naive Bayes 
TiMBL (k=l) 
Maccent (freq=2;iter=150) 
Maccent (freq=l;iter=300) 
WPDV 0 th order weights 
1 
kl 
96.41 
97.83 
98.07 
98.13 
97.66 97.71 
96.86 96.92 
WPDV initial 18t order 
tune = ah (10GR) 98.14 98.16 
tune = i (12GR) 98.14 98.17 
tune = j (llGR) 98.14 98.16 
WPDV with hill-climbing 
tune = ah (30 steps) 98.17 98.21 
tune = i (20 steps) 98.15 98.20 
tune = j (20 steps) 98.15 98.18 
96.24 
97.79 
98.03 
98.10 
97.63 
96.86 
98.12 
98.12 
98.13 
98.15 
98.12 
98.16 
Weighting scheme 
Comparison 
Naive Bayes 
TiMBL (k----l) 
Maccent (freq=2;iter=150) 
Maccent (freq=l;iter=300) 
Test set 
ah i j 
50.05 49.98 
92.25 92.02 
79.41 79.36 
80.43 80.35 
WPDV 0 th order weights 
1 90.99 90.49 90.25 
k! 92.77 92.05 91.89 
WPDV initial 1 st order 
tune = ah (30GR) 93.27 92.74 92.52 
tune = i (25GR) 93.24 92.76 92.54 
tune = j (25GR) 93.24 92.76 92.54 
WPDV with hill-climbing 
tune = ah (34 steps) 93.29 92.77 92.53 
tune = i (28 steps) 93.25 92.79 92.53 
tune = j (12 steps) 93.24 92.76 92.54 
presence of stress, on the basis of the focus 
grapheme, three preceding and three following 
graphemes. 7 features with 42 values each; 159 
classes; 540Kcase training; 2x68Kcase test. 
The PP attachment task (PP) is preposi- 
tional phrase attachment to either a preceding 
verb or a preceding noun, on the basis of the 
verb, the noun, the preposition in question and 
the head noun of the prepositional complement. 
4 features with 3474, 4612, 68 and 5780 values; 
2 classes; 19Kcase training; 2x2Kcase test. 
The NP chunking task (NP) is the deter- 
ruination of the position of the focus token in a 
base NP chunk (at beginning of chunk, in chunk, 
or not in chunk), on the basis of the words and 
tags for two preceding tokens, the focus and 
one following token, and also the predictions by 
three newfirst stage classifiers for the task. 5 11 
features with 3 (first stage classifiers), 90 (tags) 
and 20K (words) values; 3 classes; 201Kcase 
training; 2x25Kcase test. 6 
For each of the tasks, sections a to h of the data 
set are used as the training set and sections i 
5For a WPDV approach to a more general chunking 
task, see my contribution to the CoNLL shared task, 
elsewhere in these proceedings. 
~The number of feature combinations for the NP task 
is so large that the WPDV model has to be limited. For 
the current experiments, I have opted for a maximum 
size for fsub of four features and a threshold frequency 
of two observations in the training set. 
and j as (two separate) test sets. All three are 
also used as tuning sets. This allows a compari- 
son between tuning on the training set itself and 
on a held-out tuning set. For comparison with 
some other well-known machine learning algo- 
rithms, I complement the WPDV experiments 
with accuracy measurements for three other sys- 
tems: 1) A system using a Naive Bayes prob- 
ability estimation; 2) TiMBL, using memory 
based learning and probability estimation based 
on the nearest neighbours (Daelemans et al., 
2000), 7 for which I use the parameters which 
yielded the best results according to Daelemans 
et al. (1999); and 3) Maccent, a maximum en- 
tropy based system, s for which I use both the 
default parameters, viz. a frequency threshold 
of 2 for features to be used and 150 iterations 
of improved iterative scaling, and a more am- 
bitious parameter setting, viz. a threshold of 1 
and 300 iterations. 
The results for various WPDV weights, and 
the other machine learning techniques are listed 
in Tables 1 to 4. 9 Except for one case (PP with 
tune on j and test on i), the first order weight 
WPDV results are all higher than those for the 
7http://ilk.kub.nl/. Shttp://w~.cs.kuleuven.ac.be/~ldh. 
9The accuracy is shown in itMics wheneverthetuning 
set is equM to the test set, i.e. when there is anunf~r 
advantage. 
121 
Table 3: Accuracies for the PP task (with the 
training set ah tested in leave-one-out mode) 
Table 4: Accuracies for the NP task (with the 
training set ah tested in leave-one-out mode) 
Weighting scheme Test set 
ah i 
Comparison 
Naive Bayes 
TiMBL (k=l) 
Maccent (freq=2;iter=150) 
Maccent (freq=l;iter=300) 
WPDV 0 ~h order weights 
1 
k~ 
82.68 82.64 
83.43 81.97 
81.00 80.25 
79.41 79.79 
80.83 82.26 81.46 
80.76 82.30 81.30 
WPDV initial 18t order 
tune = ah (21GR) 82.89 83.64 82.38 
tune = i (15GR) 82.82 83.81 82.55 
tune = j (llGR) 82.60 83.26 82.76 
WPDV with hill-climbing 
tune --- ah (19 steps) 83.10 83.72 82.68 
tune = i (18 steps) 82.95 84.06 82.80 
tune = j (16 steps) 82.65 83.10 82.93 
Weighting scheme i Test set 
ah i j 
Comparison 
Naive Bayes 
TiMBL (k=3) 
Maccent (freq=2;iter=150) 
Maccent (freq=l;iter=300) 
WPDV 0 th order weights 
1 
k~ 
96.52 96.49 
98.34 98.22 
97.89 97.75 
97.66 97.45 
97.56 97.77 97.69 
97.74 97.97 97.87 
WPDV initial I st order 
tune = ah (380GR) 98.19 98.38 98.26 
tune = i (60GR) 98.14 98.39 98.17 
tune = j (360GR) 98.19 98.38 98.27 
WPDV with hill-climbing 
tune = ah (50 steps) 98.36 98.54 98.44 
tune = i (34 steps) 98.25 98.57 98.33 
tune = j (12 steps) 98.19 98.38 98.27 
comparison systems. 1° 0 th order weights gener- 
ally do not reach this level of accuracy. 
Hill-climbing with the tuning set equal to the 
training set produces the best results overall. 
It always leads to an improvement over initial 
weights of the accuracies on both test sets, al- 
though sometimes very small (GS). Equally im- 
portant, the improvement on the test sets is 
comparable to that on the tuning/training set. 
This is certainly not the case for hill-climbing 
with the tuning set equal to the other test set, 
which generally does not reach the same level of 
accuracy and may even be detrimental (climb- 
ing on PPj). 
Strangely enough, hill-climbing with the tun- 
ing set equal to the test set itself sometimes does 
not even yield the best quality for that test set 
(POS with test set i and especially NP with j). 
This shows that the weight-+accuracy function 
does have local maxima~ and the increased risk 
for smaller data sets to run into a sub-optimal 
one is high enough that it happens in at least 
two of the eight test set climbs. 
1°The accuracies for TiMBL are lower than those 
found by Daelemans et ai. (1999): POSi 97.95, POSj 
97.90, GS~ 93.75, GSj 93.58, PP~ 83.64, PPj 82.51, NP~ 
98.38 and NPj 98.25. This is due to the use of eight part 
training sets instead of nine. The extreme differences for 
the GS task show how much this task depends on indi- 
vidual observations rather than on generalizations, which 
probably also explains why Naive Bayes and Maximum 
Extropy (Maccent) handle this task so badly. 
In summary, hill-climbing should preferably 
be done with the tuning set equal to the training 
set. This is not surprising, as the leave-one- 
out mechanism allows the training set to behave 
as held-out data, while containing eight times 
more cases than a test set turned tuning set. 
The disadvantage is a much more time-intensive 
hill-climbing procedure, but when developing an 
actual production model, the weights only have 
to be determined once and the results appear to 
be worth it most of the time. 

References 
w. Daelemans, A. Van den Bosch, and J. Zavrel. 
1999. Forgetting exceptions is harmful in lan- 
guage learning. Machine Learning, Special issue 
on Natural Language Learning, 34:11-41. 
W. Daelemans, J. Zavrel, K. Van der Sloot, and 
A. Van den Bosch. 2000. TiMBL: Tilburg Mem- 
ory Based Learner, version 3.0, reference manual. 
Tech. Report ILK-00-01, ILK, Tilburg University. 
H. van Halteren. 2000a. Weighted Probability Dis- 
tribution Voting, an introduction. In Computa- 
tional linguistics in the Netherlands, 1999. 
H. van Halteren. 2000b. The detection of in- 
consistency in manually tagged text. In Proc. 
LINC2000. 
H. van Halteren, J. Zavrel, and W. Daelemans. 
To appear. Improving accuracy in NLP through 
combination of machine learning systems. Com- 
putational Linguistics. 
J.R. Quinlan. 1986. Induction of Decision Trees. 
Machine Learning, 1:81-206. 
