Improving POS Tagging Using Machine-Learning Techniques 
Llufs Mhrquez 1, Horacio Rodrfguez 2, Josep Carmona 1 and Josep Montolio 1 
1 TALP Research Center. Dep. LSI - Universitat Polit~cnica de Catalunya 
c/Jordi Girona 1-3. 08034 Barcelona. Catalonia 
lluism@l si .upc. es 
2 Dep. IMA - Universitat de Girona 
Horacio. Rodriguez@ima. udg. es 
Abstract 
In this paper we show how machine learning 
techniques for constructing and combining sev- 
eral classifiers can be applied to improve the 
accuracy of an existing English POS tagger 
(MSxquez and Rodrfguez, 1997). Additionally, 
the problem of data sparseness is also addressed 
by applying a technique of generating convez 
pseudo-data (Breiman, 1998). Experimental re- 
sults and a comparison to other state-of-the- 
art tuggers are reported. 
Keywords: POS Tagging, Corpus-based mod- 
eling, Decision Trees, Ensembles of Classifiers. 
1 Introduction 
The study of general methods to improve the 
performance in classification tasks, by the com- 
bination of different individual classifiers, is a 
currently very active area of research in super- 
vised learning. In the machine learning (ML) 
literature this approach is known as ensemble, 
stacked, or combined classifiers. Given a classi- 
fication problem, the main goal is to construct 
several independent classifiers, since it has been 
proven that when the errors committed by indi- 
vidual classifiers are uncorrelated to a sufficient 
degree, and their error rates are low enough, 
the resulting combined classifier performs bet- 
ter than all the individual systems (Ali and Paz- 
zani, 1996; Tumer and Ghosh, 1996; Dietterich, 
1997). 
Several methods have been proposed in order 
to construct ensembles of classifiers that make 
uncorrelated errors. Some of them are general, 
and they can be applied to any learning algo- 
rithln, while other are specific to particular al- 
gorithms. From a different perspective, there 
exist methods for constructing homogeneous en- 
sembles, in the sense that a unique learning al- 
gorithm has been used to acquire each individ- 
ual classifier, and heterogeneous ensembles that 
combine different types of learning paradigms 1. 
Impressive results have been obtained by ap- 
plying these techniques on the so-called unsta- 
ble learning algorithms (e.g. induction of deci- 
sion trees, neural networks, rule-induction sys- 
tems, etc.). Several applications to real tasks 
have been performed, and, regarding NLP, we 
find ensembles of classifiers in context-sensitive 
spelling correction (Golding and Roth, 1999), 
text categorization (Schapire and Singer, 1998; 
Blum and Mitchell, 1998), and text filtering 
(Schapire et al., 1998). Combination of classi- 
tiers have also been applied to POS tagging. For 
instance, van Halteren (1996) combined a num- 
ber of similar tuggers by way of a straightfor- 
ward majority vote. More recently, two parallel 
works (van Halteren et al., 1998; Brill and Wu, 
1998) combined, with a remarkable success, the 
output of a set of four tuggers based on different 
principles and feature modelling. Finally, in the 
work by MSxquez et al. (1998) the combination 
of taggers is used in a bootstrapping algorithm 
to train a part of speech tagger from a limited 
amount of training material. 
The aim of the present work is to improve 
an existing POS tagger based on decision trees 
(Mkrquez and Rodriguez, 1997), by using en- 
sembles of classifiers. This tagger treats sepa- 
rately the different types (classes) of ambiguity 
by considering a different decision tree for each 
class. This fact allows a selective construction 
of ensembles of decision trees focusing on the 
most relevant ambiguity classes, which greatly 
vary in size and difficulty. Another goal of the 
present work is to try to alleviate the problem 
of data sparseness by applying a method, due 
1An excellent survey covering all these topics call be 
found in (Dietterich, 1997). 
53 
to Breiman (1998), for generating new pseudo- 
examples from existing data. As we will see in 
section 4.2 this technique will be combined with 
the construction of an ensemble of classifiers. 
The paper is organized as follows: we start 
by presenting the two versions of the POS tag- 
ger and their evaluation on the reference corpus 
(sections 2 and 3). Sections 4 and 5 are, respec- 
tively, devoted to present the machine-learning 
improvements and to test their implementation. 
Finally, section 6 concludes. 
2 Tree-based Taggers 
Decision trees have been successfully applied to 
a number of different NLP problems and, in par- 
ticular, in POS tagging they have proven to be 
an efficient and compact way of capturing the 
relevant information for disambiguating. See 
(MSxquez, 1999) for a broad survey on this is- 
sue. 
In this approach to tagging, the ambiguous 
words in the training corpus are divided into 
classes corresponding to the sets of tags they 
can take (i.e, 'noun-adjective', 'noun-adjective- 
verb', etc.). These sets are called ambiguity 
classes and a decision tree is acquired for each 
of them. Afterwards, the tree-base is applied in 
a particular disambiguation algorithm. 
Regarding the learning algorithm, we use a 
particular implementation of a top-down induc- 
tion of decision trees (TDIDT) algorithm, be- 
longing to the supervised learning family. This 
algorithm is quite similar to the well-known 
CART (Breiman et al., 1984), and C4.5 (Quin- 
lan, 1993), but it incorporates some particular- 
ities in order to better fit the domain at hand. 
Training examples are collected from anno- 
tated corpora and they consist of the target 
word to be disambiguated and some informa- 
tion of its local context in the sentence. 
All words not present in the training corpus 
are considered unknown. In principle, we have 
to assume that they can take any tag corre- 
sponding to open categories (i.e., noun, proper 
noun, verb, adjective, adverb, cardinal, etc.), 
which sum up to 20 in the Penn Treebank 
tagset. In this approach, an additional ambigu- 
ity class for unknown words is considered, and 
so, they are treated exactly in the same way 
as the other ambiguous words, except by the 
type of information used for acquiring the trees, 
which is enriched with a number of morpholog- 
ical features. 
Once the tree-model has been acquired, it can 
be used in many ways to disambiguate a real 
text. In the following sections, 2.1 and 2.2, we 
present two alternatives. 
2.1 RTT: A Reductionistic Tree-based 
Tagger 
RTT is a reductionistic tagger in the sense of 
Constraint Grammars (Karlsson et al., 1995). 
In a first step a word-form frequency dictionary 
provides each input word with all possible tags 
with their associated lexical probability. After 
that, an iterative process reduces the ambiguity 
(discarding low probable tags) at each step until 
a certain stopping criterion is satisfied. 
More particularly, at each step and for each 
ambiguous word (at a sentence level) the work 
performed in parallel is: 1) The target word 
is "passed" through its corresponding decision 
tree; 2) The resulting probability distribution is 
used to multiplicatively update the probability 
distribution of the word; and 3) The tags with 
very low probabilities are filtered out. 
For more details, we refer the reader to 
(Mgrquez and Rodrfguez, 1997). 
2.2 STT: A Statistical Tree-based 
Tagger 
The aim of statistical or probabilistic tagging 
(Church, 1988; Cutting et al., 1992) is to as- 
sign the most likely sequence of tags given the 
observed sequence of words. For doing so, two 
kinds of information are used: the lexical prob- 
abilities, i.e, the probability of a particular tag 
conditional on the particular word, and the con- 
textual probabilities, which describe the proba- 
bility of a particular tag conditional on the sur- 
rounding tags. 
Contextual (or transition) probabilities are 
usually reduced to the conditioning of the pre- 
ceding tag (bigrams), or pair of tags (tri- 
grams), however, the general formulation allows 
a broader definition of context. In this way, the 
set of acquired statistical decision trees can be 
seen as a compact representation of a rich con- 
textual model, which can be straightforwardly 
incorporated inside a statistical tagger. The 
point here is that the context is not restricted to 
the n-1 preceding tags as in the n-gram formu- 
lation. Instead, it is extended to all the contex- 
S4 
tual information used for learning the decision 
trees. 
The Viterbi algorithm (described for instance 
in (Deaose, 1988)),. in which n-gram probabil- 
ities are substituted by the application of the 
corresponding decision trees, allows the calcu- 
lation of the most-likely sequence of tags with 
a linear cost on the sequence length. However, 
one problem appears when applying condition- 
ings on the right context of the target word, 
since the disambiguation proceeds from left to 
right and, so, the right hand side words may be 
ambiguous. Although dynamic programming 
can be used to Calculate the most likely sequence 
of tags to the right (in a forward-backward ap- 
proach), we use a simpler approach which con- 
sists of calculating the contextual probabilities 
by a weighted average of all possible tags for the 
right context. 
Additionally, the already presented tagger al- 
lows a straightforward incorporation of n-gram 
probabilities, by linear interpolation, in a back- 
off approach including, from most general to 
most specific, unigrams, bigrams, trigrams and 
decision trees. From now on, we will refer to 
STT as STT + when using n-gram information. 
Due to the high ambiguity of unknown words, 
their direct inclusion in the statistical tagger 
would result in a severe decreasing of perfor- 
mance. To avoid this situation, we apply the 
tree for unknown words in a pre-process for fil- 
tering low probable tags. In this way, when en- 
tering to the tagger the average number of tags 
per unknown word is reduced from 20 to 3.1. 
3 Evaluation of the Taggers 
3.1 Domain of Application 
We have used a portion of about 1,17 Mw of the 
Wall Street Journal (WSJ) corpus, tagged ac- 
cording to the Penn Treebank tag set (45 differ- 
ent tags). The corpus has been randomly parti- 
tioned into two subsets to train (85%) and test 
(15%) the system. See table 1 for some details 
about the used corpus. 
The training corpus has been used to create a 
word form lexicon --of 45,469 entries-- with the 
associated lexical probabilities for each word. 
The training corpus contains 239 different 
ambiguity classes, with a number of examples 
ranging from few dozens to several thousands 
(with a maximum of 34,489 examples for the 
preposition-adverb-particle ambiguity). It is 
noticeable that only the 36 most frequent am- 
biguity classes concentrate up to 90% of the 
ambiguous occurrences of the training corpus. 
Table 2 contains more information about the 
number of ambiguity classes necessary to cover 
a concrete percentage of the training corpus. 
Training examples for the unknown-word am- 
biguity class were collected from the training 
corpus in the following way: First, the training 
corpus is randomly divided into twenty parts of 
equal size. Then, the first part is used to extract 
the examples which do not occur in the remain- 
ing nineteen parts, that is, taking the 95% of the 
corpus as known and the remaining 5% to ex- 
tract the examples. This procedure is repeated 
with each of the twenty parts, obtaining approx- 
imately 22,500 examples from the whole corpus. 
The choice of dividing by twenty is not arbi- 
trary. 95%-5% is the proportion that results in 
a percentage of unknown words very similar to 
the test set (i.e., 2.25%) 2 . 
Finally, the test set has been used as com- 
pletely fresh material to test the taggers. All 
results on tagging accuracy reported in this pa- 
per have been obtained against this test set. 
3.2 Results 
In this experiment we used six basic discrete- 
valued features to disambiguate know n ambigu- 
ous words, which are: the part-of-speech tags 
of the three preceding and two following words, 
and the orthography of the word to be disam- 
biguated. 
For tagging unknown words, we used 20 at- 
tributes that can be classified into three groups: 
• Contextual information: part-of-speech 
tags of the two preceding and following 
words. 
• Orthographic and Morphological informa- 
tion (about the target word): prefixes (first 
two symbols) and suffixes (last three sym- 
bols); Length; Multi-word?; Capitalized?; 
Other capital letters?; Numerical charac- 
ters?; Contain dots? 
• Dictionary-related information: Does the 
target word contains any known word as 
a prefix (or a suffix)?; Is the target word 
:See (M£rquez, 1999) for a discussion on the appro- 
priateness of this procedure. 
55 
Training 
Test 
S W W/S AW T/W T/AW T/DW U 
40,977 998,354 24.36 339,916(34.05%) 1.48 2.40 -- -- 
7,167 175,412 24.47 59,440 (33.89%) 1.45 2.40 3.49 3,941 (2.25%) 
Total 48,144 1,173,766 24.38 399,356 (34.02%) 1.47 2.40 -- -- 
Table 1: Information about the WSJ training and test corpora. S: number of sentences; W: number 
of words; W/S: average number of words per sentence; AW: number and percentage of ambiguous 
words; T/W: average number of tags per word; T/AW: average number of tags per ambiguous 
t:nown word; T/DW: average number of tags per ambiguous word (including unknown words); and 
U: number and percentage of Unknown words 
Classes I 8 11 14 18 36 57 111 239 J 
Table 2: Number of ambiguity classes that cover the x% of the ambiguous words of the training 
corpus 
the prefix (or the suffix) of any word in the 
lexicon? 
The last group of features are inspired in 
those applied by Brill (1995) when addressing 
unknown words. 
The learning algorithm 3 acquired, in about 
thirty minutes, a base of 191 trees (the other 
ambiguity classes had not enough examples) 
which required about 0,68 Mb of storage. 
The results of the taggers working with this 
tree-base is presented in table 3. MFT stands 
for a baseline most-frequent-tag tagger. RTT, 
STT, and STT + stand for the basic versions of 
the taggers presented in section 2. The over- 
all accuracy is reported in the first column. 
Columns 2, 3, and 4 contain the tagging ac- 
curacy on some specific groups of words: un- 
known words, ambiguous words (excluding un- 
known words) and known words which is the 
complementary of the set of unknown words. 
Column 5 shows the speed of each tagger 4 and, 
finally, the 'Memory' column reflects the size of 
the used language model (the lexicon is not con- 
sidered). 
Three main conclusions can be extracted: 
• RTT and STT approaches obtain almost 
the same results in accuracy, however RTT 
is faster. 
ZThe programs were implemented using PERL-5.0 
and they were run on a SUN UltraSparc2 machine with 
194Mb of RAM. 
4More than absolute figures what is important here 
is the performance of each tagger relative to the others. 
5TT obtains better results when it incor- 
porates bigrams and trigrams, with a slight 
time-space penalty. 
The accuracy of all taggers is comparable 
to the best state-of-the art taggers under 
the open vocabulary assumption (see sec- 
tion 5.2). 
4 Machine-Learning-based 
Improvements 
Our purpose is to improve the performance on 
two types of ambiguity classes, namely: 
Most frequent ambiguity classes. We fo- 
cused on the 26 most representative classes, 
which concentrate the 86% of the am- 
biguous occurrences. From these, eight 
(24.1%) were already resolved at almost 
100% of accuracy, while the remaining eigh- 
teen (61.9%) left some room for improve- 
ment. Section 4.1 explain which meth- 
ods have been applied to construct en- 
sembles for these eighteen classes plus the 
unknown-word ambiguity class. 
Ambiguity classes with few examples. We 
considered the set of 82 ambiguity classes 
with a number of examples between 50 and 
3,000 and an accuracy rate lower than 95%. 
They agglutinate 48,322 examples (14.24% 
of the total ambiguous occurrences). Sec- 
tion 4.2 explains the applied method to 
increase the number of examples of these 
classes. 
56 
i 
Tagger 
MFT 
RTT 
STT . 
STT + 
Overall Known Ambiguous Unknown Speed Memory 
92.75% 94.25% 83.40% 27.43% 2818 w/s 0 Mb 
96.61% 97.01% 91.36% 79.22% 426 w/s 0.68 Mb 
96.63% 97.02% 91.40% 79.60% ~ 321 w/s 0.68 Mb 
96.84% 97.21% 91.95% 80.70% ! 302 w/s 0.90 Mb 
Table 3: Tagging accuracy, speed, and storage requirement of RTT and STT taggers 
4.1 Ensembles of Decision Trees 
The general methods for constructing ensem- 
bles of classifiers are based on four tech- 
niques: 1) Resampling the training data, e.g. 
Boosting (Freund and Schapire, 1995), Bagging 
(Breiman, 1996), and Cross-validated Commit- 
tees (Parmanto et al., 1996); 2) Combining dif- 
ferent input features (Cherkauer, 1996; Tumer 
and Ghosh, 1996); 3) Changing output repre- 
sentation, e.g. ECOC (Dietterich and Bakiri, 
1995) and PWC-CC (Moreira and Mayoraz, 
1998); and 4) Injecting randomness (Dietterich, 
1998). 
We tested several of the preceding methods 
on our domain. Below, we briefly describe those 
that reported major benefits. 
4.i.1 Bagging (BAG) 
From a training set of n examples, severaI sam- 
ples of the same size are extracted by randomly 
drawing, with replacement, n times. Such new 
training sets are called bootstrap replicates. In 
each replicate, some examples appear multiple 
times, while others do not appear. A classifier is 
induced from each bootstrap replicate and then 
they are combined in a voting approach. The 
technique is called bootstrap aggregation, from 
which the acronym bagging is derived. In our 
case, the bagging approach was performed fol- 
lowing the description of Breiman (1996), con- 
structing 10 replicates for each data set 5. 
4.1.2 Combining Feature Selection 
Criteria, (FSC) 
In this case, the idea is to obtain different clas- 
sifiers by applying several different functions for 
feature selection inside the tree induction algo- 
rithm. In particular, we have selected a set of 
seven functions that achieve a similar accuracy, 
namely: Gini Impurity Index, Information Gain 
and Gain Ratio, Chi-square statistic (X2), Sym- 
metrical Tau criterion, RLM (a distance-based 
method), and a version of RELIEF-F which uses 
the Information Gain function to assign weights 
to the features. The first five are described, 
for instance, in (Sestito and Dillon, 1994), RLM 
is due to LSpez de M£ntaras (1991), and, fi- 
nally, RELIEF-F is described in (Kononenko et 
al., 1995). Since the applied feature selection 
functions are based on different principles, we 
expect to obtain biased classifiers with comple- 
mentary information. 
4.1.3 Combining Features (FCOMB) 
We have extended the basic set of six features 
with lexical information about words appear- 
ing in the local context of the target word, and 
with the ambiguity classes of the same words. 
In this way, we consider information about the 
surrounding words at three different levels of 
specificity: word form, POS tag, and ambiguity 
class. 
Very similar to Brill's lexical patterns (Brill, 
1995), we also have included features to capture 
collocational information. Such features are ob- 
tained by composition of the already described 
single attributes and they are sequences of con- 
tiguous words and/or POS tags (up to three 
items). 
The resulting features were grouped accord- 
ing to their specificity to generate ensembles 
of eight trees 6. The idea here is that spe- 
cific information (lexical attributes and colloca- 
tional patterns) would produce classifiers that 
cover concrete cases (hopefully, with a high pre- 
cision), while more general information (POS 
tags) would produce more general (but proba- 
bly less precise) trees. The combination of both 
type of trees should perform better because of 
the complementarity of the information. 
5Several authors indicate that most of the potential 
improvement provided by bagging is obtained within the 
first ten replicates. 
6The features for dealing with unknown words were 
combined in a similar way to create ensembles of 10 trees. 
For details, see (M£rquez, 1999). 
57 
4.2 Generating Pseudo-Examples 
(CPD) 
Breiman (1998), describes a simple and effective 
method for generating new pseudo-examples 
fl'om existing data and incorporating them into 
a tree-based learning algorithm to increase pre- 
diction accuracy in domains with few training 
exalnples. We call this method CPD (standing 
for generation of Convex Pseudo-Data). 
The method for obtaining new data from the 
old is similar to the process of gene combination 
to create new generations in genetic algorithms. 
First, two examples of the same class are se- 
lected at random from the training set. Then, 
a new example is generated from them by se- 
lecting attributes from one or another parent 
according to a certain probability. This prob- 
ability depends on a single generation param- 
eter (a real number between 0 and 1), which 
regulates the amount of change allowed in the 
combination step. 
In the original paper, Breiman does not pro- 
pose any optimization of the generation param- 
eter, instead, he performs a limited amount of 
trials with different values and simply reports 
the best result. In our domain, we observed 
a big variance on the results depending on the 
concrete values of the generation parameter. In- 
stead of trying to tune it, we generate several 
training sets using different values of the genera- 
tion parameter and we construct an ensemble of 
decision trees. In this way, we make the global 
classifier independent of the particular choice, 
and we generally obtain a combined result which 
is more accurate than any of the individuals. 
5 Experiments and Results 
5.1 Constructing and Evaluating 
Ensembles 
First, the three types of ensembles were applied 
to the 19 selected ambiguity classes in order to 
decide which is the best in each case. The eval- 
uation was performed by means of a 10-fold 
cross-validation ,using the training corpus. The 
obtained results confirm that all methods con- 
tribute to improve accuracy in almost all do- 
mains. The absolute improvement is not very 
impressive but the variance is generally very low 
and, so, the gain is statistically significant in the 
majority of cases. Summarizing, BAG wins in 8 
cases, FCOMB in 9, and FSC in 2 (including the 
unknown-word class). 
These results are reported in table 4, in which 
the error rate of a single basic tree is compared 
to the results of the ensembles for each ambi- 
guity class r. The last column presents the per- 
centage of error reduction for the best method 
in each row. 
Second, CPD was applied to the 82 selected 
ambiguity classes, with positive results in 59 
cases, from which 25 were statistically signifi- 
cant (again in a 10-fold cross-validation exper- 
iment). These 25 classes agglutinate 20,937 ex- 
amples and the error rate was diminished, o11 
average, from 20.16% to 18.17%. 
5.2 Tagging with the Enriched Model 
Ensembles of classifiers were learned for the am- 
biguity classes explained in the previous sec- 
tions using the best technique in each case. 
These ensembles were included in the tree-base, 
used by the basic taggers of section 3, substi- 
tuting the corresponding individual trees, and 
both taggers were tested again using the en- 
riched model. 
At runtime, the combination of classifiers was 
done by averaging the results of each individual 
decision tree. 
In order to test the relative improvement 
of each component, the inclusion of the en- 
sembles is performed in three steps: 'CPD ~ 
stands for the ensembles for infrequent ambi- 
guity classes, 'ENS' stands for the ensembles for 
frequent ambiguity classes and unknown words, 
and 'CPD-~ENS' stands for the inclusion of both. 
Results are described in table 5. 
Some important conclusions are: 
• The best result of each tagger is signifi- 
cantly better than each corresponding ba- 
sic version, and the accuracy consistently 
grows as more components are added. 
• The relative improvement of STT + is lower 
than those of RTT and STT, suggesting 
than the better the tree-based model is, 
the less relevant is the inclusion of n-gram 
information. 
• The special treatment of low frequent am- 
biguity classes results in a very small con- 
tribution, indicating that there is no much 
7These figures are calculated by averaging the resu|ts 
of the ten folds. 
58 
A-class #exs 
IN-RB-RP i 
2 VBD-VBN 
3 NN-VB-VBP 
4 VB-VBP 
5 JJ-NN 
6 NNS-VBZ I 
7 NN-NNP 
8 JJ-VBD-VBN 
9 NN-VBG 
I0 JJ-NNP 
ii JJ-RB 
12 DT-IN-RB-WDT 
13 JJR-RBR 
14 NNP-NNPS-NNS 
15 JJ-NN-RB 
16 JJ-NN-VB 
17 JJ-NN-VBG 
18 JJ-VBG 
Total 
19 unknown-word 
34,489 
25,882 
24,522 
17,788 
17,077 
15,295 
13,824 
11,403 
9,597 
8 724 
8 722 
8 419 
2 868 
2 808 
2 625 
2 145 
1.986 
1.980 
210.154 
22.594 
%exs 
10.16% 
7.63% 
7.23% 
5.24% 
5.03% 
4.51% 
4.07% 
3.36% 
2.83% 
2.57% 
2.57% 
2.48% 
o.85% 
0.83% 
0.77% 
0.63% 
0.59% 
0.58% 
61.93% 
B asic 
8.30% 
7.44% 
4.10% 
4.13% 
14.71% 
5.14% 
9.67% 
19.18% 
14.11% 
5.10% 
10.45% 
7.01% 
16.40% 
36.50% 
i5.31% 
13.32% 
20.30% 
21.11% 
9.35% 
20.87% 
BAG 
7.31% 
5.93% 
3.7o% 
3.62% 
13.30% 
4.37% 
9.10% 
17.91% 
12.53% 
4.5o% 
8.86% 
6.49% 
15.84% 
36.50% 
13.32% 
13.87% 
17.98% 
18.89% 
8.38% 
17,47% 
FSC 
7.79% 
6.64% 
3.84% 
3.94% 
13.50% 
4.59% 
8.37% 
18.05% 
12.93% 
4.56% 
9.75% 
6.84% 
15.28% 
35.14% 
11.83% 
12.99% 
18.79% 
19.39% 
8.61% 
16.86% 
FCOMB 
7.23% 
6.28% 
3.58% 
3.76% 
13.55% 
4.34% 
6.83% 
17.27% 
12.99% 
4.35% 
9.68% 
6.53% 
14.72% 
35.oo% 
12.44% 
12.75% 
18.23% 
19.60% 
8.25% 
17.21% 
BestER 
12.89% 
20.30% 
12.68% 
12.35% 
9.59% 
15.56% 
29.37% 
9.96% 
11.20% 
14.71% 
15.22% 
7.42% 
10.24% 
4.11% 
22.73% 
4.28% 
11.43% 
10.52% 
13.40% 
19.26% 
Table 4: 
classes 
Comparative results (error rates) of different ensembles on the most significant ambiguity 
Tagger 
RTT 
RTT(cPD) 
RTT(ENS) 
RTT(cPD+ENS) 
STT 
STT(cPD) 
STT(ENs) 
STT(cPD+ENS) 
STT + 
STT+(cPD) 
STT+(ENS) 
5TT+(CPD+ENS) 
Overall Known Ambig. Unknown 
96.61% 97.00% 91.36% 79.21% 
96.66% 97.06% 91.51% 79.25% 
96.99% 97.30% 92.23% 83.25% 
97.05% 97.37% 92.48% 83.30% 
96.63% 97.02% 91.40% 79.60% 
96.69% 97.07% 91.56% 79.69% 
97.05% 97.36% 92.38% 83.78% 
97.10% 97.40% 92.51% 83.68% 
96.84% 97.21% 91.95% 80.70% 
96.88% 97.25% 92.09% 80.77% 
97.19% 97.48% 92.73% 84.47% 
97.22% 97.51% 92.81% 84.54% 
Speed Memory 
426 w/s 0.68Mb 
366 w/s 0.93Mb 
97 w/s 3.53Mb 
89 w/s 3.78Mb 
321 w/s 0.68Mb 
261 w/s 0.93Mb 
70 w/s 3.53Mb 
64 w/s 3.78Mb 
302 w/s 0.90Mb 
235 w/s 1.15Mb 
65 w/s 3.75Mb 
60 w/s 3.97Mb 
Table 5: Tagging accuracy, speed, and storage requirements of enriched RTT and 5TT taggers 
to win from these classes, unless we were 
able to fix their errors in a much greater 
proportion than we really did. 
• The price to pay for the enriched models is 
a substantial overhead in storage require- 
ment and speed decreasing, which in the 
worst case is divided by 5. 
In order to compare our results to others, 
we list in table 6 the results reported by sev- 
eral state-of-the-art PO5 taggers, tested on 
the WSJ corpus with the open vocabulary as- 
sumption. In that table, TBL stands for 
Brill's transformation-based error-driven tag- 
get (Brill, 1995), ME stands for a tagger based 
on the ma×imum entropy modelling (Ratna- 
parkhi, 1996), SPATTER stands for a statisti- 
cal parser based on decision trees (Magerman, 
1996), IGTREE stands for the memory-based 
tagger by Daelemans et al. (1996), and, finally, 
TComb stands for a tagger that works by com- 
bination of a statistical trigram-based tagger, 
59 
Tagger 
TBL 
ME 
SPATTER 
IGTREE 
TComb 
STT+(CPD+ENS) 
Train Test 
950 Kw 150 Kw 
963 Kw 193 Kw 
~975 Kw 47 Kw 
2,000 kw 200 Kw 
1,i00 Kw 265 Kw 
Overall Known Unknown 
96.6% -- 82.2% 
96.5% -- 86.2% 
96.5% -- -- 
96.4% 96.7% 90.6% 
97.2% -- -- 
Ambig 
998 Kw 175 Kw 97.2% 97.5% 84.5% 92.8% 
Table 6: Comparison of different tuggers on the WSJ corpus 
TBL and ME (Brill and Wu, 1998). 
Comparing to all the individual tuggers we 
observe that our approach reports the highest 
accuracy, and that it is comparable to that of 
YComb obtained by the combination of three 
tuggers. This is encouraging, since we have im- 
proved an individual POS tagger which could be 
further introduced as a better component in an 
ensemble of tuggers. 
Unfortunately, the performance on unknown 
words is difficult to compare, since it strongly 
depends on the used lexicon. For instance, 
IGTREE does not include in the lexicon the num- 
bers appearing in the training set, and, so, any 
number in the test set is considered unknown 
(they report an unusually high percentage of 
Unknown words: 5.5% compared to our 2.25%). 
The fact that numbers are very easy to rec- 
ognize could explain their outstanding results 
on tagging unknown words. ME also reports 
a higher percentage of unknown words, 3.2%, 
• while TBL says nothing about this issue. 
6 Conclusions and Further Work 
In this paper, we have applied several ML tech- 
niques for constructing ensembles of classifiers 
to address the most representative and/or diffi- 
cult cases of ambiguity within a decision-tree- 
based English POS tagger. As a result, the over- 
all accuracy has been significantly improved. 
Comparing to other approaches, we see that our 
tagger performs better on the WSJ corpus and 
under the open vocabulary assumption, than a 
number of state-of-the-art POS tuggers, and 
similar to another approach based on the com- 
bination of several tuggers s. 
8However, it has to be said that the pure statistical 
or machine-learning based approaches to POS tagging 
still significantly underperform some sophisticated man- 
ually constructed systems, such as the English shallow 
parser based on Constraint Grammars developed at the 
Helsinki University (Samuelsson and Voutilainen, 1997). 
The cost of this improvement has been quan- 
tiffed in terms of storage requirement and speed 
of the resulting enriched tuggers. Of course, 
there exists a clear tradeoff between accuracy 
and efficiency which should be resolved on the 
basis of the user needs. Although all proposed 
techniques are fully automatic, it has to be said 
that the construction of appropriate ensembles 
requires a significant human and computational 
effort. 
There are several features that should be fur- 
ther studied with respect to the used methods 
for constructing the ensembles of decision trees, 
the way they are combined and included in the 
tuggers, etc. However, we are now more inter- 
ested on experimenting with the inclusion of our 
tagger as a component in an ensemble of pre- 
existing tuggers, in the style of (Brill and Wu, 
1998; van Halteren et al., 1998). 
More generally, one may think that, after all 
the involved effort, the achieved improvement 
seems small. On this particular, we think that 
we are moving very close to the best achiev- 
able results using fully statistically-based tech- 
niques, and that some kind of specific human 
knowledge should be jointly considered in order 
to achieve the next qualitative step. We also 
think that other issues than simply 'accuracy 
rates' are becoming more important in order to 
test and evaluate the real utility of different ap- 
proaches for tagging. Such aspects, that should 
be studied in the near future, refer to the abil- 
ity of adapting to new domains (tuning), the 
types of errors committed and their influence 
on the task at hand, the language independence 
assumption, etc. 
Acknowledgments 
This research has been partially funded by the 
Spanish Research Department (CICYT's ITEM 
project TIC96-1243-C03-02), by the EU Corn- 
60 
mission (EuroWordNet LE4003) and by the 
Catalan Research Department (CIRIT's con- 
solidated research group 1997SGR 00051, and 
CREL project). 

References 

K. M. All and M. J. Pazzani. 1996. Error Reduction 
through Learning Multiple Descriptions. Machine 
Learning, 24(3): 173-202. 

A. Blum and T. Mitchell. 1998. Combining Labeled 
and Unlabeled Data with Co-Training. In Pro- 
ceedings of the ilth Annual Conference on Com- 
putational Learning Theory, COLT-98, pages 92- 
100, Madison, Wisconsin. 

L. Breiman, J. H. Friedman, R. A. Olshen, and C. J. 
Stone. 1984. Classification and Regression Trees. 
Wadsworth International Group, Belmont, CA. 

L. Breiman. 1996. Bagging Predictors. Machine 
Learning, 24(2): !23-140- 

L. Breiman. 1998. I Using Convex Pseudo-Data to 
Increase Prediction Accuracy. Technical Report, 
Statistics Department. University of California, 
Berkeley, CA. 

E. Brill and J. Wu. 1998. Classifier Combination 
for Improved L exical Disambiguation. In Proceed- 
ings of the joint COLING-ACL'98, pages 191- 
195, Montreal, Canada. 

E. Brill. 1995. Transformation-based Error-driven 
Learning and Natural Language Processing: A 
Case Study in Part-of-speech Tagging. Compu- 
tational Linguistics, 21(4) :543-565. 

E. Charniak. 1993. Statistical Language Learning. 
The MIT Press, Cambridge, Massachusetts. 

K.J. Cherkauer. 1996. Human Expert-level Perfor- 
mance on a Scientific Image Analysis Task by 
a System Using Combined Artificial Neural Net- 
works. In P. Chan, editor, Working Notes of the 
AAAI Workshop on Integrating Multiple Learned 
Models, pages 15-21. 

K. W. Church. 1988. A Stochastic Parts Program 
and Noun Phrase Parser for Unrestricted Text. 
In Proceedings of the 1st Conference on Applied 
Natural Language Processing, ANLP, pages 136- 
143. ACL. 

D. Cutting, J. Kupiec, J. Pederson, and P. Sibun. 
1992. A Practical Part-of-speech Tagger. In Pro- 
ceedings of the 3rd Conference on Applied Natu- 
ral Language Processing, ANLP, pages 133-140. 
ACL. 

W. Daelemans, J. Zavrel, P. Berck, and S. Gillis. 
1996: MBT: A Memory-Based Part-of-speech 
Tagger Generator. In Proceedings of the 4th 
Workshop on Very Large ColTora , pages 14-27, 
Copenhagen, Denmark. 

S. J. DeRose. 1988. Grammatical Category Disam- 
biguatlon by Statistical Optimization. Computa- 
tional Linguistics, 14:31-39. 

T. G. Dietterich and G. Bakiri. 1995. Solving Mul- 
ticlass Learning Problems via Error-Correcting 
Output Codes. Journal of Artificial Intelligence 
Research, 2:263-286. 

T. G. Dietterich. 1997. Machine Learning Research: 
Four Current Directions. AI Magazine, 18(4):97- 
136. 

T. G. Dietterich. 1998. An Experimental Compar- 
ison of Three Methods for Constructing Ensem- 
bles of Decision Trees: Bagging, Boosting, and 
Randomization. Machine Learning, pages 1-22. 

Y. Freund and R. E. Schapire. 1995. A Decision- 
Theoretic Generalization of On-line Learning 
and an Application to Boosting. In Pro- 
ceedings of the 2nd European Conference on 
Computational Learning Theory, EuroCOLT'95, 
Barcelona, Spain. 

A. R. Golding and D. Roth. 1999. A Winnow-based 
Approach to Spelling Correction. Machine Learn- 
ing, Special issue on Machine Learning and Nat- 
ural Language Processing. 

H. van Halteren, J. Zavrel, and W. Daelemans. 
1998. Improving Data Driven Wordclass Tagging 
by System Combination. In Proceedings of the 
joint COLING-A CL'98, pages 491-497, MontrEal, 
Canada. 

H. van Halteren. 1996. Comparison of Tagging 
Strategies, a Prelude to Democratic Tagging. S. 
Hockney and N. Ide (eds.), Clarendon Press. 
Research in Humanities Computing 4. Selected 
papers for the ALLC/ACH Conference, Christ 
Church, Oxford. 

F. Karlsson, A. Voutilainen, J. Heikkil£, and 
A. Anttila, editors. 1995. Constraint Grammar: 
A Language-Independent System for Parsing Un- 
restricted Text. Mouton de Gruyter, Berlin and 
New York. 

I. Kononenko, E. Simec, and M. Robnik-Sikouja. 
1995. Overcoming the Myopia of Inductive Learn- 
ing Algorithms with RELIEFF. Applied Intelli- 
gence, 10:39-55. 

R. LSpez de Mhntaras. 1991. A Distance-Based At- 
tribute Selection Measure for Decision Tree In- 
duction. Machine Learning, Kluwer Academic, 
6(1):81-92. 

D. M. Magerman. 1996. Learning Gramnaatical 
Structure Using Statistical Decision-Trees. In 
Proceedings of the 3rd International Colloquium 
on Grammatical Inference, ICGL Springer-Verlag 
Lecture Notes Series in Artificial Intelligence 

L. M~trquez and H. Rodrfguez. 1997. Automatically 
Acquiring a Language Model for POS Tagging Us- 
ing Decision Trees. In Proceedings of the Second 
Conference on Recent Advances in Natural Lan- 
guage Processing, RANLP, pages 27-34, Tzigov 
Chark, Bulgaria. 

L. M~rquez, L. Padr5, and H. Rodrfguez. 1998. 
hnproving Tagging Accuracy by Voting Taggers. 
In Proceedings of thc 2nd Conference on Nat- 
ural Language Processing ~ Industrial Applica- 
lions, NLP+IA/TAL+AI, pages 149-155, New 
Brunswick, Canada. 

L. Mhrquez. 1999. Part-of-Speech Tagging: A Ma- 
chine Learning Approach based on Decision Trees. 
Phd. Thesis, Dep. Llenguatges i Sistemes In- 
fortuities. Universitat Polit~cnica de Catalunya. 
(Forthcoming) 

M. Moreira and E. Mayoraz. 1998. Improved Pair- 
wise Coupling Classification with Correcting Clas- 
sifiers. In Proceedings of the lOth European Con- 
ference on Machine Learning, ECML, pages 160- 
171, Chemnitz, Germany. 

B. Parmanto, P.W. Munro, and H.R. Doyle. 1996. 
hnproving Committee Diagnosis with Resampling 
Techniques. In M.C. Mozer D.S. Touretzky and 
M.E. Hesselmo, editors, Advances in Neural In- 
formation Processing Systems, volume 8, pages 
882-888. MIT Press., Cambridge, MA. 

J. R. Quinlan. 1993. C4.5: Programs for Machine 
Learning. Morgan Kaufmann Publishers, Inc., 
San Mateo, CA. 

L. R. Rabiner. 1990. A Tutorial on Hidden Markov 
Models and Selected Applications in Speech Recog- 
nition. Readings in Speech Recognition (eds. A. 
Waibel, K. F. Lee). Morgan Kaufmann Publish- 
ers, Inc., San Mateo, CA. 

A. Ratnaparkhi. 1996. A Maximum Entropy Part- 
of-speech Tagger. In Proceedings of the 1st Con- 
ference on Empirical Methods in Natural Lan- 
guage Processing, EMNLP'96. 

C. Samuelsson and A. Voutilainen. 1997. Compar- 
ing a Linguistic and a Stochastic Tagger. In Pro- 
ceedings of the 35th Annual Meeting of the Asso- 
ciation for Computational Linguistics, pages 246- 
253, Madrid, Spain. 

R. E. Schapire and Y. Singer. 1998. BoosTexter: A 
system for multiclass multi-label text categoriza- 
tion. Unpublished. Postscript version available at 
AT&T Labs. 

R. E. Schapire, Y. Singer, and A. Singhal. 1998. 
Boosting and Rocchio applied to text filtering. In 
In Proceedings of the 21st Annual International 
Conference on Research and Development in In- 
formation Retrieval, SIGIR '98. 

S. Sestito and T. S. Dillon. 1994. Automated 
Knowledge Acquisition. T. S. Dillon (ed.), Series 
in Computer Systems Science and Engineering. 
Prentice Hall, New York/London. 

K. Tumer and J. Ghosh. 1996. Error Correla- 
tiou and Error Reduction in Ensemble Classifiers. 
Connection Science. Special issue on combining 
artificial neural networks: ensemble appTvache.~, 
8(3 and 4):385-404. 
