POS tagger combinations on Hungarian text
Andr as Kuba, L aszl o Felf oldi, Andr as Kocsor
Research Group on Arti cial Intelligence,
University of Szeged,
Aradi vrt. 1, H-6720, Hungary
fakuba, lfelfold, kocsorg@inf.u-szeged.hu
Abstract
In this paper we will brie y survey
the key results achieved so far in
Hungarian POS tagging and show
how classi er combination tech-
niques can aid the POS taggers.
Methods are evaluated on a manu-
ally annotated corpus containing 1.2
million words. POS tagger tests
were performed on single-domain,
multiple domain and cross-domain
test settings, and, to improve the ac-
curacy of the taggers, various com-
bination rules were implemented.
The results indicate that combina-
tion schemas (like the Boosting al-
gorithm) are promising tools which
can signi cantly degrade the classi-
 cation errors, and produce a more
e ective tagger application.
1 Introduction an related works
Part-of-speech (POS) tagging is perhaps one
of the most basic tasks in natural language
processing. In this paper we will review
the current state-of-the-art in Hungarian POS
tagging, and investigate the possibilities of
improving the results of the taggers by ap-
plying classi er combination techniques.
We used a Transformation Based Learner
(TBL) as the base tagger for combination
experiments, and the learner algorithm de-
termines the set of applicable combination
schemes. From this set we chose two algo-
rithms called Bagging and Adaboost.M1.
In the next subsection, the most impor-
tant published results of the last few years
in Hungarian POS tagging are summarized.
The TBL tagger is described in detail in Sec-
tion 2, then the corpora and the data sets
we used for our investigations are presented
in Section ??. The classi er combination and
details about the implementation of this tech-
nique are described in Section 3. After the
results of the boosted tagger are presented in
Section 4. Lastly, some conclusions about the
e ectiveness and e ciency of our boosting ap-
proach are made in the  nal section.
1.1 POS Tagging of Hungarian Texts
Standard POS tagging methods were applied
to Hungarian as soon as the  rst annotated
corpora appeared that were big enough to
serve as a training database for various meth-
ods. The TELRI corpus (Dimitrova et al.,
1998) was the  rst corpus that was used
for testing di erent POS tagging methods.
This corpus contains approximately 80; 000
words. Later, as the Hungarian National Cor-
pus (V aradi, 2002) and the Manually Anno-
tated Hungarian Corpus (the Szeged Corpus)
(Alexin et al., 2003) became available, an op-
portunity was provided to test the results on
bigger corpora (153M and 1.2M words, re-
spectively).
In recent years several authors have pub-
lished many useful POS tagging results in
Hungarian. It is generally believed that, ow-
191
ing to the fairly free word order and the ag-
glutinative property of the Hungarian lan-
guage, there are more special problems as-
sociated with Hungarian than those of the
Indo-European languages. However, the lat-
est results are comparable to results achieved
in English and other well-studied languages.
Fruitful approaches for Hungarian POS tag-
ging are Hidden Markov Models, Transforma-
tion Based Learning and rule-based learning
methods.
One of the most common POS tagging ap-
proaches is to build a tagger based on Hid-
den Markov Models (HMM). Tu s (Tu s et
al., 2000) reported good results with the Tri-
grams and Tags (TnT) tagger (Brants, 2000).
A slightly better version of TnT was employed
by Oravecz (Oravecz and Dienes, 2002), and
it achieved excellent results. In their pa-
per, Oravecz and Dienes (Oravecz and Dienes,
2002) argue that regardless of the rich mor-
phology and relatively free word order, the
POS tagging of Hungarian with HMM meth-
ods is possible and e ective once one is able
to handle the data sparsity problem. They
used a modi ed version of TnT that was sup-
ported by an external morphological analyzer.
In this way the trigram tagger was able to
make better guesses about the unseen words
and therefore to get better results. An ex-
ample of the results achieved by this trigram
tagger is presented in the  rst row of Table 1.
Another approach besides the statistical
methods is the rule-based learning one. A
valuable feature of the rule-based methods is
that the rules these methods work with are
usually more intelligible to humans than the
parameters of statistical methods. For Hun-
garian, a few such approaches are available in
the literature.
In a comprehensive investigation,
Horv ath et al. (Horv ath et al., 1999) applied
 ve di erent machine learning methods to
Hungarian POS tagging. They tested C4.5,
PHM, RIBL, Progol and AGLEARN (Alexin
et al., 1999) methods on the TELRI corpus.
The results of C4.5 and the best tagger found
in this investigation (RIBL) are presented
in the second and third rows of Table 1.
Tagger Accuracy
TnT + Morph. Ana. 98.11%
C4.5 97.60%
Best method (RIBL) 98.03%
RGLearn 97.32%
TBL 91.94%
TBL + Morph. Ana. 96.52%
Best combination 96.95%
Table 1: Results achieved by various Hungar-
ian POS taggers.
H ocza (H ocza et al., 2003) used a di erent
rule generalization method called RGLearn.
Row 4 shows the test results of that tagger
in Table 1. Transformation Based Learning
is a rule-based method that we will discuss
in depth in Section 2. Megyesi (Megyesi,
1999) and Kuba et al. (Kuba et al., 2004)
produced results with TBL taggers that
are given in Table 1, in rows 5 and 6,
respectively. Kuba et al. (Kuba et al., 2004)
performed experiments with combinations of
various tagger methods. The combinations
outperformed their component taggers in
almost every case. However, in the di erent
test sets, di erent combinations proved the
best, so no conclusion could be drawn about
the best combination. The combined tagger
that performed best on the largest test set is
shown in row 7 of Table 1.
2 The TBL tagger
Transformation Based Learning (TBL) was
introduced by Brill (Brill, 1995) for the task
of POS tagging. Brill’s implementation con-
sists of two processing steps. In the  rst step,
a lexical tagger calculates the POS tags based
on lexical information only (word forms). The
result of the lexical tagger is used as a  rst
guess in the second run where both the word
forms and the actual POS tags are applied by
the contextual tagger. Both lexical and con-
textual taggers make use of the TBL concept.
During training, TBL performs a greedy
search in a rule space in order to  nd the rules
that best improve the correctness of the cur-
192
rent tagging. The rule space contains rules
that change the POS tag of some words ac-
cording to their environments. From these
rules, an ordered list is created. In the tag-
ging phase, the rules on the rule list are ap-
plied one after another in the same order as
the rule list. After the last rule is applied, the
current tag sequence is returned as a result.
For the Hungarian language, Megyesi ap-
plied this technique initially with moderate
success. (Megyesi, 1998) The weak part of
her  rst implementation was the lexical mod-
ule of the tagger, as described in (Megyesi,
1999). With the use of extended lexical tem-
plates, TBL produced a much better perfor-
mance but still lagged behind the statistical
taggers.
We chose a di erent approach that is simi-
lar to (Kuba et al., 2004). The  rst guess of
the TBL tagger is the result of the baseline
tagger. For the second run, the contextual
tagger implementation we used is based on
the fnTBL learner module. (Ngai and Flo-
rian, 2001) We used the standard parameter
settings included in the fnTBL package.
2.1 Baseline Tagger
The baseline tagger relies on an external mor-
phological analyzer1 to get the list of possible
POS tags. If the word occurs in the training
data, the word gets its most frequent POS
tag in the training. If the word does not ap-
pear in the training, but representatives of its
ambiguity class (words with the same possible
POS tags) are present, then the most frequent
tag of all these words will be selected. Other-
wise, the word gets the  rst tag from the list
of possible POS tags.
Some results produced by the baseline tag-
ger and the improvements achieved by the
TBL tagger are given in Table 2.
3 Classi er Combinations
The goal of designing pattern recognition sys-
tems is to achieve the best possible classi ca-
tion performance for the speci ed task. This
objective traditionally led to the development
1We used the Humor (Pr osz eky, 1995) analyzer de-
veloped by MorphoLogic Ltd.
Test Domain Baseline TBL
full corpus 94.94% 96.52%
business news 97.56% 98.26%
cross-domain 79.51% 95.79%
Table 2: Accuracy of the baseline tagger and
the TBL tagger.
of di erent classi cation schemes for recogni-
tion problems the user would like solved. Ex-
periments shows that although one of the de-
signs should yield the best performance, the
sets of patterns misclassi ed by the di erent
classi ers do not necessarily overlap. These
observations motivated the relatively recent
interest in combining classi ers. The main
idea behind it is not to rely on the decision
of a single classi er. Rather, all of the induc-
ers or their subsets are employed for decision-
making by combining their individual opin-
ions to produce a  nal decision.
3.1 Bagging
The Bagging (Bootstrap aggregating) algo-
rithm (Breiman, 1996) applies majority vot-
ing (Sum Rule) to aggregate the classi ers
generated by di erent bootstrap samples. A
bootstrap sample is generated by uniformly
sampling m instances from the training set
with replacement. T bootstrap samples
B1; B2; :::; BT are generated and a classi er
Ci is built from each bootstrap sample Bi.
Bagging algorithm
Require: Training Set S = fx1; : : :;xmg
Ensure: Combined classi er ^C
for i = 1 : : :T do
S0 = bootstrap sample from S
Train classifer Ci on S0
end for
^C(x) = argmax
j
X
i
I[Ci(x) = !j]
For a given bootstrap sample, an instance
in the training set will have a probability
1  (1  1=m)m of being selected at least
once from the m instances that are randomly
picked from the training set. For large m,
193
this is about 1-1/e = 63.2%. This perturba-
tion causes di erent classi ers to be built if
the inducer is unstable (e.g. ANNs, decision
trees) and the performance may improve if the
induced classi ers are uncorrelated. However,
Bagging can slightly degrade the performance
of stable algorithms (e.g. kNN) since e ec-
tively smaller training sets are used for train-
ing.
3.2 Boosting
Boosting (Freund and Schapire, 1996) was
introduced by Shapire as a method for im-
proving the performance of a weak learning
algorithm. AdaBoost changes the weights of
the training instances provided as input for
each inducer based on classi ers that were
previously built. The  nal decision is made
using a weighted majority voting schema for
each classi er, whose weights depend on the
performance of the training set used to build
it.
Adaboost algorithm
Require: Training Set S = fx1; : : :;xmg
Ensure: Combined classi er ^C
d(1)j = 1=N for all j = 1 : : :m
for t = 1 : : :T do
Train classi er Ct with respect to the dis-
tribution d(t).
Calculate the weighted training error  t
of Ct:  t =
mX
j=1
d(t)j I[Ct(xj) 6= !j]
if  t > 1=2 then Exit
Set  t:  t = 12log1   t 
t
d(t+1)j = d(t+1)j 1Z
t
e  tI[Ct(x)6=!j],
where Zt is the normalization constant,
such that:
mX
j=1
d(t+1)j = 1
end for
^C(x) = argmax
j
X
t
 tI[Ct(x) = !j]
The boosting algorithm requires a weak
learning algorithm whose error is bounded by
a constant strictly less than 1/2. In the case of
multi-class classi cations this condition might
be di cult to guarantee, and various tech-
niques may need to be applied to get round
this restriction.
There is an important issue that relates
to the construction of weak learners. At
step t the weak learner is constructed based
on the weighting dt. Basically, there are
two approaches for taking this weighting into
account. In the  rst approach we assume
that the learning algorithm can operate with
reweighted examples. For instance, when the
learner minimizes a cost function, one can
construct a revised cost function which as-
signs weights to each of the examples. How-
ever, not all learners can be easily adapted to
such an inclusion of the weights. The other
approach is based on resampling the data with
replacement. This approach is more general
as it is applicable to all kinds of learners.
4 Boosted results for TBL
TBL belongs to the group of learners that gen-
erates abstract information, i.e. only the class
label of the source instance. Although it is
possible to transcribe the output format to
con dence type, this limitation degrades the
range of the applicable combination schemes.
Min, Max and Prod Rules cannot produce
a competitive classi er ensemble, while Sum
Rule and Borda Count are equivalent to ma-
jority voting. From the set of available boost-
ing algorithms we may only apply those meth-
ods that do not require the modi cation of the
learner.
For the experiments we chose Bagging and
a modi ed Adaboost.M1 algorithm as boost-
ing. Since the learner is incapable of han-
dling instance weights, individual training
datasets were generated by bootstrapping (i.e.
resamping with replacement). The origi-
nal Adaboost.M1 algorithm requires that the
weighted error should be below 50%. In this
case the modi ed algorithm reinitializes the
instance weights and goes on with the pro-
cessing.
The Boosting algorithm is based on weight-
ing the independent instances based on the
classi cation error of the previously trained
194
0 20 40 60 80 1001
1.2
1.4
1.6
1.8
2
number of classifiers
classification error [%]
Recognition error on training dataset
0 20 40 60 80 1001
1.2
1.4
1.6
1.8
2
number of classifiers
classification error [%]
Recognition error on test dataset
Figure 1: Classi cation error of Bagging and
Boosting algorithm on the training and train-
ing datatsets ( dashed: Bagging, solid: Boost-
ing).
learners. TBL operates on words, but words
are not treated as independent instances.
Their context, the position in the sentence,
a ects the building of the classi er. Thus
instead of the straightforward selection of
words, the boosting method handles the sen-
tences as instance samples. The classi cation
error of the instances are calculated as the
arithmetic mean of the classi cation errors
of the words in the corresponding sentence.
Despite this, the combined  nal error is ex-
pressed as the relative number of the misclas-
si ed words.
The training and testing datasets were cho-
sen from the business news domain of the
Szeged Corpus. The train database contained
about 128,000 annotated words in 5700 sen-
tences. The remaining part of the domain,
the test database, has 13,300 words in 600
sentences.
The results of training and testing error
rates are shown in Fig.1. The classi cation
error of the stand-alone TBL algorithm on the
test dataset was 1.74%. Boosting is capable of
decreasing it to below 1.31%, which means a
24.7% relative error reduction. As the graphs
show, boosting achieves this during the  rst
20 iterations, so further processing steps can-
not make much di erence to the classi cation
accuracy. It can also be seen that the train-
ing error does not converge to a zero-error
level. This behavior is due to the fact that
the learner cannot maintain the 50% weighted
error limit condition. Bagging achieved only
a moderate gain in accuracy, its relative error
reduction rate being 18%.
5 Conclusions
In this paper we investigated the possibility
of improving the classi cation performance of
POS taggers by applying classi er combina-
tion schemas. For the experiments we chose
TBL as a tagger and Bagging and Boosting
as combination schemas. The results indi-
cates that Bagging and Boosting can reduce
the classi cation error of the TBL tagger by
18% and 24.7%, respectively.
It is clear that further improvements
could be made by tailoring the tagger algo-
rithm to the requirements of more sophis-
ticated boosting methods like Adaboost.M2
and other derivatives optimized for multi-
class recognition. Another promising idea
for more e ective combinations is that of
applying con dence-type generative (ANN,
kNN, SVM) or discriminative (HMM) learn-
ers (Duda et al., 2001; Bishop, 1995; Vap-
nik, 1998). These kinds of classi ers pro-
vide more information about the recognition
results and can improve the cooperation be-
tween the combiner and the classi ers. In the
future we plan to investigate these alterna-
tives for constructing better tagger combina-
tions.
195
References
Zolt an Alexin, Szilvia Zvada, and Tibor
Gyim othy. 1999. Application of AGLEARN
on Hungarian Part-of-Speech tagging. In
D. Parigot and M. Mernik, editors, Sec-
ond Workshop on Attribute Grammars and
their Applications, WAGA’99, pages 133{
152, Amsterdam, The Netherlands. INRIA
rocquencourt.
Zolt an Alexin, J anos Csirik, Tibor Gyim othy,
K aroly Bibok, Csaba Hatvani, G abor Pr osz eki,
and L aszl o Tihanyi. 2003. Manually anno-
tated Hungarian corpus. In Proceedings of the
Research Note Sessions of the 10th Confer-
ence of the European Chapter of the Associa-
tion for Computational Linguistics, EACL’03,
pages 53{56, Budapest, Hungary.
C. M. Bishop. 1995. Neural Networks for Pattern
Recognition. Oxford University Press.
Thorsten Brants. 2000. TnT { a statistical part-
of-speech tagger. In Proceedings of the Sixth
Applied Natural Language Processing, ANLP-
2000, Seattle, WA.
L. Breiman. 1996. Bagging predictors. Machine
Learning, 24, no. 2:123{140.
Eric Brill. 1995. Transformation-based error-
driven learning and natural language process-
ing: A case study in part-of-speech tagging.
Computational Linguistics, 21(4):543{565.
Ludmila Dimitrova, Toma z Erjavec, Nancy Ide,
Heiki Jaan Kaalep, Vladimir Petkevic, and Dan
Tu s. 1998. Multext-east: Parallel and com-
parable corpora and lexicons for six Central
and Eastern European languages. In Christian
Boitet and Pete Whitelock, editors, Proceedings
of the Thirty-Sixth Annual Meeting of the Asso-
ciation for Computational Linguistics and Sev-
enteenth International Conference on Compu-
tational Linguistics, pages 315{319, San Fran-
cisco, California. Morgan Kaufmann Publish-
ers.
R. O. Duda, P. E. Hart, and D. G. Stork. 2001.
Pattern Classi cation. John Wiley and Son,
New York.
Yoav Freund and Robert E. Schapire. 1996. Ex-
periments with a new boosting algorithm. In
Morgan Kaufmann, editor, Proc International
Conference on Machine Learning, pages 148{
156, San Francisco.
Andr as H ocza, Zolt an Alexin, D ora Csendes,
J anos Csirik, and Tibor Gyim othy. 2003. Ap-
plication of ILP methods in di erent natural
language processing phases for information ex-
traction from Hungarian texts. In Proceedings
of the Kalm ar Workshop on Logic and Com-
puter Science, pages 107{116, Szeged, Hungary.
Tam as Horv ath, Zolt an Alexin, Tibor Gyim othy,
and Stefan Wrobel. 1999. Application of dif-
ferent learning methods to Hungarian Part-of-
Speech tagging. In S. D zeroski and P. Flach,
editors, Proceedings of ILP99, volume 1634 of
LNAI, pages 128{139. Springer Verlag.
Andr as Kuba, Andr as H ocza, and J anos Csirik.
2004. POS tagging of Hungarian with com-
bined statistical and rule-based methods. In
Petr Sojka, Ivan Kopecek, and Karel Pala, ed-
itors, Text, Speech and Dialogue, Proceedings
of the 7th International Conference, TSD 2004,
pages 113{121, Brno, Czech Republic.
Be ata Megyesi. 1998. Brill’s rule-based POS tag-
ger for Hungarian. Master’s thesis, Department
of Linguistics, Stockholm University, Sweden.
Be ata Megyesi. 1999. Improving Brill’s POS
tagger for an agglutinative language. In Pro-
ceedings of the Joint Sigdat Conference on Em-
pirical Methods in Natural Language Process-
ing and Very Large Corpora, EMNLP/VLC ’99,
pages 275{284.
Grace Ngai and Radu Florian. 2001.
Transformation-based learning in the fast
lane. In Proceedings of North American ACL
2001, pages 40{47, June.
Csaba Oravecz and P eter Dienes. 2002. E cient
stochastic Part-of-Speech tagging for Hungar-
ian. In Proceedings of the Third International
Conference on Language Resources and Evalu-
ation, LREC2002, pages 710{717, Las Palmas.
G abor Pr osz eky. 1995. Humor: a morphological
system for corpus analysis. In H. Retting, edi-
tor, Language Resources for Language Technol-
ogy, Proceedings of the First European TELRI
Seminar, pages 149{158, Tihany, Hungary.
Dan Tu s, P eter Dienes, Csaba Oravecz, and
Tam as V aradi. 2000. Principled hidden tagset
design for tiered tagging of Hungarian.
V. N. Vapnik. 1998. Statistical Learning Theory.
John Wiley and Son.
Tam as V aradi. 2002. The Hungarian National
Corpus. In Proceedings of the Third Interna-
tional Conference on Language Resources and
Evaluation, LREC2002, pages 385{396, Las
Palmas de Gran Canaria.
196
