Morphological Tagging: Data vs. Dictionaries 
Jan Haji~.* 
Department of Computer Science 
Johns Hopkins University 
Baltimore, MD 21218 
hajic@cs.jhu.edu 
Abstract 
Part of Speech tagging for English seems to have 
reached the the human levels of error, but full mor- 
phological tagging for inflectionally rich s, 
such as Romanian, Czech, or Hungarian, is still an 
open problem, and the results are far from being 
satisfactory. This paper presents results obtained 
by using a universalized exponential feature-based 
model for five such s. It focuses on the data 
sparseness issue, which is especially severe for such 
s (the more so that there are no extensive 
annotated data for those s). In conclusion, 
we argue strongly that the use of an independent 
morphological dictionary is the preferred choice to 
more annotated data under such circumstances. 
1 Full Morphological Tagging 
English Part of Speech (POS) tagging has been 
widely described in the recent past, starting with 
the (Church, 1988) paper, followed by numerous 
others using various methods: neural networks (Ju- 
lian Benello and Anderson, 1989), HMM tagging 
(Merialdo, 1992), decision trees (Schmid, 1994), 
transformation-based error-driven learning (Brill, 
1995), and maximum entropy (Ratnaparkhi, 1996), 
to select just a few. However different the methods 
were, English dominated in these tests. 
Unfortunately, English is a morphologically "im- 
poverished" : there are no complicated 
agreement relations, word order variation is mini- 
real, and the morphological categories are either ex- 
tremely simple (-s for plural of nouns, for example), 
or (almost) nonexistent (cases expressed by inflec- 
tion, for example) - with not too many exceptions 
and irregularities. Therefore the number of tags se- 
lected for an English tagset is not that large (40-75 
in the typical case). Also, the average ambiguity 
is low (2.32 tags per token on the manually tagged 
* The work described herein has been started and largely 
done within author's home institution, the Institute of For- 
mal and Applied Linguistics, Charles University, Prague, CZ, 
within the project VS96151 of the Ministry of Education 
of the Czech Republic and partially also under the grant 
405/96/K214 of the Grant Agency of the Czech Republic. 
Wall Street Journal part in the Penn Treebank, for 
example). 
Highly inflective and agglutinative s are 
different. Obviously we can limit the number of tags 
to the major part-of-speech classes, plus some (like 
the Xerox Language Tools (Chanod, 1997) for such 
s do), and in fact achieve similar perfor- 
mance, but that limits the usefulness of the results 
thus obtained for further analysis. These s, 
obviously, do not use the rich inflection just for the 
amusement (or embarrassment) of their speakers (or 
NLP researchers): the inflectional categories carry 
important information which ought to be known at 
a later time (e.g., during parsing). Thus one wants 
not only to tell apart verbs from nouns, but also 
nominative from genitive, masculine animate from 
inanimate, singular from plural - all of them being 
often ambiguous one way or the other. 
The average tagset, as found even in a moderate 
corpus, contains between 500 and 1,000 distinct tags 
- whereas the size of the set of possible and plausible 
tags can reach 3,000 to 5,000. Obviously, any of the 
statistical methods used for English (even if fully 
supervised) clash with (or, fall through) the data 
sparseness problem (see below Table 1 for details). 
There have been attempts to solve this problem 
for some of the highly inflectional European lan- 
guages ((Daelemans et al., 1996), (Erjavec et al., 
1999), (Tufts, 1999), and also our own in (Haji~ 
and Hladk~, 1997), (Haji~ and Hladk~, 1998), see 
also below), but so far no method nor a tagger has 
been evaluated against a larger number of those lan- 
guages in a similar setting, to allow for a side-by- 
side comparison of the difficulty (or ease) of full 
morphological tagging of those s. Thanks 
to the Multext-East project (V6ronis, 1996a), there 
are now five annotated corpora available (which are 
manually fully morphologically tagged) to perform 
such experiments. 
2 The Languages Used and The 
Training Data 
We use the Multext-East-annotated version of the 
Orwell's 1984 novel in Czech, Estonian, Hungarian, 
94 
Romanian and Slovene I. The annotation uses a sin- 
gle SGML-based formal scheme, and even common 
guidelines for tagset design and annotation, nev- 
ertheless the tagsets differ substantially since the 
s differ as well: Romanian is a French-like 
romance , Hungarian is agglutinative, and 
the other s are more or less inflectional- 
type s 2. The annotated data contains about 
100k tokens (including punctuation) for each lan- 
guage; out of those, the first 20k tokens has been 
used for testing, the rest for training. We have also 
extended the tag identifiers by appending a string 
of hyphens ('-') to suit the exponential tagger which 
expects the tags to be of equal length; the mapping 
was 1:1 for all tags in all s, since the "long" 
tags are in fact the Multext-East standard. 
From the tagging point of view, the  char- 
acteristics displayed in Table 1 are the most rele- 
vant 3 . 
3 The Methodology 
The main tagger used for the comparison experiment 
is the probabilistic exponential-model-based, error- 
driven learner we described in detail in (Haji~ and 
Hladk~, 1998). Modifications had to be made, how- 
ever, to make it more universal across s. 
3.1 Structure of the Model 
The model described in (Haji~ and Hladk~, 1998) 
is a general exponential (specifically, a log-linear) 
model (such as the one used for Maximum Entropy- 
based models): 
pAc(ylx ) = exp(~\]in_1AJi(y, x)) z(x) (1) 
where fi(y,x) is a binary-valued feature of the 
event value being predicted and its context, A~ is 
a weight of the feature fi, and Z(x) is the natural 
normalization factor. This model is then essentially 
reduced to Naive Bayes by the approximation of the 
IThere are more s involved in the Multext-East 
project, but only these five s have been really care- 
fully tagged; English is unfortunately tagged using Eric Brill's 
tagger trained in unsupervised mode, leaving multiple output 
at almost every ambiguous token, and Bulgarian is totally 
unusable since it has been tagged automatically with only 
a baseline tagger. The English results reported below thus 
come from the Penn Treebank data, from which we have used 
roughly 100,000 words to match the training data sizes for the 
remaining s. For Czech, Hungarian, and Slovene we 
use later versions of the annotated data (than those found on 
the Multext-East CD) which we obtained directly from the 
authors of the annotations after the Multext-CD had been 
published, since the new data contain rather substantial im- 
provements over the originally published data. 
2For detailed account of the lexical characteristics of these 
s, see (Vdronis, 1996b). 
3We have included English here for comparison purposes, 
since these characteristics are independent of the annotation. 
Ai parameters, which is done because there are mil- 
lions of possible features in the pool and thus the full 
entropy maximization is prohibitively expensive, if 
we want to select a small number of features instead 
of keeping them all. 
The tags are predicted separately for each mor- 
phological category (such as POS, NUMBER, 
CASE, DEGREE OF COMPARISON, etc.). The 
model makes an extensive use of so-called "ambigu- 
ity classes" (ACs). An ambiguity class is a set of 
values (such as genitive and accusative) of a single 
category (such as CASE) which arises for some word 
forms as a result of morphological analysis. For un- 
ambiguous word forms (unambiguous from the point 
of view of a certain category), the ambiguity class set 
contains only a single value; for ambiguous forms, 
there are 2 or more values in the AC. For example, 
let's suppose we use part-of-speech (POS), number 
and tense as morphological categories for English; 
then the word form "borrowed" is 2-way ambiguous 
in POS ({V, J} for verb and adjective, respectively), 
unambiguous in number (linguistic arguments apart, 
number is typically regarded "not applicable" to ad- 
jectives as well as to almost all forms of verbs in 
English), and 3-way ambiguous in tense ({P,N,-) 
for past tense, past participle, and "not applicable" 
in the adjective form). 
The predictions of the models are always condi- 
tioned on the ambiguity class of the category (POS, 
NUMBER, ...) in question. In other words, there is 
a separate model for each category and an ambigu- 
ity class from that category. Naturally, there is no 
model for unambiguous ACs classes. However, even 
though the ambiguity classes bring very valuable in- 
formation about the word form being tagged and a 
reliable information about the context (since they 
are fixed during tagging), using ACs causes also an 
unwelcome effect of partitioning the already scarce 
data and also effectively ignores statistics of the un- 
ambiguous cases. 
The context of features uses the neighboring words 
(original word forms) and ambiguity classes on sub- 
tags, where their relative position in text might be 
either fixed (0, -1, +1) or "variable" using a value of 
the POS subtag as the "stop here" criterion, up to 
4 text positions (words) apart. 
3.2 General Subtag Features 
The original model uses the ambiguity classes not 
only for conditioning on context in features, but also 
for the individual models based on category and an 
AC. 
More general features have been introduced, 
which do not depend on the ambiguity class of the 
subtag being predicted any more. This allows to 
learn also from unambiguous tokens. However, the 
training time is increased dramatically by doing so 
since all events in the training data have to be taken 
95 
Language 
English 4 
Czech 
Estonian 
Hungarian 
Romanian 
Slovene 
Table 1: Training data in numbers 
Training Size Tagset Size Ambiguous Tokens 
99903 
87071 
81383 
102992 
104583 
94457 
139 
970 
476 
401 
486 
1033 
38.65% 
45.97% 
40.24% 
21.58% 
40.00% 
38.01% 
into consideration, as opposed to the case of training 
the small AC-based model, when only those training 
events which contain the particular AC are used. 
3.3 Variable Distance Condition 
The "stop" criterion for finding the appropriate rel- 
ative position was originally based on hard coded 
choices suitable for the Czech  only, and of 
course it depended on the tagset as well. This depen- 
dency has been removed by selecting the appropriate 
conditions automatically when building the pool of 
possible features at the initialization phase 5 (using 
the relative frequency of the POS ambiguity classes, 
and a threshold to cut off less frequent categories to 
limit the size of the feature pool). 
3.4 Weight Variation 
Even though the full computation of the appropriate 
feature weight is still prohibitive (the more so when 
the general features are added), the learner is now 
allowed to vary the weights (in several discrete steps) 
during feature selection, as a (somewhat crude) at- 
tempt to depart from the Naive Bayes simplification 
to the approximation of the "correct" Maximum En- 
tropy estimation. 
3.5 Handling Unknown Words 
In order to compare the effects of (not) using an in- 
dependent dictionary, we have added an unknown 
word handling module to the code. 6 It extracts 
the prefix and suffix frequency information (and the 
combination thereof) from the training data. Then, 
for each of the combinations, it selects the most fre- 
quent set of tags seen in the training data and stores 
it for later use. When tagging, the data is first piped 
through a "guesser" which assigns to the unknown 
words such a set of possible tags which is stored with 
the longest matching prefix/suffix combination. 
5Also, the use of variable-distance context may be switched 
off entirely. 
6Originally, the code relied exclusively on the use of such 
an independent dictionary. Since the coverage of the Czech 
dictionary we have used is extensive, we have been simply 
ignoring the unknown word problem altogether in the past. 
4 The Results 
4.1 Reporting Error Rate: Words vs. 
Tokens 
Since "best-only" tagging has been carried out, the 
error rate (i.e, 100 - accuracy in %) measure has 
been used throughout as the only evaluation crite- 
rion. However, since some results reported previ- 
ously were apparently obtained using only the "real" 
words as the total for accuracy evaluation, whereas 
in other experiments every token counts (including 
punctuation 7, for example), we have computed both 
and report them separately s. 
4.2 Availability of Dictionary Information 
We use two methods to obtain the set of possible 
tags for any given word form (i.e., to analyze it mor- 
phologically). Both methods include handling un- 
known words. First, we use only information which 
may be obtained automatically from the manually 
annotated corpus (we call this method automatic). 
This is the way the Maximum Entropy tagger (Rat- 
naparkhi, 1996) runs if one uses the binary version 
from the website (see the comparison in Section 5). 
However, it is not unreasonable to assume that a 
larger independent dictionary exists which can help 
to obtain a list of possible tags for each word form 
in test data. This is what we have at our disposal 
for the s in question, since the development 
of such a dictionary was part of the Multext-East 
project. We can thus assume a dictionary info is 
available for unknown words in the test data, i.e., 
even though there is no statistics available for them 
(since they did not appear in the training data), all 
possible tags for (almost 9) every test token are avail- 
able. This method is referred to as independent in 
the following text. 
We have also used a third method of obtain- 
ing a dictionary information (called mized), namely, 
by using only the words from the training data, 
rAnd sometimes a separate token for sentence boundary 
STable 1 has been computed using all tokens. In fact, the 
s differ significantly in the proportion of punctuation: 
from about 18% (English) to 30% (Estonian). 
9Depending on the quality of the independent dictionary. 
Of course, the tagsets must match, which could be a problem 
per se. Here it is simple, since the dictionaries have been 
developed using the same tagsets as the tagged data. 
96 
but complementing the information about them ob- 
tained from the training data by including all other 
possible tags for such words. Therefore the net result 
is that during testing, we have only training words 
at our disposal, but with a complete dictionary in- 
formation (as if coming from a full morphological 
dictionary) 1°. 
The results on the full training data set are sum- 
marized in Table 2. 
The baseline error rate is computed as follows. 
First of all, we use the independent dictionary for 
obtaining the possible tags for each word. Then we 
extract only the lexical information from the current 
position 11 and counts used for smoothing (which is 
based on the ambiguity classes only and it does not 
use lexical information). The system is then trained 
normally, which means it uses the lexical information 
only if the AC-based smoothing 12 alone does not 
work. This baseline method is thus very close to the 
usual baseline method of using simple conditional 
distribution of tags given words. 
The message of Table 2 seems to be obvious; but 
before we jump to conclusions, let's present another 
set of experiments. 
In view of the recent interest in dealing with 
"small s", and with regard to the questions 
of cost-effectiveness of using "human" resources (i.e. 
annotation vs. rule-writing vs. tools development 
etc.), we have also performed experiments with re- 
duced training data size (but with an enriched fea- 
ture pool - by lowering thresholds, adding more of 
the "general features" as described above, etc. - as 
allowed by reasonable time/space constraints). 13 
These results are summarized in Table 3 (using 
only the dictionary derived from the training data), 
Table 4 (using words from training data with mor- 
phological information complemented from a dictio- 
nary) and Table 5 (using the "independent" dictio- 
nary). In all cases, we again count only true words 
(no punctuation). Accordingly, the major POS er- 
ror rate is reported, too (12 POS tags to be dis- 
tinguished only: Noun, Verb, Adjective .... ; see Ta- 
bles 6, 7, and 8). 
1°This arrangement removes the "closed vocabulary" phe- 
nomenon from the test data, since for the Multext-East data, 
we did not have a truly independent vocabulary available. 
11Words from the training data which are not singletons 
(freq > 1) are used. Surprisingly enough, it would not hurt 
to use them too. We believe it is due to the smoothing method 
used. Even though this is valid only for the baseline experi- 
ment, we have observed in general that this form of exponen- 
tial model (with error-driven training, that is) is remarkably 
resistant to overtrainlng. 
12Using ACs linearly interpolated with global unigram sub- 
tag distribution and finally the uniform distribution. 
13By reasonable we mean less than a day of CPU for train- 
ing. 
Table 9: Exponential w/feature selection vs. Max- 
imum Entropy tagger (Words-only Error Rate, no 
dictionary) 
Language Tagger 
Exp. 
English 9.18% 
Czech 18.83% 
Estonian 13.95% 
Hungarian 8.16% 
Romanian 7.76% 
Slovene 16.26% 
MaxEnt 
6.38% 
17.77% 
14.92% 
8.55% 
7.66% 
17.44% 
4.3 Tagger Comparison 
The work (Erjavec et al., 1999) consistently com- 
pares several taggers (HMM, Brill's Transformation- 
based Tagger, Ratnaparkhi's Maximum Entropy 
tagger, and the Daelemans et al.'s Memory-based 
Tagger) on Slovene. We have chosen the Maximum 
Entropy tagger (Ratnaparkhi, 1996) for a compari- 
son with our universal tagger, since it achieved (by 
a small margin) the best overall result on Slovene 
as reported there (86.360% on all tokens) of tag- 
gers available to us (MBT, the best overall, was not 
freely available to us at the time of writing). We 
have trained and tested the Maximum Entropy Tag- 
ger on exactly the same data, using the off-the-shelf 
(java binary only) version. 
The results are compared in Table 9. 
Since we want to show how a tagger accuracy is 
influenced by the amount of training data available, 
we have run a series of experiments comparing the 
results of the exponential tagger to the maximum 
entropy tagger when there is only a limited amount 
of data available. The results are summarized in 
Table 10. Since the public version of the MaxEnt 
tagger cannot be modified to take advantage of nei- 
ther the mixed nor the independent dictionary, we 
have compared it only to the automatic dictionary 
version of the exponential tagger. To save space, 
the results are tabulated only for the training data 
sizes of 2000, 5000 and 20000 words. Again, only 
the "true" word error rate is reported. 
As the tables show, for the s we tested, 
the exponential, feature-based tagger we adapted 
from (Haji~ and Hladk~, 1998) achieves similar re- 
sults as the Maximum Entropy tagger 14 15. (using 
exactly the same (full) training data; the "score" 
is 3:3, with the MaxEnt tagger being substantially 
better on English; probably the development lan- 
14Otherwise the acknowledged leader in English tagging 
15The only substantial difference we noticed was in tagging 
speed. The runtime speed of the MaxEnt tagger is lower, only 
about 10 words per second vs. almost 500 words per second; 
it should be noted however that we are comparing MaxEnt's 
java bytecode and C. 
97 
Table 2: Results (Error rate, ER) on full training data, only true words counted (no punctuation) 
Dictionary: 
Language 
English 
Czech 
Estonian 
Hungarian 
Romanian 
Slovene 
Automatic 
Baseline Pull 
11.42% 9.18% 
23.02% 18.83% 
16.12% 13.95% 
8.35% 8.16% 
10.87% 7.76% 
20.53% 16.26% 
Mixed 
Baseline Full 
11.40% 7.91% 
22.61% 14.78% 
16.19% 12.98% 
8.31% 8.00% 
10.81% 7.34% 
20.01% 13.29% 
Independent 
Baseline Full 
7.07% 3.58% 
19.40% 9.59% 
9.94% 5.34% 
3.55% 2.58% 
7.49% 3.35% 
17.29% 9.00% 
Table 3: Error rate on reduced training data, dictionary: automatic 
Language 
English 
Czech 
Estonian 
Hungarian 
Romanian 
Slovene 
I000 
36.20% 
48.22% 
48.14% 
39.68% 
40.61% 
45.84% 
2000 
29.36% 
42.95% 
42.10% 
32.21% 
35.02% 
39.58% 
Training 
5000 
23.47% 
36.54% 
32.44% 
23.94% 
25.06% 
33.12% 
data size 
10000 
18.27% 
30.97% 
26.81% 
18.04% 
19.26% 
28.60% 
20000 
14.46% 
27.08% 
21.51% 
13.92% 
15.16% 
24.50% 
Full 
9.18% 
18.83% 
13.95% 
8.16% 
7.76% 
16.26% 
Table 4: Error rate on reduced training data, dictionary: mixed 
Language 
English 
Czech 
Estonian 
Hungarian 
Romanian 
Slovene 
Training data size 
I000 
36.15% 
48.97% 
48.24% 
39.87% 
42.85% 
46.74% 
2000 
29.58% 
41.93% 
42.79% 
32.71% 
35.70% 
39.88% 
5000 
22.93% 
34.37% 
32.98% 
23.63% 
25.46% 
32.00% 
10000 
17.70% 
28.10% 
26.60% 
17.98% 
19.23% 
26.20% 
20000 
14.00% 
23.31% 
21.02% 
13.82% 
14.81% 
21.73% 
Full 
7.91% 
14.78% 
12.98% 
8.00% 
7.34% 
13.29% 
Table 5: Error 
Language 
English 
Czech 
Estonian 
Hungarian 
Romanian 
Slovene 
rate on reduced training data, dictionary: "independent" 
1000 
10.29% 
22.51% 
13.11% 
6.84% 
13.11% 
24.63% 
2000 
7.64% 
18.07% 
11.95% 
5.35% 
9.47% 
19.17% 
Training data size 
5000 10000 
5.53% 4.54% 
17.33% 15.10% 
10.70% 9.29% 
4.29% 4.07% 
7.81% 6.18% 
16.17% 14.12% 
20000 Pull 
3.83% 3.58% 
12.62% 9.59% 
8.10% 5.34% 
3.48% 2.58% 
5.07% 3.35% 
12.62% 9.00% 
guage bias shows herein). However, when the train- 
ing data size goes down, the advantage of predicting 
the single morphological categories separately favors 
the exponential tagger (with the notable and sub- 
stantial exception of English). The less data, the 
larger the difference (Tab 10). 
16On the other hand, the Exponential tagger has been de- 
veloped on Czech originally and it lost on this . It 
should be noted that the original version of the exponen- 
tial tagger did contain more Czech-specific features, and thus 
might in fact do better. 
The resulting accuracy (of both taggers) is still 
unsatisfactory not only from the point of view of 
results obtained on English, but also from the prac- 
tical point of view: approx. 85% accuracy (Czech, 
Slovene) typically means that about five out of six 
10-word sentences contain at least one error in it. 
That is bad news e.g. for parsing projects involving 
tagging as a preliminary step. 
98 
Table 6: POS Error rate on reduced training data, dictionary: automatic 
Language 
English 
Czech 
Estonian 
Hungarian 
Romanian 
Slovene 
1000 
26.77% 
24.32% 
35.81% 
30.54% 
31.33% 
27.16% 
Training data size 
2000 5000 
20.82% 16.11% 
20.20% 13.46% 
30.52% 23.02% 
24.99% 18.09% 
27.59% 19.24% 
23.15% 17.01% 
10000 
11.86% 
9.70% 
18.26% 
13.15% 
14.51% 
12.89% 
20000 Full 
9.48% 5.64% 
7.22% 3.72% 
14.31% 8.46% 
10.29% 5.81% 
11.25% 5.21% 
9.74% 5.61% 
Table 7: POS Error rate on reduced training data, dictionary: mixed 
Language 
English 
Czech 
Estonian 
Hungarian 
Romanian 
Slovene 
1000 
26.69% 
24.32% 
36.48% 
30.28% 
33.56% 
27.58% 
Training data size 
2000 
21.09% 
20.61% 
31.76% 
25.25% 
28.34% 
23.30% 
5OOO 
15.82% 
13.47% 
23.55% 
17.59% 
20.03% 
16.85% 
10000 
11.53% 
10.19% 
18.21% 
12.89% 
14.52% 
12.59% 
20000 
9.08% 
7.37% 
14.32% 
10.15% 
11.03% 
9.88% 
Full 
4.94% 
3.76% 
8.2O% 
5.64% 
5.04% 
5.12% 
Table 8: POS Error rate on reduced training data, dictionary: "independent" 
Language Training data size 
1000 2000 5000 10000 20000 Full 
-English 6.42% 5.36% 3.63% 3.02% 2.53% 2.43% 
Czech 3.21% 2.85% 2.17% 2.01% 1.65% 1.12% 
Estonian 6.71% 6.32% 5.27% 4.31% 3.77% 2.36% 
Hungarian 5.35% 4.42% 3.39% 3.18% 2.75% 2.04% 
Romanian 9.51% 6.54% 5.36% 4.00% 3.18% 1.89% 
Slovene 6.10% 5.19% 4.04% 3.59% 3.25% 2.08% 
5 Conclusions 
5.1 The Differences Among Languages 
The following discussion abstracts from the tagset 
design, relying on the fact that the Multext-East 
project has been driven by common tagset guidelines 
to an unprecedented extent, given the very different 
s involved. At the same time, we acknowl- 
edge that even so, their design for the individual 
s might have influenced the results. Also, 
the quality of the annotation is an important factor; 
we believe though that the later data we obtained 
for the experiments described here are within the 
range of usual human error and do not suffer from 
negligence 1~. 
First of all, it is clear that these s differ 
substantially just by looking at the simple training 
17Specifically, we are sure that the post-release Czech, 
Slovene and Hungarian data we are using are without anno- 
tation defects beyond the usual occasional annotation error, 
as they have been double checked, and we also believe that 
the other two s are reasonably clean. Bulgarian, al- 
though present on the CD, is unfortunately unusable since it 
has not been manually annotated; for English, see above. 
data statistics, where the number of unique tags seen 
in a relatively small collection of about 100k tokens is 
high - from 401 (Hungarian) to 1033 (Slovene); com- 
pare that to English with only 139 tags. However, it 
is interesting to see that the average per-token ambi- 
guity is much more narrowly distributed, and in fact 
English ranks 3rd (after Hungarian and Slovene), 
Czech being the last with almost every other token 
ambiguous on average. This ambiguity does not cor- 
respond with the results obtained: Slovene, being 
the second least ambiguous, is the second most dif- 
ficult to tag. Only Czech behaves consistently by 
tailing the pack in both cases. 
5.2 Comparison to Previous Results 
Any comparison is necessarily difficult due to differ- 
ent evaluation methodologies, even within the "best- 
only", accuracy-based reporting. Nevertheless, we 
will try. 
For Romanian, Tufts in his recent work (Tufts, 
1999) reports 98.5% accuracy (i.e. 1.5% error rate) 
on Romanian, using the classifier combination ap- 
proach advocated by e.g. (Brill and Wu, 1998). His 
99 
Table 10: Error rate comparison on reduced training data, automatic dictionary 
Language 
English 
Czech 
Estonian 
Hungarian 
Romanian 
Slovene 
Training data size 
2000 
ME Exp 
26'03% 29.36% 
50.77% 42.95% 
51.08% 42.10% 
41.12% 32.21% 
42.88% 35.02% 
49.46% 39.58% 
5000 
ME Exp 
17.70% 23.47% 
41.95% 36.54% 
40.09% 32.44% 
30.68% 23.94% 
30.07% 25.06% 
39.34% 33.12% 
20000 
ME Exp 
9.61% 14.46% 
28.16% 27.08% 
25.50% 21.51% 
17.27% 13.92% 
16.67% 15.16% 
27.77% 24.50% 
results are well above the 3.29% error rate achieved 
here (with even a larger tagset of 1391 vs. 486 here), 
but the paper does not say how this number has been 
computed (training data size, the all-token/words- 
only question) thus making any conclusions difficult 
to make. He also argues that his method is  
independent but no results are mentioned for other 
s. 
For Czech, previous work achieved similar results 
(6.20% on newspaper text using the all-tokens-based 
error rate computation, on 160,000 training tokens; 
vs. 7.04% here on approx, half that amount of train- 
ing data; same handling of unknown words). This is 
in line with the expectations, since the same method- 
ology (tagging as well as evaluation) has been used, 
except the features used in that work were specifi- 
cally tuned to Czech. 
The most detailed account of Slovene (Erjavec et 
al., 1999) reports various results, which might not 
be directly comparable because it is unclear whether 
they use the all-tokens-based or words-only compu- 
tation of the error rate. They report 6.421% error 
rate on the full tagset on known words, and 13.583% 
on all words (tokens?) including unknown words 
(the exponential tagger we used achieved 13.82% on 
all tokens, 16.26% on words only). They use almost 
the same data (Orwell's 1984, but leaving out the 
Appendices) ls. They also report that the original 
Czech-specific exponential tagger used as a basis for 
the work reported here achieved 7.28% error rate on 
Slovene on full tags on the same data, which means 
that by the changes to the exponential tagger aimed 
at its  independence we introduced in Sec- 
tion 3, we have not achieved any improvement (on 
Slovene) of the exp. tagger (the error rate stayed at 
7.26% - using all-tokens-based evaluation numbers, 
dictionary available; but the data was not exactly 
the same, presumably). 
5.3 Dictionary vs. Training Data 
This is, according to our opinion, the most interest- 
ing result of the experiments described so far. As 
18Their tag count is lower (1021) than here (1033), but 
that's not really relevant. They do not report the average 
ambiguity or a similar measure. 
100 
already Table 2 clearly suggests, even the baseline 
tagging results obtained with the help of an indepen- 
dent dictionary are comparable (if not better) than 
the fully-trained tagger on 100k words, but without 
the dictionary information. The situation is even 
clearer when comparing the POS-only results: here 
the "independent" dictionary results are better by 
far, with almost no training data needed. 
Looking at the characteristics of the s, it 
is apparent that the inflections cause the problem: 
the coverage of a previously unseen text is inferior to 
the usual coverage of English or another analytical 
. Therefore, unless we can come up with 
a really clever way of learning rules for dealing with 
previously unseen words, it is clearly strongly prefer- 
able to work on a morphological dictionary 19, rather 
than to try to annotate more data. 
6 Future Work 
We would like to compare more taggers using still 
other methodologies, especially the MBT tagger, 
which achieved the best results on Slovene but which 
was not available to us at the time of writing this 
paper. Obviously, we would also like to use the clas- 
sifter combination method on them, to confirm the 
really surprisingly good results on Romanian and 
test it on the other s as well. 
We would also like to enrich the best taggers avail- 
able today (such as the Maximum Entropy tagger) 
by using the dictionary information available and 
compare the results with the exponential feature- 
based tagger we have been using in the experiments 
here. 
For Czech and Slovene, the results are still far be- 
low what one would like to see (in absolute terms). It 
seems that the key lies in the initial feature set defi- 
nition - including statistical tagset clustering, which 
might potentially lead to more reliable estimates of 
certain parameters while using still the same size of 
training data. 
19Not necessarily manually - apparently, even a partially 
supervised method would be of tremendous help. 
7 Acknowledgements 
The author wishes to thank many Multext-East par- 
ticipants for their efforts to improve the original 
data, especially to Niki Petkevi~, Tomaz Erjavec, 
Heiki-Jaan Ka~lep and G£bor Pr6sz6ky, and for pro- 
viding the final versions of the annotated data for 
the experiments. Any errors and mistakes are solely 
to be blamed on the author, not the annotators, of 
course. 

References 

Eric Brill and Jun Wu. 1998. Classifier combination 
for improved lexical disambiguation. In Proceedings of ACL/COLING , pages 191-195, Montreal, Canada. ACL/ICCL. 

Eric Brill. 1995. Transformation-based error-driven 
learning and natural  processing: A case 
study in part-of-speech tagging. Computational 
Linguistics, 21:543-565. 

Jean-Pierre Chanod. 1997. Current developments 
for Central & Eastern European s. In 
Proceedings of EU Project meeting TELRI I, Romania. 

Kenneth W. Church. 1988. A stochastic parts program and noun phrase parser for unrestricted text. 
In Proceedings of the Second Conference on Applied Natural Language Processing, pages 136-143, 
Austin, Texas. ACL. 

Walter Daelemans, Jakub Zavrel, Peter Berck, and 
Steven Gillis. 1996. MBT: A memory-based part 
of speech tagger generator. In Proceedings of 
WVLC 4, pages 14-27. ACL. 

Tomaz Erjavec, Saso Dzeroski, and Jakub Zavrd. 
1999. Morphosyntactic Tagging of Slovene: Evaluating PoS Taggers and Tagsets. Technical Report IJS-DP 8018, Dept. for Intelligent Systems, 
Jozef Stefan Institute, Ljubljana, Slovenia, April 
2nd. 

Jan Hajic and Barbora Hladka. 1997. Tagging of inflective s: a comparison. In Proceedings 
of ANLP 'gT, pages 136-143, Washington, DC. 
ACL. 

Jan Haji~ and Barbora Hladk& 1998. Tagging 
inflective s: Prediction of morphological categories for a rich, structured tagset. In 
Proceedings of ACL / COLING'98, pages 483-490, 
Montreal, Canada. ACL/ICCL. 

Andrew W. Mackie Julian Benello and James A. Anderson. 1989. Syntactic category disambiguation 
with neural networks. Computer Speech and Language, 3:203-217. 

Bernard Merialdo. 1992. Tagging text with a 
probabilistic model. Computational Linguistics, 
20(2):155-171. 

Adwait Ratnaparkhi. 1996. A maximum entropy 
model for part-of-speech tagging. In Proceedings 
of EMNLP 1, pages 133-142. ACL. 

Helmut Schmid. 1994. Probabilistic part-of-speech 
tagging using decision trees. In Proceedings of International Conference on New Methods in Language Processing, pages 44-49, Manchester, England. 

Dan Tufts. 1999. Tiered tagging and combined lan- 
guage models classifiers. In Proceedings of Text, 
Speech and Dialogue'99, Mari~nskd LLzn~, Czech 
Republic, Sept. 15-18. 

Jean Vdronis. 1996a. Multext-East (Copernicus 106). 

Jean Vdronis. 1996b. Multext-East -specific resources (Copernicus 106). 
