Tagging Inflective Languages: Prediction of Morphological 
Categories for a Rich, Structured Tagset 
Jan Haji~: and Barbora Hladkfi 
Institute of Formal and Applied Linguistics MFF UK 
Charles University, Prague, Czech Republic 
{hajic,hladka}~ufal.mff.cuni.cz 
Abstrakt (~esky) 
(This short abstract is in Czech. For illustration 
purposes, it has been tagged by our tagger; errors 
are printed underlined and corrections are shown.) 
Hlavnfm/AAIS7 .... 1A-- 
probldmem/NNIS7 ..... A-- 
p~i/RR--6 
morfologickdm/AANS6 .... 1A-- 
zna~kov£nf/NNNS6 ..... A-- (/z: 
........... 
n~kdy/Db 
t~/Db' 
zvandm/AAI_S6 .... IA-- 
morfologicko/A2 ........... 
-/Z: ........... 
syntaktickd/AAIP1 .... 1A-- )/z: 
........... 
jazykfi/NNIP2 ..... A-- 
s/RR--7 
bohatou/AAFS7 .... 1A-- 
flexf/NNFS7 ..... A-- 
,/Z: 
jako/J, 
je/VB-S---3P-AA- 
nap~fklad/Db 
~egtina/NNFSl ..... A-- 
nebo/J ~ 
ru~tina/NNFS 1 ..... A-- 
,/Z: ........... 
je/VB-S---3P-AA- 
-/Z: ........... 
p~i/P~--6' 
omezend/AAFS6 .... 1A-- 
velikosti/NNFS2- .... A-- 
zdrojfl/NNIP2 ..... A-- 
-/Z :- 
po~et/NNIS1 ..... A-- 
mo~n~ch/AAFP2 .... IA-- 
zna~ek/NNFP2 ..... A-- 
,/Z : ........... 
kter37/P4YS1. 
jde/VB-S---3P-AA- 
obvykle/Dg ....... 1A-- 
do/RR--2 
Correct: N 
Correct: NS 
Correct: 6 
tisfc6/NNIP2 ..... A-- 
./Z: ........... 
Na~e/PSHS1-P1. 
metoda/NNFS1 ..... A-- 
p~itom/Db. 
vyu~fvi/VB-S---3P-AA- 
exponenciilnfho/AAIS2 .... 1A-- 
pravd~podobnostnfho/AAI $2 .... 1A-- 
modelu/NNIS2 ..... A-- 
zalo~endho/AAIS2 .... 1A-- 
na/P~--6 
automaticky /Dg ....... 1A-- 
vybran3~ch/AA_NP6 .... 1A-- Correct: I 
rysech/NNIP6 ..... A-- 
./Z: 
Parametry/NNIPl ..... A-- 
tohoto/PDZS2 
modelu/NNIS2 ..... A-- 
se/P7-X4 
po~kaj f/VB-P---3P-AA- 
pomocf/NNFS7 ..... A-- Correct: PSt--2,- 
jednoduch~ch/AAIP2 .... 1A-- 
odhad6/NNIP2 ..... A-- (/z: 
........... 
trdnink/NNIS1 ..... A-- 
je/VB-S---3P-AA- 
tak/Db 
mnohem/Db 
rychlej~f/AAES1 .... 2A-- Correct: I 
,/Z: ........... 
ne~./J, 
kdybychom/J, -P--- 1 ..... 
pou~ili/VpMP---XR-AA- 
metodu/NNFS4 ..... A-- 
maximilnf/AAFS_4---- IA-- Correct: 2 
entropie/NNFS2 ..... A-- )/z: 
........... 
a/J'- .......... 
p\[itom/Db 
se/PT-X4. 
pHmo/Dg ....... 1A-- 
minimalizuje/VB-S---3P-AA- 
po~et/NNIS_4- .... A-- 
chyb/NNFP2 ..... A-- 
./Z: ........... 
Correct: 1 
483 
Abstract 
The major obstacle in morphological (sometimes 
called morpho-syntactic, or extended POS) tagging 
of highly inflective languages, such as Czech or Rus- 
sian, is - given the resources possibly available - the 
tagset size. Typically, it is in the order of thou- 
sands. Our method uses an exponential probabilis- 
tic model based on automatically selected features. 
The parameters of the model are computed using 
simple estimates (which makes training much faster 
than when one uses Maximum Entropy) to directly 
minimize the error rate on training data. 
The results obtained so far not only show good 
performance on disambiguation of most of the indi- 
vidual morphological categories, but they also show 
a significant improvement on the overall prediction 
of the resulting combined tag over a HMM-based tag 
n-gram model, using even substantially less training 
data. 
1 Introduction 
1.1 Orthogonality of morphological 
categories of inflective languages 
The major obstacle in morphological 1 tagging of 
highly inflective languages, such as Czech or Rus- 
sian, is - given the resources possibly available - the 
tagset size. Typically, it is in the order of thou- 
sands. This is due to the (partial) "orthogonality "2 
of simple morphological categories, which then mul- 
tiply when creating a "flat" list of tags. However, 
the individual categories contain only a very small 
number of different values; e.g., number has five (Sg, 
P1, Dual, Any, and "not applicable"), case nine etc. 
The "orthogonality" should not be taken to mean 
complete independence, though. Inflectional lan- 
guages (as opposed to agglutinative languages such 
as Finnish or Hungarian) typically combine several 
certain categories into one morpheme (suffix or end- 
ing). At the same time, the morphemes display a 
high degree of ambiguity, even across major POS 
categories. 
For example, most of the Czech nouns can form 
singular and plural forms in all seven cases, most 
adjectives can (at least potentially) form all (4) gen- 
ders, both numbers, all (7) cases, all (3) degrees of 
comparison, and can be either of positive or nega- 
tive polarity. That gives 336 possibilities (for ad- 
jectives), many of them homonymous on the sur- 
face. On the other hand, pronouns and numerals do 
1 This type of tagging is sometimes called morpho-syntactic 
tagging. However, to stress that we are not dealing with syn- 
tactic categories such as Object or Attribute (but rather with 
morphological categories such as Number or Case) we will use 
the term "morphological" here. 
2By orthogonality we mean that all combinations of values 
of two (or more) categories are systematically possible, i.e. 
that every member of the cartesian product of the two (or 
more) sets of values do appear in the language. 
not display such an orthogonality, and even adjec- 
tives are not fully orthogonal - an ancient "dual" 
number, happily living in modern Czech in the fem- 
inine, plural and instrumental case adds another 6 
sub-orthogonal possibilities to almost every adjec- 
tive. Together, we employ 3127 plausible combina- 
tions (including style and diachronic variants). 
1.2 The individual categories 
There are 13 morphological categories currently used 
for morphological tagging of Czech: part of speech, 
detailed POS (called "subpart of speech"), gender, 
number, case, possessor's gender, possessor's num- 
ber, person, tense, degree of comparison, negative- 
ness (affirmative/negative), voice (active/passive), 
and variant/register. 
The P0S category contains only the major part of 
speech values (noun (N), verb (V), adjective (A), pro- 
noun (P), verb (V), adjective (A), adverb (D), numeral 
(C), preposition (R), conjunction (J), interjection (I), 
particle (T), punctuation (Z), and "undefined" (X)). 
The "subpart of speech" (SUBPOS) contains details 
about the major category mad has 75 different values. 
For example, verbs (POS: V) are divided into simple 
finite form in present or future tense (B), conditional 
(c), infinitive (f), imperative (i), etc. 3 
All the categories vary in their size as well as in 
their unigram entropy (see Table 1) computed using 
the standard entropy definition 
Hp = - ~ p(y)log(p(y)) (1) 
yEY 
where p is the unigram distribution estimate based 
on the training data, and Y is the set of possible 
values of the category in question. This formula can 
be rewritten as 
1 \[D\[ 
Hp,t)- iDl~lOg(p(yi)) (21 
i=1 
where p is the unigram distribution, D is the data 
and IDI its size, and yi is the value of the category 
in question at the i - th event (or position) in the 
data. The form (2) is usually used for cross-entropy 
computation on data (such as test data) different 
from those used for estimating p. The base of the 
log function is always taken to be 2. 
1.3 The morphological analyzer 
Given the nature of inflectional languages, which can 
generate many (sometimes thousands of) forms for a 
given lemma (or "dictionary entry"), it is necessary 
to employ morphological analysis before the tagging 
proper. In Czech, there are as many as 5 differ- 
ent lemmas (not counting underlying derivations nor 
3The categories POS and SUBPOS are the only two categories 
which are rather lexically (and not inflectionally) based. 
484 
Table h Most Difficult Individual Morphological 
Categories 
Category 
POS 
SUBPOS 
GENDER 
NUMBER 
CASE 
POSSGENDER 
POSSNUMBER 
PERSON 
TENSE 
GRADE 
NEGATION 
VOICE 
VAR 
Number 
of values 
12 
75 
11 
6 
9 
5 
3 
5 
6 
4 
3 
3 
10 
Unigram entropy 
Hp (in bits) 
2.99 
3.83 
2.05 
1.62 
2.24 
0.04 
0.04 
0.64 
0.55 
0.55 
1.07 
0.45 
0.07 
word senses) and up to 108 different tags for an in- 
put word form. The morphological analyzer used for 
this purpose (Hajji, in prep.), (Haji~, 1994) covers 
about 98% of running unrestricted text (newspaper, 
magazines, novels, etc.). It is based on a lexicon 
containing about 228,000 lemmas and it can analyze 
about 20,000,000 word forms. 
2 The Training Data 
Our training data consists of about 130,000 tokens 
of newspaper and magazine text, manually double- 
tagged and then corrected by a single judge. 
Our training data consists of about 130,000 tokens 
of newspaper and magazine text, manually tagged 
using a special-purpose tool which allows for easy 
disambiguation of morphological output. The data 
has been tagged twice, with manual resolution of 
discrepancies (the discrepancy rate being about 5%, 
most of them being simple tagging errors rather than 
opinion differences). 
One data item contains several fields: the input 
word form (token), the disambiguated tag, the set of 
all possible tags for the input word form, the disam- 
biguated lemma, and the set of all possible lemmas 
with links to their possible tags. Out of these, we 
are currently interested in the form, its possible tags 
and the disambiguated tag. The lemmas are ignored 
for tagging purposes. 4 
The tag from the "disambiguated tag" field as 
well as the tags from the "possible tags" field are 
further divided into so called subtags (by morpho- 
logical category). In the set "possible tags field", 
4In fact, tagging helps in most cases to disambiguate the 
lemmas. Lemma disambiguation is a separate process follow- 
ing tagging. The lemma disambiguation is a much simpler 
problem - the average number of different lemmas per token 
(as output by the morphological analyzer) is only 1.15. We 
do not cover the lemma disambiguation procedure here. 
~--s ........ IRIRI-I-1461-1-1-1-1-1-I-I-IIoa 
AAIS6 .... tA N I AIAIIMNISlSI-I-I-I-I t/A/-/-/Ipoetta,"ov&~ 
milS6 ..... A--lNINII/S12361-/-I-I-I-IAl-I-/Imodelu 
z: ........... \[Zl :l-l-l-l-l-l-l-l-l-l-l-l\] , 
P4YS1 ........ \[P/4/I¥/S/14/-/-/-/-/-/-/-/-/\]kZ,r~ 
VpYS---IR-A A-lV/p/Y/S/-/-/-II/P,I-/A/-/-/lsi~uloval 
~IS4 ..... A--\[N/N/I/S/14/-/-/-/-/-/A/-/-/\[v~rvoj 
AANS2 .... IA--\[A/A/IMN/S/24/-/-/-/-/i/A/-/-/Isv~zov4ho 
h~NS2 ..... A-- \[N/N/N/S/236/-/-/-/-/-/A/-/-/\]kllma~u 
\]~--8 ........ I~IRI-1-1461-I-I-I-I-I-I-I-311 v 
AAIm8 .... IA--IAIAIFI~IP1281-1-1-1-111Al-l-llP~i~tlch 
IaWIP6 ..... A--INININIPlSl-l-l-l-l-lAl-l-lldea,tiletlch 
Figure 1: Training Data: lit: on computer(adj.) 
model, which was-simulating development of-world 
climate in next decades 
the ambiguity on the level of full (combined) tags is 
mapped onto so called "ambiguity classes" (AC-s) 
of subtags. This mapping is generally not reversible, 
which means that the links across categories might 
not be preserved. For example, the word form jen 
for which the morphology generates three possible 
tags, namely, TT ........... (particle "only"), and 
NNISI ..... A-- and NNIS4 ..... A-- (noun, masc. 
inanimate, singular, nominative (1) or accusative 
(4) case; "yen" (the Japanese currency)), will be 
assigned six ambiguous ambiguity classes (NT, NT, 
-I, -S, -14, -h, for POS, subpart of speech, gen- 
der, number, case, and negation) and 7 unambiguous 
ambiguity classes (all -). An example of the train- 
ing data is presented in Fig. 1. It contains three 
columns, separated by the vertical bar 0): 
1. the "truth" (the correct tag, i.e. a sequence of 
13 subtags, each represented by a single charac- 
ter, which is the true value for each individual 
category in the order defined in Fig. 1 (lst col- 
umn: POS, 2nd: SUBPOS, etc.) 
2. the 13-tuple of ambiguity classes, separated by 
a slash (/), in the same order; each ambiguity 
class is named using the single character subtags 
used for all the possible values of that category; 
3. the original word form. 
Please note that it is customary to number the 
seven grammatical cases in Czech: (instead of nam- 
ing them): "nominative" gets 1, "genitive" 2, etc. 
There are four genders, as the Czech masculine gen- 
der is divided into masculine animate (M) and inan- 
imate (I). 
Fig. 1 is a typical example of the ambiguities en- 
countered in a running text: little POS ambigu- 
ity, but a lot of gender, number and case ambiguity 
(columns 3 to 5). 
485 
3 The Model 
Instead of employing the source-channel paradigm 
for tagging (more or less explicitly present e.g. in 
(Merialdo, 1992), (Church, 1988), (Hajji, Hladk~, 
1997)) used in the past (notwithstanding some ex- 
ceptions, such as Maximum Entropy and rule-based 
taggers), we are using here a "direct" approach to 
modeling, for which we have chosen an exponential 
probabilistic model. Such model (when predicting 
an event 5 y E Y in a context x) has the general 
form 
PAC,e (YIX) = exp(~-~in----1 Aifi (y, x)) Z(x) (3) 
where fi (Y, x) is the set (of size n) of binary-valued 
(yes/no) features of the event value being predicted 
and its context, hi is a "weigth" (in the exponential 
sense) of the feature fi, and the normalization factor 
Z(x) is defined naturally as 
z(x) = exp( z x)) (4) 
yEY i----1 
~,Ve use a separate model for each ambiguity class 
AC (which actually appeared in the training data) 
of each of the 13 morphological categories 6. The 
final PAC (Yix) distribution is further smoothed using 
unigram distributions on subtags (again, separately 
for each category). 
pAC(y\[x) = apAC,e(yIx) q- (1 -- a)PAC, I(y) (5) 
Such smoothing takes care of any unseen context; 
for ambiguity classes not seen in the training data, 
for which there is no model, we use unigram proba- 
bilities of subtags, one distribution per category. 
In the general case, features can operate on any 
imaginable context (such as the speed of the wind 
over Mt. Washington, the last word of yesterday 
TV news, or the absence of a noun in the next 1000 
words, etc.). In practice, we view the context as a 
set of attribute-value pairs with a discrete range of 
values (from now on, we will use the word "context" 
for such a set). Every feature can thus be repre- 
sented by a set of contexts, in which it is positive. 
There is, of course, also a distinguished attribute for 
the value of the variable being predicted (y); the rest 
of the attributes is denoted by x as expected. Values 
of attributes will be denoted by an overstrike (~, 5). 
The pool of contexts of prospective features is for 
the purpose of morphological tagging defined as a 
Sa subtag, i.e. (in our case) the unique value of a morpho- 
logical category. 
6Every category is, of course, treated separately. It means 
that e.g. the ambiguity class 23 for category CASE (mean- 
ing that there is an ambiguity between genitive and dative 
cases) is different from ambiguity class 23 for category GRADE 
or PEI~0N. 
full cross-product of the category being predicted 
(y) and of the x specified as a combination of: 
1. an ambiguity class of a single category, which 
may be different from the category being pre- 
dicted, or 
2. a word form 
and 
1. the current position, or 
2. immediately preceding (following) position in 
text, or 
3. closest preceding (following) position (up to 
four positions away) having a certain ambiguity 
class in the POS category 
Let now 
Categories = { POS, SUBPOS, GENDER, 
NUMBER, CASE, POSSGENDER, 
POSSNUMBER, PERSON, TENSE, 
GRADE, NEGATION, VOICE, VAR}; 
then the feature function fcatAc,~,~(Y,X) ~ {0, 1} 
is well-defined iff 
6 CatAc (6) 
where Cat E Categories and CatAC is the ambi- 
guity class AC (such as AN, for adjective/noun am- 
biguity of the part of speech category) of a mor- 
phological category Cat (such as POS). For exam- 
ple, the function fPOSaN,A,-~ is well-defined (A E 
{A,N}), whereas the function fCASE145,6,-£ is not 
(6 ¢~ {1, 4, 5}). We will introduce the notation of the 
context part in the examples of feature value com- 
putation below. The indexes may be omitted if it 
is clear what category, ambiguity class, the value of 
the category being predicted and/or the context the 
feature belongs to. 
The value of a well-defined feature 7 function 
fca~Ac,y,~(Y, x) is determined by 
fCa~ac.y,~(Y, x) = 1 ~=~ ~ = y A • C x. (7) 
This definition excludes features which are positive 
for more than one y in any context x. This property 
will be used later in the feature selection algorithm. 
As an example of a feature, let's assume we are 
predicting the category CASE from the ambiguity 
class 145, i.e. the morphology gives us the possibility 
to assign nominative (1), accusative (4) or vocative 
(5) case. A feature then is e.g. 
The resulting case is nominative (1) and 
the following word form is pracuje (lit. 
(it) works) 
7From now on, we will assume that all features are well- 
defined. 
486 
lllSl .... 1A-- \[ A/AlIM/S/1451-/-/-I-IllAI-I-I I tvrd~' 
I~NISl ..... A--I t~/~i/-I ISl-141-1-1-21-1-1Al-I-Ilboj 
Figure 2: Context where the feature fPOSNv,N,(POS_l=A,CASE-~=145) 
is positive (lit. 
heavy fighting). 
AAIS6 .... 1A--I A/A/IMN/S/6/-/-/-/-/1/AI-I-/IprtdeBk6m 
troiS6 ..... A-- I t~VINolIYISI-OI-I-I-I-I-IAI-I-/II~rad6 
Figure 3: Context where the feature fPOSNv,N,(POS_l=A,CASE_l=145) 
is negative (lit. (at the) Prague castle). 
denoted as fCASE145,1,(FORM+1=pracuje), or 
The resulting case is accusative (4) and the 
closest preceding preposition's case has the 
ambiguity class 46 
denoted as fCASEa4s,4,(CASE-pos=R=46). 
The feature fPOSNv,N,(POS_l=A,CASE_l=145) will 
be positive in the context of Fig. 2, but not in the 
context of Fig. 3. 
The full cross-product of all the possibilities out- 
lined above is again restricted to those features 
which have actually appeared in the training data 
more than a certain number of times. 
Using ambiguity classes instead of unique values 
of morphological categories for evaluating the (con- 
text part of the) features has the advantage of giv- 
ing us the possibility to avoid Viterbi search during 
tagging. This then allows to easily add lookahead 
(right) context. 8 
There is no "forced relationship" among categories 
of the same tag. Instead, the model is allowed to 
learn also from the same-position "context" of the 
subtag being predicted. However, when using the 
model for tagging one can choose between two modes 
of operation: separate, which is the same mode 
used when training as described herein, and VTC 
(Valid Tag Combinations) method, which does 
not allow for impossible combinations of categories. 
See Sect. 5 for more details and for the impact on 
the tagging accuracy. 
4 Training 
4.1 Feature Weights 
The usual method for computing the feature weights 
(the Ai parameters) is Maximum Entropy (Berger 
8It remains to be seen whether using the unique values - 
at least for the left context - and employing Viterbi would 
help. The results obtained so far suggest that probably not 
much, and if yes, then it would restrict the number of features 
selected rather than increase tagging accuracy. 
& al., 1996). This method is generally slow, as it 
requires lot of computing power. 
Based on our experience with tagging as well as 
with other projects involving statistical modeling, 
we assume that actually the weights are much less 
important than the features themselves. 
We therefore employ very simple weight estima- 
tion. It is based on the ratio of conditional proba- 
bility of y in the context defined by the feature fy,~ 
and the uniform distribution for the ambiguity class 
AC. 
4.2 Feature Selection 
The usual guiding principle for selecting features of 
exponential models is the Maximum Likelihood prin- 
ciple, i.e. the probability of the training data is being 
maximized. (or the cross-entropy of the model and 
the training data is being minimized, which is the 
same thing). Even though we are eventually inter- 
ested in the final error rate of the resulting model, 
this might be the only solution in the usual source- 
channel setting where two independent models (a 
language model and a "translation" model of some 
sort - acoustic, real translation etc.) are being used. 
The improvement of one model influences the error 
rate of the combined model only indirectly. 
This is not the case of tagging. Tagging can be 
seen as a "final application" problem for which we 
assume to have enough data at hand to train and 
use just one model, abandoning the source-channel 
paradigm. We have therefore used the error rate 
directly as the objective function which we try to 
minimize when selecting the model's features. This 
idea is not new, but as far as we know it has been 
implemented in rule-based taggers and parsers, such 
as (Brill, 1993a), (Brill, 1993b), (Brill, 1993c) and 
(Ribarov, 1996), but not in models based on proba- 
bility distributions. 
Let's define the set of contexts of a set of features: 
X(F) = {Z: 3~ Bf~,-~ 6 F}, (s) 
where F is some set of features of interest. 
The features can therefore be grouped together 
based on the context they operate on. In the cur- 
rent implementation, we actually add features in 
"batches". A "batch" of features is defined as a set 
of features which share the same context Z (see the 
definition below). Computationaly, adding features 
in batches is relatively cheap both time- and space- 
wise. 
For example, the features 
fPOSNv,N,(POS_I=A,CASE_I=I45) 
and 
fPOSNv,V,(POS_I=A,CASE_I=I45) 
487 
share the context (POS_I = A, CASE_, = 145). 
Let further 
• FAC be the pool of features available for selec- 
tion. 
• SAC be the set of features selected so far for a 
model for ambiguity class AC, 
• PSac (Yl d) the probability, using model (3-5) 
with features SAC, of subtag y in a context de- 
fined by position d in the training data, and 
• FAC,~ be the set ("batch") of features sharing 
the same context ~, i.e. 
FAc,  = {f FAc: : S = (9) 
Note that the size of AC is equal to the size of 
any batch of features (\[AC\[ = \[FAc,~\[ for any 
z). 
The selection process then proceeds as follows: 
1. For all contexts ~ E X(FAc) do the following: 
2. For all features f = fy,~ E FAc,5 compute their 
associated weights AI using the formula: 
A.~ = log(/3ac~(Y)), 
where 
= f~,~(Yd, Xd) 
(10) 
(11) 
3. Compute the error rate of the training data by 
going through it and at each position d selecting 
the best subtag by maximizing PSacUFAc.~(Yid) 
over all y E AC. 
4. Select such a feature set FAC,~ which results in 
the maximal improvement in the error rate of 
the training data and add all f e FAC,~ perma- 
nently to SAC; with SAC now extended, start 
from the beginning (unless the termination con- 
dition is met), 
5. Termination condition: improvement in error 
rate smaller than a preset minimum. 
The probability defined by the formula (11) can 
easily be computed despite its ugly general form, as 
the denominator is in fact the number of (positive) 
occurrences of all the features from the batch defined 
by the context ~ in the training data. It also helps 
if the underlying ambiguity class AC is found only 
in a fraction of the training data, which is typically 
the case. Also, the size of the batch (equal to \[AC\[) 
is usually very small. 
On top of rather roughly estimating the Af param- 
eters, we use another implementation shortcut here: 
we do not necessarily compute the best batch of fea- 
tures in each iteration, but rather add all (batches 
of) features which improve the error rate by more 
than a threshold 6. This threshold is set to half the 
number of data items which contain the ambiguity 
class AC at the beginning of the loop, and then is cut 
in half at every iteration. The positive consequence 
of this shortcut (which certainly adds some unnec- 
essary features) is that the number of iterations is 
much smaller than if the maximum is regularly com- 
puted at each iteration. 
5 Results 
We have used 130,000 words as the training set and a 
test set of 1000 words. There have been 378 different 
ambiguity classes (of subtags) across all categories. 
We have used two evaluation metrics: one which 
evaluates each category separately and one "flat- 
list" error rate which is used for comparison with 
other methods which do not predict the morpho- 
logical categories separately. We compare the new 
method with results obtained on Czech previously, 
as reported in (Hladk~, 1994) and (Hajie, Hladk~, 
1997). The apparently high baseline when compared 
to previously reported experiments is undoubtedly 
due to the introduction of multiple models based on 
ambiguity classes. 
In all cases, since the percentage of text tokens 
which are at least two-way ambiguous is about 55%, 
the error rate should be almost doubled if one wants 
to know the error rate based on ambiguous words 
only. 
The baseline, or "smoothing-only" error rate was 
at 20.7 % in the test data and 22.18 % in the training 
data. 
Table 2 presents the initial error rates for the indi- 
vidual categories computed using only the smooth- 
ing part of the model (n = 0 in equation 3). 
Training took slightly under 20 hours on a Linux- 
powered Pentium 90, with feature adding threshold 
set to 4 (which means that a feature batch was not 
added if it improved the absolute error rate on train- 
ing data by 4 errors or less). 840 (batches) of fea- 
tures (which corresponds to about 2000 fully spec- 
ified features) have been learned. The tagging it- 
self is (contrary to training) very fast. The average 
speed is about 300 words/sec, on morphologically 
prepared data on the same machine. The results are 
summarized in Table 3. 
There is no apparent overtraining yet. However, 
it does appear when the threshold is lowered (we 
have tested that on a smaller set of training data 
consisting of 35,000 words: overtraining started to 
occur when the threshold was down to 2-3). 
Table 4 contains comparison of the results 
488 
Category 
POS 
SUBPOS 
GENDER 
NUMBER 
CASE 
POSSGENDER 
POSSNUMBER 
PERSON 
TENSE 
GRADE 
NEGATION 
VOICE 
VAR 
Overall 
training data test data 
1.10 
1.06 
6.35 
5.34 
14.55 
0.05 
0.13 
0.28 
0.36 
0.48 
1.33 
0.40 
0.30 
22.18 
2.1 
1.1 
6.1 
4.2 
14.5 
0.0 
0.1 
0.0 
0.1 
0.3 
1.0 
0.1 
0.3 
20.7 
Table 2: Initial Error Rate 
Category 
POS 
SUBPOS 
GENDER 
NUMBER 
CASE 
POSSGENDER 
POSSNUMBER 
PERSON 
TENSE 
GRADE 
NEGATION 
VOICE 
VAR 
Overall 
training data test data 
0.02 
0.49 
1.78 
2.73 
6.01 
0.04 
0.01 
0.12 
0.12 
0.11 
0.25 
0.11 
0.10 
8.75 
0.9 
1.0 
2.0 
0.9 
5.0 
0.0 
0.0 
0.0 
0.1 
0.1 
0.0 
0.0 
0.2 
8.0 
Table 3: Resulting Error Rate 
achieved with the previous experiments on Czech 
tagging (Hajji, HladkA, 1997). It shows that we 
got more than 50% improvement on the best error 
rate achieved so far. Also the amount of training 
data used was lower than needed for the HMM ex- 
periments. We have also performed an experiment 
using 35,000 training words which yielded by about 
4% worse results (88% combined tag accuracy). 
Finally, Table 5 compares results (given differ- 
Experiment 
Unigram HMM 
Rule-based (Brill's) 
Trigram HMM 
Bigram HMM 
Exponential 
Exponential 
Exponential, VTC 
training 
data size 
621,015 
37,892 
621,015 
621,015 
35,000 
130,000 
160,000 
best error 
rate (in %) 
34.30 
20.25 
18.86 
18.46 
12.00 
8.00 
6.20 
Table 4: Comparing Various Methods 
ent training thresholds 9) obtained on larger train- 
ing data using the "separate" prediction method dis- 
cussed so far with results obtained through a mod- 
ification, the key point of which is that it considers 
only "Valid (sub)Tag Combinations (VTC)'. The 
probability of a tag is computed as a simple product 
of subtag probabilities (normalized), thus assuming 
subtag independence. The "winner" is presented in 
boldface. As expected, the overall error rate is al- 
ways better using the VTC method, but some of the 
subtags are (sometimes) better predicted using the 
"separate" prediction method l°. This could have 
important practical consequences - if, for example, 
the POS or SUBPOS is all that's interesting. 
6 Conclusion and Further Research 
The combined error rate results are still far below 
the results reported for English, but we believe that 
there is still room for improvement. Moreover, split- 
ting the tags into subtags showed that "pure" part of 
speech (as well as the even more detailed "subpart" 
of speech) tagging gives actually better results than 
those for English. 
We see several ways how to proceed to possibly 
improve the performance of the tagger (we are still 
talking here about the "single best tag" approach; 
the n-best case will be explored separately): 
• Disambiguated tags (in the left context) plus 
Viterbi search. Some errors might be eliminated 
if features asking questions about the disam- 
biguated context are being used. The disam- 
biguated tags concentrate - or transfer - in- 
formation about the more distant context. It 
would avoid "repeated" learning of the same 
or similar features for different but related dis- 
ambiguation problems. The final effect on the 
overall accuracy is yet to be seen. Moreover, 
the transition function assumed by the Viterbi 
algorithm must be reasonably defined (approx- 
imated). 
• Final re-estimation using maximum entropy. 
Let's imagine that after selecting all the features 
using the training method described here we 
recompute the feature weights using the usual 
maximum entropy objective function. This will 
produce better (read: more principled) weight 
estimates for the features already selected, but 
it might help as well as hurt the performance. 
• Improved feature pool. This is, according to 
our opinion, the source of major improvement. 
The error analysis shows that in many cases the 
9No overtraining occurred here either, but the results for 
thresholds 2-4 do not differ significantly. 
l°For English, using the Penn 23"eebank data, we have al- 
ways obtained better accuracy using the VTC method (and 
redefinition of the tag set based on 4 categories). 
489 
Threshold: 128 16 8 4 2 
Features learned: 23 213 772 1529 4571 
Category 
POS 
SUBPOS 
GENDER 
NUMBER 
CASE 
POSSGENDER 
POSSNUMBER 
PERSON 
TENSE 
GRADE 
NEGATION 
VOICE 
VAR 
Overall 
Sep VTC 
1.50 1.32 
1.24 1.40 
4.50 4.06 
3.46 2.94 
11.10 10.52 
O.08 0.10 
0.14 0.04 
0.28 0.18 
0.36 0.18 
0.88 1.00 
0.62 0.26 
0.38 0.18 
0.26 0.18 
16.50 13.22 
Sep VTC 
0.86 0.78 
0.78 0.84 
3.00 2.80 
2.62 2.40 
7.74 7.66 
0.08 0.12 
0.04 0.04 
0.14 0.16 
0.16 0.14 
0.70 0.30 
0.34 0.36 
0.16 0.14 
0.24 0.22 
12.20 9.58 
Sep VTC 
0.66 0.60 
0.70 0.64 
2.40 2.14 
1.86 1.72 
5.30 5.34 
0.08 0.04 
0.04 0.00 
0.16 0.10 
0.10 0.12 
0.44 0.30 
0.28 0.26 
0.10 0.12 
0.14 0.14 
8.42 6.98 
Sep VTC 
0.44 0.42 
0.36 0.48 
2.14 1.80 
1.72 1.56 
4.82 4.80 
0.04 0.06 
0.02 0.02 
0.14 0.12 
0.10 0.12 
0.22 0.18 
0.24 0.24 
0.10 0.12 
0.12 0.14 
7.62 6.22 
Sep VTC 
0.36 0.44 
0.30 0.48 
2.08 1.90 
1.80 1.50 
4.88 4.84 
0.02 0.04 
0.00 0.00 
0.12 0.06 
0.I0 0.08 
0.22 0.16 
0.26 0.24 
0.08 0.08 
0.12 0.04 
7.66 6.20 
Table 5: Resulting Error Rate in % (newspaper, training size: 160,000, test size: 5000 tokens) 
context to be used for disambiguation has not 
been used by the tagger simply because more 
sophisticated features have not been considered 
for selection. An example of such a feature, 
which would possibly help to solve the very hard 
and relatively frequent problem of disambiguat- 
ing between nominative and accusative cases of 
certain nouns, would be a question "Is there 
a noun in nominative case only in the same 
clause?" - every clause may usually have only 
one noun phrase in nominative, constituting its 
subject. For such feature to work we will have 
to correctly determine or at least approximate 
the clause boundaries, which is obviously a non- 
trivial task by itself. 
7 Acknowledgements 
Various parts of this work has been supported by 
the following grants: Open Foundation RSS/HESP 
195/1995, Grant Agency of the Czech Republic 
(GA(~R) 405/96/K214, and Ministry of Education 
Project No. VS96151. The authors would also like 
to thank Fred Jelinek of CLSP JHU Baltimore for 
valuable comments and suggestions which helped to 
improve this paper a lot. 

References 
Adam Berger, Stephen Della Pietra, Vincent Della 
Pietra. 1996. Maximum Entropy Approach. In 
Computational Linguistics, vol. 3, MIT Press, 
Cambridge, MA. 
Eric Brill. 1993a. A Corpus Based Approach To 
Language Learning. PhD Dissertation, Depart- 
ment of Computer and Information Science, Uni- 
versity of Pennsylvania. 
Eric Brill. 1993b. Automatic grammar induc- 
tion and parsing free text: A Transformation° 
Based Approach. In: Proceedings of the 3rd In- 
ternational Workshop on Parsing Technologies, 
Tilburg, The Netherlands. 
Eric Brill. 1993c. Transformation-Based Error- 
Driven Parsing. In: Proceedings of the Twelfth 
National Conference on Artificial Intelligence. 
Kenneth W. Church. 1988. A stochastic parts pro- 
gram and noun phrase parser for unrestricted 
text. In Proceedings of the Second Conference 
on Applied Natural Language Processing, pages 
136-143, Austin, Texas. Association for Compu- 
tational Linguistics, Morristown, New Jersey. 
Jan Haji~. 1994. Unification Morphology Grammar. 
PhD Dissertation. MFF UK, Charles University, 
Prague. 
Jan Haji~. In prep. Automatic Processing of Czech: 
between Morphology and Syntax. MFF UK, 
Charles University, Prague. 
Jan Hajji, Barbora Hladk& 1997. Tagging of Inflec- 
tive Languages: a Comparison. In Proceedings of 
the ANLP'97, pages 136-143, Washington, DC. 
Association for Computational Linguistics, Mor- 
ristown, New Jersey. 
Barbora Hladk& 1994. Programov6 vybavenf pro 
zpracov~ni velk~ch ~esk~ch textov~ch korpusfi. 
MSc Thesis, Institute of Formal and Applied Lin- 
guistics, Charles University, Prague, Czech Re- 
public. 
Bernard Merialdo. 1992. Tagging Text With A 
Probabilistic Model. Computational Linguistics, 
20(2):155-171 
Kiril Ribarov. 1996. Automatick~. tvorba gramatiky 
p~irozen6ho jazyka. MSc Thesis, Institute of For- 
mal and Applied Linguistics, Charles University, 
Prague, Czech Republic. In Czech. 
