Probabilistic and Rule-Based Tagger of an Inflective 
Language- a Comparison 
Jan Haji~ and Barbora Hladk~i 
Institute of Formal and Applied Linguistics 
Faculty of Mathematics and Physics 
Malostransk~ n£m. 25 
CZ-118 00 Prague 1 
e-maih (hajic,hladka} ~ufal.mff.cuni.cz 
Abstract 
We present results of probabilistic tag- 
ging of Czech texts in order to show how 
these techniques work for one of the highly 
morphologically ambiguous inflective lan- 
guages. After description of the tag system 
used, we show the results of four experi- 
ments using a simple probabilistic model 
to tag Czech texts (unigram, two bigram 
experiments, and a trigram one). For com- 
parison, we have applied the same code and 
settings to tag an English text (another 
four experiments) using the same size of 
training and test data in the experiments in 
order to avoid any doubt concerning the va- 
lidity of the comparison. The experiments 
use the source channel model and maxi- 
mum likelihood training on a Czech hand- 
tagged corpus and on tagged Wall Street 
Journal (WSJ) from the LDC collection. 
The experiments show (not surprisingly) 
that the more training data, the better is 
the success rate. The results also indicate 
that for inflective languages with 1000+ 
tags we have to develop a more sophisti- 
cated approach in order to get closer to an 
acceptable error rate. In order to compare 
two different approaches to text tagging -- 
statistical and rule-based -- we modified 
Eric Brill's rule-based part of speech tag- 
ger and carried out two more experiments 
on the Czech data, obtaining similar results 
in terms of the error rate. We have also 
run three more experiments with greatly 
reduced tagset to get another comparison 
based on similar tagset size. 
1 INTRODUCTION 
Languages with rich inflection like Czech pose a 
special problem for morphological disambiguation 
(which is usually called tagging1). For example, the 
ending "-u" is not only highly ambiguous, but at the 
same time it carries complex information: it corre- 
sponds to the genitive, the dative and the locative 
singular for inanimate nouns, or the dative singu- 
lar for animate nouns, or the accusative singular for 
feminine nouns, or the first person singular present 
tense active participle for certain verbs. There are 
two different techniques for text tagging: a stochas- 
tic technique and a rule-based technique. Each ap- 
proach has some advantages -- for stochastic tech- 
niques there exists a good theoretical framework, 
probabilities provide a straightforward way how to 
disambiguate tags for each word and probabilities 
can be acquired automatically from the data; for 
rule-based techniques the set of meaningful rules is 
automatically acquired and there exists an easy way 
how to find and implement improvements of the tag- 
ger. Small set of rules can be used, in contrast to the 
large statistical tables. Given the success of statis- 
tical methods in different areas, including text tag- 
ging, given the very positive results of English statis- 
tical taggers and given the fact that there existed no 
statistical tagger for any Slavic language we wanted 
to apply statistical methods even for the Czech lan- 
guage although it exhibits a rich inflection accom- 
panied by a high degree of ambiguity. Originally, 
we expected that the result would be plain negative, 
getting no more than about two thirds of the tags 
correct. However, as we show below, we got bet- 
ter results than we had expected. We used the same 
statistical approach to tag both the English text and 
the Czech text. For English, we obtained results 
comparable with the results presented in (Merialdo, 
1992) as well as in (Church, 1992). For Czech, we ob- 
tained results which are less satisfactory than those 
for English. Given the comparability of the accu- 
racy of the rule-based part-of-speech (POS) tagger 
(Brill, 1992) with the accuracy of the stochastic tag- 
IThe development of automatic tagging of Czech 
is/was supported fully or partially by the fol- 
lowing grants/projects: Charles University GAUK 
39/94, Grant Agency of the Czech Republic GACR 
405/96/K214 and Ministry of Education VS96151. 
111 
ger and given the fact that a rule-based POS tagger 
has never been used for a Slavic language we have 
tried to apply rule-based methods even for Czech. 
2 STATISTICAL EXPERIMENTS 
2.1 CZECH EXPERIMENTS 
2.1.1 CZECH TAGSET 
Czech experiment is based upon ten basic POS 
classes and the tags describe the possible combina- 
tions of morphological categories for each POS class. 
In most cases, the first letter of the tag denotes the 
part-of-speech; the letters and numbers which follow 
it describe combinations of morphological categories 
(for a detailed description, see Table 2.1 and Table 
Cat. 
Var. 
see 
Tab. 
2.2) 
g 
2.2). 
Morph. 
Categ. 
Poss. Description 
Val. 
gender M masc. anim. 
I masc. inanim. 
N neuter 
F feminine 
number n S singular 
P plural 
tense t M past 
P present 
F future 
mood m O indicative 
R imperative 
case c 1 nominative 
2 genitive 
3 dative 
4 accusative 
5 vocative 
6 locative 
7 instrumental 
voice s A active voice 
P passive voice 
polarity a N negative 
A affirmative 
deg. of comp. d 1 base form 
2 comparative 
3 superlative 
person p 1 1st 
2 2nd 
3 3rd 
Table 2.1 
Note especially, that Czech nouns are divided 
into four classes according to gender (Sgall, 1967) 
and into seven classes according to ease. 
POS Class 
nouns 
noun, abbreviations 
adjectives 
Ngnc 
NZ 
Agncda 
verbs, infinitives VTa 
verbs, transgressives VWntsga 
verbs, common Vpnstmga 
pronouns, personal PPpnc 
pronouns, 3rd person PP3gnc 
pronouns, possessive PRgncpgn 
"svfij" --"his" referring to PSgnc 
subject 
reflexive particle "se" PEc 
pronouns, demonstrative PDgnca 
adverbs Od a 
conjunctions S 
numerals C gnc 
prepositions Rpreposition 
interjections F 
particles K 
sentence boundaries T_SB 
punctuation T_IP 
unknown tag X 
Table 2.2 
Not all possible combinations of morphological 
categories are meaningful, however. In addition to 
these usual tags we have used special tags for sen- 
tence boundaries, punctuation and a so called "un- 
known tag". In the experiments, we used only those 
tags which occurred at least once in the training cor- 
pus. To illustrate the form of the tagged text, we 
present here the following examples from our train- 
ing data, with comments: 
word Itag #comments 
doIRdo #"to" 
(prepositions have their 
own individuals tags) 
oddflulNIS2 #"unit" 
(noun, masculine inani- 
mate, singular, genitive) 
kiRk 
snfdanilNFS3 
pou#,ijeIV3SAPOMA 
prolRpro 
n£s\[PP1P4 
~:" for" 
(preposition) 
~" breakfast" 
(noun, feminine, singular, 
dative) 
~" uses" 
(verb, 3rd person, singular, 
active, 
present, indicative, masc. 
animate, affirmative) 
#"for" 
(preposition) 
~" US" 
(pronoun, personal, 1st 
person, plural, accusative) 
112 
2.1.2 CZECH TRAINING DATA 
For training, we used the corpus collected dur- 
ing the 1960's and 1970's in the Institute for Czech 
Language at the Czechoslovak Academy of Sciences. 
The corpus was originally hand-tagged, including 
the lemmatization and syntactic tags. We had to 
do some cleaning, which means that we have disre- 
garded the lemmatization information and the syn- 
tactic tag, as we were interested in words and tags 
only. Tags used in this corpus were different from 
our suggested tags: number of morphological cate- 
gories was higher in the original sample and the no- 
tation was also different. Thus we had to carry out 
conversions of the original data into the format pre- 
sented above, which resulted in the so-called Czech 
"modified" corpus, with the following features: 
tokens 621 015 
words 72 445 
tags 1 171 
average number of tags per token 3.65 
Table 2.3 
V~Te used the complete "modified" corpus 
(621015 tokens) in the experiments No. 1, No. 3, 
No. 4 and a small part of this corpus in the experi- 
ment No. 2, as indicated in Table 2.4. 
tokens 110 874 
words 22 530 
tags 882 
average number of tags per token 2.36 
Table 2.4 
2.2 ENGLISH EXPERIMENTS 
2.2.1 ENGLISH TAGSET 
For the tagging of English texts, we used the 
Penn Treebank tagset which contains 36 POS tags 
and 12 other tags (for punctuation and the currency 
symbol). A detailed description is available in (San- 
torini, 1990). 
2.2.2 ENGLISH TRAINING DATA 
For training in the English experiments, we used 
WSJ (Marcus et al., 1993). We had to change the 
format of WSJ to prepare it for our tagging soft- 
ware. V~e used a small (100k tokens) part of WSJ in 
the experiment No. 6 and the complete corpus (1M 
tokens) in the experiments No. 5, No. 7 and No. 
8. Table 2.5 contains the basic characteristics of the 
training data. 
tokens 
words 
tags 
average number 
of tags per token 
Experiment Experiments 
No. 6 No. 5, No. 7, 
No. 8 
110 530 1 287 749 
13 582 51 433 
45 45 
1.72 2.34 
Table 2.5 
2.3 CZECH VS ENGLISH 
Differences between Czech as a morphologically am- 
biguous inflective language and English as language 
with poor inflection are also reflected in the number 
of tag bigrams and tag trigrams. The figures given 
in Table 2.6 and 2.7 were obtained from the training 
files. 
Czech WSJ 
corpus 
x<=4 24 064 x<--10 459 
4<x<=16 5 577 10<x<--100 411 
16<x<=64 2 706 100<x<=1000 358 
x>64 1 581 x>1000 225 
bigrams 33 928 bigrams 1 453 
Table 2.6 Number of bigrams with frequency x 
x<----4 
4<x<=16 
Czech 
corpus 
155 399 
16 371 
x<=lO 
10<x<=100 
WSJ 
11 810 
4 571 
16<x<=64 4 380 100<x<=1000 1 645 
x>64 933 x> 1000 231 
trigrams 177 083 trigrams 18 257 
Table 2.7 Number of trigrams with frequency x 
It is interesting to note the frequencies of the 
most ambiguous tokens encountered in the whole 
"modified" corpus and to compare them with the 
English data. Table 2.8 and Table 2.9 contain the 
first tokens with the highest number of possible tags 
in the complete Czech "modified" corpus and in the 
complete WSJ. 
Token Frequency #tags 
in train, data in train, data 
jejich 1 087 51 
jeho 1 087 46 
jeho~ 163 35 
jejich~ 150 25 
vedoucl 193 22 
Table 2.8 
In the Czech "modified" corpus, the token "ve- 
douc/" appeared 193 times and was tagged by twenty 
two different tags: 13 tags for adjective and 9 tags 
113 
for noun. The token "vedoucf' means either: "lead- 
ing" (adjective) or "manager" or "boss" (noun). The 
following columns represent the tags for the token 
"vedouc/" and their frequencies in the training data; 
for example "vedoucf' was tagged twice as adjective, 
feminine, plural, nominative, first degree, affirma- 
tive. 
2 
4 
6 
11 
1 
4 
5 
2 
11 
3 
12 
2 
2 
vedouci\[AFPllA 
vedouci\[AFP41A 
vedoucl AFSllA 
vedouci AFS21A 
vedouci AFS31A 
vedoue~ AFS41A 
vedouci AFS71A 
vedoucl AIPllA 
vedoucl A M P 11A 
vedouc{ AMP41A 
vedoucl AMSllA 
vedoucl ANPllA 
vedoucl ANS41A 
10 vedouci 
1 vedouci 
1 vedouci 
1 vedoud 
2 vedoucl 
34 vedouci 
17 vedouci 
61 vedouc~ 
1 vedouci 
NFS1 
NFS2 
NFS3 
NFS4 
NFS7 
NMP1 
NMP4 
NMS1 
NMS5 
Token Frequency #tags 
in train, data in train, data 
a 25 791 7 
down 1 052 7 
put 380 6 
set 362 6 
that 10 902 6 
the 56 265 6 
Table 2.9 
It is clear from these figures that the two lan- 
guages in question have quite different properties 
and that nothing can be said without really going 
through an experiment. 
2.4 THE ALGORITHM 
We have used the basic source channel model (de- 
scribed e.g. in (Merialdo, 1992)). The tagging pro- 
cedure ¢ selects a sequence of tags T for the sentence 
W: ¢ : PV --+ T . In this case the optimal tagging 
procedure is 
¢(W) -- argmaxTPr(T\[W) = 
: argmaxTPr(TlW) * Pr(W) = 
= argrnaxTPr(W,T) = 
-- argmaxTPr(W\[T) * Pr(T). 
Our implementation is based on generating the 
(W,T) pairs by means of a probabilistic model 
using approximations of probability distributions 
Pr(WIT) and Pr(T). The Pr(T) is based on tag bi- 
grams and trigrams, and Pr(WIT ) is approximated 
as the product of Pr(wi\[tl). The parameters have 
been estimated by the usual maximum likelihood 
training method, i.e. we approximated them as the 
relative frequencies found in the training data with 
smoothing based on estimated unigram probability 
and uniform distributions. 
2.5 THE RESULTS 
The results of the Czech experiments are displayed 
in Table 2.10. 
No. 1 No. 2 No. 3 No. 4 
test data 1 294 1 294 1 294 1 294 
(tokens) 
prob. unigram bigram bigram trigram 
model 
incorrect 
tags 
tagging 
accuracy 
444 
65.70% 
334 
74.19% 
239 
81.53% 
Table 2.10 
244 
81.14% 
These results show, not surprisingly, of course, 
that the more data, the better (results experiments 
of No.2 vs. No.3), but in order to get better results 
for a trigram tag prediction model, we would need 
far more data. Clearly, if 88% trigrams occur four 
times or less, then the statistics is not reliable. The 
following tables show a detailed analysis of the errors 
of the trigram experiment. 
\[ \[\[A IC \[F \]K IN lO 
A 32 0 0 0 6 3 
C 0 4 0 0 1 0 
F 0 0 0 0 0 0 
K 0 0 0 0 0 0 
N 4 0 0 0 64 8 
O 0 0 0 0 1 0 
P 0 0 0 0 0 3 
R 0 0 0 0 1 1 
S 0 0 0 0 0 0 
V 0 0 0 0 3 8 
T 0 0 0 0 1 0 
X 0 0 0 0 0 0 
Table 2.11a 
I\] P \[ R I s I V I T I X I 
A 2 2 2 2 1 0 50 
C 0 0 0 0 0 0 5 
F 0 0 0 0 0 0 0 
K 0 0 1 0 0 1 2 
N 0 4 2 2 5 4 93 
O 0 0 0 1 1 0 3 
P 19 0 0 0 1 2 23 
R 0 0 0 0 0 2 4 
S 0 0 0 0 0 2 2 
V 0 3 8 28 1 2 53 
T 0 0 0 0 0 0 1 
X 5 0 1 2 0 0 8 
Table 2.11b 
The letters in the first column and row denote 
POS classes, the interpunction (T) and the "un- 
known tag" (X). The numbers show how many times 
the tagger assigned an incorrect POS tag to a to- 
ken in the test file. The total number of errors was 
244. Altogether, fifty times the adjectives (A) were 
114 
tagged incorrectly, nouns (N) 93 times, numbers (C) 
5 times and etc. (see the last unmarked column in 
Table 2.11b); to provide a better insight, we should 
add that in 32 cases, when the adjective was cor- 
rectly tagged as an adjective, but the mistakes ap- 
peared in the assignment of morphological categories 
(see Table 2.12), 6 times the adjective was tagged as 
a noun, twice as a pronoun, 3 times as an adverb 
and so on (see the second row in Table 2.11a). A 
detailed look at Table 2.12 reveals that for 32 cor- 
rectly marked adjectives the mistakes was 17 times 
in gender, once in number, three times in gender and 
case simultaneously and so on. 
\[ A\[\[ g \[ n \[ c I g&~ g&:~ c&:~ g&n~zc\[ g~zc&:d\[ 
1321117\]1161 3 I 2 I 1 I 1 I 1 I 
Table 2.12 
Similar tables can be provided for nouns (Table 
2.13), numerals (Table 2.14), pronouns(Table 2.15) 
and verbs (Table 2.16a, Table 2.16b). 
N l\[ g In t c I g&c \[ n&c I ->NZ \] 
64 \[\[ 11 \[ 5 \[ 41 \[ 2 \[ 4 \[ 1 \] 
Table 2.13 
Cllg c 
4 \[\[1 3 
Table 2.14 
P Ilg c g&clVD->PV 
19ll8 7 3 I 1 
Table 2.15 
V I P t n Is I n&t I p&t t&a I 
22\]3 6 5151 1 I 1 1 I 
Table 2.16a 
v II gt~a I pan~t I v->VT 
6 II 1 I1 \]4 
Table 2.16b 
The results of our experiments with English are 
displayed in Table 2.17. 
test data 
(tokens) 
INo5 
1 294 
No. 6 
1 294 
INo. 7 
1 294 
No. 8 
1 294 
prob. unigram bigram bigram trigram 
model 
136 
89.5% 
incorrect 
tags 
41 
96.83% tagging 
accuracy 
81 
93.74% 
37 
97.14% 
Table 2.17 
To illustrate the results of our tagging experi- 
ments, we present here short examples taken from 
the test data. Cases of incorrect tag assignment are 
in boldface. 
-- Czech 
word\[hand tag exp. exp. exp. exp. 
No.4 No.3 No.2 No.12 
na\[Rna Rna Rna 
pfid~\[NFS6 NFS6 NFS6 
vlasti\[NFS2 NFS2 NFS2 
rady\[NFS2 NFS2 NFS2 
~en\[NFP2 NFP2 NFP2 
Gusta\[NFS1 T_SB T_SB 
Fu~ov£\[NFS1 NFS1 NFS1 
a\[SS SS SS 
p~edseda\[NMS1 NMS1 NMS1 
dv\[NZ NZ NZ 
ssm\[NZ NZ NZ 
Juraj\[NMS1 NMS1 NMS1 
Varhol~\[NMS1 NMS1 NMS1 
-- English 
word \[ hand tag 
Rna Rna 
NFS6 NFS6 
NFS2 NFS2 
NFS2 NFS2 
NFP2 NFP2 
AFP21A XX 
NFP2 NFS1 
SS SS 
NMS1 NMS1 
NZ NZ 
NZ NZ 
NMS1 XX 
NMS1 NMS1 
exp. exp. exp. exp. 
No.8 No.7 No.6 No.5 
With\[IN IN IN IN IN 
stock\[NN NN NN NN NN 
prices\[NNS NNS NNS NNS NNS 
hovering\[VBG VBG VBG IN VBG 
near\[IN IN IN JJ IN 
record\[NN NN NN NN NN 
levels\[NNS NNS NNS NNS NNS 
,\[, alDT fiT fiT 
DT DT 
number\[NN NN NN NN NN 
of\[IN IN IN IN IN 
companieslNNS NNS NNS NNS NNS 
have\[VBP VBP VBP VBP VBP 
been\[VBN VBN VBN VBN VBN 
announcing\[VBG VBG VBG IN VBG 
stock\[NN NN NN NN NN 
splits\]NNS NNS VBZ NN VBZ 
.\[. 
2.6 A PROTOTYPE OF RANK XEROX 
POS TAGGER FOR CZECH 
(Schiller, 1996) describes the general architecture of 
the tool for noun phrase mark-up based on finite- 
state techniques and statistical part-of-speech dis- 
ambiguation for seven European languages. For 
Czech, we created a prototype of the first step of 
this process -- the part-of-speech (POS) tagger -- 
using Rank Xerox tools (Tapanainen, 1995), (Cut- 
ting et al., 1992). 
2.6.1 POS TAGSET 
The first step of POS tagging is obviously a def- 
inition of the POS tags. We performed three ex- 
2We used a speciM tag XX for unknown words. 
115 
periments. These experiments differ in the POS 
tagset. During the first experiment we designed 
tagset which contains 47 tags. The POS tagset can 
be described as follows: 
Category Symbol Pos. 
Value 
Description 
case c NOM nominative 
GEN genitive 
bAT dative 
ACC accusative 
VOC vocative 
locative 
kind 
verb 
LOC 
INS 
INV 
PAP 
PRI 
INF 
IMP 
TRA 
Nm 
instrumental 
invariant 
past 
paticiple 
present 
participle 
infinitive 
imperative 
transgressive 
2.6.2 RESULTS 
Figures representing the results of all experi- 
ments are presented in the following table. We have 
also included the results of English tagging using the 
same Xerox tools. 
language tags 
Czech 47 
Czech 43 
Czech 34 
English _\[ 76 
ambiguity ~ 
39% 
36% 
14% 
36% 
tagging 
accuracy 
91.7% 
93.0% 
96.2% 
97.8% 
Table 2.20 
The results show that the more radical reduction 
of Czech tags (from 1171 to 34) the higher accuracy 
of the results and the more comparable are the Czech 
and English results. However, the difference in the 
error rate is still more than visible -- here we can 
speculate that the reason is that Czech is "free" word 
order language, whereas English is not. 
Table 2.18 
POS tag Description 
NOUN_c nouns + case 
ADJ_c adjectives + case 
PRON_c pronouns + case 
NUM_c numerals + case 
VERB_k verbs + kind of verb 
ADV adverbs 
PROP 
PREP 
proper names 
prepositions 
PSE reflexive particles "se" 
CLIT clitics 
CONJ 
INTJ 
conjunctions 
interjections 
PTCL particles 
DATE dates 
CM comma 
PUNCT interpunction 
SENT sentence bundaries 
Table 2.19 
The analysis of the results of the first experiment 
showed very high ambiguity between the nominative 
and accusative cases of nouns, adjectives, pronouns 
and numerals. That is why we replaced the tags 
for nominative and accusative of nouns, adjectives, 
pronouns and numerals by new tags NOUNANA, 
ADJANA, PRONANA and NUMANA (meaning nom- 
inative or accusative, undistinguished). The rest of 
the tags stayed unchanged. This led 43 POS tags. 
In the third experiment we deleted the morphologi- 
cal information for nouns and adjectives alltogether. 
This process resulted in the final 34 POS tags. 
3 A RULE-BASED EXPERIMENT 
FOR CZECH 
A simple rule-based part of speech (RBPOS) tag- 
ger is introduced in (Brill, 1992). The accuracy of 
this tagger for English is comparable to a stochas- 
tic English POS tagger. From our point of view, it 
is very interesting to compare the results of Czech 
stochastic POS (SPOS) tagger and a modified RB- 
POS tagger for Czech. 
3.1 TRAINING DATA 
We used the same corpus used in the case of the 
SPOS tagger for Czech. RBPOS requires different 
input format; we thus converted the whole corpus 
into this format, preserving the original contents. 
3.2 LEARNING 
It is an obvious fact that the Czech tagset is totally 
different from the English tagset. Therefore, we had 
to modify the method for the initial guess. For Czech 
the algorithm is: "If the word is W_SB (sentence 
boundary) assign the tag T_SB, otherwise assign the 
tag NNSI." 
3.2.1 LEARNING RULES TO PREDICT 
THE MOST LIKELY TAG FOR 
UNKNOWN WORDS 
The first stage of training is learning rules to 
predict the most likely tag for unknown words. 
These rules operate on word types; for example, if 
3The percentage of ambiguous word forms in the test 
file. 
116 
a word ends by "d37;, it is probably a masculine ad- 
jective. To compare the influence of the size of the 
training files on the accuracy of the tagger we per- 
formed two subexperiments4: 
TAGGED-CORPUS 
(tokens) 
TAGGED-CORPUS 
(words) 
TAGGED-CORPUS 
(tags) 
No. 1 No. 2 
15 297 5 031 
738 495 
UNTAGGED-CORPUS 621 015 621 015 
(tokens) 
72 445 72 445 UNTAGGED-CORPUS 
(words) 
101 LEXRULEOUTFILE 
(rules) 
75 
Table 3.1 
We present here an example of rules taken from 
LEXRULEOUTFILE from the exp. No. 1: 
u hassuf 1 NIS2 # change the tag to NIS2 
if the suffix is "u" 
y hassuf 1 NFS2 # change the tag to NFS2 
if the suffix is "y" 
ho hassuf 2 AIS21A # change the tag to AIS21A 
if the suffix is "ho" 
£ch hassuf 3 NFP6 # change the tag to NFP6 
if the suffix is "£ch" 
nej addpref 3 O2A # change the tag to O2A 
if adding the prefix "nej" 
results in a word 
3.2.2 LEARNING CONTEXTUAL CUES 
The second stage of training is learning rules to 
improve tagging accuracy based on contextual cues. 
These rules operate on individual word tokens. 
4We use the same names of files and variables as 
Eric Brill in the rule-based POS tagger's documenta- 
tion. TAGGED-CORPUS -- manually tagged train- 
ing corpus, UNTAGGED-CORPUS -- collection of 
all untagged texts, LEXRULEOUTFILE -- the list 
of transformations to determine the most likely tag 
for unknown words, TAGGED-CORPUS-2 -- manually 
tagged training corpus, TAGGED-CORPUS-ENTIRE 
-- Czech "modified" corpus (the entire manually tagged 
corpus), CONTEXT-RULEFILE -- the list of transfor- 
mations to improve accuracy based on contextual cues. 
No. 1 No. 2 
TAGGED-CORPUS-2 37 892 9 989 
(tokens) 
TAGGED-CORPUS-2 12 676 4 635 
(words) 
TAGGED-CORPUS-2 717 479 
(tags) 
TAGGED-ENTIRE-CORPUS 621 015 621 015 
(tokens) 
TAGGED-ENTIRE-CORPUS 72 445 72 445 
(words) 
TAGGED-ENTIRE-CORPUS 1 171 1 171 
(tags) 
CONTEXT-RULEFILE 487 61 
(rules) 
Table 3.2 
We present here an example of the rules taken 
from CONTEXT-RULEFILE from the exp. No. 1: 
AFP21A AIP21A # change the tag 
AFP21A to AIP21A 
NEXT1OR2TAG if the following tag is 
NIP2 NIP2 
NIS2 NIS6 
PREV1OR2OR3TAG 
Rv 
# change the tag NIS2 
to NIS6 
if the preceding tag is 
Rv 
NIS1 NIS4 # change the tag NIS1 
to NIS4 
PREVIOR2TAG if the preceding tag is 
Rna Rna 
3.2.3 RESULTS 
The tagger was tested on the same test file as 
for the statistical experiments. We obtained the fol- 
lowing results: 
I 
TEST-FILE 
errors 
tagging accuracy 
II No. 1 No. 2 
1 294 1 294 
262 294 
79.75% 77.28% 
Table 3. 3 
4 CONCLUSION 
The results, though they might seem negative com- 
pared to English, are still better than our original ex- 
pectations. Before trying some completely different 
approach, we would like to improve the current sim- 
ple approach by some other simple measures: adding 
a morphological analyzer (Hajji, 1994) as a front- 
end to the tagger (serving as a "supplier" of pos- 
sible tags, instead of just taking all tags occurring 
in the training data for a given token), simplifying 
the tagset, adding more data. However, the desired 
positive effect of some of these measures is not guar- 
anteed: for example, the average number of tags per 
117 
token will increase after a morphological analyzer 
is added. Success should be guaranteed, however, 
by certain tagset reductions, as the original tagset 
(even after the reductions mentioned above) is still 
too detailed. This is especially true when comparing 
it to English, where some tags represent, in fact, a 
set of tags to be discriminated later (if ever). For ex- 
ample, the tag VB used in the WSJ corpus actually 
means "one of the (five different) tags for 1st person 
sg., 2nd person sg., 1st person pl., etc.". First, we 
will reduce the tagset to correspond to our morpho- 
logical analyzer which already uses a reduced one. 
Then, the tagset will be reduced even further, but 
nevertheless, not as much as we did for the Xerox- 
tools-based experiment, because that tagset is too 
"rough" for many applications, even though the re- 
sults are good. 
Regarding tagset reduction, we should note that 
we haven't performed a "combined" experiment, i.e. 
using the full (1100+) tagset for (thus) "interme- 
diate" tagging, but only the reduced tagset for the 
final results. However, it can be quite simply derived 
from the tables 2.10, 2.11a and 2.11b, that the error 
rate would not drop much: it will remain high at 
about 6.5070 (based on the results of experiment No. 
4) using the very small tagset of 12 (= number or 
lines in table 2.11a) tags used for part of speech iden- 
tification. This is even much higher than the error 
rate reported here for the smallest tagset used in the 
'pure' experiment (sect. 2.6, table 2.20), which was 
at 3.8~0. This suggests that maybe the pure meth- 
ods (which are obviously also simple to implement) 
are in general better than the "combined" methods. 
Another possibility of an improvement is to add 
more data to allow for more reliable trigram esti- 
mates. We will also add contemporary newspaper 
texts to our training data in order to account for 
recent language development. Hedging against fail- 
ure of all these simple improvements, we are also 
working on a different model using independent pre- 
dictions for certain grammatical categories (and the 
lemma itself), but the final shape of the model has 
not yet been determined. This would mean to intro- 
duce constraints on possible combinations of mor- 
phological categories and take them into account 
when "assembling" the final tag. 
ACKNOWLEDGMENTS: The authors wish to 
thank Eva Hajidovd for her comments and sugges- 
tions and Eric BriU, Jean-Pierre Chanod and Anne 
Schiller who made their software tools available. 

Eric Brill. 1993. A Corpus Based Approach To Lan- 
guage Learning. PhD Dissertation, Department 
of Computer and Information Science, Univer- 
sity of Pennsylvania. 

Eric Brill. 1994. Some Advances in Transformation- 
-Based Part of Speech Tagging. In: Proceedings 
of the Twelfth National Conference on Artificial 
Intelligence. 

Jan Hajic. 1994. Unification Morphology Gram- 
mar. PhD Dissertation, Institute of Formal and 
Applied Linguistics, Charles University, Prague, 
Czech Republic. 

Kenneth W. Church. 1992. Current Practice In Part 
Of Speech Tagging And Suggestions For The 
Future. For Henry Kucera, Studies in Slavic 
Philology and Computational Linguistics, Michi- 
gan Slavic Publications, Ann Arbor. 

Doug Cutting, Julian Kupiec, Jan Pedersen and 
Penelope Sibun 1992. A Practical Part-of-Speech Tagger. In: Proceedings of the Third 
Conference on Applied Natural Language Processing , Trento, Italy. 

Mitchell P. Marcus, Beatrice Santorini, and Mary Ann Marcinkiewicz 1993. Building A Large 
Annotated Corpus Of English: The Penn Treebank. Computational Linguistics, 19(2):313-330. 

Bernard Merialdo. 1992. Tagging Text With A 
Probabilistie Model. Computational Linguistics, 20(2):155--171 

Beatrice Santorini. 1990. Part Of Speech Tag- 
ging Guidelines For The Penn Treebank Project. 
Technical report MS-CIS-90-47, Department of 
Computer and Information Science, University 
of Pennsylvania. 

Anne Schiller. 1996. Multilingual Finite-State Noun 
Phrase Extraction. ECAI'96, Budapest, Hun- 
gary. 

Petr Sgall. 1967. The Generative Description of a 
Language and the Czech Declension (In Czech). 
Studie a prdce lingvistickd, 6. Prague. 

Pasi Tapanalnen. 1995. RXRC Finite-State Com- 
piler. Technical Report MLTT-20, Rank Xerox 
Research Center, Meylen, France. 
References 

Eric Brill. 1992. A Simple Rule-Based Part of 
Speech Tagger. In: Proceedings of the Third 
Conference on Applied Natural Language Processing, Trento, Italy. 
