PROBABILISTIC TAGGING WITH FEATURI~ STR,UCTUR,I;3S 
n trc I(empe 
University of Stuttgart, Institute for (?omputational Linguistics, 
Azenbergstrage 12, 70174 Stuttgart, (lerniany, kellipe~_)ims.uni-stuttgart.de 
Abstract 
The described tagger is b,'used on a hidden Markov 
model and uses tags composed of features such as part- 
of speech, gender, etc. 'l?he contextual probability of a 
tag (state transition probability) is deduced from the 
contextual probabilities of its feature--value-pairs. 
This approach is advantageous when the available 
training corpus is small and the tag set large, which 
can be the case with morphologically rich languages. 
1 INTRODUCTION 
'l'he present article describes it probabillstic tagger 
based on a hidden Marl(or model (IIMM) (Rabiner, 
1990) and employs tags which are fe,'iture structures. 
Their features concern part-of-speech (POS), gel,der, 
number, etc. and tlave only atouiie vahles. 
Usually, the contextual probability of a tag (state 
transition probability) is estimated dividing a trigrain 
frequency by a bigram frequency (second order II MM). 
With a large tag set resulting froin tire fact that the 
tags colitain besides or the POS a lot of lnorphologi- 
cal information, and with only a slnall training corpus 
available, most of these frequencies are too low for an 
exact estimation of contextual probabilities. 
Our feature structure tagger esthnates these prob- 
abilities by connecting contextual probabilities of the 
single fealvre-wdue-pai,'s (rv-pairs) of the tags (cf. sec. 
2). 
Starting point for the iulph;nientation of the \['ea- 
ture structure tagger was a second-order-li'IvlM tagger 
(trigrams) b~med on a modilied version of the Viterbi 
algorithm (Viterbi, 1967; Chllrch, 1988) which we had 
earlier implemented in C (l(empe ,1994). 'Flier{: we 
modified tim calculus of the contextual probabilities 
of the tags in the above-described way (cf see. 4). 
A test of both tatters under the sanle conditions Oli 
a French corpus 1 has shown that tile feature structure 
tagger is clearly better when tim available training col 
pus is small and the tag set is large but the tags are 
decomlmsable into relatively few fv-pairs. 'l'he hitter 
can be the case with morphologically rich languages 
when the tags contain a lot of morphological infornia- 
tion (cf. see. 5). 
11 inll nmch obliged to Achim Stein and Leo W,tuner, lto- 
nl~.UC~: l)ept., Univ. Stuttglirt, Gel'lll&liy, for t~rovidlng the cor- 
ptlS and it dictionary. 
2 MATHEMATICAL BACK- 
GR.OUND 
In order to ~Lssign tags to a word sequence, a IIMM can 
be used where tim tagger selects among all possible 
tag sequences tile most probable one (Garside, Leech 
and Saulpson, 1987; (Tlnlrch, 1988; Brown e.t al., 1989; 
Rabiner, 1990). The joint probability of a tag sequence 
l -- I0...tN_ 1 given a word sequence lg., : ~v0...lON_-l is 
hi the case of a second order IIMM: 
*'(l, ,Z,) := ~t,, ,, • p(,v0lZ0). J,(ivl lZ,) ' 
N-1 l-{ (p(,.,I',),(l,I 
(1) 
i=2 
The term rqo t, stands for tim initial slate probabil- 
ity, i.e. the probability that the sequence begins with 
the first two tags. N is tim nunlber of words in the 
sequence, i.e. the corpus size. "Phe term p(w¢\]ll) is the 
probability of a word w¢ in the context of the assigned 
tag tl. it is called observation symbol prolmbility (lex- 
ical probability) and can be estimated by: 
f(wl ll) 
t,(,,,~lt~) -- f(t~) (2) 
The second order state transition probability (contex- 
tual probability) 1,(t~ I t~-2 re-.t) in formula (l) ex- 
presses how probable it; is that the tag tl appears in 
the context of its two preceding tags li-', all(\] ti-\]. It 
is usually esthnate.d as the ratio of the frequency of 
the trigram (ll-'2, t~-l,t;) in a given training corpus 
to the. I'requency of the higram (li_2,li~l} ill {,lie sallie 
corpllS: 
f(ti-.~ ti-~ ti) 
With a large tag set and a relatively small hand- 
tagged training corpus forinula (3) has an iinl)ortant 
disadvantage: The maioril,y of transition probabilities 
cannot be estimated exactly because most of the possi- 
ble trigrams (sequences of three consecutive tags) will 
not appear at all or only a few tilnes a. 
|I10llr exarrlple we have a 1,'rencli training corplls 
of 10,000 words tagged with a set of 386 different 
tags whMl could forrn a8a a = 57,512,450 different 
trigrams, but because of the corpus size no more 
than 10,000-2 trigranrs can appear. Actually, their 
nuinber was only 4,8\[5, i.e. 0.008 % of all possible 
'2 A detaihM descrilltlon of pro\] ileli/S egnlsed by sniall and ,.4el'O 
frequencies was given by Clah~ and Church (1989) 
161 
ones, because some of them appeared more than once 
(table 1). 
frequency number and percentage 
range of trigrams in the range 
> 128 1 (0.021%) 
64 - 127 2 (0.042 %) 
32 - 63 13 (0a6 %) 
16 - 31 43 (0.89 %) 
8- 15 119 (2.5 %) 
4-7 282 (5.9 %) 
2~3 860 (18 %) 
1 3,495 (73 %) 
sum 4,815 (100 %) 
Table 1: Trigram count from a French train- 
ing corpus of 10,000 words 
When we divide e.g. a trigram frequency 1 by a 
bigram frequency 2 according to formula (3) we gel 
tbe probability p=0.5 but we cannot trust it to be 
exact because the frequencies it is based on are too 
small. 
We can take advantage of the fact that the 386 tags 
are constituted by only 57 different fv-pairs concerning 
POS, gender, number, etc. If we consider probabilistic 
relations between single fv-pairs then we get higher 
frequencies (fig. 2) and tbe resulting probabilities are 
more exact. 
From the equations 
n--\[ 1 (t,} = {e,0 nc,, ... he,,._,} = / N ~,~ (4) \] kk=0 ) 
where tl means a tag and the elk symbolize its D-pairs 
and 
( 
\k=0 I / \ ~=0 
(cd. p(e,olC,), p(~, IC~ n e,0). 
....p e~ .... I elk (5) 
k=0 
whe~ Ci means the context of/~ and contains tile tags 
t;_.~ and ti-1 follows 
p(tilCi) = p(clolCi) " ~\[ p elk Ci 0 elj (6) 
k=t \ I j=0 
Tire latter formula 3 describes the relation between 
the contextual probability of a tag and the contextual 
probabilities of its fv-pairs. 
The unification of morphological features inside 
a noun phrase is accomplished indirectly, hr a 
given context of D-pairs the correct fv-pair obtains 
the probability p=l and therefore will not influence 
tim probability of the tag to which it belongs (e.g. 
p~( 0num:SG \[...) = 1 in fig. 2). A wrong fv-pair 
would obtain p=0 and make the whole tag impossible. 
asugg ested bY Mats Rooth, IMS, Unlv.Stuttgart, Germany 
3 TRAINING ALGORITHM 
In the training process we are not interested in 
analysing and storing the contextual probabilities 
(state transition probabilities) of whole tags but of 
single fv-pairs. We note them in terms of probabilistic 
feature relations (PFI:~): 
Vr'l~: ( e, I c,'"~ ; p(~,Ic~ "~) ) (7) 
which later, in the Lagging process, will be combined 
in order to obtain the contextual tag probabilities. 
The term el in formula (7) is a fv-pair. G~ "~ is a 
reduced context which contains only a subset of the 
fv-pairs of a really appearing context Ci (fig. 1). C/~ 
is obtained from Ci by eliminating all fv-pairs which 
do not influence the relative frequency of e,', according 
to the condition: 
P(e,'lC~ '"b) / p(e, lC~) C \[1 - e, 1 + ~\] (8) 
The considered D-pair has nearly 4 the same prob- 
ability in the complete and in the reduced contexts, 
i.e. Ci does not supply more information abont the 
probability of el than C/~''b does. 
ti--2 ti--1 tl 
2typ:l)l'3F I gen:FEM Ogen:FEM 
2geu:FEM hmm:S(l 0m~m:SG 
2nu In :S G 
(,,) 
Figure 1: (a) Complete context Ci and (b) 
reduced context C/'"b of the feature-value-pair 
el = Ogen:FEM 
In the example (fig. la) we consider tile fv-pair 
Ogen:l,'EM. Within the given training corpus, its prob- 
ability ill tile complete context Ci, i.e. in the context 
of all tile other fv-pairs of figure la, is p~=44/44=I 
(of. p~ in fig. 2). 
The presence of inum:SG in tag ti-1 does not influ- 
ence the probability of Ogen:FEM in tag I i. Therefore 
lnum:SG eau be eliminated. Only fv-pairs which re- 
ally have an influence remain in the context. The re- 
duced context C~ "b with less D-pairs, which we obtain 
this way, is more general (fig. lb). 
In the given training corpus, the probability of 
Ogen:FEM in the context CZ "b is p0=170/174=0.997 
(el. P0 in PFR0 in fig. 2), which is near to p~=l. The 
reduced context C~ ''~ is used to form a PFR which will 
be stored. 
4 A small change in the probability caused by the elimination 
of fv-pairs from the context is admitted if it does not exceed a 
defined sman percentage e. (We used ~ -- 3%.) 
162 
We. see in the use of reduced contexts instead of 
complete ones two advantages: 
(1) A great number of complete contexts containing 
many fv-pairs can lead after eliminatim, of irrelevant 
fv-pairs to the same PFR, which makes the nmnber 
of all possible PFlks much smaller than the number of 
all possible trigrams (cf. sec. 2). 
(2) "\['he probability of a fv-pair can be estimated 
more exactly in a reduced context than in a complete 
one because of the higher frequencies in the first case. 
The Generation of Pl.,'l{s 
In the training process we first extract from a train- 
ing corpus a set of trigrams where the tags are split 
up into their fv-pairs. From these trigrams a set of 
PFILs is generated separately \['or every fvqmlr ei. We 
examined four difl'erent methods for this procedure: 
Method 1-3: For every trigram we generate all 
possible subsets of its fv-pairs. Many trigrams, e.g. 
if they dillk'.r in only one fv-pair, have most of their 
subsets of fv-pairs in coil,IliOn. Both the complete 
trigrams and the subsets, constitute together the set, 
of contexts and subcontexts (Ci and C/'''~) wherein a 
fv-pair couhl appear. To generate Pl:lLs for at giw'.n 
fv-pair, we preselect and mark those (sub-)contexts 
which are supposed to have an intluence on the con- 
textual probability of the. fv-pair. A (sub-)context will 
not be preselected if its frequency is smaller than a 
defined threshold. We use dilferent ways for the pres- 
election: 
Melhod 1: A (sub-)context will be preseleeted if 
the considered D-pair itself or all fv-p;dr l)etong- 
ing to the same feature type ew'.r appears in this 
(sul)-)context. E.g., if gen:MAS appears in a certain 
(sub-)context the,, this (sub-)context will l,e prese- 
lected for gen:l:EM too. Furthermore, it is possible 
to impose special conditions on the preselection, e.g. 
that a (sub-)context can only be preselected if it con- 
tains a POS feature in tag tl and ti-1 (cf. lit. l;t: 
Opos and Ipos). 
Method 2: In order to preselect (sub-)contexts for an 
fv-pair, we generate a decision tree r' (Quinlan, I983) 
where the feature of the fv-pair, e.g. ten, hum el.e, 
serves to classify all existing (sub-)contexts. E.g., hum 
prodt, ces three classes of contexts: those containing 
the fwpair Onum:SG, those with Onum:PL and those 
without a Onum feature. We assign to tile tree nodes 
other features than this upon which the cl~ussification is 
based. The root node is labeled with the feature from 
which we expect most information al)out the proba- 
bility of the currently considered feature. The values 
of the rout node feature are assigned to the I)ranches 
starting at the root node. ~,h.~ continue the. branch- 
ing until there remain no features will, an expected 
information gain and a frequency higher than defined 
Ssuggestedlw lIehnut Schmld, \[MS, Univ. Stuttgart, Ger- 
Ilk, lilly, lear reasolls of space we explain only how we etnploy 
decision trees for our purposes. For details about the automatic 
generation of such trees see Quinhm (1983). 
threshohls. To ever), leaf of the tree corresponds a 
(sul>)context which will be marked and thus prese- 
letted for further analysis. 
Method 3: For each fvq)air concerning POS we pre- 
select every (sub-)context containing only I'OS fea- 
tures ht tag tl-2 ;t,,d ti-1 (classical I'OS trigram), e.g. 
2pos:PREP lpos:DET tbr Opos:NOUN. For the other 
fv-p;tirs we mark every (sub-)conl;ext containing any 
fv-pair of the same type in the previous tag ti-1 and 
ally POS features in tag li_ 1 alld Ii, e.g. lpos:DET 
Igen:FL'M Opos:NOUN for @en:I:EM. 
Witl, the methods 1-3, we next eliminate frolll ev~ 
cry preselected (sul>)context all fv-pairs which in the 
above described sense do not intluenee the relative fre- 
quency of the currently considered fv-pair (eq. 8). 
Method 4: l:ronl the set of trigrams extracted from 
a training corpus we generate separately for every fv- 
pair, a binaryd>ranched decision tree which shall tie- 
scribe wtrious contextual probabilities of this fv-pair. 
The tree is generated on a modi\[ied version of the II)3 
algorithm (Quildan, 1983) and is similar to the one 
desr.rlbed by Schmid (1994). 
We start with a binary classification of all trigrams 
based on the considered D-palr. l'\].g., a classification 
for :len:l"EM will divide the set of trigrams in two 
subsets, one where the trigrams contain Ogen:l"EM in 
the tag Ii and one where they do not. 
\[ Igen:MAS \] 
--F~ ge,,:l:EM 
ycs~- ~yes / / 
Figure 3: l)ecision tre.e for the fv-pair 
Ogen:l,'EM (Every number is a probability of 
Ogeu:l"ltM in the context described by the 
path from the root node to the node labeled 
with the munl>er.) 
The tree is built up recursiw~ly (fig. 3). At each 
step, i.e. with the construction of each node, we test 
which one of the other D-pairs delivers most infof 
matioi! concerning the abow>described chmsillcation. 
The current node will be labeled with this fv-pair. One 
of its two branches concerns the trigrams which con~ 
163 
p( 0gen:FEM 0num:SG 0pos:ADJ I lgen:FEM lnum:SG lpos:NOUN 
2gen:FEM 2num:SG 2pos:DET 2typ:DEF) = 44/298 = 0.148 
p~( 0gen:FEM \[ 0num:SG 0pos:ADJ lgen:FEM hmm:SG lpos:NOUN 
2gen:FEM 2num:SG 2pos:DET 2typ:DEF) = 44/44 = 1.0 
PFRo • ( 0gen:l"EM \] 0pos:A1)J lgen:FE'M ; p0 = 170/174 = 0.977) 
p~'( 0num:SG \[ 0gen:FEM 0pos:Al)J lgen:FEM lnum:SG Ipos:NOUN 
2gen:FEM 2num:SG 2pos:l)ET 2typ:DEF) = 44/44 = 1.0 
PFllq : ( 0num:SG \[ 0pos:Al)J lnurn:SG 2pos:l)ET ; p~ = 90/96 = 1.0) 
p~( 0pos:ADJ \[ lgen:FEM lnum:SG lpos:NOUN 
2gen:PEM 2num:SG 2pos:DET 2typ:DEF) = 44/298 --= 0.148 
PFR~ : ( 0pos:ADJ \[ lgen:FEM liras:NOUN 2pos:DET ; p2 = 69/465 = 0.148) 
2 
H pi "~ 0.145 
The position index at the beginning of every feature-v',due-pair indicates the tag to which 
it belongs; e.g. Ogen:FEM belongs to t~tg li and 2num:SG to ll-2. 
Figure 2: Decomposition amt reconstruction of a contextual tag probability (state 
transition probability) using probabilislic feature relations (PFH,) 
tain the D-pair, the other branch concerns tim tri- 
grams which do not contain it. The recursive expan- 
sion of the tree stops if either the information gained 
by consulting further fv-pairs or the frequencies upon 
which the calculus is based are smaller than defined 
thresholds. 
4 TAGGING ALGORITHM 
Starting point for the implementation of a feature 
structure tagger was a second-0rdcr-IIMM tagger (tri- 
grams) based on a modified version of the Viterbi al- 
gorithm (Viterbi, 1967; Church, 1988) which we had 
earlier implemented in C (Kempe ,1994). There we 
replaced the function which estimated the contextual 
probability of a tag (state transition probability) hy 
dividing a trigram frequency by a bigram frequency 
(eq. 3) with a flmction which accomplished this cal- 
culus either using PF1Ls in the above-described way 
(eq.s 6, 7) or by consulting a decision tree (fig. 3). 
To estimate the contextual probability of a tag we 
have to know the contextual probabilities of its fv- 
pairs in order to multiply them (eq. 6). 
Using PFRs generated by roof:hod 1 or 2, when 
e.g looking for the probability p~(0pos:ADJ I...) from 
Ilgure 2, we may find in the list of PFRs, instead of 
a PFR, which would directly correspond (but is not 
stored), the two PFRs 
(0pos:ADJ \[ lgen:FEM lpos:NOUN 2pos:Dl;;T; 
Pl ----- 0.148) 
(0pos:ADJ \[ 0num:SG ll|llln:S(~ lsyn:NOUN 2syn:l)ET; 
p~ = 0.414> 
Both of them contain subsets of tile fv-pairs of the 
required complete context and could therefore both be 
applied. In such c;*se we laced to know how to combine 
Pl and p2 in order to gel; p (=p.~ in fig. 2). 
As there exists no mathematical relation between 
these three probabilities, we simply average Pt and P2 
to get p l)ecause this gives as good tagging results as a 
nmnber of other more complicated approaches which 
we examined. 
PFRs generated by method 3 do not create this 
problem. For every complete context only one PFIL is 
stored. 
When we use the set of decision trees generated by 
method 4, we obtain for every fv-palr in every pos- 
sible context only one probability by going down on 
the relevant branches until a probability information 
is reached. 
In opposition to tile PFRs of tile other methods, the 
decisiou trees also contain negative information al)ont 
the contexL of an fv-l)air, i.c. not only which fv-llairs 
have to be in the context but also which ones nmst bc 
absent. 
5 TAGGING RESULTS 
In tile training arm tagging process we experimented 
with different values for parameters like: minimal ad- 
mitted frequency for preselection, admitted percentua\] 
difference c between probabilities considered to bc 
equal, etc. (cf. see. 3). 
The feature structure tagger was trained on the 
French 10,000 words corpus already mentioned ill ta- 
ble 1, with the fonr different training methods (see. 3). 
When tagging a 6,000 words corpus 6 with an average 
ambiguity of 2.63 tags per word (after the dictionary 
SNo overlap betWeell training and test corpora. 
164 
look-up) 
88.89 % (table 2). 
t'ag- training corpus tag set 
of words I guage 
2,000,000 English 47 -- 
2~00,000 English 47' -- 
t'l' 10,000 French 386 57 
tT 10,000 French 386 57 
lpT 10,000 French 386 
fsT1 10,000 French 386 
fsT2 10,000 French 386 
fsT3 10,000 French 386 
fsT4 10,000 l~rench 386 
we obtained in the best case an accuracy of 
IIMM tagging 
order accuracy 
1 94.93 % 
2 96.16 % 
1 56.39 % 
2 83.23 % 
57 2 83.81% 
57 --- 88.53 % 
57 -- \] 88.89 % 
57 --~~ 
57 --- 188.14 (Z) 
tT--4 "traditional" tlMM-tagger, 
IpT--+ "Tagger" considering ~nly lexical prohahilitles, 
\]sTl..4 ---* feature structure tagger 
trMned with method 1..,1, 
HMM order I ~ blgrams, 2 ~ trigrams 
Table 2: Comparison of the tagging accuracy with 
different taggers, corpora, tag sets and IIMM orders 
Comparatively, we used a "traditional" II/VlM- 
tagger (cf. see. 4) on the same training and test 
corpora and got an accuracy of 83.23 % 7, i.e. the 
error rate was about 50 % higher than with the fea- 
ture structure tagger (table 2). 
When we used a tool which always selects the lexi- 
tally most probable tag without considering the con- 
text we obtained an accuracy of 83.81%, which is even 
better than with the "traditional" IIMM-tagger. 
Provided with enough training data and working 
on a small tag set, our "traditional" tagger got an 
accuracy of 96.16 % (Kempe ,1994), which is usual in 
tiffs case (Cutting et a1.,1992). The English test cori)us 
we used here had an average amt)iguity of 2.61 tags per 
word which is amazingly similar to the aml)iguity o\[" 
the French corpus. 
The feature structure tagger is clearly bel, l.er when 
the available training corpus is small and the tag set 
large but the tags are decomposal)le into few fv-pairs. 
6 FURTHEI~ RESEARCH 
We intend to search for other similar models while 
keeping in mind the basic idea described above: Split- 
ring up a tag into D-pairs and deducing it, s contextual 
probability from the contextual probabilities of its fv- 
pairs. 
Furthermore, it may be preferable to split up the 
tags only when tim frequencies are too small s . 
7 For a similar experiment for Qerman (20,000 words training 
corpus, 689 tags, trigrams) an accuracy of 72.5 % has been 
reported (Wothke et al., 1993, p. 21). 
Ssuggestcd by 'red Briscoe, Rank Xerox Research Centre, 
Grenoble, France 
References 
Brown, P.F. et al. (1989). A Statistical Approach to 
Machine q~ranslation. Technical l/.epo,'t, I/.C 14773 
(~//-66226) 7/17/89, IBM Research l)ivision. 
Church, K.W. (1988). A Stochastic Parts Program 
and Noun Phra.se Parser for Unrestricted Text. In 
Proc. 2rid Conference on Applied Natural Language 
Processing, ACL., pp. I36-143. 
Cutting, I). et al. (1992). A Practical Part-of-Sl)eech 
Tagger. In Pwc. 3rd Conference on Applied Natural 
Language Processing, ACL. Trento, Italy. 
Gale, W.A. and Church, K.W. (1989). What's Wrong 
with Adding One?. Statistical Research Reports, No. 
90, AT&T Bell Laboratories, Murray Ilill. 
Garside, IL, Leecll, CI. and Sampson, (I. (1987). The. 
Computational Analysis of English: A Corpus-based 
Approach. London: I,ongman. 
Kempe, A. (1994). A Prol)abilistic Tagger aud an 
Analysis of Tagging Errors. Research Report. IMS, 
Univ. of Stuttgart. 
Quinlan, J.R. (1983). Learning Efficient Classi\[ica- 
tion Procedures and Their Application to Chess End 
Qames. In Michalski, R., Carbonell, J. and Mitchell 
T. (Eds.) Machine Learning: An arlificial inlelli- 
gence approach, pp. 463-482. San Mateo, California: 
Morgan l(aufmann. 
Rablner, L.R.. (1990). A 'Bltorial on llidden Markov 
Models and Selected Applications in Speech Recog- 
nition. In Waibel, A and Lee, K.F. (Eds.) Readings 
in Speech Recognition. San Mateo, California: Mof 
gas Kanfinann. 
Schmid, 1\[. (19!14). l'robabilistic Part-of-Speach Tag- 
ging llsing \])eeision Trees. Research I/.eport. IMS, 
Univ. of Stuttgart. 
Viterbi, A.J. (1967). Error Bounds for Convolutional 
Codes and an Asymptotieal Optimal l')ecoding Algo- 
rithm. In Proceedings oflEEE, vo\]. 61, pp. 268-278. 
Wothke, K. et al. (1993). Statistically Based Auto- 
marie Tagging of German Text Corpora with Parts- 
of-Speech - Some Experinmnl, s. Research ll,eport, 
Doe. No. TR 75.93.02. Ileidelberg Scientific Center, 
IBM Germany. 
165 
