PART-OF-SPEECH TAGGING WITH NEURAL NETWORKS 
Hehnut Schmid 
Institute for Computational Linguistics, Azenbergstr.12, 70174 Stuttgart, Germany, 
schmid@ims.uni-stuttgart.de 
Topic area: large text corpora, part-of-speech tag- 
ging, neural networks 
1 ABSTRACT 
Text corpora which are tagged with part-of-speech in- 
formation are useful in many areas of linguistic re- 
search. In this paper, a new part-of-speech tagging 
method hased on neural networks (Net-Tagger) is pre- 
sented and its performance is compared to that of a 
llMM-tagger (Cutting et al., 1992) and a trigram- 
based tagger (Kempe, 1993). It is shown that the 
Net-Tagger performs as well as the trigram-based tag- 
ger and better than the iIMM-tagger. 
2 INTRODUCTION 
Words are often ambiguous in their part of speech. 
The English word store for example can be either a 
noun, a finite verb or an infinitive. In an utterance, 
this ambiguity is normally resolved by the context of a 
word: e.g. in the seutence "The 1977 P6's could store 
two pages of data. ", store can only be an intluitive. 
A part-of-speech tagger is a system which automat- 
ically assigns the part of speech to words using con- 
textual information. Potential applications for part- 
of-speech taggers exist in many areas inclnding speech 
recognition, speech synthesis, machine translation and 
information retrieval. 
l)ifi'ereut methods have been used for the im plemen- 
ration of part-of-speech taggers. TAGGIT (Greene, 
Rnbin, 1971), an early system, which was used for the 
initial tagging of the Brown corpus was rule-based. It 
was able to assign the correct part-of-speech to about 
77 % of the words in the Brown corpus. 
In another approach contextual dependencies are 
modelled statistically. Churcb (1988) and Kempe 
(1993) use second order Markov Models and train 
their systems on large handtagged corpora. Using this 
metbod, they are able to tag more than 96 % of their 
test words with the correct part-of-speech. The need 
for reliably tagged training data, however, is a prob- 
lem for languages, where such data is not available 
in sufficient quantities. Jelinek (1985) and Cutting et 
al. (1992) circumvent this problem by training their 
taggers on untagged data using tile Itaum-Welch algo- 
rithm (also know as the forward-backward algorithm). 
They report rates of correctly tagged words which are 
comparable to that presented by Church (1988) and 
Kempe (1993). 
A third and rather new approach is tagging with 
artificial neural networks. In the area of speech recog- 
nition neural networks have been used for a decade 
r, ow. They have shown performances comparable to 
that of IIidden Ivlarkov model systems or even better 
(Lippmann, 1989). Part-of-speech prediction is an- 
other area, closer to POS tagging, where neural net- 
works have been applied successfidly. Nakamura el; 
al. (1990) trained a d-layer feed-forward network with 
up to three preceding part-of-speech tags ,as input to 
predict the word category of the next word. The pre- 
diction accuracy was similar to that of a trigram-b,~sed 
predictor. Using tile predictor, Nakamura et al. were 
able to improve the recognition rate of their speech 
recognition system from 81.0 % to 86.9 %. 
Federici and Pirrelli (199a) developed a part-of- 
speech tagger which is based on a special type of 
neural network. It disambiguates between alternative 
morphosyntactic tags which are generated by a roof 
phological analyzer. The tagger is trained with an 
analogy-driven learning procedure. Only preliminary 
results are presented, so that a comparison with other 
methods is difficult. 
Ill this paper, a part-of-speech tagger based on a 
multilayer perceptrou network is presented. It is simi- 
lar to tile network of Nakamura et al. (1990) in so far 
as the same training procedure (Backpropagation) is 
used; but it differs in the structure of tile network and 
also in its purpose (disambignation vs. prediction). 
The performance of tl,e presented tagger is measured 
and compared to that of two other taggers (Cutting 
et al., 1992; Kempe, 1993). 
3 NEURAL NETWORKS 
Artificial neural networks consist of a large number of 
simple processing units. These units are highly inter- 
connected by directed weighted links. Associated with 
each unit is an activation value. Through tile connec- 
tions, this activation is propagated to other units. 
In mnltilayer perceptron networks (MLP-networks), 
tile most popular network type, the processing units 
are arranged vertically in several layers (fig. I). Con- 
nections exist only between units in adjacent layers. 
The bottom layer is called input layer', because the 
activations of the units in this layer represent the in- 
put of tile network. Correspondingly, the top layer is 
called output layer. Any layers between input layer 
772 
Figure 1: A 3-layer perceptron network ( 
output units 
hidden units 
b input units 
and outlmt layer are called hidden layers. Their acti- 
wttions are not visible externally. 
During the processing in a MLP-network, actiwt- 
tions are propagated from inlmt units through hidden 
units to output units. At each unit j, the weighted 
inlmt activations aiwij are summed and a bias pa- 
rameter Oj is added. 
net i = ~ aiwlj + Oj (1) 
t 
The resulting network input ,telj is then l)~uqsed 
through a sigmoid fimction (the logistic funclion) in 
order to restrict the value range of the resulting acti- 
vation aj to the interval \[0,i\]. 
1 
a~ - t + e .... ~, (:~) 
The network learns by adapting the weights of the 
connections between units, tmtil the correct output is 
t~rocluced. One widely used method is the backl.'o p- 
~gation algorithm which performs a gradient descent 
search on the error surface, The weight update ~XlOij , 
i.e. the difference between the old and the new value 
of weight wij, is here defined ,~s: 
AWij -- rlapi6pj, where { ,,pj(1 --,,,)(t,,j 
- "pJ), 
if j is an output unit 
a,,~ = ,,vj(l _avs)~vk,oik, (a) k 
if j is a hidden unit 
Ilere, Zp is the target output vector which the network 
lnnst learn t . 
"Daining the MLP-network with the backpropagao 
tion rule guarantees that a local minimum of the er- 
ror surface is found, thougl, this is not necessarily the 
global one. In order to speed up the trahfiug process, 
a momentum term is often introduced into the update 
rormula: 
£kWij(t -~" 1) "~ Oapi~pj "+ (:~lt)ij(l) (4) 
1We assume here that the hia.s parameter Oj is realized ms 
a weight to an additional unit whidt has always the activation 
va}.ue 1 (cp. (B.umelhart, McChdland, t98,1)). 
For a de.tailed introduction to MLP networks see e.g. 
(l{unaelhart, McClellan(l, 1984). 
r 4 TtIG I~_AGGER NIi',TWO1{I( 
The Net-Tagger consists of a Ml, P-network and a lex- 
icon (see tlg. 2). 
l;'igu,'e 2: Structure. of I.he Net-Tagger without hidden 
layer; tile arrow symbolizes the connections between 
the layers. 
11 f \ 
@ @ ®...® 
@...© © © © ©.-.© 
@...© © @ @ @...@ 
@...© @ © © ©...© 
©...@ © @ ® @...@ 
p f 
In the output layer of the MLP network, each unit 
corresponds to one of the tags in the tagset. The net- 
work learns during the training to activate that output 
unit which represents the correct tag and to deactivate 
all other output units, llence, in the trained network, 
the output unit with the higlu.'st activation indicates, 
which tag shouhl be attached to the word that is cur- 
rently processed. 
The input of the network comprises all the informa- 
tion whicii the systeni ti;Ls about the parts of speech of 
the current word, the p precedhig words al,d the f fol- 
lowing words. More precisely, for each part-of-speech 
tag posj and each of the p-t- 1-kf words in the context, 
there is an input unit whose activation in U represents 
the probability that wordl h~Ls part of speech posj. 
For the word which is being tagged and the fol- 
lowing words, the lezical part-of-speech probability 
l'(posj\]wordi) is all we know about the part of 
speech ~, This probability does not take into account 
arty contextual influences. So, we get the following in- 
put representation for the currently tagged word and 
the following words: 
i,,,j : v(vo.,v I,,,o,.d,), ir i > o (s) 
2 Lexical probabilities are estimated hy dividing, the number 
of times a word occurs with a giw:n tag by the own'all numher of 
times the word occurs. This method is known as the Ma.vimum 
Likelihood Principle. 
IZ'~ 
For tile preceding words, there is more information 
available, because they have already bccn tagged. The 
activation values of the output units at the time of 
processing are here used instead of the lexieal part-of- 
speech probabilitiesa: 
i,,;/t) = o,,t/t + O, if ; < 0 (6) 
Copying output activations of tile network into the 
input units introduces recurrence into the network. 
This complicates the training process, because the out- 
put of the network is not correct, when the training 
starts and therefore, it cannot be fed back directly, 
when the training starts. Instead a weighted average 
of the actual output and the target output is used. 
It resembles more the output of the trained network 
which is similar (or at least shouhl be similar) to the 
target output. At tile beginning of the training, the 
weighting of the target output is high. It fails to zero 
during the training. 
The network is trained on a tagged corpus. Target 
activations are 0 for all output units, excepted for the 
unit which corresponds to the correct tag, for which it 
is 1. A slightly modified version of the backpropaga- 
tion algorithm with momentum term which has been 
presented in the last section is used: if the difference 
between the activation of an output unit j and the cor- 
responding target output is below a predefined thresh- 
old (we used 0.1), the error signal ~pJ is set to zero. In 
this way the network is forced to pay more attention to 
larger error signals. This resulted in an improvement 
of the tagging accuracy by more than 1 percent. 
Network architectures with and without hidden lay- 
ers have been trained and tested. In general, MLP- 
networks with hidden layers are more powerful than 
networks without one, but they also need more train- 
ing and there is a higher risk of overlearning 4. As will 
be shown in the next section, the Net-Tagger did not 
profit from a hidden layer. 
In both network types, the tagging of a single word 
is performed by copying the tag probabilities of the 
current word and its neighbours into the input units, 
propagating the activations through the network to 
the output units and determining the output unit 
which has the highest activation. The tag correspond- 
ing to this unit is then attached to the current word. 
If the second strongest activation in the output layer 
is close to the strongest one, tile tag corresponding 
to the second strongest activation may be given as 
an alternative output. No additional computation is 
required for this. Further, it is possible to give a scored 
list of all tags as output. 
aThe output activations of the network do not necessar- 
ily sum to 1. Therefore, they should not he interpreted as 
probabilities. 40verlearning 
means that irrelevant features of the training 
set are learned. As a result, the uetwork is unable to generalize. 
5 TIIE LEXICON 
The lexicon which contains the a priori tag probabili- 
ties for each word is similar to the lexicon which was 
used by Cutting et al. (1992). it has three parts: a 
fullform lexicon, a suffix lexicon and a default enlry. 
No documentation of tile construction algorithm of the 
su\[lix lexicon in (Cutting et al., 1992) was available. 
Thus, a new method based on information theoretic 
principles was developed. 
During the lookup of a word in the lexicon of the 
Net-Tagger, the fifllform lexicon is searched first. If 
the word is found there, the corresponding tag prob- 
ability vector is returned. Otherwise, the uppercase 
letters of the word are turned to lowercase, and the 
search in the fullform lexicon is repeated. If it fails 
again, the suIfix lexicon is searched next. If none of 
the previous steps has been snccessfull, tile default en- 
try of the lexicon is returned. 
The fullform lexicon was created from a tagged 
training corpus (some 2 million words of the Penn 
Treebank Corpus). First, the number of occurrences 
of each word/tag pair was counted. Afterwards, those 
tags of each word with an estimated probability of less 
than 1 percent were removed, because they were in 
most eases the result of tagging errors in the original 
corpus. 
Figure 3: A sample suffix tree of length 3 
ies 
Oils 
ous 
scd 
old 
ble 
lie 
nee 
ive 
ing 
ion 
SOIl 
ton 
man 
ity 
The second part of the lexicon, the suflix lexicon, 
forms a tree. Each node of tile tree (excepted tile root 
node) is labeled with a character. At tile leaves, tag 
probability vectors are attached. During a lookup, tile 
suffix tree is searched from the root. In each step, tile 
branch which is labeled with the next character from 
tile end of the word suffix, is followed. 
Assume e.g., wc want to look for tile word taggiu 9 
in the suflqx lexicon which is shown in fig. 3. We start 
at the root (labeled #) and follow the branch which 
leads to the node labeled g. From there, we move to 
the node labeled n, and finally we end up in tile node 
174 
Table 1: Sample frequencies at a tree node and its two 
child nodes. 
suffix ess 
10 
I gp/ <iS 
l_5 __1 2 
t~a~ 143 
sufllx ness suffix less 
1 85 
2 8 
45 0 
0 2 
48 95 
labeled i. This node is a leaf and the attached tag 
probability vector (which is not shown in lib. 3) is 
returned. 
The suffix lexicon was automatically built from the 
training corpus. First, a sujJiz tree wits constructed 
from the suffices of length 5 of sill words wliich were 
annotated with an open class l)art-of-speecli s. Then 
tag frequencies were cotlnted for all suffices and stored 
at the corresponding tree nodes. 
In the next step, an information measure I(S) was 
calculated for each node of the tree: 
I(S) = - ~ P(posiS ) tomd'(p,>,qS) (7) 
po* 
IIere, S is the suffix which corresponds to the current 
node and P(poslS ) is the probability of tag pos given 
a word with suffix S. 
Using this information measure, the suffix tree has 
been pruned. For each leaf, the weighted information 
gain G(aS) was calculated: 
a(aS) = V(aS) (S(S) - S(<,S)), (8) 
where S is the suffix of the parent node, aS is the 
suffix of the current node and F(aS) is the frequency 
of suffix nS. 
If the information gain at some leaf of the suffix tree 
is below a given threshoht ~, it is removed. The tag 
frequencies of all deleted subnodes of a parent node 
are collected at the defi, ult node of the parent node. 
If the default node is the only renlaining subnodc, it 
is deleted too. In this case, the parent node becomes 
a leaf and is also checked, whether it is deletable. 
To illustrate this process consider the following ex- 
ample, where ess is the suffix of the parent node, less 
is tim suffix of one child node and hess is the suffix 
of the other child node. The tag frequencies of these 
nodes are given in table 1. 
Tim information measure for the parent node is: 
86 86 10 10 S(ess) ..... Iog~ ...... lo,a~-- ... ~ 1.32 (9) 
143 143 143 14'3 
'\].'lie corresponding values for the chihl nodes are 0.39 
for hess and 0.56 for less. Now, we can determine the 
welghted information gain at each of the ehihl nodes. 
We get: 
G(ness) = 48(1.32 - 0.39) = 44.64 (10) 
5Opell class parts-of-speech are those, width allow for the 
production of new words {e.g. noun, verb, adjective). 
6We used a gain threshohl of 10. 
Table 2: Comparison of recognition rates 
method accuracy t 
Net-Tagger 96.22 % 
trigrarn tagger 96.06 % 
IIMM tagger 94.24 % 
G(less) = 95(1.32- 0.56) = 72,20 (11) 
Both wdues are well above a threshohl of 10, and there- 
fore none of them should be deleted. 
As explained before, the suflix tree is walked during 
a lookup along the l)ath, where the nodes are anno- 
tated with the letters of the word snflix in reversed or- 
der. If at some node on the path, no matching subnode 
can be found, and there is a default subitode, then the 
default node is followed. If a leaf is reached at the end 
of the path, the corresponding tag probability vector 
is returned. Otherwise, the search fails and the default 
entry is returned. 
The defaull entry is constructed by subtracting the 
tag frequencies at all leaves of the pruned suffix tree 
from the tag frequencies of the root node and nor- 
malizing the resulting frequencies. Thereby, relative 
frequencies are obtained which sum to one. 
6 Rl,~suurs 
The 2-layer version of the Net-Tagger w,~s trained on a 
2 million word subpart of the Pe.nn-Treebank corpus. 
Its performance was tested on a 100,000 word subpart 
which was not part of the training corlms. The set- 
tings of the network parameters were as follows: the 
number of preceding words in the context p w,~s 3, the 
number of following words f was 2 and the number 
of training cycles was 4 millions. The training of the 
tagger took one day on a Sparcl0 workstation and the 
tagging of 100,000 words took 12 minutes on the same 
machine. 
In tabh; 2, the accuracy rate of the Net-Tagger is 
cOrolLated to that of a trigram l)msed tagger (Kempe, 
1993) and a lIidden Markov Model tagger (Cutting et 
al., 1992) which were. trained and tested on the same 
data. In order to determine the influence of tim size 
of the training sample, the taggers were also trained 
on corpora of different sizes and tested again r. The 
resulting percentages of correctly tagged words are 
shown in figure 4. 
These experiments demonstrate that the perfor- 
mance of the Net-Tagger is comparable to that of the 
trigram tagger and better than that of the IIMM tag- 
ger. They further show tl,at the performance of the 
Net-Tagger is less affected by a small amount of train- 
ing data than that of tim trigram tagger. This may 
be due to a much smaller number of paraineters in the 
Net-Tagger: while the trigram tagger must accurately 
~l:or this test, a slightly simpler netwm'k structure with two 
preceding and one following word in the input context was used. 
775 
Figure 4: Recognition rates for varying sizes of the 
training corpus. 
100 "', '' • • • ' .... ,' • • ...... , 
95 
.=_ 
90 ~ • .jr ¢'" 
:~ 85 Net-Tagger -"- 
~o= Xer0x-Tagger .... 
o J 80 -~ Trigram Tagger -='- 
75 ..,I ........ I ........ I 
10000 100000 1 e+06 
size of training corpus 
estimate 110,592 trigrams, the Net-Tagger only has to 
train 13,824 network parameters. 
It was fitrther tested, whether an additional hid- 
den layer in the network with 50 units would improve 
the accuracy of the tagging. It turned out that the 
accuracy actually deteriorated slightly, although the 
number of training cycles had been increased to 50 
millions s. 
Also, tire influence of the size of the input context 
was determined. Shrinking the context from three 
preceding and two following words to two preceding 
and one following word reduced the accuracy only by 
0.1%. Enlarging the context gave no improvement. 
A context of three preceding and two following words 
seems to he optimal. 
As mentioned previously, the tagger can produce 
an alternative tag, if the decision between two tags is 
difficult. In that way, the accuracy can be raised to 
97.79 % at the expense of 4.6 % ambiguously tagged 
words. 
An analysis of tire errors of the Net-Tagger and the 
trigram tagger shows that both have problems with 
the same words, althot, gh the individual errors are of- 
ten different 9 . 
7 CONCLUSIONS 
In this paper, the Net-Tagger was presented, a part- 
of-speech tagger which is based on a MLP-network. 
A comparison of the tagging results with those of a 
trigram tagger and a IIMM tagger showed that the 
accuracy is as high as that of the trigram tagger and 
the robustness on small training corpora is as good 
as that of the HMM tagger. Thus, the Net-Tagger 
combines advantages of both of these methods. 
The Net-Tagger has the additional advantage that 
problematic decisions between tags are easy to detect, 
aDue to the large training times needed to train the 3-layer- 
network, no further tests have been conducted. 
o Less than 60 % of the tagging errors were made in common 
by both taggers. 
so that in these cases an additional tag can be given 
in the output. In this way, the final decision can be 
delayed to a later processing stage, e.g. a parser. 
A disadvantage of the presented method may be its 
lower processing speed compared to statistical meth- 
ods. In the light of the high speed of present computer 
hardware, however, this does not seem to be a serious 
drawback. 
8 REFERENCES 
Church, K. W. (1985). A stochastic parts program 
and noun phrase parser for unrestricted text. Pro- 
ceedings of the Second Conference on Applied Natvral 
Language Processing, p. 136-143. 
Cutting, D., a. Kupiec, a. Pedersen and P. Sibun 
(1992). A practical part-of-speech tagger. Proceedings 
of the Third Conference on Applied Nalural Laguage 
Processing, r1¥ento, Italy (ACL), pages 133-140, 1992. 
Also awtilable as Xerox technical report SSL-92-01. 
Federici, S. and V. Pirrelli (1993). Analogical mod- 
elling of text tagging, unpublished report, Istituto di 
Linguistica Computazionale, Pisa, Italy. 
Greene, 1t. B and G. M. R.ubin (1971). Auto- 
matic grammatical tagging of English. technical re- 
port, Department of Linguistics, Brown University, 
Providence, Rhode Island. 
aelinek, F. (1985). Markov Source modeling of text 
generation". In J.K. Skwirzinski Ed., Impact of Pro- 
cessing Techniques on Communication, Nijhoff, Dor- 
drecht. 
Kempe, A. (1993). A stochastic Tagger and an 
Analysis of Tagging Errors. Internal paper. In- 
stitute for Computational Linguistics, University of 
Stuttgart. 
Lippmann, R.. P. (1989). Review of Neural Networks 
for Speech Recognition. Neural Computation, Vol. i, 
p. 1-38. 
Nakamura, M., I(. Marnyama, T. Kawabata and K. 
Shikano (1990). Neural network approach to word cat- 
egory prediction for Englis}i texts. In iI. l(arlgren Ed., 
COLING-90, lIelslnki University, p. 213-218. 
Rumelhart, D. E. and J. L. McClelland (1984). Par- 
allel Distributed Processing. MIT-Press, Cambridge, 
MA. 
176 
