NEURAL NETWORK APPROACH TO WORD CATEGORY PREDICTION 
FOR ENGLISH TEXTS 
Masami NAKAMURA, Katsuteru MARUYAMA f, Takeshi KAWABATA f¢, Kiyohiro SHIKANO tit 
ATR Interpreting Telephony Research Laboratories 
Seika-chou, Souraku-gun, Kyoto 619-02, JAPAN 
e-mail masami@atr-la.atr.co.jp 
Abstract 
Word category prediction is used to implement an 
accurate word recognition system. Traditional statistical 
approaches require considerable training data to estimate 
the probabilities of word sequences, and many parameters 
to memorize probabilities. To solve this problem, NETgram, 
which is the neural network for word category prediction, is 
proposed. Training results show that the perfornmnce of tim 
NETgram is comparable to that of the statistical model 
;although the NETgram requires fewer parameters than the 
~;tatisticat model. Also the NETgram performs effectively 
£or unknown data, i.e., the NETgram interpolates sparse 
training data. Results of analyzing the hidden layer show 
that the word categories are classified into linguistically 
ti~ignificant groups. The results of applying the NETgram to 
HMM English word recognition show that the NETgram 
improves the word recognition rate fi'om 81.0% to 86.9%. 
1. Introduction 
For the realization of an interpreting telephony system, 
an accurate word recognition system is necessary. Because 
it is difficult to recognize English words using only their 
acoustical characteristics, an accurate word recognition 
system needs certain linguistic information. Errors in word 
recognition results for sentences uttered in isolation 
include the tbllowing types of errors recoverable using 
linguistic infi)rmation. 
(a) Local syntax errors. 
(b) Global syntax errors. 
(c) Semantics and context errors. 
Many errors arise with one-syllable words such as ( I, by 
) and ( the, be ). More than half of these errors can be 
recovered by use of local syntax rules. The Trigram 
language model is an extremely rough approximation of a 
language, but it is a practical and useful model from the 
J Research and Development Department, NITSUF, O CorporatAo~l 
1 N'ffFBasic Research \[,aboratm'ies 
"J"l"t N'UI' ttuman Interface \[,aboratories 
viewpoint of entropy. At the very least, the trigram model 
is useful as a preprocessor for a linguistic processor which 
will be able to deal with syntax, semantics and context. 
Text Mr. Hawksly said yesterday he would 
Category NP NP VBD NR PPS MD 
Category 5 \] 51 79 55 66 46 No. 
Bigram :1 /"" ~'-~"" 
prediction \] / :' ' ............. ~. 
Fig. 1 Word Category Prediction 
Using Brown Corpus Text Data 
The trigram model using the appearance probabilities 
of the following Word was efficiently applied to improve 
word recognition results 111\]\[121. However, the traditional 
statistical approach requires considerable training samples 
to estimate the probabilities of word sequence and 
considerable memory capacity to process these 
probabilities. Additionally, it is difficult to predict unseen 
data which never appeared in tile training data. 
Neural networks are interesting devices which can 
learn general characteristics or rules from limited sample 
data. Neural networks are particularly useful in pattern 
recognition. In symbol processing, NETtalk \[3\], which 
produces phonemes from English text, has been used 
successfully. Now a neural network is being applied to word 
category prediction \[4\]. 
This paper describes the NETgram, which is a neural 
network for word category prediction in text. The NETgram 
is constructed by a trained Bigram network with two hidden 
layers, so that each bidden layer can learn the coarse-coded 
features of the input or output word category. Also, the 
NETgram can easily be expanded from Bigran) to N-gram 
network without exponentially increasing the number of 
parameters. Tire NETgram is tested by ti'ainin~ 
experiments with the Brown Corpus English Text Database 
i 213 
\[51. The NETgram is applied to IIMM English word 
recognition resulting in an improvement of its recognition 
performance. 
2. Word Category Prediction Neural Net (N ETgram) 
The basic Bigram network in the NETgram is a 4-layer 
feed-forward network, as shown in Fig.2, which has 2 
hidden layers. Because this network is trained for the next 
word category as the output for an input word category, 
hidden layers are expected to learn some linguistic 
structure from the relationship between one word category 
and the next in the text. The Trigram network in the 
NETgram has a structure such that, as the number of 
next word category 
\[ .................. 1 
L ___ (! 
I iO0 ...... 0', 
...... 
I. .................. J 
present word category 
Output Layer 
89 units 
Hidden layer 2 
16 units 
Hidden layer 1 
16 units 
Input Layer 
89 units 
Fig.2 NETgram ( Basic Bigram Network ) 
:,oo°.,,~°..°,,,, 
N th word 
"I Output ! 
I Unit<s9) ,, i 
I HL2"I 
o ,°"~ lu'it~06) I " I .~i I .. . LU'it<lm I 
'1 Ifiput I '." Input .... i U nits(89) I Units(89) 
:{N-3)th Word " (N-2)th Word (N-1)th Word ! 
4-gram Trigram Bigram network 
Fig.3 NETgram ( Trigram, 4-gram Network ) 
grams increases, every new input block produced is fully 
connected to the lower hidden layer of one basic Bigram 
network. The link weight is set at wt' as shown in Fig.3. 
When expanding from Trigram network to 4-gram network, 
one lower hidden layer block is added and the first and 
second input blocks are fully connected to one lower hidden 
layer block, and the second and third input blocks are fully 
connected to the other lower hidden layer block. 
3. How to Train NETgram 
Ilow to train a NE'Pgram, e.g. a Trigram network, is 
shown in Fig.4. As input data, word categories in the 
Brown Corpus text\[5\] are given, in order, from the first 
word in the sentence to the last, In one input block, only one 
unit corresponding to the word category number is turned 
ON (1); The others are turned OFF (0). As output data, 
only one unit corresponding to the next word category 
number is trained by ON (1). The others are trained by OFF 
(0). The training algorithm is the Back-Propagation 
algorithm\[6l, which uses the gradient descent to change 
link-weights in order to reduce the difference between the 
network output vectors and the desired output vectors. 
First, the basic Bigram network is trained. Next, the 
Trigram networks are trained with the llnk weight values 
trained by the basic Bigram network as initial values. 
This task is a many-to-many mapping problem. Thus, it . 
is difficult to train because the updating direction of the 
link weight vector easily fluctuates. In a two-sentence 
.~ko. o°°° 
0 1 0 
i O ...... 0 .... O! oo,pot 
~_1 ........ 5~ ..... J!9_~ Layer 
Hidden Layers 
L_! ........ _s3 ..... 8_9_2 L_t ........... 79___8_9_2 Layer ~. ..~ 
• , o,, ,'l 
i _..d 
Fig.4 ttow to Train NETgram (Trigram Model) 
214 
training experiment of about 50 words, we have confirmed 
that the output values of tbe basic Bigram network 
converge on the next occurrence probability distribution. 
ttowever, for many training data, considerable time is 
required for training. Therefore, in order to increase 
training speed, we use the next werd category occurrence 
probability distribution calculated for 1,024 sentences 
(about 24,000 words) as output training data in the basic 
Bigram network. Of course, in Trigram and 4-gram 
training, we use the next one-word category as output 
training data. 
4. Training llcsnlts 
4.1. Basic Bigram Network 
Word category prediction results show that NETgrmn 
(the basic Bigram network) is comparal)le to the statistical 
Bigram model. 
Next, we consider whether the hidden layer has 
obtained some linguistic structure. We Calculated the 
similarity of every two lower hidden layer (HIA) output 
vectors for 89 word categories and clustered them. 
Similarity S is calculated by 
(M(Ci),M(Cj)) S(ci,cj) 
= (4.1) 
I1 M(Ci)it II M(Qi)I1 
where MfOi) is the lower hidden layer (ILL1) output vector 
of the input word category CL (M(Ci),M(C~\])) is the immr 
product of M(Ci) and M(Cj). It M(Ci) II is the norm of M(Ci). 
The clustering result is shown in Fig.5. Clustering by the 
threshold of similarity, 0.985, the word categories are 
classified into linguistically significant groups, which are 
the HAVE', verb group, BE verb group, subjective pronoun 
group, group whose categories should be before a noun, and 
others. Therefore the NETgram can learn linguistic 
structure naturally. 
4.2. Trigram Network 
Word category prediction results are shown in Fig.6. 
The NETgram (Trigram network) is comparable to the 
statistical Trigram model for test data. 
Furthermore, the NETgram performs effectively for 
unseen data whicil never appeared in the training data, 
although the statistical Trigram can not predict the next 
word category for unseen data. That is to say, NETgrams 
interpolate sparse training data in the same way deleted 
interpolation \[71 does. 
CA'FE- 
GORY 
"3"g3qV~ 
37 HVD 
40 EIV7 
16 BEDZ 
20 BER 
21 BEZ 
19 BEN 
14 BE 
17BEG 
38 I-IVG 
~ffPPS-- 
86 WPS 
67 PPSS 
"2gD-r--- 
45 JJT 
58 OD 
42 JJ 
48 NN$ 
61 PP$ 
52 NP$ 
13 AT 78 VB 
80 VBG 
06, 
1lAP 
22 CC 
09 ABN 
10 ABX 
75 RP 
81 VBN 
89 DUM 
23 CD 
43 J JR 
47 NN noun,single) 
55 NR lome, west 
79 VBD (verb, past) 
32 VSZ (verb, -s, -es) 
~2 DTS these 
55 PPO me, him, it 
19 NNS (noun,plural) 
t0 RB (adverb) 
~0NNS$ men's 
}1 N PP ATR, Tom 
~thers others 
EXAMPLE Threshold of Similarity 
(part of speech) 1.000 0.995 0.990 0.985 0.980 
had ---1 J : 
has :\ ----~-- T_Y2-L _ ~_L_ -- 
WaS ........ / ! 
are ....... ~ i ' 
 een Q?)i 
bei.~ ....... ~ j 
h~ .................. 7 fie, Jt ........... ~ 
whe,whicb .............. ~ I ; 
I,we,thev ........ ~ ; 
biggest .... ~ /I I F F l; 
first, 2nd .... Jill| I \[i 
(adjective) ........ 1 I I t t |i 
dog's ------ 
ATR's ....... 1 I I I ~ I ! a the ....... ~_ I I ; ! 
(verb, base ~--t \] \[ ; (~erU. ~.g) -2 _~ i i i 
\] many, next .... ~ ', i 
and, or --~! ! 
half, all ~- I ~ 
both -I I about, off ~-- i 
(verb, -ed) ~_~ i 
(dummy) 
one, 2 I--t- ', comp.adj.) 
Prediction Rate 
0.8 - 
............. C) 
NETgram for test data 
Statistical Model for test data 
0.6 
0.4 .,... ,,,'""'" 
data as Trigram data. 
0.2 
I_ 
Fig.5 Clustering Result of iliA Output Vectors of NETgram 
( Bigram ) 
l 1 l__J 
0 \] 2 3 4 5 
Number of Prediction Candidate Categories 
Fig.6 NETgram (Trigran0 Prediction Rates 
215 3 
4.3. Differences between the statistical model and the 
N ETgram 
We discuss differences between two approaches, the 
conventional statistical model and the NETgraln. The 
conventional statistical model is based on the table-lookup. 
In the case of the Trigram nlodel, next appearance 
probabilities are computed frmn the histogram, counting 
the next word category for the two word categories in the 
training sentences. The probabilities are put in an 89*89*89 
size table. Thus, the 89 appearance probabilities of the next 
word category are obtained from the 89*89*89 size table 
using the argument of 89*89 symbol permutation. 
B 89.s9 ---> R 89 B ; binary space 
R ; real space 
In order to get 89 prediction values for the next word 
category, the trained NETgram procedure is as follows : 
First, encode from an 89*89 symbol permutation to a 
16-dimensional analogue code. ( h'om the input 
layer to the hidden layer 1 ) 
Second, transform the 16-dimensional analogue code to 
a 16-dimensiotial analogue code of the next 
word's 89 prediction values. ( from hidden layer 
1 to hidden layer 2 ) 
Finally, decode the 16-dimensional analogue code to 89 
prediction values of the next word's 89 
prediction values. ( fl'om hidden layer 2 to 
output layer ) 
B89"89 __> R 16 .__> R 16 ___> R 89 
The values of each space are output values of the 
NETgram units of each layer. These mappings are uniquely 
determined by link-weight values of the NETgram. That is 
to say, each layer unit value is computed by summing 
lower-connected unit values multiplied by link-weights and 
passing through the nonlinear function (sigmoid function). 
These two approaches need the following memory area 
(number of parameters). 
'Statistical model 89×89X89 = 704,969 
(max number of table elements} 
"NETgram (89+89)X16+16X16+16X89 +121 = 5,193 
(number of link-weights) 
(121 ; offset parameters) 
Thus, the parameters of the statistical model ar(~ 
89×89X89 probabilities. In practice, there are ninny 0 
values in 89 X 89 X 89 probabilities and the size of the table 
can be reduced using a particular technique, tlowever, this 
depends on the kind of task and the number of' training 
data. On the other band, the NETgram can produce 
89×89×89 prediction values using link-weight values 
memorized as parameters. 
Next, concerning'the data representation, the statistical 
model does not use input data structures because it is based 
on the table-lookup which get probabilities directly by 
symbol series input. On the other hand, the NETgram 
extracts a feature related to the distance between word 
categories from symbol series input into 16-dimensional 
analogue code. 16-dimensional analogue codes are described 
in 4.1 as the feature of the NETgram hidden layer in the 
Bigram nmdel. Thus, the NETgram interpolates sparse 
training data in the process of bigram and trigram training. 
From the viewpoint of data coding, The NETgram 
compresses data from 89-dimensional binary space into 16- 
dimensional real space. 
4.4. 4-gram Network 
4-gram prediction rates of the NETgram trained by 
2,048 are not much higher than the trigram prediction rates 
of that trained by 1,024 sentences. The statistical model 
experiment results show that more than 6,000 sentences are 
necessary as training data in order for the 4-gram 
prediction rates to equal the trigram prediction rates of the 
NETgram trained by 1,024 sentences. Futhermore, the 
trigram prediction rates of the statistical model increase as 
tim training sentences increase, up to a max of 16,000 
training sentences. The NETgram compensates for the 
sparse 4-gram data through the interpolation effect. 
However, it is clear that the 4-gram prediction NETgram 
needs far more than 16,000 training sentences in order to 
better the performance of the trigram prediction. Training 
for so many sentences was not possible because of the 
limited database and considerable computing required. 
5. Applying the NETgram to Speech Recognition 
The algorithm for applying the NETgram to speech 
recognition is shown in Fig.7. HMM refers to the ltidden 
Markov Model which is a technique for speech 
recognition\[l\] \[8\] \[9\]. 
216 
Acoustic lin___quistic 
Keyboard Conversation Speech Data Brown Col pus Toxt Data 
,* Traininq ~, ~ Test "** ~* Training ~. 
Data" ~ ,, Data • ~ ~ Data * "'l'" " 
Recoqnit o Training ~ j 
Fig. 7 I mprovenmnt of 11MM English Word Recognition 
Using the NETgram 
5.1. Formulation 
l.,et w~ show a word just after wi_ 1 and just before wi+ I. 
Let Ci show one of the word categories to which the word wi 
belongs. The same we,'d belonging to a different category is 
regarded as a different word. The trigram probability of wi 
is calculated using the following approximations. 
t'(wi/wi.2 wi !) 
~-~ P(wi/Ci-2 Ci-1) 
:= P(Ci/Ci-2 Ci4) 
X {P(w i / C i.2 Ci-1) / P( Ci / Ci-2 Ci-1) } 
:= P(Ci/Ci-2 Ci-1) {P(wi) / P(Ci) } (5.1) 
Word trigram probabilities are approximated using 
category trigram probabilities as follows : 
P(wi/wi.2 wi.1) ~ P(wi/Ci-2 Ci-1) (5.2) 
The probability of w~ is denoted by the preceding two- 
word sequence, wi.2, Wi.l, and is approximated by their 
preceding two-category sequence. 
P(wi / Ci-2 Ci-1) / P(Ci / Ci-2 Ci-l) = I'(wi) / P(Ci) (5.3) 
The probability ratio of wi and Ci given by Ci-2 Cij is 
nearly equal to the total probability ratio ofwi and C i. 
To eah.'ulate the above probability, the trigram 
probability of word category, P(C i / Ci-2 Ci-1), and word 
occurrence probability, P(wi)/P(Ci), are required. The word 
probability, P(wi) / P(Ci), is prestored in tim dietio,mry of 
word wi for each wm'd category. 
To avoid the multit)lication of probabilities% tbe log 
likelihood, STi, is defined as : 
STi = IolIP(Ci/Ci-2 Ci-l) ÷ log(P(wi)/P(C~)) (5,4) 
'rhe. first term is retrieved from the trigram of word 
categories and the second term is retrieved from the word 
dictionary. 
The maximum likelihood of a word, SW, is given by the 
sum of word likelihood values of a n-word sequence. The j- 
th word candidate in the i-th word of a sentence is denoted 
by wij. The likelihood of" wij , SWi,i, is defined as the sum of 
two types of likelihood which are the log-likelihood of the 
ItMM output probability, SHia, and the trigram likelihood, 
STij. Thus, the likelihood of wij is described as follows : 
SWi.j = (1-o~) . SHij + ~o . S'l'ij (55) 
where a~ is the weighting parameter to adjust the 
scaling of two kinds of likelihood. 
The maximum sentence likelihood values, G, are 
denoted by the following equations : 
Go,; = SWod ( i = O) (5.6) 
Gij -- max( S Wij -~ Gi.1, k ) (i v= 0 ) (5.7) k 
When the length era sentence is N, the maximum value 
of GN.ij is regarded as the maximum likelihood of the word 
sequence. The back-tracing of wij gives the optimal word 
sequence. 
In this paper, the best-ten candidates in the tlMM word 
recognition results are used. As the same word belonging to 
a different category is regarded as a different word, there 
are ten or more word candidates. 
5.2. English Word Recognition Results 
The experiment task is to translate keyboard 
conversations which include 377 English sentences (2,834 
words) uttered word by word by one mate native speaker. 
The sentences are composed of 542 different words. HMM 
phone models are trained using 190 sentences (1,487 words) 
without phone labels. 
The trigram models, the NETgram and the statistical 
model, are trained using using 512 and 1,024 sentences of 
the Brown Corpus Text l)atabase. One sentence is about 24 
words long. 
English word recognition results for 18'7 sentences 
(l,347 words) of keyboa,'d eonversatim~s using HMM and 
the trigram models are shown in Table 1. The recognition 
rate in the experiment using only flMM is 81.0% Using the 
217 
NETgram, the recognition rates have been in)proved about 
5 or 6 %. The number of recognition errors decreases using 
N l,\]Tgram. 
Table 1 tlMM English Word Recognition Rates 
using NETgrana or Statistical model (%) 
Model Training NETgram Statistical Sentences Model 
! 
I 512 86.3 85.5 Trigram 
1,024 86.9 85.4 
• i i i i 
The results of analyzing the hidden layer after training 
showed that the word categories; were classified into some 
linguistically significant groups, that is to say, the 
NlgTgram learns a linguistic structure. 
Next, the NETgram was applied to ttMM English word 
recognition, and it was shown that the NETgram can 
effectively correct word recognition errors in text. The word 
recognition rate using tlMM is 81.0%. The NETgram 
trained by 1,024 sentences improves the word recognition 
rate to 86.9%. The NETgram performs better than tim 
statistical trigram model when data is in*;ufficient to 
estimate the correct probabilities of a word sequence. 
Comparing the NETgram and the statistical trigram 
model, the performance of the NETgram is higher than that 
of the statistical trigram in the case of training data 
consisting of 512 and 1,024 sentences. Furthermore, the 
statistical trigram model cannot learn word sequences 
which do not appear as a trigram in the training datm Thus, 
the prediction value o. c that word sequence is zero. The 
NETgram does not make such fatal mistakes. 
Additional results for 4,096 and 30,000 sentences show 
that recognition rates are 86.9% and 87.2% using the 
N ETgram, and 86.6% and 87,7% using the statistical model. 
It is confirmed that the NETgram performs better than the 
statistical trigram nmdel when data is insufficient to 
estimate the correct probabilities Tberefore, even if 
training data were insufficient to estimate accurate trigram 
probabilities, the NETgram pe,'forms effectively. That is to 
say, the NETgram interpolates sparse trigram training 
data using bigram training memory. 
6. Conclusion 
In this paper we have presented the NETgram, a neural 
network for N-gram word category prediction in text. The 
NETgram can easily be expanded from Bigram to N-gram 
network without exponentially increasing the number of 
parameters. 
The training results showed that the Trigram word 
category prediction ability of the NETgram was comparable 
to that of the statistical Trigram model although the 
NgTgram requires fewer parameters than the statistical 
model. We also confirmed that NETgrams performed 
effectively for unknown data which never appeared in the 
training data, that is to say, NETgrams interpolate sparse 
training data naturally. 
Acknowledgement 
The authors would like to express their gratitude to Dr. 
Akira Kurematsu, president of ATR Interpreting 
Telephony Research Laboratories, which made this 
research possible, for his encouragement and support. We 
are also indebted to the members of the Speech Processing 
Department at ATR, for their hell) in the various stages of 
this research. 

References 

\[11 I,'.Jetinek, "Continuous Speech Recognition by 
Statistical Methods", Proceedings of the iI'~EI'\], Vol.64, No.4 
(1976.4) 

121 K.Shikano, "hnprovenmnt of Word Recognition Result 
by Trigram Model", \[CASSP 87, 29.2(1987.4) 

\[al T.J.Sejnowski, C.R.Rosenberg, "NETtalk, A Parallel 
Network that l,earns to Read Aloud", Teeh. Report, The 
Johns llopkins University EESS -86-01 (1986) 

\[41 M.Nakamura, K.Shikano, "A Study of English Word 
Category Prediction Based on Neural Networks", ICASSP 
89, SI 3.10(1989.5) 

151 13town University, "Brown Corpus", Tech: Report, 
Brown University (1967) 

16l D.E.Rumelhart, G.E.tllnton, R.d.Williams, "Parallel 
Distributed Processing", M.I.T. Press (1986) 

\[7\] F.Jelinek, R.Mercer, "Interpolated Estimation of 
Marker Source Parameters from Sparse Data", Pattern 
Recognition in Practice, ed. E,S. Gelsema and L. N. Kanal, 
North ttolland (1980) 

\[81 S.E.Levinson, L.R.Rabiner, M.M.Sondhi, "An 
Introduction to the Application of the Theory of 
Probabilistie Functions of a Marker Process to Automatic 
Speech Recognition", Bell Syst. Teeh. J. 62(4), pp1035-1074 
(1983) 

\[91 T.1Ianazawa, G.Kawabata, K.Shikano, "Study of 
Separate Vector Quantization for HMM Phoneme 
Recognition", ASJ 2-P-8(1988.10)
