File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/90/c90-3038_evalu.xml
Size: 18,100 bytes
Last Modified: 2025-10-06 14:00:00
<?xml version="1.0" standalone="yes"?> <Paper uid="C90-3038"> <Title>NEURAL NETWORK APPROACH TO WORD CATEGORY PREDICTION FOR ENGLISH TEXTS</Title> <Section position="3" start_page="0" end_page="217" type="evalu"> <SectionTitle> 1. Introduction </SectionTitle> <Paragraph position="0"> For the realization of an interpreting telephony system, an accurate word recognition system is necessary. Because it is difficult to recognize English words using only their acoustical characteristics, an accurate word recognition system needs certain linguistic information. Errors in word recognition results for sentences uttered in isolation include the tbllowing types of errors recoverable using linguistic infi)rmation.</Paragraph> <Paragraph position="1"> (a) Local syntax errors.</Paragraph> <Paragraph position="2"> (b) Global syntax errors.</Paragraph> <Paragraph position="3"> (c) Semantics and context errors.</Paragraph> <Paragraph position="4"> Many errors arise with one-syllable words such as ( I, by ) and ( the, be ). More than half of these errors can be recovered by use of local syntax rules. The Trigram language model is an extremely rough approximation of a language, but it is a practical and useful model from the J Research and Development Department, NITSUF, O CorporatAo~l 1 N'ffFBasic Research \[,aboratm'ies &quot;J&quot;l&quot;t N'UI' ttuman Interface \[,aboratories viewpoint of entropy. At the very least, the trigram model is useful as a preprocessor for a linguistic processor which will be able to deal with syntax, semantics and context. Text Mr. Hawksly said yesterday he would The trigram model using the appearance probabilities of the following Word was efficiently applied to improve word recognition results 111\]\[121. However, the traditional statistical approach requires considerable training samples to estimate the probabilities of word sequence and considerable memory capacity to process these probabilities. Additionally, it is difficult to predict unseen data which never appeared in tile training data.</Paragraph> <Paragraph position="5"> Neural networks are interesting devices which can learn general characteristics or rules from limited sample data. Neural networks are particularly useful in pattern recognition. In symbol processing, NETtalk \[3\], which produces phonemes from English text, has been used successfully. Now a neural network is being applied to word category prediction \[4\].</Paragraph> <Paragraph position="6"> This paper describes the NETgram, which is a neural network for word category prediction in text. The NETgram is constructed by a trained Bigram network with two hidden layers, so that each bidden layer can learn the coarse-coded features of the input or output word category. Also, the NETgram can easily be expanded from Bigran) to N-gram network without exponentially increasing the number of parameters. Tire NETgram is tested by ti'ainin~ experiments with the Brown Corpus English Text Database i 213 \[51. The NETgram is applied to IIMM English word recognition resulting in an improvement of its recognition performance.</Paragraph> <Paragraph position="7"> 2. Word Category Prediction Neural Net (N ETgram) The basic Bigram network in the NETgram is a 4-layer feed-forward network, as shown in Fig.2, which has 2 hidden layers. Because this network is trained for the next word category as the output for an input word category, hidden layers are expected to learn some linguistic structure from the relationship between one word category and the next in the text. The Trigram network in the NETgram has a structure such that, as the number of next word category</Paragraph> <Paragraph position="9"> grams increases, every new input block produced is fully connected to the lower hidden layer of one basic Bigram network. The link weight is set at wt' as shown in Fig.3.</Paragraph> <Paragraph position="10"> When expanding from Trigram network to 4-gram network, one lower hidden layer block is added and the first and second input blocks are fully connected to one lower hidden layer block, and the second and third input blocks are fully connected to the other lower hidden layer block.</Paragraph> <Paragraph position="11"> 3. How to Train NETgram Ilow to train a NE'Pgram, e.g. a Trigram network, is shown in Fig.4. As input data, word categories in the Brown Corpus text\[5\] are given, in order, from the first word in the sentence to the last, In one input block, only one unit corresponding to the word category number is turned ON (1); The others are turned OFF (0). As output data, only one unit corresponding to the next word category number is trained by ON (1). The others are trained by OFF (0). The training algorithm is the Back-Propagation algorithm\[6l, which uses the gradient descent to change link-weights in order to reduce the difference between the network output vectors and the desired output vectors.</Paragraph> <Paragraph position="12"> First, the basic Bigram network is trained. Next, the Trigram networks are trained with the llnk weight values trained by the basic Bigram network as initial values.</Paragraph> <Paragraph position="13"> This task is a many-to-many mapping problem. Thus, it .</Paragraph> <Paragraph position="14"> is difficult to train because the updating direction of the link weight vector easily fluctuates. In a two-sentence .~ko. odegdegdeg</Paragraph> <Paragraph position="16"> ~_1 ........ 5~ ..... J!9_~ Layer Hidden Layers L_! ........ _s3 ..... 8_9_2 L_t ........... 79___8_9_2 Layer ~. ..~ * , o,, ,'l i _..d Fig.4 ttow to Train NETgram (Trigram Model) training experiment of about 50 words, we have confirmed that the output values of tbe basic Bigram network converge on the next occurrence probability distribution. ttowever, for many training data, considerable time is required for training. Therefore, in order to increase training speed, we use the next werd category occurrence probability distribution calculated for 1,024 sentences (about 24,000 words) as output training data in the basic Bigram network. Of course, in Trigram and 4-gram training, we use the next one-word category as output Next, we consider whether the hidden layer has obtained some linguistic structure. We Calculated the similarity of every two lower hidden layer (HIA) output vectors for 89 word categories and clustered them. Similarity S is calculated by</Paragraph> <Paragraph position="18"> where MfOi) is the lower hidden layer (ILL1) output vector of the input word category CL (M(Ci),M(C~\])) is the immr product of M(Ci) and M(Cj). It M(Ci) II is the norm of M(Ci).</Paragraph> <Paragraph position="19"> The clustering result is shown in Fig.5. Clustering by the threshold of similarity, 0.985, the word categories are classified into linguistically significant groups, which are the HAVE', verb group, BE verb group, subjective pronoun group, group whose categories should be before a noun, and others. Therefore the NETgram can learn linguistic structure naturally.</Paragraph> <Section position="1" start_page="214" end_page="216" type="sub_section"> <SectionTitle> 4.2. Trigram Network </SectionTitle> <Paragraph position="0"> Word category prediction results are shown in Fig.6.</Paragraph> <Paragraph position="1"> The NETgram (Trigram network) is comparable to the statistical Trigram model for test data.</Paragraph> <Paragraph position="2"> Furthermore, the NETgram performs effectively for unseen data whicil never appeared in the training data, although the statistical Trigram can not predict the next word category for unseen data. That is to say, NETgrams interpolate sparse training data in the same way deleted interpolation \[71 does.</Paragraph> <Paragraph position="3"> had ---1 J : has :\ ----~-- T_Y2-L _ ~_L_ --WaS ........ / ! are ....... ~ i ' een Q?)i bei.~ ....... ~ j h~ .................. 7 fie, Jt ........... ~ whe,whicb .............. ~ I ; I,we,thev ........ ~ ; biggest .... ~ /I I F F l; first, 2nd .... Jill |I \[i (adjective) ........ 1 I I t t |i dog's ------ATR's ....... 1 I I I ~ I ! a the ....... ~_ I I ; ! (verb, base ~--t \] \[ ; (~erU. ~.g) -2 _~ i i i \] many, next .... ~ ', i and, or --~! ! half, all ~- I ~ both -I I about, off ~-- i (verb, -ed) ~_~ i 4.3. Differences between the statistical model and the N ETgram We discuss differences between two approaches, the conventional statistical model and the NETgraln. The conventional statistical model is based on the table-lookup. In the case of the Trigram nlodel, next appearance probabilities are computed frmn the histogram, counting the next word category for the two word categories in the training sentences. The probabilities are put in an 89*89*89 size table. Thus, the 89 appearance probabilities of the next word category are obtained from the 89*89*89 size table using the argument of 89*89 symbol permutation. B 89.s9 ---> R 89 B ; binary space R ; real space In order to get 89 prediction values for the next word category, the trained NETgram procedure is as follows : First, encode from an 89*89 symbol permutation to a 16-dimensional analogue code. ( h'om the input layer to the hidden layer 1 ) Second, transform the 16-dimensional analogue code to a 16-dimensiotial analogue code of the next word's 89 prediction values. ( from hidden layer 1 to hidden layer 2 ) Finally, decode the 16-dimensional analogue code to 89 prediction values of the next word's 89 prediction values. ( fl'om hidden layer 2 to</Paragraph> <Paragraph position="5"> The values of each space are output values of the NETgram units of each layer. These mappings are uniquely determined by link-weight values of the NETgram. That is to say, each layer unit value is computed by summing lower-connected unit values multiplied by link-weights and passing through the nonlinear function (sigmoid function).</Paragraph> <Paragraph position="6"> These two approaches need the following memory area (number of parameters).</Paragraph> <Paragraph position="7"> 'Statistical model 89x89X89 = 704,969 (max number of table elements}</Paragraph> <Paragraph position="9"> Thus, the parameters of the statistical model ar(~ 89x89X89 probabilities. In practice, there are ninny 0 values in 89 X 89 X 89 probabilities and the size of the table can be reduced using a particular technique, tlowever, this depends on the kind of task and the number of' training data. On the other band, the NETgram can produce 89x89x89 prediction values using link-weight values memorized as parameters.</Paragraph> <Paragraph position="10"> Next, concerning'the data representation, the statistical model does not use input data structures because it is based on the table-lookup which get probabilities directly by symbol series input. On the other hand, the NETgram extracts a feature related to the distance between word categories from symbol series input into 16-dimensional analogue code. 16-dimensional analogue codes are described in 4.1 as the feature of the NETgram hidden layer in the Bigram nmdel. Thus, the NETgram interpolates sparse training data in the process of bigram and trigram training. From the viewpoint of data coding, The NETgram compresses data from 89-dimensional binary space into 16-dimensional real space.</Paragraph> <Paragraph position="11"> 4.4. 4-gram Network 4-gram prediction rates of the NETgram trained by 2,048 are not much higher than the trigram prediction rates of that trained by 1,024 sentences. The statistical model experiment results show that more than 6,000 sentences are necessary as training data in order for the 4-gram prediction rates to equal the trigram prediction rates of the NETgram trained by 1,024 sentences. Futhermore, the trigram prediction rates of the statistical model increase as tim training sentences increase, up to a max of 16,000 training sentences. The NETgram compensates for the sparse 4-gram data through the interpolation effect.</Paragraph> <Paragraph position="12"> However, it is clear that the 4-gram prediction NETgram needs far more than 16,000 training sentences in order to better the performance of the trigram prediction. Training for so many sentences was not possible because of the limited database and considerable computing required.</Paragraph> <Paragraph position="13"> 5. Applying the NETgram to Speech Recognition The algorithm for applying the NETgram to speech recognition is shown in Fig.7. HMM refers to the ltidden Markov Model which is a technique for speech recognition\[l\] \[8\] \[9\].</Paragraph> </Section> <Section position="2" start_page="216" end_page="217" type="sub_section"> <SectionTitle> Using the NETgram 5.1. Formulation </SectionTitle> <Paragraph position="0"> l.,et w~ show a word just after wi_ 1 and just before wi+ I.</Paragraph> <Paragraph position="1"> Let Ci show one of the word categories to which the word wi belongs. The same we,'d belonging to a different category is regarded as a different word. The trigram probability of wi is calculated using the following approximations.</Paragraph> <Paragraph position="3"> Word trigram probabilities are approximated using category trigram probabilities as follows :</Paragraph> <Paragraph position="5"> The probability of w~ is denoted by the preceding two-word sequence, wi.2, Wi.l, and is approximated by their preceding two-category sequence.</Paragraph> <Paragraph position="7"> The probability ratio of wi and Ci given by Ci-2 Cij is nearly equal to the total probability ratio ofwi and C i.</Paragraph> <Paragraph position="8"> To eah.'ulate the above probability, the trigram probability of word category, P(C i / Ci-2 Ci-1), and word occurrence probability, P(wi)/P(Ci), are required. The word probability, P(wi) / P(Ci), is prestored in tim dietio,mry of word wi for each wm'd category.</Paragraph> <Paragraph position="9"> To avoid the multit)lication of probabilities% tbe log likelihood, STi, is defined as :</Paragraph> <Paragraph position="11"> 'rhe. first term is retrieved from the trigram of word categories and the second term is retrieved from the word dictionary.</Paragraph> <Paragraph position="12"> The maximum likelihood of a word, SW, is given by the sum of word likelihood values of a n-word sequence. The j-th word candidate in the i-th word of a sentence is denoted by wij. The likelihood of&quot; wij , SWi,i, is defined as the sum of two types of likelihood which are the log-likelihood of the ItMM output probability, SHia, and the trigram likelihood, STij. Thus, the likelihood of wij is described as follows :</Paragraph> <Paragraph position="14"> where a~ is the weighting parameter to adjust the scaling of two kinds of likelihood.</Paragraph> <Paragraph position="15"> The maximum sentence likelihood values, G, are denoted by the following equations : Go,; = SWod ( i = O) (5.6) Gij -- max( S Wij -~ Gi.1, k ) (i v= 0 ) (5.7) k When the length era sentence is N, the maximum value of GN.ij is regarded as the maximum likelihood of the word sequence. The back-tracing of wij gives the optimal word sequence.</Paragraph> <Paragraph position="16"> In this paper, the best-ten candidates in the tlMM word recognition results are used. As the same word belonging to a different category is regarded as a different word, there are ten or more word candidates.</Paragraph> <Paragraph position="17"> 5.2. English Word Recognition Results The experiment task is to translate keyboard conversations which include 377 English sentences (2,834 words) uttered word by word by one mate native speaker. The sentences are composed of 542 different words. HMM phone models are trained using 190 sentences (1,487 words) without phone labels.</Paragraph> <Paragraph position="18"> The trigram models, the NETgram and the statistical model, are trained using using 512 and 1,024 sentences of the Brown Corpus Text l)atabase. One sentence is about 24 words long.</Paragraph> <Paragraph position="19"> English word recognition results for 18'7 sentences (l,347 words) of keyboa,'d eonversatim~s using HMM and the trigram models are shown in Table 1. The recognition rate in the experiment using only flMM is 81.0% Using the NETgram, the recognition rates have been in)proved about 5 or 6 %. The number of recognition errors decreases using N l,\]Tgram.</Paragraph> <Paragraph position="20"> Table 1 tlMM English Word Recognition Rates The results of analyzing the hidden layer after training showed that the word categories; were classified into some linguistically significant groups, that is to say, the NlgTgram learns a linguistic structure.</Paragraph> <Paragraph position="21"> Next, the NETgram was applied to ttMM English word recognition, and it was shown that the NETgram can effectively correct word recognition errors in text. The word recognition rate using tlMM is 81.0%. The NETgram trained by 1,024 sentences improves the word recognition rate to 86.9%. The NETgram performs better than tim statistical trigram model when data is in*;ufficient to estimate the correct probabilities of a word sequence. Comparing the NETgram and the statistical trigram model, the performance of the NETgram is higher than that of the statistical trigram in the case of training data consisting of 512 and 1,024 sentences. Furthermore, the statistical trigram model cannot learn word sequences which do not appear as a trigram in the training datm Thus, the prediction value o. c that word sequence is zero. The NETgram does not make such fatal mistakes.</Paragraph> <Paragraph position="22"> Additional results for 4,096 and 30,000 sentences show that recognition rates are 86.9% and 87.2% using the N ETgram, and 86.6% and 87,7% using the statistical model. It is confirmed that the NETgram performs better than the statistical trigram nmdel when data is insufficient to estimate the correct probabilities Tberefore, even if training data were insufficient to estimate accurate trigram probabilities, the NETgram pe,'forms effectively. That is to say, the NETgram interpolates sparse trigram training data using bigram training memory.</Paragraph> </Section> </Section> class="xml-element"></Paper>