File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/94/c94-1027_metho.xml

Size: 12,987 bytes

Last Modified: 2025-10-06 14:13:37

<?xml version="1.0" standalone="yes"?>
<Paper uid="C94-1027">
  <Title>PART-OF-SPEECH TAGGING WITH NEURAL NETWORKS</Title>
  <Section position="4" start_page="0" end_page="772" type="metho">
    <SectionTitle>
3 NEURAL NETWORKS
</SectionTitle>
    <Paragraph position="0"> Artificial neural networks consist of a large number of simple processing units. These units are highly interconnected by directed weighted links. Associated with each unit is an activation value. Through tile connections, this activation is propagated to other units.</Paragraph>
    <Paragraph position="1"> In mnltilayer perceptron networks (MLP-networks), tile most popular network type, the processing units are arranged vertically in several layers (fig. I). Connections exist only between units in adjacent layers.</Paragraph>
    <Paragraph position="2"> The bottom layer is called input layer', because the activations of the units in this layer represent the input of tile network. Correspondingly, the top layer is called output layer. Any layers between input layer  and outlmt layer are called hidden layers. Their actiwttions are not visible externally.</Paragraph>
    <Paragraph position="3"> During the processing in a MLP-network, actiwttions are propagated from inlmt units through hidden units to output units. At each unit j, the weighted inlmt activations aiwij are summed and a bias pa-</Paragraph>
    <Paragraph position="5"> The resulting network input ,telj is then l)~uqsed through a sigmoid fimction (the logistic funclion) in order to restrict the value range of the resulting activation aj to the interval \[0,i\].</Paragraph>
    <Paragraph position="7"> The network learns by adapting the weights of the connections between units, tmtil the correct output is t~rocluced. One widely used method is the backl.'o p~gation algorithm which performs a gradient descent search on the error surface, The weight update ~XlOij , i.e. the difference between the old and the new value of weight wij, is here defined ,~s: AWij -- rlapi6pj, where { ,,pj(1 --,,,)(t,,j - &amp;quot;pJ), if j is an output unit a,,~ = ,,vj(l _avs)~vk,oik, (a) k if j is a hidden unit Ilere, Zp is the target output vector which the network lnnst learn t .</Paragraph>
    <Paragraph position="8"> &amp;quot;Daining the MLP-network with the backpropagao tion rule guarantees that a local minimum of the error surface is found, thougl, this is not necessarily the global one. In order to speed up the trahfiug process, a momentum term is often introduced into the update rormula:</Paragraph>
    <Paragraph position="10"> a weight to an additional unit whidt has always the activation va}.ue 1 (cp. (B.umelhart, McChdland, t98,1)).</Paragraph>
    <Paragraph position="11"> For a de.tailed introduction to MLP networks see e.g.</Paragraph>
    <Paragraph position="12"> (l{unaelhart, McClellan(l, 1984).</Paragraph>
    <Paragraph position="13"> r 4 TtIG I~_AGGER NIi',TWO1{I( The Net-Tagger consists of a Ml, P-network and a lexicon (see tlg. 2).</Paragraph>
    <Paragraph position="14"> l;'igu,'e 2: Structure. of I.he Net-Tagger without hidden layer; tile arrow symbolizes the connections between the layers.</Paragraph>
    <Paragraph position="16"> In the output layer of the MLP network, each unit corresponds to one of the tags in the tagset. The network learns during the training to activate that output unit which represents the correct tag and to deactivate all other output units, llence, in the trained network, the output unit with the higlu.'st activation indicates, which tag shouhl be attached to the word that is currently processed.</Paragraph>
    <Paragraph position="17"> The input of the network comprises all the information whicii the systeni ti;Ls about the parts of speech of the current word, the p precedhig words al,d the f following words. More precisely, for each part-of-speech tag posj and each of the p-t- 1-kf words in the context, there is an input unit whose activation in U represents the probability that wordl h~Ls part of speech posj.</Paragraph>
    <Paragraph position="18"> For the word which is being tagged and the following words, the lezical part-of-speech probability l'(posj\]wordi) is all we know about the part of speech ~, This probability does not take into account arty contextual influences. So, we get the following input representation for the currently tagged word and the following words: i,,,j : v(vo.,v I,,,o,.d,), ir i &gt; o (s) 2 Lexical probabilities are estimated hy dividing, the number of times a word occurs with a giw:n tag by the own'all numher of times the word occurs. This method is known as the Ma.vimum Likelihood Principle.</Paragraph>
    <Paragraph position="19"> IZ'~ For tile preceding words, there is more information available, because they have already bccn tagged. The activation values of the output units at the time of processing are here used instead of the lexieal part-of-speech probabilitiesa:</Paragraph>
    <Paragraph position="21"> Copying output activations of tile network into the input units introduces recurrence into the network.</Paragraph>
    <Paragraph position="22"> This complicates the training process, because the output of the network is not correct, when the training starts and therefore, it cannot be fed back directly, when the training starts. Instead a weighted average of the actual output and the target output is used.</Paragraph>
    <Paragraph position="23"> It resembles more the output of the trained network which is similar (or at least shouhl be similar) to the target output. At tile beginning of the training, the weighting of the target output is high. It fails to zero during the training.</Paragraph>
    <Paragraph position="24"> The network is trained on a tagged corpus. Target activations are 0 for all output units, excepted for the unit which corresponds to the correct tag, for which it is 1. A slightly modified version of the backpropagation algorithm with momentum term which has been presented in the last section is used: if the difference between the activation of an output unit j and the corresponding target output is below a predefined threshold (we used 0.1), the error signal ~pJ is set to zero. In this way the network is forced to pay more attention to larger error signals. This resulted in an improvement of the tagging accuracy by more than 1 percent.</Paragraph>
    <Paragraph position="25"> Network architectures with and without hidden layers have been trained and tested. In general, MLP-networks with hidden layers are more powerful than networks without one, but they also need more training and there is a higher risk of overlearning 4. As will be shown in the next section, the Net-Tagger did not profit from a hidden layer.</Paragraph>
    <Paragraph position="26"> In both network types, the tagging of a single word is performed by copying the tag probabilities of the current word and its neighbours into the input units, propagating the activations through the network to the output units and determining the output unit which has the highest activation. The tag corresponding to this unit is then attached to the current word. If the second strongest activation in the output layer is close to the strongest one, tile tag corresponding to the second strongest activation may be given as an alternative output. No additional computation is required for this. Further, it is possible to give a scored list of all tags as output.</Paragraph>
    <Paragraph position="27"> aThe output activations of the network do not necessarily sum to 1. Therefore, they should not he interpreted as probabilities. 40verlearning means that irrelevant features of the training set are learned. As a result, the uetwork is unable to generalize.</Paragraph>
  </Section>
  <Section position="5" start_page="772" end_page="772" type="metho">
    <SectionTitle>
5 TIIE LEXICON
</SectionTitle>
    <Paragraph position="0"> The lexicon which contains the a priori tag probabilities for each word is similar to the lexicon which was used by Cutting et al. (1992). it has three parts: a fullform lexicon, a suffix lexicon and a default enlry.</Paragraph>
    <Paragraph position="1"> No documentation of tile construction algorithm of the su\[lix lexicon in (Cutting et al., 1992) was available.</Paragraph>
    <Paragraph position="2"> Thus, a new method based on information theoretic principles was developed.</Paragraph>
    <Paragraph position="3"> During the lookup of a word in the lexicon of the Net-Tagger, the fifllform lexicon is searched first. If the word is found there, the corresponding tag probability vector is returned. Otherwise, the uppercase letters of the word are turned to lowercase, and the search in the fullform lexicon is repeated. If it fails again, the suIfix lexicon is searched next. If none of the previous steps has been snccessfull, tile default entry of the lexicon is returned.</Paragraph>
    <Paragraph position="4"> The fullform lexicon was created from a tagged training corpus (some 2 million words of the Penn Treebank Corpus). First, the number of occurrences of each word/tag pair was counted. Afterwards, those tags of each word with an estimated probability of less than 1 percent were removed, because they were in most eases the result of tagging errors in the original  The second part of the lexicon, the suflix lexicon, forms a tree. Each node of tile tree (excepted tile root node) is labeled with a character. At tile leaves, tag probability vectors are attached. During a lookup, tile suffix tree is searched from the root. In each step, tile branch which is labeled with the next character from tile end of the word suffix, is followed.</Paragraph>
    <Paragraph position="5"> Assume e.g., wc want to look for tile word taggiu 9 in the suflqx lexicon which is shown in fig. 3. We start at the root (labeled #) and follow the branch which leads to the node labeled g. From there, we move to the node labeled n, and finally we end up in tile node  probability vector (which is not shown in lib. 3) is returned.</Paragraph>
    <Paragraph position="6"> The suffix lexicon was automatically built from the training corpus. First, a sujJiz tree wits constructed from the suffices of length 5 of sill words wliich were annotated with an open class l)art-of-speecli s. Then tag frequencies were cotlnted for all suffices and stored at the corresponding tree nodes.</Paragraph>
    <Paragraph position="7"> In the next step, an information measure I(S) was calculated for each node of the tree:</Paragraph>
    <Paragraph position="9"> IIere, S is the suffix which corresponds to the current node and P(poslS ) is the probability of tag pos given a word with suffix S.</Paragraph>
    <Paragraph position="10"> Using this information measure, the suffix tree has been pruned. For each leaf, the weighted information gain G(aS) was calculated:</Paragraph>
    <Paragraph position="12"> where S is the suffix of the parent node, aS is the suffix of the current node and F(aS) is the frequency of suffix nS.</Paragraph>
    <Paragraph position="13"> If the information gain at some leaf of the suffix tree is below a given threshoht ~, it is removed. The tag frequencies of all deleted subnodes of a parent node are collected at the defi, ult node of the parent node. If the default node is the only renlaining subnodc, it is deleted too. In this case, the parent node becomes a leaf and is also checked, whether it is deletable.</Paragraph>
    <Paragraph position="14"> To illustrate this process consider the following example, where ess is the suffix of the parent node, less is tim suffix of one child node and hess is the suffix of the other child node. The tag frequencies of these nodes are given in table 1.</Paragraph>
    <Paragraph position="15"> Tim information measure for the parent node is: 86 86 10 10 S(ess) ..... Iog~ ...... lo,a~-- ... ~ 1.32 (9) 143 143 143 14'3 '\].'lie corresponding values for the chihl nodes are 0.39 for hess and 0.56 for less. Now, we can determine the welghted information gain at each of the ehihl nodes.</Paragraph>
    <Paragraph position="16"> We get:</Paragraph>
    <Paragraph position="18"/>
    <Paragraph position="20"> Both wdues are well above a threshohl of 10, and therefore none of them should be deleted.</Paragraph>
    <Paragraph position="21"> As explained before, the suflix tree is walked during a lookup along the l)ath, where the nodes are annotated with the letters of the word snflix in reversed order. If at some node on the path, no matching subnode can be found, and there is a default subitode, then the default node is followed. If a leaf is reached at the end of the path, the corresponding tag probability vector is returned. Otherwise, the search fails and the default entry is returned.</Paragraph>
    <Paragraph position="22"> The defaull entry is constructed by subtracting the tag frequencies at all leaves of the pruned suffix tree from the tag frequencies of the root node and normalizing the resulting frequencies. Thereby, relative frequencies are obtained which sum to one.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML