File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/03/w03-0426_metho.xml
Size: 8,751 bytes
Last Modified: 2025-10-06 14:08:28
<?xml version="1.0" standalone="yes"?> <Paper uid="W03-0426"> <Title>Named Entity Recognition with Long Short-Term Memory</Title> <Section position="3" start_page="0" end_page="0" type="metho"> <SectionTitle> 3 Long Short-Term Memory (LSTM) </SectionTitle> <Paragraph position="0"> An LSTM network consists of 3 layers, an input layer, a recurrent hidden layer and an output layer. The hidden layer in LSTM constitutes the main innovation. It consists of one or more memory blocks each with one or more memory cells. Normally the inputs are connected to all of the cells and gates. The cells are connected to the outputs and the gates are connected to other cells and gates in the hidden layer.</Paragraph> <Paragraph position="1"> A single-celled memory block is illustrated in Figure 1.</Paragraph> <Paragraph position="2"> The block consists of an input gate, the memory cell and an output gate. The memory cell is a linear unit with selfconnection with a weight of value 1. When not receiving any input, the cell maintains its current activation over time. The input to the memory cell is passed through a squashing function and gated (multiplied) by the activation of the input gate. The input gate thus controls the flow of activation into the cell.</Paragraph> <Paragraph position="3"> The memory cell's output passes through a squashing function before being gated by the output gate activation. Thus the output gate controls the activation flow from cells to outputs. During training the gates learn to open and close in order to let new information into the cells and let the cells influence the outputs. The cells otherwise hold onto information unless new information is accepted by the input gate. Training of LSTM networks proceeds by a fusion of back-propagation through time and real-time recurrent learning, details of which can be found in (Hochreiter and Schmidhuber, 1997).</Paragraph> <Paragraph position="4"> In artificial tasks LSTM is capable of remembering information for up-to 1000 time-steps. It thus tackles one of the most serious problems affect the performance of recurrent networks on temporal sequence processing tasks.</Paragraph> <Paragraph position="5"> LSTM has recently been extended (Gers and Schmidhuber, 2000) to include forget gates which can learn to modify the cell contents directly and peephole connections which connect the cell directly to the gates, thus enabling them to use the cells' contents directly in their decisions. Peephole connections are not used here, but forget gates are used in some experiments.</Paragraph> </Section> <Section position="4" start_page="0" end_page="0" type="metho"> <SectionTitle> 4 Experiments </SectionTitle> <Paragraph position="0"> The LSTM networks used here were trained as follows: a0 Each sentence is presented word by word in two passes. The first pass is used to accumulate information for disambiguation in the second pass. In the second pass the network is trained to output a vector representation (see Table 1) of the relevant output tag. During the first pass the network is just trained to produce &quot;0.1&quot;s at all its outputs. Note that the binary patterns listed in Table 1 are converted to &quot;0.1&quot;s and &quot;0.9&quot;s when used as target patterns. This technique has been found to improve performance.</Paragraph> <Paragraph position="1"> a0 The inputs to the networks are as follows: - The SARDNET representations of the current word and optionally the next word (or a null vector if at the end of a sentence).</Paragraph> <Paragraph position="2"> With some nets, the lexical space representation of the current word is also used. This involves computing, for each word, the frequencies with which the most frequent 250 words appear either immediately before or immediately after that word in the training set. The resulting 500 element vectors (250 elements each for the left and right context) are normalised then mapped onto their top 25 principal components. null - An orthogonal representation of the current part of speech (POS) tag. However, for some networks, the input units the POS tag is presented to perform a form of time integration as follows. The units are updated according to the formula a0a2a1a4a3a6a5a8a7a10a9a12a11a13a15a14a16a0a17a1a18a3a20a19a22a21a23a5a25a24a27a26a28a1a4a3a6a5 , where a0a2a1a18a9a29a5a30a7a31a9 , a26a28a1a18a3a6a5 is the pattern representing the current POS tag, anda3a32a7a33a21a35a34a36a11a37a11a38a11a37a34a40a39 wherea39 is the length of the current sequence of inputs (twice the length of the current sentence due to the 2 pass processing). By doing this the network receives a representation of the sequence of POS tags presented thus far, integrating these inputs over time.</Paragraph> <Paragraph position="3"> - An orthogonal representation of the current chunk tag, though with some networks time integration is performed as described above.</Paragraph> <Paragraph position="4"> - One input indicates which pass through the sentence is in progress.</Paragraph> <Paragraph position="5"> - Some networks used a list of named entities (NEs) as follows. Some units are set aside cor null responding to the category of NE, 1 unit per category. If the current word occurs in a NE, the unit for that NE's category is activated. If the word occurs in more than one NE, the units for all the NEs' categories are activated. In the case of the English data there were 5 categories of NE (though one category &quot;MO&quot; seems to arise from an error in the data).</Paragraph> <Paragraph position="6"> a0 The networks were trained with a learning rate of 0.3, no momentum and direct connections from the input to the output layers for 100 iterations. Weight updating occurred after the second pass of each sentence was presented. The best set of weights during training were saved and used for evaluation with the development data.</Paragraph> <Paragraph position="7"> a0 The results reported for each network are averaged over 5 runs from different randomised initial weight settings.</Paragraph> <Paragraph position="8"> Table 2 lists the various networks used in these experiments. The &quot;Net&quot; column lists the names of the networks used. The &quot;Opts&quot; column indicates whether word lists are used (list), a 1 word lookahead is used (look), lexical space vectors are used (lex), whether the units for the training data. Results are averaged over 5 runs using different initial weights. * indicates use of the list of NEs. Italics indicate best result reported on first submission, whilst bold indicates best result achieved overall.</Paragraph> <Paragraph position="9"> POS tags use time integration as described above (int) and whether time integration is performed on both the units for POS tags and the units for chunk tags (int2). Additionally, it indicates whether forget gates were used (FG). The &quot;Hidden&quot; column gives the size of the hidden layer of the network (i.e. 8x6 means 8 blocks of 6 cells). The &quot;Wts&quot; column gives the number of weights used. Table 3 gives the results for extracting named entities from the English development data for the networks. The &quot;Precision&quot;, &quot;Recall&quot; and &quot;Fscore&quot; columns show the average scores across 5 runs from different random weight settings. The &quot;Range&quot; column shows the range of fscores produced across the 5 runs used for each network. The Precision gives the percentage of named entities found that were correct, whilst the Recall is the percentage of named entities defined in the data that were found. The Fscore is (2*Precision*Recall)/(Precision+Recall).</Paragraph> <Paragraph position="10"> Most options boosted performance. The biggest boosts came from the lexical space vectors and the word lists. The use of forget gates improved performance despite leading to fewer weights being used. Lookahead seems to make no significant difference overall. Only Net8 gets above baseline performance (best fscore = 72.88), but the average performance is lower than the baseline.</Paragraph> <Paragraph position="11"> Table 4 gives the results for the best network broken down by the type of NE for both the English development and testing data. This is from the best performing run for Net8. Table 4 also depicts the best result from 5 runs of a network configured similarly to Net7 above, using the German data. This did not employ a list of NEs and the lemmas in the data were ignored. The fscore of 43.501 is almost 13 points higher than the baseline of 30.65. With the German test set the fscore is 47.74, 17 points higher</Paragraph> </Section> class="xml-element"></Paper>