File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/02/w02-2031_metho.xml
Size: 5,660 bytes
Last Modified: 2025-10-06 14:08:11
<?xml version="1.0" standalone="yes"?> <Paper uid="W02-2031"> <Title>Learning with Multiple Stacking for Named Entity Recognition</Title> <Section position="2" start_page="0" end_page="0" type="metho"> <SectionTitle> 2 System Description </SectionTitle> <Paragraph position="0"> Before describing our system, let us see one aspect of the named entity recognition, the outline of our method, and the relation to the previous works.</Paragraph> <Paragraph position="1"> The task of named entity recognition can be regarded as a process of assigning a named entity tag to each given word, taking into account the patterns of surrounding words. Suppose that a sequence of words is given as below: ... Wa2a4a3 , Wa2a6a5 , Wa7 , Wa5 , Wa3 ...</Paragraph> <Paragraph position="2"> Then, given that the current position is at word Wa7 , the task is to assign tag Ta7 to Wa7 .</Paragraph> <Paragraph position="3"> In the named entity recognition task, an entity is often made up of a sequence of words, rather than a single word. For example, an entity &quot;the United States of America&quot; consists of five words. In order to allocate a tag to each word, the tags of the surrounding words (we call these tags the surrounding tags) can be a clue to predict the tag of the word (we call this tag the current tag). For the test set, however, these tags are unknown.</Paragraph> <Paragraph position="4"> In order to take into account the surrounding tags for the prediction of the current tag, we propose a method which employs multiple stacked learners, an extension of stacking method (Wolpert, 1992).</Paragraph> <Paragraph position="5"> Stacking based method for named entity recognition usually employs two or more level learners. The higher level learner uses the current tags predicted by its lower level learners. In our method, by contrast, the higher level learner uses not only the current tag but also the surrounding tags predicted by the lower level learner. Our aim is to leverage the performance of the base system using the surrounding tags as the features.</Paragraph> <Paragraph position="6"> At least two groups have previously proposed systems which use the predicted surrounding tags.</Paragraph> <Paragraph position="7"> One system, proposed by van Halteren et al. (1998), also uses stacking method. This system uses four completely different types of taggers as the first level learners, because it has been assumed that first level learners should be as different as possible. The tags predicted by the first level learners are used as the features of the second level learner.</Paragraph> <Paragraph position="8"> The other system, proposed by (Kudo and Matsumoto, 2000; Yamada et al., 2001), uses the &quot;dynamic features&quot;. In the test phase, the predicted tags of the preceding (or subsequent) words are used as the features, which are called &quot;dynamic features&quot;. In the training phase, the system uses the answer tags of the preceding (or subsequent) words as the features.</Paragraph> <Paragraph position="9"> More detailed descriptions of our system are shown below:</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.1 Learning Algorithm </SectionTitle> <Paragraph position="0"> As the learning algorithm for all the levels , we use an extension of AdaBoost, the real AdaBoost.MH which is extended to handle multiclass problems (Schapire and Singer, 1999). For weak learners, we use decision stumps (Schapire and Singer, 1999), which select only one feature to classify an example. null</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.2 Features </SectionTitle> <Paragraph position="0"> We use the following types of the features for the prediction of the tag of the word.</Paragraph> <Paragraph position="1"> a8 One of the eight word features in Table 1. These features are similar to those used in (Bikel et al., 1997).</Paragraph> <Paragraph position="2"> a8 First and last two/three letters of W a8 Estimated tag of W a7 based on the word uni-gram model in the training set. Additionally, we use the surrounding tag feature. This feature is discussed in Section 2.3.</Paragraph> </Section> <Section position="3" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.3 Multiple Stacking </SectionTitle> <Paragraph position="0"> In order to take into account the tags of the surrounding words, our system employs stacked learners. Figure 1 gives the outline of the learning and applying algorithm of our system. In the learning phase, the base system is trained at first. After that, the higher level learners are trained using word features (described in Section 2.2), current tag Ta7 and surrounding tags Ta2a9a3a15a14a16a2a6a5a17a14a16a5a17a14a3 predicted by the lower level learner. While these tag may not be correctly predicted , if the accuracy of the prediction of the lower level learner is improved, the features used in each prediction become accurate. In the applying phase, all of the learners are cascaded in the order.</Paragraph> <Paragraph position="1"> Compared to the previous systems (van Halteren et al., 1998; Kudo and Matsumoto, 2000; Yamada et al., 2001), our system is: (i) employing more than two levels stacking, (ii) using only one algorithm and training only one learner at each level, (iii) using the surrounding tag given by the lower level learner. (iv) using both the preceding and subsequent tags as the features. (v) using the predicted tags instead of the answer tags in the training phase.</Paragraph> </Section> </Section> class="xml-element"></Paper>