File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/w06-0124_metho.xml
Size: 7,352 bytes
Last Modified: 2025-10-06 14:10:36
<?xml version="1.0" standalone="yes"?> <Paper uid="W06-0124"> <Title>Boosting for Chinese Named Entity Recognition</Title> <Section position="5" start_page="150" end_page="150" type="metho"> <SectionTitle> 2 Boosting </SectionTitle> <Paragraph position="0"> The main idea behind the boosting algorithm is that a set of many simple and moderately accurate weak classifiers (also called weak hypotheses) can be effectively combined to yield a single strong classifier (also called the final hypothesis). The algorithm works by training weak classifiers sequentially whose classification accuracy is slightly better than random guessing and finally combining them into a highly accurate classifier.</Paragraph> <Paragraph position="1"> Each weak classifier searches for the hypothesis in the hypotheses space that can best classify the current set of training examples. Based on the evaluation of each iteration, the algorithm reweights thetrainingexamples, forcingthenewlygenerated weak classifier to give higher weights to the examples that are misclassified in the previous iteration. The boosting algorithm was originally created to deal with binary classification in supervised learning. The boosting algorithm is simple to implement, does feature selection resulting in a relatively simple classifier, and has fairly good generalization. null Based on the boosting framework, our system uses the AdaBoost.MH algorithm (Schapire and Singer, 1999) as shown in Figure 1, an n-ary classification variant of the original well-known binary AdaBoost algorithm (Freund and Schapire, 1997). The original AdaBoost algorithm was designedforthebinaryclassificationproblembutdid null not fulfill the requirements of the Chinese NER Input: A training set Tr = {< d1,C1 >,...,< dg,Cg >} where Cj [?] C = {c1,...,cm} for all j = 1,...,g.</Paragraph> <Paragraph position="2"> Output: A final hypothesis Ph(d,c) =summationtextSs=1 asPhs(d,c). Algorithm: LetD1(dj,ci) = 1mg for all j = 1,...,g and for all i = 1,...,m. For s = 1,...,S do: * pass distribution Ds(dj,ci)to the weak classifier; * derive the weak hypothesis Phs from the weak</Paragraph> <Paragraph position="4"> task. AdaBoost.MH has shown its usefulness on standard machine learning tasks through extensive theoretical and empirical studies, where different standard machine learning methods have been used as the weak classifier (e.g., Bauer and Kohavi (1999), Opitz and Maclin (1999), Schapire (2002)). It also performs well on a number of natural language processing problems, including text categorization (e.g., Schapire and Singer (2000), Sebastiani et al. (2000)) and word sense disambiguation (e.g., Escudero et al. (2000)). In particular, it has also been demonstrated that boosting can be used to build language-independent NER models that perform exceptionally well (Wu et al.</Paragraph> <Paragraph position="5"> (2002), Wu et al. (2004), Carreras et al. (2002)).</Paragraph> <Paragraph position="6"> The weak classifiers used in the boosting algorithmcomefromawiderangeofmachinelearning null methods. Wehavechosentouseasimpleclassifier called a decision stump in the algorithm. A decision stump is basically a one-level decision tree where the split at the root level is based on a specific attribute/value pair. For example, a possible attribute/value pair could beW2 =/.</Paragraph> </Section> <Section position="6" start_page="150" end_page="152" type="metho"> <SectionTitle> 3 Experiment Details </SectionTitle> <Paragraph position="0"> In order to implement the boosting/decision stumps, we used the publicly available software AT&T BoosTexter (Schapire and Singer, 2000), which implements boosting on top of decision stumps. For preprocessing we used an off-the-shelf Chinese lexical analysis system, the open source ICTCLAS (Zhang et al., 2003), to segment and POS tag the training and test corpora.</Paragraph> <Section position="1" start_page="151" end_page="151" type="sub_section"> <SectionTitle> 3.1 Data Preprocessing </SectionTitle> <Paragraph position="0"> The training corpora provided by the SIGHAN bakeoff organizers were in the CoNLL two column format, with one Chinese character per line and hand-annotated named entity chunks in the second column.</Paragraph> <Paragraph position="1"> In order to provide basic features for training thedecisionstumps,thetrainingcorporaweresegmented and POS tagged by ICTCLAS, which labels Chinese words using a set of 39 tags. This module employs a hierarchical hidden Markov model (HHMM) and provides word segmentation, POS tagging and unknown word recognition. It performs reasonably well, with segmentation precision recently evaluated at 97.58%.2 The recall rateofunknownwordsusingroletaggingwasover 90%.</Paragraph> <Paragraph position="2"> We note that about 200 words in each training corpora remained untagged. For these words we simply assigned the most frequently occurring tags in each training corpora.</Paragraph> </Section> <Section position="2" start_page="151" end_page="152" type="sub_section"> <SectionTitle> 3.2 Feature Set Theboosting/decisionstumpswereabletoaccom- </SectionTitle> <Paragraph position="0"> label during the training) of the previous two characters.</Paragraph> <Paragraph position="1"> The chunk tag is the BIO representation, which was employed in the CoNLL-2002 and CoNLL-2003 evaluations. In this representation, each character is tagged as either the beginning of a named entity (B tag), a character inside a named entity (I tag), or a character outside a named entity (O tag).</Paragraph> <Paragraph position="2"> When we used conjunction features, we found that they helped the NER performance significantly. The conjunction features used are basically conjunctions of 2 consecutive characters and larger context window (3 characters instead of 2 before and after the current character) to be quite helpful to performance.</Paragraph> <Paragraph position="3"> Apart from the training and test corpora, we considered the gazetteers from LDC which contain about 540K persons, 242K locations and 98K organization names. Named entities in the training corpora which appeared in the gazetteers were identified lexically or by using a maximum forward match algorithm. Once named entities have been identified, each character can then be annotated with an NE chunk tag. The boosting learner can view the NE chunk tag as an additional feature. Here we used binary gazetteer features. If the character was annotated with an NE chunk tag, its gazetteer feature was set to 1; otherwise it was set to 0. However we found that adding binary gazetteer features does not significantly help the performance when conjunction features were used. In fact, it actually hurt the performance slightly.</Paragraph> <Paragraph position="4"> The features used in the final experiments were: * The current character and its POS tag.</Paragraph> <Paragraph position="5"> * The characters within a window of 3 characters before and after the current character. * The POS tags within a window of 3 characters before and after the current character. * A small set of conjunctions of POS tags and characters within a window of 3 characters of the current character.</Paragraph> <Paragraph position="6"> * The BIO chunk tags of the previous 3 characters. null</Paragraph> </Section> </Section> class="xml-element"></Paper>