File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/06/w06-0124_intro.xml
Size: 4,030 bytes
Last Modified: 2025-10-06 14:03:48
<?xml version="1.0" standalone="yes"?> <Paper uid="W06-0124"> <Title>Boosting for Chinese Named Entity Recognition</Title> <Section position="4" start_page="0" end_page="150" type="intro"> <SectionTitle> 1 Introduction </SectionTitle> <Paragraph position="0"> Named entity recognition (NER), which includes the identification and classification of certain proper nouns, such as person names, organizations, locations, temporal, numerical and monetary phrases, plays an important part in many natural language processing applications, such as machine translation, information retrieval, information extraction and question answering.</Paragraph> <Paragraph position="1"> Much of the NER research was pioneered in the MUC/DUC and Multilingual Entity Task (MET) evaluations, as a result of which significant progress has been made and many NER [?]This work was supported in part by DARPA GALE contract HR0011-06-C-0023, and by the Hong Kong Research Grants Council (RGC) research grants RGC6083/99E, RGC6256/00E, and DAG03/04.EG09.</Paragraph> <Paragraph position="2"> systems of fairly high accuracy have been constructed. In addition, the shared tasks of CoNLL-2002 and CoNLL-2003 helped spur the development toward more language-independent NER systems, by evaluating four types of entities (people, locations, organizations and names of miscellaneous entities) in English, German, Dutch and Spanish.</Paragraph> <Paragraph position="3"> However, these are all European languages, and Chinese NER appears to be significantly more challenging in a number of important respects.</Paragraph> <Paragraph position="4"> We believe some of the main reasons to be as follows: (1) Unlike European languages, Chinese lacks capitalization information which plays a very important role in identifying named entities. (2) There is no space between words in Chinese, so ambiguous segmentation interacts with NER decisions. Consequently, segmentation errors will affect the NER performance, and vice versa. (3) Unlike European languages, Chinese allows an open vocabulary for proper names of persons, eliminating another major source of explicit clues used by European language NER models.</Paragraph> <Paragraph position="5"> This paper presents a system that introduces boosting to Chinese named entity identification and classification. Our primary aim was to conduct a controlled experiment to test how well the boosting based models we designed for European languages would fare on Chinese, without major modeling alterations to accommodate Chinese. We evaluated the system using data from the third SIGHAN Chinese language processing bakeoff, the goal of which was to perform NER on three types of named entities: PERSON, LOCATION and ORGANIZATION.1 Three training corpora from MSRA, CityU and LDC were given.</Paragraph> <Paragraph position="6"> TheMSRAandLDCcorporaweresimplifiedChinese texts while the CityU corpus was traditional 1Except in the LDC corpus, which contains four types of entities: PERSON, LOCATION, ORGANIZATION and GEOPOLITICAL.</Paragraph> <Paragraph position="7"> Chinese. In addition, the competition also specified open and closed tests. In the open test, the participants may use any other material including material from other training corpora, proprietary dictionaries, and material from the Web besides the given training corpora. In the closed test, the participants can only use the three training corpora. No other material or knowledge is allowed, including part-of-speech (POS) information, externally generated word-frequency counts, Arabic and Chinese numbers, feature characters for place names, common Chinese surnames, and so on.</Paragraph> <Paragraph position="8"> The approach we used is based on selecting a number of features, which are used to train several weak classifiers. Using boosting, which has been showntoperformwellonotherNLPproblemsand is a theoretically well-founded method, the weak classifiers are then combined to perform a strong classifier.</Paragraph> </Section> class="xml-element"></Paper>