File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/04/w04-3248_intro.xml
Size: 4,027 bytes
Last Modified: 2025-10-06 14:02:51
<?xml version="1.0" standalone="yes"?> <Paper uid="W04-3248"> <Title>A New Approach for English-Chinese Named Entity Alignment</Title> <Section position="2" start_page="0" end_page="1" type="intro"> <SectionTitle> 1 Introduction </SectionTitle> <Paragraph position="0"> This paper addresses the Named Entity (NE) alignment of a bilingual corpus, which means building an alignment between each source NE and its translation NE in the target language. Research has shown that Named Entities (NE) carry essential information in human language (Hobbs et al., 1996). Aligning bilingual Named Entities is an effective way to extract an NE translation list and translation templates. For example, in the following sentence pair, aligning the NEs, [Zhi Chun road] and [Zhi Chun Lu ] can produce a translation</Paragraph> <Section position="1" start_page="0" end_page="1" type="sub_section"> <SectionTitle> Language Information Retrieval (CLIR). </SectionTitle> <Paragraph position="0"> A Named Entity alignment, however, is not easy to obtain. It requires both Named Entity Recognition (NER) and alignment be handled correctly. NEs may not be well recognized, or only [?] The work was done while the first author was visiting Microsoft Research Asia.</Paragraph> <Paragraph position="1"> parts of them may be recognized during NER. When aligning bilingual NEs in different languages, we need to handle many-to-many alignments. And the inconsistency of NE translation and NER in different languages is also a big problem. Specifically, in Chinese NE processing, since Chinese is not a tokenized language, previous work (Huang et al., 2003) normally conducts word segmentation and identifies Named Entities in turn. This involves several problems for Chinese NEs, such as word segmentation error, the identification of Chinese NE boundaries, and the mis-tagging of Chinese NEs. For example, &quot;Guo Fang Bu Chang &quot; in Chinese is really one unit and should not be segmented as [ON Guo Fang Bu ]/Chang . The errors from word segmentation and NER will propagate into NE alignment.</Paragraph> <Paragraph position="2"> In this paper, we propose a novel approach using a maximum entropy model to carry out English- null alignment. NEs in English are first recognized by NER tools. We then investigate NE translation features to identify NEs in Chinese and determine the most probable alignment. To ease the training of the maximum entropy model, bootstrapping is used to help supervised learning.</Paragraph> <Paragraph position="3"> On the other hand, to avoid error propagations from word segmentation and NER, we directly extract Chinese NEs and make the alignment from plain text without word segmentation. It is unlike previous work reported in the literature. Although this makes the task more difficult, it greatly reduces the chance of errors introduced by previous steps and therefore produces much better performance on our task.</Paragraph> <Paragraph position="4"> To justify our approach, we adopt traditional alignment approaches, in particular IBM Model 4 (Brown et al., 1993) and HMM (Vogel et al., 1996), to carry out NE alignment as our baseline systems. Experimental results show that in this task our approach outperforms IBM Model 4 and HMM significantly. Furthermore, the performance We only discuss NEs of three categories: Person Name (PN), Location Name (LN), and Organization Name (ON).</Paragraph> <Paragraph position="5"> without word segmentation is much better than that with word segmentation.</Paragraph> <Paragraph position="6"> The rest of this paper is organized as follows: In section 2, we discuss related work on NE alignment. Section 3 gives the overall framework of NE alignment with our maximum entropy model. Feature functions and bootstrapping procedures are also explained in this section. We show experimental results and compare them with baseline systems in Section 4. Section 5 concludes the paper and discusses ongoing future work.</Paragraph> </Section> </Section> class="xml-element"></Paper>