File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/03/w03-1501_intro.xml

Size: 5,460 bytes

Last Modified: 2025-10-06 14:01:59

<?xml version="1.0" standalone="yes"?>
<Paper uid="W03-1501">
  <Title>Learning Formulation and Transformation Rules for Multilingual Named Entities</Title>
  <Section position="2" start_page="0" end_page="0" type="intro">
    <SectionTitle>
1 Introduction
</SectionTitle>
    <Paragraph position="0"> Named entities are major components of a document. Capturing named entities is a fundamental task to understanding documents (MUC, 1998). Several approaches have been proposed to recognize these types of terms. For example, corpus-based methods are employed to extract Chinese personal names, and rule-based methods are used to extract Chinese date/time expressions and monetary and percentage expressions (Chen and Lee, 1996; Chen, Ding and Tsai, 1998). In the past, named entity extraction mainly focuses on general domains and is employed to various applications such as information retrieval (Chen, Ding and Tsai, 1998), question-answering (Lin, et al., 2001), and so on. Recently, several attempts have been extended to mine knowledge from biomedical documents (Hirschman, et al., 2002).</Paragraph>
    <Paragraph position="1"> Most of the previous approaches dealt with monolingual named entity extraction. Chen et al. (1998) extended it to cross-language information retrieval. A grapheme-based model was proposed to compute the similarity between Chinese transliteration name and English name. Lin and Chen (2000) further classified the works into two directions - say, forward transliteration (Wan and Verspoor, 1998) and backward transliteration (Chen et al., 1998; Knight and Graehl, 1998), and proposed a phoneme-based model. Lin and Chen (2002) employed a machine learning approach to determine phonetic similarity scores for machine transliteration. AI-Onaizan and Knight (2002) investigated the translation of Arabic named entities to English using monolingual and bilingual resources.</Paragraph>
    <Paragraph position="2"> The past works on multilingual named entities emphasizes on the transliteration issues. However, the transformation between named entities in different languages is not transliteration only. The mapping may be a combination of meaning translation and/or phoneme transliteration. The following five English-Chinese examples show this issue. The symbol A = B denotes a foreign name A is translated and/or transliterated into a Chinese name B.</Paragraph>
    <Paragraph position="3">  (s1) Victoria Fall = Wei Duo Li Ya Pu Bu (wei duo li ya pu bu) (s2) Little Rocky Mountains = Xiao Luo Ji Shan Mo (xiao luo ji shan mo) (s3) Great Salt Lake = Da Yan Hu (da yan hu) (s4) Kenmare = Kang Mei Er (kang mei er) (s5) East Chicago = Dong Zhi Jia Ge (dong zhi jia ge)  Example (s1) shows a name part (i.e., Victoria) and a keyword part (i.e., Fall) of a named location are transliterated and translated into &amp;quot;Wei Duo Li Ya &amp;quot; (wei duo li ya) and &amp;quot;Pu Bu &amp;quot; (pu bu), respectively. In Example (s2), the keyword part (i.e., Mountains) is still translated, i.e., &amp;quot;Shan Mo &amp;quot; (shan mo), however, some part of name is translated (i.e., Little = &amp;quot;Xiao &amp;quot; (xiao)) and another part is transliterated (i.e., Rocky = &amp;quot;Luo Ji &amp;quot; (luo ji)). Example (s3) shows an extreme case. All the three words are translated (i.e., Great = &amp;quot;Da &amp;quot; (da)), Salt = &amp;quot;Yan &amp;quot; (yan), Lake = &amp;quot;Hu &amp;quot; (hu)). Examples (s4) and (s5) show two location names without keywords. The former is transliterated and the latter is a combination of transliteration and translation.</Paragraph>
    <Paragraph position="4"> Which part is translated and which part is transliterated depends on the type of named entities. For example, personal names tend to be transliterated. For a location name, name part and keyword part are usually transliterated and translated, respectively. The organization names are totally different. Most of constituents are translated. Besides the issue of the named entity types, different language pairs have different transformation rules. German named entity has decompounding problem when it is translated/transliterated, e.g., Bundesbahn = &amp;quot;Lian Bang Tie Lu Ju &amp;quot; (lian bang tie lu ju) and Bundesbank = &amp;quot;Lian Bang Yin Xing &amp;quot; (lian bang yin hang). This paper will study the issues of languages and named entity types on the choices of translation and transliteration. We focus on three more challenging named entities only, i.e., named people, named locations and named organizations.</Paragraph>
    <Paragraph position="5"> Three phrase-aligned corpora will be adopted - say, a multilingual personal name corpus and a multilingual organization name corpus compiled by Central News Agency (abbreviated CNA personal name and organization corpora hereafter), and a multilingual location name corpus compiled by National Institute for Compilation and Translation of Taiwan (abbreviated NICT location name corpus hereafter). We will extract transliteration/translation rules from these multilingual named corpora. This paper is organized as follows. Section 2 introduces the corpora used. Section 3 shows how to extract formulation rules and the transformation rules.</Paragraph>
    <Paragraph position="6"> Section 4 analyzes the results. Section 5 demonstrates the application of the extracted rules on cross language information retrieval. Section 6 concludes the remarks.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML