File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/abstr/02/w02-1809_abstr.xml

Size: 5,907 bytes

Last Modified: 2025-10-06 13:42:41

<?xml version="1.0" standalone="yes"?>
<Paper uid="W02-1809">
  <Title>Corpus-Based Pinyin Name Resolution</Title>
  <Section position="1" start_page="0" end_page="0" type="abstr">
    <SectionTitle>
Abstract
</SectionTitle>
    <Paragraph position="0"> For readers of English text who know some Chinese, Pinyin codes that spell out Chinese names are often ambiguous as to their original Chinese character representations if the names are new or not well known. For English-Chinese cross language retrieval, failure to accurately translate Pinyin names in a query to Chinese characters can lead to dismal retrieval effectiveness. This paper presents an approach of extracting Pinyin names from English text, suggesting translations to these Pinyin using a database of names and their characters with usage probabilities, followed with IR techniques with a corpus as a disambiguation tool to resolve the translation candidates.</Paragraph>
    <Paragraph position="1"> Introduction It is important for many applications to be able to identify and extract person names in text. For English, capital letter beginning of a word is an important clue to spot names, in addition to other contextual ones. When an English story refers to a foreign person, it is relatively easy to represent the person's name if the alphabets have approximate correspondences between the languages. When it refers to a Chinese person, this is not possible because Chinese language does not use alphabets. The most popular method for this purpose is Pinyin coding (see, for example, the conversion project at the Library of Congress website (2002)), China's official method of using English to spell out Chinese character pronounciations according to the Beijing Putonghua convention. Chinese characters are monosyllabic, and the large majority of them has one sound (ignoring tones) and hence one code. However, given a Pinyin it usually maps to multiple characters. Such an English Pinyin name raises ambiguity about the original Chinese characters that it refers to and hence the person. If the name is well known, such as Mao ZeDong, this is not an issue; if the name is less frequently seen, one would like to see or confirm the actual Chinese characters.</Paragraph>
    <Paragraph position="2"> The situation is similar to many Chinese word processing systems that use Pinyin as one of their input methods. When a Pinyin is typed (sometimes with tonal denotation), many candidate characters will be displayed for the user to select. The character list can be ordered based on a language model, Chen &amp; Lee (2000), or on the user's past habit. When one comes across names as input however, a language model is not as helpful because practically any character combination is possible for names.</Paragraph>
    <Paragraph position="3"> Pinyin names also present difficulties in a cross language information retrieval (CLIR) scenario. Here, an English query is given to retrieve Chinese documents, and Pinyin names could be present as part of the query. In general, one can have three approaches to CLIR as discussed in Grefenstette (1998): translate the Chinese documents to English and do retrieval matching in English; translate the English query to Chinese and do matching in Chinese; or translate both to an intermediate representation. With the first approach, one could use standard table lookup to map the characters of a Chinese name to Pinyin after identifying a name for extraction. Chen and Bai (1998), Sun et.al.</Paragraph>
    <Paragraph position="4"> (1994) have shown that this extraction process is not trivial since Chinese writing has no white space to delimit names or words. A more general difficulty is that the document collection may not be under a user's control, but available for retrieval purposes only. This makes document translation to the query language (or to an intermediate language) not suitable. A more flexible approach is to translate a query to Chinese and do retrieval in Chinese. This has been the more popular method to use for CLIR in TREC experiments: Voorhees and Harman (2001). Whichever translation direction one chooses, a bilingual dictionary is essential. This dictionary however can be expected to be incomplete, especially with person names.</Paragraph>
    <Paragraph position="5"> Missing their translations can adversely impact on CLIR effectiveness. This raises the question of how to render Pinyin names into Chinese characters for translingual retrieval purposes.</Paragraph>
    <Paragraph position="6"> In the recent NTCIR-2 English-Chinese cross language experiments, Eguchi et.al. (2001), quite a few queries have names. Kwok (2001) found that these lead to good monolingual retrieval because the names are quite specific and have good retrieval properties. On the other hand, for CLIR that starts with English queries, not being able to translate Pinyin names correctly leads to substantial deficit in effectiveness. This causes comparisons with monolingual results particularly dismal.</Paragraph>
    <Paragraph position="7"> In this paper, we propose an approach to resolve the characters from a Pinyin code. It is based on: 1) a rule-based procedure to extract Pinyin codes for Chinese person names in English text; 2) a database for proposing candidate Chinese character sequences for a Pinyin code based on usage probabilities; and 3) a target collection and IR techniques as a confirmation tool for resolving or narrowing down the proposed candidates. These are described in Sections 1, 2, and 3 respectively.</Paragraph>
    <Paragraph position="8"> Section 4 presents some CLIR results and a measure of the effectiveness of our procedures.</Paragraph>
    <Paragraph position="9"> We like to stress that even if one obtains the correct Chinese characters for a Pinyin, they can still refer to different persons with the same name. We do not address this issue here.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML