File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/05/i05-2010_intro.xml
Size: 3,704 bytes
Last Modified: 2025-10-06 14:02:56
<?xml version="1.0" standalone="yes"?> <Paper uid="I05-2010"> <Title>Applying a Mix Word-Pair Identifier to the Chinese Syllable-to-Word Conversion Problem</Title> <Section position="2" start_page="0" end_page="55" type="intro"> <SectionTitle> 1 Introduction </SectionTitle> <Paragraph position="0"> Currently, the most popular method for Chinese input is phonetic and pinyin based, because Chinese people are taught to write the corresponding phonetic and pinyin syllables of each Chinese character and word in primary school.</Paragraph> <Paragraph position="1"> In Chinese, each Chinese character corresponds to at least one syllable; and each Chinese word can be a mono-syllabic word, such as &quot;Shu (mouse)&quot;, a bi-syllabic word, such as &quot;Dai Shu (kangaroo)&quot;, or a multi-syllabic word, such as &quot;Mi Lao Shu (Mickey mouse).&quot; Although there are more than 13,000 distinct Chinese characters (of which 5,400 are commonly used), there are only about 1,300 distinct syllables. Since the size of problem space for syllable-to-word (STW) conversion is much less than that of syllable-tocharacter (STC) conversion, the most existing Chinese input systems (Hsu 1994, Hsu et al.</Paragraph> <Paragraph position="2"> 1999, Tsai and Hsu 2002, Gao et al. 2002, MSIME) are addressed on STW conversion.</Paragraph> <Paragraph position="3"> Conventionally, there are two approaches for STW conversion: (1) the linguistic approach based on syntax parsing, semantic template matching and contextual information (Hsu 1994, Fu et al. 1996, Hsu et al. 1999, Kuo 1995, Tsai and Hsu 2002); and (2) the statistical approach based on the n-gram models where n is usually 2 or 3 (Lin and Tsai 1987, Gu et al. 1991, Fu et al.</Paragraph> <Paragraph position="4"> 1996, Ho et al. 1997, Sproat 1990, Gao et al.</Paragraph> <Paragraph position="5"> 2002, Lee 2003). Although the linguistic approach requires considerable effort in designing effective syntax rules, semantic templates or contextual information, it is more user-friendly than the statistical approach on understanding why such a system makes a mistake (Hsu 1994, Tsai and Hsu 2002). On the other hand, the statistical language model (SLM) used in the statistical approach requires less effort and has been widely adopted in commercial Chinese input systems (Gao et al. 2002, Lee 2003).</Paragraph> <Paragraph position="6"> According to (Fong and Chung 1994, Tsai and Hsu 2002), homophone selection and syllable-word segmentation are two critical problems to the STW conversion in Chinese. Incorrect homophone selection and failed syllable-word segmentation will directly influence the STW conversion rate. The goal of this study is to illustrate the effectiveness of specific word-pairs and common word-pairs for resolving homonym/segmentation ambiguities to perform STW conversion in Chinese. In this paper, we use tonal to indicate the syllables with four tones, such as &quot;ji4(Ji )shu4(Shu )&quot; and toneless to indicate the syllables without four tones, such as &quot;ji(Ji ) shu(Shu ).&quot; The remainder of this paper is arranged as follows. In Section 2, we firstly propose the method for auto-generating the specific word-pairs and the common word-pairs from given Chinese sentences. Then, we develop a mix word-pair (mix-WP) identifier includes a specific word-pair identifier and a common word-pair identifier. The mix-WP identifier is based on pre-collected datasets of specific and common word-pairs. In Section 3, we present our STW experiment results. Finally, in Section 4, we give our conclusions and suggest some future research directions.</Paragraph> </Section> class="xml-element"></Paper>