File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/00/p00-1031_intro.xml
Size: 3,639 bytes
Last Modified: 2025-10-06 14:00:53
<?xml version="1.0" standalone="yes"?> <Paper uid="P00-1031"> <Title>A New Statistical Approach to Chinese Pinyin Input</Title> <Section position="2" start_page="0" end_page="0" type="intro"> <SectionTitle> 1. Introduction </SectionTitle> <Paragraph position="0"> Chinese input method is one of the most difficult problems for Chinese PC users. There are two main categories of Chinese input method. One is shape-based input method, such as &quot;wu bi zi xing&quot;, the other is Pinyin, or pronunciation-based input method, such as &quot;Chinese CStar&quot;, &quot;MSPY&quot;, etc. Because of its facility to learn and to use, Pinyin is the most popular Chinese input method. Over 97% of the users in China use Pinyin for input (Chen Yuan 1997). Although Pinyin input method has so many advantages, it also suffers from several problems, including Pinyin-to-characters conversion errors, user typing errors, and UI problem such as the need of two separate mode while typing Chinese and English, etc.</Paragraph> <Paragraph position="1"> Pinyin-based method automatically converts Pinyin to Chinese characters. But, there are only about 406 syllables; they correspond to over 6000 common Chinese characters. So it is very difficult for system to select the correct corresponding Chinese characters automatically. A higher accuracy may be achieved using a sentence-based input.</Paragraph> <Paragraph position="2"> Sentence-based input method chooses character by using a language model base on context. So its accuracy is higher than word-based input method. In this paper, all the technology is based on sentence-based input method, but it can easily adapted to word-input method.</Paragraph> <Paragraph position="3"> In our approach we use statistical language model to achieve very high accuracy. We design a unified approach to Chinese statistical language modelling. This unified approach enhances trigram-based statistical language modelling with automatic, maximumlikelihood-based methods to segment words, select the lexicon, and filter the training data. Compared to the commercial product, our system is up to 50% lower in error rate at the same memory size, and about 76% better without memory limits at all (Jianfeng etc.</Paragraph> <Paragraph position="4"> 2000).</Paragraph> <Paragraph position="5"> However, sentence-based input methods also have their own problems. One is that the system assumes that users' input is perfect. In reality there are many typing errors in users' input. Typing errors will cause many system errors. Another problem is that in order to type both English and Chinese, the user has to switch between two modes. This is cumbersome for the user. In this paper, a new typing model is proposed to solve these problems. The system will accept correct typing, but also tolerate common typing errors. Furthermore, the typing model is also combined with a probabilistic spelling model for English, which measures how likely the input sequence is an English word. Both models can run in parallel, guided by a Chinese language model to output the most likely sequence of Chinese and/or English characters.</Paragraph> <Paragraph position="6"> The organization of this paper is as follows. In the second section, we briefly discuss the Chinese language model which is used by sentence-based input method. In the third section, we introduce a typing model to deal with typing errors made by the user. In the fourth section, we propose a spelling model for English, which discriminates between Pinyin and English. Finally, we give some conclusions.</Paragraph> </Section> class="xml-element"></Paper>