XML Viewer - p00-1031

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/00/p00-1031_metho.xml
Size: 16,440 bytes
Last Modified: 2025-10-06 14:07:15
<?xml version="1.0" standalone="yes"?>
<Paper uid="P00-1031">
  <Title>A New Statistical Approach to Chinese Pinyin Input</Title>
  <Section position="3" start_page="0" end_page="0" type="metho">
    <SectionTitle>
2. Chinese Language Model
</SectionTitle>
    <Paragraph position="0"> Pinyin input is the most popular form of text input in Chinese. Basically, the user types a phonetic spelling with optional spaces, like: woshiyigezhongguoren And the system converts this string into a string of Chinese characters, like:</Paragraph>
    <Paragraph position="2"> A sentence-based input method chooses the probable Chinese word according to the context. In our system, statistical language model is used to provide adequate information to predict the probabilities of hypothesized Chinese word sequences.</Paragraph>
    <Paragraph position="3"> In the conversion of Pinyin to Chinese character, for the given Pinyin P , the goal is to find the most probable Chinese character H , so as to maximize )|Pr( PH . Using Bayes law, we have:</Paragraph>
    <Paragraph position="5"> The problem is divided into two parts, typing model )|Pr( HP and language model )Pr(H .</Paragraph>
    <Paragraph position="6"> Conceptually, all H 's are enumerated, and the one that gives the largest ),Pr( PH is selected as the best Chinese character sequence. In practice, some efficient methods, such as Viterbi Beam Search (Kai-Fu Lee 1989; Chin-hui Lee 1996), will be used.</Paragraph>
    <Paragraph position="7"> The Chinese language model in equation 2.1, )Pr(H measures the a priori probability of a Chinese word sequence. Usually, it is determined by a statistical language model (SLM), such as Trigram LM. )|Pr( HP , called typing model, measures the probability that a Chinese word H is typed as Pinyin P .</Paragraph>
    <Paragraph position="8"> Usually, H is the combination of Chinese words, it can decomposed into</Paragraph>
    <Paragraph position="10"> character. So typing model can be rewritten as equation 2.2.</Paragraph>
    <Paragraph position="12"> The most widely used statistical language model is the so-called n-gram Markov models (Frederick 1997). Sometimes bigram or trigram is used as SLM. For English, trigram is widely used. With a large training corpus trigram also works well for Chinese. Many articles from newspapers and web are collected for training. And some new filtering methods are used to select balanced corpus to build the trigram model. Finally, a powerful language model is obtained. In practice, perplexity (Kai-Fu Lee 1989; Frederick 1997) is used to evaluate the SLM, as equation 2.3.</Paragraph>
    <Paragraph position="14"> where N is the length of the testing data. The perplexity can be roughly interpreted as the geometric mean of the branching factor of the document when presented to the language model. Clearly, lower perplexities are better.</Paragraph>
    <Paragraph position="15"> We build a system for cross-domain general trigram word SLM for Chinese. We trained the system from 1.6 billion characters of training data. We evaluated the perplexity of this system, and found that across seven different domains, the average per-character perplexity was 34.4. We also evaluated the system for Pinyin-to-character conversion.</Paragraph>
    <Paragraph position="16"> Compared to the commercial product, our system is up to 50% lower in error rate at the same memory size, and about 76% better without memory limits at all. (JianFeng etc.</Paragraph>
    <Paragraph position="17"> 2000)</Paragraph>
  </Section>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
3. Spelling Correction
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.1 Typing Errors
</SectionTitle>
      <Paragraph position="0"> The sentence-based approach converts Pinyin into Chinese words. But this approach assumes correct Pinyin input. Erroneous input will cause errors to propagate in the conversion. This problem is serious for Chinese users because:  1. Chinese users do not type Pinyin as frequently as American users type English. 2. There are many dialects in China. Many people do not speak the standard Mandarin Chinese dialect, which is the origin of Pinyin. For example people in the southern area of China do not distinguish 'zh'-'z', 'sh'-'s', 'ch'-'c', 'ng'-'n', etc.</Paragraph>
      <Paragraph position="1"> 3. It is more difficult to check for errors  while typing Pinyin for Chinese, because Pinyin typing is not WYSIWYG. Preview experiments showed that people usually do not check Pinyin for errors, but wait until the Chinese characters start to show up.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.2 Spelling Correction
</SectionTitle>
      <Paragraph position="0"> In traditional statistical Pinyin-to-characters</Paragraph>
      <Paragraph position="2"> mentioned in equation 2.2, is usually set to 1 if</Paragraph>
      <Paragraph position="4"> and 0 if it is not. Thus, these systems rely exclusively on the language model to carry out the conversion, and have no tolerance for any variability in Pinyin input. Some systems have the &amp;quot;southern confused pronunciation&amp;quot; feature to deal with this problem. But this can only address a small fraction of typing errors because it is not data-driven (learned from real typing errors). Our solution trains the probability of )|Pr( )( iif wP from a real corpus.</Paragraph>
      <Paragraph position="5"> There are many ways to build typing models. In theory, we can train all possible</Paragraph>
      <Paragraph position="7"> wP , but there are too many parameters to train. In order to reduce the number of parameters that we need to train, we consider only single-character words and map all characters with equivalent pronunciation into a single syllable. There are about 406 syllables in Chinese, so this is essentially training: ) |Pr( SyllableStringPinyin , and then mapping each character to its corresponding syllable.</Paragraph>
      <Paragraph position="8"> According to the statistical data from psychology (William 1983), most frequently errors made by users can be classified into the following types: 1. Substitution error: The user types one key instead of another key. This error is mainly caused by layout of the keyboard.</Paragraph>
      <Paragraph position="9"> The correct character was replaced by a character immediately adjacent and in the same row. 43% of the typing errors are of this type. Substitutions of a neighbouring letter from the same column (column errors) accounted for 15%. And the substitution of the homologous (mirrorimage) letter typed by the same finger in the same position but the wrong hand, accounted for 10% of the errors overall  (William 1983).</Paragraph>
      <Paragraph position="10"> 2. Insertion errors: The typist inserts some keys into the typing letter sequence. One reason of this error is the layout of the keyboard. Different dialects also can result in insertion errors.</Paragraph>
      <Paragraph position="11"> 3. Deletion errors: some keys are omitted while typing.</Paragraph>
      <Paragraph position="12"> 4. Other typing errors, all errors except the  errors mentioned before. For example, transposition errors which means the reversal of two adjacent letters.</Paragraph>
      <Paragraph position="13"> We use models learned from psychology, but train the model parameters from real data, similar to training acoustic model for speech recognition (Kai-Fu Lee 1989). In speech recognition, each syllable can be represented as a hidden Markov model (HMM). The pronunciation sample of each syllable is mapped to a sequence of states in HMM. Then the transition probability between states can be trained from the real training data. Similarly, in Pinyin input each input key can be seen as a state, then we can align the correct input and actual input to find out the transition probability of each state. Finally, different HMMs can be used to model typists with different skill levels.</Paragraph>
      <Paragraph position="14"> In order to train all 406 syllables in Chinese, a lot of data are needed. We reduce this data requirement by tying the same letter in different syllable or same syllable as one state. Then the number of states can be reduced to 27 (26 different letters from 'a' to 'z', plus one to represent the unknown letter which appears in the typing letters). This model could be integrated into a Viterbi beam search that utilizes a trigram language model.</Paragraph>
    </Section>
    <Section position="3" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.3 Experiments
</SectionTitle>
      <Paragraph position="0"> Typing model is trained from the real user input. We collected actual typing data from 100 users, with about 8 hours of typing data from each user. 90% of this data are used for training and remaining 10% data are used for testing. The character perplexity for testing corpus is 66.69, and the word perplexity is 653.71.</Paragraph>
      <Paragraph position="1"> We first, tested the baseline system without spelling correction. There are two groups of input: one with perfect input (which means instead of using user input); the other is actual input, which contains real typing errors. The error rate of Pinyin to Hanzi conversion is shown as table 3.1.</Paragraph>
      <Paragraph position="2">  In the actual input data, approximately 4.6% Chinese characters are typed incorrectly. This 4.6% error will cause more errors through propagation. In the whole system, we found that it results in tripling increase of the error rate from table 3.1. It shows that error tolerance is very important for typist while using sentence-based input method. For example, user types the Pinyin like: wisiyigezhonguoren (G0G0G0G0G0G0G0), system without error tolerance will convert it into Chinese character like: wiG0G0G0G0uG0G0.</Paragraph>
      <Paragraph position="3"> Another experiment is carried out to validate the concept of adaptive spelling correction. The motivation of adaptive spelling correction is that we want to apply more correction to less skilled typists. This level of correction can be controlled by the &amp;quot;language model weight&amp;quot;(LM weight) (Frederick 1997; Bahl etc. 1980; X. Huang etc. 1993). The LM weight is applied as in equation 3.1.</Paragraph>
      <Paragraph position="4">  where a is the LM weight. (3.1) Using the same data as last experiment, but applying the typing model and varying the LM weight, results are shown as Figure 3.1.</Paragraph>
      <Paragraph position="5"> As can be seen from Figure 3.1, different LM weight will affect the system performance. For a fixed LM weight of 0.5, the error rate of conversion is reduced by approximately 30%.</Paragraph>
      <Paragraph position="6"> For example, the conversion of &amp;quot;wisiyigezhonguoren&amp;quot; is now correct.  If we apply adaptive LM weight depending on the typing skill of the user, we can obtain further error reduction. To verify this, we select 3 users from the testing data, adding one ideal user (suppose input including no errors), we test the error rate of system with different LM weight, and result is as table 3.2.</Paragraph>
      <Paragraph position="7">  The average input error rates of User 1,2,3 are 0.77%, 4.41% and 5.73% respectively.</Paragraph>
      <Paragraph position="8"> As can be seen from table 3.2, the best weight for each user is different. In a real system, skilled typist could be assigned lower LM weight, and the skill of typist can be  determined by: 1. the number of modification during typing. 2. the difficulty of the text typed distribution of typing time can also be estimated. It can be applied to judge the skill of the typist. 4. Modeless Input  Another annoying UI problem of Pinyin input is the language mode switch. The mode switch is needed while typing English words in a Chinese document. It is easy for users to forget to do this switch. In our work, a new spelling model is proposed to let system automatically detect which word is Chinese, and which word is English. We call it modeless Pinyin input method. This is not as easy as it may seem to be, because many legal English words are also legal Pinyin strings. And because no spaces are typed between Chinese characters, and between Chinese and English words, we obtain even more ambiguities in the input. The way to solve this problem is analogous to speech recognition. Bayes rule is used to divided the objective function (as equation 4.1) into two parts, one is the spelling model for English, the other is the Chinese language model, as shown in equation  One of the common methods is to consider the English word as one single category, called &lt;English&gt;. We then train into our Chinese language model (Trigram) by treating &lt;English&gt; like a single Chinese word. We also train an English spelling model which could be a combination of: 1. A unigram language model trained on real English inserted in Chinese language texts. It can deal with many frequently used English words, but it cannot predict the unseen English words.</Paragraph>
      <Paragraph position="9"> 2. An &amp;quot;English spelling model&amp;quot; of tri-syllable probabilities - this model should have non-zero probabilities for every 3-syllable sequence, but also should emit a higher probability for words that are likely to be English-like. This can be trained from real English words also, and can deal with unseen English words.</Paragraph>
      <Paragraph position="10"> This English spelling models should, in general, return very high probabilities for real English word string, high probabilities for letter strings that look like English words, and low probabilities for non-English words. In the actual recognition, this English model will run in parallel to (and thus compete with) the Chinese spelling model. We will have the following situations:  1. If a sequence is clearly Pinyin, Pinyin models will have much higher score.</Paragraph>
      <Paragraph position="11"> 2. If a sequence is clearly English, English models will have much higher score.</Paragraph>
      <Paragraph position="12"> 3. If a sequence is ambiguous, the two models will both survive in the search until further context disambiguates.</Paragraph>
      <Paragraph position="13"> 4. If a sequence does not look like Pinyin,  nor an English word, then Pinyin model should be less tolerant than the English tri-syllable model, and the string is likely to remain as English, as it may be a proper name or an acronym (such as &amp;quot;IEEE&amp;quot;). During training, we choose some frequently used English syllables, including 26 uppercase, 26 lower-case letters, English word begin, word end and unknown into the English syllable list. Then the English words or Pinyin in the training corpus are segmented by these syllables. We trained the probability for every three syllable. Thus the syllable model can be applied to search to measure how likely the input sequence is an English word or a Chinese word. The probability can be combined with Chinese language model to find the most probable Chinese and/or English words.</Paragraph>
      <Paragraph position="14"> Some experiments are conducted to test the modeless Pinyin input methods. First, we tell the system the boundary between English word and Chinese word, then test the error of system; Second, we let system automatically judge the boundary of English and Chinese word, then test the error rate again. The result is as table 4.1.</Paragraph>
      <Paragraph position="15">  In our modeless approach, only 52 English letters are added into English syllable list, and a tri-letter spelling model is trained based on corpus. If we let system automatically judge the boundary of English word and Chinese word, we found the error rate is approximate 3.6% (which means system make some mistake in judging the boundary). And we found that spelling model for English can be run with spelling correction, with only a small error increase.</Paragraph>
      <Paragraph position="16"> Another experiment is done with an increased English syllable list. 1000 frequently used English syllables are selected into English syllable list. Then we train a tri-syllable model base on corpus. The result is shown in table 4.2.</Paragraph>
      <Paragraph position="17">  As can be seen from table 4.2, increasing the complexity of spelling model adequately will help system a little.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML