File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/93/w93-0311_metho.xml

Size: 20,590 bytes

Last Modified: 2025-10-06 14:13:30

<?xml version="1.0" standalone="yes"?>
<Paper uid="W93-0311">
  <Title>Corpus-based Adaptation Mechanisms for Chinese Homophone Disambiguation</Title>
  <Section position="4" start_page="94" end_page="94" type="metho">
    <SectionTitle>
2 Homophone Disambiguation
</SectionTitle>
    <Paragraph position="0"> Mandarin Chinese has approximately 1300 syllables, 13,051 commonly used characters, and more than 100,000 words. Each character is pronounced as a syllable. Thus, it is clear that there are many syllables are shared by numbers of characters. Actually, some syllables correspond to more than 100 characters, e.g.. tile syllable \[yi4\] corresponds to 125 characl, ers, ,~, Jilt, ~, ~C/, ~, ~, etc. Thus, homophone (character) disambiguation is difficult but important in Chinese phonetic input methods and speech recognition systems.</Paragraph>
    <Paragraph position="1"> The problem of homophone disambiguation can be defined as how to convert a sequence of syllables S = sl, s2 ..... sn (usually a sentence or a clause) into a corresponding sequence of characters C = cl,e~,...,cn correctly. Here, each si stands for one of the 1300 Chinese syllables and each c, one of the 13,051 characters. null Fortunately, when the characters are grouped into words (the smallest meaningful unit), the homophone problem is lessened. The number of homophone polysyllables is much less than that of homophone characters. (A Chinese word is usually composed of 1 to 4 characters.) For the disamhiguation, longer words are usually correct and preferred. Thus, the homophone disambiguation problem is usually formulated as a word-lattice optimal path finding problem. (Note that there is the problem of unknown words, especially personal names, compound words, and acronyms, which are not registered in the lexicon.) null For example, a sequence of three syllables sl, s2, s3 involves six possible subsequences sl, s2, s3, sl-s2, s2-s3, sl-s2-s3, which can correspond to some words.</Paragraph>
    <Paragraph position="2"> Each subsequence could correspond to more than one word, especially in the case of monosyllables. Accordingly, a word lattice is formed by the words with one of the six subsequences as pronunciation. See Figure 1 for a sample word lattice.</Paragraph>
    <Paragraph position="3"> Note that syllables are chosen as input units instead of word-sized units used in systems like TianMa. The major reason is: Chinese is a mono-syllabic language; characters/syllables are the most natural units, while &amp;quot;words&amp;quot; are not well-defined in Chinese. It is difficult for people to segment the words correctly and consistently, especially according to the dictionary provided by the system. This is also the reason why newer intelligent Chinese input methods in Taiwan like Hanin, WangXing, and Going, all use syllables (for a sentence) as input units. In addition, our target system is an isolated-syllables speech recognition system.</Paragraph>
  </Section>
  <Section position="5" start_page="94" end_page="96" type="metho">
    <SectionTitle>
3 The Baseline System
</SectionTitle>
    <Paragraph position="0"> The proposed system (Figure 2) is composed of a baseline system plus two new features: character-preference learning (CPL) and pseudo word learning (PWL).</Paragraph>
    <Paragraph position="1"> The baseline syllable-to-character converter consists of three components: (1) a word hypothesizer0 (2) a word-lattice search algorithm, and (3) a score function. The basic model used in our system is: (-1) a Viterbi search algorithm, (2) a lexicon-based word hypothesizer, and (3) a score function considering word length and word frequency.</Paragraph>
    <Paragraph position="2"> The word hypothesizer matches the current input syllable candidates with the lexical entries in the lexicon (7,953 1-character words, 25,567 2-character, 12,216 3-character, 12,419 4-character, 558,155 words totally). All matched words are proposed as word hypotheses forming the word lattice. Currently, we consider only those words with at most four syllables (only less than 0.1% of words contain five or more syllables). In addition, Determinative-Measure (DM)  compounds are proposed dynamically, i.e., not stored in the lexicon.</Paragraph>
    <Paragraph position="3"> Viterbi search is a well-known algorithm for optimal path-finding problems. The word lattice for a whole clause (delimited by a punctuation) is searched using the dynamic-programnfing-style Viterbi algorithm. null The score function is defined as follows: If a path P is composed of n words uq ..... w,, and two assumed clause delimiters w0 and wn+l, the path score for P is the sum of word scores for the n words and inter-word link scores for the n+l word links (n-1 between-word links and 2 boundary links).</Paragraph>
    <Paragraph position="5"> The word score of a word is based on the word frequency statistics computed by counting the number of occurrences the word appears in the 10-millioncharacter UD corpus. The word frequency is mapped into an integral score by taking its logarithm value and truncating the value to an integer. Lee et al. \[11\] recently presented a novel idea called word-lattzce-based Chinese character bigram for Chinese language modeling. Basically, they approximate the effect of word bigrams by applying character bigrams to the boundary characters of adjacent words. The approach is simple (easy to implement) and very effective. Following the idea, we built a Chinese character bigram based on the UD corpus and used it to compute inter-word link scores. For two adjacent words, the last. character of the first word and the first character of the second word are used to consult the character bigram which recorded the number of occurrences in the UD corpus. Inter-word link scores are then computed similarly to word scores.</Paragraph>
  </Section>
  <Section position="6" start_page="96" end_page="98" type="metho">
    <SectionTitle>
4 Bidirectional Conversion
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="96" end_page="96" type="sub_section">
      <SectionTitle>
and Automatic Evaluation
</SectionTitle>
      <Paragraph position="0"> Here, we will only briefly review the concepts of bidirectional conversion and automatic evaluation \[I,2\].</Paragraph>
      <Paragraph position="1"> For more details, see the cited papers.</Paragraph>
      <Paragraph position="2"> Homophone disambiguation can be considered as a process of syllable-to-character ($2C) conversion, Its reverse process, character-to-syllable (C2S) conversion, is also nontrivial. There are more than 1000 characters, so-called Poyinzi (homographs), with multiple pronunciations. However, a high-accuracy C2S converter is achievable. Using an n-gram lookahead scheme, we have designed such a converter with 99.71c~ accuracy. Because of the high accuracy, the C2S converter can be used to convert a text corpus to a syllable corpus automatically. The two processes together form a bidirectional conversion model. The i~oint is: If we ignore the 0.29% error (could be reduced ifa better C2S system is used), many applications of the model appear.</Paragraph>
      <Paragraph position="3"> We have applied the bidirectional model to automatic evaluation of language models for speech recognition. A more straightforward application is automatic evaluation of the $2C converter. A text is converted into a syllable sequence, which then is converted back to an output text. Comparing the input text with the output, we can compute the accuracy of the $2C converter automatically.</Paragraph>
      <Paragraph position="4">  In the following, we describe how to apply the model to user-adaptation of homophone disambiguator.</Paragraph>
    </Section>
    <Section position="2" start_page="96" end_page="97" type="sub_section">
      <SectionTitle>
5.1 Character-Preference Learning
</SectionTitle>
      <Paragraph position="0"> Everyone has his own preference for characters and words. A chemist might use the special characters for chemical elements frequently. Different people uses a different set of proper names that are usually not stored in the lexicon. In this section, we propose an adaptation method based on the bidirectional conversion model.</Paragraph>
      <Paragraph position="1"> From a sample text given by the user, the system first converts it to a sequence of syllables. Then, the baseline system is used to convert them back to Chinese characters. After that, we can compare them with the input to obtain the error records. From the comparison report, we will derive three indices for each character in the character set (say, 13,051 characters in the Big-5 coding used in Taiwan): Acount, B-count, and C-count. A-count is defined a.s  the number of times thai the character is misrecognized. B-count the number of times it is wrougly used. while C-count the number of times it is correctly recognized. For example, if the user wants to input the string ~I~N~ and keys in the corresponding syllables \[li3\]\[zhenl\]\[zhenl\] while the output is ?-~t.~t~, the indices would be: A(~)=0, B(~)=0, C(~)=I, A(~ )=2, B(~)=0, C(~)=0, A({.~)=0, B({~)=2, C(\].~ )=0. From these indices, we propose a character-preference learmng procedure:  1. Convert the given sample text 1C/ intoa syllable file 10, using the character-to-syllable converter.</Paragraph>
      <Paragraph position="2"> Let the baseline version be I~j. Run V c with l, to obtain an output 0 deg. From \]C/ and 0 deg, compute the initial accuracy a deg.</Paragraph>
      <Paragraph position="3"> 2. Initialize the 13051-entry character-preference table CPT deg to zeroes. Set n to 1.</Paragraph>
      <Paragraph position="4"> 3. From Ic and O n-l. compute the A. B, C indices for each character.</Paragraph>
      <Paragraph position="5"> 4. For each character c, add to the corresponding entry in CP'1 'n-I a preference score (according to a preference adjustment function pf of A(c), B(c), C(c)) to form CPT n 5. Form a new version V&amp;quot; of the syllable-to-character converter by considering CPT'. Run V n with 1., to obtain a new output O n .</Paragraph>
      <Paragraph position="6"> 6. From \]c and O&amp;quot;, compute the new accuracy rate a n &amp;quot; 7. If a n &gt; a '~-l, set n to n + 1 and repeat steps 3.6. Otherwise, stop and let CPT&amp;quot;-\] be the final CPT for the user.</Paragraph>
    </Section>
    <Section position="3" start_page="97" end_page="97" type="sub_section">
      <SectionTitle>
Adjustment Functions
</SectionTitle>
      <Paragraph position="0"> In step 4, the adjustment function pf is a function of A(c), B(c), C(c). Several versions have been tried in our experiments. Three of them are:</Paragraph>
      <Paragraph position="2"> training since it only considers error cases. To avoid the problem, we devise a new pf (2) taking the correct cases into account. After trying several combinations  of A, B, C for pf, we observe that positive learning (3) is most effective, i.e., achieving the highest accuracy. Therefore, in the current implementation, pf (3) is used.</Paragraph>
    </Section>
    <Section position="4" start_page="97" end_page="98" type="sub_section">
      <SectionTitle>
5.2 Pseudo Word Learning
</SectionTitle>
      <Paragraph position="0"> 'the second adaptation mechanism is to treat error N-grams as new words (called pseudo words). An error N-gram is defined as a sequence of N characters in which at least (N - 1) characters are wrongly converted (from syllables) by the system. (In practice, 2 &lt; N &lt; 4.) For example, if \[fan4\]\[zhen4\]\[he2\] (to input ffill\]~$fl) is converted to ~E~, three pseudo words are produced: t:~, ~$\[1, and $~$~. There are two modes for generating pseudo words: corpus training and interactive correction. In the former, the user-specific text corpus (or simply a sample text) is used for generating the pseudo word lexicon (PW lexicon), applying the concept of bidirectional conversion. In the latter, pseudo words are produced through user corrections in an interactive input process. Both modes can be used at the same time.</Paragraph>
      <Paragraph position="1"> In the following, we will describe how to build, maintain, and use the user-specific PW lexicon. The PW lexicon stores the M (lexicon size) pseudo words that are produced or referenced in the most recent period. It is structurally exactly the same as the general lexicon, containing Hanzi, phonetic code, and word frequency. The word frequency of a new PW is set to /0 (3 in the implementation) and incremented by one when referenced. Once the word frequency exceeds an upper bound F, the PW would be considered as a real word and no longer liable to replacement.</Paragraph>
      <Paragraph position="2"> The procedure is: I. Segment the sample text into clauses (separated by punctuations). For each clause It, do steps 2.--4.</Paragraph>
      <Paragraph position="3"> 2. Convert the clause into syllable sequence I0 using C2S, then convert I0 back to a character se- null quence Oc using baseline $2C. For each character Cn in 0~. do steps 3-4.</Paragraph>
      <Paragraph position="4">  3. Compare Cn with the corresponding input character. Set the error flag if different.</Paragraph>
      <Paragraph position="5"> 4. If a pseudo word ending with Cn is found (ac null cording to error flags) then (1) increment the word frequency if it is already in the PW lexicon, and check the upper bound F; (2) replace the old entry and set frequency to f0 if the lexicon has a homophone PW: (3) add a new entry if the lexicon has vacancies: (4) otherwise, remove one of the entries that have the least word frequency and add the new PW.</Paragraph>
      <Paragraph position="6"> 5. We have a new PW lexicon after the above steps are done.</Paragraph>
      <Paragraph position="7"> We observe that 3-character pseudo words are very useful for dealing with the unregistered proper name problem, which is a significant source of conversion errors. The reasons are: ( 1 ) A large part. of unknown word~ in news articles are proper names, especially three-character personal names; (2) It is not practical to store all the proper names beforehand; (3) The proper names usually contain uncommon characters which are difficult to convert from syllables. Therefore, the user (or author) can have a personalized PW lexicon which contains unregistered proper names he will use, simply by providing a sample text.</Paragraph>
      <Paragraph position="8"> The parameters for both CPL and PWL can be trained by the bidirectional learning procedure. The only input the user needs to provide is a sample text similar to the texts he wants to input by&amp;quot; the phonetic-input-to-character converter. The phonetic input file will be automatically generated by the character-to-syllable converter.</Paragraph>
      <Paragraph position="9"> pus with more than 10 million characters. First, collect character-trigrams after the word \[~:~ (ji4zhe3,'reporter') and sort them according to the number of occurrences. Most of these trigrams ha W pen to be names of reporter. We use the top-10 names as the basis for selecting articles. Then, search the names in the corpus in order to built the article databases for the 10 reporters. The AP News corpus is built in a similar way (searching for the word I\[~:~+- rnei31ian2she4). Table 1 lists some statistics for the article databases. The first column lists the set names, the second column the numbers of articles in the set, the third column the numbers of characters, and the fourth column the numbers of pronounceable characters.</Paragraph>
      <Paragraph position="10">  Each corpus is then divided into two parts according to publication date: a training set and a testing set. For example, the corpus lwy is divided into lvy-1 and lwy-2.</Paragraph>
      <Paragraph position="11"> In the following, we show the experimental results for training sets and testing sets, respectively.</Paragraph>
    </Section>
  </Section>
  <Section position="7" start_page="98" end_page="99" type="metho">
    <SectionTitle>
6 Experimental Results
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="98" end_page="98" type="sub_section">
      <SectionTitle>
6.1 The Corpora
</SectionTitle>
      <Paragraph position="0"> Eleven sets of newspaper articles are extracted from the 1991 United Daily News Corpus (kindly provided by United Informatics. Inc., Taiwan). Ten of them are by specific reporters, i.e., one set per reporter.</Paragraph>
      <Paragraph position="1"> The other is translated AP News. These corpora are used to validate the proposed adaptation techniques.</Paragraph>
      <Paragraph position="2"> We design an extraction procedure to select articles written by a specific reporter from the cor-</Paragraph>
    </Section>
    <Section position="2" start_page="98" end_page="99" type="sub_section">
      <SectionTitle>
6.2 Training Sets
</SectionTitle>
      <Paragraph position="0"> Table 2 shows the adaptation results for the training sets. The RI column lists the accuracy rates for the baseline system, while the R2 column lists those for the adapted (or personalized) system. To avoid the problem of over-training, we train the the system only by two iterations in practice. More iterations can improve the performance for training sets but hurt the performance for testing sets. The average character accuracy rate is improved by 4.68% (from 93.48% to 98.16%). That is, 71.8 percent of errors are eliminated.</Paragraph>
    </Section>
    <Section position="3" start_page="99" end_page="99" type="sub_section">
      <SectionTitle>
6.3 Testing Sets
</SectionTitle>
      <Paragraph position="0"> Table 3 shows tile results for the testing sets. The average accuracy is improved by 1.34c~. (from 93.46% to 94.80%). That is, 20.5 percent of errors are eliminated. null</Paragraph>
    </Section>
  </Section>
  <Section position="8" start_page="99" end_page="99" type="metho">
    <SectionTitle>
7 Related Work
</SectionTitle>
    <Paragraph position="0"> The study of phonetic-input-to-character conversion has been quite active m the recent years. There are two different approaches for the problem: dictionary-based and statistics-based.</Paragraph>
    <Paragraph position="1"> Matsushita (Taipei) developed a Chinese word-string input system, ltanin, as a new input method (Chen \[4\]) in which phonetic symbols are continuously converted to Chinese characters through dictionary lookup. Commercial systems TianMa and WangXing (ETch Corp.) also belong to this type. In the mainland, there have been several groups involving in similar projects \[14,15\] although most of them are pinyin-based and word-based.</Paragraph>
    <Paragraph position="2"> In the statistics-based school are relaxation techniques (Fan and Tsai \[6\] ), character bigrams with dynamic programming (Sproat \[12\]), constraint satisfaction approaches (JS Chang \[3\]), and zero-order or first-order Markov models (Gu et aL \[7\]).</Paragraph>
    <Paragraph position="3"> Ni \[9\] mentioned a so-called self-learning capability for his Chinese PC called LX-PC. However, the method is (1) let the user define new words during the input process (2) dynamically adjust the word frequency of used words. Chen \[4\] also proposed a learning function that uses a learning file to store user-selected characters and words and the character before them. The entries in the learning file are favored over those in the regular dictionary. Lua and Gall \[8\] describe a simple error-correcting mechanism: increase the usage frequency of the desired word by 1 unit when the user corrects the system's output.</Paragraph>
    <Paragraph position="4"> These methods are either manual adaptation or simple word frequency counting.</Paragraph>
    <Paragraph position="5"> Recently, Su el at. \[5,13\] proposed a discrimination oriented adaptive learning procedure for various problems, e.g., speech recognition, part-of-speech tagging, and word segmentation. The basic idea is: When an error is made, i.e., the first candidate is not correct, adjust the parameters in the score function based on subspace projection. The parameters for the correct candidate are increased, while those for the first candidate are decreased, both in an amount decided by the difference between the scores of the two candidates. This process continues until the correct candidate becomes the new first candidate; that is, the score of the correct candidate is greater than that of the old first one. Our learning procedure is different from theirs because (I) ours is increment-based while theirs is projection-based, (2) ours is not discrimination oriented, (3) ours is coarse-grained learning while theirs is fine-grained, and (4) the application domain is different.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML