File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/05/i05-2010_metho.xml

Size: 14,398 bytes

Last Modified: 2025-10-06 14:09:35

<?xml version="1.0" standalone="yes"?>
<Paper uid="I05-2010">
  <Title>Applying a Mix Word-Pair Identifier to the Chinese Syllable-to-Word Conversion Problem</Title>
  <Section position="3" start_page="55" end_page="56" type="metho">
    <SectionTitle>
2 Development of Mix-WP Identifier
</SectionTitle>
    <Paragraph position="0"> In this study, a mix word-pair identifier includes a specific word-pair (SWP) identifier and a common word-pair (CWP) identifier. The system dictionary of the mix-WP identifier is comprised of the CKIP lexicon (CKIP, 1995) and those unknown words found automatically from the UDN 2001 corpus by a Chinese word autoconfirmation (CWAC) system (Tsai et al. 2003).</Paragraph>
    <Paragraph position="1"> The pinyin syllable-words were translated by phoneme-to-pinyin mappings, such as &amp;quot;JI \&amp;quot;-to-&amp;quot;ji4.&amp;quot;</Paragraph>
    <Section position="1" start_page="55" end_page="56" type="sub_section">
      <SectionTitle>
2.1 Development of SWP Identifier
</SectionTitle>
      <Paragraph position="0"> The steps of auto-generating specific word-pair (AUTO-SWP) for a given Chinese sentence: Step 1. Generate the segmentation for the given Chinese sentence with a backward maximum matching (BMM) technique. As pre (Tsai et al.</Paragraph>
      <Paragraph position="1"> 2004), the performance of BMM is better than that of forward maximum matching.</Paragraph>
      <Paragraph position="2"> Step 2. Extract the BEGIN, END and BOUND word-pairs from the BMM segmentation of Step 1 by following processes, respectively: (1) BEGIN word-pair. When the word number of segmentation is greater than 1, the first two words will be comprised as a BEGIN wordpair. For the segmentation &amp;quot;Yin Le Hui (concert)Xian Chang (locale)Yong Ru (enter)Xu Duo (many)Guan Zhong (audience members),&amp;quot; the &amp;quot;Yin Le Hui -Xian Chang &amp;quot; will be generated as a BEGIN word-pair.</Paragraph>
      <Paragraph position="3"> (2) END word-pair. When the word number of segmentation is greater than 2, the last two words will be comprised as an END word-pair.</Paragraph>
      <Paragraph position="4"> For the segmentation &amp;quot;Quan Bu (whole)Gong Cheng  (construction)Yu Ding (prearrange)Nian Di (end of year)Wan Cheng (complete),&amp;quot; the &amp;quot;Nian Di -Wan Cheng &amp;quot; will be generated as an END word-pair.</Paragraph>
      <Paragraph position="5"> (3) BOUND word-pair. When the word num null ber of segmentation is greater than 2, the first word and the last word will be comprised as a BOUND word-pair. For the segmentation &amp;quot;Wu Jia (price)Da Di (ordinarily)Wei Chi (maintain)Ping Wen (stable),&amp;quot; the &amp;quot;Wu Jia -Ping Wen &amp;quot; will be generated as a BOUND word-pair.</Paragraph>
      <Paragraph position="6"> Step 3. If the generated SWP was not found in its corresponding datasets, insert the generated SWP into the BEGIN, END and BOUND word-pair datasets, respectively.</Paragraph>
      <Paragraph position="7">  In Figure 1, the SWP data is a collection of auto-generated BEGIN, END and BOUND SWP datasets. If a SWP identifier only uses one of the BEGIN, END or BOUND SWP dataset, it will naturally become a BEGIN(BN), END(ED) or BOUND(BD) SWP identifier. The algorithm of our SWP identifier is as follows:  SWP in the input syllables to be the initial SWP set. if the initial SWP set, if the found SWP number of the word-syllable pair of a BN, ED or BD SWP is greater than one in the BN, ED or BD datasets, respectively, the SWP will be dropped from the initial SWP set.</Paragraph>
      <Paragraph position="8"> Step 3. Use the longest syllabic word-pair first (LS-WPF) strategy (Tsai and Hsu. 2002) to select the BN, ED and BD word-pair from the initial SWP set into the final SWP set.</Paragraph>
      <Paragraph position="9"> Step 4. Replace corresponding syllable-word pair of the input syllables with the word-pairs of the final SWP set to be a SWP-sentence. As per our experiment, the performance of the three SWP identifiers is BD &lt; BN &lt; ED. Thus, the identifying sequence of our SWP identifier is from BD, BN to ED.</Paragraph>
      <Paragraph position="10"> Table 1 is a step by step example that illustrates the four steps of our SWP identifier for the Chi- null nese syllables &amp;quot;shu3 dou1 shu3 bu4 qing1 (Shu [count]Du [always]Shu Bu Qing [innumerable])).&amp;quot; Note that when we used the Microsoft Input Method Editor 2003 for Traditional Chinese, a trigramlike input system (MSIME), to convert the same syllables, the output was &amp;quot;Shu (belong)Du (always) Shu (mouse)Bu (not)Qing (clear).&amp;quot;</Paragraph>
    </Section>
    <Section position="2" start_page="56" end_page="56" type="sub_section">
      <SectionTitle>
2.2 Development of CWP Identifier
</SectionTitle>
      <Paragraph position="0"> The steps of auto-generating common word-pair (AUTO-CWP) for a given Chinese sentence: Step 1. Generate the word segmentation for the given Chinese sentence by BMM technique.</Paragraph>
      <Paragraph position="1"> Step 2. Extract all the combinations of word-pairs from the BMM segmentation of Step 1 to be the initial CWP set. For the segmentation &amp;quot;Wo /Bu Hui /Kai Che ,&amp;quot; three CWP will be extracted, i.e. &amp;quot;Wo -Bu Hui &amp;quot;, &amp;quot;Wo -Kai Che &amp;quot; and &amp;quot;Bu Hui -Kai Che .&amp;quot; Step 3. Select the word-pairs comprised of two multi-syllabic Chinese words (such as &amp;quot;Bu Hui (can not)&amp;quot;) to be the finial CWP set. For the final CWP set, if the word-pair is not found in the CWP database, insert it into the CWP database and set its frequency to 1; otherwise, increase its frequency by 1. In the above case, the final CWP set includes one word-pair, i.e.</Paragraph>
      <Paragraph position="2"> &amp;quot;Bu Hui -Kai Che .&amp;quot;  The system overview of the CWP identifier is same with that of the SWP identifier as shown in Fig. 1. The algorithm of our CWP identifier is as follows: Step 1. Input tonal or toneless syllables.</Paragraph>
      <Paragraph position="3"> Step 2. Generate all possible word-pairs comprised of two multi-syllabic Chinese words for the input syllables to be the input of Step 3.</Paragraph>
      <Paragraph position="4"> Step 3. Select out the word-pairs that match a word-pair in the CWP database to be the initial CWP set, firstly. Then, from the initial CWP set, select the word-pair with maximum frequency as the key word-pair. Finally, find the co-occurrence word-pairs with the key word-pair in the training corpus to be the final CWP set. If there are two or more word-pairs with the same maximum frequency, one of them is randomly selected as the key wordpair. null Step 4. Arrange all word-pairs of the final CWP set into a CWP-sentence. If no word-pairs can be identified in the input syllables, a NULL CWP-sentence is produced.</Paragraph>
      <Paragraph position="5"> If applying the CWP identifier on the syllables &amp;quot;yi1 ge5 wen2 ming2 de5 shuai1 wei2 guo4 cheng2([?] Ge [a]Wen Ming [civilization]De [of]Shuai Wei [decay] Guo Cheng [process]),&amp;quot; the generated WPsentence will be &amp;quot;[?] Ge Wen Ming de5shuai1wei2 Guo Cheng .&amp;quot; For the same syllables, the MSIME will convert them into &amp;quot;[?] Ge [a]Wen Ming [famous]De [of] Shuai Wei [decay]Guo Cheng [process].&amp;quot; The detailed analysis and demonstration of our CWP identifier can be found in (Tsai 2005). Appendix A presents a case of the CWP identified results.</Paragraph>
    </Section>
  </Section>
  <Section position="4" start_page="56" end_page="58" type="metho">
    <SectionTitle>
3 The STW Experiments
</SectionTitle>
    <Paragraph position="0"> To evaluate the STW performance of our mix-WP identifier, the STW accuracy, the identified character ratio (ICR) and the STW improvement were used (Tsai 2005).</Paragraph>
    <Section position="1" start_page="56" end_page="57" type="sub_section">
      <SectionTitle>
3.1 Experimental Data
</SectionTitle>
      <Paragraph position="0"> To conduct the STW experiments, firstly, use the inverse translator of phoneme-to-character (PTC) provided in GOING system to convert testing sentences into their corresponding syllables. All the error PTC translations of GOING were corrected by post human-editing. We, then, apply our SWP, CWP and mix-WP identifier to convert the syllable sequence back to words and calculate its STW accuracy and identified character ratio. All test sentences are composed of a string of Chinese characters.</Paragraph>
      <Paragraph position="1"> In following experiments, the training and testing corpus, closed/open test sets and the collection of the testing SWP and CWP data were:  Training corpus: The UDN 2001 corpus was selected as our training corpus. It is a collection of 4,539,624 Chinese sentences extracted from whole 2001 articles on the United Daily News Website (UDN) in Taiwan.</Paragraph>
      <Paragraph position="2"> Testing corpus: The UDN 2002 corpus was selected as our testing corpus. It is a collection of 3,321,504 Chinese sentences that were extracted from whole 2002 articles on (UDN).</Paragraph>
      <Paragraph position="3"> Closed testing set: 10,000 sentences were randomly selected from the UDN 2001 corpus as the closed testing set.</Paragraph>
      <Paragraph position="4"> Open testing set: 10,000 sentences were randomly selected from the UDN 2002 corpus as the open testing set. At this point, we checked that the selected open testing sentences were not in the closed testing set as well.</Paragraph>
      <Paragraph position="5"> Testing SWP data: By applying our AUTO-SWP on the UDN 2001 corpus, we created 1,754,055 BN, 1,594,036 ED and 2,502,241 BD specific word-pairs.</Paragraph>
      <Paragraph position="6"> Testing CWP data: By applying our AUTO-CWP on the UDN 2001 corpus, we created 25,439,679 common word-pairs.</Paragraph>
      <Paragraph position="7"> In this study, we conducted the STW experiment in a progressive manner. The experimental results of the SWP, CWP and mix-WP identifiers are described in Sub-sections 3.2, 3.3.and 3.4, respectively.</Paragraph>
    </Section>
    <Section position="2" start_page="57" end_page="57" type="sub_section">
      <SectionTitle>
3.2 Experiment of SWP Identifier
</SectionTitle>
      <Paragraph position="0"> This experiment is to demonstrate the tonal and toneless STW accuracies by using the SWP identifier with the testing BN, ED, BD and ALL datasets, respectively. Note that the symbol ALL stands for a mixed collection of all BN,  The performance of SWP identifier with three SWP data and the word-pair replacing sequence of the SWP is from BD, BN to ED Table 2 shows the average tonal and toneless STW accuracies of the SWP identifier with ALL SWP data for the closed and open test sets are 99.4% and 97.1%, respectively. Meanwhile, between the closed and open test sets, the differences of tonal and toneless STW accuracies of the SWP identifier are 0.5% and 2%, respectively.</Paragraph>
    </Section>
    <Section position="3" start_page="57" end_page="57" type="sub_section">
      <SectionTitle>
3.3 Experiment of CWP Identifier
</SectionTitle>
      <Paragraph position="0"> This experiment is to demonstrate the tonal and toneless STW accuracies among the identified word-pairs by using the CWP identifier with the  Table 3 shows the average tonal and toneless STW accuracies of the CWP identifier for closed and open test sets are 98.8% and 92.6%, respectively. Meanwhile, between the closed and open test sets, the differences of tonal and toneless STW accuracies of the CWP identifier are 0.7% and 3.2%, respectively.</Paragraph>
    </Section>
    <Section position="4" start_page="57" end_page="58" type="sub_section">
      <SectionTitle>
3.4 Experiment of Mix-WP Identifier
</SectionTitle>
      <Paragraph position="0"> This experiment is to demonstrate the tonal and toneless STW accuracies among the identified word-pairs by using the mix-WP identifier with all testing WP data. From Tables 2 and 3, the STW performance of the SWP identifier is better than that of the CWP identifier. Therefore, our mix-WP identifier uses the CWP identifier to identify CWP first and the SWP identifier to identifier SWP last for a given syllables.</Paragraph>
      <Paragraph position="1">  Table 4 shows the average tonal and toneless STW accuracies of the mix-WP identifier for closed and open test sets are 98.8% and 93.5%, respectively. Meanwhile, between the closed and open test sets, the differences of tonal/toneless STW accuracies of the mix-WP identifier are 0.8% and 3.1%, respectively. The average identified character ratio (ICR) of the tonal and the toneless syllables are 67.6% and 64.6%, respectively. To sum up the results of Tables 2 to 4, we conclude that the mix-WP (SWP and CWP) data can be used to effectively convert Chinese STW on the mix-WP-related  portion (including the SWP-related portion and the CWP-related portion, respectively).</Paragraph>
    </Section>
    <Section position="5" start_page="58" end_page="58" type="sub_section">
      <SectionTitle>
3.5 Commercial IME System and Bigram
</SectionTitle>
      <Paragraph position="0"> Model with WP Identifier We selected Microsoft Input Method Editor 2003 for Traditional Chinese (MSIME) as our experimental commercial Chinese input system. In addition, an optimized bigram model called BiGram was developed (Tsai et al. 2004). The BiGram STW system is a bigram-based model developing by SRILM (Stolcke 2002) with Good-Turing back-off smoothing (Manning and Schuetze, 1999), as well as forward and backward LS-WPF strategies (Chen et al. 1986, Tsai et al. 2004). The training corpus and the system dictionary of this BiGram system are same with that of the mix-WP identifier. In this experiment, the STW output of the MSIME with the mix-WP identifier, or the BiGram with the mix-WP identifier, was collected by directly replacing the identified word-pairs from the corresponding STW output of MSIME or BiGram.</Paragraph>
      <Paragraph position="1">  STW accuracies of the words identified by the BiGram with the mix-WP identifier From Table 5, the tonal and toneless STW improvements of the MSIME by using the mix-WP identifier are 29.2% and 22.5%, respectively. On the other hand, from Table 6, the tonal and toneless STW improvements of the BiGram by using the mix-WP identifier are 12.8% and 19.6%, respectively. To sum up the results of this experiment, we conclude that the mix-WP identifier can achieve better WPportion STW accuracy than that of the MSIME and BiGram Chinese input systems.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML