File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/93/w93-0311_intro.xml
Size: 4,231 bytes
Last Modified: 2025-10-06 14:05:28
<?xml version="1.0" standalone="yes"?> <Paper uid="W93-0311"> <Title>Corpus-based Adaptation Mechanisms for Chinese Homophone Disambiguation</Title> <Section position="3" start_page="0" end_page="94" type="intro"> <SectionTitle> 1 Introduction </SectionTitle> <Paragraph position="0"> Corpus-based Chinese NLP research has been very active in the recent years as more and more computer readable Chinese corpora are available. Reported corpus-based NLP applications \[10\] include machine translation, word segmentation, character recognition, text classification, lexicography, and spelling checker. In this paper, we will describe our work on adaptive Chinese homophone disambiguation (also known as phonetic-input-to-character conversion or phonetic decoding) using part of the 1991 United Daily (UD) corpus of approximately 10 million Chinese characters (Hanzi).</Paragraph> <Paragraph position="1"> It requires a coding method, structural or phonetic, to input Chinese characters into a computer, since there are more than I0,000 of them in common use.</Paragraph> <Paragraph position="2"> In the literature \[3,7\], there are several hundred different coding methods for this purpose. For most users, phonetic coding (Pinyin or Bopomofo) is the choice. To input a Chinese character, the user simply keys in its corresponding phonetic code. It is easy to learn, but suffers from the homophone problem, i.e., a phonetic code corresponding to several different characters. Therefore, the user needs to choose the desired character from a (usually long) list of candidate characters. It is inefficient and annoying. So, automatic homophone disambiguation is highly desirable. Several disambiguation approaches have been reported in the literature \[3, 7\]. Some of them have even been realized in commercial input methods, e.g., ttanin, WangXing, Going. However, the accuracies of these disambiguators are not satisfactory. In this paper, we propose a corpus-based adaptation method for improving the accuracy of homophone disambiguation.</Paragraph> <Paragraph position="3"> For homophone disambiguation, what we need as input is syllable (phonetic code) corpora instead of text corpora. For adaptation, what we need is personal corpora instead of general corpora (such as the UD corpus). Thus, we first design a selection procedure to extract articles by individual reporters. Ten personal corpora were set up in this way. An additional domain-specific corpus, translated AP news, was built up similarly. Then, we design a highly-reliable (99.7% correct) character-to-syllable converter \[I\] to transfer the text corpora into syllable corpora.</Paragraph> <Paragraph position="4"> Our baseline disambiguator is rather conventional, composed of a word-lattice searching module, a path scorer, and a lexicon-driven word hypothesizer. Using the original text corpora and the corresponding syllable corpora, we propose a user-adaptation method, applying the concept of bidirectional conversion \[I\] and automatic evaluation \[2\]. The adaptation method includes two parts: character-preference learning and pseudo word learning. Given a personal corpus (i.e., sample text), the adaptation pro- null cedure is able to produce a user-specific character-preference model and a pseudo word lexicon automatically. Then the system can use the user-specific parameters in the two models for improving the conversion accuracy.</Paragraph> <Paragraph position="5"> Extensive experiments have been conducted for (1) ten sets of local-news articles (one set per reporter) and (2) translated international news from AP News.</Paragraph> <Paragraph position="6"> Each set is divided into two subsets: one for training, the other for testing. The character accuracy of the b&seline version is 93.46% on average. With the proposed adaptation method, the augmented version increases the accuracy to 98.16~ for the training sets and to 94.80% for the test sets. In other words, 71.8% and 20.5% of the errors have been eliminated, respectively. The results are encouragiug for us to further pursue corpus-based adaptive learning methods for Chinese phonetic input and language modeling for speech recognition.</Paragraph> </Section> class="xml-element"></Paper>