File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/concl/05/i05-2010_concl.xml
Size: 2,567 bytes
Last Modified: 2025-10-06 13:54:37
<?xml version="1.0" standalone="yes"?> <Paper uid="I05-2010"> <Title>Applying a Mix Word-Pair Identifier to the Chinese Syllable-to-Word Conversion Problem</Title> <Section position="5" start_page="58" end_page="59" type="concl"> <SectionTitle> 4 Conclusion and Future Directions </SectionTitle> <Paragraph position="0"> In this paper, we have applied a mix-WP identifier to the Chinese STW conversion and obtained a high STW accuracy on the identified word-pairs with ICR of more than 60%. All of the testing mix-WP data was auto-generated by using the AUTO-SWP and the AUTO-CWP on the training corpus. We are encouraged by the fact that mix-WP knowledge can achieve tonal and toneless STW accuracies of 98.8% and 93.5%, respectively, for the mix-WP-related portion of the testing syllables. The mix-WP identifier can be easily integrated into existing Chinese input systems or Chinese language processing of typical speech recognition systems by identifying word-pairs in a post-processing step. Our experimental results show that, by applying the mix-WP identifier together with the MSIME and the BiGram input systems, the tonal and toneless STW improvements are 29%/23% and 13%/20%, respectively. To the adaptive approach, we also tried to use the AUTO-SWP and the AUTO-CWP to autoextract new SWP and CWP from the open test sentences into the mix-WP data, firstly. Then, we found the overall tonal and toneless STW accuracies of the MSIME and the BiGram for closed/open syllables become 96.5%/90% and 97.1%/89%, respectively.</Paragraph> <Paragraph position="1"> Currently, our approach is quite basic when more than one SWP or CWP occurs in the same sentence. Although there is room for improvement, we believe it would not produce a noticeable effect as far as the STW accuracy is concerned. However, this issue will become important as we apply the mix-WP knowledge to speech recognition. According to our computations, the collection of our mix-WP knowledge can cover approximately 70% and 60% of the characters in the UDN 2001 and 2002 corpus, respectively.</Paragraph> <Paragraph position="2"> We will continue to expand our collection of mix-WP knowledge with Web corpus. In other directions, we will try to improve our WP-based STW conversion with other types of WP data, such as NEVF and MWP (Tsai et al. 2002 and 2004), and statistical language models, such as HMM, and extend it to other areas of NLP, especially word segmentation and the mix-WP identifier from the word lattice of Chinese speech recognition systems.</Paragraph> </Section> class="xml-element"></Paper>