File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/03/w03-1724_intro.xml
Size: 2,975 bytes
Last Modified: 2025-10-06 14:02:06
<?xml version="1.0" standalone="yes"?> <Paper uid="W03-1724"> <Title>Integrating Ngram Model and Case-based Learning For Chinese Word Segmentation</Title> <Section position="2" start_page="0" end_page="0" type="intro"> <SectionTitle> 1 Introduction </SectionTitle> <Paragraph position="0"> After about two decades of studies of Chinese word segmentation, ICWSB-1 (henceforth, the bakeoff) is the rst effort to put different approaches and systems to the test and comparison on common datasets. We participated in the bakeoff with a segmentation system that is designed to integrate a general-purpose ngram model for probabilistic segmentation and a case- or example-based learning approach (Kit et al., 2002) for disambiguation.</Paragraph> <Paragraph position="1"> The ngram model, with words extracted from training corpora, is trained with the EM algorithm (Dempster et al., 1977) using unsegmented training corpora. Originally it was developed to enhance word segmentation accuracy so as to facilitate Chinese-English word alignment for our ongoing EBMT project, where only unsegmented texts are available for training. It is expected to be robust enough to handle novel texts, independent of any segmented texts for training. To simplify the EM training, we used the uni-gram model for the bakeoff and relied on the Viterbi algorithm (Viterbi, 1967) for the most probable segmentation, instead of attempting to exhaust all possible segmentations of each sentence for a complicated full version of EM training.</Paragraph> <Paragraph position="2"> The case-based learning works in a straightforward way. It rst extracts case-based knowledge, as a set of context-dependent transformation rules, from the segmented training corpus, and then applies them to ambiguous strings in a test corpus in terms of the similarity of their contexts. The similarity is empirically computed in terms of the length of relevant common af xes of context strings.</Paragraph> <Paragraph position="3"> The effectiveness of this integrated approach is veri ed by its outstanding performance on IV word identi cation. Its IV recall rate, ranging from 96% to 98%, stands at the top or the next to the top in all closed tests in which we have participated. Unfortunately, its overall performance is not sustainable at the same level, due to the lack of a module for OOV word detection.</Paragraph> <Paragraph position="4"> This paper is intended to present the implementation of the system and analyze its performance and problems, aiming at exploration of directions for further improvement. The remaining sections are organized as follows. Section 2 presents the ngram model and its training with the EM algorithm, and Section 3 presents the case-based learning for disambiguation. The overall architecture of our system is given in Section 4, and its performance and problems are analyzed in Section 5. Section 6 concludes the paper and previews future work.</Paragraph> </Section> class="xml-element"></Paper>