File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/03/w03-1723_intro.xml
Size: 1,680 bytes
Last Modified: 2025-10-06 14:02:07
<?xml version="1.0" standalone="yes"?> <Paper uid="W03-1723"> <Title>A two-stage statistical word segmentation system for Chinese</Title> <Section position="2" start_page="0" end_page="0" type="intro"> <SectionTitle> 1 Introduction </SectionTitle> <Paragraph position="0"> Word segmentation is very important for Chinese language processing, which aims to recognize the implicit word boundaries in Chinese text. During the past decades, great success has been achieved in Chinese word segmentation (Nie, et al, 1995; Yao, 1997; Fu and Wang, 1999; Wang et al, 2000; Zhang, et al, 2002). However, there still remain two difficult problems, i.e. ambiguity resolution and unknown word (so-called OOV word) identification, while developing a practical segmentation system for large open applications.</Paragraph> <Paragraph position="1"> In this paper, we present a two-stage statistical word segmentation system for Chinese. In the first stage, we employ word bigram model to segment known words (viz. the words included in the system dictionary) in input. In the second stage, we develop a hybrid algorithm to perform unknown word identification incorporating word contextual information, word-formation patterns and word juncture model.</Paragraph> <Paragraph position="2"> The rest of this paper is organized as follows: Section 2 presents a word bigram solution for known word segmentation. Section 3 describes a hybrid approach for unknown word identification.</Paragraph> <Paragraph position="3"> In section 4, we report the results of our system at the SIGHAN evaluation program, and in the final section we give our conclusions on this work.</Paragraph> </Section> class="xml-element"></Paper>