File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/05/i05-3026_intro.xml
Size: 2,075 bytes
Last Modified: 2025-10-06 14:02:57
<?xml version="1.0" standalone="yes"?> <Paper uid="I05-3026"> <Title>Description of the HKU Chinese Word Segmentation System for Sighan Bakeoff 2005</Title> <Section position="2" start_page="0" end_page="0" type="intro"> <SectionTitle> 1 Introduction </SectionTitle> <Paragraph position="0"> Word segmentation is very important for Chinese text processing, which is aiming at recognizing the implicit word boundaries in plain Chinese text. Over the past decades, great progress has been made with Chinese word segmentation technology. However, two difficulties still face us while developing a practical segmentation system for large open applications, i.e. the resolution of ambiguous segmentation and the identification of unknown or out-of-vocabulary (OOV) words.</Paragraph> <Paragraph position="1"> In order to resolve the above two problems, we developed a purely statistical Chinese word segmentation system using a two-stage strategy.</Paragraph> <Paragraph position="2"> We participated in eight tracks at the Second</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> International Chinese Word Segmentation </SectionTitle> <Paragraph position="0"> Bakeoff sponsored by the ACL-SIGHAN, and tested our system on different testing corpora.</Paragraph> <Paragraph position="1"> The scored results show that our system is effective for most of ambiguous segmentation and unknown words in Chinese text. In this paper, we make a summary of this work and give some analysis on the results.</Paragraph> <Paragraph position="2"> The rest of this paper is organized as follows: First in Section 2, we describe in brief a two-stage strategy for Chinese word segmentation. Then in Section 3, we give details about the settings or configuration of our system for different testing tracks, particularly the training data and the dictionaries used in our system. Finally, we report the results of our system at this bakeoff in Section 4, and give our conclusions on this work in Section 5.</Paragraph> </Section> </Section> class="xml-element"></Paper>