File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/03/w03-1725_intro.xml
Size: 1,892 bytes
Last Modified: 2025-10-06 14:02:07
<?xml version="1.0" standalone="yes"?> <Paper uid="W03-1725"> <Title>A Unicode based Adaptive Segmentor</Title> <Section position="3" start_page="0" end_page="0" type="intro"> <SectionTitle> 1 Introduction </SectionTitle> <Paragraph position="0"> The most difficult problem in Chinese word segmentation is due to overlapping ambiguities [12]. The recognition of names, foreign names, and organizations are quite unique for Chinese. Some systems can already achieve very high accuracy [3], but they heavily rely on manual work in getting the system to be trained to work certain language environment. However, for many applications, we need to look at the cost to achieve high accuracy.</Paragraph> <Paragraph position="1"> In a competitive environment, we also need to have systems that are quickly adaptive to new requirements with limited resources available.</Paragraph> <Paragraph position="2"> In this paper, we report a Unicode based Chinese word segmentor. The segmentor can handle Chinese text in Simplified, Traditional, or mixed mode where internally only one dictionary is needed. The system uses the strategy of divide-and-conquer to handle the recognition of personal names, numbers, time and numerical values. The system has a built-in new word extractor that can extract new words from running text, thus save time on training and getting the system quickly adaptive to new language environment. The Bakeoff results in the open text for our system in all categories have shown that it works reasonably good for all different corpora.</Paragraph> <Paragraph position="3"> The rest of the paper is organized as follows.</Paragraph> <Paragraph position="4"> Section 2 presents our system design objectives and components. Section 3 discusses more implementation details. Section 4 gives some performance evaluations. Section 5 is the conclusion.</Paragraph> </Section> class="xml-element"></Paper>