File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/04/w04-1105_intro.xml
Size: 3,233 bytes
Last Modified: 2025-10-06 14:02:33
<?xml version="1.0" standalone="yes"?> <Paper uid="W04-1105"> <Title>An Enhanced Model for Chinese Word Segmentation and Part-of-speech Tagging</Title> <Section position="2" start_page="0" end_page="0" type="intro"> <SectionTitle> 1 Introduction </SectionTitle> <Paragraph position="0"> Generally, Chinese Lexical Analysis consists of two phases; one is word segmentation and the other is part-of-speech(POS) tagging. Rule -based approach and statistic -based approach are two dominant ways in natural language processing, as well as Chinese Lexical Analysis. This paper will only focus on the later one. Hence, our model is called a probabilistic model.</Paragraph> <Paragraph position="1"> Scanning through the researches in this field before, we have just found two points at which the performance of a Chinese word segmentation and POS tagging system could get better. One is the on the system architecture, and the other is from the Machine Learning theory.</Paragraph> <Paragraph position="2"> First, the traditional way of Chinese Lexical Analysis simply regards the word segmentation and POS tagging as two separated phases. Each one of them has its own algorithms and models.</Paragraph> <Paragraph position="3"> Dividing the whole process into two independent parts can lower the complexity of the design of system, but decrease the performance as well, because the two are fully integrated when a human processing a sentence. Fortunately, many researchers have already noticed it, and recent projects pay more attention on the integration of word segmentation and POS tagging, such as [Gao Shan, Zhang Yan. 2001]'s pseudo trigram integrated model, [Fu Guohong et al. 2001]'s analyzer which incorporates backward Dynamic Programming and A* algorithm, [Sun Maosong, et al. 2003]'s 'Divide and Conquer integration', [Zhang Huaping, et al. 2003]'s hierarchical hidden Markov model and so on. The experiments given by these papers also showed a great potential of the integrated models.</Paragraph> <Paragraph position="4"> Besides the system architecture, another point should be noticed. A probabilistic model of word segmentation and POS tagging can be regarded as an instance of Machine Learning. In Machine Learning, the feature extraction is the most important aspect, and far more important than a learning algorithm. In the models nowadays, it seems that the features for Chinese Lexical Analysis are a little too simple . Most of them take tag sequences, or word frequencies as the distinguishing features and ignore the other useful information that are provided by Chinese itself.</Paragraph> <Paragraph position="5"> In this paper, we will present an enhanced, not too complex, model for word segmentation and POS tagging, which will not only inherit the merit of an integrated model, but also take a new feature (word length) into account.</Paragraph> <Paragraph position="6"> The second part of this paper will describe the model, including the input, output, and some assumptions. The third part will give some brief discussion about the model on some issues like data sparseness and Named Entity Recognition. In the final part, the results of our experiments will be reported.</Paragraph> </Section> class="xml-element"></Paper>