Design of Chinese Morphological Analyzer 
 
Huihsin Tseng 
Institute of Information Science 
Academia Sinica, Taipei 
kaori@hp.iis.sinica.edu.tw 
Keh-Jiann Chen 
Institute of Information Science 
Academia Sinica, Taipei 
kchen@iis.sinica.edu.tw 
 
Abstract 
This is a pilot study which aims at the design of a 
Chinese morphological analyzer which is in state 
to predict the syntactic and semantic properties of 
nominal, verbal and adjectival compounds. 
Morphological structures of compound words 
contain the essential information of knowing their 
syntactic and semantic characteristics. In 
particular, morphological analysis is a primary 
step for predicting the syntactic and semantic 
categories of out-of-vocabulary (unknown) words. 
The designed Chinese morphological analyzer 
contains three major functions, 1) to segment a 
word into a sequence of morphemes, 2) to tag the 
part-of-speech of those morphemes, and 3) to 
identify the morpho-syntactic relation between 
morphemes. We propose a method of using 
associative strength among morphemes, 
morpho-syntactic patterns, and syntactic 
categories to solve the ambiguities of 
segmentation and part-of-speech. In our 
evaluation report, it is found that the accuracy of 
our analyzer is 81%. 5% errors are caused by the 
segmentation and 14% errors are due to 
part-of-speech. Once the internal information of a 
compound is known, it would be beneficial for the 
further researches of the prediction of a word 
meaning and its function. 
 
1. Introduction 
This is the first attempt to design a morphological 
analyzer to automatically analyze the 
morphological structures of Chinese compound 
words
1
. Morphological structures of compound 
words contain the essential information of 
knowing their syntactic and semantic 
characteristics. In particular, morphological 
analysis is a primary step for predicting the 
syntactic and semantic categories of 
out-of-vocabulary (unknown) words. The 
existence of unknown words is a major obstacle in 
Chinese natural language processing. Due to the 
                                                 
1
 Compound words here include compounds in 
traditional Chinese linguistics and morphological 
complex words. 
fact that new words are easily coined by 
morphemes in Chinese text, the number of 
unknown words is increasingly large. As a result, 
we cannot collect all the unknown words and 
manually mark their syntactic categories and 
meanings. Our hypothesis to predict the category 
and the meaning of a word is basically based on 
Frege’s principle: “The meaning of the whole is a 
function of the meanings of the parts”. The 
meanings of morphemes are supposed to make up 
the meanings of the words. However, some words 
like idioms and proper nouns cannot be included 
in the principle. In general, unknown words could 
be divided into two different types: the type that 
has the property of semantic transparency, i.e. the 
words whose meanings can be derived from their 
morphemes and the type without meaning 
transparency, such as proper nouns. In this paper 
we are dealing with the compound words with 
semantic transparency only. For the type of 
compounds without semantic transparency, such 
as proper nouns, their morphemes and 
morphological structures do not provide useful 
information for predicting their syntactic and 
semantic categories. Therefore they are processed 
differently and independently. In addition, some 
regular types of compounds, such as numbers, 
dates, and determinant-measure compounds, are 
easily analyzed by matching their morphological 
structures with their regular expression grammars 
and the result can be used to predict their syntactic 
and semantic properties, so they will be handled 
by matching regular expressions at the stage of 
word segmentation. According to our observation, 
most Chinese compounds have semantic 
transparency except proper nouns, which means 
the meaning of an unknown word can be 
interpreted by their own morpheme components. 
The design of our morphological analyzer will 
focus on processing these compounds, but words 
without semantic transparency are excluded. It 
takes a compound word as input and produces the 
morphological structure of the word. The major 
functions are 1) to segment a word into a sequence 
of morphemes, 2) to tag the part-of-speech of 
those morphemes, and 3) to identify the 

References

Bosch, Antal van den, Walter Daelemans and Ton
Weijters. (1996) Morphological Analysis
Classification: an Inductive-Learning Approach.
NeMLaP.

Chao, Yuen Ren. (1968) A grammar of spoken Chinese.
Berkeley:University of California Press.

Chen, Chao-jan, Ming-hung Bai and Keh-jiann Chen.
(1997) Category Guessing for Chinese Unknown
Words. Proceedings of the Natural Language
Processing Pacific Rim Symposium 1997, 35-40.

Chen Yun-chai. (2001) Corpus Analysis of
Reduplication in Mandarin Chinese. National
Kaohsiung Normal University: English Department.

CKIP. (1993) Technical Report no. 93-05: The analysis
of Chinese category. [??????] CKIP:Nankang

Creutz, Mathias and Krista Lagus. (2002)
Unsupervised Discovery of Morphemes. Proceedings
of Morphological and Phonological Learning
Workshop of ACL'02.

Beaney, Michael.(editor) (1997) The Frege Reader.
Oxfort: Blackwell.

Li, Charles and Sandra A. Thompson. (1981) Mandarin
Chinese. Berkeley: University of California Press.

Ma, Weiyun, Youming Hsieh, Changhua Yang, and
Keh-jiann Chen. (2001) Chinese Corpus
Development and Management System  [????
??????????]. Proceedings of Research
on Computational Linguistics Conference XIV,
175-191.
