Automatic Senmntic Classification for Chinese Unknown Compound Nouns 
Keh-Jiann Chert & Chao-jan Chen 
Institute of Information Science, Acadeinia Sinica, Taipei 
Abstract 
Tim paper describes a similarity-based model to 
present the morphological rules for Chinese com- 
pound nouns. This representation model serves 
functions of 1) as the morphological rules of the 
compounds, 2) as a mean to evaluate the proper- 
ness of a compound construction, and 3) as a mean 
to disambiguate the semantic ambiguity of the 
nlorphological head of a compound noun. An 
automatic semantic classil'ication system for Chine- 
se unknown compounds is thus implemented based 
on the model. Experiments and on'or analyses arc 
also presented. 
1. Introduction 
The occurrences of unknown words cause difficul- 
ties in natural  processing. Tile word set of 
a natural  is open-ended. There is no way 
of collecting every words of a , since new 
words will be created for expressing new concepts, 
new inventions. Therefore how to identify new 
words in a text will bc tile most challenging task for 
natural  processing. It is especially true for 
Chinese. Each Chinese morpheme (usually a single 
character) carries meanings and most are polysc- 
incus. New words are easily constructed by com- 
bining lnorphelnes and their meanings are tile se- 
mantic composition of morpheme components. 
Of course there are exceptions of semantically non- 
compositional compounds. In Chinese text, there is 
no blank to mark word boundaries and no inlqec- 
tional markers nor capitalization markers to denote 
the syntactic or selnantic types of new words. 
Hence the unknown word identification for Chinese 
became one of the most difficult and demanding 
research topic. 
The syntactic and semantic categories of 
unknown words in principle can be determined by 
their content and contextual information. However 
many difficult problems have to be solved. First of 
all it is not possible to find a uniforln representa- 
tional schema and categorization algorithm to han- 
dle different types of unknown words, since each 
type of unknown words has very much differeut 
morpho-syntactic structures. Second, the clues for 
identifying different type of unknown words are 
also different. For instance, identification of 
names of Chinese is very much relied on the 
surnames, which is a limited set of characters. 
The statistical methods are commonly used for 
identifying proper names (Chang et al. 1994, 
Sun et al. 1994). The identification of general 
compounds is more relied on the morphemes 
and tile semantic relations between morphemes. 
There are co-occurrence restrictions between 
morphemes of compounds, but their relations are 
irregular and mostly due to common sense 
knowledge. The third difficulty is the problems 
of ambiguities, such as structure ambiguities, 
syntactic alnbiguitics and semantic ambiguities. 
For instances, usually a morpheme charac- 
tedword has multiple lneaning and syntactic 
categories. Therefore the ambiguity resolution 
became one of the major tasks. 
Compound nouus are ttle most frequently 
occurred unknown words in Chinese text. 
According to an inspection on tile Sinica corpus 
(Chen etc. 1996), 3.51% of lhe word tokens in 
the corpus are unknown, i.e. they are not listed 
in the CKIP lexicon, which contains about 
80,000 entries. Alnong them, about 51% of the 
word types are compound nouns, 34% are 
compound verbs and 15% are proper names. In 
this paper we locus our attention on the 
identification of the compound nouns. We 
propose a representation model, which will be 
facilitated to identify, to disambiguate and to 
evaluate the structure of a compound noun. In 
fact this model can be extended to handle 
compound verbs also. 
1.1 General properties of compounds and 
their identification strategy 
The semantic category and syntactic category 
are closely related. For coarse-grained analysis, 
syntactic categorization and semantic categori- 
zation are close related. For instances, nouns 
denote entities; active verbs denote events and 
stative verbs denote states. For fine-grained 
analysis, syntactic and semantic classifications 
take difl'erent classification criterion, in our 
model the coarse-grained analysis is processed 
first. The syntactic categories of an unknown 
173 
word are predicted first and the possible semantic 
categories will be identified according to its top 
ranked syntactic categories. Different syntactic 
categories require different representational models 
and different fine-grained semantic classification 
methods. 
The presupposition of automatic se- 
mantic classification for compounds is that the 
meaning of a compound is the semantic com- 
position of its morphemic components and the 
head morpheme determines the major semantic 
class of this compound. There are many poly- 
syllabic words of which the property of se- 
mantic composition does not hold, for in- 
stances the transliteration words, those words 
should be listed in the lexicon. Since for the 
majority of compounds the presupposition hold, 
the design of our semantic classification algo- 
rithm will be based upon this presupposition. 
Therefore the process of identifying semantic 
class of a compound boils down to find and to 
determine the semantic class of its head mor- 
phen-~e. However ambiguous morphological 
structures cause the difficulties in finding head 
morpheme. For instances, the compound in la) 
has two possible morphological structures, but 
only lb) is the right interpretation. 
1 a) ~2\]..l..'American' 
b) ~ /~.'Amcrica' 'people', 
c) ~ \["~l)\ 'beautiful' 'country-man' 
Once the morphological head is deterlnined, the 
semantic resolution for the head morpheme is the 
next difficulty to be solved. About 51.5% of the 
200 most productive morphemes are polysemous 
and according to the Collocation Dictionary of 
Noun and Measure Words (CDNM), in average 
each ambiguous morpheme carries 3.5 different 
senses (Huang et al. 1997). 
2. Representation Models 
Compounds are very productive types of unknown 
words. Nominal and verbal compounds are easily 
coined by combiniug two/many words/characters. 
Since there are more than 5000 commonly used 
Chinese characters and each with idiosyncratic 
syntactic behaviors, it is hm,t to derive a set of 
morphological rules to generate the set of 
Chinese colnpounds without over-generation or 
under-generation. The set of general compounds 
is an open-class. The strategy for automatic 
identification will be relied not only on the 
morpho-syntactic structures but also morpho- 
semantic relations. In general, certain 
interpretable semantic relationships between 
morphemic must be held. However there is no 
absolute lneans to judge whether the semantic 
relations between morphemic components are 
acceptable, i,e. the acceptability of such type of 
compounds is not simply 'yes' or 'no'. The 
degree of properness of a compound should 
depend on the logical relation between 
morphemic components and their logicalness 
should be judged by cominou sense knowledge. 
It is ahnost ilnpossible to in\]plement a system 
with common sense knowledge. Chen & Chen 
(1998) proposed an example-based measurement 
to evaluate the properness of a newly coined 
compound instead. They postulate that for a 
newly coined compound, if the semantic relation 
of its morphemic components is similar to the 
existing compounds, then it is inore likely that 
this newly coined compound is proper. 
2.1 Example-based similarity nleasure 
Supposed that a compound has the structure of 
XY where X and Y are morphemes and sup- 
posed without loss of generality Y is the head. 
For instance, ~i<~- 'learn-word-machine' is a 
noun compound and the head morphemeY is 
~'machine' and the modifier X is ~': 'learn- 
word'. In fact the morpheme f~{~ has four differ- 
ent meanings. They are 'machine', 'airplane', 
'secret' and 'opportunity'. How do computers 
judge which one is the right meaning and how is 
the compound construction well-formed or logi- 
cally lneaningful? First of all, the exalnples with 
the head morpheme ~ are extracted from cor- 
pora and dictionaries. The examples are classi- 
fied according to their meaning as shown in the 
Table 1. 
Senses semantic category examples 
(l) machine <~> Bo0ll 
(2) airplane <~{~> Bo223 
(3) opportunity <{N~> Ca042 
(4) secret <~> Da011 
Table 1. Four senses of tile morpheme ')~' and their respective samples 
174 
Tile meaning of ~l-\]i:i':I~,~- is then determined by 
comparing the similarity between it and each class 
of exalllples. Tile nleauing of the input ul\]kuown 
word will be assigned with the moaning of tile class 
with the most simihu morpho-semantic structures 
with this unknown word. Tile similarity measure is 
based on tile following formula. 
Supl)osed that each class of examples forlllS lhe 
following SOlUalltic relatioll rules. The rules silow 
the possible semantic relations between prel'ix and 
suffix Y and their weight in term ol' ihe frequency 
distribution of each semantic category of the pro- 
fixes in tile class. 
Rules: Semi +Y Freql 
Sere2 + Y Freq2 
: 
Semk + Y Freqk 
( freqi: tile number of the words of the l'orm Semi + 
Y) 
Take sulTix ~- with moaning of 'machine' as 
oxaulple. Igor tile nlorphonle I{} 'ulachine', tile 
extracted COlllpotlllds of the fornl X+~j~-'machine' 
and tile semantic categories of the n3odifiors al'e 
shown il1 Table 2 and the n3orpl~ological rule de- 
rived froill them ix in Table 3. The scnlai/tic types 
alld their hierarchical structure are adopted fro111 
tile Chilin (Moi el al. 1984). The similarity is 
measured between the semantic class of tile prefix 
X o1' tile unknown conlpound and tile prefix se- 
mantic types shown in the rule. ()no ot' the meas- 
uroulonts proposed is: 
SIMILAR (Sere,Rule) = {};il,k lnforlllati()n- 
Load(Sem(hSemi) * lq-eqi } / Max-vahle 
Where Sere is tile semantic class of X. Max-value 
is tile maxinlal vahle o1' {~\] \[nfornlation Load(S 
~Selni) * Freqi } for all semantic classes S. The 
iriax-wllue normalizes tile SIMILAR value to 
0-1. S(hSemi denotes the least common ances- 
tor of S and Semi. For instance, (Hh03('lHb06) 
= H. Tile Information-Load(S) of a senmntic 
class S ix defined as Entropy(sonlantic sys- 
tem) - Entropy(S). Simply speaking it is the 
anlount of reduced entropy after S is seen. En- 
tropy(S) =~\]i=l,k -p(SemilS) * Log P(SemiJs), 
where {Semi, Sem2 ..... Semk} is lhe set of the 
bottoln level selnantic classes contained in S. 
Jr, 5 % t,n ~,~-(f~) AoI73 llh031 ,(7J~#~(t~) Bo171 
,~;'.te(7"(,l~) AoI73 I11~032 SII;~(}~)I f083 Ih063 
I~JTl{(\]~) 11c013 11e032 ll~(L, tt~(4~) Ili141 tlj345 
LI~7':~ @~) I 1c()32 (7 P J'-?! ('~{~)Fa221 
I~1 ~)j(4~) Eb342 4~Jl~(4~) Iic071 
'i:\[{'~)2@~) Eb342 /JJ(JC(lJ~) Ih031 
~1~ ,,,:(~) AoI62 117162 ?,3~(4,~) P>e(>51 
gg~'l.;@}) I)k162 1fg191 ~J~!',lZ({~J~) 1h042 
5',J{(,).@{) Be041 ,~':~\[;41@~) 117192 
. -- Ill I u ". {: ILT-)I':(IN) Ca039 Ca041 ~.~pl) (4~) FhOI2 
Os~!l\]Ja(t~) 11c122 IJl:\];';::(45~) 1\[c231 
~t&i I(q~b \]~,12i 4,,i~1;~(.4',2~) Fa212 .hl102 
Table 2. The senlanlic categories ol' modifiers of 
tile COlllpouuds el' X-"f~ ~ math inc" 
Take lhe word l~'~ i :J': I~ 'learning-word- 
inachine' as example. In tile table 3, the results 
show tile estiinated similarity between tile 
X-~J,~ Serui freqi 
Hh031 
Ao 173 
He032 
Eb342 
Ae 162 
Hgl91 
3 
I 
2 
2 
I 
1 
Sem(~.~i!j ":) = Ilg111 
lnl'orrnation-Load( Hgl I 1 f-I Hh031 )=hlfornlation-Load(H)=2.231 
lnfornlation-Load( Hgl 11 N I1e032 )=hlformation-Load(Ii)=2.231 
Inforuultion-l~oad( Itg I 11 I'3 fig 191 )=lnformation-Load(Hg)=5.91 
£ i= l,k hfformation-Load(Hgl I 1 f-I Semi) * Freqi 
=2.231"3+2.231"2+5.912" 1 + ...... 
= 104.632 
Max-Vahlel = Z i hffornlation-Load(Hg031 f-I Semi) * Freqi 
= 155.164 
S1MILAP,= (104.632 / 155.164) = 0.6743 
Table 3. The derived morphological rule for tile ulorphenle 'machine' and tile simihu'ity measure of ,~.~l~j¢: 
t~" aS ;_i I1OUn conlpound which denotes a kind of nlachille. 
175 
compound ~-~ and the extracted examples. The 
similarity value is also considered as the logical 
properness value of this compound. In this case is 
0.67, which can be interpreted as that we have 67% 
of confidence to say that ~z~ 'learning-word- 
machine' is a well-formed compound. 
The above representation model serves many func- 
tions. First of all it serves as the morl3hological 
rules of the colnpounds. Second it serves as a mean 
to implement the evaluation function. Third it 
serves as a mean to disambiguate the semantic am- 
biguity of the morphological head of a compound 
noun. For instance, them are four different @. 
Each denotes 'machine', 'airplane', 'opportunity' 
and 'secret' and they are considered as four differ- 
ent morphemes. The example shows that '~ 
~}~'denotes a machine not other senses, since the 
evaluation score for matching the rules of '~-'ma- 
chine' has the highest evaluation score among 
theln. 
The above discussion shows the basic concept 
and the base-line model of the example-based 
model. The above similarity measure is called 
over-all-similarity measure, since it takes the equal 
weight on the similarity values of the input com- 
pound with every member in the class. Another 
similarity measum is called maximal-similarity, 
which is defined as follows. It takes the maximal 
value of the similarity between input compound 
and every member in the class as the output. 
SlM2(Word,Rule) = Maxi=l,k{ (Information 
Load(Sem~Semi)) / Max-value2 } 
Both similarity measures are reasonable and have 
their own advantages. The experiment results 
showed that the combination of these two measures 
achieved the best performance on the weights of 
w 1=0.3 and w2=0.7 (Chen & Chen 1998), i.e. SIM 
= SIMI * wl + SIM2 * w2, where wl+w2 = 1. We 
adopt this measure in our experiments. 
It also showed a strong co-relation between the 
similarity scores and the human evaluation scores 
on the properness of testing compounds. The hu- 
man considemd bad compounds showed also low 
similarity scores by computers. 
3. System Implementation 
3.1 Knowledge sources 
To categorize unknown words, the computer sys- 
tem has to equip with the linguistic and semantic 
knowledge about words, morphemes, and word 
formation rules. The knowledge is facilitated to 
identify words, to categorize their semantic and 
syntactic classes, and to evaluate the properness of 
word formation and the confidence level of 
categorization. In our experiments, the available 
knowledge sources include: 
1) CKIP lexicon: an 80,000 entry Chinese lexi- 
con with syntactic categories for each entry 
(CKIP 1993). 
2) Chilin: a thesaurus of synonym classes, which 
contains about 70,000 words distributed un- 
der 1428 semantic classes (Mei 1984). 
3) Sinica Corpus: a 5 million word balanced 
Chinese corpus with word segmented and 
part-of-speech tagged (Chen 1996). 
4) the Collocation Dictionary of Noun and 
Measure Words (CDNM) : The CDNM lists 
collocating measure words for nouns. The 
nouns in this dictionary are arranged by their 
ending morpheme, i.e. head morpheme. There 
are 1910 noun ending morphemes and 12,352 
example nouns grouped according to their 
different senses. 
Each knowledge source provides partial data for 
representing morphological rules, which in- 
clndes lists of sample compounds, high frequen- 
cy morphemes and their syntactic and semantic 
information. Unknown words and their frequen- 
cies can be extracted from the Sinica corpus. 
The extracted unknown words produce the 
testing data and the morpheme-category asso- 
ciation-strength which are used in the algorithm 
for the syntactic category prediction for un- 
known words (Chen et al. 1997). The CKIP dic- 
tionary provides the syntactic categories for 
morphemes and words. The Chilin provides the 
semantic categories for morpheme and words. 
The CDNM provides the set of high frequency 
noun morphemes and the example compounds 
grouped according to each difference sense. The 
semantic categories for each sense is extracted 
from the Chilin and disambiguated manually. 
The sample compounds for each sense- 
difl'emntiated morpheme extracted from CDNM 
form the base samples for the morphological 
rules. Additional samples are supplemented from 
the Chilin. 
3.2 Tile algorithm for morphological 
analysis 
The process of morphological analysis for com- 
pound words is very similar to Chinese word 
segmentation process. It requires dictionary 
look-up for matching nlorphemes and resolution 
methods for the inherent ambiguous segmenta- 
176 
tions, such as the exalnples in 1). However con- 
ventional word segmentation algorithms cannot 
apply for the morphological analysis without modi- 
fication, since the nlorpho-syntactic behavior is 
different froth syntactic behavior. Since ihc struc- 
ture of the Chinese COlllpound nOtlllS is head final 
and the most productive morphemes arc monosyl- 
labic, there is a simple and effective algorithm, 
which agrees with these facts. This algorithm seg- 
ments input compounds flom left to right by the 
longest matching criterion (Chcn& Liu 1992). It is 
clear that the loft to right longest lllaiching algo- 
rithm prel'ers shorler head and longer modifier 
structtlres. 
3.3 Senlantic categories of morphemes 
The semantic categories of morphemes arc lot- 
lowed from the thesaurus Chilin. This thesaurus is 
a lattice structure of concept taxonomy. Mor- 
phemes/words may have multiple classification due 
to either ambiguous classification or inherent so- 
illantic mnbiguitios. For lhe ambiguous scn'lantic 
categories o1' a morl)hcmo, lhc lower ranking se- 
nmntic categories will be eliminated and leave the 
higher-ranking scnlantic categories to conlpotc 
during the identification process. For instances, in 
the table 2 only the re;tier categories of each exam- 
ple are shown. Since the majority of nlorphemcs 
are unanlbiguous, they will compensate the uncer- 
tainty caused by die semantically ambiguous roof 
phemes. The rank of a semantic category of a mot'- 
phonic depends on the Occurrillg order o1: lhis lilO1- 
plionlo in ils synonyln group, since lhc arrangcincnt 
of the Chilin cilirics is by this natural, hi addition, 
dtlo to limit coverage o1: Chilin, nlally of the ll\]Ol'- 
phemes arc not listed. For the unlisted morphemes, 
we recursivcly apply the currellt algorithm to pre- 
dict their semantic categories. 
4. Semantic Chlssification and Ambiguity 
Resolution for Compound Nouns 
The demand o1" a semantic chlssification system for 
COlllpound nouns was first raised while the task of 
selnantic tagging for Chinese corpus was lriod. The 
Sin|ca corpus is a 5 in|Ilion-word Chinese corpus 
with part-of speech lagging, lit lhis corpus there are 
47,777 word typos tagged with conllllOn nOl.lllS and 
Ollly 12,536 Of tholll are listed ill the Chilin. They 
count only 26.23%. In oilier words the scmandc 
categories for most of the common nouns arc tin- 
known. They will be the target for automatic se- 
mantic classification. 
4.1 Derivation of morphological rules 
A list of' most productive lriorphoinos arc first 
generated from the unknown words extracted 
fl'om the Sinica corpus. The morphological rules 
o1' the sot of the lllOSl productive head mor- 
phonies {llO derived flonl their examples. Both 
the CI)MN all(\] Chilin provide SOlilO oxanlplcs. 
So lhr there are 1910 head morphemes for com- 
pound nouns with examples in the system and 
increasing. They are all monosyllabic mor- 
phemes. For the top 200 most productive mor- 
phenlcs, among them 51.5% are polysemous and 
in average each has 3.5 different meanings. \]'tie 
coverage of ihe ctlrrollt 1910 illorphonlos is 
aboul 71% of ihc uilkiiown noun conlpounds of 
the iesling dala. The rosl 29% tincovorod noun 
nlorphonlos are cilher polysyllabic i-llorpholiies 
or/lie low frequency nlorl~hemes. 
4.2 Semantic classification algorithm 
The unknown compound nouns extracted from 
the Sinica corpus w'cre classified according to 
Ihc morphological representation by the simihtr- 
ity-bascd algoriltnn. The problcms of semantic 
ambiguitics and out-of-coverage morphcmcs 
were two major dilTicultics to be solved during 
the classification stage. The complete scmanlic 
classification algorilhm is as follows: 
I) For each inpu! noun compound, apply mor- 
phological analysis algorilhm lo derive die 
morphemic components of the input com- 
pound. 
2) I)clcrminc the head nlorphenlc and modifiers. 
'flit: dcfaull head illorphclllo is lhc last liior- 
phonic of a conlpound. 
3) Got die synlactic and semantic categories of 
the modifiers. If a modil\]or is also an tin- 
known word, lhen apply this algorilhm rocur- 
sively to idendfy its son-ialltic category. 
4) For lhe head morpholne with the representa- 
tional rules, apply siinilarity illeastlro for each 
possible sornantic chtss and outptlt the so- 
manlic class with lhe highest siinilariiy wthic. 
5) If the head illorphonlo is not covered by tile 
nlorphological rules, search its semantic class 
from the Chilin. If its semantic class is not list 
in the Chilin, then no ariswcr can be found, if 
it is polysemous, then the top ranked classes 
will be the output. 
In lhc step I, thc algorithm rcsolvcs the possible 
ambiguities o1' the morphological slrtlcttlrcs of 
the input COlllpound. In the step 3, the selllantic 
categories of the modil'ier arc determined. There 
arc some complications. The firsl complication 
is lhat lhe modifier has nmltiple semantic care- 
177 
gories. In our current process, tile categories of 
lower ranking order will be eliminated. The re- 
maining categories will be processed independently. 
One of the semantic categories of the modifier 
pairing with one of the rule of the head morpheme 
with the category will achieve the maximal simi- 
larity value. The step 4 thus achieves the resolution 
of both semantic ambiguities of the head and tile 
modifier. However only the category of the head is 
our target of resolution. The second complication is 
that the modifier is also unknown. If it is a not list- 
ed in the Chilin, there is no way of knowing its 
semantic categories by tile era'rent available re- 
sources. At the step 4, the prediction of semantic 
category of the input compound will depend solely 
on the information about its head morpheme. If the 
head morpheme is unambiguous then output the 
category of the head morpheme as the prediction. 
Otherwise, output the semantic category of the top 
rank sense of the head morpheme. The step 5 han- 
dles the cases of exceptions, i.e. no representational 
rule for head morphemes. 
4.3 Experimental results 
The system classifes the set of unknown common 
nouns extracted from tile Sinica corpus. We ran- 
domly picked two hundred samples from tile output 
for the performance evaluation by examining the 
semantic classification manually. The correction 
rate for semantic classil'ication is 84% and 81% 
for tile frst hundred samples and the second 
hundred samples respectively. We further classi- 
fy tim errors into different types. The first type is 
caused by the selection error while disam- 
biguating the polysemous head lnorphemes. The 
second type is caused by the fact that the mean- 
ings of some compounds are not semantic com- 
position of tile meanings of their morphological 
components. Tile third type errors are caused by 
the fact that a few compounds are conjunctive 
structures not assumed head-modifier structure 
by the system. Tile forth type errors are caused 
by the head-initial constructions. Other than tile 
classification errors, there exist 10 unidentifiable 
colnpounds, 4 and 6 in each set, for their head 
morphemes are not listed in tile system nor in 
the Chilin. Among tile 190 identifiable head 
morphemes, 142 of them are covered by the 
morphological rules encoded in the system and 
80 of theln have multiple semantic categories. 
Tile semantic categories of remaining 48 head 
morphemes were found fiom the Chilin. If the 
type 1 selection errors are all caused by the 80 
morphemes with multiple semantic categories, 
then the correction rate of semantic disambigua- 
tion by our similarity-based measure is (80- 
15)/80 = 81%. 
Testing data 1 Testing data 2 
........................................................................................................ 
Total : 100 
error: 12 
type l(semantic selection error): 8 
type2(non-compositional): 2 
type3(conjunction): I 
typre4(head-initial): 1 
unidentified: 4 
Total: 100 
error: 13 
type l(semantic selection error): 7 
type3(non-compositional): 5 
type3(con.junction): 0 
typre4(head-initial): 1 
unidentified: 6 
Table 5. The performance evaluations of 
5. Further Remarks and Conclusions 
In general if an unknown word was extracted from 
corpora, both of its syntactic and semantic catego- 
ries are not known. The syntactic categories will 
be predicted first according to its prefix-category 
and suffix-category associations as mentioned in 
(Chen et al. 1997). According to the top ranked 
syntactic predictions, each respective semantic 
representational rules or models will be applied to 
produce the morpho-semantic plausibility of the 
unknown word of its respective syntactic catego- 
rization. For instance if the predicted syntactic 
the semantic classification algorithm 
categories are either a common noun or a verb, the 
algorithm present in this paper will be carried out 
to classify its semantic category and produce its 
plausibility value for the noun category. Similar 
process should deal with tile case of verbs and 
produce tile plausibility of being a verb. The final 
syntactic and semantic prediction will be based oil 
their plausibility values and its contextual envi- 
ronments (Bai et al. 1998, Ide 1998). 
The advantages of tile current representational 
model are: 
1) it is declarative. New examples and new mor- 
178 
phemes can be added into the system withoul 
changing the processing algorilhm, but 111e per- 
formance o1' the system might be increased due 
to the increlnent of the knowledge. 
2) The representational model not only provides 
the semantic classification of the unknown 
words but also gives the wdue of the phmsibil- 
ity o1' a compound construction. This value 
could be utilized to resolve the alnbiguous 
matching between compeling compound rules. 
3) The representational model can be extended for 
presenting compound verbs. 
4) It acts as one of the major building block of a 
self-learning systeln for linguistic and world 
knowledge acquisition on the lnternel environ- 
lllellt. 
Tile classification errors are caused by a) some of 
the testing examples have no semantic composi- 
tion property, b) some semantic classifications are 
too much fine-grained. There is no clear cut dif- 
ference between some classes, even Imman judge 
cannot lnake a right classification, c) there are not 
enough samples that causes the simihuity-based 
model does not work on the suffixes with few or 
no sample data. The above classification errors 
can be resolved by collecting the new words, 
which are Selnantically nol>compositional, into 
tile lexicon and by adding new examples for each 
naorphenle. 
Current Selnantic categorization system only 
roughly classifies the unknown compound nouns 
according to their semantic heads. In the future 
deeper analysis on the semantic relations between 
modifier and head should also be carried otll. 

References 

Bai, M.H., C.J. Chert & K..I. Chert, 1998, "POS-tagging for Chinese Unknown Words by 
Contextual P, ules" Proceedings of 
ROCLING, pp.47-62. 

Chang, .1. S.,S.D. Chert, S. J. Ker, Y. Chert, & J. 
Liu,1994 "A Multiple-Corpus Approach to 
Recognition of Proper Nmnes in Chinese 
Texts", Computer Processing o/" Chinese 
and Oriental Languages, Vo\[. 8, No. 1, pp. 
75-85. 

Chert, C. J., M. It. Bai, K. J. Chert, 1997, "Catego- 
ry Guessing for Chinese Unknown Words." 
Proceedings of the Natural Langttage 
Processing Pac(fi'c Rim Symposimn 1997, 
pp. 35-40. NLPRS '97 Thailand. 

Chert, K. J., C.J. Chert, 1998, "A Corpus-based 
Study on Computational Morphology for 
Mandarin Chinese", in Qlmnlitative and 
Comlmtationa\[ Studies on the Chinese 
lxmgttage Eds by l?,cnjanfin K. Tsou, City 
Univ. of Hong Kong, pp283-306. 

Chert, l<eh-.liann, Ming-Hong Bai, 1997, "Un- 
known Wolzl l)etection for Chinese by a 
Corpus-based Learning Method." Pro- 
ceedings of the lOth Research on Comlm- 
lalional Linguistics International Confer- 
ence, pp 159-174. 

Chen, K.J. & S.H. Liu, 1992, "Word identification 
for Mandarin Chinese Sentences," Proceedings of 14th COLING , pp. 101 - 107. 

Chien, Lee-feng, 1999," lWF-tree-based Adaptive 
Keyphrase Extraction for Intelligent Chine- 
se Information Retrieval," Information 
Processing and Management, Vol. 35, pp. 
501-521. 

l;ung P., 1998," Extracting Key Terms from Chi- 
nese and Japanese Texts," Computer Proc- 
essing of Oriental Languages, Vol. 12, #1, 
pp 99-122. 

Huang, C. R. E1 al.,1995,"The lnlroduction of 
Sinica Corpus," Proceedings of ROCIJNG 
VIII, pp. 81-89. 

lhtang, Chu-P, en, Keh-Jiann Chert, and Ching- 
hsiung Lai, 1997, Mandarin 1)ally 
Classification l)ictionary, Taipci: Mandarin 
l)aily Press. 

Ide, Nancy & Jean Veronis, 1998, " Special Issue 
on Word Sense Disambiguation", 
Computational Linguistics, Vol. 24, # I. 

Lee, J.C. , Y.S. Lee and H.H. Chert, 1994, 
"Identification of Personal Names in 
Chinese Texts." Proceedings of 7th ROC 
Computational Linguistics Conference. 

Lin, M. Y., T. H. Chiang, & K.Y. Su, t993," A 
Preliminary Study on Unknown Word 
Problem in Chinese Word Segmentation" 
Proceedings of Reeling V1, pp 119-137. 

Mei, Gia-Chu etc., 1984Iq *-- 4~ ~q ~q g.(Chilin - 
thesaurus of Chinese words). Hong Kong, 
McDonald 1)., 1996, " Internal and External 
Evidence in the Identification and Semantic 
Categorization of Proper Names", in 
Corpus Processing Jot Lexical Acquisition, 
J. Pustejovsky and B. Boguraev Eds, MIT 
Press 1996. 

Sun, M. S., C.N. Huang, H.Y. Gao, & Jie Fang, 
1994, "Identifying Chinese Names in Unre- 
stricted Texts", Communication of COLIPS, 
Vol.4 No. 2. 113-122. 
179 
