Word Meaning Inducing via Character Ontology: A Survey on the
Semantic Prediction of Chinese Two-Character Words
Shu-Kai Hsieh
Seminar f¨ur Sprachwissenschaft
Abt. Computerlinguistik
Universit¨at T¨ubingen
72074, Germany
kai@hanzinet.org
Abstract
This paper presents a semantic class
prediction model of Chinese two-
character compound words based on
a character ontology, which is set to
be a feasible conceptual knowledge re-
source grounded in Chinese characters.
The experiment we conduct yields sat-
isfactory results which turn out to be
that the task of semantic prediction of
two-character words could be greatly
facilitated using Chinese characters as
a knowledge resource.
1 Introduction
This paper describes the theoretical considera-
tion concerning with the interaction of ontology
and morpho-semantics, and an NLP experiment
is performed to do semantic class prediction of
unknown two-character words based on the on-
tological and lexical knowledge of Chinese mor-
phemic components of words (i.e., characters).
The task that the semantic predictor (or classifier)
performs is to automatically assign the (prede-
fined) semantic thesaurus classes to the unknown
two-character words of Chinese.
Among these types of unknown words, Chen
and Chen (2000) pointed out that compound
words constitute the most productive type of un-
known words in Chinese texts. However, the
caveat at this point should be carefully formu-
lated, due to the fact that there are no unequiv-
ocal opinions concerning with some basic theo-
retical settings in Chinese morphology. The no-
tion of word, morpheme and compounding are
not exactly in accord with the definition common
within the theoretical setting of Western morphol-
ogy. To avoid unnecessary misunderstanding, the
pre-theoretical term two-character words will be
mostly used instead of compound words in this
paper.
2 Word Meaning Inducing via
Character Meaning
2.1 Morpho-Semantic Description
As known, “bound roots” are the largest classes of
morpheme types in Chinese morphology, and they
are very productive and represent lexical rather
than grammatical information (Packard 2000).
This morphological phenomena leads many Chi-
nese linguists to view the word components (i.e.,
characters) as building blocks in the seman-
tic composition process of dis- or multisyllabic
words. In many empirical studies (Tseng and
Chen (2002); Tseng (2003); Lua (1993); Chen
(2004)), this view has been confirmed repeatedly.
In the semantic studies of Chinese word for-
mation, many descriptive and cognitive seman-
tic approaches have been proposed, such as ar-
gument structure analysis (Chang 1998) and the
frame-based semantic analysis (Chu 2004). How-
ever, among these qualitative explanation theoret-
ical models, problems often appear in the lack of
predictability on the one end of spectrum, or over-
generation on the other.1 Empirical data have
1For example, in applying Lieber’s (1992) analysis of ar-
gument structure and theta-grid in Chinese V-V compounds,
Chang (1998) found some examples which may satisfy the
semantic and syntactic constraints, but they may not be ac-
56
also shown that in many cases, – e.g., the abun-
dance of phrasal lexical units in any natural lan-
guage, – the principle of compositionality in a
strict sense, that is, “the meaning of a complex
expression can be fully derivable from the mean-
ings of its component parts, and from the schemas
which sanction their combination”(Taylor 2002),
which is taken to be a fundamental proposition in
some of morpho-semantically motivated analysis,
is highly questionable.
This has given to the consideration of the em-
beddedness of linguistic meanings within broader
conceptual structures. In what follows, we will
argue that an ontology-based approach would
provide an interesting and efficient prospective
toward the character-triggered morpho-semantic
analysis of Chinese words.
2.2 Conceptual Aggregate in Compounding:
A Shift Toward Character Ontology
In prior studies, it is widely presumed that the cat-
egory (be it syntactical or semantic) of a word, is
somehow strongly associated with that of its com-
posing characters. The semantic compositionality
underlying two-character words appears in differ-
ent terms in the literature.2
Word semantic similarity calculation tech-
niques have been commonly used to retrieve the
similar compositional patterns based on semantic
taxonomic thesaurus. However, one weak point
in these studies is that they are unable to sep-
arate conceptual and semantic levels. Problem
raises when words in question are conceptually
correlated are not necessarily semantically corre-
lated, viz, they might or might not be physically
close in the CILIN thesaurus (Mei et al 1998).
On closer observations, we found that most syn-
onymic words (i.e., with the same CILIN seman-
tic class) have characters which carry similar con-
ceptual information. This could be best illustrated
by examples. Table 1 shows the conceptual distri-
bution of the modifiers of an example of VV com-
pound by presuming the second character ¦ as a
ceptable to native speakers.
2Using statistical techniques, Lua (1993) found out that
each Chinese two-character word is a result of 16 types of
semantic transformation patterns, which are extracted from
the meanings of its constituent characters. In Chen (2004),
the combination pattern is referred to as compounding se-
mantic template.
head. The first column is the semantic class of
CILIN (middle level), the second column lists the
instances with lower level classification number,
and the third column lists their conceptual types
adopted from a character ontology we will discuss
later. As we can see, though there are 12 result-
ing semantic classes for the * ¦ compounds, the
modifier components of these compounds involve
only 4 concept types as follows:
11000 (SUBJECTIVE→EXCITABILITY→ABILITY→ORGANIC FUNCTION)
ÜSTXÙ,
11010 (SUBJECTIVE→EXCITABILITY→ABILITY→SKILLS) ¿STX
YSTXLSTX‰STXTSTXÙSTX²,
11011 (SUBJECTIVE→EXCITABILITY→ABILITY→INTELLECT) „STX
ÜSTX5STX²STX“STXpSTX=,
11110 (SUBJECTIVE→EXCITABILITY→SOCIAL EXPERIENCE→DEAL WITH THINGS)
YSTX×STXäSTX²STXTSTX¥STXÆSTX,STXªSTXDC1STXySTX¹STXÛSTXª
We defined these patterns as conceptual aggre-
gate pattern in compounding. Unlike statistical
measure of the co-occurrence restrictions or asso-
ciation strength, a concept aggregate pattern pro-
vides a more knowledge-rich scenario to repre-
sent a specific manner in which concepts are ag-
gregated in the ontological background, and how
they affect the compounding words. We will pro-
pose that the semantic class prediction of Chinese
two-character words could be improved by mak-
ing use of their conceptual aggregate pattern of
head/modifier component.
3 Semantic Prediction of Unknown
Two-Character Words
The practical task intended to be experimented
here involves the automatic classification of Chi-
nese two-character words into a predetermined
number of semantic classes. Difficulties encoun-
tered in previous researches could be summarized
as follows:
First, many models (Chen and Chen
1998;2000) cannot deal with the issue of
“incompleteness” of characters in the lexicon, for
these models depend heavily on CILIN, a Chinese
Thesaurus containing only about 4,133 monosyl-
labic morphemic components (characters). As
a result, if unknown words contain characters
that are not listed in CILIN, then the prediction
task cannot be performed automatically. Second,
the ambiguity of characters is often shunned by
57
SC VV compounds Concept types of modifier component
Ee 37 ª¦ 11110
Fa 05 Y¦ 08 ¿¦ 15 L¦ 11010
Fc 05 =¦ 11011
Gb 07 p¦ 11011
Ha 06 T¦ 11110
Hb 08 ¹¦ 12 Û¦ 12 T¦ 12 ¹¦ 11110
Hc 07 Y¦ 23 ‰¦ 25 “¦ {11110;11011}
Hi 27 ä¦ 27 T¦ {11010;11110}
Hj 25 ²¦ 25 ¿¦ {11010;11110}
Hn 03 ª¦ 10 y¦ 12 Y¦ 11110
If 09 5¦ 11011
Je 12 DC1¦ 12 Æ¦ 12 Ü¦ 12 Ù¦ 12 ¥¦ 12 „
¦ 12 Ü¦ 12 ª¦ 12 ,¦ 12 i¦ 12 ×¦ 12
²¦ 12 T¦
{11000;11110;11011;11110}
Table 1: Conceptual aggregate patterns in two-character VV (compound) words: An example of * ¦
manual pre-selection of character meaning in the
training step, which causes great difficulty for an
automatic work. Third, it has long been assumed
(Lua 1997; Chen and Chen 2000) that the over-
whelming majority of Chinese compounds are
more or less endocentric, where the compounds
denote a hyponym of the head component in the
compound. E.g, Ús (“electric-mail”; e-mail)
is a kind of mail. So the process of identifying
semantic class of a compound boils down to find
and to determine the semantic class of its head
morpheme. However, there is also an amount
of exocentric and appositional compounds3
where no straightforward criteria can be made to
determine the head component. For example, in
a case of VV compound o½ (“denounce-scold”,
drop-on), it is difficult (and subjective) to say
which character is the head that can assign a
semantic class to the compound.
To solve above-mentioned problems, Chen
(2004) proposed a non head-oriented character-
sense association model to retrieve the latent
senses of characters and the latent synonymous
compounds among characters by measuring sim-
ilarity of semantic template in compounding by
using a MRD. However, as the author remarked
in the final discussion of classification errors, the
performance of this model relies much on the pro-
ductivity of compounding semantic templates of
the target compounds. To correctly predict the se-
mantic category of a compound with an unpro-
ductive semantic template is no doubt very dif-
ficult due to a sparse existence of the template-
3Lua reports a result of 14.14% (Z3 type).
similar compounds. In addition, the statistical
measure of sense association does not tell us any
more about the constraints and knowledge of con-
ceptual combination.
In the following, we will propose that a knowl-
edge resource at the morpheme (character) level
could be a straightforward remedy to these prob-
lems. By treating characters as instances of con-
ceptual primitives, a character ontology thereof
might provide an interpretation of conceptual
grounding of word senses. At a coarse grain, the
character ontological model does have advantages
in efficiently defining the conceptual space within
which character-grounded concept primitives and
their relations, are implicitly located.
4 A Proposed Character
Ontology-based Approach
In carrying out the semantic prediction task,
we presume the context-freeness hypothesis, i.e.,
without resorting to any contextual information.
The consideration is taken based on the observa-
tion that native speaker seems to reconstruct their
new conceptual structure locally in the processing
of unknown compound words. On the other hand,
it has the advantage especially for those unknown
words that occur only once and hence have lim-
ited context.
In general, the approach proposed here differs
in some ways from previous research based on the
following presuppositions:
58
4.1 Character Ontology as a Knowledge
Resource
The new model that we will present below will
rely on a coarsely grained upper-level ontology
of characters.4 This character ontology is a tree-
structured conceptual taxonomy in terms of which
only two kinds of relations are allowed: the
INSTANCE-OF (i.e., certain characters are in-
stances of certain concept types) and IS-A rela-
tions (i.e., certain concept type is a kind of certain
concept type).
In the character ontology, monosyllabic char-
acters 5 are assigned to at least 6 one of 309 con-
sets (concept set), a new term which is defined as
a type of concept sharing a given putatively prim-
itive meaning. For instance, z (speak), − (chatter),
x (say), ; (say), µ (tell), s (inform), ƒ (explain), Å (nar-
rate), ‚ (be called), H (state), these characters are as-
signed to the same conset.
Following the basic line of OntoClear method-
ology (Guarino and Welty (2002)), we use sim-
ple monotonic inheritance, which means that each
node inherits properties only from a single ances-
tor, and the inherited value cannot be overwritten
at any point of the ontology. The decision to keep
the relations to one single parent was made in or-
der to guarantee that the structure would be able
to grow indefinitely and still be manageable, i.e.
that the transitive quality of the relations between
the nodes would not degenerate with size. Fig-
ure 1 shows a snapshot of the character ontology.
4.2 Character-triggered Latent
Near-synonyms
The rationale behind this approach is that simi-
lar conceptual primitives - in terms of characters
- probably participate in similar context or have
similar meaning-inducing functions. This can
be rephrased as the following presumptions: (1).
Near-synonymic words often overlap in senses,
i.e., they have same or close semantic classes. (2).
Words with characters which share similar con-
ceptual information tend to form a latent cluster
4At the time of writing, about 5,600 characters have been
finished in their information construction. Please refer to [4]
5In fact, in addition to monosyllabic morpheme, it also
contains a few dissyllabic morphemes, and borrowed poly-
syllabic morphemes.
6This is due to the homograph.
uni0052uni004Funi004Funi0054
uni004Funi0042uni004A
uni0053uni0055uni0042uni004A
uni0043uni004Funi004Euni0043uni0052uni0045uni0054uni0045
uni0041uni0042uni0053uni0054uni0052uni0041uni0043uni0054
uni0045uni0058uni0049uni0053uni0054uni0045uni004Euni0043uni0045
uni0041uni0052uni0054uni0049uni0046uni0041uni0043uni0054
uni0045uni0058uni0043uni0049uni0054uni0041uni0042uni004Cuni0045
uni0043uni004Funi0047uni004Euni0049uni0054uni0049uni0056uni0045
uni0053uni0045uni004Duni0049uni004Funi0054uni0049uni0043
uni0052uni0045uni004Cuni0041uni0054uni0049uni004Funi004Euni0041
uni004C
uni0053uni0045uni004Euni0053uni0041uni0054uni0049uni004Funi004E
uni0053uni0054uni0041uni0054uni0045
uni0049uni004Euni004Euni0041uni0054uni0045
uni0053uni004Funi0043uni0049uni0041uni004C
uni0063uni006Funi006Euni0073uni0065uni0074uni0020uni0031
uni0063uni006Funi006Euni0073uni0065uni0074uni0020uni0033uni0030uni0039
uni0063uni006Funi006Euni0073uni0065uni0074uni0020uni0032
uni0063uni006Funi006Euni0073uni0065uni0074uni0020uni0033
uni002Duni002Duni002Duni002Duni002Duni002D
uni002Duni002Duni002Duni002Duni002Duni002D
uni002Duni002Duni002Duni002Duni002Duni002D
uni002Duni002Duni002Duni002Duni002Duni002D
uni002Duni002Duni002Duni002Duni002Duni002D
uni002Duni002Duni002Duni002Duni002Duni002D
uni002Duni002Duni002Duni002Duni002Duni002D
uni002Duni002Duni002Duni002Duni002Duni002D
uni002Duni002Duni002Duni002Duni002Duni002D
uni002Duni002Duni002Duni002Duni002Duni002D
uni002Duni002Duni002Duni002Duni002Duni002D
uni002Duni002Duni002Duni002Duni002Duni002D
uni002Duni002Duni002Duni002Duni002Duni002D
uni002Duni002Duni002Duni002Duni002Duni002D
uni002Duni002Duni002Duni002Duni002Duni002D
uni0063uni006Funi006Euni0073uni0065uni0074uni0020uni0033uni0030uni0038
uni0063uni006Funi006Euni0073uni0065uni0074uni0020uni0033uni0030uni0037
uni007Buni91D1uni9280uni9285uni9435uni932Buni6C27uni78B3uni92C1uni78F7uni7812uni6C2Buni6C2Euni007D
uni007Buni6D77uni6D0Buni6CBCuni6FA4uni6E56uni6CCAuni6C60uni5858uni6F6Duni6EAAuni6CB3uni85EAuni007D
uni007Buni990Auni4FDDuni8892uni8B77uni5E87uni4F51uni8F14uni885Buni9867uni5ED5uni620Cuni007D
uni002Duni002Duni002Duni002Duni002Duni002D
uni002Duni002Duni002Duni002Duni002Duni002D
uni002Duni002Duni002Duni002Duni002Duni002D
uni002Duni002Duni002Duni002Duni002Duni002D
uni002Duni002Duni002Duni002Duni002Duni002D
uni002Duni002Duni002Duni002Duni002Duni002D
uni002Duni002Duni002Duni002Duni002Duni002D
uni002Duni002Duni002Duni002Duni002Duni002D
uni002Duni002Duni002Duni002Duni002Duni002D
uni002Duni002Duni002Duni002Duni002Duni002D
uni002Duni002Duni002Duni002Duni002Duni002D
uni002Duni002Duni002Duni002Duni002Duni002D
uni002Duni002Duni002Duni002Duni002Duni002D
uni002Duni002Duni002Duni002Duni002Duni002D
uni002Duni002Duni002Duni002Duni002Duni002D
uni002Duni002Duni002Duni002Duni002Duni002D
uni007Buni96FBuni6CE2uni78C1uni007D
uni007Buni8B1Duni8FADuni62D2uni99C1uni63A8uni7D55uni916Cuni4EA4uni7DE0uni7D50uni8655uni9080uni7D04uni966Auni4F34uni8FCEuni007D
uni007Buni5C60uni6C7Auni622Euni52E6uni527Funi8A85uni6BB2uni5937uni6BC0uni6EC5uni6CEFuni5BB0uni6BBAuni6B8Auni007D
Figure 1: The character ontology: a snapshot
of synonyms. (2). These similar conceptual in-
formation can be formalized as conceptual aggre-
gate patterns extracted from a character ontology.
(3). Identifying such conceptual aggregate pat-
terns might thus greatly benefit the automatically
acquired near-synonyms, which give a set of good
candidates in predicting the semantic class of pre-
viously unknown ones.
The proposed semantic classification system
retrieves at first a set of near-synonym candidates
using conceptual aggregation patterns. Consid-
erations from the view of lexicography can win-
now the overgenerated candidates, that is, a final
decision of a list of near-synonym candidates is
formed on the basis of the CILIN’s verdict as to
what latent near-synonyms are. Thus the semantic
class of the target unknown two-character words
will be assigned with the semantic class of the
top-ranked near-synonym calculated by the sim-
ilarity measurement between them. This method
has advantage of avoiding the snag of apparent
multiplicity of semantic usages (ambiguity) of a
character.
Take for an example. Suppose that the seman-
tic class of a two-character word \ˆ (protect;
Hi37) is unknown. By presuming the leftmost
character \ the head of the word, and the right-
most character ˆ as the modifier of the word,
59
we first identify the conset which the modifier
ˆ belongs to. Other instances in this conset are
\, ô, {, ¨, 7, G, è, ., 1, è, ô, etc. So we
can retrieve a set of possible near-synonym can-
didates by substitution, namely, NS1: {\\, \ô,
\{, \¨, \7, \G, \è, \., \1, \è, \ô}; in
the same way, by presuming ˆ as the head, we
have a second set of possible near-synonym can-
didates, NS2: {ˆˆ, ôˆ, {ˆ, ¨ˆ, 7ˆ, Gˆ, èˆ,
.ˆ, 1ˆ, èˆ, ôˆ}7. Aligned with CILIN, those
candidates which are also listed in the CILIN are
adopted as the final two list of the near-synonym
candidates for the unknown word \ˆ: NSprime1:
{ôˆ(Hi41), ¨ˆ(Hb04;Hi37), 7ˆ(Hi47), è
ˆ(Hi37),ôˆ(Hd01)}, and NSprime2: {\G(Hl33),\
ô(Hj33), \è(Ee39)}.
4.3 Semantic Similarity Measure of
Unknown Word and its Near-Synonyms
Given two sets of character-triggered near-
synonyms candidates, the next step is to calcu-
late the semantic similarity between the unknown
word (UW) and these near-synonyms candidates.
CILIN Thesaurus is a tree-structured taxo-
nomic semantic structure of Chinese words,
which can be seen as a special case of seman-
tic network. To calculate semantic similarity be-
tween nodes in the network can thus make use of
the structural information represented in the net-
work.
Following this information content-based
model, in measuring the semantic similarity
between unknown word and its candidate near-
synonymic words, we use a measure metric
modelled on those of Chen and Chen (2000),
which is a simplification of the Resnik algorithm
by assuming that the occurrence probability
of each leaf node is equal. Given two sets
(NSprime1,NSprime2) of candidate near synonyms, each
with m and n near synonyms respectively, the
similarity is calculated as in equation (1) and
(2), where scuwc1 and scuwc2 are the semantic
class(es) of the first and second morphemic com-
ponent (i.e., character) of a given unknown word,
respectively. sci and scj are the semantic classes
of the first and second morphemic components
on the list of candidate near-synonyms NSprime1
7Note that in this case, \ and ˆ are happened to be in
the same conset.
and NSprime2. f is the frequency of the semantic
classes, and the denominator is the total value of
numerator for the purpose of normalization. β
and 1−β are the weights which will be discussed
later. The Information Load (IL) of a semantic
class sc is defined in Chen and Chen (2004):
IL(sc) = Entropy(system) −Entropy(sc)
(3)
similarequal (−1q
summationdisplay
log2 1q) − (−1p
summationdisplay
log2 1p)
= log2 q − log2 p
= −log2(pq),
if there is q the number of the minimal semantic
classes in the system,8 p is the number of the se-
mantic classes subordinate sc.
4.4 Circumventing “Head-oriented”
Presupposition
As remarked in Chen (2004), the previous re-
search concerning the automatic semantic classi-
fication of Chinese compounds (Lua 1997; Chen
and Chen 2000) presupposes the endocentric fea-
ture of compounds. That is, by supposing that
compounds are composed of a head and a modi-
fier, determining the semantic category of the tar-
get therefore boils down to determine the seman-
tic category of the head compound.
In order to circumventing the strict “head-
determination” presumption, which might suf-
fer problems in some borderline cases of V-V
compounds, the weight value (β and 1 − β) is
proposed. The idea of weighting comes from
the discussion of morphological productivity in
Baayen (2001). We presume that, within a given
two-character words, the more productive, that
is, the more numbers of characters a charac-
ter can combine with, the more possible it is a
head, and the more weight should be given to it.
The weight is defined as β = C(n,1)N , viz, the
number of candidate morphemic components di-
vided by the total number of N. For instance, in
the above-mentioned example, NS1 should gain
more weights than NS2, for ˆ can combine with
more characters (5 near-synonyms candidates) in
8In CILIN, q = 3915.
60
simµ(UW,NSprime1) = argmaxi=1,m IL(LCS(scuwc1,sci)) ∗fisummationtextm
i=1 IL(LCS(scuwc1,sci)) ∗fi
(β) (1)
simν(UW,NSprime2) = argmaxj=1,n IL(LCS(scuwc2,scj)) ∗fjsummationtextn
j=1 IL(LCS(scuwc2,scj)) ∗fj
(1 −β) (2)
NS1 than \ does in NS2 (3 near-synonyms can-
didates). In this case, β = 58 = 0.625. It is
noted that the weight assignment should be char-
acter and position independent.
4.5 Experimental Settings
4.5.1 Resources
The following resources are used in the ex-
periments: (1)Sinica Corpus9, (2) CILIN The-
saurus (Mei et al 1998) and (3) a Chinese char-
acter upper-level ontology.10 (1) is a well known
balanced Corpus for modern Chinese used in Tai-
wan. (2) CILIN Thesaurus is a Chinese The-
saurus widely accepted as a semantic categoriza-
tion standard of Chinese word in Chinese NLP.
In CILIN, a collection of about 52,206 Chinese
words are grouped in a Roget’s Thesaurus-like
structure based on categories within which there
are several 3 levels of finer clustering (12 major,
95 minor and 1428 minor semantic classes).(3) is
an on-going project of Hanzi-grounded Ontology
and Lexicon as introduced.
4.5.2 Data
We conducted an open test experiment, which
meant that the training data was different from the
testing data. 800 two-character words in CILIN
were chosen at random to serve as test data, and
all the words in the test set were assumed to be un-
known. The distribution of the grammatical cate-
gories of these data is: NN (200, 25%), VN (100,
12.5%) and VV (500, 62.5%).
4.5.3 Baseline
The baseline method assigns the semantic class
of the randomly picked head component to the se-
mantic class of the unknown word in question. It
is noted that most of the morphemic components
9http://www.sinica.edu.tw/SinicaCorpus/
10http://www.hanzinet.org/HanziOnto/
Compound types Baseline Our algorithm
V-V 12.20% 42.00%
V-N 14.00% 37.00%
N-N 11.00% 72.50%
Table 2: Accuracy in the test set (level 3)
(characters) are ambiguous, in such cases, seman-
tic class is chosen at random as well.
4.5.4 Outline of the Algorithm
Briefly, the strategy to predict the seman-
tic class of a unknown two-character word
is, to measure the semantic similarity of un-
known words and their candidate near-synonyms
which are retrieved based on the character
ontology. For any unknown word UW,
which is the character sequence of C1C2,
the RANK(simµ(β),simν(1 − β)) is com-
puted. The semantic category sc of the
candidate synonym which has the value of
MAX(simµ(β),simν(1 − β)), will be the top-
ranked guess for the target unknown word.
4.6 Results and Error Analysis
The correctly predicted semantic class is the se-
matic class listed in CILIN. In the case of ambigu-
ity, when the unknown word in question belongs
to more than one semantic classes, any one of the
classes of an ambiguous word is considered cor-
rect in the evaluation.
The SC prediction algorithm was performed
on the test data for outside test in level-3 classi-
fication. The resulting accuracy is shown in Ta-
ble 2. For the purpose of comparison, Table 3
also shows the more shallow semantic classifica-
tion (the 2nd level in CILIN).
Generally, without contextual information, the
classifier is able to predict the meaning of a Chi-
nese two-character words with satisfactory accu-
61
Compound types Baseline Our algorithm
V-V 13.20% 46.20%
V-N 16.00% 42.00%
N-N 12.50% 76.50%
Table 3: Accuracy in the test set (level 2)
racy against the baseline. A further examina-
tion of the bad cases indicates that error can be
grouped into the following sources:
• Words with no semantic transparency:
Like “proper names”, these types have no se-
mantic transparency property, i.e., the word
meanings can not be derived from their mor-
phemic components. Loan words such as c144
ê (/sh¯af¯a/; “sofa”) are typical examples.
• Words with weak semantic transparency:
These can be further classified into four
types:
– Appositional compounds: words whose
two characters stand in a coordinate re-
lationship, e.g. ‰a (“east-west”, thing).
– Lexicalized idiomatic usage: For such
usage, each word is an indivisible con-
struct and each has its meaning which
can hardly be computed by adding up
the separate meaning of the compo-
nents of the word. The sources of these
idiomatic words might lie in the etymo-
logical past and are at best meaningless
to the modern native speaker. e.g, ‹®
(“salary-water”, salary).
– Metaphorical usage: the meaning of
such words are therefore different from
the literal meaning. Some testing data
is not semantically transparent due to
their metaphorical uses, For instance, ¸
I (Aj) is assigned to the ¸å (Bk).
• Derived words:
Such as ª  (enter). These could be filter out
using syntactical information.
• The quality and coverage of CILIN and char-
acter ontology:
Since our SC system’s test and training data
are gleaned from CILIN and the character
Compound types Our model Current best
model
V-V 42.00% 39.80% (Chen
2004)
N-N 72.50% 81.00% (Chen
and Chen 2000)
Table 4: Level-3 performance in the outside test:
a comparison
ontology, the quality and coverage play a
crucial role. For example, for the unknown
compound word ÂÂ (/s¯ao-s¯ao/; “be in tu-
mult”), there not even an example which
has Â as the first character or as the sec-
ond character. the same problem such as
falling short on coverage and data sparse-
ness goes to the character ontology, too. For
instance, there are some dissyllabic mor-
phemes which are not listed in ontology,
such as ˙˘ (/j`ıy´u/;“covet”).
4.7 Evaluation
So far as we know, no evaluation in the previous
works was done. This might be due to many rea-
sons: (1) the different scale of experiment (how
many words are in the test data?), (2) the selec-
tion of syntactic category (VV, VN or NN?) of
morphemic components, and (3) the number of
morphemic components involved (two or three-
character words?).. etc. Hence it is difficult to
compare our results to other models. Among the
current similar works, Table 4 shows that our sys-
tem outperforms Chen(2004) in VV compounds,
and approximates the Chen and Chen(2000) in
NN compounds.
5 Conclusion
In this paper, we propose a system that aims to
gain the possible semantic classes of unknown
words via similarity computation based on char-
acter ontology and CILIN thesaurus. In gen-
eral, we approach the task in a hybrid way that
combines the strengths of ontology-based and
example-based model to achieve at better result
for this task.
The scheme we use for automatic semantic
class prediction takes advantage of the presump-
tions that the conceptual information wired in
Chinese characters can help retrieve the near-
62
synonyms, and the near-synonyms constitute a
key indicator for the semantic class guess of un-
known words in question.
The results obtained show that, our SC pre-
diction algorithm can achieve fairly high level of
performance. While the work presented here in
still in progress, a first attempt to analyze a test
set of 800 examples has already shown a 43.60%
correctness for VV compounds, 41.00% for VN
compounds, and 74.50% for NN compounds at
the level-3 of CILIN. If shallow semantics is taken
into consideration, the results are even better.
Working in this framework, however, one point
as suggested by other similar approach is that,
human language processing is not limited to an
abstract ontology alone (Hong et al. 2004). In
practical applications, ontologies are seldom used
as the only knowledge resources. For those un-
known words with very weak semantic trans-
parency, it would be interesting to show that an
ontology-based system can be greatly boosted
when other information sources such as metaphor
and etymological information integrated. Fu-
ture work is aimed at improving this accuracy by
adding other linguistic knowledge sources and ex-
tending the technique to WSD (Word Sense Dis-
ambiguation).
Acknowledgements
I would like to thank Erhard Hinrichs and Lothar
Lemnitzer for their useful discussions. I also
thank the anonymous referees for constructive
comments. Thanks also go to the institute of lin-
guistics of Academia Sinica for their kindly data
support.
References
Baayen, Harald. (2001). Word frequency distributions.
Kluwer Academic Publishers.
Chen, Keh-Jiann and Chao-Jan Chen. (2000). Auto-
matic semantic classification for Chinese unknown
compound nouns. COLING 2000, Saarbr¨ucken,
Germany.
Chen, Chao-Ren. (2004). Character-Sense association
and compounding template similarity: Automatic
semantic classification of Chinese compounds. The
3rd SIGHAN Workshop.
Chu, Yan. (2004). Semantic word formation of Chinese
compound words. Peking University Press.
HanziNet Project: http://www.hanzinet.org.
Guarino, Nicola and Chris Welty. (2002). Evaluating
ontological decisions with OntoClean. In: Commu-
nications of the ACM. 45(2):61-65.
Hong, Li and Huang (2004). Ontology-based Predic-
tion of Compound Relations: A study based on
SUMO. PACLIC 18.
Hsieh, Shu-Kai. (2005). HanziNet: An enriched con-
ceptual network of Chinese characters. The 5rd
workshop on Chinese lexical semantics, China: Xi-
amen.
Lin, Dekang. (1998). A information-theoretic defini-
tion of similarity. In:Proceeding of 15th Interna-
tional Conference of Machine Learning..
Lua, K. T. (1993). A study of Chinese word semantics
and its prediction. Computer Processing of Chinese
and Oriental Languages, Vol 7. No 2.
Lua, K.T. (1997). Prediction of meaning of bisyl-
labic Chinese words using back propagation neural
network. In:Computer Processing of Oriental Lan-
guages. 11(2).
Lua, K. T. (2002). The Semantic Transformation of
Chinese Compound Words (Çx¨È¬˙íx<c158²).
The 3rd workshop on Chinese lexical semantics,
Taipei.
Packard, J. L. (2000). The morphology of Chinese.
Cambridge, UK: Cambridge University Press.
Mei et al (1998). °2ÈÈŠ. Dong-Hua Bookstore:
Taipei.
63
