Proceedings of the COLING/ACL 2006 Main Conference Poster Sessions, pages 385–390,
Sydney, July 2006. c©2006 Association for Computational Linguistics
When Conset meets Synset: A Preliminary Survey of an Ontological
Lexical Resource based on Chinese Characters
Shu-Kai Hsieh
Institute of Linguistics
Academia Sinica
Taipei, Taiwan
shukai@gate.sinica.edu.tw
Chu-Ren Huang
Institute of Linguistics
Academia Sinica
Taipei, Taiwan
churen@gate.sinica.edu.tw
Abstract
This paper describes an on-going project
concerning with an ontological lexical re-
source based on the abundant conceptual
information grounded on Chinese charac-
ters. The ultimate goal of this project is set
to construct a cognitively sound and com-
putationally effective character-grounded
machine-understandable resource.
Philosophically, Chinese ideogram has its
ontological status, but its applicability to
the NLP task has not been expressed ex-
plicitly in terms of language resource. We
thusproposethefirstattempttolocateChi-
nese characters within the context of on-
tology. Having the primary success in ap-
plying it to some NLP tasks, we believe
that the construction of this knowledge re-
source will shed new light on theoretical
setting as well as the construction of Chi-
nese lexical semantic resources.
1 Introduction
In the history of western linguistics, writing has
long been viewed as a surrogate or substitute for
speech, the latter being the primary vehicle for hu-
man communication. Such “surrogational model”
which neglects the systematicity of writing in
its own right has also occupied the predominant
views in current computational linguistic studies.
This paper is set to provide a quite different per-
spective along with the Eastern philological tra-
dition of the study of scripts, especially the ideo-
graphic one i.e., Chinese characters (Hanzi). We
believe that the conceptual knowledge information
which has been grounded on Chinese characters
can be used as a cognitively sound and compu-
tationally effective ontological lexical resource in
performing some NLP tasks, and it will have con-
tribution to the development of Semantic Web as
well.
2 Background Issues of Chinese
Ideographic Writing
2.1 Ideographic Script and Conceptual
Knowledge
From the view of writing system and cognition,
human conceptual information has been regarded
as being wired in ideographic scripts. However, in
reviewing the contemporary linguistic literatures
concerning with the discussions of the essence of
Chinese writing system, we found that the main
theoretical dispute lies in the fact that, both struc-
tural descriptions and psycholinguistic modeling
seem to presume that the notions of ideography
and phonography are mutually exclusive.
To break the theoretical impass´e, we take a
pragmatic position in claiming the tripartite prop-
erties of Chinese characters: They are logographic
(morpho-syllabic) in essence, function phonologi-
cally at the same time, and can be interpreted ideo-
graphically and implemented as concept instances
by computers.
2.2 Chinese Wordhood
Roughly put, a Chinese character is regarded as
an ideographic symbol representing syllable and
meaning of a “morpheme” in spoken Chinese.
But unlike most affixing languages, Chinese has
alargeclassofmorphemes-whichPackard(2000)
calls “bound roots” - that possess certain affixal
properties (namely, they are bound and productive
in forming words), but encode lexical rather than
385
grammatical information. These may occur as ei-
ther the left- or right-hand component of a word.
For example, the morpheme uni8F38 (/shu/; “transport”)
can be used as either the first morpheme (e.g., uni8F38uni5165
(/y`un-r`u/; transport-into “import”), or the second
morpheme (e.g., uni904Buni8F38 /y`un-shu/; transit-transport
“conveyance”) of a dissyllabic word, but cannot
occur in isolation.
The fuzzy boundary between free and bound
morphemes is directly related to the notori-
ous controversial notion of Chinese Wordhood.
There are multiple studies showing that to a
large extent, (trained or untrained) native speak-
ers of Chinese disagree on what a (free) mor-
pheme/word/compound is.
Suchdifficultycouldbetracedbacktoitshistor-
ical facts. In modern Mandarin Chinese, there is a
strong tendency toward dissyllabic words, while
the predominant monosyllabic words in ancient
Chinese remain more or less a closed set. But
the conceptual knowledge encoded in monosyl-
labic morphemes still have their influence even on
contemporary texts, and thus resulting the difficul-
ties of word-marking decision.
3 Theoretical Setting
Yu et al (1999) reported that a Morpheme Knowl-
edgeBaseofModernChineseaccordingtoallChi-
nese characters in GB2312-80 code has been con-
structed by the institute of Computational Linguis-
tics of Peking University. This Morpheme Knowl-
edge Base has been later integrated into the project
called “Grammatical Knowledge Base of Contem-
porary Chinese”.
It is noted that the “morphemes” adopted in this
database are monosyllabic “bound morphemes”.
As for “free morphemes”, that is, characters which
can be independently used as words, are not in-
cluded in the Knowledge Base. For example,
the monosyllabic character uni68B3 (/shu/,“comb”) has
(at least) two senses. For the verbal sense (“to
comb”), it can be used as a word; for the nomi-
nal sense (“a comb”), it can only be used in com-
bining with other morphemes. Therefore, only the
nominal sense of uni68B3 is included in the Knowledge
Base. However, such morpheme-based approach
can hardly escape from facing with the difficult
decisionoffree/bounddistinctionincontemporary
Chinese.
3.1 Hanzi/Word Space Model
Based on the consideration mentioned above, in
this paper, we will propose a historical, conven-
tionalized, pre-theoretical perspective in viewing
the lexical and knowledge information within Chi-
nese characters. In Figure 1, (a) illustrates a naive
Hanzi space, while (d) shows a linguistic theory-
laden result of Hanzi/Word space, where green ar-
eas denote to words, consisting of 1 to 4 char-
acters. The decision of words (green) and non-
words (white) in the space is based on certain per-
spectives (be it psycholinguistic or computational
linguistic). Instead, we take the traditional philo-
logical construct of Hanzi into consideration. By
analyzing the conceptual relations between char-
acters (b) which scatter among diverse lexical re-
sources, we construct an top-level ontology with
Hanzi as its instances (c). Rather than (a) → (d),
which is a predominant approach in contempo-
rary linguistic theoretical construction of Chinese
Wordhood, we believe that the proposed approach
(a) → (b) → (c) → (d) could not only enclose
the implicit conceptual information evolutionarily
encoded in Chinese characters, but also provide a
more clear knowledge scenario for the interaction
of characters/words in modern linguistic theoreti-
cal setting.
3.2 Conset and Character Ontology
The new model that we propose here is called
HanziNet. It relies on a novel notion called con-
set and a coarsely grained upper-level ontology
of characters.
In comparison with synset, which has become
a core notion in the construction of Wordnet-like
lexicalsemanticresources, wewillarguethatthere
is a crucial difference between Word-based lexi-
cal resource and character-based lexical resource,
in that they rest with finely-differentiated informa-
tion contents represented by the nodes of network.
A synset, or synonym set in WordNet contains a
group of words,1 and each of which is synony-
mous with the other words in the same synset.
In WordNet’s design, each synset can be viewed
as a concept in a taxonomy, While in HanziNet,
we are seeking to align Hanzi which share a given
putatively primitive meaning extracted from tradi-
tional philological resources, so a new term con-
set (concept set) is proposed. A conset contains
1To put it exactly, it contains a group of lexical units,
which can be words or collocations.
386
(a) (b) (c) (d)
Figure 1: Illustrations of Hanzi/Word Spaces
a group of Chinese characters similar in concept,
and each of which shares with similar conceptual
information with the other characters in the same
conset.2
The relations between consets constitute a char-
acter ontology. Formally, it is a tree-structured
conceptual taxonomy in terms of which only two
kinds of relations are allowed: the INSTANCE-OF
(i.e., characters are instances of consets) and IS-
A relations (i.e., consets are hypernyms/hyponyms
to other consets).
Currently, frequently used monosyllabic char-
acters are assigned to at least one of 309 consets.
Following are some examples:
conset 126 (SUBJECTIVE → EXCITABILITY → ABILITY → ORGANIC
FUNCTION)
uni5438uni3001uni54C1uni3001uni5690uni3001uni56BCuni3001uni56A5uni3001uni541Euni3001uni994Cuni3001uni8339uni3001uni98F2,
conset 130 (SUBJECTIVE→EXCITABILITY→ABILITY→SKILLS)
uni6458uni3001uni69A8uni3001uni62FEuni3001uni62D4uni3001uni63D0uni3001uni651Duni3001uni9078,
conset 133 (SUBJECTIVE→EXCITABILITY→ABILITY→INTELLECT)
uni725Funi3001uni8B00uni3001uni8003uni3001uni9078uni3001uni9304uni3001uni8A18uni3001uni807D,
In fact, the core assumption behind the
synset/conset distinction is non-trivial. In this
project, we assume a hypothesis of the locality
of Concept Gestalt and the context-sensibility of
Word Sense concerning with Chinese characters.
That is, characters carry two meaning dimensions:
on the one hand, they are lexicalized concepts;
2At the time of writing, about 3,600 characters have been
finished in their information construction.
on the other hands, they can be observed lin-
guistically as bound root morphemes and mono-
morphemic words according to their independent
usage in modern Chinese texts.
Figure 2 shows a schematic diagram of our pro-
posed model. In Aitchison’s (2003) terms, for the
character level, we take an “atomic globule” net-
work viewpoint, where the characters - realized as
instances of core concept Gestalt - which share
similar conceptual information, cluster together.
The relationships between these concept Gestalt
form a rooted tree structure. Characters are thus
assigned to the leaves of the tree in terms of an
assemblage of bits. For the word level, we take
the “cobweb” viewpoint, as words -built up from
a pool of characters- are connected to each other
through lexical semantic relations. In such case,
the network does not form a tree structure but a
more complex, long-range highly-correlated ran-
dom acyclic graphic structure.
4 Hanzi-grounded Ontological
CharacterNet
In light of the previous consideration, this sec-
tion attempts to further clarify the building blocks
of the HanziNet system, – a Hanzi-grounded on-
tological Character Net – with the goal to ar-
rive at a working model which will serve as a
framework for ontological knowledge processing.
Briefly, HanziNet is consisted of two main parts:
387
Figure 2: The Schematic Representation of
character-triggered tree-like conceptual hierarchy
and word-based semantic network
a character-stored machine-readable lexicon and a
top-level character ontology.
4.1 Hanzi-grounded Lexicon and Ontology
The current lexicon contains over 5000 characters,
and 30,000 derived words in total.3
The building of the lexical specification of the
entries in HanziNet includes various aspects of
Hanzi:
1. Conset(s): The conceptual code is the core
part of the MRD lexicon in HanziNet. Con-
cepts in HanziNet are indicated by means
of a label (conset name) with a code form.
In order to increase the efficiency, an ideal
strategyistoadopttheHuffmann-coding-like
method, by encoding the conceptual structure
of Hanzi as a pattern of bits set within a bit
string.4 The coding thus refers to the assign-
ment of code sequences to an character. The
sequence of edges from the root to any char-
acter yields the code for that character, and
the number of bits varies from one character
to another. Currently, for each conset (309 in
total) there are 12 characters assigned on the
average; for each character, it is assigned to
3Since this lexicon aims at establishing an knowl-
edge resource for modern Chinese NLP, characters
and words are mostly extracted from the Academia
Sinica Balanced Corpus of Modern Chinese
(http://www.sinica.edu.tw/SinicaCorpus/), those charac-
ters and words which have probably only appeared in
classical literary works, (considered ghost words in the
lexicography), will be discarded.
4This is inspired by Chu (1999)’s works.
2-3 consets on the average.5
2. Character Semantic Head (CSH) and Char-
acter Semantic Modifier (CSM) division.6
3. Shallow parts of speech (mainly Nominal(N)
and Verbal(V) tags)
4. Gloss of prototypical meaning
5. List of combined words with statistics calcu-
lated from corpus, and
6. Further aspects such as character types and
cognates: According to ancient study, char-
acters can be compartmentalized into six
groups based on the six classical principles of
character construction. Character type here
means which group the character belongs to.
And the term cognate here is defined as char-
acters that share the same CSH or CSM. Fig-
ure 3 shows a snapshot of this lexicon.
Figure 3: The character-stored lexicon: a snapshot
The second core component of the proposed re-
source is a set of hierarchically related Top Con-
cepts called Top-level Ontology (or Upper ontol-
ogy). This is similar to EuroWordnet 1.2, which is
5The disputing point here is that, if some of the mono-
syllabic morphemes are taken as words, they should be very
ambiguous in the daily linguistic context, at least more am-
biguous than the dissyllabic words. However, as we argued
previously, HanziNet takes a different perspective in locating
theoretical roles of Hanzi.
6This distinction is made based on the glyphographical
consideration, which has been a crucial topic in the studies of
traditional Chinese scriptology. Due to the limited space, this
will not be discussed here.
388
also enriched with the Top Ontology and the set of
Base Concepts (Vossen 1998).
As mentioned, a tentative set of 309 conset,
a kind of ontological categories in contrast with
synset has been proposed 7, and over 5000 charac-
ters have been used as instances in populating the
character ontology.
Methodologically, following the basic line of
OntoClear approach (Guarino and Welty (2002)),
we use simple monotonic inheritance in our ontol-
ogy design, which means that each node inherits
properties only from a single ancestor, and the in-
herited value cannot be overwritten at any point of
the ontology. The decision to keep the relations
to one single parent was made in order to guaran-
tee that the structure would be able to grow indef-
initely and still be manageable, i.e. that the tran-
sitive quality of the relations between the nodes
would not degenerate with size. Figure 4 shows a
snapshot of the character ontology.
uni0052uni004Funi004Funi0054
uni004Funi0042uni004A
uni0053uni0055uni0042uni004A
uni0043uni004Funi004Euni0043uni0052uni0045uni0054uni0045
uni0041uni0042uni0053uni0054uni0052uni0041uni0043uni0054
uni0045uni0058uni0049uni0053uni0054uni0045uni004Euni0043uni0045
uni0041uni0052uni0054uni0049uni0046uni0041uni0043uni0054
uni0045uni0058uni0043uni0049uni0054uni0041uni0042uni004Cuni0045
uni0043uni004Funi0047uni004Euni0049uni0054uni0049uni0056uni0045
uni0053uni0045uni004Duni0049uni004Funi0054uni0049uni0043
uni0052uni0045uni004Cuni0041uni0054uni0049uni004Funi004Euni0041uni004C
uni0053uni0045uni004Euni0053uni0041uni0054uni0049uni004Funi004E
uni0053uni0054uni0041uni0054uni0045
uni0049uni004Euni004Euni0041uni0054uni0045
uni0053uni004Funi0043uni0049uni0041uni004C
uni0063uni006Funi006Euni0073uni0065uni0074uni0020uni0031
uni0063uni006Funi006Euni0073uni0065uni0074uni0020uni0033uni0030uni0039
uni0063uni006Funi006Euni0073uni0065uni0074uni0020uni0032uni0063uni006Funi006Euni0073uni0065uni0074uni0020uni0033uni002Duni002Duni002Duni002Duni002Duni002D
uni002Duni002Duni002Duni002Duni002Duni002Duni002Duni002Duni002Duni002Duni002Duni002Duni002Duni002Duni002Duni002Duni002Duni002D
uni002Duni002Duni002Duni002Duni002Duni002Duni002Duni002Duni002Duni002Duni002Duni002Duni002Duni002Duni002Duni002Duni002Duni002D
uni002Duni002Duni002Duni002Duni002Duni002Duni002Duni002Duni002Duni002Duni002Duni002Duni002Duni002Duni002Duni002Duni002Duni002D
uni002Duni002Duni002Duni002Duni002Duni002Duni002Duni002Duni002Duni002Duni002Duni002Duni002Duni002Duni002Duni002Duni002Duni002D
uni002Duni002Duni002Duni002Duni002Duni002Duni002Duni002Duni002Duni002Duni002Duni002D
uni0063uni006Funi006Euni0073uni0065uni0074uni0020uni0033uni0030uni0038uni0063uni006Funi006Euni0073uni0065uni0074uni0020uni0033uni0030uni0037
uni007Buni91D1uni9280uni9285uni9435uni932Buni6C27uni78B3uni92C1uni78F7uni7812uni6C2Buni6C2Euni007D
uni007Buni6D77uni6D0Buni6CBCuni6FA4uni6E56uni6CCAuni6C60uni5858uni6F6Duni6EAAuni6CB3uni85EAuni007D
uni007Buni990Auni4FDDuni8892uni8B77uni5E87uni4F51uni8F14uni885Buni9867uni5ED5uni620Cuni007D
uni002Duni002Duni002Duni002Duni002Duni002Duni002Duni002Duni002Duni002Duni002Duni002Duni002Duni002Duni002Duni002Duni002Duni002D
uni002Duni002Duni002Duni002Duni002Duni002Duni002Duni002Duni002Duni002Duni002Duni002Duni002Duni002Duni002Duni002Duni002Duni002D
uni002Duni002Duni002Duni002Duni002Duni002Duni002Duni002Duni002Duni002Duni002Duni002Duni002Duni002Duni002Duni002Duni002Duni002D
uni002Duni002Duni002Duni002Duni002Duni002Duni002Duni002Duni002Duni002Duni002Duni002Duni002Duni002Duni002Duni002Duni002Duni002D
uni002Duni002Duni002Duni002Duni002Duni002Duni002Duni002Duni002Duni002Duni002Duni002Duni002Duni002Duni002Duni002Duni002Duni002D
uni002Duni002Duni002Duni002Duni002Duni002D
uni007Buni96FBuni6CE2uni78C1uni007D
uni007Buni8B1Duni8FADuni62D2uni99C1uni63A8uni7D55uni916Cuni4EA4uni7DE0uni7D50uni8655uni9080uni7D04uni966Auni4F34uni8FCEuni007Duni007Buni5C60uni6C7Auni622Euni52E6uni527Funi8A85uni6BB2uni5937uni6BC0uni6EC5uni6CEFuni5BB0uni6BBAuni6B8Auni007D
Figure 4: The character ontology: a snapshot
4.2 Characters in a Small World
In addition, an experiment concerning the char-
acter network that was based on the meaning as-
pects of characters, was performed from a statisti-
cal point of view. It was found that this character
network, like many other linguistic semantic net-
works (such as WordNet), exhibits a small-world
property (Watt 1998), characterized by sparse con-
nectivity, small average shortest paths between
characters, and strong local clustering. Moreover,
due to its dynamic property, it appears to exhibit
an asymptotic scale-free (Barabasi 1999) feature
7It would be interesting to compare consets with the basic
400 nodes in the upper region proposed by Hovy(2005).
Table 1: Statistical characteristics of the char-
acter network: N is the total number of
nodes(characters),k istheaveragenumberoflinks
per node, C is the clustering coefficient, and L is
the average shortest-path length, and Lmax is the
maximum length of the shortest path between a
pair of characters in the network.
N k C L
Actual configuration 6493 350 0.64 2.0
Random configuration 6493 350 0.06 1.5
with the connectivity of power laws distribution,
which is found in many other network systems as
well.
Our first result is that our proposed conceptual
network is highly clustered and at the same time
and has a very small length, i.e., it is a small
world model in the static aspect. Specifically,
L greaterorsimilar Lrandom but C greatermuch Crandom. Results for the
network of characters, and a comparison with a
corresponding random network with the same pa-
rameters are shown in Table 1. N is the total num-
ber of nodes (characters), k is the average number
of links per node, C is the clustering coefficient,
andLis the average shortest path.
4.3 HanziNet in the Global Wordnet Grid
In order to promote a semantic and ontological
interoperability, we have aligned conset with the
164 Base Concepts, a shared set of concepts from
EWN in terms of Wordnet synsets and SUMO
definitions, which has been currently proposed in
the international collaborative platform of Global
Wordnet Grid.
5 Applications and Future Development
5.1 Sense Prediction and Disambiguation
Based on the initial version of the proposed re-
sources, Hsieh (2005b) has proposed a semantic
class prediction model which aims to gain the pos-
sible semantic classes of unknown two-characters
words. The results obtained shows that, with this
knowledge resource, the system can achieve fairly
high level of performance. Meaning relevant NLP
TaskssuchasWordSenseDisambiguationarealso
in preparation.
389
5.2 Interfacing Hantology, HanziNet and
Chinese Wordnet
Interfacing ontologies and lexical resources has
been a research topic in the coming age of se-
mantic web. In the case of Chinese, three existing
lexicalresources(uni610Funi7B26Radicals::Hantology(Chou
and Huang (2005))- uni5B57 Characters::HanziNet -
uni8A5E Words::Chinese Wordnet) constitutes an inte-
grated 3-level knowledge scenario which would
provide important insights into the problems of
understanding the complexities and its interaction
with Chinese natural language.
6 Conclusion
In conclusion, the goal of this research is set
to survey the unique characteristics of Chinese
Ideographs.
Though it has been well understood and agreed
upon in cognitive linguistics that concepts can be
represented in many ways, using various construc-
tionsatdifferentsyntacticallevels, conceptualrep-
resentation at the script level has been unfortu-
nately both undervalued and under-represented in
computational linguistics. Therefore, the Hanzi-
driven conceptual approach in this thesis might re-
quire that we consider the Chinese writing system
from a perspective that is not normally found in
canonical treatments of writing systems in con-
temporary linguistics.
Against the deep-seated tradition in contempo-
rary Chinese linguistics, which views the use of
Chinese characters in scientific theories as a mani-
festation of mathematical immaturity and interpre-
tational subjectivity, we propose the first lexical
knowledge resource based on Chinese characters
in the field of linguistic as well as in the NLP.
It is noted that HanziNet, as a general knowl-
edge resource, should not claim to be a sufficient
knowledge resource in and of itself, but instead
seek to provide a groundwork for the incremen-
tal integration of other knowledge resources for
language processing tasks. In order to augment
HanziNet, additional information will needed to
be incorporated and mapped into HanziNet. This
leads us to several avenues of future research.
Acknowledgements
The authors would like to thank the anonymous
referees for constructive comments. Thanks also
go to the institute of linguistics of Academia
Sinica for their kindly data support.
References
Aitchison, Jean. 2003. Words in the mind: an introduc-
tion to the mental lexicon. Blackwell publishing.
Barabasi, Albert-Laszlo and Reka Albert. 1999. Emer-
gence of scaling in random networks. Science,
286:509-512.
Chou, Ya-Min and Chu-Ren Huang. 2005. Hantology:
An ontology based on conventionalized conceptual-
ization. OntoLex Workshop, Korea.
Chu, Bong-Foo. 1999-. http://www.cbflabs.com
Guarino, Nicola and Chris Welty. 2002. Evaluating on-
tological decisions with OntoClean. In: Communi-
cations of the ACM. 45(2):61-65
Hovy, E.H. 2005. Methodologies for the Reliable Con-
struction of Ontological Knowledge. In : F. Dau,
M.-L. Mugnier, and G. Stumme (eds), Conceptual
Structures: Common Semantics for Sharing Knowl-
edge. Proceedings of the 13th Annual International
Conference on Conceptual Structures (ICCS 2005).
Kassel, Germany.
Hsieh, Shu-Kai. 2005(a). HanziNet: An enriched
conceptual network of Chinese characters. The 5rd
workshop on Chinese lexical semantics, China: Xi-
amen.
Hsieh, Shu-Kai. 2005(b). Word Meaning Inducing via
Character Ontology. IJINLP, SIGHAN Workshop,
Jijeu Island, South Korea.
Packard, J. L. 2000. The morphology of Chinese. Cam-
bridge, UK: Cambridge University Press.
Steyvers, M. and Tenenbaum, J.B. 2002 The Large-
Scale Structure of Semantic Networks: Statistical
Analyses and a Model of Semantic Growth. Cog-
nitive Science.
Watts, D. J. and Strogatz, S. H. 1998. Collective dy-
namics of ‘small-world’ networks. Nature 393:440-
42.
Yu, Shiwen, Zhu Xuefeng and Li Feng. 1999. The de-
velopment and application of modern Chinese mor-
pheme knowledge base.[in Chinese]. In: uni4E16uni754Cuni6F22uni8A9Euni6559
uni5B78, No.2. pp38-45.
390
