File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/p06-2050_metho.xml
Size: 13,262 bytes
Last Modified: 2025-10-06 14:10:30
<?xml version="1.0" standalone="yes"?> <Paper uid="P06-2050"> <Title>When Conset meets Synset: A Preliminary Survey of an Ontological Lexical Resource based on Chinese Characters</Title> <Section position="5" start_page="385" end_page="386" type="metho"> <SectionTitle> 3 Theoretical Setting </SectionTitle> <Paragraph position="0"> Yu et al (1999) reported that a Morpheme KnowledgeBaseofModernChineseaccordingtoallChi- null nese characters in GB2312-80 code has been constructed by the institute of Computational Linguistics of Peking University. This Morpheme Knowledge Base has been later integrated into the project called &quot;Grammatical Knowledge Base of Contemporary Chinese&quot;.</Paragraph> <Paragraph position="1"> It is noted that the &quot;morphemes&quot; adopted in this database are monosyllabic &quot;bound morphemes&quot;.</Paragraph> <Paragraph position="2"> As for &quot;free morphemes&quot;, that is, characters which can be independently used as words, are not included in the Knowledge Base. For example, the monosyllabic character uni68B3 (/shu/,&quot;comb&quot;) has (at least) two senses. For the verbal sense (&quot;to comb&quot;), it can be used as a word; for the nominal sense (&quot;a comb&quot;), it can only be used in combining with other morphemes. Therefore, only the nominal sense of uni68B3 is included in the Knowledge Base. However, such morpheme-based approach can hardly escape from facing with the difficult decisionoffree/bounddistinctionincontemporary Chinese.</Paragraph> <Section position="1" start_page="385" end_page="385" type="sub_section"> <SectionTitle> 3.1 Hanzi/Word Space Model </SectionTitle> <Paragraph position="0"> Based on the consideration mentioned above, in this paper, we will propose a historical, conventionalized, pre-theoretical perspective in viewing the lexical and knowledge information within Chinese characters. In Figure 1, (a) illustrates a naive Hanzi space, while (d) shows a linguistic theoryladen result of Hanzi/Word space, where green areas denote to words, consisting of 1 to 4 characters. The decision of words (green) and non-words (white) in the space is based on certain perspectives (be it psycholinguistic or computational linguistic). Instead, we take the traditional philological construct of Hanzi into consideration. By analyzing the conceptual relations between characters (b) which scatter among diverse lexical resources, we construct an top-level ontology with Hanzi as its instances (c). Rather than (a) - (d), which is a predominant approach in contemporary linguistic theoretical construction of Chinese Wordhood, we believe that the proposed approach (a) - (b) - (c) - (d) could not only enclose the implicit conceptual information evolutionarily encoded in Chinese characters, but also provide a more clear knowledge scenario for the interaction of characters/words in modern linguistic theoretical setting.</Paragraph> </Section> <Section position="2" start_page="385" end_page="386" type="sub_section"> <SectionTitle> 3.2 Conset and Character Ontology </SectionTitle> <Paragraph position="0"> The new model that we propose here is called HanziNet. It relies on a novel notion called conset and a coarsely grained upper-level ontology of characters.</Paragraph> <Paragraph position="1"> In comparison with synset, which has become a core notion in the construction of Wordnet-like lexicalsemanticresources, wewillarguethatthere is a crucial difference between Word-based lexical resource and character-based lexical resource, in that they rest with finely-differentiated information contents represented by the nodes of network. A synset, or synonym set in WordNet contains a group of words,1 and each of which is synonymous with the other words in the same synset.</Paragraph> <Paragraph position="2"> In WordNet's design, each synset can be viewed as a concept in a taxonomy, While in HanziNet, we are seeking to align Hanzi which share a given putatively primitive meaning extracted from traditional philological resources, so a new term conset (concept set) is proposed. A conset contains a group of Chinese characters similar in concept, and each of which shares with similar conceptual information with the other characters in the same conset.2 The relations between consets constitute a character ontology. Formally, it is a tree-structured conceptual taxonomy in terms of which only two kinds of relations are allowed: the INSTANCE-OF (i.e., characters are instances of consets) and IS-A relations (i.e., consets are hypernyms/hyponyms to other consets).</Paragraph> <Paragraph position="3"> Currently, frequently used monosyllabic characters are assigned to at least one of 309 consets. Following are some examples: In fact, the core assumption behind the synset/conset distinction is non-trivial. In this project, we assume a hypothesis of the locality of Concept Gestalt and the context-sensibility of Word Sense concerning with Chinese characters.</Paragraph> <Paragraph position="4"> That is, characters carry two meaning dimensions: on the one hand, they are lexicalized concepts; 2At the time of writing, about 3,600 characters have been finished in their information construction.</Paragraph> <Paragraph position="5"> on the other hands, they can be observed linguistically as bound root morphemes and monomorphemic words according to their independent usage in modern Chinese texts.</Paragraph> <Paragraph position="6"> Figure 2 shows a schematic diagram of our proposed model. In Aitchison's (2003) terms, for the character level, we take an &quot;atomic globule&quot; network viewpoint, where the characters - realized as instances of core concept Gestalt - which share similar conceptual information, cluster together. The relationships between these concept Gestalt form a rooted tree structure. Characters are thus assigned to the leaves of the tree in terms of an assemblage of bits. For the word level, we take the &quot;cobweb&quot; viewpoint, as words -built up from a pool of characters- are connected to each other through lexical semantic relations. In such case, the network does not form a tree structure but a more complex, long-range highly-correlated random acyclic graphic structure.</Paragraph> </Section> </Section> <Section position="6" start_page="386" end_page="388" type="metho"> <SectionTitle> 4 Hanzi-grounded Ontological CharacterNet </SectionTitle> <Paragraph position="0"> In light of the previous consideration, this section attempts to further clarify the building blocks of the HanziNet system, - a Hanzi-grounded ontological Character Net - with the goal to arrive at a working model which will serve as a framework for ontological knowledge processing.</Paragraph> <Paragraph position="1"> Briefly, HanziNet is consisted of two main parts:</Paragraph> <Section position="1" start_page="387" end_page="388" type="sub_section"> <SectionTitle> 4.1 Hanzi-grounded Lexicon and Ontology </SectionTitle> <Paragraph position="0"> The current lexicon contains over 5000 characters, and 30,000 derived words in total.3 The building of the lexical specification of the entries in HanziNet includes various aspects of Hanzi: 1. Conset(s): The conceptual code is the core part of the MRD lexicon in HanziNet. Concepts in HanziNet are indicated by means of a label (conset name) with a code form. In order to increase the efficiency, an ideal strategyistoadopttheHuffmann-coding-like method, by encoding the conceptual structure of Hanzi as a pattern of bits set within a bit string.4 The coding thus refers to the assignment of code sequences to an character. The sequence of edges from the root to any character yields the code for that character, and the number of bits varies from one character to another. Currently, for each conset (309 in total) there are 12 characters assigned on the average; for each character, it is assigned to 3Since this lexicon aims at establishing an knowledge resource for modern Chinese NLP, characters and words are mostly extracted from the Academia Sinica Balanced Corpus of Modern Chinese (http://www.sinica.edu.tw/SinicaCorpus/), those characters and words which have probably only appeared in classical literary works, (considered ghost words in the lexicography), will be discarded.</Paragraph> <Paragraph position="1"> 2-3 consets on the average.5 2. Character Semantic Head (CSH) and Character Semantic Modifier (CSM) division.6 3. Shallow parts of speech (mainly Nominal(N) and Verbal(V) tags) 4. Gloss of prototypical meaning 5. List of combined words with statistics calculated from corpus, and 6. Further aspects such as character types and cognates: According to ancient study, characters can be compartmentalized into six groups based on the six classical principles of character construction. Character type here means which group the character belongs to.</Paragraph> <Paragraph position="2"> And the term cognate here is defined as characters that share the same CSH or CSM. Figure 3 shows a snapshot of this lexicon.</Paragraph> <Paragraph position="3"> The second core component of the proposed resource is a set of hierarchically related Top Concepts called Top-level Ontology (or Upper ontology). This is similar to EuroWordnet 1.2, which is 5The disputing point here is that, if some of the mono-syllabic morphemes are taken as words, they should be very ambiguous in the daily linguistic context, at least more ambiguous than the dissyllabic words. However, as we argued previously, HanziNet takes a different perspective in locating theoretical roles of Hanzi.</Paragraph> <Paragraph position="4"> 6This distinction is made based on the glyphographical consideration, which has been a crucial topic in the studies of traditional Chinese scriptology. Due to the limited space, this will not be discussed here.</Paragraph> <Paragraph position="5"> also enriched with the Top Ontology and the set of Base Concepts (Vossen 1998).</Paragraph> <Paragraph position="6"> As mentioned, a tentative set of 309 conset, a kind of ontological categories in contrast with synset has been proposed 7, and over 5000 characters have been used as instances in populating the character ontology.</Paragraph> <Paragraph position="7"> Methodologically, following the basic line of OntoClear approach (Guarino and Welty (2002)), we use simple monotonic inheritance in our ontology design, which means that each node inherits properties only from a single ancestor, and the inherited value cannot be overwritten at any point of the ontology. The decision to keep the relations to one single parent was made in order to guarantee that the structure would be able to grow indefinitely and still be manageable, i.e. that the transitive quality of the relations between the nodes would not degenerate with size. Figure 4 shows a snapshot of the character ontology.</Paragraph> </Section> <Section position="2" start_page="388" end_page="388" type="sub_section"> <SectionTitle> 4.2 Characters in a Small World </SectionTitle> <Paragraph position="0"> In addition, an experiment concerning the character network that was based on the meaning aspects of characters, was performed from a statistical point of view. It was found that this character network, like many other linguistic semantic networks (such as WordNet), exhibits a small-world property (Watt 1998), characterized by sparse connectivity, small average shortest paths between characters, and strong local clustering. Moreover, due to its dynamic property, it appears to exhibit an asymptotic scale-free (Barabasi 1999) feature acter network: N is the total number of nodes(characters),k istheaveragenumberoflinks per node, C is the clustering coefficient, and L is the average shortest-path length, and Lmax is the maximum length of the shortest path between a pair of characters in the network.</Paragraph> <Paragraph position="1"> N k C L Actual configuration 6493 350 0.64 2.0 Random configuration 6493 350 0.06 1.5 with the connectivity of power laws distribution, which is found in many other network systems as well.</Paragraph> <Paragraph position="2"> Our first result is that our proposed conceptual network is highly clustered and at the same time and has a very small length, i.e., it is a small world model in the static aspect. Specifically, L greaterorsimilar Lrandom but C greatermuch Crandom. Results for the network of characters, and a comparison with a corresponding random network with the same parameters are shown in Table 1. N is the total number of nodes (characters), k is the average number of links per node, C is the clustering coefficient, andLis the average shortest path.</Paragraph> </Section> <Section position="3" start_page="388" end_page="388" type="sub_section"> <SectionTitle> 4.3 HanziNet in the Global Wordnet Grid </SectionTitle> <Paragraph position="0"> In order to promote a semantic and ontological interoperability, we have aligned conset with the 164 Base Concepts, a shared set of concepts from EWN in terms of Wordnet synsets and SUMO definitions, which has been currently proposed in the international collaborative platform of Global Wordnet Grid.</Paragraph> </Section> </Section> class="xml-element"></Paper>