File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/00/w00-1318_intro.xml

Size: 10,829 bytes

Last Modified: 2025-10-06 14:01:03

<?xml version="1.0" standalone="yes"?>
<Paper uid="W00-1318">
  <Title>Automatic WordNet mapping using word sense disambiguation*</Title>
  <Section position="2" start_page="0" end_page="144" type="intro">
    <SectionTitle>
1 Introduction
</SectionTitle>
    <Paragraph position="0"> There is no doubt on the increasing importance of using wide coverage ontologies for NLP tasks especially for information retrieval and cross-language information retrieval. While these ontologies exist in English, there are very few available wide range ontologies for other languages. Manual construction of the ontology by experts is the most reliable technique but is costly and highly time-consuming. This is the reason for many researchers having focused on massive acquisition of lexical knowledge and semantic information from pre-existing lexical resources as automatically as possible.</Paragraph>
    <Paragraph position="1"> This paper presents a novel approach for automatic WordNet mapping using word sense disambiguafion. The method has been applied to link Korean words from a bilingual dictionary to English WordNet synsets.</Paragraph>
    <Paragraph position="2"> To clarify the description, an example is given. To link the first sense of Korean word &amp;quot;gwan-mog&amp;quot; to WordNet synset, we employ a bilingual Korean-English dictionary. The first sense of 'gwan-mog' has 'bush' as a translation in English and 'bush' has five synsets in WordNet. Therefore the first sense of 'gwan-mog&amp;quot; has five candidate synsets.</Paragraph>
    <Paragraph position="3"> Somehow we decide a synset {shrub, bush} among five candidate synsets and link the sense of 'gwan-mog' to this synset.</Paragraph>
    <Paragraph position="4"> As seen from this example, when we link the senses of Korean words to WordNet synsets, there are semantic ambiguities. To remove the ambiguities we develop new word sense disambiguation heuristics and automatic mapping method to construct Korean WordNet based on the existing English WordNet.</Paragraph>
    <Paragraph position="5"> This paper is organized as follows. In section 2, we describe multiple heuristics for word sense disambiguation for sense linking. In section 3, we explain the method of combination for these heuristics. Section 4 presents some experiment results, and section 5 will discuss some related researches. Finally we draw some conclusions and future researches in section 6. The automatic mapping-based Korean WordNet plays a natural Korean-English bilingual thesaurus, so it will be directly applied to Korean-English cross-lingual information retrieval as well as Korean monolingual information retrieval.</Paragraph>
    <Paragraph position="6"> 2 Multiple heuristics for word sense disambiguation As the mapping method described in this paper has been developed for combining multiple individual solutions, each single heuristic must be seen as a container for some part of the linguistic knowledge needed to disarnbiguate the * This research was supported by KOSEF special purpose basic research (1997.9- 2000.8 #970-1020-301-3) Corresponding author  ambiguous WordNet synsets. Therefore, not a single heuristic is suitable to all Korean words collected from a bilingual dictionary. We will describe each individual WSD (word sense disambiguation) heuristic for Korean word mapping into corresponding English senses.</Paragraph>
    <Paragraph position="7">  word kw has m translations in English and n WordNet synsets as candidate senses. Each heuristic is applied to the candidate senses (ws,, .... ws) and provides scores for them.</Paragraph>
    <Section position="1" start_page="142" end_page="142" type="sub_section">
      <SectionTitle>
2.1 Heuristic 1: Maximum Similarity
</SectionTitle>
      <Paragraph position="0"> This heuristic comes from our previous Korean WSD research (Lee and Lee, 2000) and assumes that all the translations in English for the same Korean word sense are semantically similar. So this heuristic provides the maximum score to the sense that is most similar to the senses of the other translations. This heuristic is applied when the number of translations for the same Korean word sense is more than 1. The following formula explains the idea.</Paragraph>
      <Paragraph position="1"> Hi(s,) = max support( s,, ew~) - 1 ~'~, (n-1)+a k,=l where EWi = (ewl s, ~ synset(ew)} In this formula, Hi(s) is a heuristic score of synset s, s is a candidate synset, ew is a translation into English, n is the number of translations and synset(ew) is the set of synsets of the translation ew. So Ew becomes the set of translations which have the synset s r. The parameter tx controls the relative contribution of candidate synsets in different number of translations: as the value of a increases, the candidate synsets in smaller number of translations get relatively less weight (a=0.5 was tuned experimentally), support(s,ew) calculates the maximum similarity with the synset s and the translation ew, which is defined as :</Paragraph>
      <Paragraph position="3"> Similarity measures lower than a threshold 0 are considered to be noise and are ignored. In our experiments, 0=0.3 was used. sim(s,s2) computes the conceptual similarity between concepts s~ and sz as in the following formula : sim(sl, s2)= 2 x level(MSCA(sl, s:)) level(sO + level(s2) where MSCA(sl,s2) represents the most specific common ancestor of concepts s~ and s2 and level(s) refers to the depth of concept s from the root node in the WordNetL</Paragraph>
    </Section>
    <Section position="2" start_page="142" end_page="142" type="sub_section">
      <SectionTitle>
2.2 Heuristic 2: Prior Probability
</SectionTitle>
      <Paragraph position="0"> This heuristic provides prior probability to each sense of a single translation as score.</Paragraph>
      <Paragraph position="1"> Therefore we will give maximum score to the synset of a monosemous translation, that is, the translation which has only one corresponding synset. The following formula explains the idea.</Paragraph>
      <Paragraph position="3"> where si ~ synset(ewj), nj = Isyr et( w,)l In this formula, n is the number of synsets of the translation e~t~.</Paragraph>
    </Section>
    <Section position="3" start_page="142" end_page="143" type="sub_section">
      <SectionTitle>
2.3 Heuristic 3: Sense Ordering
</SectionTitle>
      <Paragraph position="0"> (Gale et al., 1992) reports that word sense disambiguation would be at least 75% correct if a system assigns the most frequently occurring sense. (Miller et al., 1994) found that automatic  assignment of polysemous words in Brown Corpus to senses in WordNet was 58% correct with a heuristic of most frequently occurring sense. We adopt these previous results to develop sense ordering heuristic.</Paragraph>
      <Paragraph position="1"> The sense ordering heuristic provides the maximum score to the most frequently used sense of a Ixanslation. The following formula explains the heuristic.</Paragraph>
      <Paragraph position="3"> In this formula, x refers to the sense order of s in synset(ew): x is 1 when s, is the most frequently used sense of ew. The information about the sense order of synsets of an English word was extracted from the WordNet.</Paragraph>
      <Paragraph position="4">  The value a=0.705 and fl=2.2 was acquired from a regression of Figure 2 semcor corpus 2 data distribution.</Paragraph>
    </Section>
    <Section position="4" start_page="143" end_page="143" type="sub_section">
      <SectionTitle>
2.4 Heuristic 4: IS-A relation
</SectionTitle>
      <Paragraph position="0"> This heuristic is based on the following facts: 2 semcor is a sense tagged corpus from part of Brown corpus.</Paragraph>
      <Paragraph position="1"> If two Korean words have an IS-A relation, their translations in English should also have an IS-A relation.</Paragraph>
      <Paragraph position="2"> Korean word English word WordNet  figure 3, hkw is a hypemym of a Korean word kw and hew is a translation of hkw and ew is a translation of kw.</Paragraph>
      <Paragraph position="3"> This heuristic assigns score 1 to the synsets which satisfy the above assumption according to the following formula:</Paragraph>
      <Paragraph position="5"> In this formula, lsA(s,,s 2) returns true if s, is a kind of s 2.</Paragraph>
    </Section>
    <Section position="5" start_page="143" end_page="144" type="sub_section">
      <SectionTitle>
2.5 Heuristic 5: Word Match
</SectionTitle>
      <Paragraph position="0"> This heuristic assumes that related concepts will be expressed using the same content words.</Paragraph>
      <Paragraph position="1"> Given two definitions - that of the bilingual dictionary and that of the WordNet - this heuristic computes the total amount of shared content words.</Paragraph>
      <Paragraph position="3"> In this formula, X is the set of content words in English examples of bilingual dictionary and Y is  the set of content words of definition and example of the synset s, in WordNet.</Paragraph>
    </Section>
    <Section position="6" start_page="144" end_page="144" type="sub_section">
      <SectionTitle>
2.6 Heuristic 6: Cooccurrence
</SectionTitle>
      <Paragraph position="0"> This heuristic uses cooccurrence measure acquired from the sense tagged Korean definition sentences of bilingual dictionary. To build sense tagged corpus, we use the definition sentences which have monosemous translation in bilingual dictionary. And we uses the 25 semantic tags of WordNet as sense tag :</Paragraph>
      <Paragraph position="2"> In this formula, Defis the set of content words of a Korean definition sentence, t is a semantic tag corresponding to the synset s and n refers to  all these 6 scores in combination to classify the synset as linking or discarding.</Paragraph>
      <Paragraph position="3"> The combination of heuristics is performed by decision tree learning for non-linear relationship. Each internal node of a decision tree is a choice point, dividing an individual method into ranges of possible values. Each leaf node is labeled with a classification (linking or disc~ding). The most popular method of decision tree induction, employed here, is C4.5 (Quinlan, 1993).</Paragraph>
      <Paragraph position="4"> Figure 4 shows a training phase in decision ,tree based combination method. In the training phase, the candidate synset ws k of a Korean word is manually classified as linking or discarding and get assigned scores by each heuristic. A training data set is constructed by these scores and manual classification. The training data set is used to optimize a model for combining heuristics.</Paragraph>
      <Paragraph position="5">  Figure 5 shows a mapping phase. In the mapping phase, the new candidate synset ws~ of a Korean word is rated using 6 heuristics, and then the decision tree, which is learned in the training phase, classifies w&amp; as linking or discarding. The synset classified as linking is linked to the corresponding Korean word.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML