File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/94/h94-1025_metho.xml

Size: 10,552 bytes

Last Modified: 2025-10-06 14:13:49

<?xml version="1.0" standalone="yes"?>
<Paper uid="H94-1025">
  <Title>Building Japanese-English Dictionary based on Ontology for Machine Translation</Title>
  <Section position="3" start_page="141" end_page="142" type="metho">
    <SectionTitle>
2. Linguistic Resources
2.1. Ontology
</SectionTitle>
    <Paragraph position="0"> At USC/ISI, we have been constructing an ontology, a large-scale conceptual network, for three main purposes with the PAngloss MT system, which we are building together with CMT and NMSU. The first is to define the interlingua constituents, which comprise the semantic meanings of the input sentences independent of the source and target languages. They are defined in the ontology as concepts that represent commonly encountered objects, entities, qualities, and relations. As the result of analyzing the input text, our MT system parsers produce interlingua representation using the concepts.</Paragraph>
    <Paragraph position="1"> The second purpose is to describe semantic constraints among concepts in the ontology, which works to support the analysis and generation processes of the MT system. The third purpose is to act as a common unifying framework among the lexical items of the various languages. The ontology is being semi-automatically constructed from the lexical database WordNet\[Miller, 1990\] and the Longman Dictionary of Contemporary English (LDOCE)\[Knight, 1993\]. At the current time, the ontology contains over 70,000 items. English lexical items are associated with over 98% of the ontology. The ontology is also being linked to a lexicon of Spanish words, using the Collins Spanish-English bilingual dictionary. In our work, it is being linked to the Japanese lexicon developed for the JUMAN word identification and morphology system\[Matsumoto et al., 1993b\] by the algorithms described in this paper.</Paragraph>
    <Paragraph position="2"> The ontology consists of three regions: the upper region (more abstract), the middle region, and the lower (domain specific) region. The upper region of the ontology is called the Ontology Base (OB) and contains approximately 400 items that represent generalizations essential for the various modules' linguistic processing during translation. The middle region of the ontology, approximately 50,000 items, provides a framework for a generic world model, containing items representing many English and other word senses. The lower regions of the ontology provide anchor points for different application domains. Both the middle and domain model regions of the ontology house the open-class terms of the MT interlingua. They also contain specific information used to screen unlikely semantic and anaphoric interpretations.</Paragraph>
    <Paragraph position="3">  At USC/ISI, we employ the JUMAN morphological analyzer and the SAX parser for Japanese parsing\[Matsumoto et al., 1993b; Matsumoto et al., 1993a\]. These two modules use a lexicon of appropriate 100,000 Japanese words. The lexicon contains spelling/orthography forms, morphological information, and part-of-speech annotations. To be useful for MT, the Japanese words should contain English wordsense equivalents or semantic definitions. We provide this information required for linking JUMAN lexicon to the ontology concepts by employing a Japanese-English bilingual dictionary as a &amp;quot;bridge&amp;quot; .</Paragraph>
    <Section position="1" start_page="141" end_page="142" type="sub_section">
      <SectionTitle>
2.3. Bilingual Dictionary
</SectionTitle>
      <Paragraph position="0"> To link the unilingual Japanese JUMAN lexicon to the ontology, we employ a Japanese-English bilingual dictionary. This dictionary contains 75,000 words, providing Japanese-English word correspondences as shown in Figure 1. It is not difficult to link JUMAN lexical entries with the Japanese lexical items of the bilingual dictionary by a simple string matching. Our problem is: how can we automatically find the appropriate ontology item corresponding to each Japanese lexical item, if any ? Since we assume that there is at least one sense shared by a Japanese word jwi and the equivalent English words, ewlt, ew12, .... ew U, we define it as the bilingual concept JWi_O01. A bilingual concept JWi-k is assigned to the kth correspondence pair. For each bilingual concept, we have extracted from the dictionary lists of the lexical information necessary for MT processing the Japanese word entry, including its definition, parts of speech, syntactic and semantic constraints for the arguments, English equivalent words including synonyms, and bilingual example sentences. The lexical lists indexed by the bilingual concept are shown in Figure 2.</Paragraph>
      <Paragraph position="1"> For each bilingual concept, we replace information written in Japanese (such as the Japanese definition) by lists of English words for each Japanese word, by applying Japanese morphological analysis and the bilingual dictionary. Hereby we gain, for each Japanese word in the JUMAN lexicon that also appears in the bilingual dictio- null nary, the raw material to which we can apply algorithms to link it to the ontology.</Paragraph>
    </Section>
  </Section>
  <Section position="4" start_page="142" end_page="143" type="metho">
    <SectionTitle>
3. Concept Association Algorithms
</SectionTitle>
    <Paragraph position="0"> There are four cases on associating ontology concepts and equivalent bilingual concepts: case-I Single to single association A bilingual concept leads to one equivalent English word. The English word is linked to one ontology concept. Therefore, the bilingual concept is linked to one ontology concept as shown in Figure 3.</Paragraph>
    <Paragraph position="1"> case-II Single to multiple association A bilingual concept leads to one equivalent English word. The English word is linked to several ontology concepts. Therefore, the bilingual concept is linked to several ontology concepts as shown in Figure 4.</Paragraph>
    <Paragraph position="2"> case-III Multiple to single association A bilingual concept leads to several equivalent English words. The English words are linked to one ontology concept. Therefore, the bilingual concept is linked to one ontology concept as shown in Figure 5.</Paragraph>
    <Paragraph position="3"> case-IV Multiple to multiple association A bilingual concept leads to several equivalent English words. Each English word is linked to several ontology concepts. Therefore, the bilingual concept is linked to several ontology concepts as shown in Figure 6.</Paragraph>
    <Paragraph position="4">  case-IV. The equivalent-word match is designed for case-IV. The argument match and the example match are designed for case-II and for complementing the equivalent-word match.</Paragraph>
    <Section position="1" start_page="142" end_page="142" type="sub_section">
      <SectionTitle>
3.1. Equivalent-word Match
</SectionTitle>
      <Paragraph position="0"> The equivalent-word match algorithm is based on the algorithm developed by K.Knight for merging LDOCE and WordNet\[Knight, 1993\] and Knight's bilingual match algorithm\[Knight, 1994\]. The equivalent-word match searches for concept equivalencies by performing an intersection operation on all ontology concepts linked to the English equivalent words of the bilingual concept.</Paragraph>
      <Paragraph position="1"> Higher confidence is assigned to the concepts whose part of speech corresponds to the ontology type. For example, the Japanese noun &amp;quot;Tama&amp;quot; has nine senses in the dictionary. One of these senses is shown in Figure 7.</Paragraph>
      <Paragraph position="2"> The bilingual-concept TAMA-001 is represented by two English words: &amp;quot;ball&amp;quot; and &amp;quot;globe&amp;quot; . There are respectively six and three concepts for &amp;quot;ball&amp;quot; and &amp;quot;globe&amp;quot; in the ontology as shown in Figure 8. By intersecting the ontology concepts for a ball with the ontology concepts for a globe, TAMA_001 can be associated with the ontology concept balL0_1 with a fairly high level of confidence..</Paragraph>
    </Section>
    <Section position="2" start_page="142" end_page="143" type="sub_section">
      <SectionTitle>
3.2. Argument Match
</SectionTitle>
      <Paragraph position="0"> The argument match collates Japanese argument constraints with ontology argument constraints. The argument match complements the equivalent-word match, because not all the lists contain two or more English equivalent words. For example, the Japanese verb &amp;quot;utsusu&amp;quot; has five senses in the dictionary. One of these senses is shown in Figure 9. There are three concepts linked to &amp;quot;infect&amp;quot; in the ontology as shown in Figure 10. Ontology concept infect_0_2 contains an argument constraint such as &amp;quot;Somebody infects somebody with Case-I and case-III provide single associations between the bilingual concepts and the ontology concepts, which are simple. The problem is to associate the ontology concepts with equivalent bilingual concepts for case-II and  some disease.&amp;quot; When the algorithm matches the argument constraints, the ontology concept infect_0_2 is found to contain similar argument constraints to the bilingual concept UTSUSU..004. The algorithm assigns higher confidence to the association of OTSUSU_004 and infect_O_2.</Paragraph>
    </Section>
    <Section position="3" start_page="143" end_page="143" type="sub_section">
      <SectionTitle>
3.3. Example Match
</SectionTitle>
      <Paragraph position="0"> In order to complement the above two matches, the example match Mgorithm compares the bilingual examples with the ontology examples and definition sentences. By measuring the similarity of both examples, the algorithm determines the similarity of concepts. For example, the Japanese noun &amp;quot;ginkou&amp;quot; has one sense in the dictionary. The sense is shown in Figure 11. There are four concepts linked to &amp;quot;bank&amp;quot; in the ontology as shown in Figure 12. The algorithm calculates the similarity of two word-sets (the words contained in the bilingual examples and the words contained in the ontology examples and definition sentence) by simply intersecting the two sets of words after transforming them to canonical dictionary entry forms and removing function words. In the case of GINKOU-001 example set and bank example sets, GINKOU-001 and bank_0.3 share the maximum number of words: &amp;quot;deposit&amp;quot; and &amp;quot;money&amp;quot;. As a result, GINKOU_001 is highly associated with the ontology concept bank_0_3.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML