File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/93/h93-1036_metho.xml

Size: 21,572 bytes

Last Modified: 2025-10-06 14:13:24

<?xml version="1.0" standalone="yes"?>
<Paper uid="H93-1036">
  <Title>BUILDING A LARGE ONTOLOGY FOR MACHINE TRANSLATION</Title>
  <Section position="3" start_page="0" end_page="0" type="metho">
    <SectionTitle>
1. Introduction
</SectionTitle>
    <Paragraph position="0"> The PANGLOSS project is a three-site collaborative effort to build a large-scale knowledge-based machine translation system. Key components of PANGLOSS include New Mexico State University's ULTRA parser \[Farwell and Wilks, 1991\], Carnegie Mellon's interlingua representation format \[Nirenburg and Defrise, 1991\], and USC/ISI's PENMAN English generation system \[Penman, 1989\]. Another key component currently under construction at ISI is the PANGLOSS ontology, a large-scale conceptual network intended to support semantic processing in other PANGLOSS modules. This network will contain 50,000 nodes representing commonly encountered objects, entities, qualities, and relations.</Paragraph>
    <Paragraph position="1"> The upper (more abstract) region of the ontology is called the Ontology Base (OB) and contains approximately 400 items that represent generalizations essential for the various PANGLOSS modules' linguistic processing during translation. The middle region of the ontology, approximately 50,000 items, provides a framework for a generic world model, containing items representing many English word senses. The lower (more specific) regions O f the ontology provide anchor points for different application domains. Both the middle and domain model regions of the ontology house the open-class terms of the MT interlingua. They also contain specific information used to screen unlikely semantic and anaphoric interpretations.</Paragraph>
    <Paragraph position="2"> The Ontology Base is a synthesis of USC/ISI's PENMAN Upper Model \[Bateman, 1990\] and CMU's ON-TOS concept hierarchy \[Carlson and Nirenburg, 1990\].</Paragraph>
    <Paragraph position="3"> Both of these high-level ontologies were built by hand, and they were merged manually. Theoretical motivations behind the OB and its current status are described in \[Hovy and Knight, 1993\].</Paragraph>
    <Paragraph position="4"> The problem we focus on in this paper is the construction of the large middle region of the ontology. Because large-scale knowledge resources are difficult to build by hand, we are pursuing primarily automatic methods applied in several stages. During the first stage we created several tens of thousands of nodes, organized them into sub/superclass taxonomies, and subordinated those taxonomies to the 400-node Ontology Base. This work we describe below. Later stages will address the insertion of additional semantic information such as restrictions on actors in events, domain/range constraints on relations, and so forth.</Paragraph>
    <Paragraph position="5"> For the major node creation and taxonomization stage, we have primarily used two on line sources of information: (1) the Longman Dictionary of Contemporary English (LDOCE)\[Group, 1978\], and (2) the lexical database WordNet \[Miller, 1990\].</Paragraph>
  </Section>
  <Section position="4" start_page="0" end_page="185" type="metho">
    <SectionTitle>
2. Merging LDOCE and WordNet
</SectionTitle>
    <Paragraph position="0"> LDOCE is a learner's dictionary of English with 27,758 words and 74,113 word senses. Each word sense comes with: A short definition. One of the unique features of LDOCE is that its definitions only use words from a &amp;quot;control vocabulary&amp;quot; list of 2000 words. This makes it attractive from the point of view of extracting semantic information by parsing dictionary entries.</Paragraph>
    <Paragraph position="1"> Examples of usage.</Paragraph>
    <Paragraph position="2"> One or more of 81 syntactic codes.</Paragraph>
    <Paragraph position="3"> For nouns, one of 33 semantic codes.</Paragraph>
    <Paragraph position="4"> For nouns, one of 124 pragmatic codes.</Paragraph>
    <Paragraph position="5">  WordNet is a semantic word database based on psycholinguistic principles. Its size is comparable to LDOCE, but its information is organized in a completely different manner. WordNet groups synonymous word senses into single units (&amp;quot;synsets&amp;quot;). Noun senses are organized into a deep hierarchy, and the database ~so contains part-of links, antonym links, and others. Approximately 55% of WordNet synsets have brief informal definitions.</Paragraph>
    <Paragraph position="6"> Each of these resources has something to offer a large-scale natural language system, but each is missing important features present in the other. What we need is a combination of the features of both.</Paragraph>
    <Paragraph position="7"> Our most significant project to date has been to merge LDOCE and WordNet. This involves producing a list of matching pairs of word senses, e.g.:</Paragraph>
  </Section>
  <Section position="5" start_page="185" end_page="185" type="metho">
    <SectionTitle>
LDOCE WORDNET
</SectionTitle>
    <Paragraph position="0"/>
    <Paragraph position="2"> ... deg,deg Section 4 describes how we produced this list semiautomatically. Solving this problem yields several benefits: null</Paragraph>
  </Section>
  <Section position="6" start_page="185" end_page="185" type="metho">
    <SectionTitle>
3. Related Work
</SectionTitle>
    <Paragraph position="0"> Our ontology is a symbolic model for fueling semantic processing in a knowledge-based MT system. We are aiming at broader coverage (dictionary-scale) than has previously been available to symbolic MT systems. Also, we are committed to automatic and semi-automatic methods of knowledge acquisition from the start. This, and the fact that we are concentrating on a particular language-processing application, distinguishes the PANGLOSS work from the CYC knowledge base \[Lenat and Guha, 1990\]. We also believe that dictionaries and corpora are imperfect sources of knowledge, so we still employ human effort to check the results of our semi-automatic algorithms. This is in contrast to purely statistical systems (e.g., \[Brown et al., 1992\]), which are difficult to inspect and modify.</Paragraph>
    <Paragraph position="1"> There has been considerable use in the NLP community of both WordNet (e.g., \[Lehman et al., 1992; Resnik, 1992\]) and LDOCE (e.g..., \[Liddy et aL, 1992; Wilks et al., 1990\]), but no one has merged the two in order to combine their strengths. The next section describes our approach in detail.</Paragraph>
  </Section>
  <Section position="7" start_page="185" end_page="186" type="metho">
    <SectionTitle>
4. Algorithms and Results
</SectionTitle>
    <Paragraph position="0"> We have developed two algorithms for merging LDOCE and WordNet. Both algorithms generate lists of sense pairs, where each pair consists of one sense from LDOCE and the proposed matching sense from WordNet, if any.</Paragraph>
    <Section position="1" start_page="185" end_page="186" type="sub_section">
      <SectionTitle>
4.1. Definition Match
</SectionTitle>
      <Paragraph position="0"> The Definition Match algorithm is based on the idea that two word senses should he matched if their two definitions share words. For example, there are two noun definitions of &amp;quot;batter&amp;quot; in LDOCE:  (batter_2_0) &amp;quot;mixture of flour, eggs, and milk, beaten together and used in cooking&amp;quot; (batter_3_0) &amp;quot;a person who bats, esp in baseball -compare BATSMAN&amp;quot; and two definitions in WordNet: * (BATTER-l) &amp;quot;ballplayer who bats&amp;quot; * (BATTER-2) &amp;quot;a flour mixture thin enough to pour or drop from a spoon&amp;quot;  The Definition Match Algorithm will match (batter_2_0) with (BATTER-2) because their definitions share words like &amp;quot;flour&amp;quot; and &amp;quot;mixture.&amp;quot; Similarly (batter_3_0) and (BATTER-I) both contain the word &amp;quot;bats,&amp;quot; so they are also matched together.</Paragraph>
      <Paragraph position="1"> Not all senses in WordNet have definitions, but most have synonyms and superordinates. For this reason, the algorithm looks not only at WordNet definitions, but also at locally related words and senses. For example, if  synonyms of WordNet sense x appear in the definition of LDOCE sense y, then this is evidence that x and y should be matched.</Paragraph>
      <Paragraph position="2"> Here is the algorithm: Definition-Match For each English word w found in both LDOCE and WordNet:  1. Let n be the number of senses of w in LDOCE. 2. Let m be the number of senses of w in WordNet. . Identify and stem all open-class, content words in  the definitions (and example sentences) of all senses of w in both resources.</Paragraph>
      <Paragraph position="3"> .</Paragraph>
      <Paragraph position="4"> .</Paragraph>
      <Paragraph position="5"> Let ULD be the union of all stemmed content words appearing in LDOCE definitions.</Paragraph>
      <Paragraph position="6"> Let UWN be the same for WordNet, plus all synonyms of the senses, their direct superordinates, siblings, super-superordinates, as well as stemmed content words from the definitions of direct superordinates. null</Paragraph>
    </Section>
  </Section>
  <Section position="8" start_page="186" end_page="188" type="metho">
    <SectionTitle>
6. Let CW=(ULD N UWN) - w. These are definition
</SectionTitle>
    <Paragraph position="0"> words common to LDOCE and WordNet.</Paragraph>
    <Paragraph position="1"> . Create matrix L of the n LDOCE senses and the words fromCW. Forall0&lt;i&lt;nand0&lt;z&lt; \[</Paragraph>
    <Paragraph position="3"> if the definition of sense i in LDOCE contains word x otherwise 8. Create matrix W of the m WordNet senses and the words fromCW. For all0&lt;j&lt; mand0&lt;x &lt; \]</Paragraph>
    <Paragraph position="5"> if x is a synonym or superordinate of sense j in WordNet if x is contained in the definition of sense j or the definition of its superordinate if x is a sibling or super-superordinate of sense  j in WordNet otherwise 9. Create similarity matrix SIM of LDOCE and Word-Net senses. For all 0 _&lt; i &lt; n and 0 &lt; j &lt; m: FlCWl- \] SIMti, j I = .\[ ~ (L\[i,x\]-W\[x,jl) / I CWl 10. Repeat until SIM is a zero matrix: (a) Let SIM\[y, z\] be the largest value in the SIM matrix.</Paragraph>
    <Paragraph position="6"> (b) Generate matched pair of LDOCE sense y and WordNet sense z.</Paragraph>
    <Paragraph position="7"> (c) For all 0 _&lt; i &lt; n, set SIM\[i, z\] = 0.0. (d) For all 0 &lt; j &lt; m, set sIm\[y, j\] = 0.0.  In constructing the SIM matrix the algorithm comes up with a similarity measure between each of the n.m possible pairs of LDOCE and WordNet senses. This measure, SIM\[i, j\], is a number from 0 to 1, with 1 being as good a match as possible. Thus, every matching pair proposed by the algorithm comes with a confidence factor. Empirical results are as follows. We ran the algorithm over all nouns in both LDOCE and WordNet. We judged the correctness of its proposed matches, keeping records of the confidence levels and the degree of ambiguity present.</Paragraph>
    <Paragraph position="8"> For low-ambiguity words (words with exactly two senses in LDOCE and two in WordNet), the results are: confidence pct. pct.</Paragraph>
    <Paragraph position="9"> level correct coverage  At confidence levels &gt; 0.0, 75% of the proposed matches are correct. If we restrict ourselves to only matches proposed at confidence ~ 0.8, accuracy increases to 90%, but we only get 27% of the possible matches. For high-ambiguity words (more than five senses in LDOCE and WordNet), the results are: confidence pet. pct.</Paragraph>
    <Paragraph position="10"> level correct coverage  Accuracy here is worse, but increases sharply when we only consider high confidence matches. The algorithm's performance is quite reasonable, given that 45% of WordNet senses have no definitions and that many existing definitions are brief and contain misspellings. Still, there are several improvements to be made e.g., modify the &amp;quot;greedy&amp;quot; strategy in which matches are extracted from SIM matrix, weigh rare words in definitions more highly than common ones, and/or score senses with long definitions lower than ones with short definitions. These improvements yield only slightly better results, however, because most failures are simply due to the fact that matching sense definitions have no words in common. For example, &amp;quot;seal&amp;quot; has 5 noun senses in LDOCE, one of which is: (seal_l_l) &amp;quot;any of several types of large fisheating animals living mostly on cool seacoasts and floating ice, with broad flat limbs (FLIP-PERs) suitable for swimming&amp;quot; WordNet has 7 definitions of &amp;quot;seal,&amp;quot; one of which is: For example, (bat_l_l) is defined as &amp;quot;any of the several types of specially shaped wooden stick used for ...&amp;quot; The genus term for (bat_l_l) is (stick_l_l). As another example, the genus sense of (aisle_0_l) is (passage_0_7). The genus sense and the semantic code hierarchies were extracted automatically from LDOCE. The semantic code hierarchy is fairly robust, but since the genus sense hierarchy was generated heuristically, it is only 80% correct. The idea of the Hierarchy Match algorithm is that once two senses are matched, it is a good idea to look at their respective ancestors and descendants for further matches. For example, once (animal_l_2) and ANIMAL1 are matched, we can look into the respective animalsubhierarchies. We find that the word &amp;quot;seal&amp;quot; is locally unambiguous---only one sense of &amp;quot;seal&amp;quot; refers to an animal (in both LDOCE and WordNet). So we feel confident to match those seal-animal senses. As another example, suppose we know that (swan_dive-0_0) is the same concept as (SWAN-DIVE-l). We can then match their superordinates (dive_2_l) and (DIVE-3) with high confidence; we need not consider other senses of &amp;quot;dive.&amp;quot; Here is the algorithm: (SEAL-7) &amp;quot;any of numerous marine mammals that come on shore to breed; chiefly of cold regions&amp;quot; The Definition Match algorithm cannot see any similarity between (seal_l_1) and (SEAL-7), so it does not match them. However, we have developed another match algorithm that can handle cases like these.</Paragraph>
    <Section position="1" start_page="187" end_page="188" type="sub_section">
      <SectionTitle>
4.2. Hierarchy Match
</SectionTitle>
      <Paragraph position="0"> The Hierarchy Match algorithm dispenses with sense definitions altogether. Instead, it uses the various sense hierarchies inside LDOCE and WordNet.</Paragraph>
      <Paragraph position="1"> WordNet noun senses are arranged in a deep is-a hierarchy. For example, SEAL-7 is a PINNIPED-1, which is on AQUATIC-MAMMAL-l, which is a EUTHERIAN1, which is a MAMMAL-l, which is ultimately an ANIMAL-I, and so forth.</Paragraph>
      <Paragraph position="2"> LDOCE has two fairly flat hierarchies. The semantic code hierarchy is induced by a set of 33 semantic codes drawn up by Longman lexicographers. Each sense is marked with one of these codes, e.g., &amp;quot;H&amp;quot; for human &amp;quot;P&amp;quot; for plant, &amp;quot;J&amp;quot; for movable object. The other hierarchy is the genus sense hierarchy. Researchers at New  Mexico State University have built an automatic algorithm \[Bruce and Guthrie, 1992\] for locating and disambiguating genus terms (head nouns) in sense definitions. Hierarchy-Match 1. Initialize the set of matches: (a) Retrieve all words that are unambiguous in both LDOCE and WordNet. Match their corresponding senses, and place all the matches on a list called M1.</Paragraph>
      <Paragraph position="3"> (b) Retrieve a prepared list of hand-crafted matches. Place these matches on a list called M2. We created 15 of these, mostly high-level matches like (person_0_l, PERSON-2) and (plant_2_l, PLANT-3). This step is not strictly necessary, but provides guidance to the algorithm.</Paragraph>
      <Paragraph position="4"> 2. Repeat until M1 and M2 are empty: (a) For each match on M2, look for words that are unambiguous within the hierarchies rooted at the two matched senses. Match the senses of locally unambiguous words and place the matches on M1.</Paragraph>
      <Paragraph position="5"> (b) Move all matches from M2 to a list called M3. (c) For each match on M1, look upward in the two  hierarchies from the matched senses. Whenever a word appears in both hierarchies, match the corresponding senses, and place the match on M2.</Paragraph>
      <Paragraph position="6">  (d) Move all matches from M1 to M2.</Paragraph>
      <Paragraph position="7"> The algorithm operate in phases, shifting matches from M1 to M2 to M3, placing newly-generated matches on M1 and M2. Once M1 and M2 are exhausted, M3 contains the final list of matches proposed by the algorithm. Again, we can measure the success of the algorithm along two dimensions, coverage and correctness: pct. matches phase correct proposed Step 1 99% 7563  In the end, the algorithm produced 11,128 matches at 96% accuracy. We expected 100% accuracy, but the algorithm was foiled at several places by errors in one or another of the hierarchies. For example, (savings_bank_0_0) is mistakenly a subclass of river-bank (bank_l_1) in the LDOCE genus hierarchy, rather than (bank_l_4), the money-bank. &amp;quot;Savings bank&amp;quot; senses are matched in step l(a), so step 2(c) erroneously goes on to match the river-bank of LDOCE with the money-bank of WordNet. Fortunately, the Definition and Hierarchy Match algorithms complement one another, and there are several ways to combine them. Our practical experience has been to run the Hierarchy Match algorithm to completion, remove the matched senses from the databases, then run the Definition Match algorithm. The Definition Match algorithm's performance improves slightly after hierarchy matching removes some word senses. Once the high confidence definition matches have.been verified, we use them as fuel for another run of the Hierarchy Match algorithm.</Paragraph>
      <Paragraph position="8"> We have built an interface that allows a person to verify matches produced by both algorithms, and to reject or correct faulty matches. So far, we have 15,000 correct matches, with 10,000 to follow shortly. The next section describes what we do with them in our ontology.</Paragraph>
    </Section>
  </Section>
  <Section position="9" start_page="188" end_page="189" type="metho">
    <SectionTitle>
5. The Current Ontology
</SectionTitle>
    <Paragraph position="0"> The ontology currently contains 15,000 noun senses from LDOCE and 20,000 more from WordNet. Its purpose is to support semantic processing in the PANGLOSS analysis and generation modules. Because we have not yet taxonomized adjective and verb senses (see Section 6) semantic support is still very limited.</Paragraph>
    <Paragraph position="1"> On the generation side, the PENMAN system requires that all concepts be subordinated to the PENMAN Upper Model, which is part of the Ontology Base (OB). It is difficult to subordinate tens of thousands of LDOCE word senses to the OB individually, but if we instead subordinate various WordNet hierarchies to the OB, the LDOCE senses will follow automatically via the WordNet-LDOCE merge.</Paragraph>
    <Paragraph position="2"> Subordinating the WordNet noun hierarchy to the OB required about 100 manual operations. Each operation either merged a WordNet concept with an OB equivalent, inserted one or more WordNet concepts between two OB concepts, or attached a WordNet concept below an OB concept. The noun senses from WordNet (and their matches from LDOCE) fall under all three of the OB's primary top-level categories of OBJECT, PROCESS, and QUALITY. The PENMAN generator now has access to the semantic knowledge it needs to generate a broad range of English.</Paragraph>
    <Paragraph position="3"> To support parsing, we have manually added about 20 mutual-disjoint assertions into the ontology. One of these assertions states that no individual can be both an INANIMATE-OBJECT and an ANIMATE-OBJECT, another states that PERSON and 1011-HtrtlAN-ANItlAL are mutually disjoint, and so forth. A parser can use such information to disambiguate sentences like &amp;quot;this crane is my pet,&amp;quot; where &amp;quot;crane&amp;quot; and &amp;quot;pet&amp;quot; have several senses in LDOCE (crane_l_l, a machine; crane_l_2, a bird; pet_l_1, a domestic animal; pet_l_2, a favorite person; etc.). The only pair of senses that are not mutually disjoint in our ontology is (crane_l.2)/(pet_l_l), so this is the preferred interpretation. So far, all mutual-disjoint links are between OB concepts. We plan a study of our lexicon to determine which nouns have senses that are not distinguishable on the basis of mutual-disjointness, and this will drive further knowledge acquisition of these assertions. null We are now integrating the ontology with ULTRA, the Prolog-based parsing component of the PANGLOSS translator. Although ULTRA parses Spanish input for PANGLOSS, the lexical items have already been semantically tagged with LDOCE sense keys, so no large-scale knowledge acquisition is necessary. Our first step has been to produce a Prolog version of the ontology, with inference rules for inheritance and propagation of mutual-disjoint links.</Paragraph>
    <Paragraph position="4"> Another use of the ontology has been to help us refine LDOCE and WordNet themselves. For example, any  sample of the automatically-generated LDOCE genussense hierarchy has approximately 20% errors. Using our merged LDOCE-WordNet-OB ontology as a standard, we have been able to locate and fix a large number of these errors automatically.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML