File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/98/w98-0713_metho.xml
Size: 13,407 bytes
Last Modified: 2025-10-06 14:15:08
<?xml version="1.0" standalone="yes"?> <Paper uid="W98-0713"> <Title>I I I ! I ! I I I ! I I I I I i I I Lexical Acquisition with WordNet and the Mikrokosmos Ontology</Title> <Section position="4" start_page="96" end_page="98" type="metho"> <SectionTitle> 3 Implementation </SectionTitle> <Paragraph position="0"> The Onto-WordNet Mapper works by performing a breadth-first traversal of the pK concept space, attempting to find matches for each concept node with a synset ~om WordNet. The end result is a list of potential mappings sorted by a match score derived by weighting the scores ~om the individual heuristics.</Paragraph> <Paragraph position="1"> An empty list indicates that no suitable matches were determined. This mapping process is detailed in figure 3, which also shows the default weights used prior to the optimization discussed later. The following sections describe each of the five matching heuristics. Note that a separate component, not described here, is used to produce the lexicon entry from the best match, provided the score is above a certain threshold.</Paragraph> <Paragraph position="2"> For each Mikrokosmos concept: 1. Get candidate synset (a) Try to find a word in WordNet with the same speUing (e.g. REAL*ESTATE vs. &quot;real estate&quot;). (b) Try to find a word matching a prefix or suifix of the concept (e.g., PEPPER-VEGETABLE vs.</Paragraph> <Paragraph position="3"> &quot;pepper&quot;).</Paragraph> <Paragraph position="4"> 2. Perform structure match of the synset and concept hierarchies. For each word and concept pair: (a) Check for exact match of the word and concept. null (b) Check for partial match of the word and concept (as above).</Paragraph> <Paragraph position="5"> (c) Check predefined equivalences.</Paragraph> <Paragraph position="6"> (d) Evaluate each match by computing the percent of matched nodes on the best-matching branches for each (scaled by length).</Paragraph> <Paragraph position="7"> 3. Perform concept-similarity match using corpus-derived probabilities (a) Get words occurring in the definition glosses for synset & concept.</Paragraph> <Paragraph position="8"> (b) Compute palrwise similarity by finding ancestor with the highest information content (the most-informative-subsumer ).</Paragraph> <Paragraph position="9"> (c) Evaluate the match by the degree of support the synset gets from all of the mostinformative-subsumers that are applicable. 4. Perform intersection matches for the following: (a) the sibling synsets PS, concepts.</Paragraph> <Paragraph position="10"> (b) the children synsets & concepts.</Paragraph> <Paragraph position="11"> (c) the definition gloss words ~om the synset & concept 5. Compute total match score by a weighted sum* 25 * hier + .25 * sire + .2 * child + .2 * sibl +. 1 * text.</Paragraph> <Section position="1" start_page="97" end_page="97" type="sub_section"> <SectionTitle> 3.1 Hierarchical Match </SectionTitle> <Paragraph position="0"> The hierarchy match (see figure 4) computes a score for the similarity of the two concept hierarchies.</Paragraph> <Paragraph position="1"> Since WordNet gives several words per synset, the matching at each step uses the maximum scores of the alternatives. The matching proceeds node by node up the hierarchies. If a given node doesn't match, it is skipped, but it still is included the total number of nodes.</Paragraph> <Paragraph position="2"> As given here, this algorithm is quite inefficient since similar subproblems are generated repeatedly.</Paragraph> <Paragraph position="3"> In the actual solution, the results from previous matches are cached, making the solution comparable to one based on dynamic programming. Note that this problem is related to-the &quot;Longest Corn-</Paragraph> <Paragraph position="5"> 1. If both lists empty, the score and node-count is 0 2. If either hierarchy is empty, the score is likewise 0.</Paragraph> <Paragraph position="6"> Determine node-count from the nodes in the other hierarchy.</Paragraph> <Paragraph position="7"> 3. Compute the s'tmilarity of the WordNet synset and pK concept names. If the result is above a preset threshold (0.75), add it to the score, and tally in score ofrecursive match of the parents: match-hierarchies(rest (wn), rest(onto)) 4. Otherwise, take the maximum of the scores from the recursive matches in which the WordNet node and/or the pK node is skipped: max(match-hides(rest (wn), onto), match-hierarchies (wn, rest(onto)), match-hlerarchies(rest (wn), rest(onto))) This is done for each possible pairing of the hierarchy paths, in case either concept has more than one via dynamic programming in (Cormen et al., 1990). Their recursive formulation for the problem follows: Let l\[i, j\] be the length of an LCS of the sequences X~ and r#:</Paragraph> <Paragraph position="9"/> </Section> <Section position="2" start_page="97" end_page="98" type="sub_section"> <SectionTitle> 3.2 Similarity Match </SectionTitle> <Paragraph position="0"> The idea of the similarity heuristic is to use the information content of the common ancestor concepts (or subsumers) for words in the definition texts. This is based on the technique Resnik (1995) uses for disambiguating noun groups. The frequency of a synset is based on the frequency of the word plus the frequencies of all its descendant synsets. Therefore, the top-level syusets have the highest frequencies and thus the highest estimated probability of occurrence. For each pair of nouns from the text of both definitions (one each from Word.Net and pK), the most-informatiue-subsumer is determined by finding the common ancestor with the highest information content, which is inversely related to frequency. The information content of this ancestor determines the pairwise similarity score. The candidate synset that receives the most support from the pairwiee similarity scores is then preferred. These calculations are detailed in figures 5 and 6.</Paragraph> <Paragraph position="1"> This technique requires an estimation for the frequencies of the WordNet synsets. Unless the corpus has been annotated to indicate the WordNet synset for each word, there is no direct way to determine the synset frequencies. However, these can be estimated by taking the frequency of the words in all descendant synsets (i.e., all the words the synset subsulnes). null</Paragraph> <Paragraph position="3"/> <Paragraph position="5"> increase synset-support by sim~,j increase normalization by simij</Paragraph> </Section> <Section position="3" start_page="98" end_page="98" type="sub_section"> <SectionTitle> 3.3 Miscellaneous matching heuristics </SectionTitle> <Paragraph position="0"> The remaining heuristics are similar in that they each are based on degree of overlap in word-based matching. For instance, in the match-siblings heuristic, the siblings sets for the candidate synset and ~K concept are compared by determining the size of the intersection relative to the size of the pK set. The match-children and match-definitiontext heuristics are similar. Figure 7 shows the general form of these intersection-based matching heuristics. This uses an equivalence test modified to account for a few morphological variations; the test also accounts for partial matches with the components of a concept name (similar to first step in figure 3).</Paragraph> </Section> </Section> <Section position="5" start_page="98" end_page="99" type="metho"> <SectionTitle> 4 Evaluation </SectionTitle> <Paragraph position="0"> To evaluate the performance of the mapper, two sets of 100 random concepts were mapped by hand into the corresponding WordNet synset (or marked as not-applicable). The first set was selected from the entire set of concepts mapped, whereas the second set was selected just from the cases with more than one plausible mapping (e.g., the corresponding WordNet entry has more than one synset). The results of this test shows that it handles 77% of the ambiguous cases and 94% of the cases overall, excluding cases corresponding to lexical gaps in Word-Net (see table 1). This shows an improvement of more than 15% over the lower bound, which was estimated from the proportion of correct mappings using sense 1. Note that these tests were performed after development was completed on the system.</Paragraph> <Paragraph position="1"> The remainder of this section presents results on analyzing how often the individual heuristics contribute to the correct result. The most important finding is that the hierarchy and text match account for most of the results. Furthermore, when all heuristics are used together, the similarity heuristic has a minor contribution to the result, although it is second when heuristics apply individually.</Paragraph> <Paragraph position="2"> Table 2 contains the results for each heuristic evaluated individually against the manual mapping of the ambiguous cases. Note that the overall score shows the accuracy using the default weights for comparison purposes.</Paragraph> <Paragraph position="3"> As a rough estimate for optimizing the weighting scheme, regression analysis was performed on the score produced by each heuristic to the result of the manual mapping for the ambiguous cases. This accounts for the interactions among the heuristics.</Paragraph> <Paragraph position="4"> There are 343 data points, because the score for each sense is included, not just those for the current sense. See table 3. Although the correlation coefficient is only 0.41, the regression suggests that the hierarchy match and the text match are the most significant heuristics. When using these revised weights, the accuracy increases to 81.3%.</Paragraph> <Paragraph position="5"> An alternative method for determining these weights used an optimization search, which accounts for nonlinear relationships. This method produced the weights shown in table 4. This shows that only the hierarchy and text heuristics contribute significantly to the result. When these are applied to the ambiguous sample, the accuracy becomes 83.5%.</Paragraph> <Paragraph position="6"> Note that the results given earlier uses the lower figure, because this represents the evaluation before training the weights on the sample.</Paragraph> <Paragraph position="7"> These results are preliminary: larger test sets would be required before conclusions can be drawn. However, it seems clear that a statistical approach is not likely to serve as a complete solution for this problem. Instead, a combination of symbolic and</Paragraph> <Paragraph position="9"> statistical appro~':hes seems appropriate, with an emphasis on the fi)rmer.</Paragraph> </Section> <Section position="6" start_page="99" end_page="99" type="metho"> <SectionTitle> 5 Relation to other work </SectionTitle> <Paragraph position="0"> Work of this nature has been more common in matching entries in multilingual dictionaries (e.g., (Rigau and Agirre 1995)) than in lexical acquisition.</Paragraph> <Paragraph position="1"> This section will ~oncentrate on work augmenting lexical informatiozL by ontological mappings.</Paragraph> <Paragraph position="2"> Knight and Luk (1994) describe an approach to establish correspondences between Longman's Dictionary of Contempol'ary English (LDOCE) and Word-Net entries. A defiaition match compares overlap of the LDOCE de ~nition text to those of both the WordNet entry aJLd its hypernym along with the words from closely-related synsets. Their hierarchy match uses the im I,licit hierarchy within LDOCE defined from the genas terms of the definitions, incorporating work done at NMSU (Bruce and Guthrie, 1991) that identifies and disaxabiguates the head nouns in the definJ tion texts. The hierarchy is used to guide the deterzaination of nontrivial matches by providing local cozLtext in which senses can be considered unambiguo as by filtering out the other senses not applicable to either subhierarchy. It also allows for matching the ~ arents of words from an existing match. Note that r, his mapping is facilitated by the target and source domains being the same: namely, English words. Therefore, the problem of assessing correspondence is minimized.</Paragraph> <Paragraph position="3"> Chang and Chen (1996) describe an algorithm for augmenting LDOPS'E with information from Longman's Lexicon of Contemporary English (LLOCE).</Paragraph> <Paragraph position="4"> LLOCE is basically a thesaurus, with word lists arranged under 14 subjects and 129 topics. These topic identifiers are used as a coarse form of sense division. The mat~ zing algorithm works by computing a similarity scol'e for the degree of overlap in the list of words for each LDOCE sense compared to the list of words from t\[ e LLOCE topics that contain the headword (expanded to include cross-references).</Paragraph> <Paragraph position="5"> Other work is l,~s directly related. Lehmann (1995) describes a methodology for semantic integration that matches classes based on the overlap in the inclusions of typical class members. For this to be effective, these instances must have been consistently applied in both ontologies. O'Sullivan et al. (1995) describe work on doing the reverse process we do. Specifi(ally, they augment WordNet by linking in entries fr,)m an ontology describing word processing. However, their approach requires manual linking.</Paragraph> </Section> class="xml-element"></Paper>