File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/02/c02-1162_intro.xml
Size: 3,538 bytes
Last Modified: 2025-10-06 14:01:24
<?xml version="1.0" standalone="yes"?> <Paper uid="C02-1162"> <Title>Identifying Concepts Across Languages: A First Step towards a Corpus-based Approach to Automatic Ontology Alignment</Title> <Section position="3" start_page="0" end_page="0" type="intro"> <SectionTitle> 3 Cross-lingual Semantic Similarity </SectionTitle> <Paragraph position="0"> Since automatic ontology alignment involves the comparison of sets of words to each other, it is necessary to define some measure for semantic similarity. Much work has been done on this topic, but most of it has been in monolingual semantic similarity calculation. Our problem is more complicated, as a cross-lingual ontology alignment will require measuring semantic similarity of words from different languages.</Paragraph> <Paragraph position="1"> The method used in this paper is an extension of work from Fung and Lo (1998). The assumption is that there is a correlation between word cooccurrence patterns that persists across languages, and the similarity between word cooccurrence patterns is indicative of the semantic similarity. To construct a representation of the cooccurrence patterns, a list of seedwords is compiled. The seedwords in one language is a direct translation of those in the other language. Given a bilingual corpus, a context vector can then be constructed for each of the words of interest, where each element in the vector is a weight corresponding to a function of the significance of a particular seedword and its cooccurrence frequency with the word of interest. This method, which was applied to the problem of automatic dictionary induction, has the advantage of being able to utilize non-parallel bilingual corpora, which is by nature much more plentiful than parallel corpora.</Paragraph> <Paragraph position="2"> The most important extension that our work makes to the work of Fung et al. is the introduction of translation groups of words. A major issue with translation research is that, given two arbitrary languages, it is common for a word in one language to have multiple translations in the other. It is also common for a given translation of a particular word to be a translation of one of its synonyms as well.</Paragraph> <Paragraph position="3"> To address this problem, this work uses seedword groups, a13 -to-a63 translations of sets of words, rather than 1-to-1 translations of single words. This increases the robustness of the method, since a word need not be consistently translated for its context to be accurately identified. An additional benefit is that the sparse data problem is alleviated somewhat: the increased number of seedwords increases the coverage of the corpus, which reduces the possibility that a rare word whose translation we are interested in does not occur with any of the seedwords.</Paragraph> <Paragraph position="4"> Given two languages, a64a52a65 and a64a67a66 , the algorithm proceeds as follows: 1. Define a list a68 a65 a1 a69a70a68 a65a52a71 a21a72a68 a65a72a65 a21a50a73a32a73a32a73a74a68 a65a76a75a78a77 , where each member a68 a65 a19 of the list is a set of words in a64 a65 .</Paragraph> <Paragraph position="5"> 2. Create a list a68 a66 a1a79a69a70a68 a66a72a71 a21a72a68 a66a17a65 a21a50a73a32a73a32a73a74a68 a66a80a75a72a77 , where a68 a66 a19 is a set of words in a64 a66 which are translations of the words from a68 a65 a19.</Paragraph> <Paragraph position="6"> 3. For each worda17 of interest in a64 a19, create a vec-</Paragraph> <Paragraph position="8"/> </Section> class="xml-element"></Paper>