File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/02/c02-1162_metho.xml
Size: 17,311 bytes
Last Modified: 2025-10-06 14:07:51
<?xml version="1.0" standalone="yes"?> <Paper uid="C02-1162"> <Title>Identifying Concepts Across Languages: A First Step towards a Corpus-based Approach to Automatic Ontology Alignment</Title> <Section position="4" start_page="0" end_page="0" type="metho"> <SectionTitle> 4 Experiment Details </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.1 Ontologies </SectionTitle> <Paragraph position="0"> The ontologies selected for alignment in this work were the American English WordNet (Miller et al., 1990) version 1.7, and the Mandarin Chinese HowNet (Dong, 1988).2 There are two main reasons why these particular two ontologies were chosen: they represent very different languages, and were constructed with very different approaches. WordNet was constructed with what is commonly referred to as a differential theory of lexical semantics (Miller et al., 1990), which aims to differentiate word senses by grouping words into synonym sets (synsets), which are constructed as to allow a user to easily distinguish between different senses of a word.</Paragraph> <Paragraph position="1"> HowNet, in contrast, was constructed following a constructive approach. At the most atomic level is a set of almost 1500 basic definitions, or sememes, such as &quot;human&quot;, or &quot;aValue&quot; (attributevalue). Higher-level concepts, or definitions, are composed of subsets of these sememes, sometimes with &quot;pointers&quot; that express certain kinds of relations, such as &quot;agent&quot; or &quot;target&quot;, and words are associated with the definition(s) that describe them. For example, the word &quot;a0 &quot; (scar) is associated with the definition &quot;tracea0a2a1 ,#diseasea0a4a3</Paragraph> <Paragraph position="3"> HowNet contains a total of almost 17000 definitions. On average, each definition contained 6.5 Chinese words, with 45% of them containing only one word, and 10% of them containing more than 10 words. Since the words within a definition are composed of the same sememe combination, HowNet definitions can be considered to be the equivalent of WordNet synsets.</Paragraph> <Paragraph position="4"> A detailed structural comparison between HowNet and WordNet can be found in (Wong and Fung, 2002).</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.2 Supplementary Dictionary </SectionTitle> <Paragraph position="0"> To supplement the English translations included in HowNet, translations were included from CEDict, an open-source Chinese-English lexicon which was downloaded from the web. The two lexicons were merged to create the final dictionary by iteratively grouping together Chinese words that shared English translations to create our a13 -to-a63 seedword 2The entries in HowNet are mainly in Chinese with a few English technical terms such as &quot;ASCII&quot;. English translations are included for all the words and sememes.</Paragraph> <Paragraph position="1"> translation groups.</Paragraph> <Paragraph position="2"> The finalized dictionary is used to create seed word groups for building the contextual vectors.</Paragraph> <Paragraph position="3"> First, the mappings in which none of the Chinese or English words appear in the corpus are filtered out. Second, only the mappings in which all of the Chinese words appear in the same HowNet definition are kept. The remaining 1975 mappings, which consist of an average of 2.0 Chinese words which map to an average of 2.2 English words, are used as seed word groups.</Paragraph> </Section> <Section position="3" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.3 Corpora </SectionTitle> <Paragraph position="0"> The bilingual corpus from which the context vectors were constructed are extracted from newspaper articles from 1987-1992 of the American English Wall Street Journal and 1988-1996 of the Mandarin Chinese People's Daily newspaper (a10a12a11a14a13a16a15 ). The articles were sentence-delimited and a greedy maximum forward match algorithm was used with a lexicon which included all word entries in HowNet to perform word segmentation on the Chinese corpus. On the English side, the same greedy maximum forward match algorithm is used in conjunction with a lexicon consisting of all word phrases found in WordNet to concatenate individual words into non-compositional compounds. To ensure that we were working on well-formed, complete sentences, sentences which were shorter than 10 Chinese words or 15 English words were filtered out.</Paragraph> <Paragraph position="1"> A set of sentences were then randomly picked to be included: the final corpus consisted of 15 million English words (540k sentences) and 11.6 Chinese words (390k sentences). Finally, the English half of the corpus was part-of-speech tagged with fnTBL (Ngai and Florian, 2001), the fast adaptation of Brill's transformation-based tagger (Brill, 1995).</Paragraph> <Paragraph position="2"> It is important to note that the final corpus thus generated is not parallel or even comparable in nature. To our knowledge, most of the previous work which utilizes bilingual corpora have involved corpora which were at least comparable in origin or content, if not parallel. The only previous work that we are aware of which uses unrelated corpora is that of Rapp (1995), a study on word co-occurrence statistics in unrelated German and English corpora.</Paragraph> </Section> </Section> <Section position="5" start_page="0" end_page="0" type="metho"> <SectionTitle> 5 Experiments </SectionTitle> <Paragraph position="0"> To get a sense of the efficacy of our method, a test set of 160 HowNet definitions were randomly chosen as candidates for the test set.3 The Chinese words contained within the definitions were extracted, along with the corresponding English translations. Two sets of context vectors, a0a2a1 and a0a4a3 , can then be constructed for the Chinese words in the definition and their English translations. Once these context vectors have been constructed, the similarities between the HowNet definitions and the Word-Net synsets can be calculated according to the formulae in Section 2.</Paragraph> </Section> <Section position="6" start_page="0" end_page="0" type="metho"> <SectionTitle> 6 Results </SectionTitle> <Paragraph position="0"> To get a sense of the complexity of the problem, it is necessary to construct a reasonable baseline system from which to compare against. For a baseline, all of the synsets that directly correspond to the English translations were extracted and enumerated.</Paragraph> <Paragraph position="1"> Ties were broken randomly and the synset with the highest number of corresponding translations was selected as the alignment candidate.</Paragraph> <Paragraph position="2"> Because there is no annotated data available for the evaluation, two judges who speak the languages involved were asked to hand-evaluate the resulting alignments, based on, firstly, the set of sememes that make up the definition, with the words that are contained in the definition only as a secondary aid. Table 1 shows the overall performance of our algorithm, and Table 2 show the highest-scoring alignment mappings.</Paragraph> <Paragraph position="3"> In addition to the overall results, it is also interesting to examine the rankings of the alignment candidates for some of the more difficult HowNet definitions. null Table 3 shows an example definition and the candidate rankings. This definition includes the words &quot;population&quot; and &quot;number of people&quot;, however, &quot;number of people&quot; was filtered out as it does not occur in WordNet as a single collocation, leaving only &quot;population&quot;, a noun with 6 senses in Word-Net, to work with. This example is a good illustration of the strength and power of the cross-lingual 3The original number of definitions chosen for the test set was higher. However, upon inspection, it was found that a number had no corresponding WordNet synset and thus cannot be aligned. The 160 are the ones which are left after the nonalignable definitions were filtered out.</Paragraph> <Paragraph position="4"> word similarity calculation, as the system correctly identifies the first sense of &quot;population&quot; -- &quot;the people who inhabit a territory or state&quot; -- as the correct semantic sense of this particular definition from the Chinese words &quot;a10a6a5 &quot; (number of human mouths), &quot;a10a8a7 &quot; (number of people) and &quot;a10a10a9 &quot; (number of human heads).</Paragraph> <Paragraph position="5"> Another very good example of the algorithm's strength can be found in the rankings for the HowNet definition &quot;TakeAwaya0a12a11 a13 ,patient=family a0a15a14 &quot; (Table 4). Again, the phrasal word translations &quot;move house&quot;, &quot;change one's residence&quot;, &quot;move to a better place&quot;, etc were filtered out, leaving the single word &quot;move&quot;, which has a total of 16 senses as a verb in WordNet 1.7. However, as the table shows, the algorithm correctly assigns the &quot;change residence&quot; sense of &quot;move&quot; to the HowNet definition, which is appropriate for the Chinese words it contains, which include &quot;a11a16a14 &quot; (move house), &quot;a17a19a18 &quot; (change one's dwelling), and &quot;a20 a17 &quot; (tear down one's house and move).</Paragraph> </Section> <Section position="7" start_page="0" end_page="0" type="metho"> <SectionTitle> 7 Analysis </SectionTitle> <Paragraph position="0"> Even though the final goal of our work is to construct a full mapping from HowNet to WordNet, there will be quite a number of HowNet definitions which do not have a WordNet synset equivalent.</Paragraph> <Paragraph position="1"> The reason is that given an arbitrary pair of languages, there will exist some words in one language which do not have a translation in the other language. In the case of English and Chinese, many of the encountered problems came from Chinese idiomatic expressions, which are common in every-day usage and are considered to be single words, but do not usually translate to a single word in English.</Paragraph> <Paragraph position="2"> In addition, the inherent difference in sense granularity and structure between any given two ontologies means that a full-scale mapping of synsets from one ontology to another will not usually be possible.</Paragraph> <Paragraph position="3"> For example, HowNet's &quot;livestock&quot; definition covers words which are as diverse as &quot;cow&quot;, &quot;cat&quot; and &quot;dog&quot;, while the finest-grained WordNet synset that covers all these definitions is a69 placental, placental mammal, eutherian, eutherian mammala77 .</Paragraph> <Paragraph position="4"> One of the most troublesome problems encountered in this work was in the selection of seedwords, which define set for the automatic lexicon induction.</Paragraph> <Paragraph position="5"> If the seedwords occur so frequently in the corpus that other words co-occur with them too easily, they will provide little useful discriminatory information to the algorithm; but if they are too rare, they will not co-occur often enough with other words to be able to provide enough information, either. This problem can be solved, however, by a better selection of seedwords, or, more easily, simply by using a bigger corpus to alleviate the sparse data problem.</Paragraph> <Paragraph position="6"> A more serious problem was introduced by the comparability of the corpora involved in the experiment. Even though both English and Chinese halves were extracted from news articles, the newspapers involved are very different in content and style: the People's Daily is a government publication, written in a very terse and brief style, and does not concern itself much with non-government affairs. The Wall Street Journal, on the other hand, caters to a much broader audience with a variety of news articles from all sources.</Paragraph> <Paragraph position="7"> This creates a problem in the co-occurrence patterns of a word and its translations. The assumption that word co-occurrence patterns tend to hold across language boundaries seems to be less valid with corpora that differ too much from each other. An observation made during the experiments was some words occurred much more frequently (relative to the half of the corpus they were in) than their translated counterparts. The result of this is that their context vectors may not be as similar as desired.</Paragraph> <Paragraph position="8"> The difference in the co-occurrence patterns between the two halves of the corpora are best illustrated by a dotplot (Church, 1993). The total term frequency (TF) of each seedword group is plotted against that of its translations.</Paragraph> <Paragraph position="9"> Figure 1 shows the resulting dotplot. If the two halves of the corpora were exact copies of each other, the frequencies of the seedwords would be equal and the points would therefore be aligned along the a79a79a1a81a80 diagonal. The further the points diverge from the diagonal, the more different the two halves of the corpus are from each other. This is quite obviously the case for this particular corpus -- the overall point pattern is fan-shaped, with the diagonal only faintly discernible. This suggests that the word usage patterns of the English and Chinese halves of the corpus are quite dissimilar to each other.</Paragraph> <Paragraph position="10"> It is, of course, reasonable to ask why parallel or comparable corpora had not been used in the experiments. The reason is, as noted in Section 2, that noncomparable corpora are easier to come by. In fact, the only Chinese/English corpus of comparable origin that was available to us was the parallel Hong Kong News corpus, which is about half the size. Furthermore, the word entries in HowNet were extracted from Mandarin Chinese corpora, which differs enough from the style of Chinese used in Hong Kong such that many words from HowNet do not exist in the Hong Kong News corpus. Feasibility experiments with that corpus showed that many of the seedwords either did not occur, or did not co-occur with the words of interest, which resulted in sparse context vectors with only a few non-zero co-occurrence frequencies. The result was that the similarity between many of the candidate WordNet synset-HowNet definition pairs was reduced to zero.</Paragraph> <Paragraph position="11"> Despite all these problems, our method is successful at aligning some of the more difficult, single-word HowNet definitions to appropriate WordNet synsets, thus creating a partial mapping between two ontologies with very different structures from very different languages. The method is completely unsupervised and therefore cheap on resource requirement -- it does not use any annotated data, and the only resource that it requires -beyond the ontologies that are to be aligned -- is a bilingual machine-readable dictionary, which can usually be obtained for free or at very low cost.</Paragraph> </Section> <Section position="8" start_page="0" end_page="0" type="metho"> <SectionTitle> 8 Previous Work </SectionTitle> <Paragraph position="0"> The preceding sections mentioned some previous and related work that targets the same problem, or some of its subproblems. This section takes a closer look at some other related work.</Paragraph> <Paragraph position="1"> There has been some interest in aligning ontologies. Dorr et al. (2000) and Palmer and Wu (1995) focused on HowNet verbs and used thematic-role information to align them to verbs in an existing classification of English verbs called EVCA (Levin, 1993). Asanoma (2001) used structural link information to align nouns from WordNet to an existing Japanese ontology called Goi-Taikei via the Japanese WordNet, which was constructed by manual translation of a subset of WordNet nouns.</Paragraph> <Paragraph position="2"> There has also been a lot of work involving bilingual corpora, including the IBM Candide project (Brown et al., 1990), which used statistical data to align words in sentence pairs from parallel corpora in an unsupervised fashion through the EM algorithm; Church (1993) used character frequencies to align words in a parallel corpus; Smadja et al. (1996) used cooccurrence functions to extract phrasal collocations for translation, and Melamed (1997) identified non-compositional compounds by comparing the objective functions of a translation model with and without NCCs.</Paragraph> <Paragraph position="3"> The calculation of word semantic similarity scores is also a problem that has attracted a lot of interest. The numerous notable approaches can usually be divided into those which utilize the hierarchical information from an ontology, such as Resnik (1995) and Agirre and Martinez (2002); and those which simply use word distribution information from a large corpus, such as Lin (1998) and Lee (1999).</Paragraph> </Section> class="xml-element"></Paper>