File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/00/w00-0801_metho.xml
Size: 16,902 bytes
Last Modified: 2025-10-06 14:07:27
<?xml version="1.0" standalone="yes"?> <Paper uid="W00-0801"> <Title>An Unsupervised Method for Multifingual Word Sense Tagging Using Parallel Corpora: A Preliminary Investigation</Title> <Section position="4" start_page="0" end_page="4" type="metho"> <SectionTitle> 2. Proposed method </SectionTitle> <Paragraph position="0"> We propose a method that utilizes translations as filters for sense distinctions. The method is unsupervised since it does not rely on the availability of sense tagged data. As an illustration, if we look up the canonical ambiguous word bank in the Oxford Hachette English-French dictionary, we find that it translates to several words indicating its possible senses. Bank, as a noun, translates to the French words banque, rive, bord, etc. If we reverse the French translations into English, we get the original word bank as well as other English equivalents. Accordingly, rive translates back into English as bank and shore; bord translates into bank, edge, and rim. Therefore, given a parallel corpus with a source and target language, if there exists a method of finding word afignments from the source language corpus to words in the target language corpus, one can create a set of all the words in the target corpus that are aligned with a word in the source corpus. For example, given a French/English parallel corpus, we would expect the word rive, on the French side, to align with the words bank and shore, on the English side, in the correct contexts with a high probability. This approach essentially hinges upon the diversity of contexts in which words are translated.</Paragraph> <Paragraph position="1"> We will refer to the English side of the parallel corpus as the target language corpus since we assume the knowledge resources exist for English. The foreign language side is referred to as the source corpus.</Paragraph> <Paragraph position="2"> The required linguistic knowledge resource is a lexical ontology that has the words in the target language and a listing of their associated senses. There are several databases of that sort available for language researchers, among which is WordNet \[Fellbanm, 1998; Miller et al., 1990\].</Paragraph> <Paragraph position="3"> WordNet is a lexical ontology - a variant on semantic networks with more of a hierarchical structure, even though some of the nodes can have multiple parents - that was manually constructed for the English language. It comprises four taxonomies for four parts of speech: nouns, verbs, adverbs and adjectives.</Paragraph> <Paragraph position="4"> Accordingly, given a taxonomy like WordNet for the target language, and an appropriate distance measure between words with their associated senses, the distance between all the senses for both shore and bank is calculated. In WordNet 1.6, bank has 10 senses, the 3 topmost frequent senses are: I. a financial institution that accepts deposits and channels the money into lending activities 2. sloping land (especially the slope beside a body of water) 3. a supply or stock held in reserve especially for future use (especially in emergencies) shore has two senses listed: 1. the land along the edge of a body of water (a lake or ocean or r/vet) 2. a beam that is propped against a structure to provide support One would expect that the distance between sense #2 of bank and sense #1 of shore to be smaller than the latter's distance from the other two senses of bank. Accordingly, with an appropriate optimization function over the distance measures between all the senses of the two words, sense #2 for bank and sense # 1 for shore are assigned as the correct tags for the words, respectively. In effect, we have assigned sense tags to rfi, e in its respective alignments, in the appropriate contexts. Therefore the instances where rive is aligned with bank gets assigned sense #2 for the noun bank; instances where rive is aLigned with shore is assigned sense #1 for shore. Furthermore, we created linEq automatically in WordNet for the French word rive. Our approach is described as follows: * Preprocessing of corpora Tokenizc both corpora Align the sentences of the corpora such that each sentence in the source corpus is aligned with one corresponding sentence in the target corpus.</Paragraph> <Paragraph position="5"> * For each source and corresponding target sentence, find the best token level alignments. Methods for automating this process have been proposed in the Literature. \[A10naizan et al., 1999; Melamed, 2000; etc.\] * For each source language token, create a List of its alignments to target language tokens, target set * Using the taxonomy, calculate the distance between the senses of the tokens in the target set; assign the appropriate sense(s) to each of the tokens in the target set based on an optimiTation function over the entire set of target token senses * Propagate the assigned senses back to both target and source corpora tokens, effectively, creating two tag sets, one for each the target and source corpus * Evaluate the resnlting tag sets against a hand tagged test set.</Paragraph> <Paragraph position="6"> 3. Preliminary Evaluation 3.1. Materials We chose the Brown Corpus of American English \[Francis & Kutera, 1982\] - of one milLion words - as our target language corpus. It is a balanced corpus and it has more than 200K words that are manually sense tagged as a product of the semantic concordance (SemCor) effort using WordNet \[Miller et al. 1994\]. The SemCor data is tagged in lamning text - words of varying parts of speech are tagged in context - using WordNet 1.6. Hence, we used WordNet 1.6 taxonomy as the Linguistic knowledge resource. \[Fellbaum, 1998\] For purposes of this preliminary investigation, we only explored nouns in the corpus, yet there are no inherent restrictions in the method for applying it to other parts of speech. Accordingly, we used part of speech tags that were available in the Penn Tree Bank for the Brown Corpus.</Paragraph> <Paragraph position="7"> The test set was created from the polysemous nouns in SemCor. The nouns were extracted from the Brown corpus with their relative corpus and sentence position information. The test set comprised 58372 noun instances of 6824 polysemous nouns. The nouns were not lemmatized.</Paragraph> <Paragraph position="8"> Two baselines were cons~ucted. A random baseline (RBL), where each noun instance inthe test set was assigned a random sense from the List of senses pertaining to that noun in the taxonomy. And a default baseline (DBL), where each noun instance in the test set is assigned its most frequent sense according to WordNet 1.6. The Brown Corpus only exists in English; therefore, we decided to automatically translate it into three different languages using two commercially available machine translation (MT) packages, Systran Professional 2.0 (SYS) and Globalink Power Translator Pro v.6.4 (GL). We used two different translation packages to maximize the variability of the word translation selection, in an attempt to approximate a human translation. The idea is that different MT packages use different bilingual lexicons in the translation process. Moreover, we decided to use more than one language since polysemous words can be translated in different ways in different languages, i.e. an ambiguous word that has two senses could be translated into two distinct words into one language but into one word in another language. We translated the Brown Corpus into French, German and Spanish, since these are considered the most reliable languages for the translation quality of the MT packages. Furthermore, the fact that EuroWordNet exists for these languages faciLitates the process of evaluating the source language tag set.</Paragraph> <Section position="1" start_page="0" end_page="4" type="sub_section"> <SectionTitle> 3.2. Experiments </SectionTitle> <Paragraph position="0"> Once we had the translations available, the seven corpora - namely, English Brown corpus, French GL, German GL, Spanish GL, French SYS, German SYS, and Spanish SYS - were tokenized and the sentences were alignedk For 1 This was a relatively easy task since the corpora are artificially created, therefore there was a one to one token level alignments, we used the GIZA program \[Al Onaizan et al. 1999\]\[. GIZA is an intermediate program in a statistical machine translation system, EGYPT. It is an implementation of Models 1-4 of Brown et al.</Paragraph> <Paragraph position="1"> \[1993\], where each of these models produces a Viterbi alignment. The models are trained in succession where the final paraaneter values from one model are used as the starting parameters for the next model. We trained each model for I0 iterations. Given a source and target pair of afigned sentences, GIZA produces the most probable token-level alignments.</Paragraph> <Paragraph position="2"> Multiple token alignments are allowed on the target language side, i.e. a token in English could align with multiple tokens :in the foreign language. Tokens on either side could align with nothing, designated as a null token. GIZA requires a large corpus in order to produce reliable alignments, hence, the use of the entire Brown corpus: both the SemCor tagged data without the tags and the untagged data.</Paragraph> <Paragraph position="3"> Therefore, we produced the alignments for the 6 parallel corpora - a parallel cortms comprises the English eorpns and its translation into one of the three languages using one of the MT packages - with English as the target language.</Paragraph> <Paragraph position="4"> The Brown Corpus has 52282 sentences. Due to processing limitations, GIZA ignores sentences that exceed 50 words in length, therefore it ignored -3000 sentences on average per parallel corpus alignment. GIZA output was converted to an internal format: sentence number followed by all the tokens 2 in the sentence represented as token positions in the target language aligned with corresponding source language token positions in the aligned foreign sentence.</Paragraph> <Paragraph position="5"> All the token positions were replaced by the actual tokens from the corresponding corpora.</Paragraph> <Paragraph position="6"> Tokens that were aligned with null tokens on either side of the parallel corpus were ignored.</Paragraph> <Paragraph position="7"> All the tokens were tagged with the sentence number and sentence position. In order to reduce the search space, we reduced the list to the nouns in the corpus. We created a list of the source language words that were aligned to nouns in the target language, thereby creating a source-target noun list for each source word. We correspondence between the sentences.</Paragraph> <Paragraph position="8"> 2 Tokens include punctuation source and target removed punctuation marks and their corresponding afignments; also, we filtered out stop words from the source language. Finally, we compressed the source-target list to have the following format: Src wdi trgt_nnl, trgt_nnz,...,trgt_nn, where Src wdi is a word J in the source corpus and trgt_nnj is the noun 4 it aligned to in the target corpus.</Paragraph> <Paragraph position="9"> Source words that were aligned with one target word only throughout the corpus were excluded from the final fist of words to be tagged in our tag set. Each resulting set - a set had to include at least 2 nouns - of Engfish target nouns, corresponding to a source word, was passed on to the distance measure routine.</Paragraph> <Paragraph position="10"> We used an optimization function over the senses of the nouns in a set. The function aims at maximizing a similarity of meaning over all the members of a set based on a pair wise similarity calculation over all the listed senses in WordNet 1.6. The algorithm~ disambiguate_class, which is implemented by Resnik and described in detail in \[Resnik, 1999\], calculates the similarity between all the words' senses of words in a set. R assigns a confidence score based on shared information content of the sense combinations, which is measured via the most informative subsumer in the taxonomy. The senses with the highest confidence scores are the senses that contribute the most to the maximization function for the set. The algorithm expects the words to be input as a set for calculating the confidence scores. In many instances, we observed considerable noise in the target noun set. For example, the French source word accord was aligned with the English nouns accord, agreement, signing, consonance, and encyclopaedia in the target corpus. All but the last word in the target set seem to be related to the word accord in French except encyclopaedia. The source of noise can be attributed to the specific translation system, or to the alignment program~ or in other cases to the 3 Parts of speech are not necessarily symmetric in alignments, i.e. nouns could very well map to verbs or other parts of speech.</Paragraph> <Paragraph position="11"> 4 Note that the nouns at this point are types not tokens, i.e. not instances in the corpus rather a conflafion of instances fact that the source language word itself is ambiguous.</Paragraph> <Paragraph position="12"> Consequently, we conducted three types of experiments in an attempt to reduce the noise in the target sets: Class sire, Pair_sinai and Pair_simall. They essentially varied in input format to disambiguate_class.</Paragraph> <Paragraph position="13"> For Class sire, the target noun data was produced directly from the source-target list and input to the distance measure routine with no special formatting. Each of the target nouns was assigned the sense(s) that had the maximum confidence level from among the senses listed for it in the taxonomy. Thereby creating the tag set for the target language, English. If a noun does not have an entry in the taxonomy, it is assigned a null sense.</Paragraph> <Paragraph position="14"> On the other hand, for both Pair_sire 1 and Pair_siin all the nouns in the target fist for each source word were formatted into all pair combinations in the set and then sent to disambiguate_class. The idea was to localize the noise to the pair level comparison, since disambiguate_class optimizes over the entire set of nouns. The senses that were selected were the ones with the maximum confidence score from the noun pair sense comparison. All the senses with a maximum confidence score for a noun were aggregated into a final list of senses for that noun and duplicates were removed.</Paragraph> <Paragraph position="15"> In Pair_sinu1, only the senses that had a confidence score of 100% were considered, i.e.</Paragraph> <Paragraph position="16"> if disambiguate_class is agnostic as to whether the senses of the target noun pair are similar, each noun in this pak comparison is assigned a null sense, for the noun pair in the local comparison, respectively. That does not necessarily mean that either noun will have a final null sense in the aggregate list, it rather depends on the sum total of comparisons for each of them with all the nouns in the set.</Paragraph> <Paragraph position="17"> In Pair sire all, the same conditions apply as in Pair_sinai, yet there is no threshold of a 100%. A pair of nouns in a local comparison is assigned a null sense if one of the nouns in the pair is not in WordNet or all the senses get a confidence score of 0%.</Paragraph> <Paragraph position="18"> Once we had the tag set for each of our parallel corpora, we evaluated it against the manually tagged test set. So far, we only evaluated the tag set for the target language, English. Evaluation of the source tag set is in progress; a serious hurdle is that EuroWordNet is interfaced with WordNet 1.5 only. The preliminary evaluation metric is: ~ co.~ta~o~joU~,lO0 \[I\] ace = total hum testsenses We only considered the first sense assigned in the test set for any noun instance in the process of our evaluation. The system was not penalized if it assigned more than one sense to the noun in the tag set if the correct sense was among the senses assigned.</Paragraph> <Paragraph position="19"> We conducted the three types of experiments on the 6 parallel corpora. In the following section, we present the results for GL translations for the three languages and the SYS translation for Spanish, since we found no significant difference in the results across the two translation systems for the three experiment types. Furthermore, we wanted to test the effect of merging the token alignments of the two MT systems on the accuracy rates. For all the experiment conditions, the noun instances that were excluded from the tag set and were in the test set were sense tagged using the default baseline of 67.6%, in order to report the results at 100% coverage for the test set, the results of which are presented in table 2 below.</Paragraph> </Section> </Section> class="xml-element"></Paper>