File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/w06-2505_metho.xml
Size: 17,724 bytes
Last Modified: 2025-10-06 14:10:52
<?xml version="1.0" standalone="yes"?> <Paper uid="W06-2505"> <Title>Multilingual versus Monolingual WSD</Title> <Section position="3" start_page="33" end_page="34" type="metho"> <SectionTitle> 2 Related work </SectionTitle> <Paragraph position="0"> Recently, others have also investigated the differences between sense repositories for monolingual and multilingual WSD. Chatterjee et al.</Paragraph> <Paragraph position="1"> (2005), e.g., investigated the ambiguity in the translation of the English verb &quot;to have&quot; into Hindi. 11 translation patterns were identified for the 19 senses of the verb, according to the various target syntactic structures and/or target words for the verb. They argued that differences in both these aspects do not depend only on the sense of the verb. Out of the 14 senses analyzed, six had 2-5 different translations each.</Paragraph> <Paragraph position="2"> Bentivogli et al. (2004) proposed an approach to create an Italian sense tagged corpus (MultiSemCor) based on the transference of the annotations from the English sense tagged corpus SemCor (Miller et al., 1994), by means of word-alignment methods. A gold standard corpus was created by manually transferring senses in SemCor to the Italian words in a translated version of that corpus. From a total of 1,054 English words, 155 annotations were considered nontransferable to their corresponding Italian words, mainly due to the lack of synonymy at the lexical level.</Paragraph> <Paragraph position="3"> Mihaltz (2005) manually mapped senses from the English in a sense tagged corpus to Hungarian translations, in order to carry out WSD between these languages. Out of 43 ambiguous nouns, 38 had all or most of their English senses mapped into the same Hungarian translation.</Paragraph> <Paragraph position="4"> Some senses of the remaining nouns had to be split into different Hungarian translations. On average, the sense mapping decreased the ambiguity from 3.97 English senses to 2.49 Hungarian translations.</Paragraph> <Paragraph position="5"> As we intend to show with this work, differences like those mentioned above in the sense inventories make it inappropriate to use mono-lingual WSD strategies for multilingual disambiguation. Nevertheless, some approaches have successfully employed multilingual information, especially parallel corpora, to support monolingual WSD. They are motivated by the argument that the senses of a word should be determined based on the distinctions that are lexicalized in a second language (Resnik and Yarowsky, 1997).</Paragraph> <Paragraph position="6"> In general, the assumptions behind these approaches are the following: (1) If a source language word is translated dif- null ferently into a second language, it might be ambiguous and the different translations can indicate the senses in the source language.</Paragraph> <Paragraph position="7"> (2) If two distinct source language words are translated as the same word into a second language, it often indicates that the two are being used with similar senses.</Paragraph> <Paragraph position="8"> Ide (1999), for example, analyzes translations of English words into four different languages, in order to check if the different senses of an English word are lexicalized by different words in all the other languages. A parallel aligned corpus is used and the translated senses are mapped into WordNet senses. She uses this information to determine a set of monolingual sense distinctions that is potentially useful for NLP applications. In subsequent work (Ide et al., 2002), seven languages and clustering techniques are employed to create sense groups based on the translations. Diab and Resnik (2002) use multilingual information to create an English sense tagged corpus to train a monolingual WSD approach. An English sense inventory and a parallel corpus automatically produced by an MT system are employed. Sentence and word alignment systems are used to assign the word correspondences between the two languages. After grouping all the words that correspond to translations of a single word in the target language, all their possible senses are considered as candidates. The sense that maximizes the semantic similarity of the word with the others in the group is chosen.</Paragraph> <Paragraph position="9"> Similarly, Ng et al. (2003) employ English-Chinese parallel word aligned corpora to identify a repository of senses for English. The English word senses are manually defined, based on the WordNet senses, and then revised in the light of the Chinese translations. For example, if two occurrences of a word with two different senses in WordNet are translated into the same Chinese word, they will be considered to have the same English sense.</Paragraph> <Paragraph position="10"> In general, these approaches rely on the two previously mentioned assumptions about the interaction between translations and word senses. Although these assumptions can be useful when using cross-language information as an approximation to monolingual disambiguation, they are not very helpful in the opposite direction, i.e., using monolingual information for cross-language disambiguation, as we will show in Section 4.</Paragraph> </Section> <Section position="4" start_page="34" end_page="35" type="metho"> <SectionTitle> 3 Experimental setting </SectionTitle> <Paragraph position="0"> We focused our experiments on verbs, which represent difficult cases for WSD. In particular, we experimented with five frequent and highly ambiguous verbs identified as problematic for MT systems in a previous study (Specia, 2005): &quot;to come&quot;, &quot;to get&quot;, &quot;to give&quot;, &quot;to look&quot;, and &quot;to make&quot;; and other three frequent verbs that are not so ambiguous: &quot;to ask&quot;, &quot;to live&quot;, and &quot;to tell&quot;. The inclusion of the additional verbs allows us to analyze the effect of the ambiguity level in the experiment. These verbs will then be translated into Portuguese so that the resulting translations can be contrasted to the English senses.</Paragraph> <Section position="1" start_page="34" end_page="35" type="sub_section"> <SectionTitle> 3.1 Corpus selection </SectionTitle> <Paragraph position="0"> We collected all the sentences containing one of the eight verbs and their corresponding phrasal verbs from SemCor, Senseval-2 and Senseval-3 corpora1. These corpora were chosen because they are both widely used and easily available. In each of these corpora, ambiguous words are annotated with WordNet 2.0 senses. Occurrences which did not identify a unique sense were not used. The numbers of sentences selected for each verb and its phrasal verbs are shown in Table 1.</Paragraph> <Paragraph position="1"> It is worth mentioning that the phrasal verbs include simple verb-particle constructions, such as &quot;give up&quot;, and more complex multi-word expressions, e.g., &quot;get in touch with&quot;, &quot;make up for&quot;, &quot;come to mind&quot;, etc.</Paragraph> <Paragraph position="2"> In order to avoid biasing the experiment due to possible misunderstandings of the verb uses, and to make the experiment feasible, with a reason- null able number of occurrences to be analyzed, we selected a subset of the total number of sentences in Table 1, which were distributed among five professional English-Portuguese translators (T1, T2, T3, T4, T5), according to the following criteria: null - The meaning of the verb/phrasal verb in the context of the sentence should be understandable and non-ambiguous (for human translators).</Paragraph> <Paragraph position="3"> 1 Available at http://www.cs.unt.edu/~rada/downloads.html. - The experiment should be the most comprehensive possible, with the largest possible number of senses for each verb/phrasal.</Paragraph> <Paragraph position="4"> - Each translator should be given two occurrences (when available) of all the distinct senses of each verb/phrasal verb, in order to make it possible to contrast different uses of the verb. - The translators should not be given any information other than the sentence to select the translation.</Paragraph> <Paragraph position="5"> To meet these criteria, a professional translator, who was not involved in the translation task, post-processed the selected sentences, filtering them according to the criteria specified above.</Paragraph> <Paragraph position="6"> Due to both the scarce number of occurrences of each phrasal verb sense and the large number of different phrasal verbs for certain verbs, the post-selection of phrasal verbs was different from the post-selection of verbs. In the case of verbs, the translator scanned the sentences in order to get 10 distinct occurrences of each sense (two for each translator), eliminating those sentences which were too complex to understand or used the verb in an ambiguous way. This process did not eliminate any senses, and thus did not reduce the coverage of the experiment. When there were fewer than 10 occurrences of a given sense, sentences were repeated among translators to guarantee that each translator would be given examples of all the senses of the verb. For instance, if a sense had only four occurrences, the first two occurrences were given to T1, T3 and T5, while the other two occurrences were given to T2 and T4. If a sense occurred only once for a verb, it was repeated for all five translators.</Paragraph> <Paragraph position="7"> For phrasal verbs, the same process was used to eliminate the complex and ambiguous sentences. Two occurrences (when available) of each sense of a phrasal verb were then selected.</Paragraph> <Paragraph position="8"> Due to the large number of different phrasal verbs for certain verbs, they were divided among translators, so that each translator was given two occurrences of only some phrasal verbs of each verb. Sentences were distributed so that all translators had a similar number of cases, as shown in In order to avoid biasing the translations according to the English senses, the original sense annotations were not shown to the translators and the sentences for each of the verbs, together with their phrasal verbs, were randomly ordered.</Paragraph> <Paragraph position="9"> Additionally, we gave the same set of selected sentences to another group of five translators, so that we could analyze the reliability of the experiment by investigating the agreement between the groups of translators on the same data.</Paragraph> <Paragraph position="10"> distribution among the five translators</Paragraph> </Section> <Section position="2" start_page="35" end_page="35" type="sub_section"> <SectionTitle> 3.2 English senses and Portuguese transla- tions </SectionTitle> <Paragraph position="0"> As mentioned above, the corpora used are tagged with WordNet senses. Although this may not be the optimal sense inventory for many purposes, it is the best option in terms of availability and comprehensiveness. Moreover, it is the most frequently used repository for monolingual WSD systems, making it possible to generalize, to a certain level, our results to most of the monolingual work. The number of senses for the eight selected verbs (and their phrasal verbs) in WordNet 2.0, along with the number of their As we can see, the number of possible translations is different from the number of possible senses, which already shows that there is not a one-to-one correspondence between senses and translations (although there is a high correlation between the number of senses and translations: Pearson's Correlation = 0.955). In general, the number of possible translations is greater than</Paragraph> </Section> </Section> <Section position="5" start_page="35" end_page="36" type="metho"> <SectionTitle> 2 For example, DIC Pratico Michaelis(r), version 5.1. </SectionTitle> <Paragraph position="0"> the number of possible senses, in part because synonyms are considered as different translations. As we will show in Section 5 (Table 4), we eliminate the use of synonyms as possible translations. Moreover, we are dealing with a limited set of possible senses, provided by the SemCor and Senseval data. As a consequence, the number of translations pointed out by the human translators for our corpus will be considerably smaller than the total number of possible translations. null</Paragraph> </Section> <Section position="6" start_page="36" end_page="37" type="metho"> <SectionTitle> 4 Contrasting senses and translations </SectionTitle> <Paragraph position="0"> In order to contrast the English senses with the Portuguese translations, we submitted the selected sentences (cf. Section 3.1) to two groups of five translators (T1, T2, T3, T4, and T5), all native speakers of Portuguese. We asked the translators to assign the appropriate translation to each of the verb occurrences, which we would then compare to the original English senses.</Paragraph> <Paragraph position="1"> They were not told what their translations were going to be used for.</Paragraph> <Paragraph position="2"> The translators were provided with entire sentences, but for practical reasons they were asked to translate only the verb and were allowed to use any bilingual resource to search for possible translations, if needed. They were asked to avoid considering synonyms as different translations.</Paragraph> <Paragraph position="3"> The following procedure was defined to analyze the results returned by the translators, for each verb and its phrasal verbs separately: 1) We grouped all the occurrences of an English sense and looked at all the translations used by the translators in order to identify synonyms (in those specific uses), using a dictionary of Portuguese synonyms. Synonyms were considered as unique translations.</Paragraph> <Paragraph position="4"> 2) We then analyzed the sentences which had been given to multiple translators of the same group (when there were not enough occurrences of certain senses, as mentioned in Section 3.1), in order to identify a single translation for the occurrence and eliminate redundancies. The translation chosen was the one pointed out by the majority of the translators. When it was not possible to elect only one translation, the n equally most used were kept, and thus the sentence was repeated n times.</Paragraph> <Paragraph position="5"> 3) Finally, we examined the relation between senses and translations, focusing on two cases: (1) if a sense had only one or many translations; and (2) if a translation referred to only one or many senses, i.e., whether the sense was shared by many translations. We placed each sense into two of the following categories, explained below: (a) or (b), mutually exclusive, representing the first case; and (c), (d) or (e), also mutually exclusive, representing the second case.</Paragraph> <Paragraph position="6"> (a) 1 sense barb2right 1 translation: all the occurrences of the same sense being translated as the same Portuguese word. For example, &quot;to ask&quot;, in the sense of &quot;inquire, enquire&quot;, is always translated as &quot;perguntar&quot;.</Paragraph> <Paragraph position="7"> (b) 1 sense barb2rightn translations: different occurrences of the same sense being translated as different, non-synonyms, Portuguese words.</Paragraph> <Paragraph position="8"> For example, &quot;to look&quot;, in the sense of &quot;perceive with attention; direct one's gaze towards&quot; can be translated as &quot;olhar&quot;, &quot;assistir&quot;, and &quot;voltar-se&quot;.</Paragraph> <Paragraph position="9"> (c) n senses barb2right 1 translation (ambiguous): Different senses of a word being translated as the same Portuguese word, which encompasses all the English senses. For example, &quot;make&quot;, in the sense of &quot;engage in&quot;, &quot;create&quot;, and &quot;give certain properties to something&quot;, is translated as &quot;fazer&quot;, which carries the three senses.</Paragraph> <Paragraph position="10"> (d) n senses barb2right 1 translation (nonambiguous): different senses of a word being translated using the same Portuguese word, which has only one sense. For example, &quot;take advantage&quot; in both the senses of &quot;draw advantages from&quot; and &quot;make excessive use of&quot;, being translated as &quot;aproveitar-se&quot;.</Paragraph> <Paragraph position="11"> (e) n senses barb2right n translations: different senses of a word being translated as different Portuguese words. For example, the &quot;move fast&quot; and &quot;carry out a process or program&quot; senses of the verb &quot;run&quot; being translated respectively as &quot;correr&quot; and &quot;executar&quot;. Items (a) and (e) represent cases where multilingual ambiguity only reflects the monolingual one, that is, to all the occurrences of every sense of an English word corresponds a specific Portuguese translation. On the other hand, items (b), (c) and (d) provide evidence that multilingual ambiguity is different from monolingual ambiguity. Item (b) means that different criteria are needed for the disambiguation, as ambiguity arises only during the translation, due to specific principles used to distinguish senses in Portuguese. Items (c) and (d) mean that disambiguation is not necessary, as either the Portuguese translation is also ambiguous, embracing the same senses of the English word, or Portuguese has a less refined sense distinction.</Paragraph> </Section> class="xml-element"></Paper>