File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/00/w00-0801_evalu.xml
Size: 7,723 bytes
Last Modified: 2025-10-06 13:58:39
<?xml version="1.0" standalone="yes"?> <Paper uid="W00-0801"> <Title>An Unsupervised Method for Multifingual Word Sense Tagging Using Parallel Corpora: A Preliminary Investigation</Title> <Section position="5" start_page="4" end_page="4" type="evalu"> <SectionTitle> 3.3. Results and Discussion </SectionTitle> <Paragraph position="0"> The investigation yielded the following results Class sire Pak sire 1 Pair sire all coverage percentages of the test set data for the English target corpus. The first cohunn has the 5 experiment conditions used as source language filters of the English target corpus, and the first row has the three experiment types. FG is the French translation of the Brown corpus rendered by the MT system GL; GG is the German translation by GL; SG is the Spanish translation by GL; SS is the Spanish translation by the MT system SYS; and MSp is the merged Spanish translations from both NIT systems. All the results are presented as percentages, where the Coy. indicates the percentage covered by the tag set of the test set. Ace. is the percent correct at the coverage level based on the evaluation measure in \[ 1\].</Paragraph> <Paragraph position="1"> Across the board, the results from Pair sire all for all the experiment conditions are hider than the results from Pai~sin~l, which in turn are higher than Class sire results. The results do not seem to suggest any significant difference in the results from the two Spanish translations SG and SS across the three expernnent types. On the other hand, results from MSp outperform the individual Spanish translation systems for the Pair_sire 1 and Pair sire all experiments by a margin ~-25% more in coverage and -6% in accuracy rates. In the Class_sire experiment, the individual Spanish translations outperform the MSp condition. We also note that coverage is higher for all the experiment conditions.</Paragraph> <Paragraph position="2"> the test set data for the target tag set. FG, GG, SG, SS, MSp, are the same as in table 1. RBL is the random baseline, while DBL is the default baseline. All the experimental conditions significantly outperformed the random baseline. None of the conditions outperformed the default baseline, DBL, in both Class_sire and Pair sire 1 experiments. Pair_sinai had a higher accuracy rate than Class_sire for all the experiment conditions. Similar to the observations in table I, Pair sire all outperformed the other two experiment types for all the experiment conditions. Pair sire all also outperformed the default baseline with an improvement of 1.4 (marginal in this case) to 9%. It is worth noting that there was no significant difference between the experimental conditions SG and SS across the experiment types. As in Table 1, the results from MSp are significantly higher than those obtained from the individual Spanish translation conditions for both Pair sire 1 and Pair sire all, while the results for Class sire were much lower than the individual Spanish conditions. This can be attributed to the fact that while combining evidence from both translations, we aggregated the noise in the target set from both translations. The noise causes disambiguate class to get trapped into assizning higher confidences to irrelevant senses.</Paragraph> <Paragraph position="3"> In terms of the overall performance of the different conditions, the results suggest that merging the two translation systems yields the best results, with an improvement of 6% over the individual translations independently for Spanish in PaL_sire_all. Examining the results across the three languages, it seems there were slight variations in the accuracy rates in the Pair sire 1 and Pair sire all experiments at full coverage, exemplified in table 2. Yet we note the low relative coverage of the test data in the German, GG condition, as shown in table 1. This can be explained as a result of the nature of the German language, which is highly agglutinative, thereby affecting the quality of the alignments. Also it could be a reflection of the quality of the GL MT system for the German language.</Paragraph> <Paragraph position="4"> The most interesting result is the result of the MSp condition in table 1, which indicates that 81.8% of the target data can be sense tagged with an accuracy of 79%, significantly higher than chance (25.6%) as well as it is higher than the default tagging of 67.6%. We have yet to investigate the source tag set in order to see how many of these source words can transparently acquire the target noun senses. The fine graininess of WordNet leads us to suspect that the appropriate level of evaluation will be at the most informative subsumer level in the taxonomy (a coarser grain) as opposed to the actual sense tagged for the corresponding aligned target noun.</Paragraph> <Paragraph position="5"> The low accuracy rates over the full test set (table 2) may be attributed to the cascading of different sources of noise in the evaluation method, starting off with a less than perfect translation s and an automated alignment program with a reported accuracy rate of ~92% for word alignments, English to German. \[Och & Ney, 2000\] The latter result has to be considered with caution in the present experimental design context since the evaluation of the alignments was done with a human translation on a closed domain corpus, for only one of the languages under consideration in the current investigation. A large-scale multilingual evaluation of the alignment program is much needed. By qualitatively looking at some of the automatic ali~ments, some of the cases had very fight all,merits in the target language. For instance, the French word abr/was aligned with cover and shed; agitation, in French, was aligned with the nouns agitation, bustle, commotion, flurry, fuss, restlessness, and turmoil.</Paragraph> <Paragraph position="6"> Word ambiguity in the source language could have contributed to the low accuracy rates attained. In many cases, we noticed that the source language seemed to preserve the ambiguity found in the target language. For example, (a) the French word canon was aligned with the target nouns: cannon, cannonball, canon, theologian; Co) the French word bandes was aligned with the target nouns: band, gang, mob, streaks, strips, tapes, tracks. In both examples we see at least two clusters in the target noun sets, in (a), cannon and cannonball are one cluster and canon and theologian form the other cluster; in Co), the word band is ambiguous, we can see that band, gang and mob can form a cluster, while band, streaks, strips, tapes and tracks could form another. We are currently investigating the effect of incorporating co-occurrence information as a means of clustering the words in the target set, aiming at delineating the senses for the source language word. Another source of noise is the metaphoric as well as slang usage of some of the words in the target language, for instance, be'b~s, in French, was aligned with babes and babies in the target language.</Paragraph> <Paragraph position="7"> We expect the results to improve the more distant the language pair. Moreover, combining different language sources simultaneously could yield improved results due to the fact that 5 We do not know of any formal evaluation on the quality of the two wanslation packages used languages will differ in the manner in which they conflate senses.</Paragraph> <Paragraph position="8"> We would like to explore different evaluation metrics for the target language, which are fine-tuned to the fine granularity of WordNet. As well as, devise methods for obtaining a quantitative measure of evaluation for the source tag set.</Paragraph> </Section> class="xml-element"></Paper>