File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/c04-1192_metho.xml
Size: 5,321 bytes
Last Modified: 2025-10-06 14:08:49
<?xml version="1.0" standalone="yes"?> <Paper uid="C04-1192"> <Title>Fine-Grained Word Sense Disambiguation Based on Parallel Corpora, Word Alignment, Word Clustering and Aligned Wordnets</Title> <Section position="3" start_page="2" end_page="2" type="metho"> <SectionTitle> 3 The Experiment </SectionTitle> <Paragraph position="0"> The parallel corpus we used for our experiments is based on Orwell's novel &quot;Ninety Eighty Four&quot; (1984) which has been initially developed by the Multext-East consortium. Besides Orwell's original text, the corpus contained professional translations in six languages (Bulgarian, Czech, Estonian, Hungarian, Romanian and Slovene).</Paragraph> <Paragraph position="1"> The Multext-East corpus (and other language resources) is maintained by Tomaz Erjavec and a new release of it may be found at http://nl.ijs.si/ME/V3. Later, the parallel corpus has been extended with many other new language translations. The BalkaNet consortium added three new translations to the &quot;1984&quot; corpus: Greek, Serbian and Turkish. Each language text is tokenized, tagged and sentence aligned to the English original. We extracted from the entire parallel corpus only the languages of concern in the BalkaNet project (English, Bulgarian, Czech, Greek, Romaniann, Serbian and Turkish) and further retained only the 1-1 sentence alignments between English and all the other languages. This way, we built a unique alignment for all the languages and, by exploiting the transitivity of sentence alignment, we are able to make experiments with any combination of languages.</Paragraph> <Paragraph position="2"> The BalkaNet version of the &quot;1984&quot; corpus is encoded as a sequence of translation units (TU), each containing one sentences per language, so that they are reciprocal translations. In order to evaluate both the performance of the WSDtool and to assess the accuracy of the interlingual linking of the BalkaNet wordnets we selected a bag of English target words (nouns and verbs) occurring in the corpus. The selection considered only polysemous words (at least two senses per part of speech) implemented (and ILI linked) in all BalkaNet wordnets. There resulted 211 words with 1644 occurrences in the English part of the parallel corpus.</Paragraph> <Paragraph position="3"> Three experts independently sense-tagged all the occurrences of the target words and the disagreements were negotiated until consensus was obtained. The commonly agreed annotation represented the Gold Standard (GS) against which the WSD algorithm was evaluated.</Paragraph> <Paragraph position="4"> Additionally, a number of 13 students, enrolled in a Computational Linguistics Master program, were asked to manually sense-tag overlapping subsets of the same word occurrences. The overlapping ensured that each target word occurrence was seen by at least three students.</Paragraph> <Paragraph position="5"> Based on the students' annotations, using a majority voting, we computed another set of comparison data which below is referred to as SMAJ (Students MAJority).</Paragraph> <Paragraph position="6"> Finally, the same targeted words were automatically disambiguated by the WSDtool algorithm (ALG) which was run both with and without the back-off clustering algorithm. For the basic wordnet-based WSD we used the Princeton Wordnet, the Romanian wordnet and the English-Romanian translation equivalence dictionary. For the back-off clustering we extracted a four language translation dictionary (EN-RO-CZ-BG) based on which we computed the initial clustering vectors for all occurrences of the target words.</Paragraph> <Paragraph position="7"> Although we used only RO, CZ and BG translation texts, nothing prevents us from using any other translations, irrespective of whether their languages belong or not to the BalkaNet consortium. Out of the 211 set of targeted words, with 1644 occurrences the system could not make a decision for 38 (18 %) words with 63 occurrences (3.83%). Most of these words were happax legomena (21) for which neither the wordnet-based step not the clustering back-off could do anything. Others, were not translated by the same part of speech, were wrongly translated by the human translator or not translated at all (28). Finally, four occurrences remained untagged due to the incompleteness of the Romanian synsets linked to the relevant concepts (that is the four translation equivalents had their relevant sense missing from the Romanian wordnet). Applying the simple heuristics (SH) that says that any unlabelled target occurrence receives its most frequent sense, 42 out of 63 of them got a correct sense-tag. The table below summarizes the results.</Paragraph> <Paragraph position="8"> the algorithm based on aligned wordnets (AWN), for AWN with clustering (AWN+C) and for AWN+C and the simple heuristics (AWN+C+SH) and for the students' majority voting (SMAJ) It is interesting to note that in this experiment the students' majority annotation is less accurate than the one achieved by the automatic WSD annotation in all three variants. This is a very encouraging result since it shows that the tedious hand-made WSD in building word-sense disambiguated corpora for supervised training can be avoided.</Paragraph> </Section> class="xml-element"></Paper>