File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/00/w00-0801_intro.xml

Size: 4,013 bytes

Last Modified: 2025-10-06 14:00:58

<?xml version="1.0" standalone="yes"?>
<Paper uid="W00-0801">
  <Title>An Unsupervised Method for Multifingual Word Sense Tagging Using Parallel Corpora: A Preliminary Investigation</Title>
  <Section position="3" start_page="0" end_page="0" type="intro">
    <SectionTitle>
1. Introduction
</SectionTitle>
    <Paragraph position="0"> With the term &amp;quot;globalization&amp;quot; becoming the theme of cuxrent political and economic discourse, communications technology exemplified by the World Wide Web OVWW) has become a source of an abundance of languages. Language researchers are faced with an ever so present challenge and excitement of being able to study and process these languages and create the appropriate NLP applications for them. Yet, a major bottleneck for many NLP applications such as machine translation, cross language information retrieval, natural language understanding, etc, is word sense ambiguity. The problem escalates as we deal with languages that are scarce in processing resources and knowledge bases. The availability of large scale, accurately, sense tagged data should help alleviate the problem.</Paragraph>
    <Paragraph position="1"> It has been acknowledged that best way to acquire sense tags for words in a corpus is manually, which has proven to be a very expensive and labor intensive endeavor. In an attempt to approximate the human effort, both supervised \[Bruce &amp; Weibe, 1994; Lin, 1999;etc.\] and unsupervised methods \[Resnik 1997; Yarowsky, 1992&amp;1995; etc.\] have been proposed to solve the problem automatically. On average supervised methods report higher accuracy rates, but they are faced with the problem of requiring large amounts of sense tagged data as training material. Most of the methods, to date, aim at solving the problem for one language, namely the language with the most available linguistic resources. Moreover, most of the proposed approaches report results on a handful of the data, rendering them solutions for a small scale of the data.</Paragraph>
    <Paragraph position="2"> Many researchers in the field have looked at language translations as a source for sense distinctions \[Dagan &amp; Itai, 1994; Dyvik, 1998; Ide, in press; Resnik &amp; Yarowsky, 1999; etc.\].</Paragraph>
    <Paragraph position="3"> The idea is that polysemons words in one language can be translated as distinct words in a different language. The problem has always been the availability of large corpora in translation, i.e. parallel corpora. Resnik \[1999\] proposed a method for facilitating the acquisition of parallel corpora from the WWW.</Paragraph>
    <Paragraph position="4"> Potentially, we can have parallel corpora in a myriad of languages, yet the downside is the scarcity of linguistic knowledge resources and processing tools for less widely represented/studied languages. Consequently, we decided to bootstrap the process of word sense tagging for both languages in a parallel corpus using the translations as a source of word sense distinction. Thereby, attaining sense tagged data for languages with scarce resources as well as creating a supply of large-scale, automatically sense tagged data for a the language with more knowledge resources -albeit noisy - to be utilized by supervised algorithms. In this paper, we propose an unsupervised method for word sense tagging of' both corpora automatically. The algorithm assumes the availability of a word sense inventory in one of the languages. The preliminary evaluation of the method on the nouns in an English corpus, yielded accuracy rates in the range of 69-77% against the polysemous nouns in a hand tagged test set, which contrasts with a random baseline of 25.6%, and a baseline of the most frequent sense of 67.6%.</Paragraph>
    <Paragraph position="5"> In the following section we describe the proposed method, followed by a preliminary evaluation of the method. Section 4 discusses related work and we conclude with some thoughts on future directions in section 5.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML