File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/04/p04-1039_intro.xml

Size: 2,655 bytes

Last Modified: 2025-10-06 14:02:24

<?xml version="1.0" standalone="yes"?>
<Paper uid="P04-1039">
  <Title>Relieving The Data Acquisition Bottleneck In Word Sense Disambiguation</Title>
  <Section position="3" start_page="0" end_page="0" type="intro">
    <SectionTitle>
1 Introduction
</SectionTitle>
    <Paragraph position="0"> Supervised Word Sense Disambiguation (WSD) systems perform better than unsupervised systems.</Paragraph>
    <Paragraph position="1"> But lack of training data is a severe bottleneck for supervised systems due to the extensive labor and cost involved. Indeed, one of the main goals of the SENSEVAL exercises is to create large amounts of sense-annotated data for supervised systems (Kilgarriff&amp;Rosenzweig, 2000). The problem is even more challenging for languages which possess scarce computer readable knowledge resources.</Paragraph>
    <Paragraph position="2"> In this paper, we investigate the role of large amounts of noisily sense annotated data obtained using an unsupervised approach in relieving the data acquisition bottleneck for the WSD task. We bootstrap a supervised learning WSD system with an unsupervised seed set. We use the sense annotated data produced by Diab's unsupervised system SALAAM (Diab&amp;Resnik, 2002; Diab, 2003). SALAAM is a WSD system that exploits parallel corpora for sense disambiguation of words in running text. To date, SALAAM yields the best scores for an unsupervised system on the SENSEVAL2 English All-Words task (Diab, 2003). SALAAM is an appealing approach as it provides automatically sense annotated data in two languages simultaneously, thereby providing a multilingual framework for solving the data acquisition problem. For instance, SALAAM has been used to bootstrap the WSD process for Arabic as illustrated in (Diab, 2004).</Paragraph>
    <Paragraph position="3"> In a supervised learning setting, WSD is cast as a classification problem, where a predefined set of sense tags constitutes the classes. The ambiguous words in text are assigned one or more of these classes by a machine learning algorithm based on some extracted features. This algorithm learns parameters from explicit associations between the class and the features, or combination of features, that characterize it. Therefore, such systems are very sensitive to the training data, and those data are, generally, assumed to be as clean as possible.</Paragraph>
    <Paragraph position="4"> In this paper, we question that assumption. Can large amounts of noisily annotated data used in training be useful within such a learning paradigm for WSD? What is the nature of the quality-quantity trade-off in addressing this problem?</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML