File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/97/w97-0201_intro.xml
Size: 2,701 bytes
Last Modified: 2025-10-06 14:06:23
<?xml version="1.0" standalone="yes"?> <Paper uid="W97-0201"> <Title>Getting Serious about Word Sense Disambiguation</Title> <Section position="3" start_page="0" end_page="0" type="intro"> <SectionTitle> 1 Introduction </SectionTitle> <Paragraph position="0"> Much recent research in the field of natural language processing (NLP) has focused on an empirical, corpus-based approach (Church and Mercer, 1993).</Paragraph> <Paragraph position="1"> The high accuracy achieved by a corpus-based approach to part-of-speech tagging and noun phrase parsing, as demonstrated by (Church, 1988), has inspired similar approaches to other problems in natural language processing, including syntactic parsing and word sense disambiguation (WSD).</Paragraph> <Paragraph position="2"> The availability of large quantities of part-of-speech tagged and syntactically parsed sentences like the Penn Treebank corpus (Marcus, Santorini, and Marcinkiewicz, 1993) has contributed greatly to the development of robust, broad coverage part-of-speech taggers and syntactic parsers. The Penn Treebank corpus contains a sufficient number of part-of-speech tagged and syntactically parsed sentences to serve as adequate training material for building broad coverage part-of-speech taggers and parsers.</Paragraph> <Paragraph position="3"> Unfortunately, an analogous sense-tagged corpus large enough to achieve broad coverage, high accuracy word sense disambiguation is not available at present. In this paper, I argue that, given the current state-of-the-art capability of automated machine learning algorithms, a supervised learning approach using a large sense-tagged corpus is a viable way to build a robust, wide coverage, and high accuracy WSD program. In this view, a large sense-tagged corpus is critical as well as necessary to achieve broad coverage, high accuracy WSD.</Paragraph> <Paragraph position="4"> The rest of this paper is organized as follows. In Section 2, I briefly discuss the utility of WSD in practical NLP tasks like information retrieval and machine translation. I also address some objections to WSD research. In Section 3, I examine the size of the training corpus on the accuracy of WSD, using a corpus of 192,800 occurrences of 191 words hand tagged with WORDNET senses (Ng and Lee, 1996). In Section 4, I estimate the amount of human sense-tagged corpus and the manual annotation effort needed to build a broad coverage, high accuracy WSD program. Finally, in Section 5, I suggest that intelligent example selection techniques may significantly reduce the amount of sense-tagged corpus needed and offer this research problem as a fruitful area for WSD research.</Paragraph> </Section> class="xml-element"></Paper>