File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/02/w02-1030_intro.xml

Size: 4,876 bytes

Last Modified: 2025-10-06 14:01:34

<?xml version="1.0" standalone="yes"?>
<Paper uid="W02-1030">
  <Title>Using the Web to Overcome Data Sparseness</Title>
  <Section position="2" start_page="0" end_page="1" type="intro">
    <SectionTitle>
1 Introduction
</SectionTitle>
    <Paragraph position="0"> In two recent papers, Banko and Brill (2001a; 2001b) criticize the fact that current NLP algorithms are typically optimized, tested, and compared on fairly small data sets (corpora with millions of words), even though data sets several orders of magnitude larger are available, at least for some tasks.</Paragraph>
    <Paragraph position="1"> Banko and Brill go on to demonstrate that learning algorithms typically used for NLP tasks benefit significantly from larger training sets, and their performance shows no sign of reaching an asymptote as the size of the training set increases.</Paragraph>
    <Paragraph position="2"> Arguably, the largest data set that is available for NLP is the web, which currently consists of at least 968 million pages.</Paragraph>
    <Paragraph position="3">  Data retrieved from the web therefore provides enormous potential  This is the number of pages indexed by Google in March 2002, as estimated by Search Engine Showdown (see http://www.searchengineshowdown.com/).</Paragraph>
    <Paragraph position="4"> for training NLP algorithms, if Banko and Brill's findings generalize. There is a small body of existing research that tries to harness the potential of the web for NLP. Grefenstette and Nioche (2000) and Jones and Ghani (2000) use the web to generate corpora for languages where electronic resources are scarce, while Resnik (1999) describes a method for mining the web for bilingual texts. Mihalcea and Moldovan (1999) and Agirre and Martinez (2000) use the web for word sense disambiguation, and Volk (2001) proposes a method for resolving PP attachment ambiguities based on web data.</Paragraph>
    <Paragraph position="5"> A particularly interesting application is proposed by Grefenstette (1998), who uses the web for example-based machine translation. His task is to translate compounds from French into English, with corpus evidence serving as a filter for candidate translations. As an example consider the French compound groupe de travail. There are five translation of groupe and three translations for travail (in the dictionary that Grefenstette (1998) is using), resulting in 15 possible candidate translations. Only one of them, viz., work group has a high corpus frequency, which makes it likely that this is the correct translation into English. Grefenstette (1998) observes that this approach suffers from an acute data sparseness problem if the corpus counts are obtained from a conventional corpus such as the British National Corpus (BNC) (Burnard, 1995).</Paragraph>
    <Paragraph position="6"> However, as Grefenstette (1998) demonstrates, this problem can be overcome by obtaining counts through web searches, instead of relying on the BNC. Grefenstette (1998) therefore effectively uses the web as a way of obtaining counts for compounds that are sparse in the BNC.</Paragraph>
    <Paragraph position="7"> Association for Computational Linguistics.</Paragraph>
    <Paragraph position="8"> Language Processing (EMNLP), Philadelphia, July 2002, pp. 230-237. Proceedings of the Conference on Empirical Methods in Natural While this is an important initial result, it raises the question of the generality of the proposed approach to overcoming data sparseness. It remains to be shown that web counts are generally useful for approximating data that is sparse or unseen in a given corpus. It seems possible, for instance, that Grefenstette's (1998) results are limited to his particular task (filtering potential translations) or to his particular linguistic phenomenon (noun-noun compounds). Another potential problem is the fact that web counts are far more noisy than counts obtained from a well-edited, carefully balanced corpus such as the BNC. The effect of this noise on the usefulness of the web counts is largely unexplored.</Paragraph>
    <Paragraph position="9"> The aim of the present paper is to generalize Grefenstette's (1998) findings by testing the hypothesis that the web can be employed to obtain frequencies for bigrams that are unseen in a given corpus. Instead of having a particular task in mind (which would introduce a sampling bias), we rely on sets of bigrams that are randomly selected from the corpus.</Paragraph>
    <Paragraph position="10"> We use a web-based approach not only for noun-noun bigrams, but also for adjective-noun and verb-object bigrams, so as to explore whether this approach generalizes to different predicate-argument combinations. We evaluate our web counts in two different ways: (a) comparison with actual corpus frequencies, and (b) task-based evaluation (predicting human plausibility judgments).</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML