File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/abstr/03/j03-3005_abstr.xml
Size: 12,679 bytes
Last Modified: 2025-10-06 13:42:47
<?xml version="1.0" standalone="yes"?> <Paper uid="J03-3005"> <Title>Using the Web to Obtain Frequencies for Unseen Bigrams</Title> <Section position="2" start_page="0" end_page="462" type="abstr"> <SectionTitle> 1. Introduction </SectionTitle> <Paragraph position="0"> In two recent papers, Banko and Brill (2001a, 2001b) criticize the fact that current NLP algorithms are typically optimized, tested, and compared on fairly small data sets (corpora with millions of words), even though data sets several orders of magnitude larger are available, at least for some NLP tasks. Banko and Brill (2001a, 2001b) experiment with context-sensitive spelling correction, a task for which large amounts of data can be obtained straightforwardly, as no manual annotation is required. They demonstrate that the learning algorithms typically used for spelling correction benefit significantly from larger training sets, and that their performance shows no sign of reaching an asymptote as the size of the training set increases.</Paragraph> <Paragraph position="1"> Arguably, the largest data set that is available for NLP is the Web, which currently consists of at least 3,033 million pages.</Paragraph> <Paragraph position="2"> Data retrieved from the Web therefore provide enormous potential for training NLP algorithms, if Banko and Brill's (2001a, 2001b) findings for spelling corrections generalize; potential applications include tasks that involve word n-grams and simple surface syntax. There is a small body of existing research that tries to harness the potential of the Web for NLP. Grefenstette and Nioche (2000) and Jones and Ghani (2000) use the Web to generate corpora for languages for which electronic resources are scarce, and Resnik (1999) describes a method for mining the Web in order to obtain bilingual texts. Mihalcea and Moldovan (1999) and Agirre and Martinez (2000) use the Web for word sense disambiguation, Volk (2001) proposes a method for resolving PP attachment ambiguities based on Web data, Markert, Nissim, and Modjeska (2003) use the Web for the resolution of nominal anaphora, [?] School of Informatics, 2 Buccleuch Place, Edinburgh EH8 9LW, UK. E-mail: keller@inf.ed.ac.uk + Department of Computer Science, 211 Portobello Street, Sheffield S1 4DP, UK. might have databases that are even larger than the Web. Lexis Nexis provides full-text access to news sources (including newspapers, wire services, and broadcast transcripts) and legal data (including case law, codes, regulations, legal news, and law reviews).</Paragraph> <Paragraph position="3"> Computational Linguistics Volume 29, Number 3 and Zhu and Rosenfeld (2001) use Web-based n-gram counts to improve language modeling.</Paragraph> <Paragraph position="4"> A particularly interesting application is proposed by Grefenstette (1998), who uses the Web for example-based machine translation. His task is to translate compounds from French into English, with corpus evidence serving as a filter for candidate translations. An example is the French compound groupe de travail. There are five translations of groupe and three translations for travail (in the dictionary that Grefenstette [1998] is using), resulting in 15 possible candidate translations. Only one of them, namely, work group, has a high corpus frequency, which makes it likely that this is the correct translation into English. Grefenstette (1998) observes that this approach suffers from an acute data sparseness problem if the counts are obtained from a conventional corpus. However, as Grefenstette (1998) demonstrates, this problem can be overcome by obtaining counts through Web searches, instead of relying on a corpus. Grefenstette (1998) therefore effectively uses the Web as a way of obtaining counts for compounds that are sparse in a given corpus.</Paragraph> <Paragraph position="5"> Although this is an important initial result, it raises the question of the generality of the proposed approach to overcoming data sparseness. It remains to be shown that Web counts are generally useful for approximating data that are sparse or unseen in a given corpus. It seems possible, for instance, that Grefenstette's (1998) results are limited to his particular task (filtering potential translations) or to his particular linguistic phenomenon (noun-noun compounds). Another potential problem is the fact that Web counts are far more noisy than counts obtained from a well-edited, carefully balanced corpus. The effect of this noise on the usefulness of the Web counts is largely unexplored.</Paragraph> <Paragraph position="6"> Zhu and Rosenfeld (2001) use Web-based n-gram counts for language modeling.</Paragraph> <Paragraph position="7"> They obtain a standard language model from a 103-million-word corpus and employ Web-based counts to interpolate unreliable trigram estimates. They compare their interpolated model against a baseline trigram language model (without interpolation) and show that the interpolated model yields an absolute reduction in word error rate of .93% over the baseline. Zhu and Rosenfeld's (2001) results demonstrate that the Web can be a source of data for language modeling. It is not clear, however, whether their result carries over to tasks that employ linguistically meaningful word sequences (e.g., head-modifier pairs or predicate-argument tuples) rather than simply adjacent words. Furthermore, Zhu and Rosenfeld (2001) do not undertake any studies that evaluate Web frequencies directly (i.e., without a task such as language modeling). This could be done, for instance, by comparing Web frequencies to corpus frequencies, or to frequencies re-created by smoothing techniques.</Paragraph> <Paragraph position="8"> The aim of the present article is to generalize Grefenstette's (1998) and Zhu and Rosenfeld's (2001) findings by testing the hypothesis that the Web can be employed to obtain frequencies for bigrams that are unseen in a given corpus. Instead of having a particular task in mind (which would introduce a sampling bias), we rely on sets of bigrams that are randomly selected from a corpus. We use a Web-based approach for bigrams that encode meaningful syntactic relations and obtain Web frequencies not only for noun-noun bigrams, but also for adjective-noun and verb-object bigrams. We thus explore whether this approach generalizes to different predicate-argument combinations. We evaluate our Web counts in four ways: (a) comparison with actual corpus frequencies from two different corpora, (b) comparison with human plausibility judgments, (c) comparison with frequencies re-created using class-based smoothing, and (d) performance in a pseudodisambiguation task on data sets from the literature.</Paragraph> <Paragraph position="9"> Keller and Lapata Web Frequencies for Unseen Bigrams 2. Obtaining Frequencies from the Web</Paragraph> <Section position="1" start_page="461" end_page="462" type="sub_section"> <SectionTitle> 2.1 Sampling Bigrams from the BNC </SectionTitle> <Paragraph position="0"> The data sets used in the present experiment were obtained from the British National Corpus (BNC) (see Burnard [1995]). The BNC is a large, synchronic corpus, consisting of 90 million words of text and 10 million words of speech. The BNC is a balanced corpus (i.e., it was compiled so as to represent a wide range of present day British English). The written part includes samples from newspapers, magazines, books (both academic and fiction), letters, and school and university essays, among other kinds of text. The spoken part consists of spontaneous conversations, recorded from volunteers balanced by age, region, and social class. Other samples of spoken language are also included, ranging from business or government meetings to radio shows and phoneins. The corpus represents many different styles and varieties and is not limited to any particular subject field, genre, or register.</Paragraph> <Paragraph position="1"> For the present study, the BNC was used to extract data for three types of predicate-argument relations. The first type is adjective-noun bigrams, in which we assume that the noun is the predicate that takes the adjective as its argument.</Paragraph> <Paragraph position="2"> The second predicate-argument type we investigated is noun-noun compounds. For these, we assume that the rightmost noun is the predicate that selects the leftmost noun as its argument (as compound nouns are generally right-headed in English). Third, we included verb-object bigrams, in which the verb is the predicate that selects the object as its argument. We considered only direct NP objects; the bigram consists of the verb and the head noun of the object. For each of the three predicate-argument relations, we gathered two data sets, one containing seen bigrams (i.e., bigrams that occur in the BNC) and one with unseen bigrams (i.e., bigrams that do not occur in the BNC).</Paragraph> <Paragraph position="3"> For the seen adjective-noun bigrams, we used the data of Lapata, McDonald, and Keller (1999), who compiled a set of 90 bigrams as follows. First, 30 adjectives were randomly chosen from a part-of-speech-tagged and lemmatized version of the BNC so that each adjective had exactly two senses according to WordNet (Miller et al. 1990) and was unambiguously tagged as &quot;adjective&quot; 98.6% of the time. Lapata, McDonald, and Keller used the part-of-speech-tagged version that is made available with the BNC and was tagged using CLAWS4 (Leech, Garside, and Bryant 1994), a probabilistic part-of-speech tagger, with error rate ranging from 3% to 4%. The lemmatized version of the corpus was obtained using Karp et al.'s (1992) morphological analyzer.</Paragraph> <Paragraph position="4"> The 30 adjectives ranged in BNC frequency from 1.9 to 49.1 per million words; that is, they covered the whole range from fairly infrequent to highly frequent items. Gsearch (Corley et al. 2001), a chart parser that detects syntactic patterns in a tagged corpus by exploiting a user-specified context-free grammar and a syntactic query, was used to extract all nouns occurring in a head-modifier relationship with one of the 30 adjectives. Examples of the syntactic patterns the parser identified are given in Table 1. In the case of adjectives modifying compound nouns, only sequences of two nouns were included, and the rightmost-occurring noun was considered the head.</Paragraph> <Paragraph position="5"> Bigrams involving proper nouns or low-frequency nouns (less than 10 per million words) were discarded. This was necessary because the bigrams were used in experiments involving native speakers (see Section 3.2), and we wanted to reduce the risk of including words unfamiliar to the experimental subjects. For each adjective, the set of bigrams was divided into three frequency bands based on an equal division of the 3 This assumption is disputed in the theoretical linguistics literature. For instance, Pollard and Sag (1994) present an analysis in which there is mutual selection between the noun and the adjective. range of log-transformed co-occurrence frequencies. Then one bigram was chosen at random from each band. This procedure ensures that the whole range of frequencies is represented in our sample.</Paragraph> <Paragraph position="6"> Lapata, Keller, and McDonald (2001) compiled a set of 90 unseen adjective-noun bigrams using the same 30 adjectives. For each adjective, Gsearch was used to compile a list of all nouns that did not co-occur in a head-modifier relationship with the adjective. Again, proper nouns and low-frequency nouns were discarded from this list. Then each adjective was paired with three randomly chosen nouns from its list of non-co-occurring nouns. Examples of seen and unseen adjective-noun bigrams are shown in Table 2.</Paragraph> <Paragraph position="7"> For the present study, we applied the procedure used by Lapata, McDonald, and Keller (1999) and Lapata, Keller, and McDonald (2001) to noun-noun bigrams and to verb-object bigrams, creating a set of 90 seen and 90 unseen bigrams for each type of predicate-argument relationship. More specifically, 30 nouns and 30 verbs were chosen according to the same criteria proposed for the adjective study (i.e., minimal sense ambiguity and unambiguous part of speech). All nouns modifying one of the 30 nouns were extracted from the BNC using a heuristic from Lauer (1995) that looks for consecutive pairs of nouns that are neither preceded nor succeeded by another Table 2 Example stimuli for seen and unseen adjective-noun, noun-noun, and verb-object bigrams (with log-transformed BNC counts).</Paragraph> </Section> </Section> class="xml-element"></Paper>