File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/02/w02-1030_metho.xml
Size: 8,897 bytes
Last Modified: 2025-10-06 14:08:05
<?xml version="1.0" standalone="yes"?> <Paper uid="W02-1030"> <Title>Using the Web to Overcome Data Sparseness</Title> <Section position="3" start_page="1" end_page="1" type="metho"> <SectionTitle> 2 Obtaining Frequencies from the Web </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="1" end_page="1" type="sub_section"> <SectionTitle> 2.1 Sampling Bigrams </SectionTitle> <Paragraph position="0"> Two types of adjective-noun bigrams were used in the present study: seen bigrams, i.e., bigrams that occur in a given corpus, and unseen bigrams, i.e., bigrams that fail to occur in the corpus. For the seen adjective-noun bigrams, we used the data of Lapata et al. (1999), who compiled a set of 90 bi-grams as follows. First, 30 adjectives were randomly chosen from a lemmatized version of the BNC so that each adjective had exactly two senses according to WordNet (Miller et al., 1990) and was unambiguously tagged as &quot;adjective&quot; 98.6% of the time. The 30 adjectives ranged in BNC frequency from 1.9 to 49.1 per million. Gsearch (Corley et al., 2001), a chart parser which detects syntactic patterns in a tagged corpus by exploiting a user-specified context free grammar and a syntactic query, was used to extract all nouns occurring in a head-modifier relationship with one of the 30 adjectives. Bigrams involving proper nouns or low-frequency nouns (less than 10 per million) were discarded. For each adjective, the set of bigrams was divided into three frequency bands based on an equal division of the range of log-transformed co-occurrence frequencies. Then one bigram was chosen at random from each band.</Paragraph> <Paragraph position="1"> Lapata et al. (2001) compiled a set of 90 unseen adjective-noun bigrams using the same 30 adjectives. For each adjective, the Gsearch chunker was used to compile a list of all nouns that failed to co-occur in a head-modifier relationship with the adjective. Proper nouns and low-frequency nouns were discarded from this list. Then each adjective was paired with three randomly chosen nouns from its list of non-co-occurring nouns.</Paragraph> <Paragraph position="2"> For the present study, we applied the procedure used by Lapata et al. (1999) and Lapata et al. (2001) to noun-noun bigrams and to verb-object bigrams, creating a set of 90 seen and 90 unseen bigrams for each type of predicate-argument relationship. More specifically, 30 nouns and 30 verbs were chosen according to the same criteria proposed for the adjective study (i.e., minimal sense ambiguity and unambiguous part of speech). All nouns modifying one of the 30 nouns were extracted from the BNC using a heuristic which looks for consecutive pairs of nouns that are neither preceded nor succeeded by another noun (Lauer, 1995). Verb-object bigrams for the 30 preselected verbs were obtained from the BNC using Cass (Abney, 1996), a robust chunk parser designed for the shallow analysis of noisy text. The parser's output was post-processed to remove bracketing errors and errors in identifying chunk categories that could potentially result in bigrams whose members do not stand in a verb-argument relationship (see Lapata (2001) for details on the filtering process). Only nominal heads were retained from the objects returned by the parser. As in the adjective study, noun-noun bigrams and verb-object bi-grams with proper nouns or low-frequency nouns (less than 10 per million) were discarded. The sets of noun-noun and verb-object bigrams were divided into three frequency bands and one bigram was chosen at random from each band.</Paragraph> <Paragraph position="3"> The procedure described by Lapata et al. (2001) was followed for creating sets of unseen noun-noun and verb-object bigrams: for each of noun or verb, we compiled a list of all nouns with which it failed to co-occur with in a noun-noun or verb-object bi-gram in the BNC. Again, Lauer's (1995) heuristic and Abney's (1996) partial parser were used to identify bigrams, and proper nouns and low-frequency nouns were excluded. For each noun and verb, three bigrams were randomly selected from the set of their non-co-occurring nouns.</Paragraph> <Paragraph position="4"> Table 1 lists examples for the seen and unseen noun-noun and verb-object bigrams generated by this procedure.</Paragraph> </Section> <Section position="2" start_page="1" end_page="1" type="sub_section"> <SectionTitle> 2.2 Obtaining Web Counts </SectionTitle> <Paragraph position="0"> Web counts for bigrams were obtained using a simple heuristic based on queries to the search engines Altavista and Google. All search terms took into account the inflectional morphology of nouns and verbs.</Paragraph> <Paragraph position="1"> The search terms for verb-object bigrams matched not only cases in which the object was directly adjacent to the verb (e.g., fulfill obligation), but also cases where there was an intervening determiner (e.g., fulfill the/an obligation). The following search terms were used for adjective-noun, noun-noun, and verb-object bigrams, respectively: (1) &quot;A N&quot;,whereA is the adjective and N is the singular or plural form of the noun.</Paragraph> <Paragraph position="2"> (2) &quot;N1 N2&quot; where N1 is the singular form of the first noun and N2 is the singular or plural form of the second noun.</Paragraph> <Paragraph position="3"> (3) &quot;V Det N&quot; where V is the infinitive, singular present, plural present, past, perfect, or gerund foroftheverb,Det is the determiner the, a or the empty string, and N is the singular or plural form of the noun.</Paragraph> <Paragraph position="4"> Note that all searches were for exact matches, which means that the search terms were required to be directly adjacent on the matching page. This is encoded using quotation marks to enclose the search term. All our search terms were in lower case.</Paragraph> <Paragraph position="5"> For Google, the resulting bigram frequencies were obtained by adding up the number of pages that matched the expanded forms of the search terms in (1), (2), and (3). Altavista returns not only the number of matches, but also the number of words queries to search engines (unseen bigrams) that match the search term. We used this count, as it takes multiple matches per page into account, and is thus likely to produce more accurate frequencies. The process of obtaining bigram frequencies from the web can be automated straightforwardly using a script that generates all the search terms for a given bigram (from (1)-(3)), issues an Altavista or Google query for each of the search terms, and then adds up the resulting number of matches for each bigram. We applied this process to all the bigrams in our data set, covering seen and unseen adjective-noun, nounnoun, and verb-object bigrams, i.e., 540 bigrams in total.</Paragraph> <Paragraph position="6"> A small number of bigrams resulted in zero counts, i.e., they failed to yield any matches in the web search. Table 2 lists the number of zero bigrams for both search engines. Note that Google returned fewer zeros than Altavista, which presumably indicates that it indexes a larger proportion of the web. We adjusted the zero counts by setting them to one. This was necessary as all further analyses were carried out on log-transformed frequencies.</Paragraph> <Paragraph position="7"> Table 3 lists the descriptive statistics for the bigram counts we obtained using Altavista and Google.</Paragraph> <Paragraph position="8"> From these data, we computed the average factor by which the web counts are larger than the BNC counts. The results are given in Table 4 and indicate that the Altavista counts are between 331 and 467 times larger than the BNC counts, while the Google counts are between 759 and 977 times larger than the BNC counts. As we know the size of the BNC (100 million words), we can use these figures to estimate the number of words on the web: between 33.1 and 46.7 billion words for Altavista, and between 75.9 and 97.7 billion words for Google. These estimates are in the same order of magnitude as Grefenstette and Nioche's (2000) estimate that 48.1 billion words of English are available on the web (based on Altavista counts in February 2000).</Paragraph> <Paragraph position="9"> noun-noun bigrams high medium low unseen predicate process 1:14 user :95 gala 0 collection, clause, coat directory television 1:53 satellite :95 edition 0 chain, care, vote broadcast plasma 1:78 nylon 1:20 unit :60 fund, theology, minute membrane verb-object bigrams predicate high medium low unseen fulfill obligation 3:87 goal 2:20 scripture :69 participant, muscle, grade intensify problem 1:79 effect 1:10 alarm 0 score, quota, chest choose name 3:74 law 1:61 series 1:10 lift, bride, listener larger than the BNC counts (seen bigrams)</Paragraph> </Section> </Section> class="xml-element"></Paper>