File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/concl/03/j03-3005_concl.xml
Size: 4,887 bytes
Last Modified: 2025-10-06 13:53:28
<?xml version="1.0" standalone="yes"?> <Paper uid="J03-3005"> <Title>Using the Web to Obtain Frequencies for Unseen Bigrams</Title> <Section position="5" start_page="481" end_page="482" type="concl"> <SectionTitle> 4. Conclusions </SectionTitle> <Paragraph position="0"> This article explored a novel approach to overcoming data sparseness. If a bigram is unseen in a given corpus, conventional approaches re-create its frequency using techniques such as back-off, linear interpolation, class-based smoothing or distance-weighted averaging (see Dagan, Lee, and Pereira [1999] and Lee [1999] for overviews).</Paragraph> <Paragraph position="1"> The approach proposed here does not re-create the missing counts but instead retrieves them from a corpus that is much larger (but also much more noisy) than any existing corpus: it launches queries to a search engine in order to determine how often the bigram occurs on the Web.</Paragraph> <Paragraph position="2"> We systematically investigated the validity of this approach by using it to obtain frequencies for predicate-argument bigrams (adjective-noun, noun-noun, and verb-object bigrams). We first applied the approach to seen bigrams randomly sampled from the BNC. We found that the counts obtained from the Web are highly correlated with the counts obtained from the BNC. We then obtained bigram counts from NANTC, a corpus that is substantially larger than the BNC. Again, we found that Web counts are highly correlated with corpus counts. This indicates that Web queries can generate frequencies that are comparable to the ones obtained from a balanced, carefully edited corpus such as the BNC, but also from a large news text corpus such as NANTC.</Paragraph> <Paragraph position="3"> Secondly, we performed an evaluation that used the Web frequencies to predict human plausibility judgments for predicate-argument bigrams. The results show that Web counts correlate reliably with judgments, for all three types of predicate-argument bigrams tested, both seen and unseen. For the seen bigrams, we showed that the Web frequencies correlate better with judged plausibility than corpus frequencies.</Paragraph> <Paragraph position="4"> To substantiate the claim that the Web counts can be used to overcome data sparseness, we compared bigram counts obtained from the Web with bigram counts re-created using a class-based smoothing technique (a variant of the one proposed by Resnik [1993]). We found that Web frequencies and re-created frequencies are reliably correlated, and that Web frequencies are better at predicting plausibility judgments than smoothed frequencies. This holds both for unseen bigrams and for seen bigrams that are treated as unseen by omitting them from the training corpus.</Paragraph> <Paragraph position="5"> Finally, we tested the performance of our frequencies in a standard pseudodisambiguation task. We applied our method to three data sets from the literature. The results show that Web counts outperform counts re-created using a number of class-based smoothing techniques. However, counts re-created using an EM-based smoothing approach yielded better pseudodisambiguation performance than Web counts.</Paragraph> <Paragraph position="6"> To summarize, we have proposed a simple heuristic method for obtaining bigram counts from the Web. Using four different types of evaluation, we demonstrated that this simple heuristic method is sufficient to obtain valid frequency estimates. It seems that the large amount of data available outweighs the problems associated with using the Web as a corpus (such as the fact that it is noisy and unbalanced).</Paragraph> <Paragraph position="7"> A number of questions arise for future research: (1) Are Web frequencies suitable for probabilistic modeling, in particular since Web counts are not perfectly normalized, as Zhu and Rosenfeld (2001) have shown? (2) How can existing smoothing methods benefit from Web counts? (3) How do the results reported in this article carry over to languages other than English (for which a much smaller amount of Web data are available)? (4) What is the effect of the noise introduced by our heuristic approach? The last question could be assessed by reproducing our results using a snapshot of the Web, from which argument relations can be extracted more accurately using POS tagging and chunking techniques.</Paragraph> <Paragraph position="8"> Computational Linguistics Volume 29, Number 3 Finally, it will be crucial to test the usefulness of Web-based frequencies for realistic NLP tasks. Preliminary results are reported by Lapata and Keller (2003), who use Web counts successfully for a range of NLP tasks, including candidate selection for machine translation, context-sensitive spelling correction, bracketing and interpretation of compounds, adjective ordering, and PP attachment.</Paragraph> </Section> class="xml-element"></Paper>