File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/abstr/04/n04-1016_abstr.xml

Size: 1,274 bytes

Last Modified: 2025-10-06 13:43:30

<?xml version="1.0" standalone="yes"?>
<Paper uid="N04-1016">
  <Title>The Web as a Baseline: Evaluating the Performance of Unsupervised Web-based Models for a Range of NLP Tasks</Title>
  <Section position="1" start_page="0" end_page="0" type="abstr">
    <SectionTitle>
Abstract
</SectionTitle>
    <Paragraph position="0"> Previous work demonstrated that web counts can be used to approximate bigram frequencies, and thus should be useful for a wide variety of NLP tasks. So far, only two generation tasks (candidate selection for machine translation and confusion-set disambiguation) have been tested using web-scale data sets. The present paper investigates if these results generalize to tasks covering both syntax and semantics, both generation and analysis, and a larger range of n-grams. For the majority of tasks, we find that simple, unsupervised models perform better when n-gram frequencies are obtained from the web rather than from a large corpus.</Paragraph>
    <Paragraph position="1"> However, in most cases, web-based models fail to outperform more sophisticated state-of-the-art models trained on small corpora. We argue that web-based models should therefore be used as a baseline for, rather than an alternative to, standard models.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML