File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/concl/06/e06-1030_concl.xml

Size: 1,978 bytes

Last Modified: 2025-10-06 13:55:07

<?xml version="1.0" standalone="yes"?>
<Paper uid="E06-1030">
  <Title>Web Text Corpus for Natural Language Processing</Title>
  <Section position="10" start_page="239" end_page="239" type="concl">
    <SectionTitle>
7 Conclusion
</SectionTitle>
    <Paragraph position="0"> In this paper, the accuracy of natural language applicationtrainingona10billionwordWebCorpus null is compared with other methods using search engine hit counts and corpora of printed text.</Paragraph>
    <Paragraph position="1"> In the context-sensitive spelling correction task, a simple memory-based learner trained on our Web Corpus achieved better results than method based on search engine queries. It also rival some of the state-of-the-art systems, exceeding the accuracy of the Unpruned Winnow method (the only other true cross-corpus experiment). In the task of thesaurus extraction, the same overall results are obtained extracting from the Web Corpus as a traditional corpus of printed texts.</Paragraph>
    <Paragraph position="2"> The Web Corpus contrasts with other NLP approaches that access web data through search engine queries. Although the 10 billion words Web Corpus is smaller than the number of words indexed by search engines, better results have been achieved using the smaller collection. This is due to the more accurate n-gram counts in the downloaded text. Other NLP tasks that require further analysis of the downloaded text, such a PP attachment, may benefit more from the Web Corpus.</Paragraph>
    <Paragraph position="3"> We have demonstrated that carefully collected and filtered web corpora can be as useful as newswire corpora of equivalent sizes. Using the same framework describe here, it is possible to collect a much larger corpus of freely available web text than our 10 billion word corpus. As NLP algorithms tend to perform better when more data is available, we expect state-of-the-art results for many tasks will come from exploiting web text.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML