File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/06/e06-1030_intro.xml

Size: 5,068 bytes

Last Modified: 2025-10-06 14:03:18

<?xml version="1.0" standalone="yes"?>
<Paper uid="E06-1030">
  <Title>Web Text Corpus for Natural Language Processing</Title>
  <Section position="3" start_page="233" end_page="233" type="intro">
    <SectionTitle>
2 Existing Web Corpora
</SectionTitle>
    <Paragraph position="0"> The web has become an indispensible resource withavastamountofinformationavailable. Many NLP tasks have successfully utilised web data, including machine translation (Grefenstette, 1999), prepositional phrase attachment (Volk, 2001), and other-anaphora resolution (Modjeska et al., 2003).</Paragraph>
    <Section position="1" start_page="233" end_page="233" type="sub_section">
      <SectionTitle>
2.1 Search Engine Hit Counts
</SectionTitle>
      <Paragraph position="0"> Most NLP systems that have used the web access it via search engines such as Altavista and Google.</Paragraph>
      <Paragraph position="1"> N-gram counts are approximated by literal queries &amp;quot;w1 ... wn&amp;quot;. Relations between two words are approximated in Altavista by the NEAR operator (whichlocateswordpairswithin10tokensofeach other). The overall coverage of the queries can be expanded by morphological expansion of the search terms.</Paragraph>
      <Paragraph position="2"> Keller and Lapata (2003) demonstrated a high degree of correlation between n-gram estimates fromsearch enginehitcountsand n-gramfrequencies obtained from traditional corpora such as the British National Corpus (BNC). The hit counts also had a higher correlation to human plausibility judgements than the BNC counts.</Paragraph>
      <Paragraph position="3"> The web count method contrasts with traditional methods where the frequencies are obtained from a corpus of locally available text. While the corpus is much smaller than the web, an accurate count and further text processing is possible because all of the contexts are readily accessible.</Paragraph>
      <Paragraph position="4"> The web count method obtains only an approximate number of matches on the web, with no control over which pages are indexed by the search engines and with no further analysis possible.</Paragraph>
      <Paragraph position="5"> There are a number of limitations in the search engine approximations. As many search engines discard punctuation information (especially when using the NEAR operator), words considered adjacent to each other could actually lie in different sentences or paragraphs. For example in Volk (2001), the system assumes that a preposition attaches to a noun simply when the noun appears within a fixed context window of the preposition.</Paragraph>
      <Paragraph position="6"> The preposition and noun could in fact be related differently or be in different sentences altogether.</Paragraph>
      <Paragraph position="7"> The speed of querying search engines is another concern. Keller and Lapata (2003) needed to obtain the frequency counts of 26,271 test adjective pairs from the web and from the BNC for the task of prenominal adjective ordering. While extracting this information from the BNC presented no difficulty, making so many queries to the Altavista was too time-consuming. They had to reduce the size of the test set to obtain a result.</Paragraph>
      <Paragraph position="8"> Lapata and Keller (2005) performed a wide range of NLP tasks using web data by querying Altavista and Google. This included variety of generation tasks (e.g. machine translation candidate selection) and analysis tasks (e.g. prepositional phrase attachment, countability detection). They showed that while web counts usually out-performed BNC counts and consistently outperformed the baseline, the best performing system is usually a supervised method trained on annotated data. Keller and Lapata concluded that having access linguistic information (accurate n-gram counts, POS tags, and parses) outperforms using a large amount of web data.</Paragraph>
    </Section>
    <Section position="2" start_page="233" end_page="233" type="sub_section">
      <SectionTitle>
2.2 Spidered Web Corpora
Afewprojectshaveutiliseddatadownloadedfrom
</SectionTitle>
      <Paragraph position="0"> the web. Ravichandran et al. (2005) used a collection of 31 million web pages to produce noun similarity lists. They found that most NLP algorithms are unable to run on web scale data, especially those with quadratic running time. Halacsy et al. (2004) created a Hungarian corpus from the web by downloading text from the .hu domain.</Paragraph>
      <Paragraph position="1"> From a 18 million page crawl of the web a 1 billion word corpus is created (removing duplicates and non-Hungarian text).</Paragraph>
      <Paragraph position="2"> A terabyte-sized corpus of the web was collected at the University of Waterloo in 2001. A breadth first search from a seed set of university home pages yielded over 53 billion words, requiring 960GB of storage. Clarke et al. (2002) and Terra and Clarke (2003) used this corpus for their question answering system. They obtained increasing performance with increasing corpus size but began reaching asymptotic behaviour at the 300-500GB range.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML