File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/06/e06-1030_evalu.xml

Size: 2,688 bytes

Last Modified: 2025-10-06 13:59:33

<?xml version="1.0" standalone="yes"?>
<Paper uid="E06-1030">
  <Title>Web Text Corpus for Natural Language Processing</Title>
  <Section position="9" start_page="238" end_page="239" type="evalu">
    <SectionTitle>
6.3 Results
</SectionTitle>
    <Paragraph position="0"> We used the same 300 evaluation headwords as Curran(2004)andextractedthetop200synonyms for each headword. The evaluation headwords were extracted from two corpora for comparison a2billionwordsampleofourWebCorpusandthe null 2 billion words in the Gigaword Corpus. Table 5 shows the average InvR scores over the 300 head-words for the two corpora - one of web text and the other newspaper text. The InvR values differ by a negligible 0.05 (out of a maximum of 5.92).</Paragraph>
    <Section position="1" start_page="238" end_page="239" type="sub_section">
      <SectionTitle>
6.4 Analysis
</SectionTitle>
      <Paragraph position="0"> However on a per word basis one corpus can sigificantlyoutperformtheother. Table6ranksthe300 headwords by difference in the InvR score. While much better results were extracted for words like home from the Gigaword, much better results were extracted for words like chain from the Web Corpus. null Table 7 shows the top 50 synoyms extracted for the headword home from the Gigaword and the WebCorpus. Whilesimilarnumberofcorrectsynonyms were extracted from both corpora, the Gigaword matches were higher in the extracted list and received a much higher InvR score. In the list  extractedfromtheWebCorpus,web-relatedcollocations such as home page and search home appear. Table 8 shows the top 50 synoyms extracted for the headword chain from both corpora. While there are only a total of 9 matches from the Gigaword Corpus, there are 53 matches from the Web Corpus. A closer examination shows that the synonyms extracted from the Gigaword belong only to one sense of the word chain, as in chain stores.</Paragraph>
      <Paragraph position="1"> The gold standard list and the Web Corpus results both contain the necklace sense of the word chain.</Paragraph>
      <Paragraph position="2"> The Gigaword results show a skew towards the business sense of the word chain, while the Web Corpus covers both senses of the word.</Paragraph>
      <Paragraph position="3"> While individual words can achieve better results in either the Gigaword or the Web Corpus than the other, the aggregate results of synonym extractionforthe300headwordsarethesame. For this task, the Web Corpus can replace the Gigaword without affecting the overall result. However, as some words are perform better under different corpora, an aggregate of the Web Corpus and the Gigaword may produce the best result.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML