File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/02/w02-1030_evalu.xml

Size: 9,554 bytes

Last Modified: 2025-10-06 13:58:50

<?xml version="1.0" standalone="yes"?>
<Paper uid="W02-1030">
  <Title>Using the Web to Overcome Data Sparseness</Title>
  <Section position="4" start_page="1" end_page="1" type="evalu">
    <SectionTitle>
3 Evaluation
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="1" end_page="1" type="sub_section">
      <SectionTitle>
3.1 Evaluation Against Corpus Frequencies
</SectionTitle>
      <Paragraph position="0"> While the procedure for obtaining web counts described in Section 2.2 is very straightforward, it also has obvious limitations. Most importantly, it is based on bigrams formed by adjacent words, and fails to take syntactic variants into account (other than intervening determiners for verb-object bigrams). In the case of Google, there is also the problem that the counts are based on the number of matching pages, not the number of matching words. Finally, there is the problem that web data is very noisy and unbalanced compared to a carefully edited corpus like the BNC.</Paragraph>
      <Paragraph position="1"> Given these limitations, it is necessary to explore if there is a reliable relationship between web counts and BNC counts. Once this is assured, we can explore the usefulness of web counts for overcoming data sparseness. We carried out a correlation analysis to determine if there is a linear relationship between the BNC counts and Altavista and Google counts. The results of this analysis are listed in Table 5. All correlation coefficients reported in this paper refer to Pearson's r and were computed on log-transformed counts.</Paragraph>
      <Paragraph position="2"> A high correlation coefficient was obtained across the board, ranging from :675 to :822 for Altavista counts and from :737 to :849 for Google counts.</Paragraph>
      <Paragraph position="3"> This indicates that web counts approximate BNC counts for the three types of bigrams under investigation, with Google counts slightly outperforming Altavista counts. We conclude that our simple  (seen bigrams) heuristics (see (1)-(3)) are sufficient to obtain useful frequencies from the web. It seems that the large amount of data available for web counts outweighs the associated problems (noisy, unbalanced, etc.).</Paragraph>
      <Paragraph position="4"> Note that the highest coefficients were obtained for adjective-noun bigrams, which probably indicates that this type of predicate-argument relationship is least subject to syntactic variation and thus least affected by the simplifications of our search heuristics.</Paragraph>
    </Section>
    <Section position="2" start_page="1" end_page="1" type="sub_section">
      <SectionTitle>
3.2 Task-based Evaluation
</SectionTitle>
      <Paragraph position="0"> Previous work has demonstrated that corpus counts correlate with human plausibility judgments for adjective-noun bigrams. This results holds for both seen bigrams (Lapata et al., 1999) and for unseen bigrams whose counts were recreated using smoothing techniques (Lapata et al., 2001). Based on these findings, we decided to evaluate our web counts on the task of predicting plausibility ratings. If the web counts for bigrams correlate with plausibility judgments, then this indicates that the counts are valid, in the sense of being useful for predicting intuitive plausibility.</Paragraph>
      <Paragraph position="1"> Lapata et al. (1999) and Lapata et al. (2001) collected plausibility ratings for 90 seen and 90 unseen adjective-noun bigrams (see Section 2.1) using magnitude estimation. Magnitude estimation is an experimental technique standardly used in psychophysics to measure judgments of sensory stimuli (Stevens, 1975), which Bard et al. (1996) and Cowart (1997) have applied to the elicitation of linguistic judgments. Magnitude estimation requires subjects to assign numbers to a series of linguistic stimuli in a proportional fashion. Subjects are first exposed to a modulus item, which they assign an arbitrary number. All other stimuli are rated proportional to the modulus. In the experiments conducted by Lapata et al. (1999) and Lapata et al. (2001), native speakers of English were presented with adjective-noun bigrams and were asked to rate the degree of adjective-noun fit proportional to the modulus item. The resulting judgments were normalized by dividing them by the modulus value and by logtransforming them. Lapata et al. (1999) report a correlation of :570 between mean plausibility judgments and BNC counts for the seen adjective-noun bigrams. For unseen adjective-noun bigrams, Lapata et al. (2001) found a correlation of :356 between mean judgments and frequencies recreated using class-based smoothing (Resnik, 1993).</Paragraph>
      <Paragraph position="2"> In the present study, we used the plausibility judgments collected by Lapata et al. (1999) and Lapata et al. (2001) for adjective-noun bigrams and conducted additional experiments to obtain noun-noun and verb-object judgments for the materials described in Section 2.1. We used the same experimental procedure as the original study (see Lapata et al. (1999) and Lapata et al. (2001) for details). Four experiments were carried out, one each for seen and unseen noun-noun bigrams, and for seen and unseen verb-object bigrams. Unlike the adjective-noun and the noun-noun bigrams, the verb-object bigrams were not presented to subjects in isolation, but embedded in a minimal sentence context involving a proper name as the subject (e.g., Paul fulfilled the obligation).</Paragraph>
      <Paragraph position="3"> The experiments were conducted over the web using the WebExp software package (Keller et al., 1998). A series of previous studies has shown that data obtained using WebExp closely replicates results obtained in a controlled laboratory setting; this was demonstrated for acceptability judgments (Keller and Alexopoulou, 2001), co-reference judgments (Keller and Asudeh, 2001), and sentence completions (Corley and Scheepers, 2002). These references also provide a detailed discussion of the WebExp experimental setup.</Paragraph>
      <Paragraph position="4"> Table 6 lists the descriptive statistics for all six judgment experiments: the original experiments by Lapata et al. (1999) and Lapata et al. (2001) for adjective-noun bigrams, and our new ones for noun-noun and verb-object bigrams.</Paragraph>
      <Paragraph position="5"> We used correlation analysis to compare web counts with plausibility judgments for seen adjective-noun, noun-noun, and verb-object bigrams. Table 7 (top half) lists the correlation coefficients that were obtained when correlat- null in each experiment ing log-transformed web and BNC counts with log-transformed plausibility judgments.</Paragraph>
      <Paragraph position="6"> The results show that both Altavista and Google counts correlate with plausibility judgments for seen bigrams. Google slightly outperforms Altavista: the correlation coefficient for Google ranges from :624 to :693, while for Altavista, it ranges from :638 to :685. A surprising result is that the web counts consistently achieve a higher correlation with the judgments than the BNC counts, which range from :488 to :569. We carried out a series of one-tailed t-tests to determine if the differences between the correlation coefficients for the web counts and the correlation coefficients for the BNC counts were significant. For the adjective-noun bigrams, the difference between the BNC coefficient and the Altavista coefficient failed to reach significance (t(87)=1:46, p &gt; :05), while the Google coefficient was significantly higher than the BNC coefficient (t(87)=1:78, p &lt; :05). For the noun-noun bigrams, both the Altavista and the Google coefficients were significantly higher than the BNC coefficient (t(87)=2:94, p &lt;:01 and t(87)=3:06, p &lt;:01). Also for the verb-object bigrams, both the Altavista coefficient and the Google coefficient were significantly higher than the BNC coefficient (t(87)=2:21, p &lt;:05 and t(87)=2:25, p &lt;:05). In sum, for all three types of bigrams, the correlation coefficients achieved with Google were significantly higher than the ones achieved with the BNC. For Altavista, the noun-noun and the verb-object coefficients were higher than the coefficients obtained from the BNC.</Paragraph>
      <Paragraph position="7"> Table 7 (bottom half) lists the correlations coefficients obtained by comparing log-transformed judgments with log-transformed web counts for unseen adjective-noun, noun-noun, and verb-object bigrams. We observe that the web counts consistently show a significant correlation with the judgments, the coefficient ranging from :466 to :588 for Al- null web counts and BNC counts tavista counts, and from :446 to :611 for the Google counts. Note that a small number of bigrams produced zero counts even in our web queries; these frequencies were set to one for the correlation analysis (see Section 2.2).</Paragraph>
      <Paragraph position="8"> To conclude, this evaluation demonstrated that web counts reliably predict human plausibility judgments, both for seen and for unseen predicate-argument bigrams. In the case of Google counts for seen bigrams, we were also able to show that web counts are a better predictor of human judgments than BNC counts. These results show that our heuristic method yields useful frequencies; the simplifications we made in obtaining the counts, as well as the fact that web data are noisy, seem to be outweighed by the fact that the web is up to three orders of magnitude larger than the BNC (see our estimate in Section 2.2).</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML