XML Viewer - j03-3005

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/03/j03-3005_intro.xml
Size: 41,484 bytes
Last Modified: 2025-10-06 14:01:40
<?xml version="1.0" standalone="yes"?>
<Paper uid="J03-3005">
  <Title>Using the Web to Obtain Frequencies for Unseen Bigrams</Title>
  <Section position="4" start_page="469" end_page="481" type="intro">
    <SectionTitle>
3. Evaluation
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="469" end_page="471" type="sub_section">
      <SectionTitle>
3.1 Evaluation against Corpus Frequencies
</SectionTitle>
      <Paragraph position="0"> Since Web counts can be relatively noisy, as discussed in the previous section, it is crucial to determine whether there is a reliable relationship between Web counts and corpus counts. Once this is assured, we can explore the usefulness of Web counts for overcoming data sparseness. We carried out a correlation analysis to determine whether there is a linear relationship between BNC and NANTC counts and AltaVista and Google counts. All correlation coefficients reported in this article refer to Pearson's r.</Paragraph>
      <Paragraph position="1">  All results were obtained on log-transformed counts.</Paragraph>
      <Paragraph position="2">  Table 8 shows the results of correlating Web counts with corpus counts from the BNC, the corpus from which our bigrams were sampled (see Section 2.1). A high correlation coefficient was obtained across the board, ranging from .720 to .847 for AltaVista counts and from .720 to .850 for Google counts. This indicates that Web counts approximate BNC counts for the three types of bigrams under investigation. Note that there is almost no difference between the correlations achieved using Google and AltaVista counts.</Paragraph>
      <Paragraph position="3"> It is important to check that these results are also valid for counts obtained from other corpora. We therefore correlated our Web counts with the counts obtained from NANTC, a corpus that is larger than the BNC but is drawn from a single genre, namely, news text (see Section 2.2). The results are shown in Table 9. We find that 6 Correlation analysis is a way of measuring the degree of linear association between two variables. Effectively, we are fitting a linear equation y = ax + b to the data; this means that the two variables x and y (which in our case represent frequencies or judgments) can still differ by a multiplicative constant a and an additive constant b, even if they are highly correlated.</Paragraph>
      <Paragraph position="4"> 7 It is well-known that corpus frequencies have a Zipfian distribution. Log-transforming them is a way of normalizing the counts before applying statistical tests. We apply correlation analysis on the log-transformed data, which is equivalent to computing a log-linear regression coefficient on the untransformed data.</Paragraph>
      <Paragraph position="5">  correlation coefficients range from .667 to .788 for AltaVista and from .662 to .787 for Google. Again, there is virtually no difference between the correlations for the two search engines. We also observe that the correlation between Web counts and BNC is generally slightly higher than the correlation between Web counts and NANTC counts. We carried out one-tailed t-tests to determine whether the differences in the correlation coefficients were significant. We found that both AltaVista counts (t(87)=3.11, p &lt;.01) and Google counts (t(87)=3.21, p &lt;.01) were significantly better correlated with BNC counts than with NANTC counts for adjective-noun bigrams. The difference in correlation coefficients was not significant for noun-noun and verb-object bigrams, for either search engine.</Paragraph>
      <Paragraph position="6"> Table 9 also shows the correlations between BNC counts and NANTC counts.</Paragraph>
      <Paragraph position="7"> The intercorpus correlation can be regarded as an upper limit for the correlations we can expect between counts from two corpora that differ in size and genre and that have been obtained using different extraction methods. The correlation between AltaVista and Google counts and NANTC counts reached the upper limit for all three bigram types (one-tailed t-tests found no significant differences between the correlation coefficients). The correlation between BNC counts and Web counts reached the upper limit for noun-noun and verb-object bigrams (no significant differences for either search engine) and significantly exceeded it for adjective-noun bigrams for AltaVista (t(87)=3.16, p &lt;.01) and Google (t(87)=3.26, p &lt;.01).</Paragraph>
      <Paragraph position="8"> We conclude that simple heuristics (see Section 2.3) are sufficient to obtain useful frequencies from the Web; it seems that the large amount of data available for Web counts outweighs the associated problems (noisy, unbalanced, etc.). We found that Web counts were highly correlated with frequencies from two different corpora. Furthermore, Web counts and corpus counts are as highly correlated as counts from two different corpora (which can be regarded as an upper bound).</Paragraph>
      <Paragraph position="9"> Note that Tables 8 and 9 also provide the correlation coefficients obtained when corpus frequencies are compared with frequencies that were re-created through class- null Keller and Lapata Web Frequencies for Unseen Bigrams based smoothing, using the BNC as a training corpus (after removing the seen bigrams). This will be discussed in more detail in Section 3.3.</Paragraph>
    </Section>
    <Section position="2" start_page="471" end_page="474" type="sub_section">
      <SectionTitle>
3.2 Evaluation against Plausibility Judgments
</SectionTitle>
      <Paragraph position="0"> Previous work has demonstrated that corpus counts correlate with human plausibility judgments for adjective-noun bigrams. This result holds both for seen bigrams (Lapata, McDonald, and Keller 1999) and for unseen bigrams whose counts have been re-created using smoothing techniques (Lapata, Keller, and McDonald 2001). Based on these findings, we decided to evaluate our Web counts on the task of predicting plausibility ratings. If the Web counts for bigrams correlate with plausibility judgments, then this indicates that the counts are valid, in the sense of being useful for predicting the intuitive plausibility of predicate-argument pairs. The degree of correlation between Web counts and plausibility judgments is an indicator of the quality of the Web counts (compared to corpus counts or counts re-created using smoothing techniques).</Paragraph>
      <Paragraph position="1"> 3.2.1 Method. For seen and unseen adjective-noun bigrams, we used the two sets of plausibility judgments collected by Lapata, McDonald, and Keller (1999) and Lapata, Keller, and McDonald (2001), respectively. We conducted four additional experiments to collect judgments for noun-noun and verb-object bigrams, both seen and unseen.</Paragraph>
      <Paragraph position="2"> The experimental method was the same for all six experiments.</Paragraph>
      <Paragraph position="3"> Materials. The experimental stimuli were based on the six sets of seen or unseen bigrams extracted from the BNC as described in Section 2.1 (adjective-noun, nounnoun, and verb-object bigrams). In the adjective-noun and noun-noun cases, the stimuli consisted simply of the bigrams. In the verb-object case, the bigrams were embedded in a short sentence to make them more natural: A proper-noun subject was added.</Paragraph>
      <Paragraph position="4"> Procedure. The experimental paradigm was magnitude estimation (ME), a technique standardly used in psychophysics to measure judgments of sensory stimuli (Stevens 1975), which Bard, Robertson, and Sorace (1996) and Cowart (1997) have applied to the elicitation of linguistic judgments. The ME procedure requires subjects to estimate the magnitude of physical stimuli by assigning numerical values proportional to the stimulus magnitude they perceive. In contrast to the five- or seven-point scale conventionally used to measure human intuitions, ME employs an interval scale and therefore produces data for which parametric inferential statistics are valid.</Paragraph>
      <Paragraph position="5"> ME requires subjects to assign numbers to a series of linguistic stimuli in a proportional fashion. Subjects are first exposed to a modulus item, to which they assign an arbitrary number. All other stimuli are rated proportional to the modulus. In this way, each subject can establish his or her own rating scale, thus yielding maximally finegraded data and avoiding the known problems with the conventional ordinal scales for linguistic data (Bard, Robertson, and Sorace 1996; Cowart 1997; Sch &amp;quot;utze 1996).</Paragraph>
      <Paragraph position="6"> The experiments reported in this article were carried out using the WebExp software package (Keller et al. 1998). A series of previous studies has shown that data obtained using WebExp closely replicate results obtained in a controlled laboratory setting; this has been demonstrated for acceptability judgments (Keller and Alexopoulou 2001), coreference judgments (Keller and Asudeh 2001), and sentence completions (Corley and Scheepers 2002).</Paragraph>
      <Paragraph position="7"> In the present experiments, subjects were presented with bigram pairs and were asked to rate the degree of plausibility proportional to a modulus item. They first saw a set of instructions that explained the ME technique and the judgment task. The concept of plausibility was not defined, but examples of plausible and implausible bigrams were given (different examples for each stimulus set). Then subjects were asked to fill in a questionnaire with basic demographic information. The experiment proper  consisted of three phases: (1) a calibration phase, designed to familiarize subjects with the task, in which they had to estimate the length of five horizontal lines; (2) a practice phase, in which subjects judged the plausibility of eight bigrams (similar to the ones in the stimulus set); (3) the main experiment, in which each subject judged one of the six stimulus sets (90 bigrams). The stimuli were presented in random order, with a new randomization being generated for each subject.</Paragraph>
      <Paragraph position="8"> Subjects. A separate experiment was conducted for each set of stimuli. The number of subjects per experiment is shown in Table 10 (in the column labeled N). All subjects were self-reported native speakers of English; they were recruited by postings to newsgroups and mailing lists. Participation was voluntary and unpaid.</Paragraph>
      <Paragraph position="9"> WebExp collects by-item response time data; subjects whose response times were very short or very long were excluded from the sample, as they are unlikely to have completed the experiment adequately. We also excluded the data of subjects who had participated more than once in the same experiment, based on their demographic data and on their Internet connection data, which is logged by WebExp.</Paragraph>
      <Paragraph position="10">  each numerical judgment by the modulus value that the subject had assigned to the reference sentence. This operation creates a common scale for all subjects. Then the data were transformed by taking the decadic logarithm. This transformation ensures that the judgments are normally distributed and is standard practice for magnitude estimation data (Bard, Robertson, and Sorace 1996; Cowart 1997; Stevens 1975). All further analyses were conducted on the normalized, log-transformed judgments.</Paragraph>
      <Paragraph position="11"> Table 10 shows the descriptive statistics for all six judgment experiments: the original experiments by Lapata, McDonald, and Keller (1999) and Lapata, Keller, and McDonald (2001) for adjective-noun bigrams, and our new ones for noun-noun and verb-object bigrams.</Paragraph>
      <Paragraph position="12"> We used correlation analysis to compare corpus counts and Web counts with plausibility judgments. Table 11 (top half) lists the correlation coefficients that were obtained when correlating log-transformed Web counts (AltaVista and Google) and corpus counts (BNC and NANTC) with mean plausibility judgments for seen adjectivenoun, noun-noun, and verb-object bigrams. The results show that both AltaVista and Google counts correlate well with plausibility judgments for seen bigrams. The correlation coefficient for AltaVista ranges from .641 to .700; for Google, it ranges from .624 to .692. The correlations for the two search engines are very similar, which is also what we found in Section 3.1 for the correlations between Web counts and corpus counts.</Paragraph>
      <Paragraph position="13"> Note that the Web counts consistently achieve a higher correlation with the judgments than the BNC counts, which range from .488 to .569. We carried out a series of one-tailed t-tests to determine whether the differences between the correlation coefficients for the Web counts and the correlation coefficients for the BNC counts were significant. For the adjective-noun bigrams, the AltaVista coefficient was significantly  Keller and Lapata Web Frequencies for Unseen Bigrams higher than the BNC coefficient (t(87)=1.76, p &lt;.05), whereas the difference between the Google coefficient and the BNC coefficient failed to reach significance. For the noun-noun bigrams, both the AltaVista and the Google coefficients were significantly higher than the BNC coefficient (t(87)=3.11, p &lt;.01 and t(87)=2.95, p &lt;.01).</Paragraph>
      <Paragraph position="14"> Also, for the verb-object bigrams, both the AltaVista coefficient and the Google coefficient were significantly higher than the BNC coefficient (t(87)=2.64, p &lt;.01 and t(87)=2.32, p &lt;.05).</Paragraph>
      <Paragraph position="15"> A similar picture was observed for the NANTC counts. Again, the Web counts outperformed the corpus counts in predicting plausibility. For the adjective-noun bigrams, both the AltaVista and the Google coefficient were significantly higher than the NANTC coefficient (t(87)=1.97, p &lt;.05; t(87)=1.81, p &lt;.05). For the noun-noun bigrams, the AltaVista coefficient was higher than the NANTC coefficient (t(87)=1.64, p &lt;.05), but the Google coefficient was not significantly different from the NANTC coefficient. For verb-object bigrams, the difference was significant for both search engines (t(87)=2.74, p &lt;.01; t(87)=2.38, p &lt;.01).</Paragraph>
      <Paragraph position="16"> In sum, for all three types of bigrams, the correlation coefficients achieved with AltaVista were significantly higher than the ones achieved by either the BNC or the NANTC. Google counts outperformed corpus counts for all bigrams with the exception of adjective-noun counts from the BNC and noun-noun counts from the NANTC.</Paragraph>
      <Paragraph position="17"> The bottom panel of Table 11 shows the correlation coefficients obtained by comparing log-transformed judgments with log-transformed Web counts for unseen adjective-noun, noun-noun, and verb-object bigrams. We observe that the Web counts consistently show a significant correlation with the judgments, with the coefficient ranging from .480 to .578 for AltaVista counts and from .473 to .595 for the Google counts. Table 11 also provides the correlations between plausibility judgments and counts re-created using class-based smoothing, which we will discuss in Section 3.3.</Paragraph>
      <Paragraph position="18"> An important question is how well humans agree when judging the plausibility of adjective-noun, noun-noun, and verb-noun bigrams. Intersubject agreement gives an upper bound for the task and allows us to interpret how well our Web-based method performs in relation to humans. To calculate intersubject agreement we used leave- null Computational Linguistics Volume 29, Number 3 one-out resampling. This technique is a special case of n-fold cross-validation (Weiss and Kulikowski 1991) and has been previously used for measuring how well humans agree in judging semantic similarity (Resnik 1999, 2000).</Paragraph>
      <Paragraph position="19"> For each subject group, we divided the set of the subjects' responses with size n into a set of size n[?]1 (i.e., the response data of all but one subject) and a set of size 1 (i.e., the response data of a single subject). We then correlated the mean ratings of the former set with the ratings of the latter. This was repeated n times (see the number of participants in Table 6); the mean of the correlation coefficients for the seen and unseen bigrams is shown in Table 11 in the rows labeled &amp;quot;Agreement.&amp;quot; For both seen and unseen bigrams, we found no significant difference between the upper bound (intersubject agreement) and the correlation coefficients obtained using either AltaVista or Google counts. This finding holds for all three types of bigrams. The same picture emerged for the BNC and NANTC counts: These correlation coefficients were not significantly different from the upper limit, for all three types of bigrams, both for seen and for unseen bigrams.</Paragraph>
      <Paragraph position="20"> To conclude, our evaluation demonstrated that Web counts reliably predict human plausibility judgments, both for seen and for unseen predicate-argument bigrams.</Paragraph>
      <Paragraph position="21"> AltaVista counts for seen bigrams are a better predictor of human judgments than BNC and NANTC counts. These results show that our heuristic method yields valid frequencies; the simplifications we made in obtaining the Web counts (see Section 2.3), as well as the fact that Web data are noisy (see Section 2.4), seem to be outweighed by the fact that the Web is up to a thousand times larger than the BNC.</Paragraph>
    </Section>
    <Section position="3" start_page="474" end_page="476" type="sub_section">
      <SectionTitle>
3.3 Evaluation against Class-Based Smoothing
</SectionTitle>
      <Paragraph position="0"> The evaluation in the last two sections established that Web counts are useful for approximating corpus counts and for predicting plausibility judgments. As a further step in our evaluation, we correlated Web counts with counts re-created by applying a class-based smoothing method to the BNC.</Paragraph>
      <Paragraph position="1"> We re-created co-occurrence frequencies for predicate-argument bigrams using a simplified version of Resnik's (1993) selectional association measure proposed by Lapata, Keller, and McDonald (2001). In a nutshell, this measure replaces Resnik's (1993) information-theoretic approach with a simpler measure that makes no assumptions with respect to the contribution of a semantic class to the total quantity of information provided by the predicate about the semantic classes of its argument. It simply substitutes the argument occurring in the predicate-argument bigram with the concept by which it is represented in the WordNet taxonomy. Predicate-argument co-occurrence frequency is estimated by counting the number of times the concept corresponding to the argument is observed to co-occur with the predicate in the corpus. Because a given word is not always represented by a single class in the taxonomy (i.e., the argument co-occurring with a predicate can generally be the realization of one of several conceptual classes), Lapata, Keller, and McDonald (2001) constructed the frequency counts for a predicate-argument bigram for each conceptual class by dividing the contribution from the argument by the number of classes to which it belongs.</Paragraph>
      <Paragraph position="2"> They demonstrate that the counts re-created using this smoothing technique correlate significantly with plausibility judgments for adjective-noun bigrams. They also show that this class-based approach outperforms distance-weighted averaging (Dagan, Lee, and Pereira 1999), a smoothing method that re-creates unseen word co-occurrences on the basis of distributional similarity (without relying on a predefined taxonomy), in predicting plausibility.</Paragraph>
      <Paragraph position="3"> In the current study, we used the smoothing technique of Lapata, Keller, and McDonald (2001) to re-create not only adjective-noun bigrams, but also noun-noun  and verb-object bigrams. As already mentioned in Section 2.1, it was assumed that the noun is the predicate in adjective-noun bigrams; for noun-noun bigrams, we treated the right noun as the predicate, and for verb-object bigrams, we treated the verb as the predicate. We applied Lapata, Keller, and McDonald's (2001) technique to the unseen bigrams for all three bigram types. We also used it on the seen bigrams, which we were able to treat as unseen by removing all instances of the bigrams from the training corpus.</Paragraph>
      <Paragraph position="4"> To test the claim that Web frequencies can be used to overcome data sparseness, we correlated the frequencies re-created using class-based smoothing on the BNC with the frequencies obtained from the Web. The correlation coefficients for both seen and unseen bigrams are shown in Table 12. In all cases, a significant correlation between Web counts and re-created counts is obtained. For seen bigrams, the correlation coefficient ranged from .344 to .362 for AltaVista counts and from .330 to .349 for Google counts. For unseen bigrams, the correlations were somewhat higher, ranging from .386 to .439 for AltaVista counts and from .397 to .444 for Google counts. For both seen and unseen bigrams, there was only a very small difference between the correlation coefficients obtained with the two search engines.</Paragraph>
      <Paragraph position="5"> It is also interesting to compare the performance of class-based smoothing and Web counts on the task of predicting plausibility judgments. The correlation coefficients are listed in Table 11. The re-created frequencies are correlated significantly with all three types of bigrams, both for seen and unseen bigrams. For the seen bigrams, we found that the correlation coefficients obtained using smoothed counts were significantly lower than the upper bound for all three types of bigrams (t(87)=3.01, p &lt;.01; t(87)=3.23, p &lt;.01; t(87)=3.43, p &lt;.01). This result also held for the unseen bigrams: The correlations obtained using smoothing were significantly lower than the upper bound for all three types of bigrams (t(87)=1.86, p &lt;.05; t(87)=1.97, p &lt;.05; t(87)=3.36, p &lt;.01).</Paragraph>
      <Paragraph position="6"> Recall that the correlation coefficients obtained using the Web counts were not found to be significantly different from the upper bound, which indicates that Web counts are better predictors of plausibility than smoothed counts. This fact was confirmed by further significance testing: For seen bigrams, we found that the AltaVista correlation coefficients were significantly higher than correlation coefficients obtained using smoothing, for all three types of bigrams (t(87)=3.31, p &lt;.01; t(87)=4.11, p &lt;.01; t(87)=4.32, p &lt;.01). This also held for Google counts (t(87)=3.16, p &lt;.01; t(87)=4.02, p &lt;.01; t(87)=4.03, p &lt;.01). For unseen bigrams, the AltaVista coefficients and the coefficients obtained using smoothing were not significantly different  Computational Linguistics Volume 29, Number 3 for adjective-noun bigrams, but the difference reached significance for noun-noun and verb-object bigrams (t(87)=2.08, p &lt;.05; t(87)=2.53, p &lt;.01). For Google counts, the difference was again not significant for adjective-noun bigrams, but it reached significance for noun-noun and verb-object bigrams (t(87)=2.34, p &lt;.05; t(87)=2.15, p &lt;.05).</Paragraph>
      <Paragraph position="7"> Finally, we conducted a small study to investigate the validity of the counts that were re-created using class-based smoothing. We correlated the re-created counts for the seen bigrams with their actual BNC and NANTC frequencies. The correlation coefficients are reported in Tables 8 and 9. We found that the correlation between re-created counts and corpus counts was significant for all three types of bigrams, for both corpora. This demonstrates that the smoothing technique we employed generates realistic corpus counts, in the sense that the re-created counts are correlated with the actual counts. However, the correlation coefficients obtained using Web counts were always substantially higher than those obtained using smoothed counts. These differences were significant for the BNC counts for AltaVista (t(87)=8.38, p &lt;.01; t(87)=5.00, p &lt;.01; t(87)=5.03, p &lt;.01) and Google (t(87)=8.35, p &lt;.01; t(87)=5.00, p &lt;.01; t(87)=5.03, p &lt;.01). They were also significant for the NANTC counts for AltaVista (t(87)=4.12, p &lt;.01; t(87)=3.72, p &lt;.01; t(87)=6.58, p &lt;.01) and Google (t(87)=4.08, p &lt;.01; t(87)=3.06, p &lt;.01; t(87)=6.47, p &lt;.01).</Paragraph>
      <Paragraph position="8"> To summarize, the results presented in this section indicate that Web counts are indeed a valid way of obtaining counts for bigrams that are unseen in a given corpus: They correlate reliably with counts re-created using class-based smoothing. For seen bigrams, we found that Web counts correlate with counts that were re-created using smoothing techniques (after removing the seen bigrams from the training corpus). For the task of predicting plausibility judgments, we were able to show that Web counts outperform re-created counts, both for seen and for unseen bigrams. Finally, we found that Web counts for seen bigrams correlate better than re-created counts with the real corpus counts.</Paragraph>
      <Paragraph position="9"> It is beyond the scope of the present study to undertake a full comparison between Web counts and frequencies re-created using all available smoothing techniques (and all available taxonomies that might be used for class-based smoothing). The smoothing method discussed above is simply one type of class-based smoothing. Other, more sophisticated class-based methods do away with the simplifying assumption that the argument co-occurring with a given predicate (adjective, noun, verb) is distributed evenly across its conceptual classes and attempt to find the right level of generalization in a concept hierarchy, by discounting, for example, the contribution of very general classes (Clark and Weir 2001; McCarthy 2000; Li and Abe 1998). Other smoothing approaches such as discounting (Katz 1987) and distance-weighted averaging (Grishman and Sterling 1994; Dagan, Lee, and Pereira 1999) re-create counts of unseen word combinations by exploiting only corpus-internal evidence, without relying on taxonomic information. Our goal was to demonstrate that frequencies retrieved from the Web are a viable alternative to conventional smoothing methods when data are sparse; we do not claim that our Web-based method is necessarily superior to smoothing or that it should be generally preferred over smoothing methods. However, the next section will present a small-scale study that compares the performance of several smoothing techniques with the performance of Web counts on a standard task from the literature.</Paragraph>
    </Section>
    <Section position="4" start_page="476" end_page="481" type="sub_section">
      <SectionTitle>
3.4 Pseudodisambiguation
</SectionTitle>
      <Paragraph position="0"> In the smoothing literature, re-created frequencies are typically evaluated using pseudodisambiguation (Clark and Weir 2001; Dagan, Lee, and Pereira 1999; Lee 1999; Pereira, Tishby, and Lee 1993; Prescher, Riezler, and Rooth 2000; Rooth et al. 1999).</Paragraph>
      <Paragraph position="1">  Keller and Lapata Web Frequencies for Unseen Bigrams The aim of the pseudodisambiguation task is to decide whether a given algorithm re-creates frequencies that make it possible to distinguish between seen and unseen bigrams in a given corpus. A set of pseudobigrams is constructed according to a set of criteria (detailed below) that ensure that they are unattested in the training corpus. Then the seen bigrams are removed from the training data, and the smoothing method is used to re-create the frequencies of both the seen bigrams and the pseudobigrams. The smoothing method is then evaluated by comparing the frequencies it re-creates for both types of bigrams.</Paragraph>
      <Paragraph position="2"> We evaluated our Web counts by applying the pseudodisambiguation procedure that Rooth et al. (1999), Prescher, Riezler, and Rooth (2000), and Clark and Weir (2001) employed for evaluating re-created verb-object bigram counts. In this procedure, the noun n from a verb-object bigram (v, n) that is seen in a given corpus is paired with a randomly chosen verb v prime that does not take n as its object within the corpus. This results in an unseen verb-object bigram (v prime , n). The seen bigram is now treated as unseen (i.e., all of its occurrences are removed from the training corpus), and the frequencies of both the seen and the unseen bigram are re-created (using smoothing, or Web counts, in our case). The task is then to decide which of the two verbs v and v prime takes the noun n as its object. For this, the re-created bigram frequency is used: The bigram with the higher re-created frequency (or probability) is taken to be the seen bigram. If this bigram is really the seen one, then the disambiguation is correct. The overall percentage of correct disambiguations is a measure of the quality of the re-created frequencies (or probabilities). In the following, we will first describe in some detail the experiments that Rooth et al. (1999) and Clark and Weir (2001) conducted. We will then discuss how we replicated their experiments using the Web as an alternative smoothing method.</Paragraph>
      <Paragraph position="3"> Rooth et al. (1999) used pseudodisambiguation to evaluate a class-based model that is derived from unlabeled data using the expectation maximization (EM) algorithm. From a data set of 1,280,712 (v, n) pairs (obtained from the BNC using Carroll and Rooth's [1998] parser), they randomly selected 3,000 pairs, with each pair containing a fairly frequent verb and noun (only verbs and nouns that occurred between  ). The probabilities were re-created using Rooth et al.'s (1999) EM-based clustering model on a training set from which all seen pairs (v, n) had been removed. An accuracy of 80% on the pseudodisambiguation task was achieved (see Table 13). Prescher, Riezler, and Rooth (2000) evaluated Rooth et al.'s (1999) EM-based clustering model again using pseudodisambiguation, but on a separate data set using a  slightly different method for constructing the pseudobigrams. They used a set of 298 (v, n, n prime ) BNC triples in which (v, n) was chosen as in Rooth et al. (1999) but paired with a randomly chosen noun n prime . Given the set of (v, n, n prime ) triples, the task was to decide whether (v, n) or (v, n prime ) was the correct pair in each triple. Prescher, Riezler, and Rooth (2000) reported pseudodisambiguation results with two clustering models: (1) Rooth et al.'s (1999) clustering approach, which models the semantic fit between a verb and its argument (VA model), and (2) a refined version of this approach that models only the fit between a verb and its object (VO model), disregarding other arguments of the verb. The results of the two models on the pseudodisambiguation task are shown in Table 14.</Paragraph>
      <Paragraph position="4"> At this point, it is important to note that neither Rooth et al. (1999) nor Prescher, Riezler, and Rooth (2000) used pseudodisambiguation for the final evaluation of their models. Rather, the performance on the pseudodisambiguation task was used to optimize the model parameters. The results in Tables 13 and 14 show the pseudodisambiguation performance achieved for the best parameter settings. In other words, these results were obtained on the development set (i.e., on the same data set that was used to optimize the parameters), not on a completely unseen test set. This procedure is well-justified in the context of Rooth et al.'s (1999) and Prescher, Riezler, and Rooth's (2000) work, which aimed at building models of lexical semantics, not of pseudodisambiguation. Therefore, they carried out their final evaluations on unseen test sets for the tasks of lexicon induction (Rooth et al. 1999) and target language disambiguation (Prescher, Riezler, and Rooth 2000), once the model parameters had been fixed using the pseudodisambiguation development set.</Paragraph>
      <Paragraph position="5">  Clark and Weir (2002) use a setting similar to that of Rooth et al. (1999) and Prescher, Riezler, and Rooth (2000); here pseudodisambiguation is employed to evaluate the performance of a class-based probability estimation method. In order to address the problem of estimating conditional probabilities in the face of sparse data, Clark and Weir (2002) define probabilities in terms of classes in a semantic hierarchy and propose hypothesis testing as a means of determining a suitable level of generalization in the hierarchy. Clark and Weir (2002) report pseudodisambiguation results on two data sets, with an experimental setup similar to that of Rooth et al. (1999). For the first data set, 3,000 pairs were randomly chosen from 1.3 million (v, n) tuples extracted from the BNC (using the parser of Briscoe and Carroll [1997]). The selected pairs con8 Stefan Riezler (personal communication, 2003) points out that the main variance in Rooth et al.'s (1999) pseudodisambiguation results comes from the class cardinality parameter (start values account for only 2% of the performance, and iterations do not seem to make a difference at all). Figure 3 of Rooth et al. (1999) shows that a performance of more than 75% is obtained for every reasonable choice of classes. This indicates that a &amp;quot;proper&amp;quot; pseudodisambiguation setting with separate development and test data would have resulted in a similar choice of class cardinality and thus achieved the same 80% performance that is cited in Table 13.</Paragraph>
      <Paragraph position="6">  tained relatively frequent verbs (occurring between 500 and 5,000 times in the data). The data sets were constructed as proposed by Rooth et al. (1999). The procedure for creating the second data set was identical, but this time only verbs that occurred between 100 and 1,000 times were considered. Clark and Weir (2002) further compared their approach with Resnik's (1993) selectional association model and Li and Abe's (1998) tree cut model on the same data sets. These methods are directly comparable, as they can be used for class-based probability estimation and address the question of how to find a suitable level of generalization in a hierarchy (i.e., WordNet). The results of the three methods used on the two data sets are shown in Table 15. We employed the same pseudodisambiguation method to test whether Web-based frequencies can be used for distinguishing between seen and artificially constructed unseen bigrams. We obtained the data sets of Rooth et al. (1999), Prescher, Riezler, and Rooth (2000), and Clark and Weir (2002) described above. Given a set of (v, n, v prime ) triples, the task was to decide whether (v, n) or (v  Then we used two models for pseudodisambiguation: the joint probability model compared the joint probability estimates f(v, n) and f(v prime , n) and predicted that the bigram with the highest estimate is the seen one. The conditional probability model compared the conditional probability estimates f(v, n)/f(v) and f(v prime , n)/f(v prime ) and again selected as the seen bigram the one with the highest estimate (in both cases, ties were resolved by choosing at random).</Paragraph>
      <Paragraph position="7">  The same two models were used to perform pseudodisambiguation for the (v, n, n prime ) triples, where we have to choose between (v, n) and (v, n prime ).</Paragraph>
      <Paragraph position="8"> Here, the probability estimates f(v, n) and f(v, n prime ) were used for the joint probability model, and f(v, n)/f(n) and f(v, n prime )/f(n prime ) for the conditional probability model.</Paragraph>
      <Paragraph position="9"> The results for Rooth et al.'s (1999) data set are given in Table 13. The conditional probability model achieves a performance of 71.2% correct for subjects and 85.2% correct for objects. The performance on the whole data set is 77.7%, which is below the performance of 80.0% reported by Rooth et al. (1999). However, the difference is not found to be significant using a chi-square test comparing the number of correct and incorrect classifications (kh  (1)=2.02, p = .16). The joint probability model performs consistently worse than the conditional probability model: It achieves an overall accuracy of 72.7%, which is significantly lower than the accuracy of the Rooth et al. (1999) model (kh  (1)=19.50, p &lt;.01).</Paragraph>
      <Paragraph position="10"> 9 We used only AltaVista counts, as there was virtually no difference between AltaVista and Google counts in our previous evaluations (see Sections 3.1-3.3). Google allows only 1,000 queries per day (for registered users), which makes it time-consuming to obtain large numbers of Google counts. AltaVista has no such restriction.</Paragraph>
      <Paragraph position="11"> 10 The probability estimates are P(a, b)=f(a, b)/N and P(a|b)=f(a, b)/f(a) for the joint probability and the conditional probability, respectively. However, the corpus size N can be ignored, as it is constant.  Computational Linguistics Volume 29, Number 3 A similar picture emerges with regard to Prescher, Riezler, and Rooth's (2000) data set (see Table 14). The conditional probability model achieves an accuracy of 66.7% for subjects and 70.5% for objects. The combined performance of 68.5% is significantly lower than the performance of both the VA model (kh  (1)=33.28, p &lt;.01) reported by Prescher, Riezler, and Rooth (2000). Again, the joint probability model performs worse than the conditional probability model, achieving an overall accuracy of 62.4%.</Paragraph>
      <Paragraph position="12"> We also applied our Web-based method to the pseudodisambiguation data set of Clark and Weir (2002). Here, the conditional probability model reached a performance of 83.9% correct on the low-frequency data set. This is significantly higher than the highest performance of 72.4% reported by Clark and Weir (2002) on the same data set (kh  (1)=115.50, p &lt;.01). The joint probability model performs worse than the conditional model, at 81.1%. However, this is still significantly better than the best result of Clark and Weir (2002) (kh  (1)=63.14, p &lt;.01). The same pattern is observed for the high-frequency data set, on which the conditional probability model achieves 87.7% correct and thus significantly outperforms Clark and Weir (2002), who obtained 73.9% (kh  (1)=283.73, p &lt;.01). The joint probability model achieved 85.3% on this data set, also significantly outperforming Clark and Weir (2002) (kh  (1)=119.35, p &lt;.01).</Paragraph>
      <Paragraph position="13"> To summarize, we demonstrated that the simple Web-based approach proposed in this article yields results for pseudodisambiguation that outperform class-based smoothing techniques, such as the ones proposed by Resnik (1993), Li and Abe (1998), and Clark and Weir (2002). We were also able to show that a Web-based approach is able to achieve the same performance as an EM-based smoothing model proposed by Rooth et al. (1999). However, the Web-based approach was not able to outperform the more sophisticated EM-based model of Prescher, Riezler, and Rooth (2000). Another result we obtained is that Web-based models that use conditional probabilities (where unigram frequencies are used to normalize the bigram frequencies) generally outperform a more simple-minded approach that relies directly on bigram frequencies for pseudodisambiguation.</Paragraph>
      <Paragraph position="14"> There are a number of reasons why our results regarding pseudodisambiguation have to be treated with some caution. First of all, the two smoothing methods (i.e., EM-based clustering and class-based probability estimation using WordNet) were not evaluated on the same data set, and therefore the two results are not directly comparable. For instance, Clark and Weir's (2002) data set is substantially less noisy than Rooth et al.'s (1999) and Prescher, Riezler, and Rooth's (2000), as it contains only words and nouns that occur in WordNet. Furthermore, Stephen Clark (personal communication, 2003) points out that WordNet-based approaches are at a disadvantage when it comes to pseudodisambiguation. Pseudodisambiguation assumes that the correct pair is unseen in the training data; this makes the task deliberately hard, because some of the pairs might be frequent enough that reliable corpus counts can be obtained without having to use WordNet (using WordNet is likely to be more noisy than using the actual counts). Another problem with WordNet-based approaches is that they offer no systematic treatment of word sense ambiguity, which puts them at a disadvantage with respect to approaches that do not rely on a predefined inventory of word senses.</Paragraph>
      <Paragraph position="15"> Finally, recall that the results for the EM-based approaches in Tables 13 and 14 were obtained on the development set (as pseudodisambiguation was used as a means of parameter tuning by Rooth et al. [1999] and Prescher, Riezler, and Rooth [2000]). It is possible that this fact inflates the performance values for the EM-based approaches (but see note 8).</Paragraph>
      <Paragraph position="16">  Keller and Lapata Web Frequencies for Unseen Bigrams</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML