File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/93/p93-1024_evalu.xml
Size: 5,740 bytes
Last Modified: 2025-10-06 14:00:08
<?xml version="1.0" standalone="yes"?> <Paper uid="P93-1024"> <Title>DISTRIBUTIONAL CLUSTERING OF ENGLISH WORDS</Title> <Section position="6" start_page="188" end_page="189" type="evalu"> <SectionTitle> MODEL EVALUATION </SectionTitle> <Paragraph position="0"> The preceding qualitative discussion provides some indication of what aspects of distributional relationships may be discovered by clustering.</Paragraph> <Paragraph position="1"> However, we also need to evaluate clustering more rigorously as a basis for models of distributional relationships. So, far, we have looked at two kinds of measurements of model quality: (i) relative entropy between held-out data and the asymmetric model, and (ii) performance on the task of deciding which of two verbs is more likely to take a given noun as direct object when the data relating one of the verbs to the noun has been withheld from the training data.</Paragraph> <Paragraph position="2"> The evaluation described below was performed on the largest data set we have worked with so far, extracted from 44 million words of 1988 Associated Press newswire with the pattern matching techniques mentioned earlier. This collection process yielded 1112041 verb-object pairs. We selected then the subset involving the 1000 most frequent nouns in the corpus for clustering, and randomly divided it into a training set of 756721 pairs and a test set of 81240 pairs.</Paragraph> <Paragraph position="3"> Relative Entropy Figure 3 plots the unweighted average relative entropy, in bits, of several test sets to asymmetric clustered models of different sizes, given by 1 ~,,eAr, D(t,,ll/~-), where Aft is the set of direct objects in the test set and t,~ is the relative frequency distribution of verbs taking n as direct object in the test set. 3 For each critical value of f?, we show the relative entropy with respect to awe use unweighted averages because we are interested her on how well the noun distributions are ap- proximated by the cluster model. If we were interested on the total information loss of using the asymmetric model to encode a test corpus, we would instead use the asymmetric model based on gp of the training set (set train), of randomly selected held-out test set (set test), and of held-out data for a further 1000 nouns that were not clustered (set new).</Paragraph> <Paragraph position="4"> Unsurprisingly, the training set relative entropy decreases monotonically. The test set relative entropy decreases to a minimum at 206 clusters, and then starts increasing, suggesting that larger models are overtrained.</Paragraph> <Paragraph position="5"> The new noun test set is intended to test whether clusters based on the 1000 most frequent nouns are useful classifiers for the selectional properties of nouns in general. Since the nouns in the test set pairs do not occur in the training set, we do not have their cluster membership probabilities that are needed in the asymmetric model. Instead, for each noun n in the test set, we classify it with respect to the clusters by setting p(cln) = exp -DD(p,~ I lc)/Z, where p,~ is the empirical conditional verb distribution for n given by the test set. These cluster membership estimates were then used in the asymmetric model and the test set relative entropy calculated as before. As the figure shows, the cluster model provides over one bit of information about the selectional properties of the new nouns, but the overtraining effect is even sharper than for the held-out data involving the 1000 clustered nouns.</Paragraph> <Section position="1" start_page="188" end_page="189" type="sub_section"> <SectionTitle> Decision Task </SectionTitle> <Paragraph position="0"> We also evaluated asymmetric cluster models on a verb decision task closer to possible applications to disambiguation in language analysis. The task consists judging which of two verbs v and v' is more likely to take a given noun n as object, when all occurrences of (v, n) in the training set were deliberately deleted. Thus this test evaluates how well the models reconstruct missing data in the the weighted average ~,~e~t fnD(t,~ll~,,) where f,, is the relative frequency of n in the test set.</Paragraph> <Paragraph position="1"> verb distribution for n from the cluster centroids close to n.</Paragraph> <Paragraph position="2"> The data for this test was built from the training data for the previous one in the following way, based on a suggestion by Dagan et al. (1993). 104 noun-verb pairs with a fairly frequent verb (between 500 and 5000 occurrences) were randomly picked, and all occurrences of each pair in the training set were deleted. The resulting training set was used to build a sequence of cluster models as before. Each model was used to decide which of two verbs v and v ~ are more likely to appear with a noun n where the (v, n) data was deleted from the training set, and the decisions were compared with the corresponding ones derived from the original event frequencies in the initial data set. The error rate for each model is simply the proportion of disagreements for the selected (v, n, v t) triples. Figure 4 shows the error rates for each model for all the selected (v, n, v ~) (al 0 and for just those exceptional triples in which the conditional ratio p(n, v)/p(n, v ~) is on the opposite side of 1 from the marginal ratio p(v)/p(v~). In other words, the exceptional cases are those in which predictions based just on the marginal frequencies, which the initial one-cluster model represents, would be consistently wrong.</Paragraph> <Paragraph position="3"> Here too we see some overtraining for the largest models considered, although not for the exceptional verbs.</Paragraph> </Section> </Section> class="xml-element"></Paper>