c© 2002 Association for Computational Linguistics
Class-Based Probability Estimation Using
a Semantic Hierarchy
Stephen Clark
∗
David Weir
†
University of Edinburgh University of Sussex
This article concerns the estimation of a particular kind of probability, namely, the probability
of a noun sense appearing as a particular argument of a predicate. In order to overcome the
accompanying sparse-data problem, the proposal here is to define the probabilities in terms of
senses from a semantic hierarchy and exploit the fact that the senses can be grouped into classes
consisting of semantically similar senses. There is a particular focus on the problem of how
to determine a suitable class for a given sense, or, alternatively, how to determine a suitable
level of generalization in the hierarchy. A procedure is developed that uses a chi-square test to
determine a suitable level of generalization. In order to test the performance of the estimation
method, a pseudo-disambiguation task is used, together with two alternative estimation methods.
Each method uses a different generalization procedure; the first alternative uses the minimum
description length principle, and the second uses Resnik’s measure of selectional preference. In
addition, the performance of our method is investigated using both the standard Pearson chi-
square statistic and the log-likelihood chi-square statistic.
1. Introduction
This article concerns the problem of how to estimate the probabilities of noun senses
appearing as particular arguments of predicates. Such probabilities can be useful for a
variety of natural language processing (NLP) tasks, such as structural disambiguation
and statistical parsing, word sense disambiguation, anaphora resolution, and language
modeling. To see how such knowledge can be used to resolve structural ambiguities,
consider the following prepositional phrase attachment ambiguity:
Example 1
Fred ate strawberries with a spoon.
The ambiguity arises because the prepositional phrase with a spoon can attach to either
strawberries or ate. The ambiguity can be resolved by noting that the correct sense of
spoon is more likely to be an argument of “ate-with” than “strawberries-with” (Li and
Abe 1998; Clark and Weir 2000).
The problem with estimating a probability model defined over a large vocabulary
of predicates and noun senses is that this involves a huge number of parameters,
which results in a sparse-data problem. In order to reduce the number of parameters,
we propose to define a probability model over senses in a semantic hierarchy and
∗ Division of Informatics, University of Edinburgh, 2 Buccleuch Place, Edinburgh, EH8 9LW, UK. E-mail:
stephenc@cogsci.ed.ac.uk.
† School of Cognitive and Computing Sciences, University of Sussex, Brighton, BN1 9QH, UK. E-mail:
david.weir@cogs.susx.ac.uk.
188
Computational Linguistics Volume 28, Number 2
to exploit the fact that senses can be grouped into classes consisting of semantically
similar senses. The assumption underlying this approach is that the probability of a
particular noun sense can be approximated by a probability based on a suitably chosen
class. For example, it seems reasonable to suppose that the probability of (the food
sense of) chicken appearing as an object of the verb eat can be approximated in some
way by a probability based on a class such as FOOD.
There are two elements involved in the problem of using a class to estimate the
probability of a noun sense. First, given a suitably chosen class, how can that class
be used to estimate the probability of the sense? And second, given a particular noun
sense, how can a suitable class be determined? This article offers novel solutions to
both problems, and there is a particular focus on the second question, which can be
thought of as how to find a suitable level of generalization in the hierarchy.
1
The semantic hierarchy used here is the noun hierarchy of WordNet (Fellbaum
1998), version 1.6. Previous work has considered how to estimate probabilities us-
ing classes from WordNet in the context of acquiring selectional preferences (Resnik
1998; Ribas 1995; Li and Abe 1998; McCarthy 2000), and this previous work has also
addressed the question of how to determine a suitable level of generalization in the
hierarchy. Li and Abe use the minimum description length principle to obtain a level
of generalization, and Resnik uses a simple technique based on a statistical measure
of selectional preference. (The work by Ribas builds on that by Resnik, and the work
by McCarthy builds on that by Li and Abe.) We compare our estimation method with
those of Resnik and Li and Abe, using a pseudo-disambiguation task. Our method
outperforms these alternatives on the pseudo-disambiguation task, and an analysis of
the results shows that the generalization methods of Resnik and Li and Abe appear
to be overgeneralizing, at least for this task.
Note that the problem being addressed here is the engineering problem of es-
timating predicate argument probabilities, with the aim of producing estimates that
will be useful for NLP applications. In particular, we are not addressing the problem
of acquiring selectional restrictions in the way this is usually construed (Resnik 1993;
Ribas 1995; McCarthy 1997; Li and Abe 1998; Wagner 2000). The purpose of using a
semantic hierarchy for generalization is to overcome the sparse data problem, rather
than find a level of abstraction that best represents the selectional restrictions of some
predicate. This point is considered further in Section 5.
The next section describes the noun hierarchy from WordNet and gives a more
precise description of the probabilities to be estimated. Section 3 shows how a class
from WordNet can be used to estimate the probability of a noun sense. Section 4 shows
how a chi-square test is used as part of the generalization procedure, and Section 5
describes the generalization procedure. Section 6 describes the alternative class-based
estimation methods used in the pseudo-disambiguation experiments, and Section 7
presents those experiments.
2. The Semantic Hierarchy
The noun hierarchy of WordNet consists of senses, or what Miller (1998) calls lexicalized
concepts, organized according to the “is-a-kind-of” relation. Note that we are using
concept to refer to a lexicalized concept or sense and not to a set of senses; we use class to
refer to a set of senses. There are around 66,000 different concepts in the noun hierarchy
1 A third element of the problem, namely, how to obtain arguments of predicates as training data, is not
considered here. We assume the existence of such data, obtained from a treebank or shallow parser.
189
Clark and Weir Class-Based Probability Estimation
of WordNet version 1.6. A concept in WordNet is represented by a “synset,” which is
the set of synonymous words that can be used to denote that concept. For example,
the synset for the concept 〈cocaine〉
2
is {cocaine, cocain, coke, snow, C}. Let syn(c) be the
synset for concept c, and let cn(n)={c |n ∈ syn(c) } be the set of concepts that can be
denoted by noun n.
The hierarchy has the structure of a directed acyclic graph (although only around
1% of the nodes have more than one parent), where the edges of the graph constitute
what we call the “direct–isa” relation. Let isa be the transitive, reflexive closure of
direct–isa; then c
prime
isa c implies c
prime
is a kind of c.Ifc
prime
isa c, then c is a hypernym of c
prime
and
c
prime
is a hyponym of c. In fact, the hierarchy is not a single hierarchy but instead consists of
nine separate subhierarchies, each headed by the most general kind of concept, such as
〈entity〉, 〈abstraction〉, 〈event〉, and 〈psychological feature〉. For the purposes of this work
we add a common root dominating the nine subhierarchies, which we denote 〈root〉.
There are some important points that need to be clarified regarding the hierarchy.
First, every concept in the hierarchy has a nonempty synset (except the notional con-
cept 〈root〉). Even the most general concepts, such as 〈entity〉, can be denoted by some
noun; the synset for 〈entity〉 is {entity, something}. Second, there is an important distinc-
tion between an individual concept and a set of concepts. For example, the individual
concept 〈entity〉 should not be confused with the set or class consisting of concepts
denoting kinds of entities. To make this distinction clear, we use c = {c
prime
|c
prime
isa c }
to denote the set of concepts dominated by concept c, including c itself. For exam-
ple, 〈animal〉 is the set consisting of those concepts corresponding to kinds of animals
(including 〈animal〉 itself).
The probability of a concept appearing as an argument of a predicate is written p(c |
v, r), where c is a concept in WordNet, v is a predicate, and r is an argument position.
3
The focus in this article is on the arguments of verbs, but the techniques discussed
can be applied to any predicate that takes nominal arguments, such as adjectives. The
probability p(c | v, r) is to be interpreted as follows: This is the probability that some
noun n in syn(c), when denoting concept c, appears in position r of verb v (given
v and r). The example used throughout the article is p(〈dog〉|run, subj), which is
the conditional probability that some noun in the synset of 〈dog〉, when denoting the
concept 〈dog〉, appears in the subject position of the verb run. Note that, in practice,
no distinction is made between the different senses of a verb (although the techniques
do allow such a distinction) and that each use of a noun is assumed to correspond to
exactly one concept.
4
3. Class-Based Probability Estimation
This section explains how a set of concepts, or class, from WordNet can be used to
estimate the probability of an individual concept. More specifically, we explain how
a set of concepts c
prime
, where c
prime
is some hypernym of concept c, can be used to estimate
p(c | v, r). (Recall that c
prime
denotes the set of concepts dominated by c
prime
, including c
prime
itself.)
One possible approach would be simply to substitute c
prime
for the individual concept c.
This is a poor solution, however, since p(c
prime
| v, r) is the conditional probability that
2 Angled brackets are used to denote concepts in the hierarchy.
3 The term predicate is used loosely here, in that the predicate does not have to be a semantic object but
can simply be a word form.
4 A recent paper that extends the acquisition of selectional preferences to sense-sense relationships is
Agirre and Martinez (2001).
190
Computational Linguistics Volume 28, Number 2
some noun denoting a concept in c
prime
appears in position r of verb v. For example,
p(〈animal〉|run, subj) is the probability that some noun denoting a kind of animal
appears in the subject position of the verb run. Probabilities of sets of concepts are
obtained by summing over the concepts in the set:
p(c
prime
| v, r)=
summationdisplay
c
primeprime
∈c
prime
p(c
primeprime
| v, r)(1)
This means that p(〈animal〉|run, subj) is likely to be much greater than p(〈dog〉|
run, subj) and thus is not a good approximation of p(〈dog〉|run, subj).
What can be done, though, is to condition on sets of concepts. If it can be shown
that p(v | c
prime
, r), for some hypernym c
prime
of c, is a reasonable approximation of p(v | c, r),
then we have a way of estimating p(c | v, r). The probability p(v | c, r) can be obtained
from p(c | v, r) using Bayes’ theorem:
p(c | v, r)=p(v | c, r)
p(c | r)
p(v | r)
(2)
Since p(c | r) and p(v | r) are conditioned on the argument slot only, we assume
these can be estimated satisfactorily using relative frequency estimates. Alternatively,
a standard smoothing technique such as Good-Turing could be used.
5
This leaves p(v |
c, r). Continuing with the 〈dog〉 example, the proposal is to estimate p(run |〈dog〉, subj)
using a relative-frequency estimate of p(run | 〈animal〉, subj) or an estimate based on a
similar, suitably chosen class. Thus, assuming this choice of class, p(〈dog〉|run, subj)
would be approximated as follows:
p(〈dog〉|run, subj) ≈ p(run | 〈animal〉, subj)
p(〈dog〉|subj)
p(run | subj)
(3)
The following derivation shows that if p(v | c
prime
i
, r)=k for each child c
prime
i
of c
prime
, and
p(v | c
prime
, r)=k, then p(v | c
prime
, r) is also equal to k:
p(v | c
prime
, r)=p(c
prime
| v, r)
p(v | r)
p(c
prime
| r)
(4)
=
p(v | r)
p(c
prime
| r)
parenleftBigg
summationdisplay
i
p(c
prime
i
| v, r)+p(c
prime
| v, r)
parenrightBigg
(5)
=
p(v | r)
p(c
prime
| r)
parenleftBigg
summationdisplay
i
p(v | c
prime
i
, r)
p(c
prime
i
| r)
p(v | r)
+ p(v | c
prime
, r)
p(c
prime
| r)
p(v | r)
parenrightBigg
(6)
=
1
p(c
prime
| r)
parenleftBigg
summationdisplay
i
kp(c
prime
i
| r)+kp(c
prime
| r)
parenrightBigg
(7)
=
k
p(c
prime
| r)
parenleftBigg
summationdisplay
i
p(c
prime
i
| r)+p(c
prime
| r)
parenrightBigg
(8)
= k (9)
5 Unsmoothed estimates were used in this work.
191
Clark and Weir Class-Based Probability Estimation
Note that the proof applies only to a tree, since the proof assumes that c
prime
is partitioned
by c
prime
and the sets of concepts dominated by each of the daughters of c
prime
, which is not
necessarily true for a directed acyclic graph (DAG). WordNet is a DAG but is a close
approximation to a tree, and so we assume this will not be a problem in practice.
6
The derivation in (4)–(9) shows how probabilities conditioned on sets of concepts
can remain constant when moving up the hierarchy, and this suggests a way of finding
a suitable set, c
prime
, as a generalization for concept c: Initially set c
prime
equal to c and move
up the hierarchy, changing the value of c
prime
, until there is a significant change in p(v |
c
prime
, r). Estimates of p(v | c
prime
i
, r), for each child c
prime
i
of c
prime
, can be compared to see whether
p(v | c
prime
, r) has significantly changed. (We ignore the probability p(v | c
prime
, r) and consider
the probabilities p(v | c
prime
i
, r) only.) Note that this procedure rests on the assumption that
p(v | c, r) is close to p(v | c, r). (In fact, p(v | c, r) is equal to p(v | c, r) when c is a leaf
node.) So when finding a suitable level for the estimation of p(〈sandwich〉|eat, obj),
for example, we first assume that p(eat | 〈sandwich〉, obj) is a good approximation of
p(eat |〈sandwich〉, obj) and then apply the procedure to p(eat | 〈sandwich〉, obj).
A feature of the proposed generalization procedure is that comparing probabilities
of the form p(v | C, r), where C is a class, is closely related to comparing ratios of
probabilities of the form p(C | v, r)/p(C | r) (for a given verb and argument position):
p(v | C, r)=
p(C | v, r)
p(C | r)
p(v | r)(10)
Note that, for a given verb and argument position, p(v | r) is constant across classes.
Equation (10) is of interest because the ratio p(C | v, r)/p(C | r) can be interpreted as a
measure of association between the verb v and class C. This ratio is similar to point-
wise mutual information (Church and Hanks 1990) and also forms part of Resnik’s
association score, which will be introduced in Section 6. Thus the generalization pro-
cedure can be thought of as one that finds “homogeneous” areas of the hierarchy,
that is, areas consisting of classes that are associated to a similar degree with the verb
(Clark and Weir 1999).
Finally, we note that the proposed estimation method does not guarantee that the
estimates form a probability distribution over the concepts in the hierarchy, and so a
normalization factor is required:
p
sc
(c | v, r)=
ˆp(v | [c, v, r], r)
ˆp(c|r)
ˆp(v|r)
summationtext
c
prime
∈C
ˆp(v | [c
prime
, v, r], r)
ˆp(c
prime
|r)
ˆp(v|r)
(11)
We use p
sc
to denote an estimate obtained using our method (since the technique
finds sets of semantically similar senses, or “similarity classes”) and [c, v, r] to denote
the class chosen for concept c in position r of verb v; ˆp denotes a relative frequency
estimate, and C denotes the set of concepts in the hierarchy.
Before providing the details of the generalization procedure, we give the relative-
frequency estimates of the relevant probabilities and deal with the problem of am-
6 Li and Abe (1998) also develop a theoretical framework that applies only to a tree and turn WordNet
into a tree by copying each subgraph with multiple parents. One way to extend the experiments in
Section 7 would be to investigate whether this transformation has an impact on the results of those
experiments.
192
Computational Linguistics Volume 28, Number 2
biguous data. The relative-frequency estimates are as follows:
ˆp(c | r)=
f(c,r)
f(r)
=
summationtext
v
prime
∈V
f(c, v
prime
, r)
summationtext
v
prime
∈V
summationtext
c
prime
∈C
f(c
prime
, v
prime
, r)
(12)
ˆp(v | r)=
f(v,r)
f(r)
=
summationtext
c
prime
∈C
f(c
prime
, v, r)
summationtext
v
prime
∈V
summationtext
c
prime
∈C
f(c
prime
, v
prime
, r)
(13)
ˆp(v | c
prime
, r)=
f(c
prime
,v,r)
f(c
prime
,r)
=
summationtext
c
primeprime
∈c
prime
f(c
primeprime
, v, r)
summationtext
v
prime
∈V
summationtext
c
primeprime
∈c
prime
f(c
primeprime
, v
prime
, r)
(14)
where f(c, v, r) is the number of (n, v, r) triples in the data in which n is being used to
denote c, and V is the set of verbs in the data. The problem is that the estimates are
defined in terms of frequencies of senses, whereas the data are assumed to be in the
form of (n, v, r) triples: a noun, verb, and argument position. All the data used in this
work have been obtained from the British National Corpus (BNC), using the system
of Briscoe and Carroll (1997), which consists of a shallow-parsing component that is
able to identify verbal arguments.
We take a simple approach to the problem of estimating the frequencies of senses,
by distributing the count for each noun in the data evenly among all senses of the
noun:
ˆ
f(c, v, r)=
summationdisplay
n∈syn(c)
f(n, v, r)
|cn(n)|
(15)
where
ˆ
f(c, v, r) is an estimate of the number of times that concept c appears in position
r of verb v, and |cn(n)| is the cardinality of cn(n). This is the approach taken by
Li and Abe (1998), Ribas (1995), and McCarthy (2000).
7
Resnik (1998) explains how
this apparently crude technique works surprisingly well. Alternative approaches are
described in Clark and Weir (1999) (see also Clark [2001]), Abney and Light (1999),
and Ciaramita and Johnson (2000).
4. Using a Chi-Square Test to Compare Probabilities
In this section we show how to test whether p(v | c
prime
, r) changes significantly when
considering a node higher in the hierarchy. Consider the problem of deciding whether
p(run | 〈canine〉, subj) is a good approximation of p(run | 〈dog〉, subj).(〈canine〉 is the
parent of 〈dog〉 in WordNet.) To do this, the probabilities p(run | c
prime
i
, subj) are compared
using a chi-square test, where the c
prime
i
are the children of 〈canine〉. In this case, the null
hypothesis of the test is that the probabilities p(run | c
i
, subj) are the same for each
child c
i
. By judging the strength of the evidence against the null hypothesis, how
similar the true probabilities are likely to be can be determined. If the test indicates
that the probabilities are sufficiently unlikely to be the same, then the null hypothesis
is rejected, and the conclusion is that p(run | 〈canine〉, subj) is not a good approximation
of p(run | 〈dog〉, subj).
An example contingency table, based on counts obtained from a subset of the BNC
using the system of Briscoe and Carroll, is given in Table 1. (Recall that the frequencies
are estimated by distributing the count for a noun equally among the noun’s senses;
this explains the fractional counts.) One column contains estimates of counts arising
7 Resnik takes a similar approach but divides the count evenly among the noun’s senses and all the
hypernyms of those senses.
193
Clark and Weir Class-Based Probability Estimation
Table 1
Contingency table for the children of 〈canine〉 in the subject position of run.
c
i
ˆ
f(c
i
, run, subj)
ˆ
f(c
i
, subj) −
ˆ
f(c
i
, run, subj)
ˆ
f(c
i
, subj)=
summationtext
v∈V
ˆ
f(c
i
, v, subj)
〈bitch〉 0.3 (0.5) 26.7 (26.6) 27.0
〈dog〉 12.8 (10.5) 620.4 (622.7) 633.2
〈wolf〉 0.3 (0.6) 38.7 (38.4) 39.0
〈jackal〉 0.0 (0.3) 20.0 (19.7) 20.0
〈wild dog〉 0.0 (0.0) 3.0 (3.0) 3.0
〈hyena〉 0.0 (0.2) 10.0 (9.8) 10.0
〈fox〉 0.0 (1.2) 72.3 (71.1) 72.3
13.4 791.1 804.5
from concepts in c
i
appearing in the subject position of the verb run:
ˆ
f(c
i
, run, subj).A
second column presents estimates of counts arising from concepts in c
i
appearing in
the subject position of a verb other than run. The figures in brackets are the expected
values if the null hypothesis is true.
There is a choice of which statistic to use in conjunction with the chi-square test.
The usual statistic encountered in textbooks is the Pearson chi-square statistic, de-
noted X
2
:
X
2
=
summationdisplay
i,j
(o
ij
− e
ij
)
2
e
ij
(16)
where o
ij
is the observed value for the cell in row i and column j, and e
ij
is the
corresponding expected value. An alternative statistic is the log-likelihood chi-square
statistic, denoted G
2
:
8
G
2
= 2
summationdisplay
i,j
o
ij
log
e
o
ij
e
ij
(17)
The two statistics have similar values when the counts in the contingency table are
large (Agresti 1996). The statistics behave differently, however, when the table contains
low counts, and, since corpus data are likely to lead to some low counts, the question
of which statistic to use is an important one. Dunning (1993) argues for the use of G
2
rather than X
2
, based on an analysis of the sampling distributions of G
2
and X
2
, and
results obtained when using the statistics to acquire highly associated bigrams. We
consider Dunning’s analysis at the end of this section, and the question of whether to
use G
2
or X
2
will be discussed further there. For now, we continue with the discussion
of how the chi-square test is used in the generalization procedure.
For Table 1, the value of G
2
is 3.8, and the value of X
2
is 2.5. Assuming a level of
significance of α = 0.05, the critical value is 12.6 (for six degrees of freedom). Thus,
for this α value, the null hypothesis would not be rejected for either statistic, and the
conclusion would be that there is no reason to suppose that p(run | 〈canine〉, subj) is
not a reasonable approximation of p(run | 〈dog〉, subj).
8 An alternative formula for G
2
is given in Dunning (1993), but the two are equivalent.
194
Computational Linguistics Volume 28, Number 2
Table 2
Contingency table for the children of 〈liquid〉 in the object position of drink.
c
i
ˆ
f(c
i
, drink, obj)
ˆ
f(c
i
, obj) −
ˆ
f(c
i
, drink, obj)
ˆ
f(c
i
, obj)=
summationtext
v∈V
ˆ
f(c
i
, v, obj)
〈beverage〉 261.0 (238.7) 2,367.7 (2,390.0) 2,628.7
〈supernatant〉 0.0 (0.1) 1.0 (0.9) 1.0
〈alcohol〉 11.5 (9.4) 92.0 (94.1) 103.5
〈ammonia〉 0.0 (0.8) 8.5 (7.7) 8.5
〈antifreeze〉 0.0 (0.1) 1.0 (0.9) 1.0
〈distillate〉 0.0 (0.5) 6.0 (5.5) 6.0
〈water〉 12.0 (31.6) 335.7 (316.1) 347.7
〈ink〉 0.0 (2.9) 32.0 (29.1) 32.0
〈liquor〉 0.7 (1.1) 11.6 (11.2) 12.3
285.2 2,855.5 3,140.7
As a further example, Table 2 gives counts for the children of 〈liquid〉 in the object
position of drink. Again, the counts have been obtained from a subset of the BNC
using the system of Briscoe and Carroll. Not all the sets dominated by the children of
〈liquid〉 are shown, as some, such as 〈sheep dip〉, never appear in the object position
of a verb in the data. This example is designed to show a case in which the null
hypothesis is rejected. The value of G
2
for this table is 29.0, and the value of X
2
is
21.2. So for G
2
, even if an α value as low as 0.0005 were being used (for which the
critical value is 27.9 for eight degrees of freedom), the null hypothesis would still be
rejected. For X
2
, the null hypothesis is rejected for α values greater than 0.005. This
seems reasonable, since the probabilities associated with the children of 〈liquid〉 and
the object position of drink would be expected to show a lot of variation across the
children.
A key question is how to select the appropriate value for α. One solution is to
treat α as a parameter and set it empirically by taking a held-out test set and choosing
the value of α that maximizes performance on the relevant task. For example, Clark
and Weir (2000) describes a prepositional phrase attachment algorithm that employs
probability estimates obtained using the WordNet method described here. To set the
value of α, the performance of the algorithm on a development set could be com-
pared across different values of α, and the value that leads to the best performance
could be chosen. Note that this approach sets no constraints on the value of α: The
value could be as high as 0.995 or as low as 0.0005, depending on the particular
application.
There may be cases in which the conditions for the appropriate application of a chi-
square test are not met. One condition that is likely to be violated is the requirement
that expected values in the contingency table not be too small. (A rule of thumb
often found in textbooks is that the expected values should be greater than five.) One
response to this problem is to apply some kind of thresholding and either ignore
counts below the threshold, or apply the test only to tables that do not contain low
counts. Ribas (1995), Li and Abe (1998), McCarthy (2000), and Wagner (2000) all use
some kind of thresholding when dealing with counts in the hierarchy (although not in
the context of a chi-square test). Another approach would be to use Fisher’s exact test
(Agresti 1996; Pedersen 1996), which can be applied to tables regardless of the size of
195
Clark and Weir Class-Based Probability Estimation
the counts they contain. The main problem with this test is that it is computationally
expensive, especially for large contingency tables.
What we have found in practice is that applying the chi-square test to tables dom-
inated by low counts tends to produce an insignificant result, and the null hypothesis
is not rejected. The consequences of this for the generalization procedure are that
low-count tables tend to result in the procedure moving up to the next node in the
hierarchy. But given that the purpose of the generalization is to overcome the sparse-
data problem, moving up a node is desirable, and therefore we do not modify the test
for tables with low counts.
The final issue to consider is which chi-square statistic to use. Dunning (1993)
argues for the use of G
2
rather than X
2
, based on the claim that the sampling distri-
bution of G
2
approaches the true chi-square distribution quicker than the sampling
distribution of X
2
. However, Agresti (1996, page 34) makes the opposite claim: “The
sampling distributions of X
2
and G
2
get closer to chi-squared as the sample size n
increases....The convergence is quicker for X
2
than G
2
.”
In addition, Pedersen (2001) questions whether one statistic should be preferred
over the other for the bigram acquisition task and cites Cressie and Read (1984), who
argue that there are some cases where the Pearson statistic is more reliable than the
log-likelihood statistic. Finally, the results of the pseudo-disambiguation experiments
presented in Section 7 are at least as good, if not better, when using X
2
rather than G
2
,
and so we conclude that the question of which statistic to use should be answered on
a per application basis.
5. The Generalization Procedure
The procedure for finding a suitable class, c
prime
, to generalize concept c in position r
of verb v works as follows. (We refer to c
prime
as the “similarity class” of c with respect
to v and r and the hypernym c
prime
as top(c, v, r), since the chosen hypernym sits at the
“top” of the similarity class.) Initially, concept c is assigned to a variable top. Then,
by working up the hierarchy, successive hypernyms of c are assigned to top, and this
process continues until the probabilities associated with the sets of concepts dominated
by top and the siblings of top are significantly different. Once a node is reached that
results in a significant result for the chi-square test, the procedure stops, and top is
returned as top(c, v, r). In cases where a concept has more than one parent, the parent
is chosen that results in the lowest value of the chi-square statistic, as this indicates
the probabilities are the most similar. The set top(c, v, r) is the similarity class of c for
verb v and position r. Figure 1 gives an algorithm for determining top(c, v, r).
Figure 2 gives an example of the procedure at work. Here, top(〈soup〉, stir, obj) is
being determined. The example is based on data from a subset of the BNC, with 303
cases of an argument in the object position of stir. The G
2
statistic is used, together with
an α value of 0.05. Initially, top is set to 〈soup〉, and the probabilities corresponding
to the children of 〈dish〉 are compared: p(stir | 〈soup〉, obj), p(stir | 〈lasagne〉, obj), p(stir |
〈haggis〉, obj), and so on for the rest of the children. The chi-square test results in a G
2
value of 14.5, compared to a critical value of 55.8. Since G
2
is less than the critical value,
the procedure moves up to the next node. This process continues until a significant
result is obtained, which first occurs at 〈substance〉 when comparing the children of
〈object〉. Thus 〈substance〉 is the chosen level of generalization.
Now we show how the chosen level of generalization varies with α and how it
varies with the size of the data set. A note of clarification is required before presenting
the results. In related work on acquiring selectional preferences (Ribas 1995; McCarthy
196
Computational Linguistics Volume 28, Number 2
Algorithm top(c, v, r):
top ← c
sig result ← false
comment parent
min
gives lowest G
2
value, G
2
min
while not sig result & top negationslash= 〈root〉 do
G
2
min
←∞
for all parents of top do
calculate G
2
for sets dominated by children of parent
if G
2
< G
2
min
then G
2
min
← G
2
parent
min
← parent
end
if chi-square test for parent
min
is significant
then sig result ← true
else move up to next node: top ← parent
min
end
return top
Figure 1
An algorithm for determining top(c, v, r).
CWCPCVCVCXD7D0CPD7CPCVD2CT
CSCXD7CW
D2D3D9D6CXD7CWD1CTD2D8
CUD3D3CS
CUCPD6CT CQCTDACTD6CPCVCT
CRD3D9D6D7CTD1CTCPD0
D7D9CQD7D8CPD2CRCT
D3CQCYCTCRD8
ADD9CXCS D4D3CXD7D3D2
CPD6D8CXCUCPCRD8CVD6D3D9D2CS
CTD2D8CXD8DD
D7D3D9D4
BZ
BE
BM BDBGBMBHB8 CRD6CXD8CXCRCPD0 DACPD0D9CTBM BHBHBMBK
BZ
BE
BM BHBMBGB8 CRD6CXD8 DACPD0BM BDBIBMBL
BZ
BE
BM BHBMBHB8 CRD6CXD8 DACPD0BM BDBIBMBL
BZ
BE
BM BEBLBMBLB8 CRD6CXD8 DACPD0BM BHBKBMBD
BZ
BE
BM BDBGBDBMBDB8 CRD6CXD8 DACPD0BM BFBJBMBJ
Figure 2
An example generalization: Determining top(〈soup〉, stir, obj).
197
Clark and Weir Class-Based Probability Estimation
1997; Li and Abe 1998; Wagner 2000), the level of generalization is often determined for
a small number of hand-picked verbs and the result compared with the researcher’s
intuition about the most appropriate level for representing a selectional preference.
According to this approach, if 〈sandwich〉 were chosen to represent 〈hotdog〉 in the
object position of eat, this might be considered an undergeneralization, since 〈food〉
might be considered more appropriate. For this work we argue that such an evaluation
is not appropriate; since the purpose of this work is probability estimation, the most
appropriate level is the one that leads to the most accurate estimate, and this may or
may not agree with intuition. Furthermore, we show in Section 7 that to generalize
unnecessarily can be harmful for some tasks: If we already have lots of data regarding
〈sandwich〉, why generalize any higher? Thus the purpose of this section is not to show
that the acquired levels are “correct,” but simply to show how the levels vary with α
and the sample size.
To show how the level of generalization varies with changes in α, top(c, v, obj)
was determined for a number of hand-picked (c, v, obj) triples over a range of values
for α. The triples were chosen to give a range of strongly and weakly selecting verbs
and a range of verb frequencies. The data were again extracted from a subset of the
BNC using the system of Briscoe and Carroll (1997), and the G
2
statistic was used in
the chi-square test. The results are shown in Table 3. The number of times the verb
occurred with some object is also given in the table.
The results suggest that the generalization level becomes more specific as α in-
creases. This is to be expected, since, given a contingency table chosen at random, a
higher value of α is more likely to lead to a significant result than a lower value of α.
We also see that, for some cases, the value of α has little effect on the level. We would
expect there to be less change in the level of generalization for strongly selecting verbs,
such as drink and eat, and a greater range of levels for weakly selecting verbs such
as see. This is because any significant difference in probabilities is likely to be more
marked for a strongly selecting verb, and likely to be significant over a wider range
of α values. The table only provides anecdotal evidence, but provides some support
to this argument.
To investigate more generally how the level of generalization varies with changes
in α, and also with changes in sample size, we took 6, 000 (c, v, obj) triples and calcu-
lated the difference in depth between c and top(c, v, r) for each triple. The 6, 000 triples
were taken from the first experimental test set described in Section 7, and the train-
ing data from this experiment were used to provide the counts. (The test set contains
nouns, rather than noun senses, and so the sense of the noun that is most probable
given the verb and object slot was used.) An average difference in depth was then
calculated. To give an example of how the difference in depth was calculated, sup-
pose 〈dog〉 generalized to 〈placental mammal〉 via 〈canine〉 and 〈carnivore〉; in this case
the difference would be three.
The results for various levels of α and different sample sizes are shown in Table 4.
The figures in each column arise from using the contingency tables based on the
complete training data, but with each count in the table multiplied by the percentage
at the head of the column. Thus the 50% column is based on contingency tables in
which each original count is multiplied by 50%, which is equivalent to using a sample
one-half the size of the original training set. Reading across a row shows how the
generalization varies with sample size, and reading down a column shows how it
varies with α. The results show clearly that the extent of generalization decreases
with an increase in the value of α, supporting the trend observed in Table 3. The
results also show that the extent of generalization increases with a decrease in sample
198
Computational Linguistics Volume 28, Number 2
Table 3
Example levels of generalization for different values of α.
(c, v, r), f(v, r) α
(〈coffee〉, drink, obj)0.0005 〈coffee〉〈BEVERAGE〉〈food〉...〈object〉〈entity〉
0.05 〈coffee〉〈BEVERAGE〉〈food〉...〈object〉〈entity〉
f(drink, obj)=849 0.5 〈coffee〉〈BEVERAGE〉〈food〉...〈object〉〈entity〉
0.995 〈coffee〉〈BEVERAGE〉〈food〉...〈object〉〈entity〉
(〈hotdog〉, eat, obj)0.0005 〈hotdog〉〈sandwich〉〈snack food〉〈DISH〉...〈food〉...〈entity〉
0.05 〈hotdog〉〈sandwich〉〈snack food〉〈DISH〉...〈food〉...〈entity〉
f(eat, obj)=1,703 0.5 〈hotdog〉〈sandwich〉〈snack food〉〈DISH〉...〈food〉...〈entity〉
0.995 〈hotdog〉〈SANDWICH〉〈snack food〉〈dish〉...〈food〉...〈entity〉
(〈Socrates〉, kiss, obj)0.0005 〈Socrates〉...〈person〉〈life form〉〈CAUSAL AGENT〉〈entity〉
0.05 〈Socrates〉...〈person〉〈life form〉〈CAUSAL AGENT〉〈entity〉
f(kiss, obj)=345 0.5 〈Socrates〉...〈person〉〈life form〉〈CAUSAL AGENT〉〈entity〉
0.995 〈Socrates〉...〈PERSON〉〈life form〉〈causal agent〉〈entity〉
(〈dream〉, remember, obj)0.0005 〈dream〉...〈preoccupation〉〈cognitive state〉〈STATE〉
0.05 〈dream〉...〈preoccupation〉〈cognitive state〉〈STATE〉
f(remember, obj)=1,982 0.5 〈dream〉...〈preoccupation〉〈COGNITIVE STATE〉〈state〉
0.995 〈dream〉...〈PREOCCUPATION〉〈cognitive state〉〈state〉
(〈man〉, see, obj)0.0005 〈man〉...〈mammal〉...〈ANIMAL〉〈life form〉〈entity〉
0.05 〈man〉...〈MAMMAL〉...〈animal〉〈life form〉〈entity〉
f(see, obj)=16,757 0.5 〈man〉...〈MAMMAL〉...〈animal〉〈life form〉〈entity〉
0.995 〈MAN〉...〈mammal〉...〈animal〉〈life form〉〈entity〉
(〈belief〉, abandon, obj)0.0005 〈belief〉〈mental object〉〈cognition〉〈PSYCHOLOGICAL FEATURE〉
0.05 〈belief〉〈MENTAL OBJECT〉〈cognition〉〈psychological feature〉
f(abandon, obj)=673 0.5 〈BELIEF〉〈mental object〉〈cognition〉〈psychological feature〉
0.995 〈BELIEF〉〈mental object〉〈cognition〉〈psychological feature〉
(〈nightmare〉, have, obj)0.0005 〈nightmare〉〈dreaming〉〈IMAGINATION〉...〈psychological feature〉
0.05 〈nightmare〉〈dreaming〉〈IMAGINATION〉...〈psychological feature〉
f(have, obj)=93,683 0.5 〈nightmare〉〈DREAMING〉〈imagination〉...〈psychological feature〉
0.995 〈nightmare〉〈DREAMING〉〈imagination〉...〈psychological feature〉
Note: The selected level is shown in upper case.
Table 4
Extent of generalization for different values of α and sample sizes.
α 100% 50% 10% 1%
0.0005 3.33.95.05.6
0.05 2.83.54.65.6
0.52.12.94.15.4
0.995 1.21.52.63.9
size. Again, this is to be expected, since any difference in probability estimates is less
likely to be significant for tables with low counts.
6. Alternative Class-Based Estimation Methods
The approaches used for comparison are that of Resnik (1993, 1998), subsequently
developed by Ribas (1995), and that of Li and Abe (1998), which has been adopted by
McCarthy (2000). These have been chosen because they directly address the question
of how to find a suitable level of generalization in WordNet.
199
Clark and Weir Class-Based Probability Estimation
The first alternative uses the “association score,” which is a measure of how well
a set of concepts, C, satisfies the selectional preferences of a verb, v, for an argument
position, r:
9
A
(C, v, r)=p(C | v, r) log
2
p(C | v, r)
p(C | r)
(18)
An estimate of the association score,
ˆ
A
(C, v, r), can be obtained using relative frequency
estimates of the probabilities. The key question is how to determine a suitable level of
generalization for concept c, or, alternatively, how to find a suitable class to represent
concept c (assuming the choice is from those classes that contain all concepts dom-
inated by some hypernym of c). Resnik’s solution to this problem (which he neatly
refers to as the “vertical-ambiguity” problem) is to choose the class that maximizes
the association score.
It is not clear that the class with the highest association score is always the most
appropriate level of generalization. For example, this approach does not always gen-
eralize appropriately for arguments that are negatively associated with some verb. To
see why, consider the problem of deciding how well the concept 〈location〉 satisfies the
preferences of the verb eat for its object. Since locations are not the kinds of things that
are typically eaten, a suitable level of generalization would correspond to a class that
has a low association score with respect to eat. However, 〈location〉 is a kind of 〈entity〉
in WordNet,
10
and choosing the class with the highest association score is likely to
produce 〈entity〉 as the chosen class. This is a problem, because the association score
of 〈entity〉 with respect to eat may be too high to reflect the fact that 〈location〉 is a very
unlikely object of the verb.
Note that the solution to the vertical-ambiguity problem presented in the previous
sections is able to generalize appropriately in such cases. Continuing with the eat
〈location〉 example, our generalization procedure is unlikely to get as high as 〈entity〉
(assuming a reasonable number of examples of eat in the training data), since the
probabilities corresponding to the daughters of 〈entity〉 are likely to be very different
with respect to the object position of eat.
The second alternative uses the minimum description length (MDL) principle.
Li and Abe use MDL to select a set of classes from a hierarchy, together with their
associated probabilities, to represent the selectional preferences of a particular verb.
The preferences and class-based probabilities are then used to estimate probabilities
of the form p(n | v, r), where n is a noun, v is a verb, and r is an argument slot.
Li and Abe’s application of MDL requires the hierarchy to be in the form of a
thesaurus, in which each leaf node represents a noun and internal nodes represent the
class of nouns that the node dominates. The hierarchy is also assumed to be in the
form of a tree. The class-based models consist of a partition of the set of nouns (leaf
nodes) and a probability associated with each class in the partition. The probabilities
are the conditional probabilities of each class, given the relevant verb and argument
position. Li and Abe refer to such a partition as a “cut” and the cut together with the
probabilities as a “tree cut model.” The probabilities of the classes in a cut, Γ, satisfy
the following constraint:
summationdisplay
C∈Γ
p(C | v, r)=1 (19)
9 The definition used here is that given by Ribas (1995).
10 For example, the hypernyms of the concept 〈Dallas〉 are as follows: 〈city〉, 〈municipality〉,
〈urban area〉, 〈geographical area〉, 〈region〉, 〈location〉, 〈object〉, 〈entity〉.
200
Computational Linguistics Volume 28, Number 2
<abstraction>
<life_form>
<plant>
<object>
<entity>
<substance>
<set>
<root>
<mushroom>
<artifact>
<rope>
<food>
<pizza><lobster>
<fluid><solid>
<animal>
<lobster>
<time><space>
Figure 3
Possible cut returned by MDL.
In order to determine the probability of a noun, the probability of a class is assumed
to be distributed uniformly among the members of that class:
p(n | v, r)=
1
|C|
p(C | v, r) for all n ∈ C (20)
Since WordNet is a hierarchy with noun senses, rather than nouns, at the nodes,
Li and Abe deal with the issue of word sense ambiguity using the method described
in Section 3, by dividing the count for a noun equally among the concepts whose
synsets contain the noun. Also, since WordNet is a DAG, Li and Abe turn WordNet
into a tree by copying each subgraph with multiple parents. And so that each noun
in the data appears (in a synset) at a leaf node, Li and Abe remove those parts of the
hierarchy dominated by a noun in the data (but only for that instance of WordNet
corresponding to the relevant verb).
An example cut showing part of the WordNet hierarchy is shown in Figure 3 (based
on an example from Li and Abe [1998]; the dashed lines indicate parts of the hierarchy
that are not shown in the diagram). This is a possible cut for the object position of the
verb eat, and the cut consists of the following classes: 〈life form〉, 〈solid〉, 〈fluid〉, 〈food〉,
〈artifact〉, 〈space〉, 〈time〉, 〈set〉. (The particular choice of classes for the cut in this example
is not too important; the example is designed to show how probabilities of senses are
estimated from class probabilities.) Since the class in the cut containing 〈pizza〉 is 〈food〉,
the probability p(〈pizza〉|eat, obj) would be estimated as p(〈food〉|eat, obj)/|〈food〉|.
Similarly, since the class in the cut containing 〈mushroom〉 is 〈life form〉, the probability
p(〈mushroom〉|eat, obj) would be estimated as p(〈life form〉|eat, obj)/|〈life form〉|.
The uniform-distribution assumption (20) means that cuts close to the root of the
hierarchy result in a greater smoothing of the probability estimates than cuts near the
leaves. Thus there is a trade-off between choosing a model that has a cut near the
leaves, which is likely to overfit the data, and a more general (simple) model near the
root, which is likely to underfit the data. MDL looks ideally suited to the task of model
selection, since it is designed to deal with precisely this trade-off. The simplicity of a
model is measured using the model description length, which is an information-theoretic
201
Clark and Weir Class-Based Probability Estimation
term and denotes the number of bits required to encode the model. The fit to the data
is measured using the data description length, which is the number of bits required to
encode the data (relative to the model). The overall description length is the sum of
the model description length and the data description length, and the MDL principle
is to select the model with the shortest description length.
We used McCarthy’s (2000) implementation of MDL. So that every noun is repre-
sented at a leaf node, McCarthy does not remove parts of the hierarchy, as Li and Abe
do, but instead creates new leaf nodes for each synset at an internal node. McCarthy
also does not transform WordNet into a tree, which is strictly required for Li and
Abe’s application of MDL. This did create a problem with overgeneralization: Many
of the cuts returned by MDL were overgeneralizing at the 〈entity〉 node. The reason
is that 〈person〉, which is close to 〈entity〉 and dominated by 〈entity〉, has two parents:
〈life form〉 and 〈causal agent〉. This DAG-like property was responsible for the over-
generalization, and so we removed the link between 〈person〉 and 〈causal agent〉. This
appeared to solve the problem, and the results presented later for the average degree
of generalization do not show an overgeneralization compared with those given in Li
and Abe (1998).
7. Pseudo-Disambiguation Experiments
The task we used to compare the class-based estimation techniques is a decision task
previously used by Pereira, Tishby, and Lee (1993) and Rooth et al. (1999). The task is
to decide which of two verbs, v and v
prime
, is more likely to take a given noun, n,asan
object. The test and training data were obtained as follows. A number of verb–direct
object pairs were extracted from a subset of the BNC, using the system of Briscoe and
Carroll. All those pairs containing a noun not in WordNet were removed, and each
verb and argument was lemmatized. This resulted in a data set of around 1.3 million
(v, n) pairs.
To form a test set, 3,000 of these pairs were randomly selected such that each
selected pair contained a fairly frequent verb. (Following Pereira, Tishby, and Lee, only
those verbs that occurred between 500 and 5,000 times in the data were considered.)
Each instance of a selected pair was then deleted from the data to ensure that the test
data were unseen. The remaining pairs formed the training data. To complete the test
set, a further fairly frequent verb, v
prime
, was randomly chosen for each (v, n) pair. The
random choice was made according to the verb’s frequency in the original data set,
subject to the condition that the pair (v
prime
, n) did not occur in the training data. Given
the set of (v, n, v
prime
) triples, the task is to decide whether (v, n) or (v
prime
, n) is the correct
pair.
11
We acknowledge that the task is somewhat artificial, but pseudo-disambiguation
tasks of this kind are becoming popular in statistical NLP because of the ease with
which training and test data can be created. We also feel that the pseudo-disambig-
uation task is useful for evaluating the different estimation methods, since it directly
addresses the question of how likely a particular predicate is to take a given noun as
an argument. An evaluation using a PP attachment task was attempted in Clark and
Weir (2000), but the evaluation was limited by the relatively small size of the Penn
Treebank.
11 We note that this procedure does not guarantee that the correct pair is more likely than the incorrect
pair, because of noise in the data from the parser and also because a highly plausible incorrect pair
could be generated by chance.
202
Computational Linguistics Volume 28, Number 2
Table 5
Results for the pseudo-disambiguation task.
Generalization technique % correct av.gen. sd.gen.
Similarity class
α = 0.0005 73.83.32.0
α = 0.05 73.42.81.9
α = 0.373.02.41.8
α = 0.75 73.91.91.6
α = 0.995 73.81.21.2
Low class 73.60.91.0
MDL 68.34.11.9
Assoc 63.94.22.1
Note: av.gen. is the average number of generalized levels;
sd.gen. is the standard deviation.
Using our approach, the disambiguation decision for each (v, n, v
prime
) triple was made
according to the following procedure:
if max
c∈cn(n)
p
sc
(c | v, obj) > max
c∈cn(n)
p
sc
(c | v
prime
, obj)
then choose (v, n)
else if max
c∈cn(n)
p
sc
(c | v
prime
, obj) > max
c∈cn(n)
p
sc
(c | v, obj)
then choose (v
prime
, n)
else choose at random
If n has more than one sense, the sense is chosen that maximizes the relevant prob-
ability estimate; this explains the maximization over cn(n). The probability estimates
were obtained using our class-based method, and the G
2
statistic was used for the
chi-square test. This procedure was also used for the MDL alternative, but using the
MDL method to estimate the probabilities.
Using the association score for each test triple, the decision was made according
to the following procedure:
if max
c∈cn(n)
max
c
prime
∈h(c)
ˆ
A
(c
prime
, v, obj) > max
c∈cn(n)
max
c
prime
∈h(c)
ˆ
A
(c
prime
, v
prime
, obj)
then choose (v, n)
else if max
c∈cn(n)
max
c
prime
∈h(c)
ˆ
A
(c
prime
, v
prime
, obj) > max
c∈cn(n)
max
c
prime
∈h(c)
ˆ
A
(c
prime
, v, obj)
then choose (v
prime
, n)
else choose at random
We use h(c) to denote the set consisting of the hypernyms of c. The inner maximization
is over h(c), assuming c is the chosen sense of n, which corresponds to Resnik’s method
of choosing a set to represent c. The outer maximization is over the senses of n, cn(n),
which determines the sense of n by choosing the sense that maximizes the association
score.
The first set of results is given in Table 5. Our technique is referred to as the
“similarity class” technique, and the approach using the association score is referred
203
Clark and Weir Class-Based Probability Estimation
Table 6
Results for the pseudo-disambiguation task with one-fifth training data.
Generalization technique % correct av.gen. sd.gen.
Similarity class
α = 0.0005 66.74.51.9
α = 0.05 68.44.11.9
α = 0.370.23.71.9
α = 0.75 72.33.01.9
α = 0.995 72.41.91.6
Low class 71.91.11.1
MDL 62.94.71.9
Assoc 62.64.12.0
Note: av.gen. is the average number of generalized levels;
sd.gen. is the standard deviation.
to as “Assoc.” The results are given for a range of α values and demonstrate clearly that
the performance of similarity class varies little with changes in α and that similarity
class outperforms both MDL and Assoc.
12
We also give a score for our approach using a simple generalization procedure,
which we call “low class.” The procedure is to select the first class that has a count
greater than zero (relative to the verb and argument position), which is likely to return
a low level of generalization, on the whole. The results show that our generalization
technique only narrowly outperforms the simple alternative. Note that, although low
class is based on a very simple generalization method, the estimation method is still
using our class-based technique, by applying Bayes’ theorem and conditioning on a
class, as described in Section 3; the difference is in how the class is chosen.
To investigate the results, we calculated the average number of generalized levels
for each approach. The number of generalized levels for a concept c (relative to a
verb v and argument position r) is the difference in depth between c and top(c, v, r),
as explained in Section 5. For each test case, the number of generalized levels for
both verbs, v and v
prime
, was calculated, but only for the chosen sense of n. The results
are given in the third column of Table 5 and demonstrate clearly that both MDL and
Assoc are generalizing to a greater extent than similarity class. (The fourth column
gives a standard deviation figure.) These results suggest that MDL and Assoc are
overgeneralizing, at least for the purposes of this task.
To investigate why the value for α had no impact on the results, we repeated the
experiment, but with one fifth of the data. A new data set was created by taking every
fifth pair of the original 1.3 million pairs. A test set of 3,000 triples was created from
this new data set, as before, but this time only verbs that occurred between 100 and
1,000 times were considered. The results using these test and training data are given
in Table 6.
These results show a variation in performance across values for α, with an opti-
mal performance when α is around 0.75. (Of course, in practice, the value for α would
need to be optimized on a held-out set.) But even with this variation, similarity class is
still outperforming MDL and Assoc across the whole range of α values. Note that the
12 The results given for similarity class are different from those given in Clark and Weir (2001) because
the probability estimates used in Clark and Weir (2001) were not normalized.
204
Computational Linguistics Volume 28, Number 2
Table 7
Disambiguation results for G
2
and X
2
.
α value % correct (G
2
) % correct (X
2
)
0.0005 73.8 (3.3) 74.1 (3.0)
0.05 73.4 (2.8) 73.8 (2.5)
0.373.0 (2.4) 74.1 (2.2)
0.75 73.9 (1.9) 74.3 (1.8)
0.995 73.8 (1.2) 73.3 (1.2)
α values corresponding to the lowest scores lead to a significant amount of general-
ization, which provides additional evidence that MDL and Assoc are overgeneralizing
for this task. The low-class method scores highly for this data set also, but given that
the task is one that apparently favors a low level of generalization, the high score is
not too surprising.
As a final experiment, we compared the task performance using the X
2
, rather than
G
2
, statistic in the chi-square test. The results are given in Table 7 for the complete
data set.
13
The figures in brackets give the average number of generalized levels.
The X
2
statistic is performing at least as well as G
2
, and the results show that the
average level of generalization is slightly higher for G
2
than X
2
. This suggests a possible
explanation for the results presented here and those in Dunning (1993): that the X
2
statistic provides a less conservative test when counts in the contingency table are
low. (By a conservative test we mean one in which the null hypothesis is not easily
rejected.) A less conservative test is better suited to the pseudo-disambiguation task,
since it results in a lower level of generalization, on the whole, which is good for this
task. In contrast, the task that Dunning considers, the discovery of bigrams, is better
served by a more conservative test.
8. Conclusion
We have presented a class-based estimation method that incorporates a procedure for
finding a suitable level of generalization in WordNet. This method has been shown to
provide superior performance on a pseudo-disambiguation task, compared with two
alternative approaches. An analysis of the results has shown that the other approaches
appear to be overgeneralizing, at least for this task. One of the features of the gener-
alization procedure is the way that α, the level of significance in the chi-square test,
is treated as a parameter. This allows some control over the extent of generalization,
which can be tailored to particular tasks. We have also shown that the task perfor-
mance is at least as good when using the Pearson chi-square statistic as when using
the log-likelihood chi-square statistic.
There are a number of ways in which this work could be extended. One possibility
would be to use all the classes dominated by the hypernyms of a concept, rather than
just one, to estimate the probability of the concept. An estimate would be obtained for
each hypernym, and the estimates combined in a linear interpolation. An approach
similar to this is taken by Bikel (2000), in the context of statistical parsing.
There is still room for investigation of the hidden-data problem when data are used
that have not been sense disambiguated. In this article, a very simple approach is taken,
13 χ
2
performed slightly better than G
2
using the smaller data set also.
205
Clark and Weir Class-Based Probability Estimation
which is to split the count for a noun evenly among the noun’s senses. Abney and Light
(1999) have tried a more motivated approach, using the expectation maximization
algorithm, but with little success. The approach described in Clark and Weir (1999) is
shown in Clark (2001) to have some impact on the pseudo-disambiguation task, but
only with certain values of the α parameter, and ultimately does not improve on the
best performance.
Finally, an issue that has not been much addressed in the literature (except by
Li and Abe [1996]) is how the accuracy of class-based estimation techniques compare
when automatically acquired classes, as opposed to the manually created classes from
WordNet, are used. The pseudo-disambiguation task described here has also been used
to evaluate clustering algorithms (Pereira, Tishby, and Lee, 1993; Rooth et al., 1999),
but with different data, and so it is difficult to compare the results. A related issue
is how the structure of WordNet affects the accuracy of the probability estimates. We
have taken the structure of the hierarchy for granted, without any analysis, but it may
be that an alternative design could be more conducive to probability estimation.
Acknowledgments
This article is an extended and updated
version of a paper that appeared in the
proceedings of NAACL 2001. The work on
which it is based was carried out while the
first author was a D.Phil. student at the
University of Sussex and was supported by
an EPSRC studentship. We would like to
thank Diana McCarthy for suggesting the
pseudo-disambiguation task and providing
the MDL software, John Carroll for
supplying the data, and Ted Briscoe, Geoff
Sampson, Gerald Gazdar, Bill Keller, Ted
Pedersen, and the anonymous reviewers for
their helpful comments. We would also like
to thank Ted Briscoe for presenting an
earlier version of this article on our behalf
at NAACL 2001.

References
Abney, Steven P. and Marc Light. 1999.
Hiding a semantic hierarchy in a Markov
model. In Proceedings of the ACL Workshop
on Unsupervised Learning in Natural
Language Processing, University of
Maryland, College Park, pages 1–8.
Agirre, Eneko and David Martinez. 2001.
Learning class-to-class selectional
preferences. In Proceedings of the Fifth ACL
Workshop on Computational Language
Learning, Toulouse, France, pages 15–22.
Agresti, Alan. 1996. An Introduction to
Categorical Data Analysis. Wiley.
Bikel, Daniel M. 2000. A statistical model
for parsing and word-sense
disambiguation. In Proceedings of the Joint
SIGDAT Conference on Empirical Methods in
Natural Language Processing and Very Large
Corpora, pages 155–163, Hong Kong.
Briscoe, Ted and John Carroll. 1997.
Automatic extraction of subcategorization
from corpora. In Proceedings of the Fifth
ACL Conference on Applied Natural Language
Processing, pages 356–363, Washington,
DC.
Church, Kenneth W. and Patrick Hanks.
1990. Word association norms, mutual
information, and lexicography.
Computational Linguistics, 16(1):22–29.
Ciaramita, Massimiliano and Mark Johnson.
2000. Explaining away ambiguity:
Learning verb selectional preference with
Bayesian networks. In Proceedings of the
18th International Conference on
Computational Linguistics, pages 187–193,
Saarbrucken, Germany.
Clark, Stephen. 2001. Class-Based Statistical
Models for Lexical Knowledge Acquisition.
Ph.D. dissertation, University of Sussex.
Clark, Stephen and David Weir. 1999. An
iterative approach to estimating
frequencies over a semantic hierarchy. In
Proceedings of the Joint SIGDAT Conference
on Empirical Methods in Natural Language
Processing and Very Large Corpora, pages
258–265, University of Maryland, College
Park.
Clark, Stephen and David Weir. 2000. A
class-based probabilistic approach to
structural disambiguation. In Proceedings
of the 18th International Conference on
Computational Linguistics, pages 194–200,
Saarbrucken, Germany.
Clark, Stephen and David Weir. 2001.
Class-based probability estimation using a
semantic hierarchy. In Proceedings of the
Second Meeting of the North American
Chapter of the Association for Computational
Linguistics, pages 95–102, Pittsburgh.
Cressie, Noel A. C. and Timothy R. C. Read.
1984. Multinomial goodness of fit tests.
Journal of the Royal Statistics Society Series B,
46:440–464.
Dunning, Ted. 1993. Accurate methods for
the statistics of surprise and coincidence.
Computational Linguistics, 19(1):61–74.
Fellbaum, Christiane, editor. 1998. WordNet:
An Electronic Lexical Database. MIT Press.
Li, Hang and Naoki Abe. 1996. Clustering
words with the MDL principle. In
Proceedings of the 16th International
Conference on Computational Linguistics,
pages 4–9, Copenhagen, Denmark.
Li, Hang and Naoki Abe. 1998. Generalizing
case frames using a thesaurus and the
MDL principle. Computational Linguistics,
24(2):217–244.
McCarthy, Diana. 1997. Word sense
disambiguation for acquisition of
selectional preferences. In Proceedings of
the ACL/EACL Workshop on Automatic
Information Extraction and Building of Lexical
Semantic Resources for NLP Applications,
pages 52–61, Madrid.
McCarthy, Diana. 2000. Using semantic
preferences to identify verbal
participation in role switching. In
Proceedings of the First Conference of the
North American Chapter of the Association for
Computational Linguistics, pages 256–263,
Seattle.
Miller, George A. 1998. Nouns in WordNet.
In Christiane Fellbaum, editor, WordNet:
An Electronic Lexical Database. MIT Press,
pages 23–46.
Pedersen, Ted. 1996. Fishing for exactness.
In Proceedings of the South-Central SAS Users
Group Conference, Austin, pages 188–200.
Pedersen, Ted. 2001. A decision tree of
bigrams is an accurate predictor of word
sense. In Proceedings of the Second Meeting
of the North American Chapter of the
Association for Computational Linguistics,
pages 79–86, Pittsburgh.
Pereira, Fernando, Naftali Tishby, and
Lillian Lee. 1993. Distributional clustering
of English words. In Proceedings of the 31st
Annual Meeting of the Association for
Computational Linguistics, pages 183–190,
Columbus, OH.
Resnik, Philip. 1993. Selection and
Information: A Class-Based Approach to
Lexical Relationships. Ph.D. dissertation,
University of Pennsylvania.
Resnik, Philip. 1998. WordNet and
class-based probabilities. In Christiane
Fellbaum, editor, WordNet: An Electronic
Lexical Database. MIT Press, pages 239–263.
Ribas, Francesc. 1995. On learning more
appropriate selectional restrictions. In
Proceedings of the Seventh Conference of the
European Chapter of the Association for
Computational Linguistics, pages 112–118,
Dublin.
Rooth, Mats, Stefan Riezler, Detlef Prescher,
Glenn Carroll, and Franz Beil. 1999.
Inducing a semantically annotated lexicon
via EM-based clustering. In Proceedings of
the 37th Annual Meeting of the Association for
Computational Linguistics, pages 104–111,
University of Maryland, College Park.
Wagner, Andreas. 2000. Enriching a lexical
semantic net with selectional preferences
by means of statistical corpus analysis. In
Proceedings of the ECAI-2000 Workshop on
Ontology Learning, Berlin, pages 37–42.
