Automatic Identification of Infrequent Word Senses
Diana McCarthy & Rob Koeling & Julie Weeds & John Carroll
Department of Informatics,
University of Sussex
Brighton BN1 9QH, UK
a0 dianam,robk,juliewe,johnca
a1 @sussex.ac.uk
Abstract
In this paper we show that an unsupervised method for
ranking word senses automatically can be used to iden-
tify infrequently occurring senses. We demonstrate this
using a ranking of noun senses derived from the BNC
and evaluating on the sense-tagged text available in both
SemCor and the SENSEVAL-2 English all-words task.
We show that the method does well at identifying senses
that do not occur in a corpus, and that those that are erro-
neously filtered but do occur typically have a lower fre-
quency than the other senses. This method should be
useful for word sense disambiguation systems, allowing
effort to be concentrated on more frequent senses; it may
also be useful for other tasks such as lexical acquisition.
Whilst the results on balanced corpora are promising, our
chief motivation for the method is for application to do-
main specific text. For text within a particular domain
many senses from a generic inventory will be rare, and
possibly redundant. Since a large domain specific cor-
pus of sense annotated data is not available, we evaluate
our method on domain-specific corpora and demonstrate
that sense types identified for removal are predominantly
senses from outside the domain.
1 Introduction
Much about the behaviour of words is most appro-
priately expressed in terms of word senses rather
than word forms. However, an NLP application
computing over word senses is faced with consid-
erable extra ambiguity. There are systems which
can perform word sense disambiguation (WSD) on
the words in input text, however there is room for
improvement since the best systems on the English
SENSEVAL-2 all-words task obtained at most 69%
for precision and recall. Whilst there are systems
that obtain higher precision (Magnini et al., 2001),
these typically suffer from a low recall. WSD per-
formance is affected by the degree of polysemy, but
even more so by the entropy of the frequency distri-
butions of the words’ senses (Kilgarriff and Rosen-
zweig, 2000) since the distribution for many words
is highly skewed. Many of the senses in such an
inventory are rare and WSD and lexical acquisition
systems do best when they take this into account.
There are many ways that the skewed distribution
can be taken into account. One successful approach
is to back-off to the first (predominant) sense (Wilks
and Stevenson, 1998; Hoste et al., 2001). Another
possibility would be concentrate the selection pro-
cess to senses with higher frequency, and filter out
rare senses. This is implicitly done by systems
which rely on hand-tagged training corpora, since
rare senses often do not occur in the available data.
In this paper we use an unsupervised method to rank
word senses from an inventory according to preva-
lence (McCarthy et al., 2004a), and utilise the rank-
ing scores to identify senses which are rare. We use
WordNet for our inventory, since it is widely used
and freely available, but our method could in prin-
ciple be used with another MRD (we comment on
this in the conclusions). We report work with nouns
here, and leave evaluation on other PoS for the fu-
ture.
Our approach exploits automatically acquired
thesauruses which provide “nearest neighbours” for
a given word entry. The neighbours are ordered
in terms of the distributional similarity that they
share with the target word. The neighbours relate
to different senses of the target word, so for exam-
ple the word competition in such a thesaurus pro-
vided by Lin 1 has neighbours tournament, event,
championship and then further down the ordered list
we see neighbours pertaining to a different sense
competitor,...market...price war. Pantel and Lin
(2002) demonstrate that it is possible to cluster the
neighbours into senses and relate these to WordNet
senses. In contrast, we use the distributional sim-
ilarity scores of the neighbours to rank the various
senses of the target word since we expect that the
quantity and similarity of the neighbours pertain-
ing to different senses will reflect the relative dom-
inance of the senses. This is because there will
1Available from
http://www.cs.ualberta.ca/˜lindek/demos/depsim.htm
be more data for the more prevalent senses com-
pared to the less frequent senses. We use a measure
of semantic similarity from the WordNet Similarity
package to relate the senses of the target word to the
neighbours in the thesaurus.
The paper is structured as follows. The ranking
method is described elsewhere (McCarthy et al.,
2004a), but we summarise in the following section
and describe how ranking scores can be used for fil-
tering word senses. Section 3 describes two exper-
iments using the BNC for acquisition of the sense
rankings with evaluation using the hand-tagged data
in i) SemCor and ii) the English SENSEVAL-2 all-
words task. We demonstrate that the majority of
senses identified by the method do not occur in these
gold-standards, and that for those that do, only a
small percentage of the sense tokens would be re-
moved in error by filtering these senses. In section 4
we use domain labels produced by (Magnini and
Cavagli`a, 2000) to demonstrate differences in the
senses filtered for a sample of words in two domain
specific corpora. We describe some related work in
section 5 and conclude in section 6.
2 Method
McCarthy et al. (2004a) describe a method to pro-
duce a ranking over senses and find the predominant
sense of a word just using raw text. We summarise
the method below, and describe how we use it for
identifying candidate senses for filtering.
2.1 Ranking the Senses
In order to rank the senses of a target word (e.g.
plant) we use a thesaurus acquired from automati-
cally parsed text (section 2.2 below). This provides
the a2 nearest neighbours to each target word (e.g.
factory, refinery, tree etc...) along with the distribu-
tional similarity score between the target word and
its neighbour. We then use the WordNet similar-
ity package (Patwardhan and Pedersen, 2003) (see
section 2.3) to give us a semantic similarity mea-
sure (hereafter referred to as the WordNet similarity
measure) to weight the contribution that each neigh-
bour (e.g. factory) makes to the various senses of
the target word (e.g. flora, industrial, actor etc...).
We take each sense of the target word (a3 ) in turn
and obtain a score reflecting the prevalence which is
used for ranking. Let a4a6a5a8a7a10a9a12a11a14a13a16a15a17a11a19a18a21a20a22a20a22a20a23a11a19a24a26a25 be the or-
dered set of the top scoring a2 neighbours of a3 from
the thesaurus with associated distributional similar-
ity scores a9a12a27a29a28a30a28a26a31a32a3a6a15a33a11a14a13a35a34a17a15a33a27a29a28a30a28a16a31a36a3a37a15a17a11a19a18a38a34a33a15a21a20a22a20a22a20a39a27a29a28a30a28a26a31a32a3a6a15a33a11a14a24a35a34a38a25 .
Let a28a30a40a30a11a14a28a21a40a30a28a16a31a36a3a41a34 be the set of senses of a3 . For each
sense of a3 (a3a42a28a44a43a46a45a47a28a30a40a30a11a14a28a21a40a30a28a16a31a36a3a41a34 ) we obtain a rank-
ing score by summing over the a27a48a28a21a28a16a31a32a3a6a15a33a11a50a49a30a34 of each
neighbour (a11a50a49a51a45a52a4a6a5 ) multiplied by a weight. This
weight is the WordNet similarity score (a3a42a11a14a28a30a28 ) be-
tween the target sense (a3a53a28a35a43 ) and the sense of a11a54a49
(a11a14a28a35a55a56a45a57a28a30a40a30a11a14a28a21a40a30a28a16a31a36a11a54a49a16a34 ) that maximises this score, di-
vided by the sum of all such WordNet similarity
scores for a28a30a40a21a11a19a28a21a40a30a28a26a31a32a3a41a34 and a11a50a49 .
Thus we rank each sense a3a42a28 a43 a45a10a28a30a40a21a11a19a28a21a40a30a28a26a31a32a3a41a34 us-
ing:a58a41a59
a11a14a2a54a60a61a11a50a62a64a63a66a65a35a67a16a68a12a40a26a31a32a3a53a28 a43 a34a69a7
a70
a71a44a72a33a73a16a74a76a75
a27a48a28a21a28a16a31a32a3a6a15a33a11a50a49a30a34a30a77
a3a42a11a14a28a21a28a16a31a32a3a53a28a35a43a36a15a33a11a50a49a30a34
a78
a5a19a79a81a80a39a82
a73
a79a61a83
a71
a79a81a83a61a79a36a84a85a5a19a86
a3a53a11a14a28a30a28a26a31a32a3a42a28 a43
a82
a15a17a11 a49 a34
(1)
where:
a3a42a11a14a28a21a28a16a31a32a3a53a28a35a43a87a15a17a11a54a49a21a34a66a7 a88a46a89a21a90
a71
a79a92a91
a73
a79a81a83
a71
a79a81a83a61a79a93a84
a71a44a72
a86
a31a36a3a42a11a14a28a30a28a26a31a32a3a53a28a35a43a36a15a33a11a14a28a35a55a29a34a87a34
2.2 Acquiring the Automatic Thesaurus
There are many alternative distributional similarity
measures proposed in the literature, for this work
we used the measure and thesaurus construction
method described by Lin (1998). For input we
used grammatical relation data extracted using an
automatic parser (Briscoe and Carroll, 2002). For
each noun we considered the co-occurring verbs in
the direct object and subject relation, the modifying
nouns in noun-noun relations and the modifying ad-
jectives in adjective-noun relations. We could easily
extend the set of relations in the future. A noun, a3 ,
is thus described by a set of co-occurrence triples
a94
a3a6a15a33a68a16a15a17a95a97a96 and associated frequencies, where a68
is a grammatical relation and a95 is a possible co-
occurrence with a3 in that relation. For every pair
of nouns, we computed their distributional similar-
ity. If a98a41a31a32a3a99a34 is the set of co-occurrence types a31a32a68a16a15a33a95a100a34
such that a101a102a31a32a3a37a15a17a68a16a15a33a95a100a34 is positive then the similarity
between two nouns, a3 and a11 , can be computed as:
a27a29a28a30a28a26a31a32a3a37a15a17a11a103a34a76a7
a78
a84a105a104a36a106a55a30a86
a73a16a107
a84a85a5a19a86a109a108
a107
a84
a71
a86
a31a110a101a102a31a32a3a6a15a33a68a16a15a17a95a14a34a54a111a56a101a102a31a32a11a112a15a33a68a16a15a33a95a100a34a87a34
a78
a84a85a104a36a106a55a30a86
a73a12a107
a84a105a5a14a86
a101a102a31a32a3a6a15a33a68a16a15a87a95a100a34a54a111
a78
a84a85a104a36a106a55a30a86
a73a16a107
a84
a71
a86
a101a102a31a32a11a112a15a33a68a16a15a17a95a14a34
where:
a101a102a31a32a3a6a15a33a68a16a15a17a95a14a34a14a7a114a113a22a115a117a116a64a118
a31a36a95a76a119a120a3a122a121a46a68a123a34
a118
a31a32a95a112a119a120a68a123a34
A thesaurus entry of size a2 for a target noun a3 is
then defined as the a2 most similar nouns to a3 .
2.3 The WordNet Similarity Package
We use the WordNet Similarity Package 0.05 and
WordNet version 1.6. 2 The WordNet Similarity
2We use this version of WordNet since it would in principle
allow us to map information to WordNets of other languages
more accurately. We are able to apply the method to other ver-
sions of WordNet.
package supports a range of WordNet similarity
scores. We used the jcn measure to give results
for the a3a53a11a19a28a21a28 function in equation 1 above, since
this has given us good results for other experiments,
and is efficient given the precompilation of required
frequency files (information dat files). We discuss
the merits of investigating other semantic similarity
scores in section 6.
The jcn (Jiang and Conrath, 1997) measure
provides a similarity score between two WordNet
senses (a28a117a124 and a28a21a125 ), these being synsets within
WordNet. The measure uses corpus data to pop-
ulate classes (synsets) in the WordNet hierarchy
with frequency counts. Each synset, is incre-
mented with the frequency counts from the cor-
pus of all words belonging to that synset, directly
or via the hyponymy relation. The frequency data
is used to calculate the “information content” (IC)
of a class a101a54a126a127a31a32a28a26a34a128a7a130a129a132a131a133a67a44a62a19a31a135a134a100a31a32a28a16a34a87a34 . Jiang and Con-
rath specify a distance measure: a136a6a49a93a137 a71 a31a32a28a117a124a123a15a33a28a30a125a138a34a56a7
a101a54a126a127a31a32a28a123a124a16a34a66a111a139a101a54a126a127a31a36a28a30a125a29a34a140a129a141a125a142a77a56a101a54a126a127a31a36a28a30a143a29a34 , where the third
class, a28a30a143 is the most informative, or most specific
superordinate synset of the two senses a28a117a124 and a28a30a125 .
This is transformed from a distance measure in the
WN-Similarity package by taking the reciprocal:
a144
a65a35a11a103a31a32a28a117a124a117a15a17a28a30a125a29a34a112a7a145a124a16a146a16a136 a49a93a137
a71
a31a36a28a117a124a117a15a17a28a30a125a29a34
The jcn measure uses corpus data for the calcu-
lation of IC. The experimental results reported here
are obtained using IC counts from the BNC corpus
with the resnik count option available in the Word-
Net similarity package. We did not use the default
IC counts provided with the package since these are
derived from the hand-tagged data in SemCor. All
the results shown here are those with the size of the-
saurus entries (a2 ) set to 50. 3
2.4 Filtering
We use equation 1 above to produce ranking scores
for the senses a28a30a40a21a11a19a28a21a40a30a28a26a31a32a3a41a34 of a target word a3 . We
then use a threshold a98 a5 which is a constant percent-
age (a98a99a147 ) of the ranking score of the first ranked
sense. Any senses with scores lower than a98 a5 are
identified for filtering. This threshold will permit
the filtering to be sensitive to the ranking scores of
the word in question.
3 Experiments with a Thesaurus from a
Balanced Corpus
For the experiments described in this section we
acquired the thesaurus from the grammatical rela-
tions listed in 2.2 automatically extracted from the
90 million words of written English from the BNC.
3Previous ranking experiments using
a148a150a149a152a151a93a153a44a154a32a155a17a153a35a154a61a156a17a153 and a157a33a153
gave only minimal changes to the results.
We generated a thesaurus entry for all polysemous
nouns which occurred in SemCor with a frequency
a96 2, and in the BNC with a frequency a158 10. We ex-
periment with a98a99a147a159a7a161a160a22a124a30a162a54a15a17a125a117a162a54a20a22a20a23a163a117a162a16a164 . For these exper-
iments we evaluate using the gold-standard sense-
tagged data available in i) SemCor and ii) the En-
glish SENSEVAL-2 all-words task. For each value
of a98a99a147 we compute the number of sense types fil-
tered (a165a53a166a32a167a21a134a168a40a21a28 ), and the percentage of these that are
correctly filtered (a165a53a166a32a167a21a134a168a40a30a169a17a137a133a137 ) in that they do not oc-
cur at all in our gold-standard. We also compute
for those types that do occur a165a53a166a32a67a16a2a29a83a133a104a36a104 a80 , the percent-
age of sense tokens that would be filtered incorrectly
from the gold-standard by their removal from Word-
Net. a165a53a166a36a67a12a2a29a83a133a104a36a104 a80a39a80 is the percentage of sense tokens that
would be filtered incorrectly for the subset of words
for which there are tokens filtered.
The results when using the ranking scores derived
from the BNC thesaurus for filtering the senses in
SemCor are shown in table 1 for different values of
a98a41a147 . For polysemous nouns in SemCor, the percent-
age of sense types that do not occur is 38%, so if we
filtered randomly we could expect to get 38% ac-
curacy. a165a53a166a32a167a21a134a168a40a21a169a33a137a133a137 is well above this baseline for all
values of a98a41a147 . Whilst there are sense types in Sem-
Cor that are filtered erroneously, these are senses
which occur less frequently than the non-filtered
types. Furthermore, they account for a relatively
small percentage of tokens for the filtered words as
shown by a165a53a166a36a67a12a2a29a83a133a104a36a104 a80a39a80 . Table 2 shows that a165a53a166a36a67a12a2a29a83a133a104a36a104 a80
is lower than would be expected if the sense types
which are filtered had average frequency. There
are 10687 sense types for the polysemous nouns in
SemCor, of which 6573 actually occur. The num-
ber of sense types filtered in error for each value
of a98a99a147 is shown by a165a53a166a32a167a21a134a168a40a30a28a21a83a133a104a36a104 . The proportion
of tokens expected for the given a165a53a166a32a167a21a134a168a40a21a28a30a83a133a104a36a104 , if the
filtered types were of average frequency, is given
by a166a36a67a12a2a29a83a133a55a170a7 a171a14a172a22a173a36a174 a83a81a79a61a175a133a176a133a176a177a33a178a38a179a36a180 . For the highest value of
a98a41a147a181a7a182a163a123a162 , 3099 types are identified for filtering,
this comprises 47% of the types occurring in Sem-
Cor, however a165a53a166a36a67a12a2 a83a133a104a36a104 a80 shows that only 39% tokens
are filtered. As the value of a98a99a147 decreases, we filter
fewer sense types, less tokens in error and the ratio
between a166a36a67a16a2a48a83a92a55 and a165a53a166a36a67a16a2a48a83a92a104a32a104
a80 increases. The com-
promise between the number of sense types filtered,
and the removal of tokens in error will depend on
the needs of the application, and can be altered with
a98a41a147 .
The SENSEVAL-2 English all-words task (Palmer
et al., 2001) is a much smaller sample of hand-
tagged text compared to SemCor, comprising three
documents from the Wall Street Journal section of
the Penn Treebank. For the sample of polysemous
a98a41a147 a165a53a166a32a167a21a134a168a40a21a28 a165a53a166a32a167a21a134a168a40a30a169a17a137a133a137 a165a53a166a36a67a12a2a29a83a133a104a36a104
a80
a165a53a166a32a67a16a2a29a83a133a104a36a104
a80a23a80
90 5952 48 39 44
80 4560 50 25 32
70 3057 52 16 25
60 1724 52 8 19
50 672 54 3 13
40 146 54 0.5 9
30 28 57 0.04 5
20 - - - -
Table 1: Filtering results for SemCor
a98a41a147 a165a53a166a32a167a21a134a168a40a30a28a21a83a133a104a36a104 a166a32a67a16a2a29a83a133a55 a165a53a166a32a67a16a2a29a83a133a104a36a104
a80
a172a109a183
a24 a175a109a91
a171a19a172a109a183
a24a38a175a133a176a133a176 a80
90 3099 47 39 1.2
80 2271 35 25 1.4
70 1472 22 16 1.4
60 821 12 8 1.5
50 308 5 3 1.7
40 67 1 0.5 2
30 12 0.2 0.04 5
Table 2: Erroneous tokens anticipated, and filtered
from SemCor
nouns occurring in this corpus, there are 77% sense
types which do not occur. The results in table 3
show much higher values for a165a53a166a32a167a21a134a168a40a21a169a33a137a133a137 because of
this higher baseline (77%). The filtering results nev-
ertheless show superior performance to this base-
line at all levels of a98a99a147 . This time there are no
sense types filtered for a98a41a147 a7a184a143a117a162 . The frequen-
cies of the types filtered in error are close to the val-
ues of a166a36a67a12a2 a83a133a55 , as shown in table 4. This is because
the corpus is very small. Many types do not occur
and many types have a low frequency, regardless of
whether they are filtered or not.
In this section we demonstrated that the ranking
scores can be used alongside a threshold to remove
senses which are considered rare for the corpus data
at hand, that the majority of sense types filtered in
this way do not occur in our test data, and that those
that do typically have a low or average frequency.
There are of course differences between the BNC
corpus that we used to create our sense ranking and
the test corpora, however, since the BNC is a bal-
anced corpus we feel that this is a feasible means
of evaluation, and the results bear this out. A main
advantage of our approach is to enable us to tailor
a resource such as WordNet to domain specific text,
and it is to this that we now turn.
a98a41a147 a165a53a166a36a167a30a134a168a40a21a28 a165a53a166a32a167a21a134a168a40a30a169a17a137a133a137 a165a53a166a32a67a16a2a29a83a133a104a36a104
a80
a165a53a166a32a67a16a2a29a83a133a104a36a104
a80a23a80
90 1018 87 38 44
80 827 88 28 35
70 584 89 18 29
60 370 91 10 22
50 157 89 5 24
40 42 95 0.06 11
30 - - - -
Table 3: Filtering results on the SENSEVAL-2 En-
glish all-words task
a98a99a147 a165a53a166a36a167a30a134a102a40a30a28a30a83a92a104a36a104 a166a32a67a16a2a29a83a133a55 a165a53a166a32a67a16a2a29a83a133a104a36a104
a80
a172a135a183
a24a87a175a109a91
a171a14a172a135a183
a24 a175a133a176a133a176 a80
90 133 38 39 1
80 96 28 28 1
70 62 18 18 1
60 33 10 10 1
50 17 5 5 1
40 2 0.06 0.06 1
Table 4: Erroneous tokens anticipated, and filtered
from SENSEVAL-2
4 Experiments Filtering Senses from
Domain Specific Texts
A major motivation for our work is to try to tailor a
sense inventory to the text at hand. In this section we
apply our filtering method to two domain specific
corpora. We demonstrate that the senses filtered us-
ing our method on these corpora are determined by
the domain. The Reuters corpus (Rose et al., 2002)
is a collection of about 810,000 Reuters, English
Language News stories (covering the period August
1996 to August 1997). Many of the news stories
are economy related, but several other topics are in-
cluded too. We have selected documents from the
SPORTS domain (topic code: GSPO) and a limited
number of documents from the FINANCE domain
(topic codes: ECAT (ECONOMICS) and MCAT
(MARKETS)). We chose the domains of SPORTS
and FINANCE since there is sufficient material for
these domains in this publically available corpus.
The SPORT corpus consists of 35317 documents
(about 9.1 million words). The FINANCE corpus
consists of 117734 documents (about 32.5 million
words). We acquired thesauruses for these corpora
using the procedure described in section 2.2.
There is no existing gold-standard that we could
use to determine the frequency of word senses
within these domain specific corpora. Instead we
evaluate our method using the Subject Field Codes
(SFC) resource (Magnini and Cavagli`a, 2000)
a98a99a147 BNC FINANCE SPORT
90 83 82 81
80 75 62 60
70 61 49 37
60 46 32 12
50 24 1 27
40 6 5 -
30 3 - -
20 - - -
Table 5: Percentage of sense types filtered
which annotates WordNet synsets with domain la-
bels. The SFC contains an economy label and a
sports label. For this domain label experiment we
selected all the words in WordNet that have at least
one synset labelled economy and at least one synset
labelled sports. The resulting set consisted of 38
words. The relative frequency of the domain labels
for all the sense types of the 38 words is show in
figure 1. The three main domain labels for these
38 words are of course sports, economy and fac-
totum (domain independent). In figure 2 we con-
trast the relative frequency distribution of domain
labels for filtered senses (using a98a41a147a185a7a187a186a117a162 ) of these
38 words in i) the BNC ii) the FINANCE corpus and
iii) the SPORT corpus.
From this figure one can see that there are more
economy and commerce senses removed from the
SPORT corpus, with no filtered sport labels. The
FINANCE and BNC corpora do have some filtered
economy and commerce labels, but these are only
a small percentage of the filtered senses, and for FI-
NANCE there are less than for the BNC.
Table 5 shows the percentage of sense types fil-
tered at different values of a98a41a147 . There are a rela-
tively larger number of sense types filtered in the
BNC compared to the FINANCE corpus, and this in
turn has a larger percentage than the SPORT corpus.
This is particularly noticeable at lower values of a98a41a147
and is because for these 38 words the ranking scores
are less spread in the FINANCE, and SPORT corpus,
arising from the relative size of the corpora and the
spread of the distributional similarity scores. We
conclude from these experiments that the value of
a98a41a147 should be selected dependent on the corpus as
well as the requirements of the application. There is
also scope for investigating other distributional sim-
ilarity scores and other filtering thresholds, for ex-
ample, taking into account the variance of the rank-
ing scores in the corpus.
5 Related Work
WordNet is an extensive resource, as new versions
are created new senses get included, however, for
backwards compatibility previous senses are not
deleted. For many NLP applications the problems
of word sense ambiguity are significant. One way
to cope with the larger numbers of senses for a word
is by working at a coarser granularity, so that re-
lated senses are grouped together. There is useful
work being done to cluster WordNet senses auto-
matically (Agirre and Lopez de Lacalle, 2003). Pan-
tel and Lin (2002) are working with automatically
constructed thesauruses and identifying senses di-
rectly from the nearest neighbours, where the gran-
ularity depends on the parameters of the clustering
process. In contrast we are using the nearest neigh-
bours to indicate the frequency of the senses of the
target word, using semantic similarity between the
neighbours and the word senses listed in WordNet.
We do so here in order to identify the senses of the
word which are rare in corpus data.
Lapata and Brew (2004) have recently used syn-
tactic evidence to produce a prior distribution for
verb senses and incorporate this in a WSD system.
The work presented here focusses on using a preva-
lence ranking for word senses to identify and re-
move rare senses from a generic resource such as
WordNet. We believe that this method will be use-
ful for systems using such a resource, which can
incorporate prior distributions over word senses or
wish to identify and remove rare word senses. Sys-
tems requiring sense frequency distributions cur-
rently rely on available hand-tagged training data,
and for WordNet the most extensive resource for all-
words is SemCor. Whilst SemCor is extremely use-
ful, it comprises only 250,000 words taken from a
subset of the Brown corpus and a novel. Because of
its size, and the zipfian distribution of words, there
are many words which do not occur in this resource,
for example embryo, fridge, pancake, wheelbarrow
and many words which occur only once or twice.
Our method using raw text permits us to obtain a
sense ranking for any word from our corpus, subject
to the constraint that we have enough occurrences in
the corpus. Given the increasing amount of data on
the web, this constraint is not likely to be problem-
atic.
Another major benefit of the work here, rather
than reliance on hand-tagged training data such as
SemCor, is that this method permits us to produce
a ranking for the domain and text type required.
The sense distributions of many words depend on
the domain, and filtering senses that are rare in a
specific domain permits a generic resource such as
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
Relative Frequency Sense Types
a188
literature
a189
industry
a190
factotum
a191
physics
a192
architecture
a193
alimentation
a188
politics agriculture
a190 art
body_care military
a194
commerce
a195
engineering
a193
pedagogy
a189
psychology
a196
publishing
a192
play
a189
chemistry
a195
religion
a197
telecommunication
a195 lawa189
administration
a192
sexuality
a190
computer_science
a198
fashion
a196
transport
a192
linguistics
a199
free_time mathematics
a195
earth biology
a191
sociology
a191
sport
a191
economy
a188
medicine
a199
all sense types
Figure 1: Distribution of domain labels of all senses for 38 polysemous words
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0.5
Relative Frequency Filtered Sense Types
a200
literature
a189
industry
a190
factotum
a191
physics
a192
architecture
a193
alimentation
a188
politics agriculture
a190
art
body_care military
a194
commerce
a195
engineering
a193
pedagogy
a189
psychology
a196
publishing
a192
play
a189
chemistry
a195
religion
a197
telecommunication
a195 lawa189
administration
a192
sexuality
a190
computer_science
a198
fashion
a196
transport
a192
linguistics
a199
free_time mathematics
a195
earth biology
sociology
a191
sport
a191
economy
a188
medicine
a199
bnc
finance
sport
Figure 2: Distribution of domain labels of filtered senses for 38 polysemous words
WordNet to be tailored to the domain. Buitelaar and
Sacaleanu (2001) have previously explored rank-
ing and selection of synsets in GermaNet for spe-
cific domains using the words in a given synset, and
those related by hyponymy, and a term relevance
measure taken from information retrieval. Buite-
laar and Sacaleanu have evaluated their method on
identifying domain specific concepts using human
judgements on 100 items.
Magnini and Cavagli`a (2000) have identified
WordNet word senses with particular domains, and
this has proven useful for high precision WSD
(Magnini et al., 2001); indeed in section 4 we used
these domain labels for evaluation of our automatic
filtering senses from domain specific corpora. Iden-
tification of these domain labels for word senses was
semi-automatic and required a considerable amount
of hand-labelling. Our approach is complementary
to this. It provides a ranking of the senses of a word
for a given domain so that manual work is not neces-
sary, because of this it can easily be applied to a new
domain, or sense inventory, given sufficient text.
6 Conclusions
We have proposed and evaluated a method which
can identify senses which are rare in a given cor-
pus. This method uses a ranking of senses derived
automatically from raw text using both distribu-
tional similarity methods and a measure of semantic
similarity, such as those available in the WordNet
similarity package. When using rankings derived
from a thesaurus automatically acquired from the
BNC, we have demonstrated that this technique pro-
duces promising results in removing unused senses
from both SemCor and the SENSEVAL-2 English
all-words task corpus. Moreover, the senses re-
moved erroneously from SemCor were less frequent
than average.
A major benefit of this method is to tailor
a generic resource such as WordNet to domain-
specific text, and we have demonstrated this us-
ing two domain specific corpora and and an eval-
uation using semi-automatically created domain la-
bels (Magnini and Cavagli`a, 2000).
There is scope for experimentation with other
WordNet similarity scores. From earlier experi-
ments we noted that the lesk measure produced
quite good results, although it is considerably less
efficient than jcn as it compares sense definitions at
run time. One major advantage that lesk has, is its
applicability to other PoS. The lesk measure can be
used when ranking adjectives, and adverbs as well
as nouns and verbs (which can also be ranked us-
ing jcn). Another advantage of the lesk measure is
that it is applicable to lexical resources which do not
have the hierarchical structure that WordNet does,
but do have definitions associated with word senses.
This paper only deals with nouns, however we
have recently investigated the ranking method for an
unsupervised predominant sense heuristic for WSD
for other PoS (McCarthy et al., 2004b). We plan
to use the ranking method for identifying prevalent
and infrequent senses from domain specific text and
using this as a resource for WSD and lexical acqui-
sition.
Acknowledgements
We would like to thank Siddharth Patwardhan and
Ted Pedersen for making the WN Similarity pack-
age publically available. This work was funded
by EU-2001-34460 project MEANING: Develop-
ing Multilingual Web-scale Language Technolo-
gies, UK EPSRC project Robust Accurate Statisti-
cal Parsing (RASP) and a UK EPSRC studentship.

References

Eneko Agirre and Oier Lopez de Lacalle. 2003. Clus-
tering wordnet word senses. In Recent Advances in
Natural Language Processing, Borovets, Bulgaria.

Edward Briscoe and John Carroll. 2002. Robust ac-
curate statistical annotation of general text. In Pro-
ceedings of the Third International Conference on
Language Resources and Evaluation (LREC), pages
1499–1504, Las Palmas, Canary Islands, Spain.

Paul Buitelaar and Bogdan Sacaleanu. 2001. Ranking
and selecting synsets by domain relevance. In Pro-
ceedings of WordNet and Other Lexical Resources:
Applications, Extensions and Customizations, NAACL
2001 Workshop, Pittsburgh, PA.

V´eronique Hoste, Anne Kool, and Walter Daelemans.
2001. Classifier optimization and combination in
the English all words task. In Proceedings of the
SENSEVAL-2 workshop, pages 84–86.

Jay Jiang and David Conrath. 1997. Semantic similarity
based on corpus statistics and lexical taxonomy. In In-
ternational Conference on Research in Computational
Linguistics, Taiwan.

Adam Kilgarriff and Joseph Rosenzweig. 2000. English
SENSEVAL: Report and results. In Proceedings of
LREC-2000, Athens, Greece.

Mirella Lapata and Chris Brew. 2004. Verb class dis-
ambiguation using informative priors. Computational
Linguistics, 30(1):45–75.

Dekang Lin. 1998. Automatic retrieval and clustering of
similar words. In Proceedings of COLING-ACL 98,
Montreal, Canada.

Bernardo Magnini and Gabriela Cavagli`a. 2000. Inte-
grating subject field codes into WordNet. In Proceed-
ings of LREC-2000, Athens, Greece.

Bernardo Magnini, Carlo Strapparava, Giovanni Pezzuli,
and Alfio Gliozzo. 2001. Using domain information
for word sense disambiguation. In Proceedings of the
SENSEVAL-2 workshop, pages 111–114.

Diana McCarthy, Rob Koeling, Julie Weeds, and John
Carroll. 2004a. Finding predominant senses in un-
tagged text. In Proceedings of the 42nd Annual Meet-
ing of the Association for Computational Linguistics,
Barcelona, Spain.

Diana McCarthy, Rob Koeling, Julie Weeds, and John
Carrolł. 2004b. Using automatically acquired pre-
dominant senses for word sense disambiguation. In
Proceedings of the ACL SENSEVAL-3 workshop.

Martha Palmer, Christiane Fellbaum, Scott Cotton, Lau-
ren Delfs, and Hoa Trang Dang. 2001. English tasks:
All-words and verb lexical sample. In Proceedings of
the SENSEVAL-2 workshop, pages 21–24.

Patrick Pantel and Dekang Lin. 2002. Discovering word
senses from text. In Proceedings of ACM SIGKDD
Conference on Knowledge Discovery and Data Min-
ing, pages 613–619, Edmonton, Canada.

Siddharth Patwardhan and Ted Pedersen.
2003. The cpan wordnet::similarity pack-
age. http://search.cpan.org/author/SID/WordNet-
Similarity-0.03/.

Tony G. Rose, Mary Stevenson, and Miles Whitehead.
2002. The Reuters Corpus volume 1 - from yes-
terday’s news to tomorrow’s language resources. In
Proc. of Third International Conference on Language
Resources and Evaluation, Las Palmas de Gran Ca-
naria.

Yorick Wilks and Mark Stevenson. 1998. The grammar
of sense: using part-of speech tags as a first step in se-
mantic disambiguation. Natural Language Engineer-
ing, 4(2):135–143.
