Supersense Tagging of Unknown Nouns in WordNeta0
Massimiliano Ciaramita
Brown University
massi@brown.edu
Mark Johnson
Brown University
mark johnson@brown.edu
Abstract
We present a new framework for classify-
ing common nouns that extends named-
entity classification. We used a fixed set
of 26 semantic labels, which we called su-
persenses. These are the labels used by
lexicographers developing WordNet. This
framework has a number of practical ad-
vantages. We show how information con-
tained in the dictionary can be used as ad-
ditional training data that improves accu-
racy in learning new nouns. We also de-
fine a more realistic evaluation procedure
than cross-validation.
1 Introduction
Lexical semantic information is useful in many nat-
ural language processing and information retrieval
applications, particularly tasks that require com-
plex inferences involving world knowledge, such
as question answering or the identification of co-
referential entities (Pasca and Harabagiu, 2001;
Pustejovsky et al., 2002).
However, even large lexical databases such as
WordNet (Fellbaum, 1998) do not include all of
the words encountered in broad-coverage NLP ap-
plications. Ideally, we would like a system that
automatically extends existing lexical resources by
a1We would like to thank Thomas Hofmann, Brian Roark,
and our colleagues in the Brown Laboratory for Linguistic In-
formation Processing (BLLIP), as well as Jesse Hochstadt for
his editing advice. This material is based upon work supported
by the National Science Foundation under Grant No. 0085940.
identifying the syntactic and semantic properties of
unknown words. In terms of the WordNet lexical
database, one would like to automatically assign un-
known words a position in the synset hierarchy, in-
troducing new synsets and extending the synset hier-
archy where appropriate. Doing this accurately is a
difficult problem, and in this paper we address a sim-
pler problem: automatically determining the broad
semantic class, or supersense, to which unknown
words belong.
Systems for thesaurus extension (Hearst, 1992;
Roark and Charniak, 1998), information extrac-
tion (Riloff and Jones, 1999) or named-entity recog-
nition (Collins and Singer, 1999) each partially ad-
dress this problem in different ways. The goal
in these tasks is automatically tagging words with
semantic labels such as “vehicle”, “organization”,
“person”, etc.
In this paper we extend the named-entity recogni-
tion approach to the classification of common nouns
into 26 different supersenses. Rather than define
these ourselves, we adopted the 26 “lexicographer
class” labels used in WordNet, which include labels
such as person, location, event, quantity, etc. We be-
lieve our general approach should generalize to other
definitions of supersenses.
Using the WordNet lexicographer classes as su-
persenses has a number of practical advantages.
First, we show how information contained in the dic-
tionary can be used as additional training data that
improves the system’s accuracy. Secondly, it is pos-
sible to use a very natural evaluation procedure. A
system can be trained on an earlier release of Word-
Net and tested on the words added in a later release,
1 person 7 cognition 13 attribute 19 quantity 25 plant
2 communication 8 possession 14 object 20 motive 26 relation
3 artifact 9 location 15 process 21 animal
4 act 10 substance 16 Tops 22 body
5 group 11 state 17 phenomenon 23 feeling
6 food 12 time 18 event 24 shape
Table 1. Lexicographer class labels, or supersenses.
since these labels are constant across different re-
leases. This new evaluation defines a realistic lexi-
cal acquisition task which is well defined, well mo-
tivated and easily standardizable.
The heart of our system is a multiclass perceptron
classifier (Crammer and Singer, 2002). The features
used are the standard ones used in word-sense classi-
fication and named-entity extraction tasks, i.e., col-
location, spelling and syntactic context features.
The experiments presented below show that when
the classifier also uses the data contained in the dic-
tionary its accuracy improves over that of a tradition-
ally trained classifier. Finally, we show that there are
both similarities and differences in the results ob-
tained with the new evaluation and standard cross-
validation. This might suggest that in fact that the
new evaluation defines a more realistic task.
The paper is organized as follows. In Section 2
we discuss the problem of unknown words and the
task of semantic classification. In Section 3 we de-
scribe the WordNet lexicographer classes, how to
extract training data from WordNet, the new evalu-
ation method and the relation of this task to named-
entity classification. In Section 4 we describe the
experimental setup, and in Section 5 we explain the
averaged perceptron classifier used. In Section 6 and
7 we discuss the results and the two evaluations.
2 Unknown Words and Semantic
Classification
Language processing systems make use of “dictio-
naries”, i.e., lists that associate words with useful
information such as the word’s frequency or syn-
tactic category. In tasks that also involve inferences
about world knowledge, it is useful to know some-
thing about the meaning of the word. This lexical
semantic information is often modeled on what is
found in normal dictionaries, e.g., that “irises” are
flowers or that “exane” is a solvent.
This information can be crucial in tasks such
as question answering - e.g., to answer a ques-
tion such as “What kind of flowers did Van Gogh
paint?” (Pasca and Harabagiu, 2001) - or the indi-
viduation of co-referential expressions, as in the pas-
sage “... the prerun can be performed with a2a4a3a6a5a8a7a9a2a11a10
... this a12a14a13 a15a17a16 a2a18a7a20a19 a10 can be considered ...” (Pustejovsky
et al., 2002).
Lexical semantic information can be extracted
from existing dictionaries such as WordNet. How-
ever, these resources are incomplete and systems
that rely on them often encounter unknown words,
even if the dictionary is large. As an example, in the
Bllip corpus (a very large corpus of Wall Street Jour-
nal text) the relative frequency of common nouns
that are unknown to WordNet 1.6 is approximately
0.0054; an unknown noun occurs, on average, ev-
ery eight sentences. WordNet 1.6 lists 95,000 noun
types. For this reason the importance of issues such
as automatically building, extending or customizing
lexical resources has been recognized for some time
in computational linguistics (Zernik, 1991).
Solutions to this problem were first proposed
in AI in the context of story understanding, cf.
(Granger, 1977). The goal is to label words using
a set of semantic labels specified by the dictionary.
Several studies have addressed the problem of ex-
panding one semantic category at a time, such as
“vehicle” or “organization”, that are relevant to a
particular task (Hearst, 1992; Roark and Charniak,
1998; Riloff and Jones, 1999). In named-entity clas-
sification a large set of named entities (proper nouns)
are classified using a comprehensive set of semantic
labels such as “organization”, “person”, “location”
or “other” (Collins and Singer, 1999). This latter
approach assigns all named entities in the data set a
semantic label. We extend this approach to the clas-
sification of common nouns using a suitable set of
semantic classes.
3 Lexicographer Classes for Noun
Classification
3.1 WordNet Lexicographer Labels
WordNet (Fellbaum, 1998) is a broad-coverage
machine-readable dictionary. Release 1.71 of the
English version lists about 150,000 entries for all
open-class words, mostly nouns (109,000 types), but
also verbs, adjectives, and adverbs. WordNet is or-
ganized as a network of lexicalized concepts, sets of
synonyms called synsets; e.g., the nouns a21 chairman,
chairwoman, chair, chairpersona22 form a synset. A
word that belongs to several synsets is ambiguous.
To facilitate the development of WordNet, lexi-
cographers organize synsets into several domains,
based on syntactic category and semantic coherence.
Each noun synset is assigned one out of 26 broad
categories1. Since these broad categories group to-
gether very many synsets, i.e., word senses, we call
them supersenses. The supersense labels that Word-
Net lexicographers use to organize nouns are listed
in Table 12. Notice that since the lexicographer la-
bels are assigned to synsets, often ambiguity is pre-
served even at this level. For example, chair has
three supersenses: “person”, “artifact”, and “act”.
This set of labels has a number of attractive fea-
tures for the purposes of lexical acquisition. It is
fairly general and therefore small. The reasonable
size of the label set makes it possible to apply state-
of-the-art machine learning methods. Otherwise,
classifying new words at the synset level defines a
multiclass problem with a huge class space - more
than 66,000 noun synsets in WordNet 1.6, more than
75,000 in the newest release, 1.71 (cf. also (Cia-
ramita, 2002) on this problem). At the same time
the labels are not too abstract or vague. Most of the
classes seem natural and easily recognizable. That
is probably why they were chosen by the lexicog-
raphers to facilitate their task. But there are more
important practical and methodological advantages.
3.2 Extra Training Data from WordNet
WordNet contains a great deal of information about
words and word senses.The information contained
1There are also 15 lexicographer classes for verbs, 3 for ad-
jectives and 1 for adverbs.
2The label “Tops” refers to about 40 very general synsets,
such as “phenomenon” “entity” “object” etc.
in the dictionary’s glosses is very similar to what
is typically listed in normal dictionaries: synonyms,
definitions and example sentences. This suggests a
very simple way in which it can be put into use: it
can be compiled into training data for supersense la-
bels. This data can then be added to the data ex-
tracted from the training corpus.
For several thousand concepts WordNet’s glosses
are very informative. The synset “chair” for example
looks as follows:
a23a25a24a18a26
a5a28a27a30a29 : president, chairman, chairwoman,
chair, chairperson – (the officer who presides at
the meetings of an organization); “address your
remarks to the chairperson”.
In WordNet 1.6, 66,841 synsets contain definitions
(in parentheses above), and 6,147 synsets contain
example sentences (in quotation marks). As we
show below, this information about word senses is
useful for supersense tagging. Presumably this is be-
cause if it can be said of a “chairperson” that she can
“preside at meetings” or that “a remark” can be “ad-
dressed to her”, then logically speaking these things
can be said of the superordinates of “chairperson”,
like “person”, as well.
Therefore information at the synset level is rele-
vant also at the supersense level. Furthermore, while
individually each gloss doesn’t say too much about
the narrow concept it is attached to (at least from
a machine learning perspective) at the supersense
level this information accumulates. In fact it forms
a small corpus of supersense-annotated data that can
be used to train a classifier for supersense tagging of
words or for other semantic classification tasks.
3.3 Evaluation Methods
Formulating the problem in this fashion makes it
possible to define also a very natural evaluation pro-
cedure. Systems can be trained on nouns listed in
a given release of WordNet and tested on the nouns
introduced in a later version. The set of lexicogra-
pher labels remains constant and can be used across
different versions.
In this way systems can be tested on a more real-
istic lexical acquisition task - the same task that lex-
icographers carried out to extend the database. The
task is then well defined and motivated, and easily
standardizable.
5 10 15 20 250.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Supersense Labels 
Cumulative Relative Frequency
Test: new nouns in WN 1.71
Train: nouns in WN 1.6
Figure 1. Cumulative distribution of supersense labels in
Bllip.
3.4 Relation to Named-Entity Tasks
The categories typically used in named-entity recog-
nition tasks are a subset of the noun supersense la-
bels: “person”, “location”, and “group”. Small la-
bel sets like these can be sufficient in named-entity
recognition. Collins and Singer (1999) for exam-
ple report that 88% of the named entities occur-
ring in their data set belong to these three cate-
gories (Collins and Singer, 1999).
The distribution of common nouns, however, is
more uniform. We estimated this distribution by
counting the occurrences of 744 unambiguous com-
mon nouns newly introduced in WordNet 1.71. Fig-
ure 1 plots the cumulative frequency distribution of
supersense tokens; the labels are ordered by decreas-
ing relative frequency as in Table 1.
The most frequent supersenses are “person”,
“communication”, “artifact” etc. The three most fre-
quent supersenses account for a little more of 50%
of all tokens, and 9 supersenses account for 90% of
all tokens. A larger number of labels is needed for
supersense tagging than for named-entity recogni-
tion. The figure also shows the distribution of labels
for all unambiguous tokens in WordNet 1.6; the two
distributions are quite similar.
4 Experiments
The “new” nouns in WordNet 1.71 and the “old”
ones in WordNet 1.6 constitute the test and training
data that we used in our word classification exper-
iments. Here we describe the experimental setup:
training and test data, and features used.
4.1 Training data
We extracted from the Bllip corpus all occur-
rences of nouns that have an entry in WordNet 1.6.
Bllip (BLLIP, 2000) is a 40-million-word syntac-
tically parsed corpus. We used the parses to ex-
tract the syntactic features described below. We then
removed all ambiguous nouns, i.e., nouns that are
tagged with more than one supersense label (72%
of the tokens, 28.9% of the types). In this way we
avoided dealing with the problem of ambiguity3.
We extracted a feature vector for each noun in-
stance using the feature set described below. Each
vector is a training instance. In addition we com-
piled another training set from the example sen-
tences and from the definitions in the noun database
of WordNet 1.6. Overall this procedure produced
787,186 training instances from Bllip, 66,841 train-
ing instances from WordNet’s definitions, and 6,147
training instances from the example sentences.
4.2 Features
We used a mix of standard features used in word
sense disambiguation, named-entity classification
and lexical acquisition. The following sentence il-
lustrates them: “The art-students, nine teen-agers,
read the book”, art-students is the tagged noun:
1. part of speech of the neighboring words: a31a33a32a35a34a37a36a39a38a33a40 ,
a31a18a41a42a36a44a43a33a43a4a45 , a31a14a46 a34 a36a48a47a49a38 , ...
2. single words in the surrounding context: a47a50a36a39a51a18a52a18a53a55a54 ,
a47a50a36a39a56a4a57a33a57a59a58 , a47a50a36a61a60a63a62a33a53a18a64a33a64 , a47a50a36a39a65a63a66a4a52 , ...
3. bigrams and trigrams: a47a4a32a35a34a17a67a46 a34a6a36a39a65a63a66a4a52 a68a14a69a70a68a4a52 ,
a47 a32a35a34a17a67a32a71a34 a36a39a65a63a66a4a52 , a47a18a46 a34a17a67a46a35a72a73a36a44a68a14a69a59a68a4a52 a65a18a52a33a52a49a68a75a74a76a53a55a77a18a52a63a51a4a64 , ...
4. syntactically governed elements under a given phrase:
a78
a34 a36a39a65a63a66a18a52 a43a33a31
5. syntactically governing elements under a given phrase:
a78
a72 a36a39a51a18a52a33a53a55a54 a45
6. coordinates/appositives: a47a33a79a80a36a39a65a18a52a63a52a55a68a75a74a76a53a55a77a18a52a55a51a81a64
7. spelling/morphological features: prefixes, suffixes, com-
plex morphology: a82a33a31a83a36a48a53 , a82a33a31a83a36a48a53a55a51 ... a82a81a45a50a36a61a64 , a82a4a45a84a36a48a65a81a64
... a82a4a47a80a36a48a53a55a51a33a65 , a82a4a47a80a36a61a64a49a65a63a85a33a54a18a52a55a68a18a65 ...
3A simple option to deal with ambiguous words would be
to distribute an ambiguous noun’s counts to all its senses. How-
ever, in preliminary experiments we found that a better accuracy
is achieved using only non-ambiguous nouns. We will investi-
gate this issue in future research.
Open class words were morphologically simpli-
fied with the “morph” function included in Word-
Net. We parsed the WordNet definitions and exam-
ple sentences with the same syntactic parser used for
Bllip (Charniak, 2000).
It is not always possible to identify the noun that
represents the synset in the WordNet glosses. For
example, in the gloss for the synset relegation the
example sentence is “He has been relegated to a post
in Siberia”, where a verb is used instead of the noun.
When it was possible to identify the target noun the
complete feature set was used; otherwise only the
surrounding-word features (2) and the spelling fea-
tures (7) of all synonyms were used. With the def-
initions it is much harder to individuate the target;
consider the definition “a member of the genus Ca-
nis” for dog. For all definitions we used only the
reduced feature set. One training instance per synset
was extracted from the example sentences and one
training instance from the definitions. Overall, in
the experiments we performed we used around 1.5
million features.
4.3 Evaluation
In a similar way to how we produced the training
data we compiled a test set from the Bllip corpus.
We found all instances of nouns that are not in Word-
Net 1.6 but are listed in WordNet 1.71 with only
one supersense. The majority of the novel nouns in
WordNet 1.71 are unambiguous (more than 90%).
There were 744 new noun types, with a total fre-
quency of 9,537 occurrences. We refer to this test
set as Testa86a70a87a88a55a86 .
We also randomly removed 755 noun types
(20,394 tokens) from the training data and used them
as an alternative test set. We refer to this other test
set as Testa86a70a87a89 . We then ran experiments using the
averaged multiclass perceptron.
5 The Multiclass Averaged Perceptron
We used a multiclass averaged perceptron classifier,
which is an “ultraconservative” on-line learning al-
gorithm (Crammer and Singer, 2002), that is a mul-
ticlass extension of the standard perceptron learning
to the multiclass case. It takes as input a training set
a90a92a91a94a93
a3 a10a96a95a70a97a71a10a99a98a101a100
a10a103a102 a86
, where each instance a3 a10a105a104a107a106a108a110a109 rep-
resents an instance of a noun and a97a30a111a112a104a92a113 . Here a113
Algorithm 1 Multiclass Perceptron
1: input training data a93 a3 a10a114a95a70a97a71a10a30a98 a100
a10a103a102 a86
, a115
a91a117a116
2: repeat
3: for a27 a91a119a118 a95a18a120a103a120a103a120a103a95 a7 do
4: if a121 a93 a3 a10a96a122 a115 a98a124a123a91 a97a71a10 then
5: a16a126a125a59a127a129a128a130a16a126a125a59a127a73a131 a3 a10
6: a132 a10
a91
a21 a97a44a104a61a113a134a133a6a135
a16a136a125
a95 a3 a10a30a137a139a138a140a135
a16a136a125a59a127
a95 a3 a10a30a137 a22
7: for a97a39a104 a132a105a10 do
8: a16a126a125a124a128a141a16a136a125a143a142 a86a144a145 a127 a144a3 a10
9: end for
10: end if
11: end for
12: until no more mistakes
is the set of supersenses defined by WordNet. Since
for training and testing we used only unambiguous
words there is always exactly one label per instance.
Thus a90 summarizes a7 word tokens that belong to the
dictionary, where each instance a27 is represented as a
vector of features a3 a10 extracted from the context in
which the noun occurred; a146 is the total number of
features; and a97 a10 is the true label of a3a28a10 .
In general, a multiclass classifier for the dictio-
nary is a function a121 a133a124a106a108 a100a148a147 a113 that maps fea-
ture vectors a3 to one of the possible supersenses of
WordNet. In the multiclass perceptron, one intro-
duces a weight vector a16 a125 a104a112a106a108 a109 for every a97a39a104a48a113 and
defines a121 implicitly by the so-called winner-take-all
rule
a121
a93
a3 a122 a115 a98
a91a117a149a136a150a59a151a83a152a153a149a126a154
a125a81a155a126a156
a135
a16a136a125
a95 a3 a137a6a120 (1)
Here a115 a104a157a106a108a110a158a35a159 a109 refers to the matrix of weights,
with every column corresponding to one of the
weight vectors a16a18a125 .
The learning algorithm works as follows: Train-
ing patterns are presented one at a time in
the standard on-line learning setting. Whenever
a121
a93
a3 a10a96a122 a115 a98a92a123
a91
a97a71a10 an update step is performed; oth-
erwise the weight vectors remain unchanged. To
perform the update, one first computes the error set
a132 a10 containing those class labels that have received a
higher score than the correct class:
a132 a10
a91
a21 a97a39a104a48a113a160a133a6a135
a16a136a125
a95 a3 a10a11a137a139a138a161a135
a16a126a125a59a127
a95 a3 a10a99a137 a22 (2)
An ultraconservative update scheme in its most gen-
eral form is then defined as follows: Update a16a96a125a48a128
a16a136a125a61a131a163a162a33a125
a3 a10 with learning rates fulfilling the con-
straints a162a59a125a59a127 a91 a118 , a164 a125a136a165
a102
a125a59a127
a162a33a125 a91 a142 a118 , and a162a55a125 a91a166a116
for a97a160a123a104 a132 a10a168a167 a21 a97a71a10 a22 . Hence changes are limited to
a16a136a125 for
a97a169a104 a132 a10a9a167 a21 a97a71a10 a22 . The sum constraint ensures
that the update is balanced, which is crucial to guar-
anteeing the convergence of the learning procedure
(cf. (Crammer and Singer, 2002)). We have focused
on the simplest case of uniform update weights,
a162a33a125 a91 a142
a86
a144a145
a127
a144 for
a97a169a104 a132 a10 . The algorithm is summa-
rized in Algorithm 1.
Notice that the multiclass perceptron algorithm
learns all weight vectors in a coupled manner, in
contrast to methods that perform multiclass classifi-
cation by combining binary classifiers, for example,
training a classifier for each class in a one-against-
the-rest manner.
The averaged version of the perceptron (Collins,
2002), like the voted perceptron (Freund and
Schapire, 1999), reduces the effect of over-training.
In addition to the matrix of weight vectors a115 the
model keeps track for each feature a170 of each value
it assumed during training, a170 a111 , and the number of
consecutive training instance presentations during
which this weight was not changed, or “life span”,
a15
a12
a93
a170 a111a14a98 . When training is done these weights are av-
eraged and the final averaged weight a170a114a171a49a172a59a173 of feature
a170 is computed as
a170 a171a49a172a59a173
a91
a164
a111
a170a18a111
a15
a12
a93
a170a18a111 a98
a164
a111
a15
a12
a93
a170 a111a81a98
(3)
For example, if there is a feature weight that is
not updated until example 500, at which point it is
incremented to value 1, and is not touched again
until after example 1000, then the average weight
of that feature in the averaged perceptron at ex-
ample 750 will be: a174a176a175a59a177a114a178a70a175a70a175a59a179 a86 a177a114a180a70a178a70a175a59a181
a174a176a178a70a175a70a175a59a179a42a180a70a178a70a175a59a181
, or 1/3. At ex-
ample 1000 it will be 1/2, etc. We used the av-
eraged model for evaluation and parameter setting;
see below. Figure 2 plots the results on test data of
both models. The average model produces a better-
performing and smoother output.
5.1 Parameters Setting
We used an implementation with full, i.e., not
sparse, representation of the matrix for the percep-
tron. Training and test are fast, at the expense of a
0 100 200 300 400 500 600 700 800 900 100027
28
29
30
31
32
33
34
35
36
37
38
Epochs
Accuracy on Test
1.71
Averaged perceptron
Basic perceptron
Figure 2. Results on test of the normal and averaged
perceptron
slightly greater memory load. Given the great num-
ber of features, we couldn’t use the full training set
from the Bllip corpus. Instead we randomly sam-
pled from roughly half of the available training data,
yielding around 400,000 instances, the size of the
training is close to 500,000 instances with also the
WordNet data. When training to test on Testa86a70a182a89 , we
removed from the WordNet training set the synsets
relative to the nouns in Testa86a70a87a89 .
The only adjustable parameter to set is the number
of passes on the training data, or epochs. While test-
ing on Testa86a70a87a88a55a86 we set this parameter using Testa86a70a87a89 ,
and vice versa for Testa86a70a87a89 . The estimated values for
the stopping iterations were very close at roughly ten
passes. As Figure 2 shows, the great amount of data
requires many passes over the data, around 1,000,
before reaching convergence (on Testa86a70a87a88a55a86 ).
6 Results
The classifier outputs the estimated supersense label
of each instance of each unknown noun type. The
label a183 a93 a7 a98 of a noun type a7 is obtained by voting4:
a183
a93
a7 a98
a91a184a149a136a150a49a151a75a152a153a149a126a154
a125a4a155a136a156a186a185
a187
a155
a100
a188
a121
a93
a3 a122 a115 a98
a91
a97a190a189 (4)
where a188 a120a191a189 is the indicator function and a3 a104 a7 means
that a3 is a token of type a7 . The score on a7 is 1 if
4During preliminary experiments we tried also creating one
single aggregate pattern for each test noun type but this method
produced worse results.
Method Token Type Test set
Baseline 20.0 27.8
AP-B-55 35.9 50.7 Testa86a70a87a88a55a86
AP-B-65 36.1 50.8
AP-B-55+WN 36.9 52.9
Baseline 24.1 21.0
AP-B-55 47.4 47.7 Testa86a70a87a89
AP-B-65 47.9 48.3
AP-B-55+WN 52.3 53.4
Table 2. Experimental results.
a183
a93
a7 a98
a91
a113
a93
a7 a98 , where a113
a93
a7 a98 is the correct label for
a7 , and 0 otherwise.
Table 2 summarizes the results of the experiments
on Testa86a70a87a88a55a86 (upper half) and on Testa86a70a87a89 (bottom half).
A baseline was computed that always selected the
most frequent label in the training data, “person”,
which is also the most frequent in both Testa86a70a87a89 and
Testa86a70a87a88a55a86 . The baseline performances are in the low
twenties. The first and second columns report per-
formance on tokens and types respectively.
The classifiers’ results are averages over 50 trials
in which a fraction of the Bllip data was randomly
selected. One classifier was trained on 55% of the
Bllip data (AP-B-55). An identical one was trained
on the same data and, additionally, on the WordNet
data (AP-B-55+WN). We also trained a classifier on
65% of the Bliip data (AP-B-65). Adding the Word-
Net data to this training set was not possible because
of memory limitations. The model also trained on
WordNet outperforms on both test sets those trained
only on the Bllip data. A paired t-test proved the
difference between models with and without Word-
Net data to be statistically significant. The “least”
significant difference is between AP-B-65 and AP-
B-55+WN (token) on Testa86a70a87a89 : a192 a91a193a116 a120a116a71a116a35a194 . In all
other cases the a192 -level is much smaller.
These results seem to show that the positive im-
pact of the WordNet data is not simply due to the
fact that there is more training data5. Adding the
WordNet data seems more effective than adding an
equivalent amount of standard training data. Fig-
ure 3 plots the results of the last set of (single trial)
experiments we performed, in which we varied the
5Notice that 10% of the Bllip data is approximately the size
of the WordNet data and therefore AP-B-65 and AP-B-55+WN
are trained on roughly the same amount of data.
5 10 15 20 25 30 35 40 45 50 5545
46
47
48
49
50
51
52
53
54
55
Percentage of Training Data Used
Accuracy on Types (Test
1.71
)
AP−B−55+WN
AP−B−55
Figure 3. Results on Testa195a11a196a197a101a195 incrementing the amount of
training data.
amount of Bllip data to be added to the WordNet
one. The model with WordNet data often performs
better than the model trained only on Bllip data even
when the latter training set is much larger.
Two important reasons why the WordNet data is
particularly good are, in our opinion, the following.
The data is less noisy because it is extracted from
sentences and definitions that are always “pertinent”
to the class label. The data also contains instances
of disambiguated polysemous nouns, which instead
were excluded from the Bllip training. This means
that disambiguating the training data is important;
unfortunately this is not a trivial task. Using the
WordNet data provides a simple way of getting at
least some information from ambiguous nouns.
7 Differences Between Test Sets
The type scores on both evaluations produced simi-
lar results. This finding supports the hypothesis that
the two evaluations are similar in difficulty, and that
the two versions of WordNet are not inconsistent in
the way they assign supersenses to nouns.
The evaluations show, however, very different
patterns at the token level. This might be due to the
fact that the label distribution of the training data is
more similar to Testa86a70a87a89 than to Testa86a70a87a88a55a86 . In particular,
there are many new nouns in Testa86a70a87a88a55a86 that belong to
“abstract” classes6, which seem harder to learn. Ab-
stract classes are also more confusable; i.e., mem-
6Such as “communication” (e.g., reaffirmation) or “cogni-
tion” (e.g., mind set).
1 2 3 4 5 6 745
50
55
60
65
70
Test Nouns Frequency Bins
Accuracy
Test1.6
Test1.71
Figure 4. Results on types for Testa195a11a196a198 and Testa195a11a196a197a101a195 ranked
by the frequency of the test words.
bers of these classes are frequently mis-classified
with the same wrong label. A few very frequently
mis-classified pairs are communication/act, commu-
nication/person and communication/artifact.
As a result of the fact that abstract nouns are more
frequent in Testa86a70a87a88a55a86 than in Testa86a70a87a89 the accuracy on
tokens is much worse in the new evaluation than in
the more standard one. This has an impact also on
the type scores. Figure 4 plots the results on types
for Testa86a70a87a89 and Testa86a70a87a88a55a86 grouped in bins of test noun
types ranked by decreasing frequency. It shows that
the first bin is harder in Testa86a70a87a88a55a86 than in Testa86a70a87a89 .
Overall, then, it seems that there are similarities
but also important differences between the evalua-
tions. Therefore the new evaluation might define a
more realistic task than cross-validation.
8 Conclusion
We presented a new framework for word sense
classification, based on the WordNet lexicographer
classes, that extends named-entity classification.
Within this framework it is possible to use the in-
formation contained in WordNet to improve classi-
fication and define a more realistic evaluation than
standard cross-validation. Directions for future re-
search include the following topics: disambiguation
of the training data, e.g. during training as in co-
training; learning unknown ambiguous nouns, e.g.,
studying the distribution of the labels the classifier
guessed for the individual tokens of the new word.

References
BLLIP. 2000. 1987-1989 WSJ Corpus Release 1. Linguistic
Data Consortium.
E. Charniak. 2000. A maximum-entropy-inspired parser. In
Proceedings of the 38th Annual Meeting of the Association
for Computational Linguistics.
M. Ciaramita. 2002. Boosting Automatic Lexical Acquisi-
tion with Morphological Information. In Proceedings of the
Workshop on Unsupervised Lexical Acquisition, ACL-02.
M. Collins and Y. Singer. 1999. Unsupervised Models for
Named Entity Classification. In Proceedings of the Joint
SIGDAT Conference on Empirical Methods in Natural Lan-
guage Processing and Very Large Corpora.
M. Collins. 2002. Discriminative Training Methods for Hidden
Markov Models: Theory and Experiments with Perceptron
Algorithms. In Proceedings of the Conference on Empiri-
cal Methods in Natural Language Processing (EMNLP-02),
pages 1–8.
K. Crammer and Y. Singer. 2002. Ultraconservative Online Al-
gorithms for Multiclass Problems. Technical Report [2001-
18], School of Computer Science and Engineering, Hebrew
University, Jerusalem, Israel.
C. Fellbaum. 1998. WordNet: An Electronic Lexical Database.
MIT Press, Cambridge, MA.
Y. Freund and R. Schapire. 1999. Large Margin Classification
Using the Perceptron Algorithm. Machine Learning, 37.
R. Granger. 1977. FOUL-UP: A Program that Figures Out
Meanings of Words from Context. In Proceedings of the
Fifth International Joint Conference on Artificial Intelli-
gence.
M. Hearst. 1992. Automatic Acquisition of Hyponyms from
Large Text Corpora. In Proceedings of the 14th Interna-
tional Conference on Computational Linguistics.
M. Pasca and S.H. Harabagiu. 2001. The Informative Role of
WordNet in Open-Domain Question Answering. In NAACL
2001 Workshop on WordNet and Other Lexical Resources:
Applications, Extensions and Customizations.
J. Pustejovsky, A. Rumshisky, and J. Castaa199a200 o. 2002. Rerender-
ing Semantic Ontologies: Automatic Extensions to UMLS
through Corpus Analytics. In In Proceedings of REC 2002
Workshop on Ontologies and Lexical Knowledge Bases.
E. Riloff and R. Jones. 1999. Learning Dictionaries for In-
formation Extraction by Multi-Level Bootstrapping. In Pro-
ceedings of the Sixteenth National Conference on Artificial
Intelligence.
B. Roark and E. Charniak. 1998. Noun-Phrase Co-Occurrence
Statistics for Semi-Automatic Semantic Lexicon Construc-
tion. In Proceedings of the 36th Annual Meeting of the Asso-
ciation for Computational Linguistics and 17th International
Conference on Computational Linguistics.
U. Zernik. 1991. Introduction. In U. Zernik, editor, Lexical Ac-
quisition: Exploiting On-line Resources to Build a Lexicon.
Lawrence Erlbaum Associates.
