I 
Determining the specificity of nouns from text 
Sharon A. Caraballo and Eugene Charniak 
Dept. of Computer Science 
Brown University 
Providence, RI 02912 
{ SC, ec}@cs, brown, edu 
Abstract 
In this work, we use a large text corpus to 
order nouns by their level of specificity. This 
semantic information can for most nouns be 
determined with over 80% accuracy using 
simple statistics from a text corpus with- 
out using any additional sources of seman- 
tic knowledge. This kind of semantic in- 
formation can be used to help in automat- 
ically constructing or augmenting a lexical 
database such as WordNet. 
1 Introduction 
Large lexical databases such as Word- 
Net (see Fellbaum (1998)) are in com- 
mon research use. However, there are cir- 
cumstances, particularly involving domain- 
specific text, where WordNet does not have 
sufficient coverage. Various automatic meth- 
ods have been proposed to automatically 
build lexical resources or augment existing 
resources. (See, e.g., Riloff and Shepherd 
(1997), Roark and Charniak (1998), Cara- 
hallo (1999), and Berland and Charniak 
(1999).) In this paper, we describe a method 
which can be used to assist in this problem. 
We present here a way to determine the 
relative specificity of nouns; that is, which 
nouns are more specific (or more general) 
than others, using only a large text cor- 
pus and no additional sources of semantic 
knowledge. By gathering simple statistics, 
we are able to decide which of two nouns is 
more specific to over 80% accuracy for nouns 
at "basic level" or below (see, e.g., Lakoff 
(1987)), and about 59% accuracy for nouns 
above basic level. 
It should be noted that specificity by it- 
self is not enough information from which 
to construct a noun hierarchy. This project 
is meant to provide a tool to support other 
methods. See Caraballo (1999) for a detailed 
description of a method to construct such a 
hierarchy. 
2 Previous work 
To the best of our knowledge, this is the first 
attempt to automatically rank nouns based 
on specificity. 
Hearst (1992) found individual pairs of 
hypernyms and hyponyms from text using 
pattern-matching techniques. The sparse- 
ness of these patterns prevents this from be- 
ing an effective approach to the problem we 
address here. 
In Caraballo (1999), we construct a hierar- 
chy of nouns, including hypernym relations. 
However, there are several areas where that 
work could benefit from the research pre- 
sented here. The hypernyms used to label 
the internal nodes of that hierarchy are cho- 
sen in a simple fashion; pattern-matching as 
in Hearst (1992) is used to identify candi- 
date hypernyms of the words dominated by a 
particular node, and a simple voting scheme 
selects the hypernyms to be used. The hy- 
pernyms tend to lose quality as one looks 
closer to the root of the tree, generally by 
being too specific. This work could help to 
choose more general hypernyms from among 
the candidate words. In addition, it could 
be used to correct places where a more spe- 
cific word is listed as a hypernym of a more 
general word, and to select between hyper- 
nyms which seem equally good by the voting 
scheme (which is currently done arbitrarily). 
63 
3 Methods for determining 
specificity 
We tested several methods for ordering 
nouns by their specificity. Each of these 
methods was trained on the text of the 1987 
Wall Street Journal corpus, about 15 mil- 
lion words. When parsed text was needed, 
it was obtained using a parser recently de- 
veloped at Brown which performs at about 
the 88% level on the standard precision and 
recall measures. 
One possible indicator of specificity is how 
often the noun is modified. It seems rea- 
sonable to suppose that very specific nouns 
are rarely modified, while very general nouns 
would usually be modified. Using the parsed 
text, we collected statistics on the probabil- 
ity that a noun is modified by a prenomi- 
nal adjective, verb, or other noun. (In all of 
these measures, when we say "noun" we are 
referring only to common nouns, tagged NN 
or NNS, not proper nouns tagged NNP or 
NNPS. Our results were consistently better 
when proper nouns were eliminated, prob- 
ably since the proper nouns may conflict 
with identically-spelled common nouns.) We 
looked at both the probability that the noun 
is modified by any of these modifiers and 
the probability that the noun is modified 
by each specific category. The nouns, ad- 
jectives, and verbs are all stemmed before 
computing these statistics. 
Po  (noun) = 
count(noun with a prenominal adjective) 
count(noun) 
Pub(noun) = 
count(noun with a prenominal verb) 
count(noun) 
Pn (noun) = 
count(noun with a prenominal noun) 
count(noun) 
Brood(noun) = 
count(noun with prenom, adj, vb, or nn) 
count(noun) 
However, if a noun almost always appears 
with exactly the same modifiers, this may 
be an indication of an expression (e.g., "ugly 
duckling"), rather than a very general noun. 
For this reason, we also collected entropy- 
based statistics. For each noun, we com- 
puted the entropy of the rightmost prenom- 
inal modifier. 
Hmod(noun) =- ~ \[P(modifierlnoun) 
modifier 
* log 2 P(modifier Inoun)\] 
where P(modifierlnoun ) is the probability 
that a (possibly null) modifier is the right- 
most modifier of noun. The higher the en- 
tropy, the more general we believe the noun 
to be. In other words, we are considering not 
just how often the noun is modified, but how 
much these modifiers vary. A great variety 
of modifiers suggests that the noun is quite 
general, while a noun that is rarely modified 
or modified in only a few different ways is 
probably fairly specific. 
We also looked at a simpler measure which 
can be computed from raw text rather than 
parsed text. (For this experiment we used 
the part-of-speech tags determined by the 
parser, but that was only to produce the 
set of nouns for testing. If one wanted to 
compute this measure for all words, or for a 
specific list of words, tagging would be un- 
necessary.) We simply looked at all words 
appearing within an n-word window of any 
instance of the word being evaluated, and 
then computed the entropy measure: 
Hn(noun) = - E \[P(w°rdln°un) 
Word 
* log 2 P(word\[noun)\] 
where P(word\]noun) is the probability that 
a word appearing within an n-word win- 
dow of noun is word. Again, a higher en- 
tropy indicates a more general noun. In 
this measure, the nouns being evaluated are 
stemmed, but the words in its n-word win- 
dow are not. 
Finally, we computed the very simple mea- 
sure of frequency (freq(noun)). The higher 
64 
the frequency, the more general we expect 
the noun to be. (Recall that we are using 
tagged text, so it is not merely the frequency 
of the word that is being measured, but the 
frequency of the word or its plural tagged as 
a common noun.) 
This assumed inverse relationship between 
frequency and the semantic content of a 
word is used, for example, to weight the im- 
portance of terms in the standard IDF mea- 
sure used in information retrieval (see, e.g., 
Sparck Jones (1972)), and to weight the im- 
portance of context words to compare the 
semantic similarity of nouns in Grefenstette 
(1993). 
4 Evaluation 
To evaluate the performance of these mea- 
sures, we used the hypernym data in Word- 
Net (1998) as our gold standard. (A word 
X is considered a hypernym of a word Y if 
native speakers accept the sentence "Y is a 
(kind of) X.")'We constructed three small 
hierarchies of nouns and looked at how of- 
ten our measures found the proper relation- 
ships between the hypernym/hyponym pairs 
in these hierarchies. 
To select the words for our three hierar- 
chies, we wanted to use sets of words for 
which there would be enough information in 
the Wall Street Journal corpus. We chose 
three clusters produced by a program similar 
to Roark and Charniak (1998) except that it 
is based on a generative probability model 
and tries to classify all nouns rather than 
just those in pre-selected clusters. (All data 
sets are given in the Appendix.) The clus- 
ters we selected represented vehicles (car, 
truck, boat, ...), food (bread, pizza, wine, 
...), and occupations (journalist, engineer, 
biochemist, ...). From the clustered data we 
removed proper nouns and words that were 
not really in our target categories. We then 
looked up the remaining words in Word- 
Net, and added their single-word hypernyms 
to the categories in the correct hierarchical 
structure. (Many WordNet categories are 
described by multiple words, e.g., "motor- 
ized vehicle", and these were omitted for ob- 
vious reasons.) 
For each of these three hierarchies, 
we looked at each hypernym/hyponym 
pair within the hierarchy and determined 
whether each specificity measure placed the 
words in the proper order. The percentage 
each specificity measure placed correctly are 
presented in Table 1. 
Clearly the better measures are perform- 
ing much better than a random-guess algo- 
rithm which would give 50% performance. 
Among the measures based on the parsed 
text (Pmod and its components and Hmod), 
the entropy-based measure Hmod is clearly 
the best performer, as would be expected. 
However, it is interesting to note that the 
statistics based on adjectives alone (Padj) 
somewhat outperform those based on all of 
our prenominal modifiers (P~od). The rea- 
sons for this are not entirely clear. 
Although Hmod is the best performer 
on the vehicles data, freq and Hs0 do 
marginally better overall, with each having 
the best results on one of the data sets. All 
three of these measures, as well as H2 and 
H10, get above 80% correct on average. 
In these evaluations, it became clear that a 
single bad node high in the hierarchy could 
have a large effect on the results. For ex- 
ample, in the "occupations" hierarchy, the 
root node is "person," however, this is not a 
very frequent word in the Wall Street Jour- 
nal corpus and rates as fairly specific across 
all of our measures. Since this node has 
eight children, a single bad value at this 
node can cause eight errors. We therefore 
considered another evaluation measure: for 
each internal node in the tree, we evalu- 
ated whether each specificity measure rated 
this word as more general than all of its de- 
scendants. (This is somewhat akin to the 
idea of edit distance. If we sufficiently in- 
creased the generality measure for each node 
marked incorrect in this system, the hierar- 
chy would match WordNet's exactly.) The 
results for this evaluation are presented in 
Table 2. Although this is a harsher measure, 
it isolates the effect of individual difficult in- 
ternal nodes. 
Although the numbers are lower in Table 
65 
Specificity measure Vehicles\] 
Pmod 65.2 
/:~dj 
Pvb 
Pnn 
Hmod 
65.2 
73.9 
65.2 
91.3 
H2 87.0 
H10 87.0 
Hso 87.0 
Freq 87.0 \[ 
Food 
63.3 
67.3 
42.9 
57.1 
79.6 
79.6 
79.6 
85.7 
83.7 
Occupations Average 
66.7 65.0 
69.7 67.4 
51.5 56.1 
51.5 58.0 
72.7 81.2 
75.8 8O.8 
75.8 8O.8 
75.8 82.8 
78.8 83.1 
Table 1: Percentage of parent-child relationships which 
sure. 
Specificity measure Vehicles Food 
Pmod 44.4 
Padj 33.3 
Pvb 33.3 
Pn~ 55.6 
Hmod 77.8 
H2 66.7 
Hlo 66.7 
Hso 66.7 
Freq 66.7 
57.9 
52.6 
21.1 
21.1 
63.2 
57.9 
63.2 
73.7 
63.2 
Table 2: Percentage of internal nodes 
dants. 
are ordered correctly by each mea- 
Occupations Average 
53.3 51.9 
60.0 48.7 
40.0 31.5 
33.3 36.6 
66.7 69.2 
60.0 61.5 
60.0 63.3 
60.0 
60.0 
having the 
66.8 
63.3 
correct relationship to all of their descen- 
2, the same measures as in Table 1 perform 
relatively well. However, here Hmod has the 
best performance both on average and on 
two of three data sets, while the freq mea- 
sure does a bit less well, now performing at 
about the level of Hi0 rather than Hs0. The 
fact that some of the numbers in Table 2 are 
below 50% should not be alarming, as the av- 
erage number of descendants of an internal 
node is over 5, implying that random chance 
would give performance well below the 50% 
level on this measure. 
Some of these results are negatively af- 
fected by word-sense problems. Some of the 
words added from the WordNet data are 
much more common in the Wall Street Jour- 
nal data for a different word sense than the 
one we are trying to evaluate. For example, 
the word "performer" is in the occupations 
hierarchy, but in the Wall Street Journal this 
word generally refers to stocks or funds (as 
"good performers", for example) rather than 
to people. Since it was our goal not to use 
any outside sources of semantic knowledge 
these words were included in the evaluations. 
However, if we eliminate those words, the re- 
sults are as shown in Tables 3 and 4. 
It is possible that using some kind of 
automatic word:sense disambiguation while 
gathering the statistics would help reduce 
this problem. This is also an area for fu- 
ture work. However, it should be noted that 
on the evaluation measures in Tables 3 and 
4, as in the first two tables, the best results 
are obtained with Hmod, Hso and freq. 
The above results are primarily for nouns 
at "basic level" and below, which includes 
the vast majority of nouns. We also consid- 
ered a data set at basic level and above, with 
"entity" at its root. Table 5 presents the 
66 
Specificity measure Vehicles 
Pmod 65.0 
Padj 70.0 
Pvb 80.0 
P~n 70.0 
Hmo d 100.0 
H2 95.0 
H10 95.0 
H~0 95.0 
Freq 95.0 
Food 
62.5 
66.7 
43.8 
56.3 
81.3 I 
79.2 
79.2 
85.4 
83.31 
Table 3: 
dominant sense are removed. 
Percentage of correct parent-child 
Specificity measure Vehicles 
Brood 
Padj 
Vvb 
Hmod 100.0 
H2 
Hlo 
"-H5 0 
I Occupations Average 
67.7 65.1 
71.0 69.2 
48.4 57.4 
51.6 59.3 
Freq 
74.2 85.1 
77.4 83.9 
77.4 83.9 
77.4 85.9 
80.6 86.3 
relationships when words with the wrong pre- 
Food 
50.0 55.6 61.5 
33.3 50.0 61.5 
33.3 16.7 38.5 
66.7 22.2 30.8 
55.6 I 76.9 
83.3 55.6 69.2 
83.3 61.1 61.5 
83.3 72.2 69.2 
83.3 61.1 I 69.2 
I Occupations Average 
55.7 
48.3 
29.5 
39.9 
77.5 
69.4 
68.7 
74.9 
71.2 
Table 4: Percentage of internal nodes with the correct relationship to all descendants when 
words with the wrong predominant sense are removed. 
results of testing on this data set and each 
measure, for the evaluation measures de- 
scribed above, percentage of correct parent- 
child relationships and percentage of nodes 
in the correct relationship to all of their de- 
scendants. 
Note that on these nouns, freq and H~0 
are among the worst performers; in fact, by 
looking at the parent-child results, we can 
see that these measures actually do worse 
than chance. As nouns start to get extremely 
general, their frequency appears to actually 
decrease, so these are no longer useful mea- 
sures. On the: other hand, Hmod is still one 
of the best performers; although it does per- 
form worse here than on very specific nouns, 
it still assigns the correct relationship to a 
pair of nouns about 59% of the time. 
5 Conclusion 
Determining the relative specificity of nouns 
is a task which can help in automatically 
building or augmenting a semantic lexicon. 
We have identified various measures which 
can identify which of two nouns is more spe- 
cific with over 80% accuracy on basic-level or 
more specific nouns. The best among these 
measures seem to be Hmod, the entropy of 
the rightmost modifier of a noun, Hs0, the 
entropy of words appearing within a 50-word 
window of a target noun, and freq, the fre- 
quency of the noun. These three measures 
perform approximately equivalently. 
If the task requires handling very gen- 
eral nouns as well as those at or below the 
basic level, we recommend using the Hmod 
measure. This measure performs nearly as 
well as the other two on specific nouns, and 
much better on general nouns. However, 
67 
Specificity measure Parent-child All descendants 
Pmod 59.1 46.4 
Padj 60.2 46.4 
Pvb 50.0 35.7 
Pnn 50.0 28.6 
Hmod 59.1 39.3 
H2 53.4 25.0 
H10 45.5 32.1 
Hs0 i 46.6 32.1 
Freq 45.5 32.1 
Table 5: Evaluation of the various specificity measures on a test set of more general nouns. 
if it is known that the task will only in- 
volve fairly specific nouns, such as adding 
domain-specific terms to an existing hier- 
archy which already has the more general 
nouns arranged appropriately, the easily- 
computed freq measure can be used instead. 
6 Acknowledgments 
Thanks to Mark Johnson and to the anony- 
mous reviewers for many helpful sugges- 
tions. This research is supported in part by 
NSF grant IRI-9319516 and by ONR grant 
N0014-96-1-0549. 
68 
Appendix 
Below are the data sets used in these exper- 
iments. Words shown in italics are omitted 
from the results in Tables 3 and 4 because 
the predominant sense in the Wall Street 
Journal text is not the one represented by 
the word's position in this hierarchy. 
Food: 
fo )~(,verage 
-- alcg.hol nquor 
| gin | rum 
| vodka 
i brandy cognac 
I Winc~ampagne 
cola dessert 
bread muffin ' 
cracker cheese 
meat liver 
veal , ham 
~ ingredient 
relish 
I L_ olives ' ketchup 
dish sandwich 
soup 
pizza 
salad butter 
i cake cookie 
egg 
candy I mint 
~pPr apastry sta 
noodles oduce 
~e~ vegetable tomatoes 
shroom ume 
pea 
fruit pineapple 
peaches i oerry 
strawberry 
Vehicles: 
vehicle 
--1 truck 
~--\] van . . 
mlnlvans 
car compact 
limousines jeep 
wagon 
cab sedan 
coupe 
hatchback trailer 
campers 
i craft ~ssel 
yacht 
b°~arges 
motorcycle 
I motorbike wagon 
I cart 
Occupations: 
person 
worker I editor 
technician riter 
journalist 
columnist commentator 
novelist biographer 
intellectual scientist 
I sociologist 
t--7 chemist \[____L_~ biochemist 
physicist 
' scholar ' . historian 
professional physician 
I i speciali_st psychiatrist 
veterinarian 
educator I teacher 
nurse 
dentist 
~p.e leader administrator 
tertainer r former 
comedian I engineer 
' homemaker 
69 
Entities (data used for Table 5): 
entity organism 
person 
animal vermin 
mammal • horse 
dog 
cat 
cattle bird 
chicken duck 
fiSl~erring 
salmon trout 
reptile 
turtle snake 
lizard alligator 
virus bacteria 
microbe -, cause 
\] i danger 
hazard -- menace 
object 
substance food 
" metal alloy 
steel ' bronze 
gold 
silver ir.on 
carcinogen 
liquid 
water locatio.n 
I region 
I country 
state city 
com mq&ty \[ clothing 
appliance 
artifact covering 
paint 
roof curtain 
creation 
I art music 
publication 
\[ book ,, article 
decoration 
sedative interferon 
enclosure fabric 
nylon 
wool facility 
airport headquarters 
station fixture 
structure OUSe 
ctory 
store ~part 
organ \[ 
\]---- ~eart \] ~-a~: ng 
I corner I fragment 
slice need 
variable 
70 

References 

Matthew Berland and Eugene Charniak. 
1999. Finding parts in very large corpora. 
In Proceedings of the 37th Annual Meet- 
ing of the Association for Computational 
Linguistics. 

Sharon A. Caraballo. 1999. Automatic con: 
struction of a hypernym-labeled noun hi- 
erarchy from text. In Proceedings of the 
37th Annual Meeting of the Association 
for Computational Linguistics. 

Christiane Fellbaum, editor. 1998. Word- 
Net: An Electronic Lexical Database. MIT 
Press. 

Gregory Grefenstette. 1993. SEXTANT: 
Extracting semantics from raw text imple- 
mentation details. Heuristics: The Jour- 
nal of Knowledge Engineering. 

Marti A. Hearst. 1992. Automatic acquisi- 
tion of hyponyms from large text corpora. 
In Proceedings of the Fourteenth Interna- 
tional Conference on Computational Lin- 
guistics. 

George Lakoff. 1987. Women, Fire, and 
Dangerous Things: What Categories Re- 
veal about the Mind. University of Chicago 
Press. 

Ellen Riloff and Jessica Shepherd. 1997. 
A corpus-based approach for building se- 
mantic lexicons. In Proceedings of the Sec- 
ond Conference on Empirical Methods in 
Natural Language Processing, pages 117-124. 

Brian Roark and Eugene Charniak. 1998. 
Noun-phrase co-occurrence statistics for 
semi-automatic semantic lexicon construc- 
tion. In COLING-ACL '98: 36th An- 
nual Meeting of the Association for Com- 
putational Linguistics and 17th Interna- 
tional Conference on Computational Lin- 
guistics: Proceedings of the Conference, 
pages 1110-1116. 

Karen Sparck Jones. 1972. A statistical in- 
terpretation of term specificity and its ap- 
plication in retrieval. Journal of Docu- 
mentation, 28:11-21. 
