Automatic Selection of Class Labels from a Thesaurus for an Effective 
Semantic Tagging of Corpora. 
Alessandro Cucchiarelli t Paola Velardi ~1 
JrUniversit~ di Ancona - alex@inform.unian.it :~Universit/I di Roma 'La Sapienza' - velardi@dsi.uniromal.it 
Abstract 
It is widely accepted that tagging text with se- 
mantic information would improve the quality of 
lexical learning in corpus-based NLP methods. 
However available on-line taxonomies are 
rather entangled and introduce an unnecessary 
level of ambiguity. The noise produced by the re- 
dundant number of tags often overrides the ad- 
vantage of semantic tagging. In this paper we 
propose an automatic method to select from 
WordNet a subset of domain-appropriate cate- 
gories that effectively reduce the overambiguity 
of WordNet, and help at identifying and cate- 
gorise relevant language patterns in a more com- 
pact way. The method is evaluated against a 
manually tagged corpus, SEMCOR. 
1 Introduction 
It is well known that statistically-based approaches to 
lexical knowledge acquisition are faced with the problem 
of low counts. Many language patterns (from simple co- 
occurrences to more complex syntactic associations 
among words) occur very rarely, or are never encoun- 
tered, in the learning corpus. Since rare patterns are the 
majority, the quality and coverage of lexical learning may 
result severely affected. 
The obvious strategy to reduce this problem is to gener- 
alise word patterns according to some clustering tech- 
niques. In the literature, two generalisation strategies 
have been adopted: 
Distributional approaches: Several papers adopt distribu- 
tional techniques to identify clusters of words according 
to some defined measure of similarity. Among these, in 
(Grishman and Sterling, 1994) a method is proposed to 
cluster syntactic triples, while in (Pereira and Tishby 
1992, 1993), (Dagan et al., 1994) pure bigrams are anal- 
ysed. 
The most intuitive evaluation of the effectiveness of dis- 
tributional approaches to the problem of word general- 
ization is presented in (Grishman and Sterling, 1994). In 
this paper it is argued that distributional (called also 
smoothing) techniques introduce a certain degree of addi- 
tional error, because co-occurrences may be erroneously 
conflated in a cluster, and some of the co-occurrences be- 
ing generalized are themselves incorrect. In general the 
effect is a higher recall at the price of a lower precision. 
Another drawback of these methods is that, since clusters 
have only a numeric description, they are often hard to 
evaluate on a linguistic ground. 
Semantic tagging: Another adopted solution is to gener- 
380 
alise the observed word patterns by grouping patterns in 
which words have the same semantic tag. Semantic tags 
are assigned from on-line thesaura like WordNet (Basili 
et al, 1996) (Resnik, 1995), Roget's categories (Yarowsky 
1992) (Chen and Chen, 1996), the Japanese BGH (Utsuro 
et al, 1993), or assigned manually (Basili et al, 1992) 1. 
The obvious advantage of semantic tags is that words are 
clustered according to an intuitive principle (they belong 
to the same concept) rather than to some probabilistic 
measure. Semantic tagging has been proven useful for 
learning and categorising interesting relations among 
words, and for systematic lexical learning in sublan- 
guages, as shown in (Basili et al, 1996) and (Basili et al, 
1996b). 
On the other hand, semantic tagging has a serious draw- 
back, which is not solely due to the limited availability 
of on-line resources, but rather to the entangled structure 
of thesaura. Wordnet and Roget's thesaura have not been 
conceived, despite their success among researchers in lex- 
ical statistics, as tools for automatic language processing. 
The purpose was rather to provide the linguists with a 
very refined, general purpose, linguistically motivated 
source of taxonomic knowledge. 
As a consequence, in most on-line thesaura words are ex- 
tremely ambiguous, with very subtle distinctions among 
senses. 
(Dolan, 1994) and (Krovetz and Croft, 1992) claim that 
fine-grained semantic distinctions are unlikely to be of 
practical value for many applications. Our experience 
supports this claim: often, what matters is to be able to 
distinguish among contrastive (Pustejowsky, 1995) ambi- 
guities of the bank_river bank__organisation flavour. 
High ambiguity, entangled nodes, and asymmetry have al- 
ready been emphasised in (Hearst and Shutze, 1993) as 
being an obstacle to the effective use of on-line thesaura 
in corpus linguistics. In most cases, the noise introduced 
by overambiguity almost overrides the positive effect of 
semantic clustering. For example, in (Brill and Resnik, 
1994) clustering PP heads according to WordNet synsets 
produced only a 1% improvement in a PP disambiguation 
task, with respect to the non-clustered method. There are 
reported cases in which the use of WordNet worsened 
the performance of an automatic indexing method. Even 
context-based sense disambiguation becomes a pro- 
hibitive task on a wide-scale basis, because when words 
in the context of an ambiguous word are replaced by 
1 Manually assigning semantic tags if of course rather 
time-consuming, however on-line thesaura are not avail- 
able in many languages, like Italian. 
their synsets, there is a multiplication of possible con- 
texts, rather than a generalization. In (Agirre and Rigau, 
1996) a method called Conceptual Distance is proposed 
to reduce this problem, but the reported performance in 
disambiguation still do not reach 50%. 
A possible alternative is to manually select a set of high- 
level tags from the thesaurus. This approach is adopted 
in (Chen and Chen, 1996) and in (Basili et al, 1996) 
where only a dozen categories are used. As discussed in 
the latter paper, high-level tags reduce the problem of 
overambiguity and allow the detection of more regular 
behaviours in the analysis of lexical patterns. On the 
other hand, high-level tags may be overgeneral, and the 
acquired lexical rules, while usually perform well in the 
task of selecting the correct word associations (for ex- 
ample in PP disambiguation, or sense interpretation), are 
less capable of filtering out the noise. Overgeneral cate- 
gories may even fail to capture contrastive ambiguities of 
words. 
So far the manual selection of an appropriate set of se- 
mantic tags has been a matter of personal intuitions, but 
we believe that this task should be performed in a more 
principled, and automatic, way. 
In this paper, we present a method for the selection of the 
"best-set" of WordNet categories for an effective, domain- 
tailored, semantic tagging of a corpus. The purpose of the 
method is to automatically select: 
• A domain-appropriate set of categories, that well 
represent the semantics of the domain. 
• A "right" level of abstraction, so as to mediate at best 
between overambiguity and overgenerality. 
• A balanced (for the domain) set of categories, i.e. 
words should be evenly distributed among cate- 
gories. 
The second feature is the most important, since as we re- 
marked so far, assigning semantic characteristics to 
words is very useful in lexical learning tasks, but over- 
ambiguity is the major obstacle to an effective use of the- 
saura in semantic tagging. 
In the following sections, we define a method for the au- 
tomatic selection of the '"oest-set" of WordNet categories, 
for nouns given an application corpus. 
First, an iterative method is used to create alternative 
sets of balanced categories. Sets have an increasing level 
of generality. Second, a scoring function is applied to al- 
ternative sets to identify the "best" set. The best set is 
modelled as the linear function of four performance fac- 
tors: generality, coverage of the domain, average ambigu- 
ity, and discrimination power. An interpolation method is 
adopted to estimate the parameters of the model against a 
reference, correctly tagged, corpus (SEMCOR). The per- 
formance of the selected set of categories is evaluated in 
terms of effective reduction of overambiguity. 
The described method only requires a medium-range 
(stemmed) application corpus and a thesaurus. The model 
parameters are tuned against a reference correctly tagged 
corpus, but this is not strictly necessary if correctly 
tagged corpora are not available. 
381 
2 Selection of Alternative Sets of Se- 
mantic Categories from WordNet 
The first step of the method is generating alternative sets 
of WordNet categories. Alternative sets are selected ac- 
cording to the following principles: 
• Balanced categories: words must be uniformly dis- 
tribut~z~d among categories of a set; 
• Increasing level of generality: alternative sets are se- 
lected by uniformly increasing the level of generality 
of the categories belonging to a set; 
• Domain-appropriateness: selected categories in a set 
are those pointed by (an increasingly large number 
of) words of the application domain, weighted by 
their frequency in the corpus. 
The set-generation algorithm is an iterative application 
of the algorithm proposed in (Hearst and Shutze, 1993) 
for creating WordNet categories of a fixed average size. 
In its modified version, the algorithm is as follows2: 
Let S be a set of WordNet synsets s, W the set of different 
words (nouns) in the corpus, P(s) the number of words in 
W that are instances of s, weighted by their frequency, 
LIB and LB the upper and lower bound for P(s), N, h and 
k constant values. 
i=1 
UB=N 
LB=UB*h; 
do 
{ initialise S with the set of WordNet topmost; 
initialise the set of categories C i with 
the empty set; 
new_cat(S); 
ifi=l or Ci~C.i_ 1 then add C i to the set of Cat. 
i=i+l; 
UB=UB+k; 
LB=UB*h; I 
until C i is not an empty set; 
where: 
newcat(S): 
for any category s of S { 
if s does not belong to C i then 
I if 
P(s) <= LIB and P(s) >= LB 
then put s in the set C i 
else if P(s) > UB 
then { 
let S' be the set of direct descendents of s 
new_cat(S') } 
else add s to SCT(C i) } 
} 
2The procedure new_cat(S) is almost the same as in 
(Hears-t and Shutze, 1993). For sake of brevity, the algo- 
rithm is not further explained here. 
N, h and k are the initial parameters of the algorithm. We 
experimentally observed that only h (the ratio between 
lower and upper bound) significantly modifies the result- 
ing sets of categories (Ci): we established that a good 
compromise is h=0.4. SCT(C i) is the set of "smaller" 
WordNet categories with P(s)<LB that do not belong to 
the Ci set (see next section). 
3 Scoring Alternative Sets of Cate- 
gories 
The algorithm of section 2 creates alternative sets of bal- 
anced and increasingly general categories C i. We now 
need a scoring function to evaluate these alternatives. 
The following performance factors have been selected to 
express the scoring function: 
Generality: In principle, we would like to represent the 
semantics of the domain using the highest possible level of 
generalisation. We can express the generality G'(Ci) as 
1/DM(Ci), being DM(C i) the average distance between 
the categories of C i and the WordNet topmost synsets. 
Due to the graph structure of WordNet, different paths 
may connect each element cij of Ci with different topmosts, 
therefore we compute DM(Ci) as: 
n 
DM(Ci) = 1, ~.adm(cij) 
/1 
where dm(cij) is the average distance of each cij from the 
topmosts. Figure I illustrates a possible sysnsets hierar- 
chical in which, for Ci=\[cil ci2}, being dm(Cil)=(4+3)/2 
=3.5 and dm(ci2)=3, DM(Ci)=(3+3.5)/2=3.25 
Cil Ci2 
• Topmost Synset 
• General Synset 
Figure 1 - An example of synsets hierarchy 
As defined, G'(C i) is a linear function (low values for 
low generality, high value for high generality), whilst 
our goal is to mediate at best between overspecificity and 
overgenerality. Therefore, we model the generality as 
G(Ci)=G'(Ci)*Gauss(G'(Ci)), where Gauss(G'(Ci)) is a 
Gauss distribution function computed by using the aver- 
age and the variancy of G'(C i) values over the set of all 
categories Ci, selected by the algorithm in section 2, nor- 
malised in the \[0,1\] interval. 
Coverage: the algorithm of section 2, for any set C i , does 
not allow a full coverage of the nouns in the domain. 
Given a selected pair <UB, LB>, it may well be the case 
382 
that several words are not assigned to any category, be- 
cause when branching from an overpopulated category to 
its descendants, some of the descendants may be under- 
populated. Each iterative step that creates a C i also cre- 
ates a set of underpopulated categories SCT(Ci). To en- 
sure full coverage, these categories may be added to Ci, or 
alternatively, they can be replaced by their direct ances- 
tors, but clearly a "good" selection of C i is one that mini- 
mizes this problem. The coverage CO(Ci) is therefore de- 
fined as the ratio Nc(Ci)/W, where Nc(Ci) is the number 
of words that reach at least one category of C i 
Discrimination Power: a certain selection of categories 
may not allow a full discrimination of the lowest-level 
senses for a word (leaves-synsets hereafter). Figure 2 il- 
lustrates an example. If C i = {Cil ci2 ci3 ci4}, w2 cannot be 
fully disambiguated by any sense selection algorithm, be- 
cause two of its leaves-synsets belong to the same cate- 
gory ci2- With respect to w2, ci2 is overgeneral (though 
nothing can be said about the actual importance of dis- 
criminating between such two synsets). 
We measure the discrimination power DP(C i) as the ratio 
(Nc(Ci)-Npc(Ci))/Nc(Ci), where Nc(Ci) is the number of 
words that reach at least one category of C i, and Npc(Ci) 
is the number of words that have at least two leaves- 
synsets that reach the same category cij of C i. For the ex- 
ample of figure 2 DP1, DP(C i) =(3-1 ) / 3=0.66. 
Cil ci2 ci3 ci4 
w I w 2 w 3 
ca General 5ynset 
0 LeafSynset 
\[\] Corpus Word 
Figure 2 - An example of distribution of leaves- 
synsets among categories 
Average Ambiguity: each choice of C i in general re- 
duces the initial ambiguity of the corpus. In part, because 
there are leaves-synsets that converge into a single cate- 
gory of the set, in part because there are leaves-synsets of 
a word that do not reach any of these categories. This 
phenomenon is accounted for by the inverse of the aver- 
age ambiguity A(Ci). The A(Ci) is measured as: 
I Nc(Ci) 
a (Ci) = * ~_, Cwj(Ci) Nc(Ci) 
j=l 
where Nc(C i) is the number of words that reach at least 
one category of Ci and, for each word wj in this set, 
Cwj(C i) is the number of categories of C i reached. 
In figure 2, the average ambiguity is 2 for the set C i = {cil 
ci2 ci3 ci4}, and is 5/3=1.66 for C i = |Cil ci2 ci3}. 
1 
-, .DP_ /"\\ ,/e, _ ~.,/' 0,9 
0,8 - ~//~-.- .,,/\~/ ;I - -----~-/:. ....... / 
-/'~-'--.1/A ~ ~ I "'- / \ 0.7 ........ -- ..... 
/ ~/'~ "" 7 --_. ok 
0,6 - 
O,S - 
0,4 - 
0,3 - 
0,:~ - 
0,1 
0 
o o 
0 N 
i.--..., ~/--L ~. ~, '~-~. - _ 
~-, /, .,JCO / 
. ~ '. '. '.~ '. '. '. '. '.'. '. ~ ', ~ ~ ~ ~ ~ ~ ~ ',~ '. ,, ,, ,, ,, I, , ,~ ~ ~ j 
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ~ 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
Figure 3 - Performance factors G, CO, DP and l/A, for each generated set of categories 
The scoring function for a set of categories C i is defined 
as the linear combination of the performance parameters 
described above: 
(1) Score (C i ) = ~G(C i )+ 13CO(C i )+ xDP(C i )+ 8(1/A(C i )) 
Notice that we assigned a positive effect on the score 
(modelled by l/A) to the ability of eliminating certain 
leaves-synsets and a negative effect (modelled by DP) to 
the inability of discriminating among certain other 
leaves-synsets. This is reasonable in general, because our 
aim is to control overgenerality while reducing overam- 
biguity. However, nothing can be said on the appropri- 
ateness of a specific sense aggregation and/or sense elim- 
ination for a word. It may well be the case that merging 
two senses in a single category is a reasonable thing to 
do, if the senses do not draw interesting (for the domain) 
distinctions. Therefore eliminating a priori a sense of a 
word may be inappropriate in the domain. 
The (1) is computed for all the generated sets of categories 
C i, and then normalised in the \[0,1\] interval. The effec- 
tiveness of this model is estimated in the following sec- 
tion. 
4 Evaluation Experiments and Discus- sion of the Dafa 
The algorithm was applied to the 10,235 different nouns 
of the Wall Street Journal (hereafter WSJ) corpus that are 
classified in WordNet. Categories are generated with h= 
0.4 and k=l,000. The cardinality of each set varies, but 
not uniformly, from 456 categories for UB--2000 
(remember that words are frequency-weighted), to I cate- 
gory (i.e. the topmost entity) for UB=264,000. Medium- 
high level categories (those between 50,000 and 100,000 
maximum words) range between 10-20 members for each 
383 
Figure 3 plots the values of G, CO, DP and 1/A for the 
different sets of categories generated by the algorithm of 
Section 2. Alternative sets of categories are identified by 
their upperbound 3. The figure shows that DP(Ci) has a 
regular decreasing behaviour, while 1/A(Ci) is less regu- 
lar. The coverage CO(C i) has a rather unstable behaviour 
due to the entangled structure of WordNet. We attempted 
slight changes in the definitions and computation of CO, 
DP and 1/A (for example, weighting words with their 
frequency), but globally, the behaviour remain as those in 
figure 3. 
To compute the score of each set Ci. , the parameters (x,13,;¢ 
and 8 in (1) must be estimated. To perform this task, we 
adopted a linear interpolation method, using SEMCOR 
(the semantically tagged Brown Corpus) as a reference 
corpus. In SEMCOR every word is unambiguously tagged 
with its leaf-synset. 
To build a reference scoring function against which to 
evaluate our model parameters, we proceeded as follows: 
• Since our categories are generated for an economic 
domain (WSJ) while SEMCOR is a tagged balanced 
corpus (the Brown Corpus), we extracted only the 
fragment of the corpus dealing with economic and fi- 
nancial texts. We obtained a reference corpus in- 
cluding 475 of the 1,235 nouns of the WSJ corpus. 
• For each set of categories C i generated by the algo- 
rithm in section 2, we computed on the reference cor- 
pus the following two l:~rformance figures: 
Precision: For each Ci, let W(C i) be the set of words in 
the reference corpus covered by the met C i. For each w k in 
3Remember that words are weighted by their frequency in 
the corpus. This seems reasonable, but in any case we ob- 
served that our results do not vary when counting each 
word only once. 
W(Ci), let S(w k) be the total set of leaves-synsets of w k in 
WordNet, SR(w k) the subset of leaves-synsets of w k 
found in the reference corpus, SC(w k) the subset of 
leaves-synsets that reach some of the categories of C i. Let 
WR(Ci) ~ W(C i) be the set of w k having SC(w k) c S(Wk). 
Following the algorithm: 
for any w k in WR(C i) 
{ 
for any s i in SR(w k) { 
ifs i E SC(wk) then N + =N + + freqi(w k) 
NtOt = NtOt + freq(w k) 
} 
where freq(w k) is the number of occurrences of w k in the 
reference corpus, the precision Precision(C i) is then de- 
fined as N+/N -t°t. The precision measures the ability of 
each set C i at correctly pruning out some of the senses of 
W(Ci). 
Global reduction of ambiguity: For each C i, let S(W i) 
be the total number of WordNet leaves-synsets reached 
by the words in WR(Ci), and SCON i) ~ S(W i) the set of 
these synsets that reach some category in C i. By tagging 
the corpus with C i, we obtain a reduction of ambiguity 
measured by: 
GRAmb(C i ) = (card (S(Wi))- card (SC(Wi)))/card(S(Wi)) 
where card (X) is the number of elements in the set X 
Starting from these two performance figures, the global 
performance function Perf(C i ) is measured by: 
(2) Perf(C i ) = Precision(Q) + GRAmb(C i ) 
The (2) is computed for all the generated sets of categories 
Ci, and then normalised in the \[0,1\] interval. The obtained 
plot is the reference against which we apply a standard 
linear interpolation method to estimate the values of the 
model parameters c¢,~,X and 8 that minimize the differ- 
ence between the values of the two functions for each C i. 
In figure 4a the (not normalised) Precision and GRAmb 
are plotted for the test corpus. In figure 4b the normalised 
reference performance function and the "best fitting" 
scoring function are shown, with the estimated values of 
a,~,X and 8. 
While the reference function has a peak on the class set Cj 
with UB--55,000 and the score function assigns the max- 
imum value to the class set C k with UB=62,000, the per- 
formance of the sets in the range j-k is very similar. Table 
1 shows the values of precision and global reduction of 
ambiguity in the range \[~k\]. 
UB Precision GRAmb 
0.338 55,000 
56,000 
59,000 
60,000 
61,000 
62,000 
0.752 
0.758 0.347 
0.748 0.360 
0.734 0.378 
0.731 0.386 
0.733 0.368 
Table 1 - Values of precision and global reduction of 
ambiguity for UB \[55,000-62,000\] 
In evaluating the method, few aspects are worth un- 
0,8 
0,7 - ) - "-~~~'~ 
i 
G..l~a~b ,-. _, ,... _ 0,6 - ," 
0,5 - 
0,4 - Precision •, ..., . ,, , '\ 
I • *~ • jI • A tie \ I # t I I e 
0,3 - ",. 
0,2 ,: = i ~ !!! '. '.'. '.',~ = i \] ~ ~! ! !'. '. '. '. ~ ~ { i ~ ! ! i '. " :"" ~= ~\[~!! ! " " " "" ~ ~ ~ = ~ " " " " " "" 
O O O O O O O (D C' O (D O O O O O O O O O (~ O O O O ~ ~ 8 C, C, C, O (D, C' <D' C' O O O CD' O O 'O' C' ~ O 'O ~ 
Figure 4a. - Performance figures of sets C i against the reference corpus 
384 
1 
0,9 
0,8 
0,7 
0,6 
0,5 
0,4 
0,3 
0,2 
0,1 
0 
O 
O 
O 
• ~# I I ' 
," ".' ~=1.o29 v 5//,-1 / 
13=0"412 \[~/: IX__A.O~ ~ :.V,,= £: / 
I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I \[ I I I I I I I I I I I I |~P'I 
° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° 0o° 8 8 8 8® ~ = 88ooo oooo0ooo ooo o o o o o o o o o o o o o o o o o 
I',..- 0q OlD ~ ¢O ~ ,O~ ~ N ~ ~ ~ ~ O 
¢%1 
Figure 4b - Reference function and best-fitting Score function, with estimated parameters. 
derlining 
• The test corpus includes only 475 words of the over 
10,000 in our learning corpus. This may well cause a 
shift of the reference scoring function, as compared 
with the "real" scoring function. 
• In any case, figure 4a shows that the sets C~ have 
peak performances in the range 50,000-100,d00. In 
this range, the precision is around 73-76%, and the 
reduction of ambiguity is around 35%, which are 
both valuable results. We also experimented that, by 
changing slightly the model parameters and/or the 
definitions of the four performance figures in the (1), 
in any case the peak performance of the obtained 
scoring function falls in the 50,000-100,000 interval, 
and the function stays high around the peak, with 
local maxima. 
• In other domains (see a brief summary in the conclud- 
ing remarks) for which we did not have a reference 
tagged corpus, we used (a=l ~--0,5 z=l 5=1) as model 
parameters in the (1), and still observed a scoring 
function similar in shape to that of figure 4b. Se- 
lected categories vary according to the domains, but 
the size of the best set stays around the 10-20 cate- 
gories. Evaluation is of course more problematic due 
to the absence of a tagged reference corpus. 
Therefore, we may conclude that the method is "robust", 
in the sense that it correctly identifies a range of reason- 
able choices for the set of categories to be used, eventually 
leaving the final choice to a linguist. 
As for the WSJ corpus, a short analysis of the linguistic 
data may be useful. In figure 5 the 14 '"oest" selected cate- 
gories for nouns are listed. Figure 6 shows four very 
frequent and very ambiguous words in the domain: bank, 
business, market and stock, with attached list of synsets as 
385 
generated by WordNet, ordered from left to right by the 
increasing level of generality (leaf-sysnset leftmost). The 
senses marked with '*' are those that reach some of the 
categories (marked in bold in the figure) of the best- 
performing set, selected by the scoring function (1). For 
bank and market, we observed that the less plausible (for 
the domain) senses ridge and market_grocery_store are 
pruned out. The word stock retains only 5 out of 16 
senses. Of these, the gunstock and progenitor senses 
should have been further dropped out, but there are 11 
senses that are correctly pruned, like liquid, caudex, 
plant, etc. The word business still keeps its ambiguity, but 
the 9 subtle distinctions of WordNet are reduced to 7 
senses. 
i. person, individual, someone, 
mortal, human, soul 
2. instrumentality, instrumentation 
3. attribute 
4. written_communication, 
written_language 
5. message, content, subject_matter, 
substance 
6. measure, quantity, amount, quantum 
7. action 
8. activity 
9. group_action 
i0. organization 
ii. psychological_feature 
12. possession 
13. state 
14. location 
Figure 5 - List of best-performing categories 
Word: bank 
Sense n.l: bank, side -> slope,incline,side ->... 
Sense n.2 (*): depository_financial_institution,bank,banking_concern,banking_company -> finan- 
cial institution, financial_organization -> institution,establishment -> organization -> ... 
Sense n3: bank -> ridge ->... 
Sense n.4: bank -> array -> ... 
Sense n~5 (*): bank -> reserve,backlog, stockpile -> accumulation -> asset -> possession 
Sense n.6 (*): bank -> funds, finances,monetary_resource, cash in hand,pecuniary_resource -> asset -> possession 
Sense n.7" bank, cant,camber -> slope, incline,side -> ... 
Sense n.8 (*): savings_bank, coin_bank, money_box,bank -> container -> instrumentality, instrumentation -> ... 
Sense n.9: bank, bank_building -> depository, deposit,repository -> ... 
Word: business 
Sense n.1 (*): business, concern,business_concern,business_organization -> enterprise -> organization -> ... 
Sense n2 (*): commercial_enterprise, business_enterprise,business -> commerce,commercialism,mercantilism -> 
group_action -> ... 
Sense n.3 (*): occupation,business,line of work, line -> activity ->... 
Sense n.4 (*): business -> concern, worry, headache, vexation -> negative_stimulus -> stimula- 
tion, stimulus, stimulant,input -> information -> cognition, knowledge -> psychological_feature 
Sense n~5 (*): business -> aim, object,objective, target -> goal,end -> content,cognitive_content,mental_object -> cogni- 
tion,knowledge -> psychological_feature 
Sense n.6: business,business_sector -> sector -> ... 
Sense n.7 (*): business -> business_activity, commercial_activity -> activity -> ... 
Sense n.8: clientele, patronage, business -> people ->... 
Sense n.9 (*): business, stage_business,byplay -> acting,playing, playacting,performing -> activity -> ... 
Word: market 
Sense n.l: market -> class,social_class, socio-economic_class -> ... 
Sense n.2: grocery_store,grocery, market -> marketplace, mart ->... 
Sense n,3 (*): market, marketplace -> activity ->... 
Sense n.4 (*): market,securities industry -> industry -> commercial_enterprise -> enterprise -> organization -> ... 
Word: stock 
Sense n.1 (*): stock -> capital,working_capital -> asset -> possession 
Sense n.2 (*): stock,gunstock -> support -> device -> instrumentality, instrumentation -> ... 
Sense n.3: stock, inventory -> merchandise,wares,product -> ... 
Sense n.4 (*): stockcertificate, stock -> security,certificate -> le- 
gal_document, legal_instnm~ent,official_document,instrument -> document,writtendocument, papers -> writ- 
ing, written material -> written_communication,written_language -> ... 
Sense n,5 (*): store,stock,fund -> accumulation -> asset -> possession 
Sense n.6 (*): stock -> progenitor,primogenitor -> ancestor,ascendant,ascendent,antecedent -> relative,relation -> 
person,individual,someone,mortal,human, soul -> ... 
Sense n.7: broth,stock -> soup -> ... 
Sense n.8: stock,caudex -> stalk, stem ->... 
Sense n.9: stock-> plant_part -> ... 
Sense n.lO: stock, gillyflower -> flower -> ... 
Sense n.11: Malcolm_stock, stock -> flower -> ... 
Sense n.12: lineage, line,line_of descent,descent,bloodline,blood_line,blood,pedigree, ancestry, origin, parent- 
age, stock -> genealogy, family_tree ->... 
Sense n.13: breed,strain,stock,variety -> animal_group -> ... 
Sense n.14: stock -> lumber, timber -> ... 
Sense n.15: stock-> handle, grip,hold -> ... 
Sense n.16: neckcloth,stock -> cravat -> ... 
Figure 6. Selected synsets for the words bank, business, market and stock. 
386 
5 Concluding Remarks 
It has already been demonstrated in (Basili et al, 1996) 
that tagging a corpus with semantic categories triggers a 
more effective lexical learning. However, overambiguity 
of on-line thesaura is known as the major obstacle to au- 
tomatic semantic tagging of corpora. The method pre- 
sented in this paper allows an efficient and simple selec- 
tion of a fiat set of domain-tuned categories, that dramati- 
cally reduce the initial overambiguity of the thesaurus. 
We measured a 73% precision in reducing the initial am- 
biguity, and a 37% global reduction of ambiguity. 
Significantly, our method selects a limited number of 
categories (10-20, depending upon the learning corpus 
and the model parameters), out of the initial 47,110 leaf- 
synsets of WordNet 4. 
We remark that our experiment is on large, meaning that 
we automatically evaluated the performance of the model 
on a large set of nouns taken from the Wall Street 
Journal. Most sense disambiguation or semantic tagging 
methods evaluate their performances manually, against 
few very ambiguous cases, with clear distinctions among 
senses. Instead, WordNet draws very subtle and fine- 
grained distinctions among words. We believe that our 
results are very encouraging. 
The model parameters for category selection has been 
tuned on SEMCOR, but a correctly tagged corpus is not 
strictly necessary. In our experiments, we applied a scor- 
ing function similar to that obtained for the Wall Street 
Journal to two other domains, a corpus of Airline reser- 
vations and the Unix handbook. We do not discuss the 
data here for the sake of space. The method constantly 
selects a set of categories at the medium-high level of 
generality, different for each domain. The selection 
"seems" good according to our linguistic intuition of the 
domains, but the absence of a correctly tagged corpus 
does not allow a large-scale evaluation. 
In the future, we plan to demonstrate that the method 
proposed in this paper, besides reducing the overambigu- 
ity of on-line thesaura, improves the performance of lexi- 
cal learning methods that are based on semantic tagging, 
such as PP disambiguation, case frame acquisition and 
and sense selection, with respect to a non-optimal choice 
of semantic categories. 
6 Acknowledgements 
The method presented in this paper has been developed 
within the context of the ECRAN project LE 2110, funded 
by the European Community. One of the main research 
objectives of ECRAN is lexical tuning, being semantic 
tagging and sense disambiguation two important and 
preliminary objectives. This paper approached the prob- 
lem of domain-appropriate semantic tagging. 
We thank Christian Pavoni who developed much of the 
4We used WordNet in this experiment, because it is now 
veryppo ular among scholars in. lexical statistics, but clearly our method could be apphed to any on-line tax- 
onomy or lattice. 
387 
software used in this experiment, as well as all our part- 
ners in the ECRAN project. 

References 

(Agirre and Rigau, 1996) E. Agirre and G. Rigau, Word 
Sense Disambiguation using Conceptual Density, 
proc. of COLING 1996 

(Basili et al, 1992) Basili, R., Pazienza, M.T., Velardi, P., 
"Computational Lexicons: the Neat Examples and the 
Odd Exemplars", Proc. of Third Int. Conf. on Applied 
Natural Language Processing, Trento, Italy, 1-3 April, 
1992. 

(Basili et al, 1996) Basili, R., M.T. Pazienza, P. Velardi, 
An Empyrical Symbolic Approach to Natural Language Processing, Artificial Intelligence, August 1996 

(Basili et al, 1996b) R. Basili, R., M.T. Pazienza, P. Ve- 
lardi, Integrating general purpose and corpus-based 
verb classification, Computational Linguistics, 1996 

(Brill and Resnik, 1994) E. Brill and P. Resnik, A transformation-based approach to prepositional phrase 
attachment disambiguation, proc. of COLING 1994 

(Chen and Chen, 1996) K. Chen and C. Chen, A rule-based 
and MT-oriented Approach to Prepositional Phrase 
Attachment, proc. of COLING 1996 

(Dagan et al, 1994)Dagan I., Pereira F., Lee L., Similarly-based Estimation of Word Co-occurrences Probabilities, Proc. of ACL, Las Cruces, New Mexico, USA, 
1994. 

(Dolan, 1994) W. Dolan, Word Sense Ambiguation: Clustering Related Senses, Proc. of Coling 1994 

(Hearst and Schuetze, 1993) M. Hearst and H. Schuetze, 
Customizing a Lexicon to Better Suite a Computational Task, ACL SIGLEX, Workshop on Lexical Acquisition from Text, Columbus, Ohio, USA, 1993. 

(Krovetz and Croft, 1992) R. Krovetz and B. Croft, Lexi- 
cal Ambiguity and Information Retrieval, in ACM 
trans, on Information Systems, 10:2, 1992 

(Grishman and Sterling, 1994) R. Grishman, J. Sterling, 
Generalizing Automatically Generated Selectional 
Patterns, Proc. of COLING '94, Kyoto, August 1994. 

(Pereira et al., 1993) Pereira F., N. Tishby, L. Lee, Distributional Clustering of English Verbs, Proc. of ACL, 
Columbus, Ohio, USA, 1993. 

(Pustejovsky, 1995) J. Pustejovsky, The generative Lexicon, MIT Press, 1995 

(Resnik, 1995) P. Resnik, Disambiguating Noun Groupings with respect to Wordnet Senses, proc. of 3rd 
Workshop on Very Large Corpora, 1995 

(Yarowsky, 1992) Yarowsky D., Word-sense disambiguation using statistical models of Roget's categories trained on large corpora, Proc. of COLING 92, 
Nantes, July 1992. 

COtsuro et al, 1993) T. Utsuro, Y. Matsumoto and M. Na- 
gao, "Verbal case frame acquisition from Bilingual 
Corpora" Proc. of IJCAI, 1993 
