DO WE NEED LINGUISTICS WHEN WE HAVE STATISTICS? A 
COMPARATIVE ANALYSIS OF THE CONTRIBUTIONS OF LINGUISTIC 
CUES TO A STATISTICAL WORD GROUPING SYSTEM 
Vasileios Hatzivassiloglou 
Deparlrnent of Computer Science 
450 Computer Science Building 
Columbia University 
New York, N.Y. 10027 
vh@cs.columbia.edu 
ABSTRACT 
We present a comparative analysis of the performance of 
a statistics-based system for the formation of semantic groups 
of adjectives when various sources of linguistic knowledge 
are introduced. We identify four different types of slufllow 
linguistic knowledge that are applicable to this system, and 
we quantify the performance gained by incorporating each 
such knowledge module, We perform experiments for dif- 
ferent corpus sizes and different inputs (sets of adjectives to 
group), collect clam on the usaful.ness of each linguistic 
module, assess the statistical significance of the results, and 
compare the contributions of the linguistic knowledge sources 
against each other. We also assess the overall effect linguistic 
knowledge has in our system. Our results show that linguistic 
knowledge causes a significant increase in the performance of 
the system. We conclude by discussing how these positive 
restdts can be generalized to other problems in statistical 
NLP. 
1. INTRODUCTION 
The idea of integrating statistical and knowledge-based approaches for natural language 
problems, has been. recently. . gainin$ ground, in. the computational lingmsucs commumty, as it is ex- 
pected that a combined approach will offer sig- 
nificantly better performance over either 
methodology alone. This paper supplements this 
intuitive belief with actual evaluatzon data, ob- 
tained when several linguistics-based modules 
were integrated in a statistical system. 
We used a system we previously developed for 
the separation of adjectives into semantic groups 
\[Hatzivassiloglou and McKeown, 1993\] as the 
basis for our comparative analysis. We identified 
several different types of shallow linguistic knowledge that can be efficiently introduced into 
our system. We evaluated the system with and 
43 
without each such feature, obtaining an estimate of 
each feature's positive or negative contribution to 
the overall performance. By matching cases where 
all system parameters are the same except for one teature, we assess the statistical significance of the 
differences found. Also, a statistical model of the 
system's performance in terms of the active fea- 
tures for each run offers a view of the contribu- 
tions of features from a different angle, contrasting 
the significance of linguistic features (or other modeled system parameters) against each other. 
Our analysis of the experimental results 
showed that many forms of li%.uistic knowledge 
have a significant positive conmbution to the per- 
formance of the system. We attribute to the com- 
bined effect of the linguistic knowledge modules 
the ability of our system to perform fine-tuned 
classification of adjectives into semantic classes. 
Other statistical systems that address word clas- 
sification probleans do not emphasize the use of 
linguistic knowledge and do not deal with a 
specific word class\[Brown et al., 1992\], or do not 
exploit as much linguistic knowledge as we do 
\[Pereira et al., 1993\]. As a result, a coarser clas- 
sification is usually produced. In contrast, by limiting the system's input to adjectives, we can 
take advantage of specific syntactic relationships and additional faltering procedures that apply only 
to ,particular word classes. These sources of lin- 
gmstic knowledge provide in turn the extra eedgc 
for discriminating among the adjectives at the 
semantic level. 
Our" adjective grouping system can be used for 
applications such as natural lansuage generation 
(where knowledge of the semanuc groups and of 
the ordering of the elements within them allows 
the precise lexiealization of semantic concepts 
\[Elhadad, 1991\]) and computational lexicography (by automatically eompifing domain-dependent 
lists of synonyms and antonyms). The produced 
groups can also help correct erroneous usage of 
multiple qualifiers that are superfluc~ts or con- 
tradict each other, a phenomenon that has been ob- 
served in medical reports 1. But in addition to the 
immediate applications of word classification, 
many other sfatistical NLP applications can be cast 
in a similar framework. Therefore, the positive ef- 
fects of linguistic knowledge on our system in- 
dicate that the incorpo/'ation of linguistic knowledge will probably result in similar b~efits 
for other applications as well. 
In what follows, we briefly review our adjec- 
tive grouping system, and then present the !ingui.'s - tic features we explored and the alternatives tor 
each of them. In Section 5 we give the results of 
our evaluation on different combinations of fea- tures and we analyze their significance. We also 
~rresent these results in a predictor-response amework, and we conclude by discussing the ap- 
plicability of our results to other NLP problems. 
2. AN OVERVIEW OF THE ADJECTIVE 
GROUPING SYSTEM 
Our adjective grouping system \[Hatzi- 
vassilogiou and McKeown, 1993\] starts with a set 
of adjectives to be clustered into semantically re- 
lated groups. Ideally, we want highly related 
words such as synonyms, antonyms, and 
hyponyms to be the only ones placed in the same 
group. The system is given ~e number of groups 
to form as an input parameter 2, and has access to a 
text corpus. No semantic information about the 
adjectives is available to the system. The system operates by extracting pairs of modified nouns for 
each adjective, and, optionally, pairs of adjectives 
that we can expect to be semantically uurelated on 
linguistic grounds 3. From the estimated distribu- 
tion of modified nouns for each adjective, a similarity sccxe is assigned to each possible pair of 
adjectives. This is based on KendaLl's x, a non- 
parametric, robust estimator of correlation 
\[Kendall, 193811 Using the similarity scores and, optionally, the established relationships of non- 
relatedness, a non-hierarchical clustermg method 
\[Spath, 1985\] assigns the adjectives to groups in a 
way that maximizes the within-group similarity 
(and therefore also maximizes the between-group 
dissimilarity). 
IWe thank Johanna Moore for pointing out this application 
to us. 
2Determining this number from the data is probably the 
hardest problem in cluster analysis in general; see \[Kaufman 
and Rousseeuw. 1990\]. 
J~hese are adjectives that either modify the same noun in 
the same NP (e.g. big white house) or one of them modifies 
the other (e.g. light blue coat); see \[Hatzivassiloglou and 
McKeown. 1993\] for a detailed analysis. 
44 
1. deadly fatal 
2. capitalist socialist 
• 3. clean dirty dumb 
4. hazardous toxic 
5. insufficient scant 
6. generous outrageous unreasonable 
7. endless protracted 
&plain 
9. hostile unfi~endly 
I0. delicate fragile unstable 
1 I. affluent impoverished prosperous 
12. brilliant clever energetic smart stupid 
13. communist leftist 
14. astonishing meager vigorous 
15. catastrophic disasm3ns harmful 
16. dry exotic wet 
17. chaotic turbulent 
18. confusing misleading 
19. dismal gloomy 
20. dual multiple pleasant 
21. fat slim 
22. affordable inexpensive 
23. abrupt gradual stunning 
24. flexible lenient rigid strict stringent 
Figure 1: Example clustering found by the 
system using all linguistic modules. 
To evaluate our system, we have developed 
extended versions of the standard information 
retrieval measures precision, recell, and fallout. 
These extended versions score the grouping 
produced .by the system against a set of model groupings (mstead of just one) for the same adjec- 
fives, supplied by humans. In the experiments 
reported m this paper, we employ 8 or 9 human- 
constructed models for each adjective set. We 
base our comparisons on and report the F-measure 
scores \[Van Rijsbergen, 1979\] which combine 
precision and recall in a single number. In ad- 
dition, since the correct number of groupings is 
something that the system cannot yet determine 
(and, incidentally, something that human 
evaluators disagree abou0, we run the system for 
the five ceases m the range -2 to +2 around the 
average number of clusters employed by the 
humans and average the results. This smoothing 
operation prevents an accidental hilgh or low score 
being reported when a small variauon in the num- 
ber of clusters produces very different scores. 
It should be noted here that the scores reported 
should not be interpreted as linear percentages. In 
other words, a score of 40 is not just twice as good 
as a score of 20, and going from 30 to 40 is much 
harder than goin$ from 20to 30. The latter is true 
for most applicanons, but the problem of interpret- 
ing the scores is exacerbated in our context be- 
cause of the structural constraints imposed by the 
clustering and the presence of multiple models. 
Furthermore, even the best clustering that could be 
produced would not receive a score of 100, be- 
cause of disagreement among humans on what is 
the correct answer. To el,~ the meaning of the 
scores, we accompany them with lower and _upper 
bounds for each adjective set we examine. These 
bounds are obtained by the performance of a sys- 
tem that creates random groupings (averaged over 
many runs) and by the average score of the 
human-produced partitions when evaluated against 
the other human models respectively. 
Figure 1 shows an example clustering 
produced by our system for one of the adjective 
sets analyzed in this paper. 
3. THE LINGUISTIC FEATURES BEING 
TESTED 
We have identified several sources of linguis- 
tic knowledge that can be incorporated in our sys- 
tem, augmenting the statistical component. Each 
such source represents a parameter of the system, 
i.e. a Ieature that can be present or absent or more 
generally take a value from a predefined set. We 
selected features that can be effficienfly computed 
in a completely automatic way for unrestricted text 
and do not require extensive amounts of 
knowledge to be available to the system. Almost 
all of these features can be ~generalized to other ap- 
plications as well, as we discuss in Section 6. In 
this section we discuss first one of these 
parameters that can take several values, namely 
the method of extracting data from the corpus, and 
then several other bina~-valued features. 
45 
3.1 Extracting data from the corpus 
Our adjective clustering system determines the 
distribution of related (modified) nouns for each 
adjective and eventuaUy the similarity between ad- 
jectives from pairs of the form (adjective. 
modified noun)observed in the corpus. Direct in- 
formation about incompatible adjectives (in the 
form of appropriate adjective-adjective pairs) cart 
~lso be collected from the corpus. Therefore, a 
tWst parameter of the st, stem and a possible dimen- 
sion for comparisons ts the method employed to 
identify such pairs in free text. This is hardly a 
unigue feature of our system: all word-based 
statistical systems must fwst collect data from the 
corpus about the words of interest, on which the 
subsequent statistics operate 4 . 
There are several alternate models for this task 
of data collection, with different degees of lin- 
guistic sophistication. A first model Is to use no 
linguistic knowledge at all: we collect for each ad- 
jective of interest all words that fall within a win- 
dow of some predetermined size. Naturally. no 
negative data (adjective-adjective pairs) can be 
collected with this method. However, the method 
can be implemented easily and does not require 
the identification of any linguistic constraints so it 
is completely general. It has been used for diverse 
problems such as machine translation and sense 
disambiguation \[Gale et al., 1992, Schiltze, 1992\]. 
A second model is to restrict the words col- 
lected to the same sentence as the adjective of in- 
terest and to word elass(es) that we expect on lin- 
guistic grounds to be relevant to adjectives. For 
our application, we collect all nouns in the vicinity 
of an adjective without leaving the current sen- 
tence. We assume that these nouns have some 
relationship with the adjective and that seman- 
tically different adjectives will exhibit different 
collections of such nouns. This model requires 
only part-of-speech information (to identify 
nouns) and a method of detecting sentence boun- 
daries. It uses a window of fixed length to define 
the neighborhood of each adjective. Such a model 
incorporates minimal linguistic knowledge. 
namely in determining what constitutes the infor- 
mative class(es) of words collected (nouns in our 
problem). Again, negative knowledge such as in- 
compatible adjective pairs cannot be collected 
with this model. Nevertheless, it has also been 
widely used, e.g. for collocation extraction 
\[Sm~ja, 1993\] and sense disambiguation \[Liddy 
and Park, 1992\]. 
A third model uses a simple linguistic rule to 
identify pairs of interest that is even more restric- 
tive and informative than the "nouns in vicinity" 
4Although frequently details of the statistical model 
employed receive more consideration. 
approach. Since we are interested in .nouns 
modified by adjectives, such a rule is to correct a 
noun immediately following an adjective, assum- 
ing that this implies a modification relationship. 
Pairs of consecutive adjectives Can also be col- 
lected. 
Up to this point we have successively 
restricted the collected pairs on linguistic grounds, 
so that less but cleaner data is collected. For the 
fourth model, we extend the simple rule given 
above, using linguistic information to catch more 
valid pairs without sacrificing accuracy. We 
employ a pattern matcher that retrieves any se- 
quence of one or more adjectives followed by any 
sequence of zero or more nouns. These sequences 
are then analyzed with heuristics based on linguis- 
tics to obtain pairs. For example, it can be shown 
that all adjectives in such a sequence must be 
semantically unrelated, and that it is best to attach 
all the adjectives to the final noun. 
The regular expression and pattern matching 
rules of the previous model can be extended fur- 
ther, forming a grammar for the constructs of in- 
terest. This approach can detect more pairs, and at 
the same time address known problematic cases 
not detected by the previous models. 
We imp.lemented the above five data extraction 
models, using typical window sizes for the first 
two methods (50 and 5 on each side of the window 
respectively) which have been found useful in 
other problems before. For the fifth model, we 
developed a finite-state grammar for NPs which is able to handle both predicative and attributive 
modification of nouns, conjunctions of adjectives, 
adverbial modification of adjectives, quantifiers, 
and apposition of adjectives to nouns or other 
adjectives 5. Unfortunately, the resources required 
to perform our tests for the first model were too 
great (e.g. 12,287,320 pairs in a 151 MB file were 
extracted for the 21 adjectives in our smallest test 
set) so we dropped that model from further con- 
sideration and we use the second model as the 
baseline of minimal linguistic knowledge. Other 
researchers have also reported similar problems of 
excessive resource demands with the "collect all 
neighbors" model \[Gale et al., 1992\]. 
3.2 Other linguistic features 
In addition to the data extraction method, we 
identified three other areas where linguistic 
knowledge can be introduced in our system. F'trst, 
we can employ morphology to convert plural 
5For efficiency reasons we did not consider a more power- 
ful formalism. 
antitrust new 
big old 
economic political 
financial potential 
foreign real 
global serious 
international severe 
legal staggering 
little technical 
major unexpected 
mechanical 
Figure 2: Test set 1; from an earlier corpus. 
nouns to the corresponding singular ones and ad- 
jectives in comparative or superlative degree to 
their base form. Almost all adjectives and nouns 
that appear in multiple forms have no semantic 
difference from their base form except for the 
number or deg'e e feature. This conversion com- 
bines counts of similar pairs, thus raising the ex- 
pected and estimated fr~,quencies of each pair in 
any statistical model. We develoI.~zl a m0rphol- 
ogy component that produces the singular fohn of 
nouns nsmg rules pins a large table of exceptions. 
For adjectives, a set of rules is _again employed but 
because of the vowed in the suffix -er or -est, many 
base forms look plausible without a lexicon (e.g. 
bigger could have been produced from big, bi~g, 
or bigge). We solve this problem ~ counting me 
occurrences of each candidate form in our corpus 
and selecting the one with non-zero frequency. 
Another potential application of linguistic 
knowledge is the use of it spell-checking proce- 
dure, combined with a word list, to eliminate 
typographical errors from the corpus. Such en'ors 
can produce wrong estimates for the frequencies 
of modified nouns for an adjective, but most im- 
portant!y introduce "unique" nouns appea~'ing 
only with one adjective, skewing the comparison 
of noun distributions. We implemented this com- 
ponent using the Unix spell program and as- 
sociated word list, with extensions for hyphenated 
compounds. Uni~ortm~tedy, since a fixed and 
domgiu independent lemcon is used for this 
process, some valid but overspecialized words 
may be discarded too. 
Finally, we can use additional sources of 
knowledge which supplement the primary 
similarity relationships and are justified on linguis- 
tic grounds. We identified several potential 
sources of additional knowledge that can be ex- 
tracted from the corpus (e.g. conjunctions of ad- 
jectives). In this comparison study we im- 
plemented and consider the significance of one of 
these knowledge sources, r~medy the negative ex- 
amples offered by adjective-adjective pairs. 
46 
4. THE COMPARISON EXPERIMENTS 
In the previous section we identified four 
parameters of the system, the effects of which we 
want to analyze. But in addition to these 
parameters that can be directly varied and have 
predetermined possible values, several other vari- 
ables can affect the performance of the system. 
First, the performance of the system depends 
naturally on the adjective set that is to be clus- 
tered. Presumably variations in the adjective set 
can be modeled by several parameters, such as 
size of the set, number of semantic groups in it, 
and strength of semantic relatedness among its 
members, plus several parameters describing the 
properties of the adjectives in the set in isolation, 
such as frequency, specificity, etc. 
A second variable that affects the clustering is 
the corpus that is used as the main knowledge 
source, through the observed cooeeurrence pat- 
terns. Again the effects of different corpora can be 
separatecl into several factors, e.g. the size of the 
corpus, its generality, the genre of the texts, etc. 
Since in this paper we are interested in quan- 
tifying the effect of the linguistic knowlcdse in our 
system, or more precisely of the linguistic 
knowledge that we can explicitly control through 
the four parameters discussed above, we did not 
attempt to model in detail the various factors 
entering the system as a result of the choice of ad- 
annual negative 
big net 
chief new 
commercial next 
current old 
daily past 
different positive 
difficult possible 
easy pre-tax 
final previous 
future private 
hard public 
high quarterly 
important recent 
initial regional 
international senior 
likely significant 
local similar 
low small 
military strong 
modest weak 
national 
Figure 3: Test set 2; high frequency words. 
.~ective set and corpus. However, we are interested 
in measuring the effects of the linguistic 
parameters in a wide range of contexts, and m cor- 
relating these effects with variables originating 
from the choice of corpus and adjective set. For 
example, we would want to be able to detect that 
the linguistic parameter "morphology" is sig- 
nificant for small corpora but not for large ones, if 
that were the cease. Therefore, we included in our 
model two additional parameters, representing the 
corpus and the adjective set used. 
We used the Wall Street Journal articles from 
the ACL-DCI as our corpus. We selected four sub- 
corpora of decreasing size to study the relationship 
of corpus size with linguistic feature effects: all 
the 1987 articles (21 million words), every third of 
these articles (7 million words), every twenty-first 
(1 million words), and articles no. 50 and 100 
(330,000 words). Since we use subsets of the same 
corpus, we are essentially modeling the corpus 
size parameter only. 
abrupt hazardous 
affluent hostile 
affordable impoverished 
astonishing inexpensive 
brilliant insufficient 
capitalist leftist 
catastrophic lenient 
chaotic meager 
clean misleading 
clever multiple 
communist outrageous 
confusing plain 
deadly pleasant 
delicate prosperous 
dirty protracted 
disastrous rigid 
dismal scant 
dry slim 
dual smart 
dumb socialist 
endless strict 
energetic stringent 
exotic stunning 
fat stupid 
fatal toxic 
flexible turbulent 
fragile unfriendly 
generous unreasonable 
gloomy unstable 
gradual vigorous 
harmful wet 
Figure 4: Test set 3; low frequency words. 
47 
Parameter 
Extraction Model Parsing 
Morphology 
Spell-checking 
Use of negative 
knowledge 
Value i score 
30.29 
Pattern Matching 28.88 
Observed Pairs 27.87 
Nouns in Vicinity 22.36 
Yes 28.60 
No 27.53 
Yes 28.12 
No 28.00 
Yes 29.40 
No 28.63 
Table 1: Average scores when only one feature 
is changed. 
For each corpus, we analyzed three different 
sets of adjectives, listed in figures 2-4. The first of 
them was selected from a similar corpus, contains 
21 frequent and ambiguous words that all as- 
sociate strongly with a particular noun (problem), 
and was analyzed in \[Hatzivassiloglou and McKeown, 1993\]. The second set (43 adjectives) 
was saected with the constraint that it contain 
high frequency adjectives (more than 1,000 occur- 
rences in the 21 million word corpus). The third 
set (62 adjec.fives) satisfies the opposite constraint 
containing adjectives of relatively low frequency 
(between 50 and 250). Figure 1 shows a typical   
uping found by our system for the third set of 
a jectives, when the full corpus and all linguistic 
modules were used. 
These three sets of adjectives represent various 
characteristics of the adjective sets that the system 
may be c,~led to. duster. First, they explicitly 
represent increasing sizes of the grouping 
problem. The second and third sets also contrast 
the independent frequencies of their member ad- 
JfreCtives. Furthermore, we have found that the less 
equent adjectives of the third set tend to be more 
specific than the more frequent ones. The human 
evaluators reported that the task of classification 
was easier for the third set, and their models ex- 
hibited about the same degree of agreement for the 
second and third sets although the third set is sig- 
nificantly larger. We plan to investigate the 
generality of this inverse correlation between fre- 
quency and specificity in the future. 
By including the parameters "corpus size" 
and "adjective set", we have six parameters that 
we can vary in our experiments. Any remaining 
tactors affecting the performance of our system 
are modeled as random noise, so staUstical 
methods are used to evaluate the effects of the 
selected parameters. The six chosen parameters 
48 
are completely orthogonal, with the exception that 
parameter "negative knowledge" must have the 
value "not used" when parameter "extraction 
model" has the value "nouns in vicinity". In or- 
der to avoid introducing imbalance in our experi- 
ment, we constructed a complete designed experi- 
merit \[Hicks, 1973\] for all their (4x2-l)x2x2x 
4 x 3 = 336 valid combinations 6. 
5. RESULTS 
5.1 Average effect of each linguistic 
parameter 
Space limitations do not allow us to present the 
scores for every one of the 336 individual experi- 
ments performed, corresponding to all valid com- 
binations of the six modeled parameters. Instead 
we present several summary measures. We 
measured the effect of eachparticular setting of 
each linguistic parameter of Section 3 by averag- 
ing the scores obtained in all experiments where 
mat particular parameter had thatparticular value. 
In this way, Table 1 summarizes the differences in 
the performance of the system caused by each 
parameter. Because of the complete design of the 
experiment, each value in Table 1 is obtained in 
runs that are identical to the runs used for estimat- 
ing the other values of the same parameter except 
for the difference in the parameter itself 7. 
Table I shows that there is indeed improve- 
.ment with the introduction of any of the proposed 
linguistic teatures, or with the use of a lingnis- 
ticfilly more sophisticated extraction model. To as- 
sess the statistical significance of these dif- 
ferences, we compared each run for a particular 
value of a parameter to the corresponding identical 
valueo '(exceptflOr that parameter) run for a different u me parameter. Each pair of values for a 
paran~eter produces~ in this way a set of paired ob- 
servations, on eacn of these sets, we performed a 
sign test \[Gibbons and Chakrahorti, 1992\] of the 
null hypoth~is that there is no real difference in 
the system s performance between the two values, 
i.e. that any observed difference is due to chance. 
We counted the number of times that the first of the two compared values led to superior perfor- 
mance relative to the second, distributing ties 
equally between the two cases. Under the null 
hypothesis, the number of times that the first value 
6RecaU that a designed experiment is complete when at 
least one trial, or run. is performed for every valid combina- 
tion of the modeled predictors. 
7The slight asymmetry in parameters "extraction model" 
and "negative knowledge" is accounted for by leaving out 
non-matching runs. 
Parameter tested 
Extraction model 
Mo*hology 
Spell.checking 
Negative knowledge 
l~irst Value 
Parsing 
Test 
S nd Value' 
Pattern matching 
Parsing Observed pairs 
! Parsing Nouns in vicinRy 
Pattern matching Observed pairs 
Pattern matching 
Observed pairs 
Used 
Used 
Used 
Nouns in vicinity 
Nouns in vicinity 
Not used 
Not used 
Not used 
Comparisons First value better Probability 
than second 
96 64 0'.0014 
96 66 0.0003 
48 42 10 -7 
96 61 0.0104 
48 
48 
168 
168 
144 
41 
36 
107 
94 
97 
6.24:'10-7 
0.0007 
0.0005 
O. 1425 
3.756-10 -5 
Table 2: 
performs better follows the binomial distribution 
with parameter p=0.5. Table 2 gives the results of 
these tests along with the probabilities that the 
same or more extreme results would be encoun- 
tered by chance. We can see from the table that all 
typos of linguistic knowledge except spell- 
checking have a beneficial effect that is statis- 
tically stgnificant at, or below, the 1% level. 
Statistical tests of the difference in performance offered by each linguistic feature. 
5.2 Comparison among the linguistic 
features 
In order to measure the significance of the con- 
tribufion of each linguistic feature relative to the 
other linguistic features, we fitted a linear regres- 
sion model \[Draper and Smith, 1981\] to the data. 
We use the six parameters of our experiments as 
the predictors, and the measured F-score of the 
corresponding clustering as the response variable. 
In such a model the response Y is assumed to be a 
linear function of the predictors, i.e. 
Y=bo+bl.Xl+b2.X2+... +bn'X n (1) 
where X i is the i-th predictor and bi is its ccor- 
responding weight s . The weights found by the fit- 
ring process (Table 3) indicate by their absolute 
magriitude and sign how important each predictor 
is and whether it contributes positively or nega- 
tively to the final result. Numerical values such as 
SSueh a model is appropriate for comparative purposes, 
although extrapolating response values for prediction outside the range of predictor values used in the fitting may give 
incorrect results. For example, the coefficients in Table 3 
earmot be used to predict the score when the corpus is fig- 
niticanfly smaller than 0.33 Mbytes or larger than 21 Mbytes. 
the corpus size enter formula (1) directly as 
predictors, so Table 3 indicates that each ad- 
ditional megabyte of text increases the perfor- 
mance of the system by 0.9417 on the average. 
For binary features, the weights in Table 3 indicate 
the increase in the system's performance when the 
feature is present, so introduction of morphology 
improves the system's performance by 0.5371 on 
the average. For the categorical variables "extrac- 
tion model" and "adjective set", the weights 
show the change in score for the indicated value in 
contrast to the base case (minimal linguistic 
knowledse represented by extraction model 
"nouns m vicinity" and adjective set 1 respec- 
tively). For example, using the finite-state parser 
instead of the "nouns in vicinity" model improves 
Variable Weight 
Intercept 18.7997 
Corpus size (in megabytes) 0.9417 
Extraction method (Pairs) 5.1307 
Extraction method (Sequences) 6.1418 
Extraction method (Parser) 7.5423 
Morphology 0.5371 
Spelling 0.0589 
Adjective Set (2) 2.5996 
Adjective Set (3) - 11.4882 
Use of negative knowledge 0.3838 
Table 3: Fitted coefficients for linear regression 
model. 
49 
Adjective Set 1 Adjective Set 2 Adjective Set 3 
Random partitions 9.66 6.21 3.80 
No linguistic components active 24.51 38.51 33.21 
All linguistic components active 39.06 44.73 46.17 I 
Humans 53.98 64.27 63.07 
Table 4: Performance of a random classifier, of the system on the 21 million word corpus, 
and of the humans. 
the score by 7.5423 on the average, while going 
from adjecuve set 2 to adjective set 3 decreases 
the score by -(-2.5996-11.4882) = 14.0878 on the 
average. Finally the intercept b 0 gives a baseline   
erformance of a minimal system that uses the 
ase case for each pararneter; the effects of corpus 
size are to be added to this system. 
From Table 3 we can see that the data extrac- 
tion model has a significant effect on the quality of 
the produced clustering, and among the fingutstic 
parameters is the most important one. Increasing 
the size of the corpus also significantly increases 
the score. The adjective set that is clustered also 
has a major influence on the score, with rarer ad- 
jectives leading to worse clusterings. The two lin- 
guistic feat~u~e s "morphology" and "negative 
knowledge' have less pronounced although still 
sil~nificant effects, while spell-cbecking offers 
minimal improvement that probably does not jus- 
tify the effort of implementing the module and the 
cost of activating it at run-time. 
5.3 Overall effect of linguistic knowledge 
Up to this point we have described averages of 
scores, taken over many combinations of features 
that are orthogonal to the one studied. These 
averages are good for describing the existence of a 
difference caused by the different values of each 
feature, across all possible combinations of the 
other features. They are not, however, repre- 
sentative of the performance of the system m a 
particular setting of parameters, nor are they 
suitable for describing the difference in features 
quantitatively, since they are averages taken over 
widely differing settings of the system's 
parameters. In particular, the inclusion of very 
small corpora drives the average scores down, as 
we have confirmed in a more detailed analysis 
where averages were computed separately for each 
value of the corpus size parameter. To gtve a feel- 
ing of how important the introduction of linguistic 
knowledge is quantitatively, we compare in Table 
4 the results obtained for the full corpus of 21 mil- 
lion words for the two cases of having all or none 
of the linguistic components active. The scores ob- 
tained by a random system that produces partitions 
50 
of the adjectives with no knowledge except the 
number of groups are included as a lower bound. 
These estimates are obtained after averaging the 
scores of 20,000 such random partitions for each 
adjective set. The average scores that each human 
model receives when compared to all the other 
human models are also included, since they 
ovide an estimate of the maximum score that can 
achieved by any system. That maximum 
depends on the disagreement between models for 
each adjective set. For these measurements we use 
a smaller smoothing window of size 3 instead of 5, 
which is fairer to the system when its performance 
is compared to the humans. We also give in Figure 
5 the grouping produced by the system witlaout 
using any of the linguistic modules for adjective 
set 3; this is to be contrasted with Figure I. 
6. GENERALIZING TO OTHER 
APPLICATIONS 
In the previous section, we showed that the in- 
troduction of linguistic knowledge in our system 
produces a performance difference, which is not 
only statistically observable but also quantitatively 
significant (cf. Table 4). We believe that these 
positive results should also apply to other corpus- 
based NLP systems that employ statistical 
methods. Many of the linguistic components of 
our system, including the extraction model that 
was shown to be the most important linguistic 
parameter, are not specific to the word grouping 
problem. They can thus be directly incorporated in 
systems designed for other problems but essen- 
tially following the same basic architecture as 
ours. 
Many statistical approaches share the same 
basic methodology with our system: a set of words 
is preselected, related words are identified in a 
corpus, the frequencies of words and of pairs of 
related words are estimated, and a statistical model 
is used to make predictions for the original words. 
Across appfications, there are differences in what 
words are selected, how related words are defined, 
and what kind of predictions is made. Neverthe- 
less, the basic components stay the same. For ex- 
ample, in our appfication the original words are 
1. catastrophic harmful 
2. dry wet 
3. lenient rigid strict stringent 
4. communist leftist 
5. flexible hostile protracted unfriendly 
6. abrupt chaotic disastrous gradual 
turbulent vigorous 
7. affluent affordable inexpensive 
prosperous 
8. outrageous 
9. capitalist socialist 
lO. dismal gloomy pleasant 
11. generous insufficient meager scant 
slim 
12. delicate fragile 
13. brilliant energetic 
14. dual multiple stupid 
15. hazardous toxic unreasonable 
unstable 
16. plain 
17. confusing 
18. clever 
19. endless 
20. clean dirty impoverished 
21. deadly fatal 
22. astonishing misleading stunning 
• 23. dumb fat smart 
24. exotic 
Figure 5: Example clustering found by the 
system using no linguistic modules. 
the adjectives and the predictions are their groups: 
in machine translation, the predictions are the translations of the words in the source languag~ 
text; in sense disambiguation, the predictions are 
the senses assigned to the words of interest; in 
part-of-speech tagging or in classification the 
predictions are the t,~s or classes assigned to each 
word. Because of this underlying similarity, the 
comparative analysis presented m the paper is 
relevant to all these problems. 
For a concrete example, we examine the case 
of collocation extraction that has been addressed 
with statistical methods in the past. Smadja 
\[1993\] describes a system that imtially uses the 
"nouns in vicinity" extraction model to collect 
cooccurrence information about words, and then 
identifies collocations on the basis of distributional 
criteria. A later component falters the retrieved col- 
locations, removing the ones where the participat- 
ing words are not used consistently in the same 
syntactic relationship. This post-processing stage 
doubles the precision of the system. We believe 
that using from the start a more sophisticated ex- 
traction model to collect these pairs of related 
words will have similar positive effects. Other lin- 
guistic components, such as a morphology module 
that combines frequency counts, should also im- 
prove the performance of the system, In this way, 
we can benefit from linguistic knowledge without 
having to use a separate filtering process after ex- 
pending the effort to coUeet the collocations. 
Similarly, the sense disambiguation problem is 
typically attacked by comparing the distribution of 
the neighbors of a word's occurrence to prototypi- 
cal distributions associated with each of the 
word's senses \[Gale et al., 1992, Schtltze, 1992\]. 
Usually, no explicit linguistic knowledge is used 
in defining these neighbors, which are taken as all 
words appearing within a window of fixed width 
centered-at the word being disambiguated. Many 
words unrelated to the word of interest are col- 
lected in this way. In contrast, identifying ap- 
propriate word classes that can be expected on lin- 
guistic grounds to convey significant information 
about the original word should increase the perfor- 
mance of the disambiguation system. Such classes 
might be modified nouns for adjectives, nouns in a 
subject or object position for verbs, etc. As we 
have showed in Section 5, less but cleaner infor- 
mation increases the quality of the results. 
An interesting topic is the identification of 
parallels of our fingulstic modules for these ap- 
plications, at least for those modules which, unlike 
morphology, are not ubiquitous. Negative 
knowledge for example improves the performance 
of our system, supplementing the positive infor- 
mation provided by adjective-noun pairs. It could 
be useful for other systems as well if an ap- 
propriate application-dependent method of extract- 
mg such information is identified. 
51 
7. CONCLUSIONS AND FUTURE WORK 
We have showed that all lin$uistic features 
considered in this study had a posiuve contribution 
to the performance of the system. Except for ~spell- 
checking, all these contributions were both statis- 
tically significant and large enough to make a dif- 
ferencce in practical situations. Furthermore, the 
results can be expected to generalize to a wide 
variety of corpus-based systems for different ap- plications. 
The cost of incorporating the linguistics-based 
modules in the system is not prohibitive. The ef- 
fort needed to implement all the linguistic modules 
was about 5 person-months, in contrast with 7 
person-months needed to develop the basic statis- 
tical system. Furthermore, the run-time overhead 
causedby the linguistic modules is not significant. 
Each takes from lto 7 minutes on a Sun SparcSta- 
tion 10 to process a million entries (words or 
pairs) and all except the negative knowledge 
module need process a corpus only once, reusing 
the same information for different adjective sets. 
This should be compared to the approximately 15 
minutes needed by the statistical component for 
grouping about 40 adjectives. 
In the future, we plan to extend the results dis- 
cussed in this paper by an analysis of the depen- 
dence of the effects of each parameter on the 
values of the other parameters. We are currently 
stratifying the experimental data obtained to study 
trends in the magnitude of parameter effects as 
other parameters vary in a controlled manner, and 
we will examine the interactions with corpus size 
and specificity of clustered adjectives. We are 
also interested in providing similar quantitative 
results for other applications, to corroborate our 
belief in the generality of the importance of easily 
obtainable linguistic knowledge for statistical sys- 
tems. 
ACKNOWLEDGEMENTS 
This work was supported jointly by ARPA and 
ONR under contract N00014-89-J-1782, by NSF 
GER-90-24069, and by New York State center for 
Advanced Technology contract NYSSTF- 
CAT(91)-053. I wish to thank Kathy McKeown, 
Jacques Robin, and the workshop organizers for 
providing useful comments on earlier versions of 
the paper. 

REFERENCES 
Brown P.. Della Pietra V., deSouza P., Lai J., and Mer- 
cer R. (1992). Class-based n-gram Models of 
Natural Language. Computational Linguistics. 
18"4.467-479. 
Draper, H, R. and Smith, H. (1981). Applied Regres- 
sion Analysis (2rid ed.). New York: Wiley. 
Elhadad, Michael. (1991). Generating Adjectives to 
Express the Speaker's Argumentative inure. 
Proceedings of the 9th National Conference on 
Arttficial Intelligence (AAAI 91). Annheim. 
Gale, W. A., Clmrch. K. W., and Yarowsky, D. (1992). 
Work on Statistical Methods for Word Sense Dis- 
ambiguation. Probabilistic Approaches to 
Natural Language: Papers from the 1992 Fall 
Symposium. AAAI. 
Gibbons, Jean Dickinson and ChAIn'aborti, Subhabrata. 
(1992). Nonparametric Statistical Inference (3rd 
ed.). New York: Marcel Deld~er. 
Hatzivassiloglou, Vasileios and MeKeown, Kathleen. 
(Iune 1993). Towards the Automatic Ideatifica- 
tion of Adjectival Scales: Clustering Adjectives 
Aecor'dlnoe to MeJmin~l Proceedings of the 31st 
Annual Meeting of the ACL. Columbus, Ohio: 
Association for Computational Linguistics. 
Hicks, C.R. (1973). Fundamental Concepts in the 
Design of Experiments. New York: Holt. 
Rinehart, and Wilson. 
Kaufinan, L. and Rousseeuw, PJ. (1990). Finding 
Groups in Data: An Introduction to Cluster 
Analysis. New Yedl:: Wiley. 
Kendall, M.G. (1938). A New Measure of Rank Cor- 
relation. Biometrika, 30, 81-93. 
Liddy, Elizabeth D. and Paik; Woojin (1992). 
Statistically-Guided Word Sense Disambiguadon. 
Probabilistic Approaches to Natural Language: 
Papers from the 1992 Fall Symposium. AAAI. 
Pereira F., Tishby N.0 and Lee L. (June 1993). Dis- 
tributional Clustexing of l~Jn~li~h Words. 
Proceedings of the 31st Conference of the ACL. 
Columbus, Ohio: Associatiea for Computational Li uisti . 
Sehtttze, Him'ieh. (July 1992). Word Sense Dis- 
ambiguation With Sublexical Representations. 
Proceedings of the AAAI-92 Workshop on 
Statistically-Based NILP Techniques. AAAI. 
Smadja, Frank. (March 1993). Retrieving Collocatims 
~om Text: Xtract. Computational Linguistics. 
19:1. 143-177. 
Spath, Helmuth. (1985). Cluster Dissection and 
Analysis: Theory, FORTRAN Programs, 
Examples. Cldchester, West Sussex. England: 
Ellis Horwood. 
Van Rijsbergen, C.J. (1979). Information Retrieval 
(2rid ed.). London: Butterwoths. 
