Generalizing Automatically Generated Selectional 
Patterns 
Ralph Grishman and John Sterling 
Comlmter Science Department, New York University 
715 Broadway, 7th Floor, New York, NY 10003, U.S.A. 
{grishlnan,sterling } (O) cs.nyu.cdu 
Abstract 
Frequency information on co-occurrence patterns can 
be atttomatically collected from a syntactically ana- 
lyzed corpus; this information can then serve as the ba- 
sis for selectional constraints when analyzing new text; 
from the same domain. Tiffs information, however, is 
necessarily incomplete. We report on measurements of 
the degree of selectional coverage obtained with ditt\~r- 
ent sizes of corpora. We then describe a technique for 
using the corpus to identify selectionally similar terms, 
and for using tiffs similarity to broaden the seleetional 
coverage for a tixed corpus size. 
1 Introduction 
Selectional constraints specify what combinations of 
words are acceptable or meaningful in particular syn- 
tactic relations, such as subject-verb-object or head- 
modifier relations. Such constraints are necessary for 
the accurate analysis of natural language text+ Accord- 
ingly, the acquisition of these constraints is an essen- 
tial yet time-consuming part of porting a natural lan- 
guage system to a new domain. Several research groups 
have attempted to automate this process by collecting 
co-occurrence patterns (e.g., subject-verb-ol)ject pat- 
terns) from a large training corpus. These patterns 
are then used as the source of seleetional constraints 
in attalyzing new text. 
The initial successes of this approach raise the ques- 
tion of how large a training corpus is required. Any 
answer to this question must of course be relative to 
the degree of coverage required; the set of selectional 
patterns will never be 100% complete, so a large cor- 
pus will always provide greater coverage. We attempt 
to shed to some light on this question by processing 
a large corpus of text from a broad domain (business 
news) and observing how selectional coverage increases 
with domain size. 
In many cases, there are limits on the amount of 
training text, available. We therefore also consider how 
coverage can be increased using a tixed amount of text. 
The most straightforward acquisition procedures build 
selectional patterns containing only the specific word 
combinations found in the training corpus. (areater 
coverage can be obtained by generalizing fl'om the 
patterns collected so that patterns with semantically 
related words will also be considered acceptable. In 
most cases this has been (lotto using manually-created 
word classes, generalizing fi'oul specific words to their 
classes \[12,1,10\]. If a pre-existing set of classes is used 
(as in \[10\]), there is a risk that the classes awdlable 
may not match the needs of the task. If classes are 
created specifically to capture selectional constraints, 
there lnay be a substantial manual I>urden in moving 
to a new domain, since at least some of the semantic 
word classes will be domain-specillc. 
We wish to avoid this manual component by auto: 
maritally identifying semantically related words. This 
can be done using the co-occurrence data, i.e., by idea: 
tifying words which occur in the same contexts (for ex- 
ample, verbs which occur with the same subjects and 
objects). From the co-occurrence data o110 Call coiil.- 
pute a similarity relation between words \[8,7\]. This 
similarity information can then be used in several ways. 
One approach is to form word clusters based on this 
similarity relation \[8\]. This approach was taken by 
Sekine et al. at UMIST, who then used these chlsters 
to generalize the semantic patterns \[11\]. l'ereira et al. 
\[9\] used a variant of this approach, "soft clusters", in 
which words can be members of difl'erent clusters to 
difl'eren t degrees. 
An alternative approach is to use the word similar- 
ity information directly, to inDr information about the 
likelihood of a co-occurrence pattern from information 
abont patterns involving similar words. This is the 
approach we have adopted for our current experiments 
\[6\], and which has also been employed by 17)agan et 
al. \[2\]. We corl:lttttl;e from the co+occurrence data a 
"confitsion matrix", which measures the interchange- 
ability of words in particular contexts. We then use 
the confllsion matrix directly to geueralize the selllan- 
tic patterns. 
2 Acquiring Semantic Patterns 
Based on a series of experitnents over tile past two 
years \[5,6\] we have developed the following procedure 
742 
for acquiring s(;m&lll;ic i)attern.q from a 1;ext (:()i'l)US: 
l, l)~rsc the trMning corpus using a hro~d-cover~g(~ 
~r~i, iilll\]~tr) a\[ld i'(~ll\]~triT, e Lli(! |)arse,~ I,() I)roduce 
solllethil/g Mdil to ~ul l,F(l f-.sl, ructure, with explic 
il,ly lal)olod synl;icctic i'ela, tions su0h a,q SUII.1 li\]CT 
and O l:lJ F, CT. i 
2. Extract froHi (,he regul~rizcd l);u'se ;~ series of 
~riplcs of I;hc forni 
helm syntactic-reh~tion head-otLargumenl, 
/modi{icr 
Wc will use l, he iiotM, ion< 'lui~' 'll/j > for ~ilch ;t 
triple, ~li(I < r u:j > for a rt~laJ:ion--arguin<~nt pah'. 
3. (',Olnptite tho h'eqliency P' of e~tc:h head and e~ch 
triple in the corpllS. If a, SelitCilce prodliCe8 N 
pm'ses, ~ l, rilAe gOlier~ti,ed froln a, singh~ pa, rse has 
weight 1IN in the t, of, M. 
Wc will IlS(~ th(~ liot,lttion 1'~(< 'll)irw j >) \['or tim fro 
qlion0y of a triple, ;Uld P'A(,,l('wi) t()r the frl!qilellCy 
wil, h w\[iich w i a4il)eai's as ~t hea,{t iii a plti'se i,I'C(L 7 
For exaniple> the S(~IIIrOI/C.{2 
Mary likes young lhiguists froni I,hn(n'ick. 
would produc(: the regul~trized synthetic structure 
(s like (subjcci. (up Mary)) 
(/,bj(,ci; (nil liuguist (a-pos young) 
(ffol,i (np I,i:,,Mck))))) 
fl'Olli which i,hc following folir t,ripl0s arc gjelit~i';tl,e(\[: 
like subject Mary 
like ,lbj(~ot liugliist 
lhig;uist *>t)os yOIlll if, 
linguist froin l,iulerick 
(~iwm tJlc fre(lllCncy hiforiilM;ioll l+'~ we C~tll l, hen 
estiin~tte the prob;d)ilii,y i, ha, i, ~ particular head wi 
al)pem's with a parl, iculnr ;tl'~lilllOiil, or uiodilier 
< r 'wj > : 
/"_(< "2~ "~"~, >)~ 
Tliis probahility infornl~ctiou would l, hcn bc used in 
scoi'hlg altei'naA;ive p~trse (,rots, \[,'or (,lie eva, itla, tiOll l)0- 
lOW: howcvcr, wc will I18(7 l,h(~ t'r(~tltl011cy tl&~a, l" directly. 
Stall 3 (i, he I, riples exLr~-tcl,i{ni) inchl(tes ;t liilllil)(~r of 
spccial cases: 
l-hie wlt.h SOlllowh&l, ii1o1"1: i'egli|aril,;tl01oil \[|l{tli iS I'l()litl ill I J"( i; 
in pal'ticill&r~ l)a.ssivl: strll(:l.lircs &l'(l CilllVCl>t.ed tO clnTeSllOlidillg 
active f( )\['illS, 
~N,)Ce that l'),<,<,~(wi) is 
different ~¥oni /a(w i apl>ears as a head in a triple) sin(c a single 
hc&d in a l)&rse trec ln~ly l)rodu(:e sorer&| slich triples, one for 
eaCll &rgtllllClit OF nio(li|i(!r of tlutt head. 
(a) if ;t vm'b has a sep~l~i-M)le lmrticlc (e.g., "ouL" in 
"c,~u-ry out"), this is hi;filched t,o the head (to crc- 
~,i;e the ticked carry-oul) ttll(I l\]()tu IH'o~Li, cd sis ;% Sel)~t 
rate rclatioii. I)ilfereilt p~vq.iclcs often corl'oSpolid 
to very dilfcrenl, senses of ~ w~rb, so this avoids 
coutlt~thig i, he subject ~md object distrilnitious of 
these different selises. 
(b) if the Vel'll is (<tie >>, We genera, l;(~ a i'eltttion bc- 
COml)lcmcnl bei,weeu the suliject and the pre(licat(~ 
COil liJiClll(!lli;. 
((:) triples in which either I,he heard or l, he a, rg~ttlll(!ll{ is 
;3. I)l'OliOtlll ;IA'(~ disc~crdcd 
(d) l,rilflCs in which the aa'<gliln(~nt i~ ;i, su\[ioi'din~tte 
(:l~uise ~cre disc*ci'dcd (lihis iuchides sut)or(Ihi;~l;e 
toll junctions ;uul wn't)s l,a, king cl~tils;_c\] ~_n'gunienl,s) 
(e) l.l'iples indic:cling negMiio/i (with &ll ;M'<ff.iil\[10ill, Of 
"not" or "newer") are ignored 
3 Generalizing Semantic Pat- 
terns 
The proc.rdurc described M)ow ~, producers a. set, of I~'c 
(luencies 3,11d I)r(dmbilit.y esl.intatcs based on Sl)ccilic 
wordm The "trnditi<mM" ~tpproach to gcnerMizing tiff<; 
inl\:,rma, tion ha~ I:,ccu i;o assign the word,'-; t,() a set or 
ScIIl~tllti(' C\[asges, g%n(\[ thrill \[,0 collect the freqileiicy in- 
\[br|m~tion on COlnbinations of sen,antic cla.sscs \[ 12, 1\]. 
Since ~t; least some of t, hese classes will be domain 
Sl)ecilic , there has \[)t!ell inl.erest in mltomating the ac- 
quisition of these classes ~ts well. This c~m be done 
by ch,st,(,ring l,ogetohcr words which appear in the s;m,c 
contexl.. Sl.arting from the lile of l;riplcs, thi:s involves: 
I. collecting for e;~l:h woM i;he \['re(lucilcy with which 
it occurs in each possilAc context; f.r cx~mplc, for 
n nou,l we wouhl collect the frequency with which 
it occurs as the slll)jeci; ;till\[ 1.he object ot'(~ach verb 
2. delini,g a similarity lll{~tSill'(': between words, 
which rellccts thc tlllltll'Jer Of CO\]\[III\[IOll COIIt(!XL8 ill 
which l;ticy nppc~r 
3. foruiing clusters Imsod on this similarity m{~asul'c 
Such a procedure was performed by Sekinc ct M. ~tt 
\[lMIS'l' \[ll\]; these chnsl.crs were then manually r(! 
viewed ~tnd the i'eStlltilt~ clusters wet'(! used 1() F, Clml'- 
Mizc SelC:Ctioll:rll pa, tllern,'s. A sitnila, r afq)roa,ch Lo word 
cluster rormatiou w~s dcscril}cd by llirschum.u (,t al. 
in 1975 V\]. M(iro ,'(...<y, rer,,ira, .t ;~1. \[(4 l,.v. (l,,- 
sci'ibcd ;~ word chlstei'hlg nicthod ushig "soft cinsl,ers': 
hi which a, word C;lll belong to several chlsl,er,q, with 
dilN~i'enl, chtsl,er menll)ership I,'obalfilith~s, 
(Jlusl, er creal;iou has (,he ;Ldwull, ago l,ha.I, the clusl, ers 
;tr0 aAIlCll~tl)lc l;o ln~Ullt;cl review and correction, ()ll 
Lhe other haucl, olir experience illdicates 1,h;~t stlcccs.q 
rul chlster g~01if;ra, l, ioil depends Oil ral;her dclh:~i;c ad- 
jltsl;li~ienl, of the chlstcrillg criteria. We haw~ l, hcrcfore 
743 
elected to try an approach which directly uses a form 
of similarity measnre to smooth (generalize) the prob-. 
abilities. 
Co-occurrence smoothing is a method which has 
been recently proposed for smoothing n-gram models 
\[3\].a The core of this method involves the computation 
of a co-occurrence matrix (a matrix of eonfl, sion prob- 
abilities) Pc:(wj Iw0, which indicates the prol)ability of 
word wj occurring in contexts in which word wi occurs, 
averaged over these contexts. 
:L,(wj lwi) : ~ P(wjI,~)P(.~I'~,O 
E. P(,o~ Is)v(,,,~l,~)V(s) 
p(wi) 
where the sum is over the set of all possible contexts s. 
In applying this technique to the triples we have col- 
lected, we have initially chosen to generalize (smooth 
over) the first element of triple. Thus, in triples of the 
form wordl relation word2 we focus on wordl, treating 
relation attd word2 as the context: 
:'.(~,~lw5 
= ~ rO.:l < ,'~j >). v(< ,.w, > In,:) 
r:v~ 
F(< ~ ,- ~j >) /,(< w~ ,, ~"5 >) 
Informally, we ear, say that a large value of /)C'(,,il,)I) 
indicates that wi is selectionally (semantically) accept- 
able in the syntactic contexts where word w~ appears. 
For example, looking at the verb "convict", we see that 
the largest values of P(:(eonvict, x) are for a: = "acquit" 
and x = "indict", indicating that "convict" is selec- 
tionally acceptable in contexts where words "acquit" 
or "indict" appear (see Figure 4 for a larger example). 
How do we use this information to generalize the 
triples obtained from the corpus'? Suppose we are in- 
terested in determining (.he acceptability of the pattern 
convict-object-owner, even though this triple does not 
apl)ear in our training corpus. Since "convict" can 
appear in contexts in which "acquit" or "indict" ap 
pear, and the patterns acquit-object-owner and indicb 
o/)ject-owner appear in the corpus, we can conchlde 
thai, the pattern convict-object-owner is acceptable 
too. More formally, we compute a smoothed triples 
frequency lP.s' from the observed frequency /i' by aver- 
aging over all words w~, incorporating frequency infor- 
mation for w~ to the extent that its contexts are also 
suitable contexts for wi: 
:':~*(< *,:i ,. ,,,j >) -- ~ r"("'il*";)" ::(< ,,,~ ,, ,,:j >) 
~tJ 
lit or(ler to avoid the generation of confltsion table en- 
tries from a single shared context (which quite often 
awe wish to thank Richard Schwartz of BBN for referring us 
to this method &lid article. 
is the result of an incorrect I)arse), we apply a filter 
in generating Pc: for i ¢ j, we generate a non-zero 
Pc(wj Iwj only if the wi and wj appear it* at leant two 
eoitllnon contexts, and there is some eOlnnlon context 
in which both words occur at least twice, l,'urther- 
more, if the value computed by the formula for Pc' is 
less than some thresbold re:, the value is taken to be 
zero; we have used rc = 0.001 in the experiments re- 
ported below. (These tilters are not applied for the 
case i = j; the diagonal elements of the confusion 
matrix are always eomputed exactly.) Because these 
filters may yeild an an-normalized confltsion matrix 
(i.e., E~ t>(*vJlv'i) < l), we renorn, alize the n\]atrix 
so that }~,.j \[g,(wi\[wi ) = 1. 
A similar approach to pattern generalization, using 
a sirnilarity measnre derived fi'om co-occurrence data, 
has been recently described by l)agan et a\]. \[2\]. Theh' 
approach dill'ers from the one described here in two 
sign*titan* regards: their co-occurrence data is based 
on linear distance within the sentence, rather than on 
syntactic relations, and they use a different similarity 
measure, based on mutual information. The relative 
merits of the two similarity rneasures may need to be 
resolved empirically; however, we believe *bat, there is 
a virtue to our llOn-sylnlnetric lileaSlll'e~ becatlse 8tll)- 
stitutibility in seleetional contexts is not a symmetric 
relation .4 
4 Evaluation 
4.1 Evaluation Metric 
We have previously \[5\] described two methods for the 
evaluation of semantic constraints. For tile current ex-- 
periments, we have used one of these methods, where 
the constraints are evaluated against a set of manually 
classitied semantic triples. '~ 
For this (waluation, we select a small test corpus sep- 
arate fl'om the training corpus. We parse the corpus, 
regularize the parses, and extract triples just as we did 
tbr the semantic acquisition phase. We then manually 
classify each triph" as valid or invalid, depending on 
whether or not it arises fl'om the correct parse for the 
sentence. G 
We then estahlish a threshold 7' for the weighted 
triples counts in our training set, and deline 
4 If v:l allows a hi'o,taler range of argulnents than w2, then 
we can replace w2 by vq, but llOIb giC(~ versa, For (':xanlple~ w(; 
can repla(:e "speak" (which takes a human subject) by "sleep" 
(which takes an animate subject), and still have a selectionally 
valid pattern, \])tit. not the other wety around. 
~"l'his is similar to tests conducted by Pcreira ct al. \[9\] and 
l)agan et al. \[2\]. The cited tests, howevcl', were based ,m selected 
words or word pairs of high frequency, whereas ore" test sets 
involve a representative set of high and low frequency triples. 
stiffs is a different criterion fl'om the one used in our earlier 
papers. In our earlier work, we marked a triple as wdid if it 
could be valid in some sentence in the domain. We found that it 
was very (lilIicult to apply such a standard consistmltly, and have 
therefore changed to a criterion based on an individual sentence. 
744 
recall 
0.70 
0.60 
0.50 
0.40 
0.30 
0.20 
0.10 
RIIlI\]O 0 0 ° 
°'boo 
0 o 
~%o 
O 
o°% 
0.00 i- -7 l • • 
0,60 0.65 0.70 0.75 0.80 0.85 
precision 
Figm'c 1: l/.acall/prccision trade-oil using eutire cor- 
pus. 
vq numl',er of l.rilllcs in test set which were ('.lassitied 
as vMid and which a,F,l)em'ed iu training sct with 
count > "/' 
V__ llllllll)or oV tril)lcs in tcsl, set which were classilicd 
as valid m,d which nl)pearc(I in training set with 
COIlI/I. < ~/' 
i I- lmn,I)er of tril)lcs in t,est set. which were classitlcd 
as inwdid and which al)peared ilt trahfing set with 
CO\[llti, > "\[' 
and then delinc 
v -I 
recall = .... 
tJi + v_ 
I)reci,~hm _-: _ vq~ .. 
v+ + iq 
By wu'ying the l, hreshold, we can a~lcct dill\went 
trade-olfs ()f recall and precisioli (at high threshold, we 
seh~ct: only a small n,,mher of triph:s which apl)eared 
frequ(mtly and in which we l.hereforc have \]ligh conli-- 
(h!nce, t;hus obtaining a high precision lm(, \]()w recall; 
conversely, at a h)w t, hrcshohl we adndt a uuuch larger 
nund)er of i.riplcs, obt,aiuiug ~ high recall but lower 
precisiol 0. 
4.2 .t .s~ Data 
The trai,fing and Icst corpora were taken from the Wall 
Street ,hmrnaJ. In order to get higher-quality parses 
or I,\]lcse ,q(ml;elices, we disahlcd some of the recovery 
mechanisms normally t>ed in our parser. Of the 57,366 
scnte,lCCS hi our t,rMidng corpus, we ohtMned comph%e 
pars('s Ibr 34,414 and parses of initial substrings for an 
additional 12,441 s(mtenccs. These i)m'ses were th(m 
regularized aim reduced to t,riph~s. Wc gcnerat;(;d a 
total of 27q,233 distinct triples from the corpus. 
The test corpus used to generate l, he triph~s which 
were mamlally classified consisl,ed of l0 artMcs, also 
0.60 
0.58 
0.56 
0.54 
0.52 
0.50 
0.48 
recall 0.46 
0.44 
0.42 
0.40 
0.38 
0.36 
0.34 
0 × 
0 
X 
0 
0 
x 
...... 1- .... 7 ....... 1- -I 
0 25 50 75 100 
percent;~ge of corplls 
Figure 2: Growth of recall as a fimction of corpus size 
(percentage of totM corpus used), o = at 72% l)reci - 
siou; • = maximmn reca\[\], regardless of precision; x :- 
I)redicted values R)r m~Lximum recall 
D<)m the Wall S~;reet Journal, distinct from those in 
the training set. These articles produced a tcs(. set; 
containing a totM of i{)32 triples, of which 1107 were 
valid ;rod 825 were \[nvMid. 
4.3 Results 
4.3.1 Growth with Corl)us Size 
Wc began by generating triples from the entire corpus 
and cwdmLt, ing the selectional patterns as <lescribed 
above; tile resulth/g recall/l)recision curve generated 
by wu'ying the threshold is shown in Figure 1. 
To see how pattern coverage iwl)roves with corpus 
size, we divided our training corpus into 8 segments 
and coHll/uted sets of tril)lcs based on the lirst Seglllell|,, 
the Ih'st two segments, etc. We show iu Figure 2 a plot 
of recall vs. corpus size, both at ~ consl, ant precision of 
72% and for maximum recall regardless of precision .7 
The rate of g;rowth of the maximum recall cau be 
understood in teruls of the frequency distribution of 
triples. In our earlier work \[4\] we lit the growth data 
to curw~s of the form 1 -exp(-fia:), on tile assump- 
t.ion that all selectional imtterns are t~qually likely. 
This lttay have 1)ee|l a roughly accurate assumption for 
that app\]ication, involving semantic-class based pat- 
terns (rather t, han word-based l);-ttl;erns), and a rather 
sharply circumscribed sublanguage (m(xlical reports). 
For the (word level) pal;i,crlls described here, howevcr, 
the distribution is quite skewed, with a small number 
of very-high-frequency l)atl,erns, a which results in di\[: 
rN,, (1,tta point is shown for 72% precision for the first seg- 
lll(:ilt &lone };e(:allSe we ~tl'c nl)\[ able to re&oh ;t prcci.%lOll of 72~ 
with a single seglnent. 
a'l'hc number of highq're(luency patterns is m:(:enl, u;tted by 
745 
1000 
number 100 
of 
triples 
with 
this 
frequency i0 
% 
% 
% 
** 
..-.. 
I00 
f 
10 
fl'equency of triple in training corpus 
l"igure 3: Distribution of fre(tuencies of triples in train- 
ing corpus. Vertical scale shows number of triples with 
a given frequency. 
fereat growth curves. Figure 3 plots the number of 
distinct triples per unit frequency, as a function of fi'e- 
quency, for the entire training corpus, This data can 
be very closely approximated by a fimction of tile form 
N(t,') = al ;'-~, where r~ = 2.9. 9 
q'o derive a growth curve for inaxinmln recall, we 
will assunle that the fl'equeney distribution for triples 
selected at random follows the same tbrm. Let I)(7) 
represent the probability that a triple chosen at ran- 
dorn is a particular triple T. l,et P(p) be the density 
of triples with a given probability; i.e., the nmnber of 
triples with probal)ilities between p and p + ( is eP(p) 
(for small e). Then we are ass,,ming that P(p) = ~p-~, 
for p ranging fl'om some minimum probability Pmin to 
1. For a triple T, the probability that we would lind 
at least one instance of it in n corpus of w triples is 
approximately i -- c -~p(T). The lnaximum recall for a 
corpus of ~- triples is the probability of a given triple 
(the "test triple") being selected at random, multiplied 
by the probability that that triple was found in the 
training corpus, summed over a.ll triples: 
~)(r). (1 - e-~"("')) 
7' 
which can be coral)uteri using the density function 
~ 1 P' P(P)' (1 e-"V)dp 
m, n 
f l -~(1 c -T~' = rap. p - . )alp 
,,~ia 
By selecting an appropriate value of a (and correspond- 
ing l),~i,~ so that the total probability is 1), we can get a 
the fact that our lcxicM scmmcr replaces all identitiablc COlll- 
lYally lllLllleS by tile token a-company, all C/llTellcy values by a- 
currency, etc. Many of the highest frequency triples involve such 
tokens. 
9Thls is quite shnilm' to a Zipf's law distribution, for which 
w f'c(bondlw) 
eurobond 
foray 
mortgage 
objective 
marriage 
note 
maturity 
subsidy 
veteran 
commitment 
debenture 
activism 
mile 
coupon 
security 
yield 
issue 
0.133 
0.128 
0.093 
0.089 
0.071 
0.068 
0.057 
0.0d6 
0.046 
0.046 
0.044 
0.043 
().038 
0.038 
0.037 
0.036 
0.035 
Figure 4: Nouns closely related to the IIOUII 'q)ond": 
ranked by t):. 
good match to the actual maximum recall values; these 
computed values are shown as x in Figure 2. Except 
\['or the smallest data set, the agreement is quite good 
considering the wwy simple assumpt, ions made. 
4.3.2 Smoothing 
In order to increase our coverage (recall), we then ap- 
plied the smoothing procedure to the triples fi'om our 
training corpus. In testing our procedure, we lirst gen- 
erated the confusioll matrix Pc and examined some of 
the entries, l"igure 4 shows the largest entries in f'c for 
the noun "bond", a common word in the Wall Street 
Journal. It is clear that (with some odd exceptions) 
most of tile words with high t): wtlues are semanti- 
cally related to the original word. 
'lk) evaluate the etl\~ctiveness of our smoothing pro- 
cedure, we have plotted recall vs. precision graphs for 
both unsmoothed and smoothed frequency data. The 
results are shown in l,'igure 5. Over tile range of preci- 
sions where the two curves overlap, the smoothed data 
performs better at low precision/high recall, whereas 
the unsmoothed data is better at high precision/low 
recall. In addition, smoothing substantially extends 
the level of recall which can be achieved for a given 
corpus size, although at some sacrilice in precision. 
Intuitively we can understand why these curves 
should (:ross as they do. Smoothing introduces a cer- 
tain degree of additional error. As is evident from Fig- 
ure 4, some of the confllsion matrix entries arc spuri- 
ous, arising from such SOllrces as incorrect l)arses and 
the conIlation of word senses. In addition, some of the 
triples being generalized are themselves incorrect (note 
that even at high threshold the precision is below 90%). 
The net result is that a portion (roughly 1/3 to 1/5) of 
746 
0.80 
recldl 
0.70 
0.60 
0.50 
0.40 
O.3O 
0.20 
0.6O 
o 
~' .+ l,,ll, + + 
°~°- o o o 
*~. OaO 0 
o 
o 
*. o 
"t 8 
• , • °o o 
-- \[ .... I -- T-- • 
0.65 0.70 0,75 0.80 
precision 
l"ip;ure 5: Benel:its of stnoot,hiug for la.r,gcsl eorl)us: o 
-- unstn<>othe(I da.i,&, • =: sH,oothed (tati~. 
the tril>les added by smoothing ~l'e incorreet. At low 
levels of 1)recision, (,his l)ro({uces a. net gldn on t.he l)re - 
eision/rec+dl curve; +tt, highe,' levels o1' precision, '°here 
ix a. net loss. In a.ny event, smoothing (toes allow for 
sul)stlml, ially higher levels o1' recall than are possible 
without+ smoothing. 
5 Conclusion 
We \]tltve demonstrated how select.tonal p;~tl;erns ca.n lye 
attt;otn~tic;dly acquired from it corpus, ~md how selee 
tiond e,:)w~ragc gr~t(hlally it~ere~.tses with the size (,f the 
tra.ining eorl+us. We h~tve +dso domonst;r;tted that 
I'~)r it given corpus size eovcr++ge can I)e sig,tilicantly 
improved 1)y using 1,he corpus 1;o identify selectionally 
related ternts, and using these simila.rit.ies to generltlize 
the patterns tybserved in the training eorums. 'l'his is 
consisteut with other reeo,tt results using ,+eHtted l,eeh- 
niques \[2,9\]. 
We believe th+tl; Lhese Lt:ehniques e+tn I)e ft,t'ther im- 
proved in several wa.ys. The exl)erin,ent.s rel)orl.e+l 
above ha+ve only get'~ra\]ized over t, ht, lit'st; (head) po- 
sitioI, of t,he triples; we need to melrstu'o the eIl'<~ct of 
gcncrldizhtg ow~r the al'gum<mt l~OSh;ion as w<ql. With 
btrger eorpor~ it, I~,~ty itlso be I>asil)l{~ to use lirl+ger pat- 
terns, including in p~trtieular st,b.jeet-verl)-~d).ieet i)~tl;. 
terns, ;tnd thus reduce the confusion due to tre~tt, ing 
(li|t'et'e,lL words senses its eOlllltlOll eontexLs. 
References 
\[I\] .ling-Shin (~haatg, Yih-l:en l,uo, +rod l(eh--Yih 
Su. G\[)SM: A eo;enera.lized l)robtd)ilislic s<.~tllltllt;ic 
model for n.lnbiguil, y resolution. It+ l+rocccdiP+\[I s of 
the ,\[\]Olh Annual Meeting of tile Assn. for (/o~tpu- 
rational Linguistic.s, l);tges 177 18/t, Newark, I)E, 
June 1992. 
\[2\] Ido l)aglm, Hh~ml M~u'cus, ;tud ~qh~ul M~rkovilch. 
(3ontextual word similarity ~md estintnti(m from 
spa.rse (lal, c~. In l"l"oeeedings of the ,"7/.st An~lual 
Mt:t'.li~g +4" lht Assn. for Co~ltp'atatio~tal Li?Jg'llis- 
tics, pa.ges 3l 37,(lolunibus, 011,.lune 1993. 
\[5;\] U. Essen +rod V. Steinbiss. (~ooccta'renc~ 
stnoothing for st.oehltst;ie langua.ge rnodeliug. It, 
1(7A,%',<¢1'9°~?, pages I 161 I 164, S;I.tl I"rll.tl{:isel~, 
(3A, M;i.y 1992. 
\[14\] R. (;visluna.n, I,. l/irschma.n, and N.T. Nllctn. Dis-. 
eovery procedures for sul)l~mgui~ge selcet, iot~;tl pitt- 
terns: Initi+d exl)eriments. (/oTitp+ttatio'~tal Li'a- 
(luislic.% 1213):205 16, 1986. 
\[5\] l{+~rll)h (lrishm+m aztd ,lohn Sterling. A cquisitiott 
of seleetion~d i)aAA,erns. In l'roc. 14th lnt'l (/o~+\]~ 
(/o~l~pulatio?lal Ling~lislics (COLIN(? 92), N~mtes, 
France, July 1!)92. 
\[6\] R.~dph (~t'ishmatl ~md John S(.erlhlg. SutootAting 
of autotn++t, ic.ally generltted selecl, ionttl consLr++int.s. 
Ill \[s?'occt,.diTt\[\]s oj Ihc \[htTtta~t L(t?ty\]~ta.qt' '\]'t!c\[ti~olo(\]\[q 
Workshop, l'rineel, on, N J, March 1993. 
\[7\] l)onldd Ilindle. Noun el~,ssitica, tAon I'rol,t 
predica, t+e-~+rgutnent struel;ures. In l'?'occeding.s o\] 
the ,?Slh An+l+tal Meeting of the Ass'it. \]br (;o+++pu- 
talio~.al Ling+ui.slics, i)a.ges 268 275, Jitne 1990. 
\[8\] l,ynet\[,e llirschtnlm, H.idl)h (h'isht,t+L,l, ;ttk(l Ni~ot,ti 
S;tger. (~r&tnlrlitl+ieltlly-I'~itsed ~tUl;Oln&Lic word 
cli~ss t'ortnlttion, l~tJ'or~ttalio?t l'~'occ.~si~t:l a',d 
Manage.mcnl, 11(1/2):3D 57, 1975. 
\[9\] l"ern+mdo l'ereiri% N~d't;;di Tishl)y, imd I,illilm Lee. 
l)istributiona, l clustering of English words. In 
l"r'occedi~l\[ls of lhe <Ylsl An~vual Mcelin:l of the 
Ass?t. j'o't" (/omp,zlalional Li?ig+tistics, pltges 183 
l'()0, (',olund)us, ()11, June 1903. 
\[10\] Philip R.esnik. A elass-b;tsed :tlH)t'oach to lexi- 
cal discovery, lu I'rocecdi?tgs of tile ,YOlh A?t'nual 
Mecli'ag oJ" tht Assn. for (\]o~lpulalional L4'~./li~is- 
lies, Newark, I)F,, .lun+~ 1992. 
\[1 I\] S~ttoshi Sekine, Soti~ A,~a.niadou, .lere,,,y (~*tr,'oll, 
;u,d .lun'iehi :l'sujii. 15nguistie knowl<>dg;e get,- 
el'i;~l, or. hi l'roc, l/~lh, htt'l (\]onfl (/omp,lialio?lal 
Li?tg~ti.slies ((/OLIN(I 9,°3), pa.ges 5G0 5(i6, N;tnl.es, 
I:,'a.nce, .luly 19.92. 
\[12\] l:'a.ola. Vela.rdi, M;tria. 'l'eros~t Pazienza, and 
Miehel+~ Fa.solo. lIow to encode semantic knowl 
edge: A lnethod for lne~+ning represent~tti{m and 
compuLer-~dded ~cquisiticm. (7o?np+tlalio'nal I,i~l- 
guislics, 17(2):153 170, I,1t(.)1. 
747 
