Symbolic word clustering for medium-size corpora 
Benoit Habert* and Elie Naulleau* ** and Adeline Nazarenko* 
*Equipe de Linguistique Inform~tique 
Ecole NorInale Sup&ieure de Fontenay-St Cloud 
31 av. bombart, F-92260 Fontenay-aux-Roses 
Firstname. Name@ens-fcl. fr 
**Direction des Etudes et Recherches - Electricitd de Fra, nce 
1, av. du G ~z de Gaulle, F-92141 Clamart 
F irstname. Name@der. edfgdf, fr 
Abstract 
When trying to identify essential con- 
cepts and relationships in a medium-size 
corpus, it is not always possible to rely 
on statistical methods, as the frequencies 
are too low. We present an alternative 
method, symbolic, based on the simplifi- 
cation of parse trees. We discuss the re- 
suits on nominal phrases of two technical 
corpora, analyzed by two different robust 
parsers used for terminology updating in 
an industrial company. We compare our 
results with Hindle's scores of similarity. 
Subjects Clustering, ontology development, ro- 
bust parsing, knowledge acquisition from corpora, 
computational terminology 
1 Identifying word classes in 
medium-size corpora 
In companies with a wide range of activities, such 
as EDF, the French electricity company, the rapid 
evolution of technical domains, the huge amount 
of textual data involved, its variation in length 
and style imply building or updating numerous 
terminologies as NLP resources. In this context, 
terminology acquisition is defined as a twofold 
process. On one hand, a terminologist must iden- 
tify the essential entities of the domain and their 
relationships, that is its ontology. On the other 
hand, (s)he must relate these entities and rela- 
tionships to their linguistic realizations, so as to 
isolate the lexical entries to be considered as cer- 
tified terms for the domain. 
In this paper, we concentrate on the first is- 
sue. Automatic exploration of a sublanguage cor- 
pus constitutes a first step towards identifying the 
semantic classes and relationships which are rele- 
vant for this sublanguage. 
In the past five years, important research on the 
automatic acquisition of word classes based on lex- 
ical distribution has been published (Church and 
Hanks, 1990; Hindle, 1990; Smadja, 1993; Grei~n- 
stette, 1994; Grishman and Sterling, 1994). Most 
of these approaches, however, need large or even 
very large corpora in order for word classes to be 
discovered 1 whereas it is often the case that the 
data to be processed are insufficient to provide re- 
liable lexical intbrmation. In other words, it is not 
always possible to resort to statistical methods. 
On the other hand, medium size corpora (between 
100,000 and 500,000 words: typically a reference 
manual) are already too complex and too long to 
rely on reading only, even with concordances. For 
this range of corpora, a pure symbolic approach, 
which recycles and simplifies analyses produced 
by robust parsers in order to classify words, offers 
a viable alternative to statistical methods. We 
present this approach in section 2. Section 3 de- 
scribes the results on two technical corpora with 
two different robust parsers. Section 4 compares 
our results to Itindle's ones (Hindle, 1990). 
2 Simplifying parse trees to 
classify words 
2.1 The need for normalized syntactic 
contexts 
As Hindle's work proves it, among others (Gr- 
ishman and Sterling, 1994; Grefenstette, 1994:), 
the mere existence of robust syntactic parsers 
makes it possible to parse large corpora in order 
to automate the discovery of syntactic patterns 
in the spirit of Harris's distributional hypothesis. 
Itowever, Harris' methodology implies also to sim- 
plify and transform each parse tree 2 , so as to ob- 
tain so-called "elementary sentences" exhibiting 
the main conceptual classes for the domain (Sager 
lIa'or instance, Hindle (Hindle, 1990) needs a six 
million word corpus in order to extract noun similari- 
ties from predicate-argunlent structures. 
2Changing passive into active sentences, using a 
verb instead of a nominalization, and so on. 
490 
NP\] 
NPa AP4 
I I 
Nr As 
I I stenose serre 
NPo 
PP2 
Pa NP6 I 
de 
D9 NPlo 
le NPll AP12 
t 
NPla AP14 A15 
I I I N~ A~r gauche 
I I tronc eorninun 
I?igure 1: Parse tree for stenose serre de le hone commun gauche 
et al., 1987). 
In order to ~mtomate this normalization, we 
propose to post-process parse trees so as to em- 
phasize the dependency relationships among the 
content words and to infer semantic classes. Our 
approach can be opposed to the a prior one 
which consists in building simplified representa- 
tions while parsing (Basili et al., 1994; Metzler 
and Haas, 1989; Smeaton and Sheridan, 19911). 
2.2 R.ecycling the results of robust 
parsers 
For the sake of reusability, we chose to add a 
generic post-processing treatment to the results of 
robust parsers. It ilnplies to transduce the trees 
resulting fl:om different parsers to a common for- 
nlat. 
We experimented so t~r two parsers: Aleth- 
(h:am and I,exl;er, which are being used at DER- 
EDI,' for terminology acquisition and updating. 
They both analyze corpora of arbitrary length. 
AlethGram has been developped winthin the 
GIIAAL project a. I,EXrI'ER has been developped 
at DER-EI)F (Bourigault, 1993). In this exper- 
inlent, we if)cussed on noun phrases, as they are 
central in most terminologies. 
2.3 The simplification algorithm 
The objective is then to reduce automatically the 
numerous and complex nominal phrases provided 
by AlethGram and LEXTEI{ to elementary trees, 
3The Eureka GRAAL project gathers in France 
(IC1-F, RLI (prime contractor), EDF, Aerospatiale and 
l{enanlt. 
which more readily exhibit the flmdamental bi- 
nary relations , and to classify words with respect 
to these simplified trees. 
For instance, from the parse tree for slenose 
serve de le tronc eommun gauche 4 (cf. fig. 2, in 
which non terminal nodes are indexed for refer- 
ence purposes), the algorithm 5 yields the set of 
elementary trees of figure 1. 'l'he trees a and c 
correspond to contiguous words in the original se- 
quence, whereas b and d only appear after modifier 
removal (see below). 
Two types of simplifications are applied when 
possible to a given tree: 
1. ,5'plitting: Each sub-tree immediately domi- 
nated by the root is extracted and possibly 
further simplified. For instance, removing 
node NP0 yields two sub-trees: NP\], which is 
elementary (see below) and PP2, which needs 
further simplification. 
2. Modifier removal: Within the whole tree, ev- 
ery phrase which represents a modified con- 
stituent is replaced by the corresponding non 
modified constituent. For example, in NP0, 
the adjectival modifier scrrc is removed, as 
well as the determiner and the adjectives 
4 Tight stcnosis of left common mainstem. In both 
parsers, {,he accents are removed during tile analy- 
sis, the lemmas are used instead of inflected fo,'ms. 
Additionally, fro' simplitication purposes, a contracted 
word like du is considered as a preposition- determiner 
sequ_enec. 
5See (Habet't el; al., 1.995) for a detailled presen- 
tal,ion. The corresponding software, SYCI,AI)E, has 
been developped by the tirst author. 
491 
N|)a, 
NP AP 
I I 
N A 
I I 
st, chose Serl?e 
NPv 
N P,, Nl)~t 
NP PP ~ 
I ~ NP AP NP A P 
N /' Xl' I I I I I I I 
N A N /I 
stc~ms~: de N \[ \[ \] I 
\[ troIlc COIflfft/l\[l ~Tollc .(\]ditch( ~ 
tronc 
l)'igm:e 2: I,\]lcmcntary trees for sl.cnose serve de lc tro,m co'm..mm~ (l(mchr. 
-_coronarien 
-._gauche 
-~------___._ a t t ein t e. de- 
~.~ diametre .de.... --------'-----I~ tr0nc 
~ ~,~;tenose de - / 
- coronarJ ell /\ / 
/ '~ m rena\],~ / 
5~ / ~ presence .de~ / 
• - corona±re 
-- . - coronarien / \ / 
." I / --_ell.staLe \ - coronarJ en L • - clrconrzexe / montre_de - \ -- de artere 
-\ -~\]roxlmal z -- - " -- " \] k de artere / ~ dJametre de .- 
-.joroximal \ / / ~dr~n \ / / 
- de intervent-riculaire /_ aorti~,e ~ ~ , ~' / .... / .......... / "~- corortarzen/\ | 
\ \[ / - coronarien --dkffus / \ I \ / / . ~ - circonflexe ...... / \ / 
k I / - lorpzma± - coronarien existence de - \ 
........ "ien~ | / / --dia on~\] ~7--- k / injec~q "--de artere ~ ~ ~/ - 
g ,c .. ..... ~ \ - c ..... rJ .... -- -- ~" - non-slgnlr icaEz£ / ~,. "~ , 
- de carotide4~ -r '~ ~ / ~ \ N mztra± - ~ ~ - eszaue± / ~ \ ~7 , 
S / N. ~ (llagnostlC de - 
-d2 cnat°;idetdulaJ !e ~ % c°;°ne arien'-de-'trOnc <°r°narien// I ~ 
- ---'. " - ' ' '. X -_ / -- " ~.- / b atheromateux persist .... e de- I k- de artere /.-_severe-- /I.- ....... ire 
/ / X-- ~ /severite de_- <- coronarien 
/ / ~ ,~ .// / i~ -severeX 
OOCtUS~ -Cgrl~arlen --- ce;°;earzen "~'o~ ~'~ ...... / / X 
-- de artere -- coronar\] en - corona±re - de artere - " - " --~ 
- de tronc { --severe - coronarlen 
- \ | --- de tronc -_im3ortant \\- 
severe 
-~de_ar 
Sl~tm~-~----___~_coronazzen 
- de artere --'---'!~ a~nte ~ 
mab~ 
1 
~ coronarien 
diagnostic de -~ 
frequence de - ,1 
a~asdemse 
Figure 3: Example of a strongly connected component ((\]MC corpus) 
492 
uiodifying I, ro~.c, which l('~ts (.o elenienl;ary 
I;ree b. 
W\[i(;li I;\]ie (:lirrolil. imee is clc'm.~;nlary, t\]io sini- 
plific~lJon process s~ops. I~efT)re I)rocessing lill(, 
s(;L ()f oril>;iuM \])aJ:se l;l:ees, OllC llillS\[; dec/a>re 1.|le 
l;l:ees which li:i/lsi; iv)l; I)e sinll)lilied a.lly fllrl.\]ier. Ill 
I:)iis exl)el:inlent; , a.rc (:om~id(:red as e/el\]i(;il\[,itl'y l,he 
ilOliiinaJ I;l'(;es which exhil)it, a. binary 1;ela.l;\]Oll I)e- 
I;w(;(;l~ l;wo "('oiil;elil." words, \[7)r iliSl.a.ii(:e I)ei;weeN 
/,wo N in &ll N \]) N SCqll011cc. 
2.4: lProni (:h:m(:nl;ary ('ont;(:xts I;o word 
( ".\].,}It .',4 S ( ~. S 
'l'lle i:esull;iug collo(:~lJons a,e tout, rolled I)y IJie 
synl.acl;ic felaJ;ionshit)s sl, rll('t;llrillg l, he l)a.r.'q(, t;l'ees, 
which is liOl. t;lie case For wi n(low--t)ased a.i)l)roaches 
((Jhurch a.nd lla.liks, 199())~ ev(;n wh(;n lJley use 
i)a.l:l,-oP-Sl)cech la.I)els (Sniasljn, 1993; I)a.ille, 1994). 
Ill l;he ("Xallll)\]( % .qaitr:hc is llOf i:elaJ,ed 1;o s/,<:ltosc', 
as il. does iiol~ liiOtti\[3~ this noun. 
'l'hc eleiiieii/a.ry l;rces I<~d 1,o oh>sos of syiiDI.C- 
(;ic COllt;C:;',.:Lq. I"or hlsl;a, ll(:e, frOl\[l {;h(; /£ec' corl'(;- 
s|)(:)ll(\[ill~ i;o s/,v/tosc ,~{~'#'~'c, t;wo (:lasses o1' (:Oll(,exl;s 
aye crca.lxxl. 'l'hc Ih'sl, ()tie, <s/cltos,': ~, iu which 
sl;a.nds Ior t, he lfiw)t, word, conl.a.ins serf+, whereas 
l.\[i(; second o11(~> N ."7C;P7'C~ (:Oiii;a.illS S{C'ItO.S<. kl; 
t;he end o1" l;he SilUl)lilic;tl, iou process, I, ll(;s(; classes 
ha,re I)(:ei~ cOUll)lelied ~licl olJi(;r oiler; (:rea l;e(I. VVe 
(-laim t,h~l, th(' s(,inant;i(', similaril,y I)elween two 
lcxical enl, ries is in i)l:Ol)orl;ioli wii;h I;lie mlillt)er 
of sha, red (:Olll;(:xl,s, \[,hi: insl;mlc(', in ol,, of ore' 
(:orl)ora. , ,s/,e~tosu ,'.;ha r(;s 8 conlie×l,s wit, h l(szom 
In order I,o get, ~ glohal vision of the similm:- 
il.ies relyi,g on elenient, ary conl.exD;, a. gi'ad)h is 
C, Olill)lil;c:(\]. Tim WOl~(ls CO\[lSl;il;llt;,:; l, hc IIO(Ics. A 
link corresl~onds 1.o a. Cel:l;&ili lliiliil)er oF shared 
c.oni;exl;s (a<:c.ordill~ l,O ~t. chosen I.hreshold). The 
edges are labclle, d wiiJi l, he sha.red coiit;cxls. The 
sl;l:oiigly colineclx;(I c.oinponeill~.s ~I a.nd t;hc cliques '/ 
a.l'('~ conil)ul.ed a.s woll, ~s t.hcy ~re l;he tiiosi; t'(;l(; 
Va.lll; l)a.rl,s oF {tie gra.i)h ~ oil i,opologica.I ~lX)lilidS, 
'\['lie un(l('.l:lying illl;liil;ion is l;h a,l; a~ COiliieclA;d (:Olll- 
\])Otleli/; I'C'\]itl,(:',S lil:':igjhl)ori/igj words (llollSC\]/ and 
Savit;ch, \]9!)5) m~d I, hal, the cliques tend l,o iso- 
Ial;c ,<dmih~i:il;y cla,ssc's. An ext;rm::t of a connc'ci;ed 
conll)onenl; , wil;h 3 as a, threshold, a,l)pears in lig- 
u r(; 3. 
s'\]'he sub-graphs hi which l.here is ~t 1)aLh I)cl.ween 
every pair of (lisl>hicl; liO(1CS. 
rThc sul><~ra, phs in wlfich l;here is a palJl I)et;wee\]l 
each lto(le and eve>r?/ olhcr noch: of l;he graph. 
3 Results 
3.1. Two corpora 
We haw; l.esl;ed olir niei.,hod on I.wo i;echnicM 
niedium-size col:pora.. The fii:sl; ()li(;> i;he INII-- 
cleaJ: '\[}x:hliOlogjy (}Ol.\])tls (N'I'C) of EI)I", is of 
a,I)olll; ,52,000 words. 'l'he second one, I,he (k)i;<) 
n~u;y Medicine (JOrl)US (CM(7), is of a.I)ou{, ($(), 000 
words. It was buill; for t, he l,;urol)ca.u M I'\]N El,AS 
t)rojccl. (Zweigenl)a, ilni, 19{)/I) and is used For 1)ilol. 
sl,udies in l;erminology exLra.clk)n s. 
3.2 A vlsnal lliap of {:OIICOp{s lUld 
relationships 
I@en if iio onl;ology (:~u/ I)c \['ully aJIl;o\]iiat, R:a.ily 
derived \[;l:Olil a. ('orl)iis (llaJ)erl; ;tll(\[ Na.zarelll,:o, 
1996), IJle ,gY( JI,A I)1,\] gra.I)hs ('AI.II I)e iise(I I.o I)()oi.- 
slma I) i.he I)ilihting of l, he onl;olog;y o(' a. dolila.iu. 
'l'he SY(',I,AI)I,; ii(fl, work gives a <glol>a,I view 
over t,hc COrl).S which etmhles {m all, ernal;e 
i)a, ra,digniaJ;ic a, nd sylil;agul~l;ic exl)lora.l, iou of I,he 
cont;cxl; o\[' a word. 'l'hc gl;al)h (;nat)les 1;o idenl,ify 
I;lic concel)t,s , I hcir possit)lc t, yl)icaJ I)rOl)erl, ies, a, lld 
also t, he rcla, l;ionshil)s b(;I;weCll 1;he selecl,cd COil- 
Cel)l,s. 
Tim cliques I)ring ()ill; sitia.ll i)ara.diglllai.ic scl.s 
of \['orins which, ill a. tirsl, sl.et) > Ca.ll Im iuLer 
I)relx;d as onl;ologh:M classes rellecl;ing coliCelfl..~. 
'l'he a.rc lal)Ns l.ticn help Ix) retilie I.llosc chlsses 
t)y acldiu S sOlile of the Sllrl'Olilidhl~ words whicli 
axe li()l, pa.i'l; ()\[' t;lie clkluc bul. which ileverllie-- 
le~s sha,r(; the iiIOSl; siguifica.nl; or SOllle Siiliila.r 
<Ollbexl,8. 1@0111 the clique {sl,+e~,o,sc, b.<Uos b ob- 
sl, r~ml, ion, altcinb:} (of. fig. 3), one ca, t\] build l lie 
cla.ss of all'eel;ions which arc Io(:al;ed in l, he I)odv as 
{Idam.:, occ1.<~7o., s/..~.Js<~, Ic,~7o., <:.l<:{li~:ali<.,, 
ob,~'l, rl,:l, ion, aZl, c'inl, c}. Siinila.rly, from l, he gt'al)h 
o\[' I;he (~'M(7 corpus, Oile (:a.li i(leni,i\['y l.ll~ classes 
of body" ~ii, cs { artcrc, I.'anchc, rcs+sa~l, "v<'ntri, ~dc, 
intc',rve, nlriculairc, cft'roli(l~,} , o(" diseases { 'malmli+ , 
arth, crvsclcrose} and oF chirul'gica/ m:ts { l)o*ll,(~gc, 
rcvasc.ularisatio'n, angioplastic}. 
Olic(; l.\]ieS(; (;Oli(:('pts axe identilied, t, hei r i)rolJei;- 
{,i('~S Ca, fl I)C lisl, e(l, t)y i llt;erl)l:el, ing ~ I;he l ld)cls of I lie 
links, 'l'he al;t;ri hu I~(, of the localizaJJon of l,h(' aJl'ec- 
l;ions is descril)ed l;\]irough three, kinds oF u:lodiliers 
(lig. ',/): ,io,,n,~ (,-, (t<'. { a,'Z,.~,'<~ ', t,','<,,u:)), ,,;,i,os <' ~,,:- 
lyrics (~ d(; {ca'rotTd<', #tl, crventrTculaTre} aud a(l- 
,iectivcs rela, l;cd to ;~ q)('(:ific aa'l;ery (~ {coro#utiru, 
co'ronaricn, diaqonal, ci'lvonfl<~:(;}). 'l'he a l, i;i:iblll;e 
(legr(:e of (;lie a, fl'ecl, ion is a, lso reveaJed IJlrougjh 
{~ si,qm{/ical, if, n<m-,siqnificati.l; severe, 7m, l)Orl.a.l., 
s<;veriZc} . 
"(41roupe '\['erniinologie el; lnl;elligence Ari;iticielle, 
I ~ I{C--(_I I )171. | ul;elligcnce A r@icicltc, (7 NI {S 
493 
etude~ ~evaluatlon 
a t~.alyse 
calcul 
etudeS5 
b N ~ essai 
analyse 
Figure 4: Polysemy of etude 
Last, relationships between concepts can be ex- 
tracted, such as the"part-of" relation between 
tronc and artere, and segment and artere (fig. 3). 
3.3 Distinguishing word meanings 
Polysemy and quasi-synonymy often makes the 
ontological reading of linguistic data difficult. 
However, through cliques and edge labels, the 
SYCLADE structured and documented map of 
the words helps to capture the word meaning level. 
Among a set of connected words where w is sim- 
ilar to wi and wj, cliques bring out coherent sub- 
sets where wi and wj are also similar to each other. 
We argue that the various cliques in which a word 
appears represent different axes of similarity and 
help to identify the different senses of that word. 
For instance, in the whole set of words connected 
to etude (study) in a strongly connected compo- 
nent of the NTC graph (analyse, evaluation, resul- 
tat, presentation, principe, calcul, travail...), some 
subsets form cliques with etude. Two of those 
cliques (resp. a and b in fig. 4 - threshold of 7) 
bring out a concrete and a more theoretical use of 
etude. 
The network also enables to distinguish the uses 
of quasi-synonyms such as eoronaire and coronar- 
ien in the CMC corpus. Even if they are among 
the most similar adjectives (7 shared contexts) 
and if they belong to the same clique {coronaire, 
eoronarien, diagonal, circonflexe}, the fact that 
eoronarien alone is connected to evaluation ad- 
jectives (severe, signifieatif and important) shows 
that they cannot always substitute to each other. 
4 Towards an adequate similarity 
esfimatation for the building of 
ontologies 
The comparison with the similarity score of (Hin- 
dle, 1990) shows that SYCLADE similarity in- 
dicator is specifically relevant for ontology boot- 
strap and tuning. Hindle uses the observed fre- 
quencies within a specific syntactic pattern (sub- 
ject/verb, and verb/object) to derive a cooccu,> 
rence score which is an estimate of mutual infor- 
mation (Church and Hanks, 1990). We adapted 
this score to noun phrase patterns) However the 
similarity measures based on cooccurrence scores 
and nominal phrase patterns are less relevant for 
an ontological analysis. The subgraph of the 
chirurgical acts words, which is easy to identify 
from the SYCLADE graph (fig. 5a), is split in dif- 
ferent parts in the similarity graph (fig. 5b). This 
difference stems from the fact that this cooccur- 
rence score overestimates rare events and under- 
lines the collocations specific to each form. 1° For 
instance, it appears that the relationship between 
stenose and lesion, which was central in figure 
3, with 8 shared contexts, almost diseappears if 
one considers the number of shared cooccurrences. 
Therefore, similarity measures based on cooccur- 
rences and similarity estimation based on shared 
contexts must not be used in place of each other. 
As opposed to Hindle's lists of similar words 
which are centered on pivot words whose neigh- 
bors are all on the same level, in SYCLADE 
graphs, a word is represented by its role in a whole 
syntactic and conceptual network. The graph en- 
ables to distinguish the various meanings of words, 
a crucial feature in the ontological perspective 
since the meaning level is closer to the concept 
level than the word level. In addition, the results 
are clear and more easily interpretable than those 
given by a statistical method, because the reader 
does not have to supply the explanation as to why 
and how the words are similar. 
The building of an ontology, which is a time- 
consuming task and which cannot be achieved 
automatically, can nevertheless be guided. The 
SYCLADE graphs based on shared contexts can 
facilitate this process. 
9For instance, for Na PN2 
CoocNi,N~ : log 2 ~~ 
where f(NIPN2) is the fi'equency of noun N1 occur- 
ring with N2 in a noun preposition pattern, f(N1) is 
the frequency of NI as head of any N1PN,~ sequence 
and f(N2) the frequency of N2 in modifier/argument 
position of auy N~PN2 sequence and k is the count 
of NxPN v elementary trees in the corpus. COOCNAda 
and CooeAd~N are similarly defined. 
1°The various cooccurrence scores retrieve sets of 
collocations which are sharply different fi'om the con- 
texts shown by SYCLADE connected components. 
The coll6cations which get the greatest cooccurrence 
scores seem to characterize medecine phraseology (fac- 
teur (de) risque, milieu hospitalier) but not the coro- 
nary diseases as such. 
494 
pontage angioplastie ~ artere \/ 
revascul~risation 
pontage angloplastle 
I her ed~tC~ionl~y!!: sme pont 
artere stenose 
l,'igure 5: Similarity among the chirurgical act 
family 
Acknowledgments 
We ~hank (\]hristian 3aequemin (IRIN), Di- 
dier Bourigault, Marie-Luce Herviou, Jean- 
David Sta (DER EDF), Marie-tl~51~ne Can- 
dito (TAI,ANA) and Sophie Aslanides (ELI) 
for their remarks on a previous version of 
this l)aper. We are very gratefid to Serge 
Heiden (ELI), who has developed G~aphX 
(ftp://mycroft. ens-f c:l.. fr/pub/graphx/), 
I;hc graph interactive handling software that en- 
abled us to visualize and handle the SYCLADI,3 
graphs. 
References 
I/.oberto Basili, Maria-Teresa Pazienza, and Paola 
Velardi. 1994. A "not-so-shMlow" parser for 
collocadonal analysis. In Proceedings of Col- 
ing'94, pages 447 453. 
Peter A. Bensch and Walter 3. Saviteh. 1995. An 
occurrence-based model of word categorization. 
Annuals of Mathematics and Artificial Intelli- 
gence, 14:1 16. 
I)idier Bonrigmdt. :1993. An endogenous corpus- 
based method for structural noun phrase dis- 
ambiguation. In 6th Euwpcan Chapter of thc 
Association Jbr Computational Linguistics. 
Kenneth W. Church ~md Patrick Hanks. 1990. 
Word association norms, mutual information, 
and lexicography. Computational Linguistics, 
16(1) :22 -29, march. 
Bdatrice Daille. 1994. App~)clw mixtc pour 
l'cxtraclion de tcrminologie : stalistique lexi- 
tale et filtres linguistiques. PhD Thesis, Paris 
VII \[lniversity, february. Supervisor: Laurence 
l)anlos. 
Gregory Grefenstette. 1994. Exploration in Auto- 
marie 7'hesaurus Discovery. Kluwer Academic 
Publishers. 
Ralph Grishman ~md John Sterling. 1994:. Gen- 
eralizing automatyically generated selectional 
patterns. In Proceedings of Coling'94, volume 3, 
pages 742-747, Kyoto. 
Benoit Habert and Adeline Nazarenko. 1996. 
La syntaxe comrne parche-pied de l'acquisition 
des connaissances. In Actes des Journdes 
d'Acquisition des Connaissances, S6te, May. 
Benoit Habert, Philippe 13arbaud, I,'ernande 
Dupuis, and Christian dacquemin. 1995. 
Simplifier des arbres d'analyse pour dd.gager 
les comportements syntaxico-s&nantiqnes des 
formes d'un corpus. Cahiers de Grammaire, (20). 
l)onnald Hindle. 1990. Noun classification from 
predicate-argument structures. In Proceedin.ils 
of the Association for Computational Linguis- 
tics', pages 268 275. 
Douglas P. Metzler and Stephanic W. Haas. 198!). 
The constituent object parser : Syntactic struc- 
ture matching for information retrieval. In l)~w- 
cccdings, 12th Annual International A(\]M 5'l- 
Gll~ Conference on l~csearch and Development 
in hfformation Retrieval (SIGIR'89), pages 
117 126, Cambridge, MA. 
Naomi Sager, Carol Friedman, and Margared 
S. Lyman (editors). 1987. Medical Language 
PTvcessing : Computer Management of Narra- 
tive Data. AddisomWesley. 
I,'ranck Smadja. :1993. Retrieving collocations 
fi'om text: Xtract. Computational Lingui~'iics, 
19(1.):143 177, march. Special Issue on Using 
Large Corpora: \[. 
Alan Smeaton and P. Sheridan. 1991. Using 
morpho-syntaetic language analysis in phrase 
matching, in Proceedings l~lA 0'9 l, pages 415 
429. 
Pierre Zweigenbaum. 1994. MENELAS: an ac- 
cess system for medicM records using natural 
language. Computer Methods and l'rograms in 
Biomedicine, 45:1;17 120. 
495 
