Acquisition of Lexical Information 
from a Large Textual Italian Corpus 
Nicoletta CALZOLARI - Remo BINDI 
Istituto di Linguistica Computazionale del CNR, Pisa 
Dipartimento di Linguistica, Universita" di Pisa 
Via deIla t'aggiola 32 
56100 PISA - ITALY 
e-mNl: GLOTTOLO @ ICNUCEVM.BITNET 
1. l.troduction: LDB requirements 
The creation and development of a large Lexical 
Database (LDB) which, until now, mainly reuses 
the data found in standard Machine Readable 
Dictionaries, has been going on in Pisa for a 
number of years (see Calzolari 1984, 1988, 
Calzolari, Picchi 1988). We are well aware that, in 
order to build a more powel-ful I.DB (or even a 
Le.'dcal Knowledge Base) to be used in different 
ComputationM l.inguistics (CL) applications, types 
of information other than those usually found in 
machine readable dictionaries are urgently needed. 
Different sources of information must therefore be 
exploited if we wemt to overcome the 'qchcal 
bottleneck ~" of Natural l..anguage Processing 
(NIP). 
In a trend which is becoming increasingly 
relevant both m c1 proper and in Literao" and 
Iinguistic Computing, we feel that very interesting 
data ibr our LI)Bs c:m be found b.v processing large 
textuM corpora, where the actual usage of the 
language can be truly investigated. Many research 
projects are nowadays collecting large amounts of 
textuM data, thus providing more and more 
material to be analyzed for descriptions based on 
measurable evidence of how language is actually 
used. We uhhnately aim at integrating lexical data 
extracted from the an',dysis of large textual corpora 
into the I,DB we are implementing. These data 
refer, typically, to: 
i) complementation relations introduced by 
prepositions (e.g. dividere < divide > 
subcategorizes for a PP headed by the preposition 
in < in > ha one sense, and by the preposition fra 
< among > in another sense); 
ii) lexically conditioned modification relations (tena 
macchina potente < powerful car >, un farmaco 
potente <potent medicine> and not /brte 
< strong >, while un cajfe" forte < strong coffee >, 
una moneta forte < strong currency > and not po- 
tente < powerful > ); 
iii) lefically significant collocations (premiere ut~a 
decisione < to take a decision > and not fare z~na 
decisione < to make >, prestare attenzione < to pay 
attention > and not dare < to give > ); 
iv) fixed phrases and idioms I (donna itz carriera, 
dottorato di ricerca, a propo~ito di); 
v) compounds ( tarola calda, ~ave scuo/a). 
All these types of data are a major issue of 
practical relevance, and particularly problematic, in 
many NIP applications in different areas. They 
should therefore be dvcn very lm'ge coverage in any 
useful LDB, and, moreover, should also be 
annotated, in a computerk,'ed lexicon, for the 
pe~inent t)equency information obtained fiom the 
processed corpus, and obviously updated fl"om time 
to time. As a matter of fact, dictionaries now tend 
to encode all the theoreticcd possibilities on a same 
level, but "if e;'e~ possibility in the diction:m, must 
be given equal weight, parsing is very diificult" 
(Church 1988, p.3): they should provide 
infornaation on what is more likely to occur, e.g. 
relative likelihood of alternate pm-ts of speech for a 
word or of ahernate word-senses, both out of 
context and it" possible taking into account 
contextu~d factors. 
Statistical anMyses of linguistic data were very 
popular in the "50s and '60s, mainly, even though 
not only, for literary types of analyses and for 
studies on the lexicon (Guiraud 1959, Muller 1964, 
Moskovich 1977). Stochastic approaches to 
linguistic analyses have been strongly reevaluated 
in the past few years, either for syntactic analysis 
(Gm'side et al. 1987, Church 1988), or for NLP 
applications (Brown et al. 1988), or for semantic 
analysis (Zemik 1989, Smadja 1989). Quantitative 
(not statistical) evidence on e.g. word-sense 
occurrences in a large corpus have been taken into 
account for lexicographic descriptions (Cobuild 
1%7). 
I llere and in the following we have not translated idiomatic phrases and compounds, because there is no point in 
giving the literal translation of the single words. 
54 1 
The claim of this paper is that the above types 
of linguistic information (i-v), to be made available 
for our LDB, can be partially extracted by 
processing and analyzing very large text corpora, 
with quantitativc/st,~zistic methods. 
2. The Italian Reference Corpus 
The corpus (see Zarnpolli 1988) on which we are 
now conducting our analysis is being produced by 
the ILC and an Italian publishing house (see Bindi 
et al. 1989). The project was begun in 1988. The 
corpus now contains about 12 milfion words, and 
the first goal is to reach 20 million words by the end 
of "90. When completed, the corpus will be 
balanced among journals, novels, manuals, 
scientific texts, 'grey' literature, etc. The corpus is 
presently unproportioned, because we first 
processed and inserted up to about 8 million words 
from journals, newspapers, mag~ines, etc., while 
we are nt)w inserting data fi'om novels and from the 
scientific and technical literature. 
The present study is conducted on the first 
section of the corpus, but we obviously intend to 
extend the mlMysis to the other sections as soon as 
they become available. 
We describe two types of quantitative anMyses 
whose aim is to extract information on: 
a) the strength ofassocialion between two words; 
b) fixed phrases or idioms. 
3. The strength of association within word-pairs 
As regards the first point we have used the 
method of measuring thc association ratio between 
two words as described by Church and llanks 
(1989). The value of the association ratio reflects 
the strength of the bond between the two words 
taken into account. The method is very simple. 
The association ratio between any two words x and 
y appearing together in a window of five words in 
the corpus is based on the concept of "mutuM 
information" defined as: 
I (x,y) = log~ 
P (x) P (y) 
where P is the probability. We refer to Church and 
Hank,.~ (1989, pp. 77-78) for a detailed explanation 
of the fommla and of how the association ratio 
slightly differs from it, given that we are more 
interested here in the linguistic m~d lexicographic 
evaluation of the numerical results deriving from its 
application. 
In addition to this we have introduced the 
measurement of the so~called dispersion, in order 
to obtain - linked to the association ratio - quanti~ 
tative information on the distribution of the second 
word of the word-pair in the selected window. We 
wanted in fact to complete the simple frequency 
notion for a word-pair with that of frequency 
stability or dispersion, i.e. to add to the frequency 
a measure of how it is distributed over the different 
positions of the window. In this way we evaluate 
the uniformity of repartition of frequency of the 
second word over the considered span. We have 
used the formula described in detail in Bortolini et 
al. (1971, pp. 23-31), even though used here for 
different purposes. 
We gj've some ex,'unples in Table 1, where fix,y) 
is the frequency of occurrence of words x and y 
together and in this order in a window of 5 words, 
gap is equal to the number of words between x and 
y (if gap = 0 then x and y are immediately adjacent), 
f(x) and f(y) are the frequencies of occu;rence of x 
and y independently in the corpus, ass.ratio is the 
result of application of the formula to x at~d y, 
dispersion calculates how tile second word is 
distributed within the considered window. 
This last information is very useflfl not only to 
evidence words belongi'ng to fixed phrases, but 
especially while trying to evidence syntactic 
relationships. If the dispersion is 0 or ncm" to !), all 
or most of tile occurrences of tile second word are 
concentrated in the sameposition. This means that 
the position and distance of the two words is 
always the same, and it is theretbre a strong 
measure for evidencing "fixed phrases" or 
"compounds" with no variation inside. When 
viceversa its value approaches 1, y is almost equally 
distributed in the four positions of the considered 
span. Thus, the combination of a not very lfigh 
(but above a certain level) ass.ratio with dispersion 
values near to 1 is more typical of syntactic types 
of collocations, giving e.g. information on 
prepositional government. 
We wish to highlight here some of the results 
achieved by the application of these statistical 
measures to the Italian corpus, and mainly to 
evaluate their linguistic relevance. 
Table 1. 
f(x,y) gap = 0 gap = 1 gap = 2 gap = 3 
Stall Uniti 2047 2042 0 
punto vista 832 0 831 
opinione pubblica 272 272 0 
presto a 33 14 9 
spendere per 36 8 8 
e" arnbizioso 20 5 5 
~x) 
0 5 5850 
0 1 4396 
0 0 657 
6 4 85 
9 11 183 
5 5 115476 
f(y) ass.ratio dispersion 
2159 10.34 0.003 
1974 9.58 0.002 
1315 11.30 0.000 
120969 4.68 0.736 
101862 3.95 0.921 
123 3.49 1.000 
2 55 
From the present corpus of 8,032,667 
occurrences (tokens) and 178,811 different word- 
forms (types), we obtained 26,473,263 word-pairs 
(tokens) in a window of 5 words (and not 
32,000,000, as the window is not extended beyond 
any strong punctuation mark) and 8,716,446 
different word-pairs (types). After discarding all the 
pairs with f(x,y) < 4, because they were too rare and 
of no linguistic relevance, 787,878 word-pairs were 
obtained, which were eventually reduced to 322,718 
after eliminating those with association ratio < 3 
(the pairs seem to be linguistically irrelevant below 
this level). 
We must also recall that the data to which we 
have applied our measures are articles from many 
different types of newspapers, journals, etc. - i.e. 
many short texts -, so that there is no bias towards 
clustering tendencies of words such as could appear 
in longer texts, like entire novels. 
If we order the word-pairs by decreasing value 
of the association ratio, and examine the types of 
word combinations appearing in the different 
positions of the file, we observe a different typology 
of word combinations according to the different 
levels: 
i) at the top; 
ii) in the center; 
iii) towards the lower interesting values, which 
for Italian seems to be a little higher than for 
English, i.e. around 35; 
iv) below this significant value, until reaching the 
few negative values. 
For example at the top, i.e. with very high values 
(ranging from 22.93 to about 15), we find the 
following categories of word-pairs: 
- proper nouns, titles, etc. (e.g. Oci Ciornie 20.6, 
Cvrano Bergerac 20.1, Montgomery Clift 20.1, 
Ursula Andress 19.9); 
foreign (usually English) compound words or 
fixed phrases (value added 19.8, pax Christi 17.7, 
teen ager 17.7, drug administration 17.3); 
Italian compounds of words belonging to 
specialized languages, which almost never occur in 
eve~day language (bismuto colloidale 20.1, 
tornografia assiale 19.8, marmitte catalitiche 19.6, 
nitrato amrnonio 19.5, accoppiatore acu_~tico 17.5); 
- co-occurring technical words, which again appear 
very rarely ( laringiti traeheiti 20.3, idrologia 
climatologia 20.2, capperi cetriolini 19.6, prefetti 
questori 18.5, antisettiche antispasmodiche 17.8); 
- fixed phrases or idioms whose component words 
are not frequent in ordinary language (volente 
nolente 20.6, specchietto allodole 18.8, bla bla 18.00, 
batter ciglio 17.2, cartoni animati 16.5, spron battuto 
15.5); 
modification relations between low" frequency 
Adjectives and Nouns (sostantivi plurali 19.9, fbr- 
bicine affilate 18.4, gradazione alco\[ica 18.1, giub~ 
botti antiproiettile 17.4, salmone affumicato 17.1); 
-. modification relations between Noun and Noun 
of a PP, both of tow frequency (cartina tornasole 
18.3, filetti alici 17.7, siepi bosso 15.9, spicchio aglio 
15.5). 
These word-pairs share the following properties: 
both the words are of very low frequency, and 
almost always appear only together in the same 
context. 
The characteristics of the different types of 
combinations appearing within the other ranges of 
the association ratio value, i.e. from ii) to iv) above 
(for example, at the value levels when more specific 
grammatical/syntactic information appears), are 
very different and present quite interesting 
properties. 
Thus, we have observed how the measure of the 
association ratio gives quantitative/statistical 
evidence to a number of lexical, syntactic and 
semantic relationships between word-pairs. These 
relationships are essential for codification in an 
LDB, and cannot be actfieved with the same 
"objectiveness", and certainly not to the same 
extent, by other means such as e.g. le~cographers' 
intuition. 
iMnong the syntactic relationships, particularly 
relevant is the data which regards the prepositions 
marking the different arguments of verbs, adjectives 
and nouns, together with their relative frequency. 
This is very important hfformation to be inserted 
in the LDB (especially of Italian), provided we have 
no dictionary source for this type of 
complementation as for example the Lon~nan 
dictionary for Emglish. Other syntactic data 
concern the type of sentential complementation, 
mainly for the verbs. 
We notice, for example, that in M1 their 
iIfflections the verb rischiare < to risk > and the 
noun rischio only subcategorize for the preposition 
di < of>; the same holds for the adjective capace 
< able >. Tiffs infom~ation is sinll?ly a 
confirmation of their only possible prepositional 
complementizer. The verb pensare < to think > is 
found with a, che, come, di < to, that, how, of>, 
i.e. with all its theoretictd possibilities of 
prepositional and sentential government, while 
parlare < to speak > is more frequently associated 
only with con, di < with, about >, and not with a 
< to >, which should be found in principle. DM- 
dere is mostly associated with con, da, bl < with, 
from, in>, and not with Ira < among>. These 
quantitative data can be associated to the different 
subcategorization frames and can be helpful for 
cornplementation rules, to decide on ambiguous 
attachments of PPs. 
As a next step, we are trying to con'elate the 
different eomplementation patterns evidenced by 
some word-pairs with other lexical information 
(fbund in the environment of these th'st word-pairs) 
which can be used as a clue for semantic 
disambiguation. For example, if we take the 
word-pairs dividere con, dividere da, dividere in, we 
must look at the surrounding context and see which 
generalizations can be done at the semantic level for 
56 3 
the three types of subcategorization. These may in 
fact correspond to different word-senses. 
Vmy useful data of both syntactic and 
lexical/semantic relevance concern the so-called 
support verbs (see Gross 1982) for Nouns (usually 
deverbal or Action nouns) or for Adjectives. WE 
observe for example: 
compiere accertamenti /0.8 
fare, afjTdamento 8.1 
ayere (lcee.~so 5.3 
condwre:'e/fetmare analisi 8.3/7.3 
avuto accoglienza 8.0 
prendere decisione 9.7 
rendere accettabile 8.9 
re:~dere accessibile 9.4 
This sort of intonnation on support verbs is of 
cssential importance for language generation (see 
Mel'cuk, Polguere 1988), and cannot be predicted 
in may other way, but can only be given either by 
observation or by introspection. The automatic 
collection of these data is thu:s an impovtm:t 
shortcut towards their extensive coverage in a I,DB. 
Their se.mantics cm~ be rather easih' inferred by the 
type of support verb (there is a iinite list of them) 
and given by rule. 
Purely semantic data mainly regard typical 
collocations, e.g. between Adjective and Noun (see 
below), or between Verb and :\dverb, or between 
\:art) and Lvpical Subjects and :or Objects (flondare 
cohmia /I.4, abbas:are co{esterolo \]1.3, di~toggere 
attet~.Tiotze 10.9, attirare aztet;2iotze /0.7, pre.~tare 
attenzione 10.5, sparo" co"po 10.6). 
Interesting dma are also found concerning the 
semantic field of ccrtain wo,ds, and obvioush.' 
words bel,mging to a fixed phrase. For co- 
occurrences of ?',ouns bclongina, to the same 
semantic field an exmnple is: 
abbig/iamemo acces.~ori 9.6 
chili acce.ssori 9.4 
her.u? aeces.~ori 9.3 
:carpe accessori 9.0 
Examples of fixed and or idiomatic phrases ::re: 
battuta arresto 11.7 (t)attuta d'arresto) 
pohnone acetate II.6 (pohnone d'acciaio) 
primo acchito 10.1 (di primo acchito) 
As this method is only used to work on couples of 
words, it is clear that we do not generally obtain the 
whole phrase. It is for this reason that we have 
developed, especially for this type of data, other 
quantitative tools which are described in section 4, 
whose results will supplcmcnt those providcd by 
this method. 
A number of different observations can be made 
for the word-pairs, according to whether they are 
sorted on the right or the left word. If we examine 
the left contexts (i.e. if' words arc ordered on tt,e 
right), we arc more likely to gather information on 
e.g. the Nouns which are typically modified by" a 
given tbllowing Adjective (sotriso accattivat2te \]13; 
luce accecante 10.8; h¢ce accesa 8.7, radio aceeaa 
9.7, co/ori aecesi 10.0, toni accesi 1/.2, forno acce:o 
10.7, ji~oco acceso 8.5). If vice-versa we examine 
the right contexts, it is easier to collect data on the 
Nouns which are typically modified by a given 
preceding Adjective (costante aumee~to 7.6. co.~lante 
contalto 6.4, costante miglioramemo 7.9, co.~tan\[e 
riferitnenlo 7.4, costante tet~peratura 8.1). 
In the left contexts again we find together data 
which regard which Adjectives m'e typical pre- 
modifiers of a given Noun (forte aecel2zo 8.6, 
inconfimdibile aecento 12.0," difficile accesso 5.3, 
facile accesso 5.7, libero accesso 7.5," buena acco- 
glienza 8.7; amice amore 4.& 1)uo:2 amo;e 3.,1. 
elerrzo amore 7.0, grande amo,,'e 5.2. im?ro~q%o 
amore 5..5, u#imo amore ,f.4, vecchio amore 3.7. ;'ere 
amore 3.7), or which types of Nouns arc the 
governors of PPs with a given Noun as head (co,> 
troI/o armamenti 8.9, limitazione armamet?ri /1.,?. 
riduzione armamenti 9.t, settore atw:a.,?ze~m ~ 0.9). 
When analyzhag the left contexts, we also find 
high association ratios for certain types :~!" 
gammatieal structures such as: comD.,ur~d \e.,bs 
(with essere < to be > or re'ere ," to have ;> as k'ft 
word), rcfle.,dve or intransitive propomina! -orbs 
(with tile par'icle .vi on tile left), reciprt~ca:! verb.,. 
{with the particles ci, ri), etc. A!t ',hose types of 
data are obviously iml,Onant for the c:cazio> <:~.~' ::n 
e×haustive LDB. 
As a final remark we can add that it would 
ce:ainlv be useihl to make the same ca!cul:,,tic,::s 
on a tagged (for POS) corpus, in order to ob~ait: 
relevant inf0nnation for the Iemmas; however, we 
rnust observe that different word-forms c,f the same 
~. OIYlbllid\[ (.3 ! ILtl len-ana often present very different ~ '- "' 
properties, both at the grammatical syntactic level 
and at the lexieal'.,,cmantic level. \Vh~;n compacting 
m, lommtic.n f0r a sin~c lcmma we must therefore 
be carclhl not to lose data wlfich are relevant to 
particular inflected forms. This kind of information 
is again particularly' importmu tbr practical NI~P 
applications. 
4. Fixed phrases and idioms 
Mairdy for the detection of "stereotypes" in texts 
we have implemented and are now refining other 
quantitative/statistical tools not limited to couples 
of words. 
In order to collect data concerning specifically 
fixed phrases or multi-word units, we first 
calculated the frequency of occunence in the 
corpus of all identical couples, triples, and so on, 
up to seven-word syntagqns. 
Also for this data wc calculated the di~;persio:~, 
and we also cMculated the so-called useNe. Also 
usage is defined according to Bortolini ct al. (1971) 
as: U = FI), i.e. \[;sage equal to Frequent> by 
4 57 
Table 2. 
'85 "86 '87 '88 Total l)isp. 17sage 
per la prima volta 136 119 111 123 489 0.96 468.13 
dal punto di vista 64 76 93 77 310 0.92 286.20 
in gusto il mondo 73 78 60 66 277 0.94 26l .22 
un vero e proprio 43 23 21 25 112 0.82 9\[.74 
(Novels) 
102 
21 
2 
29 
Dispersion. It is therefore equal to Frequency 
when the word is uniformly distributed in the 
different years (and genres), m:d is equal to 0 when 
Dispersion is 0, i.e. if all occurrences were 
concentrated in a sin~e year (or genre). L'sage is 
as nearer to Frequency as much the distribution is 
uniform, and decreases proportionally while 
I)ispersion is decreasing.. 
In this case dispersion and usage werc first 
calculated on the sections of the corpus which refer 
to the 4 3'cars of publication of the journals (from 
1985 to 1988), in c, rder to point out, :-~_mong others, 
the appearance (or disappearance) of phrases, 
compounds, and stereotypes in gcneral. We then 
compared a svbset of all the prcss data with a 
subset of novels of analogous size, and again 
calculated dispersion mid usage in order to evidence 
eventual difference of distributkm of these fixed 
phrases between t'ress and novels. 
'\[he data (of the two types) were then so~cd in 
different ways: by alphabetical order of the n- 
:uples. by frcque::cy oi occurrence of the n-tuples, 
by dispcr.qou, by :.:sage. l:rom each ordering we 
anther data ~hiei: can be used in a variety of ways 
or can evidence different bpes of phenomena. ,-~n 
cx~implc at :he bcginniug of the filc of the 
quad:u:'!cs e'.dcrcd b\ ~:'..,:-ge (in decreasing order) 
is I\mnd in Table 2 (\~ith iigurcs ior dispcrsion and 
usage c,::i\ cop.coming press data, i.e. the first four 
columns: the ,:elu,/v.n for Novels, of the s~m:e size 
:is each )-era" cc, lumn. has been inserted in the table 
from the second comparison just for curiosity). 
Fhc data i.e. :dI the n-tuples of different lengths, 
were aiso :nerged in a single file, to evidence the 
precise length of each Wen phrase. For exa_mple, 
veto e proyrio is in a ,'rex high position for its 
frequency in the set of tripies, but the fact that un 
veto e/~roprio is also in a very high position in the 
set of quadruples memas that this is the size of the 
'true fixed phrase'. Other observations on the 
linguistic results evidenced by this method will be 
made in the presentation. 
5. Final remarks 
In the next months we intend to experiment with 
other statistical formulas (e.g. those used by Smadja 
and Choueka) on the corpus (which will 'also 
contain the novels and other types of texts). 
The first stage of the research consists in a careful 
linguistic analysis of the results obtained by the 
different statisticN tools we are now implementing 
and applying. By this analysis performed 
according to different parameters, both from the 
slatistical and linguistic, lexicographic viewpoints - 
we a~im at achieving a twofold objective. On the 
one side we mm at setting up the beginning of a 
sound methodology to semi-automatize the 
extraction of at least part of the relevant 
syntactic:semantic relationships flom the corpus: 
on the other :side we hope we shall be able to build 
a model of the "actuat" modification and 
complementation relations (out of the theoretical 
a-priori possibilities), ">f the "actual" lcxical 
collocations, of the "actual" stcre(~types in the 
Italian language. 
One of the claims of this pro\];:ct i's ttmt the 
linguistic information embodi,ed in all these quite 
different types of lexicaI collocations - c, nc,: they 
have been supplied in a ~ystemati,: vc'tv by a 
computational lexicon which is also ar?p,.ot~:tted" !br 
frequency can be helpful t,.)r !c:,:ic,d 
disambiguation in analysis and cmciaI fbr lexical 
selection in generation. Our method should be 
seen as a strategy to obtain in a semi-automatic 
way, and for a large portion of the lexicen, a 
fi~rmali.,:ation of many of the types of lexical 
relations coded, ior example, in the _MeI'cuk 
lexicon. "Il-,is should be an enhancement both for 
a rnore concr<e and objective lexicography (the 
results will be in fact evaluated in the next months 
in a true lexicographic environment), and tbr a 
more comprehensive :rod q\]ata-based" li::g:dstics. 
Acknmvledgment 
We wish to th:mk our Referees for useful 
comments :rod suggestions, and A. Zampo!li for 
helpful discussions. 

References 

Bindi, R., M..Menachini, P. Orsolini, (lq8%, 
"Italian Reference Corpus", lI.C-'lI.N-3, Pisa. 

Bortolini, U., C. Tagliavini, A. Zampolli, (t971), 
Lessico di Frequenza della LiINua ltalia~Ta 
Contemporanea, Garzanti, Nlilano. 

Brown, P., Cocke, J., Della Pietra, S., Della Pietra, 
V., Jelinek, F., Mercer, R., Roossin, P., (1988), 
"A statisticN approach to language translation", 
Proceedings oJ" the 27th b2ternational Cotference 
on Computafio~zal Linguistics, Budapest. 

Calzolari, N., (1984), "Detecting patterns in a 
I~exical Database", Proceedings of the /Otk 
International Conference on Computational 
Linguistics, Stanford (CA), 170-173. 

Calz.olm-i, N., (1988), ':Fhe dictionary and the 
thesaurus can be combined", in M. Evens (ed.), 
Relational Models of the Lexicon, Studies in 
Natural I,anguage Processing, Cambridge 
University Press, Cambridge, 75-96. 

Calzolari, N., (1989), '%exical Databases ,and 
Textual Corpora: perspectives of integration for 
a Lexical Knowledge Base", Proceedings of the 
/st International Lexical Acquisition Workshop, 
1)etroit, Michigan. 

Calzolari, N., E. Picchi, (1988), "Acquisition of 
semantic intbrmation from ml on-line 
dictionary", Proceedings of the I2th 
International Conference on Computational 
l.inguistics, Budapest, I Iungary, 87-92. 

Choueka, Y., (1988), "I.ooking for needles in a 
haystack", Proceedings of the RIAO, 609-623. 

Church, K., (1988), "e\ stochastic parts program 
and noun phrase parser tbr unrestricted text", 
Proceedings of the 2nd A CL Conference on 
Applied Yatural Langucc¢e Processing. 

Church K., P. Ihmks, (1989), "%Vord association 
norms, mutual intbrrnation and lexicography", 
Proceedings of the 27th Annual Meeting of the 
Association for Computational Linguistics, 
Vancouver, British Columbia, 76-83. 

Cobuild, (1987), Collins Cobuild English Language 
Dictionary, Collins, Glasgow. 

Garside, R., Leech, G., Sampson, G., (1987), The 
Computational Analysis of English - a corpus 
based approach, Lonmnan, l.ondon. 

Gross, M., (1982), "On the notion of support verb", 
semhmr at the Simon Fraser University, B.C. 
Canada. 

Guiraud, P., (1959), Problemes et methodes de la 
statistique linguistique, D.Reidel, I)ordrecht. 

Ilindlc, D., (1989), "Acquiring disambiguation rules 
from text", Proceedings of the 27th Annual 
Meeting of the Association for Computational 
Linguistics, Vancouver, British Columbia, 
118-125. 

Mel'cuk, I., Polgmere, A., (1988), "A formal Icxicon 
in Memfing-Text theory", Computational 
Linguistics, 13(3-4). 

Muller, Ch., (1964), Essai de Statistique Lexica/e, 
Klincksiek, Paris. 

Muller, Ch., (1965), "Frequencc, dispersion et 
usage: apropos des dictionnaires de fiequence", 
Cahiers de, Lexicologie, I1, 32-42. 

Moskovitch, W.A., (1977), "Polyscmy in natural 
and artificial (phmned) languages", S,~///., 
Journal of Linguistic Cak'u/us, Skriptor, 1, 5-2g. 

Smadja, J., (1989), "Macrocoding the lexicon with 
co-occurrence knowledge", Proceeding.s" q: z/:e 
\]qrst International l.exical :1 cqui.ritio:z 
Workshop, Detroit, Michigan. 

Webster, M., M. Marcus, (1989), ":\uto:natic 
acquisition of the lexical semantics of verbs 
from sentence frames", Proceedings O~ dw 2 ~':/t 
Annual Meeting o/" the Association _/br 
Computational l.i:~guistics, Vmacouver, lh-i~ish 
Columbia, 177-184. 

Zarnpolli, A., (1988), "Progetto Strategic, .Uetodi 
e strumenti per l:lndustria delle Li~zgue ::ella 
collaborazione internaAonale", I LC- C N R, Pica. 

Zemik, U., (1989), "Paradigms in lexical 
acquisition", Proceedinsds o J" the \['ir.~t 
International Lexica/ .,i cquifftion I Vorks/',o/), 
Detroit, Michigan. 
