File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/abstr/90/c90-3010_abstr.xml

Size: 18,533 bytes

Last Modified: 2025-10-06 13:46:52

<?xml version="1.0" standalone="yes"?>
<Paper uid="C90-3010">
  <Title>Acquisition of Lexical Information from a Large Textual Italian Corpus</Title>
  <Section position="2" start_page="0" end_page="0" type="abstr">
    <SectionTitle>
2. The Italian Reference Corpus
</SectionTitle>
    <Paragraph position="0"> The corpus (see Zarnpolli 1988) on which we are now conducting our analysis is being produced by the ILC and an Italian publishing house (see Bindi et al. 1989). The project was begun in 1988. The corpus now contains about 12 milfion words, and the first goal is to reach 20 million words by the end of &amp;quot;90. When completed, the corpus will be balanced among journals, novels, manuals, scientific texts, 'grey' literature, etc. The corpus is presently unproportioned, because we first processed and inserted up to about 8 million words from journals, newspapers, mag~ines, etc., while we are nt)w inserting data fi'om novels and from the scientific and technical literature.</Paragraph>
    <Paragraph position="1"> The present study is conducted on the first section of the corpus, but we obviously intend to extend the mlMysis to the other sections as soon as they become available.</Paragraph>
    <Paragraph position="2"> We describe two types of quantitative anMyses whose aim is to extract information on: a) the strength ofassocialion between two words; b) fixed phrases or idioms.</Paragraph>
    <Paragraph position="3"> 3. The strength of association within word-pairs As regards the first point we have used the method of measuring thc association ratio between two words as described by Church and llanks (1989). The value of the association ratio reflects the strength of the bond between the two words taken into account. The method is very simple.</Paragraph>
    <Paragraph position="4"> The association ratio between any two words x and y appearing together in a window of five words in the corpus is based on the concept of &amp;quot;mutuM information&amp;quot; defined as:</Paragraph>
    <Paragraph position="6"> where P is the probability. We refer to Church and Hank,.~ (1989, pp. 77-78) for a detailed explanation of the fommla and of how the association ratio slightly differs from it, given that we are more interested here in the linguistic m~d lexicographic evaluation of the numerical results deriving from its application.</Paragraph>
    <Paragraph position="7"> In addition to this we have introduced the measurement of the so~called dispersion, in order to obtain - linked to the association ratio - quanti~ tative information on the distribution of the second word of the word-pair in the selected window. We wanted in fact to complete the simple frequency notion for a word-pair with that of frequency stability or dispersion, i.e. to add to the frequency a measure of how it is distributed over the different positions of the window. In this way we evaluate the uniformity of repartition of frequency of the second word over the considered span. We have used the formula described in detail in Bortolini et al. (1971, pp. 23-31), even though used here for different purposes.</Paragraph>
    <Paragraph position="8"> We gj've some ex,'unples in Table 1, where fix,y) is the frequency of occurrence of words x and y together and in this order in a window of 5 words, gap is equal to the number of words between x and y (if gap = 0 then x and y are immediately adjacent), f(x) and f(y) are the frequencies of occu;rence of x and y independently in the corpus, ass.ratio is the result of application of the formula to x at~d y, dispersion calculates how tile second word is distributed within the considered window.</Paragraph>
    <Paragraph position="9"> This last information is very useflfl not only to evidence words belongi'ng to fixed phrases, but especially while trying to evidence syntactic relationships. If the dispersion is 0 or ncm&amp;quot; to !), all or most of tile occurrences of tile second word are concentrated in the sameposition. This means that the position and distance of the two words is always the same, and it is theretbre a strong measure for evidencing &amp;quot;fixed phrases&amp;quot; or &amp;quot;compounds&amp;quot; with no variation inside. When viceversa its value approaches 1, y is almost equally distributed in the four positions of the considered span. Thus, the combination of a not very lfigh (but above a certain level) ass.ratio with dispersion values near to 1 is more typical of syntactic types of collocations, giving e.g. information on prepositional government.</Paragraph>
    <Paragraph position="10"> We wish to highlight here some of the results achieved by the application of these statistical measures to the Italian corpus, and mainly to evaluate their linguistic relevance.</Paragraph>
    <Paragraph position="11">  From the present corpus of 8,032,667 occurrences (tokens) and 178,811 different word-forms (types), we obtained 26,473,263 word-pairs (tokens) in a window of 5 words (and not 32,000,000, as the window is not extended beyond any strong punctuation mark) and 8,716,446 different word-pairs (types). After discarding all the pairs with f(x,y) &lt; 4, because they were too rare and of no linguistic relevance, 787,878 word-pairs were obtained, which were eventually reduced to 322,718 after eliminating those with association ratio &lt; 3 (the pairs seem to be linguistically irrelevant below this level).</Paragraph>
    <Paragraph position="12"> We must also recall that the data to which we have applied our measures are articles from many different types of newspapers, journals, etc. - i.e. many short texts -, so that there is no bias towards clustering tendencies of words such as could appear in longer texts, like entire novels.</Paragraph>
    <Paragraph position="13"> If we order the word-pairs by decreasing value of the association ratio, and examine the types of word combinations appearing in the different positions of the file, we observe a different typology of word combinations according to the different levels: i) at the top; ii) in the center; iii) towards the lower interesting values, which for Italian seems to be a little higher than for English, i.e. around 35; iv) below this significant value, until reaching the few negative values.</Paragraph>
    <Paragraph position="14"> For example at the top, i.e. with very high values (ranging from 22.93 to about 15), we find the following categories of word-pairs: - proper nouns, titles, etc. (e.g. Oci Ciornie 20.6, Cvrano Bergerac 20.1, Montgomery Clift 20.1, Ursula Andress 19.9); foreign (usually English) compound words or fixed phrases (value added 19.8, pax Christi 17.7, teen ager 17.7, drug administration 17.3); Italian compounds of words belonging to specialized languages, which almost never occur in eve~day language (bismuto colloidale 20.1, tornografia assiale 19.8, marmitte catalitiche 19.6, nitrato amrnonio 19.5, accoppiatore acu_~tico 17.5); - co-occurring technical words, which again appear very rarely ( laringiti traeheiti 20.3, idrologia climatologia 20.2, capperi cetriolini 19.6, prefetti questori 18.5, antisettiche antispasmodiche 17.8); - fixed phrases or idioms whose component words are not frequent in ordinary language (volente nolente 20.6, specchietto allodole 18.8, bla bla 18.00, batter ciglio 17.2, cartoni animati 16.5, spron battuto 15.5); modification relations between low&amp;quot; frequency Adjectives and Nouns (sostantivi plurali 19.9, fbrbicine affilate 18.4, gradazione alco\[ica 18.1, giub~ botti antiproiettile 17.4, salmone affumicato 17.1); -. modification relations between Noun and Noun of a PP, both of tow frequency (cartina tornasole 18.3, filetti alici 17.7, siepi bosso 15.9, spicchio aglio 15.5).</Paragraph>
    <Paragraph position="15"> These word-pairs share the following properties: both the words are of very low frequency, and almost always appear only together in the same context.</Paragraph>
    <Paragraph position="16"> The characteristics of the different types of combinations appearing within the other ranges of the association ratio value, i.e. from ii) to iv) above (for example, at the value levels when more specific grammatical/syntactic information appears), are very different and present quite interesting properties.</Paragraph>
    <Paragraph position="17"> Thus, we have observed how the measure of the association ratio gives quantitative/statistical evidence to a number of lexical, syntactic and semantic relationships between word-pairs. These relationships are essential for codification in an LDB, and cannot be actfieved with the same &amp;quot;objectiveness&amp;quot;, and certainly not to the same extent, by other means such as e.g. le~cographers' intuition.</Paragraph>
    <Paragraph position="18"> iMnong the syntactic relationships, particularly relevant is the data which regards the prepositions marking the different arguments of verbs, adjectives and nouns, together with their relative frequency. This is very important hfformation to be inserted in the LDB (especially of Italian), provided we have no dictionary source for this type of complementation as for example the Lon~nan dictionary for Emglish. Other syntactic data concern the type of sentential complementation, mainly for the verbs.</Paragraph>
    <Paragraph position="19"> We notice, for example, that in M1 their iIfflections the verb rischiare &lt; to risk &gt; and the noun rischio only subcategorize for the preposition di &lt; of&gt;; the same holds for the adjective capace &lt; able &gt;. Tiffs infom~ation is sinll?ly a confirmation of their only possible prepositional complementizer. The verb pensare &lt; to think &gt; is found with a, che, come, di &lt; to, that, how, of&gt;, i.e. with all its theoretictd possibilities of prepositional and sentential government, while parlare &lt; to speak &gt; is more frequently associated only with con, di &lt; with, about &gt;, and not with a &lt; to &gt;, which should be found in principle. DMdere is mostly associated with con, da, bl &lt; with, from, in&gt;, and not with Ira &lt; among&gt;. These quantitative data can be associated to the different subcategorization frames and can be helpful for cornplementation rules, to decide on ambiguous attachments of PPs.</Paragraph>
    <Paragraph position="20"> As a next step, we are trying to con'elate the different eomplementation patterns evidenced by some word-pairs with other lexical information (fbund in the environment of these th'st word-pairs) which can be used as a clue for semantic disambiguation. For example, if we take the word-pairs dividere con, dividere da, dividere in, we must look at the surrounding context and see which generalizations can be done at the semantic level for 56 3 the three types of subcategorization. These may in fact correspond to different word-senses.</Paragraph>
    <Paragraph position="21"> Vmy useful data of both syntactic and lexical/semantic relevance concern the so-called support verbs (see Gross 1982) for Nouns (usually deverbal or Action nouns) or for Adjectives. WE observe for example: compiere accertamenti /0.8 fare, afjTdamento 8.1 ayere (lcee.~so 5.3 condwre:'e/fetmare analisi 8.3/7.3 avuto accoglienza 8.0 prendere decisione 9.7 rendere accettabile 8.9 re:~dere accessibile 9.4 This sort of intonnation on support verbs is of cssential importance for language generation (see Mel'cuk, Polguere 1988), and cannot be predicted in may other way, but can only be given either by observation or by introspection. The automatic collection of these data is thu:s an impovtm:t shortcut towards their extensive coverage in a I,DB. Their se.mantics cm~ be rather easih' inferred by the type of support verb (there is a iinite list of them) and given by rule.</Paragraph>
    <Paragraph position="22"> Purely semantic data mainly regard typical collocations, e.g. between Adjective and Noun (see below), or between Verb and :\dverb, or between \:art) and Lvpical Subjects and :or Objects (flondare cohmia /I.4, abbas:are co{esterolo \]1.3, di~toggere attet~.Tiotze 10.9, attirare aztet;2iotze /0.7, pre.~tare attenzione 10.5, sparo&amp;quot; co&amp;quot;po 10.6).</Paragraph>
    <Paragraph position="23"> Interesting dma are also found concerning the semantic field of ccrtain wo,ds, and obvioush.' words bel,mging to a fixed phrase. For co-occurrences of ?',ouns bclongina, to the same semantic field an exmnple is: abbig/iamemo acces.~ori 9.6 chili acce.ssori 9.4 her.u? aeces.~ori 9.3 :carpe accessori 9.0 Examples of fixed and or idiomatic phrases ::re: battuta arresto 11.7 (t)attuta d'arresto) pohnone acetate II.6 (pohnone d'acciaio) primo acchito 10.1 (di primo acchito) As this method is only used to work on couples of words, it is clear that we do not generally obtain the whole phrase. It is for this reason that we have developed, especially for this type of data, other quantitative tools which are described in section 4, whose results will supplcmcnt those providcd by this method.</Paragraph>
    <Paragraph position="24"> A number of different observations can be made for the word-pairs, according to whether they are sorted on the right or the left word. If we examine the left contexts (i.e. if' words arc ordered on tt,e right), we arc more likely to gather information on e.g. the Nouns which are typically modified by&amp;quot; a given tbllowing Adjective (sotriso accattivat2te \]13; luce accecante 10.8; hC/ce accesa 8.7, radio aceeaa 9.7, co/ori aecesi 10.0, toni accesi 1/.2, forno acce:o 10.7, ji~oco acceso 8.5). If vice-versa we examine the right contexts, it is easier to collect data on the Nouns which are typically modified by a given preceding Adjective (costante aumee~to 7.6. co.~lante contalto 6.4, costante miglioramemo 7.9, co.~tan\[e riferitnenlo 7.4, costante tet~peratura 8.1).</Paragraph>
    <Paragraph position="25"> In the left contexts again we find together data which regard which Adjectives m'e typical pre-modifiers of a given Noun (forte aecel2zo 8.6, inconfimdibile aecento 12.0,&amp;quot; difficile accesso 5.3, facile accesso 5.7, libero accesso 7.5,&amp;quot; buena accoglienza 8.7; amice amore 4.&amp; 1)uo:2 amo;e 3.,1. elerrzo amore 7.0, grande amo,,'e 5.2. im?ro~q%o amore 5..5, u#imo amore ,f.4, vecchio amore 3.7. ;'ere amore 3.7), or which types of Nouns arc the governors of PPs with a given Noun as head (co,&gt; troI/o armamenti 8.9, limitazione armamet?ri /1.,?. riduzione armamenti 9.t, settore atw:a.,?ze~m ~ 0.9). When analyzhag the left contexts, we also find high association ratios for certain types :~!&amp;quot; gammatieal structures such as: comD.,ur~d \e.,bs (with essere &lt; to be &gt; or re'ere ,&amp;quot; to have ;&gt; as k'ft word), rcfle.,dve or intransitive propomina! -orbs (with tile par'icle .vi on tile left), reciprt~ca:! verb.,. {with the particles ci, ri), etc. A!t ',hose types of data are obviously iml,Onant for the c:cazio&gt; &lt;:~.~' ::n exhaustive LDB.</Paragraph>
    <Paragraph position="26"> As a final remark we can add that it would ce:ainlv be useihl to make the same ca!cul:,,tic,::s on a tagged (for POS) corpus, in order to ob~ait: relevant inf0nnation for the Iemmas; however, we rnust observe that different word-forms c,f the same ~. OIYlbllid\[ (.3 ! ILtl len-ana often present very different ~ '- &amp;quot;' properties, both at the grammatical syntactic level and at the lexieal'.,,cmantic level. \Vh~;n compacting m, lommtic.n f0r a sin~c lcmma we must therefore be carclhl not to lose data wlfich are relevant to particular inflected forms. This kind of information is again particularly' importmu tbr practical NI~P applications.</Paragraph>
    <Paragraph position="27"> 4. Fixed phrases and idioms Mairdy for the detection of &amp;quot;stereotypes&amp;quot; in texts we have implemented and are now refining other quantitative/statistical tools not limited to couples of words.</Paragraph>
    <Paragraph position="28"> In order to collect data concerning specifically fixed phrases or multi-word units, we first calculated the frequency of occunence in the corpus of all identical couples, triples, and so on, up to seven-word syntagqns.</Paragraph>
    <Paragraph position="29"> Also for this data wc calculated the di~;persio:~, and we also cMculated the so-called useNe. Also usage is defined according to Bortolini ct al. (1971) as: U = FI), i.e. \[;sage equal to Frequent&gt; by  Dispersion. It is therefore equal to Frequency when the word is uniformly distributed in the different years (and genres), m:d is equal to 0 when Dispersion is 0, i.e. if all occurrences were concentrated in a sin~e year (or genre). L'sage is as nearer to Frequency as much the distribution is uniform, and decreases proportionally while I)ispersion is decreasing..</Paragraph>
    <Paragraph position="30"> In this case dispersion and usage werc first calculated on the sections of the corpus which refer to the 4 3'cars of publication of the journals (from 1985 to 1988), in c, rder to point out, :-~_mong others, the appearance (or disappearance) of phrases, compounds, and stereotypes in gcneral. We then compared a svbset of all the prcss data with a subset of novels of analogous size, and again calculated dispersion mid usage in order to evidence eventual difference of distributkm of these fixed phrases between t'ress and novels.</Paragraph>
    <Paragraph position="31"> '\[he data (of the two types) were then so~cd in different ways: by alphabetical order of the n:uples. by frcque::cy oi occurrence of the n-tuples, by dispcr.qou, by :.:sage. l:rom each ordering we anther data ~hiei: can be used in a variety of ways or can evidence different bpes of phenomena. ,-~n cx~implc at :he bcginniug of the filc of the quad:u:'!cs e'.dcrcd b\ ~:'..,:-ge (in decreasing order) is I\mnd in Table 2 (\~ith iigurcs ior dispcrsion and usage c,::i\ cop.coming press data, i.e. the first four columns: the ,:elu,/v.n for Novels, of the s~m:e size :is each )-era&amp;quot; cc, lumn. has been inserted in the table from the second comparison just for curiosity).</Paragraph>
    <Paragraph position="32"> Fhc data i.e. :dI the n-tuples of different lengths, were aiso :nerged in a single file, to evidence the precise length of each Wen phrase. For exa_mple, veto e proyrio is in a ,'rex high position for its frequency in the set of tripies, but the fact that un veto e/~roprio is also in a very high position in the set of quadruples memas that this is the size of the 'true fixed phrase'. Other observations on the linguistic results evidenced by this method will be made in the presentation.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML