Study and Implementation of Combined Techniques for 
Automatic Extraction of Terminology 
Bdatrice Daille 
TALANA 
University Paris 7 
Case 7003 
2, Place Jussieu 
F-75251 Paris Cedex 05 
France 
daille@linguist, j ussieu, fr 
Abstract 
This paper presents an original method and its imple- 
mentation to extract terminology from corpora by com- 
bining linguistic filters and statistical methods. Start- 
ing from a linguistic study of the terms of telecommu- 
nication domain, we designed a number of filters which 
enable us to obtain a first selection of sequences that 
may be considered as terms. Various statistical scores 
are applied to this selection and results are evaluated. 
This method has been applied to French and to English, 
but this paper deals only with French. 
Introduction 
A terminology bank contains the vocabulary of a tech- 
nical domain: terms, which refer to its concepts. 
Building a terminological bank requires a lot of time 
and both linguistic and technical knowledge. The is- 
sue, at stake, is the automatic extraction of termi- 
nology of a specific domain from a corpus. Cur- 
rent research on extracting terminology uses either lin- 
guistic specifications or statistical approaches. Con- 
cerning the former, \[Bouriganlt, 1992\] has proposed 
a program which extracts automatically from a cor- 
pus sequences of lexical units whose morphosyntax 
characterizes maximal technical noun phrases. This 
list of sequences is given to a terminologist to be 
checked. For the latter, several works (\[Lafon, 1984\], 
\[Church and Hanks, 1990\], \[Calzolari and Bindi, 1990\], 
\[Smadja and McKeown, 1990\]) have shown that statis- 
tical scores are useful to extract collocations from cor- 
pora. The main problem with one or the other approach 
is the "noise": indeed, morphosyntactic criteria are not 
sufficient to isolate terms, and collocations extracted 
thanks to statistical methods belong to various types of 
associations: functional, semantical, thematical or un- 
characterizable ones. 
Our goal is to use statistical scores for extracting tech- 
nical compounds only and to forget about the other 
types of collocations. We proceed in two steps: first, 
apply a linguistic filter which selects candidates from 
the corpus; then, apply statistical scores to rank these 
candidates and select the scores which fit our purpos(~ 
best, in other words scores that concentrate their high 
values to terms and their low values to co-occurrcuccs 
which are not terms. 
Linguistic Data 
In a first part, we therefore study the linguistic specifi- 
cations on the nature of terms in the technical domain 
of telecommunications for French. Then, taking into ac- 
count these linguistics results, we present the method 
and the program which extracts andcounts the candi- 
date terms. 
Linguistic specifications 
Terms are mainly multi-word units of nominal type that 
could be characterized by a range of morphological, syn- 
tactic or semantic properties. The main property of 
nominal terms is the morphosyntactic one: its str,c- 
ture belongs to well-known morphosyntactic structures 
such asN ADJ, N1 de N2, etc. that have been studied by 
\[Mathieu-Colas, 1988\] for French. Some graphic indica- 
tions (hyphen), morphological indications (restrictious 
in flexion) and syntactic ones (absence of determiners) 
could also be good clues that a noun phrase is a term. 
We have also employed a semantic criteria: the criterion 
of unique referent. A term refers to an unique and uni- 
versal concept. However, it is not obvious to apply this 
criterion to a technical domain where we are not expert. 
So, we have interpreted the criterion of unique referent 
by the one of unique translation. A French term is al- 
ways identically translated, mostly by a compound or 
a simple noun in English. We have extracted mare,ally 
terms following these criteria from our bilingual corpus, 
29 
available in French and English, the Satellite Commu- 
nication Handbook (SCH) containing 200 000 words in 
,,ach language. Then, we have classified terms following 
their lengths; the length of a term is defined as the num- 
b,,r of main items it contains. 1 From this classification, 
it at)p('ars that terms of length 2 are by far the most 
frcquenl, ones. As statistical methods ask for a good 
rcl)rcsentatiou in number of the samples, we decided to 
extract in a first round only terms of length 2 that we 
will call base-term which matched a list of previously 
determined patterns: 
N ADJ station terrienne (Earth station) 
NI de (D~T) N2 zone de eouverture (coverage zone) 
Nt h (DET) N2 rdflecteur d grille (grid reflector) 
Nt PREP N~ liaison par satellite (satellite link) 
Ni N2 diode tunnel (tunnel diode) 
Of course, terms exist whose length is greater than 2. 
But the majority of terms of length greater than 2 are 
created recursively from base-terms. We have distin- 
guished three operations that lead to a term of length 3 
from a term of length 1 or 2: "overcomposition", modifi- 
catio,, and coordination. We illustrate now these opera- 
tions with a few examples where the base-terms appear 
inside brackets: 
I. Ovcrcomposltion 
Two kinds of overcomposition have been pointed out: 
ow'rcomposition by juxtaposition and overcomposi- 
tion by substitution. 
(a) .I u xtaposition 
A term obtained by juxtaposition is built with at 
least one base-term whose structure will not be al- 
tered. The example below illustrate the juxtaposi- 
tion of a base-term and a simple noun: 
Ni PREPI \[N2 PREP2 N3\] modulation par \[d~placement de phase\] (\[phase 
shift,\] keying) 
(b) Substitution 
(living a base-term, one of its main item is sub- 
stituted by a base-term whose head is this main 
item. For example, in the N1 PREP1 N2 structure, 
N1 is subtituted by a base-term of N1 PREP2 N3 
structure to create a term of N1 PREPs N3 PREP1 
N~ structure: 
rdscau d satellites + rgseau de transit ~ rgseau de 
transit h satellites (satellite transit network). 
We notice in the above example that the structure 
of r~seau h satellites (satellite network) is altered. 
2. Modification 
Modifiers that could generate a new term from a base- 
term appear either inside or after it. 
(a) Insertion of modifiers 
Adjectives and adverbs are the current modifiers 
' Main items are nouns, adjectives, adverbs, etc. Neither 
I,r,'l~osil.ions nor deterndners are main items 
that could be inserted inside a base-term structure: 
adjectives in the N1 PREP (DI~T) N2 structure and 
adverbs in the N ADJ one: 
liaisons multiples par satellite (multiple \[satellite 
links\]) 
rdseaux enti/~rement numdriques (all \[digital net- works\]) 
(b) Post-modifieation 
Adjectives and adverbial prepositionnal phrase of 
PREP ADJ N are the main modifiers that lead to the 
creation of new terms: post-adjectives can modify 
any kind of base-terms; for example, \[station ter- 
rienne\] brouilleuse (interfering \[earth(-)station\]). 
Adverbial prepositional phrases modify either sim- 
ple nouns or base-terms2: amplificateur(s) \[d 
faible bruit\] (flow noise\] amplifier(s)), \[interface(s) 
usager-rgseau\] \[~i usage multiple\] (\[multipurpose\] 
\[user-network interface(s)\]). 
3. Coordination 
Coordination is a rather complex syntactic phe- 
nomenon (term coordination have been studied in 
\[Jacquemin, 1991\]) and seldom generates new terms. 
Let us examine a rare example of a term of length 3 
obtained by coordination : 
N1 de Ns + N2 de N3 --+ N1 et N~ de N3 
assemblage de paquet + dgsassemblage de paquets --r 
assemblage et dgsassemblage de paquets (packet as- 
sembly/desassembly) 
It is difficult to determine whether a modified or over- 
composed base-term is or is not a term. Take for ex- 
ample bande latgrale unique (single side-band): bande 
latgrale (side-band} is a base-term of structure N ADJ 
and unique (single) a very common post-modifier ad- 
jective in French. The fact that bande latgrale unique 
is a term is indicated by the presence of the abbrevia- 
tion BLU (SSB). As abbreviations are not introduced 
for all terms, the right attitude is surely to extract first 
base-terms, i.e. bande latgrale (side-band}. Once you 
have base-terms, you can easily extract from the corpus 
terms of length greater than 2, at least post-modified 
base-terms and overcomposed base-terms by juxtaposi- 
tion. But, even if we have decided to extract only base- 
terms (length 2), we have to take into account their 
variations, at least some of them. Variants of base- 
terms are classified under the following categories: 
1. Graphical and orthographic variants 
By graphical variants, we mean either the use or not 
of capitalized letters (Service national or service na- 
tional ((D/cl)omestic service), or the presence or not 
of an hyphen inside the Ni N2 structure (mode pa- 
quet or mode.paquet (packet(-)mode) ). 
Orthographic variants concern N1 PREP N2 struc- 
ture. For this structure, the number of N2 is gen- 
erally fixed, either singular or plural. However, we 
2In this case, the length of the term is equal to 4 
30 
have encountered some exceptions: rdseau(x) ~ satel- 
lite, rgseaux(x) fi satellites (satellite network(s)). 
2. Morphosyntactic variants 
Morphosyntactic variants refer to the presence or 
not of an article before the N2 in the N1 PREP N~ 
structure: ligne d'abonng, lignes de l'abonng (sub- 
scriber lines), to the optional character of the prepo- 
sition: tension hdlice, tension d'hdlice (helix voltage) 
and to synonymy relation between two base-terms 
of different structures: for example N ADJ and N1 d 
N2: rgseau commutd, rgseau d commutation (switched 
network) 
3. Elliptical variants 
A base-term of length 2 could be called up by an 
elliptic form: for example: ddbit which is used instead 
of dgbit binaire (bit rate). 
After this linguistic investigation, we decide to concen- 
trate on terms of length 2 (base-terms) which seem 
by far the most frequent ones. Moreover, the major- 
ity of terms whose length is greater than 2 are built 
from base-terms. A statistical approach requires a good 
sampling that base-terms provide. To filter base-terms 
from the corpus, we use their morphosyntaetic struc- 
tures. For this task, we need a tagged corpus where 
each item comes with its part-of-speech and its lemma. 
The part-of-speech is used to filter and the lemma to 
obtain an optimal sampling. We have use the stochas- 
tic tagger and the lemmatizer of the Scientific Center of 
IBM-France developed by the speech recognition team 
(\[Ddrouault, 1985\] and \[E1-B~ze, 19931). 
Linguistic filters 
We now face a choice: we can either isolate collocations 
using statistics and then apply linguistic filters, or ap- 
ply linguistic filters and then statistics. It is the latter 
strategy that has been adopted: indeed, the former asks 
for the use of a window of an arbitrary size; if you take 
a small window size, you will miss a lot of occurrences, 
mainly morphosyntactic variants, base-terms modified 
by an inserted modifier, very frequent in French, and 
coordinated base-terms; if you take a longer one, you 
will obtain occurrences that do not refer to the same 
conceptual entity, a lot of ill-formed sequences which do 
not characterizes terms, and moreover wrong frequency 
counts as several short sequences are masked by only 
one long sequence. Using first linguistic filters based on 
part-of-speech tags appears as the best solution. More- 
over, as patterns that characterizes base-terms can be 
described by regular expressions, the use of finite au- 
tomata seems a natural way to extract and count the 
occurrences of the candidate base-terms. 
The frequency counts of the occurrences of the candi- 
date terms are crucial as they are the parameters of 
the statistical scores. A wrong frequency count im- 
plies wrong or not relevant values of statistical scores. 
The objective is to optimize the count of base-terms 
occurrences and to minimize the count of incorrect oc- 
currences. Graphical, orthographic and xnorpho.sy.t;w- 
tic variants of base-terms (except synomymic varbmt,~) 
are taken into account as well as some syntactic vari- 
ations that affect the base-terms structure: coordhm- 
tion and insertion of modifiers. Coordimttion of two 
base-terms rarely leads to the creation of a new tcrnt 
of length greater than 2, so it is reasonable to thi.k 
that the sequence gquipements de modulation et d,' d, ~- 
modulation (modulation and demodulation cqaipmvnls) 
is equivalent to the sequence gquipement ,h. modul.tt,m 
et dquipement de d~modulation (modulation equipment 
and demodulation equipment). Insertion of moditicrs in- 
side a base-term structure does not raise problem, ex- 
pect when this modifier is an adjective inserted inside a 
N1 PREP N2 structure. Let us examine the sequence an- 
tenne parabolique de rdception (parabolic receiving an- 
tenna), this sequence could be a term of length 3 (ob- 
tained either by over-composition or by modification) 
or a modified base-term, namely antenne de rgception 
modified by the inserted adjective parabolique. On one 
hand, we don't want to extract terms of length greater 
than 2, but on the other hand, it is not possible to 
ignore adjective insertion. So, we have chosen to ac- 
cept insertion of adjective inside N1 PREP N~ structure. 
This choice implies the extraction of terms of length 3 
of N 1 ADJ PREP N2 structure that are considered as 
terms of length 2. However, such cases are rare and 
the majority of N1 ADJ PREP N2 sequences refer to a 
N1 PREP N2 base-term modified by an adjective. 
Each occurrence of a base-terms is counted equally; we 
consider that there is equiprobability of the term ap- 
pearance in the corpus. The occurrences of morpholog- 
ical sequences which characterize base-terms are classi- 
fied under pairs: a pair is composed of two main items 
in a fixed order and collects all the sequences where 
the two lemmas of the pair appear in one of the al- 
lowed morphosyntactic patterns; for example, the se- 
quences: ligne d'abonnd, lignes de l'abonnd (subscriber 
lines), ligne numgrique d'abonnd (digital subscriber line) 
are each one occurrence of the pair (llgne, abonn~). 
If we have the coordinated sequence lignes et services 
d'abonnd (subscriber lines and services), we count one 
occurrence for the pair (ligne, abonnd) and one occur- 
fence for the pair (service, abonnd). Our program scans 
the corpus and counts and extracts collocations whose 
syntax characterizes base-terms. Under each pair, we 
find all the different occurrences found with their fre- 
quencies and their location in the corpus (file, sentence, 
item). This program runs fast: for example, it took 2 
minutes to extract 8 000 pairs from our corpu s SCH 
(200 000 words) for the structure Nx de (DI,:T) N~ on a 
Sparc station ELC (SS1) under Sun-Os Release 4. ! .3. 
Now that we have obtained a set of pairs, each pair rep- 
resenting a candidate term, we apply statistical scores 
in order to distinguish terms from non-terms among the 
candidates. 
31 
Ratio 
number of good 
pairs divided 
by total number 
of an equivalence 
class (50 pairs) 
1 
0I S 
,1 
I I 
p F 
2 3 ... 30 223 
Pairs sorted accordin to fr__g...~,._ 
Figure 1: Frequency histogram 
Lexical Statistics 
The problem to solve now is to discover which sta- 
tistical score is the best to isolate terms among our 
list of Candidates. So, we compute several measures: 
frequencies, association criteria, Shannon diversity and 
distance scores. All these measures could not be used 
for the same purpose: frequencies are the parameters 
of the association criteria, association criteria propose 
a conceptual sort of the couples, and Shannon diversity 
an<i distance measures are not discriminatory scores but 
provide other types of informations. 
Frequencies and Association criteria 
From a statistical point of view, the two lemmas of a 
pair could be considered as two qualitative variables 
whose link has to be tested. A contingency table is 
defined for each pair (Li, Lj): 
I II Lj I Lj, with j' # j 
Li, with i' ¢ i c d 
where: 
a stands for the frequency of pairs involving both Li 
and Lj, 
b stands for the frequency of pairs involving Li and Lj,, 
c stands for tile frequency of pairs involving Li, and Lj, 
d stands for the frequency of pairs involving Li, and 
The statistical literature proposes many scores which 
can be used to test the strength of the bond be- 
tween the two variables of a contingency table. Some 
are well-known such as the association ratio, close 
to the concept of mutual information, introduced by 
\[Church and Hanks, 1990\]: 
a 
IM = log~ (a+b)Ca+c) (1) 
the O 2 coefficient introduced by 
\[Gale and Church, 1991\]: 
02 = (ad- be) 2 
(a + b)(a + c)(b + c)(b + d) (2) 
or the Loglike coefficient introduced by 
\[Dunning, 1993\]: 
Loglike = a loga + blogb + clogc + dlogd 
-( a + b)log(a + b) - ( a + c) logCa -I- c) 
-(b + d) logCb + d) - ( c + d) logCc + d) 
+(a+b+c+d)logCa+b+c+d) (3) 
A property of these scores is that their values increase 
with the strength of the bond of the lemmas. We have 
tried out several scores (more than ten) including IM, 
• 2 and Loglike and we have sorted the pairs following 
the score value. Each score proposes a conceptual sort 
of the pairs. This sort, however, could put at the top 
of the list compounds that belong to general language 
rather than to the telecommunication domain. As we 
want to obtain a list of telecommunication terms, it is 
32 
Pairs of Ni (PREP (VET)) N2 
structure 
(largeur, bande) 
(tempt6rature, bruit) 
(bande, base) 
(amplificateur, puissance) 
(temps, propagation) 
(r~glement, radiocommunication) 
(produit, intermodulation) 
(taux, erreur) 
(miss, ~uvre) 
(tdldcommunication, satellite) 
(bilan, liaison) 
The most frequent pair sequence 
largeur de bands (197) 
(bandwidth) 
tempdrature de bruit (110) 
(noise temperature) 
bande de base (142) 
(baseband) 
amplifieateur(s) de puissance (137) 
(power amplifier) 
temps de propagation (93) 
(propagation delay) 
r~glement des radiocommunications (60) 
(radio regulation) 
produit(s) d'intermodulation (61) 
( intermodulation product) 
taux d' erreur (70) 
(error ratio) 
raise en omvre (47) 
(implementation) 
t~l~communication(s) par satellite (88) 
(satellite communication(s) ) 
bilanfs) de liaison (37) 
(link budget) 
Figure 2: Topmost pairs 
I Logl Nbc IM 
1328 223 5.74 
777 126 6.18 
745 145 5.52 
728 137 5.66 
612 94 6.69 
521 60 8.14 
458 61 7.45 
420 70 6.35 
355 47 7.49 ! 
353 99 4.09 
344 55 6.42 
essential to evaluate the correlation between the score 
values and the pairs and to find out which scores are 
the best to extract terminology. Therefore, we compare 
the values obtained for each score to a reference list of 
the domain. We have used the terminology data bank 
of the EEC, telecommunication section, which has been 
elaborated by experts. This evaluation has been done 
for 2 200 French pairs3of N1 de (DEW) N 2 structure ex- 
tracted from our corpus SCH (200 000 words). Each 
score provides as a result a list where the candidates 
are sorted following the score value. We have defined 
equivalence classes which generally collect 50 succes- 
sive pairs of the list. The results of a score are rep- 
resented graphically thanks to an histogram in which 
the x-axis represents the pairs sorted according to the 
score value, and y-axis the ratio of the number of pairs 
belonging to the reference list divided by the number 
of pairs per equivalence class, i.e. generally 50 pairs. If 
all the pairs of an equivalence class belong to the ref- 
erence list, we obtain the maximum ratio of 1; if none 
of the pairs appear in the reference list, the minimum 
ratio of 0 is reached. The ideal score should assign its 
high values (resp. low) to good (resp. bad) pairs, i.e. 
candidates which belong (resp. which don't belong) to 
the reference list. In other words, the histogram of the 
ideal score should assign to equivalence classes contain- 
ing the high values (resp. low values) of the score a ratio 
close to 1 (resp. 0). We are not going to present here 
all the histograms obtained (see \[Daille, 1994\]). All of 
3Only pairs which appear at least twice in the corpus 
have been retained. 
them show a general growing trend that confirm that 
the score values increase with the strength of the bond 
of the \]emma. However, the growth is more or less clear, 
with more or less sharp variations. The most beautiSd 
histogram is the simple frequency of the pair (see Fig- 
ure 1). This histogram shows that more frequent tile 
pair is, the more likely the pair is a term. Frequency is 
the most significant score to detect terms of a technical 
domain. This results contradicts numerous results of 
lexical resources, which claim that association criteria 
are more significant than frequency: for example, all the 
most frequent pairs whose terminological status is un- 
doubted share low values of association ratio (formula 
1) as for example rdseau d satellites (satellite network} 
IM=2.57, liaison par satellite (satellite link) IM=2.72, 
circuit tglgphonique (telephone circuit )IM=3.32, sta- 
tion spatiale (space station) IM=l.17 etc. Tile remain- 
ing problem with the sort proposed by frequency is that 
it integrates very quickly bad candidates, i.e. pairs 
which are not terms. So, we have preferred to elect the 
Loglike coefficient (formula 3) the best score. Indeed, 
Loglike coefficient which is a real statistical test, takes 
into account the pair frequency but accepts very little 
noise for high values. To give an element of comparison, 
the first bad candidate with frequency for the general 
pattern N1 (PREP (DEW)) N2 is the pair (cas, transmis- 
sion) which appears in 56th place; this pair, which is 
also the first bad candidate with Loglike, appears in 
176th place. We give in figure 2 the topmost 11 french 
pairs sorted by the Loglike coefficient (Logl) (Nbc is 
the number of the pair occurrences and IM the value of 
33 
association ratio). 
Diversity 
Diversity has been introduced by \[Shannon, 1948\] and 
chara,~terizes the marginal distribution of the lemma of 
a pair through the range of pairs. Its computation uses 
a contingency table of length n: we give below as an 
,xample the contingency table which is associated to 
the pairs of N ADJ structure: 
progressi\] \] porteur L.:2:._l\] Total .~ 
corm't 9 0 
The line counts nbi., which are found in the last column, 
represent the distribution of the adjectives with regards 
to a given noun. The columns counts nb.j, which are 
found on the last line, represent the distribution of the 
.ouns with regards to a given adjective. These distri- 
butio,s arc called "marginal distributions" of the nouns 
and the adjectives for the N ADJ structure. Diversity 
is computed for each lemma appearing in a pair, using 
the fornmla: 
Hi = nbi. log hi. - ~ nblj log nblj (4) 
j=l 
ilj = nb.j log n4 - ~ nbo log nblj 
i=1 
For example, using the contingency table of the N ^vJ 
structure above, diversity of the noun onde is equal to: 
H{onde..) = nb(onde,.) lognb(onde,.) - 
( nb( onde,progre~,i l ) log nb( onde,pr ogre,ai \] ) "k 
nb(onde,porteur) log nb(onde,porteur) + • • .) 
We note H1, diversity of the first lemma of a pair and 
!t2 diversity of the second lemma. We take into account 
the diversity normalized by the number of occurrences 
of the pairs: 
Hi hi -- (5) 
nij 
Hj hj = -- 
nij 
The normalized diversities hi and h2 are defined from 
Ill and H2. 
'l'h~, normalized diversity provides interesting informa- 
tions about the distribution of the pair lemmas in the 
set of pairs. A lemma with a high diversity means that 
it appears in several pairs in equal proportion; con- 
w'rscly, a lemma which appear only in one pair owns 
a zero diversity (minimal value) and this, whatever is 
the frequency of the pair. High values of hi applied to 
the pairs of N ^DJ structure characterizes nouns that 
could l)c seen as key-words of the domain: r#sean (net- 
work), s~gnal, antenne (antenna), satellite. Conversely, 
high values of h~ applied to the pairs of N ADJ structure 
characterizes adjectives which do not take part to base- 
MWVs as n&essaire (necessary), suivant (following), 
important, different (various), tel (such), etc. The pairs 
with a zero diversity on one of their lemma receive high 
values of association ratio and other association crite- 
ria and a non-definite value of Loglike coefficient. How- 
ever, the diversity is more precise because it indicates 
if the two lemmas appear only together as for (ocEan, 
indien) (indian ocean) (Hl=hl=H2=h2=0), or if not, 
which of the two lemmas appear only with the other, as 
for (r&,eau, maill~) (mesh network) (H2=hz=0), where 
the adjective mailM apl:wears only with rdseau or for 
(¢odeur, id&al) (ideal coder) (Hi=hi=0) where the noun 
codeur appears only with the adjective ideal. Other 
examples are: Oh, salomon) (solomon island), (h~lium, 
gazeux) (helium gas), (suppresscur, bzho) (echo suppres- 
sor). These pairs collects many frozen compounds and 
collocations of the current language. In future work, 
we will investigate how to incorporate the nice results 
provided by diversity into an automatic extraction al- 
gorithm. 
Distance Measures 
French base-terms often accept modifications of their 
internal structure as it has been demonstrated previ- 
ously. Each time, an occurrence of a pair is extracted 
and counted, two distances are computed: the num- 
ber of items Dist and the number of main items MDist 
which occur between the two lemmas. Then, for each 
couple, the mean and the variance of the number of 
items and main items are computed. The variance for- 
mula is: 
v(x) = 
= 
The distance measures bring interesting informations 
which concern the morphosyntactic variations of the 
base-terms, but they don't allow to take a decision upon 
the status of term or non-term of a candidate. A pair 
which has no distance variation, whatever is the dis- 
tance, is or is not a term; we give now some examples 
of pairs which have no distance variations and which 
are not terms: paire de signal (a pair of signaO, type 
d'antenne (a type off antenna), organigramme de la fig- 
ure (diagram of the figure), etc. We illustrate below 
how the distance measures allow to attribute to a pair 
its elementary type automatically, for example, either 
N1 N2, N1 PREP N2, N1 PREP DET N2, or Ni ADJ PREP (VET) 
N2 for the general N1 (PREP (VET)) N2 structure. 
1. Pairs with no distance variation V(X) = 0 
(a) N1 N2 : Dist = 2 MDist = 2 
• liaison sdmaphore, liaisons sdmaphores 
(common signalling link(s)) 
• canal support, canaux support, canaux supports 
(bearer channe 0 
34 
(b) N\] pReP N2 : Dist = 3 MDist = 2 
* accusg(s) de rgception 
(acknowledgement of receipt) 
• refroidissement d air, re/roidissement par air 
(cooling by air) 
(c) N1 PREP DF.T N~ : Dist = 4 MDist = 2 
• sensibilitd au bruit (susceptibility to noise) 
• reconnaissance des signaux (signal recognition) 
(d) N1 ADJ PgEP N~ : Dist = 4 MDist = 3 
• rgseau local de lignes, rgseaux locaux de lignes 
(local line network(s)) 
• service fixe par satellite (fixed-satellite service) 
2. Pairs with distance variations V(X) ~ 0 
• (liaison, satellite) 
liaison par satellite, liaisons par satellite 
liaisons (trds rapides + numgriques 4- 
phoniques nationales) par satellite 
liaisons numdriques par satellites 
liaisons satellite 
liaisons entre satellites 
• (ligne,.abonn6) 
ligne d' abonng, lignes d'abonng 
ligne de l'abonng, lignes de l'abonnd 
ligne d'abonnds, lignes des abonngs 
ligne(s) (tdlgphonique(s) + numdriques(s) 
analogique(s)) d'abonnd 
ligne(s) (numgrique(s) + analogique(s)) 
l ' abonng 
lignes et services d'abonng 
t#l~- 
+ 
de 
Conclusion 
We presented a combining approach for automatic term 
extraction. Starting from a first selection of lemma 
pairs representing candidate terms from a morphosyn- 
tactic point of view, we have applied and evaluated sev- 
eral statistical scores. Results were surprising: most 
association criteria (for example, mutual association) 
didn't give good results contrary to frequency. This bad 
behavior of the association criteria could be explained 
by the introduction of linguistic filters. We can no- 
tice anyway that frequency characterizes undoubtedly 
terms, contrary to association criteria which select in 
their high values frozen compounds belonging to gen- 
eral language. However, we preferred to elect the Log- 
like criterion rather than frequency as the best score. 
This latter takes into account frequency of the pairs 
but provide a conceptual sort of high accuracy. Our 
system which uses finite automata allows to increase 
the results of the extraction of lexical resources and to 
demonstrate the efficiency to incorporate linguistics in 
a statistic system. This method has been extended to 
bilingual terminology extraction using aligned corpora 
(\[Daille et al., 1994\]). 
Acknowledgments 
I would like to thank the IBM-France team, aJl<l i. 
particular l~ric Gaussier and Jean-Marc Laug6, Ibr tl,~ 
tagged and lemmatized version of the French corpus all(I 
for their evaluation of statistics; Owen Rainbow f.r r~.- 
viewing. This research was supported by the Eurol)~'at~ 
Commission and IBM-France, I.hrough I.Iw IC1'-10/~;.1 
project,. 

References 
\[Bourigault, 1992\] Didier Bourigault. 1992. Surfa~'c 
grammatical analysis for the extraction of termino- 
logical noun phrases. In Proceedings of the Fourte,'nth 
International Conference on Computational Linguis- 
tics (COLING-9~), Nantes, France. 
\[Calzolari and Bindi, 1990\] Nicoletta Calzolari a,d 
Remo Bindi. 1990. Acquisition of lexical information 
from a large textual Italian corpus. In Proceedings of 
the Thirteenth International Conference on ComPu- 
tational Linguistics, Helsinki, Finland. 
\[Church and Hanks, 1990\] Kenneth Ward Church and 
Patrick Hanks. 1990. Word association norms, mu- 
tual information, and lexicography. Computational 
Linguistics, vol. 16, n ° 1, pp. 22-29. 
\[Daille, 1994\] B6atrice Daille. 1994. Approche mixtc 
pour l'extraction automatique de terminoloqie : 
statistique lexicale et filtres linguistiques. Ph D thesis, 
University Paris 7, France. 
\[Daille et al., 1994\] 1994. Bdatrice Daille, I~;ric 
Gaussier and Jean-Marc Lang6. Towards Automatic 
Extraction of Monolingual and Bilingual Ternfinol- 
ogy. COLING-94, Kyoto, Japon. 
\[D~rouault, 1985\] Anne-Marie Ddrouault. 1985. Mod- 
glisation d'une langue naturelle pour la ddsambigua- 
tion des chaines phondtiques. PhD thesis, University 
Paris VII, France. 
\[Dunning, 1993\] Ted Dunning. 1993. Accurate Meth- 
ods for the Statistics of Surprise and Coincidence. 
Computational Linguistics, vol. 19, n ° 1. 
\[EI-B6ze, 1993\] Marc E1-B/~ze. 1093. Lea Mod- 
dies de Langage Probabilistes : Quelques Domaines 
d'Applications. Habilitation b. diriger des reeherches 
(Thesis required in France to be a professor), Univer- 
sity Paris-Nord, France. 
\[Jacquemin, 1991\] Christian Jacquemin. 1991. 7hans- 
formations des noms composds. PhD thesis, Univer- 
sity Paris 7, France. 
\[La.fon, 1984\] Pierre Lafon. 1984. Ddpouilleme,ts 
et Statistiques en Lexicomdtrie, Gen~ve, Slatkine, 
Champion. 
\[Gale and Church, 1991\] William A.Gale and K~,n- 
neth W.Church. 1991. Concordances for parallel 
texts. In Proceedings of the Seventh Annual Co,re'f- 
ence of the UW Centre for the New OED and 71'.rt 
Research, Usin 9 Corpora, pp. 40-62, Oxford, (J.K. 
36 
\[Mathieu-Colas, 1988\] Michel Mathieu-Colas. 1988. 
Typologie de~ noms compos6s, Technical report n ° 7, 
Paris, Programme de Recherches Coordonndes "In. 
formalique ct Linguistique", University Paris 13. 
\[Slmnnon, 1948\] C. Shannon. 1948. The mathemati- 
cal theory of communication. Bell Systems Technical 
.Journal, 27. 
\[Smadja and McKeown, 1990\] Frank A. Smadja and 
Kathleen R. McKeown. 1990. Automatically extract- 
i llg ~md representing collocations for language gener- 
ation. In : Proceedings of the ~,Sth Annual Meeting 
of. the Association for Computational Linguistics, pp. 
252-259, Pittsburgh. 
