Toward,q Automatic Extraction of Monolingual and Bilingual 
q?erminology 
B(atricc I)AILLI';* - l::,'ic OAUSSIEll - J(.'ali-Mal(; LANCI;; 
(*)TALANA Univ. Pa,'is 7 k IBM-France, See 3099, 75592 Paris Cedcx 12 
Emaih j ml((.0vnet.ibm.conl 
Abstract 
In this paper, we make use of linguistic 
knowledge to identify certain noun phrases, 
both in English and French, which are likely 
to be terms. We then test and cmnl)are (lifl'e.r- 
ent statistical scores to select the "good" ones 
among tile candidate terms, and finally propose 
a statistical method to build correspondences of 
multi-words units across languages. 
Acknowledgement 
Most of this work was carried out under 
project EUII.OTP~A ET-10/63, co-sponsored by 
the European Economic Conmmnity. 
1 Introduction 
A technical lexicon consists of simple as well 
as complex lexical units. Complex lexical units 
are mainly compounds, commonly character- 
ized by a set of morphosyntactic and semantic 
criteria (see for example \[llenveniste, 1966\] for 
Dench and \[tlatcher, 1960\] for English). The 
constitution of a list of technical terms from a 
specific domain is an important issue in Nat- 
urM Language Processing, for example in ap- 
plications such as Machine Translation. Sev- 
eral methods for the extraction of terms have 
been proposed, relying on linguistic or statistical 
approaches. To build a monolingual terminol- 
ogy bank, \[Bourigault, 1{)92\] has implemented a 
software package which provides it list of likely 
terminologicM units, based on linguistic spec- 
ifications. Different statistical methods have 
Mso been tested to extract collocations from 
large corpora, as \[Church and lianks, 1990, 
Smadja aim McKeown, 1990\]. Our work makes 
use of both linguistic and statistical knowledge. 
The outline of this paper is the following: lirst, 
given linguistic specifh:ations of English and 
l;'rencli terms, we select noun phrases which are 
likely ta be terms, and which we will call can- 
didate terms. To build a monolingual termi- 
nology bank, we then apply different statistical 
scores to select the good ones among the candi- 
date terms. Finally, we. test statistical ineth- 
ods to obtMn ;~ list of bilingual terms, using 
aligned corpora in l"cench and Ex@ish. All work 
and tests are carried out on a parallel bilin- 
gual corpus that is tagged by the words' part- 
of-speech and lemma. This corpus consists of 
Mmut 20(I,000 words for each language in the 
tieht of telecommunications. Simil,~r work can 
be found in \[Van der Eijk, 1993\]. 
2 Linguistic specifications of 
French and English terms 
2.1 Basic multi-words units 
From an investigation of the cori)us, and 
from the tenninologists' point of view (for exam- 
pie \[Nkwenti-Azeh, 1992\]), it appears that most 
of tile terms encountered in technical tiehls are 
noun phrases corresponding to ~t limited num- 
I}er of syntactic patterns. These patterns detine 
the morphosyntactic structures of multi-words 
units (MWU) which are likely to be terms. We 
detine the length of a MWU as the number of 
main items it contains. A main item is either a 
noun, an adjective, a verb or an adverb; thus, 
neither deternliners nor prepositions are consid- 
ered main items. MWU of length 1 often corre- 
spond to words connected with a hyphen or an 
apostrophe, and are easy to identify. We will not 
deal with them in this paper. Nor are we con- 
cerned with mono-word terms (e.g. telematics), 
the extraction of which is a different -possibly 
515 
516 
snore complex- problem. We present below the 
patterns retained for MWU of length 2, which 
we will refer to as base MWU. For each pattern 
we give an example followed by its translation 
in parentheses. The interpretation of the abbre- 
viations is strMgthforward. 
French 
• N Adj 
orbite gdostationnaire 
(geostationary orbit) 
• N1 N2 
diode tunnel (tunnel diode) 
• N1 de (det) N2 
bande de frdquenee (frequency band) 
• N1 prep (det) N2 
assignation e't la demande 
(demand-assignment) 
English 
• AdjN 
multiple access (ace,s multiple) 
• N1 N2 
data transmission 
(transmission de donndes) 
2.2 Operations on base MWU 
Although MWU of length greater than 2 do 
exist, they are most of the time built from base 
MWU by one of the following operations, en- 
countered both in French and English (the base 
MWU is bracketed in the examples): 
• overcomposition: 
rdgdngration des \[lobes latdraux\] 
\[side lobe\] regrowth 
• modification: 
\[station terrienne\] brouilleuse 
interfering \[earth(-)station\] 
. coordination: 
assemblage et dgsassemblage de paquets 
packet assembly/disassembly 
0vercomposition and modification do not 
always preserve the base-MWU structure: in 
French, an adjective modifier is often inserted 
inside a base-MWU of Igl prop /q2 structure; 
for example, national (domestic) cast be inserted 
inside rdseau fi satellites (satellite network) to 
produce r'dseau national it satellites (domestic 
satellite network) (notice that in English, the 
adjective precedes the base MWU most of the 
time); in l';nglish, the Adj Ig structure is altered 
when two base-MWU, one of Adj N1 struc- 
ture and one of N2 N:t structure, are overcom- 
posed: geostationnary satellite and telecommu- 
nication satellite yield geostationnary telecom- 
munication satellite. Coordination, which re- 
quires two base-MWU, always breaks the struc- 
ture of one of them. These operations are not 
used with the same frequency, composition and 
modification being more frequent than coordi- 
nation. 
From the preceding considerations, it seems 
natural to focus on base MWU. But, we tat~e 
into account the cases where base-MWU struc- 
ture is broken ms well as their variants. Vari- 
ants of base-MWU are mainly graphical, ortho- 
graphical and morphosyntactic (for more details 
see \[l)aille, 1994\]). In English, we have also ac- 
eel)ted the transformation of a base-MWU of N2 
lql structure into a N:I. of lg2 structure. 
2.3 Extraction of base MWU 
To extract and count the occurrences of base 
MWU (under their various forms) we use their 
morphosyntactic structures. IIappily enough, 
our corpus is tagged: we can rely on tile se- 
quences of part-of-speech to extract the rele- 
vant candidates. For this purpose, we have im- 
plemented finite automata associated to regular 
expressions covering most occurrences of mor- 
phosyntactic structures, including modilications 
clue to the afore mentionned operations. The 
output of the program is a list of pairs composed 
of two lemmas. Under each pair are stored all 
occurrences of the base MWU extracted from 
the corpus: for example, for the pair 
(interference, level), 
we get the following sequences: 
interference level, interference levels, level of in- 
terference, levels of interference; 
for the pair 
(circuit, num6rique), 
we get: 
circuit numdrique, circuits numdriques, ciTvuit 
enti~rement numdrique, circuits analogiques et 
numdriques. 
Unfortunately, the pairs obtained through 
this method are not always "good" terms; we 
therefore have to introduce additional filters: 
statistical scores which use the number of oc- 
currences of the pairs as input. 
Statistical scores to se- 
lect good monolingual can- 
didates 
Since the MWU extracted after the lin- 
guistic specifications still contain noise, we use 
statistical scores as an additional tilter in of 
der to choose the ~good" ones among the can- 
didate base MWU (or candidates for short). 
These scores apply on candi&~tc lcmma pairs 
for a given morphosyntactic pattern. We con> 
puted different scores: frequencies, associa.tiol~ 
criteria, Shaflnon diversity and distance mea~ 
sures. We discuss frequencies and association 
criteria below. Informations provided by Shan- 
non diversity and distance measures are pre- 
sented in \[Daille, 1994\]. Most of the associa- 
tion criteria can be found in the classic liter- 
ature, such as \[Clifford and Stephenson, 1975\], 
and are based on the so-called "contingency ta- 
bles". In the field of eomputationa.1 linguis- 
tics, mutual information \[Brown et al., 1988\], 
¢2 \[Church and Hanks, 1990\], or a likelihood ra- 
tio test \[Dunning, 199a\] are suggested. 
Our testing method consists in comparing 
our result list, sorted according to a specified 
score, with a reference list containing only valid 
terms. In order to l)uild the reference list, 
we augmented an existing terminology database 
(EURODICAUTOM) with hand work: we se- 
lected those candidates for which at least two 
judges out of three agreed on their goodness, 
and included them in the reference llst. A candi- 
date will be considered "good" when it is found 
in this reference list. 
The program for the extraction of candi- 
dates was run on a 240,000 words French corpus, 
divided into 9,541 sentences. It yiehled 2,400 
candidates that appear at least three times in 
the text. After sorting the candidates accord- 
ing to the particular score examined, we build 
a graphic representation of the prol)ortion of 
"good" candidates per range of score w~lues, in 
the form of a histogram. Figure 1 shows the his- 
tograms obtMned on the French Iqt de 1~2 can- 
didates with two different scores: the likelihood 
ratio and mutuM information. The vertical axis, 
which represents the proportion of good candio 
dates, is bounded between 0 and 1. llut, since 
we do not pretend to have built an exhaustive 
reference list, we can't actually expect values of 
more than 0.8. The horizontM axis grows with 
the score. The likelihood ratio test (LOG) (the 
use of a likelihood ratio test leads to a log likeli- 
hood statistics, which explains tile LOG abbre- 
vi:ttion) shows a curve that grows slowly at tirst, 
and then decidedly. There is a true ol)position 
ill the behaviours of small and big v~Llues, the 
representativity of good candidates approaching 
0.8 in that case. On tile contrary, mutual infor- 
mat, ion (MI)is quite uniformeiy distributed, and 
the representativity never exceeds 0.5. Thus, if 
one had to choose between these two scores, the 
first one would undoubtedly be selected. 
The outcome of tile study of tile histograms 
is that very few searing methods will meet our 
objective. The most signilicant seem to be tit('. 
likelihood ratio test, and the number of occur- 
fences of the pair. If it were for the sole his- 
tograms, one could be tempted to keep only the 
frequency of the pair, but due to its definition 
unfrequent terms won't stand out; therefore, we 
have to keep several methods. Anyhow, the per- 
formance of our scoring functions indicate tile 
best method to follow: start any task (human 
post-editing for example) with the highest fre- 
quency values on monolingual candidates pre- 
viously identified via linguistic patterns, thus 
maximizing the odds that the candidates will 
indeed be "good" candidates. 
Is the best score a combination of 
s(:ores? An extra data analysis method 
To complete the evaluation and comparison 
of the different scores used to select actual ternls 
from our candidates list, we have applied to 
these scores specific data analysis methods. In 
our case, tile techniques of data anMysis can pro- 
vide at least two things: a probabillstic study 
of the correlations between scores, in order to 
select tile relevant ones, and the possibility of 
coml)ining scores, in order to improve the re- 
sults. 
We carried out Principal Component Analy- 
sis (PCA) with a set of fifteen variables, i.e. tile 
scores, on a set of 1,230 individuals, i.e. the can- 
517 
0.8 
0.6 
0.4 
0.2 
0 
0.5 
0.4 
0.5 
0.2 
0.1 
0 t .... 
Figure 1: Itistograms corresponding to the log likelihood (left), and to mntnal information (right) 
didates. The first objective of PCA is to reduce 
the number of variables with as little loss of in- 
formation as possible. Doing so shouhl enable 
us to visualize our data more easily. From the 
study of the correlation matrix of variables and 
their part in the principal axis, it appeared that 
variables could be divided into four classes. In 
each class, we wanted to retain the most signifi- 
cant variable with regard to our task. This was 
done by selecting in each class the variable with 
the best score-histogram. All other w~riables 
can be confidently connected to one of these. 
With the remaining four variables we tried 
to investigate the set of candidates. The diffi- 
culty of this task comes from the high nulnber 
of candidates, so we decided to work on smaller 
sets. We divided the whole set of candidates 
into six subsets depending on the number of oc- 
currences of the candidate (respectively with a 
number of occurrences in the text equal to 2, 3, 
4, 5, between 6 and 10, and over 10). This di- 
vision was used to counterbalance the fact that 
most scores tend to privilege low frequencies. 
We then visualized the candidates graphically 
(with different tick marks distinguishing terms 
and non-terms) on several two-dimensional dia- 
grams, the axis of which are linear combinations 
of the different variables. The results show series 
of marks overlapping each other, without any 
clear separation between terms and non-terms. 
The PCA allowed us to go further in the 
study of the statistical scores, and confirmed 
that only a few are relevant to our task; more- 
over, we found no satisfactory combination of 
variables that could account for termhood. 
4 Bilingual candidate term 
alignment 
The problenl is now one of finding correspon- 
dences between candidates across languages (be- 
tween English and I"rench in our case). As it has 
been shown in the previous section, it is difllcult 
to restrict the set of candidates to the "good" 
ones. l'~urthermore, if a candidate is really a 
term, it shouht be easy to find the equivalent in 
other languages, or more precisely, in our case, 
in aligned sentences (supposing that terms are 
always translated into the same lexical unit). 
For these reasons, we tried to pick out associa- 
tions between the whole sets of candidates, both 
l?rench and English. Sticking to our general phi- 
losophy of harmonizing linguistic and statistical 
contributions, we made experiments with two 
methods. 
4.1 A simple count method, and a 
possible improvement 
This method basically counts how often a 
source candidate and a target candidate occur 
in aligned sentences; we shall cMl the resulting 
score bilingual count. For a given source lan- 
guage candidate, the most frequently aligned 
target candidates are sorted according to the 
score; one can then apply the criterion defined 
in \[Gaussier et al., 1992\], which consists in ex- 
tracting a bilingual pair only when the target 
candidate has no stronger association with any 
other source candidate. Such a naive frequency 
collection gives satisfactory results, providing a 
sorted list where the top half candidates are 
518 
mostly good. We can however try to improve on 
that result by taking adv,'tntage of available lin- 
guistic inforIffation, namely the part-of-speech 
patterns of the candidates. We estimated, by 
counting pattern alignments on a bilingual cor- 
pus, pattern aJfinity, which we define as the 
probability that the pattern of the source candi- 
date gets translated into the pattern of the tar- 
get candidate (e.g. terms with pattern NdcN in 
French get translated to terms with pattern NN 
in English in 81% of the eases). Now, instead of 
simple counts of the occurrences of the source 
and target candidates, we use counts that are 
weighted with a function of the pattern aftin- 
ity of candidates, as explained in the following 
formula: 
p,,o:,~.~(p, t 7,.,) x c(t) C(s, t) = ~,. 
~.p,.ol,~4> 1>) x c(>) 
where C() means "count of", and p "pattern 
of"; probes stands for the probabilities we es- 
timated, and sent represents the set of aligned 
sentences in which both the source candidate, s, 
and the target candidate, t, appear. Results are 
discussed below. 
4.2 Another method: using word 
alignment scores 
This method consists essentially in align- 
ing terms using bilingual associations of 
single words, that have been previously 
cmnputed through statistical methods (see 
\[Gaussier et al., 1992\]). Let wordcnl and 
worden2 denote the two words composing an 
English candidate an(l, similarly, wordfrl and 
wordfr2 those for a given French candidate. 
Each time a source and target candidate occur 
in aligned sentences, we define the association 
score between the two candidates as the sum of 
all the possible single-word-associations between 
the words composing the candidates: 
Assoc(wordenl, wordf rj) 
L)3 
This method relics on the assumption that if 
two terms are translation of one another, then 
wordfrl is likely to be the translation of, or at 
least associated to, either wordcnl or wordcn2, 
and so is wordfr> For example, l)oth French 
words station and terrienne belong to the lists 
of single-word translation candidates associated 
to the English words earth and station. 
4.3 It\]valuation and results 
To evaluate the results, we manually con- 
structed a reference list which contains 1,238 
correspondences of MWU2 we believe to be 
terms (a.ppearing at least twice). For any list 
of candidate pairs to be tested (extracted and 
sorted according to some scores), we define its 
precision as the ratio of the number of elements 
found in the reference list to the total numl)er 
of elements, and its recall as the ratio of the 
number of dements found in the reference list 
to the total number of elements in the reference 
list. In so far as these parameters ;u'e affected by 
the length of the list, we divided each list into 
units of the same length, considering the first 
100 elements, the lirst 2(}0 elements and so on, 
till we reach the end of one list. For each unit, 
we calcula.ted its recall a.nd its precision. The 
scores can then be compared in graphs showing 
the evolution of precision and recall with the 
number of candidates considered. Using such 
graphs, we found out that adding pattern prob- 
abilities gives only slightly better or equal re- 
sults than the silnple frequency method, which 
does not entirely satisfy our expectations as to 
their role on the quality of term alignment. 
The second method presented above, which 
does not make use of pattern probabilities, be- 
haves better for candidates of rank 1,00(} and 
more, hut worse for the first 100 best candi- 
dates. The examination of the top 100 candi- 
dates for both methods brings another surprise: 
only 6 term pairs are cmnmon to the two metL 
ods. Of course this could be due to the restric- 
tion to the very hest candidates; but, suspecting 
that there is nmre to it, we combined the can- 
didates from both methods and thus obtained 
an improvement of precision in all the range of 
candidates. The comparison between this com- 
bination of methods and the method involving 
pattern atlinities is shown in figure 2. The plain 
line in figure 2 shows precision and recall for the 
first method involving pattern aftinitles (which 
is referred to as normal on the figure). We think 
that such results could be even better if we chose 
the right combination of two -or, possibly, more 
than two- scoring methods. 
All in all, our method provides 70% preci- 
sion for a list of more than 1,00(} candidates, 
or 80% if one restricts to the topmost 500. it 
gives a good starting point for terminologists 
519 
100 
80 
g ~- 60 
,q 
.o 
"~ 40 
20 
"~ Precision 
,1"" 
....J" \[__~_ normal method \] 
/..~a "j~" \[ _ ,. combined melh0dsJ 
, I , \] , I , . I 
250 508 750 1000 1250 
N" of candidates 
Figure 2: Precision and Recall when using first method vs. combined methods 
.520 
seeking to extract bilingual terms from texts. 
It has to be noted that the remaining "wrong" 
alignments are generally somehow relevant, but 
either incomplete because one of the candidates 
has a length of more than 2, or trivial because 
we also extract pairs such as following formula 
/ formule suivante. 
5 Conclusion 
For monolingual terminology extraction, we 
have first defined the linguistic specifications of 
multi-word unit terms. These specifications are 
used to extract candidate pairs from the cor- 
pora. I~om these pairs and using frequency 
counts, we have applied a range of statistic~d 
scores, in order to find the score that would 
best discriminate between terms and non-terms 
among the candidates. The simple frequency 
count of the pair turns out to be the best 
score. We then carried out bilingual terminol- 
ogy extraction using results from the nmnolin- 
gum step. Two scoring methods augmented by 
the use of prior knowledge about the bilingual 
correspondences of linguistic patterns were ap- 
plied to bilingual pairs of monolingua\[ candidate 
pairs. In this case, the best way to proceed is 
derived from the combination of the two meth- 
ods. 
With regard to the future, we will focus 
on monolingual extraction of MWU of length 
a and more (based on the identified base MWU 
and linguistic specifications on term composi- 
tion), and on the improvement of term align- 
ment in order to deal with terms of heteroge- 
neous length. 
References 
\[Benveniste, 1966\] Benveniste B. (1966). 
Formes nouvelles de l~t composition nomi- 
nale. \[n Gallimard Ed., Probldmes de lin- 
gulstique gdndrale, pp. 163-173. 
\[Bourigault, 1992\] Bourigault D. (1992). Sur- 
face Grammatical Analysis for the Extrac- 
tion of TerminologicM Noun Phrases. Pro- 
ceedings of the l~lh International Confer- 
ence on Computational Linguistics. 
\[Brown et el., 1988\] Brown, P. F., Co&e, J., 
I)ellaPietra, S. A., DellaPietra, V. J., 3t- 
linek, F., Mercer, R. L., and Roossin, P. S. 
(1988). A statistical approach to language 
translation. Proceedings of the 12th In- 
ternationol Conference an Computational 
Linguistics. 
\[Church and llanks, 1990\] Church K. W. and 
Ilanks P., 1990\] (1990) Word Association 
Norms, Mutual Information, And Lexicog- 
raphy. Computational Linguistics, Vol. 16, 
Number I. 
\[Olifford and Stephenson, 1975\] Clifford 
II.T. and Stephenson W. (1975) An Intro- 
duction to Numerical Classification, Aca- 
demic Press. 
\[Daille, 1994\] Daille B. (1994). Study and fin- 
plementation of combined techniques for 
Automatic Extraction of Terminology. In 
The Balancing Act: Combining Symbolic 
and Statistical Approaches to Language. 
Workshop at the 32nd Annual Meeting of 
the ACL. 
\[Dunning, 1993\] Dunning T. (1993). Accurat,, 
Methods for the Statistics of Surprise 
attd Coincidence. Computational \],inguis- 
tics, Vol. 19, N~m~ber 1. 
\[G,~ussier et al., 1992\] Gaussier E., Langd J.-M. 
and Meunier F. (1992). Towards Bilingual 
Terminology. Proceedings of ALLC/ACH 
CO@FCT"tCe, 
\[Ilatcher, 1960\] Hatcher A. J. (1960). An in- 
troduction to the analysis of English noun 
compounds. In Word, i6, 356-373. 
\[J,~cquemin, 1991\] Jacquemin C. (1991). Trans- 
formation des noms composds. Th~se de 
doctorat en Informatique Fondamentale, 
Universitd Paris Z 
\[Mathieu-Colas, 1988\] M,~thieu-Colas M. 
(1988). Typologie des noms composds. 
Technical l~.eport of the "Programme de 
Reeherches Coordonn~es Informatique Lin- 
guistiquc', C.N.R.S. 
\[Nkwenti-Azeh, 1992\] Nkwenti-Azeh B. (1992). 
Positional and Combinational Characteris- 
tics of S~tellite Communications Terms. In- 
terim Report, CCL-UMIST. 
\[Smadja and McKeown, 1990\] Slnadja I". A. 
and McKeown K. 11~. (1990). Automatically 
Extracting And Representing Collocations 
For Language Generation. Proceedings of 
the 28th annual Meeting of thc A CL. 
\[Van der Eijk, 1993\] V~m der Eijk P. (I903) Au- 
tomating the Acquisition of BilinguM Ter- 
minology. Proceedings of Em'opean A CL. 
521 

Computational Linguistics 

