A step towards the detection of semantic variants of terms in 
technical documents 
Thierry Hamon and Adeline Nazarenko 
Laboratoire d'Informatique de Paris-Nord 
Universit~ Paris-Nord 
Avenue J-B Clement 
93430 Villetaneuse, FRANCE 
thierry.hamon@lipn.univ-parisl3.fr 
adeline.nazarenko@lipn.univ-parisl3.fr 
C~cile Gros 
EDF-DER-IMA-TIEM-SOAD 
1 Avenue du G~n~ral de Gaulle 
92141 Clamart CEDEX, FRANCE 
cecile.gros@der.edfgdf.fr 
Abstract 
This paper reports the results of a preliminary 
experiment on the detection of semantic vari- 
ants of terms in a French technical document. 
The general goal of our work is to help the struc- 
turation of terminologies. Two kinds of seman- 
tic variants can be found in traditional termi- 
nologies : strict synonymy links and fuzzier re- 
lations like see-also. We have designed three 
rules which exploit general dictionary informa- 
tion to infer synonymy relations between com- 
plex candidate terms. The results have been 
examined by a human terminologist. The ex- 
pert has judged that half of the overall pairs of 
terms are relevant for the semantic variation. 
He validated an important part of the detected 
links as synonymy. Moreover, it appeared that 
numerous errors are due to few mis-interpreted 
links: they could be eliminated by few exception 
rules. 
1 Introduction 
1.1 Structuring a terminology 
The work presented here is a part of an indus- 
trial project of Technical Document Consulta- 
tion System (Gros et al., 1996) at the French 
electricity company EDF. The goal is to develop 
tools to help a terminologist in the construction 
of a structured terminology (cf. figure 1) pro- 
viding : 
• terms of a domain, i.e. simple or com- 
plex lexical units pointing out accurate con- 
cepts in a technical document, (Bourigault, 
1992); 
• semantic links such as the see-also relation. 
This can be viewed as a two-step process. The 
candidate terms (i.e. lexical units which can 
be terms if a domain expert validates them) are 
first automatically extracted from the technical 
document with a Terminology Extraction Soft- 
ware (LEXTER) (Bourigault, 1992). The list 
of candidate terms is then structured into a se- 
mantic network. We focus on the latter point 
by detecting semantic variants, especially syn- 
onyms. 
ligne a~rienne (overhead line) 
See_also : D~part a~rien (overhead outlet) 
Synonym : Liaison ~lectrique a~rienne 
(overhead electric link) 
Ligne simple (single circuit line) 
Is_a : Ligne a~rienne (overhead line) 
Ligne multiterne (multiple circuit line) 
ls_a : Ligne a~rienne (overhead line) 
Synonym : Ligne double (double circuit line) 
Figure 1: Example of a structured terminology 
in the electric domain. 
In order to build a structured terminology, 
we thus attempt to link candidate terms ex- 
tracted from a French technical document 1. 
For instance, from synonyms such as matgriel (equipment) 
/ dquipement (fittings), marche 
(running) /fonctionnement (working) and nor- 
mal (normal) / bon (right), we infer a synonymy 
link between candidate terms matdriel dlec- 
trique (electric equipment) / dquipement dlec- 
trique (electrical fittings) and marche normale 
(normal running) / bon fonctionnement (right 
working). 
As the terms used in this paper have been extracted 
from French documents, their translation, especially for 
the synonymy, does not always show the same nuance 
than originally. 
498 
modNe (model) : < 1 > canon (canon), ~talon (standard), 
exemplaire (copy), example (example), 
plan (plan) 
< 2 > sujet (subject), maquette (maquette) 
< 3 > h~ros (hero), type (type) 
< 4 > 4chantillon (sample), specimen (sample) 
< 5 > standard (standard), type (type), 
prototype (prototype) 
< 6 > maquette (model) 
< 7 > gabarit (size), moule (mould), patron (pattern) 
Figure 2: Example of a word entry from the dictionary Le Robert. 
1.2 Using a general language dictionary 
for specialized corpora 
As domain specific semantic information is sel- 
dom available, our aim is to evaluate the rel- 
evance and usefulness of general semantic re- 
sources for the detection of synonymy between 
candidate terms. 
For this study, we used a French general 
dictionary Le Robert supplied by the Institut 
National de la Langue Franqaise (INaLF). It 
provides synonyms and analogical words dis- 
tributed among the different senses (cf. figure 2) 
of each word entry. It is exploited as a machine- 
readable synonym dictionary. 
We use a 200 000 word corpus about electric 
power plant. Its size is typical of the technical 
documents. It is very technical if one consid- 
ers the dictionary lemma coverage for this cor- 
pus (45%). Concerning two other available doc- 
uments dealing with software engineering and 
electric network planning, the dictionary lemma 
coverage is respectively of 65% and 57%. In that 
respect the chosen corpus is the worse case for 
this experiment. 
The present corpus has been analyzed by 
the Terminology Extraction Software LEX- 
TER which extracted 12 043 candidate terms 
(2 831 nouns, 597 adjectives and 8 615 noun 
phrases). Each complex candidate term (ligne 
d'alimentation, supply line) is analyzed into a 
head (ligne, line) and an expansion (alimenta- 
tion, supply). It is part of a syntactic network 
(cf. figure 3). 
2 Method for the detection of 
synonymous terms 
The terminological variation include morpho- 
logical (fiectional, derivational) variants, syn- 
tactic variant (coordinated and compound 
terms) but also semantic variant (synonyms, hy- 
peronyms) of controlled terms. In this experi- 
ment, we attempt to infer synonymy links be- 
tween candidate terms. 
2.1 Semantic variation and synonymy 
relation 
Semantic variation The semantic variation 
includes relations (e.g. synonymy and see-also) 
between words of the same grammatical cate- 
gory, even if one may also take into consider- 
ation phenomena such as elliptic relations or 
combination of synonymy and derivation rela- 
tions (e.g. heat and thermal) where the cate- 
gories may be different. 
Fuzzier relations such as the traditional see- 
also relations of terminologies are also very use- 
ful. Once a link is established between two 
terms, it is sometimes easy to interpret for the 
terminology users. Moreover, for applications 
such as document retrieval, the link itself is of- 
ten more important than its very type. 
Synonymy We use a synonymy definition 
close to that of WordNet (Miller et al., 1993). 
It is defined as an equivalence relation between 
terms having the same meaning in a particu- 
lar context. The transitivity rule cannot be ap- 
plied to the links extracted from the dictionary. 
Indeed, while the synonymy is sometimes very 
contextual in the dictionary, the links appear in 
the data without context information and would 
produce a great deal of errors. Thus, for in- 
stance, the synonymy links between the adjec- 
tives polaire (polar) and glacial (icy) and the ad- 
jectives glacial (cold) and insensible (insensitive) 
would allow to deduce a wrong synonym link 
between polaire and insensible. 
Moreover, tests carried out on dictionary 
samples show that the relevant links which 
499 
Y 
ligne (line) 
ligne a@rienne 
(overhead line) • 
ligne simple (single line) 
ligne double (double line) 
ligne d'alimentation 
H 
(supply line) 
(...) 
ligne a~rienne haute tension 
(hight voltage overhead line) 
ligne a~rienne moyenne tension 
(middle voltage overhead line) 
alimentation (supply) 
capacit@ de transit de la ligne (transit capacity of the line) 
cofit d'investissement de la ligne 
(cost of investissement of the line) 
d6clenchement de la ligne 
9 (tripping of the line) E 
longueur de la ligne (size of the line) 
puissance caract@ristique de la ligne (caracteristic power of the line) 
ordre de d~clenchement 
(order of tripping) 
de la ligne (of the line) 
...) 
Figure 3: Fragment of the syntactic network (H = head, E = expansion). 
Number of simple terms extracted 
Number of retained words 
at the filtering step 
Percentage of retained words 
at the filtering step 
Nouns Adjectives Total 
2 831 597 3 428 
1 134 408 1 542 
40% 68% 45% 
Table 1: Coverage of the corpus by the dictionary. 
could be added thanks to the transitivity rules 
already exist in the dictionary. For instance the 
following words are synonymous pairwise: lo- 
gement (accommodation), demeure (residence), 
domicile (residence) and habitation (house). 
We consider all links provided by the dictio- 
nary as expressing synonymy relation between 
simple candidate terms and design a two-step 
automatic method to infer links between com- 
plex candidate terms. 
2.2 First step: Dictionary data filtering 
In order to reduce the database, we first fil- 
ter the relevant dictionary links for the stud- 
ied document. For instance, the link matdriel 
(equipment) / dquipement (fittings) is selected 
because its both ends, materiel and 6quipement 
exist in the studied corpus. For this document, 
3 369 synonymy links between 1 542 simple 
terms are preserved. 
Table 1 shows the results of the filtering step 
in regard to the coverage of our corpus by the 
dictionary. 
2.3 Second step: Detection of 
synonymous candidate terms 
Assuming that the semantics and the synonymy 
of the complex candidate terms are composi- 
tional, we design three rules to detect synonymy 
relations between candidate terms. Consider- 
ing two candidates terms, if one of the following 
conditions is met, a synonymy link is added to 
the terminological network: 
- the heads are identical and the expansions 
are synonymous (collecteurg~ndral (general 
collector) / collecteur commun (common 
collector)); 
-the heads are synonymous and the ex- 
pansions are identical (matdriel dlectrique 
(electric equipment) / dquipement ~lectrique 
(electrical fittings)); 
- the heads are synonymous and the expansions 
are synonymous (marche normale (normal 
running) / bon fonctionnement (right work- ing)); 
500 
We first use the dictionary links as a boot- 
strap to detect synonymy links between com- 
plex candidate terms. Then, we iterate the pro- 
cess by including the newly detected links in 
our base until no new link can be found. In the 
present experiment, the process ends up after 
three iterations. 
3 Results and study of the detected 
links 
3.1 Various detected links 
Synonymy links 396 links between complex 
candidate terms (i.e. noun phrases) are inferred 
by this method. An expert of the domain vali- 
dated 37% of them (i.e. 146 links, cf. table 2) 
as real synonymy links: hauteur d'eau (water 
height) / niveau d'eau (level of water), d~t~ri- 
oration notable (notable deterioration) / d6gra- 
dation importante (important damage) (cf. fig- 
ure 4). 
Number Percentage 
Validated links 146 37% 
Unvalidated links 250 63% 
Total 396 100% 
Table 2: Results of the link validation. 
Most of the synonymy links between candi- 
date terms are detected at the first iteration 
(383 liens out of 396). The majority of the val- 
idated links are given by the two first rules: 89 
validated links out of 206 with the first rule (ad- 
mission d'air (air intake) / entrde d'air (air en- 
try)), 49 out of 105 with the second (toitflottant 
(floating roof) / toil mobile (movable roof) and 
collecteur gdndral (general collector) / colleeteur 
commun (common collector)). Obviously, the 
last rule has a lower precision rate: 8 out of 
85 (fausse manoeuvre (wrong operation) / mau- 
valse manipulation (bad handling)). However, 
it infers important links which are difficult to 
detect by hand. 
Other useful links On the whole, the expert 
judged that half of the detected links are useful 
for the terminology structuration even if he re- 
jected some of them as real synonymy links (cf. 
figure 5). Our method detects different types of 
links: meronymy, antonymy, relations between 
close concepts, connected parts of a whole mech- 
anism, etc. 
The meronymy links are the most numerous 
after synonymy (rapport de s~retd (safety report) / 
analyse de s~retd (safety analysis)). In the 
previous example, whereas rapport (report) and 
analyse (analysis) are given as synonyms by the 
general language dictionary (which is context- 
free), their technical meanings in our document 
are more specific. Therefore, rapport de s~retd 
is a meronym rather than a synonym of analyse 
de s~retd in the studied document. 
Other detected links allow to group the can- 
didate terms which refer to related concepts. 
For instance, we detected a link between the 
device ligne de vidange (draining line) and the 
place point de purge (blow-down point) which is 
relevant since a draining line ends at a blow- 
down point. Likewise, it is useful to link fin de 
vidange (draining end) which designates an op- 
eration and destination des purges (blow-down 
destination) which is the corresponding equip- 
ment. 
The expert considered that the link be- 
tween the candidate terms (commande md- 
canique (mechanical control) / ordre automa- 
tique (automatic order)) expresses an antonymy 
relation, although it is infered from the syn- 
onymy relation of the dictionary mdeanique 
(mechanical) / automatique (automatic). It ap- 
pears that those adjectives have a particular 
meaning in the present corpus. Therefore, ev- 
ery link detected from this "synonymy" link is 
an antonymy one. 
Those links express various relations some- 
times difficult to name, even by the expert. 
Such links are important in a terminology. 
3.2 Polysemy, elision and metaphor 
Most real errors are due to the lack of con- 
text information for polysemic words and the 
noisy data existing in the dictionary. For in- 
stance the French word temps means either 
time or weather. According to the dictio- 
nary, temps (weather) is a synonym of temper- 
ature (temperature) 2, but this meaning is ex- 
cluded from the present corpus. Since we can- 
not distinguish the different meanings, the syn- 
onymy of temps / time and temperature is taken 
for granted. Temps attendu (expected time) 
and tempdrature attentive (expected tempera- 
2 It would be more precise to interpret it as analogous 
words. 
501 
Term 1 Term 2 
d~t~rioration notable 
(notable deterioration) 
fausse manoeuvre (wrong operation) 
action de l'op~rateur 
(action of the operator) 
capacit~ interne (internal capacity) 
capacit~ totale (total capacity) 
capacit~ utile (useful capacity) 
limite de solubilit~ (limit of solubility) 
marche manuelle (manual running) 
tests p~riodiques (periodic tests) 
hauteur d'eau (water height) 
panneau de commande (control panel) 
d~gradation importante 
(important damage) 
mauvaise manipulation (bad handling) 
intervention de l'op6rateur 
(intervention of the operator) 
volume interne (internal volume) 
volume total (total volume) 
volume utile (useful volume) 
seuil de solubilit6 (solubility threshold) 
fonctionnement manuel (manual working) 
essais p~riodiques (periodic trials) 
niveau d'eau (level of water) 
tableau de commande (control board) 
Figure 4: Examples of synonymy links between complex candidate terms. 
Term 1 Term 2 
essai en usine (test in plant) 
ligne de vidange (draining line) 
fonction d'un temps (fonction of a time) 
froid normal (normal cold) 
rapport de sfiret~ (safety report) 
solution d'acide borique 
(solution of boric acid) 
temperature attendue 
(expected temperature) 
temperature normale (normal temperature) 
organes de commande (control devices) 
gros d~bit (big flow) 
activit~ importante (important activity) 
commande m~canique (mechanical control) 
risques de corrosion (risk of corrosion) 
experience d'exploitation 
(experiment of exploitation) 
point de purge (blow-down point) 
effet d'une temperature 
(effect of a temperature) 
refroidissement correct (correct cooling) 
analyse de sfiret~ (safety analysis) 
dissolution de l'acide borique 
(dissolving of the boric acid) 
temps attendu (expected time) 
temps normal (normal time) 
organes d'ordre (order devices) 
plein d~bit (full flow) 
activit~ ~lev~e (high activity) 
ordre automatique (automatic order) 
risques de destruction (risk of destruction) 
Figure 5: Examples of rejected links 
ture) are thus given as synonymous. This type 
of wrong links is rather important in the list 
presented to the expert: between 10 to 20 links 
out of 396. 
On the contrary, about ten wrong links are 
due to the elision of common terms in the do- 
main. For instance, the term activitd (activity) 
which actually corresponds to the term radioac- 
tivitd (radioactivity) in the document is given as 
a synonym of gnergie (energy) in the dictionary. 
between complex candidate terms. 
We have detected links such as activitd haute 
(high activity) / haute dnergie (high energy). 
As regards metaphor, we have observed that 
it preserves semantic relation. For instance, in 
graph theory, the link (arbre (tree) / feuille 
(leaf)) can be inferred from the meronyny in- 
formation of general dictionary. 
Those types of wrong links are easily iden- 
tified during the validation. Some exceptions 
rules can be designed to first regroup those links 
502 
and then eliminate them. With that aim, we 
plan to use dictionary definitions. 
3.3 Evaluation 
The inferred links express not only synonymy, 
but also other relations which may be difficult 
to name. Apart from real errors, these fuzzy 
see-also relations are useful in the context of a 
consultation system. 
The results of this first experiment are en- 
couraging. Although the precision rate and the 
number of links are low (37%, 396 links), the 
use of complementary methods (e.g. detection 
of syntactic variants) would allow to propagate 
these links and increase their number. Also, 
the use of other knowledge sources or different 
methods (Habert et al., 1996) is necessary to 
increase precision rate and find links between 
more technical candidate terms. 
As regards the improvement of such a 
method, the terminology acquisition by an ex- 
pert will take tens of hours while the automatic 
extraction takes one hour and the validation of 
the links has been done in two hours. 
The main difficulty is to evaluate the recall in 
the results because there is no standard refer- 
ence in that matter, giving the overall relevant 
relations in the document. One may think that 
the comparison with links manually detected by 
an expert is the best evaluation, but such man- 
ual detection is subjective. Regarding the vali- 
dation by several experts, it is well-known that 
such validation would give different results de- 
pending on the background of each expert (Sz- 
pakowicz et al., 1996). So, we are reduced to 
compare our results with those obtained by dif- 
ferent methods even though they are not perfect 
either. We are planning to compare the clusters 
found by our method with the clustering one of 
(Assadi, 1997) to study how the results overlap 
and are complementary. 
4 Related works 
The variant detection in specialized corpora 
must be taken into account for information re- 
trieval. This complex operation involves the 
semantic as well as the morphological and 
syntactic level. (Jacquemin, 1996) design a 
unification-based partial parser FASTER which 
analyses raw technical text while meta-rules 
detect morpho-syntactic variants of controlled 
terms (blood cell, blood mononuclear cell). By 
using morphological and part-of-speech mod- 
ules, the system are extended to the verbal 
phrases (tree cutting, tree have been cut down) 
(Klavans et al., 1997). Dealing with syntac- 
tic paraphrase in the general language, (Dras, 
1997) propose a similar representation by using 
the STAG formalism to detect syntactic related 
sentences. Because we deal with the semantic 
level, our work is complementary of those. 
Semantic variation is rarely studied in spe- 
cialized domains. Works on word similarity and 
word sense disambiguation are generally based 
on statistical methods designed for large or even 
very large corpora (Hindle, 1990; Agirre and 
Rigau, 1996). Therefore, they cannot be ap- 
plied for technical documents which usually are 
medium size corpora. However, dealing with 
already linguistic filtered data, (Assadi, 1997) 
aims at statistically build rough clusters sup- 
posing that similar candidate terms have similar 
expansions. Then he relies on human expertise 
for the semantic interpretation. It differs from 
our work which tries to automatically explicit 
the semantic relations. In order to disambiguate 
noun objects in a short text (30 000 words), 
(Li et al., 1995) design heuristic rules using se- 
mantic similarity information in WordNet and 
verbs as context. Their system disambiguate an 
encouraging number on noun-verb pairs if one 
considers single and multiple sense assigned to 
a word. 
In (Basili et al., 1997), the lexical knowledge 
base WordNet (Miller et al., 1993) is used as a 
bootstrap for verb disambiguation. They tune 
it to the domain of the studied document by 
taking into account the contexts in which the 
verbs are used. This tuning leads both to elimi- 
nate certain semantic categories and to add new 
ones. For instance, the category contact is cre- 
ated for the verb to record. The resulted sense 
classification is thus a better description of the 
verb specialized meanings. 
Our symbolic and dictionary-based approach 
is close that of (Basili et al., 1997). They both 
use general language information (traditional 
dictionary vs. WordNet) for specialized cor- 
pora. However, their goals differ: disambigua- 
tion vs. semantic relation identification. 
503 
5 Conclusion and future works 
The use of a synonym dictionary and the rules of 
synonymous candidate terms detection we have 
designed allow to extract an encouraging num- 
ber of links in a very technical corpus. An ex- 
pert validated these links. More than one third 
of the detected links are synonymy relations. 
Beside synonymy, our method detects various 
kinds of semantic variants. Wrong links due to 
the polysemy can be easily eliminated with ex- 
ception rules by comparing selectional patterns 
and generalized contexts (Basili et al., 1993; Gr- 
ishman and Sterling, 1994). 
Our work shows that general semantic data 
are useful for the terminology structuration and 
the synonym detection in a corpus of specialized 
language. The results show that semantic vari- 
ants can be automatically detected. Of course, 
the number of acquired links is relatively low 
but our method is not to be used in isolation. 
Acknowledgment 
This work is the result of a collaboration with 
the Direction des Etudes et Recherche (DER) 
d'Electricit~ de France (EDF). We thank Marie- 
Luce Picard from EDF and Benoit Habert from 
ENS Fontenay-St Cloud for their help, Didier 
Bourigault and Jean-Yves Hamon from the In- 
stitut de la Langue Fran~aise (INaLF) for the 
dictionary and Henry Boccon-Gibod for the val- 
idation of the results. 

References 
E. Agirre and G. Rigau. 1996. Word sense 
disambiguation using conceptual density. In 
Proceedings of COLING'96, pages 16-22, 
Copenhagen, Danmark. 
H. Assadi. 1997. Knowledge acquisition from 
texts: Using an automatic clustering method 
based on noun-modifier relationship. In 
Proceedings of ACL'97- Student Session, 
Madrid, Spain. 
Roberto Basili, Maria Teresa Pazienza, and 
Paola Velardi. 1993. Acquisition of selec- 
tional patterns in sublanguages. Machine 
Translation, 8:175-201. 
Roberto Basili, Michelangelo Della Rocca, and 
Maria Teresa Pazienza. 1997. Contextual 
word sense tunig and disambiguation. Ap- 
plied Artificial Intelligence, 11:235-262. 
D. Bourigault. 1992. Surface grammatical 
analysis for the extraction of terminological 
noun phrases. In Proceedings of COLING'92, 
pages 977-981, Nantes, France. 
Mark Dras. 1997. Representing paraphrases us- 
ing synchronous tree adjoining grammars. In 
proceedings of the 1997 Australian NLP Sum- 
mer Workshop, Syndney, Australia. 
Ralph Grishman and John Sterling. 1994. Gen- 
eralizing automatically generated selectional 
patterns. In Proceedings of Coling'94, vol- 
ume 3, pages 742-747, Kyoto. 
C. Gros, H. Assadi, N. Aussenac-Gilles, and 
A. Courcelle. 1996. Task models for techni- 
cal documentation accessing. In Proceedings 
of EKA W'96, Nottingham. 
Beno~t Habert, Elie Naulleau, and Adeline 
Nazarenko. 1996. Symbolic word cluster- 
ing for medium-size corpora. In Proceedings 
of COLING'96, volume 1, pages 490-495, 
Copenhagen, Danmark, August. 
D. Hindle. 1990. Noun classification from 
predicate-argument structures. In Proceed- 
ings of ACL'90, pages 268-275, Pittsburgh, 
PA. 
C. Jacquemin. 1996. A symbolic and surgi- 
cal acquisition of terms through variation. In 
E. Riloff et G. Scheler S. Wermter, editor, 
Connectionist, Statistical and Symbolic Ap- 
proaches to Learning/or Natural Language 
Processing, pages 425-438, Springer. 
J. Klavans, C. Jacquemin, and E. Tzouker- 
mann. 1997. A natural language approach to 
multi-word term conflation. In Proceedings of 
the third Delos Workshop - Cross-Language 
Information Retrieval. 
Xiaobin Li, Stan Szpakowicz, and Stan Matwin. 
1995. WordNet-based algorithm word sense 
disambiguation. In Proceedings of IJCAI-95, 
pages 1368-1374, Montreal, Canada. 
G. A. Miller, R. Beckwith, C. Fellbaum, 
D. Gross, and K. Miller. 1993. Introduc- 
tion to WordNet: An on-line lexical database. 
Technical Report CSL Report 43, Cognitive 
Science Laboratory, Princeton. 
Stan Szpakowicz, Stan Matwin, and Ken 
Barker. 1996. WordNet-based word sense 
disambiguation that works for short texts. 
Technical Report TR-96-03, Department of 
Computer Science, University of Ottawa, On- 
tario, Canada. 
