Syntagmatic and Paradigmatic Representations of Term Variation 
Christian Jacquemin 
LIMSI-CNRS 
BP 133 
91403 ORSAY Cedex 
FRANCE 
j acquemin@limsi, fr 
Abstract 
A two-tier model for the description of morphologi- 
cal, syntactic and semantic variations of multi-word 
terms is presented. It is applied to term normal- 
ization of French and English corpora in the medi- 
cal and agricultural domains. Five different sources 
of morphological and semantic knowledge are ex- 
ploited (MULTEXT, CELEX, AGROVOC, Word- 
Netl.6, and Microsoft Word97 thesaurus). 
1 Introduction 
In the classical approach to text retrieval, terms 
are assigned to queries and documents. The terms 
are generated by a process called automatic index- 
ing. Then, given a query, the similarity between the 
query and the documents is computed and a ranked 
list of documents is produced as output of the system 
for information access (Salton and McGill, 1983). 
The similarity between queries and documents de- 
pends on the terms they have in common. The 
same concept can be formulated in many different 
ways, known as variants, which should be conflated 
in order to avoid missing relevant documents. For 
this purpose, this paper proposes a novel model of 
term variation that integrates linguistic knowledge 
and performs accurate term normalization. It re- 
lies on previous or ongoing linguistic studies on this 
topic (Sparck Jones and Tait, 1984; Jacquemin et 
al., 1997; Hamon et al., 1998). Terms are described 
in a two-tier framework composed of a paradigmatic 
level and a syntagmatic level that account for the 
three linguistic dimensions of term variability (mor- 
phology, syntax, and semantics). Term variants are 
extracted from tagged corpora through FASTR 1, a 
unification-based transformational parser described 
in (Jacquemin et al., 1997). 
Four experiments are performed on the French 
and the English languages and a measure of pre- 
cision is provided for each of them. Two experi- 
ments are made on a French corpus \[AGRIC\] com- 
posed of 1.2 x 106 words of scientific abstracts in 
I FASTR can be downloaded 
www. limsi, f r/Individu/j acquemi/FASTR. 
from 
the agricultural domain and two on an English cor- 
pus \[MEDIC\] composed of 1.3 x 106 words of sci- 
entific abstracts in the medical domain. The two 
experiments in the French language are \[AGRIC\] + 
Word97 and \[AGRIC\] + AGROVOC. In the for- 
mer, synonymy links are extracted from the Mi- 
crosoft Word97 thesaurus; in the latter, seman- 
tic classes are extracted from the AGROVOC the- 
saurus, a thesaurus specialized in the agricultural 
domain (AGROVOC, 1995). In both experiments, 
morphological data are produced by a stemming al- 
gorithm applied to the MULTEXT lexical database 
(MULTEXT, 1998). The two experiments on the 
English language are \[MEDIC\] + WordNet 1.6 or 
\[MEDIC\] + Word97; they correspond to two differ- 
ent sources of semantic knowledge. In both cases, 
the morphological data are extracted from CELEX 
(CELEX, 1998). 
2 Term Variation: Representation 
and Exploitation 
Terms and variations are represented into two par- 
allel frameworks illustrated by Figure 1. While 
terms are described by a unique pair composed of 
a structure--at the syntagmatic level--and a set of 
lexical items--at the paradigmatic level--, a varia- 
tion is represented by a pair of such pairs: one of 
them is the source term (or normalized term) and 
the other one is the target term (or variant). 
The syntagmatic description of a term is a con- 
text free rule; it is complemented with lexical infor- 
mation embedded in a feature structure denoted by 
constraints between paths and values. For instance, 
the term speed measurement is represented by: 
{ Syntagm:{i°-+N2N1} } (N1 lemma) = measurement 
Paradigm: {N2 lemma> = speed 
(1) 
This term is a noun phrase composed of a head noun 
N1 and a modifier N2; the lemmas are given by the 
constraints at the paradigmatic level. This frame- 
work is similar to the unification-based representa- 
tion of context-free grammars of (Shieber, 1992). 
341 
Term Variation 
................................................................. 
Normalized term Variant 
Syntagmatic 
,ev., 
transformation ~ \[ ~ I 
............... -~ ..... :-= ............ : ---~-~- ._ ~-'~---- -j .... ~ -: - _ _, 
Paradigmatic ILl\ L2 \[ l/ILl// L2I andsemanfic I Ll' L2'I 
level speed ~m~ment ','~J links 
lnstantiation of the \[ource 
............................ I_ ..................................... 
Figure 1: Two level description of terms and variations 
At the syntagmatic level, variations are repre- 
sented by a source and a target structure. At the 
paradigmatic level, the lexical elements of variations 
are not instantiated in order to ensure higher gener- 
ality. Instead, links between lexical elements are pro- 
vided. They denote morphological and/or semantic 
relations between lexical items in the source and tar- 
get structures of the variation. For example, the 
variation that associates a Noun-Noun term such as 
the preceding term speedN= measurementN1 with a 
verbal formof the head word and a synonym of the 
argument such as measuringvl maximaIh shorten- 
ingN velocityN,= is given by: 
Syntagm: 
{ (N° -+ N2 N1) =0" } 
(V0 --~ V1 (Prep ? Det ? (AINIPart)*) N~) (2) 
{ root)=(Vlroot) } 
Paradigm: {N12sem)=(Ni2sem ) 
If this variation is instantiated with the term given 
in (1), it recognizes the lexico-syntactic structure 
Vl (Prep ? Det ? (AINIPart)*) N~ (3) 
in which V1 and measurement are morphologically 
related, and N~ and speed are semantically related. 
The target structure is under-specified in order to 
describe several possible instantiations with a single 
expression and is therefore called a candidate varia- 
tion. In this example, a regular expression is used to 
under-specify the structure2; another solution would 
be to use quasi-trees with extended dependencies 
(Vijay-Shanker, 1992). 
3 Paradigmatic relations 
As illustrated by Figure 2 and Formula (2), there are 
two types of paradigmatic relations between lemmas 
2A stands for adjective, N for noun, Prep for preposition, 
V for verb, Det for determiner, Part for participle, and Adv 
for adverb. 
involved in the definition of term variations: mor- 
phological and semantic relations. The morphologi- 
cal family of a lemma l is denoted by the set FM(l) 
and its semantic family by the set FSL (l) or Fsc (l). 
Semantic family 
/~-~velocity 
Morphological family Semantic family 
Figure 2: Paradigmatic links between lemmas 
Roughly speaking, two words are morphologi- 
cally related if and only if they share the same root. 
In the preceding example, to measure and measure- 
ment are in the same morphological family because 
their common root is to measure. Let/: be the set of 
lemmas, morphological roots define a binary relation 
M from £ to/: that associates each lemma with its 
root(s): M E £ ~ £. M is not a function because 
compound lemmas have more than one root. 
The morphological family FM(l) of a 
lemma 1 is the set of lemmas (including l) 
which share a common root with l: 
Vle f~, FM (l) = {l' E /Z * 3r E /:, (/, r) E M 
A(/',r) E M} = M-I(M({I})) 
(4) 
342 
(liD(/:) is the power-set of £:, the set of its subsets.) 
There are principally two types of semantic re- 
lations: direct links through a binary relation SL E 
/2 ~ £: or classes C E ~(l?(/:)). 
In the case of semantic links, the semantic 
family Fs~ (l) of a lemma 1 is the set of 
lemmas (including l) which are linked to l: 
FSL • IP(E) 
Vl E ~, FSL (l) = {l' • f~ * (l, Y) • SL} tJ {l} (5) 
= u {l} 
In the case of semantic classes, the seman- 
tic family Fsc (l) of a lemma l is the union 
of all the classes to which it belongs: 
(6) VleL, Fsc(l)= U c U(l} 
(c~c)^(tec) 
Links and classes are equivalent, the choice of 
either model depends on the type of available se- 
mantic data. In the experiments reported here, di- 
rect links are used to represent data extracted from 
the word processor Microsoft Word97 because they 
are provided as lists of synonyms associated with 
each lemma. Conversely, the synsets extracted from 
WordNet 1.6 (Fellbaum, 1998) are classes of disam- 
biguated lemmas and, therefore, correspond to the 
second technique. 
With respect to the definitions of semantic 
and morphological families given in this section, 
the candidate variant (3) is such that V1 • 
FM(measurement) and N~ • FSL(speed) or N~ • 
Fsc (speed). 
4 Morphological and Semantic 
Families 
In the experiments on the English corpora, the 
CELEX database is used to calculate morphologi- 
cal families. As for semantic families, either Word- 
Net 1.6 or the thesaurus of Microsoft Word97 are 
used. 
Morphological Links from CELEX 
In the CELEX morphological database (CELEX, 
1998), each lemma is associated with a morpholog- 
ical structure that contains one or more root lem- 
mas. These roots are used to calculate morpholog- 
ical families according to Formula (4). For exam- 
ple, the morphological family FM(measurementN) 
of the lemmas with measurev as root word is 
{ commensurable A , commensurably Adv , countermea- 
sureN, immeasurableA, immeasurablyAdv, incom- 
mensurableA, measurableA, measurablyAdv, mea- 
sureN , measureless A , measurementN , mensurable A , 
tape-measureN, yard-measureN , measurev }. 
Semantic Classes from WordNet 
Two sources of semantic knowledge are used for 
the English language: the WordNet 1.6 thesaurus 
and the thesaurus of the word processor Microsoft 
Word97. In the WordNet thesaurus, disambiguated 
words are grouped into sets of synonyms--called 
synsets--that can be used for a class-based ap- 
proach to semantic relations. For example, each of 
the five disambiguated meanings of the polysemous 
noun speed belongs to a different synset. In our 
approach, words are not disambiguated and, there- 
fore, the semantic family of speed is calculated as 
the union of the synsets in which one of its senses is 
included. Through Formula (6), the semantic fam- 
ily of speed based on WordNet is: Fsc (speedN) = 
{speedN, speedingN, hurryingN, hasteningN, swift- 
nessN, fastnessN, velocityN, amphetamineN }. 
Semantic Links from Microsoft Word97 
For assisting document edition, the word proces- 
sor Microsoft Word97 has a command that returns 
the synonyms of a selected word. We have used 
this facility to build lists of synonyms. For exam- 
ple, FSn ( speed N ) = { speedN , swi\]tnesss, velocityN , 
quicknessN , rapidityN , accelerationN , alacrityN , 
celerityN} (Formula (5)). Eight other synonyms of 
the word speed are provided by Word97, but they are 
not included in this semantic family because they are 
not categorized as nouns in CELEX. 
5 Variations 
The linguistic transformations for the English lan- 
guage presented in this section are somehow simpli- 
fied for the sake of conciseness. First, we focus on 
binary terms that represent 91.3% of the occurrences 
of multi-word terms in the English corpus \[MEDIC\]. 
Then, simplifications in the combinations of types 
of variations are motivated by corpus explorations 
in order to focus on the most productive families of 
variations. 
The 3 Dimensions of Linguistic Variations 
There are as many types of morphological re- 
lations as pairs of syntactic categories of content 
words. Since the syntactic categories of content 
words are noun (N), verb (V), adjective (A), and 
adverb (Adv), there are potentially sixteen different 
pairs of morphological links. (Associations of iden- 
tical categories must be taken into consideration. 
For example, Noun-Noun associations correspond to 
morphological links between substantive nouns such 
as agent/process: promoter~promotion.) Morpho- 
logical relations are further divided into simple re- 
lations if they associate two words in the same po- 
sition and crossed relations if they associate a head 
word and an argument. Combining categories and 
positions, there are, in all, 64 different types of mor- 
phological relations. 
343 
In (Hamon et al., 1998), three types of semantic 
relations are studied: a link between the two head 
words, a link between the two arguments, or two 
parallel links between heads and arguments. These 
authors report that double links are rare and that 
their quality is low. They only represent 5% of the 
semantic variations on a French corpus and they are 
extracted with a precision of 9% only. We will there- 
fore focus on single semantic links. Since we are only 
concerned with synonyms, only two types of seman- 
tic links are studied: synonymous heads or synony- 
mous arguments. 
The last dimension of term variability is the 
structural transformation at the syntagmatic 
level. The source structure of the variation must 
match a term structure. There are basically two 
structures of binary terms: X1 N2 compounds in 
which X1 is a noun, an adjective or a participle, and 
N1 Prep N~ terms. According to (Jacquemin et al., 
1997), there are three types of syntactic variations 
in French: coordinations (Coot), insertions of mod- 
ifiers (Modif), and compounding/decompounding 
(Comp). Each of these syntactic variations is fur- 
ther subdivided into finer categories. 
Multi-dimensional Linguistic Variations 
The overall picture of term variations is obtained by 
combining the 64 types of morphological relations, 
the two types of semantic relations and the three 
types of syntactic variations (and their sub-types). 
There are different constraints on these combina- 
tions that limit the number of possible variations: 
1. Morphological and semantic links must operate 
on different words. For example, if the head 
word is transformed by a morphological link, 
the only word available for a semantic link is 
the argument word. 
2. The target syntactic structure must be com- 
patible with the morphological transformations. 
For example, if a noun is transformed into 
a verb, the target structure must be a verb 
phrase. 
These two constraints influence the way in which 
a variation can be defined by combining different 
types of elementary modifications. Firstly, lexical 
relations are defined at the paradigmatic level: mor- 
phological links, semantic links or identical words. 
Then a syntactic structure that is compatible with 
the categories of the target words is chosen. 
The list of variations used for binary compound 
terms in English is given in Table 1. 3 It has been 
experimentally refined through a progressive corpus- 
based tuning. The Synt column gives the target 
syntactic structure. The Morph column describes 
3punctuations are noted Pu and coordinating conjunction 
CC. 
the morphological link: a source and a target syn- 
tactic category and the syntactic positions of the 
source and target lemmas. The Sere column indi- 
cates whether the variation involves a semantic link 
and the position of the lemmas concerned by the link 
(both lemmas must have an identical position). The 
Pattern column gives the target syntactic structure 
as a function of the source structure which is either 
X1N2, A1N2, or N1N2. 
For example, Variation #42 transforms an 
Adjective-Noun term A1 N2 into 
N1 ((CC Det?) ? Prep Det ? (AIN\[Part) °-a) N~ 
N1 is a noun in the morphological family of A1 
(noted FM(A1)N) and N~ is semantically related 
with N2 (noted Fs(N2)). This variation recognizes 
malignancy in orbital turnouts as a variant of malig- 
nant tumor because malignancy and malignant are 
morphologically related, turnout and tumor are se- 
mantically related, and malignancyN inprep orbitaIA 
tumoursN matches the target pattern. Variation 
#56 is a more elaborated version of variation (2) 
given in Section 2. 
Sample Syntactico-semantic Variants from 
\[MEDIC\] 
The first 36 variations in Table 1 do not contain 
any morphological link. They are built as follows. 
Firstly, the different structures of noun phrases are 
used as target structures. Twelve structures are pro- 
posed: head coordination (#1), argument coordina- 
tion (#4), enumeration with conjunction (#7), enu- 
meration without conjunction (#10), etc. 
Then each transformation is enriched with ad- 
ditional semantic links between the head words 
or between the argument words. Semantic links 
between argument words are found in variations 
#(3n + 2)o<n<ll and between head words in vari- 
ations #(3n)l<n<12. (Due to the lack of space, only 
variations #2 and #3 constructed on top of vari- 
ation #1 are shown in Table 1.) Sample variants 
from \[MEDIC\] for the first 36 variations are given 
in Table 2. Some variations have not matched any 
variant in the whole corpus. 
Sample Morpho-syntactico-semantic 
Variants 
Morpho-syntactico-semantic variations are num- 
bered #37 to #62 in Table 1. Only 10 of the 64 
possible morphological associations are found in the 
list of morphological links: Noun to Adjective on 
arguments (#37), Adjective to Noun on arguments 
(#39), etc. Each of these variations is doubled by 
adding a semantic link between the words that are 
not morphologically related. For example, variation 
(#40) is deduced from variation (#39) by adding 
a semantic link between the head words. Sample 
variants are given in Table 3. 
344 
Table 1: Patterns of semantic variation for terms of structure X1 N~. 
# Synt. Morph. Sere. Pattern 
1 Coot -- 
2 Coor -- Arg 
3 Coor -- Head 
4 Coor -- 
7 Coor -- 
10 Coor -- 
13 Coor -- 
16 Modif -- 
19 Modif -- 
22 Modif -- 
25 Modif -- 
28 Modif -- 
31 Perm -- 
34 Perm -- 
37 Modif N--+A (Arg) -- 
38 Modif N-+A (Arg) Head 
39 Modif A-+N (Arg) -- 
40 Modif A-+N (Arg) Head 
41 Perm A--+N (Arg) -- 
42 Perm A--+N (Arg) Head 
43 Perm A--~N (Arg) -- 
44 Perm A--4N (Arg) Head 
45 Modif A-4Adv (Arg) -- 
46 Modif A-+Adv (Arg) Head 
47 Modif A-~A (Arg) -- 
48 Modif A-~A (Arg) Head 
49 Modif N-4N (Head) -- 
50 Modif N-~N (Head) Arg 
51 Modif N-+N (Arg) -- 
52 Modif N~N (Arg) Head 
53 Perm N-4N (Head) -- 
54 Perm N-~N (Head) Arg 
55 VP N--~V (Head) -- 
56 VP N~V (Head) Arg 
57 VP N--~V (Head) -- 
58 VP N--~V (Head) Arg 
59 NP N--cV (Head) -- 
60 NP N-~V (Head) Arg 
61 NP V--oN (Arg) -- 
62 NP V--~N (Arg) Head 
Xl\[sin\] ((AINIPart) °-3 N Pu\[','\] ? CC) N2 
Fs(X1)\[sin\] ((AINIPart) °-3 N Pu\[','\] ? CC) N2 
Xl\[sin\] ((AINIPart) °-3 N Pu\[','\] ? CC) Fs(N2) 
X~\[sin\] (CC (AIN\]Part) °-3) N2 
X1 (Pu (A\]NIPaxt) Pu ? CC (AINIPart)) N2 
Xl\[sin\] (Pu (AINIPart) Pu (AINIPart) Pu ? CC (A\[NIPart)) N~ 
Xl\[sin\] ((AINIPaxt) °-3 N Pu\[','\] CC) N2 
X1 \[sin\] ((AIN\]Part) °-3) N2 
Xl\[sin\] (N Prep Det ? A T) N2 
Xl\[sin\] (Pu\[')'\] (AIN\]Part) ?) N2 
X~\[sin\] (Pu\['('\] CC ? (AINIPaxt) ~-2 Pu\[')'\]) N2 
X,\[sin\] (Pu\[','\] (AINIPart)) N2 
N: (V\['be'\]lPu\['('\]) X1 
N~ (V ? Prep Det ? (AIN\]Paxt) °-3 ((N) CC Det?) ?) N1 
FM(N1)A ((A\]NIPart) °-3) N2 
FM(Nz)A ((A\[N\]Paxt) °-3) Fs(N2) 
FM(A1)N ((AINIPart) °-3) N2 
FM(Az)r~ ((AINIPart) °-3) Fs(N~) 
FM(At)N ((CC Det?) ? Prep Det ? (AINIPart) °-3) N2 
FM(A1)N ((CC Det?) ? Prep Det ? (AINIPart) °-3) Fs(N2) 
N2 ((Prep Det?) ? (AIN\]Paxt) °-3) FM(A1)N 
Fs(N2) ((Prep Det?) ? (AINIPart) °-3) FM(A1)N 
FM(A1)Adv ((AINIPart) °-a) N~ 
FM(A1)Adv ((AINIPart) °-3) Fs(N2) 
FM(A1)A ((AINIPart) °-3) N2 
FM(A1)A ((AINIPart) °-a) Fs(N2) 
X1 ((AINIPart) °-3) FM(N2)N 
Fs(X1) ((AINIPaxt) °-a) FM(N2)N 
FM(N1)N ((AINIPart) °-a) N2 
FM(N1)N ((AIN\]Part) °-3) Fs(N2) 
FM(N2)N (Prep (AINIPart) °-3) N1 
FM(N2)N (Prep (AINIPart) °-3) Fs(N1) 
FM(N2)v (Adv ? Prep ? (Det (N) ? Prep) ? Det ? (AINIPaxt) °-a) N1 
FM(N2)v (Adv ? Prep ? (Det (N) ? Prep) ? Det ? (AINIPart) °-3) Fs(Nt) 
Nt ((N) ? V\['be'\] 7) FM(N2)v 
Fs(N1) ((N) ? V\['be'\] 7) FM(N~)v 
As ((AIN\]Part) °-~ ((N) Prep) ?) FM(N~)v 
Fs(At) ((AIN\[Part) °-2 ((N) Prep) ?) FM(N2)v 
FM(V1)N ((AINIPart) °-3) N2 
FM (Vt)N ((AINIPart)°-3)Fs (N~) 
6 Evaluation 
We provide two evaluations of term variant confla- 
tion. First, we calculate precision rates through a 
manual scanning of the variants. Secondly, we eval- 
uate the numbers of variations extracted through the 
four experiments. 
Precision 
Because of the large volumes of data, only experi- 
ments on the French corpus are evaluated. \[AGRIC\] 
+ AGROVOC produces 2,739 variations and 2,485 
of them are selected as correct. Since the number 
of synonym links proposed by Word97 is higher, the 
number of variants produced by \[AGRIC\] + Word97 
is higher: 3,860. 3,110 of them are accepted after 
human inspection. 
The two experiments produce the same set of non- 
semantic variants (syntactic and morpho-syntactic 
variants). Associated values of precision are re- 
ported in Tables 4 and 5. The semantic variations 
are divided into two subsets: "pure" semantic vari- 
ations and semantic variations involving a syntactic 
transformation and/or a morphological link. Their 
precisions are given in Tables 6 and 7. 
As far as precision is concerned, these tables show 
that variations are divided into two levels of qual- 
ity. On the one hand, syntactic, morpho-syntactic 
and pure semantic variations are extracted with a 
high level of precision (above 78%, see the "Total" 
values in Tables 4 to 6). On the other hand, the 
345 
Table 2: Sample variants from \[MEDIC\] using the 
variations from Table 1 (#1 to #36). 
# Term Variant 
1 cell differentiation 
2 primary response 
3 pressure decline 
4 adipose tissue 
5 extensive resection 
6 clinical test 
7 adipic acid 
8 morphological 
change 
9 clinical test 
10 electrical property 
12 hypothesis test 
16 acidic protein 
17 absorbed dose 
18 cylindrical shape 
19 assisted ventilation 
20 genetic disease 
21 early pregnancy 
22 intertrochanteric 
fracture 
25 arteriovenous 
fistula 
27 pressure measure- 
ment 
28 identification test 
29 electrical stimulus 
31 combined treatment 
32 genetic disease 
33 increased dose 
34 acrylonitrile copoly- 
mer 
35 development area 
36 cell death 
cell growth and differenti- 
ation 
basal secretory activity 
and response 
pressure rise and fall 
adipose or fibroadipose 
tissue 
wide or radical resection 
clinical and histologic ex- 
aminations 
adipie, suberic and se- 
bacic acids 
morphologic, ultrastruc- 
rural and immunologic 
changes 
clinical, radiographic, 
and arthroscopic exami- 
nation 
electrical, mechanical, 
thermal and spectroscopic 
properties 
hypothesis, compara- 
bility, randomized and 
non-randomized trials 
acidic epidermal protein 
ingested human doses 
cylindrical fiberglass cast 
assisted modes of me- 
chanical ventilation 
hereditary transmission 
of the disease 
early stage of gestation 
intertrochanteric ) 
femoral fractures 
arteriovenous (A V) fistu- 
las 
pressure (SBP) measure 
identification, sensory 
tests 
electric, acoustic stimuli 
treatments were com- 
bined 
disease is familial 
dosage was increased 
copolymer of aerylonitrile 
areas of growth 
destruction of the virus- 
infected cell 
Table 3: Sample variants from \[MEDIC\] using the 
variations from Table 1 (#37 to #62). 
Term Variant 
37 cell component cellular component 
38 work place workable space 
39 embryonic develop- embryo development 
ment 
40 angular measure- angles measure 
ment 
41 deficient diet deficiency in the diet 
42 malignant tumor malignancy in orbital tu- 
rnouts 
43 cerebral cortex cortex of the cerebrum 
44 surgical advance- advance in middle ear 
ment surgery 
45 inappropriate secre- inappropriately high TSH 
tion secretion 
46 genetic variant genetically determined 
variance 
47 fatty meal fat meals 
48 optical system optic Nd-YA G laser unit 
49 drug addiction drug addicts 
50 simultaneous mea- concurrent measures 
surement 
51 saline solution salt solution 
52 flow limit airflow limitation 
53 bile reflux flux of bile 
55 measurement tech- measuring technique 
nique 
57 age estimation estimating gestational 
age 
58 density measure- measured COHb eoncen- 
ment trations 
59 blood coagulation blood coagulated 
60 concentration mea- density was measured 
surement 
61 combined treatment combination treatment 
Table 4: Precision of syntactic variant extraction 
(\[AGRIC\] corpus). 
Coor Modif Comp Total 
97.2% 88.7% 98.0% 95.7% 
Table 5: Precision of morpho-syntactic variant ex- 
traction (\[AGRIC\] corpus). 
A to N N to A N toN N to V Total 
68.5% 69.6% 92.1% 75.3% 84.6% 
346 
Table 6: Precision of semantic variant extraction 
(\[AGRIC\] corpus). 
Word97 AGROVOC 
Sem Arg 76.3% 88.9% 
Sere Head 82.7% 91.3% 
Total 78.1% 91.0% 
Table 7: Precision of semantico-syntactic variant ex- 
traction (\[AGRIC\] corpus). 
texts in which words are disambiguated. 
Numbers of Variants 
Table 8 shows the numbers of term variants ex- 
tracted by the four experiments. For each experi- 
ment and for each type of variation, three values are 
reported: the number of variants v of this type and 
two percentages indicating the ratio of these vari- 
ants. The first percentage is ~ in which V is the 
total number of variants produced by this experi- 
v in which T ment. The second percentage is 
is the number of (non-variant) term occurrences ex- 
tracted by this experiment. 
Word97 AGROVOC 
Coor + sem 44.8% 62.6% 
Modif Jr sem 55.6% 87.5% 
A to N -1- sem 44.9% 0.0% 
N to A + sere 21.3% 0.0% 
N to N d- sem 0.0% 60.0% 
N to V d- sere 24.2% 44.4% 
Total 29.4% 55.0% 
combination of semantic links with syntax or with 
morphology results in poor precision (55% precision 
in average with the AGROVOC semantic links and 
29.4% precision with the Word97 links, see line "To- 
tal" in Table 7). 
The lower precision of hybrid variations is due to 
a cumulative effect of semantic shift through com- 
bined variations. For instance, former un rdseau 
continu (build a continuous network) is incorrectly 
extracted as a variant of formation permanente (con- 
tinuing education) through a Noun-to-Verb varia- 
tion with a semantic link between argument words. 
The verb former and the associated deverbal noun 
formation are two polysemous words. In formation 
permanente, the meaning is related to a human ac- 
tivity (to train) while, in former un rdseau continu, 
the meaning is related to a physical construction (to 
build). 
Despite the relatively poor precision of hybrid 
variations, the average precision of term conflation is 
high because hybrid variations only represent a small 
fraction of term variations (5.4% and 0.9%, see lines 
'% sem" in Table 8 below). The average precision 
on \[AGRIC\] + Word97 is 79.8% and the average 
precision on \[AGRIC\] + AGROVOC is 91.1%. 
The exploitation of semantic links extracted from 
WordNet in term variant extraction does not suffer 
from the problem of ambiguity pointed out for query 
expansion in (Voorhees, 1998). The robustness to 
polysemy is due to the fact that we are dealing with 
multiword terms that build restricted linguistic con- 
The last line of Table 8 shows that variants rep- 
resent a significant proportion of term occurrences 
(from 27.3% to 37.3%). The distribution of the 
different types of variants depends the semantic 
database and on the language under study. Word- 
Net 1.6 is a productive source of knowledge for the 
extraction of semantic variants: In the experiment 
\[MEDIC\] + WordNet, semantic variants represent 
58.6% of the variants, while they only represent 4.9% 
of the variants in the \[AGRIC\] + AGROVOC exper- 
iment. These values are reported in the line "Tot. 
Sem" of Table 8. Such results confirm the relevance 
of non-specialized semantic links in the extraction of 
specialized semantic variants (Hamon et al., 1998). 
7 Conclusion 
The model proposed in this study offers a simple 
and generic framework for the expression of com- 
plex term variations. The evaluation proposed at 
the end of this paper shows that term variations are 
extracted with an excellent precision for the three 
types of elementary variations: syntactic, morpho- 
syntactic and semantic variations. The best perfor- 
mance is obtained with WordNet as source of seman- 
tic knowledge. Ongoing work on German, Japanese 
and Spanish shows that such a transformational and 
paradigmatic description of term variability applies 
to other languages than French and English reported 
in this study. 
Acknowledgement 
We would like to thank Jean Royaut@ and Xavier 
Polanco (INIST-CNRS) for their helpful collabora- 
tion. We are also grateful to B6atrice Daille (IRIN) 
for running her termer ACABIT on the data and 
to Olivier Ferret (LIMSI) for the Word97 macro- 
function used to extract the thesaurus. 

References 
AGROVOC. 1995. Thdsaurus Agricole Multi- 
lingue. Organisation de Nations Unies pour 
l'Alimentation et l'Agriculture, Roma. 
CELEX. 1998. www. talc. upenn, edu/ 
readme_fi tes/ce fez. teatime, htmt. Consor- 
tium for Lexical Resources, UPenn. 
Christiane Fellbaum, editor. 1998. WordNet: An 
Electronic Lexical Database. MIT Press, Cam- 
bridge, MA. 
Thierry Hamon, Adeline Nazarenko, and Cdcile 
Gros. 1998. A step towards the detection of se- 
mantic variants of terms in technical documents. 
In Proceedings, COLING-A CL'98, pages 498-504, 
Montreal. 
Christian Jacquemin, Judith L. Klavans, and Eve- 
lyne Tzoukermann. 1997. Expansion of multi- 
word terms for indexing and retrieval using mor- 
phology and syntax. In ACL - EACL'97, pages 
24-31, Madrid. 
MULTEXT. 1998. www..~p t. univ-ai~, fv/ 
p~'ojects/muttezt/. Laboratoire Parole et 
Langage, Aix-en-Provence. 
Gerard Salton and Michael J. McGill. 1983. In- 
troduction to Modern Information Retrieval. Mc- 
Graw Hill, New York, NY. 
Stuart N. Shieber. 1992. Constraint-Based For- 
malisms. A Bradford Book. MIT Press, Cam- 
bridge, MA. 
Karen Sparck Jones and John I. Tait. 1984. Auto- 
matic search term variant generation. Journal of 
Documentation, 40(1):50-66. 
K. Vijay-Shanker. 1992. Using descriptions of trees 
in a Tree Adjoining Grammar. Computational 
Linguistics, 18(4):481-518, December. 
Ellen M. Voorhees. 1998. Using wordnet for text 
retrieval. In Christiane Fellbaum, editor, Word- 
Net: An Electronic Lexical Database, pages 285- 
303. MIT Press, Cambridge, MA. 
