Improving Automatic Indexing through Concept Combination 
and Term Enrichment 
Christian Jacquemin* 
LIMSI-CNRS 
BP 133, F-91403 ORSAY Cedex, FRANCE 
j acquemin@limsi, fr 
Abstract 
Although indexes may overlap, the output of 
an automatic indexer is generally presented as 
a fiat and unstructured list of terms. Our pur- 
pose is to exploit term overlap and embed- 
ding so as to yield a substantial qualitative 
and quantitative improvement in automatic in- 
dexing through concept combination. The in- 
crease in the volume of indexing is 10.5% for 
free indexing and 52.3% for controlled indexing. 
The resulting structure of the indexed corpus is 
a partial conceptual analysis. 
1 Overview 
The method, proposed here for improving au- 
tomatic indexing, builds partial syntactic stru- 
ctures by combining overlapping indexes. It is 
complemented by a method for term acquisition 
which is described in (Jacquemin, 1996). The 
text, thus structured, is reindexed; new indexes 
are produced and new candidates are discove- 
red. 
Most NLP approaches to automatic indexing 
concern free indexing and rely on large-scale 
shallow parsers with a particular concern for 
dependency relations (Strzalkowski, 1996). For 
the purpose of controlled indexing, we exploit 
the output of a NLP-based indexer and the stru- 
ctural relations between terms and variants in 
order to (1) enhance the coverage of the in- 
dexes, (2) incrementally build an a posteriori 
conceptual analysis of the document, and, (3) 
interweave controlled indexing, free indexing, 
and thesaurus acquisition. These 3 goals are 
achieved by CONPARS (CONceptual PARSer), 
presented in this paper and illustrated by Fi- 
gure 1. CONPARS is based on the output of 
* We thank INIST-CNRS for providing us with thesauri 
and corpora in the agricultural domain and AFIRST for 
supporting this research through the SKETCHI project. 
a part-of-speech tagger for French described in 
(Tzoukermann and Radev, 1997) and FASTR, 
a controlled indexer (Jacquemin et al., 1997). 
All the experiments reported in this paper are 
performed on data in the agricultural domain: 
\[AGRIC\] a 1.18-million word corpus, \[AGRO- 
VOC\] a 10,570-term controlled vocabulary, and 
\[AGR-CAND\] a 15,875-term list acquired by 
ACABIT (Daille, 1997) from \[AGRIC\]. 
Augmented indexing 
Figure 1: Overall Architecture of CONPARS 
2 Basic Controlled Indexing 
The preprocessing of the corpus by the tag- 
ger yields a morphologically analyzed text, 
with unambiguous syntactic categories. Then, 
the tagged corpus is automatically indexed by 
FASTR which retrieves occurrences of multi- 
word terms or variants (see Table 1). 
595 
Table 1: Indexing of a Sample Sentence 
La variation mensuelle de la respiration du sol et 
ses rapports avec l'humiditd et la tempdrature du 
sol ont dtd analysdes dans le sol super\]iciel d'une 
for~t tropicale. (The monthly variation of the respi- 
ration of the soil and its connections with the mois- 
ture and the temperature of the soil have been ana- 
lyzed in the surface soil of a tropical forest.) 
il 007019 Respiration du sol Occurrence 
respiration du sol (respiration of the soil) 
i2 002904 Sol de for~t Embedding2 
so_.__l superficiel d'une \]or~t (surf. soil of a forest) 
i3 012670 Humiditd du sol Coordination1 
humiditd et la tempdrature du sol 
(moisture and the temperature of the soil) 
i4 007034 Tempdrature du sol Occurrence 
tempdrature du sol (temperature of the soil) 
i5 007035 Analyse de sol VerbTransfl 
analysdes clans le sol (analyzed in the soil) 
i6 007809 For~t tropicale Occurrence 
for~t tropicale (tropical forest) 
Each variant is obtained by generating term 
variations through local transformations com- 
posed of an input lexico-syntactic structure 
and a corresponding output transformed struc- 
ture. Thus, VerbTransfl is a verbalization which 
transforms a Noun-Preposition-Noun term into 
a verb phrase represented by the variation pat- 
tern V 4 (Adv ? (Prep ? Art \[ Prep) A ?) N3:1 
VerbTransfl( N1 Prep2 N3 ) (1) 
= V4 (Adv ? (Prep ? Art J Prep) A ?) N3 
{MorphFamily(N1) = MorphFamily(V4)} 
The constraint following the output structure 
states that V4 belongs to the same morphologi- 
cal family as N1, the head noun of the term. 
VerbTransfl recognizes analys~es\[v\] dans\[prep\] 
le\[nrt\] sOl\[N\] (analyzed in the soil) as a variant 
of analyse\[N\] de\[Prep\] sol\[N\] (soil analysis). 
Six families of term variations are accounted 
for by our implementation for French: coordina- 
tion, compounding/decompounding, term em- 
bedding, verbalization (of nouns or adjectives), 
nominalization (of nouns, adjectives, or verbs), 
and adjectivization (of nouns, adjectives, or 
verbs). Each index in Table 1 corresponds to 
1The following abbreviations are used for the catego- 
ries: V = verb, N = noun, Art = article, hdv --- adverb, 
Conj = conjunction, Prep --- preposition, Punc -- punc- 
tuation. 
a unique term; it is referenced by its identifier, 
its string, and a unique variation of one of the 
aforementioned types (or a plain occurrence). 
3 Conceptual Phrase Building 
The indexes extracted at the preceding step are 
text chunks which generally build up a correct 
syntactic structure: verb phrases for verbaliza- 
tions and, otherwise, noun phrases. When over- 
lapping, these indexes can be combined and re- 
placed by their head words so as to condense 
and structure the documents. This process is 
the reverse operation of the noun phrase decom- 
position described in (Habert et al., 1996). 
The purpose of automatic indexing entails the 
following characteristics of indexes: 
• frequently, indexes overlap or are embed- 
ded one in another (with \[AGR-CAND\], 
35% of the indexes overlap with another 
one and 37% of the indexes are embed- 
ded in another one; with \[AGROVOC\], the 
rates are respectively 13% and 5%), 
• generally, indexes cover only a small fra- 
ction of the parsed sentence (with \[AGR- 
CAND\], the indexes cover, on average, 15% 
of the surface; with \[AGROVOC\], the ave- 
rage coverage is 3%), 
• generally, indexes do not correspond to 
maximal structures and only include part 
of the arguments of their head word. 
Because of these characteristics, the construc- 
tion of a syntactic structure from indexes is like 
solving a puzzle with only part of the clues, and 
with a certain overlap between these clues. 
Text Structuring 
The construction of the structure consists of the 
following 3 steps: 
Step 1. The syntactic head of terms is deter- 
mined by a simple noun phrase grammar of the 
language under study. For French, the following 
regular expression covers 98% of the term struc- 
tures in the database \[AGROVOC\] (Mod is any 
adjectival modifier and the syntactic head is the 
noun in bold face): 
Mod* N N ? (Mod I (Prep Art ? Mod* N N ? Mod*))* 
The second source of knowledge about synta- 
ctic heads is embodied in transformations. For 
596 
instance, the syntactic head of the verbalization 
in (1) is the verb in bold typeface. 
Step 2. A partial relation between the indexes 
of a sentence is now defined in order to rank 
in priority the indexes that should be grouped 
first into structures (the most deeply embedded 
ones). This definition relies on the relative spa- 
tial positions of two indexes i and j and their 
syntactic heads H(i) and H(j): 
Definition 3.1 (Index priority) Let i and j 
be two indexes in the same sentence. The rela- 
tive priority ranking of i and j is: 
i~j ¢~ (i=j) V(H(i)=n(j)AiCj) 
V (H(i)¢H(j)AH(i)ej A n(j)¢_i) 
This relation is obviously reflexive. It is nei- 
ther transitive nor antisymmetric. It can, howe- 
ver, be shown that this relation is not cyclic for 
3 elements: i~j A jT~k =¢ -~(kT~i). (This 
property is not demonstrated here, due to the 
lack of space.) 
The linguistic motivations of Definition 3.1 
are linked to the composite structure built at 
Step 3 according to the relative priorities stated 
by T~. We now examine, in turn, the 4 cases of 
term overlap: 
1. Head embedding: 2 indexes i and j, with 
a common head word and such that i is 
embedded into j, build a 2-level structure: 
H(i) H(i) 
H(i) 
This structuring is illustrated by nappe 
d'eau (sheet of water) which combines 
with nappe d'eau souterraine (underground 
sheet of water) and produces the 2-level 
structure \[\[nappe d'eau\] souterraine\] (\[un- 
derground ~ of water\]\]). (Head words 
are underlined.) In this case, i has a higher 
priority than j; it corresponds to (H(i) = 
H(j) A i C_ j) in Definition 3.1. 
2. Argument embedding: 2 indexes i and j, 
with different head words and such that the 
head word of i belongs to j and the head 
word of j does not belong to i, combine as 
follows: 
n(j) H(j) H(i) 
14(0 
This structuring is illustrated by nappe 
d'eau which combines with eau souter- 
raine (underground water) and produces 
the structure \[nappe d~.eau souterraine\]\] 
(\[sheet of \[underground water.\]\]). Here, i 
has a higher priority than j; it corresponds 
to (H(i) ~ H(j) A H(i) • j A g(j) ~ i) 
in Definition 3.1. 
3. Head overlap: 2 indexes i and j, with 
a common head word and such that i 
and j partially overlap, are also combi- 
ned at Step 3 by making j a substructure 
of i. This combination is, however, non- 
deterministic since no priority ordering is 
defined between these 2 indexes. There- 
fore, it does not correspond to a condition 
in Definition 3.1. 
H(i) 
In our experiments, this structure cor- 
responds to only one situation: a head 
word with pre- and post-modifiers such 
as importante activitd (intense activity) 
and activivtg de ddgradation mdtabolique 
(activity of metabolic degradation). 
With \[-AGR-CAND\], this configuration 
is encountered only 27 times (.1% of 
the index overlaps) because premodifiers 
rarely build correct term occurrences in 
French. Premodifiers generally correspond 
to occasional characteristics such as size, 
height, rank, etc. 
4. The remaining case of overlapping indexes 
with different head words and reciprocal in- 
clusions of head words is never encounte- 
red. Its presence would undeniably denote 
a flaw in the calculus of head words. 
Step 3. A bottom-up structure of the sentences 
is incrementally built by replacing indexes by 
trees. The indexes which are highest ranked by 
597 
the Step 2 are processed first according to the 
following bottom-up algorithm: 
1. build a depth-1 tree whose daughter nodes 
are all the words in the current sentence 
and whose head node is S, 
2. for all the indexes i in the current sentence, 
selected by decreasing order of priority, 
(a) mark all the the depth-1 nodes which 
are a lexical leaf of i or which are the 
head node of a tree with at least one 
leaf in i, 
(b) replace all the marked nodes by a 
unique tree whose head features are 
the features of H(i), and whose depth- 
1 leaves are all the marked nodes. 
When considering the sentence given in 
Table 1, the ordering of the indexes after Step 2 
is the following: i2 > i5, i6 > i2, and i4 > i3. 
(They all result from the argument embedding 
relation.) The algorithm yields the following 
structure of the sample sentence: 
f 
...la respiration et ses rapports avec l'humidit~ ont dt~ analvs~es 
respiration du sol humidit~ et la temperature analys~es dans le sol 
temperature du sol sol superficiel d'une for~t 
for~t tropicale 
Text Condensation 
The text structure resulting from this algorithm 
condenses the text and brings closer words that 
would otherwise remain separated by a large 
number of arguments or modifiers. Because of 
this condensation, a reindexing of the structu- 
red text yields new indexes which are not ex- 
tracted at the first step. 
Let us illustrate the gains from reindexing 
on a sample utterance: l'dvolution au cours du 
temps du sol et des rendements (temporal evo- 
lution of soils and productivity). At the first 
step of indexing, ~volution au cours du temps 
(lit. evolution over time) is recognized as a va- 
riant of dvolution dans le temps (lit. evolution 
with time). At the second step of indexing, the 
daughter nodes of the top-most tree build the 
condensed text: l'dvolution du sol et des rende- 
ments (evolution of soils and productivity): 
1st step 
l'~volution au cours du temps du sol el des rendements 
2nd step 
l'~volution du sol et des rendements 
l'~volution au cours du temps 
This condensed text allows for another index ex- 
traction: dvolution du sol et des rendements, a 
Coordination variant of dvolution du rendement 
(evolution of productivity). This index was not 
visible at the first step because of the additional 
modifier au cours du temps (temporal). (Reite- 
rated indexing is preferable to too unconstrai- 
ned transformations which burden the system 
with spurious indexes.) 
Both processes--text structuring, presented 
here, and term acquisition, described in (Jac- 
quemin, 1996)--reinforce each other. On the 
one hand, acquisition of new terms increases the 
volume of indexes and thereby improves text 
structuring by decreasing the non-conceptual 
surface of the text. On the other hand, text 
condensation triggers the extraction of new in- 
dexes, and thereby furnishes new possibilities 
for the acquisition of terms. 
4 Evaluation 
Qualitative evaluation: The volume of in- 
dexing is characterized by the surface of the 
text occupied by terms or their combinations-- 
we call it the conceptual surface. Figure 2 
shows the distribution of the sentences in re- 
lation to their conceptual surface. For instance, 
in 8,449 sentences among the 62,460 sentences 
of \[AGRIC\], the indexes occupy from 20 to 30% 
of the surface (3rd column). 
This figure indicates that the structures built 
from free indexing are significantly richer than 
those obtained from controlled indexing. The 
number of sentences is a decreasing exponen- 
tial function of their conceptual surface (a linear 
function with a log scale on the y axis). 
Figure 3 illustrates how the successive steps 
of the algorithm contribute to the final size of 
the incremental indexing. For each mode of 
598 
10 s 
~ 10 4 
N 10 3 
~ 10 2 
~ 10 I~ 
10 0 0 
........ Free indexing 
........ Controlled indexing 
10 20 30 40 50 60 70 80 90 100 
% of conceptual suface 
Figure 2: Conceptual Surface of Sentences 
Table 2: Increase in the volume of indexing 
Acquisition Condensation Total 
Controlled 49.3% 3.0% 52.3% 
Free 5.8% 4.7% 10.5% 
indexing two curves are plotted: the phrases 
resulting from initial indexing and from rein- 
dexing due to text condensation (circles) and 
the phrases due to term acquisition (asterisks). 
For instance, at step3, free indexing yields 309 
indexes and reindexing 645. The corresponding 
percentages are reported in Table 2. 
The indexing with the poorest initial volume 
(controlled indexing) is the one that benefits 
best from term acquisition. Thus, concept com- 
bination and term enrichment tend to compen- 
sate the deficiencies of the initial term list by 
extracting more knowledge from the corpus. 
10 5, 
"~ 10 4. 
103 
102 
~. 10' 
I0 ~ 
~ o Free indexing 
* Free acquisition 
"'.... ~_._~.~.. ..-.@-.. Controlled indexing 
. "'-_. ~ .... * .... o .... Controlled acquisition 
2 3 4 5 6 7 8 
# step 
Figure 3: Step-by-step Number of Phrases 
Qualitative evaluation: Table 3 indicates the 
number of overlapping indexes in relation to 
their type. It provides, for each type, the rate of 
success of the structuring algorithm. This eva- 
Table 3: Incremental Structure Building 
Head Argument Total 
embedding embedding 
Distribution 27.0% 73.0% 100% 
# correct 128 346 474 
Precision 79.0% 91.1% 87.5% 
luation results from a human scanning of 542 
randomly chosen structures. 
5 Conclusion 
This study has presented CONPARS, a tool 
for enhancing the output of an automatic in- 
dexer through index combination and term en- 
richment. Ongoing work intends to improve the 
interaction of indexing and acquisition through 
self-indexing of automatically acquired terms. 

References 
B6atrice Daille. 1997. Study and implementa- 
tion of combined techniques for automatic ex- 
traction of terminology. In J. L. Klavans and 
P. Resnik, ed., The Balancing Act: Combi- 
ning Symbolic and Statistical Approaches to 
Language, p. 49-66. MIT Press, Cambridge. 
Benoit Habert, Elie Naulleau, and Adeline Na- 
zarenko. 1996. Symbolic word clustering for 
medium size corpora. In Proceedings of CO- 
LING'96, p. 490-495, Copenhagen. 
Christian Jacquemin, Judith L. Klavans, and 
Evelyne Tzoukermann. 1997. Expansion of 
multi-word terms for indexing and retrieval 
using morphology and syntax. In Proceedings 
of ACL-EACL'97, p. 24-31. 
Christian Jacquemin. 1996. A symbolic and 
surgical acquisition of terms through varia- 
tion. In S. Wermter, E. Riloff, and G. Sche- 
ler, ed., Connectionist, Statistical and Symbo- 
lic Approaches to Learning for NLP, p. 425- 
438. Springer, Heidelberg. 
Tomek Strzalkowski. 1996. Natural language 
information retrieval. Information Processing 
~J Management, 31(3):397-417. 
Evelyne Tzoukermann and Dragomir R. Radev. 
1997. Use of weighted finite state transducers 
in part of speech tagging. In A. Kornai, ed., 
Extended Finite State Models of Language. 
Cambridge University Press. 
