Dictionaries merger for text expansion in question answering
Bernard JACQUEMIN
b jacquemin@yahoo.fr
Abstract
This paper presents an original way to add new
data in a reference dictionary from several other
lexical resources, without loosing any consis-
tence. This operation is carried in order to get
lexical information classifled by the sense of the
entry. This classiflcation makes it possible to
enrich utterances (in QA: the queries) following
the meaning, and to reduce noise. An analysis
of the experienced problems shows the interest
of this method, and insists on the points that
have to be tackled.
1 Introduction
Our society is currently facing an increasing
amount of textual data, that no-one can store
up or even read. Many automatic systems are
designed to flnd a requested piece of informa-
tion. All the current systems use dictionaries
to identify data in texts or in queries. QA soft-
wares, which are particularly demanding about
data from the dictionary, have a similar mode
of working: they process an utterance (generally
the query) in order to provide the largest num-
ber of way to express the same meaning. Then
they try to flnd a match between the expanded
utterance and a text. For example, (Hull, 1999)
expands synonymically the ’signiflcant’ vocab-
ulary of the question. QUALC (Ferret et al.,
1999) adds stemming expansion prior to using
a search engine. The Falcon system (Moldovan
et al., 2000) uses some semantic relations from
WordNet when it expands the question.
In this paper, I present a way to process
dictionaries to make them consistent with the
needs of the application. I flrst describe the
lexical needs of my QA application. I secondly
outline the issue of the use of several incom-
patible dictionaries. Then I show the way I dis-
tribute information from additional dictionaries
to a reference one: synonyms, derivative forms
and taxonomy. Finally, I present the problems
and di–culties I found.
2 What the QA method needs
The QA system (Jacquemin, 2003) is based on
a matching procedure between query and text
segment. As most of the other approaches, my
methodology solves the problem of the difierent
ways to express the same idea by adding to the
utterance (’enrichments’) synonyms, derivatives
or words belonging to the same taxonomy.
My method entails two new features: First, it
uses semantic disambiguation in order to choose
the right meaning to each word in the sentences.
I notice that most of the QA systems try to
give as many enrichments as possible to a word
rather than to a meaning. The answers often
correspond to a sense difierent from the original
one. But if each enrichment has the same sense
as the original one, the noise decreases.
The fact that a semantic disambiguator needs
a large context to the word to be disambiguated
(Weaver, 1949) provides the second feature: the
query generally comprises few words. I decided
to process the documents to build an enriched
informative structure (Jacquemin, 2004). But
this feature falls outside the scope of this paper.
My semantic disambiguator (Jacquemin et
al., 2002) is an evolution of a tool previously de-
veloped for both French and English at XRCE
(Brun, 2000; Brun et al., 2001). The idea is to
use a dictionary as a tagged corpus to extract
semantic disambiguation rules. The contextual
data (syntactic, lexico-syntactic and semantico-
syntactic) for a given sense of a word are seen as
difierential indications. So when the schema is
found in the context of this word in a sentence,
the corresponding sense is assigned.
In flgure 1, we can see how a disambiguation
rule is extracted from an informative fleld of
Dubois’ French dictionary (Dubois and Dubois-
Charlier, 1997). From the instance fleld of the
entry remporter in its second sense gagner (to
win), the XIP parser (A˜‡t-Mokhtar et al., 2002)
extracts a lexico-syntactic schema: VARG[DIR]
Example from Dubois’ dictionary (entry: rem-
porter):
On remporte la victoire sur ses adversaires
(sense nb 2 : gagner)
We win a victory over our adversaries.
Dependency containing remporter:
VARG[DIR](remporter,victoire)
Corresponding disambiguation rule:
remporter:VARG[DIR](remporter,victoire)
==> sense gagner
Figure 1: Extraction of a semantic disambigua-
tion rule.
means that the argument victoire is a direct ob-
ject of the argument remporter. The rule built
from this dependency indicates that the sense
of the word remporter, in a context where vic-
toire (victory) is the direct object, is the second
sense gagner (to win). Two other types of rules
exists: the flrst type puts lexical rules into gen-
eral use, replacing lexical arguments by corre-
sponding semantic classes. The other one uses
syntactic schemas stipulated by the dictionary
(for instance: transitive, re exive, etc.).
The dictionary needs of both QA system
and semantic disambiguator are of two natures.
First, the dictionary is required to share out
data following sense and not following lemma:
The data are difierential indications. Second,
the dictionary is required to contain contextual
information as much as possible: examples or
collocations (lexical rules), semantic classes or
application domains (generalized rules), subcat-
egorisation...The Dubois’ dictionary yields to
these demands, and moreover it contains some
data that could be helpful to enrich an utter-
ance: synonyms, instructions for derivations...
3 Enrichment problems
Several expansion solutions are proposed by the
QA approaches: use of synonyms or taxonomy’s
members, stemming or use of derivatives...
Dubois’ dictionary contains some synonyms
linked with a sense of the word they are syn-
onymous with. But these synonyms are too
few to provide su–cient enrichments. The sys-
tem needs one or more synonyms dictionaries
to complete Dubois’ gaps. No synonyms dictio-
naryshares out thesynonymsbysense of theen-
try, except EuroWordNet (Catherin, 1999). But
EuroWordNet’s sharing out into senses does not
match Dubois’ senses. Thus the question is to
distribute the available synonyms of each word
to the right sense in Dubois’.
The stemming, which considers two words
with the same stem nearly synonyms, is too un-
predictable to be used in a methodology that
tries to avoid noise. As Dubois’ provides in-
structions to form derivatives from lemma and
su–xes for some senses, the derivation is pre-
ferred to the stemming. But the instructions are
often vague, and indicate only the su–x to use
and the new part-of-speech. It is not su–cient
to be used automatically. Thus the derivation
procedure needs an extra tool able to propose
derivatives, including the right one. Dubois’ in-
formation is su–cient to fllter and classify them.
Finally, Dubois’ does not provide taxonomy,
and the French resources containing a seman-
tic hierarchy do not supply contextual informa-
tion. The taxonomy has to be found in another
resource, which is not consistent with the ref-
erence dictionary. The compatibility between
senses of all these resources is the objective.
4 How to make the dictionaries
compatible
The main di–culty is to share out information
collected from extra dictionaries. The dictio-
naries are incompatible with Dubois’, but new
data have to be distributed following the senses
of the entries of the reference dictionary.
4.1 Synonyms
Three resources are at my disposal: Bailly’s
dictionary (Bailly, 1947), an electronic dictio-
nary designed by Memodata, and the French
EuroWordNet (Catherin, 1999). The expansion
methods commonly use all the available syn-
onyms for a word, but my approach has to keep
only the synonyms corresponding to the current
sense of the word. For each considered sense for
a word, Dubois’ provides semantic features: a
semantic class and an application domain.
The synonyms from the extra dictionaries are
proposals. A proposal for a lemma in Dubois’
dictionary is kept for a given sense only if one
sense at least of the Dubois’ entry correspond-
ing to the proposal matches the semantic fea-
tures of the given sense. If no sense of the pro-
posal matches the semantic features of the given
sense, the proposal is rejected for this sense.
In flgure 2, the problem is to determine which
proposal matches the word ravir in the sense nb
2 voler (to steal). The semantic features of this
Semantic features of the entry:
Domain Class
ravir (2) SOC S4
Semantic features of synonyms:
Domain Class
charmer PSY P2
voler SOC S4
Figure 2: Selection of the synonyms.
sense are the application domain SOC (sociol-
ogy) and the semantic class S4 (to grip, to own).
The proposal charmer, which features are PSY
(psychology) and P2 (psychological verb) does
not match the features of ravir 2. The proposal
d¶erober in its second and fourth senses has the
same features. This proposal is conflrmed for
ravir in sense nb 2. It will be used as enrichment
when the sense nb 2 of ravir is detected in an
utterance by the semantic disambiguator. This
procedure is applied for all the proposed syn-
onyms for all the senses of each entry in Dubois’.
4.2 Derivatives
The derivation fleld in the Dubois’ provides suf-
flcient indications to recognize the stipulated
derivatives of an entry in a determined sense.
Thus, the need is a resource or a tool provid-
ing all the potential derivative from a word.
Resources are rare and incomplete for French,
but I have to my disposal a tool (Gaussier et
al., 2000) able to construct su–xal derivatives
from a word. If the only constraint requires the
derivatives belong to the lexicon, all the right
su–xal forms are provided among the incor-
rect proposals. When all the proposals are pro-
duced, the su–x of each proposal is compared
with the instructions in the dictionary. When
they match, the derivative is kept for the cur-
rent sense. If not, the derivative is rejected.
Derivatives for the verb couper:
Proposed Instruction Retained
derivatives sense nb 1 derivatives
coupure ure coupure
coupable | removed
coupage (age sense 5) removed
coupeur eur coupeur
coupant ant coupant
... ... ...
Figure 3: Selection of the derivatives.
Theflgure3showstheworkingofthemethod.
For the verb couper in the sense trancher (to
cut, to slice), Dubois’ indicates derivatives with
su–xes -ure: coupure (break), -ant: coupant
(sharp) and -eur: coupeur (cutter). But no in-
struction is given for a su–x -able. The wrong
derivative coupable (guilty) is rejected.
4.3 Taxonomy
Only two resources containing taxonomy exist
for French. AlethDic (Gsi, 1993) is known for
its very bad quality. The hierarchy is neither
very deep, nor very large. The semantic rela-
tions are not strictly deflned inside the hierar-
chy. Because of this, I rejected AlethDic.
The other resource is EuroWordNet. Two
kind of taxonomic relations are deflned: hyper-
onymy (and hyponymy), and meronymy (and
holonymy). The other semantic relations of this
resource fall outside the scope of this paper.
The taxonomic relations link synsets to-
gether. The synsets contain synonymous words
for at least one of their senses. The taxonomy is
usable by the QA system only if the sense of the
whole synset can be identifled, and if the sense
matches at least one of the sense of the word
under consideration in Dubois’ dictionary.
So each word in Dubois’ has to be linked with
a synset to be inserted into a taxonomic hierar-
chy. That amounts to match senses in Dubois’
and synsets in EuroWordNet. We already have
some senses in Dubois’ matching sets of syn-
onyms in EuroWordNet. It is easy to use the
additional synonyms from EuroWordNet to set
up a correspondence between sense of Dubois’
dictionary and synsets of EuroWordNet.
The procedure is to examine all the synsets
whereaconsideredwordappears. Foreachofits
sense, if the majority of the synonyms obtained
from EuroWordNet are contained in a synset,
the meaning illustrated by the synset and this
sense of the word are considered to be equiva-
lent. In this case, the word under consideration
is inserted in this place into the taxonomic hi-
erarchy. Otherwise, the synset is not seen to
match the sense, and it is rejected.
5 Experienced problems
The di–culties met difier for each kind of pro-
cessed data. In the sharing out of the synonyms,
the system cannot determine automatically the
meaning of a multiword expression. Dubois’
only lists single words, and no semantic feature
can be allocated to a multiword expression. A
multiword proposal is considered to have all the
meaning of the word to which it is synonym.
I have no real evaluation of this procedure:
the division into senses of the reference dictio-
nary is as always open to doubt. Considering
the result, examiners never agree with each syn-
onyms for a sense. But when they agree (three
examiners where consulted), they where satis-
fled by more than 80% of the synonyms.
The derivation tool provides nearly all the
derivatives from a word when no constraint is
deflned. Most of the wrong derivatives (about
97%) are screened by the instructions supplied
by Dubois’ dictionary. However, these flgure
are not valid for short words: the tool is desig-
nated in such a way that derivatives with a rad-
ical shorter than 3 letters are generally wrong.
Moreover, the instructions are often incomplete
in the dictionary, above all nominal entries.
The promising procedure using taxonomy,
presented above, is still a suggestion. I am fac-
ing with the problem that EuroWordNet cov-
ers only a small part of the French lexicon. A
more proper trial should use WordNet (Fell-
baum, 1998), that covers a huge part of the En-
glish lexicon, in an English QA application.
6 Conclusion
In this paper, I present an original method to
merge several dictionaries in such a way that
all the informative flelds become semantically
consistent. This need comes from an expansion
method for QA, which uses as enrichment only
the synonyms, derivatives, and taxonomy that
match the sense determined by a semantic dis-
ambiguator. The method to share out informa-
tion is particular for each kind of enrichment.
An analysis of the experienced problems shows
the interest of the method, and insists on the
points that still have to be tackled.
In a more general perspective, it is known
that no perfect dictionary exists. Each dictio-
nary used bythis methodhas gaps. The method
used to mix information is flltering data from
extra dictionaries by data from a reference dic-
tionary, but errors in the reference are passed on
to the added data. The right solution should be
to use more than one reference dictionary.

References
Salah A˜‡t-Mokhtar, Jean-Pierre Chanod, and
Claude Roux. 2002. Robustness beyond shal-
lowness: incremental deep parsing. Natural
Language Engineering, 8(2/3):121{144.
Ren¶e Bailly. 1947. Dictionnaire des synonymes
de la langue fran»caise. Larousse, Paris.
Caroline Brun, Bernard Jacquemin, and
Fr¶ed¶erique Segond. 2001. Exploitation
de dictionnaires ¶electroniques pour la
d¶esambigu˜‡sation s¶emantique lexicale. TAL,
42(3):667{690.
Caroline Brun. 2000. A client/server archi-
tecture for word sense disambiguation. In
Proceedings of Coling’2000, pages 132{138,
Saarbr˜ucken, Deutschland.
Laurent Catherin. 1999. The french wordnet.
Technical report, EuroWordNet.
Jean Dubois and Fran»coise Dubois-Charlier.
1997. Dictionnaire des verbes fran»cais.
Larousse, Paris. Electronic version, with its
complement: Dictionnaire des mots.
Christiane Fellbaum, editor. 1998. WordNet:
an electronic lexical database. Language,
Speech and Communication. The MIT Press,
Cambridge, Massachusetts.
Olivier Ferret, Brigitte Grau, Gabriel Illouz,
Christian Jacquemin, and N. Masson. 1999.
Qualc. The question-answering program of
the langage et cognition group at limsi-cnrs.
In Proceedings of TREC-8, pages 455{464.
¶Eric Gaussier, Gregory Grefenstette, David
Hull, and Claude Roux. 2000. Recherche
d’information en fran»cais et traitement au-
tomatique des langues. TAL, 41(2):473{493.
Gsi-Erli, France, 1993. Le dictionnaire
AlethDic, 1.5 edition, Mars.
David A. Hull. 1999. Xerox trec-8 question
answering track report. In Proceedings of
TREC-8, pages 743{752.
Bernard Jacquemin, Caroline Brun, and Claude
Roux. 2002. Enriching a text by semantic
disambiguationforinformationextraction. In
LREC 2002 Workshop Proceedings.
Bernard Jacquemin. 2003. Construction et in-
terrogation de la structure informationnelle
d’une base documentaire en fran»cais. Ph.D.
thesis, Universit¶e de la Sorbonne Nouvelle.
Bernard Jacquemin. 2004. Analyse et expan-
sion des textes en question-r¶eponse. In Actes
des 7 emes JADT, volume 2, pages 633{641.
Dan Moldovan, Sanda Harabagiu, Marius Psca,
Rada Mihalcea, Richard Goodrum, Roxana
Grju, and Vasile Rus. 2000. The structure
and performance of an open-domain question
answering system. In Proceedings of the 38th
Annual Meeting of the ACL, pages 563{570.
Warren Weaver, 1949. Machine translation of
languages, chapter Translation, pages 15{23.
John Wiley and Sons, New York.
