Qualitative Evaluation of Automatically Calculated Acception
Based MLDB
Aree Teeraparbseree
GETA, CLIPS, IMAG
385, rue de la Bibliothèque
B.P. 53 - 38041 Grenoble Cedex 9, France
aree.teeraparbseree@imag.fr
Abstract
In the context of the Papillon project, which
aims at creating a multilingual lexical database
(MLDB), we have developed Jeminie, an adapt-
able system that helps automatically building
interlingual lexical databases from existing lex-
ical resources. In this article, we present a tax-
onomy of criteria for evaluating a MLDB, that
motivates the need for arbitrary compositions
of criteria to evaluate a whole MLDB. A quality
measurement method is proposed, that is adapt-
able to different contexts and available lexical
resources.
1 Introduction
The Papillon project1 aims at creating a cooper-
ative, free, permanent, web-oriented environ-
ment for the development and the consultation
of a multilingual lexical database. The macro-
structure of Papillon is a set of monolingual dic-
tionaries (one for each language) of word senses,
called lexies, linked through a central set of in-
terlingual links, called axies. Axies, also called
interlingual acceptions, are not concepts, but
simply interlingual links between lexies, motived
by translations found in existing dictionaries or
proposed by the contributors. Figure 1 repres-
ents an interlingual database that links monolin-
gual resources in three languages: French, Eng-
lish and Japanese. The interlingual acceptions
(axies) are linked to lexies from each language.
For instance, a lexie for the French word “terre”
is linked through an axie to two lexies for the
English words “earth” and “soil” and to a lexie
for the Japanese word “tsuchi”. Note that an
axie can be refined into a set of axies. For in-
stance, a lexie for the English word “chair” is
linked through axie1 to two lexies for the French
words “fauteuil” and “chaise”. Axie1 can be re-
fined into two axies axie11 and axie12 as illus-
trated in figure 2.
1http://www.papillon-dictionary.org/
axie3axie2
axie1
interlingual
lexie1lexie2
Japanese monolingual
lexie3lexie2
lexie1
French monolingual
terre
lexie3lexie2
lexie1
lexie4
earth
lexie2lexie1
land
lexie1lexie2soil
English monolingualFigure 1: An example interlingual database
lexie1chaise
lexie1lexie2fauteuil
French monolingual
lexie3lexie2
lexie1
English monolingual
chair
axie11
axie12axie1
interlingual
Figure 2: An example of refined axies
This pivot macrostructure has been defined
by (Sérasset, 1994) and experimented by (Blanc,
1999) in the PARAX mockup. The mi-
crostructure of the monolingual dictionaries
is the “DiCo” structure, which is a simpli-
fication of Mel’cuk’s (Mel’cuk et al., 1995)
DEC (Explanatory-Combinatorial Dictionary)
designed by Polguère & Mel’cuk (Polguère,
2000) to make it possible to construct large,
detailed and principled dictionaries in tractable
time.
The building method of the Papillon lexical
database is based on one hand on 1) reusing ex-
isting lexical resources, and on the other hand on
2) contributions of volunteers working through
Internet. In order to automate the first step,
we have developed Jeminie (cf. section 2), a
flexible software system that helps create (semi-
) automatically interlingual lexical databases.
As there are several possible techniques for the
creation of axies that can be implemented in
Jeminie, it is necessary to evaluate and compare
these techniques to understand their strengths
and weaknesses and to identify possible im-
provements. This article proposes an approach
for the automatic qualitative evaluation of an
automatically created MLDB, for instance cre-
ated by Jeminie, that relies on an evaluation
software system that adapts to the measured
MLDB.
The next section of this article provides an
overview of the Jeminie system and the strategy
it implements to create interlingual lexical data-
bases. The third section presents in detail eval-
uation criteria for an MLDB. The fourth section
describes the evaluation system that we propose
and the metrics and criteria to evaluate the qual-
ity of MLDB. Last sections discuss the measure-
ment strategy and conclude.
2 Jeminie
Jeminie is a software system that helps build-
ing interlingual databases. Its first function is
to automatically extract information from ex-
isting monolingual dictionaries, at least one for
each considered language, and to normalize it
into lexies. The second function of Jeminie
is to automatically link lexies that have the
same sense into axies. The prominent feature of
Jeminie is the ability to arbitrarily combine sev-
eral axie creation techniques (Teeraparbseree,
2003).
An axie creation technique is an algorithm
that creates axies to link a set of existing lex-
ies. An algorithm may use existing additional
lexical resources, such as: bilingual dictionaries,
parallel corpora, synonym dictionaries, and ant-
onym dictionaries. Algorithms that do not rely
on additional lexical resources consider only in-
formation available from the monolingual data-
bases, and include vectorial algorithms such as
calculating and comparing conceptual vectors
for each lexie (Lafourcade, 2002).
The use of one algorithm alone is not suf-
ficient, in practice, to produce a good quality
MLDB. For instance, using only one algorithm
that uses bilingual dictionaries, one obtains a
lexical database on the level of words but not on
the level of senses of words. The Jeminie system
tackles this problem from a software engineering
point of view. In Jeminie, an axie creation al-
gorithm is implemented in a reusable software
module. Jeminie allows for arbitrary composi-
tion of modules, in order to take advantage of
each axie creation algorithm, and to create a
MLDB of the best possible quality. We call a
MLDB production process, a sequence of exe-
cutions of axie creation modules. A process is
specified using a specific language that provides
high-level abstractions. The Jeminie architec-
ture is divided into three layers. The core layer
is a library that is used to implement axie cre-
ation modules at the module layer. The pro-
cesses interpreter starts the execution of mod-
ules according to processes specified by linguists.
The interpreter is developed using the core layer.
Jeminie has been developed in Java following
object-oriented design techniques and patterns.
Each execution of an axie creation module
progressively contributes to create and filter the
intermediate set of axies. The final MLDB is
obtained after the last module execution in a
process. The quality of a MLDB can be eval-
uated either 1) on the final set of axies after
a whole process has been executed, or 2) on
an intermediate set of of axies after a module
has been executed in a process. The modularity
in MLDB creation provided by Jeminie there-
fore allows for a wide range of quality evalu-
ation strategies. The next sections describe the
evaluation criteria that we consider for MLDBs
created using Jeminie.
3 Taxonomy of evaluation criteria
Here, we propose metrics for the qualitative
evaluation of multilingual lexical databases, and
give an interpretation for these measures. We
propose a classification of MLDB evaluation cri-
teria into four classes, according to their nature.
3.1 Golden-standard-based criteria
In the domain of machine translation systems,
an increasingly accepted way to measure the
quality of a system is to compare the out-
puts it produces with a set of reference trans-
lations, considered as an approximation of a
golden standard (Papineni et al., 2002; hovy et
al., 2002). By analogy, one can define a golden
standard multilingual lexical database to com-
pare to a database generated by a system such as
Jeminie, that both contain axies that link to lex-
ies in the same monolingual databases. Consid-
ering that two axies are the same if they contain
links to exactly the same lexies, the quality of a
machine generated multilingual lexical database
would then be measured with two metrics adap-
ted from machine translation system evaluation
(Ahrenberg et al., 2000): recall and precision.
Recall (coverage) is the number of axies that
are defined in both the generated database and
in the golden standard database, divided by the
number of axies in the golden standard.
Precision is the number of axies that are
defined in both the generated database and in
the golden standard database, divided by the
number of axies in the generated database.
However, (Aimelet et al., 1999) highlighted
the limits of the golden standard approach, as
it is often difficult to manually produce precise
reference resources. In the context of the Papil-
lon project, a golden standard multilingual lex-
ical database would deal with nine languages
(English, French, German, Japanese, Lao, Thai,
Malay, Vietnamese and Chinese), which makes
it extremely difficult to produce. Furthermore,
since the produced multilingual lexical data-
base in Papillon will define at least 40000 ax-
ies, using heterogeneous resources, a comparison
with a typical golden standard of only 100 ax-
ies seems not relevant. Instead of producing a
golden standard for a whole multilingual lexical
database, we propose to consider partial golden
standard that concerns only a part of a MLDB.
For instance, a partial golden standard can be
produced using a bilingual dictionary that con-
cerns only two languages in the database. Sev-
eral partial golden standard MLDBs could be
produced using several bilingual dictionaries, in
order to cover all languages in the multilingual
lexical database.
3.2 Structural criteria
Structural evaluation criteria consider the state
of links between lexies and axies. We define sev-
eral general structural criteria:
• CLAave, the average number of axies linked
to each lexie. Here, we consider only lexies
that are linked to axies. CLAave should be
1. If it is > 1, several axies have the same
sense, i.e. the produced MLDB is ambigu-
ous. If it is < 1, the produced MLDB may
not be precise enough, as it does not cover
all the lexies. Actually, we should also con-
sider the standard deviation of that num-
ber, because a MLDB would be quite bad if
CLAave = 2 for half the lexies and CLAave
= 0 for the rest, although the global value
of CLAave is 1.
• for each language, ADLlang, the ratio of the
number of axies to the number of lexies in
that language. If it is too low, the axies
may represent fuzzy acceptions. If it is too
high, axies may overlap, i.e. several axies
may represent the same acception. Typ-
ically, it should be about 1.2 (cf. large
MLDB such as EDR - the Electronic Dic-
tionary Research project in Japan). This
metrics should be calculated for each lan-
guage independently, because the number
of lexies may significantly vary between two
languages, making this metrics irrelevant if
calculated using the total number of lexies
and axies in a database.
• CALave, the average number of lexies of
each language linked to each axie. It should
be about 1.2. If it is > 1 for a language,
axies may represent a fuzzy acception or
there is synonymy, as illustrated in figure
3. If it is < 1 for a language, axies may not
cover that language precisely. Note that
CALave may help us locate places in the
“axie” set where an axie is refined by one
or more axies. Each CALave may then be
far from CALave global, but their average
should still be near CALave global for the
considered set.
lexie1(place)
lexie2(fish)
lexie2(fish)
lexie3(measure)
axie1
lexie1(place)
French monolingual
interlingual
lieu
bar
Figure 3: Example of two lexies that are syn-
onym in the same language and linked to the
same axie
Such metrics are complementary and can eas-
ily be measured, and are among the rare metrics
that concern a whole MLDB. They, however, do
not help evaluating the quality of links between
axies and lexies in terms of semantics.
3.3 Human-based criteria
This class of evaluation criteria is based on the
measurement of the number and nature of the
corrections made by a linguist on a part of a
produced MLDB. For instance, one can measure
the ratio of the number of corrections made by
a linguist, to the total number of links between
the considered axies and lexies. The closer the
ratio is to zero, the higher is the quality of the
multilingual lexical database. A high correction
ratio implies a low MLDB quality.
However, this class of criteria assumes that
the produced MLDB are homogeneous. In the
context of Papillon, the database will be pro-
duced using several techniques and heterogen-
eous lexical resources, which limits the relevance
of such criteria.
This approach is similar to the golden-
standard approach described above, although
the golden-standard approach is automatic.
3.4 Non-resource-based semantic
criteria
In this class, criteria evaluate the quality of the
semantics of the links between axies and lex-
ies, and do not rely on additional lexical re-
sources. One of the metrics that we consider
is the distance between conceptual vectors of
lexies linked to the same axie. A conceptual
vector for a lexie is calculated by projecting the
concepts associated with this lexie into a vector
space, where each dimension corresponds to a
leaf concept of a thesaurus (Lafourcade, 2002).
The concepts associated with a lexie are identi-
fied by analyzing the lexie definition. The lower
the distance between the conceptual vectors of
two lexies is, the closer are those lexies (word-
senses). As a metrics, we therefore consider the
average conceptual distance between each pair
of lexies linked to the same axie. The lower
that value is, the better the MLDB is, in terms
of the semantics of the links between axies and
lexies. However, a reliable computation of con-
ceptual vectors relies on the availability precise
and rich definitions in lexies, and on large lexical
resources to compute initial vectors, which are
difficult to gather for all languages in practice.
3.5 Discussion
As a more general conceptual framework, we
define a classification of evaluation criteria along
four dimensions, or characteristics:
• automation: a criterion is either automat-
ically evaluated, or relies on linguists.
• scope: a criterion evaluates either a part of
a MLDB, or a whole MLDB.
• semantics: a criterion considers either the
structure of a MLDB, or the semantics of
the links between axies and lexies.
• resource: a criterion relies on additional
lexical resources, or not.
Multilingual lexical databases such as Papillon
can be used in different contexts, e.g. in ma-
chine translation systems or in multilingual in-
formation retrieval systems. The criteria used
for evaluating a multilingual lexical database
should be adapted to the context in which the
database is used. For instance, if a multilin-
gual lexical database is very precise and good at
French and Japanese acceptions, but not good
at other languages, it should be judged as a good
lexical database by users who evaluate a usage
of French and Japanese only, but it should be
judged as a bad multilingual lexical database
globally.
Since the Papillon database generated by
Jeminie will not be tied to specific usages, the
database production system must not impose
predefined evaluation criteria. We propose in-
stead to allow for the use of any criterion at any
point in the four dimensions above and for arbit-
rary composition of evaluation criteria to adapt
to different contexts. However, since we aim at
performing an automatic evaluation, we do not
consider human-based criteria, although human
evaluation is certainly valid. Our approach is
similar to the approach chosen in Jeminie for
the creation of axies. We tackle this problem of
criteria composition from a software engineering
point of view, by using object oriented program-
ming techniques to design and implement mod-
ular and reusable criterion software modules.
4 Adaptable evaluation system
By analogy with the Jeminie modules that im-
plement algorithms to create axies, we propose
a system that allows for the implementation in
Java of reusable software modules that imple-
ment algorithms to measure MLDB. In this sys-
tem, we consider that each criterion is imple-
mented as a module. Criterion modules are of
a different kind, and are developed differently
from Jeminie axie creation modules. As a con-
vention, we define that each criterion module
returns a numeric value as the result of a meas-
urement, noted Qi. The higher that value, the
better the evaluated database.
4.1 Axie-creation-related criteria
As the strategy we have chosen in Jeminie is to
combine complementary axie creation modules
to produce axies in a multilingual lexical data-
base, we consider that each axie creation mod-
ule encapsulates its own quality criterion that it
tends to optimize, explicitly or implicitly. Since
each module implements an algorithm to decide
whether to create an axie, we consider that such
an algorithm can also be used as a criterion to
decide whether an existing axie is correct. An
axie creation module can not be reused as is as a
criterion module, however its decision algorithm
can be easily reimplemented in a criterion mod-
ule. For each algorithm, we define the following
four metrics, adapted from (Bédécarrax, 1989):
A1 the number of internal adjustments, i.e. the
number of axies that would be created ac-
cording to the algorithm, and that have ac-
tually been created.
A2 the number of external adjustments, i.e. the
number of axies that would not be created
according to the algorithm, and that have
actually not been created.
E1 the number of internal errors, i.e. the num-
ber of axies that would not be created ac-
cording to the algorithm, and that have ac-
tually been created.
E2 the number of external errors, i.e. the num-
ber of axies that would be created according
to the algorithm, and that have actually not
been created.
For each algorithm, the quality criteria are to
maximize A1 + A2, to minimize E1 + E2, or to
maximize (A1 +A2) − (E1 +E2).
Resource-based algorithms
For instance, following are the definitions of A1,
A2, E1 and E2 for the axie creation algorithm
that uses a bilingual dictionary between lan-
guages X and Y:
A1 the number of pairs of lexies of languages
X and Y that are linked to the same axie
and which words are mutual translations
according to the bilingual dictionary.
A2 the number of pairs of lexies of languages X
and Y that are not linked to the same axie
and which words are not mutual transla-
tions according to the bilingual dictionary.
E1 the number of pairs of lexies of languages X
and Y that are linked to the same axie and
which words are not mutual translations ac-
cording to the bilingual dictionary.
E2 the number of pairs of lexies of languages
X and Y that are not linked to the same
axie and which words are mutual transla-
tions according to the bilingual dictionary.
However, resources used by resource-based cre-
ation algorithms have a number of entries that
is often significantly lower than the number of
lexies and axies in a multilingual lexical data-
base. For instance, the number of translation
entries in a bilingual dictionary is typically lower
than the number of available monolingual accep-
tions in the source language, because that set of
lexies may be constructed by combining a set
of rich monolingual dictionaries. For instance,
our monolingual database for French contains
about 21000 headwords and 45000 lexies ex-
tracted from many definition dictionaries such
as Hachette, Larousse, etc. Our monolingual
database for English contains about 50000 head-
words and 90000 lexies extracted from English
WordNet 1.7.1. However, the bilingual French-
English dictionary that we use is based on the
FeM2 multilingual dictionary, and defines only
15000 French headwords.
lexical database number of headwords
French monolingual 21000
English monolingual 50000
FeM 15000
Table 1: Comparing the number of entries in
monolingual lexical databases with the number
of entries in the multilingual lexical database
According to the example above, measuring
the number of external adjustments A2 and in-
ternal errors E1 is therefore not relevant. For
example, a criterion can not decide if the words
of a French lexie and of an English lexie that are
linked together, are translations of each other,
since the bilingual dictionary used is not precise
enough. We therefore propose a simplified qual-
ity criterion for resource-based algorithms, that
is to maximize A1and to minimize E2.
2French-English-Malay dictionary http://www-
clips.imag.fr/geta/services/fem
Vectorial algorithms
This measure can also be adapted to the com-
parison of the conceptual distance between lex-
ies:
A1 the number of pairs of lexies that are linked
to the same axie and which conceptual vec-
tor distance is below a given threshold.
A2 the number of pairs of lexies that are not
linked to the same axie and which concep-
tual vector distance is above the threshold.
E1 the number of pairs of lexies that are linked
to the same axie and which conceptual vec-
tor distance is above the threshold.
E2 the number of pairs of lexies that are not
linked to the same axie and which concep-
tual vector distance is below the threshold.
This algorithm is not limited by the size of
an additional lexical resource, and can decide
whether any pair of lexies should be linked or
not. It is therefore possible to evaluate A2 and
E1 in addition to A1 and E2.
Synthesis
We specify that the value returned by such axie-
creation-related criteria is calculated as Qi =
A1−E2 for resource-based criteria, and as Qi =
(A1+A2)−(E1+E2) for any other axie-creation-
related criteria, as those formulas reflect both
the number of adjustments and the number of
errors.
4.2 Structural criteria
As described above, structural criteria consider
the structure of each axie in a whole multilin-
gual lexical database. We propose to implement
such algorithms also as modules in our system.
For example, we define one criterion module to
calculate the following value:
Qi = 1
0.01 +
vextendsinglevextendsingle
vextendsinglevextendsingle
vextendsinglevextendsingle1 −
nblexiessummationtext
k = 1
nblinkedaxiesk
nblexies
vextendsinglevextendsingle
vextendsinglevextendsingle
vextendsinglevextendsingle
where nblexies is the total number of lexies in
the database, and nblinkedaxiesk is the number
of axies linked to a lexie k. Qi is comprised
between 0 and 100.
4.3 Global criteria
A global quality value Q can be calculated as
the sum of each quality value measured by each
measurement module. The choice of the meas-
urement modules corresponds to a given usage
context of the evaluated database, and the posit-
ive weight of each metric module in this context
is specified as a factor in the sum:
Q =
nbmodulessummationtext
i = 1
weighti · Qi
The objective is to maximize Q. The weight
for each module can be chosen to emphasize the
importance of selected criteria in the context of
evaluation. For instance, when specifically eval-
uating the quality of axies between French and
English lexies, the weight for a bilingual EN-
FR dictionary-based criterion module could be
higher than the weights for the other criterion
modules. In addition, the values returned by
different criterion modules are not normalized.
It is therefore necessary to adapt the weights to
compensate the difference of scale between Qi
values.
5 Evaluation method
One can evaluate the quality of a MLDB after
it has been created or enhanced through the ex-
ecution of an axie creation process by Jeminie.
Such a quality measure can be used by linguists
to decide whether to execute another axie cre-
ation process to enhance the quality of the data-
base, or to stop if the database has reached the
desired quality. The creation of an axie database
is therefore iterative, alternating executions of
axie creation processes, quality evaluations, and
decisions.
It should be noted that the execution of an
axie creation process may not always imply a
monotonous increase of the measured quality.
Since axie creation algorithms may not be mu-
tually coherent, the order of executions of mod-
ules, in a process or in several consecutively ex-
ecuted processes, has an impact on the meas-
ured global quality. More precisely, the addi-
tional resources used by axie creation modules,
and/or by quality criteria modules, may contain
errors and be mutually incoherent. The execu-
tion of a resource-based axie creation module
using a resource R1, can cause a drop of the
A1 value and an increase of the E2 value meas-
ured by a resource-based criterion module using
a resource R2 incoherent with R1. This may sig-
nificantly decrease the evaluated global quality.
The database may however be actually of a bet-
ter quality if R2 has a poor quality and R1 has a
good quality. This highlights the need for good
quality resources for both creating the database
and evaluating its quality.
Another problem is that the additional lex-
ical resources used, such as bilingual dictionar-
ies, generally provide information at the level of
words, not at the level of senses. It is thus ne-
cessary to complement these resource-based axie
creation modules, for instance by using vectorial
modules. Moreover, it is necessary to develop
new algorithms to increase the internal consist-
ence of an axie database, for example one that
merges all the axies that link to the same lexie.
6 Example processes
Figure 4 illustrates the two sets of axies created
by a process A and a process B to link to lexies
retrieved from a French and an English mono-
lingual dictionaries. Process A consists of the
execution of only module Mbidict, that uses a
bilingual dictionary FR-EN extracted from FeM
dictionary and partially illustrated in figure 5.
The set of axies produced by process A consists
of axie1 to axie7. Process B consists of the exe-
cution of the same module Mbidict as in process
A, then of a module Mvect that implements a
conceptual vector comparison algorithm for fil-
tering some bad links. Process B produces only
axie1, axie4, axie5 and axie7. Note that processes
A and B were hand-simulated in this example.
lexie1
banque
lexie2(person)
lexie1(fruit)
avocat
lexie1
admirable
French monolingual
lexie2(river)
lexie1(office)
bank
lexie1
advocate
lexie1
avocado
lexie1
admirable
English monolingual
axie7
axie6
axie5
axie4
axie3
axie2
axie1
link created by process B
link created by process AFigure 4: Axies created by processes A and B
The two same criterion modules are used to
evaluate both processes: 1) an axie-creation-
related criterion module using the same bilin-
gual dictionary as the one used in the axie
creation modules in processes, and calculating
Admirable (a.)Avocat (n.m.)
Avocat (n.m.)
Admirable (a.)Advocate (n.)
Avocado (n.)
Bank (n.)Banque (n.f.)
Bilingual Dictionary FR−EN(FeM) (Le Robert & Collins)Bilingual Dictionary EN−FR 
Bank (n.)Bank (n.)
Admirable (a.)Advocate (n.)
Avocado (n.)
Banque (n.f.)Rive (n.f.)
Admirable (a.)Avocat (n.m.)
Avocat (n.m.)
Figure 5: Bilingual dictionaries
a Qbidict value, and 2) the structural criterion
module described in section 4.2, and calculating
a Qstruct value. The global evaluated quality
value for the set of axies created by each pro-
cess is:
Q = α · Qbidict +β · Qstruct
The actually evaluated values of Qbidict and
Qstruct, and of Q for several combinations of α
and β, are shown in table 2.
process A process B
Qbidict 7 1
Qstruct 1.76 8.25
Q (α=1, β=1) 8.76 9.25
Q (α=1, β=2) 10.52 17.5
Q (α=2, β=1) 15.76 10.25
Table 2: The results of qualitative evaluations
Axie creation module Mbidict considers only
words, but not senses of words. It therefore cre-
ates several axies linked to each lexie, some of
which are not correct because they do not dis-
tinguish between the lexies of a given transla-
tion word. In process B, module Mvect is ex-
ecuted to suppress links and axies that are se-
mantically incorrect. The structural quality, as
given in Qstruct, is therefore better with process
B than with process A, and intuitively the global
quality has actually increased. However, execut-
ing module Mvect reduces the quality from the
point of view of a bilingual translation that con-
siders only words and not acceptions, as given
in Qbidict.
This illustrates that not all quality criteria
should be maximized to attain the best possible
quality. Weight factors for each criterion mod-
ule should be carefully chosen, according to the
scale of the values returned by each module, and
to the linguistic objectives. For instance, as il-
lustrated in table 2, setting a weight too high
for the bilingual translation criterion lets the
evaluated global quality decrease, while it has
actually increased.
7 Conclusion
This article presents the problem of the auto-
matic creation and evaluation of interlingual
multilingual lexical databases (MLDB), in the
context of the Papillon project. It describes the
Jeminie software system, that we are develop-
ing, for the automatic creation of interlingual
acceptions (axies). It can adapt to different con-
texts, e.g. to different lexical resources and dif-
ferent languages, by providing a means to arbit-
rarily compose axie creation modules.
We have proposed a taxonomy of criteria for
the automatic evaluation of a MLDB. One cri-
teria alone is not sufficient to significantly eval-
uate the quality of a whole database. We there-
fore propose a method for the arbitrary compos-
ition of evaluation criteria, following the same
principles as the Jeminie system.
The proposed method will be implemented in
a software framework, along with a library of
modules that implement a variety of evaluation
criteria, and that can be freely composed. This
framework will be integrated with Jeminie, in
order to allow for the automatic evaluation of a
MLDB during its creation.

References
Lars Ahrenberg, Magnus Merkel, Anna Sagvall
Hein, and Jorg Tiedemann. 2000. Evalu-
ation of word alignment systems. In Proceed-
ing of LREC’2000, pages 1255–1261, Athens,
Greece.
Elisabeth Aimelet, Veronika Lux, Corinne Jean,
and Frédérique Segond. 1999. WSD evalu-
ation and the looking-glass. In Proceedings of
TALN’1999, Cargèse, France.
Chantal Bédécarrax. 1989. Classification auto-
matique en analyse relationnelle : la quadri-
décomposition et ses applications. thesis, Uni-
versité Paris 6.
Etienne Blanc. 1999. PARAX-UNL: A large
scale hypertextual multilingual lexical data-
base. In Proceedings of 5th Natural Language
Processing Pacific Rim Symposium, pages
507–510, Beijing. Tsinghua University Press.
Eduard hovy, Margaret King, and Andrei
Popescu-Belis. 2002. Principles of context-
based machine translation evaluation. Ma-
chine Translation, 17(1):43–75.
Mathieu Lafourcade. 2002. Automatically pop-
ulating acception lexical databases through
bilingual dictionaries and conceptual vectors.
In Papillon’2002 Seminar, Tokyo, Japan.
Igor Mel’cuk, André Clas, and Alain Polguère.
1995. Introduction à la lexicologie explicative
et combinatoire. Duculot, Louvain-la-Neuve.
Kishore Papineni, Salim Roukos, Todd Ward,
and Wei-Jing Zhu. 2002. BLEU: a method
for automatic evaluation of machine transla-
tion. In Proceeding of ACL’2002, pages 311–
318, Philadelphia.
Alain Polguère. 2000. Towards a theoret-
ically motivated general public dictionary
of semantic derivations and collocations for
French. In Proceedings of EURALEX’2000,
pages 517–527, Stuttgart.
Gilles Sérasset. 1994. Interlingual lexical or-
ganisation for multilingual lexical database
in NADIA. In Proceedings of COLING’94,
volume 1/2, pages 278–282, Kyoto, Japan.
Aree Teeraparbseree. 2003. Jeminie: A flexible
system for the automatic creation of inter-
lingual database. In Papillon’2003 Seminar,
Sapporo, Japan.
