File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/w04-2203_metho.xml

Size: 28,150 bytes

Last Modified: 2025-10-06 14:09:22

<?xml version="1.0" standalone="yes"?>
<Paper uid="W04-2203">
  <Title>Qualitative Evaluation of Automatically Calculated Acception Based MLDB Aree Teeraparbseree</Title>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
DEC (Explanatory-Combinatorial Dictionary)
</SectionTitle>
    <Paragraph position="0"> designed by Polguere &amp; Mel'cuk (Polguere, 2000) to make it possible to construct large, detailed and principled dictionaries in tractable time.</Paragraph>
    <Paragraph position="1"> The building method of the Papillon lexical database is based on one hand on 1) reusing existing lexical resources, and on the other hand on 2) contributions of volunteers working through Internet. In order to automate the first step, we have developed Jeminie (cf. section 2), a flexible software system that helps create (semi) automatically interlingual lexical databases. As there are several possible techniques for the creation of axies that can be implemented in Jeminie, it is necessary to evaluate and compare these techniques to understand their strengths and weaknesses and to identify possible improvements. This article proposes an approach for the automatic qualitative evaluation of an automatically created MLDB, for instance created by Jeminie, that relies on an evaluation software system that adapts to the measured MLDB.</Paragraph>
    <Paragraph position="2"> The next section of this article provides an overview of the Jeminie system and the strategy it implements to create interlingual lexical databases. The third section presents in detail evaluation criteria for an MLDB. The fourth section describes the evaluation system that we propose and the metrics and criteria to evaluate the quality of MLDB. Last sections discuss the measurement strategy and conclude.</Paragraph>
  </Section>
  <Section position="5" start_page="0" end_page="0" type="metho">
    <SectionTitle>
2 Jeminie
</SectionTitle>
    <Paragraph position="0"> Jeminie is a software system that helps building interlingual databases. Its first function is to automatically extract information from existing monolingual dictionaries, at least one for each considered language, and to normalize it into lexies. The second function of Jeminie is to automatically link lexies that have the same sense into axies. The prominent feature of Jeminie is the ability to arbitrarily combine several axie creation techniques (Teeraparbseree, 2003).</Paragraph>
    <Paragraph position="1"> An axie creation technique is an algorithm that creates axies to link a set of existing lexies. An algorithm may use existing additional lexical resources, such as: bilingual dictionaries, parallel corpora, synonym dictionaries, and antonym dictionaries. Algorithms that do not rely on additional lexical resources consider only information available from the monolingual databases, and include vectorial algorithms such as calculating and comparing conceptual vectors for each lexie (Lafourcade, 2002).</Paragraph>
    <Paragraph position="2"> The use of one algorithm alone is not sufficient, in practice, to produce a good quality MLDB. For instance, using only one algorithm that uses bilingual dictionaries, one obtains a lexical database on the level of words but not on the level of senses of words. The Jeminie system tackles this problem from a software engineering point of view. In Jeminie, an axie creation algorithm is implemented in a reusable software module. Jeminie allows for arbitrary composition of modules, in order to take advantage of each axie creation algorithm, and to create a MLDB of the best possible quality. We call a MLDB production process, a sequence of executions of axie creation modules. A process is specified using a specific language that provides high-level abstractions. The Jeminie architecture is divided into three layers. The core layer is a library that is used to implement axie creation modules at the module layer. The processes interpreter starts the execution of modules according to processes specified by linguists. The interpreter is developed using the core layer.</Paragraph>
    <Paragraph position="3"> Jeminie has been developed in Java following object-oriented design techniques and patterns.</Paragraph>
    <Paragraph position="4"> Each execution of an axie creation module progressively contributes to create and filter the intermediate set of axies. The final MLDB is obtained after the last module execution in a process. The quality of a MLDB can be evaluated either 1) on the final set of axies after a whole process has been executed, or 2) on an intermediate set of of axies after a module has been executed in a process. The modularity in MLDB creation provided by Jeminie therefore allows for a wide range of quality evaluation strategies. The next sections describe the evaluation criteria that we consider for MLDBs created using Jeminie.</Paragraph>
  </Section>
  <Section position="6" start_page="0" end_page="0" type="metho">
    <SectionTitle>
3 Taxonomy of evaluation criteria
</SectionTitle>
    <Paragraph position="0"> Here, we propose metrics for the qualitative evaluation of multilingual lexical databases, and give an interpretation for these measures. We propose a classification of MLDB evaluation criteria into four classes, according to their nature.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.1 Golden-standard-based criteria
</SectionTitle>
      <Paragraph position="0"> In the domain of machine translation systems, an increasingly accepted way to measure the quality of a system is to compare the outputs it produces with a set of reference translations, considered as an approximation of a golden standard (Papineni et al., 2002; hovy et al., 2002). By analogy, one can define a golden standard multilingual lexical database to compare to a database generated by a system such as Jeminie, that both contain axies that link to lexies in the same monolingual databases. Considering that two axies are the same if they contain links to exactly the same lexies, the quality of a machine generated multilingual lexical database would then be measured with two metrics adapted from machine translation system evaluation (Ahrenberg et al., 2000): recall and precision.</Paragraph>
      <Paragraph position="1"> Recall (coverage) is the number of axies that are defined in both the generated database and in the golden standard database, divided by the number of axies in the golden standard.</Paragraph>
      <Paragraph position="2"> Precision is the number of axies that are defined in both the generated database and in the golden standard database, divided by the number of axies in the generated database.</Paragraph>
      <Paragraph position="3"> However, (Aimelet et al., 1999) highlighted the limits of the golden standard approach, as it is often difficult to manually produce precise reference resources. In the context of the Papillon project, a golden standard multilingual lexical database would deal with nine languages (English, French, German, Japanese, Lao, Thai, Malay, Vietnamese and Chinese), which makes it extremely difficult to produce. Furthermore, since the produced multilingual lexical data-base in Papillon will define at least 40000 axies, using heterogeneous resources, a comparison with a typical golden standard of only 100 axies seems not relevant. Instead of producing a golden standard for a whole multilingual lexical database, we propose to consider partial golden standard that concerns only a part of a MLDB.</Paragraph>
      <Paragraph position="4"> For instance, a partial golden standard can be produced using a bilingual dictionary that concerns only two languages in the database. Several partial golden standard MLDBs could be produced using several bilingual dictionaries, in order to cover all languages in the multilingual lexical database.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.2 Structural criteria
</SectionTitle>
      <Paragraph position="0"> Structural evaluation criteria consider the state of links between lexies and axies. We define several general structural criteria: * CLAave, the average number of axies linked to each lexie. Here, we consider only lexies that are linked to axies. CLAave should be  1. If it is &gt; 1, several axies have the same  sense, i.e. the produced MLDB is ambiguous. If it is &lt; 1, the produced MLDB may not be precise enough, as it does not cover all the lexies. Actually, we should also consider the standard deviation of that number, because a MLDB would be quite bad if CLAave = 2 for half the lexies and CLAave = 0 for the rest, although the global value of CLAave is 1.</Paragraph>
      <Paragraph position="1"> * for each language, ADLlang, the ratio of the number of axies to the number of lexies in that language. If it is too low, the axies may represent fuzzy acceptions. If it is too high, axies may overlap, i.e. several axies may represent the same acception. Typically, it should be about 1.2 (cf. large MLDB such as EDR - the Electronic Dictionary Research project in Japan). This metrics should be calculated for each language independently, because the number of lexies may significantly vary between two languages, making this metrics irrelevant if calculated using the total number of lexies and axies in a database.</Paragraph>
      <Paragraph position="2"> * CALave, the average number of lexies of each language linked to each axie. It should be about 1.2. If it is &gt; 1 for a language, axies may represent a fuzzy acception or there is synonymy, as illustrated in figure 3. If it is &lt; 1 for a language, axies may not cover that language precisely. Note that CALave may help us locate places in the &amp;quot;axie&amp;quot; set where an axie is refined by one or more axies. Each CALave may then be far from CALave global, but their average should still be near CALave global for the considered set.</Paragraph>
      <Paragraph position="3">  onym in the same language and linked to the same axie Such metrics are complementary and can easily be measured, and are among the rare metrics that concern a whole MLDB. They, however, do not help evaluating the quality of links between axies and lexies in terms of semantics.</Paragraph>
    </Section>
    <Section position="3" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.3 Human-based criteria
</SectionTitle>
      <Paragraph position="0"> This class of evaluation criteria is based on the measurement of the number and nature of the corrections made by a linguist on a part of a produced MLDB. For instance, one can measure the ratio of the number of corrections made by a linguist, to the total number of links between the considered axies and lexies. The closer the ratio is to zero, the higher is the quality of the multilingual lexical database. A high correction ratio implies a low MLDB quality.</Paragraph>
      <Paragraph position="1"> However, this class of criteria assumes that the produced MLDB are homogeneous. In the context of Papillon, the database will be produced using several techniques and heterogeneous lexical resources, which limits the relevance of such criteria.</Paragraph>
      <Paragraph position="2"> This approach is similar to the golden-standard approach described above, although the golden-standard approach is automatic.</Paragraph>
    </Section>
    <Section position="4" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.4 Non-resource-based semantic
</SectionTitle>
      <Paragraph position="0"> criteria In this class, criteria evaluate the quality of the semantics of the links between axies and lexies, and do not rely on additional lexical resources. One of the metrics that we consider is the distance between conceptual vectors of lexies linked to the same axie. A conceptual vector for a lexie is calculated by projecting the concepts associated with this lexie into a vector space, where each dimension corresponds to a leaf concept of a thesaurus (Lafourcade, 2002).</Paragraph>
      <Paragraph position="1"> The concepts associated with a lexie are identified by analyzing the lexie definition. The lower the distance between the conceptual vectors of two lexies is, the closer are those lexies (wordsenses). As a metrics, we therefore consider the average conceptual distance between each pair of lexies linked to the same axie. The lower that value is, the better the MLDB is, in terms of the semantics of the links between axies and lexies. However, a reliable computation of conceptual vectors relies on the availability precise and rich definitions in lexies, and on large lexical resources to compute initial vectors, which are difficult to gather for all languages in practice.</Paragraph>
    </Section>
    <Section position="5" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.5 Discussion
</SectionTitle>
      <Paragraph position="0"> As a more general conceptual framework, we define a classification of evaluation criteria along four dimensions, or characteristics: * automation: a criterion is either automatically evaluated, or relies on linguists.</Paragraph>
      <Paragraph position="1"> * scope: a criterion evaluates either a part of a MLDB, or a whole MLDB.</Paragraph>
      <Paragraph position="2"> * semantics: a criterion considers either the structure of a MLDB, or the semantics of the links between axies and lexies.</Paragraph>
      <Paragraph position="3"> * resource: a criterion relies on additional lexical resources, or not.</Paragraph>
      <Paragraph position="4"> Multilingual lexical databases such as Papillon can be used in different contexts, e.g. in machine translation systems or in multilingual information retrieval systems. The criteria used for evaluating a multilingual lexical database should be adapted to the context in which the database is used. For instance, if a multilingual lexical database is very precise and good at French and Japanese acceptions, but not good at other languages, it should be judged as a good lexical database by users who evaluate a usage of French and Japanese only, but it should be judged as a bad multilingual lexical database globally.</Paragraph>
      <Paragraph position="5"> Since the Papillon database generated by Jeminie will not be tied to specific usages, the database production system must not impose predefined evaluation criteria. We propose instead to allow for the use of any criterion at any point in the four dimensions above and for arbitrary composition of evaluation criteria to adapt to different contexts. However, since we aim at performing an automatic evaluation, we do not consider human-based criteria, although human evaluation is certainly valid. Our approach is similar to the approach chosen in Jeminie for the creation of axies. We tackle this problem of criteria composition from a software engineering point of view, by using object oriented programming techniques to design and implement modular and reusable criterion software modules.</Paragraph>
    </Section>
  </Section>
  <Section position="7" start_page="0" end_page="0" type="metho">
    <SectionTitle>
4 Adaptable evaluation system
</SectionTitle>
    <Paragraph position="0"> By analogy with the Jeminie modules that implement algorithms to create axies, we propose a system that allows for the implementation in Java of reusable software modules that implement algorithms to measure MLDB. In this system, we consider that each criterion is implemented as a module. Criterion modules are of a different kind, and are developed differently from Jeminie axie creation modules. As a convention, we define that each criterion module returns a numeric value as the result of a measurement, noted Qi. The higher that value, the better the evaluated database.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.1 Axie-creation-related criteria
</SectionTitle>
      <Paragraph position="0"> As the strategy we have chosen in Jeminie is to combine complementary axie creation modules to produce axies in a multilingual lexical database, we consider that each axie creation module encapsulates its own quality criterion that it tends to optimize, explicitly or implicitly. Since each module implements an algorithm to decide whether to create an axie, we consider that such an algorithm can also be used as a criterion to decide whether an existing axie is correct. An axie creation module can not be reused as is as a criterion module, however its decision algorithm can be easily reimplemented in a criterion module. For each algorithm, we define the following four metrics, adapted from (Bedecarrax, 1989): A1 the number of internal adjustments, i.e. the number of axies that would be created according to the algorithm, and that have actually been created.</Paragraph>
      <Paragraph position="1"> A2 the number of external adjustments, i.e. the number of axies that would not be created according to the algorithm, and that have actually not been created.</Paragraph>
      <Paragraph position="2"> E1 the number of internal errors, i.e. the number of axies that would not be created according to the algorithm, and that have actually been created.</Paragraph>
      <Paragraph position="3"> E2 the number of external errors, i.e. the number of axies that would be created according to the algorithm, and that have actually not been created.</Paragraph>
      <Paragraph position="4"> For each algorithm, the quality criteria are to</Paragraph>
      <Paragraph position="6"> For instance, following are the definitions of A1, A2, E1 and E2 for the axie creation algorithm that uses a bilingual dictionary between languages X and Y: A1 the number of pairs of lexies of languages X and Y that are linked to the same axie and which words are mutual translations according to the bilingual dictionary.</Paragraph>
      <Paragraph position="7"> A2 the number of pairs of lexies of languages X and Y that are not linked to the same axie and which words are not mutual translations according to the bilingual dictionary. E1 the number of pairs of lexies of languages X and Y that are linked to the same axie and which words are not mutual translations according to the bilingual dictionary.</Paragraph>
      <Paragraph position="8"> E2 the number of pairs of lexies of languages X and Y that are not linked to the same axie and which words are mutual translations according to the bilingual dictionary. However, resources used by resource-based creation algorithms have a number of entries that is often significantly lower than the number of lexies and axies in a multilingual lexical database. For instance, the number of translation entries in a bilingual dictionary is typically lower than the number of available monolingual acceptions in the source language, because that set of lexies may be constructed by combining a set of rich monolingual dictionaries. For instance, our monolingual database for French contains about 21000 headwords and 45000 lexies extracted from many definition dictionaries such as Hachette, Larousse, etc. Our monolingual database for English contains about 50000 head-words and 90000 lexies extracted from English WordNet 1.7.1. However, the bilingual French-English dictionary that we use is based on the  monolingual lexical databases with the number of entries in the multilingual lexical database According to the example above, measuring the number of external adjustments A2 and internal errors E1 is therefore not relevant. For example, a criterion can not decide if the words of a French lexie and of an English lexie that are linked together, are translations of each other, since the bilingual dictionary used is not precise enough. We therefore propose a simplified quality criterion for resource-based algorithms, that is to maximize A1and to minimize E2.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
Vectorial algorithms
</SectionTitle>
      <Paragraph position="0"> This measure can also be adapted to the comparison of the conceptual distance between lexies: null A1 the number of pairs of lexies that are linked to the same axie and which conceptual vector distance is below a given threshold.</Paragraph>
      <Paragraph position="1"> A2 the number of pairs of lexies that are not linked to the same axie and which conceptual vector distance is above the threshold. E1 the number of pairs of lexies that are linked to the same axie and which conceptual vector distance is above the threshold.</Paragraph>
      <Paragraph position="2"> E2 the number of pairs of lexies that are not linked to the same axie and which conceptual vector distance is below the threshold. This algorithm is not limited by the size of an additional lexical resource, and can decide whether any pair of lexies should be linked or not. It is therefore possible to evaluate A2 and E1 in addition to A1 and E2.</Paragraph>
    </Section>
    <Section position="3" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
Synthesis
</SectionTitle>
      <Paragraph position="0"> We specify that the value returned by such axiecreation-related criteria is calculated as Qi = A1[?]E2 for resource-based criteria, and as Qi = (A1+A2)[?](E1+E2) for any other axie-creationrelated criteria, as those formulas reflect both the number of adjustments and the number of errors.</Paragraph>
    </Section>
    <Section position="4" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.2 Structural criteria
</SectionTitle>
      <Paragraph position="0"> As described above, structural criteria consider the structure of each axie in a whole multilingual lexical database. We propose to implement such algorithms also as modules in our system.</Paragraph>
      <Paragraph position="1"> For example, we define one criterion module to calculate the following value:  where nblexies is the total number of lexies in the database, and nblinkedaxiesk is the number of axies linked to a lexie k. Qi is comprised between 0 and 100.</Paragraph>
    </Section>
    <Section position="5" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.3 Global criteria
</SectionTitle>
      <Paragraph position="0"> A global quality value Q can be calculated as the sum of each quality value measured by each measurement module. The choice of the measurement modules corresponds to a given usage context of the evaluated database, and the positive weight of each metric module in this context is specified as a factor in the sum:</Paragraph>
      <Paragraph position="2"> The objective is to maximize Q. The weight for each module can be chosen to emphasize the importance of selected criteria in the context of evaluation. For instance, when specifically evaluating the quality of axies between French and English lexies, the weight for a bilingual EN-FR dictionary-based criterion module could be higher than the weights for the other criterion modules. In addition, the values returned by different criterion modules are not normalized.</Paragraph>
      <Paragraph position="3"> It is therefore necessary to adapt the weights to compensate the difference of scale between Qi values.</Paragraph>
    </Section>
  </Section>
  <Section position="8" start_page="0" end_page="0" type="metho">
    <SectionTitle>
5 Evaluation method
</SectionTitle>
    <Paragraph position="0"> One can evaluate the quality of a MLDB after it has been created or enhanced through the execution of an axie creation process by Jeminie.</Paragraph>
    <Paragraph position="1"> Such a quality measure can be used by linguists to decide whether to execute another axie creation process to enhance the quality of the database, or to stop if the database has reached the desired quality. The creation of an axie database is therefore iterative, alternating executions of axie creation processes, quality evaluations, and decisions.</Paragraph>
    <Paragraph position="2"> It should be noted that the execution of an axie creation process may not always imply a monotonous increase of the measured quality.</Paragraph>
    <Paragraph position="3"> Since axie creation algorithms may not be mutually coherent, the order of executions of modules, in a process or in several consecutively executed processes, has an impact on the measured global quality. More precisely, the additional resources used by axie creation modules, and/or by quality criteria modules, may contain errors and be mutually incoherent. The execution of a resource-based axie creation module using a resource R1, can cause a drop of the A1 value and an increase of the E2 value measured by a resource-based criterion module using a resource R2 incoherent with R1. This may significantly decrease the evaluated global quality. The database may however be actually of a better quality if R2 has a poor quality and R1 has a good quality. This highlights the need for good quality resources for both creating the database and evaluating its quality.</Paragraph>
    <Paragraph position="4"> Another problem is that the additional lexical resources used, such as bilingual dictionaries, generally provide information at the level of words, not at the level of senses. It is thus necessary to complement these resource-based axie creation modules, for instance by using vectorial modules. Moreover, it is necessary to develop new algorithms to increase the internal consistence of an axie database, for example one that merges all the axies that link to the same lexie.</Paragraph>
  </Section>
  <Section position="9" start_page="0" end_page="0" type="metho">
    <SectionTitle>
6 Example processes
</SectionTitle>
    <Paragraph position="0"> Figure 4 illustrates the two sets of axies created by a process A and a process B to link to lexies retrieved from a French and an English mono-lingual dictionaries. Process A consists of the execution of only module Mbidict, that uses a bilingual dictionary FR-EN extracted from FeM dictionary and partially illustrated in figure 5.</Paragraph>
    <Paragraph position="1"> The set of axies produced by process A consists of axie1 to axie7. Process B consists of the execution of the same module Mbidict as in process A, then of a module Mvect that implements a conceptual vector comparison algorithm for filtering some bad links. Process B produces only axie1, axie4, axie5 and axie7. Note that processes  link created by process B link created by process AFigure 4: Axies created by processes A and B The two same criterion modules are used to evaluate both processes: 1) an axie-creationrelated criterion module using the same bilingual dictionary as the one used in the axie creation modules in processes, and calculating  a Qbidict value, and 2) the structural criterion module described in section 4.2, and calculating a Qstruct value. The global evaluated quality value for the set of axies created by each process is:</Paragraph>
    <Paragraph position="3"> The actually evaluated values of Qbidict and Qstruct, and of Q for several combinations of a and b, are shown in table 2.</Paragraph>
    <Paragraph position="4"> process A process B  Axie creation module Mbidict considers only words, but not senses of words. It therefore creates several axies linked to each lexie, some of which are not correct because they do not distinguish between the lexies of a given translation word. In process B, module Mvect is executed to suppress links and axies that are semantically incorrect. The structural quality, as given in Qstruct, is therefore better with process B than with process A, and intuitively the global quality has actually increased. However, executing module Mvect reduces the quality from the point of view of a bilingual translation that considers only words and not acceptions, as given in Qbidict.</Paragraph>
    <Paragraph position="5"> This illustrates that not all quality criteria should be maximized to attain the best possible quality. Weight factors for each criterion module should be carefully chosen, according to the scale of the values returned by each module, and to the linguistic objectives. For instance, as illustrated in table 2, setting a weight too high for the bilingual translation criterion lets the evaluated global quality decrease, while it has actually increased.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML