File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/01/w01-1507_metho.xml

Size: 20,325 bytes

Last Modified: 2025-10-06 14:07:43

<?xml version="1.0" standalone="yes"?>
<Paper uid="W01-1507">
  <Title>Multilingual ISLE Lexical Entry</Title>
  <Section position="2" start_page="0" end_page="0" type="metho">
    <SectionTitle>
Language Technology (HLT)
</SectionTitle>
    <Paragraph position="0"> programme in collaboration between American and European groups in the framework of the EU-US International Research Co-operation, supported by NSF and EC. We concentrate in this paper on the current position of the</Paragraph>
  </Section>
  <Section position="3" start_page="0" end_page="0" type="metho">
    <SectionTitle>
ISLE Computational Lexicon Working
</SectionTitle>
    <Paragraph position="0"> Group. We provide a short description of the EU SIMPLE lexicons built on thebasisofpreviousEAGLES recommendations. We then point at a few basic methodological principles applied in previous EAGLES phases, and describe a few principles to be followed in the definition of a</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
Multilingual ISLE Lexical Entry
</SectionTitle>
      <Paragraph position="0"> The ISLE project is a continuation of the long standing EAGLES initiative (Calzolari et al., 1996), carried out through a number of subsequent projects funded by the European Commission (EC) since 1993. EAGLES stands for Expert Advisory Group for Language Engineering Standards and was launched within</Paragraph>
    </Section>
  </Section>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
EC Directorate General XIII's Linguistic
</SectionTitle>
    <Paragraph position="0"> Research and Engineering (LRE) programme, continued under the Language Engineering (LE) programme, and now under the Human Language Technology (HLT) programme as ISLE, since January 2000. ISLE stands for International Standards for Language Engineering, and is carried out in collaboration between American and European groups in the framework of the EU-US International Research Co-operation, supported by NSF and EC. ISLE was built on joint preparatory EU-US work of the previous 2 years towards setting up a transatlantic standards oriented initiative for HLT.</Paragraph>
    <Paragraph position="1"> The objective of the project is to support HLT R&amp;D international and national projects, and HLT industry by developing, disseminating and promoting widely agreed and urgently demanded HLT standards and guidelines for infrastructural language resources (see Zampolli, 1998, and Calzolari, 1998), tools that exploit them and LE products. The aim of EAGLES/ISLE is thus to accelerate the provision of standards, common guidelines, best practice recommendations for: * very large-scale language resources (such as text corpora, computational lexicons, speech corpora (Gibbon et al., 1997), multimodal resources); * means of manipulating such knowledge, via computational linguistic formalisms, mark-up languages and various software tools; * means of assessing and evaluating resources, tools and products (EAGLES, 1996).</Paragraph>
    <Paragraph position="2"> ThebasicideabehindEAGLESworkisfor the group to act as a catalyst in order to pool concrete results coming from current major International/ National/industrial projects. Relevant common practices or upcoming standards are being used where appropriate as input to EAGLES/ISLE work, particularly in the areas of computational lexicons, text, speech, and multimodal annotation, and evaluation.</Paragraph>
    <Paragraph position="3"> Numerous theories, approaches, and systems are being taken into account, where appropriate, as any recommendation for harmonisation must take into account the needs and nature of the different major contemporary approaches.</Paragraph>
    <Paragraph position="4"> EAGLES is also drawing strong inspiration from the results of major projects whose results have contributed to advancing our understanding of harmonisation issues.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
1.2 A quick Overview of the ISLE
Work
</SectionTitle>
      <Paragraph position="0"> The current ISLE project (see http://www.ilc.pi.cnr.it/EAGLES96/isle/ISLE_H ome_Page.htm) targets the three areas of multilingual computational lexicons, natural interaction and multimodality (NIMM), and evaluation of HLT systems. These areas were chosen not only for their relevance to the HLT call but also for their long-term significance. * For multilingual computational lexicons, ISLE is working to: extend EAGLES work on lexical semantics, necessary to establish inter-language links; design and propose standards for multilingual lexicons; develop a prototype tool to implement lexicon guidelines and standards; create exemplary EAGLES-conformant sample lexicons and tag exemplary corpora for validation purposes; and develop standardised evaluation procedures for lexicons.</Paragraph>
      <Paragraph position="1"> * For NIMM, a rapidly innovating domain urgently requiring early standardisation, ISLE work is targeted to develop guidelines for: the creation of NIMM data resources; interpretative annotation of NIMM data, including spoken dialogue in NIMM contexts; and annotation of discourse phenomena.</Paragraph>
      <Paragraph position="2"> * For evaluation, ISLE is working on: quality models for machine translation systems; and maintenance of previous guidelines - in an ISO based framework (ISO 9126, ISO 14598).</Paragraph>
      <Paragraph position="3"> Three Working Groups, and their subgroups, carry out the work, according to the already proven EAGLES methodology, with experts from both the EU and US, working and interacting within a strongly co-ordinated framework. International workshops are used as a means of achieving consensus and advancing work. Results will be widely disseminated and published, after due validation in collaboration with EU and US HLT R&amp;D projects, National projects, and industry.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
1.3. The Computational Lexicon
Working Group
</SectionTitle>
      <Paragraph position="0"> We concentrate in the following on the current position of the ISLE Computational Lexicon Working Group (CLWG).</Paragraph>
      <Paragraph position="1"> EAGLES work towards de facto standards has already allowed the field of Language Resources to establish broad consensus on key issues for some well-established areas -- and will allow similar consensus to be achieved for other important areas through the ISLE project -- providing thus a key opportunity for further consolidation and a basis for technological advance. EAGLES previous results have already become de facto standards. To mention several key examples: the LE PAROLE/SIMPLE resources (morphological/syntactic/semantic lexicons and corpora for 12 EU languages, Ruimy et al., 1998, Lenci et al., 1999, Bel et al., 2000) rely on EAGLES results (Sanfilippo, A. et al., 1996 and 1999), and are now being enlarged at the national level through many National Projects; the ELRA Validation Manuals for Lexicons (Underwood and Navarretta, 1997) and Corpora (Burnard et al., 1997) are based on EAGLES guidelines; morpho-syntactic tagging of corpora in a very large number of EU, international and national projects - and for more than 20 languages -- is conformant to EAGLES recommendations (Leech and Wilson, 1996).</Paragraph>
      <Paragraph position="2"> The first priority of the CLWG in the first phase of the ISLE project was to do a comprehensive survey of existing multilingual lexicons. To this end the European and the American members decided, among others, i) to prepare a grid for lexicon description to classify the content and structure of the surveyed resources on the basis of a number of agreed parameters of description, and ii) to provide a list of cross-lingual lexical phenomena that could be used to focus the survey. The inventory (survey) of what exists and is available (semantic and bilingual/multilingual lexicons, printed bilingual dictionaries) is now being completed, and will be made soon available on the Web. Each participant engaged for surveying a number of resources. A list of the main applications that use lexical resources was also established, to focus the survey and subsequent recommendations around them. Each summary of a particular bilingual or multilingual dictionary includes: i) a description of the surveyed dictionary structure (on the basis of the common grid), ii) for one or two examples from the cross-lingual lexical phenomena, an explanation of how these examples are handled by this dictionary.</Paragraph>
      <Paragraph position="3"> 2 The structure of the prospective Multilingual ISLE Lexical Entry The main goal of the CLWG is the definition of a Multilingual ISLE Lexical Entry (henceforth MILE). This is the main focus of the second year of the project, the so called &amp;quot;recommendation phase&amp;quot;.</Paragraph>
    </Section>
    <Section position="3" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
2.1 Basic EAGLES principles
</SectionTitle>
      <Paragraph position="0"> We remind here just a few basic methodological principles derived from and applied in previous EAGLES phases. They have proven useful in the process of reaching consensual de facto standards in a bottom-up approach and will be at the basis also of ISLE work.</Paragraph>
      <Paragraph position="1"> The MILE is envisaged as a highly modular and possibly layered structure, with different levels of recommendations. Such an architecture has been proven useful in previous EAGLES work, e.g in the EAGLES morphosyntactic recommendations (Monachini and Calzolari, 1996), which embody three levels of linguistic information: obligatory, recommended and optional (optional splits furthermore into language independent and language dependent). This modularity would enhance: the flexibility of the representation, the easiness of customisation and integration of existing resources (developed under different theoretical frameworks or for different applications), the usability by different systems which are in need of different portions of the encoded data, the compliance with the proposed standards also of partially instantiated entries.</Paragraph>
      <Paragraph position="2"> The MILE recommendations should also be very granular, in the sense of reaching a maximal decomposition into the minimal basic information units that reflect the phenomena we are dealing with. This principle was previously recommended and used to allow easier reusability or mappability into different theoretical or system approaches (Heid and McNaught, 1991): small units can be assembled, in different frameworks, according to different (theory/application dependent) generalisation principles. Such basic notions must be established before considering any system-specific generalisations, otherwise our work may be too conditioned by system-specific approaches. For example, 'synonymy' can be taken as a basic notion; however, the notion of 'synset' is a generalisation, closely associated with the WordNet approach. 'Qualia relations' are another example of a generalisation, whereas 'semantic relation' is a basic notion. Modularity is also a means to achieve better granularity.</Paragraph>
      <Paragraph position="3"> On the other side, past EAGLES experience has shown it is useful in many cases to accept underspecification with respect to recommendations for the representation of some phenomenon (and hierarchical structure of the basic notions, attributes, values, etc.), i) to allow for agreement on a minimal level of specificity especially in cases where we cannot reach wider agreement, and/or ii) enable mappability and comparability of different lexicons, with different granularity, at the minimal common level of specificity (or maximal generality). For example, the work on syntactic subcategorisation in EAGLES proved that it was problematic to reach agreement on a few notions, e.g. it seemed unrealistic to agree on a set of grammatical functions. This led to an underspecified recommendation, but nevertheless one that was useful.</Paragraph>
      <Paragraph position="4"> One of the first objectives of the CLWG will be to discover and list the (maximal) set of (minimal/more granular) basic notions needed to describe the multilingual level. This task will be facilitated by the survey of existing lexicons, accompanied by the analysis of the requirements of a few multilingual applications, and by the parallel analysis of typical multilingual complex phenomena. Most or part of these basic notions should be already included in previous EAGLES recommendations, and, with different distribution, in the existing and surveyed lexicons. We have therefore to revisit earlier linguistic layers (previous EAGLES work, essentially monolinguistic) to see what we need to change/add or what we can reuse for the multilingual layer. The multilingual layer thus depends on monolingual layers.</Paragraph>
    </Section>
    <Section position="4" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
2.2 The MILE architecture
</SectionTitle>
      <Paragraph position="0"> The MILE is intended as a meta-entry, acting as a common representational layer for multilingual lexical resources. The key-ideas underlying the design of a meta-entry can be summarized as follows. Different theoretical frameworks appear to impose different requirements on how lexical information should be represented. One way of tackling the issue of theoretical compatibility stems from the observation that existing representational frameworks mostly differ in the way pieces of linguistic information are mutually implied, rather than in the intrinsic nature of this information. To give a concrete example, almost all theoretical frameworks claim that lexical items have a complex semantic organization, but some of them try to describe it through a multidimensional internal structure (cf. the qualia structure in the Generative Lexicon, Pustejovsky 1995), others by specifying a network of semantic relations (cf. WordNet, Miller et al. 1990), and others in terms of argumental frames (cf FrameNet, Baker et al.</Paragraph>
      <Paragraph position="1"> 1998; Lexical Conceptual Structures, Jackendoff 1992; etc.). A way out of this theoretical variation is to augment the expressive power of the lexical representation language both horizontally, i.e. by distributing the linguistic information over mutually independent &amp;quot;coding layers&amp;quot;, and vertically, by further specifying the information conveyed by each such layer. This solution will contribute to solve the issues raised by theoretical variation by defining a common level onto which different types of resources will be mapped without loss of information.</Paragraph>
      <Paragraph position="2"> This appears to be a necessary condition to guarantee an efficient re-use and interchange of lexical data, often coming from resources developed according to very different architectural and theoretical criteria.</Paragraph>
      <Paragraph position="3"> With respect to this issue, the MILE is designed to meet the following desiderata:  solutions.</Paragraph>
      <Paragraph position="4"> All these requirements serve the main purpose of making the lexical meta-entry open to taskand system-dependent parameterization.</Paragraph>
      <Paragraph position="5"> The MILE is modular along at least three dimensions:  the MILE word sense.</Paragraph>
      <Paragraph position="6"> A. Modularity in the macrostructure and general architecture of the MILE -The following modules should be at least envisaged, referring to the macrostructure of a multilingual system:  1. Meta-information - versioning of the lexicon, languages, updates, status, project, origin, etc. (see e.g. OLIF (Thurmair, 2000), GENELEX).</Paragraph>
      <Paragraph position="7"> 2. Possible architecture(s) of bilingual/  multilingual lexicon(s): we must analyse the interactions of the different modules, and the general structure in which they are inserted, both in the interlingua- and transfer-based approaches, and in possibly hybrid solutions. An open issue is also the relation between the source language (SL) and target language (TL) portions of a lexicon.</Paragraph>
      <Paragraph position="8">  B. Modularity in the microstructure of the MILE - The following modules should be at least envisaged, referring to the global microstructure of MILE: 1. Monolingual linguistic representation - null this includes the morphosyntactic, syntactic, and semantic information characterizing the MILE in a certain language. It generally corresponds to the typology of information contained in existing lexicons, such as PAROLE-SIMPLE, (Euro)WordNet (EWN), COMLEX, and FrameNet. Following the general organizations of computational lexicons like PAROLE-SIMPLE, which in turn instantiates the GENELEX framework (GENELEX, 1994), at the monolingual level the MILE sorts out the linguistic information into three layers, respectively for morphological, syntactic and semantic dimensions. Typologies of information to be part of this module include (not an exhaustive list):  speech relations (e.g. intelligent intelligence; writer - to write) The expressive power of the semantic layer is of the utmost importance for the multilingual layer. A general issue discussed in ISLE concerns whether consensus has to be pursued at the generic level of &amp;quot;type&amp;quot; of information or also at the level of its &amp;quot;values&amp;quot; or actual ways of representation. The answer may be different for different notions, e.g. try to reach the more specific level of agreement also on values for types of meronymy, but not for types of ontology.</Paragraph>
      <Paragraph position="9"> 2. Collocational information - This module includes more or less typical and/or fixed syntagmatic patterns including the lexical head defined by the MILE, which can contribute to characterise its use, or to perform more subtle and/or domain specific characterisations. It includes at least:  previous EAGLES - is critical in a multilingual context both to characterise a word-sense in a more granular way and to make it possible to perform a number of operations, such as WSD or translation in a specific context. Here, synergies with the NSF-XMELLT project on multi-word expressions are exploited. First proposals for the representation of support verbs and noun-noun compounds in multilingual computational lexicons are laid out, and now tested on some language pairs.</Paragraph>
      <Paragraph position="10"> 3. Multilingual apparatus - This represents the focal part of the CLWG activities, which will concentrate its main effort in proposing a general framework for the expression of multilingual transfers. Some of the main issues at stake here are: * identify a typology of the most common cases of problematic transfer (actually this task has been partially performed during the survey phase of the project); * identify which conditions must be expressible and which transformation actions are necessary, in order to establish the correct multilingual mappings; * select which types of information these conditions must access in the modules (1) and (2) above; * identify the various methods of establishing SL --&gt; TL equivalence * examine the variability of granularity needed when translating in different languages, and the architectural implications of this.</Paragraph>
      <Paragraph position="11"> C. Modularity in the specific microstructure of the MILE word-sense (wordsense is the basic unit at the multilingual level) Senses should also have a modular structure (i.e. the above distinction between modules (B.1.) and (B.2.) must be intended at word-sense level):  1. Coarse-grained (general purpose) characterisation in terms of prototypical properties, captured by the formal means in (B.1.) above, which serves to partition the meaning space in large areas and is sufficient for some NLP tasks.</Paragraph>
      <Paragraph position="12"> 2. Fine-grained (domain or text  dependent) characterisation mostly in terms of collocational/syntagmatic properties (B.2.), which is especially useful for specific tasks, such as WSD and translation. Different types of information may have a sort of different operational specialisation.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML