File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/86/c86-1100_metho.xml

Size: 24,398 bytes

Last Modified: 2025-10-06 14:11:48

<?xml version="1.0" standalone="yes"?>
<Paper uid="C86-1100">
  <Title>TOWARD INTEGRATED DICTIONARIES FOR M(a)T: motlvatlons and linguistic organtsation</Title>
  <Section position="1" start_page="0" end_page="423" type="metho">
    <SectionTitle>
TOWARD INTEGRATED DICTIONARIES FOR M(a)T:
</SectionTitle>
    <Paragraph position="0"> In tile framework of Macll I rre (aided) Translation systems, two types of lextcal knowledge are used, &amp;quot;natural &amp;quot; and &amp;quot;formal &amp;quot;, in the form of on-i lee termlnologlca I resources for human translators or revisors arid of coded dtct lonar ies for Machine Translat ton proper.</Paragraph>
    <Paragraph position="1"> A new organization is presented, whlch allows to integrate both types In a unique structure, called &amp;quot;fork&amp;quot; integrated dictionary, or FIB, A given FIG is associated wl th one natural language and may give access to translations into several other laeguages.</Paragraph>
    <Paragraph position="2"> The FIGs associated to languages L1 and 1_2 contain all information necessary to geeerate coded dictionaries of M(a)T systems translating from L1 Into l_2 or vice-versa. The skeleton of a FIG may be vlewed as a classical rnonollngual dictionary, augmented with one (or several) bilingual dictionary. Each Item Is a tree strLictLIre, constructed by taking the &amp;quot;natural&amp;quot; information (a tree) and &amp;quot;graft 1 t\]~J II onto i t some '1 forma 111 lnf&amp;quot;ormat Ion . Various aspects of thls design are refined and Illustrated by detailed examples, several scenarli for the construcI Ion of rids are presented, and seine problems of organizer ion and Implement at ion are discussed. A prototype hrlplementation of the FID structure Is L,lder way ill Grenoble.</Paragraph>
    <Paragraph position="3"> Key-words : Macbtne (aldod) Translation, Fork Ietegrated Dictionary, Lexioal Data Base, Specialized I..aeguages for</Paragraph>
    <Paragraph position="5"> Integrated Machine (aided) lraosl at loll (&amp;quot;M(a) r&amp;quot;) systems tncludo two types ef translator&amp;quot; aids, First, there ts a sort of traeslator &amp;quot;workstatlon&amp;quot;, relying on a text processing system augmeetod with spectal f~unc |~ons and glvlng access to one or several &amp;quot;natural&amp;quot; on-line &amp;quot;lextcal resources&amp;quot; IC4,7\[I, such as dictionaries, terminology lists or data banks, and t hesaLIr i . This constitutes the Machlne Alded Human Translatlon (&amp;quot;MAHT&amp;quot;) aspect. Second, there may be a true Machine lranslation ( &amp;quot;MT&amp;quot; ) system, wh tell &amp;quot; l ingware&amp;quot; conststs of &amp;quot;coded&amp;quot; grammars and dictionaries, lhls Is the (human alded) MT aspect, abbreviated as &amp;quot;HAMT&amp;quot;, or simply &amp;quot;MT&amp;quot;, because human revision Is necessary even more for machine translations than for human translations.</Paragraph>
    <Paragraph position="6"> The tern1 &amp;quot;coded&amp;quot; doesn't orlly mean that MT gr'armlar's and dictionaries are written In Specialized Languages for&amp;quot; Linguistic Pr ogr anlnf rrg (&amp;quot;SLLP&amp;quot;) , but also that the grammatical and lexical Information they contain is of a more &amp;quot;formal&amp;quot; nature. In some systems, the f`ormal lexical information ts a reduction (and perhaps ae oversimplification) of the Information found In usual dictionaries. But, tn all sophisticated systenls, it Is far more detailed, and relies on some deep analysis of the language. Moreover, tile access keys may be different: classical dtct 1charles are accessed by \]efftrlas, whl le formal d'lct looar tes may be accessed by morphs (roots, affixes...), \]el~\]~ras, lexlcal units, and even other linguistic properties. Ill many systems written ill ARIANE--78 {1}, left, as are not directly used.</Paragraph>
    <Paragraph position="7"> Efforts have beer\] made to devise data base systems for the natural or the formal aspect , separately.</Paragraph>
    <Paragraph position="8"> Multillngual terminological data bases, such as TERMIUM I'B\[I or EURODICAUTOM, Illustrate tt~e first type.</Paragraph>
    <Paragraph position="9"> On tile other hand, the Japanese and the French National MT projects have developed specialized lexlcal data base systems ( &amp;quot;LEXDB&amp;quot; ) , in which the ( formal ) information is entered, and from WlllCll MT dictionaries are produced. More precisely, there Is a data base for&amp;quot; each language (I.), and for eacl\] pair of laeguages (L1,L2) handled by the MT system. From the first LEXDB, analysis and syntilesls MT dictionaries for I_ are automatlcally constructed, while transfer dictionaries for (L1,L2) are produced from tire second.</Paragraph>
    <Paragraph position="10"> In all Integrated M(a)r system, it would be useful to maintain the two types of dlct ionar les in a unique structere, ill order to ensure coherency. rhls strLlcture would act as a &amp;quot;pivot&amp;quot;, being the source of the &amp;quot;natural&amp;quot; view as well as of the &amp;quot;formal&amp;quot; dictioearles. Moreover, ft would be lnterestlng, for the same reasons, to reduce the number of I..E XDBs. Will\] the t ocl~rl 1due r/len t ioned above, there el'(:; I\]*'2 for' I'1 languages.</Paragraph>
    <Paragraph position="11"> The authors have begun a research aloeg those I lnes in 1982 {6). \[rl 1985, this has led to a tentative (sma I 1-sea le} implementation ef a first prototype, adapted to tl~e aims of&amp;quot; a Eurotra coetraet.</Paragraph>
    <Paragraph position="12"> At tile time of revision of: tl~l s paper , work on specification arrd Implenrentation was being continued by a smal } team tryiog to construct a Japaeese-French-Er/glish L.EXDB, for a partlcular domain. Tills is why some details given in this PaDer are already obsolete. However-, the spirit I~as remaleed the same.</Paragraph>
    <Paragraph position="13"> lhe Ii/a~ll Idea Of the new organization ls to fntograte both types of dictionaries in a unlqtJe structllre, called &amp;quot;for'l~.&amp;quot; integrated dictionary, or &amp;quot;I:ID'. A given FID tS associated with one natural laeguage and may give access to translations Into several other languages.</Paragraph>
    <Paragraph position="14"> Hence, there would be only n FiGs for n languages. The f&amp;quot;orm oF ~tle &amp;quot;natural&amp;quot; Dart has been designed to reflect the organl zat 1on oPS current modern usual dlct loner les.</Paragraph>
    <Paragraph position="15"> lhts is why we have limited ourselves to the &amp;quot;fork&amp;quot; architecture&amp;quot;, and have not attempted to constrtlct a Llnlque str'ueture for n languages.</Paragraph>
    <Paragraph position="16"> In tile flrst part, we present tile &amp;quot;skeleton&amp;quot; of a Fill item, Part I1 shows how to &amp;quot;graft&amp;quot; codes onto It, and discusses the nature and place of tllose codes. Finally, some problems of' organization and fmplementation are discussed in part IIi. An annex gives a complete example for the len~r~as associated with the lexlcal unit COMPTER.</Paragraph>
    <Paragraph position="17"> I...USING A 'tNA URAL&amp;quot; SKELE rON After having stedied the strectures of several classical dlct 1char les, including LOGOS, I AROUSSE, ROGER1 , I4ARRAP'S, WEBSTER, SACHS, etc., we have proposed a staedard flora for the &amp;quot;natural skeleton&amp;quot; of a FIG item. Items are accessed by the lenrnas, but the eotlon of iexlcal untt ( &amp;quot;LU&amp;quot; , or &amp;quot;UL&amp;quot; 111 French) ts present. k, bl~\]la s are &amp;quot;norma 1 Forms&amp;quot; 0PS words ( in Engi lsh, tnflnlt ire tier' verbs, singular For&amp;quot; nouns, etc.). A lextcal uelt fs the main element of a derlvatlonal family, and is usually denoted by the main len~na of thts family. Lexlcal unlts are useful lrl MT systems, for&amp;quot; paraphrasing purposes.</Paragraph>
    <Paragraph position="18">  constr 2 : OUANTIFIE sen.ss 3 :  def &amp;quot;unltC/ de press l on&amp;quot; e x &amp;quot;une presslon de 2 atmospheres&amp;quot;</Paragraph>
    <Paragraph position="20"> d#f &amp;quot;d@cider, pr@Darer avec calcul&amp;quot; ex &amp;quot;le pharmaclen avalt pr@m6dlt~ la rupture&amp;quot;  There are three types of elements in the examples. Keywords are underlined. They show the articulation oF the standard structure. In case of repetition at the same level, numbers are used (e.g. trad 1).</Paragraph>
    <Paragraph position="21"> Identlflers are in uppercase (and should be In italic, but for the limitations of our printer). They correspond to the list of abbreviations which is usually placed at the beginning of a classical dictionary. They may contain some special signs such as &amp;quot;.&amp;quot; or &amp;quot;-&amp;quot; Strings are shown between double quotes. They cerrespond to the data. We use our &amp;quot;local&amp;quot; transcription, based on IS0-025 (French character set).</Paragraph>
  </Section>
  <Section position="2" start_page="423" end_page="424" type="metho">
    <SectionTitle>
2. FORM OF AN ITEM
</SectionTitle>
    <Paragraph position="0"> 2_..!~ K~ s .~emma s L _ l e x 1 c al u 0 ItA As illustrated above, an Item may consist of several lemmas, because of possible ambiguities between two canonical Forms (e.g. LIGHT-noun and LIGHT-adjective).</Paragraph>
    <Paragraph position="1"> The corresponding LU Is always given. The symbol &amp;quot;--&amp;quot; stands for the key of the Item. Confusion should be avoided In the denotation of LUs. For example, for lernmas LIGHT, we could denote the LU cerreponding to the first (the noun) by .... lm 1&amp;quot; or .... CI N.&amp;quot; 2.2. Constructions refinements m@s The preceding Items have been chosen for their relative slmpltctty. In general, a lemma may lead to several constructions, a construction to several refinements, eacb deflQed as a &amp;quot;meaning&amp;quot;, for lack ef a better word.</Paragraph>
    <Paragraph position="2"> Further refinements may be added, to select various translations For a given meanlng. The Following diagram illustrates the idea.</Paragraph>
    <Paragraph position="3"> .......................................................... 4  key ! __ _ l etTllla ! I constructlon ! ! ! .... meaning/transl. ANG constructlonl I I RUS constructtonl t t ALM constructlonl I construction ! ._ refinement 1 ! ..... meaning/transl. ANG { l !_ refinement 1 f I I constructlonl ! I reftnement ! ! I constructtonl ! RUS constructfon! f ALM constructionl .... refinement I I meanlng/trans1. ANG ..... constructlonl L__ RUS .... construction! ! ALM__constructlon! lemma meaning/transl, ! .... ~L~ ........ L--fiZ ...............................................  \[ntultlvely, constralnts are more local to the left than to the right. The presence of a construction may be tested In a sentence, but the notion of domain of discourse or of level of language Is obviously more global.</Paragraph>
    <Paragraph position="4"> The notion of construction Is fundamental. In particular, predicative words cannot be translated in Isolation, and it Is necessary to translate expressions of the Form P(x,y,z), P being the predicate and x, y, z Its arguments, possibly with conditions on the arguments. Note that 1dloms or locutions are particular Forms of constructions.</Paragraph>
    <Paragraph position="5"> In general, refinements may be local or global. Local refinements often consist In restrictions on the semantic features of the arguments (&amp;quot;to count on somebody&amp;quot; vs. &amp;quot;to count on something&amp;quot;). Global refinements concern the  domain, the style (level of discourse), or the typology (abstract, bulletln, article, ckeck-11st...). In our view, a meaning In L1 ls translated by one or several constructions In L2.</Paragraph>
    <Paragraph position="6"> We have then avoided to translate a meaning by a meaning, which might seem more logical. But this would have forced us to descrlbe the corresponding cascade of constraints In L2. As a matter of fact, It Is usually possible to reconstruct It, from the constraints tn L1 and contrastlve knowledge about L1 and L2. Hence, we follow the practice of usual dlctlonarles.</Paragraph>
    <Paragraph position="7"> 2~.3, TrAoslatlqns .C!--t~: &amp;quot;fork&amp;quot; dictionaries We have shown how to include In an Item Its translations Into several target languages. Hence the term &amp;quot;fork&amp;quot;. The &amp;quot;handle&amp;quot; Of the item consists In all information concerning the source language (L1). In order for such an organization to work, we must have at least 2 such dictlonarles, for L1 and L2, as no detailed information about 1_2 ls included In the Ll-based dictionary. This information may be found In the L2-based dlct 1chary, by look lng-up the appropriate ttem and locatlng the construction: the path from the key to the construction contains It.</Paragraph>
  </Section>
  <Section position="3" start_page="424" end_page="425" type="metho">
    <SectionTitle>
3. F&amp;CTORIZ_ATION ANp_ REFERENCE
</SectionTitle>
    <Paragraph position="0"> AS seen In the examples, we introduce some possibllltles of naming subparts of a given len'~na, by simply number lng them (sees 3 refers to trad 1 In &amp;quot;atmosph6re&amp;quot; ).</Paragraph>
    <Paragraph position="1"> This allows not only to Factorize some information, such as translations, but also to defer certain parts of the item. For example, translations might be grouped at tile end of the (linear) writing of an item. The same can be said of the formal part oC/ the Information (see be low).</Paragraph>
    <Paragraph position="3"> The formalized information may correspond to several dlstlnct \]ln.qulstlc theories. Such a theory Is deflned by a set oC/ formal attr!butes, each of a well-defined type.</Paragraph>
    <Paragraph position="4"> For example, the morphosyntactlc class might be defined as a scalar attrlbute: CATMS (VERB, NOUN, ADJECTIVE, ADVERB, CONJUNCTION, etc. ) The gender might be defined as a set attribute: GENOER = ens (MASCULIN, FEMINTN, NEUTRE).</Paragraph>
    <Paragraph position="5"> Each theory may glve rise to several implementations (\]tngwares), each of them having a particular notation For represent lng these attributes and their values. Moreover, lr, a given llngware, the information relatlve to an Item may be distributed among several components, such as analysis, transfer and synthesis dictionaries. Usually, comblnat Ions of particular properties (or&amp;quot; at tr lbute/value pairs) are glven names and called cj asses,_ For example, In ARIANE-78, there are the &amp;quot;morphologlcai&amp;quot; and &amp;quot;syntactic&amp;quot; &amp;quot;formats&amp;quot;, abbreviated as FTM and FTS, in the AM (mor phol oglca I analysis) diet lonar les. Special questionnaires, called &amp;quot;indexing charts&amp;quot;, lead to the approprlate class, by asking global questions (vs. one particular question for each possible attr lbute).</Paragraph>
    <Paragraph position="6"> 1.2~ F_oEm of _Wbat...ls._~\[rafted In tile slmplest case, there ls one theory, and one corresponding 11ngware. Tile grafted part wtl\] be of tile form: apJ3 info properties In the theory code codes (classes and possibly basic properties) The keyword aPD means &amp;quot;appended&amp;quot;.</Paragraph>
    <Paragraph position="7"> In a A less simple case, there might be two theories, called and B, of French. Suppose that there ts an analyzer, FR1, and a synthesizer, FRA, corresponding to A, and two analyzers and a synthesizer (FR2, FR3, FRB), relative to B. The grafted part will be of the form:</Paragraph>
    <Paragraph position="9"> &amp;quot;AM&amp;quot; must be Known as ae lntroductor of cedes for morphological anaiysls in ARlANE-78-based llngwares.</Paragraph>
    <Paragraph position="10"> Formal parts may be attached at all levels of an item, for factorizatlon purposes. The Information ls supposed to be cumulated along a path from a key to a &amp;quot;meaning&amp;quot; or to a translation. If two bits of information are contradictory, the most recent one (rlghtmost In our diagrams) has preeminence.</Paragraph>
    <Paragraph position="11"> Taking again the example of systems written In ARIANE-78, we may suggest to distribute the codes In the following fashlon. One could attach:  - the morphological codes (FTM) and the &amp;quot;morphs&amp;quot; to the roots (&amp;quot;bases&amp;quot;) or to the lenin/as; - the &amp;quot;local&amp;quot; syntaxo-semantic codes (FTS) to tbe \]ermlas or to the constructions; - the &amp;quot;global&amp;quot; syntactic codes (concerning the typology) to the various levels of refinement; - the codes concerning the derivations to the d~E1v  parts, wherever they appear In the item.</Paragraph>
  </Section>
  <Section position="4" start_page="425" end_page="425" type="metho">
    <SectionTitle>
3. CONSTRUCTION OF INIEGRATED DICTIONARIES
</SectionTitle>
    <Paragraph position="0"> Suppose the natural skeleton of an ltem ts obtained by using available dictionaries. There are two main methods for constructing the a~p parts.</Paragraph>
    <Paragraph position="1"> First, one may begin by filling the lnfo parts. This Is tile tecllnlque followed by the two afore-mentioned national projects. For this, people without special background in computer linguistics laay be used. They fill questionnaires (on paper or on screen) asking questions directly related to the formal attributes. Thts information ls checked and inserted In the i nfo parts at the propel&amp;quot; places, which are determined by knowing the relation between the &amp;quot;natural&amp;quot; Information and the &amp;quot;theory&amp;quot;.</Paragraph>
    <Paragraph position="2"> In a second stage, programs knowing the relation between the theory and a particular ltngware will fill the C/.gde parts.</Paragraph>
    <Paragraph position="3"> The second methods tries to make better use of existing MT dictionaries. First, the relation between the elements of a llngware and the &amp;quot;natural&amp;quot; system is defined, and programs are constructed to extract the useful Information from the MT dictionaries and to distribute It at the appropriate places. Then, knowing the relation between the &amp;quot;coded&amp;quot; Information and the theory, tnfg parts may be constructed or completed.</Paragraph>
    <Paragraph position="4"> At the time this paper was revised, M.DYMETMAN was Implementing such a program to construct a FID from our current Russfan-French MT system. Hls results and conclusions should be the theme of a forthcoming paper.</Paragraph>
    <Paragraph position="5"> Inconsistencies may be detected at various stages hq tbe construction of a Fib, and the underiylng DB (data base) system must provlde facilities for constructing checks, using them to locate incorrect parts, and modifying the item.</Paragraph>
    <Paragraph position="6"> Ill. PROBLEMS OF DESIGN AND IMPLEMENTATION The construction of an Implemented &amp;quot;mock-up&amp;quot; has led us to identify some problems tn the design, to wonder whether there is any available DBMS (data base management system) adequate for our purposes, and to ask what should be done about the representation of characters, Ina multt 1 ingual setting.</Paragraph>
    <Paragraph position="7"> I_ I\]E-\[=ATION .B_E TWEE_N_ NATU RAL,_. AND F O RMA(- I N F 0 RMA!.I O_N The relation between the formal information of a theory and the formal information of an implemented model of It (a llngware) Is simple: the latter Is a notational variant of (a subset of) the former.</Paragraph>
    <Paragraph position="8"> By contrast, it ls not so easy to define and use the relation between a formal theory and the &amp;quot;natural&amp;quot; information. The theory mlght ignore some aspects, such as phonology, or etymology, wi)lle it would use &amp;quot;semantic&amp;quot; categories (such as COUNTABLE, TOOL, HUMAN, PERSONNIFIABLE, CONCRETE, ABSTRACT...) far more detailed than the &amp;quot;natural&amp;quot; ones (SOMEBODY, SOMETHING...). In order for the construction of such FID to be possible, we must at least ask that all &amp;quot;selective&amp;quot; lnformatlon, which guides the choice of a meaning and of a translation, must In some sense be co~aon to the natural and the formal systems.</Paragraph>
    <Paragraph position="9"> Hence, these systems must flare a certain degree of homogeneity. Dictionaries containing very llttle gral~attca\] Information (e.g. only the class) cannot be used as skeletons For FIDs integrating the lexlcal data base of a (lextcally) sophisticated MT system.</Paragraph>
    <Paragraph position="10"> Another problem is just how to express the relatlon between the systems, In such a way that it Is possible: to reconstruct (part of) the skeleton of an ttem from the &amp;quot;coded&amp;quot; information; to compute (part of) the formal information on a path of the skeleton.</Paragraph>
    <Paragraph position="11"> For the time being, we can write ad hoc programs to perform these tasks, for a particular pair of systems, but we have no satisfactory way to &amp;quot;declare&amp;quot; the relation and to automatically generate programs from it.</Paragraph>
  </Section>
  <Section position="5" start_page="425" end_page="425" type="metho">
    <SectionTitle>
2. TYPE OF UNDERLYING DATA-BASE SYSTEM
</SectionTitle>
    <Paragraph position="0"> P.Vauquols (a son of B.Vauquols) and D.Bachut have implemented the above-mentioned mock-up in Prolog-CRISS, a dialect of Prolog which provides fac1lltles for tile manipulation of &amp;quot;banks&amp;quot; of clauses. It Is possible to represent directly the tree structure of an item by a (complex) term, making it easy to program the functions associated to a FID directly In Protog.</Paragraph>
    <Paragraph position="1"> ttowever, Prolog Is not a DBMS, and, at least with tile current Implementations of Prolog, a large scale implementation Would be very experlstve to use (in terms of t 1me and space) , or perhaps even impossible to realize.</Paragraph>
    <Paragraph position="2"> AS FIbs would certainly grow to at least 50000 items (perhaps to 200000 or more), it might be preferable to implement them Ina colm~erclally available DBMS system, such as DL1, SOCRATE, etc. A numeric simulation made by E. de goussineau shows that a (1--2) Fig of about 100000 len~mas CoUld be Implemented In a Socrate DB, of the network type, in one or two &amp;quot;virtual spaces&amp;quot;. No experlment has yet been conducted to evaluate the fieasiblllty oPS tile method and its COSt.</Paragraph>
    <Paragraph position="3"> Other possibilities include relational and specialized DBMS systems. In a relational DBMS, each Socrate entity would glve rise to a relatlon. Specla\]lzed DBMS have been developed for terminological data banks, such as fERMIUM or EURODICAUTOM. There is a general tool for building terminological DB, ALEXIS (3~.</Paragraph>
  </Section>
  <Section position="6" start_page="425" end_page="426" type="metho">
    <SectionTitle>
3. CHARACTER SETS
</SectionTitle>
    <Paragraph position="0"> None of tile above--mentioned systems provides facllltles for handling multlllngua\] character sets.</Paragraph>
    <Paragraph position="1"> Hence, all strings representing units of the considered natural languages, including the keys, must be represented by appropriate transcriptions.</Paragraph>
    <Paragraph position="2"> Thls is clumsy for languages written In the Roman alphabet, and almost unacceptable for oilier languages, alphabetical or ideographlc. Supposing that bit-map terminals and printers are available, two solutions may be envisaged: define appropriate ASCII or EBCDIC transcriptions, and equip the DBMS wltll corresponding interfaces;  modify the BBMS itself to represent and handle several (possibly large) character sets. Thls ls what has been done in Japan, where progralrmleg langLlages, text processing systems and operating systems have been adapted to the 16-btt JIS (or JES) standard.</Paragraph>
  </Section>
  <Section position="7" start_page="426" end_page="426" type="metho">
    <SectionTitle>
CONC~ION
</SectionTitle>
    <Paragraph position="0"> We have presented and illustrated the new concept of Fig, or Fork Integrated Dictionary, To our knowledge, this ts the first attempt to unify classical and MT dictionaries. However, only a small mock-up has been implemented, and some problems of design and Implementatl(in have been detected. It remalns to be seen wllether large scale FlOs can be constructed and used in an operational setting.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML