File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/90/c90-2006_metho.xml
Size: 17,312 bytes
Last Modified: 2025-10-06 14:12:25
<?xml version="1.0" standalone="yes"?> <Paper uid="C90-2006"> <Title>Towards Personal MT: general design~ dialogue structure, potential role of speech</Title> <Section position="1" start_page="0" end_page="0" type="metho"> <SectionTitle> Towards Personal MT: </SectionTitle> <Paragraph position="0"> general design~ dialogue structure, potential role of speech</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="metho"> <SectionTitle> Christian BOITET GETA, IMAG Institute (UJF & CNRS) </SectionTitle> <Paragraph position="0"/> </Section> <Section position="3" start_page="0" end_page="0" type="metho"> <SectionTitle> Abstract </SectionTitle> <Paragraph position="0"> Personal MT (PMT) is a new concept in dialogue-based MT (DBMT) , which we are currently studying and prototyping in the LIDIA project Ideally, a PMT system should mn on PCs and be usable by everybody.</Paragraph> <Paragraph position="1"> To get his/her text translated into one or several languages, the writer would accept to cooperate with the system in order to standardize and clarify his/her document. There are many interesting aspects in the design of such a system. The paper briefly presents some of them (llyperText, distributed architectme, guided language, hybrid transfer/iuterlingua, the goes on to study iin more detail the stmctm'e of the dialogue with fl~e writer and flle place of sw.ech synthesis \[1\].</Paragraph> <Paragraph position="2"> A first classificatio~ of MAT (Machine Aided Translation) systems is by user. &quot;Classical&quot; MAT systems are for the watcher, ~.o~ ~&quot; ' file revisor (post--editor), or for the translator. A new concept is that of &quot;persoiml MT&quot;, or MAT R)r the writer'.</Paragraph> <Paragraph position="3"> MT for the watchtw appeared iu the sixties. Its tmrpose is to provide inlormative rough legislations of large amounts of unrestcicted tcxlg for lh(~ end us;or.</Paragraph> <Paragraph position="4"> MT for lhc revisor appeared in the seventk:s, l.t aims at producing raw {~'anslatior~s go~xt enough to be revised by profcssiona!s in a cost-effective way. This implies that the system needs to be specialized R~r a certail~ sub!ang, uageo For a system to be cost-effective, it is generally agreed that at least 20000 pages must be handled (e.g. 10000 pages/year for at least 2 yem's).</Paragraph> <Paragraph position="5"> Leaving &quot;heavy MT&quot;, not adapted to smaU volumes of heterogeneous texts, several firms have developed MAT systems tor translators, in the form of tools (e.g.</Paragraph> <Paragraph position="6"> Mercury-Termex~), or of integrated environments (e.g.</Paragraph> <Paragraph position="7"> Alps TSSrM).</Paragraph> <Paragraph position="8"> The concept of MT for the author (writer/speaker) has recently crystallized, building on previous studies on interactive MT, text critiquing and dialog structures \[5, 6, 7, 9, 12\]. Its aim is to provide high quality translation/interpretation services to end users with no knowledge of the target languages or linguistics.</Paragraph> <Paragraph position="9"> A second classification of MAT systems is by the types of knowledge felt to be central to their functioning. Linguistic Based MT uses : core knowledge about the language ; specific knowledge about the corpus (domain, typology) ; intrinsic semantics (a term coined by J.Po Desclds to cover all information f(mnally marked in a natural language, but which refers to its interpretation, such as semantic features or relations : concreteness, location, cause, instrument..deg ) ; but not : extrinsic semantics (static knowledge describing the domain(s) of the mxt, edegg. in terms of facts and rules) ; situatkmal semantics (describing the dynamic situations and their actors) ; pragmatics (overi: or covert intentions in the comnmnicative comext)..</Paragraph> <Paragraph position="10"> Kn.owledge~Based MT uses extralinguistic kr~owledge on top of linguistic knowledge. Finally, Dialogue-Based MT i~L~ists on extracting knowledge from a human (the author or a specialist). These options are not exclusive, however. In KBMT-89 \[7\], tot example, ambiguities persisting after using linguistic and extralinguistic Imowledge arc solved through a dialogue with the wiitm initiated by the &quot;augmenter&quot;. In ATR's Machine Interpretation project, tie dialogues center amuud a wello defined rusk (organization of international confbrences), but may also conceru extraneous matters (cultural events, health problems...). This feature, added to the enormous ambiguity inherent in speech input, will likely force such systems to be dialogue-based as well as knowledge-based \[5\]. In Personal MT, we may rely on some core exla'alinguistic knowledge base, bu.t not on any detailed expertise, because the domains and types of text should be unrestricted. Hence, Personal MT must be primarily dialogue-based.</Paragraph> <Paragraph position="11"> A third classification of MAT systems is by their internal organization (direct/transfer/interlingua, use of classical or specialized languages, procedurality / declarativeness...) through which ,;o-called &quot;generations&quot; have been distinguished. This level of detail will not be too relevant in this paper.</Paragraph> <Paragraph position="12"> I. A project in Personal MT 1. Goals LIDIA (Large Internationalization of the Documents by Interacting with their Authors) aims at studying the theoretical and methodological issues of the PMT approach, to be experimented on by first building a small prototype, and more generally at promoting this concept within the MT community.</Paragraph> <Paragraph position="13"> We are U'ying to develop an architecture which would be suitable for very large applications, to be upscaled later with industrial parmers if results are promising enough. For example, we don't intend to incorporate more than a few hundred or thousand words in the prototype's (LIDIA-1) dictionaries, although we try to develop robust indexing schemes and to implement the texical dam base in a way which would allow supporting on the order of 1 to 10Mwords in 10 languages. The same goes lor the grammars.</Paragraph> <Paragraph position="14"> Even in a prototype, however, the structure of the dialogue with the author must be studied with care, and offers interesting possibilities. Clearly, the writer should be allowed to write freely, and to decide for himself when and on which part of his docmnent to start any kind of interaction. But changes in the text should be controlled so that not all changes would force !he system to start the interaction anew.</Paragraph> <Paragraph position="15"> From a linguistic point of view, it is extremely exciting to see, at last, a possibility to experiment with Zemb's theme/rheme/pheme &quot;statutory&quot; articulation of propositions \[1311, and/or Prague's topic/focus opposition, which are claimed to be of utmost importance for translation : both are almost impossible to compute automatically, because the tests are very often expressed in terms of possible transformations in a given discourse context. But, in PMT, we may ask the author.</Paragraph> </Section> <Section position="4" start_page="0" end_page="0" type="metho"> <SectionTitle> 2. Outline </SectionTitle> <Paragraph position="0"> The prototype system for LIDIA-I is constrained as lbllows.</Paragraph> <Paragraph position="1"> Translation from French into Russian, German and English (inversing previous systems), with other target languages being studied in cooperative frameworks ; Small corpus from the Ariane-G5 user interface (containing some on-line documentation), in HyperCard form ; Distributed computer architecture:writer workstation on a Macintosh (Plus or SE), MT server on a mini (IBM-4361) ; Guided Language approach, as opposed to Free Tex! or Controlled Language ; Linguistic architecture : hybrid Transferflnterlingua.</Paragraph> <Paragraph position="2"> tIyperText The choice of HyperCard reflects the fact that Hypertexts are becoming the favorite supports for technical documentation. It also relies on the assumption that writers will more readily agree to participating in a dialogue if the tool they are using is very interactive than if they use a more classical text processor. Finally, there are some linguistic advantages. First, the textual parts are clearly isolated in fields, and not cluttered with images, formulas, tabs, rnarkups, etc. Scripts should not be translated -- if they generate messages, these must be taken from normal fields, and not directly generated (linguistic requirements may lead to better programming practices!).</Paragraph> <Paragraph position="3"> Second, the textual parts may be typed, thus greatly facilitating analysis. For example, a given field may contain only titles, another only menu items, another only sentences without the initial subject (which is often contained in another field), etc. A distinct possibility is to define microlanguages as types of very short textual fragments (less than 2 or 3 lines, to be concrete), and to define sublanguages as structured collections of microlanguages for longer textual fragments.</Paragraph> <Paragraph position="4"> Distributed architecture The idea to use a distributed architecture has both a practical and theoretical basis. First, we want to use the Ariane-G5 system, a comprehensive generator of MT systems developed over many years \[1l\]. Although some micros can support this system (PC-AT/370, PS2/7437), their user-friendliness and availability are no match to those of the Mac.</Paragraph> <Paragraph position="5"> Second, looking at some other experiences (Alps, Weidner), we have concluded that some parts of sophisticated natural language processing can not be performed in real time on small and cheap machines without oversimplifying the linguistic parts and degrading quality down to near uselessness. Rather, it should be possible to perform the &quot;heavy&quot; parts in an asynchronous but still user-friendly way, as IBM researchers have done for the Critique system \[9\]. Of course, this idea could be implemented on a single machine running under a multitasking operating system, if such a system were available on the most popular micros, and provided the heavy linguistic computations don't take hours.</Paragraph> <Paragraph position="6"> Guided Language The &quot;guided language approach&quot; is a middle road between free and controlled text. The key to quality in MT, as in other areas of AI, is to restrict the domain in an acceptable way.</Paragraph> <Paragraph position="8"> By &quot;controlled language&quot;, we understand a subset of natural language restricted in such a way that ambiguities disappear. That is the approach of the TITUS system : no text is accepted unless it completely conforms to one predefined sublanguage. While this technique works very well in a very restricted domain, with professionals producing the texts (technical abstracts in textile, in this case), it seems impossible to generalize it to open-ended uses involving the general public.</Paragraph> <Paragraph position="9"> What seems possible is to define a collection of microlanguages or sublanguages, to associate one with each unit of translation, and to induce the writer/speaker to conform to it, or else to choose another one.</Paragraph> <Paragraph position="10"> Hybrid Transfer/Interlingua By &quot;hybrid Transfer/Interlingua&quot;, we mean that the interface structures produced by analysis are multilevel structures of the source language, in the sense of Vauquois \[4, 11, see also 2, 3\], where some parts are universal (logico-semantic relations, semantic features, abstract time, discourse type...), while others are language-specific (morphosyntactic class, gender, number, lexical elements, syntactic functions...). In PMT, because of the necessity of lexical clarification, we should go one step further toward interlingua by relating the &quot;word senses&quot; of the vocabularies of all the languages considered in the system and making them independent objects in the lexical dam base.</Paragraph> <Paragraph position="11"> II. Structure of the dialogue with the terminology and style Hence, the first interaction planned in LIDIA concerns typology : given a stack, the system will first construct a &quot;shadow&quot; file. For each textual field, it will ask its typology (microlanguage for very small texts, sublanguages for others), and attach it to the corresponding shadow record. In the case of &quot;incomplete&quot; texts, where for example the subject of the first sentence is to be taken from another field (as in tables containing command names and their explanations), it will ask how to construct a complete text for translation, and attach the corresponding rule to the shadow re, cord.</Paragraph> <Paragraph position="12"> The second level of interaction concerns spelling.</Paragraph> <Paragraph position="13"> Any spellchecker will do. However, it would be best to use a lemmatizer relying on the lexical database of the system, as the user must be allowed to enter new words and will expect a coherent behavior of the entire system. Level three concerns terminology. The lexical database should contain thesaurus relations, indicating among other things the preferred term among a cluster of (quasi-)synonyms (e.g. plane/aircraft/ship/plane). Which term is preferred often depends on local decisions : it should be easy to change it for a particular stack, without of course duplicating the thesaurus. Note that the lexical database should contain a great variety of terms, even incorrect or dubious, whereas terminological databases are usually restricted to normalized or recommended terms. In PMT, we only want to guide the author : if s/he prefers to use a non standard term, that should be allowed.</Paragraph> <Paragraph position="14"> Level four concerns style, understood in a simply quantitative way (average length of sentences, frequency of complex conjuncts/disjuncts, rare verbal forms, specific words like dont in French, relative frequency of nouns/articles, etc.). From the experience of CRITIQUE \[9\], it seems that such methods, which work in real time, may be very useful as a first step to guide towards the predetermined text types (micro- or sub-languages).</Paragraph> </Section> <Section position="5" start_page="0" end_page="0" type="metho"> <SectionTitle> 2. Interactions concerning syntax, </SectionTitle> <Paragraph position="0"> semantics and pragmatics Until now, the system has worked directly with the text as written by the author. For the remaining types of interaction, it will work on a transcription contained in the shadow record, as well as with some intermediate forms of processing stored in associated records of the shadow file. This fomes to lock the original textual field (unless the author decides to change it and accepts to start again from level two).</Paragraph> <Paragraph position="1"> Level five concerns the fixed forms, it is quite usual, especially in technical documentation, that some groups of words take a fixed meaning in certain contexts, with specific, non-compositional translations. For example, &quot;Save as&quot; as a menu item Save as... is translated in French as F n re g |s t re r s o u s ~., and not as &quot;Sauver comme&quot;, which would be correct for other uses. As a menu item, this group functions as a proper noun, not as a verbal phrase. The writer should be asked whether a given occurrence of each such group is to be treated as fixed or not. In the first case, an adequate transcription should be generated in the shadow record (&quot;&FXD_Save as&quot;, for example). Certain elements (such as menu items) should be automatically proposed for insertion in the list.</Paragraph> <Paragraph position="2"> Level six concerns lexical clarification. First, polysemies are to be solved by asking the writer. For example, the word &quot;dipldme&quot; is not ambiguous in French. However, if translating from French into English, 2 possibilities should be given : &quot;dipldme non terminal&quot; (&quot;diploma&quot;) or &quot;dipldme terminal&quot; (&quot;degree&quot;). Some polysemies are source language specific, some depend on the target languages. We want to treat them in a uniform way, by maintaining in the lexical database the collection of all &quot;word senses&quot; (&quot;acceptions&quot;, not really concepts of an ontology as in KBMT.-89), linked by disambiguating questions/definitions to the words/terms of the languages supported by the system.</Paragraph> <Paragraph position="3"> Lexical ellipses can also be treated at that level. This problem is particularly annoying in MT. Suppose a text is about a space ship containing a &quot;centralc 61ectrique&quot; (&quot;electric plant&quot;) and a &quot;centrale inertielle&quot; (&quot;inertial guidance system&quot;). The complete form is often replaced by the elided one: &quot;centrale&quot;. Although it is vital to</Paragraph> </Section> class="xml-element"></Paper>