File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/02/w02-1602_metho.xml
Size: 23,382 bytes
Last Modified: 2025-10-06 14:08:07
<?xml version="1.0" standalone="yes"?> <Paper uid="W02-1602"> <Title>Coedition to share text revision across languages and improve MT a posteriori</Title> <Section position="2" start_page="0" end_page="0" type="metho"> <SectionTitle> Christian BOITET GETA, CLIPS, IMAG </SectionTitle> <Paragraph position="0"> 385 rue de la Bibliotheque, BP 53 38041 Grenoble cedex 9, France</Paragraph> </Section> <Section position="3" start_page="0" end_page="0" type="metho"> <SectionTitle> TSAI Wang-Ju GETA, CLIPS, IMAG </SectionTitle> <Paragraph position="0"> 385 rue de la Bibliotheque, BP 53 38041 Grenoble cedex 9, France</Paragraph> </Section> <Section position="4" start_page="0" end_page="1" type="metho"> <SectionTitle> Abstract </SectionTitle> <Paragraph position="0"> Coedition of a natural language text and its representation in some interlingual form seems the best and simplest way to share text revision across languages. For various reasons, UNL graphs are the best candidates in this context. We are developing a prototype where, in the simplest sharing scenario, naive users interact directly with the text in their language (L0), and indirectly with the associated graph. The modified graph is then sent to the UNL-L0 deconverter and the result shown. If is is satisfactory, the errors were probably due to the graph, not to the deconverter, and the graph is sent to deconverters in other languages. Versions in some other languages known by the user may be displayed, so that improvement sharing is visible and encouraging. As new versions are added with appropriate tags and attributes in the original multilingual document, nothing is ever lost, and cooperative working on a document is rendered feasible. On the internal side, liaisons are established between elements of the text and the graph by using broadly available resources such as a L0-English or better a L0-UNL dictionary, a morphosyntactic parser of L0, and a canonical graph2tree transformation. Establishing a &quot;best&quot; correspondence between the &quot;UNL-tree+L0&quot; and the &quot;MS-L0 structure&quot;, a lattice, may be done using the dictionary and trying to align the tree and the selected trajectory with as few crossing liaisons as possible. A central goal of this research is to merge approaches from pivot MT, interactive MT, and multilingual text authoring.</Paragraph> <Paragraph position="1"> Keywords: revision sharing, interlingual representation, text / UNL coedition, multilingual communication</Paragraph> <Section position="1" start_page="0" end_page="1" type="sub_section"> <SectionTitle> Resume </SectionTitle> <Paragraph position="0"> La coedition d'un texte en langue naturelle et de sa representation dans une forme interlingue semble le moyen le meilleur et le plus simple de partager la revision du texte vers plusieurs langues. Pour diverses raisons, les graphes UNL sont les meilleurs candidats dans ce contexte. Nous developpons un prototype ou, dans le scenario avec partage le plus simple, des utilisateurs &quot;naifs&quot; interagissent directement avec le texte dans leur langue (L0), et indirectement avec le graphe associe. Le graphe modifie est ensuite envoye au deconvertisseur UNL-L0 et le resultat est affiche. S'il est satisfaisant, les erreurs etaient probablement dues au graphe et non au deconvertisseur, et le graphe est envoye aux deconvertisseurs vers d'autres langues. Les versions dans certaines autres langues connues de l'utilisateur peuvent etre affichees, de sorte que le partage de l'amelioration soit visible et encourageant. Comme les nouvelles versions sont ajoutees dans le document multilingue original avec des balises et des attributs appropries, rien n'est jamais perdu, et le travail cooperatif sur un meme document est rendu possible. Du cote interne, des liaisons sont etablies entre des elements du texte et du graphe en utilisant des ressources largement disponibles comme un dictionnaire L0-anglais, ou mieux L0-UNL, un analyseur morphosyntaxique de L0, et une transformation canonique de graphe UNL a arbre. On peut etablir une &quot;meilleure&quot; correspondance entre &quot;l'arbre-UNL+L0&quot; et la &quot;structure MS-L0&quot;, une treille, en utilisant le dictionnaire et en cherchant a aligner l'arbre et une trajectoire avec aussi peu que possible de croisements de liaisons. Un but central de cette recherche est de fusionner les approches de la TA par pivot, de la TA interactive, et de la generation multilingue de texte.</Paragraph> <Paragraph position="1"> Mots-cles: revision partagee, representation interlingue, coedition texte / UNL, communication multilingue Introduction Creating and maintaining aligned multilingual documents is a growing necessity. In the current practice, a multilingual document consists in many parallel monolingual files, which may be technical documentation as well as help files, message files, or simply thematic information put on the web and intended for a multilingual audience (medicine, cooking, travel...). The task is difficult even for a document managed in a centralized manner.</Paragraph> <Paragraph position="2"> Ususally, it is first created in a unique source language, and translated into several target languages. There must be a way to keep trak of modifications, possibly done at various places on different linguistic versions. From time to time, somebody has to decide which modifications to integrate in the next release of the document. For that, modifications done in target languages have to be translated back into the source language. The new and the old source versions are then compared using (fuzzy) matching techniques, so that only really new segments are sent for translation.</Paragraph> <Paragraph position="3"> The problem arises even more if the documents are not managed centrally, so that the monolingual files are often in various formats (Word, EgWord, Interleaf, FileMaker, DBMS formats, etc.).</Paragraph> <Paragraph position="4"> A. Assimi [1, 2] has shown how to &quot;realign&quot; parallel decentralized documents and apply the methodology sketched above. However, in both cases, human translators have to retranslate the modified or new source segments, or to revise them if they are retranslated by a quality MT system.</Paragraph> <Paragraph position="5"> Contrary to what is often said, quality MT exists, but for specific contexts only. (See [14]).</Paragraph> <Paragraph position="6"> What we would like to do is to make it possible to share the revision work across languages, whatever the domain and the context. It is clearly impossible to reflect changes on a file in language L0 into files in L1,... Ln automatically and faithfully, without any intermediate structure to bridge the gap, because that would necessitate at least a perfect fine-grained aligner in case of changing articles or common nouns (provided the gender and number stay the ame in each Li version). In case of replacing a verb by another with a different valency frame in a target Li, the sentence in Li would have to be reanalyzed, transformed accordingly, and regenerated without introducing any new error or imprecision, thereby keeping the manual improvements coming from previous manual revisions. Or we would need a more than perfect MT system, namely one which would be able to analyze the changed utterance in L0, and to transfer and generate it into a sentence of Li as close as possible as the previous sentence in Li, which again could have been improved manually before.</Paragraph> <Paragraph position="7"> The best and simplest way to go seems to use some formalized interlingua IL and to (1) reflect the modifications from L0 to the IL, (2) regenerate into L1,... Ln from the IL.</Paragraph> <Paragraph position="8"> We should also allow for direct manual improvements, considering that the IL form will not always be present, or not always improvable enough for lack of expressivity, or that generators will never be perfect. We choose UNL [3, 4, 10, 11] as our IL of choice for various reasons: (1) it is specifically designed for linguistic and semantic machine processing, (2) it derives with many improvements from H.Uchida's pivot used in ATLAS-II (Fujitsu) [13], still evaluated as the best quality MT system for English-Japanese, with a large coverage (586,000 lexical entries in each language), (3) participants of the UNL project1 have built &quot;deconverters&quot; from UNL into about 12 languages, and at least the Arabic, Indonesian, Italian, French, Russian, Spanish, and Thai http://unl.ias.unu.edu deconverters were accessible for experimentation through a web interface at the time of writing, (4) although formal, UNL graphs (see below) are quite easy to understand with little training and may be presented in a &quot;localized&quot; way to naive users by translating UNL symbols (semantic relations, attributes) and lexemes (UWs) into symbols and lexemes of their language, (5) the UNL project has defined a format embedded in html for files containing a complete multilingual document aligned at the level of utterances, and produced a &quot;visualizer&quot; transforming a UNL file into as many html files as languages, and sending them to any web browser.</Paragraph> <Paragraph position="9"> The UNL representation of a text is a list of &quot;semantic graphs&quot;, each expressing the meaning of a natural language utterance. Nodes contain lexical units and attributes, arcs bear semantic relations.</Paragraph> <Paragraph position="10"> Connex subgraphs may be defined as &quot;scopes&quot;, so that a UNL graph may be a hypergraph.</Paragraph> <Paragraph position="11"> The lexical units, called Universal Words (UW), represent (sets of) word meanings, something less ambitious than concepts. Their denotations are built to be intuitively understood by developers knowing English, that is, by all developers in NLP. AUW is an English term or special symbol (number...) possibly completed by semantic restrictions : the UW &quot;process&quot; represents all word meanings of that lemma, seen as citation form (verb or noun here), and &quot;process(icl>do, agt>person)&quot; covers only the meanings of processing, working on, etc.</Paragraph> <Paragraph position="12"> The attributes are the (semantic) number, genre, time, aspect, modality, etc., and the 40 or so semantic relations are traditional &quot;deep cases&quot; such as agent, (deep) object, location, goal, time, etc.</Paragraph> <Paragraph position="13"> One way of looking at a UNL graph corresponding to an utterance in language L is to say that it represents the abstract structure of an equivalent English utterance &quot;seen from L&quot;, that is, where semantic attributes not necessarily expressed in L may be absent (e.g., aspect coming from French, determination or number from Japanese, etc.).</Paragraph> <Paragraph position="14"> We will first present scenarios of increasing internal complexity for the situation where somebody reads a UNL document in her language, corrects it, and wants the corrections to carry over to the corresponding fragment in other languages. We will then study more precisely the correspondence between a text in language L0 and its representation in UNL, and show the advantage of breaking it into 3 parts: text - morpho-syntactic lattice or chart - abstract &quot;UNL-tree&quot; - UNL graph. Finally, we present the current status of this work: an experimentation web site, a method to establish the second part of the correspondence, and related research.</Paragraph> <Paragraph position="15"> 1. Scenarios for sharing revision across languages Suppose a collection of multilingual documents is stored on a server as multilingual files in UNL-html format, or in any other form, e.g. in a data base, provided (1) it is possible to easily produce the version in any language contained in the document, (2) the versions are aligned at the level of utterancelike segments (a segment may contain more than 1 utterance), (3) UNL-graphs may be stored and aligned with the segments. Here is a slightly simplified example of a file in UNL-html format.</Paragraph> <Paragraph position="17"> Ich lief gestern im Park. {/de} {es dtime=20020130-2031, deco=UNL-SP} Yo corri ayer en el parque.{/es} {fr dtime=20020131-0805, deco=UNL-FR} J'ai couru dans le parc hier. {/fr}[/S]</Paragraph> <Paragraph position="19"> Mein Hund bellte zu mir.{/de} {fr dtime=20020131-0806, deco=UNL-FR} Mon chien aboya pour moi. [/S] [/P][/D] </BODY></HTML> The French versions have been produced automatically, the German and Chinese manually. The output of the UNL viewer for French is: and will probably be displayed by a browser as: Example 1 El/UNL J'ai couru dans le parc hier. Mon chien aboya pour moi.</Paragraph> <Paragraph position="20"> and similarly for all other languages. In all scenarios, the user is reading the text in the normal display, not seing any tags, and wants to make some modification, such as moving &quot;hier&quot; after &quot;couru&quot; and changing &quot;pour&quot; to &quot;vers&quot;. Activating some button or menu item, she enters a revision interface.</Paragraph> </Section> <Section position="2" start_page="1" end_page="1" type="sub_section"> <SectionTitle> 1.1 Multiple revision without sharing </SectionTitle> <Paragraph position="0"> In this first scenario, we don't suppose that there are UNL graphs associated with the segments. The problem is to transmit and add the user's modifications to the original form of the multilingual document. That is impossible by editing the html documents displayed, because they have no links to the original form. The UNL-html format predates XML, hence the special tags like [S] and {unl}, but we may transform it into an equivalent &quot;UNL-xml&quot; format. Then, using DOM and javaScript, it is possible to produce various views: that of a viewer, a bilingual or multilingual editable presentation, and a revision (coedition) interface.</Paragraph> <Paragraph position="1"> This is an example from an experiment performed for the &quot;Forum Barcelona 2004&quot; on Spanish, Italian, Russian, French and Hindi.</Paragraph> <Paragraph position="2"> Hindi and Russian are not shown, but Japanese has been added by hand. The XML form is simplified.</Paragraph> <Paragraph position="3"> Correct sentences are produced by the deconverters from correct and complete UNL graphs. We suppose here that the UNL graph has been produced from a Chinese version, and does not countain definiteness and aspectual information. Now all results are wrong wrt articles, and some wrt</Paragraph> <Paragraph position="5"> <unl:el> After a Forum, a city will retrieve a coastal zone.</unl:el> <unl:es> Ciudad recobrara una zona de costal despues Foro. </unl:es> <unl:fr> Une cite retrouvera une zone cotiere apres un forum. </unl:fr> <unl:it> Citta ricuperara une zona costiera dopo Forum. </unl:it> In the second scenario, there is a UNL graph associated with the modified segment. In order to share the revisions across languages, we should reflect them on the UNL graph, e.g.</Paragraph> <Paragraph position="6"> * add &quot;.@def&quot; on the nodes &quot;city&quot; & &quot;Forum&quot;. * replace &quot;retrieve&quot; by &quot;recover&quot; and add &quot;.@complete&quot; on the node containing it.</Paragraph> <Paragraph position="7"> It is not possible in principle to deduce the modification on the graph from a modification on the text. For example, replacing &quot;un&quot; (&quot;a&quot;) by &quot;le&quot; (&quot;the&quot;) does not entail that the following noun is determined (.@def), because it can also be generic (&quot;il aime la montagne&quot; = &quot;he likes mountains&quot;). Hence, the technique envisaged is that: * revision is not done by modifying directly the text, but by using a menu system, * the menu items have a &quot;language side&quot; and a hidden &quot;UNL side&quot;, * when a menu item is chosen, only the graph is transformed, and the action to be done on the text is stored and shown next to its focus.</Paragraph> <Paragraph position="8"> * at any time, the new graph may be sent to the L0 deconverter and the result shown. If is is satisfactory, that shows that errors were due to the graph and not to the deconverter, and the graph may be sent to deconverters in other languages.</Paragraph> <Paragraph position="9"> Versions in some other languages known by the user may be displayed, so that improvement sharing is visible and encouraging.</Paragraph> <Paragraph position="10"> New versions will be added with appropriate tags and attributes in the multilingual document in UNLxml format, or in a DBMS, so that nothing is lost, and cooperative working on a document is feasible.</Paragraph> <Paragraph position="11"> 1.3 Revision on more than the texts For the above method to work, the text has to be preprocessed, at least by computing morpho-syntactic classes (POS & actualization attributes) to avoid many spurious menus, segmenting, and lemmatizing. Because we want our technique to be widely applicable, this preprocessing should be such that it can be performed by large coverage tools freely available for many languages. That is the case for morphosyntactic analyzers (MSA), but not yet for full or even shallow parsers.</Paragraph> <Paragraph position="12"> We also propose that the revision interface should allow access not only to the texts, but to editable representations of the UNL graph, of the result of the MSA, and of any other available structure such as a tree derived from the UNL graph.</Paragraph> <Paragraph position="13"> QuitSaveMultiple text viewSimple text view</Paragraph> <Paragraph position="15"> indef art noun verb indef art noun adj prop indef art noun sin sin future sin sin sin sin sin For users not wanting to see anything else than text, the previous scenario will always be usable. But there are good reasons to &quot;open the black box&quot;: (1) the UNL Spanish group has successfully experimented with an interface for interactive UNL graph creation using a MSA and a graph editor showing the UNL graph in a &quot;localized&quot; way (symbols & lexemes appear in Spanish), (2) it is sometimes much quicker to change something on another representation than on a text: for example, to merge two nodes in order to change &quot;Mary likes Mary's daughter&quot; into &quot;Mary likes her daughter&quot;, (3) it may even be necessary, if the correspondence is faulty and can not be improved because the text is very far from any reasonable deconversion obtainable from the graph, (4) user interface technology has made much progress, and offers tools to build user-friendly direct manipulation environments, (5) last but not least, the younger generation manipulates complex interfaces very naturally and expertly, far better than its elders!</Paragraph> </Section> <Section position="3" start_page="1" end_page="1" type="sub_section"> <SectionTitle> 1.4 What can and cannot be done </SectionTitle> <Paragraph position="0"> We identify 4 common types of errors in the corpus we have analysed so far: (1) graphs containing false information: wrong attachment, wrong choice of UW, wrong attribute, wrong semantic relation... (2) graphs with missing information, as above, (3) absence of text because the UNL graph is formally incorrect (due to some wrong human manipulation, some bug in a deconverter...): missing parenthesis, missing entry node in a scope, disconnected graph..., (4) deconversion errors.</Paragraph> <Paragraph position="1"> Our method can be used for correcting the first 2 types of errors only. If a graph is formally incorrect, it may displayable or not. In the first case, it should be possible to manipulate and correct it graphically, e.g. by connecting 2 disconnected parts or choosing an entry node. In the second case, it is necessary to work on a textual representation. If errors come from the deconverter, the user may still correct the text by hand (last zone).</Paragraph> <Paragraph position="2"> 2. Establishing a text-graph correspondence</Paragraph> </Section> <Section position="4" start_page="1" end_page="1" type="sub_section"> <SectionTitle> 2.1 The nature of correspondences </SectionTitle> <Paragraph position="0"> The correspondence between a text and a UNL graph may be decomposed into less complex liaisons, which are often not simple links, even between words and nodes. We found the following types in this case.</Paragraph> </Section> <Section position="5" start_page="1" end_page="1" type="sub_section"> <SectionTitle> 2.2 Division in 3 subcorrespondences </SectionTitle> <Paragraph position="0"> We have already begun to break down the correspondence in 2 parts: text - MS-structure UNL graph. The MS structure may always be embedded in a loop-free graph with information on the nodes (lattice) or on the arcs (charts), so that the first part of the correspondence is made of liaisons between substrings of the text (not necessarily always connex) and elements (nodes or arcs) on the trajectory corresponding to the preferred interpretation (in case of ambiguity).</Paragraph> <Paragraph position="1"> It is perhaps possible to compute a direct correspondence between the MS lattice and the UNL graph, but it is not clear how to represent the liaisons between phrases and subgraphs. For that purpose, a tree structure is far better. Because there is no available large-scale and free syntactico-semantic analyzer for the vast majority of languages, we can not use even a tree produced by a shallow parser. But it is possible to associate a &quot;standard UNL-tree&quot; to any UNL graph by a reversible algorithmic transformation [3, 4, 10]: start at the outer entry node, and traverse the graph and its scopes (subgraphs) recursively, thereby creating auxiliary nodes for scopes, &quot;inverse&quot; semantic relations for arcs in the &quot;wrong&quot; direction, and coindexing symbols to represent reentrancy without duplication.</Paragraph> <Paragraph position="2"> We can also take advantage of having one more structure by enriching it with lexical units of L0.</Paragraph> <Paragraph position="3"> Now the correspondence is broken into 3 parts: produced by modifying the standard reversible graph2tree transformation).</Paragraph> <Paragraph position="4"> Another advantage of introducing this tree structure is that the correspondences between strings and abstract trees have been much studied [5, 15, 16]. They can be encoded within the trees by 2 attributes expressing what a node covers lexically (SNODE) and as root of a subtree (STREE).</Paragraph> <Paragraph position="5"> 3. Current status and related research</Paragraph> </Section> <Section position="6" start_page="1" end_page="1" type="sub_section"> <SectionTitle> 3.1 Experimental platform </SectionTitle> <Paragraph position="0"> We have implemented a web site called SWIIVRE-</Paragraph> </Section> </Section> class="xml-element"></Paper>