File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/82/c82-1004_metho.xml

Size: 21,452 bytes

Last Modified: 2025-10-06 14:11:23

<?xml version="1.0" standalone="yes"?>
<Paper uid="C82-1004">
  <Title>AN INTEGRATED SYSTEM FOR AUTOMATED TRANSLATION AND HUMAN REVISION</Title>
  <Section position="3" start_page="19" end_page="19" type="metho">
    <SectionTitle>
IMPLEMENTATION AND CONVERSATIONAL EL'~/IRONMENT OF ARIANE 78.4 21
</SectionTitle>
    <Paragraph position="0"> In each step, the linguistic data may be of four different kinds : - grammatical &amp;quot;variabTes&amp;quot; (like gender, number, semantic type), which define the attributes present on each node of a (linguistic) tree structure ;  - classes, describing useful combinations of values of variables ; - dictionaries ; - grammars, containing the rules and the strategy for using them.  They are expressed in a metalanguage. Their syntax and coherency is first checked by the corresponding compiler which then generates a compact intermediate code. At run-time, this code is interpreted.</Paragraph>
    <Paragraph position="1"> The conversational monitor ARIANE is a transparent interface with the user. It handles the data-base of linguistic files and of texts, with their intermediate results. The entire system exists in 2 versions (french and english). Any user space is implemented as a virtual machine (under VM/CMS), and may support any number of analyses, transfers and generations. A &amp;quot;source language code&amp;quot; (a &amp;quot;target language code&amp;quot;) is associated with each analysis (generation), and a pair &amp;quot;source code-target code&amp;quot; with each transfer.</Paragraph>
    <Paragraph position="2"> Once a user has logged in, he (or she) types in &amp;quot;ARIANE&amp;quot; and enters level 0 of the monitor, which is constructed as a hierarchyof subenvironments, with corresponding menus. At any time, entering a null line pops to the next higher environment, if any, and &amp;quot;STOP&amp;quot; allows to exit from any depth. A trace of the session is always constructed and may be printed or discarded (the normal type of terminal is a screen).</Paragraph>
    <Paragraph position="3"> ARIANE handles certain global variables, like the dialog mode (brief or detailed), the source and/or target language codes, or the name of the current corpus (homogeneous set of texts with the same method of structuring). Their values are asked by the system, if there is no default, and may be explicity assigned by the user.~ All environments provide a help function. Hitting '?' gives a brief explanation, and 'DET' a more complete one (not all environments need it, however).</Paragraph>
    <Paragraph position="4"> These features enable users with no particular computer science background to use the system. The only things to know are the standard editor (EDGAR or XEDIT), and the metalanguages of the different components.</Paragraph>
    <Paragraph position="5">  switches between brief and detailed mode.</Paragraph>
    <Paragraph position="6"> resets the langage code(s).</Paragraph>
    <Paragraph position="7"> resets the current corpus name.</Paragraph>
    <Paragraph position="8"> is the same as LG + INDGTXT.</Paragraph>
    <Paragraph position="9"> 2 - Information on the data base The information asked may concern the linguistic data, the corpora, the texts, the (partial) results and the lexical units.</Paragraph>
    <Paragraph position="10"> Linguistic data LIENS gives all analyses, transfers and generations known to ARIANE in the user space.</Paragraph>
    <Paragraph position="11"> 22 CH. BOITET, P. GUILLAUME and M. QUEZEL-AMBRUNAZ PRET visualizes the state of any or all steps of ~hese applications. A given step must be completely compiled in order to execute it.</Paragraph>
    <Paragraph position="12"> LISGEN (LlSVAR) lists all the linguistic files (only the variables) of a given application, from one step to another, e.g., with &amp;quot;AS TS&amp;quot;, from structural analysis to structural t~ansfer.</Paragraph>
    <Paragraph position="13"> Corpora and texts LINDG, LTLG and LTOT list respectively all corpus names, the names of all texts in a given corpus, or the names of all texts in all corpora.</Paragraph>
    <Paragraph position="14"> LTXIG an LTXTIG list respectively all texts of a given corpus, or all texts of all corpora. It is not unusual to have more than a dozen corpora and some hundreds of texts in one of them.</Paragraph>
    <Paragraph position="15"> Results of different processings RESULT allows to print a rough translation and/or its revision, with or without the source text. The output is formatted (by using the standard SCRIPT system), and contains the date where the rough translation was obtained (if any), the dates (of compilation) and language codes of the analyses, transfers and generations used, as well as the date of the last (manual) revision, if any. The rough output itself may not be altered manually.</Paragraph>
    <Paragraph position="16"> RESGT indicates, for given language code(s), for one (or all) corpus, and for a given step (AS, TS, GS, GM or RV - for revision), the names of the texts having a corresponding result. All intermediate results (not the translations or revisions, however) are erased when the corresponding linguistic data is modified. The rough translation (not the revision) is erased when the source text is modified. If TRAD or REVIS is used instead of GM or RV, RESGT acts like a global RESULT, which allows to print all results of a given type, for one or all corpora, in one command.</Paragraph>
    <Paragraph position="17"> Lexical units LTULS (or LTULC) gives tables indicating where the source (or target) lexical units have been referenced. For instance, it is possible to know, in the case when a given generation is shared by several applications (e.g. russian-french, english-french), for which source languages a given target lexical unit has been an equivalent in one of the transfer dictionaries.</Paragraph>
  </Section>
  <Section position="4" start_page="19" end_page="19" type="metho">
    <SectionTitle>
3 - Global actions
</SectionTitle>
    <Paragraph position="0"> DUPLG allows to copy a given application (or part of it) onto another one, which may already eXist or not. This is of course very useful during the development of a project.</Paragraph>
    <Paragraph position="1"> DESTRUC is the opposite function, and allows (with a lot of warnings !) to erase a given application (or part of it).</Paragraph>
    <Paragraph position="2"> ELIMIG allows to erase an entire corpus, with all related results, in much the same way.</Paragraph>
    <Paragraph position="3"> III - PREPARATION OF THE TEXTS This environment is called PRTXT. Its subcommands are either proper or general. The data base of texts is divided into corpora. For each corpus, there is an associated structuring method, defined by an hierarchy of separators. /</Paragraph>
  </Section>
  <Section position="5" start_page="19" end_page="19" type="metho">
    <SectionTitle>
IMPLEMENTATION AND CONVERSATIONAL ENVIRONMENT OF ARIANE 78.4 23
</SectionTitle>
    <Paragraph position="0"> Usual interpretations are in terms of sections, paragraphs and sentences, but nc interpretation is forced on the user. Hence, a text always appear with an associated tree-like structure of decomposition.</Paragraph>
    <Paragraph position="2"> CREAT, ELIM and MODIF are respectively used for creating, erasing or modifying a text (under a full screen editor).</Paragraph>
    <Paragraph position="3"> REGROUP allows to group several texts in one.</Paragraph>
    <Paragraph position="4"> CARTES is used to enter texts via a tape or a card reader.</Paragraph>
    <Paragraph position="5"> DESCRT is used to display or to modify the &amp;quot;descriptor&amp;quot; of the current corpus, which contains the ordered list of the separators defining the structuring method of the corpus.</Paragraph>
    <Paragraph position="6"> LISTE, with subcommands LITEX, LTDES, LIOCC, allows to print a text, in a format~d or unformatted way, its tree structure (deduced from the structuring method), and the sorted list of its &amp;quot;occurrences&amp;quot; (&amp;quot;words&amp;quot;, or &amp;quot;forms&amp;quot;).</Paragraph>
  </Section>
  <Section position="6" start_page="19" end_page="19" type="metho">
    <SectionTitle>
2 - General subcommands
</SectionTitle>
    <Paragraph position="0"> Under this environment, it is also possible to ask global queries relative to the corpora and the texts : LINDG, LTLG, LTOT, LTXIG, LTXTIG (see above).</Paragraph>
    <Paragraph position="1"> IV - PREPARATION OF THE LINGUISTIC DATA There are 6 subenvironments, denoted by PR&lt;name of phase&gt; : PRAM, PRAS, PRTL, PRTS,  PRGS, PRGM. The subcommands are either a global action, or some acronym for a subset of the components of the linguistic data.</Paragraph>
    <Paragraph position="2"> 1 - Commands relative to the components of the linguistic data These components vary with the metalanguage of the given step. In general, they comprise : - the declaration of the &amp;quot;variables&amp;quot;, which would be better called &amp;quot;attribute types&amp;quot;. The fundamental construct, in any given step, is a decoration type (a hierarchical collection of attributes). Each node of a tree structure bears a decoration of the declared type. The &amp;quot;variables&amp;quot; (&amp;quot;attribute types&amp;quot;) are either non-terminal, and correspond to subdecorations, or elementary, and may then be of types akin to PASCAL scalar, set and integer types. See \[9\] or \[11\] for examples. We use DV (DVM and DVS in AM, where there are two sets of variables).</Paragraph>
    <Paragraph position="3"> - formats, which are constant such record structures. Linguistically speaking, a format corresponds to a &amp;quot;class&amp;quot; (combination of certain values of the variables). We use FTM, FTS and FTSG in AM, and FAF elsewhere.</Paragraph>
    <Paragraph position="4"> - boolean procedures, which may express any boolean conditions on the values of the variables of one or several decorations. We use PCP as acronym.</Paragraph>
    <Paragraph position="5"> - assignment procedures, which express assignments (of values of variables) from one set of decorations to another. We use the name PAF.</Paragraph>
    <Paragraph position="6"> - dictionaries (DIC). All phases (AM, TL, GM) using dictionaries may declare several dictionaries.</Paragraph>
    <Paragraph position="7"> - grammars (GR).</Paragraph>
    <Paragraph position="8">  We give now the components which are expected for each step of the translation p'rocess.</Paragraph>
    <Paragraph position="9"> 24 CH. BOITET, P. GU1LLAUME and M. QUEZEI.-AMBRUNAZ</Paragraph>
    <Paragraph position="11"> The division in different dictionaries is used for strategic purposes. At TL; for instance, it is possible to choose, before execution, any ordered subset of the present dictionaries. The induced priorities are used for the choice of equivalents.</Paragraph>
    <Paragraph position="12"> The following subcommands apply to any of these components : M (modify) calls the editor and destructs the invalidated compiled files.</Paragraph>
    <Paragraph position="13"> V (visualize) calls the editor, but no modification is allowed.</Paragraph>
    <Paragraph position="14"> L (list) prints the data, with a variety of options.</Paragraph>
    <Paragraph position="15"> C (compile) compiles the data, with a variety of options.</Paragraph>
    <Paragraph position="16"> For the dictionaries~, a number of predefined sorting options are defined, and the</Paragraph>
    <Paragraph position="18"> In any PRXX environment, LISTEF will print all the data, with or without the dictionaries. In PRAM, PRTL and PRGM, the commands CMPDIC and EFDIC are used to compile all the dictionaries, or to erase some of them. In PRGM, the commands CMPGRAN and EFGRAM are used in much the same way for the grammars.</Paragraph>
    <Paragraph position="20"> There are 6 subenvironments, denoted by EX&lt;name of phase&gt; : EXAM, EXAS, EXTL, EXTS, EXGS, EXGM. They are used to debug the translation process up to the indicated phase. If some previous intermediate result already exists for the processed text, the system asks the user whether to use it as starting point or not. The possible subcommands are relative to the preparation of a text (typically, a short text for testing a particular phenomenon), or to the execution proper.</Paragraph>
    <Paragraph position="21"> 1 - Commands relative to the submitted text These commands are a subset of the PRTXT subcommands, namely CREAT, ELIM, MODIF, LISTE. RESULT is also possible (so that one can visualize the old results before starting a new execution).</Paragraph>
  </Section>
  <Section position="7" start_page="19" end_page="19" type="metho">
    <SectionTitle>
2 - Execution proper
</SectionTitle>
    <Paragraph position="0"> Execution is called by the EXECUT command. It may concern the totality of the text, or part of it. This part is expressed in a logical or in a physical way, by giving either the first and last units, or the first ane last segments. A unit is any part or subpart of a text, as defined by the structur,ng method of the corpus. A segment is an integer number of units of the highest possible level such that the number of words of the corresponding fragment of text lies between two specified limits (which are parameters of ARIANE and depend of the size of the virtual memory of the virtual machine). A trace of the segmentation may be provided, as well as CPU times for the different steps. The EXECUT subenvironment asks the user to give, for each step, tracing and output parameters. ATEF and ROBRA provide a variety of possible traces. With some options, execution may be followed step by step. In ROBRA, the trace parameters may be different for each grammar call (a transformational system is a structured collection of such grammars). To give an example, it is quite usual to produce no trace at all for a large part of a process, then a reduced one for the grammar preceding the one being debugged, and then a fairly complete one for some grammar calls, and so on if the behaviour of several gran~Brs must be investigated. This point is quite crucial for the development of applications of reasonable size and scope.</Paragraph>
  </Section>
  <Section position="8" start_page="19" end_page="19" type="metho">
    <SectionTitle>
IMPLEMENTATION AND CONVERSATIONAL ENVIRONMENT OF ARIANE 78,4 25
</SectionTitle>
    <Paragraph position="0"/>
    <Paragraph position="2"> Due to the variety of components of linguistic data used in a given application (up to 21 dictionaries and usually 5 grammars, with variables, formats, procedures), the global operator COMPGEN is very useful in practice. The user indicates the first and last steps to be compiled (e.g. AM-GM, or AS-TS, etc.), with the compilation options.</Paragraph>
    <Paragraph position="3"> Should an error occur in a given component, the system asks the user whether he wants to modify it. If no, compiling may continue (e.g. when a dictionary contains some errors), or not (other cases). If yes, the editor is called, modifications are made, and the global compilation may proceed.</Paragraph>
    <Paragraph position="4"> z - Processing of a quantity of texts This is the TRAGEN global command, with arguments A, T, GS and GM. These arguments indicate up to which step execution must proceed. In one line, the user gives all general parameters (source and target languages, output units, if any, for the unknown words, the results, the CPU times, and the trace of segmentation ; and form of the result - with or without the source text).</Paragraph>
    <Paragraph position="5"> The system then asks which texts are to be processed. It is possible to give them explicitly, by entering their names and their corpus, or to specify an entire corpus, with or without the possibility to amend (under the editor) the list proposed by the system.</Paragraph>
    <Paragraph position="6"> Hence, TRAGEN (GM) may be used as a production environment &amp;quot;cranking out&amp;quot; rough translations of texts already in the data base. An operational production environment, with constant input flow of texts (from tapes, disquettes, terminal, etc.) and output flow of translations, is currently being developed. It will also include a more sophisticated revision subenvironment than ARIANE.</Paragraph>
  </Section>
  <Section position="9" start_page="19" end_page="19" type="metho">
    <SectionTitle>
3 - Human revision
</SectionTitle>
    <Paragraph position="0"> Being used as a system for automated translation, ARIANE provides an environment for human revision, called RI~gTS~II)ITT. The rough machine output may or may not exist.</Paragraph>
    <Paragraph position="1"> In the last case, revision is the same as human translation and revision. In the first case, obviously the most interesting, a file for the revised translation is createdby copying the rough translation. The rough translation itself may never be altered by the user.</Paragraph>
    <Paragraph position="2"> Revision is done under the standard editor, with 2 or 3 windows on the same screen, on which the source text, the rough translation (optionnally) and the revised text appear. In the future, a real text processing system may be used instead of the editor, with functions specialized to frequent actions done during revision, such as erasing, inserting or swapping words, sentences or paragraphs.</Paragraph>
    <Paragraph position="3"> VII - STATIC AND DYNAMIC COSTS We don't consider here the cost of preparing the grammars and dictionaries for a given application, but rather the static and dynamic costs incurred while developing, maintaining and using an application.</Paragraph>
    <Paragraph position="4"> I - Static costs We refer to them only in terms of space on secondary storage (disks).</Paragraph>
    <Paragraph position="5"> This cost is divided into 2 parts : - the space for the files containing the basic software in executable form. This space is shared by all user spaces, and amounts to roughly 7 Mbytes for any of the 2 (english and french) versions of ARIANE-78.4.</Paragraph>
    <Paragraph position="6"> - the space for the texts and the linguistic data.</Paragraph>
    <Paragraph position="7"> 26 CH. BOITET, P. GUILLAUME and M. QUEZEL-AMBRUNAZ The space taken for a source text is just the size of the file containing it, and the same goes for its rough translation and its revision, if any. If an intermediate result is kept, its size is roughly 4 times the size of the source text (it contains some representation of the associated tree structure). For any text and any language pair, up to 3 intermediate results may exist (after analysis, transfer, or syntactic generation). Obviously, the space taken by the texts does not measure the complexity or validity of an application. It is only a measure of the size of the corpus (corpora) used for developping and debugging an application. In the current russian-french application, for example, the source texts occupy roughly 12 Mbytes on disk. This corresponds to roughly 700000 words, because of some redundancies.</Paragraph>
    <Paragraph position="8"> The space taken by the linguistic data is first divided in two parts : the source data and the compiled data. Second, one must distinguish between data relative to the linguistic model, and the typology (variables, formats, procedures, grammars), and the dictionaries, the size of which may expand if domains are added, or better covered. The following table gives this partition for a previous russian-french version (1980), using roughly 4000 lexical units, and reasonably large grammars.  The marginal space for adding one lexical unit (in the dictionaries of AM, TL and GM) is roughly 310 bytes/UL in source form, and 120 bytes/UL in compiled form. The compression factor is higher for dictionaries than for grammars.</Paragraph>
  </Section>
  <Section position="10" start_page="19" end_page="19" type="metho">
    <SectionTitle>
2 - Dynamic costs
</SectionTitle>
    <Paragraph position="0"> They may be measured in terms of the CPU time and the size of the virtual memory used while executing a translation, as this computation dominates the costs.</Paragraph>
    <Paragraph position="1"> A main principle of the implementation has been to use the virtual memory facility in such a way that, should the real memory be big enough, the only I/O incurred for executing a translation are due to loading the executable module generated by ARIANE, and reading/writing the input text, the intermediate results and the translation on disk files. In other terms, everything is done in central memory.</Paragraph>
    <Paragraph position="2"> For an application like the one mentioned above, a 2 Mbytes virtual memory is enough to contain the OS (CMS) of the virtual machine, the compiled tables, ARIANE's programs, and the work areas. As we said before, adding 1000 UL causes this size to augment by roughly 120K.</Paragraph>
    <Paragraph position="3"> As far as CPU times are concerned, only virtual CPU is significative. As it varies considerably from one type of machine to another, we prefer to give estimates in millions of (virtual) machine instructions necessary to translate one word, or Mi/w. If we consider various applications (russian-french, french-english, english-malay or english-chinese), this complexity lies between 3 and 7 Mi/w.</Paragraph>
    <Paragraph position="4"> According to the size and load of the computer system, a certain overhead (for simulation and paging) has to be added to obtain the total CPU time. No uniform rule may be given.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML