XML Viewer - j85-2003

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/85/j85-2003_metho.xml
Size: 67,884 bytes
Last Modified: 2025-10-06 14:11:40
<?xml version="1.0" standalone="yes"?>
<Paper uid="J85-2003">
  <Title>Vasconcellos, Muriel 1985a Management of the Machine Translation Enviornment: Interaction of Functions at the Pan American Health</Title>
  <Section position="1" start_page="0" end_page="0" type="metho">
    <SectionTitle>
SPANAM AND ENGSPAN: MACHINE TRANSLATION
AT THE PAN AMERICAN HEALTH ORGANIZATION
</SectionTitle>
    <Paragraph position="0"/>
  </Section>
  <Section position="2" start_page="0" end_page="0" type="metho">
    <SectionTitle>
1 PROJECT HISTORY AND CURRENT STATUS
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
1.1 OVERVIEW
</SectionTitle>
      <Paragraph position="0"> Spanish-English machine translation (SPANAM) has been operational at the Pan American Health Organization (PAHO) since 1980. As of May 1984, the system's services had been requested by 87 users under 572 job orders, and the project's total output corresponded to 7,040 pages (1.76 million words) that had actually been used in the service of PAHO's activities. The translation program runs on an IBM mainframe computer (4341 DOS/VSE), which is used for many other purposes as well. Texts are submitted and retrieved using the ordinary word-processing workstation (Wang OIS/140) as a remote job-entry terminal. Production is in batch mode only. The input texts come from the regular flow of documentation in the Organization, and there are no restrictions as to field of discourse or fype of syntax. A trained full-time post-editor, working at the screen, produces polished output of standard professional quality at a rate between two and three times as fast as traditional translation (4,000-10,000 words a day versus 1,500-3,000 for human translation). The post-edited output is ready for delivery to the user with no further preparation required.</Paragraph>
      <Paragraph position="1"> The SPANAM program is written in PL/I. It is executed on the mainframe at speeds as high as 700 words per minute in clock time (172,800 words an hour in CPU time), and it runs with a size parameter of 215 K. Its source and target dictionaries (60,150 and 57,315 entries, respectively, as of May 1984) are on permanently mounted disks and occupy about 9 MB each.</Paragraph>
      <Paragraph position="2"> While SPANAM continues to build its reputation as a work horse, at the same time development is well advanced on a parallel system that translates from English into Spanish, ENGSPAN. Also written in PL/I, ENGSPAN uses essentially the same modular system architecture that has been developed for SPANAM, but it is conceived on the basis of up-to-date linguistic theory leading to rule-based strategies for the parsing of syntactic and semantic information. The overall policy is to regularly upgrade SPANAM as breakthroughs become available in the more sophisticated ENGSPAN. In this way it has been possible to maintain ongoing production with SPANAM while its capabilities are gradually enhanced and expanded. Because of this dynamic mode of development, information about the theoretical status of either SPANAM or ENGSPAN is necessarily shortlived. null</Paragraph>
    </Section>
  </Section>
  <Section position="3" start_page="0" end_page="0" type="metho">
    <SectionTitle>
1.2 EARLY HISTORY: 1976-1979
</SectionTitle>
    <Paragraph position="0"> The Pan American Health Organization, with headquarters in Washington, D.C., is the specialized international agency in the Americas that has responsibility for action in the field of public health. It comes under the umbrellas of both the Inter-American System and the UN family, serving in the latter instance as Regional Office of the World Health Organization. In addition to its headquarters staff of 546 in Washington, PAHO has a field staff of 652 that supports both the operational programs in its 10 Pan American centers, located in eight different countries, and its 30 representational offices, in 28 countries.</Paragraph>
    <Paragraph position="1"> Business may be conducted in any of the four official languages: Spanish, English, Portuguese, and French.</Paragraph>
    <Paragraph position="2"> The translation demand is greatest into Spanish, which over the years has corresponded .to more than half the total workload, and, after that~ into English. The demand for Portuguese is considerably smaller, and there is only an occasional requirement for French.</Paragraph>
    <Paragraph position="3"> In 1975 the Organization's administrators undertook a feasibility study and determined that MT might be a means of reducing the expenditure for translation. There was already a mainframe computer, then an IBM 360, at the headquarters site, and the decision was made to l Respectively, Chief, Terminology and Machine Translation Program, PAHO, and Senior Computational Linguist, Machine Translation Project, PAHO.</Paragraph>
    <Paragraph position="4"> Copyright1985 by the Association for Computational Linguistics. Permission to copy without fee all or part of this material is granted provided that the copies are not made for direct commercial advantage and the CL reference and this copyright notice are included on the first page. To copy otherwise, or to republish, requires a fee and/or specific permission.</Paragraph>
  </Section>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
122 Computational Linguistics, Volume 11, Numbers 2-3, April-September 1985
</SectionTitle>
    <Paragraph position="0"> Muriel Vasconcellos and Marjorie Learn SPANAM and ENGSPAN develop an MT system that would run on this installation on a time-sharing basis. Work was to focus on the Spanish-English and English-Spanish combinations. The effort was to be supported under the Organization's regular budget.</Paragraph>
    <Paragraph position="1"> The intention from the outset was that MT should articulate with the routine flow of text in PAHO. Post-editing was considered to be unavoidable, since the system would have to deal with free syntax, with any vocabulary normally used in the Organization, and, in time, with a large range of subjects and different genres of discourse. No serious thought was given to a mode of operation that would require pre-editing.</Paragraph>
    <Paragraph position="2"> Initial work was begun in 1976. A team of three part-time consultants worked for the Organization for two years, and one of these consultants remained with the project for a third year. In the beginning the approach drew upon a number of the principles that had evolved at Georgetown University in the late 1950s and early 1960s in the course of work on the Russian-English system known as GAT (Georgetown Automatic Translation, described in Zarechnak 1979).</Paragraph>
    <Paragraph position="3"> The first language combination to be addressed was Spanish-English. The consultants had opted for this direction recognizing that results could be available earlier than if they had started with English as the source. Parallel efforts were concentrated on the architecture itself of the system and the extensive supporting software. The period 1976-1978 saw the mounting of this architecture and the writing of a basic algorithm for the translation of Spanish into English. At the end of three years the Spanish-English algorithm was in place, as well as eight other PL/I support programs that performed a variety of related tasks. The Spanish source dictionary had been built to a level of 48,000 entries (at that time the verbs required full-form entries), with corresponding English glosses in the separate target dictionary. Work on the dictionaries was supported by mnemonic, user-friendly software developed in 1978-1979 to facilitate the operations of updating, side-by-side printing, and retrieval of individual records. A corpus of about 50,000 words had been translated from Spanish into English.</Paragraph>
    <Paragraph position="4"> Efforts from English into Spanish had produced one page of text.</Paragraph>
    <Paragraph position="5"> Human resources during the period 1976-1978 consisted of the three part-time consultants together with PAHO's contribution in the form of dictionary manpower (total of 24 staff-months in the three-year period) and, starting in 1977, half-time participation of the staff terminologist, who assumed the responsibility of coordination. A full-time computational linguist was recruited and assigned to the project in 1979.</Paragraph>
    <Paragraph position="6"> The year 1980 was a turning point for MT at PAHO.</Paragraph>
    <Paragraph position="7"> Advances came together which made it possible to move into a production mode. To begin with, the computational linguist took full charge of the system software, replacing the consultants. Up to that time production had not been feasible because there was no morphological analysis of verbs: the failure to find a high percentage of inflected verbs had meant that many sentences in random text were barred from even the most rudimentary analysis. Thus the first order of business was to develop the needed morphological lookup.</Paragraph>
    <Paragraph position="8"> At the same time, the operational problems of text input, another major impediment to production, were also resolved. An interface established between the IBM mainframe and the Organization's word-processing facility (then a Wang System 30) enabled MT to take its place in the text-processing chain and tap into a large body of text that had been made machine-readable for other purposes. A conversion program was written which handled the differences in representation of characters, solved ambiguities of punctuation, and made certain decisions about the format. From the time this program was installed, any Spanish text that had been keyed onto the word processor, regardless of the purpose for which it was entered, was available for machine translation.</Paragraph>
  </Section>
  <Section position="5" start_page="0" end_page="0" type="metho">
    <SectionTitle>
1.3 OPERATIONAL PHASE: 1980-PRESENT
</SectionTitle>
    <Paragraph position="0"> As production gained momentum, the MT staff was increased by the assignment of a full-time post-editor and by greater participation of the terminologist as head of the project.</Paragraph>
    <Paragraph position="1"> Over the next two years, the sources of machine-readable text for SPANAM increased at a steady pace. The use of word processing at PAHO expanded, and, in addition, another mode of input became possible through optical character recognition (OCR). Whereas word processing had previously been restricted to special services provided by a typing pool, after the installation of word-processing hardware throughout the headquarters building (Wang OIS/140), all program units eventually came to participate in the text-processing chain. Furthermore, the optical character reader (a Compuscan Alphaword II), previously used only for Telex transmission, was interfaced with the word-processing system; this meant that existing typewriters could also be used as input devices, and therefore that texts could be prepared in the field and machine-read in Washington.</Paragraph>
    <Paragraph position="2"> With accelerated production, improvements to SPANAM have followed in tandem. From the beginning it has been the policy, and continues to be so today, that the output from production serves not only to meet the purpose for which it was requested but also to provide feedback for further development of the algorithm and dictionaries. As post-editing proceeds, note is made of recurring problems at all levels. Capture of this information at the time of post-editing saves much work later on. The messages written by the post-editor on the side-by-side text serve as a basis both for updating the dictionaries and for making enhancements, as feasible, in the algorithm.</Paragraph>
    <Paragraph position="3"> In this way the Spanish source dictionary had grown to a total of 60,120 entries as of May 1984. Of this total, 94% were bases or stems and 6% were full forms, all with corresponding entries in the English target. Since Computational Linguistics, Volume 11, Numbers 2-3, April-September 1985 123 Muriel Vasconcellos and Marjorie LeOn SPANAM and ENGSPAN 1981 the incidence of not-found words in random text has been well under 1% - limited usually to proper names, scientific names, new acronyms, and nonce formations. Through coordination with the terminology side of the program, the glosses have been increasingly tailored to the specific requirements of PAHO. In addition, microglossaries have been established for various users, so that specialized translations can be elicited.</Paragraph>
    <Paragraph position="4"> In its four years of operation, SPANAM has become not only wiser but more efficient as well. The program's speed of run time has increased from 160 words per minute to over 700 wpm. Yet the algorithm, even though it has sustained a major reorganization into modular structure and regularly undergoes enhancement, remains approximately the same size (2,085 statements as of May 1984).</Paragraph>
    <Paragraph position="5"> Further details about the working environment of SPANAM are given in sections 2 and 6.</Paragraph>
    <Paragraph position="7"> In early 1981 a long-range strategy was decided on for the continued improvement of SPANAM and the development of a parallel system from English into Spanish.</Paragraph>
    <Paragraph position="8"> Two consultants from Georgetown University, Professors Ross Macdonald and Michael Zarechnak, undertook separate evaluations of SPANAM at that time. Their recommendations led to the adoption of a combined working mode in which improvements were to be introduced in SPANAM according to a predetermined schedule while at the same time development began on the other system, ENGSPAN. Recognizing that each language combination imposed a different set of linguistic priorities, the consultants nevertheless emphasized that greatly expanded parsing was needed in both cases, especially in the analysis of English as a source language. Such parsing, in turn, called for revision of the dictionary record in order to allow for a broader range of syntactic and semantic coding. It was felt that the basic modular architecture of SPANAM, as well as the dictionary record in its essential format, should also be used for ENGSPAN. A common architecture for the two systems meant that they could continue to share the same supporting software.</Paragraph>
    <Paragraph position="9"> Thus, improvements could migrate readily from one system to the other; it wouid be easy for them to crossfertilize. null Having adopted this approach to development, with each side to benefit systematically from the work being done on the other, the project addressed its attention in 1981 to the enhancements that had been recommended for SPANAM. Then, as the SPANAM effort tapered off, time was devoted increasingly to ENGSPAN. By the end of 1982, the ENGSPAN program and dictionaries (about 40,000 source entries, most of them with acceptable glosses in the Spanish target) were in place.</Paragraph>
    <Paragraph position="10"> Translation from English into Spanish has special importance for public health in the developing countries, and this fact provided the incentive for seeking extrabudgetary support from the U.S. Agency for International Development (AID). In August 1983, AID gave the Organization a two-year grant for the accelerated development of ENGSPAN. 2 This funding has made it possible to have a second computational linguist for the grant period, as well as consultants and part-time dictionary assistants who have undertaken specific tasks within the approved plan of work.</Paragraph>
    <Paragraph position="11"> With the added manpower, the project has made significant progress on the English-Spanish algorithm.</Paragraph>
    <Paragraph position="12"> Particular focus has been placed on the development of a parser using an augmented transition network (ATN), which as of April 1984 was integrated into the rest of the ENGSPAN program. The dictionary record has been modified, without any increase in its overall size, so that it can now accommodate 211 fields, as compared with 82 in the 1980 version of SPANAM. Deep syntactic and semantic coding has been introduced for dictionary entries corresponding to a sizable proportion of the experimental corpus of 50,000 running words.</Paragraph>
  </Section>
  <Section position="6" start_page="0" end_page="0" type="metho">
    <SectionTitle>
2 APPLICATION ENVIRONMENT
2.1 PRE-EDITING POLICY
</SectionTitle>
    <Paragraph position="0"> As indicated above, it has always been expected that the output of SPANAM and ENGSPAN would have to be post-edited. There was no application of MT at PAHO for which a customized language would be feasible.</Paragraph>
    <Paragraph position="1"> Since post-editing was inevitable, it was felt that a pre-editing step would be anti-economic: the advantages to be gained would not be sufficient to offset the added cost of a second human pass. Moreover, in order for pre-editing to be worthwhile, the process would have to draw on a high degree of linguistic sophistication, and adequate manpower for this purpose would be scarce: the pre-editor would need to be well prepared not only in translation skills but also in knowledge of the algorithm, or at least a number of its capabilities and limitations.</Paragraph>
    <Paragraph position="2"> Thus, pre-editing in the linguistic sense has been ruled out for SPANAM and ENGSPAN. In theory, a document can be sent for execution by SPANAM without being seen by any human eyes. If the operator has keyed in the original Spanish document using normal in-house typing conventions, no adjustments whatsoever are required.</Paragraph>
    <Paragraph position="3"> With inexperienced operators, however, and with texts read automatically by the OCR, the precaution is taken to check the format, particularly the line-spacing and page width, since deviations from the standard at that level can disrupt the work of the algorithm.</Paragraph>
    <Paragraph position="4"> Production texts are run only once. Demonstrations are always performed on random text.</Paragraph>
  </Section>
  <Section position="7" start_page="0" end_page="0" type="metho">
    <SectionTitle>
2.2 POST-EDITING POLICY
2.2.1 GENERAL POSITION
</SectionTitle>
    <Paragraph position="0"> The multifaceted approach to post-editing is an important feature of SPANAM. On one level, consideration is given to the user's needs and capabilities and to the purpose of the translation. At the same time, specific linguistic strategies developed by the project are often used in order to minimize the recasting of certain unwieldy constructions which frequently recur - mainly the result of verbs in sentence-initial position in Spanish.</Paragraph>
    <Paragraph position="1"> Finally, there is a series of word-processing aids that help to speed up the physical process of editing and to deal with pragmatic decisions in the SPANAM output which are not handled by syntactic rules.</Paragraph>
    <Paragraph position="2"> The degree of post-editing is determined by:  1. the purpose of the translation, 2. the user's own resources for editing, 3. the time frame, and 4. structural linguistic considerations in the text itself.  A text may be needed for information only, for publication, or for a variety of uses between these two extremes. If it is to be edited by the requesting office, only the most glaring problems are dealt with by the post-editor. On the other hand, if it is to be published without much further review, the post-editor devotes careful attention to the quality of the text. These factors are determined in a conference with the user at the time the job is submitted. As to time constraints, it may happen that the work has to be delivered under considerable pressure: information-only translations of 20-25 pages may have to be delivered within a couple of hours, and once a 40-page proposal for funding was delivered in polished form the same day it was requested. With translation for publication, however, longer periods are negotiated.</Paragraph>
    <Paragraph position="3"> Contrary to what might be deduced from the nature of PAHO's mission, SPANAM is asked to cope with a wide range of subject areas and types of text. There have been: documents for meetings, international agreements, technical and administrative reports, proposals for funding, summaries and protocols for international data bases, journal articles and abstracts, published proceedings of scientific meetings, training manuals, letters, lists of equipment, material for newsletters - even film scripts.</Paragraph>
    <Paragraph position="4"> In an open, &amp;quot;try-anything&amp;quot; (Lawson 1982:5) system such as SPANAM, with its highly varied applications, experience has led to the conclusion that post-editing requires a trained professional translator. Whereas Martin Kay (1982:74) suggests that the person who interprets machine output &amp;quot;would not have to be a translator and could quite possibly be drawn from a much larger segment of the labor pool&amp;quot;, the SPANAM experience suggests that this conclusion would be valid only for technical experts working on a text for information purposes only. Even in such cases, the technicians at PAHO are encouraged to request a more careful translation of passages that are of particular interest. Only an experienced translator will be aware of the words whose variable meanings are dependent on extra-linguistic context. For example proyecto in Spanish can mean 'project', 'proposal', or 'draft', and the choice depends on full knowledge of the situation to which the text refers. Esperar can mean 'hope' or 'expect', and the distinction is essential in English - sometimes even crucial. Such ambiguities require the attention of a translator with training, experience, good knowledge of the subject matter vocabulary in both languages, and a technical understanding of what is meant by the text. Only a person with this combined background is in a position to make the choices that will fully reflect the intention of the original author. Another area in which the translator's role is important is in interpretation of the degree of intensity associated with relative terms. For example, trascendente in Spanish can have much less force than its English cognate, and the entire tone of a message may be over- or underdrawn, depending on the interpretation given to a key term of this nature. Indeed, it has been the experience of SPANAM that users, even technical experts, can misinterpret the glosses appearing in the machine output and assign an altogether incorrect meaning in the process of &amp;quot;correcting&amp;quot; the text. The role of the experienced translator is not to be underestimated.</Paragraph>
    <Paragraph position="5">  In addition to experience in the interpretation of nuances, the post-editor needs a strong linguistic background in order to master the particular strategies that have proved to be effective in the processing of SPANAM output. For the inexperienced post-editor, the most time-consuming task is the recasting that is deemed to be &amp;quot;required&amp;quot; when machine-translated constructions turn out to be ungrammatical or intolerably awkward in the English output. SPANAM has addressed this problem by developing a series of &amp;quot;quick-fix&amp;quot; post-editing expedients (QFP) for dealing with the typical problems of Spanish-to-English translation. For example, certain maneuvers are suggested as being useful in the case of fronted verb constructions in Spanish, which occur frequently and present difficulties for the standard SVO pattern in English. The purpose of the QFP is to minimize the number of steps required in order to make the sentence work. Since it was a V(S)O construction that triggered the problem in the first place, any solution that avoids reordering will necessarily depart from one-on-one syntactic fit. In other words, in the example of the fronted verb, one might try to see if the opening phrase, which will be a discourse adjunct, a cognitive adjunct (terms from Halliday 1967), or the main verb itself, could be nominalized so that it might serve as the subject of the sentence in English. Such an approach manages to preserve in the theme position (Halliday 1967) the cognitive material which had been thematic in the source text, usually with a parallel effect on the focus position as well (Vasconcellos 1985b). For this reason, the result is often quite satisfactory, even compared with a translation that Computational Linguistics, Volume 11, Numbers 2-3, April-September 1985 125 Muriel Vasconcellos and Marjorie Leon SPANAM and ENGSPAN is syntactically more &amp;quot;faithful&amp;quot; (see Section 6 below and also Vasconcellos, in preparation)* The examples below compare QFPs with solutions that were actually proposed by translator-post-editors (traditional human translation - THT).</Paragraph>
    <Paragraph position="6"> In example (1), the semantic content of the fronted verb is reworked into a noun phrase that can serve as the subject of the sentence. Time is saved by leaving the rheme of the sentence untouched; only a few characters, highlighted inside the box, were changed. Moreover, additional speed was gained by making changes from left to right, in the same direction in which the text is being reviewed.</Paragraph>
    <Paragraph position="7">  (1) Durante 1983 se inici6 ya la transformaci6n paulatina de estos planteamientos en acciones.</Paragraph>
    <Paragraph position="8"> MT: During 1983 t~e \[ was initiated already \[ the gradual transformation of these proposals into actions.</Paragraph>
    <Paragraph position="9"> THT: During 1983 these proposals already began to be gradually transformed into actions. (62 keystrokes) QFP: During 1983 \[ progress began toward \[ the  gradual transformation of these proposals into actions. (27 keystrokes) In example (2), on the other hand, the adjunct itself is nominalized, again with a significant saving of time and keystrokes.</Paragraph>
    <Paragraph position="10"> (2) En este estudio se buscarfi contestar dos preguntas fundamentales: MT: ~ this study \[ it will be sought \[ to answer two fundamental questions: THT: In this study answers to two fundamental questions will be sought: (53 keystrokes) QFP: This study l, seeks  |to answer two fundamental questions: (14 keystrokes) Use of the foregoing approach, wherever feasible, adds up to substantial economy, with apparently little or no deterioration in the quality of the translation (see Section 6 below)* However, knowing when and how to make such changes requires considerable skill. This is one more reason why the post-editor should have a strong background both in translation and, if possible, in hngulsttcs as well* It is always emphasized in SPANAM that editorial changes should be kept to the minimum needed in order to make the output intelligible and acceptable for its intended purpose*  The SPANAM post-editors work directly on-screen.</Paragraph>
    <Paragraph position="11"> Experience has shown that post-editing on hard copy, with the changes entered by a &amp;quot;word-processing operator&amp;quot;, is not a highly efficient mode. Accordingly, attention has also been given to speeding up the post-edit by automating as many of the recurring operations as possible.</Paragraph>
    <Paragraph position="12"> The SEARCH-and-REPLACE function on the word processor is heavily used in post-editing. In addition, SPANAM has a set of special aids developed for the purposes of MT. Besides a full set of possible word switches (lxl, lx2, 2xl, 2x2, lx3, 3xl, 3x3, etc.), there are routines that deal with the character strings that most often have to be changed in SPANAM output* For example, only a single &amp;quot;glossary&amp;quot; keystroke is needed to perform the following editorial operations: SEAR CH-and-DELETE: the, of, there, to, in order to SEAR CH-and-REPLA CE: from~of, for~of, for~by, in order to -/for -ing, a/the, which~that, who~that, every~each, among~between, such as~as, some of the~some The inventory can be changed or expanded at will.</Paragraph>
  </Section>
  <Section position="8" start_page="0" end_page="0" type="metho">
    <SectionTitle>
2.2.4 OTHER TIME-SAVERS
</SectionTitle>
    <Paragraph position="0"> From the discussion above, it can be seen that speed in post-editing is achieved by a combination of strategies.</Paragraph>
    <Paragraph position="1"> Some of the points made may appear on the surface to be almost trivial, but yet they can add up to a significant difference* One example of an apparently trivial factor is the method of positioning the cursor under the string to be-modified. Delays at this point can add up to a surprising proportion of total time spent on post-editing, since they will occur with every change that is made. Informal experiments suggest that the most efficient approach for positioning the cursor is to always use the SEARCH key.</Paragraph>
    <Paragraph position="2"> The &amp;quot;mouse&amp;quot; and the light pencil appear to be less effective. The slowest method, unfortunately, seems to be the one that is most often used, namely simple manual striking of the directional keys. Since people tend to rely on the directional keys unless otherwise trained, this point is emphasized with the post-editors who work on SPANAM.</Paragraph>
    <Paragraph position="3"> The staff of the project are constantly on the lookout for new ways of saving time. All tasks are streamlined as much as possible* A series of programs have been developed on the word processor for automating the housekeeping support that has to be done apart from post-editing, and recently some of this work was made even more efficient by passing it on to the mainframe computer. Printing is kept to a minimum; finished production is delivered to the user either on a diskette or by a telephone call notifying the office that the job is available on the system*</Paragraph>
  </Section>
  <Section position="9" start_page="0" end_page="0" type="metho">
    <SectionTitle>
2.3 POST-EDITING VIS-a-VIS OTHER ASPECTS
OF THE SYSTEM
</SectionTitle>
    <Paragraph position="0"> In the SPANAM environment there is a close link between post-editing and the other aspects of the system.</Paragraph>
    <Paragraph position="1"> The staff post-editor has been trained to update the dictionaries, and currently almost all the dictionary work 126 Computational Linguistics, Volume 11, Numbers 2-3, April-September 1985 Muriei Vasconcellos and Marjorie Le~n SPANAM and ENGSPAN on SPANAM is done by this person. Required changes to the dictionaries are proposed at the time of post-editing.</Paragraph>
    <Paragraph position="2"> Hence there is no need to go through the text a second time. Also, glosses and other solutions seem to come to mind most readily when the whole text is actually being worked on. The post-editor, if adequately trained in updating, is in the best position to see what dictionary changes are necessary in order to deal with the specific constructions that tend to recur in production translations. null The post-editor also alerts the computational linguist to areas where the algorithm needs improvement.</Paragraph>
  </Section>
  <Section position="10" start_page="0" end_page="0" type="metho">
    <SectionTitle>
2.4 INTEGRATION OF THE SYSTEM INTO A COMPLETE
TRANSLATION ENVIRONMENT
</SectionTitle>
    <Paragraph position="0"> SPANAM/ENGSPAN, the terminology activity, and the traditional human translation unit are in the process of being merged into a single program of language services at PAHO. While human and machine translation even now coordinate the workload to a certain extent, as of May 1984 there was not yet any centralized screening of incoming jobs. It is expected that such a triage will make it possible to maximize the effectiveness and efficiency of the respective services.</Paragraph>
    <Paragraph position="1"> Combination of the two activities will also make for a more rational utilization of the manpower available at any given time, with the staff being assigned to a variety of different duties, depending on both needs and skills.</Paragraph>
    <Paragraph position="2"> It is also a goal to reduce a given person's day in front of the screen from eight hours to six through the rotation of assignments.</Paragraph>
    <Paragraph position="3"> In the area of management, the SPANAM/ENGSPAN programs on the mainframe computer are also helping with labor-intensive operations for which the human translation service is responsible: it is already performing automatic word counts, and spelling check systems are being developed for both English and Spanish.</Paragraph>
    <Paragraph position="4"> In terms of the linguistic work of the human translator, SPANAM/ENGSPAN can help to lighten the load in a number of ways. To begin with, technical and scientific terms are retrieved in context, which means that MT is a sort of very efficient lexical data base. With an ordinary LDB, the translator has to go to the terminal (which is not usually at his desk for his use alone), sign on, and initiate a search. After he has performed all the mechanical steps, there is still the possibility that the term is not in the data base at all, and his effort will have been wasted. When this happens repeatedly over time, a cumulative frustration builds up. With SPANAM/ENGSPAN, on the other hand, not only does the translator know immediately what translation has been assigned to the term, he is also told its degree of reliability and whether or not it is in the WHOTERM data base. The status of the terms is indicated by small superscript symbols which can be requested at the time the text is sent for translation.</Paragraph>
    <Paragraph position="5"> WHOTERM is also on the word-processing system. Its general file contains: a definition of each term in English, translation equivalents in up to four languages besides English, synonyms if there are any, a reliability code for the primary term in each language, scope notes, and a subject code. In addition to the general file, it has files with: names of organizational entities, full equivalents for abbreviations, scientific names of pathogens, generic names of drugs in three languages, and chemical names of pesticides with trade names cross-referenced to them (Ahlroth &amp; Lowe 1983).</Paragraph>
    <Paragraph position="6"> SPANAM also aids the translator with its system of microglossaries for specialized subject areas (see Section 4.3 below). When a text is known to deal with a certain subject, the translator can request a corresponding microglossary which will contain alternate glosses. One or more of these microglossaries can be specified at the time the job is submitted. The translator can also have a mieroglossary of his own in which he can store special terms he prefers to use.</Paragraph>
    <Paragraph position="7"> It is also possible for SPANAM to provide alternate choices in the output entry, such as project/ proposal~draft, hope~expect, time~weather, etc., although this is not the regular policy. These alternatives can be stored in a microglossary. In the output, the undesired translation is eliminated by striking a single glossary key. If the translator provides feedback in the form of suggested or requested changes in the dictionaries, the updating can be done immediately. Some of SPANAM's users have developed the habit of providing regular feedback, and this means that their translations become increasingly tailored to their specific requirements.</Paragraph>
    <Paragraph position="8"> While there is no doubt that SPANAM/ENGSPAN reach their maximum efficiency when post-edited on-screen, at the same time studies are being done on ways in which a translator can dictate his changes so that they can be entered by a word-processing operator working from a tape.</Paragraph>
    <Paragraph position="9"> The human translation service stands to benefit, also, from the sophisticated facilities that have been developed on the word processor for editing and housekeeping support.</Paragraph>
  </Section>
  <Section position="11" start_page="0" end_page="0" type="metho">
    <SectionTitle>
3 GENERAL TRANSLATION APPROACH
</SectionTitle>
    <Paragraph position="0"> Since the bulk of the Organization's translation work involves only Spanish and English, the machine translation system was developed specifically for this pair of languages. No consideration was given to using the interlingua approach. The broad range of subject areas to be dealt with precluded the use of a knowledge-based approach or one based on a representation of the meaning of the text. Although the systems are currently language-specific, significant portions of the algorithm could be adapted for use in a system involving Portuguese or French, the other official languages of the Organization. Because SPANAM and ENGSPAN were developed separately, they reflect different theoretical orientations and utilize different computational techniques. At the same time, they have many features in common.</Paragraph>
    <Paragraph position="1"> Computational Linguistics, Volume 11, Numbers 2-3, April-September 1985 127 Muriel Vasconcellos and Marjorie Le6n SPANAM and ENGSPAN SPANAM was originally designed as a direct translation system. The translation is produced through a series of operations which analyze the Spanish source string, transform the surface structure to produce a syntactic frame for the English target string, substitute the English glosses indicated by the results of the analysis, insert and/or delete certain grammatical morphemes, and synthesize the required endings on the English words. The principal stages involved in the translation algorithm are: morphological analysis and single-word lookup, gap analysis, multi-word unit lookup, homograph resolution, subject identification, treatment of prepositions, object pronoun movement, verb string analysis, subject insertion, do-insertion, noun phrase rearrangement, target lookup, target synthesis.</Paragraph>
    <Paragraph position="2"> ENGSPAN is a lexical and syntactic transfer system based on the slot-and-filler approach to language structure. It performs a separate analysis of the English source string, applies transfer routines based on the contrastive analysis of English and Spanish, and then synthesizes the Spanish target string. The principal stages of this algorithm are: morphological analysis and single-word lookup, gap analysis, substitution and analysis unit lookup, sentence-level parse, transfer unit lookup, target lookup, syntactic transfer, and target synthesis.</Paragraph>
    <Paragraph position="3"> The program includes backup modules for homograph resolution, verb string analysis, and noun phrase analysis, which are called in if the sentence-level parse is unsuccessful. null</Paragraph>
  </Section>
  <Section position="12" start_page="0" end_page="0" type="metho">
    <SectionTitle>
4 LINGUISTIC TECHNIQUES
4.1 MORPHOLOGICAL ANALYSIS
</SectionTitle>
    <Paragraph position="0"> SPANAM's morphological lookup procedure makes it possible to find most Spanish words in their stem forms.</Paragraph>
    <Paragraph position="1"> The algorithm recognizes plural and feminine endings for nouns, pronouns, determiners, quantifiers, and adjectives; person, number and tense endings for verbs; and derivational endings such as -mente/-ly. Bound clitic pronouns are separated from verb forms, and any accent mark related to the presence of the clitic is removed.</Paragraph>
    <Paragraph position="2"> Another subroutine adds missing accent marks when the source word is written with an initial capital or in all capital letters. The components of compounds formed with hyphens or slashes are looked up as separate words. A few prefixes are also removed from words without a hyphen.</Paragraph>
    <Paragraph position="3"> ENGSPAN's morphological analysis procedure, known as LEMMA, is called if the full-form is not found in the dictionary and the word consists of at least four alphabetic characters. This procedure checks for the presence of a number of different endings, including -'s, -s; -s, -ly, -ed, -ing, -er, -est, and -n't. Each time an ending is removed, the new form of the word is looked up.</Paragraph>
    <Paragraph position="4"> LEMMA uses morphological and spelling rules and short lists of exceptions in order to determine when to remove or add a final -e, when the word ends in a double consonant, etc. If a lemmatized form of the word is found in the dictionary, its record is checked to make sure that its part of speech corresponds with the ending which was removed. If LEMMA exhausts all its possibilities, the word is checked against a small list of prefixes (re-, non-, un-, sub-, and pre-). If one of these prefixes can be removed, another lookup is performed. If this final look-up is unsuccessful, a dummy record is created for the word and a gap analysis routine is called. &amp;quot;Not-found&amp;quot; words are initially considered to be nouns and given the possibility of also functioning as verbs and adjectives.</Paragraph>
    <Paragraph position="5"> Information from both LEMMA and derivational suffixes is used in order to confirm or reassign the main part of speech, as well as to confirm, remove, or add possibilities for ambiguities.</Paragraph>
    <Paragraph position="6"> The lookup strategy used in both SPANAM and ENGSPAN keeps down the size of the dictionary while allowing a good deal of flexibility. The dictionary coder has the option of entering a word in its full form, in one or more of its inflected forms, or in its stem form. With irregular forms and homographs, the full form must be used. For example, in the Spanish source dictionary the only entries for the word esperar are the stem esper and the verb/noun homograph espera. The English source dictionary contains an entry for expect and unexpected, but not for expects, expected, expecting, or unexpectedly.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.2 HOMOGRAPH RESOLUTION
</SectionTitle>
      <Paragraph position="0"> SPANAM deals with homographs at several different stages of the program. Ambiguities that can be resolved by morphological clues or capitalization are handled by the lookup procedure. Proper names are also identified at this stage. One-character words are distinguished from letters of the alphabet after the lookup has been completed. The homograph resolution module handles other types of homographs by examining the surrounding context.</Paragraph>
      <Paragraph position="1"> The possible parts of speech for a word are indicated in the dictionary record in a series of bit fields which include: verb, noun, adjective, pronoun, determiner, numerative, preposition, modifier, adverb, conjunction, auxiliary, and prefix. Any combination of two or more bits may be coded. Other sequences of bit codes are used to distinguish between different types of pronouns, adverbs, and conjunctions: relative, interrogative, nominal, adverbial, connector, compound, and coordinate.</Paragraph>
      <Paragraph position="2"> The use of multiple-word substitution units reduces the number of lexical ambiguities which must be resolved by the algorithm. Analysis units may also be used to selectively specify the part of speech of any or all of the words covered by the unit.</Paragraph>
      <Paragraph position="3"> ENGSPAN's front-line approach to homograph resolution is embodied in the ATN parser, described in Section 5.3. The English words can be coded for the same possible parts of speech as in SPANAM. Determination of the function of each word depends on the path taken through the network. The sequence of parts of speech which leads to the first successful parse is used as the basis for the transfer stage.</Paragraph>
      <Paragraph position="4"> 128 Computational Linguistics, Volume 11, Numbers 2-3, April-September 1985 Muriel Vasconcellos and Marjorie Leon SPANAM and ENGSPAN There are three ways in which lexical information from the dictionary is used to help the parser arrive at the correct analysis. Substi;ution units compress idioms into one record with a single part of speech. Analysis units can be used to indicate that a group of words can be expected to occur in a collocation with a particular function. This information may be overridden, whenever necessary, by the parser. An individual word may also be coded to indicate which of its possible parts of speech is statistically most frequent. Again, the final decision is made by the parser based on the results of the sentence-level analysis.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.3 POLYSEMY
</SectionTitle>
      <Paragraph position="0"> SPANAM/ENGSPAN have two principal tools for dealing with polysemy: microglossaries and transfer units.</Paragraph>
      <Paragraph position="1"> Substitution units and analysis units are also used when common collocations are involved.</Paragraph>
      <Paragraph position="2"> A microglossary is a sub-dictionary of target glosses which can be set up for a particular subject area, discourse register, or specific user. Glosses pertaining to the subject area of international public health form part of the main dictionary. Microglossaries are in use for special translations of terms in the fields of law, finance, sanitary engineering, statistics, and scientific research. The system may have up to 99 microglossaries with any number of entries in each one. The microglossaries to be consulted during the translation of a particular text are specified at run time. The existence of a specific microglossary entry is indicated in the target record containing the principal gloss for the word. Thus, no time is wasted looking for special translations of every word. More than one microglossary may be activated for the same translation, in which case they are listed and consulted in order of priority.</Paragraph>
      <Paragraph position="3"> The transfer unit is a rule that is stored in the source dictionary and is retrieved after the analysis of the sentence has been completed. The existence of a transfer unit is indicated in the record corresponding to the individual source word. A transfer unit contains a condition to be tested and an action to be performed. Examples of conditions are:  * Subject of this verb has X feature(s) or is word W.</Paragraph>
      <Paragraph position="4"> * Object of this verb (or preposition) has X feature(s) or is word W.</Paragraph>
      <Paragraph position="5"> * This word modifies a word with X feature(s) or modifies word W.</Paragraph>
      <Paragraph position="6"> * This word has N object(s).</Paragraph>
      <Paragraph position="7"> * Context N word(s) to left/right contains word with X  feature(s).</Paragraph>
      <Paragraph position="8"> Transfer units are explicitly ordered in the dictionary. The action may either select an alternative translation, insert a word such as a preposition, or delete one or more words. The action also indicates whether or not additional transfer entries should be sought for the same word.</Paragraph>
    </Section>
    <Section position="3" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.4 SYNTACTIC AND SEMANTIC FEATURES
</SectionTitle>
      <Paragraph position="0"> The dictionary record for each lexical item (including substitution units) contains bit fields that are used to store information about its syntactic and semantic features. These features are used in both the analysis and transfer stages of the translation. For example, verbs and deverbal nouns are specified as occurring with one or more of the following codes: no object, one object, two objects, complement, no passive, locative, marked infinitive, unmarked infinitive, declarative clause, imperative clause, interrogative clause, gerund, adjunct, bound preposition, and object followed by bound preposition. Subject and object preferences can be specified as _+Human, +Animate, and +Concrete. Other fields are reserved for case frames. Features which can be coded for nouns include Count, Bulk, Concrete, Human, Animate, Feminine, Proper, Collective, Device, Location, Time, Quantity, Scale, Color, Nationality, Material, Apposition, Body part, Condition, and Treatment.</Paragraph>
      <Paragraph position="1"> Adjectives are coded for many of the same features mentioned above. In addition, they can be coded as Inflectable, Optionally Inflectable, General, Temporary condition, Positive connotation, and Negative connotation. Adverbs can be coded as Time, Place, Manner, Motive, Interruptive, and Connector. One of the refer-</Paragraph>
    </Section>
    <Section position="4" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.5 ANNOTATED SURFACE STRUCTURE NODES
</SectionTitle>
      <Paragraph position="0"> In ENGSPAN, the structure produced by the parser consists of a graph containing nodes corresponding to each clause and phrase. Each node contains a list of its constituents, their roles, and their locations. If the constituent is a lexical item, the location is a word number; if it is a phrase or a clause, the location is the pointer to the appropriate node. Each node is annotated with features applicable to the type of phrase or clause involved. These features include Type, Mood, Person, Number, Tense, Aspect, and Voice.</Paragraph>
      <Paragraph position="1"> Both the ATN formalism and the structural representation used in ENGSPAN draw heavily on the presentation of ATN parsers and systemic grammar in Winograd (1983). Winograd's discussion, in turn, is based on the work of Woods (1970, 1973) and Kaplan (1973). Of course, the ATN parser has necessarily had to be adapted to the needs and computational environment of the PAHO project.</Paragraph>
    </Section>
    <Section position="5" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.6 SPANISH VERB SYNTHESIS
</SectionTitle>
      <Paragraph position="0"> The procedure for the synthesis of Spanish verb forms is based on principles of generative morphology and phonology. The program synthesizes all regular and most of the irregular verbs, in all tenses and moods except the future subjunctive, and in all persons except the second person plural. The verb is entered in the target dictionary in its stem form. Binary codes are used Computational Linguistics, Volume 1 I, Numbers 2-3, April-September 1985 129 Muriel Vasconcellos and Marjorie LeOn SPANAM and ENGSPAN to specify the conjugation class and the 11 exception features which govern the synthesis of the irregular forms. Only one dictionary entry is needed for each verb. A small number of highly irregular stems and full forms (74 in all) are listed in a table. The majority of &amp;quot;stem-changing&amp;quot; verbs require no special synthesis coding. The procedure consists of a series of morphological spellout rules; raising, lowering, diphthongization, and deletion rules based on phonological processes; stress assignment rules; and orthographic rules to handle predictable spelling changes.</Paragraph>
    </Section>
  </Section>
  <Section position="13" start_page="0" end_page="0" type="metho">
    <SectionTitle>
5 COMPUTATIONAL TECHNIQUES
5.1 DICTIONARIES
</SectionTitle>
    <Paragraph position="0"> The SPANAM/ENGSPAN dictionaries are VSAM files stored on a permanently mounted disk. The source and target dictionaries are separate files. The basic record has a fixed length of 160 bytes. The source entry is linked to its target gloss by means of a 12-digit lexical number (LEX). The first six digits of the LEX are the unique identification number assigned to each pair when it is added to the dictionary. The second half of the LEX is used to specify alternative target glosses associated with the same source entry. The main or default target gloss for each pair has zeroes in these positions.</Paragraph>
    <Paragraph position="1">  The key for a source entry is the lexical item itself, which may be up to 30 characters in length. The source dictionary is arranged alphabetically. The key for a target entry is the LEX, and the target dictionary is arranged in numerical order.</Paragraph>
    <Paragraph position="2"> Words may be entered in the source dictionary either with or without inflectional endings. Most nouns are entered only in the singular and adjectives only in the masculine singular. Verbs are entered as stems. Full-form entries are required for auxiliary verbs, words with highly irregular morphology, and homographs.</Paragraph>
    <Paragraph position="3"> Several source items may be linked to the same target gloss by assigning them to the same LEX. For example, irregular forms of the same verb or alternative spellings of a word require only one entry in the target dictionary. Likewise, more than one target gloss can be linked to the same source word through the lexical number. In this case, each alternative gloss is distinguished by coding in the second half of the LEX. Two positions are used to designate terms belonging to microglossaries, two for glosses corresponding to different parts of speech, and two for context-sensitive glosses which are triggered by transfer units.</Paragraph>
    <Paragraph position="4">  The dictionaries contain four types of multiple-word entries: substitution units (SU), analysis units (AU), delayed substitution units (DSU), and transfer units (TU). The key for a multiple-word entry in the source dictionary is a string consisting of the first six digits of the LEX for each word in the unit. In the case of an SU or an AU, the words must occur consecutively in the sentence in order for the unit to be activated. A DSU or a TU can cover either a continuous or discontinuous string.</Paragraph>
    <Paragraph position="5"> The basic SU contains from two to five words. A different record structure is used for longer entries, such as names of organizations and titles of publications.</Paragraph>
    <Paragraph position="6"> When an SU is retrieved, the dictionary records corresponding to the individual words are replaced with one record corresponding to the entire sequence. The gloss for the unit is also found in a single entry in the target dictionary. This type of unit is essential in order to obtain the correct translation of names of organizations, titles of publications, slogans, etc., and is an efficient way of handling some fixed idioms, phrasal prepositions, and certain technical terminology. An SU record has the same format as a single-word entry. In addition, it contains a character string which indicates the part of speech of each of its members. This information can be used by the parser if it is unable to parse the sentence using the single part of speech specified by the unit.</Paragraph>
    <Paragraph position="7"> Examples of phrases entered as SUs are by leaps and bounds, International Drinking Water Supply and Sanitation Decade, and Health for All by the Year 2000.</Paragraph>
    <Paragraph position="8"> The AU, which also contains from two to five words, has several functions. At the very least, it alerts the analysis routines to the possible presence of a common phrase and provides information on its length and function. It can also be used to resolve the part-of-speech ambiguity of any of its members. Finally, it can specify an alternative translation for one or more of its parts.</Paragraph>
    <Paragraph position="9"> The AU is an entry in the source dictionary but has no counterpart in the target dictionary. The record for each source word is retained in the representation of the sentence, but the last two digits of its lexical number are modified if a translation other than the main gloss is desired. When the target lookup is performed, the gloss for each word is retrieved separately. This ensures that the rules for analysis and synthesis of conjoined modifiers will be able to access information about the individual words of the phrase. It also makes it possible for the parser to determine whether or not the individual words are being used as a unit in the given context. Examples of phrases entered as AUs are drinking water and patient care. The algorithm is still able to correctly analyze sequences such as the children have been drinking water with a high fluoride content, and it is essential that the patient care for himself.</Paragraph>
    <Paragraph position="10"> The DSU is used to handle lexical items such as phrasal verbs which are likely to occur as noncontiguous words in the input. The existence of a DSU is indicated in the source record of the first word of the unit. The unit is retrieved from the dictionary during the sentence-level parse. The decision of whether or not to accept the unit is based on both syntactic and semantic requirements of the parser. If the unit is accepted, it replaces the individual records and causes a different target gloss to be 130 Computational Linguistics, Volume 11, Numbers 2-3, April-September 1985 Muriel Vasconcellos and Marjorie Le~n SPANAM and ENGSPAN retrieved. Examples of DSUs are look up, put on, and carry out.</Paragraph>
    <Paragraph position="11"> The TU is used to specify an alternate translation of a word or words which depends on the occurrence of a specific word or set of features in one of its arguments or in a specified environment. These entries are stored in the source dictionary only and are retrieved after the analysis &amp;quot;has been completed. If the conditions specified in the transfer entry are met, the corresponding lexical numbers are modified so that the desired target gloss is selected during the target lookup. For example, if the object of know is coded as+Human,the verb is translated as conocer instead of saber. If female and male modify a noun coded as -Human, they are translated as hembra and macho instead of mujer and hombre.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
5.2 GRAMMAR RULES
</SectionTitle>
      <Paragraph position="0"> SPANAM, up to now, uses two basic types of grammar rules: pattern matching and transformations. Pattern matching is used for the recognition and reordering of noun phrases. The grammatical patterns are stored in a file which can be updated without recompiling the program. The patterns are applied by searching for the longest match first. Transformations are used to identify and synthesize the verb phrases and clitic pronouns. The rules are expressed in PL/I code and are grouped in modules according to the part of speech of the head word. Each group of rules is tested once for each sentence. The structural description of each rule is compared with the input string. The description may require a match of parts of speech, syntactic features, or specific lexical items. If a match is found, the rule is applied. The rule may permute, add, delete, or substitute lexical items or features associated with them.</Paragraph>
      <Paragraph position="1"> As indicated earlier, ENGSPAN's grammar rules are expressed in the form of an ATN. The network configuration indicates possible sequences of constituents. The rules governing the acceptability of any specific input string are contained in the conditions attached to the various arcs of the network. The building of the nodes of the structural representation and the assignment of features and roles is determined by actions associated with each arc. The conditions and actions are contained in separate modules which are part of the compiled program. The configuration of states and arcs is specified in a file which is updated on-line. The contents of this file also determine which of the conditions and actions are actually attached to specific arcs for a particular run.</Paragraph>
      <Paragraph position="2"> As of May 1984 the ATN grammar had seven networks: sentence, clause, noun phrase, verb phrase, sentence nominalization, hyphenated compound, and prepositional phrase. Each network consists of a set of states connected by arcs. Four types of arcs are used: category arcs, which can be taken if the part of speech matches that of the input word; jump arcs, which can be taken without matching a word of the input; seek arcs, which initiate recursive calls to a network; and send arcs, which return control to the calling network after the successful parsing of a constituent.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
5.3 PARSING ALGORITHM
</SectionTitle>
      <Paragraph position="0"> The ENGSPAN algorithm performs a top-down, left-to-right sequential parse using a combination of chronological and explicit backtracking. The parser stops after completing the first successful parse. The path taken through the network depends on the ordering of the arcs at each state, the structural information already determined by the parser, and the codes contained in the dictionary record for each lexical item and multiple-word entry. Also available to the parser is information regarding sentence punctuation, capitalization, parenthetical material, etc., which has been gathered by an earlier procedure. The algorithm processes the words of the input string one at a time, moving from left to right. At each state, all arcs are tested to determine whether they may be taken for the current word. The possible arcs are placed on a pushdown stack and the top arc on the stack is taken. The parser continues through the input string as long as it can find an arc that it is allowed to take. If no arc is found for the current word, the parser backtracks.</Paragraph>
      <Paragraph position="1"> Which of the alternative arcs is taken off the stack depends on the situation which caused the parser to backtrack. If the end of the string is reached and the algorithm is at a final state in the network, the parse is successful. If no path can be found through the network, the parse fails.</Paragraph>
      <Paragraph position="2"> Long-distance dependencies such as those involved in relative clauses and WH-questions are parsed by using a hold Hst. When the parser encounters a noun phrase followed by a relative pronoun, a copy of the phrase is placed on the hold list. When a question is being parsed, the questioned element is placed on the list. When a gap is detected in the relative clause or interrogative sentence, the phrase on the hold list is used to fill the appropriate slot.</Paragraph>
      <Paragraph position="3"> Whenever backtracking is required, a well-formed phrase list is used to save a copy of the phrases that have been completed but are about to be modified or rejected.</Paragraph>
      <Paragraph position="4"> For all seek arcs, the parser checks to see if a phrase of the appropriate type is already on the well-formed phrase list. If there are several phrases on the list that begin with the same word, the longest phrase is tried first. A new phrase is parsed only if there is nothing on the well-formed phrase list that satisfies the seek arc. In this way, large amounts of reparsing are avoided.</Paragraph>
      <Paragraph position="5"> Conjoining is currently being handled by a configuration of arcs at the end of each subnetwork which allows additional phrases of the same type to be parsed recursively. When partial phrases are conjoined, the end of the subnetwork is reached by traversing one or more jump arcs.</Paragraph>
      <Paragraph position="6"> In the event of an unsuccessful parse, ENGSPAN is still expected to produce a translation. The longest successful path is always saved, and information from this &amp;quot;partial parse&amp;quot; can be used by the synthesis routines. Computational Linguistics, Volume 11, Numbers 2-3, April-September 1985 131 Muriel VasconceHos and Marjorie Ledn SPANAM and ENGSPAN Local routines are used to analyze the remainder of the input string. These routines function as a &amp;quot;safety net&amp;quot;. They resolve homograph ambiguities and analyze verb strings and noun strings, adding as much information as they can to the structural description of the sentence as a whole.</Paragraph>
      <Paragraph position="7"> The ATN parsing algorithm is being developed in an independent PL/I program, using the ENGSPAN input and dictionary lookup modules. 3 It is also totally compatible with SPANAM. The network grammar is read in at runtime, making it possible to experiment with different network configurations without recompiling the program.</Paragraph>
      <Paragraph position="8"> Each time an enhanced version of the parser has been tested and debugged, it replaces the working version in the ENGSPAN program. The diagram in Figure 1 shows the relationship between the parser and the &amp;quot;safety net&amp;quot; routines in ENGSPAN. The parser is to be incorporated in a similar way into SPANAM as well.</Paragraph>
      <Paragraph position="9"> A complete description of the ATN grammar and parser will be found in the report to be submitted to the U.S. Agency for International Development at the end of the grant period (October 1985).</Paragraph>
    </Section>
  </Section>
  <Section position="14" start_page="0" end_page="0" type="metho">
    <SectionTitle>
6 PRACTICAL EXPERIENCE
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
6.1 SYSTEM MAINTENANCE
</SectionTitle>
      <Paragraph position="0"> As of May 1984, the Spanish source dictionary had 60,150 entries and the English target had 57,315. The program for updating the SPANAM dictionaries is userfriendly. Many default codes are entered by the update program automatically. Even though there are now 211 possible fields in which codes can be entered, as opposed to the original 82, almost all of them can be specified using mnemonic descriptors and code names.</Paragraph>
      <Paragraph position="1"> Today, updating is done almost exclusively on the basis of production text. Every job reveals ways in which the dictionaries can be improved, with either new glosses for individual words or idiomatic phrases, especially in the case of technical terminology, or deeper coding of existing entries. The steady, ongoing development of the dictionaries (Table 1) has ensured both a decrease in not-found words, with advantages for program effectiveness, and closer correspondence to the type of language used in the Organization, leaving less work for the posteditor. null As indicated earlier, it is the post-editor who notes the changes needed at the time of post-editing and who later updates the dictionaries. An hour is reserved for this work at the end of the day. When production permits, the post-editor may spend extra time on dictionary work;</Paragraph>
    </Section>
  </Section>
  <Section position="15" start_page="0" end_page="0" type="metho">
    <SectionTitle>
3 A major portion of the parsing routines have been developed by Lee
</SectionTitle>
    <Paragraph position="0"> Ann Schwartz, who has participated in this activity on a full-time basis since August 1983.</Paragraph>
    <Paragraph position="1"> if there is pressure, the work may have to be postponed for a while. Because of the integration of dictionarybuilding into the work of the post-editor, the cost is no longer an element that can be clearly identified.</Paragraph>
    <Paragraph position="2">  ENGSPAN has the same user-friendly software as SPANAM for updating its dictionaries. As of May 1984, the English source dictionary had 41,210 entries and the Spanish target had 42,638.</Paragraph>
    <Paragraph position="3"> The AID project has provision for two half-time dictionary assistants, one a linguist of English mother tongue and one a translator of Spanish mother tongue. A new deeply coded source entry costs from $0.60 to $1.00; Spanish target glosses that require research are about the same. Simple changes in existing entries average about $0.25 each.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML