XML Viewer - h89-2015

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/89/h89-2015_metho.xml
Size: 26,604 bytes
Last Modified: 2025-10-06 14:12:17
<?xml version="1.0" standalone="yes"?>
<Paper uid="H89-2015">
  <Title>NEW POSSIBILITIES IN MACHINE TRANSLATION</Title>
  <Section position="4" start_page="99" end_page="99" type="metho">
    <SectionTitle>
MT SYSTEMS: COMPONENTS AND APPROACHES
THE COMPONENTS OF AN MT SYSTEM
</SectionTitle>
    <Paragraph position="0"> In order to build an MT system, the following program modules or components are needed:  In all MT systems, these modules are related essentially as shown in Figure 1. We briefly discuss each module below.</Paragraph>
    <Paragraph position="1"> Parser: Sentences of the source text are parsed into some internal form by the parser. In almost all current MT systems, the internal form represents both syntactic and semantic aspects of the input. Interlanguage Translation Rules: Many MT systems contain a set of rules that transform certain aspects of the internal representation of the input to make them conform to the requirements of the target language. Such MT systems are known as transfer-based. An alternative approach is to build MT systems without transfer rules, using a single intermediate representation form called an interlingua; the generality and power of such systems depends on the expressiveness of the interlingua used. Generator: The (modified) internal representation of the input is generated as sentence(s) of the target language by the generator. The output must express the semantic content of the internal form, and if possible should use syntactic forms equivalent to those present in the input.</Paragraph>
    <Paragraph position="2"> Grammars: In some systems, the grammars (syntactic information) are intrinsic parts of the parser and generator; in others, the grammars can be separated from the procedural mechanism. In bidirectional systems, the parser and generator use the same grammar to analyze and produce each language. Such systems are desirable because they do not duplicate syntactic information and are therefore more  maintainable. True bidirectional grammars have proven hard to build, not least because existing knowledge representation formalisms do not provide some capabilities (such as inference over disjunction.) that facilitate parsing and generation.</Paragraph>
    <Paragraph position="3"> Semantic Knowledge Base: All sophisticated MT systems make heavy use of a knowledge base (representing underlying semantic information) containing the ontology of the application domain: the entities and their possible interrelationships. Among other uses, the parser requires these entities to perform semantic disambiguation and the generator uses them to determine acceptable paraphrases where exact 'literal' formulations are not possible.</Paragraph>
    <Paragraph position="4"> Lexicons: All MT systems require a lexicon for the source language and one for the target language. In simple systems, corresponding entries in the two lexicons are directly linked to each other; in more sophisticated systems, lexicon entries are either accessed by entities represented in the knowledge base, or are indexed by characteristic collections of features (as built up by the parser).</Paragraph>
  </Section>
  <Section position="5" start_page="99" end_page="102" type="metho">
    <SectionTitle>
APPROACHESTO MT
</SectionTitle>
    <Paragraph position="0"> Using these basic modules, a number of different approaches to the problem of MT are possible.</Paragraph>
    <Paragraph position="1"> The Lexical Approach: Many of the early MT systems, as well as some existing projects, base their approach on the lexicon to a large extent. Typically, in such systems one finds a proliferation of highly specific translation rules spread throughout the lexicon; in fact, the size and complexity of lexical entries can be used as a touchstone for the degree to which the system is lexically based or not. While this approach may work for a time for any specific domain, it lacks the power that comes from a general, well-founded theoretical underpinning. This is the reason such systems tend to become larger and seemingly less defined as they grow, while not necessarily exhibiting greatly increased performance.</Paragraph>
    <Paragraph position="2"> The Interllngua Approach: The second approach is to use an interlingua as a language into which to parse and from which to generate. Early attempts at an interlingua (such as the Conceptual Dependency representation \[Schank 75\]) did not lead to much success primarily due to the difficulty of dealing with terms on a very primitive (in the sense of basic or fundamental) level: sentences, when parsed, had to be decomposed into configurations of the basic elements, and to be generated, had to be reassembled again.</Paragraph>
    <Paragraph position="3"> Given the basic level of the elements used at the time, this task was too complex to support successful MT.</Paragraph>
    <Paragraph position="4"> The Transfer Approach: Many later systems relied less on translation rules hidden in the lexicon and more on representation-transforming rules associated with representational features. This approach gained popularity when early experiments with interlinguas failed due to researchers' inability to develop powerful enough language-neutral representation terms. However, the approach also suffers from a proliferation of rules, especially when more than two languages are present: for n languages, O(n 2) sets of translation rules are required.</Paragraph>
    <Paragraph position="5"> At present, no single approach is the clear winner. The systems with the most practical utility at present, the commercially available Japanese systems, all use a relatively crude lexical approach and derive their power from the brute force provided by tens of thousands of rules. Most promising for newer more general systems seems to be a mixture of the interlingua and transfer approaches.</Paragraph>
  </Section>
  <Section position="6" start_page="102" end_page="104" type="metho">
    <SectionTitle>
WHY A NEW ATTEMPT AT MAT?
</SectionTitle>
    <Paragraph position="0"> The time is ripe for a new initiative in the investigation of MAT in the U.S.A. The principal reasons for this are both strategic and technical. In the first instance, a large amount of MT work is being done in Europe (including such multinational projects as the EEC-wide EUROTRA) and Japan, with increasing success; little MT work is done in the U.S. (most of which is funded by Japanese money).</Paragraph>
    <Paragraph position="1"> In the second instance, recent technical breakthroughs, coupled with the steady advances of the past 25 years, make possible the establishment of small MAT projects and their rapid growth to achieve a high level of sophistication.</Paragraph>
    <Paragraph position="2"> These advances, discussed in more detail below, are the following:</Paragraph>
    <Section position="1" start_page="102" end_page="104" type="sub_section">
      <SectionTitle>
Representation Languages:
</SectionTitle>
      <Paragraph position="0"> Advances have been made in the theory of representation languages which make possible a new integrated treatment of syntax and semantics. Usually, semantic knowledge is represented in knowledge representation languages such as those of the KL-ONE family. Syntactic knowledge, on the other hand, is hardly ever (if at all) represented in these languages, and neither are the numerous intcrrnediate structures built by parsers. This is because disjunction (the logical operator OR) has generally not been included in the language capabilities. The result is a serious problem, since parsers necessarily deal with multiple options due to the structural and semantic ambiguities inherent in language. The inability to represent both syntactic and semantic knowledge in the same system has precluded the development of parsers using a single inferencing technique to perform their work in a homogeneous and unified manner. Thus the lack of a general framework for computing with disjunctive knowledge structures has always been a hindrance to the development of parsing technology.</Paragraph>
      <Paragraph position="1"> Work is currently under way to incorporate inference over disjunctions into Loom \[MacGregor &amp; Bates 87\], a newly developed exemplar of the KL-ONE-Iike languages, at ISI. This work extends the capabilities of earlier methods for handling disjunctive descriptions in unification-based parsers (see \[Kasper 87, Kaspcr 88\]). It is expected to be completed by the end of 1989. This breakthrough will have two major effects: greatly simplified parsers and enhanced processing speed and efficiency.</Paragraph>
      <Paragraph position="2"> In more detail, this innovation makes possible, in a single KL-ONE-Iike representation system, the representation of both semantic and syntactic knowledge. In this scheme, the automatic concept classifier will be used as a powerful resource to perform simultaneous syntactic and semantic-based classificatory inference under control of the parser. Until now, the flow of control between syntactic and semantic processing has always been a vexing question for parsers: for semantic processing, they have used classificatory inference of various kinds, and for syntactic processing, a variety of other methods, including unification. Since syntactic ambiguities are often resolved by semantic information, and vice versa, it is important to make the results of each type of processing available to the other as soon as possible.</Paragraph>
      <Paragraph position="3">  Difficulties in doing so have always meant that one or the other process is made to perform more work (in some cases significantly more) than necessary, requiring the maintenance of numerous alternatives of interpretation. Under the new scheme, the representation of syntactic and semantic knowledge in the same representation system simplifies the parsing process considerably, since there is then only one inference process and its results are represented in a single formalism. Also, the speed and efficiency of the parser is increased, since each type of processing can be performed as soon as possible and no additional work need be done. This new integrated approach, enabled by the ability to handle inference over disjunction, has not been developed before.</Paragraph>
      <Paragraph position="4"> Melding interlingua and transfer approaches: A second breakthrough is the maturation of a representation scheme which enables the melding of the best features of the interlingua and transfer approaches. Problems arise with the interlingua approach either when the interlingua is too 'shallow' to capture more than the surface form of the source text (and hence requires nuance-specific translation rules) or when it is too 'deep' to admit easy parsing and generation, as is the case with Conceptual Dependency \[Schank 75\].</Paragraph>
      <Paragraph position="5"> Knowledge representation experience over the past 15 years has resulted in a much better understanding of the different types of representation schemes and of the ways to define representation terms that support the tasks at hand (the literature contains much work in this regard; see for example \[Hobbs 85, Hobbs et al. 86\]). However, the organization of such terms to facilitate optimal language translation has been a problem until the recent recognition that a taxonomy of abstract generalizations of linguistically motivated classes can be used as a type of generalized transfer rule. It has become clear that, using the abstract conceptual categories necessary to support the generation of the source and target languages (such as the Upper Model for English in the Penman system; see \[Bateman et al. 89c\]), it is possible to exploit the commonalities across languages to bypass the need for numerous transfer rules. To the extent that English shares with the other languages a linguistically motivated underlying ontology of the world (especially at the more abstract levels, taxonomizing the world into objects, qualities, and processes such as actions, events, and relations), such a conceptual model can act as a type of interlingua in an MAT system, where differences are taken care of by transfer rules of the normal type. For example, the fact that actions have actors is general enough to be part of the generalized 'interlingua', while particularities of tenses in various languages is not.</Paragraph>
      <Paragraph position="6"> By building a suitable taxonomic organization of these terms, both the abovementioned problems can be avoided: by defining enough specific terms in the taxonomy, nuances present in the domain can be represented; and by basing the terms of the taxonomy on linguistically derived generalizations (instead of, say, on notions about the underlying reality of the physical universe as in the case of CD), the ease of parsing and generation can be guaranteed. The use of such a taxonomy for MAT has been investigated; a pilot study is reported in \[Bateman et al. 89a, Bateman et al. 89b\]. The central ideas are described in some considerable detail in \[Bateman 89\]. This semi-interlingua approach is preferable to the lexically based and pure transfer approaches, since it minimizes the number of special-purpose rules required to customize the system to a new domain, and hence increases the power and portability of the MAT system.</Paragraph>
      <Paragraph position="7"> Grammar development: One of the steady advances in the field of Natural Language Processing is the development of more complete grammars. There exist today computational grammars that cover English (and other languages such as German, Chinese, Japanese, and French) far more extensively than the most comprehensive grammars of 20 years ago did. Modern MAT system developers thus need spend much less effort on grammar development and can concentrate on less understood issues.</Paragraph>
      <Paragraph position="8"> Generation and parsing technology: Another advance is in generation and parsing technology. The issues in single-sentence parsing and  generation have been studied to the point where a number of well-established paradigms and algorithms exist, each with known strengths and weaknesses (in fact, in the last 5 years a number of general-purpose generators have been distributed, including Penman \[Penman 88\], MUMBLE \[Meteer et al. 87\], and SEMSYN \[RSsner 88\]). Obviously, this situation greatly facilitates the construction of new MAT systems.</Paragraph>
      <Paragraph position="9"> Knowledge about Machine Translation: The amount of knowledge about MT available today is much larger than it was 20 years ago. More than one journal is devoted to the topic (for example, Computers and Translation). Books on the subject include \[Nirenburg 87, Slocum 88, Hutchins 86\]. Some larger MT systems developed over the past decade are the EEC-sponsored EUROTRA project \[Arnold ,~ des Tombe 87, Arnold 86\], the METAL project \[Bennett 82\], the Japanese-German SEMTEX-SEMSYN project \[RSsner 88\]. Two current MT projects in the U.S. are KBMT \[Nirenburg et al. 89\] and a project at the CRL (New Mexico State University).</Paragraph>
    </Section>
  </Section>
  <Section position="7" start_page="104" end_page="105" type="metho">
    <SectionTitle>
WHAT WOULD AN MAT PROGRAM INVOLVE?
</SectionTitle>
    <Paragraph position="0"> The three cornerstones of an MT system are the parser, the generator, and the knowledge representation system. Computational Linguistics research has developed far enough today that there are available in the world at least four general-purpose language generators, two of them English-based, and a number of limited-purpose parsers. A number of knowledge representation systems are also available, some of which commercially (such as KEE, manufactured by Intellicorp), and others in the public domain (such as NIKL and Loom \[Kaczmarek et al. 86, MacGregor &amp; Bates 87\]).</Paragraph>
    <Paragraph position="1"> In essence, an MAT project under the new effort would perform six steps:  of the various approaches and to identify the major problems facing MT. A standard set of MT tests should be applied at various stages of the program.</Paragraph>
  </Section>
  <Section position="8" start_page="105" end_page="105" type="metho">
    <SectionTitle>
A SCENARIO FOR AN MAT PROGRAM
</SectionTitle>
    <Paragraph position="0"> This section outlines how a solid MAT capability can be achieved in the next five years for a relatively small investment.</Paragraph>
  </Section>
  <Section position="9" start_page="105" end_page="105" type="metho">
    <SectionTitle>
WHAT SHOULD THE PROGRAM AIM FOR?
</SectionTitle>
    <Paragraph position="0"> The program should aim at establishing a small MAT program in the U.S. to conduct good research on the basic issues, to stay abreast of the developments happening elsewhere in the world, to develop and exploit current breakthroughs in the technology in the form of prototypes that perform machine-aided translation, and to foster collaborations among the various Darpa-supported NLP projects. Its goals  should be: 1. to stimulate development and incorporation of the newest techniques in order to identify and push the limits of MT possible today, 2. to focus on technologies that provide general, extensible capabilities, thereby surpassing less general foreign efforts, 3. to develop prototype systems that exemplify this work in limited domains, . to use the tests developed by various other MT projects (such as the EEC project EUROTRA)  to measure the progress and success of the current technology, and to identify its most serious bottlenecks and limitations, 5. to stimulate collaborations and software sharing among various groups developing appropriate NLP-related software and theories in this country.</Paragraph>
    <Paragraph position="1"> The program should not aim at the development of single-sentence translation systems with wide coverage of narrow domains that possess little generality (as proven by the commercially available Japanese systems, this can be achieved by brute force). It should instead aim at the development of prototype systems that illustrate the translation of multipage texts, and that are general, easily portable, and accommodate new domains and languages with a minimum of effort. That is, generality and feasibility are the properties that will propel this effort beyond current technology.</Paragraph>
  </Section>
  <Section position="10" start_page="105" end_page="106" type="metho">
    <SectionTitle>
OVERVIEW OF THE PROPOSED PROGRAM
</SectionTitle>
    <Paragraph position="0"> This subsection describes some important facets of the proposed MAT program.</Paragraph>
    <Paragraph position="1"> Given the amount of existing NLP technology, a relatively small investment can result in a significant effort in MAT over a period of 5 years. By making use of existing parsers and generators and grammars, individual projects can be kept reasonably small in manpower (on the order of four to five people per project).</Paragraph>
    <Paragraph position="2"> Limiting project size enables the support of a greater number of projects. This is important because, due to differences in their theoretical approach, systems can be variously successful simply by virtue of  the limitations of the theory they embody, and can thus hinder the principal task, which is to identify the major bottlenecks of MT. Therefore the program should encourage two or three different theoretical approaches in order to help find the best one and to promote the development of technology which will deliver near-term machine-aided translation and lay the foundation for full machine translation in the long run.</Paragraph>
    <Paragraph position="3"> The program should specify a domain of application for the MAT systems which is easily modeled and represented, and for which the language typically used is clear and relatively unambiguous. A popular domain for existing MAT systems is that of technical documents such as computer manuals, descriptions of computer architectures or operating systems, etc. Another alternative domain is intelligence reports. Beyond the obvious advantages of such domains is the fact that evaluation techniques and tests have already been developed for translated technical documents by such projects as EUROTRA.</Paragraph>
    <Paragraph position="4"> In order to ensure that the systems developed are reasonably flexible and general, they should be encouraged to be more than bilingual. This can be achieved by developing the systems first to handle English and one other language and then to incorporate a third afterward. This suggests a 5-year plan broken into three phases: a startup phase of one year for English-to-English paraphrasing, a second phase of two years to include a second language, and a final phase of two years to refine the second language and include a third language.</Paragraph>
    <Paragraph position="5"> This scenario involves a 5-year plan, at an investment of between $1 million and $2.5 million per year, as follows:  This money should support three groups of between 3 and 5 people per group. At various times, each group would use the services of a parser specialist, a generator specialist, a knowledge representation specialist, and a text specialist, as well as of programmers. Since it is unlikely that any single group will have available such depth of experience, this requirement would foster collaborations among NLP research projects in this country.</Paragraph>
  </Section>
  <Section position="11" start_page="106" end_page="108" type="metho">
    <SectionTitle>
PROGRAM TIMETABLE
</SectionTitle>
    <Paragraph position="0"> In order to minimize the amount of wasted effort, projects under this program should be encouraged to use as much existing NLP technology as possible. This is to some degree enforced by the requirement of a demonstration after 3 years, which is quite reasonable given the availability of general-purpose generators, parsing techniques, and knowledge representation systems. An additional saving of effort can be achieved by using, as second and third languages, grammars that have been developed by grammarians and computational linguists in other countries. It is suggested that German be used as the second language, since a number of computational grammars of German exist in the public domain, and since German is structurally very close to English. The third language could be the choice of individual projects so as to allow them to capitalize on their strengths, but should be a language structurally quite different from English (such as Japanese or Chinese), so as to test the generality of the underlying theoretical approach.</Paragraph>
    <Paragraph position="1"> Thus the program can be structured as follows:  Year 1: * Selection and adaptation of a parser.</Paragraph>
    <Paragraph position="2"> * Selection and adaptation of a generator.</Paragraph>
    <Paragraph position="3"> * Selection and incorporation of the English grammar(s).</Paragraph>
    <Paragraph position="4"> * Representation of the domain, construction of a domain model.</Paragraph>
    <Paragraph position="5"> * Selection and establishment of the English lexicon.</Paragraph>
    <Paragraph position="6"> * Demonstration of the first stage of the system by a limited paraphrase task: parsing English texts and then generating English paraphrases of them.</Paragraph>
    <Paragraph position="7"> Year 2: * Selection and initial incorporation of the German grammar.</Paragraph>
    <Paragraph position="8"> * Selection and incorporation of the German lexicon.</Paragraph>
    <Paragraph position="9"> * Integration of the initial German grammar with the parser and generator.</Paragraph>
    <Paragraph position="10"> * Demonstration of the second stage of the system by parsing some German texts and generating English equivalents and vice versa.</Paragraph>
    <Paragraph position="11"> Year 3: * Refinement of the German grammar.</Paragraph>
    <Paragraph position="12"> * Refinement of the German lexicon.</Paragraph>
    <Paragraph position="13"> * Completion of additional data sources such as domain models, transfer rules, etc. * Integration of the completed German grammar with the parser and generator. * Demonstration of the third stage of the system by parsing German texts and generating English equivalents and vice versa.</Paragraph>
    <Paragraph position="14"> * Establishment of a prototype English-German MAT system.</Paragraph>
    <Paragraph position="15"> Year 4: * Selection and incorporation of the third language grammar (e.g. Japanese) * Selection and incorporation of the third language lexicon.</Paragraph>
    <Paragraph position="16"> * Refinement of the English-German translations by development of additional techniques and transfer rules.</Paragraph>
    <Paragraph position="17"> * Demonstration of the refined English-German MAT system.</Paragraph>
    <Paragraph position="18"> Year 5: * Completion of the third language grammar.</Paragraph>
    <Paragraph position="19"> * Completion of the third language lexicon.</Paragraph>
    <Paragraph position="20">  * Completion of additional data sources such as domain models, transfer rules, etc. * Integration of the completed third language grammar with the parser and generator. * Demonstration of the final stage of the system, comprising translation in all six directions between English, German, and the third language.</Paragraph>
    <Paragraph position="21"> * Evaluation of the coverage and sophistication of the translations using the test and measures developed by EUROTRA, as applicable to the domain.</Paragraph>
    <Paragraph position="22"> * Reports of the major shortcomings and bottlenecks that stand in the way of more complete MT.  The program should encourage evaluation of the prototype systems at every stage, using a well-conceived set of measures such as those of the EUROTRA project. One measure of evaluation is to count the number of sentences translated correctly (i.e., without requiring other than stylistic changes by the editor). This measure can be subdivided according to the type(s) of error made: syntactic, semantic, lexical, unknown word (lexical), unknown concept (semantic), etc. The projects should aim at a 50% correct sentence rate by the end of phase 2 and for a 75% rate for the German translation by the end of the program. Another measure is to compare the time required to translate a piece of text by a human alone with the time taken by a human in conjunction with the system. Existing commercial systems, using brute-force techniques, claim a speedup rate of 50%, establishing a bottom line which the projects' prototypes can strive to improve.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML