XML Viewer - c96-2201

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/96/c96-2201_metho.xml
Size: 15,012 bytes
Last Modified: 2025-10-06 14:14:20
<?xml version="1.0" standalone="yes"?>
<Paper uid="C96-2201">
  <Title>PaTrans - A Patent Translation System</Title>
  <Section position="4" start_page="1115" end_page="1115" type="metho">
    <SectionTitle>
3 An overview of the Translation
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="1115" end_page="1115" type="sub_section">
      <SectionTitle>
Process
3.1 Document handling
</SectionTitle>
      <Paragraph position="0"> The document handling step has four main flmctions: null * Format Preservation Input to docuinent handling is a text from a text processing system which has been marked up in SGML. Tile SGML codes denote e,.g. titles, paragraphs, text segments that should not be translated, etc. All information about doc, ument layout is stored separately and taken away from the translation process.</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="1115" end_page="1115" type="metho">
    <SectionTitle>
* Formula Recognition The docmnent han-
</SectionTitle>
    <Paragraph position="0"> dler automatically recognises certain text typical untranslatable units, such as chemical formulas and tables.</Paragraph>
  </Section>
  <Section position="6" start_page="1115" end_page="1116" type="metho">
    <SectionTitle>
* Term Reeognition Terms and multi-word
</SectionTitle>
    <Paragraph position="0"> units are also recognised at this stage, in this context, words are treated as terms if they are subject specific or if they have a unique translation in the given text type. They are recognised during text handling and have their translation equivalent attaehed to them along with inorphosyntactic information for both source and target language.</Paragraph>
    <Paragraph position="1"> * Segmentation Finally tile text, is separated into units for translation i.e. sentences for which various recognition patterns haw ~. been set up. In some patent texts of specfic sub-ject tields, tile sentences are incredibly long. In these cases, there is no point in trying to arrive at a complete parse of the whole sentence, since the parse is most likely to fail and processing will be too space and time consuming. Therefore the docmnent handler attempts to arrive at a meaningflfl partition of the sentences by identifying sentence internal boundaries and submitting the individual subparts for translation.</Paragraph>
    <Paragraph position="2">  Before the text is passed on to the parser, it is subjected to a thorough process of disambiguation. This is one of the new features of PaTrans compared to the EUR()TRA model and will be discussed in detail below.</Paragraph>
    <Paragraph position="3">  Since PaTrans is based on the transfer translation model tile surface strings of the text are sequentially transformed into an interinediate representation defined by several mapping principles. During source language analysis the sentences are assigned a surface syntactic structure. This surface syntactic structure is converted into a language-neutral transfer represent, ation ordering the constituents of the sentence in a canonical order with heads preceeding arguments and ar= guments preceding modifiers (Copeland et al., 1991a). The, transfer representation is a reflection of tile argument structure of the predicates where iuformation about surface syntactic realization appears as features on the individual nodes. Function words (coRjmwtions, determiners, prepositional case markers) are featurized and tense/aspect and negation represented in language-neutral features.</Paragraph>
    <Paragraph position="4"> The output of source language analysis is thus a tree with multilwered information including syntactic and morphosyntactic features, as well as the syntactic/semantic relationships between the predicators and the arguments, At, all levels, sets of preference rules based on heuristic principles select among competing analyses, e.g. for PP-attachment (Bennett and Paggio, 1993).</Paragraph>
    <Paragraph position="5">  PaTrans adheres to simple transfer, i.e. the substitution of source language lexical units with target language lexical units by means of lexical transfer rules, 9 while the source language stru&lt;&gt; tural representation is mapped directly onto the target language transfer representation which is input to tile generation module. There are two main reasons why complex transfer (i.e. transfer where the strucl;ure of the input representation is  altere(t) is kept at a minimum: * Complex transfer is costly inasmuch as the general applicability of the rules is usually very restricted.</Paragraph>
    <Paragraph position="6"> * A transfer rule applies to any object matching  its left-hand side and performs the mapping defined on the right-hand side. Due to the 'fail-soft'-mechanisin (discussed below), the structure of the objects which the transfer rules nmst apply to cannot he flflly predicted. In order for complex transfer to work in all cases, rules must be set up not only for correctly parsed input structures, but also for tile special fail-soft structures. For this reason, complex transfer is costly and is only used for frequent phenomena considered crucial for good translation, e.g. converting certain English ing-forins into l)anish relative clauses.</Paragraph>
    <Paragraph position="7"> 3.1.4 Target syntactic generation During gelmration, the transti;r representation is mat)ped onto a target syntactic structure through intermediate representational lewfls. At, the first level, the target language lexical units are looked up in the lexical database and mon(}lingually relevant features are calculated on the 2Recall theft this only applies to words of the general vocabulary which require disaint}iguation during analysis and not to terms  basis of the language-neuLral representation, e.g. tense and asl)eet.</Paragraph>
    <Paragraph position="8"> At Lhe second level (Lhe relational level) surface syntactic flmcLions are (:alculaLed and certain flmcLion words, sut:h as t)reposiLional markers are inserted. Finally, the relational sLru(:ture is mapped onto the level defining tim constituenL sLructure of Lhe target language sentent:e. At; Lhis level all informaLion wiLh indetmndenL lexical expressions is t)resent.</Paragraph>
    <Paragraph position="9">  PaqA'ans has a highly develot)ed mori)hological module which l)rovi(les an almost eomt)leLe coverage of Dmfish inflecLional morl)hoh)gy. The module is based on sLrueture, buihling rules whi(:b allow for downwards ext)ansion. Regular inflection, syncope and gemination is accounLed for while only completely irregular word forms will have, to be coded in their entirety. PaTrans also has a limited strategy for LranslaLing (:ompounds composil, ionally. Generally, comI)ounds are co(led in the (terminoh)gical) dictionari('.s, 1)uL the t)arser tries to translate (:ompom~ds which are not code(t in the dictionarie.s by translating their individual subparts.</Paragraph>
    <Paragraph position="10">  Finally, the doemnent generation module inserLs ~fll SGML-inarkers anti all iLems which have been inarke.d as mlLranslatable (tal)les, formulas, illlllflbe, rs el;(;.), and a separate conversion programme converts the output into WoldPerfecL for-HIaL. a</Paragraph>
  </Section>
  <Section position="7" start_page="1116" end_page="1116" type="metho">
    <SectionTitle>
4 The lexica
</SectionTitle>
    <Paragraph position="0"> l'a~iYans distinguishes two kinds of voealmlm'ies: the general vocabulary and Lhe Lerminologi(:al vocabulm'ies. null * The general vocabulary is stored in a mono-lingual English dictionary, a monolingual l)anish dictionary separated into a. inLo syntactic and a morphological level, and a t)ilingual transfer dictionary.</Paragraph>
    <Paragraph position="1"> * The terminology is divided into sul).ject specific databases. As PaTrans is used for a numl)er of ditferenL subject fields, the prioriLy of the databases is user-defined and flexible, The user specifies which term bases are to be used for a translation .job, and in wtfich order of prioriLy. When a term is fomld in one tel'in base, it; is not looked up fllrLher in the subsequenL databases.</Paragraph>
    <Paragraph position="2"> auntil now, all texts have been dcliv('.r('.d in Word-Perfect, lint the conversion programme, may of (;oursc l)e adat)tcd to odmr t;t.'xl; processing syst,ems,</Paragraph>
    <Section position="1" start_page="1116" end_page="1116" type="sub_section">
      <SectionTitle>
4.1 PaTerm Coding Tool
</SectionTitle>
      <Paragraph position="0"> For ease of mainLenance and updating, PaTrans has a special coding; tool. As mentioned above, Lhe l'aTrans term 1)ases conLain terms as well as words aim expressions which behave like terms, i.e. which have unique translations. New terms occur in each and every pate.nt documenL whict~ is submitted for trmlsladon. Consequently, it; is iml)ortant thaL Lhe use, r, who is noL necessarily a (;onll)Htal;ional linguisL, (;all elIcode L(;rtns ill a.n efficient and precise way. The PaTerm coding tool provides a screen wiLh fiehls Lo fill in, and in most; cases an atlswer is proposed by t;he system, st) Lhat Lit(', user llas to make jllSt one accet)Lance ke, ysta'olce. Care has been taken (;o t)resent Lhe mosL frequenL, and therefore ntosL t)robable, answer on tim Lop of the. list, Pa'l~erln asks Lhe. minimum number of quest, ions and COmlmtes the, remaining linguisLic information from the answers re.ceived.</Paragraph>
      <Paragraph position="1"> This also saves Lime tbr the user.</Paragraph>
    </Section>
  </Section>
  <Section position="8" start_page="1116" end_page="1117" type="metho">
    <SectionTitle>
5 Special Features
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="1116" end_page="1116" type="sub_section">
      <SectionTitle>
5.1 Error Recovery
</SectionTitle>
      <Paragraph position="0"> Since the system runs in a praetical environment, it must, ne, ver fail to I)roduce, an olltput, even if iL encounLers an unanalysable sentence. ConsequenLly, a f~dl-sofl: inechanism was inLroduce, d.</Paragraph>
      <Paragraph position="1"> Tim fail-soft; mt'.ehanism works at all levels of representation. If the parser fiJls to assign a wellforme(t sLr|le\[;urc Lo the input, a path is selected i\]om tim chart which spans the greatest: amount of dm inlmL ~ril(l already c.reated constituents are collecLed. Tim qualiLy of fail-selL output; varies considerably and recent work has attempLed Lo improve the results of fail-soft;. Disambiguadon of individual words, the selection of al)propriaLe readings and Lhe determinaLion of individual (xmsLituents at a very early stage are (:rueial in arriving aL a 'l)esL-tit' lmrse.</Paragraph>
      <Paragraph position="2"> Interestingly, Lhere are some flmdamental diilieulties in combining advanced MT with fail-soft, straLegies. The most sLriking example of this is the fact; that PaTrans aims at a very deep analysis of the source, text, and aL the same Lime t;he formalism alh)ws for non-lnonotoni(; mappings l)e-Lweell levels of represenLadon. Due Lo Lhe minxpetted mid 1;() some extent Ulq)re, dictat)le, strllctlne of tSil-sofl; analyses, snl)seqllent granlnlar rllles may fail to al)ply ,resulLing in ouLput represenl;a-Lions where inforination e.g. about Lhc degree of adjectives an(1 other inforlnatiol~ stemming fl'om flmction words has been lost, Current efforts (;onsequently aim at preserving informaLion at all levels. null</Paragraph>
    </Section>
    <Section position="2" start_page="1116" end_page="1117" type="sub_section">
      <SectionTitle>
5.2 'Fagging
</SectionTitle>
      <Paragraph position="0"> llefore Lhe Lext is submiLted to the parser, the Lext, is Lagged, i.e,. dm tagger t, rics to determine the t)arl;-of-st)e(w.h of the individual words based  on local cooccurrence restrictions. There are two reasons why the tagger has been integrated into the system: * Since the overall translation system is unification-based, words are disambiguated by the application of all possible rules, which is highly inefficient.</Paragraph>
      <Paragraph position="1"> * If the sentence is fail-sorted, one intermediate analysis is picked from the chart, which means that all words may not have been disambiguated properly by the grammar rules.</Paragraph>
      <Paragraph position="2"> If, however, the words have been disambiguated and impossible readings have been discarded prior to parsing the 'best-fit'-parse is considerably better than it would otherwise have been.</Paragraph>
      <Paragraph position="3"> The tagger is a public-domain, rule based tagger. It has been trained on a corpus of the Wall Street Journal and on patent texts within the sub-ject field. In addition, it has been augmented with several 'local' contextual rules developed by the linguists working with PaTrans. The integration of the tagger has not only provided for more effecient processing but, more importantly, also for a higher quality of the translations of fail-softed sentences. Current efforts aim at improving the performance of the tagger.</Paragraph>
    </Section>
    <Section position="3" start_page="1117" end_page="1117" type="sub_section">
      <SectionTitle>
5.3 Preparsing
</SectionTitle>
      <Paragraph position="0"> The original EUROTRA-parser has been augmented with special rules which apply before the actual grammar rules (Music, 1993). The goal is to enable more efficient handling of long sentences that are otherwise unprocessable given moderate resources. With pre-rules, sentences are segmented via pattern-matching, before they are sent to the parser. In this way, the number of parse paths that the system has to consider is reduced considerably.</Paragraph>
      <Paragraph position="1"> To give greater power to the preparser, pre-rule application has been made cyclic. This means that the output from one rule application (or one application cycle) is used as input to a new cycle which starts at the beginning of the rule set. In principle then, any rule can feed (i.e. create the preconditions needed for application of) any other rule, while at the same time allowing prioritization of rules, The pre-rules not only add structure to the input, they are also used for lexical disambiguation based on collocatives and immediate context. Where the rule based tagger described above is able to determine the part-of-speech of individual words based on prior training and contextual rules, pre-rules can select individual readings of words within the same partof-speech. Pre-rules have been developed for lexteal disambiguation and for parsing of adverbial phrases, complex verb groups, coordinated thatclauses, indexed lists, valency-bound prepositional phrases and explicitly marked intervals (e.g. from *.. to, between.., and). The effects of pre-rules are twofold: On tile one hand they assign structure to tile input at a shallow level, which nevertheless increases processing efficiency considerably, on the other hand they also improve fail-soft results since inappropriate readings of words in a given context are discarded at an early stage.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML