PaTrans - A Patent Translation System 
Bjarne Orsnes, Bradley Music & Bente Maegaard 
Center for Language ~lbchnology 
Njalsgade 80 
2300 Copenhagell S 
Denmark 
{ bj arne,music,bente } ¢~_cst. ku. dk 
Abstract 
This paper describes Pa~lh'ans - a fully 
automatic production MT system de- 
signed for producing raw translations 
of patent texts fl'om English into Dan- 
ish. First we describe the backbone of 
tile system: the EUROTRA research 
project, and prototype. Then we give 
an overview of the trauslat, ion process 
and the basic flmetionality of Pa'I~'ans, 
and finally we describe some recent ex- 
tensions for improving processing effi- 
ciency and the translation quality of un- 
exl)ected input encountered in real-lit~ 
texts. 
1 Introduction 
Pa\]~'ans 1 is a fully-automatic machine transla- 
tion system designed for English-Danish transla- 
tion of patent, texts. It is based on the linguistic 
specifications and to some extent on the software 
of the EUROTRA project of the European Com- 
munity (Copeland et al., 1991a; Copeland et al., 
1991b). Pa'IYans consists of a core grammar and 
translation module and a host of peripheral util- 
ities: terin databases, general databases, editors 
for pre- and postediting, document handling fa- 
cilities, facilities for creating and updating term 
databases. In this short presentation we will con- 
centrate on the grammar, lexicon and translation 
module and on some of the new features of Pa- 
~i~'ans. 
2 From EUROTRA to PaTrans 
EUROTRA was the European Community MT 
research programme. The Community started the 
programme in 1982, with the goal of creating an 
advanced systeln for automatic translation capa- 
ble of treating all the otficial working languages of 
the Community. When the programme finished in 
1992, it had delivered a huge amount of research 
1paTrans was developed for Lingtech A/S. 
results and an implemented prototype of a multi- 
lingual translation system. The PaTrans develop- 
meat relics on the prototype resources (Macgaard 
and Hansen, 1995), the system architecture and 
linguistic specifications, as well as on the experi- 
enced staff created by EUROTRA. 
2.1 The EUROTRA Prototype 
EUROTI{A was a transfer-based multilingual MT 
project. Because of the multilinguality, the proto- 
type was quite "clean" in terms of separate mod- 
ules for analysis, transtL'r and synthesis of the var- 
ious languages and language pairs. 
2.1.1 Sottware 
The software component consisted of the t;lans- 
lation kernel, used tbr analysis, transfer and gen- 
eration. The trmisb~tion kernel had mechanisms 
for treating grammar rules, dictionary informa- 
tion and mapping rules. 
2.1.2 Lingware 
For all languages, the project produced a 
large grammar and a general language dictionary. 
Though insufficient for the task at hand, the Pa: 
'lk'ans development eould buil<l on the English and 
Danish grammars and dictionaries, as well as on 
the transfer module from English into Daifish. 
2.2 Customizing EUROTRA 
Patent texts are characterised by the vocabulary 
they contain: terlns belonging t;o the fiehl tt'eated , 
e.g. chemistry, and patent document terms of a 
more legal nature. But; patent documents are also 
charaeterised by tile frequency of some linguistic 
phenomena and the absence of others, e.g. we had 
to develop ~ treatnmnt of lists and emmmration, 
and conversely we could simplify the treatment of 
modality considerably. The current maintenance 
and further development of the system continues 
this text type specific lille. The success of the sys- 
tem is mainly based on this fundamental lninciple 
of tailoring it; to a specific text type and sub jeer 
field. 
1115 
3 An overview of the Translation 
Process 
3.1 Document handling 
The document handling step has four main flmc- 
tions: 
• Format Preservation Input to docuinent 
handling is a text from a text processing sys- 
tem which has been marked up in SGML. Tile 
SGML codes denote e,.g. titles, paragraphs, 
text segments that should not be translated, 
etc. All information about doc, ument layout 
is stored separately and taken away from the 
translation process. 
• Formula Recognition The docmnent han- 
dler automatically recognises certain text 
typical untranslatable units, such as chemi- 
cal formulas and tables. 
• Term Reeognition Terms and multi-word 
units are also recognised at this stage, in this 
context, words are treated as terms if they are 
subject specific or if they have a unique trans- 
lation in the given text type. They are recog- 
nised during text handling and have their 
translation equivalent attaehed to them along 
with inorphosyntactic information for both 
source and target language. 
• Segmentation Finally tile text, is separated 
into units for translation i.e. sentences for 
which various recognition patterns haw ~. been 
set up. In some patent texts of specfic sub- 
ject tields, tile sentences are incredibly long. 
In these cases, there is no point in trying to 
arrive at a complete parse of the whole sen- 
tence, since the parse is most likely to fail 
and processing will be too space and time 
consuming. Therefore the docmnent handler 
attempts to arrive at a meaningflfl partition 
of the sentences by identifying sentence inter- 
nal boundaries and submitting the individual 
subparts for translation. 
3.1.1 Disambiguation 
Before the text is passed on to the parser, it 
is subjected to a thorough process of disambigua- 
tion. This is one of the new features of PaTrans 
compared to the EUR()TRA model and will be 
discussed in detail below. 
3.1.2 Source language analysis 
Since PaTrans is based on the transfer transla- 
tion model tile surface strings of the text are se- 
quentially transformed into an interinediate repre- 
sentation defined by several mapping principles. 
During source language analysis the sentences 
are assigned a surface syntactic structure. This 
surface syntactic structure is converted into a 
language-neutral transfer represent, ation ordering 
the constituents of the sentence in a canonical 
order with heads preceeding arguments and ar= 
guments preceding modifiers (Copeland et al., 
1991a). The, transfer representation is a re- 
flection of tile argument structure of the pred- 
icates where iuformation about surface syntac- 
tic realization appears as features on the indi- 
vidual nodes. Function words (coRjmwtions, de- 
terminers, prepositional case markers) are featur- 
ized and tense/aspect and negation represented in 
language-neutral features. 
The output of source language analysis is thus a 
tree with multilwered information including syn- 
tactic and morphosyntactic features, as well as 
the syntactic/semantic relationships between the 
predicators and the arguments, 
At, all levels, sets of preference rules based on 
heuristic principles select among competing analy- 
ses, e.g. for PP-attachment (Bennett and Paggio, 
1993). 
3.1.3 Transfer 
PaTrans adheres to simple transfer, i.e. the 
substitution of source language lexical units with 
target language lexical units by means of lexical 
transfer rules, 9 while the source language stru<> 
tural representation is mapped directly onto the 
target language transfer representation which is 
input to tile generation module. There are two 
main reasons why complex transfer (i.e. transfer 
where the strucl;ure of the input representation is 
altere(t) is kept at a minimum: 
• Complex transfer is costly inasmuch as the 
general applicability of the rules is usually 
very restricted. 
• A transfer rule applies to any object matching 
its left-hand side and performs the mapping 
defined on the right-hand side. Due to the 
'fail-soft'-mechanisin (discussed below), the 
structure of the objects which the transfer 
rules nmst apply to cannot he flflly predicted. 
In order for complex transfer to work in all 
cases, rules must be set up not only for cor- 
rectly parsed input structures, but also for 
tile special fail-soft structures. For this rea- 
son, complex transfer is costly and is only 
used for frequent phenomena considered cru- 
cial for good translation, e.g. converting cer- 
tain English ing-forins into l)anish relative 
clauses. 
3.1.4 Target syntactic generation 
During gelmration, the transti;r representa- 
tion is mat)ped onto a target syntactic structure 
through intermediate representational lewfls. At, 
the first level, the target language lexical units 
are looked up in the lexical database and mon(}- 
lingually relevant features are calculated on the 
2Recall theft this only applies to words of the gen- 
eral vocabulary which require disaint}iguation during 
analysis and not to terms 
1116 
basis of the language-neuLral representation, e.g. 
tense and asl)eet. 
At Lhe second level (Lhe relational level) sur- 
face syntactic flmcLions are (:alculaLed and cer- 
tain flmcLion words, sut:h as t)reposiLional mark- 
ers are inserted. Finally, the relational sLru(:ture 
is mapped onto the level defining tim constituenL 
sLructure of Lhe target language sentent:e. At; Lhis 
level all informaLion wiLh indetmndenL lexical ex- 
pressions is t)resent. 
3.1.5 Target morphological generation 
PaqA'ans has a highly develot)ed mori)hological 
module which l)rovi(les an almost eomt)leLe cover- 
age of Dmfish inflecLional morl)hoh)gy. The mod- 
ule is based on sLrueture, buihling rules whi(:b al- 
low for downwards ext)ansion. Regular inflection, 
syncope and gemination is accounLed for while 
only completely irregular word forms will have, to 
be coded in their entirety. PaTrans also has a 
limited strategy for LranslaLing (:ompounds com- 
posil, ionally. Generally, comI)ounds are co(led in 
the (terminoh)gical) dictionari('.s, 1)uL the t)arser 
tries to translate (:ompom~ds which are not code(t 
in the dictionarie.s by translating their individual 
subparts. 
3.1.6 Document generation 
Finally, the doemnent generation module in- 
serLs ~fll SGML-inarkers anti all iLems which have 
been inarke.d as mlLranslatable (tal)les, formulas, 
illlllflbe, rs el;(;.), and a separate conversion pro- 
gramme converts the output into WoldPerfecL for- 
HIaL. a 
4 The lexica 
l'a~iYans distinguishes two kinds of voealmlm'ies: 
the general vocabulary and Lhe Lerminologi(:al vo- 
cabulm'ies. 
• The general vocabulary is stored in a mono- 
lingual English dictionary, a monolingual 
l)anish dictionary separated into a. inLo syn- 
tactic and a morphological level, and a t)ilin- 
gual transfer dictionary. 
• The terminology is divided into sul).ject spe- 
cific databases. As PaTrans is used for a 
numl)er of ditferenL subject fields, the prioriLy 
of the databases is user-defined and flexible, 
The user specifies which term bases are to be 
used for a translation .job, and in wtfich or- 
der of prioriLy. When a term is fomld in one 
tel'in base, it; is not looked up fllrLher in the 
subsequenL databases. 
auntil now, all texts have been dcliv('.r('.d in Word- 
Perfect, lint the conversion programme, may of (;oursc 
l)e adat)tcd to odmr t;t.'xl; processing syst,ems, 
4.1 PaTerm Coding Tool 
For ease of mainLenance and updating, PaTrans 
has a special coding; tool. As mentioned above, 
Lhe l'aTrans term 1)ases conLain terms as well as 
words aim expressions which behave like terms, 
i.e. which have unique translations. New terms 
occur in each and every pate.nt documenL whict~ 
is submitted for trmlsladon. Consequently, it; is 
iml)ortant thaL Lhe use, r, who is noL necessarily a 
(;onll)Htal;ional linguisL, (;all elIcode L(;rtns ill a.n ef- 
ficient and precise way. The PaTerm coding tool 
provides a screen wiLh fiehls Lo fill in, and in most; 
cases an atlswer is proposed by t;he system, st) Lhat 
Lit(', user llas to make jllSt one accet)Lance ke, y- 
sta'olce. Care has been taken (;o t)resent Lhe mosL 
frequenL, and therefore ntosL t)robable, answer on 
tim Lop of the. list, Pa'l~erln asks Lhe. minimum 
number of quest, ions and COmlmtes the, remaining 
linguisLic information from the answers re.ceived. 
This also saves Lime tbr the user. 
5 Special Features 
5.1 Error Recovery 
Since the system runs in a praetical environment, 
it must, ne, ver fail to I)roduce, an olltput, even 
if iL encounLers an unanalysable sentence. Con- 
sequenLly, a f~dl-sofl: inechanism was inLroduce, d. 
Tim fail-soft; mt'.ehanism works at all levels of rep- 
resentation. If the parser fiJls to assign a well- 
forme(t sLr|le\[;urc Lo the input, a path is selected 
i\]om tim chart which spans the greatest: amount 
of dm inlmL ~ril(l already c.reated constituents are 
collecLed. Tim qualiLy of fail-selL output; varies 
considerably and recent work has attempLed Lo 
improve the results of fail-soft;. Disambiguadon 
of individual words, the selection of al)propriaLe 
readings and Lhe determinaLion of individual (xm- 
sLituents at a very early stage are (:rueial in arriv- 
ing aL a 'l)esL-tit' lmrse. 
Interestingly, Lhere are some flmdamental diili- 
eulties in combining advanced MT with fail-soft, 
straLegies. The most sLriking example of this is 
the fact; that PaTrans aims at a very deep anal- 
ysis of the source, text, and aL the same Lime t;he 
formalism alh)ws for non-lnonotoni(; mappings l)e- 
Lweell levels of represenLadon. Due Lo Lhe minx- 
petted mid 1;() some extent Ulq)re, dictat)le, strllctlne 
of tSil-sofl; analyses, snl)seqllent granlnlar rllles 
may fail to al)ply ,resulLing in ouLput represenl;a- 
Lions where inforination e.g. about Lhc degree of 
adjectives an(1 other inforlnatiol~ stemming fl'om 
flmction words has been lost, Current efforts (;on- 
sequently aim at preserving informaLion at all lev- 
els. 
5.2 'Fagging 
llefore Lhe Lext is submiLted to the parser, the 
Lext, is Lagged, i.e,. dm tagger t, rics to determine 
the t)arl;-of-st)e(w.h of the individual words based 
1117 
on local cooccurrence restrictions. There are two 
reasons why the tagger has been integrated into 
the system: 
• Since the overall translation system is 
unification-based, words are disambiguated 
by the application of all possible rules, which 
is highly inefficient. 
• If the sentence is fail-sorted, one intermedi- 
ate analysis is picked from the chart, which 
means that all words may not have been dis- 
ambiguated properly by the grammar rules. 
If, however, the words have been disam- 
biguated and impossible readings have been 
discarded prior to parsing the 'best-fit'-parse 
is considerably better than it would otherwise 
have been. 
The tagger is a public-domain, rule based tag- 
ger. It has been trained on a corpus of the Wall 
Street Journal and on patent texts within the sub- 
ject field. In addition, it has been augmented with 
several 'local' contextual rules developed by the 
linguists working with PaTrans. The integration 
of the tagger has not only provided for more ef- 
fecient processing but, more importantly, also for 
a higher quality of the translations of fail-softed 
sentences. Current efforts aim at improving the 
performance of the tagger. 
5.3 Preparsing 
The original EUROTRA-parser has been aug- 
mented with special rules which apply before the 
actual grammar rules (Music, 1993). The goal is 
to enable more efficient handling of long sentences 
that are otherwise unprocessable given moder- 
ate resources. With pre-rules, sentences are seg- 
mented via pattern-matching, before they are sent 
to the parser. In this way, the number of parse 
paths that the system has to consider is reduced 
considerably. 
To give greater power to the preparser, pre-rule 
application has been made cyclic. This means 
that the output from one rule application (or one 
application cycle) is used as input to a new cy- 
cle which starts at the beginning of the rule set. 
In principle then, any rule can feed (i.e. create 
the preconditions needed for application of) any 
other rule, while at the same time allowing pri- 
oritization of rules, The pre-rules not only add 
structure to the input, they are also used for lex- 
ical disambiguation based on collocatives and im- 
mediate context. Where the rule based tagger 
described above is able to determine the part-of- 
speech of individual words based on prior train- 
ing and contextual rules, pre-rules can select in- 
dividual readings of words within the same part- 
of-speech. Pre-rules have been developed for lex- 
teal disambiguation and for parsing of adverbial 
phrases, complex verb groups, coordinated that- 
clauses, indexed lists, valency-bound prepositional 
phrases and explicitly marked intervals (e.g. from 
•.. to, between.., and). The effects of pre-rules are 
twofold: On tile one hand they assign structure to 
tile input at a shallow level, which nevertheless in- 
creases processing efficiency considerably, on the 
other hand they also improve fail-soft results since 
inappropriate readings of words in a given context 
are discarded at an early stage. 
6 Performance 
PaTrans is in everyday use at the translation 
agency Lingtech where it is being used for all texts 
which are suited for it in its current version, i.e. 
chemical, biochemical, medical etc. patents, and 
gradually also a considerable amount of mechan- 
ical patents. PaTrans is making the translation 
process faster and more efficient, and it has proven 
to be a good business for Lingteeh, saving around 
50% of the raw translator cost. 
7 Conclusion 
PaTrans is a running production translation sys- 
tem producing cost-effective raw translations of 
patent texts. But PaTrans is also a project which 
combines academic research and practical appli- 
cations and which has shown that MT is viable in 
limited domains. Current work concentrates on 
improving the coordination of the rule-based part 
of the systeln and the fail-soft component. 
References 
Bennett, P. and Paggio, P., editors (1993). Prc- 
\]erenee in EuTvtra, volume 3 of Studies in Ma- 
chine Translation and Natural Language Pro- 
cessing. Commission of the European Commu- 
nities, Luxembourg. 
Copeland, C., Durand, J., Krauwer, S., and 
Maegaard, B., editors (1991a). The Eurotra 
Linguistic Specifications, volume 1 of Studies 
in Machine Translation and Natural Language 
Processing. Commission of the European Com- 
munities, Luxembourg. 
Copeland, C., Durand, J., Krauwer, S., and Mac- 
gaard, B., editors (1991b). The Eurotra For- 
mal Specifications, volume 2 of Studies in Ma- 
chine Translation and Natural Language Pro- 
tossing. Commission of the European Commu- 
nities, Luxembourg. 
Maegaard, B. and Hansen, V. (1995)• PaTrans - 
Machine Translation of Patent Texts. From Re- 
search to Practical Application. In Convention 
Digest: Second Language Engineering Conven- 
tion, London, pages 1--8. 
Music, B. (1993). Preparsing in the PaTrans MT 
System. In Bits ~d Bytes: Datalingvistisk Foren- 
ings ~rsmCde nr. 3, pages 82 90. Institut for 
Sprog og Kommunikation, Odense Universitet. 
1118 
