DII, EMMA- AN INSTANT LEXICOGItAPIlER 
IIANS KAItLGREN, JUSSI I(~AII.I,GREN, NIAGNUS NORDSTR()M, PAUL PETTI,;R, SSON, P, ENGT ~'VAIlItOI,I{N 
di?-emma~s±cs, se 
Swedish Institute of Computer Science 
Box 1263, 164 28 Kista, Stockholm, Sweden 
Introduction 
Dilemma is intended to enhance quality and increase 
productivity of expert human translators by prese,,ti,,g 
to the writer relewmt lexical information ,necha,dcally 
extracted from comparable existing translations, thus 
replacing - or compensating for the ahsence of- a lex- 
icographer and stand-by terminologist rather than the 
translator. Using statistics and crude surface analysis 
and a minimum of prior information, Dihnn,,,a identi- 
ties instances and suggests their counterparts in I)aralh!l 
source and target texts, on all levels down to individual 
words. 
Dilemma forms part of a tool kit for t,'anslation 
where focus is on text structure and over-all consis- 
tency in large text volumes rather than on framing sen- 
tences, on interaction between many actors in a large 
project rather than on retriewd of machine-stored data 
and on decision making rather than on application of 
give,) rules. 
In particular, the system has been tuned to the 
needs of the ongoing translation of European Con,n,u- 
nity legislation into the languages of candidate me,,,her 
countries. '/'he system has been demonstrated to and 
used by professional translators with promising results. 
Instant Lexicographer 
The design of translation aids beyond ordinary texl. pro- 
cessing and database accession and maintenance tools 
is mostly based on the same sinai)lifted view which .... 
for compelling reasons -- has been the worki,,g hyl>oth- 
esis of machine translation: that the source text. has a 
well-determined meaning and that there exists in the 
target language at least one correct and adequate ways 
of expressing that meaning. 
When these assumptions are reasonably well justi- 
fled, translation is relatively easy, fast and cheap with 
traditional methods and mechanization not rarely fea- 
sible with methods now known or envisaged. Typi- 
cally, however, the translator must do more tha.n re- 
trieve and operate on we-established a,,d in principle 
pre-storable correspondences. Thus, lexical correspon- 
dences do not exist for all items; it is an essential part 
of translation to establish them. Legal texts, factual 
and stereotype though they may see,n, re.gularly repre- 
sent thoughts, attitudes and arguments which do not 
haw'. any counterparts in the. target language prior to 
translation. This is particulary true in the huge project 
to translate the European Community legislation into 
the languages of countries which are not yet members 
of tile (\]omnmnity and which currently have a partly 
different legal conceptual framework. 
What human translators need is decision supl)ort. 
The most imlmrtant tools are telelfl,ones , electroni- 
cal ¢'onfi.'rencing systems and good and relevant dic- 
tiol,aries. Unfortunately, there are not always at ew~'ry 
point of time ltnowledgalAe and cooperative colleagues 
or othe.r experts to call, eh.'ctronical networks ~tre only 
recently being established in some domMns, and the 
intelligent and comprehensive dictionaries, which can 
serve as a writer's digest to the cumulative literature 
it, a fiehl are few and far between. Answers are ofl.en 
to be found in a text translated late at night the day 
beR)re - or in the preceding sections of the text at hand. 
R.ather than all autolnated writer, we need an instant 
h!xicographer. 
Recycling Translations 
In practh:e, existing translations are being used as a 
major source (Shgvall llein et at, 1990; Merkel 1993). 
Often in the hope to be able to avoid duplication of 
costs - or of getting paid twice for the same ellbrt - 
by findirlg identical or near-identical texts or passages, 
hut, more hnportantly, to ensure consistency or getting 
good suggestions, to follow or argue against. Synonymy 
wu'ial.iou for the salne concept is not al)lU'eclated in 
technical and legal prose and avoided as anxiously :m 
ilOlllOnyllly, The ideal is I: I corresllondeltces \])etweell 
expressions at least within one pair of documents and 
to eliminate "forks", i.e.., one expression being trans- 
lated hlto or beil,g the trm,slation of more than one 
counte.rpart in the other language (Karlgren, 1988). 
We slmll call a c.o,,pled pair of source and target text 
a bitc~'Z (Isabelle, 1992). Wlmt is said here abou't bitexts 
ca,, be generalized to n-tnples of parallel texts, claimed 
to dilfer "only" in hmguage. Such n-tuph.'s exist: in the 
l!htropean Comnn,nity, a major part of the legislatiou 
is available in 9 "authenth:" versions, which in (legal 
and political) theory are equivalent, and according to 
plans the number of "authentic" will soon rise to 12 
or more. l,ittle efforts have previously been made to 
systematically exploit this redunda,,cy by means of po- 
82 
tent multi-lingual procedures for retrieving faet.s or ex- 
l)ressions, even when surprisingly simple methods show 
l)romise of surprisingly useful results (l)ahlqvist, 1994). 
Steps in the '\]h-anslation Process 
l'roducing target language text is only a small l)ropor - 
lion of the translation l)rocess. Eml)irically, good econ- 
omy is achieved if about the same proportion of work 
is put into each of the stages Preparation, Text pro 
ductiou and Verification, a trichotomy reminiscent of 
tile classical person-time breakdown of software devel 
opment (Brooks, 1975). The Dilemma tool is usct'ul for 
some t~usks in each of the three stages. 
Funct;ionality 
A typical question for translators while actually wt'it.inu, 
is how a word or phrase has been used or translated in 
l)reviously processed texts. Conversely, they may ask 
for the source languages counl.erl)art:; of given target 
language expression, to lnal¢.e sure that homonyn,y is 
not introduced. Similarly, during the I)reparation and 
verification phases, a translator or editor scans through 
the text for words and phrases that need to be resolw>d 
or treated specially. 
Text Production Phase 
Navigating in Bitexts 
The first service is to enM)le the translator to I)rowse 
through the bitext and look at text elements pairwise, 
to cheek |br conventions of usage that are unfamiliar or 
unexpected. 
Pointing at a shorter or longer string ill eitl,cr lan- 
guage the user can film successively larger contexts and 
their counterparts or covnlerlezls in the other l;mguage 
version. This service is available to the user from within 
a word processor, tile allsWeF pres(.qlted ill a selmi'ate 
wiudow. 
COllllt i!l'IVor(|s 
The second service is to assess the word-lcwq COlllltol'- 
parts or "eounterwords" so far used for a given word. 
llere the system performs, crudely but instantly, the 
job of a terminologist or lexic.ographer. It uses a statis- 
tical matching process which offers the translator a list, 
of candidate counterl)arts. 
Verification phase 
Translation Vm'illeatlon 
In this l)hasc a revisor reads the text to detect inad- 
equacies and inconsistencies. Often, there is no (.rue 
answer to a terminological question: either one of a 
fe.w options may be equally good i)er se but unintelMed 
variation is disturbing and lnay be misleading. Verifi- 
cation, therefore, is not a matter of local eorrecl.ll(?ss or 
of compliance with a given dictionary or otll~r norm, 
and reading one passage at ;~ Lime ,,v}\[} not reve;d the 
dc(iehmcy of the translation (Karlgrenl 1988). 
One way of resolving or detecting dubious cases is 
to compare how a word or phrase has been used in a 
multitude of previous contexts aim how it was remlered 
in their respective countertcxts. 
Preparation phase 
Text-and Domain-specilh'. 1)hrase Lists 
in the preparation phase the translator or editor has 
to estal)lish text lind domai,l-spccilic word and phrase 
lists. In a batch mode, l)ile.mma produces draft lists on 
the basis of previously translated material ill the same 
domain. 
Structuring Bitexts 
l,'or bltexts to be exploitable as information sources, 
text constituents in t.lm two versions must I)e paired 
Oil SOtll(! hicrarchicM levels - l)hl'aSC., ClallSg~, Selll,(?llce, 
paragraph, etc. ~Ve lllllst creat.e a structured bitext, 
with links fro,n eacl~ constituent not only to its prede- 
Ce,qSOl' illld SllCCessor bill also tO its (;Otlllterl)itrt ill the 
other language version. This cross-latlglla,gc slA'llCtllre 
(:;Ul I)e rather easily captured when the translation is 
lining typed, but we ueed to be al)le to derive the pairs 
from two given coluplete texts. I)ilemma does so auto- 
maritally. 
We Inake three linguistic assumptions: 
1. 'l'he two texts c~m I)e segmented into hierarchical 
cons(.ituenl.s so Ilia| most constituents in one (.cxt 
hltve a COllllterl)art ill I, he other. 
2. For all levels, except the lowest level, co,mterparts 
occur in the sa.me mutual order. 
3. The counterp~rt.s on the lowest level, "counter-. 
words", appear apl)roxinmtely in the saIlle tl-ltlt, ll&l 
ord(w. 
We do not assume every (:onstil.u(ml. on any level to 
llavc it ('OUllt,~'rl)al't , ilor collstitlletlts 1,o I)e sel);wate(I 
by uniqlm delhlliters. Thus, i\[' I)aragral)hs are separated 
})y a blank line and sentences I)y a full stop folh)wed by 
a sl)aee , we do lie| exchlde that, say, ~ paragral)h in 
()lie lallgllage is SOilletilll(}S i'ell(ler{}(I as all enlll-flcration, 
separated by blank lines and that "i,\[')" is Ilow and |hell 
typed as "1. 5". The procedure is robust in that it, 
tolerates gaps all(l llOllC too \['rC(lllCllt deviations from 
the prevalellt lmttern. 
We al)l)ly two statistical procedures, ore.' of align- 
mmfl. for higher levels :rod one of assignllmnt for the 
lowest, "ph rasc", level. 
Alignment; 
The general i)robleln of order-i)reserving alignment on 
(me linguistic level reduces to the string correction l)rob - 
hnn (Wagner and l,'ischer, 1974). The l)ractical solutio,i 
is not trivial, however, duc to the extremely large sere'oh 
space ew'n for small texts. We use. :m algorithm with 
83 
search space constraining heuristics not entirely unlike 
the one published by Church and Gale (1990) but us- 
ing linguistic information on more levels. Using a min- 
imum of prior information, texts are aligned down to 
phrase level. Recognizing identity or similarity of a few 
pnnetuation marks, mlmerals and tile nmnber of words 
between these suffices for a crude alignment. 
Word Assignment 
When the two texts have been aligned on higher levels, 
correspondences are established between counterwor<ls, 
which do not necessarily appear in the same order in the;" 
two language versions. For this purpose Dilemma use.~ 
an association function which is a weighte<l sum of mea- 
sures of agreement of word position within the phrase, 
relative frequency of occnrrence, al,d, optionally, some 
other properties. The weighting of the parameters is 
set after text genre specific experimental.ion. />airs of 
terrns with a high association value are candidate coun- 
terwords (NordstriSm and Petterson, 19!t:1). 
The procedure is self-evaluating since uncertainty is 
reflected by a low maxinmm association wtlue. Only 
items which have a score above a cut-off threshold are 
presented to the user. The proeednre yields sonre 90 
per cent successfifl assignments among those presented 
on the basis of as little material as a single 10 page doc- 
ument, but for rare words the assignment becomes less 
certain. In a material of 10 000 pages of legal documents 
related to the European Economic Space as much as 50 
per cent of the word tokens were hapax legomena and 75 
per cent occurred less than 5 times, providing a meagre 
basis for statistical analysis. These. results can be im- 
proved if other properties are taken into account. When 
a word length was included as a parameter in the asso- 
ciation evaluation, the results became marginally more 
adequate. Syntactical tagging, vide infra, is expected 
to affect assignment more. 
Tagging 
In tim llrst release of l)ilemma, alignment and as- 
signment was perfornled on umnodified tyI)ogr;q)hie;d 
strings but naturally tim procedures were intended to 
be applied after monolingual preproressing. Trivially, 
results become l)ractically much more ade(luat(' and the 
statistical analysis more effective if, say, making and 
made and the infinitive make are subsumed under one 
item and the infinitive and the noun make are kel>t sei>a- 
rate. Without any change of method, the p,'ocedure can 
be applied to strings of words tagged morphologically 
and syntactically. The tools chosen for this l>urpose are 
the parsers for English, French and Swedish develol>ed 
at Ilelsinki University (Voutilainen el al, 1993). 
hnplementational Status 
Dilemma has been iml>lernented in C-I--I- aim runs ira-- 
de,' Microsoft Windows on a regular-size personal C(ml- 
puter. Dilemma is currently being ewduated and tested 
by translators currently involve(1 in the translation of 
large amounts of legal docnments info Scandinavian 
languages in tile context of the proposed accession to 
the l';nr(>pean Economic Commu,fity. 
References 
l,¥e<lerlek Phillips Brooks. 1975. "lTle mythical man- 
month essays on software engineering, Reading, Mas- 
sachusetts: Addison-Wesley. 
Bengt Dahlqulst. 1994. TS,'SA 2.0 A PC Program for 
Text Segmentation and Sorting, \])epartment of Lin= 
guistics, Uppsala University, Uppsala. 
Gale, W.A. and K.W. Church. ;1991. "ldentifyiag 
Word CorrespoiMences in Par;dlel Texts", in Proeecd- 
in\[is of the/dh £'peech and Natural Language l'Vorkshop, 
I)ARPA, Morgan Kaufinann. 
Brian Harris. 1!)92. "l~itext", in Proceedings of "'l}'ans- 
latiorl and the Em'opean Communities", lliskops:Arld~, 
.qtockhohn: \["AT ('\['he Swedish Assoclatiol, for Autho= 
rized 'l'r~mslators). 
Pierre Isabelle. 19!)2. "Bitexts: Aids for Transhttors', 
Screening Words: User hlteTJacc, for Text, 8th Anmtal 
Conferen<:e of the L'W centre of the New OED and Text 
Research, Walerloo, Canada: W;d.erloo University. 
Ihms Karlgren. 1987. "Making Qood Use of I>oor "Pra.ns- 
lotions", in \[NTEI|NATIONAI. }?OlllJl'vl ()N \[NFOItMATION 
AND I)OCUMENTATION, 12:4, Moscow: I"ID. 
Ilans I(arlL, ren. 1988. "Terni-Tuning, a lVletlmd I'(ir the 
Compl~tel'-Aidcd Revision of Multi-l,ingual Texts", in 
INTEnNA'\['IONAL I;'OI~UM ON INFORMATION ANO I)OC- 
UMICN'I'A'I'ION, 13:~, Moscow: l"II). 
Ilans Km'lgren. 1,<t81. "(-,'()Jill)tiLer Aids in ¢\]'r~tnsl;t- 
tlon" with the llanzhlelle l)ccl~tration, in Sigurd and 
Svartvik (eds.), AILA F'roceedings, pp 86-101, Lund: 
AILA. 
Martin l(ay. |980. "The Proper Place of Men and Mr= 
chines in I,angllage Translation", Xerox report (/$1,- 
80-11, PMo Alto: Xerox l>alo Alto llesear<:h Center. 
Magnus Merkel 1993. "When and Wily Should Transla- 
tions I>e Reused?", Papers fl'om the 18lh UAA KI .gym. 
posium on I,.<;P, Thcorg of 7}'anshltion and ()Omlmlers, 
VSyri. 
Magnus Nordstrihu and Paul Petterson. 1993. "A 
Tool I'or Rapid ManuM Translation", Master5 Thesis 
al the University of Uplmala , Upl)sala:lJnivcrsity of lJp- 
psala. 
Amu~ S,igvall Iteln , Amw.tte (;}stling~ 
Eva Wlklmhn. 1990. "Phrases i,i the Core Voeabn- 
lary". A Report fi'OUl 7'he Project Multilingual Support 
for "l}'anslation and Writing. Report no. U(~I)I,-l/- 
90-1. Center for C<>nllmtational I,inguistics. Uppsaht 
U n iw~rsity. 
Atro Voutihfinen an(l Pasi Tapanainen. 1993. "Am- 
biguity l\[esohl/.ion in a I(cductionistie Parser", Pro= 
ccc.dings of the 6th Confl?rencc of the l';uropcar~ Chapter 
of the AC'L, pp. 39,1-,103, Utrecht:ACL. 
Rol)ert A. Wagner and M. J. Fischer. 1974. "The 
.qtring-to-Strlng Corre<:tion Prol)lem", JOIJRNAL ()F 
'l'm.: ACM, 21:I, pp 168-173, New York:ACM. 
84 
