AUT()MATI(; TRANSI,ATION OF NOUN COMPOUNI)S 
UI,II.ZKE l{ ACI<OW 
IBM Scientific Center 
\[nstitute \[ol Knowlc(tge 
Based Syste,ns 
lleidelherg, (\]er,\[tal, y 
rack~)w(o3<lhdlbm I .hltnet 
11)O I)AGAN 
Cotnputer Science Dcpartnmnt 
Techniou, llail'a, Israel and 
iBM Advanced Solution Center 
Ilaifa, Israel 
dagan ~(:s.te+:hniou.ac.il 
UI,ILII(E SCIIWAI,I, 
IBM Scientilic Center 
h,stitute for l(nowledgc 
Based Systems 
lleidelhc,g, Germany 
sch wallcti~tlhdibm 1 .bitnct 
Abstract 
'\['his paper describes the treatment of n.mina\] coin 
p<+unds in a tranarer based \]uaclnine translation system; 
it presentt+ a new apprfmeh fc~r resolving amblgnities in 
co\[/li)Olllld segmelltatlotl and COllStitllellt st.rllt:lllre sele(!- 
tim, using a combination .f linguistic rules and statistical 
data. An introducti~m to the general as well as to the 
(~erman-English-speeil\]c problems oi' (:<mlpound Lransla 
ti.n is given (sect. 1). In section 2, tile analysis phase it+ 
described with its linguistics as well as its computati.nal 
aspects. Se(:ti.n 3 deals with the transfer anti generation 
process, \[ocnssing ()n c\[>rpus based techniques. 
I Introduction 
It is widely known that the word formation mech- 
anism of compounding is highly productive, in Ger 
man as well as in English, and that efficient strategies 
have to bc ,lcvelopcd to dcal with this linguistic phe 
nomenon in any kind of NI,1 ~ system. Although this 
fact is generally agreed upon and a lot of linguistic rc 
search has been it,me, it has not bccn possible so fat 
to ,levelop a general and overall pro,:cdure to solve 
the probh:m in a satisfactory aud ade,lnatc way ((:f. 
\[A ..... iadou/McNaughl t990\]). 
Two special aspects <)f the probh:ul <)f compound 
ins phe\],omena arlve, a,,tong others, withi,, the fiats( 
work of machine translati<m (MT), here the transla 
tion fron~ Ger,uan irtLo English. The first pr,ll)h+m 
that has to he ,tcalL with in this ease is the (:orrect 
segmentation o\[ line (-\]erttlall (:otnp<)und word. The 
constituents having been found, the rlext step we have 
to deal witln (:onsisLs in Lranslating them correctly. 
Correctness refers here a) to ihe choice of thc appro 
priate target lexemes att,l b) to the seh:etion of the 
right target ,:onst,'ut:tio,~ Lypc. 
Of course, there are a lot of other problems to he 
resolved for the treatment of (:otrLpotunds in MT, e.g. 
semantic interpretation of tim relation between tin: 
constituents, Line question hi how far this point is re\] 
(vast for translation, <lel)th ~>f analysis, etc. In tlds 
paper, howew!r+ we+ will ,nainly t:ont:elltraL<! on the 
two problems IllellLioned al)~')ve, 
An important properly of our approach for seg 
mcntation (of. 2) is optimizing the process by using 
the type of the jun,:ture between the compoun<l con 
stltuents to formulate restrlctions on their posslbh~ 
position (front, middh! and/or end) in the compound 
wor<l. Another ,low!l characteristic of ()ur approa<:h is 
that there is no need or finding olin the correct (:ou- 
stltl\]e,lL stru(:ture during analysis phase. This prob 
lem is transfer, ed to the pr<~cess (>f selecting the ,:<>r- 
rect target compound ,:onstrl+ctlon (cf. 3.3). The 
solutions we propose arc based <m all i,lvc~tig+ttion 
of exatnples whleh were extracLed, hn parl randomly, 
from real text corp<)ra. 1 (;o,ltrary Lo the approach of 
example hase(I ,nachine translation (e.g+ of. \[,qundta 
1\[)!91\]), we don't use a billugual corl,u~ , but a mouo 
lingual target ,:orpus which is mu<:h easier to obtal,t 
in a very large size. The last feature of our approach 
we would like Lo point out here is its multilinguatity: 
on the on,e. ha,ld+ tile resnlLs of (}ellllatl COlllp,)lllnd 
analysis can scrw' as inpuL fm all target languages; 
and, on the other ha,+d, the fcatlntcs ,>\[ tint English 
(:OlnstrncLioll types as~o(:iatcd with the target el,tries 
for English nouns can also be. usc.d for souH:c htn- 
gllagt+:'i oilier th~.n (\]@t'lllalt+ +llld wh+tt is inll)olt~.nL, 
for NI,l'-app\]ications other than MT. 
The several compomrnts of our model are ,:u, renLly 
being tested separately, and an integration is planned. 
i'reli,niuary ,'esults in,li<:ate Lhat Lhc ,:orpns basc,i 
tcchnlques achh:ve hi/4h ac(:Ulany~ but they art: not 
hdly analyzed yet. We plait to reporl col+lphtte r(! 
suits ill a l+Ittllre paper. 
2 Automatic Analysis of German 
Compounds 
2.1 Prelhnhnary Hemnrks 
Our work focuses on nominal compounds; in om tlr~t 
approa(:h, we narrowed this t.yl>e even inorc to ltou,t 
no|In(n0,tl\[...) COlllp,)u\[ltls, these CO|IstrHcIIoIIS hc 
ing aBain the iltosL freqlle\]\[it type of )lOlnilna\] COlll 
po,m,ls in both languages (of. \[Rackow 1992\])+ This 
rcstt'ict.iou to nouns gives us the posslbil~ty of u~ 
in~ the pant o\[ speech in the segmenlation algorith,u 
to reduce the numlmr of posslbh: Seglrll~ittatlon re 
sult.~; ill arty case+ t:el'taill p(:rsl),ta\[ or j~ossessi;'c pro- 
nouns, conjunctions etc. can be excluded explicitly 
for tlney ileV(:\]' occill ill produ(:tlvc coruposltlon types. 
This way, wc can awfid wrong (h:coml~ositions, such 
as *Ons-lnnigkeits-Vorwurf ('us intimacy reproach') iu 
stead of Unsinnigkeits vorwurf ('tKmsense reproach'). 
O\[lly those (:onH)otnrtd~ which arc not Icxicallzed 
arc treated, i.e. the segmentation and translation al: 
gorithm is only ,:ailed upc~rt if an irtpuL word has not 
t'\['he German examph.s are partly taken fr<lm the 
SPRIN(; C,rl>n~ which was kindly put at <mr disl)osal by 
the Speech It(cognition (;r.np +)f the German IP, M Sci- 
ence Center \]leidelberg. The English data were extracted fr.nl+ 
thv corl>ora maintained by the speech gl'oup of IBM 
Watm~n lien(arch Center, Y.rktown Iteights. 
Acn..'s DE COLING-92, NANTES, 23-28 ^oLrz' 1992 1 2 4 9 PRO(. OF COI.\]NG-92, NANTES, AUG. 23=28. 1992 
been found in ti~e system's lexicon. 
With German as the source language, the first 
prohlem in the treatment of compound words arises 
from the fact that German compounds are written 
in one word and that in many cases, the form of the 
words in a compound differs front the base form in 
that either a so called ~hgenelement (connecting ele- 
ment or juncture morpheme) is added to tl~e modify- 
ins word or that one or more letters are taken away 
from the ending of these words, ht order to allow 
for a correct segmentation of the compounds, a code 
has to be added next to the morphological declension 
code of the entries in the analysis part of the lexicon 
pointing to the corresponding morpheme types. 
2.2 Code for the Connecting Element 
The importance of the correct encoding of the con- 
necting elerncut is shown in the following example. 
Suppose a word like Arbcitsamt 'job tenter' wouht 
not have an entry in the lexicon and Arbeif would not 
he encoded with the connecting morpheme 's'. The 
system would then decompose the unknown word into 
Arbeif ('job, work') which is still correct, and Semi 
('~el~ct'}, which is obviously not the expected sec- 
ond constituent (which has to be Am( ('offtce, de- 
partment, tenter') because the 's' is not interpreted 
~s a morpheme but as tile first letter of the second 
constituent. ~ For several reasons, the correct encod- 
ins of the connecting morphemes (l'~gen-code) in not 
ms trivial as it might appear. First, there are various 
types of these elements: zero morpheme: Umweft .-~ 
Umwell beweonng; addition of a form of the inflec- 
tional paradigm of the word, e.g. the plural ending: 
Diskette --* Diskette+n-lanJwerk; addition of a let- 
ter which in not in the inflectional paradigm: Instal- 
laflon -~ lnstallation+s programm; deletion of the 
ending: Schnle ~ Schnl hot, deletion of thc end- 
ing and addition of another letter: Weihnachten 
Weihnaehl+s konzert. 
There are quite a lot of words, however, which can 
take more than one type of connecting morpheme. In 
some cases, it is only a question of usage, depending 
on the head noun, in which form the word appcars; in 
other cascs, the type of jura:(ere morphcme has signif- 
icance in meaning distinction. The noun Geschichtc 
F'story/hislory') is an example fur such a case (of. 
leischer 19821): 
Geschicht+s-buch 'history book' 
Geschichte+n-buch 'story book' 
This fact which can help disambignation has to be 
represented in the lexicon as a transfer constraint for 
compound nouns. The type of juncture element is not 
predictable from other forntal aspects of the nonn, e.$. 
from gender, declension code, etc. There are certain 
regularities, but they are no~ consistent enongh to 
allow for an automatic encoding. It is just am little 
possible to derive the connecting elements completely 
from existing machine readable dictionaries (MILD); 
as a prerequisite, all words would have to appear in an 
MRI) in all their possible forms as modifying elements 
of compound words. 
~More examples can be found in (\[l,uckhstrdt/Zimmermann 
t991\], l l6f). 
The (:odes which we assigned to the connecting el- 
ements relate only to the form of the morpheme. As 
far as the implementation is concerned, the formal 
identity of some connecting elements and inflectional 
morphemes on the one haml is used to simplify the 
segmentation algorithm, and, on the other hand, the 
diffcrence betwecn connecting elements which are in 
the inflectional paradigm and those which are not is 
used to make predictions on the possible position of 
a constituent in a compound word. 
2.$ Possible Positions of Compound 
Constituents 
It is possible to draw certain conclnsions from the 
type of eonnecting element on the possible position 
of a constituent in a compound word. \])ependlng on 
whether the juncture morpimme has the same form 
as a h~rm of the in\[lectinnal paradigm of the word or 
not, or whether the ending of the base form of the 
word is deleted or not, the word with its juncture 
can be positioned as a modifying constitneut in the 
beginning or in the middle of the compound, or am 
the modified constituent (the head) at the end, or 
in any (:ombination of the mentioned positions. The 
following examples will make the idea clearer. 
13 Words with zero jnm:ture can be at any position 
ill a Conlponnrt word: 
Import-beschrSnkung ('import restrletion') 
Fisch-import ('fish import') 
Fi$ch-lmport-belchr~nk ung 
E1 Words of which the connecting element is in the 
inflectional paradigm (:an also be al, any position 
in a compound word: 
Parlament+s-debatte ('parliamentary debate') 
(der Sitz des) Btmdes-parlament+s 
('(the seat of the) federal parliament') 
\[\] Words of which the ending is deleled can only 
hc in front or middle position: Schul-.hof ('school 
yard'). *MuBik-schul, but -~ehule ('music school') 
\[:3 Words of which the connecting element is not in 
the inflectional para<ligm (:an only be in front or 
middle position: 
Information+t-materlal ('inform. material') 
"Studenten-information+s, but-information 
('information lot students') 
2.4 The Segmentation Procedure of 
COMPGE in LMT-GE 
\]'he general frarnework for our research work and im- 
plementation is the machine translation system LMT 
developed by Michael McCord. '~ LMT is a lexicalis- 
tlc, source based transfer system, in this section, we 
concentrate on the performance of the PI{DLOG al- 
gorithm 'Compound Interpretation COMPGE' as a 
hook up component to LMT GE (German F, nglish). 
The segmentation and translation algorithm 
COMPGE is only called upon if an input word (with 
more than five letters) has not heen found in the sys- 
tem's lexicon or in the on llnc accessible MR1) Collins 
German English ~, i.e. when lookup and the regular 
'~LMT and related pr¢~jects are described in detail in (\[McC.rd 1989\]; \[Rimon et el. 1991\]; \[Schwall t991D. 
4 For further infnrmati,n, of. (\[Neff/McCord 1fl911\]). 
AcrEs DE COLING-92. N/,tClT~. 23-28 AOtJT 1992 1 2 5 0 PROC. OF COLlNG-92. NANTES, AUG. 23-28, 1992 
remrphological analysis fail. The segmentation is then 
carried ont front left to right, begianlng after the third 
letter. The decomposition process eontinues until the 
first word is fonnd in the lexicon; the dictionary el/- 
try contains, among other data, information ahont 
tile connecting element (Fugcn code). The algorithm 
then takes the complete dictionary entry with sonrce 
and target word and all information contained in it+, 
stores the word and continues by looking up the rest 
as a whole. If an entry is fraud, it is stored as well, 
together with the relevant ntorphological, syntactic, 
and seinantic information. If there is, on the other 
hand, no entry for the remainder as a whole, the seg- 
mentation is carried on letter for letter, the same way 
as for tile first constituent until an analysis Sir an ex 
isting entry is derived. 
When all eonstltuents are found, the words are 
stored, and segmentation is started again in order to 
allow, in a,nbiguous cases, for /rtorc than one possi- 
hie segmentation. Let us look at the word Messer- 
alienist, rl'he result of the first de(:omposition wouht 
be Messe.-rallen-lat ('mass-ral-aclion'), in accordance 
with the bitgcn codes of till+ segments; the second re- 
sult wouht be Messer-allental ('kniJe-aflack'}, also in 
accordance with the l'hgen codes. The system which 
then has to choose between tile two possibilities wouhl 
take the second result following the general strategy 
that cmnpounds with two nominal constituents are 
rnuch more frequent than those with three elements, 
those with three more frequent than those with four, 
etc. (el. \[Jczlorski 1982\], \[Mfiller 1q77\]). Wt ...... g- 
mentation is finished, the algorithm begins with the 
semantic interpretation of the coup(rand be\[ore start- 
ing transfer. 
2.5 Syntactic and Semantic Implications 
Since, in non lexicalized conlpounds~ tile compourld 
is generally a member of the syntactic and semantil: 
t:lass to which its head word belongs, this informa 
tlon can be passed on to the whole conepoand+ As 
mentioned carrier, the entry for each constituent or 
the componnd is extracted from the lexicon. Then 
the relevant nmrphologit:al, syntactic and semantle 
information of the last constltnent, the head nmm, 
is attributed to the compound word as a whole. 
The following exatnplc Umwellbewe.qung illustrates 
the procedure: Whereas Umwell has the semantic 
type physical:', tlcwegnng gets the type abstract. 
Conseqnently, tile eompoand word is attrlhnted the 
semantic type abstract, too. This passing on of se.. 
mantle informatlon s can be nsed, for instance, for 
target lcxeme selection using semantic constraints or 
for anaphora rest>lotion. 
SOn the semantic type hierarchy used in LM'I'--GE, of. 
\[Breidt 1991\]. t • 
Since we intend to treat only not, lexicalized com- 
pounds this way, a raise semantic analysis as it might 
occur in trying to translate the word Frauenzimmer(not 
~women's room', but rather an archaic/derogatory term 
far 'woman') this way - is nnt very prohable, given the 
fact that these kinds or words (:an be found iu the I,MT 
lexicon ,r in on line accessihle dictionaries. 
3 Transfer and Generation 
Transh+r in LMT is divided into two parts: the coal- 
positional transfer which is part of the shell, and the 
language pair dependent rcstrnctnring transfer+ The 
translation of compound words is (lolL(: during /:ont- 
positional transfer. 
In older to translate (\]erlnan compollnds correctly 
into l'3ngfish, c,)ntrastivc research studies had to be 
carried ou~ on compmmding phenonlcna. We first 
set np a typology of German anti English morpho- 
logical ( orresponden('es of compoluld Coostrllf:tions. 
Analysis was first done on the tmsis of 17,40(I nominal 
conlponntls extracted from {,he MPd) C<dlins (Iceman- 
English. In a set:end phase, i,l order to compen. 
sate for tilt: fact that there are also lexlcafized, non- 
prodnctlve t:Olnl)oand typt+s ill tile dictionary, lelOnO- 
lingual corpas based research was carried out (of. 
3.3). 
3.1 Feature Transfer 
Morphological and syntactic informatinn on the 
source head word is passed on to the correspond 
ing target word. Ill . there is a specilic feature of the 
target word coded in the transh:r part or the lexi 
con which contladlctu a source feature, the last one 
is ow~rwrittea by the target h~atare. If for instance 
the target word only occurs in the singular, bu~ the 
source head word of the compound has the feature 
plural, the target word feature is preferred over the 
sonrce word feature, and the compound will appear 
in the singubtr, e.g. the plural word lnduslricinJor- 
mationen becomes a slagnlar ill English induslry 
information I)ecause of the transfer lexicon part 
< t (information)/sg. 
Other information that goes with tile target head 
word rnLry such as hfformatlon on st~bc;ttcgorlzation 
is passed i)n t(i the target compoand i;onsl.rllction as 
well. 7 
3.2 Analysls of the Compounds of a 
Bilingual Dictionary 
The aim el" our contrastivc study was Io find out 
corresl)ondences between morphoh)gical types of Ger- 
rtlan and l!~ngfish conlponnd Hearts. Therefore, a clas- 
sification was set Ull where six types of German nom- 
iaal compounds were contrasted with twelve types 
<>F l,;ugilsh vtonnlnal cort~pound eonstYutrllons. 'l'Inese 
types contained information eel the t'(),q of the corn+ 
poand c+)nstituents, i.e. on the internal struetllre or 
tile componnds in hoth languages. 
After encoding 17,400 (+eltnan compounds with 
their English correspondences according to these 
types, an evahlation was made which led to the follow- 
ingresnlts: The noun noun construction is themost 
frequent type in German as well as in English. What 
is eveit more important for the translation strategy 
is the fact that 54.4% of the German noun noun 
c<)nq)onnds are translated into the same l';ngfish colt- 
struttit)n typt:, i.e. into Iloltll llonll coerlpOllllds its 
In certatn cases a ,lot of the frame is filled hy the 
modifier of the' hea~nrmn nf a c\[tmpound. Nevertheless, 
this is m~t always the cast:; therefore, we peeler passing .n 
the subcaleg+,rizati,m frame (of. {l:an~eh,~ mS;It. 
AL-TES DE COLING-92. NANTES, 23-28 A¢)t~r 1992 1 2 5 1 Pl~OC. Ol: COL1NG-92, NANTES, AU¢;. 23-28. 1992 
well. They are followed by the adjective noun-type 
(17.2%) and th .... n-o\]-, ..... type (14.3%). Con- 
sidering all German nominal compounds and not only 
noun-nonn-compounds, 44.4% of tltem were trans- 
lated into the English noun-noun-type, s 
These are the data which formed the basis for oar 
first translation strategy, namely to translate German 
nominal compoumls per default into English noun 
noun constructions. Since about 50% would then not 
be translated correctly, i.e. not according to language 
usage, this first approach has been augmented by cor- 
pus based techniques which are currently at art exper- 
imental level. 
$.8 Corpus Based Techniques 
$.$.1 Selecting the Target Construction 
Recognizing that selecting the preferred target con 
struction for a certain compound is in part an ar- 
hltrary decision of each language, it seems suitable 
to look for the information in a target language cor- 
pus. The idea is that when the English compound 
we should generate does not appear in the system's 
lexicon we will try to match it against the corpus and 
select a preferred construction according to the infor- 
mation found s. It should be noted at this point that 
in many cases there are several legitimate construc- 
tions that may be selected, ltowever, the system can- 
not always distinguish these cases from cases where 
there is only one legitimate choice in the specific con- 
text. Therefore, it is always necessary to make a se- 
lection, and our strategy is to prefer the construc- 
tion that seems most probable for being a legitimate 
choice. This strategy has also a stylistic advantage, 
as it prefers the more commonly used constructions. 
The most simple anti accurate method to start 
with is to search the corpus for explicit examples 
of the complete compouml and prefer that construc- 
tion which is most frequent. For instance, the Ger- 
man compound 'Oppositiortsgrappe' may in princi- 
ple be translated (according to the findings described 
in the previous section) to either 'opposilio 9 group', 
'group of opposition', 'opposilional group' or 'opposi- 
lion's group '. Consulting a corpus of 40 million words 
of The Washington Post articles enables us to prefer 
the first ('noun-noun) option as it occurs 89 times 
in the corpus, while the second option occurs only 3 
times and the other options do not occur at all. On 
the other haml, in translating tile cmnponud 'Par- 
lamentsdebatte ' the statistics prefer the construction 
'parliamentary debate' (23 occurrences), where the 
modifier appears in its adjectival fornL In this case, 
the 'noun.noun' fornt, 'parliament debate', does not 
occur in the corpns~ and the form 'debate in parlia- 
ment'occurs 3 times. 
In the cases mentioned above, the corpus provides 
enough examples of the exact compound we are look- 
ing for. The only generalization that was used is to 
take into accmmt the morphological inflections of the 
words (e.g. counting also occurrences of 'parliamen- 
SThe conirastlve studies and their results are described 
in detail in \[Rackow 1992\]. 9 , • . 
This approach is apphcable for an)' natural lan~ua\[~e 
geuerat on task, hence the relevance of this section Is not 
restricted to the application of tnachine translation. 
lary debates', with the plural form of 'de6ate~. llow- 
ever, many compounds are too rare anti do not oc- 
cur a significant number of times in the corpus. In 
these cases it is necessary to use various generaliza- 
tions over the constituents of the conlpmmd in or- 
tier to ohtain some relevant information. A suitable 
solution in to generalize over the part of speech of 
some of the words of the compound. For example, 
the compomtd 'Umwellbewegun9', may he translated 
(among other options) to 'eeolooy movement' or 'eco- 
logical movement: This compmmd occurs only once 
in The Washington Post corpus, in the form 'ecolog- 
ical movement', but this is not significant enough to 
make a selection. In order to obtain more informa- 
tion we look for compmmds in which either 'ecology' 
or 'ecological' serves as a prenominal modifier, with 
no restriction on the specific word which serves am the 
head noun. This information was searched for in the 
first 100,000 sentences of the Ilansard corpus of the 
proceedings of the Canadian parliament, which was 
tagged with part of speech ushtg a stochastic tag- 
ger \[Merialdo 199l\]. In these sentences, the form 
'eeoloqieal (noan)'was ohscrved 11 times while tile 
form 'ecology (noun) ' only once. Using these statis- 
tics we regard the adjectival form 'ecological' as the 
del'ault form whenever the two alternatives are en- 
countered and there are not enough examples of the 
complete compooml. For instance, this default will 
be used also when translating 'Umweltproblcme' to 
'ecological problems' or 'Umweltreserven' to 'ecologi- 
cal reserve ' (and not inappropriately to 'ecology prob- 
lems/r'eserve~. The use of such defaults enables us 
to increase the coverage of the statistical method and 
treat infrequent compounds of the target language. 
Another important purpose for using default con- 
structions for single words is to save storage space. 
Without defaults, we would have to store in our sta- 
tistical data base the most frequent construction h>r 
every specific compound whir:it occurs in the train- 
ing corpus a significant number of tbnes. This might 
require too much space wltcn training the system on 
the very large corpora which are necessary to get high 
coverage and precision of the method. On tile other 
hand, if we store the default constructions for sin- 
glc words, then we should store specific compounds, 
i.e. comhinations of words, only when the preferred 
construction for these comhinations conflicts with the 
defaults for single words. 
This leads to the following implementation scheme: 
During the training phase, the (tagged) corpus will 
be processed twice. In the first pass, the default 
constructions for single words will hc collected. In 
the second pass, all the specific compmmds will be 
identified, but only those which conflict with the de- 
fault constructions will be stored in an exception list. 
When translating a new German compound (during 
the actual translation phase), the exception list will 
first be consulted to check whether one of its items 
matches one of the possible alternatives for transla- 
tion. Only if there is no relevant item, the dcfaults 
for the single constituents will be used. 
I.I.2 Selecting the Target Lexcmes 
We relate to the problem of selecting the appropri- 
ate target words lot the constituents of the compound 
ACRES DE COTING-92, NANTES, 23-28 AOt~T 1992 I 2 5 2 PROC. OF COLING-92. NANTES. AUG. 23-28, 1992 
as a special case of the prnl)lem of target word selec 
(ion in machine translation (which itself is a variant 
of lexical disamblguation). As such, these ambig0i- 
ties wilt he treated by the general method deserihud 
in \[Dagan et al. 1991\], which uses statistical data on 
lexlcal cooccurrcnce within specific syntactic relations 
in a target language corpus. 
Consider the folh)wlng example given for illustra- 
tion. The German (:o~tq)ound 'Re\]ormprozefl' ('re- 
/orm process') has in principle 9 possible transla 
tions. There are three possible English constructions, 
'1101111 florin I noun of no|tat nounrs not|n I ~n(I three 
t)osslbh~ translations for the word 'Prozefl', 'process', 
'case' and 'trial'. Out of these 9 alternatives, the 
c(mlt)ound 're\]orm process'occurs 5 tintes in the first 
half of The Washington Post corpus, while all the 
other alternatives ('process of reform', 'case o\] re- 
\]orm', 'reform case' etc.) never occur. Using (best: 
statistics, the algorithm described in \[\])agan et al. 
1.1t91\] selects 'reJorm process'as the preferred trans- 
lation. It should bc noted that the info~r\[tation which 
is used for lexical disambiguation may come fi'om ei- 
tlter within the compound, as in this example, or fron, 
the surrounding context, such as using the verb which 
interacts with the compound. 
4 Conclusions 
This paper demonstrates that the translation of noun 
compounds is a difficult task. llavlng German ms the 
source language adds the problem of segmenting the 
compound into its constituents, a prol)letn which does 
not exist in many o(her languages. The solution for 
these problems seems to require varloas levels of in for 
marion, involving morphological, syntactic, semantic 
and stylistic criteria. 
Though these levels are general for \[:very natural 
language processing task, WE have shown how a de- 
tailed analysis of the specific linguistic \[)hellorneu~t 
can lead to an ellicient hybrid architectnre which 
uses the partial information availalde computation 
ally. This architecture con,1)ines formal syntactic and 
mnrpltologlcal rules, wherever they (:an he spe(:ified 
accurately, with empirical data whicll reltects sorer: 
or the semantic and stylistic considerations. In this 
sense, this paper promotes the integration of the 
sometimes diverging streams, natnely the use of syln- 
hollo, manually stipulated linguistic ruk:s versus the 
use of statistical data which is extracted alltolnat- 
ically from corpora, ht our view, these two disci- 
plines complement each other and are both esscntlal 
to aehleve high performance in practical natural \]an 
gllage processing systems. 
Acknowledgements: We wouhl like to thank 
Eran Amir from llai\[a, Peter t~rown from Yorktown 
and Mark Beers and Myriam Welsehhillig from llci 
delberg h)r their Itelp and comments. 
References 
\[1\] S. Ananiadou and J. McNaught. Treatment of 
(~ompoun(Is i, a Transfi:r based Machine '\['rans- 
lation System. In I'roc. o/ the 3rd Int. Conf. on 
Theoretical and Methodolooical Issues in MT of 
NL, Univ. of Texas, Austin, t990. 
12\] II. U. Block. Maschinelle Obersel~ung komplertr 
frauz5sischer Nominalsynta.qmen ins Deutsche. 
Niemeyer, Tfibingen, 1984. 
\[:1\] E. Breldt. Die Behandlung won mehrdeutiOen 
Verben in dec maschinellen Obersetzung. IBM 
\]WBS Report 158, Stuttgart/lleldelberg, 1991. 
{4\] I. l)agan, A. ltai, and IJ. Sehwall. Two languages 
are more informaLive than one. In I'roc. o\] the 
ggth Meetino of ihe ACL, pages 1,30 137, Berke- 
ley, t991. 
\[5\] P. Downing. On the creation and use n\[ English 
conlpotmd nonns. Lanauaoe, 1531:8108,12, 1977. 
\[~i\] (:. \];anselow. Zur Synlaz und Semantlk der Nom- 
inalkomposilion. Niemeyer, Tiiblngen, 1981. 
\['(\] W. Fleischcr. IVortbildung der deulsehen Ge#en- 
war'tssprache. Niemeyer, Tfihingen, 1982. 
\[8\] J. Jeziorski. Strukturmodelle der deutachen 
Nominalkomposita vom Typ 'Substantlv + Suh- 
stantiv'. IVirkendes Worl, 141:2:15 238, 1982. 
19} \[1.-I). l,uckhardt and 1I. ll. Zimmermann. 
Compuleroest~tzte nnd Maschinelle Dbersetzunt3. 
Simrbr(ickcn, 1991. 
\[10\] M. C. McCord. 'Fhe slot grammar system, ht J. 
Wedeklnd and Ch. Rohrer, editors, Un,fleation 
in Grammar, MIT Press. to appear. 
\[I 11 M.C. McCord. A New Version of the Machine 
Translation System I,MT. Literary and Linguiso 
lic Uompuling, (41:218 229, 1989. 
\[12\] 11. Meriahlo. Tagging text with a probat)ilistic 
model. ICASSP, 19.1tl. 
\[1:1\] \[I. S. Miiller, Eiolge statistische Angaben fiber 
'zusatunleagesetzt.e ~uhstantlw~ im I)eutscheu. 
Gcrmanist. Linguistik, (I/21:171 198, 1977. 
\[14\] M. Neff and M. McCord. Acquiring lexical data 
froth machine-readable dictionary resources for 
machine translation. In l'roc, of the. 3rd Int. 
ConJ. on Theoretical and Methodological Issues 
in hi'l" o/ NL, pages 87 (.12, Univ. <)f Texas, 
Austin, 1990. 
\[15\] I1. Ortner arid 1,. Ortner. Znr Theorie und l'ra~cis 
der Komposita\]orsehung. Narr, Tilhingen, 198,1. 
\[1¢;\] l J. Rackow. On the Treatment of Compounds 
in Machine "li~taslation. A Study. IBM IWBS 
Technical Report, Ileidelberg, 1992. To appear. 
\[17\] M. lit(no\]t, M. MeCord, U. Schwall, and P. 
Martlnez. Advances in Machine Translation Re- 
sea~ch in IBM. In Proceedinqs oJ the MT Summit 
II1, pages 11 18, Washington, 1991. 
118\] U. Schwall. LMT Machine Translation Demon- 
slralion. IBM \[WBS Report 177, Stuttgart/l|el- 
delherg, 199l. 
II.q\] Eilchiro Sumita and tlitoshi lida. Eperi- 
merits and Prospects of l';xample based Machine 
Translation. In l'roc, o.\[ the ~9lh Meelin9 of lhe 
AUL, pages 185 192, Berkeley, I\[t!tl. 
AC-rES DE COLING-92, NANTES, 23-28 ho(rr 1992 I 2 5 3 PRo(:. ov COLING-92, NANTES. AUG. 23-28, 1992 
