XML Viewer - c92-4201

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/92/c92-4201_metho.xml
Size: 27,830 bytes
Last Modified: 2025-10-06 14:13:05
<?xml version="1.0" standalone="yes"?>
<Paper uid="C92-4201">
  <Title>AUT()MATI(; TRANSI,ATION OF NOUN COMPOUNI)S</Title>
  <Section position="1" start_page="0" end_page="0" type="metho">
    <SectionTitle>
AUT()MATI(; TRANSI,ATION OF NOUN COMPOUNI)S
UI,II.ZKE l{ ACI&lt;OW
IBM Scientific Center
</SectionTitle>
    <Paragraph position="0"/>
  </Section>
  <Section position="2" start_page="0" end_page="0" type="metho">
    <SectionTitle>
Abstract
</SectionTitle>
    <Paragraph position="0"> '\['his paper describes the treatment of n.mina\] coin p&lt;+unds in a tranarer based \]uaclnine translation system; it presentt+ a new apprfmeh fc~r resolving amblgnities in co\[/li)Olllld segmelltatlotl and COllStitllellt st.rllt:lllre sele(!tim, using a combination .f linguistic rules and statistical data. An introducti~m to the general as well as to the (~erman-English-speeil\]c problems oi' (:&lt;mlpound Lransla ti.n is given (sect. 1). In section 2, tile analysis phase it+ described with its linguistics as well as its computati.nal aspects. Se(:ti.n 3 deals with the transfer anti generation process, \[ocnssing ()n c\[&gt;rpus based techniques.</Paragraph>
  </Section>
  <Section position="3" start_page="0" end_page="0" type="metho">
    <SectionTitle>
I Introduction
</SectionTitle>
    <Paragraph position="0"> It is widely known that the word formation mechanism of compounding is highly productive, in Ger man as well as in English, and that efficient strategies have to bc ,lcvelopcd to dcal with this linguistic phe nomenon in any kind of NI,1 ~ system. Although this fact is generally agreed upon and a lot of linguistic rc search has been it,me, it has not bccn possible so fat to ,levelop a general and overall pro,:cdure to solve the probh:m in a satisfactory aud ade,lnatc way ((:f.</Paragraph>
    <Paragraph position="1"> \[A ..... iadou/McNaughl t990\]).</Paragraph>
    <Paragraph position="2"> Two special aspects &lt;)f the probh:ul &lt;)f compound ins phe\],omena arlve, a,,tong others, withi,, the fiats( work of machine translati&lt;m (MT), here the transla tion fron~ Ger,uan irtLo English. The first pr,ll)h+m that has to he ,tcalL with in this ease is the (:orrect segmentation o\[ line (-\]erttlall (:otnp&lt;)und word. The constituents having been found, the rlext step we have to deal witln (:onsisLs in Lranslating them correctly.</Paragraph>
    <Paragraph position="3"> Correctness refers here a) to ihe choice of thc appro priate target lexemes att,l b) to the seh:etion of the right target ,:onst,'ut:tio,~ Lypc.</Paragraph>
    <Paragraph position="4"> Of course, there are a lot of other problems to he resolved for the treatment of (:otrLpotunds in MT, e.g.</Paragraph>
    <Paragraph position="5"> semantic interpretation of tim relation between tin: constituents, Line question hi how far this point is re\] (vast for translation, &lt;lel)th ~&gt;f analysis, etc. In tlds paper, howew!r+ we+ will ,nainly t:ont:elltraL&lt;! on the two problems IllellLioned al)~')ve, An important properly of our approach for seg mcntation (of. 2) is optimizing the process by using the type of the jun,:ture between the compoun&lt;l con stltuents to formulate restrlctions on their posslbh~ position (front, middh! and/or end) in the compound wor&lt;l. Another ,low!l characteristic of ()ur approa&lt;:h is that there is no need or finding olin the correct (:oustltl\]e,lL stru(:ture during analysis phase. This prob lem is transfer, ed to the pr&lt;~cess (&gt;f selecting the ,:&lt;&gt;rrect target compound ,:onstrl+ctlon (cf. 3.3). The solutions we propose arc based &lt;m all i,lvc~tig+ttion of exatnples whleh were extracLed, hn parl randomly, from real text corp&lt;)ra. 1 (;o,ltrary Lo the approach of example hase(I ,nachine translation (e.g+ of. \[,qundta 1\[)!91\]), we don't use a billugual corl,u~ , but a mouo lingual target ,:orpus which is mu&lt;:h easier to obtal,t in a very large size. The last feature of our approach we would like Lo point out here is its multilinguatity: on the on,e. ha,ld+ tile resnlLs of (}ellllatl COlllp,)lllnd analysis can scrw' as inpuL fm all target languages; and, on the other ha,+d, the fcatlntcs ,&gt;\[ tint English (:OlnstrncLioll types as~o(:iatcd with the target el,tries for English nouns can also be. usc.d for souH:c htngllagt+:'i oilier th~.n (\]@t'lllalt+ +llld wh+tt is inll)olt~.nL, for NI,l'-app\]ications other than MT.</Paragraph>
    <Paragraph position="6"> The several compomrnts of our model are ,:u, renLly being tested separately, and an integration is planned.</Paragraph>
    <Paragraph position="7"> i'reli,niuary ,'esults in,li&lt;:ate Lhat Lhc ,:orpns basc,i tcchnlques achh:ve hi/4h ac(:Ulany~ but they art: not hdly analyzed yet. We plait to reporl col+lphtte r(! suits ill a l+Ittllre paper.</Paragraph>
  </Section>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
2 Automatic Analysis of German
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
Compounds
2.1 Prelhnhnary Hemnrks
</SectionTitle>
      <Paragraph position="0"> Our work focuses on nominal compounds; in om tlr~t approa(:h, we narrowed this t.yl&gt;e even inorc to ltou,t no|In(n0,tl\[...) COlllp,)u\[ltls, these CO|IstrHcIIoIIS hc ing aBain the iltosL freqlle\]\[it type of )lOlnilna\] COlll po,m,ls in both languages (of. \[Rackow 1992\])+ This rcstt'ict.iou to nouns gives us the posslbil~ty of u~ in~ the pant o\[ speech in the segmenlation algorith,u to reduce the numlmr of posslbh: Seglrll~ittatlon re sult.~; ill arty case+ t:el'taill p(:rsl),ta\[ or j~ossessi;'c pronouns, conjunctions etc. can be excluded explicitly for tlney ileV(:\]' occill ill produ(:tlvc coruposltlon types.</Paragraph>
      <Paragraph position="1"> This way, wc can awfid wrong (h:coml~ositions, such as *Ons-lnnigkeits-Vorwurf ('us intimacy reproach') iu stead of Unsinnigkeits vorwurf ('tKmsense reproach').</Paragraph>
      <Paragraph position="2"> O\[lly those (:onH)otnrtd~ which arc not Icxicallzed arc treated, i.e. the segmentation and translation al: gorithm is only ,:ailed upc~rt if an irtpuL word has not t'\['he German examph.s are partly taken fr&lt;lm the SPRIN(; C,rl&gt;n~ which was kindly put at &lt;mr disl)osal by the Speech It(cognition (;r.np +)f the German IP, M Science Center \]leidelberg. The English data were extracted fr.nl+ thv corl&gt;ora maintained by the speech gl'oup of IBM Watm~n lien(arch Center, Y.rktown Iteights.</Paragraph>
      <Paragraph position="3"> Acn..'s DE COLING-92, NANTES, 23-28 ^oLrz' 1992 1 2 4 9 PRO(. OF COI.\]NG-92, NANTES, AUG. 23=28. 1992 been found in ti~e system's lexicon.</Paragraph>
      <Paragraph position="4"> With German as the source language, the first prohlem in the treatment of compound words arises from the fact that German compounds are written in one word and that in many cases, the form of the words in a compound differs front the base form in that either a so called ~hgenelement (connecting element or juncture morpheme) is added to tl~e modifyins word or that one or more letters are taken away from the ending of these words, ht order to allow for a correct segmentation of the compounds, a code has to be added next to the morphological declension code of the entries in the analysis part of the lexicon pointing to the corresponding morpheme types.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
2.2 Code for the Connecting Element
</SectionTitle>
      <Paragraph position="0"> The importance of the correct encoding of the connecting elerncut is shown in the following example.</Paragraph>
      <Paragraph position="1"> Suppose a word like Arbcitsamt 'job tenter' wouht not have an entry in the lexicon and Arbeif would not he encoded with the connecting morpheme 's'. The system would then decompose the unknown word into Arbeif ('job, work') which is still correct, and Semi ('~el~ct'}, which is obviously not the expected second constituent (which has to be Am( ('offtce, department, tenter') because the 's' is not interpreted ~s a morpheme but as tile first letter of the second constituent. ~ For several reasons, the correct encodins of the connecting morphemes (l'~gen-code) in not ms trivial as it might appear. First, there are various types of these elements: zero morpheme: Umweft .-~ Umwell beweonng; addition of a form of the inflectional paradigm of the word, e.g. the plural ending: Diskette --* Diskette+n-lanJwerk; addition of a letter which in not in the inflectional paradigm: Installaflon -~ lnstallation+s programm; deletion of the ending: Schnle ~ Schnl hot, deletion of thc ending and addition of another letter: Weihnachten Weihnaehl+s konzert.</Paragraph>
      <Paragraph position="2"> There are quite a lot of words, however, which can take more than one type of connecting morpheme. In some cases, it is only a question of usage, depending on the head noun, in which form the word appcars; in other cascs, the type of jura:(ere morphcme has significance in meaning distinction. The noun Geschichtc F'story/hislory') is an example fur such a case (of.</Paragraph>
      <Paragraph position="3"> leischer 19821): Geschicht+s-buch 'history book' Geschichte+n-buch 'story book' This fact which can help disambignation has to be represented in the lexicon as a transfer constraint for compound nouns. The type of juncture element is not predictable from other forntal aspects of the nonn, e.$. from gender, declension code, etc. There are certain regularities, but they are no~ consistent enongh to allow for an automatic encoding. It is just am little possible to derive the connecting elements completely from existing machine readable dictionaries (MILD); as a prerequisite, all words would have to appear in an MRI) in all their possible forms as modifying elements of compound words.</Paragraph>
      <Paragraph position="4"> ~More examples can be found in (\[l,uckhstrdt/Zimmermann t991\], l l6f).</Paragraph>
      <Paragraph position="5"> The (:odes which we assigned to the connecting elements relate only to the form of the morpheme. As far as the implementation is concerned, the formal identity of some connecting elements and inflectional morphemes on the one haml is used to simplify the segmentation algorithm, and, on the other hand, the diffcrence betwecn connecting elements which are in the inflectional paradigm and those which are not is used to make predictions on the possible position of a constituent in a compound word.</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="0" end_page="0" type="metho">
    <SectionTitle>
2.$ Possible Positions of Compound
Constituents
</SectionTitle>
    <Paragraph position="0"> It is possible to draw certain conclnsions from the type of eonnecting element on the possible position of a constituent in a compound word. \])ependlng on whether the juncture morpimme has the same form as a h~rm of the in\[lectinnal paradigm of the word or not, or whether the ending of the base form of the word is deleted or not, the word with its juncture can be positioned as a modifying constitneut in the beginning or in the middle of the compound, or am the modified constituent (the head) at the end, or in any (:ombination of the mentioned positions. The following examples will make the idea clearer.</Paragraph>
    <Paragraph position="1">  \]'he general frarnework for our research work and implementation is the machine translation system LMT developed by Michael McCord. '~ LMT is a lexicalistlc, source based transfer system, in this section, we concentrate on the performance of the PI{DLOG algorithm 'Compound Interpretation COMPGE' as a hook up component to LMT GE (German F, nglish).</Paragraph>
    <Paragraph position="2"> The segmentation and translation algorithm COMPGE is only called upon if an input word (with more than five letters) has not heen found in the system's lexicon or in the on llnc accessible MR1) Collins German English ~, i.e. when lookup and the regular '~LMT and related prC/~jects are described in detail in (\[McC.rd 1989\]; \[Rimon et el. 1991\]; \[Schwall t991D.</Paragraph>
  </Section>
  <Section position="6" start_page="0" end_page="0" type="metho">
    <SectionTitle>
4 For further infnrmati,n, of. (\[Neff/McCord 1fl911\]).
</SectionTitle>
    <Paragraph position="0"> AcrEs DE COLING-92. N/,tClT~. 23-28 AOtJT 1992 1 2 5 0 PROC. OF COLlNG-92. NANTES, AUG. 23-28, 1992 remrphological analysis fail. The segmentation is then carried ont front left to right, begianlng after the third letter. The decomposition process eontinues until the first word is fonnd in the lexicon; the dictionary el/try contains, among other data, information ahont tile connecting element (Fugcn code). The algorithm then takes the complete dictionary entry with sonrce and target word and all information contained in it+, stores the word and continues by looking up the rest as a whole. If an entry is fraud, it is stored as well, together with the relevant ntorphological, syntactic, and seinantic information. If there is, on the other hand, no entry for the remainder as a whole, the segmentation is carried on letter for letter, the same way as for tile first constituent until an analysis Sir an ex isting entry is derived.</Paragraph>
    <Paragraph position="1"> When all eonstltuents are found, the words are stored, and segmentation is started again in order to allow, in a,nbiguous cases, for /rtorc than one possihie segmentation. Let us look at the word Messeralienist, rl'he result of the first de(:omposition wouht be Messe.-rallen-lat ('mass-ral-aclion'), in accordance with the bitgcn codes of till+ segments; the second result wouht be Messer-allental ('kniJe-aflack'}, also in accordance with the l'hgen codes. The system which then has to choose between tile two possibilities wouhl take the second result following the general strategy that cmnpounds with two nominal constituents are rnuch more frequent than those with three elements, those with three more frequent than those with four, etc. (el. \[Jczlorski 1982\], \[Mfiller 1q77\]). Wt ...... gmentation is finished, the algorithm begins with the semantic interpretation of the coup(rand be\[ore starting transfer.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
2.5 Syntactic and Semantic Implications
</SectionTitle>
      <Paragraph position="0"> Since, in non lexicalized conlpounds~ tile compourld is generally a member of the syntactic and semantil: t:lass to which its head word belongs, this informa tlon can be passed on to the whole conepoand+ As mentioned carrier, the entry for each constituent or the componnd is extracted from the lexicon. Then the relevant nmrphologit:al, syntactic and semantle information of the last constltnent, the head nmm, is attributed to the compound word as a whole.</Paragraph>
      <Paragraph position="1"> The following exatnplc Umwellbewe.qung illustrates the procedure: Whereas Umwell has the semantic type physical:', tlcwegnng gets the type abstract.</Paragraph>
      <Paragraph position="2"> Conseqnently, tile eompoand word is attrlhnted the semantic type abstract, too. This passing on of se..</Paragraph>
      <Paragraph position="3"> mantle informatlon s can be nsed, for instance, for target lcxeme selection using semantic constraints or for anaphora rest&gt;lotion.</Paragraph>
      <Paragraph position="4"> SOn the semantic type hierarchy used in LM'I'--GE, of. \[Breidt 1991\]. t * Since we intend to treat only not, lexicalized compounds this way, a raise semantic analysis as it might occur in trying to translate the word Frauenzimmer(not ~women's room', but rather an archaic/derogatory term far 'woman') this way - is nnt very prohable, given the fact that these kinds or words (:an be found iu the I,MT lexicon ,r in on line accessihle dictionaries.</Paragraph>
    </Section>
  </Section>
  <Section position="7" start_page="0" end_page="0" type="metho">
    <SectionTitle>
3 Transfer and Generation
</SectionTitle>
    <Paragraph position="0"> Transh+r in LMT is divided into two parts: the coalpositional transfer which is part of the shell, and the language pair dependent rcstrnctnring transfer+ The translation of compound words is (lolL(: during /:ontpositional transfer.</Paragraph>
    <Paragraph position="1"> In older to translate (\]erlnan compollnds correctly into l'3ngfish, c,)ntrastivc research studies had to be carried ou~ on compmmding phenonlcna. We first set np a typology of German anti English morphological ( orresponden('es of compoluld Coostrllf:tions. Analysis was first done on the tmsis of 17,40(I nominal conlponntls extracted from {,he MPd) C&lt;dlins (Iceman-English. In a set:end phase, i,l order to compen. sate for tilt: fact that there are also lexlcafized, nonprodnctlve t:Olnl)oand typt+s ill tile dictionary, lelOnOlingual corpas based research was carried out (of.</Paragraph>
    <Paragraph position="2"> 3.3).</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.1 Feature Transfer
</SectionTitle>
      <Paragraph position="0"> Morphological and syntactic informatinn on the source head word is passed on to the correspond ing target word. Ill . there is a specilic feature of the target word coded in the transh:r part or the lexi con which contladlctu a source feature, the last one is ow~rwrittea by the target h~atare. If for instance the target word only occurs in the singular, bu~ the source head word of the compound has the feature plural, the target word feature is preferred over the sonrce word feature, and the compound will appear in the singubtr, e.g. the plural word lnduslricinJormationen becomes a slagnlar ill English induslry information I)ecause of the transfer lexicon part &lt; t (information)/sg.</Paragraph>
      <Paragraph position="1"> Other information that goes with tile target head word rnLry such as hfformatlon on st~bc;ttcgorlzation</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.2 Analysls of the Compounds of a
Bilingual Dictionary
</SectionTitle>
      <Paragraph position="0"> The aim el&amp;quot; our contrastivc study was Io find out corresl)ondences between morphoh)gical types of Gerrtlan and l!~ngfish conlponnd Hearts. Therefore, a classification was set Ull where six types of German nomiaal compounds were contrasted with twelve types &lt;&gt;F l,;ugilsh vtonnlnal cort~pound eonstYutrllons. 'l'Inese types contained information eel the t'(),q of the corn+ poand c+)nstituents, i.e. on the internal struetllre or tile componnds in hoth languages.</Paragraph>
      <Paragraph position="1"> After encoding 17,400 (+eltnan compounds with their English correspondences according to these types, an evahlation was made which led to the followingresnlts: The noun noun construction is themost frequent type in German as well as in English. What is eveit more important for the translation strategy is the fact that 54.4% of the German noun noun c&lt;)nq)onnds are translated into the same l';ngfish coltstruttit)n typt:, i.e. into Iloltll llonll coerlpOllllds its In certatn cases a ,lot of the frame is filled hy the modifier of the' hea~nrmn nf a c\[tmpound. Nevertheless, this is m~t always the cast:; therefore, we peeler passing .n the subcaleg+,rizati,m frame (of. {l:an~eh,~ mS;It.</Paragraph>
      <Paragraph position="2"> AL-TES DE COLING-92. NANTES, 23-28 AC/)t~r 1992 1 2 5 1 Pl~OC. Ol: COL1NG-92, NANTES, AUC/;. 23-28. 1992 well. They are followed by the adjective noun-type (17.2%) and th .... n-o\]-, ..... type (14.3%). Considering all German nominal compounds and not only noun-nonn-compounds, 44.4% of tltem were translated into the English noun-noun-type, s These are the data which formed the basis for oar first translation strategy, namely to translate German nominal compoumls per default into English noun noun constructions. Since about 50% would then not be translated correctly, i.e. not according to language usage, this first approach has been augmented by corpus based techniques which are currently at art experimental level.</Paragraph>
    </Section>
  </Section>
  <Section position="8" start_page="0" end_page="0" type="metho">
    <SectionTitle>
$.8 Corpus Based Techniques
$.$.1 Selecting the Target Construction
</SectionTitle>
    <Paragraph position="0"> Recognizing that selecting the preferred target con struction for a certain compound is in part an arhltrary decision of each language, it seems suitable to look for the information in a target language corpus. The idea is that when the English compound we should generate does not appear in the system's lexicon we will try to match it against the corpus and select a preferred construction according to the information found s. It should be noted at this point that in many cases there are several legitimate constructions that may be selected, ltowever, the system cannot always distinguish these cases from cases where there is only one legitimate choice in the specific context. Therefore, it is always necessary to make a selection, and our strategy is to prefer the construction that seems most probable for being a legitimate choice. This strategy has also a stylistic advantage, as it prefers the more commonly used constructions.</Paragraph>
    <Paragraph position="1"> The most simple anti accurate method to start with is to search the corpus for explicit examples of the complete compouml and prefer that construction which is most frequent. For instance, the German compound 'Oppositiortsgrappe' may in principle be translated (according to the findings described in the previous section) to either 'opposilio 9 group', 'group of opposition', 'opposilional group' or 'opposilion's group '. Consulting a corpus of 40 million words of The Washington Post articles enables us to prefer the first ('noun-noun) option as it occurs 89 times in the corpus, while the second option occurs only 3 times and the other options do not occur at all. On the other haml, in translating tile cmnponud 'Parlamentsdebatte ' the statistics prefer the construction 'parliamentary debate' (23 occurrences), where the modifier appears in its adjectival fornL In this case, the 'noun.noun' fornt, 'parliament debate', does not occur in the corpns~ and the form 'debate in parliament'occurs 3 times.</Paragraph>
    <Paragraph position="2"> In the cases mentioned above, the corpus provides enough examples of the exact compound we are looking for. The only generalization that was used is to take into accmmt the morphological inflections of the words (e.g. counting also occurrences of 'parliamen-SThe conirastlve studies and their results are described in detail in \[Rackow 1992\]. 9 , * .</Paragraph>
    <Paragraph position="3"> This approach is apphcable for an)' natural lan~ua\[~e geuerat on task, hence the relevance of this section Is not restricted to the application of tnachine translation.</Paragraph>
    <Paragraph position="4"> lary debates', with the plural form of 'de6ate~. llowever, many compounds are too rare anti do not occur a significant number of times in the corpus. In these cases it is necessary to use various generalizations over the constituents of the conlpmmd in ortier to ohtain some relevant information. A suitable solution in to generalize over the part of speech of some of the words of the compound. For example, the compomtd 'Umwellbewegun9', may he translated (among other options) to 'eeolooy movement' or 'ecological movement: This compmmd occurs only once in The Washington Post corpus, in the form 'ecological movement', but this is not significant enough to make a selection. In order to obtain more information we look for compmmds in which either 'ecology' or 'ecological' serves as a prenominal modifier, with no restriction on the specific word which serves am the head noun. This information was searched for in the first 100,000 sentences of the Ilansard corpus of the proceedings of the Canadian parliament, which was tagged with part of speech ushtg a stochastic tagger \[Merialdo 199l\]. In these sentences, the form 'eeoloqieal (noan)'was ohscrved 11 times while tile form 'ecology (noun) ' only once. Using these statistics we regard the adjectival form 'ecological' as the del'ault form whenever the two alternatives are encountered and there are not enough examples of the complete compooml. For instance, this default will be used also when translating 'Umweltproblcme' to 'ecological problems' or 'Umweltreserven' to 'ecological reserve ' (and not inappropriately to 'ecology problems/r'eserve~. The use of such defaults enables us to increase the coverage of the statistical method and treat infrequent compounds of the target language.</Paragraph>
    <Paragraph position="5"> Another important purpose for using default constructions for single words is to save storage space.</Paragraph>
    <Paragraph position="6"> Without defaults, we would have to store in our statistical data base the most frequent construction h&gt;r every specific compound whir:it occurs in the training corpus a significant number of tbnes. This might require too much space wltcn training the system on the very large corpora which are necessary to get high coverage and precision of the method. On tile other hand, if we store the default constructions for singlc words, then we should store specific compounds, i.e. comhinations of words, only when the preferred construction for these comhinations conflicts with the defaults for single words.</Paragraph>
    <Paragraph position="7"> This leads to the following implementation scheme: During the training phase, the (tagged) corpus will be processed twice. In the first pass, the default constructions for single words will hc collected. In the second pass, all the specific compmmds will be identified, but only those which conflict with the default constructions will be stored in an exception list. When translating a new German compound (during the actual translation phase), the exception list will first be consulted to check whether one of its items matches one of the possible alternatives for translation. Only if there is no relevant item, the dcfaults for the single constituents will be used.</Paragraph>
    <Paragraph position="8"> I.I.2 Selecting the Target Lexcmes We relate to the problem of selecting the appropriate target words lot the constituents of the compound ACRES DE COTING-92, NANTES, 23-28 AOt~T 1992 I 2 5 2 PROC. OF COLING-92. NANTES. AUG. 23-28, 1992 as a special case of the prnl)lem of target word selec (ion in machine translation (which itself is a variant of lexical disamblguation). As such, these ambig0ities wilt he treated by the general method deserihud in \[Dagan et al. 1991\], which uses statistical data on lexlcal cooccurrcnce within specific syntactic relations in a target language corpus.</Paragraph>
    <Paragraph position="9"> Consider the folh)wlng example given for illustration. The German (:o~tq)ound 'Re\]ormprozefl' ('re/orm process') has in principle 9 possible transla tions. There are three possible English constructions, '1101111 florin I noun of no|tat nounrs not|n I ~n(I three t)osslbh~ translations for the word 'Prozefl', 'process', 'case' and 'trial'. Out of these 9 alternatives, the c(mlt)ound 're\]orm process'occurs 5 tintes in the first half of The Washington Post corpus, while all the other alternatives ('process of reform', 'case o\] re\]orm', 'reform case' etc.) never occur. Using (best: statistics, the algorithm described in \[\])agan et al.</Paragraph>
    <Paragraph position="10"> 1.1t91\] selects 'reJorm process'as the preferred translation. It should bc noted that the info~r\[tation which is used for lexical disambiguation may come fi'om eitlter within the compound, as in this example, or fron, the surrounding context, such as using the verb which interacts with the compound.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML