File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/94/c94-1004_metho.xml
Size: 20,756 bytes
Last Modified: 2025-10-06 14:13:35
<?xml version="1.0" standalone="yes"?> <Paper uid="C94-1004"> <Title>INTERPRETING COMPOUNDS FOR MACIIINE TRANSLATION</Title> <Section position="1" start_page="0" end_page="0" type="metho"> <SectionTitle> INTERPRETING COMPOUNDS FOR MACIIINE TRANSLATION BARBARA GAWRONSKA CHP, ISTER JOHANSSON ANI)ERS NORDNEI~. CAROLINE WILLNERS </SectionTitle> <Paragraph position="0"> Dept. of Linguistics, University of Lnnd, Helgonabacken 12, S+223 62 LUND, Sweden SUMMARY: The paper presents a procedure for interpretation of English compounds and lkn&quot; automatic translation of such compounds into Slavic languages and French. In tile target languages, a compound nominal is as a role to be rendered by an NP with an adjective or genitive attribute, or with an attrilmtive participle construction. The model is based on Bierwisch's theory of word formation, which in turn is inspired by categorial grammar. The procedure is applied to a specific domain (asthma research).</Paragraph> <Paragraph position="1"> 0. INTItOI)UCTION The need of a component interpreting complex lexical items in an MT system translating fl'om Germanic languages into e.g. French or Slavic languages is obvious. Many rules (or patterns) of word fo,'mation are highly productive, which makes it impossible to store all complex lexical entries in a static lexicon. An effective MT system rnust also be able to match the interpretation of a complex entry with the correct morphosyntactic pattern in the target language. For example, a program translating from German into Polish must distinguish the relations between the parts of a compound like Universitiitslehrer (university teacher) \['rotll the relations holding between Musik and Lehrer in Musiklehrer (teacher el + music). The first mentioned compound is to be translated as a noun followed by an adjective (nattczyciel tmiwersytecki-'teacher university+adjective ending'), the later one its a noun and a genitive attribute (,zattczyciel mttzyki-'teacher music+gen'). Similar problems occur when translating into French or Czech: cf. Musikabend-Fr. soim;e mtt.s'icah' (n a), Cz. hudebn( ve&~r (a n), Mttsiklehrer-Fr.</Paragraph> <Paragraph position="2"> pro.fiesseur de musique (n prep n), Cz. tlEitel hudbv (n n+gen).</Paragraph> <Paragraph position="3"> Tile models for compound interpretation and generation proposed by general linguists (of.</Paragraph> <Paragraph position="4"> Lees 1960, Selkirk 1982, Fanselow 1988, Bierwisch 1989) require as a rule several modifications in order to be applicable ill an MT system. Since, in our opinion, a model aimed to serve as an efficient tool for NLP and MT mnst be linguistically valid, we will discuss a number of theoretical questions and relate our model to general linguistics beRn'e presenting our experimental procedure t'or domainrestricted compound translation.</Paragraph> <Paragraph position="5"> 1. Till,; STATUS OF WORI) FORMATION RUIA,;S 1.1. 'Where's nmrphology?' The above question, put by Stephen Anderson 1982, is still waiting for a definitive answer. Word for,nation rules have been claimed to obey syntactic principles and hence being a part of UG (Lees 1960, Pesetsky 1985), to form a grammatical level on their own (Di Sciullo & Williams 1987), to be explainable in semantic terms solely (Fanselow 1988) or to belong to the lexicon (Chomsky 1970, Jackendoff 1975, Bierwisch 1989).</Paragraph> <Paragraph position="6"> We will propose a quite simple answer to Anderson's question: morphology shall be seen ;is a component of the grammar, tim notion 'grammar' to be understood as an integrated fnodel where no borders are drawn between syntax, morphology, and semantics.</Paragraph> <Paragraph position="7"> 1.2. Towards an integration of syntax, semantics and morphology Fanselow (1985, 1988)argues, on tile basis of psycholinguistic evidence, for treating word fornmtion rules not ;is generative processes, but as a 'primitive' process of co,acatcnating morphematic items, a very easily learnable procedure. His argumentation is restricted to morphology in the traditional sense of tim term. We would like to go even further and clailn that tile gralnmal&quot; as a whole can be regarded as a set of patterns {'of concalolVdtion or COOCllITellCe of lexical items, each concatenatio,~ pattern associated with principles of semantic interpretation. This approach is to some extent inspired by (but far l'ronl identical to) categorial grammar and gierwisch's lexicon theory (P, ierwisch 1989).</Paragraph> <Paragraph position="8"> At tile same time, it is in its very essence not incompatible with Constraint Grannnar (Karlsson 1990, Koskenniemi 1990).</Paragraph> <Paragraph position="9"> 1.3. Compounds as collocations English compounds provide an argument in \['avour of our approach to grammar. It seems impossible to draw a clear-cut borderline between strings traditionally labelled its compounds and those classified as noun phrases. Cf.:the following examples, taken from a corpus of medical abstracks: ragweed allergic dffnitis house-thtst-allergic a,vthma house chtst asthma patient daily symptom diary cards fluticasone propionate aqueous nasal spray In most grammatical descriptions, strings consisting of nouns (like house dust asthma) are treated as compound nouns, whereas a complex including an adjective followed by a' noun is normally labelled as an NP. The above examples show, however, that such a distinction is not unproblcmatic. Phrases like house-dust-allergic asthma and.fluticasone propionate aqueous nasal spray may be analysed either as NPs containing a compound adjective and a head noun, or as compounds including optional adjective constituents (hottse dust asthma and fluticasone ,wray are perfectly well-formed). Furthermore, parts of an English compound may provide referents for elliptic constructions, as in the following examples: The variations in provocation concentrations ... were small during both placebo aml active drug treatment the d~fference between a single allergen provocation and continltotts exposure...</Paragraph> <Paragraph position="10"> Thus, a noun included in a compound can still have a referent on its own, an ability normally associated with nominal phrases. Such facts indicate that there is no absolute distinction to be drawn between compound nominals and complex nominal phrases in English. It seems more appropriate to talk about morn or less lexicalized collocations. However, in the following the traditional term 'compound' will be used.</Paragraph> </Section> <Section position="2" start_page="0" end_page="49" type="metho"> <SectionTitle> 2. AUTOMATIC INTIqlPilETATION AND TRANSLATION OF COMPOUNi)S </SectionTitle> <Paragraph position="0"> 2.1. The theoretical foundation Bierwisch (1989; cf also Olsen 1991) regards the process of compounding as a functional application, where one of the thematic roles of the head noun becomes 'absorbed'. For example, a noun like payer is supposed to have the following interp,'etation: )~y)~x\[zINST\[xPAYy\]\], where y is the external theta-role, x the internal one, mad z represents the 'referential role'. In a compound like bill payer, the internal role of pay becomes instantiated: )vx\[zINST\[xPAY BILL\]\].</Paragraph> <Paragraph position="1"> Our analysis of compounds is not incompatible with Bierwisch's approach.</Paragraph> <Paragraph position="2"> However, for the purpose of MT, a classification of valency in terms of three kinds of theta roles only (external, internal and referential) seems insufficient. A procedure for compound interpretation must also take into account optional thematic roles, e.g. location (tmive~wio, teacher). It must in addition be able to deal with compounds that do not include deverbal components. IIence, we decided to modify the theory proposed by Bierwisch at two main points: a. the valency of a verbal stem is to be represented not in terms of external and internal them roles, but in terms of the components of the event or situation the verb may refer to b. the interpretation of compounds that do not contain deverbal elements is based on morpho-semantic patterns specifying the default readings of combinations that include members of dilTferent semantic categories.</Paragraph> <Section position="1" start_page="0" end_page="49" type="sub_section"> <SectionTitle> 2.2 An experimental procedure for </SectionTitle> <Paragraph position="0"> understanding derived nouns and compounds In an experimental program, implemented in LPA MacProlog, we structured a very restricted lexicon of Swedish stems and affixes (basal lexical ent,'ies, BI.A) according to the approach outlined above. Each verbal stem was provided with a list of elements of its typical event referent, e.g.: lex(\[ l~ir\],m(teach,stem),v,vt,\[ agent, sem_object,domain,place,time,resultl,\[\]). Affixes were specified with respect to the following features: 0 the category or categories of stems the affix may be combined with 0 the resulting category, including the morpho-syntactic specification 0 the default semantic interpretation of the affix. For example, the Swedish agcntive suffix -are was represented as: slex(\[ arel,su ff(n,agr(sg,re,inde f)),v,agent,\[\]). Underived nouns got a quite simplified semantic specifcation formulated in tradilional terms like 'human', 'animate', 'abstract', 'concrete', 'potential location' etc. On this basis, the interprctation procedure tried to match the semantic specification of the affix or of the noun and associate the morphcmaiic entries attached to the verb stem with the most probable elements of the stem's semantic valency. The program distinguished correctly between coml+ounds like grammatikliimre (teacher of gl'itllllllal') and tolivetwitetslih'are (university teacher), as shown in the following outprint.</Paragraph> <Paragraph position="2"> category: n agr(sg, re, indel +) constituents \[grammatik, 1;,irate, \[\]fir, :tre tl</Paragraph> <Paragraph position="4"> category: n agr(sg, re, indef) constituents \[universitet, lPSirare, I I~ir, atoll The program wits also able to interpret somewhat unusual, but I'ully possil+le compounds like ttniversitervmi)rdare (university killer). In the case of 'university killer', three alternative interpretations were given, all of them acceptable in Swedish: l) a person who kills in university buildings, 2) somebody who causes destruction of a university, 3) somehody who uses a university for destructive purposes.</Paragraph> <Paragraph position="5"> The flexibility of the quite simple interpretation procedure and its ahility to 'undcrst:md' even unusual complex words encouraged us to apply the method tested by means ol7 the toy program fflr a more serious goal, viz. for interpretation and translation of rncdical abstracts dealing with asthma and allergy research.</Paragraph> <Paragraph position="6"> 2.3. Translation o1&quot; compounds within ;l restricted domain (medical texts ml asthma and allergy research) In order to construct a domain specific lexicon and to design apl)ropriate parsing and translation algorithms, we investigated a corpus of about 140 medical abstracts. Already the preliminary inspecticm provided evidence for the need of a special procedure for COmlmund interpretation. The frequency of compounds in the texts was extremely high. Cf. the following sample: A large-scale mttlticenter investigation was undertaken in 3 cities with c'Oml)aral.~le pollen seasons and atmospheric pollen concentrations in order to obtain+ more dell'nile it~flormatiot7 about the sa/'ety and +q\[icacy o\['cromolyn sodium in the treatment of pollen-induced seasonal rhinitis.</Paragraph> <Paragraph position="7"> Complex names of chemical substances, as cromolyt~ soditm,, do not pose especially great prohlems to an MT system, since chemical symbols may be efficiently used as interlingual representations. Highly lexicalized and highly idiosyncratic compoutv, ls, like airways or hay fi, ver, may also he stored in tim basic lexicon. The rnain difficulty lies rather in the translation of productive compounds referring to different allergic syudroms, types of medical treatment and patient groups (ragweed pollen asthma, late-stmtmer rhinilis, Jhmisolide test,.lhmisolMe patient group etc.). In different texts, tile same syndrom may be referred to by different phrases, e.g. ragweed asthma, ragwood-indttced asthnla, ra,~wood pollen asttuna, ragwood-allergic asthma etc. A correct interpretation of the semantic relations between tile constituents of such collocations is necessary for correct translation. Otherwise, a phrase like child/~ood o.vthma \vcmhl be translated into French itot its asthme des etfrmts, bttt as C/tsthme inchtit par el!/?race (lit. asthma induced by childhood-by analogy to e.g. pyrethrltm asthmactslhme induit \]mr \]~yr+;thrines). A procedure for interpretation of compounds and complex NPs must therefore include a kind ol' domain know+ lcC/lge, preferably encoded in the lexicon.</Paragraph> <Paragraph position="8"> 2.3.2. The lexicon An MT system aimed at translatkm of scientific texts should give tile user a possibility el&quot; adding new entries to the lexicon in a simple way. A system for medical abstract translation would n{~t be really useful, if the user could not introduce names of new medicines, new terms denoting syndroms, symptoms, treatment me-.</Paragraph> <Paragraph position="9"> thods etc. Since the users of such a .system would, with a high degree of probability, be a non-linguist, the linguist designing the method for lexicon extension lntlst adapt the form of interactions to the expected competence of the I.iser.</Paragraph> <Paragraph position="10"> It would be naive to helive that a non-linguist could manage to specify tile lexical items in terms of internal and external theta~roles. Even terms like agent, theme and semantic object would prohably cause confusion, l-lence, it seems most reasonable to fornmlate the se,nantic classification in do,nain-spccific texts (in our case, in terms like allergen, syndrom, body-part etc.). There are actually linguistic reasons for this solutiotl, as scientific sublanguages differ semantically from each other as well as from the everyday conversation language. For a botanist, pyrethrum is primarily +a plant belonging to the chrysanthemum family, whereas an allergy re- null searcher regards pyrethrum as an allergy-inducing factor, having much in common with grass pollen and house dust.</Paragraph> <Paragraph position="11"> In the preliminary model of the lexicon developed until now we classify nouns as members of the following categories: - syndrom (asthma, rhinitis) - symptom (sneezing, irritation) - allergen (pyrethrum, ragweed) - body part (airways, skin) - body function (inhalation) - chemical substance: medicine (antihistamine) or not used as medicine (histamine) - medical treatment (injection) - scientific method (measurement, test) - time period (season, childhood) - lmman: patient or not (the later distinction is needed for con'ect interpretation of e.g.</Paragraph> <Paragraph position="12"> asthma patient and asthma researcher) - amount: mass or countable (dose, group) - others: concrete or abstract 2.3.3. Interactive lexicon extension Tim user has tim possibility to classify new nouns to be added to the lexicon by marking the desired alternative in an interaction window. The same entry may be marked as belonging to several categories. For example, inhalation may be regarded as both body function and medical treatment (house dust inhalation~steroid inhalation). When adding a compound, the user is asked to specify its constituents according to the category list above. New words may be typed in by the user or read in fl'om a text file.</Paragraph> <Paragraph position="13"> It is assumed that the lexical entries to be added will belong to open lexical classes: nouns, verbs and adjectives. To distinguish between these three classes is not an impossible task for a non-linguist, especially if an appropriate instruction is provided. Adjectives are classified in a way similar to nouns, e.g. nasal, bronchial-denoting body part; stttf\[iv, rttnny (as in sttt\[\[iy nose)-dcnoting symptom and attribute of body part.</Paragraph> <Paragraph position="14"> A user-adapted classification of ve,'bs is more difficult to achieve. In our preliminary model, tim user is presented questions combined with example patterns, for instance: 'Does tim verb take an object, like investigate the effects'?' 'Does it also take a complement with a certain preposition like: shield the patient from house dust?' 'What preposition is required?' If the verb in question turns out to be transitive, a further question is asked about the semantic category of the typical object, according to the standard category list. The specification of verbs takes more time than the one of nouns and adjectives. However, the need of introducing new verbs is ~su:dly not as great as the need o17 adding new nouns.</Paragraph> <Paragraph position="15"> generation of target equivalents The present program covers the most frequent types of compounds found in the corpus. After having filtered out the most fl'equent verbs (:mxiliaries, medals) and items belonging to closed lexical classes (pronouns, articles, prepositions etc.), we first investigated word frequencies, and then the (unfiltered) environment of about the thirty most frequent words. On this basis, we could state that the most usual compounds containing the most frequent nouns (disregarding names of chemical substances) display the following patterns: i. (attribute, concretc)-allergcn-(adj)-synd,'om viii. (medicine/allergen)-mcdical tmatnmnt/body function-time period steroid treatment period house dust inhalation period The procedure fot' compound interpretation is base(I on a Prolog formalization of the most frequent patterns. &quot;File following program fragment shows what the format for basal lexical entries looks like and how the interpretation rules are constructed.</Paragraph> <Paragraph position="16"> lex(\[asthma\] ,n,\[synd,'om I ...... 2).</Paragraph> <Paragraph position="17"> lox(\[dust\],n,lallergen\] ...... ).</Paragraph> <Paragraph position="18"> lex(Ipollenl,n,\[allergen\] ........ ).</Paragraph> <Paragraph position="19"> lex(Ipatient\] ,n ,I patientl ........ ).</Paragraph> <Paragraph position="20"> lex(\[season\],n,\[ time_period \] ......... ). lex(I steroid I ,n ,\[ medicine\] ....... ).</Paragraph> <Paragraph position="21"> lex(lgrassl,n,l concrete\] ....... ) i /* pattern: grass pollen */ tlex(\[G,P\],mean(\[G,Pl),n, \[ alle,'gen\],F 1 ,F2,F3):lex(\[GI,_,\[concretel ......... ), lex(lPl,n,lallergen\] ........ ).</Paragraph> <Paragraph position="22"> The rules simply specify the dcfimlt interpretation of a sequence of nouns and deliver a semanlic representation coded in 'Machinese English', as shown in the outprints below: * ~ interpret(\[ house, dust, aslh,na, patien(\]) mean(\[ patient, su f for in g__.fro m, syndmm (\[asthma, because_of, allergen(\[ house, dust\])l) D grammatical category : n semantic category : \[patient\] :- interprct(\[house, dust, inhalationl) mcan(linhalation, of object, allcrgen(\[ house, dustl)l) !~I'anlnlalical category : n semantic category : I hody_l'unctionl The Machinese l'epresenlatio,~s can without difficulties be matched with the appropriate target morphosyntactic patterns. For example, the semantic representation of grass pollen asthma l)atieHt becomes associated with the Polish pattern (simplified notation): paticnt,su ffcrin~from, allergy-ins prep pollcl>aCC flower-a(\[i grass-gcn In a similar way, the progranl disambiguates ra#u, oocl crvthma and chilclhood <A~'ttzmct when translating into l~rench. Still, certain ambiguities may remail~: the present program can, for example, nol decide whether xras.v pollen asthma should bc translated into l:rcnch as asthme i,,> duit pat&quot; pollen des gramin#es or pat&quot; pollen tie l'herbe. The decision has to be made by the user.</Paragraph> <Paragraph position="23"> Translation of frequent compounds of the type noun+past participle (allergen-shielded, allergen-tested, placebo controlled) is handled in a way similar to the one used in the prototype program when translating compounds like university teacher and univetwity killer. The semantic category of the noun is compared with the ~mantic specification of the wdency of the verb stem and the noun is associated with the most probable verbal argument. Thus, allergen shielded tvom is interpreted as 'a rooin shielded fi'om allergen', while allergen tested skin gets the reading 'skin tested by exposure to allergen'. null</Paragraph> </Section> </Section> class="xml-element"></Paper>