File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/w04-0409_metho.xml
Size: 14,677 bytes
Last Modified: 2025-10-06 14:09:06
<?xml version="1.0" standalone="yes"?> <Paper uid="W04-0409"> <Title>Integrating Morphology with Multi-word Expression Processing in Turkish</Title> <Section position="3" start_page="0" end_page="0" type="metho"> <SectionTitle> 2 Multi-word expressions in Turkish </SectionTitle> <Paragraph position="0"> Turkish is an Ural-Altaic language, having agglutinative word structures with productive in ectional and derivational processes. Most derivational phenomena take place within a word form, but there are certain derivations involving partial or full reduplications that are best considered under the notion of multi-word expressions.</Paragraph> <Paragraph position="1"> Turkish word forms consist of morphemes concatenated to a root morpheme or to other morphemes, much like beads on a string. Except for a very few exceptional cases, the surface realizations of the morphemes are conditioned by various morphophonemic processes such as vowel harmony, vowel and consonant elisions. The morphotactics of word forms can be quite complex when multiple derivations are involved. For instance, the derived mod-</Paragraph> <Paragraph position="3"> This word starts out with an adjective root and after ve derivations, ends up with the nal part-of-speech adjective which determines its role in the sentence.</Paragraph> <Paragraph position="4"> Turkish employs multi-word expressions in essentially four different forms: 1. Lexicalized Collocations where all components of the collocations are xed, 2. Semi-lexicalized Collocations where some components of the collocation are xed and some can vary via in ectional and derivational morphology processes and the (lexical) semantics of the collocation is not compositional, 3. Non-lexicalized Collocations where the collocation is mediated by a morphosyntactic pattern of duplicated and/or contrasting components hence the name non-lexicalized, and 4. Multi-word Named-entities which are multi- null word proper names for persons, organizations, places, etc.</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.1 Lexicalized Collocations </SectionTitle> <Paragraph position="0"> Under the notion of lexicalized collocations, we consider the usual xed multi-word expressions 1Literally, (the thing existing) at the time we caused (something) to become strong . Obviously this is not a word that one would use everyday. Turkish words (excluding nonin ecting frequent words such as conjunctions, clitics, etc.) found in typical text average about 10 letters in length.</Paragraph> <Paragraph position="1"> 2Please refer to the list of morphological features given in Appendix A for the semantics of some of the non-obvious symbols used here.</Paragraph> <Paragraph position="2"> whose resulting syntactic function and semantics are not readily predictable from the structure and the morphological properties of the constituents.</Paragraph> <Paragraph position="3"> Here are some examples of the multi-word expressions that we consider under this grouping:3;4</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.2 Semi-lexicalized Collocations </SectionTitle> <Paragraph position="0"> Multi-word expressions that are considered under this heading are compound and support verb formations where there are two or more lexical items the last of which is a verb or is a derivation involving a verb. These are formed by a lexically adjacent, direct or oblique object, and a verb, which for the purposes of syntactic analysis, may be considered as single lexical item: e.g., sayg dur- (literally to stand (in) respect to pay respect), kafay ye- (literally to eat the head to get mentally deranged), etc.5 Even though the other components can themselves be in ected, they can be assumed to be xed for the purposes of the collocation, and the collocation assumes its morphosyntactic features from the last verb which itself may undergo any morphological derivation or in ection process. For instance in rest of the suf xes for any in ectional and derivational markers. kafay _ye+Verb...</Paragraph> <Paragraph position="1"> get mentally deranged ( literally eat the head ) the rst part of the collocation, the accusative marked noun kafay , is the xed part and the part starting with the verb ye- is the variable part which may be in ected and/or derived in myriads of ways.</Paragraph> <Paragraph position="2"> For example the following are some possible forms of the collocation: kafay yedim I got mentally deranged kafay yiyeceklerdi they were about to get mentally deranged kafay yiyenler those who got mentally deranged null kafay yedi gi the fact that (s/he) got mentally deranged Under certain circumstances, the xed part may actually vary in a rather controlled manner subject to certain morphosyntactic constraints, as in the idiomatic verb: (4) kafa(y ) ekkafa(head)+Noun+A3sg+Pnon+Acc null ek(pull)+Verb...</Paragraph> <Paragraph position="3"> kafa_ ek+Verb...</Paragraph> <Paragraph position="4"> consume alcohol (but literally to pull the head ) (5) kafalar ek-</Paragraph> <Paragraph position="6"> consume alcohol (but literally to pull the heads ) where the xed part can be in the nominative or the accusative case, and if it is in the accusative case, it may be marked plural, in which case the verb has to have some kind of plural agreement (i.e., rst, second or third person plural), but no possessive agreement markers are allowed.</Paragraph> <Paragraph position="7"> In their simplest forms, it is suf cient to recognize a sequence of tokens one of whose morphological analyses matches the corresponding pattern, and then coalesce these into a single multi-word expression representation. However, some or all variants of these and similar semi-lexicalized collocations present further complications brought about by the relative freeness of the constituent order in Turkish, and by the interaction of various clitics with such collocations.6 When such multi-word expressions are coalesced into a single morphological entity, the ambiguity in morphological interpretation is reduced as we see in the following example: (he) continued (literally made a continuation ) Here, when this semi-lexicalized collocation is recognized, other morphological interpretations of the components (marked with a * above) can safely be removed, contributing to overall morphological ambiguity reduction.</Paragraph> </Section> <Section position="3" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.3 Non-lexicalized Collocations </SectionTitle> <Paragraph position="0"> Turkish employs quite a number of non-lexicalized collocations where the sentential role of the collocation has (almost) nothing to do with the parts-of-speech and the morphological features of the individual forms involved. Almost all of these collocations involve partial or full duplications of the forms involved and can actually be viewed as morphological derivational processes mediated by reduplication across multiple tokens.</Paragraph> <Paragraph position="1"> The morphological feature representations of such multi-word expressions follow one of the patterns:</Paragraph> <Paragraph position="3"> where ! is the duplicated string comprising the root, its part-of-speech and possibly some additional morphological features encoded by any suf xes. X and Y are further duplicated or contrasted morphological patterns and Z is a certain clitic token. In 6The question and the emphasis clitics which are written as separate tokens, can occasionally intervene between the components of a semi-lexicalized collocation. We omit the details of these due to space restrictions.</Paragraph> <Paragraph position="4"> duplications of type 4, it is possible that !1 is different from !2.</Paragraph> <Paragraph position="5"> Below we present list of the more interesting non-lexicalized expressions along with some examples and issues.</Paragraph> <Paragraph position="6"> When a noun appears in duplicate following the rst pattern above, the collocation behaves like a manner adverb, modifying a verb usually to the right. Although this pattern does not necessarily occur with every possible noun, it may occur with many (countable) nouns without much of a further semantic restriction. Such a sequence has to be coalesced into a representation indicating this derivational process as we see below.</Paragraph> <Paragraph position="8"> house by house (literally house house ) When an adjective appears in duplicate, the collocation behaves like a manner adverb (with the semantics of -ly adverbs in English), modifying a verb usually to the right. Thus such a sequence has to be coalesced into a representation indicating this derivational process.</Paragraph> <Paragraph position="10"> slowly (literally slow slow ) This kind of duplication can also occur when the adjective is a derived adjective as in</Paragraph> <Paragraph position="12"> rapidly (literally with-speed with-speed ) Turkish has a fairly large set of onomatopoeic words which always appear in duplicate and function as manner adverbs. The words by themselves have no other usage and literal meaning, and mildly resemble sounds produced by natural or arti cial objects. In these cases, the root word almost always is reduplicated but need not be, but both words should be of the part-of-speech category +Dup that we use to mark such roots.</Paragraph> <Paragraph position="13"> (10) har l hurul (!1 +X !2 +X ) har l+Dup hurul+Dup har l_hurul+Adverb+Resemble making rough noises (no literal meaning) Duplicated verbs with optative mood and third person singular agreement function as manner adverbs, indicating that another verb is executed in a manner indicated by the duplicated verb:</Paragraph> <Paragraph position="15"> by running (literally let him run let him run ) Duplicated verbs in aorist mood with third person agreement and rst positive then negative polarity, function as temporal adverbs with the semantics as soon as one has verbed</Paragraph> <Paragraph position="17"> as soon as (he) sleeps ( literally (he) sleeps (he) does not sleep ) It should be noted that for most of the non-lexicalized collocations involving verbs (like (11) and (12) above), the verbal portion before the inectional marking mood can have additional derivational markers and all such markers have to dupli- null as soon as (he) forti es (causes to become strong) Another interesting point is that non-lexicalized collocations can interact with semi-lexicalized collocations since they both usually involve verbs. For instance, when the verb of the semi-lexicalized collocation example in (5) is duplicated in the form of the non-lexicalized collocation in (12), we get (14) kafalar eker ekmez In this case, rst the non-lexicalized collocation has to be coalesced into (15) kafalar ek+Verb+Pos DB+Adverb+AsSoonAs and then the semi-lexicalized collocation kicks in, to give (16) kafa_ ek+Verb+Pos DB+Adverb+AsSoonAs ( as soon as (we/you/they) get drunk ) Finally, the following non-lexicalized collocation involving adjectival forms involving duplication and a question clitic is an example of the last type of non-lexicalized collocation.</Paragraph> </Section> <Section position="4" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.4 Named-entities </SectionTitle> <Paragraph position="0"> Another class of multi-word expressions that we process is the class of multi-word named-entities denoting persons, organizations and locations. We essentially treat these just like the semi-lexicalized collocation discussed earlier, in that, when such named-entities are used in text, all but the last component are xed and the last component will usually undergo certain morphological processes demanded by the syntactic context as in sion extraction processor (18) T rkiye B y k Millet Meclisi'nde ....7 Here, the last component is case marked and this represents a case marking on the whole namedentity. We package this as</Paragraph> <Paragraph position="2"> To recognize these named entities we use a rather simple approach employing a rather extensive database of person, organization and place names, developed in the context of a previous project, instead of using a more sophisticated named-entity extraction scheme.</Paragraph> </Section> </Section> <Section position="4" start_page="0" end_page="0" type="metho"> <SectionTitle> 3 The Structure of the Multi-word </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> Expression Processor </SectionTitle> <Paragraph position="0"> Our multi-word expression processor is a multi-stage system as depicted in Figure 1. The rst component is a standard tokenizer which splits input text into constituent tokens. These then go into 7In the Turkish Grand National Assembly.</Paragraph> <Paragraph position="1"> a wide-coverage morphological analyzer (Oflazer, 1994) implemented using Xerox nite state technology (Karttunen et al., 1997), which generates, for all tokens, all possible morphological analyses. This module also performs unknown processing by postulating possible noun roots and then trying to parse the rest of a word as a sequence of possible Turkish suf xes. The morphological analysis stage also performs a very conservative non-statistical morphological disambiguation to remove some very unlikely parses based on unambiguous contexts. Figure 2 shows a sample Turkish text that comes out of morphological processing, about to go into multi-word expression extraction.</Paragraph> <Paragraph position="2"> The multi-word expression extraction processor has three stages with the output of one stage feeding into the next stage: locations. The reason semi-lexicalized collocations are handled last, is that any duplicate verb formations have to be processed before compound verbs are combined with their lexicalized complements (cf. examples (14) (16) above).</Paragraph> <Paragraph position="3"> The output of the multi-word expression extraction processor for the relevant segments in Figure 2 is given in Figure 3.</Paragraph> <Paragraph position="4"> The multi-word expression extraction processor has been implemented in Perl. The rule bases for the three stages are maintained separately and then compiled of ine into regular expressions which are then used by Perl at runtime.</Paragraph> <Paragraph position="5"> Table 1 presents statistics on the current rule base of our multi-word expression extraction processor: For named entity recognition, we use a list of about</Paragraph> </Section> </Section> class="xml-element"></Paper>