File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/00/c00-1042_metho.xml
Size: 10,448 bytes
Last Modified: 2025-10-06 14:07:10
<?xml version="1.0" standalone="yes"?> <Paper uid="C00-1042"> <Title>Statistical Morphological Disambiguation for Agglutinative Languages</Title> <Section position="4" start_page="285" end_page="286" type="metho"> <SectionTitle> 3 Turkish </SectionTitle> <Paragraph position="0"> Turkish is a flee constituent order language. Tlm order of the constituents may clmnge freely according to tim discourse context and the syntactic role of the constituents is indicated by their case marking.</Paragraph> <Paragraph position="1"> Turkish has agglutinative morphology with productive inflectional and derixmtional suflixations. The number of word forms one can derive from a Turkish root; form mW be in the millions (ttankmner, 19891.</Paragraph> <Paragraph position="2"> Hence, the number of distinct word forms, i.e., the vocabulary size, can be very large. For instance, Table 1 shows the size of the vocabulary for I and 10 million word corl)ora of Turkish, collected from on-line newspaI)ers. This large vocabulary is the reason for a serious data sparseness problem and also significantly increases the number of parameters to be estimated even for a bigram language model. The size of the vocabulary also causes the perplexity to be large (although this is not an issue in morphological disambiguation). Table 2 lists tlm training and test set perplexities of trigram language models trained on 1 and 10 million word corpora for ri51rkish.</Paragraph> <Paragraph position="3"> For each corpus, tile first cohmm is the perplexity for the data the language model is trained on, and the second column is the pert)lexity for previously unseen test data of 1 million words. Another major reason for the high perplexity of Turkish is the high percentage of out-of vocabulary words (words in the test; data which did not occur in the training data); this results from the productivity of the word tbrmation process.</Paragraph> <Paragraph position="4"> word-based trigram language models.</Paragraph> <Paragraph position="5"> The issue of large vocabulary brought in by productive inflectional and derivational processes also makes tagset design an important issue. 111 languages like English, the nunlber of POS tags that can be assigned to the words in a text; is rather linfited (less than 100, though some researchers have used large tag sets to refine g,:anularity, but they are still small compared to Turkish.) But, such a finite tagset al)proach for languages like Turkish may lead to an inevitable loss of information. The reason for this is that the lnorphological features of intermediate derivations can contain markers for syntactic relationshil)s. Thus, leaving out this information witlfin a fixed-tagset scheme may prevent crucial syntactic information fl'om being represented (Oilazer et al., 1999). For examl)le , it; is not clear what POS tag sllould be assigned to the word sa.~lamlaqtwmak (below), without losing any information, the category of the root; (Adjective), tile final category of the word as a whole (Noun) or one of the intermediate categories (Verb). 1</Paragraph> <Paragraph position="7"> to ca'ass (,s'ometh, i.ng) to become stron 9 / to strength, or,/fortify (somcth, ing) Ignoring the fact that the root; word is an adjective may sever any relationslfips with a.n adverbial modifier modi~ying the root. Thus instead of a sim-I)le POS tag, wc use the full rno~Tflt, oIogical a'nahtscs of the words, rcprcscntcd as a combination of \]'catures (including any dcrivational markers) as their morphosyntactic tags. For instance in the exami)le above, we would use everything including the root; form as the morphosyntactic tag.</Paragraph> <Paragraph position="8"> In order to alleviate the data sparseness probleln we break down the flfll tags. We represent each word as a sequence of inflectional groups (IGs hereafter), separated by &quot;DBs denoting derivation boundaries, as described by Oflazer (1999). Thus a morphological parse would be represented in the following general tbrm: tThe morphological features other than the POSs are: +Become: become verb, +Cans: causative verb, +Pos: Positive polarity, +Inf: marker that derives an infinitive form fl'om a verb, +Aasg: 3sg number-person agreement, +Pnon: No possessive agreement, m~d +Nora: Nominative case. &quot;DB's mark derivational boundaries.</Paragraph> <Paragraph position="9"> %fl)le 3: Numbers of q2tgs and iGs root+IGi ~DB+IG2 ~DB+- - -^DB+IG.</Paragraph> <Paragraph position="10"> where IGi denotes relevant inflectional feal;urcs of the inflectional groul)s, including the 1)art-ofsl)eech for the root or any of the derived forms.</Paragraph> <Paragraph position="11"> For exalnlfle , the infinitive, tbtm s(u.~lamla.#'trmak given above would be ret)resented with the adjective reading of the root sa.rflam mM the tbllowing 4 IGs: 1. Adj 2. Verb+Become 3. Verb+Caus+Pos d. Noun+Inf+A3sg+Pnon+Nom Table 3 1)rovides a (',oml)arison of the mnnl)er disl, in(:t full morl)hosyntactic tags (ignoring the root words in this case) mid IGs, generativ(dy 1)ossil)le a.nd observed in a (:ortms of 1M words (considering a\]\[ ambiguities). One can see thai; the' nmnber observed till tags ignoring the root words is very high, significantly higher than quoted tbr Czech by Ita.ji5 and Itladk5 (1998).</Paragraph> </Section> <Section position="5" start_page="286" end_page="287" type="metho"> <SectionTitle> 4 Statistical Morphological </SectionTitle> <Paragraph position="0"> Disambiguation Morphoh)gica.1 disambiguation is the prol)lcun of tinding the. corresponding s(;qucnce, of morl/hological parses (including l;he root), 7' = t~ ~ = ll,12,...,l,,, given a sequence of words 1Y= = 'w~' = 1u 1 , 'W2, ...~ 'lU n. Our at)proach ix to model the (listrilmtion of lilOrphological I)arscs give, n the words, using a hidden Markov model, and then to seek the variable 7', I.hat</Paragraph> <Paragraph position="2"> The term P(W) ix a constant for all choices of T, and can thus be ignored when choosing the most probable 7'. \Ve C~lll further simplify the t)roblem using the aSSUlnlil;ion that words arc indc'i)endent of each other given their tags. In Tm'ldsh we can use the additional simplification that \]'(wilti) = l since l,i illcludes tim root fbrm and all morphosyntactic t~atures to uniquely determine the word f'orm. 2 Since 2'l'hat is, we assume that there is no morphological generation ambiguity. This is ahnost always true. There are a tb.w word fin'ms like flelirkcne and horde, which have the</Paragraph> <Paragraph position="4"> Simplifying fin%her with the trigram tag model, we get:</Paragraph> <Paragraph position="6"> If we consider morl)hoh)gi(:al analyses as a se(t11011(:(2 of root ;111(l \]Gs, each parse t,i can \])e rcp-C/ res(;nted as (Ii, IGi, ,..., IGi,,,), where ni is the nuinber ()t&quot; IG's in the, in, word.:~ This rel)resental.ioil changes the l)ro})lem as shown in Figure 1 wher(', the, chain rule has been used to factor out the individual comt)oncnts.</Paragraph> <Paragraph position="7"> This f(irtlttll~ltioll still suffers from 1:,t1(! data spat'soll(}SS 1)roblclll. To allo, viatc l;\]lis~ wc ill~ll,\[c thQ folkiwing siml)lifying assumlitions: 1. A root wor(1 del)ends only on l;he roots of the 1)revious words, alld ix indet)cndent of the inflectional and derivationa\] productions on thein:</Paragraph> <Paragraph position="9"> The intention here is that this will be useflll in tile disambiguaPSion of the root word when a given form has mori)hological parses with diffiwent root words. So, tbr instance, for disambiguating the surface, form adam with the following two parses: same morphological parses with the word forms gclirkcn and heretic, respectively but are i)ronounced (and writte.n) sllghlly differently. These. m'e rarely seen in written te.xts, and can thus l)e. ignored.</Paragraph> <Paragraph position="10"> aln our training and W.st data, the nmnbcr of 1Gs in a word form is on the average 1.6, the.refore, ni is usually 1 or 2. We. have seen, occasionally, word tbrms with 5 or 6 inflectional groups.</Paragraph> <Paragraph position="12"> in the iloun phrase k'trm~z~ kazakh adam (the man with a red sweater), only the roots (along with the part-of speech of the root) of the previous words will be used to select the right root.</Paragraph> <Paragraph position="13"> Note that tile selection of the root has some impact on what the next IG in the word is, but we assuine that IGs are determined by the syntactic context and not by the root.</Paragraph> <Paragraph position="14"> 2. An interesting observation that we can make about q_hrkish is that when a word is considere(l as a sequence of IGs, syntactic relations are between the last IG of a (dependent) word and with some (including the last) IG of the (head) word on the right (with nfinor eXCel)tions) (Oflazer, 1999).</Paragraph> <Paragraph position="15"> Based on these assumptions and the equation in Figure 1, we define three models, all of which are based on word level trigrams: 1. Model 1: The presence of IGs in a word only depends on the final IGs of the previous words.</Paragraph> <Paragraph position="16"> This model ignores any morphotactical relation between an IG and any previous IG in the same word.</Paragraph> <Paragraph position="17"> 2. Model 2: The presence of IGs in a word only depends on the final IGs of the previous words and the previous IG in tile same word. In this model, we consider morphotactical relations and assume that an IG (except the first one) in a word form has some dependency on tile previous IG. Given that on the average a word has about 1.6 IGs, IG bigrams should be sufficient.</Paragraph> <Paragraph position="18"> 3. Model 3: This is the same as Model 2, except that the dependence with the previous IG in a word is assumed to be indelmndent of the dependence on the final IGs of the previous words. This allows the formulation to separate the contributions of the morphotactics and syntax.</Paragraph> <Paragraph position="19"> The equations for these models are shown in Figure 2. We also have built a baseline model based on when tags are decomposed into inflectional groups. tile standard definition of the tagging problem in Equation 2. For the baseline, we have assumed that the part of the morphological analysis after the root word is the tag in the conventional sense (and the assumption that P(wi\]ti) = 1 no longer holds).</Paragraph> </Section> class="xml-element"></Paper>