File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/99/p99-1037_metho.xml
Size: 18,468 bytes
Last Modified: 2025-10-06 14:15:26
<?xml version="1.0" standalone="yes"?> <Paper uid="P99-1037"> <Title>Memory-Based Morphological Analysis</Title> <Section position="3" start_page="285" end_page="285" type="metho"> <SectionTitle> 2 Dutch Morphology </SectionTitle> <Paragraph position="0"> The processes of Dutch morphology include inflection, derivation, and compounding. Inflection of verbs, adjectives, and nouns is mostly achieved by suffixation, but a circumfix also occurs in the Dutch past participle (e.g. ge+werk+t as the past participle of verb werken, to work). Irregular inflectional morphology is due to relics of ablaut (vowel change) and to suppletion (mixing of different roots in inflectional paradigms). Processes of derivation in Dutch morphology occur by means of prefixation and suffixation. Derivation can change the syntactic class of wordforms. Compounding in Dutch is concatenative (as in German and Scandinavian languages)' words can be strung together almost unlimitedly, with only a few morphotactic constraints, e.g., rechtsinformaticatoepassingen (applications of computer science in Law). In general, a complex wordform inherits its syntactic properties from its right-most part (the head). Several spelling changes occur: apart from the closed set of spelling changes due to irregular morphology, a number of spelling changes is predictably due to morphological context. The spelling of long vowels varies between double and single (e.g. ik loop, I run, versus wii Iop+en, we run); the spelling of rootfinal consonants can be doubled (e.g. ik stop, I stop, versus wij stopp+en, we stop); there is variation between s and z and f and v (e.g. huis, house, versus huizen, houses). Finally, between the parts of a compound, a linking morpheme may appear (e.g. staat+s+loterij, state lottery).</Paragraph> <Paragraph position="1"> For a detailed discussion of morphological phenomena in Dutch, see De Haas and Trommelen (1993). Previous approaches to Dutch morphological analysis have been based on finite-state transducers (e.g., XEROX'es morphological analyzer), or on parsing with context-free word grammars interleaved with exploration of possible spelling changes (e.g. Heemskerk and van Heuven (1993); or see Heemskerk (1993) for a probabilistic variant).</Paragraph> </Section> <Section position="4" start_page="285" end_page="286" type="metho"> <SectionTitle> 3 Applying memory-based learning </SectionTitle> <Paragraph position="0"> to morphological analysis Most linguistic problems can be seen as,contextsensitive mappings from one representation to another (e.g., from text to speech; from a sequence of spelling words to a parse tree; from a parse tree to logical form, from source language to target language, etc.) (Daelemans, 1995). This is also the case for morphological analysis. Memory-based learning algorithms can learn mappings (classifications) if a sufficient number of instances of these mappings is presented to them.</Paragraph> <Paragraph position="1"> We drew our instances from the CELEX lexical data base (Baayen et al., 1993). CELEX contains a large lexical data base of Dutch wordforms, and features a full morphological analysis for 247,415 of them. We took each wordform and its associated analysis, and created task instances using a windowing method (Sejnowski and Rosenberg, 1987). Windowing transforms each wordform into as many instances as it has letters. Each example focuses on one letter, and includes a fixed number of left and right neighbor letters, chosen here to be five. Consequently, each instance spans eleven letters, which is also the average word length in the CELEX data base. Moreover, we estimated from exploratory data analysis that this context would contain enough information to allow for adequate disambiguation.</Paragraph> <Paragraph position="2"> To illustrate the construction of instances, Table 1 displays the 15 instances derived from the Dutch example word abnormaliteiten (abnormalities) and their associated classes. The class of the first instance is &quot;A+Da&quot;, which says that (i) the morpheme starting in a is an adjective (&quot;A&quot;) 1, and (ii) an a was deleted at the end (&quot;+Da&quot;). The coding thus tells that the first morpheme is the adjective abnorrnaal.</Paragraph> <Paragraph position="3"> The second morpheme, iteit, has class &quot;N_A,&quot;. This complex tag indicates that when iteit attaches right to an adjective (encoded by &quot;A,&quot;), the new combination becomes a noun (&quot;N_&quot;). Finally, the third morpheme is en, which is a plural inflection (labeled &quot;m&quot; in CELEX). This way we generated an instance base of 2,727,462 tive (A), quantifier/numeral (Q), verb (V), article (D), pronoun (O), adverb (B), preposition (P), conjunction (C), interjection (J), and abbreviation (X).</Paragraph> <Paragraph position="4"> instances. Within these instances, 2422 different class labels occur. The most frequently occurring class label is &quot;0&quot;, occurring in 72.5% of all instances. The three most frequent non-null labels are &quot;N&quot; (6.9%), &quot;V&quot; (3.6%), and &quot;m&quot; (1.6%). Most class labels combine a syntactic or inflectional tag with a spelling change, and generally have a low frequency.</Paragraph> <Paragraph position="5"> When a wordform is listed in CELEX as having more than one possible morphological labeling (e.g., a morpheme may be N or V, the inflection -en may be plural for nouns or infinitive for verbs), these labels are joined into ambiguous classes (&quot;N/V&quot;) and the first generated example is labeled with this ambiguous class.</Paragraph> <Paragraph position="6"> Ambiguity in syntactic and inflectional tags occurs in 3.6% of all morphemes in our CELEX data.</Paragraph> <Paragraph position="7"> The memory-based learning algorithm used within MBMA is ml-m (Daelemans and Van den Bosch, 1992; Daelemans et al., 1997), an extension of IBI (Aha et al., 1991). IBI-IG constructs a data base of instances in memory during learning. New instances are classified by IBI-IG by matching them to all instances in the instance base, and calculating with each match the distance between the new instance X and the memory instance Y, A(X~Y) ---~-\]n W(fi)~(xi,yi), where W(fi) is the weight i----1 of the ith feature, and 5(x~, Yi) is the distance between the values of the ith feature in instances X and Y. When the values of the instance features are symbolic, as with our linguistic tasks, the simple overlap distance function 5 is used: 5(xi,yi) = 0 if xi = Yi, else 1. The (most frequently occurring) classification of the memory instance Y with the smallest A(X, Y) is then taken as the classification of X.</Paragraph> <Paragraph position="8"> The weighting function W(fi) computes for each feature, over the full instance base, its information gain, a function from information theory; cf. Quinlan (1986). In short, the information gain of a feature expresses its relative importance compared to the other features in performing the mapping from input to classification. When information gain is used in the similarity function, instances that match on important features are regarded as more alike than instances that match on unimportant features.</Paragraph> <Paragraph position="9"> In our experiments, we are primarily interested in the generalization accuracy of trained models, i.e., the ability of these models to use their accumulated knowledge to classify new instances that were not in the training material. A method that gives a good estimate of the generalization performance of an algorithm on a given instance base, is 10-fold cross-validation (Weiss and Kulikowski, 1991). This method generates on the basis of an instance base 10 subsequent partitionings into a training set (90%) and a test set (10%), resulting in 10 experiments.</Paragraph> </Section> <Section position="5" start_page="286" end_page="289" type="metho"> <SectionTitle> 4 Experiments: MBMA of Dutch </SectionTitle> <Paragraph position="0"> wordforms As described, we performed 10-fold cross validation experiments in an experimental matrix in which MBMA is applied to the full instance base, using a context width of five left and right context letters. We structure the presentation of the experimental outcomes as follows. First, we give the generalization accuracies on test instances and test words obtained in the experiments, including measurements of generalization accuracy when class labels are interpreted at lower levels of granularity. While the latter measures give a rough idea of system accuracy, more insight is provided by two additional analyses. First, precision and recall rates of morphemes are given. We then provide prediction accuracies of syntactic word classes. Finally, we provide estimations on free-text accuracies.</Paragraph> <Section position="1" start_page="286" end_page="287" type="sub_section"> <SectionTitle> 4.1 Generalization accuracies </SectionTitle> <Paragraph position="0"> The percentages of correctly classified test instances are displayed in the top line of Table 2, showing an error in test instances of about 4.1% (which is markedly better than the baseline error of 27.5% when guessing the most frequent class &quot;0&quot;), which translates in an error at the word level of about 35%. The output of MBMA can also be viewed at lower levels of granularity.</Paragraph> <Paragraph position="1"> We have analyzed MBMA's output at the three following lower granularity levels: 1. Only decide, per letter, whether a segmentation occurs at that letter, and if so, whether it marks the start of a derivational stem or an inflection. This can be derived straightforwardly from the full-task class labeling.</Paragraph> <Paragraph position="2"> 2. Only decide, per letter, whether a segmentation occurs at that letter. Again, this can</Paragraph> <Paragraph position="4"> lyzed as \[abnormaal\]A\[iteit\]N_A,\[en\]m.</Paragraph> <Paragraph position="5"> be derived straightforwardly. This task implements segmentation of a complex word form into morphemes.</Paragraph> <Paragraph position="6"> 3. Only check whether the desired spelling change is predicted correctly. Because of the irregularity of many spelling changes this is a hard task.</Paragraph> <Paragraph position="7"> The results from these analyses are displayed in Table 2 under the top line. First, Table 2 shows that performance on the lower-granularity tasks that exclude detailed syntactic labeling and spelling-change prediction is about 1.1% on test instances, and roughly 10% on test words. Second, making the distinction between inflections and other morphemes is almost as easy as just determining whether there is a boundary at all. Third, the relatively low score on correctly predicted spelling changes, 80.95%, indicates that it is particularly hard to generalize from stored instances of spelling changes to new ones. This is in accordance with the common linguistic view on spelling-change exceptions. When, for instance, a past-tense form of a verb involves a real exception (e.g., the past tense of Dutch brengen, to bring, is bracht), it is often the case that this exception is confined to generalize to only a few other examples of the same verb (brachten, gebracht) and not to any other word that is not derived from the same stem, while the memory-based learning approach is not aware of such constraints. A post-processing step that checks whether the proposed morphemes are also listed in a morpheme lexicon would correct many of these errors, but has not been included here.</Paragraph> </Section> <Section position="2" start_page="287" end_page="288" type="sub_section"> <SectionTitle> 4.2 Precision and recall of morphemes </SectionTitle> <Paragraph position="0"> Precision is the percentage of morphemes predicted by MBMA that is actually a morpheme in the target analysis; recall is the percentage of morphemes in the target analysis that are also predicted by MBMA. Precision and recall of morphemes can again be computed at different levels of granularity. Table 3 displays these computed values. The results show that both precision and recall of fully-labeled morphemes within test words are relatively low. It comes as no surprise that the level of 84% recalled fully labeled morphemes, including spelling information, is not much higher than the level of 80% correctly recalled spelling changes (see Table 2). When word-class information, type of inflection, and spelling changes are discarded, precision and recall of basic segment types becomes quite accurate: over 94%.</Paragraph> <Paragraph position="1"> and words, with standard deviations (+) of MBMA applied to full Dutch morphological analysis and three lower-granularity tasks derived from MBMA's full output. The example word abnormaliteiten is shown according to the different labeling granularities, and only its single spelling change at the bottom line).</Paragraph> <Paragraph position="2"> rived from the classification output of MBMA applied to the full task and two lower-granularity variations of Dutch morphological analysis, using a context width of five left and right letters.</Paragraph> </Section> <Section position="3" start_page="288" end_page="288" type="sub_section"> <SectionTitle> 4.3 Predicting the syntactic class of </SectionTitle> <Paragraph position="0"> wordforms Since MBMA predicts the syntactic label of morphemes, and since complex Dutch word-forms generally inherit their syntactic properties from their right-most morpheme, MBMA's syntactic labeling can be used to predict the syntactic class of the full wordform. When accurate, this functionality can be an asset in handling unknown words in part-of-speech tagging systems. The results, displayed in Table 4, show that about 91.2% of all test words are assigned the exact tag they also have in CELEX (including ambiguous tags such as &quot;N/V&quot; - 1.3% word-forms in the CELEX dataset have an ambiguous syntactic tag). When MBMA's output is also considered correct if it predicts at least one out of the possible tags listed in CELEX, the accuracy on test words is 91.6%. These accuracies compare favorably with a related (yet strictly incomparable) approach that predicts the word class from the (ambiguous) part-of-speech tags of the two surrounding words, the first letter, and the final three letters of Dutch words, viz.</Paragraph> <Paragraph position="1"> 71.6% on unknown words in texts (Daelemans et al., 1996a).</Paragraph> <Paragraph position="2"> standard deviations) of MBMA on syntactic classes of test words. The top line displays exact matches with CELEX tags; the bottom line also includes predictions that are among CELEX alternatives. null</Paragraph> </Section> <Section position="4" start_page="288" end_page="289" type="sub_section"> <SectionTitle> 4.4 Free text estimation </SectionTitle> <Paragraph position="0"> Although some of the above-mentioned accuracy results, especially the precision and recall of fully-labeled morphemes, seem not very high, they should be seen in the context of the test they are derived from: they stem from held-out portions of dictionary words. In texts sampled from real-life usage, words are typically smaller and morphologically less complex, and a relatively small set of words re-occurs very often.</Paragraph> <Paragraph position="1"> It is therefore relevant for our study to have an estimate of the performance of MBMA on real texts. We generate such an estimate following these considerations: New, unseen text is bound to contain a lot of words that are in the 245,000 CELEX data base, but also some number of unknown words. The morphological analyses of known words are simply retrieved by the memory-based learner from memory. Due to some ambiguity in the class labeling in the data base itself, retrieval accuracy will be somewhat below 100%. The morphological analyses of unknown words are assumed to be as accurate as was tested in the above-mentioned experiments: they can be said to be of the type of dictionary words in the 10% held-out test sets of 10-fold cross validation experiments. CELEX bases its wordform frequency information on word counts made on the 42,380,000-words Dutch INL corpus. 5.06% of these wordforms are wordform tokens that occur only once. We assume that this can be extrapolated to the estimate that in real texts, 5% of the words do not occur in the 245,000 words of the CELEX data base.</Paragraph> <Paragraph position="2"> Therefore, a sensible estimate of the accuracies of memory-based learners on real text is a weighted sum of accuracies comprised of 95% of the reproduction accuracy (i.e, the error on the training set itself), and 5% of the generalization accuracy as reported earlier.</Paragraph> <Paragraph position="3"> Table 5 summarizes the estimated generalization accuracy results computed on the results of MBMA. First, the percentages of correct instances and words are estimated to be above 98% for the full task; in terms of words, it is estimated that 84% of all words are fully correctly analyzed. When lower-granularity classification tasks are discerned, accuracies on words are estimated to exceed 96% (on instances, less than 1% errors are estimated). Moreover, precision and recall of morphemes on the full task are estimated to be above 93%. A considerable surplus is obtained by memory retrieval in the estimated percentage of correct spelling changes: 93%. Finally, the prediction of the syntactic tags of wordforms would be about 97% according to this estimate.</Paragraph> <Paragraph position="4"> We briefly note that Heemskerk (1993) reports a correct word score of 92% on free text test material yielded by the probabilistic morphological analyzer MORPA. MORPA segments wordforms, decides whether a morpheme is a stem, an affix or an inflection, detects spelling changes, and assigns a syntactic tag to the wordform. We have not made a conversion of our output to Heemskerk's (1993). Moreover, a proper comparison would demand the same test data, but we believe that the 92% corresponds roughly to our MBMA estimates of 97.2% correct syntactic tags, 93.1% correct spelling changes, and 96.7% correctly segmented words.</Paragraph> <Paragraph position="5"> Estimate correct instances, full task correct words, full task</Paragraph> </Section> </Section> class="xml-element"></Paper>