File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/w04-0109_metho.xml
Size: 23,675 bytes
Last Modified: 2025-10-06 14:09:04
<?xml version="1.0" standalone="yes"?> <Paper uid="W04-0109"> <Title>Multilingual Noise-Robust Supervised Morphological Analysis using the WordFrame Model</Title> <Section position="4" start_page="0" end_page="0" type="metho"> <SectionTitle> 3 The WordFrame Algorithm </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.1 Motivation </SectionTitle> <Paragraph position="0"> The supervised morphological learner presented in Yarowsky and Wicentowski (2000) modeled lemmatization as a word-final stem change plus a suffix taken from a (possibly empty) list of potential suffixes. Though effective for suffixation, this end-of-string (EOS) based model can not model other morphological phenomena, such as prefixation.</Paragraph> <Paragraph position="1"> By including a pre-specified list of prefixes, we can extend the EOS model to handle simple prefixation: For each inflection, an analysis is performed on the original string, plus on each substring resulting from removing exactly one matching prefix taken from the list of prefixes. While effective for some simple prefixal morphologies, this extension cannot model word-initial stem changes at the point of prefixation. In contrast, the WordFrame (WF) algorithm can isolate a potential prefix and model any potential point-of-prefixation stem changes directly, without pre-specified lists of prefixes.</Paragraph> <Paragraph position="2"> The EOS model also fails to capture word-internal vowel changes found in many languages.</Paragraph> <Paragraph position="3"> The WF model directly models stem-internal vowel changes in order to to learn higher-quality, less sparse, transformation rules.</Paragraph> <Paragraph position="4"> training pair EOS analysis WF analysis</Paragraph> <Paragraph position="6"> lyzed by the EOS algorithm, which results in learning rules with low productivity. The WF algorithm is able to identify the productive ue-o stem-internal vowel change.</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.2 Required and Optional Resources </SectionTitle> <Paragraph position="0"> a. Training data of the form <inflection,root> is required for the WordFrame algorithm. Ideally, this data should be high-quality and noise-free, but algorithm is robust to noise, which allows one to use lower-quality pairs extracted from unsupervised techniques.</Paragraph> <Paragraph position="1"> b. Pre-specified lists of prefixes and suffixes can be incorporated, but are not required.</Paragraph> <Paragraph position="2"> c. Precision can be improved (at the expense of coverage) by providing a list of potential roots extracted from a dictionary or large corpus.</Paragraph> <Paragraph position="3"> d. In order to allow for word-internal vowel changes, the WordFrame model requires a list of the vowels of the language.</Paragraph> </Section> <Section position="3" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.3 Formal Presentation </SectionTitle> <Paragraph position="0"> The WordFrame model is constructed explicitly as an extension to the end-of-string model proposed by Yarowsky and Wicentowski (2000); as such, we first give a brief presentation of the model, then introduce the WordFrame model.</Paragraph> <Paragraph position="1"> In the discussion below, if affix lists are not explicitly provided, they are assumed to contain the single element epsilon1 (the empty string).</Paragraph> <Paragraph position="2"> 3.3.1 The end-of-string model The end-of-string model makes use of two optional externally provided sets: a set of acceptable suffixes, Psprimes, and a set of &quot;canonical root endings&quot;, Pss. The inclusion of a list of canonical root endings is motivated by languages where verb roots can end in only a limited number of ways (e.g. -er, -ir and -re in French).</Paragraph> <Paragraph position="3"> From inflection-root training pairs, a deterministic analysis is made by removing the longest matching suffix (psprimes [?] Psprimes) from the inflection, removing the longest matching canonical ending (pss [?] Pss) from the root, and removing the longest common initial substring (g) from both words. The remaining strings represent the word-final stem change (dprimes - ds) necessary to transform the inflection (gdprimespsprimes) into the root (gdspss). The word-final stem changes are stored in a hierarchically-smoothed suffix trie representing P(dprimes -ds|gdprimes).</Paragraph> <Paragraph position="4"> A simple extension allows the EOS model to handle purely concatenative prefixation: the analysis begins by removing the longest matching prefix taken from a given set of prefixes (psprimep [?] Psprimep), then continuing as above. This changes the inflection to psprimepgdprimespsprimes, and leaves the root as gdspss. (See Table 2 for an overview of this notation.) Given a previously unseen inflection, one finds the root that maximizes P(gdspss|psprimepgdprimespsprimes). By making strong independence assumptions and some approximations, and assuming that all prefixes and suffixes are equally likely, this is equivalent to:1</Paragraph> <Paragraph position="6"> Note we are using a slightly different, but equivalent, notation to that used in Yarowsky and Wicentowski (2000). Simply, we use psprimes rather than s, and we use dprimes - ds rather than a - b. This change point-of- secondary primary point-of-prefixation common vowel common suffixation suffix/ prefix change substring change substring change ending model extended to allow for simple prefixation, and the WordFrame model. If lists of prefixes, suffixes and endings are not specified, the prefix, suffix and ending are set to epsilon1. The WordFrame model fills two major gaps in the EOS model: the inability to model prefixation without a list of provided prefixes, and the inability to model stem-internal vowel shifts.</Paragraph> <Paragraph position="7"> While not required, the WordFrame model does allow for the inclusion of lists of prefixes, and when provided, can automatically discover the point-of-prefixation stem change, dprimep - dp. When a list of prefixes is not provided, the word-initial stem change will model both the prefix and stem change.</Paragraph> <Paragraph position="8"> Formally, this requires the inclusion of the point-of-prefixation stem change into the notation used in the EOS model. When presented with an inflection-root pair, the longest common substring in the inflection and root, g, is assumed to be the stem. The string preceding the stem is the prefix and point-of-prefixation stem change, psprimepdprimep; the string following the stem is the suffix and point-of-suffixation stem change, psprimesdprimes. Combining these parts, the inflection can be represented as psprimepdprimepgdprimespsprimes, and the root as dpgdspss.</Paragraph> <Paragraph position="9"> In addition, the WordFrame model allows for a single word-internal vowel change within the stem.</Paragraph> <Paragraph position="10"> To accommodate this, the longest common sub-string of the inflection and root, g, is allowed to be split in a single location to allow the vowel change dprimev - dv where dprimev and dv are taken from a predetermined list of vowels for the language.2 The portions of the stem located before and after the vowel change are now gp and gs, respectively.</Paragraph> <Paragraph position="11"> Both dprimev and dv may contain more than vowel, thereby allowing vowel changes such as ee-e.</Paragraph> <Paragraph position="12"> However, as presented here, the WF model does not allow for the insertion of vowels into the stem where there were no vowels previously; more formally, both dprimev and dv must contain at least one vowel, or they both must be epsilon1. Though this restriction can 2If one wishes to model arbitrary internal changes, this &quot;vowel&quot; list could be made to include every letter in the alphabet; results are not presented for this configuration. be removed, initial results (not presented here) indicated a significant drop in accuracy when entire vowels clusters could be removed or inserted. In addition, the vowel change must be internal to the stem, and cannot be located at the boundary of the stem; formally, unless both dprimev and dv are epsilon1, both portions of the split stem (gp and gs) must contain at least one letter. This prevents confusion between &quot;stem-internal&quot; vowel changes and stem-changes at the point of affixation.</Paragraph> <Paragraph position="13"> As with the EOS model, a deterministic analysis is made from inflection-root training pairs. If provided, the longest matching prefix and suffix are removed from the inflection, and the longest matching canonical ending is removed from the root.3 The remaining string must then be analyzed to find the longest common substring with at most one vowel change, which we call the WordFrame.</Paragraph> <Paragraph position="14"> The WordFrame (gpdprimevgs, gpdvgs) is defined to be the longest common substring with at most one internal vowel cluster (V[?] - V [?]) transformation.</Paragraph> <Paragraph position="15"> Should there be multiple &quot;longest&quot; substrings, the substring closest to the start of the inflection is chosen.4 In practice, there is rarely more than one such &quot;longest&quot; substring.</Paragraph> <Paragraph position="16"> The remaining strings at the start and end of the common substring form the point-of-prefixation and point-of-suffixation stem changes.</Paragraph> <Paragraph position="17"> The final representation of the inflection-root pair in the WF model is shown in Table 2.</Paragraph> <Paragraph position="18"> Given an unseen inflection, one finds the root that maximizes P(dpgpdvgsdspss|psprimepgsdprimespsprimes). If we make the simplifying assumption that all prefixes, suffixes and endings are equally likely and remove 3A canonical prefix is not included in the model because we knew of no language in which this occurred; introducing it to the model would be straight-forward.</Paragraph> <Paragraph position="19"> 4This places a bias in favor of end-of-string changes and is motivated by the number of languages which are suffixal and the relative few that are not; this could be adjusted for prefixal languages.</Paragraph> <Paragraph position="20"> EOS analysis yields non-productive rules such as gestunk-stink. The WF analysis captures the productive Spanish vowel change ue - o, the German prefix ge, and English vowel changes e-ee and a-i. the longest possible affixes deterministically, this is equivalent to:</Paragraph> <Paragraph position="22"> This can be expanded using the chain rule. As before, the point-of-suffixation probabilities are implicitly conditioned on the applicability of the change to dprimepgpdprimevgsdprimes, and are taken from a suffix trie created during training. The point-of-prefixation probabilities are implicitly conditioned on the applicability of the change to dprimepgpdprimevgs, i.e. once dprimes has been removed, and are taken from an analogous prefix trie. The vowel change probability is conditioned on the applicability of the change to gpdprimevgs. In the current implementation, this is approximated using the conditional probability of the vowel change P(dv|dprimev) without regard to the local context. This is a major weakness in the current system and one that will be addressed in future work.</Paragraph> <Paragraph position="23"> The WordFrame model's ability to capture stem-internal vowel changes allows for proper analysis of the Spanish examples from Table 1, and also allows for the analysis of prefixes without the use of a pre-specified list of prefixes, as shown in Table 3.</Paragraph> </Section> </Section> <Section position="5" start_page="0" end_page="0" type="metho"> <SectionTitle> 4 Experimental Evaluation </SectionTitle> <Paragraph position="0"> All of the experimental results presented here were done using 10-fold cross-validation on the training data. The majority of the training data used here point-of-prefixation change ge -epsilon1 point-of-suffixation change epsilon1- l man examples listed in Table 3.</Paragraph> <Paragraph position="1"> was obtained from web sources, although some has been hand-entered or scanned from printed materials then hand-corrected. All of the data used were inflected verbs; there was no derivational morphology in this evaluation.5 Unless otherwise specified, all results are system accuracies at 100% coverage Section 5.3 addresses precision at lower coverages. Space limits the number of results that can be presented here since most of the evaluations have been carried out in each of the 32 languages. Therefore, in comparing the models, results will only be shown for only a representative subset of the languages. When appropriate, a median or average for all languages will also be given. Table 10 presents the final results for all languages.</Paragraph> <Paragraph position="2"> 5Examples of derivational morphology, as well as nominal and adjectival inflectional morphology, are excluded from this presentation due to the lack of available training data for more than a small number of well-studied languages.</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.1 End-of-string vs. WordFrame </SectionTitle> <Paragraph position="0"> The most striking difference in performance between the EOS model and WordFrame model comes from the evaluation of languages with prefixal morphologies. The EOS model cannot handle prefixation without pre-specified lists of prefixes, so when these are omitted, the WF model drastically outperforms the EOS model (Table 5).</Paragraph> <Paragraph position="1"> model without and with pre-specified lists of affixes (if available for that language).</Paragraph> <Paragraph position="2"> Table 5 also shows that the simple EOS model can sometimes significantly outperform the WF model (e.g. in Spanish). Making things more difficult, predicting which model will be more successful for a particular language and set of training data may not be possible, as illustrated by the fact that EOS model performed better for Spanish, but the closely-related Portuguese was better handled by the WF model. Additionally, as illustrated by the Portuguese example, it is not always beneficial to include lists of affixes, making selection of the model problematic.</Paragraph> <Paragraph position="3"> Lists of prefixes and suffixes were not available for all languages.6 However, for the 25 languages where such lists were available, the WordFrame model performed equally or better on only 17 (68%). Evidence suggests that this occurs when the affix lists have missing prefixes or suffixes. Since these lists were extracted from printed grammars, such gaps were unavoidable.</Paragraph> <Paragraph position="4"> Regardless of whether or not affix lists were included, the WordFrame model only outperformed the EOS model for just over half the languages. An examination of the output of the WF model suggests that the relative parity in performance of the two models is due to the poor estimation of the vowel change probability which is approximated without regard to the contextual clues.</Paragraph> </Section> </Section> <Section position="6" start_page="0" end_page="0" type="metho"> <SectionTitle> 5 WordFrame + EOS </SectionTitle> <Paragraph position="0"> One of our goals in designing the WordFrame model was to reduce or eliminate the dependence on externally supplied affix lists. However, the results presented in Section 4.1 indicate that the WF model outperforms the EOS model for just over half (17/32) of the evaluated languages, even when affix lists are included.</Paragraph> <Paragraph position="1"> Predicting which model worked better for a particular language proved difficult, so we created a new analyzer by combining our WordFrame model with the end-of-string model. For each inflection, the root which received the highest probability using an equally-weighted linear combination was selected as the final analysis.</Paragraph> <Paragraph position="2"> This new combination analyzer outperformed both stand-alone models for 21 of the 25 languages with significant overall accuracy improvements as shown in Table 6(a).</Paragraph> <Paragraph position="3"> vidual models vs. the combined model (a) with and (b) without affix lists.</Paragraph> <Paragraph position="4"> When affix lists are available, combining the WordFrame model and the end-of-string model yielded very similar results: the combined model outperformed either model on its own for 23 of the 25 languages. Of the two remaining languages, the stand-alone WF model outperformed the combined model by just one example out of 5197 in Danish, and just 4 examples out of 9497 in Tagalog. As before, the combined model showed significant accuracy increases over either stand-alone model, as shown in Table 6(b).</Paragraph> <Paragraph position="5"> Finally, we build the WordFrame+EOS classifier, by combining all four individual classifiers (EOS with and without affix lists, and WF with and without affix lists) using a simple equally-weighted linear combination. This is motivated from our initial observation that using affix lists does not always improve overall accuracy. Cumulative results are shown below in Table 7, and results for each individual language is shown in Table 10.</Paragraph> <Paragraph position="6"> combination of the combined models in the 25 languages for which affix lists were available.</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 5.1 Robustness to Noise </SectionTitle> <Paragraph position="0"> The WordFrame model was designed as an alternative to the end-of-string model. In Yarowsky and Wicentowski (2000), the end-of-string model is trained from inflection-root pairs acquired through unsupervised methods. None of those previously presented unsupervised models yielded high accuracies on their own, so it was important that the end-of-string model was robust enough to learn string transduction rules even in the presence of large amounts of noise.</Paragraph> <Paragraph position="1"> In order for the WF+EOS model to be an adequate replacement for the end-of-string model, it must also be robust to noise. To test this, we first ran the WF+EOS model as before on all of the data using 10-fold cross-validation. Then, we introduced noise by randomly assigning a certain percentage of the inflections to the roots of other inflections. For example, the correct pair menaced-menace became the incorrect pair menaced-move. The results of introducing this noise are presented in Table 9 and maintains high accuracy in the presence noise.</Paragraph> <Paragraph position="2"> Above, up to 75% of the inflections in the training data have been assigned incorrect roots.</Paragraph> <Paragraph position="3"> As one might expect, the effect of introducing noise is particularly pronounced for highly inflected languages such as Estonian, as well as with the vowel-harmony morphology found in Turkish7.</Paragraph> <Paragraph position="4"> However, languages with minimal inflection (English) or a fairly regular inflection space (French) show much less pronounced drops in accuracy as noise increases.</Paragraph> <Paragraph position="5"> 7All of the data is inflectional verb morphology, making the Turkish task substantially easier than most other attempts at modeling Turkish morphology.</Paragraph> <Paragraph position="6"> noise yields only a 5% reduction in performance even when 50% of the training samples are replaced with noise.</Paragraph> <Paragraph position="7"> It is important to point out that the incorrect pairs were not added in addition to the correct pairs; rather, they replaced the correct pairs. For example, the Estonian training data was comprised of 5932 inflection-root pairs. When testing at 50% noise, there were only 2966 correct training pairs, and 2966 incorrect pairs. This means that real size of the training data was also reduced, further lowering accuracy, and making the model's effective robustness to noise more impressive.</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 5.2 Regular vs. Irregular Inflections </SectionTitle> <Paragraph position="0"> For 13 of the languages evaluated, the inflections were classified as either regular, irregular, or semiregular. As an example, the English pair jumpedjump was classified as regular, the pair hopped-hop was semi-regular (because of the doubling of the final-p), and the pair threw-throw was labeled irregular.8 null Table 8 shows the accuracy of the WF+EOS model in each of the three categories, as well as for all data in total.9 As expected, the WF+EOS model performs very well on regular inflections and reasonably well on the semi-regular inflections for most languages.</Paragraph> <Paragraph position="1"> The performance on the irregular verbs, though clearly not as good as on the regular or semi-regular verbs, was surprisingly good, most notably in French, and to a lesser extent, Spanish and Ital- null ble 10 is due to the fact that some of the inflection-root pairs were not labeled. The &quot;All&quot; column of Table 8 reflects only labeled inflections.</Paragraph> <Paragraph position="2"> ian. This is due in large part because our test set included many irregular verbs which shared the same irregularity. For example, in French, the inflection-root pair prit-prendre is irregular; however, the pairs apprit-apprendre and comprit-comprendre both follow the same irregular rule. The inclusion of just one of these three pairs in the training data will allow the WF+EOS model to correctly find the root form of the other two. Our French test set included many examples of this, including roots that ended -tenir, -venir, -mettre, and -duire.</Paragraph> <Paragraph position="3"> For most languages however, the performance on the irregular set was not that good. We propose no new solutions to handling irregular verb forms, but suggest using non-string-based techniques, such as those presented in (Yarowsky and Wicentowski, 2000), (Baroni et al., 2002) and (Wicentowski, 2002).</Paragraph> </Section> <Section position="3" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 5.3 Accuracy, Precision and Coverage </SectionTitle> <Paragraph position="0"> All of the previous results assumed that each inflection must be aligned to exactly one root, though one can improve precision by relaxing this constraint.</Paragraph> <Paragraph position="1"> The WF+EOS model transforms an inflection into a new string which we can compare against a dictionary, wordlist, or large corpus. In determining the final inflection-root alignment, we can downweight, or even throw away, all proposed roots which are are not found in such a wordlist. While this will adversely affect coverage, precision may be more important in early iterations of co-training.</Paragraph> <Paragraph position="2"> Given a sufficiently large wordlist, such a weighting scheme cannot discard correct analyses. In addition, a large majority of the incorrectly analyzed inflections are proposed roots which are not actually words. By excluding all proposed roots which were not found in a broad coverage wordlist (available for 19 languages), median coverage fell to 97.4%, but median precision increased from 97.5% to 99.1%.</Paragraph> </Section> </Section> class="xml-element"></Paper>