XML Viewer - w02-0606

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/02/w02-0606_metho.xml
Size: 35,049 bytes
Last Modified: 2025-10-06 14:08:01
<?xml version="1.0" standalone="yes"?>
<Paper uid="W02-0606">
  <Title>Unsupervised discovery of morphologically related words based on orthographic and semantic similarity</Title>
  <Section position="3" start_page="0" end_page="0" type="metho">
    <SectionTitle>
ACL Special Interest Group in Computational Phonology (SIGPHON), Philadelphia,
</SectionTitle>
    <Paragraph position="0"> Morphological and Phonological Learning: Proceedings of the 6th Workshop of the process corpus data for an analysis to be performed by a human morphologist, or as the first step of a fully automated morphological learning program, to be followed, for example, by a rule induction procedure that extracts correspondence patterns from paired forms. See the last section of this paper for further discussion of possible applications.</Paragraph>
    <Paragraph position="1"> We tested our model with German and English input. Our results indicate that the algorithm is able to identify a number of pairs related by a variety of derivational and inflectional processes with a remarkably high precision rate. The algorithm is also discovering morphological relationships (such as German plural formation with umlaut) that would probably be harder to discover using affix-based approaches. null The remainder of the paper is organized as follows: In section 2, we shortly review related work.</Paragraph>
    <Paragraph position="2"> In section 3, we present our model. In section 4, we discuss the results of experiments with German and English input. Finally, in section 5 we summarize our main results, we sketch possible directions that our current work could take, and we discuss some potential uses for the output of our algorithm.</Paragraph>
  </Section>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
2 Related work
</SectionTitle>
    <Paragraph position="0"> For space reason, we discuss here only three approaches that are closely related to ours. See, for example, Goldsmith (2001) for a very different (possibly complementary) approach, and for a review of other relevant work.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
2.1 Jacquemin (1997)
</SectionTitle>
      <Paragraph position="0"> Jacquemin (1997) presents a model that automatically extracts morphologically related forms from a list of English two-word medical terms and a corpus from the medical domain.</Paragraph>
      <Paragraph position="1"> The algorithm looks for correspondences between two-word terms and orthographically similar pairs of words that are adjacent in the corpus. For example, the list contains the term artificial ventilation, and the corpus contains the phrase artificially ventilated. Jacquemin's algorithm thus postulates the (paired) morphological analyses artificial ventilation and artificial-ly ventilat-ed.</Paragraph>
      <Paragraph position="2"> Similar words, for the purposes of this pairing procedure, are simply words that share a common left substring (with constraints that we do not discuss here).</Paragraph>
      <Paragraph position="3"> Jacquemin's procedure then builds upon these early steps by clustering together sets that follow the same patterns, and using these larger classes to look for spurious analyses. Finally, the algorithm tries to cluster classes that are related by similar, rather than identical, suffixation patterns. Again, we will not describe here how this is accomplished.</Paragraph>
      <Paragraph position="4"> Our basic idea is related to that of Jacquemin, but we propose an approach that is more general both in terms of orthography and in terms of semantics.</Paragraph>
      <Paragraph position="5"> In terms of orthography, we do not require that two strings share the left (or right) substring in order to constitute a candidate pair. Thus, we are not limited to affixal morphological patterns. Moreover, our algorithm extracts semantic information directly from the input corpus, and thus it does not require a pre-compiled list of semantically related pairs.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
2.2 Schone and Jurafsky (2000)
</SectionTitle>
      <Paragraph position="0"> Schone and Jurafsky (2000) present a knowledge-free unsupervised model in which orthography-based distributional cues are combined with semantic information automatically extracted from word co-occurrence patterns in the input corpus.</Paragraph>
      <Paragraph position="1"> They first look for potential suffixes by searching for frequent word-final substrings. Then, they look for potentially morphologically related pairs, i.e., pairs that end in potential suffixes and share the left substring preceding those suffixes. Finally, they look, among those pairs, for those whose semantic vectors (computed using latent semantic analysis) are significantly correlated. In short, the idea behind the semantic component of their model is that words that tend to co-occur with the same set of words, within a certain window of text, are likely to be semantically correlated words.</Paragraph>
      <Paragraph position="2"> While we follow Schone and Jurafsky's idea of combining orthographic and semantic cues, our algorithm differs from them in both respects. From the point of view of orthography, we rely on the comparison between individual word pairs, without requiring that the two pairs share a frequent affix, and indeed without requiring that they share an affix at all.</Paragraph>
      <Paragraph position="3"> From the point of view of semantics, we compute scores based on mutual information instead of latent semantic analysis. Thus, we only look at the co-occurrence patterns of target words, rather than at the similarity of their contexts.</Paragraph>
      <Paragraph position="4"> Future research should try to assess to what extent these two approaches produce significantly different results, and/or to what extent they are complementary. null</Paragraph>
    </Section>
    <Section position="3" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
2.3 Yarowsky and Wicentowski (2000)
</SectionTitle>
      <Paragraph position="0"> Yarowsky and Wicentowski (2000) propose an algorithm that extracts morphological rules relating roots and inflected forms of verbs (but the algorithm can be extended to other morphological relations).</Paragraph>
      <Paragraph position="1"> Their algorithm performs unsupervised, but not completely knowledge-free, learning. It requires a table of canonical suffixes for the relevant parts of speech of the target language, a list of the content word roots with their POS (and some information about the possible POS/inflectional features of other words), a list of the consonants and vowels of the language, information about some characteristic syntactic patterns and, if available, a list of function words.</Paragraph>
      <Paragraph position="2"> The algorithm uses a combination of different probabilistic models to find pairs that are likely to be morphologically related. One model matches root + inflected form pairs that have a similar frequency profile. Another model matches root + inflected form pairs that tend to co-occur with the same subjects and objects (identified using simple regular expressions). Yet another model looks for words that are orthographically similar, in terms of a minimum edit distance score that penalizes consonant changes more than vowel changes. Finally, the rules relating stems and inflected forms that the algorithm extracts from the pairs it finds in an iteration are used as a fourth probabilistic model in the subsequent iterations. null Yarowsky and Wicentowski show that the algorithm is extremely accurate in identifying English root + past tense form pairs, including those pairs that are related by non-affixal patterns (e.g., think/thought.) The main issue with this model is, of course, that it cannot be applied to a new target language without having some a priori knowledge about some of its linguistic properties. Thus, the algorithm cannot be applied in cases in which the grammar of the target language has not been properly described yet, or when the relevant information is not available for other reasons. Moreover, even when such information is in principle available, trying to determine to what extent morphology could be learned without relying on any other knowledge source remains an interesting theoretical pursuit, and one whose answer could shed some light on the problem of human language acquisition.</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="0" end_page="0" type="metho">
    <SectionTitle>
3 The current approach: Morphological
</SectionTitle>
    <Paragraph position="0"> relatedness as a function of orthographic and semantic similarity The basic intuition behind the model presented here is extremely simple: Morphologically related words tend to be both orthographically and semantically similar. Obviously, there are many words that are orthographically similar, but are not morphologically related; for example, blue and glue. At the same time, many semantically related words are not morphologically related (for example, blue and green). However, if two words have a similar shape and a related meaning (e.g., green and greenish), they are very likely to be also morphologically related.</Paragraph>
    <Paragraph position="1"> In order to make this idea concrete, we use minimum edit distance to identify words that are orthographically similar, and mutual information between words to identify semantically related words.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.1 Outline of the procedure
</SectionTitle>
      <Paragraph position="0"> Given an unannotated input corpus, the algorithm (after some elementary tokenization) extracts a list of candidate content words. This is simply a list of all the alphabetic space- or punctuation-delimited strings in the corpus that have a corpus frequency below .01% of the total token count.2 Preliminary experiments indicated that our procedure does not perform as well without this trimming. Notice in any case that function words tend to be of little morphological interest, as they display highly lexicalized, often suppletive morphological patterns.</Paragraph>
      <Paragraph position="1"> The word list extracted as described above and the input corpus are used to compute two lists of word pairs: An orthographic similarity list, in which the 2In future versions of the algorithm, we plan to make this high frequency threshold dependent on the size of the input corpus. null pairs are scored on the basis of their minimum edit distance, and a semantic similarity list, based on mutual information. Because of minimum thresholds that are enforced during the computation of the two measures, neither list contains all the pairs that can in principle be constructed from the input list.</Paragraph>
      <Paragraph position="2"> Before computing the combined score, we get rid of the pairs that do not occur in both lists (the rationale being that we do not want to guess the morphological status of a pair on the sole basis of orthographic or semantic evidence).</Paragraph>
      <Paragraph position="3"> We then compute a weighted sum of the orthographic and semantic similarity scores of each remaining pair. In the experiments reported below, the weights are chosen so that the maximum weighted scores for the two measures are in the same order of magnitude (we prefer to align maxima rather than means because both lists are trimmed at the bottom, making means and other measures of central tendency less meaningful).</Paragraph>
      <Paragraph position="4"> The pairs are finally ranked on the basis of the resulting combined scores.</Paragraph>
      <Paragraph position="5"> In the next subsections, we describe how the orthographic and semantic similarity lists are constructed, and some properties of the measures we adopted.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.2 Scoring the orthographic similarity of
</SectionTitle>
      <Paragraph position="0"> word pairs Like Yarowsky and Wicentowski, we use minimum edit distance to measure orthographic similarity. The minimum edit distance between two strings is the minimum number of editing operations (insertion, deletion, substitution) needed to transform one string into the other (see section 5.6 of Jurafsky and Martin (2000) and the references quoted there).</Paragraph>
      <Paragraph position="1"> Unlike Yarowsky and Wicentowski, we do not attempt to define a phonologically sensible edit distance scoring function, as this would require making assumptions about how the phonology of the target language maps onto its orthography, thus falling outside the domain of knowledge-free induction. Instead, we assign a cost of 1 to all editing operations, independently of the nature of the source and target segments. Thus, in our system, the pairs dog/Dog, man/men, bat/mat and day/dry are all assigned a minimum edit distance of 1.3 Rather than computing absolute minimum edit distance, we normalize this measure by dividing it by the length of the longest string (this corresponds to the intuition that, say, two substitutions are less significant if we are comparing two eightletter words than if we are comparing two three-letter words). Moreover, since we want to rank pairs on the basis of orthographic similarity, rather than dissimilarity, we compute (1 - normalized minimum edit distance), obtaining a measure that ranges from 1 for identical forms to 0 for forms that do not share any character.</Paragraph>
      <Paragraph position="2"> This measure is computed for all pairs of words in the potential content word list. However, for reasons of size, only pairs that have a score of :5 or higher (i.e., where the two members share at least half of their characters) are recorded in the output list.</Paragraph>
      <Paragraph position="3"> Notice that orthographic similarity does not favor concatenative affixal morphology over other types of morphological processes. For example, the pairs woman/women and park/parks both have an orthographic similarity score of :8.</Paragraph>
      <Paragraph position="4"> Moreover, orthographic similarity depends only on the two words being compared, and not on global distributional properties of these words and their substrings. Thus, words related by a rare morphological pattern can have the same score as words related by a very frequent pattern, as long as the minimum edit distance is the same. For example, both nucleus/nuclei and bench/benches have an orthographic similarity score of :714, despite the fact that the latter pair reflects a much more common pluralization pattern.</Paragraph>
      <Paragraph position="5"> Of course, this emancipation from edge-anchored concatenation and global distributional salience also implies that orthographic similarity will assign high 3Following a suggestion by two reviewers, we are currently experimenting with an iterative version of our algorithm, along the lines of the one described by Yarowsky and Wicentowski. We start with the cost matrix described in the text, but we re-estimate the editing costs on the basis of the empirical characterto-character (or character-to-zero/zero-to-character) probabilities observed in the output of the previous run of the algorithm. Surprisingly, the revised version of the algorithm leads to (moderately) worse results than the single-run version described in this paper. Further experimentation with edit cost re-estimation is needed, in order to understand which aspects of our iterative procedure make it worse than the single-run model, and how it could be improved.</Paragraph>
      <Paragraph position="6"> scores to many pairs that are not morphologically related - for example, the pair friends/trends also has an orthographic similarity score of :714.</Paragraph>
      <Paragraph position="7"> Furthermore, since in most languages the range of possible word lengths is narrow, orthographic similarity as a ranking measure tends to suffer of a &amp;quot;massive tying&amp;quot; problem. For example, when pairs from the German corpus described below are ranked on the sole basis of orthographic similarity, the resulting list is headed by a block of 19,597 pairs that all have the same score. These are all pairs where one word has 9 characters, the other 9 or 8 characters, and the two differ in only one character.4 For the above reasons, it is crucial that orthographic similarity is combined with an independent measure that allows us to distinguish between similarity due to morphological relatedness vs. similarity due to chance or other reasons.</Paragraph>
    </Section>
    <Section position="3" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.3 Scoring the semantic similarity of word
</SectionTitle>
      <Paragraph position="0"> pairs Measuring the semantic similarity of words on the basis of raw corpus data is obviously a much harder task than measuring the orthographic similarity of words.</Paragraph>
      <Paragraph position="1"> Mutual information (first introduced to computational linguistics by Church and Hanks (1989)) is one of many measures that seems to be roughly correlated to the degree of semantic relatedness between words. The mutual information between two words A and B is given by:</Paragraph>
      <Paragraph position="3"> Intuitively, the larger the deviation between the empirical frequency of co-occurrence of two words and the expected frequency of co-occurrence if they were independent, the more likely it is that the occurrence of one of the two words is not independent from the occurrence of the other.</Paragraph>
      <Paragraph position="4"> Brown et alii (1990) observed that when mutual information is computed in a bi-directional fashion, and by counting co-occurrences of words within a 4Most of the pairs in this block - 78% - are actually morphologically related. However, given that all pairs contain words of length 9 and 8/9 that differ in one character only, they are bound to reflect only a very small subset of the morphological processes present in German.</Paragraph>
      <Paragraph position="5"> relatively large window, but excluding &amp;quot;close&amp;quot; co-occurrences (which would tend to capture collocations and lexicalized phrases), the measure identifies semantically related pairs.</Paragraph>
      <Paragraph position="6"> It is particularly interesting for our purposes that most of the examples of English word clusters constructed on the basis of this interpretation of mutual information by Brown and colleagues (reported in their table 6) include morphologically related words.</Paragraph>
      <Paragraph position="7"> A similar pattern emerges among the examples of German words clustered in a similar manner by Baroni et alii (2002). Rosenfeld (1996) reports that morphologically related pairs are common among words with a high (average) mutual information.</Paragraph>
      <Paragraph position="8"> We computed mutual information by considering, for each pair, only co-occurrences within a maximal window of 500 words and outside a minimal window of 3 words. Given that mutual information is notoriously unreliable at low frequencies (see, for example, Manning and Sch&amp;quot;utze (1999), section 5.4), we only collected mutual information scores for pairs that co-occurred at least three times (within the relevant window) in the input corpus. Obviously, occurrences across article boundaries were not counted. Notice however that the version of the Brown corpus we used does not mark article boundaries. Thus, in this case the whole corpus was treated as a single article.</Paragraph>
      <Paragraph position="9"> Our &amp;quot;semantic&amp;quot; similarity measure is based on the notion that related words will tend to often occur in the nears of each other. This differs from the (more general) approach of Schone and Jurafsky (2000), who look for words that tend to occur in the same context. It remains an open question whether the two approaches produce complementary or redundant results.5 Taken by itself, mutual information is a worse predictor of morphological relatedness than minimum edit distance. For example, among the top one hundred pairs ranked by mutual information in each language, only one German pair and five English pairs are morphologically motivated. This poor performance is not too surprising, given that there are 5We are currently experimenting with a measure based on semantic context similarity (determined on the basis of class-based left-to-right and right-to-left bigrams), but the current implementation of this requires ad hoc corpus-specific settings to produce interesting results with both our test corpora.</Paragraph>
      <Paragraph position="10"> plenty of words that often co-occur together without being morphologically related. Consider for example (from our English list) the pairs index/operand and orthodontist/teeth.</Paragraph>
    </Section>
  </Section>
  <Section position="6" start_page="0" end_page="0" type="metho">
    <SectionTitle>
4 Empirical evaluation
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.1 Materials
</SectionTitle>
      <Paragraph position="0"> We tested our procedure on the German APA corpus, a corpus of newswire containing over twenty-eight million word tokens, and on the English Brown corpus (KuVcera and Francis, 1967), a balanced corpus containing less than one million two hundred thousand word tokens. Of course, the most important difference between these two corpora is that they represent different languages. However, observe also that they have very different sizes, and that they are different in terms of the types of texts constituting them.</Paragraph>
      <Paragraph position="1"> Besides the high frequency trimming procedure described above, for both languages we removed from the potential content word lists those words that were not recognized by the XEROX morphological analyzer for the relevant language. The reason for this is that, as we describe below, we use this tool to build the reference sets for evaluation purposes.</Paragraph>
      <Paragraph position="2"> Thus, morphologically related pairs composed of words not recognized by the analyzer would unfairly lower the precision of our algorithm.</Paragraph>
      <Paragraph position="3"> Moreover, after some preliminary experimentation, we also decided to remove words longer than 9 characters from the German list (this corresponds to trimming words whose length is one standard deviation or more above the average token length). This actually lowers the performance of our system, but makes the results easier to analyze - otherwise, the top of the German list would be cluttered by a high number of rather uninteresting morphological pairs formed by inflected forms from the paradigm of very long nominal compounds (such as Wirtschaftsforschungsinstitut 'institute for economic research'). Unlike high frequency trimming, the two operations we just described are meant to facilitate empirical evaluation, and they do not constitute necessary steps of the core algorithm.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.2 Precision
</SectionTitle>
      <Paragraph position="0"> In order to evaluate the precision obtained by our procedure, we constructed a list of all the pairs that, according to the analysis provided by the XEROX analyzer for the relevant language, are morphologically related (i.e., share one of their stems).6 We refer to the lists constructed in the way we just described as reference sets.</Paragraph>
      <Paragraph position="1"> The XEROX tools we used do not provide derivational analysis for English, and a limited form of derivational analysis for German. Our algorithm, however, finds both inflectionally and derivationally related pairs. Thus, basing our evaluation on a comparison with the XEROX parses leads to an underestimation of the precision of the algorithm. We found that this problem is particularly evident in English, since English, unlike German, has a rather poor inflectional morphology, and thus the discrepancies between our output and the analyzer parses in terms of derivational morphology have a more visible impact on the results of the comparison. For example, the English analyzer does not treat pairs related by the adverbial suffix -ly or by the prefix un- as morphologically related, whereas our algorithm found pairs such as soft/softly and load/unload.</Paragraph>
      <Paragraph position="2"> In order to obtain a more fair assessment of the algorithm, we went manually through the first 2,000 English pairs found by our algorithm but not parsed as related by the analyzer, looking for items to be added to the reference set. We were extremely conservative, and we added to the reference set only those pairs that are related by a transparent and synchronically productive morphological pattern. When in doubt, we did not correct the analyzer-based analysis. Thus, for example, we did not count pairs such as machine/machinery, variables/varies or electric/electronic as related.</Paragraph>
      <Paragraph position="3"> We did not perform any manual post-processing on the German reference set.</Paragraph>
      <Paragraph position="4"> Tables 1 and 2 report percentage precision (i.e., the percentage of pairs that are in the reference set over the total number of ranked pairs up to the relevant threshold) at various cutoff points, for German and English respectively.</Paragraph>
      <Paragraph position="6"> For both languages we notice a remarkably high precision rate (&gt; 90%) up to the 1500-pair cutoff point.</Paragraph>
      <Paragraph position="7"> After that, there is a sharper drop in the English precision, whereas the decline in German is more gradual. This is perhaps due in part to the problems with the English reference set we discussed above, but notice also that English has an overall poorer morphological system and that the English corpus is considerably smaller than the German one. Indeed, our reference set for German contains more than ten times the forms in the English reference set.</Paragraph>
      <Paragraph position="8"> Notice anyway that, for both languages, the precision rate is still around 50% at the 5000-pair cutoff.7 7Yarowsky and Wicentowski (2000) report an accuracy of over 99% for their best model and a test set of 3888 pairs. Our precision rate at a comparable cutoff point is much lower (58% at the 4000-pair cutoff). However, Yarowksy and Wicentowski restricted the possible matchings to pairs in which one member is an inflected verb form, and the other member is a potential verbal root, whereas in our experiments any word in the corpus (as long as it was below a certain frequency threshold, and it was recognized by the XEROX analyzer) could be matched with any other word in the corpus. Thus, on the one hand, Yarowsky and Wicentowski forced the algorithm to produce a matching for a certain set of words (their set of inflected forms), whereas our algorithm was not subject to an analogous constraint. On the other hand, though, our algorithm had to explore a much larger possible matching space, and it could (and did) make a high number of mistakes on pairs (such as, e.g., sorry and worry) that Of course, what counts as a &amp;quot;good&amp;quot; precision rate depends on what we want to do with the output of our procedure. We show below that even a very naive morphological rule extraction algorithm can extract sensible rules by taking whole output lists as its input, since, although the number of false positives is high, they are mostly related by patterns that are not attested as frequently in the list as the patterns relating true morphological pairs. In other words, true morphological pairs tend to be related by patterns that are distributionally more robust than those displayed by false positives. Thus, rule extractors and other procedures processing the output of our algorithm can probably tolerate a high false positive rate if they take frequency and other distributional properties of patterns into account.</Paragraph>
      <Paragraph position="9"> Notice that we discussed only precision, and not recall. This is because we believe that the goal of a morphological discovery procedure is not to find the exhaustive list of all morphologically related forms in a language (indeed, because of morphological productivity, such list is infinite), but rather to discover all the possible (synchronically active and/or common) morphological processes present in a language. It is much harder to measure how good our algorithm performed in this respect, but the qualitative analysis we present in the next subsection indicates that, at least, the algorithm discovers a varied and interesting set of morphological processes.</Paragraph>
    </Section>
    <Section position="3" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.3 Morphological patterns discovered by the
</SectionTitle>
      <Paragraph position="0"> algorithm The precision tables confirm that the algorithm found a good number of morphologically related pairs. However, if it turned out that all of these pairs were examples of the same morphological pattern (say, nominal plural formation in -s), the algorithm would not be of much use. Moreover, we stated at the beginning that, since our algorithm does not assume an edge-based stem+affix concatenation model of morphology, it should be well suited to discover relations that cannot be characterized in these Yarowksy and Wicentowski's algorithm did not have to consider. Schone and Jurafsky (2000) report a maximum precision of 92%. It is hard to compare this with our results, since they use a more sophisticated scoring method (based on paradigms rather than pairs) and a different type of gold standard. Moreover, they do not specify what was the size of the input they used for evaluation.</Paragraph>
      <Paragraph position="1"> terms (e.g., pairs related by circumfixation, stem changes, etc.). It is interesting to check whether the algorithm was indeed able to find relations of this sort.</Paragraph>
      <Paragraph position="2"> Thus, we performed a qualitative analysis of the output of the algorithm, trying to understand what kind of morphological processes were captured by it.</Paragraph>
      <Paragraph position="3"> In order to look for morphological processes in the algorithm output, we wrote a program that extracts &amp;quot;correspondence rules&amp;quot; in the following simple way: For each pair, the program looks for the longest shared (case-insensitive) left- and right-edge substrings (i.e., for a stem + suffix parse and for a prefix + stem parse). The program then chooses the parse with the longest stem (assuming that one of the two parses has a non-zero stem), and extracts the relevant edge-bound correspondence rule. If there is a tie, the stem + suffix parse is preferred. The program then ranks the correspondence rules on the basis of their frequency of occurrence in the original output list.8 We want to stress that we are adopting this procedure as a method to explore the results, and we are by no means proposing it as a serious rule induction algorithm. One of the most obvious drawbacks of the current rule extraction procedure is that it is only able to extract linear, concatenative, edge-bound suffixation and prefixation patterns, and thus it misses or fails to correctly generalize some of the most interesting patterns in the output. Indeed, looking at the patterns missed by the algorithm (as we do in part below) is as instructive as looking at the rules it found.</Paragraph>
      <Paragraph position="4"> Tables 3 and 4 report the top five suffixation and prefixation patterns found by the rule extractor by taking the entire German and English output lists as its input.</Paragraph>
      <Paragraph position="5"> These tables show that our morphological pair scoring procedure found many instances of various common morphological patterns. With the exception of the German &amp;quot;prefixation&amp;quot; rule ers$drit (actually relating the roots of the ordinals 'first' and 'second'), and of the compounding pattern $&amp;quot;Ol ('Oil'), all the rules in these lists correspond to realistic affixation patterns. Not surprisingly, in both  prefixation patterns languages many of the most frequent rules (such as, e.g., $s) are poly-functional, corresponding to a number of different morphological relations within and across categories.</Paragraph>
      <Paragraph position="6"> The results reported in these tables confirm that the algorithm is capturing common affixation processes, but they are based on patterns that are so frequent that even a very naive procedure could uncover them9 More interesting observations emerge from further inspection of the ranked rule files. For example, among the 70 most frequent German suffixation rules extracted by the procedure, we encounter those in table 5.10 The patterns in this table show that our algorithm is capturing the non-concatenative plural formation 9For example, as shown by a reviewer, a procedure that pairs words that share the same first five letters, and extracts the diverging substrings following the common prefix from each pair. 10In order to find the set of rules presented in table 5 using the naive algorithm described in the previous footnote, we would have to consider the 2672 most frequent rules. Most of these 2672 rules, of course, do not correspond to true morphological patterns - thus, the interesting rules would be buried in noise.  changes found by the rule extractor process involving fronting of the stem vowel plus addition of a suffix (-e/-en). A smarter rule extractor should be able to generalize from patterns like these to a smaller number of more general rules capturing the discontinuous change. Other umlaut-based patterns that do not involve concomitant suffixation such as in Mutter/M&amp;quot;utter - were also found by our core algorithm, but they were wrongly parsed as involving prefixes (e.g., Mu$M&amp;quot;u) by the rule extractor. null Finally, it is very interesting to look at those pairs that are morphologically related according to the XEROX analyzer, and that were discovered by our algorithm, but where the rule extractor could not posit a rule, since they do not share a substring at either edge. These are listed, for German, in table 6.  do not share an edge found by the basic algorithm We notice in this table, besides three further instances of non-affixal morphology, a majority of pairs involving circumfixation of one of the members. null While a more in-depth qualitative analysis of our results should be conducted, the examples we discussed here confirm that our algorithm is able to capture a number of different morphological patterns, including some that do not fit into a strictly concatenative edge-bound stem+affix model.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML