XML Viewer - w04-0106

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/w04-0106_metho.xml
Size: 16,843 bytes
Last Modified: 2025-10-06 14:09:05
<?xml version="1.0" standalone="yes"?>
<Paper uid="W04-0106">
  <Title>Induction of a Simple Morphology for Highly-Inflecting Languages</Title>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
4 Experiments
</SectionTitle>
    <Paragraph position="0"> We report experiments on Finnish and English corpora. The new category-learning algorithm is compared to two other algorithms, namely the baseline segmentation algorithm presented in (Creutz, 2003), which was also utilized for initializing the segmentation in the category-learning algorithm, and the Linguistica algorithm (Goldsmith, 2001).3</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.1 Data sets
</SectionTitle>
      <Paragraph position="0"> The Finnish corpus consists of mainly news texts from the CSC (The Finnish IT Center for Science)4 and the Finnish News Agency. The corpus consists of 32 million words and it was divided into a development set and a test set, each containing 16 million words. For experiments on English we have used the Brown corpus5. It contains one million words, divided into a development set of 250 000 words and a test set of 750 000 words.</Paragraph>
      <Paragraph position="1"> The development sets were utilized for optimizing the algorithms and for selecting parameter values, whereas the test sets were used solely in the  The algorithms were evaluated on different sub-sets of the test set to produce the precision-recall curves in Figure 2. The sizes of the subsets are shown in Table 1. As can be seen, the Finnish and English data sets contain the same number of word tokens (words of running text), but the number of word types (distinct word forms) is higher in the Finnish data. The word type figures are important, since what was referred to as a 'corpus' in the previous sections is actually a word list. That is, one occurrence of each distinct word form in the data is picked for the morphology learning task.</Paragraph>
      <Paragraph position="2"> The word forms in the test sets for which there are no gold standard segmentations are simply left out of the evaluation. The proportions of such word forms are 5%, 6%, 8%, and 15% in the Finnish sets of size 10 000, 50 000, 250 000 and 16 million words, respectively. For English the proportions are 5%, 9%, and 14% for the data sets (in growing order). null</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.2 Parameters
</SectionTitle>
      <Paragraph position="0"> The development sets were used for setting the values of the parameters of the algorithms. As a criterion for selecting the optimal values, we used the (equally weighted) F-measure, which is the harmonic mean of the precision and recall of detected morpheme boundaries. For each data size and language separately, we selected the configuration yielding the best F-measure on the development set. These values were then fixed and utilized when evaluating the performance of the algorithms on the test set of corresponding size.</Paragraph>
      <Paragraph position="1"> In the Baseline algorithm, we optimized the prior morph length distribution. The prior morph frequency distribution was left at its default value.</Paragraph>
      <Paragraph position="2"> The Category algorithm has four parameters: a, b, c, and d; cf. Equations 4, and 5. The constant values c = 2, d = 3.5 work well for every data set size and language, as does the relation a = 10/b. The perplexity threshold, b, assumes values between 5 and 100 depending on the data set. Conveniently, the algorithm is robust with respect to the value of b and the result is always better than that of the Base-line algorithm, except for values of b that are orders  English (b) data. Each data point is an average of 4 runs on separate test sets, with the exception of the 16M (16 million) words for Finnish (with 1 test set), and the 250k (250 000) words for English (3 test sets). In these cases the lack of test data constrained the number of runs. The standard deviations of the averages are shown as intervals around the data points. There is no 16M data point for Linguistica on Finnish, because the algorithm is very memory-consuming and we could not run it on larger data sizes than 250 000 words on our PC. In most curves, when the data size is increased, recall also rises. An exception is the Baseline curve for Finnish, where precision rises, while recall drops.</Paragraph>
      <Paragraph position="3"> of magnitude too large.</Paragraph>
      <Paragraph position="4"> In the Linguistica algorithm, we used the commands 'Find suffix system' and 'Find prefixes of suffixal stems'.</Paragraph>
    </Section>
    <Section position="3" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.3 Results
</SectionTitle>
      <Paragraph position="0"> Figure 2 depicts the precision and recall of the algorithms on test sets of different sizes.</Paragraph>
      <Paragraph position="1"> When studying the curves for Finnish (Fig. 2a), we observe that the Baseline and Category algorithms perform on a similar level on the smallest data set (10k). However, from there the performances diverge: the Category algorithm improves on both precision and recall, whereas the Baseline algorithm displays a strong increase in precision while recall actually decreases. This means that words are split less often but the proposed splits are more often correct. This is due to measuring the cost of both the lexicon and the data in the optimization function: with a much larger corpus (more data) the optimal solution contains a much larger morph lexicon. Hence, less splitting ensues. The effect is not seen on the English data (Fig. 2b), but this might be due to the smaller corpus sizes.</Paragraph>
      <Paragraph position="2"> For Linguistica, an increase in the amount of data is reflected in higher recall, but lower precision.</Paragraph>
      <Paragraph position="3"> Linguistica only suggests a morpheme boundary between a stem and an affix, if the same stem has been observed in combination with at least one other affix. This leads to a &amp;quot;conservative word-splitting behavior&amp;quot;, with a rather low recall for small data sets, but with high precision. As the amount of data increases, the sparsity of the data decreases, and more morpheme boundaries are suggested. This results in higher recall, but unfortunately lower precision.</Paragraph>
      <Paragraph position="4"> As Linguistica was not designed for discovering the boundaries within compound words, it misses a large number of them.</Paragraph>
      <Paragraph position="5"> For Finnish, the Category algorithm is better than the other two algorithms when compared on data sets of the same size. We interpret a result to be better, even though precision might be somewhat lower, if recall is significantly higher (or vice versa). As an example, for the 16 million word set, the category algorithm achieves 79.0% precision and 71.0% recall. The Baseline achieves 88.5% precision but only 45.9% recall. T-tests show significant differences at the level of 0.01 between all algorithms on Finnish, except for Categories vs. Baseline at 10 000 words.</Paragraph>
      <Paragraph position="6"> For English, the Baseline algorithm generally performs worst, but it is difficult to say which of the other two algorithm performs best. According to T-tests there are no significant differences at the level of 0.05 between the following: Categories vs.</Paragraph>
      <Paragraph position="7"> Linguistica (50k &amp; 250k), and Categories vs. Base-line (10k). However, if one were to extrapolate from the current trends to a larger data set, it would seem likely that the Category algorithm would out-perform Linguistica.</Paragraph>
    </Section>
    <Section position="4" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.4 Computational requirements
</SectionTitle>
      <Paragraph position="0"> The Baseline and Category algorithms are implemented as Perl scripts. On the Finnish 250 000 word set, the Baseline algorithm runs in 45 minutes, and the Category algorithm additionally takes 20 minutes on a 900 MHz AMD Duron processor with a maximum memory usage of 20 MB. The Linguistica algorithm is a compiled Windows program, which uses 500 MB of memory and runs in 90 minutes, of which 80 minutes(!) are taken up by the saving of the results.</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="0" end_page="0" type="metho">
    <SectionTitle>
5 Discussion
</SectionTitle>
    <Paragraph position="0"> It is worth remembering that the gold standard splitting used in these evaluations is based on a traditional morphology. If the segmentations were evaluated using a real-world application, perhaps somewhat different segmentations would be most useful.</Paragraph>
    <Paragraph position="1"> For example, the tendency to keep common words together, seen in the Baseline model and generally in Bayesian or MDL-based models, might not be at all troublesome, e.g., in speech recognition or machine translation applications. In contrast, excessive splitting might be a problem in both applications.</Paragraph>
    <Paragraph position="2"> When compared to the gold standard segmentation used here, the Baseline algorithm produces three types of errors that are prominent: (i) excessive segmentation especially when trained on small amounts of data, (ii) too little segmentation especially with large amounts of data, and (iii) erroneous segments suggested in the beginning of words due to the fact that the same segments frequently occur at the end of words (e.g. 's+wing'). The Category algorithm is able to clearly reduce these types of errors due to its following properties: (i) the joining of noise morphs with adjacent morphs, (ii) the removal of redundant morphs by splitting them into sub-morphs, and (iii) the simple morphotactics involving three categories (stem, prefix, and suffix) implemented as an HMM. Furthermore, (iii) is necessary for being able to carry out (i) and (ii).</Paragraph>
    <Paragraph position="3"> The Category algorithm does a good job in finding morpheme boundaries and assigning categories to the morphs, as can be seen in the examples in Figure 3, e.g., 'photograph+er+s', 'un+expect+ed+ly', 'aarre+kammio+i+ssa' (&amp;quot;in treasure chambers&amp;quot;; 'i' is a plural marker and 'ssa' marks the inessive case), 'bahama+saar+et' (&amp;quot;[the] Bahama islands&amp;quot;; 'saari' means &amp;quot;island&amp;quot; and 'saaret' is the plural form). The reader interested in the analyses of other words can try our on-line demo athttp://www.cis.hut.</Paragraph>
    <Paragraph position="4"> fi/projects/morpho/.</Paragraph>
    <Paragraph position="5"> It is nice to see that the same morph can be tagged differently in different contexts, e.g. 'p&amp;quot;a&amp;quot;a' is a prefix in 'p&amp;quot;a&amp;quot;a+aihe+e+sta' (&amp;quot;about [the] main topic&amp;quot;), whereas 'p&amp;quot;a&amp;quot;a' is a stem in 'p&amp;quot;a&amp;quot;a+h&amp;quot;an' (&amp;quot;in [the] head&amp;quot;). In this case the morph categories also resolve the semantic ambiguity of the morph 'p&amp;quot;a&amp;quot;a'. Occasionally, the segmentation is correct, but the category tagging differs from the linguistic convention, e.g., 'taka+penkki+l&amp;quot;ais+et' (&amp;quot;[the] ones in [the] back seat&amp;quot;), where 'l&amp;quot;ais' is tagged as a stem instead of a suffix.</Paragraph>
    <Paragraph position="6"> The segmentation of 'p&amp;quot;a&amp;quot;aaiheesta' is not entirely correct: 'p&amp;quot;a&amp;quot;a+aihe+e+sta' contains an superfluous morph ('e'), which should be part of the stem, i.e., 'p&amp;quot;a&amp;quot;a+aihee+sta'. This mistake is explained by a comparison with the plural form 'p&amp;quot;a&amp;quot;a+aihe+i+sta', which is correct. As the singular and plural only differ in one letter, 'e' vs. 'i', the algorithm has found a solution, where the alternating letter is treated as an independent &amp;quot;number marker&amp;quot;: 'e' for singular, 'i' for plural.</Paragraph>
    <Paragraph position="7"> In the Linguistica algorithm, stems and suffixes are grouped into so called signatures, which can be thought of as inflectional paradigms: a certain set of stems goes together with a certain set of suffixes.</Paragraph>
    <Paragraph position="8"> Words will be left unsplit unless the potential stem and suffix fit into a signature. As a consequence, if there is only the plural of some particular English noun in the data, but not the singular, Linguistica will not split the noun into a stem and the plural 's', since this does not fit into any signature. In this respect, our category-based algorithm is better at coping with data sparsity. For highly-inflecting languages, such as Finnish, this is especially important. null In contrast with Linguistica, our algorithms can incorrectly &amp;quot;overgeneralize&amp;quot; and suggest a suffix, aarre + kammio + i + ssa j&amp;quot;a&amp;quot;ady + tt&amp;quot;a + &amp;quot;a abandon long + est aarre + kammio + i + sta j&amp;quot;a&amp;quot;ady + tt&amp;quot;a + &amp;quot;a + kseen abandon + ed long + fellow + 's aarre + kammio + ita j&amp;quot;a&amp;quot;ady + tt&amp;quot;a + isi abandon + ing longish aarre + kammio + nsa maclare + n abandon + ment long + itude aarre + kammio + on nais + auto + ili + ja beauti + ful master + piece + s aarre + kammio + t nais + auto + ili + ja + a beauti + fully micro + organ + ism + s aarre + kammio + ta nais + auto + ili + joista beauty + 's near + ly bahama + saar + et prot + e + iin + eja calculat + ed necess + ary bahama + saari + en prot + e + iin + i calculat + ion + s necess + ities bahama + saari + lla prot + e + iin + ia con + figur + ation necess + ity bahama + saari + lle p&amp;quot;a&amp;quot;a + aihe + e + sta con + firm + ed photograph bahama + saar + ten p&amp;quot;a&amp;quot;a + aihe + i + sta express + ion + ist photograph + er + s edes + autta + isi + vat p&amp;quot;a&amp;quot;a + h&amp;quot;an express + ive + ness photograph + y edes + autta + ko + on p&amp;quot;a&amp;quot;a + kin fanatic + ism phrase + d edes + autta + maan p&amp;quot;a&amp;quot;a + ksi invit + ation + s phrase + ology edes + autta + ma + ssa taka + penkki + l&amp;quot;a + in+ en invit + e phrase + s haap + a + koske + a taka + penkki + l&amp;quot;ais + et invit + ed sun + rise haap + a + koske + en voida + kaan invit + e + es thanks + giving haap + a + koske + lla voi + mme + ko invit + es un + avail + able haap + a + koski voisi + mme invit + ing un + expect + ed + ly  Discovered stems are underlined, suffixes are slanted, and prefixes are rendered in the standard font. where there is none, e.g., 'maclare+n' (&amp;quot;Mac-Laren&amp;quot;). Furthermore, nonsensical sequences of suffixes (which in other contexts are true suffixes) can be suggested, e.g., 'prot+e+iin+i', which should be 'proteiini' (&amp;quot;protein&amp;quot;). A model with more fine-grained categories might reduce such shortcomings in that it could model morphotactics more accurately. null Another aspect requiring attention in the future is allomorphy. Currently each discovered segment (morph) is assigned a role (prefix, stem, or suffix), but no further &amp;quot;meaning&amp;quot; or relation to other morphs. In Figure 3 there are some examples of allomorphs, morphs representing the same morpheme, i.e., morphs having the same meaning but used in complementary distributions. The current algorithm has no means for discovering that 'on' and 'en' mark the same case, namely illative, in 'aarre+kammio+on' (&amp;quot;into [the] treasure chamber&amp;quot;) and 'haap+a+koske+en' (&amp;quot;to Haapakoski&amp;quot;). 6 To enable such discovery in principle, one would probably need to look at contexts of nearby words, not just the word-internal context. Additionally, one should allow the learning of a model with richer category structure. Moreover, 'on' and 'en' do not always mark the illative case. In 'ba6Furthermore the algorithm cannot deduce that the illative is actually realized as a vowel lengthening + 'n': 'kammioon' vs. 'koskeen'.</Paragraph>
    <Paragraph position="9"> hama+saari+en' the genitive is marked as 'en', and in 'edes+autta+ko+on' (&amp;quot;may he/she help&amp;quot;) 'on' marks the third person singular. Similar examples can be found for English, e.g., 'ed' and 'd' are allomorphs in 'invit+ed' vs. 'phrase+d', and so are 'es' and 's' in 'invit+es' vs. 'phrase+s'. However, the meaning of 's' is often ambiguous. It can mark either the plural of a noun or the third person singular of a verb in the present tense. But this kind of ambiguity is in principle solvable in the current model; the Category algorithm resolves similar, also semantic, ambiguities occurring between the three current categories: prefix, stem, and suffix.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML