File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/02/w02-0603_evalu.xml
Size: 7,217 bytes
Last Modified: 2025-10-06 13:58:51
<?xml version="1.0" standalone="yes"?> <Paper uid="W02-0603"> <Title>Unsupervised Discovery of Morphemes</Title> <Section position="7" start_page="0" end_page="0" type="evalu"> <SectionTitle> 5 Experiments and Results </SectionTitle> <Paragraph position="0"> We compared the two proposed methods as well as Goldsmith's program Linguistica5 on both Finnish and English corpora. The Finnish corpus consisted of newspaper text from CSC6. A morphosyntactic analysis of the text was performed using the Conexor FDG parser7. All characters were converted to lower case, and words containing other characters than a through z and the Scandinavian letters @a, &quot;a and &quot;o were removed. Other than morphemic tags were removed from the morphological analyses of the words. The remaining tags correspond to inflectional affixes (i.e. endings and markers) and clitics. Unfortunately the parser does not distinguish derivational affixes. The first 100 000 word tokens were used as training data, and the following 100 000 word tokens were used as test data.</Paragraph> <Paragraph position="1"> The test data contained 34 821 word types.</Paragraph> <Paragraph position="2"> The English corpus consisted of mainly newspaper text from the Brown corpus8. A morphological analysis of the words was performed using the Lingsoft ENGTWOL analyzer9. In case of multiple alternative morphological analyses, the shortest analysis was selected. All characters were converted to lower case, and words containing other characters than a through z, an apostrophe or a hyphen were removed. Other than morphemic tags were removed from the morphological analyses of the words. The remaining tags correspond to inflectional or derivational affixes. A set of 100 000 word tokens from the corpus sections Press Reportage and Press Editorial were used as training data. A separate set of 100 000 word tokens from the sections Press Editorial, Press Reviews, Religion, and Skills Hobbies were used as test data. The test data contained 12 053 word types.</Paragraph> <Paragraph position="3"> Test results for the three methods and the two languages are shown in Table 2. We observe different tendencies for Finnish and English. For Finnish, there is a correlation between the compression of the corpus and the linguistic generalization capacity to new word forms. The Recursive splitting with the MDL cost function is clearly superior to the Sequential splitting with ML cost, which in turn is superior to Linguistica. The Recursive MDL method is best in terms of data compression: it produces the smallest morph lexicon (codebook), and the codebook naturally occupies a small part of the total cost. It is best also in terms of the linguistic measure, the total alignment distance on test data. Linguistica, on the other hand, employs a more restricted segmentation, which leads to a larger codebook and to the fact that the codebook occupies a large part of the total MDL cost. This also appears to lead to a poor generalization ability to new word forms. The linguistic alignment distance is the highest, and so is the percentage of aligned morph/morphemic label pairs that were never observed in the training set. On the other hand, Linguistica is the fastest program10.</Paragraph> <Paragraph position="4"> Also for English, the Recursive MDL method achieves the best alignment, but here Linguistica achieves nearly the same result. The rate of compression follows the same pattern as for Finnish, in that Linguistica produces a much larger morph lexicon than the methods presented in this paper. In spite of this fact, the percentage of unseen morph/morphemic label pairs is about the same for all three methods. This suggests that in a morphologically poor language such as English a restrictive segmentation method, such as Linguistica, can compensate for new word forms - that it does not recognize at all - with old, familiar words, that it &quot;gets just right&quot;. In contrast, the methods presented in this paper produce a morph lexicon that is smaller and able to generalize better to new word forms but has somewhat lower accuracy for already observed word forms.</Paragraph> <Paragraph position="5"> Visual inspection of a sample of words. In an attempt to analyze the segmentations more thoroughly, we randomly picked 1000 different words from the Finnish test set. The total number of occurrences of these words constitute about 2.5% of the whole set. We inspected the segmentation of each word visually and classified it into one of three categories: (1) correct and complete segmentation (i.e., all relevant morpheme boundaries were identified), (2) correct but incomplete segmentation (i.e., not all relevant morpheme boundaries were identified, but no proposed boundary was incorrect), (3) incorrect segmentation (i.e., some proposed boundary did not correspond to an actual morpheme boundary).</Paragraph> <Paragraph position="6"> The results of the inspection for each of the three segmentation methods are shown in Table 3. The Recursive MDL method performs best and segments about half of the words correctly. The Sequential ML method comes second and Linguistica third with a share of 43% correctly segmented words. When considering the incomplete and incorrect segmentations the methods behave differently. The Recursive MDL method leaves very common word forms unsplit, and often produces excessive splitting for rare 10Note, however, that the computing time comparison with Linguistica is only approximate since it was a compiled program run on Windows whereas the two other methods were implemented as Perl scripts run on Linux.</Paragraph> <Paragraph position="7"> (Ling.). The total MDL cost measures the compression of the corpus. However, the cost is computed according to Equation (1), which favors the Recursive MDL method. The final number of morphs in the codebook (#morphs in codebook) is a measure of the size of the morph &quot;vocabulary&quot;. The relative codebook cost gives the share of the total MDL cost that goes into coding the codebook. The alignment distance is the total distance computed over the sequence of morph/morphemic label pairs in the test data. The unseen aligned pairs is the percentage of all aligned morph/label pairs in the test set that were never observed in the training set. This gives an indication of the generalization capacity of the method to new word forms. words. The Sequential ML method is more prone to excessive splitting, even for words that are not rare.</Paragraph> <Paragraph position="8"> Linguistica, on the other hand, employs a more conservative splitting strategy, but makes incorrect segmentations for many common word forms.</Paragraph> <Paragraph position="9"> The behaviour of the methods is illustrated by example segmentations in Table 4. Often the Recursive MDL method produces complete and correct segmentations. However, both it and the Sequential ML method can produce excessive splitting, as is shown for the latter, e.g. affecti + on + at + e. In contrast, Linguistica refrains from splitting words when they should be split, e.g., the Finnish compound words in the table.</Paragraph> </Section> class="xml-element"></Paper>