File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/04/c04-1152_evalu.xml

Size: 3,976 bytes

Last Modified: 2025-10-06 13:59:10

<?xml version="1.0" standalone="yes"?>
<Paper uid="C04-1152">
  <Title>Efficient Unsupervised Recursive Word Segmentation Using Minimum Description Length</Title>
  <Section position="7" start_page="0" end_page="0" type="evalu">
    <SectionTitle>
5 Experimental Results
</SectionTitle>
    <Paragraph position="0"> In this section we give initial results for the above algorithm in English and Turkish, showing how meaningful morphs are extracted using different greedy MDL criteria. Recall that the models and algorithm described in this paper are intended as parts of a more comprehensive morphological analysis system, as we describe below in future work.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
5.1 English
</SectionTitle>
      <Paragraph position="0"> For evaluation in English, we used the standard Reuters-21578 corpus of news articles (comprising 1.7M word tokens and 32,811 unique words). For each of the 3 models described above, we extracted morphs either by resegmenting on prefixes or on suffixes (looking at the words reversed). When segmenting according to Models 1 and 2, a minimum prefix length of 2 was enforced, to improve morph quality (though not for suffixes, since in English there are some one-letter suffixes such as -s).</Paragraph>
      <Paragraph position="1"> First, consider morphs found by Model 1 (Fig. 2).</Paragraph>
      <Paragraph position="2"> The prefix morphs found are surprisingly good for this simple model, with only one wrong in the first 15 extracted. That erroneous morph is ter-, which is part of inter-, however in- was extracted first; this kind of error could be ameliorated by a merging postprocessing step. The suffixes are similarly good, although oddly the system did not find -s, which caused it to find several composite morphs, such as -ers and -ions, which can get resegmented into their components (-er+s and -ion+s) later.</Paragraph>
      <Paragraph position="3"> Model 1a also performs extremely well, for different values of a (we show just a = 1 and a = 2 in Fig. 3, for lack of space). Note that the morphs found by this model differ qualitatively from those found by Model 1, in that we get longer morphs more related to agglutination than to regular inflection patterns. This suggests that multiple statistical models should be used together to extract different facets of a language's morphological composition.</Paragraph>
      <Paragraph position="4"> Finally, morphs from the more complex Model 2 are given in Fig. 4. As in Model 1a, Model 2 gives more agglutinative morphs than inflective morphs, and has a greater tendency to segment complex morphs (such as -ification-), which presumably will later be resegmented into their component parts (e.g., -if+ic+at+ion). This may enable construction of hierarchical models of morphological composition in the future.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
5.2 Turkish
</SectionTitle>
      <Paragraph position="0"> In addition to English, we tested the method's ability to extract meaningful morphs on a small corpus of Turkish texts from the Turkish Natural Language Processing Initiative (Oflazer, 2001), which consists of one foreign ministry press release, texts of two treaties, and three journal articles on translation.</Paragraph>
      <Paragraph position="1"> The corpus comprises 20,284 individual words, of which 5961 are unique. Turkish is a highly agglutinative language, hence a prime candidate for recursive morphological segmentation. Results for Models 1 and 2 are shown in Tables 5-8. Meaningful</Paragraph>
    </Section>
    <Section position="3" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
Prefixes
</SectionTitle>
      <Paragraph position="0"> morphs were found using all models, with Model 2 finding longer morphs, as in English. We do note some issues with boundary letters for Model 2 prefixes, however.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML