File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/05/w05-0504_intro.xml

Size: 1,475 bytes

Last Modified: 2025-10-06 14:03:15

<?xml version="1.0" standalone="yes"?>
<Paper uid="W05-0504">
  <Title>The SED heuristic for morpheme discovery: a look at Swahili</Title>
  <Section position="3" start_page="0" end_page="2" type="intro">
    <SectionTitle>
1 Introduction
</SectionTitle>
    <Paragraph position="0"> This paper describes work on a technique for the unsupervised learning of the morphology of natural languages which employs the familiar string edit distance (SED) algorithm (Wagner and Fischer 1974 and elsewhere) in its first stage; we refer to it here as the SED heuristic.</Paragraph>
    <Paragraph position="1"> The heuristic finds 3- and 4-state finite state automata (FSAs) from untagged corpora. We focus on Swahili, a Bantu language of East Africa, because of the very high average number of morphemes per word, especially in the verbal system, a system that presents a real challenge to other systems discussed in the literature.</Paragraph>
    <Paragraph position="2">  In Section 2, we present the SED heuristic, with precision and recall figures for its application to a corpus of Swahili. In Section 3, we propose three elaborations and extensions of  An earlier version of this paper, with a more detailed discussion of the material presented in Section 3, is available at Goldsmith et al (2005).</Paragraph>
    <Paragraph position="3"> this approach, and in Section 4, we describe and evaluate the results from applying these extensions to the corpus of Swahili.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML