File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/05/w05-0504_intro.xml
Size: 1,475 bytes
Last Modified: 2025-10-06 14:03:15
<?xml version="1.0" standalone="yes"?> <Paper uid="W05-0504"> <Title>The SED heuristic for morpheme discovery: a look at Swahili</Title> <Section position="3" start_page="0" end_page="2" type="intro"> <SectionTitle> 1 Introduction </SectionTitle> <Paragraph position="0"> This paper describes work on a technique for the unsupervised learning of the morphology of natural languages which employs the familiar string edit distance (SED) algorithm (Wagner and Fischer 1974 and elsewhere) in its first stage; we refer to it here as the SED heuristic.</Paragraph> <Paragraph position="1"> The heuristic finds 3- and 4-state finite state automata (FSAs) from untagged corpora. We focus on Swahili, a Bantu language of East Africa, because of the very high average number of morphemes per word, especially in the verbal system, a system that presents a real challenge to other systems discussed in the literature.</Paragraph> <Paragraph position="2"> In Section 2, we present the SED heuristic, with precision and recall figures for its application to a corpus of Swahili. In Section 3, we propose three elaborations and extensions of An earlier version of this paper, with a more detailed discussion of the material presented in Section 3, is available at Goldsmith et al (2005).</Paragraph> <Paragraph position="3"> this approach, and in Section 4, we describe and evaluate the results from applying these extensions to the corpus of Swahili.</Paragraph> </Section> class="xml-element"></Paper>