File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/04/c04-1152_intro.xml
Size: 1,937 bytes
Last Modified: 2025-10-06 14:02:13
<?xml version="1.0" standalone="yes"?> <Paper uid="C04-1152"> <Title>Efficient Unsupervised Recursive Word Segmentation Using Minimum Description Length</Title> <Section position="3" start_page="0" end_page="0" type="intro"> <SectionTitle> 1.2 Related work </SectionTitle> <Paragraph position="0"> Several systems for unsupervised learning of morphology have been developed over the last decade or so. D'ejean (1998), extending ideas in Harris (1955), describes a system for finding the most frequent affixes in a language and identifying possible morpheme boundaries by frequency bounds on the number of possible characters following a given character sequence. Brent et al. (1995) give an information theoretic method for discovering meaningful affixes, which was later extended to enable a novel search algorithm based on a probabilistic word-generation model (Snover et al., 2002). Goldsmith (2001) gives a comprehensive heuristic algorithm for unsupervised morphological analysis, which uses an MDL criterion to segment words and find morphological paradigms (called signatures). Similarly, Creutz and Lagus (2002) use an MDL formulation for word segmentation. All of these approaches assume a stem+affix morphological paradigm.</Paragraph> <Paragraph position="1"> Further, the above approaches only consider information in words' character sequences for improve morphological segmentation, and do not consider syntactic or semantic context. Schone and Jurafsky (2000) extend this by using latent semantic analysis (Dumais et al., 1988) to require that a proposed stem+affix split is sufficiently semantically similar to the stem before the split is accepted. A conceptually similar approach is taken by Baroni et al. (2002) who combine use of edit distance to measure orthographic similarity and mutual information to measure semantic similarity, to determine morphologically related word pairs.</Paragraph> </Section> class="xml-element"></Paper>