File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/04/p04-2012_intro.xml

Size: 2,898 bytes

Last Modified: 2025-10-06 14:02:28

<?xml version="1.0" standalone="yes"?>
<Paper uid="P04-2012">
  <Title>A Framework for Unsupervised Natural Language Morphology Induction</Title>
  <Section position="3" start_page="0" end_page="0" type="intro">
    <SectionTitle>
2 Previous Work
</SectionTitle>
    <Paragraph position="0"> It is possible to organize much of the recent work on unsupervised morphology induction by considering the bias each approach has toward discovering morphologically related words that are also orthographically similar. Yarowsky et al.</Paragraph>
    <Paragraph position="1"> (2001), who acquire a morphological analyzer for a language by projecting the morphological analysis of a second language onto the first through a clever application of statistical machine translation style word alignment probabilities, place no constraints on the orthographic shape of related word forms.</Paragraph>
    <Paragraph position="2"> Next along the spectrum of orthographic similarity bias is the work of Schone and Jurafsky (2000; 2001), who first acquire a list of potential morphological variants using an orthographic similarity technique due to Gaussier (1999) in which pairs of words with the same initial string are identified.</Paragraph>
    <Paragraph position="3"> They then apply latent semantic analysis (LSA) to score the potential morphological variants with a semantic distance. Word forms with small semantic distance are proposed as morphological variants of one anther.</Paragraph>
    <Paragraph position="4"> Goldsmith (2001), by searching over a space of morphology models limited to substitution of suffixes, ties morphology yet closer to orthography. Segmenting word forms in a corpus, Goldsmith creates an inventory of stems and suffixes. Suffixes which can interchangeably concatenate onto a set of stems form a signature. After defining the space of signatures, Goldsmith searches for that choice of word segmentations resulting in a minimum description length local optimum.</Paragraph>
    <Paragraph position="5"> Finally, the work of Harris (1955; 1967), and later Hafer and Weiss (1974), has direct bearing on the approach taken in this paper. Couched in modern terms, their work involves first building tries over a corpus vocabulary and then selecting, as morpheme boundaries, those character boundaries with corresponding high branching count in the tries.</Paragraph>
    <Paragraph position="6"> The work in this paper also has a strong bias toward discovering morphologically related words that share a similar orthography. In particular, the morphology model I use is, akin to Goldsmith, limited to suffix substitution. The novel proposal I bring to the table, however, is a formalization of the full search space of all candidate inflection classes. With this framework in place, defining search strategies for morpheme discovery becomes a natural and straightforward activity.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML