File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/02/w02-0603_intro.xml

Size: 8,019 bytes

Last Modified: 2025-10-06 14:01:27

<?xml version="1.0" standalone="yes"?>
<Paper uid="W02-0603">
  <Title>Unsupervised Discovery of Morphemes</Title>
  <Section position="3" start_page="0" end_page="0" type="intro">
    <SectionTitle>
ACL Special Interest Group in Computational Phonology (SIGPHON), Philadelphia,
</SectionTitle>
    <Paragraph position="0"> word for 'also for [the] coffee drinker'.</Paragraph>
    <Paragraph position="1"> Word kahvinjuojallekin Morphs kahvi n juo ja lle kin Transl. coffee of drink -er for also The problem is further compounded as languages evolve, new words appear and grammatical changes take place. Consequently, it is important to develop methods that are able to discover a morphology for a language based on unsupervised analysis of large amounts of data.</Paragraph>
    <Paragraph position="2"> As the morphology discovery from untagged corpora is a computationally hard problem, in practice one must make some assumptions about the structure of words. The appropriate specific assumptions are somewhat language-dedependent. For example, for English it may be useful to assume that words consist of a stem, often followed by a suffix and possibly preceded by a prefix. By contrast, a Finnish word typically consists of a stem followed by multiple suffixes. In addition, compound words are common, containing an alternation of stems and suffixes, e.g., the wordkahvinjuojallekin(Engl.</Paragraph>
    <Paragraph position="3"> 'also for [the] coffee drinker'; cf. Table 1)1. Moreover, one may ask, whether a morphologically complex word exhibits some hierarchical structure, or whether it is merely a flat concatenation of stems and suffices.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
1.1 Previous Work on Unsupervised
Segmentation
</SectionTitle>
      <Paragraph position="0"> Many existing morphology discovery algorithms concentrate on identifying prefixes, suffixes and stems, i.e., assume a rather simple inflectional morphology. null D'ejean (1998) concentrates on the problem of finding the list of frequent affixes for a language rather than attempting to produce a morphological analysis of each word. Following the work of Zellig Harris he identifies possible morpheme boundaries by looking at the number of possible letters following a given sequence of letters, and then utilizes frequency limits for accepting morphemes.</Paragraph>
      <Paragraph position="1"> 1For a comprehensive view of Finnish morphology, see (Karlsson, 1987).</Paragraph>
      <Paragraph position="2"> Goldsmith (2000) concentrates on stem+suffixlanguages, in particular Indo-European languages, and tries to produce output that would match as closely as possible with the analysis given by a human morphologist. He further assumes that stems form groups that he calls signatures, and each signature shares a set of possible affixes. He applies an MDL criterion for model optimization.</Paragraph>
      <Paragraph position="3"> The previously discussed approaches consider only individual words without regard to their contexts, or to their semantic content. In a different approach, Schone and Jurafsky (2000) utilize the context of each term to obtain a semantic representation for it using LSA. The division to morphemes is then accepted only when the stem and stem+affix are sufficiently similar semantically. Their method is shown to improve on the performance of Goldsmith's Linguistica on CELEX, a morphologically analyzed English corpus.</Paragraph>
      <Paragraph position="4"> In the related field of text segmentation, one can sometimes obtain morphemes. Some of the approaches remove spaces from text and try to identify word boundaries utilizing e.g. entropy-based measures, as in (Redlich, 1993).</Paragraph>
      <Paragraph position="5"> Word induction from natural language text without word boundaries is also studied in (Deligne and Bimbot, 1997; Hua, 2000), where MDL-based model optimization measures are used. Viterbi or the forward-backward algorithm (an EM algorithm) is used for improving the segmentation of the corpus2. null Also de Marcken (1995; 1996) studies the problem of learning a lexicon, but instead of optimizing the cost of the whole corpus, as in (Redlich, 1993; Hua, 2000), de Marcken starts with sentences.</Paragraph>
      <Paragraph position="6"> Spaces are included as any other characters.</Paragraph>
      <Paragraph position="7"> Utterances are also analyzed in (Kit and Wilks, 1999) where optimal segmentation for an utterance is sought so that the compression effect over the segments is maximal. The compression effect is measured in what the authors call Description Length Gain, defined as the relative reduction in entropy.</Paragraph>
      <Paragraph position="8"> The Viterbi algorithm is used for searching for the optimal segmentation given a model. The input ut2The regular EM procedure only maximizes the likelihood of the data. To follow the MDL approach where model cost is also optimized, Hua includes the model cost as a penalty term on pure ML probabilities.</Paragraph>
      <Paragraph position="9"> terances include spaces and punctuation as ordinary characters. The method is evaluated in terms of precision and recall on word boundary prediction.</Paragraph>
      <Paragraph position="10"> Brent presents a general, modular probabilistic model structure for word discovery (Brent, 1999).</Paragraph>
      <Paragraph position="11"> He uses a minimum representation length criterion for model optimization and applies an incremental, greedy search algorithm which is suitable for on-line learning such that children might employ.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
1.2 Our Approach
</SectionTitle>
      <Paragraph position="0"> In this work, we use a model where words may consist of lengthy sequences of segments. This model is especially suitable for languages with agglutinative morphological structure. We call the segments morphs and at this point no distinction is made between stems and affixes.</Paragraph>
      <Paragraph position="1"> The practical purpose of the segmentation is to provide a vocabulary of language units that is smaller and generalizes better than a vocabulary consisting of words as they appear in text. Such a vocabulary could be utilized in statistical language modeling, e.g., for speech recognition. Moreover, one could assume that such a discovered morph vocabulary would correspond rather closely to linguistic morphemes of the language.</Paragraph>
      <Paragraph position="2"> We examine two methods for unsupervised learning of the model, presented in Sections 2 and 3. The cost function for the first method is derived from the Minimum Description Length principle from classic information theory (Rissanen, 1989), which simultaneously measures the goodness of the representation and the model complexity. Including a model complexity term generally improves generalization by inhibiting overlearning, a problem especially severe for sparse data. An incremental (online) search algorithm is utilized that applies a hierarchical splitting strategy for words. In the second method the cost function is defined as the maximum likelihood of the data given the model. Sequential splitting is applied and a batch learning algorithm is utilized.</Paragraph>
      <Paragraph position="3"> In Section 4, we develop a method for evaluating the quality of the morph segmentations produced by the unsupervised segmentation methods. Even though the morph segmentations obtained are not intended to correspond exactly to the morphemes of linguistic theory, a basis for comparison is provided by existing, linguistically motivated morphological analyses of the words.</Paragraph>
      <Paragraph position="4"> Both segmentation methods are applied to the segmentation of both Finnish and English words.</Paragraph>
      <Paragraph position="5"> In Section 5, we compare the results obtained from our methods to results produced by Goldsmith's Linguistica on the same data.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML