File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/relat/01/j01-3002_relat.xml

Size: 5,895 bytes

Last Modified: 2025-10-06 14:15:37

<?xml version="1.0" standalone="yes"?>
<Paper uid="J01-3002">
  <Title>A Statistical Model for Word Discovery in Transcribed Speech</Title>
  <Section position="3" start_page="0" end_page="0" type="relat">
    <SectionTitle>
2. Related Work
</SectionTitle>
    <Paragraph position="0"> While there exists a reasonable body of literature regarding text segmentation, especially with respect to languages such as Chinese and Japanese that do not explicitly include spaces between words, most of the statistically based models and algorithms tend to fall into the supervised learning category. These require the model to be trained first on a large corpus of text before it can segment its input. 3 It is only recently that interest in unsupervised algorithms for text segmentation seems to have gained ground.</Paragraph>
    <Paragraph position="1"> A notable exception in this regard is the work by Ando and Lee (1999) which tries to infer word boundaries from character n-gram statistics of Japanese Kanji strings.</Paragraph>
    <Paragraph position="2"> For example, a decision to insert a word boundary between two characters is made solely based on whether character n-grams adjacent to the proposed boundary are relatively more frequent than character n-grams that straddle it. This algorithm, however, is not based on a formal statistical model and is closer in spirit to approaches based on transitional probability between phonemes or syllables in speech. One such approach derives from experiments by Saffran, Newport, and Aslin (1996) suggesting that young children might place word boundaries between two syllables where the second syllable is surprising given the first. This technique is described and evaluated in Brent (1999). Other approaches not based on explicit probability models include those based on information theoretic criteria such as minimum description length (Brent and Cartwright 1996; de Marcken 1995) and simple recurrent networks (Elman 1990; Christiansen, Allen, and Seidenberg 1998). The maximum likelihood approach due to Olivier (1968) is probabilistic in the sense that it is geared toward explicitly calculating the most probable segmentation of each block of input utterances (see also Batchelder 1997). However, the algorithm involves heuristic steps in periodic purging of the lexicon and in the creation in the lexicon of new words. Furthermore, this approach is again not based on a formal statistical model.</Paragraph>
    <Paragraph position="3"> Model Based Dynamic Programming, hereafter referred to as MBDP-1 (Brent 1999), is probably the most recent work that addresses exactly the same issue as that considered in this paper. Both the approach presented in this paper and Brent's MBDP-1 are unsupervised approaches based on explicit probability models. Here, we describe only Brent's MBDP-1 and direct the interested reader to Brent (1999) for an excellent review and evaluation of many of the algorithms mentioned above.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
2.1 Brent's model-based dynamic programming method
</SectionTitle>
      <Paragraph position="0"> Brent (1999) describes a model-based approach to inferring word boundaries in child-directed speech. As the name implies, this technique uses dynamic programming to  infer the best segmentation. It is assumed that the entire input corpus, consisting of a concatenation of all utterances in sequence, is a single event in probability space and that the best segmentation of each utterance is implied by the best segmentation of the corpus itself. The model thus focuses on explicitly calculating probabilities for every possible segmentation of the entire corpus, and subsequently picking the segmentation with the maximum probability. More precisely, the model attempts to calculate</Paragraph>
      <Paragraph position="2"> for each possible segmentation of the input corpus where the left-hand side is the exact probability of that particular segmentation of the corpus into words Wm = WlW2 &amp;quot;'&amp;quot; Win; and the sums are over all possible numbers of words n, in the lexicon, all possible lexicons L, all possible frequencies f, of the individual words in this lexicon and all possible orders of words s, in the segmentation. In practice, the implementation uses an incremental approach that computes the best segmentation of the entire corpus up to step i, where the ith step is the corpus up to and including the ith utterance.</Paragraph>
      <Paragraph position="3"> Incremental performance is thus obtained by computing this quantity anew after each segmentation i - 1, assuming however, that segmentations of utterances up to but not including i are fixed.</Paragraph>
      <Paragraph position="4"> There are two problems with this approach. First, the assumption that the entire corpus of observed speech should be treated as a single event in probability space appears rather radical. This fact is appreciated even in Brent (1999), which states &amp;quot;From a cognitive perspective, we know that humans segment each utterance they hear without waiting until the corpus of all utterances they will ever hear becomes available&amp;quot; (p. 89). Thus, although the incremental algorithm in Brent (1999) is consistent with a developmental model, the formal statistical model of segmentation is not.</Paragraph>
      <Paragraph position="5"> Second, making the assumption that the corpus is a single event in probability space significantly increases the computational complexity of the incremental algorithm. The approach presented in this paper circumvents these problems through the use of a conservative statistical model that is directly implementable as an incremental algorithm. In the following section, we describe the model and how its 2-gram and 3-gram extensions are adapted for implementation.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML