File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/01/j01-3002_metho.xml
Size: 12,774 bytes
Last Modified: 2025-10-06 14:07:33
<?xml version="1.0" standalone="yes"?> <Paper uid="J01-3002"> <Title>A Statistical Model for Word Discovery in Transcribed Speech</Title> <Section position="4" start_page="0" end_page="0" type="metho"> <SectionTitle> 3. Model Description </SectionTitle> <Paragraph position="0"> The language model described here is a fairly standard one. The interested reader is referred to Jelinek (1997, 57-78), where a detailed exposition can be found. Basically, we seek</Paragraph> <Paragraph position="2"> where W = wl ..... wn with w i C L denotes a particular string of n words belonging to a lexicon L.</Paragraph> <Paragraph position="3"> The usual n-gram approximation is made by grouping histories wl ..... wi-1 into equivalence classes, allowing us to collapse contexts into histories at most n - 1 words Computational Linguistics Volume 27, Number 3 backwards (for n-grams). Estimations of the required n-gram probabilities are then done with relative frequencies using back-off to lower-order n-grams when a higher-order estimate is not reliable enough (Katz 1987). Back-off is done using the Witten and Bell (1991) technique, which allocates a probability of Ni/(Ni q- Si) to unseen/-grams at each stage, with the final back-off from unigrams being to an open vocabulary where word probabilities are calculated as a normalized product of phoneme or letter probabilities. Here, Ni is the number of distinct /-grams and Si is the sum of their frequencies. The model can be summarized as follows:</Paragraph> <Paragraph position="5"> where C 0 denotes the count or frequency function, ki denotes the length of word wi, excluding the sentinel character #, wi\[j\] denotes its jth phoneme, and r 0 denotes the relative frequency function. The normalization by dividing using 1 - r(#) in Equation (7) is necessary because otherwise</Paragraph> <Paragraph position="7"> Since we estimate P(w~\]) by r(w~\]), dividing by 1 -r(#) will ensure that ~w P(w) = 1.</Paragraph> </Section> <Section position="5" start_page="0" end_page="355" type="metho"> <SectionTitle> 4. Method </SectionTitle> <Paragraph position="0"> As in Brent (1999), the model described in Section 3 is presented as an incremental learner. The only knowledge built into the system at start-up is the phoneme table, with a uniform distribution over all phonemes, including the sentinel phoneme. The learning algorithm considers each utterance in turn and computes the most probable segmentation of the utterance using a Viterbi search (Viterbi 1967) implemented as a dynamic programming algorithm, as described in Section 4.2. The most likely placement of word boundaries thus computed is committed to before the algorithm considers the next utterance presented. Committing to a segmentation involves learning unigram, bigram, and trigram frequencies, as well as phoneme frequencies, from the inferred words. These are used to update the respective tables.</Paragraph> <Paragraph position="1"> To account for effects that any specific ordering of input utterances may have on the segmentations that are output, the performance of the algorithm is averaged over 1000 runs, with each run receiving as input a random permutation of the input corpus.</Paragraph> <Section position="1" start_page="0" end_page="354" type="sub_section"> <SectionTitle> 4.1 The input corpus </SectionTitle> <Paragraph position="0"> The corpus, which is identical to the one used by Brent (1999), consists of orthographic transcripts made by Bernstein-Ratner (1987) from the CHILDES collection (MacWhin null ney and Snow 1985). The speakers in this study were nine mothers speaking freely to their children, whose ages averaged 18 months (range 13-21). Brent and his colleagues transcribed the corpus phonemically (using the ASCII phonemic representation in the appendix to this paper) ensuring that the number of subjective judgments in the pronunciation of words was minimized by transcribing every occurrence of the same word identically. For example, &quot;look&quot;, &quot;drink&quot;, and &quot;doggie&quot; were always transcribed &quot;lUk&quot;, &quot;driNk&quot;, and &quot;dOgi&quot; regardless of where in the utterance they occurred and which mother uttered them in what way. Thus transcribed, the corpus consists of a total of 9790 such utterances and 33,399 words, and includes one space after each word and one newline after each utterance. For purposes of illustration, Table 1 lists the first 20 such utterances from a random permutation of this corpus.</Paragraph> <Paragraph position="1"> It should be noted that the choice of this particular corpus for experimentation is motivated purely by its use in Brent (1999). As has been pointed out by reviewers of an earlier version of this paper, the algorithm is equally applicable to plain text in English or other languages. The main advantage of the CHILDES corpus is that it allows for ready comparison with results hitherto obtained and reported in the literature.</Paragraph> <Paragraph position="2"> Indeed, the relative performance of all the algorithms discussed is mostly unchanged when tested on the 1997 Switchboard telephone speech corpus with disfluency events removed.</Paragraph> </Section> <Section position="2" start_page="354" end_page="355" type="sub_section"> <SectionTitle> 4.2 Algorithm </SectionTitle> <Paragraph position="0"> The dynamic programming algorithm finds the most probable word sequence for each input utterance by assigning to each segmentation a score equal to its probability, and committing to the segmentation with the highest score. In practice, the implementation computes the negative logarithm of this score and thus commits to the segmentation with the least negative logarithm of its probability. The algorithm for the unigram</Paragraph> <Paragraph position="2"> Recursive optimization algorithm to find the best segmentation of an input utterance using the unigram language model described in this paper.</Paragraph> </Section> </Section> <Section position="6" start_page="355" end_page="357" type="metho"> <SectionTitle> BEGIN </SectionTitle> <Paragraph position="0"> Input (by reference) word w\[O..k\] where w\[i\] are the phonemes in it.</Paragraph> <Paragraph position="2"> The function to compute - logP(w) of an input word w. L stands for the lexicon object. If the word is novel, then it backs off to using a distribution over the phonemes in the word.</Paragraph> <Paragraph position="3"> language model is presented in recursive form in Figure 1 for readability. The actual implementation, however, used an iterative version. The algorithm to evaluate the back-off probability of a word is given in Figure 2. Algorithms for bigram and trigram language models are straightforward extensions of that given for the unigram model.</Paragraph> <Paragraph position="4"> Essentially, the algorithm description can be summed up semiformally as follows: For each input utterance u, we evaluate every possible way of segmenting it as u = u' + w where u' is a subutterance from the beginning of the original utterance up to some point within it and w--the lexical difference between u and u'--is treated as a word.</Paragraph> <Paragraph position="5"> The subutterance u' is itself evaluated recursively using the same algorithm. The base case for recursion when the algorithm rewinds is obtained when a subutterance cannot be split further into a smaller component subutterance and word, that is, when its length is zero. Suppose for example, that a given utterance is abcde, where the letters represent phonemes. If seg(x) represents the best segmentation of the utterance x and Venkataraman Word Discovery in Transcribed Speech word(x) denotes that x is treated as a word, then</Paragraph> <Paragraph position="7"> The evalUtterance algorithm in Figure 1 does precisely this. It initially assumes the entire input utterance to be a word on its own by assuming a single segmentation point at its right end. It then compares the log probability of this segmentation successively to the log probabilities of segmenting it into all possible subutterance-word pairs.</Paragraph> <Paragraph position="8"> The implementation maintains four separate tables internally, one each for unigrams, bigrams, and trigrams, and one for phonemes. When the procedure is initially started, all the internal n-gram tables are empty. Only the phoneme table is populated with equipossible phonemes. As the program considers each utterance in turn and commits to its best segmentation according to the evalUtterance algorithm, the various internal n-gram tables are updated correspondingly. For example, after some utterance abcde is segmented into a bc de, the unigram table is updated to increment the frequencies of the three entries a, bc, and de, each by 1, the bigram table to increment the frequencies of the adjacent bigrams a bc and bc de, and the trigram table to increment the frequency of the trigram a bc de. 4 Furthermore, the phoneme table is updated to increment the frequencies of each of the phonemes in the utterance, including one sentinel for each word inferred. 5 Of course, incrementing the frequency of a currently unknown n-gram is equivalent to creating a new entry for it with frequency 1. Note that the very first utterance is necessarily segmented as a single word. Since all the n-gram tables are empty when the algorithm attempts to segment it, all probabilities are necessarily computed from the level of phonemes up. Thus, the more words in the segmentation of the first utterance, the more sentinel characters will be included in the probability calculation, and so the lesser the corresponding segmentation probability will be. As the program works its way through the corpus, n-grams inferred correctly by virtue of their relatively greater preponderance compared to noise tend to dominate their respective n-gram distributions and thus dictate how future utterances are segmented.</Paragraph> <Paragraph position="9"> One can easily see that the running time of the program is O(mn 2) in the total number of utterances (m) and the length of each utterance (n), assuming an efficient implementation of a hash table allowing nearly constant lookup time is available. Since individual utterances typically tend to be small, especially in child-directed speech as evidenced in Table 1, the algorithm practically approximates to a linear time procedure.</Paragraph> <Paragraph position="10"> A single run over the entire corpus typically completes in under 10 seconds on a 300 MHz i686-based PC running Linux 2.2.5-15.</Paragraph> <Paragraph position="11"> Although the algorithm is presented as an unsupervised learner, a further experiment to test the responsiveness of each algorithm to training data is also described here: The procedure involves reserving for training increasing amounts of the input corpus, from 0% in steps of approximately 1% (100 utterances). During the training period, the algorithm is presented with the correct segmentation of the input utterance, which it uses to update trigram, bigram, unigram, and phoneme frequencies as Computational Linguistics Volume 27, Number 3 required. After the initial training segment of the input corpus has been considered, subsequent utterances are then processed in the normal way.</Paragraph> <Section position="1" start_page="357" end_page="357" type="sub_section"> <SectionTitle> 4.3 Scoring </SectionTitle> <Paragraph position="0"> In line with the results reported in Brent (1999), three scores are reported -- precision, recall, and lexicon precision. Precision is defined as the proportion of predicted words that are actually correct. Recall is defined as the proportion of correct words that were predicted. Lexicon precision is defined as the proportion of words in the predicted lexicon that are correct. In addition to these, the number of correct and incorrect words in the predicted lexicon were computed, but they are not graphed here because lexicon precision is a good indicator of both.</Paragraph> <Paragraph position="1"> Precision and recall scores were computed incrementally and cumulatively within scoring blocks, each of which consisted of 100 utterances. These scores were computed and averaged only for the utterances within each block scored, and thus represent the performance of the algorithm only on the block scored, occurring in the exact context among the other scoring blocks. Lexicon scores carried over blocks cumulatively. In cases where the algorithm used varying amounts of training data, precision, recall, and lexical precision scores are computed over the entire corpus. All scores are reported as percentages.</Paragraph> </Section> </Section> class="xml-element"></Paper>