XML Viewer - h89-2048

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/89/h89-2048_metho.xml
Size: 11,889 bytes
Last Modified: 2025-10-06 14:12:20
<?xml version="1.0" standalone="yes"?>
<Paper uid="H89-2048">
  <Title>Some Applications of Tree-based Modelling to Speech and Language</Title>
  <Section position="3" start_page="343" end_page="345" type="metho">
    <SectionTitle>
3. Stop Classification
</SectionTitle>
    <Paragraph position="0"> The first application has already partially been introduced. Figure 4 shows a more complete tree than Figure 1 for deciding whether a stop is voiced or voiceless. This tree size was selected for the reasons given above.</Paragraph>
    <Paragraph position="1"> This tree was grown from 3313 stop+vowel examples taken from male speaker in the TIMIT database. The classification task is to decide whether a given stop was labelled voiced (b, d, g) or unvoiced (p, t, k) by the TIMIT transcribers.</Paragraph>
    <Paragraph position="2"> The features (possible zrs) considered were:  The first three features were computed directly from the TIMIT labellings. The zero-crossing rate was the mean rate between release and onset. The formant and F0 values were computed using David Talkin's formant and pitch extraction programs \[Talkin 1987\].</Paragraph>
    <Paragraph position="3"> Coding the phonetic context required special considerations since more than 50 phones (using the TIMIT labelling) can precede a stop in this context. If this were treated as a single feature, more than 25deg binary partitions would have to be considered for this variable at each node, clearly making this approach impractical. Chou \[1987\] proposes one solution, which is to use k-means clustering to find sub-optimal, but good paritions in linear complexity.</Paragraph>
    <Paragraph position="4"> The solution adopted here is to classify each phone in terms of 4 features, consonant manner, consonant place, &amp;quot;vowel manner&amp;quot;, and &amp;quot;vowel place&amp;quot;, each class taking on about a dozen values. Consonant manner takes on the usual values such as voiced fricative, unvoiced stop, nasal, etc. Consonant manner takes on values such as bilabial, dental, velar, etc. &amp;quot;Vowel manner&amp;quot; takes on values such as monopthong, diphthong, glide, liquid, etc. and &amp;quot;vowel place&amp;quot; takes on values such as front-low, central-mid-high, back-high, etc. All can take on the value &amp;quot;n/a&amp;quot; if they do not apply; e.g., when a vowel is being represented, consonant manner and place are assigned &amp;quot;n/a&amp;quot;. In this way, every segment is decomposed into four multi-valued features that have acceptable complexity to the classification scheme and that have some phonetic justification.</Paragraph>
    <Paragraph position="5"> The tree in Figure 4 correctly classifies about 91% of the stops as voiced or voiceless. All percent figures quoted in this paper are cross-validated unless otherwise indicated. In other words, they are tested on data distinct from the training data.</Paragraph>
    <Paragraph position="6"> In an informal experiment, the author listened to 1000 of these stops cut at 30 msec before the stop closure and at 30 msec after the vowel onset. He correctly classifed 90% of these stops as voiced or voiceless. This suggests that the input features selected were appropriate to the task and that the classification tree was a reasonable structure to exploit the information carried by them.</Paragraph>
    <Paragraph position="7">  4. Segment duration modelling for speech synthesis 400 utterances from a single speaker and 4000 utterances from 400 speakers (the TIMIT database) of American English were used separately to build regression trees that predict segment durations based on the following features: * Segment Context: -- Segment to predict -- Segment to left -- Segment to right * Stress (0, 1, 2) * Word Frequency: (tel. 25M AP words) * Lexical Position: -- Segment count from start of word -- Segment count from end of word -- Vowel count from start of word -- Vowel count from end of word * Phrasal Position: -- Segment count from start of phrase -- Segment count from end of phrase -- Segment count from end of phrase * Dialect: N, S, NE, W, SMid, NMid, NYC, Brat * Speaking Rate: (tel. to calibration sentences)  The coding of each segment was decomposed into four features each as described above. The word frequency was included as a crude function word detector and was based on six months of AP news text. The last two features were used only for the multi-speaker database. The stress was obtained from a dictionary (which is easy, but imperfect). The dialect information was coded with the TIMIT database. The speaking rate is specified as the mean duration of the two calibration sentences, which were spoken by every speaker. Over 70% of the durational variance for the single speaker and over 60% for the multiple speakers were accounted for by these trees. Figure 5 shows durations and duration residuals for all the segments together. Figure 6 shows these broken down into particular phonetic classes. The large tree sizes here, many hundreds of nodes, make them uninteresting to display.</Paragraph>
    <Paragraph position="8"> These trees were used to derive durations for a text-to-speech synthesizer and were found to often give more faithful results than the existing heuristically derived duration rules \[cf. Klatt 1976\]. Since tree building and evaluation is rapid once the data are collected and the candidate features specified, this technique can be readily applied to other feature sets and to other languages.</Paragraph>
  </Section>
  <Section position="4" start_page="345" end_page="351" type="metho">
    <SectionTitle>
5. Phoneme-to-phone prediction
</SectionTitle>
    <Paragraph position="0"> The task here is given a phonemic transcription of an utterance, e.g., based on dictionary lookup, predict the phonetic realization produced by a speaker \[see also Lucassen, et. al. 1984; Chou, 1987\]. For example, when will a T be released or flapped? Figure 7 shows a tree grown to decide this question based on the  TIMIT database. The features used for this tree and a larger tree made for all phones were:  * Phonemic Context: -- Phoneme to predict -- Three phonemes to left -- Three phonemes to right * Stress (0, 1, 2) * Word Frequency: (rel. 25M AP words) * Dialect: N, S, NE, W, SMid, NMid, NYC, Brat * Lexical Position: -- Phoneme count from start of word -- Phoneme count from end of word * Phonetic Context: phone predicted to left  The phonemic context was coded in a seven segment window centered on the phoneme to realize, again using the 4 feature decomposition described above (and labelled cm-3, cm-2,..., cm3 ; cp-3, cp-2,..., cp3, etc.). The other features are similar to the duration prediction problem. Ignore the last feature, for the moment. In Figure 7, we can see several cases of how a phonemic T is realized. The first split is roughly whether the segment after the T is a vowel or consonant. If we take the right branch, the next split (to Terminal Node 7) indicates that a nasal, R, or L is almost always unreleased (2090 out 2250 cases) Terminal Node 11 indicates that if the segment preceding the T is a stop or &amp;quot;blank&amp;quot; (beginning of utterance) the T closure is unlabelled, which is the convention adopted by the transcribers. Terminal Node 20 indicates that an intervocalic T preceding an unstressed vowel is often flapped.</Paragraph>
    <Paragraph position="1"> This tree predicts about 75% of the phonetic realizations ofT correctly. The much larger tree for all phonemes predicts on the average 84% of the TIMIT labellings exactly. A large percentage of the errors are on the precise labelling of reduced vowels as either IX or AX.</Paragraph>
    <Paragraph position="2"> A list of alternative phonetic realizations can also be produced from the tree, since the relative frequencies of different phones appearing at a given terminal node can be computed. Figure 8 shows such a listing for the utterance, Would your name be Tom? . It indicates, for example, that the D in &amp;quot;would&amp;quot; is most likely uttered as a DCL JH in this context (59% of the time), followed by DCL D (28%). On the average four alternatives per phoneme are sufficient to cover 99% of the possible phonetic reMizations. This can be used, for example, to greatly constrain the number of alternatives that must be considered in automatic segmentation when the orthography is known.</Paragraph>
    <Paragraph position="3"> These a priori probabilities, however, do not take into account the phonetic context, only the phonemic. For example, if DCL Jtt is uttered for the D in the example in Figure 7, then Y is most likely deleted and not uttered. However, the overall probability that a Y is uttered in that phonemic context (averaging both D going to DCL Jii, D, etc.) is greatest. The point is that to incorporate the fact that &amp;quot;D goes to DCL JH implies Y usually deletes&amp;quot; is that transition probabilities should be taken into account. This can be done by including an additional feature for the phonetic identity of the previous segment. The output listing then becomes a transition matrix for each phoneme. The best path through such a lattice can be found by dynamic programming. This approach would give the best results for an automatic segmenter. This, coupled with a dictionary, can also be used for letter-to-sound rules for a synthesizer (when the entry  is present in the dictionary).</Paragraph>
    <Paragraph position="4"> The effect of using the TIMIT database for this latter purpose is a somewhat folksy sounding synthesizer. Having the D &amp;quot;Would your&amp;quot; uttered as a JH may be correct for fluent English, but it sounds a bit forced for existing synthesizers. Too much else is wrong! A very carefully uttered database by a professional speaker would give better results for this application of the phoneme-to-phone tree.</Paragraph>
    <Paragraph position="5"> 6. End of sentence detection As a final example, consider the not-so-simple problem of deciding when a period in text corresponds to the end of a declarative sentence. While a period, by convention, must occur at the end of a declarative sentence, one can also occur in abbreviations. Abbreviations can also occur at the end of a sentence! The two space rule after an end stop is often ignored and is never present in many text sources (e.g., the AP news). The tagged Brown corpus of a million words indicates that about 90% of periods occur at the end of sentences, 10% at the end of abbreviations, and about 1/2% in both.</Paragraph>
    <Paragraph position="6"> The following features were used to generate a classification tree for this task:  --e.g., month name, unit-of-measure, title, address name, etc.</Paragraph>
    <Paragraph position="7"> The choice of these features was based on what humans (at least when constrained to looking at a few words around the &amp;quot;.&amp;quot;). Facts such as &amp;quot;Is the word after the '.' capitalized?&amp;quot;, &amp;quot;Is the word with the '.' a common abbreviation?&amp;quot;, &amp;quot;Is the word after the &amp;quot;.&amp;quot; likely found at the beginning of a sentence?&amp;quot;, etc. can be answered with these features.</Paragraph>
    <Paragraph position="8"> The word probabilities indicated above were computed from the 25 million words of AP news, a much larger (and independent) text database. (In fact, these probabilities were for the beginning and end of paragraphs, since these are explicitly marked in the AP, while end of sentences, in general, are not.) The resulting classification tree correctly identifies whether a word ending in a &amp;quot;.&amp;quot; is at the end of a declarative sentence in the Brown corpus with 99.8% accuracy. The majority of the errors are due to difficult cases, e.g. a sentence that ends with &amp;quot;Mrs.&amp;quot; or begins with a numeral (it can happen!).</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML