File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/03/n03-2007_metho.xml

Size: 6,423 bytes

Last Modified: 2025-10-06 14:08:15

<?xml version="1.0" standalone="yes"?>
<Paper uid="N03-2007">
  <Title>Active Learning for Classifying Phone Sequences from Unsupervised Phonotactic Models</Title>
  <Section position="3" start_page="0" end_page="0" type="metho">
    <SectionTitle>
2 Unsupervised Phone Recognition
</SectionTitle>
    <Paragraph position="0"> Unsupervised recognition of phone sequences is carried out according to the method described by Alshawi (2003). In this method, the training inputs to recognition model training are simply the set of audio files that have been recorded from the application.</Paragraph>
    <Paragraph position="1"> The recognition training phase is an iterative procedure in which a phone n-gram model is refined successively: The phone strings resulting from the current pass over the speech files are used to construct the phone n-gram model for the next iteration. We currently only re-estimate the n-gram model, so the same general-purpose HMM acoustic model is used for ASR decoding in all iterations.</Paragraph>
    <Paragraph position="2"> Recognition training can be briefly described as follows. First, set the phone sequence model to an initial phone string model. This initial model used can be an unweighted phone loop or a general purpose phonotactic model for the language being recognized. Then, for successively larger n-grams, produce the output set of phone sequences from recognizing the training speech files with the current phone sequence model, and train the next larger n-gram phone sequence model on this output corpus.</Paragraph>
    <Paragraph position="3"> 3 Training phone sequence classifiers with active selection of examples The method we use for training the phone sequence classifier is as follows.</Paragraph>
    <Paragraph position="4">  1. Choose an initial subset S of training recordings at random; assign class label(s) to each example.</Paragraph>
    <Paragraph position="5"> 2. Recognize these recordings using the phone recognizer described in section 2.</Paragraph>
    <Paragraph position="6"> 3. Train an initial classifier C on the pairs (phone string, class label) of S.</Paragraph>
    <Paragraph position="7"> 4. Run the classifier on the recognized phone strings of the training corpus, obtaining confidence scores for each classification.</Paragraph>
    <Paragraph position="8"> 5. While labeling effort is available, or until performance on a development corpus reaches some threshold, (a) Choose the next subset S0 of examples from of the training corpus, on the basis of the confidence scores or other indicators. (Selection criteria are discussed later.) (b) Assign class label(s) to each selected example.</Paragraph>
    <Paragraph position="9"> (c) Train classifier C0 on all the data labeled so far. (d) RunC0 on the whole training corpus, obtaining confidence scores for each classification.</Paragraph>
    <Paragraph position="10"> (e) Optionally test C0 on a separate test corpus.</Paragraph>
  </Section>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
4 Experimental Setup
</SectionTitle>
    <Paragraph position="0"> The datasets tested on and the classifier used are the same as those in the experiments on phone sequence classification reported by Alshawi (2003). The details are briefly restated here.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.1 Data
</SectionTitle>
      <Paragraph position="0"> Two collections of utterances from two domains were used in the experiments: 1. Customer care utterances (HMIHY). These utterances are the customer side of live English conversations between AT&amp;T residential customers and an automated customer care system. This system is open to the public so the number of speakers is large (several thousand).</Paragraph>
      <Paragraph position="1"> The total number of training utterances was 40,106.</Paragraph>
      <Paragraph position="2"> All tests use 9724 test utterances. Average utterance length was 11.19 words; there were 56 classes, with an average of 1.09 classes per utterance.</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="0" end_page="0" type="metho">
    <SectionTitle>
2. Text-to-Speech Help Desk utterances (TTSHD).
</SectionTitle>
    <Paragraph position="0"> This is a smaller database of utterances in which customers called an automated information system primarily to find out about AT&amp;T Natural Voices text-to-speech synthesis products.</Paragraph>
    <Paragraph position="1"> The total number of possible training utterances was 10,470. All tests use 5005 test utterances. Average utterance length was 3.95 words; there were 54 classes, with an average of 1.23 classes per utterance.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.2 Phone sequences
</SectionTitle>
      <Paragraph position="0"> The phone sequences used for testing and training are those obtained using the phone recognizer described in section 2. Since the phone recognizer is trained without labeling of any sort, we can use all available training utterances to train it, that is, 40,106 in the HMIHY domain and 10,470 in the TTSHD domain. The initial model used to start the iteration is, as in (Alshawi, 2003), an unweighted phone loop.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.3 Classifier
</SectionTitle>
      <Paragraph position="0"> For the experiments reported here we use the BoosTexter classifier (Schapire and Singer, 2000). The features used were identifiers corresponding to prompts, and phone n-grams up to length 4. Following Schapire and Singer (2000), the confidence level for a given prediction is taken to be the difference between the scores assigned by BoosTexter to the highest ranked action (the predicted action) and the next highest ranked action.</Paragraph>
    </Section>
    <Section position="3" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.4 Selection criteria
</SectionTitle>
      <Paragraph position="0"> Subsets of the recognized phone sequences were selected to be assigned class labels and used in training the classifiers. Examples were selected in order of BoosTexter confidence score, least confident first. Further selection by utterance length was also used in some experiments such that only recognized utterances with less than a given number of phones were selected.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML