XML Viewer - n03-3007

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/03/n03-3007_metho.xml
Size: 7,653 bytes
Last Modified: 2025-10-06 14:08:15
<?xml version="1.0" standalone="yes"?>
<Paper uid="N03-3007">
  <Title>Word Fragment Identification Using Acoustic-Prosodic Features in</Title>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
3 Experiments
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.1 Experimental Setup
</SectionTitle>
      <Paragraph position="0"> Our goal is to investigate whether there are some reliable acoustic-prosodic features for word fragments. The task of word fragment identification is viewed as a statistical classification problem, i.e. for each word boundary, a classifier determines whether the word before the boundary is a word fragment or not. For such a classification task, we develop an inventory of input features for the statistical classifier. A CART decision tree classifier is employed to enable easy interpretation of results.</Paragraph>
      <Paragraph position="1"> Missing features are allowed in the decision trees. To avoid globally suboptimal feature combinations in decision trees, we used a feature selection algorithm to search for an optimal subset of input features (Shriberg et al., 2000).</Paragraph>
      <Paragraph position="2"> We used conversational telephone speech Switchboard corpus (Godfrey et al., 1992) for our experiments. In the human transcriptions, word fragments are identified (around 0.7% of the words are word fragments). We use 80% of the data as the training data, and the left 20% for testing. In order to avoid the bias toward the complete words (which are much more frequent than word fragments), we downsampled the training data so that we have an equal amount number of word fragments and complete words. Downsampling makes the decision tree model more sensitive to the inherent features of the minority class.</Paragraph>
      <Paragraph position="3"> We generated forced alignments using the provided human transcriptions, and derived the prosodic and voice quality features from the resulting phone-level alignments and the speech signal. The reason that we used human transcriptions is because the current recognition accuracy on such telephone speech is around 70%, which will probably yield inaccurate time marks for the word hypotheses, and thus affect the feature extraction results and also make the evaluation difficult (e.g. determine which word hypothesis should be a word fragment). Even if the human transcription and the forced alignment are used to obtain the word and phone level alignments, the alignments could still be error-prone because the recognizer used for obtaining the alignments does not have a model for the word fragments. Note that we only used transcriptions to get the word and phone level alignments for computing prosodic and voice quality features. We did not use any word identity information in the features for the classification task.</Paragraph>
      <Paragraph position="4"> At each boundary location, we extracted prosodic features and voice quality measures as described in Section 2. We trained a decision tree classifier from the down-sampled training set that contains 1438 samples, and tested it on the downsampled test set with 288 samples (50% of the samples in the training and test set are word fragments).</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.2 Experimental Results
</SectionTitle>
      <Paragraph position="0"> In Table 1 the results for word fragments vs. complete words classification are shown. The precision and recall for this fragment detection task are 74.3% and 70.1% respectively. The overall accuracy for all the test samples is 72.9%, which is significantly better than a chance performance of 50%. These results suggest some acoustic-prosodic features are indicative for word fragment detec- null reference complete 109 35fragment 43 101 Figure 1 shows the pruned decision tree for this task.</Paragraph>
      <Paragraph position="1"> An inspection of the decision tree's feature usage in the results can further reveal the potential properties that distinguish word fragments from complete words. In Table 2 we report the feature usage as the percentage of decisions that have queried the feature type. Features that are used higher up in the decision tree have higher usage values.</Paragraph>
      <Paragraph position="2">  tion using the Switchboard data.</Paragraph>
    </Section>
    <Section position="3" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
Feature Percentage
</SectionTitle>
      <Paragraph position="0"> jitter 0.272 energy slope difference between the current word and 0.241 the following word log ratio between the minimum median filtered F0 in a window before the boundary and the 0.238 maximum value after boundary average OQ 0.147 position of the current turn 0.084 pause duration after the word 0.018 Among the voice quality features, jitter is queried the most by the decision tree. We think that when the speaker suddenly cuts off in the middle of the word, there is abnormality of the vocal fold, in particular the pitch periods,</Paragraph>
      <Paragraph position="2"> fragments. The indent represents the tree structure. Each line corresponds to a node in the tree. A question regarding one feature is associated with each node. The decision is made in the leaf nodes; however, in the figure we also show the majority class passing along an internal node in the tree.</Paragraph>
      <Paragraph position="3"> and this is captured by jitter. The average of OQ is also chosen as a useful feature, suggesting that a mid-word interruption generates some creaky or breathy voice. The questions produced by the decision tree show that word fragments are hypothesized if the answer is positive to the questions such as 'jitter a0 0.018053', 'average OQ a1 0.020956?' and 'average OQ a0 0.60821?'. All these questions imply abnormal voice quality. We have also conducted the same classification experiments by only using jitter and average OQ two features, and we obtained a classification accuracy of 68.06%.</Paragraph>
      <Paragraph position="4"> We also observe from the table that one energy feature and one F0 feature are queried frequently. However, we may need to be careful of interpreting these prosodic features, because some word fragments are more likely to have a missing (or undefined) value for the stylized F0 or energy features (due to the short duration of the word fragments and the unvoiced frames). For example, in one leaf of the decision tree, word fragment is hypothesized if the energy slope before the boundary is an undefined value (as shown in Figure 1, the question is 'EN-</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="0" end_page="0" type="metho">
    <SectionTitle>
ERGY PATTERN BOUNDARY in Xr, Xf?', where 'X'
</SectionTitle>
    <Paragraph position="0"> means undefined value).</Paragraph>
    <Paragraph position="1"> Notice that the usage of the pause feature is very low, although a pause is expected after a sudden closure of the speaker. One reason for this is that the recognizer is more likely not to generate a pause in the phonetic alignment results when the pause after the mid-word interruption is very short. For example, around 2/3 of the word fragments in our training and test set are not followed by a pause based on the alignments. Additionally, there are many other places (e.g. sentence boundaries or filled pauses) that are possible to be followed by a pause, therefore being followed by a pause cannot accurately distinguish between a word fragment and other complete words.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML