XML Viewer - w04-2906

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/w04-2906_metho.xml
Size: 10,190 bytes
Last Modified: 2025-10-06 14:09:31
<?xml version="1.0" standalone="yes"?>
<Paper uid="W04-2906">
  <Title>Assessing Prosodic and Text Features for Segmentation of Mandarin Broadcast News</Title>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
3 Data Set
</SectionTitle>
    <Paragraph position="0"> We utilize the Topic Detection and Tracking (TDT) 3 (Wayne, 2000) collection Mandarin Chinese broadcast news audio corpus as our data set. Story segmentation in Mandarin and English broadcast news and newswire text was one of the TDT tasks and also an enabling technology for other retrieval tasks. We use the segment boundaries provided with the corpus as our gold standard labeling. Our collection comprises 3014 stories drawn from approximately 113 hours over three months (October-December 1998) of news broadcasts from the Voice of America (VOA) in Mandarin Chinese. The transcriptions span approximately 740,000 words. The audio is stored in NIST Sphere format sampled at 16KHz with 16-bit linear encoding.</Paragraph>
  </Section>
  <Section position="5" start_page="0" end_page="0" type="metho">
    <SectionTitle>
4 Prosodic Features
</SectionTitle>
    <Paragraph position="0"> We employ four main classes of prosodic features: pitch, intensity, silence and duration. Pitch, as represented by f0 in Hertz, was computed by the To pitch function of the Praat system (Boersma, 2001). We then applied a 5-point median lter to smooth out local instabilities in the signal such as vocal fry or small regions of spurious doubling or halving. Analogously, we computed the intensity in decibels for each 10ms frame with the Praat To intensity function, followed by similar smoothing.</Paragraph>
    <Paragraph position="1"> For consistency and to allow comparability, we computed all gures for word-based units, using the ASR transcriptions provided with the TDT Mandarin data. The words are used to establish time spans for computing pitch or intensity mean or maximum values, to enable durational normalization and pairwise comparison, and to identify silence duration.</Paragraph>
    <Paragraph position="2"> It is well-established (Ross and Ostendorf, 1996) that for robust analysis pitch and intensity should be normalized by speaker, since, for example, average pitch is largely incomparable for male and female speakers. In the absence of speaker identi cation software, we approximate speaker normalization with story-based normalization, computed as a0a2a1a4a3a6a5a8a7a10a9a11a1a2a12a7a10a9a11a1a13a12 , assuming one speaker per topic1. For duration, we consider both absolute and normalized word duration, where average word duration is used as the mean in the calculation above.</Paragraph>
    <Paragraph position="3"> Mandarin Chinese is a tone language in which lexical identity is determined by a pitch contour - or tone - associated with each syllable. This additional use of pitch raises the question of the cross-linguistic applicability of the prosodic cues, especially pitch cues, identi ed for non-tone languages. Speci cally, do we nd intonational cues in tone languages? We have found highly signi cant differences based on paired t-test two-tailed, (a14a16a15a18a17a20a19a21a19a23a22a25a24a27a26a29a28a31a30a32a24a34a33a24a25a24a25a35a21a36 ) for words in segment- nal position, relative to the same word in non- nal positions. (Levow, 2004). Speci cally, word duration, normalized mean pitch, and normalized mean intensity all differ signi cantly for words in topicnal position relative to occurrences throughout the story. Word duration increases, while both pitch and intensity decrease. Importantly, reduction in pitch as a signal of topic nality is robust across the typological contrast of tone and non-tone languages, such as English (Nakatani et al., 1995) and Dutch (Swerts, 1997).</Paragraph>
    <Paragraph position="4"> 1This is an imperfect approximation as some stories include off-site interviews, but seems a reasonable choice in the absence of automatic speaker identi cation.</Paragraph>
  </Section>
  <Section position="6" start_page="0" end_page="0" type="metho">
    <SectionTitle>
5 Classi cation
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
5.1 Prosodic Feature Set
</SectionTitle>
      <Paragraph position="0"> The contrasts above indicate that duration, pitch, and intensity should be useful for automatic prosody-based identi cation of topic boundaries. To facilitate cross-speaker comparisons, we use normalized representations of average pitch, average intensity, and word duration.</Paragraph>
      <Paragraph position="1"> These features form a word-level context-independent feature set.</Paragraph>
      <Paragraph position="2"> Since segment boundaries and their cues exist to contrastively signal the separation between topics, we augment these features with local context-dependent measures. Speci cally, we add features that measure the change between the current word and the next word. 2 This contextualization adds four contextual features: change in normalized average pitch, change in normalized average intensity, change in normalized word duration, and duration of following silence.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
5.2 Text Feature Set
</SectionTitle>
      <Paragraph position="0"> In addition to the prosodic features, we also consider a set of features that exploit the textual similarity of regions to identify segment boundaries. We build on the standard information retrieval measures for assessing text similarity. Speci cally we consider aa37a15a39a38a41a40a16a14a42a15 weighted cosine similarity measure across 50 and 30 word windows. We also explore a length normalized word overlap within the same region size. We use the words from the ASR transcription as our terms and perform no stopword removal.</Paragraph>
      <Paragraph position="1"> We expect that these measures will be minimized at topic boundaries where changes in topic are accompanied by changes in topical terminology.</Paragraph>
    </Section>
    <Section position="3" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
5.3 Classi er Training and Testing Con guration
</SectionTitle>
      <Paragraph position="0"> We employed Quinlan's C4.5 (Quinlan, 1992) decision tree classi er to provide a readily interpretable classi er.</Paragraph>
      <Paragraph position="1"> Now, the vast majority of word positions in our collection are non-topic- nal. So, in order to focus training and test on topic boundary identi cation, we downsample our corpus to produce training and test sets with a 50/50 split of topic- nal and non-topic- nal words. We trained on 2789 topic- nal words 3 and 2789 non-topic- nal words, not matched in any way, drawn randomly from the full corpus. We tested on a held-out set of 200 topic- nal and non-topic- nal words.</Paragraph>
      <Paragraph position="2"> 2We have posed the task of boundary detection as the task of nding segment- nal words, so the technique incorporates a single-word lookahead. We could also repose the task as identi cation of topic-initial words and avoid the lookahead to have a more on-line process. This is an area for future research.</Paragraph>
      <Paragraph position="3"> 3We excluded a small proportion of words for which the pitch tracker returned no results.</Paragraph>
    </Section>
    <Section position="4" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
5.4 Classi er Evaluation
5.4.1 Prosody-only Classi cation
</SectionTitle>
      <Paragraph position="0"> The resulting classi er achieved 95.8% accuracy on the held-out test set, closely approximating pruned tree performance on the training set. This effectiveness is a substantial improvement over the sample baseline of 50%. Inspection of the classi er indicates the key role of silence as well as the use of both contextual and purely local features of both pitch and intensity. Durational features play a lesser role in the classi er.</Paragraph>
      <Paragraph position="1">  In a comparable experiment, we employed only the text similarity and silence duration features to train and test the classi er. These features similarly achieved a 95.5% overall classi cation accuracy. Here the best classi cation accuracy was achieved by the text similarity measure that was based on thea37a15a43a38a44a40a42a14a42a15 weighted 50 word window. The text similarity measures based ona37a15a39a38a45a40a42a14a42a15 in the 30 word window and on length normalized overlap performed similarly. The combination of all three text-based features did not improve classi cation over the single best measure.</Paragraph>
      <Paragraph position="2">  Finally we built a combined classi er integrating all prosodic and textual features. This classi er yielded an accuracy of 97%, the best overall effectiveness. The decision tree utilized all classes of prosodic features and performed comparably with only thea37a15a46a38a47a40a42a14a16a15 features and with all text features. A portion of the tree is reproduced in Figure 1.</Paragraph>
    </Section>
    <Section position="5" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
5.5 Feature Comparison
</SectionTitle>
      <Paragraph position="0"> We also performed a set of contrastive experiments with different subsets of available features to assess the dependence on these features. 4 We grouped features into 5 sets: pitch, intensity, duration, silence, and textsimilarity. For each of the prosody-only, text-only, and combined prosody and text-based classi ers, we successively removed the feature class at the root of the decision tree and retrained with the remaining features (Table 1).</Paragraph>
      <Paragraph position="1"> We observe that although silence duration plays a very signi cant role in story boundary identi cation for all feature sets, the richer prosodic and mixed text-prosodic classi ers are much more robust to the absence of silence information. Further we observe that intensity and then pitch play the next most important roles in classi cation.</Paragraph>
      <Paragraph position="2">  newly removed from the set of available features.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML