File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/w04-1115_metho.xml
Size: 6,516 bytes
Last Modified: 2025-10-06 14:09:13
<?xml version="1.0" standalone="yes"?> <Paper uid="W04-1115"> <Title>Combining Prosodic and Text Features for Segmentation of Mandarin Broadcast News</Title> <Section position="4" start_page="0" end_page="0" type="metho"> <SectionTitle> 3 Prosody and Mandarin </SectionTitle> <Paragraph position="0"> In this paper we focus on topic segmentation in Mandarin Chinese broadcast news. Mandarin Chinese is a tone language in which lexical identity is determined by a pitch contour - or tone - associated with each syllable. This additional use of pitch raises the question of the cross-linguistic applicability of the prosodic cues, especially pitch cues, identified for non-tone languages. Specifically, do we find intonational cues in tone languages? The fact that emphasis is marked intonationally by expansion of pitch range even in the presence of Mandarin lexical tone (Shen, 1989) suggests encouragingly that prosodic, intonational cues to other aspects of information structure might also prove robust in tone languages.</Paragraph> </Section> <Section position="5" start_page="0" end_page="0" type="metho"> <SectionTitle> 4 Data Set </SectionTitle> <Paragraph position="0"> We utilize the Topic Detection and Tracking (TDT) 3 (Wayne, 2000) collection Mandarin Chinese broadcast news audio corpus as our data set. Story segmentation in Mandarin and English broadcast news and newswire text was one of the TDT tasks and also an enabling technology for other retrieval tasks. We use the segment boundaries provided with the corpus as our gold standard labeling. Our collection comprises 3014 news stories drawn from approximately 113 hours over three months (October-December 1998) of news broadcasts from the Voice of America (VOA) in Mandarin Chinese, with 800 regions of other program material including musical interludes and teasers. The transcriptions span approximately 750,000 words. Stories average approximately 250 words in length to span a full story.</Paragraph> <Paragraph position="1"> No subtopic segmentation is performed. The audio is stored in NIST Sphere format sampled at 16KHz with 16-bit linear encoding.</Paragraph> </Section> <Section position="6" start_page="0" end_page="0" type="metho"> <SectionTitle> 5 Prosodic Features </SectionTitle> <Paragraph position="0"> We consider four main classes of prosodic features for our analysis and classification: pitch, intensity, silence and duration. Pitch, as represented by f0 in Hertz was computed by the &quot;To pitch&quot; function of the Praat system (Boersma, 2001). We selected the highest ranked pitch candidate value in each voiced region. We then applied a 5-point median filter to smooth out local instabilities in the signal such as vocal fry or small regions of spurious doubling or halving. Analogously, we computed the intensity in decibels for each 10ms frame with the Praat &quot;To intensity&quot; function, followed by similar smoothing.</Paragraph> <Paragraph position="1"> For consistency and to allow comparability, we compute all figures for word-based units, using the automatic speech recognition transcriptions provided with the TDT Mandarin data. The words are used to establish time spans for computing pitch or intensity mean or maximum values, to enable durational normalization and the pairwise comparisons reported below, and to identify silence or non-speech duration.</Paragraph> <Paragraph position="2"> It is well-established (Ross and Ostendorf, 1996) that for robust analysis pitch and intensity should be normalized by speaker, since, for example, average pitch is largely incomparable for male and female speakers. In the absence of speaker identification software, we approximate speaker normalization with story-based normalization, computed as a0a2a1a4a3a6a5a8a7a10a9a11a1a13a12 a7a10a9a11a1a13a12 , assuming one speaker per topic</Paragraph> </Section> <Section position="7" start_page="0" end_page="0" type="metho"> <SectionTitle> 1. For du- </SectionTitle> <Paragraph position="0"> ration, we consider both absolute and normalized word duration, where average word duration is used as the mean in the calculation above.</Paragraph> </Section> <Section position="8" start_page="0" end_page="0" type="metho"> <SectionTitle> 6 Prosodic Analysis </SectionTitle> <Paragraph position="0"> To evaluate the potential applicability of prosodic features to story segmentation in Mandarin Chinese, we performed some initial data analysis to determine if words in story-final position differed from the same words used throughout the story in news stories. This lexical match allows direct pairwise comparison. We anticipated that since words in Mandarin varied not only in phoneme sequence but also in tone sequence, a direct comparison might be particularly important to eliminate sources of variability. Features that differed significantly would form the basis of our classifier feature set.</Paragraph> <Paragraph position="1"> We found significant differences for each of the features we considered. Specifically, word duration, normalized mean pitch, and normalized mean intensity all differed significantly for words in topicfinal position relative to occurrences throughout the story (paired t-test, two-tailed, a14 a15 a16a18a17a19a16a21a20a23a22a24a14 a15 a16a18a17a19a16a25a16a21a26a25a20a23a22a24a14a27a15a28a16a18a17a19a16a25a16a21a26a25a20 , respectively) . Word duration increased, while both pitch and intensity decreased. A small side experiment using 15 hours of English broadcast news from the TDT collection shows similar trends, though the magnitude of the change in intensity is smaller than that observed for the Chinese. Furthermore, comparison of average pitch and average intensity for 1, 5, and 10 word windows at the beginning and end of news stories finds that pitch and intensity are both significantly higher (a14a29a15a30a16a18a17a19a16a25a16a32a31 ) at the start of stories than at the end.</Paragraph> <Paragraph position="2"> These contrasts are consistent with, though in some cases stronger than, those identified for English (Nakatani et al., 1995) and Dutch (Swerts, 1997). The relatively large size of the corpus enhances the salience of these effects. We find, importantly, that reduction in pitch as a signal of topic finality is robust across the typological contrast of tone and non-tone languages. These findings demonstrate highly significant intonational effects even in tone languages and suggest that prosodic</Paragraph> </Section> class="xml-element"></Paper>