Assessing Prosodic and Text Features for Segmentation of Mandarin
Broadcast News
Gina-Anne Levow
University of Chicago
levow@cs.uchicago.edu
Abstract
Automatic topic segmentation, separation of
a discourse stream into its constituent sto-
ries or topics, is a necessary preprocessing
step for applications such as information re-
trieval, anaphora resolution, and summariza-
tion. While signi cant progress has been made
in this area for text sources and for English au-
dio sources, little work has been done in au-
tomatic segmentation of other languages us-
ing both text and acoustic information. In
this paper, we focus on exploiting both textual
and prosodic features for topic segmentation of
Mandarin Chinese. As a tone language, Man-
darin presents special challenges for applica-
bility of intonation-based techniques, since the
pitch contour is also used to establish lexical
identity. However, intonational cues such as re-
duction in pitch and intensity at topic bound-
aries and increase in duration and pause still
provide signi cant contrasts in Mandarin Chi-
nese. We  rst build a decision tree classi-
 er that based only on prosodic information
achieves boundary classi cation accuracy of
89-95.8% on a large standard test set. We
then contrast these results with a simple text
similarity-based classi cation scheme. Finally
we build a merged classi er,  nding the best
effectiveness for systems integrating text and
prosodic cues.
1 Introduction
Natural spoken discourse is composed of a sequence
of utterances, not independently generated or randomly
strung together, but rather organized according to basic
structural principles. This structure in turn guides the in-
terpretation of individual utterances and the discourse as
a whole. Formal written discourse signals a hierarchical,
tree-based discourse structure explicitly by the division
of the text into chapters, sections, paragraphs, and sen-
tences. This structure, in turn, identi es domains for in-
terpretation; many systems for anaphora resolution rely
on some notion of locality (Grosz and Sidner, 1986).
Similarly, this structure represents topical organization,
and thus would be useful in information retrieval to se-
lect documents where the primary sections are on-topic,
and, for summarization, to select information covering
the different aspects of the topic.
Unfortunately, spoken discourse does not include the
orthographic conventions that signal structural organiza-
tion in written discourse. Instead, one must infer the hi-
erarchical structure of spoken discourse from other cues.
Prior research (Nakatani et al., 1995; Swerts, 1997) has
shown that human labelers can more sharply, consis-
tently, and con dently identify discourse structure in a
word-level transcription when an original audio record-
ing is available than they can on the basis of the tran-
scribed text alone. This  nding indicates that substan-
tial additional information about the structure of the dis-
course is encoded in the acoustic-prosodic features of the
utterance. Given the often errorful transcriptions avail-
able for large speech corpora, we choose to focus here
on fully exploiting the prosodic cues to discourse struc-
ture present in the original speech in addition to possibly
noisy textual cues. We then compare the effectiveness of
a pure prosodic classi cation to text-based and mixed text
and prosodic based classi cation.
In the current set of experiments, we concentrate on se-
quential segmentation of news broadcasts into individual
stories. This level of segmentation can be most reliably
performed by human labelers and thus can be considered
most robust, and segmented data sets are publicly avail-
able.
Furthermore, we consider the relative effectiveness
prosodic-based, text-based, and mixed cue-based seg-
mentation for Mandarin Chinese, to assess the relative
utility of the cues for a tone language. Not only is the use
of prosodic cues to topic segmentation much less well-
studied in general than is the use of text cues, but the use
of prosodic cues has been largely limited to English and
other European languages.
2 Related Work
Most prior research on automatic topic segmentation has
been applied to clean text only and thus used textual fea-
tures. Text-based segmentation approaches have utilized
term-based similarity measures computed across candi-
date segments (Hearst, 1994) and also discourse markers
to identify discourse structure (Marcu, 2000).
The Topic Detection and Tracking (TDT) evaluations
focused on segmentation of both text and speech sources.
This framework introduced new challenges in dealing
with errorful automatic transcriptions as well as new op-
portunities to exploit cues in the original speech. The
most successful approach (Beeferman et al., 1999) pro-
duced automatic segmentations that yielded retrieval re-
sults comparable to those with manual segmentations, us-
ing text and silence features. (Tur et al., 2001) applied
both a prosody-only and a mixed text-prosody model
to segmentation of TDT English broadcast news, with
the best results combining text and prosodic features.
(Hirschberg and Nakatani, 1998) also examined auto-
matic topic segmentation based on prosodic cues, in the
domain of English broadcast news.
Work in discourse analysis (Nakatani et al., 1995;
Swerts, 1997) in both English and Dutch has identi ed
features such as changes in pitch range, intensity, and
speaking rate associated with segment boundaries and
with boundaries of different strengths.
3 Data Set
We utilize the Topic Detection and Tracking (TDT)
3 (Wayne, 2000) collection Mandarin Chinese broadcast
news audio corpus as our data set. Story segmentation in
Mandarin and English broadcast news and newswire text
was one of the TDT tasks and also an enabling technol-
ogy for other retrieval tasks. We use the segment bound-
aries provided with the corpus as our gold standard label-
ing. Our collection comprises 3014 stories drawn from
approximately 113 hours over three months (October-
December 1998) of news broadcasts from the Voice of
America (VOA) in Mandarin Chinese. The transcriptions
span approximately 740,000 words. The audio is stored
in NIST Sphere format sampled at 16KHz with 16-bit lin-
ear encoding.
4 Prosodic Features
We employ four main classes of prosodic features: pitch,
intensity, silence and duration. Pitch, as represented by f0
in Hertz, was computed by the  To pitch function of the
Praat system (Boersma, 2001). We then applied a 5-point
median  lter to smooth out local instabilities in the signal
such as vocal fry or small regions of spurious doubling or
halving. Analogously, we computed the intensity in deci-
bels for each 10ms frame with the Praat  To intensity 
function, followed by similar smoothing.
For consistency and to allow comparability, we com-
puted all  gures for word-based units, using the ASR
transcriptions provided with the TDT Mandarin data. The
words are used to establish time spans for computing
pitch or intensity mean or maximum values, to enable
durational normalization and pairwise comparison, and
to identify silence duration.
It is well-established (Ross and Ostendorf, 1996) that
for robust analysis pitch and intensity should be nor-
malized by speaker, since, for example, average pitch
is largely incomparable for male and female speak-
ers. In the absence of speaker identi cation software,
we approximate speaker normalization with story-based
normalization, computed as a0a2a1a4a3a6a5a8a7a10a9a11a1a2a12a7a10a9a11a1a13a12 , assuming one
speaker per topic1. For duration, we consider both abso-
lute and normalized word duration, where average word
duration is used as the mean in the calculation above.
Mandarin Chinese is a tone language in which lexi-
cal identity is determined by a pitch contour - or tone
- associated with each syllable. This additional use of
pitch raises the question of the cross-linguistic applicabil-
ity of the prosodic cues, especially pitch cues, identi ed
for non-tone languages. Speci cally, do we  nd intona-
tional cues in tone languages?
We have found highly signi cant differences based
on paired t-test two-tailed, (a14a16a15a18a17a20a19a21a19a23a22a25a24a27a26a29a28a31a30a32a24a34a33a24a25a24a25a35a21a36 )
for words in segment- nal position, relative to the same
word in non- nal positions. (Levow, 2004). Speci cally,
word duration, normalized mean pitch, and normalized
mean intensity all differ signi cantly for words in topic-
 nal position relative to occurrences throughout the story.
Word duration increases, while both pitch and intensity
decrease. Importantly, reduction in pitch as a signal of
topic  nality is robust across the typological contrast of
tone and non-tone languages, such as English (Nakatani
et al., 1995) and Dutch (Swerts, 1997).
1This is an imperfect approximation as some stories include
off-site interviews, but seems a reasonable choice in the absence
of automatic speaker identi cation.
5 Classi cation
5.1 Prosodic Feature Set
The contrasts above indicate that duration, pitch, and
intensity should be useful for automatic prosody-based
identi cation of topic boundaries. To facilitate cross-
speaker comparisons, we use normalized representations
of average pitch, average intensity, and word duration.
These features form a word-level context-independent
feature set.
Since segment boundaries and their cues exist to con-
trastively signal the separation between topics, we aug-
ment these features with local context-dependent mea-
sures. Speci cally, we add features that measure the
change between the current word and the next word. 2
This contextualization adds four contextual features:
change in normalized average pitch, change in normal-
ized average intensity, change in normalized word dura-
tion, and duration of following silence.
5.2 Text Feature Set
In addition to the prosodic features, we also consider a set
of features that exploit the textual similarity of regions to
identify segment boundaries. We build on the standard
information retrieval measures for assessing text similar-
ity. Speci cally we consider aa37a15a39a38a41a40a16a14a42a15 weighted cosine
similarity measure across 50 and 30 word windows. We
also explore a length normalized word overlap within the
same region size. We use the words from the ASR tran-
scription as our terms and perform no stopword removal.
We expect that these measures will be minimized at topic
boundaries where changes in topic are accompanied by
changes in topical terminology.
5.3 Classi er Training and Testing Con guration
We employed Quinlan’s C4.5 (Quinlan, 1992) decision
tree classi er to provide a readily interpretable classi er.
Now, the vast majority of word positions in our collec-
tion are non-topic- nal. So, in order to focus training and
test on topic boundary identi cation, we downsample our
corpus to produce training and test sets with a 50/50 split
of topic- nal and non-topic- nal words. We trained on
2789 topic- nal words 3 and 2789 non-topic- nal words,
not matched in any way, drawn randomly from the full
corpus. We tested on a held-out set of 200 topic- nal and
non-topic- nal words.
2We have posed the task of boundary detection as the task
of  nding segment- nal words, so the technique incorporates a
single-word lookahead. We could also repose the task as iden-
ti cation of topic-initial words and avoid the lookahead to have
a more on-line process. This is an area for future research.
3We excluded a small proportion of words for which the
pitch tracker returned no results.
5.4 Classi er Evaluation
5.4.1 Prosody-only Classi cation
The resulting classi er achieved 95.8% accuracy on
the held-out test set, closely approximating pruned tree
performance on the training set. This effectiveness is
a substantial improvement over the sample baseline of
50%. Inspection of the classi er indicates the key role
of silence as well as the use of both contextual and purely
local features of both pitch and intensity. Durational fea-
tures play a lesser role in the classi er.
5.4.2 Text and Silence-based Classi cation
In a comparable experiment, we employed only the
text similarity and silence duration features to train and
test the classi er. These features similarly achieved a
95.5% overall classi cation accuracy. Here the best clas-
si cation accuracy was achieved by the text similarity
measure that was based on thea37a15a43a38a44a40a42a14a42a15 weighted 50 word
window. The text similarity measures based ona37a15a39a38a45a40a42a14a42a15
in the 30 word window and on length normalized overlap
performed similarly. The combination of all three text-
based features did not improve classi cation over the sin-
gle best measure.
5.4.3 Combined Prosody and Text Classi cation
Finally we built a combined classi er integrating all
prosodic and textual features. This classi er yielded an
accuracy of 97%, the best overall effectiveness. The deci-
sion tree utilized all classes of prosodic features and per-
formed comparably with only thea37a15a46a38a47a40a42a14a16a15 features and
with all text features. A portion of the tree is reproduced
in Figure 1.
5.5 Feature Comparison
We also performed a set of contrastive experiments with
different subsets of available features to assess the de-
pendence on these features. 4 We grouped features
into 5 sets: pitch, intensity, duration, silence, and text-
similarity. For each of the prosody-only, text-only, and
combined prosody and text-based classi ers, we succes-
sively removed the feature class at the root of the decision
tree and retrained with the remaining features (Table 1).
We observe that although silence duration plays a very
signi cant role in story boundary identi cation for all
feature sets, the richer prosodic and mixed text-prosodic
classi ers are much more robust to the absence of silence
information. Further we observe that intensity and then
pitch play the next most important roles in classi cation.
4For example, VOA Mandarin has been observed stylisti-
cally to make idiosyncratically large use of silence at story
boundaries. (personal communication, James Allan).
Figure 1: Decision tree classi er labeling words as segment- nal or non-segment- nal, using text and prosodic features
Prosody-only Texta48 Silence Texta48 Prosody
Accuracy Pct. Change Accuracy Pct. Change Accuracy Pct. Change
All 95.8% 0 95.5% 0 97% 0
Silence 89.4% -6.7% 75.5% -21% 91.5% 5.7%
Intensity 82.2% -14.2% 86.4% -11%
Pitch 64% -33.2% 77% -20.6%
Table 1: Reduction in classi cation accuracy with removal of features. Each row is labeled with the feature that is
newly removed from the set of available features.
6 Conclusion and Future Work
We have demonstrated the utility of prosody-only, text-
only, and mixed text-prosody features for automatic topic
segmentation of Mandarin Chinese. We have demon-
strated the applicability of intonational prosodic features,
speci cally pitch, intensity, pause and duration, to the
identi cation of topic boundaries in a tone language. We
observe similar effectiveness for all feature sets when all
features are available, with slightly better classi cation
accuracy for the hybrid text-prosody approach. These re-
sults indicate a synergistic combination of meaning and
acoustic features. We further observe that the prosody-
only and hybrid feature sets are much less sensitive to the
absence of individual features, and, in particular, to si-
lence features. These  ndings indicate that prosodic fea-
tures are robust cues to topic boundaries, both with and
without textual cues.
There is still substantial work to be done. We would
like to integrate speaker identi cation for normalization
and speaker change detection. We also plan to explore the
integration of text and prosodic features for the identi ca-
tion of more  ne-grained sub-topic structure, to provide
more focused units for information retrieval, summariza-
tion, and anaphora resolution. We also plan to explore
the interaction of prosodic and textual features with cues
from other modalities, such as gaze and gesture, for ro-
bust segmentation of varied multi-modal data.

References
D. Beeferman, A. Berger, and J. Lafferty. 1999. Statisti-
cal models for text segmentation. Machine Learning,
34((1-3)):177 210.
P. Boersma. 2001. Praat, a system for doing phonetics
by computer. Glot International, 5(9 10):341 345.
B. Grosz and C. Sidner. 1986. Attention, intention, and
the structure of discourse. Computational Linguistics,
12(3):175 204.
M. Hearst. 1994. Multi-paragraph segmentation of ex-
pository text. In Proceedings of the 32nd Annual Meet-
ing of the Association for Computational Linguistics.
Julia Hirschberg and Christine Nakatani. 1998. Acoustic
indicators of topic segmentation. In Proceedings on
ICSLP-98.
Gina-Anne Levow. 2004. Prosody-based topic segmen-
tation for mandarin broadcast news. In Proceedings of
HLT-NAACL 2004, Volume 2.
D. Marcu. 2000. The Theory and Practice of Discourse
Parsing and Summarization. MIT Press.
C. H. Nakatani, J. Hirschberg, and B. J. Grosz. 1995.
Discourse structure in spoken language: Studies on
speech corpora. In Working Notes of the AAAI Spring
Symposium on Empirical Methods in Discourse Inter-
pretation and Generation, pages 106 112.
J.R. Quinlan. 1992. C4.5: Programs for Machine Learn-
ing. Morgan Kaufmann.
K. Ross and M. Ostendorf. 1996. Prediction of ab-
stract labels for speech synthesis. Computer Speech
and Language, 10:155 185.
Marc Swerts. 1997. Prosodic features at discourse
boundaries of different strength. Journal of the Acous-
tical Society of America, 101(1):514 521.
G. Tur, D. Hakkani-Tur, A. Stolcke, and E. Shriberg.
2001. Integrating prosodic and lexical cues for auto-
matic topic segmentation. Computational Linguistics,
27(1):31 57.
C. Wayne. 2000. Multilingual topic detection and track-
ing: Successful research enabled by corpora and eval-
uation. In Language Resources and Evaluation Con-
ference (LREC) 2000, pages 1487 1494.
