File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/04/w04-1115_evalu.xml
Size: 11,164 bytes
Last Modified: 2025-10-06 13:59:13
<?xml version="1.0" standalone="yes"?> <Paper uid="W04-1115"> <Title>Combining Prosodic and Text Features for Segmentation of Mandarin Broadcast News</Title> <Section position="9" start_page="0" end_page="0" type="evalu"> <SectionTitle> 7 Classification </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 7.1 Prosodic Feature Set </SectionTitle> <Paragraph position="0"> The results above indicate that duration, pitch, and intensity should be useful for automatic prosody-based identification of topic boundaries. To facilitate cross-speaker comparisons, we use normalized representations of average pitch, average intensity, and word duration. We also include absolute word duration. These features form a word-level context-independent feature set.</Paragraph> <Paragraph position="1"> Since segment boundaries and their cues exist to contrastively signal the separation between topics, we augment these features with local context-dependent measures. Specifically, we add features that measure the change between the current word and the next word.2 This contextualization adds four contextual features: change in normalized average pitch, change in normalized average intensity, change in normalized word duration, and duration of following silence or non-speech region.</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 7.2 Text Feature Set </SectionTitle> <Paragraph position="0"> In addition to the prosodic features which are our primary interest, we also consider a set of features that exploit textual similarity to identify segment boundaries. Motivated by text and topic similarity measures in the vector space model of information retrieval (Salton, 1989), we compute a vector representation of the words in 50 word windows preceding and following the current potential boundary position. We compute the cosine similarity of these two vectors. We employ a a33a35a34a37a36a39a38a41a40a42a34 weighting; each term is weighted by the product of its frequency in the window (a33a35a34 ) and its inverse document frequency (a38a41a40a42a34 ) as a measure of topicality. We also consider the same similarity measure computed across 30 word windows. The final text similarity measure we consider is simple word overlap, counting the number of words that appear in both 50 word windows defined above. We did not remove stopwords, and used the word-based units from the ASR transcription directly as our term units. We expect that these measures will be minimized at topic boundaries where changes in topic are accompanied by changes in topical terminology.</Paragraph> <Paragraph position="1"> Finally we identified a small set of word unigram features occuring within a ten-word window immediately preceding or following a story boundary that 2We have posed the task of boundary detection as the task of finding segment-final words, so the technique incorporates a single-word lookahead. We could also repose the task as identification of topic-initial words and avoid the lookahead to have were indicative of such a boundary.3 These features include the Mandarin Chinese words for &quot;audience&quot;, &quot;reporting&quot;, and &quot;Voice of America.&quot; We used a boolean feature for each such word corresponding to its presence or absence in the current word's environment in the classifier formulation.</Paragraph> </Section> <Section position="3" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 7.3 Classifier Training and Testing Configuration </SectionTitle> <Paragraph position="0"> We employed Quinlan's C4.5 (Quinlan, 1992) decision tree classifier to provide a readily interpretable classifier. Now, the vast majority of word positions in our collection are non-segment-final. So, in order to focus training and test on segment boundary identification and to assess the discriminative capability of the classifier, we downsample our corpus to produce a 50/50 split of segment-final and non-final words. We train on 3500 segment-final words4 and 3500 non-final words, not matched in any way, drawn randomly from the full corpus. We test on a similarly balanced test set of 500 instances.</Paragraph> </Section> <Section position="4" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 7.4 Classifier Evaluation </SectionTitle> <Paragraph position="0"> The resulting classifier achieved 95.6% accuracy, with 2% missed boundaries and 7% false alarms.</Paragraph> <Paragraph position="1"> This effectiveness is a substantial improvement over the sample baseline of 50%. A portion of the decision tree is reproduced in Figure 1. Inspection of the tree indicates the key role of silence as well as the use of both contextual and purely local features of pitch, intensity, and duration. The classifier relies pitch tracker returned no results.</Paragraph> <Paragraph position="2"> on the theoretically and empirically grounded features of pitch, intensity, duration, and silence, where it has been suggested that higher pitch and wider range are associated with topic initiation and lower pitch or narrower range is associated with topic finality. null We performed a set of contrastive experiments to explore the impact of different lexical tones on classification accuracy. We grouped words based on the lexical tone of the initial syllable into high, rising, low, and falling. We found no tone-based differences in classification with all groups achieving 9496% accuracy. Since the magnitude of the difference in pitch based on discourse position is comparable to that based on lexical tone identity, and the overlap between pitch values in non-final and final positions is relatively small, we obtain consistent results. null In a comparable experiment, we employed only the text similarity, text unigram, and silence duration features to train and test the classifier. These features similarly achieved a 95.6% overall classification accuracy, with 3.6% miss and 5.2% false alarms. Here the best classification accuracy was achieved by the a33 a34 a36 a38a42a40a42a34 weighted 50 word window based text similarity measure. The text unigram features also contributed to overall classifier effectiveness. A portion of the decision tree classifier using text-based features is reproduced in Figure 2.</Paragraph> </Section> <Section position="5" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 7.4.3 Combined Prosody and Text Classification </SectionTitle> <Paragraph position="0"> Finally we built a combined classifier integrating all prosodic, textual, and silence features. This classifier yielded an accuracy of 96.4%, somewhat better effectiveness, still with more than twice as many false alarms as missed detections. The decision tree utilized all prosodic features. a33a35a34 a36 a38a41a40a11a34 weighted cosine similarity alone performed as well as any of the other text similarity or overlap measures. The text unigram features also contributed to overall classifier effectiveness. A portion of the decision tree classifier using prosodic, textual, and silence features is reproduced in Figure 3.</Paragraph> </Section> <Section position="6" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 7.5 Feature Comparison </SectionTitle> <Paragraph position="0"> We also performed a set of contrastive experiments with different subsets of available features to assess the dependence on these features.5 We grouped features into 5 sets: pitch, intensity, duration, silence, and text-similarity. For each of the prosodyonly, text-only, and combined prosody and text-based classifiers, we successively removed the feature class at the root of the decision tree and re-trained with the remaining features (Table 1).</Paragraph> <Paragraph position="1"> We observe that although silence duration plays a very significant role in story boundary identification for all feature sets, the richer prosodic and mixed text-prosodic classifiers are much more robust to the absence of silence information. Further we observe that intensity and then pitch play the next most important roles in classification. This behavior can be explained by the observation that, like silence or non-speech regions, pitch and intensity changes provide sharp, local cues to topic finality or initiation. Thus the prosodic features provide some measure of redundancy for the silence feature. In contrast, the text similarity measures apply to relatively wide regions, comparing pairs of 50 word windows.</Paragraph> </Section> <Section position="7" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 7.6 Further Feature Integration </SectionTitle> <Paragraph position="0"> Finally we considered effectiveness on a representative test sampling of the full data set, rather than the downsampled, balanced set, adding a proportional number of unseen non-final words to the test set.</Paragraph> <Paragraph position="1"> We observed that although the overall classification accuracy was 95.6% and we were missing only 2% of the story boundaries, we produced a high level of false alarms (4.4%), accounting for most of the observed classification errors. Given the large predominance of non-boundary positions in the real-word distribution, we sought to better understand and reduce this false alarm rate, and hopefully reduce the overall error rates. At the same time, we hoped to avoid a dramatic increase in the miss rate.</Paragraph> <Paragraph position="2"> tically to make idiosyncratically large use of silence at story boundaries. (personal communication, James Allan).</Paragraph> <Paragraph position="3"> To explore this question, we considered the contribution of each of the three main feature types prosody, text, and silence - and their combined effects on false alarms. We constructed independent feature-set-specific decision tree classifiers for each of the feature types and compared their independent classifications to those of the integrated classifier.</Paragraph> <Paragraph position="4"> We found that while there was substantial agreement across the different feature-based classifiers in the cases of correct classification, erroneous classifications often occured when the assignment was a minority decision. Specifically, one-third of the false alarms were based on a minority assignemnt, where only the fully integrated classifier deemed the position a boundary or where it agreed with only one of the feature-set-specific classifiers.</Paragraph> <Paragraph position="5"> Based on these observations, we completed our multi-feature integration by augmenting the decision tree based classification with a voting mechanism. In this configuration, a boundary was only assigned in cases where the integrated classifier agreed with at least two of the feature-set-specific classifiers. This approach reduced the false alarm rate by one-third, to 3.15%, while the miss rate rose only to 2.8%. The overall accuracy on a representative sample distribution reached 96.85%.</Paragraph> </Section> </Section> class="xml-element"></Paper>