File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/06/n06-2050_evalu.xml
Size: 5,748 bytes
Last Modified: 2025-10-06 13:59:38
<?xml version="1.0" standalone="yes"?> <Paper uid="N06-2050"> <Title>Comparing the roles of textual, acoustic and spoken-language features on spontaneous-conversation summarization</Title> <Section position="6" start_page="197" end_page="199" type="evalu"> <SectionTitle> 4 Experimental results </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="197" end_page="198" type="sub_section"> <SectionTitle> 4.1 Experiment settings </SectionTitle> <Paragraph position="0"> The data used for our experiments come from SWITCHBOARD. We randomly select 27 conversations, containing around 3660 utterances. The important utterances of each conversation are manually annotated. We use f-score and the ROUGE score as evaluation metrics. Ten-fold cross validation is applied to obtain the results presented in this section.</Paragraph> </Section> <Section position="2" start_page="198" end_page="198" type="sub_section"> <SectionTitle> 4.2 Summarization performance </SectionTitle> <Paragraph position="0"> Table-1 shows the f-score of logistic regression (LR) based summarizers, under different compression ratios, and with incremental features Both tables show that the performance of summarizers improved, in general, with more features used. The use of lexicon and structural features outperforms MMR, and the speech-related features, acoustic features and spoken language features produce additional improvements.</Paragraph> <Paragraph position="1"> The following tables provide the ROUGE-1 scores: The ROUGE-1 scores show similar tendencies to the f-scores: the rich features improve summarization performance over the baseline MMR summarizers. Other ROUGE scores like ROUGE-L show the same tendency, but are not presented here due to the space limit.</Paragraph> <Paragraph position="2"> Both the f-score and ROUGE indicate that, in general, rich features incrementally improve summarization performance.</Paragraph> </Section> <Section position="3" start_page="198" end_page="199" type="sub_section"> <SectionTitle> 4.3 Comparison of features </SectionTitle> <Paragraph position="0"> To study the effectiveness of individual features, the receiver operating characteristic (ROC) curves of these features are presented in Figure-1 below.</Paragraph> <Paragraph position="1"> The larger the area under a curve is, the better the performance of this feature is. To be more exact, the definition for the y-coordinate (sensitivity) and where TP, FN, TN and FP are true positive, false negative, true negative, and false positive, respectively.</Paragraph> <Paragraph position="2"> Figure-1. ROC curves for individual features Lexicon and MMR features are the best two individual features, followed by spoken-language and acoustic features. The structural feature is least effective.</Paragraph> <Paragraph position="3"> Let us first revisit the problem (2) discussed above in the introduction. The effectiveness of the structural feature is less significant than it is in broadcast news. According to the ROC curves presented in Christensen et al. (2004), the structural feature (utterance position) is one of the best features for summarizing read news stories, and is less effective when news stories contain spontaneous speech. Both their ROC curves cover larger area than the structural feature here in figure 1, that is, the structure feature is less effective for summarizing spontaneous conversation than it is in broadcast news. This reflects, to some extent, that information is more evenly distributed in spontaneous conversations.</Paragraph> <Paragraph position="4"> Now let us turn to the role of speech disfluencies, which are very common in spontaneous conversations. Previous work detects and removes disfluencies as noise. Indeed, disfluencies show regularities in a number of dimensions (Shriberg, 1994). They correlate with many factors including the topic difficulty (Bortfeld et al, 2001). Tables 14 above show that they improve summarization performance when added upon other features.</Paragraph> <Paragraph position="5"> Figure-1 shows that when used individually, they are better than the structural feature, and also better than acoustic features at the left 1/3 part of the figure, where the summary contains relatively fewer utterances. Disfluencies, e.g., pauses, are often inserted when speakers have word-searching problem, e.g., a problem finding topic-specific keywords: Speaker A: with all the uh sulfur and all that other stuff they're dumping out into the atmosphere.</Paragraph> <Paragraph position="6"> The above example is taken from a conversation that discusses pollution. The speaker inserts a filled pause uh in front of the word sulfur. Pauses are not randomly inserted. To show this, we remove them from transcripts. Section-2 of SWITCHBOARD (about 870 dialogues and 189,000 utterances) is used for this experiment. Then we insert these pauses back randomly, or insert them back at their original places, and compare the difference. For both cases, we consider a window with 4 words after each filled pause. We average the tf.idf scores of the words in each of these windows. Then, for all speaker-inserted pauses, we obtain a set of averaged tf.idf scores. And for all randomlyinserted pauses, we have another set. The mean of the former set (5.79 in table 5) is statistically higher than that of the latter set (5.70 in table 5). We can adjust the window size to 3, 2 and 1, and then get the following table.</Paragraph> <Paragraph position="7"> The above table shows that instead of randomly inserting pauses, real speakers insert them in front of words with higher tf.idf scores. This helps explain why disfluencies work.</Paragraph> </Section> </Section> class="xml-element"></Paper>