File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/w06-3401_metho.xml
Size: 22,351 bytes
Last Modified: 2025-10-06 14:10:58
<?xml version="1.0" standalone="yes"?> <Paper uid="W06-3401"> <Title>Prosodic Correlates of Rhetorical Relations</Title> <Section position="3" start_page="0" end_page="0" type="metho"> <SectionTitle> 2 Previous Research </SectionTitle> <Paragraph position="0"> Early work on automatic RST analysis relied heavily on discourse cues to identify relations (Corston-Oliver, 1998; Knott and Sanders, 1998; Marcu, 1997; Marcu, 1999; Marcu, 2000) (e.g., &quot;however&quot; signaling an antithesis or contrast relation. As mentioned above, this approach is limited by the fact that rhetorical relations are often not explicitly signalled, and discourse markers can nevertheless be ambiguous. A novel approach was described in (Marcu and Echihabi, 2002), which used an unsupervised training technique, extracting relations that were explicitly and unamibiguously signalled and automatically labelling those examples as the training set. This unsupervised technique allowed the authors to label a very large amount of data and pairs of words found in the nucleus and satellite as the features of interest. The authors reported very encouraging pairwise classification results using these word-pair features, though subsequent work using the same bootstrapping technique has fared less well (Sporleder and Lascarides, to appear 2006).</Paragraph> <Paragraph position="1"> There is little precedent for applying RST to speech dialogues, though (Taboada, 2004) describes rhetorical analyses of Spanish and English spoken dialogues, with in-depth corpus analyses of discourse markers and their corresponding relations.</Paragraph> <Paragraph position="2"> The work in (Noordman et al., 1999) uses short read texts to explore the relationship between prosody and the level of hierarchy in an RST tree. The authors report that higher levels in the hierarchy are associated with longer pause durations and higher pitch. Similar results are reported in (den Ouden, 2004), who additionally found significant prosodic differences between causal and non-causal relations and between semantic and pragmatic relations.</Paragraph> <Paragraph position="3"> Litman and Hirschberg (1990) investigated whether prosodic features could be used to disambiguate sentential versus discourse instances of certain discourse markers such as &quot;incidentally.&quot; Passonneau and Litman (1997) explored the discourse structure of spontaneous narrative monologues, with a particular interest in both manual and automatic segmentation of narratives into coherent discourse units, using both lexical and prosodic features. Grosz and Hirschberg (1992) found that read AP news stories annotated for discourse structure in the Grosz and Sidner (1986) framework showed strong correlations between prosodic features and both global and local structure. Also in the Grosz and Sidner framework, Hirschberg and Nakatani (1996) found that utterances from direction-giving monologues significantly differed in prosody depending on whether they appeared as segment-intial, segment-medial or segment-final.</Paragraph> </Section> <Section position="4" start_page="0" end_page="2" type="metho"> <SectionTitle> 3 Defining the Relations </SectionTitle> <Paragraph position="0"> Following Marcu and Echihabi's work, we included contrast, elaboration and cause relations in our research. We chose to exclude condition because it is always explicitly signalled and therefore trivial for classification purposes. We also include a summary relation, which is of particular interest here because it is hoped that classification of rhetorical relations will aid an automatic speech summarization system.</Paragraph> <Paragraph position="1"> As in Segmented Discourse Representation Theory (SDRT) (Asher and Lascarides, 2004), an alternative framework for representing text structure, we included question/answer to our relations list. All training and testing pairs consist of a nucleus followed by a satellite, and the relations are defined as follows: * Contrast: The information in the satellite contradicts or is an exception to the information in the nucleus. Example: - Speaker 1: You use it as a tool Speaker 1: Not an end user * Elaboration: The information from the nucleus is discussed in greater detail in the satellite. Example: - Speaker 1: The last time I looked at it was a while ago Speaker 1: Probably a year ago * Cause: The situation described in the satellite results from the situation described in the nucleus. Example: - Speaker 1: So the GPS has crashed as well Speaker 1: So the first person has to ask you where you are * Summary: The information in the satellite is semantically equivalent to the information in the nucleus. It is not necessarily more succinct. Example: - Speaker 1: The whole point is that the text and lattice are isomorphic Speaker 1: They represent each other completely * Question/Answer: The satellite fulfills an information need explicitly stated in the nucleus. Example: - Speaker 1: What does the P stand for anyway? null Speaker 2: I have no idea We also took the simplifying step of concentrating only on dialogue acts which did not internally contain such relations as defined above, which could confound the analysis. For example, a dialogue act might serve as a contrast to the preceding dialogue act while also containing a cause relation within its own text span.</Paragraph> </Section> <Section position="5" start_page="2" end_page="4" type="metho"> <SectionTitle> 4 Experimental Setup </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="2" end_page="2" type="sub_section"> <SectionTitle> 4.1 Corpus Description </SectionTitle> <Paragraph position="0"> All data was taken from the ICSI Meetings corpus (Janin et al., 2003), a corpus of 75 unrestricted domain meetings averaging about an hour in length each. Both native and non-native English speakers participate in the meetings. The following experiments used manual meeting transcripts and relied on manual dialogue act segmentation (Shriberg et al., 2004). A given meeting can contain between 1000 and 1600 dialogue acts. All rhetorical relation examples in the training and test sets are pairs of adjacent dialogue acts.</Paragraph> </Section> <Section position="2" start_page="2" end_page="3" type="sub_section"> <SectionTitle> 4.2 Features </SectionTitle> <Paragraph position="0"> Seventy-five prosodic features were extracted in all, relating to pitch (or F0) contour, pitch variance, energy, rate-of-speech, pause and duration. To approximate the pitch contour of a dialogue act, we measure the pitch slope at multiple points within the dialogue act, e.g., the overall slope, the slope of the first 100 and 200 ms, last 100 and 200 ms, first half and second half of the dialogue act, and each quarter of the dialogue act. The pitch standard deviation is measured at the same dialogue act subsections. For each of the four quarters of the dialogue act, the energy level is measured and compared to the overall dialogue act energy level, and the number of silent frames are totalled for each quarter of the dialogue act as well. The maximum F0 for each dialogue act is included, as are the length of the dialogue act both in seconds and in number of words. A very rough rate-of-speech feature is employed, consisting of the number of words divided by the length of the dialogue act in seconds. We also include a feature of pause length between the nucleus and the satellite, as well as a feature indicating whether or not the speakers of the nucleus and the satellite are the same.</Paragraph> <Paragraph position="1"> Finally, the cosine similarity of the nucleus feature vector and satellite feature vector is included, which constitutes a measurement of the general prosodic similarity between the two dialogue acts. The motivation for this last feature is that some relations such as question would be expected to have very different prosodic characteristics in the satellite versus the nucleus, whereas other relations such as summary might have a nucleus and satellite with very similar prosody to each other.</Paragraph> <Paragraph position="2"> While there are certainly informative lexical cues to be exploited based on previous research, this pilot study is expressly interested in how efficient prosody alone is in automatically classifying such rhetorical relations. For that reason, the feature set is limited solely to the prosodic characteristics described above.</Paragraph> </Section> <Section position="3" start_page="3" end_page="3" type="sub_section"> <SectionTitle> 4.3 Training Data </SectionTitle> <Paragraph position="0"> Using the PyML machine learning tool2, support vector machines with polynomial kernels were trained on multiple training sets described below, using the default libsvm solver3, a sequential minimal optimization (SMO) method. Feature normalization and feature subset selection using recursive feature elimination were carried out on the data. The following subsections describe the various training approaches we experimented with.</Paragraph> <Paragraph position="1"> For the first experiment, a very small set of manually labelled relations was constructed. Forty examples of each relation were annotated, for a total training set of 200 examples. Each relation has training examples that are explicitly and non-explicitly signalled, since we want to discover prosodic cues for each relation that are not dependent on how lexically explicit the relation tends to be. The percentage of either unsignalled or amibiguously signalled relations across all of the manually-labelled datasets is about 57%, though this varies very much depending on the relation. For example, only just over 20% of questions are unsignalled or ambiguously signalled whereas nearly 70% of elaborations are unsignalled.</Paragraph> <Paragraph position="2"> Following Marcu and Echihabi, we employ a bootstrapping technique wherein we extract cases which are explicitly signalled lexically and use those as our automatically labelled training set. Because those lexical cues are sometimes ambiguous or misleading, the data will necessarily be noisy, but this approach allows us to create a large training set without the time and cost of manual annotation. Whereas ... But...</Paragraph> <Paragraph position="3"> ... Except...</Paragraph> <Paragraph position="4"> ... Although...</Paragraph> <Paragraph position="5"> Cause ... Therefore...</Paragraph> <Paragraph position="6"> ... As a result...</Paragraph> <Paragraph position="7"> ... And so...</Paragraph> <Paragraph position="8"> ... Subsequently...</Paragraph> <Paragraph position="9"> Elaboration ... Which...</Paragraph> <Paragraph position="10"> ... For Example...</Paragraph> <Paragraph position="11"> ... Specifically...</Paragraph> <Paragraph position="12"> Summary ... Basically...</Paragraph> <Paragraph position="13"> ... In other words...</Paragraph> <Paragraph position="14"> ... I mean...</Paragraph> <Paragraph position="15"> ... In short...</Paragraph> <Paragraph position="16"> relation examples and learn further lexical information about the relation pairs, we are using similar templates based on discourse markers but subsequently exploring the extracted relation pairs in terms of prosodic features. Three hundred examples of each relation were extracted and automatically labelled, for a training set of 1500 examples, more than ten times the size of the manually labelled training set. Examples of the explicit lexical cues used to construct the training set are provided in Table 1: Finally, the two training sets discussed above were combined to create a set of 1700 training examples. null</Paragraph> </Section> <Section position="4" start_page="3" end_page="4" type="sub_section"> <SectionTitle> 4.4 Development and Testing Data </SectionTitle> <Paragraph position="0"> For the development set, 35 examples of each relation were annotated, for a total set size of 175 examples. We repeatedly tested on the development set as we increased the prosodic database and experimented with various classifier types. The smaller final test set consists of 15 examples of each relation, for a total set size of 75 examples. Both the test set and development set consist of explicitly and non-explicitly signalled relations. As mentioned above, the percentage of either unsignalled or amibiguously signalled relations across all of the manually-labelled datasets is about 57% Both pairwise and multi-class classification were carried out. The former set of experiments simply aimed to determine which relation pairs were most confusable with each other; however, it is the latter multi-class experiments that are most indicative of the real-world usefulness of rhetorical classication using prosodic features. Since our goal is to label meeting transcripts with rhetorical relations as a preprocessing step for automatic summarization, multi-class classification must be quite good to be at all useful.</Paragraph> </Section> </Section> <Section position="6" start_page="4" end_page="5" type="metho"> <SectionTitle> 5 Results </SectionTitle> <Paragraph position="0"> The following subsections give results on a development set of 175 relation pairs and on a test set of 75 relation pairs.</Paragraph> <Section position="1" start_page="4" end_page="4" type="sub_section"> <SectionTitle> 5.1 Development Set Results </SectionTitle> <Paragraph position="0"> The pairwise classification results on the development set are quite encouraging, showing that prosodic cues alone can yield an average of 68% classification success. Because equal class sizes were used in all data sets, the baseline classification would be 50%. The manually-labelled training data resulted in the highest accuracy, with the unsupervised technique performing slightly worse and the combination approach showing no added benefit to using manually-labelled data alone. Relation pairs involving the question relation generally perform the best, with the single highest pairwise classification being between elaboration and question. Elaboration is also generally discernible from contrast and summary.</Paragraph> <Paragraph position="1"> The multi-class classification on the development set attained an accuracy of 0.35 using a one-againstthe-rest classification approach, with chance level classification being 0.20. The confusion matrix in Table 3 illustrates the difficulty of multi-class classification; while cause, contrast and question relations are classified with considerable success, the elaboration relation pairs are often misclassified as cause and the summary pairs misclassifed as question.</Paragraph> </Section> <Section position="2" start_page="4" end_page="5" type="sub_section"> <SectionTitle> 5.2 Test Set Results 5.2.1 Pairwise </SectionTitle> <Paragraph position="0"> The pairwise results on the test set are similar to those of the development set, with the manually-labelled training set yielding superior results to the other two approaches, and relation pairs involving question and elaboration relations being particularly discernible. The average accuracy of the supervised approach applied to the test set is 67%, which closely mirrors the results on the development set.</Paragraph> <Paragraph position="1"> The most confusable pairs are summary/question and cause/elaboration; the former is quite surprising in that the question nucleus would be expected to have a prosody quite distinct from the others.</Paragraph> <Paragraph position="2"> The multi-class classification on the test set was considerably worse than the development set, with a success rate of only 0.24 (baseline: 0.2).</Paragraph> </Section> <Section position="3" start_page="5" end_page="5" type="sub_section"> <SectionTitle> 5.3 Features Analysis </SectionTitle> <Paragraph position="0"> This section details the prosodic characteristics of the manually labelled relations in the training, development, and test sets.</Paragraph> <Paragraph position="1"> The contrast relation is typically realized with a low rate-of-speech for the nucleus and high rate-of-speech for the satellite, little or no pause between nucleus and satellite, a relatively flat overall F0 slope for the nucleus, and a satellite that increases in energy from the beginning to the end of the dialogue act. Of the manually labelled data sets, 74% of the examples are within a single speaker's turn.</Paragraph> <Paragraph position="2"> The cause relation typically has a very high duration for the nucleus but a large amount of the nucleus containing silence. The slope of the nucleus is typically flat and the nuclear rate-of-speech is low. The satellite has a low rate-of-speech, a large amount of silence, a high maximum F0 and a high duration.</Paragraph> <Paragraph position="3"> There is typically a long duraton between nucleus and satellite and the speakers of the nucleus and the satellite are the same. Of the manually labelled data sets, nearly 94% of the examples are within a single speaker's turn.</Paragraph> <Paragraph position="4"> The elaboration relation is often realized with a high nuclear duration, a high satellite duration, a long pause in-between and a low rate-of-speech for the satellite. The satellite typically has a high maximum F0 and the speakers of the nucleus and satellite are the same. 95% of the manually labelled examples occur within a single speaker's turn.</Paragraph> <Paragraph position="5"> With the summary relation, the nucleus typically has a steep falling overall F0 while the nucleus has a rising overall F0. There is a short pause and a short duration for both nucleus and satellite. The rate-of-speech for the satellite is typically very high and there is little silence. 48% of the manually labelled examples occur within a single speaker's turn.</Paragraph> <Paragraph position="6"> Finally, the question relation has a number of unique characteristics. The rate-of-speech of the nucleus is very high and there is very little silence.</Paragraph> <Paragraph position="7"> Surprisingly, these examples do not have canonical question intonation, instead having a low maximum F0 for the nucleus and a declining slope at the end of the nucleus. The overall F0 for the satellite steeply declines and there is a high standard deviation. The energy levels for the second and third quarters of the satellite are high compared with the average satellite energy and there is very little silence in the satellite as a whole. There is little or no pause between satellite and nucleus and both nucleus and satellite have relatively short durations. The maximum F0 for the satellite is typically low, and the speaker of the satellite is almost always different than the speaker of the nucleus - 99% of the time.</Paragraph> </Section> </Section> <Section position="7" start_page="5" end_page="6" type="metho"> <SectionTitle> 6 Conclusion </SectionTitle> <Paragraph position="0"> These experiments attempted to classify five rhetorical relations between dialogue acts in meeting speech using prosodic features. We primarily focused on pitch contour using numerous features of pitch slope and variance that intend to approximate the contour. In addition, we incorporated pause, energy, rate-of-speech and duration into our feature set. Using an unsupervised bootstrapping approach, we automatically labelled a large amount of training data and compared this approach to using a very small training set of manually labelled data. Whereas Marcu and Echihabi used such a bootstrapping approach to learn additional lexical information about relation pairs, we used the automatically labelled examples to learn the prosodic correlates of the relations. However, even a small amount of manually-labelled training data outperformed the unsupervised method, which is the same conclusion of Sporleder and Lascarides (Sporleder and Lascarides, to appear 2006), and a combined training method gave no additional benefit. One possible explanation for the poor performance of the bootstrapping approach is that some of the templates were inadvertently ambiguous, e.g., &quot;I mean&quot; can signal an elaboration or a summary and which can signal an elaboration or the beginning of a question relation. Furthermore, one possible drawback in employing this bootstrapping method is that there may be a complementary distribution between prosodic and lexical features. We are using explicit lexical cues to build an automatically labelled training set, but such explicitly cued relations may not be prosodically distinct. For example, a question that is sig- null nalled by &quot;Who&quot; or &quot;What&quot; may not have canonical question intonation since it is lexically signalled. This relates to a finding of Sporleder and Lascarides, who report that the unsupervised method of Marcu and Echihabi only generalizes well to relations that are already explicitly signalled, i.e. which could be found just by using the templates themselves.</Paragraph> <Paragraph position="1"> The pairwise results were quite encouraging, with the supervised training approach yielding average accuracies of 68% on the development and test sets.</Paragraph> <Paragraph position="2"> This illustrates that prosody alone is quite indicative of certain rhetorical relations between dialogue acts. However, the multi-class classification performance was not far above chance levels. If this automatic rhetorical analysis is to aid an automatic summarizaton system, we will need to expand the prosodic database and perhaps couple this approach with a limited lexical/discourse approach in order to improve the multi-class classification accuracy. But most importantly, if even a small amount of training data leads to decent pairwise classification using only prosodic features, then greatly increasing the amount of manual annotation should provide considerable improvement.</Paragraph> </Section> <Section position="8" start_page="6" end_page="6" type="metho"> <SectionTitle> 7 Acknowledgements </SectionTitle> <Paragraph position="0"> Thanks to Mirella Lapata and Caroline Sporleder for valuable feedback. Thanks to two anonymous reviewers for helpful suggestions. This work was partly supported by the European Union 6th FWP</Paragraph> </Section> class="xml-element"></Paper>