File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/06/n06-2003_evalu.xml
Size: 7,801 bytes
Last Modified: 2025-10-06 13:59:37
<?xml version="1.0" standalone="yes"?> <Paper uid="N06-2003"> <Title>Museli: A Multi-Source Evidence Integration Approach to Topic Segmentation of Spontaneous Dialogue</Title> <Section position="5" start_page="9" end_page="11" type="evalu"> <SectionTitle> 4 Evaluation </SectionTitle> <Paragraph position="0"> In this section we evaluate Museli in comparison to the best performing state-of-the-art approaches, demonstrating that our hybrid Museli approach out-performs all of these approaches on two different dialogue corpora by a statistically significant margin (p < .01), in one case reducing the probability of error as measured by Beeferman's P</Paragraph> <Paragraph position="2"/> <Section position="1" start_page="9" end_page="10" type="sub_section"> <SectionTitle> 4.1 Experimental Corpora </SectionTitle> <Paragraph position="0"> We used two different dialogue corpora for our evaluation. The first corpus, which we refer to as the Olney & Cai corpus, is a set of dialogues selected randomly from the same corpus Olney and Cai selected their corpus from (Olney and Cai, 2005). The second corpus is a locally collected corpus of thermodynamics tutoring dialogues, which we refer to as the Thermo corpus. This corpus is particularly appropriate for addressing the research question of how to automatically segment dialogue for two reasons: First, the exploratory task that students and tutors engaged in together is more loosely structured than many task oriented domains typically investigated in the dialogue community, such as flight reservation or meeting scheduling. Second, because the tutor and student play asymmetric roles in the interaction, this corpus allows us to explore how conversational role affects how speakers mark topic shifts.</Paragraph> <Paragraph position="1"> Table 1 presents statistics describing characteristics of these two corpora. Similar to (Passonneau and Litman, 1993), we adopt a flat model of topicsegmentation for our gold standard based on discourse segment purpose, where a shift in topic corresponds to a shift in purpose that is acknowledged and acted upon by both conversational agents. We evaluated inter-coder reliability over 10% of the Thermo corpus mentioned above. 3 annotators were given a 10 page coding manual with explanation of our informal definition of shared discourse segment purpose as well as examples of segmented dialogues. Pairwise inter-coder agreement was above 0.7 kappa for all pairs of annotators.</Paragraph> </Section> <Section position="2" start_page="10" end_page="10" type="sub_section"> <SectionTitle> 4.2 Baseline Approaches </SectionTitle> <Paragraph position="0"> We evaluate Museli against the following algorithms: (1) Olney and Cai (Ortho), (2) Barzilay and Lee (B&L), (3) TextTiling (TT), and (4) Foltz.</Paragraph> <Paragraph position="1"> As opposed to the other baseline algorithms, (Olney and Cai, 2005) applied their orthonormal basis approach specifically to dialogue, and prior to this work, report the highest numbers for topic segmentation of dialogue. Barzilay and Lee's approach is the state of the art in modeling topic shifts in monologue text. Our application of B&L to dialogue attempts to harness any existing and recognizable redundancy in topic-flow across our dialogues for the purpose of topic segmentation.</Paragraph> <Paragraph position="2"> We chose TextTiling for its seminal contribution to monologue segmentation. TextTiling and Foltz consider lexical cohesion as their only evidence of topic shifts. Applying these approaches to dialogue segmentation sheds light on how term distribution in dialogue differs from that of expository monologue text (e.g. news articles).</Paragraph> <Paragraph position="3"> The Foltz and Ortho approaches require a trained LSA space, which we prepared as described in (Olney and Cai, 2005). Any parameter tuning for approaches other than our hybrid approach was computed over the entire test set, giving competing algorithms the maximum advantage.</Paragraph> <Paragraph position="4"> In addition to these approaches, we include segmentation results from three degenerate approaches: (1) classifying all contributions as NEW_TOPIC (ALL), (2) classifying no contributions as NEW_TOPIC (NONE), and (3) classifying contributions as NEW_TOPIC at uniform intervals (EVEN), corresponding to the average reference topic length (see Table 1).</Paragraph> <Paragraph position="5"> As a means for comparison, we adopt two evaluation metrics: P k and f-measure. An extensive argument of P k 's robustness (if k is set to 1/2 the average reference topic length) is present in (Beeferman, et al. 1999). P k measures the probability of misclassifying two contributions a distance of k contributions apart, where the classification question is are the two contributions part of the same topic segment or not? Lower P k values are preferred over higher ones. It equally captures the effect of false-negatives and false-positives and it favors near misses. F-measure punishes false positives equally, regardless of the distance to the reference boundary.</Paragraph> </Section> <Section position="3" start_page="10" end_page="10" type="sub_section"> <SectionTitle> 4.3 Results </SectionTitle> <Paragraph position="0"> Results for all approaches are displayed in Table 2. Note that lower values of P k are preferred over higher ones. The opposite is true of F-measure. In both corpora, Museli performed significantly better than all other approaches (p < .01).</Paragraph> </Section> <Section position="4" start_page="10" end_page="11" type="sub_section"> <SectionTitle> 4.4 Error Analysis </SectionTitle> <Paragraph position="0"> Results for all approaches are better on the Olney and Cai corpus than the Thermo corpus. The Thermo corpus differs profoundly from the Olney and Cai corpus in ways that very likely influenced the performance. For instance, in the Thermo corpus each dialogue contribution is an average of 5 words long, whereas in the Olney and Cai corpus each dialogue contribution contains an average of 28 words. Thus, the vector space representation of the dialogue contributions is much more sparse in the Thermo corpus, which makes shifts in lexical coherence less reliable as topic shift indicators. In terms of P k , TextTiling (TT) performed worse than the degenerate algorithms. TextTiling measures the term-overlap between adjacent regions in the discourse. However, dialogue contributions are often terse or even contentless. This produces many islands of contribution-sequences for which the local lexical cohesion is zero. TextTiling wrongfully classifies all of these as starts of new topics. A heuristic improvement to prevent TextTiling from placing topic boundaries at every point along a sequence of contributions failed to produce a statistically significant improvement. The Foltz and the orthonormal basis approaches rely on LSA to provide strategic semantic generalizations. Following (Olney and Cai, 2005), we built our LSA space using dialogue contributions as the atomic text unit. However, in corpora such as the Thermo corpus, this may not be effective because of the brevity of contributions.</Paragraph> <Paragraph position="1"> Barzilay and Lee's algorithm (B&L) did not generalize well to either dialogue corpus. One reason could be that such probabilistic methods require that reference topics have significantly different language models, which was not true in either of our evaluation corpora. We also noticed a number of instances in the dialogue corpora where participants referred to information from previous topic segments, which consequently may have blurred the distinction between the language models assigned to different topics.</Paragraph> </Section> </Section> class="xml-element"></Paper>