File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/e06-1035_metho.xml
Size: 25,207 bytes
Last Modified: 2025-10-06 14:10:04
<?xml version="1.0" standalone="yes"?> <Paper uid="E06-1035"> <Title>Automatic Segmentation of Multiparty Dialogue</Title> <Section position="4" start_page="273" end_page="275" type="metho"> <SectionTitle> 3 Method </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="273" end_page="273" type="sub_section"> <SectionTitle> 3.1 Data </SectionTitle> <Paragraph position="0"> In this study, we used the ICSI meeting corpus (LDC2004S02). Seventy-five natural meetings of ICSI research groups were recorded using close-talking far field head-mounted microphones and four desktop PZM microphones. The corpus includes human transcriptions of all meetings. We addedASRtranscriptions ofall75meetingswhich were produced by Hain (2005), with an average WER of roughly 30%.</Paragraph> <Paragraph position="1"> The ASR system used a vocabulary of 50,000 words, together with a trigram language model trained on a combination of in-domain meeting data, related texts found by web search, conversational telephone speech (CTS) transcripts and broadcast news transcripts (about 109 words in total), resulting in a test-set perplexity of about 80. The acoustic models comprised a set of context-dependent hidden Markov models, using gaussian mixture model output distributions. These were initially trained on CTSacoustic training data, and were adapted to the ICSI meetings domain using maximum a posteriori (MAP) adaptation. Further adaptation to individual speakers was achieved using vocal tract length normalization and maximum likelihood linear regression. A four-fold cross-validation technique was employed: four recognizers were trained, with each employing 75% of the ICSI meetings as acoustic and language model training data, and then used to recognize the remaining 25% of the meetings.</Paragraph> </Section> <Section position="2" start_page="273" end_page="274" type="sub_section"> <SectionTitle> 3.2 Fine-grained and coarse-grained topics </SectionTitle> <Paragraph position="0"> We characterize a dialogue as a sequence of topical segments that may be further divided into subtopic segments. For example, the 60 minute meeting Bed003, whose theme is the planning of aresearch project on automatic speech recognition can be described by 4 major topics, from &quot;opening&quot; to &quot;general discourse features for higher layers&quot; to &quot;how to proceed&quot; to &quot;closing&quot;. Depending on the complexity, each topic can be further divided into a number of subtopics. For example, &quot;how to proceed&quot; can be subdivided to 4 subtopic segments, &quot;segmenting off regions of features&quot;, &quot;ad-hoc probabilities&quot;, &quot;data collection&quot; and &quot;experimental setup&quot;.</Paragraph> <Paragraph position="1"> Three human annotators at our site used a tailored tool to perform topic segmentation in which they could choose to decompose a topic into subtopics, with at most three levels in the resulting hierarchy. Topics are described to the annotators as what people in a meeting were talking about.</Paragraph> <Paragraph position="2"> Annotators were asked to provide a free text label for each topic segment; they were encouraged to use keywords drawn from the transcription in these labels, and we provided some standard labels for non-content topics, such as &quot;opening&quot; and &quot;chitchat&quot;, to impose consistency. For our initial experiments with automatic segmentation at different levels of granularity, we flattened the subtopic structure and consider only two levels ofsegmentation-top-level topics andallsubtopics.</Paragraph> <Paragraph position="3"> To establish reliability of our annotation procedure, we calculated kappa statistics between the annotations of each pair of coders. Our analysis indicates human annotators achieve k = 0.79 agreement on top-level segment boundaries and k = 0.73 agreement on subtopic boundaries. The level of agreement confirms good replicability of the annotation procedure.</Paragraph> </Section> <Section position="3" start_page="274" end_page="274" type="sub_section"> <SectionTitle> 3.3 Probabilistic models </SectionTitle> <Paragraph position="0"> Our goal is to investigate the impact of ASR errors on the selection of features and the choice of models for segmenting topics at different levels of granularity. We compare two segmentation models: (1) an unsupervised lexical cohesion-based model (LM) using solely lexical cohesion information, and (2) feature-based combined models (CM) that are trained on a combination of lexical cohesion and conversational features.</Paragraph> <Paragraph position="1"> 3.3.1 Lexical cohesion-based model In this study, we use Galley et al.'s (2003) LCSeg algorithm, a variant of TextTiling (Hearst, 1997). LCSeg hypothesizes that a major topic shift is likely to occur where strong term repetitions start and end. The algorithm works with two adjacent analysis windows, each of a fixed size which is empirically determined. For each utterance boundary, LCSeg calculates a lexical cohesion score by computing the cosine similarity at the transition between the twowindows. Lowsimilarity indicates low lexical cohesion, and a sharp change in lexical cohesion score indicates a high probability of an actual topic boundary. The principal difference between LCSeg and TextTiling is that LCSeg measures similarity in terms of lexical chains (i.e., term repetitions), whereas TextTiling computes similarity using word counts.</Paragraph> </Section> <Section position="4" start_page="274" end_page="274" type="sub_section"> <SectionTitle> 3.3.2 Integrating lexical and </SectionTitle> <Paragraph position="0"> conversation-based features We also used machine learning approaches that integrate features into a combined model, casting topic segmentation as a binary classification task. Under this supervised learning scheme, a training set in which each potential topic boundary2 is labelled as either positive (POS) or negative (NEG) is used to train a classifier to predict whether each unseen example in the test set belongs to the class POS or NEG. Our objective here is to determine whether the advantage of integrating lexical and conversational features also improves automatic topic segmentation at the finer granularity of subtopic levels, as well as when ASR transcriptions are used.</Paragraph> <Paragraph position="1"> For this study, we trained decision trees (c4.5) to learn the best indicators of topic boundaries.</Paragraph> <Paragraph position="2"> We first used features extracted with the optimal window size reported to perform best in Galley et al. (2003) for segmenting meeting transcripts into major topical units. In particular, this study uses the following features: (1) lexical cohesion features: the raw lexical cohesion score and probability of topic shift indicated by the sharpness of change in lexical cohesion score, and (2) conversational features: the number of cue phrases in an analysis window of 5 seconds preceding and following the potential boundary, and other interactional features, including similarity of speaker activity (measured as a change in probability distribution of number of words spoken by each speaker) within 5 seconds preceding and following each potential boundary, the amount of overlapping speech within 30 seconds following each potential boundary, and the amount of silence between speaker turns within 30 seconds preceding each potential boundary.</Paragraph> </Section> <Section position="5" start_page="274" end_page="275" type="sub_section"> <SectionTitle> 3.4 Evaluation </SectionTitle> <Paragraph position="0"> To compare to prior work, we perform a 25-fold leave-one-out cross validation on the set of 25 ICSI meetings that were used in Galley et 2In this study, the end of each speaker turn is a potential segment boundary. If there is a pause of more than 1 second within a single speaker turn, the turn is divided at the beginning of the pause creating a potential segment boundary. al. (2003). We repeated the procedure to evaluate the accuracy using the lexical cohesion and combined models on both human and ASR transcriptions. In each evaluation, we trained the automatic segmentation models for two tasks: predicting subtopic boundaries (SUB) and predicting only top-level boundaries (TOP).</Paragraph> <Paragraph position="1"> In order to be able to compare our results directly with previous work, we first report our results using the standard error rate metrics of Pk and Wd. Pk (Beeferman et al., 1999) is the probability that two utterances drawn randomly from a document (inourcase, ameeting transcript) areincorrectly identified as belonging to the same topic segment. WindowDiff (Wd) (Pevzner and Hearst, 2002) calculates the error rate by moving a sliding window across the meeting transcript counting the number of times the hypothesized and reference segment boundaries are different.</Paragraph> <Paragraph position="2"> To compute a baseline, we follow Kan (2003) and Hearst (1997) in using Monte Carlo simulated segments. For the corpus used as training data in the experiments, the probability of a potential topic boundary being an actual one is approximately 2.2% for all subtopic segments, and 0.69% fortop-level topic segments. Therefore, the Monte Carlo simulation algorithm predicts that a speaker turnisasegment boundary withtheseprobabilities for the two different segmentation tasks. We executed the algorithm 10,000 times on each meeting and averaged the scores to form the baseline for our experiments.</Paragraph> <Paragraph position="3"> For the 24 meetings that were used in training, we have top-level topic boundaries annotated by coders at Columbia University (Col) and in ourlab at Edinburgh (Edi). We take the majority opinion on each segment boundary from the Col annotators as reference segments. For the Edi annotations of top-level topic segments, where multiple annotations exist, we choose one randomly. The topline is then computed as the Pk score comparing the Col majority annotation to the Edi annotation. null</Paragraph> </Section> </Section> <Section position="5" start_page="275" end_page="277" type="metho"> <SectionTitle> 4 Results </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="275" end_page="277" type="sub_section"> <SectionTitle> 4.1 Experiment 1: Predicting top-level and </SectionTitle> <Paragraph position="0"> subtopic segment boundaries The meetings in the ICSI corpus last approximately 1 hour and have an average of 8-10 top-level topic segments. In order to facilitate meeting browsing and question-answering, we believe it is useful to include subtopic boundaries in order to narrow in more accurately on the portion of the meeting that contains the information the user needs. Therefore, we performed experiments aimed at analysing how the LM and CM segmentation models behave in predicting segment boundaries at the two different levels of granularity. null All of the results are reported on the test set.</Paragraph> <Paragraph position="1"> Table 1 shows the performance of the lexical cohesion model (LM)andthecombined model (CM) integrating the lexical cohesion and conversational features discussed in Section 3.3.2.3 For the task of predicting top-level topic boundaries from human transcripts, CM outperforms LM. LM tends to over-predict on the top-level, resulting in a higher false alarm rate. However, for the task of predicting subtopic shifts, LM alone is considerably better than CM.</Paragraph> <Paragraph position="2"> In order to support browsing during the meeting orshortly thereafter, automatic topic segmentation willhave to operate on the transcriptions produced by ASR. First note from Table 1 that the preference of models for segmentation at the two different levels of granularity is the same for ASR and human transcriptions. CM is better for predicting top-level boundaries and LM is better for predicting subtopic boundaries. This suggests that these 3We do not report Wd scores for the combined model (CM) on ASR output because this model predicted 0 segment boundaries when operating on ASR output. In our experience, CM routinely underpredicted the number of segment boundaries, and due to the nature of the Wd metric, it should not be used when there are 0 hypothesized topic boundaries. are two distinct tasks, regardless of whether the system operates on human produced transcription or ASR output. Subtopics are better characterized by lexical cohesion, whereas top-level topic shifts are signalled by conversational features as well as lexical-cohesion based features.</Paragraph> <Paragraph position="3"> predicting from human transcripts Next, we wish to determine which features in the combined model are most effective forpredicting topic segments at the two levels of granularity. Table 2 gives the average Pk for all 25 meetings in the test set, using the features described in Section 3.3.2. We group the features into four classes: (1)lexicalcohesion-based features (LF):including lexical cohesion value (LCV) and estimated posterior probability (LCP); (2) interaction features (IF): the amount of overlapping speech (OVR), the amount of silence between speaker segments (GAP), similarity of speaker activity (ACT); (3) cuephrase feature (CUE);and(4)allavailable features (ALL). For comparison we also report the baseline (see Section 3.4.2) generated by Monte Carlo algorithm (MC-B). All of the models using one or more features from these classes out-perform the baseline model. A one-way ANOVA revealed this reliable effect on the top-level seg- null for predicting topic boundaries from human transcripts. MC-B is the randomly generated baseline. As shown in Table 2, the best performing model for predicting top-level segments is the one using all of the features (ALL). This is not surprising, because these were the features that Galley et al. (2003) found to be most effective for predicting top-level segment boundaries in their combined model. Looking at the results in more detail, we see that when we begin with LF features alone and add other features one by one, the only model (other than ALL) that achieves significant4 improvement (p < 0.05) over LF is LF+CUE, the model that combines lexical cohesion features with cue phrases.</Paragraph> <Paragraph position="4"> When we look at the results for predicting subtopic boundaries, we again see that the best performing model is the one using all features (ALL). Models using lexical-cohesion features alone (LF) and lexical cohesion features with cue phrases (LF+CUE) both yield significantly better results than using interactional features (IF) alone (p < 0.01), or using them with cue phrase features (IF+CUE) (p < 0.01). Again, none of the interactional features used in combination with LF significantly improves performance. Indeed, adding speaker activity change (LF+ACT) degrades the performance (p < 0.05).</Paragraph> <Paragraph position="5"> Therefore, we conclude that for predicting both top-level and subtopic boundaries from human transcriptions, the most important features are the lexical cohesion based features (LF), followed by cue phrases (CUE), with interactional features contributing to improved performance only when used in combination with LF and CUE.</Paragraph> <Paragraph position="6"> However, a closer look at the Pk scores in Table 2, adds further evidence to our hypothesis that predicting subtopics may be a different task from predicting top-level topics. Subtopic shifts occur more frequently, and often without clear conversational cues. This is suggested by the fact that absolute performance on subtopic prediction degrades when any of the interactional features are combined with the lexical cohesion features.</Paragraph> <Paragraph position="7"> In contrast, the interactional features slightly improve performance when predicting top-level segments. Moreover, the fact that the feature OVR has a positive impact on the model for predicting top-level topic boundaries, but does not improve the model for predicting subtopic boundaries reveals that having less overlapping speech isamore prominent phenomenon in major topic shifts than in subtopic shifts.</Paragraph> <Paragraph position="8"> predicting from ASR output Features extracted from ASRtranscripts are distinct from those extracted from human transcripts in at least three ways: (1) incorrectly recognized words incur erroneous lexical cohesion features (LF), (2) incorrectly recognized words incur erroneous cue phrase features (CUE), and (3) the ASR system recognizes less overlapping speech (OVR).</Paragraph> <Paragraph position="9"> Incontrast tothefindingthat integrating conversational features with lexical cohesion features is useful for prediction from human transcripts, Table 3 shows that when operating on ASR output, neither adding interactional nor cue phrase features improvestheperformance ofthemodelusing only lexical cohesion features. In fact, the model using allfeatures (ALL)issignificantly worsethan the model using only lexical cohesion based features (LF).Thissuggests thatwemustexplore new features that can lessen the perplexity introduced by ASR outputs in order to train a better model.</Paragraph> <Paragraph position="10"> for predicting topic boundaries from ASR output.</Paragraph> </Section> <Section position="2" start_page="277" end_page="277" type="sub_section"> <SectionTitle> 4.2 Experiment 2: Statistically learned cue </SectionTitle> <Paragraph position="0"> phrases In prior work, Galley et al. (2003) empirically identified cue phrases that are indicators of segment boundaries, and then eliminated all cues that had not previously been identified as cue phrases in the literature. Here, we conduct an experiment to explore how different ways of identifying cue phrases can help identify useful new features for the two boundary prediction tasks.</Paragraph> <Paragraph position="1"> In each fold of the 25-fold leave-one-out cross validation, we use a modified5 Chi-square test to 5In order to satisfy the mathematical assumptions undercalculate statistics for each word (unigram) and word pair (bi-gram) that occurred in the 24 training meetings. We then rank unigrams and bigrams according to their Chi-square scores, filtering out those with values under 6.64, the threshold for the Chi-square statistic at the 0.01 significance level.</Paragraph> <Paragraph position="2"> The unigrams and bigrams in this ranked list are the learned cue phrases. We then use the occurrence counts of cue phrases in an analysis window around each potential topic boundary in the test meeting as a feature.</Paragraph> <Paragraph position="3"> Table 4 shows the performance of models that usestatistically learned cue phrases intheirfeature sets compared with models using no cue phrase features and Galley's model, which only uses cue phrases that correspond to those identified in the literature (Col-cue). We see that for predicting subtopics, models using the cue word features (1gram) and the combination of cue words and bigrams(1+2gram) yielda15%and8.24%improvement over models using no cue features (NOCUE) (p < 0.01) respectively, while models using only cue phrases found in the literature (Col-cue) improve performance by just 3.18%. In contrast, for predicting top-level topics, the model using cue phrases from the literature (Col-cue) achieves a 4.2% improvement, and this is the only model that produces statistically significantly better results than the model using no cue phrases (NOCUE).</Paragraph> <Paragraph position="4"> The superior performance of models using statistically learned cue phrases as features for predicting subtopic boundaries suggests there may exist a different set of cue phrases that serve as segmentation cues for subtopic boundaries.</Paragraph> </Section> </Section> <Section position="6" start_page="277" end_page="278" type="metho"> <SectionTitle> 5 Discussion </SectionTitle> <Paragraph position="0"> As observed in the corpus of meetings, the lack of macro-level segment units (e.g., story breaks, paragraph breaks) makes the task of segmenting spontaneous multiparty dialogue, such as meetings, different from segmenting text or broadcast news. Compared to the task of segmenting expository texts reported in Hearst (1997) with a 39.1% chance of each paragraph end being a target topic boundary, the chance of each speaker turn being a top-level or sub-topic boundary in our ICSI corpus is just 2.2% and 0.69%. The imbalanced class distribution has a negative effect on the perlying the test, we removed cases with an expected value that is under a threshold (in this study, we use 1), and we apply Yate's correction, (|ObservedV alue[?]ExpectedValue|[?] 0.5)2/ExpectedValue.</Paragraph> <Paragraph position="1"> over the increase of the training set size.</Paragraph> <Paragraph position="2"> formance of machine learning approaches. In a pilot study, we investigated sampling techniques that rebalance the class distribution in the training set. We found that sampling techniques previously reported in Liu et al (2004) as useful for dealing with an imbalanced class distribution in the task of disfluency detection and sentence segmentation do not work for this particular data set. The implicit assumption of some classifiers (such as pruned decision trees) that the class distribution of the test set matches that of the training set, and that the costs of false positives and false negatives are equivalent, may account for the failure of these sampling techniques to yield improvements inperformance, when measured using Pk and Wd.</Paragraph> <Paragraph position="3"> Another approach that copes with the imbalanced class prediction problem but does not change the natural class distribution is to increase the size of the training set. We conducted an experiment in which we incrementally increased the training set size by randomly choosing ten meetings each time until all meetings were selected.</Paragraph> <Paragraph position="4"> We executed the process three times and averaged the scores to obtain the results shown in Figure 1.</Paragraph> <Paragraph position="5"> However, increasing training set size adds to the perplexity in the training phase. We see that increasing the size of the training set only improves the accuracy of segment boundary prediction for predicting top-level topics on ASR output. The figure also indicates that training a model to predicttop-level boundaries requires nomorethanfifteen meetings in the training set to reach a reasonable level of performance.</Paragraph> </Section> <Section position="7" start_page="278" end_page="279" type="metho"> <SectionTitle> 6 Conclusions </SectionTitle> <Paragraph position="0"> Discovering major topic shifts and finding nested subtopics are essential for the success of speech document browsing andretrieval. Meeting records contain rich information, in both content and conversation behavioral form, that enable automatic topic segmentation at different levels of granularity. The current study demonstrates that the two tasks - predicting top-level and subtopic boundaries - are distinct in many ways: (1) for predicting subtopic boundaries, the lexical cohesion-based approach achieves results that are competitive with the machine learning approach that combines lexical and conversational features; (2) for predicting top-level boundaries, the machine learning approach performs the best; and (3) many conversational cues, such as overlapping speech and cue phrases discussed in the literature, are better indicators for top-level topic shifts than for subtopic shifts, but new features such as cue phrases can belearned statistically forthesubtopic prediction task. Even in the presence of a relatively higher word error rate, using ASR output makes no difference to the preference of model for the two tasks. The conversational features also did not help improve the performance for predicting from ASR output.</Paragraph> <Paragraph position="1"> In order to further identify useful features for automatic segmentation of meetings at different levels of granularity, we will explore the use of multimodal, i.e., acoustic and visual, cues. In addition, in the current study, we only extracted features from within the analysis windows immediately preceding and following each potential topic boundary; we will explore models that take into account features of longer range dependencies.</Paragraph> </Section> <Section position="8" start_page="279" end_page="279" type="metho"> <SectionTitle> 7 Acknowledgements </SectionTitle> <Paragraph position="0"> Many thanks to Jean Carletta for her invaluable help in managing the data, and for advice and comments on the work reported in this paper.</Paragraph> <Paragraph position="1"> Thanks also to the AMI ASR group for producing the ASR transcriptions, and to the anonymous reviewers for their helpful comments. This work was supported by the European Union 6th FWP</Paragraph> </Section> class="xml-element"></Paper>