File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/03/n03-2012_metho.xml
Size: 10,049 bytes
Last Modified: 2025-10-06 14:08:15
<?xml version="1.0" standalone="yes"?> <Paper uid="N03-2012"> <Title>DETECTION OF AGREEMENT vs. DISAGREEMENT IN MEETINGS: TRAINING WITH UNLABELED DATA</Title> <Section position="1" start_page="0" end_page="0" type="metho"> <SectionTitle> DETECTION OF AGREEMENT vs. DISAGREEMENT IN MEETINGS: TRAINING WITH UNLABELED DATA </SectionTitle> <Paragraph position="0"/> </Section> <Section position="2" start_page="0" end_page="0" type="metho"> <SectionTitle> Abstract </SectionTitle> <Paragraph position="0"> To support summarization of automatically transcribed meetings, we introduce a classifier to recognize agreement or disagreement utterances, utilizing both word-based and prosodic cues. We show that hand-labeling efforts can be minimized by using unsupervised training on a large unlabeled data set combined with supervised training on a small amount of data.</Paragraph> <Paragraph position="1"> For ASR transcripts with over 45% WER, the system recovers nearly 80% of agree/disagree utterances with a confusion rate of only 3%.</Paragraph> </Section> <Section position="3" start_page="0" end_page="0" type="metho"> <SectionTitle> 1 Introduction </SectionTitle> <Paragraph position="0"> Meetings are an integral component of life in most organizations, and records of meetings are important for helping people recall (or learn for the first time) what took place in a meeting. Audio (or audio-visual) recordings of meetings offer a complete record of the interactions, but listening to the complete recording is impractical. To facilitate browsing and summarization of meeting recordings, it is useful to automatically annotate topic and participant interaction characteristics. Here, we focus on interactions, specifically identifying agreement and disagreement. These categories are particularly important for identifying decisions in meetings and inferring whether the decisions are controversial, which can be useful for automatic summarization. In addition, detecting agreement is important for associating action items with meeting participants and for understanding social dynamics. In this study, we focus on detection using both prosodic and language cues, contrasting results for hand-transcribed and automatically transcribed data.</Paragraph> <Paragraph position="1"> The agreement/disagreement labels can be thought of as a sort of speech act categorization. Automatic classification of speech acts has been the subject of several studies. Our work builds on (Shriberg et al., 1998), which showed that prosodic features are useful for classifying speech acts and lead to increased accuracy when combined with word based cues. Other studies look at prediction of speech acts primarily from word-based cues, using language models or syntactic structure and discourse history (Chu-Carroll, 1998; Reithinger and Klesen, 1997).</Paragraph> <Paragraph position="2"> Our work is informed by these studies, but departs significantly by exploring unsupervised training techniques.</Paragraph> </Section> <Section position="4" start_page="0" end_page="0" type="metho"> <SectionTitle> 2 Approach </SectionTitle> <Paragraph position="0"> Our experiments are based on a subset of meeting recordings collected and transcribed by ICSI (Morgan et al., 2001). Seven meetings were segmented (automatically, but with human adjustment) into 9854 total spurts. We define a 'spurt' as a period of speech by one speaker that has no pauses of greater than one half second (Shriberg et al., 2001). Spurts are used here, rather than sentences, because our goal is to use ASR outputs and unsupervised training paradigms, where hand-labeled sentence segmentations are not available.</Paragraph> <Paragraph position="1"> We define four categories: positive, backchannel, negative, and other. Frequent single-word spurts (specifically, yeah, right, yep, uh-huh, and ok) are separated out from the 'positive' category as backchannels because of the trivial nature of their detection and because they may reflect encouragement for the speaker to continue more than actual agreement. Examples include: Neg: (6%) &quot;This doesn't answer the question.&quot; Pos: (9%) &quot;Yeah, that sounds great.&quot; Back: (23%) &quot;Uh-huh.&quot; Other: (62%) &quot;Let's move on to the next topic.&quot; The first 450 spurts in each of four meetings were hand-labeled with these four categories based on listening to speech while viewing transcripts (so a sarcastic &quot;yeah, right&quot; is labeled as a disagreement despite the positive wording). Comparing tags on 250 spurts from two labelers produced a kappa coefficient (Siegel and Castellan, 1988) of .6, which is generally considered acceptable.</Paragraph> <Paragraph position="2"> Additionally, unlabeled spurts from six hand-transcribed training meetings are used in unsupervised training experiments, as described later. The total number of automatically labeled spurts (8094) is about five times the amount of hand-labeled data.</Paragraph> <Paragraph position="3"> For system development and as a control, we use handtranscripts in learning word-based cues and in training. We then evaluate the model with both hand-transcribed words and ASR output. The category labels from the hand transcriptions are mapped to the ASR transcripts, assigning an ASR spurt to a hand-labeled reference if more than half (time wise) of the ASR spurt overlaps the reference spurt.</Paragraph> <Paragraph position="4"> Feature Extraction. The features used in classification include heuristic word types and counts, word-based features derived from n-gram scores, and prosodic features. Simple word-based features include: the total number of words in a spurt, the number of &quot;positive&quot; and &quot;negative&quot; keywords, and the class (positive, negative, backchannel, discourse marker, other) of the first word based on the keywords. The keywords were chosen based on an &quot;effectiveness ratio,&quot; defined as the frequency of a word (or word pair) in the desired class divided by the frequency over all dissimilar classes combined. A minimum of five occurrences was required and then all instances with a ratio greater than .6 were selected as keywords.</Paragraph> <Paragraph position="5"> Other word-based features are found by computing the perplexity (average log probability) of the sequence of words in a spurt using a bigram language model (LM) for each of the four classes. The perplexity indicates the goodness of fit of a spurt to each class. We used both word and class LMs (with part-of-speech classes for all words except keywords). In addition, the word-based LM is used to score the first two words of the spurt, which often contain the most information about agreement and disagreement. The label of the most likely class for each type of LM is a categorical feature, and we also compute the posterior probability for each class.</Paragraph> <Paragraph position="6"> Prosodic features include pause, fundamental frequency (F0), and duration (Baron et al., 2002). Features are derived for the first word alone and for the entire spurt. Average, maximum and initial pause duration features are used. The F0 average and maximum features are computed using different methods for normalizing F0 relative to a speaker-dependent baseline, mean and max.</Paragraph> <Paragraph position="7"> For duration, the average and maximum vowel duration from a forced alignment are used, both unnormalized and normalized for vowel identity and phone context. Spurt length in terms of number of words is also used.</Paragraph> <Paragraph position="8"> Classifier design and feature selection. The overall approach to classifying spurts uses a decision tree classifier (Breiman et al., 1984) to combine the word based and prosodic cues. In order to facilitate learning of cues for the less frequent classes, the data was upsampled (duplicated) so that there were the same number of training points per class. The decision tree size was determined using error-based cost-complexity pruning with 4-fold cross validation. To reduce our initial candidate feature set, we used an iterative feature selection algorithm that involved running multiple decision trees (Shriberg et al., 2000). The algorithm combines elements of brute-force search (in a leave-one-out paradigm) with previously determined heuristics for narrowing the search space. We used entropy reduction of the tree after cross-validation as a criterion for selecting the best subtree.</Paragraph> <Paragraph position="9"> Unsupervised training. In order to train the models with as much data as possible, we used an unsupervised clustering strategy for incorporating unlabeled data. Four bigram models, one for each class, were initialized by dividing the hand transcribed training data into the four classes based upon keywords. First, all spurts which contain the negative keywords are assigned to the negative class. Backchannels are then pulled out when a spurt contains only one word and it falls in the backchannel word list. Next, spurts are selected as agreements if they contain positive keywords. Finally, the remaining spurts are associated with the &quot;other&quot; class.</Paragraph> <Paragraph position="10"> The keyword separation gives an initial grouping; further regrouping involves unsupervised clustering using a maximum likelihood criterion. A preliminary language model is trained for each of the initial groups. Then, by evaluating each spurt in the corpus against each of the four language models, new groups are formed by associating spurts with the language model that produces the lowest perplexity. New language models are then trained for the reorganized groups and the process is iterated until there is no movement between groups. The final class assignments are used as &quot;truth&quot; for unsupervised training of language and prosodic models, as well as contributing features to decision trees.</Paragraph> </Section> class="xml-element"></Paper>