File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/p04-1085_metho.xml

Size: 23,141 bytes

Last Modified: 2025-10-06 14:08:58

<?xml version="1.0" standalone="yes"?>
<Paper uid="P04-1085">
  <Title>Identifying Agreement and Disagreement in Conversational Speech: Use of Bayesian Networks to Model Pragmatic Dependencies</Title>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
3 Adjacency Pairs
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.1 Overview
</SectionTitle>
      <Paragraph position="0"> Adjacency pairs (AP) are considered fundamental units of conversational organization (Schegloff and Sacks, 1973). Their identification is central to our problem, since we need to know the identity of addressees in agreements and disagreements, and adjacency pairs provide a means of acquiring this knowledge. An adjacency pair is said to consist of two parts (later referred to as A and B) that are ordered, adjacent, and produced by different speakers.</Paragraph>
      <Paragraph position="1"> The first part makes the second one immediately relevant, as a question does with an answer, or an offer does with an acceptance. Extensive work in conversational analysis uses a less restrictive definition of adjacency pair that does not impose any actual adjacency requirement; this requirement is problematic in many respects (Levinson, 1983). Even when APs are not directly adjacent, the same constraints between pairs and mechanisms for selecting the next speaker remain in place (e.g. the case of embedded question and answer pairs). This relaxation on a strict adjacency requirement is particularly important in interactions of multiple speakers since other speakers have more opportunities to insert utterances between the two elements of the AP construction (e.g. interrupted, abandoned or ignored utterances; backchannels; APs with multiple second elements, e.g. a question followed by answers of multiple speakers).2 Information provided by adjacency pairs can be used to identify the target of an agreeing or disagreeing utterance. We define the problem of AP  contiguous parts is about 21%.</Paragraph>
      <Paragraph position="2"> identification as follows: given the second element (B) of an adjacency pair, determine who is the speaker of the first element (A). A quite effective baseline algorithm is to select as speaker of utterance A the most recent speaker before the occurrence of utterance B. This strategy selects the right speaker in 79.8% of the cases in the 50 meetings that were annotated with adjacency pairs. The next sub-section describes the machine learning framework used to significantly outperform this already quite effective baseline algorithm.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.2 Maximum Entropy Ranking
</SectionTitle>
      <Paragraph position="0"> We view the problem as an instance of statistical ranking, a general machine learning paradigm used for example in statistical parsing (Collins, 2000) and question answering (Ravichandran et al., 2003).3 The problem is to select, given a set of a0 possible candidates a1a3a2a5a4a7a6a7a8a9a8a9a8a9a6a10a2a12a11a14a13 (in our case, potential A speakers), the one candidate a2a16a15 that maximizes a given conditional probability distribution.</Paragraph>
      <Paragraph position="1"> We use maximum entropy modeling (Berger et al., 1996) to directly model the conditional probability a17a19a18a20a2a21a15a23a22a24a26a25 , where each a27a5a15 in a24a29a28a30a18a31a27a32a4a33a6a7a8a9a8a9a8a9a6a23a27a34a11a14a25 is an observation associated with the corresponding speaker a2 a15 . a27 a15 is represented here by only one variable for notational ease, but it possibly represents several lexical, durational, structural, and acoustic observations. Given a35 feature functions a36a12a37a38a18a31a24a39a6a10a2a33a15a40a25 and a35 model parameters a41a42a28a43a18a45a44a46a4a33a6a7a8a9a8a9a8a9a6a47a44a49a48a50a25 , the probability of the maximum entropy model is defined as:</Paragraph>
      <Paragraph position="3"> The only role of the denominator a57 a18a31a24a54a25 is to ensure that a17a49a51 is a proper probability distribution. It is defined as:</Paragraph>
      <Paragraph position="5"> To find the most probable speaker of part A, we use the following decision rule:</Paragraph>
      <Paragraph position="7"> Note that we have also attempted to model the problem as a binary classification problem where  candidates are assigned an initial rank beforehand. each speaker is either classified as speaker A or not, but we abandoned that approach, since it gives much worse performance. This finding is consistent with previous work (Ravichandran et al., 2003) that compares maximum entropy classification and re-ranking on a question answering task.</Paragraph>
    </Section>
    <Section position="3" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.3 Features
</SectionTitle>
      <Paragraph position="0"> We will now describe the features used to train the maximum entropy model mentioned previously. To rank all speakers (aside from the B speaker) and to determine how likely each one is to be the A speaker of the adjacency pair involving speaker B, we use four categories of features: structural, durational, lexical, and dialog act (DA) information. For the remainder of this section, we will interchangeably use A to designate either the potential A speaker or the most recent utterance4 of that speaker, assuming the distinction is generally unambiguous. We use B to designate either the B speaker or the current spurt for which we need to identify a corresponding A part.</Paragraph>
      <Paragraph position="1"> The feature sets are listed in Table 1. Structural features encode some helpful information regarding ordering and overlap of spurts. Note that with only the first feature listed in the table, the maximum entropy ranker matches exactly the performance of the baseline algorithm (79.8% accuracy). Regarding lexical features, we used a count-based feature selection algorithm to remove many first-word and last-word features that occur infrequently and that are typically uninformative for the task at hand. Remaining features essentially contained function words, in particular sentence-initial indicators of questions (&amp;quot;where&amp;quot;, &amp;quot;when&amp;quot;, and so on).</Paragraph>
      <Paragraph position="2"> Note that all features in Table 1 are &amp;quot;backwardlooking&amp;quot;, in the sense that they result from an analysis of context preceding B. For many of them, we built equivalent &amp;quot;forward-looking&amp;quot; features that pertain to the closest utterance of the potential speaker A that follows part B. The motivation for extracting these features is that speaker A is generally expected to react if he or she is addressed, and thus, to take the floor soon after B is produced.</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="0" end_page="0" type="metho">
    <SectionTitle>
3.4 Results
</SectionTitle>
    <Paragraph position="0"> We used the labeled adjacency pairs of 50 meetings and selected 80% of the pairs for training. To train the maximum entropy ranking model, we used the generalized iterative scaling algorithm (Darroch and Ratcliff, 1972) as implemented in YASMET.5  ranker on the test data with different feature sets: the performance is 89.39% when using all feature sets, and reaches 90.2% after applying Gaussian smoothing and using incremental feature selection as described in (Berger et al., 1996) and implemented in the yasmetFS package.6 Note that restricting ourselves to only backward looking features decreases the performance significantly, as we can see in Table 2.</Paragraph>
    <Paragraph position="1"> We also wanted to determine if information about  dialog acts (DA) helps the ranking task. If we hypothesize that only a limited set of paired DAs (e.g. offer-accept, question-answer, and apologydownplay) can be realized as adjacency pairs, then knowing the DA category of the B part and of all potential A parts should help in finding the most meaningful dialog act tag among all potential A parts; for example, the question-accept pair is admittedly more likely to correspond to an AP than e.g. backchannel-accept. We used the DA annotation that we also had available, and used the DA tag sequence of part A and B as a feature.7 When we add the DA feature set, the accuracy reaches 91.34%, which is only slightly better than our 90.20% accuracy, which indicates that lexical, durational, and structural features capture most of the informativeness provided by DAs. This improved accuracy with DA information should of course not be considered as the actual accuracy of our system, since DA information is difficult to acquire automatically (Stolcke et al., 2000).</Paragraph>
  </Section>
  <Section position="6" start_page="0" end_page="0" type="metho">
    <SectionTitle>
4 Agreements and Disagreements
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.1 Overview
</SectionTitle>
      <Paragraph position="0"> This section focusses on the use of contextual information, in particular the influence of previous agreements and disagreements and detected adjacency pairs, to improve the classification of agreements and disagreements. We first define the classification problem, then describe non-contextual features, provide some empirical evidence justifying our choice of contextual features, and finally evaluate the classifier.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.2 Agreement/Disagreement Classification
</SectionTitle>
      <Paragraph position="0"> We need to first introduce some notational conventions and define the classification problem with the agreement/disagreement tagset. In our classification problem, each spurt a0a33a15 among the a1 spurts of a meeting must be assigned a tag a0a21a15a3a2 a1 AGREEa6 DISAGREEa6 BACKCHANNELa6 OTHERa13 .</Paragraph>
      <Paragraph position="1"> To specify the speaker of the spurt (e.g. speaker B), the notation will sometimes be augmented to incorporate speaker information, as with a0 a4</Paragraph>
      <Paragraph position="3"> to designate the addressee of B (e.g. listener A), we will use the notation a0 a4a6a5a8a7  it obvious that we do not necessarily assume that agreements and disagreements are reflexive 7The annotation of DA is particularly fine-grained with a choice of many optional tags that can be associated with each DA. To deal with this problem, we used various scaled-down versions of the original tagset.</Paragraph>
      <Paragraph position="4"> relations. We define:  is produced by Y and addresses X. This definition will help our multi-party analyses of agreement and disagreement behaviors.</Paragraph>
    </Section>
    <Section position="3" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.3 Local Features
</SectionTitle>
      <Paragraph position="0"> Many of the local features described in this subsection are similar in spirit to the ones used in the previous work of (Hillard et al., 2003). We did not use acoustic features, since the main purpose of the current work is to explore the use of contextual information. null Table 3 lists the features that were found most helpful at identifying agreements and disagreements. Regarding lexical features, we selected a list of lexical items we believed are instrumental in the expression of agreements and disagreements: agreement markers, e.g. &amp;quot;yes&amp;quot; and &amp;quot;right&amp;quot;, as listed in (Cohen, 2002), general cue phrases, e.g. &amp;quot;but&amp;quot; and &amp;quot;alright&amp;quot; (Hirschberg and Litman, 1994), and adjectives with positive or negative polarity (Hatzivassiloglou and McKeown, 1997). We incorporated a set of durational features that were described in the literature as good predictors of agreements: utterance length distinguishes agreement from disagreement, the latter tending to be longer since the speaker elaborates more on the reasons and circumstances of her disagreement than for an agreement (Cohen, 2002). Duration is also a good predictor of backchannels, since they tend to be quite short.</Paragraph>
      <Paragraph position="1"> Finally, a fair amount of silence and filled pauses is sometimes an indicator of disagreement, since it is a dispreferred response in most social contexts and can be associated with hesitation (Pomerantz, 1984).</Paragraph>
    </Section>
    <Section position="4" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.4 Contextual Features: An Empirical Study
</SectionTitle>
      <Paragraph position="0"> We first performed several empirical analyses in order to determine to what extent contextual information helps in discriminating between agreement and disagreement. By integrating the interpretation of the pragmatic function of an utterance into a wider context, we aim to detect cases of mismatch between a correct pragmatic interpretation and the surface form of the utterance, e.g. the case of weak or &amp;quot;empty&amp;quot; agreement, which has some properties of downright agreement (lexical items of positive polarity), but which is commonly considered to be a disagreement (Pomerantz, 1984).</Paragraph>
      <Paragraph position="1"> While the actual classification problem incorporates four classes, the BACKCHANNEL class is ig-Structural features: a0 is the previous/next spurt of the same speaker? a0 is the previous/next spurt involving the same B speaker?  nored here to make the empirical study easier to interpret. We assume in that study that accurate AP labeling is available, but for the purpose of building and testing a classifier, we use only automatically extracted adjacency pair information. We tested the validity of four pragmatic assumptions:  1. previous tag dependency: a tag a0a33a15 is influenced by its predecessor a0a7a15a1a0 a4 2. same-interactants previous tag depen- null a60 , the most recent tag of the same speaker addressing the same listener; for example, it might be reasonable to assume that if speaker B disagrees with A, B is likely to disagree with A in his or her next speech  addressing A.</Paragraph>
      <Paragraph position="2"> 3. reflexivity: a tag a0 a4a6a5a8a7</Paragraph>
      <Paragraph position="4"> is influenced by the polarity (agreement or disagreement) of what A said last to B.</Paragraph>
      <Paragraph position="5"> 4. transitivity: assuming there is a speaker a2  ample of such an influence is a case where speaker a2 first agrees with a3 , then speaker a4 disagrees with a2 , from which one could possibly conclude that a4 is actually in disagreement with a3 .</Paragraph>
      <Paragraph position="6"> Table 4 presents the results of our empirical evaluation of the first three assumptions. For comparison, the distribution of classes is the following: 18.8% are agreements, 10.6% disagreements, and 70.6% other. The dependencies empirically evaluated in the two last columns are non-local; they create dependencies between spurts separated by an arbitrarily long time span. Such long range dependencies are often undesirable, since the influence of one spurt on the other is often weak or too difficult to capture with our model. Hence, we made a Markov assumption by limiting context to an arbitrarily chosen value a0 . In this analysis subsection and for all classification results presented thereafter, we used a value of a0 a28 a56a6a5 .</Paragraph>
      <Paragraph position="7"> The table yields some interesting results, showing quite significant variations in class distribution when it is conditioned on various types of contextual information. We can see for example, that the proportion of agreements and disagreements (respectively 18.8% and 10.6%) changes to 13.9% and 20.9% respectively when we restrict the counts to spurts that are preceded by a DISAGREE. Similarly, that distribution changes to 21.3% and 7.3% when the previous tag is an AGREE. The variable is even more noticeable between probabilities a17a19a18 a0a21a15a40a25</Paragraph>
      <Paragraph position="9"> cases where a given speaker B disagrees with A, he or she will continue to disagree in the next exchange involving the same speaker and the same listener.</Paragraph>
      <Paragraph position="10"> Similarly with the same probability distribution, a tendency to agree is confirmed in 25% of the cases.</Paragraph>
      <Paragraph position="11"> The results in the last column are quite different from the two preceding ones. While agreements in response to agreements (a17a19a18 AGREEa22AGREEa25 a28 a8 a56a8a7a10a9 ) are slightly less probable than agreements without conditioning on any previous tag (a17a26a18 AGREEa25 a28  a56a12a11a13a11 ), the probability of an agreement produced in response to a disagreement is quite high (with 23.4%), even higher than the proportion of agreements in the entire data (18.8%). This last result would arguably be quite different with more quarrelsome meeting participants.</Paragraph>
      <Paragraph position="12"> Table 5 represents results concerning the fourth pragmatic assumption. While none of the results characterize any strong conditioning of a0a8a14 by a0a70a15 and a0a53a37 , we can nevertheless notice some interesting phenomena. For example, there is a tendency for agreements to be transitive, i.e. if X agrees with A and B agrees with X within a limited segment of speech, then agreement between B and A is confirmed in 22.5% of the cases, while the probability of the agreement class is only 18.8%. The only slightly surprising result appears in the last column of the table, from which we cannot conclude that disagreement with a disagreement is equivalent to agreement. This might be explained by the fact that these sequences of agreement and disagreement do not necessarily concern the same propositional content. null The probability distributions presented here are admittedly dependent on the meeting genre and particularly speaker personalities. Nonetheless, we believe this model can as well be used to capture salient interactional patterns specific to meetings with different social dynamics.</Paragraph>
      <Paragraph position="13"> We will next discuss our choice of a statistical model to classify sequence data that can deal with non-local label dependencies, such as the ones tested in our empirical study.</Paragraph>
    </Section>
    <Section position="5" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.5 Sequence Classification with Maximum
Entropy Models
</SectionTitle>
      <Paragraph position="0"> Extensive research has targeted the problem of labeling sequence information to solve a variety of problems in natural language processing. Hidden Markov models (HMM) are widely used and considerably well understood models for sequence labeling. Their drawback is that, as most generative models, they are generally computed to maximize the joint likelihood of the training data. In order to define a probability distribution over the sequences of observation and labels, it is necessary to enumerate all possible sequences of observations.</Paragraph>
      <Paragraph position="1"> Such enumeration is generally prohibitive when the model incorporates many interacting features and long-range dependencies (the reader can find a discussion of the problem in (McCallum et al., 2000)).</Paragraph>
      <Paragraph position="2"> Conditional models address these concerns.</Paragraph>
      <Paragraph position="3"> Conditional Markov models (CMM) (Ratnaparkhi, 1996; Klein and Manning, 2002) have been successfully used in sequence labeling tasks incorporating rich feature sets. In a left-to-right CMM as shown in Figure 1(a), the probability of a sequence of L tags a0 a28a43a18 a0a12a4a33a6a7a8a9a8a9a8a9a6 a0a2a1 a25 is decomposed as:</Paragraph>
      <Paragraph position="5"> a24 a28 a18a31a27a32a4a33a6a7a8a9a8a9a8a9a6a23a27a7a1a46a25 is the vector of observations and each a9 is the index of a spurt. The probability distribution a17a19a18 a0a64a15a10a22a0a47a15a1a0 a4a33a6a23a27a5a15a45a25 associated with each state of the Markov chain only depends on the preceding tag a0a70a15a1a0 a4 and the local observation a27a34a15 . However, in order to incorporate more than one label dependency and, in particular, to take into account the four pragmatic c 1 c 2 c 1 c 2 c 3  Bayesian network. Assuming for example that a8 a88a10a9 a8a12a11a14a13a16a15a88 and a8a18a17 a9 a8 a11a19a13a16a15a17 , there is then a direct dependency between a8 a88 and a8a18a17 , and the probability model becomes</Paragraph>
      <Paragraph position="7"> plifying example; in practice, each label is dependent on a fixed number of other labels.</Paragraph>
      <Paragraph position="8"> contextual dependencies discussed in the previous subsection, we must augment the structure of our model to obtain a more general one. Such a model is shown in Figure 1(b), a Bayesian network model that is well-understood and that has precisely defined semantics.</Paragraph>
      <Paragraph position="9"> To this Bayesian network representation, we apply maximum entropy modeling to define a probability distribution at each node (a0a33a15 ) dependent on the observation variable a27a76a15 and the five contextual tags used in the four pragmatic dependencies.8 For notational simplicity, the contextual tags representing these pragmatic dependencies are represented here as a vector a35 (a0 a15a1a0 a4 ,  Given a35 feature functions a36a21a37a38a18a3a35a39a6a23a27a38a15a53a6 a0a70a15a40a25 (both local and contextual, like previous tag features) and a35 model parameters a41 a28 a18a45a44 a4 a6a7a8a9a8a9a8a9a6a47a44 a48 a25 , the probability of the model is defined as:</Paragraph>
      <Paragraph position="11"> Again, the only role of the denominator a57 a18a31a24a26a25 is to ensure that a17 a51 sums to 1, and need not be computed when searching for the most probable tags. Note that in our case, the structure of the Bayesian network is known and need not be inferred, since AP identification is performed before the actual agreement and disagreement classification. Since tag sequences are known during training, the inference of a model for sequence labels is no more difficult than inferring a model in a non-sequential case.</Paragraph>
      <Paragraph position="12"> We compute the most probable sequence by performing a left-to-right decoding using a beam search. The algorithm is exactly the same as the one described in (Ratnaparkhi, 1996) to find the most probable part-of-speech sequence. We used a large beam of size a0 =100, which is not computationally prohibitive, since the tagset contains only four ele8The transitivity dependency is conditioned on two tags, while all others on only one. These five contextual tags are defaulted to OTHER when dependency spans exceed the threshold of a36 a9a38a37a34a39 .</Paragraph>
      <Paragraph position="13">  ments. Note however that this algorithm can lead to search errors. An alternative would be to use a variant of the Viterbi algorithm, which was successfully used in (McCallum et al., 2000) to decode the most probable sequence in a CMM.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML