File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/e06-1022_metho.xml

Size: 26,012 bytes

Last Modified: 2025-10-06 14:10:04

<?xml version="1.0" standalone="yes"?>
<Paper uid="E06-1022">
  <Title>Addressee Identification in Face-to-Face Meetings</Title>
  <Section position="3" start_page="169" end_page="169" type="metho">
    <SectionTitle>
2 Addressing in face-to-face meetings
</SectionTitle>
    <Paragraph position="0"> Whenaspeaker contributes totheconversation, all those participants who happen to be in perceptual range of this event will have &amp;quot;some sort of participation status relative to it&amp;quot;. The conversational roles that the participants take in a given conversational situation make up the &amp;quot;participation framework&amp;quot; (Goffman, 1981b).</Paragraph>
    <Paragraph position="1"> Goffman (1976) distinguished three basic kinds of hearers: those who overhear, whether or not their unratified participation isunintentional or encouraged; those who are ratified but are not specifically addressed by the speaker (also called unaddressed recipients (Goffman, 1981a)); and those ratified participants who are addressed. Ratified participants are those participants whoare allowed to take part in conversation. Regarding hearers' roles in meetings, we are focused only on ratified participants. Therefore, the problem of addressee identification amounts to the problem of distinguishing addressed from unaddressed participants for each dialogue act that speakers perform.</Paragraph>
    <Paragraph position="2"> Goffman (1981a) defined addressees as those &amp;quot;ratified participants () oriented to by the speaker in a manner to suggest that his words are particularly for them, and that some answer is therefore anticipated fromthem, moresothanfromtheother ratified participants&amp;quot;. According to this, it is the speaker whoselects his addressee; the addressee is the one who is expected by the speaker to react on what the speaker says and to whom, therefore, the speaker is giving primary attention in the present act.</Paragraph>
    <Paragraph position="3"> In meeting conversations, a speaker may address his utterance to the whole group of participants present in the meeting, or to a particular sub-group of them, or to a single participant in particular. A speaker can also just think aloud or mumble to himself without really addressing anybody (e.g.&amp;quot;What else do I want to say?&amp;quot; (while trying to evoke more details about the issue that he is presenting)). We excluded self-addressed speech from our study.</Paragraph>
    <Paragraph position="4"> Addressing behavior is behavior that speakers show to express to whom they are addressing their speech. It depends on the course of the conversation, the status of attention of participants, their current involvement in the discussion as well as on what the participants know about each others' roles and knowledge, whether explicit addressing behavior is called for. Using a vocative is the explicit verbal way to address someone. In some cases the speaker identifies the addressee of his speech by looking at the addressee, sometimes accompanying this by deictic hand gestures. Addressees can also be designated by the manner of speaking. For example, by whispering, a speaker can select a single individual or a group of people as addressees. Addressees are often designated by the content of what is being said. For example, when making the suggestion &amp;quot;We all have to decide together about the design&amp;quot;, the speaker is addressing the whole group.</Paragraph>
    <Paragraph position="5"> In meetings, people may perform various group actions (termed as meeting actions) such as presentations, discussions or monologues (McCowan et al., 2003). A type of group action that meeting participants perform may influence the speaker's addressing behavior. For example, speakers may show different behavior during a presentation than during a discussion when addressing an individual: regardless of the fact that a speaker has turned his back to a participant in the audience during a presentation, he most probably addresses his speech to the group including that participant, whereas the same behavior during a discussion, in many situations, indicates that that participant is unaddressed.</Paragraph>
    <Paragraph position="6"> In this paper, we focus on speech and gaze aspects of addressing behavior as well as on contextual aspects such as conversational history and meeting actions.</Paragraph>
  </Section>
  <Section position="4" start_page="169" end_page="171" type="metho">
    <SectionTitle>
3 Cues for addressee identification
</SectionTitle>
    <Paragraph position="0"> In this section, we present our motivation for feature selection, referring also to someexisting work  on the examination of cues that are relevant for addressee identification.</Paragraph>
    <Paragraph position="1"> Adjacency pairs and addressing - Adjacency pairs (AP) are minimal dialogic units that consist of pairs of utterances called &amp;quot;first pair-part&amp;quot; (or a-part) and the &amp;quot;second pair-part&amp;quot; (or b-part) that are produced by different speakers. Examples include question-answers or statement-agreement.</Paragraph>
    <Paragraph position="2"> In the exploration of the conversational organization, special attention has been given to the a-parts that are used as one of the basic techniques for selecting a next speaker (Sacks et al., 1974). For addressee identification, the main focus is on b-parts and their addressees. It is to be expected that the a-part provides a useful cue for identification of addressee of the b-part (Galley et al., 2004). However, itdoesnot implythat thespeaker ofthea-part is always the addressee of the b-part. For example, A can address a question to B, whereas B's reply to A's question is addressed to the whole group. In this case, the addressee of the b-part includes the speaker of the a-part.</Paragraph>
    <Paragraph position="3"> Dialogue acts and addressing When designing an utterance, a speaker intends not only to perform a certain communicative act that contributes to a coherent dialogue (in the literature referred to as dialogue act), but also to perform that act towardthe particular others. Within aturn, aspeaker may perform several dialogue acts, each of those having its own addressee ( e.g. I agree with you [agreement; addressed to a previous speaker] but is this what we want [information request; addressed to the group]). Dialogue act types can provide useful information about addressing types since some types of dialogue acts -such as agreements or disagreements- tend to be addressed to an individual rather than to a group. More information about the addressee of a dialogue can be induced by combining the dialogue act information with some lexical markers that are used as addressee &amp;quot;indicators&amp;quot; (e.g. you, we, everybody, all of you) (Jovanovic and op den Akker, 2004).</Paragraph>
    <Paragraph position="4"> Gaze behavior and addressing Analyzing dyadic conversations, researchers into social interaction observed that gaze in social interaction is used for several purposes: to control communication, to provide a visual feedback, to communicate emotions and to communicate the nature of relationships (Kendon, 1967; Argyle, 1969).</Paragraph>
    <Paragraph position="5"> Recent studies into multi-party interaction emphasized the relevance of gaze as a means of addressing. Vertegaal (1998) investigated towhatextent the focus of visual attention might function as an indicator for the focus of &amp;quot;dialogic attention&amp;quot; in four-participants face-to-face conversations. &amp;quot;Dialogic attention&amp;quot; refers to attention while listening to a person as well as attention while talking to one or more persons. Empirical findings show that when a speaker is addressing an individual, there is 77% chance that the gazed person is addressed.</Paragraph>
    <Paragraph position="6"> When addressing a triad, speaker gaze seems to be evenly distributed over the listeners in the situation where participants are seated around the table. It is also shown that on average a speaker spends significantly more time gazing at an individual when addressing the whole group, than at others when addressing a single individual. When addressing an individual, people gaze 1.6 times more while listening (62%) than while speaking (40%). When addressing a triad the amount of speaker gaze increases significantly to 59%. According to all these estimates, we can expect that gaze directional cues are good indicators for addressee prediction.</Paragraph>
    <Paragraph position="7"> However, these findings cannot be generalized in the situations where some objects of interest are present in the conversational environment, since it is expected that the amount of time spent looking at the persons will decrease significantly. As shown in (Bakx et al., 2003), in a situation where auser interacts withamultimodal information system and in the meantime talks to another person, the user looks most of the time at the system, both when talking to the system (94%) and when talking to the user (57%). Also, another person looks at the system in 60% of cases when talking to the user. Bakx et al. (2003) also showed that some improvement in addressee detection can be achieved by combining utterance duration with gaze.</Paragraph>
    <Paragraph position="8"> In meeting conversations, the contribution of the gaze direction to addressee prediction is also affected by the current meeting activity and seating arrangement (Jovanovic and op den Akker, 2004). For example, when giving a presentation, a speaker most probably addresses his speech to the whole audience, although he may only look at a single participant in the audience. A seating arrangement determines avisible areaforeach meeting participant. During a turn, a speaker mostly looksat theparticipants whoareinhisvisible area.</Paragraph>
    <Paragraph position="9">  Moreover, the speaker frequently looks at a single participant in his visual area when addressing a group. However, when he wants to address a singleparticipant outside hisvisual area, hewilloften turn his body and head toward that participant.</Paragraph>
    <Paragraph position="10"> In this paper, we explored not only the effectiveness of the speaker's gaze direction, but also the effectiveness of the listeners' gaze directions as cues for addressee prediction.</Paragraph>
    <Paragraph position="11"> Meeting context and addressing As Goffman (1981a) has noted, &amp;quot;the notion of a conversational encounter does not suffice in dealing with the context in which words are spoken; a social occasion involving a podium event or no speech event at all may be involved, and in any case, the whole social situation, the whole surround, must always be considered&amp;quot;. A set of various meeting actions that participants perform in meetings is one aspect of the social situation that differentiates meetings from other contexts of talk such as ordinary conversations, interviews or trials. As noted above, it influences addressing behavior as well as the contribution of gaze to addressee identification. Furthermore, distributions of addressing types vary for different meeting actions. Clearly, the percentage of the utterances addressed to the whole group during a presentation is expected to be much higher than during a discussion.</Paragraph>
  </Section>
  <Section position="5" start_page="171" end_page="172" type="metho">
    <SectionTitle>
4 Data collection
</SectionTitle>
    <Paragraph position="0"> To train and test our classifiers, we used a small multimodal corpus developed for studying addressing behavior in meetings (Jovanovic et al., 2005). The corpus contains 12 meetings recorded at the IDIAP smart meeting room in the research program of the M41 and AMI projects2. The room has been equipped with fully synchronized multi-channel audio and video recording devices, a whiteboard and a projector screen. The seating arrangement includes two participants at each of two opposite sides of the rectangular table. The total amount of the recorded data is approximately 75 minutes. For experiments presented in this paper, we have selected meetings from the M4 data collection. These meetings are scripted in terms of type and schedule of group actions, but content is natural and unconstrained.</Paragraph>
    <Paragraph position="1"> The meetings are manually annotated with dialogue acts, addressees, adjacency pairs and gaze  direction. Each type of annotation is described in detail in (Jovanovic et al., 2005). Additionally, theavailable annotations ofmeetingactions forthe M4 meetings3 were converted into the corpus format and included in the collection.</Paragraph>
    <Paragraph position="2"> The dialogue act tag set employed for the corpus creation is based on the MRDA (Meeting Recorder Dialogue Act) tag set (Dhillon et al., 2004). The MRDA tag set represents a modification of the SWDB-DAMSL tag set (Jurafsky et al., 1997) for an application to multi-party meeting dialogues. The tag set used for the corpus creation is made by grouping the MRDA tags into 17 categories that are divided into seven groups: acknowledgments/backchannels, statements, questions, responses, action motivators, checks and politeness mechanisms. A mapping between this tag set and the MRDA tag set is given in (Jovanovic et al., 2005). Unlike MRDA where each utterance is marked with a label made up of one or more tags from the set, each utterance in the corpus is marked as Unlabeled or with exactly one tag from the set. Adjacency pairs are labeled by marking dialogue acts that occur as their a-part and bpart. null Since all meetings in the corpus consist of four participants, the addressee of a dialogue act is labeled as Unknown or with one of the following addressee tags: individual Px, a subgroup of participants Px,Py or the whole audience Px,Py,Pz.</Paragraph>
    <Paragraph position="3"> Labeling gaze direction denotes labeling gazed targets for each meeting participants. As the only targets of interest for addressee identification are meeting participants, the meetings were annotated with the tag set that contains tags that are linked to each participant Px and the NoTarget tag that is used when the speaker does not look at any of the participants.</Paragraph>
    <Paragraph position="4"> Meetings are annotated with a set of six meeting actions described in (McCowan et al., 2003): monologue, presentation, white-board, discussion, consensus, disagreement and note-taking.</Paragraph>
    <Paragraph position="5"> Reliability of the annotation schema As reported in (Jovanovic et al., 2005), gaze annotation has been reproduced reliably (segmentation  groups that annotated twodifferent sets of meeting data.</Paragraph>
  </Section>
  <Section position="6" start_page="172" end_page="174" type="metho">
    <SectionTitle>
5 Addressee classification
</SectionTitle>
    <Paragraph position="0"> In this section we present the results on addressee classification in four-persons face-to-face meetings using Bayesian Network and Naive Bayes classifiers.</Paragraph>
    <Section position="1" start_page="172" end_page="172" type="sub_section">
      <SectionTitle>
5.1 Classification task
</SectionTitle>
      <Paragraph position="0"> In a dialogue situation, which is an event which lasts as long as the dialogue act performed by the speaker in that situation, the class variable is the addressee of the dialogue act (ADD). Since there are only a few instances of subgroup addressing in the data, we removed them from the data set and excluded all possible subgroups of meeting participants from the set of class values. Therefore, we define addressee classifiers to identify one of the following class values: individual Px where x[?]{0,1,2,3} and ALLPwhich denotes the whole group.</Paragraph>
    </Section>
    <Section position="2" start_page="172" end_page="173" type="sub_section">
      <SectionTitle>
5.2 Feature set
</SectionTitle>
      <Paragraph position="0"> To identify the addressee of a dialogue act we initially used three sorts of features: conversational context features (later referred to as contextual features), utterance features and gazefeatures.</Paragraph>
      <Paragraph position="1"> Additionally, we conducted experiments with an extended feature set including a feature that conveys information about meeting context.</Paragraph>
      <Paragraph position="2"> Contextual features provide information about the preceding utterances. We experimented with using information about the speaker, the addressee and the dialogue act of the immediately preceding utterance on the same or a different channel (SP1, ADD-1, DA-1) as well as information about the related utterance (SP-R,ADD-R,DA-R).A related utterance is the utterance that is the a-part of an adjacency pair with the current utterance as the b-part. Information about the speaker of the current utterance (SP) has also been included in the contextual feature set.</Paragraph>
      <Paragraph position="3"> As utterance features, we used a subset of lexical features presented in (Jovanovic and op den Akker, 2004) as useful cues for determining whether the utterance issingle or group addressed.</Paragraph>
      <Paragraph position="4"> The subset includes the following features: * does the utterance contain personal pronouns &amp;quot;we&amp;quot; or &amp;quot;you&amp;quot;, both of them, or neither of them? * does the utterance contain possessive pronouns or possessive adjectives (&amp;quot;your/yours&amp;quot; or &amp;quot;our/ours&amp;quot;), their combination or neither of them? * does the utterance contain indefinite pronouns such as &amp;quot;somebody&amp;quot;, &amp;quot;someone&amp;quot;, &amp;quot;anybody&amp;quot;, &amp;quot;anyone&amp;quot;, &amp;quot;everybody&amp;quot; or &amp;quot;everyone&amp;quot;? * does the utterance contain the name of participant Px? Utterance features also include information about the utterance's conversational function (DA tag) and information about utterance duration i.e.</Paragraph>
      <Paragraph position="5"> whether the utterance is short or long. In our experiments, an utterance is considered as a short utterance, if its duration is less than or equal to 1 sec.</Paragraph>
      <Paragraph position="6"> We experimented with a variety of gaze features. In the first experiment, for each participant Px we defined a set of features in the form Pxlooks-Py and Px-looks-NT where x,y [?]{0,1,2,3} and x negationslash= y; Px-looks-NT represents that participant Px does not look at any of the participants. The value set represents the number of times that speaker Px looks at Py or looks away during the time span of the utterance: zerofor 0, one for 1, two for 2 and more for 3 or more times. In the second experiment, we defined a feature set that incorporates only information about gazedirection of the current speaker (SP-looks-Px and SP-looks-NT) with the same value set as in the first experiment. null As to meeting context, we experimented with different values of the feature that represents the meeting actions (MA-TYPE). First, we used a full set of speech based meeting actions that was applied for the manual annotation of the meetings in the corpus: monologue, discussion, presentation, white-board, consensus and disagreement. As the results on modeling group actions in meetings presented in (McCowan et al., 2003) indicate that consensus and disagreements were mostly mis-classified as discussion, we have also conducted experiments with a set of four values for MA-TYPE, where consensus, disagreement and discussion meeting actions were grouped in one category marked as discussion.</Paragraph>
    </Section>
    <Section position="3" start_page="173" end_page="174" type="sub_section">
      <SectionTitle>
5.3 Results and Discussions
</SectionTitle>
      <Paragraph position="0"> To train and test the addressee classifiers, we used the hand-annotated M4 data from the corpus. After we had discarded the instances labeled with Unknown orsubgroup addressee tags, therewere 781 instances left available for the experiments.</Paragraph>
      <Paragraph position="1"> The distribution of the class values in the selected data is presented in Table 2.</Paragraph>
      <Paragraph position="2">  For learning the Bayesian Network structure, we applied the K2 algorithm (Cooper and Herskovits, 1992). The algorithm requires an ordering ontheobservable features; different ordering leads to different network structures. We conducted experiments with several orderings regarding feature types as well as with different orderings regarding features of the same type. The obtained classification results for different orderings were nearly identical. For learning conditional probability distributions, we used the algorithm implemented in theWEKAtoolbox4 that produces direct estimates of the conditional probabilities.</Paragraph>
      <Paragraph position="3">  context The performances of the classifiers are measured using different feature sets. First, we measured the performances of classifiers using utterance features, gaze features and contextual features separately. Then, we conducted experiments withall possible combinations ofdifferent types of features. For each classifier, weperformed 10-fold cross-validation. Table 3 summarizes the accuracies of the classifiers (with 95% confidence interval) for different feature sets (1) using gaze information of all meeting participants and (2) using only information about speaker gaze direction.</Paragraph>
      <Paragraph position="4"> The results show that the Bayesian Network classifier outperforms the Naive Bayes classifier for all feature sets, although the difference is significant only for the feature sets that include contextual features.</Paragraph>
      <Paragraph position="5"> For the feature set that contains only information about gaze behavior combined with information about the speaker (Gaze+SP), both classifiers perform significantly better when exploiting gaze information of all meeting participants.  In other words, when using solely focus of visual attention to identify the addressee of a dialogue act, listeners' focus of attention provides valuable information for addressee prediction. The same conclusion can be drawn when adding information about utterance duration to the gaze feature set (Gaze+SP+Short), although for the Bayesian Network classifier the difference is not significant. For all other feature sets, the classifiers do not perform significantly different when including or excluding thelisteners gazeinformation. Evenmore, both classifiers perform better using only speaker gaze information in all cases except when combined utterance and gaze features are exploited (Utterance+Gaze+SP).</Paragraph>
      <Paragraph position="6"> The Bayesian network and Naive Bayes classifiers show the same changes in the performances over different feature sets. The results indicate that the selected utterance features are less informative for addressee prediction (BN:52.62%, NB:52.50%) compared to contextual features (BN:73.11%; NB:68.12%) or features of gaze behavior (BN:66.45%, NB:64.53%).</Paragraph>
      <Paragraph position="7"> The results also show that adding the information about the utterance duration to the gaze features, slightly increases the accuracies of the classifiers (BN:67.73%, NB:65.94%), which confirms findings presented in (Bakx et al., 2003). Combining the information from the gaze and speech channels significantly improves the performances of the classifiers (BN:70.68%; NB:69.78%) in comparison to performances obtained from each channel separately. Furthermore, higher accuracies are gained when adding contextual features to the utterance features (BN:76.82%; NB:72.21%) and even more to the features of gaze behavior (BN:80.03%, NB:77.59%). As it is expected, the best performances are achieved by combining all three types of features (BN:82.59%, NB:78.49%), although not significantly better compared to combined contextual and gaze features.</Paragraph>
      <Paragraph position="8"> Wealso explored how well the addressee can be predicted excluding information about the related utterance (i.e. AP information). The best performances are achieved combining speaker gaze information with contextual and utterance features (BN:79.39%; NB:76.06%). A small decrease in the classification accuracies when excluding AP information (about 3%) indicates that remaining contextual, utterance and gaze features capture most of the useful information provided by AP.</Paragraph>
      <Paragraph position="9">  of all meeting participants (Gaze All) and using speaker gaze information (Gaze SP) Error analysis Further analysis of confusion matrixes for the best performed BN and NB classifiers, show that most misclassifications were between addressing types (individual vs. group): each Px was more confused with ALLP than with Py. A similar type of confusion is observed between human annotators regarding addressee annotation (Jovanovic et al., 2005). Out of all mis-classified cases for each classifier, individual types of addressing (Px) were, in average, misclassified with addressing the group (ALLP) in 73% cases for NB, and 68% cases for BN.</Paragraph>
      <Paragraph position="10">  Weexamined whether meetingcontext information can aid the classifiers' performances. First, we conducted experiments using the six values set for the MA-TYPE feature. Then, we experimented with employing the reduced set of four types of meeting actions (see Section 5.2). The accuracies obtained by combining the MA-TYPE feature with contextual, utterance and gaze features are presented in Table 4.</Paragraph>
      <Paragraph position="11">  TYPE with the initial feature set The results indicate that adding meeting context information to the initial feature set improves slightly, but not significantly, the classifiers' performances. The highest accuracy (83.74%) is achieved using the Bayesian Network classifier by combining thefour-values MA-TYPEfeature with contextual, utterance and the speaker's gaze features. null</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML