File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/03/p03-1070_metho.xml
Size: 13,862 bytes
Last Modified: 2025-10-06 14:08:20
<?xml version="1.0" standalone="yes"?> <Paper uid="P03-1070"> <Title>Towards a Model of Face-to-Face Grounding</Title> <Section position="4" start_page="0" end_page="2" type="metho"> <SectionTitle> 3 Empirical Study </SectionTitle> <Paragraph position="0"> In order to get an empirical basis for modeling face-to-face grounding, and implementing an ECA, we analyzed conversational data in two conditions.</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.1 Experiment Design </SectionTitle> <Paragraph position="0"> Based on previous direction-giving tasks, students from two different universities gave directions to campus locations to one another. Each pair had a conversation in a (1) Face-to-face condition (F2F): where two subjects sat with a map drawn by the direction-giver sitting between them, and in a (2) Shared Reference condition (SR): where an L-shaped screen between the subjects let them share a map drawn by the direction-giver, but not to see the other's face or body.</Paragraph> <Paragraph position="1"> Interactions between the subjects were videorecorded from four different angles, and combined by a video mixer into synchronized video clips.</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.2 Data Coding </SectionTitle> <Paragraph position="0"> 10 experiment sessions resulted in 10 dialogues per condition (20 in total), transcribed as follows.</Paragraph> <Paragraph position="1"> Coding verbal behaviors: As grounding occurs within a turn, which consists of consecutive utterances by a speaker, following [13] we tokenized a turn into utterance units (UU), corresponding to a single intonational phrase [14]. Each UU was categorized using the DAMSL coding scheme [15]. In the statistical analysis, we concentrated on the following four categories with regular occurrence in our data: Acknowledgement, Answer, Information request (Info-req), and Assertion.</Paragraph> <Paragraph position="2"> Coding nonverbal behaviors: Based on previous studies, four types of behaviors were coded: Gaze At Partner (gP): Looking at the partner's eyes, eye region, or face.</Paragraph> <Paragraph position="3"> Gaze At Map (gM): Looking at the map Gaze Elsewhere (gE): Looking away elsewhere Head nod (Nod): Head moves up and down in a single continuous movement on a vertical axis, but eyes do not go above the horizontal axis.</Paragraph> <Paragraph position="4"> By combining Gaze and Nod, six complex categories (ex. gP with nod, gP without nod, etc) are generated. In what follows, however, we analyze only categories with more than 10 instances. In order to analyze dyadic behavior, 16 combinations of the nonverbal behaviors are defined, as shown in Table 1. Thus, gP/gM stands for a combination of speaker gaze at partner and listener gaze at map.</Paragraph> </Section> <Section position="3" start_page="0" end_page="0" type="sub_section"> <SectionTitle> Results </SectionTitle> <Paragraph position="0"> We examine differences between the F2F and SR conditions, correlate verbal and nonverbal behaviors within those conditions, and finally look at correlations between speaker and listener behavior.</Paragraph> <Paragraph position="1"> Basic Statistics: The analyzed corpus consists of 1088 UUs for F2F, and 1145 UUs for SR. The mean length of conversations in F2F is 3.24 minutes, and in SR is 3.78 minutes (t(7)=-1.667 p<.07 (one-tail)). The mean length of utterances in F2F (5.26 words per UU) is significantly longer than in SR (4.43 words per UU) (t(7)=3.389 p< .01 (onetail)). For the nonverbal behaviors, the number of shifts between the statuses in Table 1 was compared (eg. NV status shifts from gP/gP to gM/gM is counted as one shift). There were 887 NV status shifts for F2F, and 425 shifts for SR. The number of NV status shifts in SR is less than half of that in F2F (t(7)=3.377 p< .01 (one-tail)).</Paragraph> <Paragraph position="2"> These results indicate that visual access to the interlocutor's body affects the conversation, suggesting that these nonverbal behaviors are used as communicative signals. In SR, where the mean length of UU is shorter, speakers present information in smaller chunks than in F2F, leading to more chunks and a slightly longer conversation. In F2F, on the other hand, conversational participants convey more information in each UU.</Paragraph> <Paragraph position="3"> Correlation between verbal and nonverbal behaviors: We analyzed NV status shifts with respect to the type of verbal communicative action and the experimental condition (F2F/SR). To look at the continuity of NV status, we also analyzed the amount of time spent in each NV status. For gaze, transition and time spent gave similar results; since head nods are so brief, however, we discuss the data in terms of transitions. Table 2 shows the most frequent target NV status (shift to these statuses from others) for each speech act type in F2F. Numbers in parentheses indicates the proportion to the total number of transitions.</Paragraph> <Paragraph position="4"> <Acknowledgement> Within an UU, the dyad's NV status most frequently shifts to gMwN/gM (eg. speaker utters &quot;OK&quot; while nodding, and listener looks at the map). At pauses, a shift to gMgM is most frequent. The same results were found in SR where the listener could not see the speaker's nod. These findings suggest that Acknowledgement is likely to be accompanied by a head nod, and this behavior may function introspectively, as well as communicatively.</Paragraph> <Paragraph position="5"> <Answer> In F2F, the most frequent shift within a UU is to gP/gP. This suggests that speakers and listeners rely on mutual gaze (gP/gP) to ensure an answer is grounded, whereas they cannot use this strategy in SR. In addition, we found that speakers frequently look away at the beginning of an answer, as they plan their reply [7].</Paragraph> <Paragraph position="6"> <Info-req> In F2F, the most frequent shift within a UU is to gP/gM, while at pauses between UUs shift to gP/gP is the most frequent. This suggests that speakers obtain mutual gaze after asking a question to ensure that the question is clear, before the turn is transferred to the listener to reply. In SR, however, rarely is there any NV status shift, and participants continue looking at the map.</Paragraph> <Paragraph position="7"> <Assertion> In both conditions, listeners look at the map most of the time, and sometimes nod.</Paragraph> <Paragraph position="8"> However, speakers' nonverbal behavior is very different across conditions. In SR, speakers either look at the map or elsewhere. By contrast, in F2F, they frequently look at the listener, so that a shift to gP/gM is the most frequent within an UU. This suggests that, in F2F, speakers check whether the listener is paying attention to the referent mentioned in the Assertion. This implies that not only listener's gazing at the speaker, but also paying attention to a referent works as positive evidence of understanding in F2F.</Paragraph> <Paragraph position="9"> In summary, it is already known that eye gaze can signal a turn-taking request [16], but turn-taking cannot account for all our results. Gaze direction changes within as well as between UUs, and the usage of these nonverbal behaviors differs depending on the type of conversational action.</Paragraph> <Paragraph position="10"> Note that subjects rarely demonstrated communication failures, implying that these nonverbal behaviors represent positive evidence of grounding.</Paragraph> <Paragraph position="11"> Correlation between speaker and listener behavior: Thus far we have demonstrated a difference in distribution among nonverbal behaviors, with respect to conversational action, and visibility of interlocutor. But, to uncover the function of these nonverbal signals, we must examine how listener's nonverbal behavior affects the speaker's following action. Thus, we looked at two consecutive Assertion UUs by a direction-giver, and analyzed the relationship between the NV status of the first UU and the direction-giving strategy in the second UU. The giver's second UU is classified as go-ahead if it gives the next leg of the directions, or as elaboration if it gives additional information about the first UU, as in the following example: [U1]S: And then, you'll go down this little corridor.</Paragraph> <Paragraph position="12"> [U2]S: It's not very long.</Paragraph> <Paragraph position="13"> Results are shown in Figure 2. When the listener begins to gaze at the speaker somewhere within an UU, and maintains gaze until the pause after the UU, the speaker's next UU is an elaboration of the previous UU 73% of the time. On the other hand, when the listener keeps looking at the map during an UU, only 30% of the next UU is an elaboration (z = 3.678, p<.01). Moreover, when a listener keeps looking at the speaker, the speaker's next UU is go-ahead only 27% of the time. In contrast, when a listener keeps looking at the map, the speaker's next UU is go-ahead 52% of the time (z</Paragraph> <Paragraph position="15"> . These results suggest that speakers interpret listeners' continuous gaze as evidence of not-understanding, and they therefore add more information about the previous UU. Similar findings were reported for a map task by [17] who suggested that, at times of communicative difficulty, interlocutors are more likely to utilize all the channels available to them. In terms of floor management, gazing at the partner is a signal of giving up a turn, and here this indicates that listeners are trying to elicit more information from the speaker.</Paragraph> <Paragraph position="16"> In addition, listeners' continuous attention to the map is interpreted as evidence of understanding, and speakers go ahead to the next leg of the direc-</Paragraph> <Paragraph position="18"/> </Section> <Section position="4" start_page="0" end_page="2" type="sub_section"> <SectionTitle> 3.3 A Model of Face-to-Face Grounding </SectionTitle> <Paragraph position="0"> Analyzing spoken dialogues, [18] reported that grounding behavior is more likely to occur at an The percentage for map does not sum to 100% because some of the UUs are cue phrases or tag questions which are part of the next leg of the direction, but do not convey content. We also analyzed two consecutive Answer UUs from a giver, and found that when the listener looks at the speaker at a pause, the speaker elaborates the Answer 78% of the time. When the listener looks at the speaker during the UU and at the map after the UU (positive evidence), the speaker elaborates only 17% of the time.</Paragraph> <Paragraph position="1"> giver's next verbal behavior intonational boundary, which we use to identify UUs. This implies that multiple grounding behaviors can occur within a turn if it consists of multiple UUs. However, in previous models, information is grounded only when a listener returns verbal feedback, and acknowledgement marks the smallest scope of grounding. If we apply this model to the example in Figure 1, none of the UU have been grounded because the listener has not returned any spoken grounding clues.</Paragraph> <Paragraph position="2"> In contrast, our results suggest that considering the role of nonverbal behavior, especially eye-gaze, allows a more fine-grained model of grounding, employing the UU as a unit of grounding.</Paragraph> <Paragraph position="3"> Our results also suggest that speakers are actively monitoring positive evidence of understanding, and also the absence of negative evidence of understanding (that is, signs of miscommunication). When listeners continue to gaze at the task, speakers continue on to the next leg of directions. Because of the incremental nature of grounding, we implement nonverbal grounding functionality into an embodied conversational agent using a process model that describes steps for a system to judge whether a user understands system contribution: (1) Preparing for the next UU: according to the speech act type of the next UU, nonverbal positive or negative evidence that the agent expects to receive are specified. (2) Monitoring: monitors and checks the user's nonverbal status and signals during the UU. After speaking, the agent continues monitoring until s/he gets enough evidence of understanding or not-understanding represented by user's nonverbal status and signals.(3) Judging: once the agent gets enough evidence, s/he tries to judge groundedness as soon as possible. According to some previous studies, length of pause between UUs is in between 0.4 to 1 sec [18, 19]. Thus, time out for judgment is 1 sec after the end of the UU. If the agent does not have evidence then, the UU remains ungrounded.</Paragraph> <Paragraph position="4"> This model is based on the information state approach [3], with update rules that revise the state of the conversation based on the inputs the system receives. In our case, however, the inputs are sampled continuously, include the nonverbal state, and only some require updates. Other inputs indicate that the last utterance is still pending, and allow the agent to wait further. In particular, task attention over an interval following the utterance triggers grounding. Gaze in the interval means that the contribution stays provisional, and triggers an obligation to elaborate. Likewise, if the system times-out without recognizing any user feedback, the segment remains ungrounded. This process allows the system to keep talking across multiple utterance units without getting verbal feedback from the user. From the user's perspective, explicit acknowledgement is not necessary, and minimal cost is involved in eliciting elaboration.</Paragraph> </Section> </Section> class="xml-element"></Paper>