File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/03/w03-0205_metho.xml

Size: 17,949 bytes

Last Modified: 2025-10-06 14:08:19

<?xml version="1.0" standalone="yes"?>
<Paper uid="W03-0205">
  <Title>A Comparison of Tutor and Student Behavior in Speech Versus Text Based Tutoring</Title>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
3 Typed Human-Human Tutoring Corpus
</SectionTitle>
    <Paragraph position="0"> The Why2-Atlas Human-Human Typed Tutoring Corpus is a collection of typed tutoring dialogues between (human) tutor and student collected via typed interface, which the tutor plays the same role that Why2-Atlas is designed to perform. The experimental procedure is as follows: 1) students are given a pretest measuring their knowledge of physics, 2) students are asked to read through a small document of background material, 3) students work through a set of up to 10 Why2-Atlas training problems with the human tutor, and 4) students are given a post-test that is similar to the pretest. The entire experiment takes no more than 15 hours per student, and is usually performed in 1-3 sessions of no more than 4 hours each. Data collection began in the Fall 2002 semester and is continuing in the Spring 2003 semester. The subjects are all University of Pittsburgh students who have never taken any college physics courses. One tutor currently participates in the study.</Paragraph>
    <Paragraph position="1"> As in the Why2-Atlas system, when the tutoring session then begins, the student rst types an essay answering a qualitative physics problem. Once the student submits his/her essay, the tutor then engages the student in a typed natural language dialogue to provide feedback and correct misconceptions, and to elicit more complete explanations. This instruction is in the form of a dialogue between the student and the tutor through a text based chat interface with student and tutor in separate rooms.</Paragraph>
    <Paragraph position="2"> At key points in the dialogue, the tutor asks the student to revise the essay. This cycle of instruction and revision continues until the tutor is satis ed with the student's essay. A sample tutoring dialogue from the Why2-Atlas typed human-human tutoring corpus is displayed in Figure 1.</Paragraph>
    <Paragraph position="3"> The tutor was instructed to cover the expectations for each problem, to watch for the speci c set of expectations and misconceptions associated with the problem, and to end the discussion of each problem by showing the ideal essay to the student. He was encouraged to avoid lecturing the student and to attempt to draw out the student's own reasoning. He knew that transcripts of his tutoring would be analyzed. Nevertheless, he was not required to follow any prescribed tutoring strategies. So his tutoring style was much more naturalistic than in previous studies such as the BEE study (Ros*e et al., 2001) in which two speci c tutoring styles, namely Socratic and Didactic, were contrasted. The results of that study revealed a trend for students in the Socratic condition to learn more than those in the Didactic condition. A further analysis of the corpus collected during the BEE study (Core et al., 2002) veri ed that the Socratic dialogues from the BEE study were more interactive than the Didactic ones.</Paragraph>
    <Paragraph position="4"> The biggest reliable difference between the two sets of tutoring dialogues was the percentage of words spoken by the student, i.e, number of student words divided by total number of words. The Didactic dialogues contained on average 26% student words, whereas the Socratic dialogues contained 33% student words. On average with respect to percentage of student words, the dialogues in our text based human tutoring corpus were more like the Didactic dialogues from the BEE study, with average percentage of student text being 27%. Nevertheless, because the tutor was not constrained to follow a prescribed tutoring style, the level of interactivity varied widely throughout the transcripts, at times being highly Socratic, and at other times being highly Didactic.</Paragraph>
    <Paragraph position="5"> Pre and post tests were used to measure learning gains to be used for evaluating the effectiveness of various features of tutorial dialogue found in our corpora. Thus, we developed two tests: versions A and B, which were isomorphic to one another. That is, the problems on test A and B differed only in the identities of the objects (e.g., cars vs. trucks) and other surface features that should not affect the reasoning required to solve them. Each version of the test (A and B) consisted of 40 multiple choice questions. Each multiple choice question was written to address a single expectation covered in the training problems. Some students were not able to complete all 10 problems before they reached the end of their participation time. Thus, they took the post-test after only working through a subset of the training problems.</Paragraph>
  </Section>
  <Section position="5" start_page="0" end_page="0" type="metho">
    <SectionTitle>
4 Spoken Human-Human Tutoring
Corpus
</SectionTitle>
    <Paragraph position="0"> The ITSPOKE Human-Human Spoken Tutoring Corpus is a parallel collection of spoken tutoring dialogues collected via a web interface supplemented with a high quality audio link, where a human tutor performs the same task that our ITSPOKE system is being designed to perform. The experimental procedure used to collect the corpus is exactly the same as the procedure used to gather the Why2-Atlas Human-Human Corpus: the same tutor is used, the same subject pool1 is used, the same pre-test and post-test are used, and the same set of physics prob1We assigned a greater percentage of students to the text based condition as part of a separate experiment. Thus, the text based corpus is larger than the speech based corpus.</Paragraph>
    <Paragraph position="1"> PROBLEM: Suppose that you released 3 identical balls of clay in a vacuum at exactly the same instant. They would all hit the ground at the same instant. Now you stick two of the balls together, forming one ball that is twice as heavy as the remaining, untouched clay ball. Both balls are released in a vacuum at exactly the same instant. Which ball hits the ground rst? ESSAY: Both balls will hit the ground at the same time. The balls are in free fall (only gravitational forces). The ratio of the masses and weight are equal.</Paragraph>
    <Paragraph position="2"> . . . excerpt from 2 minutes into a typed dialogue . . .</Paragraph>
    <Paragraph position="3">  lems are used. Only the modality differs. In particular, once the tutoring session begins and the student submits his/her typed essay, the tutor and student then discuss the student's essay using spoken English. In contrast to the text condition, where strict turn-taking is enforced, in the spoken condition, interruptions and overlapping speech are common. An example excerpt from the corpus is shown in Figure 2. Note that turns ending in - indicate speech overlapping with the following turn. Eventually, the student will edit his/her typed explanation. As in the text condition, the tutor then either expresses satisfaction and ends the tutoring for the current problem, or continues with another round of spoken dialogue interaction and typed essay revision. As in the text condition, students are presented with the ideal essay answer for a problem upon completing that problem.</Paragraph>
    <Paragraph position="4"> 5 Differences between Typed and Spoken Human-Tutoring (Ros*e et al., submitted) presents an analysis to uncover which aspects of the tutorial dialogue were responsible for its effectiveness in the text based condition. Longer student answers to tutor questions reveal more of a student's reasoning. Very short answers, i.e., 10 words or less, are normally composed of a single clause at most.</Paragraph>
    <Paragraph position="5"> Longer, multi-clausal answers have the potential to communicate many more inter-connections between ideas.</Paragraph>
    <Paragraph position="6"> Thus, if a tutor is attending to and responding directly to the student's revealed knowledge state, it would be expected that the effectiveness of the tutor's instruction would increase as average student turn length increases.</Paragraph>
    <Paragraph position="7"> To test this prediction, we computed a linear regression of the sequence of student turn lengths over time for each student in the text based condition in order to obtain an intercept and a slope, since student turn lengths have been observed to decline on average over the course of their PROBLEM: If a car is able to accelerate at 2 m/s2, what acceleration can it attain if it is towing another car of equal mass? ESSAY: If the car is towing another car of equal mass, the maximum acceleration would be the same because the car would be towed behind and the friction caused would only be by the front of the rst car. . . . excerpt from 6.5 minutes into spoken dialogue . . .</Paragraph>
    <Paragraph position="8">  interaction with the turn. We then computed a multiple regression with pre-test score, intercept, and gradient as independent variables and post test score as the dependent variable. We found a reliable correlation between intercept and learning, with pre-test scores and gradients regressed out (R=.836; p a0 .05). This result is consistent with (Core et al., 2002) where percentage of student talk is strongly correlated with learning. Consistent with this, we found a strong and reliable correlation between ratio of student words to tutor words and learning2. We computed a correlation between ratio of student words to tutor words and post-test score after pre-test scores were regressed out (R=.866, p a0 .05).</Paragraph>
    <Paragraph position="9"> One of our current research objectives is to compare 2Note that ratio of student words to tutor words is number of student words divided by number of tutor words, whereas percentage of student words is number of student words divided by total number of words the relative effectiveness of speech based and text based tutoring. Thus, when we have enough speech data, we would like to compare learning gains between the speech and text based conditions to test whether or not speech based tutoring is more effective than text based tutoring.</Paragraph>
    <Paragraph position="10"> We also plan to test whether the same features that correlate with learning in the text based condition also correlate with learning in the speech based condition. Since both average student turn length and overall ratio of student words to tutor words correlated strongly with learning gains in the text based condition, in this paper we compare these two measures between the text based tutoring condition and the speech based tutoring condition, but not yet in connection with learning gains in the speech-based corpus.</Paragraph>
    <Paragraph position="11"> Since strict turn taking was not enforced in the speech condition, turn boundaries were manually annotated (based on consensus labellings from two coders) when either (1) the speaker stopped speaking and the other party in the dialogue began to speak, (2) when the speaker asked a question and stopped speaking to wait for an answer, or (3) when the other party in the dialogue interrupted the speaker and the speaker paused to allow the other party to speak.</Paragraph>
    <Paragraph position="12"> Currently, 13 students have started the typed human-human tutoring experiment, 7 of whom have nished.</Paragraph>
    <Paragraph position="13"> We have so far collected 78 typed dialogues from the text based condition, 69 of which were used in our analysis. 9 students have started the spoken human-human tutoring experiment, 6 of whom have nished. Thus, we have collected 63 speech based dialogues (1290 minutes of speech from 4 female and 4 male subjects), and have transcribed 25 of them. We hope to have an analysis covering all of our data in both conditions by the time of the workshop.</Paragraph>
    <Paragraph position="14"> As shown in Table 1, analysis of the data that has been collected and transcribed to date is already showing interesting differences between the ITSPOKE (spoken) and WHY2-ATLAS (text) corpora of human-human dialogues. The #trns columns show mean and standard deviation for the total number of turns taken by the students or tutor in each problem dialogue, while the next pair of columns show the mean and standard deviation for the total number of words spoken or typed by the students or tutor (#wds) in each problem dialogue. The last pair of columns show mean and standard deviation for the average number of student or tutor words per turn in each problem dialogue.</Paragraph>
    <Paragraph position="15"> Due to the fact that data is still being collected for both corpora (and the fact that the speech corpus also requires manual transcription), the sizes of the two data sets represented in the table differ somewhat. However, even at this early stage in the development of both corpora, these gures already show that the style of the interactions are very different in each modality. In particular, in spoken tutoring, both student and tutor take more turns on average than in text based tutoring, but these spoken turns are on average shorter. Moreover, in spoken tutoring both student and tutor on average use more words to communicate than in text based tutoring. Another interesting difference is that although in the speech condition both student and tutor take more turns, students nish the speech condition in less time. In particular, on average, students in the text based tutoring condition require 370.58 minutes to nish the training problems, with a standard deviation of 134.29 minutes, students in the speech condition require only 159.9 minutes on average, with a standard deviation of 58.6 minutes. We measured the statistical reliability of the difference between the two measures that correlated reliably with learning in the text-based condition. A 2-tailed unpaired t-test indicates that this difference is signi cant (t(30)=8.99, p a0 .01). There are also similarities across the two conditions. In particular, a 2-tailed unpaired t-test shows that the relative proportion of student and tutor word or turns do not differ signi cantly on average across the two modalities (t(13)=1.225, p=.242). As an illustration, Table 2 shows mean and standard deviation for the ratios of the total number of student and tutor words (#Swds/#Twds) and turns (#Strns/#Ttrns) in each problem dialogue3.</Paragraph>
    <Paragraph position="16"> Average student turn length is signi cantly lower in the speech based condition (t(13) = 4.5, p a0 .001). This might predict that speech based tutoring may be less effective than text based tutoring. However, since ratio of student words to tutor words does not differ signi cantly, this would predict that learning will also not differ signi cantly between conditions. Total number of words uttered in the speech condition is larger than in the text based condition as are total number of turns. This difference between the two conditions will likely be even more pronounced in the human-computer comparison, due to noisy student input that results from use of automatic speech recognition. For example, clari cations and corrections made necessary by this will likely lead to an increase in dialogue length. More careful analysis is required to determine whether this means that more self-explanation took place overall in the speech based condition. If so, this would predict that the speech based condition would be more effective for learning than the text based condition. Thus, much interesting exploration is left to be done after we have collected enough speech data to compute a reliable comparison between the two conditions.</Paragraph>
  </Section>
  <Section position="6" start_page="0" end_page="0" type="metho">
    <SectionTitle>
6 Current Directions
</SectionTitle>
    <Paragraph position="0"> Currently we are continuing to collect data both in the speech and text based human tutoring conditions. Since human tutors differ with each other with respect to both their tutoring styles and their conversational styles, we plan to collect data using several different human tutors in order to test the robustness of our comparisons between speech and text based human tutoring. Another possible direction for further inquiry would be to contrast naturalistic speech (where strict turn taking is not enforced, as in this data collection effort), with a speech condition in which strict turn taking is enforced, in order to separate the effects of speech on learning from the effects of alternative turn taking policies.</Paragraph>
    <Paragraph position="1"> As discussed in Section 2, we are currently developing both text based and speech based human-computer tutorial systems. Our ultimate goal is to test the relative effectiveness of speech versus text based computer tutors.</Paragraph>
    <Paragraph position="2"> We expect differences both between text and speech con- null ditions in the human-computer data and between human-human and human-computer data. One of our rst tasks will thus be to use the baseline version of ITSPOKE described in Section 2 to generate a corpus of human-computer spoken dialogues, using a process comparable to the human-human corpus collection described here.</Paragraph>
    <Paragraph position="3"> This will allow us to 1) compare the ITSPOKE human-human and human-computer corpora 2) compare the ITSPOKE human-computer spoken corpus with a comparable Why2-Atlas text corpus, e.g. by expanding on the just described pilot study of the two human-human corpora, and 3) use the ITSPOKE human-computer corpus to guide the development of a new version of ITSPOKE that will attempt to increase its performance, by taking advantage of information that is only available in speech, and modifying its behavior in other ways to respect the interaction differences in item 2.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML