File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/w06-1303_metho.xml

Size: 24,699 bytes

Last Modified: 2025-10-06 14:10:39

<?xml version="1.0" standalone="yes"?>
<Paper uid="W06-1303">
  <Title>Building Effective Question Answering Characters</Title>
  <Section position="4" start_page="18" end_page="18" type="metho">
    <SectionTitle>
2 SGT Blackwell
</SectionTitle>
    <Paragraph position="0"> A user talks to SGT Blackwell using a head-mounted close capture USB microphone. The user's speech is converted into text using an automatic speech recognition (ASR) system. We used the Sonic statistical speech recognition engine from the University of Colorado (Pellom, 2001) with acoustic and language models provided to us by our colleagues at the University of Southern California (Sethy et al., 2005). The answer selection module analyzes the speech recognition output and selects the appropriate response. The character can deliver 83 spoken lines ranging from one word to a couple paragraphs long monologues. There are three kinds of lines SGT Blackwell can deliver: content, off-topic, and prompts. The 57 content-focused lines cover the identity of the character, its origin, its language and animation technology, its design goals, our university, the conference setup, and some miscellaneous topics, such as &amp;quot;what time is it?&amp;quot; and &amp;quot;where can I get my coffee?&amp;quot; When SGT Blackwell detects a question that cannot be answered with one of the content-focused lines, it selects one out of 13 off-topic responses, (e.g., &amp;quot;I am not authorized to comment on that,&amp;quot;) indicating that the user has ventured out of the allowed conversation domain. In the event that the user persists in asking the questions for which the character has no informative response, the system tries to nudge the user back into the conversation domain by suggesting a question for the user to ask: &amp;quot;You should ask me instead about my technology.&amp;quot; There are 7 different prompts in the system.</Paragraph>
    <Paragraph position="1"> One topic can be covered by multiple answers, so asking the same question again often results in a different response, introducing variety into the conversation. The user can specifically request alternative answers by asking something along the lines of &amp;quot;do you have anything to add?&amp;quot; or &amp;quot;anything else?&amp;quot; This is the first of two types command-like expressions SGT Blackwell understands. The second type is a direct request to repeat the previous response, e.g., &amp;quot;come again?&amp;quot; or &amp;quot;what was that?&amp;quot; If the user persists on asking the same question over and over, the character might be forced to repeat its answer. It indicates that by preceding the answer with one of the four &amp;quot;pre-repeat&amp;quot; lines indicating that incoming response has been heard recently, e.g., &amp;quot;Let me say this again...&amp;quot;</Paragraph>
  </Section>
  <Section position="5" start_page="18" end_page="20" type="metho">
    <SectionTitle>
3 Answer Selection
</SectionTitle>
    <Paragraph position="0"> The main problem with answer selection is uncertainty. There are two sources of uncertainty in a spoken dialog system: the first is the complex nature of natural language (including ambiguity, vagueness, underspecification, indirect speech acts, etc.), making it difficult to compactly characterize the mapping from the text surface form to the meaning; and the second is the error-prone output from the speech recognition module. One possible approach to creating a language understanding system is to design a set of rules that select a response given an input text string (Weizenbaum, 1966). Because of uncertainty this approach can quickly become intractable for anything more than the most trivial tasks. An alternative is to create an automatic system that uses a set of training question-answer pairs to learn the appropriate question-answer matching algorithm (Chu-Carroll and Carpenter, 1999). We have tried three different methods for the latter approach, described in the rest of this section.</Paragraph>
    <Section position="1" start_page="18" end_page="19" type="sub_section">
      <SectionTitle>
3.1 Text Classification
</SectionTitle>
      <Paragraph position="0"> The answer selection problem can be viewed as a text classification task. We have a question text  as input and a finite set of answers, - classes, we build a system that selects the most appropriate class or set of classes for the question. Text classification has been studied in Information Retrieval (IR) for several decades (Lewis et al., 1996). The distinct properties of our setup are (1) a very small size of the text, - the questions are very short, and (2) the large number of classes, e.g, 60 responses for SGT Blackwell.</Paragraph>
      <Paragraph position="1"> An answer defines a class. The questions corresponding to the answer are represented as vectors of term features. We tokenized the questions and stemmed using the KStem algorithm (Krovetz, 1993). We used a tf x idf weighting scheme to assign values to the individual term features (Allan et al., 1998). Finally, we trained a multi-class Support Vector Machines (SVMstruct) classifier with an exponential kernel (Tsochantaridis et al., 2004). We have also experimented with linear kernel function, various parameter values for the exponential kernel, and different term weighting schemes. The reported combination of the kernel and weighting scheme showed the best classification performance. Such an approach is well-known in the community and has been shown to work very well in numerous applications (Leuski, 2004). In fact, SVM is generally considered to be one of the best performing methods for text classification. We believe it provides us with a very strong baseline.</Paragraph>
    </Section>
    <Section position="2" start_page="19" end_page="20" type="sub_section">
      <SectionTitle>
3.2 Answer Retrieval
</SectionTitle>
      <Paragraph position="0"> The answer selection problem can also be viewed as an information retrieval problem. We have a set of answers which we can call documents in accordance with the information retrieval terminology. Let the question be the query, we compare the query to each document in the collection and return the most appropriate set of documents.</Paragraph>
      <Paragraph position="1"> Presently the best performing IR techniques are based on the concept of Language Modeling (Ponte and Croft, 1997). The main strategy is to view both a query and a document as samples from some probability distributions over the words in the vocabulary (i.e., language models) and compare those distributions. These probability distributions rarely can be computed directly. The &amp;quot;art&amp;quot; of the field is to estimate the language models as accurately as possible given observed queries and documents.</Paragraph>
      <Paragraph position="2"> Let Q = q1...qm be the question that is received by the system, RQ is the set of all the answers appropriate to that question, and P(w|RQ) is the probability that a word randomly sampled from an appropriate answer would be the word w.</Paragraph>
      <Paragraph position="3"> The language model of Q is the set of probabilities P(w|RQ) for every word in the vocabulary. If we knew the answer set for that question, we can easily estimate the model. Unfortunately, we only know the question and not the answer set RQ. We approximate the language model with the conditional distribution:</Paragraph>
      <Paragraph position="5"> The next step is to calculate the joint probability of observing a string: P(W) = P(w1,...,wn).</Paragraph>
      <Paragraph position="6"> Different methods for estimating P(W) have been suggested starting with simple unigram approach where the occurrences of individual words are assumed independent from each other: P(W) =producttext n i=1P(wi). Other approaches include Probabilistic Latent Semantic Indexing (PLSI) (Hoffman, 1999) and Latent Dirichlet Allocation (LDA) (Blei et al., 2003). The main goal of these different estimations is to model the interdependencies that exist in the text and make the estimation feasible given the finite amount of training data.</Paragraph>
      <Paragraph position="7"> In this paper we adapt an approach suggested by Lavrenko (Lavrenko, 2004). He assumed that all the word dependencies are defined by a vector of possibly unknown parameters on the language model. Using the de Finetti's representation theorem and kernel-based probability estimations, he derived the following estimate for the query language model:</Paragraph>
      <Paragraph position="9"> (2) Here we sum over all training strings s [?] S, where S is the set of training strings. pis(w) is the probability of observing word w in the string s, which can be estimated directly from the training data. Generally the unigram maximum likelihood estimator is used with some smoothing factor:</Paragraph>
      <Paragraph position="11"> where #(w,s) is the number of times word w appears in string s, |s |is the length of the string s, we sum over all training strings s [?] S, and the constant lpi is the tunable parameter that can be determined from training data.</Paragraph>
      <Paragraph position="12"> We know all the possible answers, so the answer language model P(w|A) can be estimated from the data:</Paragraph>
      <Paragraph position="14"/>
    </Section>
    <Section position="3" start_page="20" end_page="20" type="sub_section">
      <SectionTitle>
3.3 Ranking criteria
</SectionTitle>
      <Paragraph position="0"> To compare two language models we use the</Paragraph>
      <Paragraph position="2"> which can be interpreted as the relative entropy between two distributions. Note that the Kullback-Leibler divergence is a dissimilarity measure, we use [?]D(pq||pa) to rank the answers.</Paragraph>
      <Paragraph position="3"> So far we have assumed that both questions and answers use the same vocabulary and have the same a priori language models. Clearly, it is not the case. For example, consider the following exchange: &amp;quot;what happened here?&amp;quot; - &amp;quot;well, maam, someone released the animals this morning.&amp;quot; While the answer is likely to be very appropriate to the question, there is no word overlap between these sentences. This is an example of what is known in information retrieval as vocabulary mismatch between the query and the documents.</Paragraph>
      <Paragraph position="4"> In a typical retrieval scenario a query is assumed to look like a part of a document. We cannot make the same assumption about the questions because of the language rules: e.g., &amp;quot;what&amp;quot;, &amp;quot;where&amp;quot;, and &amp;quot;why&amp;quot; are likely to appear much more often in questions than in answers. Additionally, a typical document is much larger than any of our answers and has a higher probability to have words in common with the query. Finally, a typical retrieval scenario is totally context-free and a user is encouraged to specify her information need as accurately as possible. In a dialog, a portion of the information is assumed to be well-known to the participants and remains un-verbalized leading to sometimes brief questions and answers.</Paragraph>
      <Paragraph position="5"> We believe this vocabulary mismatch to be so significant that we view the participants as speaking two different &amp;quot;languages&amp;quot;: a language of questions and a language of answers. We will model the problem as a cross-lingual information task, where one has a query in one language and wishes to retrieve documents in another language. There are two ways we can solve it: we can translate the answers into the question language by building a representation for each answer using the question vocabulary or we can build question representations in the answer language.</Paragraph>
    </Section>
    <Section position="4" start_page="20" end_page="20" type="sub_section">
      <SectionTitle>
3.4 Question domain
</SectionTitle>
      <Paragraph position="0"> We create an answer representation in the question vocabulary by merging together all the training questions that are associated with the answer into one string: a pseudo-answer. We use equations 5, 2, 3, and 4 to compare and rank the pseudo-answers. Note that in equation 2 s iterates over the set of all pseudo-answers.</Paragraph>
    </Section>
    <Section position="5" start_page="20" end_page="20" type="sub_section">
      <SectionTitle>
3.5 Answer domain
</SectionTitle>
      <Paragraph position="0"> Let us look at the question language model P(w|Q) again, but now we will take into account that w and Q are from different vocabularies and have potentially different distributions:</Paragraph>
      <Paragraph position="2"> Here s iterates over the training set of question-answer pairs {Qs,As} and ax(w) is the experimental probability distribution on the answer vocabulary given by the expression similar to equa-</Paragraph>
      <Paragraph position="4"> and the answer language model P(w|A) can be estimated from the data as</Paragraph>
      <Paragraph position="6"/>
    </Section>
  </Section>
  <Section position="6" start_page="20" end_page="21" type="metho">
    <SectionTitle>
4 Algorithm comparison
</SectionTitle>
    <Paragraph position="0"> We have a collection of questions for SGT Blackwell each linked to a set of appropriate responses.</Paragraph>
    <Paragraph position="1"> Our script writer defined the first question or two for each answer. We expanded the set by a) paraphrasing the initial questions and b) collecting questions from users by simulating the final system in a Wizard of Oz study (WOZ). There are 1,261 questions in the collection linked to 72 answers (57 content answers, 13 off-topic responses, and 2 command classes, see Section 2). For this  study we considered all our off-topic responses equally appropriate to an off-topic question and we collapsed all the corresponding responses into one class. Thus we have 60 response classes.</Paragraph>
    <Paragraph position="2"> We divided our collection of questions into training and testing subsets following the 10-fold cross-validation schema. The SVM system was trained to classify test questions into one of the 60 classes.</Paragraph>
    <Paragraph position="3"> Both retrieval techniques produce a ranked list of candidate answers ordered by the [?]D(pq||pa) score. We only select the answers with scores that exceed a given threshold [?]D(pq||pa) &gt; t. If the resulting answer set is empty we classify the question as off-topic, i.e., set the candidate answer set contains to an off-topic response. We determine the language model smoothing parameters ls and the threshold t on the training data.</Paragraph>
    <Paragraph position="4"> We consider two statistics when measuring the performance of the classification. First, we measure its accuracy. For each test question the first response returned by the system, - the class from the SVM system or the top ranked candidate answer returned by either LM or CLM methods, is considered to be correct if there is link between the question and the response. The accuracy is the proportion of correctly answered questions among all test questions.</Paragraph>
    <Paragraph position="5"> The second statistic is precision. Both LM and CLM methods may return several candidate answers ranked by their scores. That way a user will get a different response if she repeats the question. For example, consider a scenario where the first response is incorrect. The user repeats her question and the system returns a correct response creating the impression that the QA character simply did not hear the user correctly the first time. We want to measure the quality of the ranked list of candidate answers or the proportion of appropriate answers among all the candidate answers, but we should also prefer the candidate sets that list all the correct answers before all the incorrect ones.</Paragraph>
    <Paragraph position="6"> A well-known IR technique is to compute average precision - for each position in the ranked list compute the proportion of correct answers among all preceding answers and average those values.</Paragraph>
    <Paragraph position="7"> Table 1 shows the accuracy and average precision numbers for three answer selection methods on the SGT Blackwell data set. We observe a significant improvement in accuracy in the retrieval methods over the SVM technique. The differences shown are statistical significant by t-test with the cutoff set to 5% (p &lt; 0.05).</Paragraph>
    <Paragraph position="8"> We repeated out experiments on QA characters we are developing for another project. There we have 7 different characters with various number of responses. The primary difference with the SGT Blackwell data is that in the new scenario each question is assigned to one and only one answer. Table 2 shows the accuracy numbers for the answer selection techniques on those data sets. These performance numbers are generally lower than the corresponding numbers on the SGT Blackwell collection. We have not yet collected as many training questions as for SGT Blackwell.</Paragraph>
    <Paragraph position="9"> We observe that the retrieval approaches are more successful for problems with more answer classes and more training data. The table shows the percent improvement in classification accuracy for each LM-based approach over the SVM baseline.</Paragraph>
    <Paragraph position="10"> The asterisks indicate statistical significance using a t-test with the cutoff set to 5% (p &lt; 0.05).</Paragraph>
  </Section>
  <Section position="7" start_page="21" end_page="22" type="metho">
    <SectionTitle>
5 Effect of ASR
</SectionTitle>
    <Paragraph position="0"> In the second set of experiments for this paper we studied the question of how robust the CLM answer selection technique in the SGT Blackwell system is to the disfluencies of normal conversational speech and errors of the speech recognition. We conducted a user study with people interviewing SGT Blackwell and analyzed the results. Because the original system was meant for one of three demo &amp;quot;reporters&amp;quot; to ask SGT Blackwell questions, specialized acoustic models were used to ensure the highest accuracy for these three (male) speakers. Consequently, for other speakers (especially female speakers), the error rate was much higher than for a standard recognizer. This allowed us to calculate the role of a variety of speech error rates on classifier performance.</Paragraph>
    <Paragraph position="1"> For this experiment, we recruited 20 participants (14 male, 6 female, ages from 20 to 62) from our organization who were not members of this project. All participants spoke English fluently, however the range of their birth languages included English, Hindi, and Chinese.</Paragraph>
    <Paragraph position="2"> After filling out a consent form, participants were &amp;quot;introduced&amp;quot; to SGT Blackwell, and demonstrated the proper technique for asking him questions (i.e., when and how to activate the microphone and how to adjust the microphone position.) Next, the participants were given a scenario</Paragraph>
  </Section>
  <Section position="8" start_page="22" end_page="23" type="metho">
    <SectionTitle>
SVM LM CLM
</SectionTitle>
    <Paragraph position="0"> accuracy accuracy impr. SVM avg. prec. accuracy impr. SVM avg. prec.</Paragraph>
    <Paragraph position="1"> 53.13 57.80 8.78 63.88 61.99 16.67 65.24  performance number is given in percentages. number of number of SVM LM CLM questions answers accuracy accuracy impr. SVM accuracy impr. SVM  The table shows the number of answers and the number of questions collected for each character. The accuracy and the improvement over the baseline numbers are given in percentages. wherein the participant would act as a reporter about to interview SGT Blackwell. The participants were then given a list of 10 pre-designated questions to ask of SGT Blackwell. These questions were selected from the training data. They were then instructed to take a few minutes to write down an additional five questions to ask SGT Blackwell. Finally they were informed that after asking the fifteen written down questions, they would have to spontaneously generate and ask five additional questions for a total of 20 questions asked all together. Once the participants had written down their fifteen questions, they began the interview with SGT Blackwell. Upon the completion of the interview the participants were then asked a short series of survey questions by the experimenter about SGT Blackwell and the interview. Finally, participants were given an explanation of the study and then released. Voice recordings were made for each interview, as well as the raw data collected from the answer selection module and ASR. This is our first set of question answer pairs, we call it the ASR-QA set.</Paragraph>
    <Paragraph position="2"> The voice recordings were later transcribed. We ran the transcriptions through the CLM answer selection module to generate answers for each question. This generated question and answer pairs based on how the system would have responded to the participant questions if the speech recognition was perfect. This is our second set of question answer pairs - the TRS-QA set. Appendix B shows a sample dialog between a participant and SGT Blackwell.</Paragraph>
    <Paragraph position="3"> Next we used three human raters to judge the appropriateness of both sets. Using a scale of 1-6 (see Appendix A) each rater judged the appropriateness of SGT Blackwell's answers to the questions posed by the participants. We evaluated the agreement between raters by computing Cronbach's alpha score, which measures consistency in the data. The alpha score is 0.929 for TRS-QA and 0.916 for ASR-QA, which indicate high consistency among the raters.</Paragraph>
    <Paragraph position="4"> The average appropriateness score for TRS-QA is 4.83 and 4.56 for ASR-QA. The difference in the scores is statistically significant according to t-test with the cutoff set to 5%. It may indicate that ASR quality has a significant impact on answer selection.</Paragraph>
    <Paragraph position="5"> We computed the Word Error Rate (WER) between the transcribed question text and the ASR output. Thus each question-answer pair in the ASR-QA and TRS-QA data set has a WER score assigned to it. The average WER score is 37.33%.</Paragraph>
    <Paragraph position="6"> We analyzed sensitivity of the appropriateness score to input errors. Figure 1a and 1b show plots of the cumulative average appropriateness score (CAA) as function of WER: for each WER value t we average appropriateness scores for all questions-answer pairs with WER score less than  user-designated question-answer pairs as function of the ASR's output word error rate. We show the scores for TRS-QA (dotted black line) and ASR-QA (solid black line). We also show the percentage of the question-answer pairs with the WER score below a given value (&amp;quot;# ofQA&amp;quot;) as a gray line with the corresponding values on the right Y axis.</Paragraph>
    <Paragraph position="7"> or equal to t.</Paragraph>
    <Paragraph position="9"> where p is a question-answer pair, A(p) is the appropriateness score for p, and WER(p) is the WER score for p. It is the expected value of the appropriateness score if the ASR WER was at most t.</Paragraph>
    <Paragraph position="10"> Both figures show the CAA values for TRS-QA (dotted black line) and ASR-QA (solid black line). Both figures also show the percentage of the question-answer pairs with the WER score below a given value, i.e., the cumulative distribution function (CDF) for the WER as a gray line with the corresponding values depicted on the right Y axis.</Paragraph>
    <Paragraph position="11"> Figure 1a shows these plots for the pre-designated questions. The values of CAA for TRS-QA and ASR-QA are approximately the same between 0 and 60% WER. CAA for ASR-QA decreases for WER above 60% - as the input becomes more and more garbled, it becomes more difficult for the CLM module to select an appropriate answer. We confirmed this observation by calculating t-test scores at each WER value: the differences between CAA(t) scores are statistically significant for t &gt; 60%. It indicates that until WER exceeds 60% there is no noticeable effect on the quality of answer selection, which means that our answer selection technique is robust relative to the quality of the input.</Paragraph>
    <Paragraph position="12"> Figure 1b shows the same plots for the user-designated questions. Here the system has to deal with questions it has never seen before. CAA values decrease for both TRS-QA and ASR-QA as WER increases. Both ASR and CLM were trained on the same data set and out of vocabulary words that affect ASR performance, affect CLM performance as well.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML