File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/p06-2008_metho.xml
Size: 21,309 bytes
Last Modified: 2025-10-06 14:10:23
<?xml version="1.0" standalone="yes"?> <Paper uid="P06-2008"> <Title>Sydney, July 2006. c(c)2006 Association for Computational Linguistics Towards Conversational QA: Automatic Identification of Problematic Situations and User Intent [?]</Title> <Section position="5" start_page="57" end_page="59" type="metho"> <SectionTitle> 3 User Studies </SectionTitle> <Paragraph position="0"> We conducted a user study to collect data concerning user behavior in a basic interactive QA setting. We are particularly interested in how users respond to different system performance and its implication in identifying problematic situations and user intent. As a starting point, we characterize system performance as either problematic, which indicates the answer has some problem, or error-free, which indicates the answer is correct.</Paragraph> <Paragraph position="1"> In this section, we first describe the methodology and the system used in this effort and then discuss the observed user behavior and its relation to problematic situations and user intent.</Paragraph> <Section position="1" start_page="58" end_page="58" type="sub_section"> <SectionTitle> 3.1 Methodology and System </SectionTitle> <Paragraph position="0"> The system used in our experiments has a user interface that takes a natural language question and presents an answer passage. Currently, our interface only presents to the user the top one retrieved result. This simplification on one hand helps us focus on the investigation of user responses to different system performances and on the other hand represents a possible situation where a list of potential answers may not be practical (e.g., through PDA or telephone line).</Paragraph> <Paragraph position="1"> We implemented a Wizard-of-Oz (WOZ) mechanism in the interaction loop to control and simulate problematic situations. Users were not aware of the existence of this human wizard and were led to believe they were interacting with a real QA system. This controlled setting allowed us to focus on the interaction aspect rather than information retrieval or answer extraction aspect of question answering. More specifically, during interaction after each question was issued, a random number generator was used to decide if a problematic situation should be introduced. If the number indicated no, the wizard would retrieve a passage from a database with correct question/answer pairs. Note that in our experiments we used specific task scenarios (described later), so it was possible to anticipate user information needs and create this database. If the number indicated that a problematic situation should be introduced, then the Lemur retrieval engine was used on the AQUAINT collection to retrieve the answer. Our assumption is that AQUAINT data are not likely to provide an exact answer given our specific scenarios, but they can provide a passage that is most related to the question. The use of the random number generator was to control the ratio between the occurrence of problematic situations and error-free situations. In our initial investigation, since we are interested in observing user behavior in problematic situations, we set the ratio as 50/50. In our future work, we will vary this ratio (e.g., 70/30) to reflect the performance of state-of-the-art factoid QA and investigate the implication of this ratio in automated performance assessment.</Paragraph> </Section> <Section position="2" start_page="58" end_page="58" type="sub_section"> <SectionTitle> 3.2 Experiments </SectionTitle> <Paragraph position="0"> Eleven users participated in our study. Each user was asked to interact with our system to complete information seeking tasks related to four specific scenarios: the 2004 presidential debates, Tom Cruise, Hawaii, and Pompeii. The experimental scenarios were further divided into two types: structured and unstructured. In the structured task scenarios (for topics Tom Cruise and Pompeii), users had to fill in blanks on a diagram pertaining to the given topic. Using the diagram was to avoid the influence of these scenarios on the language formation of the relevant questions. Because users must find certain information, they were constrained in the range of questions in which they could ask, but not the way they ask those questions. The task was completed when all of the blanks on the diagram were filled. The structured scenarios were designed to mimic the real information seeking practice in which users have real motivation to find specific information related to their information goals. In the unstructured scenarios (for topics the 2004 presidential debates and Hawaii), users were given a general topic to investigate, but were not required to find specific information. This gave the user the ability to ask a much wider range of questions than the structured scenarios. Users were generally in an exploration mode when performing these unstructured tasks. They were not motivated to find specific information and were content with any information provided by the system. In our view, the unstructured scenarios are less representative of the true information seeking situations.</Paragraph> </Section> <Section position="3" start_page="58" end_page="59" type="sub_section"> <SectionTitle> 3.3 Observations and Analysis </SectionTitle> <Paragraph position="0"> From our studies, a total of 44 interaction sessions with 456 questions were collected. Figure 1 shows an example of a fragment of interaction related to Tom Cruise. In this example, both problematic situations applied to answers (e.g., Problematic and Error-Free) and user intent (described later) applied to questions are annotated.</Paragraph> <Paragraph position="1"> There are several observations from this data.</Paragraph> <Paragraph position="2"> First, questions formed during interactive QA tend to be self-contained and free of definite noun phrases, pronouns, or ellipsis. Only one question in the entire data set has a pronoun (i.e., What are the best movies with Tom Cruise in them?).</Paragraph> <Paragraph position="3"> Even in this case, the pronoun them did not refer to any entities that occurred previously in the</Paragraph> </Section> </Section> <Section position="6" start_page="59" end_page="60" type="metho"> <SectionTitle> # Question/Answer Annotation </SectionTitle> <Paragraph position="0"> Kovic, whose gunshot wound in Vietnam left him paralyzed from the chest down.</Paragraph> <Paragraph position="1"> ....a powerfully intimate portrait that unfolds on an epic scale, Born on the Fourth of July is arguably Stone's best film (if you can forgive its often strident tone), on the 3rd of July, 1962 (eerily similar to his film Born on the 4th of July), in Syracuse, New York. He was the only boy of four children....</Paragraph> <Paragraph position="2"> Error-Free A3 ...you get a very nice role that shows you differently, not the heavy, and you're working with a George Clooney or a Tom Cruise or a Nicolas Cage or a Martin Scorsese. I can live with that ...</Paragraph> <Paragraph position="3"> Problematic null A4 ...So we may agree that Cruise's professional standing accounts for some measure of his fame....</Paragraph> <Paragraph position="4"> how the answers are presented. Unlike specific answer entities, the answer passages provided by our system do not support the natural use of referring expressions in the follow-up questions. Another possible explanation could be that in an interactive environment, users seem to be more aware of the potential limitation of a computer system and thus tend to specify self-contained questions in a hope to reduce the system's inference load. The second observation is about user behavior in response to different system performances (i.e., problematic or error-free situations). We were hoping to see different strategies users might apply to deal with the problematic situations. However, based on the data, we found that when a problem occurred, users either rephrased their questions (i.e., the same question expressed in a different way) or gave up the question and went on specifying a new question. (Here we use Rephrase and New to denote these two kinds of behaviors.) We have not observed any sub-dialogs initiated by responding number of occurrences from the unstructured scenarios, the structured scenarios, and the entire dataset.</Paragraph> <Paragraph position="5"> the user to clarify a previous question or answer. One possible explanation is that the current investigation was conducted in a basic interactive mode where the system was only capable of providing some sort of answers. This may limit users' expectation in the kind of questions that can be handled by the system. Our assumption is that, once the QA system becomes more intelligent and able to carry on conversation, different types of questions (i.e., other than rephrase or new) will be observed. This hypothesis certainly needs to be validated in a conversational setting.</Paragraph> <Paragraph position="6"> The third observation is that the rephrased questions seem to strongly correlate with problematic situations, although not always. New questions cannot distinguish a problematic situation from an error-free situation. Table 1 shows the statistics from our data about different combinations of new/rephrase questions and performance situ- null . What is interesting is that these different combinations can reflect different types of user intent behind the questions. More specifically, given a question, four types of user intent can be captured with respect to the context (e.g., the previous question and answer) Continue indicates that the user is satisfied with the previous answer and now moves on to this new question.</Paragraph> <Paragraph position="7"> Switch indicates that the user has given up on the previous question and now moves on to this The last question from each interaction session is not included in these statistics because there is no follow-up question after that.</Paragraph> <Paragraph position="8"> new question.</Paragraph> <Paragraph position="9"> Re-try indicates that the user is not satisfied with the previous answer and now tries to get a better answer.</Paragraph> <Paragraph position="10"> Negotiate indicates that the user is not satisfied with the previous answer (although it appears to be correct from the system's point of view) and now tries to get a better answer for his/her own needs.</Paragraph> <Paragraph position="11"> Table 1 summarizes these different types of intent together with the number of corresponding occurrences from both structured and unstructured scenarios. Since in the unstructured scenarios it was hard to anticipate user's questions and therefore take a correct action to respond to a problematic/error-free situation, the distribution of these two situations is much more skewed than the distribution for the structured scenarios. Also as mentioned earlier, in unstructured scenarios, users lacked the motivation to pursue specific information, so the ratio between switch and re-try is much larger than that observed in the structured scenarios. Nevertheless, we did observe different user behavior in response to different situations. As discussed later in Section 5, identifying these fine-grained intents will allow QA systems to be more proactive in helping users find satisfying answers.</Paragraph> </Section> <Section position="7" start_page="60" end_page="63" type="metho"> <SectionTitle> 4 Automatic Identification of </SectionTitle> <Paragraph position="0"> Given the discussion above, the next question is how to automatically identify problematic situations and user intent. We formulate this as a classification problem. Given a question Q and the interaction context. This is a binary classification problem.</Paragraph> <Paragraph position="1"> (2) Automatic identification of user intent is to identify the intent of Q</Paragraph> <Paragraph position="3"> given the interaction context. Because we only have very limited instances of Negotiate (see Table 1), we currently merge Negotiate with Re-try since both of them represent a situation where a better answer is requested. Thus, this problem becomes a trinary classification problem.</Paragraph> <Paragraph position="4"> To build these classifiers, we identified a set of features, which are illustrated next.</Paragraph> <Section position="1" start_page="60" end_page="61" type="sub_section"> <SectionTitle> 4.1 Features </SectionTitle> <Paragraph position="0"> Given a question Q</Paragraph> <Paragraph position="2"> , the following set of features are used: Target matching(TM): a binary feature indicating whether the target type of Q i+1 is the same as the target type of Q</Paragraph> <Paragraph position="4"> . Our data shows that the repetition of the target type may indicate a rephrase, which could signal a problematic situation has just happened.</Paragraph> <Paragraph position="5"> Named entity matching (NEM): a binary feature indicating whether all the named entities in Q</Paragraph> <Paragraph position="7"> Similarity between questions (SQ): a numeric feature measuring the similarity between Q</Paragraph> <Paragraph position="9"> . Our assumption is that the higher the similarity is, the more likely the current question is a rephrase to the previous one.</Paragraph> <Paragraph position="10"> Similarity between content words of questions (SQC): this feature is similar to the previous feature (i.e., SQ) except that the similarity measurement is based on the content words excluding named entities. This is to prevent the similarity measurement from being dominated by the named entities.</Paragraph> <Paragraph position="12"> (SA): this feature measures how close the retrieved passage matches the question. Our assumption is that although a retrieved passage is the most relevant passage compared to others, it still may not contain the answer (e.g., when an answer does not even exist in the data collection).</Paragraph> <Paragraph position="14"> based on the content words (SAC): this feature is essentially the same as the previous feature (SA) except that the similarity is calculated after named entities are removed from the questions and answers.</Paragraph> <Paragraph position="15"> Note that since our data is currently collected from simulation studies, we do not have the confidence score from the retrieval engine associated with every answer. In practice, the confidence score can be used as an additional feature.</Paragraph> <Paragraph position="16"> Since our focus is not on the similarity measurement but rather the use of the measurement in the classification models, our current similarity measurement is based on a simple approach that measures commonality and difference between two objects as proposed by Lin (1998). More specifically, the following equation is applied to measure the similarity between two chunks of text T where P(w) was calculated based on the data used in the previous TREC evaluations.</Paragraph> </Section> <Section position="2" start_page="61" end_page="61" type="sub_section"> <SectionTitle> 4.2 Identification of Problematic Situations </SectionTitle> <Paragraph position="0"> To identify problematic situations, we experimented with three different classifiers: Maximum Entropy Model (MEM) from MALLET . A leave-one-out validation was applied where one interaction session was used for testing and the remaining interaction sessions were used for training.</Paragraph> <Paragraph position="1"> Table 2 shows the performance of the three models based on different combinations of features in terms of classification accuracy. The base-line result is the performance achieved by simply assigning the most frequently occurred class. For the unstructured scenarios, the performance of the classifiers is rather poor, which indicates that it is quite difficult to make any generalization based on the current feature sets when users are less motivated in finding specific information. For the structured scenarios, the best performance for each model is highlighted in bold in Table 2. The Decision Tree model achieves the best performance of 77.8% in identifying problematic situations, which is more than 25% better than the baseline performance.</Paragraph> </Section> <Section position="3" start_page="61" end_page="63" type="sub_section"> <SectionTitle> 4.3 Identification of User Intent </SectionTitle> <Paragraph position="0"> To identify user intent, we formulate the problem as follows: given an observation feature vector f where each element of the vector corresponds to a feature described earlier, the goal is to identify an intent c [?] from a set of intents I ={Continue, Switch, Re-try/Negotiate} that satisfies the following equation: Our assumption is that user intent for a question can be potentially influenced by the intent from a preceding question. For example, Switch is likely to follow Re-try. Therefore, we have implemented a Maximum Entropy Markov Model (MEMM) (McCallum et al., 2000) to take the sequence of interactions into account.</Paragraph> <Paragraph position="1"> Given a sequence of questions Q This variable can be calculated by a dynamic optimization procedure similar to the Viterbi algorithm in the Hidden Markov Model:</Paragraph> <Paragraph position="3"> ) is estimated by the Maximum Entropy Model.</Paragraph> <Paragraph position="4"> Table 3 shows the best results of identifying user intent based on the Maximum Entropy Model and MEMM using the leave-one-out approach. The results have shown that both models did not work for the data collected from unstructured scenarios (i.e., the baseline accuracy for intent identification is 63.4%). For structured scenarios, in terms of the overall accuracy, both models performed significantly better than the baseline (i.e., 49.3%). The MEMM worked only slightly better than the MEM. Given our limited data, it is not conclusive whether the transitions between questions will help identify user intent in a basic interactive mode. However, we expect to see more influence from the transitions in fully conversational QA.</Paragraph> <Paragraph position="5"> Automated identification of problematic situations and user intent have potential implications in the design of conversational QA systems. Identification of problematic situations can be considered as implicit feedback. The system can use this feed-back to improve its answer retrieval performance and proactively adapt its strategy to cope with problematic situations. One might think that an alternative way is to explicitly ask users for feedback. However, this explicit approach will defeat the purpose of intelligent conversational systems. Soliciting feedback after each question not only will frustrate users and lengthen the interaction, but also will interrupt the flow of user thoughts and conversation. Therefore, our focus here is to investigate the more challenging end of implicit feedback. In practice, the explicit feedback and implicit feedback should be intelligently combined.</Paragraph> <Paragraph position="6"> For example, if the confidence for automatically identifying a problematic situation or an error-free situation is low, then perhaps explicit feedback can be solicited.</Paragraph> <Paragraph position="7"> Automatic identification of user intent also has important implications in building intelligent conversational QA systems. For example, if Continue is identified during interaction, then the system can automatically collect the question answer pairs for potential future use. If Switch is identified, the system may put aside the question that has not been correctly answered and proactively come back to that question later after more information is gathered. If Re-try is identified, the system may avoid repeating the same answer and at the same time may take the initiative to guide users on how to rephrase a question. If Negotiate is identified, the system may want to investigate the user's particular needs that may be different from the general needs. Overall, different strategies can be developed to address problematic situations and different intents. We will investigate these strategies in our future work.</Paragraph> <Paragraph position="8"> This paper reports our initial effort in investigating interactive QA from a conversational point of view. The current investigation has several simplifications. First, our current work has focused on factoid questions where it is relatively easy to judge a problematic or error-free situation.</Paragraph> <Paragraph position="9"> However, as discussed in earlier work (Small et al., 2003), sometimes it is very hard to judge the truthfulness of an answer, especially for analytical questions. Therefore, our future work will examine the new implications of problematic situations and user intent for analytical questions. Sec- null ond, our current investigation is based on a basic interactive mode. As mentioned earlier, once the QA systems become more intelligent and conversational, more varieties of user intent are anticipated. How to characterize and automatically identify more complex user intent under these different situations is another direction of our future work.</Paragraph> </Section> </Section> class="xml-element"></Paper>