File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/06/w06-3002_intro.xml
Size: 11,010 bytes
Last Modified: 2025-10-06 14:04:09
<?xml version="1.0" standalone="yes"?> <Paper uid="W06-3002"> <Title>WoZ Simulation of Interactive Question Answering</Title> <Section position="2" start_page="0" end_page="11" type="intro"> <SectionTitle> 1 Introduction </SectionTitle> <Paragraph position="0"> Open-domain question answering (QA) technologies allow users to ask a question using natural language and obtain the answer itself rather than a list of documents that contain the answer (Voorhees et al.2000). While early research in this field concentrated on answering factoid questions one by one in an isolated manner, recent research appears to be moving in several new directions. Using QA systems in an interactive environment is one of those directions. A context task was attempted in order to evaluate the systems' ability to track context for supporting interactive user sessions at TREC 2001 (Voorhees 2001). Since TREC 2004, questions in the task have been given as collections of questions related to common topics, rather than ones that are isolated and independent of each other (Voorhees 2004). It is important for researchers to recognize that such a cohesive manner is natural in QA, although the task itself is not intended for evaluating context processing abilities since, as it is given the common topic, sophisticated context processing is not needed.</Paragraph> <Paragraph position="1"> Such a direction has also been envisaged as a research roadmap, in which QA systems become more sophisticated and can be used by professional reporters and information analysts (Burger et al.2001). At some stage of that sophistication, a young reporter writing an article on a specific topic will be able to translate the main issue into a set of simpler questions and pose those questions to the QA system. null Another research trend in interactive QA has been observed in several projects that are part of the ARDA AQUAINT program. These studies concern scenario-based QA, the aim of which is to handle non-factoid, explanatory, analytical questions posed by users with extensive background knowledge. Issues include managing clarification dialogues in order to disambiguate users' intentions and interests; and question decomposition to obtain simpler and more tractable questions (Small et al.2003)(Hickl et al.2004).</Paragraph> <Paragraph position="2"> The nature of questions posed by users and patterns of interaction vary depending on the users who use a QA system and on the environments in which it is used (Liddy 2002). The user may be a young reporter, a trained analyst, or a common man without special training. Questions can be answered by simple names and facts, such as those handled in early TREC conferences (Chai et al.2004), or by short passages retrieved like some systems developed in the AQUAINT program do (Small et al.2003). The situation in which QA systems are supposed to be used is an important factor of the system design and the evaluation must take such a factor into account.</Paragraph> <Paragraph position="3"> QACIAD (Question Answering Challenge for Information Access Dialogue) is an objective and quantitative evaluation framework to measure the abilities of QA systems used interactively to participate in dialogues for accessing information (Kato et al.2004a)(Kato et al.2006). It assumes the situation in which users interactively collect information using a QA system for writing a report on a given topic and evaluates, among other things, the capabilities needed under such circumstances, i.e. proper interpretation of questions under a given dialogue context; in other words, context processing capabilities such as anaphora resolution and ellipses handling.</Paragraph> <Paragraph position="4"> We are interested in examining the assumptions made by QACIAD, and conducted an experiment, in which the dialogues under the situation QACIAD assumes were simulated using the WoZ (Wizard of Oz) technique (Fraser et al.1991) and analyzed. In WoZ simulation, which is frequently used for collecting dialogue data for designing speech dialogue systems, dialogues that become possible when a system has been developed are simulated by a human, a WoZ, who plays the role of the system, as well as a subject who is not informed that a human is behaving as the system and plays the role of its user. Analyzing the characteristics of language expressions and pragmatic devices used by users, we confirm whether QACIAD is a proper framework for evaluating QA systems used in the situation it assumes. We also examine what functions will be needed for such QA systems by analyzing intelligent behavior of the WoZs.</Paragraph> <Paragraph position="5"> 2 QACIAD and the previous study QACIAD was proposed by Kato et al. as a task of QAC, which is a series of challenges for evaluating QA technologies in Japanese (Kato et al.2004b). QAC covers factoid questions in the form of complete sentences with interrogative pronouns. Any answers to those questions should be names. Here, &quot;names&quot; means not only names of proper items including date expressions and monetary values (called &quot;named entities&quot;), but also common names such as those of species and body parts. Although the syntactical range of the names approximately corresponds to compound nouns, some of them, such as the titles of novels and movies, deviate from that range. The underlying document set consists of newspaper articles. Being given various open-domain questions, systems are requested to extract exact answers rather than text snippets that contain the answers, and to return the answer along with the newspaper article from which it was extracted. The article should guarantee the legitimacy of the answer to a given question.</Paragraph> <Paragraph position="6"> In QACIAD, which assumes interactive use of QA systems, systems are requested to answer series of related questions. The series of questions and the answers to those questions comprise an information access dialogue. All questions except the first one of each series have some anaphoric expressions, which may be zero pronouns, while each question is in the range of those handled in QAC. Although the systems are supposed to participate in dialogue interactively, the interaction is only simulated; systems answer a series of questions in batch mode. Such a simulation may neglect the inherent dynamics of dialogue, as the dialogue evolution is fixed beforehand and therefore not something that the systems can control. It is, however, a practical compromise for an objective evaluation. Since all participants must answer the same set of questions in the same context, the results for the same test set are comparable with each other, and the test sets of the task are reusable by pooling the correct answers.</Paragraph> <Paragraph position="7"> Systems are requested to return one list consisting of all and only correct answers. Since the number of correct answers differs for each question and is not given, a modified F measure is used for the evaluation, which takes into account both precision and recall.</Paragraph> <Paragraph position="8"> Two types of series were included in the QA-CIAD, which correspond to two extremes of information access dialogue: a gathering type in which the user has a concrete objective such as writing a report and summary on a specific topic, and asks a system a series of questions related to that topic; and a browsing type in which the user does not have any fixed topic of interest. Although the QACIAD assumes that users are interactively collecting information on a given topic and the gatheringtype dialogue mainly occurs under such circumstances, browsing-type series are included in the task based on the observation that even when focusing on information access dialogue for writing reports, the systems must handle focus shifts appearing in browsing-type series. The systems must identify the type of series, as it is not given, although they need not identify changes of series, as the boundary is given. The systems must not look ahead to questions following the one currently being handled. This restriction reflects the fact that the QACIAD is a simulation of interactive use of QA systems in dialogues. Examples of series of QACIAD are shown in Figure 1. The original questions are in Japanese and the figure shows their direct translations.</Paragraph> <Paragraph position="9"> The evaluation of QA technologies based on QACIAD were conducted twice in QAC2 and QAC3, which are a part of the NTCIR-4 and NTCIR-5 workshops1, respectively (Kato et al.2004b)(Kato et al.2005). It was one of the three tasks of QAC2 and the only task of QAC3. On each occasion, several novel techniques were proposed for interactive QA.</Paragraph> <Paragraph position="10"> Kato et al. conducted an experiment for confirming the reality and appropriateness of QACIAD, in which subjects were presented various topics and were requested to write down series of questions in Japanese to elicit information for a report on that topic (Kato et al.2004a)(Kato et al.2006). The report was supposed to describe facts on a given topic, rather than state opinions or prospects on the topic. The questions were restricted to wh-type questions, and a natural series of questions that may contain anaphoric expressions and ellipses was con1The NTCIR Workshop is a series of evaluation workshops designed to enhance research in information access technologies including information retrieval, QA, text summarization, extraction, and so on (NTCIR 2006).</Paragraph> <Paragraph position="11"> Series 30002 What genre does the &quot;Harry Potter&quot; series belong to? Who is the author? Who are the main characters in the series? When was the first book published? What was its title? How many books had been published by 2001? How many languages has it been translated into? How many copies have been sold in Japan? structed. Analysis of the question series collected in such a manner showed that 58% to 75% of questions for writing reports could be answered by values or names; a wide range of reference expressions is observed in questions in such a situation; and sequences of questions are sometimes very complicated and include subdialogues and focus shifts. From these observations they concluded the reality and appropriateness of the QACIAD, and validated the needs of browsing-type series in the task.</Paragraph> <Paragraph position="12"> One of the objectives of our experiment is to confirm these results in a more realistic situation. The previous experiment setting is far from the actual situations in which QA systems are used, in which subjects have to write down their questions without getting the answers. Using WoZ simulation, it is confirmed whether or not this difference affected the result. Moreover, observing the behavior of WoZs, the capabilities and functions needed for QA sys- null tems used in such a situation are investigated.</Paragraph> </Section> class="xml-element"></Paper>