File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/06/w06-3007_evalu.xml
Size: 8,542 bytes
Last Modified: 2025-10-06 13:59:54
<?xml version="1.0" standalone="yes"?> <Paper uid="W06-3007"> <Title>User-Centered Evaluation of Interactive Question Answering Systems</Title> <Section position="6" start_page="53" end_page="55" type="evalu"> <SectionTitle> 4 Discussion </SectionTitle> <Paragraph position="0"> In this section, we present key findings with regard to the effectiveness of these data collection techniques in discriminating between systems.</Paragraph> <Paragraph position="1"> Corpus. The corpus consisted of a specialized collection of CNS and Web documents. Although this combination resulted in a larger, diverse corpus, this corpus was not identical to the kinds of corpora analysts use in their daily jobs. In particular, analysts search corpora of confidential government documents. Obviously, these corpora are not readily available for QA system evaluation.</Paragraph> <Paragraph position="2"> Thus, creation of a realistic corpus with documents that analysts are used to is a significant challenge.</Paragraph> <Paragraph position="3"> Scenarios. Scenarios were developed by two consultants from the Rome AFRL. The development of appropriate and robust scenarios that mimicked real-world tasks was a time intensive process. As noted earlier, we discovered that in spite of this process, scenarios were still missing important contextual details that govern report generation. Thus, creating scenarios involves more than identifying the content and scope of the information sought. It also requires identifying information such as customer, role and deadline.</Paragraph> <Paragraph position="4"> Analysts. Analysts in this experiment were naval reservists, recruited by email solicitation of a large pool of potential volunteers; the first 8 positive responders were inducted into the study. Such self-selection is virtually certain to produce a non-random sample. However, this sample was from the target population which adds to the validity of the findings. We recommend that decision makers evaluating systems expend substantial effort to recruit analysts typical of those who will be using the system and be aware that self selection biases are likely to be present. Care should be taken to ensure that subjects have a working knowledge of basic tasks and systems, such as using browsers, Microsoft Word, and possibly Microsoft Excel.</Paragraph> <Paragraph position="5"> Experimental Design. We used a great deal of randomization in our experimental design; the purpose was to obtain more valid statistical results.</Paragraph> <Paragraph position="6"> All statistical results are conditioned by the statement &quot;if the analysts and tasks used are a random sample from the universe of relevant analysts and tasks.&quot; Scenarios were not a random selection among possible scenarios; instead, they were tailored to the corpus. Similarly, analysts were not a random sample of all possible analysts, since they were in fact self-selected from a smaller pool of all possible analysts. The randomization in the experimental rotation allowed us to mitigate biases introduced by non-probability sampling techniques across system, as well as curtail any potential bias introduced by observers.</Paragraph> <Paragraph position="7"> Data Collection. We employed a wide variety of data collection techniques. Key findings with respect to each technique are presented below.</Paragraph> <Paragraph position="8"> Questionnaires were powerful discriminators across the range of hypotheses tested. They were also relatively economical to develop and analyze.</Paragraph> <Paragraph position="9"> Most analysts were comfortable completing questionnaires, although with eight repetitions they sometimes became fatigued. Questionnaires also provided a useful opportunity to check the validity of experimental materials such as scenarios.</Paragraph> <Paragraph position="10"> The NASA TLX was sensitive in assessing analysts' workloads for each scenario. It was cheap to administer and analyze, and has established validity and reliability as an instrument in a different arena, where there are real time pressures to control a mechanical system.</Paragraph> <Paragraph position="11"> Formative techniques, such as interviews and focus groups, provided the most useful feedback, especially to system developers. Interview and focus group data usually provide researchers with important information that supplements, qualifies or elaborates data obtained through questionnaires.</Paragraph> <Paragraph position="12"> With questionnaires, users are forced to quantify their attitudes using numeric values. Data collection methods designed to gather qualitative data, such as interviews, provide users with opportunities to elaborate and qualify their attitudes and opinions. One effective technique used in this evaluation was to ask analysts to elaborate on some of their numeric ratings from questionnaires. This allows us to understand more about why analysts used particular values to describe their attitudes and experiences. It is important to note that analysis of qualitative data is costly - interviews were transcribed and training is needed to analyze and interpret data. Training is also necessary to conduct such interviews. Because researchers are essentially the 'instrument' it is important to learn to moderate one's own beliefs and behaviors while interviewing. It is particularly important that interviewers not be seen by their interviewees as &quot;invested in&quot; any particular system; having individuals who are not system developers conduct interviews is essential.</Paragraph> <Paragraph position="13"> The SmiFro Console was not effective as implemented. Capturing analysts' in situ thoughts with minimal disruption remains a challenge. Although SmiFro Console was not particularly effective, status report data was easy to obtain and somewhat effective, but defied analysis.</Paragraph> <Paragraph position="14"> Cross evaluation of reports was a sensitive and reliable method for evaluating product. Complementing questionnaires, it is a good method for assessing the quality of the analysts' work products. The method is somewhat costly in terms of analysts' time (contributing approximately 8% of the total time required from subjects), and analysis requires skill in statistical methods.</Paragraph> <Paragraph position="15"> System logs answered several questions not addressable with other methods including the Glass Box. However, logging is expensive, rarely reusable, and often unruly when extracting particular measures. Development of a standard logging format for interactive QA systems is advisable. The Glass Box provided data on user interaction across all systems at various levels of granularity. The cost of collection is low but the cost of analysis is probably prohibitive in most cases. NIST's previous experience using Glass Box allowed for more rapid extraction, analysis and interpretation of data, which remained a very time consuming and laborious process. Other commercial tools are available that capture some of the same data and we recommend that research teams evaluate such tools for their own evaluations.</Paragraph> <Paragraph position="16"> Hypotheses. We started this study with hypotheses about the types of interactions that a good QA system should support. Of course, different methods were more or less appropriate for assessing different hypotheses. Table 4 displays part of our results with respect to the example hypotheses presented above in Table 1. For each of the example hypotheses provided in Table 1, we show data about example hypotheses (see Table 1).</Paragraph> <Paragraph position="17"> Although not reported here, we note that the performance of each of the systems evaluated in this study varied according to hypothesis; in particular, some systems did well according to some hypotheses and poor according to others.</Paragraph> <Paragraph position="18"> Interaction. Finally, while the purposes of this paper were to present our evaluation method for interactive question answering systems, our instruments elicited interesting results about analysts' perceptions of interaction. Foremost among them, users of interactive systems expect systems to exhibit behaviors which can be characterized as understanding what the user is looking for, what the user has done and what the user knows. Analysts in this study expected interactive systems to track their actions over time, both with the system and with information.</Paragraph> </Section> class="xml-element"></Paper>