File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/06/w06-3007_intro.xml
Size: 9,294 bytes
Last Modified: 2025-10-06 14:04:09
<?xml version="1.0" standalone="yes"?> <Paper uid="W06-3007"> <Title>User-Centered Evaluation of Interactive Question Answering Systems</Title> <Section position="3" start_page="49" end_page="51" type="intro"> <SectionTitle> 2 Evaluation Approach </SectionTitle> <Paragraph position="0"> This evaluation was conducted as a two-week workshop. The workshop mode gives analysts an opportunity to fully interact with all four systems, complete time-intensive tasks similar to their normal work tasks and lets us evaluate a range of methods and metrics.</Paragraph> <Paragraph position="1"> The researchers spent approximately 3 weeks onsite preparing and administering the workshop. Intelligence analysts, the study participants, spent 2 weeks onsite. The evaluation employed 8 analysts, 8 scenarios in the chemical/biological WMD domain, and 4 systems - 3 QA systems and a Google baseline system. Each analyst used each system to analyze 2 scenarios and wrote a pseudo-report containing enough structure and content for it to be judged by peer analysts.</Paragraph> <Paragraph position="2"> During the planning stage, we generated hypotheses about interactive QA systems to guide development of methods and metrics for measuring system effectiveness. Fifteen hypotheses were selected, of which 13 were operationalized. Example hypotheses are presented in Table 1.</Paragraph> <Paragraph position="3"> A good interactive QA system should ...</Paragraph> <Section position="1" start_page="49" end_page="49" type="sub_section"> <SectionTitle> 2.1 Evaluation Environment </SectionTitle> <Paragraph position="0"> The experiment was done at the Pacific North-west National Laboratory (PNNL) in Richland, WA. We used one room with support servers, four rooms with two copies of one system in each and a Any mention of commercial products or companies is for information only and does not imply recommendation or endorsement by NIST.</Paragraph> <Paragraph position="1"> conference room seating 20, for general meetings, focus group discussions, meetings among observers, meetings among developers, etc.</Paragraph> </Section> <Section position="2" start_page="49" end_page="49" type="sub_section"> <SectionTitle> 2.2 QA Systems </SectionTitle> <Paragraph position="0"> Three end-to-end interactive QA systems and a Google baseline were used. System developers were assigned a room, and installed their systems on two workstations in the room.</Paragraph> <Paragraph position="1"> Before analysts used each system, they were trained by the system developer. Training included a skills check test, and free experimentation.</Paragraph> <Paragraph position="2"> Methods of training included: a script with trainees reproducing steps on their own workstations, a slide presentation with scripted activities, a presentation from a printed manual, and a presentation, orally and with participation, guided by a checklist. The workstations used during the experiment were Dell workstations configured with Windows</Paragraph> </Section> <Section position="3" start_page="49" end_page="49" type="sub_section"> <SectionTitle> 2.3 Subjects </SectionTitle> <Paragraph position="0"> Analysts who participated in the study were volunteers serving their yearly two-week service requirement as U.S. Naval Reservists. Analysts were recruited by email solicitation of a large pool of potential volunteers. The first 8 positive responders were inducted into the study.</Paragraph> <Paragraph position="1"> We collected the following data from analysts: age, education level, job type, number of years in the military, number of years conducting analysis work, computer usage, computer expertise, and experience with querying systems. Data about analysts characterizes them on several dimensions.</Paragraph> <Paragraph position="2"> With small samples, this step is critical, but it is also important in studies with larger samples. This type of data lets us describe participants in published reports and ask whether individual differences affect study results. For instance, one might look for a relationship between computer experience and performance.</Paragraph> </Section> <Section position="4" start_page="49" end_page="50" type="sub_section"> <SectionTitle> 2.4 Scenarios </SectionTitle> <Paragraph position="0"> Scenarios were developed by a team of analysts from the Air Force Rome Research Lab, and were vetted to produce 14 appropriate to the collection and target participants. We found after the first two scenarios that while scenario descriptions were sufficient in describing the content of the task, important information regarding context of the description and the format of the report, such as customer and length, was lacking. This omission generated ambiguity in report creation, and caused some uncertainty for the analysts on how to proceed with the task. Thereafter, analysts met as a group in the conference room to agree on additional specifications for each scenario when it was assigned. In addition to this information, the project director and one analyst worked together to design a template for the report, which established a uniform report structure, and specified formatting guidelines such as headings and length. An example scenario is displayed in Figure 1.</Paragraph> <Paragraph position="1"> Scenario B: [country] Chemical Weapons Program Before a U.S. military presence is reestablished in [country], a current, thorough study of [country] chemical weapons program must be developed. Your task is to produce a report for the Secretary of the United States Navy regarding general information on [country] and the production of chemical weapons.</Paragraph> <Paragraph position="2"> Provide information regarding [country] access to chemical weapons research, their current capabilities to use and deploy chemical weapons, reported stockpiles, potential development for the next few years, any assistance they have received for their chemical weapons program, and the impact that this information will have on the United States. Please add any other related information to your report.</Paragraph> </Section> <Section position="5" start_page="50" end_page="50" type="sub_section"> <SectionTitle> 2.5 Corpus </SectionTitle> <Paragraph position="0"> Using the live Web would make it impossible to replicate the experiment, so we started with the AQUAINT corpus from the Center for Non-Proliferation Studies (CNS). The CNS data consists of the January 2004 distribution of the Eye on Proliferation CD, which has been &quot;disaggregated&quot; by CNS into about 40,000 documents. Once the initial 14 scenarios were delivered to NIST, they were characterized with respect to how well the CNS corpus could support them. Several scenarios had less than 100 documents in the CNS corpus, so to increase the number of documents available for each scenario we supplemented the corpus by mining the Web.</Paragraph> <Paragraph position="1"> Documents were collected from the Web by semi-automated querying of Google and manual retrieval of the documents listed in the results. A few unusually large and useless items, like CD images, pornography and word lists, were deleted.</Paragraph> <Paragraph position="2"> The approximate counts of different kinds of files, as determined by their file extensions, are summarized in Table 2.</Paragraph> </Section> <Section position="6" start_page="50" end_page="51" type="sub_section"> <SectionTitle> 2.6 Experimental Design </SectionTitle> <Paragraph position="0"> The evaluation workshop included four, two-day blocks. In each block, a pair of analysts was assigned to each room, and a single observer was assigned to the pair of analysts. Analysts used the two machines in each room to work independently during the block. After each block, analysts and observers rotated to different system rooms, so that analysts were paired together only once and observers observed different analysts during each block. The goal in using designed experiments is to minimize the second-order interactions, so that estimates of the main effects can be obtained from a much smaller set of observations than is required for a full factorial design. For instance, one might imagine potential interaction effects of system and scenario (some systems might be better for certain scenarios); system and analysts (some analysts might adapt more quickly to a system); and analyst and scenario (some analysts might be more expert for certain scenarios). To control these potential interactions, we used a modified Greco-Latin 4x4 design.</Paragraph> <Paragraph position="1"> This design ensured that each analyst was observed by each of the four observers, and used each of the four systems. This design also ensured that each system was, for some analyst, the first, second, third or last to be encountered, and that no analyst did the same pair of scenarios twice. Analyst pairings were unique across blocks. Following standard practice, analysts and scenarios were ran- null domly assigned codenames (e.g. A1, and Scenario A), and systems were randomly assigned to the rows of Table 3. Although observers were simply rotated across the system rows, the assignment of human individuals to code number was random.</Paragraph> </Section> </Section> class="xml-element"></Paper>