File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/w04-0502_metho.xml

Size: 17,353 bytes

Last Modified: 2025-10-06 14:09:05

<?xml version="1.0" standalone="yes"?>
<Paper uid="W04-0502">
  <Title>Evaluation of Restricted Domain Question-Answering Systems</Title>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
3 Initiating a restricted domain
</SectionTitle>
    <Paragraph position="0"> evaluation When it came time to evaluate the KAAS system, we initially defaulted to the TREC style QA evaluation with short, fact-based questions, adjudicated answers to these questions, and a test collection in which to find those answers. This choice of evaluation was not surprising since early versions of our system grew out of that environment. However, it quickly became apparent that this evaluation style posed problems for our restricted-domain, specific purpose system.</Paragraph>
    <Paragraph position="1"> Developing a set of test questions was easier said than done. Unlike the open domain evaluations, where test questions can be mined from question logs (Encarta, Excite, AskJeeves), no question sets are at the disposal of restricted-domain evaluators. To build a set of test questions, we hired two sophomore aerospace engineering students. Based on class project papers of the previous semester and examples of TREC questions, the students were asked to create as many short factoid questions as they could, i.e &amp;quot;What is APAS?&amp;quot; However, the real user questions that we collected later did not look anything like the short test questions in this initial evaluation set. The user questions were much more complex, e.g. &amp;quot;How difficult is it to mold and shape graphite-epoxies compared with alloys or ceramics that may be used for thermal protective applications?&amp;quot; A more in depth analysis of KAAS question types can be found in Diekema et al. (to appear).</Paragraph>
    <Paragraph position="2"> Establishing answers for the initial test questions proved difficult as well. The students did fine at collecting the questions that they had while reading the papers, but lacked sufficient domain expertise to establish answer correctness. Another issue was determining recall because it wasn't always clear whether the (small) corpus simply did not contain the answer or whether the system was not able to find it. A third student, a doctoral student in aerospace engineering, was hired to help with these issues. To facilitate automatic evaluation we wanted to represent the answers in simple patterns but found that complex answers are not necessarily suitable for such a representation, even though patterns have proven feasible for TREC systems.</Paragraph>
    <Paragraph position="3"> While a newswire document collection for general domain evaluation is easy to find, a collection in our specialized domain needed to be created from scratch. Not only did the collection of documents take time, the conversion of most of these documents to text proved to be quite an unexpected hurdle as well.</Paragraph>
    <Paragraph position="4"> As is evident, the TREC style QA evaluation did not suit our restricted domain system. It also leaves out the user entirely. While information-based evaluations are necessary to establish the ability of the system to answer questions correctly, we felt that they were not sufficient for evaluating a system with real users.</Paragraph>
  </Section>
  <Section position="5" start_page="0" end_page="0" type="metho">
    <SectionTitle>
4 User-based evaluation dimensions
</SectionTitle>
    <Paragraph position="0"> Restricted domain systems tend to be situated not only within a specific domain, but also within a certain user community and within a specific task domain. A generic evaluation is neither sufficient nor suitable for a restricted domain system. The environment in which KAAS is situated should drive the evaluation. Unlike many of the systems that participate in a TREC QA evaluation, the KAAS system has to function in real time with real users, not in batch mode with surrogate relevance assessors. This brings with it additional evaluation criteria such as utility and system speed (Nyberg and Mitamura, 2003).</Paragraph>
    <Paragraph position="1"> KAAS users were asked in two separate surveys about their use and experiences with the system. The surveys were part of larger scale, cross-university course evaluations which looked at the students' perceptions of distance learning, collaboration at a distance, the collaborative software package, the KAAS, and each participating faculty member. While there was some structure and guidance in the user survey of the QA system, it was minimal and the survey is mainly characterized by the open nature of the responses. There were 25 to 30 students participating in each full course survey, but since we do not have the actual surveys that were turned in, we are not certain as to exactly how many students completed the survey section on the KAAS. However, it appears that most, if not all of the students provided feedback.</Paragraph>
    <Paragraph position="2"> Given the free text nature of the responses, it was decided that the three researchers would do a content analysis of the responses and independently derive a set of evaluation dimensions that they detected in the students' responses. Through content analysis of the user responses and follow-up discussion, we identified 5 main areas of importance to KAAS users when using the system: system performance, answers, database content, display, and expectations (see Figure 1). Each of the categories will be described in more detail below.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.1 System Performance
</SectionTitle>
      <Paragraph position="0"> with system speed and system availability. Users indicated that the speed with which answers were returned to them mattered. While they did not necessarily expect an immediate answer, they also did not want to wait, e.g. &amp;quot;took so long, so I gave up&amp;quot;. Whenever users have a question, they want to find an answer immediately. If the system is down or not available to them at that moment, they will not come back later and try again.</Paragraph>
      <Paragraph position="1"> Possible system performance metrics are the &amp;quot;answer return rate&amp;quot;, and &amp;quot;up time&amp;quot;. The answer return rate measures how long it takes (on average) to return an answer after the user has submitted a question. &amp;quot;Up-time&amp;quot; measures for a certain time period how often the system is</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
2.1 Completeness
2.2 Accuracy
2.3 Relevance
2.4 Applicability to task / utility / usefulness
</SectionTitle>
      <Paragraph position="0"/>
    </Section>
  </Section>
  <Section position="6" start_page="0" end_page="0" type="metho">
    <SectionTitle>
3 Database Content
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.1 Authority / provenance / Source quality
3.2 Scope /extensiveness / coverage
3.3 Size
3.4 Updatedness
</SectionTitle>
      <Paragraph position="0"/>
    </Section>
  </Section>
  <Section position="7" start_page="0" end_page="0" type="metho">
    <SectionTitle>
4 Display (UI)
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.1 Input
</SectionTitle>
      <Paragraph position="0"/>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.2 Output
</SectionTitle>
      <Paragraph position="0"/>
    </Section>
  </Section>
  <Section position="8" start_page="0" end_page="0" type="metho">
    <SectionTitle>
5 Expectations
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
5.1 Googleness
4.2 Answers
</SectionTitle>
      <Paragraph position="0"> What users find important in an answer is captured in the Answers category. The users not only wanted answers to be accurate, they also wanted them to be complete and, something that is not tested at all in a regular evaluation, applicable to their task. e.g. &amp;quot;in general what I received was helpful and accurate&amp;quot;, &amp;quot;it [the system] was useful for the Columbia incident exercise...&amp;quot;.</Paragraph>
      <Paragraph position="1"> Possible metrics concerning answers are &amp;quot;accuracy or correctness&amp;quot;, &amp;quot;completeness&amp;quot;, &amp;quot;relevance&amp;quot;, or &amp;quot;task suitability&amp;quot;. While the first three metrics are used in some shape or form in the TREC evaluations, &amp;quot;task suitability&amp;quot; is not. Perhaps this measure requires a certain task description with a question to test whether the answer provided by the system allowed the user to complete the task.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.3 Database Content
</SectionTitle>
      <Paragraph position="0"> Users also shared thoughts about the Database Content or source documents that are searched for answers. They find it important that these documents are reputable. They also shared concerns about the size of the database, fearing that a limit in size would restrict the number of answerable questions, e.g. &amp;quot;it needs more documents&amp;quot;. The same is true for the scope of the collection. Users desired extended coverage to ensure that a wide range of questions could be fielded by the collection, e.g. &amp;quot;I found the data too limited in scope&amp;quot;.</Paragraph>
      <Paragraph position="1"> Possible database content metrics are &amp;quot;authority&amp;quot;, &amp;quot;coverage&amp;quot;, &amp;quot;size&amp;quot;, and &amp;quot;up-todateness&amp;quot;. To measure &amp;quot;authority&amp;quot; one would first have to identify the core authors for a domain through citation analysis. Once that is established, one could measure the percentage of database content created by these core researchers. &amp;quot;Coverage&amp;quot; could be measured in a similar way after the main research areas within a domain are identified. &amp;quot;Size&amp;quot; could simply be measured in megabytes or gigabytes. &amp;quot;Up-todateness&amp;quot; could be measured by calculating the number of articles per year or simply noting the date of the most recent article.</Paragraph>
    </Section>
    <Section position="3" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.4 User Interface
</SectionTitle>
      <Paragraph position="0"> The User Interface of a system was also found of importance. Users were critical about the way they were asked to input their questions. They did not always want to phrase their question as a question but sometimes preferred to use keywords, e.g. &amp;quot;a keyword search would be more useful&amp;quot;. They also expected the system to prompt them with assistance in case they misspelled terms, or when the system did not understand the question, e.g &amp;quot;sometimes very good at correcting you to what you need, other times not very good&amp;quot;. Users also care about the way in which the results are presented to them and whether the system desires any additional responses from them. They did not like being prompted for feedback on a document's relevance for example, e.g. &amp;quot;...the 'was this useful' window was disruptive&amp;quot;.</Paragraph>
      <Paragraph position="1"> Measuring UI related aspects can be done through observation, questionnaires and interviews and does not typically result in actual metrics but rather a set of recommendations that can be implemented in the next version of the system.</Paragraph>
    </Section>
    <Section position="4" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.5 Expectations
</SectionTitle>
      <Paragraph position="0"> Another interesting aspect of user criteria is Expectations , e.g. &amp;quot;the documents in the e-Query database were useful, but Google is much faster&amp;quot;. All users are familiar with Google and tend to have little patience with systems that have a different look and feel.</Paragraph>
      <Paragraph position="1"> Expectations can be captured by survey so that it can be established whether these expectations are reasonable and whether they can be met.</Paragraph>
    </Section>
  </Section>
  <Section position="9" start_page="0" end_page="0" type="metho">
    <SectionTitle>
5 Restricted domain QA Evaluation
</SectionTitle>
    <Paragraph position="0"> If we consider a restricted domain QA system as a system developed for a certain application, it is clear that these systems require a situated evaluation. The evaluation has to be situated in the task, domain, and user community for which the system is developed.</Paragraph>
    <Paragraph position="1"> How then can a restricted domain system best be evaluated? We believe that the evaluation should be driven by the dimensions identified by the users as important: system performance, answers, database content, display, and expectations.</Paragraph>
    <Paragraph position="2"> The system should be evaluated on its performance. How many seconds does it take to answer a question? Once the speed is known, one can determine how long users are willing to wait for an answer. It may very well be that the answer-finding capability of a system will need to be simplified in order to speed up the system and satisfy its users. Similarly, tests to determine robustness need to be part of the system performance evaluation. Users tend to shy away from systems that are periodically unavailable or slow to a crawl during peak usage hours.</Paragraph>
    <Paragraph position="3"> Systems should also be evaluated on their answer providing ability. This evaluation should include measures for answer completeness, accuracy, and relevancy. Test questions should be within the domain of the QA system in order to test the answer quality for that domain.</Paragraph>
    <Paragraph position="4"> Answers to certain questions require a more fine-grained scoring procedure: answers that are explanations or summaries or biographies or comparative evaluations cannot be meaningfully rated as simply right or wrong. The answer providing capability should be evaluated in light of the task or purpose of the system. For example, users of the KAAS are learners in the field and are not well served with exact answer snippets. For their task, they need answer context information to be able to learn from the answer text.</Paragraph>
    <Paragraph position="5"> The evaluation should also include measures of the Database Content. Rather than assuming relevancy of a collection, it should be evaluated whether the content is regularly updated, whether the contents are of acceptable quality to the users, and whether the coverage of the restricted domain is extensive enough.</Paragraph>
    <Paragraph position="6"> Another system component that should be evaluated is the User Interface. Is the system easy to use? Does the interface provide clear guidance and/or assistance to the user? Does it allow users to search in multiple ways? Finally, it may be pertinent to evaluate how far the system goes in living up to user expectations. Although it is impossible to satisfy everybody, the system developers need to know whether there is a large discrepancy between user expectations and the actual system, since this may influence the use of the system.</Paragraph>
    <Paragraph position="7"> 6 Cross-fertilization between evaluations How different are restricted-domain evaluations from open-domain evaluations? Are they so diametrically opposed that restricted-domain systems require separate evaluations from open-domain systems and vice versa? As pointed out in Section 1, we stopped participating in the TREC QA evaluations because that evaluation was not well suited to our restricted-domain system. However, we regretted this as we believe we could, nevertheless, have gained valuable insights.</Paragraph>
    <Paragraph position="8"> Clearly, open-domain systems would benefit from the evaluation dimensions discussed in Section 4. The difference would be that the test questions used for evaluation would be general rather than tailored to a specific domain.</Paragraph>
    <Paragraph position="9"> Additionally, it may be harder to evaluate the database content (i.e. the collection) for a general domain system than would be the case for restricted-domain systems.</Paragraph>
    <Paragraph position="10"> To make open-domain evaluations more applicable to restricted-domain systems, they could be extended to include metrics about answer speed, and the ability of answering within a certain task. For example, the evaluation could include system performance to get an indication as to how much processing time, given certain hardware, is required in getting the answers. As for answer correctness itself, it may be interesting to require extensive use of task scenarios that would determine aspects such as answer length and level of detail. It may also be desirable to evaluate runs without redundancy techniques separately. Ideally, users would be incorporated into the evaluation to assess the user interface and the ability of the system to assist them in completion of a certain task.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML