File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/concl/04/n04-1009_concl.xml
Size: 4,793 bytes
Last Modified: 2025-10-06 13:53:56
<?xml version="1.0" standalone="yes"?> <Paper uid="N04-1009"> <Title>A Probabilistic Rasch Analysis of Question Answering Evaluations Rense Lange Integrated Knowledge Systems</Title> <Section position="5" start_page="4" end_page="4" type="concl"> <SectionTitle> 5 Conclusions </SectionTitle> <Paragraph position="0"> In this paper we have described the Rasch model for binary data and applied it to the 2002 TREC QA results.</Paragraph> <Paragraph position="1"> We addressed the estimation of question difficulty and system ability, the estimation of standard errors for these parameters, and how to assess the fit of individual questions and systems. Finally, we presented a simulation which demonstrated the advantage of using Rasch modeling for calibration of question sets.</Paragraph> <Paragraph position="2"> Based on our findings, we recommend that test equating be introduced in formal evaluations of HLT. In particular, for the QA track of the TREC competition, we propose that NIST include a set of questions to be reused in the following year for calibration purposes.</Paragraph> <Paragraph position="3"> For instance, after evaluating the systems' performance in the 2004 competition, NIST would select a set of questions consistent with the criteria outlined above.</Paragraph> <Paragraph position="4"> Using twenty to fifty questions from a set of 500 will probably be sufficient, especially when misfitting questions are eliminated. When the results are released to the participants, they would be asked not to look at these equating questions, and not to use them to train their systems in the future. These equating questions would then be included in the 2005 question set so as to place the 2004 and 2005 results on the same Logit scale. The process would continue in each consecutive year.</Paragraph> <Paragraph position="5"> The approach outlined above serves several purposes. For instance, the availability of equated tests would increase the confidence that the testing indeed measures progress, and not simply the unavoidable variations in difficulty across each year's question set. Additionally, it would support the goal of making each competition increasingly more challenging by correctly identifying easy and difficult questions. Further, calibrated questions could be combined into increasingly large corpora, and these corpora could then be used to provide researchers with immediate performance feed-back in the same metric as the NIST evaluation scale. The availability of large corpora of equated questions might also provide the basis for the development of methods to predict question difficulty, thus stimulating important theoretical research in QA.</Paragraph> <Paragraph position="6"> The work presented here only begins to scratch the surface of adopting a probabilistic approach such as the Rasch model for the evaluation of human language technologies. First, as was discussed above, questions displaying unexpectedly large or small Outfit values can be identified for further study. The questions themselves can be analyzed in terms of both content and linguistic expression. With the objective of beginning to form a theory of question difficulty, questions can be analyzed in concert with the occurrence of correct answers in the document corpus and the incorrect answers returned by systems. Also, experimentation with more complex scaling models could be conducted to uncover information other than questions' difficulty levels. For example, so-called 2-parameter IRT models (see e.g., Hambleton and Swaminathan, 1985) would allow for the estimation of a discrimination parameter together with the difficulty parameter for each question. More direct information concerning the diagnosis of systems' skill defects are described in Stout (2002).</Paragraph> <Paragraph position="7"> It is also possible to incorporate into the model other factors and variables affecting a system's performance. Rasch modeling can be extended to many other HLT evaluation contexts since Rasch measurement procedures exist to deal with multi-level responses, counts, proportions, and rater effects. Of particular interest is application to technology areas that use metrics other than percent of items processed correctly. Measures such as average precision, R-precision and precision at fixed document cutoff, which are used in Information Retrieval (Voorhees and Harman, 1999), metrics such as BiLingual Evaluation Understudy (BLUE) (Papineni et al., 2002) used in Machine Translation, and F-measure (Van Rijsbergen, 1979) commonly used for evaluation of a variety of NLP tasks are just a few of the variety of metrics used for evaluation of language technologies that can benefit from Rasch scaling and related techniques.</Paragraph> </Section> class="xml-element"></Paper>