File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/n04-1009_metho.xml

Size: 21,731 bytes

Last Modified: 2025-10-06 14:08:52

<?xml version="1.0" standalone="yes"?>
<Paper uid="N04-1009">
  <Title>A Probabilistic Rasch Analysis of Question Answering Evaluations Rense Lange Integrated Knowledge Systems</Title>
  <Section position="3" start_page="0" end_page="4" type="metho">
    <SectionTitle>
2 The Rasch Model for Binary Data
</SectionTitle>
    <Paragraph position="0"> For binary results, Rasch's (1960/1980) measurement requires that the outcome of an encounter between computer systems (1, ..., s, ..., n s ) and questions (1, ..., q, ...,</Paragraph>
    <Paragraph position="2"> ) should depend solely on the differences between these systems' abilities (S s ) and the questions' difficulties (Q q ). Together with mild and standard scaling assumptions, the preceding implies that:  correctly. For a rigorous derivation of Equation 1 and an overview of the assumptions involved, we refer the reader to work by Fisher (1995). Fisher also proves that sum-scores (and hence percentages of correct answers) are sufficient performance statistics if and only if the assumptions of the Rasch model are satisfied.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
Curves
</SectionTitle>
      <Paragraph position="0"> The binary Rasch model has several interesting properties. First, as is illustrated by the three solid lines in Figure 1, Equation 1 defines a set of non-intersecting logistic response-curves such that P</Paragraph>
      <Paragraph position="2"> . In the following, such points are also referred to as question's locations. For instance, the locations of the three questions depicted in Figure 1 are -5, 0, and 2.</Paragraph>
      <Paragraph position="3"> Second, for each pair of systems i and j with S</Paragraph>
      <Paragraph position="5"> any question q, system i has a better chance of responding correctly than system j, i.e., P</Paragraph>
      <Paragraph position="7"> . Thus, the questions that are answered correctly by less capable systems always form a probabilistic subset of those answered correctly by more capable systems. Third, restating Equation 1 as is shown below highlights that the question and system parameters are additive and expressed in a common metric:</Paragraph>
      <Paragraph position="9"> (2) Given the left-hand side of Equation 2, this metric's units are called Logits. Note that this equation further implies that S s and Q q are determined up to an additive constant only (i.e., their common origin is arbitrary). Finally, efficient maximum-likelihood procedures exist to estimate S s and Q q independently, together with their respective standard errors SE</Paragraph>
      <Paragraph position="11"> Wright and Stone, 1979). These procedures do not require any assumptions about the magnitudes or the distribution of the S s in order to estimate the Q q , and viceversa. Accordingly, systems' abilities can be determined in a &amp;quot;question free&amp;quot; fashion, as different sets of questions from the same pool will yield statistically</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
equivalent S
</SectionTitle>
      <Paragraph position="0"> s estimates. Analogously, questions' locations can be estimated in a &amp;quot;system free&amp;quot; fashion, as similarly spaced Q q estimates should be obtained across different samples of systems. In this paper, the model parameters, together with their errors of estimate, will be computed via the Winsteps Rasch scaling software (Linacre, 2003).</Paragraph>
      <Paragraph position="2"> 2. Questions w ith Outfit &gt; 2.0 are show n as X 1. The size of the dots is proportional to SE q Note: 3. Symbols have been slightly jittered along the  We used the results from the Question Answering track of the 2002 TREC competition to test the feasibility of applying Rasch modeling to QA evaluation. Sixty-seven systems participated, and answered 500 questions by returning a single precise response extracted from a 3gigabyte corpus of texts. While the NIST judges assessed each answer as correct, incorrect, non-exact, or unsupported, we created binary responses by treating each of these last three assessments as incorrect. Ten questions were excluded from all analyses, as these were not answered correctly by any system.</Paragraph>
      <Paragraph position="4"/>
    </Section>
    <Section position="3" start_page="0" end_page="3" type="sub_section">
      <SectionTitle>
3.1 Question Difficulty and System Ability
</SectionTitle>
      <Paragraph position="0"> Maximum-likelihood estimates of the questions' difficulty and the systems' abilities were computed via Winsteps. Figure 2 displays the results associated with the questions, whereas Figure 3 addresses the systems. Each dot in these displays corresponds to one question or one system. For questions, the Y-value gives the estimate of the questions' difficulty (i.e., Q q ); for systems, the Y-value reflects the estimate of systems' ability (S s ). For questions, lower values correspond to easier questions, while higher values to difficult questions. For systems, higher values correspond to greater ability. As is customary, the mean difficulty of the questions is set at zero, thereby implicitly fixing the origin of the estimated system abilities at -1.98. As was noted earlier, a incorrectly), the parameter Q q cannot be estimated. Note that raw-score approaches implicitly ignore such questions as well since including them does not affect the order of systems' &amp;quot;number right.&amp;quot; Of course, by changing the denominator, percentages of right or wrong questions are affected. constant value can be added to each Q</Paragraph>
      <Paragraph position="2"> thereby affecting the validity of the results. The X-axes of Figures 2 and 3 refer to the quality of fit, as described in section 3.3 below.</Paragraph>
      <Paragraph position="3"> As an example, consider a question with difficulty level -2. This means that a system whose ability level is -2 has a probability of .5 (odds=1) of getting this question correct. The odds of a system with ability of -1 getting this same question correct would increase by a factor  of 2.72, thus increasing the probability of a correct answer to P sq = 2.72/(1+2.72) = .73. For a system at ability level 0, the odds increase by another factor of 2.72 to 7.39, giving a probability of .88. On the other hand, a system with an ability of -3, would have the even odds decrease by a factor of 2.72 to .369, yielding</Paragraph>
      <Paragraph position="5"> = .27. In other words, increasing (decreasing) questions' difficulties or decreasing (increasing) systems' abilities by the same amounts affects the log-odds in the same fashion. The preceding thus illustrates that question difficulty and system ability have additive properties on the log-odds, or, Logit, scale.</Paragraph>
      <Paragraph position="6">  The smoothed densities in Figure 4 summarize the locations of the 490 questions (dotted distribution) and the 67 systems (solid). It can be seen that question difficulties range approximately from -3 to +3, and that most questions fall in a region about -1. Systems' abilities mostly cover a lower range such that the questions'  questions are very hard for these systems. The vast majority of systems (those located near -1 or below) have only a small chance (below 15%) of answering a significant portion of the questions (those located above 1), and an even smaller chance (below 5%) on a non-negligible number of questions (those above 2). Of those systems, a large portion (those at -2 or below) will have even less of a chance on these questions.</Paragraph>
      <Paragraph position="7">  Note that a number of measures used in the physical sciences likewise achieve additivity by adopting a log scale.</Paragraph>
    </Section>
    <Section position="4" start_page="3" end_page="3" type="sub_section">
      <SectionTitle>
3.2 Standard Error of Estimate
</SectionTitle>
      <Paragraph position="0"> The two U-shaped curves in Figure 4 reflect the estimates of error, SE q for questions (dotted curve) and SE s for systems (solid curve), as a function of their estimated locations (X-axis). As is also reflected by the size of the dots in Figure 3, it can be seen that SE  nature of the TREC evaluation well since the top performing systems are assessed with nearly optimal precision. While the most capable system shows somewhat  &gt; 1 Logit. Thus, the locations of the hardest questions are known with very poor precision only.</Paragraph>
    </Section>
    <Section position="5" start_page="3" end_page="4" type="sub_section">
      <SectionTitle>
3.3 Question and System Fit
</SectionTitle>
      <Paragraph position="0"> According to the Rasch model, a system, A, with middling performance is expected to perform well on the easier questions and poorly on the harder questions.</Paragraph>
      <Paragraph position="1"> However, it is possible that some system, B, achieved the same score on the test by doing poorly on the easy questions and well on the harder questions. While the behavior of system A agrees with the model, system B does not. Accordingly, the fit of system B is said to be poor as this system contradicts the knowledge embodied in the data as a whole. Analogous comments can be made with respect to questions. Rasch modeling formalizes the preceding account in a statistical fashion. In particular, for each response to Question q by System s, Equation 1 allows the computation of a standardized residual z sq , which is the difference between an observed datum (i.e., 0 or 1) and the probability estimate P sq after division by its standard deviation. Since the z sq follow an approximately normal distribution, unexpected results (e.g., |z sq |&gt;3) are easily identified. The overall fit for systems (across questions) and for questions (across systems) is quantified by their Outfit.</Paragraph>
      <Paragraph position="2">  Additionally, systems' (or questions') &amp;quot;Infit&amp;quot; statistic is defined by weighting the z  sq contributions inversely to their distance to the contributing questions (or systems). As such, Infit statistics are less sensitive to outlying observations. Since this paper focuses on overall model fit, Infit statistics are not reported.</Paragraph>
      <Paragraph position="3"> from 0 to [?], with an expected value of 1. As a rule of thumb, for rather small samples such as the present, Outfit values in the range 0.6 to 1.6 are considered as being within the expected range of variation. Figure 2 shows 46 questions whose Outfit exceeds 1.6 (those to the right of the rightmost dashed vertical line) and the Outfit values of 24 of these exceed 2.0 (shown in the graph by Xs, plotted at the right with horizontal jitter). These are questions on which low performing systems performed surprisingly well, and/or high performing systems performed unexpectedly poorly. Thus, there is a clear indication that these questions have characteristics that differentiate them from typical questions. Such questions are worthy of individual attention by the system developers. Questions and systems with uncharacteristically small Outfit values are of interest as well. For instance, in the present context it seems plausible that some questions simply cannot be answered by systems lacking certain capabilities (e.g., pronominal anaphora resolution, acronym expansion, temporal phrase recognition), while such questions are easily answered by systems that possess such capabilities. We might find that these questions would be answered by the vast majority, if not all, of the high performing systems and very few if any of the low ability systems. The estimated fit would be much better (small Outfit) than expected by chance. Again, the identification and analysis of such overfitting questions and, similiarly, underfitting systems may greatly enhance our understanding of both.</Paragraph>
    </Section>
    <Section position="6" start_page="4" end_page="4" type="sub_section">
      <SectionTitle>
3.4 Example of System with large Outfit
</SectionTitle>
      <Paragraph position="0"> Note that Figure 3 above shows that the best performing system also exhibits the largest Outfit (2.68), and we investigated this system's residuals in detail. Table 1 indicates that this system failed (Datum = 0) on eight questions (q) where its modeled probability of success was very high (P sq &gt; 0.98). Thus, the misfit results from this system's failure to answer correctly questions that proved quite easy for most other systems.</Paragraph>
      <Paragraph position="1">  This situation should be highly relevant to the system's developers. Informally speaking, the best system studied here &amp;quot;should have gotten these questions right,&amp;quot; and it might thus prove fruitful to determine exactly why the system failed. Even if no obvious mistakes can be identified, doing so could reveal strategies for system improvement by focusing on seemingly &amp;quot;easy&amp;quot; issues first. Alternatively, it might turn out that precisely those aspects of the system that enable it do so well overall also cause it to falter on the easier questions. Ascertaining this might or might not be of help to the system's designers, but it would certainly foster the development of a scientific theory of automatic question answering.</Paragraph>
    </Section>
    <Section position="7" start_page="4" end_page="4" type="sub_section">
      <SectionTitle>
3.5 Impact of Misfit
</SectionTitle>
      <Paragraph position="0"> The existence of misfitting entities raises the possibility that the estimated Rasch system abilities are distorted by the question and system misfit. We therefore recomputed systems' locations by iteratively removing the worst fitting questions until 372 questions with Outfit q &lt; 1.6 remained. In support of the robustness of the Rasch model, Figure 5 shows that the correlation between the two sets of estimates is nearly perfect (r = 0.99), indicating that the original and the &amp;quot;purified&amp;quot; questions produce essentially equivalent system evaluations. Thus, the observed misfit had negligible effect on the system parameter estimates.</Paragraph>
    </Section>
  </Section>
  <Section position="4" start_page="4" end_page="4" type="metho">
    <SectionTitle>
4 Test Equating Simulation
</SectionTitle>
    <Paragraph position="0"> A major motivation for introducing Rasch models in educational assessment was that this allows for the calibration, or equating, of different tests based on a limited set of common (i.e., repeated) questions. The purpose of equating is to achieve equivalent test scores across different test sets. Thus, equating opens the door to calibrating the difficulty of questions and the performance of systems across the test sets used in different years.</Paragraph>
    <Paragraph position="1"> Since appropriate data from different years are lacking, a simulation study was performed based on different subsets of the 490 available questions. We show how system abilities can be expressed in the same metric, even though systems are evaluated with a completely different set of questions. To rule out the possibility that such a correspondence might come about by chance (e.g., equally difficult sets of questions might accidentally be produced), a worst-case scenario is used.</Paragraph>
    <Paragraph position="2"> The simulation also provides a powerful means to demonstrate the superiority of the Rasch Logit metric compared to raw scores as indices of system performance.</Paragraph>
    <Paragraph position="3"> To this end, we divide the available questions from TREC 2002 into two sets of equal size. The Easy set contains the easiest questions (lowest Q q ) as identified in earlier sections. For the simulation, this will be the test set for one year's evaluation. A second, Hard set, serves as the test set for a subsequent evaluation, and it contains the remaining 50% of the questions (highest</Paragraph>
    <Paragraph position="5"> ). By design, the difference in questions' difficulties is far more extreme than is likely to be encountered in practice. The Rasch model is then fitted to the responses to the Easy set of questions. Next, based on questions' difficulties and their fit to the Rasch model, a subset of the Easy questions is selected for inclusion in the second test in conjunction with the Hard question set.</Paragraph>
    <Paragraph position="6"> These questions are said to comprise the Equating set, as they serve to fix the overall locations of the questions in the Hard set.</Paragraph>
    <Paragraph position="7"> Normally, this second test would be administered to a new set of systems (some completely new, others improved versions of systems evaluated previously). However, for the purposes of this simulation, we &amp;quot;administer&amp;quot; the second test to the same systems. The Rasch model is then applied to the Hard and Equating questions combined, while fixing the locations of the Equating questions to those derived while scaling the Easy set. The Winsteps software achieves this by shifting the locations in the Hard set to be consistent with the Equating set - but without adjusting the spacing of the questions in the Hard or Easy sets. If the assumptions of the Rasch model hold, then the Easy and Hard question sets will now behave as if their levels had been estimated simultaneously. Since the same systems are used in the simulation, and the questions have been calibrated to be on the same scale, the estimated system</Paragraph>
    <Section position="1" start_page="4" end_page="4" type="sub_section">
      <SectionTitle>
abilities S
</SectionTitle>
      <Paragraph position="0"> s as derived from the Easy and Hard question sets should be statistically identical. That is, these two estimates should show a high linear correlation and they should have very similar means and standard deviations (see e.g., Wright and Stone, 1979, p. 83-126).</Paragraph>
      <Paragraph position="1"> Common wisdom in the Rasch scaling community holds that relatively few questions are needed to achieve satisfactory equating results. For this reason, the simulation study was performed three times, using Equating sets with 20, 30, and 50 questions, respectively (i.e., about 4, 6, and 10% of the total number of questions in the present study).</Paragraph>
    </Section>
    <Section position="2" start_page="4" end_page="4" type="sub_section">
      <SectionTitle>
4.1 Findings
</SectionTitle>
      <Paragraph position="0"> The simulation results are summarized in Table 2, whose rows reflect the sizes of the respective Equating sets (i.e., 20, 30, and 50). Each first sub-row reports the Rasch scaling results, while the second sub-row reports the raw-score (i.e., number correct) findings. The columns report a number of basic statistics, including the mean (M) and standard deviations (SD) of the Logit and raw-score values in the Easy and Hard sets, and the correlation (r linear ) between systems' estimated abilities based on the Easy and Hard sets.</Paragraph>
      <Paragraph position="1">  The major findings are as follows. First, inspection of the r linear columns indicates that Rasch equating consistently produced higher correlations between systems' estimated abilities as estimated via the Easy and Hard question sets than did the raw scores for each set. Second, for obvious reasons the raw-score estimates based on the Easy sets are considerably higher than those based on the Hard sets. However, Table 2 also shows that the standard deviations of the number correct estimates obtained for the Easy sets exceed those of the Hard sets as well (sometimes by over 100%). In other words, when raw scores (or percentages) are used, the &amp;quot;spacing&amp;quot; of the systems is affected by the difficulty of the questions.</Paragraph>
      <Paragraph position="2"> Third, the Rasch approach by contrast produces very similar means and standard deviations for the capability estimates based on the Easy and Hard question sets. This holds regardless of the size of the Equating sets. For instance, when 50 equating questions are used, the estimated abilities based on the Easy and Hard sets have nearly identical SD (i.e., 1.11 and 1.18 Logits, respectively). The corresponding averages for this case are -0.78 and -0.77 Logits, i.e., a standardized difference (effect size) of less than 0.01 SD. Similarly small effects sizes are obtained for the other cases. Further, given the superior values of r linear , it thus appears that Rasch equating provides an acceptable equating mechanism even when maximally different question sets are used. In fact, already for Equating sets of size 20 a correlation of 0.90 is produced.</Paragraph>
      <Paragraph position="3"> Fourth, as a check on the results, scatter plots of the various cases summarized in Table 2 were produced. The left panel of Figure 6 shows the Rasch capability estimates obtained for the Hard question set plotted against those for the Easy set, and it can be seen that these estimates are highly correlated (r linear = 0.94). The corresponding raw scores are plotted in the right panel of Figure 6. In addition to showing a lower correlation (r linear = 0.82), raw scores also clearly posses a non-linear component, and in fact the quadratic trend is highly significant (t quad = 13.10, p &lt; .001). Thus, in addition to being question-dependent, raw score and percentage based comparisons suffer from pronounced non-linearity.</Paragraph>
      <Paragraph position="4"> Despite the favorable results, we remind the reader that the above simulations represented a worse-case scenario. Indeed, more realistic simulations not reported here indicate that Rasch equating can further be improved by omitting misfitting questions and by using less extreme question sets.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML