File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/ackno/93/x93-1018_ackno.xml
Size: 9,145 bytes
Last Modified: 2025-10-06 13:51:57
<?xml version="1.0" standalone="yes"?> <Paper uid="X93-1018"> <Title>COMPARING HUMAN AND MACHINE PERFORMANCE FOR NATURAL LANGUAGE INFORMATION EXTRACTION: Results from the Tipster Text Evaluation</Title> <Section position="19" start_page="191" end_page="193" type="ackno"> <SectionTitle> ACKNOWLEDGEMENTS </SectionTitle> <Paragraph position="0"> The following persons contributed to the effort resulting in the human performance measurements reported here: Deborah Johnson, Catherine Steiner, Diane Heavener, and Mario Severino served as analysts for the English Microelectronics material, and Mary Ellen Okurowski made keys to allow comparison with all analysts. Susanne Smith served as a technical consultant on microelectronics fabrication. Beth Sundheim, Nancy Chinchor, and Kathy Daley helped in various ways, particularly with respect to the scoring program used. Nancy Chinchor also provided some statistical advice. Boyan Onyshkevych also helped in defining the problem and approaches to attacking it, and was a coauthor on some early presentations of pilot work on human analyst performance at the Tipster 12-month meeting in September, 1992 and the National Science Foundation Workshop on Machine Translation Evaluation on November 2-3, 1992, both in San Diego. Mary Ellen Okurowski provided valuable discussions about the human performance work and comments on this paper.</Paragraph> <Paragraph position="1"> Larry Reeker helped with project management of the overall template collection effort, and provided comments on this paper.</Paragraph> <Paragraph position="2"> APPENDIX: Details of Statistical Measurements and Tests Performance is expressed in terms of error per response fill, using the methodology described by Nancy Chinchor and Beth Sundheim \[7\] and implemented by the SAIC scoring program \[8\]. Error is defined in this methodology by the following formula: incorrect + (partial x 0.5) + missing + spurious Error = correct + partial + incorrect + missing + spurious where each variable represents a count of the number of responses falling into each category. A correct response occurs when the response for a particular slot matches exactly the key for that slot. A partial response occurs when the response is similar to the key, according to certain rule8 used by the scorer. An incorrect response does not match the key. A spurious response occurs when a response is nonblank, but file key is blank, while a missing response occurs when a response is blank but the key is nonblank. The scoring program is typically given a set of templates and provides an error score, based on all slots in all of the templates. In this paper data is usually reported as means in terms of this error score. However, statistical parameters describing variability are estimated by having the scoring program generate scores for each template, even though the means of the data reported here are calculated across a set of templates. Only about 80% of templates produce an independent score, and only those templates are used in estimating statistical parameters. Thus, in many cases two Ns are given, with the larger number the number of templates scored and the smaller number the number of individual template scores used in estimating the variance, calculating the standard error and confidence intervals, and performing statistical tests.</Paragraph> <Paragraph position="3"> In the remainder of this Appendix, details are provided for data presented in each Figure, as indicated: Figure 2: In the &quot;primary&quot; and &quot;secondary&quot; conditions, 120 templates were scored, 30 for each analyst. In the &quot;other&quot; condition, 240 templates were scored, 60 for each analyst. The mean for the primary condition was statistically different from zero at a level of p <.0001 (z=6.74). The standard error of the mean for the primary condition was 2.30 and the 95% confidence interval (indicating that 95% of the time the true population mean. can be found within this interval) was from 10.9 to 20.7. Of the 120 templates (each of which con~ibuted to the score shown as the mean), 92 templates codd be scored independently, and thus N=92 was used for statistical tests. The standard error of the mean for the secondary condition was 2.76 and the 95% confidence interval from 21.5 to 32.5. N was 95. The means for the primary and secondary conditions are statistically different at a level of p <.01 (t=3.19). The standard error of the mean for the &quot;other&quot; condition was 2.13, and the 95% confidence interval was from 33.2 to 41.6, with an N of 193. The means for the secondary and other conditions were significantly different (p <.01, t=2.88).</Paragraph> <Paragraph position="4"> condition for the 4 analysts (A,B,C, and D, respectively) was as follows: 2.8.2.6, 3.1, and 2.9. For the &quot;Other analysts&quot; condition: 3.4, 3.2, 3.7, and 3.3. For &quot;Independent analysts&quot;: 4.1, 3.7, 3.9, and 5.7. For &quot;5th analyst&quot;: 4.7, 4.5, 4.1, and 4.4.</Paragraph> <Paragraph position="5"> condition is 25.3, with a standard error of the mean of 1.45 and a 95% confidence interval from 22.4 to 28.2 (N=374). The mean across analysts in the &quot;Other Analysts&quot; condition is 29.8, with a standard error of the mean of 1.74 and a 95% confidence interval from 26.39 to 33.21 (N=283). The mean across analysts in the &quot;Independent Analysts&quot; condition is 33.2, with a standard error of the mean of 2.14 and a 95% confidence interval from 29.0 to 37.4 (N=190). The mean across analysts in the &quot;5th Analyst&quot; condition is 28.3, with a standard error of the mean of 2.24 and a 95% confidence interval from 24.0 to 32.6 (N=187). The mean of the &quot;All analysts&quot; condition is significantly different from that of the &quot;Other analysts&quot; condition (t=2.O0), while the mean of the &quot;Other analysts&quot; condition is not significantly different from that of the &quot;independent analysts&quot; condition.</Paragraph> <Paragraph position="6"> Figure 7: In the &quot;All Analysts&quot; condition, analyst A had recall and precision scores of 84 and 86.5, respectively, analyst B 81 and 88.5, analyst C 82.5 and 85.5, and analyst D 82 and 86.5. In the &quot;Other Analysts&quot; condition, analyst A had recall and precision scores of 79 and 79, analyst B 72 and 81, analyst C 79 and 75, and analyst D 78 and 82. In the &quot;Independent Analysts&quot; condition, Analyst A had recall and precision scores of 81 and 78, Analyst B 72 and 83, Analyst C 79 and 79, and Analyst D 73 and 75, respectively. In the &quot;5th analyst&quot; condition, analyst A had recall and precision scores of 81 and 83, analyst B 69 and 80, analyst C 86 and 86, and analyst D 81 and 86, respectively.</Paragraph> <Paragraph position="7"> was 2.14, and the 95% confidence interval was from 29.0 to 37.4 (N=190). The mean error for system X (Vishnu) was 62%, with a standard error of 2.41 and a 95% confidence interval from 47.28 to 66.72. The mean error for system Y (Shiva) was 63%, with a standard error of 2.19 and a 95% confidence interval from 58.71 to 67.29, while the mean error for system Z (Brahma)was 68%, with a standard error of 2.27and a 95% confidence interval from 63.55 to 72.45. The difference between the mean human scores and the mean for the best machine was statistically significant (p<.001, t=8.58).</Paragraph> <Paragraph position="8"> Figure 12: The human analysts had recall and precision scores of 79 and 79%, 72 and 81%, 78 and 82%, and 79 and 75%, respectively. The three best machines, in conlxast, had recall, and precision scores of 45 and 57%, 53 and 49%, and 41 and 51%, respectively. These data differ slightly from the official scoring for machine performance because they use only the 120 article subset, not the full 300 article test set. In addition, the official scoring of the machines used interactive scoring, in which human scorers were allowed to give partial credit for some answers, while this scoring was done noninteractively. Note, however, that non-interactive scoring was used for all data in this paper, so comparisons between humans and machines are possible. The use of noninteractive scoring for both machine and human data could bias the result slightly, because of the possibility that humans might be better at providing partially or fully correct answers that don't obviously match the key, but again the difference is likely to be slight.</Paragraph> <Paragraph position="9"> Figure 13:120 templates were used in the &quot;Normal key&quot; condition, while 480 templates were used in the &quot;Orig. Coding&quot; condition in the calculation of the mean.</Paragraph> <Paragraph position="10"> Figure 14:360 templates were used in the &quot;Normal key&quot; condition, and 1440 used in the &quot;Orig. Coding&quot; in calculating the mean. 320 and 1287 were used, respectively, in calculating the error.</Paragraph> </Section> class="xml-element"></Paper>