File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/95/m95-1004_metho.xml
Size: 5,397 bytes
Last Modified: 2025-10-06 14:14:03
<?xml version="1.0" standalone="yes"?> <Paper uid="M95-1004"> <Title>STATISTICAL SIGNIFICANCE OF MUC-6 RESULT S</Title> <Section position="3" start_page="0" end_page="39" type="metho"> <SectionTitle> STATISTICAL SIGNIFICANCE TESTING </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> Method </SectionTitle> <Paragraph position="0"> The general method employed to analyze the MUC-6 results is the Approximate Randomization method described in [3] . It is a computer intensive method which approximates the entire sample space in such a way as t o allow us to determine the significance of the differences in F-Measures between each pair of systems and th e confidence in that significance . The general method was applied on the basis of a message-by-message shuffling of a pair of MUC systems' responses to rule out differences that could have occurred by chance and to give us a picture o f the similarities of the systems in terms of performance.</Paragraph> <Paragraph position="1"> The method sorts systems into like and unlike categories . The results are shown in the following three table s for Named Entity, Template Element, and Scenario Template . These three all use the F-Measure as the single measur e for systems as defined in [4] and in the MUC-6 Test Scores appendix to this proceedings . The parameters in the F Measure used are such that recall and precision scores are combined with equal weighting . Note that Coreference was not characterized by F or any other unified measure because of the linkages that were being evaluated . Of course, an F-Measure is calculable, but more research is necessary before we can conclude that it will combine recall an d precision in a way that is meaningful for these evaluations .</Paragraph> <Paragraph position="2"> The statistical results reported here are based on the strictest cutoff point for significance level (0 .01) and high confidence in the assigned level (at least 99%) . What this method does not tell us is a numerical range withi n which F is not a significant distinguisher (such as plus or minus 3%) . Instead it provides lists of similar systems . We have to be careful to not confuse the numerical order of the F-Measures with a ranking of systems and to instead loo k at the groupings on these charts. If a group or a single system is off by itself, then that group or single system i s significantly different from its non-members . However, if there is overlap (and there is a lot of it in these results), the n the ranking of the grouped systems is impossible. In addition, two similarly acting systems could use very differen t approaches to data extraction, so there may be some other value that distinguishes these systems that has not been measured in MUC-6 .</Paragraph> </Section> <Section position="2" start_page="0" end_page="39" type="sub_section"> <SectionTitle> Processing </SectionTitle> <Paragraph position="0"> To prevent human error, the entire process of doing the statistical analysis is automated. An awk program extracts tallies that appear in the score report output by the scoring software and puts them in a file to be fed to the C program for approximate randomization. The C program re-calculates F-measure, recall, and precision from raw tallies for higher accuracy than during the approximate randomization comparisons . The scoring program is slow in emacslisp and would be slowed further by calculations with higher accuracy. The statistical program outputs th e significance and confidence levels in a matrix format for the analyst to inspect . Although 10,000 shuffles are carried out, the C program is fast . Results are depicted in lists of systems that are all equivalent, i .e., the differences in thei r scores were due to chance .</Paragraph> </Section> <Section position="3" start_page="39" end_page="39" type="sub_section"> <SectionTitle> Results </SectionTitle> <Paragraph position="0"> The results are reported in a tabular format . The row headings contain the F-Measures for the systems an d the rows are ordered from highest to lowest F. The columns are ordered in the same way as the rows and the header s contain the numerical order of the F values rather than the F value itself because of the size of the table on the page .</Paragraph> <Paragraph position="1"> To use the table, you first determine which system you are interested in and identify its F-Measure in the left column, then look across the row or down the corresponding column to see which systems' F-Measures its F-Measure is not significantly different from. The systems that make up that group can be considered to have gotte n their different F-Measures just by chance .</Paragraph> <Paragraph position="2"> You can see, for instance, that among the Named Entity systems, the two lowest scoring systems ar e significantly different from each other and all of the all of the other systems . The two systems above them form a group which are significantly different from the other systems, but not from each other . A similar case appears in Template Element at the low and high end of the scores . However, the important thing to note is that there is a larg e amount of overlap otherwise . The Scenario Template test shows even more overlap than the other two tasks .</Paragraph> </Section> </Section> class="xml-element"></Paper>