File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/93/x93-1018_metho.xml
Size: 48,113 bytes
Last Modified: 2025-10-06 14:13:28
<?xml version="1.0" standalone="yes"?> <Paper uid="X93-1018"> <Title>COMPARING HUMAN AND MACHINE PERFORMANCE FOR NATURAL LANGUAGE INFORMATION EXTRACTION: Results from the Tipster Text Evaluation</Title> <Section position="4" start_page="179" end_page="181" type="metho"> <SectionTitle> THE PREPARATION OF TEMPLATES </SectionTitle> <Paragraph position="0"> The development of templates for the English Microelectronics corpus began in the fall of 1992. It began with an interagency committee that developed the basic template structure and, when ~t had evolved to the point that ~t was relatively stable, two experienced analysts were added to the project so that they could begin training toward eventual production work. About two months after that, two more production analysts joined the project.</Paragraph> <Paragraph position="1"> The template structure was object-oriented, with a template consisting of a number of objects, or template building blocks with related information. Each object consisted of slots, which could be either fields containing specific information or pointers, that is, references, to other objects. Slot fields can either be set fills, which are filled by one or more categorical choices defined for that field, or by string fills, in which text is copied verbatim from the original article. In cases of ambiguity, analysts could provide alternative fills for a given slot. In addition, analysts could include comments when desired to note unusual problems or explain why a particular coding was selected. Comments were not scored and were primarily used when analysts compared two or more codings of a given article to determine which was the more correct. For more information on the template design see \[4\]. Also see \[5\] for a discussion of selection of the articles in the corpora and preparation of the data, and \[6\] for a discussion of the different extraction tasks, domains, and languages.</Paragraph> <Paragraph position="2"> Previous experience with the English and Japanese Joint Ventures corpus had made it clear that producing templates with a high degree of quality and consistency is a difficult and time-consuming task, and we attempted to make the best use of what had been learned in that effort in producing templates for English Microelectronics with quality and consistency appropriate to both the needs of the project and the resources we had available.</Paragraph> <Paragraph position="3"> &quot;Quality&quot; refers to minimizing the level of actual error by each analyst. &quot;Error&quot; includes the following: (1) Analysts missing information contMned in or erroneously interpreting the meaning of an article; (2) Analysts forgetting or misapplying a fill rule; (3) Analysts misspelling a word or making a keyboarding (typographical) error or the analogous error with a mouse; and (4) Analysts making an error in constructing the object-oriented structure, such as failing to create an object, failing to reference an object, providing an incorrect reference to an object, or creating an extraneous object.</Paragraph> <Paragraph position="4"> &quot;Consistency&quot; refers to minimizing the level of legitimate analytical differences among different analysts. &quot;Legitimate analytical differences&quot; include the following: (1) Different interpretations of ambiguous language in an article; (2) Differences in the extent to which analysts were able or willing to infer information from the article that is not directly stated; and (3) Different interpretations of a fill rule and how it should be applied (or the ability or willingness to infer a rule if no rule obviously applies).</Paragraph> <Paragraph position="5"> To improve quality and consistency, three steps were taken:</Paragraph> <Section position="1" start_page="180" end_page="180" type="sub_section"> <SectionTitle> Development of Fill Rules </SectionTitle> <Paragraph position="0"> First, a set of relatively detailed rules for extracting information from articles and structuring it as an object-oriented template was developed (with the rules for English Microelectronics a 40-page, single-spaced docamen0. These rules were created by a group of analysts who met periodically to discuss problems and to agree on how to handle particular cases via a general rule. One person (who was not one of the production analysts) served as the primary person maintaining the rules. Because of the highly technical nature of the topic domain, an expert in microelectronics fabrication also attended the meetings and resolved many problems that required technical knowledge.</Paragraph> <Paragraph position="1"> Coding by Multiple Analysts The second step was the development of a procedure in which two analysts participated in coding nearly half of the articles, and the reconciliation of different codings to produce final versions. For 300 articles in a &quot;high quality&quot; development set and for the 300 articles in the test set, the following procedure was followed: Two analysts first independently coded each article, with the resulting codings provided to one of these same analysts, who produced a final version, or &quot;key&quot;. The remaining 700 development templates were coded by only one analyst, with each of four analysts coding some portion of the 700 articles. The purpose of the two-analyst procedure was to correct inadvertent errors in the initial coding and to promote consistency, by allowing the final analyst to change his or her coding after seeing an independent coding by a different analyst. The procedure also promoted consistency in the long run by providing analysts with examples of codings made by other analysts so that they could see how other analysts handled a given problem. It also helped improve the fill rules by allowing analysts to detect recurring problems that could be discussed at a meeting and lead to a change in the fill rules.</Paragraph> </Section> <Section position="2" start_page="180" end_page="181" type="sub_section"> <SectionTitle> Software Support Tools </SectionTitle> <Paragraph position="0"> The third step was the development of software tools that helped analysts to minimize errors, detect certain kinds of errors, and support the process of comparing initial codings.</Paragraph> <Paragraph position="1"> One such tool was the template-filling tool developed by Bill Ogden and Jim Cowie at New Mexico State University (known as Locke in the version designed for English Microelectronics). This tool, which runs on a Sun workstation and uses the Xwindows graphical user interface, provided an interface that allowed analysts to easily visualize the relationships among objects and thus avoid errors in linking objects together. The tool also allowed analysts to copy text from the original article by selecting it with a mouse, entering it verbatim into a template slot, thus eliminating keystroke errors. In addition, the Locke tool has checking facilities that allowed analysts to detect such problems as unreferenced or missing objects. A second tool was the Tipster scoring program (developed by Nancy Chinchor and Gary Dunca at SAIC \[8\]) which provided analysts making keys with a printout of possible errors and differences between the initial codings. Another program, written by Gerry Reno at the Department of Defense at Fort Meade, did final checking of the syntax of completed keys.</Paragraph> <Paragraph position="2"> The four analysts who coded templates all had substantial experience as analysts for U.S. government agencies. In all cases analysts making the keys were unaware of the identity of the analyst producing a particular coding. Analysts did often claim that they could often identify the analyst coding a particular article by the comments included in the template coding or the number of alternative fills added, although when this was investigated further it appeared that they were not necessarily correct in their identification.</Paragraph> <Paragraph position="3"> In addition to the templates and keys created for the development and test sets described above, a small number of codings and keys were made for the purpose of studying human performance on the extraction task. In February, 1993, at about the time of the 18-month Tipster evaluation, a set of 40 templates in the development set were coded by all analysts for this purpose. Similarly, for 120 templates of the 300-template test set that was coded in June and July, 1993 extra codings were made by the two analysts that would have not normally participated in coding those articles, resulting in codings by all 4 analysts for 120 articles.</Paragraph> </Section> </Section> <Section position="5" start_page="181" end_page="181" type="metho"> <SectionTitle> INFLUENCE OF ANALYSTS PLAYING DIFFERENT ROLES ON KEY </SectionTitle> <Paragraph position="0"> The two-analyst procedure for making keys used for English Microelectronics was intended as an efficient compromise between the extremes of using a single analyst and the procedure that had been used for English Joint Ventures in which two analysts would independently make codings and provide the codings to a third analyst who would make a key.</Paragraph> <Paragraph position="1"> It is of interest to know whether this form of checking is achieving its intended result--that of improving the quality and consistency of templates. We can investigate this indirectly by measuring the level of influence analysts playing different roles (and producing different codings) have on the key that is produced. The question of influence is also of interest--as will be seen in the next section--for its implications in understanding the characteristics of different methods for measuring human performance.</Paragraph> <Paragraph position="2"> To investigate this influence, data from the set of 120 templates that all analysts coded were analyzed separately based on the role in the production of the key played by the particular coding. Figure 1 shows the relationship between different codings and the analysts producing them. Analyst 1 produces, based on the original article, what will be called here the primary coding of the article. Analyst 2 produces independently the secondary coding of the article. The primary and secondary codings are then provided to Analyst 1, who produces the key. Analysts 3 and 4 also produce other codings of the article that have no effect on the key. Each analyst plays a particular role (Analyst 1, 2, 3, or 4) for 30 of the 120 templates.</Paragraph> <Paragraph position="3"> Note that when Analyst 1 uses the primary and secondary codings in making the final coding, or key, there is a natural bias toward the primary coding. This is primarily because Analyst 1 created that coding, but also because the analyst typically does not create the key from scratch with the Locke tool, but modifies the primary coding (probably reinforcing the tendency to use the primary coding unless there is a significant reason for changing it).</Paragraph> <Paragraph position="4"> Figure 2 shows the results of this analysis, with performance expressed in terms of error per response fill, as calculated with the methodology described by Nancy Chinchor and Beth Sundheim \[7\] and implemented by the SAIC scoring program \[8\]. All objects in the template were scored, and scoring was done in &quot;key to key&quot; mode, meaning that both codings were allowed to contain alternatives for each slot.</Paragraph> </Section> <Section position="6" start_page="181" end_page="182" type="metho"> <SectionTitle> ANALYST i SECONDARY 2 CODING </SectionTitle> <Paragraph position="0"/> <Paragraph position="2"> Keys for the 120 Articles Coded by All Analysts (See the Appendix for details of producing the error scores and calculation of statistical parameters and tests.) upon Role of Coding in Making Key The data is shown for three conditions, with each condition reflecting the accuracy, in terms of percent error, of a coding playing a particular role (or no role) in producing a key. The conditions are: (1) error for the primary coding when measured against the key (shown as an unfilled vertical bar); (2) error for the secondary coding when measured against the key (shown as a light gray vertical bar); and (3) error for other codings when measured against the key (shown as a dark gray vertical bar).</Paragraph> <Paragraph position="3"> Also shown for all conditions in the form of error bars is the standard error of the mean. Because the mean shown is cal- null culated from a small sample it can be different from the desired true population mean, with the sample mean only the most likely value of the population mean. The standard error bars show the range expected for the true population mean. That mean can be expected to be within the error bars shown 68% of the time. See the appendix for details about how the standard error of the mean was calculated.</Paragraph> <Paragraph position="4"> Figure 3 makes clear the role of different codings: the primary coding is made by the analyst who also later made the key, while the secondary and other codings were made by other analysts. The primary and secondary codings are both used in making the key, while the other codings are not used.</Paragraph> </Section> <Section position="7" start_page="182" end_page="182" type="metho"> <SectionTitle> PRIMARY SECONDARY OTHER </SectionTitle> <Paragraph position="0"> The result (in Figure 2) that the primary coding when compared to the key shows a mean error considerably above zero indicates that analysts quite substantially change their coding from their initial version in producing the key. Presumably, this results from the analyst finding errors or moredesirable ways of coding, and means that quality and consistency is improved in the final version. (All differences claimed here are statistically significant--see Appendix for details).</Paragraph> <Paragraph position="1"> The result that the secondary coding when compared to the key shows a mean error that is substantially above that of the primary coding condition indicates that the analyst's original (primary) coding does in fact influence the key more strongly than does the secondary coding (produced by another analyst). At the same time, it is clear that the secondary coding does itself substantially influence the key, since the mean error for the secondary coding is substantially less than that for the ;'other&quot; codings, which are not provided to the analyst making the key and thus have no influence on it. This provides good evidence that analysts are indeed making use of the information in the secondary coding to a substantial extent. This probably resulted in an improvement in both quality and consistency of the templates above what would be the case if only a single coder (even with repeated checking) was used, although we do not have direct evidence of such improvement and the extent of its magnitude is not clear.</Paragraph> </Section> <Section position="8" start_page="182" end_page="185" type="metho"> <SectionTitle> METHODS FOR SCORING HUMAN PERFORMANCE </SectionTitle> <Paragraph position="0"> Before human performance for information extraction can be effectively compared with machine performance, it is necessary to develop a method for scoring responses by human analysts.</Paragraph> <Paragraph position="1"> The problem of measuring machine performance has been solved in the case of the MUC-5 and Tipster evaluations by providing (1) high-quality answer keys produced in the manner described in the previous section; and (2) a scoring methodology and associated computer program.</Paragraph> <Paragraph position="2"> The primary additional problem posed when attempting to measure the performance of humans performing extraction</Paragraph> <Paragraph position="4"> is that of &quot;who decides what the correct answer is?&quot; In the case of the English Microeleclronics analysts, the four analysts whose performance we are attempting to measure became---once they had substantial Ixaining and practice-the primary source of expertise about the task, with their knowledge and skill often outstripping that of others who were supervising and advising them. This made it especially difficult to measure the performance of particular analysts.</Paragraph> <Paragraph position="5"> We approached the problem of determining the best method for scoring humans empirically: We compared experimentally four different methods for sconng codings by human analysts. In general, the criteria is objectivity, statistical reliability, and a perhaps difficult-to-define &quot;fairness&quot; or plausibility of making appropriate comparisons, both between different human analysts and between humans and machines.</Paragraph> <Paragraph position="6"> In evaluating different scoring methods, the 120 templates in the Tipster/MUC-5 test set that had been coded by all four analysts were used. As was described in the previous section, keys for each template in this set were made by one of the analysts, using as inputs codings done independently by the analyst making the key and one other analyst. Each of the 4 analysts made keys for 30 of the 120 templates and also served as the independent analyst providing a coding to the analyst making the keys for a different set of 30 templates.</Paragraph> <Paragraph position="7"> In addition, for 60 of the 120 templates, a fifth analyst made a second key from codings made by the four analysts.</Paragraph> <Paragraph position="8"> Figure 4 shows data comparing the four sconng methods for each of the four analysts. The data is shown in terms of percent error, with all objects scored, and with &quot;key to response&quot; scoring being used. In key to response scoring, alternative fills for slots are allowed only in the key, but not in the coding being scored. Since in the data collected here, analysts did normally add alternative fills (since their goal was to make keys), these alternatives were removed before scoring, with the first alternative listed assumed to be the most likely one and thus kept, and others deleted. The purpose of using key-to-response scoring was so that the resulting data could be directly compared with data from machine systems, which produced only one fill for each slot. Scoring was done in batch mode, meaning that human analysts were not used (as they are in interactive mode) to judge cases in which strings did not completely match.</Paragraph> <Paragraph position="9"> In the &quot;All Analysts&quot; condition, all 120 templates made by each analyst were scored, using as keys those made by all 4 analysts (including the analyst being scored). In the &quot;Other Analysts&quot; condition, only templates that have keys made by analysts other than the analyst being scored were used in sconng each analyst (with each analyst having 90 templates of the 120 templates coded by that analyst scored). In the &quot;Independent Analysts&quot; condition, only templates for which the analyst being scored neither made the key nor produced a coding that was used as an input for making the key were used in scoring each analyst. (This resulted in from 30 to 80 templates being scored for each analyst, depending upon the analyst.) In the &quot;5th Analyst&quot; condition, a 5th analyst made the answer keys (with 60 templates scored in this condition).</Paragraph> <Paragraph position="10"> This 5th analyst did not code production templates but was in charge of maintaining the fill rules and the overall management of the English Microelectronics template coding effort.</Paragraph> <Paragraph position="11"> The &quot;All Analysts&quot; condition showed the most consistent performance across analysts, with a variance calculated from the means for each analyst of 1.82 (N=4). The &quot;Other Analysts&quot; condition was nearly as consistent, with a variance of 3.16. The &quot;Independent Analysts&quot; and &quot;5th Analyst&quot; conditions were much less consistent, with variances of 9.08 and 30.19, respectively. The high variance of the &quot;Independent Analysts&quot; condition, however, resulted only from the performance of analyst D, who had a very small sample size, only 30 templates. If analyst D is left out, the variance becomes only 0.32 for this condition. The high variability across analysts for the 5th analyst could be a result either of the small sample size or, more likely, a tendency for the 5th analyst to code articles in a manner more similar to some analysts (especially analyst C) than others (especially analyst B).</Paragraph> <Paragraph position="12"> The subjective opinions of the analysts and their co-workers suggested that all English Microeleclxonics analysts here were generally at the same level of skill, which is consistent with the above data. (This was not true of the English Joint Venture analysts, for example, where both the data and the opinions of analysts and others suggested considerable variability of skill among analysts.) However, it should be noted that all of the conditions in which analysts are being scored by other analysts run the risk of making the differences among analysts artificially low. Consider, for example, the case of a very skilled analyst being scored against a key made by an analyst who is poorly skilled. The more skilled analyst is likely to have some correct responses scored incorreectly, while a less-skilled analyst may have his or her incorrect responses scored as correct. However, the patterns of analyst skill elicited by the different scoring methods do not show any reliable evidence of such differences in skill, and it appears that analysts have similar levels of skill and that any effect of a&quot;regression toward the mean&quot; of mean analyst scores is minimal.</Paragraph> <Paragraph position="13"> Figure 5 shows the same data comparing scoring methods that was shown in the previous figure, but in this figure data has been combined to show means for all analysts in each of the scoring conditions. This combining allows the overall differences among the different scoring methods to be seen more clearly. In addition, combining the data in this way increases reliability of the overall mean.</Paragraph> <Section position="1" start_page="184" end_page="185" type="sub_section"> <SectionTitle> All Analysts </SectionTitle> <Paragraph position="0"> Figure 6 shows a summary of the characteristics of different scoring methods as discussed above. The &quot;All Analysts&quot;, &quot;Other Analysts&quot;, and &quot;Independent Analysts&quot; methods all use the expertise of the most practiced (production) analysts.</Paragraph> <Paragraph position="1"> method appeared to have some bias, presumably because of a coding style more similar to some analysts than others.</Paragraph> <Paragraph position="2"> Finally, the &quot;All Analysts&quot;, &quot;Other Analysts&quot;, and &quot;Independent Analysts&quot; methods had relatively high statistical reliability, while the &quot;5th analyst&quot; method had much less reliability.</Paragraph> <Paragraph position="3"> Figure 7 shows a Recall-Precision scatterplot for the four analysts and for each of the four conditions shown in Figure 4. Analysts scored by the &quot;All Analysts&quot; method are shown as solid circles, while analysts scored by the &quot;Other Analysts&quot; method are shown as solid triangles. Analysts scored by the &quot;Independent Analysts&quot; method are shown as deltas, and analysts scored by the &quot;5th Analyst&quot; method are shown as solid squares. Note that only the upper fight-hand quadrant of the usual 0-100% recall-precision graph is shown.</Paragraph> <Paragraph position="4"> Performance is expressed in terms of recall and precision, which are measures borrowed from information retrieval that allow assessment of two independent aspects of performance. Recall is a measure of the extent to which all relevant information in an article has been extracted, while precision is a measure of the extent to which information that has been entered into a template is correct. Details of the method for calculating recall-precision scores for the Tipster/MUC-5 evaluation can be found in the paper by Chinchor and Sundheim \[7\].</Paragraph> <Paragraph position="5"> To make the key, while the &quot;5th analyst&quot; method uses an analyst the expertise of which is somewhat more questionable because of putting much less time into actual coding articles. The &quot;All Analysts&quot; method appeared to have substantial bias (producing artificially low error scores) from analysts scoring their own codings, while the &quot;Other Analysts&quot; method appeared to produce some (but less) such bias. Neither the &quot;Independent Analysts&quot; nor the &quot;5th Analyst&quot; method suffered from this kind of bias. The &quot;All Analysts&quot;, &quot;Other Analysts&quot;, and &quot;Independent Analysts&quot; methods are unbiased with respect to particular analysts (because of counterbalancing to control for this), but the &quot;5th Analyst&quot;</Paragraph> </Section> </Section> <Section position="9" start_page="185" end_page="186" type="metho"> <SectionTitle> THE DEVELOPMENT OF ANALYST SKILL </SectionTitle> <Paragraph position="0"> were used in determining the differences between template codings.</Paragraph> <Paragraph position="1"> In interpreting the levels of performance shown by analysts for the extraction task, and, particularly, when comparing human performance with that of machines, it is important to know how skilled the analysts are compared to how they might be with additional training and practice. Comparing machine performance with humans who are less than fully skilled would result in overstating the comparative performance of the machines.</Paragraph> <Paragraph position="2"> Four analysts were used in production template coding, all having had experience as professional analysts. One analyst had about 6 years of such experience, another 9 years of experience, a third 10 years of experience, and the fourth about 30 years of experience. All were native speakers of English. None of the analysts had any expertise in microelectronics fabrication.</Paragraph> <Paragraph position="3"> We compared the skill of analysts at two different stages in their development by analyzing two sets of templates, each coded at a different time. The first, or &quot;18 month&quot; set, was coded in early February, 1993, at about the same time as the 18 month Tipster machine evaluation, after analysts had been doing production coding for about 3 months. The second, or &quot;24 month&quot; set was coded in June and July, 1993, somewhat before the 24 month Tipster/MUC-5 machine evaluation, and toward the end of the template coding process, when fill rules were at their most developed stage and analysts at their highest level of skill. There was some difference in expected skill between the two pairs of analysts, since one pair (analysts A and B) had begun work in September, although their initial work primarily involved coding of templates on paper and attending meetings to discuss the template design and fill rules, and during this period they did not code large numbers of templates. The second pair began work in November, and did not begin production coding of templates until a few weeks after the first analysts. Data for the 18 month condition was produced by first having all analysts code all articles of a 40-article set in the development set. Each analyst was then provided with printouts of all 4 codings for a set of 10 articles, and asked to make keys for those articles. Data for the 24 month condition was produced as described previously for the &quot;All Analysts&quot; condition, using the 120 templates that all analysts coded in the test set, with each of the 4 analysts making keys for 30 of the 120 templates. Note that analysts making the keys in the 18 month condition used as inputs the codings of all 4 analysts, while analysts making the keys in the 24 month condition used as inputs the codings of only 2 analysts. In both conditions and for all analysts, &quot;key to key&quot; scoring was used, in which all alternatives in both codings Figure 8 shows data, in terms of percent error, for each of the two pairs of analysts in both the 18 month and 24 month conditions. The pairing of analysts is based on when they started work, with analysts A and B (&quot;Early Starting Analysts&quot;) beginning work on the project before analysts C and D (&quot;Later Starting Analysts&quot;). Note that analysts who started early appeared to make slightly fewer errors in the 18 month condition (27%) than in the 24 month condition (28.3%), although the difference is not statistically significant.</Paragraph> <Section position="1" start_page="185" end_page="186" type="sub_section"> <SectionTitle> Who Stated Early </SectionTitle> <Paragraph position="0"> This difference can be explained at least in part by the difference in the method of making the keys. In the 18 month condition, all 4 analyst codings influenced the key, while in the 24 month condition only 2 of the analyst codings influenced the key. This results in the 18 month condition producing scores that are artificially low in terms of errors, compared to the 24 month condition. The difference, based on the data in Figure 5, can be estimated at from about 4 to 8 percentage points. Thus, it appears that analysts who started early did not improve their skill, or improved it minimally, between the 18 and 24 month tests. However, analysts who started later did appear to learn substantially, with error scores of 36% in the 18 month condition and 30.5% in the 24 month condition, with the amount of learning for the analysts who started later probably somewhat more than shown because of the difference in the method of making keys. Because of the differences in scoring methods between the conditions and the small sample size in the 28 month condition, the above results are only suggestive. An alternative approach to assessing the development of skill of analysts (that does not require keys for scoring) compares the pattern of disagree-</Paragraph> <Paragraph position="2"/> </Section> <Section position="2" start_page="186" end_page="186" type="sub_section"> <SectionTitle> Two Matrices </SectionTitle> <Paragraph position="0"> ment among analysts for the 18-month and 24-month tests, and is more convincing. Such a pattern can be constructed by running the scoring program for a given set of templates in &quot;key to key&quot; mode (so that it calculates a measure of disagreement between two codings) for all pairs of codings for the four analysts.</Paragraph> <Paragraph position="1"> Figure 9 shows such patterns, termed here &quot;Disagreement Matrices&quot;, for the 18- and 24-month tests, along with a third &quot;Difference&quot; matrix (shown at the far right) created by subtracting scores in the 24-month matrix from those in the 18-month one, resulting in a measure of the extent to which consistency between particular analysts has improved. Note that all cells of the Difference matrix have positive scores, indicating that consistency between all pairs of analysts has increased.</Paragraph> <Paragraph position="2"> Figure 10 shows comparisons of the scores from the Difference matrix for three specific cases of analyst pairs. For the Early Starting Analysts pair (A and B), shown at the far left, consistency between the two analysts increased by only one percentage point, suggesting that even at the 18-month test, these analysts were already near their maximum level of skill. For the Later Starting Analysts (C and D), however, shown at the far fight, consistency between the two analysts increased by 6 percentage points, indicating that these analysts were still developing their skill. For the case where the mean of all pairs of early-late analysts (AC, AD, BC, and BD) is calculated, shown as the middle vertical bar, consistency increased by an average of 4.375 percentage points, indicating that the less-skilled analysts had increased their consistency with the more-skilled analysts.</Paragraph> <Paragraph position="3"> The general finding here is that (1) the analysts who started earlier improved their skill minimally from the 18 to 24 month tests; and (2) analysts who started later improved their skill considerably. Because by the time of the 24 month test the later starting analysts had as much or more practice coding templates as did the early starting analysts at the time of the 18 month test, it is reasonable to assume that their increase in skill reflects an early part of the learning curve and that by the 24 month test all analysts have started to reach an asymptotic level of skill.</Paragraph> <Paragraph position="4"> The evidence in the literature for the development of skill in humans suggests that skill continues to develop, if slowly, for years or decades even on simple tasks, and it can be expected that continued practice on information extraction by these analysts would increase their level of skill. However, it does appear that the analysts were very highly skilled by the end of the study and were at an appropriate level of skill for comparison with machine performance.</Paragraph> </Section> </Section> <Section position="10" start_page="186" end_page="187" type="metho"> <SectionTitle> COMPARISON OF HUMAN AND MACHINE PERFORMANCE </SectionTitle> <Paragraph position="0"> The most critical question in the Tipster/MUC-5 evaluation is that of how performance of the machine extraction systems compares with that of humans performing the same task.</Paragraph> <Paragraph position="1"> Figure 11 shows mean performance, in percent error, for the 4 human analysts, using the &quot;Independent Analysts&quot; condition discussed in a previous section and shown in Figure 5, for the 120 articles coded by all analysts from the English Microelectronics test set. Also shown is the corresponding machine performance for the same articles for the three machine systems in the Tipster/MUC-5 evaluation that had the best official scores for English Microeleclronics.</Paragraph> <Paragraph position="2"> The differences are very clear, with the mean error for human analysts about half that of the machine scores. Both</Paragraph> </Section> <Section position="11" start_page="187" end_page="187" type="metho"> <SectionTitle> MEAN HUMAN ANALYSTS INDEPENDENT ANALYSTS MADE KEYS ALL ANALYSTS MADE KEYS </SectionTitle> <Paragraph position="0"/> <Paragraph position="2"/> </Section> <Section position="12" start_page="187" end_page="187" type="metho"> <SectionTitle> 3 BEST MACHINES </SectionTitle> <Paragraph position="0"> the human and machine scores are highly reliable, as is shown by the standard error bars.</Paragraph> <Paragraph position="1"> Figure 12 shows essentially the same data expressed in terms of recall and precision.</Paragraph> <Section position="1" start_page="187" end_page="187" type="sub_section"> <SectionTitle> What is surprising about this data is not that the machines </SectionTitle> <Paragraph position="0"> have a seemingly rather high error rate, but that the human rate is so high. The recall-precision diagram suggests that machines can have even more similar performance to humans on either recall or precision, if one is willing to trade of the other to achieve it. Machine performance is likely to be at least somewhat better than this in a real system, since resource constraints forced developers to run incomplete systems (that, for example, did not fill in slots for which information was infrequently encountered).</Paragraph> <Paragraph position="1"> The performance data shown in the figure, other data, and the subjective accounts of individual analysts and their co-workers support the general conclusion that for this group of analysts the level of skill for information extraction was very similar for each analyst. This uses the &quot;Other Analysts&quot; scoring method, with recall and precision scores for individual analysts not particularly meaningful for the otherwise more reliable &quot;Independent Analysts&quot; condition. (See Figure 7 for recall and precision scores for all scoring conditions).</Paragraph> </Section> </Section> <Section position="13" start_page="187" end_page="188" type="metho"> <SectionTitle> EFFECT OF METHOD OF KEY PREPARATION ON MACHINE PERFORMANCE </SectionTitle> <Paragraph position="0"> t' A practical consideration in evaluating machind perfor- deg mance of importance for future evaluations (such as MUC6) is the extent to which it is necessary or desirable to use elaborate checking schemes to prepare test templates, or whether templates prepared by a single analyst will serve as well.</Paragraph> <Paragraph position="1"> In an attempt to provide some data relevant to this issue, the performance of the three best machine systems was measured using two different sets of keys. In one condition (&quot;Normal key&quot;) the keys used for evaluating the machines were those normally used in the 24 month evaluation, for the 120-article set for which templates were coded by all analysts and a checked version produced by a particular analyst using codings of multiple analysts. In the other condition (&quot;Orig. Coding&quot;), the keys used for evaluating the machines were the original unchecked templates coded by all 4 analysts. null Figure 13 shows the resulting data for both conditions for each of the three machines. For all machines, there is little difference (and none that is signitican0 between performance between the two conditions.</Paragraph> <Paragraph position="2"> machines. Again, there is no significance difference, and because of the large sample size and resulting small standard error, the result is highly reliable. This finding may seem surprising given the results presented earlier that show substantial differences between original and (checked) final codings. The difference can be explained by the relative precision involved. Comparisons between original and final codings by analysts might be seen as analogous to different shades of colors: if an original Together with Key or Original Coding analyst codes light green, while a second analyst produces a checked version of dark green, a measure of differences may show a substantial magnitude. At the same time, the machines may be producing codings ranging from blue to orange. While comparing light green with orange may yield considerable differences, it is plausible that there may be little or no difference between the magnitudes resulting when orange is compared first with light green and then with dark green. It can be expected that as machine performance improves, there will be an increasing difference between evaluations using original and checked keys.</Paragraph> </Section> <Section position="14" start_page="188" end_page="189" type="metho"> <SectionTitle> AGREEMENT ON DIFFICULTY OF PARTICULAR TEMPLATES </SectionTitle> <Paragraph position="0"> The extent to which different analysts (and machines) agree on which templates are difficult and which are easy is of interest in understanding the task and human and machine performance for the task.</Paragraph> <Paragraph position="1"> This was measured first by obtaining scores for different analysts for particular templates, and calculating Pearson product-moment correlation coefficients for corresponding templates between pairs of analysts.</Paragraph> <Paragraph position="2"> Figure 15 shows these correlations, with correlations between the 4 analysts shown at the far left, correlations between the 3 best machines shown correlation between ran- null domly in the center, and selected pairs of humans and machines shown at the far fight.</Paragraph> <Paragraph position="3"> Correlations among humans were relatively low, with R 2 from .04 to .20 (median at 0.13). Correlations among machines were moderate, with R z from .21 to .44.</Paragraph> <Paragraph position="4"> Correlations between a particular human and a particular machine were low to moderate, with R 2 from .07 to .21.</Paragraph> <Section position="1" start_page="189" end_page="189" type="sub_section"> <SectionTitle> Individual Analysts and Machines </SectionTitle> <Paragraph position="0"> dividing a set of 93 temples into two groups, either&quot;easy&quot; or &quot;hard&quot;. The division was made by first calculating an error score for each template when one analyst is measured against another as akey. This was done for all 93 temples for 2 pairs of analysts, with the mean difference calculated for both pairs for each template. The templates were then divided into &quot;easy&quot; and &quot;hard&quot; groups, with the &quot;easy&quot; group consisting of those templates with the lower mean difference scores and the &quot;hard&quot; group consisting of those templates with the higher scores.</Paragraph> <Paragraph position="1"> This was intended a a way of constructing a simulation of a corpus and task that was easier (and harder) than the Tipster task, which was viewed by many in the Tipster project as excessively demanding. The hypothesis was that the machines might do comparatively better than humans on the &quot;'easy&quot; set than on the &quot;hard&quot; set.</Paragraph> <Paragraph position="2"> The results are shown in Figure 16 for two analysts and two machines. The opposite of the expected (and hoped-for) hypothesis appeared to be the case. The human analysts produced roughly twice as many errors on the &quot;hard&quot; set of templates as on the &quot;easy&quot; set, while the machines were only somewhat better on the &quot;easy&quot; versus &quot;hard&quot; set. Figure 17 shows the same data, but in terms of the means for each pair of humans and machines. In addition, data (at the far fight) is presented in which the mean machine error is divided by the mean human error for each set, thus normalizing the difference. Here the comparative difference between machine and human was much larger for the &quot;easy&quot; set compared with the &quot;hard&quot; set.</Paragraph> <Paragraph position="3"> Whether this method allows a realistic simulation of the effects of difficulty of the text and task is unclear and the meaning of this data is hardto interpret. It would be valuable to develop tests sets for future MUC and Tipster evaluations that could effectively assess the effect and nature of text and task difficulty.</Paragraph> </Section> </Section> <Section position="15" start_page="189" end_page="190" type="metho"> <SectionTitle> MEAN MEAN MACHINE HUMANS MACHINES MEAN__ HUMAN </SectionTitle> <Paragraph position="0"/> </Section> <Section position="16" start_page="190" end_page="190" type="metho"> <SectionTitle> COMPARING HUMAN AND MACHINE PERFORMANCE FOR SPECIFIC INFORMATION </SectionTitle> <Paragraph position="0"> A particularly interesting question about human and machine performance is that of how the two compare for different aspects of the extraction task. Such differences are most easily seen by comparing human and machine performance on different slots in a template object or on an entire object.</Paragraph> <Paragraph position="1"> This issue was investigated by making use of a 60-template subset of the MUC-5/Tipster test set that was coded by all analysts and for which a key was made by the 5th analyst.</Paragraph> <Paragraph position="2"> The scoring program, in addition to calculating overall scores for templates or sets of templates, also provides scores for each individual slot and object in the template. Scores for each of the four human analysts and the three best machine systems were obtained for each object and slot using the scoring program. Only those objects and slots with at least 10 examples of nonblank responses in the keys were further scored.</Paragraph> <Paragraph position="3"> Because of the wide disparity between scores for humans and for machines, the data representing performance on each slot and object were normalized in the following manner: First, performance for each slot and object for the 4 human analysts was averaged by calculating a mean error for each slot and object. A rank order score was then assigned to each slot and object, reflecting the lowest to highest comparative performance for humans for that slot or object.</Paragraph> <Paragraph position="4"> Finally, a calculation was made of the comparative difference in performance in scores for particular slots and objects between humans and machines, by sublracting the rank order score for machines from the rank order score from humans for the corresponding slot or object.</Paragraph> <Paragraph position="5"> Figure 18 shows the results, listing the 5 comparatively &quot;worst&quot; slots from the point of view of the machines and then the 5 comparatively &quot;best&quot; slots from the point of the machines. For further comparison, one slot in which the performance of humans and machines are comparatively equal is listed.</Paragraph> <Paragraph position="6"> The column at the left shows the rank order difference score, with +30 indicating <equip>name, the slot for which machines did the worst compared with humans. The two columns at the right list the error scores for humans and machines, with the <equip>name slot resulting in a machine score of 84.3% error, but a human score of 18.0% error. Note that this extreme comparative difference results</Paragraph> </Section> <Section position="17" start_page="190" end_page="191" type="metho"> <SectionTitle> MACHINE COMPARATIVELY WORSE DIFF SLOT MACHINE HUMAN </SectionTitle> <Paragraph position="0"> ITom two factors: the machines did particularly bad on this slot (84.3%) compared to their overall performance (68.7%), and humans did particularly well on the slot (18.0%) compared to their overall performance (29.3%).</Paragraph> <Paragraph position="1"> This data is essentially a pilot experiment towards investigating the question of how human and machine performance might compare on specific tasks, with slot and object fills the best available way of obtaining this information with the given darns Without detailed investigation, we can only speculate on the reasons for the results. It appears, however, that many or all machine developers, pressed for time particularly in the case of microelectronics, simply did not bother to code specific slots, viewing them as unimportant to the final score. Those slots would likely appear in a list of slots that machines did comparatively bad on (though it may also be necessary for humans to do particularly badly on the slots as well). It appears that, in the case of the slots that machines did comparatively well on, that these were slots with large sets of categorical fills, with the set sufficiently large and the items sufficiently obscure that humans had a difficult time remembering them well enough to effectively detect them when they appeared in text. Because these words (or acronyms) tended to be context-free, relatively simple strategies for detecting these keywords and matching them to slots could be used. This does suggest that the abilities of humans and machines are quite different, and that an approach in which an integrated human-machine system is used rather than a machine-only system, as is described in \[3\], might be appropriate.</Paragraph> </Section> class="xml-element"></Paper>