File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/93/m93-1006_metho.xml

Size: 35,182 bytes

Last Modified: 2025-10-06 14:13:23

<?xml version="1.0" standalone="yes"?>
<Paper uid="M93-1006">
  <Title>COMPARING HUMAN AND MACHINE PERFORMANCE FO R NATURAL LANGUAGE INFORMATION EXTRACTION : Results for English Microelectronics from the MUC-5 Evaluation</Title>
  <Section position="3" start_page="53" end_page="55" type="metho">
    <SectionTitle>
THE PREPARATION OF TEMPLATE S
</SectionTitle>
    <Paragraph position="0"> The development of templates for the English Microelectronics corpus began in the fall of 1992 . It began with an interagency committee that developed the basic template structure and, when it had evolved to the point that it wa s relatively stable, two experienced analysts were added to the project so that they could begin training toward eventua l production work. About two months after that, two more production analysts joined the project .</Paragraph>
    <Paragraph position="1"> The template structure was object-oriented, with a template consisting of a number of objects, or template building blocks with related information . Each object consisted of slots, which could be either fields containing specific information or pointers, that is, references, to other objects . Slot fields can either be set fills, which are filled by one or more categorical choices defined for that field, or by string fills, in which text is copied verbatim from the original article. In cases of ambiguity, analysts could provide alternative fills for a given slot . In addition, analysts could include comments when desired to note unusual problems or explain why a particular coding was selected. Comments were not scored and were primarily used when analysts compared two or more codings of a given article to determin e which was the more correct. For more information on the template design see [4] . Also see [5] for a discussion of the corpora selection and data preparation and [6] for a discussion of the different extraction tasks, domains, and languag es. null Previous experience with the English and Japanese Joint Ventures corpus had made it clear that producin g templates with a high degree of quality and consistency is a difficult and time-consuming task 11l, and we attempted  to make the best use of what had been learned in that effort in producing templates for English Microelectronics with quality and consistency appropriate to both the needs of the project and the resources we had available .</Paragraph>
    <Paragraph position="2"> &amp;quot;Quality&amp;quot; refers to minimizing the level of actual error by each analyst. &amp;quot;Error&amp;quot; includes the following : (1 ) Analysts missing information contained in or erroneously interpreting the meaning of an article ; (2) Analysts forgetting or misapplying a fill rule ; (3) Analysts misspelling a word or making a keyboarding (typographical) error or th e analogous error with a mouse; and (4) Analysts making an error in constructing the object-oriented structure, such as failing to create an object, failing to reference an object, providing an incorrect reference to an object . or creating a n extraneous object.</Paragraph>
    <Paragraph position="3"> &amp;quot;Consistency&amp;quot; refers to minimizing the level of legitimate analytical differences among different analysts . &amp;quot;Legitimate analytical differences&amp;quot; include the following : (1) Different interpretations of ambiguous language in an article; (2) Differences in the extent to which analysts were able or willing to infer information from the article that i s not directly stated ; and (3) Different interpretations of a fill rule and how it should be applied (or the ability or willing ness to infer a rule if no rule obviously applies).</Paragraph>
    <Paragraph position="4"> To improve quality and consistency, three steps were taken : First, a set of relatively detailed rules for extracting information from articles and structuring it as an object oriented template was developed (with the rules for English Microelectronics a 40-page, single-spaced document) .</Paragraph>
    <Paragraph position="5"> These rules were created by a group of analysts who met periodically to discuss problems and to agree on how to han dle particular cases via a general rule. One person (who was not one of the production analysts) served as the primar y person maintaining the rules. Because of the highly technical nature of the topic domain, an expert in microelectronic s fabrication also attended the meetings and resolved many problems that required technical knowledge .</Paragraph>
    <Paragraph position="6"> The second step was the development of a procedure in which two analysts participated in coding nearly hal f of the articles. For 300 articles in a &amp;quot;high quality&amp;quot; development set and for the 300 articles in the test set, the followin g procedure was followed : Two analysts first independently coded each article, with the resulting codings provided t o one of these same analysts, who produced a final version, or &amp;quot;key&amp;quot; . The remaining 700 development templates wer e coded by only one analyst, with each of four analysts coding some portion of the 700 articles . The purpose of the two analyst procedure was to correct inadvertent errors in the initial coding and to promote consistency, by allowing th e final analyst to change his or her coding after seeing an independent coding by a different analyst . The procedure also promoted consistency in the long run by providing analysts with examples of codings made by other analysts so tha t they could see how other analysts handled a given problem . It also helped improve the fill rules by allowing analyst s to detect recurring problems that could be discussed at a meeting and lead to a change in the fill rules .</Paragraph>
    <Paragraph position="7"> The third step was the development of software tools that helped analysts to minimize errors, detect certai n kinds of errors, and support the process of comparing initial codings . One such tool was the template-filling tool developed by Bill Ogden and Jim Cowie at New Mexico State University (known as Locke in the version designed fo r English Microelectronics) . This tool, which runs on a Sun workstation and uses the Xwindows graphical user interface, provided an interface that allowed analysts to easily visualize the relationships among objects and thus avoid errors i n linking objects together. The tool also allowed analysts to copy text from the original article by selecting it with a mouse, entering it verbatim into a template slot, thus eliminating keystroke errors. In addition, the Locke tool has checking facilities that allowed analysts to detect such problems as unreferenced or missing objects . A second tool was the Tipster scoring program (developed by Nancy Chinchor and Gary Dunca at SAIC [8]) which provided analysts making keys with a printout of possible errors and differences between the initial codings . Another program, writte n by Gerry Reno at the Department of Defense at Fort Meade, did final checking of the syntax of completed keys .</Paragraph>
    <Paragraph position="8"> The four analysts who coded templates all had substantial experience as analysts for U .S. government agencies. In all cases analysts making the keys were unaware of the identity of the analyst producing a particular coding . Analysts did often claim that they could often identify the analyst coding a particular article by the comments include d in the template coding or the number of alternative fills added, although when this was investigated further it appeare d that they were not necessarily correct in their identification.</Paragraph>
    <Paragraph position="9">  In addition to the templates and keys created for the development and test sets described above, a small number of codings and keys were made for the purpose of studying human performance on the extraction task . In February, 1993, at about the time of the 18-month Tipster evaluation, a set of 40 templates in the development set were code d by all analysts for this purpose. Similarly, for 120 templates of the 300-template test set that was coded in June an d July, 1993 extra codings were made by the two analysts that would have not normally participated in coding those arti cles, resulting in codings by all 4 analysts for 120 articles .</Paragraph>
  </Section>
  <Section position="4" start_page="55" end_page="56" type="metho">
    <SectionTitle>
INFLUENCE OF ANALYSTS PLAYING DIFFERENT ROLES ON KE Y
</SectionTitle>
    <Paragraph position="0"> The two-analyst procedure for making keys used for English Microelectronics was intended as an efficient compromise between the extremes of using a single analyst and the procedure that had been used for English Join t Ventures in which two analysts would independently make codings and provide the codings to a third analyst who would make a key.</Paragraph>
    <Paragraph position="1"> It is of interest to know whether this form of checking is achieving its intended result--that of improving th e quality and consistency of templates. We can investigate this indirectly by measuring the level of influence analyst s playing different roles (and producing different codings) have on the key that is produced . The question of influence is also of interest--as will be seen in the next section--for its implications in understanding the characteristics of different methods for measuring human performance .</Paragraph>
    <Paragraph position="2"> To investigate this influence, data from the set of 120 templates that all analysts coded were analyze d separately based on the role in the production of the key played by the particular coding . Figure 1 shows the relationship between different codings and the analysts producing them . Analyst 1 produces, based on the original article, what will be called here the primary coding of the article. Analyst 2 produces independently the secondary coding of the article. The primary and secondary codings are then provided to Analyst 1, who produces the key .</Paragraph>
    <Paragraph position="3"> Analysts 3 and 4 also produce other codings of the article that have no effect on the key . Each analyst plays a particular role (Analyst 1, 2, 3, or 4) for 30 of the 120 templates .</Paragraph>
    <Paragraph position="4">  the 120 Articles Coded by All Analysts Note that when Analyst 1 uses the primary and secondary codings in making the final coding, or key, there i s a natural bias toward the primary coding . This is primarily because Analyst 1 created that coding, but also becaus e the analyst making the key typically does not create the key from scratch with the Locke tool, but modifies th e  primary coding (probably reinforcing the tendency to use the primary coding unless there is a significant reason fo r changing it).</Paragraph>
    <Paragraph position="5"> Figure 2 shows the results of this analysis, with performance expressed in terms of error per response lilt . as calculated with the methodology described by Nancy Chinchor and Beth Sundhcim 171 and implemented by th e SAIC scoring program [81 . All objects in the template were scored, and scoring was done in &amp;quot;key to key &amp;quot; mode . meaning that both codings were allowed to contain alternatives for each slot . (See the Appendix for details o f producing the error scores and calculation of statistical parameters and tests.)</Paragraph>
  </Section>
  <Section position="5" start_page="56" end_page="56" type="metho">
    <SectionTitle>
PRIMARY SECONDARY OTHE R
CODING CODING CODIN G
VS VS VS
KEY KEY KEY
</SectionTitle>
    <Paragraph position="0"> Role of Coding in Making Ke y The data is shown for three conditions, with each condition reflecting the accuracy, in terms of percent error , of a coding playing a particular role (or no role) in producing a key. The conditions are : (1) error for the primary coding when measured against the key (shown as an unfilled vertical bar) ; (2) error for the secondary coding when measured against the key (shown as a light gray vertical bar) ; and (3) error for other codings when measured against the key (shown as a dark gray vertical bar) .</Paragraph>
    <Paragraph position="1"> Also shown for all conditions in the form of error bars is the standard error of the mean . Because the mean shown is calculated from a small sample it can be different from the desired true population mean, with the sample mean only the most likely value of the population mean . The standard error bars show the range expected for the tru e population mean. That mean can be expected to be within the error bars shown 68% of the time .</Paragraph>
    <Paragraph position="2">  codings are both used in making the key, while the other codings are not used.</Paragraph>
    <Paragraph position="3"> The result (in Figure 2) that the primary coding when compared to the key shows a mean error considerabl y above zero indicates that analysts quite substantially change their coding from their initial version in producing th e key. Presumably, this results from the analyst finding errors or more-desirable ways of coding, and means that qualit y and consistency is improved in the final version . (All differences claimed here are statistically significant se e Appendix for details) .</Paragraph>
    <Paragraph position="4"> The result that the secondary coding when compared to the key shows a mean error that is substantially above that of the primary coding condition indicates that the analyst's original (primary) coding does in fact influence the key more strongly than does the secondary coding (produced by another analyst) . At the same time, it is clear that the secondary coding does itself substantially influence the key, since the mean error for the secondary coding i s substantially less than that for the &amp;quot;other&amp;quot; codings, which are not provided to the analyst making the key and thus have no influence on it. This provides good evidence that analysts are indeed making use of the information in th e secondary coding to a substantial extent. This probably results in an improvement in both quality and consistency o f the templates above what would be the case if only a single coder (even with repeated checking) was used, althoug h the extent of this improvement is not clear .</Paragraph>
  </Section>
  <Section position="6" start_page="56" end_page="58" type="metho">
    <SectionTitle>
METHODS FOR SCORING HUMAN PERFORMANC E
</SectionTitle>
    <Paragraph position="0"> Before human performance for information extraction can be effectively compared with machine performance, it is necessary to develop a method for scoring responses by human analysts .</Paragraph>
    <Paragraph position="1"> The problem of measuring machine performance has been solved in the case of the MUC-5 and Tipster evaluations by providing (1) high-quality answer keys produced in the manner described in the previous section ; and (2) a scoring methodology and associated computer program .</Paragraph>
    <Paragraph position="2"> The primary additional problem posed when attempting to measure the performance of humans performin g extraction is that of &amp;quot;who decides what the correct answer is? &amp;quot; In the case of the English Microelectronics analysts, the four analysts whose performance we are attempting to measure became--once they had substantial training an d practice--the primary source of expertise about the task, with their knowledge and skill often outstripping that of others who were supervising and advising them . This made it especially difficult to measure the performance of particula r analysts.</Paragraph>
    <Paragraph position="3"> We approached the problem of determining the best method for scoring humans empirically : We compared experimentally four different methods for scoring codings by human analysts . In general, the criteria is objectivity, statistical reliability, and a perhaps difficult-to-define &amp;quot;fairness&amp;quot; or plausibility of making appropriate comparisons, both between different human analysts and between humans and machines .</Paragraph>
    <Paragraph position="4"> In evaluating different scoring methods, the 120 templates in the MUC-5/Tipster test set that had been code d by all four analysts were used. As was described in the previous section, keys for each template in this set were mad e by one of the analysts, using as inputs codings done independently by the analyst making the key and one other analyst . Each of the 4 analysts made keys for 30 of the 120 templates and also served as the independent analyst providing a coding to the analyst making the keys for a different set of 30 templates . In addition, for 60 of the 120 templates, a fifth analyst made a second key from codings made by the four analysts .</Paragraph>
    <Paragraph position="5"> Figure 4 shows data comparing the four scoring methods for each of the tier analysts . The data is shown in terms of percent error, with all objects scored, and with &amp;quot;key to response&amp;quot; scoring being used . In key to response scoring, alternative tills for slots are allowed only in the key, but not in the coding being scored. Since in the data collected here, analysts did normally add alternative fills (since their goal was to make keys), these alternatives were remove d before scoring, with the first alternative listed assumed to be the most likely one and thus kept, and others deleted . The purpose of using key-to-response scoring was so that the resulting data could be directly compared with data fro m  machine systems, which produced only one fill for each slot . Scoring was done in batch mode, meaning that human analysts were not used (as they are in interactive mode) to judge cases in which strings did not completely match .</Paragraph>
  </Section>
  <Section position="7" start_page="58" end_page="59" type="metho">
    <SectionTitle>
WHO MADE KEY:
</SectionTitle>
    <Paragraph position="0"> In the &amp;quot;All Analysts&amp;quot; condition, all 120 templates made by each analyst were scored, using as keys thos e made by all 4 analysts (including the analyst being scored) . In the &amp;quot;Other Analysts&amp;quot; condition, only templates that have keys made by analysts other than the analyst being scored were used in scoring each analyst (with each analys t having 90 templates of the 120 templates coded by that analyst scored) . In the &amp;quot;Independent Analysts&amp;quot; condition, only templates for which the analyst being scored neither made the key nor produced a coding that was used as a n input for making the key were used in scoring each analyst . (This resulted in from 30 to 80 templates being scored for each analyst, depending upon the analyst.) In the &amp;quot;5th Analyst&amp;quot; condition, a 5th analyst made the answer keys (wit h 60 templates scored in this condition). This 5th analyst did not code production templates but was in charge o f maintaining the fill rules and the overall management of the English Microelectronics template coding effort.</Paragraph>
    <Paragraph position="1"> The &amp;quot;All Analysts&amp;quot; condition showed the most consistent performance across analysts, with a varianc e calculated from the means for each analyst of 1 .82 (N=4). The &amp;quot;Other Analysts&amp;quot; condition was nearly as consistent , with a variance of 3.16. The &amp;quot;Independent Analysts&amp;quot; and &amp;quot;5th Analyst&amp;quot; conditions were much less consistent, wit h variances of 9.08 and 30.19, respectively. The high variance of the &amp;quot;Independent Analysts&amp;quot; condition, however , resulted only from the performance of analyst D, who had a very small sample size, only 30 templates . If analyst D is left out, the variance becomes only 0.32 for this condition. The high variability across analysts for the 5th analys t could be a result either of the small sample size or, more likely, a tendency for the 5th analyst to code articles in a manner more similar to some analysts (especially analyst C) than others (especially analyst B).</Paragraph>
    <Paragraph position="2"> The subjective opinions of the analysts and their co-workers suggested that all English Microelectronics analysts here were generally at the same level of skill, which is consistent with the above data (This was not true o f the English Joint Venture analysts, for example, where both the data and the opinions of analysts and others suggeste d considerable variability of skill among analysts .) However, it should be noted that all of the conditions in which analysts are being scored by other analysts run the risk of making the differences among analysts artificially low .</Paragraph>
    <Paragraph position="3"> Consider, for example, the case of a very skilled analyst being scored against a key made by an analyst who is poorly skilled. The more skilled analyst is likely to have some correct responses scored incorrectly, while a less-skilled analyst may have his or her incorrect responses scored as correct . However, the patterns of analyst skill elicited by the different scoring methods do not show any reliable evidence of such differences in skill, and it appears that analyst s have similar levels of skill and that any effect of a &amp;quot;regression toward the mean&amp;quot; of mean analyst scores is minimal .</Paragraph>
    <Paragraph position="4"> Figure 5 shows the same data comparing scoring methods that was shown in the previous figure, but in this figure data has been combined to show means for all analysts in each of the scoring conditions . This combining  allows the overall differences among the different scoring methods to be seen more clearly . In addition, combinin g the data in this way increases reliability of the overall mean .</Paragraph>
  </Section>
  <Section position="8" start_page="59" end_page="60" type="metho">
    <SectionTitle>
WHO MADE KEY;
</SectionTitle>
    <Paragraph position="0"> To make the key, while the &amp;quot;5th analyst&amp;quot; method uses an analyst the expertise of which is somewhat mor e questionable because of putting much less time into actual coding articles . The &amp;quot;All Analysts&amp;quot; method appeared to have substantial bias (producing artificially low error scores) from analysts scoring their own codings, while th e &amp;quot;Other Analysts&amp;quot; method appeared to produce some (but less) such bias . Neither the &amp;quot;Independent Analysts&amp;quot; nor the &amp;quot;5th Analyst&amp;quot; method suffered from this kind of bias. The &amp;quot;All Analysts&amp;quot;, &amp;quot;Other Analysts&amp;quot;, and &amp;quot;Independen t Analysts&amp;quot; methods are unbiased with respect to particular analysts (because of counterbalancing to control for this) , but the &amp;quot;5th Analyst&amp;quot; method appeared to have some bias, presumably because of a coding style more similar to some analysts than others . Finally, the &amp;quot;All Analysts&amp;quot;, &amp;quot;Other Analysts&amp;quot;, and &amp;quot;Independent Analysts&amp;quot; methods ha d relatively high statistical reliability, while the &amp;quot;5th analyst &amp;quot; method had much less reliability.</Paragraph>
    <Paragraph position="1"> Figure 7 shows a Recall-Precision scatterplot for the four analysts and for each of the four conditions show n in Figure 4. Analysts scored by the &amp;quot;All Analysts&amp;quot; method are shown as solid circles, while analysts scored by th e &amp;quot;Other Analysts &amp;quot; method are shown as solid triangles . Analysts scored by the &amp;quot;Independent Analysts&amp;quot; method are shown as deltas, and analysts scored by the &amp;quot;5th Analyst &amp;quot; method are shown as solid squares . Note that only the upper right-hand quadrant of the usual 0-100% recall-precision graph is shown . Performance is expressed in terms of  recall and precision, which are measures borrowed from information retrieval that allow assessment of two independent aspects of performance . Recall is a measure of the extent to which all relevant information in an article has been extracted, while precision is a measure of the extent to which information that has been entered into a template is correct . Details of the method for calculating recall-precision scores for the MUC-5/Tipster evaluatio n can be found in the paper by Chinchor and Sundheim[7] .</Paragraph>
  </Section>
  <Section position="9" start_page="60" end_page="62" type="metho">
    <SectionTitle>
THE DEVELOPMENT OF ANALYST SKIL L
</SectionTitle>
    <Paragraph position="0"> In interpreting the levels of performance shown by analysts for the extraction task, and, particularly, whe n comparing human performance with that of machines, it is important to know how skilled the analysts are compare d to how they might be with additional training and practice . Comparing machine performance with humans who ar e less than fully skilled would result in overstating the comparative performance of the machines .</Paragraph>
    <Paragraph position="1"> Four analysts were used in production template coding, all having had experience as professional analysts .</Paragraph>
    <Paragraph position="2"> One analyst had about 6 years of such experience, another 9 years of experience, a third 10 years of experience, and the fourth about 30 years of experience . All were native speakers of English . None of the analysts had any expertise in microelectronics fabrication.</Paragraph>
    <Paragraph position="3"> We compared the skill of analysts at two different stages in their development by analyzing two sets of templates, each coded at a different time. The first, or &amp;quot;18 month&amp;quot; set, was coded in early February, 1993, at about th e same time as the 18 month Tipster machine evaluation, after analysts had been doing production coding for about 3 months. The second, or &amp;quot;24 month&amp;quot; set was coded in June and July, 1993, somewhat before the 24 month MUC-5 / Tipster machine evaluation, and toward the end of the template coding process, when fill rules were at their mos t developed stage and analysts at their highest level of skill . There was some difference in expected skill between th e two pairs of analysts, since one pair (analysts A and B) had begun work in September, although their initial wor k primarily involved coding of templates on paper and attending meetings to discuss the template design and fill rules , and during this period they did not code large numbers of templates. The second pair began work in November, and did not begin production coding of templates until a few weeks after the first analysts.</Paragraph>
    <Paragraph position="4"> Data for the 18 month condition was produced by first having all analysts code all articles of a 40-article se t in the development set. Each analyst was then provided with printouts of all 4 codings for a set of 10 articles, an d  asked to make keys for those articles. Data for the 24 month condition was produced as described previously for th e &amp;quot;All Analysts&amp;quot; condition, using the 120 templates that all analysts coded in the test set, with each of the 4 analyst s making keys for 30 of the 120 templates. Note that analysts making the keys in the 18 month condition used as inputs the codings of all 4 analysts, while analysts making the keys in the 24 month condition used as inputs the codings of only 2 analysts. In both conditions and for all analysts, &amp;quot;key to key&amp;quot; scoring was used, in which all alternatives in both codings were used in determining the differences between template codings .</Paragraph>
    <Paragraph position="5"> Figure 8 shows data, in terms of percent error, for each of the two pairs of analysts in both the 18 month an d 24 month conditions. The pairing of analysts is based on when they started work, with analysts A and B (&amp;quot;Earl y Starting Analysts&amp;quot;) beginning work on the project before analysts C and D (&amp;quot;Later Starting Analysts&amp;quot;) . Note that analysts who started early appeared to make slightly fewer errors in the 18 month condition (27%) than in the 2 4 month condition (28 .3%), although the difference is not statistically significant.</Paragraph>
    <Paragraph position="6">  This difference can be explained at least in part by the difference in the method of making the keys . In the 18 month condition, all 4 analyst codings influenced the key, while in the 24 month condition only 2 of the analys t codings influenced the key. This results in the 18 month condition producing scores that are artificially low in term s of errors, compared to the 24 month condition . The difference, based on the data in Figure 5, can be estimated at fro m about 4 to 8 percentage points . Thus, it appears that analysts who started early did not improve their skill, or improved it minimally, between the 18 and 24 month tests . However, analysts who started later did appear to learn substantially, with error scores of 36% in the 18 month condition and 30 .5% in the 24 month condition, with the amount of learning for the analysts who started later probably somewhat more than shown because of the differenc e in the method of making keys .</Paragraph>
    <Paragraph position="7"> Because of the differences in scoring methods between the conditions and the small sample size in the 2 8 month condition, the above results are only suggestive . An alternative approach to assessing the development of skill of analysts (that does not require keys for scoring) compares the pattern of disagreement among analysts for the 18 month and 24-month tests, and is more convincing . Such a pattern can be constructed by running the scoring progra m for a given set of templates in &amp;quot;key to key&amp;quot; mode (so that it calculates a measure of disagreement between tw o codings) for all pairs of codings for the four analysts .</Paragraph>
    <Paragraph position="8"> Figure 9 shows such patterns, tenned here &amp;quot;Disagreement Matrices&amp;quot;, for the 18- and 24-month tests, alon g with a third &amp;quot;Difference&amp;quot; matrix (shown at the far right) created by subtracting scores in the 24-month matrix fro m those in the 18-month one, resulting in a measure of the extent to which consistency between particular analysts ha s improved. Note that all cells of the Difference matrix have positive scores, indicating that consistency between al l  pairs of analysts has increased.</Paragraph>
    <Paragraph position="9">  and Differences Between the Two Matrices Figure 10 shows comparisons of the scores from the Difference matrix for three specific cases of analys t pairs. For the Early Starting Analysts pair (A and B), shown at the far left, consistency between the two analyst s increased by only one percentage point, suggesting that even at the 18-month test, these analysts were already nea r their maximum level of skill. For the Later Starting Analysts (C and D), however, shown at the far right, consistenc y between the two analysts increased by 6 percentage points, indicating that these analysts were still developing thei r skill . For the case where the mean of all pairs of early-late analysts (AC, AD, BC, and BD) is calculated, shown as th e middle vertical bar, consistency increased by an average of 4 .375 percentage points, indicating that the less-skille d analysts had increased their consistency with the more-skilled analysts.</Paragraph>
    <Paragraph position="10"> The general finding here is that (1) the analysts who started earlier improved their skill minimally from th e 18 to 24 month tests; and (2) analysts who started later improved their skill considerably . Because by the time of the 24 month test the later starting analysts had as much or more practice coding templates as did the early startin g analysts at the time of the 18 month test, it is reasonable to assume that their increase in skill reflects an early part o f the learning curve and that by the 24 month test all analysts have reached an asymptote level of skill .</Paragraph>
  </Section>
  <Section position="10" start_page="62" end_page="62" type="metho">
    <SectionTitle>
BETWEEN BETWEEN BETWEE N
EARLY STARTING EARLY AND LATER EARLY STARTIN G
ANALYSTS STARTING ANALYSTS
</SectionTitle>
    <Paragraph position="0"> The evidence in the literature for the development of skill in humans suggests that skill continues to develop , if slowly, for years or decades even on simple tasks, and it can be expected that continued practice on informatio n extraction by these analysts would increase their level of skill . However, it does appear that the analysts were ver y highly skilled by the end of the study and were at an appropriate level of skill for comparison with machin e performance.</Paragraph>
  </Section>
  <Section position="11" start_page="62" end_page="64" type="metho">
    <SectionTitle>
6 3
COMPARISON OF HUMAN AND MACHINE PERFORMANC E
</SectionTitle>
    <Paragraph position="0"> The most critical question in the MUC-5/Tipster evaluation is that of how performance of the machin e extraction systems compares with that of humans performing the same task .</Paragraph>
    <Paragraph position="1"> Figure 11 shows mean performance, in percent error, for the 4 human analysts, using the &amp;quot;Independen t Analysts &amp;quot; condition discussed in a previous section and shown in Figure 5, for the 120 articles coded by all analyst s from the English Microelectronics test set . Also shown is the corresponding machine performance for the same articles for the three machine systems in the MUC-5/Tipster evaluation that had the best official scores for Englis h  The differences are very clear, with the mean error for human analysts about half that of the machine scores . Both the human and machine scores are highly reliable, as is shown by the standard error bars .</Paragraph>
    <Paragraph position="2"> Figure 12 shows essentially the same data expressed in terms of recall and precision .</Paragraph>
    <Paragraph position="3"> What is surprising about this data is not that the machines have a seemingly rather high error rate, but tha t the human rate is so high . The recall-precision diagram suggests that machines can have even more simila r performance to humans on either recall or precision, if one is willing to trade of the other to achieve it . Machine performance is likely to be at least somewhat better than this in a real system, since resource constraints forced developers to run incomplete systems (that, for example, did not fill in slots for which information was infrequentl y  Using Recall and Precision Scores The performance data shown in the figure, other data, and the subjective accounts of individual analysts and their co-workers support the general conclusion that for this group of analysts the level of skill for informatio n extraction was very similar for each analyst. This uses the &amp;quot;Other Analysts&amp;quot; scoring method, with recall and precisio n scores for individual analysts not particularly meaningful for the otherwise more reliable &amp;quot;Independent Analysts &amp;quot; condition . (See Figure 7 for recall and precision scores for all scoring conditions).</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML