File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/93/m93-1004_metho.xml
Size: 66,361 bytes
Last Modified: 2025-10-06 14:13:23
<?xml version="1.0" standalone="yes"?> <Paper uid="M93-1004"> <Title>TIPSTER/MUC- 5 INFORMATION EXTRACTION SYSTEM EVALUATIO N</Title> <Section position="4" start_page="0" end_page="27" type="metho"> <SectionTitle> THE EVALUATION PROCESS </SectionTitle> <Paragraph position="0"> The Tipster contractors were allowed access to the training corpus (articles and hand-coded templates for a given language-domain pair) and associated materials (documentation, software resources, lexical resources) a s they were being prepared over the course of Phase 1. The articles and corresponding hand-coded templates from the test corpus were held in reserve for use as blind-test materials during evaluation periods; new test sets were used for each evaluation. A description of the training and test corpora is contained in [1]. Those MUC-5 evaluation participants who were not Tipster contractors were allowed access to training materials in March , 1993, when major updates resulting from decisions made at the Tipster interim evaluation in February had been completed and permission for MUC-5 participants to use most of the copyrighted articles had been obtained .</Paragraph> <Paragraph position="1"> Table 1 identifies the MUC-5 evaluation participants and the language-domain pairs on which their systems were evaluated.</Paragraph> </Section> <Section position="5" start_page="27" end_page="31" type="metho"> <SectionTitle> MUC-5 CLASS OF PARTICIPANT PARTICIPATION SYSTEM EJV EME JJV JM E BBN Tipster PLUM X X X X GE/CMU Tipster SHOGUN X X X X </SectionTitle> <Paragraph position="0"> The evaluation participants (Tipster and non-Tipster) were also provided with evaluation software, prepared via NRaD contract to SAIC, to help them monitor the performance benefits of alternative software solutions the y were exploring in their research [9]. The evaluation software, corpora, documentation, and miscellaneous other resources were distributed primarily through electronic mail and electronic file transfer . Virtually every item was updated numerous times, and updates continued on some of them right up to the start of final testing . Personnel at the Consortium for Lexical Research (New Mexico State University) and the Institute for Defense Analyse s played critical roles in making these materials available for electronic transfer .</Paragraph> <Paragraph position="1"> At the start of the test week for each evaluation, the participants were supplied electronically with encoded test sets of articles, which they were to decode only when they were ready to begin testing . Testing was conducted by the participants at their own sites in accordance with a strict test protocol . After their system s processed the texts and produced the extracted information in the expected template format, the participant s electronically transfered the templates to the Government for scoring .</Paragraph> <Paragraph position="2"> Using the evaluation software prepared by SAIC, evaluators may score templates fully automatically (batc h mode) or partially interactively (interactive mode) . Since interactive scoring produces a slightly more accurate performance evaluation, scoring for the formal evaluations is usually done in that way . Some of the same analysts who hand-coded the answer-key templates prepared written guidelines for conducting the interactiv e scoring and did the scoring . SAIC conducted statistical significance tests on the 24-monthIMUC-5 final test scores for the overall metrics of performance [3] .</Paragraph> <Paragraph position="3"> Table 2 summarizes the Phase 1 evaluations in terms of the test sets, participating sites, and primary evaluation metrics .l Since the JV template was especially complex, JV testing was done in two ways for eac h evaluation: (1) the core portion of the template, including the identification of tie-ups, entities, and relationship s of entities within tie-ups, (2) the full template . The first microelectronics test was conducted at the 18-mont h point; up until a few weeks prior to that test, the Tipster contractors had had only a small portion of the EME an d JME corpora available to them . The first JME evaluation (at 18 months) was conducted using all but th e <packaging> objects.</Paragraph> <Paragraph position="4"> The same test sets used for the Tipster 18-month evaluation were used for the MUC-5 dry run, with on e exception: certain articles in the EJV test set had to be omitted because permission for non-Tipster MUC- 5 participants to use those copyrighted articles had not been obtained . (Permission to use all but two of these This tabulation ignores the fact that the period of time covered by Tipster Phase 1 also included MUC-4 . The Tipster Phase 1 contractors were evaluated for MUC-4 in the terrorism domain even as they were beginning their Tipster research and development in the Tipster domains [MUC-4] .</Paragraph> <Paragraph position="5"> sources were obtained in time for the MUC-5 final test.) The Tipster contractors did not participate in the dry run.</Paragraph> <Paragraph position="6"> The primary performance metrics changed in the course of Phase 1 . These are discussed below.</Paragraph> <Section position="1" start_page="28" end_page="30" type="sub_section"> <SectionTitle> Evaluation Criteria and Metric s </SectionTitle> <Paragraph position="0"> In assessing the performance of the information extraction systems, we are interested in knowing the classe s of errors made and the circumstances in which those errors were made . We are also interested in performance from an applications perspective in terms of the completeness and accuracy of the database fills generated by the system. The criteria are limited to those that can be measured without access to anything more than the template s that the systems generated. Using these criteria, we attempt to assess the current state of the art, measur e progress relative to previous evaluations, and compare the task performance of machines with that of humans .</Paragraph> <Paragraph position="1"> The scoring software classifies each piece of extracted information into one of the following scorin g categories: correct, partial, incorrect, spurious, missing, and noncommittal. Systems are penalized for having missed pertinent information, for having &quot;hallucinated&quot; more information than was actually pertinent, and fo r having otherwise extracted mismatching pieces of information . In order to reveal information about th e circumstances affecting performance, the scoring software calculates scores at the following levels of granularity : for each slot in each template, for each object type in each template, and overall for each template ; for each slot in the test set, for each object type in the test set, and overall for the test set.</Paragraph> <Paragraph position="2"> Two sets of metrics were in force for MUC-5 [4] . The first set of metrics is based on the classification error rate and includes an overall metric (error per response fill) and three secondary, diagnostic metric s (undergeneration, overgeneration, and substitution). These secondary metrics correspond to the three penalty situations described above . The error per response fill metric and the secondary metrics are together referred to a s the error-based metrics.</Paragraph> <Paragraph position="3"> The second set of metrics measures the completeness (recall) and accuracy (precision) of the extracted information. These are supplemented by the undergeneration and overgeneration metrics mentioned above, whic h serve to isolate the system's shortfall in recall due to undergeneration and the system's shortfall in precision due to overgeneration . Recall and precision are combined into a weighted overall measure called the F-measure . Recall, precision, and F-measure are together referred to as the recall-precision-based metrics.</Paragraph> <Paragraph position="4"> 2&quot;Core&quot; refers to a core set of N template slots : the <template>content slot, the <tie-up-relationship > status, entity, and joint-venture slots, the <entity> name, aliases, location, nationality, type, and entity-relationship slots, and the <entity-relationship> entityl, entity2, rel-ent2-to-entl, and status slots.</Paragraph> <Paragraph position="5"> ()The UMass/Hughes team was tasked to work only in English (EN, EME).</Paragraph> <Paragraph position="6"> ?All thirteen non-Tipster MUC-5 sites worked in just one domain ; two worked in both languages, one worked i n Japanese only, and ten worked in English only . See table 1 .</Paragraph> <Paragraph position="7"> The error-based metrics served as the official metrics for the MUC-5 evaluation, meaning essentially tha t any ranking of systems by overall performance would be done on the basis of error per response fill rather than F measure. However, as it turned out, statistical significance tests showed that system rankings on the basis of error per response fill are very consistent with those made on the basis of F-measure (see discussion below an d [3]). Both sets of metrics play important roles in the discussion found in this paper .</Paragraph> <Paragraph position="8"> An appendix to this volume contains summary tallies and scores for each of the systems . The rightmost columns in the tables contain the scores for the error-based and recall-precision-based metrics ; other columns contain the raw tallies . The rows in the top portion of the tables contain summary statistics for each slot an d object; the rows at the bottom contain overall statistics . See the preface to the appendix and [4] for further information on reading the score reports.</Paragraph> <Paragraph position="9"> Updates to the Test Design and Dat a Each of the interim evaluations resulted in significant updates to the evaluation design . For the 12-month test, the evaluation software that had been used for MUC-4 was rewritten by SAIC to accomodate the object oriented Tipster templates . Issues that were addressed in the interim between the 12-month and the 18-month tests include JV template formatting (especially in the Japanese template), performance metrics (probability o f false alarm as alternative to precision, system-independent version of recall), object alignment by the evaluation software8 (content-based as well as threshold-based alignment options, alignment optimization based on scor e rather than on number correct), and evaluation software support for human performance studies (scoring of one se t of hand-coded templates versus another) .</Paragraph> <Paragraph position="10"> At the 18-month meeting, decisions were made regarding scoring for the 24-month/MUC-5 9 evaluation .</Paragraph> <Paragraph position="11"> The principal decision was to supplant recall and precision with a modified formulation of the error rate metri c that had been in experimental usage for the 18-month test . The revised metric was named error per response fill because it is system-dependent (i.e., the denominator in the formula varies across systems according to th e number of spurious fills generated, and it also varies because the answer keys allow for a somewhat variabl e number of expected fills). Error per response fill became the primary measure of performance . However, a less system-dependent error rate metric was also implemented; this metric was termed the richness-normalized error. These changes to the metrics necessitated significant reprogramming of the evaluation software . In addition, the decision was made to convert portions of the JV template from objects to complex slots l0, and this resulted in significant updates to the evaluation software, JV corpora, and JV documentation. Another decision resulting from this meeting was to ease the object alignment criteria, largely because of the difficulty of setting valid threshold values given the sparseness of the fills in many of the objects in the answer key.</Paragraph> <Paragraph position="12"> The MUC-5 dry run was conducted after all these updates had been completed . Between the dry run and th e final test two months later, further updates were made to the evaluation software, including a revised way of scoring two-part (complex) slots in the JV template . The new method gives separate scores to each part of the two-part fill, rather than giving one score to the complex fill as a whole. Another update was the implementation of a limited two-pass object alignment strategy, which results in slightly improved object alignments becaus e more of the information on the interrelationships among entities is present when objects that reference the entitie s are aligned .</Paragraph> <Paragraph position="13"> The intention had been to eliminate some evaluation criteria before the MUC-5 effort began in earnest in March, 1993 ; however, some of the decisions made at the 18-month meeting were tentative and, in the end, fe w simplifications were made at that time. The net result was that the number of performance measures has increased since MUC-4, and it is clear that there is still no clear answer as to the single most appropriate criterion to appl y to assessing performance on an information extraction task . The good news is that the error per response fill an d $Object alignment as implemented for MUC-5 is discussed in the next section .</Paragraph> <Paragraph position="14"> 9 Since the MUC-5 evaluation was the 24-month evaluation for the Tipster contractors, the evaluation will hereafter b e referred to simply as the MUC-5 evaluation .</Paragraph> <Paragraph position="15"> degThis conversion affected three parts of the JV template: ownership percent, product/service, and activit y site . After the conversion, each of these was represented in the template as a two-part slot rather than as an object with two slots .</Paragraph> <Paragraph position="16"> the F-measure provide consistent views of the relative performance of systems, and therefore technolog y consumers may choose to use whichever set of metrics they feel is most appropriate for their purposes. All this experimentation resulted in other useful information as well about system-independent metrics, object alignmen t approaches, and template design, among other things .</Paragraph> </Section> <Section position="2" start_page="30" end_page="31" type="sub_section"> <SectionTitle> Alignment and Scorin g </SectionTitle> <Paragraph position="0"> System-generated (response) templates must be aligned with the answer-key (key) templates for scoring.</Paragraph> <Paragraph position="1"> Alignment takes place at all levels where there exist more than one response instance of a given kind and/or mor e than one key instance. These levels include the template level, the object level, and the slot-fill level. In each case, the intent is to find the alignment that will provide the best content match between the key and the response. At the object level, there is also the intent to determine whether a response object should be rejected fo r alignment purposes for failing to show any substantial degree of match with a key object.</Paragraph> <Paragraph position="2"> Alignment at the template level is trivial ; it is done on the basis of matching the <template> doc n r fills in the key and response. At the slot-fill level, when there are more than one key and/or response slot-fill fo r a given instance of a slot type, alignment is done on the basis of the degree of match between key and respons e fills. Slot-fill alignments that the alignment program can only guess at may be revised interactively during th e scoring stage.</Paragraph> <Paragraph position="3"> Alignment at the object level is the most complicated and controversial aspect of alignment . It takes place prior to scoring, and it is normally done fully automatically because to do it interactively would be so time consuming as to be virtually impossible . The criteria for establishing whether an object ought to be allowed t o align at all are defined in a file external to the alignment process. The criteria are defined to apply across al l instances of a given object type . However, it is difficult to specify the criteria in this manner since man y instances in the keys contain little fill on which to base a comparison .</Paragraph> <Paragraph position="4"> Various object alignment schemes and minimal alignment criteria (also called the minimal mapping requirements) have been tried; for MUC-5, an alignment scheme called threshold-based was used, and th e alignment criteria were loose . As used for MUC-5, this scheme allows nearly any matching fill in a given objec t type to enable an object alignment. The only exception concerns certain slots for which an overwhelming default fill exists, e.g., <entity>type . Such slots are ignored in the alignment process .</Paragraph> <Paragraph position="5"> If there is no content match at all between a response object and a key object or if the only match is on a slot that is excluded from the threshold-based alignment criteria, the response object is marked as spurious . In such cases, the object's alignment status is termed connected, meaning that the object did not align but there existed a key object to which it could have mapped had it met the minimal alignment criteria. As a connected object under the official All-Objects scoring method (see [41 and preface to score-report appendix), response fill s with no corresponding key fill are scored as spurious and those with a corresponding key fill (whether the fill s themselves are a correct match or not) are scored as incorrect.</Paragraph> <Paragraph position="6"> For a given object type in a template, there may be more than one possible alignment of object instances that meet the alignment criteria. Such objects are aligned on the basis of the degree of slot-fill match, as coarsely determined by the alignment program. The program determines an approximate error per response fill score , which will be overridden during the actual scoring process following alignment .</Paragraph> <Paragraph position="7"> The alignment of objects in one pass results in suboptimal mappings of some object types, especiall y <entity> in the JV template, because advantage cannot be taken of useful information about the dependenc y between <entity> (or <person>) and <entity-relationship> . The solution implemented for MUC-5 was to align objects in two passes, with a few of the object types handled in both passes . However, despite th e theoretical advantages of two-pass alignment, it is believed that the adopted solution results in only slightl y improved object mappings over what can be done in a single pass. Two-pass alignment is only a partial solutio n to the problem, but the problem itself appears to be relatively minor .</Paragraph> </Section> </Section> <Section position="6" start_page="31" end_page="33" type="metho"> <SectionTitle> MEASURING TASK DIFFICULTY </SectionTitle> <Paragraph position="0"> With each new MUC, the evaluators have challenged technology to deal with a broader variety of texts an d to do more with them. One of the ways in which MUC-5 distinguishes itself from previous evaluations is in th e increased task realism, which manifests itself in a greater variety of data extraction requirements, in th e requirement for translation of extracted information into entries from standard reference sources (unabridge d gazetteers, the Standard Industrial Code manual, etc.), and in a richer template structure . However, the mos t distinctive feature of Tipster Phase 1 for extraction is the requirement to handle more than one language and mor e than one domain. This requirement generated a strong push in the direction of language- and domain independence, while the task realism generated a strong push for maximizing task coverage with minimum tim e and effort.</Paragraph> <Paragraph position="1"> Major changes have been made to the evaluation design over the years, which complicates the issue o f progress assessment. Not only have the metrics and scoring and alignment algorithms evolved and been replaced , but new extraction tasks have been defined [10] . The first two tasks were in the naval tactical domain, the nex t two were in the terrorist domain, and the Tipster/MUC-5 evaluation was conducted in the joint ventures and microelectronics domains. One could conceive of trying to compare the difficulty of these tasks in terms o f human performance ; however, at this point there exists reasonably sound performance data only for MUC-5 [11 , 12]. One could also imagine trying to measure relative difficulty in some atheoretic or polytheoretic way i n terms of the number of semantic patterns, inference rules, etc ., required to carry out the task, but that idea is not a practical one .</Paragraph> <Paragraph position="2"> In a preliminary attempt to compare the difficulty of different extraction tasks, quantitative criteria wer e developed in support of MUC-3 that enable comparison in terms of superficial features of the texts, templat e definition, and template fill rules [5] . Comparison of the complexity of the terrorist task with the naval task i n light of these criteria shows at least an order-of-magnitude increase for several of the criteria . Once allowances are made for changes to the scoring methods and the earlier evaluation results are recomputed, it is clear for the result s of the top systems in each evaluation that MUC-3 system performance represents significant progress fo r extraction systems as a group over the previous evaluation .</Paragraph> <Paragraph position="3"> The criteria can be adapted to allow rough estimation of the relative difficulty of the MUC-5 joint ventures and microelectronics tasks compared to the MUC-31MUC-4 terrorism task. Most of the adaptations reflect th e shift from a flat-format template to an object-oriented template . Table 3 summarizes the comparison, using EJV as the MUC-5 point of comparison .</Paragraph> <Paragraph position="4"> The summarized data indicate that the EJV task is a somewhat more difficult task than the terrorism tas k along three of the four major dimensions. The dimensions measure difficulty in the following terms : Text corpus complexity measures difficulty in terms of coverage of language features that may b e encountered during testing . Measurement takes the following statistics from the training corpus into account: * number of text types * vocabulary size * average sentence length * average number of sentences per tex t Text corpus dimensions measures difficulty in terms of the volume of material to be processed in order t o achieve coverage and monitor system progress . Measurement takes the following statistics from the trainin g corpus into account: * number of texts * number of sentences * total number of words Templatefill characteristics measures difficulty in terms of features of the template structure and the amount of information to be extracted from a given test set. Measurement takes the following statistics from the training corpus into accountl i s Nature of task measures difficulty in terms of the extraction task in general -- the elaborateness of the rules that the system must incorporate in order to conform to the template defmition and fill rules, including relevanc e rules at the template, object, and slot level and the formatting specifications at the slot-fill level . Measurement takes into account the following statistics from the training corpus and MUC-5 test set : * percent nonrelevant texts This measurement also takes the following statistics drawn from the task documentation into account : * number of pages of relevance rule s * number of pages of template definition and template fill rule s The numerical factor corresponding to each dimension in table 3 represents a rough average of the factor s assigned to the component criteria identified above . Some of the assumptions inherent in this approach t o assessing relative difficulty are that longer sentences will be processed less accurately than shorter ones, tha t relevant texts with a greater amount of relevant information present more opportunities for error, that a greate r variety of extraction requirements makes a task harder, and that extraction is harder when it goes beyon d categorization of information into a set fill.</Paragraph> <Paragraph position="5"> The EJV task is harder by a factor of two on criteria such as the following : * vocabulary size; * average number of sentences per text; * number of slots in the template; * one of the types of slot (numeric/complex slots) .</Paragraph> <Paragraph position="6"> Among the ways in which EJV is easier than the terrorism task are the following: * sentences in the EJV corpus are shorter on average (18 words versus 27 words) ; * there are so few nonrelevant EJV texts that relevance filtering plays a negligible role (--10% nonrelevan t versus -50% nonrelevant) ; * there is a sparser amount of information in the EJV templates (--1 filler per slot versus -1 .5 per slot). The greatest difference between the EJV and terrorism tasks concerns the text corpus dimensions. This dimension, which treats the volume of text as a measure of difficulty, could be viewed as less of an issue no w than it was for MUC-3 . In fact, with the increasing popularity of statistical techniques, large amounts of training data are sometimes required. Nonetheless, the challenge of making effective use of text increases with th e quantity of text, since a large amount of text implies a broad domain, and most kinds of domain knowledg e cannot currently be captured using automated training methods.</Paragraph> <Paragraph position="7"> 11One other statistic that was used in comparing the naval and terrorism tasks, the number of template types, was not used in this comparison because the statistic is not pertinent to the way the Tipster templates are designed . 12In the MUC-4 template, there were no objects, but there were groupings of slots into those that contained data on the perpetrator of the terrorist act, the physical target of the terrorist act, the human target of the terrorist act, and on th e terrorist act itself. These four slot groupings were referred to as pseudo-objects.</Paragraph> <Paragraph position="8"> 13Two other statistics that could be used if two object-oriented tasks were being compared--average number of object s per template and average number of slots per object--were not used in this comparison because there were no forma l object types in the terrorist template.</Paragraph> <Paragraph position="9"> The percent nonrelevant texts criterion, which figures in two of the dimensions, is based on the view that the more a system's performance would suffer as a consequence of ignoring the text filtering (document detection ) subtask, the harder the task. The percentage of nonrelevant texts in EJV is so low (approximately 5% in th e training corpus and 10% in the MUC-5 test sets) that a system can almost ignore the text filtering subtas k without suffering a serious degradation in performance; the system can be optimized in favor of generating tie-ups even when it is not sure there is sufficient information in the text . This is not true of the terrorism task, where the percentage of nonrelevant and relevant texts is about equal . In conclusion, either a task such as EJV that places extremely little emphasis on text filtering or a task that places extremely high emphasis on text filtering i s considered to be less difficult than one such as the terrorism task, which places significant emphasis both on tex t filtering and information extraction .</Paragraph> <Paragraph position="10"> Within the context of the extraction subtask independent of the text filtering subtask, the more information there is to be extracted, the more difficult the task is judged to be . This is because richer texts present more opportunities to miss information and to confuse information about one reportable item with another. The most difficult comparison to make concerns the template fill characteristics, because of the switch to the object-oriented template . Furthermore, the overall difficulty of slot fill criterion is itself composed of several features . It is based on the number and distribution of the various types of slots : set-fill slots with no more than twelve possible fills, set-fill slots with more than twelve possible fills (for MUC-5, these were slots tha t referenced the gazetteer), numeric/complex slots (which includes some normalized fills and, in the case of MUC-5 , some two-part fills), string-fill slots (and normalized strings such as corporation names), and pointer-fill slots (i n the case of MUC-4, these are slots that require cross-references) . The more open-ended the extraction task, the harder it is judged to be . The EJV task is judged to be harder with regard to the numeric/complex slots in particular.</Paragraph> <Paragraph position="11"> In summary, the generalization may be that the EJV task is harder than the terrorism task in terms of th e template (number and nature of slots), the sheer volume of text (vocabulary size), and the discourse demand s (number of sentences per text), but a little easier in terms of the shorter sentence length, lesser proportion of relevant information in relevant texts (number of fills per slot), and very small proportion of nonrelevant texts .</Paragraph> </Section> <Section position="7" start_page="33" end_page="39" type="metho"> <SectionTitle> OVERALL RESULTS </SectionTitle> <Paragraph position="0"> The discussion of the MUC-5 evaluation results will be presented from various perspectives, using th e metrics that are most appropriate in each case . This paper presents some general views on the results . Results for individual sites are summarized in the papers in this volume that were prepared by the evaluation participants .</Paragraph> <Paragraph position="1"> Progress Assessment from MUC-4 to MUC-5 (EJV) Since the F-measure was in force for both MUC-4 (as an official metric) and for MUC-5 (as an unofficia l metric), a rough measure of progress can be obtained with that metric, using EJV as the representative MUC- 5 task. The purpose of the comparison is to gauge whether the field of NLP as a whole has progressed in terms o f overall performance achievable on extraction tasks. To that end, only the top-scoring systems are included in the comparison, namely those that were in one of the top two ranks statistically according to the F-measure.14 There were four systems in the top two ranks for MUC-4 (TST3 and TST4 test sets) [2] and three in the to p two ranks for MUC-5 (EJV test set) [3] . These systems are GE, GE/CMU, UMass/Hughes, and SRI for MUC-4, and GE/CMU, BBN, and SRI for EJV MUC-5 . The average F-measure score of the MUC-4 systems is 51 .68; the average for the MUC-5 EJV systems is 47 .12. If the one non-Tipster EJV system (SRI) is excluded from th e EJV MUC-5 average, the average rises to 49 .35.</Paragraph> <Paragraph position="2"> The greater level of difficulty of the MUC-5 EJV task and the fact that the F-measure scores are close t o being as high as the MUC-4 F-measure scores indicate that performance of top MUC-5 EJV systems is at leas t comparable to performance of top MUC-4 systems . It is important to remember that the Tipster systems wer e achieving that level of performance for MUC-5 on EJV while working also in the microelectronics domain and , in most cases, also in Japanese. In that regard, it is notable that the GE/CMU system scored in the top rank i n each language and domain pair; on the F-measure their scores were 52.75 for EJV, 60.07 for JJV, 49 .18 for EMI :, and 56.31 for JME . It is also notable that SRI, which was a non-Tipster MUC-5 participant in both EJV and JJV, achieved F-measure scores of 42 .67 and 44.21, respectively .</Paragraph> <Paragraph position="3"> The fact that relative task difficulty can be assessed only roughly together with the fact that several MUC- 5 sites worked on more than one task mean that too much importance should not be placed on comparison of score s between MUC-4 and MUC-5 . However, whether or not difficulty factors and evaluation design changes are take n into account, there is at least one MUC-5 task on which performance can only be said to be outstanding, namel y the JJV core-template task. Two systems achieved an F-measure score on the JJV core-template test in the 70-8 0 range -- 73 .54 (corresponding to error per response fill of 39) for the GE/CMU Shogun system and 77 .94 (error per response fill of 34) for the GE/CMU optional test run with the TEXTRACT system . Top performance on the EJV core-template test was about 20 points worse . The relatively high performance on the JJV core-templat e task may be indicative not only of the relative simplicity of the core-template task compared to the full-templat e task but also to the relative simplicity of the JJV texts compared to the EJV texts . (Some of these languag e differences are discussed further in a later section .) Nonetheless, taken on the task's own terms, these JJV score s reflect strong performance.</Paragraph> <Paragraph position="4"> Comparison of Machine Performance with Human Performanc e Application perspective . The F-measure is a weighted combination of recall and precision . Recall and precision give an indication of system performance relative to the application goals of extracting all and only th e information that should be extracted. Despite the fact that humans are subject to human factors limitations tha t inhibit their performance, the performance limits of humans on an information extraction task represent a goo d target for automated systems as well, since the shortfall of human performance from perfection is due not only to human factors but also to other factors, such as deficiencies in the task defmition . As reported in [11, 12], human performance and machine performance on 120 articles in the MUC-5 EME test set was measured . As part of the study, the performance of the four well-trained analysts and the top three MUC-5 systems (GE/CMU, BBN, and UManitoba) was compared .</Paragraph> <Paragraph position="5"> The four human analysts were able to extract up to 79% of the information expected (recall metric), and o f all the information they extracted, at best 82% of it was judged to be correct (precision metric) . Performance of the top systems fell far below human performance; the three systems used in the comparison were able to extrac t up to 53% of the information expected, and of all the information they extracted, at best 57% of it was judged t o be correct. In terms of performance shortfall, the machines fell 19-38 points of human performance on the recal l measure and 18-31 points short of human performance on the precision measure .</Paragraph> <Paragraph position="6"> Increasing system recall and precision by another 20 points or so may not seem to be a difficult task -- afte r all, since systems managed to obtain an F-measure score in the 70s on the JJV core-template test, why not als o on the EME task? But it may not be easy to increase both recall and precision by that amount simultaneously on a relatively difficult task such as EME, since the metrics are in tension with each other . The harder a system tries to extract all the expected information (i.e., the more aggressively configured it is), the more likely it is to extract erroneous information . The tension is reduced if the texts are easier to interpret, as the JJV texts apparently are (see section below on handling two languages) and if the task is simpler, as the JJV core-template tas k undoubtedly is in comparison to the EME (full-template) task.</Paragraph> <Paragraph position="7"> The overall recall and precision scores hide the fact that there were not only slots on which human performance was relatively strong but also slots on which human performance was relatively weak. A study reported in [11] measured the degree of difference between human and machine performance for frequently-fille d slots in a portion of the EME test set . The author's general conclusion was that machines did comparatively wel l on slots that may lend themselves to keyword analysis and that are to be filled with a set-fill category from a relatively long list; examples include the <layering> type and film slots.</Paragraph> <Paragraph position="8"> Speed. Another respect in which systems showed an advantage over humans is in terms of speed . On average, the time required for a human to fill a template (using software tools tailored for the Tipster tasks) wa s between 15 minutes (for an EME template) to over 60 minutes (for a JJV template) . In contrast, timing information collected for the BBN PLUM system, the GE/CMU Shogun system, and the NMSU/Brandeis Didero t system shows that the average time required to process an article in the EME test set was between 75 .0 seconds (Shogun on a Sparc 10 with 64 mb RAM) and 211 .2 seconds (Diderot, which was not optimized for speed i n English, on a Sparc2 with 32 mb RAM) and that the average time required to process an article in the JJV test se t was between 39 .0 seconds (Diderot on a Sparc2 with 2.32 mb RAM) and 140 .8 seconds (PLUM on a Sparc 10 with 128 mb RAM) .</Paragraph> <Section position="1" start_page="35" end_page="39" type="sub_section"> <SectionTitle> Predominant Classes of Error </SectionTitle> <Paragraph position="0"> The most frequent type of error committed by nearly all of the MUC-5 systems was to miss pertinen t information. This class of error is captured by the undergeneration metric . The test results show tha t performance on this metric is a good indicator of performance on the overall metric of error per response fill . The effect of undergeneration in relation to the overgeneration and substitution metrics as well as to the error pe r response fill metric can be seen in figures 1 and 2, which graph the results of all MUC-5 systems for the EM E and JME tests. From these graphs it is clear that undergeneration (UND) generally correlates with the overall response fill for all EME MUC-5 systems response fill for all JME MUC-5 systems Substitution is a lesser source of error than undergeneration and overgeneration, lesser even tha n overgeneration . Examination of the template-fill specifications sheds light on these data . Some slots and objects in the JV and ME templates have essentially fixed number, requiring one fill or allowing zero or one fill; others have a highly variable number, some requiring one or more fills and some allowing zero or more fills. Thus, for the slots having a highly variable number of fills, there is no absolute bound on the number of fills a syste m could potentially spuriously generate . This means that overgeneration on those slots could be quite high . Substitution errors, on the other hand, are accrued only when there exists a pairing between a fill in the key and a fill in the response, and the response is judged to be incorrect. Thus, the substitution score has as its uppe r bound the number of fills in the key .</Paragraph> <Paragraph position="1"> If the extraction of relatively little target information is indicative of poor overall performance, how and t o what extent does the extraction of relatively much information -- good or bad -- correlate with overal l performance? Are the aggressive systems just wildly guessing, or is their aggressiveness paying off form them on the overall metric? The data show that there is a correlation between generating lots of data and obtaining a relatively good (i .e., low) error per response fill score. This can be seen by computing the number of right and wrong fills generated by a system (this number is called the actual (ACT)) as a percentage of the total number o f fills expected (termed the possible (POS)) and comparing that percentage with the overall error per response fill score.</Paragraph> <Paragraph position="2"> In figure 3, the EME results are sorted by increasing error per response fill on the vertical axis . It is eviden t that the more fills generated by the system, the better its error per response fill score, even to the extent that th e number of fills generated by the GE/CMU system exceeds the number expected, i.e., the system clearly generated a high proportion of spurious fills (as figure 1 bears out) . The only clear exception to the generalization is th e UMichigan system, which had a relatively high error per response fill score despite having generated relativel y many fillers (more than the Language Systems, Inc. (LSI) system or NMSU/Brandeis system) . Figure 1 shows that the UMichigan system suffered from relatively high overgeneration as well as relatively high with performance on the error per response fill metri c The results of the comparative performance study of machines (the GE/CMU, BBN, and UManitoba systems) and humans on part of the MUC-5 EME task show how far short of human performance the machines' performance fell. Well-trained humans are being compared with the best-performing MUC-5 systems . Since the MUC evaluations are designed to challenge research technology as well as to show a practical use of technology , it would probably be unreasonable to expect that any information extraction system participating in a MU C evaluation would perform at a level close to humans, and it is unlikely that any of the MUC-5 participants ha d comparability with humans as their primary development goal. Nonetheless, there may be evaluation data to help support speculation about how likely it would be that these systems could be developed to make up the shortfall . Figure 1 shows that all EME systems other than GE/CMU incurred more errors as a result of missin g information than as a result of committing other types of error, and figure 3 shows that generating more data was generally beneficial in terms of improving overall performance . The fact that the BBN and UManitoba systems ' overall performance is very close to GE/CMU 's -- in fact, the differences among the three are statisticall y insignificant [3] -- provides evidence that relatively good performance does not necessarily come at the expense o f high overgeneration 1S and therefore that greater task coverage could make up for some of the shortfall fro m human performance . Further evidence of the room left for improvement of most, if not all, MUC-5 systems i s found in the fact that there are slots which systems never filled in during the final test.16 Even in the case of Tipster systems, these unattempted slots can account for a sizeable proportion of the total number of misse d pieces of information. 17 Measuring the Performance of Systems at Different Levels of Maturit y Scoring of unfilled slots. An object that is instantiated in the answer key may not be fully filled ; the corresponding text may not provide information to fill some of the slots defined for that object type . Cases where 15This is not to fault the GE/CMU system for overgenerating . There are other systems with an equal or wors e overgeneration score that come nowhere near matching the GE/CMU system in error per response fill . The GE/CM U system had undergeneration and overgeneration in close balance for the MUC-5 evaluation and evidently was optimize d on both. Of the four language-pair systems they were required to field for MUC-5, three came out slightly better o n balance on recall (which emphasizes minimizing undergeneration) and one, JN, came out slightly better on precisio n (which emphasizes minimizing overgeneration) .</Paragraph> <Paragraph position="3"> &quot;Count of unattempted slots (i.e., those where the system's &quot;actual&quot; equals zero) excludes those slots that were never filled in the key (i .e., those where the &quot;possible&quot; equals zero) .</Paragraph> <Paragraph position="4"> 17For example, BBN's JJV system made no attempt to fill 17 of the slots in the JJV template, which accounts for 25% of the total missing, and their JME system made no attempt to fill 12 of the JME slots, accounting for 24% of the tota l missing; the UMass/Hughes system made no attempt to fill 13 of the EN slots and 11 of the EME slots, and in each case this accounts for 15% of the total missing .</Paragraph> <Paragraph position="6"> a template slot is correctly left unfilled by the system under evaluation are scored as noncommittal by the scoring software. Noncommittals are not included in the standard formulation of any of the performance measures . This is reasonable from a research perspective, if not from an applications perspective . The question comes down to whether systems normally leave a slot unfilled out of knowledge or whether they do so out of a lack o f knowledge. Highly immature systems tend either to overgenerate to an extreme, leaving few slots unfilled, or to undergenerate to an extreme, leaving many slots unfilled. The latter type of immature system was very common at the MUC-5 evaluation and could have benefited unfairly from a metric that considers a noncommittal fill to b e a correct fill, especially since there are many unfilled slots in the key templates .</Paragraph> <Paragraph position="7"> The effect of scoring noncommittal fills as correct fills is to give an inflated estimate of performance, at least for the systems that undergenerate to a relatively large extent. It also has the potential effect of giving a distorted cross-system view, since very immature systems could end up being ranked higher than is intuitively sensible.1 8 The latter effect was not evident, however, for MUC-5, despite the relatively large number of unfilled slots in the answer keys (EJV compared to MUC-4) . Apparently, the potential effect on the MUC-5 evaluation wa s eliminated through the object structure. Since the MUC-5 templates consist of objects that are aligned separately, the scoring impact of producing an object that fails to meet the minimal alignment criteria is limited to just that one object. Such an object, which contains an insufficient amount of correct fill to warrant alignment, is no t given credit for any &quot;correct&quot; fills .19 Thus, even though the object alignment criteria were loose for MUC-5 , there were still objects that failed to align, and systems got no credit for any correct information that they may have contained.</Paragraph> <Paragraph position="8"> For MUC-4, on the other hand, there was no object alignment, only template alignment, and the templat e alignment criteria were fairly strict. Thus, although no credit would be gained for correct fills in an unaligned template, the amount of credit that would be obtained for noncommittal fills in an aligned template would be fairly high on average, since the MUC-4 template is a larger structure than any of the objects in the MUC- 5 two formulations of error per response fill two formulations of error per response fill Figures 4 and 5 provide examples of the difference the treatment of the noncommittal scoring category can make in the MUC-5 results . They show the error per response fill scores for the Tipster systems on EME and JME MUC-5 using two formulations of the metric: the standard formulation, which disregards noncommitta l 18 When applied to MUC-4 systems, the standard formulation of error per response fill results in no significan t reranking of the 17 systems . But a formulation that includes noncommittals would result in rerankings of all 1 7 systems . The most radical changes would be for immature systems whose number of noncommittals grealy outweighs al l other categories of response .</Paragraph> <Paragraph position="9"> 19 Major differences between M11C-5 and MUC-4 in the alignment process do not play a role in this investigation of th e scoring of noncommittal fills, since the investigation with respect to both MUC-5 and MUC-4 treated such fills a s correct only in the scoring stage, not in the alignment stage . As far as scoring method goes, the two evaluations are not very different; both used the All-Objects method, which for MUC-4 was called All-Templates .</Paragraph> <Paragraph position="10"> fills, and the alternative formulation, which treats noncommittal fills as correct . The alternative formulation and the standard formulation provide consistent cross-system views of performance; as discussed above, the alternative formulation does not distort the cross-system perspective on the results .</Paragraph> <Paragraph position="11"> Viewed in terms of the impact on the actual scores, the difference between the two formulations ranges from 14 to 18 points .20 As mentioned earlier, the alternative formulation inflates the scores of systems that greatl y undergenerate . It is quite likely that such systems leave slots unfilled ignorantly more often than they do s o knowingly . Nonetheless, actual performance of the systems may be estimated to lie somewhere between the tw o values, closer to the standard value for lesser developed systems and closer to the alternative value for more highl y developed systems .</Paragraph> <Paragraph position="12"> Richness-Normalized Error. The alternative error per response fill formulation described above may provide better insight than the standard formulation into the potential performance level of systems that miss relatively little of the pertinent information in the texts . Similarly, the richness-normalized error, either in its standard formulation or in an alternative formulation, may provide better insight than the error per response fil l into the potential performance level of systems that generate relatively little spurious information, i.e., that have a relatively low overgeneration score. This metric views documents as streams of data of varying richnes s according to the number of fills in the key 21 The richness-normalized error metric is close to being a system-independent metric, meaning that the denominator disregards spurious responses because the number of such responses varies greatly from one syste m to the next22 Since overgeneration was a significant problem for virtually all MUC-5 systems, this measure tends to distort cross-system comparisons by treating systems relatively harshly that overgenerate to a relatively large extent. For this reason, this measure does not appear to offer a useful way of viewing the MUC-5 tes t results; however, it may be useful when the performance of systems under evaluation is uniformly higher .</Paragraph> <Paragraph position="13"> Handling Two Language s Four of the five sites that were evaluated in both Japanese and English (see table 1) performed at least a s well in Japanese as in English . Averaged across all the MUC-5 systems, JME error per response fill is bette r than EME by eight points, and JJV is better than EJV by eleven points. Averaged across all the sites and the tw o domains, there is a ten-point difference between Japanese and English .23 These findings are presented and analyzed in [7].</Paragraph> <Paragraph position="14"> System performance differences between the two languages in the JV domain are attributed largely t o differences between the EJV and JJV text corpora in terms of overall text structure and style. Analysis of the Japanese text characteristics and their impact on extraction performance is presented in [6]. The JJV corpus i s chore nearly homogeneous and the texts and sentences more pattern-like, which reduces the discourse demands an d generally facilitates extraction .</Paragraph> <Paragraph position="15"> In the ME domain, differences in scores between the two languages may be attributed largely to the fact that there was one-third less information to extract in JME than EME (average of 17 fills per template in JME, 2 5 fills per template in EME), including only about one microelectronics capability per template in JME as oppose d 20For EJV and JJV, the difference is somewhat less, ranging from 9-13 points .21 Thus, data extraction is viewed as analogous to speech recognition . Just as in speech, where there are detectable andclassifiable signals coming in, in data extraction there is extractable information coming in . The slot-fill count for a document is analagous to the word count for a stream of speech, and the slot fills in the key templates are analagous t o the known words in the spoken sentences .22 However, it is not entirely system-independent . A small amount of system dependence remains because of variability in the key templates, which capture some textual ambiguity by representing alternative correct answers, which ma y include an alternative number of slot fills or objects in a particular instance [4] . This situation may arise in speech a s well, where hearers disagree on which words and how many words were uttered .23 These statistics are based on the results for all MUC-5 sites . Consequently, English JV and ME averages are lo w because of the number of relatively underdeveloped, non-Tipster systems that were evaluated in English only . If thestatistics are limited to those sites that worked in both languages (five JV, three ME), there is still a five-point difference between Japanese and English (six-point difference for JV and four-point difference for ME) .</Paragraph> <Paragraph position="16"> to about two per template in EME. Thus, the problem of object splitting and merging (discourse-related template effects) is lesser in JME . Discussion in [7] goes into more depth on this subject, showing interestin g performance differences between EME and JME in relation to the key templates that contain multipl e <process> objects.</Paragraph> </Section> <Section position="2" start_page="39" end_page="39" type="sub_section"> <SectionTitle> Handling Two Domains </SectionTitle> <Paragraph position="0"> The four Tipster sites (five systems, including the GE/CMU optional JJV and JME optional test runs usin g the CMU TEXTRACT system) were evaluated in both extraction domains. Although they had more time to work on JV than ME, their system performance was comparable across domains . Overall error per response fill scores for the UMass/Hughes system are the same for EJV and EME; the BBN system performed a little better on ME than JV (two points difference in both languages) ; the GE/CMU system scored worse on ME than JV (four points difference in both languages) ; and results for the NMSU/Brandeis system are mixed -- better on ME than JV for English (five points difference) and a little worse for Japanese (two points difference) . The biggest difference was shown on the GE/CMU optional test run (nine points worse on JME than JJV) .</Paragraph> <Paragraph position="1"> It would appear that the comparable results achieved by most of the systems are attributable primarily t o factors that kept JV performance down . Parts of the JV template underwent many changes, which may hav e caused sites to do less development on those parts. Some sites may have also skipped parts that represented a very small proportion of the overall task in terms of number of fills in the training corpus keys, especiall y skipping deeply embedded slots and/or objects . In addition, the fact that testing was conducted on a core portion of the template as well as on the full template may have caused sites to focus less development effort on the non core portions of the template .</Paragraph> <Paragraph position="2"> The net effect of these factors is that the sites essentially reduced the task to a manageable size and as a consequence, incurred errors by missing relatively more information in JV than ME . Although thi s generalization holds for most of the MUC-5 systems, among the Tipster systems it does not apply to th e GE/CMU Shogun English system, the GE/CMU TEXTRACT (optional) Japanese system, or th e NMSUBrandeis Japanese system . Statistics on the average degree of task reduction by the MUC-5 sites in each language-domain pair can be found in [7] .</Paragraph> </Section> </Section> <Section position="8" start_page="39" end_page="42" type="metho"> <SectionTitle> RESULTS FOR LIMITED JV TAS K </SectionTitle> <Paragraph position="0"> MUC-5 English and Japanese joint ventures testing was conducted in two configurations . In one configuration, the entire template was scored; in the other, only the core portion of the template was scored (se e footnotes to table 2) . Figures 6-9 graph error per response fill together with the diagnostic secondary metrics of undergeneration, overgeneration, and substitution for the Tipster systems for each of the two configurations .</Paragraph> <Paragraph position="1"> Across the EJV systems, the error per response fill scores on the core-template test range between seven and nine percentage points better (lower) than on the full-template test ; for the JJV systems, the error per response fil l scores on the core-template test range between fifteen and sixteen points lower than on the full-template test .</Paragraph> <Paragraph position="2"> The source of most of the difference in error per response fill is in the number of missed fills, which i s reflected in better undergeneration scores on the core-template test ; the range across Tipster systems is 6-15 points lower for EJV and 11-24 points lower for JJV . The only other sizeable differences (i.e., differences of more than five points) are the overgeneration score for the GE/CMU EJV and JJV systems (nine points lower on the core-template test for EJV and seven points lower for JJV) and both the overgeneration and substitution scores for th e GE/CMU optional JJV run using the CMU TEXTRACT system (overgeneration nine points lower on the core template test and substitution seven points lower). Thus, for all systems except GE/CMU 's, the only score among the secondary metrics that differs considerably between the two test configurations is the undergeneratio n score.</Paragraph> <Paragraph position="3"> The difference in scores on the two configurations is more marked for Japanese than for English, with th e best error per response fill scores posted for the whole evaluation by the GE/CMU Shogun system and th e GE/CMU optional test run with the TEXTRACT system on the JJV core-template test (scores of 39 and 34 , respectively) . On the EJV and JJV full-template tests, most of the error per response fill scores are in the 50-7 0 range. As a point of reference, the error per response fill score of 61 posted by the GFJCMU system on the EJ V full-template test corresponds to a recall of 57 and precision of 49 (F-measure of 52 .75).</Paragraph> <Section position="1" start_page="40" end_page="42" type="sub_section"> <SectionTitle> Slot-Level Performance </SectionTitle> <Paragraph position="0"> The JV core template includes fourteen slots, one-third as many slots as the full template; yet for the EJ V MUC-5 full-template test, the slot fills from the core slots account for nearly two-thirds (around 63%) of the tota l slot fills. This distribution reflects the fact that the core-template slots cover some of the less idiosyncrati c portions of the task. Since the MUC-5 test set is fairly representative of data seen in the training corpus, it is not surprising that participants would have dedicated more development effort to the core slots in the template and would have been able to leverage previous work that is applicable across a range of tasks .</Paragraph> <Paragraph position="1"> Therefore, it is not surprising that scores on the core slots are relatively good compared to other slots in th e template . At least one of the four Tipster EJV systems had an error per response fill score of less than or equal to 50 on six of the 43 scored siots24; five out of the six slots are in the core part of the template . At least one of the systems scored between 51 and 75 on twenty other slots; nine of the twenty are in the core part of the template . Scores over 75 were obtained for many non-core slots but not for any core slots . Statistics for the Tipster EJV system that scored best on each slot and for the average across Tipster EJV systems are summarize d in table 4 .</Paragraph> <Paragraph position="2"> 24The <rate>eta slot is excluded from the total slot count, since there were no fills for it in the key for the EJV MUC- 5 test .</Paragraph> <Paragraph position="3"> in error per response fill. Numbers in parentheses are for core-template slots .</Paragraph> <Paragraph position="4"> The fact that performance on the core slots is relatively good is evident if the template slots are divided int o categories roughly according to their type : pointer, set fill, string fill, numeric fill, geographic place-name fill , temporal fill, two-part (complex) fill. The core template contains slots of the following types : pointer, set fill, string fill, and geographic place-name fill. For each of the Tipster EJV systems it is generally the case that performance on the core slots of a given type is better than performance of any other slots of that type . Thus, for example, performance by each of the Tipster EJV systems on the four set-fill slots in the core se t (<entity>type, <tie-up-relationship>status, <entity-relationship> status, <entity relationship>rel-ent2-to-entl is better than performance on any of the four set-fill slots that are not in th e core set (<industry>type, <facility>type, <person> position, <revenue>type) .</Paragraph> <Paragraph position="5"> There are three minor exceptions, which affect only the NMSU/Brandeis and UMass/Hughes systems . Two of the exceptions show performance on a non-core set-fill slot slightly better (two points) than <entityrelationship>rel-ent2-to-entl. The third exception is that the UMass/Hughes system performed four points worse on <entity>aliases, a core string-fill slot, than on <person>name, which is a non-core string-fill slot.</Paragraph> <Paragraph position="6"> However, in addition to these minor exceptions, there are two core slots that represent major exceptions that affect all four of the systems : <tie-up-relationship>joint-venture and <entity-relationship>entity2 .</Paragraph> <Paragraph position="7"> These are both pointer slots to an <entity> object. For each system, there is at least one non-core pointer slot (and as many as five) that the system scored better on than on than these two core slots . Furthermore, there is a gap between the scores for these two core pointer slots and the scores for the other core pointer slots of at leas t nine points (and as many as seventeen).</Paragraph> <Paragraph position="8"> The joint-venture and entity2 slots have similarities that indicate why performance on them is not as good as on the other core pointer slots : they both require making two-way role distinctions among entities foun d in the texts, and they both capture the less frequent of the two entity roles . In the case of the <tie-uprelationship> object, both the joint-venture and the entity slots point to an <entity> object, but th e joint-venture slot is meant to be filled only when a tie-up results in the formation of a joint venture company , which is often not the case . In the case of the <entity-relationship> object, both the entityl and entity2 slots point to an <entity> object, but the entity2 slot is meant to be filled only if a relationship exists othe r than partnership, which is the most common type of relationship . The lower scores on joint-venture and entity2 are therefore attributed in part to the relative difficulty of identifying specific roles of entities . The restricted use of the joint-venture and entity2 slots is reflected in the template definition: joint-venture and entity2 are constrained to contain either zero or one filler while the entity and entityl slots must contain at least one filler and may contain two or more . The system must decide not only what to fill th e joint-venture and entity2 slots with but also whether to fill them at all . Thus, the system is likely to fil l them only if it has found clear evidence in order to avoid generating spurious data, and this can result in th e opposing type of performance problem, namely missing relevant information .</Paragraph> <Paragraph position="9"> Apart from the joint-venture and entity2 slots, the only core slots that appear to have suffered relatively poor performance for all four systems compared to other core slots are the <entity> location and <entity>nationality slots, two of the three geographic place-name slots in the template . Although th e systems scored better on these two place-name slots than on the non-core one, <facility>location, the fact that all systems appeared to have relative difficulty with those two core slots is notable, as it may reflect a practical difficulty of selecting the correct entry for an ambiguous place name from the large English gazetteer as well as the linguistic difficulty of determining whether a mention of a place in association with an entity reflect s the entity's location or its nationality.</Paragraph> <Paragraph position="10"> However, there is a problem with attributing relatively low performance of the four core slots unde r discussion solely to the difficulty of determining the correct role of an entity or of a geographic place-name. The problem is that the lower performance may also be partially explained by the fact that those slots are less frequently filled than any of the other core slots in the full-template test .25 All other core slots account for at least 3% each of the fills in the full-template test, with six core slots in the 3-4% range and five slots in the 5 10% range . Thus, even among the core slots, it can be expected that development efforts were not focused equall y on all slots and that lower performance on some core slots may be a consequence not only of their relativ e difficulty but also of their lesser impact on the total evaluation.</Paragraph> </Section> </Section> class="xml-element"></Paper>