XML Viewer - m91-1008

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/91/m91-1008_metho.xml
Size: 20,179 bytes
Last Modified: 2025-10-06 14:12:42
<?xml version="1.0" standalone="yes"?>
<Paper uid="M91-1008">
  <Title>GTE'S TEXT INTERPRETATION AID (TIA) : MUC-3 TEST RESULTS AND ANALYSI S</Title>
  <Section position="3" start_page="0" end_page="69" type="metho">
    <SectionTitle>
SCORING RESULTS
</SectionTitle>
    <Paragraph position="0"> The following section discusses MUC-TIA's evaluation scores collected during the final week of syste m development/testing. All scores were gathered by running MUC-TIA under normal operational mode, i .e., no tradeoff testing configuration switches are utilized for optimizing testing parameters, e .g., recall vs. precision , precision vs . overgeneration, etc. MUC-TIA operates under the direct assumption of maximized recall and precision , and minimized overgeneration and fallout. For a detailed discussion of the scoring metrics, see the MUC-3 Scorin g System User Manual [3] . Tables 1 .0, 2.0, and 3.0 display GTE's scores for the MUC-3 NL evaluation task .</Paragraph>
    <Paragraph position="1"> Recall Recall is a maximized scoring metric which measures the amount of data extracted from messages and inserte d into message templates during the parsing and extraction processes . During Phase II of MUC-3, overall recal l (REC) for NOSC's test set &amp;quot;tst2-muc3&amp;quot; was computed to be 28% for &amp;quot;Matched Only&amp;quot; 1, 11% for &amp;quot;Matched/Missing&amp;quot; , and 11% for &amp;quot;All Templates .&amp;quot; However, these results at first glance seemed inconsistent with our Phase I result s from NOSC's test set &amp;quot;tstl-muc3&amp;quot;, shown in Table 2 .0, where TIA achieved a recall of 21% 2, suggesting that recall decreased after Phase II development . To form a baseline for comparisons, GTE rescored NOSC's test set &amp;quot;tstl muc3&amp;quot; using Phase II's scoring software. These results are shown in Table 3 .0. Unfortunately, Phase I I 1 &amp;quot;Matched Only&amp;quot; refers to the totals for templates which are matched, i .e., scores are not penalized for missin g or spurious slot fillers (template slot id is an exception to this rule) . &amp;quot;Matched/Missing&amp;quot; contains the totals fo r templates which are matched, however scores are penalized for missing, but not spurious, slot fillers. All Templates &amp;quot; contain totals for templates, however penalizations occur for missing and spurious slot fillers . &amp;quot;Set Fills Only&amp;quot; contains the totals for only the slots filled from a finite set .</Paragraph>
    <Paragraph position="2">  The scoring software used during Phase I of MUC-3 has been significantly modified to capture more precis e scoring metrics . Phase I Grand Totals roughly correspond to Phase H's Matched/Missing template scores .</Paragraph>
    <Paragraph position="3">  discouraging results were confirmed after rescoring tstl-muc3 when recall decreased from 31% for &amp;quot;Matched Only &amp;quot;</Paragraph>
    <Paragraph position="5"> One interesting score consistent throughout the entire MUC-3 evaluation task (Phase I and Phase II) wa s precision . Precision (PRE) measures the correctness of the information extracted from the messages and placed in the templates during the parsing processes. The overall goal is to maximize precision . GTE's precision (for &amp;quot;tst2 muc3&amp;quot;) was 43% for &amp;quot;Matched Only&amp;quot;, 43% for &amp;quot;Matched/Missing&amp;quot; and 25% for &amp;quot;All Templates .&amp;quot; After examinin g Phase I scores, precision did increase (although not significantly) 1%. Moreover, the rescored &amp;quot;tstl-muc3&amp;quot; precision was 42% for &amp;quot;Matched Only&amp;quot;, 42% for &amp;quot;Matched/Missing&amp;quot;, and 18% for &amp;quot;All Templates&amp;quot; .</Paragraph>
    <Paragraph position="6"> OverGeneration Overgeneration is the scoring metric which measures extraneous template fills, i.e., the percentage of templates which were incorrectly spawned during the parsing and extraction processes. This metric should be minimized. GTE scored 33% for tst2-muc3 &amp;quot;Matched Only&amp;quot;, 33% for &amp;quot;Matched/Missing&amp;quot;, and 61% for &amp;quot;All Templates&amp;quot; . During Phase I testing, GTE scored 29% overgeneration. After rescoring tstl-muc3 (after Phase II development) overgeneration increased to 35% for &amp;quot;Matched Only&amp;quot;, 35% for &amp;quot;Matched/Missing&amp;quot;, and 72% for &amp;quot;All Templates&amp;quot; . Overgeneration slightly increased by Phase II development.</Paragraph>
    <Paragraph position="7">  Table 1 .0: Official TST2-MUC3 Phase II Scores INC SPII MIS REC PRE OVGSLOT POS ACT COR PAR template-id 95 135 59 0 0 76 36 62 44 56 incident-date 92 56 15 25 16 0 36 30 49 0 incident-type 95 59 36 1 22 0 36 38 62 0 category 66 59 29 0 12 18 25 44 49 30 indiv-perps 87 14 1 1 11 1 74 2 11 7 org-perps 58 15 7 3 2 3 46 15 57 20 perp-confidence 98 59 28 2 20 9 48 30 49 15 phys-target-ids 52 0 0 0 0 0 52 0 100 0 phys-target-num 40 26 2 0 4 20 34 5 8 77 phys-target-types 47 0 0 0 0 0 47 0 100 0 human-target-ids 94 32 2 4 17 9 71 4 12 28 7 0 human-target-num 68 32 12 0 15 5 41 18 38 16 human-target-types 76 31 15 1 8 7 52 20 50 22 target-nationality 23 0 0 0 0 0 23 0 100 0 instrument-types 17 0 0 0 0 0 17 0 100 0 incident-location 95 32 8 12 12 0 63 15 44 0 phys-effects 29 0 0 0 0 0 29 0 100 0 human-effects 29 30 2 2 6 20 19 10 10 67 GRAND TOTAL 1161 580 216 51 145 168 749 21 42 29 Table 2 .0 : TST1-MUC3 after Phase I development INC SPU MIS REC PRE OVGSLOT POS ACT COR PAR template-id 91 85 27 0 0 58 64 30 32 68 incident-date 88 25 12 9 4 0 63 19 66 0 incident-type 91 27 18 0 9 0 64 20 67 0 category 62 27 10 0 7 10 45 16 37 37 indiv-perps 82 14 4 0 5 5 73 5 28 36 org-perps 57 5 3 0 0 2 54 5 60 40 perp-confidence 94 27 14 3 6 4 71 16 57 15 phys-target-ids 52 8 4 0 0 4 48 8 50 50 phys-target-num 40 17 2 0 3 12 35 5 12 70 phys-target-types 47 5 4 0 0 1 43 8 80 20 human-target-ids 89 2 1 0 1 0 87 1 50 0 human-target-num 64 12 1 0 9 2 54 2 8 17 human-target-types 72 2 0 1 1 0 70 1 25 0 target-nationality 23 2 0 0 2 0 21 0 0 0 instrument-types 17 0 0 0 0 0 17 0 *</Paragraph>
  </Section>
  <Section position="4" start_page="69" end_page="69" type="metho">
    <SectionTitle>
JUSTIFICATION AND ANALYSIS OF SCORE S
</SectionTitle>
    <Paragraph position="0"> Although the above stated scores seem rather discouraging or low, there are several valid justifications for suc h occurrences. The sections which follow explain each of the justifications.</Paragraph>
    <Paragraph position="1">  Phase I scores left GTE with some artificial results for recall and precision . Several slot fillers were the direct result of system defaults . This in turn filled many slots with correct fillers, but for wrong reasons, a phenomenon which Grishman calls &amp;quot;uncoupling input and output .&amp;quot; For example, during Phase I scoring, the MUC-TIA system defaulted the template slot Perpetrator : Confidence to the set list filler of &amp;quot;REPORTED AS FACT&amp;quot;; however, no real analysis was performed . Since &amp;quot;REPORTED AS FACT&amp;quot; was the most used correct slot filler, th e score was artificially inflated.</Paragraph>
    <Paragraph position="2"> Backend Translatio n The MUC-TIA System's internal semantic representation of a parse consists of realizations of structure d concepts. Structured concepts are frame-like knowledge representations which maintain slot fillers . During the semantic parsing process structured concepts are realized (essentially instantiated) by slot fillers such as simple tex t strings, or with more complex fillers such as demons, which are spawned. For example, an event such as a bombing instantiates a structured concept bombing-p with slots for actor (who performed the bombing), theme (wha t was bombed), location (where the bombing took place), etc . These realized structured concepts in turn represent the message parsed and maintain the data extracted. A backend translation process then maps and normalizes the dat a maintained in the structured concepts and places it in the appropriate templates .</Paragraph>
    <Paragraph position="3"> 7 1 This domain specific back-end translation module has not been fully tested and/or implemented . Many formatting issues still need to be resolved. Moreover, template merging techniques/heuristics are still being tested to determine optimal methods. Additionally, complete slot cross-referencing has not been completed and fully tested . As a result, many incorrect and partial matches occurred during the scoring process, thereby causing a detrimenta l effect on GTE's scores. Although the correct data was extracted from the message and maintained in the system' s internal representation, i .e., structured concepts, the actual template slot was filled incorrectly due to the back-en d translation process. For example, message TST2-MUC3-0034's HUMAN TARGET : TYPE correct slot filler is : POLITICAL FIGURE: &amp;quot;JECAR NEGHME&amp;quot;, however, TI A's response template indicates &amp;quot; SPOKESMAN&amp;quot; : &amp;quot;&amp;quot; . After further review of TIA's internal representation of the message, a murder -p structured concept wa s properly instantiated with ,&amp;quot;JECAR NEGHME&amp;quot;, a SPOKESMAN for the MIR, thereby properly identifying the appropriate human target.</Paragraph>
    <Paragraph position="4"> GTE has identified these &amp;quot;data extraction&amp;quot; problems with the back-end translator and recommends this module be rewritten.</Paragraph>
    <Paragraph position="5"> New Semantics Partially Implemente d During Phase II development several new semantic ideas were implemented which were not fully tested . For instance, to assist in filling the PERPETRATOR: CONFIDENCE slot, a &amp;quot;mode-p&amp;quot; prediction prototype [1] was defined which maintains two slots : By-Whom-S, and Insert-Mode-S . The By-Whom-S slot is filled by the authoritative figure which is found in the last act (this prediction is defined in the mode-p prediction prototype' s control structure .) The &amp;quot;insert-mode-s&amp;quot; slot's purpose is to inhibit the generation of a new template . For example, message TST2-MUC3-0011 states &amp;quot;The chief of the armed forces joint chiefs of staff have categoricall y denied that there are any rifts between Salvadoran army officers and U .S. military, as asserted by the Washington Post .&amp;quot; Normally, the word rifts spawns a realization of an attack template; however, the phase, denied that inhibited the attack template . This experimental mechanism has not been fully tested .</Paragraph>
    <Paragraph position="6"> Time of Domain Specific System Developmen t During the MUC-3 development period, several lexicon tools have been implemented which facilitat e development for new domains, e.g., terrorism, drug interdiction, third world launches, etc . These semi-automati c tools allow the lexicon developer to browse the message corpus and define lexical entries through a series of menus3. Additionally, sorting utilities were developed which operate on the automatically defined lexical entries . These tools are imperative to training any natural language processing system to a new domain . These tools have greatly increased the lexicon developers productivity while reducing debugging time . Since the majority of MUC 3's development time was devoted towards tool implementation, a minimal amount of MUC-3 domain-specifi c system development was performed, which is reflected in GTE's scores.</Paragraph>
  </Section>
  <Section position="5" start_page="69" end_page="69" type="metho">
    <SectionTitle>
SYSTEM DEVELOPMENT EFFORT
</SectionTitle>
    <Paragraph position="0"> The majority of the MUC-3 system development effort involved lexicon development issues (discussed below) .</Paragraph>
    <Paragraph position="1"> The construction of lexicon development tools and macros absorbed the majority of the development time.</Paragraph>
    <Paragraph position="2"> Approximately 200 of the 360 hours of system development were devoted towards these tasks . The balance , approximately 160 hours, were devoted towards actual MUC-3 task specific system development . As a result, GTE's scores were adversely affected.</Paragraph>
  </Section>
  <Section position="6" start_page="69" end_page="72" type="metho">
    <SectionTitle>
LIMITING FACTORS
</SectionTitle>
    <Paragraph position="0"> The following sections describe some of the limiting factors and problems which GTE had to overcome in orde r to participate in the MUC-3 Project.</Paragraph>
    <Paragraph position="1">  This menu approach will be modified and a human machine interface using X11 and Motif will be implemente d for the lexicon development tools in the near future .</Paragraph>
    <Paragraph position="2">  GTE has devoted two software engineers working on the MUC-3 Project for varied amounts of time . One software engineer (employed by GTE for six years) worked on the original TIA system first established in 1985 . During Phase II development, he devoted approximately 80 hours to MUC-3 domain specific tasks . The other software engineer (employed by GTE for approximately one year) devoted approximately 280 hours towards lexico n tool development, system administration (Sun 4/490 Sparc Server), MUC-3 domain specific system development , scoring and interpreting results. As a result, GTE was not able to consecrate the desired time to MUC-3 (domai n specific) system development .</Paragraph>
    <Section position="1" start_page="72" end_page="72" type="sub_section">
      <SectionTitle>
Syntactic Parser's Combinatorial Explosion Problem
</SectionTitle>
      <Paragraph position="0"> A second limiting factor which arose and was eventually solved was the syntactic parser's combinatorial explosion problem. This problem occurred due to the top-down exhaustive nature of the parser . The problem originally became apparent when several non-terminal syntactic constituents, e .g., regions, organizations, became extremely large and unwieldy . Since the parser expands non-terminals in a uniform, non-heuristic manner, al l applicable grammar rules are fired - even rules which are not viable . For example, if two rules present in the syntactic grammar are of the form :  Since the string's parse fails at the non-terminal &lt;Region&gt; in the rust &lt;Name-Position&gt; rule (because president cannot be a &lt;Region&gt;), the parse should not be permitted to try parsing using the second option of &lt;Name -Position&gt;. When the number of nonterminal expansions for a single nonterminal is &amp;quot;small&amp;quot;, this issue is not problematic. However, as the number of expansions becomes &amp;quot;large&amp;quot;, the inefficiency degrades the parse r dramatically.</Paragraph>
      <Paragraph position="1"> The problem was solved by establishing/marking the set of non-terminals which may contain a large number o f expansions, and maintaining failed parse states within the current phrases parse . If the current phrase being parsed is in a state which has failed at some prior time and the current nonterminal being expanded is &amp;quot;large&amp;quot;, the system does not try to expand the current nonterminal using the current rule . This pruning of the search space does not alter th e language recognized, i.e., all previously parsable constructs are still viable and are parsed appropriately . This solution caused dramatic results during several parses . Prior to this optimization, a sample parse of a phrase containing approximately three words which yield &amp;quot;large&amp;quot; nonterminals took the MUC-TIA syste m approximately 145 CPU seconds to run . After the optimization was implemented, the same phrase too k approximately 0.4 CPU seconds - obviously a worthwhile improvement .</Paragraph>
    </Section>
  </Section>
  <Section position="7" start_page="72" end_page="72" type="metho">
    <SectionTitle>
TRAINING
</SectionTitle>
    <Paragraph position="0"> As previously discussed, one MUC-TIA training task consisted of automating the process of lexicon development. GTE has developed two tools and several domain specific macros to train the system, each discusse d below in more detail.</Paragraph>
    <Paragraph position="1"> Lexicon Learner and Sorter Tools The lexicon learner tool/utility automates the process of entering unknown (essentially undefined) words and/or phrases into the appropriate syntactic lexicon with the appropriate syntactic and semantic features . Consider th e following excerpt from one of the dev-muc3 messages.</Paragraph>
    <Paragraph position="2"> &amp;quot;Ricardo Alfonso Castellar, Mayor of (Achi .UNKNOWN), in the Norther n Department of Bolivar, who was kidnapped on 5 January, apparently by Army 7 3 of National Liberations (ELN) guerillas, was found (slaughtered .UNKNOWN) today, according to authorities. &amp;quot; When the lexicon learner encounters the unknown lexical entry &amp;quot;Achi&amp;quot;, the system prompts for the appropriat e syntactic and semantic information necessary to sufficiently define the lexical entry as shown below. The city Achi is defined by a Def-Region macro which maintains fields for grammar, syntax, part-of, and type . The grammar field is initialized to mu c 3 (the grammar for the MUC-3 project), s y n t ax (specifies the list of possible articulations for the lexical entry) is set to the list consisting of one element, (a chi) , part -o f (specifies the region's hierarchical constituents) is set to bo1 iva r, and the type field (specifies the region's demography, e.g., village, city, state, country, continent, etc .) is set to city.</Paragraph>
    <Paragraph position="3">  These two lexical entries are then appended to the appropriate lexicon file and are compiled into the MUC-TI A system during the next Make of the parser.</Paragraph>
    <Paragraph position="4"> The lexicon learner was run on approximately 750 dev-muc3 messages over a period of approximately on e month. In that time, the MUC-TIA system lexicon grew from approximately 2000 lexical entries to over 25,000 lexical entries. During training development all 1200 dev-muc messages could have been run through the lexico n learner. However, due to time constraints and new-word vs. training time return, GTE software engineers elected no t to continue with the lexicon learning.</Paragraph>
    <Paragraph position="5"> Actual system development took place on approximately 3 dev-muc3 messages . Once again this training statistic occurred due to time and budgetary constraints . GTE plans to continue development in this domain in expectation of participating in MUC-4 next year .</Paragraph>
    <Section position="1" start_page="72" end_page="72" type="sub_section">
      <SectionTitle>
Domain Specific Macros
</SectionTitle>
      <Paragraph position="0"> The second type of lexicon development facility which was implemented for the MUC-3 project was a series of specialized macros which facilitate the definition of regions, events, people, organizations, terrorist groups, last names, etc. Each macro performs its unique job by establishing the grammar, syntax, and several macro dependen t specialized fields. For example, clef-region maintains part-of, type, and predicts fields. Moreover, the lexical entry &amp;quot;slaughtered&amp;quot; defined above establishes apredicts field ofmurder-p. This predicts field may trigger an instantiation (not necessarily a realization) of themurder-p structured concept.</Paragraph>
    </Section>
  </Section>
  <Section position="8" start_page="72" end_page="74" type="metho">
    <SectionTitle>
REUSABILIT Y
</SectionTitle>
    <Paragraph position="0"> The majority of the MUC-TIA System may be reusable for other terrorism domains ; however, should an entirely new domain be needed (such as third world launches or aircraft tracking), approximately 75% of the lexico n would need to be replaced . This task is not as insurmountable as it once was (pre-MUC-3) due to the lexicon tool s developed during the MUC-3 project .</Paragraph>
    <Paragraph position="1"> Additionally, since the back-end translator is very domain specific, a rewrite for the new domain would be necessary to adapt a new template structure .</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML