UNIVERSITY OF SHEFFIELD: DESCRIPTION OF THE
LaSIE-II SYSTEM AS USED FOR MUC-7
K. Humphreys, R. Gaizauskas, S. Azzam, C. Huyck, B. Mitchell,
H. Cunningham, Y. Wilks
1
Department of Comp uter Science
University of She#0Eeld
Regent Court, Portobello Road
She#0Eeld S1 4DP UK
fkwh,robertg,saliha,huyck,brianm,hamish,yorickg@dcs.shef.ac.uk
INTRODUCTION
The University of She#0Eeld NLP group took part in MUC-7 using the LaSIE-II system, an evolution of
the LaSIE #28Large Scale Information Extraction#29 system #0Crst created for participation in MUC-6 #5B9#5D and part
of a larger research e#0Bort into information extraction underway in our group. LaSIE-II was used to carry
out all #0Cve of the MUC-7 tasks and was, in fact, the only system to take part in all of the MUC-7 tasks.
While LaSIE-II is signi#0Ccantly di#0Berent from the earlier version #28di#0Berences are detailed below#29 there are
no radical changes in the basic philosophy of the approach. This could be described as seeking a pragmatic
middle way in the shallow vs deep analysis debate which has characterised the last several MUCs. That
is, while aware that information extraction tasks may not require full text understanding, and hence that
systems should be optimised to make use of shallow techniques where appropriate, wehave not wanted to
preclude the application of arbitrarily sophisticated linguistic analysis techniques where these may prove
useful. The result is an eclectic mixture of techniques including #0Cnite state recognition of domain-speci#0Cc
lexical patterns, partial parsing using a restricted context-free grammar, simpli#0Ced semantic representation
of each sentence in the text and a formal representation of the whole discourse from which all of the IE task
results and the coreference task results are derived. From our perspective, LaSIE-II should not be viewed
as the expression of a theory about how to do IE, but as a laboratory in which ongoing experiments with
di#0Berent component NL processing techniques, and most importantly, their interaction are being carried
out. Seen this way, one of the most important developments in LaSIE-II is its modularised architecture and
integration into the GATE platform #28see below#29 which has enabled us to gain much deeper insights into
strengths and weaknesses of components of the system and the ways in which these interact.
OVERVIEW
LaSIE-II is a highly modularised system, made up of 9 TIPSTER-compliant modules, pictured in Figure
1 as executed interactively through the GATE Graphical Interface. The system is essentially a pipeline of
modules each of which processes the entire text before the next is invoked. The following is a brief description
of each of the component modules in the system:
Tokenizer Identi#0Ces token boundaries #28as byte o#0Bsets into the text#29 and text section boundaries #28text
header, text body and any zones to be excluded from processing#29.
1
Thanks for additional contributions from Sandy Robertson, Andrea Setzer, George Demetriou, Malcolm Crawford #28fsandyr,
andrea, demetri, malcg@dcs.shef.ac.uk#29 and Mette Nelson #28mln.id@cbs.dk#29 from the Copenhagen Business School.
Figure 1: LaSIE-II Architecture
Gazetteer Lookup Looks for single and multi-word matches in multiple domain speci#0Cc full name #28loca-
tions, organisations, etc.#29 and keyword #28company designators, person #0Crst names, etc.#29 lists, and tags
matching phrases with appropriate name categories.
Sentence Splitter Identi#0Ces sentence boundaries in the text body.
Brill Tagger #5B4#5D Assigns one of the 48 Penn TreeBank part-of-speech tags to each token in the text.
Tagged Morph Simple morphological analysis to identify the root form and in#0Dectional su#0Ex for tokens
whichhave been tagged as noun, verb, or adjective.
buchart Parser Does two pass bottom-up chart parsing, pass one with a special named entity grammar,
and pass two with a general phrasal grammar. A `best parse' is then selected, whichmay be only a
partial parse, and a predicate-argument representation, or quasi-logical form #28QLF#29, of each sentence
is constructed compositionally.
Name Matcher Matches variants of named entities across the text.
Discourse Interpreter Adds the QLF representationtoa semanticnet, which encodesthe system'sdomain
model as a hierarchy of concepts. Additional information presupposed by the input is also added to the
model, then coreference resolution is performed between new and old instances, and #0Cnally information
consequent upon the input is added, producing an updated discourse model.
Template Writer Writes out the ST, TR, and TE results by traversing the discourse model and extracting
the required information.
NE and CO results are generated following the Discourse Interpreter by a generic SGML dump utility.
LaSIE-II CHANGES
Rather than duplicating much of the description of the LaSIE system as used for MUC-6 #5B9#5D, the following
sections describe the major changes between LaSIE, referred to in the following as LaSIE-I, and LaSIE-II.
GATE#2FTIPSTER Architecture
The LaSIE-II system was developed using GATE, a General Architecture for Text Engineering #5B5#5D, Shef-
#0Celd's implementation of the TIPSTER architecture speci#0Ccation #5B11#5D. GATE manages all the information
about the texts that is produced by each module, and provides graphical tools for visualising that inform-
ation, selecting control #0Dow through di#0Berent module combinations and running the IE system over sets of
texts. A major strength of the architecture is that it encourages reuse by insulating the various modules
from each other by means of a common data management substrate | a #5Cdocument manager" in TIPSTER
terminology. It also enables reuse of visualisation code: modules producing similar sorts of information #28e.g.
a PoS tagger and a named entity parser#29 share the same graphical viewing tool. GATE provides a con-
venient GUI-based environment within which to develop diverse modules, unconstrained by implementation
language #28LaSIE-II is made up of C, Perl and Prolog modules#29, with the architecture taking care of the com-
mon engineering tasks #28e.g. data storage#29 that are uninteresting from a language processing point-of-view.
Lastly, GATE provides a command-line interface for batch processing, enabling us to run a nightly build,
run, score, and report process as we developed the LaSIE-II system.
GATE is now in a one-and-a-half release #281.5#29 that adds Java support, SGML I#2FO, a manual annotation
tool, an annotation comparison tool and improved support for managing collections of documents. This
release is a half-way house between version 1, which was C++-based, and version 2, which will be Java-
based. Java modules that run under version 1.5 will run unchanged under version 2, while still accessing
all the modules and facilities available under version 1. SGML facilities are much improved, with input
via the University of Edinburgh's LT NSL toolkit #5B15#5D that uses the Sp parser. The manual annotation
tool allows hand-coding of annotations to use as training or test data, and the comparison tool provides
a straightforward way to score one annotation set against another. In cases where a dedicated scorer is
available, like MUC or Parseval #5B12#5D, this can be integrated as a module in GATE, as shown in Figure
1 with a scorer module for each MUC task. The output of the MUC scorers can also be read into the
GATE database, allowing keys and errors to be displayed using the existing viewers, as shown for the CO
task scorer in Figure 2. Wrapper code for these and other modules is available from our ftp site #28see
www.dcs.shef.ac.uk#2Fresearch#2Fgroups#2Fnlp#2Fgate#2F for more details, and to download GATE, which is
freely available for research purposes, and comes bundled with a version of our MUC-6 extraction system#29.
Lexical Preprocessing
With the exception of the Gazetteer Lookup stage, modules up to the buchart Parser are relatively
unchanged from LaSIE-I. Minor changes were required to make use of the structure of the MUC-7 NYT
texts, and to classify SGML and other special symbols.
The most noticeable change to the Gazetteer Lookup module is its change in position #7B from immediately
before the Parser in LaSIE-I to immediately after the Tokenizer in LaSIE-II. This change was made mainly
to improve the accuracy of the Sentence Splitter module, which could previously propose incorrect sentence
boundaries within known NEs, for example #5C3 p.m. EST", #5CSt. Louis", etc. The Splitter has been modi#0Ced
slightly to treat gazetteer matches as units in the same way as tokens.
The move involved decoupling the Lookup stage from the PoS Tagger. Originally only tokens with
particular tags #28nouns, adjectives, determiners, conjunctions, numerals, symbols#29 were matched against the
gazetteer lists, but this restriction has been removed. The Lookup stage now attempts to match all tokens,
and therefore no longer su#0Bers from tagging errors. However, this does introduce a few spurious matches,
particularly with capitalised words in sentence initial position, for example the #28Swedish#29 person #0Crst names
#5CAre" and #5CMy". An additional #0Cltering stage in the Tagger module handles these cases, removing any
gazetteer matches for sentence initial tokens not tagged as nouns or adjectives. This #0Cltering stage also
attempts to correct some of the tagger's frequent mistagging of capitalised common nouns as proper nouns
in document headers, by reference to a list of common English nouns #28also used in the Splitter module#29.
Figure 2: MUC CO Scorer output viewers, showing spurious coreferences during debugging
The Lookup stage has also been modi#0Ced to allowmuch simpler integration of new lists of names, each
de#0Cning a distinct semantic category. A top-level con#0Cguration #0Cle speci#0Ces a set of plain text lists and type
and subtype values #28e.g. organization:company#29 to be assigned to matches in each list. The module can now
be switched between domains simply by specifying alternative con#0Cguration #0Cles. For MUC-7 we used 55
lists comprising 23,000 entries in total.
A further improvement to the Lookup module is a reduction in its case sensitivity. While initial exper-
iments with complete case insensitivity in matching against the lists produced too many spurious matches,
some reduction in sensitivity proved useful. In particular, sequences of all-uppercase tokens in the input are
now matched in the lists in both their original form and also with each token converted to a form where only
the initial character is uppercase. This signi#0Ccantly improves matches in the NYT headers.
Parsing
As in LaSIE-I, parsing is still carried out in two stages, each stage using the same parsing mechanism
but a di#0Berent grammar #7B #0Crst a specialist NE grammar, then a general phrasal grammar. The parser
itself is largely unchanged from MUC-6 #7B a bottom-up chart parser written in Prolog which processes
context-free grammar rules with associated feature structures expressed as Prolog terms. A complete chart
is generated from which a single parse, quite possibly partial #28i.e., a gapped sequence of phrasal subtrees#29,
is selected using a `best parse selection' heuristic when parsing ceases. Semantic interpretations in the form
of predicate-argument representations are built up compositionally from phrasal constituents during parsing
and the semantic interpretation of the #0Cnal selected analysis becomes the output of the parser and gets
passed on to the discourse interpreter.
The main changes introduced for MUC-7 area signi#0Ccantly enhanced grammardevelopmentenvironment,
and a completely rewritten and extended grammar whichnow re#0Dects a substantially di#0Berent philosophy.
Figure 3: buchart Parser submodules
Grammar Development The modular facilities of GATE have been exploited to further compartment-
alise the grammars into a total of 17 subgrammars which during development can be individually executed
through the GATE graphical interface #28see Figure 3#29. After each grammar is run, its results may be viewed
using a tree viewer, and then, if changes are required, the grammar may be edited and rerun without leaving
GATE for any recompilation process.
The #0Crst ten grammars shown in Figure 3 comprise the NE grammar and when the system is run in
production mode the rules from these ten grammars are compiled into a single grammar for use in the #0Crst
pass of the parser. The net e#0Bect is the same as running them separately since the ordering of the rules in
the compiled single grammar is the same as in the cascaded developmentversion and the best parse selection
heuristic will, other things being equal, select the last of co-spanning analyses. The same holds for the next
seven subgrammars which form the phrasal grammar and are compiled into a single grammar for the second
pass of the parser in production mode.
Division of the grammar into smaller specialist chunks together with the new graphical tools had the
expected bene#0Cts of allowing more rapid development and veri#0Ccation of subgrammars, and supported con-
current grammar developmentby di#0Berent persons working on di#0Berent subgrammars.
The Grammars The ten NE grammars consist of approximately 400 hand-coded rules that make use
of part of speech tags, semantic tags added in the gazetteer lookup stage, and if necessary lexical items
themselves. While signi#0Ccantly rewritten since MUC-6 the basic philosophy here is the same #28see #5B17#5D for
details#29: patterns are detected in the texts and manually added to the grammar. The enhancements to the
grammar development cycle described above have eased and speeded this process, but otherwise there is
little change.
The phrasal grammars have been completely rewritten and compartmentalised, but there has also been a
signi#0Ccantchange in the grammar acquisition process. For MUC-6 the grammar was obtained by extracting
the context-free grammar rules implicit in the bracketing of the Penn Tree Bank #5B14#5D and selecting a subset
of them by thresholding on frequency #5B6#5D. Features to enable semantic interpretation were then added by
hand to the extracted syntactic rules.
While this technique allowed us to obtain a grammar quickly, the resulting analyses were poor, and
the e#0Bort of manually annotating the rules with features for semantic interpretation substantially reduced
the bene#0Cts of the `grab-and-run' approach to grammar acquisition. Given that so much handcrafting was
going into the grammar, it seemed that we mightaswell get the bene#0Ct of more carefully handcrafting the
syntactic aspects of the rules too. So, wehave e#0Bectively rewritten the grammar from the ground up, using
a combination of general principles #5B16#5D and iterative re#0Cnement using the MUC-7 training data. The result
is a phrasal grammar of about 150 rules.
The best parse selection heuristic chooses a single #28possibly partial#29 analysis based on selecting a shortest
sequence of maximally spanning non-overlapping edges which are of a semantically interpretable category
#28NP, VP, PP, S, and RelC#29. Where there are several equivalent alternatives, the last one generated is
selected. This approach eliminates anyambiguity detected in parsing and ensures that a single analysis is
passed on to the discourse interpreter. However, since our grammar does not rely on lexical semantic or
syntacticconstraintstoanygreatextent #28e.g.we donot use lexicalselectionalrestrictionsorsubcategorisation
information in parsing#29 it is very weak. As a consequence, we adopt a very conservative approach with
regards to phenomena such as attachment of complements, prepositional phrases and relative clauses, and
also apposition and co-ordination. We followed a philosophy of only adding those rules whichwere #28almost#29
certain never to generateerrorsin analysis#7B i.e. a high precision, possibly low recall, approachto grammatical
analysis. In theory, when grammatical relations such as complements #28e.g. subjects and objects#29 are missed,
they are meant to be added during discourse interpretation, where lexical-semantic information is available,
and to some extent this happens #28see below#29, though we have not developed this part of the system as
muchaswewould like. Exploring the boundaries between syntactic analysis, lexical-semantic analysis, and
discourse analysis is very much part of ongoing work in LaSIE-II.
The net e#0Bect using a cascade of grammars, each of which aims to identify a `chunk' of a particular
category and is conservative with respect to attachment, is something very like the #0Cnite state models that
have been advocated by other MUC participants over the past few years #5B13,10#5D, as well as others in the
language engineering community#5B1#5D. In fact, we believe our grammars are now regular and that our chart
parser could be replaced by a #0Cnite state parser, with substantial increase in speed. We have not done
so to date due to restricted development resources and because the current grammar development#2Fparsing
environment is quite habitable.
Discourse Interpretation
Apart from some gazetteer lists, and the corresponding grammar rules, all domain speci#0Cc knowledge in
the system is concentrated in the Domain Model of the Discourse Interpreter. As in LaSIE-I, this model
is expressed using a semantic net whose nodes represent `concepts' #28classes or instances#29, with associated
attribute-value structures recording properties and relations of the concept, and whose arcs model a concept
hierarchy and support property inheritance #28see #5B7#5D for further details#29.
The initial domain model for the MUC-7 ST task was constructed directly from the template de#0Cnition. A
launch event concept node was added, together with vehicle and payload nodes, each with a subhierarchy
listing the possible vehicle and payload types speci#0Ced in the template de#0Cnition:
entity#28X#29 ==#3E object#28X#29 v event#28X#29 v property#28X#29.
object#28X#29 ==#3E artifact#28X#29 v ...
event#28X#29 ==#3E launch_event#28X#29 v ...
artifact#28X#29 ==#3E vehicle#28X#29 v
payload#28X#29.
vehicle#28X#29 ==#3E spacecraft#28X#29 v
aircraft#28X#29 v
ground_vehicle#28X#29 v
water_vehicle#28X#29.
spacecraft#28X#29 ==#3E shuttle#28X#29 v
rocket#28X#29.
payload#28X#29 ==#3E satellite#28X#29 v
missile#28X#29 v
space_probe#28X#29 v
material#28X#29 v
personnel#28X#29.
Property types were also added for each template slot, launch date, vehicle owner, etc., and the
Template Writer module was modi#0Ced to read o#0B and write out instances of the required types with the
required properties.
Consequence properties #28see below#29 were then added to hypothesise instances for each slot of a template
entity, given the appropriate textual trigger. For example, an instance of a launch event in a text causes
the hypothesis of a vehicle, a payload, a date and a launch site related to the event; an instance of a
vehicle causes the hypothesis of an owner and a manufacturer, etc. The Discourse Interpreter's general
coreference mechanism is then used to attempt to resolvehypothesised instances with instances mentioned in
the text. Running the system in this state, with absolutely no domain speci#0Cc restrictions on the resolution
of hypotheses, gaveanoverall performance of 27.66 P&R on the ST training data. This result was achieved
using only the template de#0Cnition and no training data to customise the system, and required only a few
hours work. This baseline customisation of the system to a new IE task is largely mechanical, and weintend
to investigate the extent to which initial concept hierarchies and associated properties can be produced
automatically from a template de#0Cnition.
A small set of prede#0Cned, or static, instances were also added to the Domain Model, encoding certain
world knowledge necessary to complete particular ST task slots. For example, NASA was prede#0Cned to
allow its use as a default value for the payload owner slot for American astronauts, as required by the
ST task speci#0Ccations. Similarly, common spacecraft launch sites were prede#0Cned, restricting the selection
of launch site values. Much greater use of this facility could be made to encode other relevant world
knowledge.
During processing, the instances and properties from the semantic representation of a text #28QLF#29 pro-
duced by the Parser are added to the Domain Model. The semantic representation of each sentence is
Figure 4: Discourse Interpreter submodules
processed in the following stages, illustrated by the submodules in the GATE interface shown in Figure 4,
gradually specialising the Domain Model to become a Discourse Model, the #0Cnal version of which is passed
to the Template Writer.
Add Semantics Instances from the QLF representation of a sentence are added below their parent classes
in the concept hierarchy. New concept nodes are created dynamically for classes not de#0Cned in advance, and
are added directly below the object node for instances introduced by nouns, and below the event node for
instances introduced byverbs. Properties in the input semantics are added to the attribute-value structures
associated with the instances to which they relate.
A new mechanism introduced in LaSIE-II is a word root to concept node mapping, used to establish the
parent nodes of new instances. Previously, concept nodes in the ontology were labelled with English word
roots onto which the QLF semantic representation was mapped directly, forcing subclass hierarchies to be
constructed even for synonymous terms. The introduction of a word-to-concept table, or dictionary, provides
a many-to-one mapping onto the concept nodes in the ontology, allowing synonym sets to be represented
straightforwardly in the table.
The word-to-concept mapping also provides the ability to process QLF from non-English languages with
the same Discourse Interpreter and Domain Model. An experimental Multilingual LaSIE #28M-LaSIE#29 system
has been constructed to process French, Spanish and English texts, producing templates or natural language
summaries in each language, using the word-to-concept table for output as well as input. Further details on
this system can be found in #5B8#5D.
Add Presuppositions Each instance and property added from the current sentence semantics attempts
to inherit any presuppositions, prede#0Cned in the domain model, from its parent classes. Presupposition
properties of concept nodes are used to perform the following functions in the discourse model:
1. Additional Inferences Add inferred information to the current discourse model, in addition to the input
semantics. For example, a presupposition of the property name #28proper name#29 is that any instance with
this propertymust be an instance of the object class, and also haveanumber property with the value
singular. If not already known, this information will be added to the discourse model.
Some inferences will be highly domain, or even template, speci#0Cc, providing particular slot values based
on patterns of semantic relations in the input.
2. Entity Hypothesis Expected, or implicit, instances, can be hypothesised to be resolved by the core-
ference mechanism. Nominalisations of verbs can be identi#0Ced by presuppositions and lead to the
hypothesis of the corresponding event, for example a hypothesised launch event from an instance
of a launch object. Suchevents, if they acquire the necessary properties, will be written out as ST
template entries. Similarly, instances of indirectly related scenario speci#0Cc objects, suchasmission,
can also give rise to the hypothesis of a launch event.
3. Word Sense Disambiguation Identify whether an instance of a particular class in the input is the same
as a known class in the ontology. For example, the ontology contains fall as a subclass of date,
but an instance of fall initially added here may be removed by a presupposition that identi#0Ces this
instance as, say, referring to a fall in share prices rather than a date. Scenario speci#0Cc senses can
also be caught in this way, for example only fire event instances related to missiles are retained as
potential launch events.
4. Role Classi#0Ccation The ontology contains a hierarchy of person roles, including domain speci#0Cc roles
such as astronaut, pilot, etc. A presupposition acts to reclassify instances of these nodes from the
text as instances of the person node, with a property indicating the role. This avoids the previous
requirement to specify person roles as subclasses of person to force semantic compatibility for corefer-
ence. Subhierarchies of roles can now be de#0Cned, for example job roles and family roles, with cross
classi#0Ccation of instances permitted. The roles used for the ST task were obtained by identifying the
intersection of terms in the ST training data with entries in an electronic dictionary with an animate
feature.
5. Partial Parse Extension Missing or unattached properties will cause the hypothesis, and attempted
resolution, of required instances. A presupposition of the event node speci#0Ces that all instances below
it should haveanlsubj #28logical subject#29 property. This presupposition will cause the hypothesis of a
new instance for each event instance that the parser has failed to attach a subject relation to. The
general coreference mechanism is then used to attempt to resolve the hypothesis, applying various
restrictions, also expressed as properties in the domain model, such as requiring subjects to be before
an activeverb in the same sentence. More speci#0Cc eventtypes can presuppose additional restrictions,
for example restricting a hypothesised subject of a crash event to be of type vehicle. This e#0Bectively
speci#0Ces semantic roles, or subcategorisation patterns, for particular eventtypes. However we currently
specify very few such restrictions.
The same mechanism can also be used to attempt prepositional attachment, where the parser has left
a phrase unattached. Phrases can be classi#0Ced as temporal, locational, etc., using the semantic classes
of their heads, and then semantic role information used to identify potential attachment sites, as for
verb arguments. Again, however, we make little use of this facility in the current domain model.
Object Coreference The general coreference mechanism takes a set of instances newly added to the
discourse model, and compares each one with the set of instances already in the discourse model. For object
coreference, proper names, pronouns, and common nouns are handled separately, #0Crst attempting intra-
sentential coreference for each set, and then inter-sentential coreference. Each new-old pair of instances, if at
all compatible, has a similarity score calculated for it, based on the distance between the instances' parent
classes in the concept hierarchy, and the number of shared properties. The highest scoring pair, for each new
instance, is merged in the discourse model, deleting the instance with the least speci#0Cc class in the ontology,
and combining the properties of both instances. This mechanism is basically unchanged from the LaSIE-I
MUC-6 system, but with the addition of several new features:
Coreference
1. Long Distance Coreference In LaSIE-I antecedents for pronouns and bare nouns were sought only in
the current and immediately preceding paragraphs, and no attempt was made to #0Cnd an antecedentin
earlier paragraphs even if the anaphor almost certainly required one, as in the case of pronouns. This
has been extended to search successively earlier paragraphs until an antecedent can be found. On the
30 dry run texts, this extension gave a 2#25 increase in recall for pronouns, and a 7#25 increase in recall
for bare nouns, with no signi#0Ccantchange in precision.
2. Copular Constructions Constructions of the type NP1 be NP2 where NP1 should corefer with NP2
#28`Predicate Nominals' in the CO task de#0Cnition#29, e.g. The F14 #5CTomcat" is the Navy's #0Crst-line #0Cghter
aircraft,were not dealt with in LaSIE-I due to lack of development time, but they are now considered
by the coreference algorithm.
This necessitated reviewing all the coreference rules, to add exceptions for copular constructions. For
example, in general an inde#0Cnite noun phrase suchasapresident cannot haveanantecedent, but this
needed to be relaxed so as not to apply to copulars.
Another important aspect of copular constructions is that they provide information to allow `unknown'
words, i.e. words whose semantic class is not in the ontology, to be classi#0Ced during processing. This is
possible when an instance of an unknown class is coreferred in a copular construction with an instance
of a known class. For example, in Bill is a president,ifpresident is not known as a concept in the
ontology and Bill is recognised as an instance of the known person node, then a president node
can be automatically added below person. Subsequent coreferences can then be more accurate by
preferring or preventing coreference with instances of the newly added class, for example, to prevent
subsequent occurrences of it from being resolved with instances of president.
3. Catapora LaSIE-II handles two speci#0Cc cases of pronouns occurring before their antecedents. Firstly,
pronouns within quotations, such as:
#5CI caught Reggie when he was much younger counting his dad's trophies," McNair said.
where I refers ahead to the speaker, McNair. However, the coreference is currently only possible if
a complete sentence is successfully parsed within the quotation. Secondly, pronouns within copular
constructions can also refer ahead, as in:
This is a mystery.
4. Coordinated NPs Achange in the MUC-7 CO task speci#0Ccation was the introduction of certain con-
joined NPs as markables. In the LaSIE-II coreference algorithm, a new instance representing the set of
any coordinated instances is created in the discourse model, which can act as a potential antecedent.
The new instance will haveaplural attribute and the semantic class will be the lowest common par-
ent class of the coordinated instances. For example, in Bruce and his boys three instances: e1, Bruce,
e2, his boys, and the set instance e3, Bruce and his boys, of type person will be represented in the
discourse model.
However, coreferences may fail or spurious coreferences be generated if the parser fails to correctly
recognise the coordinated phrases on which the set instance identi#0Ccation relies.
5. Header Coreference In LaSIE-I only proper noun coreferences was attempted in text headers, but in
LaSIE-II both pronoun and bare noun coreferences are also attempted. However, coreference rules that
apply for normal sentences may fail for header sentences because of incomplete syntactic information
caused by the telegraphic style and omission of determiners common in headers. Relaxing some of the
coreference rules for headers gave a noticeable improvement in recall.
6. Generic Nouns A new subclass of bare nouns was introduced, with its own set of speci#0Cc resolution
rules. These `generic' nouns occur as NP heads with no modi#0Cer or other relation which could allow
additional syntactic or semantic information about them to be inferred.
The coreference mechanism still has a number of limitations. One of the most common pronoun errors is
related to #0Crst and second person pronouns #28you, we, I#29, but there is currently no special treatment for this
class. Also, these pronouns typically occur within quotations, the treatment of which is still very limited.
We also do not attempt to handle type coercion and metonymy, and so fail to resolve pronouns like they in
cases such as:
The Navy informs me that they have been unable to #0Cnd a common thread to these accidents.
We also need a more robust proper name matching module, since the coreference mechanism is heavily
dependent on this for proper name coreference. It currently fails to match the organisation National Trans-
portation Safety Board with Safety Board, while it does match Latin America with Latin | errors which
could be easily corrected.
Add Consequences The use of consequence rules parallels that of presuppositions, but since they apply
after object coreference, they may refer to information outside that of the current sentence. If an instance
in the current input is resolved with an instance elsewhere in the text it will acquire additional properties,
potentially allowing more accurate inferences.
The majority of template slot #0Clls are proposed through consequence properties, and thus most of the
domain speci#0Cc rules are applied at this stage. Consequences can hypothesise unknown instances in the same
way as presuppositions, but these hypotheses are not restricted to being resolved during the processing of
the current sentence. Unresolved hypotheses from consequence rules are retained and resolution retried after
each subsequent sentence. As a special case, unresolved hypotheses may be removed by the introduction
of new hypotheses of the same type but from a di#0Berent source. Currently, this removal only takes place
for hypotheses from consequences of launch events, to re#0Dect an assumption that instances related to a
#0Crst launch will not be introduced in a text following the description of a second launch. If, however, two
mentions of an event are subsequently coreferred, all properties, including unresolved hypotheses, will be
merged anyway.
Event Coreference The#0Cnal stagein theprocessingof eachsentenceis toattempt tomergelaunch events
introduced in the current sentence with any others in the discourse model. This includes both hypothesised
and explicitly mentioned events. Initially the criterion for merging twoevents was that all related entities
#28vehicle, payload, site, date#29 of eachevent be equivalent. However, this proved to be too strict, generating
many spurious template #0Clls due to failed merges, and the criterion was gradually relaxed until it was re-
quired only that the payload of twoevents be the same. This makes a strong assumption that apayload
will never be related to more than one launchevent. Other related entities are not compared, and only the
entities related to the earlier of twoevents are retained in a successful merge.
RESULTS
The following table summarises LaSIE-II system performance on all #0Cve tasks. After the NE evaluation
a few modi#0Ccations were made to the initial system #28system `A' #29 as a result of examining the IE training
data which had been the NE formal run data. In particular, a name list for `astronomical bodies', none
of which were identi#0Ced during the NE evaluation run, was added when we realised we should have been
marking these as locations; we also added a list of rocket names to help with identifying artifacts for the
TE task. Thus, the `A' system results for NE are the o#0Ecial results; the `B' system #28the system run in the
subsequent CO and IE evaluation#29 results are o#0Ecial for the other tasks. We uno#0Ecially rescored the `B'
system against the #28no longer blind#29 NE evaluation data to see what e#0Bect the changes had.
Task System Recall Precision P&R
NE A 83 89 85.83
B 87 94 90.41
CO B 56.1 68.8 61.8
TE B 75 80 77.17
TR B 41 82 54.70
ST B 47 42 44.04
WALKTHROUGH TEXTS
Named Entity
Task System Recall Precision P&R
NE Walkthrough A 77 86 81.40
NE Overall A 83 89 85.83
NE Walkthrough B 88 88 88.41
NE Overall B 87 94 90.41
Performance on the NE walkthrough text was slightly below the level across the whole NE formal run
test set. The system missed the organizations Intelsat and Globo, the locations Latin America and Xichang,
and the person Llennel Evangelista. It misclassi#0Ced Hughes Electronics and MURDOCH SATELLITE as
persons, due to #5CHughes" and #5CMurdoch" being listed as valid person #0Crst names followed by unknown
proper names. We spuriously identi#0Ced MTV and CNN as organizations, because they were both present
in our company gazetteer and we made no attempt to disambiguate companies from TV channels. We also
identi#0Ced #5CMarch" in the Long March rocket as a date.
Some of these errors are easily corrected with a minor gazetteer additions. Experimentation to this e#0Bect
lead to the correct classi#0Ccation of Intelsat and Hughes Electronics, and the avoidance of the spurious date
#5CMarch". We also then correctly classi#0Ced Latin America as a location, but the Namematch module then
caused Latin and Latin American to also be classi#0Ced as locations.
The remaining errors can be avoided by using the immediate context of the names, for example Globo
is quali#0Ced by conglomerate, which, if added to our ontology of organization types, would allow Globo to be
classi#0Ced. Also, Llennel Evangelista could be classi#0Ced correctly if the appositional phrase aspokesman for
Intelsat were interpreted correctly.
Coreference
Task Recall Precision P&R
CO Walkthrough 60.8 64.0 62.3
CO Overall 56.1 68.8 61.8
The coreference algorithm performed well on all de#0Cnite NPs, and on the majority of proper names in the
walkthrough text. Only half of the pronouns are correctly resolved, however, and this is less than the system
was typically able to resolve on the training data. The main reasons are explained below. More generally,
the walkthrough recall result is higher than the average recall we obtained with the training and the formal
run data #28respectively 56.5#25 and 56.1#25#29. The precision result is lower than the average obtained on the
training and the formal run data #28respectively 72.3#25 and 68.8#25#29. We concentrate below on the sources of
errors that caused precision to drop.
The most common error is when a pronoun corefers to an entity that precedes it and has not been
ruled out as a possible antecedentby possessing an incompatible syntactic or semantic property #28the basic
algorithm is `eager', in the sense that it will corefer a pronoun with the closest preceding entity which cannot
be ruled out as an antecedent#29. This happens for three pronouns in the walkthrough text. For example in
the following paragraph, its corefers to revenue while the correct antecedentisGalaxy VIII, the main focus
of the paragraph.
Hughes' Galaxy VIII#28I#29 plan would use one satellite, which the company estimates will cost $230
million to build and launch. Hughes expects Galaxy VIII#28I#29 will bring in $30 million in revenue
in its #0Crst year and $58 million each year for the following 11 years, according to #0Clings at the
FCC.
The same problem occurs with the pronoun they that corefers to airwaves instead of functions, in the
following:
Those functions are likely to be slowly shifted to another slice of spectrum, while the airwaves
they've historically usedare turned over, in part, to satellite services such as the ones plannedby
GE and GM.
This kind of spurious coreference is more likely to happen with words, like revenue or airwaves which
are not recorded in our domain model, as no information is available about such classes to rule out or
restrict coreference. Both examples suggest that a more complex mechanism is needed to detect which
entity the paragraph is about, i.e. the focus or center #28Galaxy in the #0Crst example, and functions in the
second#29 and to constrain the pronouns to corefer with it. Substituting a focus-based approach based on #5B2#5D,
that provides such a mechanism for pronoun resolution, correctly resolves these three pronouns. However,
such an approach has problems of its own: the result of applying the focus-based algorithm across the whole
text shows a slight improvement for precision, but a somewhat larger drop in recall, due mainly to the fact
that the focus-based approach is more restrictive in proposing antecedents for pronouns #5B3#5D. At present
we are carrying out experiments to see how combining the advantages of both focus- and non-focus based
approaches could lead to in superior results for pronoun resolution.
The mechanism which automatically adds nodes to the ontology for words of unknown semantic class,
contributes strongly to recall. All the dynamically added nodes corefer correctly, i.e., the de#0Cnite noun
phrases the allocation, the airwaves, the plan whose head nouns were not known in our ontology are coreferred
correctly.
Coreferring instances introduced by head nouns which are identical or of compatible semantic type is
complicated when they have di#0Berent quali#0Cers or modi#0Cers. Wehave been handling these cases carefully in
our coreference algorithm, but the walkthrough text provided some examples we had not met in the training
data. Correcting our algorithm for the quali#0Cer comparison #28to avoid coreferring things like direct-to-home
service and direct-to-home video satellite service#29 led to a small improvement: R: 62.0; P: 68.1; P & R: 64.9
on the walkthrough text, and R: 56.0; P: 70.2; P & R: 62.3 when run across the whole formal run test set.
Another source of error is when the parse extension mechanism #28described in the section on Discourse
Interpretation above#29 proposes the wrong argument for a verb. Such is the case in the example below, where
it and the allocation do not corefer because the parse extension mechanism, when looking for a missing logical
subject for use,hypothesises that it is the allocation; the allocation is then ruled out as an antecedentofthe
pronoun, since the logical subject and logical object of the same verb cannot corefer, except for particular
cases such as the copula.
Other companies that support the allocation and may use it include Lockheed Martin Corp.'s
Loral Space and Communications, International Private Satellite Partners#2FOrion Atlantic Cap-
ital Corp., and Comsat Corp.
The system also failed to detect that the pronoun it that occurs in the pleonastic construction It is
...critical is not coreferential, and tried to corefer it anyway. Pleonastic constructions are addressed in our
system; however, in this case the parser was thrown o#0B by the ellipsis.
Finally, incomplete noun group recognition of Hughes' Galaxy VIII#28I#29, i.e., not including #28I#29 as a part of
the name, resulted in identifying I as a pronoun, then coreferring it with Rupert Murdoch.
Information Extraction
Texts#2FTask TE TR ST
R P P&R R P P&R R P P&R
Walkthrough 75 87 80.68 39 77 51.97 42 40 41.18
Overall 75 80 77.17 41 82 54.70 47 42 44.04
Template Elements TE scores on the walkthrough article were slightly above the average across the test
set. On this article wewere largely successful with all slots except for the Entity Descriptor slot where scores
were 50 #25 precision and 21 #25 recall. We will #0Crst explain the particular items we failed on, and then discuss
why our Entity Descriptor slots were so poor.
The system performed relatively poorly on the three artifacts in the walkthrough text. We did get the
Long March 3 artifact, but missed the B, so the name was incorrect. We identi#0Ced both of the artifacts
without names #7B the Intelsat satellites #7B but incorrectly used Intelsat as the artifact name, and got the
descriptors wrong.
On organizations the system was much better, failing on only four #28of twenty-three#29. We did not corefer
News Corp. and News Corporation, nor did we corefer TCI and Tel-Communications Inc. because the name
matcher module did not equate them. Therefore an extra template elementwas generated for both TCI and
News Corp. We also did not recognize ING Barings as a company, and so did not print a template for it.
Finally,we reported Organizacoes Globo as a city instead of a company. Since it was not in our Gazetteer
we did not know its type; it was adjacent to a country,sowe guessed it was a city.
The system also did quite well on persons, failing on only three #28of ten#29. We did not classify Shayne
McGuire or Virnell Bruce as persons because their #0Crst names did not appear in our Gazetteer. We failed
to get the unnamed company spokesman element, because we had made a conscious decision only to output
templates for entities with a name #0Celd. This was done because our handling of descriptors was so weak.
The system also did quite well on locations, failing on only four #28of nineteen#29. We failed to print Brazil
since we printed Organizacoes Globo as a city with Brazil as its country. We failed to identify Arlington as
a city because it was in the Gazetteer as an organization and thus was attached to Space Transportation
Association; this led us to mistakenly print Virginia as a province. We printed U.S. from U.S.-based which
seems correct. We simply missed French Guyana because it was not in the Gazetteer.
The system did quite poorly on Entity Descriptors getting only four right, two incomplete, two wrong,
and missing thirteen. We had incomplete answers with Televisa where we got Mexico instead of Mexico's
biggest broadcaster. This was due to our weak handling of the 's in the grammar. Wewere also incomplete
with Irving Goldstein where the Descriptor proposed was chief executive instead of director general and chief
executive of Intelsat. This was due to our poor handling of conjunction during parsing.
The system proposed two spurious Descriptors. First, we gave Rupert Murdoch the descriptor group,
again due to a grammar confusion involving 's. Second, wegaveTCI the descriptor media due to a problem
with name lists.
The system also missed thirteen Descriptors altogether. This was largely due to a combination of a con-
servative approach to parsing and a lack of e#0Bort on #0Clling the Descriptor #0Celd. Our basic approach to parsing
was to only make attachments when we were quite sure they were correct. This meant that many of the
complex Descriptor #0Celds suchasspokesman for the SpaceTransportation Association of Arlington, Virginia
were never successfully combined during parsing. We left these attachments to the discourse interpreter, but
we did not have enough time to develop discourse interpretation rules to make these attachments.
Improving scores on Entity Descriptors would have improved other aspects of our TE performance. We
could not propose Elements based simply on Descriptors because our Descriptors were so poor. Furthermore,
Descriptors can help to classify entities, as in the case of Shayne McGuire, spokesman.
Template Relations TR scores on the walkthrough text were slightly below the test set average. In
the walkthrough text we correctly identi#0Ced six of eight location of relations, six of ten employee of
relations and zero of two product of relations.
In no formal run text did the system correctly identify a product of relation. We had only one discourse
interpreter rule to #0Cnd this relation and that relied on discovering a known aircraft manufacturer. While
quite restrictive, this rule had led to high precision in the air crash domain of the dry run training data. We
had believed that there would be a large degree of overlap between aircraft and rocket manufacturers but
wewere incorrect. Perhaps wewould have been more e#0Bective on this problem if we had had some training
data for Relations in the domain of launchevents, but none was made available.
Of the other six relations that we missed, three are due to failing to correctlycategorizetemplate elements.
That is, the TR task fails because the TE task fails. One employee of relation is missed due to our
conservative parsing strategy; we do not attach for the Space Transportation Association to spokesman in
either the parser or discourse interpreter. One location of relation is not generated due to our weakness
of 's processing. The #0Cnal location of relation is not generated because we do not completely parse
International Technology Underwriters and thus do not attach it to the adjacent location.
In addition to including a list of rocket manufacturers, two major improvements could be made to improve
our TR scores. The #0Crst is to improve the TE scores, in particular by improving our Descriptors. Secondly
we could improve our phrasal attachment either in the grammar or in the discourse interpreter. This would
enable us to see more relations between entities that are nearby in the text.
Scenario Template The walkthrough text is reasonably representative of our overall performance on the
ST task. The system proposed four launchevents, twice as manyasinthekey, due to the identi#0Ccation of
four separate satellites, each of whichwas associated with an event which then prevented event coreference.
One of the spurious satellites was mentioned as a quali#0Cer: an Intelsat satellite launch. We attempted
to avoid these cases with a rule to preventanyhypothesised entity resolving with noun modi#0Cer within a
nominal compound. This assumes that all entities required for the template will be mentioned as head nouns
at some point in the text. Unfortunately this rule failed altogether due to a bug.
The other spurious satellite was caused by a coreference failure. The coreference mechanism includes a
strict rule that inde#0Cnite noun phrases do not haveantecedents, and so no coreference was made between an
Intelsat satellite and a satellite built by Loral Corp. of New York for Intelsat, resulting in a spurious launch
event for the second case.
Our system's poor performance on descriptors also lead to incorrect payload entities for the satellite,
despite identifying the correct instances from the input, e.g. my satellite as a descriptor for asecond Intelsat
satellite. This was corrected byintroducing di#0Berent criteria for substantial descriptors in the ST task than
for the TE#2FTR tasks where we only output descriptors for entities that also had names.
Correcting the hypothesis resolution rule and the descriptor selection increased both recall and precision
on this text, with a 4.5#25 increase in P&R.
Further errors included the system's failure to identify anyFAILED launchevents. All events proposed
for the walkthrough text were SCHEDULED. We also associated two launch events with the shuttle as a
vehicle, rather than Long March 3B, and we missed the B from the rocket name.
The system's #28o#0Ecial, not debugged#29 ST output can be summarised as follows:
1. A civilian TV satellite is to be launched today from China.
2. A civilian TV satellite is to be launched in a year from the US by Long March3.
3. A civilian TV satellite is to be launched in a year from China by the shuttle.
4. A civilian TV satellite is to be launched in a year from the US by the shuttle.
The production of these simple summaries in not yet automated, as in LaSIE-I for MUC-6, but we plan
to add this facility in the near future. The construction of a complete discourse model by the system allows
various results to be read o#0B easily, not only the MUC output. For summarisation we therefore have access
to much more information than that contained in the template itself, with the ability to control the level of
detail. Such summaries can be used as a debugging tool, especially as a means for non-technical users to
provide error reports.
DISCUSSION
Before closing, two questions are worth considering. First, how did our MUC-7 results compare with our
MUC-6 results and what does this tell us? Second, what would we do di#0Berently or next in order to improve
on our MUC-7 performance? These questions are best discussed in relation to each of the tasks.
Named Entity
Our o#0Ecial MUC-6 NE scores were R: 84;P: 94; P & R: 89.06. Our system failed to process two texts in the
MUC-6 evaluation due to a bug totally unrelated to the system's language processing capabilities; scoring
done by the MUC scoring committee with this bug #0Cxed yielded R: 89;P: 93;P&R:91.01 which is a more
accurate re#0Dection of the system's capabilities. This time our o#0Ecal scores were R: 83;P: 89;P&R:85.83
#7B a drop of about 5 #25.
The #0Crst thing to note is that the scores of many of the MUC-6 veterans dropped comparably on NE in
MUC-7. One obvious cause of this was lack of training data in the domain of the #0Cnal evaluation. In our
case, and in some other cases, one manifestation of this was the failure to recognise astronomical bodies as
locations, since there were none, or very few, astronomical bodies in the NE dry run data. As noted above,
adding this change, and one or two other small changes apparent from examining the NE test data #28which
subsequently became the IE training data#29, lead to improved scores of R: 87; P: 94; P & R: 90.41.
Thus, on balance, a case can be made for claiming our NE scores have remained broadly the same
as in MUC-6. But, given the further e#0Bort that has gone into the system since MUC-6, one would have
expected scores to have improved. Whyhaven't they? Without further detailed failure analysis we cannot
say precisely. However, a few remarks can be made. First, it appears that the NYT texts used for MUC-
7 are more heterogeneous in their style, and hence there is more variation in form of NE expressions.
Designing a rule set to capture this variation is, consequently, more di#0Ecult. This observation has been
supported anecdotally byvariousMUC-7 participants. Second, the NE task was harder in certain respects.
In particular, relative temporal expressions were included in the MUC-7 NE task #7B both for dates and times
of day. Some of these expressions were very di#0Ecult indeed #28e.g. less than one hour after the American
Airlines 757 slammed into the mountain side killing all 147 passengers#29 and it is not clear that the task
guidelines had completely stabilised for this subtask.
What would we do di#0Berently#2Fnext to improve NE performance? Most obviously, since our recall is
considerably below precision, we need to concentrate on recall. Our system has a category of `unknown'
proper name #28any proper name not resolved to one of the MUC categories#29 and even super#0Ccial reviewing
of system results show that many proper name expressions falling into this category ought in fact to have
been assigned one of the MUC NE categories. Given that the system carries out a relatively deep analysis,
it should be possible to use further lexical semantic#2Fconceptual knowledge in the discourse interpreter to
resolve some of these cases.
2
Another area that needs work concerns the determinism of the NE grammars. The ten NE grammars
used in the MUC-7 system operate in a #0Cxed order and the last co-spanning analysis always gets chosen.
In some cases, regardless of the order of the grammars, errors will result. For example, consider the names
Julian Hill and Pearl Harbour. As presented to our NE grammars both consist of a known person #0Crst name
followed by a location trigger word. If our person grammar is run after our location grammar both come out
as persons; if the location grammar is run after the person grammar both come out as locations. Clearly,
both could be either locations or persons, though most of us will use world knowledge to make the choice.
However, the best solution is to pass on the ambiguity and allow a later module, with more contextual
knowledge, to decide. Controlled propagation of ambiguityinto the discourse interpreter is thus a challenge
for us.
Coreference
Our MUC-6 o#0Ecial coreference scores were: R: 51; P: 71 #28P & R: 59.36
3
#29. As with NE we missed
processing two texts and when the results for these were added in our scores were R: 54; P: 70 #28P &R:
60.97#29. For MUC-7 our scores were R: 56.1; P: 68.8;P&R:61.8.
Thus, there has been a very slight overall improvement. Considerable e#0Bort went into #0Cne-tuning the
coreference mechanism and more detailed failure analysis will be necessary to determine why more signi#0Ccant
gains were not achieved and where further advances can be made. The #0Cnal test set was smaller in MUC-7
#2820 articles, as opposed to 30 in MUC-6#29 and it may be that the extra coreference phenomena we addressed
in our MUC-7 system #28see above#29 simply did not occur in the test set with su#0Ecient frequency to makea
signi#0Ccant di#0Berence. Or it may be that the MUC-7 CO test was signi#0Ccantly harder in some way that has
not yet been determined.
Obvious areas for work include better handling of quoted speech, ascertaining whether the combination
of a focus mechanism and a semantic compatibility#2Frecency mechanism can yield overall better results, and
improving the underlying grammatical analysis on which the coreference algorithm relies.
Information Extraction
Template Element For the template element task our o#0Ecial MUC-6 scores were: R:66;P:74;P&R:
69.8 and, with the addition of the two missed articles: R: 68; P: 74; P & R: 70.8. MUC-7 scores were: R:
75;P:80;P&R:77.17.
Template element is the one MUC-7 task where we appear to have made clear and signi#0Ccant gains over
MUC-6. Of course, the major rede#0Cnition of this task since MUC-6 may mean direct comparison is not
sensible. However, it is worth brie#0Dy considering whether a reasonable story can be told about why our
scores have gone up.
2
We did attempt to collect information from the training keys about the distribution of NE classes occurring as complements
of speci#0Cc prepositions #28e.g. what sort of NE's follow the preposition in#29 and as noun phrases modi#0Ced by PPs with a speci#0Cc
preposition and complementtype #28e.g. what sort of NE's are modi#0Ced by phrases of the form in LOCATION#29. However, none
of these patterns occurred reliably enough to be used in the #0Cnal system.
3
F-measures were not calculated for the coreference task in MUC-6, but were for MUC-7. The P & R #0Cgures supplied here
for MUC-6 are our calculations using the standard formula.
Our NE scores were no better than in MUC-6 and TE is crucially dependent on the ability to identify
locations and organizations. After analysis it appears that the increase is due to two factors. First, a change
in the TE task de#0Cnition meant that the location information associated with an organization in MUC-6 #7B
the locale and country slots of the organization template element#7Bwas exported into the location of
template relation in MUC-7. Since we did relatively poorly in identifying this information in MUC-6 #28relative
to other slots in the organizationtemplate#29, it seems reasonable that our scores on TE would go up when it
was removed from the task. Second, our entity descriptor scores, while bad in MUC-7, were muchworse
in MUC-6: in fact these scores improved by a factor of three in MUC-7. We believe this is attributable to
less ambitious, but more reliable phrasal parsing, and to the creation of a more #0Cnely tuned set of rules in
the discourse interpreter speci#0Ccally geared to extracting entity descriptors.
What would we do di#0Berently#2Fnext to improve TE performance? As noted in the discussion of the
walkthrough text above, the entity descriptor slot is where most e#0Bort needs expending, and to improve
this we need more reliable PP-attachment in descriptive phrases #28many of our descriptors were fragments of
the correct descriptor#29 as well as more work on identifying which phrases are descriptors. Other weak points
were low recall on artifacts and low precision on locations.
Template Relation The template relation task was new for MUC-7, so there are no MUC-6 scores
to compare with #28the information for one of the MUC-7 relations #7B location of #7B was captured in the
template element for organization in MUC-6, but it is not clear how to compare meaningfully the locale
and country slot scores in the MUC-6 organization TE, with the MUC-6 TR location of object and slot
scores#29.
As discussed in relation to the walkthrough text above, the major improvements needed for TR are
#0Crst to develop appropriate rules for product of relations, since we missed all such relations in the #0Cnal
evaluation; second, to improve TE scores, since TR is parasitic on TE; and third, as with TE, to enhance
grammatical and semantic analysis to better detect TR's.
Scenario Template Our MUC-6 ST scores were: R: 37; P: 73; P & R: 48.96 #28the addition of the two
missed articles made no di#0Berence as they contained no scenario events#29. In MUC-7 our ST scores were: R:
47;P:42;P&R:44.04. Thus, we su#0Bered an overall drop in f-measure of nearly #0Cve points and while over
recall wentupby ten, our precision, which been the highest ST precision score at MUC-6, wentdown by
over thirty points.
We believe that the MUC-7 ST task was signi#0Ccantly more di#0Ecult than the MUC-6 one. This conclusion
appears to be borne out by the overall lower performance on ST in MUC-7 #28high f-measure of 50.79 as
compared to 56.4 in MUC-6#29 and is the result of a number of factors. First, the MUC-7 texts were longer
on average and told more complex stories. Second, the template for MUC-7 was more complex, consisting
of 7 object types with a total of 32 non-optional slots, while the MUC-6 template consisted of 5 object
types with 20 non-optional slots. Third, the MUC-7 template required more #0Cne-grained distinctions to be
made. For example, no less than #0Cve organizations had to be distinguished: the launchvehicle owner, launch
vehicle manufacturer, the payload owner, payload manufacturer and payload recipient. These are subtler
distinctions than those required in the MUC-6 management succession task.
Our drop in precision was largely due to being unable to appropriately merge multiple references to the
same event, as discussed in reference to the walkthrough article above. Our belief, though this requires
further analysis, is that this is more of a problem in the MUC-7 ST task because the texts tend to refer to
the same scenario event repeatedly more than the MUC-6 texts do. We also observed a tendency for events
in the MUC-7 scenario to be expressed more frequently by nominalisation #28#28rocket#29 launch, #28missile#29 attack,
#28shuttle#29 #0Dight, even #28shuttle#29 mission could all signal a launch event#29 than the management succession
events had been in MUC-6. Our attempts to handle these no doubt contributed to recall but had a large
negative impact on precision, as accurately merging multiple nominalisations of the same event is di#0Ecult.
Without considerably more analysis of the results it is di#0Ecult to identify the most promising avenues
to improve LaSIE-II's MUC-7 ST performance. As ever, more time to implement the scenario would have
helped. We devoted under ten person days to this task and given its complexity the results are not surprising
#28we produced no #0Cll at all for several slots, due to lack of time to implementany rules for them#29.
CONCLUSION
From our perspective the most encouraging result from MUC-7 was the vindication it supplied for the
e#0Bort wehave put into the GATE architecture over the twoyears leading up to the evaluation. The chief
advantages GATE a#0Borded were:
#0F a framework supporting a highly modular approach to language processing which in turn permitted a
team with varying levels of programming skills, areas of expertise, and available time to work e#0Bectively
together;
#0F reusabilityofinterface code, especially results viewers, that allowed rapid creation of useful tools for
gaining diagnostic insights.
Using GATE wewere able to take part in all #0Cve of the MUC-7 tasks without excessive expenditure of e#0Bort
#7Beveryone working on MUC-7 had a parallel full time commitment to other projects. Further, components
of the MUC-7 system were in active use in systems undergoing simultaneous development for other language
engineering projects, and GATE made managingthis complexity straightforward#28this contrastswith LaSIE-I
whichwas a monolithic system more or less dedicated to MUC-6#29.
As noted at the Introduction, LaSIE-II does not diverge radically from LaSIE-I in general approach, and
as we did not set out to test explicitly anyhypothesis about language processing in the evaluation, wedo
not see MUC-7 as allowing us to drawanystrong conclusions con#0Crming or discon#0Crming aspects of this
approach. Perhaps the most interesting insights wehave gained as a result of participating in MUC-7 are
the following.
#0F Simple replacement of a semantic compatibility#2Frecency approach to pronoun resolution with a focus-
based approach does more harm than good #7B perhaps because the focus-based approach relies on more
accurate#2Fcomplete syntactic information than is available from a parser designed to perform robust
syntactic analysis on real texts.
#0F Using a hand-crafted grammar which aims only to do phrasal analysis up to the pointofambiguity,
as opposed to a grammar extracted from the PTB and thresholded on rule frequency #28as we did for
MUC-6#29, produces less complete but more accurate syntactic analyses. How this feeds through to task
performance is hard to assess. More work needs to be done to see how and to what extent these partial
syntactic analyses can be extended using conceptual knowledge.
#0F Further techniques need to be devised for #28semi-#29 automatically acquiring and re#0Cning lexical se-
mantic#2Fconceptual knowledge in the domain.
MUC-7 was both a harder and a broader test than MUC-6, so we are neither surprised nor dismayed
by the lack of striking progress in `bottom line' #0Cgures. The data, task de#0Cnitions, and scoring software
produced for MUC-7 are a rich resource which we intend to mine for deeper insights for the foreseeable
future. From these insights further progress is sure to follow.
ACKNOWLEDGEMENTS
This work was partly supported by EPSRC grant GR#2FK25267 #28GATE#2FLaSIE#29, EC DGXIII grantLE
2238-1 #28AVENTINUS#29, GlaxoWellcome plc and Elsevier Science #28EMPathIE#29.

REFERENCES
#5B1#5D S. Abney. Partial parsing via #0Cnite state cascades. Natural Language Engineering, 2#284#29:337#7B344, 1996.
#5B2#5D S. Azzam. Resolving anaphors in embedded sentences. In Proceedings of the 34th meetings of the Asssociation for Computational Linguistics #28ACL'96#29, Santa Cruz, CA, 1996.
#5B3#5D S. Azzam, K. Humphreys, andR. Gaizauskas. Evaluatingafocus-basedapproachto anaphoraresolution. In Proceedings of COLING-ACL'98, pages 74#7B78, 1998.
#5B4#5D E. Brill. A simple rule-based part-of-speech tagger. In Proceeding of the Third Conference on Applied Natural Language Processing, pages 152#7B155, Trento, Italy, 1992.
#5B5#5D H. Cunningham, K. Humphreys, R. Gaizauskas, and Y. Wilks. Software Infrastructure for Natural Language Processing. In Proceedings of the Fifth Conference on Applied Natural Language Processing #28ANLP-97#29, pages 237#7B244, March 1997. Available as http:#2F#2Fxxx.lanl.gov#2Fps#2F9702005.
#5B6#5D R. Gaizauskas. Investigations into the grammar underlying the Penn Treebank II. Research Memorandum CS-95-25, Department of Computer Science, Univeristy of She#0Eeld, 1995.
#5B7#5D R. Gaizauskas and K. Humphreys. Using a semantic network for information extraction. Journal of Natural Language Engineering, 3#282#2F3#29:147#7B169, 1997.
#5B8#5D R. Gaizauskas, K. Humphreys, S. Azzam, and Y. Wilks. Concepticons vs. lexicons: An architecture for multilingual information extraction. In M.T. Pazienza, editor, Proceedings of the Summer School on Information Extraction #28SCIE-97#29, LNCS#2FLNAI, pages 28#7B43. Springer-Verlag, 1997.
#5B9#5D R. Gaizauskas, T. Wakao, K Humphreys, H. Cunningham, and Y. Wilks. Description of the LaSIE system as used for MUC-6. In Proceedings of the Sixth Message Understanding Conference #28MUC-6#29, pages 207#7B220. Morgan Kaufmann, 1995.
#5B10#5D R. Grishman. The NYU system for MUC-6 or where's the syntax. In Proceedings of the Sixth Message Understanding Conference #28MUC-6#29, pages 167#7B176. Morgan Kaufmann, 1995.
#5B11#5D R. Grishman. TIPSTER Architecture Design DocumentVersion 2.3. Technical report, DARPA, 1997. Available at http:#2F#2Fwww.tipster.org#2F.
#5B12#5D P. Harrison, S. Abney, E. Black, D. Flickinger, C. Gdaniec, R. Grishman, D. Hindle, R. Ingria, M. Marcus, B. Santorini, and T. Strzalkowski. Evaluating syntax performance of parser#2Fgrammars of english. In Proceedings of the Workshop On Evaluating Natural Language Processing Systems. Association For Computational Linguistics, 1991.
#5B13#5D J.R. Hobbs, D. Appelt, M. Tyson, J. Bear, and D. Israel. Description of the FASTUS system as used for MUC-4. In Proceedings of the Fourth Message Understanding Conference MUC-4, pages 268#7B275. Morgan Kaufmann, 1992.
#5B14#5D M.P. Marcus, B. Santorini, and M.A. Marcinkiewicz. Building a large annotated corpus of english: The Penn treebank. Computational Linguistics, 19#282#29:313#7B330, 1993.
#5B15#5D D. McKelvie, C. Brew, and H. Thompson. Using SGML as a Basis for Data-Intensive NLP. In Proceedings of the #0Cfth Conference on Applied Natural Language Processing #28ANLP-97#29, 1997.
#5B16#5D R. Quirk and S. Greenbaum. A University Grammar of English. Longman, Harlow, Essex, 1973.
#5B17#5D T. Wakao, R. Gaizauskas, and Y. Wilks. Evaluation of an algorithm for the recognition and classi#0Ccation of proper names. In Proceedings of the 16th International Conference on Computational Linguistics #28COLING96#29, pages 418#7B423, Copenhagen, 1996.
