NYU:
Description of the Proteus#2FPET System as Used for MUC-7 ST
Roman Yangarber and Ralph Grishman
Computer Science Department
New York University
715 Broadway,7th#0Door
New York, NY 10003, USA
froman|grishmang@cs.nyu.edu
INTRODUCTION
Through thehistory of the MUC's, adapting InformationExtraction #28IE#29 systems toanew class of events
has continued tobeatime-consumingand expensivetask. Since MUC-6, the Information Extraction e#0Bort
at NYU has focused on the problem of portabilityand customization, especially atthe scenario level. To
begin to address this problem, wehave builtasetoftools, which allowtheusertoadapt the system tonew
scenarios rapidly by providing examples of eventsintext, and examples of associated database entries to
be created. The system automatically uses this information to create general patterns, appropriate for text
analysis. The present system operates on twotiers:
#0F Proteus #7B core extraction engine, an enhanced version of theone employed at MUC-6, #5B3#5D
#0F PET #7B GUI frontend, through whichtheuserinteracts with Proteus, #28as described recently in #5B5,6#5D#29
It is our hope thatthe example-based approach will facilitatethe customization of IE engines; we are
particularly interested, #28as are other sites#29, in providingthe non-technical user #7B such as a domain analyst,
unfamiliar with system internals, #7B withthe capabilityto perform IE e#0Bectively in a #0Cxed domain.
In this paper we discuss the system's performance on the MUC-7 Scenario Templatetask #28ST#29. Thetopics
covered in the following sections are: the Proteus core extraction engine; the example-based PET interface
to Proteus; a discussion of howthese were used to accommodatethe MUC-7 Space Launch scenario task.
We conclude withtheevaluation of the system's performance and observations regarding possible areas of
improvement.
STRUCTURE OF THE PROTEUS IE SYSTEM
Figure 1 shows an overview of our IE system.
1
The system is a pipelineofmodules, each drawingon
attendant knowledge bases #28KBs#29 to process its input, and passes itsoutputtothenext module. Themodular
design ensures that control is encapsulated in immutable, domain-independent core components, while the
domain-speci#0Cc information resides in the knowledge bases. It is thelatter whichneed to be customized for
eachnew domain and scenario, as discussed in thenext section.
The lexical analysis module #28LexAn#29 is responsible for splittingthedocumentintosentences, andthe
sentences intotokens. LexAn draws on a set of on-line dictionaries; these includethe general COMLEX
syntactic dictionary,and domain-speci#0Cc listsofwords andnames. As the result, eachtoken receives a
reading, or a list of alternative readings,incasethetoken is syntactically ambiguous. A reading consistsof
1
Foradetailed description of the system, see #5B3, 5#5D
Figure 1: IE system architecture
alistoffeatures andtheir values #28e.g., #5Csyntactic category = Noun"#29. LexAn optionally invokesastatistical
part-of-speechtagger, which eliminates unlikely readings for eachtoken.
Thenext three phases operatebydeterministic, bottom-up, partial parsing, or pattern matching;the
patterns are regular expressions which trigger associated actions. This style of text analysis, #7B as contrasted
with full syntactic parsing,#7Bhas gained the wider popularitydue to limitations on the accuracy of full
syntactic parsers, andtheadequacy of partial, semantically-constrained, parsing for this task #5B3,2,1#5D.
The name recognition patterns identify proper names in thetext byusing local contextual cues, such
as capitalization, personal titles #28#5CMr.", #5CEsq."#29, and company su#0Exes #28#5CInc.", #5CCo."#29.
2
Thenext module
#0Cnds small syntactic units, such as basic NPs and VPs. When it identi#0Ces a phrase, the system marks the
text segment withsemantic information, e.g. thesemantic class of thehead of the phrase.
3
Thenext phase
#0Cnds higher-level syntactic constructions using local semantic information: apposition, prepositional phrase
attachment, limited conjunctions, and clausal constructions.
The actions operateonthelogical form representation #28LF#29 of the discourse segments encountered so far.
The discourse is thus a sequence of LFs correspondingtotheentities, relationships, andevents encountered
in theanalysis. A LF is an object withnamed slots #28see example in #0Cgure 2#29. One slot in each LF, named
#5CClass", has distinguished status, anddetermines thenumber andtype of other slotsthatthe object may
contain. E.g., an entity of class #5CCompany"has a slot called #5CName". It also contains a slot #5CLocation"
which pointsto another entity,therebyestablishing a relation between thelocation entityandthematrix
entity.Events are speci#0Cc kinds of relations, usually havingseveral operands.
Thesubsequentphases operateonthe logical forms builtinthepattern matchingphases. Reference
resolution #28RefRes#29 links anaphoric pronouns totheir antecedentsandmerges other co-referring expressions.
The discourse analysis moduleuses higher-level inference rules to build morecomplexevent structures, where
2
At present, the result of the NYU MENE system, as used in theNEevaluation, does not yet feed into the ST processing.
3
These marks are pointers to the correspondingentities, which are created and added to the list of logical forms representing
the discourse.
Slot Value
Class Satellite
Name #7B
Manufacturer Loral Corp.
Owner Intelsat
::: :::
Figure 2: LF for the NP: #5Ca satellite builtby Loral Corp. of New York for Intelsat"
the informationneeded to extract a single complex fact is spread across several clauses. For example,there is
a rule thatmerge a Mission entity with a corresponding Launch event. Atthis stage, we also convert all date
expressions #28"yesterday", "last month",etc.#29 tostartingandendingdates as required for the MUC templates.
Another set of rules formatsthe resultantLFintosuch a form as is directly translatable, in a one-to-one
fashion, intothe MUC template structure, the translation performed bythe#0Cnal template-generation phase.
PET USER INTERFACE
Our prior MUC experience has shown that building e#0Bectivepatterns for a new domain is a complex
andtime-consuming part of the customization process; it is highly error-prone, and usually requires detailed
knowledge of system internals. Withthis in view, wehave sought a disciplined method of customization of
knowledge bases, andthepattern base in particular.
Organization of Patterns
Thepattern base is organized in layers, correspondingto di#0Berent levels of processing. This strati#0Ccation
naturally re#0Dectsthe range of applicabilityofthepatterns. Atthelowest level are the most general patterns;
they are applied #0Crst, and capture the most basic constructs. These includethe proper names, temporal
expressions, expressions for numeric entities, and currencies. Atthenext level are thepatterns that perform
partial syntactic analysis #28nounandverb groups#29. These are domain-independentpatterns, useful in a wide
range of tasks. Atthenext level, are domain-speci#0Cc patterns, useful across a narrower range of scenarios,
butstill having considerable generality.These patterns #0Cnd relationships amongentities, suchasbetween
persons and organizations. Lastly,atthe highest level will be the scenario-speci#0Cc patterns, suchasthe clausal
patterns that capture events.
Proteus treatsthepatterns atthe di#0Berent levels di#0Berently.Thelowest level patterns, havingthe widest
applicability,are built in as a core part of the system. These change little when the system is ported. The mid-
range patterns, applicablein certain commonlyencountered domains,are provided as pattern libraries, which
can be plugged in as required bythe extraction task. For example, for the domain of #5Cbusiness#2Feconomic
news", Proteus has a library withpatterns that capture:
#0F entities #7B organization#2Fcompany, person, location;
#0F relations #7B person#2Forganization, organization#2Flocation, parent organization#2Fsubsidiary.
Lastly,the system acquires the most speci#0Cc patterns directly from the user, on a per-scenario basis,
through PET, a set of interactive graphical tools. In the process of buildingthe custom pattern base, PET
engages the user only atthe level of surface representations, hidingtheinternal operation. The user's input
is reduced to
#0F providing examples of eventsofinterest in text, and
#0F describingthe corresponding output structures to be created.
Company
z #7D| #7B
Arianespace Co.
vg
z #7D| #7B
has launched
satellite
z #7D| #7B
an Intelsat communications satellite
Figure 3: Initial analysis
Based on this information, PET automatically
#0F creates theappropriatepatterns to extract the user-speci#0Ced structures from the user-speci#0Ced text
#0F suggests generalizations for thenewly created patterns to broaden coverage.
Pattern Acquisition
The initialpattern base consistsofthe built-in patterns andthe plugged-inpattern libraries corresponding
tothe domain of interest. These serveasthe foundation for example-based acquisition. Thedevelopment
cycle, from the user's perspective, consistsofiteratively acquiringpatterns toaugmentthepattern base. The
acquisition process entails several steps:
Enter an example: theuserenters a sentence containing a salientevent, #28or copies-pastes text from a
documentthrough the corpus browser, a tool provided in the PET suite#29. We will consider the example
#5CArianespace Co. has launched an Intelsat communications satellite."
Choose an eventtemplate: the user selects from a menuofeventnames. A list of events, withtheir
associated slots, must be given tothe system attheoutset, as part of the scenario de#0Cnition. This example
will generateanevent called #5CLaunch", with slots as in #0Cgure 4: Vehicle, Payload, Agent, Site,etc.
Apply existingpatterns: the system applies the currentpatterns tothe example, to obtain an initial
analysis, as in #0Cgure 3. In the example shown, the system identi#0Ced some noun#2Fverb groups andtheir
semantic types. For each elementitmatches, the system applies minimalgeneralization, #28in the sense thatto
be any less general, the elementwould havetomatchthe example text literally#29. The system then presents
theanalysis totheuserand initiates an interaction withher:
np#28C-company#29 vg#28Launch#29 np#28Satellite#29
Tunepattern elements: the user can modify eachpattern elementinseveral ways: choose theappropriate
level of generalization of its concept class, within thesemantic concept hierarchy; force the elementtomatch
the correspondingtext in the original example literally;makethe elementoptional; remove it; etc. In this
example,theusershouldlikelygeneralize #5Csatellite"tomatchany phrase designating apayload,andgeneralize
theverb #5Claunch"to a class containingits synonyms, #28e.g. #5C#0Cre"#29:
np#28C-company#29 vg#28C-Launch#29 np#28C-Payload#29
Fill event slots: the user speci#0Ces howpattern elements are used to #0Cll slotsintheeventtemplate. Clicking
on an element displays its logical form #28LF#29. The user can drag-and-dropthe LF, or anysub-component
thereof, into a slot in thetarget event, as in #0Cgure 4.
Build pattern: when the user #5Caccepts" it, the system builds a new pattern tomatchthe example, and
compiles the associated action;the action will #0Cre when thepattern matches, and will #0Cll the slotsinthe
eventtemplateasinthe example. Thepattern is then added tothepattern base, which can be saved for
later use.
Syntactic generalization: Actually,thepattern base would acquire much more than the basic pattern that
the user accepted. The system applies built-in meta-rules #5B1, 4#5D, to produce a set of syntactic transformations
Slot Value
Class Predicate-Launch
Agent entity =#29 #3CArianespace#3E
Payload entity =#29 #3CIntelsat satellite#3E
Site :::
::: :::
Figure 4: Event LF correspondingto a clause
from a simple active clause pattern or a bare noun phrase. For this, active example, thepattern base will
automatically acquire itsvariants: the passive, relative, relative passive, reduced relative, etc.
4
Proteus
also insertsoptional modi#0Cers intothe generated variants#7Bsuchassentence adjuncts, etc.,#7Bto broaden the
coverage ofthepattern. In consequence, a passivepattern whichthe system acquires fromthis simpleexample
will matchtheeventinthewalk-through message, #5C... said Televisa expects asecond Intelsat satellite to be
launchedbyArianespacefrom French Guyana later this month ...", withthehelp of lower-level patterns for
named objects, andlocativeandtemporal sentence adjuncts.
5
PERFORMANCE ON THE LAUNCH SCENARIO
Scenario Patterns
This section describes howthe Proteus#2FPET system was adapted to accommodatethe MUC-7 scenario.
The scenario-speci#0Cc patterns were primarilyoftwotypes: those for launchevents #28#5CNASA launched a
rocket.", #5CTheterrorists #0Cred a missile."#29 andthose for missions#28#5Cthe retrieval of a satellite"#29. Starting from
patterns for simpleactive clauses, the system automaticallygenerated patterns for the syntactic variants, such
as the passive, relative, andreduced relative clauses. The missions added informationregarding payloads and
mission functions toalaunchevent, but did not directly generatealaunchevent. In some cases, the mission
was syntactically tied to a particular launchevent #28#5C... launched theshuttle todeployasatellite"#29. If there
was no direct connection, the post-processing inference rules attempted totie the mission toalaunchevent.
Inference Rules
Consider theevent in #0Cgure 4: the surface representation contains a generic #5CAgent" role. The agent can be
of several types, e.g. it can be a launchvehicle, an organization, or evenalaunchsite, in case the agentisa
country.Inthis case, the role is #0Clled by an organization, which, in principle, further admitsthe possibility
of either the payload owner or thevehicle owner. The scenario speci#0Ccation mandates thatthefunction of the
#5Cagent" be precisely speci#0Ced, although atthe surface it is underspeci#0Ced. In this case, thefunction can be
determined on the basis of thesemantic class of the agent, andthe observation thatthe payload-owner slot
is already occupied unambiguously by another organization entity.
This type of computation is performed by scenario-speci#0Cc inference rules; in general, this determination
can be quite complex. Translatingthe surface representations intothose mandated bythetask speci#0Ccation
can involvemany-to-many relations, suchasones that exist between payloads andlaunchevents, where
multiple payloads correspondtoasingle event, andmultiple launchevents are concerned withasingle
payload.
Onetechnique thatappeared fruitful in theLaunch scenario was extending our set of inference rules with
heuristics. Often a slot in an event cannot be #0Clled, as when patterns fail to#0Cnd a syntactically suitable
4
The expert user can view thevariants whichthe system generates, andmakechanges to them directly.
5
The tools can be used to acquire non-clausal patterns as well, e.g. patterns for noun groups and complex noun phrases, to
extend an existing pattern library.
Task Template Element Scenario Template
Recall 71 31
Precision 83 68
F-Measure 76.50 42.73
Figure 5: NYU scores on MUC-7 tasks
candidate. Theidea was tomakeintelligent guesses about #0Cllers for these empty slots. For example, consider
the #0Crst sentence of thewalk-through message:
Xichang, China, Feb. 15 #28Bloomberg#29 #7B A Chinese rocket carrying an Intelsat satellite exploded
as it was being launchedtoday, delivering a blow :::
Here we#0Cndtwo similar problems: concerningthelaunchdateandthelaunchsite. Our patterns recognize
the correspondinglocativeandtemporal noun phrases, however, because neither stands in a direct syntactic
relation tothemain launchevent clause #28here, headed bytheverb #5Cexplode"#29, they fail to #0Cll theappropriate
slotsintheevent. We use a simple heuristic rule to recover from this problem: if thelauncheventhas an
emptydate, andifthesentence contains a unique expression of the correct type #28i.e. date#29, use the expression
to #0Cll the empty slot.
Wehave experimented withavarietyofheuristics for several slots, including organizations for vehicle and
payload owners andmanufacturers, dates andsites. At present, the contribution of these heuristics to our
score accounts for just under 10#25 of the F-measure. It is also apparentthat someoftheheuristics actually
overgenerate, though wehaveyet toanalyze their e#0Bect in detail.
We believethattheoverall approach of example-based pattern acquisition is more appropriatethan
automatic training from annotated corpora, as the amount of trainingdata for ST-level tasks is usually quite
limited. Wehave foundpattern editingtool reasonably e#0Bective. However, we discovered thatmuchofthe
task involved creation andtuning of post-processing rules andwedoyet not have support in thetool for this
activity. This consumed a considerable part of the customization e#0Bort . This pointsto an importantproblem
thatneeds to be addressed, especially for tasks where the structure of outputtemplates di#0Bers substantially
from the structure of entities andevents as picked up bythe syntactic analysis.
We did not speci#0Ccally focus on theTEtask within thelaunch scenario, and simplyused the same system
wehad used for theSTtask. Table 5 is a summary of the scores of our system.
Acknowledgement
Wewishtothank Kristofer Franz#13en of Stockholm University for his assistance duringthe MUC-7 formal
run.

REFERENCES
#5B1#5D Douglas Appelt, Jerry Hobbs, John Bear, David Israel, Megumi Kameyama,Andy Kehler, David Martin, Karen Meyers, andMabry Tyson. SRI International FASTUS system: MUC-6 test resultsandanalysis.In Proc. Sixth Message Understanding Conf. #28MUC-6#29, Columbia,MD, November 1995. Morgan Kaufmann.
#5B2#5D DouglasAppelt, Jerry Hobbs, John Bear, David Israel, andMabry Tyson. FASTUS: A #0Cnite-state processor for information extraction from real-world text. In Proc. 13th Int'l Joint Conf. Arti#0Ccial Intelligence #28IJCAI-93#29, pages 1172#7B1178, August 1993.
#5B3#5D Ralph Grishman. The NYU system for MUC-6, or where's the syntax. In Proc. Sixth Message Understanding Conf., pages 167#7B176, Columbia, MD, November 1995. Morgan Kaufmann.
#5B4#5D Ralph Grishman. The NYU system for MUC-6 or where's the syntax? In Proc. Sixth Message Understanding Conf. #28MUC-6#29, Columbia, MD, November 1995. Morgan Kaufmann.
#5B5#5D Ralph Grishman. Information extraction: Techniques andchallenges. In Maria Teresa Pazienza, editor, Information Extraction. Springer-Verlag, Lecture Notes in Arti#0Ccial Intelligence, Rome, 1997.
#5B6#5D Roman Yangarber and Ralph Grishman. Customization of information extraction systems. In Paola Velardi, editor, Proc. International Workshop on Lexically Driven Information Extraction,Frascati, Italy, July 1997. Universit#12a di Roma.
