PRC Inc.:
DESCRIPTION OF THE PAKTUS SYSTE M
USED FOR MUC-5
Bruce Loatma n
Chih-King Yang
PRC Inc.
Technology Division
1500 PRC Drive
McLean, VA 22102
loatman_bruce @ po. gi s. prc . corn
BACKGROUND
The PRC Adaptive Knowledge-based Text Understanding System (PAKTUS) was develope d
as an Independent Research and Development project at PRC from 1984 through 1992. It includes
a core English lexicon and grammar, a concept network, processes for applying these to lexical ,
syntactic, semantic, and discourse analysis, and tools that support the adaptation of the generi c
core to new domains, primarily by acquiring sublanguage and domain-specific lexicon and topic
patterns. The lexical, syntactic, and semantic analysis components required little adaptation fo r
MUC-5, the most significant change being conversion of the task-specific semantic representation s
to object-oriented form . The discourse analysis component was modified to operate on the task -
specific semantic structures, rather than the generic case frames.
APPROACH
The overall structure and operation of PAKTUS are shown in Figure 1 . This is similar to the
"generic system" described in [1] . Processing proceeds mostly sequentially, with the exception o f
the interaction between the syntactic and semantic components at the clause and noun phrase level,
and between the lexical analysis and preprocessor.
For the MUC5 task, we added some bracketing capabilities to handle the special syntactic
phenomena in the financial application domain which might cause problems for the parser or late r
extraction processes. These phenomena include company name, currency, temporal expressions,
and one percentage expression . For example, "BRIDGESTONE SPORTS CO ." ,
"BRIDGESTONE SPORTS TAIWAN CO" , "UNION PRECISION CASTING CO" and "TAG A
CO." are recognized as company names during the bracketing phase ; " 20 MILLION NEW
TAIWAN DOLLARS" is bracketed as ((20000000 DOLLA RACURRENCY BASE C^TAIWAN));
and "75 PCT BY" is treated as a preposition . The complete sentence parse rate for the MUC5
corpus was significantly improved by the bracketing process, minimizing the need for complex
additions to the grammar .
Unlike the generic system, PAKTUS has no text filter or preparser . Full parses are attempted
on all sentences, and the first syntactico-semantically successful parse of a sentence is accepted .
Parse time is restricted as a linear function of the number of words, however, and parse fragments
are returned, implicitly conjoined, if a full parse cannot be produced in the allotted time . Full
parses were achieved for approximately 50 percent of the sentences in the MUC-5 corpus. Useful
information was obtained from the fragmentary parses of the other half, however . Based on
comparison of the MUC-5 error measures when fragmentary parses were included and excluded ,
195
the fragmented parses yielded, on average, about one-third as much correct information as the full
parses.
Another variation on the generic system is that fragment combination and semanti c
interpretation are integrated in PAKTUS, and semantic interpretation is divided into two distinc t
modules : one that produces a generic representation of the complete sentence, and a subsequen t
module that maps this into (possibly several) task-specific representations . In addition, at the
lexical analysis phase, as mentioned above, each word was associated with both syntactic an d
semantic information. Lexical patterns were developed as an alternative to semantic interpretatio n
based on full syntactico-semantic parses . This takes advantage of the information rich lexica l
analysis. A pattern matcher uses the results of both the lexical analysis (as in Figure 2a) and the
syntactic analysis (as in Figure 3b) to extract information . This is invoked when the extraction
based on full parsing fails to yield any data . The patterns are defined as regular expressions that
are matched against the results of lexical analysis. When a match is found, the corresponding nou n
phrases (produced by the full parser) are extracted and the task-specific semantic representation i s
constructed from these .
PAKTUS has no separate lexical disambiguation module : that function is distributed across all
system modules . Initially, words are assigned all possible meanings available in the lexicon .
Senses that are inconsistent with any processing choice are filtered out when that choice i s
considered.
Doc Template Preprocessor Words &Sentences
Semantic
	
Semantic
	
Semantic
Frames
	
Analysis
	
Structures
016
Parser
InternalReps & Link s
Generic-to-Task
	
Task-specificMapping
	
Structures
ReferenceResolution Topics &Bindings
OutputGenerator FilledTemplate s
Lexicon &Morph Rules
SyntacticStructures
SemanticMappings
OutputTemplates
Figure 1 : PAKTUS Modules and Control Flo w
PROCESSING OF MUC-5 DOCUMENT 059 2
Figure 2a shows the raw, unprocessed text of the first sentence (Si) of article number 0592 ,
followed by its lexical analysis. This is the result of applying both the preprocessor and lexica l
analysis modules . Each word has one or more senses, represented as a root symbol, which i s
generally the concatenation of the English token, the "^" character, and the PAKTUS lexica l
category (e.g., "Set^Monotrans"), or as a simple structure involving a root, lexical category ,
inflectional mark, and sometimes a conceptual derivation (e .g., the structure "(Say2^Monotran s
L^Monotrans S^ed)" represents the -ed form of one sense of "say"). For each word, all senses in
196
the PAKTUS lexicon are fetched or derived at this time ; disambiguation is generally delayed unti l
later phases. Many of the words are unknown to the PAKTUS lexicon; it will make guesses from
the context. An example of an ambiguous word is "concern." Figure 2b shows some of the
lexicon information for this word in PAKTUS . This includes a list of the 4 PAKTUS primitive
words corresponding to the token "concern" plus objects containing information for each primitiv e
word. These objects are embedded in a semantic network; they inherit much additiona l
information from it.
Sample PAKTUS grammar specifications relevant to S l are shown in Figure 3a, and th e
syntactic analysis of this sentence is shown in Figure 3b . This analysis is represented as a
configuration of syntactic registers (the main ones are shown in the figure) and register fillers . The
grammar fragment of Figure 3a recognized the bound clause in Si (" . . . it has set up a joint
venture. . .").
Several semantic frames that apply to S l are shown in Figure 4a, and the generic semanti c
analysis of this sentence is shown in Figure 4b. PAKTUS represents the semantic analysis as five
case frames: one for the sentence, one for each of the three subordinate clauses (only two of whic h
are displayed in the figure), and one for the "joint venture" NP . The semantic rules are distributed
in a network of objects like those in Figure 4a.
Information is organized in several hundred conceptual objects (e.g., CACREATE - the concept
for "set up" in S l) and case roles (e.g., RARESULT - the thing created) . Information about how
to map from syntactic registers to case roles may be stored in concept frames, lexical frames, o r
role frames, along with semantic constraints on allowable fillers . For example, the RARESULT
role of CACREATE is normally filled by the direct object (DO) register. This information is
inherited by RARESULT from the more general RAOBJECT role.
BRIDGESTONE SPORTS CO. SAID FRIDAY IT HAS SET UP A JOINT VENTURE IN TAIWA N
WITH A LOCAL CONCERN AND A JAPANESE TRADING HOUSE TO PRODUCE GOLF CLUB S
TO BE SHIPPED TO JAPAN.
*** lexical analysis :
(((UNKNOWN-WORD L^COMPANY BASE CAUNKNOWN . "BRIDGESTONE SPORTS CO") )
((SAY^INTRANS L^INTRANS S^ED) (SAY^TO-IO L^TO-IO S^ED )
(SAY2^MONOTRANS LAMONOTRANS SAED) )
(("17-NOV-89" L^TIME-DATE BASE)) (IT^NEUTER)
((HAVE^MONOTRANS LAMONOTRANS SAS) (HAVE2AINTRANS LAINTRANS SAS)
(HAVE"INTRANS L^INTRANS S^S) (HAVE^HAVE L^HAVE S^S )
(HAVE1AMONOTRANS LAMONOTRANS SAS) )
(SET^COLLECTION (SET^MONOTRANS LAMONOTRANS SAED) SETAMONOTRANS)
(UP^PARTICLE UPAPREP UP^DIRECTION) (A^DET)
(JOINT\ VENTURE^ACTIVITY)
(IN^PARTICLE IN^PREP) (TAIWAN^NATION )
(WITH^PARTICLE WITH^PREP) (AADET) (LOCAL^SPACE-REL)
(CONCERNACOPULA CONCERNAMONOTRANS CONCERN^EMOTION CONCERN^BUSINESS )
(ANDACONJ) (AADET)
((JAPAN^NATION L^LANGUAGE BASE C^CHAR-OF) (JAPAN^NATION L AADJ BASE
C^IT-BE-FROM) (JAPAN^NATION L^INHABITANT BASE CABE-FROM) )
((UNKNOWN-WORD LACOMPANY BASE CAUNKNOWN . "TRADING HOUSE") )
(TO^PREP TO^PARTICLE) (PRODUCEAMONOTRANS)
((UNKNOWN-WORD VP BASE C^PRIMITIVE . "GOLF" )
(UNKNOWN-WORD LACOMMON BASE CAUNKNOWN . "GOLF") )
((CLUB^GROUP L^GROUP S^S)
(UNKNOWN-WORD LAINTRANS SAS CAUNKNOWN . "CLUBS" )
(UNKNOWN-WORD LAMONOTRANS S"S CAUNKNOWN . "CLUBS") )
(TOAPREP TO^PARTICLE)
(BEABE BE^INTRANS BEACOPULA)
((SHIPABITRANS LABITRANS S^ED) (SHIPAMONOTRANS LAMONOTRANS SAED) )
	 (TOAPREP TO^PARTICLE) (JAPAN^NATION))
Figure 2a: Lexical Analysis of the First Sentence of Document Number 059 2
197
"CONCERN"
(CONCERNACOPULA CONCERNAMONOTRANS CONCERN^EMOTION CONCERNABUSINESS)
(CONCERNACOPULA (AKO (LACOPULA) )
(COMPLEMENT (WH-CLAUSE WH-TO-INF NP) )
(CONCEPT (CABE-ABOUT) )
(ROLES (RAAFFECTED RAPROPOSITION R^FOCUS) )
(R^AFFECTED NIL (_ SUBJECT) (@ "INFO") )
(R^FOCUS NIL (_ COMP) (@ (TRUE))) )
(CONCERNAMONOTRANS (AKO ( LAMONOTRANS)) (CONCEPT (CAMOTIV)) )
(CONCERN^EMOTION (AKO (LAEMOTION)) )
(CONCERN^BUSINESS (AKO (L^BUSINESS)) (TYPE (COUNT LEFT-ADJ-OF-N)) )
Figure 2b: Some PAKTUS Lexicon Data Used for S l
(C-S-T1E (TO-STATE (E^)) (AKO (ARC) )
(INIT
( (A .MAIN-VERB.LEX.HAS-FEATURE '(L^VERB ZERO-THAT) )(*
. MOOD_ BOUND)) )
(RULE (LAS31RULE)) (LABEL (T1)) (FROM-STATE ( C^) )
(NAME ("cS\\tle")) )
(L^S31RULE (AKO (LASRULE)) (PRIORITY (0) )
(THEN NIL (ACTIONS (^ .PROP_*))) )
Figure 3a: Some PAKTUS Grammar Specifications Used for S l
(S (MAIN-VERB55 (SAYATO-IO L^TO-IO SAED) )
(SUBJECT53
(NP (HEAD5 4
(UNKNOWN-WORD LACOMPANY BASE CAUNKNOWN . "BRIDGESTONE SPORTS CO"))) )
(PROP11
(T1 (MAIN-VERB51 (ESTABLISHAMONOTRANS LAMONOTRANS SAED) )
(SUBJECT35 (NP (HEAD36 IT^NEUTER)) )
(DO14
(NP (HEAD49 JOINT\ VENTURE^ACTIVITY) (DET50 AADET) )
(PROP30
(ZA (MAIN-VERB47 PRODUCEAMONOTRANS)
(SUBJECT35 (NP (HEAD36 IT^NEUTER)) )
(DO31
(NP (HEAD44 (CLUBAGROUP L^GROUP SAS) )
(PROP32
(Z^ (MAIN-VERB39 (SHIPAMONOTRANS LAMONOTRANS SAED) )
(SUBJECT37 (NP (HEAD38 SOMEONE^SOME)) )
(DO35 (NP (HEAD36 IT^NEUTER)) )
(MODS40 (PP (PREP41 TOAPREP)
(PREP-OBJ33 (NP (HEAD34 JAPAN^NATION)))))) )
(DESC46 (UNKNOWN-WORD LACOMMON BASE CAUNKNOWN . "GOLF"))))) )
(MODS15
(PP (PREP29 WITH^PREP)
(PREP-OBJ18
(NP (HEAD25 CONCERN^BUSINESS) (DET28 AADET)
(DESC27 LOCAL^SPACE-REL)
(CONJ19
(NP
(HEAD20 (UNKNOWN-WORD LACOMPANY BASE CAUNKNOWN . "TRADING HOUSE") )
(DET23 AADET)
(DESC22 (JAPAN^NATION LAADJ BASE C^IT-BE-FROM)))))))) )
(MODS16
(PP (PREP17 IN^PREP )
(PREP-OBJ12 (NP (HEAD13 TAIWAN^NATION)))))) )
(ADV56 ("17-NOV-89" L^TIME-DATE BASE)) )
Figure 3b: Syntactic Analysis of S l
198
(C^CREATE (ROLES
(R^AGENT R^RECIPIENT R^INSTR RARESULT RAPURPOSE RAMATERIAL) )
(AKO (C^BEGIN) )
(R^MATERIAL NIL (_ (PREP-OBJ 'FROM^PREP))) )
(RARESULT (AKO (RAEFFECT)) )
(RAEFFECT (KINDSOF (RARESULT RAEVENT)) (AKO (R^OBJECT)) )
(RAOBJECT
(KINDSOF
(RAAFFECTED RAEXPERIENCER R^COMPANION RAEFFECT RAPROPOSITION
RAFOCUS R^PURPOSE R^MATERIAL R^RESISTANCE) )
(_ NIL (DEFAULT DO)) (AKO (PROP-ROLE)) )
Figure 4a: Some PAKTUS Generic Semantic Specifications Used for S l
(CAASSERT
(RAAGENT53
(F53
(HEAD54 (UNKNOWN-WORD L ACOMPANY BASE C^UNKNOWN . "BRIDGESTONE SPORTS CO"))) )
(RAPROPOSITION11
(CACREATE (RAINSTR35 (F35 (HEAD36 IT^NEUTER)) )
(RARESULT14
(CAATTEMPT (HEAD49 JOINT\ VENTURE^ACTIVITY)
(RAINSTR14 @F14)
(RACOMPANION18
(CAACT (HEAD25 CONCERN^BUSINESS)
(CONJ19
(F19
(HEAD20 (UNKNOWN-WORD L^COMPANY BASE C^UNKNOWN . "TRADING HOUSE"))) )
(CONJOINER24 ANDACONJ) (RAINSTR18 @F18)) )
(RAPURPOSE30
(CACREATE (RAINSTR35 (F35 (HEAD36 IT^NEUTER)) )
(RARESULT31 (F31 (HEAD44 (CLUBAGROUP L^GROUP S^S))))) )
(RAPLACE12 (F12 (HEAD13 TAIWAN^NATION)) )
Figure 4b: Generic Semantic Analysis of S l
Figure 5a gives an example of a semantic mapping rule . This consists of a pattern component,
which in this case matches the semantic frame of Figure 4b, and a mapping specification (in th e
"slots" component). Figure 5b shows the task-specific semantic representation that results fro m
applying this mapping to the generic semantic frame. The rule in Figure 5a is a refinement of on e
used in the final test MUC-5 system. The final test version recognized the tie up relationship i n
this sentence, but extracted only two of the three tie up entities . The new mapping rule better
illustrates system features without significantly changing the output .
The pattern component specifies a semantic structure with a C ATALK concept as its root (the
CAASSERT concept shown in the semantic analysis of S1 is a specialization of C ATALK in
PAKTUS), and with RAAGENT and R^PROPOSITION roles. The filler of the RAAGENT role
will be bound to the pattern variable RAAGENT250. The R^PROPOSITION role must be fille d
by an instance of a CABEGIN frame (the CACREATE concept in S 1 is a specialization of
C^BEGIN), with a RARESULT that is an instance of C^ATTEMPT (which joint venture is) . If
R^COMPANION or R APURPOSE roles are filled in the C^ATTEMPT frame, information i s
extracted from them as well (they are optional, which is not indicated in the figure, but is marked i n
the actual mapping object).
The mapping portion of the rule specifies how to map the pattern variable bindings to a task-
specific semantic object. This rule says that the result is a tie up relationship with tie up entities
derived from the bindings of R ACOMPANION248 and RAAGENT250, the joint venture taken as
the filler of the RARESULT role of the CABEGIN frame, etc . The type of relationship (existing ,
former, etc.) is determined by the act-stage function, which computes the type from tense, aspect ,
and modality registers.
199
(P-TIE_UP_RELATIONSHIP143
(AKO C^TALK)
(R^AGENT ((> R^AGENT250)) )
(R^PROPOSITION
(P-TI E_UP_RELATIONSHIP14 4
(AKO ((CON250 IS C^BEGIN)) )
(R^INSTR ((< R^INSTR248)) )
(R^RESULT
(P-TIE_UP_RELATIONSHIP150
(AKO ((CON249 IS C^ATTEMPT)) )
(R^INSTR @P-TIE_UP_RELATIONSHIP150)
(R^COMPANION ((RACOMPANION248 IS L^AGENT)) )
(R^PURPOSE
(P-TIE_UP_RELATIONSHIP186
(AKO ((CON248 IS C^CREATE)) )
(R^INSTR ((R^INSTR248 IS L^3RD-PERS-PRO)) )
(R^RESULT ((> R^RESULT248)))))) )
(R^PLACE ((R^PLACE248 IS L^LOCATION)))) )
(SLOTS
(TIE_UP_RELATIONSHIP (TYPE (ACT-STAGE) )
(TU-ENTITY R^COMPANION248 R^AGENT250 )
(JOINT-VENTURE P-TIE_UP_RELATIONSHIP150 )
(TU-ACTIVITY (INDUSTRY (I-TYPE PRODUCTION)
(PRODUCT/SERVICE R^RESULT248) )
(SITE R^PLACE248)) )
(ENTITY (ENTITY-RELATIONSHIP (ENTITYI R^COMPANION248 R^AGENT250 )
(ENTITY2 P-TIE_UP_RELATIONSHIP150 )
(RELATIONSHIP CHILD)))) )
Figure 5a: Generic-to-Task-Based Semantic Mapping Rul e
Figure 5b shows the task-specific representation that the mapping rule in Figure 5a produce d
when applied to the generic semantic representation of Figure 4b .
Another task-specific representation is shown in Figure 6, for the fourth sentence : THE NEW
COMPANY, BASED IN KAOHSIUNG, SOUTHERN TAIWAN, IS OWNED 75 PCT BY BRIDGESTONE SPORTS, 15 PCT BYUNION PRECISION CASTING CO. OF TAIWAN AND THE REMAINDER BY TAGA CO ., A COMPANY ACTIVE IN TRADING
WITH TAIWAN, THE OFFICIALS SAID. This is shown to illustrate one of the limitations of the MUC- 5
version of the system .
(SPEC35
(ENTITY-RELATIONSHIP
(SPEC37 (RELATIONSHIP "CHILD" )
(ENTITY2 "JOINT VENTURE" )
(ENTITYI "LOCAL CONCERN" )
(ENTITYI "JAPANESE TRADING HOUSE" )
(ENTITYI "BRIDGESTONE SPORTS CO")) )
(TU-ACTIVITY
(SPEC38 (SITE "TAIWAN" )
(INDUSTRY
(SPEC39 (PRODUCT/SERVICE "GOLF CLUBS" )
(I-TYPE "PRODUCTION")))) )
(JOINT-VENTURE ("JOINT VENTURE"
(ENTITY-RELATIONSHIP SPEC37)) )
(TU-ENTITY ("LOCAL CONCERN"
(ENTITY-RELATIONSHIP SPEC37)) )
(TU-ENTITY ("JAPANESE TRADING HOUSE"
(ENTITY-RELATIONSHIP SPEC37)) )
(TU-ENTITY ("BRIDGESTONE SPORTS CO "
(ENTITY-RELATIONSHIP SPEC37)) )
(TYPE "EXISTING") )
Figure 5b: Task-Specific Semantic Representation of S i
200
(SPEC40
(ENTITY-RELATIONSHIP
(SPEC42 (ER-STATUS "CURRENT") (RELATIONSHIP "CHILD" )
(ENTITY2 "NEW COMPANY" )
(ENTITY1 "TAGA CO")
(ENTITY1 "UNION PRECISION CASTING CO" )
(ENTITY1 "REMAINDER" )
(ENTITY1 "BRIDGESTONE SPORTS")) )
(TU-ACTIVITY
(SPEC46 (SITE ("KAOHSIUNG" (APP "SOUTHERN TAIWAN")))) )
(OWNERSHIP
(SPEC43 (OWNERSHIP-% "BRIDGESTONE SPORTS") (OWNERSHIP-% 75 )
(OWNED "NEW COMPANY")) )
(OWNERSHIP
(SPEC44 (OWNERSHIP-% "UNION PRECISION CASTING CO") (OWNERSHIP-% 15)
(OWNED "NEW COMPANY")) )
(OWNERSHIP
(SPEC45 (OWNERSHIP-% "TAGA CO" )
(OWNED "NEW COMPANY")) )
(JOINT-VENTURE ("NEW COMPANY"
(ENTITY-RELATIONSHIP SPEC42)) )
(TU-ENTITY ("TAGA CO" (ENTITY-RELATIONSHIP SPEC42)) )
(TU-ENTITY ("UNION PRECISION CASTING CO" (LOC "TAIWAN" )
(ENTITY-RELATIONSHIP SPEC42)) )
(TU-ENTITY ("REMAINDER" (ENTITY-RELATIONSHIP SPEC42)) )
(TU-ENTITY ("BRIDGESTONE SPORTS" (ENTITY-RELATIONSHIP SPEC42)) )
(TYPE "EXISTING") )
Figure 6: Task-Specific Semantic Representation of S 4
The complete filled templates for this article are shown in Figure 7 . Some of the information ,
such as ownership percentages, that was extracted as shown in Figure 6, does not appear in th e
output templates . This is typical of the MUC-5 version of PAKTUS ; we were unable to devote
resources sufficient to complete the output generator component (the final processing module i n
Figure 1), so some information that was extracted was simply ignored by the final process .
<TEMPLATE-0592-84> :_
DOC NR: 0592
DOC DATE: 241189
DOCUMENT SOURCE: "Jiji Press Ltd. "
CONTENT: <TIE_UP_RELATIONSHIP-0592-84>
DATE TEMPLATE COMPLETED : 930820
<TIE_UP_RELATIONSHIP-0592-84> :_
TIE-UP STATUS: EXISTING
ENTITY: <ENTITY-0592-84>
<ENTITY-0592-85>
<ENTITY-0592-86>
<ENTITY-0592-87>
<ENTITY-0592-88>
JOINT VENTURE CO: <ENTITY-0592-89>
OWNERSHIP: <OWNERSHIP-0592-89>
ACTIVITY: <ACTIVITY-0592-89>
<ENTITY-0592-84> :_
NAME: CO
TYPE: COMPANY
<ENTITY-0592-85> :_
ENTITY RELATIONSHIP: <ENTITY_RELATIONSHIP-0592-89>
NAME: UNION PRECISION CASTING CO
LOCATION: TAIWAN (COUNTRY)
TYPE: COMPANY
Figure 7: PAKTUS Template Fills for the Sample Documen t
201
<ENTITY_RELATIONSHIP-0592-89> : =
ENTITYI : <ENTITY-0592-86>
ENTITY2 : <ENTITY-0592-89>
REL OF ENTITY2 TO ENTITYI : CHILD
<ENTITY-0592-86> : _
ENTITY RELATIONSHIP: <ENTITY_RELATIONSHIP-0592-89>
NAME: TAGA CO
TYPE: COMPANY
<ENTITY-0592-89> : _
ENTITY RELATIONSHIP: <ENTITY_RELATIONSHIP-0592-90>
<ENTITY_RELATIONSHIP-0592-89>
NAME: BRIDGESTONE SPORTS TAIWAN CO
LOCATION: KAOHSIUNG (UNKNOWN)
TYPE: COMPANY
<ENTITY_RELATIONSHIP-0592-90> :=
ENTITYI : <ENTITY-0592-87>
<ENTITY-0592-88>
ENTITY2 : <ENTITY-0592-89>
REL OF ENTITY2 TO ENTITYI : CHILD
<ENTITY-0592-87> :.
ENTITY RELATIONSHIP: <ENTITY_RELATIONSHIP-0592-90>
NAME : BRIDGESTONE SPORTS CO
ALIASES: "BRIDGESTONE SPORTS "
TYPE: COMPANY
<ENTITY-0592-88> :_
NAME : TRADING HOUSE
NATIONALITY: JAPAN (COUNTRY)
TYPE: COMPANY
ENTITY RELATIONSHIP: <ENTITY_RELATIONSHIP-0592-90>
<OWNERSHIP-0592-89> :_
TOTAL-CAPITALIZATION: 20000000 TWD
OWNED: <ENTITY-0592-89>
<ACTIVITY-0592-89> : _
ACTIVITY-SITE: (TAIWAN (COUNTRY) - )
INDUSTRY: <INDUSTRY-0592-90>
<INDUSTRY-0592-90>
INDUSTRY-TYPE: PRODUCTION
PRODUCT/SERVICE: (39 "GOLF CLUBS" )
Figure 7 (cont.): PAKTUS Template Fills for the Sample Documen t
SYSTEM PERFORMANCE
Figure 8 summarizes PRC's scores for MUC-5. The MUC-5 version of the system was
incomplete at the time of the final testing . All modules were functional, but many task-specific
details were missing . The system was ready for testing on the development corpus only tw o
weeks prior to the final test. Performance was improving rapidly — about one point per day . The
main limiting factors for PRC were time and availability of people for development. We directed
most of our energy toward the basic engineering, such as generating the template formats, which
left little time to address the task-specific linguistic requirements . This resulted in severe
undergeneration, which accounted for most of the errors (73 percent undergeneration versus 8 3
percent overall error rate) .
Development Effor t
Figure 9 enumerates our activities and level of effort in connection with the MUC-5 task, a s
well as parallel non-Tipster-specific activities in developing our system . Our total development
effort in customizing our system for the MUC-5 testing was 2 .1 months, with another month fo r
testing, file management, and other non-developmental activities .
202
	 +	 +	
SLOT
	
POS ACTIERR UND OVG SUB
	 +	 +	
<template> subtotals 348 2371 59 48 23 9
<tie-up-relationship> subtotals 1806 9921 78 63 33 29
<entity> subtotals 4146 18921 78 66 25 30
<entity-relationship> subtotals 2015 9681 82 66 30 39
<activity> subtotals 1112 1411 94 92 40 17
<industry> subtotals 1185 3361 92 82 37 48
<facility> subtotals 340 71 98 98 0 21
<person> subtotals 372 21100 100 100 *
<ownership> subtotals 526 431
	
96 93 14 40
<revenue> subtotals 57 01100 100 * *
<time> subtotals 153 11
	
99 99 0 0
	 +
ALL OBJECTS 12060 46191 83 73 29 31
MATCHED ONLY 4330 31511 51 34 9 21
	 +	
	
RECALL PRECISION
	
P&R 2P&R P&2R
ALL OBJECTS
	
19
	
49
	
F-MEASURES 27 .06 36 .95 21 .34
MATCHED ONLY
	
52
	
72
TEXT FILTERING 67
	
91
Figure 8: PRC Score Summary
—
Non-linguistic engineering
Task-specific linguistics with Tipster data
1.1 months
1 .0
— Documentation, publication prep . 0.3
Testing, scoring 0.5
File management, communications 0.2
— Non-Tipster-specific system development 1 . 9
TOTAL 5.0 months
Figure 9. Breakdown of Development Effor t
More than half of the development effort involved non-linguistic engineering of the system fo r
MUC-5 requirements, such as the output template format generation. We also needed to convert
the extraction components of the system to accommodate the object oriented nature of the MUC- 5
templates. This left us with one month of effort for task-specific linguistic development . The
specific changes and additions to the PAKTUS knowledge bases for MUC-5 are enumerated in
Figure 10. In parallel with the MUC-5 activity, we devoted 1 .9 months of effort to generi c
development of our system, which may have had some impact on MUC-5 performance .
Limiting Factors
The two areas that could significantly improve performance with modest effort are the
definition of task-specific semantic mappings and the output generator. These are highlighted in
Figure 11 . Very little was done here, however, due to limited time and resources. Only 147 of the
semantic mappings were defined . We estimate that about 1,000 would be needed to map th e
generic linguistic data extracted by PAKTUS into the task-specific representations. That would
have required about another month of effort. Within the output generator module, which produce s
the final output template fills, many functions were incomplete or entirely absent . These are no t
difficult to implement, but do require time and effort, which were not available .
203
Figure 10. Additions/Modifications to PAKTUS Knowledge Bases for MUC- 5
Figure 11: What Would Improve Performance Quickly ?
System Trainin g
The PAKTUS modules were trained on varying parts of the MUC-5 corpus . The
preprocessing and lexical analysis modules were trained from concordances based on about 1,00 0
documents. This included the analysis of corporation names for bracketing . The syntactic and
semantic analysis modules were largely unchanged, as noted above . The little tailoring that wa s
done was based on a subset of the 86 documents in the dry run, part 1 set. None of this trainin g
involved analysis of any complete text, since these modules operate only at the sentence level o r
below. The careful analysis required for discourse analysis, including coreference resolution, was
performed on only two documents : numbers 0099 and 0102 . These were selected because 0099
contained multiple tie up relationships, and the other contained a single tie up relationship wit h
	 Knowledge Type
Words (Stems)
Tokens
Compounds
Idioms
Verb categories
Nominal categories
Grammar Arcs
Grammar States
Concepts
Semantic Mappings
Domain Template
Core System
14,81 6
18,928
343
88
16
404
273
90
386
0
0
New/ Mod for MUC-5
387
81 1
110
5
0
0
9
5
1
147
1 1
204
some complex coreference phenomena. This combination seemed to maximize coverage of th e
phenomena the system had to deal with, within our very limited resource constraints .
Reusability of the Syste m
Almost all of PAKTUS is generic and can be applied to other applications . All of its processe s
are at least partly generic, as illustrated in Figure 12 . They operate on a set of object-oriented
knowledge bases, some of which are generic (common English grammar, lexicon, and concep t
frames) and some of which are task-specific (input and output templates, semantic mappings, an d
topic patterns). Even within the task-specific knowledge bases, however, the representatio n
schemes are generic, and we have tools that facilitate building them.
The primary tasks in applying PAKTUS to a new domain or improving its performance in a n
existing domain, are semantic mapping specification, and output generator function development
both of which are relatively easy (compared to changing the grammar, for example) .
Two other tasks that must be done, but only once for each new domain, are to specify the inpu t
document formats and to identify the output specifications . These are template-driven in
PAKTUS. For MUC-5 we converted the BNF specifications supplied by the Government to
template format, which is quite simple . We then added a function for each template slot to gather
information from our generic discourse data structures . Additional information was included
regarding output formats, default fills, etc. These templates are also used by a tool for buildin g
semantic mappings.
Generic :
Partly Generic: •////VVV/i
Figure 12: Generic, Reusable Components
Lessons Learned from MUC-5
We confirmed our belief that PAKTUS is robust and adaptable. The more comple x
components (syntactic, semantic, and discourse analysis modules) are stable and competen t
enough to apply the system to different domains and produce useful results, by adding domain -
specific knowledge (lexicon and semantic mappings) . We were once again pleased to learn that it
!iiiiiuZ!t!i!!i ii!i»!,,
Preprocessor
'Viii eterenpe
(:
	
esolutton
utpu
Generator I
205
was not necessary to manually analyze much of the corpus in detail . This was done for only two
documents for MUC-5 . The full development corpus was used only to customize the
preprocessing and lexical analysis components.
A task as complex as MUC-5 requires substantial investment in non-linguistic engineerin g
before the linguistic capabilities of a system can be applied . This detracts from linguisti c
development that might otherwise have been done, and hides much of the linguistic competence o f
the system if the engineering is incomplete, as in our case (e.g., correct information was clearly
obtained, as in Figure 6, but not reported due to an incomplete output function) . We recognize the
need for such engineering if useful applications are to be achieved, but hope that this process i s
standardized quickly so that it does not need to be completely reimplemented for each ne w
application.
REFERENCE
[1] Hobbs, J. "The Generic Information Extraction System", this volume.
206
