PRC Inc:
DESCRIPTION OF THE PAKTUS SYSTE M
USED FOR MUC-4
Bruce Loatman
PRC Inc.
Technology Division
1500 PRC Drive
McLean, VA 22102
loatman_bruce @po .gis .prc.com
BACKGROUND
The PRC Adaptive Knowledge-based Text Understanding System (PAKTUS) has been unde r
development as an Independent Research and Development project at PRC since 1984 . It includes
a core English lexicon and grammar, a concept network, processes for applying these to lexical,
syntactic, semantic, and discourse analysis, and tools that support the adaptation of the generic
core to new domains, primarily by acquiring sublanguage and domain-specific lexicon and
conceptual topic patterns of interest. The lexical, syntactic, and semantic analysis components
were completed before MUC-4 and required little adaptation . The discourse analysis component is
new and was completed in the course of applying the system to MUC-4, although it is generic .
The overall system is described in [1] . The present description concentrates on discourse analysis.
APPROACH
The overall structure and
operation of PAKTUS are shown in
Figure 1 . Processing proceed s
mostly sequentially through
preprocessing (the decomposition of
the text stream into individual
messages, message segments ,sentences, and words), lexica l
analysis (morphological analysis and
mapping of words into entries in the
lexicon which contain information
about their syntax and semantics) ,
syntactic analysis (using a parser an d
grammar), semantic analysis
(mapping the syntactic structures into
conceptual frames with roles fille d
by phrase constituents), discourse
analysis (identification of discourse
topics and noun phrase reference), Figure 1. PAKTUS Architecture
Lexicon &
Morph Rules
Internal
Reps & Links
Syntactic
Structures
Concept
	
Semantic
	
Semantic
Frames
	
Analysis
	
Structures
Conceptual Discourse
Patterns
	
Analysis
Filled
Templates
Doc Template
and finally extraction of information from discourse structures into domain-specific templates. The
primary exception to sequential control flow is the interaction between the syntactic and semantic
components at the clause and noun phrase level . This results in essentially deterministic parsing in
253
linear time: the first syntactico-semantically successful parse of a sentence is accepted ; others are
never generated. Moreover, parse time is restricted, and the longest substring, along with any
initial substring successfully parsed, is returned when parse time is exhausted .
Figure 2 shows the discourse analysis module, which was first used for MUC-4, and it s
interaction with the extraction module . The discourse module is generic for expository text, suc h
as news reports. In figure 2, only the conceptual patterns and filter are MUC-4-specific, and these
are part of the extraction component, not discourse analysis .
	Semantic Structures	 	 Common
°"°`YLO -11D
	
Role Filler
& Times?
Figure 2. Discourse Analysis and Extraction Details
The discourse module operates on the semantic structures (case frames) produced by the
semantic analysis module. It builds topic structures consisting of sets of case frames that have
common topic objects and times. Topic objects are defined as fillers of certain case roles,
specifically, 16 of the total 40 case roles used in PAKTUS, as illustrated in Figure 3. The most
notable case role that is excluded as a topic object is the Agent. This is because topic structures are
meant to represent information about entities that are being affected or focused upon in some way ,
whereas a single Agent can operate on several different entities . An example below, from the
MUC-4 corpus, will clarify the importance of excluding the Agent as a topic role .
A side effect of comparing topic objects for commonality, is that some noun phrases (NPs) will
be unified (i .e., considered by discourse analysis to have the same referent) . It is possible
(actually quite common) for two NPs to be considered common topic objects, but not be unifie d
(e.g., in one MUC-4 passage, PAKTUS considers "crime" and "killing" to have topi c
commonality, but does not unify them since "crime" is more general) .
After all case frames have been assigned to topic structures, domain-specific conceptua l
patterns are compared to the case frames, topic-by-topic, binding pattern variables to informatio n
that is extracted and put into event reports whose format is specified by a domain-specific template .
254
The NP unifications assist in this process, effectively consolidating information that may be widel y
dispersed in the text. Note that a single topic structure may contain information on multiple events.
The final step in the extraction process is to filter and merge the event reports (e.g., for MUC-4,
ignoring events that are too old, and merging events that can not be distinguished in time o r
location), and format the results to the output file.
Figure 3. Case Roles Determining Topic Objects in PAKTU S
EXAMPLE OF MUC-4 DOCUMENT PROCESSIN G
Message number 48 from the "test2" set, which is reprinted in Appendix F, will be used to
illustrate PAKTUS's operation for MUC-4. PAKTUS processes text sequentially, first strippin g
off the document header, then identifying sentences, which are processed syntactico-semantically
one at a time, after which all the results are passed to the discourse component .
Figure 4 shows the raw, unprocessed text of the first sentence (Si), followed by its lexica l
analysis. Each word has one or more senses, represented as a root symbol, which is generally the
concatenation of the English token, the "^" character, and the PAKTUS lexical category (e.g.,
"Condemn^Monotrans"), or as a simple structure involving a root, lexical category, inflectional
mark, and sometimes a conceptual derivation (e .g. the structure "(Condemn^Monotrans L^Effect-
mark Base C^It-got)" represents the adjective sense of "condemned") . For each word, all senses
in the PAKTUS lexicon are fetched or derived at this time ; disambiguation is generally delayed
until the syntactic and semantic phases .
R-A GEN T
R" INST RDOER
R'FOCUS
\R PURPOS E
R -MAT ER IA I .
R -ATTRIBUTE
R"OPPONENT -
E <R"OPPOSITIO N
PROP-ROLE
ER R"ORIGINR"SOURCE <R"DONO
R
R"DES T
R"RECIPIENT
R-WIZE N
R"TIME R FINISH
R'DURATION
R"WHEN-ADV
R"MANNER
R"BENEFICIARY
R"METHOD
R'CAUS E
R"ACCOMPLIC E
R"POSSESSED
R -PRECONDITIO N
R - TOO L
R" PLA C
PUR OSE
R"CO-EVEN T
R"LOC
. . ,
R"GOAL
R"AFFECTED
R"EXPERIENC ER
'12 COMPANION
R"RESUL T
Rf EVEN TR'OBJEC T REFFECT
R"PROPOSI T
R EXTEN T
Ft-PLACE R"PATH
<R"MEDWMR"CO-CON
MODAL-ROL E
255
*** raw sentence:
SALVADORAN PRESIDENT-ELECT ALFREDO CRISTIANI CONDEMNED THE
TERRORIST
KILLING OF ATTORNEY GENERAL ROBERTO GARCIA ALVARADO AN D
ACCUSED THE
FARABUNDO MARTI NATIONAL LIBERATION FRONT (FMLN) OF TH E
CRIME.
*** lexical analysis :
(((EL\ SALVADOR^NATION L^INHABITANT BASE C^BE-FROM)
(EL\ SALVADOR^NATION LAADJ BASE C^IT-BE-FROM) )
((PRESIDENT^SPECIALIST L^SPECIALIST BASE C^BE-LIKE) )
(ALFREDOAMALE) (CRISTIANI^PERSON)
((CONDEMNAMONOTRANS L^EFFECT-MARK BASE C^IT-GOT)
(CONDEMN^MONOTRANS LAMONOTRANS SAED) )
(THEADET) (TERRORIST^PERSON)
((KILL^MONOTRANS LAMONOTRANS S^ING)
(KILLAMONOTRANS LAABSTRACT BASE C^ACT-OF)
(KILLAMONOTRANS LAADJ BASE CADOES) )
(OF^PARTICLE OF^PREP) (ATTORNEY\ GENERAL^SPECIALIST)
(ROBERTOAMALE) (GARCIAAPERSON) (ALVARADOAPERSON)
(AND^CONJ)
((ACCUSEAMONOTRANS L^EFFECT-MARK BASE C^IT-GOT )
(ACCUSEAMONOTRANS LAMONOTRANS S^ED) )
(THE^DET)
(FARABUNDO\ MARTI\ NATIONAL\ LIBERATION\ FRONTATERRORIST-
GROUP)
(OF^PARTICLE OFAPREP) (THEADET) (CRIME^ACTIVITY) )
Figure 4. Lexical Analysis of the First Sentence of Test2 Document Number 48
The syntactic and conceptual analyses of this sentence are shown in Figure S . Note that
conceptual structures are produced for some nouns (notably here, "killing"), not just for verbs.
These conceptual structures are essential to the overall task of information extraction ; if no case
frames are produced for a sentence (i .e., the syntactico-semantic analysis failed), it is completely
ignored by the discourse analysis and extraction processes .
The syntactic analysis produced for Si is a configuration of syntactic registers (the main one s
are shown in the figure) and register fillers . In this case, the main clause has a Main-Ver b
(condemned), Subject (the NP whose Head is Cristiani), and Direct Object (the terrorist killin g
NP). The conjoined clause ("and accused the FMLN . . .") was correctly parsed, and its gap (no
explicit Subject) has been filled in with the Subject of the main clause .
PAKTUS produced four case frames for S1, one for each of the two clauses, one for the
"killing" NP and one for the "crime" NP .
This conceptual analysis will enable the discourse analysis module to determine that "the crime "
refers to "the killing" because 1) both are topic objects (as fillers of Focus roles), 2) "crime's"
concept CAAGGR is a generalization of "the killing's" concept of CAKILL in the PAKTUS
conceptual network, and 3) "crime" appears in a subordinate clause (all three conditions are
required for this reference resolution).
Determining that the crime refers to the killing here is important ; it enables PAKTUS to
identify the FMLN as the accused perpetrator of Alvarado's killing . It can also determine that the
accusation is made by an authority (president-elect Cristiani), thanks to the gap filling by th e
syntactic analysis.
256
The second sentence (S2 )
contains the phrase "Merino also
declared that the death of the
Attorney General . . ." whic h
PAKTUS recognizes as referring to
the killing in Si, so this phrase is
consolidated into the same topi c
structure . Merino is not, however,
a topic object, since he is in th e
Agent role, which is not a topi c
role . This is important, becaus e
later in the text there is a report of a
guerrilla attack on Merino's home.
Merino is a topic object there . That
he is not a topic object in S2 enable s
PAKTUS to recognize this later
passage as a different topic .
The complete filled templates for
this article are shown in Figure 6
(the ordering of the templates i s
immaterial) . They contain almost all
of the information that should have
been extracted . The only missin g
information is "no injury" to a
bodyguard in template 2 . Also, the
city in which one incident (fill 2 )
occurred is incorrectly reported, an d
a "terrorist" perpetrator is reported
redundantly in fill 2 .
n
70
Ta
	
ox
0
	
El
	
o
r w n
Z
	
N
	
y
o
	
(
	
n
	
C
I
	
W
	
0420
n
	
to0
	
T
	
I
	
tl~~
	
by t /AN ''A
71
	
wyjt'
	
w
T T
g
	
06 Z'sf M w
	
`.'
	
" r
r
	
ry
	
~ rw^ww w
C M rn
	
a
	
.o
l1
	
b
	
0 ~ oI i
n
	
I r ,0 o
7rJ xi Ma
	
b
	
o rG M y
N4 ls)
r 0
	
'40
	
co
	
rn PA N
N I rpiI
	
D.i
~`~ zg "
w NT
.4
	
nn{~~1K: L4y xcoMlico :4tK
	
-f
	
rnrnr''D J oco~ 3co
	
l I I
	
aan
	
Z Z~o o,ss
i x
	
I A Ms ~~orwq,Eo
	
oo-
	
y M
{,~ oao ~
p 9 WC Z M vo
yA
	
rH
	
Z~I o
Z 9
d to
	
e tD+l ~t~yr Z
	
.- _,
	
I .psi
	
!I
vid
	
n ;„04 'I t
	
m l lI
	
yip
E
	
t
	
z
	
> r'
~
	
.81f.m i
	
~t n
	
v ~I
	
%A
	
xi NJ rA
	
..z
	
rcon jj
	
a ~
	
•- »
r
~-a
y
	
a~~x = 1 W, ., ..
C~,d ~
!
4%
^H 9 Zt7~
	
OOr
	
N ty wy
9 Z~
	
Z.Mr'~
	
G
I y f
!
li `
	
~ ..~p N
C row
	
G
ti
	
-• ~>
	
rrr
	
Pa
r~
	
r2
	
~n
4l)b g ril
	
1 2 5_~ ..
	
r ~
VC
	
t
	
.4
0 •4
	
r.
	
0PA, 9 .. pl
~ .i Z
	
n
b
9
OZ
Z
Z Z '-I
~'t1dpCO
OM v
in n
Myrs:
WM r
y ..
o
Figure 5. Syntactic and Conceptual Analysis of S i
One item missing from template
1, compared to the official answer
key, is the FMLN as the
perpetrating organization. Nowhere
in the text is it stated or implied that
the FMLN attacked Merino's home ,
however. Much of the MUC-4-
specific knowledge that would b e
encoded in conceptual patterns t o
identify information in topi c
structures to be extracted in th e
template fills, was not entered int o
PAKTUS, due to our developmen t
time and effort limits . The fill s
shown include some informatio n
(underlined in the figure) derive d
with minor enhancements that were
made after the official test run for
MUC-4, specifically, the addition o f
three conceptual patterns, correction of a minor error in the output specification of another one, an d
correcting a lexicon entry to mark "vehicle's" type as "transport vehicle . "
25 7
S
	
. MES -GE:
	
D TST -MUC
1. MESSAGE: TEMPLATE 12. INCIDENT: DATE
14 APR 893. INCIDENT: LOCATION EL SALVADOR : SAN SALVADOR (CITY)
4. INCIDENT: TYPE BOMBING
5. INCIDENT: STAGE OF EXECUTION ACCOMPLISHED
6. INCIDENT: INSTRUMENT ID EXPLOSIVES'7. INCIDENT
: INSTRUMENT TYPE EXPLOSIVE: •EXPLOSIVES'8. PERP: INCIDENT CATEGORY TERRORIST ACT
9. PER': INDIVIDUAL ID GUERRILLAS'
10. PERP: ORGANIZATION ID
11. PERP: ORGANIZATION CONFIDENCE
12. PHYS TGT: ID MERINO'S HOME'13. PHYS TGT: TYPE CIVILIAN RESIDENCE: 'MERINO'S HOME'
14. PHYS TGT: NUMBER 1:
	
MERINO'S HOME'
15. PHYS TGT: FOREIGN NATION16. PHYS TOT: EFFECT OF INCIDENT
17. PHYS TGT: TOTAL NUMBER18. HUM TOT: NAME
19. HUM TGT: DESCRIPTION VICE PRESIDENT'S CHTT,TZRFI'
SEVEN CHILDREN•
'15-YEAR-OLD NIECE'CIVILIAN: 15-YEAR-OLD NIECE'
CIVILIAN: 'SEVEN CHILDREN'CIVILIAN: VICE PRESIDENT'S CHILDREN'
1: 15-YEAR-OLD NIECE•7: SEVEN CHILDREN'
4: VICE PRESIDENT'S CHILDREN'
22.HUM TGT: FOREIGN NATION
23.HUM TGT: EFFECT OF INCIDENT
	
INJURY: '15-YEAR-OLD NIECE'
24.HUM IGT: TOTAL NUMBER
----------------------------------- -
-
---------------------------------------------- -
20.HUM TOT: TYPE
21.HUM TGT: NUMBER
MESSAGE: ID
	
TST2-MUC4-0048
MESSAGE: TEMPLATEINCIDENT: DATE
INCIDENT: LOCATIONINCIDENT: TYPE
INCIDENT: STAGE OF EXECUTIONINCIDENT: INSTRUMENT ID
INCIDENT: INSTRUMENT TYPE
PERP: INCIDENT CATEGORYPERP: INDIVIDUAL ID
0.1
.2.
3.
4.
5.6.
7.8.
9.
10. PERP: ORGANIZATION ID
11. PERP: ORGANIZATION CONFIDENCE(FMLN)'
12. PHYS TGT: ID
PHYS TGT: TYPE
PHYS TGT: NUMBER
PHYS TGT: FOREIGN NATION
PHYS TGT: EFFECT OF INCIDENTPHYS TGT: TOTAL NUMBER
HUM TGT: NAMEHUM TGT: DESCRIPTION
20.HUM TOT: TYPE
21.HUM TGT: NUMBER
22.HUM TUT: FOREIGN NATION23.HUM TGT: EFFECT OF INCIDENT
24.HUM TOT: TOTAL NUMBER
2- 19 APR 89
EL SALVADOR: EL SALVADOR (CITY)
BOMBINGACCOMPLISHE
D
BOMB: 'BOMB'TERRORIST AC
T~SALVADORAN URBAN GUERRILLAS'
~GUERRILLA'
~FARABUNDO MARTI NATIONAL LIBERATION FRONT (FMLN) 'SUSPECTED OR ACCUSED BY AUTHORITIES : •FARABUNDO MARTI NATIONAL LIBERATION FRONT
•VEHICLE'TRANSPORTVRITCLE: 'VEHICLE'
1: VEHICLE•
SOME DAMAGE: 'VEHICLE'
ROBERTO GARCIA ALVARADO'
~ATIORNEY GENERAL' : 'ROBERTO GARCIA ALVARADO''
TWOAODYGUAROS'
~GARCIA ALVARADO'S DRIVER'GOVERNMENT OFFICIAL: *ROBERTO GARCIA ALVARADO'
SECURITY GUARD: •TWO BODYGUARDS'CIVILIAN: GARCIA ALVARADO'S DRIVER'
1: ROBERTO GARCIA ALVARADO"
2: •'I BODYGUARDS'1: *GARCIA ALVARADO'S DRIVER'
DEATH : 'ROBERTO GARCIA ALVARADO*
NO INJURY: *GARCIA ALVARADO'S DRIVER"
13.
14.
15.16.
17.
18.19.
Figure 6: Template Fills for Test2 Report 4 8
REFERENCE
[1] Loatman, B, "Description of the PAKTUS System Used for MUC-3", Proceedings of the 3rd
Message Understanding Conference, San Mateo, CA: Morgan Kaufmann, 1991 .
25 8
