ADVANCED DECISION SYSTEMS :
DESCRIPTION OF THE CODEX SYSTEM AS USED FOR MUC- 3
Laura Blumer Balcom
Richard M. Tong
Advanced Decision Systems
1500 Plymouth Street
Mountain View, California 94043
lbalcom@a ads . com
rtong@ ads.com
(415)-960-7300
BACKGROUND
ADS has been developing an approach to text processing, called CODEX, COntext directed Data EXtrac-
tion), that couples a concept-based, probabilistic keyword pattern matcher, called RUBRIC, with a probabilistic, gen -
eralized graph composition chart parser, called CAUCUS . We configure these two key technologies together for a
variety of natural language sorting and gisting applications to provide greater depth of analysis (higher precision) tha n
keyword-based techniques alone, as well as higher throughput and greater breadth of coverage than parsing technique s
alone.
In a typical text data extraction application, a complete syntactic, semantic, and pragmatic analysis of relevant
text segments is required in order to achieve the precision necessary to reduce the level of human interaction require d
for reliable system performance . This is the role of CAUCUS . Unfortunately, both the knowledge base developmen t
and the computational costs are currently too high to apply this technique indiscriminately to all of the text entering a
typical data extraction system, where many documents may contain no relevant information, and potentially large seg -
ments of relevant texts are also uninteresting . Coupling the keyword pattern matcher with the parser appropriately min -
imizes the size of the required parsing knowledge bases as well as the amount of text that needs to be interpreted i n
detail. In future research we expect to show that RUBRIC can improve CAUCUS' performance even further by alterin g
a priori confidences in various choices considered by the parser, based on the most probable concepts instantiated b y
the keyword processing .
Prior to MUC-3 and apart from the development of RUBRIC for its original IR function, approximately fiv e
staff years have gone into developing the CODEX system and CAUCUS knowledge bases used in MUC-3 . A proto-
type implementation of CODEX has been demonstrated in a military message domain, and an operational version o f
that system has been partially deployed . The operational system, written in C to run efficiently on a Mac II, as a run -
time system only, could not be used for MUC-3 because it was not finished and because it was not designed to handl e
the knowledge engineering tasks . Thus, the MUC-3 system evolved from the prototype implementation of the militar y
message handling system. Originally written in Allegro Common Lisp to run on a Sun3 workstation, the CAUCU S
module of this system was designed to accommodate knowledge engineering and NLP research, rather than processin g
efficiency. Unfortunately, its inefficient use of memory had to be remedied in order to compile a 10,000 word lexicon
and parse an average MUC-3 sentence within a reasonable timeframe. Though we did eventually solve this problem ,
we did not do so in time to get the reengineered CAUCUS to produce output for the official MUC-3 testing . Our MUC-
3 results thus reflect only the output of a Profiler configured to find relevant text for the parser to analyze . It is impor-
tant to note that the output generated for MUC-3 official testing strongly reflects our Profiler/Analyzer strategy ; we
could have extracted more template slot fillers from the Profiler output than we did, but we did not because we intende d
to have the Analyzer produce these slot fillers .
129
SYSTEM ARCHITECTUR E
Figure 1: CODEX - Status for MUC3 Phase 2
The CODEX architecture is depicted in Figure 1 . This architecture separates linguistic and conceptual domain
knowledge (in the Profiler and Analyzer) from the particular information requirements of an application (in the Con -
troller) so that these knowledge bases can be used as interchangeable building blocks, reducing the incremental cos t
of adding new domains, languages, and applications. We estimate that at least eighty percent of our lexicon develop-
ment time on MUC-3 was spent creating lexical entries that are generally applicable to any NL parsing problem . Only
the proper nouns do not have general applicability .
The Controller is an expert system that knows about document formats and the data templates to be filled . It
transforms and marks up the input stream into a canonical format for further processing, extracts data from formatted
fields that do not require language analysis, sends it to the Profiler for keyword analysis, extracts relevant informatio n
from the concept profile, sends relevant fragments to the Analyzer, extracts relevant information from the text frag-
ment interpretation, and puts the extracted information into the required output format .
The Profiler, based on ADS' RUBRIC information retrieval technology, takes the complete text of a docu-
ment and returns a profile indicating the Profiler's confidence in the presence of each relevant concept, as well as th e
location of textual keyword evidence for the concept .
The Analyzer, which was disabled for official MUC-3 testing, performs detailed analysis of text to a depth
required to completely disambiguate and interpret it within the bounds of a specific domain and application . In its cur-
rent MUC-3 configuration, input to the Analyzer would have been a sentence to be analyzed had CAUCUS processing
been turned on. As we have not yet developed any linguistic modules for inter-sentential analysis, we did expect thi s
omission to have an impact on our MUC-3 scores . CAUCUS is designed so that it can also take as input a set of con -
textual constraints and hypotheses about the content of the input text as determined by the Profiler and Controller . We
plan to use this feature to test the idea that the CAUCUS can be made to produce a more accurate analysis in a shorte r
time by using this additional input .
Parser/Analyze r
(CAUCUS)
KBs for
• Extensive grammar (but still has some missing
elements that would have affected TST2 results )
• Co:re semantic network (probably adequate forgood TST2 results)
• -10,000 words, most with limited semantic s
Disabled for TST2 because of unanticipated problems i n
engineering the scale-up to a large lexicon.
Profiler
(RUBRIC)
KBs for
• Murder
•Bombing
• Kidnapping
• Hijacking
• Arson
• Robbery
• Attack
Controller Templates
130
Only text fragments that require depth analysis for the needs of the application are passed to the Analyzer, as
the deeper analysis is necessarily slower than the more shallow analysis . ADS' Analyzer, called CAUCUS, is only par -
tially implemented to date . Currently it consists of a parser that does both syntactic and semantic analysis . The com-
plete CAUCUS concept also has various asynchronous pragmatic processes such as discourse focus tracking an d
plausibility analysis interacting with the parser through a global chart . The syntactic and semantic knowledge bases
for the parser are declarative and modular, making it possible to interchange domain-specific modules and opening up
the possibility of automatic and interactive incremental knowledge acquisition . The modules work together under a
best-first chart parsing strategy that optimizes speed without sacrificing recovery from unexpected input .
CAUCUS is based on the PATR family of graph unification chart parsers . It can be configured to behave like
a PATR parser, using graph unification and a top-down, left-to-right parsing strategy to find all possible parses of a n
input. CAUCUS uses PATR-like specifications for rules, lexical entries, and templates, so that a PATR grammar an d
lexicon could be implemented in CAUCUS in a straight-forward manner . CAUCUS' extensions to PATR thus retai n
the modular, declarative representation of linguistic knowledge as feature sets, but they provide for a great deal of flex -
ibility for exploring ways of improving the parser's scalability and robustness in the face of apparent ambiguities an d
unexpected input.
Lexical
Analyzer
A
Entrie s
Static KBsi
• grammar
• lexicon
• taxonomy
• domai n
	 model	
Entrie s
	 V	 Edges
Tasks
Task
Task
- - - - - - - - - - - - - - - - - - - -Task
Pragmatic/Application Reasoning Modules
• Context mode l
• Reference resolution
• Plausibility analysi s•
Expectations of context
• Other domain-specific analysi s
Not yet implemented
Figure 2: CAUCUS Architecture
Tasks
Prioritized
Agenda
Tasks
Text
Task
EdgesExtend
Edges
EdgesMatch
Edges Chart
blackboard o f
phrase analyses
and numeric
confidences
	 V
Data
Accessor
Interpretation
~
131
CAUCUS' architecture is depicted in Figure 2 . As in the PATR parsers, CAUCUS has a chart of edges, whic h
contain directed graph representations of the phrase analyses that they represent. Also like the PATR parsers, edge s
can be extended and/or matched to create longer edges, . Unlike PAIR parsers, though, instantiation of extend an d
match tasks are decoupled from their execution, allowing the strategy for task selection to be determined by a separat e
control function. This control function manipulates the placement of tasks on the prioritized agenda by giving the m
numeric priorities reflecting the likelihood that executing the task next will result in reaching the correct parse in th e
shortest time. Also unlike PATR parsers, chart edges have numeric confidences reflecting the likelihood that the as-
sociated edge is part of the correct parse of the input string . These confidences come from (1) the distribution of gram-
matical constructs in a representative corpus of text, (2) the degree to which the input matches the structure generate d
by the combination of rules and lexical entries combined by the edge, and (3) the degree to which pragmatic processe s
confirm the phrase meaning as composed by the edge . As this latter source of numeric confidences might imply, an -
other significant difference between CAUCUS and the PATR parsers is the addition of pragmatics tasks to the parsin g
agenda. Finally, a unique feature of the CAUCUS architecture is the generalization of graph unification to allow fo r
non-Boolean and non-string-matching composition functions . We call this generalization of Unification Grammar
"Generalized Composition Grammar" or "GCG ."
To date, we have implemented composition functions in CAUCUS for nodes in a class lattice and nodes rep -
resenting the semantics of conjunctions, as well as the usual string match of straight graph unification . In addition to
these, we plan to implement composition functions that perform reference resolution, spatial and temporal reasoning ,
part-whole reasoning, reasoning about measures, and other such special purpose semantic and pragmatic function s
which do not translate well to traditional feature structure unification . Some of these composition functions will cer-
tainly yield non-boolean results on the question of whether two nodes are composeable . In addition to the node com-
position functions, we also plan to implement a probabilistic feature set composition function, where the degree o f
match between a top-down proposed structure and its constituents composed from the bottom up is inversely propor-
tional to the probability that constraints violated by a given input structure tend to be violated over a representative
corpus of text.
CAUCUS couples a semantic taxonomy with a declarative grammar and lexicon . The grammar/lexicon spec-
ification is similar to Lexical Functional Grammar in that compiled entries are used by the parser to build distinct syn-
tactic constituent and functional structures for an input text segment, with the mapping between the constituent an d
functional structure specified by the lexical entries . In addition, CAUCUS simultaneously composes a semantic func-
tional structure based on the semantic functional structure and selectional restrictions specified in the lexical entrie s
and semantic taxonomy . This knowledge base structure was designed for maximum portability to new discourse do-
mains, languages, and applications. For portability to new discourse domains, we maintain a core grammar, lexicon,
and taxonomy, to which we add domain-specific language constructs and concepts . Most of the lexical entries devel-
oped for MUC-3 are generally applicable to any parsing problem and have been added to the core .
CAUCUS' probabilistic parsing strategy is similar to deterministic parsers in that the parser is directed to fin d
the most probable solution first., but it differs in two critical ways. First, probabilistic constraint relaxation is built into
a function that determines the degree of match between the input and a hypothesized interpretation structure . The de-
gree of match is then rolled into the probability of continuing along the same search path . This enables the parser to
exhibit both speed and robustness in the face of unexpected input. Second, instead of a procedurally determined, hard-
wired parse strategy, ADS' parsing strategy is determined primarily by usage frequencies stored with the rules an d
word senses retrieved by the parser as they are needed . 'Thus, the parse strategy can be tuned dynamically to differen t
situations. Currently, the ability to use usage frequency has been implemented, but non-Boolean matching has not .
Probabilistic best-first chart parsing, made possible by our Generalized Composition Grammar and the CAU-
CUS architecture, is designed to improve the scalability and robustness of natural language parsing . CAUCUS is an
132
active chart parser because phrase constituents are generated once only and made available to be matched with adjacen t
(active against inactive) edges that might be generated in the future . Typically, a chart parser is used to generate all
possible parses in an efficient manner, so the active chart edge matching and proposing mechanisms are tightly couple d
in a deterministic top-down, left-to-right control mechanism . This is a sensible approach for testing the generative and
string recognition power of a grammar, but for a parser doing real text understanding tasks, the goal is to find the bes t
fit that the non-deterministic knowledge bases can make to the input string in the shortest time possible . In CAUCUS ,
tasks are placed on a prioritized agenda as they are instantiated by the execution of other tasks . Executing a task may
cause the instantiation of any number of tasks, from zero to many, and where they are placed on the queue is a functio n
of an estimated likelihood that executing the task next will cause the parser to discover the best interpretation of th e
input in the shortest amount of time .
In CAUCUS, there are currently two types of tasks, EXTEND an active or inactive edge, and MATCH an
active edge with an inactive edge, either to the right or to the left. Various parameters may be set to manipulate whic h
tasks actually get instantiated during the execution of one of these tasks, and the composition functions and the func-
tions that calculate the priority of a task can all be substituted freely without adversely affecting the parser's operation .
Currently, matching is decomposed into two stages in order to minimize the number of graph compositions that ar e
actually executed . As the number of specialized composition functions and pragmatic functions is increased, it ma y
become appropriate to break some of these out as additional task types .
In summary, ADS is developing a unique approach to natural language analysis, called CODEX, that com-
bines a probabilistic, concept-oriented keyword pattern matcher with a probabilistic, best-first, generalized graph com -
position, active chart parser . These techniques were designed to address problems limiting the robustness, scalability ,
and portability of current techniques for data extraction from text . Although we have not completed the research an d
development on these techniques, we have been implementing them in an incremental fashion as modifications to ex-
isting prototypes so that we can document the performance gains of each addition over the older techniques from whic h
these new ones have grown . Because the large MUC-3 lexicon required an upgrade to the parser that we were unabl e
to implement in time, the system we used for MUC-3 testing had the parser disabled .
FLOW OF CONTROL
In this section we show how CODEX minus CAUCUS produced templates for the MUC-3 test messag e
TST1-MUC3-0099.
When the controller is invoked on a series of messages, it first breaks up the messages into a series of files ,
one message per file . Then it breaks up each of these files into a dateline file and a message body file. Each message
body file is handed to the profiler, which uses its knowledge base to locate relevant concepts in the text . The Profiler
knowledge base consists of two parts . The first part is a set of RUBRIC rules, and the second part is a set of concept s
that the profiler is to report on. As can be seen in Figure 3 the rules may be invoked in a hierarchical manner . Although
some of the lower level rules corresponded to slot fillers, as may be seen in Figure 4, we did not use this informatio n
to create template fills because our strategy was to have CAUCUS do this by analyzing sentences found by the profiler .
For the MUC-3 testing, we configured the profiler to report on the sentences that contained concepts of the variou s
incident types. The strategy was to overgenerate so that CAUCUS could make the final determination of relevance an d
slot fills. The profiler saves its output to a file for future use by the Controller . Figure 5 shows the message with thos e
words highlighted that triggered profiler rules.
Once the Profiler is finished with a message, the controller loops through the incident types and generates a
template for each sentence that contains that incident . For TST1, we generated just one template per incident type con -
cept found in a message, but we found that recall would be higher if we produced one template per sentence containin g
133
:IMPLIES
BOMBING-EVENT
:DEFINITION
(:OR (:AND BOMBING DESTRUCTION) ( :AND BOMBING CASUALTIES) )
:WEIGHT
100
:END-DEFINITION
:EVIDENCE
BOMBING
:DEFINITION
(:OR (:PHRASE "BLOWN" "UP") ( :PHRASE "BLEW" "UP" )
"BOMBING" "BOMBINGS" 'BOMBED" "DYNAMITED")
:WEIGHT
100
:END-DEFINITION
:EVIDENCE
DESTRUCTION
:DEFINITION
(:OR "DESTROYED" "DAMAGED" ( :PHRASE "BLOWN" "TO"))
:WEIGHT
60
:END-DEFINITION
Figure 3: Example Profiler Rules
BOMBING
ATTACK
EXPLOSIVE-DEVICE
DESTRUCTION
CASUALTIES
DEATH
KILLING
BURNING-RESULT
SHINING-PATH
TUPAC-AMARU
BOMBING-EVEN T
ATTACK-EVENT
MURDER-EVENT
TERRORIST-EVEN T
Figure 4: Profiled TSTI-MUC3-0099
(S l 100) (S3 80) (S9 100) (S11 100)
(S8 100)
(S2 100) (S3 100) (S4 100) (S5 100) (S13 100)
(S5 60)
(S12 50)
(S11 100)
(S12 100)
(S14 100)
(S7 100) (S8 100) (S9 100) (S11 100) (S12 100) (S14 100 )
(S7 100)
(S l 60) (S3 48) (S5 36) (S9 60) (S11 60 )
(S8 70)
(S12 60)
(S l 48) (S3 38) (S5 29) (S8 70) (S9 60) (S11 60) (S12 60)
134
LIMA, 25 OCT 89 (EFE) -- [TEXT] POLICE HAVE REPORTED THAT TERRORISTS TONIGHT BOMBED
THE EMBASSIES OF THE PRC AND THE SOVIET UNION. THE BOMBSCAUSED DAMAGE BUT NO INJURIES .
A CAR-BOMB EXPLODED, IN FRONT OF THE PRC EMBASSY, WHICH IS IN THE LIMA RESIDENTIAL DISTRIC T
OF SAN ISIDRO . MEANWHILE, TWO BOMBS WERE THROWN AT A USSR EMBASSY VEHICLE THAT WA S
PARKED IN FRONT OF THE EMBASSY LOCATED IN ORRANTIA DISTRICT, NEAR SAN ISIDRO .
POLICE SAID THE ATTACKS WERE CARRIED OUT ALMOST SIMULTANEOUSLY AND THAT THE BOMBS
BROKE WINDOWS ANDDESTROYED THE TWO VEHICLES .
NO ONE HAS CLAIMED RESPONSIBILITY FOR THE ATTACKS SO FAR . POLICE SOURCES, HOWEVER, HAVE
SAID THE ATTACKS COULD HAVE BEEN CARRIED OUT BY THE MAOIST "SHINING PATH' GROUP OR TH E
GUEVARIST 'TUPAC AMARU REVOLUTIONARY MOVEMENT" (MRTA) GROUP . THE SOURCES ALSO SAID
THAT THE SHINING PATIO HAS ATTACKED SOVIET INTERESTS IN PERU IN THE PAST .
IN JULY 1989 THE SHINING PATH BOMBED A BUS CARRYING NEARLY 50 SOVIET MARINES INTO THE PORT
OF EL CALLAO . FIFTEEN SOVIET MARINES WERE WOUNDED.
SOME 3 YEARS AGO TWO MARINES PIED FOLLOWING A SHINING PATH BOMBING OF A MARKET USED B Y
SOVIET MARINES .
IN ANOTHER INCIDENT 3 YEARS AGO, A SHINING PATH MILITANT WAS ,LED BY SOVIET EMBASSY
GUARDS INSIDE THE EMBASSY COMPOUND . THE TERRORIST WAS CARRYING J)YNAMITE.
THE ATTACKS TODAY COME AFTER SHINING PATH ATTACKS DURING WHICH LEAST 10 BUSES WERE
BURNED THROUGHOUT LIMA ON 24 OCT.
Figure 5: Scanned TSTI-MUC#-0099
an incident. We also could have configured the profiler to scope concepts at the paragraph level, but with the newswire
articles, this would produce very little difference in the result. If any of the incident types are found in a message, th e
Controller also sends the name of the message file and the potentially relevant sentences to CAUCUS through another
file, as shown in Figure 6.
Generating templates based on Profiler output only was a fail-safe mechanism in case of parser failure . As it
turned out, this was our only output. Thus, final output for TST1-MUC3-0099 consists of five templates with only the
message-id, template-id, and incident-type slots filled out. A BOMBING template is produced for each of the first
three sentences in Figure 6, a MURDER template is produced for the fourth sentence in Figure 6, and an ATTAC K
template is produced for the last. Given these sentences as input, the parser's analysis would have indicated that onl y
the first sentence contained relevant incidents, but it would have created two BOMBING templates, one for each o f
the physical targets. Before processing the first sentence for the message is a line indicating the number and the dateline
file of a new message . The parser uses this information to reset some global variables containing frames representin g
the dateline information, which it may need to determine the time and location of incidents . With the parser up an d
running as expected, then, the final output for this message would have been two bombing templates with the date,
perpetrator-id, physical-target, foreign-nation, and location slots filled in .
Since we never generated CAUCUS output for this message, we will not go into CAUCUS processing . Our
simple initial strategy was to have CAUCUS parse the sentences found by RUBRIC to have concepts of MUC-3 inci-
dents, determine relevance, and change the template output, as described above . This strategy would have missed the
effects mentioned in the second sentence of the message, the neighborhood-level locations and secondary physical tar -
get mentioned in the second paragraph, or the suspected perpetrator organization mentioned in the third paragraph . In
the future we will be experimenting with a feedback loop between the Controller and CAUCUS to provide additiona l
sentences to CAUCUS for analysis, depending on the absence of information in the CAUCUS template output . Thus,
only the minimum number of sentences would be parsed . In this case, all but two small sentences would be parsed, bu t
135
/prj/nlp/muc/picasso/F0000
POLICE HAVE REPORTED THAT
TERRORISTS TONIGHT BOMBED THE EMBASSIES OF THE PRC AND THE SOVIE T
UNION .
IN JULY 1989 THE SHINING PATH BOMBED A BUS CARRYING NEARLY 50 SOVIE T
MARINES INTO THE PORT OF EL CALLAO .
SOME 3 YEARS AGO TWO MARINES DIED FOLLOWING A SHINING PATH BOMBING O F
A MARKET USED BY SOVIET MARINES ..
IN ANOTHER INCIDENT 3 YEARS AGO, A SHINING PATH MILITANT WAS KILLE D
BY SOVIET EMBASSY GUARDS INSIDE THE EMBASSY COMPOUND .
THE SOURCES ALSO SAID THAT THE
SHINING PATH HAS ATTACKED SOVIET INTERESTS IN PERU IN THE PAST .
:END-MSG
:END-PROC
Figure 6: TST1-MUC3-0099 Input to CAUCU S
in other messages, this strategy should reduce the burden on the parser significantly .
SUMMARY
In this paper we have described ADS' CODEX flexible message understanding architecture, which consists
of a Profiler, a Parser/Analyzer, and an application-specific Controller . We also described the flow of control in CO-
DEX as it was instantiated for MUC-3, concentrating on the parts that contributed to our official MUC-3 test run . The
CAUCUS parser/analyzer is still quite new, and we ran into an unexpected problem in scaling up the lexicon to handle
as many entries as were required for MUC-3 . Though we eventually solved this problem, we did not do so in time t o
use it for MUC-3 testing .
136
