TRW:
DESCRIPTION OF THE DEFT SYSTEM AS USED FOR MUC- 5
WILLIAM W. NOAH, Ph.D.
ROLLIN V. WEEKS
TRW SYSTEMS DEVELOPMENT DIVISION
ONE SPACE PARK
REDONDO BEACH, CA 9027 8
R2/2186
BACKGROUN D
For the past three years, TRW has been developing a text analysis tool called DEFT--
Data Extraction from Text . Based on the Fast Data Finder (I'D I .), l)EF'l' processes larg e
volumes of text at very high speeds, identifying patterns which serve as indicator s
for the presence of relevant objects, relationships, or concepts in the data . These
indicators are processed by a series of system-supplied utilities or custom-writte n
functions which refine the data and re-formulate it into frames which can b e
presented to a user for review, editing, and submission to a downstream application
or database.
Superficially, DEFT resembles a Natural language Understanding (NLUI) system ;
however, there are key differences . DEFT entertains very limited goals in th e
processing of natural language input . Although DEF"1' processes unconstrained
input, it is looking for textual entities which are tightly constrained and presented t o
the system as a list of expressions or in a powerful pattern specification language . i t
exploits expectations about how a small set of entities will he expressed to reduce th e
amount of computation required to locate those-- and only those-- entities . The
broader question of the "meaning" of the text in the document is bypassed in favor o f
rapid, robust processing that can he readily moved from domain to domain . As lon g
as the input for a particular domain is sufficiently predictable, data extraction with a
satisfactory level of recall and precision for many applications can be achieved . We
are currently installing three DEFT systems for a United States government agency ;
initial reviews have been highly favorable .
Our involvement in MUC-5 derives from a request by the government to turn DEFT t o
a COTS product, with the intent of having a fully-supported version of the system by
the end of the year . An analysis of the broader commercial and government marke t
for text extraction suggested that the scope of problems that DEI'"I' should he able to
address needed to be expanded ; however, it was established that replication of the on -
going research and development work in the NLU community was an inappropriat e
role for our development group . Rather, we wanted DEFT to he able to integrate wit h
systems already developed or in development for functionality which falls outsid e
the narrow boundaries of DEFT's pattern-based capabilities. At the same time, DEFT' s
ability to express patterns needed to be extended from it's current, highly effectiv e
means for defining "atomic" patterns to the definition of patterns in relationship t o
each other, permitting simple syntactic information to be added to DEFT's lexica l
knowledge. Thus, DEli' would have the potential to find entities not expressly define d
237
in a lexicon, improve its ability to correctly determine the relation between entities ,
and decrease the overgeneration that tends to be associated with approaches that rel y
exclusively on pattern matching .
A mechanism was selected for enhancing pattern specification which was felt to b e
compatible with the notion of integrating DEFT with third-party systems . As will be
described in some detail, DEFT is intrinsically an engineering shell which is intende d
to facilitate such integration while making its rapid pattern-matching service s
available to the other system components . Unfortunately, the software
implementing this concept was not available at the time of the final MUC- 5
evaluation, the results of which therefore serve only to confirm our expectation s
that the recognition of "simple" (i .e. isolated) patterns is woefully insufficient fo r
complex data extraction problems .
While we regret that the capabilities of the extended version of DEFT could not b e
demonstrated for MUC-5, we feel that the outcomes justify our belief that real-worl d
message understanding problems necessitate an engineering solution that can pit a
choice of technologies against the specific problem at hand-- different technologie s
being optimum for different tasks . We believe that DEFT's success in handling simpl e
data extraction problems can be extended, and that DEFT is well-suited to a role as an
integrator of text analysis capabilities . It is toward this end that we are focusing our
on-going productization efforts .
SYSTEM DESCRIPTIO N
It is convenient to envision DEFT as a pipeline, as shown in Figure 1 . At the head is a
standardized document interface to message handling systems . At the tail is a process
which generates frames and distributes these to the appropriate destinations on th e
basis of content . In between is a series of text analysis "filters" which apply DEFT
lexicons (pattern searches) against the text (using the FDF) and call specifi c
extraction functions to process the textual fragments located by the lexicons . Al l
processes are controlled by means of external configuration files and a "workbench "
which contains tools for interacting with DEFT and the data DEFT extracts . We will
describe each of these major components in turn .
The Document Interface: Message Queuing . It is assumed that DEFT will be embedde d
in an existing automated message handling (AMIl) system . DEFT's interface with
these systems is called Message Queuing (MQ) . Text is typically disseminated to MQ
(e.g. by a messaging system like TRW's EL,CSS or KOALA that receives governmen t
cables, wire service input, etc .) on the basis of subject matter, source, structure, or
other characteristic with salience for how the message's language will he analyzed .
MQ can also accommodate documents loaded from other sources, such as native wir e
services, an existing full-text database, CD-ROM, OCR, and so on . Text is assumed to be
in ASCIi or extended ASCII; in the near-future, DEFT will build on work currently
underway to allow the FDF to accommodate Unicode for foreign character sets, suc h
as Japanese . Structural features, such as document boundaries, sentence boundaries,
paragraphs, tahularization, encoded tags (such as SGMI .), embedded non-textual
media, etc. can he defined for a particular document class using DEFT specification
files.
MQ utilizes a configuration file to assign a processing thread tailored to the problem
domain to each category of document classified by the dissemination system or by
whatever means (including manual) is used to route documents to DEFT . Document s
238
are associated with a processing thread by placing them in a particular 1\IQ "in -
basket" (a standard Unix directory) . Each in-basket is polled periodically, using a se t
of criteria (time and number of messages since the last processing thread wa s
initiated) defined in the configuration file .
fx
	
lvi ~oo-;v;wsa . 'Meow
	
x ;ti-. .
Formatted
portions
Knowledge Base an Messag Management Tools
wr,
Message
receipt
Figure 1 : DEFT Functional Architectur e
Extracting Data: Text Analysis [liters. When (\9Q assigns a document to a processin g
thread, it is subjected to a sequence of procedures which operate on the text to locat e
patterns of interest and use these patterns as a guide to extract the data required for a
particular problem domain . This sequence of processes determines what is extracted
and how it is extracted . The sequence is defined as an ordered list of "extractio n
phases" in a configuration file . This list can be changed at any time to substitute o r
add new extraction phases to refine a text processing thread . New threads can h e
modeled on existing ones, facilitating transitions to new problem areas .
Each extraction phase is an executable program . The behavior of a phase i s
dependent on the order in which it is called (i .e . its relationship to the phases tha t
have been executed before it) and on parameters which are supplied in th e
configuration file . In this Way, a generalized extraction phase can be configured fo r
a specific analytic objective. DEFT has a library of extraction phases that perform th e
most elementary analytic processes; new phases are be written on a problem-specifi c
basis. DEFT provides an application programming interface (API) in the form of a
library of utilities which allows a custom extraction phase to interact with the dat a
structure which is common to all extraction phases, and which is used t o
communicate between phases . This structure is the DEFT "Tag File. "
The Tag File is a cumulative record of the processing performed by each extractio n
phase. Each phase receives the Tag File from the preceding phase, and passes it t o
239
the next. A "tag" represents a textual pattern identified by DEFT in the text or dat a
created by an extraction function .
Much of the power of DEFT comes from the ability w apply a mixture of extractio n
phases that is optimally suited for a given class of document and extraction problem .
For example, one extraction phase might reason about the relative time o f
occurrence of events located in the text, basing its analysis on the occurrence o f
various forms of date/time indicators as well as the presence of such modifiers a s
"last week," or "three years ago ." Another phase might construct corporate name s
on the basis of the occurrence of a known name or the presence of a designator (e.g.
"Inc ." or "S.A.") . Yet another phase might act upon these names to reason about thei r
potential relationship in a joint venture .
Locating Data : DEFT Lexicons. The patterns that DLI°T uses to locate data of interest in
the text arc contained in DE.FF's lexicons . Lexicons serve various purposes : to
identify potential frames, to determine the "scope" of a frame in the text (i .e . the
boundaries to be used to find data to fill the frame slots), to find the contents for a slo t
in a frame, to determine structural elements (e .g. sentences, paragraphs, heade r
information), and to set the attributes of a text object (e .g. classification level) .
Lexicons are of two types : list and pattern . The list lexicon associates a set of
synonyms (or spelling variants) with a given object . It is useful when the complet e
set of strings associated with an object can be specified . The pattern lexicon is use d
when the textual variations associated with an object cannot be specified . For
example, all possible monetary values cannot he conveniently enumerated, but a
single pattern describing monetary values in terms of digits, punctuation, an d
denomination strings can he constructed.
Associated with lexicon entries are attributes, representing the semantics of th e
problem domain . An attribute is a characteristic of the object represented in th e
text by its synonym list or pattern. It might he the normalized form of a name o r
other data about an object which is useful to map into a frame, such as the countr y
associated with a corporate name . In a list lexicon, these attributes are know n
explicitly when an entry is created ; they are not inferred from the text . In a patter n
lexicon, however, the attributes cannot be known in advance because it is not know n
what exact value will hit against the pattern . For this reason, attributes must be
extracted for a pattern lexicon . Attribute extraction is handled by a C or C++ progra m
referred to as an "extraction function ." For example, given the location of a
corporate designator, a function might reconstruct the corporate name.
The success of a data extraction system that relics on pattern matching and strin g
finding depends on how exhaustively it can search for the variations expected i n
input language . DEFT has proved successful in its current applications in par t
because its lexicons can be extremely large, thanks to the capabilities (in terms o f
both functionality and performance) of the FDF .
Searching Te .xt for Lexicon Entries : The PM DEFT uses the TRW-developed Fast Dat a
Finder to rapidly locate instances of a potentially enormous set of patterns in th e
input text. The power of the FDF originates in two ways : the hardware architectur e
and the expressiveness of its Pattern Specification Language (PSI .) .
The current generation FDF-3, nc,a~• a COTS product manufactured by Paracel, inc . ,
uses a massively parallel architecture to stream text past a search pattern at dis k
speeds (currently 3 .5-million characters/second using a standard SCSi disk) .
240
Searches are compiled into micro code for a proprietary chip set which ca n
accommodate up to 3,600 simultaneous character searches or Boolean operations .
Lexicons are broken into "pipelines" which fully fill the chip set ; each pipeline i s
run against all of the text in the set of documents currently being processed. MO_
batches messages as they come in so as to optimize the use of the 1. 1)1 .-- larger message
sets are processed more efficiently than several smaller ones. The tradeoff betwee n
batching and "real-time" processing can he independently balanced in the W I
configuration file for each in-basket and processing thread .
Search patterns are specified in PSI,. Because the l l)f uses a streaming approach ,
PSL is not dependent on word boundaries . Extremely complex patterns can h e
expressed, which can include such features as error tolerance, sliding windows ,
multiple wildcard options, nested macros, character masking, ranging, and the usua l
Boolean operations. Features that support "fuzzy matching," like error tolerance, ar e
extremely important for handling "noisy" input .
Output Generation : Frame Assemhi) and Rowing . When the filters that comprise a
processing sequence have executed, the 'fag File is passed to the "Frame Assembly and
Region Routing" (FARR) module . This program, which constitutes the "tail" of th e
DEFT pipeline, assembles the data elements generated during the analysis thread int o
frames based on an external definition file . This file specifies which slots are
associated with which frames, how to transform a data value for display to the use r
(e.g. normalize "England" to "United Kingdom"), how to transform a value for storag e
in a downstream database (e.g. abbreviate "England" as "UK"), how to validate a dat a
value, whether a data type can he multiply-occurring, and so on .
One issue that arises during frame assembly is when to associate a data value with a n
instance of the frame class for which it is defined . In DEFT, this operation i s
associated with "scoping." Scoping is the process of determining the extent in th e
text of a concept associated with a pattern . For example, if a pattern of word s
indicative of a joint venture is found, the scope of the "tie-up" frame might be take n
to be the location of the pattern plus or minus two sentences . The unit of scoping (i n
this case, sentence) need not he a syntactic unit-- it can be any pattern stored in a
special type of lexicon used exclusively for determining frame scope . The unit of
scoping and its extent (e.g., "plus or minus n") can he determined independently fo r
each frame class.
When a pattern that gives rise to a slot value of a type defined for a given frame clas s
is found in the text, the slot is automatically mapped by I ARR to any frame whos e
scope encompasses the location of the pattern . "thus, if the name of a corporatio n
were to occur within the two sentence range of the tic-up frame in our example, i t
would appear in that frame . Of course, this may not he accurate-- DEFT has a
tendency to overgenerate slots through bogus associations that arise because of thi s
weak scoping mechanism .
Another issue that is encountered is overlapping frames . The "best available "
resolution can he specified in the frame definition file . One alternative is simply t o
accept both frames, since they may be describing separate concepts . If the frame s
are of different classes, FARR supports the attribution of a priority to each class, an d
only the frame with the highest priority need be retained . If the frames are of th e
same class, FARR supports a "non-multiply occurring" attribute, which optionall y
suppresses all but one of the frames . Unfortunately, the action taken is generalize d
to all situations-- the specifics of a given case cannot be taken into account . Thus ,
DEFT tends to either overgenerate or lose frames .
241
When a message's frames have been generated and ambiguities resolved (to th e
extent that DEFT can resolve them), the frames (and the message) are routed to a
destination directory on the basis of their content . Routing instructions are define d
in a rule base using a normalized conjunctive form of field-value pairs . It should b e
kept in mind that although DEFT's primary mission is extraction, not dissemination ,
the routing capability (since it is based on knowledge representation) provides a
sensitive mechanism for determining the destination of a message and the structure d
representation of its contents .
Controlling the System: DEI.T' lbois and Specification Management . in order to make
DEFT portable to different computing environments and problem domains, th e
definition of user-modifiable system characteristics has been exported to a set o f
external specification files . These files govern the interface with the surroundin g
message handling system, the output data model, FDF configuration, and othe r
"housekeeping" functions . Specification files are maintained using any convenien t
text editor.
The most important system specifications from the standpoint of the end-user are
the lexicons and the frame routing rules . To facilitate lexicon development and
maintenance, a lexicon editor is bundled with DEFT that provides a graphic use r
interface (under X/[Motif) for interactively defining lexicons and entering/editin g
lexicon entries . Lexicons can also he created/updated from databases or externa l
machine-readable files (e .g . gazetteers, corporate name lists) using a hatch loa d
protocol .
Like the lexicon editor, the routing rule manager provides a GUI for maintainin g
routing rules. It uses a spreadsheet metaphor to minimize the user's exposure to th e
potentially complex Boolean logic that the rules can involve. Menus of valid value s
and conditions tests are automatically provided .
Another important Dlil°l' tool is frame review . D1:1'1' was developed under the
assumption that a user would always he in the loop ; it was not intended to run
autonomously. 'Phis package therefore supports simultaneous display of message s
and the frames derived from them, providing highlights that show where slot value s
were extracted . Menus of valid values drawn from the lexicons assist the user i n
filling slots that were omitted by" Dl't Features for selectively deleting superfluou s
slots and frames are particularly important, since l)I :F I' (like other pattern-based
approaches to text analysis) tends to overgenerate data . A mechanism is also
provided to facilitate manually linking frames of different classes into higher-leve l
logical aggregations, since Dlil'l' was not originally designed with an automate d
linking capability . Clearly, these two design assumptions-- human interaction an d
manual frame linking-- had an impact on working with the iM11C-5 data .
DEFT as an Engineering Shell
Phis description of the DEFT system has emphasized that analysis threads ar e
composed of independent components \ n'hich communicate through a common dat a
structure using a library of utilities that constitute an API . It is our contention that
242
DEFT's strengths are:
• A powerful pattern searching capability, which we are extending .
• The ability to integrate COTS, CO'T'S, and custom-written program s
within the DEFT architecture .
We believe that there will probably not he a single text analysis or NI .0 system that
meets the requirements of all conceivable applications . There will always be a
tradeoff between such factors as speed, depth of analysis, breadth of coverage ,
portability, robustness, and analysis methodology that will favor one technolog y
over another for a particular problem . The real question is not "What is the bes t
system?", but "What is the best system at this moment? "
Our current development work on DEFT is chiefly targeted at its usefulness as a n
integration tool. DEFT provides a high-speed pattern searching capability which ca n
successfully extract data from structured or tightly constrained textual inputs, whil e
providing pre-processing services (e .g. tagging words with part of speech or
semantic attributes) for third-party software which performs more extensive natural
language processing for unconstrained textual inputs . This approach should h e
especially efficient for applications in which messages are mixed (formatted an d
unformatted), text analysis tasks are varied in complexity, and throughput is a majo r
consideration.
Inherent Limitations in DEFT's Pattern-Matching Approac h
Because DEFT was not originally intended for problems of the scope of MC-5, it s
simplistic approach posed some major problems. Among the most fundamental were:
Syntactical Patterns. DEFT has very powerful mechanisms for specifying "atomic "
patterns-- a corporate name, a place name, a scat of words that indicate a join t
venture, etc . DEFT was not designed to have the capability of expressin g
relationships among the patterns in its lexicons and providing for the assignment o f
values defined with respect to these patterns to variables . These are essentia l
capabilities for the implementation of the most rudimentary semantic grammar . For
example, DEFT had no way to express: "Look for a corporate name followed by a join t
venture formation phrase and take the following corporate name as the partner i n
the joint venture ."
Frame Scoping. DEFT was designed to interpret the scope of a frame as a function o f
proximity to the "hit location" of the pattern that resulted in a frame's instantiation .
The boundaries are determined by a pre-defined number of repetitions of a patter n
contained in a scope lexicon . An upper ceiling determined by a fixed number o f
characters can also be specified, in case the scoping pattern is not detected a
"reasonable" distance from the site of the hit . All occurrences of slots defined for a
frame within these boundaries are automatically included in the frame when it i s
assembled by FARR .
For highly formatted text (e.g. messages in Military "Text Format), such a mechanis m
is adequate. For free-text, it is not . in the MIIC-5 evaluation, DLF'l' failed to repor t
valid objects that it located (notably entities) because they were not within the scop e
of a tic-up, as DEFT measured scope .
243
/Tame Linking. The original DEFT design assumed that a human operator would
perform this task. Automated linking is obviously needed for "unattended" operatio n
and is clearly useful even if there is a human-in-the-loop .
Solutions
Current internal research and development work aimed at resolving each of thes e
problems for the eventual DIET product adheres to the constraint that architectura l
extensions must be philosophically compatible with the pattern-based approach ,
while avoiding significant overlap with NI..0 (which we prefer to view as a n
integratable component in a complex system) . As noted earlier, key software bein g
developed under IR&D was not available for the MtlC-5 final evaluation ; however,
work continues and will be tested on the NIIIC-5 corpus in the near future to validat e
the approach .
Syntactical Patterns. This is the specific area that was not developed in time for th e
evaluation ; unfortunately, it is also the most critical for dealing with even the simpl e
aspects of the NIt1C problem . The approach we selected is intended to be compatibl e
with the integration of more powerful text understanding components in the future ,
while extending the range of problems DEFT can solve by itself . it exploits DEFT' s
atomic pattern-recognition capabilities while separating the definition of a semanti c
grammar into an independent extraction phase . This phase could easily be replaced
(or supplemented) with an NE,tl system which can optionally take advantage of th e
D1 :1'I' lexical pre-processing while performing deep syntactic and semantic analyses .
This separation is in part intended to provide an initial test of our belief that th e
integration of Dl]']' with an Nl .tl component creates a symbiotic association wit h
better performance characteristics than either system by itself .
To stay within the (admittedly loosely defined) hounds of pattern matching, our
approach to exploiting syntax consists of providing DEFT with a simple mechanis m
for expressing "meta-patterns"-- that is, patterns whose components may be th e
atomic patterns (and, by reference, their attributes) located by the DEFT lexicons . We
decided to use a l~Nl specification to define a semantic grammar based on a
combination of' literal strings and DEFT'-identified tokens .
The key issue was how to pass the results of DEFT pattern-matching to the parser . An
integrated NE,tl component within the 1)1 :1'I' shell could interface directly with th e
DEFT Tag File through the All ; the component could also interface with the frame s
generated by DLI I', providing a preliminary level of analysis on which to build . For
our prototype, however, we chose to mark terms in the text with SGML-like tags t o
indicate their properties . The grammar directly references these tags, and routine s
were provided within the parser for assigning text strings to slots by extracting DEFT
lexicon attributes (e .g. normalized values or semantic characteristics) or collecting
words intervening between two tags (of the same or different class) . Additional
primitives for manipulating the strings prior to slot assignment were also built into
the parser infrastructure to control frame generation and the assignment of slot s
(including pointers to other frames) to frames . This significantly improves on th e
primitive scoping capability provided with the current version of DEFT .
The approach selected thus provides a vocabulary for expressing both the expected
contents of documents and the rules for instantiating and linking templates . At th e
244
same time, its intermediate product is human-readable (and, in fact, could he used as a
general-purpose "tagger") and easily interpreted by other programs .
Frame Scoping. Fundamental changes in the I)I:I'I' frame-scoping mechanism are
planned which will exploit domain knowledge as well as limited syntactic (from th e
meta-patterns) and semantic (from lexicon attributes) data . For MUC-5, the basi c
DEFT mechanism was retained, with its inherent limitations.
Frame Linking. A primitive frame linking capability was added to DE R'. It was based
on frame scoping, however, and therefore suffered from the same limitations. The
DEFT frame definition file format was extended to accommodate hierarchica l
relationships ; any frame defined as a child of another frame had its generated fram e
ID automatically included as a slot value in the parent frame if its "hit location" fel l
within the scope of the parent frame . Of course, multiple and spurious association s
are easily generated in this way . In the future, frame linking will be improved by
combining syntactic and domain knowledge in a final extraction phase to resolv e
inter-object relations .
RESULTS
The results of the final MUC evaluation were strongly influenced by th e
unavailability of the parser, which was an essential component of the DEFT approac h
to MUC-5. The resulting scores indicate the magnitude of the problems inherent in a
simple pattern-matching strategy which is not informed with even a crude semanti c
grammar. It should be noted that a decision Was made to focus only on a subset o f
templates and slots required for the preliminary run . These were the documen t
template, tie-up-relationship, and entity . The F-measures for the final evaluatio n
were:
P&R
	
2P&R
	
P&2 R
1 .15
	
2 .64
	
0.74
Not surprisingly, these were the lowest scores for any system in the evaluation . A
detailed analysis of the run is of little utility, however there are some points o f
interest seen in the walk-through sample document .
Walkthrough Documen t
The identifying data (document number, source, and date) were correctly extracted .
Some simple atomic patterns were defined in a l)I :I "I' lexicon for tie-up relations .
These were to be factored into a semantic grammar ; as noted, the parser was not
available at the time of the run . "Therefore, the patterns were run as a simple search .
It correctly identified the presence of a joint venture in the sample document ,
incorrectly instantiating two tie-up templates (one for each of two out of thre e
references to the venture) and entering their ll)s in the content slot of the
document template . DEFT currently does not determine that multiple references hav e
a common object unless the frames overlap .
A single entity was mis-identified, "Jiji Press Ltd .," which is actually the documen t
source. This entity was incorrectly associated with the first tie-up . The foregoin g
explanation of the DEFT scoping mechanism makes it clear why this false associatio n
245
took place. The name of the "BRiDGIS'TONE SPORTS CO ." was correctly reconstructed
from the corporate designator ("CO .") and assigned to the first tie-up . The name of the
joint venture, "BRIDGESTONE SPORTS TAIWAN CO .," was also constructed and associate d
with the second tie-up instance. No other features were correctly identified .
Among the other corporate names, the algorithm used by DEFT would not hav e
identified "UNION PRECISON CAS'T'ING CO .," but did identify ""I'AGA CO ." However, this
entity was considered out of scope of the tie-up templates and was (incorrectly) no t
attached to one . DEFT had no facility for recognizing "BRIDGESTONE SPORTS" nor for
tracking the reference to "TI IL NEW COMPANY ."
What Worke d
DEFT was effective at recognizing literal strings and 'patterns contained in it s
lexicons. DEFT frequently generated correct entity names that were not in th e
corporate name lexicon using a set of heuristics that reasoned backwards from a
designator. For example, "BRIDGESTONE SPORTS CO ." was constructed . DEFT of course
had little problem with the tagged items for the document template . These are
precisely the kinds of elemental functions that DEFT is expected to perform well .
DEFT recognized the occurrence of sonic of the joint ventures, based on a very limite d
set of patterns that were originally defined for use in connection with a semantic
grammar. This set could have been extended to produce improved recall had w e
known the parser would not he available . "These few successes indicate that even a
simple pattern-based approach can recognize concepts of this type in restricte d
cases .
What Failed
The lexicons and extraction phases that were rapidly developed for MtJC-5 containe d
some hugs that were not observed during training ; some corporate names were
missed, for example, that should have been constructed . The chief failings were
inadequate lexicons for identifying joint ventures and inadequate scoping . These
two problems combined to suppress the instantiation of the many valid entities tha t
DEFT found, but could not associate with a tie-up relation and therefore did no t
report . In general, the system was configured to reduce the anticipate d
overgeneration, with the expectation that tie-ups and entity relations would b e
identified and scoped by the semantic grammar ; in the absence of the parser ,
undcrgeneration became severe .
System Training and Resources Expended
The effort expended on MUC-5 testing and documentation was approximately tw o
person-weeks. System development activities undertaken independently of MUC- 5
were exploited for the Mt1C-5 evaluation run . These included:
• Analysis: 1 person-month
• lexicon Development and Data Definitions : 1 .25 person-month s
• Extraction Phases and Functions: 3 person-month s
I'he total level of effort for all actin ities impacting M1,1C-5 c\"as therefore roughly 5 . 5
person-months .
246
As we have noted, key system components were ultimately unavailable for the MUC- 5
evaluation . Although we won't know "how we would have done" until th e
components are completed and our internal tests against the MUG data are repeated, i t
is our expectation that significant improvement will he obtained with a littl e
additional effort-- although performance is neither expected nor required t o
approach that of true NM systems, given our view of DEFT as an integration
environment.
Most of the effort in creating a new DEFT application usually centers on lexicon
development. For MUC-5, most lexicons were batch loaded from the data supplied via
the Consortium for Lexical Research . A few lexicons for joint venture identificatio n
and scoping were developed manually . These were quite simple and their actua l
creation required minimal time .
Much of the time on MUG-5 was occupied with writing C-code for extraction routines,
particularly for corporate names . The need to write so much code for a new
application is a current weakness in DEFT which will he remedied to a degree whe n
the parser becomes available .
Of course, a key activity was the analysis of the test corpus and development of a
semantic grammar appropriate to the IiJV problem . The results of this analysis wer e
manifested in the tie-up relation lexicon and the BNF grammar for the parser . Only
the former was ready in time for the evaluation . Analysis was a cyclical, iterativ e
process; refinement continued during system training .
DEFT system training consisted of a series of runs against samples of the trainin g
corpus, utilizing the frame review tool to examine the results . Lexicons were
manually refined as a result of missed objects and false hits . Early runs resulted i n
changes to the hatch loading sequence for some of the lexicons (e.g. the corporat e
designators) . Feedback into the grammar would also have been derived from thi s
process, had the parser been available and time permitted . As it was, time was
insufficient even for lexicon refinement ; for example, a few key errors in th e
corporate designator lexicon resulting from a hug in the program that prepared th e
file provided through the Consortium for hatch uploading were noted only after th e
final evaluation run was analyzed . This was partially responsible for some of th e
undergeneration .
What We Learned
It came as no surprise that simple patterns are inadequate to extract the complex
ideas expressed in the IiJV documents . We view the results as validating the concep t
that DEFT, operating as a standalone system, is best qualified to perform on problem s
involving well-defined, constrained sets of text objects to be extracted, even with the
addition of a "meta-pattern" or grammatical capability . DEFT should excel on such
problems when throughput is a major consideration .
The selection (and on-going implementation) of a mechanism for expressing meta-
patterns that is compatible with all of the goals discussed earlier is a major outcom e
of our MUC work, even though it was not available in time . We believe that thi s
approach will significantly empower DEFT and broaden the range of applications fo r
which it is a suitable tool, while increasing the flexibility with which it can be
integrated with other text analysis tools . This will prove highly valuable to ou r
247
current government customers, as well as future DL F1' users in the government o r
commercial sector .
DEFT's potential as an integration environment was underscored by the fact that w e
successfully ran documents through :
• A complex set of extraction phase s
• With extremely large lexicon s
that are beyond the scope of anything that has been tried in existing DEF T
applications . The robustness of the architecture and efficiency of the patter n
searches were our major consolation in the MUG-5 evaluation . We therefore look .for
opportunities to combine DEFT's system engineering and search capabilities with th e
sophisticated analytical power of NI .U-based solutions when real-world problems ar e
encountered which are out of scope of DEFI"s simple extraction mechanisms .
248
