Diderot: TIPSTER Program, Automatic Data Extraction 
from Text Utilizing Semantic Analysis 
Y. Wilks, J. Pustejovsky t, J. Cowie 
Computing Research Laboratory, New Mexico State University, Las Cruces, NM 88003 
& 
Computer Science t, Brandeis University, Waltham, MA 02254 
PROJECT GOALS 
The Computing Research Laboratory at New Mexico 
State University, in collaboration with Brandeis Univer- 
sity, was one of four sites selected to develop systems 
to extract relevant information automatically from En- 
glish and Japanese texts. When we started, neither site 
had been involved in message understanding or informa- 
tion extraction. CRL had extensive experience in multi- 
lingual natural language processing and in the use of ma- 
chine readable dictionaries for system building, Brandeis 
had developed a theory of lexical semantics and prelim- 
inary methods for deriving this lexical information from 
corpora. Thus, our approach focused on applying new 
techniques to the information extraction task. In the 
last two years we have developed information extraction 
software for 5 five different subject area/language pairs. 
The system, Diderot, was to he extendible and the tech- 
niques used not explicitly tied to the two particular lan- 
guages, nor to the finance and electronics domains which 
are the initial targets of the Tipster project. To achieve 
this objective the project had as a primary goal the ex- 
ploration of the usefulness of machine readable dictionar- 
ies and corpora as source for the semi-automatic creation 
of data extraction systems. 
RECENT RESULTS 
The first version of the system was developed in five 
months and was evaluated in the 4th Message Under- 
standing Conference (MUC-4) wher e it extracted infor- 
mation from 200 texts on South American terrorism. At 
this point the system depended very heavily on statis- 
tical recognition of relevant sections of text and on the 
ability to recognize semantically significant phrases (e.g. 
a car bomb) and proper names. Much of this information 
was derived from the keys. 
The next version of the system used a semantically based 
parser to structure the information found in relevant sen- 
tences in the text. The parsing program was derived au- 
tomatically from semantic patterns. For English these 
were derived from the Longman Dictionary of Contem- 
porary English, augmented by corpus information and 
these were then hand translated to equivalent Japanese 
patterns. The Japanese patterns were confirmed using a 
phrasal concordance tool. A simple reference resolving 
module was also written. The system contained large 
lists of company names and human names derived from 
a variety of online sources. This system handled a subset 
of the joint venture template definition and was evalu- 
ated at twelve months into the project. 
Attention was then focused on the micro-electronics do- 
main. Much of the semantic information here was do 
rived from the extraction rules for the domain. A single 
phrase in micro-electronics can contribute to several dif- 
ferent parts of the template, to allow for this a new se- 
mantic unit the factoid was produced by the parser. This 
produced multiple copies of a piece of text, each marked 
with a key showing how the copy should be routed and 
processed in subsequent stages of processing. This rout- 
ing was performed by a new processing module, which 
transformed the output from the parser. The statistical 
based recognition of text relevance was used for micro- 
electronics only, as a much higher percentage of articles 
in the corpus were irrelevant. This system was evaluated 
at 18 months. 
Finally the improvements from micro-electronics were 
fed back to the joint venture system. An improved se- 
mantic unit recognizer was added to the parser. This 
handles conjunctions of names, possessives and bracket- 
ing. An information retrieval style interface to the Stan- 
dard Industrial Classification Manual was linked into the 
English system. The reference resolving mechanism was 
extended to handle a richer set of phenomenon (e.g. plu- 
ral references). This version was evaluated at 24 months. 
PLANS FOR THE COMING YEAR 
CRL is participating in Tipster Phase 2. This will in- 
volve participation in the development of the architec- 
ture for the Phase 2 system, user interfaces to the sys- 
tem, software to handle document markup and multi- 
lingual information retrieval. Brandeis are continuing 
work on tuning lexical entries using information from 
corpora. 
464 
