Cable Abstracting and INdexing S_ystem (CANIS) 
Prototype 
Ira Sider 
Jeffrey Baker 
Deborah Brady 
Lynne Higbie 
Tom Howard 
Lockheed Martin Corporation 
P.O. Box 8048 
Philadelphia, PA 19101 
sider, j baker, brady_d,higbie, howard@mds.lmco.com 
1. SUMMARY 
The CANIS customer receives cables from sites 
world-wide, indexes the entities mentioned in these 
cables and stores that information for access by ana- 
lysts at a later date. The CANIS customer indexes 
large quantifies of information mostly manually, and 
wishes to reduce the human rescmrces applied to this 
task. 
Incoming cables are processed, information is ex- 
tracted and stored in Corporate Databases. Cables 
with useful data are abstracted and indexed. 
Abstracting captures information about the cable 
itself: its document number, source, date, etc. 
Indexing captures information about the entities 
described in the cable: their names, dates of birth, 
locations, etc. 
The result of the abstracting and indexing process is a 
set of index records about the entities that were de- 
scribed in a cable. 
When a cable has been abstracted and indexed, its 
index record(s) are placed in a queue so that they will 
be stored. When file maintenance is performed (peri- 
odically overnight), the index records are stored in the 
Corporate Database. 
The abstracting and indexing process is a time con- 
suming and laborious task. Analysts must read every 
cable and extract the infcmnation that should be 
placed in the new index records or update existing 
records. Although the abstracting portion of the task 
has been automated, it is only a small part of the ab- 
stracting and indexing process. The majority of the 
effort is the indexing part of the process. At present, 
there is little or no automation support for the index- 
ing part of the process. 
The CANIS prototype is intended to assist the CANIS 
customer with the cable indexing task. CANIS auto- 
matically extracts entity information, builds and up- 
dates index records from cables, and presents it for 
review. CANIS' analysts can 1) approve the system 
generated index records, 2) add more information to 
the system generated index records, or 3) ignore the 
system generated index records and create their own. 
CANIS also extracts and stores relationship informa- 
tion (such as family relations, employment, and affili- 
ations). This information is not currently identified or 
stored during the manual indexing task. 
CANIS is compatible with both the input and output 
systems currently being used by the CANIS customer. 
CANIS runs on the customer-specified hardware and 
software plaffozms. However, the prototype CANIS 
system is a stand-alone system. 
2. BACKGROUND 
2.1. Concept Demonstration 
CANIS Phase I Contract was completed February 
1995. Lockheed Martin Management and Data Systems 
demonstrated the automatic extraction of the abstracted 
and indexed information from actual cables. Following 
the demonstration, M&DS developed requirements and 
design specifications for a prototype system that would 
meet the CANIS customers cable indexing and abstract- 
ing needs. 
69 
3. SYSTEM DESIGN 
( Cable Text + Cable Header J, CorporateSystemCable I Cable Text + Cable Header + Pro f'de Intb 
Cable Text + Cable Header 
Cable Deliv~) 
System Serv~ . \[ 
(LogsNogsl ~ - % 
Cable Text + 
Cable Header + 
Cable Delivery 
System Fields 
\] Sel~d ~bs~ of Cables 
copied from Doo~mom Ind~z 
Damb~ 
Document Index( 
Database Data \ 
CANIS / ~ H~d~+ Cable De~vay 4 
Prototype Amot~o~+ / 
t Relatiouel Dlua ~¢~ 
User Display 
( 
Figure 1.0 - CANIS External Interface Design 
The CANIS prototype, as illustrated in Figure 1.0, 
will take as input, Cable Text, Cable Header, and Cable 
Delivery System Server Fields. CANIS performs all 
processing and stores the results internally. Users can 
visualiTe the processed data via the User Display. 
70 
CANIS Prototype (Lores Notes) 
Ce0eceen IdA + Det-i~at Aam~e Cdl~ee IdelmV~ ÷ 
Dom~t Id~h~r 13oom~ It~ffl~ 
~~ File Server, ~J ~~r Display ) 
Figure 2.0 - CANIS Top Level Process Overview 
Figure 2.0 shows the top level design of the CANIS 
prototype. Cables are delivered to CANIS via the Cable 
Delivery System Server. This server acts as a commu- 
nication driven pipe to the CANIS Prototype. 
3.1. Comm Process CSCI 
The Comm Process CSCI retrieves the Cable Data 
from the Cable Delivery System Server at a constant 
given rate via a software timer. The Comm Process 
CSCI creates a Document from the Cable Data and 
passes the Document to the Document Manager Process 
CSCI which stores this information in a Document 
Collection. The Comm Process CSCI sends the Collec- 
tion Identifier and Document Identifier to the Extraction 
Process CSCI. The Comm Process CSCI also transfers 
the Cable Delivery System Header information to the 
SQL Database as it relates to the Document in the 
CoLlection. 
3.2. Extraction Process CSCI 
The Extraction Process CSCI processes each Collec- 
tion Identifier and Document Identifier passed to it. The 
Extraction Process CSCI passes the Collection and Doc- 
ument Identifiers to the Document Manager CSCI 
which retrieves and returns the document text. The Ex- 
traction Process CSCI extracts biographical entities 
found within the document using Lockheed Martin's 
NLToolset. The Extraction Process CSCI passes the ex- 
tracted entities to the Document Manager Process 
CSCI, which stores them as Annotations on the Docu- 
ment. The Extraction Process CSCI sends the Collec- 
tion Identifier and Document Identifier to the Analyst 
Data Setup Process CSCI. 
Upon initialization, the Extraction Process CSCI 
spawns the NLToolset Server Object, which ties all the 
NLToolset's data resources together into a single object, 
and loads into it the NLToolset System Specification 
File. This fde contains a set of entries that identify the 
71 
resources that should be loaded, the debug flags that 
should be set, the organization of the resources, and the 
sequence of operations that the NLToolset should per- 
form. The primary resources that are loaded are: 
• Lexicon 
• Ontology 
• Gazetteer 
• Abbreviation 
• Special Term List 
• Set of Lexiccr-Semantic Rule 
Packages 
The Extraction Process CSCI passes a Document 
through a series of NLToolset functions to perform the 
extraction. The steps are Tokeaization, Segmentation, 
Reduction, Extraction, Reference Resolution, and Post 
Processing. 
Tokenizatiou creates a buffer of tokens from the 
Document's text. All words, punctuation, numbers, etc. 
in a Document are processed into tokens. The informa- 
tion captured for each token is: physical string, token 
symbol, token type. symbol id, part of speech, string 
case type, and character start and end positions. 
Segmentation breaks the Document's token buffer 
into paragraphs and sentences based on multiple new- 
lines, tabs, periods, etc. 
Reduction performs multiple passes through the 
Document buffer looking for sequences of tokens that 
can be simplified into a single identifiable unit. These 
passes are used to identify specific pieces of information 
needed for Extraction. 
Extraction and Reference Resolution are the 
NLToolset functions that glue all individual pieces to- 
gether to create the entities. The information automati- 
cally extracted by CANIS from the Cables includes data 
of the following types: 
• Person Names (All Types) 
• Company/Organization Names 
• Locations 
• Dates 
• Phone Numbers 
• License Numbers 
• Identification Numbers (exam- 
ple: Social Security) 
• Gender 
• Country of Birth 
• Date of Birth 
• Occupation 
• Subject Line 
• File Numbers 
• Cable Numbers 
The following associations will be extracted by the 
CANIS Prototype from the Cables: 
• Family 
• Employment 
• Affiliations 
Post Processing creates Annotations for all the data 
items and entities. These Annotations are then attached 
to the Document and stored in the Document Manager 
Process CSCI. Appendix A contains the Annotation De- 
sign Specification for these entities. 
3.3. Analyst Data Setup Process CSCI 
The Analyst Data Setup Process CSCI processes 
each Collection Identifier and Document Identifier pair 
passed to it. It then passes the Collection and Document 
Identifiers to the Document Manager Process CSCI 
which retrieves and returns the Document and its An- 
notations. The Annotations for this document are 
placed into relational records in the CANIS Server (SQL 
Server). Names, Organizations and Associations enti- 
ties found within the extracted annotations are validated 
against existing entities. If the entity exists, then the 
new information is linked to the existing entities. If the 
entity does not exist, new relational records are created 
for that entity. 
The Analyst Data Setup Process CSCI collects and 
builds relations for each of the major entities (Person- 
nel, Organizations, and Associations) within the An- 
notations for the Document in the Collection. It vali- 
dates and connects different types of locations, numbers 
and biographic information. For each entity, the process 
validates against existing index relations. If the entity 
exists, all information is processed as an update to the 
existing records. If the entity does not exist, the record 
is added to the relational database as a currently known 
Index record. Biographical entities are connected to the 
named entity through the ODBC SQLServer API. Bio- 
graphical entity type connections are: gender, country 
of birth, date of birth, etc. 
Additionally, the process operates on address earl- 
ties and number entities and connects these to the named 
entities. It validates the address and number informa- 
tion against existing data relations. If the address or 
number exists, then all information is processed as a ref- 
erence to the existing records. Otherwise, anew record 
containing the information is added to the relations and 
connected to the named entity. The types of addresses 
captured are: location, residence, etc. The types of 
numbers captured are: phone, license, etc. 
The Analyst Data Setup Process CSCI links named 
entities together through relation links in the SQL data- 
72 
base. The process will link the following entity in- 
formation: Family (persons to family), Employment 
(persons to organizations), and Affiliation (persons to 
associations). 
The Analyst Data Setup Process CSCI validates File 
Numbers against existing relations and connects them to 
named entities. The types of Filing and Document Ref- 
erence data connected are: System Folder Objects and 
Document IDs. 
Finally, the Analyst Data Setup Process CSCI adds 
the Document to an analyst working queue for proces- 
sing by an analyst through the Analyst Interface Process 
CSCI. 
The Analyst Data Setup Process bridges the gap he- 
tween the information that was extracted from each 
Document and the information currently stored in the 
customer's database. 
3.4. Analyst Interface Process CSCI 
The Analyst Interface Process CSCI processes User 
Commands passed to it. These Commands allow an 
analyst to access and manipulate all the information 
stored in the CANIS prototype. When a Document is se- 
lected for display by an analyst, the Analyst Interface 
Process CSCI passes the Collection Identifier and Docu- 
ment Identifier for the Document to the Document Man- 
ager Process CSCI which retrieves and returns the Doc- 
ument and its relational records. 
The Analyst Interface Process CSCI displays a sum- 
mary list of the named entities associated with the se- 
lected Document. An analyst may select a given entity 
from the list and review the enfity's detailed informa- 
tion, delete the entity from the list, or lookup a new enti- 
ty found in the body of the Document. For each of the 
details available about a name, (ie. biographies, rela- 
tionships, id numbers, locations, phone numbers etc.) 
the analyst reviews, modifies the informatiou if neces- 
sary, and checks off the information. Some of the data, 
such as gender, citizenship, or relationship types, for ex- 
ample, have alternative choices available on a pull- 
down menu to minimize key strokes necessary to make 
changes. 
The Analyst Interface Process CSCI allows an ana- 
lyst to review and modify all information (Index re- 
cords, addressees, subject line, and Fding locations) 
about a Document. It will display a Document's text 
body and allow the analyst to travel through the proces- 
sing of the information about that Document. The fol- 
lowing functions are available to an analyst: Document 
Details Review, CANIS Prototype Process Logs Re- 
view, Name Lookup and Processing, and System Filing. 
Document Details Review displays the classifica- 
tion, addressees, and subject line associated with the 
current Document. The analyst may review and modify 
any of this information. 
CANIS Prototype Process Logs Review displays the 
logs generated by each of the CANIS System Processes 
in read-only mode. The information captured by these 
logs includes: document identifiers for documents pro- 
cessed, error messages, system generated messages (ie. 
debug). 
The Document Name Lookup and Processing allows 
the review and modification of named entities (Person- 
nel, Company, and Associations) of the selected Docu- 
ment. The options available to an analyst here are, a) 
Name Lookup, b) Index, Review Data records for this 
entity, c) Create Links between Entity names and re- 
viewed records: and d) Add and Modify Information 
associated with the entity (ie. gender, citizenship, loca- 
tions, phone numbers, etc.) 
Extraction errors found by the analyst during their 
processing of a document are appended to a log within 
the SQL database for review by an engineer for Extrac- 
tion CSCI package adjustments. 
3.5. Document Manager Process CSCI 
The Document Manager Process CSCI is a set of li- 
brary routines which provide a standard interface be- 
tween the CANIS Prototype and the persistent storage 
of documents. The Document Manager conforms to the 
concepts and specifications of the TIPSTER Phase II 
Architecture Design Document (version 1.15). The Li- 
brary routines of the Document Manager Process pro- 
vides all CSCI's of the CANIS Prototype with a standard 
interface (APD for accessing docmnents, and communi- 
caring annotation information about those documents. 
The Document Manager Process CSCI is imple- 
mented on top of a relational database with access to the 
database facilitated through ODBC library calls. The 
Document Manager Process CSCI uses Microsoft's 
ODBC library. Microsoft Access and Microsoft 
SQLServer. Other applications using Lockheed Mar- 
fin's Document Manager are being built on top of Sy- 
base and Oracle. 
4. NLTOOLSET 
The NLToolset is a framework of tools, techniques 
and resources for building text processing applications. 
The NLToolset is portable, extensible, robust, genetic 
and language independent. The NLToolset combines 
artificial intelligence (AD methods, especially NL pro- 
cessing, knowledge-based systems and information re- 
trieval techniques, with simpler methods, such as finite 
state machines, lexical analysis and word-based text 
73 
search to provide broad functionality without sacrific- 
ing robustness and speed. 
The NLToolset currently runs on SUN Microsys- 
tem's UN/X-based platforms and PCs (using Microsoft 
Windows NT). The NLToolset is coded in C++ and uses 
the COOL Object Library. The CANIS application is 
PC based and using the Microsoft Visual C++ compiler 
and Visual Basic on the PC. 
5. TESTING AND EVALUATION 
We are currently in the testing phase and developing 
the evaluation criteria in conjunction with the govern- 
ment These phases are scheduled to complete, July 
1996. 
Our Test Plan involves subsystem testing of each of 
the Comms Process CSCI, Extraction Process CSCI, 
Analyst Data Setup Process CSCI, and the Analyst In- 
terface Process CSCI. We are also performing System 
Level Integration Testing to validate the data passing 
through each process within the CANIS application. 
Evaluation will be performed by analysts at their site 
using real data. We are currently working with the cus- 
tomer to determine evaluation criteria. 
6, CONCLUSIONS 
The CANIS prototype will show the customer a new 
way of doing business. Analysts will see their tasks 
change from manually reading and creating index re- 
cords to verifying and updating automatically generated 
index records. Their daily process will involve more 
analysis than data entry and they will be able to process 
a larger number of documents in a single day. 
74 
APPENDIX A. 
ANNOTATION DESIGN SPECIFICATION 
annotation type r-fnllname 
annotation type r-descriptor 
annotation type r-surname 
annotation type r-birth-date 
annotation type r-rifle 
annotation type r-occupation-field 
annotation type r-occupation-text 
annotation type r-file-number 
annotation type r-phone-number 
annotation type r-org-name 
annotation type r-conntry 
annotation type r-address 
annotation type r-city 
annotation type r-id-number 
annotation type r-gender 
annotation type r-state 
annotatiOn type r-mailcode 
annotation type r-subject-line 
annotation type r-date 
annotation type r-text 
annotation type r-phone-text 
annotation type c-tperson 
{r-fidlname: r-fullname 
r-variation: sequence of r-filllname 
r-descriptor: r-descriptor 
r-surname: r-gl.llIIRme 
r-gender: r-gender 
r-birth-date: r-birth-date 
r-taddress: sequence of c-taddress 
r-tphone: sequence of c-tphone 
r-tidnum: sequence of c-tidnum 
r-tide: sequence of r-rifle 
r-occupation-field: r-occupation-field 
r-occupation-text: sequence of r-occupation-text 
r-text: sequence of r-text 
r-person-type {e.g., MAIDEN... } 
r-associated {Y} } 
annotation type c-taddress 
{r-address: r-address 
r-city: r-city 
r-state: r-state 
r-conntry: r-country 
r-mailcode: r-mailcode 
r-address-type: {BIRTH, LOC, ADDRESS } 
r-associated { Y } 
I 
annotation type c-tphone 
{r-phone-number: sequence of r-phone-number 
r-phone-type: {PHONE, FAX, TI~I.~X} 
r-phone-text: sequence of r-phone-text 
r-associated: {Y} 
\] 
annotation type c-tidnum 
{r-id-number: sequence of r-id-number 
r-id-type: {SSN, LICENSE} 
r-associated: {Y} } 
annotation type c-torgani~afion 
{r-org-name: r-org-name 
r-variation: sequence of r-org-name 
r-org-type: { COMPANY, GOVERNMENT, 
OTHER} 
r-descriptor: r-descriptor 
r-taddress: sequence of c-taddress 
r-tphone: sequence of c-tphone 
r-associated { Y \] 
I 
annotation type: c-ffamily 
{r-fullname: r-fullname 
r-suma.me: r-suillame 
r-taddress: sequence of c-taddress 
r-tphone: sequence of c-tphoue 
r-associated { Y } } 
annotation type c-parent-assoc 
{r-child: c-tperson 
r-parent: c-tperson 
r-descriptor: r-descriptor } 
annotation type c-fathex-assoc 
{r-child: c-tperson 
r-father: c-tperson 
r-descriptor: r-descriptor } 
75 
annotation type c-mother-assoc 
{ r--child: c-tperson 
r-mother: c-tperson 
r-descriptor: r-descriptor 
\] 
annotation type c-brother-assoc 
{r-sibling: c-tperson 
r-brother: c-tperson 
r-descriptor: r-descriptor 
\] 
annotation type c-sister-assoc 
{r-sibling: c-tperson 
r-sister: c-tperson 
r-descriptor: r-descriptor \] 
annotation type c-sibling-assoc 
{r-sibling-a: c-tperson 
r-sibling-b: c-tperson 
r-descriptor: r-descriptor \] 
annotation type c-married-assoc 
{r-spouse-a: c-tperson 
r-spouse-b: c-tpersou 
r-descriptor: r-descriptor 
l 
annotation type c-family-other-assoc) 
{r-person-a: c-tperson 
r-person-b: c-tperson 
r-descriptor: r-descriptor ) 
annotation type c-family-member-assoc 
{r-family: c-ffamily 
r-family-member: c-tperson 
r-descriptor: r-descriptor 
\] 
Annotation type c-maiden-persona-assoc 
{r-person-a: c-tperson 
r-person-b: c-tperson 
r-descriptor: r-descriptor \] 
annotation type c-other-persona-assoc 
{r-person-a: c-tperson 
r-person-b: c-tperson 
r-descriptor: r-descriptor 
} 
annotation type c-employment-assoc 
{r-person: c-tpersou 
r-organization: c-torganizafion 
r-descriptor: r-descriptor ) 
annotation type c-affiliafion-assoc 
{r-person: c-tperson 
r-organization: c-torganizafion 
r-affiliated-with: c-torganizafion 
r-descriptor: r-descriptor 
} 
annotation type c-tdocument 
{r-subject-line: r-subject-line 
r-reference-line: r-reference-line 
r-categories-type: sequence of {DRUGS, 
POLITICS, TERRORIST, OTHER\]} 
r-associations: sequence of c-agfiliation-assoc or 
c-employment-assoc or c-coutact-assoc or 
c-persona-assoc or c-family-member-assoc 
or c-family-other-assoc or c-married-assoc 
or c-sibling-assoc or c-sister-assoc or 
c-brother-assoc or c-mother-assoc or 
c-father-assoc or c-parent-assoc 
r-unassocpersons: sequence of c-tpersou 
r-unassocorgs: sequence of c-torganizafion 
r-unassocaddr: sequence of c-taddress 
r-tmassocphone: sequence of c-tphoue 
r-unassocidnum: sequence of c-fidnum 
r-unassocfamily: sequence of c-ffamily 
76 

REFERENCES 

1. CANIS System Requirements Specification: 
May 30, 1995. 

2. CANIS System Design Specification; 
November 30, 1995. 

3. TIPSTER Phase II Document Manager Snecifi- 
v cation (vl. 15) September, 1995. 
