File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/96/x96-1017_metho.xml

Size: 14,956 bytes

Last Modified: 2025-10-06 14:14:26

<?xml version="1.0" standalone="yes"?>
<Paper uid="X96-1017">
  <Title>( Cable Text + Cable Header J, CorporateSystemCable I Cable Text + Cable Header + Pro f'de Intb Cable Text + Cable Header Cable Deliv~) System Serv~ . \[ (LogsNogsl ~ - % Cable Text + Cable Header + Cable Delivery System Fields \] Sel~d ~bs~ of Cables copied from Doo~mom Ind~z Damb~ Document Index( Database Data \ CANIS / ~ H~d~+ Cable De~vay 4 Prototype Amot~o~+ / t Relatiouel Dlua ~C/~ User Display ( Figure 1.0 - CANIS External Interface Design</Title>
  <Section position="3" start_page="69" end_page="72" type="metho">
    <SectionTitle>
3. SYSTEM DESIGN
</SectionTitle>
    <Paragraph position="0"> The CANIS prototype, as illustrated in Figure 1.0, will take as input, Cable Text, Cable Header, and Cable Delivery System Server Fields. CANIS performs all processing and stores the results internally. Users can visualiTe the processed data via the User Display.</Paragraph>
    <Paragraph position="1">  prototype. Cables are delivered to CANIS via the Cable Delivery System Server. This server acts as a communication driven pipe to the CANIS Prototype.</Paragraph>
    <Section position="1" start_page="70" end_page="71" type="sub_section">
      <SectionTitle>
3.1. Comm Process CSCI
</SectionTitle>
      <Paragraph position="0"> The Comm Process CSCI retrieves the Cable Data from the Cable Delivery System Server at a constant given rate via a software timer. The Comm Process CSCI creates a Document from the Cable Data and passes the Document to the Document Manager Process CSCI which stores this information in a Document Collection. The Comm Process CSCI sends the Collec- null tion Identifier and Document Identifier to the Extraction Process CSCI. The Comm Process CSCI also transfers the Cable Delivery System Header information to the SQL Database as it relates to the Document in the CoLlection.</Paragraph>
      <Paragraph position="1"> 3.2. Extraction Process CSCI  ument Identifiers to the Document Manager CSCI which retrieves and returns the document text. The Extraction Process CSCI extracts biographical entities found within the document using Lockheed Martin's NLToolset. The Extraction Process CSCI passes the extracted entities to the Document Manager Process CSCI, which stores them as Annotations on the Document. The Extraction Process CSCI sends the Collection Identifier and Document Identifier to the Analyst  Upon initialization, the Extraction Process CSCI spawns the NLToolset Server Object, which ties all the NLToolset's data resources together into a single object, and loads into it the NLToolset System Specification File. This fde contains a set of entries that identify the  resources that should be loaded, the debug flags that should be set, the organization of the resources, and the sequence of operations that the NLToolset should perform. The primary resources that are loaded are:  through a series of NLToolset functions to perform the extraction. The steps are Tokeaization, Segmentation, Reduction, Extraction, Reference Resolution, and Post Processing.</Paragraph>
      <Paragraph position="2"> Tokenizatiou creates a buffer of tokens from the Document's text. All words, punctuation, numbers, etc. in a Document are processed into tokens. The information captured for each token is: physical string, token symbol, token type. symbol id, part of speech, string case type, and character start and end positions. Segmentation breaks the Document's token buffer into paragraphs and sentences based on multiple newlines, tabs, periods, etc.</Paragraph>
      <Paragraph position="3"> Reduction performs multiple passes through the Document buffer looking for sequences of tokens that can be simplified into a single identifiable unit. These passes are used to identify specific pieces of information needed for Extraction.</Paragraph>
      <Paragraph position="4"> Extraction and Reference Resolution are the NLToolset functions that glue all individual pieces together to create the entities. The information automatically extracted by CANIS from the Cables includes data of the following types:  items and entities. These Annotations are then attached to the Document and stored in the Document Manager Process CSCI. Appendix A contains the Annotation Design Specification for these entities.</Paragraph>
    </Section>
    <Section position="2" start_page="71" end_page="71" type="sub_section">
      <SectionTitle>
3.3. Analyst Data Setup Process CSCI
The Analyst Data Setup Process CSCI processes
</SectionTitle>
      <Paragraph position="0"> each Collection Identifier and Document Identifier pair passed to it. It then passes the Collection and Document Identifiers to the Document Manager Process CSCI which retrieves and returns the Document and its Annotations. The Annotations for this document are placed into relational records in the CANIS Server (SQL Server). Names, Organizations and Associations entities found within the extracted annotations are validated against existing entities. If the entity exists, then the new information is linked to the existing entities. If the entity does not exist, new relational records are created for that entity.</Paragraph>
    </Section>
    <Section position="3" start_page="71" end_page="71" type="sub_section">
      <SectionTitle>
The Analyst Data Setup Process CSCI collects and
</SectionTitle>
      <Paragraph position="0"> builds relations for each of the major entities (Personnel, Organizations, and Associations) within the Annotations for the Document in the Collection. It validates and connects different types of locations, numbers and biographic information. For each entity, the process validates against existing index relations. If the entity exists, all information is processed as an update to the existing records. If the entity does not exist, the record is added to the relational database as a currently known Index record. Biographical entities are connected to the named entity through the ODBC SQLServer API. Biographical entity type connections are: gender, country of birth, date of birth, etc.</Paragraph>
      <Paragraph position="1"> Additionally, the process operates on address earlties and number entities and connects these to the named entities. It validates the address and number information against existing data relations. If the address or number exists, then all information is processed as a reference to the existing records. Otherwise, anew record containing the information is added to the relations and connected to the named entity. The types of addresses captured are: location, residence, etc. The types of numbers captured are: phone, license, etc.</Paragraph>
    </Section>
    <Section position="4" start_page="71" end_page="72" type="sub_section">
      <SectionTitle>
The Analyst Data Setup Process CSCI links named
</SectionTitle>
      <Paragraph position="0"> entities together through relation links in the SQL data- null base. The process will link the following entity information: Family (persons to family), Employment (persons to organizations), and Affiliation (persons to associations).</Paragraph>
    </Section>
    <Section position="5" start_page="72" end_page="72" type="sub_section">
      <SectionTitle>
The Analyst Data Setup Process CSCI validates File
</SectionTitle>
      <Paragraph position="0"> Numbers against existing relations and connects them to named entities. The types of Filing and Document Reference data connected are: System Folder Objects and Document IDs.</Paragraph>
      <Paragraph position="1"> Finally, the Analyst Data Setup Process CSCI adds the Document to an analyst working queue for processing by an analyst through the Analyst Interface Process CSCI.</Paragraph>
    </Section>
    <Section position="6" start_page="72" end_page="72" type="sub_section">
      <SectionTitle>
The Analyst Data Setup Process bridges the gap he-
</SectionTitle>
      <Paragraph position="0"> tween the information that was extracted from each Document and the information currently stored in the customer's database.</Paragraph>
    </Section>
    <Section position="7" start_page="72" end_page="72" type="sub_section">
      <SectionTitle>
3.4. Analyst Interface Process CSCI
</SectionTitle>
      <Paragraph position="0"> The Analyst Interface Process CSCI processes User Commands passed to it. These Commands allow an analyst to access and manipulate all the information stored in the CANIS prototype. When a Document is selected for display by an analyst, the Analyst Interface Process CSCI passes the Collection Identifier and Document Identifier for the Document to the Document Manager Process CSCI which retrieves and returns the Document and its relational records.</Paragraph>
      <Paragraph position="1"> The Analyst Interface Process CSCI displays a summary list of the named entities associated with the selected Document. An analyst may select a given entity from the list and review the enfity's detailed information, delete the entity from the list, or lookup a new entity found in the body of the Document. For each of the details available about a name, (ie. biographies, relationships, id numbers, locations, phone numbers etc.) the analyst reviews, modifies the informatiou if necessary, and checks off the information. Some of the data, such as gender, citizenship, or relationship types, for example, have alternative choices available on a pull-down menu to minimize key strokes necessary to make changes.</Paragraph>
      <Paragraph position="2"> The Analyst Interface Process CSCI allows an analyst to review and modify all information (Index records, addressees, subject line, and Fding locations) about a Document. It will display a Document's text body and allow the analyst to travel through the processing of the information about that Document. The following functions are available to an analyst: Document Details Review, CANIS Prototype Process Logs Review, Name Lookup and Processing, and System Filing.</Paragraph>
    </Section>
    <Section position="8" start_page="72" end_page="72" type="sub_section">
      <SectionTitle>
Document Details Review displays the classifica-
</SectionTitle>
      <Paragraph position="0"> tion, addressees, and subject line associated with the current Document. The analyst may review and modify any of this information.</Paragraph>
    </Section>
  </Section>
  <Section position="4" start_page="72" end_page="72" type="metho">
    <SectionTitle>
CANIS Prototype Process Logs Review displays the
</SectionTitle>
    <Paragraph position="0"> logs generated by each of the CANIS System Processes in read-only mode. The information captured by these logs includes: document identifiers for documents processed, error messages, system generated messages (ie.</Paragraph>
    <Paragraph position="1"> debug).</Paragraph>
    <Paragraph position="2"> The Document Name Lookup and Processing allows the review and modification of named entities (Personnel, Company, and Associations) of the selected Document. The options available to an analyst here are, a) Name Lookup, b) Index, Review Data records for this entity, c) Create Links between Entity names and reviewed records: and d) Add and Modify Information associated with the entity (ie. gender, citizenship, locations, phone numbers, etc.) Extraction errors found by the analyst during their processing of a document are appended to a log within the SQL database for review by an engineer for Extraction CSCI package adjustments.</Paragraph>
    <Section position="1" start_page="72" end_page="72" type="sub_section">
      <SectionTitle>
3.5. Document Manager Process CSCI
The Document Manager Process CSCI is a set of li-
</SectionTitle>
      <Paragraph position="0"> brary routines which provide a standard interface between the CANIS Prototype and the persistent storage of documents. The Document Manager conforms to the concepts and specifications of the TIPSTER Phase II Architecture Design Document (version 1.15). The Library routines of the Document Manager Process provides all CSCI's of the CANIS Prototype with a standard interface (APD for accessing docmnents, and communicaring annotation information about those documents. The Document Manager Process CSCI is implemented on top of a relational database with access to the database facilitated through ODBC library calls. The  Document Manager Process CSCI uses Microsoft's ODBC library. Microsoft Access and Microsoft SQLServer. Other applications using Lockheed Marfin's Document Manager are being built on top of Sybase and Oracle.</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="72" end_page="73" type="metho">
    <SectionTitle>
4. NLTOOLSET
</SectionTitle>
    <Paragraph position="0"> The NLToolset is a framework of tools, techniques and resources for building text processing applications.</Paragraph>
    <Paragraph position="1"> The NLToolset is portable, extensible, robust, genetic and language independent. The NLToolset combines artificial intelligence (AD methods, especially NL processing, knowledge-based systems and information retrieval techniques, with simpler methods, such as finite state machines, lexical analysis and word-based text  search to provide broad functionality without sacrificing robustness and speed.</Paragraph>
    <Paragraph position="2"> The NLToolset currently runs on SUN Microsystem's UN/X-based platforms and PCs (using Microsoft Windows NT). The NLToolset is coded in C++ and uses the COOL Object Library. The CANIS application is PC based and using the Microsoft Visual C++ compiler and Visual Basic on the PC.</Paragraph>
  </Section>
  <Section position="6" start_page="73" end_page="73" type="metho">
    <SectionTitle>
5. TESTING AND EVALUATION
</SectionTitle>
    <Paragraph position="0"> We are currently in the testing phase and developing the evaluation criteria in conjunction with the government These phases are scheduled to complete, July 1996.</Paragraph>
    <Paragraph position="1"> Our Test Plan involves subsystem testing of each of the Comms Process CSCI, Extraction Process CSCI,</Paragraph>
    <Section position="1" start_page="73" end_page="73" type="sub_section">
      <SectionTitle>
Analyst Data Setup Process CSCI, and the Analyst In-
</SectionTitle>
      <Paragraph position="0"> terface Process CSCI. We are also performing System Level Integration Testing to validate the data passing through each process within the CANIS application.</Paragraph>
      <Paragraph position="1"> Evaluation will be performed by analysts at their site using real data. We are currently working with the customer to determine evaluation criteria.</Paragraph>
    </Section>
  </Section>
  <Section position="7" start_page="73" end_page="74" type="metho">
    <SectionTitle>
6, CONCLUSIONS
</SectionTitle>
    <Paragraph position="0"> The CANIS prototype will show the customer a new way of doing business. Analysts will see their tasks change from manually reading and creating index records to verifying and updating automatically generated index records. Their daily process will involve more analysis than data entry and they will be able to process a larger number of documents in a single day.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML