File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/96/x96-1016_metho.xml

Size: 12,119 bytes

Last Modified: 2025-10-06 14:14:26

<?xml version="1.0" standalone="yes"?>
<Paper uid="X96-1016">
  <Title>APPENDIX C: SGML TAG LISTING SGML Tag Description</Title>
  <Section position="4" start_page="61" end_page="61" type="metho">
    <SectionTitle>
3. SYSTEM ARCHITECTURE
</SectionTitle>
    <Paragraph position="0"> Figure 3-1 illustrates how ADEPT will be inserted into the Rich Open Source Environment Version 2 (ROSE) testbed environment at OIR. After a successfully evaluation, ADEPT may be made operational.</Paragraph>
    <Paragraph position="1"> ADEPT will be connected to the ROSE-Feed servers via a 16MB/second Token-Ring Local Area</Paragraph>
  </Section>
  <Section position="5" start_page="61" end_page="61" type="metho">
    <SectionTitle>
ADEPT User Workstations
</SectionTitle>
    <Paragraph position="0"> Network (LAN). These servers receive streams of documents from currently five sources/providers: NEXIS, DIALOG, DataTimes, FBIS and Newswire. Refer to Appendix A for a sample document example. After successfully parsing and extracting document required information, ADEPT will transmit a SGML Tagged document over a one-way fiber to the ROSE-Catcher where the information will be archived and disseminated to the OIR user community. Refer to Appendix B for a processed document example.</Paragraph>
    <Paragraph position="1"> ADEPT will have the ability process more than one thousand separate sources from the five current OIR providers, at an average of 80 megabytes and a maximum of 150 megabytes per day currently. These figures are estimated to increase by twelve percent per month.</Paragraph>
    <Paragraph position="2"> Over an average month, ADEPT will operate seven days per week processing and expected 600,000 documents.</Paragraph>
    <Paragraph position="3"> Appendix C depicts the SGML tags which will be identified by ADEPT.</Paragraph>
  </Section>
  <Section position="6" start_page="61" end_page="63" type="metho">
    <SectionTitle>
4. SYSTEM DESIGN
</SectionTitle>
    <Paragraph position="0"> Figure 4-1 illustrates the design of ADEPT.</Paragraph>
    <Paragraph position="1"> ADEPT is comprised of eight processes; each performing a specialized task. These processes are: the Docu- null ment Input (DI), the Document Processor (DP), the Document Management (DM), the Management Information System Manager (MISM), the Problem Queue Manager (PQM), the System Adaptation Manager (SAM), the Administration Manager (AM), and the Output Manager Function (OM).</Paragraph>
    <Paragraph position="2"> 4.1. Document Input (DI)  The DI process is the interface between ADEPT and the ROSE-Feed servers. Based on the source, a mapping template is selected. The DI identifies and  separates the ROSE Feed stream into documents. The document and its relevant information is stored in local storage via the DM function calls.</Paragraph>
    <Paragraph position="3"> If the mapping template can not be identified, the stream probably came from a source unknown to ADEPT. Unknown sources are sent to the Problem Queue to for user intervention.</Paragraph>
    <Paragraph position="4"> 4.2, Document Processor (DP) The DP identifies and extracts all SGML tags defined in the mapping template for the specific source. Each identified field value is validated and normalized (if required) before being stored as annotations with the document via DM function calls. DP creates annotations with the value 'NA' (Not Available) for those nonrequired SGML tags not present in the document. If while processing, DP is unable to identify a required SGML tag, validate or normalize its contents, the document is identified as a problem document. DP does not stop processing the document once encountering an error. It completes the document processing; identifying any remaining errors. For each problem SGML tag, DP generates diagnostic information. The diagnostic information contains an error explanation as well as suggested corrective actions. Problem documents are sent to the Problem Queue to await analysis.</Paragraph>
    <Section position="1" start_page="62" end_page="62" type="sub_section">
      <SectionTitle>
4.3. Document Manager (DM)
</SectionTitle>
      <Paragraph position="0"> The DM, the heart of ADEPT, is composed of a set of library routines providing a standard interface between ADEPT and the collections of documents in persistent storage. The DM is TIPSTER compliant and utilizes Open Database Connective (ODBC) to store document and document relevant information in the Sybase System 11 database. ODBC adds an additional layer of flexibility to DM. With ODBC, the Sybase System 11 database can be substituted with any ODBC compliant database on any platform.</Paragraph>
    </Section>
    <Section position="2" start_page="62" end_page="63" type="sub_section">
      <SectionTitle>
4.4. Management Information systems
Manager (MISM)
</SectionTitle>
      <Paragraph position="0"> The MISM process manages the quantitative MIS Statistical data used to monitor and evaluate ADEPT.</Paragraph>
      <Paragraph position="1"> MISM records the document's name, source, date/time stamp, and other relevant information when a:  * Document is transmitted to Main-ROSE Catcher. Additionally, ADEPT captures similar statistics on problem types and problems associated with each document. The ROSE Data Administrator can perform simple queries and execute quick reports against the collected data.</Paragraph>
    </Section>
    <Section position="3" start_page="63" end_page="63" type="sub_section">
      <SectionTitle>
4.5. Problem Queue Manager (PQM)
</SectionTitle>
      <Paragraph position="0"> The PQM is responsible for managing the problem queue of ADEPT. The problem queue is a visual representation of all problem document information contained in the database. An entry exists for each problem document; it contains the document identifier, source, problem class, status, mapping template identifier, date/ time stamp, etc.</Paragraph>
      <Paragraph position="1"> At the ROSE Data Administrator's discretion, documents in the problem queue can be sorted and limited by either source, date/time stamp, problem class, mapping template and status.</Paragraph>
      <Paragraph position="2"> To investigate/resolve a problem document, the desired document must be selected. For each document selected, the document viewer GUI is invoked. The GUI displays: 1) the original document, 2) the current version of the SGML template for that document, 3) the linkages between the two, 4) diagnostic information associated with the document, and 5) suggestions for fixing the problem tag(s).</Paragraph>
      <Paragraph position="3"> The document viewer allows one to modify problem tags based on system supplied corrective actions. If system suggestions are rejected, tag values can be generated from user supplied data. For cases where the original document trigger is garbled due to a transmission error, the user can elect to define a temporary trigger. Notes can created and saved for each document. After the problems associated with a document are addressed, the document can be resubmitted to the system for reprocessing. PQM functions provide the user the ability to select and resubmit multiple documents.</Paragraph>
    </Section>
    <Section position="4" start_page="63" end_page="63" type="sub_section">
      <SectionTitle>
4.6. System Adaptation Manager (SAM)
</SectionTitle>
      <Paragraph position="0"> The SAM process provides the capability to create, modify, and associate mapping templates with a specific data source. A mapping template contains the directions on how to parse a specific data source. It specifies the SGML tags (i.e., Pubdate), whether the tag is required and any associated field names (triggers within a document) which must be used to extract the SGML tag value as well as corresponding format validation and normalization rules. There is one primary mapping template for each data source received by ADEPT.</Paragraph>
      <Paragraph position="1"> Once created, SAM allows the Data Administrator to test their mapping template changes against sample files of documents.</Paragraph>
    </Section>
    <Section position="5" start_page="63" end_page="63" type="sub_section">
      <SectionTitle>
4.7. Administrator Manager (AM)
</SectionTitle>
      <Paragraph position="0"> The AM manages the routine system administration of ADEPT. AM provides login control and user permissions, maintains the system's security and audit log, and enables backups/restores of the system databases.</Paragraph>
      <Paragraph position="1"> All user interaction (system adaptations and problem queue manipulation) performed by the user are recorded in the AM's audit log including a record of the change, user identification, and date/time stamp. Both the security and audit logs can be viewed via the AM GUI.</Paragraph>
      <Paragraph position="2"> From the AM GUI, the user can authorize others to print, display, search, consolidate, and delete the computer security audit log as well as add, delete or re-enable accounts by changing user permissions.</Paragraph>
      <Paragraph position="3">  4.8. Output Manager (OM) The OM manages the output of successfully tagged documents for ADEPT. The OM's main capabilities include: * Creation of the SGML tagged version of the document, null * Performing &amp;quot;Special Processing&amp;quot; (when required), * Providing an interface for passing the tagged document to the Main-ROSE Catcher, * Providing a GUI which will allow the ROSE Data  Administrator to view the original document, the final tagged document and the linkages between the two for any document stored in local storage.</Paragraph>
      <Paragraph position="4"> OM retrieves successfully processed documents.</Paragraph>
      <Paragraph position="5"> For each document, OM walks through the annotations (SGML tags) accessing their associated SGML tag value. The set of SGML tags with their corresponding value constitute the SGML template for that document. If the document is initially from the ROSE-Feed, OM will send the SGML Template, conforming specific protocol, to the Main-ROSE Catcher. Successfully processed sample documents are saved to a UNIX file for future review.</Paragraph>
    </Section>
  </Section>
  <Section position="7" start_page="63" end_page="64" type="metho">
    <SectionTitle>
5. SYSTEM PROCESSING
</SectionTitle>
    <Paragraph position="0"> Information is passed to each process via collections stored within the TIPSTER compliant Document Manager (DM). Collections act as the queues for the processes. A collection contains the information necessary for a process to perform (i.e., documents and document relevant information). The DP, PQM, and OM processes each have a unique collection associated with it. A process begins by accessing the first document in  its collection. When completed, the document is moved to another collection for the next process to continue. Since a document moves from collection to collection, each process only depends upon the documents in its collection.</Paragraph>
    <Paragraph position="1"> As depicted in Figure 5-I, there are two categories of collections: production and adaptation. ROSE-Feed  supplied documents are processed in the production collections. Adaptation testing as well as documents from sample files are processed in the adaptation collections. These two categories of collections will clearly separate adaptation documents from production documents. Documents in the production category will run at a higher priority than those in the adaptation category. Prioritizing enables ADEPT to process both categories of collections concurrently.</Paragraph>
  </Section>
  <Section position="8" start_page="64" end_page="68" type="metho">
    <SectionTitle>
6. STATUS
</SectionTitle>
    <Paragraph position="0"> The ADEPT project has completed the System Requirements Review (SRR) as well as the Preliminary Design Review (PDR). A Critical Design Review (CDR) is scheduled for late June 1996; to be followed by a TIPSTER Engineering Review. ADEPT will be installed in OIR's testbed environment in December 1996 where it will undergo a three month evaluation period. After a successful evaluation, OIR will have the option to transition ADEPT to their production environment. null</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML