File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/abstr/03/w03-0801_abstr.xml

Size: 7,482 bytes

Last Modified: 2025-10-06 13:43:05

<?xml version="1.0" standalone="yes"?>
<Paper uid="W03-0801">
  <Title>The Talent System: TEXTRACT Architecture and Data Model</Title>
  <Section position="1" start_page="10598" end_page="10598" type="abstr">
    <SectionTitle>
Abstract
</SectionTitle>
    <Paragraph position="0"> We present the architecture and data model for TEXTRACT, a document analysis framework for text analysis components. The framework and components have been deployed in research and industrial environments for text analysis and text mining tasks.</Paragraph>
    <Paragraph position="1">  Introduction In response to a need for a common infrastructure and basic services for a number of different, but coordinated, text analysis activities with a common set of requirements, the Talent (Text Analysis and Language ENgineering Tools) project at IBM Research developed the first TEXTRACT system in 1993. It featured a common C API and a tripartite data model, consisting of linked list annotations and two hash table extensible vectors for a lexical cache and a document vocabulary. The experience of productizing this system as part of IBM's well-known commercial product Intelligent Miner for Text (IM4T  ) in 1997, as well as new research requirements, motivated the migration of the analysis components to a C++ framework, a more modular architecture modeled upon IBM's Software Solutions (SWS) Text Analysis Framework (TAF).</Paragraph>
    <Paragraph position="2"> The current version of TEXTRACT that we outline here is significantly different from the one in IM4T; however, it still retains the tripartite model of the central data store.</Paragraph>
    <Paragraph position="3"> In this paper, we first give an overview of the TEXTRACT architecture. Section 3 outlines different operational environments in which the architecture can be deployed. In Section 4, we describe the tripartite  data model. In Section 5, we illustrate some fundamentals of plugin design, by focusing on Talent's Finite State Transducer component and its interaction with the architecture and data model. Section 6 reviews related work. Finally, we conclude and chart future directions. The TEXTRACT Architecture: Overview TEXTRACT is a robust document analysis framework, whose design has been motivated by the requirements of an operational system capable of efficient processing of thousands of documents/gigabytes of data. It has been engineered for flexible configuration in implementing a broad range of document analysis and linguistic processing tasks. The common architecture features it shares with TAF include: * interchangeable document parsers allow the 'ingestion' of source documents in more than one format (specifically, XML, HTML, ASCII, as well as a range of proprietary ones); * a document model provides an abstraction layer between the character-based document stream and annotation-based document components, both structurally derived (such as paragraphs and sections) and linguistically discovered (such as named entities, terms, or phrases); * linguistic analysis functionalities are provided via tightly coupled individual plugin components; these share the annotation repository, lexical cache, and vocabulary and communicate with each other by posting results to, and reading prior analyses from, them; * plugins share a common interface, and are dispatched by a plugin manager according to declared dependencies among plugins; a resource manager controls shared resources such as lexicons, glossaries, or gazetteers; and at a higher level of abstraction, an engine maintains the document processing cycle; * the system and individual plugins are softly configurable, completely from the outside; * the architecture allows for processing of a stream of documents; furthermore, by means of collection-level plugins and applications, cross-document analysis and statistics can be derived for entire document collections.</Paragraph>
    <Paragraph position="4"> TEXTRACT is industrial strength (IBM, 1997), Unicodeready, and language-independent (currently, analysis functionalities are implemented primarily for English). It is a cross-platform implementation, written in C++. TEXTRACT is 'populated' by a number of plugins, providing functionalities for:  * tokenization; * document structure analysis, from tags and white space; * lexicon interface, complete with efficient look-up and full morphology; * importation of lexical and vocabulary analyses from a non-TEXTRACT process via XML markup; * analysis of out-of-vocabulary words (Park, 2002); * abbreviation finding and expansion (Park and Byrd, 2001); * named entity identification and classification (person names, organizations, places, and so forth) (Ravin and Wacholder, 1997); * technical term identification, in technical prose (Justeson and Katz, 1995); * vocabulary determination and glossary extraction, in specialized domains (Park et al., 2002); * vocabulary aggregation, with reduction to canonical form, within and across documents; * part-of-speech tagging (with different taggers) for determining syntactic categories in context; * shallow syntactic parsing, for identifying phrasal and clausal constructs and semantic relations (Boguraev, 2000); * salience calculations, both of inter- and intra-document salience; * analysis of topic shifts within a document (Boguraev and Neff, 2000a); * document clustering, cluster organization, and cluster labeling; * single document summarization, configurable to deploy different algorithmic schemes (sentence extraction, topical highlights, lexical cohesion) (Boguraev and Neff, 2000a, 2000b); * multi-document summarization, using iterative  residual rescaling (Ando et al., 2000); * pattern matching, deploying finite state technology specially designed to operate over document content abstractions (as opposed to a character stream alone).</Paragraph>
    <Paragraph position="5"> The list above is not exhaustive, but indicative of the kinds of text mining TEXTRACT is being utilized for; we anticipate new technologies being continually added to the inventory of plugins. As will become clear later in the paper, the architecture of this system openly caters for third-party plugin writers.</Paragraph>
    <Paragraph position="6">  Specific TEXTRACT configurations may deploy custom subsets of available plugin components, in order to effect certain processing; such configurations typically implement an application for a specific content analysis / text mining task. From an application's point of view, TEXTRACT plugins deposit analysis results in the shared repository; the application itself 'reads' these via a well defined interface. Document application examples to date include document summarization, a customer claims analysis system (Nasukawa and Nagano, 2001), and so forth.</Paragraph>
    <Paragraph position="7"> Collection applications have a document analysis component, which may also write to the shared repository. These include named relation extraction (Byrd and Ravin, 1999), custom dictionary building (Park, et al., 2001), indexing for question answering (Prager et al., 2000), cross-document coreference (Ravin and Kazi, 1999), and statistical collection analysis for document summarization or lexical navigation (Cooper and Byrd, 1997).</Paragraph>
    <Paragraph position="8">  For packaging in applications, Textract has, in addition to native APIs, a C API layer for exporting the contents of the data store to external components in C++ or Java.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML