File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/concl/03/w03-0801_concl.xml
Size: 4,244 bytes
Last Modified: 2025-10-06 13:53:40
<?xml version="1.0" standalone="yes"?> <Paper uid="W03-0801"> <Title>The Talent System: TEXTRACT Architecture and Data Model</Title> <Section position="7" start_page="10598" end_page="10598" type="concl"> <SectionTitle> 7 Conclusion </SectionTitle> <Paragraph position="0"> In this paper, we have described an industrial infrastructure for composing and deploying natural language processing components that has evolved in response to both research and product requirements. It has been widely used, in research projects and product-level applications. null A goal of the Talent project has been to create technology that is well-suited for building robust text analysis systems. With its simple plugin interface (see Section 5), its rich declarative data model, and the flexible APIs to it (Section 4), TEXTRACT has achieved that goal by providing a flexible framework for system builders. The system is habitable (external processes can be 'wrapped' as plugins, thus becoming available as stages in the processing pipeline), and open (completely new plugins can be written--by anyone--to a simple API, as long as their interfaces to the annotation repository, the lexical cache, and the vocabulary (Section 4), follow the published set of specifications.</Paragraph> <Paragraph position="1"> Openness is further enhanced by encouraging the use of TFST, which directly supports the development, and subsequent deployment, of grammar-based plugins in a congenial style. Overall, TEXTRACT's design characteristics prompted the adoption of most of the architecture by a new framework for management and processing of unstructured information at IBM Research (see below).</Paragraph> <Paragraph position="2"> Performance is not generally an inherent property of an architecture, but rather of implementations of that architecture. Also, the performance of different configurations of the system would be dependent on the number, type, and algorithmic design and implementation of plugins deployed for any given configuration. Thus it is hard to quantify TEXTRACT's performance.</Paragraph> <Paragraph position="3"> The most recent implementation of the architecture is in C++ and makes extensive use of algorithms, container classes and iterators from the C++ Standard Template Library for manipulating the data objects in the data model; its performance therefore benefits from state-of-the-art implementations of the STL. As an informal indication of achievable throughput, an earlier product implementation of the tokenization base services and annotation subsystem, in the context of an information retrieval indexer, was able to process documents at the rate of over 2 gigabytes-per-hour on a mid-range Unix workstation.</Paragraph> <Paragraph position="4"> Allowing TEXTRACT's plugins to introduce -- dynamically -- new annotation types and properties is an important part of an open system. However, a limitation of the current design is the fixed organization of annotations into families (see Section 4). This makes it hard to accommodate new plugins which need to appeal to information which is either not naturally encodable in the family space TEXTRACT pre-defines, or requires a richer substrate of (possibly mutually dependent) feature sets.</Paragraph> <Paragraph position="5"> In a move towards a fully declarative representation of linguistic information, where an annotation maximally shares an underlying set of linguistic properties, a rational re-design of TEXTRACT (Ferrucci and Lally, 2003) is adopting a hierarchical system of feature-based annotation types; it has been demonstrated that even systems supporting strict single inheritance only are powerful enough for a variety of linguistic processing applications (Shieber, 1986), largely through their well-understood mathematical properties (Carpenter, 1992). Some of this migration is naturally supported by the initial TEXTRACT data model design. Other architectural components will require re-tooling; in particular, the FST subsystem will need further extensions for the definition of FS algebra over true typed feature structures (see, for instance, Brawer, 1998; Wunsch, 2003). We will return to this issue in a following paper.</Paragraph> </Section> class="xml-element"></Paper>