File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/03/w03-0808_intro.xml
Size: 4,698 bytes
Last Modified: 2025-10-06 14:01:54
<?xml version="1.0" standalone="yes"?> <Paper uid="W03-0808"> <Title>InfoXtract: A Customizable Intermediate Level Information Extraction Engine [?]</Title> <Section position="3" start_page="0" end_page="0" type="intro"> <SectionTitle> 1 Introduction </SectionTitle> <Paragraph position="0"> This paper focuses on new intermediate level information extraction tasks that are defined and implemented in an IE engine, named InfoXtract.</Paragraph> <Paragraph position="1"> InfoXtract is a domain independent, but portable information extraction engine that has been designed for information discovery applications.</Paragraph> <Paragraph position="2"> The last decade has seen great advances in the area of IE. In the US, MUC [Chinchor & Marsh 1998] has been the driving force for developing this technology. The most successful IE task thus far has been Named Entity (NE) tagging. The state-of-the-art exemplified by systems such as NetOwl [Krupka & Hausman 1998], IdentiFinder [Miller et al 1998] and InfoXtract [Srihari et al 2000] has reached near human performance, with 90% or above F-measure. On the other hand, the deep level MUC IE task Scenario Template (ST) is designed to extract detailed information for predefined event scenarios of interest. It involves filling the slots of complicated templates. It is generally felt that this task is too ambitious for commercial application at present.</Paragraph> <Paragraph position="3"> Information Discovery (ID) is a term which has traditionally been used to describe efforts in data mining [Han 1999]. The goal is to extract novel patterns of transactions which may reveal interesting trends. The key assumption is that the data is already in a structured form. ID in this paper is defined within the context of unstructured text documents; it is the ability to extract, normalize/disambiguate, merge and link entities, relationships, and events which provides significant support for ID applications. Furthermore, there is a need to accumulate information across documents about entities and events. Due to rapidly changing events in the real world, what is of no interest one day, may be especially interesting the following day. Thus, information discovery applications demand breadth and depth in IE technology.</Paragraph> <Paragraph position="4"> A variety of IE engines, reflecting various goals in terms of extraction as well as architectures are now available. Among these, the most widely used are the GATE system from the University of Sheffield [Cunningham et al 2003], the IE components from Clearforest (www.clearforest.com), SIFT from BBN [Miller et al 1998], REES from SRA [Aone & Ramon-Santacruz 1998] and various tools provided by Inxight (www.inxight.com). Of these, the GATE system most closely resembles InfoXtract in terms of its goals as well as the architecture and customization tools. Cymfony differentiates itself by using a hybrid model that efficiently combines statistical and grammar-based approaches, as well as by using an internal data structure known as a token-list that can represent hierarchical linguistic structures and IE results for multiple modules to work on.</Paragraph> <Paragraph position="5"> The research presented here focuses on a new intermediate level of information extraction which supports information discovery. Specifically, it defines new IE tasks such as Entity Profile (EP) extraction, which is designed to accumulate interesting information about an entity across documents as well as within a discourse. Furthermore, Concept-based General Event (CGE) is defined as a domain-independent, representation of event information but more feasible than MUC ST.</Paragraph> <Paragraph position="6"> InfoXtract represents a hybrid model for extracting both shallow and intermediate level IE: it exploits both statistical and grammar-based paradigms. A key feature is the ability to rapidly customize the IE engine for a specific domain and application. Information discovery applications are required to process an enormous volume of documents, and hence any IE engine must be able to scale up in terms of processing speed and robustness; the design and architecture of InfoXtract reflect this need.</Paragraph> <Paragraph position="7"> In the remaining text, Section 2 defines the new intermediate level IE tasks. Section 3 presents extensions to InfoXtract to support cross-document IE. Section 4 presents the hybrid technology. Section 5 delves into the engineering architecture and implementation of InfoXtract. Section 6 discusses domain porting. Section 7 presents two applications which have exploited InfoXtract, and finally, Section</Paragraph> </Section> class="xml-element"></Paper>