File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/03/w03-0808_metho.xml
Size: 26,685 bytes
Last Modified: 2025-10-06 14:08:24
<?xml version="1.0" standalone="yes"?> <Paper uid="W03-0808"> <Title>InfoXtract: A Customizable Intermediate Level Information Extraction Engine [?]</Title> <Section position="4" start_page="0" end_page="0" type="metho"> <SectionTitle> 8 summarizes the research contributions. 2 InfoXtract: Defining New IE Tasks </SectionTitle> <Paragraph position="0"> InfoXtract [Li & Srihari 2003, Srihari et al 2000] is a domain-independent and domain-portable, intermediate level IE engine. Figure 1 illustrates the overall architecture of the engine.</Paragraph> <Paragraph position="1"> A description of the increasingly sophisticated IE outputs from the InfoXtract engine is given below: * NE: Named Entity objects represent key items such as proper names of person, organization, product, location, target, contact information such as address, email, phone number, URL, time and numerical expressions such as date, year and various measurements weight, money, percentage, etc.</Paragraph> <Paragraph position="2"> * CE: Correlated Entity objects capture relationship mentions between entities such as the affiliation relationship between a person and his employer. The results will be consolidated into the information object Entity Profile (EP) based on co-reference and alias support.</Paragraph> <Paragraph position="3"> * EP: Entity Profiles are complex rich information objects that collect entity-centric information, in particular, all the CE relationships that a given entity is involved in and all the events this entity is involved in. This is achieved through document-internal fusion and cross-document fusion of related information based on support from co-reference, including alias association.</Paragraph> <Paragraph position="4"> Work is in progress to enhance the fusion by correlating the extracted information with information in a user-provided existing database.</Paragraph> <Paragraph position="5"> * GE: General Events are verb-centric information objects representing 'who did what to whom when and where' at the logical level.</Paragraph> <Paragraph position="6"> Concept-based GE (CGE) further requires that participants of events be filled by EPs instead of NEs and that other values of the GE slots (the action, time and location) be disambiguated and normalized.</Paragraph> <Paragraph position="7"> * PE: Predefined Events are domain specific or user-defined events of a specific event type, such as Product Launch and Company Acquisition in the business domain. They represent a simplified version of MUC ST. InfoXtract provides a toolkit that allows users to define and write their own PEs based on automatically generated PE rule templates.</Paragraph> <Paragraph position="8"> The InfoXtract engine has been deployed both internally to support Cymfony's Brand Dashboard(TM) product and externally to a third-party integrator for building IE applications in the intelligence domain.</Paragraph> <Paragraph position="10"/> </Section> <Section position="5" start_page="0" end_page="0" type="metho"> <SectionTitle> 3 Hybrid Technology </SectionTitle> <Paragraph position="0"> InfoXtract represents a hybrid model for IE since it combines both grammar formalisms as well as machine learning. Achieving the right balance of these two paradigms is a major design objective of InfoXtract. The core of the parsing and information extraction process in InfoXtract is organized very simply as a pipeline of processing modules. All modules operate on a single in-memory data structure, called a token list. A token list is essentially a sequence of tree structures, overlaid with a graph whose edges define relations that may be either grammatical or informational in nature. The nodes of these trees are called tokens. InfoXtract's typical mode of processing is to skim along the roots of the trees in the token list, building up structure &quot;strip-wise&quot;. So even non-terminal nodes behave, in the typical case, as complex tokens. Representing a marked up text using trees explicitly, rather than implicitly as an interpretation of paired bracket symbols, has several advantages. For example, it allows a somewhat richer organization of the information contained &quot;between the brackets,&quot; allowing us to construct direct links from a root node to its semantic head, for example.</Paragraph> <Paragraph position="1"> The processing modules that act on token lists can range from lexical lookup to the application of hand written grammars to statistical analysis based on machine learning all the way to arbitrary procedures written in C++. The configuration of the InfoXtract processing pipeline is controlled by a configuration file, which handles pre-loading required resources as well as ordering the application of modules. Despite the variety of implementation strategies available, InfoXtract Natural Language Processing (NLP) modules are restricted in what they can do to the token list to actions of the following three types : 1. Assertion and erasure of token properties (features, normal forms, etc.) 2. Grouping token sequences into higher level constituent tokens.</Paragraph> <Paragraph position="2"> 3. Linking token pairs with a relational link.</Paragraph> <Paragraph position="3"> Grammatical analysis of the input text makes use of a combination of phrase structure and relational approaches to grammar. Basically, early modules build up structure to a certain level (including relatively simple noun phrases, verb groups and prepositional phrases), after which further grammatical structure is represented by asserting relational links between tokens. This mix of phrase structural and relational approaches is very similar to the approach of Lexical Functional Grammar (LFG) [Kaplan & Bresnan 1982], much scaled down.</Paragraph> <Paragraph position="4"> Our grammars are written in a formalism developed for our own use, and also in a modified formalism developed for outside users, based on our in-house experiences. In both cases, the formalism mixes regular expressions with boolean expressions.</Paragraph> <Paragraph position="5"> Actions affecting the token list are implemented as side effects of pattern matching. So although our processing modules are in the technical sense token list transducers, they do not resemble Finite State Transducers (FSTs) so much as the regular expression based pattern-action rules used in Awk or Lex.</Paragraph> <Paragraph position="6"> Grammars can contain (non-recursive) macros, with parameters.</Paragraph> <Paragraph position="7"> This means that some long-distance dependencies, which are very awkward to represent directly in finite state automata can be represented very compactly in macro form. While this has the advantage of decreasing grammar sizes, it does increase the size of the resulting automata. Grammars are compiled to a special type of finite state automata. These token list automata can be thought of as an extension of tree walking automata [Monnich et al 2001, Aho & Ullman 1971, Engelfriet et al 1999]. These are linear automata (as opposed to standard finite state tree automata [Gecseg & Steinby 1997], which are more naturally thought of as parallel) which run over trees. The problem with linear automata on trees is that there can be a number of &quot;next&quot; nodes to move the read head to: right sister, left sister, parent, first child, etc. So the vocabulary of the automaton is increased to include not only symbols that might appear in the text (test instructions) but also symbols that indicate where to move the read head (directive instructions). We have extended the basic tree walking formalism in several directions. First we extend the power of test instructions to allow them to check features of the current node and to perform string matching against the semantic head of the current node (so that a syntactically complex constituent can be matched against a single word). Second, we include symbols for action instructions, to implement side effects.</Paragraph> <Paragraph position="8"> Finally, we allow movement not only along the root sequence (string-automaton style) and branches of a tree (tree-walking style) but also along the the terminal frontier of the tree and along relational links. These extensions to standard tree walking automata extend the power of that formalism tremendously, and could pose problems. However, the grammar formalisms that compile into these token list walking automata are restrictive, in the sense that there exist many token list transductions that are implementable as automata that are not implementable as grammars. Also the nature of the shallow parsing task itself is such that we only need to dip into the reserves of power that this representation affords us on relatively rare occasions. As a result, the automata that we actually plug into the InfoXtract NLP pipeline generally run very fast.</Paragraph> <Paragraph position="9"> Recently, we have developed an extended finite state formalism named Expert Lexicon, following the general trend of lexicalist approaches to NLP. An expert lexicon rule consists of both grammatical components as well as proximity-based keyword matching. All Expert Lexicon entries are indexed, similar to the case for the finite state tool in INTEX [Silberztein 2000]. The pattern matching time is therefore reduced dramatically compared to a sequential finite state device.</Paragraph> <Paragraph position="10"> Some unique features of this formalism include: (i) the flexibility of inserting any number of Expert Lexicons at any level of the process; (ii) the capability of proximity checking within a window size as rule constraints in addition to pattern matching using an FST call, so that the rule writer can exploit the combined advantages of both; and (iii) support for the propagation of semantic tagging results, to accommodate principles like one sense per discourse.</Paragraph> <Paragraph position="11"> Expert lexicons are used in customization of lexicons, named entity glossaries, and alias lists, as well as concept tagging.</Paragraph> <Paragraph position="12"> Both supervised machine learning and unsupervised learning are used in InfoXtract. Supervised learning is used in hybrid modules such as NE [Srihari et al 2000], NE Normalization [Li et al 2002] and Co-reference. It is also used in the preprocessing module for orthographic case restoration of case insensitive input [Niu et al 2003]. Unsupervised learning involves acquisition of lexical knowledge and rules from a raw corpus. The former includes word clustering, automatic name glossary acquisition and thesaurus construction. The latter involves bootstrapped learning of NE and CE rules, similar to the techniques used in [Riloff 1996]. The results of unsupervised learning can be post-edited and added as additional resources for InfoXtract processing.</Paragraph> <Paragraph position="13"> InfoXtract has been benchmarked using the MUC-7 data sets which are recognized as standards by the research community. Precision and recall figures for the person and location entity types were above 90%.</Paragraph> <Paragraph position="14"> For organization entity types, precision and recall were in the high 80's reflecting the fact that organization names tend to be very domain specific.</Paragraph> <Paragraph position="15"> InfoXtract provides the ability to create customized named entity glossaries, which will boost the performance of organization tagging for a given domain. No such customization was done in the testing just described. The accuracy of shallow parsing is well over 90% reflecting very high performance part-of-speech tagging and named entity tagging. Table 1 shows the benchmarks for CE relationships which are the basis for EPs and for the SVO parsing which supports event extraction.</Paragraph> </Section> <Section position="6" start_page="0" end_page="0" type="metho"> <SectionTitle> 4 Engineering Architecture </SectionTitle> <Paragraph position="0"> The InfoXtract engine has been developed as a modular, distributed application and is capable of processing up to 20 MB per hour on a single processor. The system has been tested on very large (> 1 million) document collections. The architecture facilitates the incorporation of the engine into external applications requiring an IE subsystem. Requests to process documents can be submitted through a web interface, or via FTP. The results of processing a document can be returned in XML. Since various tools are available to automatically populate databases based on XML data models, the results are easily usable in web-enabled database applications.</Paragraph> <Paragraph position="1"> Configuration files enable the system to be used with different lexical/statistical/grammar resources, as well as with subsets of the available IE modules.</Paragraph> <Paragraph position="2"> InfoXtract supports two modes of operation, active and passive. It can act as an active retriever of documents to process or act as a passive receiver of documents to process. When in active mode, InfoXtract is capable of retrieving documents via HTTP, FTP, or local file system. When in passive mode, InfoXtract is capable of accepting documents via HTTP. Figure 2 illustrates a multiple processor configuration of InfoXtract focusing on the typical The architecture facilitates scalability by supporting multiple, independent Processors. The Processors can be running on a single server (if multiple CPUs are available) and on multiple servers. The Document Manager distributes requests to process documents to all available Processors. Each component is an independent application. All direct inter-module communication is accomplished using the Common Object Request Broker Architecture (CORBA). CORBA provides a robust, programming language independent, and platform neutral mechanism for developing and deploying distributed applications. Processors can be added and removed without stopping the InfoXTract engine. All modules are self-registering and will announce their presence to other modules once they have completed initialization.</Paragraph> <Paragraph position="3"> The Document Retriever module is only used in the active retriever mode. It is responsible for retrieving documents from a content provider and storing the documents for use by the InfoXtract Controller. The Document Retriever handles all interfacing with the content provider's retrieval process, including interface protocol (authentication, retrieve requests, etc.), throughput management, and document packaging. It is tested to be able to retrieve documents from content providers such as Northern Light, Factiva, and LexisNexis. Since the Document Retriever and the InfoXtract Controller do not communicate directly, it is possible to run the Document Retriever standalone and process all retrieved documents in a batch mode at a later time.</Paragraph> <Paragraph position="4"> The InfoXtract Controller module is used only in the active retriever mode. It is responsible for retrieving documents to be processed, submitting documents for processing, storing extracted information, and system logging. The InfoXtract Controller is a multi-threaded application that is capable of submitting multiple simultaneous requests to the Document Manager. As processing results are returned, they are stored to a repository or database, an XML file, or both.</Paragraph> <Paragraph position="5"> The Document Manager module is responsible for managing document submission to available Processors. As Processors are initialized, they register with the Document Manager. The Document Manager uses a round robin scheduling algorithm for sending documents to available Processors. A document queue is maintained with a size of four documents per Processor. The Processor module forms the core of the IE engine. InfoXtract utilizes a multi-level approach to NLP. Each level utilizes the results of the previous levels in order to achieve more sophisticated parsing. The JIX module is a web application that is responsible for accepting requests for documents to be processed. This module is only used in the passive mode. The document requests are received via the HTTP Post request. Processing results are returned in XML format via the HTTP Post response.</Paragraph> <Paragraph position="6"> In Table 2 we present an example of the performance that can be expected based on the application of all modules within the engine. It should be noted that considerably faster processing per processor can be achieved if output is restricted to a certain IE level, such as named entity tagging only.</Paragraph> <Paragraph position="7"> The output in this benchmark includes all major tasks such as NE, EP, parsing and event extraction as well as XML generation.</Paragraph> <Paragraph position="8"> This configuration provides throughput of approximately 12,000 documents (avg. 10KB) per day. A smaller average document size will increase the document throughput. Increased throughput can be achieved by dedicating a CPU for each running Processor. Each Processor instance requires approximately 500 MB of RAM to run efficiently.</Paragraph> <Paragraph position="9"> Processing speed increases linearly with additional Processors/CPUs, and CPU speed. In the current state, with no speed optimization, using a bank of eight processors, it is able to process approximately 100,000 documents per day. Thus, InfoXtract is suitable for high volume deployments. The use of CORBA provides seamless inter-process and over-the-wire communication between modules.</Paragraph> <Paragraph position="10"> Computing resources can be dynamically assigned to handle increases in document volume.</Paragraph> <Paragraph position="11"> A standard document input model is used to develop effective preprocessing capabilities.</Paragraph> <Paragraph position="12"> Preprocessing adapts the engine to the source by presenting metadata, zoning information in a standardized format and performing restoration tasks (e.g. case restoration). Efforts are underway to configure the engine such that zone-specific processing controls are enabled. For example, zones identified as titles or subtitles must be tagged using different criteria than running text. The engine has been deployed on a variety of input formats including HUMINT documents (all uppercase), the Foreign Broadcast Information Services feed (FBIS), live feeds from content providers such as Factiva (Dow Jones/Reuters), LexisNexis, as well as web pages. A user-trainable, high-performance case restoration module [Niu et al 2003] has been developed that transforms case insensitive input such as speech transcripts into mixed-case before being processed by the engine. The case restoration module eliminates the need for separate IE engines for case-insensitive and case-sensitive documents; this is easier and more cost effective to maintain.</Paragraph> </Section> <Section position="7" start_page="0" end_page="0" type="metho"> <SectionTitle> 5 Corpus-level IE </SectionTitle> <Paragraph position="0"> Efforts have extended IE from the document level to the corpus level. Although most IE systems perform corpus-level information consolidation at an application level, it is felt that much can be gained by doing this as an extended step in the IE engine. A repository has been developed for InfoXtract that is able to hold the results of processing an entire corpus.</Paragraph> <Paragraph position="1"> A proprietary indexing scheme for indexing token-list data has been developed that enables querying over both the linguistic structures as well as statistical similarity queries (e.g., the similarity between two documents or two entity profiles). The repository is used by a fusion module in order to generate cross-document entity profiles as well as for text mining operations. The results of the repository module can be subsequently fed into a relational database to support applications. This has the advantage of filtering much of the noise from the engine level and doing sophisticated information consolidation before populating a relational database.</Paragraph> <Paragraph position="2"> The architecture of these subsequent stages is shown in Figure 3.</Paragraph> <Paragraph position="3"> Information Extraction has two anchor points: (i) entity-centric information which leads to an EP, and (ii) action-centric information which leads to an event scenario. Compared with the consolidation of extracted events into cross-document event scenario, cross-document EP merging and consolidation is a more tangible task, based mainly on resolving aliases. Even with modest recall, the corpus-level EP demonstrates tremendous value in collecting information about an entity. This is as shown in Table 3 for only part of the profile of 'Mohamed Atta' from one experiment based on a collection of news articles. The extracted EP centralizes a significant amount of valuable information about this terrorist.</Paragraph> </Section> <Section position="8" start_page="0" end_page="0" type="metho"> <SectionTitle> 6 Domain Porting </SectionTitle> <Paragraph position="0"> Considerable efforts have been made to keep the core engine as domain independent as possible; domain specialization or tuning happens with minimum change to the core engine, assisted by automatic or semi-automatic domain porting tools we have developed.</Paragraph> <Paragraph position="1"> Cymfony has taken several distinct approaches in achieving domain portability: (i) the use of a standard document input model, pre-processors and configuration scripts in order to tailor input and output formats for a given application, (ii) the use of tools in order to customize lexicons and grammars, and (iii) unsupervised machine learning techniques for learning new named entities (e.g. weapons) and relationships based on sample seeds provided by a user.</Paragraph> <Paragraph position="2"> accuse-events (9), convict-events (10), confess-events (2), arrest-events (3), rent-events (3), .....</Paragraph> <Paragraph position="3"> It has been one of Cymfony's primary objectives to facilitate domain portability [Srihari 1998] [Li & Srihari 2000a,b, 2003]. This has resulted in a development/customization environment known as the Lexicon Grammar Development Environment (LGDE). The LGDE permits users to modify named entity glossaries, alias lexicons and general-purpose lexicons. It also supports example-based grammar writing; users can find events of interest in sample documents, process these through InfoXtract and modify the constraints in the automatically generated rule templates for event detection. With some basic training, users can easily use the LGDE to customize InfoXtract for their applications. This facilitates customization of the system in user applications where access to the input data to InfoXtract is restricted.</Paragraph> </Section> <Section position="9" start_page="0" end_page="0" type="metho"> <SectionTitle> 7 Applications </SectionTitle> <Paragraph position="0"> The InfoXtract engine has been used in two applications, the Information Discovery Portal (IDP) and Brand Dashboard (www.branddashboard.</Paragraph> <Paragraph position="1"> com). The IDP supports both the traditional top-down methods of browsing through large volumes of information as well as novel, data-driven browsing. A sample user interface is shown in Figure 4.</Paragraph> <Paragraph position="2"> Users may select &quot;watch lists&quot; of entities (people, organizations, targets, etc.) that they are interested in monitoring. Users may also customize the sources of information they are interested in processing.</Paragraph> <Paragraph position="3"> Top-down methods include topic-centric browsing whereby documents are classified by topics of interest. IE-based browsing techniques include entity-centric and event-centric browsing.</Paragraph> <Paragraph position="4"> Entity-centric browsing permits users to track key entities (people, organizations, targets) of interest and monitor information pertaining to them. Event-centric browsing focuses on significant actions including money movement and people movement events.</Paragraph> <Paragraph position="5"> Visualization of extracted information is a key component of the IDP. The Information Mesh enables a user to visualize an entity, its attributes and its relation to other entities and events. Starting from an entity (or event), relationship chains can be traversed to explore related items. Timelines facilitate visualization of information in the temporal axis.</Paragraph> <Paragraph position="6"> (3) new reports of al-Qaeda terrorist activity (1) new report of bin Laden sighting (4) new quotes by bin Laden (1) new target identified Recent efforts have included a tight integration of InfoXtract with visualization tools such as the Web-based Timeline Analysis System (WebTAS) (http://www.webtas.com). The IDP reflects the ability for users to select events of interest and automatically export them to WebTAS for visualization. Efforts are underway to integrate higher-level event scenario analysis tools such as the Terrorist Modus Operandi Detection System (TMODS) (www.21technologies .com) into the IDP.</Paragraph> <Paragraph position="7"> Brand Dashboard is a commercial application for marketing and public relations organizations to measure and assess media perception of consumer brands. The InfoXtract engine is used to analyze several thousand electronic sources of information provided by various content aggregators (Factiva, LexisNexis, etc.). The engine is focused on tagging and generating brand profiles that also capture salient information such as the descriptive phrases used in describing brands (e.g. cost-saving, non-habit forming) as well as user-configurable specific messages that companies are trying to promote and track (safe and reliable, industry leader, etc.). The output from the engine is fed into a database-driven web application which then produces report cards for brands containing quantitative metrics pertaining to brand perception, as well as qualitative information describing characteristics. A sample screenshot from Brand Dashboard is presented in Figure 5. It depicts a report card for a particular brand, highlighting brand strength as well as highlighting metrics that have changed the most in the last time period. The &quot;buzz box&quot; on the right hand side illustrates companies/brands, people, analysts, and messages most frequently associated with the brand in question.</Paragraph> </Section> class="xml-element"></Paper>