File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/01/h01-1039_metho.xml

Size: 21,506 bytes

Last Modified: 2025-10-06 14:07:32

<?xml version="1.0" standalone="yes"?>
<Paper uid="H01-1039">
  <Title>Integrated Information Management: An Interactive, Extensible Architecture for Information Retrieval</Title>
  <Section position="2" start_page="0" end_page="0" type="metho">
    <SectionTitle>
2. ARCHITECTURAL DESIGN
</SectionTitle>
    <Paragraph position="0"> IIM uses a flow-based (pipe and filter [16]) processing model.</Paragraph>
    <Paragraph position="1"> Information processing steps are represented as nodes in a graph.</Paragraph>
    <Paragraph position="2"> Each edge in the graph represents a flow connection between a parent node and a child node; the documents produced by the parent node are passed to each child node. In IIM, the flow graph is referred to as a node chain. A sample node chain is shown in Figure  1. The IIM class model includes six basic node types, which can be used to model a variety of IR problems: 1. Source. Generates a document stream (from a static collection, web search, etc.) and passes documents one at a time to its child node(s).</Paragraph>
    <Paragraph position="3"> 2. Filter. Passes only documents which match the filter to its child node(s).</Paragraph>
    <Paragraph position="4"> 3. Annotator. Adds additional information to the document regarding a particular region in the document body.</Paragraph>
    <Paragraph position="5"> Figure 1: IIM User Interface 4. Sink. Creates and passes either a single document or a collection to its child node(s), after pooling the input documents it receives.</Paragraph>
    <Paragraph position="6"> 5. Transformer. Creates and passes on a single new document, presumably the result of processing its input document.</Paragraph>
    <Paragraph position="7"> 6. Renderer. Produces output for documents received (to disk,  to screen, etc.).</Paragraph>
    <Paragraph position="8"> The IIM class model is embedded in a Model-View-Controller architecture [5], which allows the system to be run with or without the graphical interface. Pre-stored node chains can be executed directly from the shell, or as a background process, completely bypassing all user interaction when optimal performance is required. The Controller subsystem and interface event dispatching subsystem must run as separate threads to support dynamic update of parameters in a running system. The View (user interface) should support: a) plug-and-play creation of new node chains; b) support for saving, loading and importing new node chains; c) dynamic visualization of a task's status; and d) direct manipulation of a node's parameters at any time.</Paragraph>
    <Paragraph position="9"> In addition to the nodes themselves, IIM supports two other important abstractions for IR task flows: a0 Macro Nodes. Certain sequences of nodes are useful in more than one application, so it is convenient to store them together as a single reusable unit, or macro node. IIM allows the user to export a portion of a node chain as a macro node to be loaded into the Node Library and inserted into a new chain as a single node. The user may specify which of the properties of the original nodes are visible in the exported macro node (see Figure 3).</Paragraph>
    <Paragraph position="10"> a0 Controllers. Some IR tasks require iteration through multiple runs; the system's behavior on each successive trial is modified based on feedback from a previous run. For example, a system might wish to ask for more documents or perform query expansion if the original query returns an insufficient number of relevant documents. IIM includes a Controller interface, which specifies methods for sending feedback from  one node to another. The user can implement a variety of controllers, depending on the needs of the particular application. null</Paragraph>
  </Section>
  <Section position="3" start_page="0" end_page="0" type="metho">
    <SectionTitle>
3. JAVA IMPLEMENTATION
</SectionTitle>
    <Paragraph position="0"> In the IIM Java implementation, nodes are specified by the abstract interface Node and its six abstract subinterfaces: Source, Filter, Annotator, Transformer, Sink and Renderer (see Figure 2). Any user-defined Java class which implements one of the Node subinterfaces can be loaded into IIM and used in a node chain. The visualization of a node is represented by a separate Java class, Box, which handles all of the details related to drawing the node and various visual cues in the node chain display.</Paragraph>
    <Paragraph position="1"> The graphical user interface (Figure 1) is implemented as a set  of Java Swing components: a0 Node Chain Display. The canvas to the right displays the current node chain, as described in the previous section. While  the node chain is running, IIM provides two types of visual feedback regarding task progress. To indicate the percentage of overall run-time that the node is active, the border color of each node varies from bright green (low) to bright red (high). To indicate the amount of output per node per unit of time spent (throughput), the system indicates bytes per second as a text label under each node. A rectangular meter at the right of each node provides a graphic visualization of relative throughput; the node with the highest throughput will have a solid red meter, while other nodes will have a meter level which shows their throughput as a percentage of maximum throughput.</Paragraph>
    <Paragraph position="2"> a0 Node Library. The tree view to the upper left displays the library of nodes currently available on the user's machine for building and extending node chains. New nodes or node directories can be downloaded from the web and added while the system is running. The component loader examines each loaded class using Java's reflection capabilities, and places it in the appropriate place(s) in the component tree according to which of the Node subinterfaces it implements.</Paragraph>
    <Paragraph position="3"> a0 Node Property Editor. The Property Editor (table view) to the lower left in Figure 1 displays the properties of a selected node, which the user can update by clicking on it and entering a new value.</Paragraph>
    <Paragraph position="4"> a0 Node Chain Editor. IIM supports dynamic, interactive manipulation of node chains. The left side of the toolbar at the top of the IIM Window contains a set of chain editing buttons. These allow the user to create, modify and tune new node chains built from pre-existing components.</Paragraph>
    <Paragraph position="5"> a0 Transport Bar. IIM uses a tape transport metaphor to model the operation of the node chain on a given data source. The &amp;quot;Play&amp;quot;, &amp;quot;Pause&amp;quot; and &amp;quot;Rewind&amp;quot; buttons in the toolbar (right side) allow the user to pause the system in mid-task to adjust component parameters, or to start a task over after the node chain has been modified.</Paragraph>
    <Paragraph position="6"> The run-time Controller subsystem is implemented as a Java class called ChainRunner, which can be invoked with or without a graphical interface component. ChainRunner is implemented as a Thread object separate from the Java Swing event dispatching thread, so that user actions can be processed concurrently with the ongoing operation of a node chain on a particular task.</Paragraph>
  </Section>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
4. IIM COMPONENTS
</SectionTitle>
    <Paragraph position="0"> The current IIM system includes a variety of nodes which implement the different IIM component interfaces. These nodes are described in this section.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.1 Source Nodes
</SectionTitle>
      <Paragraph position="0"> a0 EditableSource. Prompts the user to interactively enter sample documents (used primarily for testing, or entering queries).</Paragraph>
      <Paragraph position="1"> a0 WebSource. Generic support for access to web search engines (e.g., Google). Includes multithreading support for simultaneous retrieval of multiple result documents.</Paragraph>
      <Paragraph position="2"> a0 NativeBATSource. Generic support for access to document  collections stored on local disk. Implemented in C, with a Java wrapper that utilized the Java Native Interface (JNI).</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.2 Filter Nodes
</SectionTitle>
      <Paragraph position="0"> a0 SizeFilter. Only passes documents which are above a user-defined size threshold.</Paragraph>
      <Paragraph position="1"> a0 RegexpFilter. Only passes documents which match a user-defined regular expression; incorporates the GNU regexp package. null</Paragraph>
    </Section>
    <Section position="3" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.3 Annotator Nodes
</SectionTitle>
      <Paragraph position="0"> a0 NameAnnotator. Locates named entities (currently, person names) in the body of the document, and adds appropriate annotations to the document.</Paragraph>
      <Paragraph position="1"> a0 IVEAnnotator. For each named entity (person) annotation, checks a networked database for supplemental information about that individual. An interface to a database of information about individuals, publications, and organizations, created as part of the Information Validation and Evaluation project at CMU [12]. Implemented using Java Database Connectivity (JDBC).</Paragraph>
      <Paragraph position="2"> a0 BrillAnnotator. Accepts a user-defined annotation (e.g., PAS-SAGE) and adds a new annotation created by calling the Brill Tagger [1] on the associated text. Implemented via a TCP/IP socket protocol which accesses a remote instance of the tagger running as a network service.</Paragraph>
      <Paragraph position="3"> a0 ChartAnnotator. Accepts a user-defined annotation, and adds new annotations based on the results of bottom-up chart parsing with a user-defined grammar. The user can select which linguistic categories (e.g., NP VP, etc.) are to be annotated. a0 RegexpAnnotator. Annotates passages which match a user-defined regular expression.</Paragraph>
    </Section>
    <Section position="4" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.4 Transformer Nodes
</SectionTitle>
      <Paragraph position="0"> a0 BrillTransformer. Similar to the BrillAnnotator (see above), but operates directly on the document body (does not create separate annotations).</Paragraph>
      <Paragraph position="1"> a0 Inquery. Accepts a query (represented as an input document) and retrieves a set of documents from the Inquery search engine [2]. Accesses an Inquery server running as a networked service, using TCP/IP sockets.</Paragraph>
      <Paragraph position="2"> a0 WordNet. Accepts a document, and annotates each word with a hypernym retrieved from WordNet [19]. Accesses a Word-Net server running as a networked service, using TCP/IP sockets.</Paragraph>
    </Section>
    <Section position="5" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.5 Sink Nodes
</SectionTitle>
      <Paragraph position="0"> a0 Ranker. Collects documents and sorts them according to a user-defined comparator. The current implementation supports sorting by document size or by annotation count.</Paragraph>
      <Paragraph position="1"> a0 CooccuranceSink. Builds a matrix of named entity associations within a given text window; uses NAME annotations created by the NameAnnotator (see above). The output of this node is a special subclass of Document, called Matrix-Document, which stores the association matrix created from the document collection.</Paragraph>
      <Paragraph position="2"> a0 QAnswer. Collects a variety of annotations from documents relevant to a particular query (e.g., &amp;quot;What is Jupiter?&amp;quot;), and uses them to synthesize an answer.</Paragraph>
    </Section>
    <Section position="6" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.6 Renderer Nodes
</SectionTitle>
      <Paragraph position="0"> a0 StreamRenderer. Outputs any documents it receives to a user-specified file stream (or to standard output, by default).</Paragraph>
      <Paragraph position="1"> a0 DocumentViewer. Pops up a document display window, which allows the user to browse documents as they are accepted by this node.</Paragraph>
      <Paragraph position="2"> a0 MatrixRenderer. A two-dimensional visualization of the association matrix created by the CoocurrenceSink(see above). Accepts instances of MatrixDocument.</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="0" end_page="0" type="metho">
    <SectionTitle>
5. IIM APPLICATIONS
</SectionTitle>
    <Paragraph position="0"> The initial set of component nodes has been used as the basis for three experimental applications: a0 Filtering and Annotation. An interactive node chain that allows the user to annotate and collect documents matching any regular expression; the resulting collection can then be viewed interactively (with highlighted annotations) in a pop-up viewer window.</Paragraph>
    <Paragraph position="1"> a0 Named Entity Association. A node chain which performs named-entity annotation using a phi-square measure[3], producin a MatrixDocument object (a user-defined Document subclass, which represents the association matrix). Note that the addition of a specialized Document subclass does not require recompilation of IIM (although the user must take care that specialized document objects are properly handled by user-defined nodes).</Paragraph>
    <Paragraph position="2"> a0 Question Answering. A node chain which answers &amp;quot;What is&amp;quot; questions by querying the web for relevant documents, finding relevant passages [8, 10], and synthesizing answers from the results of various regular expression matches3.</Paragraph>
  </Section>
  <Section position="6" start_page="0" end_page="0" type="metho">
    <SectionTitle>
6. PERFORMANCE
</SectionTitle>
    <Paragraph position="0"> In order to support accurate side-by-side evaluation of different modules, IIM implements two kinds of instrumentation for run-time performance data: a0 Per-Node Run Time. The ChainRunner and Box classes automatically maintain run-time statistics for every node in a chain (including user-defined nodes). These statistics are printed at the end of every run.</Paragraph>
    <Paragraph position="1"> a0 Node-Specific Statistics. For user-defined nodes, it may be useful to report task-specific statistics (e.g., for an Annotator, the total number of annotations, the average annotation size, etc.). IIM provides a class called Options, which contains a set of optional interfaces that can be implemented to customize a node's behavior. Any node that wishes to report task-specific statistical data can implement the ReportsStatistics interface, which is called by the ChainRunner when the chain finishes.</Paragraph>
    <Paragraph position="2"> An example of the statistical data produced by the system is shown in Figure 4. The system is careful to keep track of time spent &amp;quot;inside&amp;quot; the nodes, as well as the overall clock time taken for the task. This allows the user to determine how much overhead is added by the IIM system itself.</Paragraph>
    <Paragraph position="3"> The throughput speed of the prototype system is acceptably fast, averaging better than 50M of text per minute on a sample filtering task (530M of web documents), running on a typical Pentium III PC with 128M RAM. IIM requires about 10M of memory (including the Java run-time environment) for the core system and user interface, with additional memory requirements depending on the size of the document stream and the sophistication of the node chain4. Although the core system is implemented in Java, we have also implemented nodes in C++, using appropriate wrapper classes and the Java Native Interface (JNI). This technique allows us to implement critical, resource-intensive nodes using native code, without sacrificing the benefits of the Java-based core system.</Paragraph>
  </Section>
  <Section position="7" start_page="0" end_page="0" type="metho">
    <SectionTitle>
7. DISCUSSION
</SectionTitle>
    <Paragraph position="0"> The preliminary results of the IIM prototype are promising. IIM's drag-and-drop component library makes it possible to build and tune a new application in a matter of minutes, greatly reducing the amount of effort required to integrate and reuse existing modules.</Paragraph>
    <Paragraph position="1"> a3 We are currently expanding this application to include part of speech tagging and syntactic parsing, both of which are straight-forwardly modeled as examples of the Annotator interface.</Paragraph>
    <Paragraph position="2"> a4 Node chains which create a high volume of annotations per document use more memory, as do node chains which create new collections, transform documents, etc.</Paragraph>
    <Paragraph position="3">  In the future, we hope this high degree of flexibility will encourage greater experimentation and the creation of new aggregate systems from novel combinations of components, leading to a true &amp;quot;marketplace of modules&amp;quot;.</Paragraph>
    <Paragraph position="4"> Building extensible architectures as &amp;quot;class library plus application framework&amp;quot; is not a new idea, and has been discussed before with respect to information retrieval systems [7, 18, 9]. One might claim that any new IR architecture should adopt a similar design pattern, given the proven benefits of separating the modules from the application framework (flexibility, extensibility, high degree of reuse, easy integration, etc.). To some extent, IIM consolidates, refines and/or reimplements ideas previously published in the literature. Specifically, the following characteristics of the IIM architecture can be directly compared with prior work: a0 The IIM classes Renderer, Document, MultiDocument, and annotations on Document can be considered alternative implementations of the InfoGrid classes Visualizer, Document, DocumentSet and DocumentPart [15]. However, in IIM annotations are &amp;quot;lightweight&amp;quot;, meaning that they do not require the instantiation of a separate user object, but can be modeled as simple String instances in Java when a high degree of annotation requires optimal space efficiency.</Paragraph>
    <Paragraph position="5"> a0 The use of color to indicate status of a node is also used in the SketchTrieve system [18].</Paragraph>
    <Paragraph position="6"> a0 IIM's visualization of the document flow as a &amp;quot;node chain&amp;quot; can be compared to the &amp;quot;wire and dock&amp;quot; approach used in other IR interfaces [9, 4, 13].</Paragraph>
    <Paragraph position="7"> a0 The use of a Property Editor to customize component behavior is an alternative approach to the IrDialogs provided by the FireWorks toolkit [9] for display and update of a component's state.</Paragraph>
    <Paragraph position="8"> Nevertheless, IIM is at once simpler and more general than systems such as InfoGrid [15] and FIRE [18]. One could claim that IIM supports a higher degree of informality [9] than FIRE, since it enforces no type-checking on node connectivity. Since all tasks are modeled abstractly as document flows, nodes need only implement one of the Node sub-interfaces, and each node chain must begin with a Source. Another point of comparison is the task-specific detail present in the FIRE class hierarchy. In IIM, task-specific objects are left up to the developer (for example, representing particulars of access control on information sources, or details of indexing and retrieval, such as Index, Query, etc.).</Paragraph>
    <Paragraph position="9"> Hendry and Harper [9] have used the degree of user control as a dimension of comparison for IR architectures. At one extreme are systems which allow dynamic view and access to the run-time state of components, while at the other lie systems which hide implementation detail and perform some functions automatically, for improved performance. In their comparison of SketchTrieve and InfoGrid, Hendry and Harper note that &amp;quot;a software architecture should provide abstractions for implementing both these&amp;quot;. In IIM, the use of macro nodes can hide component details from the end user, especially when the component's parameter values have been tuned in advance for optimal performance.</Paragraph>
  </Section>
  <Section position="8" start_page="0" end_page="0" type="metho">
    <SectionTitle>
8. ONGOING RESEARCH
</SectionTitle>
    <Paragraph position="0"> While the initial results reported here show promise, we are still evaluating the usability of IIM in terms of trainability (how fast does a novice learn the system), reusability (how easily a novice can build new applications from existing node libraries) and ease of integration (effort required to integrate external components and systems). The current version of IIM lacks the explicit document management component found in systems like GATE [4] and Corelli [20]; we are in the process of adding this functionality for the official release of IIM.</Paragraph>
    <Paragraph position="1"> The IIM system (source code, class documentation, and node libraries) will be made available via the web as one of our final project milestones later in 2001. Anyone interested in using the system or participating in ongoing research and development is invited to visit the IIM web site and join the IIM mailing list: a5a7a6a8a6a10a9a12a11a14a13a7a13a15a5a7a16a18a17a19a16a20a6a8a16a18a21a22a23a6a24a21a25a27a26a28a21a25a24a22a23a29a30a21a31a28a32a20a29a33a13a35a34a20a34a20a36</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML