File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/97/a97-1045_metho.xml

Size: 13,451 bytes

Last Modified: 2025-10-06 14:14:32

<?xml version="1.0" standalone="yes"?>
<Paper uid="A97-1045">
  <Title>Construction and Visualization of Key Term Hierarchies</Title>
  <Section position="3" start_page="0" end_page="309" type="metho">
    <SectionTitle>
2. SYSTEM DESCRIPTION
</SectionTitle>
    <Paragraph position="0"> The ultimate goal of this prototype system is to offer an automated toolkit which allows the domain expert or the user to visualize and examine key terms in a large information collection. Such a toolkit has proven to be useful in a number of real applications. For example, it has helped us reduce the time and manual effort needed to develop and maintain our on-line document indexing and classification schemes.</Paragraph>
    <Paragraph position="1"> The system consists of two components: a preprocessing component for the automatic construction of key terms and the front-end component for userguided graphic interface.</Paragraph>
    <Section position="1" start_page="0" end_page="307" type="sub_section">
      <SectionTitle>
2.1 Automatic Generation of Key Terms
</SectionTitle>
      <Paragraph position="0"> Automatically identifying meaningful terms from naturally running texts has been an important task for information technologists. It is widely believed that a set of good terms can be used to express the content of the document. By capturing a set of good terms, for example, relevant documents can be searched and retrieved from a large document collection. Though what constitutes a good term still remains to be answered, we know that a good term can be a word stem, a single word, a multiple word term (a phrase), or simply a syntactic unit.</Paragraph>
      <Paragraph position="1"> Various existing and workable term extraction tools are either statistically driven, or linguistically oriented, or some hybrid of the two. They all target frequently co-occurring words in running text. The earlier work of Choueka (1988) proposed a pure frequency approach in which only quantitative selection criteria were established and applied.</Paragraph>
      <Paragraph position="2"> Church and Hanks (1990) introduced a statistical measurement called mutual information for extracting strongly associated or collocated words.</Paragraph>
      <Paragraph position="3"> Tools like Xtract (Smadja 1993) were based on the work of Church and others, but made a step forward by incorporating various statistical measurements like z-score and variance of distribution, as well as shallow linguistic techniques like part-of-speech tagging and lemmatization of input data and partial parsing of raw output.</Paragraph>
      <Paragraph position="4"> Exemplary linguistic approaches can be found in the work by Str-zalkowsky (1993) where a fast and accurate syntactic parser is the prerequisite for the selection of significant phrasal terms.</Paragraph>
      <Paragraph position="5"> Different applications aim at different types of key terms. For the purpose of generating key terms for our prototype system, we have adopted a =learn data from data&amp;quot; approach. The novelty of this  approach lies in the automatic comparison of two sample datasets, a topic focused dataset based on a predefined topic and a larger and more general base dataset. The focused dataset is created by the domain expert either through a submission of an on-line search or through a compilation of documents from a specific source. The construction of the corresponding base dataset is performed by pulling documents out of a number of sources, such as news wires, newspapers, magazines and legal databases. The intention is to make the resulted corpora cover a much greater variety of topics or domain subjects than the focused dataset.</Paragraph>
      <Paragraph position="6"> To identify interesting word patterns in both samples a set of statistical measures are applied. The identification of single word terms is based on the variation of a t-test. Two-word terms are captured through the computation of mutual information (Church et al. 1991), and an extension of mutual information assists in extracting three-word and four-word terms. Once the significant terms of these four types are identified, a comparison algorithm is applied to differentiate terms across the two samples. If significant changes in the values of certain statistical variables are detected, associated terms are selected from the focused sample and included in the final generated lists. (For a complete description of the algorithm and preliminary experiments, please refer to Zhou and Dapkus 1995.)</Paragraph>
    </Section>
    <Section position="2" start_page="307" end_page="309" type="sub_section">
      <SectionTitle>
2.2 Graphic User Interface (GUI)
</SectionTitle>
      <Paragraph position="0"> We view our prototype system as a means to achieve information visualization. Analogous to scientific visualization that allows scientists to make sense out of intellectually large data collections, information visualization aims at organizing large information spaces so that information technologists can visualize what is out there and how various parts are related to each other (Robertson et al. 1991). The guiding principle for building the GUI component of our prototype system is to automate the manual process of capturing information content out of large document collections.</Paragraph>
      <Paragraph position="1">  The design of the GUI component relies on a number of well understood elements which include a suggestive graphic design and a direct manipulation metaphor to achieve an easy-to-learn user interface. The layout of the graphic design is intended to facilitate the quick comprehension of the displayed information. The GUI component is divided into two main areas, one for interacting with key terms structures and one for browsing targeted document collections.</Paragraph>
      <Paragraph position="2"> The following descriptions should be viewed together with the appropriate figures of the GUI component. Figure 1, attached at the end of the paper, represents the overall GUI picture. Figures 2 and 3 capture the area where the interaction with the key term structures occurs. Figures 4 and 5 present the area for document browsing and key terms selection. The topic illustrated in the figures is the legal topic =Medical Malpractice&amp;quot;.</Paragraph>
      <Paragraph position="3">  The left area of the GUI component (see figures 2 and 3) is devoted to selecting, retrieving and operating on the key terms generated by the preprocessing component of the prototype system. As can be seen, the key terms, ranging from single word terms to four word terms, are organized in a tree structure. The tree is a two dimensional visualization of the term hierarchy. Single word terms are represented as root nodes and multiple word terms can be positioned uniformly below the parent node in the term hierarchy. The goal of the visualization is to present the key term lists in such a way that a high percentage of the hierarchy is visible with minimal scrolling.</Paragraph>
      <Paragraph position="4"> Figure 2  The user interaction is structured around term retrieval and navigation as the top level user interactions. The retrieval of the key terms is treated as an iterative process in which the user may select single world terms from the term hierarchy and navigate to multiple word terms accordingly.</Paragraph>
      <Paragraph position="5"> The user begins term navigation by selecting from a list of available topics. In this case, the legal topic &amp;quot;Medical Malpractice&amp;quot; (i.e., medmal3) is selected (see figure 2). Often data structures are organized linearly by some metric. Frequency of key term usage is the metric used to organize and partition the term hierarchy in an ascending numerical order. The partitioning is necessary as it is difficult to accommodate the large ratio of the term hierarchy on the screen. Currently, each partition contains 100 root nodes (or folders), representing single word terms. Once a partition has been selected, the corresponding document collection is loaded into the document browser. The browser provides the user with the ability to quickly navigate through the document collection to locate relevant key terms.</Paragraph>
      <Paragraph position="6"> example, when =malpractice&amp;quot; is selected as the root key term, a list of multiple word terms will be displayed including multiple key terms such as &amp;quot;medical malpractice&amp;quot;, &amp;quot;malpractice cases&amp;quot;, &amp;quot;medical malpractice action&amp;quot;, &amp;quot;medical malpractice claims&amp;quot;, &amp;quot;limitations for medical malpractice&amp;quot;, etc. (see figure 3) Functionality to shrink and collapse subtrees is also in place. When a term is selected from the tree, a corresponding term lookup is conducted on the document collection to locate the selected term within the currently displayed document. Documents representing the four highest frequencies for the selected term will be displayed first. Upon loca-tion the selected term is always highlighted within the document browser.</Paragraph>
      <Paragraph position="7">  The right area of the GUI component (see figures 4 and 5) is occupied by the document browser. The design of the document browser is intended to provide an easy-to-learn interface for the management and manipulation of the document collection.</Paragraph>
      <Paragraph position="8"> There are three subwindows: the document identifier window, the document window and the navigation window. The document identifier window identifies the document that is currently displayed in the document window. It shows the document id and the total frequency of the selected key term in the document collection. The document window provides a view of the content of the targeted document (see figure 4).</Paragraph>
      <Paragraph position="9"> Figure 3 The primary interaction with the key term hierarchy is accomplished by direct manipulation of the tree visualization. The user can select individual nodes in the tree structure by pointing and clicking the corresponding folders. When selecting nodes with children, the tree will expand, resulting in the display of multiple word terms of the root key term. For Figure 4  The user can move through the document by making use of the scroll bar, document buttons in the navigation window, or by dragging the mouse up and down while depressing the middle mouse button. The user can copy relevant key terms to a holding area by selecting &amp;quot;Edit&amp;quot; from the menubar. The user is presented with a popup dialog for importing the selected key terms (see figure 5).</Paragraph>
      <Paragraph position="10"> The navigation window enables the user to navigate through the documents to view the selected key terms in context. In addition, the user is provialed with information regarding term frequencies and term relevance ranking scores.</Paragraph>
      <Paragraph position="11"> Figure 5  The GUI component described above is implemented using the C++ programing language and the OSF Motif graphical user interface toolkit. The user interface consists of a small set of classes that play various roles in the overall architecture. The two major objects of the user interface interaction model are the ListTree and the Document Store objects.</Paragraph>
      <Paragraph position="12"> ListTree is the primary class for implementing the tree visualization. Operations for growing, shrinking and manipulating the tree visualization have been implemented.</Paragraph>
      <Paragraph position="13"> Document Store provides the interface to document collections. In particular, a document store provides operations to create, modify and navigate document collections.</Paragraph>
    </Section>
  </Section>
  <Section position="4" start_page="309" end_page="309" type="metho">
    <SectionTitle>
3. RESULTS OF USABILITY TESTING
</SectionTitle>
    <Paragraph position="0"> The prototype system, despite its prototype mode, has proven to be useful and applicable in the commercial business environment. Since the system is in place, we have conducted a series of usability testing within our company. The preliminary results indicate that the system can provide internal specialized library developers, as well as subject indexing domain experts with an ideal automated toolkit to select and examine significant terms from a sample dataset.</Paragraph>
    <Paragraph position="1"> A number of general topics have been tested for developing specialized libraries for our on-line search system. These include four legal topics =State Tax ~, =Medical Malpractice&amp;quot;, =Uniform Commercial Code&amp;quot;, and =Energy ~, and three news topics =Campaign&amp;quot;, =Legislature&amp;quot;, and =Executives&amp;quot;. Specific subject indexing topics that have been tested are =Advertising Expenditure&amp;quot;, =lntranet&amp;quot;, =Job interview&amp;quot; and =Mutual fund&amp;quot;. Two sets of questionnaires were filled out by the domain experts who participated in the usability testing.</Paragraph>
    <Paragraph position="2"> The overall ranking for the prototype system falls between &amp;quot;somewhat useful&amp;quot; to =very useful&amp;quot;, depending on the topics. They pointed out that the system is particularly helpful when dealing with a completely new or unfamiliar topic. It helps spot significant terms which would normally be missed and objectively examine the significance level of certain fuzzy and ambiguous terms.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML