File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/02/c02-1083_metho.xml

Size: 10,189 bytes

Last Modified: 2025-10-06 14:07:52

<?xml version="1.0" standalone="yes"?>
<Paper uid="C02-1083">
  <Title>A Methodology for Terminology-based Knowledge Acquisition and Integration</Title>
  <Section position="3" start_page="0" end_page="0" type="metho">
    <SectionTitle>
1 TIMS - system architecture
</SectionTitle>
    <Paragraph position="0"> XML-based Tag Information Management System (TIMS) is a core machinery for managing XML tag information obtained from sub functional components. Its main aim is to facilitate an efficient mechanism for KA and KI through a query answering system for XML-based documents in the domain of molecular biology, by using a tag information database.</Paragraph>
    <Paragraph position="1"> Figure 1 shows the system architecture of TIMS. It integrates the following modules via XML-based data exchange: JTAG -- an annotation tool, ATRACT -- an automatic term recognition and clustering workbench, and the LiLFeS abstract machine, which we briefly describe in this section. ATRACT and LiLFeS play a central role in the knowledge acquisition process, which includes term recognition, ontology population, and ontology-based inference. In addition to these modules, TIMS implements an XML-data manager and a TIQL query processor (see Section 2).</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
1.1 JTAG
</SectionTitle>
      <Paragraph position="0"> JTAG is an XML-based manual annotation and resource description aid tool. Its purpose is to support manual annotation (e.g. semantic tagging), adjusting term recognition results, developing RDF logic, etc. In addition, ontology information described in XML can also be developed and modified using the tool. All the annotations can be managed via a GUI.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
1.2 ATRACT
</SectionTitle>
      <Paragraph position="0"> In the domain of molecular biology, there is an increasing amount of new terms that represent newly created concepts. Since existing term  dictionaries cannot cover the needs of specialists, automatic term extraction tools are important for consistent term discovery. ATRACT (Mima et al., 2001a) is a terminology management workbench that integrates ATR and ATC. Its main aim is to help biologists to gather and manage terminology in the domain. The module retrieves and classifies terms on the fly and sends the results as XML tag information to TIMS.</Paragraph>
      <Paragraph position="1"> The ATR method is based on the C/NC-value method (Frantzi et al., 2000). The original method has been augmented with acronym acquisition and term variation management (Nenadic et al. 2002), in order to link different terms that denote the same concept. Term variation management is based on term normalisation as an integral part of the ATR process. All orthographic, morphological and syntactic term variations and acronym variants (if any) are conflated prior to the statistical analysis, so that term candidates comprise all variants that appear in a corpus.</Paragraph>
      <Paragraph position="2"> Besides term recognition, term clustering is an indispensable component in a knowledge management process (see figure 2). Since terminological opacity and polysemy are very common in molecular biology, term clustering is essential for the semantic integration of terms, the construction of domain ontology and for choosing the appropriate semantic information.</Paragraph>
      <Paragraph position="3"> The ATC method is based on Ushioda's AMI</Paragraph>
    </Section>
    <Section position="3" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
(Average Mutual Information)-hierarchical
</SectionTitle>
      <Paragraph position="0"> clustering method (Ushioda, 1996). Our implementation uses parallel symmetric processing for high speed clustering and is built on the C/NC-value results. As input, we use co-occurrences of automatically recognised terms and their contexts, and the output is a dendrogram of hierarchical term clusters (like a thesaurus). The calculated term cluster information is stored in LiLFeS (see below) and combined with a predefined ontology according to the term classes automatically assigned.</Paragraph>
    </Section>
    <Section position="4" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
1.3 LiLFeS
</SectionTitle>
      <Paragraph position="0"> LiLFeS (Miyao et al., 2000) is a Prolog-like programming language and language processor used for defining definite clause programs with typed feature structures. Since typed feature structures can be used like first order terms in Prolog, the LiLFeS language can describe various kinds of applications based on feature structures. Examples include HPSG parsers, HPSG-based grammars and compilers from HPSG to CFG. Furthermore, other NLP modules can be easily developed because feature structure processing can be directly written in the LiLFeS language. Within TIMS, LiLFeS is used to: 1) infer similarity between terms using hierarchical matching, and 2) parse sentences using HPSG-based parsers and convert the results into an XML-based formalism.</Paragraph>
    </Section>
  </Section>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
2 Knowledge Integration and Management
</SectionTitle>
    <Paragraph position="0"> Knowledge integration and management in TIMS is organised by integrating XML-data management (section 2.1) and tag- and ontology-based information extraction (section 2.2). Figure 3 illustrates a model of the knowledge management based on the knowledge integration and question-answering process within TIMS. In this scenario, a user formulates a query, which is processed by a query manager.</Paragraph>
    <Paragraph position="1"> The tag data manager retrieves the relevant data from the collection of documents via a tag database and ontology-based inference (such as</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
2.1 XML-tag data management
</SectionTitle>
      <Paragraph position="0"> Communication within TIMS is based on XML-data exchange. TIMS initially parses the XML documents (which contain relevant terminology information generated automatically by ATRACT) and &amp;quot;de-tags&amp;quot; them. Then, like in the TIPSTER architecture (Grishman, 1995), every tag information is stored separately from the original documents and managed by an external database software. This facility allows, as shown in figure 4, different types of tags (POS, syntactic, semantic, etc.) for the same document to be supported.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
2.2 Tag- and ontology-based IE
</SectionTitle>
      <Paragraph position="0"> The key feature of KA and KI within TIMS is a facility to logically retrieve data that is represented by different tags. This feature is implemented via interval operations. The main assumption is that the XML tags specify certain intervals within documents. Interval operations are XML specific text/data retrieval operations, which operate on such textual intervals. Each interval operation takes two sets of intervals as input and returns a set of intervals according to the specified logical operations. Currently, we define four types of logical operations:  intervals of all the continuous intervals.</Paragraph>
      <Paragraph position="1"> For example, the interval operation 1 &lt;VP&gt;[?](&lt;V&gt;[?]&lt;term&gt;) describes all verb (&lt;V&gt;)-term (&lt;term&gt;) pairs within a verb phrase (&lt;VP&gt;). Similarly, suppose X denotes a set of intervals of manually annotated tags for a document and Y denotes a set of intervals of automatically annotated tags for the same document. The interval operation ((X[?]Y) [?]{X[?]Y}) results in the differences between human and machine annotations (see figure 5).</Paragraph>
      <Paragraph position="2"> Interval operations are powerful means for textual mining from different sources using tag information. In addition, LiLFeS enables tag</Paragraph>
      <Paragraph position="4"> subordinate classes using either predefined or automatically derived term ontology. Thus, semantically-based tag information retrieval can be achieved. For example, the interval operation2 &lt;VP&gt;[?]&lt;nucleic_acid*&gt; will retrieve all subordinate terms/classes of nucleic acid, which are contained within a VP.</Paragraph>
      <Paragraph position="5"> The interval operations can be performed over the specified documents and/or tag sets (e.g.</Paragraph>
      <Paragraph position="6"> syntactic, semantic tags, etc.) simultaneously or in batch mode, by selecting the documents/tag sets from a list. This accelerates the process of KA, as users are able to retrieve information from multiple KSs simultaneously.</Paragraph>
    </Section>
    <Section position="3" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
2.3 TIQL - Tag Information Query Language
</SectionTitle>
      <Paragraph position="0"> In order to integrate and expand the above components, we have developed a tag information query language (TIQL). Using this language, a user can specify the interval operations to be performed on selected documents (including the ontology inference to expand queries). The basic expression in TIQL has the following form:  where, [n-tuple variables] specifies the table output format, [XML document(s)] denotes the document(s) to be processed, and [interval operation] denotes an interval operation to be performed over the corresponding document with variables of each interval to be bound.</Paragraph>
      <Paragraph position="1"> For example, the following expression:  extracts all the hierarchically subordinate classes matched to (&lt;EVENT&gt;, &lt;nucleic_acid&gt;) pair within a VP from the specified XML-documents, and then automatically builds a table to display the results (see figure 6).</Paragraph>
      <Paragraph position="2"> Since formulating an appropriate TIQL expression using interval operations might be cumbersome, in particular for novice users, TIMS was augmented with a capability of &amp;quot;recycling&amp;quot; predefined queries and macros.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML