File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/04/w04-1110_intro.xml

Size: 9,724 bytes

Last Modified: 2025-10-06 14:02:32

<?xml version="1.0" standalone="yes"?>
<Paper uid="W04-1110">
  <Title>Automated Alignment and Extraction of Bilingual Domain Ontology for Medical Domain Web Search</Title>
  <Section position="3" start_page="0" end_page="0" type="intro">
    <SectionTitle>
2 Methodologies
</SectionTitle>
    <Paragraph position="0"> Figure 1 shows the block diagram for ontology construction and the framework of the domain-specific web search system. There are four major processes in the proposed system: bilingual ontology alignment, domain ontology extraction, knowledge representation and domain-specific web search.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
2.1 Bilingual Ontology Alignment
</SectionTitle>
      <Paragraph position="0"> In this approach, the cross-lingual ontology is constructed by aligning the words in WordNet with their corresponding words in HowNet. First, the Sinorama (Sinorama 2001) database is adopted as the bilingual language parallel corpus to compute the conditional probability of the words in WordNet, given the words in HowNet. Second, a bottom up algorithm is used for relation mapping.</Paragraph>
      <Paragraph position="1"> Figure 1 Ontology construction framework and the domain-specific web search system In WordNet a word may be associated with many synsets, each corresponding to a different sense of the word. When we look for a relation between two different words we consider all the synsets associated with each word (Christiane 1998). In HowNet, each word is composed of primary features and secondary features. The primary features indicate the word's category. The purpose of this approach is to increase the relation and structural information coverage by aligning the above two language-dependent ontologies, WordNet and HowNet, with different semantic features.</Paragraph>
      <Paragraph position="2"> The relation &amp;quot;is-a&amp;quot; defined in WordNet corresponds to the primary feature defined in HowNet. Equation (1) shows the mapping between the words in HowNet and the synsets in WordNet.</Paragraph>
      <Paragraph position="3"> Given a Chinese word, i CW , the probability of the word related to synset, k synset , can be obtained via its corresponding English synonyms,</Paragraph>
      <Paragraph position="5"> where {enitity,event,act,play} is the concept set in the root nodes of HowNet and WordNet.</Paragraph>
      <Paragraph position="6"> Finally, the Chinese concept,</Paragraph>
      <Paragraph position="8"> long as the probability, Pr</Paragraph>
      <Paragraph position="10"> WordNet and HowNet. The nodes with bold circle represent the operative nodes after concept extraction. The nodes with gray background represent the operative nodes after relation expansion.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
2.2 Domain ontology extraction
</SectionTitle>
      <Paragraph position="0"> There are two phases to construct the domain ontology: 1) extract the ontology from the cross-language ontology by island-driven algorithm, and 2) integrate the terms and axioms defined in a medical encyclopaedia into the domain ontology.</Paragraph>
      <Paragraph position="1"> 2.2.1 Extraction by island-driven algorithm Ontology provides consistent concepts and world representations necessary for clear communication within the knowledge domain. Even in domain-specific applications, the number of words can be expected to be numerous. Synonym pruning is an effective alternative to word sense disambiguation.</Paragraph>
      <Paragraph position="2"> This paper proposes a corpus-based statistical approach to extracting the domain ontology. The steps are listed as follows: Step 1: Linearization: This step decomposed the tree structure in the universal ontology shown in Figure 2 into the vertex list that is an ordered node sequence starting at the leaf nodes and ending at the  root node.</Paragraph>
      <Paragraph position="3"> Step 2: Concept extraction from the corpus: The node is defined as an operative node when the Tf-idf value of word</Paragraph>
      <Paragraph position="5"> W in the domain corpus is higher than that in its corresponding contrastive (out-of-domain) corpus. That is,  n are the numbers of the documents containing word</Paragraph>
      <Paragraph position="7"> W in the domain documents and its contrastive documents, respectively. The nodes with bold circle in Figure 2 represent the operative nodes.</Paragraph>
      <Paragraph position="8"> Step 3: Relational expansion using the island-driven algorithm: There are some domain concepts not operative after the previous steps due to the problem of insufficient data. From the observation in ontology construction, most of the inoperative concept nodes have operative hypernym nodes and hyponym nodes. Therefore, the island-driven algorithm is adopted to activate these inoperative concept nodes if their ancestors and descendants are all operative. The nodes with gray background shown in Figure 2 are the activated operative nodes. Step 4: Domain ontology extraction: The final step is to merge the linear vertex list sequence into a hierarchical tree. However, some noisy concepts not belonging to this domain ontology are operative after step 3. These noisy nodes with inoperative noisy concepts should be filtered out automatically. Finally, the domain ontology is extracted and the final result is shown in Figure 3.</Paragraph>
      <Paragraph position="9"> After the above steps, a dummy node is added as the root node of the domain concept tree.</Paragraph>
      <Paragraph position="10"> Figure 3 The domain ontology after filtering out the isolated concepts  In practice, specific domain terminologies and axioms should be derived and introduced into the ontology for domain-specific applications. In our approach, 1213 axioms derived from a medical encyclopaedia have been integrated into the domain ontology. Figure 4 shows an example of the axiom. In this example, the disease &amp;quot;diabetes&amp;quot; is tagged as level &amp;quot;A&amp;quot; which represents that this disease is frequent in occurrence. And the degrees for the corresponding syndromes represent the causality between the disease and the syndromes. The axioms also provide two fields &amp;quot;department of the clinical care&amp;quot; and &amp;quot;the category of the disease&amp;quot; for medical information retrieval.</Paragraph>
      <Paragraph position="11"> Figure 4 axiom example</Paragraph>
    </Section>
    <Section position="3" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
2.3 Domain-specific web search
</SectionTitle>
      <Paragraph position="0"> This paper proposed a medical web search engine based on the constructed medical domain ontology.</Paragraph>
      <Paragraph position="1"> The engine consists of natural language interface, web crawler and indexer, relation inference module and axiom inference module. The functions and techniques of these modules are described as follows.</Paragraph>
      <Paragraph position="2"> 2.3.1 Natural language interface and web crawler and indexer Natural language interface is generally considered as an enticing prospect because it offers many advantages: it would be easy to learn and easy to remember, because its structure and vocabulary are already familiar to the user; it is particularly powerful because of the multitude of ways in which to accomplish a search action by using the natural language input. A natural language query is transformed to obtain the desired representation after the word segmentation, removing the stop words, stemming and tagging process.</Paragraph>
      <Paragraph position="3"> The web crawler and indexer are designed to seek medical web pages from Internet, extract the content and establish the indices automatically.  For semantic representation, traditionally, the keyword-based systems will introduce two problems. First, ambiguity usually results from the polysemy of words. The domain ontology will give a clear description of the concepts. In addition, not all the synonyms of the word should be expanded without constraints. Secondly, relations between the concepts should be expanded and weighted in order to include more semantic information for semantic inference. We treat each of the user's input and the content of web pages as a sequence of words. This means that the sequence of words is treated as a bag of words regardless of the word order. For the word sequence of the user's input,</Paragraph>
      <Paragraph position="5"> and the word sequence of the web page,  The similarity between input query and the page is defined as the similarity between the two bags of words. The similarity measure based on key concepts in the ontology is defined as follows.  Some axioms, such as &amp;quot;result in&amp;quot; and &amp;quot;result from,&amp;quot; that are expected to affect the performance of a web search system in a technical domain are defined to describe the relationship between syndromes and diseases. This aspect is the use of specific terms used in the medical domain. We collected the data about syndromes and diseases from a medical encyclopedia and tagged the diseases with three levels according to its occurrence and syndromes with four levels according to its significance to the specific disease. The &amp;quot;result in&amp;quot; relation score is defined as (,)</Paragraph>
      <Paragraph position="7"> if a disease occurs in the input query and its corresponding syndromes appear in the web page.</Paragraph>
      <Paragraph position="8"> Similarly, if syndrome occurs in the input query and its corresponding disease appears in the web page, the &amp;quot;result from&amp;quot; relation score is defined as</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML