File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/02/c02-2013_metho.xml

Size: 10,697 bytes

Last Modified: 2025-10-06 14:07:52

<?xml version="1.0" standalone="yes"?>
<Paper uid="C02-2013">
  <Title>SOAT: A Semi-Automatic Domain Ontology Acquisition Tool from Chinese Corpus</Title>
  <Section position="3" start_page="0" end_page="0" type="metho">
    <SectionTitle>
2. The InfoMap Framework
</SectionTitle>
    <Paragraph position="0"> Gruber defines an ontology to be a description of concepts and relationships (Gruber 1993).</Paragraph>
    <Paragraph position="1"> Our knowledge representation scheme, InfoMap, can serve as an ontology framework. InfoMap provides the knowledge necessary for understanding natural language related to a certain knowledge domain. Thus, we need to integrate various linguistic knowledge, commonsense knowledge and domain knowledge in making inferences.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
2.1 The Structure of InfoMap
</SectionTitle>
      <Paragraph position="0"> InfoMap consists of domain concepts and their associated attributes, activities, etc., which are its related concepts. Each of the concepts forms a tree-like taxonomy. InfoMap defines &amp;quot;reference&amp;quot; nodes to connect nodes on different branches, thereby integrating these concepts into a semantic network.</Paragraph>
      <Paragraph position="1"> InfoMap not only classifies concepts, but also classifies the relationships among concepts.</Paragraph>
      <Paragraph position="2"> There are two types of nodes in InfoMap: concept nodes and function nodes. The root node of a domain is the name of the domain.</Paragraph>
      <Paragraph position="3"> Following the root node, topics are found in this domain that may be of interest to users. These topics have sub-categories that list related sub-topics in a recursive fashion.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
2.2 Function Nodes in InfoMap
</SectionTitle>
      <Paragraph position="0"> InfoMap uses function nodes to label different relationships among related concept nodes. The basic function nodes are: category, attribute, synonym, and activity, which are described  below.</Paragraph>
      <Paragraph position="1"> 1. Category: Various ways of dividing up a  concept A. For example, for the concept of &amp;quot;people&amp;quot;, we can divide it into young, mid-age and old people according to &amp;quot;age&amp;quot;. Another way is to divide it into men and women according to &amp;quot;sex&amp;quot;, or rich and poor people according to &amp;quot;wealth&amp;quot;, etc. For each such partition, we shall attach a &amp;quot;cause&amp;quot;. Each such division can be regarded as an angle of viewing concept A.</Paragraph>
      <Paragraph position="2">  2. Attribute: Properties of concept A. For example, the attributes of a human being can be the organs, the height, the weight, hobbies, etc.</Paragraph>
      <Paragraph position="3"> 3. Associated activity: Actions that can be associated with concept A. For example, if A is a &amp;quot;car&amp;quot;, then it can be driven, parked, raced, washed, repaired, etc.</Paragraph>
      <Paragraph position="4"> 4. Synonym: Expressions that are synonymous to concept A in the context.</Paragraph>
    </Section>
    <Section position="3" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
2.3 The Contextual View of InfoMap
</SectionTitle>
      <Paragraph position="0"> Generally speaking, an ontology consists of definitions of concepts, relations and axioms. A well known ontology, WordNet (Miller 1990), has the following features: hypernymy, hyponymy, antonymy, semantic relationship, and synset. Comparing with the globlal view of concepts in WordNet, InfoMap defines category, event, atttibute, and synonym in a more contextual fashion. For example, the synonym of a concept in InfoMap is valid only in this particular context. This is very different from the synset in WordNet. Each node B underneath a function node (synonym, attribute, activity or category) of A can be treated as a related concept of A and can be further expanded by describing other relations pertaining to B. However, the relations for B described therein will be &amp;quot;limited under the context of A&amp;quot;. For example, if A is &amp;quot;organization&amp;quot; and B is the &amp;quot;facility&amp;quot; attribute of A, then underneath the node B we shall list those facilities one can normally find in an organization, whereas for the &amp;quot;facility&amp;quot; attribute of &amp;quot;hotel&amp;quot;, we shall only list those existing facilities in hotel.</Paragraph>
    </Section>
    <Section position="4" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
2.4 The Inference Engine of InfoMap
</SectionTitle>
      <Paragraph position="0"> The kernel program can map a natural language sentence into a set of nodes and uses the edited knowledge to recognize the events in the user's sentences. Technically, InfoMap matches a natural language sentence to a collection of concept notes. There is a firing mechanism that finds nodes in InfoMap relevent to the input sentence. Suppose we want to find the event of the following sentence: &amp;quot;How do I invest in stocks?&amp;quot; and the interrogative word &amp;quot;how&amp;quot; can fire the word &amp;quot;method&amp;quot;. Then along the path from &amp;quot;method&amp;quot; to &amp;quot;stock&amp;quot; the above sentence has fired the concepts &amp;quot;stock&amp;quot; and &amp;quot;invest&amp;quot;. Thus, the above sentence will correspond to the path: stock - event - invest - attribute - method Given enough knowledge about the events related to the main concept, InfoMap can be used to parse Chinese sentences. Readers can refer to (Hsu et al. 2001) for a thorough description of InfoMap.</Paragraph>
    </Section>
  </Section>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
3. Automatic Domain Ontology Acquisition
</SectionTitle>
    <Paragraph position="0"> To build an ontology for a new domain, we need to collect domain keywords and find the relationships among them. An acquisition process, SOAT, is designed that can construct a new ontology through domain corpus. Thus, with little human intervention, SOAT can build a prototype of the domain ontology.</Paragraph>
    <Paragraph position="1"> As described in previous sections, InfoMap consists of two major relations among concepts, i.e., Taxonomic relations (category and synonym) and Non-taxonomic relations (attribute and event). We defined sentence templates, which consists of patterns of keywords and variables, to capture these relations.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.1 Description of SOAT
</SectionTitle>
      <Paragraph position="0"> Given the domain corpus with the POS tag, our SOAT can be described as follows.</Paragraph>
      <Paragraph position="1"> Input: domain corpus with the POS tag Output: domain ontology prototype Steps:  1 Select a keyword (usually the name of the domain) in the corpus as the seed to form a potential root set R 2 Begin the following recursive process: 2.1 Pick a keyword A as the root from R 2.2 Find a new related keyword B of the root A by extraction rules and add it into the domain ontology according to the rules. 2.3 If there is no more related keywords, remove A from R 2.4 Put B into the potential root set 2.5 Repeat step 2, until either R becomes  empty or the total number of nodes generated exceeds a prescribed threshold. We find that most of the domain keywords are not in the dictionary. So the traditional TF/IDF method would fail. Instead, we use the high frequency new words discovered by PAT-tree as the seeds. Ideally, SOAT can generate an domain ontology prototype automatically. However, the extraction rules need to be refined and updated by a human editor. The details of SOAT extraction rules are in Section 3.2.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.2 The Extraction Rules of SOAT
</SectionTitle>
      <Paragraph position="0"> The extraction rules in Tables 1, 2, 3 and 4, consists of a specific noun as the root, and the POS tags of the neighboring words. A rule is a linguistic template for finding keywords related to the root. The target of extraction is usually a word or a compound word, which has strong semantic links to the root. Our rules are especially effective in identifying essential compound words for a specific domain.</Paragraph>
      <Paragraph position="1"> We use POS tags defined by CKIP (CKIP 1993), in which Na is the generic noun, Nb is the proper noun, and Nc is the toponym. Generally, an Na can be a subject or an object in a sentence, including concrete noun and abstract noun, such as &amp;quot;cloth&amp;quot;, &amp;quot;table&amp;quot;, &amp;quot;tax&amp;quot;, and &amp;quot;technology&amp;quot;. An Nc is the name of a place. Readers can refer to CKIP (CKIP 1993) for more information about the POS tag. In our experiment, we focus on Na and Nc, because the topics that we are interested in usually fall in these two categories. The extraction rules of finding categorical (taxonomy) relationships from a given Na (or Nc) are in Table 1 (and 3). The rules of finding attribute (non-taxonomy) relationships from a given Na (or Nc) are in Table 2 (and 4).</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="0" end_page="0" type="metho">
    <SectionTitle>
4. Discussion
</SectionTitle>
    <Paragraph position="0"> Li and Thompson (1981) describe Mandarin Chinese as a Topic-prominent language in which the subject or the object is not as obvious as in other languages. Therefore, the highly precise shallow parsing result (Munoz et al. 1999) on NN and SV pairs in English is probably not applicable to Chinese.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.1 The Experiment of Extraction Rate
</SectionTitle>
      <Paragraph position="0"> To test the qualitative and quantitative performance of SOAT, we design two experiments. We construct three domain ontology prototypes for three different domains and corpora. Table 5 shows the result in which the frequently asked questions (FAQs) for stocks are taken from test sentences of the financial QA system. The university and bank corpora are collected from the CKIP corpus (CKIP 1995).</Paragraph>
      <Paragraph position="1"> We select sentences containing the keyword &amp;quot;University&amp;quot; or &amp;quot;Bank&amp;quot; as the domain corpora. The results in Table 5 show that SOAT can capture related keywords and the relationships among them from limited sentences very efficiently without using the frequency.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML