File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/92/c92-4211_metho.xml

Size: 11,473 bytes

Last Modified: 2025-10-06 14:13:07

<?xml version="1.0" standalone="yes"?>
<Paper uid="C92-4211">
  <Title>KNOWLEDGE ACQUISITION AND CHINESE PARSING BASED ON CORPUS</Title>
  <Section position="3" start_page="0" end_page="0" type="metho">
    <SectionTitle>
1. Introduction
</SectionTitle>
    <Paragraph position="0"> The traditional approaches of natural language parsing are based on rewriting rules. We know that when the number of rules have already increased to a certain level, the performance of parsing will be improved little by increasing the number of rules further. So using corpus-based approach, i.e. extracting linguistic knowledge with t'me grain size from corpus directly to support natural language parsing is more impressive.</Paragraph>
    <Paragraph position="1"> In this paper we will introduce the work on Knowledge acquisition and Chinese parsing based on corpus. Our work includeds:  * Take out a total of 500 sentences from geography text book of middle school to form a small Chinese corpus.</Paragraph>
    <Paragraph position="2"> * Because Dependency Grammar (DG) directly describes the functional relations between words, and s dependency tree has not any non-terminal nodes, DG is suitable for our Corpus-Bused Chinese Parser (CBCP) particularly. We marked the dependency relations of every sentence in our corpus manually.</Paragraph>
    <Paragraph position="3"> * Input the analyzed corpus into the computer and form u matrix f'de for every sentence in the corpus.</Paragraph>
    <Paragraph position="4"> * Extract the knowledge from the matrix f'de and form a knowledge base.</Paragraph>
    <Paragraph position="5"> * Implement the CBCP system for parsing input sentences and assigning dependency trees to them.</Paragraph>
    <Paragraph position="6"> 2. Construction of the knowledge base (I) Thl. project is supported by National Science Fundation of China under grant No. 69073063  Then we run a program to input the dependency relations of every sentence to the computer and form a matrix file as bellow:</Paragraph>
    <Paragraph position="8"> In order to expound the knowledge representation, we give some definitions as below. If there are four words wl, w2, w3 and w4 with dependency relations RI, R2 and R3:</Paragraph>
    <Paragraph position="10"> Then for the word ~w3&amp;quot;, its d-relation Ls R2; its g-relatinn is R1; and its s-relation is R3.</Paragraph>
    <Paragraph position="11"> We extract the knowledge from the matrix file to form a frame as below :</Paragraph>
    <Paragraph position="13"> The slots of the frame are: governor frequency (govfreq): It indicates that wltether the given word can be a governor of a sentence and how many times it has been in our corpus.</Paragraph>
    <Paragraph position="14"> governor list (govlLst): It indicates which word can be the parent node of thc given word, and what is the dependency relation between the word and its parent node. In other words, what is the word's d-relation and how many times it has occurred in the corpus, i.e. govlist :: = \[{ &lt; governor-name &gt; {\[ &lt; d-relation :&gt;, &lt;frcqncncy &gt; \]} * } * \] dependency link list 0inkli~t): The d-relation and g-reintion of the given words can form a pair of relations described as d-relation &lt; .... ~-relatiou. The information on iinklist includes: how many kinds of dependency links the given word have in our corpus? And what are they? how many times it has occurred? what Ls the position of the word's parent node ( to the right or to the left of the word) ia a sentence? i.e.</Paragraph>
    <Paragraph position="15"> AttirEs DE COLING-92, NAtCrEs, 23-28 AOt}I 1992 l 3 0 1 PROC. OF COL,ING-92. NAtCr~S, Ann. 23-28, 1992 llnklist :: = \[{ &lt; d-relatinn &gt; {\[ &lt; g-relation &gt;, &lt; position &gt;, &lt; frequency &gt; \]} * } * I pattern list (patlist): The given word and its s-relations constitute a pattern of the word as: (s-relation1 s--relation2 s-relation3 ...). This pattren information describes the rationality of the syntactic structure in a dependency tree. The patlist knowledge extracted from the corpus includes: how many patterns can the word act in our corpus? What is each pattern? how many times has it occurred? What Ls the position (to the right or left of the word) of the children node in a sentence in our corpus? i.e.</Paragraph>
    <Paragraph position="16"> patlist :: = \[{ \[pattern \[ &lt; frequency &gt;, {\[ &lt; s-relation &gt;, &lt; positinn &gt; \]} * \]\]} * \] (notes: the content inside the &amp;quot;{ } * &amp;quot; can be repeated n times, where n &gt; 1)</Paragraph>
  </Section>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
3. The parser
</SectionTitle>
    <Paragraph position="0"> In our CBCP system, the knowledge base will first be searched for all the possible linklist information of each word pair, according to the words in the input sentence. We use this information to construct a Specific Matrix of the Sentence (SMS). Sccond, remove impossible links in the SMS, and form a network. Third, we search all the possible depcndcncy trees in the network, using the pruning algorithm. Finally, the solutions will be selected by evaluating the dependency trees. The process of removing and pruning is based on the knowledge base and the four axioms of Dependency Grammar (Robinson, J.J.1970). The four axioms are: I. There is only one independent element (governor) in a sentence.</Paragraph>
    <Paragraph position="1"> \]\]. Other elements must directly depend on one certain clement in the sentence.</Paragraph>
    <Paragraph position="2"> l\[I. There should not be any element which depends on two or more elememts.</Paragraph>
    <Paragraph position="3"> IV. If the element A directly depends on element B, and clement C is located between A and B in a sentence, element C must be either directly dependent on A or B or an element which is between A and B in the sentence.</Paragraph>
    <Paragraph position="4"> According to our Dcpendcncy Grammar practice in Chinese, we populate the fifth axiom as follows: V. There is no direct dependent relation between two elements which one is on the left hand side and the other is on the right hand side of a governor.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.1 Comtruet a specifieal matrix of a sentence
</SectionTitle>
      <Paragraph position="0"> Suppose there are k words in a sentence marked as S=(wl w2 w3 ... wi... wk), CBCP searches the linklist information of every word in the sentence. For example, ff one link of wi is ATRA&lt;----OBJ, and the link of wj is OBJ&lt;----GOV (GOVernor) in the knowledge base, CBCP can construct the link between wi and wj as ATRA &lt; ----OBJ. The SMS will be constructed by searching all the links of words in the input sentence.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.2 Remove impossible governors and links
</SectionTitle>
      <Paragraph position="0"> Since an input sentence may form a large number of dependency trees based on the SMS, it is necessary to remove the impossible links before connecting every node to a network. Suppose in a SMS, the word A is dependent on the word B and the link between them is ACIF~ DE COLING-92, N^rzff~s, 23-28 ^o(rr 1992 1 3 0 2 l))~oc. OF COL1NG-92, NANTES, AUG. 23-28, 1992 Ra&lt;--Rb. If there exists a (RI R2 ...Ra...Rk) in B's patlist, the dependent relation of Ra&lt;--Rb is reasonable. Otherwise, the Ra&lt;--Rb relation is impossible, and should be removcdo The CBCP system looks for the govfreq information of each word in an Input sentence. If the govfreq of a word is greater than zero, the word can be a governor. The rules of removing impossible governors arc:  * Ifa word has no parent node in SMS, the word must be the governor (based on axiom ~\[ ). Other words which can also act as a governor must be removed.</Paragraph>
      <Paragraph position="1"> * If a word A has only one rink to word B with the link Ra &lt;--GOV, and the word B can  not he a governor, the word A will not depend on any word in the dependency tree* According to axiom I\] this is impossible, therefore word B must he the governor. Other words which also can act as a governor must be removed.</Paragraph>
      <Paragraph position="2"> * When n word A has only one link to word B with the link Ra &lt;--Rb (Rb &lt; &gt; GOV), and the d-relatinn of the word B is not Rb, the word A will not depend on any words in the dependency tree. According to axiom \]\] this is impossible. So the d-rclatinu of the word B must not be the governor. Then this kind of link in which the word B is used as a governor must he ~movcd. After removing all the impossible governors and links, the SMS of the sentence in Fig-2.i is as follows:</Paragraph>
      <Paragraph position="4"> 3*3 Search the possible integrated tree from the specific tree Let the governor be the root node, connecting nil the nodes in order. If a node have n (n &gt; 1) parent nodes, we can sprit this node to n same nodes. Let these n same nodes depend on the n parent nodes respectirely. Thus Specific Tree (ST) will be constructed. The ST of the sentence in Fig-2.1 Ls as bellow:  m. If there is only one word, whose ~ equala to m in a ST, then m dependency trees may be constructed. If the degree of freedom of the word-i equals to n, the degree of freedom of the word-j equals to m then the n * m dependency trees will be constructed. If there are many words with ~ greater than one, the number of dependency trees being formed will be very large. Therefore, in the process of seaching an integrated dependency tree, the pruning technology must be taken. The pruning technology derives from axiom V.</Paragraph>
      <Paragraph position="5"> After the integrated dependency trees have been produced, we use the numerical evaluation to produce the parsing result \[1\].</Paragraph>
      <Paragraph position="6"> 4. Experimental result and future work When CBCP analyzed Chinese sentences in a closed corpus, it has an approximately 90% success rate (comparing with the result of manual parsing). If each word in a sentence can be found in our corpus and the corresponding dependence relation can also be found in our knowtcdge base, it is also feasible for CBCP to perform syntactic parsing in an open corpus. As our research is advancing, we will enlarge the scale of our corpus and make it work on open corpus more effectively. On the other hand, we have great interests in how to retrieve more information from different aspects. For example, we want to acquire grammatical category information and semantic features for our system or equip complex feature set for each word to support corpus-based as well as rule-based system. We want to add a few rules to our system, in order to replace the frames of the words which frequently appear in our corpus. The frame of such a word is very large, but it is easy to describe its dependency relations by rules. We plan to do further research in this field.</Paragraph>
      <Paragraph position="7"> In addition, our work can be easily expanded to set up a Chinese Collocation Dictionary. It is very difficult to make this kind of dictionary by man power, beacuase it is impossible to seek all the possible collocations of a particular word just by thinking. But it is easy to achieve this with corpus-based approach like our work. The more refined analyzing of the texts in the corpus, the more knowledge can be acquired from the corpus.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML