KNOWLEDGE ACQUISITION AND CHINESE PARSING BASED 
ON CORPUS 
Yuan Chunfa, Huaug Chaugning and Pan Shimei 
Dept. of Computer Science 
Tsinghua University, Beijing, China 
Fax: 861-256-2768 
ABSTRACT 
In Natural Language Processing (NLP), one key problem is how to design a ro- 
bust and effective parsing system. In this paper, we will introduce a corpm- based 
Chinese parsing system. Our efforts are coucetrated on: (1) knowledge acquisition 
and representation; and (2) the parsing scheme. The knowledge of this system is prin- 
cipally extracted from analyzed corpus, others are a few grammatical principles, i.e. 
the four axioms of the Dependency Grammar (DG). In addition, we also propose the 
fifth axiom of DG to support the parsing of Chinese sentences. 
1. Introduction 
The traditional approaches of natural language parsing are based on rewriting rules. We 
know that when the number of rules have already increased to a certain level, the performance 
of parsing will be improved little by increasing the number of rules further. So using 
corpus-based approach, i.e. extracting linguistic knowledge with t'me grain size from corpus 
directly to support natural language parsing is more impressive. 
In this paper we will introduce the work on Knowledge acquisition and Chinese parsing 
based on corpus. Our work includeds: 
• Take out a total of 500 sentences from geography text book of middle school to form a 
small Chinese corpus. 
• Because Dependency Grammar (DG) directly describes the functional relations between 
words, and s dependency tree has not any non-terminal nodes, DG is suitable for our 
Corpus-Bused Chinese Parser (CBCP) particularly. We marked the dependency relations of 
every sentence in our corpus manually. 
• Input the analyzed corpus into the computer and form u matrix f'de for every sentence in 
the corpus. 
• Extract the knowledge from the matrix f'de and form a knowledge base. 
• Implement the CBCP system for parsing input sentences and assigning dependency trees 
to them. 
2. Construction of the knowledge base 
(I) Thl. project is supported by National Science Fundation of China under grant No. 69073063 
AcrEs DE COLING-92, NArerES, 23-28 AOL-r 1992 1 3 0 0 Proc. OF COLING-92, NANTES, AUG. 23-28, 1992 
At first, we marked the dependency relations of every sentences in our corpus manuaUy. 
An example of analyzed sentence i.s as follows : 
SUBJ 
DETA CDE ATRA / ADVA~ OBJ 
(each) (river) (of) (middle and low reaches) (mostly) (ate) (flatlands) 
Most of the middle and low reaches of each river are fiatlauds. 
Fig 2.1 
Here: DETA(DETerminntive Adjunct), CDE(Complement of ~(DE)'), ATRA(ATtRibute 
Adjunct), SUBJ(SUBJeet), ADVA(ADVerhial Adjunct), OBJ(OBJect). 
Then we run a program to input the dependency relations of every sentence to the comput- 
er and form a matrix file as bellow: 
M(0 1)=DETA M(1 2):CDE M(2 3)=ATRA M(3 5)~SUBJ 
M(4 5)=ADVA M(6 5)=ORJ 
In order to expound the knowledge representation, we give some definitions as below. If 
there are four words wl, w2, w3 and w4 with dependency relations RI, R2 and R3: 
RI R2 R3 
Fig 2.2 
Then for the word ~w3", its d-relation Ls R2; its g-relatinn is R1; and its s-relation is R3. 
We extract the knowledge from the matrix file to form a frame as below : 
word-name :: = \[ < govfreq >, < govlLst >, < linklLst >, <: patlLst > \] 
The slots of the frame are: 
governor frequency (govfreq): It indicates that wltether the given word can be a governor of a 
sentence and how many times it has been in our corpus. 
governor list (govlLst): It indicates which word can be the parent node of thc given word, and 
what is the dependency relation between the word and its parent node. In other words, 
what is the word's d-relation and how many times it has occurred in the corpus, i.e. 
govlist :: = \[{ < governor-name > {\[ < d-relation :>, <frcqncncy > \]} * } * \] 
dependency link list 0inkli~t): The d-relation and g-reintion of the given words can form a 
pair of relations described as d-relation < .... ~-relatiou. The information on iinklist in- 
cludes: how many kinds of dependency links the given word have in our corpus? And what 
are they? how many times it has occurred? what Ls the position of the word's parent node ( 
to the right or to the left of the word) ia a sentence? i.e. 
AttirEs DE COLING-92, NAtCrEs, 23-28 AOt}I 1992 l 3 0 1 PROC. OF COL,ING-92. NAtCr~S, Ann. 23-28, 1992 
llnklist :: = \[{ < d-relatinn > {\[ < g-relation >, < position >, < frequency > \]} * } * I 
pattern list (patlist): The given word and its s-relations constitute a pattern of the word as: 
(s-relation1 s--relation2 s-relation3 ...). This pattren information describes the rationality 
of the syntactic structure in a dependency tree. The patlist knowledge extracted from the 
corpus includes: how many patterns can the word act in our corpus? What is each pattern? 
how many times has it occurred? What Ls the position (to the right or left of the word) of 
the children node in a sentence in our corpus? i.e. 
patlist :: = \[{ \[pattern \[ < frequency >, {\[ < s-relation >, < positinn > \]} * \]\]} * \] 
(notes: the content inside the "{ } * " can be repeated n times, where n > 1) 
3. The parser 
In our CBCP system, the knowledge base will first be searched for all the possible linklist 
information of each word pair, according to the words in the input sentence. We use this infor- 
mation to construct a Specific Matrix of the Sentence (SMS). Sccond, remove impossible 
links in the SMS, and form a network. Third, we search all the possible depcndcncy trees in the 
network, using the pruning algorithm. Finally, the solutions will be selected by evaluating the 
dependency trees. The process of removing and pruning is based on the knowledge base and the 
four axioms of Dependency Grammar (Robinson, J.J.1970). The four axioms are: 
I. There is only one independent element (governor) in a sentence. 
\]\]. Other elements must directly depend on one certain clement in the sentence. 
l\[I. There should not be any element which depends on two or more elememts. 
IV. If the element A directly depends on element B, and clement C is located between A 
and B in a sentence, element C must be either directly dependent on A or B or an element 
which is between A and B in the sentence. 
According to our Dcpendcncy Grammar practice in Chinese, we populate the fifth axiom 
as follows: 
V. There is no direct dependent relation between two elements which one is on the left 
hand side and the other is on the right hand side of a governor. 
3.1 Comtruet a specifieal matrix of a sentence 
Suppose there are k words in a sentence marked as S=(wl w2 w3 ... wi... wk), CBCP 
searches the linklist information of every word in the sentence. For example, ff one link of wi is 
ATRA<----OBJ, and the link of wj is OBJ<----GOV (GOVernor) in the knowledge 
base, CBCP can construct the link between wi and wj as ATRA < ----OBJ. The SMS will 
be constructed by searching all the links of words in the input sentence. 
3.2 Remove impossible governors and links 
Since an input sentence may form a large number of dependency trees based on the SMS, 
it is necessary to remove the impossible links before connecting every node to a network. Sup- 
pose in a SMS, the word A is dependent on the word B and the link between them is 
ACIF~ DE COLING-92, N^rzff~s, 23-28 ^o(rr 1992 1 3 0 2 l))~oc. OF COL1NG-92, NANTES, AUG. 23-28, 1992 
Ra<--Rb. If there exists a (RI R2 ...Ra...Rk) in B's patlist, the dependent relation of 
Ra<--Rb is reasonable. Otherwise, the Ra<--Rb relation is impossible, and should be 
removcdo 
The CBCP system looks for the govfreq information of each word in an Input sentence. If 
the govfreq of a word is greater than zero, the word can be a governor. The rules of removing 
impossible governors arc: 
• Ifa word has no parent node in SMS, the word must be the governor (based on axiom ~\[ 
). Other words which can also act as a governor must be removed. 
• If a word A has only one rink to word B with the link Ra <--GOV, and the word B can 
not he a governor, the word A will not depend on any word in the dependency tree• According 
to axiom I\] this is impossible, therefore word B must he the governor. Other words which also 
can act as a governor must be removed. 
• When n word A has only one link to word B with the link Ra <--Rb (Rb < > GOV), 
and the d-relatinn of the word B is not Rb, the word A will not depend on any words in the de- 
pendency tree. According to axiom \]\] this is impossible. So the d-rclatinu of the word B must 
not be the governor. Then this kind of link in which the word B is used as a governor must he 
~movcd. After removing all the impossible governors and links, the SMS of the sentence in 
Fig-2.i is as follows: 
M(0 1)~ DETA <--CDE M(0 5) = ADVA < --GOV M(I 2) = CDE < --ATRA 
M(I 3) = ATRA <--SUBJ' M(I 5) = SUBJ < ~-GOV M(2 3) = ATRA <--SUBJ 
M(3 5) ~ SUBJ < --GOV M(4 5) = ADVA < --GOV M(6 5) = OBJ < ---GOV 
3•3 Search the possible integrated tree from the specific tree 
Let the governor be the root node, connecting nil the nodes in order. If a node have n (n > 
1) parent nodes, we can sprit this node to n same nodes. Let these n same nodes depend on the n 
parent nodes respectirely. Thus Specific Tree (ST) will be constructed. The ST of the sentence 
in Fig-2.1 Ls as bellow: 
ADVA 
--~fr w0 
UBJ 
- ~--~.4n'~j" wl ATRA 
Gov I su.J y----~ wl 
wS~t\[l'~.~ w3,~ ATRA., CDE DETA 
w2-----~ wl-----------~ w0 
Fig-3.1 
A(zrf, s DE COLING-92, NANTEs. 23-28 ^o(;r 1992 1 3 0 3 PROC. of: COLING-92, NANTES, At;6.23-28, 1992 
Ifa node appears m times in the ST, we may say the degree of freedom of this node is 2 
m. If there is only one word, whose ~ equala to m in a ST, then m dependency trees may be 
constructed. If the degree of freedom of the word-i equals to n, the degree of freedom of the 
word-j equals to m then the n * m dependency trees will be constructed. If there are many 
words with ~ greater than one, the number of dependency trees being formed will be very large. 
Therefore, in the process of seaching an integrated dependency tree, the pruning technology 
must be taken. The pruning technology derives from axiom V. 
After the integrated dependency trees have been produced, we use the numerical 
evaluation to produce the parsing result \[1\]. 
4. Experimental result and future work 
When CBCP analyzed Chinese sentences in a closed corpus, it has an approximately 90% 
success rate (comparing with the result of manual parsing). If each word in a sentence can be 
found in our corpus and the corresponding dependence relation can also be found in our know- 
tcdge base, it is also feasible for CBCP to perform syntactic parsing in an open corpus. 
As our research is advancing, we will enlarge the scale of our corpus and make it work on 
open corpus more effectively. On the other hand, we have great interests in how to retrieve 
more information from different aspects. For example, we want to acquire grammatical cate- 
gory information and semantic features for our system or equip complex feature set for each 
word to support corpus-based as well as rule-based system. We want to add a few rules to our 
system, in order to replace the frames of the words which frequently appear in our corpus. The 
frame of such a word is very large, but it is easy to describe its dependency relations by rules. 
We plan to do further research in this field. 
In addition, our work can be easily expanded to set up a Chinese Collocation Dictionary. 
It is very difficult to make this kind of dictionary by man power, beacuase it is impossible to 
seek all the possible collocations of a particular word just by thinking. But it is easy to achieve 
this with corpus-based approach like our work. The more refined analyzing of the texts in the 
corpus, the more knowledge can be acquired from the corpus. 
References 
|1\] van Zuillco, Job M. (1990): ~Notes on a Probabilistic Parsing Experiment'. 
BSO / Language Systems, Utrecht, The Netherlands. 
\[2\] van Zuljlcn, Job M.(1989) : "The Application of Simulated Annealing in 
Dependency Grammar Parsing'. BSO / Language Systems, Utrecht, The Netherlands. 
\[3\] \]~l~'-~ (1991) : ((4,'~t\[t~'-~i~SO) , ~{~t~._~.. 
I41 ~,~ (1987): ((~o~, ~±~R~±. 
ACTES DE COLING-92, NANFES. 23-28 hOt':I' 1992 1 3 0 4 PROC. Ol: COIANG-92. NANTES, AUG. 23-28. 1992 
