A Quasi-Dependency Model for Structural Analysis 
of Chinese BaseNPs* 
Zhao Jun Huang Changning 
Department of Computer Science & Technology, 
The State Key Lab of Intelligent Technology & Systems, 
Tsinghua University, Beijing, China, 100084 
Email: zj@sl000e.cs.tsinghua.edu.cn, Hcn@tsinghua.edu.cn 
Abstract: The paper puts forward a quasi- 
dependency model for structural analysis of Chinese 
baseNPs and a MDL-based algorithm for quasi- 
dependency-strength acquisition. The experiments 
show that the proposed model is more suitable for 
Chinese baseNP analysis and the proposed MDL- 
based algorithm is superior to the traditional ML- 
based algorithm. The paper also discusses the 
problem of incorporating the linguistic knowledge 
into the above statistical model. 
1. Introduction 
The concept of baseNP is initially put forward 
by Church. In English, baseNP is defined as 
'simple non-recursive noun phrases', which means 
that there is no sub-noun-phrases contained in a 
baseNP\[1\]. B~t the definition can not meet the 
needs in Chinese information retrieval. The noun 
phrases such as "1~ ~(natural) ~-~(language) 
~(process)", "~-IF~b~l(Asian) ~;-'~!\]~(finance) ~f~ 
~(crisis)" and "i~(political) /¢~k$1J(system) 
~(reformation) ~.~(process)" are critical for 
information retrieval, but they are not non- 
recursive noun phrases. 
Type 
In Chinese, the attribute of noun phrases can be 
classified into three types, that is restrictive 
attributes, distinctive attributes and descriptive 
attributes, among which the restrictive attributes 
have agglutinative relation with the heads. The 
using the paper defines the Chinese baseNP 
restrictive attributes. 
\[ Definition 1 \] Chinese baseNP (hereafter 
abbreviated as baseNP)- 
baseNP -- baseNP + baseNP 
baseNP --- baseNP + N I VN 
baseNP -- restrictive-attribute + baseNP 
baseNP --- restrictive-attribute + N I VN 
restrictive-attribute --- A I B I V IN \] S I X I 
(M+Q) 
Where, the terminal symbols A, B, V, N, VN, S, X, 
M, Q stand for respectively adjective, distinctives, 
verbs, nouns, norminalized verbs, locatives, non- 
Chinese string, numerals and quantifiers. 
According to the definition, noun phrases 
falls into baseNPs and non-baseNPs (abbreviated 
as ~baseNP). Table-1 gives some examples. 
Table- 1 : Examples of baseNP and -baseNP 
Examples 
BaseNP ~ q~/air z-I~/eorridor 
BaseNP i~'~/politics/t~k$1J/system ~/reform 
BaseNP ,W, D/export ~ h/commodity ffl'~/price ~/index 
baseNP ~,/complicated I~/de ~fi~q-'/feature 
- baseNP ~i)~/research -~/and ~J~/development 
-baseNP ~i~l~/teacher q/write t~/de i,~/eomment 
Both baseNP recognition and baseNP structural 
analysis are basic tasks in Chinese information 
retrieval. The paper mainly discusses the problems 
in structural analysis of baseNPs, which is 
essential for generating the compositional 
indexing units from a baseNP. The task of baseNP 
• The research is supported by the key project of the National Natural Science Foundation 
1 
structural analysis is to determine the syntactic 
structure of a baseNP. In this paper, we use 
dichotomy for baseNP analysis. For example, the 
structure of "I~1 ~/natural ~/ianguage ~J~ 
/process" is "( ~ ~/natural i,~'/language) ~ 
/process". Obviously, a baseNP composed of three 
or more than three words has syntactic ambiguities. 
For example, baseNP "x y z" has two possible 
structures, that is "(x y) z" and "x (y z)". The task 
of baseNP structural analysis is to select the 
correct structure from the possible structures. 
The paper mainly discusses the problems 
related to Chinese baseNP structural analysis. 
Section 2 puts forward a quasi-dependency model 
for structure analysis of Chinese baseNPs. Section 
3 gives an unsupervised quasi-dependency- 
strength estimation algorithm based on the 
minimum description length (MDL) principle. 
Section 4 analyzes the performance of the 
proposed model and the algorithm. Section 5 
discusses some issues in the implementation of 
baseNP structure analysis and quasi-dependency- 
strength estimation. Section 6 is the conclusion. 
2. The quasi-dependency model 
There are two kinds of structural analysis 
models for Eng!ish noun phrase, that is adjacency 
model and dependency model. The research of 
Lauer shows that the dependency model is 
superior to the adjacency model for structural 
analysis of English noun phrase\[2\]. However, 
there is no model for structural analysis of Chinese 
baseNP till now. 
According to the dependency grammar, two 
constituents can be bound together they are 
determined to be dependent. The determination of 
y z y 
the dependency relation between two constituents 
is composed of two steps. The first step is to 
determine whether they have the possibility to 
constituent dependency relation. The second step 
is to determine whether they have dependency 
relation in the given context. The former is called 
the quasi-dependency-relation, which can be 
acquired from collocation dictionaries or corpora. 
The determination of the latter is difficult, because 
multiple information in the given context should 
be taken into consideration, such as syntax Or 
semantics information, etc. 
\[ Definition 2 \] Quasi-Dependency-Relation: If 
two words x and y have the possibility to 
constituent dependency relation, then we say that 
they have quasi-dependency-relation in the given 
baseNP, formulated as x--"y (where y is called the 
head) or y--x (where x is called the head); 
Otherwise, we say that they have no quasi 
dependency relation, formulated as x ~ y and 
y--/~x. 
\[Assumption 1\] In a Chinese baseNP, if two 
words x and y can constituent dependency relation, 
then the head is always the post-positon word y, 
that is x'-y. 
According to the Definition 1, there is no 
preposition phrase, verb phrase, locality phrase or 
(l~)-structure in a baseNP, so assumption-1 is 
reasonable. 
On the basis of assumption-l, we put forward 
the quasi-dependency model for structural analysis 
of Chinese baseNPs. 
There are the following 3 kinds of quasi- 
dependency-pattern for a tri-word-composed 
baseNP xyz. 
z X g 
X 
Y 
Where, pattern s3t means x~y, y-"z andx ~ z, 
which corresponds to structure (x y) z; pattern s3~ 
means x-"z, y~z and x ~ y, which corresponds 
to the structure x (y z); However, the quasi- 
dependency-strength must be used to determine 
the corresponding structure for pattern s33, which 
means x--*y, y-"z and x---z. For example, as for 
baseNP " i~ ~/politics ~ ~tJ/system La~ 
J ,/ x x J x ,/ x J ,/ J 4 J 
v y y I I 
s3, =(x y) z s==x O' z) 
/reform", there are quasi-dependency-relations "~ 
~/politics--- ~k~U/system", "~ ~/politics-" ~ 
/reform" and "~k~lJ/system--- ~/reform". If we 
know that the quasi-dependency-relations "i~ 
/politics--- ~iJ/system" and "/~k~lJ/system-- ~ 
/reform" are stronger than "i~/politics~ 
/reform", the structure of the baseNP can be 
determined to "(i~/politics ~qk~lJ/system) ~ 
2 
/reform". 
In the following, we give the definition of 
quasi-dependency-strength and the formula for 
determining the syntactic structure of baseNPs 
based on the quasi-dependency-strengths. 
\[Definition 3 \]quasi-dependency-strength: Given 
a baseNP set NP={npt,np2,...,npM} and lexicon 
W={W~,...,WM} , VW~, Wj E W , the quasi- 
dependency-strength of w~wj is defined as: 
~_a dep( w i --~ w.i , nP t ) 
npt ~ NP 
dsCwi --~ w~) = Z co(w~ --~ w j, rip, ) 
npt ~ NP 
where dep(w i ~ wj,npk)is the count of 
dependent word pair w~'w~ contained in np,, 
co(w~, w~,np,) is the count of cooccurent word 
pair (w~, wj) contained in np,. 
The formula for determining the syntactic 
x y z x y z  Uw, lx x 
x J x X J 
Y Y 4 
S41 = ((wx)y)z S42 = (wx)(yz) 
structure of baseNP based on the quasi- 
dependency-strengths is as follows. 
ds(u ~ v) 
(u..-~v)eD(np~ ,s j ) 
belief(sj I np, ) = Z ds(u ~ v) + Z ds(u ~ v) 
(u---~v)aD(np i ,s I ) (u-~v)~D(npi ,.v I ) 
Where, belief(sj \]nPi) represents the belief in 
which the structure of np~ is sj. D(npi,sj) represents 
the set of quasi-dependency-relations included in 
the quasi-dependency-pattern corresponding to 
structure sj. 
A tri-word-composed baseNP has two possible 
syntactic structures, that is s3t and s32. Similarly, a 
four-word-composed baseNP has the following 
five possible structures. 
x y z 
x J x 
Y Y 
s43 ; (~))z 
In summary, we can compute the belief in 
which the structure of npi is sj using the 
correspondence between the quasi-dependency- 
pattern and the baseNP structure. The acquisition 
of quasi-dependency-strength between words is 
the critical problem. 
3. The acquisition of quasi- 
dependency-strength between words 
If we have a large scale baseNP annotated 
corpus in which the baseNPs have been assigned 
the syntactic structures, the quasi-dependency- 
strength between words can be acquired through a 
simple statistics. However, such an annotated 
corpus is not available. We only have a baseNP 
corpus which has no structural information. How 
to acquire the quasi-dependency-strength from 
such a corpus is the main task of the section. 
Given a baseNP set NP={npi,np2,...,npM} and a 
lexicon W={wl,w2 ..... WM}, the problem can be 
described as learning a quasi-dependency-strength 
set G (abbreviated as model) from the training set. 
Where, G = {asO. I d.sj --- d4w  wj 
x y z x y z 
J X x X 
4 Y 
$44 = W((Xy)Z) $45 = W(X(yT)) 
Zhai Chengxiang puts forward an unsupervised 
algorithm for acquiring quasi-dependency-strength 
from noun phrase set\[3\]. The algorithm is derived 
from the EM algorithm. Because the algorithm is 
based on the maximum likelihood (ML) principle, 
it usually leads to overfitness between the data and 
the model\[4\]. For example, given a simple baseNP 
set NP={i~/politics ~k~lJ/system ~/reform, 
_t~:/economics ~k~lJ/system ~i~/reform, i~ 
/politics ~ f~lJ/system ~ ~/revolute , ~ 
/economics/t~lJ/system ~/revolute}, there are 
sixteen possible models for the training set, among 
them (34, GT, G~0 and Gt3 have the best fitness to 
NP, that is Num(NPIG)=6. However, in the 
linguistic view, G~ is the correct model, though it 
has lower fitness to NP, that is Num(NPIG)=4 (see 
the appendix). 
3.1 The estimation of the quasi-dependency- 
strength under Bayesian framework 
In Bayesian framework, the task of acquiring 
the quasi-dependency-strength can be described as 
the problem of selecting G which has the highest 
posterior probability p( G \[NP). 
G = arg max p(G I We) 
G 
According to Bayesian theorem, we have the 
following inference. 
G = arg max p(Ne I G)p(G) 
G p(NP) 
-- arg max p(Ne I G)p(G) 
G 
Besides using conditional probability p(NP\[G) 
to measure the fitness between the training set and 
the model G, Bayesian modeling gives additional 
consideration to the generality of the model 
through the prior probability p(G), that is simpler 
model has higher probability. The central idea of 
Bayesian modeling is to find a compromise 
between the goodness of fit and the simplicity of 
the model. 
3.2 Defining the evaluation function of Bayesian 
modeling using MDL principle 
The difficulty in Bayesian modeling is the 
estimation of the prior probability p(G). According 
to the coding theory, the lower bound of the 
coding length (bit-string) of an information with 
probability p is log 2 l/p\[5\]. The theorem 
connects Bayesian modeling with the MDL 
principle in the coding theory. 
G = arg max p(NPIG)p(G) 
G 
-- arg min {-log2 \[p(NPIG)p(G)\]} 
G 
= arg min {log2 1 + log 2 1 G p(NPIG) -~-~} 
= arg min {L(NP \[ G) + L(G)} 
G 
Where, L(a) is the optimal coding length of 
information a. Specially, L(NPIG) is called the 
data description length and L(G) is called the 
model description length. 
Therefore, the problem of estimating the prior 
probability p(G) and the conditional probability 
p(NPIG) is converted to the problem of estimating 
the model description length L(G) and the data 
description length L(NPIG). 
3_3 The MDL-based quasi-dependency-strength 
estimation algorithm 
In MDL principle, the modeling problem can he 
viewed as a problem of finding a model G which 
has the smallest sum of the data description length 
and the model description length. Because the 
search space is huge, we can not find the optimal 
model in a transversal manner. The model must be 
improved in an iterative manner in order to arrive 
at a minimum description length. 
In the research, the model is composed of the 
quasi-dependency-strength ds(w i ~ wj ), where 
each ds(w i ~ w j) can be decomposed into two 
parts: Othe structure part: the quasi-dependency- 
relation (w i ~ w j); (~)the parameter part: the 
quasi-dependency-strength ds. Therefore, the 
learning process is divided into two steps: (!) 
Keeping the structure part fixed, optimize the 
parameter part; (g)Keeping the parameter part 
fixed, optimize the structure part. The two steps go 
on alternately until the process arrives at a 
convergent point. 
Algorithm 1: The MDL-based algorithm for quasi-dependency-strength estimation 
(!)Initialize model G; 
(~Let L = L( NP \[ G) + L( G), G = ( G s , G e ) ,where Gs and Gp represent respectively the structure part and the 
parameter part. Execute the following two steps alternately, until L converged. 
• Keeping Gs fixed, optimize Gp, until L(NP I G) converges, that is L converges; 
• Keeping Gp fixed, optimize Gs, until L(G) converges, that is L converges. 
On condition that the structure part of the model 
is fixed, the parameter optimization means to find 
the optimal sets of quasi-dependency-strength in 
order that the data description length minimized, 
4 
that is 
C -- arg min LCNP I G) 
G 
Where L(NPIG ) is the optimal coding length of NP 
when G is known. 
The parameter optimization step can be 
implemented using EM algorithm\[3\]. In the 
process of parameter optimization, the structure 
part of the model is kept fixed. The optimum 
estimates of the parameters are obtained through 
Algorithm 2: The structure optimization algorithm 
the gradual reduction of data description length. 
In MDL principle, the model description length 
can be gradually reduced through the modification 
of the structure part of the model, therefore the 
overall description length of the model is reduced. 
Let the model after the parameter optimization process is G, which is composed of the quasi- 
dependency-strength ds(w~ ~ wj). 
QSort the quasi-dependency-strengths of model G in ascending order, that is ds tzl, ds \[21, ds I3\], . ..... ; 
(g) Repeat the following steps, until \[L(NP I G'} + L(G')\]- \[L(NP I G} + L(G)\] <= Th L (The. is the selected 
threshold). Let i= 1, 
• Delete the quasi-dependency-strength ds til from model G; 
• Construct the newmodel G'; 
• If \[L(NPIG')+L(G')\]-\[L(NPIG)+L(G)\]<=Th L Then the cycle ends Else let G=G', i=i+1 and 
continue the next cycle. 
4. The performance analysis 
This section takes the N2+N2+N2-type (where N2 
represents bi-syllable noun) baseNPs as the testing 
data in order to discuss the performance of the quasi- 
dependency-based model for structural analysis of 
baseNPs and the MDL-based algorithm for quasi- 
dependency-strength acquisition. The training set 
includes 7,500 N2+N2+N2-type baseNPs. The close 
testing set is the 500 baseNPs included in the 
training set. The open testing set is the 500 baseNPs 
outside the training set. The testing target is the 
precision of baseNP structural analysis, that is 
a precision = -- × 100%; 
b 
Where a is the count of the baseNPs which are 
correctly analyzed, b is the count of the baseNPs in 
the tesing set. 
4.1 The performance of the quasi-dependency 
model 
The experiments shows: (~)In the N2+N2+N2- 
type baseNPs, the left-binding structure is about two 
times of the right-binding structure; (~)The analysis 
precision of the quasi-dependency model is about 
7% higher than that of the adjacency model. This 
conclusion can be explained intuitively through the 
following example. The structure of baseNP "~d:: 
/doctor J~3~/dissertation ~/outline" can not be 
correctly determined through the adjacency model, 
because we can not find that the dependency strength 
of "~:~/doctor ~3~/dissertation" is stronger than 
that of"J~3~/dissertation ~/outline". In the other 
hand, the structure of the above baseNP can be 
determined to "(~/doctor J~3~/dissertation) 
fl~/outline" through the quasi-dependency model, 
because both "t~/doctor J~3~/dissertation" and " 
"~3~/dissertation ~_~/outline" are dependent word 
pairs, while "~ ~/doctor ~. ~/outline" is an 
independent word pair. Table 2 is the testing result. 
Table-2: The analysis precision ofN2+N2+N2-type baseNP 
Testing type Right-binding Left-binding Adjacency model Quasi-dependency model 
Close test 31.5% 68.5% 84.6% 91.5% 
Open test 32.7% 67.3% 81.5% 88.7% 
5 
4.2 The performance of the MDL-based 
algorithm for quasi-dependency-strength 
acquisition 
The ML algorithm is equivalent to the first 
parameter optimization process of the MDL 
algorithm. The MDL process is composed of two 
iterative optimization steps. In the iterative process, 
the parameters are optimized gradually and the 
model is simplified gradually as well. Therefore, the 
overfitness problem inherent in the ML algorithm is 
solved to a great extent. In the following, the 
performance of the ML algorithm and the MDL 
algorithm are compared through comparing the 
baseNP analysis precision of the models constructed 
using the above two algorithms. The precision is 
listed in Table-3. The experiment shows that the 
MDL algorithm is superior to the ML algorithm. 
Table-3: The performance of ML algorithm and MDL algorithm 
BaseNP analysis precision 
Close test 
ML algorithm MDL algorithm 
89.0% 91.5% 
5. Implementation issues 
The most difficult problem related to the 
structural analysis of baseNPs is the acquisition of 
the quasi-dependency-strength. The proposed 
algorithm(Algorithm 2) is an unsupervised 
algorithm, that is the parameters are estimated 
over the baseNP corpus which has no structural 
information. In order to improve the estimation 
results and speed up the iteration process, some 
measures are taken during the implementation. 
5.1 The pre-assignment of the baseNP structure 
The structures of some baseNPs can be 
determined using the linguistic knowledge. Such 
knowledge includes: 
(~) In a baseNP, a word pair which has the 
following syntactic composition is independent. 
• Noun+Adjective: for example, " ~ 
/ground/Noun :~/eomplicated/Adjective :~,~, 
/condition", "\]~\]tll/glass/Noun ~\[/curved/Adjective 
Open test 
~/pipe"; 
• Noun+Distinctive: for 
/elementary-school/Noun 
age/Distinctive } L~-~/child"; 
example, " zJ~ ~: 
J\[~ I~ /of-the-right- 
• Distinctive+Verb: for example, " ~ 
/large/Distinctive ~ ~1~/fight/Verb -~ ~l\],/plane", 
"~l~\[/elementary/Distinetive /l~Fj:/ereepNerb 
S/animal". 
(g) If two verbs cooccur in a baseNP, then they are 
dependent. For example," (lgJJ~/prospeet/Verb "~ 
/design/Verb ) ~ ~./group ", " ( ~ \[\]/Anti- 
Japanese/Verb ~\[\[~i/save-the-nationNerb) J~ 
/campaign". 
If we preproeess the baseNP corpus using the 
ML algorithm \[ MDL algorithm 
82.5% I 88.7% 
above knowledge, it is beneficial for the estimation 
process. 
5.2 The complex-feature-based modeling 
If the lexicon size is \]~, then the parameter 
number of the above word-based acquisition 
algorithm amounts to \[~2. The enormous parameter 
space will lead to the data sparseness problem during 
the estimation. Therefore, the paper puts forward the 
complex-feature-based acquisition algorithm. First, 
map each word to a complex-feature-set according to 
the multiple feature of the words; Then, acquire the 
quasi-dependency-strength between the complex- 
feature-sets. During analyzing the structure of a 
baseNP, the strength between the complex-feature- 
sets is used instead of that between the words. In the 
research, the multiple features include part-of-speech, 
number of syllables and word sense categories. 
6. Conclusions 
The paper put forward a quasi-dependency model 
for structural analysis of Chinese baseNPs, and a 
MDL-based algorithm for the quasi-dependency- 
strength acquisition. The experiments show that the 
proposed model is more suitable for Chinese baseNP 
analysis and the proposed MDL-based algorithm is 
superior to the traditional ML-based algorithm. The 
further research will focus on incorporating more 
linguistic knowledge into the above statistical model. 

References 
\[1\] Church K., A stochastic parts program and 
noun phrase parser for unrestricted text, In: 
Proceedings of the Second Conference on Applied 
Natural Language Processing, 1988. 
\[2\] Lauer M. Conceptual association for compound 
noun analysis, In: Proceedings of the 32 "d Annual 
Meeting of the Association for Computational 
Linguistics, Student Session, Las Cruces, NM, 
1994. 
\[3\] Zhai Chengxiang, Fast Statistical Parsing of 
Noun Phrases for Document Indexing, In: 
Proceedings of the 35 th Annual Meeting of the 
Association for Computational Linguistics, USA.: 
Association for Computational Linguistics. 1997. 
311-318. 
\[4\] Stoicke A. Bayesian learning of probabilistic 
language models, Dissertation for Ph.D. Degree, 
Berkeley, California: University of California, 
1994. 
\[5\] Solomonoff R. The mechanization of linguistic 
learning, In: Proceedings of the 2nd International 
Conference on Cybernetics. 
