Statistics Based Hybrid Approach to Chinese Base Phrase Identification 
Tie-jun ZHAO, Mu-yun YANG~ Fang LIU, Jian-min YAO, Hao YU 
Department of Computer Science and Engineering, Harbin Institute of Technology 
{tjzaho, )may, flu.fang, james, yu} @mtlab.hit.edu.en 
ABSTRACT 
This paper extends the base noun 
phrase(BNP) identification into a research 
on Chinese base phrase identification. ARer 
briefly introducing some basic concepts on 
Chinese base phrase, this paper presents a 
statistics based hybrid model for identifying 
7 types of Chinese base phrases in view. 
Experiments show the efficiency of the 
proposed method in simplifying sentence 
structure. Significance of the research hes in 
it provides a solid foundation for the 
Chinese parser. 
Keywords: Chinese base phrase 
identification, parsing, statistical model 
1 Introduction 
Decomposing syntactic analysis into 
several phases so as to decrease its difficulty 
is a new stream in NIP research. The 
successful POS tagging has encouraged 
researchers to explore further possibility for 
resolving sub-problems in parsing(Zhou, et 
al, 1999). The typical examples are the 
recognition of BaseNP in English and 
Chinese. 
In English BNP (base noun phrase) is 
defined as simple and non-nesting noun 
phrases, i.e. noun phrases that do not contain 
other noun phrase descendants (Church, 
1988). After that researches on BNP 
identification reports promising results for 
such task in English. Observing that the 
Chinese BNP is different form English, 
(Zhao & Huang, 1999) puts forward the 
definition of Chinese BNP in terms of 
combination of determinative modifier and 
head noun. According to them a BNP in 
Chinese can be recursively defined as: 
BaseNP ::= Determinative modifier + 
Noun I Nominalized verb(NIO 
Determinative modifier ::= Adjective I 
Differentiable Adjective(DA) I Verb I Noun I 
Location I String l Numeral + Classifier 
Inspired by these researches, we extend 
the concept of BNP to Base Phrase in 
Chinese. It is based on such knowledge that 
there are many structures, not only NP, in 
which the trivial components closely attach 
to their central words and constitute a basic 
phrase in a Chinese sentence. Obviously, 
resolving all these base phrases will greatly 
benefit Chinese parser by reliving it from 
some pre-processing (though non-trivial) 
and enable it focus on the most subtle 
syntactic structures. 
Since the whole system of Chinese base 
phrase is still under discussing, this paper 
just presents some tentative research 
achievements on statistics based hybrid 
model to Chinese base phrase identification. 
For the 7 types we considered at present, our 
algorithm turns out promising results and 
smoothes the way for a better Chinese 
parser. 
2 Statistics Based Hybrid Approach to 
Chinese Base Phrase Identification 
2.1 Concepts and Defmitions 
In addition to BNP, constituents of 
many local structure in Chinese centers 
around a core word with certain fixed POS 
sequences. Therefore their identification is 
slightly different from parsing in that it 
bears relatively simple phenomenon. Like 
BNP identification, identification of these 
phenomena before parsing will provide a 
simpler sequence for parser, and thus 
deserves a separate research. 
CutTenfly, we are considering 7 Chinese 
base phrases in our research, namely base 
adjective phrase(BADJP), base adverbial 
phrase (BADVP), base noun phrase (BNP), 
73 
base temporal phrase (BTN), base location 
phrase (BNS), base verb phrase (BVP) and 
base quantity phrase (BMP) Though 
theoretically definitions for these base 
phrases are still unavailable, Appendix I lists 
the preliminary illustrations for them in 
BNF format (necessary account for POS 
annotation can also be found).. 
To frame the identification of Chinese 
base phrases, we fm'ther develop the 
following concepts: 
Definition 1: Chinese based phrases are 
recognized as atomic parts of a sentence 
beyond words that posses certain functions 
and meanings. A base phrase may consist of 
words or other base phrases, but its 
constituents, in turn, should not contain any 
base phrases. 
Definition 2: Base phrase tag is the 
token representing the syntactic function of 
the phrase. At present, base tag either falls in 
one of the 7 Chinese base phrases we are 
considering or not: 
Phrase-Tag ::= BADJP I BADVP I BNP I 
Br r l Bm I BrP I BMP I lVULL 
Definition 3: Boundary tag denotes the 
possible relative position of a word to a base 
phrase. A boundary tag for a gfven word is 
either L( left boundary of a base phrase), 
R( right boundary of a ), I(inside a base 
phrase) or O(outside the base phrase). 
2.2 Duple Based HMM Parser 
Based on above definitions, we could, 
in view of Wojciech's proposal \[Wojeieeh and 
Thorsten, 1998\], interpret the parsing of 
Chinese base phrases as the following: 
Suppose the input as a sequence of POS 
annotations T= (to, ....... t,,). The task is to 
find RC, a most possible sequence of duples 
formed by base phrase tags and boundary 
tags, among the POS sequence T. 
RC = (<ro, co > ........ <rn, Cn>), 
in whil~h ri (l <i< =n )indicates the boundary 
tags, ci represents the base phrase tags. 
To go along with the POS tagger 
developed previously by us, we first think of 
preserving HMM (hidden Markov Model) 
for parsing Chinese base phrases. Thus the 
following formula is usually•at hand: 
RC = arg max p(RC I T) 
= arg max p(RC)* p(TIRC) 
p(T) 
For a given sequence of T, this formula 
can be transformed into: 
RC = arg max p(RC IT) 
= arg max p(RC)*p(T \[RC) 
Essentially this model could be 
established through bigram or tri-gram 
statistical training by a annotated corpus. In 
practice, we just build our model from 
l O, O00 manual annotated sentences with 
common bi-gram training: 
p(RC p(RC ,IRC ,_,) 
i=1 
p(T I RC ) = 1FI p(Ti I RC i) 
i=l 
In realization, a Viterbi algorithm is 
adopted to search the best path. An open test 
on additional 1000 sentences is performed to 
check its accuracy. Results are shown in 
Tablel(note precision is calculated by 
word'k 
Precision 
for R 
Close 85.7% 
Test 
Open 82.4% Test 
Precision Precision 
for Both for .C 
RandC 
87.5% 79.0% 
85.1% 74.7% 
Table 1. Results for Duple Based HMM 
2.3 Triple Based MM Exploiting 
Linguistic Information 
Although results shown in Table 1 is 
encouraging enough for research purposes, 
it is still lies a long way for practical 
Chinese parser we are aiming at. Reasons 
for errors may be account by too 
coarse-grained information provided by RC. 
Observing the fact that the Chinese base 
phrase occurs more frequently with some 
fixed patterns, i.e. some frozen POS chains, 
we decide to improved our previous model 
by emphasizing the contribution given by 
POS information. 
Adding t denoting POS in the duple (r, 
74 
c), we develop a triple in the form of (t,r,e) 
for the calculation of a node. Naturally, the 
new model is changed into a MM (Markov 
model) as: 
TRC = arg max p(TRC ) 
= arg max I~ p(TRC i I TRC i - 1) 
To train this model, we still using a 
bi-gram model. Applying the same corpus 
and tests described above, we got the 
performance of triple based MM identifier 
for Chinese base phrases (see Table 2). 
Precision Precision Precision 
for R ~rC 
89.2% 91.5% 84.6% 
88.4% 89.9% 83% 
Close 
Open 
for Both 
R and C 
Table 2. Result for Triple Based MM 
2.4 Further Improvement Through TBED 
Learning 
Like other statistical models, the above 
model, whether duple based or triple based, 
both seem to reach an accuracy ceiling after 
enlarging training set to 12, 000 or so. To 
cover the remaining accuracy, we apply the 
transformation-based error driven (TBED) 
learning strategy described in \[Brill, 1992\] 
to acquired desired rules. 
In our module, some initial rules are 
first designed as compensation of statistical 
model. Applying these rules will cause new 
mistakes as well as make correct 
identifications. Then the module will 
compare the processed texts with training 
sentences, generate new rules according to 
pre-defmed actions and update its rule bank 
after evaluation (see Fig 1.). 
I I Compare and Rules Passing 
Generate New Rules ----!~ Evaluation I Tt 
TextTraining \] TextPr°eessed \] 
Identifier 
'T 
Input Text 
Figure 1. TBED Learning Module 
The dotted line in fig 2. will stop 
functioning if pre-set accuracy is reached by 
the identifier for the Chinese base phrase. 
Evaluation of new rules is based on an 
greedy algorithm: only rule with max 
contribution (max correction and rain error) 
will be added. Design of rule generation 
(pre-defined actions) is similar to those 
described in \[Brill, 1992\]. 
Table 3 shows a significant 
improvement after applying rules obtained 
through TBED learner. It is also the final 
performance of the proposed Chinese base 
phrase identification model. 
Precision Precision Precision 
for Both for R for C 
Rand C 
91.2% 92.8% 89% 
90.4% 91.1% 87.1% 
Close 
open 
Table 3. Results after TBED Module 
3 Conclusions and Discussions 
We have accomplished preliminary 
expedments on identification of various 
types of base phrases defined in this paper. 
The data shown in last seetion prove that our 
method generates satisfactory results for 
75 
Chinese base phrase identification. The 
overall process of our method is outlined the 
following figure. 
Input Chinese Sentences 
after Sengmentation and 
POS tagging 
~ Converted into Nodes 
to Be Parsed 
Triple Based Bi-gram 
MM with Viterbi 
Algorithm 
TBED Based 
Correction 
T" ' Output 
\ 
Fig 2. Processing of Chinese Based Phrase Identification 
However, the 7 types Chinese base 
phrases we have proposed are far l~om 
perfection. Even what we have proposed for 
the 7 phrases is still under test. Further 
improvement will focus on two aspects: one 
is to discuss and add new base phrase for a 
broader coverage; the other is to define, 
theoretically or empirically, the Chinese 
base phrases with more strict constraints. Of 
course, new techniques to improved the 
accuracy of statistical model are the constant 
aim of our research. 
To sum up, Chinese base phrase 
identification will reduce complexity of a 
Chinese parser. The successful idemifieation 
of the 7 base phrases clearly simplifies the 
structure of the sentence. We expect that the 
research described in this paper will lay a 
solid foundation for a high-accuracy 
Chinese parser. 

Reference 
\[Church, 1988\] K. Church, A stochastic 
parts program and noun phrase parser for 
unrestricted text, In: Proc. of Second 
Conference on Applied Natural Language 
Processing, 1988 
\[Wojciech and Thorsten, 1998\] Wojciech Skut and 
Thorsten Brants, Chunk Tagger, Statistical 
Recongnition of Noun Phrases, In ESSLLI-98 
Workshop on Automated Acquisition of Syntax 
and Parsing, Saarbrvcken, 1998. 
\[Zhao & Huang, 1999\] Zhao Jun and Huang 
Chang-Ning, The model for Chinese baseNP 
structure analysis, Chinese J. Computer,
22(2): pp141-146 
\[Zhou, et al, 1999\] Zhou Qiang, Sun 
Mao-Song, Huang Chang-Ning, Chunk 
parsing scheme for Chinese sentences, 
Chinese J. Computer, 22(11): pp1159-1165 
