AN EFFICIENT SYNTACTIC TAGGING TOOL FOR CORPORA @ 
Ming Zhou Changning Huang 
Dept. of Computer Science, "l~inghua University, 
Beijing, 100084, China. 
A BSTRA CT 
The tree bank is an important resources tbr 
MT and linguistics researches, but it requires that 
large number of sentences be annotated with 
syntactic information. It is time consuming and 
troublesome, and dil'ficult to keep consistency, if' 
annotation is done manually. In this paper, wc 
presented a new technique for the semi-automatic 
tagging of Chinese tcxt. The system takes as input 
Chinese text, and outputs the syntactically tagged 
sentence(dependency tree). We use dependency 
grammar and employ a stack based shift/reduce 
context-dependent parser as the tagging mecha- 
nism. The system works in human-machine 
cooperative way, in which the machine can acquire 
tagging rules from human intervention. The auto- 
mation level can be improved step by step by ac- 
cumulating rules during annotation. In addition, 
good consistency of tagging is guaranteed. 
KEYWORDS: syntactic tagging, tree bank 
1. INTRODUCTION 
In recent years, the corpora, either 
monolingual or bilingual,plays an important role in 
MT and linguistics rcscarches(Komatsu, jin & 
Yasuhara, 1993; Sato, 1993; \[sabcllc & 
Dymetman,t993). This is because the corpora with 
large amount of running text is considered as an 
ideal resources of linguistic knowledge. However, 
to acquire knowledge \['rom the corpora(Watenabc, 
1993; Mitamura, Nyberg, Carboncll, 1993), or 
effectively use the scntcnces as examples, as in ex- 
ample based approach(Nagao, 1984, O. Furusc & 
H.Iida, 1992), the corpora has to be annotated with 
certain inlbrmation which may be of 
morphological information, syntactic inl'ormation 
and semantic information. 
Take Chinese monolingual corpora, For in- 
stance, the raw corpora, i.c. the text which has not 
bccn scgmcntcd into word strings, can only be uscd 
tbr statistics of Chinesc character, howevcr, if you 
want to work out the frequency of words, the 
corpora has to bc segmcntcd into word strings, i.c., 
it has to be annotated with word boundary infor- 
mation. Further morc, if you want to obtain the 
co-occurrence frcqucncy of each two adjacent part 
of speeches, which is helpful to the study of part of 
speech (POS) tagging, you must annotate the 
corpora with POS inIbrmation. And if" you want to 
extract the syntactic knowledge from corpus, the 
corpus must be attached with syntactic information 
such as dependency relation and phrase structure 
etc., and such a corpora is called tree bank which is 
used as the rcsources for knowledge acquisition and 
cxamplcs in EBMT research. 
There are usually five levels of annotation tbr 
a corpora, which includes word boundary tagging, 
POS tagging, sense tagging, syntactic relation tag- 
ging and semantic relation tagging, with the depth 
of tagging increases. To improve the tagging auto- 
marion and keep good consistency, a mechanism is 
rcquircd at each level of tagging to acquire know- 
ledge fiom hunaan intervention and the annotated 
corpus. The knowledge acquired should be fed 
back to the tagging model to improve the tagging 
automation and correctness. 
Our group has bcen doing the research on 
Chincse corpus for many years, and has done suc- 
cessful experiments on word boundary tagging, 
POS tagging(Bai & Xia, 1992), sense tagging(Tong, 
Huang & Guo, 1993). The syntactic relation tag- 
ging, however, has not been resolved well because 
of some reasons. First, there is no clear answer 
about which grammar lbrmalism, such as phrase 
structure granamar, or dependency grammar or any 
othcr grammar is suitable for large scale running 
text syntactic tagging? Second, how to save 
humanZs labor from tagging, and keep good 
(i) supported by National Foundation of Natural Science of China. 
949 
consistency? 
For the first question, some rescarchers adopt 
phrase structure grammar (PSG) as thc tagging 
formalisms(Lecch & Garside 1991), and some 
adopt dependency grammar(DG) 1993, Komatsu, 
Jin, & Yasuhara, 1993). In comparison with PSG, 
the authors think, DG has some advantages. First, 
it is economical and convcnient to use DG for thc 
syntactic relation tagging of corpus because there is 
no non-terminal node in the parse tree ofDG; Scc- 
nd, DG stresses relations among individual words, 
the acquisition of collocation knowledge and 
syntactic relation among words is straight; Third, 
there is relatively straight map bctween dependency 
tree and case reprcsentation. 
Based on the above discussion, the authors 
chosen dependency grammar as the syntactic 
formalism for corpora, and defined 44 kinds of dc- 
pendency relation tbr Chinese(Zhou & Huang 
1993). 
For the second question, we must develop an 
efficicnt tagging tool, fbr which wc nccd takc ac- 
count of two factors: (1) the power of acquiring tag- 
ging knowledge from the human intervention, in or- 
der to improve the automation level; (2) the ability 
ot" keeping good consistency. 
Simmons & Yu (1992) introduced the 
context-dependent grammar for English parsing, 
in which the context-dependent rules can be ac- 
quired through an interactive mechanism, the 
phrase structure analysis and case analysis were con- 
ducted through a stack based shift/shift parser, 
with success ratio reached as high as 99%. Inspircd 
by their work, we designed a dependency relation 
tagging tool \['or Chinese corpus, called CSTT. 
CSTT takes the context-dependent grammar as 
well. It can learn the humants knowledge of 
tagging. In the initial stage, the tagging is mainly 
done by human, the system records the operation 
of human and forms tagging rules, when the rules 
are accumulated to some number, the system can 
help human to tag, such as provides human with an- 
notation operations which human did belbre in the 
same context, or even do some annotation itself in 
some cases. The annotation automation gets higher 
and higher and good consistency is thus 
guaranteed. It should be mentioned that since PSG 
non-terminal symbols are used in shift / reduce tag- 
ging process, CSTT can produce syntactically tagge 
d sentences of PSG version as well. In addition, 
both versions of tree can be mapped into each other 
by providing with a set of transfcr rules. 
A small corpora of 1300 sentences of daily life 
is used for experiment, with the average length of 
20 Chinese characters per sentence,For the first 300 
sentences, 1455 rules were obtained, and for the 
whole corpora,totally 6521 rules was obtained. The 
tagging automation was improved continually with 
the rules increased, and the automatic tagging ratio 
is above 50% after 1200 sentences were tagged. 
2 DESIGN OF CSTT 
2.1 The context-dependent shift/reduce tagging 
m ech a nisln 
The proccss of context-dependent tagging is 
that when a sentence is input(the input string is the 
sequence of part of speech), we look up the rule 
base with the top two elements of the stack to see 
whether there exist rules coinciding with the current 
context. If not, human operation is required to de- 
termine whether reduce or shift. If reduce, then fur- 
ther decides what phrase structure will be con- 
structed, and what dependency relation will be con- 
structed bctwecn these top two elements. The sys- 
tem records the current context and the operations 
to tbrms a ncw rule, and put it into rule base. 
Formally, context dependent rule is represented as: 
c~xyfl~ s (Shif't) 
c~xy\[l~(z,y,h) (Reduce 
Where x, y are the top two elements in the 
stack, and cqfl are the context on the left hand ofx 
and the context on the right hand of y 
respectively.The context is represented as a se- 
quence of part o1" speeches. There are two actions 
on the right hand of a rule, shift action denoted as 
s, and reduce action denoted as(z,?,h).For reduce 
action, z denotes the phrase structure after reduc- 
tion, and ? denotes the dependency relation be- 
tween x and y, h denotes which clement is the head 
of the phrase structure and dependency relation. By 
tt='A'means the top clement is the head, h='B' 
means that the second top clement of the stack is 
the head. Now let/s sce the tagging process for a 
simple sentence: 
950 
R VY R USDE A NG 
(where, R: pronoun, VY: verb ,~.u, USDE: 
ufl~ju, A: adj., NG: general noun.) 
Table 1 
< Stack > ~t k < Input string > 
The contcxt-dcpcndent shift/reduce tagging process 
Action Phrase 
structure 
..... :/c4:<R> <VY> < R> <USDE> <A > <NG > < + > shift 
.... < R>:;~<VY> <R> <USDE> <A > <NG > <+ > shift 
---<R> <VY>~:< R> <USDE> <A > <NG > <. > reduce 
----<SV>:~<R><US1)E><A><NG><. > shift 
---<SV><R>:i~<USDE><A> <NG> <. > shift 
--<SV><R> <USDE>:~-<A><NG> <. > reduce 
---<SV><I)E>:~<A> <NG><° > shift 
--<SV><DE><A>r\]$<NG><° > shift 
-<SV><I)E> <A> <NG>-t~<°+'+' > reduce 
--<SV><DE><NP>~<+ > reduce 
---<SV> <NP>:/:#-<° > reduce 
..... <SS > #:/k < o > shift 
---<88><o >:~ reduce 
.... < SP > :/=/: pop 
Dependency 
relation 
SV SUB 
I)E DEP 
NP ATTA 
NP ATTA 
SS OBJ 
SP MARK 
GOV 
(where, SV: subject-verb phrase, DE: ~/II<J ~' 
structure, NP: noun phrase, SS: sub-scntcnce, SP: 
sentence. SUB: subject, DEP: u((,j ,, structure, 
ATTA: modifier, OBJ: object, MARK: punctua- 
tion mark, GOV: the predicate of sentence.) 
l)epcndency relation is represented as a triple 
of the form <modificr, head,the dcpendcncy rela- 
tion > .The tagging result is represented as a sct el" 
triples: { < 4.~, ,~,SUB >, < ~:. ,Nil,GOV >, < 4tf ~, ,ft<O 
,DEP>, < ~I*,J,}\])lJ.~,ATTA >, <)f,)I\]IJS.,A'FTA >, 
<Jl/l ~ ,~h~ ,OBJ > }.At each stcp, we can obtain a 
rule by recording the content of stack and input str- 
ing, and the operation(shift or reduce) given by us- 
er. II' the operation is a reduction, the phrase struc- 
ture and dependency relation arc to be decided by 
user. Ilere are two rulcs obtained: 
---< R> <VY>-~<R> < USDE> <A> <NG> 
<+ >-~(SV,SUB,A) 
--<SV><R> <USDE>~4z<A><NG> <o >-~s 
After the reduction, the phrasc structurc 
formed rcplaces the top two elements in the stack. 
And the head will reprcscnt this phrase in later pro+ 
ccss. Since scntcnecs varies with its length, we use 
tbrcc elements on thc lcl't side of the top two cle- 
ments in the stack and the top I'ivc clemcnts in thc 
input string as the context. 
The input is a scqucnce ot+ the part of speech of 
a sentence, and the output is the depcndency tree 
dcnotcd as a set of triple oF the form (modifier, 
hcad, the dependency relation), and as a by-prod- 
uct, context-dependent rules are acquired. It is ob- 
viously that we can work out the phrase structure 
trcc as well by modifying the algorithm (not de- 
tailcd in this papcr). 
l,ct CDG be the context-dcpendent rule base 
which were acquired bctbre,CDG is empty if" the 
system is just put into use. NUMBER-OF-AC- 
TION records the number of total actions(either 
shift or reduce) during tagging, 
NUMBER-OF-AUTOMATION is the number 
of actions(given by the system itselt) which are con- 
lirmed to bc right by human. The automatic tag- 
ging ratio is therefore sct as NUMBER-OF-AI)- 
TOMAT1ON / NUMBER-OF-ACTIONS. 
At present, the system is under supervision, 
human intervention is applied at each step either to 
confirm the actions given by the system or to ap- 
pend new actions. Idcally, the tagging process 
should be nearly full automatic with minimum hu- 
man intervention. But it is a long term process. We 
believed that with the size of corpora tagged in- 
creases, the automatic tagging ratio will be im- 
proved, and whcrt it reaches to a degree of high 
2.2 The tagging algorithm 
enough, human intervention may be removed, or it 
may only be needed in the case that no rule is 
matched. 
Table 2 The supervised tagging algorithm 
BEGIN 
STACK = EMPTY 
NUMBER-OF-AUTOMATION = 0 
NUM BER-OF-ACTION = 0 
DO UNTIL (INPUT = EMPTY AND STACK = EMPTY)) 
CONTEXT = APPEND(TOP-FIVE(STACK),FI RST-FIVE(INPUT)) / * get the context * / 
RULE-LIST = CONSULT-TO-CDG(CONTEXT) / * match with CDG * / 
RULE =CONSULT-TO-HUMAN(RULE-LIST)/ * human intervention * / 
IF(RULE= FIRST(RULE-LIST)) / * the default operation is right * / 
NUM BER-OF-AUTOMATION++ 
NUMBER-OF-ACTION++ 
IF RHS(RULE) =/S ' 
STACK = PUSH(FIRST(INPUT),STAC K) 
ELSE 
{ 
LET (Z,y, h)BE RIIS OF THE RULE 
LETX= FIRST(STACK) Y= SECOND(STACK) 
BUILD A PHRASE STRUCTURE Z VROM XAND Y 
STACK = PUSH(Z,POP(POP(STACK))) 
/ * the phrase structure rcplace the top two clements of the stack * / 
IF h = 'A' 
BUILD-DEPENDENCY-RELATION(HEAD(Y),HEAD(X),y) 
/ * build the dependency triple * / 
ELSE 
IF h = 'B' 
BUILD-DEPENDENCY-R ELATION(H EAD(X),I IEAD(Y),7) 
/ * build the dependency triple * / 
} 
IF(INPUT= EMPTY ANt) NUMBER(STACK)=I) STACK=POP(STACK) 
ENDDO 
END 
Function TOP-FIVE, FIRST-FIVE return the 
first five elements of the stack and input string 
respectively. If there are less than five elements in the 
stack or in the input string, then fills with blanks. AP- 
PEND merges two lists to obtain the current context. 
CONSULT-TO-CDG looks up the rule base and re- 
turns a list of rules matching with the current context. 
The list is empty when no rule is matched. If the list is 
not empty, rules are sorted in descending order of their 
usage frequency. If human/s intervention is dcfault(this 
may be available when the automatic tagging ratio 
reaches to some high degree), the system will take a ac- 
tion according to the rule of the highest frequency. 
CONSULT-TO-HUMAN returns only one rule by 
hmnan's inspection. In this interactive process, human is 
asked to dctermine what action should be taken, he first 
inspect the rule-list to see if there is already a rule 
correctly confirming with current context, if not, he 
should tell the system whether "shift" or '/reduce", if "re- 
duce", he is requested to tell the system what phrase 
structure and what dependency relation is to be built, 
and which element, the top element of the stack, or the 
second is the head. A new rule will be acquired when 
human makes a different operation from existing roles, 
by recording the current context and the operation. 
NUMBER-OF-AUTOMATION records the times 
that the rule with the highest frequency coincides with 
human's decision, which means that if the system works 
in automatic way, the rule with the highest frequency is 
right. NUMBER-OF-ACTIONS records the total 
times of operation(shift or reduce) during tagging. The 
952 
HEAD returns the head word of a phrase. The function 
PUSIt means push an element into stack, and POP pops 
top element out of stack, FIRST and SECOND return 
tbe first clement and second element of a list respectively. 
In matching process, weighted matching approach 
(Simmons & Yu, 1992) is used. Assmnc the set of CDG 
rules is R= { RI, R2, .., Rm} , where the left hand of 
each rule is Ri= {rid ri2.. , ril0} , assume the context of 
the top two elements of the stack is C TM {% c a, .., cs0} , 
where c 4 and c s arc the top two elements in the stack, 
we set up a match function: 
lt(Ci, rii) = 1, if e i = rii , 
.u(ci, rii) = 0, if cjI = rip 
The score function is 
L i0 
SCORE= it(cl,r,),i+ ~it(c,,r,)(ll--i) 
l=i ~-6 
some cases. CDG base is controlled dynamically so that 
to keep high efficiency of matching. A rule will be re- 
moved from the CDG base if it is seldom used. 
3 EXPERIMFNT AND ANALYSIS 
3.1 The experiment 
A small corpora of 1300 sentences of daily life is 
prepared for experiment, with the average length as 20 
Chinese characters per sentence, the corpora covers main 
classes of Chinese simple declarative sentences.The ex- 
periments is conducted in the following steps: 
(1) input a sentence; 
(2) word segmentation; 
(3) part of speech tagging. 
The tagging model is a bi-gram modcl(Bai & Xia, 
1991), and the correct ratio is about 94% , so human con- 
firmation is needed. 
(4) tagging the dependency relation by CSTT. 
A rule is preferred if and only if SCORE is greater 
than a threshold { set in advance. {=2l means full 
matching. In the beginning of the system, the full match- 
ing is recommended in order to deduce the conflict. And 
after certain period of tagging, we may set the threshold 
smaller than 21 to overcome the shortage of rules in 
As shown in Table 3, 1455 rules was obtained from 
the first 300 sentences. In the whole experiment, totally 
6521 rules was obtained. The more sentences tagged, the 
higher automatic tagging ratio may be. After 1200 sen- 
tenccs have been tagged, the ratio of automatic opera- 
tion is above 50%. 
Table 3 The experiment result 
Sentence 1-300 400 500 600 
No. of 1455 447 384 455 
rules accq uircd 
No. of 2072 768 776 792 
operation 
No. of auto 487 291 336 281 
operation 
automatic 23.5 37.8 43.3 35. 
ratio 
700 80° 
486 628 
851 834 
317 121 
900 
565 
846 
237 
30.0 
1000 1100 1200 1300 
572 564 483 492 
837 1153 1164 1111 
210 572 641 580 
25.1 49.6 55.1 52.2 
953 
3.2 Discussion 
(1) The rule conflict 
Although this system has some power for 
disambiguation due to the context-dependent 
rules, it is difticult to resolve some 
ambiguities.Therelbre it is easy to understand that 
a eonllict will occur if some ambiguity is encoun- 
tered. For example, the sequence ofVG A NG may 
be {(A, VG, COMPLEMENT),(NG, VG, OBJ)} 
or {(A, NG, ATTA), (NG, VG, OBJ)}, and the se- 
quence NGI NG2 may be {(NG2, NG1, 
COORDINATE)} or {(NGI, NG2, ATTA)} as the 
following two pairs of sentence demonstrate: 
(i) 
(ii) 
VG A 
treat well 
~J~ 
form good 
NG NG 
plane gun 
wood table 
NG 
relation 
(A, VG, Complement) 
5<l '\[i~ (A, NG, ATTA) 
habit 
(NG, NG COORDINATE) 
(NG, NG, ATTA) 
Thcre arc two kinds of ambiguities, one is con- 
textual depcndcnt ambiguity, another is contextual 
independent ambiguity. For the former, CSTT can 
resole some of them. For example, ~(VG)~L, 
(NG1)I'/,J (USDE)~'~ (NG2)is an ambiguous 
phrasc(which may be {(VG, nil, GOV), (NG1, 
USDE, DEP), (USDE, NG2, ATTA), (NG2, VG, 
OBJ)} which means "killcd the hunter's dog',or 
{(VG, USDE, DEP), (NG1, VG, OBJ), (USDE, 
NG2, ATTA), (NG2, nil, GOV)} which means the 
dog which killed the hunter. However, if the con- 
text is considered, the ambiguity may be resolved: 
VG NG USDE NG VG Y 
M Q VG NG USDE NG 
Un\[brtunately, CSTT canq resolve the ambi- 
guity of the later, human-intervcntionis necessary. 
(2) The convergence of the CDG rule 
According to the analysis of (Simmons & Yu 
1992), 25,000 CDG rules will be sufficient to cover 
the 99% phenomenon of English common sen- 
tences. In this sense, the CDG rule is convergent. If 
we are only for syntactic tagging, the convergence 
issues can be avoided temporally, if the automatic 
ratio reaches above 80%, we can stop acquisition, 
at this time the tagging can already provide lots 
help to the users. Of course, if we make some effec- 
tive attempts to CSTT, it may be developed into an 
el'licicnt dependency parser as well. 
4. CONCLUDING REMARK 
In this paper, we presented that dependency 
grammar is a suitable formalism for syntactic tag- 
ging and presented a new technique for developing 
a syntactic tagging tool lbr large corpora, in which 
a simple shift/reduce mechanism was employed 
and context dependent rules were accumulated dur- 
ing tagging. The supervised tagging algorithm is 
described. The experiment shows that automatic 
tagging ratio rises up continually with the number 
of sentence increases, and good consistency is kept. 
This idea may be helpful for POS tagging and case 
tagging of corpora as well. 
We hope the automatic tagging ratio will raise 
above 80% in the future by enlarging the size of 
rule base, so that it can be practically used lbr 
syntactic tagging oF running text. 
REFERENCES 
Bai, Shuan-hu, Yin Xia(1992). A Scheme For 
Tagging Chinese Running Text. Prec. of NLPRS, 
p25-26, 1991, Singapore. 
Furuse O, II. Iida(1992). An example-based 
mcthod For transl'cr-drivcn machine translation. 
Prec. 4th TMI-92. Montreal, 1992. 
Isabclle, Pierre, Marc Dymetman et a1.(1993). 
Translation Analysis and translation automation. 
Prec. of TMI-93, p201-217. 
Komatsu, Eiji, Cui Jin, and Hiroshi 
Yasuhara(1983). A mono-lingual corpus-based 
machine translation of the inter lingua method. 
Prec. of TM\[-93, p24-46. 
Leech, Geolt'erey and Roger Garside(1991). 
Running a grammar factory, the production of 
syntactically analyzed corpora or "tree banks". In: 
English Computer Corpora, p15-32, Mouton de 
Gruyter, 1991. 
Mitamura, Tcrko, Eric h. Nyberg, 3rd and 
954 
Jaime G. Carbonell(1993). Automated corpus ana- 
lysis and the acquisition of large, multi-lingual 
knowledge bases lbr MT. Proc. of TMI- 
93, p292-301, Kyoto, Japan, July 1993. 
Nagao, M.(1984). A framework of a mechani- 
cal translation between Japanese and English by 
analogy example, In: A. Elithorn, R. Benerji, (Ed.), 
Artificial and Human Intelligence,Elsevier: 
Amsterdam. 
Sato, Satoshi(1993). Example-based transla- 
tion of technical terms. Proc. of TMI-93, p58-68. 
Simmons, F. Robert, Yeong-Ho Yu(1992). 
The Acquisition and Use of Context- Dependent 
Grammars for English. Computational Lin- 
guisties, Vol. 18, No.4, 1992. 
Tong, Xiang, Changning Huang, and 
Chcngming Guo(1993). Example-Based Sense 
Tagging of Running Chinese Text. Proc. of the 
workshop on very large corpus, Academic and In- 
dustrial Perspectives, p102-112, Columbus, Ohio, 
USA,June 22, 1993. 
Watanabe, Hideo(1993). A method for ex- 
tracting translation patterns from translation ex- 
amples. Proe. of TMI-93, p292-301, Kyoto, 
Japan, July 1993. 
Zhou, Ming, and Changning Huang(1993). 
Viewing the Dependency parsing as a statistically 
based tagging process. Proe. NLPRS'93, Japan, 
Dec. 6-7, 1993. 
955 
