ON THE 
AN EMPIRICAL STUDY 
GENERATION OF ZERO ANAPHORS IN CHINESE 
Ching-Long Yeh*and Chris Mellish t 
Department of Artificial Intelligence 
University of Edinburgh 
80 South Bridge 
Edinburgh EH1 1HN 
Britain 
Abstract 
In this paper, we describe the creation of 
rules for generating Chinese zero anaphors 
through a sequence of experiments in a 
stepwise enhanced manner. In the exper- 
iments, we basically examined the occur- 
rence of zero anaphors in a real text and the 
ones generated by the algorithms employing 
the rules, assuming the same semantic and 
discourse structures as the text. The fac- 
tors of locality, syntactic constraints, dis- 
course structure and salience of objects 
were considered in the rules. The results of 
the experiment show that 93% of the zero 
anaphors in the text can be correctly gener- 
ated by an algorithm using a rule involving 
all the above factors. 
1 Introduction 
Anaphoric expressions in Chinese can be classified as 
zero, pronominal and nominal forms, as exemplified in 
(1) by ¢~, ta I (tie) and nage ren I (that person), respec- 
tively. 
(1)a. Zhangsan I jinghuang de paokai, 
Zhangsan frightened NOM to run-away 
Zhangsan was frightened and ran away. 
b. ¢i zhuangdau yige dahan/, 1 
(lie) bump-to a big-man 
(He) ran into a big man. 
c. t a i kanqing lena ren J de zhangxiang, 
lie see-clear ASP that man GEN appearance 
He watched clearly that man's appearance. 
d. ¢~ renchu na renJ shi shel. 
(he) recognize that man is who 
(He) recognized who that man is. 
In their paper ILl and Thompson 79\], Li and 
Thompson have shown that zero anaphors in Chinese 
can occur in any grammatical slot with an antecedent 
that may occur in any grammatical slot, regardless of 
*Also with the Department of Information Engineering 
at Tatung Institute of Technology, Taipei, Taiwan. Email 
address is chingyeh@aisl).ed.ac.uk. 
tEmail address is chrism@aisb.ed.ac.uk. 
the distance between them. Although there is no clear 
rule to account for zero anaphora, nevertheless, as 
pointed out by Li and Thompson, zero anaphora com- 
inonly occur in the situation of a "topic chain," where 
a referent is referred to in the first clause, and then sev- 
eral more clauses follow talking about the same refer- 
cut but with it omitted. In \[Chen 87\], Chen proposed 
the notion of "contimfity" of referent in discourse to 
give a more specific account of zero auaphora. 
In this paper, we aim at deciding when to generate 
zcro anaphors from some internal semantic strueturc. 
Although there are no clear rules stated in previous 
linguistic work, we, nevertheless, cau summarize a very 
simple rule, R.ule 1 as shown below, for the generation 
of zero anaphors. 
Rule 1: if an entity, c, in the current ut- 
terance was referred to in the immediately 
preceding utterance, then a zero anaplior is 
used for c; otherwise a non-zero anaphor is 
used. 
We, performed at\] experiment by comparing the zero 
anaphors generated by the algorithm employing this 
rule and those occurring in real text to see how well 
it works. The initial result showed that zero anaphors 
were over-generated to a large extent in the text pro- 
duced by employing Rule l. Consequently, we con- 
sidered other well-known factors namely, syntactic 
constraints \[Li and Thompson 81\], discourse structure 
\[Grosz and Sidner 86\] and the salience ofobjecl, s in ut 
teranees \[Sidner 83\], to get better results. 
2 Experiment 
A number of articles written by different authors were 
selected as the linguistic sources with which the text 
produced by employing the generat, ion algorithms can 
be compared. For the moment, the selected articles are 
restricted to the exposition type, namely, ones which 
explain an idea or discuss a problem. Two sets of 
data were selected; one consists of a nunlber of scien- 
tiff(: questions and answers for children and the ol~her 
is a brief introduction to modern Chinese grammar. 
Basically the experiment was executed in three steps. 
First, zero anaphors within the selected articles were 
identiffed. Second, h)r each paragraph in the selected 
732 
articles, we examined each utterance sequentially nnd 
recorded the occurrence of zero anaphors that wouhl 
be obtained by applying the algorithm using a rule, 
like Rule 1. Third, we noted down the dilfcrenees be- 
tween the results of stel)s 1 nnd 2. 
In step 3, we categorized the differences betweea the 
results ms: correct, fitlse and missiug types. If a refer- 
enee created by the algorithm is the same as tile one 
in the real text, then it belongs to the correct type. 
If a zero anat>hor is created by the algorithm, while 
the corresponding position in the real text is non-zero 
annphol', then it belongs to the false type. Conversely, 
if a zero nnaphor is found in some position in the real 
text, while a non-zero anaphor is created by the algo= 
rithm, then it t>elongs to the missiT~g type. The task 
of step 3 is to eonnt the re,tuber of cases in each type. 
3 Results 
llnving clone this, we carried oat similar experiments 
with enhanced rules. 
3.1 Etfeet of using l~ule 1 and adding 
syntactic constraints 
In Sets 1 ;rod 2 of tire testing data, there are 651 and 
149 anaphors, respectively, liy using the algorithln 
of Rule t on the data, the result is shown iu TaMe 
1. In tire data, 7 and 1 long distaace zero anaphors 
occur but the algorithm decides to use non-zero oues 
for the corresponding positions. ()onsequently, they 
belong to the missing tyl>e. Frnm the result shown in 
Table 1, the performance of the algorithm is olwiously 
unpromising. 
There are certain syntactic coustrailH;s oil zero 
anaphorn, regardless of discourse factors, as shown in 
\[Li and Thoml)SOn 79, Li and '\['hompson 81\],. There- 
fore, we enhanced R,uh: 1 by adding the above syntac 
tic constraints on zero annphora, which I)ecomes l(,ulc 
In as below. Rule la can be alternntively I>e repre- 
sented as a decision tree iu Fig. 1, where internal nodes 
are conditions in the rule and leaf nodes are decisions 
about the anaphor type, either zero or non-zero. 
It,de la: If an entity, e, in the eurr(~+lt tLt,- 
terence was re\['erred to in the imnm(/iately 
prece(ling utt,er;mce mid does not violate 
any syl+taetic eonsi, raint on ze.ro anaphora, 
then ~t zero anaphor is used for c; otherwise 
a non-zero n+lal)hor is tlsed. 
in 'l';d)le 1, by using lhdc l a, the correct cases in. 
crease from 408 to 510 and 98 to 126 for Sets I and 2, 
respectively. Though Rule la improves its ancestor's 
performance, the result, howew;r, still discourages us 
from using it fbr tile gen(;ration of zero anaphors. 
3.2 The elfeet of adding discourse st;rltet;uro 
(-~r()SZ all({ Sidner stlggest, that three sl.r+lcttlr(?s earl 
be identilied within a discern'st: liTq/uistic slr'u(- 
lure, inleuIional .slruclttrc, mid allenlio~al stale 
\[Grosz and Sidner 8@ An important idea in lhe thc-- 
ory is the mutual elf'ect between the linguistic exprcs 
immediale? /\ 
violates syntactic NZ 
constraints? /\ 
NZ Z 
Figure 1: Decision tree for l~.ule In. 
Table 1: The results of the algorithms of 
l{ules 1 and 12. 
Set Alg. ~~1~ lVlis21 
st m 40sA  a  V-FI 
Rla 126 \[ 22 l II l~O _yL I~ 
sions in utterances constituting the discourse and t, he 
discourse segment strllct+lre. Wh\[tt eol/cel'llS +IS hel'e is 
the. interrelationship between the forms of referring ex-. 
pressions and the discom:se segment structures. \]n NL 
generation systems, the semantic struetl|res oF llleS- 
sages to be produced are usually organized accord- 
ing to hierarchical inteutional structures; the.n, based 
on the structures, referring expressions are. decided 
\[llovy 90, l)Me 92\]. l/ence, in this subsection, we em- 
ploy the idea or (lisCOllrse structure to improw~ our 
algorithm lbr the generation of zero anaphors. 
In their study \[Li and Thompson 7!/\], IA and 
'rhompson propose that "the degree of preference for 
the occurrence of 1)rononfinal nnaphora in a clause in 
versely correspon(Is to the degrc'e of connection with 
the preceding clause." They listed the tollowing c(m- 
ditions of decreasing of commction: switching from 
background to lk)regrotmd information, or vice versa, 
between two clauses, the second clause headed by a,n 
adverhial expression and two clauses spokel~ by two 
dilD.rent participants. 
In gel,eral, a zero allapl|or |lsed to l'et'el" to SOil+(': 
enl,ity in the previous utterance might he i;xpccted to 
indicate the contimlation of ~ discourse segment, while 
a lion-zero nnnphor occurring in the same situation sig- 
nals n boundary of discourse segment. From the. gem~r-. 
ator's perspe.ctive, when the decision of the anaphoric 
form tbr a phrase referring to some entity in tile pro-. 
vious utterance is t,o be made., the factor of discourse 
segment boundary l|n|st be taken inl,o consideration. 
Therefore, based on this idea, we improve the previous 
ruh'.s for generation of zero anaphors, ll,ules land l a, 
to make the following rule. The (lecision h'ce for /IJlle 
2 is shown in Fig. 2. 
\]lade 2: If an el,tity, c, in the. current tli, l, er- 
aacc, u, was referred to in the immediately 
preceding utterance and does not violate 
733 
violates syntactic NZ constraints? /\ 
NZ beginning of 
D.S.? /\ 
NZ Z 
Figure 2: Decision tree for Rule 2. 
any syntactic constraints on zero anaphora, 
then if u is not the beginning of a discourse 
segment, then a zero anaphor is used for c; 
otherwise, a non-zero anaphor is used. 
To perform the experiments for the new rules, wc 
have to access the discourse segment structures of the 
testing data. Therefore, we annotated the boundaries 
between discourse segments in the testing data and 
the hierarchical discourse structures according to the 
discourse segment intentions. We farther carried out 
a test by comparing our annotations with other native 
speakers of Chinese. In the test, four native speakers 
of Chinese were asked to do the same tasks we have 
done for five articles selected from the testing data. 
Comparing with the speakers' results, on average 76% 
of the speakers' annotations coincide with ours. Ac- 
cording to the ahove comparison the annotations we 
made were reliable for the purpose of the experiment. 
We then performed the experiment by employing 
the algorithm of Rule 2. As shown in Table 2, for the. 
Set 1 data, 49 and 12 zero anaphors were over- aIM 
under-generated by the algorithm, respectively. For 
the other set of testing data, Rule 2 achieves an even 
better result. 
Table 2: The results of the algorithm of 
Rule 2. 
I set A!g' \[~~ 
$2 R2 
3.3 The effect of topic 
In this snbsection, we use the feature of topic in Chi- 
nese to further refine the i)revious rifles. The basic 
idea here is to investigate the positions of antecedent 
and anaphor in their respective utterances. In the fol- 
lowing, we divided the position of anapbors in their re- 
spective utterances into topic and non-topic. For each 
anaphor, its antecedent's position is one of the follow- 
ing categories: topic, direct object or the NP following 
a presentative verb and others. We thus classify the 
following types, A to F, of antecedent-anaphor pairs: 
the antecedents of Types A and C are ill topic position, 
of B and D are in direct object position or are the NP 
following a presentative verb, and of E and F are in 
other positions; the anaphors of Types A, B and E arc 
in topic position, and of C, D and F are in non-topic 
positions. 
Since in the new rule conditions on topic and non- 
topic will only be considered after the conditions in 
Rnle 2, in investigating the antecedcnt-anaphor pairs, 
we have to exclude the ones with either their anaphors 
violating syntactic constraints on zero anaphor or at 
the beginning of discourse segments. In other words, 
the new condition will be attached under the Z-node 
in the decision tree of Fig. 2. In the Set 1 test, lag 
data, there are 239 such pairs, among which anaphors 
of 49 pairs are zeroed by the algorithm of Rule 2 but 
appear in non-zero forms in the text. In other words, 
the 49 anaphors were over-generated by onr algoritbm, 
which in our terms belong to the false type; the other 
190 cases belong to the correct type. The number of 
each type of pairs for both correct and over-generated 
cases in the testing data are shown in Table 3. 
Table 3: Occurrence of antecedent-anaphor 
pairs in th0. data. 
1 
false. 15 14 6 ~ 9 \[ 4 \[ 49~ 
total 173 41 7 1 I ,~)~ 8 ~ 239_J 
l%r tbe Set, 1 data in Table 3, the over-generated 
cases of both Types A and FI, 15 and 14 out of 173 
and 41, respectively, are the minorities o\[' tbe respec- 
tive types, while on the contrary, {,tie number of over- 
generated c~es of Types C and g are greater tbat their 
counterparts. Thus, if we let anaphors of Types A and 
B be zero and Types C and E non-zero, then there 
will be 29 (15+14) over-generated zero anaphors and 5 
(l-t-4) under-generated ones for the Set 1 testing data. 
The numhers for Types D and F do not conclusively 
support eitber usir, g zero or non-zero in this case. In 
Chen's study \[Chen 87\], he fonnd a higher percentage 
of zero anaphors occurring in the topic position with 
their antecedent most frequently in the topic or object 
positions of the immediately previous utterance, which 
strongly supports the idea of letting anaphors of Types 
A arm B be zero and others non-zero. We choose to 
generate non-zero anaphora for Types D and F. ~¢W; 
thus obtain Rule 3 by adding the. affect of topic into 
Rule 2. 3'he decision tree for Rule 3 is shown in Fig. 
3. The results of using the new algorithm are sllown 
in Table 4. 
Rule 3: if an entity, e, in the current ut,- 
terance, u, was referred to in the immedi- 
ately preceding utterance, does not violate 
any syntactic constraints on zero anaphora, 
and u is not at tim beginning of a discom'se 
734 
violates syntactic NZ 
constraints? /\ 
NZ beginning of 
D.S.? /\ 
NZ salience? 
Z NZ 
I,'ignre 3: Decision tree for Rule 3. 
Table 4: The results of the algorithm of 
Rule 3. 
s2 1~3 jIA46± LL_2J 
segment, then if e is either a Tyl)e A or 
13 pair, then a zero anaphor is used \['or e; 
otherwise, a non-zero allaphor is ilS(!d. 
As a short summary, the numbers of anaphors in 
the Set 1 testing data satisfying the conditions of 1Lule 
3 are shown in Fig. 4, where. Z, N and 1' represent 
zero, pronominal and nominal anaphors, respectively. 
Indicated in the root node are the total number of 
all kinds of anaphors in the data. The corr'ect match 
is calculated by summing up the numbers of non: 
zero anaphors, pronouns and nominal anaphors, under 
non-zero leaf nodes and zero anaphors under zero leaf 
nodes. Non-zero anaphors under zero leaf nodes are 
the false matches. Conversely, zero anaphors under 
non-zero leaf nodes arc the missing matches. 
4 Future Work 
In this paper, we focus on distinguishing zero and non- 
zero anaphors. To have a full account for anaphors 
in Chinese, two tasks remain to be done. The first 
is to distinguish pronouns and nominal anaphors for 
tile non-zero cases, namely, to fm'ther add conditions 
under the non-zero nodes in the decision tree of Fig. 
3. 'l'he second tmsk is to develop an algorithm \['or the 
decision of an appropriate form for nominal amtphors 
\[Dale 92, Reiter and l)ale 92\]. Afterwards, a el, these 
NL generation system will be develol~eded to test; th(: 
performance of the algorithms. 
Z=202 
P=82 
N=367 
immediate? 
Z=195 
P=66 
N=170 
violates syntactic constraints? 
Z=195 
Z=0 N=86 
P=18 P=48 
N=84 ~nin~: 
NZ .. - 
Z-5 Z=190 
P=25 P=23 
N=60 N:26 
NZ 
Z=7 
P-16 
N=197 
NZ 
salience? 
z= 1B5 Z-5 P=14 
P=:9 
N=15 N=I1 
Z NZ 
I,'igure 4: A~,aphors in Set I testing data satisfying 
conditions of l{ule 3. 
5 Conclusion 
A study on the genera, tion of (Jhines(~ zero anaphors 
as opposed to the usual work from the conlprehension 
side is presented. By doing cxl:,erittmllts on a llUllh- 
bcr of descriptiw~ articles, we obtained a rule for the 
generation of zero anaphors, which illCOl'\[)orates the 
ideas of recency or occllrl'CllC(~, sylltact.ic COIlSl, railtI,s, 
discollrse segtlHqlt strllctllre Jlll(\[ saliel/ce of ob.iects iII 
discourse. In the text gener~ted by hand employing 
the algorithm of the above rule, assuming the same 
selllantic strt, ctln'e and discourse seg\[llellt strll(%iil'C as 
the real text, the use of zero anaphors is fairly close to 
those, occurring in the real text. In the stepwise mn- 
pirical study, the algorithms are improved through the 
test against real data, which in some sense provhles the 
assessment \[br the effectiveness of the rule. The result 
of the assessment thus encourages us to employ the 
rule as a p~trt of the referring expression colnponenL in 
the Chinese. NL generation system we a.re d(weloping. 
References 
\[Chao 68\] Chao, Y. R., A Gr'ammar of ,5'pokeT~ (ItS- 
nest, University of Califl)r,da Press, Ihwkeley, 
CA, 1968. 
\[Chert 87\] Chert, I)., "llanyu lingxin huizhi de huayu 
fenxi (A discourse approach to zero amq)hora ill 
(3finese)" (in ('.hi,rose), Zhongg,o Y.w~ a ((:h.i- 
neme kinguislics), 1)1 ). a(~',l-:178, I:)ST. 
735 
\[Dale 92\] Dale, R., Generating Referring Expressions: 
Constructing Descriptions in a Domain of Ob- 
jects and Processes, The MIT Press, Cambridge, 
Massachusetts, 1992. 
\[Grosz and Sidner 86\] Grosz, B. J. and Sidner, C. L., 
"Attention, intentions, and the structure of dis- 
course," Computational Linguistics, 12(3), pp. 
175-204, 1986. 
\[Itovy 90\] How, E., "Approaches to the planning of 
coherent text," in Natural Language in Artifi- 
cial Intelligence and Computational Linguistics, 
Paris, C. L., Swartout, W. R., and Mann, W. C., 
(eds.), 1990. 
\[Li and Thompson 79\] Li, C. N. and Thompson, S. 
A., "Third-person pronouns and zero-anaphora 
in Chinese Discourse," in Givon, T. (ed.), Syn- 
tax and Semantics: Discourse and Syntax, Vol. 
12, pp. 311-335, Academic Press, 1979. 
ILl and Thompson 81\] Li, C. N. and Thompson, S. 
A., Mandarin Chinese: a Functional Reference 
Grammar, University of California Press, Berke- 
ley, CA, 1981. 
\[Liu 84\] Liu, Y. C., Zhuowen de Fang Fa (Approaches 
to Composition), Xuesheng Chubanshe, Taipei, 
Taiwan, 1984. 
\[Reiter and Dale 92\] Reiter, E. and Dale, R.., "A fast 
algorithm for the generation of referring expres- 
sions," COLING-92, pp. 232-238. 
\[Sidner 83\] Sidner, C. L., "Focusing in the compre- 
hension of definite anaphora," in Brady, M. and 
Berwick, R. C. (eds.),Computational Models of 
Discourse, pp. 267-330, MIT Press, 1983, 
736 
