An Evaluation of Anaphor Generation in Chinese 
Ching-Long Yeh 
Dept. of Computer Science and Engineering 
Tatung Institute of Technology 
40 Chungshan North Road, Section 3 
Taipei 104 
Taiwan 
chingyeh~cse, tilt. edu. tw 
Chris Mellish 
Dept. of Artificial Intelligence 
University of Edinburgh 
80 South Bridge 
Edinburgh EH1 1HN 
Scotland 
chrism~aisb, ed. ac. uk 
Abstract 
In this paper, we present an eval- 
uation of anaphors generated by a 
Chinese natural language generation 
system. In the evaluation work, the 
anaphors in five test texts generated 
by three test systems employing gen- 
eration rules with different complex- 
ities ~vere compared with the ones in 
the same texts created by twelve nat- 
ive speakers of Chinese. We took the 
average number of anaphors matching 
between the machine and human texts 
as a measure of the quality of anaphors 
generated by the test systems. The 
results suggest that the one we have 
chosen and which has the most com- 
plex rule is better than the other two. 
There axe, however, real difficulties in 
establishing the significance of the res- 
ults because of the degree of disagree- 
ment among the native speakers. 
1 Introduction 
We have established several rules for the gener- 
ation of anaphors in Chinese, including rules to 
make the decision between zero, pronominal and 
nominal anaphors. Zero anaphors axe omissions 
of noun phrases in surface sentences, pronom- 
inal anaphors, ta (s/he, it) are like s/he and it 
in English, and nominal anaphors are like def- 
inite NPs in English (Che87). These types of 
anaphors are exemplified in (1) by ¢i, 1 tai (he) 
1We use a ¢~ to denote a zero anaphor, where the 
superscApt a is the index of the referent. 
and na ren j (that person), respectively. 
(1)a. Zhangsan i jinghuang de wang wai pao, 
Zhangsan frightened NOM towards outside run 
Zhangsan was frightened and ran outside. 
b. ¢i zhuangdao yige ren j, 
(he) bump-to a person 
(He) bumped into a person. 
c. t a i kanqing lena ren J de lian, 
he see-clear ASPECT that person GEN face 
He saw clearly that person's face. 
d. ¢i renchu na ren j shi shui. 
(he) recognise that person is who 
(He) recognised who that person is. 
In addition, we have established a rule for the 
choice of a description if a nominal anaphor is 
decided upon which, for instance, would choose 
between fang zuozi (square table), and simply 
zuozi (table) if a nominal form is decided upon 
to refer to a square table. These rules were im- 
plemented in our Chinese natural language gen- 
eration system and a number of texts for de- 
scribing entities in a national park were gener- 
ated (Yeh95). As shown in our previous stud- 
ies (YM94; YM95; Yeh95), these rules were 
obtained from empirical studies. The exper- 
imental results show that the anaphors gen- 
erated by using these rules largely match the 
ones in the test texts we used , assuming the 
same semantic structures and contextual in- 
formation. This shows the performance of the 
rules. However, in this previous work the same 
data served as both training and test data. Fur- 
thermore, the assumed contextual information, 
for example, discourse structures, may be dif- 
ficult to implement in a real system. Thus, 
111 
the performance of a real anaphor generation 
algorithm based on the previous rules may be 
different to the .experimental results. In this 
paper, we attempt a post-evaluation by asking 
some native speakers of Chinese to judge the 
result of the anaphors generated by our system. 
2 Previous Work and Our Approach 
Though the field of natural language genera- 
tion has progressed towards composing complex 
texts, the evaluation of natural language gener- 
ation systems has remained at the discussion 
stage (May90; MM91). Two broad methods 
have been identified for evaluating natural lan- 
guage generation systems: glass box and black 
box evaluation (MM91). The glass box method 
is concerned with examining the internal work- 
ing of individual components in a system, while 
the latter looks at the behaviour of the input 
and output to the generation systems. The dif- 
ficulty of the glass box method is the lack of 
a clear division between components in gener- 
ation systems. Even if the black box method 
is adopted, however, it is difficult to determine 
what is the appropriate input for generation and 
to be objective in evaluating the output text. 
In this paper, we aim to investigate the qual- 
ity of anaphors generated by the referring ex- 
pression component in our Chinese natural lan- 
guage generation system. The referring expres- 
sion component lies between the text planner 
and the linguistic realisation component in the 
system, as shown in Fig. 1. On accepting an 
input goal from the user, the system invokes 
the text planner which uses the operators in the 
plan library to build up a plan which is a hier- 
archical discourse structure to satisfy the input 
goal. After the text planning is finished, the 
decision of anaphoric forms and descriptions is 
then made by traversing the plan tree. As is dis- 
cussed in (YM95; Yeh95), the algorithm of the 
referring expression component first determines 
an appropriate form for an anaphor to be gen- 
erated. 
Suppose that the referring expression com- 
ponents we wish to compare all adopt the above 
basic algorithm. Then the essential character- 
Input goal 
Text planner 
Referring expression component 
Linguistic realisation 
Surface sentences 
Figure 1: Referring expression component in 
the Chinese natural language system. 
istic to distinguish them from each other be- 
comes the rules used in the components and how 
these rules are implemented. If all of these re- 
ferring expression components are embedded in 
the same Chinese natural language generation 
system, as in Fig. 1, for example, then, given 
an input to the system, anaphors in the result- 
ing texts can be characterised by the rules used 
in the referring expression component and their 
implementation. 
By adopting this approach, we need not worry 
about the problems of either of the evaluation 
methods stated above, except the objective 
evaluation of output text. Since there is no ma- 
chine that can read the generated texts and give 
an impartial judgement about them, we rely on 
the opinions of human readers who are native 
speakers of Chinese to investigate the quality of 
the generated anaphors. This is an easier task 
than assessing the quality of whole texts. To 
compensate for possible bias among the indi- 
vidual readers, we sent the output texts to a 
group of readers for viewing and took the aver- 
age of their outcomes as the measurement. 
In brief, each object system in our evaluation 
work is thought of as having the same individual 
components, including control and knowledge 
bases (which are discussed in full in (Yeh95) but 
112 
immediate// X~ng 
• violating syntactic ?onstr~ts? N 
Y7 • satisfying 
anF  a, sogmcnt 
Y7 ~o YV ~o 
P N . satisfying ^ sa~eqce? 
an'Tcy~iferi°n~e j xX~xo 
yes ~o / \ Z satisfying 9 
P N ani~acy ~ferion. 
Y7 
P N 
Figure 2: A Chinese anaphor generation rule. 
cannot be presented here for reasons of space), 
except that the anaphor generation rules used 
in the referring expression components are dif- 
ferent to each other. In the existing literature, 
we cannot find other work on the generation 
of Chinese referring expressions (or indeed on 
the full evaluation of anaphor generation for any 
other language), which means that we have no 
real working systems to compare with. In prac- 
tice, we employ our Chinese natural language 
generation system described in (Yeh95) as the 
backbone of the evaluation work because it is 
easy for us to control and maintain. What we 
have to do for each generation system is simply 
to insert the corresponding generation rule. 
3 Systems to Compare and the Test 
Task 
Having described the framework of the evalu- 
ation, in this section, we give details about the 
object systems to be compared in the evaluation 
work and the tasks to be performed in the eval- 
uation work. 
3.1 Systems to compare 
The anaphor generation rule we obtained in 
our previous studies (YM94; YM95; Yeh95) is 
shown in Fig. 2, where the internal nodes repres- 
ent constraints and the terminal nodes are the 
decisions of using a zero (Z), pronominal (P), 
or nominal (N) form. The locality constraint 
checks whether the anaphor in question occurs 
either in the immediately previous utterance or 
at a long distance. The second constraint de- 
termines whether an anaphor occurs in a posi- 
tion violating syntactic constraints on zero ana- 
phors. We adopted the concept of discourse seg- 
ment structure in (GS86) to build up the con- 
straint at segment beginning. It checks whether 
an anaphor is at the beginning of a discourse 
segment. The salience constraint says that both 
the positions of an anaphor and its antecedent 
are the topics of their respective utterances. 
The animacy constraint checks whether the ana- 
phor in question is animate. Then the following 
rule is used if a nominal form is decided on. 
If a nominal anaphor, n, is at the be- 
ginning of a "sentence" 2, or is the 
first mention of the referent in a "sen- 
tence," then a full description is pre- 
ferred; otherwise, if n is within a "sen- 
tence" or has been mentioned previ- 
ously in the same "sentence" without 
distracting elements, then a reduced 
description is preferred; otherwise a 
full description is preferred. 
The constraints in the anaphor generation 
rule were established by consulting relevant lin- 
guistic studies (YM94; YM95; Yeh95). Con- 
sequently, subsets of constraints in the above 
rule can be thought of as possible rules, if 
not complete, for the generation of anaphors in 
Chinese. As described previously, the systems 
to compare in this evaluation work are assumed 
to share the same individual components, ex- 
cept the anaphor generation rules. In this pa- 
per, we equipped each system with such a pos- 
sible anaphor generation rule. 
We chose three rules, termed TR1 TR2 and 
TR3, with different complexities among the pos- 
sible candidates as the targets of the test 3. The 
2A "sentence" is in general a meaning-complete unit 
(Liu84). A sentential mark is used to indicate the full 
stop of a "sentence"; a comma within a "sentence" in- 
dicates a temporary stop. 
aThe use of these rules enables us to investigate the 
effectiveness of individual constraints. 
113 
lt~a~y'? 
immedi~ng 
vmlating syntactic N 
• ons nts? j E 
satisl'ying an~n~ britcrmn? Z 
'7 3' 
P N Till 
immediay "k~ng 
vmlating syntactic N 
'(InsI n~? 
P N . salislying . ~ Z anlma~y ¢fl~ntln'. 
P N 
TR2 
immcdia~/ ~ng 
violating syntactic N 
"ons "nts? 
'TX' ' YX' 
P N . satisfying . sal~coxcc? 
<% 
P N 
Figure 3: Rules used in the comparison systems. 
rules are shown in Fig. 3. The first one uses loc- 
ality, syntactic constraints and animacy. The 
second and the third rules have one additional 
constraint, namely, discourse segment bound- 
aries and salience, respectively, added to their 
predecessors. In the following, we use the above 
rule names to represent the systems. 
3.2 The test task 
The task can be divided into an annotation 
and a comparison stage. Each of twelve native 
speakers of Chinese was given a number of test 
sheets to finish. On each sheet is a text gener- 
ated by our generation system. Each anaphor 
position in a generated text was left empty and 
all candidate forms of the anaphor, including 
zero, pronominal, and full, or reduced descrip- 
tions were put under the empty space. The task 
for a speaker to perform was to annotate which 
form he or she preferred for each anaphor posi- 
tion on the sheets. 
We selected five texts generated by our sys- 
tem for the test. The numbers of clauses in the 
texts are 5, 12, 12, 21 and 34; the numbers of 
anaphors in the texts are 4, 11, 11, 20 and 34. 
See the Appendix for the first three test texts. 
For convenience, we summarise the occurrence 
of anaphors in the test texts in a graphical form 
in Fig. 4. In the figure, each box represents a 
clause and at the right end is the accompany- 
ing punctuation mark. Each box is divided into 
three parts which represent the topic, the sub- 
ject and the direct object positions of the clause. 
The numbers in a box, except for the first occur- 
rences in the text, are the indices of anaphors 
in the corresponding clauses. Initial references 
are indicated by bold italics. For example, in 
Text 2, the numbers 1, 2, 3 and 4 occurring in 
the first, 5th, 8th and lOth clauses, respectively, 
are initial references; others are anaphors. 
After the annotations were collected, we car- 
ried out comparisons between the speakers' res- 
ults and the generated texts to investigate the 
performance of the test rules. In each compar- 
ison, we noted down the number of matches 
between the computer generated text and the 
human result. In the following, we use Cij to 
denote the text indexed j generated by the sys- 
tem equipped with Rule TRi, where i is 1 to 
3 and j is 1 to 5; and Hkt to denote the res- 
ulting text indexed l of speaker k, where k is 1 
to 12 and l is 1 to 5. The comparison work is 
summarised procedurally as below. 
for each rule TRi 
for each speaker j 
for each text k 
compare Cik with Hjk and 
note down the number of matches 
of anaphors between them 
4 Results 
In this section, we investigate the result of the 
comparisons made in the last section. The com- 
parison result is shown in Table 1. The average 
matching rates for all test texts are 72, 74 and 
76%. 
This average matching rate, however, is lower 
than the matching rates, about 92%, we ob- 
tained in the empirical studies described pre- 
114 
Text l 
1 I tl I I, 
2111 I I, 
3111 I l, 
4tlt \[ l, 
51 II I Io 
Text 2 Text 3 
1 t tl I I, 1 II I I, 
2 I II I I, 2 ~l I I, 
3 I tl I I, 3 II t I, 
41 II I l, 4 II l l, 
5 I I I I 2 l, 5 I I \[ l ° 
6 I?1 I \], 6 II i i, 
7 I ?1 I J+ 7 11 I I o 
8 I 9 I I ~ I,,, $ 11 I ? l, 
9 I 41 I I, 9 ?1 I I, 
101 ~1 I ,~ I, 10 '~-F'-'r+~-I, 
11 I ,11 I I, 11 I ~ I I ~1~ 
12 I ,1 I I Io 12 I ~ I l Io 
Text 4 
11 /I I I, 
21 II t t ° 
3 I wt I ?1, 
4 t ~t I t, 
5171 I Io 
6 I II I ?1+ 
71 ~1 I I, 
8 I al 141+ 
9141 I I, 
10 \[ ~ I I I+ 
11 I 3 i I I, 
12 I ~l I I I+ 
13 I 4 I I Io 
14 I I I I ¢ I, 
15 I ~ I I I, 
161 ~1 I Io 
171 II I +1, 
181F,I I I, 
191 ~1 I I, 
201 +I l l, 
21 I ¢, I I 1o 
Text 5 
11 tl t l, 
21 II l I ° 
3 I iI I ~,1, 
4 E 91 I I, 
5i71 I I o 
6 I ti I ~i, 
7141 I I, 
8141 I I, 
9 I 4\] t I, 
10 I ~1 I I, 
11 I 3 I I I, 
12 i 3 I I I, 
13 I ~l I I I ° 
14 I t I I ¢ I~ 
15 I ~ t I I, 
16 I ~ I f I, 
17 I 5 I t6.7.~, 
181 ~,1 I o1~ 
19t 91 I I, 
201 ql I I, 
211 q,t I i, 
221 ql t l, 
23 I 7 I I 101, 
24 I I01 I I~ 
25 \] \]01 I l, 
26 I toI I I, 
27 I ~ I I Io 
28 I I I I HI, 
29 \[ Ill t l, 
30 I~tl I I, 
31 IIII I 12}~ 
32 t121 \] J, 
33 1111 1131, 
34 1131 I \]21o 
Figure 4: Occurrence of anaphors in the test texts. 
115 
Table 1: Match between the results of test systems and native speakers. 
System 
TR1 
TR2 
Speaker 
10 
11 
12 
Average 
Total anaphors 
Matching rate 
7 
10 
11 
12 
Average 
Total anaphors 
Text 1 Text 2 
4 10 
4 8 
4 7 
4 8 
3 7 
4 8 
4 7 
4 10 
2 6 
4 8 
2 5 
4 9 
7.8 
4 11 
70% 
4 10 
4 8 
4 7 
4 8 
3 7 
4 8 
4 7 
4 10 
2 6 
4 8 
2 5 
4 9 
7.8 
4 11 
3.6 
90% 
3.6 
Text 3 Text 4 
9 16 
6 16 
5 15 
5 13 
8 14 
7 15 
6 16 
9 17 
7 9 
9 14 
6 10 
5 13 
14 
21 
70% 
8 17 
6 17 
5 16 
5 14 
9 15 
7 16 
7 17 
9 18 
8 9 
15 
7 11 
6 14 
14.9 
21 
6.8 
11 
62% 
10 
7.3 
11 
Text 5 
27 
24 
24 
23 
23 
28 
25 
32 
14 
23 
20 
23 
23.8 
34 
70% 
26 
25 
23 
24 
24 
29 
26 
33 
14 
24 
19 
24 
24.3 
34 
Matching rate 90% 70% 66% 75% 71% 
TR3 1 4 7 5 13 18 
2 4 11 9 16 25 
3 4 10 8 19 29 
4 4 9 4 14 24 
5 3 9 9 16 26 
11 6 14 24 
8 10 13 23 
14 25 
20 7 12 
10 4 9 7 16 23 
11 2 6 6 12 21 
12 4 10 9 16 30 
Average 3.6 8.7 7.1 14.6 24 
Total anaphors 4 11 11 21 34 
Matching rate 90% 79% 65% 77% 71% 
i16 
Speaker Text 1 I Text2 Text Text 5 3 Text 4 
Table 2: Agreement of annotations among speakers. 
1 3.9 8 7.5 14.3 24 
2 3.9 9.5 7.8 16.1 26.5 
3 3.9 9.1 7.8 15.8 26.3 
4 3.9 8.9 6.6 15.4 23.9 
5 3.3 8.5 8.3 15.2 25.4 
6 3.9 9.5 8.3 14.1 26.5 
7 3.9 8.3 7.1 15 26.2 
8 3.9 8.1 7.9 15.8 26.4 
9 2.4 6.8 7 12.1 20.5 
10 3.9 8.6 8.1 14.5 25 
11 2.3 5.7 7 12.7 21.2 
12 3.9 9.4 7.8 15.3 26.3 
Average 3.6 8.4 7.6 14.7 24.9 
Total anaphors 4 11 11 21 34 
90% 76% 69% 73% 73% 
viously (YM94; Yeh95). The problem is partly 
because the test texts used in the former com- 
parison are human-created, while the test texts 
used here are machine-generated. The gram- 
matical structures of the machine-created texts 
are simplified; they are not as sophisticated as 
human texts. In the evaluation work, when 
the speakers were asked to decide their pref- 
erences for anaphors in the machine-generated 
texts, they may find less complete information 
shown in the test texts than what they are 
used to in creating their own texts and hence 
it may be difficult for them to make their own 
decisions. In the empirical study, the human- 
created texts perhaps provided more sufficient 
information for the hypothetical machine to de- 
cide on an appropriate anaphoric form. 
A more important reason why the matching 
rates are lower than before could be that in 
some circumstances there may be more than 
one acceptable solution and the speakers may 
not always choose the same one as the machine. 
This hypothesis can be investigated by look- 
ing at the extent to which the speakers agree 
among themselves. To see how the speakers 
agree among themselves, we further made a 
comparison between the speakers' annotations, 
which is summarised as below. 
for each speaker i 
for each text j 
compare i's with the rest of speakers' 
annotations and note down 
the average number of the matches 
The comparison result is shown in Table 2. For 
each speaker, the number for each test text is 
the average number of matches with the other 
eleven speakers. For example, Speaker 1 re- 
ceives, on average, 8 matches for Text 2. At the 
end of the table are the average numbers for the 
speakers' agreement among themselves. The 
figures in the table show that the speakers do 
not achieve an agreement among themselves for 
the use of anaphors in this test. These figures 
are further supported by the kappa statistic, a 
standard measure of agreement between a set of 
judges (SC88). The overall kappa value for all 
speakers is about 0.41, whereas a value of 0.8 or 
over would normally be required for good evid- 
ence of agreement. The measure of agreement 
gets worse if only the zero/ pronoun/ nominal 
distinction is considered or if zero and non-zero 
pronouns are lumped together. Only two speak- 
ers agree with one another with a kappa value 
of more than 0.7 (none with a value of greater 
117 
than 0.8). The speakers as a whole agreed with 
kappa greater than 0.7 on 30 out of the 80 ana- 
phors, with complete agreement only 14 times. 
To get an overall agreement of greater than 0.8 
would require reducing the set of speakers from 
12 to a carefully selected 3. 
As shown in Fig. 4, the anaphors in Text 1 
form a "topic chain" 4 within a single "sen- 
tence". These anaphors are all zeroed according 
to the conditions of locality and syntactic con- 
straints in the three test rules. All three systems 
produce the same result for Text 1 and, hence, 
unsurprisingly all three systems have the same 
matching rate, 90%, as shown in Table 1. 
Text 2 similarly contains a single "sentence" 
but has three topic shifts in addition to "topic 
chains" within the "sentence" as shown in 
Fig. 4. Since no discourse segment boundaries 
occur within the "sentence", the discourse seg- 
ment boundary constraint in TR2 has no effect 
on this test text, which means that both TR1 
and TR2 produce the same output. However, 
there are three topic shifts within the "sen- 
tence", namely, clauses 5 and 6, 8 and 9, and 
10 and 11, as shown in Fig. 4. The shifts 
would make the rule containing the salience 
constraint, TR3, obtain different output from 
those without this constraint, TRi and TR2 
obtain the same matching rate, 70%. TR3 ob- 
tains higher matching rates than the other two, 
79%, which shows the effectiveness of the sali- 
ence constraint in it. 
We then examine another middle-sized test 
text, Text 3, which is broken into three "sen- 
tences," as shown in Fig. 4. The beginning of a 
"sentence" is the beginning of a discourse seg- 
ment in our implementation (Yeh95). Further- 
more, there are three topic shifts occurring in 
Text 3, i.e., clauses 8 and 9, 10 and 11, and 
11 and 12. The constraint of discourse segment 
beginnings in TRP and TR3 and the salience 
constraint in TR3 would therefore have some ef- 
fects on the output texts. The matching rates, 
as shown in Table 1, increase from 62 to 66% 
for TR2, which shows that the constraint on 
4A "topic chain" is a situation where a referent is re- 
ferred to in the first clause, and then several more clauses 
follow talking about the same referent, namely, the toi'~c. 
discourse segment beginnings in TRP is effect- 
ive. TR3 obtains 65% matching rate, on av- 
erage, which is 1% lower than its predecessor 
TR2. However, this decrease of average match- 
ing rate does not deny the effectiveness of the 
salience constraint in TR3. TR2's text differs 
from TR3's in the three topic shifts: TR2 gen- 
erates zero anaphors for these shifts, while TR3 
generates full descriptions. The speakers varied 
greatly in choosing anaphoric forms for these 
topic shifts: among twelve speakers, four chose 
all full descriptions, three used all zero ana- 
phors, and the other five chose zero, pronom- 
inal and nominal anaphors. Thus, four among 
the twelve speakers completely agree with TR3, 
while three agree with TR2. This shows that 
the salience constraint in TR3 is still effective. 
Then we examine the more complicated texts, 
Texts 4 and 5. As shown in Table 1, the in- 
creases of matching rates show the effectiveness 
of the constraint of discourse segment begin- 
ning in TR2. Again, the average matching rates 
of TR3 are sightly lower than TR2 for these 
two texts. However, similar to the situation in 
Text 3, the speakers have varied agreement on 
the choice of anaphors for the topic shiftings 
in these two texts. For Text 4, three and one 
speaker completely agree with TRP and TR3, 
respectively. As for Text 5, two speakers com- 
pletely agree with TR2, while the others partly 
agree with TR2 and TR3. 
The above discussions show that the salience 
constraint in TR3 is sometimes effective in get- 
ting small improvements in the output texts. 
This shows the difference of concepts of sali- 
ence used between the speakers and TR3. In 
brief, the more sophisticated constraints a rule 
contains, the better it performs. Both TR2 and 
TR3 perform better than TRi. TR3 performs 
better than TR2 for texts with simple discourse 
segment structure. For the texts having com- 
plicated discourse segment structures, TR2 is 
slightly better than TR3 on average matching 
rates. Adding the results of the rules to those of 
the speakers leads to a slight decrease in kappa 
for TR1 but progressively better (though only 
from 0.41 to 0.43) values for kappa for TR2 
and TR3. This indicates that the better rules 
118 
seem to disagree with the speakers no more than 
the speakers disagree among themselves. There 
art 9 anaphors where the kappa score includ- 
ing TR3 is less than that for the speakers alone 
(in many other cases, the results being better). 
These seem to involve places where the speakers 
were more willing to use a zero pronoun (where 
the system used a reduced nominal anaphor) 
and where the speakers reduced nominal ana- 
phors less than the system did. 
5 Conclusion 
In this paper, we evaluated the quality of ana- 
phors in the texts generated by using various 
rules. As shown in the results of comparis- 
ons between the anaphors created by computers 
and native speakers of Chinese, the individual 
constraints we collected in our previous stud- 
ies (YM94; YM95; Yeh95), seem to be effect- 
ive to a large extent in the generation of ana- 
phors in Chinese. Also they can be implemen- 
ted successfully. The comparison results suggest 
that a Chinese natural language generation sys- 
tem employing the combination of these con- 
straints might produce more effective anaphors 
than one using individual constraints. Although 
the average matching rates between the differ- 
ent rules and the speakers are lower than those 
from our previous experiments based on human- 
generated texts, this at least in part reflects 
considerable disagreement among the speakers 
themselves. 

References 
P. Chen. Hanyu lingxin huizhi de huayu fenxi (a dis- 
course approach to zero anaphora in chinese) (in 
chinese). Zhongguo Yuwen (Chinese Linguistics), 
pages 363-378, 1987. 
B. J. Grosz and C. L. Sidner. Attention, intentions, 
and the structure of discourse. Computational 
Linguistics, 12(3):175-204, 1986. 
Y. C. Liu. Zuowen de fangfa (Approaches to Com- 
position) (in Chinese). Xuesheng Chubanshe, 
Taipei, Taiwan, _1984. 
M. T Maybury. Planning Multisentential English 
Text Using Communicative Acts. PhD thesis, 
Cambridge University, 1990. 
M. Meteer and D. McDonald. Evaluation for gen- 
eration. In J. G Neal and S. M. Wlater, edit- 
ors, Natural Language Processing Systems Evalu- 
ation Workshop, pages 127-131, NY, 1991. Rome 
Laboratory. 
S. Siegel and N. J. Jr. Castellan. Nonparametric 
Statistics for the Behavioral Sciences. McGraw- 
Hill, 1988. 
C. L. Yeh. Generation of Anaphors in Chinese. PhD 
thesis, University of Edinburgh, Edinburgh, Scot- 
land, 1995. 
C. L. Yeh and C. Mellish. An empirical study on the 
generation of zero anaphors in Chinese. In Proc. 
of the 15th International Conference on Computa- 
tional Linguistics, pages 732-736, Kyoto, Japan, 
1994. 
C. L. Yeh and C. Mellish. An empirical study on 
the generation of descriptions for nominal ana- 
phors in Chinese. In Prod. of Recent Advances in 
Natural Language Processing, Tzigov Chark, Bul- 
garia, 1995. 
