A DISCOURSE GRAMMATICO-~TATISTICAL APPROACH TO 
PARTITIONING 
Tadashi Nomoto and Yoshihiko Nitta 
Adwmced Research Laboratory, IIitachi Ltd.* 
cmMl:{nomoto, nitta}¢harl, hitachi, co. j p 
Abstract 
The paper presents a new approach to text segmen- 
tation --. which concerns dividing a text into coher- 
ent discourse units. The approach builds on tile ttle- 
ory of discourse segment (Nomoto and Nitta, 1993), 
incorporating ideas from the research on information 
retrieval (Salton, 1988). A discourse segment has to 
do with a structure of Japanese discourse; it could be 
thought of as a linguistic unit delnarcated by wa, a 
Japanese topic particle, which may extend over sev- 
eral sentences. The segmentation works with discourse 
segments and makes use of coherence measure ba~scd 
on tfidf, a standard information retrieval measurement 
(Salton, 1988; IIearst, 1993). Experi,nents have been 
done with a Japanese newspaper corpus. It has been 
found that the present approach is quite sucecssfld in 
recovering articles fronl tile unstructured corpus. 
Introduction 
In this paper, we describe a method for discovering 
coherent texts from the unstructured corpus. The 
method is both linguistically and statistically moti- 
vated. It derives a linguistic motivation front the view 
that discourse consists of what we call discourse seg- 
ments, minimal coherent mtits of discourse (Nomoto 
and Nitta, 1993), while statistically it is guided by ideas 
from intbrmatiou retrieval (Salton, 1988). Previous 
quantitative approaches to text segmentation (Hearst, 
1993; Kozima, 1993; Youmans, 1991) have paid little 
attention to a statistically important structure that a 
discourse might have and detined it away ~Ls a lump of 
words or sentences. 1'art of our concern here is with 
explicating possible effects of a discourse segment on 
the quantitative structuring of discourse. 
In what follows, we will describe some important 
features about discourse segment and see how it can 
be incorporated into a statistical analysis of discourse. 
Also some comparison is made with other apl)roaches 
such as (Youmans, 1991), followed by discussion on the 
results of the present method. 
Theory of Discourse Segment 
The theory of discourse seF, ment (Nomoto a.nd Nitta, 
*2520 |tatoyama Saitama 350-03 Japan 
tel. +81-492-9C~6111 fax. +81-492-9(~-6006 
1993) carries with it a set of empirical hypotheses about 
structure of Japanese discourse. Among them is the 
claim that Japanese discourse is constructed from a se- 
ries of linguistics units called discourse segment. The 
discourse segment is thought of az a topic-comment 
structure, where a topic corresponds to the subject 
matter and a comment a discussion about it. In partic- 
ular, Japanese haz a special way of marking the topic: 
by sutllxing it with a postpositionM particle wa. Thus 
in Japanese, a topic-comment structure takes the form: 
topic comment 
where "*" represents a word. The comment part could 
become quite long, extending over quite a few sentences 
(Mikami, 1960). Now Japanese provides for a variety 
of ways to mark off a topic-comment structure; the wa- 
marking is one such and a typographical device such 
as a line- or a page-break is another. For the present 
discussiou, we take a discourse segment to bca block of 
sentences bounded by a text break and/or a wa-marked 
element. 
I T1 ~'1 5'2 "'' Sn \[ ~t~2 Sn-F1 Sn+2 ' "" Sn-Frn \], 
Discourse Segment Discourse Segment 
where "T" denotes a boundary marker, "S" a sentence, 
and "\[" a segment gap. For the semantics of a discourse 
segment, Nomoto and Nitta (1993) observes an inter- 
esting tendency that zero (elliptical) anapl,ora occur- 
ring within the segment do not refer across the segment 
boundary; that is, their references tend to be resolved 
internally 1 . 
Now we take a very simple view about tile gk)bal 
structure of discourse: syntactically, discourse is just 
lllerc and throughout we use 01 for a subject(NOMinative) 
zero; 02 for a object(ACCusative) zero; TOP for a topic case; 
I)AT fox" a dative(indirect object) c~me; PASS for a passive mor- 
\].)heIIle. 
I 'l'aro<,> -wa 01<i> rojin<i > -hi seki 
TOP old man DAT seat 
-wo yuzutte -ageta node, 01<i> 02<j> 
ACC give help because 
orei -wo iwar et,~. \[ 
thank say PASS 
"l\]ecause ~\]hro gave the old man a favor of 9iv- 
im.l a seat, he thanked Taro." 
Note that all the instances of 01 aa~d 02 have internal antecedents: 
Taro and rojin. 
114.5 
a chronological juxtaposition of contiguous, disjoint 
blocks of sentences, each of which corresponds to a 
discourse segment; semantically, discourse is a set of 
anaphoric islauds set side by side. Thus a discourse 
should look like Figure 1, where G denotes a discourse 
segment. Fnrthermore, we do not intend the dis- 
D 
Figure 1: Discourse Structure 
course structure to be anything close to the ones that 
rhetorical theories of discourse (tIovy, 1990; Mann and 
Tholnpson, 1987; Itobbs, 1979) claim it to be, or inteu- 
lional structure (Grosz and Sidner, 1986) ; indeed we 
do not assmne any functional relation, i.e. causation, 
elaboration, extension, etc., among the segments that 
constitute a discourse structure. The present theory 
is not so much about the rhetoric or the function of 
discourse as about the way anaphora are interpreted. 
It is quite possible that a set of discourse segments 
are not aggregated into a single discourse but may have 
diverse discourse groupings (Nomoto and Nitta, 1993). 
This happens when discourses are interleaved or em- 
bedded into some other. An interleaving or an embed- 
ding of discourse is often invoked by changes in narra- 
tive mode such as direct/indirect speech, quoting, or 
interruption; which will cue the reader/hearer to sus- 
pend a current discourse flow, start another, or resume 
the interrupted discourse. 
A Quantitative Structuring of 
Discourse 
Vector Space Model 
Formally, a discourse segment is represented as a lerm 
vector of the form: 
Gi = (gil, gi2, gi3 .... ,git) 
where a..qi represents a nominal occurrence in Gi. Ill 
the information retrieval terms, what happens here is 
that a discourse segment Gi is indexed with a set of 
terms gll through tit; namely, we characterize Gi with 
a set of indices gil,... ,tit. A term vector can either 
be binary, where each term in the w~'ctor takes 0 or 1, 
i.e., absence or presence, or weighted, where a term is 
assigned to a certain importance value. In general, the 
weighted indexing is preferable to the biuary indexing, 
as the latter policy is known to have problems with 
precision (Salton, 1988) 2. The weighting policy that 
we will adopt is known as If. idf It is an indicator of 
term importance Wij defined by: 
N wij .= t fij * log ~j 
2 Precision measures the proportioi1 of correct items retrieved 
against the total number of retrieved items. 
where tf (term frequency) is the number of occurrences 
of a term Tj in a document Di; df (document fre- 
quency) is tile number of documents in a collection of 
N documents in which 7) occurs; and the importance, 
wljis given ~s the product of if and the inverse dffactor 
,or idf, log N/dfi. With the tf idfpolicy, high-frequency 
terms that are scattered evenly over the entire docu- 
ments collection are considered to be less important 
than those that are frequent but whose occurrences 
are concentrated in particnlar documents 3. Thus the 
tf.idfindexing favors rare words, which distinguish the 
documents more effectively than common words. 
With the indexing method in place, it is now pos- 
sible to define the cohercnce between two term vec- 
tors. For term vectors X = (xl,x~,...,x~) and Y = 
(Yt, Y2,..., Yt), let the coherence be defined by: 
t 
/=1 C(X,Y) = t t 
i-.~1 4=1 
where w(xi) represents a tfidf weight assigned to the 
term xi. The measure is known as Dice coefficienl 4. 
Experiments 
Earlier quantitative approaches to text partitioning 
(Youmans, 1991; Kozima, 1993; Ilearst, 1993) work 
with an arbitrary block of words or sentences to de- 
termine a strneture of discourse. In contrast, we work 
with a block of discourse segments. It is straightfor- 
ward to apply the tfidf to the analysis of discourse; 
one may just treat a block of discourse segments as a 
document unit by itself and then define the term fre- 
quency (t J), the document fi'equency (dJ), and the size 
of docmnents collcction (N), accordingly. Coherence 
would then be determined by the number of terms seg- 
ment blocks share and tf.idf weights the terms carry. 
Thus one pair of blocks will become more cohesive than 
auother if the pair share more of the terms that arc lo- 
cally frequent. 
The partitioning proceeds in two steps. We start 
with the following: 
1. Collect all the nominal occurrences found in a 
corpus 5 
2. Divide the collection into disjoint discourse seg- 
lrlents. 
3. Compare all pairs of adjacent blocks of discourse 
segments. 
3 Precision depends on the size of docmnents coLLection; as the 
collection get, s smaller in size, index ternls become less extensive 
and more discriminatory. The id\]factor could be dispensed with 
ill such cases. 
4Other measures standardly available in the information re- 
n'ieval include inner product, cosine coefficient~ and Jaccard co- 
eJficient. 
5Thls is (tone by JUMAN, a Japanese morphological analyzer 
(Mat.sumoto et al., 1993). 
1146 
O 
m nJ 
o o 
S EGM EN T S 
Figure 2: A coherence curve 
4, Assign a coherence/similarity wdue to each pair. 
Next, we examine the coherence curve for hills and 
valleys and partition it manually. Valleys (viz, low co- 
herence val,es) are likely to signal a potential break 
in a discourse tlow, whereas hills (viz, high coherence 
values) would signal local coherency, l,'igure 2 shows 
how a coherence curve lnight appear. 
Coherence is measured at every segment with a 
paired comparison window moving along the sequence 
of segments. Or more precisely, given a segment dj 
and a block size n, what we do is to compare a block 
Sl)am'ing dj-t~4-1 through dj and one sl)anning dj+l 
through dj+n-1. The lneasnrement is l)lotted at the 
jth position on the x-axis. If either of |;lie comparison 
y 
l,'igure 3: A Moving Paired Window 
windows is underiilled, no measurelnent will be made. 
The graph that the procedure gives out is smoothed 
(with an appropriate width) to bring out its global 
trends. The length of a single segment, i.e., noun 
counts, wtries from text to text, genre to gem:e, rang- 
ing from a few words (a junior high science book) to 
somewhere around 60 words (a newspaper editorial). 
We performed experiments on a tbur-week col- 
lection of editorials fi'om Niho, Keizai Shimbun, a 
,lapanese economics newsl)al)er, which contains the to- 
tal of 1111 sentences with some 10,000 nouns, and 556 
discourse segments. The corpus was divided into seg- 
mental sets of nouns semi-autolnatically 6 (k)herence 
was measured ff)r each adjacent pair of segments, using 
6The lll~lllllill part consists in tumd-liltoering the corpus to 
eliminate non-topic marking htstemccs of tl~e particle wa, i.e., 
those that are suffixed to case particles such a.s t0 (C;ONmNCTIVI.;), 
de (LOcATIVF,/IN'rI'~UM~:NTAL), he (F:ImnCrlONA\[,), kara (SOUltCP,), 
nl (DATIVE), ere., or to & particular form of verbM inflection 
(renyou-kei, i.e. intinitive); thus wa is treated as non-topical 
unless it occm* as a postpo~itlon to the bare noun. 
the Dice coefficient. It wa~s found that the block size of 
10 segments yiehls good results. Figure 4 shows a co- 
herence graph covering ahmlt a week's amount of the 
corpus, The graph is smoothed with tile width of 5. 
We see that article boundaries (vertical lines) coincide 
fairly well with major minima on the graph: with only 
one miss at 65, which falls on a paragraph boundary. 
o£-= 
05' 
OA' 
03' 
O2 2 
02: 
0 
0 20 40 
I 
6O 80 100 120 :140 160 
l:igure 4: A Dice Analysis 
Experinmnts with w~rious block sizes suggest that 
the choice of block size relates in some way to the struco 
ture of discourse; an increasing block size would extract 
a more global or general structure of discourse. 
Youmans (1991) has suggested a information mea- 
surement b~sed on word frequency. It is intended to 
measure the ehb and flow of 'new information' in dis° 
coarse. The idea is simply to count the mmlber of new 
words introduced ow'x a moving interval and 1)roduce 
what he calls a vocabulary managemenl profile (VMI'), 
or lneasurements at intervals. Now given a discourse 
1) = {wl,...,w,,}, tile k-th interval of the size A is 
defined by lk :: {wt.,. •., w,.} where: 
k+A-I ifk<n-A 
v n otherwise 
Measurements are made at interwfls I1 through/n-l. 
11{o7 
1.60" 
140" 
\] 2o 
100" 
80" 
60" 
40- 
= 
0 • • .1".. N" 
-'~" ",,,r' 
.u.-.'T',.- I,,-I-, u-,,u ',u 
200 400 600 800 1.0001200140016001800 
WORDS (TOKENS) 
Figure 5: A VMP Analysis 
What we like to see is how the scheme ('Omlmres with 
1147 
0D9- 
OD8- 
0D7- 
0D6- 
0D5- 
004- 
003-- 
0D2- 
0D1-- 
O' 
, o • , , • 
i a i g t i i n a a I o i 
i i ! ! i ! 
l I l I I I I 
i I I I I I I ' : ; I : i 
i ffi i !^ , - 
I i ; I i 
I I l l , , , o : i 
' \[ ! ~ : : . . 
a i | | i i J i a l t i | 
I I I "I' ' I I I I I 
100 200 
, . i 
I 
I q 
J 
300 
i I I I ! 
l l : \[ 
' ' I ! 
a 
4OO 
| i 
| | i 
500 
Figure 6: The Dice on the Nikkei corpus 
' I 
600 
ours. Figure 5 shows the results of a VMP analysis for 
the same nominal collection as above. The interval is 
set to 300 words, or the average length of a paired win- 
dow in the previous analysis. The y-axis corresponds to 
the number of new words (TYro,:) and the x-axis to an 
interval position (TOK~;N). As it tnrns out, the VMP 
fails to detect any significant pattern in the corpus. 
One of the problems with tim analysis has to do with 
its generality (Kozima, 1993); a text with many repeti- 
tions and/or limited vocabulary would yield a flattened 
VMP, which, of course, does not tell us much about its 
inner strncturings. Indeed, this could be the case with 
Figure 5. We suspect that the VMP scheme fares bet- 
ter with a short-term coherency than with a long-term 
or global coherency. 
Evaluation 
l,'igure 6 demonstrates the results of the Dice analysis 
on the nikkei collection of editorial articles. What we 
see here is a close correspondence between the Dice 
curve an(l the global discourse structure. Evaluation 
here simply consists of finding major minima on the 
graph and locating them on the list of those discourse 
segments which comprise the corpus. The procedure is 
performed by hand. 
Correspondences evaluation has been l)roblemati- 
cal, since it requires human judgments o~ how dis- 
course is structnred, whose reliability is yet to be 
demonstrated. It was decided here to use copious 
boundary indicators such as an article or paragraph 
break for evalnating matches between the Dice analysis 
and the discourse. For us, discourse structure reduces 
to just an orthographic structure 7. 
In the figure, article boundaries are marked by 
dashes. 7 out of 27 local minima are found to be in- 
correct, which puts the error rate at around 25%. We 
obtained similar results for the Jaccard and cosine co- 
efficient. A possible improvement would include ad- 
justing the document frequency (d\]) factor for index 
terms; the average df factor we had for the Nikkei cor- 
pns is around 1.6, which is so low as to be negligible s. 
;'Yet, there is some evidence that an orthographic structure is 
liuguislically significant (Fujisawa et al., 1993; Nunberg, 19(00). 
8IIowever, the averagc df factor would increase in proportion 
Another interesting possibilty is to use an alternative 
weighting policy, the weighled invet~se documenl fre- 
quency (Tokunaga and Iwayama, 1994). A widfvalue 
of a term is its frequency within the docmnent divided 
by its frequency throughout the entire document col- 
lection. "\['he widf policy is reported to have a marked 
advantage over the idffor the text categorization task. 
Recall and Precision 
As with the document analysis, the effectiveness of text 
segmentation appears to be dictated by recall/precision 
parameters where: 
number of correct boundaries retrieved Fecall = 
total number o/'correct boundaries 
number of correct boundaries retrieved 
precision = total number of boundaries retrieved 
A boundary here is meant to be a minimum on the 
coherence graph. Precision is strongly affected by the 
size of block or intervalg; a large-block segmentation 
yields less boundaries than a small-block segmentation. 
(Table 1). Experiments were made on the Nikkci cor- 
Block Size 5 10 15 20 25 30 35 
Boundm'ies 52 25 24 20 21 19 12 
Table h Block size (in word) and the nnmber of bound- 
aries retrieved. 
pus to examine the effccts of the block size parameter 
on text segmentation. The corpus was divided into 
equMly sized blocks, ranging from 5 to 35 words in 
length. The window size was kept to 10 blocks. Shown 
in Figure 7 are the results given in terms of recall and 
precision. Also, a partitioning task with discourse seg- 
ments, whose length varies widely, is measured for re- 
call and precision, and the result is represented as G. 
Averaging recall and precision values for each size gives 
to the growth of corpus size. It is likely, therefore, that with a 
1148 
0.8- 
ftT- 
0.(r 
0.~ 
O4- 
0~ 
(LI- 
0 
0 
th'ecision 
+ 
35 
30 15 $+ + 
20 10 + 
25 
G + 
5+ 
0.1 0.2 113 0.4 
Recall 
"1 I' J 
0.7 (1.8 0.9 
Figure 7: Rec.all and Precision 
an ordering: 
35<25<20<5<30<15<10<(;. 
O ranks highest, whose average value, 0.66, is higher 
than any other. ~10' comes second (0.61) 1° . (It is an 
interesting coincidence that the average length of dis= 
course segments is \[3.7 words.) The results demon: 
strate in quantitative terms the significance of dis- 
eotlrse segments. 
It is worth t)ointing out that l, he method here ix 
rather selectlw'~ about a level of granularity it detects, 
namely, that of a news article. It is possible, how- 
ever, to have a much smaller granularity; as shown in 
Table 1, decreasing the block size would give a segmen- 
tation of a smaller granularity. Still, we chose not to 
work on fine-grained segmentations I)(;cause they lack 
a reliable evaluation metric ~ t. 
Conclusion 
In this l)aper, we haw; described a method for I)art,ition - 
ing ,qll unstructured eorlms into coherent textual milts. 
We have adoptc~d the view that discourse consists of 
contiguous, non-overlal)ping discourse segments. We 
have referred to a vector sl)acc model for a statisti- 
cal representation of discourse seglnellt. (Joherence b(> 
tween segments is determined by the Dice coefficient 
with the If .idf term weighting. 
We have demonstrated in quantitative terms that 
the me.thod here is quite suceessfld in discovering arti- 
cles fi'om the eort)us. An interesting question yet to be 
answered is how the corpus size affects the docul|mnt 
larger corpus, we might get, better results. 
°Blocks here a.re in|elided 1;o lll(}all minlntal l.extll;t| units into 
which a discourse is divided and fro' which c()hcrellCe is lttetmure(l. 
1°In general, recall is inversely proport.i(mate to precision; a 
high precision implies a low recall and vice versa. 
11 Passonneau and l,itman (199.' 0 reports a psychological study 
on the htlnt;I.n reliability of discourse segmental.ion. 
frequency and coherence measurements. Another prob- 
lem h~s to do with relating the present discussion to 
rhetorical analyses of discourse. 
References 
Sl,inji Fujisawa, Shigeru Masuyama, and Shozo Naito. 
1993. An Inspection oll Effect of Discourse 
Contraints pertaining to l,',llipsis Supplement in 
Japanese Sentences. In Kouen-Ronbun-Shuu 3 
(conference papers 3). Inforn\]ation Processing So- 
ciety of Japan. In Japanese. 
Barbara Grosz and Candance Sidner. 1986. Attention, 
Intentions and the. Structure of 1)iseourse. Compu- 
tational Linguistics, 12(3):175--204. 
Mart| A. Hearst. 1993. TextTiling: A Quantitative 
Approach to l)iseom'se Segmentation. Sequoia 2000 
93/24, University of California, Berkeley. 
Jerry R. l\[obbs. 1979. Coherence and Corefernce. 
Cognitive 5'cicncc, 3(l):67-90. 
F, duard II. l\[ovy. 1990. I)arsimonious and Profligate 
A1)proaches to the Question of l)iscourse Structure 
I(elations. In 5th ACI, Workshop on Natural Lan- 
guage Geucralion, l)awson, Pemnsylwmia. 
llideki Kozima. 19!i)3. 3\'.xt Segmentation Based on 
Similarity Between Words. In Proceedings of the 
31st Annual Meelin.g of the ACL. 
W. C. Mann and S. A. q'homI)son. 1987. Rhetorical 
Structure Theory. lu L. Polyani, editor, The S'truc- 
ture of Discourse. Ablex Publishing Corp., Nor- 
wood, Nil. 
Yuji Matsmnoto, Sadao I(urohashi, Takehito Utsuro, 
Yutaka Tack|, and Makoto Nagao, 19.q3. Japanese 
Morphological Analysis System JUMAN Manual. 
Kyoto University. In Japanese. 
Akira Mikami. 1960. Zou wa tlana ga Nagai (7'he 
elephant has a long trunk.). Kuroshio Shuppan, 
Tokyo. 
Tadashi Nomoto and Yoshihiko Nitta. 1993. Resolv- 
ing Zero Anaphora in Japanese. In A C\[, Proceed- 
ings of Sizlh European CoT@r'ence, pages 315 321, 
Utrecht, The Netherla,lds. 
(~eotl'rey Nunberg. 1990. The LinguiMics of Fuuctua- 
lion, volume 18 of (/SLI Leclure holes. CSLI. 
ll.ebecca J. Passotmeau and Diane J. I,itman. i\[9!)3. 
Intention-based Segmentation: lluman Relial)ility 
and (Jorrclation with Linguistic Cues. In Proceed- 
ings of the 3lst Annual Meeting of the Associalion 
for Computational Linguistics. The Association for 
Computational I,inguistics. Ohio State University, 
(~'olulnbus, Ohio, USA. 
1149 
Gerald Salton. 1988. Automatic Text Processing: the 
"l~'ansfovmalion, Analysis, and Retl%val of Infor- 
mation by Compuler. Addison-Wesley, Reading, 
MA. 
Takenobn Tokunaga and Makoto Iwayama. 1994. Text 
Categorization based on Weighted Inverse Docu- 
ment Frequency. unpublished manuscript, submit- 
ted to ACM SIGIR 1994. 
Gilbert Youmans. 1991. A New Tool for Dis- 
course Analysis:The Vocabulary Management Pro- 
file. Language, 67:763-789. 
1750 
