Cohesion and Collocation: 
Using Context Vectors in Text Segmentation 
Stefan Kaufmann 
CSLI, Stanford University 
Linguistics Dept., Bldg. 460 
Stanford, CA 94305-2150, U.S.A. 
kaufmann@csli, stanford,, edu 
Abstract 
Collocational word similarity is considered a source 
of text cohesion that is hard to measure and quan- 
tify. The work presented here explores the use of in- 
formation from a training corpus in measuring word 
similarity and evaluates the method in the text seg- 
mentation task. An implementation, the VecTile 
system, produces similarity curves over texts using 
pre-compiled vector representations of the contex- 
tual behavior of words. The performance of this 
system is shown to improve over that of the purely 
string-based TextTiling algorithm (Hearst, 1997). 
1 Background 
The notion of text cohesion rests on the intuition 
that a text is "held together" by a variety of inter- 
nal forces. Much of the relevant linguistic literature 
is indebted to Halliday and Hasan (1976), where co- 
hesion is defined as a network of relationships be- 
tween locations in the text, arising from (i) gram- 
matical factors (co-reference, use of pro-forms, ellip- 
sis and sentential connectives), and (ii) lexical fac- 
tors (reiteration and collocation). Subsequent work 
has further developed this taxonomy (Hoey, 1991) 
and explored its implications in such are.as as para- 
graphing (Longacre, 1979; Bond and Hayes, 1984; 
Stark, 1988), relevance (Sperber and Wilson, 1995) 
and discourse structure (Grosz and Sidner, 1986). 
The lexical variety of cohesion is semantically de- 
fined, invoking a measure of word similarity. But 
this is hard to measure objectively, especially in the 
case of collocational relationships, which hold be- 
tween words primarily because they "regularly co- 
occur." Halliday and Hasan refrained from a deeper 
analysis, but hinted at a notion of "degrees of prox- 
imity in the lexical system, a function of the prob- 
ability with which one tends to co-occur with an- 
other." (p. 290) 
The VecTile system presented here is designed 
to utilize precisely this kind of lexical relationship, 
relying on observations on a large training corpus 
to derive a measure of similarity between words and 
text passages. 
2 Related Work 
Previous approaches to calculating cohesion dif- 
fer in the kind of lexical relationship they quan- 
tify and in the amount of semantic knowledge they 
rely on. Topic parsing (Hahn, 1990) utilizes both 
grammatical cues and semantic inference based on 
pre-coded domain-specific knowledge More gen- 
eral approaches assess word mmllanty based on the- 
sauri (Morris and Hirst, 1991) or dictionary defini- 
tions (Kozima, 1994). 
Methods that solely use observations of pat- 
terns in vocabulary use include vocabulary manage- 
ment (Youmans, 1991) and the blocks algorithm im-- 
plemented in the TextTiling system (Hearst, 1997). 
The latter is compared below with the system intro- 
duced here. 
A good recent overview of previous approaches 
can be found in Chapters 4 and 5 of (Reynar, 1998). 
3 The Method 
3.1 Context Vectors 
The VecTile system is based on the WordSpae~ 
model of (Schiitze, 1997; Schfitze, 1998). The idea 
is to represent words by encoding the environments 
in which they typically occur in texts. Such a rep- 
resentation can be obtained automatically and often 
provides sufficient information to make deep linguis- 
tic analysis unnecessary. This has led to promis- 
ing results in information retrieval and related ar- 
eas (Flournoy et al., 1998a; Flournoy et al., 1998b). 
Given a dictionary W and a relatively small set- 
C of meaningful "content" words, for each pair in 
W × C, the number of times is recorded that the 
two co-occur within some measure of distance in a 
training corpus. This yields a \[C\]-dimensionalvector 
for each w E W. The direction that the vector has in 
the resulting ICI-dimensional space then represents 
the collocational behavior of w in the training cor- 
pus. In the present implementation, IW\[-- 20,500 
and ICI = 1000. For computational efficiency and to 
avoid the high number of zero values in the resulting 
matrix, the matrix is reduced to 100 dimensions us- 
ing Singular-Value Decomposition (Golub and van 
Loan, 1989). 
591 
0.98 
0.96 
0.94 
0.92 
1 2 3 9 1D 11 1920 21; 
0.9 0 
........... 
12 13 14 151B 17 18 4 $ 6 7 8 
2 3 
Section Breaks 
> 
(9 
Figure 1: Example of a VecT±le similarity plot 
As a measure of similarity in collocational behav- 
ior between two words, the cosine between their vec- 
tors is computed: Given two n-dimensional vectors 
V, W, 
co8( , 3) = ,,w, (1) 
3.2 Comparing Window Vectors . 
In order to represent pieces of text larger than sin- 
gle words, the vectors of the constituent words are 
added up. This yields new vectors in the same space, 
which can again be compared against each other and 
word vectors. If the word vectors in two adjacent 
portions of text are added up, then the cosine be- 
tween the two resulting vectors is a measure of the 
lexical similarity between the two portions of text. 
The VecTile system uses word vectors based on 
co-occurrence counts on a corpus of New York Times 
articles. Two adjacent windows (200 words each in 
this experiment) move over the input text, and at 
pre-determined intervals (every 10 words), the vec- 
tors associated with the words in each window are 
added up, and the cosine between the resulting win- 
dow vectors is assigned to the gap between the win- 
dows in the text. High values indicate lexical close- 
ness. Troughs in the resulting similarity'curve mark 
spots with low cohesion. 
3.3 Text Segmentation 
To evaluate the performance of the system and facil- 
itate comparison with other approaches, it was used 
in text segmentation. The motivating assumption 
behind this test is that cohesion reinforces the topi- 
cal unity of subparts of text and lack of it correlates 
with their boundaries, hence if a system correctly; 
predicts segment boundaries, it is indeed measuring 
cohesion. For want of a way of observing cohesion 
directly, this indirect relationship is commonly used 
for purposes of evaluation. 
4 Implementation 
The implementation of the text segmenter resem- 
bles that of the Texl~Tiling system (Hearst, 1997.), 
The words from the input are stemmed and asso- 
ciated with their context vectors. The similarity 
curve over the text, obtained as described above, 
is smoothed out by a simple low-pass filter, and low 
points are assigned depth scores according to the dif- 
ference between their values and those of the sur- 
rounding peaks. The mean and standard deviation 
of those depth scores are used to calculate a cutoff 
below which a trough is judged to be near a sec- 
tion break. The nearest paragraph boundary is then 
marked as a section break in the output. 
An example of a text similarity curve is given in 
Figure 1. Paragraph numbers are inside the plot at 
the bottom. Speaker judgments by five subjects are 
inserted in five rows in the upper half. 
592 
Table 1: Precision and recall on the text segmentation task 
TextTiling VecTile \[ Subjects 
Text # Prec I Rec Free \] aec \] Prec \]aec 
1 60 50 60 50 75 7,7 
2 14 20 100 80 76 76 
3 50 50 50 50 72 73 
4 25 50 10 25 70 75 
5 10 25 40 50 70 74 
avg 32 40 52 51 73 75 
The crucial difference between this and the 
TextTiling system is that the latter builds win- 
dow vectors solely by counting the occurrences of 
strings in the windows. Repetition is rewarded by 
the present approach, too, as identical 'words con- 
tribute most to the similarity between the block vec- 
tors. However, similarity scores can be high even 
in the absence of pure string repetition, as long as 
the adjacent windows contain words that co-occur 
frequently in the training corpus. Thus what a di- 
rect comparison between the systems will show is 
whether the addition of collocational information 
gleaned from the training corpu s sharpens or blunts 
the judgment. 
For comparison, the TextTfling algorithm was 
implemented and run with the same window size 
(200) and gap interval (10). 
5 Evaluation 
5.1 The Task 
In a pilot study, five subjects were presented with 
five texts from a popular-science magazine, all be- 
tween 2,000 and 3,400 words, or between 20 and 35 
paragraphs, in length. Section headings and any 
other clues were removed from the layout. Para- 
graph breaks were left in place. Thus the task was 
not to find paragraph breaks, but breaks between 
multi-paragraph passages that according to the the 
subject's judgment marked topic shifts. All subjects 
were native speakers of English. 1 
1 The instructions read: 
"You will be given five magazine articles of roughly equal 
length with section breaks removed. Please mark the places 
where the topic seems to change (draw a line between para- 
graphs). Read at normal speed, do not take much longer than 
you normally would. But do feel free to go back and recon- 
sider your decisions (even change your markings) as you go 
along. 
Also, for each section, suggest a headline of a few words that 
captures its main content. 
If you find it hard to decide between two places, mark both, 
giving preference to one and indicating that the other was a 
close rival." 
5.2 Results 
To obtain an "expert opinion" against which to 
compare the algorithms, those paragraph bound- 
aries were marked as "correct" section breaks which 
at least three out of the five subjects had marked. 
(Three out of seven (Litman and Passonneau, 1995; 
Hearst, 1997) or 30% (Kozima, 1994) are also some- 
times deemed sufficient.) For the two systems as well 
as the subjects, precision and recall with respect to 
the set of "correct" section breaks were calculated. 
The results are listed in Table 1. 
The context vectors clearly led to an improved 
performance over the counting of pure string repeti- 
tions. 
The simple assignment of section breaks to the 
nearest paragraph boundary may have led to noise 
in some cases; moreover, it is not really part of 
the task of measuring cohesion. Therefore the texts 
were processed again, this time moving the windows 
over whole paragraphs at a time, calculating gap- 
values at the paragraph gaps. For each paragraph 
break, the number of subjects who had marked it 
as a section break was taken as an indicator of the 
"strength" of the boundary. There was a significant 
negative correlation between the values calculated 
by both systems and that measure of strength, with 
r = -.338(p = .0002) for the VecTile system and 
r --- -.220(p = .0172) for Tex¢Tiling. In other 
words, deep gaps in the similarity measure are asso- 
ciated with strong agreement between subjects that 
the spot marks a section boundary. Although r 2 
is low both cases, the VecTile system yields more 
significant results. 
5.3 Discussion and Further Work 
The results discussed above need further support 
with a larger subject pool, as the level of agree: 
ment among the judges was at the low end of what 
can be considered significant. This is shown by 
the Kappa coefficients, measured against the expert 
opinion and listed in Table 2. The overall average 
was .594. 
Despite this caveat, the results clearly show that 
adding collocational information from the training 
• r 593 
Table 2: Kappa coefficients 
Subject# 
Text# 112\]3141511~ 
1 .775 .629 .596 .444 .642 .617 
2 .723 .649 .491 .753 .557 .635 
3 .859 .121 .173 .538 .738 .486 
4 .870 .532 .635 .299 .870 .641 
5 .833 .500 .625 .423 .500 .576 
AH texts .814 .491 .508 481 .675 .594 
corpus improves the prediction of section breaks, 
hence, under common assumptions, the measure- 
ment of lexical cohesion. It is likely that these en- 
couraging results can be further improved. Follow- 
ing are a few suggestions of ways to do so. 
Some factors work against the context vector 
method. For instance, the system currently has no 
mechanism to handle words that it has no context 
vectors for. Often it is precisely the co-occurrence 
of uncommon words not in the training corpus (per- 
sonal names, rare terminology etc.) that ties text 
together. Such cases pose no challenge to the string- 
based system, but the VecTile system cannot utilize 
them. The best solution might be a hybrid system 
with a backup procedure for unknown words. 
Another point to note is how well the much sim- 
pler TextTile system compares. Indeed, a close look 
at the figures in Table 1 reveals that the better re- 
sults of the VecTile system are due in large part to 
one of the texts, viz. #2. Considering the additional 
effort and resources involved in using context vec- 
tors, the modest boost in performance might often 
not be worth the effort in practice. This suggests 
that pure string repetition is a particularly strong 
indicator of similarity, and the vector-based system 
might benefit from a mechanism to give those vec- 
tors a higher weight than co-occurrences of merely 
similar words. 
Another potentially important parameter is the 
nature of the training corpus. In this case, it con- 
sisted mainly of news texts, while the texts in the 
experiment were scientific expository texts. A more 
homogeneous setting might have further improved 
the results. 
Finally, the evaluation of results in this task is 
complicated by the fact that "near-hits" (cases in 
which a section break is off by one paragraph) do 
not have any positive effect on the score." This prob- 
lem has been dealt with in the Topic Detection and 
Tracking (TDT) project by a more flexible score that 
becomes gradually worse as the distance between hy- 
pothesized and "real" boundaries increases (TDT, 
1997a; TDT, 1997b). 
Acknowledgements 
Thanks to Stanley Peters, Yasuhiro Takayama, Hin- 
rich Schiitze, David Beaver, Edward Flemming and 
three anonymous reviewers for helpful discussion 
and comments, to Stanley Peters for office space 
and computational infrastructure, and to Raymond 
Flournoy for assistance with the vector space. 

References 
S.J. Bond and J.R. Hayes. 1984. Cues people use 
to paragraph text. Research in the Teaching of 
English, 18:147-167. 
Raymond Flournoy, Ryan Ginstrom, Kenichi Imai, 
Stefan Kaufmann, Genichiro Kikui, Stanley Pe- 
ters, Hinrich Schiitze, and Yasuhiro Takayama. 
1998a. Personalization and users' semantic expec- 
tations. ACM SIGIR'98 Workshop on Query In- 
put and User Expectations, Melbourne, Australia. 
Raymond Flournoy, Hiroshi Masuichi, and Stan~ 
ley Peters. 1998b. Cross-language information re- 
trievM: Some methods and tools. In D. Hiemstra, 
F. de Jong, and K. Netter, editors, TWLT 13 Lan- 
guage Technology in Multimedia Information Re- 
trieval, pages 79-83. 
Talmy Givbn, editor. 1979. Discourse and Syntax. 
Academic Press. 
G. H. Golub and C. F. van Loan. 1989. Matrix Com- 
putations. Johns Hopkins University Press. . 
Barbara J. Grosz and Candace L. Sidner. 1986. At- 
tention, intentions, and the structure of discourse. 
Computational Linguistics, 12(3) :175-204. 
Udo Hahn. 1990. Topic parsing: Accounting for text 
macro structures in full-text analysis. Information 
Processing and Management, 26:135-170. 
Michael A.K. Halliday and Ruqaiya Hasan. 1976. 
Cohesion in English. Longman. 
Marti Hearst. 1997. TextTiling: Segmenting tex~ 
into multi-paragraph subtopic passages. Compu- 
tational Linguistics, 23(1):33-64. 
Michael Hoey. 1991. Patterns of Lexis in Text. Ox- 
ford University Press. 
Hideki Kozima. 1994. Computing Lexical Cohesion 
as a Tool for Text Analysis. Ph.D. thesis, Univer- 
sity of Electro-Communications. 
Chin-Yew Lin. 1997. Robust Automatic 
Topic Identification. Ph.D. thesis, Uni~ 
versity of Southern California. \[Online\] 
http ://ww.. isi. edu/~cyl/thesis/thesis, html 
\[1999, April 24\]. 
Diane J. Litman and Rebecca J. Passonneau. 1995. 
Combining multiple knowledge sources for dis- 
course segmentation. In Proceedings of the 33rd 
ACL, pages 108-115. 
L.E. Longacre. 1979. The paragraph as a grammat- 
ical unit. In Givbn (Givbn, 1979), pages 115-134: 
Jane Morris and Graeme Hirst. 1991. Lexical co- 
hesion computed by thesaural relations as an in- 
dication of the structure of text. Computational 
Linguistics, 17(1):21-48. 
Jeffrey C. Reynar. 1998. Topic. Segmenta- 
tion: Algorithms and Applications. Ph.D. 
thesis, University of Pennsylvania. \[Online\] 
http ://~ww. cis. edu/-j creynar/research, html 
\[1999, April 24\]. 
K. Richmond, A. Smith, and E. Amitay. 1997. 
Detecting subject boundaries within text: A 
language independent statistical approach. In 
Proceedings of The Second Conference on Em- 
pirical Methods in Natural Language. Processing 
(EMNLP-2). 
Hinrich Schiitze. 1997. Ambiguity Resolution in 
Language Learning. CSLI. 
Hinrich Schiitze. 1998. Automatic word sense 
discrimination. Computational Linguistics, 
24(1):97-123. 
Dan Sperber and Deidre Wilson. 1995. Relevance: 
Communication and Cognition. Harvard Univer- 
sity Press, 2nd edition. 
Heather Stark. 1988. What do paragraph markings 
do? Discourse Processes, 11(3):275-304. 
1997a. The TOT Pilot Study Corpus Documenta- 
tion version 1.3, 10. Distributed by the Linguistic 
Data Consortium. 
1997b. The Topic Detection and Tracking (TDT) Pi- 
lot Study Evaluation Plan, 10. Distributed by the 
Linguistic Data Consortium. 
Gilbert Youmans. 1991. A new tool for discourse 
analysis: The vocabulary-management profile. 
Language, 47(4):763-789. 
