COLING 82, J. Horeclo) (ed.) 
North-Holland Publishing Company 
© Academia, 1982 
LEXICAL PARALLELISM IN TEXT STRUCTURE DETERMINATION 
AND CONTENT ANALYSIS 
Yoshiyuki Sakamoto 
Electrotechnical Laboratory 
Tsukuba, Japan 
Tetsuya Okamoto 
University of Electrocommunications 
Tokyo, Japan 
ABSTRACT 
In this paper the problem is discussed about the text 
structure determination and content analysis by lexical 
parall~lism, or the repetition of lexical items. 
Intersentential relations are determined through the 
identical, partly identical or lexico-semantic repetition in 
Japanese scientific texts. Lexical parallelism ratio and 
lexical parallelism indicator distance are obtained on 
computer and by hand. And the application of the 
characteristics to automatic content analysis is dicsussed. 
I. INTRODUCTION 
Lexical parallelism, that is, the 
repetition of lexical items, is an 
important device for indicating the 
sentence connections in a 
text(discouse). The recurrent lexical 
items, or lexical equivalents need not 
have the same syntactic function or 
parts of speech in the two sentences 
in which they occur. They may be 
identical in form and in meaning, or 
they may be related by lexico-semantic 
relationship, such as synonymy, 
hyponymy, antonymy. In a special case 
they may be partly identical both in 
form and in meaning, as in ~ 
(ultrasonic wave), ~(sound wave) 
and ~ (sound). 
Another device for indicating the 
sentence connections is a syntactic 
device, such as substitutes, logical 
connecters, time and place relaters 
and structural parallelism \[I\]. For 
example, in Japanese substitutes--- 
~/ C~ (this), ~_.__~__~ (here), 
~ (we/our), ~_____~ (it), time 
relaters--- ~ (next), ~ (above 
mentioned), and logical connecters--- 
~ (and), ~ (or), m--~ 
(secondly) belong to this device. 
Sevbo studied lexlcal parallelism 
in normalized text, where substitutes 
were replaced by their lexical 
equivalents and complex sentences were 
decomposed into successive simple 
sentences(clauses). 
She traced the repetition patterns 
of lexical items in Subject/Predicate 
oppossitio n. She assumes here that 
the syntactic subject or its 
dependent, direct or indirect, 
corresponds to "Subject(old 
information) of elementary thought" 
and the syntactic predicate or its 
dependent to "Predicate(new 
information) of elementary 
thought"J2\]. 
In Japanese, sentence components 
occur in any positions before 
predicate and old information or topic 
is placed, as a rule, at/near the 
beginning of a sentence\[3\]. In the 
following discussion we analyze the 
repetition of lexical items in an 
unnormalized text without regard to 
their syntactic functions, parts of 
speech and topic/comment distinctions, 
assuming that the lexical equivalents 
at/near the beginning of the sentences 
function as the keywords in indicating 
the sentence connections and the 
contents of a text. 
Nouns do not inflect and most verbs 
and adjectives have the unchanging 
stems and inflectional suffixes in 
Japanese. The important concepts and 
technical terms (noun, verb or 
339 
340 Y. SAKAMOTO and T. OKAMOTO 
adjective stems) are written in Kanji 
(Chinese ideographs) or 
Katakana(square Japanese syllabary). 
Katakana is used to transcribe foreign 
technical terms. Hiragana(Japanese 
cursive syllabary), on the other hand, 
is used to write post-positional 
particles and suffixes, denoting case, 
topic, mood, tense aspect etc. In 
view of these facts we define lexical 
items as a word or phrase in Kanji and 
Katakana. 
We have studied lexical 
parallelisms in a short tale\[4\], in 
technical and scientific texts\[5,6\], 
based upon Sevbo's approach. The 
purpose of the present paper is to 
obtain the characteristics of lexical 
parallelism in Japanese technical and 
scientific texts and to explore the 
possibilities of utilizing these 
characteristics for automatic content 
analysis. 
Five text samples are used for 
experiment and discussion. They are 
the essays on " Ultrasonic 
amplification"(Text A), "Brain and 
automaton"(Text B), "Petrochemical 
industry"(Text C), "Chemical industry 
in Japan"(Text D) and "Between 
organism and inanimate matter"(Text 
E). 
2. LEXICAL PARALLELISM RATIO 
is t~e determinable maximum number of 
the ~entence connections in a text, N 
being the total number of the 
sentences in the text: t is type of 
lexical repetition and w is the 
position, i.e. the sequence number 
from the beginning of the sentence. 
The experiments were carried out to 
obtain the characteristics of the 
lexical parallelism in sample texts on 
computer and by hand. 
In eomputer experiment lexieal 
items, i.e. the sequence in Kanji or 
Katakana, were identified and 
segmented by machine character codes 
without syntactic and morphological 
analysis. Then the sentence 
connections of type 1(identical 
repetition) are determined in each 
position and lexical parallelism 
ratios are obtained(Table I). On the 
same samples the optimal sentence 
connections are determined manually 
and the lexical parallelism ratios 
were calculated(Table 2). Except for 
Text E, the totals of the ratios 
amount to 72-83%(cf. Table 2) and in 
computer experiment the ratios of type 
I in the initial position amount to 
57-68%(0f. Table I). And moreover, 
the initial lexieal items(w=1) show 
the maxima in most samples in Table I 
and by far the highest value in all 
samples in Table 2, and they decrease 
with increasing w in Table 2. It is 
The sentence connection of type t clear from the results that lexical 
in position w is determined between parallelism plays an important role in 
the given j-th sentence Sj and the the intersentential dependency and 
i-th sentence Si( i < j ), if and only lexical items at the beginning of the 
if Si is the nearest preceding sentences are the most relevant 
sentence which contains the lexical lexical parallelism indicators. 
item, lexically equivalent to the w-th 
lexical item from the beginning of the 3. LEXICAL PARALLELISM INDICATOR 
given sentence Sj through the type t DISTANCE 
repetition( t = 1,2,3; w = 1,2,3,4,5). 
The repetitions of type 1,2,3 As an example, intersentential 
correspond to the identical, partly dependency determined manually in Text 
identical, lexico-semantic A, which is the essay on "Ultrasonic 
repetitions, respectively, amplification" with 123sentences in 
The lexical equivalents in SJ and four paragraphs, is shown in Table 3 
Si are called lexieal parallelism and Figure I. The lexical parallelism 
indicators, and Sj is called a indicator distances are shown as well. 
dependent on Si. Lexical parallelism indicator 
Lexical parallelism ratio of type t distance is defined as follows: 
in position w is defined as follows: t 
t D:j-i 
= ( n / N - 1 ) * I00 w,j 
w 
where n is the number of the where D is lexical parallelism 
determined connections in a text: N-I indicator distance: t is type of 
f • 
LEXICAL PARALLELISM IN TEXT STRUCTURE DETERMINATION 341 
lexical repetition: w is position of 
the lexical indicator: i and j are 
sequence numbers of the governor 
sentence and dependent sentence 
respectively. 
The distance is supposed to 
represent the semantic extent of the 
lexical parallelism indicators, or 
better the concepts referred by them. 
In Figure I a diagonal unit 
distance line indicates the 
hypothetical situation, where every 
sentence depends on the immediately 
preceding sentence. Data show a 
tendency to distribute near this line 
in all samples. 
Lexical parallelism indicators show 
the progress of the author's thought 
in the text in Table 3. Sevbo pointed 
out the significance of the indicators 
with large D in indicating the 
contents of paragraphs and texts. The 
lexical items with large D are 
supposed to be the important topics, 
to which the author of the text 
returnes after commenting on another 
topics. In the example the items with 
large D(D>IO) were shown in Figure 2. 
These indicators are distributed 
among paragraphs. For example, the 
indicator ~i~(ultrasonic wave) 
extends over 15 sentences(from 9th to 
24th) within paragraph 2, which ranges 
from 2nd to 4Oth sentence, and the 
indicator ~ (traveling-wave 
tube) extends over 22 
sentences(100th-122nd) within 
paragraph 4(85th-123rd) as well. The 
indicator ~m (traveling-wave 
amplification) covers paragraph 3 
completely, ranging from the 41th 
sentence, or the first sentence of the 
paragraph, through the 67th sentence 
to 85th sentence, or the first 
sentence of the next paragraph. In 
short, these indicators divide the 
text into the three paragraphs. 
In addition, they reflect 
appropriately the contents of 
paragraphs in the sample text, as 
suggested by the fact that they are 
partly identical with the following 
paragraph names: 
"Introduction"(paragraph I), "What is 
the ultrasonic wave?"(paragraph 2), 
"Microwave and traveling-wave 
tube"(paragraph 3) and "Ultrasonic 
wave and traveling-wave 
amplification"(paragraph 4). 
These data suggest that the 
indicator with large D may be useful 
as keywords to the contents of a 
text. 
4. CONCLUSION 
Lexlcal parallelism plays an 
important role in the intersentential 
dependency, or text Structure and 
lexieal items at the beginning of the 
sentences are the most relevant 
lexical parallelism indicators. 
The initial lexical parallelism 
indicators with long lexical 
parallelism indicator distances 
reflect the contents of paragraphs and 
may be useful keywords in information 
retrieval. 
The partly identical repetition and 
lexico-semantic repetition through the 
lexical items at/near the beginning of 
the sentence,firstly, intersentential 
dependency by syntactic device, 
secondly, the recognition of 
topic/comment opposition in the 
sentence, thirdly, and lastly, the 
application to automatic keyword or 
key-sentence extraction in content 
analysis depend on the future 
researches. 
342 Y. SAKAMOTO and T. OKAMOTO 
Table 1 Lexical parallelism ratios 
of type 1 in computer experiment(%) 
1 2 3 4 5 
60.4 61.9 57.1'54.2 56.4 
A (75) (75)!(64) (58) (57) 
68.2 64.4 56.3 58.4 57.4 
B (71) (67) (58) (59) i(58) 
59.4!45.5 43.2 37.5 32.2 C 
(41) (31) (29) (24)(19) 
D 57.2i61.2 54.9 52.5 56.7 
(71) (76) (67) (60) (58) 
41.ii53.3 49.4 42.1 50.0 D 
(37) (48) (43) (35) (40) 
Table 2 Lexical parallelism ratios 
determined by hand(%) 
~N-I~_~ 1 5 
60.7 0.8 A 122 
(74) (i) 
68.9 0.9 B 103 
(71) (i) 
5O.7 0 C 69 
(35) (0) 
D 54.9 0 123 (67) (0) 
29.2 0 E l 89 \](26) (0) 
2 3 4 
6.6 3.2 0.8 
(8) (4) (i) 
9.7 1.9 0.9 
(i0) (2) (i) 
8.7 13.0 2.9 
(6) (9) (2) 
13.9 2.4 1.6 
(17) (3) (2) 
5.6 2,2 i.i 
(5) (2) (I) 
Note} T - sample texts, w - sequence Note: N-I --- the determinable 
numbers of indicators, values in() maximum number of intersentential 
are numbers of determined sentence relations. 
connections. 
--9 i-th sentence 
0 i0 20 30 40 50 60 70 80 90 i00 ii0 120 130 
• ~ through the ~dentical repetition 
20 ~ ,~ in the initial position 
'~" ~ -- through other repetitions 
GG 
30 "~\[~ in any positions 
\[ 4o \ 
~t 
~ 5o 
60 
~ 7o 
8O 
• ~ 
9O 
100 
110 
120 
130 
Figure i Lexico-semantic intersentential dependency graph 
in sample text A 
LEXICAL PARALLELISM IN TEXT STRUCTURE DETERMINATION 
Table 3 Lexico-semantic intersetential dependency in sample text A 
I~\]cator 
ff~ (sound) 
J~illU, (ultra~0~\]¢ vra~e) 
m (hear) 
M~ (ultrasonic ~ave) 
-~ (one) 
~--O (the seooM) 
,~OM~" (this raze) 
~.T~ (here) 
=t;~Ce),,:~,~-)E (as lel~tlonnd before) 
M~ (s~Jnd) 
~(rm) 
~tP5 (fflisle) 
~J:~ Oevelength) 
~-O)~.&l~ (thls fact ) 16 
0~00.~ (',,e) 18 
t~fubt~_ (our } 19 
il~ t~ (~le~th) 20 
mT~ (eyes) 22 
\]~T (sound) 23 
Mlil~ (ultrasonic mw ) 2 4 
MS~_~ (~) 25 
::}9~U~ (bat) 26 
~U_¢_ (bat) 27 
DO=~U0) (bat) 28 
b'-~&= (ladar) 29 
~ (to~et~r wlth ) 30 
I~-~t--~_ H=dar ) 31 
M~I= (sound ,,ave) 32 
BI\]ill\[~ (ultrasonic ~ ) 33 
M~_. (tot exm~le ) 3 4 
\]~q=~- (hetftey) 35 
~E~t~ (~lcme) 37 
~l~k~ (diagnosis) 39 
i~e~ (Introduction) 39 
~0 (discussion) 4 0 
\]A~i~: (travelll~ -w~e ~pllflcatlo~) 43 
M--~.. (fl~re 1) 42 
~ (penOulul) 43 
;t:~ ( s~rl~ ) 4 4 
1~ 45 
\] I~ 46 
I~ 47 
I I_~ 48 
~_~J:'~t~.&~ (SUch thll~) 49 
B~ (figure) 50 
~-O~-~W. (this fact ) 51 
t ~__~_ 52 
I°) 53 
M~b~ (re=etltlo~) 55 
B~ (alpllt~e) 5 7 
+e.~ ( fl~re~ 58 
Io) 59 
:\[ #,,'v# -~ (e~er~) 60 
~-OC~:~__~_ (this fact ) 62 
J I D W t indicator J I D • t 
1 .... z~/v-¢-~ (ener¢/) 63 60 3 1 1 
2 1 I 1 1 ~\]~¢9 (¢k~se velocity) 64 63 1 1 2 
3 .... 10) 65 61 4 1 1 
4 3 1 1 1 IE(du.(a=lltude) 66 65 1 2 1 
5 4 1 1 1 ~t~o)(trsvell~ -~vea~,oltflcation) 67 41 26 1 1" 
6 5 1 1 2 E~C~:~ (this fact ) 68 .... 
7 6 1 I 2 ~i~_ (rndlo wave) 69 68 1 1 1 
8 .... ~_~ (ta~e) 70 69 1 1 2 
9 .... ~ (electPi¢ slo~i ) 71 .... 
10 .... ~ (electric field) 72 .... 
11 10 1 1 1 IE~O) (electro~) 73 72 1 $ 1 
12 - - - I:}'o (electron) 74 73 1 1 1 
13 12 1 1 1 ~'.~(thls) 75 .... 
14 13 1 1 1 ~ll~(o~tslhe) 76 73 3 2 3 
13 2 1 1 ~cJ(~am) 77 62 14 3 1 
- - - Ve 78 77 1 1 1 
16 1 1 1 (1) 79 77 2 1 1 
- - - ~.t¢l~ (this) 80 .... 
.... (2) 81 78 3 1 1 
19 1 1 1 C.~ (~hls) 82 .... 
19 2 1 i w'n'hmh=--~t~;c (|icrol~ve ~=~Jni=tl:lr~ ) 83 .... 
21 1 1 1 ~\]M"J#~(W P/slcs ) 84 - - - 
21 2 I 1 ~1~(t~ll~ -~vea,~llflcatlm) 85 67 "i8 1 1 
9 15 1 1 _-OO(t~) 86 85 1 1 1 
.... x:~m-~-MJcL,~(enerc/~rce) 87 81 6 1 2 
25 1 1 1 t~(electricctstel ) 88 87 1 1 1 
26 1 1 1 Elll\]lll~ (pieZOelectric phenw~on) 89 98 1 1 1 
27 1 1 1 EtaUl~j~(I)lezoelectrloC~el~R.non) 9() 89 1 1 1 
28 1 1 1 M~l,l((yee ) 91 - - - 
29 1 2 1 ~'~7~(Pld~-u~) 92 91 1 1 2 
29 2 1 1 BE~lt~ltkt (piezoelectric crystal ) 93 90 3 1 1 
29 3 2 2 J-~(thls) 94 93 1 1 1 
32 1 1 1 ~j~ (re~rse effect) 95 .... 
- - - I~-*'~Ek¢ (receiver) 96 95 1 1 2 
- - - E~K~,T~(plez~electrlccrtstal ) 97 93 4 3 1 
35 1 1 3 f~IIJW~T" ( in ~tezoeleCttic crystal) 98 97 1 1 2 
- - - t~Rt.~ (electric field) 99 98 1 1 1 
37 1 3 1 BEIII~41,Z" (pieZOelectric cwstal ) 100 98 2 1 1 
.... c~,~,,l~(Rocheliesalt) 101 90 11 1 1 
.... CO~'~P~'IC (such tll~ ) 102 .... 
40 1 1 1 EIl-~f~{$ (pieZOelectric SellcoMuct0r ) 103 .... 
.... R:1¢~!¢$~ (piezoelectric t~lcm~uetor ) 104 103 1 1 1 
42 1 4 1 CdSU, 105 104 1 2 1 
42 2 1 1 T~__~(lt) 106 .... 
44 1 1 1 CdSI~hIC ( Cds crystal ) 107 106 1 1 2 
44 2 2 1 q~(41S~(aspllfieP ) 108 107 1 1 1 
45 2 1 1 M----B~(fl~ro3) 109 .... 
47 1 2 1 IU~lc(e~ul~me~t) 110 .... 
.... ~(Itoh| ) 111 107 4 1 1 
47 3 1 1 ~l~E~(~oeeteratnd~Oltaoe) 112 110 1 1 2 
.... • ~i~ (atte~tloe ~ntlty) I 13 112 1 1 1 
47 5 1 1 I~,E(crystal) 114 107 7 1 2 
52 1 1 1 IEEo(,~ltaue) 115 114 1 1 1 
52 2 1 1 fl~(0~tmt) 116 115 1 1 1 
49 6 1 2 ~dC(¢rystal ) 117 114 3 1 1 
54 2 1 1 N(Mt~(attenmtto~) 118 113 5 1 1 
53 4 3 1 ~;~I~E~ (~¢eleratnd ~lt~e ) 119 117 2 1 1 
50 8 1 1 CLhO)(abovelentlo~d ) 120 .... 
54 5 1 1 Mtl~(uclltter) 121 120 1 1 1 
57 3 1 1 ~_t~(travelln~t-~v~tuhe) 122 100 21 2 1 
58 3 1 1 L~(hlrth) 123 122 1 1 2 
Note: 1)Er¢llsheQulvalentsaresho~m in( ); 2) underllnndHIr~a~ass¢~Jencesare 
I~Stposltlonal particles, de~otl~ topic, case, contrast, et¢: 3)byehen 
lear~ that J -th tentence ~s not ¢onoectnd with any hrecoedlr# Sentence by lexlcal 
e?Jlvelenoe. 
Syl~ols : I . J -- ~el~= r~lb~rs of the depenhe~t ~e~terce and povemor =enten~ 
res~ctlVely ; D -- lexlCal parallel\[s= Indicator d\[=tance: • -- 
~Je~¢~ nulber of the lexlca\] Indicator frol th~ beflirmlr~ =ente~e : t -- tYPe 
of lexLCal reDetRion. 1, 2, 3 - Identical. bartlal, lexlc~-~ntlc 
respect Ively. 
343 
344 Y. SAKAMOTO.and T. OKAMOTO 
15 ~-(9) ~\]Pii~(ultrasonlc~ave ) 
L(24) El~i~ 
26 ~-(41 ) ~=~-~IBIM (traveling -~ave amplification ) 
~-~-(62) .(.ve) 
 4 iii i .- 
~tii~mq 
~ --(9o) n.:,~)bt~ (Rochelle salt) 
111i{100) ~i~(travellng -~vetube) 
~Lt:Tig~t 
Fig. 2 Dlstrl~tlon of long distance Indicator (D> 1 O) 
Note : numbers In ( ) corres~nd to the sequence nu=hers 
of the sentences, the numbers on the lines to the distances. 

REFERENCES

\[i\] Quirk, R., Greenbaum, S., Leech, 
G., and Svartvik, J., A grammar of 
contemorary English (Longman, London, 
1972). 

\[2\] Sevbo, I.N., Struktura svjaznovo 
teksta i avtomatizatsija (Nayka, M., 
1969). 

\[3\] Maklno, S., Grammar of repetition 
(Taisyukan, Tokyo, 1980). 

\[4\] Okamoto, T., Text structure 
determination and content analysis by 
lexical parallelism, Proceeding of the 
Univ. of Electro-Communications, 
voi.24(1973), no. I, 177-190. 

\[5\] Okamoto, T., Structure analysis of 
Japanese text, Mathematical 
linguistics, No.62(1972), 1-11. 

\[6\] Sakamoto, Y., Okamoto, T., Yatsu, 
N., Text structure and a model of 
discourse understanding by lexical 
parallelism, Proceeding of the 10th 
annual meeting on information science 
and technology, (1973), 55-64. 
