A DIRECTED RANDOM PARAGRAPH GENERATOR 
Stanley Y.W. Su & Kenneth E. Harper 
(The RAND Corporation, Santa Monica, California) 
I. INTRODUCTION 
The work described in the present paper represents a 
combination of two widely different approaches to the 
study of language. The first of these, the automatic gen- 
eration of sentences by computer, is recent and highly 
specialized: Yngve (1962), Sakai and Nagao (1965), 
Arsent'eva (1965), Lomkovskaja (1965), Friedman (1967), 
and Harper (1967) have applied a sentence generator to the 
study of syntactic and semantic problems of the level of 
the (isolated) sentence. The second, the study of units 
of discourse larger than the sentence, is as old as rhetor- 
ic, and extremely broad in scope; it includes, in one way 
or another, such diverse fields as beyond--the sentence 
analysis (cf. Hendricks, 1967) and the linguistic study of 
literary texts (Bailey, 1968, 53--76). The present study 
is an application of the technique of sentence generation 
to an analysis of the paragraph; the latter is seen as a 
unit of discourse composed of lower-level units (sentences), 
and characterized by some kind of structure. To repeat: 
the object of our investigation is the paragraph; the 
technique is analysis by synthesis, i.e. via the automatic 
generation of strings of sentences that possess the 
properties of paragraphs. 
--2-- 
Harper's earlier sentence generation program differed 
from other versions in its use of data on lexical co- 
occurrence and word behavior, both obtained from machine 
analysis of written text. These data are incorporated 
with some modifications in a new program designed to pro- 
duce strings of sentences that possess the properties of 
coherence and development found in "real" discourse. (The 
actual goal is the production of isolated paragraphs, not 
an extended discourse.) In essence the program is designed 
(i) to generate an initial sentence; (ii) to "inspect" 
the result in order to determine strategies for producing 
the following sentence; (iii) to build a second sentence, 
.making use of one of these strategies, and employing, in 
addition, such criteria of cohesion as lexical class 
recurrence, substitution, anaphora, an4 synonymy; (iv) to 
continue the process for a prescribed number of sentences, 
observing both the general strategic principles and the 
lexical context. Analysis of the output ~ill lead to 
modification of the input materials, and the cycle will be 
repeated. 
This paper describes the implementations of these 
ideas, and discusses the theoretical implications of the 
paragraph generator. First we give a description of the 
language materials on which the generator operates. The 
next section deals with a program which converts the 
language data into tables with associative links to minimize 
--3- 
the storage requirement and access time. Section 4 describes: 
(I) the function of the main components of the generation 
program, (2) the generation algorithm. Section 5 desczibes 
the implementation of some linguistic assumptions about 
semantic and structural connections in a discourse. 
--5-- 
Table i 
GOVERNING PROBABILITIES 
Type" of Dependent Governor VT VI N A DV DS 
S I 0 0 P2 P3 VT 0 0 P1 
S 0 VI 0 0 1 0 0 P4 P5 
N 0 0 P6 P7 0 0 
A 0 0 0 0 0 0 
DV 0 0 0 0 0 0 
DS 0 0 I 0 0 0 
The governing probabilities for a word are independent 
of each other. In paragraph generation the decision to 
select a dependent type will be made without regard to the 
selection of other dependent types. For example, a noun 
can have probabilities P6 and P7 of being the governor of 
a noun and an adjective respectively. The selection of a 
noun as a dependent based on P6 will not affect, and will 
not be affected by, the selection of an adjective as a 
dependent. 
There are two types of co--occurrence data accompanying 
every word in the glossary: a set of governing probabil- 
ities and a list of dependents. The probability values 
associated with a word are determined on the basis of the 
syntactic behavior of the word in the processed text. If 
a noun occurs in 75 instances as the governor of an 
--6-- 
adjective in I00 occurrences in a text, the probability of 
havipg an adjective as a dependent is 0.75. The zeroes and 
ones in Table I are constant for all words in the glossary. 
These values are not listed in the sets of probability 
values for the entrles of the glossary; however, they are 
known to the system. For instance, the set of probability 
values for a transitive verb will contain PI' P2' and P3" 
The probability I of governing a noun as object will not 
be listed in the data. 
The second type of co--occurrence data accompanying 
every word in the glossary is a list of possible dependents. 
The list is specified in terms of word numbers and semantic 
classes (to be described later). It contains the words that 
actually appear in the processed physics text as dependents 
of the word with which the list is associated. Since the 
lists of dependents are compiled on the basis of word co- 
occurrence in the text, legitimate word combinations are 
guaranteed. In the list of dependents for a verb~ those 
words which can only be the subject are marked "S" and 
those which can only be the direct object are marked "0". 
The co--occurrence data can be regarded as either 
syntactic or semantic. They are distinguished here from 
both the dependency rules and part of speech designation, 
and from the semantic classes that have been established. 
At present, seventy--four semantic classes have been set up. 
Some of these are formed distributionally (i.e., on the 
--7-- 
basis of their tendency to co-occur syntactically with the 
same words in text---cf. Harper, 1965); other classes contain 
words of the same root, synonyms, hypernyms, and words 
arbitrarily classified as "concrete." The semantic classi- 
fications are highly tentative, and are subject to modifi- 
cation. Their extent is shown in Table 2. 
Table 2 
SEMANTIC DATA 
Number of Number of 
Classification Classes Words in Class 
Distributional Classes 22 150 
Hypernym Classes i0 160 
Word Families 25 52 
Synonym--antonym Classes 16 48 
"Concrete" Words I 54 
TOTAL 74 464 
The language materials described above are punched 
on approximately 2500 cards. The data are processed by s 
conversion program in order to form the data base for the 
paragraph generation program. 
--8- 
3. DATA CO~VERSION PROGRAM 
The paragraph generator is written in PI/I and run on 
the IBM 360 Model 65. It consists of two main programs: 
a data conversion program and a generation program. These 
two programs run as separate jobs. The data conversion 
program converts the language materials described above 
into compact data tables with associative links. The 
converted data are stored on a magnetic tape which is used 
as input to the generation program. 
During the process of paragraph generation it is 
desirable that the language data described in the preced- 
ing section remain in core storage. However, since the 
data base is rather l~rge, its conversion into a more 
compact and flexible form is desirable so that storage 
requirements and access time can be reduced. 
In view o£ the characteristics of the language data 
and the generation algorithm (to be described latex), we 
structure the data base in the following way. 
s. Words belonging to the same parts of speech or 
semantic classes are stored in consecutive locations. In 
the process of generation, random selection of a word 
from a part of speech or a semantic class is often re-- 
quired. If words are grouped together in form of a table, 
a randomly selected number in a proper range can be used 
as an index to look up the word from the table. 
--9-- 
b. The data storage for each entry of the glossary 
is of variable length, since the lists of dependents, 
governing probabilities, hypernyms and semantic classes 
associated with the entries are of variable length. 
c. Word numbers in the lists of dependents and 
semantic classes are replaced by pointers, which identify 
the locations where the word numbers are actually stored. 
Thus, data tables containing different types of informa- 
tion are linked to one another, and access to this informa- 
tion can be carried out by straight table lookup. 
In keeping with these prlnciples, data tables of the 
form shown in Fig. i have been constructed. An example 
will illustrate the organization of the data base and the 
procedure of setting up these data tables. As a noun, 
represented by word number 2466, and its associated 
language data are read from the input unit, the word 
number is stored in the block in table (i) reserved for 
nouns. The last three digits of the word number are used 
as an index to a location in the lookup table (2), where 
the word number 2466 and its address in table (i), i.e. 
309, are stored. If the location in table (2) has been 
occupied (when more than one word has the same last three 
digits), the word number and its address are stored at the 
first unused space in table (2) following that location. 
Table (2) allows us to replace word numbers with their 
addresses after all data have been processed. 
-10- 
.... i3) Words in semantic classes 1 
~'2 Y3 YL Y47 
i 
2 
x 2 
x? i 
~4 
29~ 
I 
~30~ 
\[ '.3~c 
1~5 
I 
(2) Lookup table 
W ord No. Address 1 
• i 
(4) Dependent list 
1 ! i 
2 i ! r2 
: 3 -130 Counter = -5 
726 (Address 340) ~ 4 
CI (Value -i) ,li 
C2 (Value-2) iii" 
4594 (Address 315)! 
C16 (Value-\]b Ii 
J 4~ 
(5) Hypernym List 
i 
2 
~17 Counter = -i 
613 (Address 290) 
(6) Semantic class 
~ 2 
~19 Counter = -1 
_C4 7 (Vs lu e 7z~ 7 2 __ 
74 
t 
J, 
(8) 
! 
Y2 ~ 
Y3 
Y4 
Y471 
(7) Probability values 
I z 2 142 
I DS i N 
z 3 z 4 
! vT I vl I 
Fig. 1--Data table organization 
--U-- 
There are four pointers associated with each word 
stored in table (I). Pointer D specifies the location in 
table (4) where the list of dependents associated with 
the word is stored. A countez is used to specify the 
number of words and semantic classes in the list. A 
semantic class in the original data is prefixed by a C 
(CI identifies senmntic class I). In table (4) all the 
counters and semantic classes (the numerical values) are 
stored as negative values so that the positive values 
(i.e. word numbers) can be conveniently changed to pointers 
at a later stage. In our example the pointer D is 130 
and the words 726 and 4594, and also the semantic classes 
CI, C2 and C16, are in the dependent list associated with 
word 2466. The value which identifies a semantic class in 
table (4) is actually a pointer to a table which contains 
the starting locations of the lists of words in all senmr~- 
tic classes. This is illustrated in Fig. I by the links 
from table (4) through table (8) to table (3). 
The set of governing probabilities associated with 
word 2466 is stored in table (7). Pointer P specifies the 
starting location where the probability values are stored. 
In the example, P is set to 142. Notice that no spaces 
are reserved for adjectives and adverbs bocauae they do 
not have governing probabilities. 
The pointer H associated with a word in table (I) 
specifies the location in table (5) where a counter and 
i 
--12-- 
the hypernyms of the word are stored. Word 613 is a 
hypernym of the word 2466. Thus, H is set to 17 which is 
the location in table (5) where a counter and the word 613 
are stored. Since the word 2466 is a member of the seman- 
tic class C47, the pointer S associated with the word 2466 
is set to the location in table (6) where a counter and 
C47 is stored. 
Table (3) contains 74 blocks, which are reserved for 
the 74 semantic classes established in the system. Each 
block contains a counter and the addresses of the words 
in a semantic class. For example, the address 309 is 
stored in the 47th block in tBble (3). Table (3) is thus 
linked to table (I). 
After all data have been entered in the tables, the 
word numbers (positive values) in tsbles (4) and (5) are 
replaced by their addresses in table (i). This operation 
is done by using the lookup table (2). 
The data are organized in tables with associative 
links. All word numbers in tables (3), (4), and (5) are 
replaced by their addresses in table (i). From an entry 
in table (I) (where the generation of a sentence usually 
begins), we can trace its possible dependents; since these 
dependents are specified as pointers to their addresses in 
i table (I), it is simple to obtain the lists of dependents 
associated with these dependents. In turn we can trace third 
level dependent lists. We can easily continue this operation 
--13-- 
down to any desired leve~. Table (i) is linked to tables 
(4), (5), (6), and (7), and tables (4), (5), and (6) are 
linked either directly back to table (I) or indirectly 
through table (8) and then table (3) back to table (i). 
Thus, access to any piece of information in these data 
tables is gained by simple table lookup. 
In view of the variability in the number of words in 
each part-of--speech and semantic class, and in the number 
of governing probabilities, hypernyms, ser~ntic classes 
and dependents associated with each word, we have packed 
these data in large arrays as illustrated in tables (i), 
(3), (4), (5), (6), and (7). The advantages are (i) 
reduction in storage requirements, and (ii) capacity for 
rapid selection of a word from a part of speech or a 
sen~ntic class. The disadvantage is that we have placed 
a restriction on the amount of additional data that may 
be added to the existing lists. To avoid modifying the 
program when new data are added, indices (such as x, y, 
and z in Fig. i) to the reserved spaces in tables (I), (3), 
and (7) are n~de input parametecs to the program. At 
present the parameters are set to leave space for expansion 
of input data. Further expansion can be handled simply by 
readjusting the input parameters. 
--15-- 
The restriction pattern in Fig. 2 specifies that the sen-- 
tence to be generated should contain a transitive verb 
which belongs to either semantic class C1 or C2. The verb 
should govern (I) a noun as the subject of the sentence, 
(2) an object which is to be selected from the words in 
semantic class C15 or the specified words W 1 and W2, and 
(3) an adverb which does not belong to semantic class C19. 
The subject of the sentence should not govern an adjective. 
As illustrated in the pattern, each node in a pattern 
contains a word class and selection restrictions which are 
positively or negatively specified in terms of semantic 
class(es), specific word(s) or a word class. Restriction 
patterns are stored in the following form: Q-PIP2..oPn. 
Q is a single pattern, or a combination of patterns, and 
PIP2...P n are single restriction patterns. Essentially, 
Q-PIP2...Pn is e rule which specifies that if a sentence 
(or string of sentences) whose sentence skeleton(s) matches 
Q, then it can be followed by a sentence whose sentence 
skeleton is one of these Ps. Thus, one of these Ps is 
randomly selected to be used as a restriction pattern for 
a succeeding sentence. The pattern selection procedure is 
not yet coded. At present, strings of restriction patterns 
are given directly to the pattern selection routine. The 
generation program generates strings of sentences under 
the control oz direction of the restrictions specified in 
the patterns. The use of restriction patterns to control 
--16-- 
the general "development" of paragraphs will be described 
in a later section. 
4.1~3. Discourse Relator (RELATOR~ Input to this 
procedure are (i) a dependent type, (2) a probability 
value, and (3) a restriction pattern. This procedure 
determines whether the given dependent type conflicts 
with the restrictions specified in the pattern. If no 
conflict is found, this procedure determines whether a 
word should be selected from the given dependent type 
based on the input probability value. If the selection of 
the dependent type conflicts with the restriction pattern, 
or if the dependent type fails the probability test, no 
word will be selected from the dependent type. 
4.1.4. ~CRITERIA~. Whenever, during any stage of 
sentence generation, the selection of one word from a 
list of candidates is required, this procedure determines 
which criteria should be applied to control the selection. 
All criteria (implemented principles of cohesion to be 
described in a later section) are presented to the genera- 
tion program in the form of a table that reweights the 
probabilities. The generation program increases or de- 
creases the probability of selecting words on the basis of 
the values in this table. It has the following format 
I 
(Fig. 3): each entry is specified by a ser0~ntic class or 
a specific word followed by a positive or negative value. 
--17-- 
Identifier Weight 
C2 
W I 
½ 
C12 
+5 
--7 
+3 
-4 
Fig. 3 -- Format of a reweighting table 
To illustrate the Use of this table, let us suppose 
that in a certain stage of generating a sentence there are 
five words, each of which can be the subject of the verb 
previously selected for the sentence. The selection of 
any word from these five will satisfy the restriction 
pattern for the sentence. Instead of randomly selecting 
one word out of these five candidates, we may want to 
increase the probability of selecting a word which will 
have semantic connections with the word(s) in the preceding 
or current sentence. When there are choices in word selec- 
tion, all candidates are preassigned equal weights, and 
criteria relevant to the current selection are applied to 
form a reweighting table. If a word in the list of can-- 
didates matches a word or belongs to a semantic class in 
the table, the associated weight is added to its preassigned 
weight. The final positive weights of all candidates are 
--18- 
added, and a random number in the range from I to the 
total sum is generated to determine which candidate should 
be selected. 
4.1.5. Word Generator ~WOR-GEN). This procedure finds 
all possible candidates which satisfy the restrictions 
specified in s restriction pattern, and assigns different 
weights to them on the basis of the contents of a proba- 
bility reweighting table. It selects a word st random 
from the candidates according to their weights. 
4.1.6. .Random Number Generator ~RA .NDOM). Input to 
this procedure is an integer No This procedure generates 
a random integer in the range from i to N. 
4°2. The Generation Al~orithm 
The general strategy for generating a paragraph is, 
first, to generate the initial sentence based on a selected 
restriction pattern, and then to generate each noninitial 
sentence base not only on a selected restriction pattern 
but also on the semantic properties of the words in all 
the previously generated sentences of the paragraph. The 
algorithm and the sentence generation procedure can best be 
illustrated by an example. Let us suppose that the restric- 
tion pattern shown in Fig. 4(a) is chosen for a sentence. 
For ease of reference we will letter, the steps involved in 
this procedure. 
a. If the restriction pattern specifies a restriction 
on the selection of the sentence governor (usually a tzars- 
--19- 
sitive verb (VT) or an intransitive verb (VI)), a VT or 
VI will be randomly chosen from the specified semantic 
class(as) or word(s). Otherwise a VT or VI will be randomly 
chosen. In our example the restriction pattern in Fig. 4(a) 
specifies that a word should be selected from the word 
class VT which is not a member of the semantic e\[eeBes 
CI, C2, and C3, but is a governor of a word in C16, a 
word in word class N. and a word in C19. (Note also that 
the sentence should not contain a sentence adverb.) 
There are 16 candidates which satisfy the restrictions. 
They are shown in Fig. 4(b) by their addresses, at which 
word numbers are stored. A random number is generated in 
the range i to 16, for example, 8. Thus the eighth VT is 
chosen: word number 3336 whose address is 531. 
b. The possible dependent types of a VT are NS 
(noun subject), NO (noun object), DV (adverb), and DS 
(sentence adverb). The probabilities (in percentages) for 
the word 3336 to govern words of these dependent types 
are, say, 65, i00, 35, and 40 respectively. The procedure 
RELATOR is called in order to determine whether the selec- 
tion of each dependent type agrees with the restriction 
pattern. If the selection of a dependent type conflicts 
with the pattern, the dependent type is ignored, i.e., no 
word will be selected from this dependent type as the 
dependent of the verb 3336. In our example, NS, NO, and 
DV have passed this test. 
--20-- ~ 
)DS(--DS) 
N(+N) 
Fig. 4(a)---A restriction pattern 
VT p521 527 530 
(~NS 1201 286 299 4591 ~ N~ 12611 DV \[172 
262~ ~ ~_711 1 1610 L273J b5c~5 ,174 
L234 295 z:55 4~9j 1 175 
\ 
/~ N \[203 293 3~q 
J 2466 1214 30B z. 53 I 
L216 ~\[5 ~.q 
22631 8 46 951 
Fig. A(b)--A dependency tree 
538 54~ 
541 54 ~ 543 
lTb ib 1 /79 
English: 
Russian: 
Word No.: 
Address: 
Muxin published a study of linear method in a previous paper. 
Mu~in opublikoval izu~enie linejnyj metod v predydu~ej rabote, 
2625 3336 1610 2263 2466 6505 
317 531 261 42 308 179 
Fig. 4(c)--A generated~sentence 
--21-- 
c. The probabilities associated with NS, NO, and DV 
are used to determine whether words should be selected 
from these dependent types. For each dependent type, the 
random number generator is called to generate a number 
ranging from 1 to i00. If the random number is greater 
than the probability associated with the dependent type, 
the type is ignored. Otherwise a word will be selected 
from this type. Let us assume that all three types have 
passed the probability test. 
d.l. A noun is to be selected as the subject of verb 
3336. If the sentence to be generated is the first sen- 
tence of a paragraph, a noun which is in the dependent 
list associated with the verb 3336 and also a member of 
C16 is chosen. However, if the sentence is a noninitial 
one, the procedure CRITERIA is called to form a probability 
reweighting table on the basis of the criteria applicable 
to the verb 3336 and to this local structure (i.e., a 
VT dominates an NS). All candidates (those words which 
belong to C16 and which are in the dependent list associa- 
ted with 3336) are first assigned an equal weight. Then 
the probability reweighting table is used to adjust the 
weights of the candidates. Fig. 4(b) shows the candidates 
for the node NS. An individual word is ~andomly chosen 
from the candidates based on their different weights: 
word number 2625 whose internal address is 317. 
--22-- 
d.2. A noun is to be selected as the object of the 
verb 3336. As in d.l, the restriction pattern is consulted 
end, if the sentence is a noninitial one, the procedure 
CRITERIA is called. Fig. 4(b) shows the candidates for the 
node NO. The same probability reweighting scheme is 
applied to adjust the weights of the candidates. A word is 
selected at random: word number 1610 whose address is 261. 
d.3~ An adverb is to be selected for the verb 3336. 
Similar to the previous procedure, the restriction pattern 
restricts the selection of candidates; CRITERIA is called 
for a noninitial sentence to construct the probability 
reweighting table, and an adverb is randomly selected. In 
the figure we see the candidates for the node DV, and the 
adverb 6505, whose address is 179, is chosen. 
e. The dependents of the words 2625, 1610, and 6505 
are now considered with respect to their possible dependents 
and associated probabilities. We are working from the top 
to the second level of the dependency tree structure. 
e.l. The noun 2625 may govern the dependent types 
adjective and noun. Each of these is considered in turn 
by the same operations described in steps b. and c. For 
brevity, let us assume that none of these dependent types 
pass the probability test. Thus, no word is selected from 
these dependent types. 
e.2. The noun 1610 may govern the dependent types 
adjective and noun with different probability. Assuming 
¢ 
--23-- 
that the adjective fails the probability test, none is 
chosen for the word 1610. Since the restriction pattern 
specifies that a word should be selected from the word 
class N as a dependent of the word 1610, the same operation 
described in step d. is performed to select the word 2466, 
whose address is 308. 
e.3. The adverb 6505 selected in d.3. has no dependent 
since adverbs never govern. 
f. We now move to the third level of the dependency 
tree structure. The noun 2466 may govern the dependent 
types adjective and noun. Let us assume that the dependent 
type A passes the tests described in step c. and the 
dependent type N fails. An adjective 2263, whose address 
is 42, is selected from the list of candidates shown in 
the figure. 
g. We now move from the third level to the fourth 
level of the dependency tree structure. Since the only 
word on the fourth level is an adjective, which does not 
govern, we have reached the lowest level. The generation 
of a sentence is completed. Fig. 4(c) shows the generated 
sentence. (In the Russian sentence, m0rphology is ignored.) 
The restriction pattern of the sentence just generated, 
together with those of the previously generated Sentences, 
are again used as the basis for selecting the restriction 
pattern for the next sentence. 
--24-- 
At the present stage of development no criterion is 
used to determine the end of a paragraph. The number of 
restriction patterns input to the pattern selection procedure 
determines the number of sentences in a paragraph. When the 
sentences of a paragraph have been generated, glossary look- 
up is performed and the transliterated Russian forms and 
their structural relations are printed. 
-25- 
5. IMPLEMEntATION OF LINGUISTIC ASSUMPTIONS 
The structure of paragraphs is poorly understood, and 
is in any event subject to enormous variety. Nevertheless, 
we have adopted a simplified model, which postulates that 
the units (sentences) of a paragraph should be arranged in 
a recognizable pattern. Specifically, it is assumed that 
each pair of sentences should be characterized by the 
attributes of development and cohesion. Development implies 
progression---for example, some kind of spatial, temporal, 
or logical movement: a paragraph can be assumed to "get 
somewhere." Cohesion, on the other hand, implies contin- 
uity or relatedness; as such, it is a kind of curb on 
progression. Although it is difficult, perhaps impossible, 
to distinguish between these two attributes, they will be 
discussed separately, in an admittedly artificial way. 
The chief function of the restriction pattern is to achieve 
intersentence development, and an overall patter n to the 
sequence of sentence pairs; to a degree, lexical coherence 
is also affected through the restriction pattern (e.g., 
through the recurrence of semantic classes). The main 
function of the probability rewei~htin~ tables is to . 
achieve cohesion, through the device of increasing the 
likelihood of lexical recurrence; the principle of devel- 
opment is also implemented here, to the extent that similar, 
but not identical, words are chosen in noninitial sentences. 
In general it may be said that the restriction pattern is 
--26-- 
designed to effect an overall pattern, whereas the reweight- 
ing tables are more local in effect, dealing with purely 
lexical materials. 
5.1. Development 
An examination of hundreds of sentence pairs, and 
scores of paragraphs, of Russian scientific texts, suggests 
that the following principles of development are commonly 
employed in intersentence connection: (I) progress from 
the general to the specific (more rarely, the reverse); 
(2) from whole to part, or from multiplicity to singularity 
(presumably a variation of the first--cited principle); 
(3) past action to present; (4) "other" to "present" agent; 
(5) "other" to "present" place; (6) cause to effect (more 
rarely, the reverse); (7) action to purpose of the action; 
(8) action to means of performing the action; (9) simple 
rephrasing. Lack of space prevents illustration of these 
principles; it should be obvious that even this small stock 
of strategies will suffice for the production of innumerable 
paragraphs. It should also be noted that a random ordering 
of sentences built on the above pair--wise strategies will 
produce less than satisfactory results; certain sequences 
of sentence pairs are more likely than others to fit into 
an acceptable pattern for the paragrgph. 
It was stated in Sec. 1 that the computer program would 
provide a means of inspecting the initial sentence of a 
paragraph before deciding on a strategy for fuzther develop-- 
--28-- 
Pattern i 
Ns (--Ns) ~ N ( ~))~'oDv (+c20) 
Pattern 3 
NS(+CI6) ~N~(+N~) "~DV(+CI9) 
~N(+N) 
Pattern 2 
NS 
VT(+VT) 
~~-~--~V(--DV) (~NS) ~O(-~ethod) 
ON(+N) 
Pattern 4 
Fig. 5 -- Restriction patterns for a paragraph 
% 
a. 
N@ (+N@) N(+N) VT (+C2) DV(+CI9) 
b. Belov/ proposed/ in paper (i)/ a means/ of determining/ 
NS(+CI6) VT(+C7) DV (+C19) N@(+N~) N(+N) 
the probability/ of absorption. 
c. A theory/ of interaction/ is worked out/ in the present paper. 
NO (+NO) N(+N) VT (+C7) DV (+C20) 
d. A method/ of analyzing/ the method/ of analyzing/ the 
NO (+method) N (+N) 
magnitude/ of distortion/ is proposed. VT(+VT) 
The nature/ of scattering/ was investigated/ in an earlier paper. 
--29- 
The use of patterns to control development is summar- 
ized in Table 3. Since the verb investigate in sentence (a) 
belongs to semantic class C2, and C2 contains such verbs 
as "study" and "investigate" which specify very general 
actions, the node VT(+C7) in pattern 2 controls the selec-- 
tion of a verb of greater specificity; the verbs in C7 are 
appropriate. The node VT(+C7) in pattern 3 serves this 
purpose, i.e., to control the development of actions 
from ~enerality to specificity. The node NS(+C16) in 
pattern 2 specifies that an agent for the second sentence 
is not the present author implicitly specified in the third 
sentence. This restriction introduces another type of 
text development, i.e. from other wKiter to present author. 
The node DV(+CI9) in pattern I and pattern 2, and node 
DV(+C20) in pattern 3 introduce the time progression and 
location chan~e to the paragraph. Class C19 contains such 
adverbs as "in an earlier paper," "in paper I," "in an 
earlier study," etc., which specify that the time is past. 
Class C20 contains such adverbs as "in the present work," 
"in the present paper," etc. which specify the different 
locations in which some actions were performed. 
c~ U, 
r_l 
0, 
"0 
W 
,,-I 
q-I -,.4 
• ~ ,-~ 
t'~ °° 
0 
0 1.1 1-~ °- .1~ 
-30- 
4,; 
.,-4 
U 0 • ,-4 ,1.4 
u) 
(~ "0 "0 
$.1 .,-I .H 
0 0 
0 m'~ mO 
-~ ~ 0 
~u 
0 &0 
..~ (n 
~ 0 • r4 0~ ,,,4 
.U~ v .I~ ..,-.¢ .~-~ ~ 
ml.t m 
~4-~ 42 0 4-~ 
M 
c,4 
0 
0 -~ mo 
.~ ~ 0 
x ~ 
• ,-¢ ¢g 
U ~ 0 
'4.-I 
• 1-1 IJ .r,t 
~0 ",~ 
,.CO 
0 
~.~ 0 
0 
O~ 
c0 
u l:L 
0 
'~ 0 ,,~ \[.-4,...1 
0 
0 
.~.0~ 
~0~ 
0 U 
-,.4 ~2 
• ,~ 0 
0 
4.J 
c~ ~. 
,'~ Iw 
°° 
4.1 
g 
--31-- 
Table 3 also suggests partial introduction of coherence 
into the sentence-sequence. For example, the verbs in all 
sentences are in some degree related, and a general parallel- 
ism is maintained in the selection of agents and adverbs of 
time and location. Nonetheless it is clear that sentences 
a. through d. do not form n good paragraph. One deficiency 
is the excessively general character of the noun phrase, 
"the nature of scattering," in a.; a more obvious shortcom- 
ing is the lack of continuity in the noun object. Such 
deficiencies suggest the need for greater cohesion. 
5°2. Cohesion 
As a first approximation we have chosen to implement 
the following principles of cohesion: (i) selection of a 
"concrete" word in noun phrases; (2) word repetition; 
(3) use of hypernyms and synonyms; (4) use of anaphoric 
words; (5) increased repetition of members of the same 
semantic classes; (6) avoidance of word repetition within 
a single noun phrase. The implementation of these tactics 
is carried out in the reweighting table described in Sec. 4. 
In essence each tactic is a criterion for determining which 
words and semantic classes, together with reweighting values, 
are to be entered into the reweighting table. The content 
of the table depends on the criteria applied for each word 
selection; the main generation routine accepts the table as 
input for control of the selection, without '~nowing" which 
criteria are used in forming the table. 
--32-- 
When the principles of cohesion have been applied, 
sentences a. through d. above might have the following form: 
a'. The nature of nuclear scattering was investigated in 
an earlier paper. 
b'. BelDv proposed in paper (i) a means of determining the 
probability of such pheonomena. 
c'. A theory of proton scattering is worked out in the 
present paper. 
d'. A method of analyzing the interaction of these particles 
is proposed. 
The following are some of the improvements in this 
sentence-sequence over a. through d.: 
(i) The addition of the "concrete" adjectives 
~ and ~roton gives the noun phrases in a'. and c'. 
a specificity that is lacking in s. and c. This effect is 
forced upon the generation routine by requiring the selec- 
tion of dependents in a noun phrase to continue until a 
word coded as "concrete n has been chosen. (Since the 
effect may also be one of very long noun phrases, a counter-- 
effect is achieved by constant up-weighting of the semantic 
class of concrete words in the reweighting table.) 
(2) The recurrence of scatterln~ in a'. and c'. 
increases continuity in the sentence sequence. The gener-- 
i 
ation program achieves word repetition in adjacent or 
nearly adjacent sentences by entering the nour~-subjects 
or noun--objects of previously generated sentences (the 
--33-- 
choice between noun--subject or noun--object is made by 
reference to the restriction pattern for the sentence 
being generated) into the reweighting table, together with 
a high positive reweighting value. Moreover, the possible 
governors of these nouns are also entered into the table 
with the same reweighting value. The value controls the 
probability of repeating one of the nouns in a previously 
generated sentence or of selecting a noun which is the 
governor of a word in a previously generated sentence. In 
the latter case word repetition will occur on the next 
level of dependency structure, i.e., when the program 
selects a dependent for the selected governor. 
(3) The selection in b'. and d'. of phenomenon end 
particle, hypernyms of scat terin~ and proton respectively, 
introduces semantic continuity and, in addition, reduces 
the redundancy and monotony of word repetition. The use 
of hypernyms and synonyms is implemented by entering any 
hypernym and synonym of the words in previous sentences 
into the reweighting table with a positive reweighting value, 
thus increasing the probability of their selection. 
(4) The hypernyms phenomenon and particle in b'. and 
d'. acquire "concreteness" by the addition of anaphoric 
dependents suc___~h and thes_~e. The concreteness of the noun 
phrase such pheqomena in b'. has presumably been provided 
by the dependents of scatteri~R in a'. In the present 
system the addition of an anaphoric dependent for a hypernym 
--34-- 
automatically terminates the selection of other dependents 
for the hypernym. 
(5) The selection of pro tOo and interaction in e'~ 
and d'o is a result of increasing the repetition of members 
of the same semantic class: the semantic classes repre- 
sented by nuclear in a'o and scatterin~ in c'. are up-- 
weighted during the generation of e'. and d' 
(6) The undesirable repetition in d. is eliminated: 
words generated in a noun phrase are entered with negative 
value in the reweighting table, so that their repetition 
in the same phrase is inhibited. 
The results of implementing even these few simple 
cohesion principles are encouraging. Experimentation 
with additional constraints continues. 
--35-- 
6. CONCLUSION 
The paragraph generator is currently operational, and 
produces output in reasonable times. Using the strategies 
for achieving development and cohesion so far developed, 
it is capable of generating ten--sentence strings in 
approximately fifteen seconds. Some of the main difficul- 
ties connected with the omtput are the following: 
(i) Deficiencies in the co-occurremce data affect 
the quality of individual sentences. For example, some 
nouns have very few dependents, a characteristic deriving 
from their behavior in the text on which the data is based; 
the selection of one of these nouns in a sentence may 
nullify the effect of applying strategies for development 
or cohesion. In general a generated paragraph is only as 
strong as the weakest link; defective single sentences can 
disturb the implementation of structural principles. 
(2) The grammar permits the generation of simple 
sentences only. Complex or compound sentences can, of 
course, be created by the device of juxtaposing these 
simple sentences with the help of conjunctions or relatives; 
the conditions under which this can be done remain to be 
specified. 
(3) The creation of "lexical fields" (containing, 
e.g~ such words as "to photograph," "camera," "film,") 
would greatly increase the effect of cohesiono Distribution- 
al data for the formation of such "fields" is not readily 
available; if the classes are to be intuitively created, 
--36-- 
the result will be inconsistent with our present system of 
classification. 
Study of these problems continues through analysis of 
the output. The effects of strengthening or relaxing 
various criteria for achieving development and cohesion 
have been observed in a series of experiments. The use of 
alternative sets of language input data (e.g., different 
dependent probabilities or semantic classes) is also con- 
templated. (It should be emphasized that the program is 
not oriented on a particular language or set of language 
data.) The experimental design of the generation program 
is consistent with this kind of modification. 
