American Journal of computations! Linguistics 
Microfiche 36 
13TH ANNUAL MEETING 
Timothy C. Diller, Editor 
Sperry-Univac 
St. Paul, Minnesota 55101 
Copyright @ 1975 by the Association for Computational Linguistics 
PREFACE 
The fifth and final ACL sesskon was split into two sub- 
sessions: one continued the treatment of discourse structure 
and general knowledge begun in session 4; the other provided 
a look at several automated text analysis systems. Georgette 
Silva kindly chaired both subaetsions. 
Only five .of the a ix talks given are represented in this 
Proceedings. The paper detailing Salton's talk on automatic 
igdexing was far too extensive to be included on this fiche 
and hence will be published separately. The paper by Klapp- 
holz and Lbckman discusses the problems involved in the feeso- 
lution of cross-sentential reference and sketches an algorithm 
for their solution. (Notq the closely related paper by Deutsch 
in Session 4.) Rosenschein addresses the problem of restrict- 
ing the generation of inferential propositions given a set of 
beliefs and proposes a structural constraint upon inferencing. 
Becklea et al. present a man-machine approach to the descrip- 
tion of idiolect variations in an environment extraordinarily 
complex linguistically and sociologically. Brill and Oshika 
describe a set of programs which permit both batch and inter- 
actLve processing of orthographic and phonological strings to 
provfde information on f r , contextual variation, and 
associational relations. Anderson, Bross, and Sager present 
a theory, of: linguistic compression in written texts and de- 
scribe the results of an implementation of that theory. 
Timothy 6. Diller, Program Committee Chairman 
TABLE OF CONTENTS 
MOOELING DISCOURSE AND WORLD KNOWLEDGE 11- 
Contextual Reference Resolution David Klappholz and Abe 
.......................... 
Lockman 
How does a System Know When to Stop Inferencing? 
Stan Rosenschein ..................... 26 
Developing a Computer System to Handle Inherently 
Variable Linguistic Data D. Beckles, L. Cwrington, and 
G. Warner in collaboration with C. Borely, HI Knight, P. Aquing, 
and J. Marquee -.................. 40 
A Natural Language Processing Package David Brill and 
Beatrice T. ~shika ..................... 52 
On the Role of Wofds and Phrases in Automatic Text 
Analysis Gerard Salton (Abstract only) ......... 67 
Grammatical Compression in Notes and Records: Analysis 
and Cornputdtion Barbara B. Anderson, Irwin D. J. Bross, and 
Naomi Sager ....................... 68 
American Journal of Computational Linguistics Microfiche 36 : 4 
DAV.ID KLAPPHOLZ AND ABE LOCKMAN 
Department of Electrical Engineering 
and Computer Science 
Columbla Vniversi ty 
New Yorh, New York 10027 
ABSTRACT 
With the exception of pranomial reference, little, has been written 
(in the field of computational linguistics) about the phenomenon of reference 
in natural language. This paper investigates the power and use of 
reference in natural language. and the problems involved in its resolution. 
An algorithm is sketched for accomplishing reference resolution using 
a notion of cross-sentential focus, a mechanism for hypothesizing all 
possible contextual references, and a judgment mechanism for dis - 
~rirninati ng among the hypotheses. 
The reference resolution problem 
The present work began as an attempt to develop a set of 
algorithms and/or heurietics to enable a primitive-based, inference - 
driven model of a natural language user (Schank 1972 
Rieger 1974) to 
properly resolve pronomial references acmee eentence bmndaries. 
The 
authors quickly realized, however, that the problem of pronomial reference 
resolution is only a .small aspect of a problem which might be termed 
nominal reference resolution, itself but a 8-11 aspect of the problem 'of 
the coherence of d text, (or conversation) i, em the manner in which it 
llmeansu more than the logicd conjunction of the meaningp of its in- 
dividual constituent aentences, 
Examples of tqe first problem, i. e. pronomial reference resohtion 
are given in sentence sequences 1-4 below. 
1. Yesterday some boys from our village chased a pack of wild dogs; 
the largest one fell into a ditch. 
2. The wild dogs which forage just outside our village suffer from a 
strange bone-wealeining disealte. Yesterday some boys from our 
village chased a pack of wild dogs* the largeat one broke a leg and 
fell into a ditch. 
3. Yesterday John chased Bill half a block; he was soon out of breath. 
4. My friend Bill has an extremely severe case of asthma. Yesterday 
John chased Bill half a block; he was soon out of breath. 
The ~roblem in utterance (text, conversation etc. ) excerpts of the 
above type is #hat of determining the referents of We various occurrences 
of the pronouns I' one, I' and "heN 
For the moment we simply note that usually preferred referents 
of the two occurreqces of llonell are I1boyl1 and Ild~g~~, (examples 1 and 
2 respectively) and those of the two occurrehces of "hew are I1Johnl1 
and Bill (exampies 3 and 4 respectively. ) 
The more general pr~blem of nominal reference resolution is 
exhibited in €he following annotated excerpt from a recent newpaper 
article (N. Y. Times 7/15/75, byline Arnold Lubasch); subscripted 
bracketing of the excerpt is intended only to enable later reference to 
specific parts of the text. 
1 
[Some of the tnajor provisions of [the state's Fair Campaign 
2 
\ were declared unconstitutional here yesterday oy [a special 
3 
Federal court] that assailed [the restiictions on election campaigning] 
3 4 
as "repugnant to the right of freedom of speech. " 
5 
 h he three-judge court, ] whi* was convened to consider a 
5 
constitutional challenge by three State As sembly candidates last year 
threw out [ [the ' code' s] prohibition against attacking any political 
6 7 7 
candidate's xa ce, sex, - religion or ethnic background] 
6 
*[It]* also overtuned [ [ [ the codets]ll banlo on any 
9 10 11 
misrepresentation of a candidate1 a party affiliation, position on political 
issues and personal qualifications, including the use af llcharacter 
defamationr1 and scurrilous attacks. Ill9 
Accordbg to [the caurt1s]12 38-page decision, written by 
12 
13 
[Judge Henry F. Werber] with &he concurrence of [Judges Leonard 
13 14 
P. hdoore, and Mark A. Con~tanting]~~. 15[ 16[the provisions ba~hg 
116 
misrepresentation Irca st a substantial chill on the expres sion of 
protected speech that are unconstitutionally overbroad and vague. l1 
If newpaper reporters had a bit more sympathy for those of us 
concerned with natu-1 language poocessing, the above excerpt might 
have read as follows: 
The state has a Fair Campaign Code. 
Some of the major provisions of the state's Fair Campagin Code 
are provisions which restrict something. 
Some of the things restricted by some of the major provisions of 
the state's Fair Campaign Code which restrict something are activities 
having to do with election campaigning. 
Same of the activities having to do with election campaigning which 
are restricted by some of the major ,provisions of the state's Fair 
Campaign Code whioh restrict something are attacking a political 
candidate's race, sex, religions or ethnic background and misrepresent- 
ing a) candidate1 s party affiliation, position on political is sue s . . . 
Last year three state assembly canddiates filed a constihtional 
challenge to some of the major provisions of the state's Fair Campaign 
Code which restrict something. 
Y esterday a special Federal court declared unconstitutional thou of the 
major provisione of the state's Fair Campaign Code which restrict . 
something . . , 
The p& is that in order for a machine or a human to validly 
claim to have Ilunder~tood~~ the original excerpt& helshelit must be able 
at the very least to dekonstrate that he/she/it has established the 
following relationships between various items occurring in the excerpt. 
(Iqtegers aprbsent subscripted bracketed regments of the original excerpt. ) 
(f) The identity of 2, 7, and 11 
(ii) The identity of 3, 5, 8, and 12 
(iii) The fact that 4, 6, 9, and 15 are elements, subsets or pafb of 1 
(iv) The fact that 13 and 14 are members of 3 
and on and on and on. (I. em a closer analysis of the original excerpt 
reveals many more relationships which must be established before 
llundeirstandingu may be clamed. ) 
It people sctually wrotelspoke in the style of the somewhat 
facetious paraphrase of the original excerpt, the nominal reference 
problem would be reduced to one of matching lexcial patterns and 
recognising a few syntactic cues; to state the obvious, the necessity for 
more* succinct linguistic communication has forced the development of 
elliptical devices which shia the burden of nominal reference resolution 
from syntactic analysis to an analysis of the Hsemanticsll of sentences 
in context. More specifically, nominal references cannot in general be 
resolved without the use of general semantic infarmation as well as 
specific world knowledge. 
While the fact that syntactic a-lysis alone is insufficient for 
understmding is anything but novel, the question of the magnitude of the 
nominal reference problem and of its solution1 s crucial dependence upon 
local context seems to have been little commented upon. 
(Clark (1975) 
discussee the problem from a viewpoint different from that of this paper, ) 
The reader who remains unconvinced by the examples 
above that 
local context (and specific world knowledge relating to local context) 
must play a crucial role in reference resolution is asked to consider the 
two sentence sequences 5a, 6, and 5b, 6. 
5. 
a. The founding fathers had a difficult time agreeing on how the 
basic laws governing our country should be framed. 
b. Those foolish people at the country club have spent an incredible 
amount of time arguing about club rules. 
6. The second article of the constitution, for example, was argued 
about for months before agreement was reached. 
In sentence sequence 5a, 6, Itthe second articlett clearly refers to 
the second article of the constitution of the United States, while in 
sentence sequence 5b, 6, the reference is to the second article of the 
constitution of the country club. In each case the only factor involved 
in resolving the reference is the semantic content of its 10-1 aontext- 
in this case the meaning of the sentende preceding the one in which tke 
reference occurs. 
Since the lexical item Itthe constitutiontt appears in the example 
just considered, a word- concerning such proper-noun-like objects is in 
order. In any language €here are lexical Items and phrases such as 
those appearing in 7 below, which, in the absence of compelling 
alternative, have standard default ref erentsf for example the standard 
default referents of the items in 7 are the corresponding items in 8 
7. a. The constitution 
b. The founding fathers 
c. Wall Street 
d. The establishment 
e. The presiaent 
f. Madison Avenue 
8. a. The constitution of the U. S. 
b. The flhnding fathers of the U. S. 
C. The U. S. business. community (or that part of it residing in 
New Yotk City. ) 
d. Those people who have the power to influence the course of events 
in the nation etc. etc. 
e. The president of the U. S. 
f. The advertising industry. 
In order for textual occurrm-ce of such proper-noun-like objects to be 
properly handled, their standard default referents must be listed in the 
lexicon. This is not to say that occurrences of proper-noun-like objects 
cannot be references to objects occurring previously in the text; rather 
- 
it is the case that their default options must also be considered as 
possible referents. 
As final examples of the reference resolution problem let us con- 
sider sentence sequences 9 and 10 below. 
9. The president was shot while riding in a motorcade down one of the 
major boulevards of Dallas yesterday; it caused a panic on Wall 
Street. 
10. John was invited to tea at the Quimbyls last Saturday; he would have 
loveq to go, but he knew held be busy then. 
In example 9, while t&e first sentence of the eequence contains a 
number of noun objects (president, motorcade, boulevards, Dallas) 
which are potential referents for the occurrence of rlitll in the second 
sentence, none of the these is in fact, the proper referent; rather, the 
proper referent of Ifitlt is the event (or fact) that 
"The president was 
shot while . . . . 
It 
In example 10 we have an instance of an adverbial reference 
("thenf1) which must be recognized as referring to flyesterday" rather 
than bo some non adverbial object occurring in the first sentence of 
that example, 
Sketch of a Solution 
Frsm the point of view of computer implementation, the problem 
of nominal reference resolution is one of creating tokens for noun 
objects mentioned in a text, and discovering and encoding the relations, 
alluded to in the text, which hold between them and various other tokens 
in memory. 
This problem, though certainly not it s magnitude or ramifications, 
was noticed by Rieger (1974) in his poineering implementation of a 
primitive-based model of a natural language user. Riegerls system, 
however, suffers from the incredible inefficiency r esuIting from its need 
to search all of memory 
in order to attempt any reference resolution; 
in addition it will often miss a quite obvious referent entirely, and, in 
fact, resolves non-pronomial references only accidentally if at all. 
Before presenting a sketch of a proposed solution to the nominal 
reference resolution problem, it would be well to detail more precisely 
the overall language processing enviornment within which it is meint to 
operate and of which it is a most necessary part. 
First, we adsume that a relatively small set, S, of semantic 
primitives and $ logical-calculus -like language, L, for expressing 
ltmeaningsll are available. The set S and language L must satisfy the 
following two conditions. 
(i) The predicate, function, and constant eymbols of L are members of 
S. 
(ii) There is a one-to-o* mapping from meanings of (natural language) 
sentences to formulas of L. 
While a set of prkiitives and a meaning representation language 
even demonstably close to satisfying the above conditions have yet to be 
produced, we will, in examples to follow, make use of meaning represen 
tations; the only claim we will make for them is that the functibns 
served by their constituent constructs must be served by the elements 
of any adequate system. 
In addition to a meaning representation scheme we will assume en 
encoding of world knowledge of the sort which a lltypicallt adult might 
possess, again with the same obvious caveat. 
While the question of translation from natural language 
sentence8 
to'meaning representations will not be touched upon here, we will 
sasume sentence -by- sentence translation of the sort exhibited in various 
examples to follow. 
The solution  PO the reference resolution problem rests in recog- 
nizing the fact that reference is an elliptical device, and +that the 
human under stander of. natural language cannot recapture that which was 
elided once he is too far from it in the text; in fact, he cannot resolve 
a reference to a pint in the text more than a few-sentences back with- 
out going back and pondering it (if he can do so at all). 
We should note 
that this is true even ih the case in which the referent doesn't 
actually appear in the text, but appears only in an inference from some 
statement made in the text. In this latter case - a case which we will 
discuss only at the very end of this paper the reference is not resolvable 
(and would not therefore have Been made by the c~eator of the text in the 
first place). unless the statement from which the inference is made 
appears shortly before in the text . Though we cannot say precisely 
how far back is meant by. "shortly before, " it is certainly no more than 
a few sentences. Fbr a given sentence, S, appearing in a text we will 
refer to the gequence of sentences preceding S by no more than the 
intended distance as the focus of S. 
In terms of computer implementation, we will, in the processing of 
a text (which we conc-eive of as proceeding sentence-by-sentence), 
maintain the following focus sets. 
%bje CJ' 
(i) The noun-object focus - the set of tokens of a noun meaning re- 
presentation~ of the focus of S (where 
S is the bentence currently. 
being processed) 
(ii) The event focus -a set aontaining, for every sentence W in the 
focus of S, the object EVENT(F), where F is the meaninq 
representation of W, and EVENT is a function which maps the 
meaning of a formula, F, into a noun-like object whose meaning 
is "the event (or fact) that F" 
(iii) The time focus - a set containing taken8 for all time references 
(e. g, yesterday, five olclock, etc. ) occurring in the meaninq 
representation of the focus of S. 
The reader may question our inclusion of every object appearing 
in the meaning xepresentation of the focus of S in one of the above focus 
sets, i. e. in the set of potential referents. In fact, however, it seems 
to be the case that any object (of one of the above-mentioned types) 
occurring in the meaning representation of the focus of S may be the 
referent of an object occurring later in the meaning representation of S. 
Sonsidex, for example, the sentence sequences formed by taking each of 
he sentences af 12 below, - in turn - as an immediate continuation of a text 
zontaining sentence 1 1 below. 
11. Stan argued with his sister Fran in an attempt to convince her that 
she should bring Mary, whom he would like to get to knpw, on their 
planned trip to the San Diego Zoo tomor~ow. 
12. a. - He was really insistent. 
b. - She was hard to convince. 
C. - It was useless. 
d. He thinks - she's the prettiest one of all Frads friends. 
e. The prospect really excites him. 
f. He arguecl that - ft wouldn't tie Mary up for more than half a day. 
g. 
- It's €he best one in the country, you know. 
h. - She thruught - it was a twrible idea. 
i. 
She happened to be busy then, but expressed an interest in coming 
along ahother the. 
Ea& of the mderlined items in sentences 12a-12i references some 
object in senten- 11. 
(For the sake of clarity we present in 13 below 
the referents as we understand them, ) 
13, a, Stan 
b. Fran 
c. The attempt (to convince , . . 
d. Mary 
em EVENT \Stan will get to know Mary) 
f. The trip 
g. The San Diego Zoo 
h. Both --- @he a& it are ambiuous; if - she is taken to be "Fran, l1 then 
it refers to EXENT (Fran will bring Mary ,. .); if - she is taken 
to be "Maryll), then - it refers to EVENT (Mary will come.. . ) 
The point is, of course, that any item in (the meaning representa- 
tion of) a sentence, S, may be referenced by some item in (the mean- 
ing repr eeentation of) a latter sentence. 
On the other side of the coin the question of identifying potential re- 
- 
ferences is just as important as that of identifying the seb of all 
possible referents for an object which is known to reference something. 
If we were' concerned only with pronomial referenee reaolution, the 
problem would have a simple solution; every pronoun is a reference. 
For nominal items other than pronouns the problem is far less simple; 
if a noun occurs in a text just how do we know if there 
is a previously 
occurring nominal item to which it refers? As much as we would like 
there to be algorithmically testable criteria, i. e. recognizable syntactic 
and/or semantic cues, for making the decision, there seem to be none. 
Thus, the mechanism we propose considers every token appearing 
in the translation of a sentence as a possible reference. 
At present, we hypothesize the existence of a small set, R, of 
relations which are suffvient to account for all instances of nominal 
reference. Included in this set are, at the very least, the relations 
identity, member of, subset of, and part of. Noje that although this 
list of relations im quite small, it suffices to handle all the examples of 
reference presented thus far (i. e. those occurring in sentence sequences 
1-6 and 9-12 as well as those occurring in the excerpted newpaper 
article above). 
All of the above observations taken together lead to the following 
sketch of an algorithm for reference resolution. 
I. #AS each new sentence, S, is transrated into its meaning representation, 
the various focus sets (noun-object, event, time) are updated. 
11. A set, H, is formed containing all tuples of t&e form (N1, N2, P) such 
'that N1 is a nominal item occurring in(the meaning representation 
of S, Nz is an object occurring in the focus set (noun-object, 
event, or time) appropriate to N1 , and is a member of R; H 
is the set of all current refewnce hypotheses arising-from S. 
III. A lljudgment mechanism, " discussed below, is invoked to determine 
the liklihoods of the correctness of the various members of H. 
It is clear that following step II any further processing of reference 
hypotheses requires that all members of 
H be considered relahive to one 
avther, 
since the correctness or incorrectness of one may depend 
crucially upon that of others. 
In the general case not all hypotheses will 
turn out to be correct, and in fact some may contradict others - for 
instance in the case of two hypothesis-triples with identical first and 
second elements and different third elements. 
Once it has been created, the set H is submitted to a "judgment 
mechanismft whose task it is to choose some of the hypotheses as valid 
and others as invalid. The judgement mechanism must clearly have 
access to the world knowledge stored in memory, and must be capable of 
performing inferencing of a sort which produces decisions as to the 
relative Eklihoods of the various hypotheses. 
Before giving example8 of just how such a judgment mechanism 
might work, we should make it clear that our sense of I1inferencing1l is very 
different from Riegerls (1974). In Riegerls sense inferencing is un- 
directed, while ours is directed toward the goal of validati~g hypotheses. 
There is, in addition, another sense in which the sort of inferencing to 
be done by the judgment mechanism is directed. The fact that the rgasons 
for validating or throwing out a particular reference hypothesis (on the 
part of human natural language users) involve the information coweyed in 
local context as well as world knowledge relating to items contained in 
that information (and world knowledge relating to items contained in world 
knowledge relating to items contained in that information, ctc. ) constitutes 
a good guess as to the particular pieces of world kncrwledge and the rules 
of inference which must be involved in judging that hypothesis. 
Ekamples of reference resolation 
14 and 15 below contain components of possible meaning repfek 
sentations of the two sentencel of sentence sequence 1 
at the beginning 
of this paper. 
14. C 1 : CHASED (xl, q) 
C2: TIME (cis YESTERDAY) 
C3: SUBSET (x [BOYS]) 
C4: SUBSET (x~, [DOGS]) 
C5: GREATER (SIZE(xl),l) 
C6: GREATER (SIZE (IS,), 1) 
15, C7: FALL INTO (yI, y2) 
C8: TIME (C7, PAST) 
C9: MEMBER (y,, [DITCH]) 
€30: MEMBER(yl,y3) 
C11: LARGEST (yl, y3) 
'The meaning reprecentations proposed for the two sentences are 
C1A%hGhGAC5AC& and G7hCshCo AC~OAC~~ respectively. Note that we are 
not claiming that the predicates CHASED, and FALL INTO and the constants 
YESTERDAY, BOY, DOG, PAST and DITCH are at the leve3. of semantic 
primitives; rather, the above analyses are at just the level which we need 
to illustate the operation of the reference resolution mechanism. Further - 
more, the symbols YESTERDAY, BOY, DOG, PAST and DITCH ahould 
be taken as pointers to the definitions of the appropriate items encoded 
in memory in whatever fashion. The brackethg in the notation [A], where 
A is a pointer to a definition, is meant to be a function which takes 
A 
into an object whose meaning ie the class of items satisfying the meaning 
pointed to by A. 
Once the translation of the first sentence of sequence 
1 into its 
meaning representation has been completed - on the assumption that that 
sentence is at the beginning of the text being processed - the various 
focus sem will contain the followkg: 
no.- object focus: [xl, xz) ; event fbcps.: [(cfic21\~/\chcd\~~ )3 ; 
time focus [YESTERDAY ] . 
After the second sentence is translated the set, H, of reference - 
triple hypotheses presented to the judgment mechanism will then be the 
following : 
is a member of 
is a sqbget of 
is~part of 
Note that no member of the event focus occurrs in H because the 
translation of the second sentence contains no term of the form EVENT(y); for 
simplicae omit the question of time referencing. 
All of the relations between y2 and xl or rt can be ruled out 
pn the basis of SUBSET (x,, [DOC]) SUBSET (xl. [BOY]), MEMBER 
Or,, [DITCH]) and of the world knowledge to the effect that boysldogs 
cannot be identical to, members of, 
eubsets of or parts of ditches (of 
course in some weird fairy tale setting one of these might be possible 
and shouldn't be thrown out; but in such a case local context would 
inform us of the "weirdrr situation and the appropriate one wouldn't be 
thrown out. ) 
The hypothesis that or y3 is a part of either 
xl or xz can be 
20 
ruled out on the basis of SUBSET (q, [BOY]) and SUBSET (x,, [DOG]), 
which tell us that q and x2 are sets of objects, and the world 
knowledge that sets don't have "partsi1 in the sense of the "part of1! 
relation. 
Identify between yl and either XI or x2 can be ruled out on the 
basis of MEMBER (yl , y3) which tells us that yl is an ihdividual and 
SUBSET (x1, [BOY]), SUBSET (x2, [DOG]), GREATER (SIZE (x,), l), 
and GREATER (SIZE (xz), 11, which tell us that xl and x2 are sets 
containing more than one object. (Remember that we're not doing 
axiomatic set theory in which there are no lgindividuals in our sense 
and in which the sort of ll-iindividualll which is dealt with can be a subset 
of some set. ) 
Fmally, the "member of" relation between y3 and either xl or x2 
can be ruled out pn the basie of MEMBER (yl, y3) which requires thpt 
y3 be a set, SUBSET (xl, [BOY]), SUBSET (x, [DOG]), GREATER 
(SIZE (xl), I), and GREATER (SIZE (x2), 1), which tell us that x1 and x2 
are sets containing more thad one element each, and the fact that sets 
are not members of sets. 
(Again, we're not dealing with set theory; if 
in fact, we were - talking about axiomatic set theory in English, then 
local context would contain that information, and aiff erent inferences 
would come into play. ) 
This leaves us with the following hypotheses : 
is identical to 
'' {is a subset ofj [:j 
y, is a member of 
{ :g 
But some of these hypotheses are consistent with one another: in fact 
the hypotheses 
ie identical to 
X. i = l,2 
1 
is a subset of 
-C 3 
imply the hypotheses 
yl is a member of x. i = l,2 
1 
respectively because of MEMBER (yl, y,). 
At any rate, the judgment 
mechanism assumes at this point that either yl is a member of xl or 
yl is a member of x2. The readear is asked to recall at this point 
that in presenting the usually preferred referents for references in 
sentence sequences 1-4 the claim was made that in sentence sequence 1, 
the usually preferred referent for "onefr is lfboys." The reason for this 
claim is the authort s observation. that, when such a pronomial refercnce 
occurs as the surface subject of a sentence, in the absence of semantic 
content which discrminates among the various possible referents, most 
people seem to take the eurface subject of the last sentence in the focus 
as the inbnded referent. The reason for this human judgment is probably 
that the readerlhearer takes the surface subject to be the "topicll of a 
sentence. If this observation is correct, the judgment mechanism should, 
in the current example, simply choose "one of the boysr1 (yl is a member 
of xl) as the proper referent. If this observation is incorrect, the judgment 
mechaaism should judge that there is ambiguity in the reference 'lone 
Sentence sequence 2 at the beginning of this paper would be handled 
in precisely the same manner as sentence sequence 1 up to the point at 
which 11y3 is a member of xl1I and "y, is a member of x," were the re- 
maining hypotheses. The knowledge that Ifthe dogsLt 
refer red to suffer 
from a strange bone -weakening diaease would bhen cause the judgment 
mechaniam to strengthen the likelihood that tlonell refers to "dogs, 
thus 
causing Ityl is a member of x," to be the preferred judgment. 
Sentence sequence16 below contains an example of EVENT reference. 
16. 
The presidnet was shot yesterday. It caused a panic on Wall Street. 
Omitting all other details of the translation into meaning representation we 
simply note that the primitive -level predicate into which cause" is tranq- 
lated requires an object of the form EVENT (F) as its subject (i. e. if we 
say something like "John caused a stir" what we mean is that John did 
something and the event (or fact) that he did that caused a stir.) Thus, 
when the 2nd sentence is handled, the only possible referents for will 
be the objects contained in the EVENT focus, namely just EVENT (the 
president was shot yesterrlay). The judgment mechanism thus must skply 
decide if the event (or fact) that the president was shot yesterday was likely 
to have caused a panic on Wall Street, a judgment which, with adequate 
world knowledge, should certainly be confirmed. 
Sentence sequence 17 is a very similar case. 
17. The president was shot yesterday. Bill told me all about it. It 
caused a panic on Wall Street. 
In order to resolve the reference 'lit" in the last sentence of 17, the 
judgment mechanism would have to decide on the relative likelihoods of 
i and ii below 
(i) The event (or fact) that the president was shot yesterday caused a 
panic on Wall Street. 
(ii) The event (ok fact) that Bill told me about the president being shot 
yesterday cauaed a panic on Wall Street. 
Again, with the availability of reasonable world knowledge about such 
things as presidents, their being shot and panics, the judgment mechanism 
should be able to choose the proper referent for "it1I 
While a fully detailed specification of the judgment mechanism must 
await further investigation, the above examples should illustrate, at least 
in part, the manner in which we conceive of its operation. 
Conclusions 
The phenomenon with which we have been dealing is one example of 
what we would like to call the llcreativefl aspect of language use; more 
specifically, reference of the sort we have described - and attempted to 
handle - is an elliptical device necessary for effective communication; 
moreover, it is a device which exhibits the ability of language to "change 
the ground rulestf in a very flexible and fluid manner in response to 
context. 
At this point we must admit that there is an even more 
creative 
type of reference than the sort we have dealt with. 18 below is an 
example of this type of reference. 
18. Last week I caught a cold while vieiting my mother in Chicago; as 
ueual , the chicken eoup had too much pepper in it. 
The interesting reference in the above example is ILchickeh soup. 
There 
is no item in the first sentence to which it is directly related; on the 
other hand, few people have any trouble resolving it by interpolating 
between the two sentences of example 18 the idea expressed in sentence 19 
below: 
1,q. When I get sick my mother makes me chicken soup. 
If sentence 19 were available, our reference resolution mechanism would 
easily come 
up 
with an identity relation between the two occurrences 
of It chicken eoup Obviously, for our proposed mechanism to resolve 
this reference, some sort of inferencing must first work on the 1st 
sentence of 18 to produce the meaning of 19 as an inference. Thus it is 
clear that reference resolution and general inferencing must be inter - 
leaved. 
The mechanism proposed abave does not handle the entire problem. 
It does, however, seem to be a minimal model of reference resoIdtion 
(minimal in the sense that at least this much must be going on). 
In 
addition, it provides for that control over the use of general inferencing 
which is required to avoid a combinatbrial explosion (BOOM). 
American Journal of Computational Linguistics 
Miaofiche 36 : 26 
How DOES A SYSTEM KNOW WHEN TO STOP INFERENCING?* 
The Moore School of Electrical Engineering 
University of Pennsylvania, Philadelphia 19174 
Abstract The problem of constmining the set of hfemtces added to a set of 
beliefs is considered. One method, based on finding a minimal unifying 
structure, is frresented and discussed. The method is meant to pnxride 
internal criteria for inference cut-off. 
I. Introduction 
Natural language processing systems that are sensitive to the semantic 
and logical content of processed sentences and to the p~glratics of their use 
generally draw inferences. A set of fonmilas representing the meaning of a 
sentence and the 'state of belieft of the system is augnented by other related 
formulas (the inferences) which are retrieved and/or constructed during the 
pmcessing. The problem to be investigated here is: How can thi$ process be 
contmlled? Can reasonable criteria be found for restraining the addition of 
inferences? 
Top-down inferences fol.luwing from the meaning of lexical items (often 
expressed by decomposition into primitives) are clearly bounded, if no 
interactions are allowed amng the generated sub-formulas. This process 
(which we call EXPANSION) will not be discussed here. Rather, we shall be 
concerned with SYNlESIS, i.e., the addition of new formulas based on the 
* This work was partially suppored by NSF- Grant SCC 72-05465A01. 
** AuthorT s currerrt address : 
Courant Institute of Mathemtical Sciences, 
New York University, 2 51 Mercer Smet, New York , New York 10 012. 
zxsence - of already generated lower-level formulas, 'v~hich we shall call 
kliefs. 
In particular, we are concerned with infererces addgd because a set 
cf beliefs is recognized as fitting a plre-defined pattern. 
The question we ask is: Given an initial set of beliefs ovm a set of 
;ri&ives, - what 'crite~ion can be us& to Mt the pmcess of pattern matcljng 
r-d associated inference addition? The major structural- feature that we use 
% wide such a criterion is a partial order over the set of patterns. 
Before pursuing this suggestion any further, let us examme sane d 
-3s ait~~native ap-pmaches to infence and iiiCsrence c~r-off. 
To logicians, deductive inference involves rules by which fanmilas can 
3e added to a set (which ki~ally cantains the dons) in certain ways 
pvided other formulas are dLready in the set. In general, this sort of 
infexence is quite open-ended in that one can keep applying the rules of 
LnXerence and ccrme up with mre and ao~e famulad dl of which represent 
'pvablet statements. Xhe terminaticm criterion for a particular invocation 
of the m&ankmmi&t be the appemce of an 'intestingt farmola or the 
loss of interest of the infemcer, but in general the stdement of the rules 
of inference says nohing about when to cease deriving fbmulas. 
This para- from logic has been carried over into ktificbl 
3telligence qmtenrs, where the issue of terndnaticm is very real. The 
usual solution has been to invoke the inferencer under the very strict 
control of a supervising pgran w*&- has its own gmls progmnmd in 
which mkes c-ain that appropriate criteria are applied to hdlt the 
inferenchg. This is =st apparent in systems written in PLANNER-like 
languages ach has use~~pmgrammble me&anisms for conbmlling the pl~gf 
process . 
In the work uf Schank and Riege, (Sch, 75) (Ri , 74 inference has nore 
of the flavor of be association; inferences are conceive$ of as expan&g 
sph~s in'inference space. ' 'Ib termination strategies are qloyed: (1) 
the (iisoovery of a chain of inferences leading fr\m one of the iiidl 
behkfs to another thmw a shared formila, or 'contact pint in infmce 
space, and (2) the association of numerical fstnngthst to fomuks so that 
a line of inference can be discontinued if the strength falls below a certain 
thIesbld. 
Smtegy (2) is scmwihat unsatisfyii in viaw of t!e prwLtW. 
arbi-brsariness and attendant difficulties in evaluating the mle of parttcular 
numerical constants in the total. behavim of a cc~lplex systep. These osnstants, 
presumbly, have little to do with the m-iical stru~tuce of t?e foninl 
inferace scha~~, and as such we would call them 'extaml criteria. ' A 
stmtegy like (1) above, on the other hard, is nure tintexmalt and is to be 
pfm.' 
A gwl of 'the present mrlj is to fcmnulate a reasonable internal criterion 
for infmce cut-off tjhich can be stated fcmnally as part of the inference 
rule. To do this, we &EL impose a stmchm on the set of patterns to be 
used in inferencing, and the rule for adding inferences will be fc~rmkted 
in terns of .this strutme. 
The operatibns to be de-ibed below ate exp- mrre fully in (R,75), 
where a desmiption of a ccquter iqlem~ntath is also presented. 
* See also (C,75), (W,75). 
1 A Partial Order for the Patterm Set 
!he inference rule we are aiming fop is to depend on the - ret of input 
beliefs and the - set 0s patterns. The notion we are trying to fcmnalize is 
''What does this set of beliefs suggest with respect to this set af patterns?" 
The particular class of inferences we are concerned with are those gotten 
by matching beliefs in the input set against a pattern and augmenting the 
beliefs with additional ppositions as dictated by the pattern. We want to 
find the least instarlces of patterns which cover (include) the set of input 
beiiefs. We will take as inferences all pmpositions (an arbitrary nunber) 
wW& a. entailed by that instance of the pattern. 
Put another way, the inference operaticsl is to jump to ~onclusions. 
However, it is cmly to jump to those conclusion required to make the resulting 
set an instance of ,the - least possible pattern in the pattern set. 
The key concept here is 'least' in that thls is what cmtmls how many 
inferences are added. What would be a suitable dering relation far patterns 
and ~aopositional beliefs? One which naturally suggests itself and which is 
c~rrartly unclex- investigation relies Qn the relations of instantiation and 
substitution instance 
and (2) S 5, S' if S a St. 
< {q l,... ,%I, where the p 's Carbbing these two, we say that €q,. . . ,p,) - 
i 
&pi 
's are ~~opositional forms, if there is a substitution, s, for the 
variables of {pl,.. . ,p 3 such that {s(pl),-. ,s(p 1) < Iq ,enm 9%)- 
n n -1 1 
We adopt the notational convention of prefixing variables with '?' . 
and let Q = I (HAPPY JOHN) , (GIVE MR. JONES JOHN TOY), 
(PIIRE;YT MR. JONES JOHN)). 
Then P 5 Q under the substitution ?x+JOHN, ?y+MR. JWS. 
The tless-thm~uaJ.' relation is also defined lor pairs of pattam: 
ht PAM = {(P ?x ?y), $Q ?y ?a)} 
and l& PAT-2 = {iR ?u ?v ?w), (Q ?w ?v), Pdu ?w) 1. 
Clearly, PH-1 f P-2 under the sub9citution ?x+?u, ?3*?~, ?B-?V. 
This defi#itfon of - < is quite stdghtfcrrd and czn be made to aceorodate 
expressions wdth embeddings and mate Variables. (These are included 
in the implaentation. 1 
that the relation < - can be -thought of an informition-axrtent 
caparison; if S - < St then St contains at least askmch 'infomatiant as S 
(and pcwsibly mre) either by virtue of variables t~ving been replaced by 
particular constants or by additional farmuLas having been added to the set. 
Given 5 far rehting pairs of belief sets, pairs of patterns, aa. 
bedief-set/pttem pairs, we can now fundate me belief-set-extending 
III. The Infemnce Operation: SYNTHESIZE 
Given a set of P of patterns and an input set Bel of beliefs, 
SYN!EESTZE returns a set I of instantiated patterns fma P such that the 
following -Wee caditions a3l hold: 
(1) 
~Caemge of input. beliefs) For each instantiated 
(2) (Pairwise inccnparability) If p,q r I then 
(3E (MinWity) mere are no other instances r of patterns in P 
which are not ih I and yet which are to some element of I 
.cI 
and for which Bil - < r. 
The el-ts of I = SYHTHESSrEbl) represent possible rnumd. 
b 
extensions 
of Bel; n I represents clear extensions of Bel, namely the superset of Be1 
. 
stained in all rmmmll. extansions. 
Let P { pl = {(A ?XI, (B ?g), (C ?XI), 
p2 
= I (B 7x1, (C ?XI, (D ?XI), 
p3 
= {(A ?XI, (B ?XI, (C ?XI, (G ?XI} 1 
Represented graphically: 
If input bet Bel = {(A JOHN), (C JOHN)3 
then SYNEESIZE(Bel) = I I (A JOHN), (B JOHN), (C JOHN)) 3. 
There is only one possible ndnimdL extenshn; (B JOHN) is inferred. 
If input set Bel = €(B JOHN), (C JOHN)) 
then sYN'mSIZE:(Bell = { f (A JOHN), (B JOHN), (C JOHN) 1 
C(B JQRN), (C JOHN), (D JOfN)) 1. 
There are two possible rmrundl 
. . 
extensions, but the set of clear extensions 
contains no inferences beyond the input set, Bel. 
(Had pl and pq shared 
another clause, however, an inference would have been added.) 
If the input set Be1 = €(G JOHN), (B JOHN11 
then SYNTHESIZE(l3el) I { (G JOHN), (A JOHN), 
(B JOHN), (C JOHN)} I, 
Pattern pa is the least pattern which wheh instantiated covers the inputs, 
and there are two inferxed pru,positions : 
(A JOHN) and (C JOHN). 
me descripti~n given here has been necessmily brief and incomplete 
A mre farnodl trea-t of SYNTHESIZE in tams of lattice-themetic operations 
is given in (R,75) and is  miz zed in (JR,75). One additional technical 
point should be made: It often happens that for a given input set there are 
no single patt- instances which cover all the inputs, though patterns 
- 
exist mse instances cover subsets of the inputs. In such a case we use 
an extended SYNTHFLSIZE operationt~hich is defined in the same spirit as 
SYWHESIZE. (See (R,75).) 
Even witbut the firU fd treatment, several things should now be 
clear. First, the actual nunibex of inferences dra.. (propositions added) for 
a particular input set may be small or large (depending on the inputs and 
the pattern set,) but it is bounded in a phcipled way because of the 
definitim of SYNTHESIZE. 
Second, the usual distinction between 'antecedent' and 'consequent' 
clauses in the pattern is not htained; a clause in the pattern may serve 
as an antecedent on one occasion and a consequent on ano-Eher. 
Third, if 'defined1 lexical item were to be associated with tht: 
patterns, noting which variables are to be bound as arguments upon 
instantiation, then the SYN'IIESIZE function can be used to canpute sumnarizhg 
expressions. a*ls SYNTHESIZE remsents a possiELe formalism for lexical 
insertion. 
IV. An f5wmole of the beration SYNTHESIZE 
Far the sake of illustration, let the primitives be: 
(BENIGN ?x) 
(- ?X ?y) -- ?X thr?ebtesls ?y 
(GIVE ?x ?ob ?y) -- ?x gives ?ob to ?y 
(BELONG ?ob ?XI -- ?ob belongs to ?x 
~IlITEND ?x ?Q) -- ?x intends to do ?Q 
(LEIURN ?X ?ob ?y) -- ?X ?ob to ?y 
(?*ljlS-n~sT ?x ?y) - ?x -pays i.nt-:-sz: 
?y 
(These primitives and the patterns below may appear somwhat arkificu, but 
we have chosen a sinrple illustration due to the difficulties in following 
examples with axre than a few clauses.) 
kt the pattern set consist of the following four pattms: 
E (BENIGN ?x) , (BMNG ?ob ?y) , (GIVE ?y ?ob ?XI, 
(INTEND ?x (REIURN ?x ?ob ?yl)l 
PAT-2 : 
?x takes-loan-% ?y : 
--- 
I (BENIGN ?x) , (EELOM; ?ob iy) , (GIVE ?y ?ob ?x) , 
(INEND ?x (REIURI'V ?x ?ob ?y)), (PAYS-INEREST ?x ?y)) 
pm-3: ?x roba - ?y: 
{(NOT (KEXLGN ?XI), (BElXlNG ?ob ?y), (- ?x ?y), 
(GIVE ?y ?ob ?XI, (NOT (INEND ?x (RFNRN Tx ?ob ?y))) 3 
Pm-4: ?x plays-practid joke on ?y : 
{ (BENIGN ?XI, (BWla ?ub ?y) , (mm ?x ?y) , 
(GIVE ?y ?ob ?y), (DVIEND ?x (RFNRN ?x ?ob ?y))) 
A rough graphic 'pepsentation of the set of patterns is sham in 
Figure 1. 
PAT-2 (takes-loan-from) 
PAT-4 (plays-pctical-j ake ) 
PAYS- BENIGN BELONG GIVE THREATEN INTEND NOT- NOT-INqEND 
INTEREST BENIGN 
Figure 1 
Ncrw consider the folluwing situations: 
Situation 1. Input beliefs 
Be1 {(BELXING HARRY), (GIVE HARRY bKLEC HIE)) 
smTHESIZE(kl) = 
{{(BFLXING WUlfS;T HARRY), (BENIGN MOE), (GNE HARKY W MOE), 
a' (INTEND MOE (REMW MOE WUET HlRRY))3 
{(NOT BENIGN MOE)), (BELXING WLET HARRY), (THREATEN MOE HARRY), 
(GIVE HARRY MALLET MOE), 
(NOT  INTEND MOE ~FEruRN m WLIm IwRRY)13 
The mmmd 
. . 
matched patterns are - rob and born, adding the 
(conjectural) inf-tim that either Hary was threatened, 0x1 Moe intends to 
return the wallet. 
Situation 2. Input beliefs : 
Be1 = {(GIEUE WVK 1000-&XLARS JOHNDCE) ,(PAYS-INTEEST JOHNWE BANK)} 
sYNTHESIzE(~1) = 
{ { (mGN JOHNWE), (BELONG 1000-DOWARS BANK) , 
(GIW BANK 1000-DOLLARS JOHNWE), 
(INTEND JO-E (RETURN JOHMXlE 1000-DOLIARS BANK) 1, 
(PAYS-INTEREST JOHNDOE BANK) I I 
As a msult of matching the - loan pattern, we have added three clauses. 
Situation 3. Input beliefs 
Be1 ={(INTEND JOHNDOE (REXKN JOHNDOE 1000-DOLLARS BANK)), 
(PAYS-INTEREST JOHNDOE BANK) 1 
Here SYNTHESIZE(I3el) fiturns exactly the same set as was returned 
in Situation 2. Note, however, that the roles of 
(1) (GIVE RAM( 1000-DOLLARS JOHNDOE) 
and (2) (INTEND JOHMXlE (RFIURN JOHNDOE 1000-DOLLARS BANK)) 
have been reversed. In Situation 2, (1) was an input and ( 2 ) was infemed , 
whereas in Situation 3, (2) was input wd (1) inferred. 
The curresponding 
clauses of the loan pattke2*n were serving as antecedents on one occasion and 
consequents on the other. This follows naturKLly fran the way SYNIPIESIZE was 
defined. 
In this regard the reader rnay notice that sane input belief sets might 
yield 'warrantedt or 'spu~?ious' inferences--jumping to too many cmclusicms. 
Hawever, the incmmntal addition of new patterns corrects this anom19 in a 
natural way: 
Patterns which formerly were 'least covers' may cease to be so in 
the extended pattern set. 
V. Using Definitions to Set -Up the Pgttem Space 
We have been particularly interested in using definitions of words 
to set up pattern spaces in which SYNTHESIZE could wark as an inferencer 
and a lexical insertion technique. Special attention was payed to the 
'speech 
actv verbs, and a bief sample list is presented below. 
(The symbol '?Prt 
denotes a predicate variable. Also, primitive predicates are capitalized, 
while defined predicates are underlined. ) Again, the definitions are greatly 
oversimplified for illustrative purposes. 
(define - tell (?x ?y ?p ?t) 
(and (-RE ?tO ?t) 
(NOT (KNOW ?y ?p ?to)) 
(SAY ?x ?y ?p ?t) 
(KNOW ?y ?p ?t) 
(CAUSE (SAY ?x ?y ?p ?t)(KNW ?y ?p ?t)))) 
(define request (?x ?y ?p ?t) 
(tels ?x ?y (W ?x ?p ?t) ?t)) 
(define mse (?x ?y ?Pr ?t) 
(and (EELS-OBLIGA'F€D ?x (?Pr ?XI ?t) 
(tell - ?x ?y UNTENP ?x (?Pr ?x) ?t) ?t))) 
(define camand (?x ?y ?Pr ?t) 
(request ?x ?y (?Fb ?y) ?t 1) 
(define implare (?x ?y ?Pr ?t ) 
(and WWl?S-FAVOR-FROM ?x ?y) 
38 
The expansion of these items to patterns over the primitives yields 
a set in which, far example, KNOW < - tell 5 request iccnmand. The input 
set Be1 = {CBEFQRE tl t2), (SAY JAMES MASTER (INTEND JAMES (OPEN JAMES DOOR) t2) t2), 
(FEELS-OBLIGATED JAMES (OPE3 JAMES DOOR) t2 1) 
tmuld be synthesized to (pmmise JAMES MASER (O?EN + LX)OR) t2), with &eed 
inferences (KNOW FASTER (DElfi) JLLCES (OPEN JAI CCOi:? t2) t21, etc., as 
dictatd by the pattern instance of Wse. 
A mt?d bas been pmwsed far 'fred bfez-znciri - by attern - inatchig in 
which inference cut-f can be structurally ccmstrained: A pattsx is matched 
if it is one of the minioil =ems whose instantiati~n corn &he iqput 
in.~~mati.cm--~ven if this necessitates addkg an mbitmry anounf of additional 
infmmtion. Similarly, on the question of bw my infmces to &aw: 
'Enom -a inferemes are drawn to enable a cohmt pattm to be matched. 
The method we have proposed is general in that it nrakes no assunptions 
about the particular predicates to be used in the patterns and beliefs. (Of 
course, it does make  as^^ about ht counts as a pattern m a belief. 1 
The infmcing auld be done by a general purpose cmponent wfiich accepts a 
set of patterns as a parameta. Th*, a pgrw designing a system for 
inference by pattern mtch need not" devise external criteria, and certainly 
not miteria to be associated Wi'th 1 every pattern. Ram the criteria are 
hqlicit in the system as a Wle; any wtterns which can be described in a 
vw general pattern description language will genemte its awn set of 
internal miteria fur inference cut-off. 
We are continuing to investigate fdsms for smcturing pattern 
sets in the hope of gaining further insights into this class ~f inferences. 
D. BECKLES, L. CARRINGTON, AND G. WARNER IN COLLABORATION WITH 
C. BORELY, H. KNIGHT, P. AQUING, AND J.  MARQUE^ 
Department of Mathematics and School of Education 
Universi t y of the Wes t Indi es 
St. Augustihe, Tr2nidad 
ABSTRACT 
Linguistic communication in Trinidad and Tobago i~ characterised by 
intra- and inter-ideolectal variation in a spectrum ranging frbm Creole- 
English to Internationally Acceptab le English. The tape-recorded speech of a 
sample of children is being analysed to determine the structure of their 
language, its correlation with socio-linguistic facters and their progress in 
the use of English. X.2 coqvter system is designed to deal with manually 
codified data in the form of parse trees with associated grammatical and 
semantic information. The communication complex does not have readily 
identifiab le norms. The analytical method and compwr sys tern effect 
recognition of stable sub-systems (regardless of the external criteria which 
determine these sub-system), comparison of these sub-systems with English as 
well as state the evolution of the children's language. 
Acknowledgement 
The research of which tkis paper is a working document is partially 
funded by Ford Foundation Grant 690-06641). The authors acknowledge the kind 
assistance of the IBM worid Trade Corporation, Port of Spain, Trinidad. 
The design and some results ~f the research to which the computer 
system relates are described by Carrington, Borely and Knight (1969, 1972, 
1974 a + b) . Part of the intention of the project is to describe in terms 
applicable to curriculum development and teacher education, the structure of 
the speech of school-children aged 5-11+ in Trinidad and Tobago and to compare 
this speech with English. 
The official language and medium of instruction is Englfsh. However, 
the medium of daily communication ranges from a type of Creole-English to a 
modifed variety of Internationally Acceptable English (IAE). The term 'post- 
creole dialect continuum" has been used by several researchers, notably Le 
Page (1957), De Camp (1971) and Bickerton (1973) to refer to apparently 
analagous situations in Jamaica and Guyana. In addition to Creole, English 
and variants of both, a large part of the population is exposed to a local 
variety of Hindi (Bho jpuri) . Smaller numbers are exposed to Lesser 
Antillean French Creole and fewer still to Spanish. 
Communication within the society is characterised by inter-ideolectal 
variation related to several socio-linguistic factors - ethno-linguistic 
background, social class, educational level, occupation, sex and age. 
Code- 
switching and intra-ideolectal variation related to the context, content and 
purpose of communication complicate the examinat ion of the communication 
s)btem. Since the variant levels of the complex appear to overlap they are 
difficult f o separate into distinct sub-systems . 
ne kinpistic Data 
- 
The available corpus comprises 100 hours of the recorded conversation 
of almost 1,000 children between 5 and 11+ selected randomly from 30 schools. 
me data fall into two pre-determined categories: (a) free (with pees group); 
controlled (with investigator) . Given the nature of the communication 
compaex stated above, variation and contrast are central to the data. In 
addition to the usual socio-linguistic correlates of variation, these data 
have the possibility of containing linguistic elements which are not 
paralleled anywhere else in the community. These elements may occur as a 
result of the instability intrinsic to the performance of a vulnerable age 
cohort. We are not dealing with fully learned discrete languages or dialects 
but with partially learned systems of speech communication being used by 
children who, by virtue of being in school, are under pressure to abandon 
part of their communication repertoire in favour of another variety of speech. 
hplic~ltions of the Data Type for the 
Analytical Procedure 
hglish is the only code of the communication complex for which 
adequate grammatical descriptions are available . It is demonstrably 
untenable to assume that the informants are attempting to speak English at 
all times. They are communicating in a set of language varieties which are 
assumed to be rule-governed. A statement of frequency and type of deviation 
from Bnglish cannot therefore be an adequate analysis. The first task of the 
analysis must be to determine the structures, both major and minor, used by 
informants of' various socio-linguistic descriptions. 
A preliminary examination of the data shows that at the level of 
phrase-structure of utterances; the structures will appear to be pre- 
dominantly identical with English. It is the components of the elements, 
their meanings and functions that will show the differences 
from English. 
Consequently, the analysis mst note the levels at which derivational trees 
cease to be compatible with English. 
In view of the variability inherent in the data, the analysis must 
discover the socio-linguistic correlates of the occurence of elements, as 
hue11 as state co-occurence restrictions of a given element. 
Since it is 
possible that some elements may be distributed ih a way that does not perinit 
correlation with the stated socio-linguistic factors, the analysis must 
permit grouping of informants based on shared linguistic features for 
sasequent re-examination. This provision admits the possibility that sets 
of features may be typical of a language acquistion stage of the informants 
mgardless of their socio-linguistic descriptions. 
me Analytical Procedure 
1, Each utterance is phonetically transcribed and ascribed to an 
informant by an identification procedure. Doubtful identf ty is 
specially coded. 
2 Each utterance is rewritten in English orthography. 
3. For each utterance a parse tree is constacted using the following 
protocol where each category described below forms the content of a 
node of the parse tree. The numbers are for reference and indicate 
the hierarchical relationship of the nodes. 
.a Utterance type S sentence 
SEL elliptical S 
FRAG fragment 
FREL el lip tical FKAG 
8.1 Utterance 
comp 1 exi ty 
0.2 Structural type 
SIMP simple 
CP compound 
CX complex 
CPCX compound- complex 
DEC declarative 
INT interrogative 
IMP imperative 
0.3 Semantic type STMT statement 
Qu question 
COMM command 
RHET rhetorical intent 
(1.4 Linear order and type of cl*auses occurring 
e.g. MCl + ADVC TEMP 2 
$9.5 Linear order and type of phrases occurring 
(where not part of a clause) 
e.g. PREP P 1 + VBL P2 
9.6 Dependency of clauses - dependent 
embedded 
co-ordinate 
included 
e.g. 2/1 = clause 2 is embedded in clase 1 
8-7 ACI'V active 
AFM affirmative PAS passive 
NEG negative EQ equational 
STAT stative 
LOC Iocbnive 
1-9 surface structure of the clause/phrase occurring first. 
e.g. MC~ --7 SUM + PRED* + IOBJ + WBJ + PREP P 
*PRED = predicator - not predicate 
1-1 detailed analysis of first occurring element of 
1.9. e.g. SUBJ 3 PRMD + HDW 
1.1.1 first element of subject. e.g. PRMD~[HE] PADJ, 
RD, MASC, SG, NOK; IAE: [HfS] etc, 
2. fU surface structure of the clause/phrase occurring 
second,, , etc to 7,9, 
As exemplified at 1.1.1, the last node of each sub-part states the actual 
literal being described. 
The acceptability of the item as IAE is noted,OK 
or NOK,together with a reasonable IAE alternative. 
Apart from the obligatory 
information requifed by the procedure, the analyst may make additional 
comments which may be either in keywords or English. 
e.g. CMNT: probably 
idiosyncratic or CMNT: double NEG. 
8.6 is reserved for special idioms. 
e.g. 8.9 [SCRUNT]----) serounge for a living 
9.g is reserved for tags. 
e.g. 9.0 TAG-[YOU HEAR] 
Fig. 1 shows a sample analysis. 
Figure 1 
4652g72 
[bH SISTER AND THEM DOES BREAK A SET OF PLATE, YES] 
9.flS; 8.1 SIMP; 9.2 DEC; 8.3 SW; 9.4 MC + TAG; 9.5 NA; 8.6 NA; 9.7 AFM ACTV 
1.9 MC + SUM + PRED + DOBJ 
1.1 SUBJ-PRMD + HDW 
1.1.1 PRMD--+ [MY] PADJ, ST, SG, OK 
1.1.2 HDW-N. ASOC, ANIM, NOK: IAE: NEQV 
1.1.2.1 N ASOC -NCO + ASOC 
1.1.2.1.1 NCO + [SISTER] N SG, ANIM, OK 
1.1.2.1.2 ASOC+ [AND THEM] NOK; IAE: NEQV; VIDE 8.9 
1.2 PRED--3AUX + VT; GR @ CTN, @ PROG, PATT, NEUTTM 
1.2.1 AUX+[DOES] NOK; IAE: ZERO 
1.2.2 VT +[BREAK] OK TRAN 
1.3 DOBJ *PRMD + HDW 
1.3.1 PRMD- IND DET + N + PREP 
1.3.1.1 IND DET-f[A] 
1.3.1.2 N---)[SET] NCO, SG; LEX: NOK; IAE:[LOT] 
1.3.1.3 PREP-t[OF] OK 
1.3.2 HDW [PLATE] N PL, INAN, NOK; IAE : [PLATES] 
1.3.2.1 NPL* NCQ - PLZR; NOK; IAE : NCO + PLZR 
1.3.2.1.1-NCO +[PLATE] @ BCL, OK 
P.3.2.1.2 PLZRrZERO, NOK; IAE: PLZR = +S, CLF 
8.8 [MY SISTER AND THEM] 3 [MY SISTERS]* [MY SISTER AND HER FRIENDS] 
9.q TAG [YES] 
Glossary of keywords 
ADV - adverb [ial) , ANIM - animate, ASOC - associative 
AUX - auxiliary, BCL - base form final cluster, C- clause 
CLF - final cluster results from suffixation, CMEJT - comment 
CTN - completion, DET - determiner, DOBJ - direct object, GR - grammar 
HDW - headword, INAN - inanimate, IND - indefinite, IOBJ - indirect object 
LEX - lexical, MASC - masculine, MC - main clause, N - noun, 
NCO - countable noun, NEQV - no equivalent, NEUT - netral, P - phrase 
PADJ - possessive adjective, PATT - pattern, PL - plural, PLZR - pluralizer 
PRED - predicator, PREP - preposition, PRMD - pre-head modifier, 
PROG - progressive, RD - third person, SG - singular, SUBJ - subject, 
TEMP - temporal, TM - time, TRAN - transitive, VBL - verbal, 
VT - verb used transitively 
* - alternative parse or meaning, @ - absence of. . . , [ ] enclose literals, 
- end of inSonnation set, , - minor separator. 
Developing the Computer System 
The strucfure of the parse tree is, in general, quite complex and a 
simple ad hoc approach to validity checking was quickly seen to be inadequate. 
7 
As a result a formal description of the tree was developed and used to 
construct a (partially) syntax-driven validity checking rgutine. The output 
of this routine consists of a listing of the input, with error comments where 
necessary, together with the internal representation of the valid trees which 
is written onto a file - the parse-tree file - for the subsequent analyses. 
several other files are used in addition to the parse tree file. 
There is the informant file which contains profiles of the informants, 
(e.g. age, sex, linguistic background, etc), a set of form class files and a 
set of classification files. 
The form class files are groupings of the 
various keywords which may occur in the data. 
Thus, for example, one form 
class file contains all keywords which may occur on the left-hand side of a 
rewrite. A classification file contains a group number for each informant; 
for example, one classification file contains 0 for each informant not aged 5 
1 if the informant is aged 5 with a Hindi linguistic background and 2 
otherwise: 
In any operation on the data the utterances of informants in 
group 0 of the relevant classification will be ignored. 
Each node of a tree in the parse tree file consists of a name - in the 
case of a rewrite this is the left-hand side of the rewrite, otherwise it is 
the level number - and a set of descriptors, e .g. the grammar associated with 
the name. Thus, in the example of Figure 1, the lines 1.1, 1.1.1, 1.1.2, 
1.1.2.1 become the sub-tree of Figure 2 where the descriptors are put in 
parentheses. 
Figure 2 
SUBJ 
PRMD (PADJ, ST, SG, OK) HDW (ANIM, NOK; IAE: NEQV) 
I I 
NCO ASOC 
For any tree, each analysis starts at the root and many of the tasks 
to be described below may be regarded, in part, as a pattern matching 
exercise. The difficulties, 
and interest, arise because each node of the 
parse tree carries a substantial amount of information, and except for 
literals, only a partial matching of the nodes is usually required. In 
addition, some tasks requira the matching of disjoint sub-trees within a given 
parse tree, occasionally subject to side conditions which may involve nodes 
not lying on the paths between the root and any of the sub-trees of interest. 
Apart from the pattern matching,there is the problem of classification of the 
occurrences of the various patterns. This is a simple tabulation complicated, 
in some cases, by the fact that the total number of categories is unknown. 
The basic task of the system may be cast in the form: count with 
respect to a given classification file, and subject to stated side conditions, 
the occurrences of a given pattern. 
Since there are only 1,000 informants and they fall into a reasonably 
small nunher of classes it is economical to pre-classify on the basis of the 
informant profiles rather than build the classification process into the rest 
of the analysis. The system is instructed to produce a classification file 
by a statement of the form: 
CLASS = ( classification file name ) , (4 expression list >) where 
(classification file name) is the name by which the file will be known, 
and each expression in<eqression list 7 is a Boolean expression. For 
example : 
CLASS = HINDI, (AGE = 5 e LANG = HINDi, AGE = 5 4 LmG + HIND11 
will produce the classification file given earlier as an example. 
The side conditions refer to items ih the parse trees which must occur 
if the tree is to be i~~cludd in a given analysis. For example, if only 
affirmative active uHerances are to be analysed the side condition Q. 7 
AFM AC171 is used. me patrern to be used is stated in a manner similar to 
that used in specifying the input data. Thus, the pattern description 
PRED r .. . + AUX.. .; GR: @ CTN, NEUT TM, @ PROG, PATT 
Pdicates that the sub-tree 
PRED (GR: @ CTN, NEUT TM, @ FROG, PATT) 
is of interest, subject to the convention that both the order of node 
descriptors (where given) and node descriptors not mentioned in the pattern 
are to be Lgnored. The occurrence of keyword FORM = < form class file name) 
indicates that the contents of the stated form class file are to form an 
additional dimension to the final tabulations. Thus the pattern 
AUX --+ [?I FORM = OKFILE 
where OKFILE contains the keywords OK and NOK and is an abbreviation Sor the 
pair of patterns. 
AUX -P [?I OK 
AUX+ [?I NOK 
The symbol ? indicates that the items found there are also to add an 
additional dimension to the tabulations. The output of each tabulation may 
also be used to construct a classification file of the informants, to be 
used in further analyses. 
CONCLUSION 
In respect of performance of groups with different socio-linguistic 
descriptions, for purposes of this study, it is assumed that the frequency of 
occurrence of particular basic parse trees is a meaningful indicator of 
differences in speech patterns. 
A major difficulty is that no two trees in 
the study are identical but at the same time if we strip too much information 
from each node there are too few trees to make an analysis worthwhile, and 
in part, the study aims at determining the degree to which strippilrp of 
information at interior nodes is necessary if the Gomputer is to be a qseful 
aid. 
American Journal of Computational Linguistics 
Microfiche 36 : 52 
DAVID BRILL AND BEATRICE T. OSHIKA 
Speech Communications Research Laboratory, Inc. 
800A Mirarnonte Drive 
Santa Barbara, California 93109 
ABSTRACT 
A set of SAIL programs has been implemented for analyzing 
large bodies of natural language data in which associations 
exist between strings and sets of strings. These programs 
include facilities for compiling information such as 
frequency of occurrence of strings (e.g. word frequencies) 
or substrings (e.g. consonant cluster frequencies), and 
describing relationships among strings (e.g. various phono- 
logical realizations af a word). Also, an associative data 
base may be interactively accessed on the basis of keys 
corresponding to different types of data elements, and a 
pattern matcher allows retrieval of incompletely specified 
elements. Applications Of this natural language processing 
package include analysis of phonological variation for speci- 
fying and testing phonological rules, and comparison across 
languages for historical reconstruction. 
f, NATURAL LANGUAGE PROCESSING PACKAGE 
A. General characteristics 
The natural language processing package implemented at 
the Speech Communications Research Laboratoqy (SCIU;) is 
currently wed in the analysis of associated lists @f string 
data such as discourse transcriptions or pronouncing diction- 
aries. The package consists of 
a) a set of "batchw programs which provide frequency 
and context information on the lexical and phonological forms 
appearing in the input; and 
b) 
a system for interactively accessing the data dn the 
basis of orthographic and phonological patterns. 
All of the programs in this package are written in SAIL, 
an ALGOL-based language offering extended string and set 
manipulation operations and an associative data base. The 
programs run on a DEC PDP-10 at Carnegie-Mellon University 
via the Advanced Research Projects Agency (ARPA) computer 
network (ARPANET). The ARPANET is accessed by the ELI? oper- 
ating system developed by SCRL, which runs on a local PDP-11 [I]. 
While the processing package is applicable to various 
types of natural language data, it has been used most exten- 
sively at SCRL in the analysis of discourse transcriptions. 
The discourses consist of conversational speech gathered in 
interviews with adult speakers of various dialects of American 
English. More than twenty-five discourses, transcribed ortho- 
graphically and phonologically, have been prmssed, yielding 
detailed information on over 28,000 utterances representing 
about 3,500 distinct lexical- items. All examples in this 
section are taken from a typical discburse. 
B. "Batchw Facility 
Discourse processing usually begins with the generation 
of a transcription reference file in which orthographic and 
phonological representations are listed in discourse order, 
as illustrated in Figure 1. 
WELL 
LET ' S 
TRY 
CLASSIFYING 
THEM 
ACCORDING= 
TO 
THE 
EXCUSES 
086 TRAY 
887 KLAES$CFAYIHNS 
888 DHAXM 
88s //AXK$ORDIHN/TUW// 
Figure 1 
In this example, the phonological realization of TRY 
is /tray/ (coded TRAY). The phonological code shown is a 
basic ARPA phonemic alphabet augmented by special symbols 
indicatim some phonetic detail, such as vowel height. The 
realization of THE, for example, is coded DH$I, indicating 
that the vowel fell between /i/ and /I/. 
Reference number's assigned to each utterance serve as an 
index to the discourse context in which utterances occur, and 
are used to interpret the output of other programs in the 
package. Separate reference number sequences are provided for 
the orthographic and phonological forms in the reference files, 
since there may not be a one-to-one correspondence between 
these forms, as in the case of phonological merging whl-eh 
obscures word boundaries, In Figure 1, for example, the two 
orthographic items WELL and LET'S are realized as a single 
phonological item /wl E ts/ (coded WELEHTS) 
The core of the "batch! processing facility is a set of 
three programs: PROCON, ENVIRN and CLUSTR. PROCON provides 
frequency and context information on the lexical level, while 
the other two provide similaf information on the phonological 
level, 
PROCON output contains an alphabetically sorted list of 
the utterance types occurring in the input discourse trans- 
cription file as illustrated in Figure 2. Frequency of oc- 
currence of each type is given, along with the various phono- 
logical realizations. For each phonological realization, 
frequency count and reference numbers are provided. 
8 HAVE 3 AXV 11,337,703 
3 HHAEI 354,828,1397 
1 HHAXh 710 
1 HH$GV 1067 
Figure 2 
In Figure 2, for example, HAVE occurred eight times, and 
was pronounced (/av/) three times and HHAEV v three 
times. Using the reference numbers associated with these pro- 
nunciations, it is possible to establish the discourse context. 
One would find that the tbree AXV pronunciations (i.e. 
utterances 11, 337 and 703) all involved the auxiliary 
construction in ",.,may have felt,,,seemed to have been 
which have since been. .. II 
ENVIRN tallies occurrences of phonological segments and 
environments in the discourse transcriptions. The output of 
this program lists frequencies of all phonemes appearing in 
&he input file, as illustrated in Figure 3. 
Figure 3 
Glottal stop, coded Q, occurred a total of thirty times 
in the discourse, The immediate environments of Q are listed 
alphabetically by left context, with word boundaries indicated 
by slash /, and a frequency count and reference numbers are 
given for each environment. For example, Q appeared eight 
times in the context EH--EN (E-n, and a check of the 
reference list shows that all these occurrences were in the 
word sentence (s) . 
ENVIRN output also provides a frequency ordered liSt of 
phonemes, with frequency totals brokerr down according to 
occurrence in word initial, medial and final position. 
CLUSTR, the third of the "batch" programs,is used in the 
analysis of phoneme cluster distribution in the discourse data. 
All clusters are indexed by each of their component phonemes, 
so that the cluster NDZ (fndz)') which is listed under D in 
Figure 4 also appears under N ad 2 in the full output. 
DENTS 1 699 
DQENTS 1 486 
D V 2 1417, 1445 
D Z 5 278, 284, 837, 1341, 1350 
NDZ 1 1429 
Figure 4 
Separate output may be generated for clusters occurring within 
woxds or across word boundaries- Currently, consonant and 
vawel clusters are tallied, but the program can be easily 
modified to handle sequences of phonemes belonging to arbi- 
trary user-defined classes (e.g. voiced sounds,, nasals, un- 
voiced stops, etc. ) . 
For each phoneme belonging to a selected class, CLUSTR 
provides a count of the number of times that the phoneme appears 
in clusters, an alphabetically sorted list of those clusters, 
and a frequency count and reference numbers for each cluster. 
Figure 4, a sample of CLUSTR output for within-word consonant 
clusters, shows that D appeared in clusters a total of 70 times, 
with 32 of these being ND clusters. Reference numbers may be 
used to establish the discourse context of any cluster. For 
example, the cluster D Q EN T S (/di?nts/) appears in utterance 
486 which is the word students. Like ENVIRN, CLUSTR provides 
a frequency ordered list of cluster types in addition ts the 
alphabetic list. 
C- Interactive Retrieval Facility 
The set of "batch" programs is complemented by a language 
data retrieval system which allows the user to interactively 
retrieve data items conforming to various orthographic, phono- 
logical and syntactic patterns. 
Linguistic data is inte~nally stored in the system as a 
network of associations between items of various types. These 
associations are implemented in SAIL as LEAP triples [2J and 
the element types entering into these associations vary accord- 
ing to the - particular application. For example, in analysis 
of the discourse data described above, triples contain ortho- 
graphic, phonological and syntactic elements. For study of 
phonetic-to-phonemic mapping, triples might be orthographic, 
phonemic and phonetic elements. In comparative linguistic 
research, triples might consist of an orthographic element 
and two phonological elements corresponding to two languages 
or dialects 
Data can be accessed on the basis of patterns directed to 
any one (or any combination) of these elements. For example, 
if the data base contains associations between orthographic, 
phonological and syntactic elements, then the query 
P/ 0: THE 
retrieves the phonological items associated with the .spelling 
THE, and might return DHAX(/Ba/) and DHIY i). The query 
O/ P: TUW 
would retprn the orthographic items pronounced Tm (/tu/), e.g 
two, too, tor 
--- 
Patterns such as THE and TUW completely specify the 
element to which they are directed, but various special forms 
allow partial. specifications to be expressed also. The symbol 
$ matches any single segment (in a phonological pattern) or 
character (in an orthographic pattern), and the symbol = 
matches any number, llncluding zero, of contiguous segments 
(or characters). Thus, if N is the syntactic code for Nounr 
the query 
O/ P: $$, S: N, 0: D= 
searches for all two-phoneme nouns which begin with the letter 
D, and might return dye, day, - doe, dough. 
Each phonological element is defined in terms of a set of 
features such as UV (unvoiced) and ST (stop), and these 
features may be used to specify segments in phonological 
pawerns. To search for phonological realizations containing 
/i/ between unvoiced stops, one could use the query 
P/ P: =(UV + ST)IY(UV + ST)= 
to find /kip/ (keep) , /pik~ ?/ (peeking) , and /r  pit d/ (repeated) 
Boolean operators are also available for specifying 
pattern segments. For example, the query 
0 6: (C OR K)=, P: (NOT K)= 
returns arthographic ikems which begin with C or K and are not 
pronounced with initial k, e.g. cite, change, know. 
Several capabilities lacking in the current interactive 
system will be available in the near future. The user will be 
able to (1) specify optional segments and sequences of segments 
in phonological patterns; (2) create and name sets containing 
items of interest, e.g. monosyllabic function words, and use 
set operations such as union and intersection; (3) interactive- 
ly modify feature definitions of phonological symbols: (4) re- 
trieve several elements, e.g. orthographic and phonological 
forms, simultaneously; (5) display the discourse context of any 
given item, and (6) write retrieval queries and responses to a 
file for subsequent analysis. 
11. APPLICATIONS 
The processing package can be used in the analysis of 
various kihds of natural language data, as illustrated in the 
following examples. 
A. Phonological variation 
The programs can be used to efficiently index and sort 
natural language data so that systematic phonological varia- 
tion can be easily examined. For example, inspection of a 
PROCON output for a ten minute interview consisting of over 
2,000 utterance tokeno yields general observations such as 
-- final /t/ alternates with final glottal stop /?/ under 
certain conditions; 
-- alveolar flapping occurs under several stress 
conditions whidh appear to be related to noun affixes. 
These preliminary observations can be systematically 
investigated using the interactive query system. 
The data base can be queried for all phonological realiza- 
tions ending in T (/t/) or Q (/?/), and the corresponding ortho- 
graphic entries, using the queries 
P/ P: =(T OR Q) and O/ P: =(T OR 9) 
The resulting list might include 
art /a*/ 
but /bat/ 
can' t /k=nt/ 
/k en?/ 
fished /f 1Jt/ 
limit 
raft 
that 
want 
That is, final /t/ appears to vary with final /?/ following 
vowels arid following nasals, but not elgewhere. This hy- 
pothesis, represented as a context-sensitive phonological 
rule, could then be tested against additional data using any 
of several computer rule testers [3-51. 
Forthcoming modifications will allow queries with set 
operations, such that the intersection of orthographic entries- 
having final /t/ alternating with /?/ can be requested directly 
by the query 
01 P: =T n P: =Q . 
That is, only entrieq with /t/ and /?/ alternation would be 
retrieved, and the entties art, fished and raft would not be 
returned, 
In order to determine the conditions under which alveolar 
flapping occurs, the queries 
O/ P: =DX= and P/ P: =DX= 
can be used to retrieve phonological items which contain DX 
() and correspondihg orthographic items. Such a list might 
include 
ability lab il xfi/ 
city /s ifid/ 
facility /f as il  IF^/ 
letter /l ;fa/ 
petty 
/p ;pi/ 
responsibility /r~spans~bil~~i/ 
writing 
/rd~f I g/ 
Flapping occprs in a descending stress pattern, e.g. city 
letter, petty, wrdting in which a stressed vowel precedes 
the flap and an unstressed vowel follows. In addition, trhe 
flap appears to occur between unstressed vowels when the 
sequence rppresents the noun asfix -ity, as in ability. To 
check this, the query 
P/ 0: =ITY, S: N 
could be used to retrieve a81 nouns ending in -ity, and the 
subset involving affixed forms (i.e. excluding city, pity) 
could be examined for occurrences of flapping. 
B. Word Error Recognition testing 
The interactive facility can be used to examine the kinds 
of word recognition errors which might occur in a speech under- 
standing system due to indeterminacies in segment labelling. 
If a string is completely specified as /likrg/(coded LIYKIHNX), 
then it matches a single word, leaking. However, if labelling 
is less precise, then alternative (and incorrect)word matches 
might occur. Using the inte~ctive retrieval system, alter- 
native labels and resulting word matches can be examined for 
any given lexicon. 
In the example above, the labelled string might be 
L (VOC HIGH ANT) K IH NX 
with the stressed vowel represented as a set of features: 
vocalic, high, anterior. Resulting word matches might include 
leaking and licking. 
If the initial consonant is also specified as a set of 
features (consonant, sonorant, continuant), as in the string* 
(CON SON CONT) {VOC HIGH ANT) K IH NX 
then the resulting word matches might be leaking, lickinq, 
reeking. If the K is specified less precisely as $ voice- 
less stop, word matches might include leakinq, licking, 
reeking, leaping, rippinq. 
The interactive facility allows the system designer to 
easily determine the nature of possible incorrect matches 
due to phonological indeterminacy, especially as the size of 
the lexison increases. 
C. Comparative Linguistic Relationships 
If the data base is represented as an orthographic list 
with two associated phonological lists representing two 
languages or dialects, the interactive system can be used to 
discover systematic sound correspondences, and to aid in the 
study of dialect relationships and historical reconstruction. 
A sample'data base might be: 
Gloss 
a fish 
to have 
no, not 
brother 
bamboo 
Lanquaqe A Lanquage B 
plaa 
Pa 
mi i mia 
plaaw 
Paw 
phii fia 
phaay fay 
The query 
would retrieve those items in language B which correspond to 
items in language A with initial /pl-/ clusters, e.g. and 
paw, indicating that consonant cluster simplification may have 
occurred in language B. The query 
B/ A: =IYIY 
would retrieve those items in language B which correspond to 
items in language A with final /-ii/, e.g. the drphthongized 
mia and f ia. 
- - 
A large data base could be accessed in this way to dis- 
cover systematic correspondences between languages A and B, 
such as the correspondences /pl-/:/p-/, m:m, /ph-/:/f/, 
ii:ia aa:a, etc. 
The flexibility of the interactive system, combined with 
the linguistic intuition of the user, can be used.to specify 
and retrieve any set of correspondences, without the need to 
format the data according to initial consonants or clusters, 
vowel nuclei, finals, etc. Information such as tonal cnntours 
and stress can also be represented and accessed. 
ACKNOWLEDGEMENT 
This research was supported in part by the Advanced 
Research Projects Agency of the Department df Defense through 
Contract N00014-73-C-0221 administered by the Office of Naval 
Research Information Systems Proqram. 
1975 ACL Meeting 
ON THE ROLE OF WORDS AND PHRASES IN AUTOMATIC TEXT ANALYSIS 
Automatic indexing nom~ally consists in assigning to documents 
either single terms, or more specific entities such as phrases, or 
more general entities such as term classes. Discrimination value 
analysis assigns an appropriate role in the indexing operation to 
the single terms, term phrases, and thesaurus categories. To 
enhance precision it is useful to form phrases from high-frequency 
single term components. To improve recall, low-frequency terms 
should be grouped into affinity classes, assigned as content 
identifiers instead of the single terms. 
Collections in different subj ect areas are used in experiments 
to characterize the type of phrase an8 word class most effective 
for content representation. 
The following typical conclusions can be reached: 
a) the addition of phrases improves performance considerably; 
b) use of phrases is better with corresponding deletion of 
single terms in practically all cases; 
c) the use of both high-frequency and medium-frequency 
phrases is generally more effective than the use of either phrase- 
type alone; 
d) the most effective thesaurus categories are those which 
include a large number of low-frequency terms; 
e) the least effective classes either consist of only one or 
two terms, or else they include terms wi~h unequal frequency cha- 
racteristics permitting the high-frequency terms to overcome the 
others. 
The discrimination value theagr is developed and appropriate 
experimental output is supplied. 
American Journal of Computational Linguistics 
Microfiche 36 : 68 
Departqent of ~nthropology 
University of New Brunswick 
Roswell Park Memorial Xnsti tutg 
Buffalo, New York 
Linguistic String Project 
New York University 
2 Washington Sqyare Villager 28 
New York, New Yofk 10012 
ABSTRACT 
~inguistic mechanisms of compression are used when making notes within a 
context where the objects and meanings are known. Mechanisms of compressidn in 
medical records for a collaborative study of breast cancer are described. The 
syntactic devices were mainly deletion of words having a special status in the 
grammar of the whole language and deletion in particular positions of word+ 
having a special sta&us in the sublanguage. The deIeted forms are described 
and sublanguage Qord classes defined. A subcorpus of the medical records was 
parsed by an existing computer parsing system; a component covering the dele- 
tion-forms was added to the granunar. Modifications to t,he computer grammar 
are discussed and the parsing results are summarized. 
Introduction 
All 1anguages"have mechanisms of compression. 
Sentences may be embedded 
within other sentenaes by means of nominalization and complementation. 
Various 
grammatical transformations involve deletion of certain parts of the sentence. 
In medical records, we find entries such as no evidence of metastases, which 
may be said to be derived Trea something like There is no evidence of metastases. 
Such incomplete sentences are not common in the spoken language of the medical 
records (i.e. dictated reports). However when physiciakrs themselves are requirbd 
to write- material for records, compression mechanisms are qmmonly use&. 
Although this paper will deal with a mific corpus, 
similar devices would 
I 
often be used for compression in other s-ations where there is pressure to 
write as little as possible, Legal, educational, and scientific recordg where 
informal notes are kept woum be other examples of this class of sitqations. 
The original motivation for this study was to develop effective methods for 
storing &e information in a medical record and to be able to retrieve this in- 
formation for purposes of research, medical care, or administration. Fsoan pre- 
vious research, the feasibility of verbatim input of dictated narrative has been 
established, Computerized extraction of the information has been shown to be 
feaeible i~ a test system ACORN (Automated Coding of Report ~arrative).  his 
system has been described in detail in a series of previous papers. 1dt3 
1 
For a highly structured medical record where the entries are single words 
or very restricted sentences, the feasibility'of computer-assisted editing and 
coding has also been established. A procedure for typing in the entdes ver- 
batim in a medical record,called 'TICPIS' (Type-In Coding and Editing System) 
4 
ha8 been reported e1sew)rere. However, the thitd, intermediate class of material 
cannot be handled by ACORN or by TICES. Therefore, a linguistic analysis of 
this type of material has been undertaken with the ultimate objective of setting 
up a comprehensive eomputer system that can handle almost everything in the 
medical records. 
In the earlier effoxts to develop natural language technology, the work was 
facilitated by the fact that the documents involved were strictly for the trans- 
5 
mission of factual information. Such documents are regarded as important both 
by the persons who are filling them out and by the persons who read them. In 
this no-nonsense situation where the record may be critically reviewed by the 
peers of the person who is reporting the information, unambiguous and informa- 
tive transmission of information is a critical need. Some of the simplicities 
in the present analysis may be~eculiar to ws type of situatfon. 
The existence of a subculture with shared training, objectives, and exper- 
ience may facilitate the note-taking process in somewhat the same way that a 
person taking notes for himself can somehow be more concise without ambiguity. 
.. 
Howeveb, r many other note-taking situations would 
involve subculture, though 
not necessarily a medical one, and the findings here might be expected to have 
sdne general applicability. 
Source of Material 
The medical&es discussed here are ffom tjhe records of the Surgical 
Adjutrant Breast Project, a nationwide collaborative study involving 36 medical 
institutions. The records were filled out by medical and paramedical personnel 
at the participating institutions and cehtralized at Ro$glell Park Nemri&l 
Institute in a Statistical unit under thq direction\of-Dr. Nelson S-lack. A 
sample of approximately 50 was taken from the 2734 cage histories of patients 
in the program and is being used in the lbguistic analysis. 
Each case history 
ordinarily consiats df 3-6 pages of detailed information on the patient's ini- 
tial status, treatment, pathology report, nledicai problems, and subsequent 
fate. When the structured information in the record was excluded, each case 
history had between 6 and 26 notated items, each item consisting of 1 t6 5 par- 
tial-sentences. While this material is speckalized to me purposes of the col- 
laborative study, this type of information iq fairly typical of what is found 
in the usual hospital record. 
The notes were typed vexbath using An IBM Mag Card Communicator so as to 
obtain simultaneously a typed paper document and a record in computer-usable 
form. This device is used in the data-input sgstem of T~CES; an existing system 
for handling completely structured records. It would presumably be usea in any 
extension of TICES which would handle medical nates. In eis'analysis the com- 
puter was used to reorganize the material in a fbrmmore convenient for manual 
analysis by the linguist. 
Anderson analyzed the linguistic structure of the entries in a sample of 
the medical records involving radiation findings, 
A discussion of this ana- 
lysis will take up the next part of the paper. Sager and associates used some 
of the findings from this study to develop methods for processing these same 
medical records by computef, adapting% program and grananar which had been 
developed fok parsing science articles. This project will be discussed in the 
final part of the paper. 
Linguistic Characteristics of Medical Notes 
Many of the entries on the medical records are in the form of notes which 
are neither complete sentences nor single word entries, but linguistic strings 
of an intermediate type, which we will hereafter call fragments, Fragments are 
a compressed typ of linguistic material resulting from various transformations 
which have the effect of making linguistic strings shorter by reducing or de- 
leting materihl. The writer of these stretches of material must make his en- 
tries brief, in order to save time and effort, but also make them informative 
and unambiguous. For this reason the deleted material has to be easily recover- 
able, or in other words it must not contain much information. An analysis of 
the fragments shows that deletion is maiinly of a small class of sentence parts: 
(1) tense and the verb - be (t - be); (2) subject, tense and the verb - be; (3) the 
subject; and (4) subject, tense, and verb (V) other than - be. 
A second characteristic of fragments which makes deleted material recover- 
able is that both the meted material and the remainders consist of words in 
easily defined subclasses, based on both distributional and semantic criteria. 
These subclasses are easily defined because of the nature of the sublanguage; 
in general the vocabulary is limited and each word has a limited semantic range. 
The question on a form khich is being answered can also be used as a basis for 
retoring deleted material. 
One of the most commonly deleted items in the medical records is t - be (1 
and 2). Tense is perhaps the most important information - be gives. 
The deletion 
of tense in the medical records causes no ambiguity because usually the physician 
describes the situation at the time of filling out the report, 
Otherwise he 
gives the time in a time phrase: x-rays on November 2. 
Fragment Types 
In Table 1 we list the fragment types, giving an example of each, but not 
with all occurring word subcl&!3ses. 
The types will Sirst be given according to 
what material is deleted and then will be futther subclassed according to the 
highest nodes of the tree structure of the remainder. The material in brac- 
kets is the word subclasses which are assumed .So have been deldted. 
TABU3 1. FRAGMENT TYPES 
S trudurs 
Material Deleted of Fragment Example 
1. t - be by N Ven no metastatic lesions [were] detectea [by 
N-physician physician] 
N Adj chest films [were] nonnal 
NPN patient [was] without cough 
1 
N to V this form [is] to be used . . . 
N Ving wound [is] healing well 
2. Subject t 
be 
- 
Ven [N-disease was] aspirated once 
Ads [N-Patient is] dead 
to be Ven [N-patient is] to be seen by gynecologist 
Ving [N-patient is] doing well 
3. Subject t V Object 
3a. N-physician 
Subject 
3b. N-patient 
Subject 
3c. N-disease 
Subject 
(rare 1 
4. Subject t v 
4a. N-physician t Object 
V-diecover 
4b. N-physician Object 
t V-do 
4c. N-patient Object 
t have 
[I] found osteochondritis in, 
rib (5th right) 
[N-patient] had period one week ago 
[N-disease] invades skin 
[N-disease] seems minor 
[I V-discovered] no bony metastases 
[N-ghysician did] excision of (r ) 
5th costal cartilage 
[N-patient has] no bone pain 
Word Subclasses 
The word subclassbs should have three characteristics: (1) they should 
enable deleted material to be recovered, (2) they should make it possible to 
6 
extract and store informational units such as those in ACORN and (3) they should 
be defined so that a linguistically unsophisticated person can easily put words 
into their subclasses. 
The word subclasses ate based on both semantic and distributional criteria. 
To a large extent nouns can conveniently be subclassed on a semantic basis and 
verbs can be subclassed on a distributional basis, according ta the subclasses 
of nouns which they take as subject and object. Due to the nature of the sub- 
language there is relatively little overlap (e.g., a given verb is likely to 
take only one noun subclass as shject) compared to what we would find in the 
language as a whole. 
Two impoftant subclasses of h-n nouns used in the medical records are 
N-physician and N-patieht. Each has only a few members, but is important because 
many verbs chqacteristically take it as subject or object, and also because 
both, but particularly N-physician, are usually deleted. It is on the basis of 
the verbs which characteristically take them as subject or object that they can 
usually be recovered without ambiguity. 
Other noun subclasses concern more directly the subject matter of the re- 
ports, the concrete objects with which the physician is dealing. Unlike N- 
physician and N-patient, these classes usually have many mmbers and they are 
seldom deleted. As with N-physician and N-patient, certain uerb subclasses 
char~cteristically take them as subject or object. 
Table 2 gives some of the word subclasses with examples of each. 
6~ross et al. "Information in Natural Languages: A New Approach," 1969. 
TABLE 2. SOME WORD SUBCLASSES 
N-bwa$t 
N-change I 
R-dimas ion 
N-disease 
N-exam 
N-locatibrl 
N-patient 
N- physician 
N-therapy 
N-time 
V-be-equivalent 
V-change 
V-discer 
V-patient-object 
V-patient-subject 
V-physician-subject 
V-show 
Ad j -bodypart 
Adj-changed 
Adj-degree 
Ad j -discover 
Adj-disease quality 
abdomen, axilla, bone, Br-t, cervix, pelvis 
change, elevation, enlargement, gab, increase 
pressure,' rate, rhythm, size, weight 
carcinoma, cough, disease, edema, fibxosis 
6iopsy, exam, film, qamogram; scan, x-xay 
area, field, floor, lobe, neck, part, regionr 
she, her, patient, lady, woman 
doctor, he, him, his, I,*M.D., radiologist 
drug, insulin, medication, medicine, radiation 
date, month, the, visit, winter, year 
appear, feel, indicate, remain, represent, seem 
alter, clear, change, enlarge, heal, progress 
detect, find, identify, ncyte, observe, see 
ah=, give, leave, place, readmit, see, transfer, trqat 
complain, come, moperate, enter, feel, gain, go, have, 
refuse, show, suf f ;r , take 
feel, have, place, tel.1, t'kansfer, treat, See 
show, demonstrate, indicate, reveal, suggest 
axillary, bony, clavicular, lumbar, pelvic 
elevated, enlarged, healed, stable, unchanged. 
considerable, extensive, intermittent, little 
absent, evident, Fnown, possible, present 
active, bad, benign, degenerative, firm, hard, 
malignant, metastatpc, nodular 
adjoining, distal, dorsal, frontal, left 
clear, free, healthy, negative, normal 
Computer Parsing of Medical Records ' 
To test the linguistic analysis, a subset of the manually analyzed corpus 
of medical records was parsed by computer, using the NYU Linguistic String Parser. 
8 
The LSP grqmmar of English is based on the same linguistic principles as the 
ACORN grammar. Hence it could also serve to test the feasibility of adding a 
note-handling capability to the ACORN-TICES system. The LSY sylr which was 
designed for text-processinQ, was adapted to the parsing of medical records by 
deleting portions of the grammar which are not required for this type of ma- 
terial and adding a section covering sentence fragments. 
These change$ are 
described below, followed by the parsing resultb. 
The corpus which was parsed consisted of 12 sections of 
the Radiation 
Findings extracted in their order of appearance from the medical records. 
These 
sections contained 245 sentences or sentence fragments (word sequences ending 
in a period). Of these, 37 were complete English sentences and 205 were frag- 
ments; 3 were combinations of both types. 21 entries were identical to others 
in the corpus, accounting in all for 139 of the sentences ox sentence fragments. 
Of the complete sentences, same were quite long, e.g., Reexamination shows some 
scarring and thickening over the right apex which is perhaps slightly more evi- 
dent than it was before, but nothing is seen that is typical of tumor involvement. 
Typical sxorter sentences are Chest films on 10-25-68 and 12-14-68 do not show 
any essential changes since- last reports, Liver scan 1-29-69 was normal. Frag- 
ments were, as predicted, of the types listed in Table 1, above, though not all 
tyMs were represent-ed in the parsed corpus. 
Table 3 shows the new definitions or redefinitioqd which were added to the 
LSP grammar to cover fragments. These definitions are written in ~ahs-Naur Form 
(BNF), as ilze all the ca. 180 definitions which comprise the context-free-part 
of the LSP English grammar. The BNF definitions are used by the parser to con- 
struct a tree representing the structure of the input sentence. 
In addition to BNF definitions, the grammar contains restrictions, which 
test the sentence trees for grammatical and selectional well-formedness. The 
9 
For more explanation of the LSP system and grammar, see N. Sager and 
TABLE 3. DEFINITIONS ADDED TO THE LSP GRAMMAR 
TO COVER SENTENm FRAGMENTS 
(SENTENCE) 
<TExTILET> 
COLD-SENTENCE? 
<MORESENT> 
(INTRODUCER) 
<CENTER? 
<FRAGMENT> 
::= <TEXTLET>. 
::I <OLD-SEmNCE><MORESENT>. 
::= <INTRODUCEI~<CENTER)<ENDMARK>. 
: : = NULL/<TEXTLET>. 
c : = NULL. 
::= <ASSERTION>/:FRAGMENT>/<IMPERATIVE>. 
: := <SA> C<SOBJBESHOW/<ASTG)(SA>/<NSTG><SA~/ 
<VENPASS~/<NSTG) ~~ASSERTION~/~SOBJBESHOW~ 1 . 
::= <SUBJECT><B$-OR-SHOW><OBJBE><SA>. 
: :=- +--+/NULL. 
::= +.+/+,+/+;+/+-4. 
starting, or root, definition of the gramnqr is SENTENCE, so this is tha first 
definition seen in Table 3. In the case of medical records, the unit may be. 
longer than one sentence, but we have retained the root-word SENTENCE and de- 
fined SENTENCE in this case to be a TEXTLET (definition 2), ,which.consists of a 
sentence (called OLD-SENTENCE, definition 3) optionally followed by more sdn- 
tences (MORESENT, definition 4). The definition of OLD-SENTENCE has the same 
three elements (INTRODUCER, CENTER, ENDMARK) that the definition of SENTENCE 
does in the LSP grammar; however, in this case, tho INTRODUCER (definiqion 5) is 
NULL; the CENTER (definition 6) contains an option FRAGMENT in addition to the 
options ASSERTION and IMPERATIVE defined in the English grammar; (other~options 
of CENTER, e.g. QUESTION, have been deleted); and the ENDMARK (definition 10) 
contains unconventional punctuation, such as dashes and cornma, in addition to 
the period and semicolon. Since our main interest here is in FRAGMENT (defini- 
tion 7), we will elaborate onlhis definition. 
R. Grishman, "The Restrictton Language for Computer Grammars of TFatural Language' 
Commun. of the ACM, 18, 390-400, 1975, and the references cited there. 
In defining FRAGMENT, we have used parts of the grammar which were defined 
independently of the fragment problem. That this is possible is in itself a 
partial verificatian of the conclusion from manual analysis that only limited, 
grammatically specifiable, deletion-forms occur in the fragments seen in notes 
and records. For example, the dropping of the verb (type 1 of Table 1) can 
occur in normal English when a sentence containing the verb - be occurs as the 
object of a verb like - find, e.g. We found the chest clear to pekcussion and 
auscultation. itn the UP grammar there is an object string defined for such oc- 
currences; it is calleg SOBJBE (Subject - + Object of - be) . This same string can 
then be made an option of CENTER to analyze fragments having the same £om 
e,g. 
Cheet clear to.percussion and auscultation. 
In detail, the definition of FRAGMENT begins with the element SA (Sentence - 
Adjunct). The definition' of SA (not shown here) contaihs 16 options covering all 
- 
types of sentence modifiers* In this material the most frequent SA is a the 
expzession, usually a date [called PDATE, for optional Preposition + date) or 
this examination, this visit. Following SA in the definition FRAGMENT are the 
options proper, naming definitions already in the LSP grammar. The first option 
SOBJBESHOW (Subject - + Obj ect of - be or - show), corresponds to the second and third 
structures of type 1 and also occurrences like Chest film no change, which is an 
expansion of SOBJBE, discussed above. This option covers deletions of the two 
most common verbs in this material, - be and - show. The plade of - be or - show (defi- 
nition 8) in a fragment is either empty or is filled by a dash. 
The mkond and fourth options, ASTG and VENPASS, in FRAGMENT correspond to 
structures of type 2 in Table 1 (e.g., Negative_, felt to be a benign lesion), 
where the subject, tense and verb - be have been dropped. In the LSP grammar, ASTG 
(Adjective - - strind is an option of OBJBE, and VEWASS (V-en - passive string) is 
also permitted after -C be and in other places. The thitd option, NSTG Noun - 
strind, is an object of show e.g., Mild degenerative chanqes (£$om, X-rays show 
- -' 
mild*degenerative changes). It ale0 covers occurrences of the first structure of 
type 1 (e.g. No X-rays taken) where for regularity with more complete entries the 
passive Verb (taken) is seen as a right adjunct of the noun. The last option, 
consisting of NSTG followed by either ASSERTION or SOBJBESHOW, covers such oc- 
currences as PA and lateral chest 1l,-5-71 reexamination shows some scarring and 
thickening over the right apex. where a noun phrase (PA and'lateral chest 11-5- 
71) precedes an assertion about that ngun phrase. 
- 
Space permits only a few remarks about these definitions. It was helpful 
to order the options so that the longer options precede the shorter ones, since 
some of the shorter options (e.g., NSTG) can have the &form as the first 
element of the longer ones. This is not required in parsing texts, since in full 
senthces there is usually no other way of fitting in the remainder of the sen- 
tence. Also, in text sentences, many nouns require a preceding determiner, so 
that compound nouns are not split into separate noun phrases. 
In this material, 
determiners are rarely emplbyed, so this constraint cannot be applied. 
Thi~, combined wim verb deletions and the use of commas both in the text and as 
sentence sepamtors, makes fof a great deal of syntactic ambiguity. However, as 
the next section shows, it was possible to obtain the intended parse as the 
firs* output in most cases. This was accomplished without using the subclasses 
special to the medical material, which are used in a subsequent stage of pro- 
cessing preparatory to information retrieval. 
Parsing Results 
Parsing output is in the form of a tree, illustrated for a typicdl frag- 
ment in Fig. 1. (Only the nodes mentioned above are shown, plus LN/RN = left/ 
- 
right modifiers of Noun,) The full power of the parser is better illustrated by 
- 
- 
the along full sentences; but space does not permit presenting them here. 
Fig. 1 
Parse tree for FRAGMENT = 5-2-67 chest--no chanqe sjnce 2-7-67 
FRAGMENT. 
A summary of the parsing results is given in Table 4. Of the total 245 sen- 
tences, a correct firat parse was obtained for 171 or 69.8%, and a first parse 
adequate for further processing to obtain an "information format1' in 213 cases, 
or 86.9%. The latter statement brings us to the important question of how these 
parses are to be used. 
Full sentence 
Fragment 
Full S + Fragment 
TOTAL 
1st parse correct 
TABU3 4. PARSING RESULTS 
Number of Sentences 
1st parse OK for format 
2nd or 3rd parse OK for format 
No parse ox parse9 i-3 not OK for format 
TOTAL 
Average time for 1st par$e 
5.158 seconds 
Percentage 
15.1 
83.7 
1.2 
100.0 
The aim in processing natural language notes and records is to arrive at 
forms for the data which are suitable for computerized information retrieval. 
The data structures must not change the meaning. This is why syntactic methods 
are knpo%ant. Parsing with an English grammar provides the gross structure of 
input sentences. (The use of English transformations makes the grammatical 
analysis more refined,) In each specialized technical area, more specific struc- 
ture is p~s~ibie, making use of the restricted word usage characteristic of the 
disqourse in the given subject area. 
K) 
A second stage of processing of this type is now being applied to the parsed 
corpus of medical records and will be reported in a subsequent paper. 
A con- 
venkent test of the adeqyacy of the parsing outputs is therefore whether they can 
serve as input to this second stage of processing (called forhatting). It can 
be seen in Table 4 that a number of "wrong" parses were still adequate as input 
to the formatting; the segmentation of the sentence into parts was correct even 
if the parts were assigned an incorrect syntactic status, e.g., object instead 
of adjunct. Only when the first parse was not adequate for formatting was the 
sentence rerun to obtain alternative analyses. 
The parsing times are a rough indication of the efficiency of the parsing 
but two points should be kept in mind. (1) The present LSP system is not a pro- 
duction model, but a research tool, with all that implies. (2) A bignificant 
fraction of the input sentences were "no data" types, e,g., None this visit. 
These word sequences were so limited linguistically that a literal formula could 
serve to reaogniae them. The experimental use of such a formula cut down parsing 
times on the no-data entries from about 1.817 to0.030. However, this formula was 
not used in the parsing summarized in Table 4, 
- - 
This investigation was supported by Public Health service Research Grant 
number CA-11531 from the National Cancer Institute. 
losee Ref. 5 and A. Sager, Syntactic Formatting of Scientific Information, 
Proc. FJCC, AFIPS Press, Montvale, N. J., 1972, 

References 
Clark, H. H. (1975), I1Bridging1I, Conference on Theoretical Is sues in 
Natural Language Processing, 10 -13 June 1975, Cambridge, Mass. 
Rieger, C. J. (1974), Conceptual Memory: A Theory and Computer 
Program for Processing the Meaning Content of Natural Language Utterences, 
Ph.D. Thesis. Stanford University. 1974. 
Schaak, R. (1972), I1Conceptual Dependency: A Theory of Natural Language 
Under standing, Cognitive psycho lo^ 3(4), 1972. 
(C,75> Clark, H.H. Bridging,'' in Proceedings of the Workshop on 
Theoretical Issues in Natural Language Processing. Widge, 
Mass&usetts., June 1975. 
(JR,75) 
Joshi, ~.1(. and Rosenschein, S. "A FomKLism for Relating Meal 
and Pragmatic Information," in Proceedings of the Workshop on 
Theoretical Issues in Naturwl Language messing. Canilxidge, 
Massachusetts 3 June 19,75. 
(Ri,74) Keg?, C. Conceptual Memry. Ph.Dw Thesis Stanfcxd University. 
Stanford, CaliTn;l..nia. 1974. 
GR,75) 
Rosenschein , S . Struc-hg a Pattern Space, with Applications to 
Lexical Information and Event Intermetation. Doctoral 
- - -- --- 
Dissertation. University of Pennsylvania. Wladelphia, 
Pennsylvania. 1975. 
(Sch, 75) Schank, R. , Goldman, N. , Rieger , C. , and Riesba&, C. "Inference 
and Parapme by Computer " Journal,of-the A.C.M. Volume 22, 
No. 3. July, 1975. 
(W,75) Wilks , Y . 
"A Preferentie, Pattern-seeking Semantics fcrr Natural 
Language Inference ." Artificial InTelligence. Volume 6. 1975. 
Bickerton, D. 1973 
"The Nature of a Creole Continuumn Language 49 (3) 
p.640-669. 
Carrington L. and 
Borely, C. 1969 
"An Investigation into English Language Learning 
and Teaching Problems in Trinidad and Tobago 
Progress Reporttf. U. W. I. Institute of Education, 
St. Augustine (mimeo) . 
Carrington L., 1972 
Borely, C. and 
Knight H., 
Away Robin Run: A Critical description of the 
Teaching of Language Arts in the Primary Schools 
of Trinidad and Tobago. U.W.I. Institute of 
Education, St. Augustine. (mimeo) . 
Carrington L., 1974 
Borely, C. and 
Knight H., 
'ILinguistic Exposure of Trinidadian Children'' 
Caribbean Journal of Education No. 1, p. 12-22. 
De Camp, Dm 1971 
"The study of pidgin and creole languagest' in 
Hymes Pidginization of Creolization of Languages 
CUP. p.13-39. 
Le Page, R.B. 1957 
"General outlines of creole English dialects in the 
British Caribbean1'. Orbes 6, p. 373-391. 
Knight, Haw 19 74 
Carrington L. and
and Borely, C.,
'tPreliminary Comments on Language Arts Textbooks 
in use in the primary schools of Trinidad and 
Tobago". Caribbean Journal of Education No. 2.
[I] Retz, D. La, 3. R. Miller, J. L. McClurg, B. W. Schafer, 
Elf Kernel Programmer's Guide, Speech Comunications 
Research Laboratory, Santa Barbara, California. April, 1975. 
[2] Feldman, J. A. and P. Rovner , "An ALGOL-based Associative 
Language," Comm. ACM, Volume 12, August, 1969, 439-449. 
[3] Barnett, J. A,, A Phonological Rules System, TM-5478/000/00, 
System beveiopment Corporation, Santa Monica, California, 
1975. 
[4] Bobrow, D. G. and J. B. Fraser, "A Phonological Rule 
Te~ter,~ Comm. ACM, Volume 11, November, 1968, 766-772. 
[5] Friedman, J. and Y. C. Morin, Phonoloqical Grammar Tester: 
Description, Natural Language Studies No. 9, Phonetics 
Laboratory, The University of Michigan, 1971. 
I .Dm J. Bross et al. "Information in Natural Languages : A New Approach1'. Journal of the American Medical ~ssociakkon, Vol. 207, No. 11, 1969, pp. 2080-2084. 
2 I.D.3, Bross et al. "Feasibility ~f ~utoxnated Information Systems in the Usarts NatvlLal Language". American Scientist, Vol. 57, No. 2, 1969, pp. 193-205. 
3 p.k Shapim and D.F. Stemole. "ACORN (Automated Coding ot Report Narrative.): An Automated Ratrral-language Question-Answering syst& for surgical Reportsw. Conputers and Butomation, Vol. 20, No. 7, 1971. 
4 1.D.J. Bross et al. "Unobtrusive Biomedical Data-Input Systems". - Bio-Medical Computing, No. 4, 1973, pp. 219-228. 
5 I.D.J. Bross, P.A. Shapiro and B.B. Anderson. "How Information Is Carried in Scientific Sublangukges". Science, Vol. 176, No. 4041, 1972, pp. 1303-1307. 
6 Bross et al. Information in Natural Languages: A New Approach, 1969.
8 R. Grishman, N. Sager, C. Raze, and B. Bookchin, "The ~inguistic String Parsern'. Proceedings of the NCC, AFIPS Press, Montvale; N. 3. , 1973. 
9 N. Sager and R. Grishman, The restriction language for computer grammars of natural language, CACM, 1975.
10 N. Sager, Syntactic formatting of scientific information, FJCC, AFIPS Press, 1972.
