COLLOCATIONAL GRAMMAR AS A MODEL FOR H I NAN-COMPUTER 
INTERACTION 
W. Randolph Ford 
Prism Associates 
• 7402 York Road, Suits 301 
Towson, Maryland 21204 
l~,cul N. Smith 
GTE Laboratories, Inc. 
40 Salvan Road 
Waltham, Massachusetts 02254 
Contrary to the long-held belief of transfo~national 
E~mmarians for communication in general, the majority cf 
natural languaEe sentences which people actually use in oom- 
municating with a co~puter, in an unconstrained mode, are not 
novel. As Thompson and Thompson 1981s660 obselwez "monotony 
of structure is the rule rathe~ than the exception in homan- 
-computer communication." Thompson 1981"41 reports in her 
study of such communications that 75 percent of the queries 
were wh-questions, 1% percent were commands, 5 percent were 
statements, and 1 percent were yes/no questions. 
The repetitive feature of natural language is not a new 
concept. Similar observations have been made bsforeo Damerau 
1971 used collocation of lexioal items as the basis for a 
Marker model in an experiment for text generation. Becket 
claims that "the wonderful feats of the homan intellect.., are 
based as much on memorization as on any impromptu problem- 
-solving" (1975:62). He posits a phrasal lexicon oonslstln~ 
of six major categories of lexioal phrases by which we "stitch 
tcgether swatches of text that we have hear~ before| product- 
ive processes have the secondary role of adapting the old 
phrases to the new situations" (1975z60)o 
All of these approaches to natural languaEe data rely 
heavily on the observation that many lexlcal items tend 
- 106 - 
co-occur. This surface co-occurrence is the result of what 
may at times be complicated syntactic and semantic interrelat- 
ions of language unite. Unfortunately, a systematic account- 
ing of these interrelations has not been achieved in any 
linguistic theory. The thrust of our approach is that the 
more of language which can be handled lexical!y , the easier 
will homan language be able to be modelled. 
Actual data on the frequency of lexical collocation are 
very spares. A study of word sequences in PANALOG text has 
shown a surprising amount of repetition of word sequences 
(Bienstock and Smith in preparation). (PANALOG is a system 
for passing messages among small ~oups in computer confer- 
enclng with telemall and calendar features. See Housman 1979. 
The data are of human-homan communication and not homan- 
-computer cc~.,nunication and are thematically restricted. They 
therefore resemble Damerau's data.) A study of parts of the 
Brown English Corpus has been undertaken in order to get less 
thematically homogeneous material. In addition, Wizard of OZ 
experiments with unconstrained h,-.an-computer input will be- 
gin soon at GTE Laboratories in order to gather the more 
relevant human-oomputeu~ data. 
A PANALOG text of 16,133 words chosen for study. The 
longest string which occurred more than once was a seven 
word quote from Ntetzsche. Two six-word strings occurred twice 
and at length five, one occurred three times and thirteen 
were repeated twice. 
An interesting feature of these distributions is that 
the number of hapaxes (those strings occurring only once at 
a given string length) reaches a peak at length three (see 
Figure i). 
- I07 - 
No. of HAPAX 
(x .1o') 
12- 
ii- 
i0, 
9' 
8. ?, 
6- 
5- 
4- 
3- 
2- 
i- 
0 
2 3 4 5 6 7 8 
Figure i. String Length 
This is a reveall~ measure of the Amount of repetition 
in a text of this length. In particular, recurring two word 
strings account for 40.8 percent of the running text and re- 
cu~ring three word strings comprise 8.1 percent of the text. 
The basic assumption of the frequent occurrence of lexic- 
al collocation in natul-al language texts, especially in 
h,-,an-ocmputer communication, is the basis for the development 
of a new type of natural language processor. Ford, 1981, has 
constructed a natural languaEe processing system for database 
updating, retrieval, and manipulationt which relies critically 
on the observation that real users tend to employ a very 
limited set .of lexical dtri~ types in queryi~ databases. 
The Ford natural language processor consists of a two 
stage reduction algorithm for ~ranslating natural languaEs in- 
puts into basic functions which are then used to perform the 
query. The first stags of the reduction changes the input words 
to meaning representations usinE a list of lexioal items and 
a meaning correlate list. The second stage takes as input 
strings of these meaning correlates and oha~es them into 
basic llst. 
409 numeric representations for words mapped down to 132 
unique meanings and 1328 canonical sentence vectors mapped down 
- 108 - 
to 19 functions. This two stage reduction scheme worked 
efficiently enough to respond to 93.8 percent of the 1697 in- 
put queries, including ungrammatical ones from inexperienced 
users, with a response time of 1.5 seconds, operating in an 
environment of 90K 8-bit bytes. This compares very favorably 
to Thompson 1981 where only 67.7 percent of REL queries were 
correctly parsed with an average response time of 10 seconds. 
(Space requirements were not reported.) Similarly, Damerau 
1981 and Patrick 1981 report a success rate for TQA of 65.1 
percent inputs correctly parsed with the time required to 
process a sentence typically being 10 seconds. 
The reason why the system works so well in terms of 
accuracy, speed, and small storage reqairements is based on 
the two stags reduction technique which, in t~rn, is based on 
the fact that a great ma~v inputs in human-computer communic- 
ation are repetitious syntactically, semantically, and lexl- 
~ally. Repetition is a principal characteristic of human- 
-computer communication. 

REFERENCES 

Becket, Joseph, 1975. "The Phrasal Lexicon," in Schank, Roger 
and Bonnie Nash-Webber, eds. Theoretical Issues in Natural 
Language Processing, ACL Workshop, Cambridge, HA, pp.38-41. 

Btenstook, Daniel and Raoul H.~nith. In preparation. "Lexical 
Collocation in Three Types of Texts." 

Damerau, Frederick J. 1971. Marker Models and Linguistic 
Theol. Janua linguarum series minor 95. Mouton: The Hague. 

Dsmerau, Frederick J. 1981. "Operating Statistics for the 
Transformational Question Answering System", AJCL 7.1. :30-42 

Ford, W. Randolph. 1981. "Natural-Language Processing by 
Cemputer - A New Approach", Unpublished Ph.D.diseertation, 
The Johns Hopkins University, Baltimore, Haryland, 

Hous~an, Edward. "Computer Mediated Communication,- Profile 3 
(1979)" 1-4. 

Petriok, Stanley. 1981. "Field Testing the Transformational 
Question Answering (TQA) System," Proceedings of the Nine- 
teenth Annual Meetin 6 of the Association for Computational 
~, pP. 35-36. 

Reiger, Chuck. 1977. "Viewing Parsing as Word Sense Discrimin- 
ation," in Dinswall, Williem O. A Surve~ of Linguistic 
Science, Greylock Publishers. 

Small, Steven. 1980. "Word Expert Parsing" A Theory of 
Distributed Word-Based Natural Language Understanding," 
University of Maryland Computer Science TR-954. 

Thompson, Bo~ena H. 1981. "Evaluation of Ratural Language 
Interfaces to Data Base Systems," Proceedin~m of the 
Nineteenth Annual Meeting of the Association for Computat- 
ional Linguistics. Menlo Park, CA: Association for Canput- 
atlonal Linguistics, pp. 39-42. 

Thc|apson, Bo~ena H. and Frederick B. Thompson. 1981. "Shifting 
to a Higher Gear in a Natural Language System," Proceedln~s 
of the 1981 National Computer Conference Arlington, VA: 
APIPS Press, pp. 657-662. 
