TOPIC IDENTIFICATION TECHNIQUES YOR PREDICTIVE LANGUAGE 
ANALYSERS / 
J.I. Tait 
University of Cambridge Computer Laboratory, Corn Exchange 
3t., Cambridge CB2 3QG, England. 
1 f Introd.u,ctiQn 
The use of prediction as the basis for inferential 
analysis mechanisms for natural language has become increas- 
ingly popular in recent years. Examples of systems which use 
prediction are FRUMP (DeJong 79) and(Schank 75a). The proper- 
ty of interest here is that their basic mode of worki~ is to 
determine whether an input text follows one of the systems 
p~s-specified patterns; in other words they predict, to some 
extent, the form their input texts will take. A crucial pro- 
blem for such systems is the selection of suitable sets of 
predictions, or patterns, to be applied to any particula~ 
text, and it is this problem 1 want to address in the paper. 
I will assume that the predictions are organised into 
bundles acoordi~ to the topis of the texts to which they 
apply. This is a generalisation of the script idea employed 
b~ (DeJong 79) and (Schank75a). l will call such bundles 
s~ereotyDes. 
The basis of the technique described here is a distinct- 
ion between the process of su~estin~ possible topics of a 
section of text and the process of eliminatin~ candidate 
topics (and associated predictions) which are not, in fact, 
appropriate for the text section. Those candidates which are 
not eliminated are then identified as the topics of the text 
- 281 - 
section. (There may only be one such candidate.) This approach 
allows the use of algorithms for suggesting possible topics 
which try to ensure that if the system possesses a suitable 
stereotype for a text section it is activated, even at the 
expense of activating larEe numbers of irrelevant stereotypes. 
This technique has been tested in a computer system 
called Scrabble. 
2! Su~estin~ Candidate Topics 
The discovery of candidate topics for a text secant is 
driven by the association of a set of patterns of sen~ntio 
primitives with each stereotype. (For the purposes of this 
paper it is assumed that the system has access to a lexicon 
containing entries whose semantic component 18 something like 
that used by (Wilks 77).) As a word is input to the system 
the senses of the word are examined to determine if any of 
them have a semantic description which contains a pattern 
associated with any of the system s stereotypes. If any do 
contain such a pattern the corresponding stereotypes axe 
loaded into the active workepace of the syste m, unless they 
are already active. 
3t Rl4~4n~!~m- Irrelevant CandtdLates 
In parallel with the suggestion process, the prediotionm 
of each stereotype in the active workspace are compared with 
the text. In Scrabble, the sentences of the text are first 
parsed into a variant of Conceptual Dependency (CD) represent- 
ation (Schank 75b) by a program described in (Cater 80). The 
semantic representation scheme lxas been extended to include 
nominal descriptions similar in power to those used by (Wilke 
77). The~predictions are compared with the CD representation 
structures at the end of each sentence! but nothAng in the 
scheme described in this paper could not be applied to a 
- 28;_) - 
system whloh inteKrated the process of parefut with that of 
determining whether or not a fragment of the text satisfies 
some prediction, as is done in (DeJon8 79). 
It is likely that stereotypes which are not relevant to 
the toplo of the ourz~nt text 8eKment will hats been loaded 
sm a result of the magKestion procesS', Since the cost of the 
comparison of .a prediction with the CD-representatton of a 
sentence of the text t8 not trivial It is impoz~ont that ir- 
relevant stereotypes are removed from the active workepsoe as 
rapidly as possible, The pztmax7 algorithm used by Scrabble 
removes any stereotype which has faAled to predict mOre of the 
p~opositiong in lnoomlng the text than it has successfully 
predicted, Thls slmple algorttha has proved adequate in tests 
and its simplicity also ensures that the cost of reuovtn6 
irrelevant stereotpyes is mlnlmlsed, 
Further processing Is subsequently done to separate 
stereotypes whloh were never appropriate for the text from 
stereotypes whloh were useful for the analysis of some part 
of the text, but are no lonKer useful. 
4, Jbl EXample 
Consider the ~ollowAng short text, adapted from (Char- 
nAak 78), 
Jaok ptoked a oem of tuna elf the shelf, He put it in 
hie basket. He psAd for it and went home. 
Assume that associated with the primitive pattern for 
food the system has stereotypes for eattnK in a rester, 
shopping at a supermarket, and prepart~ a ms8~ In the kitch- 
en, The Xextoon en.tz7 for tuna (a large sea fleh whloh 18 
Qaught for food) wall ¢ontaAn this pattern, and this wall 
oause the loadlng of the above three stereotypes into the 
active workspaoe. The restaurant stereotype will not predict 
the first sentence, and so will ~-medtately be unloaded. Both 
the supermarket and kitchen stereotypes expect sentences llke 
- 283 - 
the first in the text. When the scold sentence i8 read, the 
supermarket stereotype will be q~xpeoting it (since it expects 
pul~ohases to be put into basketl~), but the kitchen stereotype 
wall not. However the kitchen stereotype will not be unloaded 
since, 8o far, it has predicted 88 many propositions as it has 
failed to predict. When the third sentence is read, again the 
supermarket stereotype has predicted propositions of this 
form, but the kitchen stereotype has not. Therefore the kitch- 
en stereotype is removed from the active workspace, and the 
topic of text is firmly identified a8 a visit to the super- 
market. 
It shogld benoted that a completely realistic system 
would have to perform much more complex processing to analyse 
the above example. In such a system additional stereotypes 
would probably be activated by the occurrence of the primitive 
pattern for food, and it is likely that yet more stereotypes 
would be activated by different primitive patterns in the 
lexicon entries for the words in the input text. 
5J Conclusions 
The technique described in this paper for the identific- 
ation of the topic of a text section has a number of advant- 
aKe8 over previous schemes. First, its use of information 
which will probably already be stored in the natural language 
processing system's lexicon has obvious advantages over 
schemes which require large, separate data-structures purely 
for topic identification, as well as for making the predict- 
ions ussoolated with a topic° In practice, Scrabble uses a 
slightly doctored lexicon to improve efficiency, but the 
necessary work could be done by an automatic proprooess~Lng 
of the lexicon. 
Second, the scheme described here can make use of 
nominal8 which suggest a candidate topic, and associated 
stereotypes, without complex ma~tpulation of semantic inform- 
ation which Is not useful for thAs purpose. The scheme of 
- 284 - 
(DeJong 79), for example, would perform complex opeZattons 
on semantic representations associated with "pick" before It 
processed the more useful word "tuna" if It processed the 
above example text. 
Third the use of semantl.o primitive patterns has great- , 
er generality than techniques which set up direct links bet- 
ween words and bundles of predictions, as appeared to be done 
in early versions of the SAM program (Sohank 75a). 
One final point. The technique for topic identification 
in this paper would not be practical either if it was very 
expensive to load stereotypes which turn out to be Irrelevant, 
or if the cost of comparing the predictions of such stereo- 
types with the text representation was high. The Scr~bble 
system, running under Cambridge LISP on an IBM 370/165 took 
8770 milliseconds to analyse the example text above of which 
756 milliseconds was used by loading and activatiDg the two 
irrelevant stereotypes and 103 milliseconds was spent compar- 
ing their predictions with the CD-x~presentation of the text. 
The system design is such that these figures would not in- 
crease drematically if more stereotypes were considered whilst 
processing the example. 

References 

(Cater 80) 
Cater, A.W.S. Analysing English Texts A N0n-determinist- 
Ic Approach with Limited Memory. AXSB-SO Conference 
Proceedings. Society for the Study of ArtifiCial Intell- 
igence and the Simulation of Behavlour. July it980. 

(CJ~.tak 78) 
Charniak E. With Spoon in Hand this must be the Eating 
Frame, TIRLAP-2, 1978. 

(De,Tong 79) 
DeJo~, O.P. Skinning Stories in Real Times an Bxpex~Am- 
ent in Integrated Understanding. Research Report No. 
158. Yale University Department of Computer Science, 
Hew Haven, Conneotiout. M~ 1979. 

(Sohenk 75a) 
Sohank, R.C. and the Yale a.I° Pro~eot. SAIl -- A Story 
Understander. Researoh Report No. 4). Yale UniTersit~ 
Departnent of Computer Soienoet New HaTen, Conneotiout. 
1975 

(Sohank 75b) 
Sohank R.Co Conoeptual In£ormatlon Prooesslng° North- 
Holland, Amsterdam. 1975o- 

(Wilks 77) 
Wilks, Y°A° Good and Bad Arguments about Semantio 
Pri~tives° Cc,~untoation and Cognition, 10° 19770 
