AUTOMATIC ANALYSIS OF DESCRIPTIVE TEXTS 
James R. Cowls 
Computer Centre 
University of Strathclyde, 
Royal College, George Street, 
Glasgow, GI IXI~. SCOTLAND 
ABSTRACT 
This paper describes a system that attempts 
to interpret descriptive texts without the uJe of 
complex grammars. The purpose of the system is to 
transform the descriptions to a standard form 
which may be used as the basis of a database sys- 
tem knowledgeable in the subject matter of the 
teXt. 
The texts currently used are wild plant 
descriptions taken directly from a popular book 
on the subject. Properties such as size, shape 
and colour are abstracted from the descriptions 
and related to parts of the plant in which we are 
interested. The resulting output is a standar- 
dined hierarchical structure holding only signi- 
ficant features of the description. 
The system, implemented in the PROLOG pro- 
gramming language, uses keywords co identify 
the way segments of the text relate to the object 
described. Information on words is held in a 
keyword list of nouns relating to parts of the 
object described. A dictionary contains the at- 
tributes of ordinary words used by the system to 
analyse the text. The text is divided into seE" 
ments using information provided by conjunctions 
and punctuation. 
About half the texts processed are correct- 
ly analysed at present. Proposals are made for 
future work to improve this figure. There seems 
Co be no inherent reason why the technique cannot 
be generalised so chac any text of seml-standard 
descriptions can be automatically converted to a 
canonical form. 
I INTRODUCTION 
A lot of useful information, covering many 
subject areas, is presently available in printed 
form in catalogues, directories and guides. Good" 
examples are plants in "Collins Pocket Guide to 
Wild Flowers", aeroplanes in "Jane's All the 
World's Aircraft" and people in '~ho's Who". Be- 
cause chls informaClon is represented in a styl- 
ised form, it is amenable CO machine processing 
Co abstract salient details concerning the entity 
being described. The research described here is 
part of a long term project to develop a system 
which can "read" descriptive text and so become 
an expert on the -~terial which has been read. 
The first stage of this research is to es- 
tablish that it is indeed possible co abstract 
useful information from descriptive text and we 
have chosen as a typical example a text consist- 
ing of descriptions of wild plants. Our system 
reade this text and generates a formal canonical 
plant description. Ultimately this will be in- 
put to a knowledse-baeed system which will then 
be able to answer questions on wild plants. 
The paper gives a limited overview of the 
recent work in text analysis in order to estab- 
lish a context for the approach we adopt. An 
outline of the operation of the system is then 
nadeo 
The analysis of our text proceeds in four 
separate stages and these are considered in con- 
Junction with a sample text. The first stage at- 
Caches to each word in the text attributes which 
are held in either a keyword llst or the system 
dictionary. This expanded text is then split up 
using conjunctions, punctuation marks and the 
keywords in the text to assign each segment of 
the text to a particular part of the plant. The 
chard stage gathers up the descriptions for a 
particular part and abstracts properties from 
them. The final operation formats the output as 
required. 
We then look at the more detailed operation 
of the system in terms of specific parts of £n- 
teresto This covers the dictionary, skeleton 
structures, text splitting, text analysis and the 
limited word guessing attempted by the system. 
Future developments are then considered. In 
particular the possibility of generalising the 
system to handle ocher topics. The actual imple- 
mentation of the system and the use of FROLOG are 
examined and we conclude with some notes on the 
current ucillty of our system. 
II BACKGROUND 
Many research workers are interested in 
different aspects of text analysis. Much of the 
I17 
emphasis of this work depends on the use of so- 
phisticated grammars to map to the internal 
representation. The work done by Schank (197~) 
and that of Sager (1981) are two contrasting ex- 
amples of this interest. In addition to the 
research oriented work, some commercial groups 
are interested in the practicability of generat- 
ing database input from text. 
its properties in a slnEle piece of text. The 
basic properties we are lookin 8 for - shape, 
colour, size - are all described by words wi~h a 
direct physical relation or with a simple men~al 
association. What we are really trying to do is 
tidy the description into a set of suitable noun 
phrases. 
Although the internal details of the vari- 
ous systems are totally different the final 
result is some form of layout, script or struc- 
ture which has been filled out with details from 
the text. The approach of the various groups can 
be contrasted according to how much of the text 
is preserved at this point and how much addition- 
al detail has been added by the system. DeJong 
(1979) processes newswire stories and once the 
key elements have been found the rest of the text 
is abandoned. Sager makes the whole text fit into 
the layout as here small details ~ay be of vital 
importance to the end user of the processed text. 
Schank in his story understanding programs may 
actually end up with more information than the 
original text, supplied from the system's own 
world knowledge. 
The other contrasting factor is the degree 
of limitation of the domain of interest of the 
text processors. The more a system has been 
designed with a practical end in view, the more 
limited the domain. Schank is operatlng at the 
level of general language understanding. DeJong 
is limiting this to the task of news recognition 
and abstraction, but only certain stories are 
handled by the system. Sager has reduced the 
range still further to a particular type of medi- 
cal diagnoses. 
Very recent work appears to be approaching 
text understanding from a word oriented 
viewpoint. Each word has associated with it 
processes which drive the analysis of the text 
(S~ail, 198\[). We have also been encouraged in 
our own approach by Kelly and Stone's (1979) work 
on word dlsamblguation. The implication of which 
seems to be that word driven rules can resolve 
ambiguities of meaning in a local context. 
Our own case is a purely practical attempt 
to generate large amounts of database building 
information from single topic texts. It should 
not be assumed however that a truly comprehensive 
syntax for a descriptive text would be simpler 
than for other types. The reverse may be true 
and the author of the descriptions may attempt to 
liven up his work with asides, unusual word- 
orders and additional atmospheric details. 
Our system does not use sophisticated gram- 
marital techniques. It is our contention that in 
the domain of descriptive texts we can make cer- 
tain assumptions about the way the descriptive 
data is handled. These allow very crude parsing 
to be sufficient in most cases. 
Similarly the semantic structures involved 
are simple. A description of an object consisting 
of several parts usually mentions the part and 
III OUTLINE OF TME SYSTEM 
The text analysis system has been con- 
structed on the assumption that much of the in- 
formation held in descriptive texts can be ex- 
tracted using very simple rules. These rules are 
analogous to the "sketchy syntax" suggested by 
Kelly and Stone and operate on the text on a lo- 
cal rather than a global basis. 
At the time of writing our system processes 
plant descriptions, in search of ten properties 
which we consider distinctive. Examples of these 
properties are the size of the plant, the colour 
of its flowers and the shape of its flowers. New 
properties can be added simply by extending the 
skeleton plant description. 
Example I. A Sample Analysis 
SMALL BUGLOS$. 
An erect bristly annual, up to a fooC high, with 
wavy lanceolate leaves and small blue flowers 
which are the only ones of their family to have 
their corolla-tube kinked at the base; calyx with 
lanceolate teeth, hardly enlarging but much 
exceeding the fruit. Rabitat: Widespread and lo- 
cally frequent in open spaces on light soils. 
April onwards. 
TOPIC COMPONENT PROPERTY PROPERTY 
PARTS NAMES VALUES 
plant 
general 
name small bugloss 
size a foot high 
flower 
leaf 
colour blue 
shape noinfo 
size small 
shape wavy lanceolate 
size noinfo 
colour noinfo 
habitat 
geog-location widespread 
season april onwards 
Figurl t. System Outline 
dictionary 
expanded 
text 
'a . 
sesmenced 
~exc 
plant 
structure 
final 
output 
outline 
structure 
property 
rules 
I19 
The texts being processed are plant 
descriptions as found in McCllntock and Fitter 
(1974). The system has been built to handle 
this topic and it attempts to fill out various 
properties for selected parts of a plant. A 
skeleton description is used to drive the pro- 
cessing of the text. This indicates the parts of 
the plant of interest and the properties required 
for each part. 
The structure which we presently use is 
shown in Example I after it has been filled out 
by processing the accompanying description. It 
should be noted that if the system cannot find a 
property then the null property "nolnfo' is re- 
turned. 
An outline of how a description is pro- 
cessed by the system and converted to canonical 
form is given in Figure i. There are four dis- 
tinct stages in the transformation of the text. 
each with an attached keyword. This keyword In- 
dentlfles the text as describing a particular 
part of the plant. Text segments are gathered 
together for a particular keyword. This may pull 
together text from separate parts of the original 
description 
This new unit of text is then examined to 
see if any of the words or phrases in it satisfy 
the specific property rules required for this 
part of the plant. If found the phrases are in- 
serted into appropriate parts of the structure. 
D. Formatter. 
The ultimate output of the system is in- 
tended as input to a relational database system 
developed at the University of Strathclyde. At 
the moment the structure is displayed in a form 
that allows checking of the system performance. 
A. Dictionary processor. 
The raw text is read in and each word in 
the text is checked in a dlctionary/keyword llst. 
Each dictionary entry has an associated list of 
attributes describing both syntactic and semantic 
attributes of that word. These attributes are 
looked at in more detail in section IV. If a 
word in the text appears in the dictionary it is 
supplemented with an attribute llst abstracted 
from the dictionary. 
The keywords for a text depend on which 
parts of the object we are interested. Thus for a 
plant we need to include all possible variants of 
flower (floret, bud) and of leaf (leaflet) and so 
on. Fortunately this is not a large number of 
words and they can be easily acquired from a 
thesaurus. 
The output from this stage is a llst of 
words and attached to each word is a llst of the 
attributes of this word. 
8. Text splitting. 
The expanded text ks then burst into seg- 
ments associated with each keyword. We identify 
segments by using "pivotal points" in the text. 
Pivotal points are pronouns, conjuntlons, prepo- 
sitions and punctuation marks. This is the sim- 
plifying assumption which we make which allows 
us to avoid detailed grammars. The actual words 
and punctuation marks chosen to split the text 
are critical to the success of this method. It 
may be necessary to change these for texts by a 
different author as each author's usage of punc- 
tuation is fairly Idiosynchratic. Within a given 
work however fairly consistent results are ob- 
tained. The actual splitting of the text is 
covered more fully An section IV C. 
C. Text analysis. 
We now have many small segments of text 
IV SYSTEM DETAILS 
A. ThE Dictionary 
The dictionary is the source of the mean- 
ings of words used during the search for proper- 
ties. Two other word sources are incorporated in 
the system, a llst of keywords which is specific 
to the subject being described and a list of 
words which may be used to split the text. This 
second list could probably be incorporated in the 
dictionary, but we have avoided this until the 
system has been generallsed to handle other types 
of text. 
The dictionary entry for each word consists 
of three lists of attributes. The first contains 
it's part of speech, a flag indicating the word 
carries no semantic information and some addi- 
tional attributes to control processing. For ex- 
ample the attribute "take-next" indicates that if 
a property rule is already satisfied when this 
word is reached in the text then the next word 
should be attached to the property phrase already 
found. Thus the word "-" carries this property 
and pulls in a successive word. 
The second llst contains attributes whose 
meaning would appear to be expressible as a phy- 
sical measure of some kind:- "touch-roughness", 
"vision-intenslty". Many of the words used in 
descriptions can be adequately categorised by a 
single attribute of this type. Thus the word red 
is an "adjective" with a physical property 
"vlslon-colour". 
The third contains those which require phy- 
sical measures to be mapped and compared to 
internal representations or which deal with the 
manipulation of internal representations alone:- 
"form-shape", "context-location". Words using 
these attributes generally tend to be more com- 
plex and may have multiple attributes. Thus the 
word field has as attributes "context-location" 
120 
and "relaclonshlp-multlple-example" whereas the 
word Scotland also carries "context-location" but 
is qualified by "relatlonship-single-example". 
We realize this cLtvis£on is delimited by an 
extremely fuzzy border, but when the search for a 
basis for word definition was made chls helped 
the intuitive allocation of attributes. Sixty 
five different attributes have been allocated. 
Only sixteen of these are used in the rules for 
our current list of properties. 
The size of the dictionary has been consid- 
erably reduced by including the algorithm, given 
by Kelly and Stone (1979), for suffix removal in 
the lookup process. 
B. Skeleton Structure 
The structure we wish to fill ouC is mapped 
directly to a hierarchical PROLOG structure with 
the uninstantiated variables, shown in the struc- 
ture in capital letters, indicating where pieces 
of text are required. The PROLOC system fills in 
these variables at run time with the appropriate 
words from the text. Each variable in a complet- 
ed structure should hold a llst of words which 
describe that particular property. Thus a partial 
plant structure is defined as:- 
plant( 
general( 
size(Gl), 
name(G2), ), 
flower( 
colour(Fl), 
shape(F2), 
), 
). 
This skeleton is accompanied by a set of 
keyword lists. Each llst being associated with 
one of the first levels of the structure. Thus a 
partial I/st for •flower" ~/ght be:- 
keyword(flower,l). 
keyword(bud,l). 
keyword(pecal,l). 
keyword(floret,l). 
The number indicates which item on the 
first level of the structure is associated with 
these keywords. 
another, 
We assume initially that we are describing 
the general details of the plant, so the text 
read up to the first pivotal poin~ belongs to 
that part of our structure, keyword level O. 
Each subsequent piece of text found assigns to 
the same keyword until a piece of text is found 
containing • new keyword. This becomes the 
current keyword and following pieces of cex~ be- 
long to this kayword until yec another keyword is 
found. 
D. Propert 7 gules 
We now gather together the pieces of text 
for a part of the structure and look for proper- 
ties as defined An the skeleton structure. A 
property search is carried out for each of the 
property names found at level two of the strut- 
cure. The property rules have the general form:- 
See "property" to NO 
repeat( 
examine attribuces of next word 
if(suitable modifier attributes) 
then keep word 
i~(sultable property attributes) 
then keep word and set "property' to YES 
if(no suitable attributes and "propercy'Is NO) 
then throw away any words kept so far 
if(no suitable attributes and 'property 
then exlc repeat 
if(no more words) 
then exit repeat 
} 
if('property" is YES) then return words kept 
if('property" is NO) then return "nolnfo'. 
• is YES) 
C. Text Splitting 
The fundamental assumption we make for 
descriptions of objects is that the part 
described will be mentioned within the piece of 
cexc referring to ic. Thus conjunctions and punc- 
tuation marks are taken to flag pivotal points lo 
the text where attention shifts from one part to 
121 
E. Special Purpose Rules 
We are trying to avoid rules specifically 
associated with layout which would need redeflnl- 
tion for different texts. However the system does 
assume a certain ordering in the initial title of 
the descriptions. Thus the name of the plant is 
any adjectives followed by a word or words not in 
the dictionary. It is intended to add rules to 
detect the Latin specific name of the plant. We 
have excluded these from our current texts. 
These will in all probability be based on a 
similar rule of ignorance, reinforced by some 
knowledge of permissible suffices. 
F. Specially Recosnised Words 
Certain words are identified in the dlc- 
tlonary by the attributes "take-next" and "take- 
previous". They imply that if a property rule is 
satisfied at the time that word is processed then 
the successor or predecessor of that word and the 
word itself should be included in the property. 
The principal use of this occurs in hyphenated 
words. These are treated as three words; wordl, 
hyphen, word2. The hyphen carries both "take- 
next" and "take-previous" attributes. This often 
allows attachment of unknown words in a property 
phrase. Thus "chocolate-brown" would be recog- 
nlsed as a colour phrase despite the fact that 
the word chocolate is not included in the dic- 
tionary. 
Words which actually name the property be- 
ing sought after carry a "take-previous" a~tri- 
bute. Thus "coloured" when found will pull in the 
previous word e.g. "butter colour" although the 
word butter may be unknown or have no specific 
dictionary attribute recognised by the rule. 
particular, we intend to provide a user interface 
to allow the system to be modified for a specific 
topic by user definitions and examples. 
The potential also exists for mapping from our 
word based internal representation to a more 
abstract machine manipulable form. This may be 
the most interesting direction in which the work 
will lead. 
¥I I~LEMENTAT ION 
The code for the system is written in PRO- 
LOG (Clocksln and Mellish, 1981) as implemented 
on the Edinburgh Multi Access System (Byrd,1981). 
This is a standard implementation of the 
language, with the single enhancement of a second 
internal database which is accessed using a hash- 
ing algorithm rather than a linear search. This 
has been used to improve the efficiency of the 
dictionary search procedures. 
PROLOG was chosen as an implementation 
language mainly because of the ease of manipula- 
tion of structures, lists and rules. The skeleton 
plant and keyword lists are held as facts in the 
PROLOG database. The implementation of the suf- 
fix stripping algorithm is a good example of the 
ease of expressing algorithms in PROLOG. The map- 
ping from the original to our code being almost 
one to one. 
V FURTHER DEVELOPMENTS 
In the short term, the size of the diction- 
ary and the rules built into the system must be 
increased so that a higher proportion of descrip- 
tions are correctly processed. Another problem 
which we must handle is the use of qualifiers 
referring to previous descriptions e.g. 'darker 
green" or "much less hairy than the last 
species'. We intend to tackle this problem by 
merging the current canonical description with 
that of plants referred to previously 
It would appear from work that has been 
carried out on dictionary analysis (Amsler, 1981) 
that a less intuitive method of word meaning 
categorization may be available. If it proves 
possible to ~ap from a standard dictionary to our 
set of attributes or some related set then the 
rigour of out internal dictionary would be signi- 
ficantly improved and a major area of repetitive 
work might be removed from the system. 
It is also intended to extend the suffix 
algorithm to handle prefixes and to convert the 
part of speech attribute according to the 
transformations carried out on the word. This 
has not proved important to us up to the present 
but future uses of the dictionary may depend on 
its being handled correctly. 
In the longer term we intend to generallse 
the system to cope with other topic areas. In 
In addition the implementation on EMAS al- 
lows large PROLOG programs to be run. The inter- 
pretive nature of the language also means that 
trace debugging facilities are available and new 
pieces of code can be easily incorporated into 
the system. 
Vll CONCLUSIONS 
Initial indications suggest that for about 
50% of descriptions, all ten properties are 
correctly evaluated and for about 30%, 8 or 9 
properties are correct. The remaining 20% are 
unacceptable as less than 8 properties are 
correctly determined by the system. 
We anticipate that increasing the knowledge base 
of the system will significantly increase its ac- 
curacy. 
The very primitive "sketchy syntax" ap- 
proach appears to offer practical solutions in 
analysing descriptive texts. Furthermore, there 
seems to be no intrinsic reason why a similar 
method could not be used to analyse temporal or 
causal structures. There will always be segments 
of text that the system cannot cope with and to 
achieve a greater degree of accuracy we will need 
to allow the system to consult with the user in 
resolving difficult pieces of text. 
122 
The structured nature of the system output 
allows the possibility of building a complex da- 
tabase system. A dace base system based on the 
raw text alone has no ability to dlsclngulsh to 
which part of an object any property belongs as 
its searches are made on the basis of keywords 
alone without caking contextual information into 
accott~t. 
VIII ACKNOWLEDGEMENTS 
I would llke co thank the director of the 
Computer Centre Mr. Grant Fraser for making 
available time co carry ouc thls work and my su- 
pervlsor Dr. fan $o--.-rvllle for his help in the 
development of the system and in the writing of 
thls paper. 

REFERENCES 

Amsler, Robert A. A Taxonomy for Engllsh Nouns 
and Verbs. Proc. 19th Annual ACL, 1981, 133-138. 

Byrd, Lawrence, ed. A User's Guide co EMAS PRO- 
LOG. D.A.Z. Occasional Paper No. 26. Departmnc 
of A.I. Edinburgh Unlverslty, 1981. 

Clocksln, William F. and Christopher $. Hellish. 
Programalng in PROLOG. Heidelberg: Springer- 
Verlag, 1961. 

DeJong, Gerald F. Sklmalns Stories In Real Time. 
Research Report 158. Department of Computer Scl- 
ence, Yale University, 1979. 

Kelly E. and P. Stone. 
Ensllsh Word Senses. 
1979. 
Computer Reco~nltlon o~f 
Amsterdam: North-Holland, 

McCllncock, David and R.S.R. Fitter. The Collins 
Pocket Guide to Wlld Flowers. London: Collins, 
1975. 

Sager, Naomi. Natural Language Information Pro- 
cessln~. Reading, .~ass.: Addison-Wesley, 1981. 

Schank, Roger C. and Kenneth M. Colby, eds. Com- 
puter Models of Thought and Language. San Fran- 
cisco: Freeman, 1973. 
