TOPIC Essentials* 
Udo Hahn / Ulrich Reimer 
Universitaet Konstanz 
Informationswissenschaft 
Postfaeh 5560 
D-7750 Konstanz, F.R.G. 
Abstract 
An overview of TOPIC is provided, a knowledge-based 
text information system for the analysis of German- 
language texts. TOPIC supplies text condensates 
(summaries) on variable degrees of generality and 
makes available facts acquired from the texts. The 
presentation focuses on the major methodological 
principles underlying the design of TOPIC: a frame 
representation model that incorporates various integ- 
rity constraints, text parsing with focus on text 
cohesion and text coherence properties of expository 
texts, a lexlcally distributed semantic text grammar 
in the format of word experts, a model of partial 
text parsing, and text graphs as appropriate repre- 
sentation structures for text condensates. 
I. Introduction 
This paper provides an overview of TOPIC, a text 
understanding and text condensation system which 
analyzes German-language texts: complete magazine 
articles in tbe domain of information technology 
products. TOPIC performs the following functions: 
Text summarization (abstracting) 
TOPIC produces a graph representation of the most 
relevant topics dealt with in a text. This summary 
is derived from text representation structures and 
its level of generality varies from quite generic 
descriptions (similar to a system of index terms) 
to rather detailed information concerning facts, 
newly acquired concepts and their properties. Due 
to the flexibility inherent to this cascaded 
approach to text summarization (cf. KUHLEN 84) we 
refer to it as text condensation. This is opposed 
to invariant forms of text summarization based on 
summary schemata (DeJONG 79, TAIT 82) or struc- 
tural features of the text representations (TAYLOR 
74, LEHNERT 81), and dynamic abstracting proce- 
dures which depend on a priori specifications of 
appropriate parameters (FUM et el. 82) or rule 
sets for importance evaluation (FUM et el. 85) 
prior to text analysis. 
* Extraction of facts / acquisition of new concepts 
Knowledge extraction resulting from text analysis 
not only leads TOPIC to the assignment of specific 
properties to concepts already known to the sys- 
tem, but also comprises the acquisition of new 
concepts and corresponding properties. 
Linking thematic descriptions with text passages 
TOPIC's analytic devices are by no means exhaus- 
tive to capture all the knowledge encoded in a 
text. Thus, the text representation structures 
provided might be incomplete, llowever, the themat- 
* The development of the TOPIC system is supported by 
BMFT/GID under contract 1020016 0. We want to thank 
D. Soergel for his contributions to this paper. 
ic descriptions generated are linked to the corre- 
sponding text passages so that querying a text 
knowledge base may end up in the retrieval of 
relevant fragments of the original text (cf. 
similar approaches in LOEF 80, HOBBS et el. 82). 
To perform these functions, the design of TOPIC is 
based on the following methodological principles: 
* a method for making strategic decisions to control 
the depth of text understanding according to the 
functional level of system performance desired 
* a knowledge representation model whose expressive 
power primarily comes from various integrity con- 
straints which control\[ the validity of the knowl- 
edge representation structures during text analy- 
sis 
* a parsing model adapted to the specific construc- 
tive requirements of expository prose (local text 
cohesion and global text coherence phenomena) 
* a text condensation model based on empirical well- 
formedness conditions on texts (text grammatical 
macro rules) and criteria derived from the knowl- 
edge representation model (complex operations) 
2. Methodological Principles of Text Analysis Under- 
lying the TOPIC Text Condensation System 
Partial Text Parsing 
The current version of TOPIC acts as a shallow under- 
stander of the original text (cf. the approach to 
"integrated parsing" in SCHANK et al. 80). It concen- 
trates on the thematic foci of texts and significant 
facts related to them and thus establishes an indica- 
tive level of text understanding. Partial parsing is 
realized by restricting the text analysis to taxo- 
nomic knowledge representation structures and by 
providing only those limited amounts of linguistic 
specifications which are needed for a text parser 
witb respect to a taxonomic representation level. 
Primarily, the concepts which are available in the 
knowledge base correspond to nouns or nominal groups 
and their attributes (adjectives, numerical values). 
A Frame Representation Model that Incorporates 
Various Integrity Constraints 
The world knowledge underlying text analysis is 
represented by means of a frame representation model 
\[REIMER/HAIIN 85\]. The large degree of schematization 
inherent to frame representations provides knowledge 
of the immediate semantic context of a concept (lexi- 
cal cohesion). Additionally supplied integrity con- 
straints formally restrict the execution of various 
operations (e.g. property assignment to support 
knowledge extraction from a text) in order to keep 
the knowledge base valid. 
Text Parsing with Focus on Text Cohesion and Text 
Coherence Patterns 
Text linguists seriously argue that texts constitute 
an object to be modeled differently from sentences in 
isolation. This is due to the occurrence of phenomena 
which establish textuallty above the sentence level. 
A common distinction is made between local text 
497 
cohesion for immediate connectivity among adjacent 
text items (due to anaphora, lexical cohesion, 
co-ordination, etc.; see HALLIDAY/HASAN 76) and the 
thematic organization in terms of text coherence 
which primarily concerns the global structuring of 
texts according to pragmatic well-formedness con- 
straints. Instances of global text structuring 
through text coherence phenomena are given by regular 
patterns of thematic progression in a text \[DANES 
74\], or by various additional functional coherence 
relations, such as contrast, generalization, explana- 
tion, compatibility \[REICHMAN 78, HOBBS 83\]. Dis- 
regarding textual cohesion and coherence structures 
will inevitably result in invalid (text cohesion) and 
understructured (text coherence) text knowledge 
comparable to mere sentence level accumulations of 
knowledge structures which completely lack indicators 
of text structure. Therefore, there should be no 
question that specially tuned text grammars are 
needed. Unfortunately, the overwhelming majority of 
grammar/parser specifications currently available is 
unable to provide broad coverage of textual phenomena 
on the level of text cohesion and coherence, so that 
the format of text grammars and corresponding parsing 
devices is still far from being settled. 
A Lexically Distributed Semantic Text Grammar 
Since major linguistic processes provide textual 
cohesion by immediate reference to conceptual struc- 
tures of the world knowledge, and since many of the 
text coherence relations can be attributed to these 
semantic sources, a semantic approach to text parsing 
has been adopted which primarily incorporates the 
conceptual constraints inherent to the domain of 
discourse as well as structural properties of the 
text class considered (for an opposite view of text 
parsing, primarily based on syntactic considerations, 
ef. POLANYI/SCHA 84). Thus, the result of a text 
parse are knowledge structures in terms of the frame 
representation model, i.e. valid extensions of the 
semantic representation of the applieational domain 
in terms of text-specific knowledge. 
Text parsing, although crucially depending on seman- 
tic knowledge~ demands that additional knowledge 
sources (focus indications, parsing memory, etc.) be 
accessible without delay. This can best be achieved 
by highly modularized grammatical processes (actors) 
which take over/give up control and communicate with 
each other and with the knowledge sources mentioned 
above. Since the semantic foundation of text under- 
standing is most evidently reflected by the interac- 
tion of the senses of the various lexical items that 
make up a text, these modular elements themselves 
provide the most natural point of departure to 
propose a lexical distribution of grammatical 
knowledge \[HAHN 86\] when deciding on the linguistic 
organization of a semantic text grammar (ALTERMAN 85 
argues in a similar vein). 
Text Graphs as Representation Structures for Text 
Condensates 
Knowledge representation structures built up during 
text parsing are submitted to a condensation process 
which transforms them into a condensate repre-- 
sentation on different levels of thematic specializa- 
tion or explicitness. The structure resulting from 
corresponding complex operations is a text graph (its 
visualized form resembles an idea first introduced by 
STRONG 74). It is a hyper graph which is composed of 
* leaf nodes each of which contains a semantic net 
that indicates the topic description of a themati- 
cally coherent text passage 
498 
* the text passages that correspond to these topic 
descriptions 
* the higher-order nodes which comprise generalized 
topic descriptions 
From this condensate representation of a text access 
can also be provided to the factual knowledge ac- 
quired during text analysis. TOPIC does not include 
natural language text generation devices since the 
retrieval interface to TOPIC, TOPOGRAPHIC \[HAMMWOEH- 
NER/THIEL 84\], is exclusively based on an interac- 
tive-graphical access mode. 
3. An Outline of the Text Model 
Despite the apparent diversity of linguistic phenome- 
na occurring in expository texts, a large degree of 
the corresponding variety can be attributed to two 
basic processes (cf. HALLIDAY/HASAN 76): various 
forms of anaphora (and cataphora), and processes 
incorporating lexical cohesion. Both serve as basic 
text cohesion preserving mechanisms on the local 
level of textuality. Their repeated application 
yields global text coherence patterns which either 
follow the principle of constant theme, linear 
thematization of rhemes, or derived themes (see DANES 
74). In Fig 1 we give a fairly abstracted presenta- 
tion of the~e coherence patterns which should be 
considered together with the linguistic examples 
provided in Fig 2 and their graphical reinterpreta- 
tion in Fig 3. The notions of frames, slots, and slot 
entries occurring in Fig 1 correspond to concepts of 
the world knowledge, their property domains, and 
associated properties, which may be frames again. 
I Constant Theme 
frame 
~< slot 2 > 
~< slot n > 
II Linear Thematization of Rhemes 
frame 
framej\]> ~ 
= frmE~ 1 } > 
III Derived Themes 
Framesu p 
fr~l~subl . . . framest~k . . . framesul~ n 
~-~< slOtll > k_~< slotkl > ~---< s\]otml > << slotl2 > <~< slotk2 > k--~< slo~ > 
k--~< slOtlp > %---< slO~q > k-.~< slOtmr > 
Fig I: Basic Patterns of Thematic Progression 
The interpretation of coherence patterns as given in 
Fig_l refers to two kinds of knowledge structures: 
* concept specialization corresponds to the phenomena 
of anaphora 
* aggregation of slots to frames corresponds to the 
phenomena of lexical cohesion 
This tight coupling of text licking processes and 
representation structures of the underlying world 
knowledge strongly supports the hypothesis that text 
understanding is basically a semantic process which, 
as a consequence, requires a semantic text parser. 
A linguistic illustration of the coherence patterns 
introduced above is given by the following text 
passages. For convenience, the examples in this paper 
are in English, although the TOPIC system deals with 
German textual input only. 
I Constant Theme 
The PC2000 is equipped with a 8086 cpu as 
opposed to the 8088 of the previous model. The 
standard amount of dynamic RAM is 256K bytes. 
One of the two RS-232C ports also serves as a 
higher-speed RS-422 port. 
II Linear Thematization of Rhemes 
A floppy disk drive by StorComp is available 
which holds around 710K bytes. Also available 
by StorComp is a hard disk drive which provides 
20M bytes of mass storage. 
Ill Derived Themes 
Compared to the FS-190 by DP-Products which 
comes with Concurrent CP/M the PC2000 runs UNIX 
iiust like the new UNPC by PCP Inc. 
Fig 2: Linguistic Examples of the Basic Patterns of 
Thematic Progression 
Fig 3 shows an interpretation of the text passages of 
Fig 2 in terms of thematic progression patterns. 
I Constant Theme (PC2000) 
pc2~ ~,~< cpu > 
8086 ~< main memory > 
RAM ~-~< size > 
256K bytes 
RS-232C RS-422 
II Linear Thematization of Rhemes (disk drives 
from StorComp) PC2~ 
< mass storage > 
~ floppy disk drive 
~< size > ~ 
71~ bytes ~ n~nufacturer > 
StorCc~p ~< product > 
~hard disk drive k~ size > 
20M bytes 
Ill Derived Themes (personal computers) 
persollal co.puter 
FS-19S PC2OO~ \[~qPC 
~-~< manufacturer > ~--~< operating system1 > ~-+< n~lufactt~er > 
DP-Products UNIX FCP Inc. 
~-~ < operating system > ~'~< operating system > 
Concurrent CP/M \[RClX 
Fig 3: Interpretation of the Text Passages of Fig 2 
in Terms of Thematic Progression Patterns 
4. The Process of Text Parsing 
TOPIC is a knowledge-based system with focus on se- 
mantic parsing. Accordingly, incoming text is direct- 
ly mapped onto the frame representation structures of 
the system's predefined world knowledge without 
considering in-depth intermediate linguistic descrip- 
tions. Basically, these mappings perfolT~L continuous 
activations of frames and slots in order to provide 
operational indicators for text summarization. 
Together with slot filling procedures they build up 
the thematic structure of the text under analysis in 
the system's world knowledge base. To account for 
linguistic phenomena these concept activation and 
property assignment processes are controlled by a set 
of decision procedures which test for certain struc- 
tural patterns in the world knowledge and the text to 
occur. Consequently, TOPIC's text parser consists of 
two main components: the world knowledge which 
provides the means of correctly associating concepts 
with each other (see sec.4.1) and the decision proce- 
dures (word experts) which utilize this foreknowledge 
to relate the concepts that are actually referred to 
by lexical items in a text, thus determining the 
patterns of thematic progression (see see.4.2). 
4.1 Representation of World Knowledge by a Frame 
Representation Model 
Knowledge of the underlying domain of discourse is 
provided through a frame representation model 
\[REIMER/HAHN 85\] which supports relationally con- 
nected frames. A frame can be considered as providing 
highly stereotyped and pre-structured pieces of knowl- 
edge about the corresponding concept of the world. It 
describes the semantic context of a concept by asso- 
ciating slots to it which either refer to semanti- 
cally closely related frames or which simply describe 
basic properties. A slot may roughly he considered as 
a property domain while actual properties of a frame 
are represented by entries in these slots (Fig 4). An 
entry may only be assigned to a slot if it is 
declared as being a permitted entry (see below). 
PC2~ frm~m < 
c~xl > < slOt I > 8ZS6 { permitted entrY\].l, ''' ) 
< nk~in Z~:~\[IO\]~ > slotentrYll 
RAM-I 
< size > 
256K bytes slotentrYlr 
<port> RS~232C, ~422 
< mass storaqe > < slot n 
hnrd disk (Irivc-i ( permitted entrYnl ' ... } size > slotentrYnl 
2~M bytes flopl~ ? disk drive-i 
size > slotentrYns 71ZK bytes 
Fig 4: Examples of Frames, Slots and Slot Entries 
Two kinds of frames are distinguished. A prototype 
frame acts as a representative of a concept class 
consisting of instance frames which all have the same 
slots but differ from the prototype in that they are 
further characterized by slot entries. Thus, instance 
frames stand for individual concepts of a domain of 
discourse. This point may be illustrated by a micro- 
processor frame which represents as a prototype the 
set of all microprocessor instances (Fig 5). 
Prototype frame (concept class): 
micropr~es ~r 
< wor~ leng~ > 
I 4 bit, S bit, 16 bit, 32 bit \] 
( \]~nu~churer > 
Associated instance frames (individual concepts): 
zs@ ~szo0 
S bit 16 bit < reanufact~er > < \]~nu~urer > 
zil~j 5bto~la 
Fig 5: A Prototype and Associated Instance Frames 
Frames are connected with each other by semantic 
relations (cf. Fig 6). Concept specialization between 
prototypes (is-a r~lation) is of fundamental impor~ 
tance to anaphora resolution. Concept specialization 
between a prototype and its instances (instance-of) 
499 
requires the instances to have the same slots as the 
prototype with the same set of permitted entries, 
resp. This property supports learning of new con- 
cepts from the text (i.e. incorporating new data in 
the knowledge base). When a new concept occurs in the 
text and it is possible to determine its concept 
elass the structural description of the new concept 
is taken from the prototype that stands for the 
concept class. Indicators of what concept class a new 
concept belongs to are e.g. given by composite nouns, 
which are particularly characteristic of German 
language (8-Bit-Cpu, Sirius-Computer), attributions 
(serial interface, monochromatic display), or 
specific noun phrases (laser printer LA-9). 
The semantic relation part-of is 
aggregation which expresses a 
semantic closeness. 
personal e~puter < cpu> 
< mass storage > 
instance-of part-of 
PC-XZX <cpu> 
8086 < main memory > 
RAM-I <port> 
< mass storage > 
memory < size > 
--RAM is-a ~ < size 
I < CyCle tir~e ) instance-of RAM-I 
part-of ( size > 256K bytes 
cycle time > 
Fig 6: Semantic Relations among Frames 
a special kind of 
particularly tight 
8~86 < word length > 
16 bit < manufacturer > 
Intel 
While the learning of new concepts is supported by 
the distinction of prototypes and instances, the 
acquisition of new facts from the text is possible by 
utilizing knowledge about the permitted entries of a 
slot. Two cases can be distinguished which correspond 
to two slot types. Non-terminal slots are slots whose 
name is identical to the name of a frame in the knowl- 
edge base. Permitted entries for them are defined 
implicitly and are given by all those frames which 
are specializations of the frame whose name equals 
the slot name (el. the slot "operating system" in 
Fig_7). On the other hand, entries of the complemen- 
tary class of terminal slots must he specified 
explicitly (cf. the slot "word length" in Fig7). 
o~rating system 
o~tar / \ < operating system ) single-user multi-user 
{ single--user system, system system 
multi-user system, 
CPIM ... UNIX VMS ... 
micrc~processor 
< word length > \[ 4bit, 8bit, 16bit, 32bit \] 
Fig 7: Permitted Entries for (Non-)Termlnal Slots 
Further devices for controlling slot filling are 
given by the construct of singleton slots which may 
hold at most one entry (e.g. the slots "epu" and 
"size" in Fig__4). Singleton slots are of use when 
several permitted entries for a slot occur at 
adjacent text positions. 0nly if that slot is a 
singleton slot, the filling is constrained to one of 
500 
those candidates; linguistic knowledge has to account 
for the selection of the appropriate one. Moreover, 
such a situation is interpreted as an indication of 
comparison (see Fig2/l and the parsing effects 
occurring with respect to "epu" and the candidate 
entries "8086" and "8088" in Fig_10). 
Control of slot filling is also supported by an 
inferential construct called cross reference filling. 
When two frames, frame-i and frame-2 (Fig8), refer 
to each other in such a way that each has a non-ter- 
minal slot for which the other frame is a permitted 
entry, then assigning frame-I to the appropriate slot 
of frame-2 automatically results in assigning frame-2 
to the appropriate slot of frame-l. Now, if the 
second slot assignment is not permitted and therefore 
blocked, the primary assignment is blocked, too. The 
following sentence gives an example (Fig 8): "Com- 
pared to the FS-190 by DP-Products the PC2000 runs 
UNIX". The concept "PC2000" is a permitted entry of 
the product slot of the manufacturer "DP-Products'. 
Its assignment would trigger the assignment of 
"DP-Products" in the manufacturer slot of "PC2000" 
which is a singleton slot and already occupied. 
Therefore no slot filling at all is performed. 
frame-i ~ frame-2 
< slot-2 > < slot-i > { frame-2 .... \] { frame-i .... \] 
DP-Produets PC2Z~O < products > < manufacturer > 
{ PC2~ ... \] { PCP Inc., Dp-Products, ... } 
PeP Inc. 
Fig 8: Cross Reference Filling 
The structural features of the frame representation 
model are extended by activation weights attached to 
frames and slots. They serve the purpose of indicat- 
ing the frequency of reference to the corresponding 
concepts in a text and are of significant importance 
for the summarization procedures. 
Currently, TOPIC's frame knowledge base comprises 
about 120 frames, an average of 6 slots per frame. 
4.2 A Generalized Word Expert Model of Lexically 
Distributed Text Parsing 
Characterizations of what texts are about are carried 
predominantly in domain-specific keywords as desig- 
nators of contents (of. SMETACEK/KOENIGOVA 77 for the 
task domain of abstracting) - in linguistic terminol- 
ogy: nominals or nominal groups. Accordingly, 
TOPIC's parsing system is essentially based on a noun 
phrase grammar adapted to the requirements of text 
phenomena. Its shallow text parsing performance can 
be attributed to the exhaustive recognition of all 
relevant keywords and the semantic and thematic 
relationships holding among them. This is sufficient 
for the provision of indicative text condensates. 
Accordingly, word experts \[SMALL/RIEGER 82\] have been 
designed which reflect the specific role of nominals 
in the process of making up connected text. The 
current section illustrates this idea through a 
discussion of a word expert for lexieal cohesion (for 
a more detailed account ef. }~HN 86). Together with 
various forms of anaphora (not considered here, 
although we refer to the effects of a corresponding 
expert in Figl0 by NA) it provides for a continuous 
cohesion stream and a corresponding thematic develop- 
ment in (expository) texts. Exceptions to this basic 
rule are due to special linguistic markers in terms 
of quantifiers, connectors, etc. As a consequence, 
supplementary word experts have to be provided which 
reflect the influence these markers have on the basic 
text cohesion and text coherence processes: experts 
applying to quantifiers and comparative expressions 
typically block simple text cohesion processes (for 
an example cf. Fig_10), experts for conjunctions 
trigger them, and experts referring to negation 
particles provide appropriately modified assignments 
of properties to frames. 
This kind of selective parsing is based on strategic 
considerations which, however, do not affect the 
linguistic generality of the approach at all. On the 
contrary, due to the high degree of modularization 
inherent to word expert specifications a word expert 
grammar can easily be extended to incrementally cover 
more and more linguistic phenomena. Moreover, the 
partial specifications of grammatical knowledge in 
the format of word experts lead to a highly robust 
parsing system, while full-fledged text grammars 
accounting for the whole range of propositional and 
pragmatic implications of a comprehensive understand- 
ing of texts are simply not available (not even in 
sublanguage domains). In other words, current text 
analysis systems must cope with linguistic descrip- 
tions that will reveal specification lags in the 
course of a text analysis if ~realistic texts" \[RIES- 
BECK 82\] are being processed. Therefore, the text 
parser carries the burden of recovering even in cases 
of severe nnder-speciflcation of lexical, grammati- 
cal, and pragmatic knowledge. Unlike question- 
answering systems, this problem cannot be 
side-stepped by asking a user to rephrase unparsable 
input, since the input to text understanding systems 
is entirely fixed. Distributing knowledge over 
various interacting knowledge sources allows easy 
recovery mechanisms since the agents which are 
executable take over the initiative while those 
lacking of appropriate information simply shut down. 
Summing up, each of the word expert specifications 
supplied (those for nominals, quantifiers, conjunc- 
tions, etc.) is not bound to a particular lexical 
item and its idiosyncrasies, but reflects function- 
ally regular linguistic processes (anaphora, lexical 
cohesion, coordination, etc.). Accordingly, a rela- 
tively small number of general grammatical descrip- 
tions encapsulated in highly modularized communities 
of agents form the declarative base of lexically 
distributed text parsing. 
By word experts (consider the word expert prototype 
provided below) we refer to a declarative organiza- 
tion of linguistic knowledge in terms of a decision 
net whose root is assigned the name of a lexical 
class or a specific word, Appropriate occurrences of 
lexical items in the text prompt the execution of 
corresponding word experts. Non-terminal nodes of a 
word expert's decision net are constructed of boolean 
expressions of query predicates or messages while its 
terminal nodes are composed of readings. With respect 
to non-terminal nodes word experts 
- query the frame knowledge base,e.g, testing for se- 
mantic relations (e.g. is-a, instance-of) to hold, 
for the existence and activation weight of concepts 
in the knowledge base, or for integrity criteria 
that restrict the assignment of slot entries 
- investigate the current state of text analysis~ 
e.g. the types of operations already performed in 
the knowledge base (activation, slot entry assign- 
ment, creation of new concepts~ etc.) 
- consider the immediate textual environment, e.g. 
testing co-occurrences of lexical items under 
qualified conditions~ e.g. within sentence or noun 
phrase boundaries 
- have message sending facilities to force direct 
communication among the running experts for block- 
ing, canceling, or re--starting companion experts 
According to the path actually taken in the decision 
net of a word expert, readings are worked out which 
either demand various actions to be performed on the 
knowledge base in order to keep it valid in terms of 
text cohesion (incrementing/decrementlng activation 
weights of concepts, assignment of slot entries, 
creation of new frames as specializations of already 
existing ones, etc.), or which indicate functional 
coherence relations (e.g. contrast, classificatory 
relations) and demand overlaying the knowledge base 
by the corresponding textual macro structure. 
Apparently, the basic constructs of the word expert 
model (query predicates, messages, and readings) do 
not refer to any particular domain of discourse. This 
guarantees a high degree of transportability of a 
corresponding word expert grammar. 
The word expert collection currently comprises about 
15 word expert prototypes, i.e. word experts for 
lexical classes, like frames, quantifiers, negation 
particles, etc. Word expert modules encapsulating 
knowledge common to different word experts amount to 
20 items. The word expert system is implemented in C 
and running under UNIX~ Grammatical knowledge is 
represented using a high-level word expert specifica- 
tion language, and it is inserted and modified using 
an interactive graphical\[ word expert editor. 
These principles will be illustrated by considering 
an informal specification of a word expert (a more 
formal treatment gives I~HN/REIMER 85) which accounts 
for lexical cohesion that is due to relations between 
a concept and its corresponding properties. 
Fig I0 shows a sample parse of text I (Fig2) which 
gives an impression of the way text parsing is real- 
ized by word experts that incorporate the linguistic 
phenomena just mentioned. 
With respect to text summarization (cf. HAHN/REIMER 
84) it is an important point to determine the proper 
extension of the world knowledge actually considered 
in a text as well as its conceptual foci. This is 
achieved by incrementing activation weights 
associated to frames and slots whenever they are 
referred to in the text (this default activation 
process is denoted DA in Figl0). In order to guaran- 
tee valid activation values their assignment must be 
independent from linguistic interferences. As an 
example for a process that causes illegal activation 
values consider the case of nominal anaphora which 
holds for \[17\] in Fig I0 (the associated word expert 
NA is not considered here, cf. HAHN 86). 
Recognizing lexleal cohesion phenomena contributes to 
associating concepts with each other in terms of 
aggregation. The word expert for lexlcal cohesion, an 
extremely simplified version of which is given in 
Fig 9, tests if a frame refers to a slot or to an 
actual or permitted entry of a frame preceding in the 
text. In the case of a slot or of an actual entry the 
activation weight of the slot (entry) is incremented; 
in the case of a permitted entry the appropriate slot 
filling is performed~ thus acquiring new knowledge 
from the text. Examples of lexlcal cohesion processes 
501 
are given by positions \[02/07\], \[07.1/24\], \[24/26\], 
\[26.2/32\], and \[32.1/38\] in Figl0. 
i ........................... I 
m: itm~ of the m~ledge base <kb it~ ~rs | in tI~ i~.\]iate 
l~ft ~tex~ of (fr~> J 
aE~ (kl~ item> de~tes all active fr~ 
\] ...................... b it~n> I ............ 
* incT~nt weight of Blot (fr~> in fr,~ <kh its,#. * 
I <Erm~,> ¢le~te. an acts1 slot valm~ / T \[ 
* ilmr~t weight of slot value (fr~> ~ F 
for slot <slot> in fr~ (kb its). 
I ........................... \[ of slot ~slot> in fr~ (~it~> 
* assign <fr~> to slot ~slot> of fr~ <~) $t~n~, • I~ r~di~ 3, * 
Fig 9: Word Expert for Lexical Cohesion (= LC) 
Fig I0 shows a sample parse with respect to the text 
I gTven in Fig 2. It includes all actions taken with 
respect to nominal anaphora resolution (NA) and 
lexical cohesion recognition (LC). 
\[02\] PC2~Z DA: 'PC20~O': O---> 1 \[ST\] 8~6 DA: '8086': 0---> i 
\[07.1\] LC: PC2000 < cpu: 8~6 > \[09/10\] opposed to < start blocking of £C > 
\[13\] 8~88 DA: '8~8S'.. 0---> 1 \[13.1\] LC: PC-ZI • q~u : 8088 > 
\[17\] r0odel DAz 'model': 0---> 1 \[17.1\] NA: 'raodel': 1 ---> 0, 'PC-~I': 1 ---> 2 
< stop blocking of LC > 
\[24\] RAM DA: 'RAM': 0 ---> 1 \[24.1\] LCs PC2~ • main n~ : ~ > 
\[26\] 256K bytes \[26.1\] IJC: RAM-OI < size : 256K bytes > 
\[26.2\] LCI PC2~@~ • main raemory : RAM~OI • 
\[32\] P~-232C DAz '\[~S-232C'." 0 --> 1 \[32.1\] I~ PC2~ < port t RS-232C > 
\[38\] RS-422 DAz 'RS-422'I O---> 1 \[38.1\] LC: PC2~ • port : RS-232C, RS-422 > 
Fig i0: Sample Parse of Text Fragment I in Fig__2 
Applying the Experts LC and NA 
Some comments seem worthy: 
i) \[13.1\]: The frame "8088" is not considered as an 
entry of the slot <cpu> of "PC2OOO" since it al- 
ready has been assigned an entry and it is a sin- 
gleton slot (cf. sec.4.1). Instead, a new instance 
of a personal computer is created ('PC-01") to 
which "8088" is assigned as a slot entry 
2) \[24.1\]: "RAM" does not refer to "PC-01" as might 
be expected from the specification of LC because a 
comparative expression (\[09/10\]) occurs in the 
text. This blocks the execution of the LC expert 
with respect to the noun phrase occurring 
immediately after that expression. 
3) \[26.1/26.2\]: The instance created ('RAM-OI') 
describes the main memory of the "PC2000". There- 
fore it is assigned as an entry to "PC2000" and 
readjusts the previous assignment of "RAM'. 
502 
Our constructive approach to text cohesion and 
coherence provides a great amount of flexibility, 
since the identification of variable patterns of 
thematic organization of topics is solely based on 
generalized, dynamically re-combinable cohesion 
devices yielding fairly regular coherence patterns. 
This is in contrast to the analytic approach of story 
grammars \[RUMELHART 75\] which depend completely on 
pre-defined global text structures and thus can only 
account for fairly idealized texts in static domains. 
5. Text Condensation 
During the process of text parsing, activation pat- 
terns and patterns of property assignment (slot 
filling) are continuously evolving in the knowledge 
base, which consequently exhibits an increasing 
degree of connectivity between the frames involved 
(text cohesion). If the analysis of a whole text 
would proceed this way, we would finally get an 
amorphous mass of activation and slot filling data in 
the knowledge base without any structural organiza- 
tion, although the original text does not lack 
appropriate organizational indicators. In order to 
avoid this deficiency, it is essential in text 
parsing to recognize topic shifts and breaks in texts 
to delimit the extension of topics exactly and to 
relate different topics properly. For this purpose 
every paragraph boundary triggers a condensation 
process which determines the topic of the latest para- 
graph (in the sublanguage domain we are working in 
topic shifts occur predominantly at paragraph bound- 
aries). If its topic description matches with the 
topic description of the preceding paragraph(s), both 
descriptions are merged; thus they form a text pas- 
sage of a coherent thematic characterization, called 
a text constituent. If the topic descriptions do not 
match a new text constituent is created. After the 
topic of a paragraph has been determined, the activa- 
tion weights in the world knowledge are reset, except 
of a residual activation of the frame(s) in focus. 
This way the thematic characterization of a paragraph 
can be exactly determined without any interference 
with knowledge structures that result from parsing 
preceding paragraphs. 
The next section presents the main ideas underlying 
the process of determining the thematic charac- 
terization of a text passage. Sec.5.2 concludes by 
giving a very concise discussion of the concept of a 
text graph which is the representational device for 
text condensates in the TOPIC system. 
5.1 Determination of Text Constituents 
The condensation process (for details cf. HAHN/REIMER 
84) completely depends on the knowledge structures 
generated in the course of text analysis. As outlined 
above, this text knowledge consists of frame struc- 
tures which have been extended by an activation 
counter associated to eacb concept (frame, slot, or 
slot entry) to indicate the frequency of reference to 
a concept in the text under analysis. These activa- 
tion weights as well as distribution patterns of slot 
filling among frames together with connectivity 
patterns of frames via semantic relations provide the 
major estimation parameters for computing text con- 
stituents (connectivity approaches to text summariza- 
tion are also described in TAYLOR 74, LEHNERT 81). 
These indicators are evaluated in a first condensa- 
tion step where the significantly salient concepts 
are determined. We distinguish between dominant 
frames, dominant slots and dominant clusters of 
frames, the latter being represented by the common 
superordinate frame (for a detailed discussion see 
H~{N/REIMER 84). The determination of dominant con- 
cepts can be viewed as a complex query operation on a 
frame knowledge base. In a subsequent step the 
dominant concepts are related to each other with 
respect to concept specialization as well as the 
frame/slot relationship. The topic of a text passage 
is thus represented by a semantic net all of whose 
elements are given by the dominant concepts (cf. the 
nodes of the text graph in Fig II). 
5.2 The Text Graph 
The text graph (Fig ii) is a hierarchical hyper graph 
whose leaf nodes are the text constituents (as given 
above) and whose higher-order nodes represent general- 
izations of their topics. Similar to the distinction 
of micro and macro propositions \[CORREIRA 80\] its 
nodes are associated by different kinds of relation- 
ships which are based on the frame representation 
model (is-a, instance-of, is-slot, identity) or which 
are constituted by the coherence relations (e.g. 
contrast). 
. °.... • • " : . "," " 
~~ . " • .." . L~nufaotu~er \] 
J 
/ 
I\] c~xM\]ent 
P~q--232C RS422 
8086 8088 
contrast 
% 
~ ° 
PC200~ ----\]~kanu facturer / \ 
DI~Produets I~C~ Inc. II cexlp~lent 
StorC~o 
I{ product 
disk O/'ive 
flOppy disk ha;d disk 
drive drive 
identity: .... is-a: ..... instance-of:----- 
FS-190 I~72~ZS U~C 
~ ratJng system \ 
I)hrfK Conc~irrent ~/M 
is-slot: 
Fig Ii: Text Graph for Text Fragments I-Iii (Fig2) 
6. Conclusions 
A comprehensive description of the text condensation 
system TOPIC has been provided which serves for the 
conceptual analysis of textual input of a knowledge- 
based full-text information system. The following 
issues are most characteristic of it: 
- a frame representation model which incorporates 
various integrity constraints 
- a text grammar with focus on text cohesion and text 
coherence properties of expository texts 
- a lexically distributed semantic text grammar in 
the format of word experts 
- partial text parsing based on a noun phrase word 
expert parser and a taxonomic knowledge repre- 
sentation 
- text graphs as representation structures of text 
condensates which provide different layers of 
informational specifity 

References 

Alterman, R~: A Dictionary Based on Concept Coher~ 
ence. In: Art. Intell. 25. 1985, ppo153-186. 

Correlra, A.: Computing Story Trees. In: Amer. J. Com- 
puting. 6. 1980, pp.135-149. 

Danes, F.: Functional Sentence Perspective and the 
Organization of the Text. In: Danes (ed): Papers on 
Functional Sentence Perspective. Academia, 1974, 
pp.i06-128. 

DeJong, G.: Skimming Stories in Real Time: an Exper- 
iment in Integrated Understanding. Yale Univ, 1979. 

Fu_mm, D. et al.: Forward and Backward Reasoning in Au- 
tomatic Abstracting. In: Proc. COLING 82, pp.83-88. 

Fun~, D. et al.: Evaluating Importance: a Step towards 
Text Summarization. In: Pron. IJCAI-85, pp.840-844. 

Hahn, U.: On Lexieally Distributed Text Parsing: A 
Computational Model for the Analysis of Textuality 
on the Level of Text Cohesion and Text Coherence. 
In: Kiefer (ed): Linking in Text. Reidel, 1986. 

}lahn, U.; U. Reimer: Computing Text Constituency: An 
Algorithmic Approach to the Generation of Text 
Graphs. In: Rijsbergea (ed): Research and Develop- 
ment in Information Retrieval. Cambridge U•P., 
1984, pp.343-368. 

Hahn, U.; U. Reimer: The TOPIC Project: Text-Oriented 
Procedures for Information Management and Condensa- 
tion of Expository Texts. Final Report Univ. Kon- 
stanz, 1985 (TOPIC-17/85) 

llalliday, M.; R. Hasan: Cohesion in English. Longman, 
1976. 

Hammwoehner, R.; U. Thiel: TOPOGRAPHIC: eine gra- 
phisch-interaktive Retrievalschnittstelle. In: 
Proc. MICROGRAPHICS. GI, 1984, pp.155-169. 

Hobbs, J.: Why is Discourse Coherent? In: Neubauer 
(ed): Coherence in Natural-Language Texts. Buske, 
1983, pp.29-70. 

Hobbs, J. et al.: Natural Language Access to Struc- 
tured Text. In: Proc. COLING 82, pp.127-132. 

Kuhlen, R.: A Knowledge-Based Text Analysis System 
for the Graphically Supported Production of Cas-- 
caded Text Condensa tes. Univ. Konstanz, 1984 
(TOPIC-9/84) 

Lehnert, W.: Plot Units and Narrative Summarization. 
In: Cognitive Science 5. 1981, pp.293~331. 

Loef, S.: The POLYTEXT/ARBIT Demonstration System. 
Umea/Sweden: Foersvarets Forskningsanstalt, FOA 4 
rapport, C 40121-M7, 1980. 

P_o_olanyl, L.; R. Scha: A Syntactic Approach to Dis- 
course Semantics. In: Proc. COLING 84, pp.413-419. 

Reichman, R.: Conversational Coherency. In: Cognitive 
Science 2. 1978, pp.283-327. 

Re\]met, U.; U.. Hahn: On Formal Semantic Properties of 
a Frame Data Model. In: Computers and Artificial 
Intelligence 4. 1985, pp.335-351. 

Riesbeck, C•: Realistic Language Comprehension. In: 
Lehnert / Ringle (eds): Strategies for Natural 
Language Processing. Erlbaum, 1982, pp.37-54. 

Rnmelhart, D•: Notes on a Schema for Stories• In: 
Bobrow /CoIlins (eds): Representation and Under- 
standing. Academic P., 1975, pp.211-236. 

Schank, R. et al.: An Integrated Understander. In: 
AJCL. 6. 1980, pp.13-30. 

Small, S.; C. Rieger: Parsing and Comprehending with 
Word Experts (a Theory and its Realization)• In: 
Lehnert / Ringle (eds): Strategies for Natural 
Language Processing. Erlbaum, 1982, pp.89-147o 

Smetacek, V.; M. Koenigova: Vnimani odborneho textu: 
experiment. In: Ceskoslovenska Informatika 19. 
1977, pp.40-46. 

Strong,S: An Algorithm for Generating Structural Sur- 
rogates of English Text. In:JASIS 25.1974, pp.10-24 

Tait, J.: Automatic Snmmarising of English Texts. 
Univ. of Cambridge, 1982 (= Technical Report 47) 

Taylor, S.: Automatic Abstracting by Applying Graphi- 
cal Techniques to Semantic Networks. Evanston/ 
Iii.: Northwestern Univ., 1974.
