Text Analysis and Knowledge Extraction 
Fujio Nishida, Shinobu Takamatsu, 
Tadaaki Tani and Hiroji Kusaka 
Department of Electrical Engineering, 
FacnJty of Engineering, University of Osaka Prefecture, 
Sakai, Osaka, 591 JAPAN 
i. Introduction 
The study of text understanding and 
knowlegde extraction has been actively done by many 
researchers. The authors also studied a method of 
structured information extraction from texts without 
a global text analysis. The method is available for 
a comparatively sbort text such as a patent claim 
clause and an abstract of a technical paper. 
This paper describes tile outline of a method 
of knowledge extraction from a longer text which 
needs a global tex analysis. The kinds of texts ~e 
expository texts ~) or explanation texts-'. 
Expository texts described here mean those which 
have various hierarchical headings such as a title, 
a heading of each section and sometimes an abstract. 
In this deEinJtion, most of texts, including 
technical papers reports and newspapers, are 
expository. Texts of this kind disclose the main 
knowledge in a top-down manner and show not only 
the location of an attribute value in a text but 
also severn\[ key points of the content. This 
property of expository texts contrasts with that of 
novels and stories in which an unexpected 
development of the plot is preferred. 
This paper pays attention to such 
characteristics of expository texts and describes a 
method of anal yzing texts by referring to 
information contained in the intersentential 
relations and the headings of texts and then 
extracting requested knowledge such as a summary 
from texts in an efficient way. 
2. Analysis of intersententia\] relations 
Tile global sentential analysis is performed 
by using the information contaJ ned in the 
intersentential relations and the headings of a text 
by a method combining both the bottom-up and the 
top-down manner. Various kinds of intersentential 
relations\]\]~ve been proposed so far by many 
linguists "--. By referring to these proposals, 
intersentential relations are class\] lied 
tentatively into about 8 items. They are a detail, 
an additional, a parallel, a rephrase, an example, a 
temporal succession, a cansal and a reasoning 
relation as described in the following subsections. 
Detail relations 
If a term t 2 is the topic term J n a sentence 
S 2 and i\[ I: is a complementary term of the topLc 
term t\] in the preceding sentence $I as shown in 
Expr.(1), S,. is called the detail of ~1" 
S • (PRE~' p , K • t., K,~" t~, K "t .) 
\]" ' 1 l\]" =t Iz" z rl " rl $5: (PRED: p~, K,,,: t., K ~: t ,,) (I) 
Z Z Z\] ~Z rz rZ S " 
3" "''''~'''" where K:t represents a pair of a ease label and a 
term, and the term w:ith a double underline denotes a 
topic. 
The sentence level of S I to that of S 2 
depends on the property of the sentence S 3 
following to S 2 and the relation among the terms 
contained in the sentences S 1 S 2 and S.. If the 
sentence S 3 is connected to S 1 more closely than $2, 
for example, if the sentence S 3 has the topic term 
tl of the sentence S\] as the topic, it is 
considered that the principal sentence is S and the 
sentence level of S~ is lower than that of ~.. 
0 z " t n the other hand, if S I is an introductory 
sentence of a term t 2 and the articles related to t 2 
are described in some sentences following to $I, or I 
:if t~ is the ~,lobal topic of the section, the z 
sentence S is considered the principal sentence. 
\]'he global 2 topic can be easily identified by 
inspecting the headings of the section the title and 
the like, whatever it :i s an attribute name or an 
attribute value without reading through the whole 
text. 
If the term t 2 in the sentence S. belongs to 
a kind of pronouns such as "in the following ones" 
or "as follows", the sentence S J s set at the same 2 
level as that of $I. At the summarization stage, the 
system tries to s~orten the part consisting of S 1 
and S^ by replacing the pronoun t~ in S. by the main g z I 
content given :in S 2, namely, the main part 
consisting of t . and p . 
\[Example I\] r2 2 
(a) S\]: SGS receives an ordered triple from a user. 
$2: Tile triple's form is category, input-frames, 
conditions on the sentence. 
$3: SGS regards tile ordered triple as a 
goal. 
S O describes; the content of a term "ordered 
triplg" in S\] , and S~ has tile topic term "SGS" in 
S Heine S is the ~e-a\[l of S ,and S is the 
i: . . ' 2 " " " \] 1 " pr:l ncl.pa£ sengence. 
(b) S\]: In th:is section, the overview of LFG is 
described. 
$2: LFG is an extension of context free grammar 
an(\] has the following two structures. 
$3: One is a c-structure which represents the 
surface word and phrase configurations, and 
tile other is a f-structure ...... 
S\] is an introductory sentence of a term "LFG" 
which Js the global topic in a section taken from a 
text. S has a kind of pronoun "tile following two 2 . " 
structures whose contents are described Jn S 3. 
Hence, S is tile principal sentence and tile sentence 2 .... 
level of S_ is the same as that of S^ z " " 
As a sl)ecial case of detail relations, there are 
a rephrase relation and an example relation. These 
intersententJal relations between sentences S t and 
S 2 can be identified by referring to their 
sentent:ial constructions and sentence modifying 
adverbs such as "in other words" and "for example" . 
The principal sentence of them is, in most cases, 
the sentence S 1 Jn an expository text. 
Additional relations 
If the current sentence has the same 
sentential topic t. as that of the preceding \]. 
sentences and describes another attributes or 
241 
functions of the topic, the current sentence is 
called an additional sentence to the preceding 
sentences. The sentential form of the relation is 
S~:~ (PRED:p~,± K.:t.,l =z Krl:tl) 
S : (PRED:p , K.:i., K " ~ (2) 
2The level~ ofJbT~h t~ 't2) sentences S and S 9 are l 
generally assumed to be the same except-for the case 
that the global topic is put in a predicate part of 
them. It can be also considered that additional 
relations hold among various sentential groups of 
the same \].eve\]. such as chapters sections or 
paragraphs under a global topic contained in a 
title. 
3£~ Other sententJal relations 
There are other intersentential relations. They 
are roughly classified into a serial and a 
concurrent or an extended parallel relation. 
A serial relation such as a temporal succession 
a caasal or a reasoning relation has tile same 
physical locatioa of focus or the same logical 
object while it has a time shift or a logical 
inference step shift between adjacent sententia\] 
groups. 
A concurrent relation has the same t:i.me instant 
of the event occurrences or the same stages of 
logical inference while Jt has a distance or a 
spatial positional shift hetween the physical or tile 
logical objects described in the adjacent sentential 
groul)s. 
The level number of a sentence to the adjacent 
sentential groups in these relations is assigned ill 
a similar way to that of the detail or the 
additional relation by referring to the inter- 
sentential relations and the global topics. 
In usual cases, the difference between a 
principal sentence level and the adjacent sentence 
level is usually set within one level. 
As seen in the above, a sentence or a sentential 
group has an intersentential relation to some 
adjacent sentences or sentential groups. The 
intersentential relation between adjacent sentences 
is sinlilar to a relation between adjacent words or 
word groups combined through rewriting rules of a 
sentence. The intersentential relations are 
classified into two classes. One of them Js a 
relation such as a detail relation which holds 
between a principal sentence and the auxiliary or 
modifying sententJal group with a lower level than 
the principal sentence as shown :in Fig.\].(a). The 
other is a juxtaposition relation like an additional 
relation which holds among several coherent 
sentences with the same level in usual as shown in 
FJg.l(b). 
n 1 n 2 n 1 n 2 ........ n m 
(a) (b) 
Fig.11ntersentential relations 
In these diagrams a leaf node represents a 
sentence of a text and an intermediate node 
denotes a representative sentence of the direcL 
descendents or the principal parts of them. A name r 
attached to an arc bridging over several branches 
denotes an intersentential relation. 
3__~.Textgnalysis 
An expository text has a title and consists 
242 
of several sections. The title shows tile main 
topics of the text. The heading of each section 
shows local topics of each section and constitutes 
the attributes of the main top:ics. 
Each of maJ n sections sometimes has ~ln 
introductory remark followed by the main part. The 
content of tile main part is almost covered with the 
subframe predetermined by tile heading and the 
title. 
The global cohesion of a section is assured 
by a relation J n which each maia part of the section 
shares some items of Lhe same subframe with other 
main parts. 
Based on the above idea of text con- 
struction, a text anelysJ.s Js (lone after parsing of 
each sentence. First, each pronoun is replaced by 
the antecedent noun word with tile aids of an 
anaphora analysis. Thea, tile interlnediate expressJ on 
of each sentence of the text is transformed into the 
normal form in which each topic term J s :inherJ ted 
together with a double under\] ined nlark. The 
expressions to be nornla\]ized are object-apposition 
expressions ~ obj ec.t-conlponent expressions 
predicate-cause expressi.ons, expressions which, have ~) 
a term consisting of a case label, and others . 
After normalization, the part of top:ics and 
the content of each sentence are first: identified° 
Second, intersentent i a\] relatJ ons between two 
adjacent sentences are identified JndetermJrlistJcal- 
ly based on the assumptions of two classes of 
intersentential relations inentJoned in section 2. 
Third , tile ma:ia sentence is identified by referring 
to tile intersentential relations and the heading o\[ 
the section under tile main topics of tile title. The 
lower \]eve\] sentence :is indented as a modifier of 
the main sentence. Sometimes, tile know\].edge of the 
specific field :is required for better understanding 
of the relations among main sententia\] groups and 
various headings of the text. A case :frame of a 
knowledge base for the specific field is provided in 
which each slot is filled with the most general term 
Jn the specific field. Fourth, a subframe name is 
prefixed to each nlail\] sent_ential group by referring 
to the category of the main predicate term of tile 
main sentence and the subframe designated by the 
heading of the section and the title of the text. 
The basic subframe names are, for example, FUNCTION, 
COMPOSITION and PROPERTY in descript:ion of actions 
and physical objects. 
As seen in the above, the main work of the 
text analysis is to :identify the main senl:entJal 
groups and to assign to thenl a staadard attribute 
name of a subirame in a specified field. These 
frames and attribute names are used as a key of a 
specific field for efficJ ently storing and 
retrieving the knowledge contained in texts. 
The next example of text analysis J s taken 
from a technical paper Jn language processing. 
\[Example 2\] 
Titie: A natural language understanding system 
for data management 
Heading of Section: Generating English sentences 
Heading of Subsection: The selector 
(l)The selector's inaia job is to construct a graph 
relevant to the input statement. (2)In constructing 
this graph the selector first copies the portion of 
the semantic net which :ks to be output. (3)It then 
uses inverse mapping functions to produce a more 
surface, but still case grammar based representation 
of the information to be output. (4) Inverse mapping 
functions map the mameric representation for (late to 
a more surface one, (5)The selector constructs 
inoda\]:i.ty I.ists next and chooses a surface ordering 
rule(SOR) for each verb o17 the resulting structure. 
(6)SORs spec:ify tile order of the syntactic cases 
associated te a particular verb to be output. 
\[U t\]l(~ above text: the :i ntersentent i a\] 
re\] ations and the leveJs ol7 sentences are 
\[denti17Jed, snd tile label of a subfranle is prefixed 
I:o each senteuce as shown :in Fig.2(a) aud (hi. 
/j.~~e t all 
(1) v~/~ ~ tenlpora\] 
/ Success\[ O\[1 
(3) (/4) (5) ((,) 
FJg.2(a) The intersentential relat:ians 
( 1 ) FUNCT\] ON ; ( ERED : constr uc t:, AG : se :1 eetor, 
OBJ :graph( ..... ), 
SUB-PROCESS : 
( 2 ) FUNCT\]ION ; (PRED : cop y, AC : so 71 ee t o r, 
OBJ:portdon( .... ), MANN:first) 
(3) FUNCTION ; (PRED : produce, AG : selector, OBJ :more- 
surface., o representat i on( ...... ), 
MANN : then, \]:NS'I'R : i nverse-- 
maI)p i ng-funct:i ons 
(/4) (ERE1) : mail, AG : ~:, OBJ : uume r i c- 
rel)resent:atien(... ), 
............ )) 
(5) FUNC\[`\] ON ; (PRED: cons true t, AG : se lec t:or, 
OB, J : roods\] J ty-- I :i s I:s, MANN : nex t: ) 
( PR I';D : choose, AG : se\] ectar, OBJ : 
surface-ordet:ing-ru\] e 
(6) (PREb:speeify,AG:V,OBJ : 
order(...)) ...... )) 
FJg.2(b) '\]'he conlpositJon of the text 
A symbo:l ""'" denotes a term prefixed te tile_' 
subfranle conta\[n:ing the marl( ,,¢c-" and modif led by the 
sub\] rame. 
/4. ('.eFleratJotl of answeri.ug, selllierlces for £1ueries 
\]n this sectJon~ sentollce generaL:ion or text 
geaeration for answer:lag a request :is described 
br:ief\]y. Text geueration ks the inverse process of 
text aua\]ysis and :is inseparable from text. analysts 
:ill a sense that the text generation provides an 
basic idea on text construction for g, ivell 
iuformaLion to be represented. A given query is 
parsed and t:he i.ntermedia Le expression is 
cons tructed. Then t:he requJ red i n \[orma tJ on i s 
retr:ieved and transformed \]\[1to LI surface express:ion 
in the \]el \[owdng steps: 
(\]) The interlnediake exl)ressJons related Lo tlle 
ulaJn topios of the query are extracted in the order 
or the \]eve\] related to the query from I:he analyzed 
text or the datiabase storing i.t under a guide ef 
the frame \].abe\]. and other heading :information as 
well as the index of the terms contained in the 
text. The \]eve\] of a description :in the text :is 
avai\]able \[or selection of tile knowledge source to 
be exLrated. 
(2) \[\['he intermediate expressions are rearranged in 
the coherent and readab:le order, for examl)\]e, in 
the occurrence order or tile eveuts~ alld all answer 
se.quence :i S coustruclied. 
(3) Under a given bounded length the answer 
sequence is grouped or segmented to several parts 
and sentential topics are selected to be expanded 
into surface expressions. 
(/4) The sentential fornl of each of tile segments is 
selected to one of phrase, simple, romp\[ ex nnd 
conlpouud sur\[ace exprc'ss\]ons by referring Lo the 
senkentJ al topi c. 
The suuunary of the text given in Example 2 Js 
generated \[rom tile analyzed resu\] Ls shown in 
Fig.2(b) hy referring ta tile steps 2 3 and 4. Fig.3 
shows two summaries construe ted from the 
descriptions o~ the text: tip to \].eve\] I and 3 , where 
the part enc\].osed wJ.l:h brackets is the part 
generated \[rom the descriptions of level 3. 
Ievel 
I evel 
\]:The selector (:onstructs a graph relevant 
to the input .';taLement. 
3:The selector COIIStrtlcts a graph relevaut 
to the input statement. In the 
constructJ on, the selector llerfurn~s 
the \]~o\] \]owJ llg processes. First, the 
selector copies the porlt:ion of tile 
semalltJ C solo Then, it produces a lnore 
Stir.lace but case gramnlar based 
represer~tation with i.nverse mapping 
:\[ urw. tJ ons \[which map a aunler J c 
representation to a more surfnce one\]° 
l"Jnal\]y, it: constructs moda\]Jty lists and 
chooses a surl:ace order:ing rule \[ wMch 
specifies the order o17 syntactic cases \] 
for each verb. 
Fig.3 Generated summaries 
5. ConclusJ ou 
All experinlenta\] system is under construction 
based on our s t r uc Ltlred-illf orl\]la t i Oll extractioll 
system constructed prev:ious\]y, rl'h:is paper focusses 
attention ou the content suggested by the headiag 
and :intersent:ent i a\] structures alld assigns a 
sentence \] eve l to each sentence. I~1\] ipsis aod 
restocaLJon problem o\[ krlown structures Oll syntax 
uud special f:ield knowledge is not considered here. 
However, it seems that there are no serious problems 
in many speci\]\[ic fields at an :interactive mode wJ Lh 
users. 

References 

Rume\[hart:,D.F,." Notes on a Schema \[or Stories, Jn 
Bobrow,D.G. and Co\]lins,A. (eds.), Representat:ion 
and Understanding, pp.2\]\[\[-236, Academic Press, 
New York (1975). 

tlobbs,J.R. : Coherence and \] nterpretatiou in 
Eng\]ish Texts, l'roc. 5th I.\]CA:\[, pp. klO-\] 16 
(1977). 

R:igneygJ.W. and Munro,A.: Oil Cognitive Strategies 
\[or Process:ing Text, Un:kvers:ity of Southern 
California, Behavioral '\['echnology l,aboratories, 
Tech. l{ep. No.80 (\]971). 

\]'akamatsu, S. , Fuj:i ta, Y. and N:i shida, F. : 
Normalization or T:itles and the Jr Retrieva:l, 
information Processiag & Management, Vo\].I6, 
pp. 155-167 (11980). 

NJ sh i.da,F, and 'fakamatsu, S. : Stracl;ured- 
Informal\[ on Extract ion from Patent-Claim 
Sentences, \[information Erocessing & Management, 
Vo\].18, No.I, pp.\]-\]3 (1982). 

NJshida,F. , Tal¢alnatsu,S. and Fujita,Y. : 
Semi.automatic Indexing of Structared \]information 
of Text, ,J. Cheln. Inf. Comput. Sci., V01.24, 
No.l, pp.\[5-20 (1984). 
