Tools for Extracting and Structuring Knowledge from Texts. 
Autlmrs: Antoine Ogonowski* (Anloinc.Ogonowski@crli.gsi.fr), Marie Luce lterviou*** 
(Marie-l,uce.Herviou@dcr.cdf.fr) and t;,va Dauphin** (sylvic.rcgnier@siege.aerospaliale.fi') 
The authors wish to Ihank M. Bernard*, G Cldmencin*, S. Lacep*, R. l,eblond**, MG. Monteil***, G. Morizc* and 
B. Nonnicr* lbr their collaboration while working on tiffs project and precious advice lot the prcparation o1' this note. 
* GSl-I!rli I pl. ties Marscillais , 94227 Charcnton Codex FRANCF. 
** AF, R()SPATIAI~E C(2R. 12 rue Pasteur, 13P76, 92 t52 Surcsnncs Cedcx I,'RANCI:i 
*** Elcclricitd de France (EDF), Research Ccnler, I Av du Gdndral de Gaulle, 92141 Clamart FRANCli 
Abstract : We demonstrate an approach and an accompanying UNIX toolbox for performing wtrious 
kinds of Knowledge tT, xlractions and Structuring. The goal is to "practically" enhance the productivily 
while constructing resources for NLP systems on the basis of large corpora of technical texts,. Users are 
lexicon/grammar builders, terminologists and knowledge engineers. We stay open to already explored 
methods in this or neighbouring activities but put a greater stress on the use of linguistic knowledge. 
The originality of the work presented here lies in the scope of applications addressed and in the degree 
of use of linguistic knowledge. 
I. Introduction 
Nincc NIA ~ \]laS started moving from toy 
problems to ,'eal applications one of the biggest 
difficully has been Knowledge Acquisition (KA) 
of different lypes (lexical, grammatical, domain 
and application specific). A lot of the needed 
inl'ormation resides (in an implicit or explicit 
form) in texts that most of the time now exist in 
machine readable form. 
It seems however too difficul/ to fully automate 
the KA process although steps have to be taken in 
that direction \[WIL93\]. Tools to help users deal 
with large corpora have been developed l'or some 
time, most of these however rely either on crude 
non linguistic approaches or mostly on statistic 
methods (eg \[CAL90\]). Credit should also be 
given to new approaches based on neural nets 
especially for dealing with Machine P, eadable 
I)ictiouaries (eg Ill)E90\]). 
This project, note illustrates a more 
linguistic and "open" approach (not ignoring the 
achievements of existing methods), basing itself 
on existing large electronic dictionaries 
compatible with thc GI:';Nt,',\[A:,X model \[GI';N90\]. 
The EIJRI';KA project Gt:~,NELEX (5 years, 
39 MEcu, 250 man years) has produced public 
models encompassing nlorphoh)gica\[, syntactic 
and semantic knowledge in both monolingual 
(Fretlch, English and soon Italian and Portuguese) 
as well as bilingual contexts. 
The tools that the authors want to discuss 
here are developed in the context of the closely 
related EUREKA project GRAA1/ (acronym for 
Grammars that can be Reused to At, tomatically 
IThis 23Ml{cu, 150 man years prqiec! is conduclcd by an 
international consortium currently gathering in France: 
GSI-I'Mi (project leader), I';DF, Adrospalialc and IRIT; in 
Italy: IRST, Centre Ricerclle FLAT; in Switzerland: 
ISSCO; in Greece: 11 ,SP; in l;inhmd: 1 ,ingsoft, Nokia; in 
Portugal: II,TliC. 
Analyse Languages). This projecl IGRA921 has 
Ihe following objectives: 
• the development of granunars that arc 
easily maintainable and reusable (ie different 
types of NLP applications can be built on their 
basis) 
* the development of tools (parsers, 
generators as well as workbenches l'or gralnmar 
construction, customisation and integration in 
spccilqc application enviromnents) 
• delivery ol' industrial level applications. 
This 4 year project is currently divided into 
several subl~rojccts ("SPs") one of which is called 
"KES" (Knowledge b2xtraction and Structuring) 
and aims :_tt the second and third types of GRAAL 
goals. 
The three partners of this mentioned SP: El)t;, 
AEROSPATIALE and GSI-Erli have built a 
modular extensiblc toolbox that should cover 
most el' the needs that may occur in "any" 
knowledge extraction process and now wdidate the 
performance of lhe toolbox on several 
applications. 
For the partners, Knowledge Extraction covers 
needs arising in wu'ious types of applications 
ranging l'rom "terminology construction and 
enrichment" (problem largely studied these years 
\[TER00\], \[LEX92\] ...), "extension of lexicons 
coverage", lhrough "grammar development" up to 
"construction o1' Large Knowledge Bases" for AI 
systems, or for technology assessment survey 
purposes. This means thai potential users of the 
toolbox range Irom lerminology experts, lexicon 
and grammar writers to knowledge engineers. 
Two languages are currently considered: French 
and English, but the tools developed should be 
easily adaptable to other languages 
1049 
2. The Approach 
Rather than developing a new automatic 
KA theory we have opted for a "practical" 
approach i.e. a set of tools that can assist the user 
in a bootstrapping process. 
2.1 Principles: Our platform integrates all the 
resources and processes allowing to proceed fi'om 
raw texts to a structured set of knowledge items 
(taken to mean words, terms, coucepts, links, rules 
etc.) extracted fi'om these texts. 
Partners believe that the future industrial tools are 
to use much more linguistic knowledge than the 
tools currently available on the market (eg 
\[SAT92\]). Our goal is not to be 100% exact at 
the different stages of processing but to help the 
user rapidly explore various hypotheses. 
2.2 Phases: Three main phases organise the 
KES process: "Corpus Characterisation", 
"Extraction" and "Structuring". 
The first step takes as input text in a "KES" 
SGML format and performs a linguistic tagging 
of these texts (for more details see section 3.1.1). 
The "extraction" and "structuring" phases are the 
real core of the KES process: implemented as 
cooperative processes (rather than purely 
sequential operations) they allow the manipulation 
of information found in the results of the 
previous stage, in the input texts or in lexicons, 
according to different criteria: 
linguistic information (morpho-syntactic 
lags, syntactic properties, thematic roles ...), 
statistical considerations (frequencies, 
weights...), 
"factual data" (eg. typographical structure 
indicators such as "title", or "lists" markups...) 
This in order to select, extract, group items of 
information and link them together.(c.f, section 
3.1.2. for details of the process). 
The main idea is to manipulate "properties" added 
to words, terms or texts (see the SATO approach) 
like tags, statistical information, links .... ; our 
novel contribution is to use linguistic information 
in all steps to add or control these properties (we 
can use more information than \[ANI90\]) while 
staying open to different modelling choices. 
Furthermore, one of our constant concerns is to 
establish well-defined and standardised exchange 
formats (SGML DTDs) between the different steps 
ensuring modularity and simplifying data 
import/export from/to application databases or 
tools manipulating textual data. 
3. The Tools 
Our tools are developed in a modnlar way 
in C++, based on standards like OSF/MOTIF, 
SGML and run under the SUN OS UNIX 
operating system. 
3.1 Current State: Two groups of tools compose 
the current toolbox. GCE \[GCE93\] - the first one, 
implements (in batch mode) a parameterised 
corpus characterisation and a first extraction of 
"interesting" items. EAEKES the second tool is 
much more interactive and accounts for the more 
domain specific part of extraction as well as for 
knowledge structuring and validation. 
3.1,1 Corpus Characterisation + 
preliminary extraction 
GCE (Graal Corpus Exploration) has been 
developed by the partners in a previous phase of 
GRAAL and is a set of software tools that ran in 
batch on a corpus perform a morpho-syntactic 
analysis (pattern-matching approach), and 
produce structured data representing : 
- lists of tagged words (GENELEX 
categories), 
predictions on the categories of unknown 
strings ("date", "numeral", proper noun...) 
based on "morpho-graphic" patterns & context, 
lists of syntactic groups (Noun Phrases that 
appear to be potential terms of the domain, 
verbal forms...), 
various statistical information (ranging 
from frequencies of particular punctuations to 
fi'equencies of syntactic patterns), 
Thus the tool can produce several representations 
of the corpus (eg: lemmatised, with various levels 
of tags etc ..). 
GCE uses for its purpose large GENELEX 
lexicons (French 55000 simple words and 
18 000 compound words, English 40 000 words) 
and a constraint grammar like approach. 
Because GCE performs a bottom-up analysis 
using a large coverage lexicon and makes lexical 
category predictions on unknown words the 
results are usually very satisfactory and constitute 
a valuable starting point for the subsequent phases, 
even for texts in very technical domains. 
3.1.2 Extraction and Structuring 
Here the implementing software tool called 
EAEKES is based on GSI-Erli's AlethSAC 
software (based in turn on GSI-Erli's experience 
in the E.C.A.I.M. project Menelas). 
EAEKES' main goal is to allow for both 
interactive and batch knowledge extraction and 
characterisation. It automatically bases itself on 
the GCE results. 
The most basic operation consists in manually 
creating domains of information and manually 
(either by typing them in, or by mouse selection 
in source text) inserting items 2 into them. 
The tool in this mode of operation allows the 
navigation between items (on the basis of the links 
between them) and domains of information in a 
somewhat "hypertext" style (mouse clicking). 
The user can also interactively change both terms 
and domains inter-rclationship (in a cut/paste way) 
automatically maintaining inverse links. 
2 items can be made up of parts of words, words, phrases, 
or even disjoint text elements. 
1050 
The second mode of operation offers the 
possibility to describe selection patterns that are 
then applied on the corpus in a batch mode. 
The selection patterns are coupled with a 
description of actions that are 1o be performed 
\[brining together "KES rules". 
The actions can among other perform "parsing 
like operations" by using a type of chart like 
represenlation of the analysed text. 
Most often however users will perform actions that 
extract identified parts of text and assign it some 
chmacteristics and or link them to other already 
extracted items. 
The reader will find below examples of patterns 
that can be specified and examples of actions 
performed with matching items. 
Tim types of patterns can be: 
morl2hoDdgical- for example: "all words 
beginning with "aqu" or containing the infix 
"bydro" are to be placed in the domain 
I' iI t ,ift~,lr "~ 
simple contextual patterns -eg "all words 
that are not adverbs and immediately precede 
verbs rehtted to the verb "to flow" are to be 
charactcrised as nouns denoting liquefied 
bodies4; Note that the type of relations that are 
to hold between verbs can be what is found in a 
rich GENEL .X dictionary", but can also be 
user defined criteria . 
s~ntactic example: all NP heads 
following a l'orm el' "obtained by" are to be 
placed in lhe domains "methods"; all phrases of 
the type "all <NP-head>'like'<enumeration -~ 
heads>" describe an 'is a' relationship between 
the NP head and each one of the enumeration 
heads (ex: "lhe data processing methods like 
%5 automatic classificalion, formal links .... ) , 
combinations of the above 6 types. 
The above mentioned types of rules are to be 
provided by the user, this however is a task too 
difficult for some users that are not linguists or 
knowledge engineers. Therefore the toolbox 
provides a library of basic rules that can either be 
used as such or serve as starting patterns that users 
may refine and adapt. 
Whether the extraction is made by "hand" 
or by rules it can be performed on any of the 
3 wu'iotls application domains offer dEgrEes Of such 
rEguhu'i/ies- some applicalions in chemish'y being perhaps 
tile most ilhtsu'ative (the above "hydro" would probably 
assign a different donmin in chElnistry). 8 
4 presuming that we are dealing with a tEclmical text. 
5 rEguhu'ities like these have been observed in technical 
tEXtS (eg cf IROU921). 
6 users with different skills wrile differenl types of 
rules. For instance a KnowlEdgE engineer usually does 9 
not use the notion of a syntagmatic head. 
forms output by GCE. It is thus possible to 
combine forms as they appeared in the source text 
with results of lemmatisation, taking into account 
frequency data, logical markups or co- 
occurrences. The extraction process can be made 
"information sensitive" i.e. the selection patterns 
can be made to check whether a knowledge item is 
not already classified somewhere (by another rule, 
by the user or in an external som'ce 7) ; thus, it is 
possible to use all the information available on an 
item, coming fi'om the original corpus or external 
resources 7. 
Therefore information predicted by the rules can 
be used in other rules thus achieving a bootstrap 
type of effect. Facilities are awfilable to keep track 
of the dependeucies between hypotheses, the user 
can interactively explore retrieval of hypotheses 
and see the effect on the extracted knowledge. 
This extracted knowledge (set of 
knowledge items) can be interactively checked 
and cleaned up. 
Once checked the knowledge items can be 
structured: various types of links can be made, 
dmnains can be divided into subdomains and 
items dispatched into them 8. 
Several ways are available to accomplish this task: 
Manually selecting and moving items 
using the mouse. 
- Rules similar to the extraction patterns can 
also be applied on the extracted set of 
knowledge ilems (eg: all items placed in the 
domain of "energy production" that begin with 
the strings "atom" ,"nucl" or contain the word 
"fission" are to be moved into the subdontain 
"energy production by nuclear means"). 9 
These structuring rules can also establish links 
between items, it is therefore possible to 
perform actions like: check for "inclusion" of 
item in another one and if positive link them 
with an "generic" link. Because both the 
possibilities of the rule language as well as the 
avaihtbility of large stocks of linguistic 
knowledge the previously mentioned 
"inclusion tests" can range from simple 
character string matching to testing for 
note that such an external source is the GENELEX 
compatible dictionary but may also be a thesanrus that 
the user is trying to enrich, but eotdd also be an 
ontology in tile context of ExpErt Systeln 
conslruclion (el' I MIZ931). Tile toolbox's underlying 
data model can in a "mEta model way" host a large 
variety of rEsourcEs. 
the user Call have two modes of operation Either an 
unconstrained where any "domain" or link can be 
created or a "model" guided mode in which the 
"adminislrator" user has to specify the links used, the 
types of domains, the types of itEmS and specify for 
each the possible interrelationships. 
note that stteh rules can implement some simple 
forms of gencralisation sUatcgy. 
7057 
example the variation o1' the prepositions used 
in a corresponding position in several terms. 
Note that established links can also be tested in 
the rules and for example it is possible to detect 
"shortcut links" in hierarchies of items. These 
identified links can then be presented to the 
user for further operations. 
The standard type of result display is in a 
workview which can handle lists of items and lists 
of links between items. Some graphical 
manipulation is also possible. (cf figures ) 
F/chLeJ- Eonnandes RJdo 
CaJlulo cotlrallt~: A3Zfl->(TG)->P, VION 
~,,0,° ,~,,,,°: \[ J 
Fig. l: textual mode 
Here the uscr works in a navigation mode; the window in 
the foreground ol' this view corresponds to a to the 
knowledge item "ilot nucldaire" of type "term". Through a 
roll-down menu one can view all occurrences of tile lerm. 
The "spread sheet" like window ("work view") in the 
background allows to manage sets of links (cf first 
cohunn), knowledge items (second cohunns) and 
attributes of knowledge items (not shown here). 
Note that the steps during which the user writes 
different extraction rules and visnalises their 
results, can also be used in grammar engineering 
tasks. There exists in fact a third viewing mode 
(not shown here) in which the user may see the 
places in his corpus where a rule applies. Nothiug 
prohibits the rules from being "normal" parsing 
rules, that the user wants to explore. 
This type of use has however been 
reserved for a subsequent phase of the SP - after 
Jnly 1994 -and thus has not been fully explored 
yet. 
3.2 Outline of an example session 
We will illustrate the working of out" toolbox in the 
context of an application whose goal is to enrich 
an existing thesanrus. 
I. A large corpus (10 MBs) of texts in the domain 
of informatics is batch processed by GCE, 
yielding lists of nouns, adjectives, potential terms, 
unknown words and statistics... 
2. Using a spell checker called fi'om within the 
EAEKES system, the user eliminates unknown 
words that in fact were misspelled technical terms. 
3. The remaining unknown words are studied in 
context thanks to the retrieval of the sentence 
where they occurred. The pertinent ones (very 
technical terms, domain specific proper nouns like 
"Unix", "Saltou" ...) are kept. 
4. The most frequent nouns and noun phrases are 
observed. Extraction rules allow to extract the 
most "productive" NP heads allowing to build 
"domains" such as : "machines" (dl), methods 
(d2), languages (d3)... This extraction is made 
"information sensitive" : the existing thesaurus is 
used to help defining these domains. Then, the 
NPs based on these heads are dispatched: "Unix 
machines", "IBM machiues" in dl, "statistic 
methods" in d2, "C programming language" in d3, 
etc ... 
5. With rules using syntactic or semantic 
information found in the GENELEX dictionary 
(for example, synonyms of "method" and 
"processing") and using contextual patterns (eg 
variants of the form "...methods such as ..." ) other 
items are dispatched: in d2 we will then find items 
like "document classification"; "data processing", 
"textual data processing", 'Tormal links", "Salton 
theory" ...). 
6. Graphical facilities allow to establish links 
between the differen! items: eg "isa links" between 
"data processing" and "textual data processiug", 
between "textual data processing" aud "document 
classification" etc... 
7. The structured results are then exported 
(encoded according to an SGML DTD) in order to 
be recovered by a terminology management 
system which will allow their integration in the 
original thesaurus. 
4. Where are we? 
The set of tools described here is a prototype aud 
filrther work is planned in both the LRE project 
TRANSTERM and the continuation of this 
GRAAL subproject. 
S. Conclusions 
We haw; presented all approach snpported by a 
toolbox corresponding to the aims of industrial 
actors ill the field of NLP. The objectives targeted 
arc an increase in the productivity of people 
manipulating large corpora. Rather than 
introducing a new theory of automatic KA we 
1052 
have presented a "lmtctical '' approach allowing the 
combinatiol~ of automatic and "hand" methods 
which can be based on largo generic repositories 
of knowledge, working in a bootslrap type of 
cooporatiOll. 
References 
\[ANI90\] "An Application of l,exical Sen-iantics to 
Knowledge Acquisition froin (~orpora"; P. Anick 
and J. I'usicjovsky ill Prec. O1: (\]olillg 1990. 
\[CAL90\] "Acquisition of Lexical Information 
from a l+arge 'Fcxtual Italian Corpus", N. Calzohlri 
and R. Bindi in Prec. o1: Coling 1990. 
\[GC\[{931 "l"tude de cotyJus: utz prdalable 
ndcessaire pour I'adaptation des syst&#tes de 7',4 
attx besoins des utilisateurs", |~. Dauphin, in 
proceedings of "Troisibmes Journdes Scientifiques 
TA, TAO, Traductique", to bc published as an 
"URF, F'" publication Universild do Montrdal. 
\[GEN90\] "(;17NI?I,KX project : I'2URIOKA .fin" 
linguistic engineering", B. Normier and M. Nossin 
in Proc of International workshop on electronic 
dictionaries 1990. 
IGRA921 "Outline IQtreka GRAAI,";Coliilg 1992 
(hitcrlmtional Project Presentations Volume). 
llDE90\] "Very l,argo Neural Nolworks for Word- 
~OllSe l)isalnbiguation" N. ldc and J. V~roiiis in 
Prec. of ECAlgO 1990. 
\[LEX92\] "Su@u:e grammatical Analysis .fi>r the 
extraction of terminological norm phrases", 
Ikmrigault Didier in Proe of Coling 1992. 
\[MIZ93\] "Knowledge Acquisition alM Ontology" 
Riichiro Mizoguchi, in Prec. of KB&KS93 -- 
International Conference on Building and Sharing 
of Very IAil'ge-Scalo Knowledge Bases 1993. 
\[ROU92\] "l~laboration de Techniques d'analyse 
adaptdes ~l la construction tie bases dc 
connaissances" F. Rousselot and II. Migmlll t{nsais 
in Prec. of Coling 1992. 
ISAT921 : "l/analyse du co#~te#lu textuel eft rue (\[e 
/a constrttctiott de thdsattrus et de I'indexation 
assistde par ordimtteur : applications possibles 
avec SATO", S. Bcrtrand.Gastaldy, G. Pagola, 
"l)ocunlentation ct bil)lioth\[;qucs", Avril-Juin 
1992. 
ITF, R901 "77+lvnino v. 1.0 ." rapport de recherche 
"Novembre 1990, par Ic groupe RI)LC (Recherche 
cl l)oveloppement ell Linguislique 
(!oniputalionnellc), (\]entre d'Analyse de Tcxtes 
par ()rdinaleur, Universitd du Qudbec, Montrdal. 
\[W11,93\] "7'owards Automated Ktwwledge 
Acquisition" Yorrick Wilks and Sergei Nirenburg 
in Prec. o\[ KB&KS 93 - International (7onferelico 
Oll I~tlilding alld Slial'ill\[ ~, of Very \],algC-Scalo 
Knowledge Bases 1993. 
l_qchier {~.'oil i 1il a n dc,'-i A ido 
,j-J" / 
-J <S~_~_ ,.4o:->-. t ....... 
I;ig2 Graphical Mode: 
All objects and links can bc (lisphtccd and updated using file mouse and keyboard. 
I|crc the user w'orking (m nuclear emergency manuals has chosen to display part of the extracted 
lit~guistic c<mff~o.vilion links: "nttcldaire" "occurs in" "ilot nucldaire" and "baliment auxilliaim nucl&iire", 
but also thesaurus like links : "ilot nucleairc" apl~cars to be a generic term of "balimcnt auxilliarc 
l'OtlclOtlf ~, 
1053 
