Knowledge Extraction from Texts: 
a method for extracting predicate-argument structures from texts 
Florence PUGEAULT (1, 2), Patrick SAINT-DIZIER (1), Marie-GaElle MONTEIL (2) 
(1) IRIT-CNRS, Universit6 Paul Sabatier, 118 route de Narbonne, 31062 qbulouse FRANCE 
(2) EDF, D.E.R., 1, avenue Gal de Gaulle, 92140 Clamart, FRANCE 
1. Aims of the project 
The general aim of our project is to improve the quality 
of existing systems extracting knowledge from texts by 
introducing refined lexical semantics data. The 
conlribution of lexical ~mantics to knowledge extraction 
is not new and has already been demonstrated in a few 
systems. Our more precise aims are to: 
- propose and show feasability of more radical semantic 
classifications which facilitate lexical descriptions by 
factoring out as much information as possible, 
enhancing re-usability of linguistic ressources. We 
show how the different linguistic ressources can be 
org~mized and how they interact, 
- investigate different levels of granularity in the 
semantic descriptions and their impact on the quality of 
the extracted knowledge. In our system, granularity is 
considered at two levels: (1) linguistic: linguistic 
knowledge representations may be more or less precise, 
(2) functional: most modules of our system can work 
independently and thus can be used ~pamtely, 
- evaluate different algorithms for extracting knowledge, 
taking into account efficiency aspects, 
- evaluate the costs of extending our system to larger 
sets of texts anti to differeut application domains. 
Our prqiect is applied to research projects descriptions 
(noted hereafter as RPD) where the annual work of 
researchers at the DER of EDF (Direction des Etudes et 
des Rechcrches, Electricit6 de France) is described in 
terms of research actions. The extracted knowledge must 
be sufficiently accurate to allow for the realization of the 
following Imrposes: (1) evaluation of the importance of 
the use of techniques, procedures anti equipments, (2) 
automatic distribution of documents in different services, 
(3) interrogation, e.g. who does what anti what kind of 
results are available, (4) identification of relations of 
various types between projects, (5) construction of 
synthesis of research activities on precise topics, and (6) 
creation of the 'history' of a project. 
About 2.000 RPD are produced each year, each of about 
200 words hmg. The total vocabulary is about 50.000 
different words. Texts include fairly complex linguistic 
constructs. We also use the EDF thesaurus (encoding for 
nouns: taxonomies, associative relations, and synonyms, 
in a broad sense). 
In this document, we first introduce the linguistic 
organization of our project, present the general form of 
texts and identify the type of information which mnst be 
extracted out of them. Next, we present a semantic 
representation for the extracted knowlexlge, and study in 
more depth the extraction of information under the form 
of predicate-argument and predicate-modifier structures 
(Jackendoff 87a, Ka~ and Fodor 63). 
2. The overall organization of the 
linguistic system 
Let us first introduce the way linguistic knowledge is 
organized. Due to space limitation, we just outline the 
main elements of the system, tiere are the different 
linguistic components of our system: 
surface forms: | detemfines basic syntactic Thematic Roles : 1 ~ ~ + selec, restrictions 
lfead - form~ and prepositions Complements 
,1~ detenuines | c*,~lnplex 
I syntactic | behavior 
/ (alternations) \[ according to 
/ general \] granmmtical 
\[ l)rincil )les 
typing rok;s \] of solne 
derived from B conceptual \[ argum- 
representa- I ents \] of \],CS 
tions ~ rept. 
Verb Semantic \] I,CS Classt~s 
(B. I,evin 93) 
define,';, ~ ~ves 
sc lea~saensttoC f ~ j ~fnl?~eI ,CS 
verbs set of language 
senlantic \] 
. priinitivt~s . 
Fig. t The General l,inguistic Organizalion 
Thematic roles (Dowty 89), (Dowty 91) paired with 
selectional restrictions and semantic information allow 
for the production or recognition of surface forms 
corresponding to 'b,'ksie' sentential forms. More complex 
forms will be treated by a system of alternations, derived 
from the semantic classification of verbs defined by 
(l,evin 93). 
In our approach, we consider a set of primitive 
elements, either general or related to our application 
domain, which includes notions such as being in contact 
with, being in spatial motion, or being the cause of. 
This set of primitives is designed so that it corresponds 
to those needed for the definition of the semantic classes 
of verbs, where the syntactic behavior of a verb (and thus 
the different ways the ,arguments can be distributed and 
should be analysed by the parser and put at the right place 
in the semantic representation) essentially depends on the 
verb's semantic nature. This approach allows for a really 
comprehensive treatment of predicate-argument structures 
because it complements the basic syntactic mappings 
realized from thematic roles specifications. Furthermore, 
this approach requires very economical lexical means 
10.'39 
since it removes a lot of idiosyncracies previously 
encoded in lexical entries. 
We are reformulating B. Levin's work for a subset of 
verbs of French. Although our study is quite general, we 
focus primarily on verbs found in applications. Verbs of 
a given class have almost identical thematic distributions 
which are predictable from their semantics. For each of 
the semantic classes we have considered, we have defined 
a relatively small set of thematic grids, which define the 
'regular' thematic distributions. 
From a different perspective, we also consider that a 
subset of the semantic primitives we have identified are 
those used in the LCS, which we use in a slightly 
simplified way, since we do not consider for our 
application its deepest refinements. The efficient use of 
LCS for practical applications has been shown in a 
number of works, including (Dorr 93). 
3. Semantic typology of the RPD texts 
Let us first illustrate the type of text we are dealing 
with. Here is a standard text: 
"Los mesures destructives (ou assimilables) posent 
toujours des probl~mes concernant le faible nombre de 
donn6es disponibles ou encore leur coot qui s'associe 
gan6ralement ~ la nacessita d'une bonne pracision. II est 
donc nacessaire d'optimiser les campagnes de mesure 
pour rnieux analyser les incertitudes de mesure, el, 
Iorsque cela est possible, raduire les coots induits. Ces 
probl~rnes sent d'autant plus difficiles ~. trailer que les 
param~tres en jeu ont des comportements non-linaaires. 
II est donc nacessaire, au praalable, d'atudier les 
mathodes permettant de prendre en compte cette non- 
linaarit6." 
3.1 General organization of texts 
A global study of these texts shows a great regularity 
in their overall organization. We have identified four 
major facets in most texts, called articulations. These 
articulations are not necessarily present altogether in a 
text. We have the following articulations: 
- THEME, which characterizes the main purpose of the 
text. This articulation includes the topic of the text, and 
the domain on which engineers are investigating, 
- MOTIVATIONS, which relate the main objectives, the 
needs, the goals and which explains the development of 
the current project. 
- PROBLEMS, which correspond to the difficulties 
related to the current state of the art or to the limitations 
of certain equipments or methods. 
- REALIZATIONS, which describe the different tasks 
required for the achievement of the project. 
Articulations may cover one or more fragments of a 
sentence, a whole sentence or a set of sentences. They do 
not necessarily appear in the order they have been defined 
here. The decomposition of texts in articulations defines 
the pragmatic level. We view the articulations as 
defining semantic fields. The above text can be 
decomposed as follows: 
\[t h e m e lies mosuros destructives\] \], 
\[motivation,, \[optimisor los campagnes de mesure 
pour mieux analyser les incertitudes de mesure, et, 
Iorsque cela est possible, reduire les coots induits.\] \], 
\[problems \[\[posent toujours des problemes 
concernant le fatble hombre de donn~es disponibles ou 
encore leur coot qui s'associe generalement ~. la 
necessit~ d'une bonne precision\], \[probl~mes sent 
d'autant plus difficiles gt traiter que les param~tres en jeu 
ont des comportements non-lin~aires.\] \], 
\[realizations \[~tudier les m~thodes permettant de 
prendre en compte cette non-lin~arit~.\]\]\]. 
For this level, we have implemented a method which 
permits the identification of the different articulations of 
a text. This problem is divided into two sub-problems: 
(1) identification of the articulations, and (2) extraction of 
relevant sentence fragments from the original text. 
A study of the RPD texts has shown that these four 
articulations can relatively easily be identified by means 
of specific terms or constructions. Let us call these terms 
or constructions articulation triggers. Articulation 
triggers belong to different linguistic domains: 
(1) lexical, where triggers are just words, e.g. 'devoted 
to', 'in the context of', 'propose', for THEME, 
(2) grammatical, where triggers can be phrases, or 
related to grammatical information (such as tense and 
aspect, e.g. 'in the past years', 'since 1989', for 
THEME), or verbs or nouns of certain semantic class, 
e.g. verbs of volition, of creation (Levin 93), 
(3) discursive, where triggers are mainly propositional 
connectors such as 'therefore', 'because', etc., 
(4) pragmatic, where the relative positions of sentences 
and more generally, the physical form of texts (e.g. 
enumerations) can determine articulations. 
The next stage is to extract those portions of text 
which are relevant for the articulation considered. Since 
the linguistic treatements of this first level are 
necessarily superficial, we must carefully discard 
irrelevant portions of texts. This approach has been 
modelled by means of extraction rules, which specify 
words and constructions to skip and which delimit zones 
of texts to be extracted. Evaluation of results is given in 
fig. 2 in the annex. 
3.2 Identification of knowledge to be 
extracted 
Let us now concentrate on the nature of the semantic 
information which should be extracted by the system. We 
have identified three types of information: 
- general nominal terms (e.g. 'methods', 'data'), and 
specific nominal terms belonging to technical domains, 
- states or actions in which these terms are involved, 
- general roles played by these terms in actions or 
states. 
Roughly speaking, the first class identifies arguments, 
the second class defines predicates, while the third one 
introduces the notion of semantic roles such as thematic 
roles. This latter level is of a crucial importance in 
knowledge extraction because it avoids making incorrect 
interpretations on the role of an argument with respect to 
the action or slate being described. This level is called 
the linguistic level. 
The level of granularity we are considering in this 
project suggests t,s to group predicates with a close 
1040 
meaning into a class and to represent them by the same 
predicate name, viewed as a primitive term. For example, 
we have terms which express the notion of definition 
(e.g. define, specify, describe, identify, qualify, represent) 
or the notion of building (e.g. assemble, build, compile, 
develop, forge) as defined in B. Leviu's work. Howcver, 
for a relatively small number of classes, in particular for 
those classes of predicates which denote complex actions 
and for those which exhibit a high degrcc of 
incorporation (Baker 88), where incorporated knowledge 
needs to be made more explicit, it may be necessary to 
use a more conceptual type of representation. We want 
to investigate the use the Lexical Conccptual Structures 
(LCS) (Jackendoff 87, 90) whicln match very well with 
the planned uses of the extracted knowledge on the one 
hand, and with the notion of thematic roles on the other 
hand. Let us call it the conceptual level. This paper 
being mainly devoted to the linguistic levcl, this level 
will not be investigated here. 
4. The linguistic level 
4.1 Identification of predicative terms 
Predicative terms characterize states or actions. The goal 
at this stage is to be able to determine in a way which is 
as systematic as possible which terms are predicative in 
the RPD texts. A priori, verbs denoting states or actions 
and prepositions are considered to be predicative terms. 
Nouns are slightly more difficult to treat. The EDF 
dictionary includes the specification of nouns derived 
from verbs. We consider that these nouns are predicative. 
A few nouns, not derived from verbs are also predicative, 
such as algorithm, sort or departure, these are identified 
so far by hand. They may be later semantically classified 
as describing, for example, actions or evenls. 
4.2 Identification of relevant predicates and 
arguments in texts 
The second aspect of the linguistic level is the 
identification of predicates and related argmnents which 
arc sufficiently relevant to be extracted. Relevance can be 
defined a priori and once for all or may depend on the 
text. The relevance of a term can be defined according to 
several criteria: 
(1) genericity, terms defining a research action, a 
realization, or a problem such as: define, improve, 
implement, test, evaluate and explore are of much 
interest. At this level, it is most useful to use B. 
Levin's verb classification to determine relevance. 
(2) specialization, corresponding to very precise terms 
describing a material, an equipment, a method or a 
system. Specialized terms can be defined a priori from 
the thesaurus by extracting the most stx~cialized terms. 
(3) localimportance, where importance in a text is 
explicitly marked, for example, by a construction such 
as 'it is important to...' or by a negation. 
4.3 Representing predicate arguments and 
modifiers by means of thematic roles 
The relationship between a predicate and one of its 
arguments can be represented by a thematic role. 
Thematic roles do confer a much stronger meaning to 
predicate structures, in particular when thenmtic roles 
have a relatively preci~ meaning. Thematic roles c~m be 
defined in a more refincd way than the usual definitions. 
From that perspective, our claim is that thematic roles 
can form the basis of a good and stable general 
descriptive semantics of prc~licate-~u'gument relationships. 
Thematic roles have then a conceptual dimension, and 
not only a linguistic one. However, they must not be 
confused with the conceptual labels of the LCS. 
Thematic roles must remain general; they form a bridge 
between conceptual representations and syntax. Fig. 3 
shows the thematic roles we consider. 
We consider here an extended use of thematic roles 
since they are also assigned to predicate modifiers, 
realized as prepositional phrases or as propositions, in 
order to represent in a more explicit and uniform way 
essential arguments and modifiers, since they all play an 
important role in the semantics of a proposition. 
The general form of a semantic representation 
introduces two functions for thematic roles: 
(1) an argument typing function: 
predicate_name( .... ~i: {argi } .... ) 
(2) a predicate mochfier typing function, where a prod icate 
is marked by a thematic role, if the modifier is a 
predicate: 
r_gJg_l j : predicate_name( .... r__0~l k : \[arg k } .... ) 
The arg i are fragments of texts (NPs and PPs), which 
may be further analyzed in a similar way, if necessary. 
}"or exanlple, a sentence such as: 
John got injured by changing a wheel 
is represented by: 
injured h(.t.h.e._m~ : {john}) ^ causal theme : 
change( itg~_nJ,: {john} , ~ : {wheel}). 
If in an articulation, we only extract an NP, it is 
represented as an argument as fi)llows: 
arg( { fragments of text corresponding to the NP }). 
and uo thematic role is assigned to it. The general 
representation of an articulation is tben: 
\[articulation_name, 
\[extracted text from pragmatic level\], 
partial predicate-arg represenlatkm\] 
The result of the parse of our sample text is given below. 
\[\[ t h e m e \[los mesures destructives (ou assimilables)\] 
arg: {mesures destructives} \], 
\[ motivations \[optimiser los carnpagnes de mesure 
pour mieux connaitre, voire ameliorer, los incertitudes de 
mesure, et, Iorsque cela est possible, reduire les coots 
induits.\] 
opdmi~( _, Incremental beneficiary theme: 
{campagnes de xnesure}) ^ 
'(g~2: (analyze( _, holistic theme: 
{incerlitudes de mesare}) ^ 
reduce(__,incremenlal victim theme:{cofits})) \], 
\[ prob!ems \[\[posent toujours des problemes 
concernant le faible hombre de donnees disponibles ou 
encore leur coot qui s'associe generalement gt la 
necessita d'une bonne precision.\] \[problhmes sent 
d'autant plus difficiles ~t traiter que les parametres en 
1041 
jou ont des comportements non-lineaires.\] 
arg: ( { faible nombre de donnE~s}, {cofit}, 
{comportements non-lintaires}) \] , 
\[ realizations \[6tudier los m6thodes permettant de 
prendre en compte cette nonqin6arite.\] 
study(_, general theme: {methods} ) \]\]. 
4.4 Parsing and assigning thematic roles 
Let us now show how our parser works and how 
thematic roles are concretely assigned to ,arguments. For 
that purpose, we introduce three main criteria: 
(1) the semantic class of the predicative term where 
thematic grids are given, 
(2) the semantic type of the preposition, if any, which 
introduces the argument, we also have defined thematic 
grids for prepositions, 
(3) the general semantic type of the head noun of the 
argument NP. Semantic types are mainly defined from 
the semantic fields given in the EDF thesaurus. 
These criteria are summarized in fig. 4 at the end of this 
document. These criteria are implemented by means of 
thematic role assignment rides. 
The parsing of the RPD texts works independently on 
each fragment of text associated with each articulation 
(referencial aspects will be considered later). We have the 
three following stages: 
(1) Identification of predicates and arguments: due to the 
complexity of texts, a partial analysis is the only 
possible and efficient solution. We have a grammar that 
identifies basic verbal constructions, nominal 
constructions. The parser works bottom-up and 
identifies maximal structures which are not ambiguous. 
(2) Thematic role assignement: The assignment 
procedure considers each thematic role in a thematic grid 
and searches for a nominal or propositional structure to 
which the thematic role can be assigned. This 
assignment is based on the thematic role assignement 
rules. The general form of a thematic role assignment 
rule is the following: 
assign_role(<name of role>, 
<grammatical form of predicate>, 
<grammatical form of argument>) :- <unification or 
subsumption constraints on semantic features>. 
This is illustrated as follows, where grammatical forms 
(xp) are given in Login form (A'R-Kaqi and Nasr 86), 
following the TFS approach: 
assign_role(effective_agent, 
xp(syntax => syn(cat => v), semantics => 
sem( pred => yes, relevance => yes)), 
xp(syntax => syn(cat => n), semantics => 
sem( pred => no, 
sere_type => tsem( semp => X )))) :- 
subsumed(X, \[human, technical\]). 
This process can be applied recursively on those 
arguments which contain predicates. The depth of 
recursion is a parameter of the system. 
(3) Semantic representation construction. At this level, 
deeper representations (such as the LCS) can be used. 
Conclusion 
The novelty of our approach with respect to knowledge 
extraction can be summarized as follows: 
(1) We have defined three levels of knowledge 
representation (pragmatic, linguistic and conceptual), 
which are homogeneous, expressed within a single, 
incremental formalism, incremental in the sense that 
knowledge extracted at an outer level is refined at a deeper 
one, and that representations support partial information. 
(2) We have defined simple methods for extracting 
relevant tern~ in texts, using a thesaurus. 
(3) We show that the syntactic alternations given in 
Levin's work complement the basic syntactic forms 
generated from thematic roles. These semantic classes of 
verbs, because of their semautic basis and because of the 
way they are defined are a very powerful tool for 
assigning correctly thematic roles to predicate argument 
in a large number of syntactic forms. 
(4) The different types of data and the level of granularity 
at which they are considered establishes linguistic levels 
of descriptions which correspond to a certain descriptive 
reality anti to a certain autonomous and homogeneous 
level of semantic representation. 
Acknowledgements 
We are very grateful to Marie-Luce Herviou, Pahnira 
Marrafa and Sophie Daub~ze for discussions on this 
project. We also thank Martha Palmer and Bonnie Dorr 
for several discussions and for introducing us to B. 
Levin's work. This project is funded by EDF-DER. 
References 
AR-Ka$i, H., Nasr, R., LOGIN: A Logic Programming 
Language with Built-in Inheritance, journal of Logic 
Programming, vol. 3, pp 185-215, 1986. 
Baker, M. C., Incorporation, A Theory of Grammatical 
Function Changing, Chicago University Press, 1988. 
Blosseville MJ, Hebrail G, Monteil MG, Penot N, 
Automatic Document Classification: Natural Language 
Processing, Statistical Analysis and Expert System 
Used Together, ACM SIGIR, Copenhaguen, June 1992. 
Dorr, B., Machine Translation: a View from the Lexicon, 
MIT Press, 1993. 
Dowty, D., On the Semantic Content of the Notion of 
Thematic Role, in Properties, Types and Meaning, G. 
Cherchia, B. Partee, R. Turner (Edts), Kluwer Academic 
Press, 1989. 
Dowty, D., Thematic Proto-roles and Argument 
Selection, Language, vol. 67-3, 1991. 
Grimshaw, J., Argument Structure, MIT Press, 1990. 
Jackendoff, R., The Status of Thematic Relations in 
Linguistic Theory, Linguistic Inquiry 18, 369-411, 
1987. 
Jackendoff, R., Conciousness and the Computational 
Mind, MIT Press, 1987. 
Jackendoff, R., Semantic Structures, MIT Press, 1990. 
Katz, J. J., Fodor, J. A., The Structure of a Semantic 
Theory, in Language 39, pp. 170-210, 1963. 
Levin, B., English Verb Classes and Alternations, the 
University of Chicago Press, 1993. 
1042 
fully correct parily correct incon'ect 
artictdations extraction extraction exu'action 
<THEME> 86% 11,5% 215% 
<MOTIVATIOI~S> • 70% .,. 8,4%, , 
<PROBLEMS> 
<REALISATIONS> 
61% 
46,5% 
21,5% 
33r5% 
30% 
,, 5r5% 
.23,5% 
Fig. 2 evahmtion of level 1 
Moym 
Agent Effectif Instrument Non Imtrummlal 
Agentvolitif Agmtlnitiatif Agent Pcrceptif Agentde~Mouvanenl htsu'ummtl)irect Instrtwnentlndirect 
Th~me G(:n6ral 
ThrOne Holistiqtaz Tt~me Incr6mmtal Thhne CatLsal 
B6n6ficiaire Victimc 
IxJc',disation 
Source Position But Direction 
PositionAbsolue Posilim Relative 
Fig. 3 The Ihemalic role hierarchy (in French) 
Thenmtic role 
Effective Agent 
(ae) 
Volitive agent 
hfitiative agent 
P~rc.eptive agent 
Agent of 
Movement 
Theme 
semantic class of 
predicate ....... 
characterize 
creation m~d lranslo. 
continue 
service 
transfer of possession 
searching, etc. 
volition 
obligation 
allowing '- 
decision 
kIiowled~e 
continne 
Means 
moving 
l,ocalization attaching 
Identifier identification 
service 
Accomp~miement 
Fig. 4 
searching 
obligation 
tranfer of possession 
attaching, etc. 
creation and transtb. 
characterize, etc. 
Selectional restr. 
on argument + prep 
human 
human 
human I technical 
human 
concrete element \[ 
human 
- animate I 
technicale I 
htllnan 
prep: ave.c, en, par 
place (spatial loc.) 
temporal (temlvoral lot.) 
abslract I technical 
(abstract loc.) 
prep: dans, surr de, etc. 
proper noun I profession. 
animate 
F, xamples 
d6finir, repr6senter, 
cr6er, r6aliser, 
continuer, poursuivre, 
aider, collaborer, 
donner, 6ch,angcr, 
rechercher, r6soudr G etc. 
vouloir, d6sirer, 
devoir, obliger, 
ndcessiter. 
favoriser, pennettre, 
conduire, ddcider, 
diriger, mener. 
savoil) connaitre. 
6tcndre, poursuivre, 
explorer, observer, 
devoir, obliger, 
donner, 6changer, 
attacher, c!lahmr, etc. 
conslruire, r6aliser 
utiliser, SP6cifier (par). 
aller, venir, 
attacher, chalner, 
relier (h). 
baplJser, nommer. 
collaborer, participer, 
attaching prep: avec attacher (ave@, unir. 
Sample of the organization of thematic roles w.r.p, to semantic classes of verbs, 
selectional restrictions and prepositions. 
1043 
