In: Proceedings of CoNLL-2000 and LLL-2000, pages 194-198, Lisbon, Portugal, 2000. 
Learning from Parsed Sentences with INTHELEX 
F. Esposito and S. Ferilli and N. Fanizzi and G. Semeraro 
Dipartimento di Informatica 
Universit£ di Bari 
via E. Orabona, 4 - 70126 Bari - Italia 
{esposito, ferilli, fanizzi, semeraro}~di.uniba.it 
Abstract 
In the context of  learning, we address a 
logical approach to information extraction. The 
system INTHELEX, used to carry out this task, 
requires a logic representation of sentences to 
run the learning algorithm. Hence, the need 
for parsers to produce structured representa- 
tions from raw text. This led us to develop a 
prototypical Italian  parser, as a pre- 
processor in order to obtain the structured rep- 
resentation of sentences required for the sym- 
bolic learner to work. A preliminary exper- 
imentation proved that the logic approach to 
learning from  is able to capture the 
semantics underlying the kind of sentences that 
were processed, even if a comparison with clas- 
sical methods as regards efficiency has still to 
be done. 
1 Introduction 
Language learning has gained growing attention 
in the last years. Statistical approaches, so far 
extensively used -- see (Saitta and Neri, 1997) 
for an overview of the research in this area --, 
have severe limitations, whereas the flexibility 
and expressivity of logical representations make 
them highly suitable for natural  anal- 
ysis (Cussens, 1999). Indeed, logical approaches 
may have a relevant impact at the level of se- 
mantic interpretation, where a logical represen- 
tation of the meaning of a sentence is important 
and useful (Mooney, 1999). 
Logical approaches have been already em- 
ployed in Text Categorization and/or Informa- 
tion Extraction. Yet they try to use an expres- 
sive representation  such as first-order 
logic to define simple properties about tex- 
tual sources, regarded, for instance, as bags of 
words (Junker et al., 1999) or as semi-structured 
texts (Freitag, 2000). These properties are of- 
ten loosely related with the grammar of the 
underlying , often relying on extra- 
grammatical features (Cohen, 1996). We intend 
to exploit a logic representation for exploiting 
the grammatical structure of texts, as it could 
be detected using a proper parser. Indeed, a 
more knowledge intensive technique is likely to 
perform better applied on the tasks mentioned 
above. 
When no background knowledge about the 
 structure is assumed to be available, 
one of the fundamental problems with the adop- 
tion of logic learning techniques is that a struc- 
tured representation of sentences is required 
on which the learning algorithm can be run. 
Thus, the need arises for parsers that are able 
to discover such a structure starting from raw, 
unstructured text. Research in this field has 
produced a variety of tools and techniques for 
English, that cannot be applied to other lan- 
guages, such as Italian, because of the differ- 
ent, and sometimes much more complex, gram- 
matical structure. Such considerations led us to 
develop a prototypical Italian  parser, 
that could serve as a pre-processor of texts in 
order to obtain the structured representation of 
sentences that is needed for the symbolic learner 
to work. It is fundamental to note that the fo- 
cus of this paper is not the parser, that does not 
adopt sophisticated NLP techniques. The aim 
here is investigating the feasibility of learning 
semantic definitions for some kinds of sentences. 
Even more so, the availability of a professional 
parser will further enhance the performance of 
the whole process. 
Further problems in applying relational learn- 
ing to  are due to the intrinsic compu- 
tational complexity of these methods, as a draw- 
194 
back of the expressive power gained through re- 
lations. Moreover, another weakness of our ap- 
proach could be the dependence on the quality 
of the data coming from the preprocessing step: 
it is possible that noise coming from wrongly 
parsed sentences be present, thus having a neg- 
ative influence towards the model to be induced. 
After briefly presenting in Section 2 the 
parser performance, in order to establish the de- 
gree of reliability of the data on which the learn- 
ing step is performed, Section 3 shows the re- 
sults of applying the first-order learning system 
INTHELEX (Esposito et al., 2000) for the in- 
ference of some simple events related to foreign 
commerce. Lastly, Section 4 draws some prelim- 
inary conclusions on this research and outlines 
future work issues to be addressed. 
2 A Stratified Parser for Italian 
Language 
This section presents a parser for the Italian 
, based on context-free grammars and 
designed to manage texts having a simple and 
standard phrase structure (e.g., foreign com- 
merce texts as opposed to poetry texts). It 
is composed by 12 parsing levels and 106 pro- 
duction rules, and uses the longest-match tech- 
nique, which complies with the typical ambigu- 
ity of Italian . Syntactic lookahead is 
used to overcome ambiguity and to prevent the 
parsing from stopping in case of grammatically 
wrong input. 
The text is segmented in progressively larger 
syntactic constructs. Subject, main verb, di- 
rect or indirect object and clauses referring to 
them are identified. Nested syntactic constructs 
at the same abstraction level (e.g., expressions 
including a sentence in parentheses) are sup- 
ported. 
Plain text documents are provided to a lexical 
analyzer and a noun-recognizer (XEROX MUL- 
TEXT), whose output is the document text 
tagged with parts of speech to be fed to the 
parser. Since Italian grammar is very differ- 
ent from the English one, some terms do not 
have an English equivalent and, hence, cannot 
be translated. 
The parser was validated on a set of 72 
sentences drawn from a corpus of articles on 
foreign commerce available on the Internet, 
and the results obtained were evaluated with 
Parsing Phase Precision Recall 
Noun Groups 0.984 0.992 
lst-level NPs 0.994 0.994 
2nd level NPs 0.983 0.983 
PPs 0.951 0.951 
Clauses 0.840 0.840 
Refined clauses 0.913 0.913 
Sentences 0.736 0.736 
Table 1: Summary of parser validation results 
(Precision/Recall) 
Parsing Phase Errorl Error2 
Noun Groups 0.787% 0.793% 
1st level NPs 0.596% 0.596% 
2nd level NPs 1.666% 1.666% 
PPs 4.918% 4.918% 
Clauses 15.941% 15.941% 
Refined clauses 8.695% 8.695% 
Sentences 26.384% 26.384% 
Table 2: Summary of parser validation results 
(Errorl/Error2) 
respect to precision, recall (reported in Table 1) 
and two measures about error ratio: 
Errorl = # errors/# total constituents extracted 
Error2 = # errors/# total correct constituents expected 
(see Table 2). 
3 Information extraction 
The grammar above was used to parse Italian 
texts downloaded from the Internet, and con- 
cerning foreign commerce. Through such pre- 
processing, the aim was to obtain some struc- 
ture for those texts that could then be trans- 
lated in the input  of the learning sys- 
tem INTHELEX (Esposito et al., 2000) in order 
to make it learn simple events concerning that 
domain. 
INTHELEX (INcremental THEory Learner 
from EXamples) is a fully incremental, multi- 
conceptual closed loop learning system for the 
induction of hierarchical theories from exam- 
ples. In detail, full incrementality avoids the 
need of a previously generated version of the 
theory to be available, so that learning can start 
from an empty theory and from the first exam- 
195 
ple; multi-conceptual means that it ,:an learn si- 
multaneously various concepts, possibly related 
to each other; a closed loop system is a system 
in which the learned theory is checked to be 
valid on any new example available, and in case 
of failure a revision process is activated on it, 
in order to restore the completeness and consis- 
tency properties. 
Incremental learning is necessary when either 
incomplete information is available at the time 
of initial theory generation, or the nature of the 
concepts evolves dynamically. The latter situ- 
ation is the most difficult to handle since time 
evolution needs to be considered. In any case, 
it is useful to consider learning as a closed loop 
process, where feedback on performance is used 
to activate the theory revision phase. 
INTHELEX learns theories, from positive and 
negative examples described in the same lan- 
guage. It adopts a full memory storage strategy 
-- i.e., it retains all the available examples, thus 
the learned theories are guaranteed to be valid 
on the whole set of known examples. 
In the formal representation of texts, we used 
the following descriptors: 
• sent (el,e2) e2 is a sentence fi:om el 
• subj (el,e2) e2 is the subject of el 
• obj (el,e2) e2 is the (direct) object of el 
• indirect_obj (el ,e2) e2 is an indirect ob- 
ject of el 
• rel_subj (el,e2) e2 is a clause related to 
the subject el 
• rel_obj(el,e2) e2 is a clause related to 
the object el 
• verb(el,e2) e2 is the verb of el 
• lemma(e2) word e2 has lemma lemma 
• infinite(e2) verb e2 is in an infinite 
mood 
• finite (e2) verb e2 is in a finite mood 
• affirmative(e2) verb e2 is in an affirma- 
tive mood 
• negative(e2) verb e2 is in a negative 
mood 
• np(el,e2) e2 is a 2nd level NP of el 
• pp(el,e2) e2 is a PP of el 
where lemma is a meta-predicate. This allows 
the system to exploit information about word 
lemmas in generalizations/specializations, and 
in the recognition of higher level concepts of 
which lemma is an instance. 
Thus, the following Horn clause is an instance 
of an example: 
imports (example) ~-- 
sent (example ,el), 
subj (example, e2), 
np(e2,e3), 
impresa(e3), 
rel_subj (el ,e4), 
verb (e4, e5), 
specializzare (e5), 
infinite (e5), 
affirmative (e5), 
pp(e4,e6), 
distrubuzione (e6), 
componente (e6), 
verb(el,eT), 
interessare (e7), 
finite (eT), 
affirmative (eT), 
indirect_obj (el, e8), 
pp(e8,e9), 
importazione (eg), 
macchina(e9) , 
produzione (e9), 
ombrello (eg). 
A first experiment aimed at learning the con- 
cept of specialization (of someone in some field). 
The system was run on 40 examples, 24 posi- 
tive and 16 negative. The resulting theory was 
made up by 5 clauses, some of which differ just 
in one literal (e.g., the lemma of the word in the 
subject). By exploiting the background knowl- 
edge that terms 'impresa', 'societY', 'ditta' and 
'agenzia' are all instances of the concept 'per- 
sona giuridica', i.e. clauses: 
persona_giuridica (X) +- 
ditta(X). 
persona_giuridica (X) +-- 
societa(X). 
persona_giuridica(X) ~- 
impresa(X). 
persona_giuridica(X) +- 
agenzia(X). 
196 
the theory becomes more compact, yielding 
the following rules: 
specialization(A)~- 
sent(A,B), 
subj(B,C), np(C,D), 
persona_giuridica(D), 
verb(B,E), 
specializzare(E), 
finite(E), 
affirmative(E), 
indirect_obj(B,F), 
pp(F,_). 
specialization(A)~- 
sent(A,B), 
subj(B,C), np(C,D), 
persona_giuridica(D), 
rel_subj(B,E), 
verb(E,F), 
specializzare(F), 
affirmative(F), 
pp(E,_), verb(B,_). 
Another experiment aimed at learning the 
concept of "imports". INTHELEX was run 
starting from the empty theory, and was fed 
with a total of 67 examples (39 positive and 28 
negative). It should be noted that not all posi- 
tive examples explicitly use verb 'importare' (to 
import): e.g., in the sentence "Societ& belga, 
specializzata nella lavorazione del legno, cerca 
fornitori di legname" the imports event is char- 
acterized by the noun 'societ&' (society) as the 
sentence subject, by the verb 'cercare' (to look 
for) and by the object including the noun 'for- 
nitore' (provider). We obtained the following 
results (in which the above background knowl- 
edge was used to compress more rules into one, 
too): 
imports(A) ~- sent(A,B), 
subj(B,C), np(C,D), 
persona_giuridica(D), 
verb(B,E), 
cercare(E), 
finite(E), 
affirmative(E), 
obj(B,F), np(F,G), 
fornitore(G). 
imports(A) ~-sent(A,B), 
subj(B,C), np(C,D), 
persona_giuridica(D), 
societa(D), 
verb(B,E), 
cercare(E), 
finite(E), 
affirmative(E), 
obj(B,F), np(F,G), 
distributore(G). 
imports(A) ~-sent(A,B), 
subj(B,C), np(C,D), 
persona_giuridica(D), 
verb(B,E), 
interessare(E), 
finite(E), 
affirmative(E), 
indirect_obj(B,F), 
pp(F,G), 
importazione(G). 
imports(A) ~-sent(A,B), 
subj(B,C), np(C,D), 
persona_giuridica(D), 
verb(B,E), 
acquistare(E), 
finite(E), 
affirmative(E). 
imports(A) ~-sent(A,B), 
subj(B,C), np(C,D), 
persona_giuridica(D), 
impresa(D), 
verb(B,E), 
importare(E), 
finite(E), 
affirmative(E). 
For instance, the third clause means: "Text A 
deals with imports if it contains a sentence with 
a subject composed by a NP containing a per- 
sona giuridica, the verb of the main sentence 
is interessare (to interest) in finite affirmative 
mood, and the indirect object consists of a PP 
containing the word importazione". Note that, 
by exploiting a background knowledge that rep- 
resents a more complex ontology than the cur- 
rent one, it would be possible to further merge 
conceptual descriptors and, as a consequence, 
clauses in the theory. For example, 'fornitore' 
(provider) and 'distributore' (distributor) could 
be recognized as instances of a common higher 
level concept; the same applies to 'acquistaxe' 
(to buy) and 'importare' (to import). 
197 
4 Conclusions & Future Work 
We have addressed the problem of learning logic 
theories for information extraction so to bene- 
fit by the semantic interpretation provided by 
a logical approach. This has required struc- 
tured sentences in a logic representation on 
which to run our learning algorithms. Hence, 
we needed a parser to produce structured repre- 
sentations from raw unstructured text. Though 
many techniques have been developed for En- 
glish, they cannot be applied to other s, 
such as Italian, because of the different gram- 
matical structure. This has led us to develop a 
prototypical Italian  parser, as a pre- 
processor in order to obtain the structured rep- 
resentation of sentences needed for the symbolic 
learner to work. 
Future work will concern a more extensive ex- 
perimentation, an empirical evaluation of our 
approach, and application of the same kind of 
experiments on English parsed texts. If good 
results will be obtained, it is possible thinking 
to carry out experiments that take advantage 
also from the structure of semi-structured doc- 
uments. Indeed, we are involved in the project 
CDL (Esposito et al., 1998; Costabile et al., 
1999), that could profit by this kind of tech- 
niques as regard semantic indexing of the stored 
documents (cf. (Chanod, 1999)). 

References 
J.-P. Chanod. 1999. Natural  processing 
and digital libraries. In M.T. Pazienza, editor, 
Information Extraction, volume 1714 of Lecture 
Notes in Artificial Intelligence Tutorial, pages 17- 
31. Springer. 
W. Cohen. 1996. Learning to classify english text 
with ilp methods. In Luc de Raedt, editor, Ad- 
vances in Inductive Logic Programming, pages 
124-143. IOS Press, Amsterdam, NL. 
M.F. Costabile, F. Esposito, G. Semeraro, and 
N. Fanizzi. 1999. An adaptive visual environment 
for digital libraries. International Journal of Dig- 
ital Libraries, 2:124-143. 
J. Cussens, editor. 1999. Learning Language in 
Logic. Workshop on Learning Language in Logic, 
Workshop Notes. 
F. Esposito, D. Malerba, G. Semeraro, N. Fanizzi, 
and S. Ferilli. 1998. Adding machine learning 
and knowledge intensive techniques to a digital 
library service. International Journal of Digital 
Libraries, 2(1):3-19. 
F. Esposito, G. Semeraro, N. Fanizzi, and S. Ferilli. 
2000. Multistrategy Theory Revision: Induction 
and abduction in INTHELEX. Machine Learn- 
ing, 38(1/2):133-156. 
D. Freitag. 2000. Machine learning for information 
extraction in informal domains. Machine Learn- 
ing, 39(2/3):169-202. 
M. Junker, M. Sintek, and M. Rinck. 1999. Learning 
for text categorization and information extraction 
with ILP. In James Cussens, editor, Proceedings 
of the First International Workshop on Learning 
Language in Logic - LLL99. 
R. Mooney. 1999. Learning for semantic interpre- 
tation: Scaling up without dumbing down. In 
Cussens (Cussens, 1999), pages 7-15. Workshop 
on Learning Language in Logic, Workshop Notes. 
L. Saitta and F. Neff. 1997. Machine learning for 
information extraction. In M.T. Pazienza, editor, 
Information Extraction, volume 1299 of Lecture 
Notes in Artificial Intelligence Tutorial, pages 
171-191. Springer. 
