Inducing Terminology for Lexical Acquisition 
Roberto Basili, Gianluca De Rossi, Maria Teresa Pazienza 
Department of Computer Science, Systems and Production 
University of Roma, Tor Vergata 
{basili ,derossi ,pazienza}@info .utovrm. it 
Abstract 
Few attention has been paid to terminol- 
ogy extraction for what concerns the pos- 
sibilities it offers to corpus linguistics and 
lexical acquisition. The problem of detect- 
ing terms in textual corpora has been ap- 
proached in a complex framework. Termi- 
nology is seen as the acquisition of domain 
specific knowledge (i.e. semantic features, 
selectional restrictions) for complex terms 
and /or unknown words. This has useful 
implications on more complex text process- 
ing tasks (e.g. information extraction). An 
hybrid symbolic and probabilistic approach 
to terminology extraction has been defined. 
The proposed inductive method puts a spe- 
cific attention to the linguistic description 
of what terms are as well as to the statis- 
tical characterization of terms as complex 
units of information typical of domain sub-. 
languages. Experimental evidence of the 
proposed method are discussed. 
1 Introduction 
Nowadays corpus processing techniques are widely 
adopted to approach the well-known lexical bot- 
tleneck problems in language engineering. Lexical 
acquisition methods rely on collocational analysis 
(pure statistics), robust parsing (syntax-driven ac- 
quisition) or semantic annotations as they are found 
in large thesaura or on-line dictionaries. The lexical 
information that trigger induction varies from simple 
word/tokens to syntactically annotated or semanti- 
cally typed collocations (e.g. powerful vs. strong tea 
(Smadj a, 1989)), syntactic disambiguation rules (e.g. 
(Hindle and Rooths,1993), (Brill and Resnik,1994)) 
or sense disalnbiguation rules are usually derived. 
Such information is lexical as it, encodes constraints 
(of different types) at the word level, to be thus in- 
herited by morphologic variants of a given lemma. 
This strongly lexicalized knowledge, as it is ex- 
tracted from corpus data, requires lexical entries to 
be known in advance in some morphologic database. 
POS taggers or temmatizers are generally used to 
suitably map tokens to lemmas. It should be noted 
that lemmas in a corpus depends on the underly- 
ing sublanguage and their nature and shape is not 
as general as it is usually encoded in a morphologic 
dictionary. As an example, let studio (i.e. study as 
a noun) be an entry in an italian morphologic dic- 
tionary. Typical information in such a database is 
the following: 
studio pos=noun gen=mas aura=sing 
Tlle only legal morphologic variant of ,studio is studi 
(studies, with nura=plur). When searching for studio 
in a corpus of environment related texts 1, we found 
this kind of occurrences (e.g. short contexts): 
... studi di base .... (basic studies) 
... studi di impatto ambientale .... 
(*studies on the environmental impact) 
... studi di fattibilitd .... (feasibility studies), 
... studi di riferimento .... (reference studies) 
It is very common in a corpus (not balanced, thus 
focused to a limited domain) to find a set of specifi- 
cations of nouns that have some specific properties: 
• they are not always compositional (e.g. studio 
di base); 
* tlmy describe complex concepts (e.g. studi di 
fattibilitd) in the underlying technical domain, 
so they are relevant for text understanding and 
classification/extraction 
Although our approach is in principle language inde- 
pendent, we systematically will describe rules and exam- 
ples in italian as they have been derived from text. cor- 
pora in \[talian. Tile environmental corpus, called ENEA, 
is a collection of short scientific abstracts or newspaper 
articles dealing with pollution. 
125 
they select specific and independent senses of 
the related term: studi di base refers to the ab- 
stract notion of study as an on-going reasearch, 
while studi di fattibilita' is not a reaseareh but 
a specific engineering task ; 
the related nominal compounds show indepen- 
dent lexical properties. For example, all the ex- 
amples are potential object of verbs like carry 
out, do .... but only feasibility studies or studies 
on the environmental impact can be modelled 
by some techniques or policies. Furthermore, 
studies on the environmental impact have spe- 
cific social and political implications that are no 
longer valid for the general notion of study. 
In the same environmental corpus the typical short 
contexts of the lemma attivit6 (activity) include no- 
tions like: 
attivitti umana (human activity), 
attivit6 entropica (hentropic activity), 
attivit6 di costruzione (building activity). 
These very common instances show that lexical ac- 
quisition for attivit6 or studio cannot be fully ac- 
complished without discriminating the lexieal prop- 
erties of such pure collocations from those related 
to their complex nominals. The results of lexical 
acquisition should thus be different for entries like 
attivittl and attivitd entropica. 
The underlying hypothesis is that complex concepts 
related to a lemma do not support all the general- 
izations related to the source lemma. In fact, when- 
ever a concepts is built it acquires an autonomous 
role within a language so it behaves in an almost 
independent fashion. In order to capture the essen- 
tial differences we need to select the proper set of 
terms in a given sublanguages, formalize them into 
independent lexicalizations and carry out a separate 
lexical acquisition for each of them. 
A further aspects that is worth to be mentioned 
is that terms are generally understood as single lex- 
ical units during syntactic recognition. They are 
sentence fragments already parsed. Robust meth- 
ods widely empl3yed in computational linguistics are 
thus sensible to a precise recognition of terms, as 
much of the ambiguity embedded within the term 
structures simply disappear after ercognition has 
been accomplished. Let for example be attivit~ di 
costruzione or articoli da spiaggia (beach articles) 
two terms. Sentence fragments like 
... l'inizio della attivitd di costruzione ... 
the start of the building activity 
or 
... lrasportavano articoli da spiaggia ... 
they transported beach articles, 
although inherently ambiguous (l 'inizio 
della costruzione and trasportavano da spiaggia are 
sentence readings that also obey to selectional con- 
straints (e.g. to transport/bring from a place)) can 
be correctly parsed when the two terms are employed 
before syntactic analysis is triggered. Applying syn- 
tactic driven lexical acquisition (e.g. (Grishman and 
Sterling,1994) or (Basili et a1.,1996)) after corpus 
specific term recognition and extraction highly im- 
prove the precision and complexity of the parsing 
activity. Experimental evidence will be discussed in 
later sections. 
In synthesis corpus driven terminology definition 
and recognition has positive implications on LA: 
* Terms rather than words are the atomic units 
of information on which LA applies: more se- 
lective induction thus results in a more precise 
acquisition 
• Terminologic variants of a given term are hints 
for domain specific word sense disambiguation 
• Terms are sentence fragments that have been 
already parsed: the lower ambiguity resulting 
from term recognition has a beneficial effect on 
the later syntagmatic analysis of the corpus 
2 Terminology and Lexical 
Acquisition. 
In this framework, a term is more than a token or 
word (to be searched for) as it stands in a more sub- 
tle relation with a piece of information in a specific 
knowledge domain. It is a concept, as it requires a 
larger number of constraints on the information to 
be searched for in texts. Furthermore a term con- 
veys a well assessed (usually complex) meaning as 
long as a user community agrees on its content. As 
long as we are interested in automatic terminology 
derivation, we can look at terms as surface canonical 
forms of (possibly structured) expressions indicating 
those contents. 
A term is thus characterized by a general com- 
mitment about it and this has some effects on its 
usage. Distributional properties of complex terms 
(nominals) differ significantly on those of their ba- 
sic elements. Deviance from usual distributional be- 
havior of single components can be used both as 
marker of non compositionality and specific hints of 
domain relevance. The detection of complex terms 
126 
assumes a crucial role in improving robust parsing 
and POS tagging for lexical acquisition, thus sup- 
porting a more precise induction of lexical proper- 
ties (e.g. PP disambiguation rules). This specific 
view extends and generalizes the classical notion of 
terminology as used in Information Science. 
Most of the domain specific terms we are inter- 
ested to are nouns or noun phrases that generally 
denote concepts in a knowledge domain. In order 
to approach the problem of terminological induction 
we thus need: 
1. to extract surface forms that are possible can- 
didates as concept markers; 
2. to decide which of those candidates are actu- 
ally concepts within a given knowledge domain, 
identified by the set of analyzed texts. 
Linguistic principles characterize classes of surface 
forms as potential terms (step 1). Note that the 
notion of terminological legal expression here is not 
equivalent to that of legal noun phrases. Concepts 
are lexicalized in surface forms via a set of opera- 
tions that imply semantic specifications. The way 
syntax operates such specification may be very com- 
plex and independent on the notion of grammatical 
well formedness. 
The decision in step (2) is again sensible to a 
principled way a language expresses concept spec- 
ifications but needs also to be specific to the given 
knowledge domain, i.e. to the underlying sublan- 
guage. Given the body of texts, the selective ex- 
traction should be sensitive to the different observed 
information. In this phase statistics is crucial to con- 
trol the relevance of linguistically plausible forms of 
all the guessed terms. 
3 Integrating linguistic and 
statistical information for term 
discovery 
The principled definitions of legal grammatical 
structures by which terms are expressed and the de- 
scription of their distributional properties in a sub- 
language are crucial for the automatic construction 
of a domain terminological dictionary. A number of 
methods for language driven terminological extrac- 
tion and complex nominals parsing and recognition 
have been proposed to support NLP and lexical ac- 
quisition tasks. They mainly differ in the empha- 
sis they give to syntactic and statistical control of 
the induction process. In (Church,1988) a well-know 
purely statistical method for POS tagging is applied 
to the derivation of simple noun phrases that are 
relevant in the underlying corpus. On the contrary 
more language oriented methods are those where 
specialized grammar are used. LEXTER (Bouri- 
gault,1992) extracts maximal length noun phrases 
(mlnp) from a corpus, and then applies a special 
purpose noun phrase parsing to ~hem in order to fo- 
cus on significant complex nominals. Although the 
reported recall of the mlnp extraction is very high 
(95%) tile precision of the method is not reported. 
Voutilanen (1993) describes a noun phrase extrac- 
tion tool (NPtool) based upon a lemmatizer for 
English (ENGTWOL) and on a Constraint Gram- 
mar parser. The set of potential well-formed noun 
phrases are selected according to two parsers work- 
ing with different NP-hood heuristics. A very high 
performance of NP recognition is reported (98.5% 
recall, and 95% precision). 
A more statistically oriented approach is undertaken 
in (Daille et a1,1994) where a methodology for syn- 
tactic recognition of complex nominals is described. 
Linguistic filters of morphological nature are also ap- 
plied. Corpus driven analysis is mainly based on mu- 
tual information statistics and the resulting system 
has been successfully applied to technical documen- 
tation, e.g. telecommunication. 
All these methods deal with the problem of NP 
recognition. As we are essentially interested to NP 
that are actual terms in a domain, we will need to 
decide which NPs are actual terms. We will define: 
1. well formedness principia for term denotations 
and a description of the different grammatical 
phenomena related to terms of a language 
2. distributional properties that distinguish terms 
from other (accidental) forms (e.g. non termi- 
nological complex nominals). 
3.1 Grammatical descriptions of terms in 
Italian 
It is generally assumed that a terminologic dictio- 
nary is composed of a (possibly structured) list of 
nouns, or complex nominals. Nominal forms are in 
fact lexicalization of domain concepts: proper nouns, 
acronyms as well as technical concepts are mostly 
represented as nominal phrases of different length 
and complexity. For this reason, we concentrated 
only on noun phrases analysis, as the main source 
of terminologic information 2. A term is obtained by 
applying several mechanisms that add to a source 
word (generally a noun) a set of further specifica- 
tions (as additional constraints of semantic nature). 
2In lexical acquisition the role of other syntactic cat- 
egories (e.g. verbs, adjectives, ...) is also very important 
but the set of phenomena related to them is very differ- 
ent, ms also outlined by (Basili et al.,1996b) 
127 
A detailed analysis of the role of syntactic modi- 
tiers and specifiers (De Rossi,1996) revealed that le- 
gal structures for modifiers and specifiers in Italian 
are mainly of two types: 
1. restrictive (or denotative) modifiers (postnom- 
inal participial, adjectival or prepositional 
phrases) 
2. appositive (or connotative) modifiers (prenomi- 
nal modifiers, i.e. adjectival phrases) 
Restrictive modifiers are generally used to constraint, 
the semantic information related to the correspond- 
ing noun, via a further specification of a given 
typefor that noun as in scambi commerciali (*ex- 
changes commercial): the referent noun is forced to 
belong to a restricted set of exchanges (that are in 
fact of commercial nature). On the contrary, apposi- 
tive modifiers are used by the speaker/writer to add 
additional details: his own point of view or prag- 
matic information, as in la bianca cornice (the white 
frame) or la perduta genre (the lost people). Appos- 
itive modifiers do not correspond to any (shared) 
classification, but rather to the subjective speaker's 
point of view. Furthermore prenominal modifica- 
tions are rather unfrequent in Italian. We thus 
decided to focus only on restrictive modifiers, the 
best candidates to bring terminological (i.e. assessed 
classificatory) information. The set of syntactic phe- 
nomena that have been studied as good candidates 
for restrictive forms are: 
1. adjectival specification (via postnominal adjec- 
tives, as in inquinamento idrologico (*pollution 
hydrological) ) 
2. nominal specification (postnominal appositions, 
as in vagone letto (wagon-lit), or Fiat Auto 
(Fiat Cars)) 
3. locative phenomena (postnominal proper nouns 
indicating locations, as in IBM Italia 
4. verbal specification (via postnominal past par- 
ticiple, as in siti inquinati (*sites polluted)) 
5. prepositional specification (via a particular set 
of postnominal prepositional structures, as in 
Istituto di Matematica (Institute of Mathemat- 
ics), or barca a vela (sailin9-boat)). a 
Given the above linguistic principles, a special 
purpose grammar for potential terminological struc- 
tures can be sketched. With a simple language 
of regular expressions the grammar of adjectival, 
3The set of prepositions that have been selected to in- 
troduce typical restrictive descriptions are: di,a,per, da. 
Only postnominal prepositional phrases introduced by 
one of these prepositions have been allowed for term 
expressions. 
prepositional and participial restrictions can be ex- 
pressed as: 
Term 6- noun A_P" 
Term. 6- noun A_P (Con9 A_P) ° 
Term 6- noun" 
Term. 6- noun (- noun)* 
Term 6- noun , Term, 
A_P 6- adjective I past_participle 
Cong 6- '-' \[ e 
Prepositional postmodifiers are modeled according 
to the following rules: 
Term 6- noun P.P" 
P_P 6- Prep noun 
Prep 6- di I a Ida I per 
Note that the allowed structures are post nominal 
due to the typical role of specifications in Italian. 
3.2 Distributional properties and term 
extraction 
The recursive nature of some rules require an itera- 
live analysis of the corpus. The following algorithm 
is used: 
1. Select singleton nouns whose distributional 
properties are those for terms and insert them 
in the terminologic dictionary (TD) 
2. Use the valid terms in TD to trigger the gram- 
mar and build complex nominals cn 
3. Select those cn whose distributional properties 
are those for terms and insert them in TD. 
4. Iterate steps 2 and 3 to build longer cn 4 
Note that newly found complex terms, added to TD 
in step 3, force a re-estimation of term probabilities 
obtained by a further corpus scanning, so that their 
heads are not counted twice. 
The validation of a limited set of potential surface 
forms as actual terms is crucial for lowering the com- 
plexity of tile above algorithm. Given the grammar, 
we need criteria to decide which surface forms, that, 
reflect the typical structure of a potential terms, are 
actual lexicalizations of relevant concepts of the cor- 
pus. The kind of observations that are available from 
the corpus are: (i) the set of lemmas met in the 
texts, (ii) the set of their well formed restrictions 
(i.e. complex nominals) and (iii) the distributional 
properties of entries in (i) and (ii). We firstly estab- 
lish when a singleton lemma is a relevant concept 
by using distributional properties of nouns. Then 
we characterize which restrictions of those terms are 
valid lexicalizations of more specific concepts. We 
proceed as follows: 
4 As terminological units longer than 5 words are very 
infrequent in any sublanguage, we decided to stop after 
the second iteration 
128 
1. Select the set of lemmas that by themselves 
are markers of relevant concepts in the cor- 
pus. Lemmas are detected according to their 
frequency in the observed language sample as 
well as to their selectivity, i.e. how they parti- 
tion the set of documents. This phase produces 
an early TD dictionary of simple terminological 
elements. 
2. Extend TD also with those (well-formed) re- 
strictions, cn(l), of any l E TD according to 
the mutual information they exchange with I. 
,Select and Extend depend on distributional prop- 
erties of simple lemmas and complex nominals, re- 
spectively. 
The distributional property needed for the Select 
step is the term specificity. Specific nouns are those 
frequently occurring in a corpus, but whose selectiv- 
ity in sets of documents is very high, that is: they 
are very frequent in a (possibly small) set of docu- 
ments and very rare in the rest. In order to capture 
such behavior we use two scores: the frequency tij of 
a term i in a document j and the inverse document 
frequency of a term (Salton,1989). Given a term i, 
its inverse document frequency is defined as follows: 
idfi = log s ~N 
where dfi is the number of documents of the corpus 
that include term i, while N is the total number of 
documents in the collection. The following criteria 
is defined to capture singleton terms: if exists at 
least one document where a noun i is required as 
index (because it is relevant for that document and 
selective with respect to other documents) then such 
a noun denotes a relevant domain term (i.e. specific 
concept). In order to decide we rely on idfi and tij 
as follows. 
DEF: (Singleton term). A noun i is a term if at least 
one document j exists for which: 
N wij = tij log2 ~// _> r (1) 
wi.i captures exactly the notion of specificity re- 
quired in the Select step of our algorithm. Potential 
heads of terminological entries are selected accord- 
ing to their selective power in the corpus• Even very 
rare words of the corpus can be captured by (1). 
In the Extend step of the a.lgorithm we need to 
evaluate the mutual information values of phrase 
structures like: 
head Modl 
head Modl Mod2 
head Modi Mod2 ... Mod,~ 
Mutual Information between two words x and y 
is defined as (Fano,1961): 
l(x,y) = log vrob(~,y) prob(x)prob(y) 
and it can be estimated by a maximum likelihood 
method as in (Dagan,1993): 
\](x, y) = log s N.Jreq(x,y) \]req(x)freq(y) 
where: freq(x, y) is the frequency of the joint event 
of (x,y), freq(x), freq(y) and N are the frequency 
of x, y and the corpus size respectively. In order to 
apply the standard definition of mutual information 
we need to extend it to capture the specific nature of 
the joint event head-modifier (H M1). Note that 
M1 denotes post nominal adjectives or past partici- 
ple but also prepositional phrase like dello Stato in 
territorio deilo Stato. We decided to estimate the 
Mutual Information of such structures in a left to 
right fashion. The rightmost modifier (i.e. M1 in (H 
M1) structures, or M, in (H M1 ... M,,)) is consid- 
ered as the right event y and every left incoming sub- 
structure (i.e. H or H ... Mn-t) is represented as a 
single event x. The generalized evaluation of Mutual 
information for cn = ( ( H , M1, M2 . . " Mn- \] ), M, ) is 
thus: 
N • freq(H, Mi,'", Mn-1, Mn) 
\](en) = log 2 freq(H, M1,'", Mn-1)freq(M,) 
(2) 
As an example a term like debito pubblico (public 
debt) receive a mutual information score according 
to the following figure: 
N.Ireq(debito,publieo) \](x, y) = log 2 freq(debito)freq(publico) 
while debito publico estero (foreign public debt) pro- 
duces to the following ratio: 
N'~recl(debito,publico,estero ) \](x, y) = log 2 freq(debito,publico)freq(estero) 
DEF. (Complex Terms) A Complex nominal cn = 
( ( U, M1, U2 . . . Mn-1), Mn) is selected as term (and 
thus included in TD) if the following condition 
holds: 
\](cn) > 6(H) (3) 
The threshold if(H) depends on noun H as it is eval- 
uated according to the statistical distribution of ev- 
ery complex nominals headed by H 5. The set of 
singleton terms is exactly the same set that a clas- 
sical indexing model (Salton,1989) obtains from the 
document collection (i.e. the corpus). The Extend 
~ In the experimental tests best values for g have been 
obtained a.s a fimction of mean and variance of the \[ 
distribution over the set of cn headed by H 
129 
phase allows to capture all the relevant specifications 
of the singleton terms, cornpile a more appropriate 
dictionary (for the corpus) and structure it in hier- 
archically organized entries. 
4 Implementation Issues 
The model described in the previous section has been 
used to implement a system for terminology deriva- 
tion from a corpus. The system relies upon the 
POS tagging activity as it is carried out within a 
LA framework (e.g. the ARIOSTO system (Basili 
et a1.,1996)) and extracts a fifil terminologic dictio- 
nary TD of: 
1. simple terms (i.e. nouns) as seeds of a termino- 
logical structured dictionary (selected according 
to (1) 
2. complex nominal forms of some of those seeds, 
generated by the grammar and filtered accord- 
ing to (3). 
Terminology extraction is triggered after POS tag- 
ging. Morphologic analysis is rerun according to the 
compiled TD. This feedback allows the system to 
exploit complex term extraction before activating 
syntactic recognition, in order to prune out signif- 
icant components of grammatical ambiguity. This 
improves the overall ability of the linguistic proces- 
sor and supports term oriented rather than lemma 
oriented lexical acquisition. 
A dedicated subsystem has been developed to sup- 
port manual validation of single terms. In Figure 1 
a screen dump of the graphical interface that sup- 
ports the interactive validation (or removal)of terms 
in TD is shown. TD is hierarchically organized 
in separate sections where singleton terms domi- 
nate all their specified subconcepts. A section is 
the set of terms that share the same term head. A 
term like smaltimento dei rifiuti (garbage collection), 
has the noun "smaltimento" (garbage) as its term 
head. A specific section includes terms like smalti- 
mento dei rifiuti, smaltimento di materiale tossico, 
smaitimento di gas di scarieo .... ). In Figure 1 the 
head noun debito (debt)) is reported: the section re- 
lated to debito includes all its validated specifications 
(e.g.debito pubblico (public debt), debito pubblico es- 
tero (.foreign public debt) ... ). 
5 Experimental Set-Up 
In this section we describe the experimental set-up 
used to evaluate and assess the described model of 
terminological derivation. 
The method has been tested over two corpora of 
italian documents. The first corpus (ENEA) is a 
Sb'ucture Name : debRo 
Component fist : 
<, debito {POSS,AB$\] 
_ / 2x__ __ 
totale p ubblico oubbli¢o estero netto finanziario cornM 
Figure 1: User Interface for Terminology Validation 
Table 1: Distribution of indexes headed by attivit5 
l,dex I 1to20 I 2~to40 I 41toS0 I SltoS0 I S,tol0S I 
Method RI: I I \[ I I \[ 
attivitfi 3 2 5 1 4 
Method Th 
ottivith 
entropica 
dl costruzione 
produttiva 
umana 
collection of scientific abstracts on the environment, 
made of about 350.000 words. The second corpus 
(Sole24Oore) is an excerpt of financial news from the 
Sole 24 Ore economic newspaper, of about 1.300.000 
words. The terminology extraction have been run 
over both tile corpora. From the ENEA corpus we 
derived a dictionary of about 2828 words. From the 
Sole24Ore corpus 5639 terms have been extracted. 
In order to carry out the experiments we used a 
subset of tile ENEA corpus in order to measure per- 
formance over manually validated documents. The 
specific nature of our tests required the definition 
of particular performance evaluation measures. In 
fact, together with the classical notion of recall and 
precisions, we used also data compression, as the 
percentage of incorrect syntactic data that are no 
longer produced when specific terminology is used. 
A further index is the average ambiguity defined ac- 
cording to the notion of collision set (Basili et al., 
1994). Ill order to accomplish the task further ref- 
erence information has been used: two standard do- 
main specific thesaura have been used for comparing 
the result of the terminology extraction in the envi- 
ronmental domain (ENEA corpus). 
5.1 Linguistic analysis of corpus data 
In Table 1 the section headed by attivitd, as it has 
been derived from the ENEA corpus, is shown. The 
130 
specific nature of the corpus is well reproduced by 
the data. Here two specific senses of the lemma 
attivit6 are captured: natural and biological activity 
as in attivitd entropies and human activities (like 
attivitd produttiva (productive activity) or attivitd 
di costruzione (building activity). These latter have 
specific implications (for what concerns artificial pol- 
lution) in the environment. 
Table 1 reports also the distribution of the term 
in a set of 106 documents. In method RI terms 
have been selected by classical inverse document fre- 
quency (Salton,1989) applied to singleton lemmas 
(i.e.attivit6). In Method TI we run inverse docu- 
ment frequency after a terminology driven lemma- 
tization of documents (i.e. using complex terms as 
source lemmas). The two sections of the table show 
that no index has been lost by the TI method (all of 
the 15 indexes have been found). This result is more 
general: TI method produces more indexes. Over 
the 106 documents MI extracts 476 simple indexes 
while TI extracts 732 (terminological) indexes. 
Again in Table 1, 5 of the fifteen indexes found by 
the TI method are complex nominals. In the set of 
documents from 1 to 20 (lto20 column) these al- 
low to discriminate between attivit6 and attivitd 
antropica. 
Such an higher discriminating power is required 
not only for document classification/retrieval but, 
first of all for lexical acquisition: in this techni- 
cal domain in fact it seems necessary to rely on 
the information that attivit6 is typically carried 
out by humans while attivit6 antropica is not. We 
are convinced that these are the typical selectional 
constraints to be captured by corpus driven lexi- 
cal acquisition methods. Finer lexicalizations (like 
attivit6 antropica) are the only way to provide a 
better input to the target acquisition tasks. 
5.2 Experiment 1: Effectiveness of the 
termlnology extraction 
The aim of this experiment was to test the ability 
of the method to capture relevant concepts in the 
sublanguage. We run this test on the environmental 
domain (ENEA corpus). The reference term dic- 
tionary was manually compiled by a team of three 
domain experts, culturally heterogeneous. We got a 
complete list of terms (simple nouns as well as com- 
plex nominals) to be used as a test-set (RT). The 
reference document set was a collection of 106 doc- 
uments. The experts compiled a set of 482 terms 
organized in 155 sections (i.e. relevant head nouns). 
Each section thus includes 3.12 terms. For sake 
of completeness we selected two large hand-coded 
thesaura for the environment: the CNR dictionary 
Table 2: Smaltimento in different dictionaries 
RT CNR AIB TD 
smaltimento dei fanghi _ X 
smaltimento dei rifiuti X X X 
smaltimento delle scorie X 
Table 3: Global Performance of different dictionaries 
Dictionary CNRD AIB TD 
# of Relevant Terms 41 45 331 
# of Terms 880 180 472 
Recall 8,87% 9,74% 71,56% 
Precision 4,66% 23,94% 70,13% 
(CNR,1995)(that includes 9613 terms) and the AIB 
dictionary (AIB,1995). Both these dictionaries as 
well as the automatically generated dictionary TD 
have been compared with the reference RT. The 
comparison has been carried out throughout the dif- 
ferent aligned sections. The alignment of the section 
related to the head smaltimento is reported in Table 
2 ("X" means the presence of the term in the corre- 
sponding dictionary, while "." denotes its absence): 
Any dictionary D can thus be evaluated by mea- 
suring precision~ i.e. 
precision = RTterrnsoDterrn8 
Dterrns 
and recall, i.e. 
recall = RTterrnsf'lDterrns 
RTterrns 
For example within the section related to the head 
smaltimento, we have 3 RT terms, of which 1 is in 
CNR and AIB respectively and 3 are in TD. When 
applying the recall and precision definition to every 
sections of the RT dictionary we obtained tile aver- 
age performance scores reported in Table 3 over the 
three dictionaries. 
5.3 Experiment 2: Shallow parsing with 
terminological knowledge 
Consulting a terminologic dictionary before acti- 
vating a shallow syntactic analyzer is helpful to 
solve several morphological and syntactic ambigui- 
ties. For exa~nple, given the sentence 6 
L 'ufficiale della Guardia di Finanza visit I'aereoporto di Fiumicino 
(The officer of Finance Guard visited the Fiumicino airport) 
a typical shallow syntactic analyzer (SSA) (Basili 
et al., 1992) produces the following elementary syn- 
tactic links (esl), due the syntactic ambiguity of 
prepositional phrases (PP), e.g. ( (di linanza), (di 
Fiumicino ) ) : 
N-P-N ufficiale della guardia 
N_P-N ufficiale di finanz~ 
N_P-N guardia di fin~nz~ 
6This sentence has been extracted from the Sole24Ore 
corpus 
131 
N-V ufficiale visit6 
V-N visit6 aeroporto 
N-P-N aereoporto di fiumicino 
V_P_N visit6 dl tiumicino 
As each sentence reading cannot assign more than 
a single referent to each PP, we can partition the 
set of esl into several collision sets (i.e. sets of esi 
that cannot belong to the same sentence reading ac- 
cording to (Basili et al, 1994)). The sample sentence 
gives rise to the following collision sets: 
{ (ufliiciale di flna, nza) (guardia di finanza) } 
{ (uiilciale visit6 ) } 
{(aereoporto di fiumicino) (visit6 di fiumicino) } 
{ (ufficiMe della guardia) } 
{ (visit6 aeroporto) } 
When terminology is available many complex nomi- 
nals are retained as single tokens and several am- 
biguity disappear. In the Sole24Ore corpus our 
method produced both the terms guardia di finanza 
and aeroporto di Fium, icino so that the final list of 
esl reduces to 
N-P-N ufficiale della gua.rdia-di-finanz& 
N_V ufl~ciMe visit6 
N_V guardia-di-finanza visit6 
V_N visit6 aeroporto_di_fiumicino 
and no ambiguous (i.e. not singleton) collision set 
remains. We have two positive effects on the parsing 
activity. The first is data compression. In fact the 
overgeneration typically due to the shallow gram- 
matical approach is significantly limited. In our ex- 
ample the early 7 elementary syntactic groups ob- 
tained in absence of terminology reduced to 4 with 
an overall data compression of ((7-4)/7) 42.8%. An 
extended experimentation has been carried out on 
a subset of 500 sentences of the corpus. The use of 
terminology reduces the number of elementary syn- 
tactic links from 500 to 403 with a corresponding 
20% of overall data compression. Furthermore, the 
detection of a term carried out over single tokens 
that are morphologically ambiguous improves also 
the morphological recognition. In fact the detection 
of a chain of tokens that are part of the same term 
implies a specific choice on the grammatical cate- 
gory of each token, thus augmenting the selectivity 
of POS tagging. Over the same subset of the corpus 
we measured a decrement of 4% in the number of 
morphological derivations produced with terminol- 
ogy against the recognition carried out in absence of 
any terminological knowledge. 
A second positive aspect of having an available 
Table 4: Performance evaluation of terminology driven 
parsing 
Parser Ambiguity #Collisions Recall Precision 
SP 0.60 3.2 0.65 0.67 
TP 0.55 2.9 0.68 0.71 
domain specific terminology is the reduction of the 
underlying syntactic ambiguity and increase of the 
parser precision. As shown in the example many PP 
ambiguity disappears as soon as a set complex nom- 
inals is detected. This has a strong implication on 
shallow (or robust ms widely accepted in literature) 
parsing. We conducted a systematic analysis of cor- 
rect parsing results by contrasting a parser with and 
without access to domain terminology. The analy- 
sis of the results has been performed by comparing 
collision sets obtained by the two runs over a set of 
100 sentences. Four performance scores have been 
evaluated: the degree of ambiguity (i.e.the ratio be- 
tween the number of ambiguous esl's over the to- 
tal number of derived esl's); the average ambiguity 
(expressed by the average eardinality of the colli- 
sion sets (i.e. the number of reciprocally ambiguous 
esl%); finally, precision and recall have been mea- 
sured according to a hand validation of the derived 
syntactic material 7. The analysis has been carried 
out specifically for prepositional esl% (i.e. noun- 
preposition-noun, verb-preposition-noun, adjective- 
preposition-noun links). Results are reported in Ta- 
ble 4 where separate columns express the scores for 
the different runs: a simple parser (SP), and a ter- 
minology driven parser (TP). As a result the simple 
parser obtains several complex nominals but only 
as syntactic structures so that it fails in detecting 
higher order syntactic links (i.e. syntactic relations 
between complex nominals and other sentence seg- 
ments). In these cases we penalized also the recall of 
the SP method, so that the difference between the 
two methods relies not only in amount of persist- 
ing ambiguity (i.e. precision), but also in coverage 
(better captured by recall). 
6 Conclusions 
In this paper a method for the automatic extraction 
of terminological (possibly complex) units of infor- 
mation from corpora is presented. The proposed 
method combines principle of grammatical correct- 
ness with statistical constraints on the distributional 
7 Precision is the number of detected correct esl's over 
the total number of detected esl's, while recall is the 
number of detected correct esl's over the number of cor- 
rect esl~s 
132 
properties of the detected domain terms. In an in- 
cremental fashion NPs are first selected as possible 
candidates for term denotation and then inserted in 
an incremental terminological dictionary according 
to their mutual information value. The experimen- 
tal test has been difficult as a precise notion of what 
is a relevant term in a domain is very vague and sub- 
jective. Tests against a domain specific user oriented 
dictionary have been carried out, in comparison with 
large scale thesaura in the domain. The significant 
improvement against this standard sources is very 
successful. The method has been widely applied to 
different corpora and it demonstrated to be easily 
portable without any heavy customization. As it re- 
lies upon simple POS tagging, it is widely portable 
to other languages, as soon as NP grammars are 
available. Feedback of the terminological extraction 
process to the morphologic analysis has been also 
designed. A measure of the improvement that ter- 
minological NP recognition implies over the activity 
of a shallow parser for LA has been carried out. The 
result is an overall improvement: data compression 
is around 5% while syntactic ambiguity elimination 
is about 10%. Recall and Precision of the syntactic 
analysis is consequently higher. 
The main result of this method is to support finer 
lexicalization, in form of complex nominals, for lex- 
ical acquisition. Lexical acquisition based on col- 
locations between terms (and not simple lemmas) 
provides more granular information on lexical senses 
as well as (syntactic or semantic) selectional con- 
straints. The success of this method allow to design 
automatic methods for taxonomic (thesaurus-like) 
knowledge generation. Distributional, as well syn- 
tactic, knowledge is a crucial source of information 
for large scale similarity estimation among detected 
terms. 

References 
A1B, 1995, Ensoli A., Marconi G., Sistema di Classifi- 
cazione dei Documenti di Interesse Ambientale, Rapporti 
AIB-7, ISSN 1121-1482 
Basili 1992, R.Basili, M.T.Pazienza, P.Velardi A Shal- 
low Syntactic Analyzer to extract word association from 
corpora, Literary and Linguistic Computing, 1992, vol.7, 
n.2, 114-124 
Basili, R., A. Marziali, M.T. Pazienza, Modelling syn- 
tactic uncertainty in lexical acquisition from texts, Jour- 
nal of Quantitative Linguistics, vol.1, n.1, 1994. 
Basili et al 1996a. Basili R., M.T. Pazienza, P. Ve- 
lardi. An Empirical Symbolic Approach to Natural Lan- 
guage Processing. Artificial Intelligence, Vol.85, August 
1996. 
Basili et al.,1996b. Basili, R., M.T. Pazienza, 
P.Velardi, Integrating General Purpose and Corpus- 
based Verb Classifications, Computational Linguistics, 
1996. 
Bourigault, D., 1992, Surface Grammatical Analysis 
for the Extraction of Terminological Noun Phrases, Proc. 
of COLING 1992, Nantes, France, pp.977-981. 
Brill E., Resnik P.,1994, A rule-based approach to 
prepositional phrase attachment disambiguation, in Proc. 
of COLING 94, 1198-1204 
Church, K., 1988, A Stochastic Parts Program and 
Noun Phrase Parser for Unrestricted Text, Proc. of 2nd 
Conf. on Applied Natural Language Processing, Austin, 
pp. 136-143 
CNR, 1995. Thesaurus ltalimm Generale per 
i'Ambiente, Consiglio Nazionale delle Ricerche (CNR), 
Rapporto Scientifico 10/95, Ed. Bruno Felluga, Edizione 
31/07/1995, Roma. 
Dagan 1993, l.Dagan, K.Church Identifying and 
Traslating Technical Terminology, \[JCAI 1993 
Daille et al, 1994, Daille B., Gaussier E., Lange', 
J.M., Towards Automatic Extraction of Monolingual 
and Bilingual Terminology, COLING-94, August, Ky- 
oto, Japan, 1994. 
De Rossi, 1996, Eiaborazioni Satistiche di Corpora 
Testuali mirate all'Acquisizione di Conoscenza per la 
Costruzione di Thesaura, Faculty of Engineering, Uni- 
versity of Roma, Tor Vergata, 1996. 
R. Fano, Transmission of Information, Cambridge, 
Mass., MIT Press, 1961 
Hindle D. and Rooth M., 1993, StructuralAmbiguity 
and Lexical Relations, Computational Linguistics, 19(1): 
103-120. 
Salton G., Automatic Text Processing: the Transfor- 
mation, Analysis and Retrieval of Information by Com- 
puter, Addison-Wesley Puhl., 1989. 
