! 
Unsupervised Learning of Syntactic Knowledge: 
methods and measures 
a. Basili (*), A. Marziali (*), M.T. Pazienza (*), P. Velardi(#) 
(*) Dipartimento di Informatica, Sistemi e 
Produzione, Universita' di Roma Tor Vergata 
(ITALY) {basili, pazienza}@info .utovrm. it 
(#) Istituto di Informatica, Universita' di Ancona 
(ITALY) 
velardi@anvax 1. c ineca, it 
Abstract 
Supervised methods for ambiguity resolution learn in 
"sterile" environments, in absence of syntactic noise. 
However, in many language engineering applications 
manually tagged corpora are not available nor easily 
implemented. On the other side, the "exportability" of 
disambiguation cues acquired from a given, noise-free, 
domain (e.g. the Wall Street Journal) to other domains 
is not obvious. 
Unsupervised methods of lexical learning have, just 
as well, many inherent limitations. First, the type of 
syntactic ambiguity phenomena occurring in real do- 
mains are much more complex than the standard V 
N PP patterns analyzed in literature. Second, espe- 
cially in sublanguages, syntactic noise seems to be a 
systematic phenomenon, because many ambiguities oc- 
cur within identical phrases. In such cases there is little 
hope to acquire a higher statistical evidence of the cor- 
rect attachment. Class-based models may reduce this 
problem only to a certain degree, depending upon the 
richness of the sublanguage, and upon the size of the 
application corpus. 
Because of these inherent difficulties, we believe that 
syntactic learning should be a gradual process, in which 
the most difficult decisions are made as late as possible, 
using increasingly refined levels of knowledge. 
In this paper we present an incremental, class-based, 
unsupervised method to reduce syntactic ambiguity. 
We show that our method achieves a considerable com- 
pression of noise, preserving only those ambiguous pat- 
terns for which shallow techniques do not allow reliable 
decisions. 
Unsupervised vs. supervised models of 
syntactic learning 
Several corpus-based methods for syntactic ambiguity 
resolution have been recently presented in the litera- 
ture. In (Hindle and Rooth, 1993) hereafter H&R, lexi- 
calized rules are derived according to the probability of 
noun-preposition or verb-preposition bigrams for am- 
biguous structures like verb-noun-preposition-noun se- 
quences. This method has been criticised because it 
does not consider the PP object in the attachment de- 
cision scheme. However collecting bigrams rather than 
trigrams reduces the well known problem of data sparse- 
ness. 
In subsequent studies, trigrams rather than bigrams 
were collected from corpora to derive disambiguation 
cues. In (Collins and Brooks,1995) the problems of data 
sparseness is approached with a supervised back-off 
model, with interesting results. In (Resnik and Hearst, 
1993) class-based trigrams are obtained by generalizing 
the PP head, using WordNet synonymy sets. In (Rat- 
naparkhi et al, 1994) word classes are derived automat- 
ically with a clustering procedure. (Franz, 1995) uses a 
loglinear model to estimate preferred attachments ac- 
cording to the linguistic features of co-occurring words 
(e.g. bigrams, the accompanying noun determiner, 
etc.). (Brill and Resnik, 1994) use transformation- 
based error-driven learning (Brill, 1992) to derive dis- 
ambiguation rules based on simple context information 
(e.g. right and left adjacent words or POSs). 
All these approaches need extensive collections of 
positive examples (i.e. hand corrected attachment in- 
stances) in order to trigger the acquisition process. 
Probabilistic, backed-off or loglinear models rely en- 
tirely on noise-free data, that is, correct parse trees or 
bracketed structures. In general the training set is the 
parsed Wall Street Journal (Marcus et al, 1993), with 
few exceptions, and the size of the training samples 
is around 10-20,000 test cases. Some methods do not 
require manually validated PP attachments, but word 
2S 
collocations are collected from large sets of noise-free 
data. Unfortunately, in language engineering applica- 
tions, manually tagged corpora are not widely avail- 
able nor easily implemented 1. On the other side, the 
"exportability" of disambiguation cues obtained in a 
given domain (e.g. WSJ) to other domains is not obvi- 
ous. 
Unsupervised methods have, on their side, serious 
limitations: 
* First, the type of occurring syntactic ambiguity 
phenomena are in the average much more com- 
plex than the standard verb-noun-preposition-noun 
patterns analyzed in literature. H&R method has 
been proved very weak on complex phenomena 
like verb-noun-preposition-noun-preposition-noun se- 
quences (see (Franz,1995)). Other methods (super- 
vised or not) do not consider more complex ambigu- 
ous structures. 
. Second, in real environments, and especially in sun 
languages, syntactic noise seems to be a systematic 
phenomenon. Many ambiguities occur within several 
identical phrases, hence the "wrong" and the "right" 
associations may gain the same statistical evidence. 
Therefore, there are intrinsic limitations to the pos- 
sibility of using purely statistical approaches to am- 
biguity resolution. 
The nature of ambiguous phenomena in untagged 
corpora has not been studied in detail in the literature 
although one such analysis would be very useful on a 
language engineering standpoint. Accordingly, section 
2 is devoted to an experimental analysis of complex- 
ity and recurrence of ambiguous phenomena in sub- 
languages. This analysis demonstrates that syntactic 
disambiguation in large cannot be afforded by the use 
of knowledge induced exclusively from the corpus. We 
think that corpus based techniques are useful to signifi- 
cantly reduce, not to eliminate, the ambiguous phenom- 
ena. In section 3, we describe an unsupervised, class- 
based, incremental, syntactic disambiguation method 
that is aimed at reducing noisy collocates, to the ex- 
tent that this is allowed by the observation of corpus 
phenomena. The approach that we support is to re- 
duce syntactic ambiguity through an incremental pro- 
cess. Decisions are deferred until enough evidence has 
been gained of a noisy phenomenon. First, a kernel of 
shallow grammatical competence is used to extract a 
collection of noise-prone syntactic collocates. Then, a 
global data analysis is performed to review local choices 
and derive new statistical distributions. This incremen- 
tal process can be iterated to the point that the system 
1 It is not just a matter of time, but also of required 
linguistic skills (see for example (Marcus et al, 1993)). 
reaches a kernel of "hard" cases for which there is no 
more evidence for a reliable decision. The output of 
the last iteration represents a less noisy environment 
on which additional learning process can be triggered 
(e.g. sense disambiguation, acquisition of subcatego- 
rization frames, ...). These later inductive phases may 
rely on some level of a priori knowledge, like for exam- 
ple the naive case relations used in the ARIOSTO_LEX 
system (Basili et al, 1993c , 1996). 
Complexity and recurrence of 
ambiguous patterns in corpora 
In the previous section we pointed out that unsuper- 
vised lexical learning methods must cope with complex 
and repetitive ambiguities. We now describe an exper- 
iment to measure these phenomena in corpora. In this 
experiment, we wish to demonstrate that: 
The type of syntactic ambiguities are much more 
complex than V N PP or N N PP sentences. In a 
realistic environment, the correct attachment must 
be selected among several possibilities, not just two. 
The fundamental assumption of most common statis- 
tical analyses is that the events being analyzed (pro- 
ductive word pairs or triples in our case) are indepen- 
dent. Instead, ambiguous patterns are highly repeti- 
tive, especially in sublanguages. This means that in 
many cases, unless we work in absence of noise, the 
"correct" and "wrong" associations in an ambiguous 
phrase acquires the same or similar statistical evi- 
dence. 
To conduct the experiment, we used a shallow syntac- 
tic analyzer (SSA) (Basili et al, 1994) to extract word 
associations from two very different corpora in Italian 
(a scientific corpus of environmental abstracts, called 
ENEA, and a legal corpus on taxation norms, called 
LD) 2 
Given a corpus, SSA produces an extensive database 
of elementary syntactic links (esl). Typical esl 
classes express the following dependency relations: 
noun-preposition-noun (N_P_N), verb-preposition-noun 
(V_P_N), adjective-conjunction-adjective (Adj_C_Adj) 
and others. An esl has the following structure: 
esl( h, mod(p, w) ), 
where h is the head of the underlying phrasal structure 
and rood(p, w) denotes the head modifier, and w as the 
modifier head. 
Ambiguity is generated by multiple morphologic 
derivations and intrinsic language ambiguities (PP ref- 
erences, coordination, etc.). Given a sentence, SSA 
2SSA is based on a DCG model with controlled skip rules 
24 
I 
produces in general a noise-prone set of esl's, some of 
which represent colliding interpretations. The defini- 
tion of Collision Set (CS) is the following: 
DEF(Collision Set): A Collision Set (CS) is the set 
of syntactic groups, derived from a given sentence that 
share the same modifier, mod O. 
To smooth the weight of ambiguous esl's in lexical 
learning, each detected esl is weighted by a measure 
called plausibility. To simplify, the plausibility of a de- 
tected esl is roughly inversely proportional to the num- 
ber of mutually excluding syntactic structures in the 
text segment that generated the esl (see (Basili et al, 
1993a) for details). 
In the following, we show examples of collision sets 
extracted from the LD (an English word by word trans- 
lation is provided for the sentence fragments that gen- 
erated a collision set). It is important to observe that 
the complexity does not arise simply from the number 
of colliding tuples but also from the structure of am- 
biguous patterns (e.g. non consecutive word strings, as 
in the second example). Bold characters identify the 
rood(p, w) shared by colliding tuples. Local plausibility 
values are reported on the right. 
1. Examples of Simple Collision sets: 
1.1 Minimal Attachment (consecutive word 
strings): 
su richiesta del ministro per le finanze , il \[ ( servizio di 
vigilanza sulle aziende) di credito \] (* service of control of 
agencies of credit ) controlla l'esattezza delle attestazioni 
contenute nel certificato . 
g_N_p_N(2,azienda,di,credito) 0.333 
g_N _p_N (4,vigilanza,di,credi to) 0.333 
g_N_p_N (6,servizio,di,credito) 0.333 
1.2 Non-Minimal Attachment (non consecutive 
word strings) 
i sostituti d imposta devono \[(presentare la dichiarazione 
di-cui-a quarto comma dell'articolo 9, relativamente ai paga- 
menti fatti e agli utili distribuiti nell'anno 1974) entro il 
15- aprile- 1975 \]. (* must present the declaration of which 
at comma 4th of item 9, relatively to the payment done and 
the profit distributed in the year 1974,within april 15, 
1974 ) 
g_N_p_N (17,articolo,entro,x_15_aprile_1975) 0.166 
g_N_p_N(7,distribuire,entro,x_l 5_aprile_ 1975) 0.166 
g_Adv_p_N(14,relativamente,entro,x_15_aprile_1975) 
0.166 
g_N_p_N(19,comma,entro,x_15_aprile_ 1975) 0.166 
g_V_p..N(24,presentare,entro,x_15_aprile_ 1975) 0.166 
To measure the complexity of the ambiguous struc- 
tures, we collected from fragments of the two corpora 
all the ambigous collision sets, i.e. those with more 
than one esl. 10,433 collision sets were found in the 
ENEA corpus and 30,130 in the LD 3. Figure 1 plots 
the percentage of colliding esl~s vs. the cardinality of 
collision sets. The average size of ambiguous collision 
sets is about 4 in both corpora. 
Of course SSA introduces additional noise due to its 
shallow nature (see referred papers for an evaluation 
of performances4), but as far as our experiment is con- 
cerned (measuring the complexity of collision sets) SSA 
still provides a good testbed. In fact, some esl can be 
missed in a collision set, or some spurious attachment 
can be detected, but in the average, these phenomena 
are sufficiently rare and in any case they tend to be 
equally probable. 
C~ 
30 -¢ 
20" 
10" 
OI 
1 
t~ 
7 10 13 16 
CS Size 
Figure 1: Percentage of collision sets Vs. number of collid- 
ing tuples for the LD. 
In the second experiment we measure the recurrence 
of ambiguous patterns. This phenomenon is known 
to be typical in sublanguage, but was never analyzed 
in detail. A straightforward measure of recurrence is 
provided by the average Mutual Information of collid- 
ing esl's. This figure measures the probability of co- 
occurrence of two esl's in a collision set. If the Mu- 
tual Information is high, it means that the measured 
phenomena (productive word tuples) do not indepen- 
dently occur in collision sets, i.e. they systematically 
occur in reciprocal ambiguity in the corpus. The conse- 
quence is that statistically based lexical learning meth- 
ods are faced not only with the problem of data sparse- 
ness (events that are never or rarely encountered), but 
also with the problem of systematic ambiguity (events 
3The LD test corpus is larger, and in addition, the legal 
language is more verbous and less concise than the scientific 
style that characterizes the ENEA corpus. 
4We measured an average of 80% precision and 75% recall 
over three corpora, one of which in English. 
25 
Table 1: Mutual Information of co-occurring esl's 
Average MI 
IT 
average frequency of esl's 
LD ENEA 
(30,130 CS) (10,433 CS) 
13.65 12.9 
1.8 0.84 
3.2 .72 
1.9 1.43 
Table 3: Mutual Information of right-generalized esl's in 
two domains 
Average MI 
IT 
0. 2 
LD 
(all esl's) 
11.5 
3.10 
9.62 
LD 
(high freq.esl's) 
2.15 
4.66 
ENEA (all 
esl's ) 
11.00 
2.65 
7.05 
(his 
Table 2: Mutual Information of esl's occurring with fre- 
quency higher than average 
LD ENEA 
Average MI 11.60 11.60 
a 2.05 1.12 
a z 4.23 1.27 
that occur always in the same sequence). This phe- 
nomenon is likely to be more relevant in sublanguages 
(medicine, law, engineering) than in narrative texts, but 
sublanguages are at the basis of many important appli- 
cations. 
The average Mutual Information was evaluated by 
first computing, in the standard way, the Mutual Infor- 
mation of all the pairs of esl's that co-occurred in at 
least one collision set: 
Prob(esli, eslj) Mr(est,, esl~) 
= log2 Prob(esl~)Prob(esl~) (1) 
where the probability is evaluated over the space of 
collision sets with cardinality > 1. 
Tables 1 and 2 summarize the results of the exper- 
iment. 
Tables 1 and 2 show the average MI, standard 
deviation and variance for the two domains. The values 
in 1 shows that the average MI is close to the perfect 
correlation 5 and has a small variance, especially in the 
ENEA corpus that is in technical style. This result 
could be biased by the esl's occurring just once in the 
collision sets, hence we repeated the computation for 
the pair of esFs occurring at a frequency higher than 
the average (> 2, in both domains). The results are 
reported in Table 2. It is seen that the values remain 
rather high, still with a small variance. 
Clustering the esl~s would seem an obvious way to 
reduce this problem. Therefore, in a subsequent exper- 
iment we clustered the head of PPs in the collision sets 
using a set of high-level semantic tags (for a discussion 
STwo esl's occurring exactly as the average (1.9 in LD) 
are in perfect correlation when their MI is equal to 13.8. 
on semantic tagging see (Basili et al, 1992, 1993b) 6. For 
example, the esl 
V_P_N ( to_present, within, apriL15_1974 ) 
is generalized as: 
V_P_N ( to_present, within, TEMPORAL.ENTITY). 
Because of sense ambiguity, the collision sets became 
20,353 in the ENEA corpus, and 42,681 in the LD. The 
average frequency of "right-generalized" esl~s is now 
4.28 in the ENEA and 4.64 in the LD. The results are 
summarised in Table 3. 
Notice that the phenomenon of systematic ambiguity 
is much less striking (lower MI and higher variance), 
though it is not eliminated. It is also important that 
the two corpora, though very different in style, behave 
in the same way as far as systematic ambiguity is con- 
cerned. 
For example, consider the following sentence frag- 
ment: 
... imposta sul reddito delle persone ... ( *... tax on the 
income of people ...) 
that occurs in the LD corpus almost 200 times. 
The global plausibility of the syntactic collocates (i) 
imposta-di-persona (tax-of-people) and (ii) reddito-di- 
persona (income-of-people)is (i) 91.66 and (ii) 93.69. 
Therefore a reliable decision is not allowed by the set 
of syntactic observations found in the corpus. Further- 
more, similar sentences, like for example 
... imposta sul reddito delle societa'... (*tax on the 
income of companies...), 
always have a HUMAN_ENTITY as head modifier. 
Therefore, the fact that (reddito di persona) is correct 
cannot be captured even when comparing the general- 
ized patterns (reddito di HUMAN_ENTITY) and (im- 
posta di HUMAN_ENTITY). 
6Class based approaches are widely employed. Clusters 
are created by means of distributional techniques in (Rat- 
naparkhi et al, 1994), while in (Resnik and Hearst, 1993) 
low level synonim sets in WordNet are used. Instead, we use 
high level tags (human, time, abstraction etc.), manually as- 
signed in Itafian domains and automatically assigned from 
WordNet in English domains. For sake of brevity, we do not 
re-discuss the matter here. See aforementioned papers. 
26 
The conclusion we may derive from these two exper- 
iments is that most syntactic disambiguation methods 
presented in literature are tested in an unrealistic en- 
vironment. This does not mean that they don't work, 
but simply that their applicability to real domains is 
yet to be proven. Application corpora are noisy, may 
not be very large, and include repetitive and complex 
ambiguities that are an obstacle to reliable statistical 
learning. 
The experiments also stress the importance of class 
based models of lexical learning. Clustering "similar" 
phenomena is an obvious way of reducing the problems 
just outlined. Unfortunately, Table 3 shows that gener- 
alization improves, but not eliminates, the problem of 
repetitive patterns. 
An incremental architecture for 
unsupervised reduction of syntactic 
ambiguity 
The previous section shows that we need to be more 
realistic in approaching the problem of syntactic am- 
biguity resolution in large. Certain results can be ob- 
tained with purely statistical methods, but there are 
many complex cases for which there seems to be a clear 
need for less shallow techniques. 
The approach that we have undertaken is to attack 
the problem of syntactic ambiguity through increasingly 
refined learning phases. The first stage is noise com- 
pression, in which we adopt an incremental syntactic 
learning method, to create a more suitable framework 
for subsequent steps of learning. Noise compression is 
performed essentially by the use of shallow NLP and 
statistical techniques. This method is described here- 
after, while the subsequent steps, that use deeper (rule- 
based) levels of knowledge, are implemented into the 
ARIOSTO_LEX lexical learning system, described in 
(Basili et al., 1993b, 1933c and 1996). 
A feedback algorithm for noise reduction 
The process of incremental noise reduction works as 
follows: 
1. First, use a surface grammatical competence (i.e. 
SSA) to derive the (noise prone) set of observations. 
2. Cluster the collocational data according to semantic 
categories. 
3. Apply class based disambiguation operators to reduce 
the initial source of noise, by first disambiguating the 
non-persistent ambiguity phenomena. 
4. Derive new statistical distributions. 
5. Repeat step 2.-4. on the remaining (i.e. persistent) 
ambiguous phenomena. 
The incremental disambiguation activity stops when 
no more evidence can be derived to solve new ambigu- 
ous cases. 
In order to accomplish the outlined noise reduction 
process we need: (i) a disambiguation operator and 
(ii) a disambiguation strategy to eliminate at each step 
"some" noisy collocations. 
The class based disambiguation operator is the Mutual Conditioned Plausibility (MCPI) 
(Basili et 
al.,1993a). Given an esl, the value of its correspond- 
ing MCPl is defined by the following: 
DEF(Mutual Conditioned Plausibility): The Mutual 
Conditioned Plausibility (MCP1) of a prepositional at- 
tachment esl(w, mod(p, n)), is: 
M C Pl(esl( w, rood(p, n ) ) = 
~yer pl(esl(w, mod(p, y) ) ) 
~vh,yer pl(esl(h, mod(p, y) ) ) ~v~ pl(esl(w, mod(p, y) ) ) (2) 
where F is the high level semantic tag assigned to the mod- 
ifier n and pl 0 is the plausibility function. Examples 
of the generalized esl's were presented in the previous 
section. For example to the computation of the MCPI 
of esl(reddito,(di, persona)) contribute esl's like 
esl (reddito, ( di, pro f essionista ) ), esl ( reddito, ( di, azienda) ) 
where professionista, persona and azienda are in- 
stances of HUMAN_ENTITIY. 
After a first scan of the corpus by the SSA and af- 
ter the computation of global MCPI values, a primary 
knowledge base is available. This knowledge is fully 
corpus driven, and it is obtained without a preliminary 
training set of hand tagged patterns. Each esl in a colli- 
sion set has its own MCP1 value, that has been globally 
derived from the corpus. The MCPI is thus employed 
to remove the less plausible attachments proposed by 
the grammar, with a consequent reduction in size of the 
related collision sets. When more than one esl remain 
in a collision set the system is not forced to decide, and 
a further disambiguation step is attempted later. 
After the first scan of the corpus by means of the SSA 
grammar, the corpus is re-written as a set of possibly 
ambiguous Collision Sets, i.e. if C is the corpus and 
CSi a Collision Set, we have: 
C = CSo U CSx U ... W CS~ U ...CSN 
CSiNCSj={O}, fori¢j,i,j=O, 1,2,...N 
where N is the total number of collision sets found in 
the corpus. 
The cardinality of a generic collision set is directly 
proportional to the degree of ambiguity of its members. 
The feedback algorithm tries to reduce the cardinality 
27 
Table 4: A general feedback algorithm for noise reduction Table 5: Disambiguation Algorithm: Learning Phase 
(1) Use SSA to derive all the syntactic observations O 
from the corpus; 
Set the initial performance index PFC' to 0; 
(2) REPEAT 
(2.1) Substitute PCF with PCF' 
(2.2) Evaluate the MCPI for each esl E 0 
(2.3) Use MCP1 on a subset of the corpus (testset) 
and evaluate the current performance 
index PCF' 
(2.3) IF PCF' > PCF THEN: 
(2.3.1) Rewrite the collision sets of O 
removing hell eslls 
into a new set of observation O' 
(2.3.2) Replace O with O' 
UNTIL PCF' > PCF 
(3) STOP 
Let CS = { el,e2,...eN } be any 
collision set in the corpus, where e~s are esl's 
Let -~ be the prior probability (pprior). 
Let MCPI(ei) be the Mutual Conditional 
Plausibility (2) of ei 
The posterior probability of el, pposti, 
is defined as MCPI(ei) 
pposti = z--,=~'jN1 MCP|(ej) 
Let a E \[0, 1\] be a given learning threshold. 
For each ei in CS DO: 
IF ~ < 1-aTHEN pprior 
REMOVE ei from CS, i.e. PUT it 
in the hell set 
OTHERWISE ei is a limbo esl. 
IF Vi ¢ j ei is in hell 
MOVE ej in the paradise set 
of all CSi step by step: esl with "lower" MCPI val- 
ues (as globally derived from all the corpus) are filtered 
out; the MCP1 values are then redistributed among the 
remaining esl~s. In a picturesque way, we can say that 
discarded esl~s are damned (the hell is the right place), 
while survived esl~s are waiting for next judgment (the 
limbo is the right place for this wait state); at the end of 
the algorithm, if there is a single winner esl, it will gain 
the paradise. Persistently ambiguous esl of the corpus 
may remain still ambiguous within the corresponding 
collision sets: limbo will be their place forever. The al- 
gorithm will try to obtain as many paradise esl~s (i.e. 
singleton CS) as possible but is robust against persis- 
tently ambiguous phenomena. 
The general feedback algorithm is illustrated in Table 
4. It should be noted that the above feedback strategy 
has three main phases: (step 2.2) statistical induction 
of syntactic preference scores; (step 2.3) testing phase 
(which is necessary in order to quantify the performance 
of disambiguation criteria derived from the current sta- 
tistical distributions); (step 2.3.1) learning phase, to fil- 
ter out the syntactically odd esl~s (i.e. esl with locally 
low MCP1 values). 
Learning and Testing disambiguation cues 
According to the disambiguate as late as possible strat- 
egy, the learning and testing phases have different ob- 
jectives: 
During the learning phase, the objective is to take 
only highly reliable decisions, by eliminating those 
esl's with a very low plausibility, while delaying un- 
reliable choices. 
• During the test phase, the objective is to evaluate the 
ability of the system at separating, within each colli- 
sion set, correct from wrong attachment candidates. 
This results in two different disambiguation algo- 
rithms: the learning phase is used only to remove hell 
esl's from the collision sets, without forcing any par- 
adise choice (e.g. a maximum likelihood candidate). In 
the test phase eslls are classified as (locally) correct and 
wrong according to their relative values of MCPI. 
The learning phase, called ith -learning step, is guided 
by the following algorithm: 
1. Identify all Collision Sets of the corpus, CSi, i = 
1,2, ...N; 
2. Apply the preference criterion to each CSi in order 
to classify hell, limbo or paradise esl' s; 
3. Redistribute plausibility values among the limbo 
esl's of each CSi; 
Step 2 is further specified in Table 5. 
In step 3 of the Learning algorithm, the new plausi- 
bility values are redistributed among the survived esl's 
according to the following rule: 
pli (CSi) pli+l (esl(h, mod(p, w))) = pli pli+l (CSi+i) (3) 
where i is the learning step and CSi+i (C CSi) does not 
contain esl's that have been placed in hell during step 
i. 
After each learning step the upgraded plausibility val- 
ues provide newer MCP1 scores that are more reliable 
because the hell esFs have been discarded. 
28 
Table 6: Disambiguation Algorithm: Learning Phase 
Let CS= { el,e2,...eN } be any 
collision set the test set 
and Ncases be the number of test cases. 
Let -~ be the prior probability (pprior). 
Let MCPl(ei) be the Mutual 
Conditional Plausibility (2) of ei 
The posterior probability of el, pposti, is defined as 
-. __ MCPI(ei) ppos , - 
Let r E \[0, 11 be a given test threshold. 
For each CS and for each ei E CS DO: 
e2_.ez/.t N IF -prior > 1 + r THE 
(F ei is correct, i.e. manually validated, THEN 
++TruePositives; 
OTHERWISE 
++ FalsePositives; 
OTHERWISE IF ~ < 1 - r THEN 
IF e~ is correct pp~,Or~HEN 
++ FalseNegatives; 
OTHERWISE 
++True Negatives; 
++Ncases 
precision = 
TruePositives-~ TrueNe~atives 
TruePositives+ TrueNegatives+ FalsePositives÷ FalseNegatives 
recall = 
TruePositives-~ TrueNe~tatives 
Ncases 
coverage = 
TruePositives~TrueNe~atives+ FalsePositives+FalseNegatives 
Ncases 
The evaluation of each learning step is carried on by 
testing the syntactic disambiguation on a selected set of 
corpus sentences where ambiguities have been manually 
solved. 
The general test algorithm is defined in Table 6. 
In Table 6, notice that precision and recall evalu- 
ate the ability of the system both at eliminating truly 
wrong esl's and accepting truly correct esl~s, since, as 
remarked in section 2, our objective is noise compres- 
sion, rather than full syntactic disambiguation. Notice 
also that, because of their different classification objec- 
tives, learning and testing use different decision thresh- 
olds. 
Experimental Results. 
To evaluate numerically the benefits of the feedback al- 
gorithm, several experiments and performance indexes 
have been evaluated. The corpus selected for experi- 
menting the incremental technique is the LD: the size 
of the corpus is about 500,000 words. The SSA gram- 
mar in LD has about 25 DCG rules and it generates 
29 
Table 7: Performance values of the MCP1 without learning 
r Coverage Recall i Precision 
0.0 99.8% 0.75 0.749 
0.05 95.0% 0.72 0.75 
0.1 87.4% 0.69 0.79 
0.2 77.8% 0.62 0.80 
0.5 49.9% 0.42 0.84 
240,493 esl's from the whole corpus. Of these only 10% 
of esl's are initially unambiguous, while all the remain- 
ing are limbo esl's. A testset of 1,154 hand corrected 
collision sets was built. 5,285 different esl's are in the 
testset. An average of 25.9% correct groups have been 
found in the testset, again demonstrating a great level 
of ambiguity in the source data. 
At first, we need to study the system classification 
parameters, ~r and r (see Tables (5) and (6)) . Dur- 
ing the learning phase, we wish to eliminate as many 
hell esl's as possible, because the more noise has been 
eliminated from the source syntactic data, the more re- 
liable is the application of the later inductive operators 
(i.e. ARIOSTO lexical learning system). However we 
know from the experiments in section 2 that the com- 
petence that we are using (shallow NLP and statistical 
operators) is insufficient to cope with highly repetitive 
ambiguities. The threshold o" is therefore a crucial pa- 
rameter, because it must establish the best trade-off 
between precision of choices (i.e. it must classify as hell 
truly noisy eslls) and impact on noise compression (i.e. 
it must remove as much noise as possible). 
Table 7 shows the results. 
To select the best value for ~r, we measured the values 
of recall and precision (defined in Table 6) according to 
different values for r. These measures have been derived 
from the early (thus noisy) state of knowledge where 
just the SSA grammar, and no learning, was applied to 
the corpus. 
According to the results of Table 7, r = 0.2 was 
selected for the better trade-off between recall, pre- 
cision and coverage. The learning steps have then be 
performed with a threshold value o" = 0.2 over the LD 
corpus. In each phase the corresponding recall and 
precision have been measured. 
The results of the experiment are summarised in Fig- 
ure 2. Figure 2.A plots recall versus precision that 
have been obtained in the early (prior to learning) stage 
(Step 0), after 1 (Step 1) and 2 (Step 2) learning iter- 
ations. Each measure is evaluated for a different value 
of the testing threshold r, that varies from 0.5 to 0.0 
from left to right in Fig. 2.A. 
Figure 2.B plots the Information Gain (Kononenko 
and Bratko, 1991) an information theory index that, 
roughly speaking, measures the quality of the statisti- 
cal distributions of the correct vs. wrong esl's. Fig- 
Table 8: Performance values of the LA without learning 
r Coverage Recall Precision 
'0.0 100% 0.610 0.610 
0.05 96.5% 0.594 0.615 
'0.1 93.8% 0.578 0.616 
0.2 86.4% 0.544 0.631 
"0.5 71.9% 0.465 0.647 
ure 2.C measures the Data Compression, that is the 
mere reduction of eis's in the corpus. The compres- 
sion is measured as the ratio between hell's els's and 
the number of the observed esl's. Figure 2.D plots 
the Coverage, i.e. the number of decided cases over 
the total number of possible decisions. Finally, Table 
8 reports the performance (at the Step 0 phase) of the 
H&R Lexical Association (LA) 7. We experiment this 
disambiguation operator just because the HLzR method 
has, among the others, the merit of being easily repro- 
ducible. 
The first four figures give a global overview of the 
method. In Fig. 2.A (Step 1), a significant improvement 
in precision can be observed. For r = 0.5 the improve- 
ment in recall (.5) and precision (.85) is more sensible. 
Furthermore a better coverage (60 %) is shown in Fig. 
2.D (Step 1). A further index to evaluate the status of 
the system knowledge about the PP-attachment prob- 
lem is the Information Gain ((Kononenko and Bratko, 
1991) and (Basili et al, 1996)). The posterior probabil- 
ity (see algorithms in Table 5 and 6) improves over 
the "blind" prior probability as much as it increases 
the confidence of correct eslls and decreases the con- 
fidence of wrong esl~s. The improvement is quantified 
by means of the number of saved bits necessary to de- 
scribe the correct decisions when moving from prior to 
posterior probability. The Information Gain does not 
depend on the selected thresholds, since it acts on all 
the probability values, and it is related to the com- 
plexity of the learning task. It gives a measure of the 
global trend of the statistical decision model. A signif- 
icant improvement measured over the testset (12% to 
24% relative increment) is shown by Fig. 2.B as a 
result of the learning steps. As discussed in (Basili et 
a1.,1994), the Information Gain produces performance 
results that may contrast with precision and recall. 
In fact, in the learning step 2, we observed decreased 
performance of precision and recall. The overlearn- 
ing effect is common of feedback algorithms. Further- 
more, the small size of the corpus is likely to anticipate 
7Unfike H&R, we did not use the t-score as a decision 
criteria, but forced the system to decide according to differ- 
ent values of the thresholds r for sake of readabifity of the 
comparison. Technical details of our treatment of the LA 
operator within our grammatical framework can be found 
in (Basili et a1,1994), 
30 
this phenomenon. The problem is clearly due to the 
highly repetitive ambiguities. The system quickly re- 
moves from the corpus syntactically wrong esl's with 
low MCP1. But now let's consider a collision set with 
two esl's that almost constantly occur together. Their 
MCPI tends to acquire exactly the same value. Thus, 
they will stay in the limbo forever. But if one of the two, 
accidentally the wrong, has an even minimal additional 
evidence with respect to its competitor, this initially 
small advantage may be emphasized by the plausibility 
redistribution rule 38 . Hence once the learning algo- 
rithm reaches the "hard cases" and is still forced to 
discriminate, it gets at stuck, and may take accidental 
decisions. This phenomenon occurs very early in our 
domains, and this could be easily foreseen according to 
the high correlation between esl's that we measured. 
For the current experimental setup, our data show 
a significant reduction of noise with a significant 40% 
compression of the data after step 1, and a correspon- 
dent slight improvement in precision-recall, given the 
complexity of the task (see the Lexical Association per- 
formance in Table 8, for a comparison). However, the 
phenomena that we analyzed in Section 2 have a neg- 
ative impact on the possibility of a longer incremental 
learning process. We do not believe that experiment- 
ing over different domains would give different results. 
In fact, the Legal and Environmental sublanguages are 
very different in style, and not so narrow in scope. 
Rather, we believe that the size of the corpora may be 
in fact too small. We could hope in a higher variability 
of language patterns by training over 1-2 million words 
corpora. 
Swhereas, for more independent phenomena, 3 should 
emphasize the right attachments. 
i 
o,s6 ! 
• ~ o~ 
o,s2 
O,80 
0,78 
0,76 
0,74 
0,72 
0,4 
7777777~711777,,,7Y ................. 'f,i ............ 
.... ............... ,\] ~ ~ :::\]\]\]:\]\]\]:: 
~ili~!:e,,O;iii21221iill "7"~ ~21772117211 _:::::tl:::: 
\];tep ! __j .... ~! ..... ~.--- 
{;tep 2 :" I = 
.......... / ........... i .............. / .............. ~, 
O,5 0,6 0,7 0,8 
- (A): Precision vs. Recall 
for learning phases Step O, Step I and Step2 
and a=0.2 - 
Data Compression 
1,0 ................................................................. 
0,8 .................................................................. i 
0,6 ............................................................... : 
0,20'4 ~i 
o,o ¢" 1 
0 1 2 Learning Step 
• (C): Data Compression in three learning steps - 
Information Gain 
0~ .... 
0,2 / t•'•'•'••""•"-~ i 
0,1" .............................. i .............................. 
0,0' i 
• 0 2 
Learning Step 
Recall 
- (B): Information Gain 
in the three learning steps - 
Coverage 
o,9 ~ i ..! 
• " "=:~;.r=""~i:;; ~~ ......... i 
0,4 i I i i i I I 
0,0 0,l 02- 0,3 0,4 0.5 0,6 
Test Threshold x 
• (D): Cm, erage in three learning steps - 
Figure 2: Incremental Learning: Experimental Results 
31 
Further improvements could also be obtained using 
a more refined discriminator than MCP1, but there is 
no free lunch. If the corpus is our unique source of 
knowledge, it is not possible to learn things for which 
there is no evidence. Only if we can rely on some a- 
priori model of the world, even a naive model 9 to guide 
difficult choices, then we can hope in a better coverage 
of repetitive phenomena. 
Conclusions 
As a conclusion we may claim that corpus-driven lexical 
learning should result from the interaction of cooperat- 
ing inductive processes triggered by several knowledge 
sources. The described method is a combination of nu- 
merical techniques (e.g. the probability driven MCP1 
disambiguation operator) and some logical devices: 
• a shallow syntactic analyzer that embodies a surface 
and portable grammatical competence helpful in trig- 
gering the overall induction; 
• a naive semantic type system to obviate the problem 
of data sparseness and to give the learning system 
some explanatory power 
The interaction of such components has been ex- 
ploited in an incremental process. In the experiments, 
the performance over a typical NLP task 10 (i.e. PP- 
disambiguation) has been significantly improved by this 
a cooperative approach. Moreover, on the language en- 
gineering standpoint the main consequences are a sig- 
nificant data compression and a corresponding improve- 
ment of the overall system efficiency. 
One of the purposes of this paper was to show that, 
despite the good results recently obtained in the field 
of corpus-driven lexical learning, we must still demon- 
strate that NLP techniques, after the advent of lexical 
statistics, are industrially competitive. And one good 
way for doing so, is by measuring ourselves with the full 
complexities of language. More effort should thus be de- 
Voted in evaluating the performance of lexical learning 
methods in real world, noisy domains. 

REFERENCES 
(Basili et a1.,1992) Basiti, R., Pazienza, M.T., Velardi, P., 
Computational Lexicons: the Neat Examples and the 
Odd Exemplars, Proc. of Third Int. Conf. on Applied 
Natural Language Processing, Trento, Italy, 1-3 April, 
1992. 
(Basili et al,1993a) Basili, R., A. Marziali, M.T. Pazienza, 
Modelling syntactic uncertainty in lexical acquisition 
from texts, Journal of Quantitative Linguistics, vol.1, 
n.1, 1994. 
(Basili et al,1993b) Basili, R., M.T. Pazienza, P. Velardi, 
What can be learned from raw texts ?, Journal of Ma- 
chine Translation, 8:147-173,1993. 
(Basili et a1,1993c) Basiti, R., M.T. Pazienza, P. Velardi, 
Acquisition of selectional patterns, Journal of Machine 
Translation, 8:175-201,1993. 
(Basili et al.,1994a) Basiti, R., M.T. Pazienza, P.Velardi, A 
(not-so) shallow parser for collocational analysis, Proc. 
of Coting '94, Kyoto, Japan, 1994. 
(Basili et al.,1994b) Basiti, R., M.H.Candito, M.T. 
Pazienza, P. Velardi, Evaluating the information gain 
of probability-based PP-disambiguation methods, Proc. 
of International Conference on New Methods in Lan- 
guage Processing, Manchester, September 1994. 
(Basiti et a1.,1996), Basili, R., M.T. Pazienza, P.Velardi, 
An Empirical Symbolic Approach to Natural Language 
Processing, Artificial Intelligence, to appear on vol. 85, 
August 1996 
(Brill 1992) Brill, E., A simple rule-based part of speech 
tagger, in Proc. of the 3rd Conf. on Applied Natural 
Language Processing, ACL, Trento Italy 
(Brill and Resnik,1994) Brill E., Resnik P., A rule-based ap- 
proach to prepositional phrase attachment disambigua- 
tion, in Proc. of COLING 94, 1198-1204 
(Collins and Brooks,1995) Collins M. and Brooks J., Prepo- 
sitional Phrase Attachment trough a Backed-off Model, 
3rd. Workshop on Very Large Corpora, MT, 1995 
(Franz,1995), Franz A., A statistical approach to learn- 
ing prepositional phrase attachment disambiguation, in 
Proc. of IJCAI Workshop on New Approaches to 
Learning for Natural Language Processing, Montreal 
1995. 
(Hindle and Rooth,1993) Hindle D. and Rooth M., Struc- 
tural Ambiguity and Lexical Relations, Computational 
Linguistics, 19(1): 103-120. 
(Kononenko and Bratko, 1991) Kononenko I., I. Bratko, 
Information-Based Evaluation Criterion for Classi- 
fier's Performance, Machine Learning, 6,67-80, 1991. 
(Marcus et al, 1993) Marcus M., Santorini B. and 
Marcinkiewicz M., Building a large annotated corpus 
in English: The Penn Tree Bank, Computational Lin- 
guistics, 19(2): 313-330. 
(Ratnaparkhi et al, 1994), Ratnaparkhi, Rynar and Roukos, 
A maximum entropy model for prepositional phrase at- 
tachment. In ARPA Workshop on Human language 
Technology, plainsboro, N J, 1994. 
(Resnik and Hearst, 1993) Resnik P. and Hearst M., Struc- 
tural Ambiguity and Conceptual Relations, in Proc. of 
1st Workshop on Very Large Corpora, 1993. 
