Probabilistic Models for PP-attachment Resolution and NP Analysis
Eric Gaussier
XRCE
6, Chemin de Maupertuis
38240 Meylan, France
fname.lname@xrce.xerox.com
Nicola Cancedda
XRCE
6, Chemin de Maupertuis
38240 Meylan, France
fname.lname@xrce.xerox.com
Abstract
We present a general model for PP attach-
ment resolution and NP analysis in French.
We make explicit the different assumptions
our model relies on, and show how it gener-
alizes previously proposed models. We then
present a series of experiments conducted
on a corpus of newspaper articles, and as-
sess the various components of the model,
as well as the different information sources
used.
1 Introduction
Prepositional phrase attachment resolution and noun
phrase analysis are known to be two difficult tasks.
Traditional context-free rules for example do not help
at all in selecting the good parse for a noun phrase,
since all valid parses are a priori correct. Subcatego-
rization information can help solve the problem, but
the amount of information necessary to be encoded
in such lexicons is huge, since in addition to subcat-
egorization frames, one should encode the different
senses of words and rules on how these senses can
be combined, as well as the different units (single or
multi word expressions) a language makes use of. It
has been recognized in different works that part of
this knowledge can be (semi-)automatically acquired
from corpora and used in the decision process. Several
models have been proposed for PP-attachment resolu-
tion and NP analysis, built on various building blocks
and making use of diverse information. Most of these
models fit within a probabilistic framework. Such
a framework allows one to estimate various quanti-
ties and to perform inference with incomplete infor-
mation, two necessary steps for the problem at hand.
We present in this paper a model generalizing several
models already proposed and integrating several infor-
mation sources.
We focus here on analysing strings corresponding
to a combination of traditional VERB NOUN PREP
sequences and noun phrases. More precisely, the NPs
we consider are arbitrarily complex noun phrases with
no embedded subordinate clause. The PP attachment
problems we tackle are those involved in analysing
such NPs for languages relying on composition of Ro-
mance type (as French or Italian), as well as those
present in verbal configurations for languages relying
on composition of Germanic type (as German or En-
glish). However, our model can easily be extended to
deal with other cases as well. The problem raised in
analysing such sequences is of the utmost relevance
for at least two reasons: 1) NPs constitute the vast ma-
jority of terms. Correctly identifying the boundaries
and the internal structure of noun phrases is crucial to
the automatic discovery of domain-specific termino-
logical databases; 2) PP-attachment ambiguity is one
of the main sources of syntactic ambiguity in general.
In the present work, we focus on the French lan-
guage since we have various lexical information at our
disposal for this language that we would like to as-
sess in the given context. Furthermore, French dis-
plays interesting propoerties (like gender and num-
ber agreement for adjectives) which makes it an in-
teresting language to test on. The remainder of the
paper is organized as follows: we first describe how
we preprocessed our corpus, and which part we re-
tained for probability estimation. We then present the
general model we designed, and show how it com-
pares with previous ones. We then describe the exper-
iments we performed and discuss the results obtained.
We also describe, in the experiments section, how
we integrated different types of information sources
(e.g. prior knowledge encoded as subcategorization
frames), and how we weighted multiple sources of ev-
idence according to their reliability.
2 Nuclei and sequences of nuclei
We first take the general view that our problem can
be formulated as one of finding dependency relations
between nuclei. Without loss of generality, we define
a nucleus to be a unit that contains both the syntactic
and semantic head and that exhibits only unambiguous
internal syntactic structure. For example, the base NP
”the white horse” is a nucleus, since the attachments
of both the determiner and the adjective to the noun
are straightforward. The segmentation into nuclei re-
lies on a manually built chunker, similar to the one
described in (Ait-Mokhtar and Chanod, 1997), and re-
sembles the one proposed in (Samuelsson, 2000). The
motivation for this assumption is twofold. First, the
amount of grammatical information carried by indi-
vidual words varies greatly across language families.
Grammatical information carried by function words
in non-agglutinative languages, for instance, is real-
ized morphologically in agglutinative languages. A
model manipulating dependencies at the word level
only would be constrained to the specific amount of
grammatical and lexical information associated with
words in a given language. Nuclei, on the other hand,
tend to correspond to phrases of the same type across
languages, so that relying on the notion of nucleus
makes the approach more portable. A second moti-
vation for considering nuclei as elementary unit is that
their internal structure is by definition unambiguous,
so that there is no point in applying any algorithm
whatsoever to disambiguate them.
We view each nucleus as being composed of several
linguistic layers with different information, namely
a semantic layer comprising the possible semantic
classes for the word under consideration, a syntactic
layer made of the POS category of the word and its
gender and number information, and a lexical layer
consisting of the word itself (referred to as the lexeme
in the following), and the preposition, for prepositional
phrases. For nuclei comprising more than two non-
empty words (as ”the white horse”), we retain only one
lexeme, the one associated with the last word which is
considered to be the head word in the sequence. Ex-
cept for the semantic information, all the necessary in-
formation is present in the output of the chunker. The
semantic lexicon we used was encoded as a finite-state
transducer, which was looked up for injecting seman-
tic classes in each nucleus. When no semantic infor-
mation is available for a given word, we use its part-of-
speech category as its semantic class1. For example,
starting with the sentence:
1The semantic resource we used can be purchased from
www.lexiquest.com. This resource contains approximately
90 different sementic classes organized into a hierarchy. We
have not made use of this hierarchy in our experiments.
Il doit rencontrer le pr´esident de la f´ed´eration franc¸aise.
(He has to meet the president of the French federation.)
we obtain the following sequence of nuclei:
il
CAT=”PRON”, GN=”Masc-Sg”, PREP=””,
SEM=”PRON”
rencontrer
CAT=”VERB”, GN=””, PREP=””, SEM=”VERB”
pr´esident
CAT=”NOUN”, GN=”Masc-Sg”, PREP=””,
SEM=”FONCTION”
f´ed´eration
CAT=”NOUN”, GN=”Fem-Sg”, PREP=”de”,
SEM=”HUMAIN”
franc¸aise
CAT=”ADJ”, GN=”Fem-Sg”, PREP=””, SEM=”GEO”
As we see in this example, the semantic resource we
use is incomplete and partly questionable. The at-
tribute HUMAN for federation can be understood if
one views a federation as a collection of human be-
ings, which we believe is the rationale behind this an-
notation. However, a federation also is an institution,
a sense which is missing in the resource we use.
In the preceding example, the preposition de can
be attached to the verb rencontrer or to the noun
pr´esident. It cannot be attached to the pronoun il.
As far as terminology extraction is our final objective,
pr´esident de la f´ed´eration franc¸aise can be deemed
a good candidate term. However, in order to accu-
rately identify this unit, a high confidence in the fact
that the preposition de attaches to the noun pr´esident
must be achieved. Sentences can be conveniently seg-
mented into smaller self-contained units according to
some heuristics to reduce the combinatorics of attach-
ments ambiguities. We define safe chains as being
sequences of nuclei in which all the items but the first
are attached to other nuclei within the chain itself. In
the preceding example, for instance, only the nucleus
associated with rencontrer is not attached to a nucleus
within the chain rencontrer ... franc¸aise. This chain
is thus a safe chain. To keep the number of alterna-
tive (combinations of) attachments as low as possible,
we are interested in isolating as short safe chains as
possible given the information available at this point,
i.e. words and their parts-of-speech (the knowledge of
semantic classes is of little help in this task).
In French, and except for few cases involving em-
bedded clauses and coordination, the following heuris-
tics can be used to identify “minimal” safe chains: ex-
tract the longest sequences beginning with a nominal,
verbal, prepositional or adjectival nucleus, containing
only nominal, prepositional, adjectival, adverbial or
verbal nuclei in indefinite moods.
There is a tension in parameter estimation of prob-
abilistic models between relying on accurate infor-
mation and relying on enough data. In an unsuper-
vised approach to PP-attachment resolution and NP
analysis, accurate information in the form of depen-
dency relations between words is not directly acces-
sible. However, specific configurations can be identi-
fied from which accurate information can be extracted.
Safe chains provide such configurations. Indeed if
there is only one possible attachment site to the left
of a nucleus, then its attachment is unambiguous. Due
to the possible ambiguities the French language dis-
plays (e.g. a preposition can be attached to a noun,
a verb or an adjective), only the first two nuclei of a
safe chain provide reliable information (we skip ad-
verbs, the attachment of which obeys specific and sim-
ple rules). From the preceding example, for instance,
we can infer a direct relation between rencontrer and
pr´esident, but this is the only attachment we can be
sure of. The use of less reliable information sources
for model parameters whose estimation would other-
wise require manual supervision is the object of an ex-
periment described in Section 6.
3 Attachment Model
Let us denote the a0a2a1a4a3 nucleus in a chain by a5 a0a7a6 , and the
the nucleus to which it is attached by a8a9a5
a0a7a6 (for each
chain, we introduce an additional empty nucleus to
which the head of the chain is attached). Given a chain
of nuclei a10 , we denote by a11a13a12 the set of dependency re-
lations covering the chain of nuclei a10a15a14a17a16a19a18a21a20a23a22a24a20 a0 . We
are interested in the set a11a13a25 such that a26a27a5a28a11a13a25 a6 is maxi-
mal. Assuming that the dependencies are built by pro-
cessing the chain in linear order, We have:
a26a29a5a2a11 a25a31a30a10
a6a33a32a35a34
a25a9a36a38a37
a26a27a5a28a11 a25a9a36a38a37a39a30a10
a6
a26a29a5a2a11 a25a38a30a11 a25a17a36a40a37 a16a41a10
a6
a32a42a34
a37a44a43a45a43a45a43
a34
a25a9a36a38a37
a25
a46
a47a49a48
a37
a26a29a5a28a11
a47
a30a11
a47
a36a40a37a50a16a41a10
a6 (1)
a11
a47 differs from
a11
a47
a36a40a37 only in that it additionally speci-
fies a particular attachment site a5
a0a7a6 for
a5a28a51
a6 such that no
cycle nor crossing dependencies are produced. In or-
der to avoid sparse data problems, we make the simpli-
fying assumption (similar to the one presented in (Eis-
ner, 1996)) that the attachment of nucleus a5a28a51
a6 to nu-
cleus a5 a0a7a6 depends only on the set of indices of the pre-
ceding dependency relations (in order to avoid cycles
and crossing dependencies) and on the three nuclei a5a28a51 a6 ,
a5
a0a7a6 and
a5a52a22
a47
a6 , where
a5a53a22
a47
a6 denotes the last nucleus be-
ing attached to a5 a0a7a6 . a5a53a22 a47 a6 is thus the closest sibling of
a5a28a51
a6 . Conditioning attachment on it the attachment of
a5a28a51
a6 allows capturing the fact that the object of a verb
may depend on its subject, that the indirect object may
depend on the direct object, and other similar indirect
dependencies. In order to focus on the probabilities of
interest, we use the following simplified notation:
a26a27a5a28a11
a47
a30a11
a47
a36a38a37a49a16a41a10
a6a33a54
a26a29a5a7a55a56a5a28a11
a47
a6a57a6a15a58
a26a29a5a59a5a28a51
a6
a30 a5
a0a7a6
a16a60a5a53a22
a47
a6
a16a59a55a56a5a28a11
a47
a6a59a6 (2)
where a55a56a5a28a11 a47 a6 represents the graph produced by the
dependencies generated so far. If this graph contains
cycles or crossing links, the associated probability is
0. Making explicit the different elements of a nucleus,
we obtain:
a26a27a5a57a5a2a51
a6
a30 a5
a0a4a6
a16a61a5a52a22
a47
a6
a16a57a55a56a5a2a11
a47
a6a57a6a62a32
a34
a63a9a64a65a63a67a66a68a47a19a69
a26a27a5a53a70 a30 a5
a0a7a6
a16a60a5a53a22
a47
a6a59a6 (3)
a58
a26a27a5a71a26a73a72
a47
a30a70a33a16a61a5
a0a4a6
a16a61a5a52a22
a47
a6a59a6 (4)
a58
a26a29a5a52a74a61a8a65a75
a47
a30a26a76a72
a47
a16a41a70a33a16a61a5
a0a4a6
a16a61a5a52a22
a47
a6a59a6 (5)
a58
a26a29a5a2a77a79a78
a47
a30a74a61a8a65a75
a47
a16a7a26a73a72
a47
a16a41a70a33a16a61a5
a0a4a6
a16a61a5a52a22
a47
a6a59a6 (6)
a58
a26a29a5a7a80a4a81a49a82
a47
a30a77a79a78
a47
a16a83a74a61a8a65a75
a47
a16a7a26a73a72
a47
a16a41a70a33a16a61a5
a0a4a6
a16a61a5a52a22
a47
a6a59a6 (7)
since the graph a55a56a5a28a11 a47 a6 provides the index of the nu-
cleus a5 a0a7a6 to which a5a28a51 a6 is attached to. Obviously, most
of the above probabilities cannot be directly estimated.
A number of simplifying assumptions preserving sig-
nificant conditional dependencies were adopted.
Assumption 1: except for graphs with cycles and
crossing links, for which the associated probability is
0, we assume a uniform distribution on the set of pos-
sible graphs.
A prior probability a26a29a5a4a55a56a5a2a11 a47 a6a59a6 could be used to model
certain corpus-specific preferences such as privileging
attachments to the immediatly preceding nucleus (in
French or English for example). However, we decided
not to make use of this possibility for the moment.
Assumption 2: the semantic class of a nucleus de-
pends only on the semantic class of its regent.
This assumption, also used in (Lauer and Dras,
1994), amounts to considering a 1st-order Markov
chain on the semantic classes of nuclei, and represents
a good trade-off between model accuracy and practical
estimation of the probabilities in (3). It leads to:
a26a29a5a52a70 a30 a5
a0a4a6
a16a61a5a52a22
a47
a6a59a6a62a32
a26a29a5a53a70 a30a70a84a5
a0a7a6a57a6 (8)
Assumption 3: the preposition of a nucleus depends
only on its semantic class and on the lexeme and POS
category of its regent, thus leading to:
a26a29a5a45a26a73a72
a47
a30 a5
a0a4a6
a16a61a5a52a22
a47
a6a59a6a62a32
a26a27a5a71a26a76a72
a47
a30a70a33a16a83a74a60a8a85a75 a12 a16a57a80a4a81a19a82 a12
a6 (9)
The nucleus a5a52a22 a47 a6 does not provide any information on
the generation of the preposition, and is thus not re-
tained. As far as the regent nucleus a5 a0a4a6 is concerned,
the dependence on the POS category controls the fact
that adjectives are less likely to subcategorize prepo-
sitions than verbs. For arguments, the preposition is
controlled by subcategorization frames, which directly
depend on the lexeme under consideration, and to a
less extent to its semantic class (even though this de-
pendence does exist, as for movement verbs which
tend to subcategorize prepositions associated with lo-
cation and motion). In the absence of subcategoriza-
tion frame information, the conditioning is placed on
the lexeme, which also controls prepositional phrases
corresponding to adjuncts. Lastly, the semantic class
of the nucleus under consideration may also play a role
in the selection of the preposition, and is thus retained
in our model.
Assumption 4: the POS category of a nucleus de-
pends only on its semantic class.
This assumption reflects the fact that our lexical re-
sources assign semantic classes from disjoint sets for
nouns, adjectives and adverbs (except for the TOP
class, identical for adjectives and adverbs). This as-
sumption leads to:
a26a29a5a52a74a61a8a65a75
a47
a30a26a73a72
a47
a16a41a70a33a16a61a5
a0a4a6
a16a61a5a52a22
a47
a6a59a6a62a32
a26a27a5a53a74a61a8a65a75
a47
a30a70
a6 (10)
Since any dependence on a5 a0a7a6 and a5a53a22 a47 a6 is lost, this fac-
tor has no impact on the choice of the most probable
attachment for a5a28a51 a6 . However, it is important to note
that this assumption relies on the specific semantic re-
source we have at our disposal, and could be replaced,
in other situations, with a 1st-order Markov assump-
tion.
Assumption 5: the gender and number of a nucleus
depend on its POS category, the POS category of its
regent, and the gender and number of its regent.
In French, the language under study, gender and
number agreements take place between the subject
and the verb, and between adjectives, or past par-
ticiples, and the noun they modify/qualify. All, and
only, these dependencies are captured in assumption 5
which leads to:
a26a27a5a28a77a17a78
a47
a30a74a61a8a65a75
a47
a16a7a26a73a72
a47
a16a83a70a33a16a60a5
a0a7a6
a16a60a5a53a22
a47
a6a57a6a62a32
a26a27a5a28a77a17a78
a47
a30a74a61a8a65a75a52a51a65a16a41a74a61a8a65a75a57a12a52a16a52a77a17a78a76a12
a6 (11)
Assumption 6: the lexeme of a nucleus depends only
on the POS category and the semantic class of the nu-
cleus itself, the lexeme, POS category and semantic
class of its regent lexeme, and the lexeme and POS
category of its closest preceding sibling.
This assumption allows us to take bigram frequencies
for lexemes into account, as well as the dependencies
a given lexeme may have on its closest sibling. In fact,
it accounts for more than just bigram frequencies since
it leads to:
a26a27a5a4a80a4a81a19a82
a47
a30a77a17a78
a47
a16a83a74a61a8a65a75
a47
a16a7a26a73a72
a47
a16a41a70a33a16a60a5
a0a7a6
a16a60a5a53a22
a47
a6a57a6a33a32
a26a29a5a4a80a7a81a49a82
a47
a30a74a61a8a65a75
a47
a16a83a70a33a16a41a74a61a8a65a75a57a12a57a16a59a80a4a81a49a82a9a12a41a16a83a70a84a5
a0a7a6
a16a41a74a61a8a65a75a57a14a61a86a85a16a59a80a4a81a49a82a9a14a61a86
a6 (12)
Assumptions 1 to 6 lead to a set of probabilities
which, except for the last one, can be confidently es-
timated from training data. However, we still need to
simplify equation (12) if we want to derive practical
estimations of lexical affinities. This is the aim of the
following assumption.
Assumption 7: a5 a0a4a6 and a5a53a22 a47 a6 are independent given
a5a2a51
a6 .
Let us first see with an example what this assumption
amounts to. Consider the sequence eat a fish with a
fork. Assumption 7 says that given with a fork, eat
and a fish are independent, that is, once we know
with a fork, the additional observation of a fish doesn’t
change our expectation of observing eat as well, and
vice-versa. This does not entail that with a fork and eat
are independent given a fish, nor that a fish and with a
fork are independent given eat, this last dependence
being the one we try to account for. However, this in-
dependence assumption is violated as soon as nucleus
a5a52a22
a47
a6 brings more or different constraints on the distri-
bution of nucleus a5 a0a7a6 than nucleus a5a2a51 a6 does, i.e. when
with a fork imposes constraints on the possible forms
the verb of nucleus a5 a0a7a6 (eat in our example) can take,
and so does a fish. With assumption 7, we claim that
the constraints imposed by with a fork suffice to deter-
mine eat, and that a fish brings no additional informa-
tion.
Assumption 7 allows us to rewrite equation (12) as:
a26a29a5a4a80a7a81a49a82
a47
a30a77a79a78
a47
a16a41a74a61a8a65a75
a47
a16a7a26a73a72
a47
a16a41a70a33a16a60a5
a0a7a6
a16a60a5a53a22
a47
a6a57a6a62a32
a34
a63a67a87a45a64a39a63a67a66
a12
a69
a26a29a5a52a70a40a88 a30a80a4a81a49a82a9a12a57a16a41a74a61a8a65a75a57a12
a6
a26a29a5a7a80a4a81a49a82
a47
a30a74a61a8a65a75
a47
a16a83a70
a6
a58
a26a27a5a4a80a4a81a19a82
a47
a30a74a60a8a85a75
a47
a16a41a70a33a16a83a74a60a8a85a75a57a12a59a16a57a80a4a81a19a82a9a12a83a16a41a70 a88
a6
a58
a26a27a5a4a80a4a81a19a82
a47
a30a74a60a8a85a75a57a14a60a86a89a16a59a80a4a81a19a82a17a14a60a86
a6 (13)
4 Comparison with other models
It is interesting to compare the proposed models to
others previously studied. The probabilistic model de-
scribed in (Lauer and Dras, 1994), addresses the prob-
lem of parsing English nominal compounds. A com-
parison with this model is of interest to us since the
sequences we are interested in contain both verbal and
nominal phrases in French. A second model relevant
to our discussion is the one proposed in (Ratnaparkhi,
1998), addressing the problem of unsupervised learn-
ing for PP attachment resolution in VERB NOUN PP
sequences. Lastly, the third model, even though used
in a supervised setting, addresses the more complex
problem of probabilistic dependency parsing on com-
plete sentences 2.
In the model proposed in (Lauer and Dras, 1994),
that we will refer to as model L, the quantity denoted
as a90a91a5a53a92 a47a94a93 a92 a12a49a30a95a17a96a98a97a99a96 a93 a92 a12 a6 is the same as the quan-
tity defined by our equation (8). The quantity a26a29a5a52a100 a6 in
model L is the same as our quantity a26a29a5a7a55a56a5a28a11 a47 a6a57a6 . There
is no equivalent for probabilities involved in equations
(9) to (11) in model L, since there is no need for them
in analysing English nominal compounds. Lastly, our
probability to generate a80a4a81a49a82 a47 depends only on a70 in
model L (the dependency on the POS category is ob-
vious since only nouns are considered). For the rest,
i.e. the way these core quantities are combined to pro-
duce a probability for a parse as well as the decision
rule (selection of the most probable parse), there is no
difference between the two models. We thus can view
our model as a generalization of model L since we can
handle PP attachment and take into account indirect
independencies.
The model proposed in (Ratnaparkhi, 1998) is sim-
ilar to a version of our model based solely on equa-
tion (9), with no semantic information. This is not sur-
prising since the goal of this work is to disambiguate
between prepositional attachment to the noun or to the
verb in V N P sequences. In fact, by adding to the set
of prepositions an empty preposition, a101 , the counts of
which are estimated from unsafe configurations (that is
a74a19a5a53a102a103a81a49a72a85a104
a6a21a32a106a105a108a107a19a109a83a110a2a107
a74a19a5a52a102a39a81a19a72a85a104a85a16a7a26a73a72a85a81a59a26
a6a38a111
a74a19a5a53a102a103a81a49a72a85a104a85a16a112a101
a6 ), equa-
tion (9) captures both the contribution from the ran-
dom variable used in (Ratnaparkhi, 1998) to denote the
presence or absence of any preposition that is unam-
biguously attached to the noun or the verb in question,
and the contribution from the conditional probability
that a particular preposition will occur as unambigu-
ous attachment to the verb or to the noun. We present
below the results we obtained with this model.
From the models proposed in (Eisner, 1996), we re-
tain only the model referred to as model C in this work,
since the best results were obtained with it. Model C
does not make use of semantic information, nor does
it rely on nuclei. So the sequence with a fork, which
corresponds to only one nucleus is treated as a three
word sequence in model C. Apart from this difference,
model C directly relies on a combination of equations
(10) and (12), namely conditioning by a80a7a81a49a82a9a12 , a74a61a8a65a75a57a12 and
a74a61a8a65a75a57a14a61a86 , both the probability of generating a74a61a8a65a75
a47 and the
one of generating a80a7a81a49a82
a47 . Thus, model C uses a reduced
version of equation (12) and an extended version of
2Other models, as (Collins and Brooks, 1995; Merlo et
al., 1998) for PP-attachment resolution, or (Collins, 1997;
Samuelsson, 2000) for probabilistic parsing, are somewhat
related, but their supervised nature makes any direct com-
parison impossible.
equation (10). This extension could be used in our case
too, but, since the input to our processing chain con-
sists of tagged words (unless the input of the stochastic
dependency parser of (Eisner, 1996)), we do not think
it necessary.
Furthermore, by marginalizing the counts for the es-
timates of our general model, we can derive the proba-
bilities used in other models. We thus view our model
as a generalization of the previous ones.
5 Estimation of probabilities
We followed a maximum likelihood approach to esti-
mate the different probabilities our model relies on, by
directly computing relative frequencies from our train-
ing data. We then used Laplace smoothing to smooth
the obtained probabilities and deal with unobserved
events.
As mentioned before, we focus on safe configu-
rations to extract counts for probability estimation,
which implies that, except for particular configurations
involving adverbs, we use only the first nuclei of the
chains we arrived at. In most cases, only the first two
nuclei of each chain are not ambiguous with respect
to attachment. However, since equation (12) relies on
a5a52a22
a47
a6 in addition to
a5
a0a7a6 , we consider the first three nuclei
of each chain (but we skip adverbs since their attach-
ment quite often obeys precise and simple rules), but
treat the third nucleus as being ambiguous with respect
to which nucleus it should be attached to, the two pos-
sibilities being a priori equi-probable. Thus, from the
sequence:
[implant´ee, VERB] (a)
d´epartement, NOUN, Masc-Sg, PREP = dans(b)
H´erault, NOUN, Masc-Sg, PREP = de(c)
(located in the county of H´erault) (En.)
we increment the counts between nuclei (a) and (b) by
1, then consider that nucleus (c) is attached to nucleus
(a) and increment the respective counts (in particular
the counts associated with equation 12) by 0.5, and
finally consider that nucleus (c) is attached to nucleus
(b) (which is wrong in this case) and increment the
corresponding counts by 0.5.
6 Experiments
We made two series of experiments, the first one to
assess whether relying on a subset of our training cor-
pus to derive probability estimates was a good strategy,
and the second one to assess the different information
sources and probabilities our general model is based
on. For all our experiments, we used articles from
the French newspaper Le Monde consisting of 300000
sentences, split into training and test data.
6.1 Accurate vs. less accurate information
We conducted a first experiment to check whether the
accurate information extracted from safe chains was
sufficient to estimate probabilities. We focused, for
this purpose, on the task of preposition attachment on
200 VERB NP PP sequences randomly extracted and
manually annotated. Furthermore, we restricted our-
selves to a reduced version of the model, based on a
reduced version of equation (9), so as to have a com-
parison point with previous models for PP-attachment.
In addition to the accurate information, we used a win-
dowing approach in order to extract less accurate infor-
mation and assess the estimates derived from accurate
information only. Each time a preposition is encoun-
tered with a verb or a noun in a window of k (k=3 in
our experiment) words, the corresponding counts are
incremented.
The French lexicons we used for tagging, lemma-
tization and chunking contain subcategorization infor-
mation for verbs and nouns. This information was en-
coded by several linguists over several years. Here’s
for example two entries, one for a verb and one for a
noun, containing subcategorization information:
quˆeter - en faveur de, pour
to raise funds - in favor of, for
constance - dans, en, de
- constancy - in, of
Subcategorization frames only contain part of the in-
formation we try to acquire from our training data,
since they are designed to capture possible arguments,
and not adjuncts, of a verb or a noun. In our approach,
like in other ones, we do not make such a distinction
and try to learn parameters for attaching prepositional
phrases independently of their status, adjuncts or ar-
guments. We used the following decision rule to test
a method solely based on subcategorization informa-
tion:
if the noun subcategorizes the preposition,
then attachment to the noun
else if the verb subcategorizes the preposition,
then attachment to the verb
else attachment according to the default rule
and two default rules, one for attachment to nouns, the
other to verbs, in order to which of these two alterna-
tives is the best. Furthermore, since subcategorization
frames aim at capturing information for specific prepo-
sitional phrases (namely the ones that might consti-
tute arguments of a given word), we also evaluated the
above decision rule on a subset of our test examples in
which either the noun or the verb subcategorizes the
preposition. The results we obtained are summarized
in table 1.
Precision
default: noun 0.68
default: verb 0.56
subset 0.75
Table 1: Using subcategorization frames
We then mixed the accurate and less accurate informa-
tion with a weighting factor a113 to estimate the proba-
bility we are interested in, and let a113 vary from 0 to
1 in order to see what are the respective impacts of
accurate and less accurate information. By using a74a85a18
a47
(resp. a74a61a114 a47 ) to denote the number of times a26a73a72 a47 occurs
with a5a7a80a4a81a49a82a9a12a41a16a83a74a61a8a65a75a57a12 a6 in accurate (resp. less accurate) con-
figurations, and by using a74a50a12 to denote the number of
occurrences of a5a7a80a4a81a19a82a17a12a41a16a83a74a60a8a85a75a57a12 a6 , the estimation we used is
summarized in the following formula:
a26a29a5a45a26a73a72
a47
a30a74a61a8a65a75 a12 a16a57a80a7a81a49a82 a12
a6a62a32 a74a85a18
a47
a111
a113a31a74a60a114
a47
a111
a18
a74a41a12
a111
a78a40a5a71a26a76a72a85a81a57a26
a6 (14)
where a78a40a5a71a26a76a72a85a81a57a26 a6 is the number of different prepositions
introduced by our smoothing procedure. The results
obtained are summarized in table 2, where an incre-
ment step of 0.2 is used.
a113 0 0.2 0.4 0.6 0.8 1
precision 0.83 0.85 0.83 0.81 0.8 0.78
Table 2: Influence of a113
These results first show that the accurate information is
sufficient to derive good estimates. Furthermore, dis-
counting part of the less accurate information seems to
be essential, since the worst results are obtained when
a113
a32
a18 . We can also notice that the best results are
well above the baseline obtained by relying only on
information present in our lexicon, thus justifying a
machine learning approach to the problem of PP at-
tachment resolution. Lastly, the results we obtained
are similar to the ones obtained by different authors
on a similar task, as (Ratnaparkhi, 1998; Hindle and
Rooth, 1993; Brill and Resnik, 1994) for example.
6.2 Evaluation of our general model
The model described in Section 3 was tested against
900 manually annotated sequences of nuclei from the
newspaper “Le Monde”, randomly selected from a
portion of the corpus which was held out from training.
The average length of sequences was of 3.33 nuclei.
The trivial method consisting in linking every nucleus
to the preceding one achieves an accuracy of 72.08%.
The proposed model was used to assign probabil-
ity estimates to dependency links between nuclei in
our own implementation of the parser described in
(Eisner, 1996). The latter is a “bare-bones” depen-
dency parser which operates in a way very similar to
the CKY parser for context-free grammars, in which
the notion of a subtree is replaced by that of a span.
A span consists of two or more adjacent nuclei to-
gether with the dependency links among them. No
cycles, multiple parents, or crossing dependencies are
allowed, and each nucleus not on the edge of the span
must have a parent (i.e.: a regent) within the span it-
self. The parser proceeds by creating spans of increas-
ing size by combining together smaller spans. Spans
are combined using the “covered concatenation” oper-
ator, which connects two spans sharing a nucleus and
possibly adds a dependency link between the leftmost
and the rightmost nucleus, or vice-versa. The proba-
bility of a span is the product of the probabilities of the
dependency links it contains. A span is pruned from
the parse table every time that there is another span
covering the same nuclei and having the same signa-
ture but a higher probability. The signature of a span
consists of three things:
a115 A flag indicating whether the span is minimal or
not. A span is minimal if it is not the simple con-
catenation of other legal spans;
a115 A flag indicating whether the leftmost nucleus in
the span already has a regent within the span;
a115 A flag indicating whether the rightmost nucleus
in the span already has a regent within the span.
Two spans covering the same nuclei and with the
same signiture are interchangeable in terms of the
complete parses they can appear in, and so the one
with the lower probability can be dropped, assuming
that we are only interested in the analysis having the
overall highest probability. For more details concern-
ing the parser, see (Eisner, 1996).
A number of tests using different variants of the
proposed models were done. For some of those
tests, we decided to make use of the subcategoriza-
tion frame information contained in our lexicon, by
extending Laplace smoothing for the probability in-
volved in equation (9) by considering Dirichlet priors
over multinomial distributions of the observed data.
We use three different variables to describe the dif-
ferent experiments we made: se, being 1 or 0 depend-
ing on whether or not we used semantic information,
sb, which indicates the equivalent sample size for pri-
ors that we used in our smoothing procedure for equa-
tion (9) (when a92a49a104
a32
a18 , the subcategorization informa-
tion contained in our lexicon is not used), Kj, which
is 1 if the variables associated with the closest sister
are used in equation (12), and 0 if not. The results
obtained with the different experiments we conducted
were evaluated in terms of accurracy of attachments
against the manually annotated reference. We did not
take into account the attachment of the second nucleus
in a chain to the first one (since this attachment is ob-
vious). Results are summarized in the following table:
Exp. name se sb kj Accuracy
base - - - 0.72
exp1 0 1 0 0.662
exp2 0 1 1 0.701
exp3 0 100 1 0.706
exp4 0 200 1 0.705
exp5 1 1 1 0.731
exp6 1 100 1 0.731
exp7 1 200 1 0.735
exp8 1 500 1 0.737
exp9 1 1000 1 0.735
Table 3: General results
7 Discussion
There are two main conclusions we can draw from the
preceding results. The first one is that the results are
disappointing, in so far as we were not able to really
outperform our baseline. The second one is that the
best results are achieved with the complete model in-
tegrating subcategorization information.
With respect to our model, the difference between
experiment 1 and experiment 2 shows that the closest
sister brings valuable information to establish the best
parse of the chain of nuclei. Even though this infor-
mation was derived from ambiguous configurations,
the extraction heuristics we used does capture ac-
tual dependencies, which validates our assumptions 6
and 7. The integration of subcategorization frame in-
formation in experiments 3 and 4 does not improve
the results, indicating that most of information is al-
ready carried in the corresponding version of the gen-
eral model by bigram lexical statistics. Furthermore,
the results obtained with sucategorization information
only for parsing V N P sequences do not compare
well with an approach solely based on bigram statis-
tics, thus validating the hypothesis behind most work
in probabilistic parsing that world knowledge can be
approximated, up to a certain extent, by bigram statis-
tics.
The main jump in performance is achieved with the
use of semantic classes. All the experiments involv-
ing semantic classes yield results over the baseline,
thus indicating the well-foundedness of models mak-
ing use of them. Even though our semantic resource is
incomplete (out of 70000 different tokens our corpus
comprises3, only 20000 have an entry in our seman-
tic lexicon), its coverage is still sufficient to constrain
word distributions and partly solve the data sparseness
problem. The results obtained in previous works re-
lying on semantic classes are above ours (around 0.82
3This huge number of tokens can be explained by the fact
that the lexicon used for tokenization and tagging integrates
many multi-word expressions which are not part of the se-
mantic lexicon
for (Brill and Resnik, 1994) and 0.77 for (Lauer and
Dras, 1994)), but a direct comparison is difficult inas-
much as only three-word sequences (V N P, for (Brill
and Resnik, 1994) and N N N for (Lauer and Dras,
1994)) were used for evaluation in those works, and
the language studied is English. However, it may well
be the case that the semantic resource we use does not
compare well, in terms of coverage and homogeneity,
with WordNet, the semantic resource usually used.
Several choices we made in the course of develop-
ing our model and estimating its parameters need now
to be more carefully assessed in light of these first re-
sults. First of all, our choice to stick with (almost)
accurate information, if it leads to good results for the
estimation of the probability of generating the prepo-
sition of a nucleus given its parent nucleus, may well
lead us to rely too often on the smoothing parameters
only when estimating other probabilities. This may
well be the case for the probability in (12) where bi-
gram statistics extracted with a windowing approach
may prove to be more suited to the task. Further-
more, the Laplace smoothing, even though appealing
from a theoretical point of view since it can be for-
malized as assuming a prior over our distributions,
may not be fully adequate in the case where the de-
nominator is always low compared to the normalizing
contraint, a situation we encounter for equation (12).
This may result in over-smoothing and thus prevent
our model from accurately discriminating between al-
ternate parses. Lastly, (Lauer and Dras, 1994) uses a
prior over the graphs defined by parse trees to score
the different parses. We have assumed a uniform prior
over graphs, but the results obtained with our baseline
clearly indicate that we should weigh them differently.
8 Conclusion
We have presented a general model for PP attachment
resolution and NP analysis in French, which gener-
alizes previously proposed models. We have shown
how to integrate several different information sources
in our model, and how we could use it in an incremen-
tal way, starting with simple versions to more complex
and accurate ones. We have also presented a series of
experiments conducted on a corpus of newspaper ar-
ticles, and tried to assess the various components of
the model, as well as the different information sources
used. Our results show that the complete model, mak-
ing use of all the available information yields the best
results. However, these results are still low, and we
still need to precisely identify how to improve them.

References
S. Ait-Mokhtar and J.-P. Chanod. 1997. Incremen-
tal Finite-State Parsing. Proceedings of the Inter-
national Conference on Applied Natural Language
Processing
E. Brill and P. Resnik. 1994. A Rule Based Ap-
proach to Prepositional Phrase Attachment Disam-
biguation. Proceedings of the 15th International
Conference on Computational Linguistics.
M. Collins and J. Brooks. 1995. Prepositional Phrase
Attachment through a Backed-Off Model. Proceed-
ings of the Third Workshop on Very Large Corpora.
M. Collins. 1997. Three Generative, Lexicalised
Models for Statistical Parsing. Proceedings of the
35th Annual Meeting of the Association for Compu-
tational Linguistics.
J. Eisner. 1996. Three New Probabilistic Models for
Dependency Parsing: An Exploration. Proceedings
of the 16th International Conference on Computa-
tional Linguistics.
D. Hindle and M. Rooth. 1993. Structural Ambiguity
and Lexical Relations. Journal of the Association
for Computational Linguistics. 19(1).
M. Lauer and M. Dras. 1994. A Probabilistic Model
of Compound nouns. Proceedings of the Seventh
Joint Australian Conference on Artificial Intelli-
gence.
P. Merlo, M.W. Croker and C. Berthouzoz. 1997.
Attaching Multiple Prepositional Phrases: Gener-
alized Backed-off Estimation. Proceedings of the
Second Conference on Empirical Methods in Natu-
ral Language Processing.
A. Ratnaparkhi. 1998. Statistical Models for Un-
supervised Prepositional Phrase Attachment. Pro-
ceedings of the joint COLING-ACL conference.
C. Samuelsson. 2000. A Statistical Theory ofDe-
pendency Syntax. Proceedings of the 18th Inter-
national Conference on Computational Linguistics.
A. Yeh and M. Vilain. 1998. Some Properties of
Preposition and Subordinate Conjunction Attach-
ment. Proceedings of the 17th International Con-
ference on Computational Linguistics and 36th An-
nual Meeting of the Association for Computational
Linguistics”.
