Using Co-Composition for Acquiring Syntactic and Semantic
Subcategorisation
Pablo Gamallo Alexandre Agustini
Department of Computer Science
New University of Lisbon, Portugal
a0 gamallo,aagustini,gpl
a1 @di.fct.unl.pt
Gabriel P. Lopes
Abstract
Natural language parsing requires ex-
tensive lexicons containing subcategori-
sation information for specific sublan-
guages. This paper describes an unsuper-
vised method for acquiring both syntac-
tic and semantic subcategorisation restric-
tions from corpora. Special attention will
be paid to the role of co-composition in
the acquisition strategy. The acquired in-
formation is used for lexicon tuning and
parsing improvement.
1 Introduction
Recent lexicalist Grammars project the subcat-
egorisation information encoding in the lexicon
onto syntactic structures. These grammars use
accurate subcategorised lexicons to restrict potential
syntactic structures. In terms of parsing devel-
opment, it is broadly assumed that parsers need
such information in order to reduce the number of
possible analyses and, therefore, solve syntactic
ambiguity. Over the last years various methods
for acquiring subcategorisation information from
corpora has been proposed. Some of them induce
syntactic subcategorisation from tagged texts
(Brent, 1993; Briscoe and Carrol, 1997; Marques,
2000). Unfortunately, syntactic information is not
enough to solve structural ambiguity. Consider the
following verbal phrases:
(1) [peel [a2a4a3 the potato] [a3a5a3 with a knife]]
(2) [peel [a2a4a3 [a2a6a3 the potato] [a3a7a3 with a rough stain]]]
The attachment of “with PP” to both the verb
“peel” in phrase (1) and to the NP “the potato” in
(2) does not depend only on syntactic requirements.
Indeed, it is not possible to attach the PP “with
a knife” to the verb “peel” by asserting that this
verb subcategorises a “with PP’. Such a subcate-
gorisation information cannot be used to explain
the analysis of phrase (2), where it is the NP “the
potato” that is attached to the “with PP”. In order
to decide the correct analysis in both phrases, we
are helped by our world knowledge about the action
of peeling, the use of knifes, and the attributes
of potatoes. In general, we know that knifes are
used for peeling, and potatoes can have different
kinds of stains. So, the parser is able to propose a
correct analysis only if the lexicon is provided with,
not only syntactic subcategorisation information,
but also with information on semantic-pragmatic
requirements (i.e., with selection restrictions).
Other works attempt to acquire selection restric-
tions requiring pre-existing lexical ressources. The
learning algorithm requires sample corpora to be
constituted by verb-noun, noun-verb, or verb-prep-
noun dependencies, where the nouns are semanti-
cally tagged by using lexical hierarchies such as
WordNet (Resnik, 1997; Framis, 1995). Selection
restrictions are induced by considering those depen-
dencies associated with the same semantic tags. For
instance, if verb ratify frequently appears with nouns
semantically tagged as “legal documents” in the di-
rect object position (e.g., article, law, precept, . . . ),
then it follows that it must select for nouns denot-
ing legal documents. Unfortunately, if a pre-defined
                     July 2002, pp. 34-41.  Association for Computational Linguistics.
                     ACL Special Interest Group on the Lexicon (SIGLEX), Philadelphia,
                  Unsupervised Lexical Acquisition: Proceedings of the Workshop of the
set of semantic tags is used to annotate the training
corpus, it is not obvious that the tags available are
the more appropriate for extracting domain-specific
semantic restrictions. If the tags were created specif-
ically to capture corpus dependent restrictions, there
could be serious problems concerning portability to
a new specific domain.
By contrast, unsupervised strategies to acquire
selection restrictions do not require a training cor-
pus to be semantically annotated using pre-existing
lexical hierarchies (Sekine et al., 1992; Dagan et
al., 1998; Grishman and Sterling, 1994). They re-
quire only a minimum of linguistic knowledge in or-
der to identify “meaningful” syntactic dependencies.
According to the Grefenstette’s terminology, they
can be classified as “knowledge-poor approaches”
(Grefenstette, 1994). Semantic preferences are in-
duced by merely using co-occurrence data, i.e., by
using a similarity measure to identify words which
occur in the same dependencies. It is assumed that
two words are semantically similar if they appear in
the same contexts and syntactic dependencies. Con-
sider for instance that the verb ratify frequently ap-
pear with the noun organisation in the subject po-
sition. Moreover, suppose that this noun turns to
be similar in a particular corpus to other nouns:
e.g., secretary and council. It follows that ratify not
only selects for organisation, but also for its simi-
lar words. This seems to be right. However, suppose
that organisation also appears in expressions like the
organisation of society began to be disturbed in the
last decade, or they are involved in the actual organ-
isation of things, with a significant different word
meaning. In this case, the noun means a particu-
lar kind of process. It seems obvious that its sim-
ilar words, secretary and council, cannot appear in
such subcategorisation contexts, since they are re-
lated to the other sense of the word. Soft clusters,
in which words can be members of different clusters
to different degrees, might solve this problem to a
certain extent (Pereira et al., 1993). We claim, how-
ever, that class membership should be modeled by
boolean decisions. Since subcategorisation contexts
require words in boolean terms (i.e., words are either
required or not required), words are either members
or not members of specific subcagorisation classes.
Hence, we propose a clustering method in which a
word may be gathered into different boolean clus-
ters, each cluster representing the semantic restric-
tions imposed by a class of subcategorisation con-
texts.
This paper describes an unsupervised method for
acquiring information on syntactic and semantic
subcategorisation from partially parsed text corpora.
The main assumptions underlying our proposal will
be introduced in the following section. Then, sec-
tion 3 will present the different steps -extraction of
candidate subcategorisation restrictions and concep-
tual clustering- of our learning method. In section
4, we will show how the dictionary entries are pro-
vided with the learned information. The accuracy
and coverage of this information will be measured
in a particular application: attachment resolution.
The experiments presented in this paper were per-
formed on 1,5 million of words belonging to the
P.G.R. (Portuguese General Attorney Opinions) cor-
pus, which is a domain-specific Portuguese corpus
containing case-law documents.
2 Underlying Assumptions
Our acquisition method is based on two theoretical
assumptions. First, we assume a very general no-
tion of linguistic subcategorisation. More precisely,
we consider that in a “head-complement” depen-
dency, not only the head imposes constraints on the
complement, but also the complement imposes lin-
guistic requirements on the head. Following Puste-
jovsky’s terminology, we call this phenomenon “co-
composition” (Pustejovsky, 1995). So, for a particu-
lar word, we attempt to learn both what kind of com-
plements and what kind of heads it subcategorises.
For instance, consider the compositional behavior of
the noun republic in a domain-specific corpus. On
the one hand, this word appears in the head position
within dependencies such as republic of Ireland, re-
public of Portugal, and so on. On the other hand, it
appears in the complement position in dependencies
like president of the republic, government of the re-
public, etc. Given that there are interesting semantic
regularities among the words cooccurring with re-
public in such linguistic contexts, we attempt to im-
plement an algorithm letting us learn two different
subcategorisation contexts:
a8a10a9a12a11a14a13a16a15a7a17a19a18a21a20a23a22a25a24a27a26a25a28a16a29a6a30a32a31a34a33a36a35a38a37a27a39a25a13a40a15a42a41a36a43 where preposition
a18a44a20 introduces a binary relation between the word
republic in the role of “head” (role noted by ar-
row “a37 ”), and those words that can be their “com-
plements” (the role complement is noted by arrow
“a15 ”). This subcategorisation context semantically
requires the complements referring to particular na-
tions or states (indeed, only nations or states can be
republics).
a8a10a9a12a11a14a13 a37 a17a19a18a21a20a23a22a25a13 a37 a39a25a24a27a26a45a28a16a29a46a30a38a31a34a33a36a35 a15 a41a36a43 this represents a
subcategorisation context that must be filled by
those heads denoting specific parts of the republic:
e.g., institutions, organisations, functions, and so on.
Note that the notion of subcategorisation restric-
tion we use in this paper embraces both syntactic and
semantic preferences.
The second assumption concerns the procedure
for building classes of similar subcategorisation con-
texts. We assume, in particular, that different sub-
categorisation contexts are considered to be seman-
tically similar if they have the same word distribu-
tion. Let’s take, for instance, the following contexts:
a47a48a50a49a52a51a32a53a55a54a25a56a42a57a49a52a51a59a58a60a25a61a34a62a50a63a65a64a67a66a69a68a71a70a36a72a74a73a71a75a76a47a48a50a49a52a51a50a53a55a54a25a56a42a57a49a52a51a59a58a77a36a78a80a79a74a78a80a61a81a72a82a73a71a75
a47a48a50a49 a72 a53a84a83a85a77a25a57a54a25a56a52a56a50a68a71a70a36a61 a51 a58a49 a72 a73a85a75a76a47a48a50a49 a72 a53a55a68a85a54a81a64a80a86 a54a81a87a5a57a64a67a61a74a88a40a68a71a87a21a70a67a63a52a89a90a64a67a61a36a87a42a78 a51 a58a49 a72 a73a71a75
All of them seem to share the same semantic pref-
erences. As these contexts require words denot-
ing the same semantic class, they tend to possess
the same word distribution. Moreover, we also as-
sume that the set of words required by these simi-
lar subcategorisation contexts represents the exten-
sional description of their semantic preferences. In-
deed, since words minister, president, assembly, . . .
have similar distribution on those contexts, they may
be used to build the extensional class of nouns that
actually fill the semantic requirements of the con-
texts. Such words are, then, semantically subcate-
gorised by them. Unlike most unsupervised methods
to selection restrictions acquisition, we do not use
the well-known strategy for measuring word simi-
larity based on distributional hypothesis. Accord-
ing to this assumption, words cooccurring in similar
subcategorisation contexts are semantically similar.
Yet, as has been said in the Introduction, such a no-
tion of word similarity is not sensitive to word poly-
semia. By contrast, the aim of our method is to mea-
sure semantic similarity between subcategorisation
contexts. This allows us to assign a polysemic word
to different contextual classes of subcategorisation.
This strategy is also used in the Asium system (Faure
and N´edellec, 1998; Faure, 2000).
3 Subcategorisation Acquisition
To evaluate the hypotheses presented above, a soft-
ware package was developed to support the auto-
matic acquisition of syntactic and semantic subcat-
egorisation information. The learning strategy is
mainly constituted by two sequential procedures.
The first one aims to extract subcategorisation can-
didates, while the second one leads us to both iden-
tify correct subcategorisation candidates and gather
them into semantic classes of subcategorisation. The
two procedures will be accurately described in the
remainder of the section.
3.1 Extraction of Candidates
We have developed the following procedure for ex-
tracting those syntactic patterns that could become
later true subcategorisation contexts. Raw text is
tagged (Marques, 2000) and then analyzed using
some potentialities of the shallow parser introduced
in (Rocio et al., 2001). The parser yields a single
partial syntactic description of sentences, which are
analyzed as sequences of basic chunks (NP, PP, VP,
. . . ). Then, attachment is temporarily resolved by a
simple heuristic based on right association (a chunk
tend to attach to another chunk immediately to its
right). Following our first assumption in section 2,
we consider that the word heads of two attached
chunks form a binary dependency that is likely to
be split in two subcategorisation contexts. It can be
easily seen that syntactic errors may appear since the
attachment heuristic does not take into account dis-
tant dependencies.1 For reasons of attachment er-
rors, it is argued here that the identified subcategori-
sation contexts are mere hypotheses; hence they are
mere subcategorisation candidates. Finally, the set
of words appearing in each subcategorisation con-
text are viewed as candidates to be a semantic class.
For example, the phrase
emanou de facto da lei
([it] emanated in fact from the law)
1The errors are caused, not only due to this restrictive at-
tachment heuristic, but also due to further misleadings, e.g.,
words missing from the dictionary, words incorrectly tagged,
other sorts of parser limitations, etc.
would produce the following two attachments:
a53a55a68a85a54a81a64a80a86 a91a74a61a74a57a19a61a36a89a90a79a74a87a21a79a74a60a25a51a59a58a67a56a50a79a82a70a67a78a80a54a45a72a32a73a92a53a71a91a74a61a82a57a34a56a50a79a82a70a93a78a80a54a81a51a52a58a34a66a84a61a36a68a85a72a74a73
from which the following 4 subcategorisation candi-
dates are generated:
a47a48a52a49a52a51a59a53a84a68a85a54a45a64a80a86 a91a74a61a82a57a49a52a51a59a58a56a50a79a82a70a93a78a80a54a81a72a38a73a71a75a94a47a48a52a49a52a72a32a53a84a68a85a54a45a64a80a86 a91a74a61a82a57a61a36a89a90a79a74a87a21a79a74a60a25a51a50a58a49a52a72a74a73a85a75
a47a48a52a49 a51 a53a55a91a74a61a82a57a49 a51 a58a66a69a61a95a68 a72 a73a71a75a94a47a48a50a49 a72 a53a55a91a82a61a74a57a56a50a79a38a70a67a78a80a54 a51 a58a49 a72 a73a85a75
Since the prepositional complement de facto
represents an adverbial locution interpolated be-
tween the verb and its real complement da lei,
the two proposed attachments are odd. Hence, the
four subcategorisation contexts should not be ac-
quired. We will see how our algorithm allows us to
learn subcategorisation information that will be used
later to invalidate such odd attachments and propose
new ones. The algorithm basically works by com-
paring the similarity between the word sets associ-
ated to each subcategorisation candidate.
Let’s note finally that unlike many learning ap-
proaches, information on co-composition is avail-
able for the characterization of syntactic subcate-
gorisation contexts. In (Gamallo et al., 2001b),
a strategy for measuring word similarity based on
the co-composition hypothesis was compared to
Grefensetette’s strategy (Grefenstette, 1994). Ex-
perimental tests demonstrated that co-composition
allows a finer-grained characterization of “meaning-
ful” syntactic contexts.
3.2 Clustering Similar Contexts
According to the second assumption introduced
above (section 2), two subcategorisation contexts
with similar word distribution should have the same
extensional definition and, then, the same selection
restrictions. This way, the word sets associated with
two similar contexts are merged into a more gen-
eral set, which represents their extensional seman-
tic preferences. Consider the two following sub-
categorisation contexts and the words that appear in
them:
a9
a48a52a49 a72 a53a55a54a25a56a42a57a68a71a87a44a56a59a60a45a68a55a87a42a96a82a61a95a89a97a61a95a87a98a78 a51 a58a49 a72 a73
a43a52a99a76a100
a79a74a60a45a78a85a68a85a70a93a66a69a61a101a66a84a79a74a102a103a87a42a54a81a60a45a89a104a62a52a60a45a61a81a70a93a61a34a62a50a78a19a105a106a105a106a105a107
a9
a48a52a49a52a72a59a53a55a91a74a54a45a64a80a86a52a57a68a71a87a21a56a59a60a45a68a71a87a42a96a82a61a95a51a32a58a49a52a72a74a73
a43 a99a94a100
a79a74a60a45a78a85a68a85a70a93a66a69a61a101a66a69a79a82a102a103a87a42a54a81a60a45a89a108a60a45a68a71a96a82a109a50a78a19a105a106a105a106a105a107
Since both contexts have a similar word dis-
tribution, it can be argued that they share the
same selection restrictions. Furthermore, it
must be inferred that the words associated to
them are all co-hyponyms belonging to the
same context-dependent semantic class. In
our corpus, context a9a12a11a40a13 a15 a17a19a110a5a18a21a30a45a111a40a22a25a112a7a33a36a18a21a31a19a113a7a24 a37 a39a25a13 a15 a41a36a43
(to infringe) is not only considered similar
to context a9a12a11a14a13a114a37a5a17a19a110a5a26a44a22a25a112a5a33a36a18a21a31a19a113 c¸ ˜aa18a98a37a21a39a25a13a16a15a42a41a36a43 (infringe-
ment of ) , but also to other contexts such as:
a9a12a11a40a13a40a37a5a17a19a110a7a18a44a30a45a111a14a22a25a24a27a26a98a115a25a28a6a26a52a33a93a116a45a113a7a24a42a37a27a39a25a13a40a15a42a41a36a43 (to respect) and
a9a12a11a40a13 a15 a17a19a110a7a18a44a30a45a111a14a22a74a113a65a28a4a31a34a33a36a35a59a113a5a24 a37 a39a25a13 a15 a41a36a43 (to apply) .
In this section, we will specify the procedure
for learning context-dependent semantic classes by
comparing similarity between the previously ex-
tracted contextual word sets. This will be done in
two steps: filtering and clustering.
3.2.1 Filtering
As has been said in the introduction, the cooper-
ative system Asium also extract similar subcategori-
sation contexts (Faure and N´edellec, 1998; Faure,
2000). This system requires the interactive partici-
pation of a language specialist in order to the contex-
tual word sets be filtered and cleaned when they are
taken as input of the clustering strategy. Such a co-
operative method requires manual removal of those
words that have been incorrectly tagged or analyzed
from the sets. Our strategy, by contrast, attempts to
automatically remove incorrect words from the con-
textual sets. Automatic filtering requires the follow-
ing subtasks:
First, each word set is associated with a list of
its most similar contextual sets. Intuitively, two sets
are considered as similar if they share a significant
number of words. Various similarity measure co-
efficients were tested to create lists of similar sets.
The best results were achieved using a particular
weighted version of the Jaccard coefficient, where
words are weighted considering both their disper-
sion and their relative frequency for each context
(Gamallo et al., 2001a).
Then, once each contextual set has been com-
pared to the other sets, we select the words shared
by each pair of similar sets, i.e., we select the in-
tersection between each pair of sets considered as
similar. Since words that are not shared by two sim-
ilar sets could be incorrect words, we remove them.
Intersection allows us to clear words that are not se-
mantically homogeneous. Thus, the intersection of
two similar sets represents a class of co-hyponyms,
[CONTXi ]
leinorma preceito
[CONTXj ]
[CONTXij]
direito
Figure 1: Clustering step
which we call basic class. Let’s take an example.
In our corpus, the most similar set extracted from
a9a12a11a40a13a40a15a5a17a19a110a5a26a27a22a25a112a7a33a36a18a21a31a19a113 c¸ ˜aa18a52a37a27a39a25a13a40a15a42a41a36a43 (infringement of )) is the set
extracted from a9a12a11a40a13 a15 a17a19a110a5a18a44a30a81a111a40a22a25a112a7a33a36a18a21a31a19a113a5a24 a37 a39a25a13 a15 a41a36a43 (infringe) .
Both sets share the following words:
sigilo princ´ıpios preceito plano norma lei
estatuto disposto disposic¸˜ao direito
(secret principle precept plan norm law statute dis-
position disposition right)
This basic class does not contain incorrect
words such as vez, flagrantemente,
obrigac¸˜ao, interesse (time, notoriously,
obligation, interest), which were oddly associated to
the context a9a12a11a40a13a5a15a7a17a19a110a5a26a27a22a25a112a7a33a36a18a21a31a19a113 c¸ ˜aa18a52a37a27a39a25a13a40a15a98a41a36a43 , but which do
not appear in context a9a12a11a40a13a21a15a5a17a19a110a7a18a44a30a45a111a14a22a25a112a5a33a36a18a21a31a34a113a5a24a42a37a7a39a25a13a40a15a42a41a36a43 . This
class seems to be semantically homogeneous be-
cause it contains only co-hyponym words referring
to legal documents. Once basic classes have been
created, they are used by the conceptual clustering
algorithm to build more general classes.
3.2.2 Conceptual Clustering
We use an agglomerative (bottom-up) cluster-
ing for successively aggregating the previously cre-
ated basic classes. Unlike most research on con-
ceptual clustering, aggregation does not rely on a
statistical distance between classes, but on empir-
ically set conditions and constraints (Talavera and
B´ejar, 1999). These conditions are discussed in
(Gamallo et al., 2001a). Figure 1 shows two ba-
sic classes associated with two pairs of similar sub-
categorisation contexts. a9a12a117a119a118a92a120a122a121a94a123 a68 a43 represents a
pair of similar subcategorisation contexts sharing the
words preceito, lei, norma (precept, law,
norm, while a9a12a117a119a118a92a120a122a121a94a123 a86 a43 represents another pair
of similar contexts sharing the words preceito,
Table 1: Class Membership of trabalho
Cluster 1 contrato execuc¸˜ao exerc´ıcio prazo pro-
cesso procedimento trabalho (agreement
execution practice term/time process procedure work)
Cluster 2 contrato exerc´ıicio prestac¸˜ao recurso
servic¸o trabalho (agreement practice installment
appeal service work)
Cluster 3 actividade atribuic¸˜ao cargo exerc´ıcio
func¸˜ao lugar trabalho (activity attribution post
practice function post work/job)
lei, direito (precept, law, right). Both basic
classes are obtained from the filtering process de-
scribed in the previous section. This figure illus-
trates more precisely how the basic classes are ag-
gregated into more general clusters. If two classes
fill the clustering conditions, they can be merged
into a new class. The two basic classes of the ex-
ample are clustered into the more general class con-
stituted by preceito, lei, norma, dire-
ito. At the same time, the two pairs of contexts
a9a12a117a119a118a92a120a122a121a94a123 a68 a43 and a9a12a117a119a118a92a120a122a121a94a123 a86 a43 are merged into the
cluster a9a12a117a119a118a92a120a122a121a76a123 a68a106a86 a43 . Such a generalization leads
us to induce syntactic data that does not appear in
the corpus. Indeed, we induce both that the word
norma may appear in the syntactic contexts repre-
sented by a9a12a117a119a118a124a120a125a121a76a123 a86 a43 , and that the word dire-
ito may be attached to the syntactic contexts rep-
resented by a9a12a117a104a118a124a120a125a121a76a123 a68 a43 .
3.2.3 Polysemic Words Representation
Polysemic words are placed in different clus-
ters. For instance, consider the word trabalho
(work/job). Table 1 situates this word as a member
of at least three different contextual classes. Clus-
ter 1 aggregates words referring to temporal objects.
Indeed, they are co-hyponyms because they appear
in subcategorisation contexts sharing the same selec-
tion restrictions: e.g., a9a12a11a14a13a27a15a7a17a19a110a5a26a44a22a82a115a65a29a6a115a74a28a6a26a52a126a127a115a7a128a113a114a18a98a37a42a41a36a43 , (in-
terruption of ), a9a12a11a14a13 a37 a17a67a26a52a129a130a22a25a13 a37 a39a74a35a32a29a16a24a27a115a98a18 a15 a41a36a43 (in course).
Cluster 2 represents the result of an action. Such
a meaning becomes salient in contexts like for in-
stance a9a12a11a40a13 a15 a17a34a33a95a18a21a30a45a111 a28a6a18a98a24a101a22a25a24a27a26a52a35a50a26a98a30a32a26a52a24 a37 a39a25a13 a15 a41a36a43 (to receive in
payment for). Indeed, the cause of receiving money
is not the action of working, but the object done
or the state achieved by working. Finally, Clus-
ter 3 illustrates the more typical meaning of tra-
balho: it is a job, function or task, which can be
carried out by professionals. This is why these co-
Table 2: Dictionary entries
a8 abono (loan)
a131
a9a12a11a14a13 a37 a17a19a110a5a26a44a22a25a13 a37 a39a74a113a101a30a38a18a42a126a132a18 a15 a41a36a43a103a133
a134 aplicac¸˜ao caso fixac¸˜ao montante pagamento t´ıtulo
a135
(diligence case fixing amount payment bond)
a131
a9a12a11a14a13a16a15a7a17a19a110a5a26a44a22a74a113a101a30a32a18a42a126a132a18a65a37a5a39a25a13a40a15a42a41a36a43a103a133
a134 ajuda despesa pens˜ao quantia remunerac¸˜ao subs´ıdio suple-
mento valor vencimentoa135
(assistance expense pension amount remuneration subsidy
additional tax value salary)
a131
a9a12a11a14a13a16a37a7a17a34a33a36a18a44a30a81a111 a110a5a26a44a22a25a13a16a37a7a39a74a113a101a30a38a18a42a126a132a18a98a15a98a41a36a43a103a133
a134 conceder conter definir determinar fixar manter prever
a135
(concede comprise define determine fix maintain foresee)
a8 emanar (emanate)
a131
a9a12a11a14a13 a15 a17a34a33a36a18a44a30a81a111 a110a5a26a44a22a82a26a52a129a122a113a7a126a132a113a5a24 a37 a39a25a13 a15 a41a36a43a103a133
a134 al´ınea artigo c´odigo decreto diploma disposic¸˜ao estatuto
legislac¸˜ao lei norma regulamentoa135
(paragraph article code decree diploma disposition statute
legislation law norm regulation)
a131
a9a12a11a14a13 a15 a17a34a33a36a18a44a30a81a111 a110a5a26a44a22a82a26a52a129a122a113a7a126a132a113a5a24 a37 a39a25a13 a15 a41a36a43a103a133
a134 administrac¸˜ao autoridade comiss˜ao conselho direcc¸˜ao es-
tado governo ministro tribunal ´org˜aoa135
(administration authority commission council direction
state government minister tribunal organ)
a8 presidente (president)
a131
a9a12a11a14a13a16a15a7a17a19a110a5a26a44a22a19a28a4a24a44a26a98a115a52a33a95a110a5a26a50a126a136a116a25a26a59a37a5a39a25a13a40a15a42a41a36a43a103a133
a134 assembleia cˆamara comis˜ao conselho direcc¸˜ao estado em-
presa gest˜ao instituto regi˜ao rep´ublica secc¸˜ao tribunal a135
(assembly chamber council direction state enterprise man-
agement institute region republic section tribunal)
a131
a9a12a11a14a13a16a37a7a17a19a110a5a26a44a22a25a13a16a37a7a39a19a28a16a24a27a26a65a115a65a33a36a110a114a26a50a126a136a116a25a26a59a15a42a41a36a43a103a133
a134 cargo categoria func¸˜ao lugar remunerac¸ ˜ao vencimento
a135
(post rank function place/post remuneration salary)
hyponyms can appear in subcategorisation contexts
such as: a9a12a11a40a13 a37 a17a19a110a5a26a27a22a25a33a93a126a127a115a25a28a46a26a52a35a32a116a81a18a42a24 a37 a41a36a43 , (of the inspector),
a9a12a11a40a13a40a37a5a17a19a110a7a18a44a30a45a111a14a22a74a110a114a26a65a115a42a26a50a129a137a28a6a26a52a126a127a138a40a113a5a24a42a37a7a39a25a13a40a15a42a41a36a43 (to accomplish).
4 Application and Evaluation
The acquired classes are used in the following way.
First, the lexicon is provided with subcategorisa-
tion information, and then, a second parsing cycle is
performed in order to syntactic attachments be cor-
rected.
4.1 Lexicon Update
Table 2 shows how the acquired classes are used to
provide lexical entries with syntactic and semantic
subcategorisation information. Each entry contains
both the list of subcategorisation contexts and the
list of word sets required by the syntactic contexts.
As we have said before, such word sets are viewed
as the extensional definition of the semantic pref-
erences required by the subcategorisation contexts.
Consider the information our system learnt for the
verb emanar (see table 2). It syntactically subcat-
egorises two kinds of “de-complements”: the one
semantically requires words referring to legal doc-
uments (emana da lei - emanate from the law;
law prescribes), the other selects words referring to
institutions (emana da autoridade - emanate
from the authority; authority proposes). The seman-
tic restrictions enables us to correct the odd attach-
ments proposed by our syntactic heuristics for the
phrase emanou de facto da lei (emanated
in fact from the law). As word facto does not be-
long to the semantic class required by the verb in
the “de-complement” position, we test the follow-
ing “de-complement”. As lei does belong, a new
correct attachment is proposed.
Consider now the nouns abono (loan) and
presidente (president). They subcategorise not
only complements, but also different kinds of heads.
For instance, the noun abono selects for “de-head
nouns” like fixac¸˜ao (fixac¸˜ao do abono -
fixing the loan), as well as for verbs like fixar in
the direct object position: fixar o abono (to fix
the loan).
4.2 Attachment Resolution Algorithm
The syntactic and semantic subcategorisation infor-
mation provided by the lexical entries is used to
check whether the subcategorisation candidates pre-
viously extracted by the parser are true attachments.
The degree of efficiency in such a task may serve as
a reliable evaluation for measuring the soundness of
our learning strategy.
We assume the use of both a traditional chart
parser (Kay, 1980) and a set of simple heuristics for
identifying attachment candidates. Then, in order to
improve the analysis, a “diagnosis parser” (Rocio et
al., 2001) receives as input the sequences of chunks
proposed as attachment candidates, checks them and
raises correction procedures. Consider, for instance,
the expression editou o artigo(edited the ar-
ticle). The diagnoser reads the sequence of chunks
VP(editar) and NP(artigo), and then proposes the
attachment a17a19a110a5a18a21a30a45a111a40a22a82a26a52a110a27a33a36a116a81a113a5a24a50a37a114a39a74a113a7a24a21a116a81a33a67a139a5a18a98a15a42a41 to be corrected
by the system. Correction is performed by ac-
cepting or rejecting the proposed attachment. This
is done looking for the subcategorisation informa-
tion contained in the lexicon dictionary, information
which has been acquired by the clustering method
described above. Four tasks are performed to check
the attachment heuristics:
Task 1a - Syntactic checking of artigo: check
word artigo in the lexicon. Look for the syntac-
tic restriction a9a12a11a14a13a7a37a5a17a19a110a5a18a21a30a45a111a40a22a25a13a40a37a7a39a74a113a5a24a21a116a95a33a67a139a114a18a65a15a42a41a36a43 . If artigo
has this syntactic restriction, then, pass to the seman-
tic checking. Otherwise, pass to task 2a.
Task 1b - Semantic checking of artigo:
check the semantic restriction associated with
a9a12a11a40a13a40a37a5a17a19a110a7a18a44a30a45a111a14a22a25a13a16a37a7a39a74a113a5a24a42a116a81a33a67a139a5a18a98a15a42a41a36a43 . If word editar be-
longs to that restricted class, then we can infer that
a17a19a110a7a18a44a30a45a111a14a22a82a26a65a110a27a33a93a116a45a113a5a24a98a37a114a39a74a113a7a24a21a116a81a33a67a139a5a18a98a15a42a41 is a binary relation. At-
tachment is then confirmed. Otherwise, pass to task
2a.
Task 2a - Syntactic checking of editar: check
word editar in the lexicon. Look for the syntac-
tic restriction a9a12a11a40a13 a15 a17a19a110a7a18a44a30a45a111a14a22a82a26a65a110a27a33a93a116a45a113a7a24 a37 a39a25a13 a15 a41a36a43 . If editar
has this syntactic restriction, then, pass to the seman-
tic checking. Otherwise, attachment cannot be con-
firmed.
Task 2b - Semantic checking of editar:
check the semantic restriction associated with
a9a12a11a40a13a40a15a5a17a19a110a7a18a44a30a45a111a14a22a82a26a65a110a27a33a93a116a45a113a5a24a98a37a5a39a25a13a40a15a42a41a36a43 . If word artigo be-
longs to that restricted class, then we can infer that
a17a19a110a7a18a44a30a45a111a14a22a82a26a65a110a27a33a93a116a45a113a5a24a98a37a114a39a74a113a7a24a21a116a81a33a67a139a5a18a98a15a42a41 is a binary relation. At-
tachment is then confirmed. Otherwise, attachment
cannot be confirmed.
Semantic checking is based on the co-
specification hypothesis stated above. According
to this hypothesis, two chunks are syntactically
attached only if one of these two conditions is
verified: either the complement is semantically
required by the head, or the head is semantically
required by the complement.
4.3 Evaluating Performance of Attachment
Resolution
Table 3 shows some results of the corrections pro-
posed by the diagnosis parser. Accuracy and cover-
age were evaluated on three types of attachment can-
didates: NP-PP, VP-NP, and VP-PP. We call accu-
racy the proportion of corrections that actually cor-
respond to true dependencies and, then, to correct
attachments. Coverage indicates the proportion of
candidate dependencies that were actually corrected.
Coverage evaluation was performed by randomly se-
lecting as test data three sets of about 100-150 oc-
currences of candidate attachments from the parsed
corpus. Each test set only contained one type of can-
didate attachments. Because of low coverage, accu-
racy was evaluated by using larger sets of test can-
didates. A brief description of the evaluation results
are depicted in Table 3.
Table 3: Evaluation of Attachment Resolution on
NP-PP, VP-NP, and VP-PP attachment candidates
Attachment Candidate Accuracy (a140 ) Coverage (a140 )
NP-PP a141a27a142a5a143a106a144a7a145 a145a44a146a114a143a69a145a44a146
VP-NP a141a27a142a5a143a69a141a101a147 a147a52a146a114a143a106a144a7a141
VP-PP a141a44a145 a39 a146a27a142 a147a52a148a114a143a84a142a98a144
Total a141a44a149a114a143a106a144a27a150 a148a44a145a114a143a69a148a21a150
Even though accuracy reaches a very promising
value (about a141a44a149a27a151 ), coverage merely achieves a148a44a145a27a151 .
There are two main reasons for low coverage: on the
one hand, the learning method needs words to have
significant frequencies through the corpus; on the
other hand, words are sparse through the corpus, i.e.,
most words of a corpus have few occurrences. How-
ever, the significant differences between the cover-
age for NP-PP attachments and that for verbal at-
tachments (i.e., VP-NP and VP-PP), leads us to be-
lieve that the values reached by coverage should in-
crease as corpus size grows. Indeed, given that verbs
are less frequent than nouns, verb occurrences are
still very low in a corpus containing a147 a39a82a152 millions of
word occurrences. We need larger annotated corpora
to improve the learning task, in particular, concern-
ing verb subcategorisation.
5 Future Work
As we do not propose long distance attachments, our
method can not be compared with other standard
corpus-based approaches to attachment resolution
(Hindle and Rooth, 1993; Brill and Resnik, 1994;
Li and Abe, 1998). Long distance attachments only
will be considered after having achieved the correc-
tions for immediate dependencies in the first cycle of
syntactic analysis. We are currently working on the
specification of new analysis cycles in order to long
distance attachments be solved. Consider again the
phraseemanou de facto da lei. At the sec-
ond cycle, the diagnoser proposed that the first PP
de facto is not corrected attached to emanou.
At the third cycle, the system will check whether the
second PPda leimay be attached to the verb. We
will perform n-cycles of attachment propositions,
until no candidates are available. At the end of the
process, we will be able to measure in a more accu-
rate way what is the degree of robustness the parser
may achieve.
6 Acknowledgement
This work is supported in part by grants of Fundac¸˜ao
para a Ciˆencia e Tecnologia, Portugal; Federal
Agency for Post-Graduate Education (CAPES),
Brazil; Pontifical Catholic University of Rio Grande
do Sul (PUCRS), Brazil; and the MLIS 4005 Euro-
pean project TRADAUT-PT.

References
Michael Brent. 1993. From grammar to lexicon: un-
supervised learning of lexical syntax. Computational
Linguistics, 19(3):243–262.
Eric Brill and Philip Resnik. 1994. A rule-based ap-
proach to prepositional phrase attachment disambigua-
tion. In COLING.
Ted Briscoe and John Carrol. 1997. Automatic extrac-
tion of subcategorization from corpora. In ANCP’97,
Washington, DC, USA.
Ido Dagan, Lillian Lee, and Fernando Pereira. 1998.
Similarity-based methods of word coocurrence prob-
abilities. Machine Learning, 43.
David Faure and Claire N´edellec. 1998. Asium: Learn-
ing subcategorization frames and restrictions of selec-
tion. In ECML98, Workshop on Text Mining.
David Faure. 2000. Conception de m ´ethode
d’aprentissage symbolique et automatique pour
l’acquisition de cadres de sous-cat´egorisation de
verbes et de connaissances s´emantiques `a partir de
textes : le syst`eme ASIUM. Ph.D. thesis, Universit´e
Paris XI Orsay, Paris, France.
Francesc Ribas Framis. 1995. On learning more appro-
priate selectional restrictions. In Proceedings of the
7th Conference of the European Chapter of the Asso-
ciation for Computational Linguistics, Dublin.
Pablo Gamallo, Alexandre Agustini, and Gabriel P.
Lopes. 2001a. Selection restrictions acquisition from
corpora. In EPIA’01, pages 30–43, Porto, Portugal.
LNAI, Springer-Verlag.
Pablo Gamallo, Caroline Gasperin, Alexandre Agustini,
and Gabriel P. Lopes. 2001b. Syntactic-based meth-
ods for measuring word similarity. In TSD-2001,
pages 116–125. Berlin:Springer Verlag.
Gregory Grefenstette. 1994. Explorations in Automatic
Thesaurus Discovery. Kluwer Academic Publishers,
USA.
Ralph Grishman and John Sterling. 1994. Generalizing
automatically generated selectional patterns. In COL-
ING’94.
Donald Hindle and Mats Rooth. 1993. Structural ambi-
guity and lexical relations. Computational Linguistics,
19(1):103–120.
Martin Kay. 1980. Alghorith schemata and data struc-
tures in syntactic processing. Technical report, XE-
ROX PARK, Palo Alto, Ca., Report CSL-80-12.
Hang Li and Naoki Abe. 1998. Word clustering and dis-
ambiguation based on co-occurrence data. In Coling-
ACL’98), pages 749–755.
Nuno Marques. 2000. Uma Metodologia para a
Modelac¸ ˜ao Estat´istica da Subcategorizac¸ ˜ao Verbal.
Ph.D. thesis, Univ. Nova de Lisboa, Lisboa, Portugal.
Fernando Pereira, Naftali Tishby, and Lillian Lee. 1993.
Distributional clustering of english words. In ACL’93,
pages 183–190, Ohio.
James Pustejovsky. 1995. The Generative Lexicon. MIT
Press, Cambridge.
Philip Resnik. 1997. Selectional preference and sense
disambiguation. In ACL-SIGLEX Workshop on Tag-
ging with Lexical Semantics, Washinton DC.
V. Rocio, E. de la Clergerie, and J.G.P. Lopes. 2001.
Tabulation for multi-purpose partial parsing. Journal
of Grammars, 4(1).
Satoshi Sekine, Jeremy Carrol, Sofia Ananiadou, and
Jun’ichi Tsujii. 1992. Automatic learning for seman-
tic collocation. In Applied Natural Language Process-
ing, pages 104–110.
Luis Talavera and Javier B´ejar. 1999. Integrating declar-
ative knowledge in hierarchical clustering tasks. In In-
telligent Data Analysis, pages 211–222.
